Word count:1.3kReading time:7minPost View:ParisComments:
Intro
This article introduces Hugging Face and the Transformers library, which is the most popular NLP library in Python, it provides states of various models and a clean API that makes it super simple to build NLP piplines even for beginners.
This articles concerns how to use the pipeline, model and tokenizer, how to combine it with Pytorch or TensorFlow, how to save and load models, how to use models from the official model hub, and how to finetune your own models.
Installation
First of all, we need to install the Transformers library here.
This library shouble be combined with your favorite deep learning library, which might be PyTorch, TensorFlow or Flax. So you should install one of them firstly.
Then we can install this library with the following command:
1
pip install transformers
Pipeline
A pipeline makes it super simple to applyan NLP task, because it abstracts a lot of things away. The way it works is showed in following code:
1 2 3 4 5 6 7 8 9
from transformers import pipeline
# create a pipeline object and put in a task classifier = pipeline("sentiment-analysis")
# apply the classifier and put in the data res = classifier("I've been waiting for a cat my whole life.")
print(res)
The pipeline will do three things for us:
Pre-processing: The first task is pre-processing the text, in this case it apllies a tokenizer.
Feed the model: Then it feed the pre-processed text to the model.
Post-processing: Show us the result we expect.
There are others examples of pipelines:
1 2 3 4 5 6 7 8 9 10 11 12 13
from transformers import pipeline
# create a pipeline object, put in a task and give a specific model generator = pipeline("text-generator", model="distilgpt2")
# apply the classifier and put in the data res = classifier( "I will name my cat", max_length=30, num_return_sequences=2, )
print(res)
The model can be either one we saved locally or one from the model hub.
We can also do zero shot classification, which means we can give it a text without knowing the corresponding label, then we put different candidate labels, the model will classify the text automatically.
1 2 3 4 5 6 7 8 9 10 11 12
from transformers import pipeline
# create a pipeline object and put in a task classifier = pipeline("zero-shot-classification")
# apply the classifier and put in the data res = classifier( "This is a course about Python.", candidate_labels=["education", "politics", "business"], )
Now let’s have a look behind the pipeline and understand the different steps a little bit better, for this, let’s have a look at the Tokenizer and the Model class.
from transformers import pipeline from transformers import AutoTokenizer, AutoModelForSequenceClassification
# create a pipeline object and put in a task classifier = pipeline("sentiment-analysis")
# apply the classifier and put in the data res = classifier("I've been waiting for a cat my whole life.")
print(res)
# specify a model name, in this case it is the defaut model model_name = "distilbert-base-uncased-finetuned-sst-2-english"
# get model and tokenizer model = AutoModelForSequenceClassification.from_pretrained(model_name) tokenizer = AutoTokenizer.from_pretrained(model_name)
classifier = pipeline("sentiment-analysis", model=model, tokenizeer=tokenizer) res = classifier("I've been waiting for a cat my whole life.") print(res)
Since the model we used is the defaut model, the code above should produce the very same result.
A tokenizer is in charge of preparing the inputs for a model. What it does is basically put the text in a mathematical representation that the model understands. In order to use this, we can call the tokenizer directly and give it a text or a list of text as input.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
from transformers import pipeline from transformers import AutoTokenizer, AutoModelForSequenceClassification
# specify a model name model_name = "distilbert-base-uncased-finetuned-sst-2-english"
# get the tokenizer tokenizer = AutoTokenizer.from_pretrained(model_name)
We can see the ids produced by print(res) is a little different with the one from print(ids), the first looks like [101, …, 102], in which 101 and 102 mean the begining and the end of the sentence.
Combine with PyTorch or TensorFlow
In this tutorial we will work with PyTorch but the code is very similar with TensorFlow, it’s noteworthy that with TensorFlow we usually have a ‘TF’ before all classes like AutoModelForSequenceClassification.
from transformers import pipeline from transformers import AutoTokenizer, AutoModelForSequenceClassification import torch import torch.nn.functional as F
X_train = [ "I've been waiting for a cat my whole life.", "Python is great!" ]
######################## Apply the pipeline ######################## # specify a model name, in this case it is the defaut model model_name = "distilbert-base-uncased-finetuned-sst-2-english"
# get model and tokenizer model = AutoModelForSequenceClassification.from_pretrained(model_name) tokenizer = AutoTokenizer.from_pretrained(model_name)
######################## Do it separately ######################## batch = tokenizer(X_train, padding=True, truncation=True, max_length=512, return_tensors="pt") print(batch)
with torch.no_grad(): # unpack the batch because it is a dictionary outputs = model(**batch) print(outputs) predictions = F.softmax(outputs.logits, dim=1) print(predictions) labels = torch.argmax(predictions, dim=1) print(labels)
This could be useful if we want to finetune our model with a PyTorch training loop.
Save Model
Following code shows an example to save and load model and tokenizer.
1 2 3 4 5 6 7 8 9
save_directory = "saved"
# save tokenizer.save_pretrained(save_directory) model.save_pretrained(save_directory)
# load tok = AutoTokenizer.from_pretrained(save_directory) mod = AutoModelForSequenceClassification.from_pretrained(save_directory)
Model Hub
Now let’s look how can we use different models from Model Hub.
All we neeed to do is choose a model, copy the name, paste the name to the code.
txt = """Game design is not an easy subject to write about. lenses and fundamentals are useful tools, but to truly understand game design is to understand an incredibly complex web of creativity, psychology, art, technology, and business. everything in this web is connected to everything else. Changing one element affects all the others, and the understanding of one element influences the understanding of all of the others. Most experienced designers have built up this web in their minds, slowly, over many years, learning the elements and relationships by trial and error. and this is what makes game design so hard to write about. books are necessarily linear. one idea must be presented at a time. for this reason, many game design books have an incomplete feeling to them—like a guided nighttime tour with a flashlight, the reader sees a lot of interesting things, but can’t really comprehend how they all fit together. """
# 1. prepare dataset # 2. load pretrained Tokenizer, call it with dataset -> encoding # 3. build PyTorch dataset with encodings # 4. load pretrained model # 5. a) load Trainer and train it # b) native PyTorch training loop
from transformers import Trainer, TrainingArguments