Note on Hugging Face

Intro

This article introduces Hugging Face and the Transformers library, which is the most popular NLP library in Python, it provides states of various models and a clean API that makes it super simple to build NLP piplines even for beginners.

This articles concerns how to use the pipeline, model and tokenizer, how to combine it with Pytorch or TensorFlow, how to save and load models, how to use models from the official model hub, and how to finetune your own models.

Installation

First of all, we need to install the Transformers library here.

This library shouble be combined with your favorite deep learning library, which might be PyTorch, TensorFlow or Flax. So you should install one of them firstly.

Then we can install this library with the following command:

1	pip install transformers

Pipeline

A pipeline makes it super simple to applyan NLP task, because it abstracts a lot of things away. The way it works is showed in following code:

from transformers import pipeline

# create a pipeline object and put in a task
classifier = pipeline("sentiment-analysis")

# apply the classifier and put in the data
res = classifier("I've been waiting for a cat my whole life.")

print(res)

The pipeline will do three things for us:

Pre-processing: The first task is pre-processing the text, in this case it apllies a tokenizer.
Feed the model: Then it feed the pre-processed text to the model.
Post-processing: Show us the result we expect.

There are others examples of pipelines:

from transformers import pipeline

# create a pipeline object, put in a task and give a specific model
generator = pipeline("text-generator", model="distilgpt2")

# apply the classifier and put in the data
res = classifier(
  "I will name my cat",
  max_length=30,
  num_return_sequences=2,
)

print(res)

The model can be either one we saved locally or one from the model hub.

We can also do zero shot classification, which means we can give it a text without knowing the corresponding label, then we put different candidate labels, the model will classify the text automatically.

from transformers import pipeline

# create a pipeline object and put in a task
classifier = pipeline("zero-shot-classification")

# apply the classifier and put in the data
res = classifier(
  "This is a course about Python.", 
  candidate_labels=["education", "politics", "business"],
)

print(res)

All available pipelines could be found here.

Tokenizer and Model

Now let’s have a look behind the pipeline and understand the different steps a little bit better, for this, let’s have a look at the Tokenizer and the Model class.

from transformers import pipeline
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# create a pipeline object and put in a task
classifier = pipeline("sentiment-analysis")

# apply the classifier and put in the data
res = classifier("I've been waiting for a cat my whole life.")

print(res)

# specify a model name, in this case it is the defaut model
model_name = "distilbert-base-uncased-finetuned-sst-2-english"

# get model and tokenizer
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

classifier = pipeline("sentiment-analysis", model=model, tokenizeer=tokenizer)
res = classifier("I've been waiting for a cat my whole life.")
print(res)

Since the model we used is the defaut model, the code above should produce the very same result.

A tokenizer is in charge of preparing the inputs for a model. What it does is basically put the text in a mathematical representation that the model understands. In order to use this, we can call the tokenizer directly and give it a text or a list of text as input.

from transformers import pipeline
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# specify a model name
model_name = "distilbert-base-uncased-finetuned-sst-2-english"

# get the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)

sequence = "Winter is coming."
res = tokenizer(sequence)
print(res)
tokens = tokenizer.tokenize(sequence)
print(tokens)
ids = tokenizer.convert_tokens_to_ids(tokens)
print(ids)
decoded_string = tokenizer.decode(ids)
print(decoded_string)

We can see the ids produced by print(res) is a little different with the one from print(ids), the first looks like [101, …, 102], in which 101 and 102 mean the begining and the end of the sentence.

Combine with PyTorch or TensorFlow

In this tutorial we will work with PyTorch but the code is very similar with TensorFlow, it’s noteworthy that with TensorFlow we usually have a ‘TF’ before all classes like AutoModelForSequenceClassification.

from transformers import pipeline
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import torch.nn.functional as F

X_train = [
  "I've been waiting for a cat my whole life.",
  "Python is great!"
]

######################## Apply the pipeline ########################
# specify a model name, in this case it is the defaut model
model_name = "distilbert-base-uncased-finetuned-sst-2-english"

# get model and tokenizer
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

classifier = pipeline("sentiment-analysis", model=model, tokenizeer=tokenizer)

res = classifier(X_train)
print(res)

######################## Do it separately ########################
batch = tokenizer(X_train, padding=True, truncation=True, max_length=512, return_tensors="pt")
print(batch)

with torch.no_grad():
  # unpack the batch because it is a dictionary
  outputs = model(**batch)
  print(outputs)
  predictions = F.softmax(outputs.logits, dim=1)
  print(predictions)
  labels = torch.argmax(predictions, dim=1)
  print(labels)

This could be useful if we want to finetune our model with a PyTorch training loop.

Save Model

Following code shows an example to save and load model and tokenizer.

save_directory = "saved"

# save
tokenizer.save_pretrained(save_directory)
model.save_pretrained(save_directory)

# load
tok = AutoTokenizer.from_pretrained(save_directory)
mod = AutoModelForSequenceClassification.from_pretrained(save_directory)

Model Hub

Now let’s look how can we use different models from Model Hub.

All we neeed to do is choose a model, copy the name, paste the name to the code.

from transformers import pipeline

summarizer = pipeline("summerization", model="facebook/bar-large-cnn")

txt = """Game design is not an easy subject to write about. lenses and fundamentals are useful tools, but to truly understand game design is to understand an incredibly complex web of creativity, psychology, art, technology, and business. everything in this web is connected to everything else. Changing one element affects all the others, and the understanding of one element influences the understanding of all of the others. Most experienced designers have built up this web in their minds, slowly, over many years, learning the elements and relationships by trial and error. and this is what makes game design so hard to write about. books are necessarily linear. one idea must be presented at a time. for this reason, many game design books have an incomplete feeling to them—like a guided nighttime tour with a flashlight, the reader sees a lot of interesting things, but can’t really comprehend how they all fit together.
"""

print(summarizer(text, max_length=130, min_length=30, do_sample=False))

Finetune

Now we will briefly go over how we can finetune our own model.

Official documentation is available here. Both PyTorch and TensorFlow examples are provided in Colab.

Following code sketched how can we fintune the model with our own dataset.

# 1. prepare dataset
# 2. load pretrained Tokenizer, call it with dataset -> encoding
# 3. build PyTorch dataset with encodings
# 4. load pretrained model
# 5. a) load Trainer and train it
# 	 b) native PyTorch training loop

from transformers import Trainer, TrainingArguments

training_args = TrainingArguments("test-trainer")

trainer = Trainer(
  model, 
  training_args, 
  train_dataset=tokenized_dataset["train"], 
  eval_dataset=tokenized_datasets["validation"], 
  data_collator=data_collator, 
  tokenizer=tokenizer
)

trainer.train()