Brief note on Segment Anything

Intro

Segmentation is the prediction of mask to segment the area of interest, which may be an object, a person, etc. The difference between Segment Anything and general segmentation is the introduction of prompts to specify where to segment. The prompts here can be vague, such as a point, a box, etc.

Model

Segment Anything’s model consists of three main components:

Prompt encoder: This part is used to get the feature of the prompt, by using positional encoding or using convolution.
Image encoder: This part directly uses Vision Transformer (ViT).
Lightweight mask decoder: This part is mainly the fusion of prompt embedding and image embedding. Called lightweight because there are only a small number of layers.

All three parts are very common, so as the paper says, “Surprisingly, we find that a simple design satisfies all three constraints”. Therefore, the key point of this paper is not the model, but the data.

Data Engine

The core of training Segment Anything lies in the large amount of labeled data. The most striking contribution of this paper is the annotation of a dataset containing 1 billion masks, and they use a Data Engine to do this.

The process of using Data Engine to annotate data is roughly divided into three steps:

Assisted-manual stage: manual annotation with the help of a model. This step can be understood as a process of active earning.
1. First, an initial model is trained on the public dataset.
2. Second, the annotator modifies the predicted mask.
3. Finally, the model is trained with the newly labeled data.
The above three steps are repeated 6 times, resulting in 4.3 million mask annotations.
Semi-automatic stage: The goal of this step is to improve the diversity of the masks, which can still be understood as an active learning process. Simply put, the model can be automatically labeled without human labeling, and the manpower is focused on the masks that the model is not confident enough.

The method of finding the confident masks here is quite clever. The method is to do object detection on the masks obtained in the first step, so that if a region of masks is more consistent, then it is likely to be identified as an object.

For example. Let’s say there are 20 possible masks in a picture (that is, it can be divided into 20 regions). Then we first use the current model to do the segmentation, but this will probably only label some of the masks, and some masks are not labeled well. We now need to automatically identify which masks are good (confidential). The approach in this paper is to do an object detection on the predicted mask results to see if there is an object in the image. If there is an object, then we think the corresponding mask is more confident. suppose this finds 8 confident masks, then the annotator has to annotate the remaining 12, which saves manpower.

The model is retrained after each labeling, and this is repeated 5 times, eventually adding another 5.9 million mask labels.
Fully automatic stage: Simply put, the model trained in the previous step is used to annotate the data. A number of strategies are used to improve the quality of annotation, including
1. Using the predicted IoU values to filter out the less confident masks.
2. Considering only stable masks. To be more specific, for each pixel, model will output a value from 0 to 1. Generally we use 0.5 as the threshold to decide whether each pixel is masked or not. Stable means that when the threshold is moved up or down by a certain degree around 0.5 (e.g., 0.45 to 0.55), the corresponding mask remains basically unchanged. This means that the values predicted by the model are more differentiated on both sides of the boundary. I personally think that this is also a kind of confidence.
3. Filter duplicates with non-maximal suppression (NMS).
This step labeled another 1.1 billion masks (the number increased by more than 100 times).