Semi-supervised object detection uses both labeled data and unlabeled data for training. It not only reduces the annotation burden for training high-performance object detectors but also further improves the object detector by using a large number of unlabeled data.
A typical procedure to train a semi-supervised object detector is as below:
We provide a dataset download script, which downloads the coco2017 dataset by default and decompresses it automatically.
python tools/misc/download_dataset.py
The decompressed dataset directory structure is as below:
mmdetection
├── data
│ ├── coco
│ │ ├── annotations
│ │ │ ├── image_info_unlabeled2017.json
│ │ │ ├── instances_train2017.json
│ │ │ ├── instances_val2017.json
│ │ ├── test2017
│ │ ├── train2017
│ │ ├── unlabeled2017
│ │ ├── val2017
There are two common experimental settings for semi-supervised object detection on the coco2017 dataset:
(1) Split train2017
according to a fixed percentage (1%, 2%, 5% and 10%) as a labeled dataset, and the rest of train2017
as an unlabeled dataset. Because the different splits of train2017
as labeled datasets will cause significant fluctuation on the accuracy of the semi-supervised detectors, five-fold cross-validation is used in practice to evaluate the algorithm. We provide the dataset split script:
python tools/misc/split_coco.py
By default, the script will split train2017
according to the labeled data ratio 1%, 2%, 5% and 10%, and each split will be randomly repeated 5 times for cross-validation. The generated semi-supervised annotation file name format is as below:
instances_train2017.{fold}@{percent}.json
instances_train2017.{fold}@{percent}-unlabeled.json
Here, fold
is used for cross-validation, and percent
represents the ratio of labeled data. The directory structure of the divided dataset is as below:
mmdetection
├── data
│ ├── coco
│ │ ├── annotations
│ │ │ ├── image_info_unlabeled2017.json
│ │ │ ├── instances_train2017.json
│ │ │ ├── instances_val2017.json
│ │ ├── semi_anns
│ │ │ ├── instances_train2017.1@1.json
│ │ │ ├── instances_train2017.1@1-unlabeled.json
│ │ │ ├── instances_train2017.1@2.json
│ │ │ ├── instances_train2017.1@2-unlabeled.json
│ │ │ ├── instances_train2017.1@5.json
│ │ │ ├── instances_train2017.1@5-unlabeled.json
│ │ │ ├── instances_train2017.1@10.json
│ │ │ ├── instances_train2017.1@10-unlabeled.json
│ │ │ ├── instances_train2017.2@1.json
│ │ │ ├── instances_train2017.2@1-unlabeled.json
│ │ ├── test2017
│ │ ├── train2017
│ │ ├── unlabeled2017
│ │ ├── val2017
(2) Use train2017
as the labeled dataset and unlabeled2017
as the unlabeled dataset. Since image_info_unlabeled2017.json
does not contain categories
information, the CocoDataset
cannot be initialized, so you need to write the categories
of instances_train2017.json
into image_info_unlabeled2017.json
and save it as instances_unlabeled2017.json
, the relevant script is as below:
from mmengine.fileio import load, dump
anns_train = load('instances_train2017.json')
anns_unlabeled = load('image_info_unlabeled2017.json')
anns_unlabeled['categories'] = anns_train['categories']
dump(anns_unlabeled, 'instances_unlabeled2017.json')
The processed dataset directory is as below:
mmdetection
├── data
│ ├── coco
│ │ ├── annotations
│ │ │ ├── image_info_unlabeled2017.json
│ │ │ ├── instances_train2017.json
│ │ │ ├── instances_unlabeled2017.json
│ │ │ ├── instances_val2017.json
│ │ ├── test2017
│ │ ├── train2017
│ │ ├── unlabeled2017
│ │ ├── val2017
There are two main approaches to semi-supervised learning, consistency regularization and pseudo label. Consistency regularization often requires some careful design, while pseudo label have a simpler form and are easier to extend to downstream tasks. We adopt a teacher-student joint training semi-supervised object detection framework based on pseudo label, so labeled data and unlabeled data need to configure different data pipeline:
(1) Pipeline for labeled data:
# pipeline used to augment labeled data,
# which will be sent to student model for supervised training.
sup_pipeline = [
dict(type='LoadImageFromFile', backend_args=backend_args),
dict(type='LoadAnnotations', with_bbox=True),
dict(type='RandomResize', scale=scale, keep_ratio=True),
dict(type='RandomFlip', prob=0.5),
dict(type='RandAugment', aug_space=color_space, aug_num=1),
dict(type='FilterAnnotations', min_gt_bbox_wh=(1e-2, 1e-2)),
dict(type='MultiBranch', sup=dict(type='PackDetInputs'))
]
(2) Pipeline for unlabeled data:
# pipeline used to augment unlabeled data weakly,
# which will be sent to teacher model for predicting pseudo instances.
weak_pipeline = [
dict(type='RandomResize', scale=scale, keep_ratio=True),
dict(type='RandomFlip', prob=0.5),
dict(
type='PackDetInputs',
meta_keys=('img_id', 'img_path', 'ori_shape', 'img_shape',
'scale_factor', 'flip', 'flip_direction',
'homography_matrix')),
]
# pipeline used to augment unlabeled data strongly,
# which will be sent to student model for unsupervised training.
strong_pipeline = [
dict(type='RandomResize', scale=scale, keep_ratio=True),
dict(type='RandomFlip', prob=0.5),
dict(
type='RandomOrder',
transforms=[
dict(type='RandAugment', aug_space=color_space, aug_num=1),
dict(type='RandAugment', aug_space=geometric, aug_num=1),
]),
dict(type='RandomErasing', n_patches=(1, 5), ratio=(0, 0.2)),
dict(type='FilterAnnotations', min_gt_bbox_wh=(1e-2, 1e-2)),
dict(
type='PackDetInputs',
meta_keys=('img_id', 'img_path', 'ori_shape', 'img_shape',
'scale_factor', 'flip', 'flip_direction',
'homography_matrix')),
]
# pipeline used to augment unlabeled data into different views
unsup_pipeline = [
dict(type='LoadImageFromFile', backend_args=backend_args),
dict(type='LoadEmptyAnnotations'),
dict(
type='MultiBranch',
unsup_teacher=weak_pipeline,
unsup_student=strong_pipeline,
)
]
(1) Build a semi-supervised dataset. Use ConcatDataset
to concatenate labeled and unlabeled datasets.
labeled_dataset = dict(
type=dataset_type,
data_root=data_root,
ann_file='annotations/instances_train2017.json',
data_prefix=dict(img='train2017/'),
filter_cfg=dict(filter_empty_gt=True, min_size=32),
pipeline=sup_pipeline)
unlabeled_dataset = dict(
type=dataset_type,
data_root=data_root,
ann_file='annotations/instances_unlabeled2017.json',
data_prefix=dict(img='unlabeled2017/'),
filter_cfg=dict(filter_empty_gt=False),
pipeline=unsup_pipeline)
train_dataloader = dict(
batch_size=batch_size,
num_workers=num_workers,
persistent_workers=True,
sampler=dict(
type='GroupMultiSourceSampler',
batch_size=batch_size,
source_ratio=[1, 4]),
dataset=dict(
type='ConcatDataset', datasets=[labeled_dataset, unlabeled_dataset]))
(2) Use multi-source dataset sampler. Use GroupMultiSourceSampler
to sample data form batches from labeled_dataset
and labeled_dataset
, source_ratio
controls the proportion of labeled data and unlabeled data in the batch. GroupMultiSourceSampler
also ensures that the images in the same batch have similar aspect ratios. If you don't need to guarantee the aspect ratio of the images in the batch, you can use MultiSourceSampler
. The sampling diagram of GroupMultiSourceSampler
is as below:
sup=1000
indicates that the scale of the labeled dataset is 1000, sup_h=200
indicates that the scale of the images with an aspect ratio greater than or equal to 1 in the labeled dataset is 200, and sup_w=800
indicates that the scale of the images with an aspect ratio less than 1 in the labeled dataset is 800,
unsup=9000
indicates that the scale of the unlabeled dataset is 9000, unsup_h=1800
indicates that the scale of the images with an aspect ratio greater than or equal to 1 in the unlabeled dataset is 1800, and unsup_w=7200
indicates the scale of the images with an aspect ratio less than 1 in the unlabeled dataset is 7200.
GroupMultiSourceSampler
randomly selects a group according to the overall aspect ratio distribution of the images in the labeled dataset and the unlabeled dataset, and then sample data to form batches from the two datasets according to source_ratio
, so labeled datasets and unlabeled datasets have different repetitions.
We choose Faster R-CNN
as detector
for semi-supervised training. Take the semi-supervised object detection algorithm SoftTeacher
as an example,
the model configuration can be inherited from _base_/models/faster-rcnn_r50_fpn.py
, replacing the backbone network of the detector with caffe
style.
Note that unlike the supervised training configs, Faster R-CNN
as detector
is an attribute of model
, not model
.
In addition, data_preprocessor
needs to be set to MultiBranchDataPreprocessor
, which is used to pad and normalize images from different pipelines.
Finally, parameters required for semi-supervised training and testing can be configured via semi_train_cfg
and semi_test_cfg
.
_base_ = [
'../_base_/models/faster-rcnn_r50_fpn.py', '../_base_/default_runtime.py',
'../_base_/datasets/semi_coco_detection.py'
]
detector = _base_.model
detector.data_preprocessor = dict(
type='DetDataPreprocessor',
mean=[103.530, 116.280, 123.675],
std=[1.0, 1.0, 1.0],
bgr_to_rgb=False,
pad_size_divisor=32)
detector.backbone = dict(
type='ResNet',
depth=50,
num_stages=4,
out_indices=(0, 1, 2, 3),
frozen_stages=1,
norm_cfg=dict(type='BN', requires_grad=False),
norm_eval=True,
style='caffe',
init_cfg=dict(
type='Pretrained',
checkpoint='open-mmlab://detectron2/resnet50_caffe'))
model = dict(
_delete_=True,
type='SoftTeacher',
detector=detector,
data_preprocessor=dict(
type='MultiBranchDataPreprocessor',
data_preprocessor=detector.data_preprocessor),
semi_train_cfg=dict(
freeze_teacher=True,
sup_weight=1.0,
unsup_weight=4.0,
pseudo_label_initial_score_thr=0.5,
rpn_pseudo_thr=0.9,
cls_pseudo_thr=0.9,
reg_pseudo_thr=0.02,
jitter_times=10,
jitter_scale=0.06,
min_pseudo_bbox_wh=(1e-2, 1e-2)),
semi_test_cfg=dict(predict_on='teacher'))
In addition, we also support semi-supervised training for other detection models, such as RetinaNet
and Cascade R-CNN
. Since SoftTeacher
only supports Faster R-CNN
, it needs to be replaced with SemiBaseDetector
, example is as below:
_base_ = [
'../_base_/models/retinanet_r50_fpn.py', '../_base_/default_runtime.py',
'../_base_/datasets/semi_coco_detection.py'
]
detector = _base_.model
model = dict(
_delete_=True,
type='SemiBaseDetector',
detector=detector,
data_preprocessor=dict(
type='MultiBranchDataPreprocessor',
data_preprocessor=detector.data_preprocessor),
semi_train_cfg=dict(
freeze_teacher=True,
sup_weight=1.0,
unsup_weight=1.0,
cls_pseudo_thr=0.9,
min_pseudo_bbox_wh=(1e-2, 1e-2)),
semi_test_cfg=dict(predict_on='teacher'))
Following the semi-supervised training configuration of SoftTeacher
, change batch_size
to 2 and source_ratio
to [1, 1]
, the experimental results of supervised and semi-supervised training of RetinaNet
, Faster R-CNN
, Cascade R-CNN
and SoftTeacher
on the 10% coco train2017
are as below:
Model | Detector | BackBone | Style | sup-0.1-coco mAP | semi-0.1-coco mAP |
---|---|---|---|---|---|
SemiBaseDetector | RetinaNet | R-50-FPN | caffe | 23.5 | 27.7 |
SemiBaseDetector | Faster R-CNN | R-50-FPN | caffe | 26.7 | 28.4 |
SemiBaseDetector | Cascade R-CNN | R-50-FPN | caffe | 28.0 | 29.7 |
SoftTeacher | Faster R-CNN | R-50-FPN | caffe | 26.7 | 31.1 |
Usually, the teacher model is updated by Exponential Moving Average (EMA) the student model, and then the teacher model is optimized with the optimization of the student model, which can be achieved by configuring custom_hooks
:
custom_hooks = [dict(type='MeanTeacherHook')]
Since there are two models in the teacher-student joint training framework, we can replace ValLoop
with TeacherStudentValLoop
to test the accuracy of both models during the training process.
val_cfg = dict(type='TeacherStudentValLoop')