train_and_test.md 23 KB

Training and Testing

Launch training

Train with your PC

You can use tools/train.py to train a model on a single machine with a CPU and optionally a GPU.

Here is the full usage of the script:

python tools/train.py ${CONFIG_FILE} [ARGS]
By default, MMPose prefers GPU to CPU. If you want to train a model on CPU, please empty `CUDA_VISIBLE_DEVICES` or set it to -1 to make GPU invisible to the program.

CUDA_VISIBLE_DEVICES=-1 python tools/train.py ${CONFIG_FILE} [ARGS]
ARGS Description
CONFIG_FILE The path to the config file.
--work-dir WORK_DIR The target folder to save logs and checkpoints. Defaults to a folder with the same name as the config file under ./work_dirs.
--resume [RESUME] Resume training. If specify a path, resume from it, while if not specify, try to auto resume from the latest checkpoint.
--amp Enable automatic-mixed-precision training.
--no-validate Not suggested. Disable checkpoint evaluation during training.
--auto-scale-lr Automatically rescale the learning rate according to the actual batch size and the original batch size.
--cfg-options CFG_OPTIONS Override some settings in the used config, the key-value pair in xxx=yyy format will be merged into the config file. If the value to be overwritten is a list, it should be of the form of either key="[a,b]" or key=a,b. The argument also allows nested list/tuple values, e.g. key="[(a,b),(c,d)]". Note that quotation marks are necessary and that no white space is allowed.
--show-dir SHOW_DIR The directory to save the result visualization images generated during validation.
--show Visualize the prediction result in a window.
--interval INTERVAL The interval of samples to visualize.
--wait-time WAIT_TIME The display time of every window (in seconds). Defaults to 1.
--launcher {none,pytorch,slurm,mpi} Options for job launcher.

Train with multiple GPUs

We provide a shell script to start a multi-GPUs task with torch.distributed.launch.

bash ./tools/dist_train.sh ${CONFIG_FILE} ${GPU_NUM} [PY_ARGS]
ARGS Description
CONFIG_FILE The path to the config file.
GPU_NUM The number of GPUs to be used.
[PYARGS] The other optional arguments of tools/train.py, see here.

You can also specify extra arguments of the launcher by environment variables. For example, change the communication port of the launcher to 29666 by the below command:

PORT=29666 bash ./tools/dist_train.sh ${CONFIG_FILE} ${GPU_NUM} [PY_ARGS]

If you want to startup multiple training jobs and use different GPUs, you can launch them by specifying different port and visible devices.

CUDA_VISIBLE_DEVICES=0,1,2,3 PORT=29500 bash ./tools/dist_train.sh ${CONFIG_FILE1} 4 [PY_ARGS]
CUDA_VISIBLE_DEVICES=4,5,6,7 GPUS=29501 bash ./tools/dist_train.sh ${CONFIG_FILE2} 4 [PY_ARGS]

Train with multiple machines

Multiple machines in the same network

If you launch a training job with multiple machines connected with ethernet, you can run the following commands:

On the first machine:

NNODES=2 NODE_RANK=0 PORT=$MASTER_PORT MASTER_ADDR=$MASTER_ADDR bash tools/dist_train.sh $CONFIG $GPUS

On the second machine:

NNODES=2 NODE_RANK=1 PORT=$MASTER_PORT MASTER_ADDR=$MASTER_ADDR bash tools/dist_train.sh $CONFIG $GPUS

Compared with multi-GPUs in a single machine, you need to specify some extra environment variables:

ENV_VARS Description
NNODES The total number of machines.
NODE_RANK The index of the local machine.
PORT The communication port, it should be the same in all machines.
MASTER_ADDR The IP address of the master machine, it should be the same in all machines.

Usually, it is slow if you do not have high-speed networking like InfiniBand.

Multiple machines managed with slurm

If you run MMPose on a cluster managed with slurm, you can use the script slurm_train.sh.

[ENV_VARS] ./tools/slurm_train.sh ${PARTITION} ${JOB_NAME} ${CONFIG_FILE} ${WORK_DIR} [PY_ARGS]

Here are the arguments description of the script.

ARGS Description
PARTITION The partition to use in your cluster.
JOB_NAME The name of your job, you can name it as you like.
CONFIG_FILE The path to the config file.
WORK_DIR The target folder to save logs and checkpoints.
[PYARGS] The other optional arguments of tools/train.py, see here.

Here are the environment variables that can be used to configure the slurm job.

ENV_VARS Description
GPUS The total number of GPUs to be used. Defaults to 8.
GPUS_PER_NODE The number of GPUs to be allocated per node. Defaults to 8.
CPUS_PER_TASK The number of CPUs to be allocated per task (Usually one GPU corresponds to one task). Defaults to 5.
SRUN_ARGS The other arguments of srun. Available options can be found here.

Resume training

Resume training means to continue training from the state saved from one of the previous trainings, where the state includes the model weights, the state of the optimizer and the optimizer parameter adjustment strategy.

Automatically resume training

Users can add --resume to the end of the training command to resume training. The program will automatically load the latest weight file from work_dirs to resume training. If there is a latest checkpoint in work_dirs (e.g. the training was interrupted during the previous training), the training will be resumed from the checkpoint. Otherwise (e.g. the previous training did not save checkpoint in time or a new training task was started), the training will be restarted.

Here is an example of resuming training:

python tools/train.py configs/body_2d_keypoint/topdown_heatmap/coco/td-hm_res50_8xb64-210e_coco-256x192.py --resume

Specify the checkpoint to resume training

You can also specify the checkpoint path for --resume. MMPose will automatically read the checkpoint and resume training from it. The command is as follows:

python tools/train.py configs/body_2d_keypoint/topdown_heatmap/coco/td-hm_res50_8xb64-210e_coco-256x192.py \
    --resume work_dirs/td-hm_res50_8xb64-210e_coco-256x192/latest.pth

If you hope to manually specify the checkpoint path in the config file, in addition to setting resume=True, you also need to set the load_from.

It should be noted that if only load_from is set without setting resume=True, only the weights in the checkpoint will be loaded and the training will be restarted from scratch, instead of continuing from the previous state.

The following example is equivalent to the example above that specifies the --resume parameter:

resume = True
load_from = 'work_dirs/td-hm_res50_8xb64-210e_coco-256x192/latest.pth'
# model settings
model = dict(
    ## omitted ##
    )

Freeze partial parameters during training

In some scenarios, it might be desirable to freeze certain parameters of a model during training to fine-tune specific parts or to prevent overfitting. In MMPose, you can set different hyperparameters for any module in the model by setting custom_keys in paramwise_cfg. This allows you to control the learning rate and decay coefficient for specific parts of the model.

For example, if you want to freeze the parameters in backbone.layer0 and backbone.layer1, you can modify the optimizer wrapper in the config file as:

optim_wrapper = dict(
    optimizer=dict(...),
    paramwise_cfg=dict(
        custom_keys={
            'backbone.layer0': dict(lr_mult=0, decay_mult=0),
            'backbone.layer0': dict(lr_mult=0, decay_mult=0),
        }))

This configuration will freeze the parameters in backbone.layer0 and backbone.layer1 by setting their learning rate and decay coefficient to 0. By using this approach, you can effectively control the training process and fine-tune specific parts of your model as needed.

Automatic Mixed Precision (AMP) training

Mixed precision training can reduce training time and storage requirements without changing the model or reducing the model training accuracy, thus supporting larger batch sizes, larger models, and larger input sizes.

To enable Automatic Mixing Precision (AMP) training, add --amp to the end of the training command, which is as follows:

python tools/train.py ${CONFIG_FILE} --amp

Specific examples are as follows:

python tools/train.py configs/body_2d_keypoint/topdown_heatmap/coco/td-hm_res50_8xb64-210e_coco-256x192.py  --amp

Set the random seed

If you want to specify the random seed during training, you can use the following command:

python ./tools/train.py \
    ${CONFIG} \                               # config file
    --cfg-options randomness.seed=2023 \      # set the random seed = 2023
    [randomness.diff_rank_seed=True] \        # Set different seeds according to rank.
    [randomness.deterministic=True]           # Set the cuDNN backend deterministic option to True
# `[]` stands for optional parameters, when actually entering the command line, you do not need to enter `[]`

randomness has three parameters that can be set, with the following meanings.

  • randomness.seed=2023, set the random seed to 2023.

  • randomness.diff_rank_seed=True, set different seeds according to global rank. Defaults to False.

  • randomness.deterministic=True, set the deterministic option for cuDNN backend, i.e., set torch.backends.cudnn.deterministic to True and torch.backends.cudnn.benchmark to False. Defaults to False. See Pytorch Randomness for more details.

Visualize training process

Monitoring the training process is essential for understanding the performance of your model and making necessary adjustments. In this section, we will introduce two methods to visualize the training process of your MMPose model: TensorBoard and the MMEngine Visualizer.

TensorBoard

TensorBoard is a powerful tool that allows you to visualize the changes in losses during training. To enable TensorBoard visualization, you may need to:

  1. Install TensorBoard environment
   pip install tensorboard
  1. Enable TensorBoard in the config file
   visualizer = dict(vis_backends=[
       dict(type='LocalVisBackend'),
       dict(type='TensorboardVisBackend'),
   ])

The event file generated by TensorBoard will be save under the experiment log folder ${WORK_DIR}, which defaults to work_dir/${CONFIG} or can be specified using the --work-dir option. To visualize the training process, use the following command:

tensorboard --logdir ${WORK_DIR}/${TIMESTAMP}/vis_data

MMEngine visualizer

MMPose also supports visualizing model inference results during validation. To activate this function, please use the --show option or set --show-dir when launching training. This feature provides an effective way to analyze the model's performance on specific examples and make any necessary adjustments.

Test your model

Test with your PC

You can use tools/test.py to test a model on a single machine with a CPU and optionally a GPU.

Here is the full usage of the script:

python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE} [ARGS]
By default, MMPose prefers GPU to CPU. If you want to test a model on CPU, please empty `CUDA_VISIBLE_DEVICES` or set it to -1 to make GPU invisible to the program.

CUDA_VISIBLE_DEVICES=-1 python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE} [ARGS]
ARGS Description
CONFIG_FILE The path to the config file.
CHECKPOINT_FILE The path to the checkpoint file (It can be a http link, and you can find checkpoints here).
--work-dir WORK_DIR The directory to save the file containing evaluation metrics.
--out OUT The path to save the file containing evaluation metrics.
--dump DUMP The path to dump all outputs of the model for offline evaluation.
--cfg-options CFG_OPTIONS Override some settings in the used config, the key-value pair in xxx=yyy format will be merged into the config file. If the value to be overwritten is a list, it should be of the form of either key="[a,b]" or key=a,b. The argument also allows nested list/tuple values, e.g. key="[(a,b),(c,d)]". Note that quotation marks are necessary and that no white space is allowed.
--show-dir SHOW_DIR The directory to save the result visualization images.
--show Visualize the prediction result in a window.
--interval INTERVAL The interval of samples to visualize.
--wait-time WAIT_TIME The display time of every window (in seconds). Defaults to 1.
--launcher {none,pytorch,slurm,mpi} Options for job launcher.

Test with multiple GPUs

We provide a shell script to start a multi-GPUs task with torch.distributed.launch.

bash ./tools/dist_test.sh ${CONFIG_FILE} ${CHECKPOINT_FILE} ${GPU_NUM} [PY_ARGS]
ARGS Description
CONFIG_FILE The path to the config file.
CHECKPOINT_FILE The path to the checkpoint file (It can be a http link, and you can find checkpoints here).
GPU_NUM The number of GPUs to be used.
[PYARGS] The other optional arguments of tools/test.py, see here.

You can also specify extra arguments of the launcher by environment variables. For example, change the communication port of the launcher to 29666 by the below command:

PORT=29666 bash ./tools/dist_test.sh ${CONFIG_FILE} ${CHECKPOINT_FILE} ${GPU_NUM} [PY_ARGS]

If you want to startup multiple test jobs and use different GPUs, you can launch them by specifying different port and visible devices.

CUDA_VISIBLE_DEVICES=0,1,2,3 PORT=29500 bash ./tools/dist_test.sh ${CONFIG_FILE1} ${CHECKPOINT_FILE} 4 [PY_ARGS]
CUDA_VISIBLE_DEVICES=4,5,6,7 GPUS=29501 bash ./tools/dist_test.sh ${CONFIG_FILE2} ${CHECKPOINT_FILE} 4 [PY_ARGS]

Test with multiple machines

Multiple machines in the same network

If you launch a test job with multiple machines connected with ethernet, you can run the following commands:

On the first machine:

NNODES=2 NODE_RANK=0 PORT=$MASTER_PORT MASTER_ADDR=$MASTER_ADDR bash tools/dist_test.sh $CONFIG $CHECKPOINT_FILE $GPUS

On the second machine:

NNODES=2 NODE_RANK=1 PORT=$MASTER_PORT MASTER_ADDR=$MASTER_ADDR bash tools/dist_test.sh $CONFIG $CHECKPOINT_FILE $GPUS

Compared with multi-GPUs in a single machine, you need to specify some extra environment variables:

ENV_VARS Description
NNODES The total number of machines.
NODE_RANK The index of the local machine.
PORT The communication port, it should be the same in all machines.
MASTER_ADDR The IP address of the master machine, it should be the same in all machines.

Usually, it is slow if you do not have high-speed networking like InfiniBand.

Multiple machines managed with slurm

If you run MMPose on a cluster managed with slurm, you can use the script slurm_test.sh.

[ENV_VARS] ./tools/slurm_test.sh ${PARTITION} ${JOB_NAME} ${CONFIG_FILE} ${CHECKPOINT_FILE} [PY_ARGS]

Here are the argument descriptions of the script.

ARGS Description
PARTITION The partition to use in your cluster.
JOB_NAME The name of your job, you can name it as you like.
CONFIG_FILE The path to the config file.
CHECKPOINT_FILE The path to the checkpoint file (It can be a http link, and you can find checkpoints here).
[PYARGS] The other optional arguments of tools/test.py, see here.

Here are the environment variables that can be used to configure the slurm job.

ENV_VARS Description
GPUS The total number of GPUs to be used. Defaults to 8.
GPUS_PER_NODE The number of GPUs to be allocated per node. Defaults to 8.
CPUS_PER_TASK The number of CPUs to be allocated per task (Usually one GPU corresponds to one task). Defaults to 5.
SRUN_ARGS The other arguments of srun. Available options can be found here.