HRNet
Deep High-Resolution Representation Learning for Human Pose Estimation
Abstract
This is an official pytorch implementation of Deep High-Resolution Representation Learning for Human Pose Estimation. In this work, we are interested in the human pose estimation problem with a focus on learning reliable high-resolution representations. Most existing methods recover high-resolution representations from low-resolution representations produced by a high-to-low resolution network. Instead, our proposed network maintains high-resolution representations through the whole process. We start from a high-resolution subnetwork as the first stage, gradually add high-to-low resolution subnetworks one by one to form more stages, and connect the mutli-resolution subnetworks in parallel. We conduct repeated multi-scale fusions such that each of the high-to-low resolution representations receives information from other parallel representations over and over, leading to rich high-resolution representations. As a result, the predicted keypoint heatmap is potentially more accurate and spatially more precise. We empirically demonstrate the effectiveness of our network through the superior pose estimation results over two benchmark datasets: the COCO keypoint detection dataset and the MPII Human Pose dataset.
High-resolution representation learning plays an essential role in many vision problems, e.g., pose estimation and semantic segmentation. The high-resolution network (HRNet), recently developed for human pose estimation, maintains high-resolution representations through the whole process by connecting high-to-low resolution convolutions in parallel and produces strong high-resolution representations by repeatedly conducting fusions across parallel convolutions.
In this paper, we conduct a further study on high-resolution representations by introducing a simple yet effective modification and apply it to a wide range of vision tasks. We augment the high-resolution representation by aggregating the (upsampled) representations from all the parallel convolutions rather than only the representation from the high-resolution convolution as done in HRNet. This simple modification leads to stronger representations, evidenced by superior results. We show top results in semantic segmentation on Cityscapes, LIP, and PASCAL Context, and facial landmark detection on AFLW, COFW, 300W, and WFLW. In addition, we build a multi-level representation from the high-resolution representation and apply it to the Faster R-CNN object detection framework and the extended frameworks. The proposed approach achieves superior results to existing single-model networks on COCO object detection.
Results and Models
Faster R-CNN
Mask R-CNN
Backbone |
Style |
Lr schd |
Mem (GB) |
Inf time (fps) |
box AP |
mask AP |
Config |
Download |
HRNetV2p-W18 |
pytorch |
1x |
7.0 |
11.7 |
37.7 |
34.2 |
config |
model | log |
HRNetV2p-W18 |
pytorch |
2x |
7.0 |
- |
39.8 |
36.0 |
config |
model | log |
HRNetV2p-W32 |
pytorch |
1x |
9.4 |
11.3 |
41.2 |
37.1 |
config |
model | log |
HRNetV2p-W32 |
pytorch |
2x |
9.4 |
- |
42.5 |
37.8 |
config |
model | log |
HRNetV2p-W40 |
pytorch |
1x |
10.9 |
|
42.1 |
37.5 |
config |
model | log |
HRNetV2p-W40 |
pytorch |
2x |
10.9 |
|
42.8 |
38.2 |
config |
model | log |
Cascade R-CNN
Backbone |
Style |
Lr schd |
Mem (GB) |
Inf time (fps) |
box AP |
Config |
Download |
HRNetV2p-W18 |
pytorch |
20e |
7.0 |
11.0 |
41.2 |
config |
model | log |
HRNetV2p-W32 |
pytorch |
20e |
9.4 |
11.0 |
43.3 |
config |
model | log |
HRNetV2p-W40 |
pytorch |
20e |
10.8 |
|
43.8 |
config |
model | log |
Cascade Mask R-CNN
Backbone |
Style |
Lr schd |
Mem (GB) |
Inf time (fps) |
box AP |
mask AP |
Config |
Download |
HRNetV2p-W18 |
pytorch |
20e |
8.5 |
8.5 |
41.6 |
36.4 |
config |
model | log |
HRNetV2p-W32 |
pytorch |
20e |
|
8.3 |
44.3 |
38.6 |
config |
model | log |
HRNetV2p-W40 |
pytorch |
20e |
12.5 |
|
45.1 |
39.3 |
config |
model | log |
Hybrid Task Cascade (HTC)
Backbone |
Style |
Lr schd |
Mem (GB) |
Inf time (fps) |
box AP |
mask AP |
Config |
Download |
HRNetV2p-W18 |
pytorch |
20e |
10.8 |
4.7 |
42.8 |
37.9 |
config |
model | log |
HRNetV2p-W32 |
pytorch |
20e |
13.1 |
4.9 |
45.4 |
39.9 |
config |
model | log |
HRNetV2p-W40 |
pytorch |
20e |
14.6 |
|
46.4 |
40.8 |
config |
model | log |
FCOS
Backbone |
Style |
GN |
MS train |
Lr schd |
Mem (GB) |
Inf time (fps) |
box AP |
Config |
Download |
HRNetV2p-W18 |
pytorch |
Y |
N |
1x |
13.0 |
12.9 |
35.3 |
config |
model | log |
HRNetV2p-W18 |
pytorch |
Y |
N |
2x |
13.0 |
- |
38.2 |
config |
model | log |
HRNetV2p-W32 |
pytorch |
Y |
N |
1x |
17.5 |
12.9 |
39.5 |
config |
model | log |
HRNetV2p-W32 |
pytorch |
Y |
N |
2x |
17.5 |
- |
40.8 |
config |
model | log |
HRNetV2p-W18 |
pytorch |
Y |
Y |
2x |
13.0 |
12.9 |
38.3 |
config |
model | log |
HRNetV2p-W32 |
pytorch |
Y |
Y |
2x |
17.5 |
12.4 |
41.9 |
config |
model | log |
HRNetV2p-W48 |
pytorch |
Y |
Y |
2x |
20.3 |
10.8 |
42.7 |
config |
model | log |
Note:
- The
28e
schedule in HTC indicates decreasing the lr at 24 and 27 epochs, with a total of 28 epochs.
- HRNetV2 ImageNet pretrained models are in HRNets for Image Classification.
Citation
@inproceedings{SunXLW19,
title={Deep High-Resolution Representation Learning for Human Pose Estimation},
author={Ke Sun and Bin Xiao and Dong Liu and Jingdong Wang},
booktitle={CVPR},
year={2019}
}
@article{SunZJCXLMWLW19,
title={High-Resolution Representations for Labeling Pixels and Regions},
author={Ke Sun and Yang Zhao and Borui Jiang and Tianheng Cheng and Bin Xiao
and Dong Liu and Yadong Mu and Xinggang Wang and Wenyu Liu and Jingdong Wang},
journal = {CoRR},
volume = {abs/1904.04514},
year={2019}
}