This website requires JavaScript.

Explicit Box Detection Unifies End-to-End Multi-Person Pose Estimation

Jie YangAiling ZengShilong LiuFeng LiRuimao ZhangLei Zhang
Feb 2023
This paper presents a novel end-to-end framework with Explicit box Detectionfor multi-person Pose estimation, called ED-Pose, where it unifies thecontextual learning between human-level (global) and keypoint-level (local)information. Different from previous one-stage methods, ED-Pose re-considersthis task as two explicit box detection processes with a unified representationand regression supervision. First, we introduce a human detection decoder fromencoded tokens to extract global features. It can provide a good initializationfor the latter keypoint detection, making the training process converge fast.Second, to bring in contextual information near keypoints, we regard poseestimation as a keypoint box detection problem to learn both box positions andcontents for each keypoint. A human-to-keypoint detection decoder adopts aninteractive learning strategy between human and keypoint features to furtherenhance global and local feature aggregation. In general, ED-Pose isconceptually simple without post-processing and dense heatmap supervision. Itdemonstrates its effectiveness and efficiency compared with both two-stage andone-stage methods. Notably, explicit box detection boosts the pose estimationperformance by 4.5 AP on COCO and 9.9 AP on CrowdPose. For the first time, as afully end-to-end framework with a L1 regression loss, ED-Pose surpassesheatmap-based Top-down methods under the same backbone by 1.2 AP on COCO andachieves the state-of-the-art with 76.6 AP on CrowdPose without bells andwhistles. Code is available at