This paper proposes an autonomous hand pose estimation method for in-the-wild datasets by cascading a pretrained YOLO image segmentation module with Keypoint Transformer. Precise segmentation of hands in-the-wild is achieved via pre-trained YOLO model. The segmented hand images are then centered in the input image, effectively reducing background interference and providing clearer and more accurate inputs for subsequent keypoint detection and pose estimation. Experimental results show that the method significantly improves the accuracy and efficiency of pose estimation, with the accurate segmentation of hand positions playing a crucial role in the entire image processing pipeline.