UViT: Efficient and lightweight U-shaped hybrid vision transformer for human pose estimation

Li, Biao; Tang, Shoufeng; Li, Wenyi

doi:10.3233/JIFS-231440

UViT: Efficient and lightweight U-shaped hybrid vision transformer for human pose estimation

Article type: Research Article

Authors: Li, Biao^{a; b} | Tang, Shoufeng^{a; *} | Li, Wenyi^{a; b}

Affiliations: [a] School of Information and Control Engineering, China University of Mining and Technology, Xuzhou, China | [b] School of Mechanical and Electronic Engineering, Suzhou University, Suzhou, China

Correspondence: [*] Corresponding author. Shoufeng Tang, School of Information and Control Engineering, China University of Mining and Technology, Xuzhou 221116, China. Email: sieedksft@cumt.edu.cn.

Abstract: Pose estimation plays a crucial role in human-centered vision applications and has advanced significantly in recent years. However, prevailing approaches use extremely complex structural designs for obtaining high scores on the benchmark dataset, hampering edge device applications. In this study, an efficient and lightweight human pose estimation problem is investigated. Enhancements are made to the context enhancement module of the U-shaped structure to improve the multi-scale local modeling capability. With a transformer structure, a lightweight transformer block was designed to enhance the local feature extraction and global modeling ability. Finally, a lightweight pose estimation network— U-shaped Hybrid Vision Transformer, UViT— was developed. The minimal network UViT-T achieved a 3.9% improvement in AP scores on the COCO validation set with fewer model parameters and computational complexity compared with the best-performing V2 version of the MobileNet series. Specifically, with an input size of 384×288, UViT-T achieves an impressive AP score of 70.2 on the COCO test-dev set, with only 1.52 M parameters and 2.32 GFLOPs. The inference speed is approximately twice that of general-purpose networks. This study provides an efficient and lightweight design idea and method for the human pose estimation task and provides theoretical support for its deployment on edge devices.

Keywords: Pose estimation, multi-branch structure, lightweight network, context enhancement, attention mechanism

DOI: 10.3233/JIFS-231440

Journal: Journal of Intelligent & Fuzzy Systems, vol. 46, no. 4, pp. 8345-8359, 2024

Published: 18 April 2024

Price: EUR 27.50

North America

IOS Press, Inc.
6751 Tepper Drive
Clifton, VA 20124
USA

Tel: +1 703 830 6300
Fax: +1 703 830 2300
sales@iospress.com

For editorial issues, like the status of your submitted paper or proposals, write to editorial@iospress.nl

Europe

IOS Press
Nieuwe Hemweg 6B
1013 BG Amsterdam
The Netherlands

Tel: +31 20 688 3355
Fax: +31 20 687 0091
info@iospress.nl

For editorial issues, permissions, book requests, submissions and proceedings, contact the Amsterdam office info@iospress.nl

Asia

Inspirees International (China Office)
Ciyunsi Beili 207(CapitaLand), Bld 1, 7-901
100025, Beijing
China

Free service line: 400 661 8717
Fax: +86 10 8446 7947
china@iospress.cn

For editorial issues, like the status of your submitted paper or proposals, write to editorial@iospress.nl

如果您在出版方面需要帮助或有任何建, 件至: editorial@iospress.nl

Share this:

North America

Europe

Asia