Hand/HCI

Hand

From the moment we wake up and pick up our phone, to the ways we gesture in a virtual meeting or assemble parts in a smart factory, our hands are the primary interface between humans and both physical and digital worlds. Yet capturing the full richness of hand motion, interaction with objects and other hands, and rendering those interactions in real time and with photo-realistic detail remains a towering challenge. Complex kinematics, frequent self-occlusions, varied lighting, and the need for millisecond-scale responsiveness all conspire to make high-fidelity hand perception and reconstruction a frontier problem in computer vision and graphics.

Our Hand Group focuses on multiple tightly connected research directions to enable comprehensive hand understanding and interaction modeling. (1) We study hand–object interaction, aiming to model how hands grasp, manipulate, and physically interact with objects under complex real-world conditions. (2) We develop methods for bimanual pose estimation and reconstruction, addressing the unique challenges posed by two-hand coordination, occlusion, and mutual influence. (3) We explore RGB, depth, and RGB-D multimodal fusion, designing systems that robustly integrate complementary sensing modalities to achieve precise and reliable pose tracking. (4) We advance photo-realistic hand–object rendering, leveraging neural implicit representations to synthesize realistic visual appearances of hands interacting with objects. By integrating these efforts, we strive to build a unified intelligent hand interaction system that combines accuracy, robustness, and realism to support next-generation AR/VR, teleoperation, and embodied AI applications.

Human-Computer Interaction

Today’s human-computer interaction systems are still far from being truly “natural” conduits of experience. The process of information exchange between humans and virtual environments is far more complex than it appears—especially in immersive settings, where traditional input methods often feel clunky and disjointed, limiting freedom of expression and creativity. We believe that rather than continuously adding more devices or forcing users to adapt to systems, we should design systems that understand, align with, and serve people.

Guided by this vision, we explore bare-hand interaction as an entry point to transform input experiences in immersive environments. Take, for instance, the seemingly simple task of writing in virtual reality. Traditional approaches rely on handheld controllers or styluses, which not only reduce the naturalness of interaction but also fail to accommodate individual differences.

To address this, our group propose several innovative solutions. Air-writing: Leveraging spatial tracking to dynamically adjust the user’s relative position to the virtual whiteboard, enabling a stable and freeform mid-air writing experience. Physical-writing: A virtual-to-physical mapping approach that allows users to write on real surfaces, offering higher precision and haptic feedback. MARS system: A machine learning model trained on personalized motion trajectories that can detect writing intent based solely on continuous hand movements—removing the need for explicit gestures or button presses while preserving the natural flow of air-writing.

Human Body Reconstruction

Today, creating digital humans that accurately replicate real-world human motion and expression has become a central pursuit in immersive applications. From AR/VR to filmmaking, from gaming to human-computer interaction, digital humans are no longer just avatars—they are extensions of the self in virtual spaces. However, traditional pipelines for building these avatars rely heavily on expensive equipment, complex capture setups, and labor-intensive post-processing, greatly hindering their scalability and accessibility.

We believe that instead of chasing ever-larger datasets or higher-end devices, the key lies in rethinking the methodology: how to achieve greater realism and expressiveness using fewer resources.

To this end, our group propose a novel solution. E3-Avatar: Enhancing the efficiency of constructing coarse body models using pretrained Gaussian generation models, while integrating detail enhancement modules specifically for the hands and face. This enables precise capture and natural reproduction of subtle motions, resulting in highly expressive and faithful digital humans—created swiftly, rendered efficiently, and ready to power the next generation of human-centered applications.

Video Streaming

Today’s video streaming systems, though widely adopted, have yet to deliver a truly seamless and universally accessible viewing experience. The process of streaming is far more intricate than simply pressing play—network fluctuations can cause stutters, and device differences can lead to inconsistent playback quality. There remains a gap between technical capability and human expectation.

Rather than endlessly increasing bitrates or expanding bandwidth, we believe the system itself should become more intelligent—understanding the network, the content, and, most importantly, the user. Guided by this principle, we bring together expertise in network architecture, media systems, machine learning, and human-centered design to tackle the systemic challenges of video transmission.

We propose several innovative solutions: PsyQoE, which captures subtle signals of perceived quality and adapts to changing bandwidth in real time. STAR-VP, which intelligently prioritizes and preloads the most relevant segments of immersive 360-degree content. By combining real-time telemetry with predictive analytics and edge-based control, we are building a delivery pipeline that ensures smooth, responsive, and cost-effective playback across diverse devices and network conditions.