Synthesizing realistic \ac{hoi} motions is essential for creating believable digital characters and intelligent robots. Existing approaches rely on data-intensive learning models that struggle with the compositional structure of daily \ac{hoi} motions, particularly for complex multi-object manipulation tasks. The exponential growth of possible interaction scenarios makes comprehensive data collection prohibitively expensive. The fundamental challenge is synthesizing unseen, complex \ac{hoi} sequences without extensive task-specific training data. Here we show that \method generates complex \ac{hoi} motions through spatial and temporal composition of generalizable interaction primitives defined by relative geometry. Our approach demonstrates that repetitive local contact patterns---grasping, clamping, and supporting---serve as reusable building blocks for diverse interaction sequences. Unlike previous data-driven methods requiring end-to-end training for each task variant, \method achieves zero-shot transfer to unseen scenarios through hierarchical primitive planning. Experimental validation demonstrates substantial improvements in adaptability, diversity, and motion quality compared to existing approaches.
Overview of our PrimHOI. (I) We define HOI planning tasks in the Planning Domain Definition Language (PDDL), named PDDL-HOI, which serves as prior knowledge for planning. A LLM translates the task description into a PDDL-formatted planning problem. The symbolic planner in PDDL-HOI then searches for a plausible solution, represented as subgoals {sg} that outline each key step. (II) After high-level planning, low-level motion generation proceeds in two steps: Key-Pose Generation and Intermediate Motion Generation between adjacent key poses. (II.i) Each key pose Ck is generated by optimizing the contact constraints specified in the subgoal graph sgk. These contact constraints are derived from sampling for each contact primitive factor. (II.ii) To connect adjacent key poses, we first plan the object’s motion, followed by generating intermediate human motion guided by the contact trajectory attached to the object’s path. A post-optimization refines the results, enforcing constraints from the subgoal graph, as well as addressing penetration, smoothness, and other requirements.
The person clamps the second box using chest and elbow.
The person uses the free hand to open the door.
The person clamps the second box using chest and the first box.
The person uses the free hand to open the door.
Grasp
Clamp
Support
Dual Support
Long Box
The person uses shoulder and hand to Dual Support the long box before passing the door.
Trashbin
The human uses right hand (Grasp) to put the trashbin on left hand to support the object.
Monitor
The human lifts the monitor using Clamp.
Plastic Container
The human lifts the plastic containter using Clamp.
Trashbin Grasp Placement #1
Trashbin Grasp Placement #2
The human lifts the monitor using Clamp.
The human lifts the plastic containter using Clamp.
Comparison in Contact Guidance Motion Generation: Local Control (Ours) vs. IK.
Comparison in Contact Guidance Motion Generation: Local Control (Ours) vs. ProgMogen[1]
@inproceedings{kai2025primhoi,
title={PrimHOI: Compositional Human-Object Interaction via Reusable Primitives},
author={Jia, Kai and Liu, Tengyu and Zhu, Yixin and Pei, Mingtao and Huang, Siyuan},
booktitle=Proceedings of International Conference on Computer Vision (ICCV),
year={2025}
}