PrimHOI: Compositional Human-Object Interaction via Reusable Primitives

ICCV 2025

Kai Jia*¹, Tengyu Liu*², Yixin Zhu³, Mingtao Pei^1,✉️, Siyuan Huang^2,✉️,

¹School of Computer Science Technology, Beijing Institute of Technology ²National Key Laboratory of General Artificial Intelligence, BIGAI ³School of Psychological and Cognitive Sciences, Peking University

Paper PDF arXiv Video Code

Abstract

Synthesizing realistic \ac{hoi} motions is essential for creating believable digital characters and intelligent robots. Existing approaches rely on data-intensive learning models that struggle with the compositional structure of daily \ac{hoi} motions, particularly for complex multi-object manipulation tasks. The exponential growth of possible interaction scenarios makes comprehensive data collection prohibitively expensive. The fundamental challenge is synthesizing unseen, complex \ac{hoi} sequences without extensive task-specific training data. Here we show that \method generates complex \ac{hoi} motions through spatial and temporal composition of generalizable interaction primitives defined by relative geometry. Our approach demonstrates that repetitive local contact patterns---grasping, clamping, and supporting---serve as reusable building blocks for diverse interaction sequences. Unlike previous data-driven methods requiring end-to-end training for each task variant, \method achieves zero-shot transfer to unseen scenarios through hierarchical primitive planning. Experimental validation demonstrates substantial improvements in adaptability, diversity, and motion quality compared to existing approaches.

Video

Method Overview

Overview of our PrimHOI. (I) We define HOI planning tasks in the Planning Domain Definition Language (PDDL), named PDDL-HOI, which serves as prior knowledge for planning. A LLM translates the task description into a PDDL-formatted planning problem. The symbolic planner in PDDL-HOI then searches for a plausible solution, represented as subgoals {sg} that outline each key step. (II) After high-level planning, low-level motion generation proceeds in two steps: Key-Pose Generation and Intermediate Motion Generation between adjacent key poses. (II.i) Each key pose Ck is generated by optimizing the contact constraints specified in the subgoal graph sgk. These contact constraints are derived from sampling for each contact primitive factor. (II.ii) To connect adjacent key poses, we first plan the object’s motion, followed by generating intermediate human motion guided by the contact trajectory attached to the object’s path. A post-optimization refines the results, enforcing constraints from the subgoal graph, as well as addressing penetration, smoothness, and other requirements.