Skip to main content Skip to secondary navigation

Articulated objects? I’ll Let My Robot Figure That Out! (continued)

Main content start

Authors: Claire Chen (Stanford), Nick Heppert (Stanford), Yijia Weng (Stanford) Kaichun Mo (Stanford), Brent Yi (Stanford), Toki Migimatsu (Stanford), Jeannette Bohg (Stanford), Leonidas Guibas (Stanford), Jeremy Ma (TRI)

In this continuation of our research highlight, we share more details on the PartNet-Mobility dataset, grasp generation, compliant control, and object pose tracking and estimation.

PartNet-Mobility Dataset

To aid the design of generalizable computer vision and manipulation methods, we developed a set of 3D assets comprising the PartNet-Mobility Dataset, which contains a collection of 2000 common articulated objects with motion annotations and rendering assets [1,2]. Figure 3 shows a selection of objects from PartNet-Mobility. The dataset can be downloaded here. These models allow us to simulate perception and manipulation of articulated object interactions in a virtual setting, which facilitates data collection.

gif of various household items show each items mobility
Figure 3: A selection of articulated objects from the PartNet-Mobility dataset [2].

Grasp Generation

The first important step for interacting with any articulated object is choosing where and how to grasp it. All articulated objects are designed with features where they are meant to be grasped, often a handle or knob; however, the position, color, and geometry of these features can vary greatly. Once the robot has determined where on the object it should grasp, it must determine what kind of grasp to use. Again, the best grasp can greatly depend on the geometry of the object, the robot’s kinematic constraints, and the task to be completed. For example, a narrow cylindrical cabinet may require a hook grasp, as shown in Figure 4a, whereas a wider pull may be more securely grasped with a pinch grasp, shown in Figure 4b. The problem of finding a secure grasp for interacting with an unknown articulated object has often been overlooked in previous works; instead, these works assume that the robot begins the manipulation task with a secure grasp on the object.

gif of robotic are moving to open refrigerator
pinch grasp robotic arm
Figure 4: A hook grasp (top) and pinch grasp (bottom) on two different cabinet pulls.

In this ongoing project, we are designing a method that takes in a single color and depth image of an articulated object and outputs the optimal grasp posture and optimal spatial location for grasping. In planning grasps for articulated object manipulation, it is also critical to consider the kinematic constraints of the robot, to ensure that a grasp is not only feasible upon initial contact, but remains good throughout the entire task. We account for this in our training, by excluding grasps that are not kinematically feasible.

Compliant Control

Once the robot has successfully grasped the moveable part of an object, it must move this part using a controller. This requires knowing in which direction a robot should apply force to. The controller must allow the robot to conform to the way the object part moves, to avoid breaking the robot or the object. Ideally, we would also want a controller that is robust to errors in predicting the joint types in objects. The most common joint types found in articulated objects are either prismatic, where the movable part of the object slides along an axis (think, drawer), or revolute, where the movable part rotates around an axis (think, refrigerator door). For example, a robot should be able to safely open the revolute door shown in Figure 5, even though it may initially perceive it to be a set of prismatic drawers.

person opening chest with only arm in frame
Figure 5: Object that appears to be a set of prismatic drawers, revealed to be a revolute
​​​​​ door.

To make our robots safe and robust to perception errors, we implement a compliant operational space controller, which applies motion along one axis of a local task frame, highlighted by the yellow arrow in Figure 6, while being compliant in the other axes of the task frame, shown in purple. To interact with an articulated object, we then simply need to specify a local motion axis for the controller.

illustration of local task frame overlaid on drawer object.
Local task frame overlaid on drawer object.

We demonstrate this controller on a real robot that opens both a door with a revolute joint and a drawer with a prismatic joint, shown in Figure 7. Note that the same controller is used in both videos, and requires no additional tuning or replanning to switch between the door and drawer.

Robotic arm opening a drawer
robotic arm opening cabinet door
Figure 7: With the compliant controller, a robot can switch between opening a prismatic drawer (top) and revolute door (bottom) without needing to retune or replan.

Object Pose Tracking and Estimation

Finally, to verify whether the robot has successfully moved the object to the target configuration and choose the next actions to take, the robot needs to track and estimate the object state over time. Specifically, we are interested in tracking object part poses, as well as estimating joint states and joint types. In the case of a refrigerator, these quantities correspond to the position and orientation of the door, the angle the door has been opened to, and whether the joint is revolute or prismatic. The challenge of this perception problem comes from the diverse structure, geometry, and appearance of articulated objects. One way to mitigate this challenge is to group articulated objects into different categories, such as refrigerators, ovens, and drawers. Object instances within a category are often less varied in geometry and appearance than objects across categories. However, identifying the category of an object can be challenging. As such, we explore two methods: a category-level tracking method, where a single perception model generalizes to all different object instances from a known category, and a category-independent method, where a single perception model applies to articulated objects across different categories. Both methods leverage as input a sequence of observations of the object in motion. Thus, these methods fit nicely with the compliant controller, which enables a robot to generate sequences of observations by moving the object autonomously.

CAtegory-level Pose Tracking for Rigid and Articulated Objects from Point Clouds (CAPTRA)

In this published work, we propose the first method for category-level articulated object pose tracking. As shown in Figure 8, our model takes in a partial point cloud sequence of an unseen object and a rough initial pose prediction of the first frame, and outputs the per-part 9 degrees-of-freedom (DOF) poses over time. Here the 9 DOF pose contains the 3D position, 3D rotation, and 3D sizes of the part, which can be visualized as a bounding box. 

Animation of CAPTRA takes
Figure 8: CAPTRA takes, as input, a sequence of depth point clouds and a rough initial pose prediction, and outputs per-part 9 degrees-of-freedom (DOF) poses over time.

We perform tracking in an online frame-by-frame fashion, where we leverage the temporal smoothness of object motions by conditioning the current frame’s pose prediction on the previous frame’s estimated pose. For more details, please refer to our paper [3]. Figure 9 shows our model’s tracking results on real robotic manipulation sequences, where pose estimation can be used for closed-loop control.

Robotic arm closing drawer
Robotic arm opening a drawer
Figure 9: Estimates of drawer poses from CAPTRA during a robotic manipulation task.

Category-independent articulated object tracking

While we can classify many object instances into a specific category, the unstructured nature of homes makes this challenging, as not every object may fit a prescribed category and kinematic structure. Hence, in this ongoing work, we use insights gained from CAPTRA to allow category-independent tracking without needing prior information such as a known segmentation of parts. We take advantage of the fact that articulated objects consist of a set of rigidly moving parts. Therefore, grouping pixels that rigidly move together while the robot interacts with the articulated object automatically leads to a pixel-wise part segmentation. Exploiting part motion over multiple frames makes category-independent articulated object tracking much easier than from a single image.  

Diagram of robotic object tracking estimation
Figure 10: Two-stage pipeline for category-independent object tracking and estimation. Given a sequence of color and depth camera images (a), a learned model detects each part of the object from pairs of images (b). Then, a factor graph estimation module (c) uses the output of the part detection to estimate the time-invariant joint type and parameters (d.1) and time-varying joint states (d.2) for each image in the input sequence.

The pipeline in Figure 10 illustrates our approach. We first process a sequence of color and depth camera images (Figure 10a) to detect all moving parts. Using a sequence allows us to predict and cluster each part’s center and movement for an image pair on a pixel level (Figure 10b). Here, through training the prediction model on a wide-range of different articulated objects, we can successfully distinguish between moving and stationary parts, as our model learns geometric characteristics  of objects. Next, we feed the output of our part detection to a factor graph estimation module (Figure 10c). The estimation method outputs the time-invariant joint type and parameters (Figure 10d.1) as well as the time-varying joint states (Figure 10d.2) for each image in the input sequence. These estimated quantities can be used as feedback to the compliant controller that moves the object part to a goal state.


Our work in grasp generation, compliant control, and object tracking for articulated objects brings us closer to deploying assistive robots in any home. Following these works, we plan to develop tightly coupled perception and planning algorithms that will enable robots to perform common household tasks given high-level specifications, like “place cutlery in the drawer” or “get milk from the refrigerator”. On the perception side, we will develop models that emerge high-level properties of objects relevant to task completion, such as emerging the concept of if a drawer has been opened wide enough to fit an object. On the planning side, we will develop task and motion planning (TAMP) algorithms that extend the method introduced in [4] to output a sequence of symbolic actions, such as “first open drawer, then place object in drawer”.  We are excited to continue collaborating with TRI to discover research breakthroughs that will make assistive home robots a reality


[1]  Kaichun Mo, Shilin Zhu, Angel X Chang, Li Yi, Subarna Tripathi, Leonidas J Guibas, and Hao Su. Partnet: A large-scale benchmark for fine-grained and hierarchical part-level 3d object understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019. 

[2] Fanbo Xiang, Yuzhe Qin, Kaichun Mo, Yikuan Xia, Hao Zhu, Fangchen Liu, Minghua Liu, Hanxiao Jiang, Yifu Yuan, He Wang, et al. Sapien: A simulated part-based interactive environment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020.

[3] Yijia Weng*, He Wang*, Qiang Zhou, Yuzhe Qin, Yueqi Duan, Qingnan Fan, Baoquan Chen, Hao Su, Leonidas Guibas. CAPTRA: CAtegory-level Pose Tracking for Rigid and Articulated Objects from Point Clouds. ICCV, 2021.

[4] Toki Migimatsu and Jeannette Bohg. Object-Centric Task and Motion Planning in Dynamic Environments. IEEE Robotics and Automation Letters and ICRA, 2020.

You can check out more of our work at the following links: