Creating a baseline object recognition algorithm

To lay the software foundation for our work with the MIT half of Team MIT+Princeton (creative name, I know), we need to establish an API. We're working on the vision and perception systems (i.e., get visual information from a RGB-D color+depth camera, identify objects and their locations / orientations) and they're working on the robot.

A natural hand-off point is for our system to provide a constant feed of objects the vision system is seeing, their poses (position and orientation), what the objects are, and how confident our system is. The robot can then use this information to make a grab.

Great. The MIT team is using the Robot Operating System (ROS; great introduction here), which isn't an operating system so much as a loosely coupled framework that lots of people in robotics use. The framework makes it really easy to do inter-process communications within a single machine or across networks, among other things (most robots and assorted hardware have some plug-in to ROS).

So now, with an API defined (again, pose, label, confidence, and some other metadata), we can create our process that will provide this API whenever MIT's process needs an update. To get everything hooked up, we need a baseline (a.k.a., dummy, but that sounds less fancy) algorithm that will recognize objects. Of course, we'll eventually make this algorithm much, much more sophisticated (competitors beware), but good software practice to get the connections points in place first.

Here's the information we've got (remember, we know all the objects ahead of time and there aren't that many):
1. Existing 3D models of all the objects
2. Existing 2D images of all the objects, taken from various angles
3. Incoming depth (distance to camera, from which we can infer 3D) and color data

Our goal is to compare the incoming RGB-D data to the 3D and 2D data we already have to determine what objects we're looking at and what their poses are.

Roughly speaking, we first look for important features in the incoming color data (i.e., 2D photo) and compare them to important features in the existing images we have of the various objects (SIFT keypoint matching with random sample consensus for those who are curious). This gives us a first estimate of what objects might be in the scene.

Then we compare the point cloud (what we call lots of 3D points in space; looks like a cloud of points) we get from the depth sensor and compare it to the known 3D model (Insane Clown Posse algorithm).

Getting the important features (green dots) and removing all of them that are outside the yellow rectangle (a mask) to get the keypoints for a particular object.

Checking the keypoints from our known photo above against a new image.

The red dots at the top represent the back of the shelf and the bright red in the bottom-right represent the Cheez-It box. The slightly offset blue dots are the 'right' answer. Not bad for a really simplistic algorithm.