Calibrating depth and color cameras

A fully-reconstructed 3D model:

In order for Mobot to reconstruct a model of the world it sees, it has to record what it sees. But what does it see? A typical phone will have a camera that captures color information arranged in a two-dimensional array. Each pixel, or picture element, contains some color information.

There is a lot of research that constructs a three dimensional world from a bunch of these two-dimensional images. But maybe we can make our lives a lot easier if we also have some three-dimensional information at each pixel as well (i.e., how far away is each pixel from the camera?).

Mobot is equipped with eight cameras (see picture below) arranged in four depth-color pairs. We use the color cameras from four iPod touches and attach a special depth sensor to each for eight total cameras.

The depth sensors come are made by a small company called Occipital and use a technique called structured light to figure out how far away pixels are from the camera. In structured light, we use a projector to project a known pattern onto a scene and then use a camera to read back the deformed pattern. Algorithms then compare the known pattern with the deformed pattern and infers the shapes the camera saw. We can see a simple example with an overhead projector and normal, visible light:

The problem with visible light, of course, is that you need a dark room so the camera can see the pattern! So most structured light depth sensors use infrared light (there are some issues with this as well, but we'll save for another time). Below, you can see the depth sensor we use, including its infrared (IR) projector, camera, and LED. We'll talk about the LED lamps in a bit.

But even though we can now capture both color and depth images, we still have more work to go. If we naively try to line up the image taken by the infrared camera and the image taken by the color camera, we're going to have a mismatch. From left red box to right in the image below, we see the monitor got 'painted' onto the wall, the wall onto the chair, and the hallway onto the door.

We see this problem because the infrared and color cameras aren't physically in the same place (and also have different optical properties and therefore different camera matrices).

So we take a lot of pictures of a known pattern with both the infrared and color cameras (which are now held at a constant relative transformation by the custom case we milled). Checkerboards are particularly easy to detect using automated methods, so we can figure out not only what properties each individual camera has (e.g., focal length), but also figure out where the two cameras are relative to each other.

Here is where our algorithm believes the two cameras saw each of the images (the mess of multi-colored planes):

And with this calibration, an improved (but not perfect) calibration:

Ok. So now a short paragraph about what I did this morning: Once we have the relative transformation between two camera views, we have to transform each pair of depth and color images we get so they line up. Originally, we were aligning the images in software (i.e., use the CPU, a general computer, to do matrix math). But Occipital recently released an update that allows us to do this computation in hardware on the depth sensor package itself! The whole system gets a speed bump.