Putting it all together

Update I was able to finish part 1 (receiving data and doing a naive 3D reconstruction in real-time). More on this next week. I guess part 2 (making that reconstruction less naive) is going to take longer than 0 days :)

I've been working on Mobot (original post explaining the project here) in some form or another for about four months now. Most of the major pieces exist, but aren't connected and not optimized for speed. It's time to pull all the parts together. So here's an aggressive timeline plus a map that explains the updates to come.

The project has three major parts: 1) collecting data, 2) processing data, and 3) planning. Part 1 is essentially complete, so I'd like to finish Part 2 by the end of this week for a single depth / color camera pair. Depending on how close to real-time performance we are, I'll decide whether to work on optimizing or adding more cameras / moving on to planning.

A strikethrough means complete as of this writing. Optimization opportunities marked OO (i.e., skip unless you're curious).

Hm, after re-reading, I see that everything's crossed out. I guess that wasn't a useful way to present a to-do list. The two major connections I need to make are 1) between receiving data and extracting feature descriptors and 2) between aligning consecutive frames and bundle adjusting.

Collecting data

Calibrating cameras: We need to know how the depth and color cameras are related, and what intrinsic properties they have. More details here.
Capturing data: We use an iPod Touch (6th generation) for color data and custom-designed cases to attach a Structure Sensor for depth data. Currently we capture both color and depth data in VGA (640x480) resolution at 15 frames per second (FPS). If the rest of the system is willing, we'll increase this framerate.
Streaming data: Unfortunately, iPod touches don't have the computational ability to process the data it collects (yet!), so we must stream to a more powerful server. We use JPEG compression for color and 16-bit grayscale for depth, and dump raw TCP packets. OO: iOS devices have hardware-accelerated h.264 processing, which means we can compress and decompress high-quality video much faster than saving individual images for each frame. Briefly, h.264 belongs to a class of video compression that uses reference frames (I-frames) that can be decoded by themselves and inter frames (P- and B-frames) that depend on the references frames for decoding. You could think of these inter frames as diffs relative to a reference frame. Anyway, encoding on iOS device, decoding on the server, and then synchronizing the color stream with the depth stream would be a non-trivial amount of work.
Receiving data: We are able to successfully receive 4 color-depth pairs (640x480 at 15 FPS) indefinitely. Of course, we have to rely on this monster router and keeping data in memory or writing to a single large file per camera pair. Intro to wireless networking (just stop and think how crazy it is you can transmit so much data so reliably) another time, maybe. OO: the router we have has a 1 GHz dual-core processor and beamforming. We don't take advantage of either.

Processing data

In a nutshell, this part means taking each new frame the camera pair sees and fusing it onto what it has already seen.

Extract feature descriptors: Given an image, we want to find key points, or features, that describe an image. Descriptors encode information about the feature (e.g., rotation, scale) and we can use descriptors to match features between two images. Searching for images and stitching panoramas both often use feature descriptors. We use the scale-invariant feature transform (SIFT). OO: we can change a few things about SIFT to improve performance according to this paper.
Align two consecutive frames: Given two frames taken one after another (i.e., are close in physical space), return the relative transforms of the camera positions that took the frames. As we collect frames, we build a long series of such transforms and combine them to place each frame in a 3D world. This is sort of like stitching a three-dimensional panorama. The process is analogous to how we might stitch normal panoramas (see example here).
Identify loop closure candidates: Of course, as we compute more relative transforms, we build up errors over time. A good explanation I borrowed from the internet here. As we collect frames, we should constantly check whether we've seen the same place earlier (earlier than the immediately preceding frames) and use those relationships as corrections. OO: move computation to GPU (e.g., here). OO: pre-train a bag of words model.
Bundle adjustment: Given relative transformations between all the frames and the loop closure candidates, spread the trajectory errors across all the camera poses (where we think all the frames were taken). OO: Use the SuiteSparse GPU implementation for fast sparse matrix math. OO: continue investigating pose graphs.

Planning

More on this next week?

Multiple cameras

Also more on this later