Teaching computers how to drive by watching some stuff but not all stuff

DeepDriving: Learning Affordance for Direct Perception in Autonomous Driving
Chenyi Chen, Ari Seff, Alain Kornhauser, Jianxiong Xiao

This post is part of an ongoing series in which I read the papers my group at Princeton has produced.

What's new?

The authors use a few indicators (called affordances) that tell a car how to turn and how to break or accelerate for given road and traffic conditions.

Earlier approaches fall into one of two camps:

  1. Mediated perception: first, we identify relevant objects in a scene, then we construct the world with those objects, and finally we take the measurements we need to in order to decide how to drive. The term ‘mediated perception’ comes from psychology and says we perceive things via mediators (e.g., images, symbols, memories).
  2. Behavior reflex: directly map an image to a driving decision.

The authors' approach, direct perception, chooses a few affordances that directly relate to driving decisions (e.g., distance to cars and lanes, angle between the current direction and the tangent of the road).

Why is this important?

Mediated perception may require too much computation and requires solving the general scene understanding problem. Behavior reflex might not take into account higher level features that we can provide to dramatically help. Ultimately, no one really uses computer vision in autonomous cars today.

Direct perception is a middle path that may bring computer vision to autonomous vehicles sooner.

Implementation
  1. Record data from The Open Racing Car Simulator (TORCS): images from the driver's view, speed of the car, and affordances.
  2. Map images to affordances via a convolutional neural network and the training set collected in 1.
  3. Map affordances during a test run to actions
Results

This new method reduces the mean absolute error by 2/3 to 5/6 relative to existing methods.

Questions & Comments
  • What if I need information I must gather over more than one frame (e.g., velocity, acceleration)?
  • In an era where everyone is talking about learning the representation of data, it is still valuable to hand-craft features. It's true, we learn the representation between images and the affordances, but we choose the affordances to learn rather than let the model go directly from image to decision.