Lift, Splat, Shoot: Encoding Images from Arbitrary Camera Rigs by Implicitly Unprojecting to 3D

ECCV 2020 Submission

1. Efficient Frustum Pooling (PyTorch)

During the forward pass of the lift-splat architecture, our network receives a featurized point cloud and performs sum pooling across each bird's-eye-view pillar. Each pillar has a variable number of points assigned to it as determined by the configuration of the cameras. We use the "cumulative sum" trick to efficiently perform pooling. The jacobian vector product is a simple lookup that bypasses multiple autograd steps, speeding up training 1.5x.

      
        class QuickCumsum(torch.autograd.Function):
          @staticmethod
          def forward(ctx, x, ranks):
            """Perform sum pooling where each bin has a variable
            number of points to be pooled.

            x: N x D tensor of features to be pooled
            ranks: N tensor of bin ids
            """
            x = x.cumsum(0)
            kept = torch.ones(x.shape[0], device=x.device, dtype=torch.bool)
            kept[:-1] = (ranks[1:] != ranks[:-1])
    
            x = x[kept]
            x = torch.cat((x[:1], x[1:] - x[:-1]))
    
            # save kept for backward
            ctx.save_for_backward(kept)
    
            return x, geom_feats
      
          @staticmethod
          def backward(ctx, gradx, gradgeom):
            kept, = ctx.saved_tensors
            back = torch.cumsum(kept, 0)
            back[kept] -= 1
    
            val = gradx[back]
    
            return val, None
      
      

2. Full Visualization (nuScenes Validation)

We visualize the predictions of lift-splat models trained to do vehicle segmentation (blue), road segmentation (orange), lane boundary segmentation (green), and ego motion planning (red). Top-10 trajectories are visualized for planning. The bird's-eye-view predictions output by the networks are shown on the right. We then project the bird's-eye-view representation back onto the input images shown on the left using a per-frame estimate of the ground surface.

3. Camera Dropout Robustness (nuScenes Validation)

We do a simple simulation of different camera rigs by dropping out each of the cameras from the nuScenes camera rig on the validation set. We visualize the predictions of each of the networks as in (2).

4. Zero-Shot Transfer to Lyft (Lyft Validation)

To test transfer to an entirely different camera rig, we evaluate the model trained on nuScenes on the Lyft Level 5 dataset. The models must generalize across different extrinsics and intrinsics in order to perform well on this task. We visualize the predictions of each of the networks as in (2).