BEVFormer: 从多视角图像到鸟瞰特征的学习范式

BEVFormer 论文精读：利用时空注意力机制统一多视角特征和时序信息，在 nuScenes 上实现 SOTA 性能

论文基本信息

标题: BEVFormer: Learning Bird’s-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers
会议: ECCV 2022
作者: Zhiqi Li, Wenhai Wang, et al.
核心贡献: 提出一种纯视觉的 BEV 感知范式，通过时空 Transformer 将多视角相机图像特征统一到 BEV 空间

自动驾驶感知中，鸟瞰视角表示（BEV Representation）是连接感知、预测与规划的天然统一表征。然而纯视觉 BEV 生成面临两大挑战：

之前的方案（如 LSS）依赖显式的深度估计，而 BEVFormer 选择通过注意力机制隐式学习这种映射。

\text{BEV查询} \in \mathbb{R}^{H \times W \times C}

BEVFormer 的核心是一个包含 $H \times W$ 个可学习查询网格的 Transformer 架构。每个查询对应 BEV 空间中的一个位置网格。

\text{TSA}(Q_p, \{Q_{t-1}, Q_t\}) = \text{DeformAttn}(Q_p, \text{Align}(Q_{t-1}), ...)

\text{SCA}(Q_p, I_t) = \frac{1}{|\mathcal{V}|} \sum_{i \in \mathcal{V}} \text{DeformAttn}(Q_p, \text{Proj}(p, i), I_t^{(i)})

其中 $\mathcal{V}$ 是能”看到”查询点 $p$ 的相机集合， $\text{Proj}(p, i)$ 将 BEV 位置投影到第 $i$ 个相机的图像平面作为参考点。

\text{nuScenes Detection NDS: } 51.7\% \quad (\text{纯视觉, Val set})

\text{nuScenes Segmentation mIoU: } 37.5\%

优于所有之前的基于相机的 BEV 方法。推理速度在 A100 上约为 2.5 FPS（使用 ResNet-50 作为 Image Backbone）。

Pros:

Cons: