Aiming at the problems of difficult identification of occluded objects and insufficient robustness to noise and viewing angle changes in existing instance segmentation, this paper proposes a method of multi-source spatio-temporal information based fine-grained bird's-eye view generation(MSTFB). The method is based on a rasterized scene bird's eye view, the self-attention mechanism is utilized to fuse temporal bird's eye view features to obtain the scene fine-graine bird's eye view, and the spatiotemporal cross-domain convolutional network is employed to capture the relative position information between instances and fuse the multi-scale features. On this basis, a bird's-eye view instance segmentation prediction method of encoding and sample fusion (ESF-BISP) is proposed. ConvGRU is used to encode the time series semantics of the historical frame to obtain the time series features, and CVAE is adopted to model the state feature distribution of the current frame fine-grained bird's eye view and sample the bird's eye view sample features, GMM is used to fuse the time series features and sample features of the bird's eye view, and then decode the fine-grained aerial view of the future frame scene. The experimental results on the public dataset nuScenes show that compared with the benchmark algorithm LSS, the vehicle segmentation IoU index of MSTFB method is improved by 7.09%, which can effectively segment remote vehicles and occluded vehicles. ESF-BISP can better capture the changes of dynamic instances in the scene, whether for instance segmentation or for future instance segmentation prediction, the performance is significantly better than the benchmark algorithm.
(1)提出了一种融合多源时空信息的场景细粒度鸟瞰图生成方法(Multi-source spatio-temporal information based fine-grained bird's-eye view generation,MSTFB)。该方法以场景栅格化鸟瞰图表示为基础,采用自注意力机制对场景时序特征进行融合,设计时空跨域卷积网络对包含空间信息的全局关键特征进行融合,最终生成场景细粒度鸟瞰图表示。该方法无需昂贵雷达信息就可实现准确的场景感知,为自动驾驶提供了新的选择。
(2)提出了一种融合时序编码和样本特征的鸟瞰图实例分割预测方法(Bird's-eye view instance segmentation prediction method of encoding and sample fusion,ESF-BISP)。采用ConvGRU编码历史帧时序语义信息,得到未来帧时序特征,通过CVAE拟合未来帧状态特征分布,利用GMM对未来帧时序特征和样本特征进行融合,实现鸟瞰图动态实例分割预测。该方法可有效实现被遮挡车辆实例、距离自车较远实例的分割。
WangX, GirdharR, YuS X, et al. Cut and learn for unsupervised object detection and instance segmentation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Piscataway, USA, 2023: 3124-3134.
[2]
HurtikP, MolekV, HulaJ, et al. Poly-YOLO: higher speed, more precise detection and instance segmentation for YOLOv3[J]. Neural Computing and Applications, 2022, 34(10): 8275-8290.
MaoLin, RenFeng-zhi, YangDa-wei, et al. Two⁃way feature pyramid network for panoptic segmentation[J]. Journal of Jilin University(Engineering and Technology Edition), 2022, 52(3): 657-665.
[5]
KeL, DanelljanM, LiX, et al. Mask transfiner for high-quality instance segmentation[C]∥Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Piscataway, USA, 2022: 4412-4421.
[6]
ChengT H, WangX G, ChenS Y, et al. Boxteacher: Exploring high-quality pseudo labels for weakly supervised instance segmentation[C]∥Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Piscataway, USA, 2023: 3145-3154.
HuoGuang, LinDa-wei, LiuYuan-ning, et al. Lightweight iris segmentation model based on multiscale feature and attention mechanism[J]. Journal of Jilin University(Engineering and Technology Edition), 2023, 53(9): 2591-2600.
[9]
DengL Y, YangM, LiH, et al. Restricted deformable convolution-based road scene semantic segmentation using surround view cameras[J]. IEEE Transactions on Intelligent Transportation Systems, 2019, 21(10): 4350-4362.
[10]
LuC Y, Wande, GerardusM J G, DubbelmanG.Monocular semantic occupancy grid mapping with convolutional variational encoder-decoder networks[J].IEEE Robotics and Automation Letters, 2019, 4(2):445-452.
[11]
PanB, SunJ, LeungH Y T, et al. Cross-view semantic segmentation for sensing surroundings[J]. IEEE Robotics and Automation Letters, 2020, 5(3): 4867-4873.
[12]
PhilionJ, FidlerS. Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3D[C]∥The 16th European Conference on Computer Vision, Glasgow, UK, 2020: 194-210.
[13]
KhalilY H, MouftahH T. End-to-end multi-view fusion for enhanced perception and motion prediction[C]∥IEEE 94th Vehicular Technology Conference, Piscataway, USA, 2021: 1-6.
[14]
HendyN, SloanC, TianF, et al. FISHING net: Future inference of semantic heatmaps in grids[C]∥Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Piscataway, USA, 2020.
[15]
MaY, WangT, BaiX, et al. Vision-centric BEV perception: a survey[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024,46(12):1-20.
[16]
AkanA K, GüneyF. Stretchbev: Stretching future instance prediction spatially and temporally[C]∥European Conference on Computer Vision, Tel Aviv, Israel, 2022: 444-460.
[17]
LiP I, DingS X, ChenX Y L, et al. PowerBEV: a powerful yet lightweight framework for instance prediction in bird's-eye view[DB/OL]. [2023-10-22].
[18]
HuY H, YangJ Z, ChenL, et al. Planning-oriented autonomous driving[C]∥Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Piscataway, USA, 2023: 17853-17862.
[19]
HuA, MurezZ, MohanN, et al. FIERY: Future instance prediction in bird's-eye view from surround monocular cameras[C]∥Proceedings of the IEEE/CVF International Conference on Computer Vision,Piscataway,USA, 2021: 15273-15282.
[20]
YuanF N, ZhangL, XiaX, et al. A gated recurrent network with dual classification assistance for smoke semantic segmentation[J]. IEEE Transactions on Image Processing, 2021, 30: 4409-4422.
[21]
MaoY X, ZhangJ, XiangM C, et al. Multimodal variational auto-encoder based audio-visual segmentation[C]∥Proceedings of the IEEE/CVF International Conference on Computer Vision, Piscataway, USA, 2023: 954-965.
[22]
TanM X, LeQ V. Efficientnet: Rethinking model scaling for convolutional neural networks[C]∥International Conference on Machine Learning, Long Beach, USA, 2019: 6105-6114.
[23]
RiazF, RehmanS, AjmalM, et al. Gaussian mixture model based probabilistic modeling of images for medical image segmentation[J]. IEEE Access, 2020, 8: 16846-16856.
[24]
LyuS W, FanY B, YingY M, et al. Average top-k aggregate loss for supervised learning[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020, 44(1): 76-86.
[25]
CaesarH, BankitiV, LangA H, et al. nuscenes: A multimodal dataset for autonomous driving[C]∥Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Piscataway, USA, 2020: 11621-11631.
[26]
MandalS, BiswasS, BalasV E, et al. Lyft 3D object detection for autonomous vehicles[M]∥Rabindra Shaw,Artificial Intelligence for Future Generation Robotics: Amsterdam: Elsevier, 2021: 119-136.
[27]
GongS, YeX, TanX Q, et al. GitNet: Geometric prior-based transformation for birds-eye-view segmentation[C]∥European Conference on Computer Vision, Tel Aviv, Israel, 2022: 396-411.