A brief surevy of deep learning based 3D object detection algorithm for autonomous driving scenarios !

1. Introduction:

With the rise of Artificial Intelligence system, the race to convert conventional system into automated system is on rise too. One such domain is automated transportation system powered by autonomous vehicles. More than forty companies have joined the race to manufacture and deploy autonomous vehicles on road [1]. This surge was caused by the availability of relatively cheaper sensor LIDARS, rise of deep learning and availability of large public dataset. The complexity of task begins as 2D dimensional data is not sufficient enough to properly detect and localize object on ground plane. 3D data acquired by sensors such as LIDAR (Light Detection and Ranging), RGB-D sensors such as Microsoft Kinect, Intel RealSense etc. provides us ample amount of information about the surrounding objects, such as there scale, distance measurement and geometric information. Both the technologies to acquire 3D were lately used to construct model to perform the object detection and tracking in autonomous vehicles. Point clouds obtained from LIDAR are irregular continuous representation of geometrical information in 3d space. Depth images obtained from RGB-D sensor contains the three color channel and depth information corresponding to each cell, making it spatial data.
Deep learning has emerged as pioneer in field of computer vision, natural language processing and machine translation etc. Most of his success is related to 2d computer vision, with state-of-art models for object classification, localization, image translation etc. With so much success in 2d computer vision, deep learning has still a long way to go for 3d computer vision. The key challenge is to properly learn the geometrical structure of objects, moreover data format like point cloud has spare and unstructed nature, which makes it impossible to learn directly from them. During these year, research have worked a lot on creating models, which use multiple data stream both 2d and 3d to perform the task of object detection in autonomous vehicles. Several datasets are made publically available, such as The KITTI Vision Benchmark Suite [2], nuScenes [3], Ford AV [4], SemanticKITTI [5] boosting the research in the field of object detection in autonomous vehicle. A few survey papers have been published to cover the domain of 3d computer vision [6] [7] [8], but this will be the first survey that is specifically focused on how deep learning models are evolving in the field of 3d object detection in autonomous vehicles. The questions that we will be specifically focusing in this study are as follows.

Why “Representation Matter” more in 3d computer vision then the deep model itself [9].
Which Deep learning architecture performs better on task of 3d object detection in autonomous vehicle?

2. Point Cloud:

Point cloud generally obtained from a 3d Scanners, is a set of data point representing the location each of point in 3d space. Point clouds contain raw representation of geometrical aspect of object in continuous space. There unstructured and sparse nature makes it hard to apply any deep learning on them directly [10]. Researchers have employed different schemes to convert raw point into a more suitable representation for deep learning models, specifically to be fed to a convolutional neural net for feature extraction [11] [12] [13] [14]. While most of the literature is focused on representing raw point cloud into a more structured form which is suitable for deep learning models; Qi et al. [10] in his pioneering work presented a novel deep learning architecture based on fully connected layers, that take raw point cloud as input. Inspired by this a lot of researches tend to explore such possibilities. In a recent effort, Shi et al. [15] proposed a complex deep learning architecture which uses the structured and point based method to extract spatial and geometrical feature to localize and detect object and hence achieved benchmark results on Kitti dataset [2]. We will future divide these studies based on what type of representation they choose for point cloud data.

2.1 Grid-Based:

Research referred as grid based methods either converts point cloud to volumetric representation or to a 2d grid representation. Converting a point cloud into a volumetric representation or voxel cells causes a large amount of empty cells .By preprocessing a point cloud into a grid representation, it became quite suitable to be passed to any 2d image based deep network, but it losses important geometrical information about the objects.