The future of autonomous driving and flying lies in giving machines the power of sight. For autonomous driving and flying, Visual SLAM (vSLAM) is a critical technology that uses visual sensors to perceive the environment, offering a low-cost, lightweight alternatives. This tutorial motivates its focus on visual information by highlighting how vSLAM provides rich, pixel-level environmental data, mirroring the way humans navigate the world. We will tackle the core challenge of real-time pose estimation and mapping, allowing a vehicle or drone to localize itself and construct a 3D map of its surroundings simultaneously using only visual inputs. This approach is essential for applications where GPS signals are unreliable and pre-existing maps are unavailable. The tutorial will explore the evolution from classic geometric vSLAM methods to modern, learning-based techniques that address persistent challenges like dynamic environments, poor lighting conditions, and the inherent scale ambiguity of various camera setups.
The tutorial is divided into three major sections followed by last section on open challenges and research directions. In the first section (Introduction to Visual SLAM), we will introduce and motivate the concept of SLAM, its categorization, various configuration based on sensors for autonomous driving and flying. We will discuss in detail the building blocks of a typical visual SLAM system e.g., pose estimation, 3D mapping, loop closure etc. In the second section (Classical SLAM systems : Feature-based/sparse to Direct/Dense SLAM) , we will discuss two major type of SLAM approaches feature-based/spare and direct/dense SLAM. Particularly, we will discuss the working of key SLAM systems ORB-SLAM series and LSD SLAM in these two categories. In the third section (Learning Based SLAM Systems), we will discuss building block of learning based Visual SLAM systems. We will talk about key learning-based methods such as CNN-SLAM, Pseudo-RGBD SLAM, Droid-SLAM, including methods that represents the 3D world using implicit representations such as NeRF, Gaussian Splatting, Sign Distance Fields (SDFs). In the Last section, we will conclude with open challenges and research directions in the visual SLAM domain.