Monocular Robot Navigation
with Self-Supervised Pretrained Vision Transformers


Abstract

In this work, we consider the problem of learning a perception model for monocular robot navigation using few annotated images. Using a Vision Transformer (ViT) pretrained with a label-free self-supervised method, we successfully train a coarse image segmentation model for the Duckietown environment using 70 training images. Our model performs coarse image segmentation at the 8x8 patch level, and the inference resolution can be adjusted to balance prediction granularity and real-time perception constraints. We study how best to adapt a ViT to our task and environment, and find that some lightweight architectures can yield good single-image segmentation at a usable frame rate, even on CPU. The resulting perception model is used as the backbone for a simple yet robust visual servoing agent, which we deploy on a differential drive mobile robot to perform two tasks: lane following and obstacle avoidance.

Pipeline

Predictor

We propose to train a classifier to predict labels for every 8x8 patch in an image. Our classifier is a fully-connected network which we apply over ViT patch encodings to predict a coarse segmentation mask:

scales

Controller

The coarse segmentation output is used to compute a left (blue) and right (red) mask which are delivered to a potential-field based controller. The controller receives the mask and maps it as a "repulsive" potential to steer away from the half of the image with the most obstacle patches

scales

Predictions

scales

Lane Following

Citation

Acknowledgements

Thanks to thank Gustavo Salazar and Lilibeth Escobar for their help labeling the dataset. Special thanks to Charlie Gauthier for her help setting-up the Duckietown experiments.
The website template was borrowed from Michaël Gharbi.