AIResearch AIResearch
Back to articles
AI

New AI Benchmark Tests Self-Driving Cars in Real-Time Scenarios

A new framework evaluates vision-language models in dynamic driving environments, revealing gaps in current AI's ability to handle unexpected situations and recover from errors.

AI Research
April 04, 2026
4 min read
New AI Benchmark Tests Self-Driving Cars in Real-Time Scenarios

Autonomous driving systems are increasingly relying on vision-language models (VLMs) that combine visual perception with natural language reasoning to make decisions. However, existing benchmarks for evaluating these AI systems have a critical flaw: they test models in static, open-loop scenarios that don't account for how errors accumulate over time or how vehicles respond to unexpected situations. This means current evaluations might overestimate how well these AI systems would perform in real-world driving where mistakes have consequences and recovery is essential. A new framework called **Bench2Drive-VL**, as described in the paper on arXiv, addresses this gap by introducing closed-loop evaluation that tests VLMs in interactive simulations where their actions directly affect future states.

## How Bench2Drive-VL Works

The researchers developed Bench2Drive-VL as an extension of the existing Bench2Drive framework, which was published at **NeurIPS 2024** in the Datasets and Benchmarks Track. The new system is specifically designed to evaluate vision-language models in autonomous driving applications. Their key innovation is **DriveCommenter**, an expert system that automatically generates question-answer pairs about driving situations in real time within the CARLA simulator. Unlike previous benchmarks that only evaluated models on pre-collected data, this system can assess performance even when vehicles deviate from expected paths or encounter rare scenarios. The framework supports multiple input formats including RGB images, bird's-eye-view maps, and text descriptions, and allows for configurable reasoning chains where answers to some questions inform responses to others.

The system works by having DriveCommenter use privileged information from the simulator to generate ground-truth answers to **50 different types of questions** covering perception, prediction, planning, and behavior. These questions range from identifying important objects in a scene to determining whether to brake or change lanes. Simultaneously, the VLM being evaluated processes sensor data to answer the same questions and control the vehicle. The framework includes a graph-based reasoning system that allows researchers to configure how questions depend on each other, supporting chain-of-thought approaches where intermediate reasoning steps inform final decisions. The action module then converts the VLM's natural language responses into control signals for the simulated vehicle.

## VLM Performance Results

Testing three popular VLMs—Qwen2.5VL-3B-Instruct, Gemma3-4b-it, and InternVL3-2B—revealed significant limitations in current models. When evaluated on planning metrics using the Bench2Drive protocol, none of the models achieved the performance of the expert DriveCommenter system, which scored **100 on driving score and success rate**. **Qwen2.5VL-3B-Instruct** performed best among the tested models but still showed substantial gaps, particularly in vision-language question answering where it scored between **46.51 and 81.81** on various questions compared to the expert's perfect scores.

The researchers found that chain-of-thought reasoning, where models explicitly work through intermediate steps, actually worsened performance in some cases due to context accumulation issues and hallucinations. For example, models sometimes invented non-existent speed limits when using chain-of-thought approaches.

## Implications for Safer Autonomous Vehicles

The implications of this research are substantial for the development of safer autonomous vehicles. By providing a more realistic evaluation framework, Bench2Drive-VL helps identify weaknesses in current AI systems that traditional benchmarks miss. The system's ability to generate annotations for out-of-distribution scenarios—like when a vehicle goes off-road or into the wrong lane—means it can support reinforcement learning where AI systems need feedback even when they make mistakes. This is crucial for developing systems that can recover from errors rather than simply following pre-programmed paths. The framework, with its full source code available on GitHub, also includes visualization tools and a complete development ecosystem that researchers can use to analyze model failures and improve their systems.

## Limitations and Future Directions

Despite its advancements, Bench2Drive-VL has limitations that the researchers acknowledge. The evaluation currently relies on simulation rather than real-world data, though **CARLA** is widely used in autonomous driving research. The framework's question-answer approach, while comprehensive, may not capture all aspects of driving competence, and the LLM-based evaluation of answers has known subjectivity issues that the researchers attempted to mitigate with task-specific scoring rules.

Additionally, the tested VLMs showed conservative behaviors that, while safe in simulation, might be impractical in real traffic where efficiency matters. The researchers note that medium-scale VLMs still exhibit hallucinations and struggle with long-context reasoning, indicating that significant improvements are needed before these systems can be reliably deployed in autonomous vehicles.

Original Source

Read the complete research paper

View on arXiv

About the Author

Guilherme A.

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn