3D Scene Understanding with VLMs

Agentic system for chatbot interaction with 3D scenes via 3D-grounded open-vocabulary detection with VLMs.

Enabled general chatbot interaction with 3D scenes by designing a benchmark, metrics, and developing a pipeline for 3D open-vocabulary detection with VLMs, improving F1-score from 19% (top prior method) to 49%. Enhanced 3D-awareness of general LLMs by combining in-context learning, tool calling, and vision LLM 3D conditioning.

At: SE3 Labs, Munich