A unified Vision-Language Transformer fuses language, vision, tactile and proprioception into autoregressive embodied control.
Vision, language, tactile and proprioception attend jointly in a single autoregressive sequence.
Keyframe selection preserves critical decision points across long-horizon episodes.
Forecasts the future in vision latent space with a built-in dynamics prior.
Decodes multi-step continuous trajectories for temporally coherent execution.
Runs on-device with asynchronous inference for low-latency, real-world closed-loop control.
One VLA model, adapted and deployed across real industrial and commercial scenarios.
Picking and sorting industrial autonomous-driving domain controllers on the line.
Grasping and arranging delicate flexible printed circuit boards.