The research team developed an architecture that unifies various data types, including camera images, language instructions, and depth maps. HPT utilises a transformer model, similar to those powering advanced language models, to process visual and proprioceptive inputs.
You are viewing a single comment's thread from: