3a:[["$","audio",null,{"id":"tts"}],["$","$L3f",null,{"paperID":"33544","publisher":"cvpr","paperJSON":{"title":"VideoGLaMM : A Large Multimodal Model for Pixel-Level Visual Grounding in Videos","paperID":"33544","avgLineHeight":11.96,"imgScale":4,"sections":[{"heading":"Abstract","paragraphs":[[{"id":"id-65","style":{"fontStyle":"italic"},"text":"Fine-grained alignment between videos and text is chal- ","element":"span"},{"style":{"fontStyle":"italic"},"text":"lenging due to complex spatial and temporal dynamics in videos.Existing video-based Large Multimodal Models (LMMs) handle basic conversations but struggle with precise pixel-level grounding in videos. To address this, we introduce VideoGLaMM, a LMM designed for fine-grained pixel-level grounding in videos based on user-provided textual inputs. Our design seamlessly connects three key components: a Large Language Model, a dual vision encoder that emphasizes both spatial and temporal details, and a spatio-temporal decoder for accurate mask generation. This connection is facilitated via tunable V","element":"span"},{"style":{"height":8.4},"width":36,"height":21,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/33544/images/0-0.png","element":"img","alt":"→","inline":true},{"style":{"fontStyle":"italic"},"text":"L and L","element":"span"},{"style":{"height":8.4},"width":36,"height":21,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/33544/images/0-1.png","element":"img","alt":"→","inline":true},{"style":{"fontStyle":"italic"},"text":"V adapters that enable close Vision-Language (VL) alignment. The architecture is trained to synchronize both spatial and temporal elements of video content with textual instructions. To enable fine-grained grounding, we curate a multimodal dataset featuring detailed visually-grounded conversations using a semiautomatic annotation pipeline, resulting in a diverse set of 38k video-QA triplets along with 83k objects and 671k masks. We evaluate VideoGLaMM on three challenging tasks: Grounded Conversation Generation, Visual Grounding, and Referring Video Segmentation. Experimental results show that our model consistently outperforms existing approaches across all three tasks.","element":"span"}]]},{"heading":"1. Introduction","paragraphs":[[{"text":"The rise of Large Language Models (LLMs) has signifi-cantly advanced progress in language-based tasks [","element":"span"},{"href":"#id-0","referenceIndex":7,"text":"7","element":"a"},{"text":", ","element":"span"},{"href":"#id-1","referenceIndex":10,"text":"10","element":"a"},{"text":", ","element":"span"},{"href":"#id-2","referenceIndex":13,"text":"13","element":"a"},{"text":", ","element":"span"},{"href":"#id-3","referenceIndex":35,"text":"35","element":"a"},{"text":", ","element":"span"},{"href":"#id-4","referenceIndex":47,"text":"47","element":"a"},{"text":"]. Their success in solving language-based complex reasoning tasks has led to their adoption in visual domains, resulting in Large Multimodal Models (LMMs). To align textual and visual modalities, previous works [","element":"span"},{"href":"#id-5","referenceIndex":11,"text":"11","element":"a"},{"text":", ","element":"span"},{"href":"#id-6","referenceIndex":20,"text":"20","element":"a"},{"text":", ","element":"span"},{"href":"#id-7","referenceIndex":21,"text":"21","element":"a"},{"text":",","element":"span"}],[{"id":"id-33","style":{"width":"100%"},"width":946,"height":1023,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/33544/images/0-2.png","element":"img"}],[{"text":"Figure 1. ","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"Grounded Conversation with VideoGLaMM. ","element":"figcaption","subtype":"caption"},{"text":"Our proposed multimodal video conversational model provides text responses grounded at the pixel level in the input video. ","element":"figcaption","subtype":"caption"},{"text":"The generated masks are spatio-temporally consistent across frames. The fine-grained grounded outputs from VideoGLaMM describe different levels of granularity, e.g., person, objects (bike), stuff (road), and explain object and scene attributes. Existing VideoLMMs do not offer pixel-level grounded conversational capability.","element":"figcaption","subtype":"caption"}],[{"href":"#id-8","referenceIndex":28,"text":"28","element":"a"},{"text":", ","element":"span"},{"href":"#id-9","referenceIndex":57,"text":"57","element":"a"},{"text":"] train a projection layer or a cross-attention block that maps visual features into the latent space of an LLM. This straightforward adaptation has enabled advanced spatial understanding, allowing detailed conversations about image content. ","element":"span"},{"text":"Recently, these models have been extended to video, aligning textual instructions with the spatio-temporal","element":"span"}],[{"id":"id-69","text":"inputs, leading to the development of Video-LMMs.","element":"span"}],[{"text":"Existing Video-LMMs [","element":"span"},{"href":"#id-10","referenceIndex":9,"text":"9","element":"a"},{"text":", ","element":"span"},{"href":"#id-11","referenceIndex":22,"text":"22","element":"a"},{"text":", ","element":"span"},{"href":"#id-12","referenceIndex":24,"text":"24","element":"a"},{"text":", ","element":"span"},{"href":"#id-13","referenceIndex":30,"text":"30","element":"a"},{"text":"–","element":"span"},{"href":"#id-14","referenceIndex":32,"text":"32","element":"a"},{"text":", ","element":"span"},{"href":"#id-15","referenceIndex":52,"text":"52","element":"a"},{"text":"], similar to image-based LMMs, tune single or multiple projection layers to align videos with the language modality using the conventional visual instruction tuning paradigm. Although this simple alignment aids in understanding the global content of videos, it poses challenges in capturing localized object-specific context. Consequently, existing works [","element":"span"},{"href":"#id-11","referenceIndex":22,"text":"22","element":"a"},{"text":", ","element":"span"},{"href":"#id-12","referenceIndex":24,"text":"24","element":"a"},{"text":", ","element":"span"},{"href":"#id-14","referenceIndex":32,"text":"32","element":"a"},{"text":", ","element":"span"},{"href":"#id-15","referenceIndex":52,"text":"52","element":"a"},{"text":"] have demonstrated capabilities in video comprehension and dialogue, they lack the crucial feature of fine-grained visual grounding, which aims to associate the LMM’s response to specific objects within the video input. The ability of an LMM to generate visually grounded responses ensures that the model understands fine-grained spatial and temporal details in a video and can relate them with the generated text.","element":"span"}],[{"text":"To bridge this gap, we introduce VideoGLaMM, a large video multimodal model capable of pixel-level spatio-temporal grounding. The model responds to natural language queries from the user and intertwines spatio-temporal object masks in its generated textual responses to provide a detailed understanding of video content. VideoGLaMM seamlessly connects three key components: a Large Language Model (LLM); dual vision encoders; and a spatio-temporal pixel decoder. The dual vision encoders extract spatial and temporal features separately, which are jointly passed to the LLM to output responses rich in both spatial and temporal cues.Our spatio-temporal pixel decoder outputs the fine-grained object masks corresponding to the specific objects in the LLM output to visually ground its responses. ","element":"span"},{"text":"These components are integrated via tunable Vision-to-Language (V","element":"span"},{"style":{"height":8},"width":40,"height":20,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/33544/images/1-0.png","element":"img","alt":"→","inline":true},{"text":"L) and Language-to-Vision (L","element":"span"},{"style":{"height":8},"width":40,"height":20,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/33544/images/1-1.png","element":"img","alt":"→","inline":true},{"text":"V) adapters that enable close vision-language alignment, unlike existing works that perform alignment with a single adapter.","element":"span"}],[{"text":"As there currently exists no instruction-tuning dataset with fine-grained masks associated with video conversations, we present a benchmark instruction tuning dataset curated through a semi-automatic pipeline (Sec. ","element":"span"},{"text":"4","element":"span"},{"text":"). The dataset consists of 38k grounded video-QA triplet pairs with 83k objects and 671k fine-grained masks. The proposed benchmark dataset enables spatio-temporal modeling and significantly augments the capacity of the model to understand videos comprehensively, leading to state-of-the-art performance in grounded conversation generation, temporal grounding, and referring video segmentation tasks under zero-shot settings.","element":"span"}],[{"text":"In summary, our contributions are as follows:","element":"span"}],[{"text":"• We introduce VideoGLaMM, a video large multimodal model, capable of pixel-level spatio-temporal grounding, featuring an end-to-end alignment mechanism.","element":"span"}],[{"text":"• To achieve fine-grained spatio-temporal alignment, we introduce a benchmark instruction tuning dataset consisting","element":"span"}],[{"text":"of 38k grounded video-QA triplet pairs and 83k objects and roughly 671k fine-grained spatio-temporal masks.","element":"span"}],[{"text":"• We assess the performance of VideoGLaMM across diverse tasks spanning grounded conversation generation (GCG), visual grounding, and referring video segmentation, where it achieves state-of-the-art performance.","element":"span"}]]},{"heading":"2. Related work","paragraphs":[[{"style":{"fontWeight":"bold"},"text":"Large Multi-modal Models (LMMs). ","element":"span"},{"text":"Vision-language models like [","element":"span"},{"href":"#id-16","referenceIndex":37,"text":"37","element":"a"},{"text":"] have made notable advancements, demonstrating impressive zero-shot capabilities using millions of noisy image-text pairs during training. These models have been effective in various applications, from detection and segmentation [","element":"span"},{"href":"#id-17","referenceIndex":6,"text":"6","element":"a"},{"text":", ","element":"span"},{"href":"#id-18","referenceIndex":23,"text":"23","element":"a"},{"text":"] to more complex tasks such as 3D understanding and video analysis [","element":"span"},{"href":"#id-19","referenceIndex":29,"text":"29","element":"a"},{"text":", ","element":"span"},{"href":"#id-20","referenceIndex":33,"text":"33","element":"a"},{"text":", ","element":"span"},{"href":"#id-21","referenceIndex":43,"text":"43","element":"a"},{"text":", ","element":"span"},{"href":"#id-22","referenceIndex":48,"text":"48","element":"a"},{"text":"]. The rise of LLMs has driven significant progress in Natural Language Processing (NLP) tasks and sparked interest in developing LMMs. Early models [","element":"span"},{"href":"#id-23","referenceIndex":2,"text":"2","element":"a"},{"text":", ","element":"span"},{"href":"#id-24","referenceIndex":4,"text":"4","element":"a"},{"text":"] incorporate visual information into intermediate embeddings for a frozen LLM using a cross-attention mechanism, trained on billions of image-text pairs to align visual and linguistic modalities. Similarly, BLIP-2 [","element":"span"},{"href":"#id-7","referenceIndex":21,"text":"21","element":"a"},{"text":"] introduces Q-Former to better align visual features with language space. MiniGPT-4 [","element":"span"},{"href":"#id-9","referenceIndex":57,"text":"57","element":"a"},{"text":"] and LLAVA [","element":"span"},{"href":"#id-8","referenceIndex":28,"text":"28","element":"a"},{"text":"] finetune on detailed image descriptions using a single projection layer to align a frozen visual encoder with a frozen LLM. Subsequent LLaVA series models [","element":"span"},{"href":"#id-25","referenceIndex":26,"text":"26","element":"a"},{"text":"] employ a multi-layer perceptron and a two-stage instruction tuning to refine the alignment process. While these works work on static images, our work focuses on efficiently aligning videos with linguistic cues.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Video LMMs. ","element":"span"},{"text":"Recent advancements in image-based multimodal models have paved the way for video LMMs, which are essential for handling spatiotemporal sequences. Models such as VideoChat [","element":"span"},{"href":"#id-11","referenceIndex":22,"text":"22","element":"a"},{"text":"], Video-LLaMA, VideoChatGPT [","element":"span"},{"href":"#id-15","referenceIndex":52,"text":"52","element":"a"},{"text":"], Video-LLAVA [","element":"span"},{"href":"#id-12","referenceIndex":24,"text":"24","element":"a"},{"text":"] and Video-GPT+ [","element":"span"},{"href":"#id-26","referenceIndex":31,"text":"31","element":"a"},{"text":"] extend the capabilities of LLMs to video domain by aligning video features with language, followed by instruction tuning on datasets annotated by either GPT models or humans. ","element":"span"},{"text":"While these models have shown effectiveness in video comprehension, they still face limitations in fine-grained spatio-temporal modeling and visual grounding. This restricts their ability to accurately understand or localize specific objects and detailed segments within videos, highlighting the need for further advancements in developing better multimodal models capable of visual grounding. ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Visual Grounding. ","element":"span"},{"text":"Recently Grounded LMMs [","element":"span"},{"href":"#id-10","referenceIndex":9,"text":"9","element":"a"},{"text":", ","element":"span"},{"href":"#id-6","referenceIndex":20,"text":"20","element":"a"},{"text":", ","element":"span"},{"href":"#id-27","referenceIndex":36,"text":"36","element":"a"},{"text":", ","element":"span"},{"href":"#id-28","referenceIndex":39,"text":"39","element":"a"},{"text":", ","element":"span"},{"href":"#id-29","referenceIndex":49,"text":"49","element":"a"},{"text":", ","element":"span"},{"href":"#id-30","referenceIndex":51,"text":"51","element":"a"},{"text":", ","element":"span"},{"href":"#id-31","referenceIndex":54,"text":"54","element":"a"},{"text":"] have made significant strides in enhancing visual and language comprehension and excel in complex localization tasks. These models demonstrate pro-ficiency in tasks such as referring expression comprehension and image segmentation, highlighting the advanced image understanding capabilities of LLMs. ","element":"span"},{"text":"Approaches such as [","element":"span"},{"href":"#id-10","referenceIndex":9,"text":"9","element":"a"},{"text":", ","element":"span"},{"href":"#id-27","referenceIndex":36,"text":"36","element":"a"},{"text":", ","element":"span"},{"href":"#id-29","referenceIndex":49,"text":"49","element":"a"},{"text":"] primarily focus on creating a language-","element":"span"}],[{"id":"id-34","style":{"width":"93%"},"width":1857,"height":652,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/33544/images/2-0.png","element":"img"}],[{"text":"Figure 2. ","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"Working of VideoGLaMM","element":"figcaption","subtype":"caption"},{"text":". VideoGLaMM consists of a dual spatio-temporal encoder for encoding image and video level features. The spatial features represent the local information and the temporal features represent global information. The spatial and temporal tokens are passed through V-L adapters and concatenated with the text tokens, before feeding to LLM. A L-V projector is employed to align LLM’s response with the visual space of pixel decoder. Finally, the aligned LLM features along with the frame features from a frame encoder are passed to a grounded pixel decoder, to obtain the fine-grained object masks corresponding to the LLM response.","element":"figcaption","subtype":"caption"}],[{"text":"based context for visual grounding. In contrast, [","element":"span"},{"href":"#id-31","referenceIndex":54,"text":"54","element":"a"},{"text":"] integrates visual elements with language, while [","element":"span"},{"href":"#id-6","referenceIndex":20,"text":"20","element":"a"},{"text":"] leverages vision-language embeddings to produce segmentation masks. Additionally, [","element":"span"},{"href":"#id-28","referenceIndex":39,"text":"39","element":"a"},{"text":"] is adept at generating natural language responses linked with object segmentation masks, facilitating detailed visual-textual interactions. ","element":"span"},{"text":"However, these models are limited to image-based applications and do not extend to video understanding. Recently, [","element":"span"},{"href":"#id-14","referenceIndex":32,"text":"32","element":"a"},{"text":"] incorporates audio transcripts alongside visual and textual data for a more detailed video understanding. However, it combines pre-trained modules that cannot be trained end-to-end, which results in lack of fine-grained spatiotemporal modeling. Similarly, [","element":"span"},{"href":"#id-32","referenceIndex":5,"text":"5","element":"a"},{"text":"] introduced a new video grounding model, but their architecture employs only a spatial encoder-decoder setup and does not address the GCG task. To this end, we propose VideoGLaMM, which leverages a novel fine-grained alignment strategy to align language instruction across both spatial and temporal dimensions, facilitating more finegrained video understanding.","element":"span"}]]},{"heading":"3. VideoGLaMM","paragraphs":[[{"style":{"fontWeight":"bold"},"text":"3.1. Overview","element":"span"}],[{"text":"In this work, we introduce VideoGLaMM, a multi-modal video LMM with spatio-temporal pixel grounding capability. The task of spatio-temporal visual grounding focuses on linking a model’s response to a user-specific text query with particular objects and regions within a video, ensuring that both spatial (what’s happening in each frame) and temporal (how things change over time) details are accurately reflected in the generated output (Fig. ","element":"span"},{"href":"#id-33","text":"1","element":"a"},{"text":"). By grounding responses in specific objects and actions across frames, the model demonstrates an understanding of both the evolving and static elements in a video, enabling it to produce responses that align closely with the visual narrative.","element":"span"}],[{"text":"Our proposed VideoGLaMM is designed to achieve effective spatio-temporal grounding due to its ability to process spatial and temporal features simultaneously. VideoGLaMM’s architecture (Fig. ","element":"span"},{"href":"#id-34","text":"2","element":"a"},{"text":") leverages a dualencoder structure: one encoder focuses on extracting spatial details from images, while the other captures temporal information from video sequences, ensuring complementary representation from both modalities. The visual features from both encoders are then integrated with a LLM using separate spatial and temporal adapters (V","element":"span"},{"style":{"height":8},"width":35.48,"height":20,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/33544/images/2-1.png","element":"img","alt":"→","inline":true},{"text":"L), guided by specific textual instructions. The LLM outputs are aligned back with the visual space using an L","element":"span"},{"style":{"height":8},"width":35.48,"height":20,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/33544/images/2-2.png","element":"img","alt":"→","inline":true},{"text":"V adapter and further processed by a pixel decoder, which also takes video frames as input to produce the final grounded outputs.","element":"span"}],[{"text":"For end-to-end spatio-temporal alignment, we train VideoGLaMM on our proposed fine-grained benchmark dataset. During training, we finetune the LoRA parameters of the LLM, along with V","element":"span"},{"style":{"height":8},"width":35.48,"height":20,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/33544/images/2-3.png","element":"img","alt":"→","inline":true},{"text":"L and L","element":"span"},{"style":{"height":8},"width":35.48,"height":20,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/33544/images/2-4.png","element":"img","alt":"→","inline":true},{"text":"V adapters. This approach seamlessly combines spatial and temporal data through an improved alignment mechanism and a precise grounding framework, enhancing the model’s capability for visual grounding and understanding.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"3.2. Architecture","element":"span"}],[{"text":"The overall architecture of our VideoGLaMM consists of following components: (i) Spatio-Temporal Dual Encoder, (ii) Dual Alignment V-L Adapters, (iii) Large Language Model (LLM), (iv) Pixel Decoder. Below we provide a detailed description and working of each of component.","element":"span"}],[{"id":"id-70","style":{"fontWeight":"bold"},"text":"Spatio-Temporal Dual Encoder. ","element":"span"},{"text":"Our architecture consists of separate image and video encoders for extracting spatial and temporal features, thus leveraging the complementary strengths of both. This enables the model to have both local and global properties. The image encoder ","element":"span"},{"style":{"height":16},"width":44.64,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/33544/images/3-0.png","element":"img","alt":" Fg","inline":true},{"text":", processes the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":"video frames separately such that the input video ","element":"span"},{"style":{"height":11.6},"width":78.08,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/33544/images/3-1.png","element":"img","alt":" V ↑","inline":true},{"style":{"height":13.79},"width":212.48,"height":34.48,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/33544/images/3-2.png","element":"img","alt":"RT →H→W →C","inline":true},{"text":". The output of the image encoder, represented by ","element":"span"},{"style":{"height":15.58},"width":35.52,"height":38.96,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/33544/images/3-3.png","element":"img","alt":" fg","inline":true},{"text":", produces local spatial features that provide framelevel context.","element":"span"}],[{"style":{"width":"79%"},"width":754,"height":56,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/33544/images/3-4.png","element":"img"}],[{"text":"Meanwhile, for extracting video features, we use segment-wise Sampling following [","element":"span"},{"href":"#id-26","referenceIndex":31,"text":"31","element":"a"},{"text":"] to obtain fine-grained temporal cues. Given an input video ","element":"span"},{"style":{"height":14.4},"width":292,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/33544/images/3-5.png","element":"img","alt":" V ↑ RT →H→W →C","inline":true},{"text":", we divide it into ","element":"span"},{"style":{"fontStyle":"italic"},"text":"K ","element":"span"},{"text":"segments, where each segment consists of ","element":"span"},{"style":{"height":19.41},"width":104,"height":48.52,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/33544/images/3-6.png","element":"img","alt":" s = TK","inline":true,"padRight":true},{"text":"frames. The video encoder ","element":"span"},{"style":{"height":13.6},"width":47.64,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/33544/images/3-7.png","element":"img","alt":" Fh","inline":true},{"text":", operates on low-resolution video segments ","element":"span"},{"style":{"height":16.19},"width":295.48,"height":40.48,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/33544/images/3-8.png","element":"img","alt":" Vk ↑ Rs→H→W →C","inline":true,"padRight":true},{"text":"yielding global features that provide segment-wise temporal context.","element":"span"}],[{"style":{"width":"80%"},"width":764,"height":52,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/33544/images/3-9.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"Dual Alignment (V","element":"span"},{"style":{"height":8},"width":36,"height":20,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/33544/images/3-10.png","element":"img","alt":"→","inline":true},{"style":{"fontWeight":"bold"},"text":"L) Adapters ","element":"span"},{"text":"To align visual features with the LLM space, we use two separate V","element":"span"},{"style":{"height":8},"width":40,"height":20,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/33544/images/3-11.png","element":"img","alt":"→","inline":true},{"text":"L adapters for image and video encoders. ","element":"span"},{"style":{"height":15.58},"width":55.36,"height":38.96,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/33544/images/3-12.png","element":"img","alt":" Wg","inline":true,"padRight":true},{"text":"represents the spatial adapter, and ","element":"span"},{"style":{"height":13.6},"width":56,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/33544/images/3-13.png","element":"img","alt":" Wh","inline":true,"padRight":true},{"text":"represents the temporal adapter. These adapters project the visual features into the LLM’s projection space, thus aligning the two modalities. The spatial and visual features corresponding to image and video samples after projecting from ","element":"span"},{"style":{"height":15.81},"width":51,"height":39.52,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/33544/images/3-14.png","element":"img","alt":" Wg","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":13.6},"width":53,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/33544/images/3-15.png","element":"img","alt":" Wh","inline":true,"padRight":true},{"text":"are represented by ","element":"span"},{"style":{"height":15.81},"width":40.48,"height":39.52,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/33544/images/3-16.png","element":"img","alt":"Zg","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":13.6},"width":43,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/33544/images/3-17.png","element":"img","alt":" Zh","inline":true},{"text":", respectively.","element":"span"}],[{"style":{"width":"84%"},"width":800,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/33544/images/3-18.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"Large Language Model ","element":"span"},{"text":"The tokenized spatio-temporal visual features are then concatenated with the textual tokens ","element":"span"},{"style":{"height":15.81},"width":271.48,"height":39.52,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/33544/images/3-19.png","element":"img","alt":" Ztext ↑ RL→Dt","inline":true,"padRight":true},{"text":"to obtain final feature embedding ","element":"span"},{"style":{"height":16.8},"width":324.52,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/33544/images/3-20.png","element":"img","alt":"Z = [Zg, Zh, Ztext]","inline":true,"padRight":true},{"text":"which is fed into the ","element":"span"},{"style":{"fontWeight":"bold"},"text":"LLM","element":"span"},{"text":". Thus, input to the ","element":"span"},{"style":{"fontWeight":"bold"},"text":"LLM ","element":"span"},{"text":"contains both the spatial and temporal cues for robust video understanding. We further expand the original ","element":"span"},{"style":{"fontWeight":"bold"},"text":"LLM ","element":"span"},{"text":"vocabulary with a new token, i.e., ","element":"span"},{"text":"","element":"span"},{"text":", which signifies the request for the segmentation output. Thus the ","element":"span"},{"style":{"fontWeight":"bold"},"text":"LLM ","element":"span"},{"text":"response ","element":"span"},{"style":{"fontWeight":"bold"},"text":"E ","element":"span"},{"text":"can be described as,","element":"span"}],[{"style":{"width":"84%"},"width":802,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/33544/images/3-21.png","element":"img"}],[{"text":"The LLM output ","element":"span"},{"style":{"fontWeight":"bold"},"text":"E ","element":"span"},{"text":"contains the ","element":"span"},{"text":" ","element":"span"},{"text":"whenever the task requires to generate the segmentation mask.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Pixel Decoder ","element":"span"},{"text":"Our Pixel decoder consists of a prompt encoder (","element":"span"},{"style":{"fontStyle":"italic"},"text":"H","element":"span"},{"text":") and a mask decoder ","element":"span"},{"style":{"fontStyle":"italic"},"text":"D","element":"span"},{"text":", capable of predicting masks with spatio-temporal grounding. The pixel decoder is adapted to videos and can implicitly process temporal information. The last layer embeddings from the LLM denoted as ","element":"span"},{"style":{"fontStyle":"italic"},"text":"l","element":"span"},{"style":{"height":7.39},"width":40,"height":18.48,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/33544/images/3-22.png","element":"img","alt":"seg","inline":true,"padRight":true},{"text":"corresponding to ","element":"span"},{"text":" ","element":"span"},{"text":"token is extracted, which is enriched with both spatial and temporal cues. The LLM embeddings act as prompts for the mask decoder and are processed by the prompt encoder. Simultaneously, we extract visual features of the input frames ","element":"span"},{"style":{"fontStyle":"italic"},"text":"V ","element":"span"},{"text":"using a grounded frame encoder ","element":"span"},{"style":{"fontStyle":"italic"},"text":"P ","element":"span"},{"text":"which is aligned with pixel decoder and is further equipped with the ability to produce multi-scale features during training. For aligning the output embeddings from LLM with the pixel decoder, we train an (L","element":"span"},{"style":{"height":8},"width":36,"height":20,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/33544/images/3-23.png","element":"img","alt":"→","inline":true},{"text":"V) adapter layer ","element":"span"},{"style":{"height":15.6},"width":54,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/33544/images/3-24.png","element":"img","alt":" Wp","inline":true,"padRight":true},{"text":"between the LLM and prompt encoder such that the output from the adapter is denoted as ","element":"span"},{"style":{"fontWeight":"bold"},"text":"e","element":"span"},{"style":{"height":17.2},"width":246,"height":43,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/33544/images/3-25.png","element":"img","alt":"pseg = Wp(lseg)","inline":true},{"text":". The ","element":"span"},{"style":{"fontWeight":"bold"},"text":"e","element":"span"},{"style":{"height":16.8},"width":41.48,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/33544/images/3-26.png","element":"img","alt":"pseg","inline":true},{"text":"is fed to prompt encoder ","element":"span"},{"style":{"fontStyle":"italic"},"text":"H","element":"span"},{"text":", ","element":"span"},{"text":"such that the encoded output ","element":"span"},{"style":{"height":16.74},"width":127.08,"height":41.84,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/33544/images/3-27.png","element":"img","alt":" H(epseg)","inline":true,"padRight":true},{"text":"is used to prompt the ","element":"span"},{"text":"mask decoder. The encoded prompts ","element":"span"},{"style":{"height":16.99},"width":122.48,"height":42.48,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/33544/images/3-28.png","element":"img","alt":" H(epseg)","inline":true,"padRight":true},{"text":"along with the ","element":"span"},{"text":"grounded visual features ","element":"span"},{"style":{"fontStyle":"italic"},"text":"P","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"V ","element":"span"},{"text":") ","element":"span"},{"text":"are passed to mask decoder ","element":"span"},{"style":{"fontStyle":"italic"},"text":"D","element":"span"},{"text":". Subsequently, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"D ","element":"span"},{"text":"produces the output mask ","element":"span"},{"style":{"fontWeight":"bold"},"text":"M","element":"span"},{"text":".","element":"span"}],[{"style":{"width":"72%"},"width":684,"height":78,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/33544/images/3-29.png","element":"img"}],[{"id":"id-51","style":{"fontWeight":"bold"},"text":"3.3. Training Strategy","element":"span"}],[{"text":"We train VideoGLaMM end-to-end in a single stage. As stated above, we use a dual encoder consisting of separate image and video encoders for processing spatial and temporal inputs to obtain local and global features, respectively. These encoders are initialized with weights of strong pre-trained encoders. During training, we keep the encoders fixed and only train the V","element":"span"},{"style":{"height":8},"width":35.48,"height":20,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/33544/images/3-30.png","element":"img","alt":"→","inline":true},{"text":"L adapters ","element":"span"},{"style":{"height":15.79},"width":53.48,"height":39.48,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/33544/images/3-31.png","element":"img","alt":" Wg","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":13.39},"width":56,"height":33.48,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/33544/images/3-32.png","element":"img","alt":" Wh","inline":true,"padRight":true},{"text":"associated with these encoders. These adapters are used to project the spatio-temporal visual features in the space of LLM and align the two modules. The spatio-temporal encoder is kept frozen and only the V","element":"span"},{"style":{"height":7.81},"width":35.48,"height":19.52,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/33544/images/3-33.png","element":"img","alt":"→","inline":true},{"text":"L adapters are updated. The textual features from the last layer of LLM, rich in spatial and temporal cues, are projected into the space of the pixel decoder using a multi-layer projection L","element":"span"},{"style":{"height":8},"width":38.48,"height":20,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/33544/images/3-34.png","element":"img","alt":"→","inline":true},{"text":"V adapter ","element":"span"},{"style":{"height":15.6},"width":54,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/33544/images/3-35.png","element":"img","alt":" Wp","inline":true},{"text":". For the LLM, we keep its weights frozen and only finetune LoRA [","element":"span"},{"href":"#id-35","referenceIndex":14,"text":"14","element":"a"},{"text":"] parameters during training. Both the frame encoder and pixel decoder are instantiated with pre-trained weights We keep the frame encoder and pixel decoder frozen and only train the L","element":"span"},{"style":{"height":8},"width":40,"height":20,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/33544/images/3-36.png","element":"img","alt":"→","inline":true},{"text":"V adapter layer. We optimize the output of the LLM by minimizing the cross entropy ","element":"span"},{"style":{"fontWeight":"bold"},"text":"CE ","element":"span"},{"text":"objective between the autoregressively obtained text output and dense grounded ground-truth caption. For the output of mask decoder, we optimize the intersection over union (IOU) between the predictions of mask decoder and ground-truth masks denoted as ","element":"span"},{"style":{"height":13.18},"width":137.84,"height":32.96,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/33544/images/3-37.png","element":"img","alt":" Lmasked","inline":true},{"text":". The total loss is the sum of ","element":"span"},{"style":{"fontWeight":"bold"},"text":"CE ","element":"span"},{"text":"loss and masked loss.","element":"span"}],[{"style":{"width":"71%"},"width":674,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/33544/images/3-38.png","element":"img"}],[{"text":"The first component of ","element":"span"},{"style":{"height":13.6},"width":92,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/33544/images/3-39.png","element":"img","alt":" Ltotal","inline":true,"padRight":true},{"text":"ensures that the LLM generates textual embeddings that not only align with the ground truth but also offer informative spatio-temporal cues to the mask decoder for effective grounding. The second component facilitates efficient grounding by leveraging these textual cues from the LLM.","element":"span"}],[{"id":"id-43","style":{"width":"99%"},"width":1965,"height":351,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/33544/images/4-0.png","element":"img"}],[{"text":"Figure 3. ","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"Proposed Semi-automatic Annotation Pipeline","element":"figcaption","subtype":"caption"},{"text":". Our dataset for grounded conversation generation (GCG) is built from three video dataset types: i) ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"Videos having masks only: ","element":"figcaption","subtype":"caption"},{"text":"Object patches are extracted from video frames using masks and processed by the Gemini-Pro model for initial object descriptions, which are then refined to produce detailed object captions. These refined captions and masks are again fed to Gemini-Pro model to create dense grounded captions. ii) ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"Videos having bbox annotations and captions: ","element":"figcaption","subtype":"caption"},{"text":"Frames are first processed with a Video-LMM to generate a comprehensive caption which is combined with the original caption and fed to GPT-4o to obtain dense grounded captions. Masks are generated using frames and ground-truth bounding boxes with the SAM model. iii) ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"Videos having object bboxes and referring expressions: ","element":"figcaption","subtype":"caption"},{"text":"Frames, bounding boxes, and referring expressions are input to GPT-4o for dense grounded captions, while masks are generated by feeding frames and bounding boxes to the SAM model.","element":"figcaption","subtype":"caption"}]]},{"heading":"4. Our Benchmark & Annotation Pipeline","paragraphs":[[{"text":"Our benchmark video dataset comes from different sources: YTVIS [","element":"span"},{"href":"#id-36","referenceIndex":16,"text":"16","element":"a"},{"text":"], BURST [","element":"span"},{"href":"#id-37","referenceIndex":3,"text":"3","element":"a"},{"text":"] ActivityNet entities [","element":"span"},{"href":"#id-38","referenceIndex":56,"text":"56","element":"a"},{"text":"], ReferYTVOS [","element":"span"},{"href":"#id-39","referenceIndex":44,"text":"44","element":"a"},{"text":"], MeViS [","element":"span"},{"href":"#id-40","referenceIndex":12,"text":"12","element":"a"},{"text":"], VidSTG [","element":"span"},{"href":"#id-41","referenceIndex":53,"text":"53","element":"a"},{"text":"] and HCSTVG [","element":"span"},{"href":"#id-42","referenceIndex":46,"text":"46","element":"a"},{"text":"]. To create fine-grained grounded captions, we develop a semi-automated pipeline (Fig. ","element":"span"},{"href":"#id-43","text":"3","element":"a"},{"text":") that ensures high-quality and scalable annotation. Our annotation pipeline is categorized into three streams based on the availability of the ground truth annotations. We explain each stream below.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"a) Videos with only Mask annotations: ","element":"span"},{"text":"Fig. ","element":"span"},{"href":"#id-43","text":"3","element":"a"},{"text":"(a) shows the annotation process for the videos having only masks as ground truth labels. To generate the corresponding dense grounded caption, we use following steps: ","element":"span"},{"style":{"fontWeight":"bold"},"text":"i) ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Object Description Generation: ","element":"span"},{"text":"For each object in the video, we begin by creating a bounding box based on the ground truth mask provided in the annotation file. This bounding box allows us to crop the object from each frame, producing a sequence of image patches that capture the object throughout the video. We then feed these image patches to the GeminiPro model [","element":"span"},{"href":"#id-44","referenceIndex":42,"text":"42","element":"a"},{"text":"] to obtain a rough description of each object in the video. ","element":"span"},{"style":{"fontWeight":"bold"},"text":"ii) ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Object Description Refinement: ","element":"span"},{"text":"The bounding boxes from the previous stage are superimposed on the corresponding video frames, and the entire video is then fed into the Gemini-Pro model to obtain a more accurate and detailed description of the objects. ","element":"span"},{"style":{"fontWeight":"bold"},"text":"iii) ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Caption Generation: ","element":"span"},{"text":"The bounding boxes of corresponding objects overlayed across the video frames are labeled according to their object IDs. Then, we input these frames into the GeminiPro model to obtain dense captions. This results in a comprehensive description of the video. Finally, we manually review the {obj_id} in the generated video captions based on the video content. ","element":"span"},{"style":{"fontWeight":"bold"},"text":"iv) ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Detailed Dense Captions. ","element":"span"},{"text":"To enhance the detail and accuracy of the video captions, we leverage two advanced Video LMMs: Video-LLAVA [","element":"span"},{"href":"#id-15","referenceIndex":52,"text":"52","element":"a"},{"text":"] and LLAVA-NeXT [","element":"span"},{"href":"#id-45","referenceIndex":27,"text":"27","element":"a"},{"text":"]. Using the semi-automatically generated captions as a reference, we integrate and refine the outputs from these models, merging their results to produce the final, comprehensive dense captions.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"b) Videos with Bounding Box annotations and Captions: ","element":"span"},{"text":"Fig. ","element":"span"},{"href":"#id-43","text":"3","element":"a"},{"text":"(b) shows the annotation process for the videos having both captions and object bounding box (Bbox) annotations. To obtain the corresponding dense grounded caption, the video frames are first passed to an open-source VideoLMM [","element":"span"},{"href":"#id-45","referenceIndex":27,"text":"27","element":"a"},{"text":"] to obtain a detailed caption, which is fed along with the reference ground truth caption to GPT-4o mini [","element":"span"},{"href":"#id-46","referenceIndex":34,"text":"34","element":"a"},{"text":"] to obtain the final dense grounded caption. The Bbox annotations are used as a prompts to SAM model [","element":"span"},{"href":"#id-47","referenceIndex":19,"text":"19","element":"a"},{"text":"] which takes the video frames as input and provides the masks corresponding to the objects.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"c) Videos with Bounding Box annotations and Referring Expressions: ","element":"span"},{"text":"Fig. ","element":"span"},{"href":"#id-43","text":"3","element":"a"},{"text":"(c) shows the annotation process for the videos having object bounding box (Bbox) annotations and referring expressions corresponding to different objects. ","element":"span"},{"text":"The video frames along with Referring expressions and Bbox annotations are prompted to GPT-4o mini, which provides the corresponding dense grounded caption. To obtain the masks corresponding to the objects, the video frames are fed to SAM model, which is prompted with Bbox annotations of the objects. Overall, our proposed GCG dataset has 38,788 grounded video-QA triplets along with 83,877 objects and 6,71,016 fine-grained masks in total. We further curate a separate test set of 308 refined video-QA triplets with 826 objects and 22762 finegrained masks for grounded conversation generation evaluation task.","element":"span"}]]},{"heading":"5. Experimental Setup","paragraphs":[[{"style":{"fontWeight":"bold"},"text":"Implementation details. ","element":"span"},{"text":"Our spatio-temporal dual encoders follow the design of image and video encoders from [","element":"span"},{"href":"#id-26","referenceIndex":31,"text":"31","element":"a"},{"text":"]. For the image encoder, we use a pretrained CLIP ViTL/14 (336 ! 336)[","element":"span"},{"href":"#id-16","referenceIndex":37,"text":"37","element":"a"},{"text":"] model, and for the temporal encoder, we select the pretrained encoder of InternVideov2 (224 !","element":"span"}],[{"id":"id-60","style":{"width":"100%"},"width":1980,"height":721,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/33544/images/5-0.png","element":"img"}],[{"text":"Figure 4. ","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"Qualitative results of VideoGLaMM on grounded conversation generation (GCG)","element":"figcaption","subtype":"caption"},{"text":". Given user queries, the VideoGLaMM generates textual responses and grounds objects and phrases using pixel-level masks, showing its detailed understanding of the video.","element":"figcaption","subtype":"caption"}],[{"text":"224) [","element":"span"},{"href":"#id-48","referenceIndex":50,"text":"50","element":"a"},{"text":"]. ","element":"span"},{"text":"The V","element":"span"},{"style":{"height":8},"width":36,"height":20,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/33544/images/5-1.png","element":"img","alt":"→","inline":true},{"text":"L projectors are initialized with the weights of MLP adapter from [","element":"span"},{"href":"#id-26","referenceIndex":31,"text":"31","element":"a"},{"text":"]. The LLM is instantiated with Phi3-Mini-3.8B [","element":"span"},{"href":"#id-49","referenceIndex":1,"text":"1","element":"a"},{"text":"] weights. Both the frame encoder and pixel decoder are initialized with SAM2 [","element":"span"},{"href":"#id-50","referenceIndex":41,"text":"41","element":"a"},{"text":"] encoder-decoder weights. The training (Sec. ","element":"span"},{"href":"#id-51","text":"3.3","element":"a"},{"text":") is carried out end-to-end on 4 Nvidia A100 40GB GPUs with a distributed training based on DeepSpeed [","element":"span"},{"href":"#id-52","referenceIndex":40,"text":"40","element":"a"},{"text":"].","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Datasets. ","element":"span"},{"text":"We train the model on our proposed grounded conversation (GCG) dataset containing 38k grounded video-QA triplets along with 83k objects and 671k fine-grained masks. During training, we also include a variety of other image and video segmentation datasets with our proposed benchmark dataset for more robust alignment. Our choice of image segmentation datasets include: ADE20K [","element":"span"},{"href":"#id-53","referenceIndex":55,"text":"55","element":"a"},{"text":"], COCO-Stuff [","element":"span"},{"href":"#id-54","referenceIndex":8,"text":"8","element":"a"},{"text":"], LVIS-PACO [","element":"span"},{"href":"#id-55","referenceIndex":38,"text":"38","element":"a"},{"text":"], refCOCO, refCOCO+, refCLEF, refCOCOg [","element":"span"},{"href":"#id-56","referenceIndex":15,"text":"15","element":"a"},{"text":"], LLaVA-Instruct-150k [","element":"span"},{"href":"#id-57","referenceIndex":25,"text":"25","element":"a"},{"text":"], ReasonSeg [","element":"span"},{"href":"#id-6","referenceIndex":20,"text":"20","element":"a"},{"text":"] and GranDf [","element":"span"},{"href":"#id-28","referenceIndex":39,"text":"39","element":"a"},{"text":"]. For video egmentation datasets we include train samples from Refer-DAVIS17 [","element":"span"},{"href":"#id-58","referenceIndex":17,"text":"17","element":"a"},{"text":"] and VideoInstruct100K [","element":"span"},{"href":"#id-13","referenceIndex":30,"text":"30","element":"a"},{"text":"].","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Tasks. ","element":"span"},{"text":"We evaluate VideoGLaMM on three challenging tasks: ","element":"span"},{"text":"grounded conversation generation (GCG), visual grounding, and referring video segmentation. For grounded conversation generation, we curate a separate dataset of 308 refined video-QA triplets containing 826 objects and 22,762 fine-grained masks, following our proposed annotation pipeline. For Visual Grounding, we evaluate our model on challenging VidSTG [","element":"span"},{"href":"#id-41","referenceIndex":53,"text":"53","element":"a"},{"text":"] dataset, considering only the interrogative sentences as done by [","element":"span"},{"href":"#id-14","referenceIndex":32,"text":"32","element":"a"},{"text":"]. ","element":"span"},{"text":"In the case of motion-guided video object segmentation, we leverage the MeViS [","element":"span"},{"href":"#id-40","referenceIndex":12,"text":"12","element":"a"},{"text":"] validation dataset. All the results on MeViS dataset are obtained via official CodaLab evaluation suite. We also report referring video object segmentation results on additional Ref-DAVIS-17 [","element":"span"},{"href":"#id-59","referenceIndex":18,"text":"18","element":"a"},{"text":"] dataset.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Evaluation metrics. ","element":"span"},{"text":"For GCG task, we use mean Intersection over Union (mIOU) and Recall to determine the correctness of generated masks, and METEOR, CIDEr and CLAIR score for determining the goodness of conversational output. In the case of visual grounding, we report mean Intersection over Union (mIOU) to quantify the performance. Finally, for referring video segmentation, We report Region Jaccard ","element":"span"},{"style":{"fontStyle":"italic"},"text":"J ","element":"span"},{"text":", Boundary F measure ","element":"span"},{"style":{"fontStyle":"italic"},"text":"F","element":"span"},{"text":", and their mean ","element":"span"},{"style":{"fontStyle":"italic"},"text":"J ","element":"span"},{"text":"&","element":"span"},{"style":{"fontStyle":"italic"},"text":"F","element":"span"},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Baselines. ","element":"span"},{"text":"We compare our VideoGLaMM with two challenging baselines employing LLMs capable of visual grounding: PG-Video-LLaVA [","element":"span"},{"href":"#id-14","referenceIndex":32,"text":"32","element":"a"},{"text":"] and GLaMM [","element":"span"},{"href":"#id-28","referenceIndex":39,"text":"39","element":"a"},{"text":"]. Since GlaMM is designed for pixel grounding in images, we enable temporal properties in GLaMM by augmenting its architecture with SAM2. ","element":"span"},{"text":"For referring segmentation, we also compare VideoGLaMM with the recently released Video-LISA [","element":"span"},{"href":"#id-32","referenceIndex":5,"text":"5","element":"a"},{"text":"].","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Training Recipe. ","element":"span"},{"text":"VideoGLaMM follows a gradual training schedule. We do not train VideoGLaMM on our GCG dataset directly from the start, rather we take a gradual approach. We first train the model on image and video segmentation datasets until epoch 20 and then introduce our GCG dataset and train the model until epoch 30. This training recipe ensures that model learns both the spatial and temporal cues effectively.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"5.1. Grounded Conversation Generation","element":"span"}],[{"text":"The Grounded Conversation Generation (GCG) task aims to provide video-level detailed captions with specific phrases directly tied to corresponding segmentation masks in the video frames. ","element":"span"},{"text":"For example, “","element":"span"},{"text":" ","element":"span"},{"text":"in ","element":"span"},{"text":"white clothes ","element":"span"},{"text":"holds ","element":"span"},{"text":"a ","element":"span"},{"text":" ","element":"span"},{"text":"in ","element":"span"},{"text":"the ","element":"span"},{"text":"room","element":"span"},{"text":"\", as shown in the first row of Fig. ","element":"span"},{"href":"#id-60","text":"4","element":"a"},{"text":", features how each bracketed phrase is anchored to a unique segmentation mask. This creates a densely annotated caption that aligns textual descriptions with visual regions in the frames, enriching the video’s contextual interpretation. To obtain GCG output, we query the model with the following sample prompt: ","element":"span"},{"text":"“","element":"span"},{"text":"Provide a detailed description of the image. Respond with interleaved segmentation masks for the corresponding parts of the answer.","element":"span"},{"text":"” The model generates a detailed caption along with interleaved segmentation masks, employing the format “","element":"span"},{"text":"

An adult woman in brown

is talking to another

adult man wearing jacket

","element":"span"},{"text":"” as shown in the third row of Fig. ","element":"span"},{"href":"#id-60","text":"4","element":"a"},{"text":". We use special tokens, namely ","element":"span"},{"text":"

and ","element":"span"},{"text":", to delineate the start and end of each phrase and its corresponding region mask, respectively.","element":"span"}],[{"text":"As shown in Table ","element":"span"},{"href":"#id-61","text":"1","element":"a"},{"text":", our proposed Video-GLaMM performs better in generating detailed captions containing references to objects in the video frames, as is evident from high METEOR, CIDEr and CLAIR scores. Regarding the quality of masks, VideoGLaMM consistently outperforms baselines in terms of mIOU and Recall scores, signifying a higher overlap with ground-truth masks. Fig. ","element":"span"},{"href":"#id-60","text":"4 ","element":"a"},{"text":"further shows the qualitative visualizations of VideoGLaMM on GCG samples.","element":"span"}],[{"id":"id-61","style":{"width":"100%"},"width":946,"height":165,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/33544/images/6-0.png","element":"img"}],[{"text":"Table 1. ","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"Evaluation on grounded conversation generation (GCG): ","element":"figcaption","subtype":"caption"},{"text":"VideoGLaMM shows superior performance in generating accurate video-level captions which are tied to corresponding segmentation masks in the video frames.","element":"figcaption","subtype":"caption"}],[{"style":{"fontWeight":"bold"},"text":"5.2. Referring Video Segmentation","element":"span"}],[{"text":"For referring video segmentation, the output should be grounded as per the given phrase, pointing towards specific instances in the video. Given a sentence or referring expression containing a specific object instance, the goal is to localize the object instances present across the video frames. This task operates in an open vocabulary setting, assessing the model’s ability to localize objects both spatially and temporally. Given a referring phrase expression ","element":"span"},{"text":"Phrase","element":"span"},{"text":", we prompt the model using the following instruction prompt to obtain the instance masks: “","element":"span"},{"text":"What is {Phrase} in this video? ","element":"span"},{"text":"Respond with segmentation masks","element":"span"},{"text":"\". Table ","element":"span"},{"href":"#id-62","text":"2 ","element":"a"},{"text":"shows results on challenging MeViS dataset for motion-guided referring video segmentation. Both the region Jaccard ","element":"span"},{"style":{"fontStyle":"italic"},"text":"J ","element":"span"},{"text":"and boundary F-measure ","element":"span"},{"style":{"fontStyle":"italic"},"text":"F ","element":"span"},{"text":"are high in the case of VideoGLaMM, significantly outperforming the baselines. Similarly, the mean ","element":"span"},{"style":{"fontStyle":"italic"},"text":"J ","element":"span"},{"text":"&","element":"span"},{"style":{"fontStyle":"italic"},"text":"F ","element":"span"},{"text":"follows the same trend. ","element":"span"},{"text":"Additionally, the scores corresponding to VideoLISA are reported with post-processing step. Notably, VideoLISA involves an additional post-processing step to boost performance. Therefore, we further fine-tune the VideoGLaMM on the task of referring segmentation post epoch 30 until epoch 40. Clearly, VideoGLaMM outperforms the VideoLISA (post-processed) on both ","element":"span"},{"style":{"fontStyle":"italic"},"text":"J ","element":"span"},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"F","element":"span"},{"text":", including the mean ","element":"span"},{"style":{"fontStyle":"italic"},"text":"J ","element":"span"},{"text":"&","element":"span"},{"style":{"fontStyle":"italic"},"text":"F","element":"span"},{"text":". ","element":"span"},{"text":"Additionally, VideoGLaMM outperforms baselines on Ref-DAVIS-17 dataset. ","element":"span"},{"text":"The improved performance of VideoGLaMM can be credited to its training pipeline, which seamlessly integrates spatio-temporal dynamics into the model.","element":"span"}],[{"id":"id-62","style":{"width":"88%"},"width":837,"height":244,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/33544/images/6-1.png","element":"img"}],[{"text":"Table 2. ","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"Performance comparison of VideoGLaMM on MeViS: ","element":"figcaption","subtype":"caption"},{"text":"VideoGLaMM shows superior performance on motion grounding and segmenting referring objects in the videos.","element":"figcaption","subtype":"caption"}],[{"style":{"width":"90%"},"width":857,"height":289,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/33544/images/6-2.png","element":"img"}],[{"text":"Table 3. ","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"Performance comparison of VideoGLaMM on Ref-DAVIS-17: ","element":"figcaption","subtype":"caption"},{"text":"VideoGLaMM shows superior performance on segmenting referring objects in the videos.","element":"figcaption","subtype":"caption"}],[{"style":{"fontWeight":"bold"},"text":"5.3. Visual Grounding","element":"span"}],[{"id":"id-63","style":{"width":"92%"},"width":875,"height":237,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/33544/images/6-3.png","element":"img"}],[{"text":"Table 4. ","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"Performance comparison of VideoGLaMM with other models on spatial grounding: ","element":"figcaption","subtype":"caption"},{"text":"Results on VidSTG (interrogative) benchmark highlights VideoGLaMM’s superior ability in correlating textual instructions with the visual frames.","element":"figcaption","subtype":"caption"}],[{"text":"To quantitatively assess VideoGLaMM’s visual grounding capability, we conduct quantitative evaluations on the benchmark test set of VidSTG dataset. The visual grounding task measures the adeptness of the model at correlating textual descriptions with visual elements in the video, a critical aspect of contextual comprehension. This ability is crucial for applications that integrate continuous visual data with language. The output of this task is refined masks that correlate with the given caption ","element":"span"},{"text":"{caption}","element":"span"},{"text":". To obtain the visual grounding output, we query the model with interrogative captions. For these captions, the prompt format follows “","element":"span"},{"text":"{caption} Please respond with a segmentation masks.","element":"span"},{"text":"\". Table ","element":"span"},{"href":"#id-63","text":"4 ","element":"a"},{"text":"shows VideoGLaMM’s improved visual grounding precision as it outperforms the baselines, demonstrating its fine-grained understanding.","element":"span"}],[{"text":"In addition to the above downstream tasks, in Sec. ","element":"span"},{"text":"B ","element":"span"},{"text":"of supplementary, we also integrate VideoGLaMM into a conditional video generation model [","element":"span"},{"href":"#id-64","referenceIndex":45,"text":"45","element":"a"},{"text":"]. VideoGLaMM provides temporally coherent masks that guides generative model in editing videos effectively. Please refer to Sections ","element":"span"},{"href":"#id-65","text":"A ","element":"a"},{"text":"and ","element":"span"},{"text":"B ","element":"span"},{"text":"of supplementary for additional quantitative and qualitative results respectively.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"5.4. Ablation studies","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Effect of Spatio-Temporal Dual Encoder. ","element":"span"},{"text":"We employ separate image and video encoders to process spatial and temporal information. While spatial processing induces local information, temporal processing helps learn global features. Both are necessary from the perspective of grounding. To verify the effectiveness of dual spatio-temporal encoder, we conduct an ablation study to measure the effectiveness of each encoder for grounded conversation generation (GCG) task (see Table ","element":"span"},{"href":"#id-66","text":"5","element":"a"},{"text":"). We notice that using only an image encoder gives suboptimal results, as we notice a drop in both the localization and captioning metrics. Using only video branch leads to the highest mIOU; however, relatively lower METEOR, CIDEr, and CLAIR scores. To obtain an optimal mIOU and good conversational abilities, VideoGLaMM uses both image and video encoders.","element":"span"}],[{"id":"id-66","style":{"width":"100%"},"width":946,"height":172,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/33544/images/7-0.png","element":"img"}],[{"text":"Table 5. ","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"Effect of Spatio-Temporal Dual Encoder: ","element":"figcaption","subtype":"caption"},{"text":"We obtain low performance using only spatial (image) encoder. Using only a video encoder gives the highest mIOU but lower scores on CLAIR, METEOR and CIDEr. For a better trade-off, we employ dual (image and video) encoders to have accurate, grounded conversations.","element":"figcaption","subtype":"caption"}],[{"style":{"fontWeight":"bold"},"text":"Spatial vs Spatio-temporal Pixel decoder. ","element":"span"},{"text":"Pixel decoder in VideoGLaMM can operate in two configurations. The first configuration processes video frames individually, ignoring temporal consistency. ","element":"span"},{"text":"The second configuration employs both spatial and temporal branches for spatio-temporal context. Table ","element":"span"},{"href":"#id-67","text":"6 ","element":"a"},{"text":"demonstrates the impact of spatiotemporal decoder on the GCG task. Results indicate that using only the spatial configuration reduces performance, with a nearly 3% drop in mIOU scores compared to the spatio-temporal configuration. Similarly, metrics like METEOR, CIDEr, and CLAIR also show a decline, underscoring the importance of using spatio-temporal configuration for pixel decoder.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Effect of number of frames for Pixel Decoder. ","element":"span"},{"text":"The pixel decoder receives the raw input frames encoded via frame encoder as input for predicting fine-grained grounded masks. ","element":"span"},{"text":"During training, the pixel decoder also receives ground-truth masks which act as supervision signals. To provide more temporal supervision, we feed the pixel decoder with multiple input frames to enhance its temporal understanding. This allows it to learn semantic information","element":"span"}],[{"id":"id-67","style":{"width":"100%"},"width":946,"height":136,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/33544/images/7-1.png","element":"img"}],[{"text":"Table 6. ","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"Spatial vs Spatio-temporal Pixel decoder: ","element":"figcaption","subtype":"caption"},{"text":"We observe that using Pixel decoder without the temporal branch gives limited performance as the model faces difficulties in temporal grounding. When using temporal branch, the performance on both the temporal grounding and grounded LLM response improves indicating the importance of temporal processing in VideoGLaMM.","element":"figcaption","subtype":"caption"}],[{"text":"that generalizes across frames. Table ","element":"span"},{"href":"#id-68","text":"7 ","element":"a"},{"text":"shows the performance when 4 and 8 frames are input to the decoder. We observe that while the mIOU with 8 frames is slightly lower compared to 4 frames, the conversational quality measured by METEOR and CLAIR is higher. Hence, to achieve a decent mIOU with higher conversational output, we stick to 8 frames in the paper.","element":"span"}],[{"id":"id-68","style":{"width":"100%"},"width":946,"height":139,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/33544/images/7-2.png","element":"img"}],[{"text":"Table 7. ","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"Effect of number of frames for Pixel Decoder: ","element":"figcaption","subtype":"caption"},{"text":"We observe that using 4 supervision frames for pixel decoder gives better mIOU but relatively modest conversation quality measured by METEOR and CLAIR. With 8 supervision frames, mIOU slightly decreases while the conversational quality increases.","element":"figcaption","subtype":"caption"}],[{"style":{"fontWeight":"bold"},"text":"Limitations and Future Work: ","element":"span"},{"text":"Our GCG dataset plays a key role in enhancing the model’s grounding capabilities. ","element":"span"},{"text":"While we validated annotations manually, some noise may still be present. Also, each scene contains several objects and the video descriptions do not exhaustively cover all objects in the scenes. A higher-quality densely annotated set could further boost model performance but would require substantial annotation resources. Additionally, VideoGLaMM struggles with objects of varying granularities, likely due to limited representation in the training data. Another improvement is to extend VideoGLaMM for longer videos, as the current GCG dataset mainly focuses on short-medium duration clips.","element":"span"}]]},{"heading":"6. Conclusion","paragraphs":[[{"text":"We introduce VideoGLaMM, a LMM specifically designed to address the challenge of fine-grained pixel-level grounding in videos. By integrating a dual vision encoder with a spatio-temporal decoder and employing tunable VisionLanguage adapters, our model achieves precise alignment between video content and textual instructions. To facilitate this alignment, we introduce a refined instruction-tuning dataset curated via a semi-automatic annotation pipeline. Our experimental evaluations across Grounded Conversation Generation, Visual Grounding, and Referring Video Segmentation tasks demonstrate that VideoGLaMM consistently outperforms existing models.","element":"span"}]]},{"heading":"References","paragraphs":[[{"id":"id-49","text":"[1] Marah Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, Jyoti ","element":"span"},{"text":"Aneja, Ahmed Awadallah, Hany Awadalla, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Harkirat Behl, et al. Phi-3 technical report: A highly capable language model locally on your phone. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2404.14219","element":"span"},{"text":", 2024. ","element":"span"},{"href":"#id-60","text":"6","element":"a"}],[{"id":"id-23","text":"[2] Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine ","element":"span"},{"text":"Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in neural information processing systems","element":"span"},{"text":", 35:23716–23736, 2022. ","element":"span"},{"href":"#id-69","text":"2","element":"a"}],[{"id":"id-37","text":"[3] Ali Athar, Jonathon Luiten, Paul Voigtlaender, Tarasha Khu- ","element":"span"},{"text":"rana, Achal Dave, Bastian Leibe, and Deva Ramanan. Burst: A benchmark for unifying object recognition, segmentation and tracking in video. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"WACV","element":"span"},{"text":", 2023. ","element":"span"},{"href":"#id-43","text":"5","element":"a"}],[{"id":"id-24","text":"[4] Anas Awadalla, Irena Gao, Josh Gardner, Jack Hessel, Yusuf ","element":"span"},{"text":"Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Shiori Sagawa, et al. Openflamingo: An open-source framework for training large autoregressive vision-language models. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2308.01390","element":"span"},{"text":", 2023. ","element":"span"},{"href":"#id-69","text":"2","element":"a"}],[{"id":"id-32","text":"[5] Zechen Bai, Tong He, Haiyang Mei, Pichao Wang, Ziteng ","element":"span"},{"text":"Gao, Joya Chen, Lei Liu, Zheng Zhang, and Mike Zheng Shou. ","element":"span"},{"text":"One token to seg them all: ","element":"span"},{"text":"Language instructed reasoning segmentation in videos. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2409.19603","element":"span"},{"text":", 2024. ","element":"span"},{"href":"#id-34","text":"3","element":"a"},{"text":", ","element":"span"},{"href":"#id-60","text":"6","element":"a"},{"text":", ","element":"span"},{"text":"7","element":"span"},{"text":", ","element":"span"},{"text":"1","element":"span"}],[{"id":"id-17","text":"[6] Hanoona Bangalath, Muhammad Maaz, Muhammad Uzair ","element":"span"},{"text":"Khattak, Salman H Khan, and Fahad Shahbaz Khan. Bridging the gap between object and image-level representations for open-vocabulary detection. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"NeurIPS","element":"span"},{"text":", 2022. ","element":"span"},{"href":"#id-69","text":"2","element":"a"}],[{"id":"id-0","text":"[7] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Sub- ","element":"span"},{"text":"biah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proc. of NeurIPS","element":"span"},{"text":", 2020. ","element":"span"},{"text":"1","element":"span"}],[{"id":"id-54","text":"[8] Holger Caesar, Jasper Uijlings, and Vittorio Ferrari. Coco- ","element":"span"},{"text":"stuff: Thing and stuff classes in context. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the IEEE conference on computer vision and pattern recognition","element":"span"},{"text":", pages 1209–1218, 2018. ","element":"span"},{"href":"#id-60","text":"6","element":"a"}],[{"id":"id-10","text":"[9] Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, ","element":"span"},{"text":"Feng Zhu, and Rui Zhao. ","element":"span"},{"text":"Shikra: ","element":"span"},{"text":"Unleashing multi-modal llm’s referential dialogue magic. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2306.15195","element":"span"},{"text":", 2023. ","element":"span"},{"href":"#id-69","text":"2","element":"a"}],[{"id":"id-1","text":"[10] Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhang- ","element":"span"},{"text":"hao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. ","element":"span"},{"text":"Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, 2023. ","element":"span"},{"text":"1","element":"span"}],[{"id":"id-5","text":"[11] Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat ","element":"span"},{"text":"Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale","element":"span"}],[{"text":"Fung, and Steven C. H. Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proc. of NeurIPS","element":"span"},{"text":", 2023. ","element":"span"},{"text":"1","element":"span"}],[{"id":"id-40","text":"[12] Henghui Ding, Chang Liu, Shuting He, Xudong Jiang, and ","element":"span"},{"text":"Chen Change Loy. ","element":"span"},{"text":"MeViS: A large-scale benchmark for video segmentation with motion expressions. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"ICCV","element":"span"},{"text":", 2023. ","element":"span"},{"href":"#id-43","text":"5","element":"a"},{"text":", ","element":"span"},{"href":"#id-60","text":"6","element":"a"},{"text":", ","element":"span"},{"text":"7","element":"span"}],[{"id":"id-2","text":"[13] Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding, Jiezhong ","element":"span"},{"text":"Qiu, Zhilin Yang, and Jie Tang. ","element":"span"},{"text":"GLM: general language model pretraining with autoregressive blank infilling. ","element":"span"},{"text":"In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proc. of ACL","element":"span"},{"text":", pages 320–335, 2022. ","element":"span"},{"text":"1","element":"span"}],[{"id":"id-35","text":"[14] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- ","element":"span"},{"text":"Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2106.09685","element":"span"},{"text":", 2021. ","element":"span"},{"href":"#id-70","text":"4","element":"a"}],[{"id":"id-56","text":"[15] Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and ","element":"span"},{"text":"Tamara Berg. ","element":"span"},{"text":"ReferItGame: Referring to objects in photographs of natural scenes. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)","element":"span"},{"text":", pages 787–798, Doha, Qatar, 2014. Association for Computational Linguistics. ","element":"span"},{"href":"#id-60","text":"6","element":"a"}],[{"id":"id-36","text":"[16] Lei Ke, Henghui Ding, Martin Danelljan, Yu-Wing Tai, Chi- ","element":"span"},{"text":"Keung Tang, and Fisher Yu. Video mask transfiner for high-quality video instance segmentation. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"European Conference on Computer Vision (ECCV)","element":"span"},{"text":", 2022. ","element":"span"},{"href":"#id-43","text":"5","element":"a"}],[{"id":"id-58","text":"[17] Anna Khoreva, Anna Rohrbach, and Bernt Schiele. Video ","element":"span"},{"text":"object segmentation with language referring expressions. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arxiv: 1803.08006","element":"span"},{"text":", 2018. ","element":"span"},{"href":"#id-60","text":"6","element":"a"}],[{"id":"id-59","text":"[18] Anna Khoreva, Anna Rohrbach, and Bernt Schiele. Video ","element":"span"},{"text":"object segmentation with language referring expressions. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Computer Vision–ACCV 2018: 14th Asian Conference on Computer Vision, Perth, Australia, December 2–6, 2018, Revised Selected Papers, Part IV 14","element":"span"},{"text":", pages 123–141. Springer, 2019. ","element":"span"},{"href":"#id-60","text":"6","element":"a"}],[{"id":"id-47","text":"[19] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, ","element":"span"},{"text":"Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the IEEE/CVF International Conference on Computer Vision","element":"span"},{"text":", pages 4015–4026, 2023. ","element":"span"},{"href":"#id-43","text":"5","element":"a"}],[{"id":"id-6","text":"[20] Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui ","element":"span"},{"text":"Yuan, Shu Liu, and Jiaya Jia. Lisa: Reasoning segmentation via large language model. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2308.00692","element":"span"},{"text":", 2023. ","element":"span"},{"text":"1","element":"span"},{"text":", ","element":"span"},{"href":"#id-69","text":"2","element":"a"},{"text":", ","element":"span"},{"href":"#id-34","text":"3","element":"a"},{"text":", ","element":"span"},{"href":"#id-60","text":"6","element":"a"},{"text":", ","element":"span"},{"text":"7","element":"span"}],[{"id":"id-7","text":"[21] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. ","element":"span"},{"text":"Blip-2: ","element":"span"},{"text":"Bootstrapping language-image pre-training with frozen image encoders and large language models. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International conference on machine learning","element":"span"},{"text":", pages 19730– 19742. PMLR, 2023. ","element":"span"},{"text":"1","element":"span"},{"text":", ","element":"span"},{"href":"#id-69","text":"2","element":"a"}],[{"id":"id-11","text":"[22] Kunchang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, ","element":"span"},{"text":"Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: Chat-centric video understanding. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv:2305.06355","element":"span"},{"text":", 2023. ","element":"span"},{"href":"#id-69","text":"2","element":"a"}],[{"id":"id-18","text":"[23] Feng Liang, Bichen Wu, Xiaoliang Dai, Kunpeng Li, Yinan ","element":"span"},{"text":"Zhao, Hang Zhang, Peizhao Zhang, Peter Vajda, and Diana Marculescu. Open-vocabulary semantic segmentation with mask-adapted clip. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition","element":"span"},{"text":", pages 7061–7070, 2023. ","element":"span"},{"href":"#id-69","text":"2","element":"a"}],[{"id":"id-12","text":"[24] Bin Lin, Bin Zhu, Yang Ye, Munan Ning, Peng Jin, and ","element":"span"},{"text":"Li Yuan. ","element":"span"},{"text":"Video-llava: Learning united visual representation by alignment before projection. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2311.10122","element":"span"},{"text":", 2023. ","element":"span"},{"href":"#id-69","text":"2","element":"a"}],[{"id":"id-57","text":"[25] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. ","element":"span"},{"text":"Visual instruction tuning. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2304.08485","element":"span"},{"text":", 2023. ","element":"span"},{"href":"#id-60","text":"6","element":"a"}],[{"id":"id-25","text":"[26] Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. ","element":"span"},{"text":"Improved baselines with visual instruction tuning. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition","element":"span"},{"text":", pages 26296–26306, 2024. ","element":"span"},{"href":"#id-69","text":"2","element":"a"}],[{"id":"id-45","text":"[27] Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan ","element":"span"},{"text":"Zhang, Sheng Shen, and Yong Jae Lee. ","element":"span"},{"text":"Llava-next: Improved reasoning, ocr, and world knowledge, 2024. ","element":"span"},{"href":"#id-43","text":"5","element":"a"}],[{"id":"id-8","text":"[28] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. ","element":"span"},{"text":"Visual instruction tuning. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in neural information processing systems","element":"span"},{"text":", 36, 2024. ","element":"span"},{"text":"1","element":"span"},{"text":", ","element":"span"},{"href":"#id-69","text":"2","element":"a"}],[{"id":"id-19","text":"[29] Zhaoyang Liu, Yinan He, Wenhai Wang, Weiyun Wang, Yi ","element":"span"},{"text":"Wang, Shoufa Chen, Qinglong Zhang, Yang Yang, Qingyun Li, Jiashuo Yu, et al. Internchat: Solving vision-centric tasks by interacting with chatbots beyond language. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2305.05662","element":"span"},{"text":", 2023. ","element":"span"},{"href":"#id-69","text":"2","element":"a"}],[{"id":"id-13","text":"[30] Muhammad Maaz, Hanoona Rasheed, Salman Khan, and ","element":"span"},{"text":"Fahad Shahbaz Khan. ","element":"span"},{"text":"Video-chatgpt: Towards detailed video understanding via large vision and language models. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv:2306.05424","element":"span"},{"text":", 2023. ","element":"span"},{"href":"#id-69","text":"2","element":"a"},{"text":", ","element":"span"},{"href":"#id-60","text":"6","element":"a"}],[{"id":"id-26","text":"[31] Muhammad Maaz, Hanoona Rasheed, Salman Khan, and ","element":"span"},{"text":"Fahad Khan. Videogpt+: Integrating image and video encoders for enhanced video understanding. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2406.09418","element":"span"},{"text":", 2024. ","element":"span"},{"href":"#id-69","text":"2","element":"a"},{"text":", ","element":"span"},{"href":"#id-70","text":"4","element":"a"},{"text":", ","element":"span"},{"href":"#id-43","text":"5","element":"a"},{"text":", ","element":"span"},{"href":"#id-60","text":"6","element":"a"}],[{"id":"id-14","text":"[32] Shehan Munasinghe, Rusiru Thushara, Muhammad Maaz, ","element":"span"},{"text":"Hanoona Abdul Rasheed, Salman Khan, Mubarak Shah, and Fahad Khan. Pg-video-llava: Pixel grounding large videolanguage models. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2311.13435","element":"span"},{"text":", 2023. ","element":"span"},{"href":"#id-69","text":"2","element":"a"},{"text":", ","element":"span"},{"href":"#id-34","text":"3","element":"a"},{"text":", ","element":"span"},{"href":"#id-60","text":"6","element":"a"},{"text":", ","element":"span"},{"text":"7","element":"span"}],[{"id":"id-20","text":"[33] Bolin Ni, Houwen Peng, Minghao Chen, Songyang Zhang, ","element":"span"},{"text":"Gaofeng Meng, Jianlong Fu, Shiming Xiang, and Haibin Ling. Expanding language-image pretrained models for general video recognition. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"European Conference on Computer Vision","element":"span"},{"text":", pages 1–18. Springer, 2022. ","element":"span"},{"href":"#id-69","text":"2","element":"a"}],[{"id":"id-46","text":"[34] OpenAI. Gpt-4v(ision) system card. ","element":"span"},{"href":"https://openai.com/research/gpt-4v-system-card","text":"https://openai.com/ ","element":"a"},{"href":"https://openai.com/research/gpt-4v-system-card","text":"research/gpt-4v-system-card","element":"a"},{"text":", 2023. ","element":"span"},{"href":"#id-43","text":"5","element":"a"}],[{"id":"id-3","text":"[35] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Car- ","element":"span"},{"text":"roll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F. Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proc. of NeurIPS","element":"span"},{"text":", 2022. ","element":"span"},{"text":"1","element":"span"}],[{"id":"id-27","text":"[36] Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan ","element":"span"},{"text":"Huang, Shuming Ma, and Furu Wei. Kosmos-2: Grounding multimodal large language models to the world. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2306.14824","element":"span"},{"text":", 2023. ","element":"span"},{"href":"#id-69","text":"2","element":"a"}],[{"id":"id-16","text":"[37] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya ","element":"span"},{"text":"Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi-","element":"span"}],[{"text":"sion. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International conference on machine learning","element":"span"},{"text":", pages 8748–8763. PMLR, 2021. ","element":"span"},{"href":"#id-69","text":"2","element":"a"},{"text":", ","element":"span"},{"href":"#id-43","text":"5","element":"a"}],[{"id":"id-55","text":"[38] Vignesh Ramanathan, Anmol Kalia, Vladan Petrovic, Yi ","element":"span"},{"text":"Wen, Baixue Zheng, Baishan Guo, Rui Wang, Aaron Marquez, Rama Kovvuri, Abhishek Kadian, et al. Paco: Parts and attributes of common objects. ","element":"span"},{"text":"In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition","element":"span"},{"text":", pages 7141–7151, 2023. ","element":"span"},{"href":"#id-60","text":"6","element":"a"}],[{"id":"id-28","text":"[39] Hanoona Rasheed, Muhammad Maaz, Sahal Shaji, Abdel- ","element":"span"},{"text":"rahman Shaker, Salman Khan, Hisham Cholakkal, Rao M Anwer, Erix Xing, Ming-Hsuan Yang, and Fahad S Khan. Glamm: Pixel grounding large multimodal model. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2311.03356","element":"span"},{"text":", 2023. ","element":"span"},{"href":"#id-69","text":"2","element":"a"},{"text":", ","element":"span"},{"href":"#id-34","text":"3","element":"a"},{"text":", ","element":"span"},{"href":"#id-60","text":"6","element":"a"},{"text":", ","element":"span"},{"text":"7","element":"span"}],[{"id":"id-52","text":"[40] Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and ","element":"span"},{"text":"Yuxiong He. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"KDD","element":"span"},{"text":", pages 3505–3506, 2020. ","element":"span"},{"href":"#id-60","text":"6","element":"a"}],[{"id":"id-50","text":"[41] Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang ","element":"span"},{"text":"Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junting Pan, Kalyan Vasudev Alwala, Nicolas Carion, ChaoYuan Wu, Ross Girshick, Piotr Dollár, and Christoph Feichtenhofer. Sam 2: Segment anything in images and videos. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2408.00714","element":"span"},{"text":", 2024. ","element":"span"},{"href":"#id-60","text":"6","element":"a"},{"text":", ","element":"span"},{"text":"7","element":"span"}],[{"id":"id-44","text":"[42] Machel Reid, Nikolay Savinov, Denis Teplyashin, Dmitry ","element":"span"},{"text":"Lepikhin, Timothy Lillicrap, Jean-baptiste Alayrac, Radu Soricut, Angeliki Lazaridou, Orhan Firat, Julian Schrittwieser, et al. ","element":"span"},{"text":"Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2403.05530","element":"span"},{"text":", 2024. ","element":"span"},{"href":"#id-43","text":"5","element":"a"}],[{"id":"id-21","text":"[43] David Rozenberszki, Or Litany, and Angela Dai. Language- ","element":"span"},{"text":"grounded indoor 3d semantic segmentation in the wild. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"European Conference on Computer Vision","element":"span"},{"text":", pages 125–141. Springer, 2022. ","element":"span"},{"href":"#id-69","text":"2","element":"a"}],[{"id":"id-39","text":"[44] Seonguk Seo, Joon-Young Lee, and Bohyung Han. Urvos: ","element":"span"},{"text":"Unified referring video object segmentation network with a large-scale benchmark. ","element":"span"},{"text":"In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XV 16","element":"span"},{"text":", pages 208–223. Springer, 2020. ","element":"span"},{"href":"#id-43","text":"5","element":"a"}],[{"id":"id-64","text":"[45] Fengyuan Shi, Jiaxi Gu, Hang Xu, Songcen Xu, Wei Zhang, ","element":"span"},{"text":"and Limin Wang. ","element":"span"},{"text":"Bivdiff: A training-free framework for general-purpose video synthesis via bridging image and video diffusion models. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","element":"span"},{"text":", pages 7393–7402, 2024. ","element":"span"},{"text":"8","element":"span"},{"text":", ","element":"span"},{"text":"1","element":"span"}],[{"id":"id-42","text":"[46] Zongheng Tang, Yue Liao, Si Liu, Guanbin Li, Xiaojie Jin, ","element":"span"},{"text":"Hongxu Jiang, Qian Yu, and Dong Xu. ","element":"span"},{"text":"Human-centric spatio-temporal video grounding with visual transformers. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"IEEE Transactions on Circuits and Systems for Video Technology","element":"span"},{"text":", 32(12):8238–8249, 2021. ","element":"span"},{"href":"#id-43","text":"5","element":"a"}],[{"id":"id-4","text":"[47] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier ","element":"span"},{"text":"Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurélien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. ","element":"span"},{"text":"Llama: Open and efficient foundation language models. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"CoRR","element":"span"},{"text":", 2023. ","element":"span"},{"text":"1","element":"span"}],[{"id":"id-22","text":"[48] Mengmeng Wang, Jiazheng Xing, and Yong Liu. Actionclip: ","element":"span"},{"text":"A new paradigm for video action recognition. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2109.08472","element":"span"},{"text":", 2021. ","element":"span"},{"href":"#id-69","text":"2","element":"a"}],[{"id":"id-29","text":"[49] Weiyun Wang, Min Shi, Qingyun Li, Wenhai Wang, Zhen- ","element":"span"},{"text":"hang Huang, Linjie Xing, Zhe Chen, Hao Li, Xizhou Zhu, Zhiguo Cao, et al. The all-seeing project: Towards panoptic visual recognition and understanding of the open world. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2308.01907","element":"span"},{"text":", 2023. ","element":"span"},{"href":"#id-69","text":"2","element":"a"}],[{"id":"id-48","text":"[50] Yi Wang, Kunchang Li, Xinhao Li, Jiashuo Yu, Yinan ","element":"span"},{"text":"He, Guo Chen, Baoqi Pei, Rongkun Zheng, Jilan Xu, Zun Wang, et al. Internvideo2: Scaling video foundation models for multimodal video understanding. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2403.15377","element":"span"},{"text":", 2024. ","element":"span"},{"href":"#id-60","text":"6","element":"a"}],[{"id":"id-30","text":"[51] Haoxuan You, Haotian Zhang, Zhe Gan, Xianzhi Du, Bowen ","element":"span"},{"text":"Zhang, Zirui Wang, Liangliang Cao, Shih-Fu Chang, and Yinfei Yang. Ferret: Refer and ground anything anywhere at any granularity. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2310.07704","element":"span"},{"text":", 2023. ","element":"span"},{"href":"#id-69","text":"2","element":"a"}],[{"id":"id-15","text":"[52] Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An ","element":"span"},{"text":"instruction-tuned audio-visual language model for video understanding. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv:2306.02858","element":"span"},{"text":", 2023. ","element":"span"},{"href":"#id-69","text":"2","element":"a"},{"text":", ","element":"span"},{"href":"#id-43","text":"5","element":"a"}],[{"id":"id-41","text":"[53] Zhu Zhang, Zhou Zhao, Yang Zhao, Qi Wang, Huasheng ","element":"span"},{"text":"Liu, and Lianli Gao. Where does it exist: Spatio-temporal video grounding for multi-form sentences. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"CVPR","element":"span"},{"text":", 2020. ","element":"span"},{"href":"#id-43","text":"5","element":"a"},{"text":", ","element":"span"},{"href":"#id-60","text":"6","element":"a"}],[{"id":"id-31","text":"[54] Yang Zhao, Zhijie Lin, Daquan Zhou, Zilong Huang, Jiashi ","element":"span"},{"text":"Feng, and Bingyi Kang. Bubogpt: Enabling visual grounding in multi-modal llms. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2307.08581","element":"span"},{"text":", 2023. ","element":"span"},{"href":"#id-69","text":"2","element":"a"},{"text":", ","element":"span"},{"href":"#id-34","text":"3","element":"a"}],[{"id":"id-53","text":"[55] Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela ","element":"span"},{"text":"Barriuso, and Antonio Torralba. ","element":"span"},{"text":"Scene parsing through ade20k dataset. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition","element":"span"},{"text":", 2017. ","element":"span"},{"href":"#id-60","text":"6","element":"a"}],[{"id":"id-38","text":"[56] Luowei Zhou, Yannis Kalantidis, Xinlei Chen, Jason J ","element":"span"},{"text":"Corso, and Marcus Rohrbach. Grounded video description. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"CVPR","element":"span"},{"text":", 2019. ","element":"span"},{"href":"#id-43","text":"5","element":"a"}],[{"id":"id-9","text":"[57] Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mo- ","element":"span"},{"text":"hamed Elhoseiny. ","element":"span"},{"text":"Minigpt-4: Enhancing vision-language understanding with advanced large language models. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2304.10592","element":"span"},{"text":", 2023. ","element":"span"},{"text":"1","element":"span"},{"text":", ","element":"span"},{"href":"#id-69","text":"2","element":"a"}],[{"text":"[58] Jiawen Zhu, Zhi-Qi Cheng, Jun-Yan He, Chenyang Li, ","element":"span"},{"text":"Bin Luo, Huchuan Lu, Yifeng Geng, and Xuansong Xie. Tracking with human-intent reasoning. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2312.17448","element":"span"},{"text":", 2023. ","element":"span"},{"text":"7","element":"span"},{"text":", ","element":"span"},{"text":"1","element":"span"}]]}],"_version":"3.3.4"},"paperNode":"$28:props:children:props:children:0:props:product"}]]