36:[["$","audio",null,{"id":"tts"}],["$","$L3b",null,{"paperID":"32394","publisher":"cvpr","paperJSON":{"title":"ViCaS: A Dataset for Combining Holistic and Pixel-level Video Understanding using Captions with Grounded Segmentation","paperID":"32394","avgLineHeight":11.96,"imgScale":4,"sections":[{"heading":"Abstract","paragraphs":[[{"style":{"fontStyle":"italic"},"text":"Recent advances in multimodal large language models (MLLMs) have expanded research in video understanding, primarily focusing on high-level tasks such as video captioning and question-answering. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Meanwhile, a smaller body of work addresses dense, pixel-precise segmentation tasks, which typically involve category-guided or referral-based object segmentation. Although both directions are essential for developing models with human-level video comprehension, they have largely evolved separately, with distinct benchmarks and architectures. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"This paper aims to unify these efforts by introducing ViCaS, a new dataset containing thousands of challenging videos, each annotated with detailed, human-written captions and temporally consistent, pixel-accurate masks for multiple objects with phrase grounding. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Our benchmark evaluates models on both holistic/high-level understanding and language-guided, pixel-precise segmentation. We also present carefully validated evaluation measures and propose an effective model architecture that can tackle our benchmark.","element":"span"}]]},{"heading":"1. Introduction","paragraphs":[[{"text":"The emergence of open-source large lauguage models (LLMs) [","element":"span"},{"text":"9","element":"span"},{"text":", ","element":"span"},{"text":"32","element":"span"},{"text":", ","element":"span"},{"text":"96","element":"span"},{"text":"], coupled with progress in aligning vision and language feature spaces [","element":"span"},{"text":"55","element":"span"},{"text":", ","element":"span"},{"text":"58","element":"span"},{"text":", ","element":"span"},{"text":"83","element":"span"},{"text":"], has enabled significant research into vision-language models that can jointly reason over visual and language inputs. Earlier advances in image-language models [","element":"span"},{"text":"8","element":"span"},{"text":", ","element":"span"},{"text":"56","element":"span"},{"text":", ","element":"span"},{"text":"66","element":"span"},{"text":"] have spurred development of multimodal LLMs capable of reasoning over other modalities such as videos [","element":"span"},{"text":"63","element":"span"},{"text":"] and 3D point-clouds [","element":"span"},{"text":"128","element":"span"},{"text":"].","element":"span"}],[{"text":"In the field of video-language models, research has mainly focused on holistic video understanding tasks such as video captioning and question-answering [","element":"span"},{"text":"63","element":"span"},{"text":", ","element":"span"},{"text":"67","element":"span"},{"text":", ","element":"span"},{"text":"72","element":"span"},{"text":", ","element":"span"},{"text":"73","element":"span"},{"text":", ","element":"span"},{"text":"113","element":"span"},{"text":", ","element":"span"},{"text":"127","element":"span"},{"text":"], with a particular emphasis on generating temporally dense text output for long video sequences [","element":"span"},{"text":"18","element":"span"},{"text":",","element":"span"}],[{"style":{"width":"100%"},"width":948,"height":918,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/32394/images/0-0.png","element":"img"}],[{"text":"Figure 1. ","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"ViCaS Dataset/Benchmark. ","element":"figcaption","subtype":"caption"},{"text":"Our dataset contains detailed video captions with phrase-level grounding for accurate object segmentation masks. The benchmark comprises two tasks to evaluate holistic and pixel-level video understanding, respectively.","element":"figcaption","subtype":"caption"}],[{"text":"62","element":"span"},{"text":", ","element":"span"},{"text":"92","element":"span"},{"text":", ","element":"span"},{"text":"118","element":"span"},{"text":", ","element":"span"},{"text":"132","element":"span"},{"text":"]. By contrast, comparatively less attention has been given to predicting spatially dense outputs — such as bounding boxes or segmentation masks — that can localize key objects and actors in videos based on text prompts, even though this has important practical applications, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"e.g","element":"span"},{"text":"., autonomous robots and video editing. Although recent language-driven benchmarks [","element":"span"},{"text":"29","element":"span"},{"text":", ","element":"span"},{"text":"47","element":"span"},{"text":", ","element":"span"},{"text":"89","element":"span"},{"text":", ","element":"span"},{"text":"117","element":"span"},{"text":"] in the video segmentation community have made progress in this area, these datasets are largely object-centric and lack evaluation of high-level, holistic video understanding.","element":"span"}],[{"text":"In this paper, we seek to bridge the gap between benchmarks that evaluate holistic video understanding and those focused on pixel-level localization. We introduce ViCaS, a new video dataset and benchmark containing thousands of videos with detailed, human-annotated captions. In these captions, words and phrases corresponding to key objects are grounded in human-drawn, pixel-precise segmentation masks that span the entire video duration. To the best of our knowledge, this is the first video dataset to offer humanlabeled annotations of this kind.","element":"span"}],[{"text":"Our benchmark comprises two tasks: (1) Video Captioning, which requires describing the video events and objects in detail, and (2) our newly proposed LanguageGuided Video Instance Segmentation (LG-VIS) task, which requires predicting temporally-consistent segmentation masks for multiple objects based on a text prompt. An example of our dataset annotations and tasks is given in Fig. ","element":"span"},{"text":"1","element":"span"},{"text":". To effectively evaluate Video Captioning, we conduct a comprehensive user study to validate the evaluation measures used for open-ended text similarity by comparing several existing approaches. Finally, we introduce Video-LLaVA-Seg, an end-to-end architecture designed to effectively tackle our benchmark by integrating insights from recent vision-language models [","element":"span"},{"text":"63","element":"span"},{"text":", ","element":"span"},{"text":"67","element":"span"},{"text":"] and prompt-based video segmentation approaches [","element":"span"},{"text":"53","element":"span"},{"text":", ","element":"span"},{"text":"86","element":"span"},{"text":"]. In summary, our contributions are as follows:","element":"span"}],[{"text":"• We introduce a large, human-annotated video dataset with detailed text captions with phrase-level grounding for objects, accompanied by pixel-precise segmentation masks. This dataset enables the evaluation of both holistic and pixel-level video understanding.","element":"span"}],[{"text":"• We propose accurate, reproducible evaluation measures for open-ended text similarity which are verified by a comprehensive user study.","element":"span"}],[{"text":"• We present Video-LLaVA-Seg, an effective, end-to-end trained architecture that can tackle our benchmark.","element":"span"}]]},{"heading":"2. Related Work","paragraphs":[[{"text":"Although video understanding is a well-researched topic in computer vision, researchers have traditionally approached holistic and pixel-level video understanding as separate streams, each with its own datasets and benchmarks. Below, we review related work in these areas.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"2.1. Holistic Video Understanding","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Video Classification. ","element":"span"},{"text":"This is one of the earliest tasks in video understanding, popularized by activity recognition datasets [","element":"span"},{"text":"52","element":"span"},{"text":", ","element":"span"},{"text":"93","element":"span"},{"text":", ","element":"span"},{"text":"125","element":"span"},{"text":"], and later expanded by larger datasets [","element":"span"},{"text":"37","element":"span"},{"text":", ","element":"span"},{"text":"38","element":"span"},{"text":", ","element":"span"},{"text":"91","element":"span"},{"text":"], ","element":"span"},{"style":{"fontStyle":"italic"},"text":"e.g","element":"span"},{"text":"., Kinetics [","element":"span"},{"text":"46","element":"span"},{"text":"]. Early deep learning approaches employed 3D CNNs [","element":"span"},{"text":"39","element":"span"},{"text":", ","element":"span"},{"text":"45","element":"span"},{"text":", ","element":"span"},{"text":"97","element":"span"},{"text":"–","element":"span"},{"text":"99","element":"span"},{"text":"] to capture spatio-temporal information. ","element":"span"},{"text":"With the popularization of transformers [","element":"span"},{"text":"100","element":"span"},{"text":"], attention-based architectures [","element":"span"},{"text":"4","element":"span"},{"text":", ","element":"span"},{"text":"68","element":"span"},{"text":", ","element":"span"},{"text":"79","element":"span"},{"text":"] emerged, offering improved performance. Although these datasets and architectures have greatly advanced video understanding, they are limited from a language perspective, typically assigning a single, predefined label to each video.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Video Captioning and Question-Answering. ","element":"span"},{"text":"Alongside classification, language-oriented tasks such as video captioning and question-answering (Q/A) have gained research attention. Early datasets like MSVD [","element":"span"},{"text":"17","element":"span"},{"text":"], MSR-VTT [","element":"span"},{"text":"111","element":"span"},{"text":"], and TGIF-QA [","element":"span"},{"text":"61","element":"span"},{"text":"] laid the groundwork for video captioning and were later adapted as Q/A benchmarks [","element":"span"},{"text":"110","element":"span"},{"text":"]. Initial approaches [","element":"span"},{"text":"31","element":"span"},{"text":", ","element":"span"},{"text":"112","element":"span"},{"text":"] combined CNNs for visual reasoning with RNNs for text generation. As transformers gained popularity, architectures like VideoBERT [","element":"span"},{"text":"94","element":"span"},{"text":"] and UniVL [","element":"span"},{"text":"71","element":"span"},{"text":"] were among the first to unify vision and language by learning shared representations for both modalities.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Multimodal Large Language Models (MLLMs). ","element":"span"},{"text":"The recent popularization of LLMs [","element":"span"},{"text":"9","element":"span"},{"text":", ","element":"span"},{"text":"24","element":"span"},{"text":", ","element":"span"},{"text":"32","element":"span"},{"text":", ","element":"span"},{"text":"96","element":"span"},{"text":"] has enabled research on multimodal models which extend LLMs to process visual inputs such as images [","element":"span"},{"text":"14","element":"span"},{"text":", ","element":"span"},{"text":"19","element":"span"},{"text":", ","element":"span"},{"text":"28","element":"span"},{"text":", ","element":"span"},{"text":"66","element":"span"},{"text":", ","element":"span"},{"text":"124","element":"span"},{"text":"] and videos [","element":"span"},{"text":"18","element":"span"},{"text":", ","element":"span"},{"text":"62","element":"span"},{"text":", ","element":"span"},{"text":"63","element":"span"},{"text":", ","element":"span"},{"text":"72","element":"span"},{"text":", ","element":"span"},{"text":"73","element":"span"},{"text":", ","element":"span"},{"text":"92","element":"span"},{"text":"]. This research has been supported by large-scale video captioning datasets [","element":"span"},{"text":"11","element":"span"},{"text":", ","element":"span"},{"text":"20","element":"span"},{"text":", ","element":"span"},{"text":"116","element":"span"},{"text":"] and multi-task video understanding benchmarks [","element":"span"},{"text":"25","element":"span"},{"text":", ","element":"span"},{"text":"57","element":"span"},{"text":"].","element":"span"}],[{"text":"Overall, the datasets and models discussed above focus primarily on high-level, holistic video understanding and do not address finegrained, pixel-level localization.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"2.2. Pixel-level Video Understanding","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Object Tracking and Segmentation. ","element":"span"},{"text":"Object localization and tracking is a deeply studied problem, even prior to the deep learning era [","element":"span"},{"text":"2","element":"span"},{"text":", ","element":"span"},{"text":"78","element":"span"},{"text":"]. This research has been propelled by several datasets, with early efforts focused on bounding-box-level object tracking [","element":"span"},{"text":"26","element":"span"},{"text":", ","element":"span"},{"text":"27","element":"span"},{"text":", ","element":"span"},{"text":"34","element":"span"},{"text":", ","element":"span"},{"text":"36","element":"span"},{"text":", ","element":"span"},{"text":"44","element":"span"},{"text":", ","element":"span"},{"text":"51","element":"span"},{"text":", ","element":"span"},{"text":"122","element":"span"},{"text":"]. As model architectures advanced, benchmarks for pixel-precise video segmentation emerged, covering tasks such as object segmentation based on predefined categories [","element":"span"},{"text":"7","element":"span"},{"text":", ","element":"span"},{"text":"48","element":"span"},{"text":", ","element":"span"},{"text":"74","element":"span"},{"text":", ","element":"span"},{"text":"75","element":"span"},{"text":", ","element":"span"},{"text":"81","element":"span"},{"text":", ","element":"span"},{"text":"82","element":"span"},{"text":", ","element":"span"},{"text":"102","element":"span"},{"text":", ","element":"span"},{"text":"108","element":"span"},{"text":", ","element":"span"},{"text":"119","element":"span"},{"text":"] or segmenting specific objects given their first-frame ground-truth masks [","element":"span"},{"text":"30","element":"span"},{"text":", ","element":"span"},{"text":"80","element":"span"},{"text":", ","element":"span"},{"text":"115","element":"span"},{"text":"] or points [","element":"span"},{"text":"7","element":"span"},{"text":", ","element":"span"},{"text":"43","element":"span"},{"text":", ","element":"span"},{"text":"133","element":"span"},{"text":"]. Popular approaches have evolved from tracking-by-detection methods [","element":"span"},{"text":"16","element":"span"},{"text":", ","element":"span"},{"text":"69","element":"span"},{"text":", ","element":"span"},{"text":"95","element":"span"},{"text":", ","element":"span"},{"text":"102","element":"span"},{"text":"] to end-to-end trainable transformer-based [","element":"span"},{"text":"6","element":"span"},{"text":", ","element":"span"},{"text":"40","element":"span"},{"text":", ","element":"span"},{"text":"42","element":"span"},{"text":", ","element":"span"},{"text":"49","element":"span"},{"text":", ","element":"span"},{"text":"90","element":"span"},{"text":", ","element":"span"},{"text":"107","element":"span"},{"text":", ","element":"span"},{"text":"109","element":"span"},{"text":", ","element":"span"},{"text":"130","element":"span"},{"text":"], and object-level attention-based architectures [","element":"span"},{"text":"5","element":"span"},{"text":", ","element":"span"},{"text":"21","element":"span"},{"text":"– ","element":"span"},{"text":"23","element":"span"},{"text":", ","element":"span"},{"text":"76","element":"span"},{"text":"]. Despite substantial progress, these works focus solely on tracking and segmentation, without addressing holistic video understanding.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Language-Guided Segmentation. ","element":"span"},{"text":"Recent progress in vision-language models has motivated multiple language-guided video segmentation datasets [","element":"span"},{"text":"103","element":"span"},{"text":"]. These include “open-vocabulary” datasets [","element":"span"},{"text":"104","element":"span"},{"text":", ","element":"span"},{"text":"105","element":"span"},{"text":"] that cover a large set of object classes, and referral-based video segmentation benchmarks [","element":"span"},{"text":"29","element":"span"},{"text":", ","element":"span"},{"text":"47","element":"span"},{"text":", ","element":"span"},{"text":"89","element":"span"},{"text":"], which require segmenting objects based on text prompts. Although more language-oriented, these datasets are still object-centric and do not involve predicting a high-level description of the entire video.","element":"span"}],[{"text":"Some approaches [","element":"span"},{"text":"29","element":"span"},{"text":", ","element":"span"},{"text":"41","element":"span"},{"text":", ","element":"span"},{"text":"123","element":"span"},{"text":"] for these tasks use text and image backbones to encode the text prompt and video","element":"span"}],[{"style":{"width":"98%"},"width":1956,"height":689,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/32394/images/2-0.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"backdrop of snowy trees turns to the left before sharply turning right. This kicks up a shower of ","element":"figcaption","subtype":"caption"},{"text":"Figure 2. ","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"ViCaS Examples. ","element":"figcaption","subtype":"caption"},{"text":"Our dataset showcases diverse scenes with a variety of objects and video events, along with detailed captions. Phrases referring to multiple objects are written with multiple colors, ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"e.g","element":"figcaption","subtype":"caption"},{"text":"., “three yellow balls” in row 2 references three different objects.","element":"figcaption","subtype":"caption"}],[{"style":{"fontWeight":"bold"},"text":"A ","element":"span"},{"style":{"fontWeight":"bold"},"text":"man ","element":"span"},{"style":{"fontWeight":"bold"},"text":"pushes a ","element":"span"},{"style":{"fontWeight":"bold"},"text":"toy car ","element":"span"},{"style":{"fontWeight":"bold"},"text":"with a ","element":"span"},{"style":{"fontWeight":"bold"},"text":"child ","element":"span"},{"style":{"fontWeight":"bold"},"text":"sitting inside down a ","element":"span"},{"style":{"fontWeight":"bold"},"text":"ramp ","element":"span"},{"style":{"fontWeight":"bold"},"text":"in a garden. The toy car veers to the left and collides with a ","element":"span"},{"style":{"fontWeight":"bold"},"text":"plant ","element":"span"},{"style":{"fontWeight":"bold"},"text":"in a puddle. The child falls into the puddle but then stands back up. A ","element":"span"},{"style":{"fontWeight":"bold"},"text":"swing ","element":"span"},{"style":{"fontWeight":"bold"},"text":"can be seen in the background. ","element":"span"},{"text":"frames, followed by a transformer decoder and segmentation head to generate masks. Meanwhile, recent LLMbased approaches [","element":"span"},{"text":"10","element":"span"},{"text":", ","element":"span"},{"text":"117","element":"span"},{"text":"] use a special ","element":"span"},{"text":"SEG ","element":"span"},{"text":"vocabulary token in conjunction with a segmentation network to predict the target masks. These approaches typically utilize several image-level and video segmentation datasets for training, and utilize other models [","element":"span"},{"text":"13","element":"span"},{"text":", ","element":"span"},{"text":"62","element":"span"},{"text":"] for dynamic frame selection or temporal mask propagation.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"A ","element":"span"},{"style":{"fontWeight":"bold"},"text":"grey cat ","element":"span"},{"style":{"fontWeight":"bold"},"text":"on a bed hits a ","element":"span"},{"style":{"fontWeight":"bold"},"text":"brown dog ","element":"span"},{"style":{"fontWeight":"bold"},"text":"that is sleeping in front of it twice. The ","element":"span"},{"style":{"fontWeight":"bold"},"text":"dog ","element":"span"},{"style":{"fontWeight":"bold"},"text":"wakes up and looks to the right for a moment before going back to sleep. ","element":"span"},{"text":"In contrast to these two categories of works, we propose a unified dataset that provides the annotations as well as benchmark tasks to evaluate both holistic/high-level, and pixel-level video understanding. Additionally, we propose a baseline architecture that is end-to-end trainable, and can be effectively trained for segmentation using only our dataset, thereby making it easy to setup and extend.","element":"span"}]]},{"heading":"3. ViCaS Dataset","paragraphs":[[{"text":"Our dataset is designed to evaluate both holistic as well as pixel-level video understanding. ","element":"span"},{"text":"To this end, we annotate 20,416 videos with detailed captions in which words/phrases referencing salient objects are grounded with temporally consistent segmentation masks. All annotations are done by professional human annotators.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"3.1. Video Source","element":"span"}],[{"text":"For our objective, it is crucial to annotate videos which contain meaningful events that can only be explained through effective temporal reasoning. At the same time, to effectively evaluate pixel-level understanding, the videos should feature challenging scenes with multiple moving objects.","element":"span"}],[{"text":"To meet these requirements, we annotate videos from three sources: (1) Oops dataset [","element":"span"},{"text":"33","element":"span"},{"text":"], a collection of ‘fail videos’ from the internet, (2) Unidentified Video Object (UVO) [","element":"span"},{"text":"105","element":"span"},{"text":"], which is a video segmentation dataset, and (3) Kinetics-700 [","element":"span"},{"text":"46","element":"span"},{"text":"], a popular human action recognition dataset. These sources are suitable for our dataset since they contain in-the-wild videos with diverse objects and backgrounds. Moreover, they contain videos with multiple objects undergoing appearance and shape changes, and fast motion. This combination of attributes makes the videos challenging both in captioning and in segmentation.","element":"span"}],[{"text":"We annotate 20,416 videos with durations ranging from 4 to 30 seconds (distribution illustrated in Fig. ","element":"span"},{"text":"3c","element":"span"},{"text":"). Note that we use only the raw videos from these Oops dataset and disregard their annotations. More details on video selection are given in supplementary.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"3.2. Annotation Process","element":"span"}],[{"text":"The annotation process consists of two main steps: first, a detailed caption is written (Step 1a) in which objects selected for segmentation masks are identified and marked using a specific syntax (Step 1b). In the second step, a different annotator reviews the video and its text, and draws the corresponding segmentation masks (Step 2). The process is illustrated in Fig. ","element":"span"},{"text":"4","element":"span"},{"text":". Further details are given below.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Step 1a: Text Captions. ","element":"span"},{"text":"Professional annotators fluent in English were tasked with writing a detailed caption covering the events in each video. To ensure consistency in the level of detail and writing style, they received comprehensive guidelines with several examples. Annotators were advised to think as if they have ","element":"span"},{"style":{"height":4},"width":27,"height":10,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/32394/images/2-1.png","element":"img","alt":" ∼","inline":true},{"text":"30 seconds to convey the video content in as much detail as possible over an audio call. They were instructed to describe the ‘fail’ event [","element":"span"},{"text":"33","element":"span"},{"text":"], the actions and movements of objects, as well as the background scene and any additional, interesting elements. Since our dataset is video-centric, annotators were explicitly guided to include temporal information, with the instruction: “","element":"span"},{"style":{"fontStyle":"italic"},"text":"Focus on details that require watching the en-","element":"span"}],[{"style":{"width":"100%"},"width":1980,"height":528,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/32394/images/3-0.png","element":"img"}],[{"text":"Figure 3. ","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"ViCaS Dataset. ","element":"figcaption","subtype":"caption"},{"text":"(a): Word cloud for nouns in grounding phrases. (b), (c): Histograms for caption length and video duration, respectively. Since these distribution are long-tailed, the right-most bar labeled ‘","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"n","element":"figcaption","subtype":"caption"},{"text":"+","element":"figcaption","subtype":"caption"},{"text":"’ captures all values ","element":"figcaption","subtype":"caption"},{"style":{"height":11.79},"width":57.48,"height":29.48,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/32394/images/3-1.png","element":"img","alt":" ≥ n","inline":true,"padRight":true},{"text":"to prevent visual distortion.","element":"figcaption","subtype":"caption"}],[{"style":{"width":"100%"},"width":950,"height":472,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/32394/images/3-2.png","element":"img"}],[{"text":"Figure 4. ","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"Annotation Process. ","element":"figcaption","subtype":"caption"},{"text":"Our annotation process consists of two steps. Step 1: A detailed video caption is written by human annotators. Salient objects are identified and marked using a special syntax (highlighted in ","element":"figcaption","subtype":"caption"},{"text":"blue","element":"figcaption","subtype":"caption"},{"text":"). Step 2: Segmentation masks are drawn for the salient objects throughout the video.","element":"figcaption","subtype":"caption"}],[{"style":{"fontStyle":"italic"},"text":"tire video and cannot be inferred from a few frames.","element":"span"},{"text":"”","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Step 1b: ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Phrase Grounding. ","element":"span"},{"text":"Alongside writing the text caption, annotators were instructed to identify salient objects/actors in the video, and mark the corresponding phrases or words using a specific syntax, as illustrated in Fig. ","element":"span"},{"text":"4","element":"span"},{"text":". ","element":"span"},{"text":"This syntax is programmatically parsable to extract object IDs that can be removed to produce a standard, human-readable caption. To streamline the mask-drawing process, annotators were advised to mark no more than 10 objects per video.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Step 2: Segmentation Masks. ","element":"span"},{"text":"In the second step, a different annotator reviews the video and its caption, and uses a polygon tool to draw segmentation masks for each marked object. The annotators were instructed to read the text carefully and watch the video multiple times to ensure that correct masks were drawn for the each grounding phrase. To balance annotation accuracy and cost, we annotate masks at 1 frame-per-second (fps) and then use an off-the-shelf SAM2 [","element":"span"},{"text":"86","element":"span"},{"text":"] model to propagate these masks to adjacent frames, producing temporally dense mask annotations at 30 fps. A similar approach has been successfully applied in other video segmentation datasets [","element":"span"},{"text":"7","element":"span"},{"text":", ","element":"span"},{"text":"105","element":"span"},{"text":"].","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Annotation Statistics and Quality Control. ","element":"span"},{"text":"For Step 1, we employed 24 annotators and 4 quality inspectors who reviewed/corrected the text captions and language grounding. On average, each video required 9 minutes for annotation and quality checking. For Step 2, we hired 40 annotators and 5 quality inspectors, with each video frame requiring an average of 7 minutes to annotate.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"3.3. Benchmark Design and Evaluation Tasks","element":"span"}],[{"text":"Leveraging our annotations, we create a benchmark to evaluate both high-level video understanding and pixel-precise segmentation. Consequently, it includes two tasks:","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"1) Video Captioning. ","element":"span"},{"text":"In this task, the model is expected to produce an open-ended text summary that explains the events in the video, including descriptions of salient objects and background elements.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"2) Language-Guided Video Instance Segmentation (LGVIS). ","element":"span"},{"text":"For this task, the model receives a text prompt describing a specific set of objects, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"e.g","element":"span"},{"text":"., ‘","element":"span"},{"style":{"fontStyle":"italic"},"text":"Where is the person who walks from left to right?","element":"span"},{"text":"’, and is expected to predict segmentation masks for the corresponding objects. For cases where multiple objects are referenced, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"e.g","element":"span"},{"text":"., ‘","element":"span"},{"style":{"fontStyle":"italic"},"text":"Where is the group of kittens that are playing?","element":"span"},{"text":"’, a separate mask is required for each object. This is in contrast to Referral Video Object Segmentation [","element":"span"},{"text":"29","element":"span"},{"text":", ","element":"span"},{"text":"47","element":"span"},{"text":", ","element":"span"},{"text":"89","element":"span"},{"text":"], where only a single, binary mask covering all target objects is required.","element":"span"}],[{"text":"From a vision-language perspective, both tasks are forms of Video Question-Answering. For Video Captioning, the question asks for a video description, with the answer provided as open-ended text. For LG-VIS, the question is a ‘","element":"span"},{"style":{"fontStyle":"italic"},"text":"Where is?","element":"span"},{"text":"’ query regarding specific objects, and the answer is a video-length segmentation mask for each object.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Converting Grounded Captions to LG-VIS Prompts. ","element":"span"},{"text":"As outlined in Sec. ","element":"span"},{"text":"3.2","element":"span"},{"text":", our annotations contain phrase ground-","element":"span"}],[{"style":{"width":"100%"},"width":1986,"height":1032,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/32394/images/4-0.png","element":"img"}],[{"text":"Table 1. ","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"Dataset Comparison. ","element":"figcaption","subtype":"caption"},{"text":"In the ‘Videos’ column, entries formatted as ‘X / Y’ indicate a total of X videos with Y labeled sub-clips, with the average sub-clip duration being reported.","element":"figcaption","subtype":"caption"}],[{"text":"ing for labeled objects. We use GPT-4 [","element":"span"},{"text":"1","element":"span"},{"text":"] to convert these grounded captions into ‘","element":"span"},{"style":{"fontStyle":"italic"},"text":"Where is?","element":"span"},{"text":"’ style questions, which serve as text prompts for the LG-VIS task. Further details about this process are provided in supplementary.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Dataset Splits. ","element":"span"},{"text":"We split the 20,416 videos into train, validation, and test sets with 14,516 videos for training, and 2,950 each for validation and testing. Statistics for each set are given in supplementary.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"3.4. Comparison with Related Datasets","element":"span"}],[{"text":"Table ","element":"span"},{"text":"1 ","element":"span"},{"text":"provides a quantitative comparison between ViCaS and existing datasets, showing that ViCaS uniquely combines text captions and pixel-precise segmentation masks.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Captioning and Q/A. ","element":"span"},{"text":"For text descriptions, ViCaS prioritizes quality over quantity, offering detailed, high-quality, human-written captions with phrase grounding for objects. ","element":"span"},{"text":"Our captions average 39 words (distribution illustrated in Fig. ","element":"span"},{"text":"3b","element":"span"},{"text":"), significantly longer than those in other datasets. While ViCaS annotates more videos than early datasets like MSVD [","element":"span"},{"text":"17","element":"span"},{"text":"] and MSR-VTT [","element":"span"},{"text":"111","element":"span"},{"text":"], later datasets with temporally dense captions, such DiDeMo [","element":"span"},{"text":"3","element":"span"},{"text":"], and YouCook2 [","element":"span"},{"text":"131","element":"span"},{"text":"], include more labeled video segments. Recently, large-scale datasets such as WebVid10M [","element":"span"},{"text":"11","element":"span"},{"text":"] and Panda70M [","element":"span"},{"text":"20","element":"span"},{"text":"] have emerged, containing millions of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"automatically ","element":"span"},{"text":"captioned videos sourced from the internet. While useful for (pre-)training large models, these datasets offer shorter, less detailed descriptions and have neither segmen-","element":"span"}],[{"text":"tation masks nor phrase grounding.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Referral VOS. ","element":"span"},{"text":"Compared to Referral Video Object Segmentation (VOS) datasets, ViCaS stands out by providing holistic, video-level captions in addition to object-level language expressions. With 20,416 annotated videos, ViCaS is significantly larger than the second-largest ReferYouTubeVOS [","element":"span"},{"text":"89","element":"span"},{"text":"], which contains only 3,978 videos. Although MeViS [","element":"span"},{"text":"29","element":"span"},{"text":"] features longer videos, averaging 13.2s compared to our 9.1s, ViCaS offers significantly more annotated objects (65,588) compared to MeViS (8,171).","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Class-Guided Segmentation. ","element":"span"},{"text":"Compared to these datasets, ViCaS provides more nuanced language labels, going beyond single-word category labels typical in Video Instance Segmentation (VIS) [","element":"span"},{"text":"81","element":"span"},{"text":", ","element":"span"},{"text":"119","element":"span"},{"text":"] datasets. Using NLTK [","element":"span"},{"text":"15","element":"span"},{"text":"], we found that our object-grounding phrases contains 11,492 unique nouns/noun-phrases (illustrated in Fig. ","element":"span"},{"text":"3a","element":"span"},{"text":"), which is significantly higher than the 1,196 object categories in LVVIS [","element":"span"},{"text":"104","element":"span"},{"text":"]. Furthermore, ViCaS contains more videos than the others, and also contains more labeled object tracks than all other datasets except UVO [","element":"span"},{"text":"105","element":"span"},{"text":"]. Lastly, ViCaS stands out by providing detailed text captions in addition to the segmentation masks.","element":"span"}]]},{"heading":"4. Evaluation Measures","paragraphs":[[{"text":"As outlined in Sec. ","element":"span"},{"text":"3.3","element":"span"},{"text":", our benchmark comprises two tasks: Video Captioning and LG-VIS. Evaluating the video caption quality is challenging since it requires computing","element":"span"}],[{"style":{"width":"100%"},"width":952,"height":678,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/32394/images/5-0.png","element":"img"}],[{"text":"Table 2. ","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"Caption Scoring User Study. ","element":"figcaption","subtype":"caption"},{"text":"Results for Part 1 (Correlation with Human Scores) and Part 2 (Robustness to Hard Positives/Negatives). ","element":"figcaption","subtype":"caption"},{"style":{"height":13.41},"width":260.52,"height":33.52,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/32394/images/5-1.png","element":"img","alt":" r, rmax and ravg","inline":true,"padRight":true},{"text":"denote Pearson correlation coefficients. ","element":"figcaption","subtype":"caption"},{"style":{"height":12},"width":76,"height":30,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/32394/images/5-2.png","element":"img","alt":" ∆P N","inline":true,"padRight":true},{"text":"is the absolute difference in predicted scores.","element":"figcaption","subtype":"caption"}],[{"text":"the similarity between two open-ended text passages—the human-written ground truth and the model’s prediction. To address this, we first conduct a comprehensive user study for scoring open-ended text similarity in Sec. ","element":"span"},{"text":"4.1","element":"span"},{"text":", before introducing the selected evaluation measures in Sec. ","element":"span"},{"text":"4.2","element":"span"},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"4.1. User Study for Video Caption Scoring","element":"span"}],[{"text":"To decide evaluation measures for video captioning, we carried out a two-part user study comparing several scoring methods which are categorized into three groups: (1) classical text similarity metrics, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"e.g","element":"span"},{"text":"., METEOR [","element":"span"},{"text":"12","element":"span"},{"text":"] and BLEU4 [","element":"span"},{"text":"77","element":"span"},{"text":"], which rely on word/phrase matching; (2) embedding similarity-based measures, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"e.g","element":"span"},{"text":"., BERTScore [","element":"span"},{"text":"129","element":"span"},{"text":"] and AnglE [","element":"span"},{"text":"60","element":"span"},{"text":"], which summarize the ground-truth and prediction as an embedding using a language model, followed by computing the cosine similarity between the two; and (3) recent LLM-based scoring methods [","element":"span"},{"text":"73","element":"span"},{"text":"], where an LLM is provided with the ground-truth and predicted captions along with instructions for assessing similarity, and it outputs a similarity score directly as text.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Part 1: Correlation with Human Scores. ","element":"span"},{"text":"In this part, 20 participants rated the accuracy of various model-predicted video captions on a scale of 0 to 10, with a minimum interval of 0.5. Similar to the evaluation methods, participants could only read the ground-truth caption and could not watch the video itself. They represented 10 different nationalities and had either native, or professional English proficiency. A total of 131 video captions were evaluated, with each sample scored by two different individuals.","element":"span"}],[{"text":"To filter samples with diverging human scores, we calculated the difference ","element":"span"},{"style":{"height":11.6},"width":30,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/32394/images/5-3.png","element":"img","alt":" ∆","inline":true,"padRight":true},{"text":"between the two assigned scores for each sample, discarding any with ","element":"span"},{"style":{"height":12.4},"width":121.52,"height":31,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/32394/images/5-4.png","element":"img","alt":" ∆ > α","inline":true,"padRight":true},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"i.e","element":"span"},{"text":"., outlier removal). We used multiple thresholds ","element":"span"},{"style":{"height":16},"width":73.48,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/32394/images/5-5.png","element":"img","alt":" {αi}","inline":true,"padRight":true},{"text":"ranging from 0.5 to 2.5 in steps of 0.5. For each ","element":"span"},{"style":{"height":9.2},"width":36.48,"height":23,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/32394/images/5-6.png","element":"img","alt":" αi","inline":true,"padRight":true},{"text":"and each method, we calculated the Pearson correlation coefficient ","element":"span"},{"style":{"height":9.6},"width":27,"height":24,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/32394/images/5-7.png","element":"img","alt":" ri","inline":true,"padRight":true},{"text":"between averaged human scores and the predicted scores. The final correlation score ","element":"span"},{"style":{"fontStyle":"italic"},"text":"r ","element":"span"},{"text":"for each method was computed as the mean of the set ","element":"span"},{"style":{"height":16},"width":66,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/32394/images/5-8.png","element":"img","alt":" {ri}","inline":true},{"text":", and is reported in Table ","element":"span"},{"text":"2","element":"span"},{"text":".","element":"span"}],[{"text":"Results indicate that LLM-based scoring methods performed best, with Llama3-70B [","element":"span"},{"text":"32","element":"span"},{"text":"] achieving ","element":"span"},{"style":{"fontStyle":"italic"},"text":"r ","element":"span"},{"text":"= 59","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"},{"text":"3%","element":"span"},{"text":", closely followed by GPT-4 [","element":"span"},{"text":"1","element":"span"},{"text":"] with 58.9%. Model size significantly impacted performance, as evidenced by Llama3-8B’s lower ","element":"span"},{"style":{"fontStyle":"italic"},"text":"r ","element":"span"},{"text":"= 31","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"},{"text":"3%","element":"span"},{"text":". Among embedding-based methods, AnglE Llama2-7B [","element":"span"},{"text":"60","element":"span"},{"text":"] performed best at 50.5%, while classical word/phrase matching metrics lagged behind, with METEOR [","element":"span"},{"text":"12","element":"span"},{"text":"] achieving only ","element":"span"},{"style":{"fontStyle":"italic"},"text":"r ","element":"span"},{"text":"= 22","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"},{"text":"3%","element":"span"},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Pooling Scores Across Multiple Ground Truths. ","element":"span"},{"text":"A common strategy to enhance robustness in text similarity evaluation is to compare predictions with multiple, synonymous ground-truths. Using GPT-4 [","element":"span"},{"text":"1","element":"span"},{"text":"], we generated four rephrasings for each human-written caption, resulting in five ground-truth variants per video. For each video, we calculated the similarity between the prediction and all ground-truths, followed by applying either average or maximum pooling to obtain the final score. We report the resulting correlations as ","element":"span"},{"style":{"height":12.19},"width":56,"height":30.48,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/32394/images/5-9.png","element":"img","alt":" ravg","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":9.79},"width":65,"height":24.48,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/32394/images/5-10.png","element":"img","alt":" rmax","inline":true,"padRight":true},{"text":"in Table ","element":"span"},{"text":"2","element":"span"},{"text":". As shown in the table, LLM-based scoring methods benefited most from average pooling, with Llama3-70B improving from 59.3% to 65.6% and GPT-4 from 58.9% to 67.3%. Embedding-based methods generally saw minimal improvements for both types of pooling, with some, like MPNet-B and AnglE-UAE-L, experiencing slight declines. Word/phrase matching metrics similarly showed no substantial improvement.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Part 2: Robustness to Hard Positives/Negatives. ","element":"span"},{"text":"Here, participants watched videos from our dataset and read the ground-truth captions (GT), and were then asked to write a hard-positive (HP) and a hard-negative (HN) caption for each video. For the HP, they rephrased the caption using different wording and style while retaining all original information. For the HN, they altered critical details while retaining most wording and sentence structure. Each of the 10 participants processed 5 videos, yielding 50 samples.","element":"span"}],[{"text":"We computed the similarity between GT-HP and GT-HN pairs for each video using different methods. The scores were scaled to ","element":"span"},{"style":{"height":17.39},"width":126.52,"height":43.48,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/32394/images/5-11.png","element":"img","alt":" [0, 100]1","inline":true},{"text":"and the difference ","element":"span"},{"style":{"height":14.19},"width":67,"height":35.48,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/32394/images/5-12.png","element":"img","alt":" ∆PN","inline":true,"padRight":true},{"text":"between the GT-HP and GT-HN pairs was calculated, and reported in Table ","element":"span"},{"text":"2","element":"span"},{"text":". Ideally, ","element":"span"},{"style":{"height":14.19},"width":67,"height":35.48,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/32394/images/5-13.png","element":"img","alt":" ∆PN","inline":true,"padRight":true},{"text":"should be high, indicating that methods assign higher similarity to GT-HP pairs and lower similarity to GT-HN pairs. GPT-4 achieved the highest ","element":"span"},{"style":{"height":14},"width":67.52,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/32394/images/5-14.png","element":"img","alt":" ∆PN","inline":true,"padRight":true},{"text":"(37.6), followed by Llama3-70B (29.2). Most embedding-based models showed low ","element":"span"},{"style":{"height":14.19},"width":67,"height":35.48,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/32394/images/5-15.png","element":"img","alt":" ∆PN","inline":true},{"text":", indicating similar scores for GT-HP and GT-HN pairs, while word/phrase matching metrics performed worst, with negative ","element":"span"},{"style":{"height":14.19},"width":67,"height":35.48,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/32394/images/5-16.png","element":"img","alt":" ∆PN","inline":true,"padRight":true},{"text":"values suggesting higher scores for GT-HN pairs than GT-HP pairs.","element":"span"}],[{"style":{"width":"99%"},"width":1979,"height":542,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/32394/images/6-0.png","element":"img"}],[{"text":"Figure 5. ","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"Video-LLaVA-Seg Architecture. ","element":"figcaption","subtype":"caption"},{"text":"The vision backbone and projection MLP encode the input video frames into a set of features ","element":"figcaption","subtype":"caption"},{"style":{"height":13.81},"width":55.52,"height":34.52,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/32394/images/6-1.png","element":"img","alt":"Fsf","inline":true,"padRight":true},{"text":"which are concatenated with the text embeddings and input to the LLM. For ","element":"figcaption","subtype":"caption"},{"text":"Video Captioning","element":"figcaption","subtype":"caption"},{"text":", the output ","element":"figcaption","subtype":"caption"},{"style":{"height":11.58},"width":41.68,"height":28.96,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/32394/images/6-2.png","element":"img","alt":" Fo","inline":true,"padRight":true},{"text":"is decoded into text. For ","element":"figcaption","subtype":"caption"},{"text":"LG-VIS","element":"figcaption","subtype":"caption"},{"text":", the ","element":"figcaption","subtype":"caption"},{"style":{"height":11.6},"width":280,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/32394/images/6-3.png","element":"img","alt":" token in Fo","inline":true,"padRight":true},{"text":"is applied to the mask decoder along with multi-scale features from the segmentation backbone.","element":"figcaption","subtype":"caption"}],[{"style":{"fontWeight":"bold"},"text":"4.2. Selected Evaluation Measures","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Video Captioning. ","element":"span"},{"text":"Although GPT-4 achieved the highest ","element":"span"},{"style":{"height":11.6},"width":56,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/32394/images/6-4.png","element":"img","alt":"ravg","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":14.21},"width":67.48,"height":35.52,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/32394/images/6-5.png","element":"img","alt":" ∆PN","inline":true,"padRight":true},{"text":"in our study, it is proprietary and subject to updates, which complicates reproducibility. Moreover, its runtime can be slow due to token limits, and it incurs monetary costs which effectively makes the benchmark ‘pay-to-evaluate’. To support open-source, reproducible research, we accept a slight performance trade-off and select Llama3-70B as the evaluation measure for our video captioning task. Since ","element":"span"},{"style":{"height":12},"width":56.52,"height":30,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/32394/images/6-6.png","element":"img","alt":" ravg","inline":true,"padRight":true},{"text":"is noticeably higher than ","element":"span"},{"style":{"fontStyle":"italic"},"text":"r ","element":"span"},{"text":"for Llama3-70B (Table ","element":"span"},{"text":"2","element":"span"},{"text":"), we generate 4 reworded variants of each ground-truth caption, and compute the final accuracy by averaging over the prediction-GT scores for each video. This score lies in ","element":"span"},{"text":"[0","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"5] ","element":"span"},{"text":"and is abbreviated as CA (Caption Accuracy).","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Language-Guided Video Instance Segmentation (LGVIS). ","element":"span"},{"text":"Unlike text similarity, segmentation mask accuracy is a well-studied problem with several feasible metrics [","element":"span"},{"text":"59","element":"span"},{"text":", ","element":"span"},{"text":"70","element":"span"},{"text":", ","element":"span"},{"text":"119","element":"span"},{"text":"]. We select Track mean Average Precision (mAP) as our primary metric since it is widely used by video segmentation benchmarks [","element":"span"},{"text":"81","element":"span"},{"text":", ","element":"span"},{"text":"105","element":"span"},{"text":", ","element":"span"},{"text":"119","element":"span"},{"text":"] and accommodates multi-object prediction.","element":"span"}]]},{"heading":"5. Video-LLaVA-Seg Model","paragraphs":[[{"text":"We propose an effective baseline called Video-LLaVA-Seg that can tackle both Video Captioning and LG-VIS with a single, end-to-end trained model.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"5.1. Architecture","element":"span"}],[{"text":"Video-LLaVA-Seg extends the popular LLaVA [","element":"span"},{"text":"65","element":"span"},{"text":"–","element":"span"},{"text":"67","element":"span"},{"text":"] architecture with segmentation ability. ","element":"span"},{"text":"It is illustrated in Fig. ","element":"span"},{"text":"5 ","element":"span"},{"text":"and comprises 3 main parts: (1) a multi-modal LLM, (2), a vision backbone, and (3) a segmentation network.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Vision Features. ","element":"span"},{"text":"We input ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":"uniformly sampled frames from the video to the vision backbone, yielding a set of video features ","element":"span"},{"style":{"height":16},"width":331.52,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/32394/images/6-7.png","element":"img","alt":" Fv ∈ RT ×H×W ×C","inline":true},{"text":". Next, we adopt the workflow proposed by Xu ","element":"span"},{"style":{"fontStyle":"italic"},"text":"et al","element":"span"},{"text":". [","element":"span"},{"text":"114","element":"span"},{"text":"] and split ","element":"span"},{"style":{"height":13.39},"width":43,"height":33.48,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/32394/images/6-8.png","element":"img","alt":" Fv","inline":true,"padRight":true},{"text":"into two sets of features, namely ‘slow’ features ","element":"span"},{"style":{"height":17.33},"width":320.64,"height":43.32,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/32394/images/6-9.png","element":"img","alt":" Fsv ∈ RTs×H×W ×C","inline":true,"padRight":true},{"text":"and ‘fast features’ ","element":"span"},{"style":{"height":17.33},"width":347.56,"height":43.32,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/32394/images/6-10.png","element":"img","alt":" Ffv ∈ RT ×Hf ×Wf ×C","inline":true},{"text":". Fast features are ","element":"span"},{"text":"generated by aggressively downsampling the spatial dimensions in ","element":"span"},{"style":{"height":13.18},"width":44.84,"height":32.96,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/32394/images/6-11.png","element":"img","alt":" Fv","inline":true,"padRight":true},{"text":"using adaptive average pooling, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i.e","element":"span"},{"text":". ","element":"span"},{"style":{"height":15.58},"width":176.28,"height":38.96,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/32394/images/6-12.png","element":"img","alt":" Hf << H","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":15.81},"width":183.52,"height":39.52,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/32394/images/6-13.png","element":"img","alt":" Wf << W","inline":true},{"text":". Meanwhile, slow features are obtained by uniformly sampling ","element":"span"},{"style":{"height":13.41},"width":121,"height":33.52,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/32394/images/6-14.png","element":"img","alt":" Ts < T","inline":true,"padRight":true},{"text":"frame features from ","element":"span"},{"style":{"height":13.6},"width":43,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/32394/images/6-15.png","element":"img","alt":" Fv","inline":true},{"text":". These two sets of features are then flattened and concatenated to yield the set of ‘slow-fast’ video features ","element":"span"},{"style":{"height":18.21},"width":248.48,"height":45.52,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/32394/images/6-16.png","element":"img","alt":" Fsf ∈ RNv×C","inline":true,"padRight":true},{"text":"where ","element":"span"},{"style":{"height":16.8},"width":669,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/32394/images/6-17.png","element":"img","alt":" Nv = (T × Hf × Wf) + (Ts × H × W)","inline":true},{"text":". The idea is to retain finegrained information along both temporal and spatial dimensions: the slow features are temporally dense but spatially condensed, whereas the fast features are spatially dense but temporally sparse. This reduces the number of video tokens ","element":"span"},{"style":{"height":13.6},"width":46,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/32394/images/6-18.png","element":"img","alt":" Nv","inline":true,"padRight":true},{"text":"input to the multimodal LLM.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Multi-modal LLM. ","element":"span"},{"text":"The slow-fast video features are then applied to a projection MLP, concatenated with the text prompt embeddings, and then input to the LLM, which outputs a set of embeddings ","element":"span"},{"style":{"height":13.6},"width":42.48,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/32394/images/6-19.png","element":"img","alt":" Fo","inline":true},{"text":". For the Video Captioning task, ","element":"span"},{"style":{"height":13.6},"width":43,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/32394/images/6-20.png","element":"img","alt":"Fo","inline":true,"padRight":true},{"text":"is decoded into text. For LG-VIS, the model is trained to output a special ","element":"span"},{"text":" ","element":"span"},{"text":"token which is used by the segmentation network to regress the mask. Note that when the text prompt references multiple objects, the model generates a separate ","element":"span"},{"text":" ","element":"span"},{"text":"token for each target object.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Segmentation Network. ","element":"span"},{"text":"The video frames are encoded at high resolution by a segmentation backbone to yield multi-scale features which, together with ","element":"span"},{"text":" ","element":"span"},{"text":"tokens, are input to a mask decoder to obtain the final segmentation masks. The mask decoder architecture is borrowed from SAM2 [","element":"span"},{"text":"86","element":"span"},{"text":"] and consists of multiple layers of bi-directional cross-attention between the ","element":"span"},{"text":" ","element":"span"},{"text":"tokens and the segmentation backbone features. Finally, the dot-product between the two is computed to obtain the mask logits. We refer to Ravi ","element":"span"},{"style":{"fontStyle":"italic"},"text":"et al","element":"span"},{"text":". [","element":"span"},{"text":"86","element":"span"},{"text":"] for more details. Our approach for mask prediction using ","element":"span"},{"text":" ","element":"span"},{"text":"tokens generated by an MLLM is inspired from recent image segmentation [","element":"span"},{"text":"53","element":"span"},{"text":", ","element":"span"},{"text":"85","element":"span"},{"text":", ","element":"span"},{"text":"120","element":"span"},{"text":", ","element":"span"},{"text":"126","element":"span"},{"text":"] and video segmentation works [","element":"span"},{"text":"10","element":"span"},{"text":", ","element":"span"},{"text":"117","element":"span"},{"text":"].","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Comparison to Existing Methods. ","element":"span"},{"text":"Video-LLaVA-Seg is a compact baseline model that is easy to setup and extend, unlike existing LLM-based video segmentation approaches [","element":"span"},{"text":"10","element":"span"},{"text":", ","element":"span"},{"text":"117","element":"span"},{"text":"]: VISA [","element":"span"},{"text":"117","element":"span"},{"text":"] involves inference with a pretrained Llama-Vid [","element":"span"},{"text":"62","element":"span"},{"text":"] to select key frames for their own model, and both VISA and VideoLISA [","element":"span"},{"text":"10","element":"span"},{"text":"] rely on pretrained models [","element":"span"},{"text":"13","element":"span"},{"text":", ","element":"span"},{"text":"21","element":"span"},{"text":"] to perform/improve temporal mask propagation. Moreover, these approaches tackle only segmentation tasks whereas our model is trained to tackle both captioning and segmentation.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"5.2. Implementation Details","element":"span"}],[{"text":"We employ an AM-RADIO [","element":"span"},{"text":"84","element":"span"},{"text":"] ViT-H [","element":"span"},{"text":"50","element":"span"},{"text":"] model as the vision backbone and Llama3-8B [","element":"span"},{"text":"32","element":"span"},{"text":"] as the LLM. The segmentation backbone is a Hiera-Small [","element":"span"},{"text":"88","element":"span"},{"text":"] network which, together with the mask decoder, is initialized with weights from SAM2 [","element":"span"},{"text":"86","element":"span"},{"text":"]. ","element":"span"},{"text":"We sample ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":"= 32 ","element":"span"},{"text":"frames and use ","element":"span"},{"style":{"height":13.39},"width":124,"height":33.48,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/32394/images/7-0.png","element":"img","alt":"Ts = 8","inline":true,"padRight":true},{"text":"slow frames. We train our model in three stages: (1) a vision-language alignment stage to optimize the projection MLP, (2) a finetuning step to optimize the MLLM, projection MLP and vision backbone for video captioning, and finally (3) the entire model with the segmentation network is fintuned for both tasks. For the first two stages, we utilize a total of 3.5M video caption samples from We-bVid10M [","element":"span"},{"text":"11","element":"span"},{"text":"] and Panda70M [","element":"span"},{"text":"20","element":"span"},{"text":"]. In stage 3, we finetune the model on the ViCaS training set for both tasks. Training takes a total of ","element":"span"},{"style":{"height":10.8},"width":68.68,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/32394/images/7-1.png","element":"img","alt":" ∼ 4","inline":true,"padRight":true},{"text":"days on 32 A100 GPUs. Further implementation details can be found in supplementary.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"5.3. Ablation Studies","element":"span"}],[{"text":"We report ablations in Table ","element":"span"},{"text":"3 ","element":"span"},{"text":"and discuss them below.","element":"span"}],[{"style":{"width":"100%"},"width":952,"height":422,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/32394/images/7-2.png","element":"img"}],[{"text":"Table 3. Ablation experiments on the ViCaS validation set. CA: Caption Accuracy. SF-Pool: Slow-Fast Pooling.","element":"figcaption","subtype":"caption"}],[{"style":{"fontWeight":"bold"},"text":"Input Frames. ","element":"span"},{"text":"In rows 1-3, we train our model using ","element":"span"},{"style":{"height":16},"width":256.52,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/32394/images/7-3.png","element":"img","alt":"Ts = {1, 2, 4}","inline":true,"padRight":true},{"text":"frames instead of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Ts ","element":"span"},{"text":"= 8 ","element":"span"},{"text":"in the final setting. We see that using fewer frames strongly reduces performance, especially for captioning, showing that our dataset/benchmark is video-centric and requires temporal context to effectively tackle.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Task Synergy. ","element":"span"},{"text":"For rows 4 and 5, we train the model for either one of our benchmark tasks. ","element":"span"},{"text":"We see that the model trained only for LG-VIS achieves 18.5 mAP which is much worse than the 20.5 achieved by the multi-task model. Meanwhile, the Caption Accuracy remains unchanged at 3.0. This indicates that pixel-level segmentation benefits greatly from holistic video understanding.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"LLaVA Baseline. ","element":"span"},{"text":"Row 6 shows results for a LLaVA-NeXT baseline trained with our recipe which uses a CLIP vision encoder and no Slow-Fast Pooling. This achieves 19.9 mAP and 2.9 CA, both of which are worse than our final setting.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"5.4. Benchmark Results","element":"span"}],[{"style":{"width":"100%"},"width":952,"height":336,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/32394/images/7-4.png","element":"img"}],[{"text":"Table 4. Benchmark Results on our validation set. Refer to supplementary for test set results. CA: Caption Accuracy","element":"figcaption","subtype":"caption"}],[{"text":"Table ","element":"span"},{"text":"4 ","element":"span"},{"text":"compares Video-LLaVA-Seg to other methods on the validation set. For video captioning, we evaluate off-the-shelf LLaVA-OneVision [","element":"span"},{"text":"54","element":"span"},{"text":"] and MiniCPM-o[","element":"span"},{"text":"121","element":"span"},{"text":"] models which achieve 2.9 and 3.0 CA scores, respectively. This is the same as Video-LLaVA-Seg, but these models are trained on significantly more data and cannot tackle LGVIS. For LG-VIS, we evaluate multiple existing ReferralVOS approaches. LMPM is an earlier transformer-based approach which only achieves 8.4 mAP. DsHmp [","element":"span"},{"text":"41","element":"span"},{"text":"] performs better with 13.3 mAP since the architecture decouples static and motion cues to improve video segmentation performance. Finally, VideoLISA [","element":"span"},{"text":"10","element":"span"},{"text":"] is a recent LLMbased approach which only achieves 10.7 mAP when finetuned on ViCaS. By contrast, Video-LLaVA-Seg achieves 20.5 mAP while tackling both captioning and LG-VIS. Furthermore, unlike existing Referral-VOS approaches, it can predict multiple segmentation masks for a single prompt.","element":"span"}]]},{"heading":"6. Conclusion","paragraphs":[[{"text":"We introduce ViCaS, a first-of-its-kind, human-annotated dataset that provides detailed captions for videos along with phrase-grounded segmentation masks for salient objects. Our associated benchmark comprises two tasks: (1) Video Captioning, which evaluates high-level, holistic understanding, and (2) Language-Guided Video Instance Segmentation (LG-VIS), which evaluates pixel-level understanding. We propose evaluation measures for open-ended caption accuracy which are based on open-source models, and are experimentally verified through a comprehensive user study. Furthermore, we propose Video-LLaVA-Seg, a compact baseline which effectively tackles both tasks.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Acknowledgements. ","element":"span"},{"text":"We thank the participants of the user study and all the annotation personnel for their effort.","element":"span"}]]},{"heading":"References","paragraphs":[[{"text":"[1] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Arxiv","element":"span"},{"text":", 2023. ","element":"span"},{"text":"5","element":"span"},{"text":", ","element":"span"},{"text":"6","element":"span"}],[{"text":"[2] Mykhaylo Andriluka, Stefan Roth, and Bernt Schiele. People-tracking-by-detection ","element":"span"},{"text":"and ","element":"span"},{"text":"people-detection-by-tracking. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"CVPR","element":"span"},{"text":", 2008. ","element":"span"},{"text":"2","element":"span"}],[{"text":"[3] Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan Russell. Localizing moments in video with natural language. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"ICCV","element":"span"},{"text":", 2017. ","element":"span"},{"text":"5","element":"span"}],[{"text":"[4] Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Luˇci´c, and Cordelia Schmid. Vivit: A video vision transformer. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"ICCV","element":"span"},{"text":", 2021. ","element":"span"},{"text":"2","element":"span"}],[{"text":"[5] Ali Athar, Jonathon Luiten, Alexander Hermans, Deva Ramanan, and Bastian Leibe. Hodor: High-level object descriptors for object re-segmentation in video learned from static images. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"CVPR","element":"span"},{"text":", 2022. ","element":"span"},{"text":"2","element":"span"}],[{"text":"[6] Ali Athar, Alexander Hermans, Jonathon Luiten, Deva Ramanan, and Bastian Leibe. Tarvis: A unified architecture for target-based video segmentation. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"CVPR","element":"span"},{"text":", 2023. ","element":"span"},{"text":"2","element":"span"}],[{"text":"[7] Ali Athar, Jonathon Luiten, Paul Voigtlaender, Tarasha Khurana, Achal Dave, Bastian Leibe, and Deva Ramanan. Burst: A benchmark for unifying object recognition, segmentation and tracking in video. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"WACV","element":"span"},{"text":", 2023. ","element":"span"},{"text":"2","element":"span"},{"text":", ","element":"span"},{"text":"4","element":"span"},{"text":", ","element":"span"},{"text":"5","element":"span"}],[{"text":"[8] Anas Awadalla, Irena Gao, Josh Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Shiori Sagawa, et al. Openflamingo: An open-source framework for training large autoregressive vision-language models. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Arxiv","element":"span"},{"text":", 2023. ","element":"span"},{"text":"1","element":"span"}],[{"text":"[9] Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Arxiv","element":"span"},{"text":", 2023. ","element":"span"},{"text":"1","element":"span"},{"text":", ","element":"span"},{"text":"2","element":"span"}],[{"text":"[10] Zechen Bai, Tong He, Haiyang Mei, Pichao Wang, Ziteng Gao, Joya Chen, Lei Liu, Zheng Zhang, and Mike Zheng Shou. One token to seg them all: Language instructed reasoning segmentation in videos. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"NeurIPS","element":"span"},{"text":", 2024. ","element":"span"},{"text":"3","element":"span"},{"text":", ","element":"span"},{"text":"7","element":"span"},{"text":", ","element":"span"},{"text":"8","element":"span"}],[{"text":"[11] Max Bain, Arsha Nagrani, G¨ul Varol, and Andrew Zisserman. Frozen in time: A joint video and image encoder for end-to-end retrieval. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"ICCV","element":"span"},{"text":", 2021. ","element":"span"},{"text":"2","element":"span"},{"text":", ","element":"span"},{"text":"5","element":"span"},{"text":", ","element":"span"},{"text":"8","element":"span"}],[{"text":"[12] Satanjeev Banerjee and Alon Lavie. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"ACL Workshop","element":"span"},{"text":", 2005. ","element":"span"},{"text":"6","element":"span"}],[{"text":"[13] Maksym Bekuzarov, Ariana Bermudez, Joon-Young Lee, and Hao Li. Xmem++: Production-level video segmentation from few annotated frames. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Arxiv","element":"span"},{"text":", 2023. ","element":"span"},{"text":"3","element":"span"},{"text":", ","element":"span"},{"text":"8","element":"span"}],[{"text":"[14] Lucas Beyer, ","element":"span"},{"text":"Andreas Steiner, ","element":"span"},{"text":"Andr´e Susano Pinto, Alexander Kolesnikov, Xiao Wang, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin, Michael Tschannen, Emanuele Bugliarello, et al. Paligemma: A versatile 3b vlm for transfer. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Arxiv","element":"span"},{"text":", 2024. ","element":"span"},{"text":"2","element":"span"}],[{"text":"[15] Steven Bird, Ewan Klein, and Edward Loper. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Natural language processing with Python: analyzing text with the natural language toolkit","element":"span"},{"text":". ”O’Reilly Media, Inc.”, 2009. ","element":"span"},{"text":"5","element":"span"}],[{"text":"[16] Guillem Bras´o and Laura Leal-Taix´e. ","element":"span"},{"text":"Learning a neural solver for multiple object tracking. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"CVPR","element":"span"},{"text":", 2020. ","element":"span"},{"text":"2","element":"span"}],[{"text":"[17] David Chen and William B Dolan. Collecting highly parallel data for paraphrase evaluation. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"ACL","element":"span"},{"text":", 2011. ","element":"span"},{"text":"2","element":"span"},{"text":", ","element":"span"},{"text":"5","element":"span"}],[{"text":"[18] Joya Chen, Zhaoyang Lv, Shiwei Wu, Kevin Qinghong Lin, Chenan Song, Difei Gao, Jia-Wei Liu, Ziteng Gao, Dongxing Mao, and Mike Zheng Shou. Videollm-online: Online video large language model for streaming video. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"CVPR","element":"span"},{"text":", 2024. ","element":"span"},{"text":"1","element":"span"},{"text":", ","element":"span"},{"text":"2","element":"span"}],[{"text":"[19] Jieneng Chen, Qihang Yu, Xiaohui Shen, Alan Yuille, and Liang-Chieh Chen. ","element":"span"},{"text":"Vitamin: Designing scalable vision models in the vision-language era. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"CVPR","element":"span"},{"text":", 2024. ","element":"span"},{"text":"2","element":"span"}],[{"text":"[20] Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Ekaterina Deyneka, Hsiang-wei Chao, Byung Eun Jeon, Yuwei Fang, Hsin-Ying Lee, Jian Ren, Ming-Hsuan Yang, et al. Panda-70m: Captioning 70m videos with multiple cross-modality teachers. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"CVPR","element":"span"},{"text":", 2024. ","element":"span"},{"text":"2","element":"span"},{"text":", ","element":"span"},{"text":"5","element":"span"},{"text":", ","element":"span"},{"text":"8","element":"span"}],[{"text":"[21] Ho Kei Cheng and Alexander G Schwing. Xmem: Longterm video object segmentation with an atkinson-shiffrin memory model. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"ECCV","element":"span"},{"text":", 2022. ","element":"span"},{"text":"2","element":"span"},{"text":", ","element":"span"},{"text":"8","element":"span"}],[{"text":"[22] Ho Kei Cheng, Yu-Wing Tai, and Chi-Keung Tang. Rethinking space-time networks with improved memory coverage for efficient video object segmentation. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"NeurIPS","element":"span"},{"text":", 2021.","element":"span"}],[{"text":"[23] Ho Kei Cheng, Seoung Wug Oh, Brian Price, Joon-Young Lee, and Alexander Schwing. Putting the object back into video object segmentation. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"CVPR","element":"span"},{"text":", 2024. ","element":"span"},{"text":"2","element":"span"}],[{"text":"[24] Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, 2023. ","element":"span"},{"text":"2","element":"span"}],[{"text":"[25] Daniel Cores, Michael Dorkenwald, Manuel Mucientes, Cees GM Snoek, and Yuki M Asano. Tvbench: Redesigning video-language evaluation. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Arxiv","element":"span"},{"text":", 2024. ","element":"span"},{"text":"2","element":"span"}],[{"text":"[26] Achal Dave, Tarasha Khurana, Pavel Tokmakov, Cordelia Schmid, and Deva Ramanan. Tao: A large-scale benchmark for tracking any object. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"ECCV","element":"span"},{"text":", 2020. ","element":"span"},{"text":"2","element":"span"}],[{"text":"[27] Patrick Dendorfer, Aljosa Osep, Anton Milan, Konrad Schindler, Daniel Cremers, Ian Reid, Stefan Roth, and Laura Leal-Taix´e. Motchallenge: A benchmark for singlecamera multiple target tracking. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"IJCV","element":"span"},{"text":", 2021. ","element":"span"},{"text":"2","element":"span"}],[{"text":"[28] Xueqing Deng, Qihang Yu, Peng Wang, Xiaohui Shen, and Liang-Chieh Chen. Coconut: Modernizing coco segmentation. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"CVPR","element":"span"},{"text":", 2024. ","element":"span"},{"text":"2","element":"span"}],[{"text":"[29] Henghui Ding, Chang Liu, Shuting He, Xudong Jiang, and Chen Change Loy. Mevis: A large-scale benchmark for video segmentation with motion expressions. ","element":"span"},{"text":"In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"ICCV","element":"span"},{"text":", 2023. ","element":"span"},{"text":"1","element":"span"},{"text":", ","element":"span"},{"text":"2","element":"span"},{"text":", ","element":"span"},{"text":"4","element":"span"},{"text":", ","element":"span"},{"text":"5","element":"span"},{"text":", ","element":"span"},{"text":"8","element":"span"}],[{"text":"[30] Henghui Ding, Chang Liu, Shuting He, Xudong Jiang, Philip HS Torr, and Song Bai. Mose: A new dataset for video object segmentation in complex scenes. ","element":"span"},{"text":"In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"ICCV","element":"span"},{"text":", 2023. ","element":"span"},{"text":"2","element":"span"}],[{"text":"[31] Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. Long-term recurrent convolutional networks for visual recognition and description. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"CVPR","element":"span"},{"text":", 2015. ","element":"span"},{"text":"2","element":"span"}],[{"text":"[32] Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Arxiv","element":"span"},{"text":", 2024. ","element":"span"},{"text":"1","element":"span"},{"text":", ","element":"span"},{"text":"2","element":"span"},{"text":", ","element":"span"},{"text":"6","element":"span"},{"text":", ","element":"span"},{"text":"8","element":"span"}],[{"text":"[33] Dave Epstein, Boyuan Chen, and Carl Vondrick. Oops! predicting unintentional action in video. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"CVPR","element":"span"},{"text":", 2020. ","element":"span"},{"text":"3","element":"span"}],[{"text":"[34] Heng Fan, Liting Lin, Fan Yang, Peng Chu, Ge Deng, Sijia Yu, Hexin Bai, Yong Xu, Chunyuan Liao, and Haibin Ling. Lasot: A high-quality benchmark for large-scale single object tracking. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"CVPR","element":"span"},{"text":", 2019. ","element":"span"},{"text":"2","element":"span"}],[{"text":"[35] Kirill Gavrilyuk, ","element":"span"},{"text":"Amir Ghodrati, ","element":"span"},{"text":"Zhenyang Li, ","element":"span"},{"text":"and Cees GM Snoek. Actor and action video segmentation from a sentence. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"CVPR","element":"span"},{"text":", 2018. ","element":"span"},{"text":"5","element":"span"}],[{"text":"[36] Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"CVPR","element":"span"},{"text":", 2012. ","element":"span"},{"text":"2","element":"span"}],[{"text":"[37] Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, et al. The” something something” video database for learning and evaluating visual common sense. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"ICCV","element":"span"},{"text":", 2017. ","element":"span"},{"text":"2","element":"span"}],[{"text":"[38] Chunhui Gu, Chen Sun, David A Ross, Carl Vondrick, Caroline Pantofaru, Yeqing Li, Sudheendra Vijayanarasimhan, George Toderici, Susanna Ricco, Rahul Sukthankar, et al. Ava: A video dataset of spatio-temporally localized atomic visual actions. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"CVPR","element":"span"},{"text":", 2018. ","element":"span"},{"text":"2","element":"span"}],[{"text":"[39] Kensho Hara, Hirokatsu Kataoka, and Yutaka Satoh. Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet? In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"CVPR","element":"span"},{"text":", 2018. ","element":"span"},{"text":"2","element":"span"}],[{"text":"[40] Ju He, Qihang Yu, Inkyu Shin, Xueqing Deng, Alan Yuille, Xiaohui Shen, and Liang-Chieh Chen. ","element":"span"},{"text":"Maxtron: Mask transformer with trajectory attention for video panoptic segmentation. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Arxiv","element":"span"},{"text":", 2023. ","element":"span"},{"text":"2","element":"span"}],[{"text":"[41] Shuting He and Henghui Ding. Decoupling static and hierarchical motion perception for referring video segmentation. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"CVPR","element":"span"},{"text":", 2024. ","element":"span"},{"text":"2","element":"span"},{"text":", ","element":"span"},{"text":"8","element":"span"}],[{"text":"[42] Miran Heo, Sukjun Hwang, Seoung Wug Oh, Joon-Young Lee, and Seon Joo Kim. Vita: Video instance segmentation via object token association. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"NeurIPS","element":"span"},{"text":", 2022. ","element":"span"},{"text":"2","element":"span"}],[{"text":"[43] Namdar Homayounfar, Justin Liang, Wei-Chiu Ma, and Raquel Urtasun. ","element":"span"},{"text":"Videoclick: Video object segmentation with a single click. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Arxiv","element":"span"},{"text":", 2021. ","element":"span"},{"text":"2","element":"span"}],[{"text":"[44] Lianghua Huang, Xin Zhao, and Kaiqi Huang. Got-10k: A large high-diversity benchmark for generic object tracking in the wild. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"IEEE TPAMI","element":"span"},{"text":", 2021. ","element":"span"},{"text":"2","element":"span"}],[{"text":"[45] Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. ","element":"span"},{"text":"Large-scale video classification with convolutional neural networks. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"CVPR","element":"span"},{"text":", 2014. ","element":"span"},{"text":"2","element":"span"}],[{"text":"[46] Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. The kinetics human action video dataset. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Arxiv","element":"span"},{"text":", 2017. ","element":"span"},{"text":"2","element":"span"},{"text":", ","element":"span"},{"text":"3","element":"span"}],[{"text":"[47] Anna Khoreva, Anna Rohrbach, and Bernt Schiele. Video object segmentation with language referring expressions. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"ECCV","element":"span"},{"text":", 2019. ","element":"span"},{"text":"1","element":"span"},{"text":", ","element":"span"},{"text":"2","element":"span"},{"text":", ","element":"span"},{"text":"4","element":"span"},{"text":", ","element":"span"},{"text":"5","element":"span"}],[{"text":"[48] Dahun Kim, Sanghyun Woo, Joon-Young Lee, and In So Kweon. Video panoptic segmentation. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"CVPR","element":"span"},{"text":", 2020. ","element":"span"},{"text":"2","element":"span"}],[{"text":"[49] Dahun Kim, Jun Xie, Huiyu Wang, Siyuan Qiao, Qihang Yu, Hong-Seok Kim, Hartwig Adam, In So Kweon, and Liang-Chieh Chen. Tubeformer-deeplab: Video mask transformer. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"CVPR","element":"span"},{"text":", 2022. ","element":"span"},{"text":"2","element":"span"}],[{"text":"[50] Alexander Kolesnikov, Alexey Dosovitskiy, Dirk Weissenborn, Georg Heigold, Jakob Uszkoreit, Lucas Beyer, Matthias Minderer, Mostafa Dehghani, Neil Houlsby, Sylvain Gelly, Thomas Unterthiner, and Xiaohua Zhai. An image is worth 16x16 words: Transformers for image recognition at scale. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"ICLR","element":"span"},{"text":", 2021. ","element":"span"},{"text":"8","element":"span"}],[{"text":"[51] Matej Kristan, Jiri Matas, Aleˇs Leonardis, Tomas Vojir, Roman Pflugfelder, Gustavo Fernandez, Georg Nebehay, Fatih Porikli, and Luka ","element":"span"},{"text":"ˇ","element":"span"},{"text":"Cehovin. A novel performance evaluation methodology for single-target trackers. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"IEEE TPAMI","element":"span"},{"text":", 2016. ","element":"span"},{"text":"2","element":"span"}],[{"text":"[52] Hildegard Kuehne, Hueihan Jhuang, Est´ıbaliz Garrote, Tomaso Poggio, and Thomas Serre. Hmdb: a large video database for human motion recognition. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"ICCV","element":"span"},{"text":", 2011. ","element":"span"},{"text":"2","element":"span"}],[{"text":"[53] Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, and Jiaya Jia. Lisa: Reasoning segmentation via large language model. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"CVPR","element":"span"},{"text":", 2024. ","element":"span"},{"text":"2","element":"span"},{"text":", ","element":"span"},{"text":"7","element":"span"}],[{"text":"[54] Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Arxiv","element":"span"},{"text":", 2024. ","element":"span"},{"text":"8","element":"span"}],[{"text":"[55] Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. ","element":"span"},{"text":"In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"ICML","element":"span"},{"text":", 2022. ","element":"span"},{"text":"1","element":"span"}],[{"text":"[56] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. ","element":"span"},{"text":"In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"ICML","element":"span"},{"text":", 2023. ","element":"span"},{"text":"1","element":"span"}],[{"text":"[57] Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: ","element":"span"},{"text":"A comprehensive multi-modal video understanding benchmark. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"CVPR","element":"span"},{"text":", 2024. ","element":"span"},{"text":"2","element":"span"}],[{"text":"[58] Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei Chang. Visualbert: A simple and performant baseline for vision and language. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Arxiv","element":"span"},{"text":", 2019. ","element":"span"},{"text":"1","element":"span"}],[{"text":"[59] Siyuan Li, Martin Danelljan, Henghui Ding, Thomas E. Huang, and Fisher Yu. Tracking every thing in the wild. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"ECCV","element":"span"},{"text":", 2022. ","element":"span"},{"text":"7","element":"span"}],[{"text":"[60] Xianming Li and Jing Li. ","element":"span"},{"text":"Angle-optimized text embeddings. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Arxiv","element":"span"},{"text":", 2023. ","element":"span"},{"text":"6","element":"span"}],[{"text":"[61] Yuncheng Li, Yale Song, Liangliang Cao, Joel Tetreault, Larry Goldberg, Alejandro Jaimes, and Jiebo Luo. Tgif: A new dataset and benchmark on animated gif description. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"CVPR","element":"span"},{"text":", 2016. ","element":"span"},{"text":"2","element":"span"}],[{"text":"[62] Yanwei Li, Chengyao Wang, and Jiaya Jia. Llama-vid: An image is worth 2 tokens in large language models. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"ECCV","element":"span"},{"text":", 2024. ","element":"span"},{"text":"1","element":"span"},{"text":", ","element":"span"},{"text":"2","element":"span"},{"text":", ","element":"span"},{"text":"3","element":"span"},{"text":", ","element":"span"},{"text":"8","element":"span"}],[{"text":"[63] Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual representation by alignment before projection. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"EMNLP","element":"span"},{"text":", 2024. ","element":"span"},{"text":"1","element":"span"},{"text":", ","element":"span"},{"text":"2","element":"span"}],[{"text":"[64] Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"ACL Workshop","element":"span"},{"text":", 2004. ","element":"span"},{"text":"6","element":"span"}],[{"text":"[65] Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Arxiv","element":"span"},{"text":", 2023. ","element":"span"},{"text":"7","element":"span"}],[{"text":"[66] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"NeurIPS","element":"span"},{"text":", 2023. ","element":"span"},{"text":"1","element":"span"},{"text":", ","element":"span"},{"text":"2","element":"span"}],[{"text":"[67] Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved reasoning, ocr, and world knowledge, 2024. ","element":"span"},{"text":"1","element":"span"},{"text":", ","element":"span"},{"text":"2","element":"span"},{"text":", ","element":"span"},{"text":"7","element":"span"}],[{"text":"[68] Ze Liu, Jia Ning, Yue Cao, Yixuan Wei, Zheng Zhang, Stephen Lin, and Han Hu. ","element":"span"},{"text":"Video swin transformer. ","element":"span"},{"text":"In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"CVPR","element":"span"},{"text":", 2022. ","element":"span"},{"text":"2","element":"span"}],[{"text":"[69] Jonathon Luiten, Paul Voigtlaender, and Bastian Leibe. Premvos: Proposal-generation, refinement and merging for video object segmentation. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"ACCV","element":"span"},{"text":", 2018. ","element":"span"},{"text":"2","element":"span"}],[{"text":"[70] Jonathon Luiten, Aljosa Osep, Patrick Dendorfer, Philip Torr, Andreas Geiger, Laura Leal-Taix´e, and Bastian Leibe. Hota: A higher order metric for evaluating multi-object tracking. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"IJCV","element":"span"},{"text":", 2020. ","element":"span"},{"text":"7","element":"span"}],[{"text":"[71] Huaishao Luo, Lei Ji, Botian Shi, Haoyang Huang, Nan Duan, Tianrui Li, Jason Li, Taroon Bharti, and Ming Zhou. Univl: A unified video and language pre-training model for multimodal understanding and generation. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Arxiv","element":"span"},{"text":", 2020. ","element":"span"},{"text":"2","element":"span"}],[{"text":"[72] Ruipu Luo, Ziwang Zhao, Min Yang, Junwei Dong, Da Li, Pengcheng Lu, Tao Wang, Linmei Hu, Minghui Qiu, and Zhongyu Wei. Valley: Video assistant with large language model enhanced ability. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Arxiv","element":"span"},{"text":", 2023. ","element":"span"},{"text":"1","element":"span"},{"text":", ","element":"span"},{"text":"2","element":"span"}],[{"text":"[73] Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Shahbaz Khan. ","element":"span"},{"text":"Video-chatgpt: Towards detailed video understanding via large vision and language models. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"ACL","element":"span"},{"text":", 2024. ","element":"span"},{"text":"1","element":"span"},{"text":", ","element":"span"},{"text":"2","element":"span"},{"text":", ","element":"span"},{"text":"6","element":"span"}],[{"text":"[74] Jieru Mei, Alex Zihao Zhu, Xinchen Yan, Hang Yan, Siyuan Qiao, Liang-Chieh Chen, and Henrik Kretzschmar. Waymo open dataset: Panoramic video panoptic segmentation. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"ECCV","element":"span"},{"text":", 2022. ","element":"span"},{"text":"2","element":"span"}],[{"text":"[75] Jiaxu Miao, Xiaohan Wang, Yu Wu, Wei Li, Xu Zhang, Yunchao Wei, and Yi Yang. ","element":"span"},{"text":"Large-scale video panoptic segmentation in the wild: A benchmark. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"CVPR","element":"span"},{"text":", 2022. ","element":"span"},{"text":"2","element":"span"}],[{"text":"[76] Seoung Wug Oh, Joon-Young Lee, Ning Xu, and Seon Joo Kim. Video object segmentation using space-time memory networks. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"ICCV","element":"span"},{"text":", 2019. ","element":"span"},{"text":"2","element":"span"}],[{"text":"[77] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"ACL","element":"span"},{"text":", 2002. ","element":"span"},{"text":"6","element":"span"}],[{"text":"[78] Nikos Paragios and Rachid Deriche. Geodesic active contours and level sets for the detection and tracking of moving objects. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"IEEE TPAMI","element":"span"},{"text":", 2000. ","element":"span"},{"text":"2","element":"span"}],[{"text":"[79] AJ Piergiovanni, Weicheng Kuo, and Anelia Angelova. Rethinking video vits: Sparse video tubes for joint image and video learning. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"CVPR","element":"span"},{"text":", 2023. ","element":"span"},{"text":"2","element":"span"}],[{"text":"[80] Jordi Pont-Tuset, Federico Perazzi, Sergi Caelles, Pablo Arbel´aez, Alex Sorkine-Hornung, and Luc Van Gool. ","element":"span"},{"text":"The 2017 davis challenge on video object segmentation. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Arxiv","element":"span"},{"text":", 2017. ","element":"span"},{"text":"2","element":"span"}],[{"text":"[81] Jiyang Qi, Yan Gao, Yao Hu, Xinggang Wang, Xiaoyu Liu, Xiang Bai, Serge Belongie, Alan Yuille, Philip HS Torr, and Song Bai. Occluded video instance segmentation: A benchmark. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"IJCV","element":"span"},{"text":", 2022. ","element":"span"},{"text":"2","element":"span"},{"text":", ","element":"span"},{"text":"5","element":"span"},{"text":", ","element":"span"},{"text":"7","element":"span"}],[{"text":"[82] Siyuan Qiao, Yukun Zhu, Hartwig Adam, Alan Yuille, and Liang-Chieh Chen. Vip-deeplab: Learning visual perception with depth-aware video panoptic segmentation. ","element":"span"},{"text":"In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"CVPR","element":"span"},{"text":", 2021. ","element":"span"},{"text":"2","element":"span"}],[{"text":"[83] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"ICML","element":"span"},{"text":", 2021. ","element":"span"},{"text":"1","element":"span"}],[{"text":"[84] Mike Ranzinger, Greg Heinrich, Jan Kautz, and Pavlo Molchanov. Am-radio: Agglomerative vision foundation model reduce all domains into one. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"CVPR","element":"span"},{"text":", 2024. ","element":"span"},{"text":"8","element":"span"}],[{"text":"[85] Hanoona Rasheed, Muhammad Maaz, Sahal Shaji, Abdelrahman Shaker, Salman Khan, Hisham Cholakkal, Rao M Anwer, Eric Xing, Ming-Hsuan Yang, and Fahad S Khan. Glamm: ","element":"span"},{"text":"Pixel grounding large multimodal model. ","element":"span"},{"text":"In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"CVPR","element":"span"},{"text":", 2024. ","element":"span"},{"text":"7","element":"span"}],[{"text":"[86] Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Arxiv","element":"span"},{"text":", 2024. ","element":"span"},{"text":"2","element":"span"},{"text":", ","element":"span"},{"text":"4","element":"span"},{"text":", ","element":"span"},{"text":"7","element":"span"},{"text":", ","element":"span"},{"text":"8","element":"span"}],[{"text":"[87] Nils Reimers and Iryna Gurevych. ","element":"span"},{"text":"Sentence-bert: Sentence embeddings using siamese bert-networks. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"EMNLP","element":"span"},{"text":", 2019. ","element":"span"},{"text":"6","element":"span"}],[{"text":"[88] Chaitanya Ryali, Yuan-Ting Hu, Daniel Bolya, Chen Wei, Haoqi Fan, Po-Yao Huang, Vaibhav Aggarwal, Arkabandhu Chowdhury, Omid Poursaeed, Judy Hoffman, et al. ","element":"span"},{"text":"Hiera: A hierarchical vision transformer without the bells-and-whistles. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"ICML","element":"span"},{"text":", 2023. ","element":"span"},{"text":"8","element":"span"}],[{"text":"[89] Seonguk Seo, Joon-Young Lee, and Bohyung Han. Urvos: Unified referring video object segmentation network with a large-scale benchmark. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"ECCV","element":"span"},{"text":", 2020. ","element":"span"},{"text":"1","element":"span"},{"text":", ","element":"span"},{"text":"2","element":"span"},{"text":", ","element":"span"},{"text":"4","element":"span"},{"text":", ","element":"span"},{"text":"5","element":"span"}],[{"text":"[90] Inkyu Shin, Dahun Kim, Qihang Yu, Jun Xie, Hong-Seok Kim, Bradley Green, In So Kweon, Kuk-Jin Yoon, and Liang-Chieh Chen. ","element":"span"},{"text":"Video-kmax: A simple unified approach for online and near-online video panoptic segmentation. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"WACV","element":"span"},{"text":", 2024. ","element":"span"},{"text":"2","element":"span"}],[{"text":"[91] Gunnar A Sigurdsson, G¨ul Varol, Xiaolong Wang, Ali Farhadi, Ivan Laptev, and Abhinav Gupta. Hollywood in homes: Crowdsourcing data collection for activity understanding. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"ECCV","element":"span"},{"text":", 2016. ","element":"span"},{"text":"2","element":"span"}],[{"text":"[92] Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Haozhe Chi, Xun Guo, Tian Ye, Yanting Zhang, et al. Moviechat: From dense token to sparse memory for long video understanding. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"CVPR","element":"span"},{"text":", 2024. ","element":"span"},{"text":"1","element":"span"},{"text":", ","element":"span"},{"text":"2","element":"span"}],[{"text":"[93] K Soomro. Ucf101: A dataset of 101 human actions classes from videos in the wild. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Arxiv","element":"span"},{"text":", 2012. ","element":"span"},{"text":"2","element":"span"}],[{"text":"[94] Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, and Cordelia Schmid. Videobert: A joint model for video and language representation learning. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"ICCV","element":"span"},{"text":", 2019. ","element":"span"},{"text":"2","element":"span"}],[{"text":"[95] Siyu Tang, Mykhaylo Andriluka, Bjoern Andres, and Bernt Schiele. ","element":"span"},{"text":"Multiple people tracking by lifted multicut and person re-identification. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"CVPR","element":"span"},{"text":", 2017. ","element":"span"},{"text":"2","element":"span"}],[{"text":"[96] Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivi`ere, Mihir Sanjay Kale, Juliette Love, et al. Gemma: Open models based on gemini research and technology. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Arxiv","element":"span"},{"text":", 2024. ","element":"span"},{"text":"1","element":"span"},{"text":", ","element":"span"},{"text":"2","element":"span"}],[{"text":"[97] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. Learning spatiotemporal features with 3d convolutional networks. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"ICCV","element":"span"},{"text":", 2015. ","element":"span"},{"text":"2","element":"span"}],[{"text":"[98] Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. A closer look at spatiotemporal convolutions for action recognition. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"CVPR","element":"span"},{"text":", 2018.","element":"span"}],[{"text":"[99] Du Tran, Heng Wang, Lorenzo Torresani, and Matt Feiszli. Video classification with channel-separated convolutional networks. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"ICCV","element":"span"},{"text":", 2019. ","element":"span"},{"text":"2","element":"span"}],[{"text":"[100] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"NeurIPS","element":"span"},{"text":", 2017. ","element":"span"},{"text":"2","element":"span"}],[{"text":"[101] Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. Cider: Consensus-based image description evaluation. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"CVPR","element":"span"},{"text":", 2015. ","element":"span"},{"text":"6","element":"span"}],[{"text":"[102] Paul Voigtlaender, Michael Krause, Aljosa Osep, Jonathon Luiten, Berin Balachandar Gnana Sekar, Andreas Geiger, and Bastian Leibe. Mots: Multi-object tracking and segmentation. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"CVPR","element":"span"},{"text":", 2019. ","element":"span"},{"text":"2","element":"span"}],[{"text":"[103] Paul Voigtlaender, Soravit Changpinyo, Jordi Pont-Tuset, Radu Soricut, and Vittorio Ferrari. Connecting vision and language with video localized narratives. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"CVPR","element":"span"},{"text":", 2023. ","element":"span"},{"text":"2","element":"span"}],[{"text":"[104] Haochen Wang, Cilin Yan, Shuai Wang, Xiaolong Jiang, Xu Tang, Yao Hu, Weidi Xie, and Efstratios Gavves. Towards open-vocabulary video instance segmentation. ","element":"span"},{"text":"In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"ICCV","element":"span"},{"text":", 2023. ","element":"span"},{"text":"2","element":"span"},{"text":", ","element":"span"},{"text":"5","element":"span"}],[{"text":"[105] Weiyao Wang, Matt Feiszli, Heng Wang, and Du Tran. Unidentified video objects: A benchmark for dense, openworld segmentation. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"ICCV","element":"span"},{"text":", 2021. ","element":"span"},{"text":"2","element":"span"},{"text":", ","element":"span"},{"text":"3","element":"span"},{"text":", ","element":"span"},{"text":"4","element":"span"},{"text":", ","element":"span"},{"text":"5","element":"span"},{"text":", ","element":"span"},{"text":"7","element":"span"}],[{"text":"[106] Xin Wang, Jiawei Wu, Junkun Chen, Lei Li, Yuan-Fang Wang, and William Yang Wang. ","element":"span"},{"text":"Vatex: A large-scale, high-quality multilingual dataset for video-and-language research. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"ICCV","element":"span"},{"text":", 2019. ","element":"span"},{"text":"5","element":"span"}],[{"text":"[107] Yuqing Wang, Zhaoliang Xu, Xinlong Wang, Chunhua Shen, Baoshan Cheng, Hao Shen, and Huaxia Xia. End-to-end video instance segmentation with transformers. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"CVPR","element":"span"},{"text":", 2021. ","element":"span"},{"text":"2","element":"span"}],[{"text":"[108] Mark Weber, Jun Xie, Maxwell Collins, Yukun Zhu, Paul Voigtlaender, Hartwig Adam, Bradley Green, Andreas Geiger, Bastian Leibe, Daniel Cremers, Aljoˇsa Oˇsep, Laura Leal-Taix´e, and Liang-Chieh Chen. Step: Segmenting and tracking every pixel. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"NeurIPS","element":"span"},{"text":", 2021. ","element":"span"},{"text":"2","element":"span"}],[{"text":"[109] Junfeng Wu, Qihao Liu, Yi Jiang, Song Bai, Alan Yuille, and Xiang Bai. In defense of online models for video instance segmentation. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"ECCV","element":"span"},{"text":", 2022. ","element":"span"},{"text":"2","element":"span"}],[{"text":"[110] Dejing Xu, Zhou Zhao, Jun Xiao, Fei Wu, Hanwang Zhang, Xiangnan He, and Yueting Zhuang. Video question answer-","element":"span"}],[{"text":"ing via gradually refined attention over appearance and motion. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"ACM MM","element":"span"},{"text":", 2017. ","element":"span"},{"text":"2","element":"span"},{"text":", ","element":"span"},{"text":"5","element":"span"}],[{"text":"[111] Jun Xu, Tao Mei, Ting Yao, and Yong Rui. Msr-vtt: A large video description dataset for bridging video and language. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"CVPR","element":"span"},{"text":", 2016. ","element":"span"},{"text":"2","element":"span"},{"text":", ","element":"span"},{"text":"5","element":"span"}],[{"text":"[112] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"ICML","element":"span"},{"text":", 2015. ","element":"span"},{"text":"2","element":"span"}],[{"text":"[113] Lin Xu, Yilin Zhao, Daquan Zhou, Zhijie Lin, See Kiong Ng, and Jiashi Feng. Pllava : Parameter-free llava extension from images to videos for video dense captioning. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Arxiv","element":"span"},{"text":", 2024. ","element":"span"},{"text":"1","element":"span"}],[{"text":"[114] Mingze Xu, Mingfei Gao, Zhe Gan, Hong-You Chen, Zhengfeng Lai, Haiming Gang, Kai Kang, and Afshin Dehghan. Slowfast-llava: A strong training-free baseline for video large language models. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Arxiv","element":"span"},{"text":", 2024. ","element":"span"},{"text":"7","element":"span"}],[{"text":"[115] Ning Xu, Linjie Yang, Yuchen Fan, Dingcheng Yue, Yuchen Liang, ","element":"span"},{"text":"Jianchao Yang, ","element":"span"},{"text":"and Thomas Huang. Youtube-vos: ","element":"span"},{"text":"A large-scale video object segmentation benchmark. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Arxiv","element":"span"},{"text":", 2018. ","element":"span"},{"text":"2","element":"span"}],[{"text":"[116] Hongwei Xue, Tiankai Hang, Yanhong Zeng, Yuchong Sun, Bei Liu, Huan Yang, Jianlong Fu, and Baining Guo. Advancing high-resolution video-language representation with large-scale video transcriptions. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"CVPR","element":"span"},{"text":", 2022. ","element":"span"},{"text":"2","element":"span"}],[{"text":"[117] Cilin Yan, Haochen Wang, Shilin Yan, Xiaolong Jiang, Yao Hu, Guoliang Kang, Weidi Xie, and Efstratios Gavves. Visa: Reasoning video object segmentation via large language models. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"ECCV","element":"span"},{"text":", 2024. ","element":"span"},{"text":"1","element":"span"},{"text":", ","element":"span"},{"text":"3","element":"span"},{"text":", ","element":"span"},{"text":"5","element":"span"},{"text":", ","element":"span"},{"text":"7","element":"span"},{"text":", ","element":"span"},{"text":"8","element":"span"}],[{"text":"[118] Antoine Yang, Arsha Nagrani, Paul Hongsuck Seo, Antoine Miech, Jordi Pont-Tuset, Ivan Laptev, Josef Sivic, and Cordelia Schmid. Vid2seq: Large-scale pretraining of a visual language model for dense video captioning. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"CVPR","element":"span"},{"text":", 2023. ","element":"span"},{"text":"1","element":"span"}],[{"text":"[119] Linjie Yang, Yuchen Fan, and Ning Xu. ","element":"span"},{"text":"Video instance segmentation. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"ICCV","element":"span"},{"text":", 2019. ","element":"span"},{"text":"2","element":"span"},{"text":", ","element":"span"},{"text":"5","element":"span"},{"text":", ","element":"span"},{"text":"7","element":"span"}],[{"text":"[120] Senqiao Yang, Tianyuan Qu, Xin Lai, Zhuotao Tian, Bohao Peng, Shu Liu, and Jiaya Jia. An improved baseline for reasoning segmentation with large language model. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Arxiv","element":"span"},{"text":", 2023. ","element":"span"},{"text":"7","element":"span"}],[{"text":"[121] Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, et al. Minicpm-v: A gpt-4v level mllm on your phone. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Arxiv","element":"span"},{"text":", 2024. ","element":"span"},{"text":"8","element":"span"}],[{"text":"[122] Fisher Yu, Haofeng Chen, Xin Wang, Wenqi Xian, Yingying Chen, Fangchen Liu, Vashisht Madhavan, and Trevor Darrell. Bdd100k: A diverse driving dataset for heterogeneous multitask learning. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"CVPR","element":"span"},{"text":", 2020. ","element":"span"},{"text":"2","element":"span"}],[{"text":"[123] Qihang Yu, Ju He, Xueqing Deng, Xiaohui Shen, and Liang-Chieh Chen. ","element":"span"},{"text":"Convolutions die hard: ","element":"span"},{"text":"Openvocabulary segmentation with single frozen convolutional clip. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"NeurIPS","element":"span"},{"text":", 2023. ","element":"span"},{"text":"2","element":"span"}],[{"text":"[124] Qihang Yu, Xiaohui Shen, and Liang-Chieh Chen. Towards open-ended visual recognition with large language models. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"ECCV","element":"span"},{"text":", 2024. ","element":"span"},{"text":"2","element":"span"}],[{"text":"[125] Zhou Yu, Dejing Xu, Jun Yu, Ting Yu, Zhou Zhao, Yueting Zhuang, and Dacheng Tao. Activitynet-qa: A dataset for","element":"span"}],[{"text":"understanding complex web videos via question answering. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"AAAI","element":"span"},{"text":", 2019. ","element":"span"},{"text":"2","element":"span"},{"text":", ","element":"span"},{"text":"5","element":"span"}],[{"text":"[126] Yuqian Yuan, Wentong Li, Jian Liu, Dongqi Tang, Xinjie Luo, Chi Qin, Lei Zhang, and Jianke Zhu. Osprey: Pixel understanding with visual instruction tuning. ","element":"span"},{"text":"In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"CVPR","element":"span"},{"text":", 2024. ","element":"span"},{"text":"7","element":"span"}],[{"text":"[127] Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An instruction-tuned audio-visual language model for video understanding. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Arxiv","element":"span"},{"text":", 2023. ","element":"span"},{"text":"1","element":"span"}],[{"text":"[128] Renrui Zhang, Ziyu Guo, Wei Zhang, Kunchang Li, Xupeng Miao, Bin Cui, Yu Qiao, Peng Gao, and Hongsheng Li. Pointclip: Point cloud understanding by clip. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"CVPR","element":"span"},{"text":", 2022. ","element":"span"},{"text":"1","element":"span"}],[{"text":"[129] Tianyi Zhang*, Varsha Kishore*, Felix Wu*, Kilian Q. Weinberger, and Yoav Artzi. ","element":"span"},{"text":"Bertscore: Evaluating text generation with bert. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"ICLR","element":"span"},{"text":", 2020. ","element":"span"},{"text":"6","element":"span"}],[{"text":"[130] Tao Zhang, Xingye Tian, Yikang Zhou, Shunping Ji, Xuebo Wang, Xin Tao, Yuan Zhang, Pengfei Wan, Zhongyuan Wang, and Yu Wu. Dvis++: Improved decoupled framework for universal video segmentation. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Arxiv","element":"span"},{"text":", 2023. ","element":"span"},{"text":"2","element":"span"}],[{"text":"[131] Luowei Zhou, Chenliang Xu, and Jason Corso. Towards automatic learning of procedures from web instructional videos. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"AAAI","element":"span"},{"text":", 2018. ","element":"span"},{"text":"5","element":"span"}],[{"text":"[132] Xingyi Zhou, Anurag Arnab, Shyamal Buch, Shen Yan, Austin Myers, Xuehan Xiong, Arsha Nagrani, and Cordelia Schmid. ","element":"span"},{"text":"Streaming dense video captioning. ","element":"span"},{"text":"In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"CVPR","element":"span"},{"text":", 2024. ","element":"span"},{"text":"1","element":"span"}],[{"text":"[133] Idil Esen Zulfikar, Sabarinath Mahadevan, Paul Voigtlaender, and Bastian Leibe. Point-vos: Pointing up video object segmentation. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"CVPR","element":"span"},{"text":", 2024. ","element":"span"},{"text":"2","element":"span"}]]}],"_version":"3.3.4"},"paperNode":"$28:props:children:props:children:0:props:product"}]]