38:[["$","audio",null,{"id":"tts"}],["$","$L3d",null,{"paperID":"31513","publisher":"cvpr","paperJSON":{"title":"Multi-modal Instruction Tuned LLMs with Fine-grained Visual Perception","paperID":"31513","avgLineHeight":11.96,"imgScale":4,"sections":[{"heading":"Abstract","paragraphs":[[{"style":{"fontStyle":"italic"},"text":"Multimodal Large Language Model (MLLMs) leverages Large Language Models as a cognitive framework for diverse visual-language tasks. Recent efforts have been made to equip MLLMs with visual perceiving and grounding capabilities. However, there still remains a gap in providing fine-grained pixel-level perceptions and extending interactions beyond text-specific inputs. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"In this work, we propose ","element":"span"},{"style":{"fontWeight":"bold"},"text":"AnyRef","element":"span"},{"style":{"fontStyle":"italic"},"text":", a general MLLM model that can generate pixel-wise object perceptions and natural language descriptions from multi-modality references, such as texts, boxes, images, or audio. This innovation empowers users with greater flexibility to engage with the model beyond textual and regional prompts, without modality-specific designs. Through our proposed refocusing mechanism, the generated grounding output is guided to better focus on the referenced object, implicitly incorporating additional pixel-level supervision. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"This simple modification utilizes attention scores generated during the inference of LLM, eliminating the need for extra computations while exhibiting performance enhancements in both grounding masks and referring expressions. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"With only publicly available training data, our model achieves state-of-the-art results across multiple benchmarks, including diverse modality referring segmentation and region-level referring expression generation. Code and models are available at ","element":"span"},{"style":{"fontStyle":"italic"},"text":"https: //github.com/jwh97nn/AnyRef","element":"span"}]]},{"heading":"1. Introduction","paragraphs":[[{"text":"Large language models (LLMs) have garnered widespread influence across various domains, and advancements have been achieved by augmenting LLMs with visual percep-","element":"span"}],[{"style":{"width":"96%"},"width":910,"height":748,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/31513/images/0-0.png","element":"img"}],[{"text":"Figure 1. ","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"Multi-modality Referring Segmentation ","element":"figcaption","subtype":"caption"},{"text":"and ","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"Expression Generation ","element":"figcaption","subtype":"caption"},{"text":"with ","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"AnyRef","element":"figcaption","subtype":"caption"},{"text":". Our model possesses the capacity to generate natural language descriptions as well as pixel-wise grounding masks for the referred object. It accommodates various referring modalities such as ","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"text","element":"figcaption","subtype":"caption"},{"text":", ","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"bounding boxes","element":"figcaption","subtype":"caption"},{"text":", ","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"images ","element":"figcaption","subtype":"caption"},{"text":"and ","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"audio","element":"figcaption","subtype":"caption"},{"text":", enabling more flexible user interactions.","element":"figcaption","subtype":"caption"}],[{"text":"tion modules to bridge the gap between vision and language tasks [","element":"span"},{"text":"6","element":"span"},{"text":", ","element":"span"},{"text":"18","element":"span"},{"text":", ","element":"span"},{"text":"23","element":"span"},{"text":", ","element":"span"},{"text":"61","element":"span"},{"text":"], thereby transforming them into Multimodal Large Language Models (MLLMs). Most recent research aims to further endow MLLMs with finer-grained visual understanding abilities, like visual grounding and referring expression generation, through user-defined formats (","element":"span"},{"style":{"fontStyle":"italic"},"text":"e.g.","element":"span"},{"text":", coordinates, bounding boxes, etc.) [","element":"span"},{"text":"4","element":"span"},{"text":", ","element":"span"},{"text":"31","element":"span"},{"text":", ","element":"span"},{"text":"57","element":"span"},{"text":"], surpassing the confines of textual responses alone.","element":"span"}],[{"text":"Despite the encouraging results demonstrated by existing MLLMs in grounding linguistic expressions to visual scenes, their capacity for precise localization remains restricted to coarse-grained levels (bounding boxes), falling short of pixel-level perceptions (As illustrated in Tab. ","element":"span"},{"text":"1","element":"span"},{"text":"). The most recent work, as exemplified by [","element":"span"},{"text":"16","element":"span"},{"text":"], has focused on enhancing MLLMs by integrating segmentation models that generate binary segmentation masks based on textual descriptions. However, this approach is constrained by its reliance solely on textual referring instructions, thereby limiting the versatility of MLLMs in various multimodal interaction scenarios, such as region-based referring or audio comprehension tasks. The interactive segmentation model SEEM [","element":"span"},{"text":"63","element":"span"},{"text":"] attempts to receive audio inputs, but it turns audio into textural prompts with the off-the-shelf speech recognition model Whisper [","element":"span"},{"text":"34","element":"span"},{"text":"], so essentially it is still the textual references.","element":"span"}],[{"text":"In light of the above observation, we propose ","element":"span"},{"style":{"fontWeight":"bold"},"text":"AnyRef","element":"span"},{"text":", a novel multi-modal instruction-tuned LLM with fine-grained visual perception. As shown in Tab. ","element":"span"},{"text":"1","element":"span"},{"text":", ","element":"span"},{"style":{"fontWeight":"bold"},"text":"AnyRef ","element":"span"},{"text":"advances existing MLLMs with the strong capability to perform pixel-level object grounding and generate region-aware expressions derived from references of diverse modalities, including text, bounding boxes, images, and audio inputs, (See Fig. ","element":"span"},{"text":"1 ","element":"span"},{"text":"as an example). To this end, we first propose a unified representation for referring across different modalities and map them to the token space of LLMs. We extract features from all the modalities mentioned above to form the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Unified Referring Representation","element":"span"},{"text":", which can be processed uniformly by the LLM, utilizing its ability of understanding and reasoning in generating the grounded output. This enables flexible referring beyond textual descriptions, without requiring modality-specific designs or changes to the existing model.","element":"span"}],[{"text":"To perform pixel-level grounding with LLMs, a possible solution [","element":"span"},{"text":"16","element":"span"},{"text":"] is to trigger the segmentation action by generating a special token ","element":"span"},{"style":{"fontStyle":"italic"},"text":"<","element":"span"},{"text":"obj","element":"span"},{"style":{"fontStyle":"italic"},"text":">","element":"span"},{"text":", whose embedding will be subsequently employed as the input to the segmentation model. As opposed to using coordinates sequence of polygons [","element":"span"},{"text":"5","element":"span"},{"text":", ","element":"span"},{"text":"41","element":"span"},{"text":"] to represent segmentation results, the introduction of the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"<","element":"span"},{"text":"obj","element":"span"},{"style":{"fontStyle":"italic"},"text":"> ","element":"span"},{"text":"token effectively simplifies pixel-level visual grounding. Nevertheless, the embedding of the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"<","element":"span"},{"text":"obj","element":"span"},{"style":{"fontStyle":"italic"},"text":"> ","element":"span"},{"text":"token is confined in a fixed feature space, due to the nature of next token prediction, leading to limited representational capacity and thus inaccurate segmentation results. To address this constraint, we propose a simple yet effective ","element":"span"},{"style":{"fontStyle":"italic"},"text":"refocusing mechanism","element":"span"},{"text":", which takes into account the correlation between the grounded expression and the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"<","element":"span"},{"text":"obj","element":"span"},{"style":{"fontStyle":"italic"},"text":"> ","element":"span"},{"text":"token. This mechanism utilizes attention scores to weight such correlation, enhancing the mask embedding with additional grounded embeddings, and since the attention scores are intermediate outputs of the self-attention layers, the additional computation introduced by the refocusing mechanism is minimal. Furthermore, the refocusing mechanism also provides a short-cut connection between the generated grounded expression and the segmentation results, allowing pixel-level labels to implicitly supervise the learning process of language expression generation, thereby enhancing","element":"span"}],[{"text":"the model’s regional understanding capability.","element":"span"}],[{"text":"To summarize, our contributions are threefold: • We introduce ","element":"span"},{"style":{"fontWeight":"bold"},"text":"AnyRef","element":"span"},{"text":", the first general MLLM capable of producing pixel-level object perceptions as well as region-aware referring descriptions. It adeptly accommodates multi-modality references including texts, bounding boxes, images or audio in a general manner, fostering more flexible interactions for users.","element":"span"}],[{"text":"• We propose a simple yet effective ","element":"span"},{"style":{"fontStyle":"italic"},"text":"refocusing mechanism ","element":"span"},{"text":"to enhance the grounded mask predictions, leveraging the correlations of generated tokens without incurring additional computational overhead, and concurrently yields improvements in regional expression referring.","element":"span"}],[{"text":"• Thorough experiments conducted on multiple datasets demonstrate the efficacy of the proposed method, resulting in state-of-the-art performance across a diverse range of multi-modality tasks.","element":"span"}],[{"text":"Our model is built upon LLaVA-7B [","element":"span"},{"text":"23","element":"span"},{"text":"], which can be efficiently fine-tuned with 8 NVIDIA 32G V100 GPUs, making our method easily reproducible at a reasonable computational cost.","element":"span"}]]},{"heading":"2. Related Works","paragraphs":[[{"style":{"fontWeight":"bold"},"text":"2.1. Multi-modal Large Language Model","element":"span"}],[{"text":"Multi-modal Large Language Models (MLLMs), built upon large language models (LLMs) as their foundations, extend their capabilities beyond traditional textual understanding to incorporate various modalities such as images, videos, and audio. Building upon the concept of instruction tuning, Flamingo [","element":"span"},{"text":"1","element":"span"},{"text":"] utilizes visual feature inputs as prompts, resulting in impressive performance across diverse visual-language tasks such as image captioning and visual question answering (VQA). Subsequent models, includin BLIP-2 [","element":"span"},{"text":"19","element":"span"},{"text":"], LLaVA [","element":"span"},{"text":"23","element":"span"},{"text":"], InstructBLIP [","element":"span"},{"text":"6","element":"span"},{"text":"], Otter [","element":"span"},{"text":"18","element":"span"},{"text":"] and LLaMaAdapter [","element":"span"},{"text":"56","element":"span"},{"text":"], utilize additional generated visual instructionfollowing data for better visual-language alignment, and demonstrate impressive multi-modal chat abilities.","element":"span"}],[{"text":"Recent studies expand the capabilities of MLLMs to address localization tasks with region-aware functionalities. KOSMOS-2 [","element":"span"},{"text":"31","element":"span"},{"text":"] and VisionLLM [","element":"span"},{"text":"41","element":"span"},{"text":"] introduce additional location tokens to the vocabulary, enabling the conversion of coordinates into textual representations. ","element":"span"},{"text":"These representations are then inputted into LLMs to enhance region understanding. On the other hand, Shikra [","element":"span"},{"text":"4","element":"span"},{"text":"] represents coordinates directly in natural language form. In contrast, GPT4RoI [","element":"span"},{"text":"57","element":"span"},{"text":"] streamlines the process by employing RoIaligned visual features without incorporating explicit positional information.","element":"span"}],[{"text":"Nevertheless, these models lack the capacity to produce fine-grained perceptions (","element":"span"},{"style":{"fontStyle":"italic"},"text":"e.g.","element":"span"},{"text":", pixel-level masks), and restrict their referring expressions to textural descriptions and","element":"span"}],[{"style":{"width":"84%"},"width":1670,"height":726,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/31513/images/2-0.png","element":"img"}],[{"text":"Table 1. ","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"Comparisons of recent Multi-modal Large Language Models. ","element":"figcaption","subtype":"caption"},{"text":"The term ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"Referring Format ","element":"figcaption","subtype":"caption"},{"text":"emphasizes the acceptable modalities used for referencing, whereas ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"Image* ","element":"figcaption","subtype":"caption"},{"text":"indicates visual references derived from another image.","element":"figcaption","subtype":"caption"}],[{"text":"regions within the image. Our model, leveraging the best of both worlds, not only generates pixel-level grounding masks, but also accommodates a broader range of referring formats (","element":"span"},{"style":{"fontStyle":"italic"},"text":"e.g.","element":"span"},{"text":", visual reference from other images or audio) in a unified manner.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"2.2. Referring Segmentation","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Referring Expression Segmentation ","element":"span"},{"text":"translates explicit textual descriptions into corresponding pixel-level segmentations, requiring a comprehensive understanding of both visual content and linguistic expression. Recent methods including SAM [","element":"span"},{"text":"15","element":"span"},{"text":"], X-Decoder [","element":"span"},{"text":"62","element":"span"},{"text":"] and SEEM [","element":"span"},{"text":"63","element":"span"},{"text":"] unify multiple segmentation tasks within a single model, supporting various human interaction methods. While LISA [","element":"span"},{"text":"16","element":"span"},{"text":"] utilizes the powerful reasoning and comprehension abilities of LLMs to process textural instructions and generate masks through the SAM [","element":"span"},{"text":"15","element":"span"},{"text":"] decoder.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Visual Referring Segmentation ","element":"span"},{"text":"can be related to one/few-shot segmentation, where an example of a certain object with its corresponding mask is provided to segment the same object in the query image [","element":"span"},{"text":"12","element":"span"},{"text":", ","element":"span"},{"text":"30","element":"span"},{"text":", ","element":"span"},{"text":"43","element":"span"},{"text":", ","element":"span"},{"text":"44","element":"span"},{"text":", ","element":"span"},{"text":"55","element":"span"},{"text":"]. Recently, CLIPSeg [","element":"span"},{"text":"28","element":"span"},{"text":"] builds upon the CLIP model to treat the example image as a visual prompt, which can generalize to novel forms of prompts. Painter [","element":"span"},{"text":"43","element":"span"},{"text":"] and SegGPT [","element":"span"},{"text":"44","element":"span"},{"text":"] utilize in-context learning to perform general vision tasks using input task prompts.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Audio-Visual Segmentation ","element":"span"},{"text":"aims to generate pixel-level masks for object(s) emitting sound, initially introduced in [","element":"span"},{"text":"60","element":"span"},{"text":"]. AVSegFormer [","element":"span"},{"text":"8","element":"span"},{"text":"] innovatively incorporates learnable audio queries, enabling selective attention to relevant visual features. Additionally, AUSS [","element":"span"},{"text":"21","element":"span"},{"text":"] proposes unmixing selfsupervised losses to bridge the gap between audio signals and visual semantics.","element":"span"}],[{"text":"While these models have achieved satisfactory results in","element":"span"}],[{"text":"their respective domains, there is currently a gap in addressing all referring tasks within a single model. Most of the aforementioned methods rely on modality-specific or taskspecific designs, which may not generalize well beyond their intended tasks. ","element":"span"},{"text":"Our approach leverages the robust comprehension ability of LLMs to concurrently tackle all these tasks while preserving the region-level reasoning capacity. Additionally, the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"refocusing mechanism ","element":"span"},{"text":"aids in enhancing region-level referring expression through implicit pixel-level supervisions.","element":"span"}]]},{"heading":"3. Methods","paragraphs":[[{"text":"The overall framework of ","element":"span"},{"style":{"fontWeight":"bold"},"text":"AnyRef ","element":"span"},{"text":"comprises a vision encoder, multi-modal feature projection layers, a LLM, and a mask decoder, as illustrated in Fig. ","element":"span"},{"text":"2","element":"span"},{"text":". ","element":"span"},{"text":"These initial three components together form a multi-modality LLM, enabling support for various reference formats and generating region-aware grounded textual responses. Additionally, a distinctive ","element":"span"},{"style":{"fontStyle":"italic"},"text":"<","element":"span"},{"text":"obj","element":"span"},{"style":{"fontStyle":"italic"},"text":"> ","element":"span"},{"text":"token is introduced to the vocabulary, which provides the input for the mask decoder through a refocusing mechanism, facilitating the generation of pixel-level perceptions.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"3.1. Model Architecture","element":"span"}],[{"text":"We adopt the pretrained ViT-L/14 from CLIP [","element":"span"},{"text":"33","element":"span"},{"text":"] as the vision encoder, and LLaMA-7B [","element":"span"},{"text":"39","element":"span"},{"text":"] as our LLM. For audio inputs, we choose the pretrained audio encoder from ImageBind [","element":"span"},{"text":"9","element":"span"},{"text":"] to extract audio features. To connect multi-modality information beyond texts to the existing LLM, such as images and audio, we adopt vision-language and audio-language projection layers to project image and audio features to the language space. The input image is converted into a fixed number of ","element":"span"},{"style":{"height":11.2},"width":129.5,"height":28,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/31513/images/2-1.png","element":"img","alt":" 16 × 16","inline":true,"padRight":true},{"text":"patch embeddings, while the audio is represented as ","element":"span"},{"text":"3 ","element":"span"},{"text":"patch embeddings. Both","element":"span"}],[{"style":{"width":"98%"},"width":1948,"height":846,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/31513/images/3-0.png","element":"img"}],[{"text":"Figure 2. ","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"Overall pipeline of AnyRef. ","element":"figcaption","subtype":"caption"},{"text":"Vision-language, audio-language projection and MLP layers are omitted for simplicity and clarity. The ","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"Unified Referring Representation ","element":"figcaption","subtype":"caption"},{"text":"(Sec. ","element":"figcaption","subtype":"caption"},{"text":"3.1.1","element":"figcaption","subtype":"caption"},{"text":") receives references from diverse types of modalities and transforms them into embeddings aligned with the LLM. The ","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"Refocusing Mechanism ","element":"figcaption","subtype":"caption"},{"text":"(Sec. ","element":"figcaption","subtype":"caption"},{"text":"3.1.2","element":"figcaption","subtype":"caption"},{"text":") enhances the embedding from the single ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"<","element":"figcaption","subtype":"caption"},{"text":"obj","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"> ","element":"figcaption","subtype":"caption"},{"text":"token with grounded textural embeddings, thus providing a broader representational capacity.","element":"figcaption","subtype":"caption"}],[{"text":"the image and audio embeddings are then projected to the same dimension as word embeddings. The LLM takes the interleaved embeddings in the same way as language tokens to generate outputs via an auto-regressive manner.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"3.1.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Unified Referring Representation","element":"span"}],[{"text":"To receive multi-modality referring prompts beyond texts, we convert them into fixed-sized tokens and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"quote ","element":"span"},{"text":"them between newly introduced special tokens.","element":"span"}],[{"text":"For visual prompts including regional bounding boxes or visual examples from another image, we introduce ","element":"span"},{"style":{"fontStyle":"italic"},"text":"<","element":"span"},{"text":"img ref","element":"span"},{"style":{"fontStyle":"italic"},"text":"> ","element":"span"},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"<","element":"span"},{"text":"/img ref","element":"span"},{"style":{"fontStyle":"italic"},"text":">","element":"span"},{"text":", where visual features will be inserted in between. ","element":"span"},{"text":"Drawing inspiration from [","element":"span"},{"text":"57","element":"span"},{"text":"], we represent bounding boxes using extracted region-level features from RoIAlign [","element":"span"},{"text":"11","element":"span"},{"text":"] with a fixed size of ","element":"span"},{"style":{"height":10.8},"width":97.5,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/31513/images/3-1.png","element":"img","alt":"4 × 4","inline":true},{"text":". ","element":"span"},{"text":"For processing image-level visual examples, we use the same CLIP vision encoder to extract visual features, which are then pooled to ","element":"span"},{"style":{"height":10.8},"width":99.5,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/31513/images/3-2.png","element":"img","alt":" 4 × 4","inline":true,"padRight":true},{"text":"as well. ","element":"span"},{"text":"To refer to them in the same way as textual descriptions, we build prompts such as: “Can you provide a description of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"<","element":"span"},{"text":"img ref","element":"span"},{"style":{"fontStyle":"italic"},"text":"><","element":"span"},{"text":"img feat","element":"span"},{"style":{"fontStyle":"italic"},"text":"><","element":"span"},{"text":"/img ref","element":"span"},{"style":{"fontStyle":"italic"},"text":"> ","element":"span"},{"text":"in this image?”, where ","element":"span"},{"style":{"fontStyle":"italic"},"text":"<","element":"span"},{"text":"img feat","element":"span"},{"style":{"fontStyle":"italic"},"text":"> ","element":"span"},{"text":"will be replaced by the extracted visual features.","element":"span"}],[{"text":"For audio prompts, we introduce ","element":"span"},{"style":{"fontStyle":"italic"},"text":"<","element":"span"},{"text":"aud ref","element":"span"},{"style":{"fontStyle":"italic"},"text":"> ","element":"span"},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"<","element":"span"},{"text":"/aud ref","element":"span"},{"style":{"fontStyle":"italic"},"text":"> ","element":"span"},{"text":"for LLM to be aware of audio referring inputs, and the extracted audio features will be projected through audio-language projection layer and then inserted in between. And the audio prompted instruction will be built like: “Can you segment the object that makes sound of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"<","element":"span"},{"text":"aud ref","element":"span"},{"style":{"fontStyle":"italic"},"text":"><","element":"span"},{"text":"aud feat","element":"span"},{"style":{"fontStyle":"italic"},"text":"><","element":"span"},{"text":"/aud ref","element":"span"},{"style":{"fontStyle":"italic"},"text":"> ","element":"span"},{"text":"in this image?”. In this way, the referring representation from different modalities is unified, which can be treated the same way as language instructions and easily handled by the LLM.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"3.1.2 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Refocusing Mechanism","element":"span"}],[{"text":"Inspired by [","element":"span"},{"text":"16","element":"span"},{"text":"], we employ another special token ","element":"span"},{"style":{"fontStyle":"italic"},"text":"<","element":"span"},{"text":"obj","element":"span"},{"style":{"fontStyle":"italic"},"text":"> ","element":"span"},{"text":"to succinctly represent the instance segmentation mask as an embedding. This embedding ","element":"span"},{"style":{"height":15.8},"width":67,"height":39.5,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/31513/images/3-3.png","element":"img","alt":" hobj","inline":true,"padRight":true},{"text":"is derived from the last-layer of LLM associated with the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"<","element":"span"},{"text":"obj","element":"span"},{"style":{"fontStyle":"italic"},"text":"> ","element":"span"},{"text":"token. It is then projected through an MLP layer ","element":"span"},{"style":{"height":10.6},"width":21.5,"height":26.5,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/31513/images/3-4.png","element":"img","alt":" γ","inline":true},{"text":", before being fed into the segmentation model ","element":"span"},{"style":{"fontStyle":"italic"},"text":"S","element":"span"},{"text":". Subsequently, the binary segmentation mask ","element":"span"},{"style":{"fontStyle":"italic"},"text":"M ","element":"span"},{"text":"can be expressed mathematically as,","element":"span"}],[{"style":{"width":"77%"},"width":734,"height":78,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/31513/images/3-5.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":12},"width":80,"height":30,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/31513/images/3-6.png","element":"img","alt":" ximg","inline":true,"padRight":true},{"text":"indicates the input image, and ","element":"span"},{"style":{"height":15.8},"width":68.5,"height":39.5,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/31513/images/3-7.png","element":"img","alt":" Vseg","inline":true,"padRight":true},{"text":"denotes the vision encoder of the segmentation model.","element":"span"}],[{"text":"However, since ","element":"span"},{"style":{"fontStyle":"italic"},"text":"<","element":"span"},{"text":"obj","element":"span"},{"style":{"fontStyle":"italic"},"text":"> ","element":"span"},{"text":"is a token in the LLM vocabulary, its representation will be limited in a fixed feature range, which will potentially limit its representational capacity and influence the decoded mask quality. Therefore, we propose a ","element":"span"},{"style":{"fontStyle":"italic"},"text":"refocusing mechanism ","element":"span"},{"text":"which augments the original mask embedding with grounded text embeddings. The motivation behind is to explicitly force the final mask embedding to focus more on the referring or grounded object with its textural expression. The updated mask embedding can be formulated as","element":"span"}],[{"style":{"width":"77%"},"width":728,"height":120,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/31513/images/3-8.png","element":"img"}],[{"style":{"width":"99%"},"width":1972,"height":666,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/31513/images/4-0.png","element":"img"}],[{"text":"Figure 3. ","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"Qualitative results of AnyRef’s applicable capabilities ","element":"figcaption","subtype":"caption"},{"text":"on multiple tasks, including ","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"(a) ","element":"figcaption","subtype":"caption"},{"text":"referring expression segmentation, ","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"(b) ","element":"figcaption","subtype":"caption"},{"text":"region-level captioning and grounding, ","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"(c) ","element":"figcaption","subtype":"caption"},{"text":"image-level referring segmentation and ","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"(d) ","element":"figcaption","subtype":"caption"},{"text":"audio-visual segmentation. ","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"AnyRef ","element":"figcaption","subtype":"caption"},{"text":"demonstrates proficiency in generating both textual responses and pixel-level perceptions across diverse modality instructions.","element":"figcaption","subtype":"caption"}],[{"text":"where ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i ","element":"span"},{"text":"token, ","element":"span"},{"style":{"height":12.4},"width":34,"height":31,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/31513/images/4-1.png","element":"img","alt":" ¯ai","inline":true,"padRight":true},{"text":"indicates the normalized attention scores between the token ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i","element":"span"},{"text":"-th token and the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"<","element":"span"},{"text":"obj","element":"span"},{"style":{"fontStyle":"italic"},"text":"> ","element":"span"},{"text":"token, and ","element":"span"},{"style":{"height":16},"width":143,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/31513/images/4-2.png","element":"img","alt":"λf = 0.1","inline":true,"padRight":true},{"text":"controls the focusing weight of augmentation embeddings. This approach enhances the mask embedding, providing a more adaptable feature range compared to the original, thereby expanding its representational capacity.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"3.1.3 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Training Objectives","element":"span"}],[{"text":"The model is trained in the end-to-end manner with a combination of text loss and mask loss. The text loss follows the next word prediction loss [","element":"span"},{"text":"23","element":"span"},{"text":"], and the mask loss includes binary cross-entropy loss and dice loss [","element":"span"},{"text":"29","element":"span"},{"text":"], as","element":"span"}],[{"style":{"width":"85%"},"width":812,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/31513/images/4-3.png","element":"img"}],[{"text":"where we choose ","element":"span"},{"style":{"height":13.8},"width":379.5,"height":34.5,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/31513/images/4-4.png","element":"img","alt":" λtext = 1.0, λbce = 2.0","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":13.8},"width":194,"height":34.5,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/31513/images/4-5.png","element":"img","alt":" λdice = 0.5.","inline":true,"padRight":true},{"text":"Due to the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"refocusing mechanism","element":"span"},{"text":", tokens generated before the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"<","element":"span"},{"text":"obj","element":"span"},{"style":{"fontStyle":"italic"},"text":"> ","element":"span"},{"text":"token can receive additional supervisory signals from pixel-level ground truth. This mutual interaction can further benefit the vision-language understanding ability of ","element":"span"},{"style":{"fontWeight":"bold"},"text":"AnyRef","element":"span"},{"text":", given the interrelated nature of referring expressions and grounding masks.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"3.2. Implementation Details.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"3.2.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Training Setup","element":"span"}],[{"text":"Unless otherwise specified, we employ the pre-trained CLIP ViT-L/14 as the vision encoder, ImageBind-H [","element":"span"},{"text":"9","element":"span"},{"text":"] as the audio encoder, and LLaMa-7B as the LLM. The vision-language projection layer is initialized from LLaVa [","element":"span"},{"text":"23","element":"span"},{"text":"], while the audio-language projection layer is randomly initialized. The word embeddings of newly introduced special tokens are initialized randomly. Furthermore, the segmentation model utilizes the pre-trained SAM-H [","element":"span"},{"text":"15","element":"span"},{"text":"]. The image resolution is ","element":"span"},{"style":{"height":10.8},"width":173,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/31513/images/4-6.png","element":"img","alt":" 224 × 224","inline":true,"padRight":true},{"text":"for MLLM and ","element":"span"},{"style":{"height":11.2},"width":211.5,"height":28,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/31513/images/4-7.png","element":"img","alt":" 1024 × 1024","inline":true,"padRight":true},{"text":"by rescaling and padding for the segmentation model. For audio inputs, we follow settings in [","element":"span"},{"text":"60","element":"span"},{"text":"] to use the 5-second audio clips and convert to 3 fixed-sized embeddings after padding, since the ImageBind [","element":"span"},{"text":"9","element":"span"},{"text":"] audio encoder samples 2-second audio each time.","element":"span"}],[{"text":"To ensure training efficiency and preserve generalization ability, we freeze the vision encoders and audio encoder. Fine-tuning of the LLM is conducted using LoRA [","element":"span"},{"text":"13","element":"span"},{"text":"], and the trainable parameters comprise the mask decoder and projection layers, accounting for approximately 7% of the total parameters.","element":"span"}],[{"text":"We conduct training using 8 NVIDIA V100 GPUs, each with a batch size of 6, and employ a gradient accumulation step set to 8. The training utilizes mixed precision, converting both the vision and audio encoder to float16 precision. AdamW [","element":"span"},{"text":"26","element":"span"},{"text":"] optimizer with a learning rate of 5e-5 and weight decay of 0.01 is employed, alongside a cosine annealing scheduler incorporating 200 warmup steps. LoRA operates with the rank of 8 and alpha of 16, exclusively applied to query and value projections within the LLM. We employ ZeRO stage-2 [","element":"span"},{"text":"35","element":"span"},{"text":"] with DeepSpeed [","element":"span"},{"text":"37","element":"span"},{"text":"] which completes network training in 10K steps.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"3.2.2 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Datasets","element":"span"}],[{"text":"The training process involves a diverse range of datasets. For general semantic and instance segmentation, COCOStuff [","element":"span"},{"text":"3","element":"span"},{"text":"], ADE20K [","element":"span"},{"text":"59","element":"span"},{"text":"], and PACO-LVIS [","element":"span"},{"text":"36","element":"span"},{"text":"] are utilized, with one category chosen per batch. Referring expression segmentation incorporates RefClef, RefCOCO, RefCOCO+ [","element":"span"},{"text":"14","element":"span"},{"text":"], RefCOCOg [","element":"span"},{"text":"52","element":"span"},{"text":"], and PhraseCut [","element":"span"},{"text":"46","element":"span"},{"text":"]. Image-level referring segmentation adopts the method outlined in [","element":"span"},{"text":"27","element":"span"},{"text":"], where samples are chosen from COCO [","element":"span"},{"text":"20","element":"span"},{"text":"], PascalVOC [","element":"span"},{"text":"7","element":"span"},{"text":"], and PhraseCut [","element":"span"},{"text":"46","element":"span"},{"text":"] datasets. Random cropped sam-","element":"span"}],[{"style":{"width":"74%"},"width":1474,"height":714,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/31513/images/5-0.png","element":"img"}],[{"text":"Table 2. ","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"Referring expression segmentation ","element":"figcaption","subtype":"caption"},{"text":"results (cIOU) on RefCOCO(+/g) datasets. (ft) denotes finetuning the model on RefCOCO(+/g) datasets. Our model surpasses all generalist models and most specialist (segmentation-oriented) models.","element":"figcaption","subtype":"caption"}],[{"text":"ples are drawn from images that contain the same category as their corresponding linguistic expressions. Region-level captioning involves RefCOCO(+/g) and Flickr30K Entities [","element":"span"},{"text":"32","element":"span"},{"text":"]. Audio-visual segmentation employs AVSBench [","element":"span"},{"text":"60","element":"span"},{"text":"] with both single and multiple sound sources. To prevent data leakage, samples with images in the validation or test splits are excluded.","element":"span"}]]},{"heading":"4. Experiments","paragraphs":[[{"text":"We assess the capabilities of our model through evaluations on various benchmarks, including different modality referring segmentation (text/image/audio) for pixel-level perception and referring expression generation for regional understanding. Models are categorized as ","element":"span"},{"style":{"fontStyle":"italic"},"text":"specialists ","element":"span"},{"text":"or ","element":"span"},{"style":{"fontStyle":"italic"},"text":"generalists","element":"span"},{"text":", with the former designed exclusively for specific tasks. We provide examples for each task in Fig. ","element":"span"},{"text":"3","element":"span"},{"text":", and more illustrations can be found in the supplementary material.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"4.1. Multi-modality Referring Segmentation","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"4.1.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Referring Expression Segmentation","element":"span"}],[{"text":"The task involves labeling pixels within an image corresponding to an object instance referred to by a linguistic expression. ","element":"span"},{"text":"We instruct our model as: “","element":"span"},{"text":"Can you segment ","element":"span"},{"style":{"fontStyle":"italic"},"text":"{","element":"span"},{"text":"exp","element":"span"},{"style":{"fontStyle":"italic"},"text":"} ","element":"span"},{"text":"in this image?","element":"span"},{"text":"”, where ","element":"span"},{"style":{"fontStyle":"italic"},"text":"{","element":"span"},{"text":"exp","element":"span"},{"style":{"fontStyle":"italic"},"text":"} ","element":"span"},{"text":"is the given explicit description. Evaluation is conducted using Cumulative-IoU (cIoU) as the metric. We make comparisons with state-of-the-art models on validation and test sets of RefCOCO, RefCOCO+ and RefCOCOg [","element":"span"},{"text":"14","element":"span"},{"text":", ","element":"span"},{"text":"52","element":"span"},{"text":"]. As shown in Tab. ","element":"span"},{"text":"2","element":"span"},{"text":", our performance surpasses all generalist models and most specialist models except UNINEXT-H [","element":"span"},{"text":"48","element":"span"},{"text":"], which is trained using a considerably larger dataset that includes video samples. Specialist models excel solely at segmentation-related tasks, while generalist models possess additional capabilities for generating textural descriptions and are capable of handling more complex references.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"4.1.2 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Image Referring Segmentation","element":"span"}],[{"text":"Predicting masks using image examples is akin to oneor few-shot segmentation, where regions corresponding to the highlighted object in the example image must be located in a query image. ","element":"span"},{"text":"We prompt our model with queries like “","element":"span"},{"text":"Can you find similar object of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"<","element":"span"},{"text":"img ref","element":"span"},{"style":{"fontStyle":"italic"},"text":"><","element":"span"},{"text":"img feat","element":"span"},{"style":{"fontStyle":"italic"},"text":"><","element":"span"},{"text":"/img ref","element":"span"},{"style":{"fontStyle":"italic"},"text":"> ","element":"span"},{"text":"in this image?","element":"span"},{"text":"”, where ","element":"span"},{"style":{"fontStyle":"italic"},"text":"<","element":"span"},{"text":"img feat","element":"span"},{"style":{"fontStyle":"italic"},"text":"> ","element":"span"},{"text":"denotes pooled features from example images as detailed in Sec. ","element":"span"},{"text":"3.1.1","element":"span"},{"text":". ","element":"span"},{"text":"The evaluation takes place under the in-domain setting on COCO-20","element":"span"},{"style":{"height":7.6},"width":8.5,"height":19,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/31513/images/5-1.png","element":"img","alt":"i","inline":true,"padRight":true},{"text":"[","element":"span"},{"text":"20","element":"span"},{"text":"] and PASCAL-5","element":"span"},{"style":{"height":7.6},"width":9,"height":19,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/31513/images/5-2.png","element":"img","alt":"i","inline":true,"padRight":true},{"text":"[","element":"span"},{"text":"7","element":"span"},{"text":"] for a fair comparison, as most classes are encountered during the training stages. In the few-shot evaluation, the model inferences multiple times using different example images, with the averaged mask serving as the final prediction. ","element":"span"},{"text":"In our referring examples, we do not have corresponding mask examples, which is different from the standard setting. ","element":"span"},{"text":"we follow [","element":"span"},{"text":"28","element":"span"},{"text":"] to crop out the target object for highlighting, using their segmentation masks. ","element":"span"},{"text":"As demonstrated in Tab. ","element":"span"},{"text":"3","element":"span"},{"text":", our model achieves competitive performance compared to state-of-the-art methods.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"4.1.3 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Audio-Visual Segmentation","element":"span"}],[{"text":"The ","element":"span"},{"text":"AVS ","element":"span"},{"text":"benchmark ","element":"span"},{"text":"comprises ","element":"span"},{"text":"single- ","element":"span"},{"text":"and ","element":"span"},{"text":"multisources ","element":"span"},{"text":"subsets ","element":"span"},{"text":"based ","element":"span"},{"text":"on ","element":"span"},{"text":"the ","element":"span"},{"text":"number ","element":"span"},{"text":"of ","element":"span"},{"text":"sounding objects. We utilize prompts like, “","element":"span"},{"text":"Can you segment the object(s) that produce sound of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"<","element":"span"},{"text":"aud ref","element":"span"},{"style":{"fontStyle":"italic"},"text":"><","element":"span"},{"text":"aud feat","element":"span"},{"style":{"fontStyle":"italic"},"text":"><","element":"span"},{"text":"/aud ref","element":"span"},{"style":{"fontStyle":"italic"},"text":"> ","element":"span"},{"text":"in this image?","element":"span"},{"text":"”, to instruct the model for mask predictions. Following [","element":"span"},{"text":"60","element":"span"},{"text":"], evaluation metrics include mean IoU","element":"span"}],[{"style":{"width":"99%"},"width":936,"height":516,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/31513/images/6-0.png","element":"img"}],[{"text":"Table 3. Quantitative results of ","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"example-based few-shot segmentation","element":"figcaption","subtype":"caption"},{"text":". * indicates that the categories in training cover that in testing as in [","element":"figcaption","subtype":"caption"},{"text":"44","element":"figcaption","subtype":"caption"},{"text":"], and ","element":"figcaption","subtype":"caption"},{"style":{"height":13.4},"width":13,"height":33.5,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/31513/images/6-1.png","element":"img","alt":" †","inline":true,"padRight":true},{"text":"denotes using mask cropping setting.","element":"figcaption","subtype":"caption"}],[{"style":{"width":"88%"},"width":840,"height":324,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/31513/images/6-2.png","element":"img"}],[{"text":"Table 4. Quantitative results of ","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"audio-visual segmentation","element":"figcaption","subtype":"caption"},{"text":".","element":"figcaption","subtype":"caption"}],[{"text":"(mIoU) for region similarity and F-score","element":"span"},{"text":"1 ","element":"span"},{"text":"for contour accuracy. ","element":"span"},{"text":"The quantitative results in Tab. ","element":"span"},{"text":"4 ","element":"span"},{"text":"demonstrate that our model consistently outperforms most methods on single-source split, indicating successful alignment of audio features with the LLM during fine-tuning. However, when confronted with audios containing multiple sound sources, our model encounters challenges in producing masks that cover more than one object. Moreover, owing to the ability of LLM, our model can determine the textural category of the sounding objects, as depicted in Fig. ","element":"span"},{"text":"3 ","element":"span"},{"text":"(d).","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"4.2. Referring Expression Generation","element":"span"}],[{"text":"This task involves generating a textual description associated with an object based on its location (bounding box). We evaluate our generated expressions using automatic caption generation metrics, including CIDEr [","element":"span"},{"text":"40","element":"span"},{"text":"] and Meteor [","element":"span"},{"text":"17","element":"span"},{"text":"], on RefCOCO, RefCOCO+ and RefCOCOg. ","element":"span"},{"text":"Our model achieves remarkable performance among generalist LLM-based models and demonstrates competitive result to specialist models, as shown in Tab. ","element":"span"},{"text":"5","element":"span"},{"text":".","element":"span"}],[{"text":"Nonetheless, as stated in [","element":"span"},{"text":"2","element":"span"},{"text":", ","element":"span"},{"text":"24","element":"span"},{"text":", ","element":"span"},{"text":"54","element":"span"},{"text":"], standard automated evaluation metrics do not authentically capture generation quality due to the constraints of ground-truth expressions. This scenario is particularly pronounced in open-text generation, especially for LLM-based models. These models have the ability to generate rich, natural sentences, while","element":"span"}],[{"style":{"width":"99%"},"width":936,"height":586,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/31513/images/6-3.png","element":"img"}],[{"text":"Figure 4. Comparison of ","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"generated expressions ","element":"figcaption","subtype":"caption"},{"text":"between ground-truth and LLM-based methods.","element":"figcaption","subtype":"caption"}],[{"text":"the provided ground-truth expressions often tend to be concise, as indicated in Fig. ","element":"span"},{"text":"4","element":"span"},{"text":".","element":"span"}],[{"text":"To further evaluate the quality of the generated expressions, we conduct human evaluations following [","element":"span"},{"text":"2","element":"span"},{"text":", ","element":"span"},{"text":"51","element":"span"},{"text":", ","element":"span"},{"text":"54","element":"span"},{"text":"]. We randomly select 100 images from the validation datasets and ask five human raters to choose the bounding box that best matches the generated expression, and the averaged score is considered the final result. In Tab. ","element":"span"},{"text":"6","element":"span"},{"text":", we present the results of the human evaluations, including both traditional methods and LLM-based methods. The LLM-based methods produce more detailed descriptions, closely resembling human behavior, which are preferred by the human raters. We provide more examples in supplementary material.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"4.3. Ablation Study","element":"span"}],[{"text":"We conduct extensive ablation studies to reveal the contribution of each component.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Refocusing Mechanism. ","element":"span"},{"text":"We first investigate the effectiveness of enhancing the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"<","element":"span"},{"text":"obj","element":"span"},{"style":{"fontStyle":"italic"},"text":"> ","element":"span"},{"text":"token through ","element":"span"},{"style":{"fontStyle":"italic"},"text":"refocusing","element":"span"}],[{"style":{"width":"76%"},"width":720,"height":540,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/31513/images/6-4.png","element":"img"}],[{"text":"Figure 5. Visualization of mask embeddings before and after the ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"refocusing mechanism","element":"figcaption","subtype":"caption"},{"text":". ","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"original ","element":"figcaption","subtype":"caption"},{"text":"denotes original mask embeddings, while ","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"vehicle","element":"figcaption","subtype":"caption"},{"text":", ","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"person","element":"figcaption","subtype":"caption"},{"text":", and ","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"animal ","element":"figcaption","subtype":"caption"},{"text":"represent the updated mask embeddings corresponding to their respective referring objects contained in the textural expression.","element":"figcaption","subtype":"caption"}],[{"style":{"width":"89%"},"width":1774,"height":554,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/31513/images/7-0.png","element":"img"}],[{"text":"Table 5. Quantitative results on ","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"region-level referring expression generation","element":"figcaption","subtype":"caption"},{"text":". ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"Generalist models ","element":"figcaption","subtype":"caption"},{"text":"(LLM-based) perform poorly on automated evaluation metrics due to the limitation of constrained ground-truth expressions, as stated in Sec. ","element":"figcaption","subtype":"caption"},{"text":"4.2","element":"figcaption","subtype":"caption"},{"text":".","element":"figcaption","subtype":"caption"}],[{"style":{"width":"88%"},"width":834,"height":333,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/31513/images/7-1.png","element":"img"}],[{"text":"Table 6. ","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"Human evaluation ","element":"figcaption","subtype":"caption"},{"text":"on referring expression generation.","element":"figcaption","subtype":"caption"}],[{"style":{"width":"88%"},"width":840,"height":344,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/31513/images/7-2.png","element":"img"}],[{"text":"Table 7. Ablation study on ","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"refocusing weight ","element":"figcaption","subtype":"caption"},{"style":{"height":13.8},"width":244.5,"height":34.5,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/31513/images/7-3.png","element":"img","alt":" λf. x† indicates","inline":true,"padRight":true},{"text":"trainable ","element":"figcaption","subtype":"caption"},{"style":{"height":13.6},"width":35.5,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/31513/images/7-4.png","element":"img","alt":" λf","inline":true,"padRight":true},{"text":"initialized with ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"x","element":"figcaption","subtype":"caption"},{"text":".","element":"figcaption","subtype":"caption"}],[{"style":{"fontStyle":"italic"},"text":"mechanism","element":"span"},{"text":", and explore the impact of different refocusing weights ","element":"span"},{"style":{"height":16},"width":38,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/31513/images/7-5.png","element":"img","alt":" λf","inline":true},{"text":". We evaluate setting different values for ","element":"span"},{"style":{"height":16},"width":38.5,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/31513/images/7-6.png","element":"img","alt":" λf","inline":true,"padRight":true},{"text":"and also try setting it as a learnable parameter along with the model. We conduct evaluations on both referring segmentation and expression generation tasks. Results in Tab. ","element":"span"},{"text":"7 ","element":"span"},{"text":"reveal that the refocusing weight significantly affects performance in both tasks. A small weight of 0.1 improves performance, while a larger weight can have detrimental effects, particularly in expression generation. We also experiment with learning ","element":"span"},{"style":{"height":16},"width":38.5,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/31513/images/7-7.png","element":"img","alt":" λf","inline":true,"padRight":true},{"text":"as a parameter along with the model, but we find that the performance varies greatly depending on the initialized value. Thus, for simplicity and stability, we empirically select ","element":"span"},{"style":{"height":16},"width":143,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/31513/images/7-8.png","element":"img","alt":" λf = 0.1","inline":true,"padRight":true},{"text":"for our experiments.","element":"span"}],[{"text":"We further employ PCA to visualize the mask embeddings before and after implementing the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"refocusing mechanism ","element":"span"},{"text":"in Fig. ","element":"span"},{"text":"5 ","element":"span"},{"text":"We choose three subsets representing different referring objects including vehicles, persons and animals (","element":"span"},{"style":{"fontStyle":"italic"},"text":"e.g.","element":"span"},{"text":", the person subset comprises output expressions","element":"span"}],[{"style":{"width":"89%"},"width":842,"height":260,"src":"https://cdn.bytez.com/mobilePapers/v2/cvpr/31513/images/7-9.png","element":"img"}],[{"text":"Table 8. Ablation study on ","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"training datasets","element":"figcaption","subtype":"caption"},{"text":".","element":"figcaption","subtype":"caption"}],[{"text":"containing “person,” “man,” “woman,” etc.). The visualization illustrates that the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"refocusing mechanism ","element":"span"},{"text":"results in a wider representation range of the mask embedding. Moreover, the updated embeddings demonstrate a clustering pattern aligned with the associated textual expressions, contributing to a more precise decoding of masks.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Training Datasets. ","element":"span"},{"text":"The impact of different types of datasets is validated in Tab. ","element":"span"},{"text":"8","element":"span"},{"text":", and evaluation is carried out on RefCOCOg validation split. Region/Image Ref. refers to region-level and image-level referring data, as explained in Sec. ","element":"span"},{"text":"3.2.2","element":"span"},{"text":". It becomes apparent that the model’s generalization improves as the type of datasets increases.","element":"span"}]]},{"heading":"5. Conclusion","paragraphs":[[{"text":"We present ","element":"span"},{"style":{"fontWeight":"bold"},"text":"AnyRef","element":"span"},{"text":", a pioneering MLLM model capable of generating pixel-level object perceptions and language descriptions from various modality references, including texts, regions, images, and audio. ","element":"span"},{"text":"This is made possible by the unified referring representation, which connects different types of inputs to the LLM. We further propose a refocusing mechanism that uses attention scores to improve the segmentation embedding and enhance pixel-level vision perception. Across various downstream tasks, our model exhibits remarkable performance while providing users with enhanced interacting flexibility.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Acknowledgements. ","element":"span"},{"text":"This work is supported by the National Natural Science Foundation of China (U23A20386, 62276045, 62293540, 62293542), Dalian Science and Technology Talent Innovation Support Plan (2022RY17).","element":"span"}]]},{"heading":"References","paragraphs":[[{"text":"[1] Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", 35:23716–23736, 2022. ","element":"span"},{"text":"2","element":"span"}],[{"text":"[2] Lior Bracha, Eitan Shaar, Aviv Shamsian, Ethan Fetaya, and Gal Chechik. Disclip: Open-vocabulary referring expression generation. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2305.19108","element":"span"},{"text":", 2023. ","element":"span"},{"text":"7","element":"span"}],[{"text":"[3] Holger Caesar, Jasper Uijlings, and Vittorio Ferrari. Cocostuff: Thing and stuff classes in context. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the IEEE conference on computer vision and pattern recognition","element":"span"},{"text":", pages 1209–1218, 2018. ","element":"span"},{"text":"5","element":"span"}],[{"text":"[4] Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao. ","element":"span"},{"text":"Shikra: ","element":"span"},{"text":"Unleashing multi-modal llm’s referential dialogue magic. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2306.15195","element":"span"},{"text":", 2023. ","element":"span"},{"text":"1","element":"span"},{"text":", ","element":"span"},{"text":"2","element":"span"},{"text":", ","element":"span"},{"text":"3","element":"span"},{"text":", ","element":"span"},{"text":"8","element":"span"}],[{"text":"[5] Ting Chen, Saurabh Saxena, Lala Li, Tsung-Yi Lin, David J Fleet, and Geoffrey E Hinton. A unified sequence interface for vision tasks. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", 35:31333–31346, 2022. ","element":"span"},{"text":"2","element":"span"}],[{"text":"[6] Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. ","element":"span"},{"text":"Instructblip: ","element":"span"},{"text":"Towards generalpurpose vision-language models with instruction tuning, 2023. ","element":"span"},{"text":"1","element":"span"},{"text":", ","element":"span"},{"text":"2","element":"span"}],[{"text":"[7] Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International journal of computer vision","element":"span"},{"text":", 88:303–338, 2010. ","element":"span"},{"text":"5","element":"span"},{"text":", ","element":"span"},{"text":"6","element":"span"}],[{"text":"[8] Shengyi Gao, Zhe Chen, Guo Chen, Wenhai Wang, and Tong Lu. ","element":"span"},{"text":"Avsegformer: Audio-visual segmentation with transformer. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2307.01146","element":"span"},{"text":", 2023. ","element":"span"},{"text":"3","element":"span"},{"text":", ","element":"span"},{"text":"7","element":"span"}],[{"text":"[9] Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, and Ishan Misra. Imagebind: One embedding space to bind them all. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"CVPR","element":"span"},{"text":", 2023. ","element":"span"},{"text":"3","element":"span"},{"text":", ","element":"span"},{"text":"5","element":"span"}],[{"text":"[10] Dawei Hao, Yuxin Mao, Bowen He, Xiaodong Han, Yuchao Dai, and Yiran Zhong. ","element":"span"},{"text":"Improving audio-visual segmentation with bidirectional generation. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2308.08288","element":"span"},{"text":", 2023. ","element":"span"},{"text":"7","element":"span"}],[{"text":"[11] Kaiming He, Georgia Gkioxari, Piotr Doll´ar, and Ross Girshick. Mask r-cnn. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the IEEE international conference on computer vision","element":"span"},{"text":", pages 2961–2969, 2017. ","element":"span"},{"text":"4","element":"span"}],[{"text":"[12] Sunghwan Hong, Seokju Cho, Jisu Nam, Stephen Lin, and Seungryong Kim. Cost aggregation with 4d convolutional swin transformer for few-shot segmentation. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"European Conference on Computer Vision","element":"span"},{"text":", pages 108–126. Springer, 2022. ","element":"span"},{"text":"3","element":"span"},{"text":", ","element":"span"},{"text":"7","element":"span"}],[{"text":"[13] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan AllenZhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2106.09685","element":"span"},{"text":", 2021. ","element":"span"},{"text":"5","element":"span"}],[{"text":"[14] Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. ","element":"span"},{"text":"Referitgame: Referring to objects in photographs of natural scenes. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the 2014 con-","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"ference on empirical methods in natural language processing (EMNLP)","element":"span"},{"text":", pages 787–798, 2014. ","element":"span"},{"text":"5","element":"span"},{"text":", ","element":"span"},{"text":"6","element":"span"}],[{"text":"[15] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Doll´ar, and Ross Girshick. Segment anything. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv:2304.02643","element":"span"},{"text":", 2023. ","element":"span"},{"text":"3","element":"span"},{"text":", ","element":"span"},{"text":"5","element":"span"}],[{"text":"[16] Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, and Jiaya Jia. Lisa: Reasoning segmentation via large language model. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2308.00692","element":"span"},{"text":", 2023. ","element":"span"},{"text":"2","element":"span"},{"text":", ","element":"span"},{"text":"3","element":"span"},{"text":", ","element":"span"},{"text":"4","element":"span"},{"text":", ","element":"span"},{"text":"6","element":"span"}],[{"text":"[17] Alon Lavie and Abhaya Agarwal. METEOR: An automatic metric for MT evaluation with high levels of correlation with human judgments. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the Second Workshop on Statistical Machine Translation","element":"span"},{"text":", pages 228–231, Prague, Czech Republic, 2007. Association for Computational Linguistics. ","element":"span"},{"text":"7","element":"span"}],[{"text":"[18] Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Jingkang Yang, and Ziwei Liu. ","element":"span"},{"text":"Otter: ","element":"span"},{"text":"A multi-modal model with in-context instruction tuning. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2305.03726","element":"span"},{"text":", 2023. ","element":"span"},{"text":"1","element":"span"},{"text":", ","element":"span"},{"text":"2","element":"span"}],[{"text":"[19] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. BLIP-2: ","element":"span"},{"text":"bootstrapping language-image pre-training with frozen image encoders and large language models. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"ICML","element":"span"},{"text":", 2023. ","element":"span"},{"text":"2","element":"span"}],[{"text":"[20] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13","element":"span"},{"text":", pages 740–755. Springer, 2014. ","element":"span"},{"text":"5","element":"span"},{"text":", ","element":"span"},{"text":"6","element":"span"}],[{"text":"[21] Yuhang Ling, Yuxi Li, Zhenye Gan, Jiangning Zhang, Mingmin Chi, and Yabiao Wang. Hear to segment: Unmixing the audio to guide the semantic segmentation. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2305.07223","element":"span"},{"text":", 2023. ","element":"span"},{"text":"3","element":"span"},{"text":", ","element":"span"},{"text":"7","element":"span"}],[{"text":"[22] Chang Liu, Henghui Ding, and Xudong Jiang. Gres: Generalized referring expression segmentation. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition","element":"span"},{"text":", pages 23592–23601, 2023. ","element":"span"},{"text":"6","element":"span"}],[{"text":"[23] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"NeurIPS","element":"span"},{"text":", 2023. ","element":"span"},{"text":"1","element":"span"},{"text":", ","element":"span"},{"text":"2","element":"span"},{"text":", ","element":"span"},{"text":"3","element":"span"},{"text":", ","element":"span"},{"text":"5","element":"span"}],[{"text":"[24] Jingyu Liu, Liang Wang, and Ming-Hsuan Yang. Referring expression generation and comprehension via attributes. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the IEEE International Conference on Computer Vision","element":"span"},{"text":", pages 4856–4864, 2017. ","element":"span"},{"text":"7","element":"span"}],[{"text":"[25] Jiang Liu, Hui Ding, Zhaowei Cai, Yuting Zhang, Ravi Kumar Satzoda, Vijay Mahadevan, and R Manmatha. ","element":"span"},{"text":"Polyformer: Referring image segmentation as sequential polygon generation. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition","element":"span"},{"text":", pages 18653– 18663, 2023. ","element":"span"},{"text":"6","element":"span"}],[{"text":"[26] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:1711.05101","element":"span"},{"text":", 2017. ","element":"span"},{"text":"5","element":"span"}],[{"text":"[27] Timo L¨uddecke and Alexander Ecker. ","element":"span"},{"text":"Image segmentation using text and image prompts. ","element":"span"},{"text":"In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition","element":"span"},{"text":", pages 7086–7096, 2022. ","element":"span"},{"text":"5","element":"span"}],[{"text":"[28] Timo L¨uddecke and Alexander Ecker. ","element":"span"},{"text":"Image segmentation using text and image prompts. ","element":"span"},{"text":"In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","element":"span"},{"text":", pages 7086–7096, 2022. ","element":"span"},{"text":"3","element":"span"},{"text":", ","element":"span"},{"text":"6","element":"span"},{"text":", ","element":"span"},{"text":"7","element":"span"}],[{"text":"[29] Fausto Milletari, Nassir Navab, and Seyed-Ahmad Ahmadi. V-net: Fully convolutional neural networks for volumetric medical image segmentation. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"2016 fourth international conference on 3D vision (3DV)","element":"span"},{"text":", pages 565–571. IEEE, 2016. ","element":"span"},{"text":"5","element":"span"}],[{"text":"[30] Juhong Min, Dahyun Kang, and Minsu Cho. Hypercorrelation squeeze for few-shot segmentation. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)","element":"span"},{"text":", 2021. ","element":"span"},{"text":"3","element":"span"},{"text":", ","element":"span"},{"text":"7","element":"span"}],[{"text":"[31] Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei. Kosmos-2: Grounding multimodal large language models to the world. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2306.14824","element":"span"},{"text":", 2023. ","element":"span"},{"text":"1","element":"span"},{"text":", ","element":"span"},{"text":"2","element":"span"},{"text":", ","element":"span"},{"text":"3","element":"span"},{"text":", ","element":"span"},{"text":"8","element":"span"}],[{"text":"[32] Bryan A. Plummer, Liwei Wang, Christopher M. Cervantes, Juan C. Caicedo, Julia Hockenmaier, and Svetlana Lazebnik. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"IJCV","element":"span"},{"text":", 123 (1):74–93, 2017. ","element":"span"},{"text":"6","element":"span"}],[{"text":"[33] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International conference on machine learning","element":"span"},{"text":", pages 8748–8763. PMLR, 2021. ","element":"span"},{"text":"3","element":"span"}],[{"text":"[34] Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. ","element":"span"},{"text":"Robust speech recognition via large-scale weak supervision. ","element":"span"},{"text":"In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Machine Learning","element":"span"},{"text":", pages 28492– 28518. PMLR, 2023. ","element":"span"},{"text":"2","element":"span"}],[{"text":"[35] Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimizations toward training trillion parameter models. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","element":"span"},{"text":", pages 1–16. IEEE, 2020. ","element":"span"},{"text":"5","element":"span"}],[{"text":"[36] Vignesh Ramanathan, Anmol Kalia, Vladan Petrovic, Yi Wen, Baixue Zheng, Baishan Guo, Rui Wang, Aaron Marquez, Rama Kovvuri, Abhishek Kadian, Amir Mousavi, Yiwen Song, Abhimanyu Dubey, and Dhruv Mahajan. PACO: Parts and attributes of common objects. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2301.01795","element":"span"},{"text":", 2023. ","element":"span"},{"text":"5","element":"span"}],[{"text":"[37] Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining","element":"span"},{"text":", pages 3505–3506, 2020. ","element":"span"},{"text":"5","element":"span"}],[{"text":"[38] Mikihiro Tanaka, Takayuki Itamochi, Kenichi Narioka, Ikuro Sato, Yoshitaka Ushiku, and Tatsuya Harada. ","element":"span"},{"text":"Generating easy-to-understand referring expressions for target identifications. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the IEEE/CVF International Conference on Computer Vision","element":"span"},{"text":", pages 5794–5803, 2019. ","element":"span"},{"text":"8","element":"span"}],[{"text":"[39] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ee Lacroix, Baptiste","element":"span"}],[{"text":"Rozi`ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: ","element":"span"},{"text":"Open and efficient foundation language models. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2302.13971","element":"span"},{"text":", 2023. ","element":"span"},{"text":"3","element":"span"}],[{"text":"[40] Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. Cider: Consensus-based image description evaluation. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the IEEE conference on computer vision and pattern recognition","element":"span"},{"text":", pages 4566–4575, 2015. ","element":"span"},{"text":"7","element":"span"}],[{"text":"[41] Wenhai Wang, Zhe Chen, Xiaokang Chen, Jiannan Wu, Xizhou Zhu, Gang Zeng, Ping Luo, Tong Lu, Jie Zhou, Yu Qiao, et al. ","element":"span"},{"text":"Visionllm: Large language model is also an open-ended decoder for vision-centric tasks. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2305.11175","element":"span"},{"text":", 2023. ","element":"span"},{"text":"2","element":"span"},{"text":", ","element":"span"},{"text":"3","element":"span"}],[{"text":"[42] Weiyun Wang, Min Shi, Qingyun Li, Wenhai Wang, Zhenhang Huang, Linjie Xing, Zhe Chen, Hao Li, Xizhou Zhu, Zhiguo Cao, et al. The all-seeing project: Towards panoptic visual recognition and understanding of the open world. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2308.01907","element":"span"},{"text":", 2023. ","element":"span"},{"text":"3","element":"span"}],[{"text":"[43] Xinlong Wang, Wen Wang, Yue Cao, Chunhua Shen, and Tiejun Huang. ","element":"span"},{"text":"Images speak in images: ","element":"span"},{"text":"A generalist painter for in-context visual learning. ","element":"span"},{"text":"In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition","element":"span"},{"text":", pages 6830–6839, 2023. ","element":"span"},{"text":"3","element":"span"},{"text":", ","element":"span"},{"text":"7","element":"span"}],[{"text":"[44] Xinlong Wang, Xiaosong Zhang, Yue Cao, Wen Wang, Chunhua Shen, and Tiejun Huang. Seggpt: Segmenting everything in context. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2304.03284","element":"span"},{"text":", 2023. ","element":"span"},{"text":"3","element":"span"},{"text":", ","element":"span"},{"text":"7","element":"span"}],[{"text":"[45] Zhaoqing Wang, Yu Lu, Qiang Li, Xunqiang Tao, Yandong Guo, Mingming Gong, and Tongliang Liu. ","element":"span"},{"text":"Cris: Clipdriven referring image segmentation. ","element":"span"},{"text":"In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the IEEE/CVF conference on computer vision and pattern recognition","element":"span"},{"text":", pages 11686–11695, 2022. ","element":"span"},{"text":"6","element":"span"}],[{"text":"[46] Chenyun Wu, Zhe Lin, Scott Cohen, Trung Bui, and Subhransu Maji. Phrasecut: Language-based image segmentation in the wild. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition","element":"span"},{"text":", pages 10216–10225, 2020. ","element":"span"},{"text":"5","element":"span"}],[{"text":"[47] Jialian Wu, Jianfeng Wang, Zhengyuan Yang, Zhe Gan, Zicheng Liu, Junsong Yuan, and Lijuan Wang. Grit: A generative region-to-text transformer for object understanding. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2212.00280","element":"span"},{"text":", 2022. ","element":"span"},{"text":"8","element":"span"}],[{"text":"[48] Bin Yan, Yi Jiang, Jiannan Wu, Dong Wang, Zehuan Yuan, Ping Luo, and Huchuan Lu. Universal instance perception as object discovery and retrieval. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"CVPR","element":"span"},{"text":", 2023. ","element":"span"},{"text":"6","element":"span"}],[{"text":"[49] Zhao Yang, Jiaqi Wang, Yansong Tang, Kai Chen, Hengshuang Zhao, and Philip HS Torr. Lavt: Language-aware vision transformer for referring image segmentation. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"CVPR","element":"span"},{"text":", 2022. ","element":"span"},{"text":"6","element":"span"}],[{"text":"[50] Fulong Ye, Yuxing Long, Fangxiang Feng, and Xiaojie Wang. ","element":"span"},{"text":"Whether you can locate or not? interactive referring expression generation. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the 31st ACM International Conference on Multimedia","element":"span"},{"text":", pages 4697–4706, 2023. ","element":"span"},{"text":"8","element":"span"}],[{"text":"[51] Licheng Yu, Patrick Poirson, Shan Yang, Alexander C Berg, and Tamara L Berg. Modeling context in referring expressions. ","element":"span"},{"text":"In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14","element":"span"},{"text":", pages 69–85. Springer, 2016. ","element":"span"},{"text":"7","element":"span"}],[{"text":"[52] Licheng Yu, Patrick Poirson, Shan Yang, Alexander C Berg, and Tamara L Berg. Modeling context in referring expressions. ","element":"span"},{"text":"In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14","element":"span"},{"text":", pages 69–85. Springer, 2016. ","element":"span"},{"text":"5","element":"span"},{"text":", ","element":"span"},{"text":"6","element":"span"}],[{"text":"[53] Licheng Yu, Patrick Poirson, Shan Yang, Alexander C Berg, and Tamara L Berg. Modeling context in referring expressions. ","element":"span"},{"text":"In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14","element":"span"},{"text":", pages 69–85. Springer, 2016. ","element":"span"},{"text":"8","element":"span"}],[{"text":"[54] Licheng Yu, Hao Tan, Mohit Bansal, and Tamara L Berg. A joint speaker-listener-reinforcer model for referring expressions. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the IEEE conference on computer vision and pattern recognition","element":"span"},{"text":", pages 7282–7290, 2017. ","element":"span"},{"text":"7","element":"span"},{"text":", ","element":"span"},{"text":"8","element":"span"}],[{"text":"[55] Jian-Wei Zhang, Yifan Sun, Yi Yang, and Wei Chen. Featureproxy transformer for few-shot segmentation. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", 35:6575–6588, 2022. ","element":"span"},{"text":"3","element":"span"}],[{"text":"[56] Renrui Zhang, Jiaming Han, Chris Liu, Peng Gao, Aojun Zhou, Xiangfei Hu, Shilin Yan, Pan Lu, Hongsheng Li, and Yu Qiao. ","element":"span"},{"text":"Llama-adapter: Efficient fine-tuning of language models with zero-init attention. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2303.16199","element":"span"},{"text":", 2023. ","element":"span"},{"text":"2","element":"span"}],[{"text":"[57] Shilong Zhang, Peize Sun, Shoufa Chen, Min Xiao, Wenqi Shao, Wenwei Zhang, Kai Chen, and Ping Luo. Gpt4roi: Instruction tuning large language model on region-of-interest, 2023. ","element":"span"},{"text":"1","element":"span"},{"text":", ","element":"span"},{"text":"2","element":"span"},{"text":", ","element":"span"},{"text":"3","element":"span"},{"text":", ","element":"span"},{"text":"4","element":"span"}],[{"text":"[58] Yang Zhao, Zhijie Lin, Daquan Zhou, Zilong Huang, Jiashi Feng, and Bingyi Kang. Bubogpt: Enabling visual grounding in multi-modal llms. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2307.08581","element":"span"},{"text":", 2023. ","element":"span"},{"text":"3","element":"span"}],[{"text":"[59] Bolei Zhou, Hang Zhao, Xavier Puig, Tete Xiao, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Semantic understanding of scenes through the ade20k dataset. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Journal of Computer Vision","element":"span"},{"text":", 127:302–321, 2019. ","element":"span"},{"text":"5","element":"span"}],[{"text":"[60] Jinxing Zhou, Jianyuan Wang, Jiayi Zhang, Weixuan Sun, Jing Zhang, Stan Birchfield, Dan Guo, Lingpeng Kong, Meng Wang, and Yiran Zhong. Audio–visual segmentation. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"European Conference on Computer Vision","element":"span"},{"text":", pages 386– 403. Springer, 2022. ","element":"span"},{"text":"3","element":"span"},{"text":", ","element":"span"},{"text":"5","element":"span"},{"text":", ","element":"span"},{"text":"6","element":"span"},{"text":", ","element":"span"},{"text":"7","element":"span"}],[{"text":"[61] Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. ","element":"span"},{"text":"Minigpt-4: Enhancing vision-language understanding with advanced large language models. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2304.10592","element":"span"},{"text":", 2023. ","element":"span"},{"text":"1","element":"span"}],[{"text":"[62] Xueyan Zou, Zi-Yi Dou, Jianwei Yang, Zhe Gan, Linjie Li, Chunyuan Li, Xiyang Dai, Harkirat Behl, Jianfeng Wang, Lu Yuan, et al. Generalized decoding for pixel, image, and language. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition","element":"span"},{"text":", pages 15116–15127, 2023. ","element":"span"},{"text":"3","element":"span"},{"text":", ","element":"span"},{"text":"6","element":"span"}],[{"text":"[63] Xueyan Zou, Jianwei Yang, Hao Zhang, Feng Li, Linjie Li, Jianfeng Gao, and Yong Jae Lee. Segment everything everywhere all at once. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2304.06718","element":"span"},{"text":", 2023. ","element":"span"},{"text":"2","element":"span"},{"text":", ","element":"span"},{"text":"3","element":"span"},{"text":", ","element":"span"},{"text":"6","element":"span"}]]}],"_version":"3.3.4"},"paperNode":"$28:props:children:props:children:0:props:product"}]]