FOCUS: Unified Vision-Language Modeling for Interactive Editing Driven by Referential Segmentation | Read Paper on Bytez