Multi-Modal Representation Learning With Text-Driven Soft Masks | Read Paper on Bytez