Zero-shot Referring Expression Comprehension via Structural Similarity Between Images and Captions | Read Paper on Bytez