Unsupervised Visual-Linguistic Reference Resolution in Instructional Videos | Read Paper on Bytez