Aligning where to see and what to tell: image caption with region-based attention and scene factorization | Read Paper on Bytez