Towards Local Visual Modeling for Image Captioning | Read Paper on Bytez