What matters when building vision-language models? | Read Paper on Bytez