Advancing Semantic Future Prediction through Multimodal Visual Sequence Transformers | Read Paper on Bytez