Compositional Video Understanding with Spatiotemporal Structure-based Transformers | Read Paper on Bytez