Concerto: Joint 2D-3D Self-Supervised Learning Emerges Spatial Representations | Read Paper on Bytez