HENASY: Learning to Assemble Scene-Entities for Interpretable Egocentric Video-Language Model