When I was going through the abstract of the HEAR 2021 challenge, the questions posed in the abstract (“where does the sound come from?”) made me think about spatial cues (ITDs, ILDs) in an audio file. Having worked with HRTFs before, I thought it might be of great value if the learnt audio representation could also capture the originating location of the sound in space (azimuthal direction of arrival, elevation, etc.). Such spatial cues usually require the audio file to have at least two channels (stereo). However, the API mentions that it must intake only mono audio.
Hence, I wanted to check if there would be any emphasis on extracting spatial information from the embeddings in any of the evaluation tasks.