@jonnor Thank you for your message. It is a great suggestion and it is also something that we the organizers intend to do. It will be part of our summary PMLR paper. We will do our best.
Nonetheless, as our focus and our time is mainly devoted to the construction of the secret tasks and a fair consistent pipeline for evaluation, we are not able to go as deep on benchmarking existing audio models as we would like.
As an example with our wav2vec2 baseline and other existing models we have been examining to port to the HEAR API, the key question is:
For an audio model that is event (timestamp) based, what is the best generic way to transform it to a scene embedding? For example, w2v2 outputs 768 or 1024 dimensions every 20 milliseconds. Taking the mean pool over this (as we currently do) will wash away any possible temporal aspects to the scene embedding and make the beginning of the audio have the same embedding as the end. We believe this is main reason (not train/test split) that our Google Speech Commands results don’t match those in the literature.
We are definitely open to any suggestions here, and will also be reviewing the submissions after the final deadline to see if there is a superior timestamp=>scene technique that we can try.
Other less crucial issues are:
- It is not always 100% obvious how to infer the timestamps from an audio model that does not output them explicitly. We make our best guesses based upon trying a variety of audio lengths.
- Some models are difficult to port in a way that achieves high throughput, e.g. models that don’t leave results in the GPU and instead move to CPU and maybe convert to numpy. So the profiles may be pessimistic.
- The best model parameters (e.g. choice of w2v2 weights, in our case
facebook/wav2vec2-large-100k-voxpopuli) are not obvious.
For these reasons, we have actively been encouraging participation by authors in the community. Some people we reached out to are gearing up to submit later, some are still on the fence, and some are busy with other commitments.
Getting the broadest strongest baselines across the existing literature in HEAR 2021 is best when it’s a community effort, and when model authors nurture their individual submissions (e.g. by experimenting with different techniques for combining timestamp embeddings into a scene embedding).
So we encourage you and your teammates to reach out to authors of important models and encourage them to participate, maybe explaining your reasoning above and mentioning your team’s participation! As I said, everything is better when the community pulls together
p.s. feel free also to reach out to us directly if you want to participate in the post-deadline experimental work!