Strong specialized baselines?

Hi. Congrats everyone on getting the leaderboard up!

The goal of the challenge is to develop methods for universal sound embeddings. But I also think it is important to be able to evaluate these with respect to current task-specific SOTA, to see how close any universal method is to a specialized solution. Of course one can do this by referring to the reported results in existing literature. But it seems that the dataset splitting and evaluation protocol might be slightly different from what has been done before on these tasks, which may affect how comparable they are. And of course reproducing a method with known expected results is a strong validation of the evaluation protocol.

For example I know that even highly computationally constrained methods (running in real-time on low-cost, low-power microcontroller) report results of 90-95% accuracy on Google Speech Commands, ref Google Speech Commands Benchmark (Keyword Spotting) | Papers With Code

What do you think of this idea? Is there any plans to do (by organizers, or other teams) to submit such methods?

@jonnor Thank you for your message. It is a great suggestion and it is also something that we the organizers intend to do. It will be part of our summary PMLR paper. We will do our best.

Nonetheless, as our focus and our time is mainly devoted to the construction of the secret tasks and a fair consistent pipeline for evaluation, we are not able to go as deep on benchmarking existing audio models as we would like.

As an example with our wav2vec2 baseline and other existing models we have been examining to port to the HEAR API, the key question is:

  • For an audio model that is event (timestamp) based, what is the best generic way to transform it to a scene embedding? For example, w2v2 outputs 768 or 1024 dimensions every 20 milliseconds. Taking the mean pool over this (as we currently do) will wash away any possible temporal aspects to the scene embedding and make the beginning of the audio have the same embedding as the end. We believe this is main reason (not train/test split) that our Google Speech Commands results don’t match those in the literature.

We are definitely open to any suggestions here, and will also be reviewing the submissions after the final deadline to see if there is a superior timestamp=>scene technique that we can try.

Other less crucial issues are:

  1. It is not always 100% obvious how to infer the timestamps from an audio model that does not output them explicitly. We make our best guesses based upon trying a variety of audio lengths.
  2. Some models are difficult to port in a way that achieves high throughput, e.g. models that don’t leave results in the GPU and instead move to CPU and maybe convert to numpy. So the profiles may be pessimistic.
  3. The best model parameters (e.g. choice of w2v2 weights, in our case facebook/wav2vec2-large-100k-voxpopuli) are not obvious.

For these reasons, we have actively been encouraging participation by authors in the community. Some people we reached out to are gearing up to submit later, some are still on the fence, and some are busy with other commitments.

Getting the broadest strongest baselines across the existing literature in HEAR 2021 is best when it’s a community effort, and when model authors nurture their individual submissions (e.g. by experimenting with different techniques for combining timestamp embeddings into a scene embedding).

So we encourage you and your teammates to reach out to authors of important models and encourage them to participate, maybe explaining your reasoning above and mentioning your team’s participation! As I said, everything is better when the community pulls together :slight_smile:

p.s. feel free also to reach out to us directly if you want to participate in the post-deadline experimental work!

@jonnor By the way, our Google Speech Commands full dataset should have identical train/test splits as the literature.