Facebook AI’s Demucs teaches AI to hear in a more human-like way
Demucs is a new research project by Facebook AI. It is designed to separate musical tracks into different instruments or vocals, similar to how a human can detect the specific instruments, and solve the problems of existing approaches. In the long run, Demucs could be applied to other AI tasks as well.
Music source separation can be a tricky task for machines, while it’s easier for humans to distinguish the vocals, bass or drums. To help with this task, Facebook AI research scientist Alexandre Defossez has developed Demucs (deep extractor for music sources).
As described in the famous “cocktail party effect”, humans have the ability to single in on a certain conversation in a loud environment. This task of sound source separation poses difficulties for machines though. Let’s see how AI tools manage this task and what sets Demucs apart.
Spectrograms vs. waveforms
Most commonly, as Defossez points out, AI separates music sources by analyzing spectrograms. While this method is well suited for instruments that resonate on a single frequency, spectrogram-based methods have their weaknesses. For examples, saxophone and guitar frequencies may cancel each other out.
This is where Demucs comes into play—an AI-based waveform model that is designed to work in a similar way to how computer vision detects patterns in images. “It detects patterns in the waveforms and then adds higher-scale structure,” as Defossez explains. Or in other words: “Demucs can re-create the audio that it thinks is there but got lost in the mix.”
Defossez based Demucs on Wave-U-Net, an earlier AI-powered waveform model, and then went on to fine-tune his model. It now not only outperforms Wave-U-Net, but is also “‘way beyond’ state-of-the-art spectrograms.”
In the future, technology like Demucs may improve the abilities of AI assistants to hear voice commands in loud environments. Additionally, it could also be used for hearing aids or noise-canceling headphones.
See the Tech@Facebook blog post for further details and sound samples.