In his latest deep dive, security expert Michał Zalewski explores a fascinating phenomenon in modern generative AI: ‘Phonetica.’ This refers to the capability of models to render audio waveforms that mimic human speech and intonation, often bypassing the traditional tokenization process.
Zalewski illustrates this by demonstrating how current diffusion and transformer-based models can hallucinate sounds—such as music or specific voices—directly from noise or vague textual cues. While technically impressive, he warns this creates a new attack surface. Adversarial audio could theoretically embed invisible commands into speech that humans hear as normal, but which AI assistants parse as executable instructions.
As the line between synthesized and organic audio blurs, the implications for deepfakes and social engineering are profound. The article suggests that while our ears may be ‘lying’ to us about the authenticity of the content, the underlying mathematics is brutally honest about its potential for misuse.
Leave a Reply