Trulli
Fig.1. The overview network architecture of the proposed three transferable attacks (1) Speech-aware gradient optimization (SAGO) (2) MI-FGSM and (3) VMI-FGSM.

We show the audio samples, transcription outputs and transcription performances (WER) of the following attack approaches.


Clean : Original clean audios from the datasets.

White Noise : Random white noise (naive baseline) are added to clean audios for decoding.

PGD : Advasarial samples are generated from projected gradient desent attack method (the strong baseline).

SAGO : Advasarial samples are generated from the proposed speech-aware gradient optimization (SAGO) attack.

MI-FGSM : Advasarial samples are generated from the proposed momentum iterative fast gradient sign method (MI-FGSM).

VMI-FGSM : Advasarial samples are generated from the proposed variance tuning momentum iterative fast gradient sign method (VMI-FGSM).


Decoded Examples from Different Attack Approaches


Example 1
Reference: so uncas you had better take the lead while i will put on the skin again and trust to cunning for want of speed.
White Noise Hypothesis (WER 0.00 %): so uncas you had better take the lead while i will put on the skin again and trust to cunning for want of speed.
PGD Hypothesis (WER 0.00 %): so uncas you had better take the lead while i will put on the skin again and trust to cunning for want of speed.
SAGO Hypothesis (WER 50.00 %): so uncas be it better take to leave that the fire would put on this gid again and trust depending for want of speed.
MI-FGSM Hypothesis (WER 41.67 %): and so one couldst be a better take to leave while i would put on the skin again and trust to cunning of her want of speed.
VMI-FGSM Hypothesis (WER 41.67%): so when 1st you had better take to leave for i will put on the skin again and trust you cunning for one to sweep with.
Clean White Noise PGD SAGO MI-FGSM VMI-FGSM
Samples


Example 2
Reference: but in this friendly pressure raoul could detect the nervous agitation of a great internal conflict.
White Noise Hypothesis (WER 0.00 %): but in this friendly pressure raoul could detect the nervous agitation of a great internal conflict.
PGD Hypothesis (WER 43.74 %): while in this friendly treasure raoul could detect a nervous agitation out of great internocondland.
SAGO Hypothesis (WER 81.25%): by amos friendly cardinal robbed with the technine nervous agitation out of dray into no conduined.
MI-FGSM Hypothesis (WER 75.00 %): by in this ranly treasure robbed with detecting nervous agitation out of gray internal pond life.
VMI-FGSM Hypothesis (WER 81.25 %): but in the friendly creature ralph could detect being merely is education at her grave interlook on link.
Clean White Noise PGD SAGO MI-FGSM VMI-FGSM
Samples


Example 3
Reference: i thought we were stumped again when i 1st saw that picture but it has been of some use after all.
White Noise Hypothesis (WER 0.00 %): i thought we were stumped again when i 1st saw that picture but it has been of some use after all.
PGD Hypothesis (WER 33.33 %): i thought we were stumped again when i 1st saw that fiction of bennett and a snail of some native after all.
SAGO Hypothesis (WER 76.19 %): i thought we were stumpy getting and i 1st was all that vexioner the vedder and it would have sung me to fantaker wall.
MI-FGSM Hypothesis (WER 66.67 %): i thought we were something having in my 1st saw map pension or credit spend of some need of fast parov.
VMI-FGSM Hypothesis (WER 52.38 %): i thought we were something in then if you saw that bachelor betting a stand of some needes after all.
Clean White Noise PGD SAGO MI-FGSM VMI-FGSM
Samples


Example 4
Reference: i say you do know what this means and you must tell us.
White Noise Hypothesis (WER 0.00 %): i say you do know what this means and you must tell us.
PGD Hypothesis (WER 15.38 %): i said you do know what this means and you must tell us good.
SAGO Hypothesis (WER 69.23 %): i say you do know what bess me losing eh give mars a towel after.
MI-FGSM Hypothesis (WER 53.85 %): i say you am clearer wi this minx ere you must halves.
VMI-FGSM Hypothesis (WER 30.77 %): i said you do know what this realist ere do you must tell us.
Clean White Noise PGD SAGO MI-FGSM VMI-FGSM
Samples


Example 5
Reference: for some time after that i remembered nothing distinctly.
White Noise Hypothesis (WER 0.00 %): for some time after that i remembered nothing distinctly.
PGD Hypothesis (WER 11.11 %): for some time after that i remembered nothing disdainfully.
SAGO Hypothesis (WER 44.44 %): and for some time after that i remembered when nothing is daintily.
MI-FGSM Hypothesis (WER 44.44 %): for some time after that i remembered nothing to stay with me.
VMI-FGSM Hypothesis (WER 44.44%): for some time after that i remember nothing of his dignity.
Clean White Noise PGD SAGO MI-FGSM VMI-FGSM
Samples


Example 6
Reference: the delawares are children of the tortoise and they outstrip the deer.
White Noise Hypothesis (WER 0.00 %): the delawares are children of the tortoise and they out strip the deer.
PGD Hypothesis (WER 41.67 %): the delawares are children in the tortoise and ants drifted youth.
SAGO Hypothesis (WER 75.00 %): the delawares i have children in the northern savoyance drift in youth.
MI-FGSM Hypothesis (WER 83.33 %): a delaware as i children of the torres announced drifted youth.
VMI-FGSM Hypothesis (WER 66.67 %): but delawares our children of the tortuous amounts drifted here.
Clean White Noise PGD SAGO MI-FGSM VMI-FGSM
Samples