We show the listening samples, transcription outputs and transcription performances (WER) of the following models.


Direct Modeling (DM) : Original polyphonic music audios from polyphonic music testsets can be found below as DM inputs, where DM is trained directly on polyphonic music data.

pre-and-fine sRes : Extracted singing vocal audios from simplified Residual-Unet extractor can be found below as pre-and-fine sRes inputs, where pre-and-fine sRes is trained on extracted singing vocal from simplified Residual Unet.

PoLyScriber-NoAug : Intermediate singing vocal audio outputs extracted from PoLyScriber-NoAug extractor can be found below, where PoLyScriber-NoAug is trained on real polyphonic music data via joint-training approach.

PoLyScriber : Intermediate singing vocal audio outputs extracted from PoLyScriber extractor can be found below, where PoLyScriber is trained on real and simulated polyphonic music data via joint-training approach.

PoLyScriber-L : Intermediate singing vocal audio outputs extracted from PoLyScriber-L extractor can be found below, where PoLyScriber-L is trained on real and simulated polyphonic music data with extraction loss via joint-training approach.


Decoded Examples from Different Models


Example 1: Reference YOU THINK YOU CAN DO ME SO WRONG I'M NOT THE ONE NO I'M NOT THE ONE
DM Hypothesis (Hyp) (WER 52.94 %): YOU MAKE YOU CANT DO ME THE RIGHT ITS NOT THE ONE
pre-and-fine sRes Hyp (WER 41.18 %): YOU MAKE YOU CAN DO ME SO WRONG ITS NOT THE ONE
PoLyScriber-NoAug Hyp (WER 35.29 %): YOU THINK YOU CAN DO ME SO WRONG ITS NOT THE ONE
PoLyScriber Hyp (WER 17.65 %): YOU THINK YOU CAN DO ME SO WRONG IM NOT THE ONE IM NOT THE ONE
PoLyScriber-L Hyp (WER 29.41 %): YOU THINK YOU CAN DO ME SO WRONG JUST NOT NO I KNOW ITS NOT THE ONE
DM pre-and-fine sRes PoLyScriber-NoAug PoLyScriber PoLyScriber-L
Sample 1


Example 2: Reference GET A TASTE OF MY BAD SIDE JUST A TASTE OF MY BAD SIDE JUST A TASTE OF MY BAD SIDE
DM Hyp (WER 47.62 %): GOT A TASTE OF MY BAD TIMES IM JUST A TASTE OF MY BAD TIME
pre-and-fine sRes Hyp (WER 85.71 %): GET YOUR TEXT UP AND TOUCH THE TASTE OF
PoLyScriber-NoAug Hyp (WER 42.86): GET A TASTE OF MY BAD TIME JUST A TASTE OF MY BAD TIME
PoLyScriber Hyp (WER 33.33 %): GET A TASTE OF MY BAD TIMES INTO THE TASTE OF MY BODY JUST A TASTE OF MY
PoLyScriber-L Hyp (WER 23.81 %): GET A TASTE OF MY OUTSIDE AND JUST A TASTE OF MY BAD TIME JUST A TASTE OF MY
DM pre-and-fine sRes PoLyScriber-NoAug PoLyScriber PoLyScriber-L
Sample 2


Example 3: Reference WHAT'S LOVE BUT A SECOND HAND EMOTION WHAT'S LOVE GOT TO DO GOT TO DO WITH IT
DM Hyp (WER 52.94 %): LOVE FOR THE SECONDS HAND DEVOTION WHATS LOVE GOT TO DO THIS
pre-and-fine sRes Hyp (WER 64.71 %): A SECOND AND EMOTION TWICE LOVE GOT TO DO THIS
PoLyScriber-NoAug Hyp (WER 47.06 %): WHATS LOVE BUT A SECOND AND EMOTION ONCE LOVE GOT TO DO THIS
PoLyScriber Hyp (WER 23.53 %): WHATS LOVE BUT A SECOND ANY POTION WHATS LOVE GOT TO DO GOT TO DO WITH IT
PoLyScriber-L Hyp (WER 35.29 %): LAST LOVE BUT A SECOND ANY POTION TWICE LOVE GOT TO DO GOT TO DO THIS
DM pre-and-fine sRes PoLyScriber-NoAug PoLyScriber PoLyScriber-L
Sample 3


Example 4: Reference TEARS ARE LIKE DIAMONDS IN MY EYES BLIND ME SO NOW YOURE OUT OF SIGHT ITS ENOUGH
DM Hyp (WER 29.41 %): YOURS I LIKE DEMANDS IN MY EYES LIKE YOURSELF NOW YOURE OUT OF SIGHT ITS ENOUGH
pre-and-fine sRes Hyp (WER 35.29 %): YEARS I LIKE DEMANDS IN MY EYES LIKENESS SO NOW YOURE OUT OF SIGHT ITS ENOUGH
PoLyScriber-NoAug Hyp (WER 23.53 %) : TEARS I LIKE DIAMONDS IN MY EYES LIGHT ME SO NOW YOURE OUT OF SIGHT ITS ENOUGH
PoLyScriber Hyp (WER 11.76 %): TEARS I LIKE DIAMONDS IN MY EYES LIGHT YOURSELF NOW YOURE OUT OF SIGHT ITS ENOUGH
PoLyScriber-L Hyp (WER 41.18 %): YOURS I LIKE THE MANS IN MY EYES LIGHT YOUR SOUL NOW YOURE OUT OF SIGHT ITS ENOUGH
DM pre-and-fine sRes PoLyScriber-NoAug PoLyScriber PoLyScriber-L
Sample 4


Example 5: Reference YOU HAD MY HEART INSIDE YOUR HAND BUT YOU PLAYED IT WITH A BEATING
DM Hyp (WER 35.71 %): YOU HAVE MY HEART INSIDE OF YOUR HEAD BUT YOU PAINTED WITH THE BEATING
pre-and-fine sRes Hyp (WER 42.86 %): YOU HAD MY HEART AND GET YOUR HEAD BUT YOU PLAYED IT BEATING
PoLyScriber-NoAug Hyp (WER 28.57 %): YOU HAVE MY HEART AND GET YOUR HEAD BUT YOU PLAYED IT WITH A BEATING
PoLyScriber Hyp (WER 14.29 %): YOU HAD MY HEART INSIDE OF YOUR HEAD BUT YOU PLAYED IT WITH A BEATING
PoLyScriber-L Hyp (WER 28.57 %): YOU HAVE MY HEART AND CARRY YOUR HEAD BUT YOU PLAYED IT WITH A BEATING
DM pre-and-fine sRes PoLyScriber-NoAug PoLyScriber PoLyScriber-L
Sample 5


Example 6: Reference GO UP AND EVERYBODY IN THE WHOLE DAMN COURT GOES NUTS PEOPLE GONNA HATE LET THEM DO IT SHINE LIKE IT AINT NOTHING TO
DM Hyp (WER 54.17 %): GO UP AND EVERYBODY IN THE WHOLE DAMN CAR GOES NUGGS PEOPLE GON HATE LET EM TOO
pre-and-fine sRes Hyp (WER 50.00 %): YOU HAD MY HEART AND GET YOUR HEAD BUT YOU PLAYED IT BEATING
PoLyScriber-NoAug Hyp (WER 37.50 %): THE WOP AND EVERYBODY IN THE WHOLE DAMN CALL GHOST NOTHINGS PEOPLE GON HATE LET EM TOO I SHINE LIKE IT AINT NOTHING TO
PoLyScriber Hyp (WER 20.83 %): GO UP AND EVERYBODY IN THE WHOLE DAMN CAR GOES NOTHING PEOPLE GON HATE LET EM DO I SHINE LIKE IT AINT NOTHING TO
PoLyScriber-L Hyp (WER 25.00 %): GO UP AND EVERYBODY IN THE WHOLE DAMN CAR GOES US PEOPLE GON HATE LET EM TOO I SHINE LIKE IT AINT NOTHING TO
DM pre-and-fine sRes PoLyScriber-NoAug PoLyScriber PoLyScriber-L
Sample 6