Self-Transriber: Few-shot Lyrics Transcription with Self-training

Trulli — Fig.1. The network architecture and training flow of the proposed Self-Transcriber with (a) supervised learning and (b) self- training techniques.

We show the transcription outputs, transcription performances (WER) and the error patterns of the following models.

Supervised model : Supervised model is trained on labeled solo-singing data from DS1.

Self-TranscriberS2 : Self-Transcriber-S2 incorporates self-training technique trained on less amount of unlabeled data DS31 and labeled data DS1 with two iterations.

Self-Transcriber2 : Self-Transcriber2 incorporates self-training technique trained on more unlabeled data DS301 and labeled data DS1 with two iterations.

Decoded Examples from Different Models

Example 1

Reference: YOU ARE GETTING CLOSER IN SLOW MOTION

Supervised model Hypothesis (WER 28.57 % with 0 ins, 1 del, 1 sub): YOU ARE GETTING CLOSER IN SOMOTION

Self-TranscriberS2 Hypothesis (WER 14.29 % with 0 ins, 0 del, 1 sub)): YOU WERE GETTING CLOSER IN SLOW MOTION

Self-Transcriber2 Hypothesis(WER 0.00 % with 0 ins, 0 del, 0 sub): YOU ARE GETTING CLOSER IN SLOW MOTION

Example 2

Reference: OVER YOUR SHOULDER I WAS STONE COLD SOBER I PULLED YOU CLOSER TO MY CHEST

Supervised model Hypothesis (WER 33.33 % with 0 ins, 2 del, 3 sub): OVER YOUR SHOULDER ONE AND OVER I PULLED YOU CLOSER TO MY CHEST

Self-TranscriberS2 Hypothesis (WER 20.00 % with 1 ins, 0 del, 2 sub)): OVER YOUR SHOULDERS I WAS STONE AND COLD BUT I PULLED YOU CLOSER TO MY CHEST

Self-Transcriber2 Hypothesis(WER 13.33 % with 0 ins, 0 del, 2 sub): OVER YOUR SHOULDER I WAS STONE AND SOME I PULLED YOU CLOSER TO MY CHEST

Example 3

Reference: LIVING JUST TO FIND EMOTION HIDING SOMEWHERE IN THE NIGHT

Supervised model Hypothesis (WER 30.00 % with 1 ins, 0 del, 2 sub): LIVING JUST TO FIND EMOTIONS HIGH HEADING SOMEWHERE IN THE NIGHT

Self-TranscriberS2 Hypothesis (WER 20.00 % with 1 ins, 0 del, 1 sub)): A LIVING JUST TO FIND EMOTIONS HIDING SOMEWHERE IN THE NIGHT

Self-Transcriber2 Hypothesis(WER 0.00 % with 0 ins, 0 del, 0 sub): LIVING JUST TO FIND EMOTION HIDING SOMEWHERE IN THE NIGHT

Example 4

Reference: THROUGH IT WE GON' DO IT LAINIE UNCLE'S CRAZY AIN'T HE YEAH BUT HE LOVES YOU GIRL AND YOU BETTER KNOW IT

Supervised model Hypothesis (WER 59.09 % with 1 ins, 3 del, 9 sub): FOR WE WERE ON THE LINE YOUNG WAS CRAZY AND HEAR BUT HE LOVES A GIRL AND YOU BETTER KNOW

Self-TranscriberS2 Hypothesis (WER 31.82 % with 1 ins, 2 del, 4 sub)): THROUGH IT WE GONNA DO IT LIKE IT WAS CRAZY AIN'T YEAH BUT HE LOVES A GIRL AND YOU BETTER KNOW

Self-Transcriber2 Hypothesis (WER 22.73 % with 0 ins, 1 del, 4 sub): THROUGH IT WE GON' DO IT LANE YOUNG CRAZY AND HY YEAH BUT HE LOVES YOU GIRL AND YOU BETTER KNOW