Automatic Speech Recognition for Maltese - Part 1
September 22, 2024
As part of my Master’s Degree in AI, one of my assignments was to fine-tune the OpenAI Whisper model for Maltese.
Maltese is a low-resource language spoken by approximately half a million people, most of whom reside on the tiny island of Malta.
I was inspired by this paper that did the same process for Javanese, another low-resource language spoken mainly by residents of the island of Java.
Using my trusty and toasty Nvidia 1080 Ti and a jupyter notebook provided by huggingface I was able to re-train the model in around 20 hours for whisper-large-v2
using 32.9 hours of Maltese transcripted audio, by combining the Mozilla Common Voice and FLEURS datasets.
Using LoRA parameters of r=32
and alpha=64
, resulting in approximately 1% of original parameters being retrained, the results were very encouraging:
Model | Maltese WER % on original Whisper | Maltese WER % on fine-tuned Whisper | Diff |
---|---|---|---|
Base | 114.66 | 67.88 | -46.78 |
Small | 111.18 | 44.67 | -66.51 |
Large V2 | 89.67 | 32.00 | -57.67 |
After completing my assignment, I decided to take it further and see how far I could push LoRA parameters to improve the model. From my research, the current wisdom is to put the alpha value always double the rank value (at least for LLMs). I could not find any supporting evidence for this for ASR, so I varied on that assumption too.
Here are the results tuning whisper-large-v3
:
Rank | Alpha | Dropout | WER % | CER % |
---|---|---|---|---|
512 | 1024 | 0.05 | 38.98 | 11.43 |
1024 | 1024 | 0.05 | 285.1 | 210.98 |
1024 | 16 | 0.05 | 34.74 | 9.70 |
1024 | 24 | 0.1 | 32.67 | 8.93 |
1024 | 48 | 0.1 | 31.68 | 8.61 |
1024 | 48 | 0.2 | 29.59 | 7.86 |
1024 | 64 | 0.3 | 29.51 | 7.94 |
Increasing the trainable parameters was expected to bring an improvement (rank 1024 results in about 25% of the original model being retrained) but in the end increasing the dropout was what improved both WER and CER the most.
Testing this on data outside of the test data results in something that looks phonetically like Maltese but is not correct. My current intutition is that a lack of data is what is stopping the model from improving further, which will be addressed in part 2.