Automatic Speech Recognition for Maltese - Part 1

September 22, 2024

As part of my Master’s Degree in AI, one of my assignments was to fine-tune the OpenAI Whisper model for Maltese.

Maltese is a low-resource language spoken by approximately half a million people, most of whom reside on the tiny island of Malta.

I was inspired by this paper that did the same process for Javanese, another low-resource language spoken mainly by residents of the island of Java.

Using my trusty and toasty Nvidia 1080 Ti and a jupyter notebook provided by huggingface I was able to re-train the model in around 20 hours for whisper-large-v2 using 32.9 hours of Maltese transcripted audio, by combining the Mozilla Common Voice and FLEURS datasets.

Using LoRA parameters of r=32 and alpha=64, resulting in approximately 1% of original parameters being retrained, the results were very encouraging:

Model	Maltese WER % on original Whisper	Maltese WER % on fine-tuned Whisper	Diff
Base	114.66	67.88	-46.78
Small	111.18	44.67	-66.51
Large V2	89.67	32.00	-57.67

After completing my assignment, I decided to take it further and see how far I could push LoRA parameters to improve the model. From my research, the current wisdom is to put the alpha value always double the rank value (at least for LLMs). I could not find any supporting evidence for this for ASR, so I varied on that assumption too.

Here are the results tuning whisper-large-v3:

Rank	Alpha	Dropout	WER %	CER %
512	1024	0.05	38.98	11.43
1024	1024	0.05	285.1	210.98
1024	16	0.05	34.74	9.70
1024	24	0.1	32.67	8.93
1024	48	0.1	31.68	8.61
1024	48	0.2	29.59	7.86
1024	64	0.3	29.51	7.94

Increasing the trainable parameters was expected to bring an improvement (rank 1024 results in about 25% of the original model being retrained) but in the end increasing the dropout was what improved both WER and CER the most.

Testing this on data outside of the test data results in something that looks phonetically like Maltese but is not correct. My current intutition is that a lack of data is what is stopping the model from improving further, which will be addressed in part 2.