Speech Recognition in Any Language
Anjana Vakil & Max Paulus
Department of Computational Linguistics, Saarland University, Germany
With contributions from: Kayokwa Chibuye (University of Cape Town, South Africa)
Developers trying to incorporate speech recognition interfaces in a low-resource language (LRL) into their applications currently face the hurdle of not finding recognition engines trained on their target language. However, for small-vocabulary applications, an existing recognizer for a high-resource language (HRL) can be used to perform recognition in the target language. This requires a pronunciation lexicon mapping the relevant words in the target language into sequences of sounds in the HRL.
lex4all is an easy-to-use desktop application for Windows that allows non-expert users to automatically create a pronunciation lexicon for words in any language, using a small number of audio recordings and a pre-existing recognition engine in a HRL such as English. The resulting lexicon can then be used to add small-vocabulary speech recognition functionality to applications in the LRL.
lex4all lets you...
- Build pronunciation lexicons for any language
- Use existing
.wav
audio files, or use the built-in audio recorder - Fine-tune parameters to improve recognition accuracy
- Evaluate lexicons for testing/research
- Choose from 5 built-in source languages for recognition
Walkthrough (with screenshots)
How it works
A simple user interface allows the user to easily specify one written form (text string)
and and one or more audio samples (.wav
files) for each word in the target vocabulary,
and to set other options (e.g. number of pronunciations per word, name/save location of lexicon file, etc.).
The audio is then passed to a speech recognition engine for a HRL (English).
An automatic pronunciation generation algorithm (the Salaam method, [2–3])
finds the best pronunciation(s) for each word in the LRL vocabulary.
The program outputs a pronunciation lexicon (.pls
XML file).
This lexicon file follows the Pronunciation Lexicon Specification,
so it can be directly included in a speech recognition application,
e.g. one built using the Microsoft Speech Platform API.
This approach to language-independent recognition requires an existing high-quality speech recognition engine with a usable API; we chose to use the English recognition engine of the Microsoft Speech Platform, so lex4all is written in C#. The audio recording feature was built using the NAudio API.
To automatically discover the pronunciation mappings we implement the Salaam algorithm as presented in [2-3]; a slight modification was made to reduce the algorithm's running time. In addition to the basic discovery algorithm [2], users have the choice of applying the discriminative training algorithm [3] as well.
Publications
Anjana Vakil, Max Paulus, Alexis Palmer and Michaela Regneri. 2014. "lex4all: A language-independent tool for building and evaluating pronunciation lexicons for small-vocabulary speech recognition." In: Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics (ACL 2014): System Demonstrations. [pdf]
Anjana Vakil and Alexis Palmer. 2014. "Cross-language mapping for small-vocabulary ASR in under-resourced languages: investigating the impact of source language choice." In: Proceedings of the 4th Workshop on Spoken Language Technologies for Under-resourced Languages (SLTU'14). [pdf]
Chibuye, N.K., Rosenstock, T. and DeRenzi, B., 2018. "Cross-language Phoneme Mapping for Low-resource Languages: An Exploration of Benefits and Trade-offs." In: INTERSPEECH (pp. 2623-2627). [pdf]
References
[1] Jahanzeb Sherwani. 2009. “Speech interfaces for information access by low literate users”. PhD thesis. Pittsburgh, PA, USA: Carnegie Mellon University. [pdf].
[2] Fang Qiao, Jahanzeb Sherwani, and Roni Rosenfeld. 2010. “Small-vocabulary speech recognition for resource-scarce languages”. In: Proceedings of the First ACM Symposium on Computing for Development (ACM DEV ’10). [pdf]
[3] Hao Yee Chan and Roni Rosenfeld. 2012. “Discriminative pronunciation learning for speech recognition for resource scarce languages”. In: Proceedings of the 2nd ACM Symposium on Computing for Development (ACM DEV ’12). [pdf]