Accurate and Efficient Pronunciation Evaluation using CMUSphinx for Spoken Language Learning
Troy
Short description: Pronunciation learning is one of the most important parts for second language acquisition. The aim of this project is to utilize the automatic speech recognition technologies to facilitate spoken language learning. This project will mainly focus on developing accurate and efficient pronunciation evaluation system using CMU Sphinx3 and maximizing the adoption population by implementing mobile apps with our evaluation system. Additionally, we also plan to design and implement game based pronunciation learning to make the learning process much more fun. Four specific sub-tasks are involved in this project, namely, automatic edit distance based grammar generation, exemplar pronunciation database building, Android pronunciation evaluation app interface implementation and game based learning interface development.
Personal information
- Name
Li Bo (English Name: Troy)
troy.lee2008@gmail.com
- Country, city during summer period
Singapore
- Current university and course, year started, length, expected completion date.
National University of Singapore, Major in adaptation for acoustic models in automatic speech recognition, started in 2008, expected completion date is 2013.
- IRC nick on freenode
troy.lee2008
- Are you subscribed to cmusphinx-devel mailing list
Yes
- Google talk ID (optional)
troy.lee2008
- Phone number (optional)
+65 96285609
- Provide a link to your CV (optional)
http://www.comp.nus.edu.sg/~li-bo/files/BoLI.pdf
- Provide a link to your personal blog/homepage (optional)
Homepage: http://www.comp.nus.edu.sg/~li-bo/
Blog: http://troylee2008.blogspot.com/
- Project title
Accurate and Efficient Pronunciation Evaluation using CMU Sphinx3 for Spoken Language Learning
- Description
Pronunciation learning is one of the most important parts for second language acquisition. The aim of this project is to utilize the automatic speech recognition technologies to facilitate spoken language learning. This project will mainly focus on developing accurate and efficient pronunciation evaluation system using CMU Sphinx3 and maximizing the adoption population by implementing mobile apps with our evaluation system. Additionally, we also plan to design and implement game based pronunciation learning to make the learning process much more fun. Four specific sub-tasks are involved in this project, namely, automatic edit distance based grammar generation, exemplar pronunciation database building, Android pronunciation evaluation app interface implementation and game based learning interface development.
- Why did you choose this idea?
From my personal English learning experience, spoken language is the most difficult part and hard to assess the learning progress. I believe automatic pronunciation evaluation would benefit lots of English learners. Meanwhile, as a Ph.D. student in automatic speech recognition, I have a strong background for related projects and I also have experience in modifying other speech toolkits such as HTK and Kaldi. I had the idea of working on open source projects for quite a long time, but cannot find a good starting point. Working with mentors on open source projects in GSoC would be the best way to join this open source community.
- Show us that you've thought about (and/or discussed) what would really be involved in your chosen project
Although my application started quite late, almost the last minute before the deadline, I have already contacted the mentor for this project by email and the discussion is still going on. Based on my previous research experience on phoneme verification based pronunciation evaluation, to develop a successful pronunciation evaluation system, two major problems must be addressed with great attentions: 1) the feedback mechanism must be able to identify severe mispronunciations with high reliability and 2) the learning system must be fun enough to attract learners’ repetitive participations. To achieve pronunciation evaluation, early systems such as FLUENCY [1], LISTEN [2], EduSpeak [3] are based on Automatic Speech Recognition (ASR) in a text-independent paradigm with the goal of providing feedback on any utterance the learner may produce. However, even till now after ten more years’ development, ASR systems’ performances degrade dramatically with accents, noise and etc. Text dependent approaches remain more preferable for pronunciation evaluation.
In [4], the authors built their vocabulary tutoring system utilizing the output of an ASR engine in a text-dependent paradigm. Three types of phoneme level language models are investigated, namely, the FIXED linear phonetic transcription based grammar, the FREE phone loop network and the BIPHONE back-off statistical language model. The classification of the learner’s phonetic realization into “acceptable” and “unacceptable” using the FIXED model was performed based on the acoustic confidence scores normalized by the duration of the phoneme. For the other two types of language models, the edit distance between the sequence of phonemes in the ASR output and those in the dictionary pronunciation is adopted for pronunciation evaluation. The authors have shown that the BIPHONE model yielded the best evaluation performance. It can be explained that the hard constraints in FIXED model cannot capture insertion and deletion errors in the learners’ pronunciation, while the FREE model had too big a search space making the error detection more difficult. However, one thing we must pay attention to is that their study is conducted on the CMU Kids corpus. Besides children, there is a large population of foreign speakers learning English who has quite different language background. Influences from both the mother tongue and accents may lead to rather different mispronunciation patterns that are hardly seen in young native English learners. To address this problem, we thus propose to automatically extract edit distance based decoding grammars for text dependent pronunciation evaluation. The grammar would be based on popular mispronunciation patterns mined from non-native speech data and represented as a phoneme network (lattice). In [5], recognition phoneme lattices are adopted for non-native pronunciation evaluation, which also shares the same limitation with the BIPHONE model. In our proposed approach, the grammar is then combined with the acoustic model to find the best phoneme sequence among all the competing hypothesizes. The statistically learnt grammar both the possible mispronunciation patterns and the correct phonetic transcription (i.e. the FIXED model) and has a much smaller size than the FREE model. Additionally, the grammar is extracted from non-native speech data could capture different mispronunciation patterns for learners from different language background.
Besides developing more advanced language models for pronunciation evaluation, improving acoustic models could further benefit the evaluation systems. Apart from the standard Gaussian Mixture Models, hybrid Neural Network / Hidden Markov Model, especially the Deep Belief Network based acoustic models have become much more popular during recent years in speech community. Speaker adaptation techniques, such as Maximum Likelihood Linear Regression and Maximum a Posterior for HMM and Linear Input Network and Linear Hidden Network for Neural Networks, could also be explored [6]. Another possible future direction is to adopt verification based acoustic scoring instead of direct use of acoustic likelihood scores, which has been shown to yield better performance for pronunciation evaluation in [7].
Apart from the technical aspect of the pronunciation evaluation, from the engineering and design perspective, the pronunciation evaluation system must be attractive so that the learners are willing to participate in the learning process. Automatic learning solutions cannot force users to do the learning and usually face the problem of building loyal user base. To address this problem, we propose to implement both an Android interface for our evaluation system to ease the users’ access to our system so that they can practice anytime anywhere they wish. Additionally, a game based interface is also planed in this project.
- Research papers on the project you have read (Titles and short resume)
[1].Eskenazi, M.: Using Automatic Speech Processing for Foreign Language Pronunciation Tu- toring: Some Issues and a Prototype. Language Learning and Technology 2, 62–67 (1999)
This article discussed the early attempt in build the CMU FLUENCY system to improve students’ accents in a foreign language. Both the phonetics and prosody aspects were investigated. The author also pointed out that the importance of adaptation in those learning systems for accurate assessment.
[2].Mostow,J.,Roth,S.,Hauptmann,A.,Kane,M.: A Prototype Reading Coach that Listens. In: Proceedings of the twelfth national conference on Artificial intelligence, vol. 1, pp. 785–792 (1994)
In this paper, the development of LISTEN project was presented. Project LISTEN is developing a novel weapon against illiteracy: an automated reading coach that displays a story on a computer screen, listens to a child read it aloud, and helps where needed. The coach provides a combination of reading and listening, in which the child reads wherever possible, and the coach helps wherever necessary. This project was based on CMU’s Sphinx-II speech recognizer.
[3].Franco,H., Neumeyer,L., Digalakis,V., Weintraub,M.: Automatic Scoring of Pronunciation Quality. Speech Communication 30, 83–93 (1999)
In this paper, the authors presented their efforts for automatic assessment of pronunciation quality by machine. Both native and non-native speech data is collected and human-expert ratings are also created. The evaluation problem is reformulated as a prediction problem, trying to predict the grade a human expert would assign to a particular speaker. Different scores are investigated and the authors have shown that the log-posterior and the normalized duration achieved a correlation with the targeted human grades that is comparable to the human-to-human correlation when a sufficient amount of speech data is available.
[4].Pakhomov et al (2008) “Forced-Alignment and Edit-Distance Scoring for Vocabulary Tutoring Applications” Lecture Notes in Computer Science 5246:443-50
In this paper, the authors built their vocabulary tutoring system utilizing the output of an ASR engine in a text-dependent paradigm. Three types of phoneme level language models are investigated, namely, the FIXED linear phonetic transcription based grammar, the FREE phone loop network and the BIPHONE back-off statistical language model. The classification of the learner’s phonetic realization into “acceptable” and “unacceptable” using the FIXED model was performed based on the acoustic confidence scores normalized by the duration of the phoneme. For the other two types of language models, the edit distance between the sequence of phonemes in the ASR output and those in the dictionary pronunciation is adopted for pronunciation evaluation.
[5].Silke Witt and Steve Young. Language Learning Based on Non-Native Speech Recognition. In Proceedings of EUROSPEECH 1997, pages 633--636, Rhodes, Greece, 1997.
This work presents methods of assessing non-native speech to aid automatic pronunciation evaluation. A Goodness of Pronunciation score is defined, which was based on the log-likelihood of each phone segment in an HMM lattice, normalized by the number of frames in the segment. Phone dependent thresholds were defined to indicate the presence of a mispronunciation. These were empirically derived based on hand analysis. Using results from forced alignment recognition, the most common substitution errors were discovered and the phone models augmented to allow for additional paths through the lattice during decoding. Speaker dependent phone thresholds also yielded slightly better performance.
[6].Bo Li, Khe Chai Sim; Comparison of Discriminative Input and Output Transformations for Speaker Adaptation in the Hybrid NN/HMM Systems; Interspeech 2010.
Speaker variability is one of the major error sources for ASR systems. Speaker adaptation estimates speaker specific models from the speaker independent ones to minimize the mismatch between the training and testing conditions arisen from speaker variabilities. One of the commonly adopted approaches is the transformation based method. In this paper, the discriminative input and output transforms for speaker adaptation in the hybrid NN/HMM systems are compared and further investigated with both structural and data-driven constraints. Experimental results show that the data-driven constrained discriminative transforms are much more robust for unsupervised adaptation.
[7].Bo Li, Khe Chai Sim; Hidden Logistic Linear Regression for support Vector Machine based Phone Verification; Interspeech 2010.
Phone verification approach to mispronunciation detection using a combination of Neural Network (NN) and Support Vector Machine (SVM) has been shown to yield improved verification performance. This approach uses a NN to predict the HMM state posterior probabilities. The average posterior probability vectors computed over each phone segment are used as input features to a SVM back-end to generate the final verification scores. In this paper, a novel Hidden Logistic Feature (HLF) for SVM back-end is proposed, where the sigmoid activations from the hidden layer that contain rich information of the NN is used instead of the output layer and the generation of HLFs can be interpreted as a Hidden Logistic Linear Regression process. Experiments on the TIMIT database show that the proposed HLF gives the lowest Equal Error Rate of 3.63%.
- What are the goals of your project?
1) To automatically generate edit distance grammars for pronunciation evaluation. Language learners tend to make similar pronunciation mistakes, especially for learners from the same region or sharing the same mother tongue language. Identifying these mispronunciation patterns would greatly reduce the search space and improve the evaluation efficiency while maintaining the evaluation accuracy. This task would involve automatic mispronunciation pattern learning using native and non-native speech data, grammar generation using those patterns mined from speech data and testing the recognition grammar. The grammar will be finally represented as phoneme network/lattice. This would be in Python using Sphinx3 and would probably take 3-5 weeks.
2) To build a exemplar pronunciation database for pronunciation evaluation. As mentioned in the first task, to achieve efficient and accurate pronunciation evaluation, we need non-native speech data for mispronunciation pattern mining. In this task, we need to recruit people to come and visit a website and record their pronunciation of phrases. Then we need post-process the recorded speech samples and build a database for future system development. The recording website will be provided by the mentor and I will invite my friends to contribute their speech to this database and do data post-processing with some automatic approaches, such as outlier analysis for rejecting obvious bad pronunciations and speech detection for cropping the signal etc. This would be an ongoing thing to take 4-6 hours per week.
3) To implement an Android interface to a pronunciation evaluation system. To make our pronunciation evaluation system accessible to more users, we plan to build an Android app for our pronunciation evaluation system. This would be a Java task to take an existing pronunciation evaluation system for the web and make an Android interface to it. I will implement the audio recording and playback functions on Android platform and client-server interaction for transferring recorded speech signals and evaluation results. Simple HTTP based client-server communication will be adopted for this taks. This would take 2-4 weeks.
4) To develop a game front end for a pronunciation evaluation system. The best way to attract users is to implement the product as a game. We will explore this idea to make our system more popular. This task would be design and implementation of a simple game front end on web and/or Android to increase the attractiveness of working on pronunciation evaluation practice tasks for students. This task is a much more challenging one. The final game mechanism and implementation will be further discussed with the mentor. Generally speaking, I will design the basic game play functionalities and the interfaces. Similar client-server communication as the previous one will be also adopted here. This would take the remainder of the time, as fancy as you want to make it.
- What is the measure of success for each goal?
1) The automatically extracted edit distance grammars will be evaluated in the existing pronunciation evaluation system, which will be provided by the mentor. The evaluation performance will be compared with standard approach and based on the comparison; I will try to improve the gain from using our automatically generated grammars.
2) The exemplar pronunciation data collection will be measured from the amount of data collected (number of speakers and speech data per each speaker) and the quality of data (adding these data would potentially improve our automatically generated grammars). In this project, we will mainly use the amount of data for evaluation and for the quality will be simply based on the amount of data that is usable (i.e. with outliers and rather bad pronunciations discarded).
3) The evaluation of the Android evaluation interface will based on the usability of the app, which includes the correct word/sentence display, speech recording, client-server communicate and feedback display.
4) In this project, the evaluation of the game will be mainly focused on the usability as the previous Android app. As for the attractiveness, it requires a long time market testing would be far beyond this project and I will discuss further about it with the mentor.
- Milestones (at least 3)
1) Get familiar with Sphinx3 and setup the baseline of the existing pronunciation evaluation system to be provided by the mentor.
2) Get both the native and non-native speech data from the mentor and extract MFCC features for recognition.
3) Automatically decode the speech data with acoustic model and language model provided by the mentor.
4) Extract mispronunciation patterns in those data.
5) Automatically construct grammars to include both the correct pronunciation and possible mispronunciation patterns.
6) Testing the generated grammars in the baseline evaluation system and comparing the performance.
7) Analyze the results, if necessary repeat 3-6 to improve the evaluation performance until a better grammar is learnt.
8) Collect information about the exemplar pronunciation data collection website and process from the mentor, if needed, help setup the website and promote the data collection among my friends.
9) Implement automatic data post-process programs for the recording data post-processing.
10) Guarding the post-processing process and occasionally do verifications manually if necessary.
11) Settle down the functionality design for the Android app.
12) Settle down the interface design for the Android app.
13) Implement the audio recording and playback on Android platform.
14) Implement the HTTP based client-server file communication.
15) Testing the Android app and fixing bugs.
16) Discuss with the mentor about the game mechanism.
17) Settle down the game play logic.
18) Settle down the functionality design.
19) Settle down the interface design.
20) Game implementation and testing.
21) Prepare a project report.
- What is your planning schedule for completing these goals? (preliminary, for further discussion)
1) Now to May 21: milestone 1 and 2 if the application is accepted.
2) May 22 to May 28: milestone 3 and 4
3) May 29 to June 4: milestone 5 and 6
4) June 5 to June 11: milestone 7, if needed milestone 7 may last longer.
5) June 12 to June 18: milestone 8 and 9
6) June 19 to project end: milestone 10
7) June 19 to June 25: milestone 11 and 12
8) June 26 to July 2: milestone 13
9) July 3 to July 9: milestone 14
10) July 10 to July 16: milestone 15, if needed this may also last longer.
11) July 17 to July 23: milestone 16 and 17
12) July 24 to July 30: milestone 18
13) July 31 to August 6: milestone 19
14) August 7 to (to be decided): milestone 20, depending on the complexity of the game logic and functionalities, the developing time may vary.
15) August 1: staring prepare the project report.
- What are your plans after the project
I am willing to continue contributing to the CMU Sphinx open source project.
- We expect you to work on the project 30 hours per week. Are you ready for that?
Yes!
- Do you have any committments during the summer?
No.
- Exams or other events you expect to have to deal with during the GSOC period
No. I have cleared all my modules.
- How you plan to juggle the competing demands on your time
My time schedule is usually quite flexible; I can thus guarantee sufficient time for the work.
- Note that we require a minimum of weekly contact from all our students, unless forewarned
No problem.
- We expect you to blog about project success each week. Are you ready for that?
I would love to share my progress with others by blogs.
- Will you have an Internet access during the summer
Definitely.
- Programming languages you have learnt, and how many lines of code, approximately, you have written in each
C/C++: started using from 2005, and it is the major programming language for most of my course projects; during my Ph.D study, C/C++ is the mainly used to modify HTK and Kaldi etc. toolkits for testing ideas.
Java: learned and used during 2006~2008, mainly for Objective Oriented Language module’s course projects;
Python: started using from 2008
Shell: started using from 2006. It together with Python are the two major script languages used to facilitate the tools written in C/C++.
Objective-C: started using from 2010 to implement small demo system for iOS to showcase research ideas.
CUDA: started from 2010 to speed up Neural Network training by using GPUs.
- Have you ever involved in scientific research? Do you read scientific papers?
Yes.
- Describe your math experience
I have good math background in probabilities, statistics, calculus, linear algebra, and also signal processing.
- Describe your machine learning experience
I have good understanding of Hidden Markov Model, Gaussian Mixture Model, Subspace Gaussian Mixture Model, Neural Network, Support Vector Machine, Deep Believe Network, Restricted Boltzmann Machine, and so on.
- Have you succeeded pocketsphinx from subversion
Yes.
- If not, did you report a bug or request support on our mailing list?
- Provide a link to the log of pocketsphinx speech recognition session on your computer (THIS IS A STRONG REQUIREMENT)
The configuration log: http://www.comp.nus.edu.sg/~li-bo/files/config.log
The compilation log: http://www.comp.nus.edu.sg/~li-bo/files/log_make
Open source development experience
- Is this your first contact with the CMUSphinx project?
Yes.
- List or link to any code, patches, or bug reports contributed to other projects
No.
- List or link to any code, patches, bug reports contributed to the CMUSphinx project
No.
- Why CMUSphinx
It is speech recognition related open source project and my major research interest is automatic speech recognition.
