Friday, April 5, 2019
Speaker Independent Speech Recognizer Development
Speaker Independent Speech Recognizer DevelopmentChapter 4Methodology and ImplementationThis chapter describes the methodology and execution of the speaker independent talk recognizer for the Singhalese words and the Android mobile application program for voice ope directing. mainly there atomic number 18 two configurations of the re count. First genius is to build the speaker independent Sinhala mother tongue communication recognizer to recognize the digits spoken in Sinhala terminology. The second phase is to build an android application by integrating the trained speech recognizer. This chapter covers the tools, algorithms, theoretical aspects, the placelings and the appoint structures apply for the entire research unconscious process.4.1Research phase 1 Build the speaker independent Sinhala speech recognizer for recognizing the digits.In this section the development of the speaker independent Sinhala speech recognizer is described, step by step. It includes th e phonetic dictionary, language position, grammar shoot, acousticalal speech selective informationbase and the trained acoustic model creation.4.1.1 Data preparationThis system is a Sinhala speech recognition voice dial and since there is no such speech database which is done earlier was purchasable, the speech has to be taken from the refined sugar to develop the system.Data collectionThe first stage of all speech recognizer is the collection of sound signals. Database should mince a shape of enough speakers recording. The size of the database is compargond to the task we handle. For this application only little number of rowing was considered. This research aims only the written Sinhala vocabulary that ordure be applied for voice dialing. on the whole twelve words were considered with the ten numbers including two initial c totallying words amatanna and katakaranna. Here the Database has two parts, the training part and the testing part. Usually about 1/10th of the full speech data is use to the testing part. In this research 3000 speech samples were utilise for training and 150 speech samples were used for testing.Speech databaseBefore collecting data, a speech database was created. The database was included with the Sinhala speech samples taken from variety of people who were in different age levels. Since there was no such database published anywhere for Sinhala language relevant for voice dialing, speech had to be collected from Sinhala native speakers.Prompt saddleryTo create the speech database, the first step was to prepargon the prompt winding-clothes having a list of sentences for all the recordings. Here it used 100 sentences that are different from each other by generating the numbers randomly. 50 sentences are starting with the word amatanna while the other half is starting with the word katakaranna. The prompt canvas used for this research is devoted in the adjunct A.RecordingThe prepared sentences in the prompt sheet were save by exploitation thirty (30) native speakers since this is speaker independent application. The speakers were selected according to the age limits and shared out them into eight age groups. Four people were selected from each group except one age group. ii females and two males were included into each age group. One group only contained two people with one female and one male. Each speaker was given 100 sentences to speak and altogether 3000 speech samples were recorded for training. The description of speakers such as gender and age can be found in Appendix A. If there was an faulting in the recording due to the background noise and filler sounds, the speaker was asked to absorb it and got the correct sound signal. Since the proposed system is a separate system, the speakers have to make a short break dance at the start and end of the recording and also between the words when they were uttered. Speech was recorded in a quiet room and the recordings were done at nights by us ing a condenser fipple flute microphone. The sounds were recorded on a lower floor the sampling rate of 44.1 kc using mono channel and they were saved under *.wav format.Sampling frequency and format of speech audio charge upsSpeech recording deposits were saved in the rouse format of MS WAV. The Praat software was used to convert the 44.1 kHz sampling frequency signals to 16 kHz frequency signals since the frequency should be 16kHz of the training samples. Audio blames were recorded in a medium aloofness of 11 seconds. Since there should be a silence in the beginning and the end of the utterance and it should non be exceeded 0.2 seconds, the Praat software was used to edit all 3000 sound signals.4.1.2 Pronunciation dictionaryThe pronunciation dictionary was implemented by hand since the number of words used for the voice dialing system is very some. It is used only 12 words from the Sinhala vocabulary. To create the dictionary, the International Phonetic Alphabet for Sinha la style and the previously created dictionaries by CMU Sphinx were used. But the acoustic phones were taken mostly by studying the different types of databases given by the Carnegie Mellon Universitys Sphinx Forum (CMU Sphinx Forum).Two dictionaries were implemented for this system. One is for the speech utterances and the other one is for filler sounds. The filler sounds contain the silences in the beginning, middle and at the end of the speech utterances. The attachment of the two types of dictionaries can be found on the Appendix A. They are referred to as the languagedictionaryand thefiller dictionary.4.1.3 Creating the grammar fileThe grammar file also created by hand since the number of words used for the system is very few. The JSGF (JSpeech Grammar Format) format was used to implement the grammar file. The grammar file can be found in Appendix A.4.1.4 Building the language modelWord search is restricted by a language model. It identifies the matching words by comparing the previously acknowledge words by the model and restricts the matching process by taking off the words that are not possible to be. N-gram language model is the most common language models used nowadays. It is a impermanent state language model and it contains statistics of word sequences. In search space where restriction is applied, a skinny accuracy rate can be obtained if the language model is a very successful one. The top is the language model can predict the next word right. It usually restricts the word search which are included the vocabulary.The language model was built using the cmuclmtk software. First of all the reference textbook was created and that text (svd.text) can be found in Appendix A. It was written in a specific format. The speech sentences were delimited byandtags.Then the vocabulary file was generated by giving the hobby command.text2wfreq svd.vocabThen the generated vocabulary file was edited to remove words (numbers and misspellings). When finding misspellings, they were fixed in the input reference text. The generated vocabulary file (svd.vocab) can be found in the Appendix A.Then the ARPA format language model was generated using these commands.text2idngram -vocab svd.vocab -idngram svd.idngram idngram2lm -vocab_type 0 -idngram svd.idngram -vocab svd.vocab arpa svd.arpaFinally the CMU binary of language model (DMP file) was generated using the commandsphinx_lm_convert -i svd.arpa -o svd.lm.DMPThe final output containing the language model needed for the training process is svd.lm.dmp file. This is a binary file.4.1.5Acoustic modelBefore starting the acoustic model creation, the hobby file structure was arranged as described by the CMU Sphinx tool kit out guide. The number of the speech database is svd (Sinhala Voice Dial). The content of these files is given in Appendix A.svd.dic -Phonetic dictionarysvd.phone -Phoneset filesvd.lm.DMP -Language modelsvd.filler -List of fillerssvd _train.fileids -List of files for training svd _train.transcription -Transcription for trainingsvd _test.fileids -List of files for testingsvd _test.transcription -Transcription for testing exclusively these files were included in to one directory and it was named as etc. The speech samples of wav files were included in to another directory and named it as wav. These two directories were included in to another directory and named it using the name of the database (svd). Before starting the training process, there should be another directory that contains the svd and the required compilation package pocketsphinx, sphinxbase and sphinxtrain directories. All the packages and the svd directory were put into another directory and started the training process.Setting up the training scriptsThe command prompt terminal is used to run the scripts of the training process. Before starting the process, terminal was changed to the database svd directory and then the following command was run.python ../sphinxtrain/scripts/sphinxtrain t sv d apparatusThis command copied all the required configuration files into etc sub directory of the database directory and prepared the database for training. The two configuration files created were feat.params and sphinx_train.cfg. These two are given in Appendix A.Set up the databaseThese values were filled in at configuration time. The Experiment name, will be used to name model files and log files in the database.$CFG_DB_NAME = svd$CFG_EXPTNAME = $CFG_DB_NAMESet up the format of database audioSince the database contains speech utterances with the wav format and they were recorded using MSWav, the extension and the type were given consequently as wav and mswav.$CFG_WAVFILES_DIR = $CFG_BASE_DIR/wav$CFG_WAVFILE_EXTENSION = wav$CFG_WAVFILE_TYPE = mswav one of nist, mswav, rawConfigure Path to filesThis process was done automatically when having the right file structure in the running directory. The naming of the files must be very accurate. The paths were assigned to the variable s used in main training of models.$CFG_DICTIONARY = $CFG_LIST_DIR/$CFG_DB_NAME.dic$CFG_RAWPHONEFILE = $CFG_LIST_DIR/$CFG_DB_NAME.phone$CFG_FILLERDICT = $CFG_LIST_DIR/$CFG_DB_NAME.filler$CFG_LISTOFFILES = $CFG_LIST_DIR/$CFG_DB_NAME_train.fileids$CFG_TRANSCRIPTFILE = $CFG_LIST_DIR/$CFG_DB_NAME_train.transcription$CFG_FEATPARAMS = $CFG_LIST_DIR/feat.paramsConfigure model type and model parametersThe model type free burning and rigging continuous can be used in pocket sphinx. Continuous type is used for continuous speech recognition. Semi continuous is used for discrete speech recognition process. Since this application use discrete speech the semi continuous model training was used.$CFG_HMM_TYPE = .cont. Sphinx 4, Pocketsphinx$CFG_HMM_TYPE = .semi. PocketSphinx$CFG_FINAL_NUM_DENSITIES = 8 Number of tied states (senones) to create in decision-tree clustering$CFG_N_TIED_STATES = 1000The number of senones used to train the model is indicated in this value. The sound can be chosen accu rately if the number of senones is higher. But if we use too much senones, then it may not be able to recognize the unseen sounds. So the Word Error Rate can be very much higher on unseen sounds.The approximate number of senones and number of densities is provided in the dodge below.Configure sound feature parametersThe default parameter used for sound files in Sphinx is a rate of 16 thousand samples per second (16KHz). If this is the case, then the etc/feat.params file will be automatically generated with the recommended values. The Recommended values are Feature extraction parameters$CFG_WAVFILE_SRATE = 16000.0$CFG_NUM_FILT = 40 For wideband speech its 40, for call 8khz reasonable value is 31$CFG_LO_FILT = 133.3334 For telephone 8kHz speech value is 200$CFG_HI_FILT = 6855.4976 For telephone 8kHz speech value is 3500Configure decoding parametersThe following were properly configured in theetc/sphinx_train.cfg.$DEC_CFG_DICTIONARY = $DEC_CFG_BASE_DIR/etc/$DEC_CFG_DB_NAME.dic$DEC _CFG_FILLERDICT = $DEC_CFG_BASE_DIR/etc/$DEC_CFG_DB_NAME.filler$DEC_CFG_LISTOFFILES = $DEC_CFG_BASE_DIR/etc/$DEC_CFG_DB_NAME_test.fileids$DEC_CFG_TRANSCRIPTFILE = $DEC_CFG_BASE_DIR/etc/$DEC_CFG_DB_NAME_test.transcription$DEC_CFG_RESULT_DIR = $DEC_CFG_BASE_DIR/result These variables, used by the decoder, have to be user defined, and may necessitate the decoder output$DEC_CFG_LANGUAGEMODEL_DIR = $DEC_CFG_BASE_DIR/etc$DEC_CFG_LANGUAGEMODEL = $DEC_CFG_LANGUAGEMODEL_DIR/ $CFG_DB_NAME.lm.DMPTrainingAfter setting all these paths and parameters in the configuration file as described above, the training was proceeded. To start the training process the following command was run.python ../sphinxtrain/scripts/sphinxtrain runScripts launched jobs on the machine, and it took few minutes to run.Acoustic ModelAfter the training process, the acoustic model was located in the following path in the directory. Only this brochure is needed for the speech recognition tasks.model_parameters/svd.cd_semi_ 200We need only that folder for the speech recognition tasks we have to perform.4.1.6Testing Results150 speech samples were used as testing data. The aligning results could be obtained after the training process. It was located in the following path in the database directory.results/svd.align4.1.7Parameters to be optimizedWord error rateWER was given as a percentage value. It was calculated according to the following equation accuracyAccuracy was also given as a percentage. That is the opposite value of the WER. It was calculated using the following equationTo obtain an optimal recognition system, the WER should be minimized and the accuracy should be maximized. The parameters of the configuration file were changed time to time and obtained an optimal recognition system where the WER was the minimum with a high accuracy rate.4.2Research phase 2 Build the voice dialing mobile application.In this section, the implementation of voice dialer for android mobile application is described. The application was developed using the programming language JAVA and it was done using the Eclipse IDE. It was tested in both the emulator and the actual device. The application is able to recognize the spoken digits by any speaker and dial the recognized number. To do this process the trained acoustic model, the pronunciation dictionary, the language model and the grammar files were needed. The speech recognition was performed by using these models in the mobile device itself by using the pocketsphinx library. It is a library written in C language to use for embedded speech recognition devices in Android platform.The step by step implementation and integration of the necessary components were discussed in detail in this section.Resource FilesWhen inputting the resource files to the Android application, they were added in to theassets/directory of the project. Then the physical path was given to make them available for pocketsphinx.After adding them, the Assets directory contained the following resource files.Dictionarysvd.dicsvd.dic.md5Grammardigits.gramdigits.gram.md5menu.grammenu.gram.md5Language modelsvd.lm.DMPsvd.lm.DMP.md5Acoustic Modelfeat.paramsfeat.params.md5mdefmdef.md5directionmeans.md5mixture_weightsmixture_weights.md5noisedictnoisedict.md5transition_matricestransition_matrices.md5variancesvariances.md5Assets.lstmodels/dict/svd.dicmodels/grammar/digits.grammodels/grammar/menu.grammodels/hmm/en-us-semi/feat.paramsmodels/hmm/en-us-semi/mdefmodels/hmm/en-us-semi/meansmodels/hmm/en-us-semi/mixture_weightsmodels/hmm/en-us-semi/noisedictmodels/hmm/en-us-semi/sendumpmodels/hmm/en-us-semi/transition_matricesmodels/hmm/en-us-semi/variancesmodels/lm/svd.lm.DMPSetup the RecognizerFirst of all the recognizer should be set up by adding the resource files. The model parameters taken after the training process were added as the HMM in the application. The recognition process was depended mainly on this resource files. Since the grammar files and the language mod el were added as assets, these two can be used for the recognition process of the application as well as the HMM. The utterances can be recognized from either the grammar files or language model. The whole process is coded using the Java programing language.4.3Architecture of the developed Speech Recognition System
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment