Research on Working of Speech Recognition

Added on - 09 Oct 2019

  • 54

    pages

  • 9786

    words

  • 92

    views

  • 0

    downloads

Showing pages 1 to 8 of 54 pages
Chapter 1: Introduction1.1 MotivationIn recent years the usage of computers is tremendously and it greatly influencing thehuman. Now keyboard and mouse are used as input devices but these devices havelimitations. So many researchers are trying to develop some methods using which we caneasily exchange the information with computers. If Speech is used for exchanginginformation between computers all these limitations can be overcome.The ability of machine to identify words or phrases of a spoken language and convertthem in to machine readable format so that, speech can be given as input to machine iscalled speech recognition. With the use of advanced speech recognition techniques it ispossible to use human voice commands for the computer and understand human languages.Reliable algorithm for detection of non-voiced and voiced segments of speech signal is acrucial pre-processing step in many speech processing systems and is essential for speechrecognition systems. In addition, many applications such as diagnosis of pathologicaldisorders speech enhancement etc [1, 2] it is required to extract the information from thevoiced segments of the speech signals in undesired noisy environment. The discriminationof speech signal into non-voiced and voiced segments in is very important one; manyresearchers have been worked on this during the last three decades. After years of researchand development the accuracy of Automatic Speech Recognition (ASR) still remains one ofthe important research challenges. The design of recognition systems requires carefulattention to the issues like speech representation, pre-processing, feature extractionetc.1.2 Research ObjectivesThe main goal of this work is to develop an algorithm with advanced signal processingtechniques for automatic detection of non-voiced and voiced segments in the given speechsignal, using which we can increase the accuracy of speech recognition system.The aim of this research is to address the following objectives:[Type text]
Many researchers are working on Speech recognition since many years across the world.The first objective of this thesis is to provide a comprehensive survey of research onclassification of speech into voiced and non-voiced segments.Speech signal is highly variable in nature such that when examined over a short durationof time it shows similar characteristics but when examined over long duration it showssignificant changes. That’s why rather than working on the whole signal of speech it isbeing divided into frames. So, second objective is to select a suitable windowingtechnique to divide the speech into different segments.Any speech recognition system involves the extraction of features from the speech signal.There are several parameters in speech signal; these parameters can change to differentdegrees in voiced and non-voiced segments. Using these parameters we can discriminatevoiced and unvoiced regions of the speech. The choice of features is a critical design inany classification problem. The third objective is to select the proper signal feature whichis more robust for making the voiced and non-voiced classification more accurate.The most fundamental and difficult problem encountered in the speech recognitionsystem is classification of speech signal into non-voiced and voiced segments. Theaccuracy of the speech recognition system can be increased to a great extent if theaccuracy of classification is increased. Fifth objective is to develop an algorithm withmaximum discriminative power and noise robustness.1.3 Thesis OutlineThe proposed study is organized in seven chapters that begin with the introduction chapterChapter1:A brief description of the contents of these chapters is as follows:Chapter 2:This chapter contains thecomprehensive survey of research on classificationof speech into voiced and non-voiced segments.From this literature review, conclusionsare made which lead to the presented thesis.Chapter 3:This chapter provides an abstract knowledge about speech signal andnecessity of pre-processing for the recorded speech signal are given in this chapter.[Type text]
Chapter 4:The process of classification of speech signal into voiced and non-voicedsegments involves two steps. In the first step, features of the recorded speech signalwhich are extracted. Extraction of these features from the speech signal is explained inthis chapter.Chapter 5:By analysing the changes in the features it is possible to distinguish thespeech as voiced and non-voiced segments some algorithms related to this process areexplained in this chapter.Chapter 6:Graphs for different speech samples are derived from the calculation offeatures. This chapter discusses the results and explains how they are useful foridentification of the non-voiced and voiced segments. The overall accuracy of thealgorithm is also discussed.Chapter 7:The conclusion of this research work is given in this chapter and alsopresents the outlook for the future research required in the field of speech recognition.[Type text]
Chapter 2: Literature ReviewWork on speech recognition is going on from past 50-60 years. Many advances havebeen made and many are to be done to improve the quality of the recognition and to makethe systems speaker, voice quality and noise independent. The first step in any automaticvoice activity detection system is the extraction of acoustic features from the speech signal.In this aspect, various approaches have been developed to define and extract signalfeatures in different domains for classifying the speech as voiced-unvoiced regions.Energy is a simple measure of the loudness of the signal. This energy based methodassume that speech is always louder than background noise and then assign the high-energyframes to speech and lower ones to noise [3]. However, when the loudness of speech andnoise are of similar levels, for example due to the increasing background noise in theenvironment or during soft speech segments the simple energy feature fails to discriminatespeech and noise.Another way to improve noise robustness is to combine energy-based features withother features, such as zero-crossing rate (ZCR)[4]or the line spectral frequency (LSF).Generally, these features work well with clean speech or high SNR conditions. However,under high noise level such as when SNR falls below 10 dB, their discriminative powerdrops drastically.To increase the discriminative power of the features under noisy condition, manyapproaches consider the relative power of speech over the estimated noise across thedifferent frequency bands of the spectrum. A number of earlier researches focused on theauto-correlation of the signal to search for self-repetition components from the signal [5].This includes the Maximum Autocorrelation Peak, which finds the magnitude of themaximum peak within the range of lags that correspond to the range of pitch of male andfemale voices (50Hz– 400Hz).Another measure is the Autocorrelation Peak Count [6],which counts the number ofpeaks found in a range of lags. Under environments that contain repetitive noise such asmotor and car noise, auto-correlation-based features would fail because the loudest[Type text]
frequency component is usually the noise itself. This leads to the motivation for more noise-robust techniques.Most of the techniques mentioned above would suffer in environments containingcross-talk speech, such as babble noise or overlapped speaking. Indeed, the interference ofother speakers causes the spectral frames of voiced speech to possibly contain more than onefundamental frequency, which causes confusion to the estimator. Until now the maximumalgorithms were using either zero crossing rate or energy for the classification in whichresults were not greater than 65%.Some researchers classify the speech sample into voiced and non-voiced segments usingneural networks. Qi and Hunt [7] used non-parametric methods to classify voiced andunvoiced speech. In his work he used multilayer feed forward network for classification ofspeech samples. The network was comparedwithmaximum-likelihood classifier and theresults are evaluated classification accuracy was obtained up to 96% from the results it isobserved that the size of training set does not affect network performance. Algabriet al [8]proposed a system for automatic classification of speech. In his method,Fuzzy logiccontroller used to distinguish three categorize of speech(Silence, Voice and unvoiced). Theexperiments were performed on Arabic KSU database in MATALB environment. Speechrecognition using zero-crossing feature is presented in Aye, Y. Yin [9]. The zero-crossingfeatures are extracted while speaking is done in the training phase, and then stored in thedatabase. Using the same technique the features for testing data are extracted and comparedwith the template in the database during the recognition phase.Pattern recognition techniques and acoustical features were used to separate the speechsamples into unvoiced and voiced. Ahmadi and Spanias [10], in their paper, they explainedtwo methods using pattern recognition and acoustical features. Here they used Melfrequency cepstral coefficient with Gaussian mixture model classifier as the first method.The accuracy of this method is approximately 90%. The second method is based on reduceddimensional LPC residual and LPC coefficient with Gaussian mixture model classifier, thismethod gives classification accuracy up to 92%. The performances of both the methodswere compared for different levels of noise and optimum condition for training is obtained.[Type text]
Using neural networks in which accuracy of the results can be increased, but a lot of trainingwas required. As neural networks were having the limitation of the need of large amount ofdata for training a novel voiced-unvoiced-silence classification based on unsupervisedlearning were also proposed in [11]. The class-dependent statistics (feature means,covariance matrices, and occurrence frequencies of voiced-unvoiced, and silence classes)needed for the classification were estimated directly from the signal to be classified viaGaussian mixture models and the expectation maximization algorithm. The classificationwas evaluated, and the results were encouraging as voiced-unvoiced and silenceclassification accuracy was greater than 91.15%. But these classification and distributionbased methods were heavily relying on the distribution or first few thousand samples of thesignal which are assumed to be part of noise and also can change with the passage of time.On the other hand neural networks if used were giving very good results of more than92% but they were required very large amount of data for training and were working forcertain amount of vocabulary only. Also for noisy environments there is not much workbeing done and the accuracy is also very less.Based on the review, it is observed that in most of the existing methods ofclassification, only one feature is selected for discriminating between voiced and unvoicedspeech segments. If single parameter is used then there are chances that the value of thatparameter overlaps between categories. So using such a method classification accuracy isvery less, particularly when the recording of speech signal is not done in a high fidelityenvironment. So, it becomes difficult to differentiate between voiced and non-voiced speechsegments using a single parameter. This leads to the motivation for more noise-robusttechniques to increase the discriminating power. The main objective of this dissertation is todevelop a generic optimal classification algorithm to improve the accuracy classification ofspeech into non-voiced and voiced speech segments.[Type text]
Chapter 3: Speech signal General overview3.1 IntroductionSpeech is the primary means of communication between humans and it is the dominancy ofthis medium that persuades research endeavours to use speech as viable means of humancomputer interaction mentioned in [12]. The study of speech signals and different methodsused to process them is called speech processing. Speech processing is employed in variousapplications such as speech synthesis, speech recognition, speech coding and speakerrecognition technologies. Among these applications, most important one is the speechrecognition. The ability of machine to identify the phrases or words of a spoken languageand convert them tomachine readable format, so that speech can be given as input tomachine is called Speech recognition.Speech recognition is in research for many years and has attracted many researchers acrossthe world. With the rapid evolution of computer hardware and software, speech recognitionhas gained considerable interest in many fields like medical services, voice activatedtelephone exchange, industrial control, banking services, every side of society and people’slives.Detection of non-voiced and voiced segments, silence detection, word boundary detection,effects of voice quality and noise removal are the prominent problems for achieving highdegree of accuracy in speech recognition. The categorization of speech signal into non-voiced and voiced segments is the preliminary step for speech recognition. Speech isproduced by the vocal tract and the vocal cords and it is composed of phonemes. The vocalcords vibrate during pronunciation of a phoneme i.e. production of speech.Voiced speech:In this stage of speech, vibration of vocal folds produces periodic or quasi-periodic excitations to the vocal tract [13, 24]. Voiced excitation for the speech sound willresult in a pulse train of more or less constant frequency called as fundamental frequency.Unvoiced speech:In this stage of speech, vocal cords do not vibrate so the resulting speechis random in nature; therefore non-voiced speech is random-like sound hence it is non-periodic. In these cases, usually no fundamental frequency can be detected.[Type text]
Depending on the surrounding environment of the recording, non-speech can be music,noise, silence or a variety of other acoustical signals such as door knocking, coughing, papershuffling, or even background speech.Segmentation of the waveform into well-defined regions of unvoiced and voiced regions isnot exact as it is difficult to distinguish. This segmentation is being illustrated in Figure 3.1In recent years considerable efforts has been spent by researchers in solving the problem ofclassifying speech into voiced unvoiced segments [7-11, 25].The voiced and non-voicedcomponents of speech is as shown in Figure 3.1Fig. 3.1 Speech waveform showing its different regions(Rabiner, L and Juang, B., 1993)The main aim of this classification is to extract only the voiced segments from the recodedspeech signal to get improved performance of speech recognition system.In any speech processing system first step is the extraction of features from the speechsignal. There are several parameters in speech signal; these parameters can change todifferent degrees in voiced-unvoiced segments. Using these parameters we can detect voicedand unvoiced region of the speech Aye, Y.Yin [9].Three major steps are involved in the process of classification of speech signal into non-voicedand voiced segments. The process of speech classification can be summarized as shown inFigure 3.2. The steps are:[Type text]
desklib-logo
You’re reading a preview
card-image

To View Complete Document

Become a Desklib Library Member.
Subscribe to our plans

Unlock This Document