Frequency of Occurrence of Phonemes in Hindi

Hindi, an Indo-Aryan language is the national language of India and the state language of various North Indian states of India such as Madhya Pradesh, Delhi, Uttar Pradesh etc. Statistics on the phonemes of a language provides useful information in the field of speech language pathology, audiology, linguistics and communication engineering. The data can be effectively used for the assessment and selection of target phonemes for treatment of various communication disorders, develop phonetically balanced word lists for audiological testing, and teach foreign language. It also provides valuable information to device text to speech systems and automatic speech recognition systems. The earlier data on frequently occurring phonemes in numerous Indian languages were derived from written sources. However, information from spoken language may be of more significance compared to written language. The aim of the present study was to determine the frequently occurring phonemes in spoken Hindi. Participants were native speakers of Hindi in the age range of 20 to 70 years. Eighteen group conversation samples were recorded. The samples were transcribed using IPA transcription. Systematic Analysis of Language Transcripts (SALT) software was used to analyse the samples in order to obtain the frequently occurring phonemes. Descriptive statistics was applied for the same. Results revealed that phonemes /n, a, e, f, h, k/ were the most frequently occurring phonemesin Hindi. Aspirated phonemes (/gh/, /ʈh/, /ph/, /ɖh/) were the least present phonemes in the data. High and front vowels were more frequently present in spoken Hindi. Considering the manner of articulation, nasals and stops had higher occurrence. Alveolar dominated considering the place of articulation of phonemes. The applications of the study are extensive and can be utilized efficiently


Introduction
Language helps us to communicate effectively through speech by delivering and receiving meaningful messages in a structures and conventional way. It includes both spoken and written and also several non-verbal cues. The world's languages consist of a set of spoken or written symbols with a definite number of phonemes. Each language has its own set of phoneme inventory or phonological system and also has variations with respect to culture, geography etc. These variants of a standard form of language are known as dialects. The standard form of a language, mainly used for official purposes may be different from its dialectal variations. The spoken form of the language may vary.
Hindi is an Indo-Aryan language spoken in various states of India, namely, Madhya Pradesh, Delhi, Uttar Pradesh, Uttarakhand, Bihar, Rajasthan, Chattisgarh, Haryana, Himachal Pradesh and Jharkhand. A variety of dialects of Hindi are spoken widely across India. It is also spoken by individuals who do not have Hindi as their state language such as Maharashtra, North Eastern states of India etc. It is also the lingua franca of countries such as Fiji (known as Fiji Hindi), Nepal, Bangladesh and Pakistan and a minority language in Mauritius, Surinam, Guyana, South Africa, and Trinidad and Tobago (Meena, 2015). Modern Standard Hindi is the standardized form of Hindi. Khari boli, Haryanvi, Bagheli are few of the dialects of the language. Hindi phonology includes 12 vowels and 38 consonants. Among the vowels, [ae] and [ɒ], are borrowed from English. Consonants [f, z, ʃ] despite being loan phonemes, are well established in Modern Standard Hindi (Ohala, 2004). Hindi has an Akshara system, which uses a combination of alphabetic and syllabic systems (Pandey, 2014).
The information is used extensively in areas such of speech language pathology, audiology, linguistics, and speech engineering. In speech language pathology and audiology, the data on frequently occurring phonemes are used to develop various assessment tools (e.g., PB word list) and speech therapy materials (e.g., articulation drill materials). The information can be used by speech engineers in devising speech recognition and text-to-speech systems which are used in Augmentative and Alternative Communication for the rehabilitation of individuals with communication disorders (Cerebral palsy, aphasia etc.). It can also be used effectively to teach a foreign language.
Hindi is a widely used language in India, spoken by over three million people (Kachru, 2006). As discussed earlier, it has several variations as well. Spoken form of a language is different from written form. Studies such as those by Ghatage and Madhav (1964) were from written materials. Moreover, recently with the wide use of English, there are many new modified and borrowed words in the spoken form of any language. Also, there may be differences in the frequency count of phonemes in written and spoken languages. There is limited research on spoken Hindi. Hence, arises the need to create a database of spoken Hindi and gather information on frequently occurring phonemes in conversational Hindi.

Methods
Participants: A total of 91 native speakers of Hindi in the age range of 20-to-70 years participated in the study. The participants were exposed to Hindi and use the language in daily conversation. The data was collected from individuals of major Hindi speaking belts -Madhya Pradesh, Delhi, Chhattisgarh, Jharkhand, Uttar Pradesh, and Uttarakhand. A minimum of 4-5 participants were considered in a group recording and the conversations were recorded for 20 minutes each. From a total of 91 participants, 33 were males and 58 were females.
Instrumentation: Olympus (LS 100) digital recorder was used to record the group conversations. Transcription of the recordings was performed using Toshiba (Satellite C665) laptop and Philips (Shl3095) headphones and Systematic Analysis of Language Transcripts (SALT-Clinical demo version 2012.4.5) was used to carry out the analysis.
Procedure: The selected participants, in groups of 4-5, were asked to sit in a circle and the digital recorder was placed at the centre, equidistant from each of the participant. As there was no specific topic provided for conversation, the participants were encouraged to speak freely on any topic of their interest. Also, conversations had to be carried out as naturally as possible in Hindi only, sometimes using loan words from English when necessary. Totally 18 spoken Hindi recordings were carried out.
Data analysis: Conversation samples were transcribed with the help of International Phonetic Alphabet (IPA) by Ohala (1994) for Hindi. 10% of each recording sample was selected, transcribed and analysed for testing both inter and intra judge reliability measures. Cronbach alpha index of 0.87 and 0.90 were obtained for inter-judge and intra-judge reliability respectively.
Statistical analysis: Mean percentage of occurrence of various phonemes was determined by employing descriptive statistics. Subsequently, Wilcoxon's sign language test was performed to establish pair-wise significance.

Results and Discussion
The study aimed at identifying the frequency of occurrence of phonemes in conversational Hindi from major Hindi speaking states such as Madhya Pradesh, Delhi, Chhattisgarh, Jharkhand, Uttar Pradesh, and Uttarakhand. There was a total of 1,48,862 phonemes in the corpus from 18 recordings and the number of total phonemes recorded across ach recording varied from 6000 to 12000 phonemes. Figure 1 provides information about the total phonemes recorded in each recording session. The total corpus included consonants, vowels and diphthong. The mean percentage of vowels (54.42%) was higher than consonants (44.50%) which are depicted in figure 2. 1.08% of the total phonemes accounted for the diphthongs.  Similar results were obtained in spoken English (Denes, 1959&Delattre, 1965, American English, spoken Cantonese, Mandarin and Italian (Thomas, 2005) and written English (2011). The study is also in consonance with several Indian languages- Ghatage & Madhav (1964) in Hindi, Pandit (1965) in written and spoken Gujarati Ranganatha (1982), Jayaram (1985) and Sreedevi et al, (2012) in Kannada, Vasanthakumari (1989) in Tamil, Kumar & Mohanty (2012) in Telugu and Sreedevi & Irfana (2013) in Malayalam.
Considering manner of articulation, nasals were predominant followed by stops and fricatives in the present study. Phoneme /n/ occurred highest among nasals, phonemes /f/ and /k/ among fricatives and stops. The study has results similar to other languages such as Malayalam, Kannada, Tamil, Telugu (Ramakrshna, Nair, Chiplunkar, Atal & Rajaraman, 1957) and Cantonese and Mandarin (Thomas, 2005) except for the occurrence of stops. Occurrence of stops was higher in many Indian languages (Jayaram, 1985;Kalyani & Sunitha, 2009) and non-Indian languages (Denes, 1957;Guirao & Jurado, 1990;Thomas, 2005). Figure 6 depicts the percentage of occurrence of consonants based on manner of articulation. Application of Friedman test revealed significant difference across the categories. A pair-wise test of the same revealed all the pairs to have significant difference except fricatives and stops (|Z|= 1.786; p= 0.074). Considering place of articulation, alveolars occurred maximally, while retroflex had least occurrence. Bilabials and labiodentals had almost equal percentage of occurrence. Voiced dental /d/, voiceless bilabial /p/ and voiceless fricative /f/ were most frequent among dentals, bilabials and labiodentals respectively. Among velars, phoneme /k/ had higher occurrence. Unlike Malayalam (Sreedevi & Irfana, 2013), Telugu (Kalyani & Sunitha, 2009) and Marathi (Berkson & Nelson, 2015), Hindi had relatively higher occurrence of glottal sound /h/. As in Hindi, Telugu (Kalyani & Sunitha, 2009;Kumar & Mahanty, 2012), Cantonese, Mandarin, Italian, German and American English (Thomas, 2005) had higher occurrence of alveolars. Dentals were more frequent in Malayalam and Kannada. However, Khan (1990) reported dentals to be frequent in Hindi than alveolar. Figure 7 illustrates the mean percentage of occurrence of consonants based on place of articulation. Friedman test revealed statistical significance among varies types of consonants. Pair wise comparison of the same revealed palatals-dentals (|Z|= 2.345; p= 0.019), bilabials-labiodentals (|Z|= 1.024; p= 0.306) and glottalpalatals (|Z|= 1.847; p= 0.065) did not have a significant difference which indicates these categories had a similar percentage of occurrences in Hindi.

Conclusions
To conclude, consonants had higher occurrence than vowels. Vowel /i/ and consonant /n/ had maximum occurrences among vowels and consonants respectively. The frequency count of diphthongs and aspirated consonants were the least in the data. Nasals occurred more frequently considering the manner of articulation while stops and fricatives were more frequent considering place of articulation. Unlike other Indian languages The results of the current study will enable audiologists and speech language pathologists in developing assessment (PB word lists) and assessment and intervention (speech sound targets for articulation therapy) tools for the rehabilitation of individuals with communication disorders. The information is paramount to speech engineers and linguists as well. Hindi being a language with large number of native and non-native speakers, it is necessary to create a database of the phonemes of the language.