EASY OCR SYSTEM FOR INDIAN LANGUAGE
What’s in the news?
Taking a cue from European languages, several of which have the same (Roman letter–based) script, a team at IIT Madras has, over the last decade, developed a unified script for nine Indian languages, named the Bharati script.
The team has now gone a step further since developing the script: it has developed a method for reading documents in Bharati script using a multi-lingual optical character recognition (OCR) scheme.
A Look at Specifics:
- The team has also created a finger-spelling method that can be used to generate a sign language for hearing-impaired persons.
- In collaboration with TCS Mumbai, the researchers have found a way for persons with hearing disability to generate signatures using this finger-spelling technique.
- The scripts that have been integrated include Devnagari, Bengali, Gurmukhi, Gujarati, Oriya, Telugu, Kannada, Malayalam and Tamil. English and Urdu have not been integrated so far.
- It is important to note that Urdu and English alphabet systems have a very different phonetic organisation. But that does not mean a mapping is not possible. It is quite possible and can be done.
What does OCR Involve?
- In general, optical character recognition schemes involve first separating (or segmenting) the document into text and non-text.
- The text is then segmented into paragraphs, sentences words and letters.
- Each letter has to be recognised as a character in some recognisable format such as ASCII or Unicode.
- The letter has various components such as the basic consonant, consonant modifiers, vowels etc.
Easy to read:
- The scripts of Indian languages pose a problem for such a character recognition because the vowel and consonant-modifier components are attached to the main consonant part.
- This difficulty is removed in the Bharati script which can be easily read.
- In Bharati characters, these different components are segmentable by design. So OCR works quite accurately.
- Three-tiered structure:
- The ease in design comes about because the Bharati characters are made up of three tiers stacked vertically.
- The consonant at the root of the letter is placed in the centre and the modifiers are in the top and bottom tiers.
- Currently, the team has developed a universal finger-spelling language for the nine Indian languages.
- As of now, they are working on a system that can help people sign documents using a finger-spelling method, and future plans include developing a new Braille system with the Bharati script.