ATAPY Software - OCR, Document Imaging, Document Management, Data Capture, Data Conversion
Services and Solutions for Document Management
ABBYY Certified Reseller
Microsoft Certified Partner
RusSoft
SibAcademSoft

Development of International Computer Dictionaries

Anna Zhavoronkova
‘ATAPY reached 99.992% text accuracy in the German-Russian Dictionary (1 mistake per 8 760 symbols), and 99.997% quality for the Spanish-Russian Dictionary project (1 mistake per 31 500 symbols). They also corrected many mistakes in the source dictionary text, including typographical misprints and even mistakes in special marks that are almost impossible to detect without special programming tools and profound knowledge of linguistics.’
Anna Zhavoronkova
Project Manager,
ABBYY Software House

Electronic dictionaries and translation systems are an area of great practical importance in the ever-globalizing world. ABBYY Software House, the world leader in OCR/ICR and linguistic technologies, develops and sells Lingvo electronic dictionaries. For many years Lingvo has been known as the best English-Russian dictionary on the market. Version 8.0 began to support 3 more languages. In the next version it was decided to use world’s latest best-of-breed dictionaries reflecting the modern state of supported languages.

ABBYY Lingvo boxABBYY turned to ATAPY Software, its outsourcing partner in Novosibirsk, for digitisation of two dictionaries from the list selected by the Linguistics Department. The 3-volume 1750-page Leping German-Russian Dictionary and the 830-page Narumov Spanish-Russian Dictionary were to be recognized and proofread for automatic conversion to the Lingvo database.

The highest possible text recognition accuracy was obviously a must. A single mistake could break the words’ alphabetical order and tear the word away from its paradigm. If the number of such mistakes went above a very modest threshold, the dictionary would have become unsearchable. Adequate interpretation of special dictionary marks was no less vital for the project. They were used as field delimiters in the automatic database conversion process and had to be recognized 100% accurately. Special marks appeared either as text characteristics (bold/italics), as special symbols (brackets, asterisks), or as a combination of the two (e.g., italics brackets indicated a dictionary comment). Omitting a single bracket or missing italization would break the article’s structure. This is why the project required both intelligent programming and a highly qualified manual effort — a true challenge for any contractor in the media service area.

The dictionaries were scanned and automatically recognized with ABBYY FineReader OCR specially tuned-up for processing of this material. Then a team of qualified operators proofread and cross-checked the results using the double verification technique to ensure recognition accuracy. Double verification allowed the detection of certain unexpected cases, such as typos in the source dictionary text, which have been corrected according to the customer’s guidelines. In its effort to automate the proofreading work to the maximum possible extent, ATAPY developed and customized a number of in-house utilities such as Glyphica, a tool for quick input of characters that can not be found on the keyboard. For Leping’s Dictionary ATAPY developed a custom converter with built-in spell- and punctuation-checking utilities which allowed to weed out mistakes that went unnoticed during the previous stages and finally convert the material into Lingvo vocabulary database.

About ABBYY Software House:

ABBYY Software House is based in Moscow, Russia. The company was founded in 1989 by David Yang. ABBYY has over 600 employees, including offices in Russia (Moscow), the USA (Fremont, CA), Ukraine (Kiev), the UK (Bracknell), Germany (Munich) and Japan (Tokyo). ABBYY has developed software in the fields of artificial intelligence, document recognition and applied linguistics. ABBYY is most notable for their optical character recognition software, FineReader.

ATAPY Software

All rights reserved © 2001-2010