Projects
Theme: Public and private authorities worldwide apply emerging technologies for database search in order to identify data records that refer to the same entity. Information technology plays a crucial role in this challenge. Real-world applications deal with names of various origins that demand a fast and language independent name matching solution that can handle large amounts of data. An efficient matching algorithm must allow for legitimate spelling or phonetic variations, transliteration variations in each language, as well as common spelling errors: substitution, omission, insertion, transposition. Spurred by the need to search efficiently in a large database, we have developed a real-world integrated system for searching name variations in different languages. The requirements of our application were: to create associations between database records by comparing names that are morphologically or in other ways varied. to offer language transparent access at the query level, so that a user can retrieve names in different languages and character sets. to incorporate into the system variants and nicknames of the same name. the recall should be very high (ideally 100%) since we cannot afford to miss a hit. Within the project, we have developed a novel algorithm that learned transliterations from the large collection of unlabeled data that we possessed, by using minimum information in order to bootstrap the procedure. Bootstrapping is a general framework for improving a learning algorithm using unlabeled data. Also, we have developed a module for mining name variations from web data. Finally, we have applied approximate name matching by finite-state recognition, in order to efficiently search in the name database.
Qualia, SingularLogic (integrator, http://www.singularlogic.eu/index_en.html).