Using NLP / Entity Recognition for text parsing

Using NLP / Entity Recognition for text parsing

This is an ongoing ML project, in conjunction with NYP’s Applied AI Programme, to help identity keywords used by restaurant owners when ordering food from a distributor, and how to match it to an online catalog


by Shen Nan

Jun 2023

bookmark

This article discusses the challenges and importance of fuzzy name matching, a process used to find the most similar name(s) from a different list. It is particularly significant in handling large datasets, such as in an international logistics company or a large corporate bank, where manual matching of names or strings becomes an unfeasible task. The article then outlines several complexities involved in the process, including varying prefixes or suffixes, abbreviations, regional names, misspelled names, and varying formats of data.

Existing solutions for this problem include phonetic or information similarity algorithms, edit-distance methods, statistical-similarity approaches, and word-embedding methods. However, these approaches face limitations, such as inability to handle multiple scripts, trade-offs between recall and precision, high computational requirements, lack of inherent feedback, difficulty interpreting stop words, handling multiple joining keys, and contextual similarities.

The authors describe their experience with a large corporate bank client using an "ensemble" approach for fuzzy name matching, which improved the quality and quantity of customer leads significantly within a month of implementation. The bank was able to reduce operational expenses, improve efficiency, and allocate data-scientist time more effectively.

The article discusses five main challenges encountered while building a comprehensive data asset for companies:

  1. Identifying the same companies across datasets: This involved normalizing company names and addresses, removing noise from names, matching company names across datasets, and accounting for language differences in data from multiple regions.
  2. Choosing the most meaningful information for each record: This required interpreting the usefulness of a field in terms of its completeness, coverage, usability, and correctness. It was difficult as much of the information was duplicated across different fields.
  3. Processing text across each individual dataset to normalize the fields: Each dataset had unique challenges, including differing text formats for key information, incomplete or unclear addresses, and diverse text delimiters and data formats.
  4. Aggregating information from different data sources: This involved consolidating over 1,000 fields of information for more than 10 million companies into a master database, once the same companies had been identified across datasets.
  5. Building an audit and feedback mechanism: This task was necessary to assure data quality and to continuously enhance the logic and improve the quality of leads. The system needed to generate leads in a self-sustaining manner.

The article discusses the development of a large-scale fuzzy matching system using a hybrid approach, with development and testing of over 50 components over three months. The final system consists of five core modules:

  1. Language detection and transliteration: The system identifies the language of a given string and generates translations. It includes language detection of mixed-language strings, normalization of languages through a transliteration layer, plain-text translation using Google Translate API, syllable-based maximum matching for word alignment, and preparing word/character embedding for multi-lingual comparison of secondary words.

  2. Pre-processing textual data: This module cleans the text by normalizing cases, tokenizing text depending on the script, treating special characters using a custom list for each language, stemming using the Porter Stemmer or Regexp Stemmer for non-Latin languages, and treating stop words. It also includes a dynamic word/character importance calculator, list of standard prefixes/suffixes/geographies, and replacement words to normalize the content.

  3. Named Entity Recognition & Classification (NERC): This process identifies entities in names and descriptions to link companies across the supply chain and establish buyer-supplier relationships. The system uses NERC to automatically detect named entities with high accuracy, using custom-built Named Entity Recognition engine to classify information into fields like Product, Company Feature, Service offered or “Other” attributes.

  4. Hashing Pipeline: This process describes a comprehensive process for large-scale fuzzy name-matching, an essential tool for businesses with extensive databases of customers, products, or services. The process is divided into several modules, a hashing pipeline for preliminary matching and a high-precision pipeline for refined, more accurate matching.

    The hashing pipeline is divided into four threads that run in parallel, performing:

    1. HDBSCAN clustering on company names and addresses using a tf-idf vectorizer. It produces clusters based on the vocabulary size.
    2. Fasttext clustering, enhanced by indexing dictionary entries by the n-grams they contain for fast lookup and limiting the matching process to words with at least one n-gram in common with others.
    3. Phonetic hashing using NYSIIS and Double Metaphone, which generate a two-pair hash for each string.
    4. Auxiliary matching for products and services, combining NERC and Word embedding to create hashes for fields with product/service descriptions.

    The second stage processor combines the output matrices from the first stage to generate the final hash for stage 2, through a rule management engine that tests for pairwise occurrences across different approaches.

    The high-precision pipeline is comprised of two ensembles:

    1. Ensemble 1 applies four edit methods (Jaro-Winkler, Hamming, Damerau-Levenshtein, and Levenshtein) and computes a net score.
    2. Ensemble 2 uses machine learning to eliminate false matches using a combination of Support Vector Machines and Logistic Regression.

    A post-processor is used to apply a dynamic threshold for each pair, assigning a strict, approximate, or failed tag. Pairs in the "approximate" bucket are passed to Ensemble 2.

  5. The feedback layer: Advises the team on potential modifications and sends alerts in case of severe imbalance. It generates reports on named entity recognition, clustering parameters, and thresholds for approximate and definite matches.

    The entire process is automated using Luigi and Python's core logging module, with the performance reports generated automatically. Flask is used to create a high-performance serving layer that can handle thousands of requests per second.

    The process dramatically increased sales leads by a factor of 500, achieving a precision of 0.99 on the same-language test set and 0.96 on cross-lingual test set. In daily operations, the engine performs matching for more than 5 million strings in less than two hours.

    The authors learned from this project that no single name-matching method can handle all data nuances, that the road to such designs is challenging, and that it's crucial to invest effort in designing good test/validation data sets, managing small blocks such as stop words, and setting up automated tests at the start. Chinese strings were a particular challenge in this project.