Roy Tang

Programmer, engineer, scientist, critic, gamer, dreamer, and kid-at-heart.

Blog Notes Photos Links Archives About

I have a number of documents from different sources. Many of them reference a company name, but may have stored the information slightly differently. The name is a field in the documents.

I’d like to be able to detect variations on the same name, something like:

  • Ajax Company Incorporated
  • Ajax Co. Inc.
  • Ajax Company Inc.
  • Ajax Company
  • Ajax Company (formerly Ajax Unlimited)
  • etc

Does MarkLogic have any facility to query documents that have “similar” name as above? I’m not sure if there’s a more technical term that I should be searching for. Preferably for either the node client API or server-side js.

Comments

There are several options you could try, or combine:

  • Use thesaurus expansion to expand a search for one of these terms to any of the others. You can use semantics for that where you use owl:sameAs triples, or you could make use of the MarkLogic thsr library.
  • Normalize your data at ingest with a reverse lookup in the thesaurus or ontology of above. You could potentially tag found matches, and add the normalized name as an attribute for searches on the normalized term. You would normalize the search terms in the same manner.
  • Use spell:double-metaphone on each token in the name at ingest, and also on the search terms to search with those instead of the real name.

Search term expansion sounds like most straight-forward in this case, particularly since you are talking about mere spelling differences of terms like ‘Company’ and ‘Incorporated’.

HTH!