TextMap juxtaposition algorithm

TextMap, the entity search engine, just published their juxtaposition algorithm.

The paper is dense in ideas, on top of being entertaining:

Concordance-Based Entity-Oriented Search, by Mikhail Bautin and Steven Skiena.

The algorithm very roughly goes as follow:

  1. Annotate every entities in every documents;
  2. Extract all sentences containing an entity;
  3. Delete duplicate sentences corpus-wide (use MD5 hashing for duplicate detection)
  4. Use Lucene to index tuples [entity, concatenation of all sentences containing it]
  5. Use special ranking function

The search is conducted with a special scoring scheme (tf-idf minus sensibility to document length), and the result to a query (e.g., ‘Montreal’) is a list of entities that are closely related to it (‘Montreal Canadiens’, ‘Saku Koivu’, etc.).

Advertisements

Tags: , ,

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


%d bloggers like this: