Top 5 Natural Language Processing Applications

May 13, 2008

In the last decades, Natural Language Processing (NLP) has been equally hyped and criticized. All in all, many applications emerged in the real world following intense and continued research and development. Here’s a list of the most prominent success stories.

Given that this blog is about named entity recognition (NER), itself an NLP application, we would be biased at including NER to this list. As such, we’ve excluded ourselves from the chart-toppers ;)

#5: Chat bots


The first time I chatted with Dr. Sbaitso, I was about 12 years old. Probably more than anything else, it has influenced my career path. Since then, chat bots such as ELIZA, A.L.I.C.E. and Jabberwacky propelled the art of conversational robots, leading to Automated Service Agent applications (see NextIT)

For its lasting impact on generations of NLP developers, and for the interesting improvements that ensued, Chat bots rank #5.

#4: NLP-based search engines

Ask Jeeves pioneered it, Powerset redefined it, but we are all somewhat skeptical when it comes to beating Google’s classic vector space models and ranking techniques., Do we really need shallow NLP parsing to answer “When did Einstein die,” or will statistical fact extraction suffice?

Though it is the Holy Grail of NLPers, it has not yet surpassed current information retrieval techniques. As such, NLP-based search engines rank #4.

#3: Speech recognition

Microsoft and Ford just teamed up to develop in-car speech recognition. But they forgot to include Electronic Voice Alert, a feature of mid-80s luxury Chrysler cars!

In all seriousness, automatic speech recognition (ASR) is a vital application for hand-free computing (for disabled persons or for certain circumstances, such as driving), and transcription. It is also poised to revolutionize audio-video content retrieval.

For where it came from, and for where it’s going, ASR ranks #3.

#2: Machine translation

“It is apparent to me that the possibilities of the aeroplane, which two or three years ago were thought to hold the solution to the [flying machine] problem, have been exhausted, and that we must turn elsewhere.”Thomas Edison, inventor, 1895

The “heavier-than-air” problem that once plagued flight technology is probably the best comparison we can make to AI and machine translation (MT). It was long believed that MT would require a completely automatic understanding of human language before a resolution finally came. But today’s Google and Government of Canada systems surpass human translation abilities (can you translate from French to Chinese? Not me.) Their good level of precision makes them useful in many applications.

People are constantly pinpointing these systems’ shortcomings, but nobody would contest their second-place ranking on this list.

#1: Knowledge discovery in texts

Have you ever heard of software that finds new relationships and interactions between genes, proteins or cells? By mining large collections of scientific literature, NLP agents can discover and highlight novel and surprising knowledge.

What makes knowledge discovery so promising is the hope that, in the near future, we may monitor all these documents that are just too abundant to be processed manually. Early forms of knowledge discovery, such as data mining, are already used for Business Intelligence (BI) and outside the NLP world, examples of machine-made inventions already exist.

As a form of technological singularity, and as an emerging field of research for NLP, knowledge discovery gets first place on this list of top NLP applications.


YooName’s creator honored at 2008 OCRI awards

April 8, 2008

OTTAWA, Canada, April 3, 2008OCRI, Ottawa’s economic development agency honoured Ottawa’s best and brightest companies, executives and students for their innovative work and contributions to the city’s knowledge-based sector at the 13th annual OCRI Awards gala.

David Nadeau from the University of Ottawa received the Student Researcher of the Year award for inspired research resulting in a more intelligent on-line search engine and his commercialization efforts which launched last November.

[see the full press release]

NER Demos on the Web

March 8, 2008

Here’s a list of demos for Named Entity Recognition technologies:

Are you aware of any other demos? Send us the links!

What is a Named Entity?

February 12, 2008

To our surprise, when it comes to defining the task of Named Entity Recognition (NER), nobody seems to question including temporal expressions and measures. This probably deserves some historic consideration, since the domain was popularized by information extraction competitions where, clearly, the date and the money generated by the event were crucial. But we receive lot of questions about the inclusion of some types, specifically those written as common nouns. Think about sports, minerals, or species. Should they be included in the task? What about genes and proteins that don’t refer to individual entities, but are often included as well?

It seems that anyone who tries to define the task eventually falls back on practical considerations, like filling templates and answering questions.

** Let’s try to sort things out and let’s fall back on practical considerations. **

We discovered 5 different criteria to determine the essence of named entities:

Orthographic criteria: named entities are usually capitalized. Capitalization rules for multiword proper nouns change from one language to the next (e.g., ‘House of Representatives’ vs ‘Chambre des communes‘). In German, all nouns are capitalized. [source]

Translation criteria: named entities usually do not translate from one language to the next. However, the transcribed names of places, monarchs, popes, and non-contemporary authors often share spelling, and are sometimes universal. [source]

Generic/specific criteria: named entities usually refer to single individuals. A mention of “John Smith” refers to an individual, but the gene “P53-wt” or the product “Canon EOS Rebel Xti” refer to multiple instances of entities. [source]

Rigid designation criteria: named entities usually designate rigidly. Proper names and certain natural terms-including biological taxa and types of natural substances (most famously “water” and “H2O”) are rigidly designated. [source]

Information Extraction (IE) criteria: named entities fill some predefined “Who? What? Where? When?” template. This surely includes money, measure, date, time, proper names, and themes such as accident, murder, election, etc. [source]

Let’s take a closer look at some examples and the criterion they meet:

God: capitalized, translatable, single individual*, rigid, useful in IE

London: capitalized, translatable, single individual, rigid, useful in IE

John Smith: capitalized, not translatable, single individual, rigid, useful in IE

water : not capitalized, translatable, not a single individual, rigid, useful in IE

Miss America: capitalized, translatable, not a single individual, not rigid, useful in IE

the first Chancellor of the German Empire: not capitalized, translatable, single individual, not rigid*, useful in IE

Canon EOS Rebel Xti: capitalized, not translatable, not single individual, not rigid, useful in IE

iPhone: not capitalized*, not translatable, not single individual, not rigid, useful in IE

hockey: not capitalized, translatable, not a single individual, not rigid, useful in IE

10$: not capitalized, not translatable, not a single individual, not rigid, useful in IE

* Alright, it could be up for debate…

No single criterion accurately covers the named entity class. Capitalization is language-specific and sometimes falls short. Translatability is inconsistent. Specificity and rigid designation miss important types, such as money and product. The only criterion that encompasses them all is usefulness in information extraction, but it’s way too broad.

Our definition is a practical one. It stems from the way YooName works:

“The types recognized by NER are any sets of words that intersect with an NER type.”

This is ugly and circular, but it is practical!

We started by including Person, Location and Organization. These sets were ambiguous with products, songs, book title, fruits, etc. So we’ve added these new sets. We expanded the number of type to 100, as guided by our definition. We calculated that less than 1% of the millions of entities we have are ambiguous with sets of words that are not handled so far. The problem is that this 1% is so diverse, we’ll need to add thousands of new types.

TextMap juxtaposition algorithm

February 2, 2008

TextMap, the entity search engine, just published their juxtaposition algorithm.

The paper is dense in ideas, on top of being entertaining:

Concordance-Based Entity-Oriented Search, by Mikhail Bautin and Steven Skiena.

The algorithm very roughly goes as follow:

  1. Annotate every entities in every documents;
  2. Extract all sentences containing an entity;
  3. Delete duplicate sentences corpus-wide (use MD5 hashing for duplicate detection)
  4. Use Lucene to index tuples [entity, concatenation of all sentences containing it]
  5. Use special ranking function

The search is conducted with a special scoring scheme (tf-idf minus sensibility to document length), and the result to a query (e.g., ‘Montreal’) is a list of entities that are closely related to it (‘Montreal Canadiens’, ‘Saku Koivu’, etc.).

Semi-Supervised Named Entity Recognition

December 16, 2007

YooName originates from the PhD research titled Semi-Supervised Named Entity Recognition: Learning to Recognize 100 Entity Types with Little Supervision. The thesis was successfully defended at University of Ottawa, Canada, and is now available online.

Here’s the abstract:

* * *

Named Entity Recognition (NER) aims to extract and to classify rigid designators in text such as proper names, biological species, and temporal expressions. There has been growing interest in this field of research since the early 1990s. In this thesis, we document a trend moving away from handcrafted rules, and towards machine learning approaches. Still, recent machine learning approaches have a problem with annotated data availability, which is a serious shortcoming in building and maintaining large-scale NER systems.

In this thesis, we present an NER system built with very little supervision. Human supervision is indeed limited to listing a few examples of each named entity (NE) type. First, we introduce a proof-of-concept semi-supervised system that can recognize four NE types. Then, we expand its capacities by improving key technologies, and we apply the system to an entire hierarchy comprised of 100 NE types.

Our work makes the following contributions: the creation of a proof-of-concept semi-supervised NER system; the demonstration of an innovative noise filtering technique for generating NE lists; the validation of a strategy for learning disambiguation rules using automatically identified, unambiguous NEs; and finally, the development of an acronym detection algorithm, thus solving a rare but very difficult problem in alias resolution.

We believe semi-supervised learning techniques are about to break new ground in the machine learning community. In this thesis, we show that limited supervision can build complete NER systems. On standard evaluation corpora, we report performances that compare to baseline supervised systems in the task of annotating NEs in texts.

If YooName Doesn’t Know It, It Doesn’t Exist

November 20, 2007

A demo log entry attracted our attention:

“I talked with Frfrfrf yesterday.”

This made-up sentence was clearly sent by a researcher in the field. This is one way to test an important feature of a Named Entity Recognition (NER) system, and to determine the paradigm behind it.

NER systems are either based on “lists&rules” or on sequence labelling algorithms. Take this sentence, for example:

“I talked with France yesterday.”

The first system requires massive lists of entities along with contextual rules to resolve ambiguity. Using lists&rules, “France” is matched to entries lists for countries and first names. A disambiguation rule will assign the right type, using contextual cues such as “talked.” This strategy works well with “France” but doesn’t handle “Frfrfrf,” as it is not in the vocabulary.

The second system relies on learned probabilities of word sequences and their inner features. Using sequence labelling, the word “France” is assigned a probability of a given type using features such as capitalization in lists of entities, prefix, suffix, and all contextual word features. List lookup is only optional, though it is recognized as a very good feature. This strategy works well with “France.” It can also handle unknown words such as “Frfrfrf” if the contextual cues are strong enough. This is exactly what happens with the LingPipe system, which annotates “Frfrfrf” as a person.

YooName is based on finite lists and rules. Therefore, if YooName doesn’t know it, it doesn’t exist. YooName constantly scouts the Web searching new entities. So far, it never found “Frfrfrf.” We believe that if it runs long enough, it may end up knowing every single named entity out there.

The Same Antique Web

November 12, 2007

It’s in the air.

We’re flooded by catchy phrases announcing it.

It’s all about semantics, AI and Web 3.0:

Web3 is closer than you think!”
You ain’t seen nothing yet!”
Web as artificial intelligence supplanting human race!”

Some years ago, “you” were the superstar of Web 2.0 and its social networks.

In the late ’90s, the dot-com boom had everything going Web-based, from grocery delivery to movie rentals. It was also when Google made its debut.

Overall, it’s difficult to say if these Web movements were successful or if the whole thing was a waste of time and money. Judging from Web 1.0 Google and Web 2.0 Wikipedia, I’d say humanity has a positive balance.

Web 3.0 should be just as disruptive as its predecessors. There will be tons of new companies bridging natural language technologies and current Web content to provide what we could call “semantic hyperlinks.” If just one of these companies can find a way to resolve ambiguity, which accounts for 50% of everything we write, it will totally change the face of the Web. Right now, we must realize that all that noise and irrelevance in search engine hit lists is abnormal.

In the end, “versioning” is just part of the Web hype. What really matters is what’s always been fundamental about the Web:

People – a lot of people – sharing content.

Combining NEs with Social Networks

October 22, 2007

What happens when you combine Named Entity Recognition with Social Networks? Do they “blend”? We may have some insight when Twine reveals its platform. Until then, O’reilly Radar provides some information on the idea.


October 11, 2007

Google and Microsoft are both active in the Named Entity Recognition (NER) field, and more notably, in Named Entity Disambiguation. This task consists of “disambiguating between multiple named entities that can be denoted by the same proper name” (Bunescu and Pasca 2006). For instance, politicians, Internet entrepreneurs and criminals share the name of James Clark. And yes, these are all distinct entities.

Well-known NER researchers at Google and Microsoft published the following papers:

These are two very nice pieces of work that deserve an attentive read. What motivates this research is clear:

“A frequent case are queries about named entities, which constitute a significant fraction of popular Web queries according to search engine logs. When submitting queries such as John Williams or Python, search engine users could also be presented with a compilation of facts and specific attributes about those named entities, rather than a set of best-matching Web pages. “