NER Demos on the Web

March 8, 2008 by yooname

Here’s a list of demos for Named Entity Recognition technologies:

Are you aware of any other demos? Send us the links!

What is a Named Entity?

February 12, 2008 by yooname

To our surprise, when it comes to defining the task of Named Entity Recognition (NER), nobody seems to question including temporal expressions and measures. This probably deserves some historic consideration, since the domain was popularized by information extraction competitions where, clearly, the date and the money generated by the event were crucial. But we receive lot of questions about the inclusion of some types, specifically those written as common nouns. Think about sports, minerals, or species. Should they be included in the task? What about genes and proteins that don’t refer to individual entities, but are often included as well?

It seems that anyone who tries to define the task eventually falls back on practical considerations, like filling templates and answering questions.

** Let’s try to sort things out and let’s fall back on practical considerations. **

We discovered 5 different criteria to determine the essence of named entities:

Orthographic criteria: named entities are usually capitalized. Capitalization rules for multiword proper nouns change from one language to the next (e.g., ‘House of Representatives’ vs ‘Chambre des communes‘). In German, all nouns are capitalized. [source]

Translation criteria: named entities usually do not translate from one language to the next. However, the transcribed names of places, monarchs, popes, and non-contemporary authors often share spelling, and are sometimes universal. [source]

Generic/specific criteria: named entities usually refer to single individuals. A mention of “John Smith” refers to an individual, but the gene “P53-wt” or the product “Canon EOS Rebel Xti” refer to multiple instances of entities. [source]

Rigid designation criteria: named entities usually designate rigidly. Proper names and certain natural terms-including biological taxa and types of natural substances (most famously “water” and “H2O”) are rigidly designated. [source]

Information Extraction (IE) criteria: named entities fill some predefined “Who? What? Where? When?” template. This surely includes money, measure, date, time, proper names, and themes such as accident, murder, election, etc. [source]

Let’s take a closer look at some examples and the criterion they meet:

God: capitalized, translatable, single individual*, rigid, useful in IE

London: capitalized, translatable, single individual, rigid, useful in IE

John Smith: capitalized, not translatable, single individual, rigid, useful in IE

water : not capitalized, translatable, not a single individual, rigid, useful in IE

Miss America: capitalized, translatable, not a single individual, not rigid, useful in IE

the first Chancellor of the German Empire: not capitalized, translatable, single individual, not rigid*, useful in IE

Canon EOS Rebel Xti: capitalized, not translatable, not single individual, not rigid, useful in IE

iPhone: not capitalized*, not translatable, not single individual, not rigid, useful in IE

hockey: not capitalized, translatable, not a single individual, not rigid, useful in IE

10$: not capitalized, not translatable, not a single individual, not rigid, useful in IE

* Alright, it could be up for debate…

No single criterion accurately covers the named entity class. Capitalization is language-specific and sometimes falls short. Translatability is inconsistent. Specificity and rigid designation miss important types, such as money and product. The only criterion that encompasses them all is usefulness in information extraction, but it’s way too broad.

Our definition is a practical one. It stems from the way YooName works:

“The types recognized by NER are any sets of words that intersect with an NER type.”

This is ugly and circular, but it is practical!

We started by including Person, Location and Organization. These sets were ambiguous with products, songs, book title, fruits, etc. So we’ve added these new sets. We expanded the number of type to 100, as guided by our definition. We calculated that less than 1% of the millions of entities we have are ambiguous with sets of words that are not handled so far. The problem is that this 1% is so diverse, we’ll need to add thousands of new types.

TextMap juxtaposition algorithm

February 2, 2008 by yooname

TextMap, the entity search engine, just published their juxtaposition algorithm.

The paper is dense in ideas, on top of being entertaining:

Concordance-Based Entity-Oriented Search, by Mikhail Bautin and Steven Skiena.

The algorithm very roughly goes as follow:

  1. Annotate every entities in every documents;
  2. Extract all sentences containing an entity;
  3. Delete duplicate sentences corpus-wide (use MD5 hashing for duplicate detection)
  4. Use Lucene to index tuples [entity, concatenation of all sentences containing it]
  5. Use special ranking function

The search is conducted with a special scoring scheme (tf-idf minus sensibility to document length), and the result to a query (e.g., ‘Montreal’) is a list of entities that are closely related to it (‘Montreal Canadiens’, ‘Saku Koivu’, etc.).

Semi-Supervised Named Entity Recognition

December 16, 2007 by yooname

YooName originates from the PhD research titled Semi-Supervised Named Entity Recognition: Learning to Recognize 100 Entity Types with Little Supervision. The thesis was successfully defended at University of Ottawa, Canada, and is now available online.

Here’s the abstract:

* * *

Named Entity Recognition (NER) aims to extract and to classify rigid designators in text such as proper names, biological species, and temporal expressions. There has been growing interest in this field of research since the early 1990s. In this thesis, we document a trend moving away from handcrafted rules, and towards machine learning approaches. Still, recent machine learning approaches have a problem with annotated data availability, which is a serious shortcoming in building and maintaining large-scale NER systems.

In this thesis, we present an NER system built with very little supervision. Human supervision is indeed limited to listing a few examples of each named entity (NE) type. First, we introduce a proof-of-concept semi-supervised system that can recognize four NE types. Then, we expand its capacities by improving key technologies, and we apply the system to an entire hierarchy comprised of 100 NE types.

Our work makes the following contributions: the creation of a proof-of-concept semi-supervised NER system; the demonstration of an innovative noise filtering technique for generating NE lists; the validation of a strategy for learning disambiguation rules using automatically identified, unambiguous NEs; and finally, the development of an acronym detection algorithm, thus solving a rare but very difficult problem in alias resolution.

We believe semi-supervised learning techniques are about to break new ground in the machine learning community. In this thesis, we show that limited supervision can build complete NER systems. On standard evaluation corpora, we report performances that compare to baseline supervised systems in the task of annotating NEs in texts.

If YooName Doesn’t Know It, It Doesn’t Exist

November 20, 2007 by yooname

A demo log entry attracted our attention:

“I talked with Frfrfrf yesterday.”

This made-up sentence was clearly sent by a researcher in the field. This is one way to test an important feature of a Named Entity Recognition (NER) system, and to determine the paradigm behind it.

NER systems are either based on “lists&rules” or on sequence labelling algorithms. Take this sentence, for example:

“I talked with France yesterday.”

The first system requires massive lists of entities along with contextual rules to resolve ambiguity. Using lists&rules, “France” is matched to entries lists for countries and first names. A disambiguation rule will assign the right type, using contextual cues such as “talked.” This strategy works well with “France” but doesn’t handle “Frfrfrf,” as it is not in the vocabulary.

The second system relies on learned probabilities of word sequences and their inner features. Using sequence labelling, the word “France” is assigned a probability of a given type using features such as capitalization in lists of entities, prefix, suffix, and all contextual word features. List lookup is only optional, though it is recognized as a very good feature. This strategy works well with “France.” It can also handle unknown words such as “Frfrfrf” if the contextual cues are strong enough. This is exactly what happens with the LingPipe system, which annotates “Frfrfrf” as a person.

YooName is based on finite lists and rules. Therefore, if YooName doesn’t know it, it doesn’t exist. YooName constantly scouts the Web searching new entities. So far, it never found “Frfrfrf.” We believe that if it runs long enough, it may end up knowing every single named entity out there.

The Same Antique Web

November 12, 2007 by yooname

It’s in the air.

We’re flooded by catchy phrases announcing it.

It’s all about semantics, AI and Web 3.0:

Web3 is closer than you think!”
You ain’t seen nothing yet!”
Web as artificial intelligence supplanting human race!”

Some years ago, “you” were the superstar of Web 2.0 and its social networks.

In the late ’90s, the dot-com boom had everything going Web-based, from grocery delivery to movie rentals. It was also when Google made its debut.

Overall, it’s difficult to say if these Web movements were successful or if the whole thing was a waste of time and money. Judging from Web 1.0 Google and Web 2.0 Wikipedia, I’d say humanity has a positive balance.

Web 3.0 should be just as disruptive as its predecessors. There will be tons of new companies bridging natural language technologies and current Web content to provide what we could call “semantic hyperlinks.” If just one of these companies can find a way to resolve ambiguity, which accounts for 50% of everything we write, it will totally change the face of the Web. Right now, we must realize that all that noise and irrelevance in search engine hit lists is abnormal.

In the end, “versioning” is just part of the Web hype. What really matters is what’s always been fundamental about the Web:

People – a lot of people – sharing content.

Combining NEs with Social Networks

October 22, 2007 by yooname

What happens when you combine Named Entity Recognition with Social Networks? Do they “blend”? We may have some insight when Twine reveals its platform. Until then, O’reilly Radar provides some information on the idea.

Synchronicity

October 11, 2007 by yooname

Google and Microsoft are both active in the Named Entity Recognition (NER) field, and more notably, in Named Entity Disambiguation. This task consists of “disambiguating between multiple named entities that can be denoted by the same proper name” (Bunescu and Pasca 2006). For instance, politicians, Internet entrepreneurs and criminals share the name of James Clark. And yes, these are all distinct entities.

Well-known NER researchers at Google and Microsoft published the following papers:

These are two very nice pieces of work that deserve an attentive read. What motivates this research is clear:

“A frequent case are queries about named entities, which constitute a significant fraction of popular Web queries according to search engine logs. When submitting queries such as John Williams or Python, search engine users could also be presented with a compilation of facts and specific attributes about those named entities, rather than a set of best-matching Web pages. “

 

The Need for a Prescriptive Ontology

October 9, 2007 by yooname

A great deal of effort is invested in universities, research labs and companies to create prescriptive ontologies. Just think about large-scale project such as Cyc/OpenCyc or smaller projects build around OWL.

I use the term “prescriptive” to emphasize the fact that ontologies are usually defined in a hard-coded and formal manner. Let’s use the “Hotel” type, for example. Elements of this hypothetical ontology are capitalized and relations are in brackets:

 

| Hotel <is a> Building

| Hotel <is located in> City, State/Province, Country

| Hotel <is located near> Attractions

| Hotel <offers service> Parking, Pool, Gym, InternetAccess, etc.

| Hotel <has parts> HotelRoom

| HotelRoom <price> MoneyQuantity

| HotelRoom <rent> TimePeriod

 

The majority of semantic Web developers would agree this ontology is quite handy in the development of a hotel-related-semantic-web-2.0 application. But is it always handy? For how long? And is it really necessary?

Is it always handy?

Is this prototypal Hotel representative of all hotels? Clearly not. What about an ice hotel that melts (we would need a start and end date, as we do with an event)? What about cultural, local and special services (e.g., pet care, special shuttles, places of worship)? We can argue that there will always be a hotel with atypical characteristics.

For how long?

How long can these relations remain valid and when will new relations develop? Take the smart phone example. The first ontology for portable phones probably had no place for features such as “mail client, Internet browser, music player.” What about the recent trend of “boutique hotels?” Does our ontology represent it? What modifications must be made now and in the future?

Is it really necessary?

That’s the real question. Are prescriptive ontologies really necessary? What if we try to develop a semantic Web application without such an ontology? Could a descriptive/soft/bottom-up/empirical ontology be sufficient?

To return to the original scenario, let’s imagine an information extraction system that crawls the Web to try to fill the Hotel template for “City, Attractions and Parking.” Using prescriptive ontology, we can literally attach a pattern to these slots and hope it will work well. With of the help of good programmers, we can be sure the problem will be elegantly resolved, and with high accuracy. The advantage here is the predictability of a template-filling task. The disadvantage is the ontology’s incomplete nature and the maintenance it requires over time. Chances are that new features will simply go unnoticed, and this new ice hotel with sleigh-only parking will not be adequately represented in the current model.

In machine learning, the idea of a descriptive ontology and that of clustering are analogous. Instead of starting with a sharp definition of the world, we invest the time of our good programmers in identifying pages on hotels and cluster information in order to find typical patterns. Not surprisingly, the word “parking” would appear with common co-occurring words such as “not available,” “free,” and “$25 a day.” Moreover, other named entities such as city, museum and monument would also co-occur frequently. We can imagine quickly generating a template containing these frequent and distinctive elements. As time goes by, new features may become prominent, indicating that maintenance is required. The advantage of this empirical ontology is the boundary-free description of entities relations. The disadvantage is a higher noise potential and conceptual drifting that would require manual post-edition.

Recent interest and successes in unsupervised learning techniques suggest the second option, or a combination of both options, is viable and promising.

 

 

Bootstrapped learning beats AI

October 2, 2007 by yooname

The EE Times has a story about a program called “Bootstrapped Learning,” developed by Darpa:

“The goal of Bootstrapped Learning (BL) is to develop an ‘electronic student’ that can be taught complex concepts incrementally over a very wide range of problem domains-without designing domain knowledge into algorithms.”

This kind of algorithm is strongly related to semi-supervised and active learning. Even when conventional AI outperforms them, these approaches boast one important benefit. When a system learns incrementally with little human knowledge, it essentially maintains itself. Or, it can at least be maintained or extended at a very low cost.