Posts Tagged ‘information extraction’

Pushing Automation a Step Forward

December 13, 2008

(and hope not to fall off the cliff)

* * *

I recently worked on the implementation of ‘Stacked Skews Model’, an algorithm proposed by Andrew Carlson and Charles Schafer.

The idea is to train a web page wrapper induction algorithm (let’s call that a ‘wrapper’) at extracting information using a small number of already trained wrappers for sites in the same domain. For instance, if you already have in hands four wrappers for hotel booking web sites then you can use them to bootstrap new wrappers for virtually any hotel booking web site out there.


sample web page wrapper annotations

What’s clever in Carlson and Schafer’s solution is overcoming the lack of annotated examples, given the huge search space for such a problem, by working on features distribution and distribution divergences instead on relying directly on surface evidences. In other words, when the system learns what the name of a hotel is, it learns how each feature is distributed and how similar the solution must be (e.g., hotel name length is around 20 characters, hotel name often contains the trigram ‘hotel’ or ‘resort’, etc.). It is basically equivalent to creating one classifier for each feature and, as the authors suggest, stack them using linear regression.

My implementation didn’t exactly worked as advertised, which is normal ;) Even if stacked models reduce the feature space and diminish overfitting, the problem is still enormous and one or two features tend to rule out the stack. However, I did some important progress by playing around the published ideas.

First, do connect on an ontology. Ok, I’m not a big fan of ontological features and only use them in the last resort but here, it did a good difference. When wrapping hotel web sites, connect on WordNet synset ‘hotel’ and use all synonyms and related words as features.

Also, do use DOM tree features. In their article, Carlson and Schafer limit the learning to features on textual information (the current node text and the previous node text). However, DOM tree is very useful here. For instance, desirable information tend to be deep and almost in juxtaposition. Also, an hotel name is more likely to be in its own HTML tag (bold, header, etc.) while amenities are often enumerated (lists, table, etc.).

Finally, in order to reduce overfitting further, I split the feature space in independent groups and applied a voting scheme over the ensemble.


What is a Named Entity?

February 12, 2008

To our surprise, when it comes to defining the task of Named Entity Recognition (NER), nobody seems to question including temporal expressions and measures. This probably deserves some historic consideration, since the domain was popularized by information extraction competitions where, clearly, the date and the money generated by the event were crucial. But we receive lot of questions about the inclusion of some types, specifically those written as common nouns. Think about sports, minerals, or species. Should they be included in the task? What about genes and proteins that don’t refer to individual entities, but are often included as well?

It seems that anyone who tries to define the task eventually falls back on practical considerations, like filling templates and answering questions.

** Let’s try to sort things out and let’s fall back on practical considerations. **

We discovered 5 different criteria to determine the essence of named entities:

Orthographic criteria: named entities are usually capitalized. Capitalization rules for multiword proper nouns change from one language to the next (e.g., ‘House of Representatives’ vs ‘Chambre des communes‘). In German, all nouns are capitalized. [source]

Translation criteria: named entities usually do not translate from one language to the next. However, the transcribed names of places, monarchs, popes, and non-contemporary authors often share spelling, and are sometimes universal. [source]

Generic/specific criteria: named entities usually refer to single individuals. A mention of “John Smith” refers to an individual, but the gene “P53-wt” or the product “Canon EOS Rebel Xti” refer to multiple instances of entities. [source]

Rigid designation criteria: named entities usually designate rigidly. Proper names and certain natural terms-including biological taxa and types of natural substances (most famously “water” and “H2O”) are rigidly designated. [source]

Information Extraction (IE) criteria: named entities fill some predefined “Who? What? Where? When?” template. This surely includes money, measure, date, time, proper names, and themes such as accident, murder, election, etc. [source]

Let’s take a closer look at some examples and the criterion they meet:

God: capitalized, translatable, single individual*, rigid, useful in IE

London: capitalized, translatable, single individual, rigid, useful in IE

John Smith: capitalized, not translatable, single individual, rigid, useful in IE

water : not capitalized, translatable, not a single individual, rigid, useful in IE

Miss America: capitalized, translatable, not a single individual, not rigid, useful in IE

the first Chancellor of the German Empire: not capitalized, translatable, single individual, not rigid*, useful in IE

Canon EOS Rebel Xti: capitalized, not translatable, not single individual, not rigid, useful in IE

iPhone: not capitalized*, not translatable, not single individual, not rigid, useful in IE

hockey: not capitalized, translatable, not a single individual, not rigid, useful in IE

10$: not capitalized, not translatable, not a single individual, not rigid, useful in IE

* Alright, it could be up for debate…

No single criterion accurately covers the named entity class. Capitalization is language-specific and sometimes falls short. Translatability is inconsistent. Specificity and rigid designation miss important types, such as money and product. The only criterion that encompasses them all is usefulness in information extraction, but it’s way too broad.

Our definition is a practical one. It stems from the way YooName works:

“The types recognized by NER are any sets of words that intersect with an NER type.”

This is ugly and circular, but it is practical!

We started by including Person, Location and Organization. These sets were ambiguous with products, songs, book title, fruits, etc. So we’ve added these new sets. We expanded the number of type to 100, as guided by our definition. We calculated that less than 1% of the millions of entities we have are ambiguous with sets of words that are not handled so far. The problem is that this 1% is so diverse, we’ll need to add thousands of new types.