Archive for June, 2007

Scientific Contributions that Shaped NER

June 18, 2007

The Named Entity Recognition (NER) field was born over fifteen years ago. It is often suggested that Lisa F. Rau’s paper, “Extracting Company Names from Text” (1991), is the root of all NER work. The Message Understanding Conferences (MUC) are also noteworthy for coining a lot of terminology (the expression “named entity recognition ” itself; entity classes such as enamex, numex, and timex; etc.) For this reason, all participants of the MUC deserve credit for a lot of structuring ideas. Even so, here’s a list of five scientific contributions that simply stand out from the crowd.

1993: McDonald’s internal and external evidences.

McDonald’s original paper (a version published in 1996 is also available). He argues the necessity of using named entity (NE) internal evidences (the name’s structure), as well as external evidences (the textual context surrounding the name). He also introduces the first NER paradigm, which is still in use today: Delimit, Classify, Record.

[ACL1][ACL2]

1997: Bikel et al.’s HMM system.

Nymble is widely cited as the prototypal HMM-based system (a second NER paradigm). With a good set of features and a sound HMM training and decoding, Nymble’s performance rivals human annotator’s precision on a specific corpus. Nymble is the foundation of the commercial BBN Identifinder system.

[Citeseer][ACM]

1997: Palmer and Day’s statistical profile of the task.

Palmer and Day’s paper is a valuable resource for any background work in NER. It addresses the crucial properties of names in text: the mean length of NEs; the relative proportion of basic types (enamex); the vocabulary transfer in typical annotated corpus; and the baseline strategy for NER. The work has been done on six languages.

[Citeseer][ACL]

1999: Cucerzan and Yarowsky’s bootstrapping.

There is a paradigm shift towards unsupervised and semi-supervised techniques in the NER field, and in the machine-learning field in general. This work pioneered the idea by showing how a very small seed of exemplar NEs can be bootstrapped in large and precise NE lists, paired with contextual evidences.

[Citeseer][ACL]

2002: Sekine et al.’s type hierarchy.

Gone are the days when the world was divided into three NE types (person, location, organization). The real world is divided into hundreds of types, all with primordial fine-grained distinctions. Sekine et al.’s hierarchy was designed to reflect well-known thesauri divisions of proper names, as well as the current scope of NER systems and common NE types found in newspapers.

[Citeseer] [NYU]

Advertisements

The Rise of Named Entity-based Applications

June 10, 2007

Named Entity (NE) technology has gained momentum in the information world. Let’s take a look at two innovative applications:

NE-Powered News Aggregation

(actors: Daylife, TextMap, EMM NewsExplorer, Kipcast, etc.)

News aggregation dates back to early academic demos NewsBlaster and NewsInEssence. Then, two major commercial services were launched by a small company and a big player (see if you can tell which one is which): Topix.net and Google News.

The new generation of news aggregation platform is powered by named entity recognition. NEs are the heart and soul of news: NEs are the who, the when, and the where. A lot of information can be analyzed using NEs, just think about plotting the popularity of entities over time and generating geospatial heat maps (take a look at TextMap for instance). However, the main improvement to traditional news aggregation brought by NEs is how they connect between people and things.

DayLife is exactly about NE connections and, as a bonus, it has an elegant Web interface. For instance, there is news on the front page about Glaxo (a pharmaceutical company) defending the virtues of Avandia (a diabetes drug). In the connections, we find the FDA (government regulator), David Nathan (a diabetes specialist), Henry Waxman (the politician who announces the hearing), and so forth. It is an ultra-summary; a starting point for analysis; news extracted from its static media and connected to the world.

NE-Powered People Search

(actors: ZoomInfo, Spock, Wink, etc.)

Are you googling your colleagues, your date, your boss, your old friends, or even yourself? We are all doing it! And what we find is ambiguous. My homonym writes sick poems, my colleague was a governor in the late 1890’s, and this guy I wanted to hire published weird photos on Flickr – or was that someone else?

NE-Powered people searches are all about resolving this ambiguity, a problem known as “personal name disambiguation“. ZoomInfo clusters person and generates profiles specifying past employment, education, and geographic location. But let’s be honest, this is a very difficult problem! That’s why ZoomInfo often gets mixed up. Spock offers $50K to whoever can send them the best disambiguation algorithm.

As Aldous Huxley said, “the author of the Iliad is either Homer or, if not Homer, somebody else of the same name”.

Named Entity Recognition Technologies Are Often Nameless

June 3, 2007

It is always an interesting paradox when experts in a given field live in contradiction with the principles that guide their expertise. The popular proverb “the shoemaker’s children are often shoeless” comes to mind. Recently, I was following a very rusty Econoline vehicle with an anti-rust company name written in big rust-tainted letters. I also used to work with an information-retrieval genius who worked in a cluttered cubicle among piles of paper.

In the four years it required in research and development, YooName remained nameless. At some point, it boasted five different temporary codenames, including NERF (Named Entity Recognition Framework). For obvious reasons, it would not have been a good idea to use this name on the market. Someone might have hit us on the head with a foam gun.

So we named finally our technology “YooName.” In terms of naming techniques, it is

  1. an amalgam of “Yoo” (pronounce “you”) and “Name,”
  2. a suggestive name that refers to the idea that you (the developer) get the power to identify named things in text and,
  3. an associative field comprised of the many Internet companies with two “Os” in their name.

 

YooName at DemoCamp Ottawa

June 2, 2007

YooName will make its first public appearance on June 18th, at DemoCamp Ottawa.

democamp_ottawa_presenter.gif