What is a Named Entity?

To our surprise, when it comes to defining the task of Named Entity Recognition (NER), nobody seems to question including temporal expressions and measures. This probably deserves some historic consideration, since the domain was popularized by information extraction competitions where, clearly, the date and the money generated by the event were crucial. But we receive lot of questions about the inclusion of some types, specifically those written as common nouns. Think about sports, minerals, or species. Should they be included in the task? What about genes and proteins that don’t refer to individual entities, but are often included as well?

It seems that anyone who tries to define the task eventually falls back on practical considerations, like filling templates and answering questions.

** Let’s try to sort things out and let’s fall back on practical considerations. **

We discovered 5 different criteria to determine the essence of named entities:

Orthographic criteria: named entities are usually capitalized. Capitalization rules for multiword proper nouns change from one language to the next (e.g., ‘House of Representatives’ vs ‘Chambre des communes‘). In German, all nouns are capitalized. [source]

Translation criteria: named entities usually do not translate from one language to the next. However, the transcribed names of places, monarchs, popes, and non-contemporary authors often share spelling, and are sometimes universal. [source]

Generic/specific criteria: named entities usually refer to single individuals. A mention of “John Smith” refers to an individual, but the gene “P53-wt” or the product “Canon EOS Rebel Xti” refer to multiple instances of entities. [source]

Rigid designation criteria: named entities usually designate rigidly. Proper names and certain natural terms-including biological taxa and types of natural substances (most famously “water” and “H2O”) are rigidly designated. [source]

Information Extraction (IE) criteria: named entities fill some predefined “Who? What? Where? When?” template. This surely includes money, measure, date, time, proper names, and themes such as accident, murder, election, etc. [source]

Let’s take a closer look at some examples and the criterion they meet:

God: capitalized, translatable, single individual*, rigid, useful in IE

London: capitalized, translatable, single individual, rigid, useful in IE

John Smith: capitalized, not translatable, single individual, rigid, useful in IE

water : not capitalized, translatable, not a single individual, rigid, useful in IE

Miss America: capitalized, translatable, not a single individual, not rigid, useful in IE

the first Chancellor of the German Empire: not capitalized, translatable, single individual, not rigid*, useful in IE

Canon EOS Rebel Xti: capitalized, not translatable, not single individual, not rigid, useful in IE

iPhone: not capitalized*, not translatable, not single individual, not rigid, useful in IE

hockey: not capitalized, translatable, not a single individual, not rigid, useful in IE

10$: not capitalized, not translatable, not a single individual, not rigid, useful in IE

* Alright, it could be up for debate…

No single criterion accurately covers the named entity class. Capitalization is language-specific and sometimes falls short. Translatability is inconsistent. Specificity and rigid designation miss important types, such as money and product. The only criterion that encompasses them all is usefulness in information extraction, but it’s way too broad.

Our definition is a practical one. It stems from the way YooName works:

“The types recognized by NER are any sets of words that intersect with an NER type.”

This is ugly and circular, but it is practical!

We started by including Person, Location and Organization. These sets were ambiguous with products, songs, book title, fruits, etc. So we’ve added these new sets. We expanded the number of type to 100, as guided by our definition. We calculated that less than 1% of the millions of entities we have are ambiguous with sets of words that are not handled so far. The problem is that this 1% is so diverse, we’ll need to add thousands of new types.

Tags: , ,

3 Responses to “What is a Named Entity?”

  1. Pascal Says:

    I would tend to disagree with some examples you provided. For instance:

    iPhone: not single individual according to you.

    Suppose that iPhone was in the dictionary, I would expect it to be defined as something like: “Popular model of mobile phone marketed by Apple” (or something like that), which would be different from the definition for a mobile phone “Electronic device that allow the transmission of voice using wave frequencies” or something like that.

    Thus, while people tend to refer to their iPhone as any device of this brand (my iPhone, your iPhone), I still see it as a unique entity: iPhone is a brand name (not the actual devices), just like Apple is a company name (and not the actual computers). iPhone is thus an unique individual, which is: THIS specific and unique brand. Yet I think this is debatable, so I’d put an * next to the examples that contain brands in your list :)

    In the case of Miss America, I also believe it is a single individual in a given context (time frame). Miss America refers to a single individual at a specific time, just as is “the President of the United States”, which could be replaced by “George W. Bush” in a sentence in 2008, but that should be replaced by “Ronald Reagan” in a sentence written in 1985. On the other hand, I’m not sure that it is a *named* entity, since it is not actually named at all.

    I think that the generic/specific criterion is the best one to define a named entity, but it lacks an additional criterion: the “named” part. In the sentence: “John Smith said that [...] while *he* was in Toronto”, the word *he* refers to an individual, but *he* is not a named entity. For this reason “The president of the United States” would not qualify as a real *named* entity, at least linguistically.

    A geeker way to define a Named Entity: something that would need a GUID. Using this allegory, a C++ pointer to an individual identified with a GUID would be akin to an anaphoric expression:

    “Fido is my dog and he is happy”

    Dog Fido = new Dog(“7855E60A-D97A-11DC-A110-85C856D89593″);
    Dog* he = &Fido;
    he->State = Dog.Mood.HAPPY;

    :)

  2. Bob Carpenter Says:

    Any plans to share your corpus?

    Though I like philosophy of language more than most (I taught it when I was a professor at Carnegie Mellon), I’m not sure there’s a place for Kripke’s possible-worlds semantics notion of rigid designation in an engineering discipline. I find its dependence on possible worlds rather circular. And in that theory, “water” is typically taken to be non-rigid, whereas H20 is of debateable rigidity depending on your beliefs about the behavior of physics in other possible worlds.

    Even names like “Ronald Reagan” and “George W. Bush” are not unique, even in 1985 or 2008. So any notion of specificity is difficult to quantify. Check out Russell’s original theory and Strawson’s replies about contextual dependence (not to mention Kaplan’s cool early work on demonstrative pronouns and other indexicals).

    Pascal makes a good point about brands: they’re specific (abstract) entities . How things get names and what names mean is another long detour through the philosophy of language that isn’t particularly relevant for engineering. It has an epistemological component concerned with how you learn what the names of things are. And an ontological component because you can give the same thing (an ontological notion) different names (e.g. “Morning Star” and “Evening Star”, both of which refer to the planet Venus).

  3. Molino de Ideas Says:

    Seems like I’m reliving this post…

    Well, from my point of view, part of the problem is that, while it’s easy (or at least not impossible) to detect “Pride and Prejudice” as a named entity, detecting “The most famous book by Jane Austen” (and not just “Jane Austen”) is a harder task, even when it refers to the same object. Actually, that same object could be also referred as “The English novel that was turned into a film starring Keira Knightley” or even “My mom’s favorite book”.

    This last two don’t have the the typical named entity structure, but despite being less orthodox-looking, they still refer to the same object, and therefore should be considered as named entity and detected as such by a perfect named entity recognizer.

    We still have a long way to go… (we happen to be working at the same field, but in Spanish)

    Greetings from Spain and congratulations on your blog! :)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


Follow

Get every new post delivered to your Inbox.

%d bloggers like this: