A demo log entry attracted our attention:
“I talked with Frfrfrf yesterday.”
This made-up sentence was clearly sent by a researcher in the field. This is one way to test an important feature of a Named Entity Recognition (NER) system, and to determine the paradigm behind it.
NER systems are either based on “lists&rules” or on sequence labelling algorithms. Take this sentence, for example:
“I talked with France yesterday.”
The first system requires massive lists of entities along with contextual rules to resolve ambiguity. Using lists&rules, “France” is matched to entries lists for countries and first names. A disambiguation rule will assign the right type, using contextual cues such as “talked.” This strategy works well with “France” but doesn’t handle “Frfrfrf,” as it is not in the vocabulary.
The second system relies on learned probabilities of word sequences and their inner features. Using sequence labelling, the word “France” is assigned a probability of a given type using features such as capitalization in lists of entities, prefix, suffix, and all contextual word features. List lookup is only optional, though it is recognized as a very good feature. This strategy works well with “France.” It can also handle unknown words such as “Frfrfrf” if the contextual cues are strong enough. This is exactly what happens with the LingPipe system, which annotates “Frfrfrf” as a person.
YooName is based on finite lists and rules. Therefore, if YooName doesn’t know it, it doesn’t exist. YooName constantly scouts the Web searching new entities. So far, it never found “Frfrfrf.” We believe that if it runs long enough, it may end up knowing every single named entity out there.