Archive for August, 2007

Wikipedia’s Inside

August 28, 2007

To many, ourselves included, Wikipedia is the most wonderful Web site ever created. That’s why we’re very excited to announce that we’ve included much of Wikipedia’s latent knowledge in YooName. The short story is that analyzing Wikipedia boosted YooName’s knowledge by 225%. It went from 175,000 to 400,000 entities.

The long story is as follows…

One of YooName’s most valuable assets is its automatically generated named entity lists. Lists are created by set expansion techniques [our paper], as seen on Google Sets, and, more recently, on Seal.

WikipediaThis iterative algorithm has a random factor: it bootstraps its knowledge by scouting the Web, starting with knowledge with the highest confidence from the latest iteration. Sometimes the algorithm hits Wikipedia, sometimes it doesn’t. However, we know that Wikipedia is full of collaboratively maintained high-quality lists of named entities. All in all, we found 32,000 pages in the Wikipedia dump that present lists with at least 10 elements. Of these lists, 1,500 intersected highly with YooName knowledge and we forced our set expansion algorithm to hit them.

3 New Types

August 14, 2007

According to users’ suggestions and our own gut feeling, we’ve added three types to the YooName NER engine:

Blog title: Wanna recognize occurrences of entities such as “Boing Boing,” “I CAN HAS CHEEZBURGER,” “Engadget,” or “TreeHugger?” Go ahead, now you’re ready to scout the Web in search of new feeds.

Musical genre: Blues? Jazz? Techno? Or maybe you prefer Death Metal and Minimalist Experimental? YooName knows them all and will annotate texts for you.

Operating system: The inner geek emerges. YooName knows OSs that most people haven’t even heard of (are we overdosing on Windows?): Darwin, Oberon, REBOL-IOS, Plan 9, etc.

What’s next? You tell us!

YooName Statistics

August 3, 2007

YooName’s self-maintenance routine was completed this morning, so we thought it was time to gather some statistics:

  • YooName can recognize 175,552 unique named entities.
  • 373,925 candidate entities were quarantined because of insufficient statistical and lexical evidence.
  • When combining first names and last names, YooName recognizes 450 million personal names.
  • YooName’s knowledge is gathered from 54,989 English-language Web pages.
  • Our crawler examined 710,901 files (~50 GB) to find the knowledge-rich pages above.
  • Disambiguation rules are created using ~300k textual passages out of 11,652 representative named entities on 1 TB of English text.

Compared to statistics compiled three months ago, YooName’s knowledge grew by 17%.

Balie – Ungava release

August 1, 2007

Balie is the open source NLP engine powering YooName. The latest release, called “Ungava” is entirely compatible with YooName. It means that by installing Balie:

  1. you get a fully functional subset of YooName;
  2. you are ready to upgrade to YooName (all you need is the extended lexicon and the disambiguation rules).

Ungava release includes named entity recognition model for persons, locations and organizations. Balie scores an average of 78% of f-measure on the standard MUC-7 dataset (see state-of-the-art results).