Archive for the ‘YooName News’ Category

8 Sentiments

April 7, 2011 is semi-supervised sentiment analysis engine.

8Sentiments model is trained every day from large volume of unannotated Twitter data and can learn emotion related to current topics.

For instance, on April fool day, the phrase ‘April Fool’ was learned and associated with emotion ‘Surprise’.

Current 8 emotions are anger, fear, sadness, joy, the waiting, surprise, disgust, and acceptance.

Very simple API is available and sample code is provided for java, ruby and python.



New Information Extraction Projects

February 8, 2010

YooName named entity recognition technology is now at the hearth of new projects in the domain of Online Reputation Management and Monitoring.

  • InfoGlutton aggregates restaurant reviews and classify them by sentiments (positive, neutral, negative). InfoGlutton is aimed at helping restaurant owners getting a complete overview of the ‘digital word-of-mouth’ around their brand.
  • FoodFu reuses InfoGlutton data into a restaurant directory for foodies in search of the best tables in town.
  • DingDining leverages YooName entity recognition trained for food industry domain and offers a directory of restaurants ranked by awards and distinctions.

And there’s more to come!

YooName is *not* a search engine

April 30, 2009

(and other frequently given answers)

In the last few weeks, YooName traffic increased dramatically (ten fold),  and so did the volume of emails. Don’t be offended if I answer your email by linking to this post. I think this is a good place and good time to address the most frequent concerns :

1. YooName is not a search engine

Don’t expect YooName to get a list of web sites when you issue a query in the demo page. YooName is not a search engine. There’s a confusion because we often describe YooName as a potential search engine component, or a novel algorithm for improving web search.

YooName is self-improving named entity recognition (NER) system. If you know what NER is then you probably have an idea how it relates to search engines. If not, then this is less obvious. In short, NER allows structuring textual information, and structured information is important for semantic search technologies.

2. YooName is not a commercial project per se

YooName is a technology showcase for my PhD project.

3. No, I didn’t hired a lawyer to write a formal privacy policy

I order to sign up for the YooName demo, we collect your email. This is the simplest form of verification we could imagine to avoid being scrapped by robots and/or mechanical turk. Also, when you send a text to the demo, it is stored in the system for statistics and quality insurance. These are two frequent privacy concerns expressed by the demo users.

E-mails: I use the demo user email database with the greatest diligence. I do not share it and I do not mass-mail for fun. In fact, in the two years of existence of the demo site, I haven’t use it yet. As the sign up form tells it: “We will not share your e-mail. We may send you news about YooName developments. We will promptly remove your e-mail from our database upon request.”

Texts: The text you send to the demo are stored and used internally. This information is not shared and is destroyed periodically. Again, if you think that you sent sensible information in the system and want it to be destroyed, drop me a line and I’ll wipe out information linked to your username.

Difficult to Pwn IM Language iykwimaityd

June 1, 2008

Researchers at the University of Toronto, Canada, suggest that instant messaging represents “an expansive new linguistic renaissance” (story from New Scientist.)

We’ve tried seeding YooName with a list of well-known internet slang expressions such as: LOL, brb, and OMG.

YooName found 993 pages on the Internet containing lexicon (or structured repository) of Internet slang, and it collected a list of 1,718 unique expressions. Interestingly, more than a quarter of these expressions are ambiguous with other categories of words, for example brb (be right back) is also a tickers symbol, lol (laugh out loud) is a place in Papua New Guinea, and asap (as soon as possible) is also the name of a company.

We’ve updated YooName lexicon and rule system to recognize and annotate Internet slang… but because of its high ambiguity and unconventional syntax, it is very difficult to pwn!

YooName’s creator honored at 2008 OCRI awards

April 8, 2008

OTTAWA, Canada, April 3, 2008OCRI, Ottawa’s economic development agency honoured Ottawa’s best and brightest companies, executives and students for their innovative work and contributions to the city’s knowledge-based sector at the 13th annual OCRI Awards gala.

David Nadeau from the University of Ottawa received the Student Researcher of the Year award for inspired research resulting in a more intelligent on-line search engine and his commercialization efforts which launched last November.

[see the full press release]

Semi-Supervised Named Entity Recognition

December 16, 2007

YooName originates from the PhD research titled Semi-Supervised Named Entity Recognition: Learning to Recognize 100 Entity Types with Little Supervision. The thesis was successfully defended at University of Ottawa, Canada, and is now available online.

Here’s the abstract:

* * *

Named Entity Recognition (NER) aims to extract and to classify rigid designators in text such as proper names, biological species, and temporal expressions. There has been growing interest in this field of research since the early 1990s. In this thesis, we document a trend moving away from handcrafted rules, and towards machine learning approaches. Still, recent machine learning approaches have a problem with annotated data availability, which is a serious shortcoming in building and maintaining large-scale NER systems.

In this thesis, we present an NER system built with very little supervision. Human supervision is indeed limited to listing a few examples of each named entity (NE) type. First, we introduce a proof-of-concept semi-supervised system that can recognize four NE types. Then, we expand its capacities by improving key technologies, and we apply the system to an entire hierarchy comprised of 100 NE types.

Our work makes the following contributions: the creation of a proof-of-concept semi-supervised NER system; the demonstration of an innovative noise filtering technique for generating NE lists; the validation of a strategy for learning disambiguation rules using automatically identified, unambiguous NEs; and finally, the development of an acronym detection algorithm, thus solving a rare but very difficult problem in alias resolution.

We believe semi-supervised learning techniques are about to break new ground in the machine learning community. In this thesis, we show that limited supervision can build complete NER systems. On standard evaluation corpora, we report performances that compare to baseline supervised systems in the task of annotating NEs in texts.

GeoNames’ Inside

October 1, 2007

Using the same modus operandi as Wikipedia, we just included much of GeoNames‘ 8 million entries into YooName. Our lexicons increased by 750%, to more than 3,000,000 named entities.

YooName is now powered by: Powered by GeoNames and Powered by Wikipedia

Did you know these common words could stand for city/town names?

The Intrinsic Likelihood of Things

September 13, 2007

The latest addition to the YooName engine consists of prior probability correction. Without going into statistical details, it creates a bias to make belong to a named entity class by default. It is very important because contextual disambiguation is not always enough to classify named entities. Here is an example. Consider this sentence:

David wants to finish the sculpture so he puts a lot of pressure on Pascal.

For the human reader, David and Pascal are obviously names of persons.

For the machine, David is the name of

  • a person,
  • a sculpture,
  • a food brand.

Pascal is the name of

  • a person,
  • a measurement unit.

Using statistical analysis of the context, disambiguation rules can be fooled into thinking that David is a sculpture (because the word “sculpture” is used in this context) and that Pascal is a measure (because the word “pressure” is used in this context). That’s why knowledge of prior probabilities is important. More often, “David” stands for a person rather than anything else. The same goes for “Pascal.” That means these word will only be classified as something else if there is strong contextual evidence to support it (e.g., a number precedes “Pascal”).

In conventional supervised (machine) learning, prior probabilities are estimated according to their frequency in the training corpus of annotated data. In semi-supervised learning, the annotated data is artificially created by the judicious sampling of a large collection of unannotated data. In YooName, this mechanism cancels out prior probabilities calculated from the training corpus. That’s why prior probability correction is necessary. We’ve used a variant of PMI-IR to approximate prior probabilities in an unsupervised manner, and YooName’s precision has significantly improved.

Wikipedia’s Inside

August 28, 2007

To many, ourselves included, Wikipedia is the most wonderful Web site ever created. That’s why we’re very excited to announce that we’ve included much of Wikipedia’s latent knowledge in YooName. The short story is that analyzing Wikipedia boosted YooName’s knowledge by 225%. It went from 175,000 to 400,000 entities.

The long story is as follows…

One of YooName’s most valuable assets is its automatically generated named entity lists. Lists are created by set expansion techniques [our paper], as seen on Google Sets, and, more recently, on Seal.

WikipediaThis iterative algorithm has a random factor: it bootstraps its knowledge by scouting the Web, starting with knowledge with the highest confidence from the latest iteration. Sometimes the algorithm hits Wikipedia, sometimes it doesn’t. However, we know that Wikipedia is full of collaboratively maintained high-quality lists of named entities. All in all, we found 32,000 pages in the Wikipedia dump that present lists with at least 10 elements. Of these lists, 1,500 intersected highly with YooName knowledge and we forced our set expansion algorithm to hit them.

3 New Types

August 14, 2007

According to users’ suggestions and our own gut feeling, we’ve added three types to the YooName NER engine:

Blog title: Wanna recognize occurrences of entities such as “Boing Boing,” “I CAN HAS CHEEZBURGER,” “Engadget,” or “TreeHugger?” Go ahead, now you’re ready to scout the Web in search of new feeds.

Musical genre: Blues? Jazz? Techno? Or maybe you prefer Death Metal and Minimalist Experimental? YooName knows them all and will annotate texts for you.

Operating system: The inner geek emerges. YooName knows OSs that most people haven’t even heard of (are we overdosing on Windows?): Darwin, Oberon, REBOL-IOS, Plan 9, etc.

What’s next? You tell us!