Archive for June, 2008

Interesting new research problem

June 25, 2008

This article, found via digg, highlights an inherent ‘by-design’ flaw of automatic news aggregators, including Google News: they need a significant amount of press coverage before promoting news to their front page. As a result, automatic news aggregators are often hours late in covering breaking news.

The solution to the problem of “finding the most important news right now” cannot rely on one hour or so of news history. After one hour, it is no more a breaking news. It is late and repetitive.

Let’s formulate a challenging research problem from that: “Given novel and unique news, can you predict that there will be thousand of repetitions and reformulations?”


DayLife Developer Challenge

June 13, 2008

DayLife is staging a challenge from June 3rd to June July 25th (extended date):

Build the future of news, in software!

Build an application that uses the Daylife API. No limits here: mashups, portals, widgets, iphone apps, blogging plugins, you name it.

DayLife challenge

DayLife is a news aggregation platform with strong named entity (NE) recognition capability. NEs are also called ‘Topics’, and they fall under the types ‘Person’, ‘Place’ and ‘Organization’.

Ontology is Overrated

June 5, 2008

This is an extract (a summary by sentence extraction – like these old days text summarizers were doing ;) of Clay Shirky’s blog post titled ‘Ontology is Overrated‘.

* * *

Today I want to talk about categorization, and […] I want to convince you that many of the ways we’re attempting to apply categorization to the electronic world are actually a bad fit.

What I think is coming instead are [..] organic ways of organizing information […], based on two units — the link, which can point to anything, and the tag, which is a way of attaching labels to links.

PART I: Classification and Its Discontents

The question ontology asks is: What kinds of things exist or can exist in the world, and what manner of relations can those things have to each other?

If you’ve got a large, ill-defined corpus, if you’ve got naive users, if your cataloguers aren’t expert, if there’s no one to say authoritatively what’s going on, then ontology is going to be a bad strategy.

One of the biggest problems with categorizing things in advance is that it forces the categorizers […] to guess what their users are thinking, and to make predictions about the future.

When people [are] offered search [e.g., Web search] and categorization [e.g., Web directory] side-by-side, fewer and fewer people [are] using categorization to find things.

Part II: The Only Group That Can Categorize Everything Is Everybody

Now imagine a world where everything can have a unique identifier. This should be easy, since that’s the world we currently live in — the URL gives us a way to create a globally unique ID for anything we need to point to.

And once you can do that, anyone can label those pointers, can tag those URLs, in ways that make them more valuable, and all without requiring top-down organization schemes.

As [Joshua] Schachter says of, “Each individual categorization scheme is worth less than a professional categorization scheme. But there are many, many more of them.” If you find a way to make it valuable to individuals to tag their stuff, you’ll generate a lot more data about any given object than if you pay a professional to tag it once and only once.

Well-managed, well-groomed organizational schemes get worse with scale, both because the costs of supporting such schemes at large volumes are prohibitive, and, as I noted earlier, scaling over time is also a serious problem. Tagging, by contrast, gets better with scale. With a multiplicity of points of view the question isn’t “Is everyone tagging any given link ‘correctly'”, but rather “Is anyone tagging it the way I do?” As long as at least one other person tags something they way you would, you’ll find it […].

We are moving away from binary categorization — books either are or are not entertainment — and into this probabilistic world, where N% of users think books are entertainment.

* * *

Difficult to Pwn IM Language iykwimaityd

June 1, 2008

Researchers at the University of Toronto, Canada, suggest that instant messaging represents “an expansive new linguistic renaissance” (story from New Scientist.)

We’ve tried seeding YooName with a list of well-known internet slang expressions such as: LOL, brb, and OMG.

YooName found 993 pages on the Internet containing lexicon (or structured repository) of Internet slang, and it collected a list of 1,718 unique expressions. Interestingly, more than a quarter of these expressions are ambiguous with other categories of words, for example brb (be right back) is also a tickers symbol, lol (laugh out loud) is a place in Papua New Guinea, and asap (as soon as possible) is also the name of a company.

We’ve updated YooName lexicon and rule system to recognize and annotate Internet slang… but because of its high ambiguity and unconventional syntax, it is very difficult to pwn!