Interesting new research problem

This article, found via digg, highlights an inherent ‘by-design’ flaw of automatic news aggregators, including Google News: they need a significant amount of press coverage before promoting news to their front page. As a result, automatic news aggregators are often hours late in covering breaking news.

The solution to the problem of “finding the most important news right now” cannot rely on one hour or so of news history. After one hour, it is no more a breaking news. It is late and repetitive.

Let’s formulate a challenging research problem from that: “Given novel and unique news, can you predict that there will be thousand of repetitions and reformulations?”

Tags: google news, news aggregation, research problem

This entry was posted on June 25, 2008 at 8:32 am and is filed under NE Ecosystem. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

One Response to “Interesting new research problem”

Brian Bean Says:
December 7, 2008 at 6:54 pm | Reply
How about this:

Filter news for temporal immediacy and against “reformulation” (if possible).

Evaluate a reasonably large sample of original news stories that resulted in repetitions/reformulations for attribute commonality, e.g., natural disasters, significant loss of life, etc., and characterize these attributes by frequency of occurrence, impact (some attributes, for example, may result in a bigger storm of follow-on articles than other attributes) AND “consistency” of predictability (some attributes may be associated not only with “important” news but also with news that did not result in significant repetition/reformulation). Cross-correlate these attributes.

Construct and apply an algorithm based on the attribute evaluation of the prior paragraph to the news stream that meets the criteria of the first paragraph.

Ascertain the efficacy of the algorithm by observing how flagged news resulted or did not result in repetition/reformulation over the evaluator’s time frame of interest.

Modify the algorithm to increase accuracy by repeating the attribute evaluation and cross correlation of the second paragraph at some selected interval. This will address “attribute drift”. For example, I suspect the attribute “terrorist act” exhibited a different predictive profile on September 12, 2001 than it did on September 10th of that year.

You have posed an interesting problem. Good luck!

Brian

YooName – named entity recognition