Archive for September, 2007

Business Intelligence and Text Analytics

September 25, 2007

With this kind of news becoming more frequent, it’s safe to say that named entity recognition technologies are playing an increasingly significant role in business intelligence (BI) and enterprise search (ES):

“The marriage of business intelligence and text analytics is starting to have a profound impact on companies in several industries, including health care, insurance and finance, which are just waking up to the benefits of tying structured BI data to unstructured text.” – Computerworld

Biggle (when BI meets Google) means “beyond returning information to users, some new search technologies act as ETL [extract, transform and load] for unstructured content.” –

Examples of this trend are also shown with BI and ES companies’ recent acquisition of text analytics providers:

The Intrinsic Likelihood of Things

September 13, 2007

The latest addition to the YooName engine consists of prior probability correction. Without going into statistical details, it creates a bias to make belong to a named entity class by default. It is very important because contextual disambiguation is not always enough to classify named entities. Here is an example. Consider this sentence:

David wants to finish the sculpture so he puts a lot of pressure on Pascal.

For the human reader, David and Pascal are obviously names of persons.

For the machine, David is the name of

  • a person,
  • a sculpture,
  • a food brand.

Pascal is the name of

  • a person,
  • a measurement unit.

Using statistical analysis of the context, disambiguation rules can be fooled into thinking that David is a sculpture (because the word “sculpture” is used in this context) and that Pascal is a measure (because the word “pressure” is used in this context). That’s why knowledge of prior probabilities is important. More often, “David” stands for a person rather than anything else. The same goes for “Pascal.” That means these word will only be classified as something else if there is strong contextual evidence to support it (e.g., a number precedes “Pascal”).

In conventional supervised (machine) learning, prior probabilities are estimated according to their frequency in the training corpus of annotated data. In semi-supervised learning, the annotated data is artificially created by the judicious sampling of a large collection of unannotated data. In YooName, this mechanism cancels out prior probabilities calculated from the training corpus. That’s why prior probability correction is necessary. We’ve used a variant of PMI-IR to approximate prior probabilities in an unsupervised manner, and YooName’s precision has significantly improved.