The New York Times just released (through LDC) a gigantic corpus including:
Over 1.5 million articles manually tagged by The New York Times Index Department with a normalized indexing vocabulary of people, organizations, locations and topic descriptors. [...] Articles are tagged for persons, places, organizations, titles and topics using a controlled vocabulary that is applied consistently across articles. For instance if one article mentions “Bill Clinton” and another refers to “President William Jefferson Clinton”, both articles will be tagged with “CLINTON, BILL”.
According to the documentation, there are hand-assigned meta annotations (describing text content) using a controlled vocabulary:
- 1.3M persons
- 600k locations
- 600k organizations
as well as algorithmically assigned and manually verified online annotations (tagged within the text):
- 114k persons
- 124k locations
- 136k organizations
Thanks Peter for forwarding the news.
Tags: corpus, nyt, text mining
November 1, 2008 at 12:19 pm |
I got the news from Daniel Lemire:
http://www.daniel-lemire.com/blog/
November 3, 2008 at 2:35 pm |
This could be a boon for training common sense AI system on. The correlations could be easily mined from this kind of structure.
Exciting!
November 16, 2008 at 10:28 pm |
wait til you can play w/ the search api that leverages all that data.