The New York Times Annotated Corpus

The New York Times just released (through LDC) a gigantic corpus including:

Over 1.5 million articles manually tagged by The New York Times Index Department with a normalized indexing vocabulary of people, organizations, locations and topic descriptors. […] Articles are tagged for persons, places, organizations, titles and topics using a controlled vocabulary that is applied consistently across articles. For instance if one article mentions “Bill Clinton” and another refers to “President William Jefferson Clinton”, both articles will be tagged with “CLINTON, BILL”.

According to the documentation, there are hand-assigned meta annotations (describing text content) using a controlled vocabulary:

  • 1.3M persons
  • 600k locations
  • 600k organizations

as well as algorithmically assigned and manually verified online annotations (tagged within the text):

  • 114k persons
  • 124k locations
  • 136k organizations

Thanks Peter for forwarding the news.

Tags: , ,

3 Responses to “The New York Times Annotated Corpus”

  1. Peter Turney Says:

    I got the news from Daniel Lemire:

  2. Michael Says:

    This could be a boon for training common sense AI system on. The correlations could be easily mined from this kind of structure.


  3. Derek Gottfrid Says:

    wait til you can play w/ the search api that leverages all that data.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: