Posts Tagged ‘corpus’

The New York Times Annotated Corpus

November 1, 2008

The New York Times just released (through LDC) a gigantic corpus including:

Over 1.5 million articles manually tagged by The New York Times Index Department with a normalized indexing vocabulary of people, organizations, locations and topic descriptors. [...] Articles are tagged for persons, places, organizations, titles and topics using a controlled vocabulary that is applied consistently across articles. For instance if one article mentions “Bill Clinton” and another refers to “President William Jefferson Clinton”, both articles will be tagged with “CLINTON, BILL”.

According to the documentation, there are hand-assigned meta annotations (describing text content) using a controlled vocabulary:

  • 1.3M persons
  • 600k locations
  • 600k organizations

as well as algorithmically assigned and manually verified online annotations (tagged within the text):

  • 114k persons
  • 124k locations
  • 136k organizations

Thanks Peter for forwarding the news.


Follow

Get every new post delivered to your Inbox.