SALTS Lab

Laboratory for the Study of Applied Language Technology and Society

User Tools

Site Tools


LDC Corpora

We are a Linguistic Data Consortium (LDC) member for the following years: 1993-1996, 1999-2001, 2006, 2009-2010. LDC corpora are available to members of the Rutgers University, for non-commercial education, research and technology development purposes only.

The table below shows the corpora that we have available. To download them, please visit our LDC Corpora Download Information Form page. When you click the link, you will be prompted to log in. Please log in with your NetID to verify your Rutgers University affiliation, and then proceed to receive download instruction.

You can sort the table below by clicking the column header. Clicking the same header again will change the sort order (ascending/descending). Please click the Catalog ID link to view more information about the corpus on the LDC website.

Please contact us at salts-admin [at] rutgers.edu if you have any questions, comments, or feedback.

Catalog ID Release Year Name Data Source(s) Description
LDC2011T03 2011 OntoNotes Release 4.0 broadcast conversation, broadcast news, newsgroups, newswire, weblogs

The goal of the OntoNotes project is to annotate a large corpus comprising various genres of text (news, conversational telephone speech, weblogs, usenet newsgroups, broadcast, talk shows) in three languages (English, Chinese, and Arabic) with structural information (syntax and predicate argument structure) and shallow semantics (word sense linked to an ontology and coreference).

This release contains the content of earlier releases, and adds newswire, broadcast news, broadcast conversation and web data in English and Chinese and newswire data in Arabic. This cumulative publication consists of 2.4 million words as follows: 300k words of Arabic newswire; 250k words of Chinese newswire, 250k words of Chinese broadcast news, 150k words of Chinese broadcast conversation and 150k words of Chinese web text; and 600k words of English newswire, 200k word of English broadcast news, 200k words of English broadcast conversation and 300k words of English web text.

LDC2010T22 2010 Manually Annotated Sub-Corpus First Release broadcast conversation, broadcast news, email, newswire, telephone speech, transcribed speech, varied, web collection This is the first of three releases of 500,000 words of MASC data developed as part of the American National Corpus (ANC) project. MASC I consists of approximately 80,000 words of contemporary spoken and written American English annotated for a variety of linguistic phenomena. The MASC project is sponsored by the National Science Foundation and was established to address, to the extent possible, many of the obstacles to the creation of large-scale, robust, multiply-annotated corpora of English covering a wide range of genres of written and spoken language data.
LDC2010T18 2010 ACE Time Normalization (TERN) 2004 English Evaluation Data V1.0 broadcast news, newswire This contains the English evaluation data prepared for the 2004 Time Expression Recognition and Normalization (TERN) Evaluation, specifically, English broadcast news and newswire data collected by LDC.
LDC2010T05 2010 NPS Internet Chatroom Conversations, Release 1.0 text chat conversations This corpus consists of 10,567 English posts (45,068 tokens) gathered from age-specific chat rooms of various online chat services in October and November 2006. It is one of the first text-based chat corpora tagged with lexical and discourse information. This might be used to develop stochastic NLP applications that perform tasks such as conversation thread topic detection, author profiling, entity identification, and social network analysis.
LDC2009T29 2009 ACL Anthology Reference Corpus journal articles This corpus is a digital archive of 10,291 research papers in computational linguistics sponsored by the Association for Computational Linguistics (ACL). Also available from the ACL, this release contains most of the papers that appear up to February 2007 in the web-based ACL Anthology, a dynamic repository that currently hosts over 16,500 articles drawn from a range of conferences and workshops as well as past issues of the Computational Linguistics journal. This is designed to be a standard, real-world digital collection testbed for experiments in bibliographic and bibliometric research.
LDC2009T23 2009 FactBank 1.0 newswire FactBank 1.0 consists of 208 documents (over 77,000 tokens) from newswire and broadcast news reports in which event mentions are annotated with their degree of factuality, that is, the degree to which they correspond to those events.
LDC2009T13 2009 English Gigaword Fourth Edition newswire

English Gigaword, now being released in its fourth edition, is a comprehensive archive of newswire text data that has been acquired over several years by the LDC at the University of Pennsylvania. The six distinct international sources of English newswire included in this edition are the following:

  • Agence France-Presse, English Service (afp_eng)
  • Associated Press Worldstream, English Service (apw_eng)
  • Central News Agency of Taiwan, English Service (cna_eng)
  • Los Angeles Times/Washington Post Newswire Service (ltw_eng)
  • New York Times Newswire Service (nyt_eng)
  • Xinhua News Agency, English Service (xin_eng)
LDC2009T12 2009 2008 CoNLL Shared Task Data news magazine, newswire This contains the the trial corpus, training corpus, development and test data for the 2008 CoNLL (Conference on Computational Natural Language Learning) Shared Task Evaluation. The 2008 Shared Task developed syntactic dependency annotations, including information such as named-entity boundaries and the semantic dependencies model roles of both verbal and nominal predicates.
LDC2009T07 2009 Unified Linguistic Annotation Text Collection broadcast conversation, broadcast news, email, newswire, telephone speech, varied

This collection consists of two separate corpora: The Language Understanding Annotation Corpus (LDC2009T10) and REFLEX EntityTranslation Training/DevTest (LDC2009T11).

LDC2009T10 contains over 9000 words of English text (6949 words) and Arabic text (2183 words) annotated for committed belief, event and entity coreference, dialog acts and temporal relations. Committed belief annotation distinguishes between statements which assert belief or opinion, those which contain speculation, and statements which convey facts or otherwise do not convey belief. Dialog act annotation seeks to determine the forward and backward links between pairs of dialog acts.

LDC2009T11 consists of approximately 67.5k words of newswire and weblog text for each of three languages: English, Chinese and Arabic, and is annotated for entities (Person, Organization, Location, Facility, Weapon, Vehicle and GeoPolitical Entity) and TIMEX2 annotation of events and temporal relations.

LDC2008T24 2008 COMNOM v 1.0 newswire This collection consists of the following: COMLEX English Syntax Lexicon (LDC98L21), an English dictionary consisting of approximately 38,000 lemmas with detailed information about the syntactic characteristics of each lexical item and subcategorization (complement structures); and COMLEX Syntax Text Corpus Version 2.0 (LDC96T11).
LDC2008T19 2008 The New York Times Annotated Corpus newswire

The New York Times Annotated Corpus contains over 1.8 million articles written and published by the New York Times between January 1, 1987 and June 19, 2007 with article metadata provided by the New York Times Newsroom, the New York Times Indexing Service and the online production staff at nytimes.com. The corpus includes:

  • Over 1.8 million articles (excluding wire services articles that appeared during the covered period).
  • Over 650,000 article summaries written by library scientists.
  • Over 1,500,000 articles manually tagged by library scientists with tags drawn from a normalized indexing vocabulary of people, organizations, locations and topic descriptors.
  • Over 275,000 algorithmically-tagged articles that have been hand verified by the online production staff at nytimes.com.
  • Java tools for parsing corpus documents from .xml into a memory resident object.
LDC2006T13 2006 Web 1T 5-gram Version 1 web collection This data set, contributed by Google Inc., contains English word n-grams and their observed frequency counts. The length of the n-grams ranges from unigrams (single words) to five-grams. This data would be useful for statistical language modeling, e.g., for machine translation or speech recognition, as well as for other uses.
LDC2006T08 2006 TimeBank 1.2 newswire The TimeBank Corpus contains 183 news articles that have been annotated with temporal information, adding events, times and temporal links between events and times.
LDC2006T06 2006 ACE 2005 Multilingual Training Corpus broadcast conversation, broadcast news, newsgroups, weblogs This publication contains the complete set of English, Arabic and Chinese training data for the 2005 Automatic Content Extraction (ACE) technology evaluation. The corpus consists of data of various types annotated for entities, relations and events was created by Linguistic Data Consortium with support from the ACE Program, with additional assistance from LDC. This data was previously distributed as an e-corpus (LDC2005E18) to participants in the 2005 ACE evaluation. The objective of the ACE program is to develop automatic content extraction technology to support automatic processing of human language in text form. A complete description of the ACE 2005 Evaluation can be found on the ACE Program website maintained by the National Institute of Standards and Technology (NIST).
LDC2004T14 2004 Proposition Bank I newswire This is a semantic annotation of the Wall Street Journal section of Treebank-2. More specifically, each verb occurring in the Treebank has been treated as a semantic predicate and the surrounding text has been annotated for arguments and adjuncts of the predicate. The verbs have also been tagged with coarse grained senses and with inflectional information.
LDC99T42 1999 Treebank-3 microphone speech, newswire, telephone speech, transcribed speech, varied Treebank-3 contains the materials of Treebank-2 and the following new materials: Switchboard tagged, dysfluency-annotated, and parsed text; and Brown parsed text. The Penn Treebank (PTB) project selected 2,499 stories from a three year Wall Street Journal (WSJ) collection of 98,732 stories for syntactic annotation, and these 2,499 stories have been distributed in both Treebank-2 and Treebank-3 releases of PTB. Treebank-2 includes the raw text for each story.
LDC97S62 1997 Switchboard-1 Release 2 telephone conversations Switchboard is a collection of about 2,400 two-sided telephone conversations among 543 speakers (302 male, 241 female) from all areas of the United States. A computer-driven “robot operator” system handled the calls, giving the caller appropriate recorded prompts, selecting and dialing another person (the callee) to take part in a conversation, introducing a topic for discussion and recording the speech from the two subjects into separate channels until the conversation was finished. About 70 topics were provided, of which about 50 were used frequently. Selection of topics and callees was constrained so that: (1) no two speakers would converse together more than once and (2) no one spoke more than once on a given topic.
LDC95T7 1995 Treebank-2 microphone speech, newswire, transcribed speech, varied This corpus contains over 1.6 million words of hand-parsed material from the Dow Jones News Service (one million words of 1989 Wall Street Journal material annotated in Treebank-2 style), plus an additional one million words tagged for part-of-speech. It also contains the first fully parsed version of the Brown Corpus, which has also been completely retagged using the Penn Treebank (PTB) tag set. Also included are tagged and parsed data from Department of Energy abstracts, IBM computer manuals, MUC-3 and ATIS.
LDC93T3A 1993 TIPSTER Complete (3CDs) newswire, varied

The TIPSTER project is developed in an effort to significantly advance the state of the art in effective document detection (information retrieval) and data extraction from large, real-world data collections. The detection data is comprised of a new test collection built at NIST to be used both for the TIPSTER project and the related TREC project.

The documents in the test collection are varied in style, size and subject domain. The first disk contains material from the Wall Street Journal, (1986, 1987, 1988, 1989), the AP Newswire (1989), the Federal Register (1989), information from Computer Select disks (Ziff-Davis Publishing) and short abstracts from the Department of Energy. The second disk contains information from the same sources, but from different years. The third disk contains more information from the Computer Select disks, plus material from the San Jose Mercury News (1991), more AP newswire (1990) and about 250 megabytes of formatted U.S. Patents. The format of all the documents is relatively clean and easy to use, with SGML-like tags separating documents and document fields. There is no part-of-speech tagging or breakdown into individual sentences or paragraphs as the purpose of this collection is to test retrieval against real-world data.