LDC Corpora
We are a Linguistic Data Consortium (LDC) member for the following years: 1993-1996, 1999-2001, 2006, 2009-2010. LDC corpora are available to members of the Rutgers University, for non-commercial education, research and technology development purposes only. The table below shows the corpora that we have available. To download them, please visit our LDC Corpora Download Information Form page. When you click the link, you will be prompted to log in. Please log in with your NetID to verify your Rutgers University affiliation, and then proceed to receive download instruction. You can sort the table below by clicking the column header. Clicking the same header again will change the sort order (ascending/descending). Please click the Catalog ID link to view more information about the corpus on the LDC website. Please contact us at salts-admin [at] rutgers.edu if you have any questions, comments, or feedback.
Catalog ID | Release Year | Name | Data Source(s) | Description |
---|---|---|---|---|
LDC2011T03 | 2011 | OntoNotes Release 4.0 | broadcast conversation, broadcast news, newsgroups, newswire, weblogs | The goal of the OntoNotes project is to annotate a large corpus comprising various genres of text (news, conversational telephone speech, weblogs, usenet newsgroups, broadcast, talk shows) in three languages (English, Chinese, and Arabic) with structural information (syntax and predicate argument structure) and shallow semantics (word sense linked to an ontology and coreference).This release contains the content of earlier releases, and adds newswire, broadcast news, broadcast conversation and web data in English and Chinese and newswire data in Arabic. This cumulative publication consists of 2.4 million words as follows: 300k words of Arabic newswire; 250k words of Chinese newswire, 250k words of Chinese broadcast news, 150k words of Chinese broadcast conversation and 150k words of Chinese web text; and 600k words of English newswire, 200k word of English broadcast news, 200k words of English broadcast conversation and 300k words of English web text. |
LDC2010T22 | 2010 | Manually Annotated Sub-Corpus First Release | broadcast conversation, broadcast news, email, newswire, telephone speech, transcribed speech, varied, web collection | This is the first of three releases of 500,000 words of MASC data developed as part of the American National Corpus (ANC) project. MASC I consists of approximately 80,000 words of contemporary spoken and written American English annotated for a variety of linguistic phenomena. The MASC project is sponsored by the National Science Foundation and was established to address, to the extent possible, many of the obstacles to the creation of large-scale, robust, multiply-annotated corpora of English covering a wide range of genres of written and spoken language data. |
LDC2010T18 | 2010 | ACE Time Normalization (TERN) 2004 English Evaluation Data V1.0 | broadcast news, newswire | This contains the English evaluation data prepared for the 2004 Time Expression Recognition and Normalization (TERN) Evaluation, specifically, English broadcast news and newswire data collected by LDC. |
LDC2010T05 | 2010 | NPS Internet Chatroom Conversations, Release 1.0 | text chat conversations | This corpus consists of 10,567 English posts (45,068 tokens) gathered from age-specific chat rooms of various online chat services in October and November 2006. It is one of the first text-based chat corpora tagged with lexical and discourse information. This might be used to develop stochastic NLP applications that perform tasks such as conversation thread topic detection, author profiling, entity identification, and social network analysis. |
LDC2009T29 | 2009 | ACL Anthology Reference Corpus | journal articles | This corpus is a digital archive of 10,291 research papers in computational linguistics sponsored by the Association for Computational Linguistics (ACL). Also available from the ACL, this release contains most of the papers that appear up to February 2007 in the web-based ACL Anthology, a dynamic repository that currently hosts over 16,500 articles drawn from a range of conferences and workshops as well as past issues of the Computational Linguistics journal. This is designed to be a standard, real-world digital collection testbed for experiments in bibliographic and bibliometric research. |
LDC2009T23 | 2009 | FactBank 1.0 | newswire | FactBank 1.0 consists of 208 documents (over 77,000 tokens) from newswire and broadcast news reports in which event mentions are annotated with their degree of factuality, that is, the degree to which they correspond to those events. |
LDC2009T13 | 2009 | English Gigaword Fourth Edition | newswire | English Gigaword, now being released in its fourth edition, is a comprehensive archive of newswire text data that has been acquired over several years by the LDC at the University of Pennsylvania. The six distinct international sources of English newswire included in this edition are the following:
|
LDC2009T12 | 2009 | 2008 CoNLL Shared Task Data | news magazine, newswire | This contains the the trial corpus, training corpus, development and test data for the 2008 CoNLL (Conference on Computational Natural Language Learning) Shared Task Evaluation. The 2008 Shared Task developed syntactic dependency annotations, including information such as named-entity boundaries and the semantic dependencies model roles of both verbal and nominal predicates. |
LDC2009T07 | 2009 | Unified Linguistic Annotation Text Collection | broadcast conversation, broadcast news, email, newswire, telephone speech, varied | This collection consists of two separate corpora: The Language Understanding Annotation Corpus (LDC2009T10) and REFLEX EntityTranslation Training/DevTest (LDC2009T11).LDC2009T10 contains over 9000 words of English text (6949 words) and Arabic text (2183 words) annotated for committed belief, event and entity coreference, dialog acts and temporal relations. Committed belief annotation distinguishes between statements which assert belief or opinion, those which contain speculation, and statements which convey facts or otherwise do not convey belief. Dialog act annotation seeks to determine the forward and backward links between pairs of dialog acts. LDC2009T11 consists of approximately 67.5k words of newswire and weblog text for each of three languages: English, Chinese and Arabic, and is annotated for entities (Person, Organization, Location, Facility, Weapon, Vehicle and GeoPolitical Entity) and TIMEX2 annotation of events and temporal relations. |
LDC2008T24 | 2008 | COMNOM v 1.0 | newswire | This collection consists of the following: COMLEX English Syntax Lexicon (LDC98L21), an English dictionary consisting of approximately 38,000 lemmas with detailed information about the syntactic characteristics of each lexical item and subcategorization (complement structures); and COMLEX Syntax Text Corpus Version 2.0 (LDC96T11). |
LDC2008T19 | 2008 | The New York Times Annotated Corpus | newswire | The New York Times Annotated Corpus contains over 1.8 million articles written and published by the New York Times between January 1, 1987 and June 19, 2007 with article metadata provided by the New York Times Newsroom, the New York Times Indexing Service and the online production staff at nytimes.com. The corpus includes:
|
LDC2006T13 | 2006 | Web 1T 5-gram Version 1 | web collection | This data set, contributed by Google Inc., contains English word n-grams and their observed frequency counts. The length of the n-grams ranges from unigrams (single words) to five-grams. This data would be useful for statistical language modeling, e.g., for machine translation or speech recognition, as well as for other uses. |
LDC2006T08 | 2006 | TimeBank 1.2 | newswire | The TimeBank Corpus contains 183 news articles that have been annotated with temporal information, adding events, times and temporal links between events and times. |
LDC2006T06 | 2006 | ACE 2005 Multilingual Training Corpus | broadcast conversation, broadcast news, newsgroups, weblogs | This publication contains the complete set of English, Arabic and Chinese training data for the 2005 Automatic Content Extraction (ACE) technology evaluation. The corpus consists of data of various types annotated for entities, relations and events was created by Linguistic Data Consortium with support from the ACE Program, with additional assistance from LDC. This data was previously distributed as an e-corpus (LDC2005E18) to participants in the 2005 ACE evaluation. The objective of the ACE program is to develop automatic content extraction technology to support automatic processing of human language in text form. A complete description of the ACE 2005 Evaluation can be found on the ACE Program website maintained by the National Institute of Standards and Technology (NIST). |
LDC2004T14 | 2004 | Proposition Bank I | newswire | This is a semantic annotation of the Wall Street Journal section of Treebank-2. More specifically, each verb occurring in the Treebank has been treated as a semantic predicate and the surrounding text has been annotated for arguments and adjuncts of the predicate. The verbs have also been tagged with coarse grained senses and with inflectional information. |
LDC99T42 | 1999 | Treebank-3 | microphone speech, newswire, telephone speech, transcribed speech, varied | Treebank-3 contains the materials of Treebank-2 and the following new materials: Switchboard tagged, dysfluency-annotated, and parsed text; and Brown parsed text. The Penn Treebank (PTB) project selected 2,499 stories from a three year Wall Street Journal (WSJ) collection of 98,732 stories for syntactic annotation, and these 2,499 stories have been distributed in both Treebank-2 and Treebank-3 releases of PTB. Treebank-2 includes the raw text for each story. |
LDC97S62 | 1997 | Switchboard-1 Release 2 | telephone conversations | Switchboard is a collection of about 2,400 two-sided telephone conversations among 543 speakers (302 male, 241 female) from all areas of the United States. A computer-driven “robot operator” system handled the calls, giving the caller appropriate recorded prompts, selecting and dialing another person (the callee) to take part in a conversation, introducing a topic for discussion and recording the speech from the two subjects into separate channels until the conversation was finished. About 70 topics were provided, of which about 50 were used frequently. Selection of topics and callees was constrained so that: (1) no two speakers would converse together more than once and (2) no one spoke more than once on a given topic. |
LDC95T7 | 1995 | Treebank-2 | microphone speech, newswire, transcribed speech, varied | This corpus contains over 1.6 million words of hand-parsed material from the Dow Jones News Service (one million words of 1989 Wall Street Journal material annotated in Treebank-2 style), plus an additional one million words tagged for part-of-speech. It also contains the first fully parsed version of the Brown Corpus, which has also been completely retagged using the Penn Treebank (PTB) tag set. Also included are tagged and parsed data from Department of Energy abstracts, IBM computer manuals, MUC-3 and ATIS. |
LDC93T3A | 1993 | TIPSTER Complete (3CDs) | newswire, varied | The TIPSTER project is developed in an effort to significantly advance the state of the art in effective document detection (information retrieval) and data extraction from large, real-world data collections. The detection data is comprised of a new test collection built at NIST to be used both for the TIPSTER project and the related TREC project.The documents in the test collection are varied in style, size and subject domain. The first disk contains material from the Wall Street Journal, (1986, 1987, 1988, 1989), the AP Newswire (1989), the Federal Register (1989), information from Computer Select disks (Ziff-Davis Publishing) and short abstracts from the Department of Energy. The second disk contains information from the same sources, but from different years. The third disk contains more information from the Computer Select disks, plus material from the San Jose Mercury News (1991), more AP newswire (1990) and about 250 megabytes of formatted U.S. Patents. The format of all the documents is relatively clean and easy to use, with SGML-like tags separating documents and document fields. There is no part-of-speech tagging or breakdown into individual sentences or paragraphs as the purpose of this collection is to test retrieval against real-world data. |