File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/w03-1507_intro.xml
Size: 6,913 bytes
Last Modified: 2025-10-06 14:01:59
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-1507"> <Title>Multilingual Resources for Entity Extraction</Title> <Section position="3" start_page="2" end_page="2" type="intro"> <SectionTitle> 4 Corpora </SectionTitle> <Paragraph position="0"> As part of the ACE program, and to further support both the DARPA TIDES and DARPA EELD Programs, LDC has developed a number of annotated corpora. These corpora all draw on broadcast news, newspaper and newswire data.</Paragraph> <Paragraph position="1"> Sources include data from the Topic Detection and Tracking corpora, Chinese Treebank, Arabic Tree-bank and other news materials.</Paragraph> <Paragraph position="2"> Corpus development for the ACE program began in 1999. Initially, the Pilot Phase was designed to develop a basic task definition for entity detection and tracking. Multiple research sites including MITRE, BBN, NYU, and LDC annotated the same set of 15,000 words of English data to establish a shared understanding of the annotation guidelines and resolve any inter-annotator discrepancies. This data supported technology evaluations in May and November 2000.</Paragraph> <Paragraph position="3"> In ACE Phase 1, the research and annotation tasks were expanded to address metonymy and generic entities. Multiple research sites joined LDC in annotating 180,000 words of training data to support a February 2002 evaluation. LDC was solely responsible for annotating an additional 45,000 words of evaluation data.</Paragraph> <Paragraph position="4"> ACE Phase 2 required research sites to additionally detect and characterize relations between entities. During this phase of ACE, LDC acted as sole annotation site and also took on responsibility for developing and maintaining annotation guidelines. Phase 2 used the entire ACE Phase 1 corpus as training data, and added an additional 45,000 words of new evaluation data. Both training and evaluation data were annotated for entities plus relations. In support of the EELD Program, LDC annotators tagged another 30,000 words of domain-specific training data plus 20,000 words of test data for entities and relations. A September 2002 evaluation tested system performance for both Entities and Relations.</Paragraph> <Paragraph position="5"> LDC is currently producing English test data to augment the existing corpora in support of a Fall 2003 TIDES extraction evaluation; in addition, LDC is creating data and annotations for multilingual extraction research in Chinese and Arabic. 100,000 words of Chinese Treebank and 10,000 words of Arabic Treebank have already been annotated for entities.</Paragraph> <Paragraph position="6"> Alongside corpus development, LDC is working in parallel to expand and refine the existing set of ACE tasks. These modifications are being made with input from both the TIDES Extraction and ACE communities. For ACE Phase 3, LDC will annotate 300,000 words of data in each of three languages: English, Chinese and Arabic; pilot annotation in Farsi is also targeted. Ultimately, all three annotation tasks -- entities, relations and events -- will be represented in the data. The corpora developed by LDC to support ACE, EELD, and TIDES Extraction are currently available to program participants only (LDC 2003c). General publication of the ACE Pilot and ACE Phase 1 Corpora is slated for Summer 2003; upon publication, the data will be available to LDC members as well as non-members. The remaining ACE and related corpora will be published after the conclusion of these programs' evaluation cycles.</Paragraph> <Paragraph position="7"> Outside of the ACE program, LDC has developed a handful of additional resources for multi-lingual extraction research. As part of the TIDES Surprise Language Exercise, LDC collects and creates linguistic resources in a previously untargeted language in an extremely compressed time span. During a two-week dry run in March 2003, the target was Cebuano, a language of the Philippines. Within the span of a few days, LDC created 250,000 words of monolingual text, built a 20,000 word lexicon, created 25,000 words of parallel text, built a morphological parser, and completed named entity tagging of 32,000 words of text.</Paragraph> <Paragraph position="8"> Given the severe time constraints of the exercise, named entity annotators used a trimmeddown version of the MUC Named Entity Guidelines rather than the more complex full MUC or ACE guidelines. Despite the time constraints, inter-annotator consistency remained high when LDC-tagged data was compared with data tagged by annotators at BBN. A similar set of resources for a new surprise language will be developed during the Surprise Language evaluation in June 2003. All of the data developed for Surprise Language is currently available to TIDES participants, and will be released as a general publication at the conclusion of the Exercise.</Paragraph> <Paragraph position="9"> A final resource created to support named entities within information extraction more broadly is the Xinhua Chinese-English Named Entity list, created from Xinhua Newswire's proper name and who's who databases. This corpus contains nearly one million proper names of various kinds, including approximately 500,000 person names, 300,000 place names, 30,000 organization names, and tens of thousands of other name types. The data provides both Chinese to English and English to Chinese name pairs. This corpus, slated for publication in Summer 2003, is currently available to TIDES participants.</Paragraph> <Paragraph position="10"> Much of the material described above is based upon large volumes of text and speech best collected from commercial providers. Commercial sources may require the negotiation of agreements that permit the distribution of data to researchers while constraining the use of the material to linguistic education, research, and technology development. LDC coordinates all necessary intellectual property arrangements for data developed under multiple research programs including TIDES, ACE, and EELD to make resources gathered in this way available to the broader research communities. null Sponsored common task research programs like TIDES and ACE rely heavily upon such shared resources. LDC was in fact created specifically to facilitate research sharing. In order to allow for expedited delivery of data to a group of researchers participating in a common task evaluation, LDC has developed a new data distribution method known as ECorpora. ECorpora target expedited delivery of training and devtest data to support of formal evaluations. Upon the conclusion of the formal task evaluation, pending negotiations with research sponsors and program coordinators, LDC publishes data more broadly to permit access to these valuable resources to all communities working in linguistic education, research, and technology development.</Paragraph> </Section> class="xml-element"></Paper>