File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/03/w03-1507_abstr.xml
Size: 5,531 bytes
Last Modified: 2025-10-06 13:43:11
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-1507"> <Title>Multilingual Resources for Entity Extraction</Title> <Section position="1" start_page="0" end_page="2" type="abstr"> <SectionTitle> Abstract </SectionTitle> <Paragraph position="0"> Progress in human language technology requires increasing amounts of data and annotation in a growing variety of languages. Research in Named Entity extraction is no exception. Linguistic Data Consortium is creating annotated corpora to support information extraction in English, Chinese, Arabic, and other languages for a variety of US Government-sponsored programs. This paper covers the scope of annotation and research tasks within these programs, describes some of the challenges of multilingual corpus development for entity extraction, and concludes with a description of the corpora developed to support this research.</Paragraph> <Paragraph position="1"> research, technology development and education Introduction Ongoing research in human language technology (HLT) requires vast amounts of data for system training and development, plus stable benchmark data to measure ongoing progress. Researchers require greater and greater volumes of data, representing a broadening inventory of human languages and ever more sophisticated annotation. This presents a substantial challenge to the HLT community because human annotation and corpus creation is quite costly. New approaches to research require not tens but hundreds and thousands of hours of speech data, and millions of words of text. The availability of high quality language resources remains a central issue for the many communities involved in basic technology development and education related to language. The role of international data centers continues to evolve to accommodate emerging needs in the speech and language technology community (Liberman and Cieri 2002).</Paragraph> <Paragraph position="2"> The Linguistic Data Consortium (LDC) was founded in 1992 at the University of Pennsylvania, with seed money from DARPA, specifically to address the need for shared language resources. Since then, LDC has created and published more than 241 linguistic databases and has accumulated considerable experience and skill in managing large-scale, multilingual data collection and annotation projects. LDC has established itself as a center for research into standards and best practices in linguistic resource development, while participating actively in ongoing HLT research.</Paragraph> <Paragraph position="3"> LDC has had a major role in creating annotated corpora and other resources to support named entity extraction, as well as larger information extraction activities, for a number of years. Current work in this area falls under a handful of research programs. The DARPA Program in Translingual Information Detection, Extraction, and Summarization (TIDES 2002) combines technologies in detection, extraction, summarization and translation to create systems capable of searching a wide range of streaming multilingual text and speech sources, in real time, to provide effective access for English-speaking users. TIDES core languages are English, Mandarin and Arabic; second tier languages are Korean, Spanish, and Japanese. The primary medium is text though this includes speech recognition output. The TIDES research tasks require broadcast transcripts and news texts to be annotated for entities, relations, and events; categorized by topic; translated; summarized; and processed in a variety of other ways.</Paragraph> <Paragraph position="4"> Another of the TIDES Program goals is to produce technology that can be easily ported to handle new natural languages. To this end, the TIDES Surprise Language Exercise (LDC 2003b) challenges researchers to produce working systems for a previously untargeted language within a constrained time span (for instance, a single calendar month).</Paragraph> <Paragraph position="5"> Currently operating under the TIDES umbrella, the Automatic Content Extraction program (NIST 2002) builds on the successes of previous extraction research programs. The objective of the ACE Program is to develop extraction technology to support automatic processing of source language data (in the form of natural text, and as text derived from Optical Character Recognition and Automatic Speech Recognition output). This includes classification, filtering, and selection based on the language content of the source data, i.e., the meaning conveyed by the data. Thus the ACE program requires the development of technologies that automatically detect and characterize this meaning. The ACE research objectives are viewed as the detection and characterization of Entities, Relations, and Events. LDC provides data and annotations to support these program goals.</Paragraph> <Paragraph position="6"> Another DARPA program, Evidence Extraction and Link Detection (EELD 2002), draws on linguistic resources created by LDC to promote its research goals. The EELD program aims for development of technologies and tools for automated discovery, extraction and linking of sparse evidence contained in large amounts of classified and unclassified data sources. EELD is developing detection capabilities to extract relevant data and relationships about people, organizations, and activities from message traffic and open source data. LDC has provided domain-specific entity-tagged corpora in support of the EELD technology evaluation. null nd Arabic.</Paragraph> </Section> class="xml-element"></Paper>