File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/04/p04-1028_concl.xml
Size: 4,957 bytes
Last Modified: 2025-10-06 13:54:02
<?xml version="1.0" standalone="yes"?> <Paper uid="P04-1028"> <Title>Mining metalinguistic activity in corpora to create lexical resources using Information Extraction techniques: the MOP system</Title> <Section position="7" start_page="0" end_page="0" type="concl"> <SectionTitle> 5 Results, comparisons and discussion </SectionTitle> <Paragraph position="0"> The DEFINDER system (Klavans et al, 2001) at Columbia University is, to my knowledge, the only one fully comparable with MOP, both in scope and goals, but some basic differences between them exist. First, DEFINDER examines user-oriented documents that are bound to contain fully-developed definitions for the layman, as the general goal of the PERSIVAL project is to present medical information to patients in a less technical language than the one of reference literature. MOP focuses on leading-edge research papers that present the less predictable informational templates of highly technical language. Secondly, by the very nature of DEFINDER's goals their qualitative evaluation criteria include readability, usefulness and completeness as judged by lay subjects, criteria which we have not adopted here. Neither have we determined coverage against existing on-line dictionaries, as they have done. Taking into account the above-mentioned differences between the two systems' methods and goals, MOP compares well with the 0.8 Precision and 0.75 Recall of DEFINDER. While the resulting MOP &quot;definitions&quot; generally do not present high readability or completeness, these informational segments are not meant to be read by laymen, but used by domain lexicographers reviewing existing glossaries for neological change, or, for example, in machine-readable form by applications that attempt automatic categorization for semantic rerendering of an expert ontology, since definitional contexts provide sortal information as a natural part of the process of precisely situating a term or concept against the meaning network of interrelated lexical items. The Metalinguistic Information Databases in their present form are not, in full justice, lexical knowledge bases comparable with the highly-structured and sophisticated resources that use inheritance and typed features, like LKB (Copestake et al., 1993). MIDs are semi-structured resources (midway between raw corpora and structured lexical bases) that can be further processed to convert them into usable data sources, along the lines suggested by Vossen and Copestake (1993) for the syntactic kernels of lexicographic definitions, or by Pustejovsky et al. (2002) using corpus analytics to increase the semantic type coverage of the NLM UMLS ontology. Another interesting possibility is to use a dynamically-updated MID to trace the conceptual and terminological evolution of a discipline.</Paragraph> <Paragraph position="1"> We believe that low recall rates in our tests are in part due to the fact that we are dealing with the wider realm of metalinguistic information, as opposed to structured definitional sentences that have been distilled by an expert for consumeroriented documents. We have opted in favor of exploiting less standardized, non-default metalinguistic information that is being put forward in text because it can't be assumed to be part of the collective expert-domain competence (Section 2.1). In doing so, we have exposed our system to the less predictable and highly charged lexical environment of leading-edge research literature, the cauldron where knowledge and terminological systems are forged in real time, and where scienti- null fic meaning and interpretation are constantly debated, modified and agreed. We have not performed major customization of the system (like enriching the tagging lexicon with medical terms), in order to preserve the ability to use the system across different domains. Domain customization may improve metrics, but at a cost for portability.</Paragraph> <Paragraph position="2"> The implementation we have described here undoubtedly shows room for improvement in some areas, including: adding other patterns for better overall recall rates, deeper parsing for more accurate semantic typing of sentence arguments, etc. Also, the issue of which learning algorithms can better perform the initial filtering of EMO candidates is still very much an open question.</Paragraph> <Paragraph position="3"> Applications that can turn MIDs into truly useful lexical resources by further processing them need to be written. We plan to continue development of our proof-of-concept system to explore those areas. DEFINDER and MOP both show great potential as robust lexical acquisition systems capable of handling the vast electronic resources available today to researchers and laymen alike, helping to make them more accessible and useful. In doing so, they are also fulfilling the promise of NLP techniques as mature and practical technologies.</Paragraph> </Section> class="xml-element"></Paper>