File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/h05-2006_metho.xml

Size: 6,793 bytes

Last Modified: 2025-10-06 14:09:37

<?xml version="1.0" standalone="yes"?>
<Paper uid="H05-2006">
  <Title>NooJ: A Linguistic Annotation System For Corpus Processing</Title>
  <Section position="2" start_page="1" end_page="1" type="metho">
    <SectionTitle>
2 A new software architecture
</SectionTitle>
    <Paragraph position="0"> NooJ's architecture is based on the .NET &amp;quot;Component programming&amp;quot; technology, which goes a step beyond the Object-Oriented approach (Silberztein 2004). This architecture gives it several  advantages, including: (1) it allows NooJ to read any document that can  be managed on the user's computer. For instance, on a typical MS-Windows computer, NooJ can process corpora in 100+ file formats, including all variants of ASCII, ISO and Unicode, HTML, RTF, XML, MS-WORD, etc.</Paragraph>
    <Paragraph position="1">  Cf. (Silberztein 1999a) for a description of the INTEX toolbox, and (Silberztein 1999b) for a description of its application as a corpus processing system. See various INTEX WEB sites for references and information on its applications, workshops and communities: http://intex.univ-fcomte.fr and the NooJ WEB site for a description of NooJ: http://www.nooj4nlp.net. (2) it allows other .NET applications to access all NooJ's public methods via its software component library. For instance, a programmer can easily run a NooJ method to extract sequences of texts that match a NooJ grammar from a document that is currently opened in the current application (e.g. MS-WORD).</Paragraph>
  </Section>
  <Section position="3" start_page="1" end_page="10" type="metho">
    <SectionTitle>
3 A new linguistic engine
</SectionTitle>
    <Paragraph position="0"> As a corpus processing system, NooJ's most important characteristic is its linguistic engine, which is based on an annotation system. An annotation is a pair (position, information) that states that at a certain position in the text, a sequence is associated with a certain piece of information. NooJ processes texts that are annotated; annotations are stored in each text's annotation structure which is synchronized with the text buffer. Text annotations that are represented as XML tags can be easily imported to NooJ; for instance, importing the XML text: &lt;N Hum&gt; Mr. John Smith &lt;/N&gt; will produce an annotated text in which the sequence &amp;quot;Mr. John Smith&amp;quot; is annotated with the tag &amp;quot;N+Hum&amp;quot; (annotation category &amp;quot;N&amp;quot;; prop-erty &amp;quot;Hum&amp;quot;). NooJ also provides several powerful tools to annotate texts: -- NooJ's morphological parser is capable of analyzing complex word forms, such as Hungarian words and Germanic compounds, as well as tokenizing Asian languages. The morphological parser annotates complex word forms as sequences of annotations. For instance, the contracted word form &amp;quot;don't&amp;quot; is associated with a sequence of two annotations: &lt;do,V+Aux+PR&gt; and &lt;not,ADV+Neg&gt;.</Paragraph>
    <Paragraph position="1">  -- NooJ's lexical parser can process the inflection of large dictionaries for simple and compound words. For instance, the English dictionary contains 100,000+ simple words and 70,000+ compound nouns. NooJ contains large-coverage dictionaries for Arabic, Armenian, Chinese, Danish, English, French, Hungarian, Italian and Spanish. In general, running NooJ's lexical parser results in adding multiple lexical annotations to a text. The annotation system can represent all types of lexical ambiguities, such as between compounds and sequences of simple words (e.g. &amp;quot;round table&amp;quot;), overlapping or embedded compounds (e.g. &amp;quot;round table mat&amp;quot;), etc. -- NooJ's local grammars are Recursive Transition Networks; they allow users to recognize certain sequences of texts, and to associate them with annotations. NooJ's graphical editor contains a dozen development tools to edit, test and debug local grammars, to organize them in libraries, and to apply them to texts, either as queries or to add (or filter out) annotations.</Paragraph>
    <Paragraph position="2"> NooJ's query system and parsers can access any previously inserted annotation. For instance, the following query includes references to word forms (e.g. &amp;quot;mind&amp;quot;) as well as to two annotations (written between brackets): (the + these) &lt;N+Hum&gt; &lt;lose&gt; their (mind + temper) &lt;N+Hum&gt; matches all sequences in the text that are associated with an &amp;quot;N&amp;quot; annotation with prop-erty &amp;quot;Hum&amp;quot;; these annotations might have been added by NooJ's lexical parser (e.g. for the word &amp;quot;director&amp;quot;), or by a local grammar used to recognize human entities (e.g. for the sequence &amp;quot;head of this company&amp;quot;). Similarly, &lt;lose&gt; matches all sequences of the text that are associated with an annotation whose lemma is &amp;quot;lose&amp;quot;; these annotations might have been added by the lexical parser (for all conjugated forms of &amp;quot;to lose&amp;quot;, e.g. &amp;quot;lost&amp;quot;), or by a local grammar that recognizes compound tenses, e.g. 'have not yet lost&amp;quot;. When all resulting matching sequences, e.g. &amp;quot;These men have not yet lost their mind&amp;quot;, have been indexed, they can be annotated, and their annotation is then instantly available either for other queries or for further cascaded parsing.</Paragraph>
    <Paragraph position="3"> Annotated texts can be used to build complex concordances, annotate or color texts, perform a syntactic or semantic analysis, etc.</Paragraph>
    <Paragraph position="4"> NooJ's linguistic engine, dictionaries and grammars are multilingual; that should allow users to implement translation functionalities.</Paragraph>
  </Section>
  <Section position="4" start_page="10" end_page="10" type="metho">
    <SectionTitle>
4 Conclusion
</SectionTitle>
    <Paragraph position="0"> Although NooJ has just come out and its technology is quite new, it is already being used by several research teams in a variety of projects. See the proceedings of the &amp;quot;Eight INTEX/NooJ workshop&amp;quot; at NooJ's WEB site: http://www.nooj4nlp.net.</Paragraph>
  </Section>
  <Section position="5" start_page="10" end_page="10" type="metho">
    <SectionTitle>
5 Demo
</SectionTitle>
    <Paragraph position="0"> Participants will use NooJ in order to build a named-entity recognizer from the ground up.</Paragraph>
    <Paragraph position="1"> Participants will learn how to apply a simple query to a corpus and build its corresponding concordance. Then I will demonstrate the building of a local grammar with NooJ's graphical editor, followed by a presentation of the organization of local grammars in re-usable libraries that can be shared and integrated into larger grammars.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML