File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/w98-0611_metho.xml

Size: 13,794 bytes

Last Modified: 2025-10-06 14:15:08

<?xml version="1.0" standalone="yes"?>
<Paper uid="W98-0611">
  <Title>Tools for locating noun phrases with finite state transducers.</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Starting from scratch
</SectionTitle>
    <Paragraph position="0"> Let us suppose we describe (in order to search) the semantic classes corresponding to the occupation officer (corresponding to the less ambiguous French equivalent word : officier).</Paragraph>
    <Paragraph position="1"> We start searching the sentences containing only the word officerwith automaton of figure 1. In our corpus, 85 occurrences are found; the concordance is then sorted according to right contexts: i,enlor cabinet oi~cer after an emb&amp;rr*szing Oere in &amp;quot;An Officer &amp;nd &amp; Oentlem&amp;n'; u * pax*troop of Rcer and had held chief oper&amp;ting of~/~cer and make available former Marine of~cer *nd Nation*\] Security the executive ot~:icer asks from topslde.</Paragraph>
    <Paragraph position="2"> African |o~n of 1~cer at one leading * four star of~cer based in Hawaii * former police ol~cer convicted of murder We first observe that the meaning army is mixed with the meanings chief executive, loan or chief operating officer. These meanings are not relevant at this stage. The left con- null text seems more interesting than the right context for desambiguation. So, we sort the concordance by left contexts, and extract all different adjectives or nouns that make explicit the occupation of army: intelligence, police, Marine... We group these modifiers in four semantic categories intelligence, army, custom and police. Depending on the need, this grouping could be made more precise.</Paragraph>
    <Paragraph position="3"> Then we construct transducer incorporating these modifiers, and giving as output of each sequence, the associated class (figure 2). Notice that two categories could be combined: army and intelligence.</Paragraph>
    <Paragraph position="4"> We find the same 85 occurrences, but about one third make more precise the word officer. the hi,hut rsnklng army officer killed in more the French Army officer who commands entIineer and lurmy officer who served u a senior intelligence o~cer who defected s national intelligence officer . Oeorl~e Kolt. ~ intelligence officer . Mr. |nm&amp;n's the fQrmer Marine officer and National AS * ~oviet military intellilgence officer in London ranking American military officer on active duty hil~hest rxnkin{~ U.S military officer to visit Vietnam a f?rmer military officer with the vision In the same way, looking at the alphabetically sorted left and right context distinguishes new modifiers: to the right on active duty, and to the left Soviet, American, U.S, ...</Paragraph>
    <Paragraph position="5"> former and highest ranking. Adding these possibilities, plus equivalent forms that com~ 1 instantly 1 to mind, we obtain the transducer of figure 3. The number of different path of this automaton is now 3,348: whereas the number of lines of concordances examined is roughly 20. We could go on, but the size of the automaton becoming larger, it is more and more difficult to handle it on one screen. Moreover, semantically, we clearly see that some categories (e.g. nationality) will become far too important to be represented as one graph. It must be clustered.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="82" type="metho">
    <SectionTitle>
3 Partitioning (manual clustering.)
</SectionTitle>
    <Paragraph position="0"> Nationality is a homogeneous semantic category, composed of simple words, or compounds (South Korean), it can easily be stored in a list. We begin to construct it, starting with the three nationalities found: American, Soviet, and French. To be compatible with the search engine described in (1998b), an output is also associated with these words: for example the nationality itself, as in: American/American, French/French, Soviet/Soviet We must also construct a list Of countries as we find productive forms such as U.S military officer, _France's officer... The list of countries is thus initiated from the few countries we find instantly, or we can think of. To use lists, we define special nodes in the automaton (gray nodes) containing the name of the dictionary.</Paragraph>
    <Paragraph position="1"> We show in the automaton of figure 4 an example of such use.</Paragraph>
    <Paragraph position="2"> In fact, we can localize other homogeneous sub-category as in automaton of figure 3: senior, high ranking, four star.., indicate the rank of the officer. Here, it is more difficult to gather them into a single list, as their combinations are not trivial, we thus construct a t At this stage, we do not pretend to be complete, we are only building the main structure.</Paragraph>
    <Paragraph position="3">  5). All the paths of this automaton combined with officer correspond approximatively to the senior officer rank. It will be sufficient for the search engine. Finally, we specialize our initial automaton to only military officers, as we dearly see that the rank and others modifiers associated with the police category will probably not be the same that those associated with the military category. We then create the military ofcer automaton (figure 6). Two main ideas should be kept in mind about clustering methodology: - homogeneous semantic sub-categories in an automaton should be put in a sub-automata, and - an automaton containing two semantic notions that we can separate, should be transformed into a more simple automaton containing only one semantic category. This clustering has many advantage:  with different level first, it allows a better readibility of automata, second, it factorizes homogeneous semantic units that we could use for other purposes.</Paragraph>
    <Paragraph position="4"> And third, it is very useful for processing, as we will show later.</Paragraph>
    <Paragraph position="5"> Now we have established the main structure: we must complete the different automata and dictionaries. We show in the next section, a general method that permits to quickly enrich 2 the database.</Paragraph>
  </Section>
  <Section position="5" start_page="82" end_page="82" type="metho">
    <SectionTitle>
4 Use of variables
</SectionTitle>
    <Paragraph position="0"> The automaton of figure 6 represents the largest structure we found around the word ofcer. The initial goal was to construct nominals of the army ofcer semantic category.</Paragraph>
    <Paragraph position="1"> Such categories certainly contain sequences not containing at all the base word officer (Synonyms, hyponyms, or any semantically related nouns). To find such words, we have nevertheless now a important clue: their right and left contexts axe close to the context of the word ofcer. We replace the word offiicer by a variable in the automaton Of figure 6. A variable is a word or a sequence of words. We find on the same corpus 1,239 occurrences. The concordance we obtain is sorted alphabetically on that variable. This allows us to locate very quickly (in less than one minute) to locate new terms that should be put in parallel with the word officer. We give here an extract: the El.st German intelligence chij_~ at the time. Mr P&amp;rk an &amp;rmy general who seised A retired army i~eneral, Mr. Chung Communist army General. sent his Croatian Army helicopte~ directed fighter the former intelligence ol~cial ~ his \[ormer top intelligence official named to Korea * Jepanese intelligence official said.</Paragraph>
    <Paragraph position="2"> which military o~cisls and federal be determined military of~cials said.</Paragraph>
    <Paragraph position="3"> the UN military Officials said.</Paragraph>
    <Paragraph position="4"> for them', military officials said.</Paragraph>
    <Paragraph position="5"> police and military officials.</Paragraph>
  </Section>
  <Section position="6" start_page="82" end_page="83" type="metho">
    <SectionTitle>
2 And more
</SectionTitle>
    <Paragraph position="0"> We also observe new right and left contexts relevant to our description (but we do not take them into account here, in order to keep our steps in sequence). With this new words, we complete the automaton and obtain the automaton of figure 7. To maintain clarity in the automaton, we always put in a same column, equivalent units or term. For that reason, officer is vertically aligned with general, o~cial... Country is vertically aligned with national and Nationality, etc. When we add new boxes, we add the necessary transitions we can think of. Sometimes, choices are difficult: for example, can we find career general ? In those cases, we permit it by default. The final automaton may recognize sequences that are not stylistically or semantically perfect, but that are always well-formed, and their presence will not generate any noise.</Paragraph>
    <Paragraph position="1"> We skip a few stages to arrive to the proper noun description. Contrary to the French case, where the full name is always put in an external apposition of the occupation noun phrase, in English both can be mixed, we find in Our corpus for example: General Jean Cot of the French Army said the idea of... ... the French commander of troops there. General Jean Cot, Genera3 Jean Cot. the French Army GrAter who commands the And as Genenal John Shali&amp;ajh~l~ has said, the Partnership...  We describe these four different structures (left and right apposition, proper nouns and occupations mixed, and only proper noun) in a simplified way 3 (to keep small automata for the presentation) by the automaton of figure 8. This automaton contain FirstName and SurName boxes. These two dictionaries are initially empty, they need to be filled, and we easily imagine that their size will quickly become very large. To that aim, we replace them by variable. In that case, we can make more precise the morphological form of these variables. (As a first approximation, let us say that they are both single words with a capital first letter). We apply this automaton to the corpus, and obtain a list of 582 occurrences.</Paragraph>
    <Paragraph position="2"> These occurrences are sorted according to the FirstName and SurName variable, and we only need to confirm these terms to be actual First and SurName. A first draft for the both dictionary is thus constructed: 15 entries of FirstName and 67 of SurName.</Paragraph>
    <Paragraph position="3"> The whole construction is dynamic: once we have some proper nouns, we seek their contexts, and find new structures for the officer automaton. These new structures allows us to obtain new proper nouns, new countries, new nationalities... This is the bootstrap effect.</Paragraph>
    <Paragraph position="4"> a In particular, this automaton recognizes the British General.of France ! We have built in one hour, a large coverage automaton (more than 200, 000 different paths, without counting proper noun combinations) for this mere semantic category. The final number of recognized items is 863 (to compare with the 85 that we started ~om, and that were in majority used in another context (economic)). The number of the potential recognized sequences, as compared to the effectively recognized sequences is a guarantee that, on a totally new corpus, we will obtain instantly a good coverage and new potential SurName and FirstName (and this, with a rate proportional to the size of existing dictionaries).</Paragraph>
  </Section>
  <Section position="7" start_page="83" end_page="84" type="metho">
    <SectionTitle>
5 Software
</SectionTitle>
    <Paragraph position="0"> The needed tools for such a methodology are: 1. Graph editor. The first tool (included in INTEX (1994)) is a graphic graph editor.</Paragraph>
    <Paragraph position="1"> This editor allows to create, copy and align parts of them. It handles input and output of transducers. It allows: by simple clicking, to open a sub-automaton.</Paragraph>
    <Paragraph position="2"> 2. Index-based parsing algorithm. A key feature of the text-based automaton construction, is in the possibility to have immediate concordances. It is hardly thinkable parse sequentially a 10 million word corpus. Moreover, with many levels of automata (sometimes more than 20): the size of the developed main automaton becomes quickly huge, that we cannot re-compute for each concordance. Thus, we have chosen to index each sub-automaton independently, with a dependency graph (cf below) like makefile, we only re-compute, the modified graph, and those depending on it.</Paragraph>
    <Paragraph position="3"> The index parsing algorithm is described in (Senellart, 1998a). It allows us, to obtain concordances on the whole corpus, in a mean time less than ls on a average personal computer.</Paragraph>
    <Paragraph position="4"> 3. Concordance manager. We have shown during all along this paper: that the way the concordance are sorted has a great importance. Under INTEX, we can sort concordances according to the recognized sequence, to the right, or left context, and any combination of the three parameters.</Paragraph>
    <Paragraph position="5"> Moreover, when we put variables in the automaton, we must are able to validate recognized sequence linked to the variable in  ton a click.</Paragraph>
    <Paragraph position="6"> 4. Debugging tools. Maintaining a large number of automata is not simple: some automata depend on others of which, the exact name and the exact function must be recalled. Exactly as in compiling programs with a large number of source files, we can compute and represent (graphically) the dependencies between the different automaton. For example the graph of figure 9, represents the dependent sub-automaton used in the O~icer automaton, and the same for each of the sub-automaton. The depth in this graph is four.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML