File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/73/c73-1018_metho.xml

Size: 11,445 bytes

Last Modified: 2025-10-06 14:11:06

<?xml version="1.0" standalone="yes"?>
<Paper uid="C73-1018">
  <Title>BUT ~* SOMETIMES ~ ~ SISTERJOSEPH ~'~ ~ THOUGHT ~C/' HE ~ C/* SPOKE ~* ~ ~* BADLY ~ ~ ~ ONPURPOSE ~ ~ TO ~ ~ ~ MAEK ~ ~ ~ ~</Title>
  <Section position="1" start_page="0" end_page="0" type="metho">
    <SectionTitle>
ANDI~E TRETIAKOFF
RESULTS OBTAINED WITH A NEW METHOD
FOR. THE AUTOMATIC ANALYSIS OF
SENTENCE STR.UCTUR.ES
</SectionTitle>
    <Paragraph position="0"> We present in this paper a method for the automatic analysis of sentence structures.</Paragraph>
    <Paragraph position="1"> Our purpose is to constitute a frequency dictionary of the different structures used in the language. This dictionary will enable us to select the most useful sentence structures in order to recommend their exclusive use for the writing of texts intended for automatic translation. We think that the automatic translation will be possible only if the texts are submitted to rules which limit the complexity of their syntax. These limitations will be the less noticed by an author as only the most unusual structures would have been left out. Of course the number of permitted structures will increase as the automatic translation codes are improved.</Paragraph>
    <Paragraph position="2"> The sentence structures are obtained by a statistical analysis of the word strings according to procedures developed in the information theory.</Paragraph>
    <Paragraph position="3"> In the present paper we have analysed only groups of two consecutive words as an example of our method.</Paragraph>
    <Paragraph position="4"> The same type of analysis can be generalized by considering non-consecutive words and groups of more than two words.</Paragraph>
  </Section>
  <Section position="2" start_page="0" end_page="0" type="metho">
    <SectionTitle>
1. GROUPS
</SectionTitle>
    <Paragraph position="0"> The first step of the analysis is to put the words into groups according to their grammatical properties, for example: noun, adjective, article and so on. The number of groups has been limited to keep significative frequencies with respect to the length of the corpus (3500 words). Inthe text under study, we have used 67 groups. A list of these groups is given in Table 3.</Paragraph>
  </Section>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
216 ANDP,~E TRBTIAKOFF
</SectionTitle>
    <Paragraph position="0"> Of course, our classification is somewhat arbitrary as it is based on a preliminary knowledge of the language. We will show later how the results of the analysis can help us to detect inadequate classifications. Each word of the corpus has been replaced by a symbol (two figures integer) representing its grammatical group. We consider the words inside the sentence, that is to say between two strong punctuation signs (. ; ! ?). Inside the sentence all punctuation signs are suppressed.</Paragraph>
    <Paragraph position="1"> We will call now &amp;quot;words&amp;quot; these symbols.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2. DICTIONARY OF STRINGS
</SectionTitle>
    <Paragraph position="0"> The second step is the constitution of a string dictionary.</Paragraph>
    <Paragraph position="1"> A sentence containing N words produces (N--1) strings. For instance, the sentence Her daughter gave me an Italian lesson every day represented by the string &amp;quot; 55 04 01 44 45 05 04 85 04 &amp;quot;, produces the following strings:</Paragraph>
    <Paragraph position="3"> Each string is obtained by suppressing the first word * ing string.</Paragraph>
    <Paragraph position="4"> of the preced-</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
AUTOMATIC ANALYSIS OF SENTENCE STRUCTURES 217
</SectionTitle>
    <Paragraph position="0"> The dictionary emphasizes the identical strings whatever their position in the sentence might be. A sample of the dictionary is given in Table 1.</Paragraph>
    <Paragraph position="1"> For example, the string 05 04 which means an adjective followed by a common noun at the end of a sentence has the rank number 244, occurs 9 times in the sentences number 9, 10, 35 and so on.</Paragraph>
    <Paragraph position="2"> All the strings beginning by the groups 05, 04 are also listed.</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3. SENTENCE STRUCTURE
</SectionTitle>
    <Paragraph position="0"> The last step of the analysis is the production of sentence structures, using the correlations between two consecutive words.</Paragraph>
    <Paragraph position="1"> We can compare the probability Pj of a word j in the corpus and the conditional probability Pj (if i) of the same word when the preceding word is given equal to i. We shall call in this paper &amp;quot;degree of correlation &amp;quot; the logarithm of the ratio of the conditional probability and the probability: C, s = Logs P~ (if i)/P~ The degree of correlation will be positive when the probability to get a word is increased by the knowledge of the preceding word, and negative when this probability is decreased. It is a measure of the &amp;quot;affinity&amp;quot; of two consecutive words.</Paragraph>
    <Paragraph position="2"> This procedure can be generalized by considering groups of more than two words, not necessarily consecutive.</Paragraph>
    <Paragraph position="3"> For each sentence of the corpus we can build a structure based on the correlation between two consecutive words in the following way. Inside the sentence, consecutive words are connected two by two in order of decreasing degree of correlation. For instance in the sentence: null She loved a good laugh we have the following degrees of correlation:  She loved = 2.56 loved a = 1.23 a good ~ 2.38 good laugh = 1.86</Paragraph>
  </Section>
  <Section position="7" start_page="0" end_page="2" type="metho">
    <SectionTitle>
218 ANDI~E TRETIAKOFF
</SectionTitle>
    <Paragraph position="0"> Therefore the first words to be connected are She and loved then a and good. We will consider that their union is the first level. Then the word laugh will be connected to the group a good. This union will be a second level and finally the two halves of the sentence are connected and this union will be the third level.</Paragraph>
    <Paragraph position="1"> This structure can be represented by the following graph, automatically produced by the computer, and by the string 1 3 1 2 obtained by writing the sequence of the successive levels.</Paragraph>
  </Section>
  <Section position="8" start_page="2" end_page="2" type="metho">
    <SectionTitle>
4. DICTIONARY OF STRUCTURES
</SectionTitle>
    <Paragraph position="0"> This procedure has been applied for all the sentences of the text, producing strings of numbers which represent the structure of these sentences.</Paragraph>
    <Paragraph position="1"> For each string of numbers, by suppressing the highest number we obtain 2 strings representing 2 substructures of this sentence. We carry on this procedure till the string has only 1 number, that is to say represents the structure of a group of 2 words.</Paragraph>
    <Paragraph position="2"> For instance the structure of the sentence Her daughter gave me an Italian lesson every day is represented by the following string of numbers:</Paragraph>
  </Section>
  <Section position="9" start_page="2" end_page="2" type="metho">
    <SectionTitle>
AUTOMATIC ANALYSIS OF SENTENCE STRUCTURES 219
SENTENCE NO 5
</SectionTitle>
    <Paragraph position="0"/>
    <Paragraph position="2"> All the structures and substructures are classified in a dictionary, giving their frequencies and the positions of the sentences containing the corresponding word strings (Table 2).</Paragraph>
    <Paragraph position="3"> For example, the structure 1 4 2 1 3 has the rank number 65 and is found 5 times in the sentences number 12 16 21 24 41.</Paragraph>
  </Section>
  <Section position="10" start_page="2" end_page="2" type="metho">
    <SectionTitle>
5. CLASSIFICATION ERRORS
</SectionTitle>
    <Paragraph position="0"> If the structure of a sentence is unsatisfactory, this can be due to an error in the classification of a word of this sentence. This observation is used to detect and correct classification errors. For example in the sentence: But come mother</Paragraph>
    <Paragraph position="2"/>
  </Section>
  <Section position="11" start_page="2" end_page="2" type="metho">
    <SectionTitle>
220 ANDP.EE TRETIAKOFP
</SectionTitle>
    <Paragraph position="0"> the word come had been classified in a wrong group 02 (indicative of intransitive verbs). When corrected (22 = infinitive of intr. verbs) we obtain the following structure: But I like to come mother</Paragraph>
    <Paragraph position="2"> Another way to check the classification of words into groups is to use the quantity of information associated to the law of succession of two consecutive words. It is known from communication theory that the average amount of information by word is reduced when we know the law of succession of two consecutive words. This reduction is precisely equal to the average degree of correlation of all the groups: ~j We shall call it quantity of information associated to the law of succession of two consecutive words.</Paragraph>
    <Paragraph position="3"> In order to check the validity of the choice of the grammatical group for a word, the quantity of information associated to the law of succession of the groups is measured. Then, changing the choice of the group, the quantity of information is measured again for this new classification. The greater the quantity of information associated to a law of succession of the groups, the better the distribution of these words into these groups.</Paragraph>
  </Section>
  <Section position="12" start_page="2" end_page="2" type="metho">
    <SectionTitle>
6. CONCLUSION
</SectionTitle>
    <Paragraph position="0"> The sample chosen here (a novel by S. Maugham of 3500 words) is too short to obtain significant frequencies for the different structures. This sample contains 200 sentences of an average length of 17 words.</Paragraph>
    <Paragraph position="1"> In spite of the simplicity of the method Of analysis employed, 72 sentences of an average length of 10 words have been correctly analysed. null This shows that the correlation of 2 consecutive words, although insufficient, will play an important part in the more elaborated methods of analysis that we are now developing.</Paragraph>
  </Section>
  <Section position="13" start_page="2" end_page="63" type="metho">
    <SectionTitle>
AUTOMATIC ANALYSIS OF SENTENCE STRUCTURES
</SectionTitle>
    <Paragraph position="0"/>
    <Paragraph position="2"> 57 1 0 1 3 2 1 6 4 3 1 2 5 1 3 2 1  60 1 0 14 12 13 5 1</Paragraph>
    <Paragraph position="4"/>
  </Section>
  <Section position="14" start_page="63" end_page="233" type="metho">
    <SectionTitle>
PAST PARTICIPLE (TR-ANSITIVE VERBS)
PR.ESENT PARTICIPLE (TR-ANSITIVE VERBS)
GERUND (TR.ANSITIVE VERBS)
02 INDICATIVE (INTR-ANSITIVE VERBS)
22 INFINITIVE (INTR.ANSITIVE VERBS)
32 PAST PARTICIPLE (INTR-ANSITIVE VERBS)
42 PRESENT PARTICIPLE (INTR-ANSITIVE VER-BS)
52 GER-UND (INTRANSITIVE VERBS)
03 INDICATIVE (STATE VERBS)
23 INFINITIVE (STATE VER-BS)
33 PAST PAR-TICIPLE (STATE VER.BS)
43 PR-ESENT PARTICIPLE (STATE VERBS)
53 GER-UND (STATE VERBS)
08 INDICATIVE (AUXILIARY VERBS)
28 INFINITIVE (AUXILIARY VERB)
</SectionTitle>
    <Paragraph position="0"/>
  </Section>
class="xml-element"></Paper>
Download Original XML