File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-2001_metho.xml

Size: 10,079 bytes

Last Modified: 2025-10-06 14:10:25

<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-2001">
  <Title>Using Machine Learning Techniques to Build a Comma Checker for Basque</Title>
  <Section position="4" start_page="1" end_page="1" type="metho">
    <SectionTitle>
2 Learning commas
</SectionTitle>
    <Paragraph position="0"> We have designed two different but combinable ways to get the comma checker: based on clause boundaries based directly on corpus Bearing in mind the formalised theory of Aldezabal et al. (2003)1, we realised that if we got to split the sentence into clauses, it would be quite easy to develop rules for detecting the exact places where commas would have to go. Thus, the best way to build a comma checker would be to get, first, a clause identification tool.</Paragraph>
    <Paragraph position="1"> Recent papers in this area report quite good results using machine learning techniques. Car reras and Marquez (2003) get one of the best per formances in this task (84.36% in test). There fore, we decided to adopt this as a basis in order to get an automatic clause splitting tool for Basque. But as it is known, machine learning techniques cannot be applied if no training cor pus is available, and one year ago, when we star ted this process, Basque texts with this tagged clause splits were not available.</Paragraph>
    <Paragraph position="2"> Therefore, we decided to use the second al ternative. We had available some corpora of Basque, and we decided to try learning commas from raw text, since a previous tagging was not needed. The problem with the raw text is that its commas are not the result of applying consistent rules.</Paragraph>
    <Paragraph position="3"> 1 From now on, we will speak about this as &amp;quot;the accepted theory of Basque punctuation&amp;quot;.</Paragraph>
    <Paragraph position="4"> Related work Machine learning techniques have been applied in many fields and for many purposes, but we have found only one reference in the literature related to the use of machine learning techniques to assign commas automatically.</Paragraph>
    <Paragraph position="5"> Hardt (2001) describes research in using the Brill tagger (Brill 1994; Brill, 1995) to learn to identify incorrect commas in Danish. The system was developed by randomly inserting commas in a text, which were tagged as incorrect, while the original commas were tagged as correct. This system identifies incorrect commas with a preci sion of 91% and a recall of 77%, but Hardt (2001) does not mention anything about identify ing correct commas.</Paragraph>
    <Paragraph position="6"> In our proposal, we have tried to carry out both aspects, taking as a basis other works that also use machine learning techniques in similar problems such as clause splitting (Tjong Kim Sang E.F. and Dejean H., 2001) or detection of chunks (Tjong Kim Sang E.F. and Buchholz S., 2000).</Paragraph>
  </Section>
  <Section position="5" start_page="1" end_page="3" type="metho">
    <SectionTitle>
3 Experimental setup
Corpora
</SectionTitle>
    <Paragraph position="0"> As we have mentioned before, some corpora in Basque are available. Therefore, our first task was to select the training corpora, taking into ac count that well punctuated corpora were needed to train the machine correctly. For that purpose, we looked for corpora that satisfied as much as possible our &amp;quot;accepted theory of Basque punctu ation&amp;quot;. The corpora of the unique newspaper written in Basque, called Egunkaria (nowadays Berria), were chosen, since they are supposed to use the &amp;quot;accepted theory of Basque punctuation&amp;quot;. Nevertheless, after some brief verifications, we realised that the texts of the corpora do not fully match with our theory. This can be understood considering that a lot of people work in a news paper. That is, every journalist can use his own interpretation of the &amp;quot;accepted theory&amp;quot;, even if all of them were instructed to use it in the same way. Therefore, doing this research, we had in mind that the results we would get were not go ing to be perfect.</Paragraph>
    <Paragraph position="1"> To counteract this problem, we also collected more homogeneous corpora from prestigious writers: a translation of a book of philosophy and a novel. Details about these corpora are shown in  Size of the corpora Corpora from the newspaper Egunkaria 420,000 words Philosophy texts written by one unique author 25,000 words Literature texts written by one unique author 25,000 words  A short version of the first corpus was used in different experiments in order to tune the system (see section 4). The differences between the re sults depending on the type of the corpora are shown in section 5.</Paragraph>
    <Section position="1" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
Evaluation
</SectionTitle>
      <Paragraph position="0"> Results are shown using the standard measures in this area: precision, recall and fmeasure2, which are calculated based on the test corpus. The res ults are shown in two colums (&amp;quot;0&amp;quot; and &amp;quot;1&amp;quot;) that correspond to the result categories used. The res ults for the column &amp;quot;0&amp;quot; are the ones for the in stances that are not followed by a comma. On the contrary, the results for the column &amp;quot;1&amp;quot; are the results for the instances that should be followed by a comma.</Paragraph>
      <Paragraph position="1"> Since our final goal is to build a comma checker, the precision in the column &amp;quot;1&amp;quot; is the most important data for us, although the recall for the same column is also relevant. In this kind of tools, the most important thing is to first ob tain all the comma proposals right (precision in columns &amp;quot;1&amp;quot;), and then to obtain all the possible commas (recall in columns &amp;quot;1&amp;quot;).</Paragraph>
      <Paragraph position="2"> Baselines In the beginning, we calculated two possible baselines based on a big part of the newspaper corpora in order to choose the best one.</Paragraph>
      <Paragraph position="3"> The first one was based on the number of commas that appeared in these texts. In other words, we calculated how many commas ap peared in the corpora (8% out of all words), and then we put commas randomly in this proportion in the test corpus. The results obtained were not very good (see Table 2, baseline1), especially for the instances &amp;quot;followed by a comma&amp;quot; (column &amp;quot;1&amp;quot;).</Paragraph>
      <Paragraph position="4"> The second baseline was developed using the list of words appearing before a comma in the training corpora. In the test corpus, a word was tagged as &amp;quot;followed by a comma&amp;quot; if it was one of the words of the mentioned list. The results (see baseline 2, in Table 2) were better, in this case, for the instances followed by a comma (column named &amp;quot;1&amp;quot;). But, on the contrary, baseline 1 provided us with better results for the instances not followed by a comma (column named &amp;quot;0&amp;quot;). That is why we decided to take, as our baseline,  2 fmeasure = 2*precision*recall / (precision+recall) the best data offered by each baseline (the ones in bold in table 2).</Paragraph>
      <Paragraph position="5"> 0 1 Prec. Rec. Meas. Prec. Rec. Meas.</Paragraph>
    </Section>
    <Section position="2" start_page="2" end_page="3" type="sub_section">
      <SectionTitle>
Methods and attributes
</SectionTitle>
      <Paragraph position="0"> We use the WEKA3 implementation of these classifiers: the Naive Bayes based classifier (Na iveBayes), the support vector machine based classifier (SMO) and the decisiontree (C4.5) based one (j48).</Paragraph>
      <Paragraph position="1"> It has to be pointed out that commas were taken away from the original corpora. At the same time, for each token, we stored whether it was followed by a comma or not. That is, for each word (token), it was stored whether a comma was placed next to it or not. Therefore, each token in the corpus is equivalent to an ex ample (an instance). The attributes of each token are based on the token itself and some surround ing ones. The application window describes the number of tokens considered as information for each token.</Paragraph>
      <Paragraph position="2"> Our initial application window was [5, +5]; that means we took into account the previous and following 5 words (with their corresponding at tributes) as valid information for each word.</Paragraph>
      <Paragraph position="3"> However, we tuned the system with different ap plication windows (see section 4).</Paragraph>
      <Paragraph position="4"> Nevertheless, the attributes managed for each word can be as complex as we want. We could only use words, but we thought some morpho syntactic information would be beneficial for the machine to learn. Hence, we decided to include as much information as we could extract using the shallow syntactic parser of Basque (Aduriz et al., 2004). This parser uses the tokeniser, the lemmatiser, the chunker and the morphosyntactic disambiguator developed by the IXA4 research group.</Paragraph>
      <Paragraph position="5"> The attributes we chose to use for each token were the following:  beginning of chunk (verb, nominal, enti ty, postposition) end of chunk (verb, nominal, entity, post position) part of an apposition other binary features: multiple word to ken, full stop, suspension points, colon, semicolon, exclamation mark and ques tion mark We also included some additional attributes which were automatically calculated: number of verb chunks to the beginning and to the end of the sentence number of nominal chunks to the begin ning and to the end of the sentence number of subordinateclause marks to the beginning and to the end of the sen tence distance (in tokens) to the beginning and to the end of the sentence We also did other experiments using binary attributes that correspond to most used colloca tions (see section 4).</Paragraph>
      <Paragraph position="6"> Besides, we used the result attribute &amp;quot;comma&amp;quot; to store whether a comma was placed after each token.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML