File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/06/p06-2001_concl.xml

Size: 3,813 bytes

Last Modified: 2025-10-06 13:55:18

<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-2001">
  <Title>Using Machine Learning Techniques to Build a Comma Checker for Basque</Title>
  <Section position="7" start_page="5" end_page="6" type="concl">
    <SectionTitle>
6 Conclusions and future work
</SectionTitle>
    <Paragraph position="0"> We have used machine learning techniques for the task of placing commas automatically in texts. As far as we know, it is quite a novel ap plication field. Hardt (2001) described a system which identified incorrect commas with a preci sion of 91% and a recall of 77% (using 600,000 words to train). These results are comparable with the ones we obtain for the task of guessing correctly when not to place commas (see column &amp;quot;0&amp;quot; in the tables). Using 100,000 words to train, we obtain 96% of precision and 98.3% of recall.</Paragraph>
    <Paragraph position="1"> The main reason could be that we use more in formation to learn.</Paragraph>
    <Paragraph position="2"> However, we have not obtained as good res ults as we hoped in the task of placing commas (we get a precision of 69.6% and a recall of 48.6%). Nevertheless, in this particular task, we have improved considerably with the designed tests, and more improvements could be obtained using more corpora and more specific corpora as texts written by a unique author or by using sci entific texts.</Paragraph>
    <Paragraph position="3"> Moreover, we have detected some possible problems that could have brought these regular results in the mentioned task: No fixed rules for commas in the Basque language Negative influence when training using corpora from different writers In this sense, we have carried out a little ex periment with some English corpora. Our hypo thesis was that a completely settled language like English, where comma rules are more or less fixed, would obtain better results. Taking a com parative English corpus5 and similar learning at tributes6 to Basque's one, we got, for the in stances followed by a comma (column &amp;quot;1&amp;quot; in tables), a better precision (%83.3) than the best  one obtained for the Basque language. However, the recall was worse than ours: %38.7. We have to take into account that we used less learning at tributes with the English corpus and that we did not change the application window chosen for the Basque experiment. Another application win dow would have been probably more suitable for English. Therefore, we believe that with a few tests we easily would achieve a better recall.</Paragraph>
    <Paragraph position="4"> These results, anyway, confirm our hypothesis and our diagnosis of the detected problems.</Paragraph>
    <Paragraph position="5"> Nevertheless, we think the presented results for the Basque language could be improved. One way would be to use &amp;quot;information gain&amp;quot; tech niques in order to carry out the feature selection. On the other hand, we think that more syntactic information, concretely clause splits tags, would be especially beneficial to detect those commas named delimiters by Nunberg (1990).</Paragraph>
    <Paragraph position="6"> In fact, our main future research will consist on clause identification. Based on the &amp;quot;accepted theory of the comma&amp;quot;, we can assure that a good identification of clauses (together with some sig nificant linguistic information we already have) would enable us to put commas correctly in any text, just implementing some simple rules. Be sides, a combination of both methods --learning commas and putting commas after identifying clauses-- would probably improve the results even more.</Paragraph>
    <Paragraph position="7"> Finally, we contemplate building an ICALL (Intelligent Computer Assisted Language Learn ing) system to help learners to put commas cor rectly.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML