File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/p06-2001_intro.xml
Size: 4,282 bytes
Last Modified: 2025-10-06 14:03:41
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-2001"> <Title>Using Machine Learning Techniques to Build a Comma Checker for Basque</Title> <Section position="3" start_page="0" end_page="1" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> In the last years, there have been many studies aimed at building a grammar checker for the Basque language (Ansa et al., 2004; Diaz De Il arraza et al., 2005). These works have been fo cused, mainly, on building rule sets --taking into account syntactic information extracted from the corpus automatically-- that detect some erro neous grammar forms. The research here presen ted wants to complement the earlier work by fo cusing on the style and the punctuation of the texts. To be precise, we have experimented using machine learning techniques for the special case of the comma, to evaluate their performance and to analyse the possibility of applying it in other tasks of the grammar checker.</Paragraph> <Paragraph position="1"> However, developing a punctuation checker encounters one problem in particular: the fact that the punctuation rules are not totally estab lished. In general, there is no problem when us ing the full stop, the question mark or the ex clamation mark. Santos (1998) highlights these marks are reliable punctuation marks, while all the rest are unreliable. Errors related to the reli able ones (putting or not the initial question or exclamation mark depending on the language, for instance) are not so hard to treat. A rule set to correct some of these has already been defined for the Basque language (Ansa et al., 2004). In contrast, the comma is the most polyvalent and, thus, the least defined punctuation mark (Bayrak tar et al., 1998; Hill and Murray, 1998). The am biguity of the comma, in fact, has been shown often (Bayraktar et al., 1998; Beeferman et al., 1998; Van Delden S. and Gomez F., 2002).</Paragraph> <Paragraph position="2"> These works have shown the lack of fixed rules about the comma. There are only some intuitive and generally accepted rules, but they are not used in a standard way. In Basque, this problem gets even more evident, since the standardisation and normalisation of the language began only about twentyfive years ago and it has not fin ished yet. Morphology is mostly defined, but, on the contrary, as far as syntax is concerned, there is quite work to do. In punctuation and style, some basic rules have been defined and accepted by the Basque Language Academy (Zubimendi, 2004). However, there are not final decisions about the case of the comma.</Paragraph> <Paragraph position="3"> Nevertheless, since Nunberg's monograph (Nunberg, 1990), the importance of the comma has been undeniable, mainly in these two as pects: i) as a due to the syntax of the sentence (Nunberg, 1990; Bayraktar et al., 1998; Garzia, 1997), and ii) as a basis to improve some natural language processing tools (syntactic analysers, error detection tools...), as well as to develop some new ones (Briscoe and Carroll, 1995; Jones, 1996). The relevance of the comma for the syntax of the sentence may be easily proved with some clarifying examples where the sentence is understood in one or other way, depending on whether a comma is placed or not (Nunberg, In the same sense, it is obvious that a well punctuated text, or more concretely, a correct placement of the commas, would help consider ably in the automatic syntactic analysis of the sentence, and, therefore, in the development of more and better tools in the NLP field. Say and Akman (1997) summarise the research efforts in this direction.</Paragraph> <Paragraph position="4"> As an important background for our work, we note where the linguistic information on the comma for the Basque language was formalised.</Paragraph> <Paragraph position="5"> This information was extracted after analysing the theories of some experts in Basque syntax and punctuation (Aldezabal et al., 2003). In fact, although no final decisions have been taken by the Basque Language Academy yet, the theory formalised in the above mentioned work has suc ceeded in unifying the main points of view about the punctuation in Basque. Obviously, this has been the basis for our work.</Paragraph> </Section> class="xml-element"></Paper>