File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/p06-2001_evalu.xml
Size: 9,685 bytes
Last Modified: 2025-10-06 13:59:43
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-2001"> <Title>Using Machine Learning Techniques to Build a Comma Checker for Basque</Title> <Section position="6" start_page="3" end_page="5" type="evalu"> <SectionTitle> 4 Experiments </SectionTitle> <Paragraph position="0"> Dimension of the corpus In this test, we employed the attributes de scribed in section 3 and an initial window of [5, +5], which means we took into account the pre vious 5 tokens and the following 5. We also used the C4.5 algorithm initially, since this algorithm gets very good results in other similar machine learning tasks related to the surface syntax As it can be seen in table 3, the bigger the corpus, the better the results, but logically, the time expended to obtain the results also increases considerably. That is why we chose the smallest corpus for doing the remaining tests (100,000 words to train and 30,000 words to test). We thought that the size of this corpus was enough to get good comparative results. This test, anyway, suggested that the best results we could obtain would be always improvable using more and more corpora.</Paragraph> <Paragraph position="1"> Selecting the window Using the corpus and the attributes described be fore, we did some tests to decide the best applic ation window. As we have already mentioned, in some problems of this type, the information of the surrounding words may contain important data to decide the result of the current word. In this test, we wanted to decide the best ap plication window for our problem.</Paragraph> <Paragraph position="2"> 0 1 Prec. Rec. Meas. Prec. Rec. Meas.</Paragraph> <Paragraph position="4"> As it can be seen, the best fmeasure for the instances followed by a comma was obtained us ing the application window [4,+5]. However, as we have said before, we are more interested in the precision. Thus, the application window [5 ,+2] gets the best precision, and, besides, its f measure is almost the same as the best one. This is the reason why we decided to choose the [5 ,+2] application window.</Paragraph> <Paragraph position="5"> Selecting the classifier With the selected attributes, the corpus of 130,000 words and the application window of [5 , +2], the next step was to select the best classifi er for our problem. We tried the WEKA imple mentation of these classifiers: the Naive Bayes based classifier (NaiveBayes), the support vector machine based classifier (SMO) and the decision tree based one (j48). Table 5 shows the results obtained: As we can see, the fmeasure for the instances not followed by a comma (column &quot;0&quot;) is almost the same for the three classifiers, but, on the con trary, there is a considerable difference when we refer to the instances followed by a comma (column &quot;1&quot;). The best fmeasure gives the C4.5 based classifier (J48) due to the better recall, al though the best precision is for the support vector machine based classifier (SMO). Definitively, the Naive Bayes (NB) based classifier was dis carded, but we had to think about the final goal of our research to choose between the other two classifiers. Since our final goal was to build a comma checker, we would have to have chosen the classifier that gave us the best precision, that is, the support vector machine based one. But the recall of the support vector machine based classi fier was not as good as expected to be selected.</Paragraph> <Paragraph position="6"> Consequently, we decided to choose the C4.5 based classifier.</Paragraph> <Paragraph position="7"> Selecting examples At this moment, the results we get seem to be quite good for the instances not followed by a comma, but not so good for the instances that should follow a comma. This could be explained by the fact that we have no balanced training cor pus. In other words, in a normal text, there are a lot of instances not followed by a comma, but there are not so many followed by it. Thus, our training corpus, logically, has very different amounts of instances followed by a comma and not followed by a comma. That is the reason why the system will learn more easily to avoid the un necessary commas than placing the necessary ones.</Paragraph> <Paragraph position="8"> Therefore, we resolved to train the system with a corpus where the number of instances fol lowed by a comma and not followed by a comma was the same. For that purpose, we prepared a perl program that changed the initial corpus, and saved only x words for each word followed by a comma.</Paragraph> <Paragraph position="9"> In table 6, we can see the obtained results.</Paragraph> <Paragraph position="10"> One to one means that in that case, the training corpus had one instance not followed by a comma, for each instance followed by a comma.</Paragraph> <Paragraph position="11"> On the other hand, one to two means that the training corpus had two instances not followed by a comma for each word followed by a comma, and so on.</Paragraph> <Paragraph position="12"> 0 1 Prec. Rec. Meas. Prec. Rec. Meas.</Paragraph> <Paragraph position="13"> normal 0,955 0,981 0,968 0,635 0,417 0,503 one to one 0,989 0,633 0,772 0,164 0,912 0,277 one to two 0,977 0,902 0,938 0,367 0,725 0,487 one to three 0,969 0,934 0,951 0,427 0,621 0,506 one to four 0,966 0,952 0,959 0,484 0,575 0,526 one to five 0,966 0,961 0,963 0,534 0,568 0,55 one to six 0,963 0,966 0,964 0,55 0,524 0,537 Table 6. Results depending on the number of words kept for each comma (C4.5 algorithm; 100,000 train / 30,000 test; [5, +2] window).</Paragraph> <Paragraph position="14"> As observed in the previous table, the best precision in the case of the instances followed by a comma is the original one: the training corpus where no instances were removed. Note that these results are referred as normal in table 6. The corpus where a unique instance not fol lowed by a comma is kept for each instance fol lowed by a comma gets the best recall results, but the precision decreases notably.</Paragraph> <Paragraph position="15"> The best fmeasure for the instances that should be followed by a comma is obtained by the one to five scheme, but as mentioned before, a comma checker must take care of offering cor rect comma proposals. In other words, as the pre cision of the original corpus is quite better (ten points better), we decided to continue our work with the first choice: the corpus where no in stances were removed.</Paragraph> <Paragraph position="16"> Adding new attributes Keeping the best results obtained in the tests de scribed above (C4.5 with the [5, +2] window, and not removing any &quot;not comma&quot; instances), we thought that giving importance to the words that appear normally before the comma would in crease our results. Therefore, we did the follow ing tests: 1) To search a big corpus in order to extract the most frequent one hundred words that pre cede a comma, the most frequent one hundred pairs of words (bigrams) that precede a comma, and the most frequent one hundred sets of three words (trigrams) that precede a comma, and use them as attributes in the learning process.</Paragraph> <Paragraph position="17"> 2) To use only three attributes instead of the mentioned three hundred to encode the informa tion about preceding words. The first attribute would indicate whether a word is or not one of the most frequent one hundred words. The second attribute would mean whether a word is or not the last part of one of the most frequent one hundred pairs of words. And the third attrib ute would mean whether a word is or not the last part of one of the most frequent one hundred sets of three words.</Paragraph> <Paragraph position="18"> 300 data as attributes) improves the precision of putting commas (column &quot;1&quot;) in more than 4 points. Besides, it also improves the recall, and, thus, we improve almost 6 points its fmeasure.</Paragraph> <Paragraph position="19"> The third test gives the best precision, but the recall decreases considerably. Hence, we decided to choose the case number 1, in table 7.</Paragraph> <Paragraph position="20"> 5 Effect of the corpus type As we have said before (see section 3), depend ing on the quality of the texts, the results could be different.</Paragraph> <Paragraph position="21"> In table 8, we can see the results using the dif ferent types of corpus described in table 1. Obvi ously, to give a correct comparison, we have used the same size for all the corpora (20,000 in stances to train and 5,000 instances to test, which is the maximum size we have been able to ac quire for the three mentioned corpora).</Paragraph> <Paragraph position="22"> 0 1 Prec. Rec. Meas. Prec. Rec. Meas.</Paragraph> <Paragraph position="23"> ra (20,000 train / 5,000 test).</Paragraph> <Paragraph position="24"> The first line shows the results obtained using the short version of the newspaper. The second line describes the results obtained using the translation of a book of philosophy, written com pletely by one author. And the third one presents the results obtained using a novel written in Basque.</Paragraph> <Paragraph position="25"> In any case, the results prove that our hypo thesis was correct. Using texts written by a unique author improves the results. The book of philosophy has the best precision and the best re call. It could be because it has very long sen tences and because philosophical texts use a stricter syntax comparing with the free style of a literature writer.</Paragraph> <Paragraph position="26"> As it was impossible for us to collect the ne cessary amount of unique author corpora, we could not go further in our tests.</Paragraph> </Section> class="xml-element"></Paper>