File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-1056_metho.xml

Size: 11,401 bytes

Last Modified: 2025-10-06 14:10:19

<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-1056">
  <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics Semi-Supervised Learning of Partial Cognates using Bilingual Bootstrapping</Title>
  <Section position="6" start_page="442" end_page="443" type="metho">
    <SectionTitle>
4 Data for Partial Cognates
</SectionTitle>
    <Paragraph position="0"> We performed experiments with ten pairs of partial cognates. We list them in Table 1. For a French partial cognate we list its English cognate and several false friends in English. Often the French partial cognate has two senses (one for cognate, one for false friend), but sometimes it has more than two senses: one for cognate and several for false friends (nonetheless, we treat them together). For example, the false friend words for note have one sense for grades and one for bills.</Paragraph>
    <Paragraph position="1"> The partial cognate (PC), the cognate (COG) and false-friend (FF) words were collected from a web resource  . The resource contained a list of 400 false-friends with 64 partial cognates. All partial cognates are words frequently used in the language. We selected ten partial cognates presented in Table 1 according to the number of extracted sentences (a balance between the two meanings), to evaluate and experiment our proposed methods.</Paragraph>
    <Paragraph position="2"> The human effort that we required for our methods was to add more false-friend English words, than the ones we found in the web resource. We wanted to be able to distinguish the</Paragraph>
    <Paragraph position="4"> senses of cognate and false-friends for a wider variety of senses. This task was done using a bi- null client client customer, patron, patient, spectator, user, shopper corps corps body, corpse detail detail retail mode mode fashion, trend, style, vogue note note mark, grade, bill, check, account police police policy, insurance, font, face responsable responsible null in charge, responsible party, official, representative, person in charge, executive, officer route route road, roadside</Paragraph>
    <Section position="1" start_page="442" end_page="443" type="sub_section">
      <SectionTitle>
4.1 Seed Set Collection
</SectionTitle>
      <Paragraph position="0"> Both the supervised and the semi-supervised method that we will describe in Section 5 are using a set of seeds. The seeds are parallel sentences, French and English, which contain the partial cognate. For each partial-cognate word, a part of the set contains the cognate sense and another part the false-friend sense.</Paragraph>
      <Paragraph position="1"> As we mentioned in Section 3, the seed sentences that we use are not hand-tagged with the sense (the cognate sense or the false-friend sense); they are automatically annotated by the way we collect them. To collect the set of seed sentences we use parallel corpora from Hansard  The cognate sense sentences were created by extracting parallel sentences that had on the French side the French cognate and on the English side the English cognate. See the upper part of Table 2 for an example.</Paragraph>
      <Paragraph position="2"> The same approach was used to extract sentences with the false-friend sense of the partial cognate, only this time we used the false-friend English words. See lower the part of Table 2.</Paragraph>
      <Paragraph position="4"> Je note, par exemple, que l'accuse a fait une autre declaration tres incriminante a Hall environ deux mois plus tard.</Paragraph>
      <Paragraph position="5">  If there is a hard frost, people are unable to pay their bills.</Paragraph>
      <Paragraph position="6"> To keep the methods simple and languageindependent, no lemmatization was used. We took only sentences that had the exact form of the French and English word as described in Table 1. Some improvement might be achieved when using lemmatization. We wanted to see how well we can do by using sentences as they are extracted from the parallel corpus, with no additional pre-processing and without removing any noise that might be introduced during the collection process.</Paragraph>
      <Paragraph position="7"> From the extracted sentences, we used 2/3 of the sentences for training (seeds) and 1/3 for testing when applying both the supervised and semi-supervised approach. In Table 3 we present the number of seeds used for training and testing. We will show in Section 6, that even though we started with a small amount of seeds from a certain domain - the nature of the parallel corpus that we had, an improvement can be obtained in discriminating the senses of partial cognates using free text from other domains.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="443" end_page="445" type="metho">
    <SectionTitle>
5 Methods
</SectionTitle>
    <Paragraph position="0"> In this section we describe the supervised and the semi-supervised methods that we use in our experiments. We will also describe the data sets that we used for the monolingual and bilingual bootstrapping technique.</Paragraph>
    <Paragraph position="1"> For both methods we have the same goal: to determine which of the two senses (the cognate or the false-friend sense) of a partial-cognate word is present in a test sentence. The classes in which we classify a sentence that contains a partial cognate are: COG (cognate) and FF (falsefriend). null</Paragraph>
    <Section position="1" start_page="443" end_page="444" type="sub_section">
      <SectionTitle>
5.1 Supervised Method
</SectionTitle>
      <Paragraph position="0"> For both the supervised and semi-supervised method we used the bag-of-words (BOW) approach of modeling context, with binary values for the features. The features were words from the training corpus that appeared at least 3 times in the training sentences. We removed the stop-words from the features. A list of stopwords for English and one for French was used. We ran experiments when we kept the stopwords as features but the results did not improve.</Paragraph>
      <Paragraph position="1"> Since we wanted to learn the contexts in which a partial cognate has a cognate sense and the contexts in which it has a false-friend sense, the cognate and false friend words were not taken into account as features. Leaving them in would mean to indicate the classes, when applying the methods for the English sentences since all the sentences with the cognate sense contain the cognate word and all the false-friend sentences do not contain it. For the French side all collected sentences contain the partial cognate word, the same for both senses.</Paragraph>
      <Paragraph position="2"> As a baseline for the experiments that we present we used the ZeroR classifier from WEKA  , which predicts the class that is the most frequent in the training corpus. The classifiers for which we report results are: Naive Bayes with a kernel estimator, Decision Trees - J48, and a Support Vector Machine implementation - SMO. All the classifiers can be found in the WEKA package.</Paragraph>
      <Paragraph position="3"> We used these classifiers because we wanted to have a probabilistic, a decision-based and a functional classifier. The decision tree classifier allows us to see which features are most discriminative.</Paragraph>
      <Paragraph position="4"> Experiments were performed with other classifiers and with different levels of tuning, on a 10-fold cross validation approach as well; the classifiers we mentioned above were consistently the ones that obtained the best accuracy results.</Paragraph>
      <Paragraph position="5"> The supervised method used in our experiments consists in training the classifiers on the  automatically-collected training seed sentences, for each partial cognate, and then test their performance on the testing set. Results for this method are presented later, in Table 5.</Paragraph>
    </Section>
    <Section position="2" start_page="444" end_page="445" type="sub_section">
      <SectionTitle>
5.2 Semi-Supervised Method
</SectionTitle>
      <Paragraph position="0"> For the semi-supervised method we add unlabelled examples from monolingual corpora: the French newspaper LeMonde  1994, 1995 (LM), and the BNC  corpus, different domain corpora than the seeds. The procedure of adding and using this unlabeled data is described in the Mono-lingual Bootstrapping (MB) and Bilingual Bootstrapping (BB) sections.</Paragraph>
      <Paragraph position="1">  The monolingual bootstrapping algorithm that we used for experiments on French sentences  (MB-F) and on English sentences (MB-E) is: For each pair of partial cognates (PC) 1. Train a classifier on the training seeds - using the BOW approach and a NB-K classifier with attribute selection on the features. 2. Apply the classifier on unlabeled data sentences that contain the PC word, extracted from LeMonde (MB-F) or from BNC (MB-E) 3. Take the first k newly classified sentences, both from the COG and FF class and add them to the training seeds (the most confident ones - the prediction accuracy greater or equal than a threshold =0.85) 4. Rerun the experiments training on the new training set 5. Repeat steps 2 and 3 for t times endFor  For the first step of the algorithm we used NB-K classifier because it was the classifier that consistently performed better. We chose to perform attribute selection on the features after we tried the method without attribute selection. We obtained better results when using attribute selection. This sub-step was performed with the WEKA tool, the Chi-Square attribute selection was chosen.</Paragraph>
      <Paragraph position="2"> In the second step of the MB algorithm the classifier that was trained on the training seeds was then used to classify the unlabeled data that was collected from the two additional resources. For the MB algorithm on the French side we trained the classifier on the French side of the</Paragraph>
      <Paragraph position="4"> training seeds and then we applied the classifier to classify the sentences that were extracted from LeMonde and contained the partial cognate. The same approach was used for the MB on the English side only this time we were using the English side of the training seeds for training the classifier and the BNC corpus to extract new examples. In fact, the MB-E step is needed only for the BB method.</Paragraph>
      <Paragraph position="5"> Only the sentences that were classified with a probability greater than 0.85 were selected for later use in the bootstrapping algorithm.</Paragraph>
      <Paragraph position="6"> The number of sentences that were chosen from the new corpora and used in the first step of the MB and BB are presented in Table 4.</Paragraph>
      <Paragraph position="7">  For the partial-cognate Blanc with the cognate sense, the number of sentences that had a probability distribution greater or equal with the threshold was low. For the rest of partial cognates the number of selected sentences was limited by the value of parameter k in the algorithm.  The algorithm for bilingual bootstrapping that we propose and tried in our experiments is:  1. Translate the English sentences that were collected in the MB-E step into French using an online MT 9 tool and add them to the French seed training data.</Paragraph>
      <Paragraph position="8"> 2. Repeat the MB-F and MB-E steps for T times.  For the both monolingual and bilingual bootstrapping techniques the value of the parameters t and T is 1 in our experiments.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML