File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-0703_metho.xml
Size: 22,976 bytes
Last Modified: 2025-10-06 14:10:34
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-0703"> <Title>Question Pre-Processing in a QA System on Internet Discussion Groups</Title> <Section position="4" start_page="16" end_page="17" type="metho"> <SectionTitle> 2 Garbage Text Removal </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="16" end_page="16" type="sub_section"> <SectionTitle> 2.1 Garbage Texts </SectionTitle> <Paragraph position="0"> Articles in discussion groups are colloquial.</Paragraph> <Paragraph position="1"> Users often write articles as if they are talking to other users. For this reason, phrases expressing appreciation, begging, or emotions of writers are often seen in the postings. For example: You Guan powerpointWen Ti Wo Xiang Qing Wen [?] Xia The phrases &quot;Wo Xiang Qing Wen [?] Xia &quot; (&quot;I'd like to ask&quot;) and &quot;Xie Xie &quot; (&quot;Thank you&quot;) are unimportant to the question itself.</Paragraph> <Paragraph position="2"> These phrases often contain content words, not stop words, and thus are hard to be distinguished with the real questions. If these phrases are not removed, it can happen that two questions are judged &quot;similar&quot; because one of these phrases appears in both questions.</Paragraph> <Paragraph position="3"> A phrase which contributes no information about a question is called garbage text in this paper and should be removed beforehand in order to reduce noise. The term theme text is used to refer to the remaining text.</Paragraph> <Paragraph position="4"> After examining real querying postings, some characteristics of garbage texts are observed: 1. Some words strongly suggest themselves being in a garbage text, such as &quot;thank&quot; in &quot;thank you so much&quot;, or &quot;help&quot; in &quot;who can help me&quot;.</Paragraph> <Paragraph position="5"> 2. Some words appear in both theme texts and garbage texts, hence ambiguity arises. For example: &quot;Qing Jiao Gao Shou &quot; (Any expert please help) &quot;Kuai Shan Gao Shou &quot; (Flash Expert) The first phrase is a garbage text, while the second phrase is a product name.</Paragraph> <Paragraph position="6"> The word &quot;expert&quot; suggests an existence of a garbage text but not in all cases.</Paragraph> <Paragraph position="7"> Because punctuation marks are not reliable in Chinese, we use sentence fragment as the unit to be processed. A sentence fragment is defined to be a fragment of text segmented by commas, periods, question marks, exclamation marks, or space marks. A space mark can be a boundary of a sentence fragment only when both characters preceding and following the space mark are not the English letters, digits, or punctuation marks.</Paragraph> </Section> <Section position="2" start_page="16" end_page="17" type="sub_section"> <SectionTitle> 2.2 Strategies to Remove Garbage Texts </SectionTitle> <Paragraph position="0"> Frequent terms seen in garbage texts are collected as garbage keywords and grouped into classes according to their meanings and usages.</Paragraph> <Paragraph position="1"> Table 1 gives some examples of classes of garbage keywords collected from the training set. To handle ambiguity, this paper proposes a length information strategy to determine garbage texts as follows: If a sentence fragment contains a garbage keyword and the length of the fragment after removing the garbage keyword is less than a threshold, the whole fragment will be judged as a garbage text. Otherwise, only the garbage keyword itself is judged as garbage text if it is never in an ambiguous case.</Paragraph> <Paragraph position="2"> Different length thresholds are assigned to different classes of garbage keywords. If more than one garbage keyword occurring in a fragment, discard all the keywords first, and then compare the length of the remaining fragment with the maximal threshold among the ones corresponding to these garbage keywords.</Paragraph> <Paragraph position="3"> In order to increase the coverage of garbage keywords, other linguistic resources are used to expand the list of garbage keywords.</Paragraph> <Paragraph position="4"> Synonyms in Tongyici Cilin (Tong Yi Ci Ci Im ), a thesaurus of Chinese words, are added into the list. More garbage keywords are added by common knowledge.</Paragraph> </Section> </Section> <Section position="5" start_page="17" end_page="19" type="metho"> <SectionTitle> 3 Question Segmentation </SectionTitle> <Paragraph position="0"> When a user posts an article in a discussion group, he may pose more than one question at one time. For example, in the following posting: Office 2003He XP-You He Pwul Tong Zhi Chu Ni ? Na [?] Ge Bi Jiao Xin Ni ? Zui Xin De Ban Ben Shi ??????????? (Office 2003 and XP - What are the differences between them? Which version is newer? What is the latest version???????????) there are 3 questions submitted at a time. If a new user wants to know the latest version of Office, responses to the previous posting will give answers.</Paragraph> <Paragraph position="1"> Table 2 lists the statistics of number of questions in the training set. The first column is the number of questions in one posting. The second and the third columns are the number and the percentage of postings which contain such number of questions, respectively.</Paragraph> <Section position="1" start_page="17" end_page="17" type="sub_section"> <SectionTitle> in Postings </SectionTitle> <Paragraph position="0"> As we can see in Table 2, nearly half (43.02%) of the postings contain two or more questions.</Paragraph> <Paragraph position="1"> That is why question segmentation is necessary.</Paragraph> </Section> <Section position="2" start_page="17" end_page="17" type="sub_section"> <SectionTitle> 3.1 Characteristics of Questions in a Posting </SectionTitle> <Paragraph position="0"> Several characteristics of question texts in postings were found in real discussion groups: 1. Some people use '?' (question mark) at the end of a question while some people do not. In Chinese, some people even separate sentences only by spaces instead of punctuation marks. (Note that there is no space mark between words in Chinese text.) 2. Questions are usually in interrogative form. Either interrogatives or question marks appear in the questions.</Paragraph> <Paragraph position="1"> 3. One question may occur repeatedly in the same posting. It is often the case that a question appears both in the title and in the content. Sometimes a user repeats a sentence several times to show his anxiety.</Paragraph> <Paragraph position="2"> 4. One question may be expressed in different ways in the same posting. The sentences may be similar. For example: A: Office2000De Jian Tie Bo Zhi Neng Wei Chi 12Ge Xiang Mu ? B: Office2000De Jian Tie Bo Zhi Neng Bao Chi 12Ge Xiang Mu ? (Can the clipboard of Office2000 only keep 12 items?) &quot;Wei Chi &quot; and &quot;Bao Chi &quot; are synonyms in the meaning of &quot;keep&quot;.</Paragraph> <Paragraph position="3"> Dissimilar sentences may also refer to the same question. For example, (1) How to use automatic text wrapping in Excel? (2) If I want to put two or more lines in one cell, what can I do? (3) How to use it? These three sentences ask the same question: &quot;How to use automatic text wrapping in Excel?&quot; The second sentence makes a detailed description of what he wants to do. Topic of the third sentence is the same as the first sentence hence is omitted. Topic ellipsis is quite often seen in Chinese.</Paragraph> <Paragraph position="4"> 5. Some users will give examples to explain the questions. These sentences often start with phrases like &quot;for example&quot; or &quot;such as&quot;.</Paragraph> </Section> <Section position="3" start_page="17" end_page="18" type="sub_section"> <SectionTitle> 3.2 Strategies to Separate Questions </SectionTitle> <Paragraph position="0"> According to the observations in Section 3.1, several strategies are proposed to separate questions: (1) Separating by Question Mark ('?') It is the simplest method. We use it as a baseline strategy.</Paragraph> <Paragraph position="1"> (2) Identifying Questions by Interrogative</Paragraph> </Section> <Section position="4" start_page="18" end_page="19" type="sub_section"> <SectionTitle> Forms </SectionTitle> <Paragraph position="0"> Questions are usually in interrogative forms including subject inversion (&quot;is he...&quot;, &quot;does it...&quot;), using interrogatives (&quot;who is...&quot;), or a declarative sentence attached with a question mark (&quot;Office2000 is better?&quot;). Only the third form requires a question mark. The first two forms can specify themselves as questions by text only. Moreover, there are particles in Chinese indicating a question as well, such as &quot;Ma &quot; or &quot;Ni &quot;.</Paragraph> <Paragraph position="1"> If a sentence fragment is in interrogative form, it will be judged as a question and separated from the others. A fragment not in interrogative form is merged with the nearest question fragment preceding it (or following it if no preceding one). Note that garbage texts have been removed before question separation.</Paragraph> <Paragraph position="2"> (3) Merging or Removing Similar Sentences If two sentence fragments are exactly the same, one of them will be removed. If two sentence fragments are similar, they are merged into one question fragment.</Paragraph> <Paragraph position="3"> Similarity is measured by the Dice coefficient (Dice, 1945) using weights of common words in the two sentence fragments. The similarity of two sentence fragments X and Y is defined as follows:</Paragraph> <Paragraph position="5"> where Wt(w) is the weight of a word w. In Equation 1, k is one of the words appearing in both X and Y. Fragments with similarity higher than a threshold are merged together.</Paragraph> <Paragraph position="6"> The weight of a word is designed as the weight of its part-of-speech as listed in Table 3. Nouns and verbs have higher weights, while adverbs and particles have lower weights. Note that foreign words are assigned a rather high weight, because names of software products such as &quot;Office&quot; or &quot;Oracle&quot; are often written in English, which are foreign words with respect to Chinese.</Paragraph> <Paragraph position="7"> Before computing similarity, word segmentation is performed to identify words in Chinese text. After that, a part-of-speech tagger is used to obtain POS information of each word.</Paragraph> <Paragraph position="8"> (4) Merging Questions with the Same Type The information of question type has been widely adopted in QA systems (Zhang and Lee, 2003; Hovy et al., 2002; Harabagiu et al., 2001). Question type often refers to the possible type of its answer, such as a person name, a location name, or a temporal expression. The question types used in this paper are PERSON, LOCATION, REASON, QUANTITY, TEMPORAL, COMPARISON, DEFINITION, METHOD, SELECTION, YESNO, and OTHER. Rules to determine question types are created manually.</Paragraph> <Paragraph position="9"> This strategy tries to merge two question fragments of the same question type. This paper proposes two features to determine the threshold to merge two question fragments: length and sum of term weights of a fragment. Length is measured in characters and term weights are designed as in Table 3.</Paragraph> <Paragraph position="10"> Merging algorithm is as follows: if the feature value of a question fragment is smaller than a threshold, it will be merged into the preceding question fragment (or the following fragment if no preceding one).</Paragraph> <Paragraph position="11"> This strategy applies recursively until no question fragment has a feature value lower than the threshold.</Paragraph> <Paragraph position="12"> (5) Merging Example Fragments If a fragment starts with a phrase such as &quot;for example&quot; or &quot;such as&quot;, it will be merged into its preceding question fragment.</Paragraph> </Section> </Section> <Section position="6" start_page="19" end_page="19" type="metho"> <SectionTitle> 4 Experiments </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="19" end_page="19" type="sub_section"> <SectionTitle> 4.1 Experimental Data </SectionTitle> <Paragraph position="0"> All the experimental data were collected from</Paragraph> <Paragraph position="2"> discussion groups similar to Yahoo! Answers but using Chinese instead of English.</Paragraph> <Paragraph position="3"> Three discussion groups, &quot;Business Application&quot; (Shang Wu Ying Yong ), &quot;Website Building&quot; (Wang Zhan Jia She ), and &quot;Image Processing&quot; (Ying Xiang Chu I ), were selected to collect querying postings. The reason that we chose these three discussion groups was their moderate growing rates. We could collect enough amount of querying postings published in the same period of time.</Paragraph> <Paragraph position="4"> The following kinds of postings were not selected as our experimental data: 1. No questions inside 2. Full of algorithms or program codes 3. Full of emoticons or Martian texts (Huo Xing Wen , a funny term used in Chinese to refer to a writing style that uses words with similar pronunciation to replace the original text)</Paragraph> </Section> </Section> <Section position="7" start_page="19" end_page="22" type="metho"> <SectionTitle> 4. Redundant postings </SectionTitle> <Paragraph position="0"> Totally 598 querying postings were collected as the training set and 269 postings as the test set.</Paragraph> <Paragraph position="1"> The real numbers of postings collected from each group are listed in Table 4, where &quot;BA&quot;, &quot;WB&quot;, and &quot;IP&quot; stand for &quot;Business Application&quot;, &quot;Website Building&quot;, and &quot;Image Processing&quot;, respectively.</Paragraph> <Paragraph position="2"> Two persons were asked to mark garbage texts and separate questions in the whole data set. If a conflicting case occurred, a third person (who was one of the authors of this paper) would solve the inconsistency.</Paragraph> <Section position="1" start_page="19" end_page="20" type="sub_section"> <SectionTitle> 4.2 Garbage Texts Removal </SectionTitle> <Paragraph position="0"> The first factor examined in garbage text removal is the length threshold. Table 5 lists the experimental results on the training set and http://tw.knowledge.yahoo.com/ Table 6 on the test set. All garbage keywords are collected from the training set.</Paragraph> <Paragraph position="1"> Eight experiments were conducted to use different values as length thresholds. The strategy Lenk sets the length threshold to be k characters (no matter in Chinese or English). Hence, Len0 is one baseline strategy which removes only the garbage keyword itself. LenS is the other baseline strategy which removes the whole sentence fragment where a garbage keyword appears.</Paragraph> <Paragraph position="2"> The strategy Heu uses different length thresholds for different classes of garbage keywords. The thresholds are heuristic values after observing many examples in the training set.</Paragraph> <Paragraph position="3"> Accuracy is defined as the percentage of successful removal. In one posting, if all real garbage texts are correctly removed and no other text is wrongly deleted, it counts one successful removal.</Paragraph> <Paragraph position="4"> with Different Length Thresholds (Test Set) As we can see in both tables, the two baseline strategies are poorer than any other strategy. It means that length threshold is useful to decide garbage existence.</Paragraph> <Paragraph position="5"> Heu is the best strategy (99.67% on the training set and 87.73% on the test set). Len3 is the best strategy (80.60% on the training set and 75.49% on the test set) among Lenk, but it is far worse than Heu. We can conclude that the length threshold should be assigned individually for each class of garbage words. If it is assigned carefully, the performance of garbage removal will be good.</Paragraph> <Paragraph position="6"> The second factor is the expansion of garbage keywords. The strategy HeuExp is the same as Heu except that the list of garbage keywords was expanded as described in Section 2.2.</Paragraph> <Paragraph position="7"> Comparing the last two rows in Table 6, HeuExp strategy improves the performance from 87.73% to 92.57%. It shows that a small amount of postings can provide good coverage of garbage keywords after keyword expansion by using available linguistic resources.</Paragraph> <Paragraph position="8"> The results of HeuExp and Heu on the training set are the same. It makes sense because the expanded list suggests garbage existence in the training set no more than the original list does.</Paragraph> </Section> <Section position="2" start_page="20" end_page="22" type="sub_section"> <SectionTitle> 4.3 Question Segmentation Overall Strategies </SectionTitle> <Paragraph position="0"> Six experiments were conducted to see the performance of different strategies for question segmentation. The strategies used in each experiment are: Baseline: using only '?' (question mark) to separate questions SameS: removing repeated sentence fragments then separating by '?' Interrg: after removing repeated sentence fragments, separating questions which are in interrogative forms SimlrS: following the strategy Interrg, removing or merging similar sentence fragments of the same question type ForInst: following the strategy SimlrS, merging a sentence fragment beginning with &quot;for instance&quot; and alike with its preceding question fragment SameQT: following the strategy ForInst, merging question fragments of the same question type without considering similarity Table 7 and Table 8 depict the results of the six experiments on the training set and the test set, respectively. The second column in each table lists the accuracy which is defined as the percentage of postings which are separated into the same number of questions as manually tagged. The third column gives the number of postings which are correctly separated. The fourth and the fifth columns contain the numbers of postings which are separated into more and fewer questions, respectively.</Paragraph> <Paragraph position="1"> by Different Strategies (Test Set) As we can see in Table 7, performance is improved gradually after adding new strategies. SameQT achieves the best performance with 88.29% accuracy. Same conclusion could also be made by the results on the test set. SameQT is the best one with 85.87% accuracy.</Paragraph> <Paragraph position="2"> In Table 7, Baseline achieves only 50.67% accuracy. That matches our observations: (1) one question is often stated many times by sentences ended with question marks in one posting (as 213 postings were separated into more questions); (2) some users do not use '?' in writing (as 82 postings were separated into fewer questions).</Paragraph> <Paragraph position="3"> SameS greatly reduces the cases (57 postings) of separation into more questions by removing repeated sentences.</Paragraph> <Paragraph position="4"> On the other hand, Interrg greatly reduces the cases (76 postings) of separation into fewer questions. Many question sentences without question marks were successfully captured by detecting the interrogative forms.</Paragraph> <Paragraph position="5"> SimlrS also improves a lot (successfully reducing number of questions separated in 63 postings). But ForInst only improves a little. It is more common to express one question several times in different way than giving examples.</Paragraph> <Paragraph position="6"> SameQT achieves the best performance, which means that question type is a good strategy. Different ways to express a question are usually in the same question type. Comparing with SimlrS which also considers sentence fragments in the same question type, more improvement comes from the successful merging of fragments with topic ellipses, co-references, or paraphrases. However, there may be other questions in the same question type which are wrongly merged together (as 49 failures in the training set). Considering the results on the test set, Interrg does not improve the overall performance comparing to SameS because the improvement equals the drop. ForInst does not improve either. It seems that giving examples is not common in the discussion groups.</Paragraph> <Paragraph position="7"> Thresholds in SameQT In the strategy SameQT, two features, length and sum of term weights, are used to determine thresholds to merge question fragments as mentioned in Section 3.2. In order to decide which feature is better and which threshold value should be set, two experiments were conducted. Table 9 depicts the experimental results of using length of sentence fragments as merging threshold. The column &quot;LenThr&quot; lists different settings of length threshold and the column &quot;Acc&quot; gives the accuracy.</Paragraph> <Paragraph position="8"> The performance is gradually improved as the value of length threshold increases. The best one is LenThr=30 with 88.63% accuracy. However, &quot;Always Merging&quot; (LenThr=[?]) achieves 88.29% accuracy, which is also acceptable comparing to the best performance. Fig 1 shows the curve of accuracy against length threshold.</Paragraph> <Paragraph position="9"> Table 10 presents the experimental results of using sum of term weights as merging thresold. The column &quot;WgtThr&quot; lists different settings of length threshold and the column &quot;Acc&quot; gives the accuracy.</Paragraph> <Paragraph position="10"> The performance is also gradually improved as the value of weight threshold increases.</Paragraph> <Paragraph position="11"> When WgtThr is set to be 500, 700, or 900, the performance is the best, with 88.46% accuracy. But the same as the threshold settings of length feature, the best one does not outperform &quot;Always Merging&quot; strategy (WgtThr=[?], 88.29% accuracy) too much. Fig 2 shows the curve of accuracy against similarity threshold.</Paragraph> <Paragraph position="12"> with Different Weight Thresholds From the results of above experiments, we can see that although using length feature with a threshold LenThr=30 achieves the best performance, &quot;Always Merging&quot; is more welcome for a online system because no feature extraction or computation is needed with only a little sacrifice of performance. Hence we choose &quot;Always Merging&quot; as merging strategy in SameQT.</Paragraph> </Section> </Section> class="xml-element"></Paper>