XML Viewer - i05-3007

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/i05-3007_metho.xml
Size: 14,030 bytes
Last Modified: 2025-10-06 14:09:35
<?xml version="1.0" standalone="yes"?>
<Paper uid="I05-3007">
  <Title>Chinese Sketch Engine and the Extraction of Grammatical Collocations</Title>
  <Section position="3" start_page="0" end_page="49" type="metho">
    <SectionTitle>
2. Online Chinese Corpora: The State of
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="48" type="sub_section">
      <SectionTitle>
the Arts
2.1 Chinese Corpora
</SectionTitle>
      <Paragraph position="0"> The first online tagged Chinese corpus is Academia Sinica Balanced Corpus of Modern Chinese (Sinica Corpus), which has been web-accessible since November, 1996. The current version contains 5.2028 million words (7.8927 million characters). The corpus data was collected between 1990 and 1996 (CKIP, 1995/1998). Two additional Chinese corpora were made available on line in 2003. The first is the Sinorama Chinese-English Parallel Text Corpus (Sinorama Corpus). The Sinorama Corpus is composed of 2,373 parallel texts in both Chinese and English that were published between 1976 and 2000. There are 103,252 pairs of sentences, composed of roughly 3.2 million  English words and 5.3 million Chinese characters 1 . The second one is the modern Chinese corpus developed by the Center for Chinese Linguistics (CCL Corpus) at Peking University. It contains eighty-five million (85,398,433) simplified Chinese characters which were published after 1919 A.D.</Paragraph>
    </Section>
    <Section position="2" start_page="48" end_page="49" type="sub_section">
      <SectionTitle>
2.2 Extracting Linguistic Information from
</SectionTitle>
      <Paragraph position="0"> Online Chinese Corpora: Tools and Interfaces The Chinese corpora discussed above are all equipped with an online interface to allow users to extract linguistic generalizations. Both Sinica Corpus and CCL Corpus offer KWIC-based functions, while Sinorama Corpus gives sentence and paragraph aligned output.</Paragraph>
      <Paragraph position="1">  The basic unit of query that a corpus allows defines the set of information that can be extracted from that corpus. While there is no doubt that segmented corpus allows more precise linguistic generalizations, string-based collocation still afford a corpus of the robustness that is not restricted by an arbitrary word-list or segmentation algorithm. This robustness is of greatest value when extracting neologism or sub-lexical collocations. Since CCL Corpus is not segmented and tagged, string-based KWIC is its main tool for extracting generalizations. This comes with the familiar pitfall of word boundary ambiguity. For instance, a query of ci.yao  b. 4&gt; @! ta ji ci yao.qiu ta da.fu he several time ask her answer 'He had asked her to answer for several times' Sinica Corpus, on the other hand, is fully segmented and allows word-based generalizations. In addition, Sinica Corpus also allows wildcards in its search. Users specify a wildcard of arbitrary length (*), or fixed length (?). This allows search of a class of words sharing some character strings.</Paragraph>
      <Paragraph position="2">  Formal restriction on the display of extracted data also constraints the type of information that can be obtained from that corpus. Sinica Corpus allows users to change window size from about 25 to 57 Chinese characters. However, since a Chinese sentence may be longer than 57 characters, Sinica Corpus cannot guarantee that a full sentence is displayed. CCL Corpus, on the other hand, is able to show a full output sentence, which may be up to 200 Chinese characters. However, it does not display more than a full sentence. Thus it cannot show discourse information. Sinorama Corpus with TOTALrecall interface is most versatile in this respect. Aligned bilingual full sentences are shown with an easy link to the full text.</Paragraph>
      <Paragraph position="3"> In terms of size and completeness of extracted data, Sinica Corpus returns all matched examples. However, cut and paste must be performed for the user to build his/her dataset. CCL Corpus, on the other hand, limits data to 500 lines per page, but allows easy download of output data. Lastly, Sinorama/TOTALrecall provides choices of 5 to 100 sentences per page.  and Sorter Both Sinica Corpus and CCL corpus allows users to process extracted information, using linguistic and contextual filter or sorter. The CCL corpus requires users to remember the rules, while Sinica Corpus allows users to fill in blanks and/or choose from pull-down menu. In particular, Sinica Corpus allows users to refine their generalization by quantitatively characterizing the left and right contexts. The quantitative sorting functions allowed include both word and POS frequency, as well as word mutual information.</Paragraph>
      <Paragraph position="4">  Availability of grammatical information depends on corpus annotation. CCL and Sinorama Corpus do not have POS tags. Sinica Corpus is the only Chinese corpus allowing users to access an overview of a keyword's syntactic behavior. Users can obtain a list of types and distribution of the keyword's syntactic category. In addition, users can find possible collocations of the keyword from the output of Mutual Information (MI).</Paragraph>
      <Paragraph position="5"> The most salient grammatical information, such as grammatical functions (subject, object, adjunct etc.) is beyond the scope of the traditional corpus interface tools. Traditional corpora rely on the human users to arrive at these kinds of generalizations.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="49" end_page="51" type="metho">
    <SectionTitle>
3. Sketch Engine: A New Corpus-based
</SectionTitle>
    <Paragraph position="0"> approach to Grammatical Information Several existing linguistically annotated corpus of Chinese, e.g. Penn Chinese Tree Bank (Xia et al., 2000), Sinica Treebank (Chen et al., 2003), Proposition Bank (Xue and Palmer, 2003, 2005) and Mandarin VerbNet (Wu and Liu, 2003), suffer from the same problem. They are all extremely labor-intensive to build and typically have a narrow coverage. In addition, since structural assignment is theory-dependent and abstract, inter-annotator consistency is difficult to achieve. Since there is also no general consensus on the annotation scheme in Chinese NLP and linguistics, building an effective interface for public use is almost impossible.</Paragraph>
    <Paragraph position="1"> The Sketch Engine offers an answer to the above issues.</Paragraph>
    <Section position="1" start_page="49" end_page="50" type="sub_section">
      <SectionTitle>
3.1 Initial Implementation and Design of the
Sketch Engine
</SectionTitle>
      <Paragraph position="0"> The Sketch Engine is a corpus processing system developed in 2002 (Kilgarriff and Tugwell, 2002; Kilgarriff et al., 2004). The main components of the Sketch Engine are KWIC concordances, word sketches, grammatical relations, and a distributional thesaurus. In its first implementation, it takes as input basic BNC (British National Corpus, (Leech, 1992)) data: the annotated corpus, as well as list of lemmas with frequencies. In other words, the Sketch Engine has a relatively low threshold for the complexity of input corpus.</Paragraph>
      <Paragraph position="1"> The Sketch Engine has a versatile query system. Users can restrict their query in any sub-corpus of BNC. A query string may be a word (with or without POS specification), or a phrasal segment. A query can also be performed using Corpus Query Language (CQL). The output display format can be adjusted, and the displayed window of a specific item can be freely expanded left and right. Most of all, the Sketch Engine produces a Word Sketch (Kilgarriff and Tugwell, 2002) that is an automatically generated grammatical description of a lemma in terms of corpus collocations. All items in each collocation are linked back to the original corpus data. Hence it is similar to a  Linguistic Knowledge Net anchored by a lexicon (Huang et al., 2001).</Paragraph>
      <Paragraph position="2"> A Word Sketch is a one-page list of a keyword's functional distribution and collocation in the corpus. The functional distribution includes: subject, object, prepositional object, and modifier. Its collocations are described by a list of linguistically significant patterns in the language. Word Sketch uses regular expressions over POS-tags to formalize rules of collocation patterns, e.g. (2) is used to retrieve the verb-object relation in English: 2. 1:&amp;quot;V&amp;quot; &amp;quot;(DET|NUM|ADJ|ADV|N)&amp;quot;* 2:&amp;quot;N&amp;quot; The expression in (2) says: extract the data containing a verb followed by a noun regardless of how many determiners, numerals, adjectives, adverbs and nouns preceding the noun. It can extract data containing cook meals and cooking a five-course gala dinner, and cooked the/his/two surprisingly good meals etc.</Paragraph>
      <Paragraph position="3"> The Sketch Engine also produces thesaurus lists, for an adjective, a noun or a verb, the other words most similar to it in their use in the language (Kilgarriff et al. 2004). For instance, the top five synonym candidates for the verb kill are shoot (0.249), murder (0.23), injure (0.229), attack (0.223), and die (0.212).2 It also provides direct links to the Sketch Difference which lists the similar and different patterns between a keyword and its similar word. For example, both kill and murder can occur with objects such as people and wife, but murder usually occurs with personal proper names and seldom selects animal nouns as complement whereas kill can take fox, whale, dolphin, and guerrilla, etc. as its object. The Sketch Engine adopts Mutual 2 The similarity is measured and ranked adopting Lin's (1998) mathematics.</Paragraph>
      <Paragraph position="4"> Information (MI) to measure the salience of a collocation. Salience data are shown against each collocation in Word Sketches and other Sketch Engine output. MI provides a measure of the degree of association of a given segment with others. Pointwise MI, calculated by Equation 3, is what is used in lexical processing to return the degree of association of two words x and y (a collocation).</Paragraph>
      <Paragraph position="5">  3. )( )|(log);( xP yxPyxI</Paragraph>
    </Section>
    <Section position="2" start_page="50" end_page="51" type="sub_section">
      <SectionTitle>
3.2 Application to Chinese Corpus
</SectionTitle>
      <Paragraph position="0"> In order to show the cross-lingual robustness of the Sketch Engine as well as to propose a powerful tool for collocation extraction based on a large scale corpus with minimal pre-processing; we constructed Chinese Sketch Engine (CSE) by loading the Chinese Gigaword to the Sketch Engine (Kilgarriff et al., 2005). The Chinese Gigaword contains about 1.12 billion Chinese characters, including 735 million characters from Taiwan's Central News Agency, and 380 million characters from China's Xinhua News Agency3. Before loading Chinese Gigaword into Sketch Engine, all the simplified characters were converted into traditional characters, and the texts were segmented and POS tagged using the Academia Sinica segmentation and tagging system (Huang et al., 1997). An array of machine was used to process the 1.12 million characters, which took over 3 days to perform. All components of the Sketch Engine were implemented, including Concordance, Word Sketch, Thesaurus and Sketch Difference.</Paragraph>
      <Paragraph position="1"> In our initial in-house testing of this prototype of the Chinese Sketch Engine, it does  produce the expected results with an easy to use interface. For instance, the Chinese Word Sketch correctly shows that the most common and salient object of dai.bu 'to arrest' is xian.fan . 'suspect'; the most common subject jing.fang ~! 'police'; and the most common modifier dang.chang Q .</Paragraph>
      <Paragraph position="2"> The output data of Thesaurus correctly verify the following set of synonyms from the Chinese VerbNet Project: that ren.wei 'to think' behaves most like biao.shi 'to express, to state' (salience 0.451), while yi.wei 0 'to take somebody/something as' is more like jue.de z 'to feel, think' (salience 0.488). The synonymous relation can be illustrated by (4) and</Paragraph>
      <Paragraph position="4"> ta ren.wei dao hai.wai tou.zi you yi ge guan.nian hen zhong.yao, jiu shi yao zhi.dao dang.di de you.xi gui.ze 'He believes that for those investing overseas, there is a very important principle-one must know the local rules of the game, and accept them.'</Paragraph>
      <Paragraph position="6"> 'The KMT also commented that due to the many controversies surrounding PTV, it could not</Paragraph>
      <Paragraph position="8"> he wen.fa, yao jiang.jiu mai.dian he shi.chang 'Ho Chia-chu says, &amp;quot;Television has its own fundamental language and grammar. You must consider selling points and the market.&amp;quot;'</Paragraph>
      <Paragraph position="10"> 'She says &amp;quot;I hope that followers of Buddhism can realize that a patriarchal society is incompatible with an enlightened society.&amp;quot;' The above examples show that ren.wei and biao.shi can take both direct and indirect quotation. Yi.wei and jue.de, on the other hand, can only be used in reportage and cannot introduce direct quotation.</Paragraph>
      <Paragraph position="11"> Distinction between near synonymous pairs can be obtained from Sketch Difference. This function is verified with results from Tsai et al.'s study on gao.xing /k 'glad' and kuai.le ! 'happy' (Tsai et al., 1998). Gao.xing 'glad' specific patterns include the negative imperative bie q 'don't'. It also has a dominant collocation with the potentiality complement marker de (e.g. ta gao.xing de you jiao you tiao /k [b 'she was so happy that she cried and danced'). In contrast, kuai.le 'happy' has the specific collocation with holiday nouns such as qiu.jie 'Autumn Festival'. The Sketch Difference result is consistent with the account that gao.xing/kuai.le contrast is that inchoative state vs. homogeneous state.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML