File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-0207_metho.xml
Size: 18,703 bytes
Last Modified: 2025-10-06 14:10:34
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-0207"> <Title>LoLo: A System based on Terminology for Multilingual Extraction</Title> <Section position="5" start_page="57" end_page="59" type="metho"> <SectionTitle> ALGORITHM: DISCOVER LOCAL GRAMMAR </SectionTitle> <Paragraph position="0"> 1. SELECT a special language corpus (SL, comprising Nspecial words and vocabulary VSpecial). i. USE a frequency list of single words from a corpus of texts used in day-to-day communications (SG comprising Ngeneral words and vocabulary Vgeneral) - for example, the British National Corpus for the English language: Fgeneral:={f(w1),f(w2),f(w3).......fVgeneral} ii. CREATE a frequency ordered list of words in SL texts is computed Fspecial:={f(w1), f(w2), f(w3).........} iii. COMPUTE the differences in the distribution of the same words in the two different corpora is computed us- null ing the in SG and SL: Weirdness (wi)= f(wi)special/f(wi)general* Ngeneral/Nspecial iv. CALCULATE z-score for the Fspeical zf(wi)=(f(wi)-fav_special)/sspecial 2. CREATE KEY a set of Nkey keywords ordered according to the magnitude of the two z-scores KEY:={key1, key2, key3......keyNkey) such that z(fkeyi) & z(weridnesskeyi)> 1 i. EXTRACT collocates of each Key in SL over a window of M word neighbourhood.</Paragraph> <Paragraph position="1"> ii. COMPUTE the strength of collocation using three measures due to Smadja (1994): U-score, k, and z-score iii. EXTRACT sentences in the corpus that comprise highly collocating key-words ((U,ko,k1)>=(10,1,1)) barb2right iv. FORM Corpus SL' a. For each Sentencei in SL': b. COMPUTE the frequency of every word in Sentencei. c. REPLACE words with frequency less than a threshold value (fthreshold) by a place marker #; d. FOR more than one contiguous place marker, use* 3. GENERATE trigrams in SL'; note frequency of each trigram together with its position in the sentences: i. FIND all the longest possible contiguous trigrams across all sentences in SL' and note their frequency ii. ORDER the (contiguous) trigrams according to frequency of occurrence iii. (CONTIGUOUS) TRIGRAMS with frequency above a threshold form THE LOCAL GRAMMAR Briefly, given a specialist corpus (SL), key-words are identified, and collocates of the key-words are extracted. Sentences containing key collocates are then used to construct a sub-corpus (SL'). The sub-corpus SL' is then analyzed and trigrams above a frequency threshold in the sub-corpus are extracted; the position of the trigrams in each of the sentences is also noted. The sub-corpus is searched again for contiguous trigrams across the sentences: The sentences are analyzed for the existence of the trigrams in the correct position - if a trigram that, for example, is noted for its frequency as a sentence initial position, is found to co-occur with another frequent trigram that exists at the next position, then the two trigrams will be deemed to form a pattern. This process is continued until all the trigrams in the sentence are matched with the significant trigrams.</Paragraph> <Paragraph position="2"> The local grammar then comprises significant contiguous trigrams that are found. These domain specific patterns, extracted from the specialist corpus SL' (and its constituent sub-corpus) are then used to extract similar patterns and information from a test corpus to validate the patterns thus found in the training corpus. Following is a demonstration of how the algorithm works using English and Arabic texts.</Paragraph> <Section position="1" start_page="58" end_page="59" type="sub_section"> <SectionTitle> 2.1 Extracting Patterns in English </SectionTitle> <Paragraph position="0"> We present an analysis of a corpus of financial news wire texts: 1204 news report produced by Reuters UK Financial News comprising 431,850 tokens. One of the frequent words in the corpus is percent- 3622 occurrences, a relative frequency of 0.0084%. When the frequency of this keyword is looked up in the British National Corpus (100 million words), it was found that percent is 287 times more frequent in the financial corpus than in the British National Corpus this ratio is sometimes termed weirdness (of special language); the weirdness of grammatical words the and to is unity as these tokens are distributed with the same (relative) frequency in Reuters Financial and the BNC. The z-score computed using the frequency of the token in the Reuters Financial is 12.64: the distribution of percent is 12 standard deviations above the mean of all words in the financial corpus. (The z-score computed for weirdness is positive as well). The heuristic here is this: a token is a candidate key-word if both its z-scores are greater than a small positive number. So percent -most frequent token with frequency and weirdness z-score over zero- was accepted as a keyword.</Paragraph> <Paragraph position="1"> The collocates of the keyword percent were then extracted by using mutual information statistics presented by Smadja (1994). A collocate in this terminology can be anywhere in the vicinity of +/- N-words. The frequency at each neighbourhood is calculated and then used to compute the 'peaks' in the histogram formed by the neighbourhood frequencies and the strength of the collocation calculated on a similar basis.</Paragraph> <Paragraph position="2"> The keyword generally collocates with certain words that have frequencies higher than itself -the upward collocates- and collocates with certain words that have lesser frequency - the downwards collocates (These terms were coined by John Sinclair). Upwards collocates are usually grammatical words and downwards collocates are lexical words - nouns, adjectives- and hence the downwards collocates are treated as candidate compound words. There were 46 collocates of percent in our corpus - 34 downwards collocates and 12 upwards collocates. A selection of 5 downwards and upwards are shown in Table 1 and 2 respectively.</Paragraph> <Paragraph position="3"> pus of 431,850 words.</Paragraph> <Paragraph position="4"> The financial texts comprise a large number of numerals (integers and decimals) and these we will denote as <no>. The numerals collocate strongly with percent for obvious reasons. The collocates are then used to extract trigrams comprising the collocates that occur at particular positions in the various sentences of our corpus: There are many other frequent patterns where the frequency of individual tokens is quite low but at least one member of the trigram has higher frequency: such low frequency tokens are omitted and marked by the (#) symbol. All the tri-grams containing such tokens with at least two others are used to extract other significant trigrams. Sometimes more than one low frequency tokens precede or succeed high frequency tokens and they are denoted by the symbol (*) as shown in Table 4. The search for contiguous trigrams leads to larger and more complex patterns, Table 5 provides some examples.</Paragraph> <Paragraph position="5"> frequency words (denoted as * for multiple tokens and # for a single token).</Paragraph> </Section> <Section position="2" start_page="59" end_page="59" type="sub_section"> <SectionTitle> 2.2 Extracting Patterns in Arabic </SectionTitle> <Paragraph position="0"> Arabic is written from right to left and its writing system does not employ capitalization. The language is highly inflected compared to English; words are generated using a root-and-pattern morphology. Prefixes and suffixes can be attached to the morphological patterns for grammatical purposes. For example, the grammatical conjunction &quot;and&quot; in Arabic is attached to the beginning of the following word. Words are also sensitive to the gender and number they refer to and their lexical structure change accordingly.</Paragraph> <Paragraph position="1"> As a result, more word types can be found in Arabic corpora compared to English of same size and type. Short vowels which are represented as marks in Arabic are also omitted from usual Arabic texts resulting in some words having same lexical structures but different semantics.</Paragraph> <Paragraph position="2"> These grammatical and lexical features of Arabic cause more complexity and ambiguity, especially for NLP systems designed for thorough processing of Arabic texts compared to English. A shallow and statistical approach for IE using texts of specialism can be useful to abstract many complexities of Arabic texts.</Paragraph> <Paragraph position="3"> Given a 431,563 word corpus comprising 2559 texts of Reuters Arabic Financial News and the same thresholds we used with the English corpus, percent (al-meaa, afii62764afii62758afii62821afii62817) is again the most frequent term with frequency and weirdness z-score greater than zero. It has 3125 occurrences (0.0072%), a frequency z-score of 19.03 and a weirdness of 76 compared against our Modern Standard Arabic Corpus (MSAC).</Paragraph> <Paragraph position="4"> There were 31 collocates of percent; 7 upwards and 23 downwards. The downwards collocates of percent appear to collocate with names of instruments i.e. shares and indices (Table 6).</Paragraph> <Paragraph position="5"> The upwards collocate are with the so-called closed class words as in English like in, on and afii62764afii62758afii62821afii62817).</Paragraph> <Paragraph position="6"> Using the same thresholds the trigrams (Table 8) appear to be different from the English tri-grams in that the words of movement are not included here - this is because Arabic has a richer morphological system compared to English and Financial Arabic is not as standardised as Financial English: however, it will not be difficult to train the system to recognise the variants of rose and fell in Financial Arabic. Table 9 lists some of the patterns.</Paragraph> </Section> </Section> <Section position="6" start_page="59" end_page="59" type="metho"> <SectionTitle> 3 Experimental Results </SectionTitle> <Paragraph position="0"> We have argued that a method that is focused on frequency at the lexical level(s) of linguistic description - single words, compounds, and Ngrams- will perhaps lead to patterns that are idiosyncratic of a specialist domain without recourse to a thesaurus. There are a number of linguistic methods - that focus on syntactic and semantic level of description which might be of equal or better use.</Paragraph> <Paragraph position="1"> In order to show the effectiveness of our method we apply it to sentiment analysis - an analysis that attempts to extract qualitative opinion expressed about a range of human and natural artefacts - films, cars, financial instruments for instance. Broadly speaking, sentiments in financial markets relate to the 'rise' and 'fall' of financial instruments (shares, currencies, commodities and energy prices): inextricably these sentiments relate to change in the prices of the instruments. In both English and Arabic, we have found that percent or equivalent is a key-word and trigrams and longer N-grams embedded with this keyword relate to metaphorical movement words- up, down, rise, fall. However, in English this association is further contextualised with other keywords - shares, stocks- and in Arabic the contextualisation is with shares and the principal commodity of many Arab states economies - oil. Our system 'discovered' both by following a lexical level of linguistic description. null For each of the two languages of interest to us, we have created 1.72 million token corpora.</Paragraph> <Paragraph position="2"> Each corpus was then divided into two (roughly) equal sized sub corpora: training corpus and testing corpus; the testing corpus is sub-divided into two testing corpora Test1 and Test2 (Table 10).</Paragraph> <Paragraph position="3"> First, we extract patterns from the Training Corpus using the discover local grammar algorithm (Figure 1) and also from Test1. Next, the Training1 and Test1 corpora are merged and patterns extracted from the merged corpus. The intuition we have is that as the size of the corpus is increased the patterns extracted from a smaller sized corpus will be elaborated: some of the patterns that are idiosyncratic of the smaller sized corpus will become statistically insignificant and hence will be ignored. The conventional way of testing would have been to see how many patterns discovered in the training corpus are found in the testing corpora; we are quantifying these results currently. In the following we describe an initial test of our method after introducing LoLo.</Paragraph> <Section position="1" start_page="59" end_page="59" type="sub_section"> <SectionTitle> 3.1 LoLo </SectionTitle> <Paragraph position="0"> LoLo (stands for Local-Grammar for Learning Terminology and means 'pearl' in Arabic) is developed using the .NET platform. It contains four components summarised in Table 11.</Paragraph> </Section> <Section position="2" start_page="59" end_page="59" type="sub_section"> <SectionTitle> Component Functionality </SectionTitle> <Paragraph position="0"/> </Section> </Section> <Section position="7" start_page="59" end_page="62" type="metho"> <SectionTitle> CORPUS ANALYSER Discover domain specific extraction patterns RULES EDITOR Group, label and evaluate patterns and slots INFORMATION EXTRACTOR Extract information INFORMATION VISUALISER Visualise patterns over time </SectionTitle> <Paragraph position="0"> The various components of LoLo -the Analyser, Editor, Extractor and the Visualiser, can be used to extract and present patterns; the system has utilities to change script and the direction of writing (Arabic is right-to-left and English left-to-right). Table 12 is an exemplar output from LoLo: &quot;rise in profit&quot; event patterns expressed similarly in English and Arabic financial news headlines found by the Corpus Analyser.</Paragraph> <Paragraph position="1"> English * profit up <no> percent `afii62760afii62809afii62765r Hafii62760afii62761r' <no> * afii62833afii62808 afii62764afii62758afii62821afii62817 Arabic percent in profit rise (up) Table 12. &quot;Rise in profit&quot; patterns in Arabic and English where the * usually comprises names of organisations or enterprises.</Paragraph> <Paragraph position="2"> The pattern acquisition algorithm presented earlier is implemented in the Corpus Analyser component, which is the focus of this paper. It can be used for discovering frequent patterns in corpora. The user has the option to filter smaller patterns contained in larger ones and to mine for interrupted or non-interrupted patterns. It can also distinguish between single word and multi word slots.</Paragraph> <Paragraph position="3"> Before mining for patterns, a corpus pre-processor routine performs a few operations to improve the pattern discovery. It identifies any punctuation marks attached to the words and separates them. it also identifies the sentences boundaries and converts all the numerical tokens to one tag &quot;<no>&quot; as numbers can be part of some patterns, especially in the domain of financial news.</Paragraph> <Paragraph position="4"> The Rules Editor is at its initial stages of development, currently it can export the extraction patterns discovered by the Corpus Analyser as regular expressions.</Paragraph> <Paragraph position="5"> A time-stamped corpus can be visualised using the Information Visualiser. The Visualiser can display a time-series that shows how the extracted events emerge, repeat and fade over time in relation to other events or imported time series i.e. of financial instruments. This can be useful for analysing any relations between different events or detecting trends in one or more corpora or with other time-series.</Paragraph> <Paragraph position="6"> LoLo facilitates other corpus and computational linguistics tasks as well, including generating concordances and finding collocations from texts encoded in UTF-8. This is particularly useful for Arabic and languages using the Arabic writing system like Persian and Urdu which lack such resources.</Paragraph> <Section position="1" start_page="61" end_page="62" type="sub_section"> <SectionTitle> 3.2 Training and Testing 3.2.1 English </SectionTitle> <Paragraph position="0"> We consider the English Training1 corpus first.</Paragraph> <Paragraph position="1"> We extracted the significant collocates of all the high frequency/high weirdness words, where 'high' defined using the associated z-scores, in the training corpus. Trigrams were then extracted and high frequency trigrams were chosen and all sentences comprising the trigrams were used to form a (training) sub corpus. The sub-corpus was then analysed for extracting the local grammar.</Paragraph> <Paragraph position="2"> The 10 high frequency N-grams extracted automatically from the Training1 Corpus (861,492) are listed in Table 13. The Test1 corpus has most of the trigrams in the Training1 corpus, particularly some of the larger N-grams Test1 corpus found as sub-patterns in Training1. We then merged the Training1 and Test1 corpora together and created Training2 corpus comprising of 3612 texts and 1,293,342 tokens. The Algorithm was executed on the merged corpus and a new set of patterns were extracted, in particular the most frequent pattern in the Training1 Corpus (<s> the * <no> percent), was elabo- null rated by the Algorithm as well as those patterns shown in Table 15.</Paragraph> <Paragraph position="3"> <s> the * was up <no> percent at <no>, <no> </s> 33 <s> the * index was up <no> The patterns related to the collocations of shares and percent from Training1 were preserved in Training2. The test on Test2 corpus showed similar results: the smaller N-grams related to the movement of instruments were similar to the Test1 Corpus. The analysis of Arabic texts is shown below with similar results.</Paragraph> <Paragraph position="4"> Some of frequent N-grams extracted automatically from the Training1 Arabic corpus (860,020) are shown in Table 16. Similar to the English corpora the Test1 Arabic corpus has most of the trigrams in the Training1 Corpus and some larger N-grams(Table 17).</Paragraph> <Paragraph position="5"> After merging the Training1 and Test1 Arabic corpora together into a corpus of 7677 texts and 1,293,342 tokens, new set of patterns were extracted as well. Some of the frequent patterns in the training corpus were elaborated more as well like the pattern shown in Table 18 where the token and-rise (wa-ertifaa, wafii62804afii62809afii62765r ) was added to the pattern.</Paragraph> </Section> </Section> class="xml-element"></Paper>