File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/92/j92-3004_metho.xml
Size: 2,160 bytes
Last Modified: 2025-10-06 14:13:15
<?xml version="1.0" standalone="yes"?> <Paper uid="J92-3004"> <Title>Technical Correspondence Automatic Clustering of Languages</Title> <Section position="6" start_page="341" end_page="342" type="metho"> <SectionTitle> SL GERMANIC RO M~7&quot; INDIC OTHERS </SectionTitle> <Paragraph position="0"> Vladimir Batagelj et al. Automatic Clustering of Languages We can mention that clusters we found with cluster analysis are very close to the language families established in linguistics (Kruskal, Dyen, and Black 1971).</Paragraph> <Paragraph position="1"> Obviously one could ask the following questions or problems that can only be answered by a large-scale project.</Paragraph> <Paragraph position="2"> 1. In our case all treated words have equal weight. The similarity measure between two languages can also be defined in such a way that different weights (based on linguistic theory) are given to the words and/or transformations.</Paragraph> <Paragraph position="3"> 2. How much does the choice of words influence the final tree structure? In our analysis English belongs to the Germanic cluster, when we know that it also has a strong Romance component.</Paragraph> <Paragraph position="4"> 3. Obviously a larger number of words would give a more accurate picture. The question is: how much and in what way do the results vary if we increase the number of words? 4. How much would the results differ if we study spoken language instead of written language? We can consider for example some phonetic properties of written letters or strings of letters.</Paragraph> <Paragraph position="5"> 5. Any choice of transliteration introduces a &quot;systematic error&quot; in the results. One way of eliminating such an error would be to test for patterns and then not to penalize patterns that occur often. For example: if we find that &quot;tch&quot; ~ &quot;zh&quot; very often then we would not count it every time it occurs but only once.</Paragraph> <Paragraph position="6"> Of course for such precise analysis one needs much better knowledge of the linguistic field than we have as laypersons.</Paragraph> </Section> class="xml-element"></Paper>