File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-2112_metho.xml

Size: 8,476 bytes

Last Modified: 2025-10-06 14:09:25

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-2112">
  <Title>R{j}ecnik.com: English--Serbo-Croatian Electronic Dictionary</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Project Description
</SectionTitle>
    <Paragraph position="0"> Project history. The on-line dictionary R{j}ecnik.com has been active since 1999. One of its most visible characteristics, also noted by other users, is simplicity of the user interface.</Paragraph>
    <Paragraph position="1"> There is one search textual field in which the user enters the query and the dictionary reports all dictionary entries matching the query on either English or SC side. It provides an efficient search mechanism, returning the results within a second.</Paragraph>
    <Paragraph position="2"> Lexical resources. As a lexicographic resource, this is a wide-coverage, up-to-date, bidirectional, and bilingual dictionary covering not only general, often used terms, but also over 8,000 computer and Internet terms,8 as well as healthcare and medical vocabulary, including useful abbreviations. The entries are grouped by semantic meaning and part of speech, in the WordNet fashion. The English lexemes are associated with their phonetic representations, and the entries are marked by domain of usage (e.g., computers, business, finance, medicine).</Paragraph>
    <Paragraph position="3"> Colloquial and informal expressions are marked  through public discussion at the e-mail list Serbian Terminology maintained by Danko Sipka (http://main.amu.edu.pl/mailman/listinfo/st-l).</Paragraph>
    <Paragraph position="4"> with special symbols so that they can be easily identified. In addition, the dictionary contains plenty of illustrative examples showing the language in use. A suitable text encoding for SC is used so that the software generates both Latin (Roman) and Cyrillic script versions. Dialectical and geographical differences are also marked.</Paragraph>
    <Paragraph position="5"> Software overview. The dictionary software is developed in the Perl programming language.</Paragraph>
    <Paragraph position="6"> From the source dictionary file, the searchable on-line resource file is generated. It is in textual format and it is indexed through an inverted file index for searchable terms in English and SC. The searchable terms are chosen selectively. The tags and descriptions are not searchable since this would produce spurious search results.</Paragraph>
    <Paragraph position="7"> Dictionary structure. Following the ideas from OED (Tompa and Gonnet, 1999), we adopted the philosophy of modern text markup systems that &amp;quot;a computer-processable version of text is well-represented by interleaving 'tags' with the text of the original document, still leaving the original words in proper sequence.&amp;quot; Additionally, we adopted the ideas from the Word-Net project (Miller, 2004) in structuring our knowledge base around the basic entry unit being a meaning; i.e., one meaning = one entry. One source dictionary entry (vs. a printed, or on-line dictionary entry) corresponds to one synset in WordNet. It is represented in one physical line in a textual file, or it may be stored in several lines which are continued by having a backslash (\) character at the end of each line but the last one. An entry starts with the English lexemes separated by commas followed by an equal sign (=), and the corresponding SC lexemes, also separated by commas. Additional pertinent information is encoded using tags.</Paragraph>
    <Paragraph position="8"> This representation is conceptually simple and efficient in terms of manual maintenance and memory use. It is also flexible, since it allows tags to define features that refer to the whole entry or just individual lexemes. Such representation deviates from the commonly used XML notation because we find the XML notation to be more &amp;quot;machine-friendly&amp;quot; than userfriendly, but it can be automatically converted to XML. To illustrate the difference between TEI (Sperberg-McQueen and Burnard, 2003), the standard XML-based markup scheme, and our markup scheme, we adopt an example from (Erjavec, 1999), which is shown in in Fig. 1.</Paragraph>
    <Paragraph position="9">  The entry (A) in Fig. 1 shows an entry with TEI markup, in (B) we give our corresponding entry. The tags are preceded with a colon (:).</Paragraph>
    <Paragraph position="10"> English lexemes are associated with their phonetic representations within the square brackets. The phonetic representation is encoded using the vfon encoding.9 All changes to the dictionary can be easily tracked down using the key :id tag and the standard CVS (Control Version System) system. The encoding ipp is used to encode SC text fragments, since they include additional letters beside the standard 7-bit ASCII set. The on-line version of the dictionary is encoded using the dual1 encoding for simplicity and efficiency reasons. The input query can be entered using the ipp encoding, and is translated into the dual1 encoding before matching. The krascii encoding10 is additionally accepted in the input query as the most common transcribing scheme, although it inherently leads to some incorrect matches.</Paragraph>
    <Paragraph position="11"> A very systematic variation in SC is ekavian vs. ijekavian dialect; for example: mleko/mlijeko (milk) and primeri/primjeri (examples), but also hteo/htio (wanted). The text is converted via the following regular ex9The details about different encodings such as ipp, vfon, and dual1 are provided in (KeVselj and others, 2004). 10Krascii is a simple transcribing scheme that ignores diacritics.</Paragraph>
    <Paragraph position="12"> POS tags: noun (n), verb (v), adjective (a), adverb (adv), article (art), preposition (prep), conjunction (conj), interjection (interj), pronoun (pron), numeral (num), noun phrase (np), verb phrase (vp), symbol or special character (sym), and idiom (idiom).</Paragraph>
    <Paragraph position="13"> Morpho-syntactic features: diminutive (dim), feminine (fm), imperfective (ipf), intransitive (itv), masculine (m), neuter (nt), past participle (pp), perfective (pf), plural (pl), preterite or past tense (pret), singular (sl), and transitive (tv).</Paragraph>
    <Paragraph position="14"> Dialect tags: American (am), Bosnian (bos), British (br), Croatian (hr), Serbian (sr), and Old Slavic (ssl). Domain tags: agriculture (agr), archaeological (archl), architecture (archt), biology (bio), botany (bot), computer (c), diplomacy (dipl), electrical (elect), chemistry (chem), culinary (cul), law (law), linguistic (ling), mathematics (mat), medicine (med), military (mil), mythology (myt), music (mus), religion (rel), sports (sp), and zoology (zoo).</Paragraph>
    <Paragraph position="15"> Computer science subareas, cob tag (e.g., cob pl): internet (int), programing languages (pl), computational linguistics (cl), graph theory (gt), cryptography (crypt), data structures (ds), formal languages (fl), computer networks (cn), information retrieval (ir), and object oriented programming (oop).</Paragraph>
    <Paragraph position="16"> Misc.: abbreviation (abb), abbreviation expansion (abbE), colloquial (coll), description (desc), example  pression substitutions for ekavian and ijekavian: s/\{(([^\|\}]*)\|)?([^\}]*)\}/$2/g and s/\{(([^\|\}]*)\|)?([^\}]*)\}/$3/g.</Paragraph>
    <Paragraph position="17"> The list of tags used in the dictionary is given in Fig. 2.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Dictionary and Usage Statistics
</SectionTitle>
    <Paragraph position="0"> The dictionary has been on-line for five years (since 22-Jul-99). As of 28-Apr-2004, it has 60,338lexemes, organizedin20,911entries. The average system response time is 0.4 sec. Some site statistics are given in Fig. 3. The interface is supposed to be used only for short-word queries, but long queries are also submitted in hope that the system would do machine translation. As can be seen from the figure, the longest submitted query had the length of 4958 bytes.</Paragraph>
    <Paragraph position="1"> Still, the majority of the queries are below 100 bytes: in 1999 there were 0.03% queries sub- null mited longer than 100 bytes, 0.05% in 2000 and 2001, 0.14% in 2002, 0.27% in 2003, and 0.12% in 2004. The distribution of query lengths less than 30 bytes is given in Fig. 4. The most commonly asked queries are given in Fig. 5.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML