File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-1004_metho.xml

Size: 20,503 bytes

Last Modified: 2025-10-06 14:10:40

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-1004">
  <Title>The LexALP Information System: Term Bank and Corpus for Mult ilingual Legal Terminology Consolidated</Title>
  <Section position="4" start_page="1" end_page="25" type="metho">
    <SectionTitle>
2 Multilingual legal information system
</SectionTitle>
    <Paragraph position="0"> The information system for the terminology of the Alpine Convention, with a specific focus on spatial planning and sustainable development, wil give the posibility to search for relevant terms and their (harmonised or rejected) translations in all 4 official languages of the Alpine Convention in the first module, the term bank. Next to retrieving synonyms and translation equivalents within each legal system, the user wil be provided with a representative context and a valid definition of the concept under consideration. Source information wil be provided for each text field in the terminological entry.</Paragraph>
    <Paragraph position="1"> Via a link from the terminological data base to the second module, the corpus facility, the information system wil give the posibility to search the corpus for further contexts.</Paragraph>
    <Paragraph position="2"> Finally, both term bank and corpus wil be i nteracting with a third module, the bibliographic database, so as to allow retrieving ful information on text excepts cited in the term bank and to store important meta data on corpus documents.</Paragraph>
  </Section>
  <Section position="5" start_page="25" end_page="26" type="metho">
    <SectionTitle>
3 Terminological data
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="25" end_page="25" type="sub_section">
      <SectionTitle>
3.1 Data categories and motivations
</SectionTitle>
      <Paragraph position="0"> The data categories present in the terminology database allow entering and organising relevant information on the concept under analysis. The term bank interface allows entering of the following terminological data categories: denomination/term, definition, context, note, sources (text fields), grammatical information to the term, harmonisation status, processing status, geographical usage, frequency and domain, according to the appositely elaborated domain classification structure  (pul down menus). Again by means of pul-down menus the terminologist wil be able to signal to the users which terms are already processed (i.e. checked by legal experts), harmonised or rejected and - most important - to which legal system they belong (the menu geographical usage allows to specify this information). Furthermore it is posible to specify synonyms, short forms, abbreviations etc. in the terminological entry and, if necesary, link them to the relative ful information already present in the term bank (however, no direct access to these linked data is posible, this must be done via the search interface). Finally, the terminol ogist is given the posibility of writing general comments to the entry. At the very end of one language entry the terminologist can decide whether to release the data to the public (by clicking on the buton 'finish') or keep it for further fine-tuning (buton 'update').</Paragraph>
      <Paragraph position="1"> Each term is created in its 'language volume' and described by means of all necessary information. As son as one or all equivalents in the other languages are available to, the single entries can be linked to each other with the help of an axie (see detailed description below).</Paragraph>
      <Paragraph position="2"> Searches can be done for all languages or on a user-defined selection of source and target languages. Presently the database allows global searching in all text fields and filtering by source, author, date of creation, as well as by axie name and ids. Results can be displayed in ful form, as a short list of terms only or in XML. Some export/import functions are granted.</Paragraph>
      <Paragraph position="3"> As the term bank serves mainly the scope of diffusing harmonised terminology, the four translation equivalents (validated by a group of experts) are displayed together, whereas rejected synonyms are displayed separately for each search language. In this way the user may well  Se also 4.1 lok for a non validated synonym and find it in the database but be warned as to which is the preferred term and its harmonised equivalents in the other languages. Figure 1 shows such a situation where the French rejected term &amp;quot;transport intra-alpin&amp;quot; is linked to the harmonised term  their relations</Paragraph>
    </Section>
    <Section position="2" start_page="25" end_page="26" type="sub_section">
      <SectionTitle>
3.2 Monolingual data
</SectionTitle>
      <Paragraph position="0"> The LexALP term bank consists in 5 volumes for French, German, Italian, Slovene and English (no data is being entered for this fifth language at the moment), which contain the term descriptions. The set of data categories is represented in an XML structure that folows a common schema.</Paragraph>
      <Paragraph position="1">  Each entry represents a unique term/meaning. Terms with the same denomination, but belong- null ing to different legal systems have, de facto, di fferent meanings. Hence, different entries are created. Terms with different denominations but conveying the same 'meaning' (concept) are also represented using different entries  . In this case, the entries are linked through a synonymy relation. null Figure 2 shows the XML structure of the French term &amp;quot;trafic intra-alpin&amp;quot;, as defined in the Alpine Convention. The term entry is associated to a unique identifier used to establish relations between volume entries.</Paragraph>
      <Paragraph position="2"> The example term belongs to the Alpine Convention legal system  (code AC). The entry also bears the information on its status (harmonised or rejected) and its processing status (to be processed, provisionally processed or finalised). In addition, a definition (along with its source) and a context may be given. The definition and context should be extracted from a legal text , which must be identified in the source field.</Paragraph>
    </Section>
    <Section position="3" start_page="26" end_page="26" type="sub_section">
      <SectionTitle>
3.3 Achieving language/legal system
</SectionTitle>
      <Paragraph position="0"> interoperability As the project deals with several different legal terms, standard techniques used in multilingual terminology management need to be adapted to the peculiarities of the specialised language of the law. Indeed, terms in different languages are (generally) defined according to different legal systems and these legal systems canot be changed. Hence, it is not posible to define a common 'meaning' that could be used as a pivot for language interoperability  . In this respect, legal terminology is closer to general lexicography than to standard terminology.</Paragraph>
      <Paragraph position="1"> In order to achieve language/legal system interoperability we had several options that are used in general lexicography.</Paragraph>
      <Paragraph position="2"> Using a set of bilingual dictionaries is not an option here, as we have to deal with at least 16  Variants, acronyms, etc. are not considered as di fferent denominations.</Paragraph>
      <Paragraph position="3">  Strictly speaking, the Alpine Convention does not constitute a legal system per se.</Paragraph>
      <Paragraph position="4">  Consider for instance the diference betwen the Italian and the Austrian concepts of journalists' pr ofesional confidentiality. Whereas the Redaktionsgeheimnis explicitly underlines that the journalist can refuse to witnes in court in order to kep the professional secret, in Italy the segreto giornalistico must obligatorily be lifted on a judge's request. The two concepts have overlaping meanings in the two states, however, they diverge greatly with respect to the behaviour in court.</Paragraph>
      <Paragraph position="5"> language/legal system couples (with alpine Convention and EU levels, but without taking into account regional levels). Moreover, such a sol ution wil not reflect the multilingual aspect of the Alpine Convention or the Swis legal system.</Paragraph>
      <Paragraph position="6"> Finally, building bilingual volumes between the French and Italian legal systems is far beyond the objectives of the LexALP project.</Paragraph>
      <Paragraph position="7"> Another solution would be to use an &amp;quot;Eurowordnet like&amp;quot; approach (Vosen, 198) where a specific language/legal system is used as a pivot and elements of the other systems are linked by equivalent or near-equivalent links. As such an aproach artificially puts a language in the pivot position, it generally leads to an &amp;quot;ethnocentric&amp;quot; view of the other languages. The advantage being that the architecture uses the bili ngual competence of lexicographers to achieve multilingualism.</Paragraph>
      <Paragraph position="8"> In this project, we chose to use 'interlingual acceptions' (a.k.a. axies) as defined in (Serasset, 194) to represent such complex contrastive phenomena as generally described in general lexicography work. In this aproach, each 'term meaning' is associated to an interlingual acception (or axie). These axies are used to achieve interoperability as a pivot linking terms of different languages bearing the same meaning.</Paragraph>
      <Paragraph position="9"> However, as we are dealing with legal terms (bound to different legal systems), it is generally not posible to find terms in different languages that bear the same meaning. In fact such t erms can only be found in the Alpine Convention (which is considered as a legal system expressed in all the considered languages). Hence, we use these terms to achieve interoperability between languages. In this aspect, we are close to Eurowordnet's approach as we use a specific legal system as a pivot, but in our case the pivot itself is generally a quadrilingual set of entries.</Paragraph>
      <Paragraph position="10"> These harmonised Alpine Convention terms are linked through an interlingual acception. An axie is a place holder for relations. Each interlingual acception may be linked to several term entries in the languages volumes through termref elements and to other interlingual acceptions through axieref elements, as ilustrated in  ilustrated Figure 1 The termref relation establishes a direct translation relation between these harmonised equivalents. Then, national legal terms are indirectly linked to Alpine Convention terms through the axieref relation as ilustrated in Figure 4.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="26" end_page="29" type="metho">
    <SectionTitle>
4 Corpus
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="26" end_page="28" type="sub_section">
      <SectionTitle>
4.1 Corpus content
</SectionTitle>
      <Paragraph position="0"> The corpus comprises around 300 legal documents of eight legal systems (Germany, Italy, France, Switzerland, Austria, Slovenia, European law and international law with the specific framework of the Alpine Convention,)  Documents of the supranational level are provided in up to four languages (subject to availability). National legislation is generally added in the national language (monolingual documents) and in case of Switzerland (multilingual documents) in the three official languages of that nation (French, German and Italian).</Paragraph>
      <Paragraph position="1"> The documents are selected by legal experts of the respective legal systems folowing predefined criteria: * entire documents (no single paragraphs or excerpts etc.); * strong relevance to the subjects 'spatial planning and sustainable development' as described in art. 9 of the relative Alpine  Convention Protocol; * primary sources of the law for every system at national and international/EU level, i.e. normative texts only (laws, codes etc.); * latest amendments and versions of all legislation (at time of colection: June -August 205); * terminological relevance.</Paragraph>
      <Paragraph position="2">  Each document is classified according to the folowing (bibliographical) categories: ful title, short title, abbreviation, legal system, language, legal hierarchy, legal text type, subfield (1, 2 and 3), oficial date, official number, published in official journal (date, number, page), ... The bibliographical information of all documents is stored in a database and can at any time be consulted by the user.</Paragraph>
      <Paragraph position="3"> The subfields have been elaborated and selected by a team of legal experts, taking into account the classification specificities folowed by the Alpine Convention and the need to classify texts from several different legal systems according to one common structure. For this reason, the legal experts have subdivided the fields spati al planning and sustainable development into 5 main areas, in accordance with the Alpine Convention Protocol dealing with these subjects and subsequently adopted an EU-based model for further subdividing the 5 main topics in such a way that all countries involved could classify their selected documents under a maximum of 3 main items, the first of which must be indicated obligatorily. This classification allows an easy selection of all subsets of documents according to subject field.</Paragraph>
    </Section>
    <Section position="2" start_page="28" end_page="28" type="sub_section">
      <SectionTitle>
4.2 Structural organization of corpus data
</SectionTitle>
      <Paragraph position="0"> Colected in raw text format (one file for each legal text) the documents are first transformed into XML-structured files and in a second step inserted into the database.</Paragraph>
      <Paragraph position="1"> The XML-annotation is done in compliance with the Corpus Encoding Standard for XML  serves to add structural information to the documents. Each text is segmented into sub-sections like: preamble, chapter, section, para- null htp:/ww.cs.vasar.edu/XCES/  htp:/ww.cs.vasar.edu/XCES/schema/xcesDoc.xsd graph, title and sentence. Furthermore, a link to the classification data (bibliographic data base) is inserted and, in case of multilingual documents, alignment is done at sentence level.</Paragraph>
      <Paragraph position="2"> The XML-annotated documents hold all the information needed for the insertion into the corpus database, such as structural mark-up and bibliographical information. The ful text documents are transformed into sets of database entries, which can be imported into the database.</Paragraph>
    </Section>
    <Section position="3" start_page="28" end_page="29" type="sub_section">
      <SectionTitle>
4.3 Technical organization of corpus data
</SectionTitle>
      <Paragraph position="0"> Folowing the bistro approach as realized for the Corpus Ladin dl'Eurac (CLE) (Streiter et al.</Paragraph>
      <Paragraph position="1"> 204) the corpus data is stored in a relational database (PostgreSQL). The information present in the XML-annotated documents is distributed among four main tables: document_info, corpus_words, corpus_structure, corpus_alignment.</Paragraph>
      <Paragraph position="2"> The four tables can be described as folows: document_info: This table holds the meta-information about the documents; each category (like ful title, short title, abbreviation, legal system, language, etc.) is represented by a separate column. For each legal document one entry (one row) with unique identification number is added to the table. These identification numbers are cited in the XML-header of the corpus documents. null corpus_words: This table holds the actual text of the colected documents. Instead of storing entire paragraphs as it was done during the creation of CLE, for this corpus a different approach is being tested. Every annotated text is split into an indexed sequence of words, starting with counter one. Once inserted into the database a text is stored as a set of tuples composed of word, position in text and document id (as a reference to the document information).</Paragraph>
      <Paragraph position="3"> corpus_structure: This table holds all information about the internal structure of the documents. Titles, sentences, paragraphs etc. are stored by indicating starting and ending point of the section. For each segment a tuple of segment type, segment id, starting point (indicated by the index of the first word), ending point (indicated by the index of the last word) and document id is added.</Paragraph>
      <Paragraph position="4"> corpus_alignment: This table defines the alignment of multilingual documents. By provi ding one column for each language the texts are aligned via the document ids or via the ids of single segments.</Paragraph>
      <Paragraph position="5">  The tables are interconected by explicitly stated references. That means that the columns of one table refer to the values of a certain column of another table. As shown in figure 8 all tables hold a column document_id that refers to the document id of the table document_info. Furthermore, the table corpus_structure holds references to the column position of the table corpus_words. null</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="29" end_page="29" type="metho">
    <SectionTitle>
5 Searching the corpus
</SectionTitle>
    <Paragraph position="0"> Due to the fine-grained classification (see section 4.1) and the structural mark-up (see section 4.2) of all corpus documents, corpus searches can be restricted in the folowing ways: * by specifying a subset of corpus documents over which the search should be carried out (e.g. all documents of legal system CH with language French); * by chosing the type of unit to be displayed (whole paragraphs &lt;p&gt;, sentences &lt;s&gt;, titles &lt;title&gt;, ...); * by searching for whole words only (exact match) or parts of words (fuzzy match); * by restricting the number of hits to be displayed at a time.</Paragraph>
    <Paragraph position="1"> For searches in multilingual documents it wil be posible to search for aligned segments, specifying search word as well as target translation. For example, the user could search for all alignments of German-Italian sentences that contain the word Umweltschutz translated as tutela ambientale (and not with protezione dell'ambiente). Figure 9 shows a simple interface for searching</Paragraph>
  </Section>
  <Section position="8" start_page="29" end_page="30" type="metho">
    <SectionTitle>
6 Interaction term bank and corpus
</SectionTitle>
    <Paragraph position="0"> Term bank and corpus are independent components which together form the LexALP Information System.</Paragraph>
    <Paragraph position="1"> The interaction between corpus and term bank wil concern in particular 1) corpus segments used as contexts and definitions in the terminological entries, 2) short source references in the term bank (and the associated sets of bibli ographical information) and 3) legal terms.</Paragraph>
    <Paragraph position="2"> 6.1 Entering data into term bank When adding citations to a term bank entry, the relative bibliographic information wil automatically be counterchecked with the contents of the bibliographical database. In case the information about the cited document is already present in the DB, a link to the term bank can be added. Ot herwise the terminologist is asked to provide all information about the new source to the bibliographic database and later create the link.</Paragraph>
    <Paragraph position="3"> Next to static contexts and definitions present for each terminological entry, each entry wil show a buton for the dynamic creation of contexts. Hiting the buton wil start a context search in the corpus and return all sentences containing the term under consideration.</Paragraph>
    <Section position="1" start_page="29" end_page="30" type="sub_section">
      <SectionTitle>
6.2 Searching the corpus
</SectionTitle>
      <Paragraph position="0"> When searching the corpus the user wil have the oportunity to highlight terms present in the term bank. In the same way standardised or rejected terms can be brought out. Via a link it wil then  be posible to directly access the term bank entry for the term found in the corpus.</Paragraph>
      <Paragraph position="1"> In general each corpus segment is linked to the ful set of bibliographic information of the document that the segment is part of. Accessing the source information wil lead the user to a detailed overview as shown in figure 4.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML