File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/n03-4007_metho.xml

Size: 5,679 bytes

Last Modified: 2025-10-06 14:08:16

<?xml version="1.0" standalone="yes"?>
<Paper uid="N03-4007">
  <Title>Demonstration of the CROSSMARC System</Title>
  <Section position="2" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 System Architecture
</SectionTitle>
    <Paragraph position="0"> The CROSSMARC architecture consists of the following main processing stages: a7 Collection of domain-specific web pages, involving two sub-stages: - domain-specific web crawling (focused crawling) for the identification of web sites that are of relevance to the particular domain (e.g. retailers of electronic products).</Paragraph>
    <Paragraph position="1"> - domain-specific spidering of the retrieved web sites in order to identify web pages of interest (e.g. laptop product descriptions).</Paragraph>
    <Paragraph position="2"> a7 Information extraction from the domain-specific web pages, which involves two main sub-stages: - named entity recognition to identify named entities such as product manufacturer name or company name in descriptions inside the web page written in any of the project's four languages (English, Greek, French, Italian) (Grover et al. 2002). Cross-lingual name matching techniques are also employed in order to link expressions referring to the same named entities across languages.</Paragraph>
    <Paragraph position="3"> - fact extraction to identify those named entities that fill the slots of the template specifying the information to be extracted from each web page. To achieve this the project combines wrapper-induction approaches for fact extraction with language-based information extraction in order to develop site independent wrappers for the domain examined.</Paragraph>
    <Paragraph position="4"> a7 Data Storage, to store the extracted information (from the web page descriptions in any of the project's four languages) into a common database.</Paragraph>
    <Paragraph position="5"> a7 Data Presentation, to present the extracted information to the end-user through a multilingual user interface, in accordance with the user's language and preferences.</Paragraph>
    <Paragraph position="6"> As a cross-lingual multi-domain system, the goal of CROSSMARC is to cover a wide area of possible knowledge domains and a wide range of conceivable facts in each domain. To achieve this we construct an ontology of each domain which reflects a certain degree of domain expert knowledge (Pazienza et al. 2003). Cross-linguality is achieved with the lexica, which provide language specific synonyms for all the ontology entries. During information extraction, web pages are matched against the domain ontology and an abstract representation of this real world information (facts) is generated.</Paragraph>
    <Paragraph position="7"> As shown in Figure 1, the CROSSMARC multi-agent architecture includes agents for web page collection (crawling agent, spidering agent), information extraction, data storage and data presentation. These agents communicate through the blackboard. The Crawling Agent defines a schedule for invoking the focused crawler which is  software component, which retrieves sites to spider from the blackboard and locates interesting web pages within them by traversing their links. Again, status information is written to the blackboard.</Paragraph>
    <Paragraph position="8"> The multi-lingual IE system is a distributed one where the individual monolingual components are autonomous processors, which need not all be installed on the same machine. (These components have been developed using a wide range of base technologies: see, for example, Petasis et al. (2002), Mikheev et al. (1998), Pazienza and Vindigni (2000)). The IE systems are not offered as web services, therefore a proxy mechanism is required, utilising established remote access mechanisms (e.g. HTTP) to act as a front-end for every IE system in the project. In effect, this proxy mechanism turns every IE system into a web service. For this purpose, we have developed an Information Extraction Remote Invocation module (IERI) which takes XHTML pages as input and routes them to the corresponding monolingual IE system according to the language they are written in. The Information Extraction Agent retrieves pages stored on the blackboard by the Spidering Agent, invokes the Information Extraction system (through IERI) for each language and writes the extracted facts (or error messages) on the blackboard. This information will then be used by the Data Storage Agent in order to read the extracted facts and to store them in the product database.</Paragraph>
  </Section>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 The CROSSMARC Demonstration
</SectionTitle>
    <Paragraph position="0"> The first part of the CROSSMARC demonstration is the user-interface accessed via a web-page. The user is presented with the prototype user-interface which supports menu-driven querying of the product databases for the two domains. The user enters his/her preferences and is presented with information about matching products including links to the pages which contain the offers.</Paragraph>
    <Paragraph position="1"> The main part of the demonstration shows the full information extraction system including web crawling, site spidering and Information Extraction. The demonstration show the results of the individual modules including real-time spidering of web-sites to find pages which contain product offers and real-time information extraction from the pages in the four project languages, English, French, Italian and Greek. Screen shots of various parts of the system are available at http://www.iit.demokritos.gr/ skel/crossmarc/demo-images.htm</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML