File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/relat/06/w06-2206_relat.xml

Size: 3,302 bytes

Last Modified: 2025-10-06 14:15:57

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-2206">
  <Title>Spotting the 'Odd-one-out': Data-Driven Error Detection and Correction in Textual Databases</Title>
  <Section position="3" start_page="40" end_page="41" type="relat">
    <SectionTitle>
2 Related Work
</SectionTitle>
    <Paragraph position="0"> There is a considerable body of previous work on the generic issue of data cleaning. Much of the research directed specifically at databases focuses on identifying identical records when two databases are merged (Hern'andez and Stolfo, 1998; Galhardas et al., 1999). This is a non-trivial problem as records of the same objects coming from different sources typically differ in their primary keys. There may also be subtle differences in other database fields. For example, names may be entered in different formats (e.g., John Smith vs. Smith, J.) or there may be typos which make it difficult to match fields (e.g., John Smith vs. Jon Smith).4 In a wider context, a lot of research has been dedicated to the identification of outliers in datasets. Various strategies have been proposed.</Paragraph>
    <Paragraph position="1"> The earliest work uses probability distributions to model the data; all instances which deviate too much from the distributions are flagged as outliers (Hawkins, 1980). This approach is called distribution-based. In clustering-based methods, a clustering algorithm is applied to the data and  instanceswhichcannotbegroupedunderanycluster, or clusters which only contain very few instances are assumed to be outliers (e.g., Jiang et al. (2001)). Depth-based methods (e.g., Ruts and Rousseeuw (1996)) use some definition of depth to organise instances in layers in the data space; outliers are assumed to occupy shallow layers.</Paragraph>
    <Paragraph position="2"> Distance-based methods (Knorr and Ng, 1998) utilise a k-nearest neighbour approach where outliers are defined, for example, as those instances whose distance to their nearest neighbour exceeds a certain threshold. Finally, Marcus and Maletic (2000) propose a method which learns association rules for the data; records that do not conform to any rules are then assumed to be potential outliers.</Paragraph>
    <Paragraph position="3"> Inprinciple, techniquesdeveloped todetectoutliers can be applied to databases as well, for instancetoidentifycellvaluesthatareexceptionalin null the context of other values in a given column, or to identify database entries that seem unlikely compared to other entries. However, most methods are not particularly suited for textual databases.</Paragraph>
    <Paragraph position="4"> Some approaches only work with numeric data (e.g., distribution-based methods), others can deal with categorical data (e.g., distance-based methods) but treat all database fields as atoms. For databases with free text fields it can be fruitful to look at individual tokens within a text string. For instance, units of measurement (m, ft, etc.) may be very common in one column (such as ALTITUDE) but may indicate an error when they occur in another column (such as COLLECTOR).</Paragraph>
    <Paragraph position="5"> 4The problem of whether two proper noun phrases refer to the same entity has also received attention outside the database community (Bagga, 1998).</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML