File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/06/n06-1009_abstr.xml
Size: 1,678 bytes
Last Modified: 2025-10-06 13:44:48
<?xml version="1.0" standalone="yes"?> <Paper uid="N06-1009"> <Title>Role of Local Context in Automatic Deidentification of Ungrammatical, Fragmented Text</Title> <Section position="1" start_page="0" end_page="0" type="abstr"> <SectionTitle> Abstract </SectionTitle> <Paragraph position="0"> Deidentification of clinical records is a crucial step before these records can be distributed to non-hospital researchers.</Paragraph> <Paragraph position="1"> Most approaches to deidentification rely heavily on dictionaries and heuristic rules; these approaches fail to remove most personal health information (PHI) that cannot be found in dictionaries. They also can fail to remove PHI that is ambiguous between PHI and non-PHI.</Paragraph> <Paragraph position="2"> Named entity recognition (NER) technologies can be used for deidentification. Some of these technologies exploit both local and global context of a word to identify its entity type. When documents are grammatically written, global context can improve NER.</Paragraph> <Paragraph position="3"> In this paper, we show that we can deidentify medical discharge summaries using support vector machines that rely on a statistical representation of local context. We compare our approach with three different systems. Comparison with a rule-based approach shows that a statistical representation of local context contributes more to deidentification than dictionaries and hand-tailored heuristics. Comparison with two well-known systems, SNoW and IdentiFinder, shows that when the language of documents is fragmented, local context contributes more to deidentification than global context.</Paragraph> </Section> class="xml-element"></Paper>