File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/n06-1009_metho.xml
Size: 11,683 bytes
Last Modified: 2025-10-06 14:10:07
<?xml version="1.0" standalone="yes"?> <Paper uid="N06-1009"> <Title>Role of Local Context in Automatic Deidentification of Ungrammatical, Fragmented Text</Title> <Section position="4" start_page="66" end_page="66" type="metho"> <SectionTitle> 3 Corpora </SectionTitle> <Paragraph position="0"> Discharge summaries are the reports generated by medical personnel at the end of a patient's hospital stay and contain important information about the patient's health. Linguistic processing of these documents is challenging, mainly because these reports are full of medical jargon, acronyms, shorthand notations, misspellings, ad hoc language, and fragments of sentences. Our goal is to identify the PHI used in discharge summaries even when text is fragmented and ad hoc, even when many words in the summaries are ambiguous between PHI and non-PHI, and even when many PHI contain misspelled or foreign words.</Paragraph> <Paragraph position="1"> In this study, we worked with various corpora consisting of discharge summaries. One of these corpora was obtained already deidentified ; i.e., (many) PHI (and some non-PHI) found in this corpus had been replaced with the generic placeholder [REMOVED]. An excerpt from this corpus is below:</Paragraph> </Section> <Section position="5" start_page="66" end_page="66" type="metho"> <SectionTitle> HISTORY OF PRESENT ILLNESS: The patient </SectionTitle> <Paragraph position="0"> is a 77-year-old-woman with long standing hypertension who presented as a Walk-in to me at the [REMOVED] Health Center on [REMOVED]. Recently had been started q.o.d. on Clonidine since [REMOVED] to taper off of the drug. Was told to start Zestril 20 mg. q.d. again. The patient was sent to the [REMOVED] Unit for direct admission for cardioversion and anticoagulation, with the Cardiologist, Dr. [REMOVED] to follow.</Paragraph> <Paragraph position="1"> SOCIAL HISTORY: Lives alone, has one daughter living in [REMOVED]. Is a non-smoker, and does not drink alcohol.</Paragraph> <Paragraph position="2"> HOSPITAL COURSE AND TREATMENT: During admission, the patient was seen by Cardiology, Dr. [REMOVED], was started on IV Heparin, Sotalol 40 mg PO b.i.d. increased to 80 mg b.i.d., and had an echocardiogram. By [REMOVED] the patient had better rate control and blood pressure control but remained in atrial fibrillation. On [RE-MOVED], the patient was felt to be medically stable. null ...</Paragraph> <Paragraph position="3"> We hand-annotated this corpus and experimented with it in several ways: we used it to generate a corpus of discharge summaries in which the [REMOVED] tokens were replaced with appropriate, fake PHI obtained from dictionaries Authentic clinical data is very difficult to obtain for privacy reasons; therefore, the initial implementation of our system was tested on previously deidentified data that we reidentified. e.g., John Smith initiated radiation therapy ... 2005); we used it to generate a second corpus in which most of the [REMOVED] tokens and some of the remaining text were appropriately replaced with lexical items that were ambiguous between PHI and non-PHI ; we used it to generate another corpus in which all of the [REMOVED] tokens corresponding to names were replaced with appropriately formatted entries that could not be found in dictionaries null . For all of these corpora, we generated realistic substitutes for the [REMOVED] tokens using dictionaries (e.g., a dictionary of names from US Census Bureau) and patterns (e.g., names of people could be of the formats, &quot;Mr. F. Lastname&quot;, &quot;Firstname Lastname&quot;, &quot;Lastname&quot;, &quot;F. M. Lastname&quot;, etc.; dates could appear as &quot;dd/mm/yy&quot;, &quot;dd MonthName, yyyy&quot;, &quot;ddth of MonthName, yyyy&quot;, etc.). In addition to these reidentified corpora (i.e., corpora generated from previously deidentified data), we also experimented with authentic discharge summaries null . The approximate distributions of PHI in the reidentified corpora and in the authentic corpus are shown in Table 1.</Paragraph> <Paragraph position="4"> Class No. in reidentified No. in authentic summaries summaries words) in the corpora.</Paragraph> </Section> <Section position="6" start_page="66" end_page="67" type="metho"> <SectionTitle> 4 Baseline Approaches </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="66" end_page="67" type="sub_section"> <SectionTitle> 4.1 Rule-Based Baseline: Heuristic+Dictionary </SectionTitle> <Paragraph position="0"> Traditional deidentification approaches rely heavily on dictionaries and hand-tailored heuristics.</Paragraph> <Paragraph position="1"> e.g., D. Sessions initiated radiation therapy... e.g., O. Ymfgkstjj initiated radiation therapy ... We obtained authentic discharge summaries with real PHI in the final stages of this project. We obtained one such system (Douglass, 2005) that used three kinds of dictionaries: * PHI lookup tables for female and male first names, last names, last name prefixes, hospital names, locations, and states.</Paragraph> <Paragraph position="2"> * A dictionary of &quot;common words&quot; that should never be classified as PHI.</Paragraph> <Paragraph position="3"> * Lookup tables for context clues such as titles, e.g., Mr.; name indicators, e.g., proxy, daugh null ter; location indicators, e.g., lives in. Given these dictionaries, this system identifies key-words that appear in the PHI lookup tables but do not occur in the common words list, finds approximate matches for possibly misspelled words, and uses patterns and indicators to find PHI.</Paragraph> </Section> <Section position="2" start_page="67" end_page="67" type="sub_section"> <SectionTitle> 4.2 SNoW </SectionTitle> <Paragraph position="0"> SNoW is a statistical classifier that includes a NER component for recognizing entities and their relations. To create a hypothesis about the entity type of a word, SNoW first takes advantage of &quot;words, tags, conjunctions of words and tags, bigram and trigram of words and tags&quot;, number of words in the entity, bigrams of words in the entity, and some attributes such as the prefix and suffix, as well as information about the presence of the word in a dictionary of people, organization, and location names (Roth and Yih, 2002). After this initial step, it uses the possible relations of the entity with other entities in the sentence to strengthen or weaken its hypothesis about the entity's type. The constraints imposed on the entities and their relationships constitute the global context of inference. Intuitively, information about global context and constraints imposed on the relationships of entities should improve recognition of both entities and relations. Roth and Yih (2002) present results that support this hypothesis.</Paragraph> <Paragraph position="1"> SNoW can recognize entities that correspond to people, locations, and organizations. For deidentification purposes, all of these entities correspond to PHI; however, they do not constitute a comprehensive set. We evaluated SNoW only on the PHI it is built to recognize. We trained and tested its NER component using ten-fold cross-validation on each of our corpora.</Paragraph> </Section> <Section position="3" start_page="67" end_page="67" type="sub_section"> <SectionTitle> 4.3 IdentiFinder </SectionTitle> <Paragraph position="0"> IdentiFinder uses Hidden Markov Models to learn the characteristics of names of entities, including people, locations, geographic jurisdictions, organizations, dates, and contact information (Bikel et al., 1999). For each named entity class, this system learns a bigram language model which indicates the likelihood that a sequence of words belongs to that class. This model takes into consideration features of words, such as whether the word is capitalized, all upper case, or all lower case, whether it is the first word of the sentence, or whether it contains digits and punctuation. Thus, it captures the local context of the target word (i.e., the word to be classified; also referred to as TW). To find the names of all entities, the system finds the most likely sequence of entity types in a sentence given a sequence of words; thus, it captures the global context of the entities in a sentence. null We obtained this system pre-trained on a news corpus and applied it to our corpora. We mapped its entity tags to our PHI and non-PHI labels. Admittedly, testing IdentiFinder on the discharge summaries puts this system at a disadvantage compared to the other statistical approaches. However, despite this shortcoming, IdentiFinder helps us evaluate the contribution of global context to deidentification.</Paragraph> </Section> </Section> <Section position="7" start_page="67" end_page="68" type="metho"> <SectionTitle> 5 SVMs with Local Context </SectionTitle> <Paragraph position="0"> We hypothesize that systems that rely on dictionaries and hand-tailored heuristics face a major challenge when particular PHI can be used in many different contexts, when PHI are ambiguous, or when the PHI cannot be found in dictionaries. We further hypothesize that given the ungrammatical and ad hoc nature of our data, despite being very powerful systems, IdentiFinder and SNoW may not provide perfect deidentification. In addition to being very fragmented, discharge summaries do not present information in the form of relations between entities, and many sentences contain only one entity. Therefore, the global context utilized by IdentiFinder and SNoW cannot contribute reliably to deidentification.</Paragraph> <Paragraph position="1"> When run on discharge summaries, the strength of these systems comes from their ability to recognize the structure of the names of different entity types and the local contexts of these entities.</Paragraph> <Paragraph position="2"> Discharge summaries contain patterns that can serve as local context. Therefore, we built an SVM-based system that, given a target word (TW), would accurately predict whether the TW was part of PHI.</Paragraph> <Paragraph position="3"> We used a development corpus to find features that captured as much of the immediate context of the TW as possible, paying particular attention to cues human annotators found useful for deidentification.</Paragraph> <Paragraph position="4"> We added to this some surface characteristics for the TW itself and obtained the following features: the TW itself, the word before, and the word after (all lemmatized); the bigram before and the bigram after TW (lemmatized); the part of speech of TW, of the word before, and of the word after; capitalization of TW; length of TW; MeSH ID of the noun phrase containing TW (MeSH is a dictionary of Medical Subject Headings and is a subset of the Unified Medical Language System (UMLS) of the National Library of Medicine); presence of TW, of the word before, and of the word after TW in the name, location, hospital, and month dictionaries; the heading of the section in which TW appears, e.g., &quot;History of Present Illness&quot;; and, whether TW contains &quot;-&quot; or &quot;/&quot; characters. Note that some of these features, e.g., capitalization and punctuation within TW, were also used in IdentiFinder.</Paragraph> <Paragraph position="5"> We used the SVM implementation provided by LIBSVM (Chang and Lin, 2001) with a linear kernel to classify each word in the summaries as either PHI or non-PHI based on the above-listed features. We evaluated this system using ten-fold crossvalidation. null</Paragraph> </Section> class="xml-element"></Paper>