File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-0914_metho.xml
Size: 14,449 bytes
Last Modified: 2025-10-06 14:09:11
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-0914"> <Title>Semantic Forensics: An Application of Ontological Semantics to Information Assurance</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Applications of NLP to Information </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Assurance and Security </SectionTitle> <Paragraph position="0"> In the last 5 years, a CERIAS-based team led by a computer scientist and an NLP expert has steadily expanded its groundbreaking effort in improving, focusing, and strengthening information assurance and security by applying the NLP resources to them. The result has been a growing number of applications, some of them NL counterparts of pre-existing applications, others NL extensions and developments of known applications, and still others unique to NL IAS. In the most implemented one, NL watermarking (see Atallah et al. 2002), a sophisticated mathematical procedure, based on a secret large prime number, selects certain sentences in a text for watermark bearing and transforms their TMRs into bitstrings that contribute up to 4 bits per sentence to the watermark. The goal of the software is, of course, to embed a robust watermark in the hidden semantic meaning of NL text, represented as its TMR in tree structure. The NLP role is to &quot;torture&quot; the TMR tree of the sentence, whose contributing bits do not fit the watermark, so that they do. The tool for that is a number of minuscule TMR tree transformations, resulting in such surface changes as The coalition forces bombed Kabul The coalition forces bombed the capital of Afghanistan. The applications are summarized in table 1.</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Human Deception Detection and Its NLP Modeling </SectionTitle> <Paragraph position="0"> Like all NLP systems, a Semantic Forensic (SF) NLP system models a human faculty. In this case, it is the human ability to detect deception (DD), i.e., to know when they are being lied to and to attempt to reconstruct the truth. The former ability is a highly desirable but, interestingly, not necessary precondition for DD (see an explanation below, in the Feasibility section). The latter functionality is the ultimate goal of SF NLP but, like all full automation in NLP, it may not be easily attainable.</Paragraph> <Paragraph position="1"> Humans detect lying by analyzing meaning of what they hear or read and compare that meaning to other parts of the same discourse, to their previously set expectations, and to their knowledge of the world. Perhaps the easiest lie to detect is a direct contradiction: If one hears first that John is in Barcelona today and then that he is not, one should suspect that one of the two statements is incorrect and to investigate--if one is interested, a crucial point. The harder type of deception to perceive is by omission: The first author was pushed into SF after having read a detailed profile of Howard Dean, then a leading contender for the Democratic nomination in the US 2004 presidential election, and noticed that the occupation of every single adult mentioned in the article was indicated with the exception of the candidate's father, who had been a stockbroker.</Paragraph> <Paragraph position="2"> Glossing over, such as saying that one has not had much opportunity to talk to John lately, which may be technically true, while covering up a major fallout with John, is yet more complicated.</Paragraph> <Paragraph position="3"> And perhaps topping the hierarchy is lying by telling the truth: when a loyal secretary tells the boss' jealous wife that her husband is not in the office because he is running errands downtown, she may well be telling the truth (though not the whole truth--but, realistically, can one tell the whole truth ever?--is it even a useful notion, especially given the fact that language underdetermines reality (cf. Barwise and Perry 1983)); but what she wants to accomplish is for the wife to infer, incorrectly, that this is all the boss is doing downtown. It is the latter, linguistically interesting type that was the focus of A new TMR contradicting a previously processed one should lead to a fact repository flag, and this is where we are moving next.</Paragraph> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 Using the Fact Repository for </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Deception Detection </SectionTitle> <Paragraph position="0"> The fact repository (FR--see Nirenburg and Raskin 2004: 350-1), so far the least developed static resource in ONSE, records the remembered event instances. In principle, it should record all of them. Realistically, it records them selectively to suit the needs of an implementation. Thus, in CRESP, a small QA system for queries about the 2000 Olympics in Sydney, the FR remembered all the nations, all the participants, all the competitive events, and all the results. A SF NLP system may start at the same level of prominence (and detect one's lie about having participated in the Games and/or achieved a better result), but like almost all NLP systems with reasoning abilities, it will be only as powerful as its FR allows.</Paragraph> <Paragraph position="1"> A contradiction will be flagged when two TMR (fragments) are discovered: For example, one having been just processed for John is/was/will be in Barcelona at noon on the 25 th (of July 2004) and the other in the FR for John is/was/will be in Valencia at noon on the 25 In the case of Papa Dean's occupation, apparently too shameful for the reporter to mention even after he had divulged the Park Avenue childhood, hereditary Republicanism, and discriminatory country club and even though there are still a few stockbrokers on this side of the bars, the FR will easily detect it by presenting this information, very simplistically, as in figure 6. To detect a gloss-over, it is not quite enough to receive a new TMR which contains an event involving a different interaction between these two individuals at the same time. The co-reference microtheory (see Nirenburg and Raskin 2004: 301-5) will have to be able to determine or at least to suspect that these events are indeed one and the same event rather than two consecutive or even parallel events. Even the time parameters are not a trivial task to equate, as in the case of I have not much opportunity to talk to John lately and John insulted me last May. It would be trivial, of course, if the temporal adverbials were since that night at Maude's and that night at Maude's, respectively, but a human sleuth does not get such incredibly easy clues most of the time and has to operate on crude proximity and hypothesizing.</Paragraph> <Paragraph position="2"> Also helping him or her is a powerful inferencing engine, obviously a must for an NLP system of any reasonable complexity, reinforced by a microtheory of euphemisms, which must contain representative sets of event types that people lie about and of fossilized, cliche-like ways of lying about them, as in How is this paper?--Well... it's The reason we think that the loyal secretary's type of lying is harder to detect is not because it may involve more inferencing of a more complex kind--this is not necessarily so. It has to do with the notion of the whole truth: It is not realistic to expect a human, let alone an SF NLP to suspect any information to be incomplete and subject every single TMR to the 'and what else did he do downtown' type of query. But, in many cases, this is necessary to do, which brings up the useful distinction between general and targeted SF.</Paragraph> </Section> </Section> <Section position="7" start_page="0" end_page="0" type="metho"> <SectionTitle> 6 Feasibility of Semantic Forensic </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Systems </SectionTitle> <Paragraph position="0"> A general SF (GSF) task is, basically, a fishing expedition. An SF NLP system may indeed expose obvious contradictions and many omissions. It is a long and expensive process, however, definitely overloading the system's FR. Inferring from every established contradiction or omission, while possibly valuable forensically, is an unaffordable luxury in this kind of task. It may, however, be a necessary evil: for instance, if an SF NLP system is to address a source that is known to be tainted or if it to be used to classify texts by the degree of their trustworthiness--quite a possible assignment.</Paragraph> <Paragraph position="1"> Humans do a degree of general SF under similar circumstances. But even in an exchange without a prior agenda, such as a conversation with a stranger under neutral, casual, indifferent circumstances, the SF/DD module may not be activated unless flagged by, again, a contradiction, an omission, etc. And such a flag will transform general SF into targeted SF (TSF). Now, TSF is what professional forensics does for a living, and there is no reason why the entry-level SF NLP systems should not be all TSF.</Paragraph> <Paragraph position="2"> Even in the case of the Dean text, a TSF system (&quot;look for anything compromising in the candidate's background&quot;) will be able to detect the occupation omission much faster. A TSF is simpler and cheaper, and the FR use is much more reasonable and manageable: it can store only very selective, limited material. The flip side of a TSF system is the easy ability to overlook highly related information an inference away, so we have reasons to suspect that a quality TSF NLP system is not that much simpler than, say, a limited domain GSF system.</Paragraph> <Paragraph position="3"> What is important to realize is that some NLP systems with SF capabilities are within reach in ONSE, using the already available resources, possibly with some modifications, primarily if not entirely on the static side, and that is not much different than changing domains for a &quot;regular&quot; NLP system (see Raskin et al. 2002b).</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 7 Using Scripts of Complex Events for Deception Detection </SectionTitle> <Paragraph position="0"> A main tool for DD, in particular TSF, is the expansion of the ontology by acquiring scripts of complex events, already found necessary for other higher-end NLP tasks (see Raskin et al. 2003).</Paragraph> <Paragraph position="1"> There are strong connections among elements of many texts. These have to do with the understanding that individual propositions may hold well-defined places in &quot;routine,&quot; &quot;typical&quot; sequences of events (often called complex events, scripts or scenarios) that happen in the world, with a well-specified set of object-like entities that appear in different roles throughout that sequence. A script captures the entities of such an event and their temporal and causal sequences, as shown for the complex event BANKRUPTCY in figure 7.</Paragraph> <Paragraph position="2"> As a general tool in ONSE, the scripts that get instantiated from the text input provide expectations for processing further sentences in a text. Indeed, if a sentence in a text can be seen as instantiating a script in the nascent TMR, the analysis and disambiguation of subsequent sentences can be aided by the expectation that propositions contained in them are instantiations of event types that are listed as components of the activated script.</Paragraph> </Section> </Section> <Section position="8" start_page="0" end_page="0" type="metho"> <SectionTitle> BANKRUPTCY Domain </SectionTitle> <Paragraph position="0"> In addition, the expectations that scripts provide play a crucial role for DD, namely in the detection of omission, in two complementary ways.</Paragraph> <Paragraph position="1"> The more obvious one is the need for an expectation of what information is to be found in a text in order to be able to infer gaps. A common attempt at deception in bankruptcy cases, for example, is concealment of pre-bankruptcy conversions of property from creditors, which is a major factor considered by the courts in determining whether there was an intent to hinder, delay or defraud in a bankruptcy. Thus, if a sale of assets by a company prior to its filing bankruptcy is found in a text and there is no mention of how closely to the filing this conversion took place, this needs to raise a flag that possibly concealment took place. This can be established since CONCEALMENT is defined as part of the script BANKRUPTCY, which is instantiated for the TMR of the text. If it can be established, from the text itself or the FR, that the sale of the assets took place while the company was approaching the state of bankruptcy, the omission of the specific time of sale in the report constitutes deception.</Paragraph> <Paragraph position="2"> Here, the script facilitates the targeting of SF (see previous section) by mapping where omissions in the text point to the omission of crucial information.</Paragraph> <Paragraph position="3"> The second mechanism by which scripts facilitate DD is when an event that occurs commonly or exclusively as a subevent of a script, which is otherwise not mentioned, is found in a text. Here, the inference should be that the larger context of this subevent, captured by the script, is to be concealed. If, for example, a company issues a report that mentions the layoff of some of its employees, this should lead to the inference that it approaches the state of bankruptcy, for which layoffs are a possible subevent.</Paragraph> <Paragraph position="4"> Simplified to a few subevents, these two DD mechanisms on the basis of scripts can be summarized as follows (cf. figure 8): 1. If a necessary element of a script is missing it is likely to be intentionally omitted. 2. If an element that commonly occurs as part of a script is found in a text, but no other element of it, that is, the script is underinstantiated, the script is likely to be intentionally omitted.</Paragraph> </Section> class="xml-element"></Paper>