File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/89/h89-2024_abstr.xml
Size: 25,376 bytes
Last Modified: 2025-10-06 13:46:45
<?xml version="1.0" standalone="yes"?> <Paper uid="H89-2024"> <Title>Text on Tap: the ACL/DCI</Title> <Section position="1" start_page="0" end_page="183" type="abstr"> <SectionTitle> AT&T Bell Laboratories Introduction </SectionTitle> <Paragraph position="0"> There has been a recent upsurge of interest in computational studies of large bodies of text. The aim of such studies varies widely, from lexicography and studies of language change to automatic indexing methods and statistical models for improving the performance of speech recognition systems and optical character readers. In general, corpus-based studies are critical for the development of adequate models of linguistic structure and for insights into the nature of language use. However, research workers have been severely hampered by the lack of appropriate materials, and specifically by the lack of a large enough body of text on which published results can be replicated or extended by others.</Paragraph> <Paragraph position="1"> Recognizing this problem, the Association for Computational Linguistics has established the ACL Data Collection Initiative (ACL/DCI). It provides the aegis of a not-for-profit scientific society to oversee the acquisition and preparation of a large text corpus to be made available for scientific research at cost and without royalties. All materials submitted for inclusion in the collection will remain the exclusive property of the copyright holders (if any) for all other purposes. Each applicant for data from the ACL/DCI will be required to sign an agreement not to redistribute the data or make any direct commercial use; however, commercial application of &quot;analytical materials&quot; denved from the text, such as statistical tables or grammar rules, is explicitly permitted. There may be special restrictions on some materials, but only if the restrictions do not compromise the central objective of providing general long-term access for research.</Paragraph> <Paragraph position="2"> The material in the ACL/DCI text corpus will be coded in a standard form based on SGML, the Standard Generalized Markup Language. Over time, we hope to be able to incorporate annotations reflecting consensuaUy approved linguistic features like part of speech and various aspects of syntactic and perhaps semantic structure. Both the coding and the annotations will be coordinated with the work of the Text Encoding Initiative (TEl), a project to develop standards for coding and tagging a broad range of different classes of texts to facilitate data interchange and further both research and the language industries. The TEl is jointly sponsored by the ACL, the Association for Computers and the Humanities, and the Association for Literary and Linguistic Computing.</Paragraph> <Paragraph position="3"> Although our initial efforts will concentrate on the collection of American English, we are interacting with groups in other countries with respect to British English and other European languages, and we hope to extend the effort to other language families as well.</Paragraph> <Paragraph position="4"> History and Current Membership The ACL/DCI Committee was established in February of 1989. Its current members are Robert Amsler (Bellcore), Bran Boguraev (IBM T.J. Watson Research Center and Cambridge University), Ken Church (AT&T Bell Laboratories), Ed Fox (Virginia Polytechnic Institute & State University), Jim Gallagher (U.S. Department of Justice), Carole Hafner (Northeastern University), Judy Klavans (IBM T.J.</Paragraph> <Paragraph position="5"> Watson Research Center), Mark Liberman (AT&T Bell Laboratones), Mitch Marcus (University of Pennsylvania), Paul Martin (SRI International & MCC), Bob Mercer (IBM T.J. Watson Research Center), Jan Pedersen (Xerox PARC), Paul Roossin (IBM T.J. Watson Research Center), Don Walker (Bellcore), Susan Warwick (ISSCO), and Antonio Zampolli (University of Pisa). Liberman is chairing the committee.</Paragraph> <Paragraph position="6"> So far, no funding has been obtained or applied for, other than what is implicit in the pro bono efforts of the committee members. Most business has been transacted by email or telephone, thus avoiding travel expenses, and various small out-of-pocket expenses (such as the cost of tapes) have been donated by individual members. However, the problems of acquisition, maintenance and distribution of such a large body of material are beginning to outgrow the bounds of an all-volunteer effort.</Paragraph> <Paragraph position="7"> Current Status of Collection Efforts During the eight months since the committee was formed, we have obtained several hundred million words of diverse text. Our current holdings are listed in Appendix A. Perhaps a quarter of this material has been (at least roughly) translated into the SGML-derived format in which it will be distributed. null We have been given the right to distribute (the 1979 edition of) the Collins English Dictionary in electronic form.</Paragraph> <Paragraph position="8"> Now that we have a fairly large quantity of text, we are beginning to turn our attention to issues of balance in style, genre and topic: what Don Walker calls the &quot;ecology of language.&quot; Although what we have is quite diverse, there is obviously a lot left out -- we have no business letters, for instance; no repair manuals; no tabloid newspapers; no movie scripts; no poetry.</Paragraph> <Paragraph position="9"> Two areas where we plan to concentrate some effort next year are spoken materials, and non-English or multi-lingual text.</Paragraph> <Paragraph position="10"> We plan to release a &quot;sampler&quot; tape of about 30 million words quite soon. This is as much as will fit on one 12-inch reel of 9-track tape, using easily available (Lempel-Ziv) compression techniques. Of course, we will include the source code to the compression/uncompression programs.</Paragraph> <Paragraph position="11"> CD-ROM is probably the most appropriate format for distribution of the full database, although 8-ram digital tape also has some fans. We also would like to establish a clearing house for distributing appropriate results of research based on the collection; one promising example is the prospective &quot;tree bank&quot; project at the University of Pennsylvania, which Mitch Marcus is describing at this workshop. Finally, we hope to distribute some simple programs for indexing, word concordancing, manipulating the SGML format, and so forth.</Paragraph> <Paragraph position="12"> Text Acquisition and Clean-up: Motivations for A Common Effort In addition to the positive value of having a standard, generally available collection of text (and eventually speech), it's nice to avoid duplicating the often-painful process of obtaining and cleaning text materials. Obtaining the text in the first place may require quite bit of negotiating, once the fight person to negotiate with is found. Even if permission is easy to get, actually getting the tapes made and sent may require some pestering and pleading. Once the material arrives, it can sometimes be quite a bit of work to make it usable: the tapes are typically in an undocumented and somewhat complex format, and the text itself may be encoded in an undocumented (and usually proprietary) typesetting language. Also, the correspondance between the logical structure of the text and its typographical structure is often approximate and errorful, so that some intelligence is required if the logical structure is to be recovered. In Appendix B, I've documented one example where decrypting the donated material was a non-trivial task, namely the Penta dump format of the Library of America volumes. Luckily, most cases are much more straightforward than this; but even in more ordinary examples, the donated text must usually be massaged quite a bit (by programs, of course!).</Paragraph> <Paragraph position="13"> What is SGML? SGML stands for Standard Generalized Markup Language. As in the famous joke about the Holy Roman Empire, one may question whether SGML is truly standard, generalized, etc. It is, at least, designated as an international standard by ISO: specifically, ISO 8879. It's designed to allow structural informarion to be added to a document by embedding user-defined sequences of text characters within the text stream.</Paragraph> <Paragraph position="14"> &quot;Markup&quot; is the term used to describe codes added to eleclronically prepared text to define the structure of the text or the format in which it is to appear. &quot;Generalized markup&quot; rises above the details of font definition, type size, exact page layout, etc., to specify more structural concepts such as &quot;heading,&quot; &quot;footnote,&quot; &quot;emphasis,&quot; etc. SGML also makes it possible to define the characters used in a document. null Overall, SGML provides not so much a system for markup as a framework for defining such systems: it is almost infinitely flexible, with the benefits and drawbacks that this entails. The standard itself is far too complex and abstract to permit a brief description -- even its syntax is almost arbitrarily redefinable. It descends towards mortal ken in the form of a reference concrete syntax defined in ISO 8879, and one version has acquired the beginnings of a semantics though the publication of proposed tag sets and document type declarations (DTD's) by the Association of American Publishers.</Paragraph> <Paragraph position="15"> In the examples appended to this paper, the relevent aspects of SGML are mainly its default method for encoding labeled brackets, and its default set of representations for characters outside the ascii (actually ISO 646) set. The start of unit of type foo is coded as <foo>, and its end (in full form -we will avoid the thorny issue of shortrefs) is </foo>. This obviously makes it easy to find foos in a text, at least for a computer, and a simple computer program can transform such a representation into other formats that are easier for humans to read, if desired, as exemplified in samples 3 and 4. Character sequences of the form &quot;&X;&quot; (where X is some ascii string not containing ';') are used to encode non-ascii characters. Thus 'e' with an acute accent is &quot;&eacute;&quot;: there are plenty of such examples in the French part of sample 8.</Paragraph> <Paragraph position="16"> At a minimum, SGML provides a useful interchange format; each researcher can easily write programs to transform such materials into his or her preferred local form. In association with the TEI, we hope to provide a consistent, complete, and well documented tag set, extending the AAP set; this has certainly not yet been done, however.</Paragraph> <Paragraph position="17"> Appendix B: Empirical structure of Penta tape format First level of record structure: The last 4 of every 514 bytes are two copies of the number of the tape file, in binary: 0 0 0 0 for the first file, 0 1 0 1 for the second file, etc. They must be removed as they cross-cut everything that follows (text, headers, padding, whatever...) Second level of record structure: The remaining byte sequence is analyzed as a sequence of chunks like this: FF 0 4 : this begins a null-terminated label or filename; it is usually 14 characters long, but not always.</Paragraph> <Paragraph position="18"> FB : this introduces 7 bytes of obscure signification; there is almost always one of these following a null-terminated label.</Paragraph> <Paragraph position="20"> variable and of obscure signification. There are 0 or more of these following each filename/FB sequence.</Paragraph> <Paragraph position="21"> FC : following this there are nulls to EOF.</Paragraph> <Paragraph position="22"> Each labelled chunk is a chapter, a rifle, a table of contents, etc. The chapters are in not in their normal order in the dump tape; in fact different books may well be intermingled promiscuously; so it is necessary to break out each labelled chunk into a separate file.</Paragraph> <Paragraph position="23"> Third level of record structure: Within each labelled chunk (i.e. file), ignore the first 478 bytes. The remainder is divided into records of either 512 or 1024 bytes, depending on whether byte 40 of the record is 0 or 1. In either case, discard the first 46 bytes of the record. If the record is of length 512, byte 41 encodes the useful length of the record as (466 - (239 - X)*2) In other words, for each decrement of byte 41 from 239, ignore two additional bytes at the end of the record. If the record is of length 1024, then there is a first subrecord of 46+500 bytes which is always valid, and then a second subrecord 12+466 bytes. The 12-byte secondary header should be discarded; the amount of valid data in the following 466 bytes is determined by byte 7 in the second subrecord, according to the same formula used above.</Paragraph> <Paragraph position="24"> Note that end of each 1024- or 512-byte unit is padded, not with nulls, but with a repetition of the corresponding portion of the previous full record.</Paragraph> <Paragraph position="25"> Last level of record structure.</Paragraph> <Paragraph position="26"> Within the useful portion of the file, as defined by the above procedure, discard 16 bytes every time FF FC occurs. These little cookies seem to function as counters.</Paragraph> <Paragraph position="27"> The file ends completely when FF FD occurs -- stuff following this seems to be random garbage repeating some earlier material.</Paragraph> <Paragraph position="28"> (Optionally) discard bytes valued 0, 0201, 0202, 0240: they seem to be redundant demarcators of typographical codes.</Paragraph> <Paragraph position="29"> Results: merely an ordinary unpleasant typographer's tape...</Paragraph> <Paragraph position="31"> erary Classics of the United States \[in\[ell2\] \[qc\[ap\[ef\[j800\] \[j200 \]II\[ot0\] \[j18\] \[qc\[jll9\]S\[cm\[cf5\]aturday morning\[cfl\] was come, and all the summer world was bright and fresh, and brimming with lif e. There was a song in every heart; and if the heart was young th e music issued at the lips. There was cheer in every face and a s pring in every step. The locust trees were in bloom and the fra~g rance of the blossoms filled the air. Cardiff Hill, beyond the vi llage and above it, was green with vegetation, and it lay just fa r enough away to seem a Delectable Land, dreamy, repose~ful and i nviting. \[epTom appeared on the sidewalk with a bucket of whitewas h and a long-handled brush. He surveyed the fence, and all gladne ss left him and a deep melancholy settled down upon his spirit. T hirty yards of board fence, nine feet high. Life to him seemed ho llow, and existence but a burden. Sighing, he dipped his brush an d passed it along the topmost plank; re~peated the operation; did it again; compared the insignificant whitewashed streak with the far-reaching continent of un-\[fjwhitewashed fence, and sat down on a tree-box discouraged. Jim \[cj22,23,24\]came skipping out at t he gate with a tin pail, and sing~ing ~'Buffalo Gals.'' Bringing water from the town pump had always been hateful work in Tom's ey es, before, but now it did not strike him so. He remembered that there was company at the pump. White, mulatto and negro boys and girls were always there waiting their turns, resting, trading pla ythings, quarreling, fighting, skylarking. And he remembered that although the pump \[nbwas only a hundred and fifty yards off, Jim never got back with a bucket of water under an hour=+=+=m=+=+and even then somebody generally had to go after him. Tom said: \[ep'' Say, Jim, I'll fetch the water if you'll whitewash some.'' \[ep <s>There was a song in every heart; and if the heart was young the music issued at the lips.</s> <s>There was cheer in every face and a spring in every step.</s> <s>The locust trees were in bloom and the fragrance of the blossoms filled the air.</s> <s>Cardiff Hill, beyond the village and above it, was green with vegetation, and it lay just far enough away to seem a Delectable Land, dreamy, reposeful and inviting.</s></p> <p><s>Tom appeared on the sidewalk with a bucket of whitewash and a long-handled brush.</s> <s>He surveyed the fence, and all gladness left him and a deep melancholy settled down upon his spiliL</s> <s>Thirty yards of board fence, nine feet high.</s> <s>Life to him seemed hollow, and existence but a burden.</s> <s>Sighing, he dipped his brush and passed it along the topmost plank; repeated the operation; did it again; compared the insignificant whitewashed streak with the far-reaching continent of unwhitewashed fence, and sat clown on a tree-box discouraged.</s> <s>Jim came skipping out at the gate with a tin pail, and singing <q>Buffalo Gals.</q></s> <s>Bringing water from the town pump had always been hateful work in Tom's eyes, before, but now it did not strike him so.</s> <s>He remembered that there was company at the pump.</s> <s>White, mulatto and negro boys and girls were always there waiting their turns, resting, trading playthings, quarreling, fighting, skylarking.</s> <s>And he remembered that although the pump was only a hundred and fifty yards off, Jim never got back with a bucket of water under an hour--and even then somebody generally had to go after him.</s> <s>Tom said:</s></p> <p><s><q>Say, Jim, I'll fetch the water if you'll whitewash some.</q></s></p> Sample 3: USDA Fact Sheets, SGML Version <doc> <h>Silverfish and Earwigs</h> <p><s>Silverfish and earwigs are completely unrelated insects; however, they are two pests which are frequently found in houses.</s></p> <p><s>The silverfish is an insect which grows continually throughout its life and feeds primarily on the glue used in book bindings, cardboard boxes and the like.</s> <s>So, the places you're likely to have problems with silverfish are around books and bookshelves.</s> <s>Silverfish are capable of destroying books and other valuable papers so if you fmd them in your house, you'll need to control them.</s> <s>Any of the so-called household insecticides are capable of killing silverfish.</s></p> <p><s>The earwig is a different problem.</s> <s>It's usually a pest in newer subdivisions where houses have been built in what was a wooded area.</s> <s>Actually, it's not a case of the earwigs moving in on people, but people moving in on the earwigs.</s></p> <p><s>Earwigs don't harm people.</s> <s>They don't bite, sting, or pinch.</s> <s>But, they do look bad and they have an unpleasant odor when you step on them.</s> <s>Of course, they can be a nuisance when they get inside the house.</s> <s>They feed exclusively on insects and related animals.</s></p> <p><s>Your first line of defense against earwigs should be physically keeping the insects out of your house by blocking areas around doors, windows and other places where they might enter the house.</s></p> <p><s>If you see that you need to resort to chemicals, spray in and around doorways, around patio areas, and indoors along baseboards.</s></p> advantages breast-feeding offers both mother and baby.</s> <s>Because the nutrients in breast milk are easily digested and ideally suited to a baby's needs, breast milk alone can provide every nutrient a baby needs for the first six months of life.</s> <s>Usually, no vitamin or mineral supplements are needed, but sometimes doctors may prescribe vitamin D and iron for some babies, and they may urge mothers to give their breast-fed babies fluoride supplements.</s></p> <p><s>Besides getting the nutrients they need, nursing babies also get other health benefits.</s> <s>Nursing infants have fewer gastronomical illnesses, such as vomiting and diarrhea, and fewer respiratory illnesses, such as colds and infections.</s> <s>Breast-fed infants also have fewer allergic reactions than do babies fed with bottles.</s></p> <p><s>Another advantage of breast-feeding is that nursing mothers don't have to buy, prepare and sterilize bottles and formula.</s> <s>Breast-feeding also helps new mothers return to their pre-pregnancy weights because of the extra calories needed for milk production.</s></p> <p><s>Nursing mothers also report that the time spent breast-feeding their babies is a special time for mother and baby.</s> <s>This closeness is another advantage of breast-feeding.</s></p> <p><s>Breast-feeding gives babies a good start on life.</s> <s>It can be a wonderful experience for mother and baby.</s></p> </doc> Sample 5: USDA Fact Sheets, Formatted Version.</Paragraph> <Paragraph position="32"> (produced by program from SGML version via troff) Silverfish and Earwigs Silverfish and earwigs are completely unrelated insects; however, they are two pests which are frequently found in houses.</Paragraph> <Paragraph position="33"> The silverfish is an insect which grows continually throughout its life and feeds primarily on the glue used in book bindings, cardboard boxes and the like. So, the places you're likely to have problems with silverfish are around books and bookshelves. Silverfish are capable of destroying books and other valuable papers so if you find them in your house, you'll need to control them. Any of the so-called household insecticides are capable of killing silverfish.</Paragraph> <Paragraph position="34"> The earwig is a different problem. It's usually a pest in newer subdivisions where houses have been built in what was a wooded area. Actually, it's not a case of the earwigs moving in on people, but people moving in on the earwigs.</Paragraph> <Paragraph position="35"> Earwigs don't harm people. They don't bite, sting, or pinch. But, they do look bad and they have an unpleasant odor when you step on them. Of course, they can be a nuisance when they get inside the house. They feed exclusively on insects and related animals.</Paragraph> <Paragraph position="36"> Your first line of defense against earwigs should be physically keeping the insects out of your house by blocking areas around doors, windows and other places where they might enter the house. If you see that you need to resort to chemicals, spray in and around doorways, around patio areas, and indoors along baseboards.</Paragraph> <Paragraph position="37"> Advantages of Breast-feeding Your Baby The American Pediatric Society recommends breast-feeding for infants because of the advantages breast-feeding offers both mother and baby. Because the nutrients in breast milk are easily digested and ideally suited to a baby's needs, breast milk alone can provide every nutrient a baby needs for the first six months of life. Usually, no vitamin or mineral supplements are needed, but sometimes doctors may prescribe vitamin D and iron for some babies, and they may urge mothers to give their breast-fed babies fluoride supplements.</Paragraph> <Paragraph position="38"> Besides getting the nutrients they need, nursing babies also get other health benefits. Nursing infants have fewer gastronomical illnesses, such as vomiting and di~hea, and fewer respiratory illnesses, such as colds and infections. Breast-fed infants also have fewer allergic reactions than do babies fed with bottles.</Paragraph> <Paragraph position="39"> Another advantage of breast-feeding is that nursing mothers don't have to buy, prepare and sterilize bottles and formula. Breast-feeding also helps new mothers return to their pre-pregnancy weights because of the extra calories needed for milk production.</Paragraph> <Paragraph position="40"> Nursing mothers also report that the time spent breast-feeding their babies is a special time for mother and baby. This closeness is another advantage of breast-feeding.</Paragraph> <Paragraph position="41"> Breast-feeding gives babies a good start on life. It can be a wonderful experience for mother and baby.</Paragraph> <Paragraph position="42"> today, as Hon. Members are no doubt aware, we are celebrating the anniversary of the proclamation of the Canadian Charter of Rights and Freedoms which took place on Apfil 17, 1982, and also of the coming into effect a year ago of the provisions guaranteeing equality for all members of our society.</s></p> <p><s>It is a day on which Hon. Members will come together to commemorate a commitment to equality, social justice, tolerance and fairness for all Canadians in keeping with basic standards of human fights and fundamental freedoms.</s></p> odeg.</Paragraph> <Paragraph position="43"> </timestamp id=canp~_860417.E> <timestamp id=canpa~860417.F></Paragraph> </Section> class="xml-element"></Paper>