XML Viewer - w03-0602

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-0602_metho.xml
Size: 31,648 bytes
Last Modified: 2025-10-06 14:08:24
<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-0602">
  <Title>Words and Pictures in the News</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 The Collection
</SectionTitle>
    <Paragraph position="0"> We have been collecting news photos with associated captions since Feb. 2002. These photographs come from Reuters and the AP, and we collect between 800 and 1500 per day. To date, we have collected more than 410,000 unique images. The vast majority of these photos are of people, and the majority of that subset focus on one or two individuals. Since November, we have also collected more than 75,000 illustrated news articles from the BBC and CNN. Both CNN and the BBC, along with the majority of all news outlets, make heavy use of news photographs from the AP and Reuters. In fact, the field is dominated by just three agencies, Reuters, the Associated Press and AFP.</Paragraph>
    <Paragraph position="1"> Journalistic writing guidelines emphasize clarity. This is certainly necessary in caption writing, where an author has an average of 55 words to convey all salient details about an image. A caption writer's first responsibility is to identify those portrayed. A caption will contain the full names of those represented if they are known. Second, the writer must describe the activity portrayed in the photo. Finally, with whatever space is left, the author may provide some broader context.</Paragraph>
    <Paragraph position="2"> Our collection's vocabulary reflects this ordering of priorities. First, there is a heavy bias towards proper names, and thus the number of unique terms in this set is very large (See figure 1). The vocabulary of other word classes, however, reflect the journalistic emphasis on simplicity. People hug rather than embrace, talk rather than converse.</Paragraph>
    <Paragraph position="3"> Captions are also easy for humans to parse because writers make use of stylized sentence structures. Just  preponderance of proper names creates a very large vocabulary of capitalized words whose frequency distribution has heavy tails. On the other hand, other word classes are somewhat restricted. In a six month period from July to December, 2002, the corpus had 93,457 distinct terms. Of these, 60,182 were capitalized and 32,059 uncapitalized (with 1216 numerical entries). About 1000 terms occur in both forms. Almost a third of all vocabulary terms occur only once. On the left we plot term occurrence ordered by frequency for the 1000 most frequent capitalized (solid) and uncapitalized ( dotted) terms. On the right we analyze the heavy tail, flipping the axes to show the number of words that occur between one and twenty times. Here, capitalized (solid line) words outnumber uncapitalized (dotted line) ones 2-1. We have used capitalization as a proxy for proper names. This obviously misrepresents initial words of sentences, but these only represent on average 2 words out of every 50-60 word caption. The richness of our vocabulary is thus largely due to proper names.</Paragraph>
    <Paragraph position="4"> over 50% of the captions in our collection, for instance, begin with a starting noun phrase that describes the central figure of the photograph followed by a present tense verb that describes the action they are performing (or if in the passive voice, having performed upon them). Textual cues such as &amp;quot;left&amp;quot;,&amp;quot;right&amp;quot; or a jersey number help to clarify any potential correspondence problems between multiple names in the caption and people in the image.</Paragraph>
    <Paragraph position="5"> The news photos themselves, like their captions, are also quite stylized with respect to choice of subject, placement within the image, and photographic techniques. A single individual will generally be centered in the photo. Two people will be equally offset to the left and right of center. A basketball player will most likely have the ball in his hands. Each of these photographs will employ a narrow depth of field to blur the background and emphasize their subjects. Like the caption writer, the photo journalist must convey a great deal of information in a small amount of space. These conventions amount to a shared language between photographer and reader that places this single image in a far richer context. We present example captions and their associated images in figure 2.</Paragraph>
    <Paragraph position="6"> The textual and photographic conventions we have illustrated are simply trends we have noticed in our dataset, ones that many individual captions and images in our collection break. One of the benefits of scale, however, is that we can throw away a great deal of data and still have meaningful sets with which to work. These simpler sets in turn may act as foundations from which to attack the exceptions.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Linking Articles
</SectionTitle>
    <Paragraph position="0"> We have linked and clustered articles in a variety of ways.</Paragraph>
    <Paragraph position="1"> We have looked at how the specialized vocabulary and statistics of our dataset affect the performance of two different textual clustering methods. We have also examined how images might help to link together articles whose relationships purely text based clustering overlooks or undervalues. We have also used the clusterings built from our captions to provide a topic structure for navigating the news in general, and investigated various interfaces to make this navigation more natural.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Iconic Matching
</SectionTitle>
      <Paragraph position="0"> Sometimes the same photograph is used to illustrate different articles. This establishes links between articles that might be difficult to discern solely from the text. To establish these links, we built an iconic matcher to find copies of the same image in a database, even if that image has been moderately cropped or given a border.</Paragraph>
      <Paragraph position="1"> Images can change in many subtle ways between their distribution and their actual publication. They may be cropped, watermarked, and acquire compression artifacts. The news agency may overlay text or add borders.</Paragraph>
      <Paragraph position="2"> An iconic matcher needs to be robust to these changes.</Paragraph>
      <Paragraph position="3"> In addition, given a collection of this size, only minimal image processing is practical.</Paragraph>
      <Paragraph position="4"> The iconic matcher we have designed first resizes images to 128x128 pixels in order to accommodate different aspect ratios. It then applies a Haar wavelet transformation separately to each color channel of the image. (Ja-U.S. President George W. Bush (news - web sites) ponders a question at a news conference following his meeting with Czech President Vaclav Havel at Prague Castle, Wednesday, Nov. 20, 2002. Bush urged the NATO (news - web sites) allies to join a U.S.-led 'coalition of the willing' to ensure Iraq disarms. (AP Photo/CTK/ Michal Dolezal) Nov 20</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5:52 AM
</SectionTitle>
    <Paragraph position="0"> Sacramento Monarchs' Ruthie Bolton, center, grabs a loose ball from Charlotte Sting's Erin Buescher, bottom, as Monarchs' Andrea Nagy moves in during the first half Friday, July 19, 2002, in Charlotte, N.C. (AP Photo/Rick Havner) Jul 19 8:24 PM Figure 2: News captions follow many conventions. Individuals are identified with their full names whenever possible. The first sentence describes the actors, and what they are doing. By looking for proper names immediately followed by present tense verbs, we can reliably find names that reference people actually in the image. This is important, as captions are also full of names (i.e. Vaclav Havel in caption 1) that do not actually appear. Often, the captions also help place the person in the frame (i.e. &amp;quot;center&amp;quot; in the second image).</Paragraph>
    <Paragraph position="1"> cobs et al., 1995) The first coefficient in each channel, then, represents the average image intensity; the second is the deviation from the mean for the right and left side; the third splits the left hand side similarly, and so on. Rotating through the channels, we use the first 64 largest coefficients as a signature for the image. We say a pair of coefficients match if both are within a given threshold of each other. Images match if their first three coefficients (average color) all match, and if the number of additional matching coefficients is above a certain count. Identical images will of course receive a perfect score using this measure.</Paragraph>
    <Paragraph position="2"> To tune the parameters of the matcher, we found sample photos to which CNN or the BBC had added borders and the corresponding borderless originals from AP and Reuters. We then examined how these changes affected the coefficients, and we set the matching thresholds to respond positively even with these changes. This led to fairly broad ranges, SS5 for the average color coefficients and SS3 for the rest (out of 0-255 scale). However, we also insisted that at least 42 coefficients actually match.</Paragraph>
    <Paragraph position="3"> A visual scan of the sets returned by our iconic matcher shows very few false positives. The matcher frequently accommodates borders or croppings that change up to 10% of the pixels in the original image. Its response to these changes, however, is somewhat unpredictable. Two crops of similar appearance can have very different effects upon the Haar parameters if one pushes many features into different quadrants of the image than the other.</Paragraph>
    <Paragraph position="4"> As Table 1 indicates, a significant percentage (10%) of those articles we have collected from the BBC and CNN share an image with another article. This phenomenon seems to stem from three main sources. Across collections, different authors may use the same image to illustrate similar stories. Authors may also use the same im- null A shared picture is a very good indication that two articles are in some way topically related. Our iconic matcher establishes those links. On the left, the figure gives the total number of articles we collected in a three month (Oct-Dec, 2002) period from the BBC and CNN collection, and on the right, how often the same image was used for multiple stories. We have split these iconic matches out into inter-agency and intra-agency totals.</Paragraph>
    <Paragraph position="5"> More than 10 % of the articles in our collection share an image.</Paragraph>
    <Paragraph position="6"> age over time to provide temporal continuity as the underlying story changes. Finally, the same image may be used to indicate a broad theme, while the articles themselves discuss quite different topics. A series of articles on an oil spill of the coast of Spain, for example, moved from simply reporting the incident, to investigating the captain, to discussing the environmental impact, always using an illustration of an oil drenched cormorant.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Text Clustering
</SectionTitle>
      <Paragraph position="0"> Our second tool for grouping articles comes from automated clustering of the AP and Reuters caption texts. We implemented two different clustering algorithms. Each generates a probability distribution over K topics for each document. Documents are then assigned to the maximum likelihood cluster.</Paragraph>
      <Paragraph position="1"> We fit two models to the data. The first was a simple unigram mixture model. This model posits that caption words are generated by sampling from a multinomial distribution over the corpus vocabulary V . Topics are separate distributions over V . For a given number of topics, we have jKjPSjVj parameters, where K is the set of topics. The probability of a caption in this model is</Paragraph>
      <Paragraph position="3"> We can fit this model using EM, with the assignments of captions to topics as our hidden variables. In our implementation, we initialize p(k) and p(wjk) with random distributions.</Paragraph>
      <Paragraph position="4"> Effectively, we treat each caption as an unordered set of keywords. Although this is a simple language model, we expected it to fit the captions fairly well. Given their extremely short length, we believed captions to have a far higher percentage of topically important (and thus discriminative) words than one would find in a more generic article collection. In longer documents researchers have investigated modelling documents as mixtures of topics (Blei et al., 2001), but we believed captions truly were narrowly focused around a single topic.</Paragraph>
      <Paragraph position="5"> Still, our original vocabulary contained over 90,000 terms. To fit this model, we trimmed the tail end of the vocabulary. We applied two heuristics. removing all words that happened less than 200 times. This seems a drastic reduction, but as figure 1 illustrates, the tail of our vocabulary is primarily proper names. Moreover, we collect more than 1000 captions a day. A single word, especially a name, could therefore easily occur 200 times in a single day. Since we are interested in topics of larger temporal extent than a day, this reduction seems at least somewhat justified. We also removed all words of three characters or less. During fitting, we normalized the document word count vectors to a constant length.</Paragraph>
      <Paragraph position="6"> Unfortunately, the model was still overwhelmed by common words, and the maximum likelihood configuration invariably driven to make every topic equally likely and every topic distribution almost exactly the same. We therefore were forced to additionally remove very common words, namely the 2000 most frequent words from the web at large.1 This heuristic almost certainly removes some words strongly associated with specific topics (e.g.</Paragraph>
      <Paragraph position="7"> &amp;quot;ball&amp;quot; with sports). Still, the remaining middle frequency words do a good job of separating captions into qualitatively good topics, Our contention is that this middle frequency is closely aligned with the true statistics of the entire corpus.</Paragraph>
      <Paragraph position="8"> Our second algorithm, which we call a &amp;quot;two-level&amp;quot; mixture model, attempts to deal with very common words  in a more principled way. The top level is a single multinomial distribution shared by all captions. The second level has K topic distributions, equivalent to the simple unigram model. For each caption word, we sample a Bernoulli random variable , to decide whether to draw from the top-level or second-level distribution. This model shunts common shared words into the top-level &amp;quot;junk&amp;quot; distribution, leaving the topic distributions to reflect truly distinctive words. The probability of a document in this model is</Paragraph>
      <Paragraph position="10"> (2) We used EM once again to estimate these parameters. Regardless of starting position or jKj, , consistently converges to just over .5 (.53 with std .004).</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 Quantitative Analysis
</SectionTitle>
      <Paragraph position="0"> We fit each of these models 10 different times each for K = 10;20;:::100 where K is the number of topics. To compare the quality of fit between different values of K, we held out 10% of the captions and compared the negative log likelihood of the held out data for each run. A model which assigns the highest probability to the test set will have the lowest log likelihood. In figure 3, we plot the average negative log likelihood for each of these runs.</Paragraph>
      <Paragraph position="1"> Across all K, EM converged in just a few iterations.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.4 Evaluation and Interfaces
</SectionTitle>
      <Paragraph position="0"> Quantitatively our models appear to fit well, but an analysis of their usefulness for interacting with these collections is more difficult. Our collection has no canonical topical structure against which we can compare our results. However, we have built various tools to aid in the qualitative evaluation of our results.</Paragraph>
      <Paragraph position="1">  The clustering algorithms discussed in the previous section take no account of the time at which an article or caption was published. News topics, however, certainly have temporal structure. Some, like baseball, happen during a specific season. Others, like an election day, are events that are heavily discussed for a brief period of time and then fade. Our first interface illustrates these temporal relationships. We plot each topic over time. Each timestep is approximately a day.2 Each element in the plot of a cluster represents the percentage of captions in that cluster collected during each time period. We have constructed a web interface that lays out all K topics as columns in a matrix, moving through time from top to  &amp;quot;Two Level&amp;quot; model extended the first with a global distribution such that words could choose to come either from the topic distribution associated with that document or from the global one. This figure plots the average and minimum negative log likelihood attained for values of K = 10;20;30;:::;100. The simple unigram model uses a vocabulary filtered of common words, 2000 terms smaller than that used by the &amp;quot;two level&amp;quot; model. This accounts for the higher log likelihood numbers on the right. When fit with the full vocabulary, the simple unigram model invariably creates K identical topics, losing all topic structure to the noise generated from common words. bottom. One may click on any individual element to bring up a list of captions specific to that period and cluster, as well as the word distribution that characterizes this topic.</Paragraph>
      <Paragraph position="2"> (Figure 4) The example in this figure utilizes the simple unigram model. One striking aspect of this view is that topics clearly do appear to have time signatures. Some are periodic, others ramp up, others are extremely peaked.</Paragraph>
      <Paragraph position="3"> We are contemplating methods to integrate these temporal features into our clustering methods.</Paragraph>
      <Paragraph position="4">  Our second interface attempts to lay out topics on the plane in such a way that distances between them are semantically meaningful. First, we use a symmetrized KL divergence to define a distance metric between topic distribution pairs. We define a symmetric KL divergence between two topics distributions ti and tj as</Paragraph>
      <Paragraph position="6"> where V is the corpus vocabulary.</Paragraph>
      <Paragraph position="7"> We then use Principal Coordinate Analysis to project into the plane in a way that in various senses &amp;quot;best&amp;quot; preserves distances. We finally calculate the likelihood for all captions in a given cluster and illustrate the topic with the image associated with the maximum likelihood caption. null In our interface, (Figure 5) one may click on any token to bring up a list of all images and articles associated with this topic. In this example, we have actually used the topic structure generated with captions to organize the BBC and CNN combined dataset. The clusters in this figure were built using the simple unigram clustering model.</Paragraph>
      <Paragraph position="8"> One aspect of Principal Coordinate Analysis that is undesirable in our case is that it tends to emphasize accurately representing large distances at the expense of small ones. In topic space, distances between topics only seem to have semantic meaning up to some threshold. Beyond this threshold they are simply unrelated. We are actively working on implementing a modified version of Principal Coordinate Analysis that will lend more weight to smaller distances for this interface.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 A Celebrity Face Detector
</SectionTitle>
    <Paragraph position="0"> Our second area of investigation with this dataset is automated methods for establishing correspondences between textual words or phrases and image regions. To this end, we are investigating the creation of an automated celebrity classifier from the AP and Reuters photographs.</Paragraph>
    <Paragraph position="1"> Brown et al. (P. Brown and Mercer, 1993) have effectively used co-occurrence statistics to build bilingual translation models. Duygulu, Barnard and Forsyth (Duygulu et al., 2002) have effectively used these machine translation algorithms to establish correspondences between keywords and types of image regions in the Corel image collection.</Paragraph>
    <Paragraph position="2"> Unlike Corel, our photos are primarily of people, and so it seemed natural to focus first on proper names as opposed to more general noun classes. Proper nouns are relatively easy to extract from our captions. Caption writers are quite consistent in how they record individual's names and titles. Simply selecting strings of capitalized words with just a few heuristics to accommodate the beginnings of sentences, middle initials, etc. performs well. Common names like the President's will be written in multiple Figure 4: In this figure we see a clustering of 50 topics laid out temporally. The captions are clustered without respect to time using EM and the simple unigram language model. In this example we have used 50 clusters. Each column represents a single topic, and each row is approximately a one day time slice starting July 19, 2002 at top and ending on Dec 8, 2002 at the bottom. The brightness of the entry reflects the percentage of all captions in this topic that occurred during this time slice. To illustrate, we have labeled certain portions of the figure with the realworld topics or events with which certain topics/time periods seem most associated. Topics appear to have time signatures. Some, like football or championship baseball, are periodic. Others, like election day, slowly build to a peak and then rapidly fade. Other, unexpected events, such as the arrest of the D.C. area snipers and Moscow hostage situation ramp up suddenly and then slowly fade over time. We are investigating adding this temporal information into our clustering models.</Paragraph>
    <Paragraph position="3"> ways, but even in these cases a single form is overwhelmingly predominant. As for the images, face detectors are a relatively mature piece of vision technology and we can reliably extract a large set of face cutouts.</Paragraph>
    <Paragraph position="4"> We could not directly apply the co-occurrence methods used in previous work. First, our captions are full of proper nouns (institutions, locations, and other people) that have no visual counterpart in the image. Duygulu et al. faced a similar problem for certain keywords in the Corel dataset, but to a far smaller degree. They could treat it as a noise problem. Here, it overwhelms the actual signal.</Paragraph>
    <Paragraph position="5"> The image side compounds our problem. Duygulu et al. were able to cluster image regions into equivalence classes based on a set of extracted features. We have no similarity metric for faces. Typically, one induces a metric by fitting a parametric model to the data and exploring differences in parameter space. Parametric models perform poorly on faces, however, due to the variability of facial expressions.</Paragraph>
    <Paragraph position="6"> If one were to manually label our collection with names and locations of the individuals portrayed in each photo, we believe we could generate hundreds or thousands of unique face images for many individuals. Given this supervised dataset, it might be possible to generate a non-parametric model of faces, fitting a set of models for each expression. Manually annotating half a million photographs is impractical. Instead we have leveraged the special linguistic structure of our captions to create a &amp;quot;photographic entity&amp;quot; detector. In other words, a proper name finder optimized to return only those names that actually occur in the photograph.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Who Is In The Picture?
</SectionTitle>
      <Paragraph position="0"> We are confronted with the problem of trying to identify those proper names that actually identify people in the photo. A small amount of linguistic analysis has been extremely helpful. We tried various proper name detectors, but it proved difficult to return only people, let alone people who actually appear in the images. Once again, however, we can exploit journalistic conventions. With few exceptions, the first sentence of a caption describes the activity and participants in the picture itself. In 51% of our captions, the beginning of the first sentence follows the pattern [Noun Phrase][Present Tense Action Verb], or less commonly [Noun Phrase][Present Tense Passive Figure 5: In this interface we have defined a two dimensional distance metric between topics and laid them out on the plane, illustrated with representative photos from the collection. The right hand side illustrates a closeup of the upper right corner of this space, an area dominated by sports related topics. In our interface one may click on any token to bring up a list of additional images and articles associated with this topic. In this example, the topics have been generated using captions from the AP and Reuters photos, but we are actually using this topic structure to navigate articles from CNN and the BBC. We contend that clustering with these captions generates topic distributions that focus on words highly relevant to the topic. Topics are defined as probability distributions across the corpus vocabulary. Our 2D embedding is derived by calculating the symmetrized KL divergence between each pair of topics and using principle coordinate analysis to project into two dimensions in a manner that &amp;quot; best&amp;quot; preserves the distances in the original high dimensional space. (Ripley, 1996) Verb]. (see figure 2) The detector we have built identifies potential proper names as strings of two or more capitalized words followed by a potential present tense verb. Words are classified as possible verbs by first applying a list of morphological rules to possible present tense singular forms, and then comparing these to a database of known verbs. Both the morphological rules and the list of verbs are from WordNet. (Wor, 2003) When there is more than one person in the photo, the author often inserts a term directly following the proper name to help disambiguate the correspondence. This will either be a position such as left or right, or an identifying characteristic. This second form is most frequently used with sports figures and gives a jersey number. Our proper name finder returns the name, the disambiguator if it exists, and the verb.</Paragraph>
      <Paragraph position="1"> Our classifier either accepts a caption and returns a proposed name or rejects the caption. We tested it on the same sample we used for clustering (146,870 captions).</Paragraph>
      <Paragraph position="2"> The name finder accepts 47% of these captions. We manually examined 400 rejected captions. Of these, 50% were true misses, where the caption contained a name that matched a face in the image. Another 35% were images of people but the caption either contained no proper names or only proper names of people that did not appear in the image. The final 15% were images that contained no people.</Paragraph>
      <Paragraph position="3"> We also examined 1000 accepted captions. In 85% of this sample, the classifier accurately extracted the name of at least one person in the image. Of these errors, the vast majority still followed our pattern of [Noun Phrase][Present Tense Verb], and the subject of the noun phrase actually did appear in the picture. Our rules simply failed to accurately parse the noun phrase. Over half of these mistakes, for instance, were due to the phrase &amp;quot;[Proper Name] of the United States [Verb],&amp;quot; and our classifier returned &amp;quot;United States&amp;quot; instead of the correct individual. More robust proper name finders should largely eliminate these sources of error. The more important point is that captions are so carefully structured that very simple rules accurately model most of the collection. We could conceivably even use our face classifier to learn some of these structural rules. If we could use the images to posit names, and even positions, we might be able to use this as a handle for learning certain types of caption structure. If a photo were to have two strong face responses, for instance, we might learn to look for those &amp;quot; left&amp;quot;,&amp;quot;right&amp;quot; indicators in the text. We would also like to investigate the possible clustering of verbs from image data.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 Iterative Improvements
</SectionTitle>
      <Paragraph position="0"> With aggressive pruning of questionable outputs from the name and face finders, we believe we can generate an effectively supervised dataset of faces for thousands of Figure 6: Not every person named in a caption appears in a picture. However, quite simple syntactic analysis of captions yields some names of persons whose faces are very likely to appear in the picture. By looking for items where the picture has a single, large face detector response -- so there is only one face present -- and analysis of the caption produces a single name, we produce a directory of news personalities that is quite accurate. The top row shows some entries from this directory: there are relatively few instances of each face, because our tests are quite restrictive, but the faces corresponding to each name are correct and are seen from a variety of angles, meaning we may be able to build a non-parametric face model for some individuals by a completely automatic analysis. The next three rows show some possible failure modes of our approach: First, our analysis of the caption could yield more than one name. Second, there may be more than one large face in the image, with only the wrong face producing a face detector response. Third, the syntax may occasionally follow a syntactic pattern our algorithm does not handle. We are able to extract proper names from 68,496 of 146,870 captions, with an estimated 85% of these actually naming a person in the image. Restricting ourselves solely to large face responses, we are able to produce a gazetteer of 452 distinct names (621 images total), containing only 60 incorrectly filed images.</Paragraph>
      <Paragraph position="1"> individuals, many of whom will have hundreds or even thousands of distinct face images. From these we hope to build a non- parametric model of faces. The next interesting task will be to investigate whether we can then return to the textual side, using our face models to learn more about linguistic structures of the captions. Bootstrapping by alternating between the two sides of a mixed dataset seems a very powerful model.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML