File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-2801_metho.xml
Size: 16,231 bytes
Last Modified: 2025-10-06 14:10:53
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-2801"> <Title>TextLinkagein theWiki Medium- A Comparative Study AlexanderMehler Departmentof ComputationalLinguistics& Text Technology BielefeldUniversity</Title> <Section position="4" start_page="1" end_page="2" type="metho"> <SectionTitle> 3 A Webgenre Structure Model </SectionTitle> <Paragraph position="0"> Linguisticstructuresvarywiththefunctionsof the discoursesin which they are manifested(Biber, 1995; Karlgren and Cutting, 1994). In analogyto theweakcontextualhypothesis(Millerand Charles,1991)one mightstatethatstructuraldifferencesreflectfunctionalonesas far as they are confirmedby a significantlyhigh numberof textualunitsandthusareidentifiableasrecurrentpat- null terns. In this sense, we expect web documents to be distinguishableby the functionalstructures they manifest. More specifically, we agree with the notion of webgenre (Yoshiokaand Herman, 2000)accordingto whichthe functionalstructure of webdocumentsis determinedbytheirmembership in genres (e.g. of conference websites, personalhomepages or electronicencyclopedias).</Paragraph> <Paragraph position="1"> Our hypothesisis that what is commonto instancesof differentwebgenresis the existenceof an implicitlogicaldocumentstructure (LDS)- in analogyto textual units whoseLDS is described in termsof section,paragraphand sentencecategories(Poweretal.,2003).Inthecaseofwebdoc- null umentswe hypothesizethat theirLDScomprises fourlevels: * Document networks consist of documents whichserve possiblyheterogenousfunctions if necessaryindependentlyof each other. A webdocumentnetworkis given,forexample, by thesystemof websitesof a university.</Paragraph> <Paragraph position="2"> * Web documentsmanifest- typicallyin the formof websites- pragmaticallyclosedacts of web-basedcommunication(e.g. conference organizationor online presentation). Eachwebdocumentis seento organizea systemof dependentsubfunctionswhichin turn aremanifestedby modules.</Paragraph> <Paragraph position="3"> * Documentmodulesare, ideally, functionally homogeneoussubunits of web documents which manifestsingle, but dependentsub null functionsin the sensethattheirrealizationis boundto therealizationofothersubfunctions manifestedby the sameencompassingdocument.Examplesofsuchsubfunctionsarecall null for papers, program presentationor conference venue organizationas subfunctionsof the functionof web-basedconference organization. null * Finally, elementarybuildingblocks(e.g.lists, tables, sections) only occur as dependent partsof documentmodules.</Paragraph> <Paragraph position="4"> This enumerationdoes not implya one-to-one mappingbetweenfunctionallydemarcatedmanifested units (e.g. modules)and manifesting(layout) units(e.g. web pages). Obviously, the same functionalvariety (e.g. of a personalacademic home page) which is mapped by a website of dozens of interlinked pages may also be manifested by a single page. The many-to-many relationinducedby thisandrelatedexamplesis describedin moredetailin Mehler& Gleim(2005).</Paragraph> <Paragraph position="5"> Thecentralhypothesisofthispaperisthatgenre specificstructureformationalso concernsdocument networks. That is, we expectthem to vary withrespectto structuralcharacteristicsaccording to the varyingfunctionsthey meet. Thus,we do not expect that different types of documentnetworks(e.g.systemsof genrespecificwebsitesvs. wiki-basednetworksvs. onlinecitationnetworks) manifesthomogeneouscharacteristics,but significantvariationsthereof.As we concentrateon coefficientswhichwereoriginallyintroducedin the context of smallworldanalyses,we expect,more concretely, that different network types vary accordingto their fitting to or deviation from the smallworldmodel. As we analyzeonlya couple of networks,this observationis boundto the corpusof networksconsideredin thisstudy. It neverthelesshintsat how to rethinknetworkanalysisin the context of newly emerging network typesas, forexample,Wikipedia.</Paragraph> <Paragraph position="6"> In orderto supportthis argumentation,the following sectionpresentsa modelfor representing and extractingdocumentnetworks. After that, theSWcharacteristicsof thesenetworksarecomputedanddiscussed. null</Paragraph> </Section> <Section position="5" start_page="2" end_page="6" type="metho"> <SectionTitle> 4 NetworkModelingandAnalysis </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="2" end_page="4" type="sub_section"> <SectionTitle> 4.1 GraphModeling </SectionTitle> <Paragraph position="0"> In order to analyse the characteristicsof documentnetworks,a formatfor uniformlyrepresenting their structureis needed. In this section,we presentgeneralizedtreesforthistask.Generalized treesare graphswitha kerneltree-like structure- null henceforthcalledkernelhierarchy- superimposed by graph-formingedgesas modelsof hyperlinks.</Paragraph> <Paragraph position="1"> Figure(1) illustratesthis graphmodel. It distinguishesthreelevels of structureformation: 1. Accordingto the webgenremodelof section (3), L1-graphsmap documentnetworks and thuscorporaof interlinked (web)documents.</Paragraph> <Paragraph position="2"> In section(4.3),four sourcesof suchnetworks are explored: wiki documentnetworks, citation networks, webgenre corpora and, for comparison witha moretraditionalmedium,networksofnewspaperarticles. null modules.</Paragraph> <Paragraph position="3"> In the case of webgenrecorpora, L3-graphs map the DOM3-basedstructureof the web pages of the websitesinvolved. In the case of all other networks distinguishedabove they representthe logicalstructureof singletext units(e.g.the sectionandparagraphstructuringof a lexicon,newspaperor scientificarticle).Notethat the tree-like structureof a documentmodulemaybe superimposedby hyperlinks,too, as illustratedin figure (1)by theverticesmandn.</Paragraph> <Paragraph position="4"> 3I.e.DocumentObjectModel.</Paragraph> <Paragraph position="5"> The kernelhierarchy of an L2-graphis constitutedby kernellinkswhicharedistinguishedfrom across, up, down and outside links (Amitayet al., 2003;EironandMcCurley, 2003;Mehlerand kernel hierarchy with nodes of other documents. null Kernelhierarchiesare exemplifiedby a conferencewebsiteheadedby a title and menupagereferringto, for example,the correspondingcallfor papers whichinturnleadstopagesonthedifferent conferencesessionsetc.so thatfinallya hierarchicalstructureevolves. Inthisexamplethekernelhierarchy evidentlyreflectsnavigationalconstraints. Thatis, the positionof a pagein the tree reflects theprobabilityto be navigatedbya readerstarting fromtherootpageandfollowingkernellinksonly.</Paragraph> <Paragraph position="6"> The kernel hierarchy of a wiki documentis spannedby an article page in conjunctionwith thecorrespondingdiscussion(ortalk), historyand edit this or view source pages which altogether form a flatly structuredtree. Likewise in the case of citationnetworks as the CiteSeersystem (Lawrenceet al., 1999), a documentconsistsof the various(e.g.PDFor PS) versionsof the focal articleas wellas of oneor morewebpagesmanifestingits citationsby meansof hyperlinks. From the point of view of documentnetwork analysis,L2-graphsandinterlinks(seefig. 1) are mostrelevant as they spanthe correspondingnetwork mediatedby documents(e.g. websites)and modules(e.g.webpages).Thisallows specifying whichlinks of whichtype in whichnetwork are examinedin thepresentstudy: * In thecaseof citationnetworks,citationlinks are modeledas interlinksas they relate(scientific)articlesencapsulatedbydocumentsof null this network type. Citationnetworksare explored by exampleof the CiteSeersystem: We analyzea sampleof more than 550,000 articles(see table 1) - the basic population coversup to 800,000documents.</Paragraph> <Paragraph position="7"> * In the case of newspaperarticle networks, content-basedlinksareexploredas resources of networking. This is done by exampleof the 1997 volumeof the Germannewspaper S&quot;uddeutscheZeitung(see table 1). That is, firstly, nodesare given by articleswheretwo nodesareinterlinked if thecorrespondingarticles contain see also links to each other. In the onlineand ePaper issueof this newspaper these links are manifestedas hyperlinks. Secondly, articlesare linked if they appear on the same page of the same issue so that they belongto the samethematic field. By means of these criteria, a bipartite network (Watts, 2003)is built in which the top-modeis spannedby topic and page units, whereasthe bottom-modeconsistsof textunits.Insucha network,two textsareinterlinked whenever they relateto at leastone commontopicor appearon thesamepageof thesameissue.</Paragraph> <Paragraph position="8"> * In the case of webgenreswe explorea corpus of 1,096conferencewebsites(see table on the level of L2-graphs. This is done in orderto get a base line for our comparative study, sinceWWW-basednetworks are well known for theirsmallworldbehavior. More specifically, this relatesto estimationsof the exponentg of power laws fittedto their degreedistributions(Newman,2003). null * These three networks are explored in order to comparatively study networking in Wikipediawhichis analyzedby exampleof its German releasede.wikipedia.org (seetable1). Becauseoftherichsystemofits nodeand link types(see section4.2) we explorethreevariantsthereof.Further, in order to get a more reliablepictureof wiki-based structureformation,we alsoanalyzewikisin the area of technicaldocumentation. This is done by exampleof three wikis on open sourceprojectsoftheApacheSoftwareFoundation(cf.wiki.apache.org). null In thefollowingsection,theextractionof Wikipedia-basednetworksis explainedin moredetail.</Paragraph> </Section> <Section position="2" start_page="4" end_page="5" type="sub_section"> <SectionTitle> 4.2 GraphExtraction- theCaseof Wiki- </SectionTitle> <Paragraph position="0"> basedDocumentNetworks In the following sectionwe analyzethe network spanned by documentmodules of the German Wikipediaandtheirinterlinks.5 Thiscannotsimply be done by extractingall its article pages. The reasonis that Wikipediadocumentsconsist quencies)as foundin the wikior additionallyintroducedinto the study in order to organize the type systeminto a hierarchy. One heuristicfor extractinginstancesof node types relatesto the URL of the correspondingpage. Category, portalandmediawikipages,forexample,containthe null prefixKategorie, PortalandMediaWiki, respectively, separatedby a colon from its page name suffix (as in http://de.wikipedia.</Paragraph> <Paragraph position="1"> org/wiki/Kategorie:Musik).</Paragraph> <Paragraph position="2"> Analogously, table (4) lists the edge types either found withinthe wiki or additionallyintroducedinto the study. Of specialinterestare redirectnodesandlinkswhichmanifesttransitive and, thus,mediatelinksofcontent-basedunits.Anarticlenodev maybelinked,forexample,witha redirectnoder whichin turnredirectsto an articlew. In this case, the documentnetwork containstwo edges(v,r),(r,w) whichhave to be resolved to a singleedge(v,w)if redirectsaretobeexcludedin accordancewithwhattheMediaWikisystemdoes whenprocessingthem.</Paragraph> <Paragraph position="3"> Basedontheseconsiderations,wecomputenetwork characteristicsof three extractionsof the GermanWikipedia(see table 1): VariantI consists of a graphwhosevertex set containsall Article nodes and whose edge set is based on Interlinks andappropriatelyresolved Redirect links. Variant II enlarges variant I by includingother content-relatedwikiunits,i.e. ArticleTalk, Portal, PortalTalk, and Disambiguationpages (multiply typednodeswereexcluded).VariantIII consists of a graphwhosevertex setcoversall verticesand edgesfoundin theextraction.</Paragraph> </Section> <Section position="3" start_page="5" end_page="6" type="sub_section"> <SectionTitle> 4.3 NetworkAnalysis </SectionTitle> <Paragraph position="0"> Basedon the inputnetworksdescribedin the previoussectionwe computetheSWcoefficientsdescribedin section(2). Average geodesicdistances are computedby meansof the Dijkstraalgorithm based on samplesof 1,000 verticesof the inputnetworks (or the wholevertex set if it is of minorcardinality).Power law fittingswerecomputedbasedonthemodelP(x) =ax[?]g+b. Note thattable(1)doesnotlistthecardinalitiesof multi sets of edges and, thus, does not count multiple edgesconnectingthe samepairof verticeswithin the correspondinginput network - therefore,the numbersin table(1)donotnecessarilyconformto the countsof link typesin table(4). Notefurther thatwe compute,as usuallydonein SWanalyses, characteristicsof undirectedgraphs.In thecaseof wiki-basednetworks,thisis justifiedby thepossibilityto processback linksinMediaWikisystems. In the case of the CiteSeersystemthis is justifiedby thefactthatit alwaysdisplayscitation quencieswithintheGermanWikipedia.</Paragraph> <Paragraph position="1"> andcitedbylinks.Finally, in thecaseof thenewspaperarticlenetwork, this is due to the fact that it is basedon a bipartitegraph(see above). Note that the indogramcorpusconsistsof predominantlyunrelatedwebsitesandthusdoesnot allow computingclusteranddistancecoefficients.</Paragraph> </Section> </Section> <Section position="6" start_page="6" end_page="6" type="metho"> <SectionTitle> 5 Discussion </SectionTitle> <Paragraph position="0"> The numericalresultsin table(5) are remarkable as they allow identifyingthreetypesof networks: * On the one hand, we observe the extreme caseof theS&quot;uddeutscheZeitung, that is, of the newspaperarticlenetwork. It is the only network which, at the same time, has very high clustervalues,shortgeodesicdistancesand a highdegreeof assortative mixing.Thus,itsvaluessupporttheassertionthat null it behaves as a smallworldin thesenseof the modelof Watts & Strogatz. Theonlyexception is the remarkablylow g value, where, accordingto the model of Barab'asi & Albert(1999),a highervaluewas expected.</Paragraph> </Section> <Section position="7" start_page="6" end_page="6" type="metho"> <SectionTitle> * Ontheotherhand,theCiteSeersampleis the </SectionTitle> <Paragraph position="0"> reversecase:It hasverylow valuesofC1 and C2, tendsto show neitherassortative, nordisassortative mixing,andat thesametimehasa low g value. Thesmallclustervaluescanbe explainedby the low probabilitywithwhich two authorscitedbya focalarticlearerelated by a citationrelationon theirown.6 alsotendto show stochasticmixingandshort geodesicdistances. The cluster values are confirmedby the wikis of technicaldocumentation(also w.r.t their numericalorder). Thus,thesewikistendto be smallworldsaccordingto the model of Watts & Strogatz, butalsoprove disassortative mixing- comparableto technicalnetworks but in departure fromsocialnetworks. Consequently, they are ranked in-betweenthe citationandthe newspaperarticlenetwork. null All thesenetworks show rathershort geodesic distances.Thus,l seemsto be inappropriatewith respect to distinguishingthem in terms of SW characteristics.Further, all theseexamplesshow remarkablylow valuesoftheg coefficient.Incontrast to this, power laws as fitted in the analyses reportedby Newman (2003)tend to have much higher exponents- Newman reports on values whichrangebetween1.4 and 3.0. This resultis onlyrealizedby theindogramcorpusof conferencewebsites,thus,by a sampleof WWWdocumentswhoseout degreedistributionis fittedby a powerlaw withexponentg =2.562.</Paragraph> <Paragraph position="1"> Thesefindingssupportthe view that compared to WWW-based networks wiki systems behave morelike &quot;traditional&quot;networks of textualunits, but are new in the sensethat their topology neitherapproximatestheoneof citationnetworksnor of content-basednetworksof newspaperarticles.</Paragraph> <Paragraph position="2"> In otherwords:As intertextualrelationsaregenre sensitive (e.g. citationsin scientificcommunication vs. content-basedrelationsin press communicationvs. hyperlinksin onlineencyclopedias), networks basedon such relationsseemto inherit this genresensitivity. Thatis, for varyinggenres (e.g. of scientific,technicalor presscommunication) differencesin topologicalcharacteristicsof their instancenetworks are expected. The study presentsresultsinsupportofthisview ofthegenre sensitivityof text-basednetworks.</Paragraph> </Section> class="xml-element"></Paper>