您的当前位置:首页正文

Using a corpus of sentence orderings defined by many experts to evaluate metrics of coheren

来源:化拓教育网
UsingaCorpusofSentenceOrderingsDefinedbyManyExperts

toEvaluateMetricsofCoherenceforTextStructuring

NikiforosKaramanis

ComputationalLinguisticsResearchGroup

UniversityofWolverhampton,UKN.Karamanis@wlv.ac.uk

Abstract

Thispaperaddressestwopreviouslyunresolvedis-suesintheautomaticevaluationofTextStructuring(TS)inNaturalLanguageGeneration(NLG).First,wedescribehowtoverifythegeneralityofanexist-ingcollectionofsentenceorderingsdefinedbyonedomainexpertusingdataprovidedbyadditionalexperts.Second,ageneralevaluationmethodol-ogyisoutlinedwhichinvestigatesthepreviouslyunaddressedpossibilitythattheremayexistmanyoptimalsolutionsforTSintheemployeddomain.Thismethodologyisimplementedinasetofex-perimentswhichidentifythemostpromisingcan-didateforTSamongseveralmetricsofcoherencepreviouslysuggestedintheliterature.1

1Introduction

ResearchinNLGfocusedonproblemsrelatedtoTSfromveryearlyon,[McKeown,1985]beingaclassicexample.Nowadays,TScontinuestobeanextremelyfruitfulfieldofdiverseactiveresearch.Inthispaper,weassumetheso-calledsearch-basedapproachtoTS[Karamanisetal.,2004]whichemploysametricofcoherencetoselectatextstruc-tureamongvariousalternatives.TheTSmoduleishypothe-sisedtosimplyorderapreselectedsetofinformation-bearingitemssuchassentences[Barzilayetal.,2002;Lapata,2003;BarzilayandLee,2004]ordatabasefacts[DimitromanolakiandAndroutsopoulos,2003;Karamanisetal.,2004].

EmpiricalworkontheevaluationofTShasbecomein-creasinglyautomaticandcorpus-based.Aspointedoutby[Karamanis,2003;BarzilayandLee,2004]interalia,usingcorporaforautomaticevaluationismotivatedbythefactthatemployinghumaninformantsinextendedpsycholinguisticexperimentsisoftensimplyunfeasible.Bycontrast,large-scaleautomaticcorpus-basedexperimentationtakesplacemuchmoreeasily.

[Lapata,2003]wasthefirsttopresentanexperimentalset-tingwhichemploysthedistancebetweentwoorderingstoes-timateautomaticallyhowcloseasentenceorderingproduced

1

Chapter9of[Karamanis,2003]reportsthestudyinmoredetail.

ChrisMellish

DepartmentofComputingScienceUniversityofAberdeen,UKcmellish@csd.abdn.ac.uk

byherprobabilisticTSmodelstandsincomparisontoorder-ingsprovidedbyseveralhumanjudges.

[DimitromanolakiandAndroutsopoulos,2003]derivedsetsoffactsfromthedatabaseofMPIRO,anNLGsystemthatgeneratesshortdescriptionsofmuseumartefacts[Isardetal.,2003].Eachsetconsistsof6factseachofwhichcor-respondstoasentenceasshowninFigure1.Thefactsineachsetweremanuallyassignedanordertoreflectwhatadomainexpert,i.e.anarchaeologisttrainedinmuseumla-belling,consideredtobethemostnaturalorderingofthecorrespondingsentences.Patternsoforderingfactswereau-tomaticallylearnedfromthecorpuscreatedbytheexpert.Then,aclassification-basedTSapproachwasimplementedandevaluatedincomparisontotheexpert’sorderings.

DatabasefactSentencesubclass(ex1,amph)→Thisexhibitisanamphora.painted-by(ex1,p-Kleo)→ThisexhibitwasdecoratedbythePainterofKleofrades.painter-story(p-Kleo,en4049)→ThePainterofKleofradesusedtodecoratebigvases.

exhibit-depicts(ex1,en914)→Thisexhibitdepictsawarriorperformingsplachnoscopybeforeleavingforthebattle.current-location(ex1,wag-mus)→ThisexhibitiscurrentlydisplayedintheMartinvonWagnerMuseum.museum-country(wag-mus,ger)→TheMartinvonWagnerMuseumisinGermany.

Figure1:MPIROdatabasefactscorrespondingtosentencesAsubsetofthecorpuscreatedbytheexpertinthepreviousstudy(towhomwewillhenceforthreferasE0)isemployedby[Karamanisetal.,2004]whoattempttodistinguishbe-tweenmanymetricsofcoherencewithrespecttotheiruse-fulnessforTSinthesamedomain.Eachhumanorderingoffactsinthecorpusisscoredbyeachofthesemetricswhicharethenpenalisedproportionallytotheamountofalternativeorderingsofthesamematerialthatarefoundtoscoreequallytoorbetterthanthehumanordering.ThefewmetricswhichmanagetooutperformtwosimplebaselinesintheiroverallperformanceacrossthecorpusemergeasthemostsuitablecandidatesforTSintheinvestigateddomain.Thismethod-ologyisverysimilartotheway[BarzilayandLee,2004]evaluatetheirprobabilisticTSmodelincomparisontotheapproachof[Lapata,2003].

Becausethedatausedinthestudiesof[Dimitromanolaki

andAndroutsopoulos,2003]and[Karamanisetal.,2004]arebasedontheinsightsofjustoneexpert,anobviousun-resolvedquestioniswhethertheyreflectgeneralstrategiesfororderingfactsinthedomainofinterest.Thispaperad-dressesthisissuebyenhancingthedatasetusedinthetwostudieswithorderingsprovidedbythreeadditionalexperts.TheseorderingsarethencomparedwiththeordersofE0us-ingthemethodologyof[Lapata,2003].SinceE0isfoundtosharealotofcommongroundwithtwoofhercolleaguesintheorderingtask,herreliabilityisverified,whileafourth“stand-alone”expertwhousesstrategiesnotsharedbyanyotherexpertisidentifiedaswell.

Asin[Lapata,2003],thesamedependentvariablewhichallowsustoestimatehowdifferenttheordersofE0arefromtheordersofhercolleaguesisusedtoevaluatesomeofthemetricswhichperformbestin[Karamanisetal.,2004].Asexplainedinthenextsection,inthiswayweinvestigatethepreviouslyunaddressedpossibilitythattheremayexistmanyoptimalsolutionsforTSinourdomain.Theresultsofthisadditionalevaluationexperimentarepresentedandemphasisislaidontheirrelationwiththepreviousfindings.

Overall,thispaperaddressestwogeneralissues:a)howtoverifythegeneralityofadatasetdefinedbyoneexpertusingsentenceorderingsprovidedbyotherexpertsandb)howtoemploythesedatafortheautomaticevaluationofaTSap-proach.Giventhatthemethodologydiscussedinthispaperdoesnotrelyontheemployedmetricsofcoherenceortheas-sumedTSapproach,ourworkcanbeofinteresttoanyNLGresearcherfacingthesequestions.

Thenextsectiondiscusseshowthemethodologyimple-mentedinthisstudycomplementsthemethodsof[Karamanisetal.,2004].Afterbrieflyintroducingtheemployedmetricsofcoherence,wedescribethedatacollectedforourexper-iments.Then,wepresenttheemployeddependentvariableandformulateourpredictions.Intheresultssection,westatewhichofthesepredictionswereverified.Thepaperiscon-cludedwithadiscussionofthemainfindings.

2Anadditionalevaluationtest

As[Barzilayetal.,2002]report,differenthumansoftenordersentencesindistinctways.Thus,theremightexistmorethanoneequallygoodsolutionforTS,aviewsharedbyalmostallTSresearchers,butwhichhasnotbeenaccountedforintheevaluationmethodologiesof[Karamanisetal.,2004]and[BarzilayandLee,2004].2

CollectingsentenceorderingsdefinedbymanyexpertsinourdomainenablesustoinvestigatethepossibilitythattheremightexistmanygoodsolutionsforTS.Then,themeasureof[Lapata,2003],whichestimateshowclosetwoorderingsstand,canbeemployednotonlytoverifythereliabilityofE0butalsotocomparetheorderingspreferredbytheassumedTSapproachwiththeorderingsoftheexperts.

However,thisevaluationmethodologyhasitslimitationsaswell.Beingengagedinotherobligations,theexpertsnor-mallyhavejustalimitedamountoftimetodevotetothe

2

Amoredetaileddiscussionofexistingcorpus-basedmethodsforevaluatingTSappearsin[KaramanisandMellish,2005].

NLGresearcher.Similarlytostandardpsycholinguisticex-periments,consultingtheseinformantsisdifficulttoextendtoalargercorpusliketheoneusede.g.by[Karamanisetal.,2004](122setsoffacts).

Inthispaper,wereachareasonablecompromisebyshow-inghowthemethodologyof[Lapata,2003]supplementstheevaluationeffortsof[Karamanisetal.,2004]usingasimilar(yetbynecessitysmaller)dataset.Clearly,ametricofcoher-encethathasalreadydonewellinthepreviousstudy,gainsextrabonusbypassingthisadditionaltest.

3Metricsofcoherence

[Karamanis,2003]discusseshowafewbasicnotionsofco-herencecapturedbyCenteringTheory(CT)canbeusedtodefinealargerangeofmetricswhichmightbeusefulforTSinourdomainofinterest.3Themetricsemployedintheex-perimentsof[Karamanisetal.,2004]include:

M.NOCBwhichpenalisesNOCBs,i.e.pairsofadjacentfactswithoutanyargumentsincommon[KaramanisandManurung,2002].BecauseofitssimplicityM.NOCBservesasthefirstbaselineintheexperimentsof[Kara-manisetal.,2004].

PF.NOCB,asecondbaseline,whichenhancesM.NOCBwithaglobalconstraintoncoherencethat[Karamanis,2003]callsthePageFocus(PF).

PF.BFPwhichisbasedonPFaswellastheoriginalfor-mulationofCTin[Brennanetal.,1987].

PF.KPwhichmakesuseofPFaswellastherecentre-formulationofCTin[KibbleandPower,2000].[Karamanisetal.,2004]reportthatPF.NOCBoutper-formedM.NOCBbutwasovertakenbyPF.BFPandPF.KP.ThetwometricsbeatingPF.NOCBwerenotfoundtodiffersignificantlyfromeachother.

ThisstudyemploysPF.BFPandPF.KP,i.e.twoofthebestperformingmetricsoftheexperimentsin[Karamanisetal.,2004],aswellasM.NOCBandPF.NOCB,thetwopreviouslyusedbaselines.Anadditionalrandombaselineisalsodefinedfollowing[Lapata,2003].

4Datacollection

16setsoffactswererandomlyselectedfromthecorpusof[DimitromanolakiandAndroutsopoulos,2003].4Thesen-tencesthateachfactcorrespondstoandtheorderdefinedbyE0wasmadeavailabletousaswell.Wewillsubsequentlyrefertoanunorderedsetoffacts(orsentencesthatthefactscorrespondto)asaTestitem.

4.1GeneratingtheBestOrdersforeachmetric

Following[Karamanisetal.,2004],weenvisageaTSap-proachinwhichametricofcoherenceMassignsascoreto

3

Sincediscussingthemetricsindetailiswellbeyondthescopeofthispaper,thereaderisreferredtoChapter3of[Karamanis,2003]formoreinformationonthisissue.4

Thesearedistinctfrom,yetverysimilarto,thesetsoffactsusedin[Karamanisetal.,2004].

eachpossibleorderingoftheinputsetoffactsandselectsthebestscoringorderingastheoutput.Whenmanyorderingsscorebest,Mchoosesrandomlybetweenthem.Crucially,ourhypotheticalTScomponentonlyconsidersorderingsstartingwiththesubclassfact(e.g.subclass(ex1,amph)inFigure1)followingthesuggestionof[DimitromanolakiandAndroutsopoulos,2003].Thisgivesriseto5!=120orderingstobescoredbyMforeachTestitem.

Forthepurposesofthisexperiment,asimplealgorithmwasimplementedthatfirstproducesthe120possibleorder-ingsoffactsinaTestitemandsubsequentlyranksthemac-cordingtothescoresgivenbyM.ThealgorithmoutputsthesetofBestOrdersfortheTestitem,i.e.theorderingswhichscorebestaccordingtoM.ThisprocedurewasrepeatedforeachmetricandallTestitemsemployedintheexperiment.

4.2Randombaseline

Following[Lapata,2003],arandombaseline(RB)wasim-plementedasthelowerboundoftheanalysis.Therandombaselineconsistsof10randomlyselectedorderingsforeachTestitem.Theorderingsareselectedirrespectiveoftheirscoresforthevariousmetrics.

4.3Consultingdomainexperts

Threearchaeologists(E1,E2,E3),onemaleandtwofemales,between28and45yearsofage,alltrainedincataloguingandmuseumlabelling,wererecruitedfromtheDepartmentofClassicsattheUniversityofEdinburgh.

Eachexpertwasconsultedbythefirstauthorinaseparateinterview.First,shewaspresentedwithasetofsixsentences,eachofwhichcorrespondedtoadatabasefactandwasprintedonadifferentfilecard,aswellaswithwritteninstructionsde-scribingtheorderingtask.5Theinstructionsmentionthatthesentencescomefromacomputerprogramthatgeneratesde-scriptionsofartefactsinavirtualmuseum.Thefirstsentenceforeachsetwasgivenbytheexperimenter.6Then,theexpertwasaskedtoordertheremainingfivesentencesinacoherenttext.

Whenorderingthesentences,theexpertwasinstructedtoconsiderwhichonesshouldbetogetherandwhichshouldcomebeforeanotherinthetextwithoutusinghintsotherthanthesentencesthemselves.Shecouldreviseherorderingatanytimebymovingthesentencesaround.Whenshewassat-isfiedwiththeorderingsheproduced,shewasaskedtowritenexttoeachsentenceitsposition,andgivethemtotheex-perimenterinordertoperformthesametaskwiththenextrandomlyselectedsetofsentences.Theexpertwasencour-agedtocommentonthedifficultyofthetask,thestrategiesshefollowed,etc.

5Dependentvariable

Givenanunorderedsetofsentencesandtwopossibleorder-ings,anumberofmeasurescanbeemployedtocalculatethe

5

TheinstructionsaregiveninAppendixDof[Karamanis,2003]andareadaptedfromtheonesusedin[Barzilayetal.,2002].6

Thisisthesentencecorrespondingtothesubclassfact.

distancebetweenthem.Basedontheargumentationin[How-ell,2002],[Lapata,2003]selectsKendall’sτasthemostap-propriatemeasureandthiswaswhatweusedforouranalysisaswell.Kendall’sτisbasedonthenumberofinversionsbetweenthetwoorderingsandiscalculatedasfollows:

(1)

τ=1−

2IPN

=1−

2I

N(N−1)/2PNstandsforthenumberofpairsofsentencesandNisthenumberofsentencestobeordered.7Istandsforthenumberofinversions,thatis,thenumberofadjacenttranspositionsnecessarytobringoneorderingtoanother.Kendall’sτrangesfrom−1(inverseranks)to1(identicalranks).Thehighertheτvalue,thesmallerthedistancebetweenthetwoorderings.Following[Lapata,2003],theTukeytestisemployedtoin-vestigatesignificantdifferencesbetweenaverageτscores.8First,theaveragedistancebetween(theorderingsof)9twoexpertse.g.E0andE1,denotedasT(E0E1),iscalculatedasthemeanτvaluebetweentheorderingofE0andtheorder-ingofE1takenacrossall16Testitems.Then,wecomputeT(EXPEXP)whichexpressestheoverallaveragedistancebetweenallexpertpairsandservesastheupperboundfortheevaluationofthemetrics.SinceatotalofEexpertsgivesrise

toPE=E(E−1)

expertpairs,T(EXPEXP),iscomputedbysummingup2theaveragedistancesbetweenallexpertpairsanddividingthesumbyPE.

While[Lapata,2003]alwaysappearstosingleoutauniquebestscoringordering,weoftenhavetodealwithmanybestscoringorderings.Toaccountforthis,wefirstcomputetheaveragedistancebetweene.g.theorderingofanexpertE0andtheBestOrdersofametricMforagivenTestitem.Inthisway,MisrewardedforaBestOrderthatisclosetotheexpert’sordering,butpenalisedforeveryBestOrderthatisnot.Then,theaverageT(E0M)betweentheexpertE0andthemetricMiscalculatedastheirmeandistanceacrossall16Testitems.Finally,yetmostimportantly,T(EXPM)istheaveragedistancebetweenallexpertsandM.ItiscalculatedbysumminguptheaveragedistancesbetweeneachexpertandManddividingthesumbythenumberofexperts.Asthenextsectionexplainsinmoredetail,T(EXPM)iscomparedwiththeupperboundoftheevaluationT(EXPEXP)toestimatetheperformanceofMinourexperiments.

RBisevaluatedinasimilarwayasMusingthe10ran-domlyselectedorderingsinsteadoftheBestOrdersforeachTestitem.T(EXPRB)istheaveragedistancebetweenallex-pertsandRBandisusedasthelowerboundoftheevaluation.

7

Inourdata,Nisalwaysequalto6.

8ProvidedthatanomnibusANOVAissignificant,theTukeytestcanbeusedtospecifywhichoftheconditionsc1,...,cnmeasuredbythedependentvariablediffersignificantly.Itusesthesetofmeansm1,...,mn(correspondingtoconditionsc1,...,cn)andthemeansquareerrorofthescoresthatcontributetothesemeanstocalculateacriticaldifferencebetweenanytwomeans.Anobserveddiffer-encebetweenanytwomeansissignificantifitexceedsthecriticaldifference.9

Throughoutthepaperweoftenrefertoe.g.“thedistancebe-tweentheorderingsoftheexperts”withthephrase“thedistancebetweentheexperts”forthesakeofbrevity.

E0E1:******0.692

E0E2:******0.717

E1E2:******0.758

E0E3:CDat0.01:0.3380.258

E1E3:CDat0.05:0.282

0.300

E2E3:F(5,75)=14.931,p<0.000

0.192

Table1:Comparisonofdistancesbetweentheexpertpairs

6Predictions

Despiteanypotentialdifferencesbetweentheexperts,oneex-pectsthemtosharesomecommongroundinthewaytheyor-dersentences.Inthissense,aparticularlywelcomeresultforourpurposesistoshowthattheaveragedistancesbetweenE0andmostofhercolleaguesareshortandnotsignificantlydifferentfromthedistancesbetweentheotherexpertpairs,whichinturnindicatesthatsheisnota“stand-alone”expert.Moreover,weexpecttheaveragedistancebetweentheex-pertpairstobesignificantlysmallerthantheaveragedistancebetweentheexpertsandRB.Thisisagainbasedontheas-sumptionthateventhoughtheexpertsmightnotfollowcom-pletelyidenticalstrategies,theydonotoperatewithabsolutediversityeither.Hence,wepredictthatT(EXPEXP)willbesignificantlygreaterthanT(EXPRB).

DuetothesmallnumberofTestitemsemployedinthisstudy,itislikelythatthemetricsdonotdiffersignificantlyfromeachotherwithrespecttotheiraveragedistancefromtheexperts.Ratherthancomparingthemetricsdirectlywitheachother(as[Karamanisetal.,2004]do),thisstudycom-paresthemindirectlybyexaminingtheirbehaviourwithre-specttotheupperandthelowerbound.Forinstance,al-thoughT(EXPPF.KP)andT(EXPPF.BFP)mightnotbesignificantlydifferentfromeachother,onescorecouldbesig-nificantlydifferentfromT(EXPEXP)(upperbound)and/orT(EXPRB)(lowerbound)whiletheotherisnot.

Weidentifythebestmetricsinthisstudyastheoneswhoseaveragedistancefromtheexperts(i)issignificantlygreaterfromthelowerboundand(ii)doesnotdiffersignificantlyfromtheupperbound.10

7Results

7.1Distancesbetweentheexpertpairs

Onthefirststepinouranalysis,wecomputedtheTscoreforeachexpertpair,namelyT(E0E1),T(E0E2),T(E0E3),T(E1E2),T(E1E3)andT(E2E3).Thenweperformedall15pairwisecomparisonsbetweenthemusingtheTukeytest,theresultsofwhicharesummarisedinTable1.11

ThecellsintheTablereportthelevelofsignificancere-turnedbytheTukeytestwhenthedifferencebetweentwo

10

Criterion(ii)canonlybeappliedprovidedthattheaveragedis-tancebetweentheexpertsandatleastonemetricMxisfoundtobesignificantlylowerthanT(EXPEXP).Then,iftheaveragedis-tancebetweentheexpertsandanothermetricMydoesnotdiffersignificantlyfromT(EXPEXP),MyperformsbetterthanMx.11

TheTablealsoreportstheresultoftheomnibusANOVA,whichissignificant:F(5,75)=14.931,p<0.000.

E0E1:******0.692

E0E2:******0.717

E1E2:******0.758

E0RB:CDat0.01:0.2420.323

E1RB:CDat0.05:0.202

0.347

E2RB:F(5,75)=18.762,p<0.000

0.352

E0E3:0.258

E1E3:0.300

E2E3:CDat0.01:0.2190.192

E3RB:CDat0.05:0.177

0.302

F(3,45)=1.223,p=0.312

Table2:Comparisonofdistancesbetweentheexperts(E0,E1,E2,E3)andtherandombaseline(RB)

distancesexceedsthecriticaldifference(CD).Significancebeyondthe0.05thresholdisreportedwithoneasterisk(*),whilesignificancebeyondthe0.01thresholdisreportedwithtwoasterisks(**).Acellremainsemptywhenthedifferencebetweentwodistancesdoesnotexceedthecriticaldifference.Forexample,thevalueofT(E0E1)is0.692andthevalueofT(E0E3)is0.258.SincetheirdifferenceexceedstheCDatthe0.01threshold,itisreportedtobesignificantbeyondthatlevelbytheTukeytest,asshowninthetopcellofthethirdcolumninTable1.

AstheTableshows,theTscoresforthedistancebetweenE0andE1orE2,i.e.T(E0E1)andT(E0E2),aswellastheTforthedistancebetweenE1andE2,i.e.T(E1E2),arequitehighwhichindicatesthatonaveragetheorderingsofthethreeexpertsarequiteclosetoeachother.Moreover,theseTscoresarenotsignificantlydifferentfromeachotherwhichsuggeststhatE0,E1andE2sharequitealotofcommongroundintheorderingtask.Hence,E0isfoundtogiverisetosimilarorderingstotheonesofE1andE2.

However,whenanyofthepreviousdistancesiscomparedwithadistancethatinvolvestheorderingsofE3thediffer-enceissignificant,asshownbythecellscontainingtwoas-terisksinTable1.Inotherwords,althoughtheorderingsofE1andE2seemtodeviatefromeachotherandtheorderingsofE0tomoreorlessthesameextent,theorderingsofE3standmuchfurtherawayfromallofthem.Hence,thereex-istsa“stand-alone”expertamongtheonesconsultedinourstudies,yetthisisnotE0butE3.

Thisfindingcanbeeasilyexplainedbythefactthatbycon-trasttotheotherthreeexperts,E3followedaveryschematicwayfororderingsentences.BecausetheorderingsofE3manifestratherpeculiarstrategies,atleastcomparedtotheor-deringsofE0,E1andE2,theupperboundoftheanalysis,i.e.theaveragedistancebetweentheexpertpairsT(EXPEXP),iscomputedwithouttakingintoaccounttheseorderings:

(2)

T(EXPEXP)=0.722=

T(E0E1)+T(E0E2)+T(E1E2)

37.2DistancesbetweentheexpertsandRB

AstheupperpartofTable2shows,theTscorebetweenanytwoexpertsotherthanE3issignificantlygreaterthantheirdistancefromRBbeyondthe0.01threshold.Onlythedis-

tancesbetweenE3andanotherexpert,showninthelowersectionofTable2,arenotsignificantlydifferentfromthedis-tancebetweenE3andRB.

AlthoughthisresultdoesnotmeanthattheordersofE3aresimilartotheordersofRB,12itshowsthatE3isroughlyasfarawayfrome.g.E0assheisfromRB.Bycontrast,E0standssignificantlyclosertoE1thantoRB,andthesameholdsfortheotherdistancesintheupperpartoftheTable.Inaccordancewiththediscussionintheprevioussection,thelowerbound,i.e.theoverallaveragedistancebetweentheexperts(excludingE3)andRBT(EXPRB),iscomputedasshownin(3):

(3)T(EXPRB)=0.341=

T(E0RB)+T(E1RB)+T(E2RB)

37.3Distancesbetweentheexpertsandeachmetric

Sofar,E3wasidentifiedasan“stand-alone”expertstandingfurtherawayfromtheotherthreeexpertsthantheystandfromeachother.WealsoidentifiedthedistancebetweenE3andeachexpertassimilartoherdistancefromRB.

Similarly,E3wasfoundtostandfurtherawayfromthemetricscomparedtotheirdistancefromtheotherthreeex-perts.13Thisresult,givesrisetothesetofformulasin(4)forcalculatingtheoverallaveragedistancebetweentheexperts(excludingE3)andeachmetric.

(4)

(4.1):T(EXPPF.BFP)=0.629=T(E0PF.BFP)+T(E1PF.BFP)+T(E2PF.BFP)

3(4.2):T(EXPPF.KP)=0.571=

T(E0PF.KP)+T(E1PF.KP)+T(E2PF.KP)

3(4.3):T(EXPPF.NOCB)=0.606=

T(E0PF.NOCB)+T(E1PF.NOCB)+T(E2PF.NOCB)

3(4.4):T(EXPM.NOCB)=0.487=

T(E0M.NOCB)+T(E1M.NOCB)+T(E2M.NOCB)

3Inthenextsection,wepresenttheconcludinganalysisforthisstudywhichcomparestheoveralldistancesinformu-las(2),(3)and(4)witheachother.Aswehavealreadymentioned,T(EXPEXP)servesastheupperboundoftheanalysiswhereasT(EXPRB)isthelowerbound.Theaimistospecifywhichscoresin(4)aresignificantlygreaterthanT(EXPRB),butnotsignificantlylowerthanT(EXPEXP).

7.4Concludinganalysis

Theresultsofthecomparisonsofthescoresin(2),(3)and(4)areshowninTable3.AsthetopcellinthelastcolumnoftheTableshows,theTscorebetweentheexpertsandRB,T(EXPRB),issignificantlylowerthantheaveragedistancebetweentheexpertpairs,T(EXPEXP)atthe0.01level.

12

Thiscouldhavebeenargued,ifthevalueofT(E3RB)hadbeenmuchcloserto1.13

Duetospacerestrictions,wecannotreportthescoresforthesecomparisonshere.ThereaderisreferredtoTable9.4onpage175ofChapter9in[Karamanis,2003].

Thisresultverifiesoneofourmainpredictionsshowingthattheorderingsoftheexperts(moduloE3)standmuchclosertoeachothercomparedtotheirdistancefromrandomlyas-sembledorderings.

Asexpected,mostofthescoresthatinvolvethemet-ricsarenotsignificantlydifferentfromeachother,ex-ceptforT(EXPPF.BFP)whichissignificantlygreaterthanT(EXPM.NOCB)atthe0.05level.Yet,whatwearemainlyinterestedinishowthedistancebetweentheexpertsandeachmetriccompareswithT(EXPEXP)andT(EXPRB).ThisisshowninthefirstrowandthelastcolumnofTable3.

Crucially,T(EXPRB)issignificantlylowerthanT(EXPPF.BFP)aswellasT(EXPPF.NOCB)andT(EXPPF.KP)atthe0.01level.Notably,eventhedis-tanceoftheexpertsfromM.NOCB,T(EXPM.NOCB),issignificantlygreaterthanT(EXPRB),albeitatthe0.05level.Theseresultsshowthatthedistancefromtheexpertsissignificantlyreducedwhenusingthebestscoringorderingsofanymetric,evenM.NOCB,insteadoftheorderingsofRB.Hence,allmetricsscoresignificantlybetterthanRBinthisexperiment.

However,simplyusingM.NOCBtooutputthebestscoringordersisnotenoughtoyieldadistancefromtheexpertswhichiscomparabletoT(EXPEXP).Al-thoughthePFconstraintappearstohelptowardsthisdi-rection,T(EXPPF.KP)remainssignificantlylowerthanT(EXPEXP),whereasT(EXPPF.NOCB)fallsonly0.009pointsshortofCDatthe0.05threshold.Hence,PF.BFPisthemostrobustmetric,asthedifferencebetweenT(EXPPF.BFP)andT(EXPEXP)isclearlynotsignifi-cant.

Finally,thedifferencebetweenT(EXPPF.NOCB)andT(EXPM.NOCB)isonly0.006pointsawayfromtheCD.Thisresultshowsthatthedistancefromtheexpertsisreducedtoagreatextentwhenthebestscoringorderingsarecom-putedaccordingtoPF.NOCBinsteadofsimplyM.NOCB.Hence,thisexperimentprovidesadditionalevidenceinfavourofenhancingM.NOCBwiththePFconstraintofcoherence,assuggestedin[Karamanis,2003].

8Discussion

Aquestionnotaddressedbypreviousstudiesmakinguseofacertaincollectionoforderingsoffactsiswhetherthestrate-giesreflectedtherearespecifictoE0,theexpertwhocreatedthedataset.Inthispaper,weaddressthisquestionbyenhanc-ingE0’sdatasetwithorderingsprovidedbythreeadditionalexperts.Then,thedistancebetweenE0andhercolleaguesiscomputedandcomparedtothedistancebetweentheotherexpertpairs.TheresultsindicatethatE0sharesalotofcom-mongroundwithtwoofhercolleaguesintheorderingtaskdeviatingfromthemasmuchastheydeviatefromeachother,whiletheorderingsofafourth“stand-alone”expertarefoundtomanifestratherindividualisticstrategies.

Thesamevariableusedtoinvestigatethedistancebetweentheexpertsisemployedtoautomaticallyevaluatethebestscoringorderingsofsomeofthebestperformingmetricsin[Karamanisetal.,2004].Despiteitslimitationsduetothenecessarilyrestrictedsizeoftheemployeddataset,thiseval-

EXPEXP:0.722

**EXPPF.BFP:0.629

EXPPF.NOCB:0.606

EXPPF.KP:0.571

***EXPM.NOCB:0.487

CDat0.01:0.150CDat0.05:0.125

F(5,75)=19.111,p<0.000

*********EXPRB:0.341

Table3:Resultsoftheconcludinganalysiscomparingthedistancebetweentheexpertpairs(EXPEXP)withthedistancebetweentheexpertsandeachmetric(PF.BFP,PF.NOCB,PF.KP,M.NOCB)andtherandombaseline(RB)uationtaskallowsustoexplorethepreviouslyunaddressedpossibilitythatthereexistmanygoodsolutionsforTSintheemployeddomain.

Outofamuchlargersetofpossibilities,10metricswereevaluatedin[Karamanisetal.,2004],onlyahandfulofwhichwerefoundtoovertaketwosimplebaselines.Theadditionaltestinthisstudycarriesontheeliminationprocessbypoint-ingoutPF.BFPasthesinglemostpromisingmetrictobeusedforTSintheexploreddomain,sincethisisthemetricthatmanagestoclearlysurvivebothtests.

Equallycrucially,ouranalysisshowsthatallemployedmetricsaresuperiortoarandombaseline.Additionalevi-denceinfavourofthePFconstraintoncoherenceintroducedin[Karamanis,2003]isprovidedaswell.Thegeneralevalu-ationmethodologyaswellasthespecificresultsofthisstudywillbeusefulforanysubsequentattempttoautomaticallyevaluateaTSapproachusingacorpusofsentenceorderingsdefinedbymanyexperts.

As[ReiterandSripada,2002]suggest,thebestwaytotreattheresultsofacorpus-basedstudyisashypotheseswhicheventuallyneedtobeintegratedwithothertypesofevalua-tion.Althoughwefollowedtheongoingargumentationthatusingperceptualexperimentstochoosebetweenmanypossi-blemetricsisunfeasible,oureffortshaveresultedintoasin-glepreferredcandidatewhichismucheasiertoevaluatewiththehelpofpsycholinguistictechniques(insteadofhavingtodealwithalargenumberofmetricsfromveryearlyon).Thisisindeedourmaindirectionforfutureworkinthisdomain.

Acknowledgments

WearegratefultoAggelikiDimitromanolakiforentrustinguswithherdataandforhelpfulclarificationsontheiruse;toMirellaLapataforprovidinguswiththescriptsforthecom-putationofτtogetherwithherextensiveandpromptadvice;toKaterinaKolotourouforherinvaluableassistanceinre-cruitingtheexperts;andtotheexpertsfortheirparticipation.ThisworktookplacewhilethefirstauthorwasstudyingattheUniversityofEdinburgh,supportedbytheGreekStateScholarshipFoundation(IKY).

References

[BarzilayandLee,2004]ReginaBarzilayandLillianLee.Catch-ingthedrift:Probabilisticcontentmodelswithapplicationstogenerationandsummarization.InProceedingsofHLT-NAACL2004,pages113–120,2004.

[Barzilayetal.,2002]ReginaBarzilay,NoemieElhadad,andKathleenMcKeown.Inferringstrategiesforsentenceordering

inmultidocumentnewssummarization.JournalofArtificialIn-telligenceResearch,17:35–55,2002.

[Brennanetal.,1987]SusanE.Brennan,MarilynA.Fried-man[Walker],andCarlJ.Pollard.Acenteringapproachtopro-nouns.InProceedingsofACL1987,pages155–162,Stanford,California,1987.

[DimitromanolakiandAndroutsopoulos,2003]AggelikiDimitro-manolakiandIonAndroutsopoulos.Learningtoorderfactsfordiscourseplanninginnaturallanguagegeneration.InProceed-ingsofthe9thEuropeanWorkshoponNaturalLanguageGener-ation,Budapest,Hungary,2003.

[Howell,2002]DavidC.Howell.StatisticalMethodsforPsychol-ogy.Duxbury,PacificGrove,CA,5thedition,2002.

[Isardetal.,2003]AmyIsard,JonOberlander,IonAndroutsopou-los,andColinMatheson.Speakingtheusers’languages.IEEEIntelligentSystemsMagazine,18(1):40–45,2003.[KaramanisandManurung,2002]NikiforosKaramanisandHisarMaruliManurung.Stochastictextstructuringusingtheprincipleofcontinuity.InProceedingsofINLG2002,pages81–88,Harriman,NY,USA,July2002.

[KaramanisandMellish,2005]NikiforosKaramanisandChrisMellish.Areviewofrecentcorpus-basedmethodsforevaluat-ingtextstructuringinNLG.2005.SubmittedtoUsingCorporaforNLGworkshop.

[Karamanisetal.,2004]NikiforosKaramanis,ChrisMellish,JonOberlander,andMassimoPoesio.Acorpus-basedmethodologyforevaluatingmetricsofcoherencefortextstructuring.InPro-ceedingsofINLG04,pages90–99,Brockenhurst,UK,2004.[Karamanis,2003]NikiforosKaramanis.EntityCoherenceforDe-scriptiveTextStructuring.PhDthesis,DivisionofInformatics,UniversityofEdinburgh,2003.

[KibbleandPower,2000]RodgerKibbleandRichardPower.Anintegratedframeworkfortextplanningandpronominalisation.InProceedingsofINLG2000,pages77–84,Israel,2000.

[Lapata,2003]MirellaLapata.Probabilistictextstructuring:Ex-perimentswithsentenceordering.InProceedingsofACL2003,pages545–552,Saporo,Japan,July2003.

[McKeown,1985]KathleenMcKeown.TextGeneration:UsingDiscourseStrategiesandFocusConstraintstoGenerateNaturalLanguageText.StudiesinNaturalLanguageProcessing.Cam-bridgeUniversityPress,1985.

[ReiterandSripada,2002]EhudReiterandSomayajuluSripada.ShouldcorporatextsbegoldstandardsforNLG?InProceedingsofINLG2002,pages97–104,Harriman,NY,USA,July2002.

因篇幅问题不能全部显示,请点此查看更多更全内容