Using a corpus of sentence orderings defined by many experts to evaluate metrics of coheren
toEvaluateMetricsofCoherenceforTextStructuring
NikiforosKaramanis
ComputationalLinguisticsResearchGroup
UniversityofWolverhampton,UKN.Karamanis@wlv.ac.uk
Abstract
Thispaperaddressestwopreviouslyunresolvedis-suesintheautomaticevaluationofTextStructuring(TS)inNaturalLanguageGeneration(NLG).First,wedescribehowtoverifythegeneralityofanexist-ingcollectionofsentenceorderingsdefinedbyonedomainexpertusingdataprovidedbyadditionalexperts.Second,ageneralevaluationmethodol-ogyisoutlinedwhichinvestigatesthepreviouslyunaddressedpossibilitythattheremayexistmanyoptimalsolutionsforTSintheemployeddomain.Thismethodologyisimplementedinasetofex-perimentswhichidentifythemostpromisingcan-didateforTSamongseveralmetricsofcoherencepreviouslysuggestedintheliterature.1
1Introduction
ResearchinNLGfocusedonproblemsrelatedtoTSfromveryearlyon,[McKeown,1985]beingaclassicexample.Nowadays,TScontinuestobeanextremelyfruitfulfieldofdiverseactiveresearch.Inthispaper,weassumetheso-calledsearch-basedapproachtoTS[Karamanisetal.,2004]whichemploysametricofcoherencetoselectatextstruc-tureamongvariousalternatives.TheTSmoduleishypothe-sisedtosimplyorderapreselectedsetofinformation-bearingitemssuchassentences[Barzilayetal.,2002;Lapata,2003;BarzilayandLee,2004]ordatabasefacts[DimitromanolakiandAndroutsopoulos,2003;Karamanisetal.,2004].
EmpiricalworkontheevaluationofTShasbecomein-creasinglyautomaticandcorpus-based.Aspointedoutby[Karamanis,2003;BarzilayandLee,2004]interalia,usingcorporaforautomaticevaluationismotivatedbythefactthatemployinghumaninformantsinextendedpsycholinguisticexperimentsisoftensimplyunfeasible.Bycontrast,large-scaleautomaticcorpus-basedexperimentationtakesplacemuchmoreeasily.
[Lapata,2003]wasthefirsttopresentanexperimentalset-tingwhichemploysthedistancebetweentwoorderingstoes-timateautomaticallyhowcloseasentenceorderingproduced
1
Chapter9of[Karamanis,2003]reportsthestudyinmoredetail.
ChrisMellish
DepartmentofComputingScienceUniversityofAberdeen,UKcmellish@csd.abdn.ac.uk
byherprobabilisticTSmodelstandsincomparisontoorder-ingsprovidedbyseveralhumanjudges.
[DimitromanolakiandAndroutsopoulos,2003]derivedsetsoffactsfromthedatabaseofMPIRO,anNLGsystemthatgeneratesshortdescriptionsofmuseumartefacts[Isardetal.,2003].Eachsetconsistsof6factseachofwhichcor-respondstoasentenceasshowninFigure1.Thefactsineachsetweremanuallyassignedanordertoreflectwhatadomainexpert,i.e.anarchaeologisttrainedinmuseumla-belling,consideredtobethemostnaturalorderingofthecorrespondingsentences.Patternsoforderingfactswereau-tomaticallylearnedfromthecorpuscreatedbytheexpert.Then,aclassification-basedTSapproachwasimplementedandevaluatedincomparisontotheexpert’sorderings.
DatabasefactSentencesubclass(ex1,amph)→Thisexhibitisanamphora.painted-by(ex1,p-Kleo)→ThisexhibitwasdecoratedbythePainterofKleofrades.painter-story(p-Kleo,en4049)→ThePainterofKleofradesusedtodecoratebigvases.
exhibit-depicts(ex1,en914)→Thisexhibitdepictsawarriorperformingsplachnoscopybeforeleavingforthebattle.current-location(ex1,wag-mus)→ThisexhibitiscurrentlydisplayedintheMartinvonWagnerMuseum.museum-country(wag-mus,ger)→TheMartinvonWagnerMuseumisinGermany.
Figure1:MPIROdatabasefactscorrespondingtosentencesAsubsetofthecorpuscreatedbytheexpertinthepreviousstudy(towhomwewillhenceforthreferasE0)isemployedby[Karamanisetal.,2004]whoattempttodistinguishbe-tweenmanymetricsofcoherencewithrespecttotheiruse-fulnessforTSinthesamedomain.Eachhumanorderingoffactsinthecorpusisscoredbyeachofthesemetricswhicharethenpenalisedproportionallytotheamountofalternativeorderingsofthesamematerialthatarefoundtoscoreequallytoorbetterthanthehumanordering.ThefewmetricswhichmanagetooutperformtwosimplebaselinesintheiroverallperformanceacrossthecorpusemergeasthemostsuitablecandidatesforTSintheinvestigateddomain.Thismethod-ologyisverysimilartotheway[BarzilayandLee,2004]evaluatetheirprobabilisticTSmodelincomparisontotheapproachof[Lapata,2003].
Becausethedatausedinthestudiesof[Dimitromanolaki
andAndroutsopoulos,2003]and[Karamanisetal.,2004]arebasedontheinsightsofjustoneexpert,anobviousun-resolvedquestioniswhethertheyreflectgeneralstrategiesfororderingfactsinthedomainofinterest.Thispaperad-dressesthisissuebyenhancingthedatasetusedinthetwostudieswithorderingsprovidedbythreeadditionalexperts.TheseorderingsarethencomparedwiththeordersofE0us-ingthemethodologyof[Lapata,2003].SinceE0isfoundtosharealotofcommongroundwithtwoofhercolleaguesintheorderingtask,herreliabilityisverified,whileafourth“stand-alone”expertwhousesstrategiesnotsharedbyanyotherexpertisidentifiedaswell.
Asin[Lapata,2003],thesamedependentvariablewhichallowsustoestimatehowdifferenttheordersofE0arefromtheordersofhercolleaguesisusedtoevaluatesomeofthemetricswhichperformbestin[Karamanisetal.,2004].Asexplainedinthenextsection,inthiswayweinvestigatethepreviouslyunaddressedpossibilitythattheremayexistmanyoptimalsolutionsforTSinourdomain.Theresultsofthisadditionalevaluationexperimentarepresentedandemphasisislaidontheirrelationwiththepreviousfindings.
Overall,thispaperaddressestwogeneralissues:a)howtoverifythegeneralityofadatasetdefinedbyoneexpertusingsentenceorderingsprovidedbyotherexpertsandb)howtoemploythesedatafortheautomaticevaluationofaTSap-proach.Giventhatthemethodologydiscussedinthispaperdoesnotrelyontheemployedmetricsofcoherenceortheas-sumedTSapproach,ourworkcanbeofinteresttoanyNLGresearcherfacingthesequestions.
Thenextsectiondiscusseshowthemethodologyimple-mentedinthisstudycomplementsthemethodsof[Karamanisetal.,2004].Afterbrieflyintroducingtheemployedmetricsofcoherence,wedescribethedatacollectedforourexper-iments.Then,wepresenttheemployeddependentvariableandformulateourpredictions.Intheresultssection,westatewhichofthesepredictionswereverified.Thepaperiscon-cludedwithadiscussionofthemainfindings.
2Anadditionalevaluationtest
As[Barzilayetal.,2002]report,differenthumansoftenordersentencesindistinctways.Thus,theremightexistmorethanoneequallygoodsolutionforTS,aviewsharedbyalmostallTSresearchers,butwhichhasnotbeenaccountedforintheevaluationmethodologiesof[Karamanisetal.,2004]and[BarzilayandLee,2004].2
CollectingsentenceorderingsdefinedbymanyexpertsinourdomainenablesustoinvestigatethepossibilitythattheremightexistmanygoodsolutionsforTS.Then,themeasureof[Lapata,2003],whichestimateshowclosetwoorderingsstand,canbeemployednotonlytoverifythereliabilityofE0butalsotocomparetheorderingspreferredbytheassumedTSapproachwiththeorderingsoftheexperts.
However,thisevaluationmethodologyhasitslimitationsaswell.Beingengagedinotherobligations,theexpertsnor-mallyhavejustalimitedamountoftimetodevotetothe
2
Amoredetaileddiscussionofexistingcorpus-basedmethodsforevaluatingTSappearsin[KaramanisandMellish,2005].
NLGresearcher.Similarlytostandardpsycholinguisticex-periments,consultingtheseinformantsisdifficulttoextendtoalargercorpusliketheoneusede.g.by[Karamanisetal.,2004](122setsoffacts).
Inthispaper,wereachareasonablecompromisebyshow-inghowthemethodologyof[Lapata,2003]supplementstheevaluationeffortsof[Karamanisetal.,2004]usingasimilar(yetbynecessitysmaller)dataset.Clearly,ametricofcoher-encethathasalreadydonewellinthepreviousstudy,gainsextrabonusbypassingthisadditionaltest.
3Metricsofcoherence
[Karamanis,2003]discusseshowafewbasicnotionsofco-herencecapturedbyCenteringTheory(CT)canbeusedtodefinealargerangeofmetricswhichmightbeusefulforTSinourdomainofinterest.3Themetricsemployedintheex-perimentsof[Karamanisetal.,2004]include:
M.NOCBwhichpenalisesNOCBs,i.e.pairsofadjacentfactswithoutanyargumentsincommon[KaramanisandManurung,2002].BecauseofitssimplicityM.NOCBservesasthefirstbaselineintheexperimentsof[Kara-manisetal.,2004].
PF.NOCB,asecondbaseline,whichenhancesM.NOCBwithaglobalconstraintoncoherencethat[Karamanis,2003]callsthePageFocus(PF).
PF.BFPwhichisbasedonPFaswellastheoriginalfor-mulationofCTin[Brennanetal.,1987].
PF.KPwhichmakesuseofPFaswellastherecentre-formulationofCTin[KibbleandPower,2000].[Karamanisetal.,2004]reportthatPF.NOCBoutper-formedM.NOCBbutwasovertakenbyPF.BFPandPF.KP.ThetwometricsbeatingPF.NOCBwerenotfoundtodiffersignificantlyfromeachother.
ThisstudyemploysPF.BFPandPF.KP,i.e.twoofthebestperformingmetricsoftheexperimentsin[Karamanisetal.,2004],aswellasM.NOCBandPF.NOCB,thetwopreviouslyusedbaselines.Anadditionalrandombaselineisalsodefinedfollowing[Lapata,2003].
4Datacollection
16setsoffactswererandomlyselectedfromthecorpusof[DimitromanolakiandAndroutsopoulos,2003].4Thesen-tencesthateachfactcorrespondstoandtheorderdefinedbyE0wasmadeavailabletousaswell.Wewillsubsequentlyrefertoanunorderedsetoffacts(orsentencesthatthefactscorrespondto)asaTestitem.
4.1GeneratingtheBestOrdersforeachmetric
Following[Karamanisetal.,2004],weenvisageaTSap-proachinwhichametricofcoherenceMassignsascoreto
3
Sincediscussingthemetricsindetailiswellbeyondthescopeofthispaper,thereaderisreferredtoChapter3of[Karamanis,2003]formoreinformationonthisissue.4
Thesearedistinctfrom,yetverysimilarto,thesetsoffactsusedin[Karamanisetal.,2004].
eachpossibleorderingoftheinputsetoffactsandselectsthebestscoringorderingastheoutput.Whenmanyorderingsscorebest,Mchoosesrandomlybetweenthem.Crucially,ourhypotheticalTScomponentonlyconsidersorderingsstartingwiththesubclassfact(e.g.subclass(ex1,amph)inFigure1)followingthesuggestionof[DimitromanolakiandAndroutsopoulos,2003].Thisgivesriseto5!=120orderingstobescoredbyMforeachTestitem.
Forthepurposesofthisexperiment,asimplealgorithmwasimplementedthatfirstproducesthe120possibleorder-ingsoffactsinaTestitemandsubsequentlyranksthemac-cordingtothescoresgivenbyM.ThealgorithmoutputsthesetofBestOrdersfortheTestitem,i.e.theorderingswhichscorebestaccordingtoM.ThisprocedurewasrepeatedforeachmetricandallTestitemsemployedintheexperiment.
4.2Randombaseline
Following[Lapata,2003],arandombaseline(RB)wasim-plementedasthelowerboundoftheanalysis.Therandombaselineconsistsof10randomlyselectedorderingsforeachTestitem.Theorderingsareselectedirrespectiveoftheirscoresforthevariousmetrics.
4.3Consultingdomainexperts
Threearchaeologists(E1,E2,E3),onemaleandtwofemales,between28and45yearsofage,alltrainedincataloguingandmuseumlabelling,wererecruitedfromtheDepartmentofClassicsattheUniversityofEdinburgh.
Eachexpertwasconsultedbythefirstauthorinaseparateinterview.First,shewaspresentedwithasetofsixsentences,eachofwhichcorrespondedtoadatabasefactandwasprintedonadifferentfilecard,aswellaswithwritteninstructionsde-scribingtheorderingtask.5Theinstructionsmentionthatthesentencescomefromacomputerprogramthatgeneratesde-scriptionsofartefactsinavirtualmuseum.Thefirstsentenceforeachsetwasgivenbytheexperimenter.6Then,theexpertwasaskedtoordertheremainingfivesentencesinacoherenttext.
Whenorderingthesentences,theexpertwasinstructedtoconsiderwhichonesshouldbetogetherandwhichshouldcomebeforeanotherinthetextwithoutusinghintsotherthanthesentencesthemselves.Shecouldreviseherorderingatanytimebymovingthesentencesaround.Whenshewassat-isfiedwiththeorderingsheproduced,shewasaskedtowritenexttoeachsentenceitsposition,andgivethemtotheex-perimenterinordertoperformthesametaskwiththenextrandomlyselectedsetofsentences.Theexpertwasencour-agedtocommentonthedifficultyofthetask,thestrategiesshefollowed,etc.
5Dependentvariable
Givenanunorderedsetofsentencesandtwopossibleorder-ings,anumberofmeasurescanbeemployedtocalculatethe
5
TheinstructionsaregiveninAppendixDof[Karamanis,2003]andareadaptedfromtheonesusedin[Barzilayetal.,2002].6
Thisisthesentencecorrespondingtothesubclassfact.
distancebetweenthem.Basedontheargumentationin[How-ell,2002],[Lapata,2003]selectsKendall’sτasthemostap-propriatemeasureandthiswaswhatweusedforouranalysisaswell.Kendall’sτisbasedonthenumberofinversionsbetweenthetwoorderingsandiscalculatedasfollows:
(1)
τ=1−
2IPN
=1−
2I
N(N−1)/2PNstandsforthenumberofpairsofsentencesandNisthenumberofsentencestobeordered.7Istandsforthenumberofinversions,thatis,thenumberofadjacenttranspositionsnecessarytobringoneorderingtoanother.Kendall’sτrangesfrom−1(inverseranks)to1(identicalranks).Thehighertheτvalue,thesmallerthedistancebetweenthetwoorderings.Following[Lapata,2003],theTukeytestisemployedtoin-vestigatesignificantdifferencesbetweenaverageτscores.8First,theaveragedistancebetween(theorderingsof)9twoexpertse.g.E0andE1,denotedasT(E0E1),iscalculatedasthemeanτvaluebetweentheorderingofE0andtheorder-ingofE1takenacrossall16Testitems.Then,wecomputeT(EXPEXP)whichexpressestheoverallaveragedistancebetweenallexpertpairsandservesastheupperboundfortheevaluationofthemetrics.SinceatotalofEexpertsgivesrise
toPE=E(E−1)
expertpairs,T(EXPEXP),iscomputedbysummingup2theaveragedistancesbetweenallexpertpairsanddividingthesumbyPE.
While[Lapata,2003]alwaysappearstosingleoutauniquebestscoringordering,weoftenhavetodealwithmanybestscoringorderings.Toaccountforthis,wefirstcomputetheaveragedistancebetweene.g.theorderingofanexpertE0andtheBestOrdersofametricMforagivenTestitem.Inthisway,MisrewardedforaBestOrderthatisclosetotheexpert’sordering,butpenalisedforeveryBestOrderthatisnot.Then,theaverageT(E0M)betweentheexpertE0andthemetricMiscalculatedastheirmeandistanceacrossall16Testitems.Finally,yetmostimportantly,T(EXPM)istheaveragedistancebetweenallexpertsandM.ItiscalculatedbysumminguptheaveragedistancesbetweeneachexpertandManddividingthesumbythenumberofexperts.Asthenextsectionexplainsinmoredetail,T(EXPM)iscomparedwiththeupperboundoftheevaluationT(EXPEXP)toestimatetheperformanceofMinourexperiments.
RBisevaluatedinasimilarwayasMusingthe10ran-domlyselectedorderingsinsteadoftheBestOrdersforeachTestitem.T(EXPRB)istheaveragedistancebetweenallex-pertsandRBandisusedasthelowerboundoftheevaluation.
7
Inourdata,Nisalwaysequalto6.
8ProvidedthatanomnibusANOVAissignificant,theTukeytestcanbeusedtospecifywhichoftheconditionsc1,...,cnmeasuredbythedependentvariablediffersignificantly.Itusesthesetofmeansm1,...,mn(correspondingtoconditionsc1,...,cn)andthemeansquareerrorofthescoresthatcontributetothesemeanstocalculateacriticaldifferencebetweenanytwomeans.Anobserveddiffer-encebetweenanytwomeansissignificantifitexceedsthecriticaldifference.9
Throughoutthepaperweoftenrefertoe.g.“thedistancebe-tweentheorderingsoftheexperts”withthephrase“thedistancebetweentheexperts”forthesakeofbrevity.
E0E1:******0.692
E0E2:******0.717
E1E2:******0.758
E0E3:CDat0.01:0.3380.258
E1E3:CDat0.05:0.282
0.300
E2E3:F(5,75)=14.931,p<0.000
0.192
Table1:Comparisonofdistancesbetweentheexpertpairs
6Predictions
Despiteanypotentialdifferencesbetweentheexperts,oneex-pectsthemtosharesomecommongroundinthewaytheyor-dersentences.Inthissense,aparticularlywelcomeresultforourpurposesistoshowthattheaveragedistancesbetweenE0andmostofhercolleaguesareshortandnotsignificantlydifferentfromthedistancesbetweentheotherexpertpairs,whichinturnindicatesthatsheisnota“stand-alone”expert.Moreover,weexpecttheaveragedistancebetweentheex-pertpairstobesignificantlysmallerthantheaveragedistancebetweentheexpertsandRB.Thisisagainbasedontheas-sumptionthateventhoughtheexpertsmightnotfollowcom-pletelyidenticalstrategies,theydonotoperatewithabsolutediversityeither.Hence,wepredictthatT(EXPEXP)willbesignificantlygreaterthanT(EXPRB).
DuetothesmallnumberofTestitemsemployedinthisstudy,itislikelythatthemetricsdonotdiffersignificantlyfromeachotherwithrespecttotheiraveragedistancefromtheexperts.Ratherthancomparingthemetricsdirectlywitheachother(as[Karamanisetal.,2004]do),thisstudycom-paresthemindirectlybyexaminingtheirbehaviourwithre-specttotheupperandthelowerbound.Forinstance,al-thoughT(EXPPF.KP)andT(EXPPF.BFP)mightnotbesignificantlydifferentfromeachother,onescorecouldbesig-nificantlydifferentfromT(EXPEXP)(upperbound)and/orT(EXPRB)(lowerbound)whiletheotherisnot.
Weidentifythebestmetricsinthisstudyastheoneswhoseaveragedistancefromtheexperts(i)issignificantlygreaterfromthelowerboundand(ii)doesnotdiffersignificantlyfromtheupperbound.10
7Results
7.1Distancesbetweentheexpertpairs
Onthefirststepinouranalysis,wecomputedtheTscoreforeachexpertpair,namelyT(E0E1),T(E0E2),T(E0E3),T(E1E2),T(E1E3)andT(E2E3).Thenweperformedall15pairwisecomparisonsbetweenthemusingtheTukeytest,theresultsofwhicharesummarisedinTable1.11
ThecellsintheTablereportthelevelofsignificancere-turnedbytheTukeytestwhenthedifferencebetweentwo
10
Criterion(ii)canonlybeappliedprovidedthattheaveragedis-tancebetweentheexpertsandatleastonemetricMxisfoundtobesignificantlylowerthanT(EXPEXP).Then,iftheaveragedis-tancebetweentheexpertsandanothermetricMydoesnotdiffersignificantlyfromT(EXPEXP),MyperformsbetterthanMx.11
TheTablealsoreportstheresultoftheomnibusANOVA,whichissignificant:F(5,75)=14.931,p<0.000.
E0E1:******0.692
E0E2:******0.717
E1E2:******0.758
E0RB:CDat0.01:0.2420.323
E1RB:CDat0.05:0.202
0.347
E2RB:F(5,75)=18.762,p<0.000
0.352
E0E3:0.258
E1E3:0.300
E2E3:CDat0.01:0.2190.192
E3RB:CDat0.05:0.177
0.302
F(3,45)=1.223,p=0.312
Table2:Comparisonofdistancesbetweentheexperts(E0,E1,E2,E3)andtherandombaseline(RB)
distancesexceedsthecriticaldifference(CD).Significancebeyondthe0.05thresholdisreportedwithoneasterisk(*),whilesignificancebeyondthe0.01thresholdisreportedwithtwoasterisks(**).Acellremainsemptywhenthedifferencebetweentwodistancesdoesnotexceedthecriticaldifference.Forexample,thevalueofT(E0E1)is0.692andthevalueofT(E0E3)is0.258.SincetheirdifferenceexceedstheCDatthe0.01threshold,itisreportedtobesignificantbeyondthatlevelbytheTukeytest,asshowninthetopcellofthethirdcolumninTable1.
AstheTableshows,theTscoresforthedistancebetweenE0andE1orE2,i.e.T(E0E1)andT(E0E2),aswellastheTforthedistancebetweenE1andE2,i.e.T(E1E2),arequitehighwhichindicatesthatonaveragetheorderingsofthethreeexpertsarequiteclosetoeachother.Moreover,theseTscoresarenotsignificantlydifferentfromeachotherwhichsuggeststhatE0,E1andE2sharequitealotofcommongroundintheorderingtask.Hence,E0isfoundtogiverisetosimilarorderingstotheonesofE1andE2.
However,whenanyofthepreviousdistancesiscomparedwithadistancethatinvolvestheorderingsofE3thediffer-enceissignificant,asshownbythecellscontainingtwoas-terisksinTable1.Inotherwords,althoughtheorderingsofE1andE2seemtodeviatefromeachotherandtheorderingsofE0tomoreorlessthesameextent,theorderingsofE3standmuchfurtherawayfromallofthem.Hence,thereex-istsa“stand-alone”expertamongtheonesconsultedinourstudies,yetthisisnotE0butE3.
Thisfindingcanbeeasilyexplainedbythefactthatbycon-trasttotheotherthreeexperts,E3followedaveryschematicwayfororderingsentences.BecausetheorderingsofE3manifestratherpeculiarstrategies,atleastcomparedtotheor-deringsofE0,E1andE2,theupperboundoftheanalysis,i.e.theaveragedistancebetweentheexpertpairsT(EXPEXP),iscomputedwithouttakingintoaccounttheseorderings:
(2)
T(EXPEXP)=0.722=
T(E0E1)+T(E0E2)+T(E1E2)
37.2DistancesbetweentheexpertsandRB
AstheupperpartofTable2shows,theTscorebetweenanytwoexpertsotherthanE3issignificantlygreaterthantheirdistancefromRBbeyondthe0.01threshold.Onlythedis-
tancesbetweenE3andanotherexpert,showninthelowersectionofTable2,arenotsignificantlydifferentfromthedis-tancebetweenE3andRB.
AlthoughthisresultdoesnotmeanthattheordersofE3aresimilartotheordersofRB,12itshowsthatE3isroughlyasfarawayfrome.g.E0assheisfromRB.Bycontrast,E0standssignificantlyclosertoE1thantoRB,andthesameholdsfortheotherdistancesintheupperpartoftheTable.Inaccordancewiththediscussionintheprevioussection,thelowerbound,i.e.theoverallaveragedistancebetweentheexperts(excludingE3)andRBT(EXPRB),iscomputedasshownin(3):
(3)T(EXPRB)=0.341=
T(E0RB)+T(E1RB)+T(E2RB)
37.3Distancesbetweentheexpertsandeachmetric
Sofar,E3wasidentifiedasan“stand-alone”expertstandingfurtherawayfromtheotherthreeexpertsthantheystandfromeachother.WealsoidentifiedthedistancebetweenE3andeachexpertassimilartoherdistancefromRB.
Similarly,E3wasfoundtostandfurtherawayfromthemetricscomparedtotheirdistancefromtheotherthreeex-perts.13Thisresult,givesrisetothesetofformulasin(4)forcalculatingtheoverallaveragedistancebetweentheexperts(excludingE3)andeachmetric.
(4)
(4.1):T(EXPPF.BFP)=0.629=T(E0PF.BFP)+T(E1PF.BFP)+T(E2PF.BFP)
3(4.2):T(EXPPF.KP)=0.571=
T(E0PF.KP)+T(E1PF.KP)+T(E2PF.KP)
3(4.3):T(EXPPF.NOCB)=0.606=
T(E0PF.NOCB)+T(E1PF.NOCB)+T(E2PF.NOCB)
3(4.4):T(EXPM.NOCB)=0.487=
T(E0M.NOCB)+T(E1M.NOCB)+T(E2M.NOCB)
3Inthenextsection,wepresenttheconcludinganalysisforthisstudywhichcomparestheoveralldistancesinformu-las(2),(3)and(4)witheachother.Aswehavealreadymentioned,T(EXPEXP)servesastheupperboundoftheanalysiswhereasT(EXPRB)isthelowerbound.Theaimistospecifywhichscoresin(4)aresignificantlygreaterthanT(EXPRB),butnotsignificantlylowerthanT(EXPEXP).
7.4Concludinganalysis
Theresultsofthecomparisonsofthescoresin(2),(3)and(4)areshowninTable3.AsthetopcellinthelastcolumnoftheTableshows,theTscorebetweentheexpertsandRB,T(EXPRB),issignificantlylowerthantheaveragedistancebetweentheexpertpairs,T(EXPEXP)atthe0.01level.
12
Thiscouldhavebeenargued,ifthevalueofT(E3RB)hadbeenmuchcloserto1.13
Duetospacerestrictions,wecannotreportthescoresforthesecomparisonshere.ThereaderisreferredtoTable9.4onpage175ofChapter9in[Karamanis,2003].
Thisresultverifiesoneofourmainpredictionsshowingthattheorderingsoftheexperts(moduloE3)standmuchclosertoeachothercomparedtotheirdistancefromrandomlyas-sembledorderings.
Asexpected,mostofthescoresthatinvolvethemet-ricsarenotsignificantlydifferentfromeachother,ex-ceptforT(EXPPF.BFP)whichissignificantlygreaterthanT(EXPM.NOCB)atthe0.05level.Yet,whatwearemainlyinterestedinishowthedistancebetweentheexpertsandeachmetriccompareswithT(EXPEXP)andT(EXPRB).ThisisshowninthefirstrowandthelastcolumnofTable3.
Crucially,T(EXPRB)issignificantlylowerthanT(EXPPF.BFP)aswellasT(EXPPF.NOCB)andT(EXPPF.KP)atthe0.01level.Notably,eventhedis-tanceoftheexpertsfromM.NOCB,T(EXPM.NOCB),issignificantlygreaterthanT(EXPRB),albeitatthe0.05level.Theseresultsshowthatthedistancefromtheexpertsissignificantlyreducedwhenusingthebestscoringorderingsofanymetric,evenM.NOCB,insteadoftheorderingsofRB.Hence,allmetricsscoresignificantlybetterthanRBinthisexperiment.
However,simplyusingM.NOCBtooutputthebestscoringordersisnotenoughtoyieldadistancefromtheexpertswhichiscomparabletoT(EXPEXP).Al-thoughthePFconstraintappearstohelptowardsthisdi-rection,T(EXPPF.KP)remainssignificantlylowerthanT(EXPEXP),whereasT(EXPPF.NOCB)fallsonly0.009pointsshortofCDatthe0.05threshold.Hence,PF.BFPisthemostrobustmetric,asthedifferencebetweenT(EXPPF.BFP)andT(EXPEXP)isclearlynotsignifi-cant.
Finally,thedifferencebetweenT(EXPPF.NOCB)andT(EXPM.NOCB)isonly0.006pointsawayfromtheCD.Thisresultshowsthatthedistancefromtheexpertsisreducedtoagreatextentwhenthebestscoringorderingsarecom-putedaccordingtoPF.NOCBinsteadofsimplyM.NOCB.Hence,thisexperimentprovidesadditionalevidenceinfavourofenhancingM.NOCBwiththePFconstraintofcoherence,assuggestedin[Karamanis,2003].
8Discussion
Aquestionnotaddressedbypreviousstudiesmakinguseofacertaincollectionoforderingsoffactsiswhetherthestrate-giesreflectedtherearespecifictoE0,theexpertwhocreatedthedataset.Inthispaper,weaddressthisquestionbyenhanc-ingE0’sdatasetwithorderingsprovidedbythreeadditionalexperts.Then,thedistancebetweenE0andhercolleaguesiscomputedandcomparedtothedistancebetweentheotherexpertpairs.TheresultsindicatethatE0sharesalotofcom-mongroundwithtwoofhercolleaguesintheorderingtaskdeviatingfromthemasmuchastheydeviatefromeachother,whiletheorderingsofafourth“stand-alone”expertarefoundtomanifestratherindividualisticstrategies.
Thesamevariableusedtoinvestigatethedistancebetweentheexpertsisemployedtoautomaticallyevaluatethebestscoringorderingsofsomeofthebestperformingmetricsin[Karamanisetal.,2004].Despiteitslimitationsduetothenecessarilyrestrictedsizeoftheemployeddataset,thiseval-
EXPEXP:0.722
**EXPPF.BFP:0.629
EXPPF.NOCB:0.606
EXPPF.KP:0.571
***EXPM.NOCB:0.487
CDat0.01:0.150CDat0.05:0.125
F(5,75)=19.111,p<0.000
*********EXPRB:0.341
Table3:Resultsoftheconcludinganalysiscomparingthedistancebetweentheexpertpairs(EXPEXP)withthedistancebetweentheexpertsandeachmetric(PF.BFP,PF.NOCB,PF.KP,M.NOCB)andtherandombaseline(RB)uationtaskallowsustoexplorethepreviouslyunaddressedpossibilitythatthereexistmanygoodsolutionsforTSintheemployeddomain.
Outofamuchlargersetofpossibilities,10metricswereevaluatedin[Karamanisetal.,2004],onlyahandfulofwhichwerefoundtoovertaketwosimplebaselines.Theadditionaltestinthisstudycarriesontheeliminationprocessbypoint-ingoutPF.BFPasthesinglemostpromisingmetrictobeusedforTSintheexploreddomain,sincethisisthemetricthatmanagestoclearlysurvivebothtests.
Equallycrucially,ouranalysisshowsthatallemployedmetricsaresuperiortoarandombaseline.Additionalevi-denceinfavourofthePFconstraintoncoherenceintroducedin[Karamanis,2003]isprovidedaswell.Thegeneralevalu-ationmethodologyaswellasthespecificresultsofthisstudywillbeusefulforanysubsequentattempttoautomaticallyevaluateaTSapproachusingacorpusofsentenceorderingsdefinedbymanyexperts.
As[ReiterandSripada,2002]suggest,thebestwaytotreattheresultsofacorpus-basedstudyisashypotheseswhicheventuallyneedtobeintegratedwithothertypesofevalua-tion.Althoughwefollowedtheongoingargumentationthatusingperceptualexperimentstochoosebetweenmanypossi-blemetricsisunfeasible,oureffortshaveresultedintoasin-glepreferredcandidatewhichismucheasiertoevaluatewiththehelpofpsycholinguistictechniques(insteadofhavingtodealwithalargenumberofmetricsfromveryearlyon).Thisisindeedourmaindirectionforfutureworkinthisdomain.
Acknowledgments
WearegratefultoAggelikiDimitromanolakiforentrustinguswithherdataandforhelpfulclarificationsontheiruse;toMirellaLapataforprovidinguswiththescriptsforthecom-putationofτtogetherwithherextensiveandpromptadvice;toKaterinaKolotourouforherinvaluableassistanceinre-cruitingtheexperts;andtotheexpertsfortheirparticipation.ThisworktookplacewhilethefirstauthorwasstudyingattheUniversityofEdinburgh,supportedbytheGreekStateScholarshipFoundation(IKY).
References
[BarzilayandLee,2004]ReginaBarzilayandLillianLee.Catch-ingthedrift:Probabilisticcontentmodelswithapplicationstogenerationandsummarization.InProceedingsofHLT-NAACL2004,pages113–120,2004.
[Barzilayetal.,2002]ReginaBarzilay,NoemieElhadad,andKathleenMcKeown.Inferringstrategiesforsentenceordering
inmultidocumentnewssummarization.JournalofArtificialIn-telligenceResearch,17:35–55,2002.
[Brennanetal.,1987]SusanE.Brennan,MarilynA.Fried-man[Walker],andCarlJ.Pollard.Acenteringapproachtopro-nouns.InProceedingsofACL1987,pages155–162,Stanford,California,1987.
[DimitromanolakiandAndroutsopoulos,2003]AggelikiDimitro-manolakiandIonAndroutsopoulos.Learningtoorderfactsfordiscourseplanninginnaturallanguagegeneration.InProceed-ingsofthe9thEuropeanWorkshoponNaturalLanguageGener-ation,Budapest,Hungary,2003.
[Howell,2002]DavidC.Howell.StatisticalMethodsforPsychol-ogy.Duxbury,PacificGrove,CA,5thedition,2002.
[Isardetal.,2003]AmyIsard,JonOberlander,IonAndroutsopou-los,andColinMatheson.Speakingtheusers’languages.IEEEIntelligentSystemsMagazine,18(1):40–45,2003.[KaramanisandManurung,2002]NikiforosKaramanisandHisarMaruliManurung.Stochastictextstructuringusingtheprincipleofcontinuity.InProceedingsofINLG2002,pages81–88,Harriman,NY,USA,July2002.
[KaramanisandMellish,2005]NikiforosKaramanisandChrisMellish.Areviewofrecentcorpus-basedmethodsforevaluat-ingtextstructuringinNLG.2005.SubmittedtoUsingCorporaforNLGworkshop.
[Karamanisetal.,2004]NikiforosKaramanis,ChrisMellish,JonOberlander,andMassimoPoesio.Acorpus-basedmethodologyforevaluatingmetricsofcoherencefortextstructuring.InPro-ceedingsofINLG04,pages90–99,Brockenhurst,UK,2004.[Karamanis,2003]NikiforosKaramanis.EntityCoherenceforDe-scriptiveTextStructuring.PhDthesis,DivisionofInformatics,UniversityofEdinburgh,2003.
[KibbleandPower,2000]RodgerKibbleandRichardPower.Anintegratedframeworkfortextplanningandpronominalisation.InProceedingsofINLG2000,pages77–84,Israel,2000.
[Lapata,2003]MirellaLapata.Probabilistictextstructuring:Ex-perimentswithsentenceordering.InProceedingsofACL2003,pages545–552,Saporo,Japan,July2003.
[McKeown,1985]KathleenMcKeown.TextGeneration:UsingDiscourseStrategiesandFocusConstraintstoGenerateNaturalLanguageText.StudiesinNaturalLanguageProcessing.Cam-bridgeUniversityPress,1985.
[ReiterandSripada,2002]EhudReiterandSomayajuluSripada.ShouldcorporatextsbegoldstandardsforNLG?InProceedingsofINLG2002,pages97–104,Harriman,NY,USA,July2002.
因篇幅问题不能全部显示,请点此查看更多更全内容