DaqingHe1,JianqiangWang2,DouglasW.Oard2,MichaelNossal1
1
InstituteforAdvancedComputerStudiesUniversityofMaryland,CollegePark,MD20742USA
{daqingd,nossal}@umiacs.umd.edu
2
CollegeofInformationStudies&InstituteforAdvancedComputerStudiesUniversityofMaryland,CollegePark,MD20742USA
{wangjq,oard}@glue.umd.edu
Abstract.Forthe2002CrossLanguageEvaluationForumInteractiveTrack,theUniversityofMarylandteamfocusedonqueryformulationandreformulation.TwelvepeopleperformedatotaloffortyeightsearchesintheGermandocumentcollectionusingEnglishqueries.Halfofthesearcheswerewithuserassistedquerytranslation,andhalfwithfullyautomaticquerytranslation.Fortheuserassistedquerytranslationcondition,participantswereprovidedtwotypesofcuesaboutthemeaningofeachtranslation:alistofothertermswiththesametranslation(potentialsynonyms),andasentenceinwhichthewordwasusedinatranslationappropriatecontext.FoursearchersperformedtheofficialiCLEFtask,theothereightsearchedasmallercollection.Searchersperformingtheofficialtaskwereabletomakemoreaccuraterelevancejudgmentswithuserassistedquerytranslationforthreeofthefourtopics.Weobservedthatthenumberofqueryiterationsseemstovarysystematicallywithtopic,system,andcollection,andweareanalyzingquerycontentandrankedretrievalmeasurestoobtainfurtherinsightintothesevariationsinsearchbehavior.
1Introduction
InteractiveCrossLanguageInformationRetrieval(CLIR)isaniterativeprocessinwhichsearcherandsystemcollaboratetofinddocumentsthatsatisfyaninformationneed,regardlessofwhethertheyarewritteninthesamelanguageasthequery.Humansandmachinesbringcomplementarystrengthstothisprocess.Machinesareexcellentatrepetitivetasksthatarewellspecified;humansbringcreativityandexceptionalpatternrecognitioncapabilities.Properlycouplingthesecapabilitiescanresultinasynergythatgreatlyexceedstheabilityofeitherhumanormachinealone.Thedesignofthefullyautomatedcomponentstosupportcrosslanguagesearching(e.g.,structuredquerytranslationandrankedretrieval)hasbeenwellresearched,butachievingtruesynergyrequiresthatthemachinealsoprovidetoolsthatwillallowitshumanpartnerstoexercisetheirskillstothegreatestpossibledegree.SuchtoolsarethefocusofourworkintheCrossLanguageEvaluationForum’s(CLEF)interactivetrack(iCLEF).In2001,webeganbyexploringsupportfordocumentselection[5].Thisyear,ourfocusisonqueryformulation.
Crosslanguageretrievaltechniquescangenerallybeclassifiedasquerytranslation,documenttranslation,orinterlingualdesigns[2].Weadoptedaquerytranslationdesignbecausethequerytranslationstageprovidesanadditionalinteractionopportunitynotpresentindocumenttranslationbasedsystems.OursearchersfirstformulateaqueryinEnglish,thenthesystemtranslatesthatqueryintothedocumentlanguage(German,inourcase).Thetranslatedqueryisusedtosearchthedocumentcollection,andarankedlistofdocumentsurrogates(first40words,inourcase)isdisplayed.Thesearchercanexamineindividualdocuments,andcanoptionallyrepeattheprocessbyreformulatingthequery.Althoughthereareonlythreepossibleinteractionpoints(queryformulation,querytranslation,anddocumentselection),theiterativenatureoftheprocessintroducessignificantcomplexity.Wethereforeperformedextensiveexploratorydataanalysistounderstandhowsearchersemploythesystemsthatweprovided.
Ourstudywasmotivatedbythefollowingquestions:
Thesequestionsare,ofcourse,fartobroadtobeansweredcompletelybyanysingleexperiment.Fortheexperimentsreportedinthispaper,wechosetoprovideoursearcherswithtwovariantsonasingleretrievalsystem,onewithsupportforinteractionduringquerytranslation(whichwecall“manual”),andtheotherwithfullyautomaticquerytranslation(whichwecall“auto”).Thisdesignallowedustotestahypothesisderivedfromourthirdquestionabove.Wereliedonobservations,questionnaires,semistructuredinterviews,andexploratorydataanalysistoaugmenttheinsightgainedthroughhypothesistesting,andtobeginourexplorationoftheotherquestions.
Inthenextsection,wedescribethedesignofoursystem.Section3thendescribesourexperiment,andSection4presentstheresultsthatweobtained.Section5concludesthepaperwithabriefdiscussionoffuturework.
2SystemDesign
Inthissection,wedescribetheresourcesthatweused,thedesignofourcrosslanguageretrievalsystem,andouruserinterfacedesign.
WechoseEnglishasthequerylanguageandGermanasthedocumentlanguagebecauseourpopulationofpotentialsearcherswasgenerallyskilledinEnglishbutnotGerman.ThefullGermandocumentcollectioncontained71,677newsstoriesfromtheSwissNewsAgency(SDA)and13,979newstoriesfromDerSpiegel.WeusedtheGermantoEnglishtranslationsprovidedbytheiCLEForganizersforconstructionofdocumentsurrogates(fordisplayinarankedlist)andfordisplayoffulldocumenttranslations(whenselectedforviewingbythesearcher).ThetranslationswerecreatedusingSystranProfessional3.0.
WeobtainedaGermanEnglishbilingualtermlistfromtheChemnitzUniversityofTechnology3,andusedtheGermanstemmerfromthe“snowball”project4.OurKeywordinContext(KWIC)techniquerequiresparallel(i.e.,translationequivalent)German/Englishtexts–weobtainedthosefromtheForeignBroadcastInformationService(FBIS)TIDESdatadisk,release2.
WeusedtheInQuerytextretrievalsystem(version3.1p1)fromtheUniversityofMassachusetts,alongwithlocallyimplementedextensionstosupportcrosslanguageretrievalbetweenGermanandEnglish.WeusedPirkola’sstructuredquerytechniqueforquerytranslation[4],whichaggregatesGermantermfrequenciesanddocumentfrequenciesseparatelybeforecomputingtheweightforeachEnglishqueryterm.ThistendstosuppressthecontributiontotherankingcomputationsofthoseEnglishtermsthathaveatleastonetranslationthatisacommonGermanword(i.e.,thatoccursinmanydocuments).Fortheautomaticcondition,allknowntranslationswereused.Forthemanualcondition,onlytranslationsselectedbythesearcherwereused.Weemployedabackofftranslationstrategytomaximizethecoverageofthebilingualtermlist[3].IfnotranslationwasfoundforthesurfaceformofanEnglishterm,westemmedtheterm(usingthePorterstemmer)andtriedagain.Ifthatfailed,wealsostemmedtheEnglishsideofthebilingualtermlistandtriedathirdtime.Ifthatstillfailed,wetreatedtheuntranslatedtermasitsowntranslationinthehopethatitmightbeapropername.
Forourautomaticcondition,weadoptedaninterfacedesignsimilartothatofpresentWebsearchengines.SearchersenteredEnglishquerytermsinanonelinetextfield,basedontheirunderstandingofafullCLEFtopicdescription
3
http://dict.tuchemnitz.de/
4
http://snowball.sourceforge.net
(title,description,andnarrative).Weprovidedthattopicdescriptiononpaperinordertoencourageamorenaturalqueryformulationprocessthanmighthavenotbeenthecaseifcutandpastefromthetopicdescriptionwereavailable.Whenthesearchbuttonwasclicked,arankedlistofdocumentsurrogateswasdisplayedbelowthequeryfield,thusallowingthequerytoserveascontextwheninterpretingtherankedlist.Tensurrogatesweredisplayedsimultaneouslyasapage,andupto10pages(intotal100surrogates)couldbeviewedbyclicking“next”button.Oursurrogatesconsistedofthefirst40wordsintheTEXTfieldofthetranslateddocument.Englishwordsinthesurrogatethatsharedacommonstemwithanyqueryterm(usingthePorterstemmer)werehighlightedinred.SeeFigure1foranillustrationoftheautomaticuserinterface.
Fig.1.Userinterface,automaticcondition.
Eachsurrogateislabeledwithanumericrank(1,2,3,...),whichisdisplayedasanumberedbuttontotheleftofthesurrogate.Ifthesearcherselectedthebutton,thefulltextofthatdocumentwouldbedisplayedinaseparatewindow,withquerytermshighlightedinthesamemanner.Inordertoprovidecontext,werepeatedthenumericrankandthesurrogateatthetopofthedocumentexaminationwindow.Figure2illustratesadocumentexaminationwindow.
Wecollectedthreetypesofinformationaboutrelevancejudgments.First,searcherscouldindicatewhetherthedocumentwasnotrelevant(“N”),somewhatrelevant(“S”),orhighlyrelevant(“H”).Afourthvalue,“?”(indicatingunjudged),wasinitiallyselectedbythesystem.Second,searcherscouldindicatetheirdegreeofconfidenceintheirjudgmentaslow(“L”),medium(“M”),orhigh(“H”),withafourthvalue(“?”)beinginitiallyselectedbythesystem.
Bothrelevancejudgmentsandconfidencevalueswererecordedincrementallyinalogfile.Searcherscouldrecordrelevancejudgmentsandconfidencevaluesineitherthemainsearchwindoworinadocumentexaminationwindow(whenthatwindowwasdisplayed).Finally,werecordedthetimesatwhichdocumentswereselectedforexaminationandthetimesatwhichrelevancejudgmentsforthosedocumentswererecorded.Thisallowedustolatercomputethe(approximate)examinationtimeforeachdocument.Fordocumentsthatwerejudgedwithoutexamination(e.g.,basedsolelyonthesurrogate),weassignedzeroastheexaminationtime.
Fig.2.Documentexaminationwindow.
Forthemanualinterface,weusedavariantofthesameinterfacewithtwoadditionalitems:1)termbytermcontroloverthequerytranslationprocess,and2)asummaryofthetranslationschosenforallqueryterms.WeusedatabbedpanetoallowtheusertoexaminealternativetranslationsforoneEnglishquerytermatatime.Eachpossibletranslationwasshownonaseparateline,andacheckboxtotheleftofeachlineallowedtheusertodeselectorreselectthattranslation.Alltranslationswereinitiallyselected,sothemanualandautomaticconditionswouldbeidenticaliftheuserdidnotdeselectanytranslation.
SincewedesignedourinterfacetosupportsearcherswithnoknowledgeofGerman,weprovidedcuesinEnglishaboutthemeaningofeachGermantranslation.Fortheseexperiments,searcherswereabletoviewtwotypesofcues:(1)backtranslation,and(2)KeywordInContext(KWIC).Eachwascreatedautomatically,usingtechniquesdescribedbelow.Searcherswereabletoalternatebetweenthetwotypesofcuesusingtabs.Thequerytranslationsummaryareaprovidedadditionalcontextforinterpretationoftherankedlist,simultaneouslyshowingallselectedtranslations(withonebacktranslationeach).Inordertoemphasizethattwostepswereinvolved(querytranslation,followedbysearch),weprovidedboth“translatequery”and“search”buttons.Allotherfunctions
Fig.4.Backtranslationsof“religious.”
BackTranslationIdeally,wewouldprefertoprovidethesearcherwithEnglishdefinitionsforeachGermantranslationalternative.Dictionarieswiththesetypesofdefinitionsdoexistforsomelanguagepairs(althoughrightsmanagementconsiderationsmaylimittheiravailabilityinelectronicform),butbilingualtermlistsaremuchmoreeasilyavailable.Whatwecall“backtranslations”areEnglishtermsthatshareaspecificGermantranslation,somethingthatwecandeterminewithasimplebilingualtermlist.Forexample,theEnglishwordreligious hasseveralGermantranslationsinthetermlistthatweused,twoofwhicharefromm andgewissenhaft.Lookinginthesametermlistforcuestothemeaningoffromm,weseethatitcanbetranslatedintoEnglishasreligious, godly, pious, piously, orgodiler.Thusfromm seemstoclearlycorrespondtotheliteraluseofreligious.Bycontrast,gewissenhaft’sbacktranslationsarereligious, sedulous, precise, conscientious, faithful, orconscientiousness.Thisseemsasifitmightcorrespondwithamorefigurativeuseofreligious,asin“herodehisbiketoworkreligiously.”Ofcourse,manyGermantranslationswillthemselveshavemultiplesenses,sodetectingareliablesignalinthenoisycuesprovidedbybacktranslationsometimesrequirescommonsensereasoning.Fortunately,thatisataskforwhichhumanareuniquelywellsuited.TheoriginalEnglishtermwillalwaysbeitsownbacktranslation,sowesupressitsdisplay.Sometimesthisresultsinanempty(andthereforeuninformative)setofbacktranslations.Figure4showsthebacktranslationdisplayfor“religious”inourmanualcondition.
Fig.5.ConstructingcrosslanguageKWICusingasentencealignedparallelcorpus.
KeywordinContextOnewaytocompensatefortheweaknessesofbacktranslationistodrawadditionalevidencefromexamplesofusage.Inkeepingwiththecommonusageinmonolingualcontexts[1],wecallthisapproach“keywordincontext”or“KWIC.”ForeachGermantranslationofanEnglishterm,ourgoalistofindabriefpassageinwhichtheEnglishtermisusedinamannerappropriatetothetranslationinquestion.Todothis,westartedwithacollectionofdocumentpairsthataretranslationsofeachother.WeusedGermannewsstoriesthathadpreviouslybeenmanuallytranslatedintoEnglishbytheForeignBroadcastInformationService(FBIS)anddistributedasastandardresearchcorpus.WesegmentedtheFBISdocumentsintosentencesusingrulebasedsoftwarebasedonpunctuationandcapitalizationpatterns,andthenproducedalignedsentencepairsusingtheGSAalgorithm(whichusesdynamicprogrammingtodiscoveraplausiblemappingofsentenceswithinapaireddocumentsbaseduponknowntranslationrelationshipsfromthebilingualtermlist,sentencelengthsandrelativepositionsineachdocuments).WepresentedtheentireEnglishsentence,favoringtheshortestoneifmultiplesentencepairscontainedthesameEnglishterm.5
Formally,lette beanEnglishtermforwhichweseekanexampleofusage,andlettg betheGermantranslationfromthebilingualtermlistthatisofinterest.LetSe andSg betheshortestpairofsentencesthatcontainte andtg respectively.WethenpresentSe astheexampleofusagefortranslationtg .Figure5illustratesthisprocess.
3ExperimentDesign
Ourexperimentisdesignedtotesttheutilityofuserassistedquerytranslationinaninteractivecrosslanguageretrievalsystem.Weweremotivatedtoexplorethisquestionbytwopotentialbenefitsthatweforesaw:
Formally,wesoughttorejectthenullhypothesesthatthereisnodifferencebetweentheFα=0.8achievedusingtheautomaticandmanualsystems.TheF measureisanoutcomemeasure,however,andwewerealsointerestedinunderstandingprocessissues.Weusedexploratorydataanalysistoimproveourunderstandinghowthesearchersusedthecuesweprovided.
WefollowedthestandardprotocolforiCLEF2002experiments.Searchersweresequentiallygivenfourtopics(statedinEnglish),twoforusewiththemanualsystemandtwoforusewiththeautomaticsystem.Presentationorderfortopicsandsystemwasvariedsystematicallyacrosssearchersasspecifiedinthetrackguidelines.Afteraninitialtrainingsession,theyweregiven20minutesforeachsearchtoidentifyrelevantdocumentsusingtheradiobuttonsprovidedforthatpurposeinouruserinterface.Thesearcherswereaskedtoemphasizeprecisionoverrecall(bytellingthemthatitwasmoreimportantthatthedocumentthattheyselectedbetrulyrelevantthanthattheyfindeverypossiblerelevantdocument).Weaskedeachsearchertofilloutbriefquestionnairesbeforethefirst
5
Wedidnothighlightthequerytermincurrentversionduetotimeconstraints.Anotherlimitationofcurrentimplementationisthatabrieferpassagemayserveourpurposebetterinsomecases.
search(fordemographicdata),aftereachsearch,andafterusingeachsystem.Eachsearcherusedthesamesystematadifferenttime,sowewereabletoobserveeachindividuallyandmakeextensiveobservationalnotes.Wealsoconductedasemistructuredinterview(inwhichwetailoredourquestionsbasedonourobservations)afterallsearcheswerecompleted.
Weconductedapilotstudywithasinglesearcher(umd01)toexerciseournewsystemandrefineourdatacollectionprocedures.Eightsearchers(umd02umd09)thenperformedtheexperimentusingtheeightsubjectdesignspecifiedinthetrackguidelines6.Whilepreparingourresultsforsubmission,wenoticedthatnoSDAdocumentappearedinanyrankedlist.InvestigationrevealedthatInQueryhadfailedtoindexthosedocumentsbecausewehadnotconfiguredtheSGMLparsingcorrectlyforthatcollection.Wethereforecorrectedthatproblem,recruitedfournewsearchers(umd10umd13),andrepeatedtheexperiment,thistimeusingthefoursubjectdesignspecifiedinthetrackguidelines.
Wesubmittedalltwelverunsforuseinformingrelevancepools,butdesignatedthesecondexperimentasourofficialsubmissionbecausethefirstexperimentdidnotcomplywithonerequirementofthetrackguidelines(thecollectionstobesearched).Ourresultsfromthefirstexperimentare,however,interestingforseveralreasons.First,itturnedoutthattopic3hadnorelevantdocumentsinthecollectionsearchedinthefirstexperiment.7Thishappensinrealapplications,ofcourse,butthesituationisrarelystudiedininformationretrievalexperimentsbecausethetypicalevaluationmeasuresareunabletodiscriminatebetweensystemswhennorelevantdocumentsareexist.Second,thenumberofrelevantdocumentsfortheremainingthreetopicswassmallerinthefirstexperimentthanthesecond.Thisprovidedanopportunitytostudytheeffectofcollectioncharacteristicsonsearcherbehavior.
Forconvenience,werefertothefirstexperimentasthesmall collection experiment,andthesecondasthelarge collection experiment.
Wecomputedthefollowingmeasuresinordertogaininsightintosearchbehaviorandsearchresults:
6
http://terral.lsi.uned.es/iclef/2002/
7
Inthispaper,wenumberthetopics1,2,3,and4inkeepingwiththetrackguidelines.ThesecorrespondtoCLEFtopicnumbersc053,c065,c056andc080,respectively.
Thesetorientedmeasures(strictandlooseF )aredeignedtocharacterizeendtoendtaskperformanceusingthesystem.Therankorientedmeasures(MAP,MAPS,MAPLandMAPR)aredesignedtoofferindirectinsightintothequeryformulationprocessbycharacterizingtheeffectofaquerybasedonthedensityofrelevantdocumentsnearthetopoftherankedlistproducedforthatquery(orforqueriesupthroughthatiterationbyeitherviewingfromthepointofthesubject’sownsenseofperformance,inthecaseofMAPSandMAPL,orviewingfromtheactualperformance,inthecaseofMAPR).Examinationtimeisintendedforuseinconjunctionwithrelevancejudgmentcategories,inordertogainsomeinsightintotherelevancejudgmentprocess.Wehavenotyetfinishedourtrajectoryanalysisortheanalysisofexaminationduration,sointhispaperwereportresultsonlyforthefinalvaluesofFα=0.8andforthenumberofiterations.
4Results
Oursearcherpopulationwasrelativelyhomogeneous.Specifically,theywere:
Affiliatedwithauniversity.Everyoneofoursearcherswasastudent,staffmemberorfacultymemberattheUniversityofMaryland.
Highlyeducated.Tenofthe12searchersareeitherenrolledinaMastersdegreeprogramorhadearnedaMastersdegreeorhigher.Theremainingtwowereundergraduatestudents,andtheyarebothinthesmallcollectionexperiment.
Mature.Theaverageageoverall12searcherswas31,withtheyoungestbeing19andtheoldestbeing43.Theaverageageofthefoursearchersinthelargecollectionexperimentwas32.
Mostlyfemale.Therewerethreetimesasmanyfemalesearchersasmales,bothoverallandinthelargecollectionexperiment.
Experiencedsearchers.Sixofthe12searchershelddegreesinlibraryscience.Thesearchersreportedanaverageofabout6yearsofonlinesearchingexperience,withaminimumof4yearsandmaximumof10years.MostsearchersreportedextensiveexperiencewithWebsearchservices,andallreportedatleastsomeexperiencesearchingcomputerizedlibrarycatalogs(rangingfrom”some”to”agreatdeal”).Elevenofthe12reportedthattheysearchatleastonceortwiceaday.Thesearchexperiencedataforthefourparticipantsinthelargecollectionexperimentwasslightlygreaterthanforthe12searchersasawhole.
Notpreviousstudyparticipants.Noneofthe12subjectshadpreviouslyparticipatedinaTRECororiCLEFstudy.
Inexperiencedwithmachinetranslation.Nineofthe12participantsreportedneverhavingusedanymachinetranslationsoftwareorfreeWebtranslationservice.Theother3reported“verylittleexperience”withmachinetranslationsoftwareorservices.Thefourparticipantsinthelargecollectionexperimentreportedthesameratio.
NativeEnglishspeakers.All12searcherswerenativespeakersofEnglish.
NotskilledinGerman.Eightofthe12searchersreportednoreadingskillsinGermanatall.Another3reportedpoorreadingskillsinGerman,andone(umd12)reportedgoodreadingskillinGerman.Amongthefoursearchersinthelargecollectionexperiment,3reportednoGermanskills,withthefourthreportinggoodreadingskillsinGerman.
Fig.6.Fα =0.8,largecollection,byconditionandtopic.
Ourofficialresultsonthelargecollectionexperimentfoundthatthemanualsystemachieveda48%largervalueforFα=0.8thantheautomaticsystem(0.4995
vs.0.3371).However,thedifferenceisnotstatisticallysignificant,andthemostlikelyreasonisthesamllsamplesize.ThepresenceofasearcherwithgoodreadingskillsinGermanisalsopotentiallytroublesomegiventhehypothesisthatwewishedtotest.Wehavenotyetconductedsearcherbysearcheranalysistodeterminewhethersearcherumd12exhibitedsearchbehaviorsmarkedlydifferentfromtheother11searchers.Forcontrast,werecomputedthesameresultswithlooserelevance.Inthatcase,thesearchersinourlargecollectionexperimentachieveda22%increaseinFα=0.8overtheautomaticsystem(0.5095vs.0.4176).
AsFigure6shows,themanualsystemachievedthelargestimprovementsfortopics1(GenesandDiseases)and4(HungerStrikes)withstrictrelevance,buttheautomaticsystemactuallyoutperformedthemanualsystemontopic2(TreasureHunting).Looserelevancejudgmentsexhibitedasimilarpattern.Searchersthatwerepresentedwithtopic2inthemanualconditionreported(inquestionnaire)thatitwasmoredifficulttoidentifyappropriatetranslationsfortopic2thanforanyothertopic,andsearchersgenerallyindicatedthattheywerelessfamiliarwithtopic2thanwithothertopics.Wehavenotyetcompletedouranalysisofobservationalnotes,sowearenotabletosaywhetherthisresultedinanydifferencesinsearchbehavior.Butitseemslikelythatwithoutusefulcues,searchersremovedtranslationsthatwouldhavebeenbetteroffkeeping.Ifconfirmedthroughfurtheranalysis,thismayhaveimplicationsforusertraining.
Fig.7.Fα =0.8,smallcollectiongroup,bycondition.
TheresultsofthesmallcollectionexperimentshowninFigure7arequitedifferent.Thesituationisreversedfortopic1,withautomaticnowoutperformingmanual,andtopic4nolongerdiscriminatesbetweenthetwosystems.8Overall,
8
Topic3,withnorelevantdocumentsinthesmallcollection,isnotshown.
themanualandautomaticsystemscouldnotbedistinguishedusinglooserelevance(0.2889vs0.2931),buttheautomaticsystemseemedtodobetterwithstrictrelevance(0.2268vs0.3206).Again,wedidnotfindthatthedifferenceisstatisticallysignificant.Thedatathatwehaveanalyzeddoes,however,seemtosuggestthatourmanualsystemisbettersuitedtocasesinwhichthereareasubstantialnumberofrelevantdocuments.Weplantousethisquestiontoguidesomeofourfurtherdataanalysis.
Weanalyzedquestionnairedataandinterviewresponsesinanefforttounderstandhowparticipantsemployedthesystemsandtobetterunderstandtheirimpressionsaboutthesystems.Questionnaireresponsesareona15scale(with1being“notatall,”and5being“stronglyagree”).
Searchersinthelargecollectionexperimentreportedthatthemanualandautomaticsystemswereequallyeasytosearchwith(average3.5),butsearchersinthesmallcollectionexperimentreportedthattheautomaticsystemwaseasiertousethanthemanualsystem(3.4vs.2.75).
Searchersinthelargecollectionexperimentreportedanequalneedtoreformulatetheirinitialquerieswithbothsystems(average3.25),butsearchersinthesmallcollectionexperimentreportedthatthiswassomewhatlessnecessarywiththeautomaticsystem(3.9vs.4.1).Onesearcher,umd07reportedthatitwas”extremelynecessary”toreformulatequerieswithbothsystems.Wenoticefromhis/heranswerstoouropenquestionsthathe/shethoughtthequerytranslationswere”usuallyverypoor,”andhe/shewouldlikebothsystemssupportBooleanqueries,proximityoperatorsandtruncationssothat”noise”couldberemoved.
Searchersinthelargecollectionexperimentreportedthattheywereabletofindrelevantdocumentsmoreeasilyusingthemanualsystemthantheautomaticsystem(4.0vs.3.5),butsearchersinthesmallcollectionexperimenthadtheoppositeopinion(2.6vs.3.0).
Forquestionsuniquetothemanualsystem,thelargecollectiongroupreportedpositivereactionstotheusefulnessofuserassistedquerytranslation(witheveryonechoosingavalueof4).Theygenerallyfeltthatitwaspossibletoidentifyunintendedtranslations(anaverageof3.5),andthatandmostofthetimethesystemprovidedappropriatetranslations(averageof3.9).
Mostparticipantsreportedthattheywerenotfamiliarwiththetopics,withtopic3(EuropeanCampaignsagainstRacism)havingthemostfamiliarity,andtopics1and2havingtheleast.
Wedeterminedthenumberofiterationsforeachsearchthroughlogfileanalysis.Inthelargecollectionexperiment,searchersaveraged9queryiterationspersearchacrossallconditions.Topic2hadthelargestnumberofiterations(averaging16),topic4hadthefewest(averaging6).Topics1and2exhibitedlittledifferenceintheaveragenumberofiterationsacrosssystems,buttopics3and4hadsubstantiallyfeweriterationswiththemanualsystem.Inthesmallcollectionexperiment,searchersperformedsubstantiallymoreiterationspersearchthanthatinthelargecollectionexperiment,averaging13iterationspersearchacrossallconditions.Topic2againhasthegreatestnumberofiterations(averaging16),whiletopic1hadthefewest(averaging8).
TheunexpectedproblemwithindexingtheSDAcollectionreducedthenumberofsearchersthatcontributedtoourofficialresults,butitprovideduswithanextradimensionforouranalysis.Searchersinthelargecollectionandsmallcollectionexperimentsweregenerallydrawnfromthesamepopulation,weregiventhesametopics,usedthesamesystems,andperformedthesametasks.Themaindifferenceisthenatureofthecollectionthattheysearched,andinparticularthenumberofrelevantdocumentsthatwereavailabletobefound.Summarizingtheresultsabovefromthisperspective,weobservedthefollowingdifferencesbetweenthetwoexperiments:
Wehavenotyetfinishedouranalysis,butthepreponderanceoftheevidencethatispresentlyavailablesuggeststhatcollectioncharacteristicsmaybeanimportantvariableinthedesignofinteractiveCLIRsystems.Webelievethatthisfactorshouldreceiveattentioninfutureworkonthissubject.
5Conclusionandfuturework
Wefocusedonsupportinguserparticipationinthequerytranslationprocess,andtestedtheeffectivenessoftwotypesofcues—back translation andkeyword in context inaninteractiveCLIRapplication.Ourpreliminaryanalysissuggeststhattogetherthesecuescansometimesbehelpful,butthatthedegreeofutilitythatisobtainedisdependentonthecharacteristicsofthetopic,thecollection,andtheavailabletranslationresources.
Ourexperimentssuggestanumberofpromisingdirectionsforfuturework.First,meanaverageprecisionisacommonlyreportedmeasureforthequalityofarankedlist(and,byextension,forthequalityofthequerythatledtothecreationofthatrankedlist).WehavefoundthatitisdifficulttodrawinsightsfromMAPtrajectories(variationsacrosssequentialqueryrefinementiterations),inpartbecausewedonotyethaveagoodwaytodescribethestrategiesthatasearchermightemploy.Wearepresentlyworkingtocharacterizethesestrategiesinausefulway,andtodevelopvariantsoftheMAPmeasure(threeofwhichweredescribedabove)thatmayofferadditionalinsight.Second,ourinitialexperimentswithusingKWICforuserassistedquerytranslationseempromising,butthereareseveralthingsthatwemightimprove.Forexample,itwouldbebetterifwecouldfindtheexamplesofusageinacomparablecorpus(oreventheWeb)ratherthanaparallelcorpusbecauseparallelcorporaaredifficulttoobtain.Finally,weobservedfarmorequeryreformulationactivityinthisstudythanwehadexpectedtosee.Ourpresentsystemprovidessomesupportforreformulationbyallowingtheusertoseewhichquerytermtranslationsarebeingusedinthesearch.Butwedonotyetprovidethesearcherwithanyinsightintothesecondhalfofthatprocess—whichGermanwordscorrespondtopotentiallyusefulEnglishtermsthatarelearnedbyexaminingthetranslations?Ifweusedthesameresourcesfordocumenttranslationasforquerytranslation,thismightnotbeaseriousproblem.Butwedon’t,soitisanissuethatweneedtothinkabouthowtosupport.
TheCLEFinteractivetrackhadproventobeanexcellentsourceofinsightintobothsystemdesignandexperimentdesign.Welookforwardtonextyear’sexperiments!
Acknowledgments
TheauthorswouldliketothankJulioGonzaloandFernandoL´
opezOstenerofortheirtirelesseffortstocoordinateiCLEF.ThisworkhasbeensupportedinpartbyDARPAcooperativeagreementN660010028910.
References