ebook img

ERIC ED610008: Outlying Observation Diagnostics in Growth Curve Modeling PDF

2017·0.52 MB·English
by  ERIC
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview ERIC ED610008: Outlying Observation Diagnostics in Growth Curve Modeling

Runninghead: OUTLYINGOBSERVATIONDIAGNOSTICS 1 OutlyingObservationDiagnosticsinGrowthCurveModeling 1 XinTong 2 UniversityofVirginia 3 ZhiyongZhang 4 UniversityofNotreDame 5 Tong, X., & Zhang, Z. (2017). Outlying Observation Diagnostics in Growth Curve Modeling. Multivariate Behavioral Research, 52(6), 768–788. The research is supported by through the grant program on Statistical and Research Methodology in Education from the Institute of Education Sciences of U.S. Department of Education (R305D140037). OUTLYINGOBSERVATIONDIAGNOSTICS 2 Abstract 6 Growthcurvemodelsarewidelyusedforinvestigatinggrowthandchangephenomena. Many 7 studiesinsocialandbehavioralscienceshavedemonstratedthatdatawithoutanyoutlying 8 observationareratheranexception,especiallyfordatacollectedlongitudinally. Ignoringthe 9 existenceofoutlyingobservationsmayleadtoinaccurateorevenincorrectstatisticalinferences. 10 Therefore,itiscrucialtoidentifyoutlyingobservationsingrowthcurvemodeling. Thisstudy 11 comparativelyevaluatesixmethodsinoutlyingobservationdiagnosticsthroughaMonteCarlo 12 simulationstudyonalineargrowthcurvemodel,byvaryingfactorsofsamplesize,numberof 13 measurementoccasions,aswellasproportion,geometry,andtypeofoutlyingobservations. Itis 14 suggestedthatthegreatestchanceofsuccessindetectingoutlyingobservationscomesfromuse 15 ofmultiplemethods,comparingtheirresultsandmakingadecisionbasedonresearchpurposes. 16 Arealdataanalysisexampleisalsoprovidedtoillustratetheapplicationofthesixoutlying 17 observationdiagnosticmethods. 18 OUTLYINGOBSERVATIONDIAGNOSTICS 3 OutlyingObservationDiagnosticsinGrowthCurveModeling 19 Growthcurve(GC)models,asoneofthefundamentaltoolsfordealingwithlongitudinal 20 dataaswellasrepeatedmeasures,arefrequentlyusedforinvestigatinggrowthandchange 21 phenomenainsocial,behavioral,andeducationalsciences(e.g.,McArdleandNesselroade,2014; 22 Zhangetal.,2012). GCmodelingallowsexaminationsofintraindividualchangeovertimeas 23 wellasinterindividualvariabilityinintraindividualchange. Itisappealingnotonlybecauseofits 24 abilitytomodelchangebutalsobecauseitallowsinvestigationintotheantecedentsand 25 consequentsofchange. AmongmethodsdevelopedforGCmodeling,the 26 normal-distribution-basedmaximumlikelihood(NML)isroutinelyusedandisincorporatedin 27 almostallstatisticalsoftware. Whenasamplecomefromanormalpopulation,NMLgenerates 28 consistentandefficientparameterestimates. However,practicaldatausuallyviolatethenormal 29 assumption. Forexample,Micceri(1989)investigated440largescaledatasetsinpsychologyand 30 foundthatalmostallofthemweresignificantlynonnormal. Theoccurrenceofoutlying 31 observationsinGCmodelingisnaturallymorecommonbecauseoftheinvolvementof 32 longitudinaldata. Whendataarecontaminatedorcontainoutlyingobservations,NMLestimates 33 canbeveryinefficientorevenbiased(YuanandBentler,2001),andHeywoodcasesorimproper 34 solutionsmaybecreated(Bollen,1987). 35 Strategiestohandleoutlyingobservationshavebeendeveloped. First,sinceoutlying 36 observationscauseaproblemespeciallyencounteredinmodelsbasedonalimitednumberof 37 individuals,astraightforwardstrategyistoobservemoreindividualsinthepopulationofinterest. 38 Withmoredatacollected,theunderlyingdistributionofthesamplecanbebetterdescribed,andit 39 mayturnoutthatweobserveseveraladditionaldatawithextremevaluessothattheoriginal 40 outlyingobservationisnolongeranoutlyingobservation. Second,besidescollectingmore 41 individuals,obtainingadditionalmeasurementsforeachindividualmayalsoaccountforthe 42 outlyingobservations,becausethepresenceofmultivariateoutlyingobservationsmayindicate 43 oneormoreimportantvariableswereomittedfromthemodel(Lieberman,2005). Third,human 44 erroroftenoccursincollectingdataorprocessingtherawdata,suchaserrorsinentry,coding, 45 OUTLYINGOBSERVATIONDIAGNOSTICS 4 andtranscription,andtheseerrorsmayleadtoextremescoresononeormorevariablesinthe 46 dataset. Thus,checkingdataconsistencymightbeasolutiontodealwithoutlyingobservations. 47 Thefourthstrategyistoimprovethemodelspecification. Ifthedataareusedtoestimatetoo 48 complexmodels,oriftheparameterizationisincorrect,outlyingobservationsaremorelikelyto 49 havelargereffects. Thefifthstrategyistoconductdatatransformationordirectlyremove 50 outlyingobservationsbeforedataanalysis(seeOsborneandOverbay,2004foramorethorough 51 discussion). Sixth,insteadofdirecttransformationortruncation,researchershavedeveloped 52 variousrobustprocedurestoprotecttheirdatafrombeingdistortedbythepresenceofoutlying 53 observations. Thesemethodseitherdownweightthepotentialoutlyingobservationsasa 54 transformationtechnique(e.g.,YuanandBentler,2000;YuanandZhang,2012a)orassumethat 55 thedatacomefromcertainnonnormaldistributionssuchastdistributionoramixtureofnormal 56 distribution(e.g.,MuthénandShedden,1999;TongandZhang,2012). Amongthesestrategies, 57 thefirstfourcannotbegenerallyandeasilyapplied. Itisnotalwaysfeasibletocollectmoredata, 58 obtainadditionalmeasurements,returntorawdatatocheckconsistency,oradaptmodel 59 complexityandchangeparameterization. Inpractice,researchersusuallytransformthedataso 60 thattheyareclosetobeingnormallydistributedorsimplydeleteoutlyingobservationspriorto 61 fittingamodeltotheirdatasets. Recently,moreandmoreresearchers(e.g.,SavaleiandFalk, 62 2014;YuanandZhang,2012a)recommendedtheapplicationofrobustmethodsandstatistics. 63 Regardlessofthestrategyused,itiscrucialtoidentifyoutlyingobservationsinadatasetinthe 64 firstplaceinordertoobtainabettermodelestimationorinterprettheextremescores. Notethat 65 twomethodologieswithvaryingpurposesarerelatedtooutlyingobservationdetection. Oneis 66 sensitivityanalysiswheredataareassumedtobecorrectandwecalibratethemodelaccordingly. 67 Incontrast,wemayassumethatthemodeliscorrect. Ifthepersonfitisnotgood,the 68 correspondingcaseisidentifiedasanoutlyingobservation. Thisarticlealignswiththesecond 69 methodology. Inpsychology,confirmatorydataanalysesareoftenconductedandamodelisbuilt 70 basedonasubstantivetheory. Sowebelievethemodeltobecorrectoratleastuseful,butdata 71 canbeproblematic. Weareinterestedindetectingobservationsthataremostunlikelytooccur 72 OUTLYINGOBSERVATIONDIAGNOSTICS 5 underthehypothesizedmodel. Theoutlyingvaluesinthedatamayleadtobiasedparameter 73 estimatesforthemodelandmisleadingmodelfitindicesandteststatistics. 74 Theimportanceofoutlyingobservationdetectioninmultivariatedataanalysishasbeen 75 recognizedandvariousstudiesfordetectingmultivariateoutlyingobservationshavebeen 76 conducted(e.g. BeckerandGather,2001;Filzmoseretal.,2005;PeñaandPrieto,2001;Rocke 77 andWoodruff,1996;RousseeuwandvanZomeren,1990). Acommonlyappliedmethodinthose 78 studiesistocalculateadistance(i.e.,Mahalanobisdistance)fromeachpointtothe“center”ofthe 79 data. Anoutlyingobservationisapointwithadistancelargerthansomepredeterminedcutoff. 80 ForGCmodeling,outlyingobservationdetectionisevenmoreimportantbecausenotonlyitcan 81 helpimprovetheaccuracyandprecisionofthemodelestimation,butalsothedetectionprocedure 82 itselfisverymeaningful. Itmayhelpidentifyindividualswhobehavedifferentlyfromthe 83 majorityofthecasesinalongitudinalstudy. Furthermore,itcantellwhetheranindividual’s 84 growthpatternisdifferentfromtheoverallpatternandwhetherthisindividualonlyhasextreme 85 scoresatsometimepoints,e.g.,talentedstudentsinthelongrun,orcheatersinasingletest. 86 DespitetheincreasingpopularityofGCmodelsandthefastgrowinginterestinmultivariate 87 outlyingobservationdetection,diagnostictoolstodetectoutlyingobservationinGCmodelinglag 88 behind. Asfarasweareaware,onlyPanandFang(2002)havespecificallydiscussedhowto 89 detectoutlyingobservationsintheGCmodelingframework. AlthoughMcArdle(1998)pointed 90 outthatanindividual-levelstructuralequationmodelingapproachpermitsathoroughanalysisof 91 outliersorsubgroups,nosystematicalanalysishasbeenconducted. 92 BecauseGCmodelscanbefittedunderthestructuralequationmodelingframework 93 (MeredithandTisak,1990),modeldiagnosticmethodsinstructuralequationmodelingcanbe 94 applied. Intheframeworkofstructuralequationmodeling,BollenandArminger(1991) 95 developedaprocedureusingcase-levelresidualstoidentifyoutliers. Cadigan(1995)andLeeand 96 Wang(1996)identifiedthemostinfluentialcasesforthelikelihoodratiostatisticsbyapplyingthe 97 localperturbationprocedureofCook(1986)tostructuralequationmodeling. TheEQSsoftware 98 (Bentler,1995)identifiescasesthatcontributemosttoMardia’smeasureofmultivariatekurtosis 99 OUTLYINGOBSERVATIONDIAGNOSTICS 6 andallowsuserstodeletecasesfromanalysis. Toavoidtheso-calledmaskingeffectwherean 100 outlyingobservationsexistsbutisnotidentifiedormultipleoutlyingobservationsexistbutnotall 101 ofthemareidentified,YuanandZhong(2008)formallydefinedleverageobservationsand 102 outliersinfactoranalysisandshowedthatrobustproceduresovercomethemaskingeffect 103 associatedwithproceduresbasedonsamplemoments. YuanandHayashi(2010)thenintroduced 104 twoscatterplotsformodeldiagnosisinstructuralequationmodelingandYuanandZhang 105 (2012b)furtherdevelopedanRpackagesemdiagtoeasilydrawthetwoplots. 106 Basedonthepreviousliterature,weinvestigatesixrepresentativemethodsformultivariate 107 outlyingobservationdetectioninGCmodelinginthisarticle. Aunivariatedetectiontoolisfirst 108 appliedasabaselineforcomparison. Atraditionalmultivariateoutlyingobservationdiagnostic 109 toolbasedonMahalanobisdistanceandthemethodinPanandFang(2002)areappliedtoGC 110 modelsaswell. Then,weproposeandapplythreemethodstostudyindividual-levelresidualsand 111 latentgrowthcoefficientstonotonlyidentifyoutlyingobservations,butalsodistinguishtwo 112 differenttypesofoutlyingobservations: leverageobservationsandoutliers. Weaimtoevaluate 113 andcomparetheperformanceofthesixmethodsunderdifferentconditions. Asfarasweknow, 114 nostudyhassystematicallyinvestigatedandcomparedoutlyingobservationdiagnosticmethods 115 inGCmodelingormultilevelmodeling,letalonedistinguishingleverageobservationsand 116 outliersinthatframework. Tomakethisarticleself-contained,inthenextsection,weintroduce 117 thedefinitionoftwodifferenttypesofoutlyingobservationinGCmodels. Thedistinction 118 betweenoutlyingobservationsandinfluentialobservationsishighlighted. Thesubsequentsection 119 discussesthesixmethodsthatweusetodetectmultivariateoutlyingobservations. Then,focusing 120 onalinearGCmodel,aMonteCarlosimulationstudyisimplementedtoevaluatethe 121 performanceofthosemethods. Anexampleisalsoprovidedtoillustratetheapplicationofthem, 122 usingdataonthePeabodyIndividualAchievementTest(PIAT)mathematicsassessmentfromthe 123 NationalLongitudinalSurveyofYouth1997Cohort(BureauofLaborStatistics,U.S.Department 124 ofLabor,2005). Weconcludethearticlebydiscussingthemeritofeachmethodandproviding 125 recommendations. 126 OUTLYINGOBSERVATIONDIAGNOSTICS 7 OutlyingObservationsinGCModeling 127 AGCmodelrepresentsrepeatedmeasuresofdependentvariablesasafunctionoftime. In 128 GCmodeling,therelativestandingofanindividualateachtimeismodeledasafunctionofan 129 underlyinggrowthprocess,withrandomcoefficients(e.g.,initiallevelandrateofchange)forthat 130 growthprocessbeingfittedtoeachindividual. Lety = (y ,...,y )0 beaT ×1randomvector 131 i i1 iT andy beanobservationforindividualiattimej (i = 1,...,N;j = 1,...,T),whereN isthe 132 ij samplesizeandT isthetotalnumberofmeasurementoccasions. Atypicalformofunconditional 133 GCmodelscanbeexpressedas 134 y = Λb +e , (1) i i i b = β +u , (2) i i whereΛisaT ×q factorloadingmatrixdeterminingthegrowthtrajectories,b isaq ×1vector 135 i ofrandomeffects,ande isavectorofintraindividualmeasurementerrors. Thevectorofrandom 136 i effectsb variesforeachindividual,anditsmean,β,representsthefixedeffects. Theresidual 137 i vectoru representstherandomcomponentofb . IntraditionalGCanalysis,itisassumedthat 138 i i therandomeffectsu andintraindividualmeasurementerrorse arenormallydistributed. 139 i i However,TongandZhang(2012)claimedthatbothrandomeffectsandintraindividual 140 measurementerrorsmaybenonnormal. 141 TwoTypesofOutlyingObservations 142 Althoughthereisnorigidmathematicaldefinitionofwhatconstitutesanoutlying 143 observation,acommonlyacceptedcharacterizationisthatoutlyingobservationsareobservations 144 thatdonotfollowthedistributionalpatternofthemajorityofdata. Theexistenceofoutlying 145 observationsinGCmodelingmayduetoextremescoresineitherorbothofe andu . Because 146 i i extremescoresine oru affectthemodelestimationdifferently(TongandBoker,2016),itis 147 i i necessarytodistinguishdifferenttypesofoutlyingobservationsinGCmodeling. Infactor 148 analysis,YuanandZhong(2008)definedobservationswhosefactorscoresarefarfromthecenter 149 OUTLYINGOBSERVATIONDIAGNOSTICS 8 ofthemajorityofthefactorscoresasleverageobservations,anddefinedoutliersasobservations 150 whosemeasurementerrorsarelarge,regardlessofthevaluesofthecorrespondingfactorscores. 151 Theysuggestedthatsimilardefinitionscanbeusedinotherstructuralequationmodels. Following 152 thedefinitionsinYuanandZhong(2008),wedistinguishtwotypesofoutlyingobservationsin 153 GCmodeling. First,whenanoutlyingobservationiscausedbyextremescoresinrandomeffects 154 (u ),itiscalledaleverageobservation. Theintraindividualmeasurementerrorsforaleverage 155 i observationmaybesmallorlarge. Theobservationcorrespondingtoasmallmeasurementerror 156 iscalledagoodleverageobservation. Itiscalledabadleverageobservationwhenthe 157 measurementerrorislarge. Second,whenanoutlyingobservationiscausedbyextremescoresin 158 intraindividualmeasurementerrors(e ),itiscalledanoutlier. Notethatitispossiblethatthere 159 i mightbeindividualswithunusualvaluesinboththeirmeasurementerrorsandgrowth 160 coefficients. Theseindividualsarebothaleverageobservationandanoutlier. 161 Tofurtherillustratethepatterndifferencesamongoutlyingobservationcausedby 162 nonnormalrandomeffectsu and/ornonnormalmeasurementerrorse ,wegenerateandplotdata 163 i i fromfourtypesofdistributionalmodels(seeFigure1). Foreachtypeofdistributionalmodel, 164 dataon20individualsaregeneratedatfourequallyspacedtimepointswithalineargrowthtrend. 165 Figure1(1)displaysthetrajectoriesofthedatageneratedwithoutanyleverageobservationsor 166 outliers. Theoveralltrendlookscleanandsmooth. Figure1(2)plotsthedatageneratedwith 167 outliers(i.e.,intraindividualmeasurementerrorscontainextremescores). Noticeably,some 168 observationsstandoutoftheoveralltrajectorysuchasthoselabeledbyaandb. Acloselookat 169 thetwoobservationsrevealsthattheydeviatefromtheoveralltrajectorybecausetheyareofftheir 170 ownexpectedgrowthtrajectories. Forexample,anindividualmightperformunexpectedlywellin 171 atestbecauses/hecheated,buthis/heroverallrateofchangewasnotsubstantiallydifferentfrom 172 otherindividuals’. Figure1(3)portraysdatageneratedwithleverageobservations(i.e.,random 173 effectscontainextremescores). Someobservationsalsodeviatefromtheoverallgrowth 174 trajectory. However,thoseobservationsarestillontheirownexpectedgrowthtrajectories. The 175 reasonwhytheystandoutisthattherateofgrowthforthespecificindividualisverydifferent 176 OUTLYINGOBSERVATIONDIAGNOSTICS 9 fromothers’. Anexamplecouldbethatsometalentedindividualsmaylearnfasterthanthe 177 others. Figure1(4)drawsthetrajectoriesfordatageneratedwithobservationsbeingbothleverage 178 observationsandoutliers(i.e.,bothintraindividualmeasurementerrorsandrandomeffects 179 containextremescoressimultaneously). Obviously,theobservationswhichstandoutaredueto 180 twosources-thetrajectoryofanindividualdeviatesfromtheoveralltrajectoryandthe 181 observationforthisspecificindividualisoffitsownexpectedtrajectory. Forexample,the 182 observationestandsoutbecauseitisofftheexpectedtrajectoryofthecaseandthecaseitselfhas 183 ahigherinitiallevel. 184 AsclearlyshowninFigure1,leverageobservationsandoutliersleadtodifferentpatternsof 185 growthtrajectories. Thisemphasizesagainwhyitisimportanttodistinguishthetwotypesof 186 outlyingobservationsinGCmodeling. Insum,leverageobservationsarecausedbyextreme 187 scoresinu andoutliersarecausedbyextremescoresine ,andingeneral,wecallleverage 188 i i observationsandoutlierstogetherasoutlyingobservationsinGCmodeling. Wewouldliketo 189 notethatinthisarticle,weusetheterm“outlier”whenonlymeasurementerrorsinGC 190 modelshaveextremescores,andtheterm“outlyingobservation”ismoregeneralandused 191 wheneveranobservationhasextremescores. 192 193 InsertFigure1here 194 195 DiagnosticsofoutlyingobservationsinGCmodelingareveryimportantinordertoobtaina 196 bettermodelestimation. Itisequallyimportantandmaybemoremeaningfultoidentifyleverage 197 observationsandoutliers. Forexample,TongandBoker(2016)claimedthatsomerobustmethods 198 mayperformwellwhendatacontainoutliers,buttheyshouldbeusedmorecarefullywhendata 199 containleverageobservations. Inaddition,leverageobservationdetectioncanbeusedtoidentify 200 talentedstudentswhosegrowthtrajectoriesaredifferentfromtheaveragetrajectory,andoutlier 201 detectioncanbeusedtodetecttestfraud,averyseriousandpopularpracticaltask. Ifastudent 202 OUTLYINGOBSERVATIONDIAGNOSTICS 10 tookaseriesoftestsinaperiodoftimeandgotpreternaturalscoresinoneortwotests,s/hemight 203 beasuspectedcheater. Sincethesetopicsareimportantinsocial,behavioralandeducational 204 researches,weapplymethodstodistinguishthetwotypesofoutlyingobservationsinourstudy. 205 OutlyingObservationsVersusInfluentialObservations 206 Observationsmayalsobeexaminedforinfluentialstatus. Influentialobservationsare 207 definedbytheirimpactonparameterestimatesor/andtheoverallmodelfit. Incontrast,an 208 outlyingobservationisobservedtobedistributionallyaberrantwhencomparingwithother 209 observationsandisconsideredasbeingcontaminatedorcomingfromadifferentpopulation. It 210 hasbeendemonstratedthataninfluentialobservationmaynotnecessarilybeanoutlying 211 observation,andviceversa. Therefore,theideasofhowtodetectinfluentialobservationsand 212 outlyingobservationsaredifferent. Acommonlyappliedmethodtodetectinfluential 213 observationsistodeletethesuspectedobservationsandseehowresultsareaffectedeitheratthe 214 levelofoverallmodelfitoratthelevelofparameterestimates. Whereasformethodsusedto 215 detectoutlyingobservations,aMahalanobisdistanceiscalculatedfromeachpointtothe“center” 216 ofthedataandanoutlyingobservationisapointwithalargedistance. 217 Themotivationofdetectinginfluentialobservationsandoutlyingobservationsismainlyto 218 checkwhetherthereareobservationsthatmaypotentiallyinfluencethemodelestimationandthen 219 determinesomestrategiestodealwiththeseobservationsifnecessary. Studiesoninfluentialcase 220 detectionhavebeenconductedinmultilevelmodelswherecasedeletiondiagnosticswereapplied 221 (e.g.,ShiandChen2008;VanderMeeretal.2006). PekandMacCallum(2011)suggestedtouse 222 multiplemeasuresofcaseinfluencebecausecasesmayinfluencedifferentaspectsofresults,and 223 casesthatexertlittleornoinfluenceononeaspectmayshowastronginfluenceonanotheraspect. 224 Anotherissuewithcasedeletionisthatitisaffectedbysamplesize(PekandMacCallum,2011). 225 AlargesamplesizeleadstoahighcomputationburdenbecauseN (N =totalsamplesize)setsof 226 modelresultsneedtobecomputedfromN delete-one-casesamples,witheachsetofresultsthen 227 comparedwithresultsobtainedfromthefullsample. Moreimportantly,someobservationsmay 228

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.