ebook img

DTIC ADA453849: Confidence Bands for ROC Curves PDF

10 Pages·0.21 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview DTIC ADA453849: Confidence Bands for ROC Curves

(cid:3) Con(cid:12)dence Bands for ROC Curves Sofus A. Macskassy [email protected] Foster J. Provost [email protected] Department of Information, Operations, & Management Sciences Leonard N. Stern School of Business, New York University 44 W. 4th Street, Suite 8-82, New York, NY 10012-1126 Michael L. Littman [email protected] Department of Computer Science Rutgers University 110 Frelinghuysen Rd, Piscataway,NJ 08854-8019 Abstract tribution. Often the comparison of two or more ROC We address the problem of comparing the performance of curves consists of either looking at the Area Under classi(cid:12)ers. In this paper we study techniques for generat- the Curve (AUC) or focusing on a particular part of ing and evaluating con(cid:12)dence bands on ROC curves. His- the curves and identifying which curve dominates the torically this has been done using one-dimensional con(cid:12)- otherinordertoselectthebest-performingalgorithm. denceintervalsbyfreezing onevariable|thefalse-positive Much less attention has been given to robust statisti- rate,orthresholdontheclassi(cid:12)cationscoringfunction. We cal comparisonsof ROC curves. This paper addresses adapttwopriormethodsandintroduceanewradialsweep thecreationofcon(cid:12)dencebandsonROCcurves. Prior method to generate con(cid:12)dence bands. We show, through workhasconsideredsweepingacrossthresholdsonthe empirical studies, that the bands are too tight and in- classi(cid:12)cation scoring function, creating con(cid:12)dence in- troduce a general optimization methodology for creating tervals around the TP/FP points for various thresh- bandsthatbetter(cid:12)tthedata,aswellasmethodsforeval- olds (Fawcett, 2003), or sweeping across the FP rates uating con(cid:12)dence bands. We show empirically that the and creating vertical con(cid:12)dence intervals around av- optimized con(cid:12)dence bands (cid:12)t much better and that, us- eraged TP levels (Provost et al., 1998). Con(cid:12)dence ing ournew evaluation method, it is possible togauge the bandscouldbecreatedbyconnectingthesecon(cid:12)dence relative (cid:12)t of di(cid:11)erent con(cid:12)dencebands. intervals (as we will show). We examine 1−(cid:14) con(cid:12)- 1. Introduction/Motivation dencebandsonamodel’sROCcurve. Weaskwhether, assuming test examples are drawn from the same, Weaddresstheproblemofcomparingtheperformance (cid:12)xed distribution, one indeed should expect that the ofclassi(cid:12)ers. Receiver-OperatorCharacteristic(ROC) model’s ROC curves will fall within the bands with analysisisanevaluationtechniqueusedinsignaldetec- probability 1−(cid:14). tion theory, which in recent yearshas seen an increas- ing use for types of diagnostic, machine-learning, and Figure 1 shows an example of what such prototypical information-retrievalsystems (Swets, 1988;Provost& con(cid:12)dencebandsshouldlooklikewith(cid:14) =0:05. Inthe Fawcett,1997;Ng&Kantor,2000;Provost&Fawcett, (cid:12)gure, any ROC curve that does not lie completely in 2001; Macskassy et al., 2001). ROC graphs plot false- the shadedareawouldbe saidto be di(cid:11)erentfromthe positive (FP) rates on the x-axis and true-positive mean curve with a 95% con(cid:12)dence. (TP)ratesonthey-axis. ROCcurvesaregeneratedin In this paper we examine methods for creating and asimilarfashionto precision/recallcurves,byvarying evaluating such con(cid:12)dence bands for a given learned athresholdacrosstheoutputrangeofascoringmodel, model. As we will show, the bands created by prior and observing the corresponding classi(cid:12)cation perfor- techniques are too tight. We introduce a new tech- mances. Although ROC curves are isomorphic to pre- nique that creates more realistic bands based on an cision/recallcurves,they havethe addedbene(cid:12)ts that empirical distribution. To these ends, we describe a they are insensitive to changes in marginal class dis- framework for evaluating the (cid:12)t of ROC con(cid:12)dence (cid:3) Macskassy, S.A., Provost, F.J., and Littman, M.L., bands. \Con(cid:12)denceBandsforROCCurves,"CeDERWorkingPa- per IS-03-04, Stern School of Business, New York Univer- Therestofthepaperisorganizedasfollows. Thenext sity,NY, NY10012. Jan 2003. section discusses related work on creating con(cid:12)dence Report Documentation Page Form Approved OMB No. 0704-0188 Public reporting burden for the collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources, gathering and maintaining the data needed, and completing and reviewing the collection of information. Send comments regarding this burden estimate or any other aspect of this collection of information, including suggestions for reducing this burden, to Washington Headquarters Services, Directorate for Information Operations and Reports, 1215 Jefferson Davis Highway, Suite 1204, Arlington VA 22202-4302. Respondents should be aware that notwithstanding any other provision of law, no person shall be subject to a penalty for failing to comply with a collection of information if it does not display a currently valid OMB control number. 1. REPORT DATE 3. DATES COVERED JAN 2003 2. REPORT TYPE 00-00-2003 to 00-00-2003 4. TITLE AND SUBTITLE 5a. CONTRACT NUMBER Confidence Bands for ROC Curves 5b. GRANT NUMBER 5c. PROGRAM ELEMENT NUMBER 6. AUTHOR(S) 5d. PROJECT NUMBER 5e. TASK NUMBER 5f. WORK UNIT NUMBER 7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES) 8. PERFORMING ORGANIZATION New York University ,Department of Information, Operations & REPORT NUMBER Management Business,44 West 4th Street,New York,NY,10012-1126 9. SPONSORING/MONITORING AGENCY NAME(S) AND ADDRESS(ES) 10. SPONSOR/MONITOR’S ACRONYM(S) 11. SPONSOR/MONITOR’S REPORT NUMBER(S) 12. DISTRIBUTION/AVAILABILITY STATEMENT Approved for public release; distribution unlimited 13. SUPPLEMENTARY NOTES 14. ABSTRACT see report 15. SUBJECT TERMS 16. SECURITY CLASSIFICATION OF: 17. LIMITATION OF 18. NUMBER 19a. NAME OF ABSTRACT OF PAGES RESPONSIBLE PERSON a. REPORT b. ABSTRACT c. THIS PAGE 9 unclassified unclassified unclassified Standard Form 298 (Rev. 8-98) Prescribed by ANSI Std Z39-18 subsetofthresholdsamongthesortedsetofallthresh- Prototype Confidence Bands olds seen across the set of ROC curves in the sam- 1 ple. For each of these thresholds, it identi(cid:12)es the set of ROC points that would be generated using that 0.8 thresholdoneachoftheROCcurves. FromtheseROC points, the mean and standard deviations are gener- e sitiv0.6 aptoeidntfoarstwheellFaPs vaenrdticTaPl arnadtehs,orgiizvoinngtatlhceonm(cid:12)edaenncReOinC- o ue P0.4 tervals. Tr Medical researchers also have examined the use of ROC curves and have introduced perhaps the 0.2 mostcomprehensivetechniquesforcreatingcon(cid:12)dence mean curve 95% confidence bands boundaries. One such technique is similar to that 0 of threshold averages in that it creates a con(cid:12)dence 0 0.2 0.4 0.6 0.8 1 False Positive boundary around each of the N ROC points associ- Figure 1. PrototypeROCCon(cid:12)dence Bands ated with N discrete events in an underlying model (Tilburyetal.,2000). Itdoesthis byconsideringeach axisasindependentandconsideringanN-dimensional vector along each axis, where the i-th element in the intervals for ROC curves, followed by a section de- vectors represent the i-th point in the ROC curve. scribing our methods for generating ROC con(cid:12)dence Discretizing the values and assuming a binomial dis- bands from con(cid:12)dence intervals. We then describe tribution, it then generates a probability distribution our evaluationmethodologyand a case study showing of the likelihood that the j-th value lies in each dis- that our initial methods do not perform as well as ex- cretizedcell. Itmapthis probabilitydensitybackinto pected. Wethendescribeageneraloptimization-based ROC space thereby generating con(cid:12)dence boundaries methodology that canbe applied to each of the band- for each point in the ROC curve. These models are generatingtechniques,anddiscussaperhapsmorerea- very complex and are not tractable for a large set of sonableevaluationmeasureand(cid:12)nallyrevisitthe case ROC points as is typically found in the ROC curves study using the optimized method. common in machine learning studies. 2. Related Work Others have looked the simpler problem of comparing Prior work on creating con(cid:12)dence intervals for ROC an ROC curve to that of the expected performance of curves has for the most part been in the context of a random model (Macskassy, 2003). As the true theo- creating one-dimensional con(cid:12)dence intervals. retical bands can be generated under the assumption of a random predictor, this method was used to gen- Pooling isatechniqueinwhichthei-thpointsfromall erate an ROC con(cid:12)dence band around the expected the ROC curves in the sample are averaged (Bradley, random performance given a speci(cid:12)c test set. 1997). This makes a strong assumption that the i-th pointsfromallthesecurvesareactuallyestimatingthe Use of the bootstrap (Efron & Tibshirani, 1993) as a same point in ROC space, which is at best a doubtful morerobustwaytoevaluateexpectedperformancehas assumption. previouslybeenusedforevaluatingcost-sensitiveclas- si(cid:12)ers(Margineantu&Dietterich,2000). Inthiswork, VerticalaveraginglooksatsuccessiveFPratesandav- bootstrappingwasusedtorepeatedlydrawpredictions eragestheTPsofmultipleROCcurvesatthatFPrate p(i;j),wherep(i;j)istheprobabilitythataninstance (Provost et al., 1998). By freezing the FP rate, it is of class j was predicted to be in class i. Using these possible to generate a (parametric) con(cid:12)dence inter- sample predictions, it was possible to generate a (cid:12)nal val for the TP rate based on the mean and variance; cost based on a cost-matrix. They did this repeatedly multiplecurvesaregeneratedusingcross-validationor to generate a set of estimated costs, which they then other sampling techniques. A potential weakness of used to generate con(cid:12)dence bounds on expected cost. this method is the practical lack of independent con- troloveramodel’sfalse-positiverates(Fawcett,2003). (Wealsoshowthatthedistributionalassumptionstyp- 3. Generating Con(cid:12)dence Bands icallyusedwiththistechniqueareviolatedinourcase In this section we describe our methodology for gen- study.) erating con(cid:12)dence bands for a classi(cid:12)cation model or modeling algorithm. The main assumption we make Threshold averaging seeks to overcome the potential for being able to generate these con(cid:12)dence bands is weakness of the vertical averaging by freezing the that we can generate (or are given) a set of ROC thresholdsofthescoringmodelratherthantheFPrate curves. These can be generated by running a learning (Fawcett, 2003). It chooses a uniformly distributed algorithmonmultipletrainingsets,testingonmultiple For an empirical distribution we sort the values of D testingsets,orresamplingthe samedata. TheseROC andchoosevlandvu,suchthatvlisthevalueissmaller curveswillbeusedtogeneratecon(cid:12)dencebandsabout than 1−(cid:14) of all values and vu is larger than 1−(cid:14) of all 2 2 an average curve. We adapt two existing methods: values, thus 1−(cid:14) of all values lie between vl and vu. vertical averaging and threshold averaging for gener- Wewillexaminethesethreetechniquesforcalculating ating con(cid:12)dence intervals. We also introduce a new 1-dimensional intervals (i.e., given a sample distribu- radial-sweepmethod, which generates bands basedon tionofvaluesforonevariable). Ifnotstatedotherwise, a radial sweep of the curves as we describe below. resultspresentedwillbe basedonthe empiricaldistri- Our methodology comprises the following steps. bution. 3.2.2. Sweep Methods 1. Creating a distribution of ROC Curves So what are these dimensions along which the con(cid:12)- 2. Generating 1-dimensional con(cid:12)dence intervals (cid:15) Choosing a distribution dence intervals will be created? These are de(cid:12)ned by howone\sweeps"acrossthecollectionofROCcurves. (cid:15) Sweeping across the ROC curves A sweep samples the set of points that de(cid:12)ne a point 3. Creatingcon(cid:12)dencebandsfromthecon(cid:12)dencein- on the average ROC curve and the con(cid:12)dence inter- tervals valaboutit. Weusethreedi(cid:11)erentsweeporientations to sample ROC points. The (cid:12)rst two are adaptations 3.1. Creating the Distribution of ROC Curves from existing methods and the last, the radial sweep, is a method we introduce in this paper. There exist various ways of generating a distribution of instances from which to generate a con(cid:12)dence in- The vertical sweep method sweeps a verticalline from terval. The most common methods, including Cross- FP = 0 to FP = 1, sampling the distribution of TPs validation (Kohavi, 1995), repeatedly split a data set from the collection of ROC curves at regular points intotrainingandtestsets. Eachsuchsplitgivesriseto along the sweep. For each such sampling at a (cid:12)xed a learned model, which can be evaluated against the FP, TP con(cid:12)dence intervals can be created using any test set|thereby generatingone ROC curve per split. of the distribution assumptions mentioned above. Although to our knowledge it has not been used be- The threshold sweep method works a little di(cid:11)erently fore to generate multiple ROC curves, bootstrapping than the vertical sweep. It sweeps along the thresh- (Efron & Tibshirani, 1993) is a standard statistical olds on the model scores from −1 to +1, sampling technique that creates multiple samples by randomly the distribution of ROC points generated with each drawinginstances,withreplacement,fromahostsam- threshold. It then generates the mean (FP,TP) point ple (the host sample is a surrogate for the true popu- for each sampled threshold and (cid:12)nds the con(cid:12)dence lation). We will describe how we use bootstrapping in intervals of the FPs and TPs, using any of the distri- Section 5.3. bution assumptions mentioned above. 3.2. Generating 1-Dimensional Con(cid:12)dence Intervals Both of these consider only the x or y axis as the 3.2.1. Distribution Assumption axes for orienting the con(cid:12)dence intervals. The draw- back with both of these is that they do not take the Mostmethodologiesassumeanormaldistribution,but curvature into account. For example, vertical inter- it may be that ROC points are not distributed nor- vals will tend to be much wider for smaller FP rates mally. For example, for a given x-value (FP rate) the than for larger FP rates (due to the slopes of the y-value (TP rate) is a proportion. So a binomial dis- curves). In fact, for cost-sensitive classi(cid:12)cation cor- tribution may be appropriate. We consider three dis- responding points on di(cid:11)erent ROC curves are points tributions for creating con(cid:12)dence intervals: normal, where the tangent lines to the curves have the same binomial, and empirical. Let us assume that we are slope (Provost& Fawcett, 2001). Thus, one might ar- given a sample distribution D of points along some gue that it is proper to have con(cid:12)dence intervals that dimension and a con(cid:12)dence threshold of (cid:14). are normal to an average curve. Producing intervals We generate con(cid:12)dence intervals under the assump- normal to an average curve is not easy (nor even well tion of a normal distribution by calculating the mean de(cid:12)ned);forthispaperweintroduceastraightforward, (cid:22) and standard deviation (cid:27) of D. We then look up intuitive approximation. the statistical constant, z, for a two-sided bound of For the radial sweep method, rather than freezing the (cid:14) con(cid:12)dence on a distribution size of jDj giving us a thresholdorthe FPrate,weinsteaddoaradialsweep con(cid:12)dence interval of (cid:22)(cid:6)(cid:1)z(cid:1)(cid:27). of the given curves by a(cid:14)xing one end of a vector to Forthebinomialdistribution,wecalculatethevariance thelowerrightcorner(atposition(1,0))andsweeping as V =q(cid:22) (cid:1) (1 − (cid:22)), thus giving con(cid:12)dence interval it radially from (0,0) to (1,1). At (cid:12)xed angular inter- (cid:22)(cid:6)z(cid:1) V . vals, we sample the points where all the given ROC jDj Vertical Sweep Confidence Intervals Vertical Sweep Confidence Bands and Intervals Threshold Sweep Confidence Intervals Threshold Confidence Bands and Intervals 1 1 1 1 0.8 0.8 0.8 0.8 Positive0.6 Positive0.6 Positive0.6 Positive0.6 True 0.4 True 0.4 True 0.4 True 0.4 0.2 0.2 0.2 0.2 mean curve mean curve 0 0 95% confidence bands 0 0 95% confidence bands 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 False Positive False Positive False Positive False Positive Figure 2. Transforming vertical sweep into con(cid:12)dence Figure 3. Transforming threshold sweep into con(cid:12)dence bands. bands. curves intersectthe vector. For eachsuch sampling at angle (cid:18)|which ranges from 0 at (0,0) to (cid:25) at (1,1)| 1 Radial Confidence Intervals 1Radial Confidence Bands and Intervals 2 and for each ROC curve, we get a polar coordinate 0.8 0.8 ((cid:18),length)wherethecurveintersectsthesweepvector. Tpohientlenfrgotmhitnhtehelopwoelrarricgohotrdcionranteers)(itshethdeisvtaanricaeboleftfhoer Positive0.6 Positive0.6 which we will compute the con(cid:12)dence interval|again True 0.4 True 0.4 using any of the distribution assumptions mentioned 0.2 0.2 mean curve above. Although the sweep vector rarely is truly or- 95% confidence bands 0 0 thogonal to the ROC curve tangent at any given in- 0 0.2 Fa0l.s4e Pos0it.i6ve 0.8 1 0 0.2 Fa0l.s4e Pos0it.i6ve 0.8 1 tersection, the sweep method does provide us with a Figure 4. Transformingradialsweepintocon(cid:12)dencebands. straightforwardapproximation. two con(cid:12)dence intervals. In this paper we chose the All of our sweep methods require three parameters: 1. The con(cid:12)dence (cid:14), which we set to 0.05 for a 95% simplest approach: discount the con(cid:12)dence interval con(cid:12)dence bound throughout this paper. We did for FP and only use the con(cid:12)dence interval for TP. test with other (cid:14)’s (0.10 and 0.01) with similar Because of this, the bands we generate turn out to be results as those presented below. somewhat conservative and containment probably is underestimated. Figure 3 illustrates the transforma- 2. Thedistributionassumptionunderwhichthecon- tion as well as the drawback. In the (cid:12)gure, we clearly (cid:12)dence intervals are generated. We test under all seethatsomeFPintervalsreachoutsidethecon(cid:12)dence three distribution assumptions mentioned above: bands (opposite to the vertical intervals, the horizon- normal, binomial, and empirical. talintervalswilltendtobelargerforhigherFPrates). 3. Thesetofpointstosamplealongthesweep,which Wearecurrentlyinvestigatingmorerobustandbetter- wesettoauniformlydistributed100points. This performing ways to generate con(cid:12)dence bands from number can be changed depending on how (cid:12)ne- threshold sweeps. grained a curve is needed.1 3.3.3. Radial Sweep 3.3. Creating Con(cid:12)dence Bands from Aswiththeverticalsweepmethod,generatingthecon- Con(cid:12)dence Intervals (cid:12)dencebandsfromthismethodisstraightforward. For 3.3.1. Vertical Sweep eachsampledvectoratangle(cid:18),wecangeneratethefar (near)point fromthe polar con(cid:12)dence intervals which Verticalsweepcanbeadapteddirectlytogeneratecon- we then map back into ROC space to generate the (cid:12)dence bands rather than a set of distinct con(cid:12)dence pointsfortheupper(lower)con(cid:12)denceband. Figure4 intervals. What we do is to consider all the upper illustrates how this method is applied. (lower) interval points as the points making up the upper (lower) band. Figure 2 illustrates this method- 4. Evaluation ology. ForeachFP(0.00through0.99|1.0alwayshas The keyquestionwe askinthis paperis howgoodare a TP of 1.00), we generate a distribution of possible these bands? As with con(cid:12)dence intervals on a single TPs across all the sampled ROC curves and generate variable, we would like to be able to say that given a the bands based on this distribution. (cid:14), the bands generated can be expected to fully con- 3.3.2. Threshold Sweep tainthecurvefromagivenmodelwithaprobabilityof This method is a little more problematic to adapt to 1−(cid:14) (assumingthatnewtestinstancescomefromthe our framework as there are various ways to deal with same distribution). As we will show, for none of the methodsproposedabovedoesthishold. Later,wewill 1While this is a free variable that will havesome e(cid:11)ect introduce an optimization method below for generat- on the overall (cid:12)t of the bands, we do not investigate its ing better bands, as well as new evaluation measures e(cid:11)ect in this paper. that give a sense of how well the bands do (cid:12)t. 5. Case Study Radial Sweep Bands - Empirical Distribution 1 5.1. Data WenowpresentacasestudyusingtheCovertypedata set from the UCI repository (Blake & Merz, 1998). 0.8 We chose this data set because its large size enabled e us to do more in-depth testing, acrossa wide range of v training- and test-set sizes. The Covertype data set siti0.6 o consists of 581;012 instances having 54 features, 10 P e 0.4 being numerical and the rest being ordinal or binary. u 125 instances r While it has seven classes, there is a large variation T 625 instances in class membership sizes. To study the ROC curves, 0.2 1250 instances we chose examples of the two classes with the most 12500 instances instances, giving us a data set of 495;141 instances 25000 instances 0 (57:2% base error rate). 0 0.2 0.4 0.6 0.8 1 False Positive 5.2. Learning Method Figure 5. ROCBands using various test sizes. We use a modi(cid:12)ed C4.5R8 (Quinlan, 1993) that gen- ROC curves fall completely within the gen- erates a Probability Estimation Tree (PET) (Provost erated con(cid:12)dence bands. & Domingos, 2002). PETs are generated by consid- This methodology has three parameters: the training ering the predictions made for each leaf in a decision size, the test size, and the number of sampling runs tree. If a leaf matches p positive examples and n neg- usedinstep(b)togeneratethe con(cid:12)dencecurves. We ative examples, the probability of class membership examine the sensitivity to each of these parameters in p in the positive example is p+n. Further, to produce the next section. Note that for this paper, we do not a better class-probability estimate, we apply a simple consider variance in curves due to the training set| Laplace correction (Niblett, 1987) under the assump- only con(cid:12)dence bands on the ROC curve of a partic- tion of uniform class distribution 1 for C classes| ular (learned) classi(cid:12)er. However,a similar methodol- C giving us a (cid:12)nal probability estimate of p+1 , as we ogywouldapplytothegenerationofcon(cid:12)dencebands p+n+2 for a learning algorithm. have 2 classes. Further, we do no pruning of the tree, as standard pruning does not consider di(cid:11)erences in 5.4. Trends in Con(cid:12)dence Bands scores that do not a(cid:11)ect 0/1 loss (but may deflate the In this section we examine the experimental parame- ROC curve) (Provost& Domingos, 2002). ters identi(cid:12)ed above, and choose values for our eval- 5.3. Bootstrap-based Evaluation uation. Unless stated otherwise, we will use the ra- dial sweep method under the empirical distribution Togenerateandevaluatecon(cid:12)dencebands,weusethe assumptionforthe(cid:12)gurespresented. Allothersweeps following method based on a bootstrapped empirical and distributions had similar performances, though sampling distribution. this combination is the best performer among the 1. Randomlysplitthecompletedatasetintoatrain- methods described thus far. ing set of 256,000 instances and a test set of 5.4.1. Training Size 125,000instances,keepingthese twosetsdisjoint. Thisparameteristheleastinterestingforthisparticu- 2. Sample with replacement from each of these two larcasestudy. Asthetrainingsizeincreases,theROC sets to generate a training set, multiple \(cid:12)tting" curvesbecomehigheraswouldbe expected. However, sets, and multiple test sets: while this has some e(cid:11)ect on the width of the con(cid:12)- (a) Fix thetrainingsize,sampleatrainingsetof dence bands, it is more a matter of considering di(cid:11)er- that size, and learn a classi(cid:12)er. entlearnedmodelsthanofhowtogenerategoodbands (b) Fixthetestsizeandrepeatedlygenerate\(cid:12)t- for a givenmodel. As such, we do not considerthis to ting" sets of that size. For each (cid:12)tting set, be an important dimension for further discussion here generate an ROC curve for the model. The and (cid:12)x the training size to 1000 instances. result is a set of ROC curves, one per (cid:12)tting 5.4.2. Test size set. Test-setsizeshouldhaveanobviouse(cid:11)ectonthebands (c) Generatecon(cid:12)dencebandsbasedontheROC generated. We (cid:12)xed the test size to 125, 625, 1250, curves generated in the (cid:12)tting step (b). 6250,12500and25000instances(0:1%,0:5%,1%,5%, (d) Do1000samplingruns. Foreachrunwepick 10% and 20%, respectively, of the complete test set). a test set using the same size as in (b), from As the test-set size increases, the approximate con(cid:12)- which we then generate an ROC curve. We denceintervalsgeneratedbyanyofoursweepmethods thencalculatehowmanyoftheresulting1000 become narrower and therefore so do our con(cid:12)dence distribution assumption Radial Sweep Bands --- Empirical Distribution 1 empirical normal binomial Method (cid:14)^ (cid:14)^ (cid:14)^ (cid:14)^ (cid:14)^ (cid:14)^ (cid:12)tting test (cid:12)tting test (cid:12)tting test radial 73.5 63.9 51.9 41.5 00.0 00.0 0.8 vertical 31.6 01.7 42.7 00.0 00.0 00.0 threshold 00.9 00.0 00.8 00.0 00.0 00.0 ve Table 1. How many ROC curves fall within the bands siti0.6 of each method using a given distribution for generating Po bands? (cid:14)^(cid:12)tting is the percentage of samples used to gener- e 0.4 atethebandsand(cid:14)^ isthepercentageofsamplesdrawn u test Tr 10 runs afterwards. 0.2 100 runs 1000 runs Radial Sweep Bands - Empirical DistributionRadial Sweep Bands - Normal Distribution 1 1 5000 runs 0 0 0.2 0.4 0.6 0.8 1 0.8 0.8 Figure 6. ROC Bands uFsianlgsev Paorysiintigvenumber of sampling ositive0.6 ositive0.6 P P runs. True 0.4 162255 iinnssttaanncceess True 0.4 162255 iinnssttaanncceess 0.2 1250 instances 0.2 1250 instances 12500 instances 12500 instances bands. Thisisageneralstatisticalproperty|withtoo 0 25000 instances 0 25000 instances few samples, the estimate of the con(cid:12)dence interval 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 False Positive False Positive tendstobe inaccurateandbiasedtobetoowide. The Figure 7. Comparisonofbandsgeneratedundertheempir- same thing is happening in the ROC space. Figure 5 ical and normal distribution assumptions. illustrates this e(cid:11)ect clearly. sampledtestsetsofsize12,500withreplacementfrom Tolimitourpresentationforthispaper,we(cid:12)xthetest theoriginaltestsetof125,000andcountedhowmany size to 12500, though the results hold for other sizes of the 1000 ROC curves fell within each band. We as well. did this for each of our three methods using each of 5.4.3. Number of Sampling Runs thethreedistributionassumptions. Table1showshow The number of sampling runs used to create the em- manyROCcurvesfallwithinthebandsofeachmethod pirical distribution (step 2(b) in Section 5.3) is the using a given distribution assumption for generating last free parameter that we consider. In order to gen- the bands. (cid:14)^ is the percentage based on the \(cid:12)t- (cid:12)tting erate the ROC bands, we need to have a sample of ting" samples that were used to generate the bands ROC curves from which to generate these bands. The and (cid:14)^ is the percentage of ROC curves based on test question to answer is how many such ROC curves| samples drawn after the bands had been generated. the number of sampling runs|are needed to generate reasonable bands. While the e(cid:11)ect of this variable is Surprisingly,none ofthe bands getanywhere near the notasintuitiveasthetestortrainingsize,itstilldoes 95% that we would expect. In particular, we see that have an e(cid:11)ect as can be seen in Figure 6. While the the binomial distribution assumption generates very lower band is fairly stable we see that the upper band bad bands and that neither the vertical sweep nor widens with more sampling runs. (This would be ex- thresholdsweepmethods performaswellasthe radial pected from a distribution with a long tail.) sweep method.2 Interestingly, bands generated under thenormaldistributionassumptiondidnotperformas AsweobservefromFigure6,theupperbandsbetween well as the bands generated under the empirical dis- using 1000 and 5000 sampling runs were very similar. tribution. Figure 7 shows the bands generated under Based on this observation, we (cid:12)x the number of sam- these two distribution assumptions side by side. Note pling runs to 1000, though our results hold for other that they are very similar in shape, though the em- values as well. pirical distribution bands are much more jagged. The 5.5. How Good Are The Bands? empirical bands are noticeably wider (on the \high" side). WouldoneexpectROCcurvestobedistributed Having (cid:12)xed our experimental parameters, let us now ask our main question: do the 1−(cid:14) con(cid:12)dence bands normallywithrespecttothevertical,threshold,orra- actually contain 1−(cid:14) of the empirical distribution? dial dimensions? We do not have a good answer, but the empirical bands do seem to (cid:12)t better. Ourmechanismallowsustoasktwovariationsonthis question: do the bands contain 1−(cid:14) of the \(cid:12)tting" Whatremainstobeaddressedisthepoorcontainment distribution? Do the bands contain1−(cid:14) ofthe \test" distribution? 2Recallthatthebandsgeneratedbythethresholdsweep methodareoverlyconservativeandthatbetterbandsmay Asperourbootstrap-basedmethodology,werandomly be found with a betterconnecting method. Method (cid:14)^(cid:12)tting (cid:14)^test Train opt-radial 96.8% 86.2% Test Table 2. Percentage of curves contained within the bands generated by theoptimized radial sweep method. ofthebands. Whiletheradialsweepmethodproduced the best (cid:12)t, it still fell far short of the expected con- tainment of the empirical distribution of ROC curves. Is it possible to produce better bands? Is there a bet- terwaytoevaluateROCbands? Therestofthepaper Figure 8. Example of point outside thecurve. presents (cid:12)rst steps toward answering these questions. 6. Optimized ROC Bands a better containment than the non-optimized meth- ods. However,it still did not (cid:12)t the test set as well as None of the methods performed as expected, even on expected. the ROC curves (the (cid:12)tting curves) that were used to generate the bands in the (cid:12)rst place. We propose to 7. Evaluation Revisited revisit the way in which these bands were generated One possible explanation for the below-expected con- and optimize them such that they (cid:12)t the empirical tainment evenof the optimized method is thatmaybe distribution of curves better. To do so, we use the there is no good way to generate bands that (cid:12)t well following optimization methodology: duetothechaoticbehavioroftenfoundinROCcurves wheretheycrisscrossmanytimes(asseeninFigure6). 1. Generate an empirical distribution using a Withcurvessuchastheseitmaybeunlikelytobeable methodappropriatefortheproblemdomain(e.g., to do any better than the convex hull in order to get our bootstrap mechanism). the expected containment. Looking more closely, the 2. Select a method for generating bands (e.g., ra- convexhullofthe (cid:12)tting samplesused to generatethe dialsweep)basedonsomeunderlyingdistribution bands might still not be enough. If even one point (e.g., the empirical distribution). falls outside the convex hull as shown in Figure 8, the 3. Optimize the bands with respect to an objective complete curveis notcontained. If the (cid:12)tting samples function that is suitable for the problem domain. arechaoticandcrisscrossmanytimes, whywouldnew samples behave di(cid:11)erently? They may be very likely We instantiate this methodology by generating the haveat leastone point outside the bands found in the sampling distribution as given before. Because the original samples. Maybe we should not require the radialsweep method performed well using the empiri- bandsfullycontainanROCcurve,butinsteadtocon- cal distribution, we choose these as the baseline from tain\almostall"oftheROCcurve. Ifwecanquantify which we will optimize. For the optimization step, for \almost all" then we can evaluate how well the bands this paper we adopt a very simple method: (cid:12)t the data with respect to this measure. 1. For each sampling in the radial sweep generate a tsheteovfecptoolraursceodotrodidnraatwest.hiLsestam(cid:18)(cid:11)plbee,atnhdelaentgNleboef Tpehrecemnteaagseur(cid:15)eowfethuesepofionrtsthoifsaenvaRluOaCtiocnurivsebtahsaetdfathlles outside the bands. For a set of con(cid:12)dence bands, we the number of ROC curves in the distribution. calculate(cid:15)foreachoftheROCcurvesintheempirical 2. Sortthe valuesby length,givingusthe sortedset 3. lS(cid:18)t(cid:11)a;1rt<ing::a:t<thl(cid:18)e(cid:11)o;Nu.termost bands (L=1 and U = dcuisrtvreibsuhtaiovne,(cid:15)a(cid:20)nd^(cid:15).idTeonutisfeys(cid:15)^ucshuc(cid:14)h;(cid:15)thcoant(cid:12)1d−en(cid:14)ceobfaanlldst,hae N),wede(cid:12)ne thecandidatelowerbandasthe set new ROC curve would be considered statistically dif- Wuofpppteooritnhbtesannld(cid:18)uim;Lasbfeotrhreoifsc=eutr1vl(cid:18):ei:s;U:iNnfooraunrids=atmh1ep:lce:a:tnNhda.itdSfaaetletl Wafesrseeencstsaininfgmthiotesrne^(cid:15)et.vhaalnua(cid:15)^toeftihtsep(cid:12)otinnetsssfaolfloauttyspideeotfhbeabnadndbsy. completely within (or lie on) these bands. 8. Case Study Revisited 4. Increase L by 1 and decrease U by one andrecal- We now revisit our case study and compute the (cid:15)^’s culate W. for each method. Figure 9 graphs, for our four sweep 5. Continue until the candidate bands containfewer methods using the empirical distribution, the percent than 1−(cid:14) of the \(cid:12)tting" curves and use U +1 of curves contained as we increase (cid:15). The vertical and L−1 to generate the (cid:12)nal bands. line is 95% (1−(cid:14)) containment. As is clear from the Table 2 shows the performance of this Optimized Ra- graph,theoptimizedradialsweepoutperformedallthe dial Sweep method, opt-radial, using the same evalua- othermethodsthoughallmethodswereabletoachieve tionasbeforewithsameparametersettings. Aswecan 95% containment at varying (cid:15)s. Table 3 shows the (cid:15)^’s see, this method was able to generate bands that had needed by each method using the normal and empiri- found aretoo loose in certainregionsof the curveand Epsilons --- Empirical Distribution (12500 instances, 1000 runs) too tight in others. These are all issues that we hope 1 to investigate further. 0.8 Wehopethisworktakesasigni(cid:12)cantsteptowardmore robustcomparisonsofmachinelearningmethodsusing e 0.6 ROC analysis. g a er Acknowledgments ov 0.4 confidence threshold c We would like to thank Tom Fawcett for his pointers to optimized radial sweep related work andfor manydiscussions about ROCcurves, radial sweep 0.2 HaymHirshforhisfeedbackearlyinthedesignstagesand vertical sweep MatthewStonewhoinitiallysuggestedusingthebootstrap threshold sweep 0 for evaluating ROCcurves. 0 0.2 0.4 0.6 0.8 1 This work is sponsored in part by the Defense Advanced epsilon Research Projects Agency (DARPA) and Air Force Re- Figure 9. (cid:15) coverage for di(cid:11)erent sweep methods. search Laboratory, Air Force Materiel Command, USAF, distribution assumption underagreementnumberF30602-01-2-585. TheU.S.Gov- empirical normal ernmentisauthorizedtoreproduceanddistributereprints Method (cid:15)^train (cid:15)^ (cid:15)^train (cid:15)^ forGovernmentalpurposesnotwithstandinganycopyright test test opt-radial 0.0000 0.0208 | | annotation thereon. The views and conclusions contained radial 0.1765 0.1250 0.2353 0.2083 herein are those of the authors and should not be inter- vertical 0.2843 0.2500 0.2451 0.2245 preted as necessarily representing the o(cid:14)cial policies or threshold 0.5588 0.5417 0.5588 0.5306 endorsements, either expressed or implied, of the Defense Advanced Research Projects Agency (DARPA), the Air Table 3. What (cid:15)’s are needed to achieve a 95% contain- Force Research Laboratory, or the U.S.Government. ment. References cal distributions.3 For example, the optimized sweep Blake, C. L., & Merz, C. J. (1998). UCI Repository of completelycontained(byconstruction)the95%ofthe machine learning databases. (cid:12)tting curves,andrequired(cid:15)=0:02to contain95%of Bradley, A. P. (1997). The use of the area under the roc the test curves. The other methods requiredconsider- curve in the evaluation of machine learning algorithms. ably higher (cid:15) values to achieve 95% containment. Pattern Recognition, 7, 1145{1159. Efron, B., & Tibshirani, R.(1993). An introduction to the 9. Discussion and Limitations bootstrap. Chapman & Hall. In this paper we evaluated various methods for gen- Fawcett, T. (2003). Roc graphs: Notes and practical con- erating con(cid:12)dence bands for ROC curves. We intro- siderations fordata miningresearchersTechnical Report duced a new radial sweep method for generating con- HPL-2003-4). HP Labs. (cid:12)dence bands around the ROC curve and developed Kohavi, R. (1995). A study of cross-validation and boot- strap foraccuracy estimation and modelselection. Pro- a general framework for optimizing such bands using ceedings of the 14th International Joint Conference on bootstrapping techniques. We showed that methods Arti(cid:12)cial Intelligence (pp. 1137{1143). San Francisco, basedonexistingtechniquesproducedbandsthatwere CA. fartoonarrow. Theoptimizedmethodperformedcon- Macskassy, S. A. (2003). New techniques in intelligent in- siderablybetter,butstillwastoonarrow. Wethenin- formation (cid:12)ltering. Doctoral dissertation, Rutgers Uni- troduced a new measure to evaluate the containment versity. of ROC con(cid:12)dence bands and showed how our opti- Macskassy, S. A., Hirsh, H., Provost, F. J., Sankara- mized radial sweep method required relatively little narayanan, R., & Dhar, V. (2001). Intelligent informa- tion triage. The 24th Annoual Internationl Conference leeway to achieve proper containment. on Research and Development in Information Retrieval However, although we introduced the radial sweep (SIGIR). NewOrleans, LA. methodtoapproximatecon(cid:12)dencebandsthatarenor- Margineantu,D.D.,&Dietterich,T.G.(2000). Bootstrap methods for the cost-sensitive evaluation of classi(cid:12)ers. mal to an ROC curve at any given point, a better International Conference on Machine Learning, ICML- technique might yield improved results. One question 2000 (pp. 582{590). that we did not investigate here washow sensitive the Ng, K.-B., & Kantor, P. (2000). Predicting the e(cid:11)ective- bands are to the number of points sampled along the ness of naive data fusion on the basis of system charac- sweep. Further, although we introduced the notion of teristics. Journal of American Society for Information optimizing the bands, we only considered a straight- Science, 51, 1177{1189. forward and simplistic optimization in this paper. Fi- Niblett,T.(1987). Constructingdecisiontreesinnoisydo- nally, it is still an open question whether the bands mains.ProceedingsoftheSecondEuropeanWorkingSes- sion on Learning (pp.67{78). Sigma, Bled, Yugoslavia. 3Note that we have dropped the comparison to the bi- Provost, F. J., & Domingos, P. (2002). Tree induction nomialdistributionasitperformedsobadlyintheprevious for probability-based rankings. Machine Learning. to evaluation. appear. Provost, F. J., & Fawcett, T. (1997). Analysis and vi- sualization of classi(cid:12)er performance: Comparison under impreciseclassandcostdistributions. Proceedingsofthe ThirdInternationalConferenceonKnowledgeDiscovery and Data Mining (pp.445{453). Provost,F.J., &Fawcett,T.(2001). Robustclassi(cid:12)cation forimpreciseenvironments. MachineLearning,42,203{ 231. Provost, F. J., Fawcett, T., & Kohavi, R. (1998). The case against accuracy estimation for comparing induc- tion algorithms. Proceedings of the Fifteenth Interna- tional Conference on Machine Learning (pp. 445{453). San Francisco, CA: Morgan Kaufman. Quinlan, J. R.(1993). C4.5: Programs for machine learn- ing. San Mateo, CA: Morgan Kaufmann. Swets, J. A.(1988). Measuring theaccuracy of diagnostic systems. Science, 240, 1285{1293. Tilbury, J., Eetvelt, P. V., Garibaldi, J., Curnow, J., & Ifeachor, E. (2000). Receiver operating characteristic analysisforintelligentmedicalsystems{anewapproach for (cid:12)nding non-parametric con(cid:12)dence intervals. IEEE Transactions on Biomedical Engineering, 47, 952{963.

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.