ebook img

Zero-inflated truncated generalized Pareto distribution for the analysis of radio audience data PDF

1.8 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Zero-inflated truncated generalized Pareto distribution for the analysis of radio audience data

TheAnnalsofAppliedStatistics 2010,Vol.4,No.4,1824–1846 DOI:10.1214/10-AOAS358 (cid:13)c InstituteofMathematicalStatistics,2010 ZERO-INFLATED TRUNCATED GENERALIZED PARETO DISTRIBUTION FOR THE ANALYSIS OF RADIO AUDIENCE DATA 1 1 By Dominique-Laurent Couturier and Maria-Pia 0 Victoria-Feser1 2 n University of Geneva a J Extreme value data with a high clump-at-zero occur in many 6 domains. Moreover, it might happen that the observed data are ei- thertruncated below a given threshold and/or might not bereliable ] enoughbelowthatthresholdbecauseoftherecordingdevices.These P situations occur, in particular, with radio audience data measured A usingpersonalmetersthatrecordenvironmentalnoiseeveryminute, . t thatisthenmatchedtooneoftheseveralradioprograms.Thereare a thereforegenuinezerosforrespondentsnotlisteningtotheradio,but t s alsozeroscorrespondingtoreallistenersforwhomthematchbetween [ therecordednoiseandtheradioprogramcouldnotbeachieved.Since 1 radioaudiencesareimportantforradiobroadcastersinorder,forex- v ample, to determine advertisement price policies, possibly according 3 to the type of audience at different time points, it is essential to be 6 abletoexplainnotonlytheprobabilityoflisteningtoaradiobutalso 1 theaveragetimespentlisteningtotheradiobymeansofthecharac- 1 teristicsofthelisteners.Inthispaperweproposeageneralizedlinear 1. model for zero-inflated truncated Pareto distribution (ZITPo) that 0 we use to fit audience radio data. Because it is based on the gener- 1 alizedParetodistribution,theZITPomodelhasnicepropertiessuch 1 as model invariance to the choice of the threshold and from which : a natural residual measure can be derived to assess the model fit to v i thedata.Fromageneralformulation ofthemostpopularmodelsfor X zero-inflated data, we derive our model by considering successively r thetruncated case, thegeneralized Pareto distribution and then the a inclusion of covariates to explain the nonzero proportion of listeners and their average listening time. By means of simulations, we study the performance of the maximum likelihood estimator (and derived inference) and use the model to fully analyze the audience data of a radio station in a certain area of Switzerland. Received April 2009; revised January 2010. 1Supportedby theSwiss National Science Foundation Grant PP001-106465. Key words and phrases. Extreme values, logistic regression, generalized linear models, residual analysis, model fit. This is an electronic reprint of the original article published by the Institute of Mathematical Statistics in The Annals of Applied Statistics, 2010,Vol. 4, No. 4, 1824–1846. This reprint differs from the original in pagination and typographic detail. 1 2 D.-L.COUTURIER ANDM.-P. VICTORIA-FESER 1. Introduction. Audienceindicators—likerating,2 timespentlistening3 and market share—are essential for radio stations managers and advertis- ers. They give important indications on public profiles and on radio stations benchmarking, allowing proper radio programming and optimization of ad- vertising strategies. The weaknesses of traditional audience measurements methods based on individual recollection of the time spent listening to all radio stations led to the development of individual, portable and passive electronic measurement systems providing more reliable and detailed mea- sures [refer to Webster, Phalen and Lichty (2006) for a complete overview of audience measurement methods]. Telecontrol4 thus developed a “wrist- watch meter,” which records 4 seconds of ambient sound at fix time delays and compares these sequences to the corresponding ones of all available ra- dios. The “people portable meter” of Arbitron5 or the “Eurisko multimedia monitor” of Gfk6 consist in a pager-sized device which detects inaudible codes that broadcasters embed in their programs. Hence,thefundamentalaudiencemeasureavailablethroughtheseportable andpassive measurementsystemsis adichotomous variable Y indicating ismt if the participant i was listening to the radio station s at the measurement m of the day t. Most used audience indicators for a given radio station are all functions of the sum of those quantities over a day part,mostly 24 hours, that is, Y = Y . ist ∀m ismt We have at our disposal radio audience data of the Swiss measurement P system“Radiocontrol” in2007 [refertoD¨ahler(2006)foracomplete presen- tationofthismeasurementsysteminSwitzerland].AsillustratedinFigure1, the distribution of the daily number of listening minutes y for a given ra- ist dioisextremely skewed,left-truncated andclumped-at-zero. Inotherwords, first, the empirical distribution of the data appears monotonically decreas- ing. The probability to listen to a radio during a time interval decreases with the time interval length. Second, because of contact validation rules of the Swiss measurement system, the listening times y are recorded as zeros ist if none of the contacts of the participant i to the radio station s last 3–4 minutes or more on day t. This means that the smallest observed (recorded as such) listening times are 3–4 minutes. This ensures that the probabil- ity to observe false positive contact is negligible over a time interval of 4 or more consecutive minutes. Third, the data contain a high clump-at-zero correspondingtothepercentageofpeoplethathadnorecordedcontact with that radio station. 2Percentage of people who tunein to a given radio station duringa day. 3Averagelistening time to a given radio station per listener. 4http://www.telecontrol.ch. 5http://www.arbitron.com. 6http://www.gfk.com. ZERO-INFLATEDTRUNCATEDGENERALIZED PARETO DISTRIBUTION 3 Data with a clump-at-zero and an asymmetric heavy-tail distribution oc- cur in numerous disciplines. Examples are the daily levels of precipitation in an area [Weglarczyk, Strupczewski and Singh (2005)], the yearly amount of car insurance claims per client [Chapados et al. (2002); Christmann (2004)] or the length of overnight stays at hospital per patient [Chen, Jiang and Mao (2007)]. However, no model has been proposed so far for data with a clump-at-zero together with a truncation of small values under a threshold, a model that is necessary to describe, in particular, radio audience data like in our example, but also any other type of data that might, for example for recording reasons, have unreliable measurements at small values of the variable of interest. Hence, the purpose of this paper is to develop a model able to fit truncated heavy-tailed data with excess zeros and to explain, by means of covariates, both the probability associated with a nonzero value and the expectation of positive outcomes. Such a model particularly makes senseinthecontextofradioaudience:Theprobabilityofanonnullvalueand the expectation of positive outcomes, respectively, correspond to the rating and time spent listening audience indicators. Market shares are a function of these expectations. Models for data with excess zeros have received much attention in the literature. The most popular ones include the two-part model of Duan et al. (1983) and the zero-inflated count models initiated by Lambert (1992) for Fig. 1. Empirical distribution of the daily listening times to a national radio in an area of the French part of Switzerland duringthe first semester of2006. 1382participants were measured by means of the Radiocontrol system during one day of the period of interest. Zerosrepresent65.7%ofthedata.Thedistributionofthepositivedataisextremelyskewed withamaximumdailylisteningtimeof1136minutes.Thelowestpossiblepositivelistening time is 3 minutes. 4 D.-L.COUTURIER ANDM.-P. VICTORIA-FESER continuous data, or the hurdle model of Mullahy (1986) for count data. In Section 2 we describe our model as a natural extension of these models that take into account the left truncation of the outcome variable. To model the positivepartoftheradiolisteningtimes,weproposeazeromodalPareto-like distribution. Choice has been made for the generalized Pareto distribution because of its ability to fit heavy tails, to be “model invariant” to the choice of the threshold for the left truncation, and because it can be used to only model the tail of the distribution. The resulting model we propose is hence a zero-inflated truncated Pareto (ZITPo) model in which the probability of nonzero outcomes and the mean of the positive outcomes are linked to a set of covariates in a generalized linear model framework. The ZITPo has great fitting flexibility and useful properties as argued in Section 2.5. In Section 3 we investigate by means of simulations the sample properties of the maximum likelihood estimator and inferential procedures. Since ZITPo models are new, it is also important to be able to check the fit of the model and, therefore, we propose in Section 4 a new data analysis tool based on Pareto residuals that is derived in a natural manner from the properties of the ZITPo model. The data from a radio station in a certain area of Switzerland are then fully analyzed in Section 5 by means of the ZITPo which provides an excellent fit to the data and hence good explanatory power for the probability of nonzero outcomes and the mean of the positive outcomes. 2. The ZITPo model. The generalized Pareto distribution, introduced byPickands(1975),isalimitdistributionfortheexcessovera(large)thresh- old α for data coming from generalized extreme value distributions, as well as a generalization of the Pareto distribution. The three parameter general- ized Pareto distribution has the following cumulative distribution function: −1/ξ y−α 1− 1+ξ , if ξ6=0, τ (1) FY(y|α,τ,ξ)= (cid:18) (cid:19)  y−α 1−exp − , if ξ=0, τ (cid:18) (cid:19)  where α, τ and ξ are location, scale and shape parameters, α≥0 and τ >0.  The range of y is ]α,− τ [ if ξ<0, and ]α,∞[ otherwise. The exponential ξ+α distributionwithmean τ occurs for ξ=0.Pareto-like distributionsoccurfor ξ >0. The generalized Pareto distribution has been widely used to model rare events in several fields. Applications for environmental extremes are especially numerous (river flow, ozone levels, earthquakes). For modeling audience radio data, it is also important to be able to link moments or parameters of the generalized Pareto distribution to a set of explanatory variables. The generalized linear models (GLM) framework, in- troduced by Nelder and Wedderburn (1972), provides a general setting to ZERO-INFLATEDTRUNCATEDGENERALIZED PARETO DISTRIBUTION 5 achieve this aim. GLM are a generalization of the linear regression model in which the assumption of normality of the conditional distribution of the response vector y given a set of covariates X, y|X, is relaxed. These models assumethat the ith unitresponse, y , follows a distributionbelonging to the i exponential family, and the expectation of the ith response,y ,is linked to a i set of fixedcovariates x throughan invertible linear predictor function ν(·), i by means of E[Y ]=ν−1(x β), with β a set of regression coefficients. The i i generalized Pareto distribution falls outside the exponential family frame- work and, hence, the advantages associated with this framework—like well- known iterative estimation procedures and mathematical properties—are not available. However, extension of the GLM to distributions outside the exponential family is pretty straightforward. Actually, generalized linear modeling has existed for a long time with responses following extreme value distributions, but not in the traditional scheme that directly relates the response expectation to the explanatory variables through a linear predictor. Indeed, in extreme value analyses, very often the parameters of the response distribution instead of the response expectation are linked to the covariates. Davison and Smith [(1990), page 395] consider that this represents “a more fruitful approach” than the usual one that links the distribution moments to the regressors, as the moments of generalized extreme value distribution do not exist for all values of their parameters. We refer to Coles [(2001), Section 6.4] for a review. In survival analysis, depending on the choice of the hazard function h(t), the survival function f(t) may follow an extreme value distribution. In this context, f(t) the hazard function h(t)= is then related to the covariates through 1−F(t) a linear predictor instead of the response expectation. Such developments may be found in Aitkin and Clayton (1980). As we will see in more details below, for the purpose of modeling radio audience data, it is more sensible to link the expected value of the response to a set of covariates. Before adapting the generalized Pareto distribution to handle clump-at- zero and left truncation of the positive part of the data, as well as incorpo- rating in the resulting model covariates in order to explain the probability of a zero outcome and the mean of the positive part, we briefly describe models proposed so far for data with excess zeros. The aim is to propose a general formulation from which different models for different situations can be deduced, and, in particular, from which we build our zero-inflated trun- cated Pareto (ZITPo) model. We then also describe in details the ZITPo model assumptions and discuss some possible extensions. 2.1. Models for nonnegative data with excess zeros. There is a rich liter- ature about adaptation of statistical models to the case of data with excess zeros. We refer to Min and Agresti (2002, 2005) and Ridout, Dem´etrio and 6 D.-L.COUTURIER ANDM.-P. VICTORIA-FESER Hinde (1998) for a review. Min and Agresti (2002) compare the advantages and disadvantages of existing approaches and note that the most appeal- ing modeling for continuous data with excess zeros is the two-part model of Duan et al. (1983), and the zero-inflated count models initiated by Lam- bert (1992) or the hurdle model of Mullahy (1986) in the case of count data with a clump-at-zero. These models are similar. Their key idea is to mix two random variables: A first one, Y , that handles the excess of zeros, and 1 a second one, Y , that models the other part of the data. Y typically fol- 2 1 lows a Bernoulli distribution where P (0)=1−π denotes the probability Y1 to observe a zero outcome. In the hurdle and two-part models (also called conditional models), the probability of the data being equal to zero only depends on Y and the positive data are all modeled by Y , which may fol- 1 2 low a zero-truncated distribution in the case of count data (hurdle model) or a continuous distribution (two-part model). In these cases, P (0)=0. Y2 In zero-inflated models (also called mixture models), Y does not follow a 2 zero-truncated distribution.Theprobability associated tozerothusdepends on both Y and Y . 1 2 Let Y be a random variable with probability distribution P for the Y clump-at-zero and the positive part, when the latter is discrete, that is, Y 2 is discrete, then P may be expressed in the following way: Y P (y)=[P (0)+(1−P (0))P (y)]ι(y=0) Y Y1 Y1 Y2 (2) +[(1−P (0))P (y)]ι(y>0), Y1 Y2 where y=0,1,2,..., the indicator function ι(·) equals one if the condition is true and zero otherwise. Let us refer to a variable as semicontinuous when it has a point mass in zero and a continuous distribution for the positive values [definition of Min and Agresti (2002), page 7]. Then (2) may easily be generalized to continuous or semicontinuous Y : 2 f (y)=[P (0)+(1−P (0))P (0)]δ(y) Y Y1 Y1 Y2 (3) +[(1−P (0))f (y)]∆ (y), Y1 Y2 0 where δ(y) is a Dirac delta function which equals zero for y6=0, ∆ (y) is 0 a step function taking the value of one for y >0 and zero otherwise, and y ∈ [0,∞[. Note that when P (0) = 0, we have the hurdle or two parts Y2 models, while we have zero-inflated models when this is not the case. Theuseof the generalized Pareto distribution to modelzero-inflated data is not common, one exception being Weglarczyk, Strupczewski and Singh (2005). The authors compare the fitting ability of some semicontinuous dis- tributions to fit hydrological data with excess zeros and consider a Dirac generalized Pareto distribution with density function −1/ξ−1 π y (4) f (y|π,τ,ξ)=(1−π)δ(y)+ 1+ξ ∆ (y), Y 0 τ τ (cid:18) (cid:19) ZERO-INFLATEDTRUNCATEDGENERALIZED PARETO DISTRIBUTION 7 where τ > 0, ξ 6= 0, 0 ≤ (1 − π) ≤ 1 corresponds to the probability of a zero event. Note that compared to (1), α=0. The Dirac generalized Pareto distribution in (4) thus corresponds to a two-part model with P (0)=0, in Y2 which f (y) is the density function of the generalized Pareto distribution. Y2 In the following sections we propose to extend (3) [and (4)] to take into account the possible truncation of small values, as well as to incorporate covariates to explain (a function of) the probability of zero outcomes and the mean distribution of positive outcomes. 2.2. The ZITPo distribution. LetY∗ denotetheeffective (butunknown) daily listening time for a given radio. Y∗ is to the sum over the day of the dichotomous variables indicating a contact to that radio station minute by minute. The probability and cumulative distribution functions of Y∗, f (y∗) and F (y∗), are semicontinuous with a point mass in zero and a Y∗ Y∗ continuous distribution for the positive values. Let Y denote the observed listening times with density function f (y). As listening times smaller than Y a given value y◦ (considered as known) are recorded as zeros, observed zeros are then a mixture between the effective zero listening times and the pos- itive listening times reported as zeros because of the measurement system. Accordingly, F (0)=F (y◦). Y Y∗ Asemicontinuousversionofthezero-inflatedcountmodeldescribedin(3) is indeed adequate to model the double origins of the zeros in the clump-at- zero and the positive values of the observed listening times. Let us assume that the unknown and true proportion of zero listening times is 1−π, with 0≤π≤1, and that the effective positive listening times follow a two param- eter generalized Pareto distribution (with α=0), Y∗|(Y∗>0)∼GPD(τ,ξ). Then, in (3), P (0)=1−π corresponds to the effective proportion of non- Y1 listeners, and P (0)=F (y◦) corresponds to the part of the two Y2 (Y∗|Y∗>0) parameter generalized Pareto distribution that cannot be observed because ofthemeasurementsystemlimitations.Thedensityfunctionsoftheeffective listening times Y∗ and of the observed listening times Y are π y∗ −1/ξ−1 (5) f (y∗|π,τ,ξ)=[1−π]δ(y∗)+ 1+ξ ∆ (y∗), Y∗ 0 τ τ (cid:20) (cid:18) (cid:19) (cid:21) f (y|π,τ,ξ)=[(1−π)+πF (y◦)]δ(y) Y (Y∗|Y∗>0) +[πf(Y∗|Y∗>0)(y)]∆y◦(y) (6) y◦ −1/ξ = 1−π 1+ξ δ(y) τ (cid:20) (cid:18) (cid:19) (cid:21) −1/ξ−1 π y + 1+ξ ∆ (y), y◦ τ τ (cid:20) (cid:18) (cid:19) (cid:21) 8 D.-L.COUTURIER ANDM.-P. VICTORIA-FESER Fig. 2. Empirical distribution function of a data set simulated from a ZITPo model with parameters π=0.5, µ=ξ=0.25 and y◦=F(−Y1∗|Y∗>0)(0.25). The theoretical trun- cated and untruncated density functions are superimposed to the plot with dashed gray and black lines. The value of the expectations of the positive values of the truncated and untruncated distributions, µ◦ and µ, are indicated on the x-axis. On the discrete part of the plot, the surfaces within the dashed gray and black boxes correspond to the theoretical probabilities to observe zeros when there is (dashed gray) and when there is no (black) left truncation of the positive part of the data. Those probabilities respectively equal 1−π and 1−π◦=(1−π)+πF−1 (y◦). (Y∗|Y∗>0) where 0≤π ≤1, τ >0, ξ 6=0 and y◦ ≥0. For y◦ =0, (6) reduces to the Dirac generalized Pareto described in (4). Finally, note that if the observed listening times distribution in (6) has the disadvantage of being a mixture distribution which makes it more complex to fit, its underlying distribution in (5) takes the advantages of the orthogonal parameterization of the hurdle and two-part models and is thus easier to interpret [for a discussion on the orthogonal parameterization see, e.g., Welsh et al. (1996)]. Indeed, the zeros depend on π, while the positive outcomes rely on the generalized Pareto parameters, τ and ξ. Figure 2 shows the distribution of a data set simulated from a ZITPo dis- tribution.Thetheoretical untruncatedandtruncateddistributionfunctions, respectively corresponding to (5) and (6), are respectively superimposed to the plot in black and dashed gray lines. On the discrete part of the plot, the surfaces within the dashed gray and black boxes correspond to the theoreti- calprobabilitiestoobservezeroswhenthereis(dashedgray)andwhenthere is no (black) left truncation of the positive part of the data. Those probabil- ities respectively equal 1−π and 1−π◦=(1−π)+πF−1 (y◦). On the (Y∗|Y∗>0) continuous part of the plot, the expectations of the truncated (µ◦) and un- ZERO-INFLATEDTRUNCATEDGENERALIZED PARETO DISTRIBUTION 9 truncated (µ) distributions are indicated. It is then clear that the expected value for the true listening time Y∗, µ, is different from the expected value of the truncated distribution, µ◦. For the audience data, one quantity of interest is µ for the untruncated distribution. 2.3. Covariates modeling inZITPodistribution. AdaptationoftheGLM to models for data with excess zeros is very intuitive. The expectations of the distributions of Y and Y in (2) and (3) are linked to the covariates 1 2 through adapted link functions. The logit link is often chosen to relate the expectation of Y , corresponding to the probability to observe positive val- 1 ues, to the covariates. The log link makes sense to connect the expectation of Y , corresponding to the mean of the positive data, to the covariates, as 2 this last is necessarily positive. For the ith observation, we then have exp(xTβ ) (7) π =P(Y∗>0)=ν−1(xTβ )= i1 1 , i i 1 i1 1 1+exp(xTβ ) i1 1 (8) µ =E[Y∗|Y∗>0]=ν−1(xTβ )=exp(xTβ ), i i i 2 i2 2 i2 2 where ν−1(·) and ν−1(·) are the inverse of the linear predictor functions 1 2 linking the expectations of Y and Y in (2) and (3) to the covariates, x 1 2 i1 and x are the covariates of the ith observation that may contain the same i2 predictors, and β and β are the correspondingparameters. Because of the 1 2 orthogonal parameterization of the underlying model in (5), if we use in (7) and(8)twodifferentanduncorrelatedsetsofcovariates, X andX ,wethen 1 2 assume that the processes that explain the probability to observe a positive outcome and the expectation of a positive outcome are independent. If part of the covariates of X are present in (or correlated to) X , π and µ will 1 2 i i possiblybelinked. Noassumptionis doneabouttheformof therelationship between these quantities. Inclusion of covariates in (4) requires that we express the distribution f (y) in terms of the expectation of the positive values of the data. Let Y (Y∗|Y∗>0)∼GPD(τ,ξ). Then τ µ=E[Y∗|Y∗>0]= for 1−ξ>0. 1−ξ The first moment of the generalized Pareto distribution, µ, thus exists for values of ξ lower than one. Substituting τ by µ(1−ξ) in (6) gives ξ y◦ −1/ξ f (y|π,µ,ξ)= 1−π 1+ δ(y) Y 1−ξ µ (cid:20) (cid:18) (cid:18) (cid:19) (cid:19) (cid:21) (9) −1/ξ−1 π ξ y + 1+ ∆ (y), y◦ µ(1−ξ) 1−ξ µ (cid:20) (cid:18) (cid:18) (cid:19) (cid:19) (cid:21) 10 D.-L.COUTURIER ANDM.-P. VICTORIA-FESER with0≤π≤1,µ>0,ξ6=0andξ<1,y◦≥0.Theinclusionofthecovariates as described in (7) and (8) is now straightforward. For the ith observation, we have f (y |x ,x ,β ,β ,ξ) Yi i i1 i2 1 2 exp(xTβ ) ξ y◦ −1/ξ = 1− i1 1 1+ δ(y) 1+exp(xTβ ) 1−ξ exp(xTβ ) (cid:20) i1 1 (cid:18) (cid:18) (cid:19) i2 2 (cid:19) (cid:21) (10) exp(xTβ ) 1 + i1 1 1+exp(xTβ )exp(xTβ )(1−ξ) (cid:20) i1 1 i2 2 −1/ξ−1 ξ y i × 1+ 1−ξ exp(xTβ ) ∆y◦(y). (cid:18) (cid:18) (cid:19) i2 2 (cid:19) (cid:21) 2.4. Assumptions of ZITPo models. The form of the ZITPo model im- plies a number of assumptions on the distribution of the positive values: First,theunobservedpositivelisteningtimesbelongingtotherange]0,y◦[ correspond to the nonobserved part of a left-truncated generalized Pareto distribution. As the generalized Pareto density function is zero modal and monotonically decreasing,thisassumptionimpliesthat,conditionally onthe covariates, the probability of positive listening times in the interval ]0,y◦[ is higher than in any other interval of the same size. As zapping through radio is frequent, we believe that this assumption is realistic. Second, the expectation µ always corresponds to the quantile 1−(1− i ξ)1/ξ of a GPD(µ ,ξ). Indeed, conditionally on the covariates, as the real i positive listening times follow generalized Pareto distributions having differ- entexpectations µ butsharingthesameξ-value,Y∗|(Y∗>0)∼GPD(µ ,ξ), i i i i we can observe that −1/ξ µ (11) F (µ )=1− 1+ξ i =1−(1−ξ)1/ξ. (Yi∗|Yi∗>0) i µ (1−ξ) (cid:18) i (cid:19) Figure 3 shows examples of two parameter generalized Pareto density func- tions sharing the same ξ-value (within the same graph) but having different expectations. For the same ξ-value, the density functions show a great vari- ety of forms and thus a high ability to model different data sets with more or less heavy tails. Third,because of the reparametrization of the generalized Pareto density formulated in (9), the shape parameter is restricted to values lower than one. This does not seem problematic in regard to (11). Indeed, for ξ>0.95, µ corresponds to quantiles of the distribution higher than 0.95. We do not expect cases in which the theoretical mean belongs to the last 5% of the distribution at least with radio listening data.

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.