Psicológica (2013), 34, 383-406. Measuring anxiety in visually-impaired people: A comparison between the linear and the nonlinear IRT approaches Pere J. Ferrando*1, Rafael Pallero2, Cristina Anguiano-Carrasco1 1Universidad 'Rovira i Virgili' (Spain); 2ONCE (Spain) The present study has two main interests. First, some pending issues about the psychometric properties of the CTAC (an anxiety questionnaire for blind and visually-impaired people) are assessed using item response theory (IRT). Second, the linear model is compared to the graded response model (GRM) in terms of measurement precision, sensitivity to change, and person fit, and the results are also used to illustrate the functioning and advantages of IRT models. The participants were 670 blind or visually-impaired people from different Spanish cities. The results showed that the CTAC scores are accurate enough for practical purposes, and that respondents are quite consistent in their responses. Model-data fit was acceptable in both cases, and both models lead to similar results regarding the trait estimates, with the exception of extreme respondents who were better assessed with the linear model. The GRM assessed measurement precision better, and both models showed high sensitivity to change around cut-off values. Person-fit results were also similar in both models. Visual impairment is expected to have a substantial impact on individuals, because of the shortcomings and restrictions it entails in their everyday life activities (WHO, 2001). Their self-assessment of the situation is usually made in a context of severe anxiety, which can give rise to a self- perception of inefficacy (e.g. I am unable to do the things I used to do). This perception may, in turn, be associated with a series of anxiety responses that are both physiological (e.g. sweaty hands) and cognitive (negative worries, recurring thoughts; see e.g. Lazarus, 2000). Anxiety responses of the * The research was partially supported by grants from the Spanish National Organization of the Blind (ONCE) and the Spanish Ministry of Economy and Competitiveness (PSI2011- 22683). Correspondence: Pere Joan Ferrando. Universidad 'Rovira i Virgili'. Facultad de Psicologia. Carretera Valls s/n . 43007 Tarragona (Spain). E-mail:[email protected] 384 P.J. Ferrando, et al. types just described are likely to appear in everyday situations, even in people who have already acquired coping resources (Welsh, 1997). Psychological assessment of the anxiety responses discussed above would make it possible to design intervention programs aimed at reducing the high anxiety levels. This reduction is expected to have three main effects: first, quality of life would be improved; second, the process of learning the adaptive skills needed to cope with visual impairment would be facilitated; and third, the skills acquired would be easier to maintain and generalize. At present there are very few instruments available to measure psychological variables related to visual loss. In particular, anxiety scales specifically intended for blind and visually impaired people are very scarce. In English (a) some existing measures intended for the general population have been adapted to create more specific anxiety scales (Hardy, 1968), and (b) some questionnaires that measure general adjustment have included a set of anxiety items (Bauman, 1963; Dodds, 1991; Fitting 1954). The state of affairs outlined above prompted our research group to develop a Spanish instrument specifically intended to measure anxiety for the blind and visually-impaired. The instrument was called the CTAC (the Spanish acronym for ‘Tarragona Anxiety Questionnaire for the Blind’), and was designed to measure specific anxiety related to visual impairment in a range of everyday situations of the type discussed above. More specifically, and as mentioned above, the CTAC aimed to measure two related (physiological and cognitive) components of anxiety, so it was conceived as a bi-dimensional instrument (Pallero, Ferrando, & Lorenzo-Seva, 2006; Ferrando, Lorenzo-Seva & Pallero, 2009). Since its initial conception we have assessed the dimensionality of the CTAC scores in a series of studies and, so far, the results can be summarized as follows. First, the unidimensional model already fits the data reasonably well. Second, fitting the bi-dimensional model slightly improves the model-data fit and leads to a clearly identifiable solution with two highly correlated factors. So, we believe that both the use of the test as an essentially unidimensional instrument or as a bi-dimensional instrument (as e.g. in Ferrando et al., 2009) is justifiable. And, in fact, the scoring procedure of the CTAC allows both a double scoring and a single general scoring to be computed. Although it will not be discussed any further in this article, a plausible approach for integrating both views is to fit a bifactor solution (e.g. Reise, 2012). The target population for which the CTAC is intended is relatively reduced, and all the studies that have been made on how it performs (including the dimensionality studies discussed above) have been based on Linear and no linear IRT models on visually-impaired people’s anxiety 385 small samples. This is the main reason why, so far, relatively simple approaches have been used to assess its psychometric properties: classical test theory (CTT) and exploratory factor analysis (FA). The results obtained so far are positive: The CTAC scores show acceptable reliability levels, and are useful for assessment and decision purposes (Pallero et al., 2006). The usefulness of the test has prompted it to be used more and a relatively large sample is now available. This situation can be exploited to assess some issues that could not be satisfactorily addressed with the simple methodology used so far. The main issues that need to be assessed are: (a) the amount of individual precision of the trait estimates, particularly around the potential cut-off points that are used to decide the need for psychological treatment, (b) the sensitivity of the trait estimates for detecting changes (mainly treatment-induced), and (c) the assessment of individual consistency when responding to the questionnaire. Given the aims of the test the relevance of the first and second issues is clear. The CTAC scores are mainly intended to be used for flagging respondents with high anxiety levels, and also in follow up studies to assess improvement due to psychological treatment. As for the third issue, given the importance of the decisions derived from the interpretation of the CTAC scores, it is critical to assess whether the participant is responding consistently to the questionnaire. If he/she is not, the score obtained must be considered as uninterpretable. For the three issues we aim to study, test length is critical in the appropriate assessment of the corresponding properties. The measurement precision, sensitivity for detecting change and the power for detecting person misfit all improve as the number of items increases. For this reason in the present study we shall treat the CTAC scores as essentially unidimensional, so they will be taken from the complete item set. Methodological Basis and Purposes of the Study Methodologically, the three issues discussed above are better addressed within an item response theory (IRT) framework. Issues (a) and (b) are better assessed by using information curves and (potentially) optimal trait level estimates (instead of raw scores). Issue (c) can be addressed by using IRT-based person-fit assessment. So far, IRT applications to the assessment of personality variables in visually-impaired populations have been relatively scarce, and most of the reported studies are conventional calibrations of existing binary-item instruments using a standard model (Ferrando, Pallero, Anguiano-Carrasco & Montorio, 2010; Lamoureux et al., 2007; Gothwal, Wright, Lamoureux, 386 P.J. Ferrando, et al. & Pesudovs, 2009; Cochrane, Marella, Keeffe, & Lamoureux, 2011). Unlike these reported studies, however, the CTAC items use a 5-point graded response format. Not only is five a reasonable number of response points for fitting a non-linear IRT graded response model, but evidence also suggests that responses on 5-point response scales are, in most cases, well fitted by linear models (Hofstee, ten Berge & Hendricks, 1998). More specifically, both theoretical (Lord, 1952, 1953) and empirical (Muthén & Kaplan, 1985; Olsson, 1979) evidence suggests that the linear model works well with this type of item when (a) the discriminatory power of the items is moderate or low, and (b) the items have no extreme locations. This is because, in these conditions, the item-trait regressions are essentially linear and homoscedastic for the range of trait values that contains most of the respondents (Ferrando, 2002). Previous analysis based on CTT obtained moderate discriminations and rather symmetrical distributions for most of the CTAC items (Pallero et al. 2006). The discussion above provides two starting points for this study: (a) both linear and non-linear IRT models are expected to be appropriate for assessing the three critical issues above, and (b) the comparison between the results provided by both approaches is of both substantive and theoretical interest. As for point (b), comparisons between the linear and the nonlinear approaches have already been made in the literature (e.g. Ferrando, 1999; McDonald, 1982, 1999). However, in most cases these comparisons are purely theoretical or have focused on issues such as item estimates or effects on external validity. In contrast, our study aims to make two new contributions. First, when possible, we shall make theoretical predictions that we shall contrast with the empirical data and discuss. Second, our study will focus on the three points discussed above. As far as linear/nonlinear comparisons in terms of conditional precision, sensitivity to change, and person fit is concerned, the present study appears to be new. In the rest of this section we shall briefly discuss the two models to be compared in the study, and derive the predictions to be contrasted with the empirical results. More specific information will be provided in the Method section. The linear model used in our study is Spearman’s factor analysis (FA) model, usually known as the congeneric model in the psychometric literature (Jöreskog, 1971). In this paper, we shall use the terms linear and congeneric indistinctly. As for the non-linear model we shall consider Samejima’s (1969) normal ogive version of the graded response model (GRM). This version (or its virtually indistinguishable logistic counterpart) is the one that is most used in practical applications (Baker, 1992; Linear and no linear IRT models on visually-impaired people’s anxiety 387 Samejima, 1969, 1997). Although the initial conceptualization of the GRM is clearly different from FA modeling, both models can be related by using a general FA formulation based on an underlying variable approach (see Ferrando, 1999, 2002). Essentially, in the linear modeling it is assumed that the congeneric model holds directly for the observed item scores. In the GRM it is assumed that the congeneric model holds for the response variables that underlie the observed scores. Because the responses to the CTAC items are discrete and bounded, the linear model cannot be strictly true and must be taken as an approximation (Mellenbergh, 1994). On the other hand, the GRM is theoretically more plausible because it correctly treats the item scores as discrete and bounded variables. As discussed above, given the properties of the CTAC items, the linear model is expected to provide a good approximation in our study. However, even if we accept this point, why should we not use only the theoretically superior GRM? There are reasons not to discard the linear model from the outset. The GRM is a complex model that makes strong assumptions that might not be met. Furthermore, its complexity makes both the calibration and the scoring processes prone to instability. Overall, in the conditions that we assume ‘a priori’ for the CTAC case, our starting prediction is that both the linear model and the GRM will lead to very similar results and fit the data equally well. Furthermore, because there is a sizeable number of response points and the samples are not too large, the estimates provided by the simple linear model are likely to be more stable. We turn now to more specific predictions, and we shall start with those concerning the scoring of the individuals. In this study the chosen scores are the maximum likelihood (ML) individual trait level estimates. In the linear case they can be obtained in the closed form and are the well known Bartlett’s factor scores (e.g. McDonald, 1982; Mellenbergh, 1994). In the case of GRM, no closed-form estimator exists, so trait estimates must be obtained iteratively. Our first prediction regarding individual scores is that the regression of the linear estimates on the GRM estimates will be S-shaped but nonlinearity will only be apparent at the ends of the curve. This prediction is based on the following results. First, the congeneric ML estimate (i.e. Bartlett’s score) is a linear combination of the raw item scores. Second, in the GRM the relation between the ML trait estimates and the ‘true’ trait levels is linear with unit slope. And, if the test is reasonably long, the estimates are related to the ‘true’ trait levels according to the assumptions of 388 P.J. Ferrando, et al. an error-in-variables model (e.g. Samejima, 1977). So, the regression of the linear estimates on the GRM estimates is expected to have essentially the same shape (i.e. S-shaped) as the regression of the test scores on the true θ levels (possibly with a slight attenuation due to the measurement error). Furthermore, given that the CTAC items are expected to have moderate discrimination and non-extreme locations, the regression is expected to be essentially linear in the trait range that contains most of the respondents. Our second prediction is that, at both trait ends, the linear estimates will be closer to zero than the GRM estimates and that the latter will have a greater dispersion. The basis for this prediction is as follows. First, finite ML estimates based on the GRM do not exist for totally extreme patterns. Furthermore, the estimates may take very extreme values for near-extreme patterns, particularly when the spread of item locations is relatively small and the item discriminations are high (Kim & Nicewander, 1993). Although these conditions are not expected in CTAC, some appreciable instability at the extremes is still expected. In the linear model, however, because the estimate is a weighted composite of the raw scores, finite estimates exist even for the totally extreme patterns. Furthermore, the changes that occur in the trait estimate as the pattern becomes extreme are gradual: no instability exists for near-extreme patterns. We shall now discuss the predictions regarding measurement precision and sensitivity to change. As far as the latter point is concerned, the situation we consider here is a repeated-measures design in which the individual is administered the CTAC on two occasions with a retest interval that is long enough to avoid retest effects. The basic measure to derive predictions in both cases (precision and sensitivity) is the test information, understood as a measure of conditional precision (Mellenbergh, 1996). In both the linear model and the GRM, the amount of information is related to the precision of the ML estimate of the trait level. So, it assesses both the accuracy of our chosen ML scores as estimates of the ‘true’ trait levels and the sensitivity of these scores for detecting change. In the linear model the test information does not depend on the trait level. So, the plot of the amount of information, which we shall term the test information curve (TIC), is flat, with constant information throughout the trait range. The amount of constant information depends only on (a) the number of items, and (b) their discriminating power. On the other hand, in the GRM the amount of information is a complex function of the trait level which depends on (a) the number of items, (b) the number of response categories (five in our case), (c) the items’ discriminating power, and (d) the Linear and no linear IRT models on visually-impaired people’s anxiety 389 distances between the item locations. The amount of information increases with the number of items, the number of categories, and the discriminating power (Samejima, 1969, chapter 6). However, the impact of determinant (d) is not so clear (Baker, 1992). Therefore, it is very difficult to predict the relations between the TIC provided by the GRM and the constant amount of information provided by the linear model. In both cases, the information increases with the number of items and their discriminating power. However, it is difficult to go any further. As discussed above, the constant information predicted by the linear model cannot be a correct result, so the information is expected to be approximately constant only for the range of trait values in which the item response function is essentially linear. Therefore, only in this range the estimated precision and measurement of change are expected to be approximately correct As for the GRM, from previous results we can assume that the CTAC item locations are generally well spread and centered around the population mean of θ and that the items’ discriminating power is only moderate. If this is so, it follows that the GRM-based TIC should be relatively flat, centered around zero and provide a reasonable amount of information over a wide range of trait values. From this result, two predictions can be made. First, measurement precision and sensitivity to change are expected to be maximum around the zero trait mean. Second, precision and sensitivity to change are expected to be acceptable over a wide range of trait values. Finally we are unable to predict the relation between the amount of information provided by both models, and what we propose is to empirically assess this issue. Finally we turn to person-fit assessment. Of the various types of parametric person-fit procedures (see e.g. Meijer & Sijtsma 1995, 2001 for reviews) this study focuses on global scalar-valued indices, which assess the extent to which a response pattern is consistent given the chosen model (the linear model or the GRM in our case) and the estimated trait value of the respondent. More specifically, we shall use global indices based on the likelihood function. Like all person-fit indices developed so far, likelihood- based indices have both theoretical (there are approximations) and practical shortcomings (e.g. Magis, Raîche & Béland, 2012). However, they are simple and practical, and perform reasonably well when used as first-step devices for flagging potentially inconsistent respondents (Ferrando, 2007; Meijer & Sijtsma 1995, 2001). The specific indices we shall use in this study are (a) the polytomous extension of Levine and Rubin’s (1979) index (l ; Drasgow, Levine & ZGRM Williams, 1985) for the GRM-based analyses, and the lco index proposed by 390 P.J. Ferrando, et al. Ferrando (2007) for the congeneric model. The results of both indices are expected to be comparable for three reasons. First, both indices are likelihood- based. Second, they are independent of the trait level, and therefore expected to detect misfitting patterns equally well at all trait levels. Finally, they both refer to a theoretical distribution (l standard normal and lco chi-square). ZGRM In spite of this comparability, however, it is hard to make predictions about the relation between l and lco due to the approximate nature of both ZGRM indices. In a substantive study such as the present one, the indices are mainly intended to be used as screening devices for flagging potentially inconsistent respondents. So, in addition to assessing the degree of relation between the indices, we shall also assess whether they both flag mostly the same respondents as inconsistent. METHOD Participants. The participants were 670 visually impaired or blind people (39.7% men and 60.3% women; mean age 73.32 years and standard deviation 6.88; ranging from 59 to 92 years). They were all members of ONCE, and met the conditions under which the CTAC is intended to be used: a residual vision of 0.1 or lower on the Weker scale and/or a visual field of 10 degrees or lower. They had no other pathologies. Participants came from different Spanish cities (18.5% Tarragona, 23.6% Barcelona, 12.2% Sevilla, 13.7% Valencia, 13.4% Madrid, 14% other and 4.5% missing data). None of the participants were living in assisted centers. For all the participants, one psychologist per city read them the items and wrote down the answer on a paper and pencil questionnaire. It is perhaps relevant to note that the CTAC sample is regularly updated, and that the first 350 of this sample of 670 had been used in the previous studies referred to in this paper. Instruments. The CTAC (Pallero et al., 2006) is made up of 35 items, with a 5-point response format, and, as discussed above, aims to measure the physiological and emotional behaviors that reflect anxiety. Each item has two parts. In the first part the respondent is asked to imagine him/herself in a situation related to their visual deficiency that the researcher explains. In the second part the respondent has to answer the degree of anxiety the imagined situation may evoke nowadays, using a suggested adjective that may refer to emotional or cognitive anxiety. An item example could be: Linear and no linear IRT models on visually-impaired people’s anxiety 391 -“Imagine that you are home alone, you drop a spoon and you can’t find it. To what extent do you feel helpless?” In previous studies on the dimensionality of the CTAC, a pair of items (9 and 27) that differ in the evoked degree of anxiety but which are very similar in both form and content were flagged as problematic. This pair is likely to behave as a locally dependent doublet, thus giving rise to problems of biased estimates and distorted goodness-of-fit results. For this reason item 9 (the least discriminating) was omitted in the present study and all the analyses that follow were based on the remaining 34-item set. Procedures Model Estimation and Scoring Both the congeneric model and the GRM were fitted using an FA approach. The congeneric model was fitted by using a standard FA based on the mean vector and the inter-item covariance matrix. The GRM was fitted by using a factor analytic limited-information estimation procedure based on the bivariate polychoric tables between pairs of item scores. To make the results as comparable as possible, both models were fitted using a robust estimation procedure with mean and variance-corrected goodness-of-fit statistics. In the congeneric model we used robust maximum likelihood estimation. In the GRM we used robust weighted least squares estimation. In both cases the models were estimated using the program Mplus 6.11 (B. Muthén & L.K. Muthén, 2010). Once the models had been fitted and their appropriateness had been assessed (item calibration), the item parameters were taken as fixed and known, and used to obtain ML estimates of the trait level for each individual (individual scoring). Assessment of Measurement Precision and Sensitivity to Change Measurement precision was assessed by computing the amount of test information as a function of the trait level. The general expression we used to obtain the expected information, which is applicable to both models (e.g. Kendall & Stuart, 1977) is ⎡∂2logL⎤ I(θ) = −E⎢⎣ ∂θ2 ⎥⎦ (1) where log L is the log-likelihood for the corresponding response vector according to the model. As mentioned above, the amount of information is 392 P.J. Ferrando, et al. related to the precision of the ML estimate of the trait level. More in detail, as the number of items increases without limit, the standard error of the ML estimate is 1/2 ⎛ 1 ⎞ s.e.(θˆ|θ) =⎜ ⎟ (2) ⎜ ⎟ ⎝I(θ)⎠ For both models, the information values obtained were next used to plot the TICs and check the predictions discussed above. Finally, the relation between the amount of information provided by both models was assessed by using the concept of relative efficiency (Lord, 1974), which, in our case, is simply the ratio of the amount of information provided by both models, and obtained as a function of θ. We turn now to the assessment of change. Two procedures were used to consider change as statistically significant. The first one (Speer, 1992; Reise & Haviland, 2005) is approximate but very simple. It consists of (a) setting a confidence band around the test score obtained at Time 1, and (b) considering change as significant if the score at Time 2 is beyond this band. Let θˆ be the ML trait estimate for the individual obtained at Time 1. A 1 90% confidence band is then computed as θˆ ±1.65 s.e.(θˆ |θ) (3) 1 1 1 where s.e. is the standard error of estimate in (2). The second, more complete procedure, takes into account that both the Time 1 and Time 2 estimates contain measurement error (Finkelman, Weiss, & Kim-Kang, 2010). Using the same critical value as above, the minimum difference in ML values that is required to consider change as significant is D =1.65 s.e2.(θˆ |θ)+s.e2.(θˆ |θ ) (4) 1 1 2 2 Person-Fit Assessment For each respondent, the l andlco indices were computed by using ZGRM the ML trait estimates described above. If we denote by l the log- 0GRM