ebook img

ERIC ED562591: A Simulation Study to Explore Configuring the New SAT® Critical Reading Section without Analogy Items. Research Report No. 2004-2. ETS RR-04-01 PDF

2004·0.9 MB·English
by  ERIC
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview ERIC ED562591: A Simulation Study to Explore Configuring the New SAT® Critical Reading Section without Analogy Items. Research Report No. 2004-2. ETS RR-04-01

Research Report No. 2004-2 A Simulation Study to Explore Configuring the New SAT® Critical Reading Section Without Analogy Items Jinghua Liu Miriam Feigenbaum Linda Cook College Board Research Report No. 2004-2 ETS RR-04-01 A Simulation Study to Explore Configuring the New SAT ® Critical Reading Section Without Analogy Items Jinghua Liu, Miriam Feigenbaum, and Linda Cook College Entrance Examination Board, New York, 2004 Jinghua Liu is measurement statistician at Educational Testing Service. Miriam Feigenbaum is principal statistical associate level II at Educational Testing Service. Linda Cook is principal research scientist level II at Educational Testing Service. Researchers are encouraged to freely express their professional judgment. Therefore, points of view or opinions stated in College Board Reports do not necessarily represent official College Board position or policy. The College Board: Expanding College Opportunity The College Board is a not-for-profit membership association whose mission is to connect students to college success and opportunity. Founded in 1900, the association is composed of more than 4,500 schools, colleges, universities, and other educational organizations. Each year, the College Board serves over three million students and their parents, 23,000 high schools, and 3,500 colleges through major programs and services in college admissions, guidance, assessment, financial aid, enrollment, and teaching and learning. Among its best-known programs are the SAT®, the PSAT/NMSQT®, and the Advanced Placement Program®(AP®). The College Board is committed to the principles of excellence and equity, and that commitment is embodied in all of its programs, services, activities, and concerns. For further information, visit www.collegeboard.com. Additional copies of this report (item #030481025) may be obtained from College Board Publications, Box 886, New York, NY 10101-0886, 800 323-7155. The price is $15. Please include $4 for postage and handling. Copyright © 2004 by College Entrance Examination Board. All rights reserved. College Board, SAT, and the acorn logo are registered trademarks of the College Entrance Examination Board. SAT Reasoning Test is a trademark owned by the college Entrance Examination Board. PSAT/NMSQT is a registered trademark of the College Entrance Examination Board and National Merit Scholarship Corporation. Other products and services may be trademarks of their respective owners. Visit College Board on the Web: www.collegeboard.com. Printed in the United States of America. Contents Method......................................................13 Estimation of the Length and Delta Abstract...............................................................1 Distribution of Three Hypothetical Tests...................................................13 Introduction........................................................1 Computation of CSEM for Three Phase 1................................................................2 Hypothetical Tests.............................14 Method........................................................2 Results.......................................................14 Design of Prototypes...............................2 General Discussion............................................15 Item Pool Construction...........................2 References.........................................................16 Assembly of Simulated Forms..................2 Tables 1. Configuration of the Current SAT Verbal Analyses of Prototypes: Item Statistics....4 Section (SAT–V) and Prototypes (Number of Items by Item Types)....................................2 Analyses of Prototypes: Test Statistics.....4 2. Specified Delta Distributions for the Establishing Psychometric Criteria..........5 SAT–V and Prototypes......................................3 3. Summary of Item Statistics for the Establishing Nonpsychometric Criteria...5 SAT–V and Prototypes......................................5 4. Reliability Estimates and Standard Error Results.........................................................5 of Measurement (SEM) of Scaled Scores for the SAT–V and Prototypes........................10 Item Statistics—Equated Deltas and r-biserial........................................5 5. Specified Delta Distributions for the SAT–V, Original and Revised Prototypes........11 Test Statistics—CSEMs, SEMs, and 6. Summary of Item Statistics for the Reliabilities..........................................5 SAT–V and Revised Prototypes......................11 Discussion.................................................10 7. Reliability Estimates and Standard Error of Measurement (SEM) of Scaled Scores for the SAT–V and Revised Prototypes...........13 Phase 2..............................................................10 8. Specified Delta Distributions for Different Method......................................................10 Item Types in the Item Pool............................13 9. Estimation of Hypothetical Item-Type Results.......................................................11 Test.................................................................14 Revised Delta Distribution....................11 Figures Item Statistics........................................11 1. Interpolation of frequency distribution for the equated new form from the base Test Statistics.........................................11 form.................................................................5 2. The average scaled score CSEMs: Prototype Discussion.................................................13 A compared to the SAT–V................................6 Phase 3..............................................................13 3. The scaled score CSEMs: Multiple versions 8. The average scaled score CSEMs: Prototype of Prototype A compared to verbal sections D compared to the SAT–V................................9 of 13 recent SAT tests.......................................6 9. The scaled score CSEMs: Multiple versions 4. The average scaled score CSEMs: Prototype of Prototype D compared to verbal sections B compared to the SAT–V................................7 of 13 recent SAT tests.......................................9 5. The scaled score CSEMs: Multiple versions 10. The average scaled score CSEMs: Revised of Prototype B compared to verbal sections Prototype A compared to the SAT–V and of 13 recent SAT tests.......................................7 the original Prototype A.................................12 6. The average scaled score CSEMs: Prototype 11. The average scaled score CSEMs: Revised C compared to the SAT–V................................8 Prototype D compared to the SAT–V and the original Prototype D.................................12 7. The scaled score CSEMs: Multiple versions of Prototype C compared to verbal sections 12. The average scaled score CSEMs for tests of 13 recent SAT tests.......................................8 composed solely of a single SAT–V item type.........................................................15 Abstract be measured, but through the use of “vocabulary in context” questions, based either on reading passages or independent sentences. This study explored possible configurations of the new This study explored the configuration of the new SAT® critical reading section without analogy items. The SAT critical reading section without ANitems. The study item pool contained items from SAT verbal (SAT–V) sec- was carried out as a simulation study using data from past tions of 14 previously administered SAT tests, calibrated administrations of the SAT–V. Simulated forms with a using the three-parameter logistic IRT model. Multiple reduced number of or no ANitems were assembled. Item versions of several prototypes that do not contain analo- and test statistics for the simulated forms were analyzed gy items were assembled. Item statistics and test statistics and compared to the current forms. An underlying for the simulated forms were compared to the average of assumption in the reconfiguration was that the overall 13 forms of the SAT–V. These statistics included: IRT difficulty of the SAT–V and the score reporting scale for scaled score reliability, scaled score standard error of the SAT–V would not be changed. Consequently, an measurement, conditional scaled score standard error of important constraint for the study was the maintenance, measurement, r-biserial, and equated deltas. The results as closely as possible, of the current test specifications indicated that it is possible to maintain measurement pre- including delta distributions. The current specifications cision for the new SAT critical reading section without were established, using IRT methods and were endorsed analogy items, but it may be necessary to modify the dis- by the SAT Committee when the SAT–V and tribution of item difficulty in order to obtain adequate mathematical (SAT–M) sectionswere revised in 1994. In precision at the ends of the score scale. this study, we worked carefully to maintain the overall dif- ficulty level of the SAT–V and to mirror, as much as pos- Key words: SAT verbal section (SAT–V), new critical sible, the psychometric characteristics of the SAT–V. reading section, data mining, analogy This study represents an application of IRT to explore the implications of revising an existing test section, such as the SAT–V. IRT provides a powerful data simulation tool to evaluate the impact of revising a test Introduction section, as long as item responses exist for all of the items involved in the revision. Previous experience using the The SAT Reasoning Test™(hereinafter called the SAT) is three-parameter logistic model with the SAT–V and an objective and standardized test that measures verbal SAT–M indicates that this model fits the data well. (See and mathematical reasoning abilities that students Cook and Petersen, 1987; Petersen, Cook, and Stocking, develop over time, both in and out of school. The 1983.) Consequently, it was possible to use this model to current verbal portion of the test (SAT–V) measures ver- simulate the impact on test and item statistics, such as test bal reasoning abilities, with emphasis on critical reason- reliability and conditional standard errors of measure- ing and vocabulary abilities in the context of reading ment, to evaluate the efficacy of various test configura- passage, analogy, and sentence completion questions. tions that were simulated without the analogy item type. The SAT–V includes 78 items: 19 Analogy (AN) items, There were three phases in this study. During Phase 19 Sentence Completion (SC) items, and 40 Critical 1, multiple versions of four prototypes without or with a Reading(CR) items. Each of the three item types allows reduced number of AN items were assembled and measurement of vocabulary knowledge, and all include analyzed. However, none of the prototypes produced test a range of words of various levels of difficulty. simulations with measurement errors that are as small as In order to strengthen the alignment of the SAT to those produced by the SAT–V for scores below 300 or curriculum and instructional practices in high schools above 700. As a result, a second phase of the study was and colleges, the College Board will be making carried out. Two of the most promising prototypes were substantial changes to the SAT. For the verbal portion, selected and revised, and additional versions of each of the upcoming changes include eliminating AN items, these two prototypes were simulated and analyzed. adding paragraph-length critical reading passages, and Further investigations were explored during Phase 3. In changing the name of the test section from verbal to an effort to better understand and compare the measure- critical reading. The new SAT critical reading section ment errors resulting from different item types, three “measures knowledge of genre, cause and effect, rhetor- hypothetical tests were formed using IRT item statistics: ical devices, and comparative arguments and ability to an all-AN item test, an all-SC item test, and an all-CR recognize relationships among parts of a text” (College item test. The conditional standard error of measurement Board, 2002). Vocabulary knowledge will continue to of each hypothetical test was computed and compared. 1 Phase 1 reading section will be reduced to 70 minutes from the current 75 minutes. In addition, the amount of time needed to answer different types of items is different. It Method was estimated that the average time to answer each item type is 0.5 min/AN, 0.7 min/SC, and 1.0–1.2 min/ Design of Prototypes CRdepending on the length of the passage (Bridgeman, Table 1 provides the configuration of the current verbal Cahalan, and Cline, 2003). As can be seen, the ANitem section (SAT–V) as well as the four prototypes for the type is least time consuming. When the testing time is new critical reading section evaluated for the study. The shorter and the least time-consuming items are prototypes were created by ETS experts given considera- removed, it is necessary to shorten the test in order to tion of face validity, speededness, alignment with current ensure that the sections are not speeded. test, etc., and consultation with the SAT Test Development Committee. All prototypes, with the Item Pool Construction exception of Prototype C, were assembled without AN The item pool was formed by linking the item parameter items. Prototype C, which contained 10 AN items, was estimates from the Item Response Theory (IRT) calibra- assembled to provide additional baseline data for the tions of operational and equating SAT items that had study and also as a possible alternative in the event that been administered over the past several years. Fourteen it was found that the omission of AN items seriously SAT–V operational forms and several SAT–V equating impacted the ability to produce a viable replacement for tests from previous administrations were calibrated the current verbal section. Prototypes A and B represent- using the three-parameter logistic IRT model, and the ed increasingly heavier reliance on a reading construct. resulting parameter estimates were placed on the same Prototype A contained approximately 56 percent CR scale by linking them back to the same base form. The items, as compared to 51 percent CRitems that appear in linking procedure used to place item parameter estimates the SAT–V. Prototype B contained approximately 71 per- from the 14 tests on the same scale was developed by cent CR items. Prototype D also contained a high per- Stocking and Lord (1983), and was found by Petersen, centage of CR items (approximately 72 percent), and Cook, and Stocking (1983) to work well with SAT data. included a simulated item type, Discrete Reading (DR) These calibrated items formed the SAT–V item pool items that do not appear in the SAT–V. DR items each containing more than 1,300 items, which provided the have a stimulus of 60–80 words (two or three sentences) basis for the simulation forms. followed by a single multiple-choice question. When the study was conducted, there was no information available Assembly of Simulated Forms on how these items would function. ETS experts estimated Automated item selection (AIS). To construct tests, the item statistics based on their experiences and the char- items are selected and assembled into intact test forms. acteristics of the current critical reading items. These DR Item selection is usually subject to various rules to con- items were pretested later in a regular SAT administration. strain the selection of items for test forms. These rules The item statistics turned out to be very close to the esti- are called test specifications. Test specifications for the mations. SAT can be classified into several categories: content All prototypes were shorter than the SAT–V constraints, statistical constraints, item sensitivity, and because the administration time for the new critical item overlap. When constructing tests, test developers provide a set of constraints to a computer, and then evaluate the results of the selected items. TABLE 1 The SAT Program currently employs a test creation Configuration of the Current SAT®Verbal Section software that uses the automated item selection (AIS) (SAT–V) and Prototypes (Number of Items by Item Types) algorithm to assemble tests. This method requires a Prototype variety of constraints with different weights, and these Current A B C D constraints will be used as rules to select items. This model Analogy 19 - - 10 - attempts to satisfy the target test properties by minimizing Sentence the aggregate failures, and attempts to provide some con- Completion 19 32 19 25 19 trol over allowable failures by weighting each constraint Critical Reading 40 40 46 40 40 (Stocking and Swanson, 1992). When allowable failures Discrete Reading - - - - 8 happen, lower-weighted constraints will be violated first. Total 78 72 65 75 67 Consequently, by using AIS, the quality of the assembled Note: Discrete Reading items each have a “passage” of 60–80 words test is usually assured. When building multiple versions of (two or three sentences). This item type does not exist in the current configuration of the test section (SAT–V). 2 the tests, AIS is capable of developing unique tests without TABLE 2 any or very low item overlap. Specified Delta Distributions for the SAT–V and It was decided to use AIS to assemble prototypes Prototypes for this study. All of the constraints including content constraints, item sensitivity, and item overlap remained DeltaLevel Prototype (10 forms/prototype) the same as the SAT–V. Therefore, the modified (Specified) Current A B C D prototypes could be assembled exactly the same way as >=19 - - - - - the SAT–Vis assembled. 18 - - - - - Setting statistical specifications. Statistical specifi- 17 - - - - - cations provide the guide for building tests that 16 1 1 1 1 (1–2) 1 (0–1) discriminate effectively at the ability levels where 15 4 3 3 4 (3–4) 3 (2–3) 14 6 6 5 6 (5–6) 5 (4–6) discrimination is most needed. SAT statistical specifica- 13 8 8 (7–8) 7 (6–7) 8 (7–8) 7 (6–8) tions call for specific numbers of items across a range of 12 12 11 (11–12) 10 (10–12) 11 (11–12) 10 (8–13) intervals on the item difficulty scale. At ETS, the index 11 14 13 (13–14) 11 (10–14) 13 (13–14) 13 (11–15) of item difficulty used for the SAT Program is the delta 10 12 11 10 (9–10) 11 (11–12) 10 (9–12) statistic. The delta index is based on the percent of test- 9 9 8 8 (7–9) 9 8 (7–9) takers who attempt to answer the item and who answer 8 7 6 (5–6) 6 (5–6) 7 (7–8) 6 (5–7) the item correctly (i.e., p-value), where 1 minus p-value 7 4 4 3 4 (4–5) 3 (3–4) are converted to a normalized z-score and transformed 6 1 1 1 1 (1–2) 1 (0–1) to a scale with a mean of 13 and a standard deviation <=5.9 - - (0–1) - (0–1) - (0–1) - (0–1) of 4. A higher delta value represents a harder item. Number This conversion of p-values provides raw delta of Items 78 72 65 75 67 values that reflect the difficulty of the items taken by Mean 11.4 11.4 11.4 11.4 11.4 particular examinees from a particular administration. S.D. 2.2 2.2 2.2 2.3 2.2 This measure of item difficulty then must be adjusted to Note: The numbers in the parentheses are the actual ranges used in the correct for differences in the ability of different test- prototypes. taking populations. Delta equating is a statistical procedure used to convert raw delta values to equated very difficult items. Rather than proposing a new delta delta values. This procedure involves administering some distribution, the delta distributions were obtained for old items with known equated delta values, along with each prototype by proportionally reducing the number new items. Each old item now has two difficulty mea- of items at each delta level to reflect the reduced total sures: the equated delta, which is on the scale, and the number of items in the prototypes. Since the same pro- observed delta from the current group of examinees. The portion of items were maintained at each delta level, the linear relationship between the pairs of observed and mean and standard deviation of each prototype were equated deltas on the old items is used to determine very close to the specified equated delta mean and stan- scaled values for each of the new items. Delta equating is dard deviation of the SAT–V. As mentioned previously, essential because the groups taking a particular test may an important constraint of this study was the mainte- differ substantially in ability from one administration to nance of the overall difficulty level of the current SAT, so another. Through delta equating, the difficulty of items matching the prototype specifications to the current test taken by different groups can be expressed on a single specifications was an important step in the study. scale so they can be more appropriately compared. The Assembly of simulation forms. Once statistical delta values discussed in this paper are equated deltas. specifications for all prototypes were set, test assembly As mentioned previously, SAT test specifications software, AIS, was used to assemble simulated test have historically been set using equated delta distribu- forms from the item pool. Ten versions of each of the tions. This practice was continued when the test was four prototypes were assembled for a total of 40 exper- revised in 1994. Consequently, the test is assembled imental test versions. Each of the 10 versions under a using classical test theory statistics (equated deltas and prototype was a unique test without any item overlap. biserial correlation coefficients). All test assembly The prototypes were evaluated in terms of item software developed for the SAT operates using target statistics, test statistics, and nonpsychometric criteria. distributions of these classical test theory statistics. The results were compared to criteria established based The delta distribution for the SAT–V is shown in on selected statistics from 13 SAT forms, administered Table 2. As can be seen, it is a unimodel distribution from March 1999 to May 2001. with more middle difficulty items and fewer very easy or 3 Analyses of Prototypes: Item Statistics The CSEM for a given number right score, based on the binomial model, is Item statistics are statistical descriptions of how a particular item functions in a test. Typically analyses pro- (2) vide information about the difficulty of the item and the ability of the item to discriminate among the examinees. As described previously, equated delta is the difficulty index reported in this study. The other item sta- where Q((cid:1)) = 1-P((cid:1)), and K is the total number of i j i j T tistic evaluated in this study is item discrimination power. items on the test. When formula scoring is used, the Each item in a test should be able to distinguish between CSEM may be computed by, higher ability and lower ability examinees with respect to (3) the trait being measured. The degree to which an item can discriminate between higher ability and lower ability examinees is known as its power of discrimination. There are a number of methods of assessing the discriminating where K is the number of alternatives associated with power of a test item. The one currently used at ETS is the i item i. r-biserial, which measures the strength of relationship An overall SEM is calculated based on CSEMs between a dichotomous variable (item right versus item and the number of test-takers, wrong) and a criterion variable that is continuous (a total test score along the score scale with many possible values). (4) For each of the prototypes, the item statistics (equated deltas and r-biserial) were produced by aver- aging the item statistics (10 forms/prototype) when the individual items were administered as part of the oper- where N is the number of examinees obtaining score j j ational test forms. in the analysis sample, and N is the total number of T examinees in the analysis sample. Analyses of Prototypes: Test Statistics IRT scaled score reliability is calculated from Test statistics provide information on precision of mea- SEM. surement. The estimates of test statistics reported in this (5) paper are IRT scaled score reliability, IRT scaled score standard error of measurement (SEM), and IRT scaled score conditional standard error of measurement (CSEM). The psychometric properties of the prototypes where (cid:2)2 is the variance of the scores. were evaluated by comparing the IRT scaled score The curves for the CSEMs were produced as part reliability, SEM, and CSEM to these same statistics of the IRT equating analyses available using the obtained from the SAT–V. The statistics were obtained GENASYS software1. Scores on all prototype forms using algorithms described by Dorans (1984). Dorans were equated to scores on the same base form adminis- described the computation of CSEMs that can be com- tered in March 2001. The criterion forms used for bined with ability distributions to produce IRT-based graphical displays of CSEMs are the 13 SAT–V forms estimates of the unconditional standard error of mea- that were previously mentioned. surement as well as a reliability coefficient. The The frequency distributions required to compute formulas developed by Dorans are described below. these reliability estimates were constructed using the Dorans (1984) employed the three-parameter IRT distribution of scores obtained from the base form model as follows, administration and the equating relationship between (1) the prototypes and the base form. Figure 1 shows the frequency distribution for a hypothetical base form, to which total scores on all simulated forms are equated as a result of having item parameter estimates on the same where Pi((cid:1)j) is the probability of a correct response at a scale as the base form. In this example, we equate a new given ability level, given three item properties: simulated form to the base form. Suppose we want to discrimination (ai), difficulty (bi), and guessing (ci). know the frequency at a score of 56 on the new form. 1GENASYS (Generalized Analysis System) is a comprehensive statistical analysis system that combines current computer technology with modern psychometric practices to perform various statistical analyses. 4 TABLE 3 Summary of Item Statistics for the SAT–V and Prototypes Prototype Current A B C D Number of Items 78 72 65 75 67 Equated Delta Mean 11.4 11.4 11.4 11.3 11.3 S.D. 2.2 2.2 2.2 2.3 2.2 Figure 1. Interpolation of frequency distribution for the r-biserial equated new form from the base form. Mean 0.51 0.53 0.51 0.52 0.52 S.D. 0.10 0.10 0.09 0.09 0.10 The corresponding equated raw score (i.e., the equiva- Note: Current criteria are based on 13 SAT–V forms. lent score on the base form) is 57.7. We use the base form frequency distribution to estimate the frequency at Test Statistics—CSEMs, SEMs, and Reliabilities this point. The graph shows that 332 people received a score of 57 on the base form, whereas 320 people Plots of IRT scaled score CSEM. Plots of IRT scaled received a score of 58. Interpolating from the graph, we score conditional standard errors of measurement can would estimate that 328 people would receive a score of be found in Figures 2 through 9. These plots show the 56 on the new form. average CSEM values for the multiple versions of each prototype compared to the average CSEM values for the Establishing Psychometric Criteria criterion obtained by averaging across 13 SAT–V The criteria used for the evaluation of the prototypes forms, as well as the CSEM values for the 10 versions were constructed in several ways. The mean and stan- of each prototype compared to the CSEM values for 13 dard deviation of equated deltas,r-biserials, SEMs, and SAT–V forms. the reliability coefficients were computed by averaging An examination of the average CSEM of values taken from the analyses of 13 SAT–V forms. Prototype A compared to the criterion, found in Figure Criteria for the CSEMs were the average value at each 2, shows that slightly greater precision of measurement scaled score level of these 13 forms. was gained for scores between about 300 and 700, where the majority of the scores are located. However, Establishing Nonpsychometric Criteria some measurement power was lost below 300 and In addition to psychometric analyses, ETS content above 700, where the CSEMs for the prototype were experts were asked to evaluate each of the prototypes larger than those for the criterion. In addition, the according to the following nonstatistical criteria: face CSEM values for all of the 10 versions under Prototype validity; educational relevance; ease of development; A were compared to the CSEM values for all 13 SAT–V ease of configuring the SAT–V into separately timed forms that were used as criteria (see Figure 3). The sections; the cost of transitioning to the new critical results indicated that although there was some variation reading section; the ongoing operational costs once the across individual forms, the trend was the same: Most transition period was over; the ability to sustain sub- of the simulated forms had larger CSEMs than the 13 scores (should they be desired at some point in time); criterion forms for scaled scores below about 300 and and the ease of aligning the PSAT/NMSQT® with the above approximately 700. recommended changes to the SAT–V. Figures 4 and 5 show plots of CSEMs for Prototype B, average and multiple versions, respectively. Results The average of Prototype B appeared to result in slightly larger CSEMs throughout the mid-portion of Item Statistics—Equated Deltas and r-biserial the score range and larger CSEMs over the ends of the score range when compared to the criterion. The Table 3 provides information about the item statistics 10 individual forms under Prototype B followed a for the prototypes. It can be seen that the mean and similar pattern. standard deviation of equated deltas and the mean and Plots for the average of the 10 versions of standard deviation of r-biserial obtained for the Prototype C are found in Figure 6, and plots for multi- prototypes are very similar to those for the criteria. ple versions of Prototype C are found in Figure 7. Prototype C versions appeared to produce slightly small- er CSEMs than the criterion throughout the mid-portion 5

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.