ebook img

Some Notes on Blinded Sample Size Re-Estimation PDF

0.22 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Some Notes on Blinded Sample Size Re-Estimation

Some Notes on Blinded Sample Size Re-Estimation 3 Ekkehard Glimm1 and Ju¨rgen La¨uter2 1 0 2 Abstract n a J This note investigates a number of scenarios in which unadjusted 7 1 testing following a blinded sample size re-estimation leads to type I ] error violations. For superiority testing, this occurs in certain small- E M sample borderline cases. We discuss a number of alternative ap- . t proaches thatkeep thetypeIerrorrate. Thepaperalso gives areason a t s why the type I error inflation in the superiority context might have [ been missed in previous publications and investigates why it is more 1 v marked in case of non-inferiority testing. 7 6 1 4 1. 1 Introduction 0 3 1 SampleSizere-estimation(SSR)inclinicaltrialshasalonghistorythatdates : v i back to Stein (1945). A sample size review at an interim analysis aims at X r a correcting assumptions which were made at the planning stage of the trial, but turn out to be unrealistic. When the sample units are considered to be normally distributed, this typically concerns the initial assumption about the variation of responses. Wittes and Brittain (1990) and Gould and Shih (1992, 1998) among others discussed methods of blinded SSR. In contrast 1Novartis Pharma AG, CH-4002 Basel, Switzerland 2Otto-von-Guericke-Universit¨atMagdeburg, 39114 Magdeburg, Germany 1 to unblinded SSR, blinded SSR assumes that the actually realized effect size estimate is not disclosed to the decision makers who do the SSR. Wittes et al. (1999) and Zucker et al. (1999) investigated the performance of various blinded and unblinded SSR methods by simulation. They observed some slight type I error violations in cases with small sample size and gave expla- nations for this phenomenon for some of the unblinded approaches available at that time. Slightly later, Kieser and Friede([1], [2]) suggested a method of blinded sample size review which is particularly easy to implement. In a trial with normally distributed sample units with the aim of testing for a significant treatment effect (”superiority testing”) at the final analysis, it estimates the variance under the null hypothesis of no treatment effect and then proceeds to an unmodified t-test in the final analysis, i.e. a test that ignores the fact that the final sample size was not fixed fromtheonset ofthe trial. Kieser and Friede investigated the type I error control of their suggestion by simulation. They conclude that no additional measures to control the significance level are required in these designs if the study is evaluated with the common t-test and the sample size is recalculated with any of these simple blind variance estimators. Although Kieser and Friede explicitly stated that they provide no formal proof of type I error control, it seems to us that many statisticians in the pharmaceutical industry are under the impression that such a proof is avail- able. This, however, is not the case. In this paper, we show that in certain situations, the method suggested by Kieser and Friede does not control the type I error. 2 It should beemphasized that asymptotic type I error control with blinded SSR is guaranteed. If the sample size of only one of the two stages tends to infinity, the other stage is obviously irrelevant for the asymptotic value of the final test statistic and thus the method asymptotically keeps α. If the sample size in both stages goes to infinity, then the stage-1-estimate of the variance converges to a constant value. Hence, whatever sample size re- estimation rule is used, it implicitly fixes the total sample size in advance (though its precise value is not yet known before the interim). In any case, asymptotically α is again kept. Govindarajulu (2003) has formalized this thought and extended to non-normally distributed data. As a consequence, the type I error violations discussed in this note are very small and occur in cases with small samples. We still believe, however, that the statistical community should be made aware of these limitations of blinded sample-size review methodology. While sections 2-4 focus on the common case of testing for treatment differences in clinical trials, section 5 briefly discusses the case of testing for non-inferiority of one of the two treatments. In had been noted in another paper by Friede and Kieser [13] that type I error inflations from SSR can be more marked in this situation. We give an explanation of this phenomenon. 2 A scenario leading to type I error violation In this section we show that in certain cases, a blinded sample size review as suggested by [1] leads to a type I error which is larger than the nominal level α. 3 In general, blinded sample review is characterized by the fact that the final sample size of the study may be changed at interim analyses, but that this change depends on the data only via the total variance which is the variance estimate under the null hypothesis of interest. If x ,i = 1,...,n i 1 are stochastically independent normally distributed observations, this total variance is proportional to n1 x2 in the one-sample and to n1 x2 n x¯2 i=1 i i=1 i − 1 in the two-sample case. P P We consider the one-sample t test of H : µ = 0 at level α applied to 0 x N(µ,σ2). The reason for this is simplicity of notation and the fact that i ∼ the geometric considerations given below cannot be imagined for the two- samplecasewhichwouldhavetodealwithadimensionlargerthanthreeeven in the simplest setup. However, the restriction to the one-sample case entails no loss of generality, as it is conceptually the same as the two sample case. We will briefly comment on this further below. In addition, a blinded sample size review may also be of practical relevance in the one-sample situation, for example in cross-over trials. Assume a blinded sample size review after n = 2 observations. If the 1 total variance is small, we stop sampling and test with the n = n = 2 1 observations we have obtained. If it is large, we take another sample element x , and do the test with n = 3 observations. This rule implies that n = 2 3 for x2 +x2 r2 and n = 3 otherwise for some fixed scalar r. Geometrically, 1 2 ≤ the rejection region of the (one-sided) t test for n = 3 is a spherical cone with the equiangular line x = x = x as its central axis in the three- 1 2 3 dimensional space. By definition, the probability mass of this cone is α under H . Forthe case of n = 2, therejection region isa segment of thecircle 0 4 x2+x2 r2 aroundtheequiangular linex = x . Hence, inthreedimensions, 1 2 ≤ 1 2 the rejection region is a segment of the spherical cylinder x2 + x2 r2,x 1 2 ≤ 3 arbitrary. The probability mass covered by this segment again is α inside the cylinder. The rejection region of the entire procedure is the segment of the cylinder plus the spherical cone minus the intersection of the cone with the cylinder. We now approximate the probability mass of these components. For r2 small, we approximately have P(x2+x2 r2) = r2 . Hence, under 1 2 ≤ 2σ2 H , the probability mass of this part of the rejection region is approximately 0 r2 α. The volume of the intersection of the cone with the cylinder can 2σ2 · be approximated as follows: The central axis x = x = x of the cone 1 2 3 intersects withthecylinder inoneofthepoints r , r , r . Thedistance ± √2 √2 √2 of this point to the origin is thus h = 3r. The(cid:16)approximat(cid:17)e volume of the 2 intersection is 4πh3 = √6πr3. To conseqrvatively approximate the probability 3 mass of this intersection, we assume that every point in it has the same probability mass as the origin (in reality, it of course has a lower probability mass). Then the probability mass of the intersection is approximated by √6πr3 α (√2πσ) 3, where (√2πσ) 3 is the value of the standard normal − − · · density N (0,σ2I ) in the point 0. Combining these results, a conservative 3 3 approximation of the probability mass of the rejection region for the entire procedure is r2 √6πr3 r2 √3r3 α 1+ = α 1+ . (1) 2σ2 − (√2πσ)3! 2σ2 − 2√πσ3! Obviously, this is larger than α for small r. For the more general case of a stage-1-sample size of n , possibly followed 1 by a stage 2 with n further observations, the rejection region of the ”sample 2 5 size reviewed” t test has an approximate null probability following the same basic principle as (1): n1 n1+n2 α 1+const r const r if r,n and n are small. · 1 · √n1σ − 2 · √n1σ 1 2 (cid:18) (cid:19) Consequently, th(cid:16)ere mu(cid:17)st be situatio(cid:16)ns wit(cid:17)h small r where the blinded √n1σ review procedure cannot keep the type I error level α exactly. Due to sym- metry of the rejection region, this statement holds for both the one- and the two-sided test of H . 0 Note that in this example, the test keeps α exactly if n1 x2 r2. This i=1 i ≤ is due to the sphericity of the conditional null distributiPon of (x1,···,xn1) given n1 x2 r2 (see [3], theorem 2.5.8). Type I error violation stems i=1 i ≤ from tPhe fact that the test does not keep α conditional on n1 x2 > r2, i.e. i=1 i if a second stage of sampling more observations is done. P To investigate the magnitude of the ensuing type I error violation, we simulated 10’000’000 cases with n = 2 initial observations and n = 2 1 2 additional observations that are only taken if x2+x2 0.5. The true type I 1 2 ≥ error of the two-sided combined t test turned out to be 0.0542 for a nominal α = 0.05. As expected, this is caused by the situations where stage-2-data is obtained. Since x2 +x2 χ2(2), we have P(x2 +x2 0.5) = 0.779. This 1 2 ∼ 1 2 ≥ was also the value observed in the simulations. The rejection rate for these cases alone was 0.0553. If x2 + x2 < 0.5, we know that conditionally the 1 2 rejection rate is exactly α. Accordingly, this conditional rejection rate in the simulations was 0.0500. If n and n are increased, the true type I error rate converges rather 1 2 quickly to α. For example, in case of n = n = 5 and r2 = 2.5, the 1 2 simulated error rate is 0.0508 with 77.6% of cases leading to stage 2 and a 6 conditional error rate of 0.0510 in case stage 2 applies. We also performed some simulations where n is determined with the 2 algorithm suggested by [1]. For this purpose, we generated 10000000 simu- ′ ′ lationrunsof ablinded samplesize review after n = 2 observationsfollowing 1 the rule given in section 3 of [1] with a very large assumed effect of δ = 2.2. This produces an average of 3.09 additional observations n . The simulated 2 type I error was 0.05077. To see that the two-sample case is also covered by these investigations, note that the ordinary t-test statistic can be viewed as X/ Y/s where X N(δ,1) is stochastically independent of Y χ2(s). Rpegarding any ∼ ∼ investigation of the properties of this quantity, it obviously does not matter if the random variables X and Y arise as mean and variance estimate from a one-sample situation or as difference in means and common within-group variance estimate in the two-sample case. The same is true here: According to [1], p. 3575, the ”resampled” t-test statistic consists of the four compo- nents D , V , D (V ,D ) and V (V ,D ) (loosely speaking, these corre- 1 1 2| 1 1 2∗| 1 1 spond to the differences in means and variance estimates of the two stages). Comparing the distributions of D and V and the conditional distributions 1 1 of D and V given D and V (and hence n ), one immediately sees that 2 2∗ 1 1 2 these are the same for the one- and the balanced two-sample case when we replace n by n /2 and the means of the two stages by the corresponding two i i differences in means between the two treatment groups. For the conditional distribution of V (V ,D ) see section 4. 2∗| 1 1 7 3 Approaches that control the type I error 3.1 Permutation and rotation tests If the considerations from the previous section are of concern, then a sim- ple alternative is to do the test as a permutation test. In the one-sample case, one would generate all permutations (or a large number of random per- mutations) of the signs onto the absolute values of observations. For each permutation, the t test would be calculated and the (1 α)-quantile of the − resulting empirical distribution of t-test values gives the critical value of an exact level α-test of H . Alternatively, a p-value can be obtained by counting 0 the percentage of values from the permutation distribution which are larger or equal to the actually observed value of the test statistic. After determin- ing the additional sample size n from the first n observations, we apply the 2 1 permutation method to all n +n observations. The special case of n = 0 is 1 2 2 possible and then the parametric (non-permutation) t-test can also be used. This strategy keeps the α-level exactly, because the total variance 1 n1 x2 n1 i=1 i is invariant to the permutations. P In the two-sample case, the approach would permute the treatment allo- cations of the observations. In order to preserve the observed total variance, the permutations have to be done separately for the n observations of stage 1 1 and the n observations of stage 2, respectively. 2 If sample sizes are small, permutation tests suffer from the discreteness of the resampling distribution and the associated loss of power. In this case, rotation tests [4, 5] offer an attractive alternative. These replace the random permutations of the sample units by random rotations. This renders the sup- 8 port of the corresponding empirical distribution continuous and thus avoids the discreteness problem of the permutation strategy. In order to facilitate this, rotation tests require the assumption of a spherical null distribution. This is the case in this context. Stage-1- and stage-2-data have to be rotated separately even in the one-sample case in order to keep the fixed observed stage-1-value of the total variance. Permutation and rotation strategies emulate the true distribution of the t test including sample size review. Hence, they will ”automatically” correct any type I error inflation as outlined in the previous section, but will oth- erwise have almost identical properties (e.g. with respect to power) as their ”parametric” counterpart. We did some simulations of the permutation and rotation strategies under null and non-null scenarios. These, however, just backed up the statements made here and are thus not reported. 3.2 Combinations of test statistics from the two stages Methods that use a combination of test statistics from the two stages are another alternative if one is looking for an exact test. For example, we might use Fisher’s p-value combination 2log(p p ) [6] where p = P(T > t ) 1 2 j j j − · with T being the test statistic from stage-j-data only and t its observation j j from the concrete data at hand. As 2log(p p ) χ2(4) for independent 1 2 − · ∼ test statistics T and T under H , the combination p-value test rejects H if 1 2 0 0 2log(p p ) is larger than the (1 α)-quantile from this distribution. In 1 2 − · − this application, we use the true null distributions of the test statistics T to j determine the p-values. For example, in case of the one-sample-t-test these are the t-distributions T t(n 1) and T t(n 1). 1 1 2 2 ∼ − ∼ − 9 The stage-2-sample size n is uniquely determined by n1 x2. Since T 2 i=1 i 1 is a test statistic for which Theorem 2.5.8. of [3] holds uPnder H , the null 0 distribution of T is valid also conditionally on n1 x2. As a consequence, 1 i=1 i T t(n 1) and T t(n 1) are stochastiPcally independent under H 1 1 2 2 0 ∼ − ∼ − for given n1 x2. Any combination of them can be used as the test statistic i=1 i for H . OPf course, one still has to find critical values of the null distribution 0 for the selected combination. The statement about the conditional null distributions of the test statis- tics given the total variance n1 x2 allows us to go beyond Fisher’s p-value i=1 i combination and similar mePthods that are combining p-values using fixed weights or calculate conditional error functions with an ”intended” stage- 2-sample size. The weights used to combine the two stages may also de- pend on the observed stage-1-data. For example, if the variance were known (and hence a z-test for H could be done), then the optimal (standardized) 0 weights for combining the z-statistics from the two stages would be n1 n1+n2 and n2 in the one-sample case. Hence, t = n1 t + qn2 t n1+n2 comb n1+n2 1 n1+n2 2 seemqs a promising candidate for a combination test stqatistic. Theqfact that T ,j = 1,2 retain their t(n 1)-null distributions if we condition on j j { } − s2 = n1 x2 means that critical values for this test can be obtained from 1 i=1 i the diPstribution of the weighted sum of two stochastically independent t- distributed random variables with (n 1) and (n 1) degrees of freedom, 1 2 − − respectively. It is obvious that this is very easy with numerical integration or a simulation. Comparing t with these critical values (that depend only comb on n and n ) to decide about the rejection of H gives an exact level-α test. 1 2 0 To investigate the performance of the introduced suggestions, we did sev- 10

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.