THE BELL SYSTEM TECHNICAL JOURNAL Sampling From Structured Populations: Some Issues and Anewere By V. N. NAIR and T. E. DALENIUS (atanusciptrecewved February 25,1981) ‘This paper reviews some sampling issues that are common to many ‘Rell Sytem surveya. We discuss various aspecta of twa-atage sa pling designs, and emphasize sampling from populations with mat tiple characteristics. The hierarchical structure of the population in ‘many surveys makes the use of multistage sampling techniques at tractive. In populations with multiple characteristics, often nol woery characteristic in common to every unit. We consider same special designs for sampling from such populations, Finally, we discuss aome issues in network sampling. Two recent Bell System surveye are used taillusrate most ofthe ideas dscusted. One ofthe nurveye deals ith ‘the estimation of traffic charaeterstics for various elasees of sevice, while the ther one ls sturvey of baseband transmission impairments, 1. mrRoDUcTION ‘Sample surveys have played an increasingly important roe in the ‘Bell System in ecent years aaa means of providing an jective basis for decision making. ‘To an extent, this han heen due to the growing swarenese among users of the survey regula that, in most survey, Sampling isnot the only source of error and often not the primary source. Even if a presumably complete census were taken instead of a ‘simple, saris error aight eis in Inv resis arising from various ‘cute ich as mcarement. ae resyonse errors, ‘The growth in numbers in recent years, has also been accompanied by a widening of che range (both im type and complexity) of the surveys, For many of these aurveys, a simple and readily available sampling design ean easily he adapted to the needs ofthe prevailing situation, More often, however, the problem at hand is sufficiently fomplex and nonstandard so that various part of existing sampling theory have to be modified and pieced together co arrive at a reason tle eoition ‘Nevertheless come sampling ie are common to a number of Bell ‘System surveys, Mort of these surveys involve sampling from popule- tions that are highly structured, and any cost-efficient sampling design ‘must take this structure into account. In thie paper, se 2eview some tumpling emoes thet arose in two surveys curently under implemen fation, Both surveys pomess some commuan features ms well as Features tmique to themselves. Since there features are common to a large ‘umber of other surveys, an exposition of both the cheoretieal andl practical considerations involved may prove beneficial to other survey Practitioners. Let us fist consider the two examples. Example 1. Coat of sercetratic veege studies (COSTUS) ‘The various Hell operating telephone companies (orca) carry out these surveys periodically to obtain an objective basis for dltebting the tafie-sensitive coats for a jurisdiction, typically a state within an fc, among its various claecs of telephone service, Measurements of tree traffic character (busy-hour ccs, busy-bour pag count and day peg count) from the sampled telephone lines are used to talents the relative magnitudes ofthe trafic characteristics for exch Claas of service. [eos ia traditional unit for measuring the usige of channels (it stands for hundrod eal seconds per hour) Peg count is the number of calls actually handle] These values are then uned as inputs to the “embedded diect ooste” analyi, which allocates most ttaffe sensitive investments and expenses among the various clase ol service. “The elementary units in thisstudy are telephone lines corresponding to che various clan of service. These units, however, are clustered Into central offices Tn fac, each central office has n number of clusters treociated with it one clster for each clase of service. A reasonably oot efiiont design should take thie hierarchical clustering into ac- ‘count, since the major portion ofthe costs in observing a line arises from viting the central office and setting up the measuring equip- ‘ment, Thas, a fwo-slaye sampling design with central offices serving te primary sampling unils (Peos} and celephone Lines serving as sec- ‘ondary sampling unite (220s) wems attractive. Ths i even more an ince the central ofices provide service in a number of classes of 1298 THE BELL SYSTEM TECHNICAL JOURNAL, SEPTEMBER 1881 service so that from exch sampled central office, we can further subsample telephone lines from all the avaiable classes of serve Hones, costUs ero examples of che use of a two-stage sumpling design for a population with multiple charactvisties The difervat characteristics here correspond to the liflerent classes of service, The parameters of the eampling design in comrus ure determined so that the busy-hour ovs parameter for each class is estimated with «pre seribed accuracy. One additional complication in these ates fs the fact that not all central offices prove service in every available ease Tin some jriictions there are some elaaes of service (such a8 ein] that are provided in only a few offices ‘The sampling Hterature refers ta this a the problem of "partial variate pattern” (PrP). The presence of PvP causes difficulties in selecting an apprapriale sample of canta, fies fr the estimation ofthe parameters of el Uhe classes of servic, Example 2. Survoy of baseband tronamlesion impairments ‘The aim of thin murvey, currently under development ae ell Labo racorie, to measure bateband transmission impairmenta for various trunk foclliey types. From each earaplod run, estimauss of various impairment eharacceriatce, ach aa agnal ts Conotched nla ratiote/ 1) and second. and third-order harmonic distortion (R2 and Ta) are to be obtained. Although the near (eransitting) an ur (revi) endl Arop equipment, in addition Uo the enzrice system, determines the ‘trunk type, it is known from past experionce that the contribution from the cartier system is the dominant Tactar. Thos, we do. not consider the influence of the end-rap equipment in this alu, Si Aferent mosturement charaeloniaice are t0 be meseured fromm each sampled unk and the parsnelere of seven diferent trunk types are lobe enimated "The elementary uni inthis survey i the trunk. Whe the trunks sre again clustered Inia central offi chi clustering i not unique since one trunk is common lo pir (ranamitting and receiving! of ‘entra offices In fact, the strurture of the population hese recembles {graph inetwork) with the oentral offs we des and trunke as edges fares). This survey is aa example of network (graph) sampling (see [Ret 1 for example). In chis survey, we senple a particular tran, we Ihave fo visit che pair of end offices connected to the trunk toast up tho measuring equipment. This implies Ut it i cheaper to sample ‘additional trunks connected to those ta end afi Herne, taking the structure of the population ino account relis in considerable cont saving, ‘One possible approach to this problem isto use wullixage sumpling tw select pairs of offices and trunks connected to those alfices Since ‘we are interested in difforont trunk types, this atudy sla involves ‘multiple characteris, SAMPUNG TECHNIQUES 1237 Both the above example valve sing mulilage sampling to study ‘populationa with maltiple characteristics. Multistage sampling is not [2m uncommon phenominon in Rell System surveys where the natural ‘administrative and geographic cusering of units makes it very cost ‘ficient. In Sections TI and III wo review various isues that confront Smurvey statistician in developing « tworatage sampling design for Studying multiple characteristics, Some of the iscues discussed in Section Ll are ale common to other sumpling designs. Section ITT deals primarily with determining the parasnelers of the sample design. Tn Section IV, we consider some sampling designs for populations with ‘ep. Section V isa brit review of inv in network sampling, We ‘Conclude the paper with a aummary in Section VI. Throughout the aver we iry Ur halance theoretical considerations with practical [uidelins guined from our own experience. One of che two examples BB used, wherever possible, ¢o ltrate the ideas discussed. 1 TWO-STAGE SAMPLING: GOME PRELIMINARIES ‘Thie section deals with sore preliminary considerations in deve oping x uwo-tage sampling derign, Some of the discussion deal with igmues chat are common ls swmnple surveys m general We begin with ‘incursion ofthe rail for using two-stage or multistage sampling designs. Aer an introduction to some notation, we examine how preoerbed accuracy requirements are implemented ina sample survey land discs the use of prior information Section 26 oxumines the use ‘of varying probabilcy sampling schemes. Nection 27 diseases ratio. tacmatore with wpovile emphasis on two-stage sampling situations 2.7 Why twestoge samoling? ‘The individuals who choracterntion are to be measured ina study are called elementary unite. Observational accoss to the elementary titan many cases, it provided by multistage sampling. Let the ‘lementary units be grouped into a numberof wlable cluster. Intwo- Stage eampling, the cluster are used as rs0s and sample of sts is lected in the first stage, The sa selected aro divided ico a umber (fF esus and a cemple of ests i aslected from each PSU selected inthe fir stage, (The elementary anita themselves can serve as 660s) All ‘elementary units in the selected sve are observed with respect tothe ‘pariables of interest "Theresare various reasons why multistage sumpling is attractive. For ingeance, in many studio m complete Tat ("ame") of elementary Anita isnot available and it enay be prohibitively expensive to create ‘uch a list, If tio relatively cheap to construct a list of clusters, the clusters can bo used as PSUs in a two-stage sampling scheme, Then, 1298 THE RELL SYSTEM TEGHNIGAL JOURNAL, SEPTEMBER 1981 ‘only ist ofthe elementary units inthe sampled clusters nee to be ‘constructed. This reults in considerable ens savings. ‘Often, the population of elementary unit in a survey ix dispersed over a lange weoxraphical area Ifwe have to visit each sampled unit to follect meansrements, sampling from the Tit of elementary ils can Tead wo high costs per elementary unil. A more cost-sfiient scheme may be obinined by grouping the elementary unit inte geographically compact slusers and using multistage sapling will Che clans as "Typically, dhe cost reduction in multistage sumpling ie aocompanied by an increase in the varlanoe of Oe estinnte over the variance of an ‘estimate fom a simple random sampling (883) ofthe same number of tlementary units. However, the “accurmey” por unit cost may be higher. Trwe have some eonteal over the formation af Uhecliters can actually reduce the variance (relative lo $85) by grouping the units to that there is more variation within clusters than between, chumera, Thomoat Bell Systom survey, however (he clusters ure predetermined. 22 Notation ‘We use the flloring notation Unroughout the remainder of cis paper uinber of ats nthe ive, ‘numberof rove sampled, numberof aU in Pav é, b= 2, +++ BM, ‘numberof sts vlocted from the ic aaraplad ost, ist ym TL, = probability of selecting the fth vst in a sample of aise m STL =m, Yo = characteristic to be mousured, M, te ~ value coreaponiing to asasaple unity inion, Lee Ny PRL s By yeRy Revs, yn, N-EN, P-L You Br, Din ne Em We consider only equal probability sampling shires in atage SAMPLING TECHNIQUES 1290 to in this paper, The parometer of interest is the overall total ¥ = BENG, P denoces an arbitrary extimator of Y. The same consider. ‘ions cam be teed for ectimating the average 7 if we rewrite PS we, Wo N, 2.3 Accuracy requirements ‘Tho sampling design ins carefully planned eurvey ia determined 50 that either (0) The total cast of the survey ie minimized wubjeet to prescribed requirement on the accuracy of the estimators or (if) the ‘ovaracy ofthe eathnatore is maximized cubjoct tow constraint on the ‘ext. Sines both approaches involve escendallythe same considerations (gee Seton TM, let us consider in some decal just the problem ol ‘minimising cont subject to accuracy requirements ‘Nsoroplng design, where che units are randomly selected woxording to given probshiltes of seleccion, pormits us eo make quuntilaive ftatemente about the error iwolved in the estimators. This i turn tllowmue to determine the ample sas wy thal the prescribed ancuracy requirements aro mot. These requirements re ypitaly sated in terme of the error e = #— Yor some function of the ee, fle), euch a8 Felatve ersor, and can he expres 2 Pr{I fle) 28) 21-@ o for ore constant « and 8, In cosrus, for instance, the saxople ave determined oo tht the able values of the welacive error is Ls ‘han o equal 0 01 with probly a last 05, 12, a = 801. To Jmplemene cho accuracy condition (1), large-cemple thoory i wsaly sued to elie hot Y is approximately normally distributed. «is beyond the aoope of this paper to diseuss che adoquaey of this normal approsimacion, The inleresed reader is refered to Tofs 2 to 6) rquation (1) is equivalent lo on expression of an upper bound on the variance [or mean-equre err (nie) Y is based] of F. ‘When setimacing vera parameters, sn population with multiple barncterit we tay require that several accuracy criteria be Sat {nfed simultaneous. Ry using normal approsimaions, we can sate this problem, in eset, aa minimizing the total cast of the survey subject toa enneteint on the variances (r nas) ofthe form Ash e aT ie the vector of variances (tse) of the p ‘combinations ofthe varlaices thar have to meet the ncuracy condi (ny 9 YF represents the bounds x the accuracies. ons, way 4240. THE BELL SYSTEM TECHNICAL JOURNAL, SEPTEMBER 1961 For example, if k= p and A iy Use jleity mtr, th parameters need to be estnated with preseribed accuracy then only one particule ines combination of the variances is needed te satisfy an uecuracy criterion 24 Varlance components Since the accuracy specifications cun be stated in terms of the variances of the indivi! estimators, we nel to examine the com ponents of the variance of the eslinalor in x two-stage sampling Scheme. This wil ad us later Section II) in determining the relative Contribution tothe variance from atiges one nd wo and Use rao, fn increasing the sample size in slage one versus chal in atage wo, IF ‘we restrict our atention to linear estimators ofthe form Py = Ys for estimating Y, we see tht ng ret egual N/T fr the estimator ti bbe unbiased. With this choice of, ye (he wellknown Horvitox "Thompzon (HT) estimator." A discussion of some ofthe properties of ‘his estimator can be found in ef. 7. Let ub recsit our attention the HT estimator and examine ics variance, ‘two aslot m psua with replacement (x instage one with inclusion probibiities IL, we have a mallinorial strap uf sine m with seco probabilities Z;= T/m. If te second-stage units are chosen without ‘replacement, the variance of ‘an be written asthe wun of two component (2) the withinau variation Wis Way thn 0 LS nayyz— ¥¥ Here, the within chuster variance and 1 ~ fs — (Ni ~ ml/Ni the finite population correction, the sampling is dane without replacement (wou! in stage one with varying selection probabiitics, the within-rsv variation remains the tame. The belween-PSt variaian, however den nn scart SAMPLING TECHNIQUES 1241 inclusion probubiltiee whieh ave extremely hard to ealeulate.° Hor tley and Rao provide some approximations.” One posible approx mation is of cours, che use of eg (3), valid for the wi scheme, in the ‘Wor situnton. 1 dbe sampling fection m/AM is large (any >0.25), this ‘approximation may be unresconable. When the sumpling is done WOR ‘with qual seleetion probabilities in stage one, ie, suswon, the B Component ie given by _MU-A Ey py wma 2 FF Yes Yoana f= nf For « discussion of variance eatimation in two stage sempling, cee [Rafa 7, & oF 9, for sxample, Some approsimata but "quick and aa” methods of variance estimation ure dixewowed in Ref 10 and 12. Ifthe variance estimator is intend only lo provide a rough guide a vo the soouracy of the etimato, an approximate, ut quick and easy, method fs adequate, Ifthe accuracy of the ealimator Is of gees impartance and must be demonstrated through the variance estialor, we have to use a"go0d” variance crtimntar, auch ax one with small rae 2.8 Pror formation ‘We need prior informacion onthe variance ofthe various eetimatore and an the sampling costs vo determine the sump sizes in u survey. Tr is rare that ws have very good prior information, paticolarty concerning the vavianes of the estimators, Preliminary estimates can bbe oblaied from prior arse or pilot studi. One practice commonly fovnd in the Bll Sytem Une ne fla eo the entire Bell System, todevslop preliminary excimates for specifi juisdictions ‘To implement the accuracy condicons exactly in a two-stage sam. pling scheme, we nood ta mow each one of the components of W and B ing. (2) exactly, Since the ia rather unlikely, we analy jos use two numbers, one for Wand one for B, instead of the individual values for each rev. Theae numbers con be interpreted as either the average or the maximum overall PSs ‘When the quality of the prior information is poor (a a consequence of one or mare of Ie whovr reason, ile can he gained in developing ‘ncomples design thar may (or may not) be optimum” for the problem ft hand. A simpler design which is lena sensitive to the preliminary featimutes of Uhedeign parameters i more desirable. Als, when the preliminary warlanee ertimatore aze uncelisble, an estimate of the securacy achieved should always be calculated aftr the fact from the ample to compare with the prescribed acura 1242 THE BELL SYSTEM TECHNICAL JOURNAL, SEPTEMBER 1981 26 Varying probabty comping The sample selection schemes in sages one and to canbe based on equal or varying probability semmgling techniques. For simplices, wre consider varying probebiity sampling only in stage one. The onideationshece also cary over Wo otber stages. Lets examine how the selection probable (l,) shuld be determined otha the Yariance of the fi-T estimator $5, 1N/I}fy for estimating Y— Bi Y, ts minimized Tn the simpler stucion of one-stage cluster aamoliog, ies m= Ne ste take I: proportional ¥, the valance ofthe HAT exiaaor is devo" Hence, here exits an auniary variable X which app Imatly proportional to, ve ean sme this auxiliary information to Select the l/s In some (wo-niage sampling suationg, we can we the Ieaaures of ae of the rau, {1} to obtain “optna” nection probs tlt, To see this, note thatthe parameter Y can he writen BIE NCE, where Fn the rou mean, and often the Ys are roughly of the ume order of magaituce. nhs ease, te ¥ =F, wl be roughly propetona wo the No that we can take the Il, ropational wo the JN. This is known a8 probability proportional to ize (ra) sampling. (th costs for example, prior, we expect dhe average usy-haue con er main mation tae about the sane acto ental afew) When Sapling from popultions vids slipe characteriiey ere te ‘liple measures of ze one atociated ith each characterinic The opcal selection probsbiitis ave se faetion ofthe size mew ‘ire depending on the pariuar weacy ceria of interet In Ieiten, there are also east io whieh the eaac ise rewures ae {akon and we have to ue elite menses "To devel acot-efcen sige Meed to minimize variance pet ni eos rather than ee actual yarianes. Tho optima selection pre ‘ite nt therefore take the cost srtureinto account. In COSTES, orhere the sare central ofices, he sampling cons depoal on the {ype of eiching equpment in Un offi. For example, consider ipmore expensive. vis and ort up the measuring equipment i electronic switching epsem (Fa) fer han a noes aie. we ke foal optimally eaeulatons, we find that with ober factors eld sant, the opine selection probly for uueh PS is iver Proportional the mare rtf he com of eamplng that P20.” ‘One or more ofthe abe wider may indicate Tha even if the eas vars greatly in ie, che optimal selection probabliies are ot too unoqual Tn sich a ean we may be Better off using, Le, {ca selection probate, sine (8 the ealaction scheme simpler tint (i exac variance formulas aze availble fi addition, we are SAMPLING TECHNIQUES 1248 from using varying election probubiiy seheren.?” Ifthe gain is not substantial in these situations, the ase of 6s seems preferable ‘Alt, even if wo use ans when the PECs vary greatly in size, we can tse rato estimators, which take into account tha aviation, to estimate the parameters Thi is diseussod in Section 2.7, ‘Finall, we briefly disuse n simple scheme for selecting psua with ‘unequal probabilities Mans schemes for unequal probability selection sist! and in fat, several procedurcs ma Tad to the ae inclusion ‘probabilities (IL). Thu scheme we consider here i for sampling WOR snd is known as Ps aystealic eampling, Let {72 denote the eum lative totale of the desired eelction probabilities {TI Siem on "To select m vss, first select @ random number w € [0, 1} and then select the m rou for which Trcutjety F201 Harley and Rao consider this procedure with random arrangement cf the sus and dovelop approximate variance expressions for the estimator.” 257 Une o iti xtiton ‘So fr we hve conser my unbiased etiatons ofthe ttl ¥. I som nuatons we can explie iformason sonable fr ae ‘sary variable ad «Ha eit, mach asthe ratio et IInton wth han amalo nae tn the tnbaned eno To set this et be the known tniiryrerieble apd let depot he otal corraronding to thin variable and note the estimator of X based othe sample. Since we know the error X, we iow bow this SSrnlepeforas in etimating X. Hence, if (Xj and (1) are highly Corel, i invly chat that we can tnprove tur oral estimator Y by exploiting our knowledge of how well the sample toutes "Te vt exinator ital spcil cane of the general erence estimator %, = P+ a(& — X) and ix obtwined by taking a = ~P/%. This results in the estimator ¥ = (P/R)X. There are other ways of exploiting the information about X — X, For instance, @ can be a prospectedsonetane fa =O, a gee the arial stator ¥ bal one ¥ maemsrements lone) We ean alo ake aco beth rere ‘effet fh otsined by rpreing the Yon the Xa Tor the ato enmatar Pte mae af ¥ ean be appeximated up eo a incor arm by 1244 THE BELL SYSTEM TECHNICAL JOURNAL, SEPTEMBER 1961