Variance reduction using a non-informative sampling design
Abstract
Official Statistics commonly conducts sample surveys to produce estimates of aggregate statistics with a desired level of precision. For this purpose, design-based methods are used which are suitable for the estimation of finite population quantities such as totals or means. In most cases, however, model-based analyses are applied to the survey data as well. Examples include small area estimation techniques that allow for reliable estimates of finite population quantities in the presence of small sample sizes and socio-econometric models used in academia to test scientific hypotheses. This may cause problems as model-based methods frequently assume a non-informative sampling design and a violation of this assumption can lead to erroneous statistical inferences. We argue in this work that if the application of model-based methods can be anticipated before the sample is drawn, then this knowledge should be incorporated in the survey design. We propose a method called antithetic clustering that enables precise estimates for aggregate figures using design-based estimation methods and does automatically lead to non-informative sampling designs. Our method is compared against other sampling plans designed to achieve precise design-based estimates for aggregates in a simulation study.
1.Introduction
Traditionally, Official Statistics has adopted a design-based approach to produce estimates using sample data. Thus, the sample is collected by means of probability sampling, i.e. each unit in the population has a known and positive probability of being included in the sample [1, p. 32]. Moreover, an estimator is chosen which possesses certain desirable properties such as design-consistency. Hence, very precise estimates can be obtained provided the sample size is large. While this prerequisite will be met for national statistics or for large subgroups that have been incorporated in the sampling design as strata, it might not be true for some small subgroups. A potential remedy in this case is the use of model-based small area estimation methods [2, 3, 4]. A caveat regarding the application of many model-based small area estimation techniques is that they are not design-consistent under general sampling designs. Hence, their design-bias does not vanish as the sample size increases and these methods are consequently not robust against a potential model misspecification. Furthermore, the sampling design can even induce biases of model-based estimates when the model is correctly specified. This phenomenon is known as informative sampling and arises whenever a model that can be validated for the sample differs from the model which holds for the population [5, p. 455]. As a consequence, the sample model cannot be used for inference on the population model without further adjustments. Ignoring this fact may lead to erroneous statistical inferences.
In most applications, estimates for subgroups with small sample sizes as well as estimates on aggregate levels with large sample sizes are needed at the same time. This poses a challenge to the survey planner, as the sampling design has to reflect different and potentially conflicting requirements simultaneously. On the one hand, the sampling design should be built on information related to the variable of interest to enable efficient design-based estimates for aggregate statistics. This could be achieved via stratification [6, p. 450] or sampling with probabilities proportional to size, where a proportional relationship between the size variable and the variable of interest is desirable [1, p. 88]. On the other hand, these optimised designs may lead to informative sampling and thereby invalidate conclusions drawn from model-based estimation procedures. For those estimators non-informative designs such as simple random sampling (SRS) that do not interfere with the model would be beneficial. However, plain SRS schemes do not use auxiliary information at the design stage and are thus not very suitable for design-based estimation of aggregate figures.
The preceding discussion clearly indicates that the trade-off between design-optimisation and modelling should be already dealt with in the sampling design. Even though both design- and model-based estimates are regularly published by statistical offices [7, 8], designs reflecting the needs of both philosophies have rarely been discussed. A notable exception is due to [9], who propose a box-constraint optimal allocation in stratified random sampling (StrRS), where the variance of a national statistic is minimised under an implicit restriction on the range of the sampling weights.
In Section 2, we propose a sampling method that allows for precise design-based estimates and is non-informative by construction. Our approach is based on the technique of antithetic variates, which is a well-known method to reduce the variance in Monte-Carlo simulations [10]. We adapt this approach to the context of survey sampling and derive conditions under which it will yield estimates with a higher precision than SRS.
Section 3 presents the results of a design-based simulation study, where we compare our method against various alternative sampling designs for both design- and model-based estimators.
Finally, concluding remarks are given in Section 4.
2.Antithetic clustering
2.1Notation
Following [11], we consider a fixed and finite population
2.2Our approach
Our aim is to construct a sampling design, which enables precise design-based estimates but does not distort the properties of statistical models. To do so, we combine single stage cluster sampling with the idea of antithetic variates, where pairs of negatively correlated random variables are drawn to reduce the variance in Monte-Carlo simulations [10, Chapter 5]. We call our proposed method antithetic clustering (ATC). It is summarised in Table 1.
Table 1
1. | Order the elements according to the values of |
---|---|
2. | Set |
3. | Increase |
4. | Repeat step 3 until all units have been assigned to a cluster. The procedure yields |
5. | Draw |
Since the sample is drawn using a simple random sample of clusters, all clusters and hence all units
The question that remains is whether our sampling mechanism is suitable for design-based estimation. Therefore, we study the properties of the sample mean under single stage cluster sampling.
2.3The efficiency of single stage cluster sampling
A sampling design yields estimates with a higher precision than simple random sampling provided its design effect is less than one. This design effect (DEFF) under single stage cluster sampling is closely related to the intraclass correlation coefficient (ICC) in the case of evenly sized clusters where
(1)
From Eq. (1), it follows that
(2)
where
(3)
denote the sum of squares between clusters and the sum of squares within clusters, respectively. Using Eqs (2) and (3), we get the following condition for a variance reduction compared to SRS:
(4)
An implication of Eq. (4) is to create clusters such that most of the variation of the dependent variable is due to variation within the clusters, not between clusters. What does this imply for our ATC approach? Intuitively, the clusters will have a large ratio of the within versus the between variation for the size variable. It can be shown that our approach is optimal among all possible combinations of PSUs, which are exhaustive, mutually exclusive and where one unit with an above-median value of the size variable is clustered with a unit with a below-median value. This follows from applying the rearrangement equality, which is given in [14, p. 261]. Having established a certain optimality of ATC for the size variable, we need to examine the implications for our variable of interest. To do so, we consider models specifying the data generating process.
2.4ATC under a single level model
Suppose that the relationship between the dependent variable and the size variable used for clustering is given by the simple linear regression model
(5)
where
(6)
where
(7)
Inserting expressions Eq. (2.4) in Eq. (6) yields the following equations:
(8)
Equation (2.4) and utilising
as well as
(9)
If
(10)
Hence, ATC is expected to perform better than SRS under a linear model provided the correlation between the variable of interest is non-zero and the ratio of the within to the between variation in the size variable is greater than
(11)
In this case, constructing clusters based on
2.5ATC under a model with domain effects
While the developments from the previous sections are based on a simple linear regression model, the condition applies as well to a model with domain-specific effects
(12)
provided that the sampling design is a two stage design with the domains as strata on the first stage (planned domains) and within domains the ATC procedure is applied. The reason why this holds is that within a domain
Now suppose that the model governing the population is indeed given by Eq. (12), but ATC is applied on the population level directly. This leads to changes for the relevant expectations needed to compute the sum of squares between and within as the cluster can be composed of units from different domains. Hence, the expected values are given by
(13)
where
(14)
Note that expressions Eq. (2.5) are approximations, since cross-product terms between the domain-specific effects and the clustering variable as well as those between the domain-specific effects and the individual error terms are ignored. These approximations can be motivated as in many applications the cross-product terms are negligible compared to the terms present in Eq. (2.5). Thus, ATC will be more precise than SRS if
(15)
where we use
and
Hence, the domain-specific effects
3.Simulation study
3.1Simulation set-up
In this section, we present results from a simulation study that compares the proposed ATC approach with other sampling designs which are known to be suitable for design-based estimation. In addition to studying the impact of the sampling designs on aggregate design-based estimates, we also analyse the influence of the designs on design- and model-based estimates for small domains. We consider a fixed and finite population comprising
(16)
Following [11, Section 5.2], the values of the explanatory variables were generated as
(17)
where
(18)
where
Table 2
Estimator |
| ATC | Cube- | Cube-SRS | Pivotal | Rejective | SRS | StrRS |
---|---|---|---|---|---|---|---|---|
HT | 0.003 | 0.005 | 0.004 | 0.005 | 0.002 | 0.002 | 0.003 | |
10–30 | 0.001 | 0.003 | 0.002 | 0.003 | 0.002 | 0.001 | 0.002 | |
0.001 | 0.002 | 0.002 | 0.002 | 0.002 | 0.001 | 0.001 | ||
GREG | 0.019 | 0.019 | 0.018 | 0.019 | 0.018 | 0.018 | 0.019 | |
10–30 | 0.001 | 0.001 | 0.001 | 0.001 | 0.001 | 0.001 | 0.001 | |
0.001 | 0.001 | 0.001 | 0.002 | 0.001 | 0.001 | 0.001 | ||
BHF | 0.061 | 0.132 | 0.061 | 0.132 | 0.062 | 0.061 | 0.061 | |
10–30 | 0.034 | 0.090 | 0.034 | 0.090 | 0.034 | 0.034 | 0.034 | |
0.013 | 0.085 | 0.012 | 0.086 | 0.013 | 0.013 | 0.012 |
As estimators for the national mean we focus on the HT and GREG estimators, as the sample design is typically constructed to enable design-based estimates on the national level. The HT estimator for the national mean is given by
while the GREG estimator follows as
where
To introduce variation in the domain sizes, we proceed in a similar fashion to [11, Section 5.2] and allocate the units to domains with probabilities proportional to
We apply a variety of sampling designs in order to compare our propose method with other sampling designs that are frequently used to obtain precise design-based estimates. The first two designs that are used in our study are SRS and the ATC approach described in Section 2.2, where the latter uses
It can be seen that
Table 3
Estimator |
| ATC | Cube- | Cube-SRS | Pivotal | Rejective | SRS | StrRS |
---|---|---|---|---|---|---|---|---|
HT | 0.470 | 0.509 | 0.468 | 0.507 | 0.472 | 0.470 | 0.470 | |
10–30 | 0.261 | 0.281 | 0.261 | 0.282 | 0.260 | 0.260 | 0.260 | |
0.163 | 0.177 | 0.163 | 0.177 | 0.163 | 0.163 | 0.163 | ||
GREG | 0.214 | 0.265 | 0.212 | 0.267 | 0.214 | 0.214 | 0.213 | |
10–30 | 0.098 | 0.122 | 0.098 | 0.123 | 0.097 | 0.098 | 0.098 | |
0.060 | 0.075 | 0.060 | 0.076 | 0.060 | 0.061 | 0.060 | ||
BHF | 0.118 | 0.166 | 0.118 | 0.167 | 0.118 | 0.118 | 0.118 | |
10–30 | 0.081 | 0.116 | 0.081 | 0.116 | 0.081 | 0.081 | 0.081 | |
0.054 | 0.099 | 0.053 | 0.100 | 0.054 | 0.054 | 0.054 |
3.2Results
The simulation results for domain estimates in terms of the average absolute relative bias (AARB) are summarised in Table 2. We average the results according to the expected sample size in the domains,
In order to assess the precision of the domain estimates, we consider the average relative root mean squared error (ARRMSE) over domains reported in Table 3. The results show an interesting pattern for any estimation method and domain size. On the one hand, there is the group of equal probability sampling designs that yield very similar results for a particular choice of an estimator and domain size. On the other hand, the Cube-
The results for the national estimates are shown in Table 4, where RBias refers to the relative bias of the national estimates, while RRMSE indicates the relative root mean squared error and ACR denotes the average confidence interval coverage rate. All numerical entries in Table 4 are rounded to three decimal places. Regarding the biases, we see that all combinations of an estimator and a design yield unbiased estimates. A closer look at the precision of the national estimates reveals that the equal probability sampling designs which use auxiliary information at the design stage yield the best results for both estimators. The RRMSE under these designs is about 10 per cent smaller than the RRMSE under SRS for a given estimator. Hence, incorporating auxiliary information at the design stage helps to achieve a variance reduction as compared to SRS. Furthermore, we see that designs using inclusion probabilities proportional to
Table 4
Estimator | Design | RBias | RRMSE | ACR |
---|---|---|---|---|
HT | SRS | 0 | 0.019 | 0.951 |
Rejective | 0 | 0.017 | – | |
Pivotal | 0.028 | – | ||
StrRS | 0 | 0.017 | 0.950 | |
Cube- | 0 | 0.021 | 0.942 | |
Cube-SRS | 0 | 0.017 | 0.948 | |
ATC | 0 | 0.017 | 0.947 | |
GREG | SRS | 0 | 0.017 | 0.949 |
Rejective | 0 | 0.015 | – | |
Pivotal | 0.001 | 0.021 | – | |
StrRS | 0 | 0.015 | 0.951 | |
Cube- | 0 | 0.018 | – | |
Cube-SRS | 0 | 0.015 | – | |
ATC | 0 | 0.015 | 0.945 |
4.Concluding remarks
We have proposed a novel allocation mechanism of ultimate sampling units to clusters, which in connection with single stage cluster sampling allows realising variance reductions for design-based estimation methods versus SRS. Moreover, this allocation mechanism yields equal inclusion probabilities and therefore avoids the issue of informative sampling. Thus, our approach does not distort the properties of model-based estimation procedures. Therefore, our method is well-suited for modern surveys, where design-based estimates are produced at aggregate levels and at the same time model-based estimates are published for domains with small sample sizes. Further advantages of our proposal are that it is both very simple to implement and, perhaps even more importantly, also very easy to communicate to the public.
We compared our proposed method against a number of alternative sampling designs aiming at variance reduction for design-based estimation methods in a simulation study under a misspecified model. The results of this study showed very similar results of the ATC method, the cube method with equal inclusion probabilities, StrRS where the strata are defined by the deciles of the auxiliary variable and a rejective sampling procedure. All of these methods make use of the auxiliary information at the design stage and use equal (initial) inclusion probabilities. Sampling designs based on sampling with probabilities proportional to size were shown to be less efficient for the estimation of national estimates and led to biased model-based small domain estimates due to informative sampling.
In comparison to the rejective sampling procedure, our approach allows fixing the inclusion probabilities in advance and it permits the use of simple unbiased design-based variance estimators. Furthermore, our sampling procedure is a SRS of clusters and, thus, very fast even for large populations. This is a distinct advantage over the cube method, which can be time-consuming for large populations. Moreover, using ATC we avoid the need for approximations to second-order inclusion probabilities.
In contrast to sampling with probabilities proportional to size, our proposal is more robust with respect to a misspecification of the implicitly assumed model. This is highlighted by the results of the simulation study, where designs based on sampling with probabilities proportional to size led to inefficiencies owing to the presence of an intercept term in the population model. Additionally, sampling with probabilities proportional to size is clearly suboptimal for HT estimation in situations where the size variable is negatively correlated with the variable of interest.
Alternatively, one could consider StrRS approaches towards optimal model-based stratification for the GREG estimator, which have been discussed in Section 12.4 of [1]. However, they require knowledge about the error structure of the assisting regression model and a rule to determine the stratum membership. Thus, the survey planner needs a comprehensive knowledge about the model, which is by far more demanding than knowing the values of some size variable.
Future research may focus on a generalization of the ATC approach to account for multiple auxiliary variables simultaneously when constructing antithetic clusters. One option in this regard could be to apply a principal component analysis to the standardized matrix of auxiliary information and to base the clustering on the values of an appropriate distance of the principal components from their origin.
Acknowledgments
The author is very grateful to the associate editor and two anonymous referees for their comments and suggestions, which helped to improve the paper substantially.
References
[1] | Särndal CE, Swensson B, Wretman J. Model assisted survey sampling. New York: Springer; (1992) . |
[2] | Pfeffermann D. New important developments in small area estimation. Statistical Science. (2013) ; 28: (1): 40-68. |
[3] | Jiang J, Lahiri P. Mixed model prediction and small area estimation. Test. (2006) ; 15: (1): 1-96. |
[4] | Rao JNK, Molina I. Small area estimation. Hoboken: John Wiley & Sons, Inc; (2015) . |
[5] | Pfeffermann D, Sverchkov M. Inference under informative sampling. In: Pfeffermann D, Rao CR, eds. Handbook of statistics vol 29B: Sample Surveys: Inference and Analysis. New York: Elsevier; (2009) . p. 455-487. |
[6] | Hidiroglou MA, Lavallee P. Sampling and estimation in business surveys. In: Pfeffermann D, Rao CR, eds. Handbook of statistics vol 29A: Sample Surveys: Design, Methods, and Applications. New York: Elsevier; (2009) . p. 441-470. |
[7] | Little RJ. Calibrated Bayes, an alternative inferential paradigm for official statistics. Journal of Official Statistics. (2012) ; 28: (3): 309-334. |
[8] | Little RJ. Calibrated Bayes, an alternative inferential paradigm for official statistics in the era of big data. Statistical Journal of the IAOS. (2015) ; 31: (4): 555-563. |
[9] | Gabler S, Ganninger M, Münnich R. Optimal allocation of the sample size to strata under box constraints. Metrika. (2012) ; 75: (2): 151-161. |
[10] | Rizzo ML. Statistical computing with R. Boca Raton: CRC Press; (2007) . |
[11] | Lehtonen R, Veijanen A. Design-based methods of estimation for domains and small areas. In: Pfeffermann D, Rao CR, eds. Handbook of statistics vol 29B: Sample Surveys: Inference and Analysis. New York: Elsevier; (2009) . p. 219-249. |
[12] | Valliant R, Dorfman AH, Royall RM. Finite population sampling and inference: a prediction approach. New York: Wiley; (2000) . |
[13] | Lohr S. Sampling: Design and Analysis. Pacific Grove: Duxbury Press; (1999) . |
[14] | Hardy GH, Littlewood JE, Polya G. Inequalities. Cambridge: Cambridge university press; (1952) . |
[15] | Battese GE, Harter RM, Fuller WA. An error component model for prediction of county crop areas using survey and satellite data. Journal of the American Statistical Association. (1988) ; 83: (401): 28-36. |
[16] | Fuller WA. Some design properties of a rejective sampling procedure. Biometrika. (2009) ; 96: (4): 933-944. |
[17] | Deville JC, Tillé Y. Unequal probability sampling without replacement through a splitting method. Biometrika. (1998) ; 85: (1): 89-101. |
[18] | Deville JC, Tillé Y. Efficient balanced sampling: The cube method. Biometrika. (2004) ; 91: (4): 893-912. |
[19] | Deville JC, Tillé Y. Variance approximation under balanced sampling. Journal of Statistical Planning and Inference. (2005) ; 128: (2): 569-591. |