BEST LINEAR UNBIASED LATENT VALUES PREDICTORS FOR FINITE POPULATION LINEAR MODELS WITH DIFFERENT ERROR SOURCES

We develop best linear unbiased predictors (BLUP) of the latent values of labeled sample units selected from a finite population when there are two distinct sources of measurement error: endogenous, exogenous or both. Usual target parameters are the population mean, the latent values associated to a labeled unit or the latent value of the unit that will appear in a given position in the sample. We show how both types of measurement errors affect the within unit covariance matrices and indicate how the finite population BLUP may be obtained via standard software packages employed to fit mixed models in situations with either heteroskedastic or homoskedastic exogenous and endogenous measurement errors.


Introduction
Predicting the latent value (expected value) of a variable for a sample unit on which some measurements are made is a common problem in Applied Statistics. Sometimes, the response variable is subject to different sources of variability associated to measurement errors as indicated in Cochran (1977) or observation errors as termed by Sukhatme et al. (1984). Two sources of measurement errors can be identified. The first is related to the natural variability of the unit responses and is referred to as inherent variability in the terminology introduced by Buonaccorsi (2006) or response error by Särndal, Swensson and Wretman (1992). The second is associated with the measuring conditions and it corresponds to the variability of the measures around a fixed value (the latent value), produced by the measurement instruments or interviewers, for example. To clearly differentiate between the two types of measurement errors, we refer to the first as endogenous measurement errors and to the second, as exogenous measurement errors.
Endogenous measurement errors may occur even if the measuring is made with absolute precision (i.e., with no exogenous measurement error). The monthly expenditure with food for a given family is an example; the expenditure may vary from month to month around a latent value, but can be measured without error. Measurement of an adult's height by different observers may serve as an example of a situation where there is only exogenous errors. The results of the daily measurement of a patient's cholesterol level may serve as an example of a situation where both endogenous and exogenous measurement errors are present.
As an example, we consider data for a subset of 13 participants in the project Seasonal Variability of Blood Lipids, NHLBI, number R01-HL52745 (MERRIAM et al., 1999). Data in this study were collected with the goal of identifying and quantifying factors that relate to seasonal changes in cholesterol. For each participant, triplicate measures of cholesterol were in collected in four quarters. In each quarter, the data were collected not necessarily by the same examiner. We reproduce part of the data in Table 1, that for illustrative effect, will represent our target population. We let y s denote the latent cholesterol level for the unit labeled s, s = 1, . . . , N , i.e., the expected value of the cholesterol level over 4 quarters and represent the corresponding endogenous measurement error variance by σ 2 s . The population mean cholesterol level is µ = N −1 N s=1 y s and the population variance is γ 2 = (N − 1) −1 N s=1 (y s − µ) 2 . We assume that the variability in the response introduced by the examiner is the exogenous measurement error. For unit labeled s, measured in quarter q by the j-th examiner, we represent the observed response by where y sq represents the latent level of cholesterol for unit s in quarter q and W j represents exogenous measurement error, assumed to have mean zero and variance σ 2 j . The question is how can we estimate the latent value y s of unit s in the population.
Linear mixed models have been extensively used for such purposes in an infinite population setup as indicated in Goldberger (1962), Verbeke and Molenberghs (2000), McCulloch and Searle (2001), Diggle, Heagerty, Liang and Zeger (2002), Demidenko (2013), Fitzmaurice, Davidian, Verbeke and Molenberghs (2008), among others. The standard linear mixed model for the response from the i-th unit selected from a population can be represented as where µ is the population mean response, B i is a random effect corresponding to the i-th selected unit, assumed to have mean zero and variance γ 2 and E i is a measurement error, assumed to have mean zero and variance σ 2 (or σ 2 i , for heteroskedastic models).
What does E i represent? The answer depends on the manner with which response error is associated with the realized units. If we assume that there is no exogenous errors, then E i represents the inherent variability of the i-th selected unit response. Now, if assume that there is no variability in the selected unit's response and that all variability is due to the effect of measuring, then we can say that E i is associated to the exogenous variability.
What happens with E i when you have the two types of variability simultaneously? If W i represents the endogenous measurement error and W i the exogenous measurement error, then E i = f (W i , W i ). The standard linear mixed model (2) does not consider the distinction between the two sources of measurement errors. It also does not retain identifiability of the units in the population. Our objective is to clarify such issues in a finite population setup.
In Section 2, we describe the finite population mixed model with endogenous/exogenous measurement errors and derive optimal estimators or predictors using the expanded variable approach considered in Singer et al. (2012) along with the methodology employed in standard linear mixed models and we discuss the relationship between the predictors obtained under both approaches for different covariance structures. In Section 3, we compare latent value predictors obtained via finite population and standard linear mixed models. In Section 4, we analyse the cholesterol data described in the Introduction and indicate how the function lme in the statistical software package R may be employed to fit finite population linear mixed models in situations with either heteroskedastic or homoskedastic exogenous and endogenous errors. We conclude with a brief discussion in Section 5.

The finite population mixed model
We define a finite population as a collection of N identifiable units labeled s = 1, . . . , N . Let y = (y 1 , . . . , y N ) denote a vector for which the s-th element is the response latent value y s associated with unit s. The population mean response is µ = N −1 N s=1 y s , and the population response variance is Note that in this setup, b s is a constant and not a random effect.
Following Singer et al. (2012), we define the random permutation model as an ordered list of N random variables, where units are independently permuted. For each permutation, we assign a new label, i = 1, . . . , N to the units according to their position in the permuted list, letting Y = (Y 1 , . . . , Y N ) denote the random vector of latent permuted values. Simple random sampling without replacement is introduced via a set of correlated indicator random variables, U is , that take on a value of one with probability 1/N if unit s is selected in position i in the sample and zero otherwise.
Letting the subscript ξ 1 denote expectation with respect to the permutation distribution, it follows that where, P N = I N − 1 N J N with J N = I N I N , I N denotes an identity matrix of dimension N and 1 N denotes an N × 1 vector with all elements equal to one. Then, it follows that Suppose that a simple random sample without replacement is to be selected from the population. Without loss of generality, we let the sample (indexed by i = 1, . . . , n) consist of the elements occupying the first n ≤ N positions in a permutation . If we assume that only one observation is made on the i-th selected unit and no measurement errors are considered, the model for the observable response Y i in i-th position is When endogenous measurement error W s associated to unit s is present, the model for the observable response Y i may be specified as If, in addition, an exogenous measurement error is considered for the j-th measurent condition, the model for the observable response Y i is Expression (7) may be written as U is W s denotes the endogenous measurement error associated to the i-th selected unit and B i = N s=1 U is b s denotes a random effect. Note that (8) has a similar expression as the standard linear mixed model (2), with the exception that the two sources of measurement error terms (endogenous and exogenous) are explicit in the former. The standard linear mixed model cannot distinguish these two sources of measurement error since the subscript i indexes simultaneously the position and the selected unit in the sample. Since N s=1 U is = 1 for all i = 1, . . . , N , and in each row of U , there exists a single value equal to 1, all the other being zero, it follows that Then, when both endogenous and exogenous measurement errors are present, a model for response on the N positions in the permuted population is Letting the subscript ξ 2 represent expectation with respect to the endogenous measurement error and subscript ξ 3 represent expectation with respect to the exogenous measurement error, we consider the following assumptions it follows that the expected value and variance of the random variable Y are, and where σ 2 = N −1 N s=1 σ 2 s and denotes the direct sum operator. We are interested in developing an optimal linear unbiased predictor (or estimate) of target quantities of the form P = g Y where g is a vector of constants. For example, , with e i denoting a vector with null elements except for the i-th which is equal to 1, then g Y = µ + B i , the latent value of the unit in the i-th position of the random permutation.
Note that i) and ii) represent fixed values but iii) refers to a random variable. We are interested in predicting the random variable in iii). For such purpose, we follow the ideas of Singer et al. (2012) and consider a setup to develop the BLUP of the target quantity. First, we represent a simple random sample without replacement by the first n ≤ N random variables in Y and let the remaining (N −n) random variables denote the responses of the non-sampled elements. Explicitly, we let Y = [ Y S , Y R ] and will express the predictor as a linear combination of the sample random variables, Y S . To determine the coefficients of these random variables that lead to the optimal predictor, we specify an unbiasedness constraint and then minimize the expected mean squared error, subject to this constraint. This leads to the BLUP of the target.
Taking (3) and (9) into account, we have and Now, we let g = (g S , g R ) so that the quantity to predict is P The BLUP of P must satisfy the following criteria considered in Royall (1976), i.e., it must: • be a linear combination of the sample data: The unbiasedness constraint implies that c E( given that g S E(Y S ) + g R E(Y R ) = g 1 N µ and from (12) and recalling (4) and (13), we obtain Therefore, using Lagrangian multipliers, we seek the value of c that will minimize Differentiating with respect to c and λ, setting these derivatives to zero and solving for c we obtain the BLUP of P as For details, see Singer et al. (2012).
In particular, to obtain the BLUP of the latent value P i = µ + B i associated to the i-th selected unit in the sample, first observe that Given that where L = n i=1 γ 2 + σ 2 + σ 2 i −1 and m is an n×1 vector with the i-th component equal to γ 2 + σ 2 + σ 2 i −1 , the remaining ones equal to zero, from (17), it follows Then, (16) simplifies to where γ 2 /(γ 2 + σ 2 + σ 2 i ) is a shrinkage constant. When there are only endogenous measurement errors, the shrinkage constant is γ 2 /(γ 2 + σ 2 ) and the BLUP is with Y = n −1 n i=1 Y i . When there are only exogenous measurement errors, the shrinkage constant is γ 2 /(γ 2 + σ 2 i ) and the BLUP is When neither measurement errors are present, the BLUP reduces to Y i = n −1 n i=1 Y i . In practice, the variance components must be estimated, leading to the so called empirical BLUP.

A comparison of finite population and standard mixed models predictors
To clarify the effect of different sources of measurement errors in the prediction of latent effects under mixed models, we reproduce a simple example from Singer et al. (2012). For such purpose, we compare predictors of latent values of sampled units in the presence of endogenous heteroskedastic measurement errors. We consider a population of size N = 3 from which a sample of size n = 2 is selected. A single measurement of a response variable with two possible values (equal to the latent value ± the endogenous standard error) is obtained on each sampled unit. The population parameters are presented in Table 2. The idea is to compare the y 1 = 10 σ 2 1 = 1 k 1 = 0.491 w 1 = 0.950 Juliana y 1 = 3 σ 2 2 = 100 k 2 = 0.082 w s = 0.160 Laura y 1 = 2 σ 2 3 = 4 k 3 = 0.427 w s = 0.826 µ = 5 γ 2 = 19 σ 2 = 35 w = 0.352 performance of the usual heteroskedastic linear mixed model BLUP, namely, with that of the corresponding heteroskedastic finite population mixed model BLUP, In Table 3 we present all the possible results for samples of size n = 2 along with the corresponding BLUP Q i and P i as well as their squared errors. Table 3 -Possible results obtained with a sample of size n = 2 from the population described in Table 2 along with the corresponding BLUP ( Q i and P i ) along with their respective squared errors, Note that the finite population mixed model predictor P i is unbiased but the standard linear mixed model Q i is not. We adopted the usual interpretation for Q i , i.e., as a predictor of the response for the i-th selected subject assuming that the associated variance corresponds to the subject-specific endogenous variance, which changes with the subject selected in the i-th position. However, we call the attention to the mistake in doing so, because according to the standard linear mixed model, the shrinkage constant γ 2 /(γ 2 + σ 2 i ) is attached to the position i in the sample and not to the subject selected in that position as in the example. This does not occur with the shrinkage constant γ 2 /(γ 2 + σ 2 ) considered in the finite population mixed model predictor. Nevertheless, the squared errors associated to the former are consistently smaller than the corresponding squared errors associated to the latter. The mean squared error of the finite population mixed model predictor is 23.7 while the mean squared error of the misinterpreted linear mixed model predictor is 9.1. This suggests that the unbiasedness condition considered in the derivation of P i may not be appropriate.
Extensive simulations were conducted by Moreno (2009) to examine the behaviour of both predictors under different setups involving underlying distributions as well as sample sizes. In general, the standard linear mixed model predictor performed better than the finite mixed model predictor. Table 1 In practical applications it is possible to fit finite population mixed models to data with endogenous and exogenous measurement errors using routines developed for standard mixed models and implemented in commonly used statistical software packages, as SAS or R.

Analysis of the cholesterol data in
The standard linear mixed model representation for the j-th measure of the i-th unit in the selected sample is with B i iid ∼ N (0, γ 2 ), and E ij iid ∼ N (0, σ 2 i ) for heteroskedastic measurement errors or E ij iid ∼ N (0, σ 2 ) for homoskedastic measurement errors. The BLUP for Y i = µ + B i under this model has the form (19) in the homoskedastic case or (20) in the heteroskedastic case.
As an example of how the computation might be carried out, consider the data set described in the Introduction.
In Table 4 we display the the means of the 12 cholesterol measurements of each subject and assume, for illustrative purposes, that they are the corresponding "true" latent values. It follows that the "true" latent value variance is γ 2 = (1/13) 13 s=1 (y s − Y ) 2 = 2939.9 where Y = (1/13) 13 s=1 y s . Additionally, we let σ 2 s = (1/3) 4 q=1 (y sq − y s ) 2 , s = 1, . . . , 13 where y sq denotes the mean cholesterol level of subject s in quarter q as the "true" variance of the endogenous measurement  Table 4. Based on these values, the average "true" endogenous variance is σ 2 = (1/13) 13 s=1 σ 2 s = 1836.5. In this setup, both the endogenous and exogenous measurement errors are heteroskedastic. Note that when only heteroskedastic endogenous measurement errors are present, the finite population mixed model predictor (19) has the same form as the standard linear mixed model predictor with homoscedastic measurement error variances.
When both heterogeneous endogenous and exogenous measurement errors are present, the finite population mixed model predictors is equivalent to the standard linear mixed model predictors generated from the model where In (23), D ik represents the endogenous measurement error and E ijk represents the exogenous measurement error.
We assume that different interviewers match the different evaluation conditions and consequently that the associated measurement errors may be considered as exogenous measurement errors. The corresponding "true" exogenous measurement error variances are presented in Table 5. The corresponding predictors for the cholesterol example may be obtained via the following commands: BD4 <-groupedData(Cholesterol~Interv|Patient/Trim, data = BLUP) fit3 <-lme(Cholesterol~1, data=BD4, random =~1, weights=varIdent(form =~1|Interv)) fit3$coefficients$fixed + fit3$coefficients$random$Patient The lme predictors are displayed in the sixth column of Table 4.
The lme estimated population variance of the latent values is γ 2 = 2455.1; the lme estimate of the mean endogenous measurement error variance is σ = 1312.8 and the lme estimates of the exogenous measurement error variances are respectively,

Discussion
By means of the example in Section 3, we showed that contrary to the usual interpretation, the heterogeneous standard linear mixed model predictor (21) does not take heterogeneous subject-specific (endogenous) variances into account. Since the step that links a unit label to its position in a response vector is omitted in the standard linear mixed model, this interpretation is erroneous. Finite population mixed models prevent such erroneous switch of concepts. This is aggravated by the fact that (21) corresponds to the BLUP obtained when exogenous heteroskedastic measurement errors are considered. By explicitly considering both types of measurement errors, we clarify this issue and extend the results of Singer et al. (2012).
Given that the expressions for best linear unbiased predictors for finite population mixed models may be matched to those obtained with standard linear mixed models with either homoskedastic, heteroskedastic (or both) measurement errors keeping the differences in interpretation in mind, we may use standard software designed for the latter to obtain predictors for the former. The advantage is that the covariance matrix is explicitly related to the exogenous or endogenous measurement errors so that the choice of the model may take advantage of the physical characteristics of the measurement process.
Finally, we note that neither model can account for the unit-specific endogenous measurement error variances when the interest is to predict the latent values of labelled selected units. One of the reasons for this may be related to the unbiasedness condition, which relates to overall expected response and not to the specific unit latent value. This issue has been raised by Robinson (1991) and by Buonaccorsi (2006) in a slightly different context.