Covariate Measurement Error Adjustment for Multilevel Models With Application to Educational Data

This article proposes a multilevel model for the assessment of school effectiveness where the intake achievement is a predictor and the response variable is the achievement in the subsequent periods. The achievement is a latent variable that can be estimated on the basis of an item response theory model and hence subject to measurement error. Ignoring covariate measurement error leads to biased parameter estimates. To address this problem, a likelihood-based measurement error adjustment for multilevel models is proposed. In particular, the method deals with a covariate measured with error that has a random coefficient. An application to educational data from the Italian region of Lombardy illustrates the method. Manuscript received January 21, 2009 Revision received November 24, 2009 Accepted January 17, 2010


Introduction
The assessment of school effectiveness has become an imperative aim among educational policymakers. School effectiveness refers to the performance of schools, measured in terms of achievement attained by pupils at the end of a period of formal schooling (Scheerens, 2000). The Coleman report (Coleman et al., 1966) represents the earliest significant research in the topic. The methodology used was deficient as it applied ordinary regression models, ignoring the data structure of students nested within classrooms, and used data for only one school year.
The important contribution of Aitkin and Longford (1986) stated the minimum requirements for an adequate analysis of school differences, namely, adjusting for the intake knowledge and modeling the multilevel structure of the data through variance components. The initial achievement level, in fact, might be a confounder of students' performance, and a fair comparison of school outcomes should account for differences in this variable. This fundamental requirement moved the research on school performance toward models that analyze change rather than status. Such models are generally referred to as ''value-added models'' (McCaffrey, Lockwood, Koretz, Louis, & Hamilton, 2004;Tekwe et al., 2004). Multilevel models are essential for hierarchically structured data in which observations within clusters are likely to be correlated. See, for example, Snijders and Bosker (1999) or Goldstein (2003) for broad reviews on multilevel techniques.
The achievement level, however, cannot be directly observed on the population, but can only be estimated, usually on the basis of item response theory models (see, e.g., Baker & Kim, 2004;Bartholomew & Knott, 1999). When the estimated intake level is included in the model as a predictor, appropriate techniques of adjustment of measurement error in the covariates are required to obtain unbiased parameter estimates. In fact, it is well known that the use of covariates measured with error can lead to biased estimates of model parameters and to misleading inferences (see, e.g., Carroll, Ruppert, Stefanski, & Crainiceanu, 2006). While several authors have raised this point, adjustments for measurement errors are seldom made in educational data analyses. One of the explanations for this fact is probably that methods that adjust for measurement error are not straightforward to apply with multilevel data, in particular when the measurement error affects the random part of the model.
The motivating application is part of a project co-funded by the Italian region of Lombardy and the European Social Fund to assess the level of students' achievement in Italian and mathematics in that region. Suitable tests in Italian and mathematics were designed and administered to every annual cohort of students at the end of each educational year. A total of 1,106 students who were 11 years old in 2003 were administered tests in April 2003, April 2004, and May 2005. The students were nested in 77 classes, with median class size of 15. The ages of 11-14 years cover the middle-school cycle in Italy, during which students keep the same teachers. Every test was composed of about 30 items, including some items in common between two subsequent tests, which anchors the test scores to the same scale. At the same time, the study collected some socioeconomic variables.
The results of the tests were analyzed with the Rasch model (Molenaar, 1995b) using specialized software that provides a measure of the achievement together with the standard error of the estimate. Item parameters were estimated by conditional maximum likelihood (Molenaar, 1995a), while person parameters were estimated by maximum likelihood (Hoijtink & Boomsma, 1995).
Prompted by this particular application, this article proposes a model specification for estimation of the value added, which includes the initial achievement as predictor and where the response variable is the achievement in the subsequent periods. Moreover, the baseline achievement has a coefficient random at the level of the class. The specification considered results in a multilevel model Battauz et al.
where both the response variable and a covariate associated with a random effect are affected by measurement error. The adjustment for the measurement error in the covariate is performed following a structural approach, implementing a Monte Carlo EM algorithm for obtaining the maximum likelihood estimates.
It is shown that the method proposed is rather effective in adjusting model estimates, leading to adjustments that sometimes have rather large size. The output of the procedure are the same quantities usually available to users of multilevel models, including random effects predictions that can be used to compare classrooms or schools (Snijders & Bosker, 1999, §4.7).
The article is organized as follows. Section 2 presents some background results on value-added models and measurement error adjustments in multilevel models. Section 3 proposes a new model specification and addresses the method for measurement error adjustment, while Section 4 applies these methodology to the data of the motivating example. A simulation study about the performance of the method proposed is presented in Section 5. Some discussions and concluding remarks are given in Section 6. All the technical derivations are provided in an appendix.

Value-Added Models
A common approach to modeling two subsequent observations over time is considering the prior level as predictor of the current level (see, e.g., Goldstein, 2003, chap. 2). Denoting by x tij the achievement at time t of student i in class j, a model that includes class random effects for the intercept and the prior level takes the following form where u 0j and u 1j are normally distributed random effects, and e ij is an independent normally distributed error term. When the achievement scores in two subsequent periods are on the same scale, gain scores can be defined as the differences, D ij ¼ x 1ij À x 0ij (McCaffrey et al., 2004), that measure the achievement growth in the educational period considered. The random intercept gain score model is then given by Note that Equation 2 is a special case of Equation 1 with b 1 ¼ 1 and Varðu 1j Þ ¼ 0. When at least three longitudinal observations on the same scale are available, it is possible to estimate a growth model, that is, a model for the growth of achievement over time (Raudenbush & Bryk, 2002, chap. 6). The inclusion of class-level random effects in the model may take different forms, depending Covariate Measurement Error Adjustment on the structure of the data. When students do not change class during the period considered, data present a nested hierarchical structure: classes group students and students group different observations over time. A typical model is then where t denotes time, the class-level random effects (u 0j and u 1j ) and the studentlevel random effects ðv 0ij and v 1ij Þ are independent normally distributed bivariate variables, and the residual term, e tij , is also normally distributed, independent of the random effects. With fixed measurement occasions, the effect of time can be modeled by means of time dummies (Snijders & Bosker, 1999, chap. 12). When students change class across years, data do no more present a nested structure, as observations on the same student are not necessarily referred to the same class. Two examples of multilevel model for cross-classified data in the educational context are the model proposed by Raudenbush and Bryk (2002, chap. 12) and the Tennessee Value Added Assessment System (TVAAS; Ballou, Sanders, & Wright, 2004).

Measurement Errors in Multilevel Models
With independent observations, various methods to adjust for measurement error have been proposed in the literature for linear models and generalized linear models or nonlinear models; Fuller (1987) and Carroll et al. (2006) are the two main references on the subject. The literature makes a distinction between structural modeling, where specific assumptions are made about the distributional structure of these unobserved variables, and functional modeling, in which nothing is assumed about the unobserved variables.
Structural methods can be applied to any kind of situations, including multilevel models. Aside from computational aspects, which may be challenging, a great deal of care is required by the analyst, as distributional assumptions about unobserved variables cannot be tested empirically. Longford (1993) proposes a structural multilevel model where the covariates measured with error do not have random coefficients. Two other examples of applications using multilevel models are given by Browne, Goldstein, Woodhouse, and Yang (2001) and Goldstein, Kounali, and Robinson (2008), which assumed a Gaussian distribution for the unobserved predictor and then applied a Bayesian approach. Both these articles performed a sensitivity analysis to the effect of measurement error, avoiding the estimation of the measurement error variance.
Functional methods may be difficult to apply, as they need to be tailored to the specific model under analysis. Several functional methods have been proposed when the measurement error affects only the covariates entering the fixed part of the model, that is, in random intercepts models. For instance, Woodhouse, Yang, Goldstein, and Rasbash (1996) applied simple methods of moments for measurement error adjustment in value-added models. Their methods can be seen Battauz et al.
as a particular instance of the corrected score methodology (see, e.g., Carroll et al., 2006, x7.4), which has been more formally studied for random intercept models by Zhong, Fung, and Wei (2002). Another approach is that of Buonaccorsi, Demidenko, and Tosteson (2000), which leads to a regression calibration method. Finally, Wang, Lin, Gutierrez, and Carroll (1998) proposes an application of the SIMEX method (Cook & Stefanski, 1995) to random intercept models.

The Model
The model proposed here can be applied to any situation where there are T þ 1 longitudinal observations (T ! 1) for students nested in the same grouping level.
The longitudinal model in Equation 3 explicitly models the achievement in the first year of data collection. Since students' achievement status results not only from the experiences students had in particular classrooms during the year of testing but also from previous experiences (Rowan, Correnti, & Miller, 2002), we prefer to avoid modeling the initial knowledge and rather to condition on this level. The result is an extension of Equation 1 to multiple occasions, where only the growth developed in the period considered is explicitly modeled. Let t be the index for time, and let the initial achievement measured at time t ¼ t 0 be denoted by x 0 . In order to achieve greater flexibility, following Snijders and Bosker (1999, x12.1), we represent each occasion of assessment, other than the first one, by an indicator variable, with random coefficients associated with each time dummy. Here, we make the assumption that the effect of the initial achievement is the same on each subsequent achievement. This seems a sensible assumption when the baseline level corresponds to the beginning of an educational cycle, and it will be further elaborated in the empirical application. The model is then given by where j ¼ 1; . . . ; J , i ¼ 1; . . . ; n j , J is the number of groups, and n j is the number of units within groups. The time dummies are defined as The vector z tij includes the covariates measured without error, b 0 ; . . . ; b T are fixed coefficients and g is a vector of fixed coefficients associated with z tij . The random effects follow the usual Gaussian assumption, namely, Covariate Measurement Error Adjustment Furthermore, the residual error terms are assumed to be correlated within subjects and with variance changing over time independent of b j . The variance matrix of the random effects is D ¼ Dðc b Þ, while the variance matrix of the residual error is ⌺ ¼ ⌺ðc e Þ, with c b and c e suitable vectors of parameters. To wind up, Equation 4 defines a random intercept and slope model, with random intercepts varying over time.
The latent values x 0ij and x tij are observed indirectly, subject to measurement error, as Y 0ij and Y tij . We assume a normally distributed additive measurement error, that is, where the true variables are assumed independent of the measurement errors and measurement errors at different times are uncorrelated. The latter hypothesis is supported by the assumption of local independence usually made in Rasch models and other item response theory models (Baker & Kim, 2004;Molenaar, 1995b). Note that which imposes a constraint on Covðu tij ; u t 0 ij Þ for given values of the remaining two covariances Goldstein et al. (2008) provide an example where the lower limit of Equation 8 is greater than zero. However, this is a consequence of the particular choices made by these authors on the reliability of the measures and the correlation between the measures. In many applications, the values for the quantities involved would not lead to any severe restriction, and this will be further illustrated in the empirical example. An important case is that of achievement estimated by the Rasch model. In particular, when the estimates are based on maximum likelihood (Hoijtink & Boomsma, 1995), the error term is asymptotically unbiased and normally distributed, thus supporting the assumption in Equation 7. In this case, there is the additional proviso that the measurement error variances are not likely to be the same for each subject and the assumption of homoscedasticity is merely an approximation. For the moment, however, we assume the model in Equation 7, as it will be shown later that the heteroscedasticity typical of the Rasch model estimates can be addressed in a simple manner.

Battauz et al.
No matter which method is used to estimate the achievement level, it is essential that the estimation procedure also provides an estimate of the measurement error variances, s 2 0 and s 2 t , t ¼ 1; . . . ; T , as methods correcting for the measurement error generally require additional information on the measurement process (Carroll et al., 2006, chap. 2). In the following, the measurement error variance will be considered as known quantities.

Measurement Error Adjustment
The variable Y tij is said to be a surrogate response if its distribution depends only on the true variable (see Carroll et al., 2006, x15.4.1), that is, where y denotes the vector ðy tij Þ t;i;j , and similarly When the estimated achievement y is the output of an item response theory model, we can safely assume it is a surrogate response, in view of the aforementioned local independence assumption and the absence of differential item functioning (Molenaar, 1995b). With surrogate response, the introduction of the measurement error in the response variable simply leads to a decomposition of the total Level 1 errors of the model in a measurement error plus a residual part that will be denoted as with each residual error defined in Equations 4 and 7. The challenging part remains the adjustment for measurement error in the predictor x 0ij of Equation 4. The structural approach for covariate error adjustment requires the specification of a distribution for the latent predictor, and the Gaussian distribution is a typical choice The method used here to obtain maximum likelihood estimates is the EM algorithm (Dempster, Laird, & Rubin, 1977), justified by the latent variable structure of the model when the additional assumption in Equation 10 is made (e.g. McLachlan & Krishnan, 1997). The missing data for the algorithm are then given by the random effects b and the latent covariate x 0 .
; c e ; m x ; s 2 x Þ T denote the vector of unknown parameters, the joint density of the complete data can be factorized as In the following, we will consider each term of Equation 11 separately. The non-differential measurement error hypothesis (Carroll et al., 2006, x2.5) is valid when the response variable y and the measured predictor y 0 are Covariate Measurement Error Adjustment conditionally independent, given the true predictor x 0 . This appears plausible for the setting under study, and under this hypothesis the first term in Equation 11 becomes that, in view of what stated above, is Gaussian with independent components across subjects. Note that the independence between the components of y follows from the conditioning on the random effects and the latent covariates. As in the case of y, it seems also sensible to assume that y 0 is surrogate, giving where each component is independent Gaussian with mean x 0ij and variance s 2 0 . Since the measurement error variance is considered as known, this term does not contain any component of y and can be ignored in the complete-data log likelihood. Finally, we assume that the true covariate x 0 is independent of the random effects b and that they are both independent of the covariates measured without error Z, hence The resulting complete-data log likelihood ' c ðyÞ, corresponding to the density in Equation 11, is given in the appendix. The computation of the expected value of ' c ðyÞ given the observed data ðy; y 0 ; ZÞ, needed in the E-step of the algorithm, requires the determination of the joint distribution of the missing data, ðb; x 0 Þ, given the observed data. However, the product between the random slopes and the latent covariate in Equation 4 makes the marginal distribution of the response variable to be not Gaussian. Consequently, the distribution of the missing data given the observed data cannot be evaluated analytically. A tractable approach makes use of the Gibbs sampling (e.g., Robert & Casella, 1999) to integrate out both b and x 0 from the joint distribution of the data and the latent variables. In fact, both the distribution of the latent covariate given the data and the random effects, and the distribution of the random effects given the data and the latent predictor, under the assumptions made so far are both Gaussian. This makes the Gibbs sampling an appealing choice. The resulting algorithm is a particular instance of a Monte Carlo Expectation Maximization (MCEM) algorithm (Wei & Tanner, 1990). To speed up the convergence, we improved on the accuracy of Monte Battauz et al.
Carlo approximations using a Quasi-Monte Carlo EM, as used, for example, by Jank (2005). In particular, we followed the method proposed by Liao (1998) for incorporating Quasi-Monte Carlo draws within the Gibbs sampling; see also Tribble and Owen (2008) for a theoretical justification of the method. The computation of the M-step of the algorithm can be carried out in closed form, taking advantage of the normality of each term in Equation 11. The appendix provides the details of the algorithm. The EM algorithm does not automatically produce an estimate of the variance matrix of the maximum likelihood estimates. Here we use the Supplemented EM (SEM) algorithm proposed by Meng and Rubin (1991). The procedure also provides a diagnostic tool about the implementation of the algorithm. In fact, the variance matrix that results from the procedure is not numerically constrained to be symmetric, so Meng and Rubin suggest to use the symmetry of the matrix to check the presence of programming errors and numerical imprecision.

Main Results
In this section, the methods presented in Section 3 will be applied to the data previously introduced, focusing on the case of mathematics.
First of all, standard exploratory analyses conducted on the model fitted using the unadjusted predictor did not highlight any particular problem for the model specification in Equation 4, in particular for what concerns the assumptions of normality of residuals and linearity.
As anticipated before, both the response variables y and the predictor y 0 were estimated by means of the Rasch model. It should be noted that the standard errors provided as output of the Rasch method are a function of the estimated achievements (Hoijtink & Boomsma, 1995), as students with very good or very bad performances present higher measurement error variance. Full treatment of this point would render the structural approach very complicated and unpractical, and for this reason a simplified solution was adopted here. Indeed, the measurement error variances were taken as constant, and they were obtained as sample means of Rasch model squared standard errors. As the estimated measures provided by the Rasch method exhibit a limited amount of variation for these data, taking constant measurement error variances was likely to represent an acceptable approximation. The validity of such approximation has been further confirmed by some simulation studies, whose results are reported in Section 5. By taking the measurement error variances as constant, the assumption in Equation 10 was fulfilled, and hence we could proceed with estimation of the parameter of the structural model.
We verified that the constraint in Equation 8 does not rule out the assumption of uncorrelated measurement errors. In fact, for t ¼ 0 and t 0 ¼ 1, the inequality is À0:35 Covðu 0ij ; u 1ij Þ 1:73; Covariate Measurement Error Adjustment using sample moments of the observed measures and the measurement error variances. Similar values were obtained for the other times.
The EM algorithm used for the estimation of the structural model was programmed in the R statistical environment (R Development Core Team, 2009), implementing some parts in C language routines to speed up the estimation procedure. The deterministic sequence used was given by the good point set sequence described in Liao (1998). The Gibbs sampling chain was run for 1,000 iterations, discarding the first 500 as burn-in. We verified that usage of 2,000 iterations (1,000 as burn-in) did not lead to any significant change in the parameter estimates. Standard errors were obtained by applying the SEM algorithm. To check about the presence of programming errors or numerical imprecision, we verified the symmetry of the estimated variance matrix of the estimates, which were quite symmetric.
Parameter estimates corrected for both the response and the covariate errors with the structural method are reported in Table 1 (ML), together with unadjusted estimates (Naive). They were obtained usingŝ 0 ¼ :582,ŝ 1 ¼ :569; and s 2 ¼ :482.
We verified that the class mean of the baseline level is not significant (p ¼ .49), hence it is not included in the results. Therefore, following the suggestion of Raudenbush and Bryk (2002, p. 138), grand-mean centering of the baseline level was employed. Grand-mean centering is also useful to facilitate the interpretation of the results (Snijders & Bosker, 1999, p. 70). As a consequence, also the achievement values in the subsequent periods were transformed by subtracting the mean of the baseline level.
Adjusting for measurement error leads to an increase in the coefficients of the two-time indicator variables. The coefficient of the baseline achievement is strongly corrected upward. Most of the socioeconomic variables have a positive effect on the achievement level, but adjustment for measurement error only slightly changes the estimate of their coefficients. This is true also for the variances of the random coefficients and the covariances between random effects. Finally, the elements of Level 1 variance matrix are corrected downward. Figure 1 shows a comparison of predicted values of the random effects obtained with the naive and the adjusted model. The main relative adjustment concerns the random slope of the initial achievement while the random intercepts present very little variation. Figure 2 displays the intake achievement versus the achievement at Times 1 and 2 for each class. The plots clearly show the heterogeneity existing among classes, which is of primary importance in this kind of study.

Comparison With the SIMEX Method
For a comparison, we also fitted the model using the SIMEX functional method. SIMEX estimates are obtained by adding additional measurement error Battauz et al.
to the data in a resampling-like stage, establishing a trend of measurement error-induced bias versus the variance of the added measurement error, and extrapolating this trend back to the case of no measurement error. The technique was proposed by Cook and Stefanski (1995); a full treatment is given in Carroll et al. (2006).
In our application, we generated 500 simulated data sets at each level of additional measurement error determined by an extra parameter l. We took eight equispaced values in the interval ½0; 2 for l and used a quadratic extrapolant function. Standard errors were computed using the simulation extrapolation variance estimation (Carroll et al., 2006, Appendix B.4.1). The results for SIMEX are also reported in Table 1. It seems that the adjustment of naive estimates goes in the same direction of that of the structural method, in particular for those

Covariate Measurement Error Adjustment
parameters for which the adjustment is more substantial. For some parameters, such as the coefficient of the baseline level, the adjusted estimates provided by the two methods are very close. Instead, for some other parameters, such as the coefficients of the covariates observed without error, the SIMEX-based adjustment seems larger than that of the structural approach. Such discrepancy is somewhat odd and hence is elaborated in the following. SIMEX is generally fast to implement and a useful method for obtaining adjusted estimates in a simple way, with the additional advantage of being free of untestable distributional assumptions about the unobservable predictor. However, it is an approximate method in practice, due to the uncertainty associated with the extrapolation step (Carroll et al., 2006, p. 108). In our application, this is particularly true for some of the parameters. Figure 3 represents parameter estimates of each simulated data set and the extrapolant function obtained with the quadratic extrapolant for b 0 and the covariance c b 10 between b 1j and b 0j . It is apparent that the extrapolation for the latter parameter is not without difficulties. Furthermore, the results obtained with the rational linear extrapolant function which is often used for SIMEX (Carroll et al., 2006, §5.3.2) were even worse, presenting a very erratic behavior. A further important drawback for multilevel applications is that the SIMEX method does not provide an estimate of the random effects and their standard errors, nor does it seem straightforward to develop a method to obtain them. For many educational applications, this is a major limitation of the method. For these reasons, we tend to prefer the estimates obtained with the structural method.

Alternative Models
Some other models have been fitted and compared with the model presented in Section 4.1, which will be referred to as Model 1. The first model has uncorrelated residuals within subjects and constant variance across time (Model 2). Two further models without random slope for the baseline level have been fitted Scatter plots of y 1 (solid circles) and y 2 (empty circles) versus y 0 for each class. The lines represent the fitted values adjusted for measurement error for t ¼ 1 (solid) and t ¼ 2 (dashed), setting the other covariates equal to their sample modal value.

Covariate Measurement Error Adjustment
(Models 3 and 4). Both these models have an unconstrained variance matrix for the residuals, and the second one presents a different effect of the baseline level on the achievements in the two subsequent periods. The log likelihood values evaluated at the maximum point are reported in Table 2, together with Akaike information criterion (AIC) values. As the MCEM algorithm does not automatically provide the evaluation of the log likelihood function at the maximum points, the log likelihood has been computed by integrating out the missing data from the complete-data log likelihood. For estimating Models 3 and 4, a simpler EM algorithm has been implemented as the response variable has a Gaussian distribution, and the expectation step can be performed in closed form. The likelihood ratio test strongly indicates that Model 1 is preferable to Model 2 (p < :001). The comparison between Model 1 and Model 3, performed using the likelihood ratio test, indicates that the variance of the random slope is significantly greater than zero (p ¼ :020). Note that here the p value has been computed using an asymptotic distribution which is a mixture of chi-squared distributions with 2 and 3 degrees of freedom with equal weights .5 (Verbeke & Molenberghs, 2000, §6.3). Finally, the likelihood ratio test for comparing Models 3 and 4 indicates that the difference between the coefficients of the initial  The AIC value is slightly smaller for Model 1 than Model 4, though this has to be taken with same care as a variance component is zero in Model 4.
In conclusion, we end up with two models, Models 1 and 4, that provide a similar fit. However, if the research is aimed to evaluate the different random slope of each class, Model 1 should be preferred. A natural extension of these models would include a different effect of the baseline level as well as different random slopes. On the basis of the naive and the adjusted estimates, one can safely say that such random slopes would have quite small variances and a very high correlation. This makes the estimation of this further model quite challenging, at least for the data at hand. Table 3 reports the estimates of main interest for Models 2, 3, and 4. It is apparent that not allowing for correlated residuals has a very strong effect, as the parameter estimates of Model 2 are quite different from all the other models. Models 3 and 4 present similar estimates with the exception of the baselinelevel effect that in Model 4 is not constrained to be constant over time. The coefficient of the baseline achievement of Model 3 is halfway between those of Model 4 and nearly equal to that of Model 1. The fact that the effect of the baseline achievement appears larger at Time 2 than at Time 1 in Model 4 has no straightforward explanation, and it may be due to lack of other time-varying covariates. This is a further reason for choosing Model 1 as a good compromise between parsimony and good model fit.

Simulation Studies
To evaluate the performance of the proposed algorithm, we conducted a simulation study with settings similar to that of the application. In particular, we simulated 200 data sets with 70 classes of 15 students, each presenting 2 observations over time plus the baseline level, leading to a total sample size of 2,100 observations. The quasi-MCEM algorithm used 1,000 iterations of the

Covariate Measurement Error Adjustment
Gibbs sampling chain, discarding the first 500 as burn-in. For the sake of simplicity, we did not consider any covariate measured without error, hence the parameter values used in the simulations were obtained by estimating the original model without those covariates. All the plots reported in this section display the distribution of parameter estimates minus the true values. The elements of the vectors c b and c e are denoted using the subscript k for the variance of the kth component, and a similar notation is used for covariances. The first simulation study was obtained generating the data under the assumptions of Section 3, the same that were used to define the MCEM algorithm, and the results are represented in Figure 4. The first panel shows the serious bias of the naive estimator, which disappears for the unfeasible estimators that uses the true baseline level, and it is nearly nil for the adjusted estimator.
The second simulation study was performed to assess the effect of heteroscedastic measurement error. In particular, this point was further investigated by generating normally distributed baseline achievements x 0 , and taking the standard error to be a quadratic function of x 0 , where the values used for a; b; and c were obtained from the original data by a regression model based on y 0ij . In all the simulated samples, the value of s 2 0 used in the MCEM algorithm was given by the sample mean of the simulated s 2 0ij , thus mimicking to a large extent what was done in the application. The other elements of this second design were equal to the previous one.
The results of the simulation study for the adjusted model are represented in Figure 5. The naive and the estimator based on the true baseline level are not reported, as their results were very similar to those of the previous study. It is apparent that the homoscedastic approximation has a small effect on the adjusted estimator, which presents performances similar to those of the previous design.

Discussion
This article proposes a model specification for value-added estimation that requires the estimation of a multilevel model where the response and the covariate having a random slope are measured with error. The problem is addressed following a structural approach, which can be rather effective in providing all the quantities of interest for multilevel data analysis, including prediction of random effects. Indeed, the inferential results of Section 4.1 are the main output of the motivating example, and the important point is that they may look somehow familiar to any investigator who routinely uses multilevel models. An alternative structural approach is given by joint modeling of item responses and value-added regression, as done in Fox and Glas (2003).

Battauz et al.
In the motivating example, latent achievements were estimated by the Rasch model. This is an important case, but not the only possibility, and actually structural modeling of measurement error can also be applied when other methods are used to estimate the latent abilities. The simulation study conducted shows that the estimation procedure is not affected by taking the variance as constant. At any rate, note that measurement error variances are heterogeneous only when students are all tested with the same items. When computerized adaptive testing is used to Covariate Measurement Error Adjustment estimate the achievement (see, e.g., Meijer & Sijtsma, 1999), items can be properly selected to attain the same measurement error variance for all the subjects, thus leading to an error structure exactly of the form of Equation 7. The longitudinal model proposed here presents the initial level as predictor for all the subsequent years. A possible alternative is given by the inclusion of the achievement at Time 1 as predictor of the achievement at Time 2 and so on. However, including both the achievement at Time 1 and the baseline level introduces multicollinearity in the model and alters the interpretation of the estimated effects, hence it may be recommendable to make a choice between these two predictors. Although the best predictor of the achievement is likely to be the level at the previous measurement point, rather than the one measured at the baseline, there are cases where the use of the initial achievement seems the right thing to do. This occurs, for example, when the analysis starts at the beginning of a schooling period, like in the motivating application, or when successive repeated assessments are made within short time. Furthermore, it might be argued that the principal aim of value-added models is not the prediction of future achievement levels, but the estimation of the contribution of teachers to student development. In that respect, it is important to adjust for the level that students present when they first enter the class.
We implemented a particular kind of MCEM algorithm, speeded up by employing Quasi-Monte Carlo draws. As usual for numerical algorithms, improvements are always possible. For example, a possible improvement would be obtained by increasing the number of Quasi-Monte Carlo points at each iteration; see Jank (2005) and the references therein. Although an improved algorithm might give better results, the simulation studies show that the current implementation provides satisfactory results, at least for the kind of data of interest here.
The literature on measurement error is often concerned about the specification of the distribution for the latent variable. In fact, there are examples where the results of structural modeling depend on such distribution. This does not seem to be the case β 2 β 0 ψ b1 ψ b2 ψ b0 ψ b12 ψ b10 ψ b20 ψ ε1 ψ ε2 ψ ε12 The model proposed can be further extended by including school effects or further covariates affected by measurement error. For those applications where such extensions appear appropriate, the methodology presented here could be extended in a straightforward way.

Details for the MCEM Algorithm
As explained in Section 3, the joint distribution f ðy; y 0 ; x 0 ; bjZ; yÞ given in Equation 11 can be rewritten in a simplified form as which is separable as each term depends on a different set of parameters. By setting w tij ¼ ðd 1t ; d 2t ; . . . ; d Tt ; x 0ij Þ T , it follows from Equations 4 and 9 that the model for the observed response y given the latent variables x 0 and b can be written as where b ¼ ðb 1 ; . . . ; b T ; b 0 Þ T is the vector of fixed coefficients associated with w tij . Let N ¼ P j n j be the total number of Level 1 units, the complete-data log likelihood defined by Equation A1 is given by For the computation of the conditional distributions required in the expectation step of the EM algorithm, we recall here some useful results. Let Y ¼ ðY 1 ; . . . ; Y d Þ T be a vector of random variables that follow a multivariate Gaussian distribution, namely, Y $ N d ðm; Þ. Consider the block partition Y T ¼ ðY T A ; Y T B Þ, and a corresponding partition for m and . When can be written in the following form Covariate Measurement Error Adjustment using standard result on matrix algebra and on the multivariate Gaussian distribution, it follows that Y A jY B $ N ðm AjB ; AjB Þ, with For the jth group, the model in Equation 15 can be conveniently expressed as where Y j ¼ ðy tij Þ t;i is the vector of responses with length m j ¼ n j Â T , W j ¼ ðw tij Þ t;i is a matrix of order m j Â p containing the regressors entering Equation A2, Z j ¼ ðz tij Þ t;i is a matrix of order m j Â q of fixed effects, e Ã j ¼ ðe Ã tij Þ t;i is a vector of Level 1 residuals with length m j , and x 0j ¼ ðx 0ij Þ i . As the random effects are independent across groups, it is possible to sample from the distribution f ðb j jY j ; W j ; Z j ; x 0j Þ j ¼ 1; . . . ; J . Let W Ã j be the matrix of regressors corresponding to W j where the elements x 0ij are replaced by the values generated from the Gibbs sampling chain at the previous step. The distribution of the random effects given the data and the latent covariate can be obtained using Equations A5 and A6 (Robinson, 1991) x 0ij;l Àm ðrÞ x 2 ; , and e Ã tL is the vector of residuals for time t. At convergence of the algorithm, estimates of the random effects are given bŷ while the estimates of the posterior variances are given by ðb l ÀbÞðb l ÀbÞ T :