Scale linking in multidimensional bifactor item response models
The bifactor model provides a viable representation of the structure of items and their relationships in multidimensional item response models. The linking of different forms of test items that show bifactor structure poses a challenging problem. In this project, methods for linking multidimensional items with a bifactor structure are provided and evaluated.
A Bayesian approach to assessing differential item functioning
A Bayesian framework for assessing DIF is provided in this project. The complete posterior distribution of the difference in the parameters of interest in the reference and focal groups is obtained. A more detailed analysis of DIF than is possible with current procedures can be carried out by examining the complete posterior distribution of the parameters in the groups of interest.
Assessing invariance in structural equation models: A Bayesian approach
A general Bayesian framework for examining invariance of parameters in populations of interest in the context of structural equation models is provided. The framework encompasses continuous and discrete endogenous and exogenous variables. A MCMC procedure is employed to obtain the posterior distributions of pairwise differences of the parameters in the populations.
Using response time information to improve item parameter estimation in IRT models.
With the advent of computer-based and computer adaptive testing, information regarding the time taken for an examinee to respond to an item can be routinely collected. Such auxiliary information can be used to improve the estimation of item and ability parameters. The use of response time as auxiliary information for improving item and ability parameter estimation in polytomous response models is evaluated.
Determining standard errors of linking using a bootstrap resampling approach.
Determining standard errors when test forms are linked is critical in the measurement context. While standard errors using IRT can be obtained, the information function approach may not be appropriate when the tests are short or if the model does not fit the data well. Bootstrap sampling approaches may provide better estimates. The bootstrap and information function approaches for computing standard errors of linking are compared.
Comparison of item response theory linking procedures for vertical scaling of test forms.
The development of vertical scales is crucial for the assessment of growth in children. However, several methodological issues must be resolved before vertical scales can be developed. Scale linking is one such issue. Several methods for scale linking are compared.
Smoothing procedures for the tails of vertical scales.
A problem that faces the development of scales is their behavior at the extremes. Vertical scales, in general, behave well in the middle of the scale but show instability at the low and high ends of the scales. Methods for smoothing the scale so that the scale shows stability at the extremes are investigated.
Assessment of growth and prediction of future performance of students using vertical scales.
Given that the impetus for the development of vertical scales is to assess student growth, it is important to develop methods for projecting student performance level/growth at a future grade so that students at risk can be identified and remedial action taken. In this project several methods for projecting student growth/proficiency level, growth models as well as regression models that do not require a vertical scale are compared using statewide assessment data.
The effect of multidimensionality on the classification of students into proficiency categories.
Unidimensional latent proficiency must be assumed in using common item response models to estimate examinee ability and classify examinees into proficiency categories. The consequences of violation of this assumption are examined in this project. Since it is impractical to provide and use a score on each dimension measured by the test to classify students, (a) a multidimensional model is fitted to the response data, and (b) a unitary score is derived by weighting the dimensions using the test characteristic curve. The effect on classification accuracy of using this score compared to fitting a unidimensional model is examined.
Assessing the dimensionality of a set of dichotomous and polytomous test items.
While several procedures have been developed for assessing the fit of item response models, very few directly address the issue of dimensionality. A direct method for assessing dimensionality that takes into account the nonlinearity in the item responses is developed in this project and compared with available methods.
Scoring and combining multiple choice and constructed response items in mixed format tests.
Composite scores based on differentially weighting examination sections are often used for assigning grades in large-scale achievement, licensure and certification testing programs. To the extent that the sections measure different constructs, different weighting schemes may change the relative standing of an examinee and alter the obtained grade. In this project different weighting schemes based on classical and IRT- based methods are compared and evaluated with respect to accuracy of classification of examinees into proficiency categories.
Evaluation of the feasibility of computer adaptive testing and multi-stage testing in low incidence fields.
The advantages of computer adaptive tests over traditional paper and pencil tests have been well documented. However, computer adaptive tests require large item banks, a requirement that cannot be met in low incidence fields. Multistage testing may provide a solution to this problem. In this project, multistage testing is compared with fully adaptive testing through simulation studies with respect to the number of items required, the accuracy of the estimate of an examinee’s ability, and the accuracy of classification into proficiency categories.
Goodness of fit statistics for polytomously scored items.
In fitting any model to data, the evaluation of the fit of the model to the data is critical. While several procedures exist for assessing the goodness of fit for dichotomous data, very little information is available for polytomously scored items. In this project, classical and Bayesian goodness of fit measures suitable for polytomous models are developed and evaluated.
The effect of errors in item parameter estimates on the estimation of population characteristics.]
The validity of large-scale assessment results used to describe and compare populations is predicated upon the accuracy with which the parameters of the population proficiency distributions are estimated. Ignoring errors in the estimation of item parameters can lead to serious problems in estimating population parameters. The effect of the effect of errors in item parameter estimation on the estimation of population characteristics is examined in this paper. A joint estimation procedure employing MCMS methods is compared with the plausible value approach with respect to accuracy and bias of estimation.
Quantitative Research Methods
Optimal design for regression discontinuity (RD) studies
RD studies have become an increasingly popular tool for researchers in recent years. Recent work has clarified the conditions necessary to design RD studies with sufficient statistical power in education policy studies. Despite existing work describing the conditions for optimal design of multi-level studies employing random assignment there has not been work extending these optimality results to RD studies. This project derives results which will inform the optimal design of RD studies.
Improving experimental designs in the presence of contamination
A critical issue in designing experiments to estimate the causal effects of interventions is to account for the possibility that features of the intervention intended for the treatment group may unintentionally be experienced by the control group. This phenomenon is variously referred to as “contamination” or “treatment diffusion”. Contamination can bias estimates of treatment effects. Existing work has clarified that even when contamination is substantial the variance penalty incurred by opting for a cluster randomized design often outweighs the biasing effects of contamination. Current work involves extending existing results to more complex multi-level designs, situations where outcomes are dichotomous and applying the idea of pseudo-cluster randomization, originally proposed in the context of health studies to educational contexts.
Using prior information about the ICC to improve power in research designs with clustering
Research designs that randomly assign entire clusters of individuals (such as schools) to treatments are common in studies of educational policy and practice. A major problem for these studies is that it is difficult to obtain a sufficient numbers of schools to conduct studies with adequate statistical power. This project involves deriving a new method of utilizing prior information about the intracluster correlation coefficient to improve power.
Statistical models for implementation fidelity and the implications of fidelity for statistical power.
One important aspect of interpreting the results of experiments is understanding the role that fidelity of treatment implementation plays in interpreting experimental results. Recently, evaluation researchers have created a model for formalizing and quantifying treatment fidelity. This project involves applying this model to understand the impact of treatment fidelity on statistical power.
The consequences of a mismatch between analytic and data generating models in education research.
A frequent mistake in the analysis of cluster randomized trials is made when the data are analyzed as if assignment was carried out at the level of individuals. It is useful to understand how to interpret the results of such analyses. This project derives actual (as opposed to nominal) type I and type II error rates under a variety of scenarios for a mismatch between the true data generating model and the model used for data analysis.
School structure science success: Organization and leadership influences on student achievement.
The emerging consensus is that school organization and leadership have quantifiable influences on student achievement. This mixed methods project seeks to understand the complex interrelationships between school climate and culture, administrator and teacher perceptions, values, and beliefs, and students’ science achievement. The project involves the development and administration of a large teacher questionnaire, which measures over a dozen attitudinal and climate variables, as well as the pairing of the survey data with achievement data. The major quantitative analyses consist of series of two-level structural equation models, which will allow for the exploration of mediational pathways of school climate and organizational influences on student achievement as well as a series of school-level growth curve analyses.
Early Vocabulary Intervention Project
The Early Vocabulary Intervention project is a multi-site, randomized cluster design study that looks at the effects of a tiered vocabulary intervention on kindergarten students. Primarily, the research team will use hierarchical linear models to answer questions about the effectiveness of the intervention across schools and growth curve models to explore the long-term effects of the intervention on outcomes in reading and language arts. A graduate student in MEA serves as the data specialist on the team, under the supervision of an MEA faculty member.
Assessment of Teacher Effectiveness
This research explores the instruments used to measure teacher effectiveness. A reviewed of the research on the underlying construct, classroom observation protocols and value-added models has been carried out. The present focus is on individualized goal based measures of student growth, currently being implemented in places such as Rhode Island and New Haven, CT. Goal based measures of effectiveness, such as Goal Attainment Scaling, have been used successfully in a variety of health care settings, but there is little research on their use in an educational context. We are in the process of validating a measure that uses individualized goals set by cooperating teachers as a measure of a student teachers’ contribution to student growth. Access to the Measures of Effective (MET) Project database, funded by the Bill and Melinda Gates Foundation, is in progress as well. This is the largest database of its kind, and would provide opportunities to examine many different aspects of teacher effectiveness.
Development and Validation of the Challenges to Scholastic Achievement Scale (CSAS)
This ongoing research project involves the development and validation of the Challenges to Scholastic Achievement Scale (CSAS), The CSAS, which is designed to identify negative manifestations of underachievement among high school students. The CSAS measures five constructs related to underachievement: alienation, negative academic self-perception, negative attitudes toward teachers and school, low motivation/self-regulation, and low goal valuation. Our goal is that educators, researchers, and clinicians will be able to use the CSAS to identify the students who are at the greatest risk of underachieving and to understand the reasons that certain able students underachieve so that they can target appropriate interventions .
Evaluating Content Alignment Methods
There are many ways to examine whether test content matches the intended objectives, most depend on test reviews by subject matter experts (SMEs). This series of studies compared how the results of content validity analyses differed dependent upon the instructions provided to one group of SMEs. Results of asking SMEs to rate items as aligned or not aligned to test objectives were compared to results generated when SMEs were allowed to rate alignment across a continuum. The study also compares the items grouped together in different subtests by test developers with analyses that compare groupings generated by SMEs when they were asked to group items in ways that made sense to them.
Instructional Sensitivity of Large Scale Tests
Instructional sensitivity refers to the ability of tests to detect instructional efforts. For school accountability to work, it is vital that tests reflect instructional efforts instead of student aptitude or other non-school factors that affect achievement. However, most state assessment and accountability systems do not evaluate the ability of tests to capture instructional efforts. This study examined the instructional sensitivity of one state test by collecting detailed information about the way that teachers operationalized state standards, rating teachers according to their alignment to and emphasis on tested topics, and using this information to predict student performance using multilevel models.
Teaching to the Test Under Standards-based Reform
Although teaching to the test is a ubiquitous term, there is no commonly-adopted definition of what constitutes teaching to the test in a standards-based environment, nor is there a commonly accepted list of appropriate and inappropriate test preparation practices. This study proposes both a definition and a list and reviews a group of third and fifth grade teachers’ stated practices, and uses multilevel modeling to determine whether teaching to the test affects test performance after controlling for prior performance.
Validation of Standards-based Report Cards
Many school districts require that teachers complete report cards using the same performance level categories as reported on the state test instead of assigning letter grade in an attempt to increase teacher focus on state standards and to improve the quality of information provided to parents on report cards. This study compares report card grades and test scores to determine their level of concurrence. Teachers’ explanations of their grading methods are used to determine whether the use of reliable grading methods results in greater consistency with state test scores. Finally, the study examines the contribution of teacher, student, and content area to inconsistencies between grades and report cards.
CSDE/UConn Measurement, Evaluation, and Assessment Partnership
The Connecticut State Department of Education (CSDE)/UConn partnership was established in 2003. The purpose of the partnership is to provide additional technical resources to the CSDE student assessment office to develop, administer, and report results from statewide measures of student achievement. Support services are provided for the main assessment program, which includes the Connecticut Mastery Test (CMT) and the Connecticut Academic Performance Test (CAPT), as well as for smaller scale assessment initiatives such as the CMT/CAPT Skills Checklist, the modified assessment program, the kindergarten inventories, and formative assessment programs. Examples of services provided include: (1) independent analysis of testing data to confirm analyses performed by the CSDE and/or its contractors to ensure data accuracy and program quality, as well as to resolve technical issues; (2) item review and test form review for developing instrumentation; (3) research to monitor the effectiveness of the student assessment programs; (4) research on current assessment issues such as the Core Content Standards, growth modeling, closing the achievement gap, the use of technology to enhance student learning, and teacher quality.
Project VIABLE-II: Unified validation of Direct Behavior Rating (DBR)
This project involves validating an 11-point behavior rating scale that teachers complete at the end of instructional periods using teacher ratings of approximately 2000 students in grades 1-2, 4-5, and 7-8 across three states. Teachers complete three separate behavioral measures for each student, reflecting on their behavior over the course of a week, in the Fall, Winter and Spring of four school years. Analyses include setting cut scores using ROC analyses, analysis of multitrait-multimethod matrices, and examining the predictive validity of the measures in determining behavioral risk in later school years. In addition, we will use verbal protocol analyses to determine what information teachers consider in rating students and will examine the efficiency and usability of the instrument based on survey data.