Volume 1, Issue 1 • Fall 2011

Table of Contents


Measuring Recidivism in Juvenile Corrections

Barron County Restorative Justice Programs: A Partnership Model for Balancing Community and Government Resources for Juvenile Justice Services

Parents Anonymous® Outcome Evaluation: Promising Findings for Child Maltreatment Reduction

Assessing Efficiency and Workload Implications of the King County Mediation Pilot

The Impact of Juvenile Drug Courts on Drug Use and Criminal Behavior

Missouri’s Crossover Youth: Examining the Relationship between Maltreatment History and Risk of Violenc

Assessing and Improving the Reliability of Risk Instruments: The New Mexico Juvenile Justice Reliability Model

School Policies, Academic Achievement, and General Strain Theory: Applications to Juvenile Justice Settings

Assessing and Improving the Reliability of Risk Instruments: The New Mexico Juvenile Justice Reliability Model

Katherine Ortega Courtney and Jeremy Howard
New Mexico Children, Youth, and Families Department, Santa Fe, New Mexico

Katherine Ortega Courtney, New Mexico Children, Youth, and Families Department, Juvenile Justice Data Analysis Unit; Jeremy Howard, New Mexico Children, Youth, and Families Department, Juvenile Justice Data Analysis Unit.

Correspondence concerning this article should be addressed to Katherine Ortega Courtney, New Mexico Children, Youth, and Families Department, Juvenile Justice Data Analysis Unit, Santa Fe, NM 87502. E-mail: Katherine.courtney@state.nm.us

Key words: reliability, risk assessment, model programs, structured decision making


Reliability is a critical feature of any screening or assessment instrument; yet, the reliability of juvenile justice risk instruments is rarely assessed. Because their reliability has rarely been examined, we developed a method for examining the reliability of the New Mexico Structured Decision Making Risk Instrument. This method involved creating sample cases that would include information needed to complete the instrument. Two Juvenile Probation Officers (JPOs) from each district in New Mexico were asked to rate ten sample cases. Upon completion of the initial reliability study, we determined that the instrument’s reliability was unacceptable. We then undertook an intensive effort to increase its reliability, which included revising definitions and instructions for the instrument and retraining workers statewide. After revising and retraining, we reassessed the instrument’s reliability. The results indicated substantial improvement in the instrument’s reliability, ensuring equitable application and scoring of risk for youth throughout the state’s cultural landscape. The method we used to improve the instrument’s reliability resulted in the creation of the New Mexico Juvenile Justice Reliability Model. This method, although new, is relatively simple to use and effective. The resulting model for assessing and improving reliability can be used by others to assess the reliability of their instruments.


As standardized tools, including risk assessment instruments, are used with increasing frequency in the juvenile justice system it is more important than ever to establish a systematic method for testing their reliability. While there are many definitions of this term, reliability generally refers to the consistency or repeatability of measures (e.g., LeBreton & Senter, 2008; Bliese, 2000). Of particular interest for the purposes of risk assessments is inter-rater reliability, which measures the degree of agreement among raters. Sufficient inter-rater reliability ensures that the same individual would be scored consistently by different raters in different locations. Inter-rater reliability is especially important in the juvenile justice system because these instruments are used to assist the JPO with case management decision making. It is vital to ensure that any youth receiving a risk assessment would receive the same score no matter who administers the instrument and no matter where the youth is located. Inter-rater reliability is particularly important in the state of New Mexico, which is culturally and geographically diverse. With such a wide range of urban and rural settings, it is particularly important to ensure that a youth would receive the same scores in both the urban setting of Albuquerque and in the rural community of Reserve.

Many studies focus solely on instruments’ validity. According to Baird, however, “If there is little or no consistency among staff members completing risk instruments, the validity of the system cannot be assumed” (Baird, 2010, p. 7). If an instrument is not reliable, it cannot be easily argued to be valid. It is therefore recommended that the reliability of an instrument be tested before its validity is assessed (Austin, 2003).

Despite this methodological necessity, relatively sparse information is available regarding the reliability of risk instruments, and often what little information is available does not adequately measure inter-rater reliability. Many studies measuring the reliability of risk instruments use measures of internal consistency rather than inter-rater reliability. For example, some studies that assess reliability calculate internal consistency using measures such as Cronbach’s alpha (e.g., Connolly, 2003; Schwalbe, Fraser, Day, & Arnold, 2005; Schmidt, Hoge, & Gomes, 2005). Some studies also examine reliability by investigating whether similar cases are categorized similarly, or whether classifications using the instrument are similar to classifications using clinical judgment (Jones & Baird, 2001; Schwalbe et al., 2005). While these measures may be useful in determining the appropriateness of an instrument, it has been stated that simple measures of internal consistency do not properly measure the reliability of risk assessments (see Baird, 2009).

We examined the inter-rater reliability of New Mexico’s Structured Decision Making Risk Instrument. This effort resulted in the creation of the New Mexico Juvenile Justice Reliability Model, which serves as a model for others wishing to assess and improve the inter-rater reliability of their risk assessment instruments.

Structured Decision Making Risk Assessment

In 1998, with the assistance of the National Council on Crime and Delinquency (NCCD), the New Mexico Children Youth and Families Department (CYFD) implemented Structured Decision Making (SDM) as the risk and needs classification instrument for juvenile offenders in New Mexico. In 2004, validation of the risk assessment was completed by NCCD and recommendations from that study were implemented, tailoring the SDM instrument for New Mexico youth. In 2008, CYFD incorporated the SDM system for field supervision into the Family Automated Client Tracking System (FACTS), the agency’s client management database. Due to this change, and because 10 years had elapsed since the initial validation study, we began a new validation study in 2008 and completed it in 2010 (Courtney, Howard, & Bunker, 2010). As part of the preparation for the validation study, we determined that it was necessary to also complete a reliability study, since reliability had never been evaluated for the SDM instrument.

The SDM instrument in New Mexico comprises a risk assessment and risk re-assessment, both of which include an assessment of needs. When a disposition is ordered for an adjudicated juvenile offender, a risk assessment and a needs assessment are completed. Risk and needs assessments are completed according to a set schedule, which depends on the youth’s type and intensity of probation supervision and on whether there is a significant change in the youth’s situation or behavior. These reassessments continue until the youth is discharged from supervision by the department.

CYFD uses the SDM instrument to guide disposition recommendations, define which set of minimum contact standards to utilize when supervising a youth in the community, and assist in the classification process of youth committed to CYFD facilities. The SDM risk instrument plays an important role in decision making, and it is therefore critical to assess reliability and validity on a regular basis.

The SDM risk instrument consists of the following six items: number of referrals, age at first juvenile referral, petition offense history, affiliation with delinquent gang, education issues, and substance abuse. The first three items are automatically scored by FACTS, so reliability and consistency of those items is exact. The focus of the reliability study was on the three remaining rating elements (gang involvement, education issues, and substance abuse). Since the reliability of risk instruments is not commonly tested, we developed a new methodology for testing the reliability of these three relevant risk assessment items.

Study 1


For most youths, CYFD JPOs complete a “baseline” assessment. Baseline assessments include information related to the youth’s referral(s), social history, educational background, and substance abuse issues. We used these assessments as the basis for creating case samples that were part of the SDM reliability study. We summarized these assessments to remove any identifiers from the sample. Since, as mentioned above, the first three SDM risk variables (the number of referrals, age at first referral, and petitioned offense history) were automated when implemented in FACTS, the sample focused on information related to the remaining risk variables: gang involvement, education issues, and substance abuse.

Creating sample cases was a vital part of the study. Arranging for duplicate ratings is often one of the greatest obstacles when conducting reliability studies (Walter, Eliasziw, & Donner, 1998). Previous studies (e.g., Austin, Coleman, Peyton, & Johnson, 2003) have addressed this problem by using actual cases that were rated by separate people at different times. Although this method may be useful on static measures (those that do not change over time), this method is not effective for dynamic measures such as those in the SDM risk instrument. To properly assess inter-rater reliability, it was important to create samples based on real cases and allow for two staff members to rate each case for the same time period without interfering with the process of actual cases. The following is a case sample that was used in the reliability study:

“The youth is an active member of the ‘Westside’ street gang. The client is enrolled in public school and is experiencing significant behavior and attendance issues. He/she has been suspended twice since the beginning of the semester as a result of leaving school without permission and threatening to kill his/her teacher. The client had previously reported daily use of marijuana and occasional use of alcohol. The current disposition resulted from a drug screen submitted a month ago by the client, which tested positive for marijuana and amphetamines.”

Each of the 14 judicial districts in New Mexico was asked for two JPOs to volunteer as raters. Each sample case was rated by two separate random JPOs to test inter-rater reliability. One hundred sample cases were each rated by two separate raters in early 2009.


Agreement between the two raters for each sample case was tested using Cohen’s Kappa (Cohen, 1960) as well as percent agreement. A Cohen’s Kappa of 1 indicates perfect agreement, and a Cohen’s Kappa of 0 indicates an agreement level no better than chance (Landis & Koch, 1977). The results of the original reliability study found room for improvement in reliability scores (see Table 1). The gang item was found to have substantial agreement (Kappa = 0.800), while the education item and the substance abuse item were found to have moderate agreement (Kappa = 0.496 and 0.592, respectively).


The results of Study 1 indicated that the reliability of the SDM risk instrument was lower than expected and may not be acceptable. The relatively low levels of agreement found in the reliability study were especially troubling given that the ratings were of sample cases designed specifically to address each of the areas to be rated. Furthermore, two of the three items (gang involvement and substance abuse) were yes-or-no items. Ratings of actual cases would not be as straightforward and because of this, actual reliability was likely lower than that found in the study. We therefore determined it was necessary to improve the instrument’s reliability. Previous research has found that additional training can improve reliability results (e.g., Austin et al., 2003; Baird, 2009). We determined that to improve reliability of the SDM risk instrument, definitions of the items would need to be clarified and revised, and intensive training on use of the instrument would have to take place.

Study 2

Following the completion of Study 1, we increased our efforts to more clearly define the risk variables being evaluated by CYFD staff. The rationale behind this decision was that the definitions were too open to interpretation and this interpretive element may have contributed to the disagreement observed in the first reliability study. For example, at the time of the first reliability study, the risk variable for education issues required the JPO to categorize the youth being evaluated as follows:

A work group consisting of members of the New Mexico CYFD Data Analysis Unit, a Regional Administrator, community behavioral health clinician, and other staff members from Juvenile Justice Services revised the definitions and rating instructions with the goal of maximizing consistency statewide. The resulting revised definitions did not change what the variables measured, but did make use of language that was more specific, definitive, and identifiable. This is demonstrated by the revised definitions for education issues presented in Figure 1.

Figure 1 Revised Definitions for Education Issues Risk Item

R5 No School Problems
Occasional School Problems
Frequent School Problems


Is enrolled in and attending school

Has no unexcused absences

Has no behavior problems

Has no work effort problems

Has a GED or
High School Diploma

Is enrolled in school but has some unexcused absences that have not impacted performance

Has occasional behavior problems that have not impacted performance

Has occasional work effort problems that have not impacted performance

Has been referred to
in-school detention

Has enrolled in school but frequent to chronic unexcused absences have impacted performance

Has frequent to chronic behavior problems that have impacted performance

Has frequent to chronic work effort problems that have impacted performance

Is failing all or most classes

Has been suspended for short or long term

Has dropped out, un-enrolled, or been expelled

Has refused to engage in recommended education services

Once members of the work group revised the definitions, we modified the SDM module of the CYFD statewide client tracking database, FACTS. Specifically, we reworded for clarity the dropdown selections for specific variables in the needs assessment, and the risk reassessment categories had been re-worded for clarification. We scheduled comprehensive training sessions to operationalize these new definitions throughout New Mexico for the months of June and July, 2009, during which time the SDM dropdown modifications were implemented in FACTS. The training consisted of the SDM coordinator traveling to JPO offices throughout the state and providing handouts of the revised definitions, and an in-depth four-hour review of each risk and needs variable, as redefined, of the SDM. This review consisted of the SDM coordinator meeting with small groups of 10 to 15 individuals, including JPOs, supervisors, and chief JPOs. The SDM coordinator reviewed various SDM protocols and the revised definitions using a PowerPoint presentation. This presentation of 58 slides, augmented by interactive question-and-answer sessions between the SDM coordinator and the training group, related not only to revised definitions but included discussions related to youth classification. The discussions that took place during training led to further revised definitions which, in turn, led to uniformity of understanding and interpretation across the state. The revised definitions were finalized in November, 2009. When the definitions were finalized they were distributed statewide by CYFD, which posted them onto the CYFD intranet and a statewide email, identifying and linking to the revisions that had been made.


After finalizing the new definitions and training staff members on scoring the Risk Assessment using the new definitions, we repeated the reliability study. In early 2010, we developed new sample cases using the same procedure as used in Study 1. Each of the 14 judicial districts was again asked for two JPOs to volunteer as raters. Once again, each sample case was rated by two separate JPOs in order to test inter-rater reliability. One hundred sample cases were each rated by two separate raters in January 2010.


As in the first study, we used Cohen’s Kappa and percent agreement to examine the level of agreement between the two raters for each sample case. Inter-rater reliability substantially improved for each of the items (see Table 1). For the gang item, the Kappa improved from 0.800 to 0.940, indicating an improvement from substantial agreement to almost perfect agreement. For the education item, the Kappa improved from 0.496, indicating moderate agreement, to 0.715, indicating substantial agreement. The Kappa for the substance abuse item improved from 0.592 to 0.917, indicating an improvement from moderate to almost perfect agreement.

Table 1 Study 1 and Study 2 Results

  Study 1 Study 2
  Kappa % Agreement Kappa % Agreement
Gang 0.800 (Substantial agreement) 90 0.940 (Almost perfect agreement) 97
Education 0.496 (Moderate agreement) 70 0.715 (Substantial agreement) 83
Substance Abuse 0.592 (Moderate agreement) 90 0.917 (Almost perfect agreement) 98


The results of Study 2 showed substantial improvement over the results of Study 1. Reliability was improved for each of the three items of interest. These results indicate that the process of improving reliability, including revising the definitions and training, were effective in improving reliability of the risk instrument.

General Discussion

The New Mexico Juvenile Justice Reliability Model

The relatively low reliability of the New Mexico Structured Decision Making Risk Instrument found in the first study yielded some unexpectedly positive effects for the instrument and the New Mexico Juvenile Justice system as a whole. Due to the low reliability found in Study 1, the agency was required to address the problem before assessing validity (for a discussion of the validity of this instrument, see Courtney et al., 2010). In doing so, it was necessary to revisit the instructions and definitions for each of the items on the risk instrument. This was an important exercise, and the resulting discussions proved useful and informative. We assessed the definitions and instructions for each item in depth, and provided training on the subsequent changes and revisions to workers statewide.

When we reassessed the reliability of this risk instrument after revising the instructions and definitions and training workers throughout the state, results indicated that the process improved the reliability of the instrument. The reliability study resulted in the creation of the New Mexico Juvenile Justice Reliability Model (see Figure 2). This model consists of a simple yet effective process for assessing and improving the reliability of any instrument.

Figure 2 The New Mexico Juvenile Justice Reliability Model

New Mexico Juvenile Justice Reliability Model

The first step in the process of evaluating any risk instrument is assessing its initial reliability. One of the most difficult factors to address in reliability studies is arranging for the replication of cases (Walter et al., 1998). The creation of sample cases based on actual information allows for the testing of reliability without interfering with the processing of actual cases. The sample cases should be rated by workers who actually use the instrument in the field. After each case is rated by two independent raters, researchers can assess the reliability of the instrument. Based on the results, definitions and instructions for each item should be revised by a work group, including field workers, researchers, and supervisors. The goal of the revised definitions should be to maximize consistency.

The next step is to train workers to use the new definitions. During this training process, it is important to solicit their feedback and incorporate this feedback into the final definitions and instructions for each item on the instrument. The final definitions and instructions should then be disseminated to the field. To determine the effectiveness of the training and new definitions/instructions, researchers should then reassess the instrument’s reliability. It may be necessary to repeat this process several times to achieve acceptable levels of reliability.


Although the reliability of risk instruments is rarely tested, it is widely agreed that an instrument’s reliability is important and cannot be assumed (e.g. Austin, 2003; Baird, 2009). If inter-rater reliability is unsatisfactory, an instrument’s validity cannot be adequately assessed. Results of the current study illustrate the value of thoroughly examining the reliability of any risk instrument. Because there is no widely agreed-upon methodology for assessing inter-rater reliability of risk instruments in the field of juvenile justice, we and our colleagues developed a new method for assessing the inter-rater reliability of the New Mexico Juvenile Justice SDM risk assessment instrument.

Results of the initial reliability study indicated that the instrument’s reliability needed improvement. This finding was somewhat surprising, given that the sample cases were designed to specifically address the information needed to make a rating. This indicates that the instrument’s reliability in the field was probably even lower than we initially found.

In response to the relatively poor results of the initial reliability study, we revised the definitions of the items that were being evaluated to be more concise and to encourage consistency statewide. After providing training, receiving feedback, and finalizing the new definitions, we reassessed the instrument’s reliability. The second study indicated that the process was an effective method for improving reliability, and the result was the creation of the New Mexico Juvenile Justice Reliability Model.

Although results indicate that the model is effective in determining an instrument’s reliability, this model should now be applied to evaluating the reliability of another instrument or be repeated in New Mexico so researchers can validate it. In addition, the reliability of the risk instrument should be revisited in one year to determine whether the improvement in the instrument’s reliability has been sustained. We began plans for this study in summer of 2011.

It is interesting to note that the only variable that was not dichotomous, education, had the lowest inter-rater reliability both before and after training. It may be useful for future studies to examine whether it is beneficial for all variables to be dichotomous. Another direction for future research should include investigating whether rater characteristics such as gender, ethnicity, job experience, or regional differences have any impact on inter-rater reliability.

The method used in the study described here resulted in an effective and useful model for assessing and improving the reliability of a risk instrument. Because there is relatively little research on the reliability of risk instruments, this much-needed model fills a gap in risk instrument research. The findings of this study have important implications for the evaluation of risk instruments as a whole. Reliability should not simply be assumed. The model used in this study to assess reliability represented a new and innovative process, was relatively easy to implement, and can easily be adopted by other agencies interested in assessing the reliability of their instruments.

About the Authors

Katherine Ortega Courtney, Ph.D., is a research epidemiologist at the New Mexico Children Youth and Families Department, Juvenile Justice Data Analysis Unit.

Jeremy Howard, B.A., is a structured decision making coordinator at the New Mexico Children Youth and Families Department, Juvenile Justice FACTS and Juvenile Justice Data Analysis Unit.


Austin, J. (2003, June 25). Findings in prison classification and risk assessment (Issues in Brief). Washington, D.C.: National Institute of Corrections Prisons Division .

Austin, J., Coleman, D., Peytonn, J., & Johnson, K.D. (2003). Reliability and validity study of the LSI-R Risk Assessment Instrument. Washington, D.C.: Institute on Crime, Justice and Corrections at The George Washington University.

Baird, C. (2009). A question of evidence: A critique of risk assessment models used in the justice system. Oakland, CA: National Council on Crime and Delinquency.

Bliese, P. D. (2000). Within-group agreement, non-independence, and reliability: Implications for data aggregation and analysis. In K. J. Klein, & S. W. Kozlowski (Eds.), Multilevel theory, research, and methods in organizations: Foundations, extensions, and new directions (pp. 349–381). San Francisco: Jossey-Bass.

Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20, 37–46.

Connolly, M. M. (2003). A critical examination of actuarial offender-based prediction assessments: Guidance for the next generation of assessments. Unpublished doctoral dissertation. University of Texas at Austin, Austin, TX.

Courtney, K. O., Howard, J., & Bunker, F. (2010). Validation of the juvenile justice risk assessment. Santa Fe, New Mexico: New Mexico Children, Youth and Families Department.

Jones, S., & Baird, C. (2001). Alameda County placement risk assessment validation, final report. Washington, D.C.: U.S. Department of Justice.

Landis, J.R., & Koch, G.G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33, 159–174.

LeBreton, J. M., & Senter, J. L. (2008). Answers to 20 questions about interrater reliability and
interrater agreement. Organizational Research Methods, 11(4), 815–852.

Schmidt, F., Hoge, R. D., & Gomes, L. (2005). Reliabiliity and validity analyses of the youth level of service/case management inventory. Criminal Justice and Behavior, 32(3), 329–344.

Schwalbe, C. S., Fraser, M. W., Day, S. H., & Arnold, E. M. (2005). North Carolina assessment of risk (NCAR): Reliability and predictive validity with juvenile offenders. Journal of Offender Rehabilitation, 40(1), 1–22.

Walter, S. D., Eliasziw, M., & Donner, A. (1998). Sample size and optimal designs for reliability studies. Statistics in Medicine, 17(1), 101–110.

OJJDP Home | About OJJDP | E-News | Topics | Funding
Programs | State Contacts | Publications | Statistics | Events