Session I - Basic Science


Thurs., 10/9/03 Basic Science, Paper #8, 3:35 PM

The Reliability of Reliability Studies of Fracture Classification Systems and a Proposal for a Structured Methodological Approach

Laurent J. Audigé, PhD1; Mohit Bhandari, MD, MSc2; Beate Hanson, MD, MPH1; James F. Kellam, MD3;

1AO Clinical Investigation and Documentation (AOCID);
2Department of Clinical Epidemiology and Biostastistics, McMaster Health Sciences, Hamilton, Ontario, Canada;
3Carolinas Medical Center, Charlotte, North Carolina, USA

Purpose: Fracture classification systems can play an important role in guiding clinical management. However, they have often been adopted in orthopaedics without prior validation. In addition, there was no accepted structured approach by which classification systems should be developed, validated, and introduced into practice. Our purpose thus was threefold: 1) to conduct a systematic literature review of the methods used to assess the "reliability" of fracture classification systems, 2) to discuss determinants of quality among reliability studies, and 3) to propose a methodologic approach for the development and validation of fracture classification systems.

Methods: We conducted a systematic literature review of studies reporting on the reliability (inter-observer agreement) of fracture classification systems. Two independent reviewers searched MEDLINE and EMBASE for published studies. Data were extracted on classifications, image modalities, fracture selection processes, sample sizes and their justification, type and number of raters, practical issues for the classification sessions, statistical methods, and results. A 10-item checklist was devised for quality assessment of methodologies.

Results: Forty-four studies assessing 32 fracture classification systems were included. These studies were conducted after the classification systems had been used in practice. A wide variation in methodologies was observed. The study population was not defined by clear inclusion and exclusion criteria in 41% (18 of 44). The selection of cases was representative in only 39% (17 of 44) (consecutive series or random selection of cases). None of the studies justified the number of cases included. Participating raters were representative of the eventual users of the classification in only four studies (9%). The type of raters, however, was reported in 39 studies, 19 of them included only orthopedic surgeons. The number of raters ranged from 2 to 36, with a median of 5. In 23% of studies (10 of 44), a group of at least five raters for any evaluation was involved. An indication that raters classified each case independently from other raters was found in 70% of studies (31 of 44). The true distribution of classification categories in the sample (an attempt to define a "gold standard" classification) was estimated in only six studies (14%). The kappa coefficient was used most frequently (39 of 44, 89%) as a way to quantify agreement for intra- and inter-observer reliability or to investigate potential influencing factors. Four of 86 kappa coefficients for inter-observer reliability (5%) were reported above 0.80, 17 between 0.60 and 0.80, 32 between 0.40 and 0.60 and 33 <0.40. Most authors (35 of 39, 90%) used one of several proposed guidelines for the interpretation of the kappa coefficient or a modification. Statistical analyses appeared to have been adapted to the study objectives in only 39% of studies (17 of 44).

Conclusions: Despite increased efforts to validate fracture classification systems, methodologies of reliability studies varied considerably, and their timing was late. The overall reliability results were poor. This review, however, showed that applied methods have their limitations, such as the frequent use of the kappa coefficient and its interpretation. We propose a three-phase methodologic approach: 1) development and pilot agreement studies, 2) pragmatic agreement studies, and 3) clinical studies, for the development and validation of fracture classification systems. Reliability studies should already be implemented in the early phase of development, before classification systems are promoted for use in practice.