Discover the most talked about and latest scientific content & concepts.

Concept: Fleiss' kappa


BACKGROUND: Systematic reviews have been challenged to consider effects on disadvantaged groups. A priori specification of subgroup analyses is recommended to increase the credibility of these analyses. This study aimed to develop and assess inter-rater agreement for an algorithm for systematic review authors to predict whether differences in effect measures are likely for disadvantaged populations relative to advantaged populations (only relative effect measures were addressed). METHODS: A health equity plausibility algorithm was developed using clinimetric methods with three items based on literature review, key informant interviews and methodology studies. The three items dealt with the plausibility of differences in relative effects across sex or socioeconomic status (SES) due to: 1) patient characteristics; 2) intervention delivery (i.e., implementation); and 3) comparators. Thirty-five respondents (consisting of clinicians, methodologists and research users) assessed the likelihood of differences across sex and SES for ten systematic reviews with these questions. We assessed inter-rater reliability using Fleiss multi-rater kappa. RESULTS: The proportion agreement was 66% for patient characteristics (95% confidence interval: 61%-71%), 67% for intervention delivery (95% confidence interval: 62% to 72%) and 55% for the comparator (95% confidence interval: 50% to 60%). Inter-rater kappa, assessed with Fleiss kappa, ranged from 0 to 0.199, representing very low agreement beyond chance. CONCLUSIONS: Users of systematic reviews rated that important differences in relative effects across sex and socioeconomic status were plausible for a range of individual and population-level interventions. However, there was very low inter-rater agreement for these assessments. There is an unmet need for discussion of plausibility of differential effects in systematic reviews. Increased consideration of external validity and applicability to different populations and settings is warranted in systematic reviews to meet this need.

Concepts: Meta-analysis, Evidence-based medicine, Contract, Assessment, Interval finite element, Cohen's kappa, Inter-rater reliability, Fleiss' kappa


BACKGROUND: Physical activity is assumed to be important in the prevention and treatment of frailty. It is however unclear to what extent frailty can be influenced, because an outcome instrument is lacking. OBJECTIVES: An Evaluative Frailty Index for Physical activity (EFIP) was developed based on the Frailty Index Accumulation of Deficits and clinimetric properties were tested. DESIGN: The content of the EFIP was determined in a written Delphi procedure. Intra-rater reliability, inter-rater reliability, and construct validity were determined in an observational study (n=24) and to determine responsiveness, the EFIP was used in a physical therapy intervention study (n=12). METHOD: Intra-rater reliability and inter-rater reliability were calculated using Cohen’s kappa, construct validity was determined by correlating the score on the EFIP with those on the Timed Up &Go Test (TUG), the Performance Oriented Mobility Assessment (POMA), and the Cumulative Illness Rating Scale for geriatrics (CIRS-G). Responsiveness was calculated by means of the Effect Size (ES), the Standardized Response Mean (SRM), and a paired sample t-test. RESULTS: Fifty items were included in the EFIP. Inter-rater (Cohen’s kappa: 0,72) and intra-rater reliability (Cohen’s kappa: 0,77 and 0,80) were good. A moderate correlation with the TUG, POMA, and CIRS-G was found (0,68 -0,66 and 0,61 respectively, P< 0.001). Responsiveness was moderate to good (ES: -0.72 and SRM:-1.14) for an intervention with a significant effect (P< 0.01). LIMITATIONS: The clinimetric properties of the EFIP have been tested in a small sample and anchor based responsiveness could not be determined. CONCLUSIONS: The EFIP is a reliable, valid, and responsive instrument to evaluate the effect of physical activity on frailty in research and clinical practice.

Concepts: Scientific method, Inter-rater reliability, Fleiss' kappa, Cohen's kappa, Student's t-test, Jacob Cohen, Reliability, Psychometrics


Clinical evaluation of scapular dyskinesis (SD) aims to identify abnormal scapulothoracic movement, underlying causal factors, and the potential relationship with shoulder symptoms. The literature proposes different methods of dynamic clinical evaluation of SD, but improved reliability and agreement values are needed. The present study aimed to evaluate the intrarater and interrater agreement and reliability of three SD classifications: 1) 4-type classification, 2) Yes/No classification, and 3) scapular dyskinesis test (SDT). Seventy-five young athletes, including 45 men and 30 women, were evaluated. Raters evaluated the SD based on the three methods during one series of 8-10 cycles (at least eight and maximum of ten) of forward flexion and abduction with an external load under the observation of two raters trained to diagnose SD. The evaluation protocol was repeated after 3 h for intrarater analysis. The agreement percentage was calculated by dividing the observed agreement by the total number of observations. Reliability was calculated using Cohen Kappa coefficient, with a 95% confidence interval (CI), defined by Kappa coefficient ±1.96 multiplied by the measurement standard error. The interrater analyses showed an agreement percentage between 80% and 95.9% and an almost perfect reliability (k>0.81) for the three classification methods in all the test conditions, except the 4-type and SDT classification methods, which had substantial reliability (k<0.80) in shoulder abduction. Intrarater analyses showed agreement percentages between 80.7% and 89.3% and substantial reliability (0.67 to 0.81) for both raters in the three classifications. CIs ranged from moderate to almost perfect categories. This indicates that the three SD classification methods investigated in this study showed high reliability values for both intrarater and interrater evaluation throughout a protocol that provided SD evaluation training of raters and included several repetitions of arm movements with external load during a live assessment.

Concepts: Normal distribution, Observation, Shoulder, Multiplication, Fleiss' kappa, Scientific method, Cohen's kappa, Inter-rater reliability


Due to the unpredictable, varied and often physical nature of law enforcement duties, police officers are at a high risk of work-related physical injury. The aim of this critical narrative review was to identify and synthesize key findings of studies that have investigated musculoskeletal injuries sustained by law enforcement officers during occupational tasks. A systematic search of four databases using key search terms was conducted to identify potentially relevant studies, which were assessed against key inclusion and exclusion criteria to determine studies to be included in the review. Included studies were critically appraised and the level of evidence determined. Relevant data were extracted, tabulated and synthesized. The 16 identified studies ranged in percentage quality scores from 25.00% to 65.00%, with a mean score of 41.25% and high interrater agreement in scores reflected in a Cohen’s Kappa coefficient, κ = 0.977. The most common body site of injury was the upper extremity, the most common injury types were soft-tissue sprains and strains and the most common cause of injury was a non-compliant offender, often involving assault. However, there was limited peer reviewed research in this area and the published research had a narrow focus and was of low to fair methodological quality.

Concepts: Sprain, Physical trauma, Fleiss' kappa, Peer review, Injury, Cohen's kappa, Inter-rater reliability, Police


Objectives The standard approach to the assessment of occupational exposures is through the manual collection and coding of job histories. This method is time-consuming and costly and makes it potentially unfeasible to perform high quality analyses on occupational exposures in large population-based studies. Our aim was to develop a novel, efficient web-based tool to collect and code lifetime job histories in the UK Biobank, a population-based cohort of over 500 000 participants. Methods We developed OSCAR (occupations self-coding automatic recording) based on the hierarchical structure of the UK Standard Occupational Classification (SOC) 2000, which allows individuals to collect and automatically code their lifetime job histories via a simple decision-tree model. Participants were asked to find each of their jobs by selecting appropriate job categories until they identified their job title, which was linked to a hidden 4-digit SOC code. For each occupation a job title in free text was also collected to estimate Cohen’s kappa (κ) inter-rater agreement between SOC codes assigned by OSCAR and an expert manual coder. Results OSCAR was administered to 324 653 UK Biobank participants with an existing email address between June and September 2015. Complete 4-digit SOC-coded lifetime job histories were collected for 108 784 participants (response rate: 34%). Agreement between the 4-digit SOC codes assigned by OSCAR and the manual coder for a random sample of 400 job titles was moderately good [κ=0.45, 95% confidence interval (95% CI) 0.42-0.49], and improved when broader job categories were considered (κ=0.64, 95% CI 0.61-0.69 at a 1-digit SOC-code level). Conclusions OSCAR is a novel, efficient, and reasonably reliable web-based tool for collecting and automatically coding lifetime job histories in large population-based studies. Further application in other research projects for external validation purposes is warranted.

Concepts: Collected, Fleiss' kappa, Inter-rater reliability, United Kingdom, Estimator, Source code, Cohen's kappa, Code


OBJECTIVE. The purpose of this study was to compare the diagnostic performance of four radiographic signs of gastric band slippage: abnormal phi angle, the “O sign,” inferior displacement of the superolateral gastric band margin, and presence of an air-fluid level above the gastric band. MATERIALS AND METHODS. A search of the electronic medical record identified 21 patients with a surgically proven slipped gastric band and 63 randomly-selected asymptomatic gastric band patients who had undergone barium swallow studies. These studies were evaluated for the four signs of band slippage by two independent radiologists who were blinded to clinical data. Sensitivity, specificity, and positive and negative predictive values were calculated for each radiographic sign of band slippage. Interobserver agreement between radiologists was assessed using the Fleiss kappa statistic. RESULTS. In evaluating for gastric band slippage, an abnormal phi angle greater than 58° was 91-95% sensitive and 52-62% specific (κ = 0.78), the O sign was 33-48% sensitive but 97% specific (κ = 0.84), inferior displacement of the superolateral band margin by more than 2.4 cm from the diaphragm was 95% sensitive and 97-98% specific (κ = 0.97), and the presence of an air-fluid level was 95% sensitive and 100% specific (κ = 1.00). CONCLUSION. We report two previously undescribed radiographic signs of gastric band slippage that are both sensitive and specific for this important surgical complication and recommend that these signs should be incorporated into the imaging evaluation of gastric band patients.

Concepts: Joseph L. Fleiss, Medical terms, Positive predictive value, Fluoroscopy, Fleiss' kappa, Barium swallow, Medical imaging, Negative predictive value


BACKGROUND: Rater agreement is important in clinical research, and Cohen’s Kappa is a widely used method for assessing inter-rater reliability; however, there are well documented statistical problems associated with the measure. In order to assess its utility, we evaluated it against Gwet’s AC1 and compared the results. METHODS: This study was carried out across 67 patients (56% males) aged 18 to 67, with a mean SD of 44.13 +/- 12.68 years. Nine raters (7 psychiatrists, a psychiatry resident and a social worker) participated as interviewers, either for the first or the second interviews, which were held 4 to 6 weeks apart. The interviews were held in order to establish a personality disorder (PD) diagnosis using DSM-IV criteria. Cohen’s Kappa and Gwet’s AC1 were used and the level of agreement between raters was assessed in terms of a simple categorical diagnosis (i.e., the presence or absence of a disorder). Data were also compared with a previous analysis in order to evaluate the effects of trait prevalence. RESULTS: Gwet’s AC1 was shown to have higher inter-rater reliability coefficients for all the PD criteria, ranging from .752 to 1.000, whereas Cohen’s Kappa ranged from 0 to 1.00. Cohen’s Kappa values were high and close to the percentage of agreement when the prevalence was high, whereas Gwet’s AC1 values appeared not to change much with a change in prevalence, but remained close to the percentage of agreement. A Schizoid sample revealed a mean Cohen’s Kappa of .726 and a Gwet’s AC1of .853 , which fell within the different level of agreement according to criteria developed by Landis and Koch, and Altman and Fleiss. CONCLUSIONS: Based on the different formulae used to calculate the level of chance-corrected agreement, Gwet’s AC1 was shown to provide a more stable inter-rater reliability coefficient than Cohen’s Kappa. It was also found to be less affected by prevalence and marginal probability than that of Cohen’s Kappa, and therefore should be considered for use with inter-rater reliability analysis.

Concepts: Schizoid personality disorder, Diagnostic and Statistical Manual of Mental Disorders, Scott's Pi, Categorical data, Joseph L. Fleiss, Inter-rater reliability, Fleiss' kappa, Cohen's kappa


INTRODUCTION: Many intensive care patients experience sleep disruption potentially related to noise, light and treatment interventions. The purpose of this study was to characterise, in terms of quantity and quality, the sleep of intensive care patients, taking into account the impact of environmental factors. METHODS: This observational study was conducted in the adult ICU of a tertiary referral hospital in Australia, enrolling 57 patients. Polysomnography (PSG) was performed over a 24 hour period to assess the quantity (total sleep time: hh:mm) and quality (percentage per stage, duration of sleep episode) of patients' sleep while in ICU. Rechtschaffen and Kales criteria were used to categorise sleep. Interrater checks were performed. Sound pressure and illuminance levels and care events were simultaneously recorded. Patients reported on their sleep quality in ICU using the Richards Campbell Sleep Questionnaire and the Sleep in Intensive Care Questionnaire. Data were summarized using frequencies and proportions or measures of central tendency and dispersion as appropriate and Cohen’s Kappa statistic was used for interrater reliability of the sleep data analysis. RESULTS: Patients' median total sleep time was 05:00 (IQR: 02:52-07:14). The majority of sleep was stage 1 and 2 (medians: 19 and 73%) with scant slow wave and REM sleep. The median duration of sleep without waking was 00:03. Sound levels were high (mean Leq 53.95 dB(A) during the day and 50.20 dB(A) at night) and illuminance levels were appropriate at night (median <2lux) but low during the day (median: 74.20lux). There was a median 1.7 care events/h. Patients' mean self-reported sleep quality was poor. Interrater reliability of sleep staging was highest for slow wave sleep and lowest for stage 1 sleep. CONCLUSIONS: The quantity and quality of sleep in intensive care patients are poor and may be related to noise, critical illness itself and treatment events that disturb sleep. The study highlights the challenge of quantifying sleep in the critical care setting and the need for alternative methods of measuring sleep. The results suggest that a sound reduction program is required and other interventions to improve clinical practices to promote sleep in intensive care patients. Trial registration: Australian New Zealand clinical trial registry ( ACTRN12610000688088.

Concepts: Sound pressure, Intensive care medicine, Scott's Pi, Arithmetic mean, Fleiss' kappa, Cohen's kappa, Inter-rater reliability, Sleep


Genetically engineered mouse models offer essential opportunities to investigate the mechanisms of initiation and progression in melanoma. Here we report a new simplified histopathology classification of mouse melanocytic lesions in Tyr::NRASQ61K derived models, using an interactive decision tree that produces homogeneous categories. Reproducibility for this classification system was evaluated on a panel of representative cases of murine melanocytic lesions by pathologists and basic scientists. Reproducibility, measured as inter-rater agreement between evaluators using a modified Fleiss' kappa statistic revealed a very good agreement between observers. Should this new simplified classification be adopted, it would create a robust system of communication between researchers in the field of mouse melanoma models. This article is protected by copyright. All rights reserved.

Concepts: Anatomical pathology, Pathology, All rights reserved, Copyright, Joseph L. Fleiss, Inter-rater reliability, Cohen's kappa, Fleiss' kappa


Purpose To estimate the inter-observer reliability and agreement of offline analyses of three different ultrasound techniques for assessing tubal patency. Methods 100 tubes (n = 100) in 50 women were evaluated for tubal patency between November 2013 and July 2015 using ultrasound as index tests and laparoscopy as the reference standard. Three different ultrasound techniques were applied: two-dimensional grayscale ultrasound using air + saline as the contrast media (2D-HyCoSy); two- and three-dimensional grayscale ultrasound using foam as the contrast media (2 D/3D-HyFoSy); and the same technique but adding bi-directional power Doppler (2 D/3D-Doppler-HyFoSy). The videos containing full standardized exams using these three techniques were split into three parts, anonymized, encoded, randomized and reassessed in Nov. 2015 by two observers who assessed tubal patency using standardized criteria. These observers were blinded to any clinical information and each other’s results. Proportions of observed agreement (po) and Cohen’s Kappa (κ) including the 95 % confidence intervals (CI) were calculated. Results The inter-observer reliability/agreement in 2 D/3D-Doppler-HyFoSy (po = 0.99, κ = 0.95, 95 % CI: 0.93 - 0.97) was higher compared to 2D-air/saline-HyCoSy (po = 0.83, κ = 0.55, 95 % CI: 0.40 - 0.68) and 2 D/3D-HyFoSy (po = 0.92, κ = 0.67, 95 % CI: 0.54 - 0.76). Conclusion The inter-observer reliability and agreement of the diagnosis of tubal patency evaluating stored videos are improved when foam and power Doppler are used during acquisition. Therefore, this technique may be preferred to minimize misclassification and misdiagnosis.

Concepts: Statistics, Scientific method, Normal distribution, Medical imaging, Magnetic resonance imaging, Fleiss' kappa, Inter-rater reliability, Cohen's kappa