Concept: Inter-rater reliability


BACKGROUND: Systematic reviews have been challenged to consider effects on disadvantaged groups. A priori specification of subgroup analyses is recommended to increase the credibility of these analyses. This study aimed to develop and assess inter-rater agreement for an algorithm for systematic review authors to predict whether differences in effect measures are likely for disadvantaged populations relative to advantaged populations (only relative effect measures were addressed). METHODS: A health equity plausibility algorithm was developed using clinimetric methods with three items based on literature review, key informant interviews and methodology studies. The three items dealt with the plausibility of differences in relative effects across sex or socioeconomic status (SES) due to: 1) patient characteristics; 2) intervention delivery (i.e., implementation); and 3) comparators. Thirty-five respondents (consisting of clinicians, methodologists and research users) assessed the likelihood of differences across sex and SES for ten systematic reviews with these questions. We assessed inter-rater reliability using Fleiss multi-rater kappa. RESULTS: The proportion agreement was 66% for patient characteristics (95% confidence interval: 61%-71%), 67% for intervention delivery (95% confidence interval: 62% to 72%) and 55% for the comparator (95% confidence interval: 50% to 60%). Inter-rater kappa, assessed with Fleiss kappa, ranged from 0 to 0.199, representing very low agreement beyond chance. CONCLUSIONS: Users of systematic reviews rated that important differences in relative effects across sex and socioeconomic status were plausible for a range of individual and population-level interventions. However, there was very low inter-rater agreement for these assessments. There is an unmet need for discussion of plausibility of differential effects in systematic reviews. Increased consideration of external validity and applicability to different populations and settings is warranted in systematic reviews to meet this need.

Concepts: Evidence-based medicine, Assessment, Interval finite element, Meta-analysis, Cohen's kappa, Inter-rater reliability, Contract, Fleiss' kappa


Purpose: the Thai PPS Adult Suandok tool was translated from the Palliative Performance Scale (PPSv2) and had been used in Chiang Mai, Thailand for several years. Aim: to test the reliability and validity of the Thai translation of PPSv2. Design: a set of 22 palliative cases were used to determine a PPS score on Time-1, and repeated two weeks later as Time-2. A survey questionnaire was also completed for qualitative analysis. Participants: a total of 70 nurses and physicians from Maharaj Nakorn Hospital in Chiang Mai participated. Results: The Time-1 intraclass correlation coefficient (ICC) for absolute agreement is 0.911 (95% CI 0.86-0.96) and for consistency is 0.92 (95% CI 0.87-0.96). The Time-2 ICC for agreement is 0.905 (95% CI 0.85-0.95) and for consistency is 0.912 (95% CI 0.86-0.96). These findings indicate good agreement among participants and also were somewhat higher in the Time-2 re-test phase. Cohen’s kappa score is 0.55, demonstrating a moderate agreement. Thematic analysis from the surveys showed that 91% felt PPS to be a valuable clinical tool overall, with it being ‘very useful’ or ‘useful’ in several areas, including care planning (78% and 20%), disease monitoring (69% and 27%) and prognostication (61% and 31%), respectively. Some respondents noted difficulty in determining appropriate scores in paraplegic patients or those with feeding tubes, while others found the instructions long or difficult. Conclusion: the Thai PPS Adult Suandok translated tool has good inter- and intra-rater reliability and can be used regularly for clinical care.

Concepts: Psychometrics, Reliability, Covariance and correlation, Thailand, Cohen's kappa, Inter-rater reliability, Translation, Chiang Mai


BACKGROUND: Physical activity is assumed to be important in the prevention and treatment of frailty. It is however unclear to what extent frailty can be influenced, because an outcome instrument is lacking. OBJECTIVES: An Evaluative Frailty Index for Physical activity (EFIP) was developed based on the Frailty Index Accumulation of Deficits and clinimetric properties were tested. DESIGN: The content of the EFIP was determined in a written Delphi procedure. Intra-rater reliability, inter-rater reliability, and construct validity were determined in an observational study (n=24) and to determine responsiveness, the EFIP was used in a physical therapy intervention study (n=12). METHOD: Intra-rater reliability and inter-rater reliability were calculated using Cohen’s kappa, construct validity was determined by correlating the score on the EFIP with those on the Timed Up &Go Test (TUG), the Performance Oriented Mobility Assessment (POMA), and the Cumulative Illness Rating Scale for geriatrics (CIRS-G). Responsiveness was calculated by means of the Effect Size (ES), the Standardized Response Mean (SRM), and a paired sample t-test. RESULTS: Fifty items were included in the EFIP. Inter-rater (Cohen’s kappa: 0,72) and intra-rater reliability (Cohen’s kappa: 0,77 and 0,80) were good. A moderate correlation with the TUG, POMA, and CIRS-G was found (0,68 -0,66 and 0,61 respectively, P< 0.001). Responsiveness was moderate to good (ES: -0.72 and SRM:-1.14) for an intervention with a significant effect (P< 0.01). LIMITATIONS: The clinimetric properties of the EFIP have been tested in a small sample and anchor based responsiveness could not be determined. CONCLUSIONS: The EFIP is a reliable, valid, and responsive instrument to evaluate the effect of physical activity on frailty in research and clinical practice.

Concepts: Scientific method, Psychometrics, Student's t-test, Reliability, Cohen's kappa, Inter-rater reliability, Jacob Cohen, Fleiss' kappa


Aim  The aims of this study were to examine whether objective measurements of the 10-minute drooling quotient (DQ10) and the 5-minute drooling quotient (DQ5) are interchangeable; to assess agreement between the measurements and their accuracy in classifying drooling severity; and to develop a time-efficient clinical assessment. Method  The study cohort included 162 children (61 females, 101 males; mean age 11y 6mo, SD 4y 5mo, range 3y 9mo-22y 1mo) suffering from moderate to profuse drooling. One hundred and twenty-four had cerebral palsy and 38 had other developmental disabilities. Seventy-four of the participants were ambulant and 88 non-ambulant. The original DQ10 was recalculated into a 5-minute score (DQ5). Assessments were undertaken while the participants were in a rest situation (DQ® ) and while they were active (DQ(A) ). Agreement in scores was quantified using intraclass correlations and Bland-Altman plots. To classify drooling, area under the receiver operating characteristic curve analysis was used to compare accuracy of the DQ10 and DQ5 at rest and during activity. Results  Agreement between DQ10A, and DQ5(A) , and between DQ10® and DQ5® was high (intraclass correlation coefficient >0.90). Moderate agreement existed between DQ(A) and DQ® . DQ(A) scores were more accurate in classifying children’s drooling behaviour. For DQ5(A) , a cut-off point of 18 or more (drooling episodes/observation time) might indicate ‘constant drooling’. Interpretation  The DQ10 and DQ5 can be used interchangeably. DQ(A) is most discriminative for drooling severity. For evaluating treatment efficiency the cut-off point can be used. For clinical and research purposes, the DQ5 is time efficient and cost saving while validity, and intrarater and interrater reliability are preserved.

Concepts: Evaluation, Assessment, Psychometrics, Correlation and dependence, Pearson product-moment correlation coefficient, Covariance and correlation, Receiver operating characteristic, Inter-rater reliability


BACKGROUND: Our aim is to implement a simple, rapid, and reliable method using computed tomography perfusion imaging and clinical judgment to target patients for reperfusion therapy in the hyper-acute stroke setting. We introduce a novel formula (1-infarct volume [CBV]/penumbra volume [MTT]× 100%) to quantify mismatch percentage. METHODS: Twenty patients with anterior circulation strokes who underwent CT perfusion and received intravenous tissue plasminogen activator (IV tPA) were analyzed retrospectively. Nine blinded viewers determined volume of infarct and ischemic penumbra using the ABC/2 method and also the mismatch percentage. RESULTS: Interrater reliability using the volumetric formula (ABC/2) was very good (intraclass correlation [ICC]= .9440 and ICC = .8510) for hemodynamic parameters infarct (CBV) and penumbra (MTT). ICC coefficient using the mismatch formula (1-MTT/CBV × 100%) was good (ICC of .635). CONCLUSIONS: The ABC/2 method of volume estimation on CT perfusion is a reliable and efficient approach to determine infarct and penumbra volumes. The 1-CBV/MTT × 100% formula produces a mismatch percentage assisting providers in communicating the proportion of salvageable brain and guides therapy in the setting of patients with unclear time of onset with potentially salvageable tissue who can undergo mechanical retrieval or intraarterial thrombolytics.

Concepts: Volume, Stroke, Thermodynamics, Tissue plasminogen activator, Thrombolysis, Plasmin, Inter-rater reliability, Formula


BACKGROUND: While “diagrammatic” evaluation of finger joint angles using two folded paper strips as goniometric arms has been proposed and could be an alternative to standard goniometry and a means for self-evaluation, the measurement differences and reliability are unknown. QUESTIONS/PURPOSES: This study assessed the standard and diagrammatic finger goniometry performed by an experienced examiner on patients in terms of (1) intragoniometer and intergoniometer (ie, intrarater) differences and reliability; (2) interrater differences and reliability relative to patients' diagrammatic self-evaluation; and (3) the interrater differences related to patient’s hand dominance. METHODS: Sixty-one patients without previous training self-evaluated active extension of all joints of the fifth finger of one hand once using two rectangular strips of paper. A practitioner used a goniometer and a diagram to perform parallel evaluations once in 12 patients and three times in 49 patients. The diagrams were scanned and measured. All evaluations and proportions of differences between the paired measurements of 5° or less were combined for analysis. RESULTS: Intrarater intraclass correlation coefficients (ICC) based on the second and third practitioner’s trials for the proximal interphalangeal joint were greater than 0.99. Reliability was poor when calculations involved the first measurement of the practitioner (ICCs < 0.38). Interrater reliability was poor regardless of the practitioner's trial (ICCs < 0.033). The proportions of the absolute differences of 5° or less between all paired practitioner's measurements were similar. The proportions of the acceptable differences between paired practitioner's and patients' measurements were nonequivalent for the interphalangeal joints. The interrater differences did not depend on patients' handedness. CONCLUSIONS: In experienced hands both techniques produce clinically comparable reliability, but patients' performance in extempore diagrammatic self-evaluation is inadequate. Further studies are necessary to explore whether appropriate training of patients can improve consistency of diagrammatic self-evaluation. LEVEL OF EVIDENCE: Level III, diagnostic study. See Guidelines for Authors for a complete description of levels of evidence.

Concepts: Measurement, Joints, Diagram, Contact angle, Finger, Hand, Inter-rater reliability, Goniometer


To establish the intrarater and interrater reliability of Wisconsin Gait Scale (WGS) in hemiplegic patients.

Concepts: Stroke, Inter-rater reliability


There are over 165,000 mHealth apps currently available to patients, but few have undergone an external quality review. Furthermore, no standardized review method exists, and little has been done to examine the consistency of the evaluation systems themselves.

Concepts: Scientific method, Evaluation, Smoking, Nicotine, Smoking cessation, Systems engineering, Inter-rater reliability


Mobile eye-trackers are currently used during real-world tasks (e.g. gait) to monitor visual and cognitive processes, particularly in ageing and Parkinson’s disease (PD). However, contextual analysis involving fixation locations during such tasks is rarely performed due to its complexity. This study adapted a validated algorithm and developed a classification method to semi-automate contextual analysis of mobile eye-tracking data. We further assessed inter-rater reliability of the proposed classification method. A mobile eye-tracker recorded eye-movements during walking in five healthy older adult controls (HC) and five people with PD. Fixations were identified using a previously validated algorithm, which was adapted to provide still images of fixation locations (n = 116). The fixation location was manually identified by two raters (DH, JN), who classified the locations. Cohen’s kappa correlation coefficients determined the inter-rater reliability. The algorithm successfully provided still images for each fixation, allowing manual contextual analysis to be performed. The inter-rater reliability for classifying the fixation location was high for both PD (kappa = 0.80, 95% agreement) and HC groups (kappa = 0.80, 91% agreement), which indicated a reliable classification method. This study developed a reliable semi-automated contextual analysis method for gait studies in HC and PD. Future studies could adapt this methodology for various gait-related eye-tracking studies.

Concepts: Scientific method, Reliability, Cohen's kappa, Inter-rater reliability, Fleiss' kappa


An archival descriptive study of public figure attackers in the United States between 1995 and 2015 was undertaken. Fifty-six incidents were identified, primarily through exhaustive internet searches, composed of 58 attackers and 58 victims. A code book was developed which focused upon victims, offenders, pre-attack behaviors including direct threats, attack characteristics, post-offense and other outcomes, motivations and psychological abstracts. The average interrater agreement for coding of bivariate variables was 0.835 (intraclass correlation coefficient). The three most likely victim categories were politicians, judges, and athletes. Attackers were males, many with a psychiatric disorder, most were grandiose, and most had both a violent and nonviolent criminal history. The known motivations for the attacks were often angry and personal, the most common being dissatisfaction with a judicial or other governmental process (23%). In only one case was the primary motivation to achieve notoriety. Lethality risk during an attack was 55%. Collateral injury or death occurred in 29% of the incidents. Only 5% communicated a direct threat to the target beforehand. The term “publicly intimate figure” is introduced to describe the sociocultural blurring of public and private lives among the targets, and its possible role in some attackers' perceptions and motivations. Copyright © 2016 John Wiley & Sons, Ltd.

Concepts: Psychology, United States, English language, Covariance and correlation, Attack, Motivation, Inter-rater reliability, Attack!