A common approach for determining musical competence is to rely on information about individuals' extent of musical training, but relying on musicianship status fails to identify musically untrained individuals with musical skill, as well as those who, despite extensive musical training, may not be as skilled. To counteract this limitation, we developed a new test battery (Profile of Music Perception Skills; PROMS) that measures perceptual musical skills across multiple domains: tonal (melody, pitch), qualitative (timbre, tuning), temporal (rhythm, rhythm-to-melody, accent, tempo), and dynamic (loudness). The PROMS has satisfactory psychometric properties for the composite score (internal consistency and test-retest r>.85) and fair to good coefficients for the individual subtests (.56 to.85). Convergent validity was established with the relevant dimensions of Gordon’s Advanced Measures of Music Audiation and Musical Aptitude Profile (melody, rhythm, tempo), the Musical Ear Test (rhythm), and sample instrumental sounds (timbre). Criterion validity was evidenced by consistently sizeable and significant relationships between test performance and external musical proficiency indicators in all three studies (.38 to.62, p<.05 to p<.01). An absence of correlations between test scores and a nonmusical auditory discrimination task supports the battery's discriminant validity (-.05, ns). The interrelationships among the various subtests could be accounted for by two higher order factors, sequential and sensory music processing. A brief version of the full PROMS is introduced as a time-efficient approximation of the full version of the battery.

Concepts: Psychometrics, Skill, Validity, Reliability, Sound, Music, Test


Chronic polysubstance abuse (SUD) is associated with neurophysiological and neuroanatomical changes. Neurocognitive impairment tends to affect quality of life, occupational functioning, and the ability to benefit from therapy. Neurocognitive assessment is thus of importance, but costly and not widely available. Therefore, in a busy clinical setting, procedures that include readily available measures targeting core cognitive deficits would be beneficial. This paper investigates the utility of psychometric tests and a questionnaire-based inventory to assess “hot” and “cold” neurocognitive measures of executive functions (EF) in adults with a substance use disorder. Hot decision-making processes are associated with emotional, affective, and visceral responses, while cold executive functions are associated with rational decision-making.

Concepts: Evaluation, Assessment, Educational psychology, Psychometrics, Neuropsychology, Emotion, Test, Psychological testing


Many US biomedical PhD programs receive more applications for admissions than they can accept each year, necessitating a selective admissions process. Typical selection criteria include standardized test scores, undergraduate grade point average, letters of recommendation, a resume and/or personal statement highlighting relevant research or professional experience, and feedback from interviews with training faculty. Admissions decisions are often founded on assumptions that these application components correlate with research success in graduate school, but these assumptions have not been rigorously tested. We sought to determine if any application components were predictive of student productivity measured by first-author student publications and time to degree completion. We collected productivity metrics for graduate students who entered the umbrella first-year biomedical PhD program at the University of North Carolina at Chapel Hill from 2008-2010 and analyzed components of their admissions applications. We found no correlations of test scores, grades, amount of previous research experience, or faculty interview ratings with high or low productivity among those applicants who were admitted and chose to matriculate at UNC. In contrast, ratings from recommendation letter writers were significantly stronger for students who published multiple first-author papers in graduate school than for those who published no first-author papers during the same timeframe. We conclude that the most commonly used standardized test (the general GRE) is a particularly ineffective predictive tool, but that qualitative assessments by previous mentors are more likely to identify students who will succeed in biomedical graduate research. Based on these results, we conclude that admissions committees should avoid over-reliance on any single component of the application and de-emphasize metrics that are minimally predictive of student productivity. We recommend continual tracking of desired training outcomes combined with retrospective analysis of admissions practices to guide both application requirements and holistic application review.

Concepts: College, School, Graduate school, Test, Bachelor's degree, Standardized test, Graduate Record Examination, Statement of purpose


Using test data for all children attending Danish public schools between school years 2009/10 and 2012/13, we examine how the time of the test affects performance. Test time is determined by the weekly class schedule and computer availability at the school. We find that, for every hour later in the day, test performance decreases by 0.9% of an SD (95% CI, 0.7-1.0%). However, a 20- to 30-minute break improves average test performance by 1.7% of an SD (95% CI, 1.2-2.2%). These findings have two important policy implications: First, cognitive fatigue should be taken into consideration when deciding on the length of the school day and the frequency and duration of breaks throughout the day. Second, school accountability systems should control for the influence of external factors on test scores.

Concepts: Education, Psychometrics, High school, Test, Standardized test


Using a multilevel approach, we estimated the effects of classroom ventilation rate and temperature on academic achievement. The analysis is based on measurement data from a 70 elementary school district (140 fifth grade classrooms) from Southwestern United States, and student level data (N = 3109) on socioeconomic variables and standardized test scores. There was a statistically significant association between ventilation rates and mathematics scores, and it was stronger when the six classrooms with high ventilation rates that were indicated as outliers were filtered (> 7.1 l/s per person). The association remained significant when prior year test scores were included in the model, resulting in less unexplained variability. Students' mean mathematics scores (average 2286 points) were increased by up to eleven points (0.5%) per each liter per second per person increase in ventilation rate within the range of 0.9-7.1 l/s per person (estimated effect size 74 points). There was an additional increase of 12-13 points per each 1°C decrease in temperature within the observed range of 20-25°C (estimated effect size 67 points). Effects of similar magnitude but higher variability were observed for reading and science scores. In conclusion, maintaining adequate ventilation and thermal comfort in classrooms could significantly improve academic achievement of students.

Concepts: Statistics, Mathematics, Statistical significance, Psychometrics, Effect size, Statistical power, Test, Standardized test


Children living in poverty generally perform poorly in school, with markedly lower standardized test scores and lower educational attainment. The longer children live in poverty, the greater their academic deficits. These patterns persist to adulthood, contributing to lifetime-reduced occupational attainment.

Concepts: Psychometrics, Test, Standardized test


Many factors have been proposed to explain the attrition of women in science, technology, engineering and math fields, among them the lower performance of women in introductory courses resulting from deficits in incoming preparation. We focus on the impact of mixed methods of assessment, which minimizes the impact of high-stakes exams and rewards other methods of assessment such as group participation, low-stakes quizzes and assignments, and in-class activities. We hypothesized that these mixed methods would benefit individuals who otherwise underperform on high-stakes tests. Here, we analyze gender-based performance trends in nine large (N > 1000 students) introductory biology courses in fall 2016. Females underperformed on exams compared to their male counterparts, a difference that does not exist with other methods of assessment that compose course grade. Further, we analyzed three case studies of courses that transitioned their grading schemes to either de-emphasize or emphasize exams as a proportion of total course grade. We demonstrate that the shift away from an exam emphasis consequently benefits female students, thereby closing gaps in overall performance. Further, the exam performance gap itself is reduced when the exams contribute less to overall course grade. We discuss testable predictions that follow from our hypothesis, and advocate for the use of mixed methods of assessments (possibly as part of an overall shift to active learning techniques). We conclude by challenging the student deficit model, and suggest a course deficit model as explanatory of these performance gaps, whereby the microclimate of the classroom can either raise or lower barriers to success for underrepresented groups in STEM.

Concepts: Scientific method, Prediction, Assessment, Educational psychology, Psychometrics, Hypothesis, Theory, Test


Handling laboratory animals during test procedures is an important source of stress that may impair reliability of test responses. Picking up mice by the tail is aversive, stimulating stress and anxiety. Responses among anxious animals can be confounded further by neophobia towards novel test environments and avoidance of test stimuli in open areas. However, handling stress can be reduced substantially by using a handling tunnel, or cupping mice without restraint on the open hand. Here we establish whether non-aversive handling, brief prior familiarisation with the test arena and alternative stimulus placement could significantly improve performance of mice in behavioural tests. We use a simple habituation-dishabituation paradigm in which animals must discriminate between two urine stimuli in successive trials, a task that mice can easily perform. Tail handled mice showed little willingness to explore and investigate test stimuli, leading to poor test performance that was only slightly improved by prior familiarisation. By contrast, those handled by tunnel explored readily and showed robust responses to test stimuli regardless of prior familiarisation or stimulus location, though responses were more variable for cup handling. Our study shows that non-aversive tunnel handling can substantially improve mouse performance in behavioural tests compared to traditional tail handling.

Concepts: Anxiety, Better, Psychology, Improve, Performance, Mouse, Fear, Test


Working dog organisations, such as Guide Dogs, need to regularly assess the behaviour of the dogs they train. In this study we developed a questionnaire-style behaviour assessment completed by training supervisors of juvenile guide dogs aged 5, 8 and 12 months old (n = 1,401), and evaluated aspects of its reliability and validity. Specifically, internal reliability, temporal consistency, construct validity, predictive criterion validity (comparing against later training outcome) and concurrent criterion validity (comparing against a standardised behaviour test) were evaluated. Thirty-nine questions were sourced either from previously published literature or created to meet requirements identified via Guide Dogs staff surveys and staff feedback. Internal reliability analyses revealed seven reliable and interpretable trait scales named according to the questions within them as: Adaptability; Body Sensitivity; Distractibility; Excitability; General Anxiety; Trainability and Stair Anxiety. Intra-individual temporal consistency of the scale scores between 5-8, 8-12 and 5-12 months was high. All scales excepting Body Sensitivity showed some degree of concurrent criterion validity. Predictive criterion validity was supported for all seven scales, since associations were found with training outcome, at at-least one age. Thresholds of z-scores on the scales were identified that were able to distinguish later training outcome by identifying 8.4% of all dogs withdrawn for behaviour and 8.5% of all qualified dogs, with 84% and 85% specificity. The questionnaire assessment was reliable and could detect traits that are consistent within individuals over time, despite juvenile dogs undergoing development during the study period. By applying thresholds to scores produced from the questionnaire this assessment could prove to be a highly valuable decision-making tool for Guide Dogs. This is the first questionnaire-style assessment of juvenile dogs that has shown value in predicting the training outcome of individual working dogs.

Concepts: Scientific method, Psychometrics, Validity, Test validity, Criterion validity, Construct validity, Dog, Test


The primary purpose of this study was to investigate the intra-tester and inter-tester reliability of the dial test using a handheld digital inclinometer. Additionally, we examined the responsiveness of the test, and side-to-side differences for meaningful comparison.

Concepts: Test, The Dial