I scraped competitive results from a few hundred ballroom dance events to estimate the distribution of concordance rates between judges. As expected, judging is quite subjective but significantly better than random.
Collegiate Ballroom Dance
A ballroom dance takes place with a number of couples (usually too many for the given floor size) assembled on a dance floor. Around 3-7 judges stand around the periphery and compare all the couples and issue a set number of callbacks (i.e., voting for whom they want to see in the next round). In the very last round, judges will rank all N couples from 1 to N. The specific rules come from the Skating System of Judging, which has many more rules for breaking ties. Other than that, judging is a fairly simple process. All results are recorded in a (pretty poorly designed) system called o2cm.
Example final round (full rankings)
7 judges have each ranked all 8 competitors
Example semifinal round (callbacks)
5 judges have each voted for their favorite 8 competitors (X); the top 8 advance to the final
It’s easy after a competition to pore over one’s callback numbers and final rankings and get caught up in which dances need the most work, or why some events went well and others didn’t. My coach was skeptical of the reliability of these conclusions, saying that there was probably a lot of noise from different judge preferences, moods, or levels of paying attention. To test this hypothesis, I decided to take a look at the data.
I used the package
scrapeR to extract tables from the finals round of all 239 events I’ve participated in (as of March 2, 2019). The extracted raw data looked something like this (from my first event to my most recent):
Once the data was properly collected and cleaned (this took a while), the statistical analysis was quite straightforward. There are many methods of estimating interrater reliability, but I chose Kendall’s W. Kendall’s W is equivalent to the Friedman test normalized to the range 0-1 (0 indicating random rankings and 1 indicating unanimous rankings). It has several properties that make it a good choice for this problem.
- Ordinal: accounts for magnitude of disagreement with discordant rankings
- Non-parametric: makes no assumptions about the underlying probability distribution
- Generalizes and scales naturally to any number of raters
It might be good here to give an intuition of what each Kendall’s W actually looks like. Here are some sample tables with Kendall’s W = 0.2, 0.5, and 0.8.
|W = 0.2||J1||J2||J3||J4||J5||W = 0.5||J1||J2||J3||J4||J5||W = 0.8||J1||J2||J3||J4||J5|
To estimate the distribution of concordance values, I computed Kendall’s W using measured data and randomly simulated data (with the same couple:judge dimensions) for all 239 finals rounds. The two distributions are shown below.
- The mean observed concordance between judges (red curve) was 0.43.
- The mean simulated concordance between random rankings (blue curve) was 0.13.
The two curves are obviously quite different, suggesting that judges almost certainly do better than random. However, there’s some overlap that indicates that at least some rounds end up being pretty indistinguishable from random.
Since we’ve explicitly modeled the null distribution here, we can compute the p-value as the fraction of observed cases greater than the 95th quantile of the null distribution (in this case, 0.283). Using this method, we find that 78.7% of finals rounds have judge concordances that are better than random (at significance level α = 0.05).
Just for kicks, we can also compute p-values based on a chi-squared approximation to the distribution of the Friedman test statistic. The p-values for all finals rounds are shown below.
This approximation finds that 69.9% of finals rounds have judge concordances that are better than random (at significance level α = 0.05), which is pretty close the estimate derived from our simulated null distribution.
For the 20-30% of ranked rounds that are indistiguishable from random, we might suspect that this is driven by some lurking variable. The ones that come most readily to mind (and that are most easily accessible from o2cm) are level and dance style. For example, it might be the case that at lower levels, all dancers are making many mistakes and it may hard to distinguish them. It is also possible that the additional styling that happens in Rhythm and Latin dances make it easier to distinguish dancers. I just made up those hypotheses, but let’s see what the data has to say.
Qualitatively, I’m convinced that there’s no significant difference between any category in either level or dance style. Quantitatively, we can turn to an F-test for analysis of variance.
|df||Sum Sq||Mean Sq||F value||Pr(>F)|
The F-test shows a significant difference in the means across levels, and a nonsignificant but suggestive difference in means across dances. Given the low sample sizes and potential non-normalities across categories, I am still convinced that there’s no significant difference.
While my coach was probably just trying to keep me from overthinking each result, it’s pretty clear that judging (at least for finals rounds) is significantly better than random. That being said, it’s not wildly better; 20-30% of all finals rounds are discordant enough that they are not statistically different from random.
There are some caveats here. First, I only looked at events that I compete in; this might not generalize to higher levels of competition (e.g., Pre-Champ and Championship). Second, there may still be value in poring over judging results for rounds that are particularly concordant (e.g., everyone ranked you last). But for most rounds, the data confirms what we pretty much already knew: judging results have value, but are certainly not reliable enough to stress over.