Opponent-Adjusted Performance Score: An Alternative to Speaker Points as the First Tiebreaker

Introduction

Speaker points suck – we all know it, and we all complain about it. Its problems are why more and more tournaments have started minimizing their influence by breaking all 4-2’s or 5-2’s.

In this article, I will propose a way to make the use of speaker points as a tiebreaker better – hopefully significantly better. I call it Opponent-Adjusted Performance Score, and it looks like this:

OAPS

I promise, it’s easier than it looks. Please bear with me. 🙂

The Problem with Speaker Points (If you already know this you can skip to the next section)

1. Inconsistent scales

Anyone who has ever looked at a tournament result packet or read a few judge paradigms realizes that the scales judges use are entirely inconsistent. Some judges regularly assign double 30’s, while others go seasons without assigning a single one. While some judges will not give below a 27 unless something horrific takes place in the round, other judges regularly drop as low as 24 or 25. Because debaters will likely be judged by very different sets of judges, this inconsistency makes speaker points (SP) largely useless as a metric for comparing the quality of debaters in a tournament.

Some tournaments have attempted to address this problem by explicitly endorsing certain scales. This strategy has always failed. Judges either fail to effectively implement the change, slide back into old habits, or completely ignore the suggestions.

2. Point Inflation

A related but distinct problem is that average speaker points have been steadily rising over time. Inflation has many potential causes, but I think the most commonly cited is the desire for judges to make debaters happy. If a judge has to make a decision in a close round, has an RFD they are not confident in, or wants to keep a debater’s coach happy, they will give higher than normal points. Some judges also feel compelled to give high points to good debaters even in rounds where they perform poorly because giving a 28 might keep someone otherwise deserving from breaking. These incentives also create upward pressure on the scales of all other judges. As points rise suddenly anything below a 29 can “screw” a debater, causing them not to break, so other judges adjust their scales upward to avoid being the judge that ruined a debater’s tournament.  Point inflation makes SP less useful as a tiebreaker because it compresses the scale (meaning there is less distinguishing debaters of different seeds) and SP have cease to be indicators of a debater’s skill.

3. Fails to Account for Strength of Opponent  

SP are intended to measure the strength of a debater compared to the rest of the field at a tournament, based on their in-round performance. It does not take into account the strength of a debater’s opponents. This is a big problem. A debater who goes 4-2, while only losing to debaters who finish undefeated, is clearly more deserving of breaking than a 4-2 who loses to average debaters. While both have the same number of wins, the losses are not comparable. It seems wrong that the second debater should break, while the first does not.

Potential Solutions

1. Opponent Wins

On the surface it appears that at the very least Opp Wins (OW) account better for the strength of opponents than SP, so perhaps OW are a good candidate to be the new first tiebreaker. However, while OW does account for opponent strength, OW are often completely out of the control of debaters. If a debater happens to be paired against two really good debaters in presets they could be advantaged over another debater, who by no fault of his own, hit easier opponents in presets.

2. OW + SP

A few tournaments have used OW + SP to break ties – MBA comes to mind, for example. While this does a better job taking into account strength of opponent than SP alone, and a better job taking into account performance than OW, the faults of both still exist.

3.  Judge Variance

I think in an ideal world Judge Variance is the best alternative to SP, other than OAPS. For those who do not know, Judge Variance (JV) is a measure of how many more (or less) speaker points a debater receives from a judge, given the average of that judge’s speaker point distribution over the course of a tournament. So, if all of a debater’s judges collectively average 28.5 and they receive an average of 29, their JV would be .5. 

While JV seems like a reasonable solution to the tiebreaker problem in theory, there are still a few problems. For instance,  JV is meaningless without a large sample size. Some judges only judge a few rounds at a tournament and if those rounds are unusually good or bad, JV scores could be completely skewed.

Some have suggested creating a a database that would allow us to use all the rounds a judge has judged in a season (or ever?) to calculate JV. Beyond the obvious logistical difficulties of this solution, a database still would not solve judges who judge very infrequently, or first-year-out judges.

Even if these difficulties could be overcome, problems remain. First, point inflation creates a major problem for JV. As scales become compressed differences in JV do as well. Second, the biggest problem for JV is that it does not take into account opponent strength at all.

So, at least on its own JV is likely not the solution to the SP problem. However, it is possible that if the sample size problem could be overcome JV could be incorporated into OAPS as a way of more effectively solving the problem of inconsistent scales. This could be done by substituting both debater’s JV scores in each round for their SP.

 

Introducing Performance Score

1. Explanation

Performance Score (PS) is defined as a debater’s speaker points (SP) in a given round minus the speaker points of their opponent (OSP) in that round:

PS

For example, if in round 1 Debater A receives a 29 and their opponent, Debater B, receives a 28, then Debater A has a round 1 PS of +1, and Debater B as a PS of -1.

PS differs in a number of ways from raw SP. Instead of measuring the skill of a debater in an absolute sense, PS is only a measurement of the difference between two debaters in a given round. Basically, PS is a measure of the “margin of victory” or the “skill gap” demonstrated by debaters in a given round. The difference between PS and SP is the same as the difference between saying the Minnesota Vikings scored 42 points, and saying the Minnesota Vikings beat the Green Bay Packers by 3 touchdowns (21 points).

Over the course of a tournament a debater’s PS for each round can be added to find their Total Performance Score (TPS):

TPS

Below are hypothetical tournament results for a debater.  The results show what TPS would look like in practice:

Round 1 2 3 4 5 6 Total
SP 28.5 29 30 29 29.5 28 174
OSP 27 28.5 29 30 27.5 29 171
PS 1.5 0.5 1 -1 2 -1 3

2. Advantages of PS

  • PS limits the impact of inconsistent judge scales. As explained above using SP as a tie-breaker requires us to treat SP from different judges the same, even though we know that the scales used by judges are completely inconsistent. PS  allows us to use SP assigned from different judges while diminishing the impact of inconsistent judge scales. A PS of +1 can be achieved when Debater A receives 30 SP while Debater B receives 29 SP, or when Debater A receives 28 SP and Debater B receives 27 SP. I believe that the difference in SP is more consistent than the absolute SP given by judges. The main exceptions are judges who assign both debaters high speaks in an effort to make everyone happy, which brings me to the second advantage of PS.
  •  PS removes the incentive to inflate speaker points. In a world where PS is used as the first tiebreaker instead of SP, the incentives that drive point inflation are much less powerful. The desire to make both debaters happy, or to cover an uncertain/bad decision with good speaks, no longer makes sense. PS is zero-sum, so by assigning one debater higher SP than they deserve, a judge is punishing the better debater. PS would mean that in most cases double 30’s would no longer be a cause of celebration for both debaters, unless of course both debaters end up with very positive TPS numbers. In a case where both debaters are actually very strong, the double 30 could be slightly positive for both.
  • Better measures the skill demonstrated in a given round. Absolute speaks do not fully account for a debater’s performance in the round. Often decisions made by one debater, like which positions to run, can affect the performance of the other debater. If debater knows their opponent is bad at framework debate, then choosing a philosophy-heavy position can cause their opponent to debate worse than they otherwise would. Even the persuasive skills of a debater can negatively affect their opponents debating by convincing them they are behind in certain parts of the debate. Thus, point differential is a better way of grading debaters – it takes into account both debaters performance and their mutual influence on each other’s performance.

 3. Remaining Issues

PS largely fails to account for the strength of opponents. Like SP and OW, PS still fails to adequately address differences in the strength of opponents. If Debater A debaters significantly worse debaters in rounds 1 and 2, their PS might be artificially inflated. In fact, it is possible that PS would be more skewed by opponent strength than SP. 

Refining Performance Score: Opponent-Adjusted PS

1. Explanation

Opponent-Adjusted Performance Score attempts to adjust for the strength of opponent by using the average of the Total PS scores of a debater’s opponent:

OAPS

While the above equation might look complicated, the concept breaks down simply: you take a debater’s Round 1 opponent’s TPS, add it to the Round 2 opponent’s TPS … until you add the TPS score of all opponents that a debater hits in prelims of a tournament. Then you divide by the total number of rounds (aka, take the average).

This process takes into account a couple things: first, if you sum an opponent’s PS score, you take into account whether, in general, that opponent performs well or poorly. An opponent that performs well might consistently get +1 or +1.5 as a PS score in rounds – the sum of their PS over prelims would thus be a positive number. Similarly, an opponent that performs poorly might get a lower positive number or a negative number as a result of summing their PS’s across prelims. Averaging the aggregates of a debater’s opponent’s PS scores would then say, on average, that debater faced “good” or “bad” opponents and would accordingly adjust the debater’s score.

So, if a debater has a TPS of +5 but if the average of their opponents TPS is -5, then that debater has performed exactly as expected and would earn an OAPS of 0. If on the other hand a debater had a TPS of +5 but their opponents have an average TPS of +1, then that debater would receive an OAPS of 6, to account for their above average competition.


2. OAPS
 takes into account strength of opponent without the problems of OW.

While OW are hindered by the randomness of presets, and unfairly reward debaters for hitting good opponents regardless of their performance, OAPS avoids both problems. Debaters are only rewarded for performing better than average against a particular debater and only punished for performing below average. If a debater hits a great opponent, who averages a PS of +3, but debates them well enough to earn a PS of -1, they are rewarded for debating better than the average debater against that opponent. On the other hand, if a debater hits that same opponent and receives a PS of -4, they are punished for debating below average relative to that opponent. Unlike OW, they do not receive a bonus because they happened to debate a talented opponent.

3. Potential Objections

  • “This seems too complicated.”  Eh, not really. While the formula may seem complicated, it is really not more difficult to calculate JV, something tab software already does. While TRPC and Tabroom.com do not currently calculate OAPS, I suspect that it would not take much to add that to the software.
  • “That is not what speaker points are for.” — Some might object that PS doesn’t make sense because SP are not about relative skill, but rather speaking ability or something else entirely. Sure, that’s fair, but it really doesn’t make a huge difference. We already primarily use SP to break ties, if we believe that speaking ability is how ties ought to be broken, all of the same reasoning offered above still applies.
  • “This doesn’t solve for inconsistent judge scales. If one judge has a bigger range than another, doesn’t that mess things up?” —  This was sort of addressed above, but it is a big enough point to address again. It is true that nothing can really be done to solve this issue completely. However, OAPS does a few things to address it. First, OAPS creates an incentive for judges to use more of the scale. Some of the same reasons speaks have become inflated could work in favor of OAPS. Judges who use a compressed scale would end up punishing good debaters, over time judges would adapt in order to protect the best debaters, and to avoid making people angry. Second, OAPS should be more consistent between judges than SP. I believe that there is not as much inconsistency in separation judges create between debaters with SP as there is with the absolute SP that are assigned.
  • I don’t know how, but it seems like low-point wins would mess everything up, right?” — Not at all. Under the current system, a low-point win is a signal that even though a debater lost this specific round, the judge believes they are” better” (at whatever set of skills a judge believes is relevant for SP), and if the two debaters are tied at the end of the tournament, the losing debater should be favored. The same would be true if we chose to switch to OAPS.
  • “OAPS? That name sucks!” — I agree. Any suggestions for a better name? If this is going to stick, we need something catchy.

 

Conclusion

I believe that our current method of determining debater seeding is broken, instead tournaments should use OAPS. SP are too flawed to be the first tiebreaker and OAPS provides an alternative that solves most if not all of the problems presented by SP.

I would love to hear thoughts from readers on how to improve OAPS, additional objections, or alternative metrics in the comments section.

Chris Theis is the owner and Co-Director of Victory Briefs. He won the 2008 and 2009 TOC and currently coaches at Peninsula High School (CA) and Apple Valley High School (MN). 

  • Mark Ahlstrom

    So I might tackle this at some point(not tonight), but we have old tournament results. Can someone feed the numbers in to a few of those to show what the actual changes would be in seeding/breaks?

    • Mark, I think I am going to enter the data from the 2014 TOC and give this a shot. I am also working on a revised version of the formula in light of some discussions I have had recently. I would like to show what seeding would look like using the currently system, using opp wins, using OAPS, and using the revised metric I am working on.

    • After too many hours spent putting this data together I found that this is how seeding by OAPS and not H/L SP would have changed the elim seeding at the 2014 TOC:

  • Advik Shreekumar

    Another idea: top- and bottom-code point spreads. Zack brings up the case of tanking a debater’s speaks for some reason, which would give their opponent an absurdly large PS for hitting somebody running an abusive position. That may be too kind to the debater who wins.

    In general, we may be worried about spread-fairies, who always award large spread wins.

    We could always bound PS at +/- 4. If a judge tanks somebody’s speaks resulting in a round have PS 8 (say, 28-20), we could always treat it as PS = 4 in the calculation. We’d basically be saying there isn’t a significant difference between beating somebody by 4 speaks and by 5 (or more) – they’re all “large” victories.

    This’d be analogous to tab giving a debater 20 speaks even if the judge awarded 0 for abuse.

  • Advik Shreekumar

    Another way to think about OAPS that may be helpful to understanding how it works is this:

    • Chris Theis

      This is an interesting point. It seems that removing the round should make the ratings more accurate, but it also seems like this may become way to hard to calculate if the same opponents TPS is different depending on which debater we are doing the OAPS calculation for.

      Also, upon reflection it seems that OAPS would be significantly improved by adjusting based on a debaters opponent’s OAPS instead of their PS. This would also take into account the opponents strength of schedule. This would mean that adjusting one debaters OAPS would affect every other debate. If we adjusted this way the formula would be iterative, which may be too much for tab software to handle. I have no idea.

      It would be similar to the “Simple Ranking System” used by some sports gamblers. There is a run-down here:

      http://www.pro-football-reference.com/blog/?p=37

  • Zack Vrana

    First, I really, really like the idea of a new second tie breaker (first is always wins) but I have some questions that perhaps someone who’s better at math can answer.

    1) Ties in points must be illegal. If we’re sorting by high OAPS, adding a zero to the top of a fraction should make it smaller, thus, a tie would be harmful.

    2) This doesn’t take into account where in the distribution the spread is. If I under this correctly, a 22-25 round would affect a debater in the same way that a 27-30 does. Also, judges who tank speaks to check abuse would be helping some kids disproportionately.

    3) You’re penalizing power-matched debates. Logically, debates should be more even as the better debaters hit each other. If we’re in round 7 and it’s two similarly skilled 5-1 debaters, the spread shouldn’t be that far off.

    Thanks for this though. Always good to have a discussion.

    • Chris Theis

      Zach, a lot of this is fair. I don’t think we are ever going to find a perfect solution, the relevant question is whether this is better than H/L SP. Unless of course, there is an even better alternative. If so, I would love to hear it.

      1. A tie would not necessarily be harmful. The adjustment functionally makes this a measure of how much better you perform against a debater than their average opponent. So, if your opponent is usually beating debaters by 1 point then after the adjustment a tie would translate to +1 OAPS.

      2. You are absolutely right that a 22-25 would be the same as a 27-30 pre-adjustment. This is potentially an issue. However, I think there are two factors that make this less of a problem.

      A.) judges scales are inconsistent. This is a major problem with just using raw points. For example, I bet the points I give are 1-1.5 lower than what the average judge would give. That makes treating a 29-28 the same as a 28-27 a bug and not a feature.

      B.) I think the adjustment mitigates this problem a bit. Matthew brought this up and facebook so I will just show that discussion here:

      “Mathew Pregasen: actually I am now a bit confused. Does this mean a 28.5 – 27 round is equal to a 30 – 28.5 round?

      Chris Theis: Mathew, sort of. The reason they would most likely end up being different is that debaters receiving 27.5’s are likely to be losing quite a few of their rounds, and getting lower points than their opponent. So, when the adjustment happens the debater receiving the 30 is likely to receive a boost based on their opponent while the debater in the second example might be negatively effected by that round.”

      3. Power-matching should not be an issue.
      A.) OAPS would be used to break ties between debaters with the same records so the power-matching effect should be non-unique.
      B.) The adjustment solves. You are only rewarded or penalized based on how much better or worse you debate an opponent than their average opponent. So if the spread is close in an up 5 debate that should not matter because it is likely that both debaters have positive TPS values. So, if I win by .2 SP that will be adjusted to 2.2 if my opponent has been beating up on weaker debaters early in the tournament and has an average PS of +2. Likewise debaters in the down 5 bracket will likely have -TPS scores. So, if a down 5 debater beats their oppoent by .2 it might be adjusted down to a -1.8, for example.

      • Chris Theis

        Also, I agree it is always good to have the discussion. It is something we really need to hash out as a community.

  • akskdkf

    Compared to speaker points, I think this system sounds much better (regardless of how unrealistic the Vikings scenario may be). Very innovative to think of something like this, and I applaud you.

    One potential problem still comes to mind, though. A lot of judges adjust speaker points not based on how good a debater was compared to their opponent that round, but rather based on subjective preferences about that debater’s in-round strategy or behavior (e.g. your paradigm says you’ll give lower speaks if debaters go too hard for the framework since it’s “frustrating and boring”; there are also plenty of judges who claim to give higher speaks for humor or lower speaks for rudeness). This doesn’t seem to be a problem if we stick to a system that relies on debaters’ own speaker points, as debaters can adapt to their judges’ preferences. In this OAPS system, though, it would seem to me that good debaters could be vulnerable to hitting bad debaters who happen to be really funny (and thus get higher speaks, hurting the good debaters’ OAPS scores), in addition to being arbitrarily advantaged if their opponent happens to piss off a judge one round.

    Not saying this is a big issue, I’m just interested to hear your thoughts.

    • Chris Theis

      (Thought I posted a response earlier but I am not seeing it here)

      Thanks for the input. I agree this is potentially an issue. To a certain extent we will never be able to solve the stupidity you describe entirely. However, I think OAPS creates incentives that the situations you outline less likely.

      The same incentives that cause SP inflation now are actually benefits under the OAPS system. If judges gave high speaks to a bad debater just because they were funny or said the right “word of the day” they would very quickly feel push back from the community. Right now there is no harm to giving both debaters high speaks, but the world of PS is more zero-sum. Giving bad debaters good speaks would punish the good debater in the round. So the incentive to protect good debaters and avoid conflict with coaches/students that drive inflation now should discourage the behavior you describe.

  • Chris Theis

    One objection has been raised with me in private and on FB a couple of times. I will quote Mathew Pregasen from facebook (he said it best):

    “There needs to be a way to account for extremes, albeit very probable extremes. For instance, let’s say I hit the top speaker. They get almost a perfect speaker point score. I get 28.5’s. 1.5 (30-28.5) x 6 = -9 PS. Which means hitting a better opponent hurts my chances of breaking which seems a bit counter intuitive.”

    I think this misunderstands the way the adjustment works. Debaters are rewarded or punished based on whether they outperform or underperform relative to their opponent’s average opponent. So if I hit the top speaker and receive 1.5 fewer speaker points than they do, and they usually outspeak their opponents by 2 points, then my adjusted PS is actually +.5 not -1.5. On the other hand if I debate a “bad” opponent who usually receives a -2 PS and I only beat them by 1.5 speaker points then I receive an adjusted PS of -.5 instead of +1.5.

  • Former Debater

    If it takes above a 29 to clear at a tournament, then LD needs to have a larger conversation about this lake wobegon-esque “all our debaters are above average” condition…

    • Joey Schnide

      As Chris’ article explains pretty clearly, the problem isn’t some incarnation of special snowflake syndrome that causes judges to believe that all debaters are above average. Speaker points are inflated because judges face pressures to raise speaks (which Chris does a great job outlining). As a result, one judge might give 29s to “average” debaters because a 29 means something different to them than it does to you. Having a conversation about the issue isn’t going to change the problem overnight, and it certainly doesn’t replace a tangible solution. There are too many judges to in the community (with varying levels of involvement) to reach everyone quickly. Even if our conversation was reasonably successful in changes judge’s attitudes to speaks, it doesn’t to anything to solve the root cause of the problem. People respond to incentives, and a change of attitude doesn’t remove the pressures that judges feel to inflate speaks in the first place.

      • Former Debater

        Or the LD community could be grown up adults and have a reasonable scale that is set nationally and shame those who disobey it. Policy on the HS level and College level has done it that way for years and it involves no calculus

        • Joey Schnide

          Because instead of finding real solutions to problems, mature adults just shame people they disagree with.