Flaws in the Daniel Ratings
The Daniel Ratings have a couple flaws that make the results
unacceptable for official use (other than chosing BCS teams). There
is a serious flaw and a not-so-serious flaw. The not-so-serious flaw
is a dependence on sparsity, the serious flaw is sensitivity to
schedule structure.
Dependence on Sparsity
Imagine if this rating system is applied to Major League Baseball.
Major league baseball teams play 162 games, and play every team in
their league at least six times. (We'll disregard the interleague
games for the moment.) In most cases, every team would have a rating
of zero, because typically every team would have at least one victory
over and one loss to every other team. Or, if one team sweeps a
season series, it would be the only team with a positive rating.
It is obvious that the Daniel ratings become meaningless if there are
too many games. In other words, this system depends on sparse
data.
Of course, in college football, teams only play a dozen or so games,
so the requisite sparsity of the data is there. Nevertheless, I
wonder if there could be some residual effect. For example, most
teams play a conference schedule, where most teams play each other.
Does this distort the results any?
Having said that, I feel that college football would have to add a lot
of game before the results start to be meaningless, so I classify this
as a not-so-serious flaw.
Sensitivity to Schedule Structure
Differences in the schedule structure between two teams can lead to an
advantage for one of the teams.
- Number of Games
An obvious flaw in the system is the fact that teams play
different numbers of games. Suppose two teams finish the season
undefeated, but one played only 11 games, whereas the other played
13. The second team got 1300 points for its undefeated schedule, while
the first got only 1100.
In many cases such as this, the solution is to just divide by the
number of games played. In this case, however, that won't work. The
Daniel Ratings are a point-to-point comparison between teams; every
team is compared to every other team, and these team comparisons are
summed to give the total rating. Thus, there is no "per game" rating,
and so dividing by the number of games is not appropriate.
- Conference vs. Nonconference
A less obvious flaw can be seen by considering UCLA and Notre
Dame. Suppose they finish the season undefeated, each playing 11
games. Even though they play the same number of games, their schedules
appear quite different to the computer. The reason for this is that
UCLA plays in a conference, the Pac 10, while Notre Dame is
independent.
In the Pac 10, every team plays almost every other team. So, if,
say, UCLA defeats Stanford, but Stanford beats everyone else they play
in the conference, UCLA would not get too many half-credit rating
points by virtue of Stanford's victories, because the Daniel Ratings
only consider the shortest transtive path. The shortest path from UCLA
to most other Pac 10 teams is 1, the direct victory. So UCLA would not
benefit much from other Pac 10 teams beating each other.
On the other hand, suppose Stanford is also on Notre Dame's
schedule, and Notre Dame beats them. Notre Dame will get all the
half-credit points from Stanford's victories in the Pac 10. In fact,
because Notre Dame is an independent, many of the teams on their
schedule won't play each other. Thus, almost any win by Notre Dame's
opponents would result in half a rating point for Notre Dame.
By virtue of their schedule, Notre Dame has much more opportunity
to gain half-points by their opponents' victories than UCLA.
In some fairness, Notre Dame would also have more opportunity to be
ranked lower if they lose.
- Cliques
A third flaw, related to the previous one, is that college
football teams tend to form cliques, which are a group of teams that
have relatively few games scheduled with other groups. (A conference
is one kind of clique, but there are larger cliques and smaller
cliques, too.)
When teams form a clique, it restricts the flow of information.
There are not so many games between teams of different cliques, so it
is hard to compare teams from different cliques. This is a problem
for any ratings system. However, this is a very hard problem for the
Daniel Ratings, because of its dependence on shortness of path. If
one clique is better than another clique, then the teams from the bad
clique still have a short path to the other teams from the bad clique,
whereas teams from the good clique have a long path to those teams.
If the bad clique has more teams, they might end up outranking teams
from the good clique.
Division I-A is a good example of a large clique. NCAA rules
specify that, in order for a team to play in a bowl, that team must
have six wins over Division I-A opponents (excepting one I-AA team
every four years or something like that). Because of this, I-A teams
do not often play I-AA teams, and as such, Division I-AA teams have a
certain shortness of path advantage over I-A teams. This caused some
I-AA teams to appear very high in the ratings before I decided to rank
only I-A.
Possible Solutions
Here are some ideas I have for solutions.
It seems that the fatal flaw in the Daniel Ratings system is that
much information is thrown out when the system decides to only account
for the Shortest Transitive Path. So, the obvious remedy is to
consider all paths.
The result of this line of thinking is perhaps a little obvious. If
we consider all paths, the system degenerates into an RPI-like system.
That is, it would be some number times your winning percentage, plus
some number times your opponents winning percentage, etc. It would be
slightly different than the RPI because it goes deeper than opponents'
opponents, and the numbers would be different.
A second idea is to consider only shortest paths, as before, but to
compensate for the hiding effect caused by selecting the shortest
path. For example, in the UCLA case with the Pac 10, we could detect
the fact that UCLA can't benefit from other Pac 10 games, and
compensate for it.
The basic idea is to determine the total number of possible
shortest transitive paths (possible meaning that we assume that the
right teams win) one can take, and then giving the team rating points
based on the percentage of those paths that are actually are
transitive paths.
One Other Thought
Almost everything written in this section is a result of my
intuition about the ratings. With more careful study, I might discover
cancellation effects that make the flaws in the system not as severe
as I thoiught. Or I might uncover unforseen problems with my ideas for
solutions.
|