Which city in America has the worst drivers? When I asked this questions, many people guessed New York or Los Angeles. How about the city with the safest drivers? According to a recent report released by The Allstate Insurance Company, Fort Collins in Colorado is the safest driving city in America, meaning that its drivers are the least accident-prone while Washington DC is the least safe city in America. Making sense of this report is an excellent opportunity to demonstrate elementary methods in descriptive statistics.
The report, called Allstate America’s Best Drivers Report, ranks America’s 200 largest cities in terms of car collision frequency with the goal of identifying which cities have the safest drivers. The report is produced using Allstate claim data. The report can be downloaded in this link or can be seen here (a copy of the report). Reporting of this report can be found in here and here.
The numerical summary that is used to measure the safety of the drivers in a city is the average time (in years) between auto collision among the drivers in that city. According to the report, the average driver in Fort Collins, Colorado will experience an auto collision every 14 years (the national average is 10 years). In contrast, Washington DC is at the bottom of the report, where the average driver will experience an auto collision every 4.8 years. The top 5 safest cities are:
Obviously, the higher the measure of average years between collision, the safer the city. The lower the measure, the less safe the city on average. In addition of looking at the top and bottom of the report, we can look at a few other cities that we care about, perhaps several metropolitan areas. For example, Los Angeles’ measure of average years between collisions is 6.6 years while New York’s is 7.3 years, both worst than the national average, but not as worse off as some people imagine. Glancing at the report, one can see that larger cities tend to have larger measures of average years between collisions. However, in order to make the data in the report come to life, we need more organized ways of looking at the data.
To make sense of data, we start with a graph and then use numerical summaries. A good visual representation of the data can reveal overall patterns and relationship. A graph shows important features of the data, for example, the shape of the data (describing how the data are distributed over their entire range), where the data are centered, how small or big is the spread of the data (whether the data are closely clustered around the center or widely dispersed from the center) and whether there are extreme data values that stand apart from the overall shape (these are called outliers). We will also look at some numerical summaries to confirm what we see in the graph. All of these activities help us tease out the patterns and meaning of the data.
The following shows the measurements of average years between collisions for the largest 193 cities in America. The data are ordered from smallest to largest. The full report from Allstate can be downloaded here. A copy of the report can also be found here.
Using a Graph
Using a histogram is an effective way of conveying a visual sense of all the data. A histogram divides the entire range of the data into intervals of equal length and shows the number of the data values that fall within each interval. In the case of average years between collisions, we use the interval length of 1. Since the data range is 4.8 to 14, we use the intervals 4 to 4.99, 5 to 5.99 and so on. To aid the construction of the histogram, we first summarize the counts in a frequency distribution. For example, there are 1 data value that falls within 4 to 4.99, 3 data values that fall within 5 to 5.99 and so on.
The following is the frequency histogram of the average years between collisions. The height of a bar represents the number of cities whose average years measures are within that range.
The shape of the histogram is essentially symmetrical. It is symmetrical around its peak and it tapers down on each side. The height of the left-most bar is one. So is the right-most bar. They are the least safe and the safest cities, Washington DC and Fort Collins, Colorado, respectively. The other cities are somewhere in between this range and are distributed more or less according to a bell-shaped curve.
Using Numerical Summaries
The center of the distribution is with the peak of the histogram, the bar over the interval of 9 to 9.99. So in a typical city, the average years between collision for a driver is about 9 to 10 years. The following shows several numerical summaries of interest.
The first two numerical summaries in the above table are measures of center. A measure of center is a numerical summary that attempts to describe what a typical data value might look like. These are “average” value or representative value of the data distribution. A measure of spread is a numerical summary that describes the degree to which the data are spread out.
Because the shape of the distribution is symmetrical and because there do not seem to be outliers in the data, the mean and the standard deviation are the representative measures of center and spread, respectively. Thus the distribution of the measure of average years between collisions is centered at 9.22 years. So the average driver in a typical larger city in America will experience a collision every 9.22 years. The standard deviation shows how much variation or dispersion there is from the mean. The standard deviation of the measure of average years between collision is 1.639 years.
The following summarizes the description of the overall patterns of the data found in the Allstate America’s Best Drivers Report.
- The measures of average years between collisions for the cities in the report follow an approximately bell-shaped distribution.
- Because of the shape of the distribution is symmetrical, the more representative measure of center is the mean, in this case, 9.22 years. In other words, the average driver in the larger cities of America is expected to experience a collision every 9.22 years.
- Because of the shape of the distribution is symmetrical, the more representative measure of spread is the satndard deviation, in this case, 1.639 years.
Using Numerical Summaries to Confirm What We See in the Graph
Note that the mean and the median are very similar (9.22 years versus 9.3 years), which is an indication that the distribution is symmetric. If the mean and median are very far apart, it is an indication that there are extreme data values in one of the tails causing the mean to be much different from the median. But this is not the case in this example.
We can also determine shape of the distribution from looking at the 5-number summary. The key is to look at the first quartile (Q1), the median (Med) and the third quartile (Q3) in relation to one another. Specifically, look at the distance between the median and Q1 (9.3-8.05=1.25 years) and the distance between Q3 and the median (10.2-9.3=0.9 years). Whenever these two distances are significantly different, we have a skewed distribution. In the case of the measure of average years between collision, these two distances are essentially the same, making it a roughly symmetric distribution, in fact, a bell-shaped one.
Another indication of a bell-shaped distribution is from using the empirical rule, a simplified version of the normal distribution.
All normal distributions (aka bell-shaped distributions) have common properties. For example, any bell-shaped distribution obeys this rule (called the empirical rule): about 68% of the data are one standard deviation away from the mean, about 95% of the data are two standard deviations away from the mean, and about 99.7% of the data are three standard deviations away from the mean.
If the data follow a bell-shaped distribution, there are not many extreme data values (as defined by being more than three standard deviations away from the mean). Such extreme data values can only make up about 0.3% of the data (100% minus 99.7%). According to the same rule, even data values more than two standard deviations from the mean do not occur very often. Such data values show up about 5% of the time (100% minus 95%). What about the data for the Allstate America’s Best Drivers Report?
Do the data in the report for average years between collisions satisfy this rule? First, we need to find the bounds that are one standard deviation from the mean, the bounds that are two standard deviations from the mean, and the bounds that are three standard deviations from the mean. They are:
The next step is to count the number of data values that fall into each of these three intervals in order to find the percentage of the data in each interval. We have the following results.
There is remarkable agreement between the percentages in the data and the percentages of 68%, 95% and 99.7%. This agreement in further indication that this distribution of average years between collisions in the Allstate America’s Best Drivers Report is very close to a bell-shaped distribution.
- Moore. D. S., McCabe G. P., Craig B. A., Introduction to the Practice of Statistics, 6th ed., W. H. Freeman and Company, New York, 2009