To answer this question, we focus on cities with population of 100,000 and up. In this group of cities, are the drivers in the smaller cities safer than the drivers in large metropolitan areas? A recent report from Allstate Insurance Company provides some answers to this question. The answer is not a resounding yes. We use scatterplots to demonstrate that the picture is somewhat muddled.

The report, called Allstate America’s Best Drivers Report, ranks America’s 200 largest cities in terms of car collision frequency with the goal of identifying which cities have the safest drivers. The report is produced using Allstate claim data. The report can be downloaded in this link or can be seen here (a copy of the report). A previous discussion of this report can be found here.

The numerical summary used in the report to measure the safety of the drivers in a city is the average time (in years) between auto collision among the drivers in that city. According to the report, the average driver in Fort Collins, Colorado (estimated population 138,733) will experience an auto collision every 14 years (the national average is about 10 years). In contrast, Washington DC (estimated population 599,657) is at the bottom of the report, where the average driver will experience an auto collision every 4.8 years. Though Los Angeles (estimated population 3,831,868) is not at the bottom of the report, its measure of average years between collisions is 6.6 years. In New York (estimated population 8,391,881), the largest city in the United States, the average driver will experience an accident in about 7.3 years.

The measure of average years between accidents is like a score for driving safety. The lower this score, the less safe the city (Washington DC has the lowest score). The higher the score, the safer the city (Fort Collins, CO has the highest score). Is there an association between population of a city and its safety score? For example, do larger cities tend to have lower scores and smaller cities tend to have higher scores?

Looking at Los Angeles, New York, Washington DC, and Fort Collins (in CO), we seem to get the impression that the larger cities have lower scores (less safe). Taking another look, we see that Providence, Rhode Island (estimated population 171,909) has a score of 6.0 in the report, which is lower than Los Angeles. Hialeah, Florida (estimated population 218,896), another medium size city, has the same score as Los Angeles (6.6 years between accidents). Phoenix, Arizona (estimated population 1,593,659) has the highest score (10.1 years between accidents) among the cities with population over 1 million. So the picture is not entirely clear cut.

We have two quantitative variables, namely population and the safety score (average years between accidents), and we wish to see whether there is any relationship between these two variables in the Allstate report (population is in the report). The most common way to describe the relationship between two quantitative variables is a scatterplot.

The following is a scatterplot of population versus the safety score of average years between accidents for the 193 cities in the report.

In the above scatterplot, each city in the report appears as a point in the plot fixed by the values of both variables. The rightmost point is the most populous city (New York, population 8.3 millions and safety score 7.3). Los Angeles is the second rightmost data point (population 3.8 millions and safety score 6.6). The scatter of the points has the shape of a horn. It appears that for the metropolitan cities at least, the large the population, the lower the score (this is the part of the horn that is leading to the mouth piece). To make the horn shape clearer to see, the following is the scatterplot without New York.

Both of the above scatterplots show a relationship between the two variables. However, the relationship is not a linear one. In fact, the most interesting part of the scatterplots is the part with the smaller cities. The following is a scatterplot with only the cities of population size half a million or under.

The above scatterplot no longer shapes like a horn. It is just a random scatter of points. Thus the cities under half a million in population can have low scores or high scores. Population size does not seem to make a difference. In fact, the correlation coefficient r for the cities under half a million is 0.008367 (essentially zero). This numerical summary confirms what we see in the scatterplot.

The following is another scatterplot, this time restricted to cities with population under 250,000 (a quarter of a million).

The scatter in the “under 250,000” scatterplot is still a random cluster of points. The population size does not have any bearing on the safety score. The average years between accidents can be high or low for this group of cities. The correlation coefficient r for these cities is 0.075498, slightly higher than the previous scatterplot, but still essentially zero.

As a contrast, the following is a scatterplot with cities of population over half a million. To make the pattern clearer, the scatterplot excludes New York.

The “over half a million” scatterplot does show some kind of relationship. The larger the cities, the lower the safety score. However, the relationship between population and average years between accidents is not strong and does not appear to be a straight-line relationship.

So the take away lesson here is that for the smaller cities in the Allstate report, there is no relationship between the score of safe drivers and population. There is no clear relation that 100,000-size cities are safer than 200,000-size cities for example. However, when the population size goes above half a million, there seems to be a relationship, in that the larger the cities, the lower the score (less safe). But the relationship for the “over half a million” cities is not a linear one (is not a straight-line relationship). Thus we cannot use linear regression on the “half a million or over” cities.