## San Francisco Annual Rainfall Data – A Regression Example

This is an introductory article on simple linear regression. For a good reference of statistics at the introductory level, see [1].

1. The Data
The following is a table containing the annual rainfall data (the number of rain days and the amount of rainfall in inches) in San Francisco from 1960 to 2009. The data are obtained from the Golden Gate Weather Services. The years in the data are fiscal years (July 1 to June 30). Thus the year 2009 refers to the period July 1, 2008 to June 30, 2009.

We expect a positive correlation between the number of rain days in a year and the annual amount of rainfall even without doing any correlation or regression analysis. The questions to be answered are how strong and whether the linear regression model is an appropriate model to describe the data. This example is an excellent introductory discussion of linear regression with only two variables (this is called simple linear regression).

The individuals in this investigation are the rain seasons from 1960 to 2009. Two measurements were recorded for each individual, the number of rain days in a year and the annual rainfall (in inches) in San Francisco. Both of these are quantitative variables.

In exploring the data, there are two questions that can be asked. Is the purpose simply to explore the relationship between the two variables? Is the purpose also to determine whether one of the variables can help explain variation in the other variable? That is, one of the variables is the explanatory variable and the other variable is the response variable. In this post, we wish to explore the relationship between the number of rain days and the annual rainfall in San Francisco (e.g. to determine the form of the relationship, its direction and its strength). As an illustration of linear regression, we use the number of rain days as the explanatory variable to explain the annual rainfall.

In examining the rainfall data, we use this strategy for data analysis: 1. start with a graphical display of the data, 2. look for the overall pattern and note any deviations from that pattern, and 3, use numerical summaries to describe aspects of the data (see [1]).

The most common way to describe the relationship between two quantitative variables is a scatterplot. In examining a scatterplot, describe the overall pattern by looking for the form, direction, and strength of the relationship. An important kind of deviation is an outlier, an individual value that falls outside the overall pattern of the relationship. See [1] for more details.

The scatterplots are produced in Excel and and the numerical calculations are performed in Excel and TI83 plus.

2. Looking at the Data Graphically
In a scatterplot, the explanatory variable is always plotted on the horizontal axis (the x axis). The explanatory variable is usually called x and the response variable y. Figure 1 gives a scatterplot that displays the relationship between the annual number of rain days (x) and the annual rainfall in San Francisco (y).

How does this scatterplot reflect the rainfall data? In this example, each individual in the data (i.e. each rain season) appears as a point in the plot fixed by the values of both variables. Two of the dryest seasons (the two lowest points in the plot) are the fiscal years 1975 (47 days and 7.97 inches) and the year 1976 (41 days and 11.06 inches). The wettest season is the year 1997 (119 days and 47.22 inches).

Now the overall pattern of the scatterplot. Form. The plot suggests that the relationship between the two variables is linear (the form is linear). The points in the plot roughly follow a straight line. Direction. There is a positive association between the two variables (the direction is a positive association). A rain season in San Francisco with higher than average number of annual rain days tends to be associated with above average annual rainfall. In other words, above average values of the two variables tend to go together. Likewise, below average values of the two variables tend to go together. Strength. The strength of the relationship is fairly strong. The strength of a relationship is determined by how closely the points in the scatterplot follow the form of the relationship (in this case a straight line).

There seems to be no obvious outliers (the individuals that falls outside of the overall linear pattern).

3. The Calculation
After punching the data into TI83 plus, the following results are obtained.

$\displaystyle \text{Means and standard deviations:} \ \ \ \begin{pmatrix} \text{ }&\text{Mean}&\text{Standard Deviation} \\{\text{ }}&\text{ }&\text{ }&\text{ } \\\text{Annual Rain Days (x)}&\overline{x}=69.96&s_{x}=15.69 \\\text{Annual Rainfall (y)}&\overline{y}=22.06&s_{y}=7.88 \end{pmatrix}$

\displaystyle \begin{aligned}\text{Linear regression results: } \ \ \ \ \ \ & \ \ \ r=0.8814 \ \ \ \text{(correlation)} \\&\ \ \ r^2=0.7769 \ \ \ \text{(coefficent of determination)} \\&\ \ \ \hat{y}=0.4427 \ x - 8.909 \ \ \ \text{(regression line)} \end{aligned}

4. Looking at the Data Numerically
Figure 1 shows that the there is a strong linear relationship between the annual number of rain days and the annual rainfall in San Francisco. Now we go beyond visual inspection to numerical summaries (sometimes our eyes are not good judges of how strong a relationship is). Sometimes a weak relationship can be made to look stronger by using a different scale.

4.1 Correlation
The correlation $r$ measures the direction and strength of the linear relationship between two quantitative variables. In this example, the correlation $r=0.8814$ confirms the overall pattern displayed in the scatterplot. The sign of $r$ indicates the direction of the association. The correlation $r$ is always between $-1$ and $1$. Values of $r$ near $0$ mean a very weak linear relationship. Values of $r$ near either $-1$ or $1$ indicate a strong linear relationship between the two variables. In this example, $r=0.8814$ indicates a strong positive association between the annual rain days and the annual rainfall in San Francisco.

4.2 Least-Squares Regression Line
If a scatterplot and the correlation $r$ show a linear relationship, we would like to summarize this overall pattern by drawing a line through the data. A regression line is a straight line that allows us to predict the value of $y$ (the response variable) for a given value of $x$.

There can be more than one line that can be drawn on the scatterplot, especially when the points are widely scattered. So we need a way to draw a regression line that is not dependent on guess work. In addition, we want a line that is as close as possible to the points in the scatterplot. The regression line we use is called the least-squares regression line, which is a line that is as close as possible to the points in the vertical direction (more about this point below).

The least-squares regression line in this example is obtained from software (TI 83 plus) and is: $\hat{y}=0.4427 \ x - 8.909$. Figure 2 below is the scatterplot with the least-squares regression line.

In the absence of the least-squares regression line, the best estimate the annual rainfall for the next fiscal year would simply be $\hat{y}=22.06$ inches. This estimate does not take account of the additional information of the number of rain days. Dry year or wet year, the estimate is 22.06 inches. The least-squares regression line will give an estimate of $\overline{y}$ for each value of $x$.

Suppose that the total number of rain days in fiscal year 2010 is estimated to be 50. Then the predicted annual rainfall for San Francisco is $\hat{y}=0.4427(50)-8.909=13.226$ inches. If the number of rain days for 2010 is revised to 65, then the annual rainfall estimate is $\hat{y}=0.4427(65)-8.909=19.8665=20$ inches.

The following table (Table 1) compares the observed annual rainfall amounts and the predicted rainfall amounts for selected fiscal years.

$\displaystyle \text{Table 1} \begin{pmatrix} \text{Year}&\text{ }&\text{Rain}&\text{ }&\text{Observed}&\text{ }&\text{Predicted} \\\text{ }&\text{ }&\text{Days (x)}&\text{ }&\text{Rainfall (y)}&\text{ }&\text{Rainfall (}\hat{y} \text{)} \\\text{ }&\text{ }&\text{ } \\1997&\text{ }&119&\text{ }&47.22&\text{ }&43.7723 \\2005&\text{ }&104&\text{ }&34.43&\text{ }&37.1318 \\1982&\text{ }&100&\text{ }&38.17&\text{ }&35.3610 \\1994&\text{ }&100&\text{ }&34.02&\text{ }&35.361 \\1968&\text{ }&93&\text{ }&25.09&\text{ }&32.2621 \\1960&\text{ }&68&\text{ }&13.87&\text{ }&21.1946 \\1989&\text{ }&49&\text{ }&14.32&\text{ }&12.7833 \\1975&\text{ }&47&\text{ }&7.97&\text{ }&11.8979 \\1976&\text{ }&41&\text{ }&11.06&\text{ }&9.2417 \\\cdots&\text{ }&\cdots&\text{ }&\cdots&\text{ }&\cdots \\\text{Avg Year}&\text{ }&66.96&\text{ }&22.06&\text{ }&22.06 \end{pmatrix}$

Note the last row of Table 1. When the value of the explanatory variable (x) is the mean of explanatory variable ($\overline{x}$), the predicted value $\hat{y}$ is the same as the mean of the response. In other words, when $x=\overline{x}$, $\hat{y}=\overline{y}$. This means that the point $(\overline{x},\overline{y})$ is on the regression line. This fact is true for all linear regression problems.

The regression line is identified by its slope and the y-intercept. In this example, the slope is $0.4427$ and the y-intercept is $-8.909$. The slope is the rate of change in the response $y$ as the explanatory variable $x$ changes. In other words, the slope is the change (increase or decrease) in $y$ for an increase of one unit in $x$. Thus, on an annual basis, for one additional number of rain day, we can expect an increase of $0.4427$ inches in rainfall in San Francisco.

Technically, the y-intercept is the predicted value of $y$ when the explanatory variable is $0$. In this case, the y-intercept has no meaning since it is a negative value. In general, this prediction is of no statistical use if the explanatory variable cannot take on values near zero.

4.3 The Residuals
In the fiscal year 1975 (one of the dryest on record), the number of rain days San Francisco was $x=47$. The predicted rainfall is $\hat{y}=0.4427(47)-8.909$ $=11.8979$ inches, while the actual record for San Francisco was $y=7.97$ inches. For the fiscal year 1976 (another dry year), the recorded number of rain days for San Francisco was $x=41$. The predicted rainfall in is $\hat{y}=0.4427(41)-8.909=9.2417$, while the actual rainfall record for San Francisco was $y=11.06$ inches.

In the case of 1975, the predicted value $\hat{y}$ is larger than the actual observed value of $y$ (the regression line overpredicts or overestimates, i.e. the data point is below the regression line). On the other hand, in 1976, the regression line underpredicts or underestimates (i.e. the data point is above the regression line). Looking at Figure 2, we see that the least-squares regression line overpredicts on some points (are above these data points) and underpredicts on other data points (are below these data points).

For 1975, look at the difference $y-\hat{y}=7.97-11.8979=-3.9279$ inches. In 1976, consider $y-\hat{y}=11.06-9.2417=1.8183$ inches. These are called residuals. Let’s look at this difference more closely.

$\text{residual}=\text{observed y}-\text{predicted y} \ \ \ \ \ \ \ \ \ \ i.e. \ \ (y-\hat{y})$

A residual is the difference between an observed value of the response variable and the value predicted by the regression line. In other words, a residual for a given point in a scatterplot is obtained by residual = observed y minus predicted y ($y-\hat{y}$). Graphically, a residual is simply the vertical distance in the scatterplot between a data point and the regression line. Table 2 below lists the residuals for selected data points in Figure 2.

$\displaystyle \text{Table 2} \begin{pmatrix} \text{Year}&\text{ }&\text{Rain}&\text{ }&\text{Observed}&\text{ }&\text{Predicted}&\text{ }&\text{Residual} \\\text{ }&\text{ }&\text{Days (x)}&\text{ }&\text{Rainfall (y)}&\text{ }&\text{Rainfall (}\hat{y} \text{)}&\text{ }&y-\hat{y} \\\text{ }&\text{ }&\text{ }&\text{ } \\1997&\text{ }&119&\text{ }&47.22&\text{ }&43.7723&\text{ }&3.4477 \\2005&\text{ }&104&\text{ }&34.43&\text{ }&37.1318&\text{ }&-2.7018 \\1982&\text{ }&100&\text{ }&38.17&\text{ }&35.3610&\text{ }&2.8090 \\1994&\text{ }&100&\text{ }&34.02&\text{ }&35.361&\text{ }&-1.3410 \\1968&\text{ }&93&\text{ }&25.09&\text{ }&32.2621&\text{ }&-7.1721 \\1960&\text{ }&68&\text{ }&13.87&\text{ }&21.1946&\text{ }&-7.3246 \\1989&\text{ }&49&\text{ }&14.32&\text{ }&12.7833&\text{ }&1.5367 \\1975&\text{ }&47&\text{ }&7.97&\text{ }&11.8979&\text{ }&-3.9279 \\1976&\text{ }&41&\text{ }&11.06&\text{ }&9.2417&\text{ }&1.8183 \\\cdots&\text{ }&\cdots&\text{ }&\cdots&\text{ }&\cdots&\text{ }&\cdots \\\text{Avg Year}&\text{ }&66.96&\text{ }&22.06&\text{ }&22.06&\text{ }&0 \end{pmatrix}$

The least-squares regression line is precisely the regression line that makes the sum of the squares of the the vertical distances (residuals) as small as possible. If we think of the residuals as errors, the least-squares regression line seeks to make the sum of error squares as small as possible. There are other types of regression lines that we can fit in a scatterplot, which we do not consider. In our discussion, we only focus on the least-squares regression line.

4.4 The Coefficient of Determination
Another regression result from using TI83 plus is the coefficient of determination (the square of $r$), which is the fraction of the variation in the response variable $y$ that is explained by the least-squares regression line. In this example, $r^2=0.7769$. So 77.69% of the variation in the San Francisco annual rainfall is explained by the least-squares regression line $\hat{y}=0.4427 \ x - 8.909$. The other 100 – 77.69 = 22.31% of the variation is explained by other variables that are not accounted for in this model.

5. One More Look at the Residuals
The relationship between the number of rain days and the annual rainfall data in San Francisco appears to be a linear one (a straight line relationship). The relationship is a positive association (as expected) and is quite strong. Because both the scatterplot and the correlation both point to a strong linear relationship, we use a least-squares regression line to predict the annual rainfall in San Francisco given the number of rain days. The least-squares regression line is also called the linear model. Is the linear model the right model to describe the annual rainfall data of San Francisco? In other words, how well does the least-squares regression model describe the data? How well does the least-squares regression line catch the overall pattern of the data?

To answer these questions, we can examine the residuals again. As noted above, a residual is the difference between an observed value of the response variable ($y$) and the value predicted by the regression line ($\hat{y}$). Graphically, a residual is the vertical distance in the scatterplot between a data point and the regression line. The term residual comes from the fact that it is the “left over” variation in the response after fitting the regression line.

Examining the residuals help assess how well the regression line describe the data. If the least-squares regression line catches the overall pattern of the data, there should be no pattern in the residuals. To aid this analysis, we construct a residual plot of the San Francisco rainfall data.

A residual plot for a linear regression problem is a scatterplot of the regression residuals against the explanatory variable. In other words, we plot the explanatory variable of the regression problem on the horizontal axis (x) and the residuals on the vertical axis (y). The residual plot helps assess how well the least-squares regression line fit the data. Figure 3 below is the residual plot for the regression displayed in Figures 1 and 2.

The residual plot (Figure 3) turns the regression line in horizontal (at zero). It magnifies the residuals to make the patterns easier to see. As noted before, if the regression line catches the overall pattern of the data, there should be no pattern in the residuals, that is, the residual plot should show an unstructured horizontal band centered at zero. This is the case in Figure 3. The points in Figure 3 scatter around the horizontal axis with no regular pattern. Thus we conclude that the least-squares regression line $\hat{y}=0.4427y-8.909$ fits the San Francisco rainfall data quite well.

In sum, there is a strong correlation between the number of rain days in a year and the annual amount of rainfall in San Francisco. We can use the least-squares linear regression line to predict the amount of rainfall given the number of rain days. Furthermore, the linear regression model is an appropriate model to describe the rainfall data in San Francisco.

Reference

1. Moore. D. S., McCabe G. P., Craig B. A., Introduction to the Practice of Statistics, 6th ed., W. H. Freeman and Company, New York, 2009