This is an introduction of data analysis for two-way tables using passenger data of the Titanic disaster almost one hundred years ago.

The sinking of the Titanic on April 15, 1912 was one of the deadliest peacetime maritime disasters in history, resulting in over 1,500 deaths. Because of the mentality of “Women and Children First”, women and children far outnumbered men among the survived passengers. As a rule, the first class received the first of everything, including the precious seats of the lifeboats. The first class passengers outnumbered the other passenger classes among the survivors. These are well known facts. But they also highlight a broader statistical idea, namely, that of a relationship (or association) between two variables. In the case of the sinking of the Titanic, we can say that there is a relationship between the categorical variable of gender group (women, children, men) and the categorical variable of surviving the sinking (yes/no). In this post, I use data of the Titanic passengers to demonstrate this statistical idea. Along the way, the concepts of two-way tables, joint distribution, marginal distributions and conditional distributions are discussed.

Categorical variables are types of data which may be divided into groups. Variables such as gender, ethnicity and occupation are clearly categorical. Sometimes, categorical variables (e.g. age group, income level) can be created by grouping values of a quantitative variable into classes. In statistics, we often study the relationship between two variables by measuring both variables on the same individuals in the sample. When both variables are categorical, we can analyze the relationship between the variables using * two-way tables*.

Data for Titanic passengers are easily found on the Internet (this is the link of the site I use). The following is a two-way table and represents the survival status of the passengers in the Titanic by gender group.

This two-way table involves the categorical variables of gender group and survival. In a two-way table, we summarize the categorical data by counting the number of observations that fall into each group for the two variables. For example, the total count for women passengers who survived was 304. We call gender group the column variable since each column describes one gender group. On the other hand, we call survival status the row variable since each horizontal row in the table contains the counts of passengers in that survival status across the gender groups.

In this two way table, our goal is to describe the relationship (or association) between the two categorical varables. The totals in the table (the rightmost column and the bottom row) will make it easier to analyze the two-way table. The rightmost column contains the counts of the passengers for the survival status without accounting for the effect of the other variable (gender group). Likewise, the bottom row is the counts of the passengers by gender group.

The two-way table presented here is a summarization of raw data. Imagine that there is a data file containing 1,296 lines of data, one for each passenger, indicating the information about gender group, survival status and other information. Each line of this data file is counted once and only once in the two-way table.

It is usually difficult to analyze two-way tables consisting of counts. So we convert the above two-way table into proportions or percentages. First, we divide each count in the two-way table by the grand total 1,296 and obtain the following two-way table.

The numbers in the two-way table (2) are proportions or probabilities. For example, what is the proportion of the passengers who were female and survived the sinking? The proportion is 0.235 (23.5%), simply the result of 304 divided by 1,296. Excluding the total column on the right and the total row at the bottom, the proportion at each entry in the table is simply the propotion of the Titanic passengers that fall into a combination of gender group and survival status. The collection of these proportions is called the * joint distribution* of the two categorical variables.

Let’s look at the rightmost column and the bottom row in table (2). The rightmost column tells us that 37.8% of the passengers were survivors. The bottom row tells us that 32.1% of the passengers were women, 8.6% of the passengers were children and 59.2% of them were men. Note that each of these is a probability distribution of a single variable. The rightmost column is a probability distribution of survival status and the bottom row is a probability distribution of gender group. Because these distributions are placed at the margin of the two-way table, they are called * marginal distributions*. In a two-way table, a marginal distribution of one variable does not take the other variable into account (it sums out the other variable).

The marginal distributions in table (2) can only tell us information about one of the variables and not the joint behavior of the two variables. The joint distribution in table (2) cannot help us to tease out the relationship between the row variable and the column variable. For example, there were more men survivors than children survivors (130 vs. 56). In terms of percents, 10% of the passengers were men who survived and 4.3% of the passengers were children survivors. But the survival rate of children was 50% and the survival rate of men was only 17%. To see this, we need to look at the survial status separately for women, children and men.

Out of 416 women passengers, 304 survived. Thus the survival rate is or 73.1%. Since half the children passengers survived, the survival rate is 0.50 or 50%. The following matrix shows the survival yes/no percents for each of the gender groups.

In table (3), the column for women is a probability distribution of survival status that is conditioned on the value of gender group being women. When we condition on the value of one variable and calculate the probability distribution of the other variable, we obtain a conditional distribution. For example, the column in table (3) for men is the conditional distribution of survival status conditional on the gender group being men.

From table (3), we see that 73.1% of the women passengers survived, 50% of the children passengers survived and only 16.9% of the men passengers survived. By comparing the three conditional distributions, we see the nature of the relationship (association) between gender group and survival status. We see that women and children were more likely to survive and men were more likely to die.

In analyzing a two-way table, there is a relationship between the two variables when the percent of cases for one variable differs reliably across the values or levels of the other variables. In other words, the row variable and the column variable are related (or dependent) when the conditional distributions of one variable (conditioning on the other variable) are significantly different. When there is a relationship between two variables, they are said to be dependent. Otherwise, they are said to be independent.

In the Titanic disaster, women and children were more likely to survive and men were more likely to die. This is a familiar fact for sure. However, the data of the Titanic passengers is an excellent introduction to data analysis for two-way tables.

Pingback: The Chi-Squared Distribution, Part 3c | Topics in Actuarial Modeling