It is well known that among the passengers in Titanic, women and children were more likely to survive. It is also well known that first class passengers were even more likely to survive than passengers in the other classes. We continue the data analysis for two-way tables started in a previous post (A look at Titanic passenger data using two-way tables). The two-way table analysis here is on the categorical variable of passenger class (first, second and third) and survival status (yes and no). The main idea we wish to demonstrate is the statistical idea of a relationship (or association) between two categorical variables. In our example, the variable of passenger class and surviving the sinking of the Titanic are related. See the previous post for a more detailed discussion on the basic terminology.

Data for Titanic passengers are easily found on the Internet (this is the link of the site I use). The following is a two-way table representing the survival status of the passengers in the Titanic by passenger class.

It is easier to analyze proportions (or percents) than absolute counts and we first convert counts into proportions. The following shows the joint distribution and the marginal distributions.

From the total column on the right, we see that 38% of the passengers survived the sinking of the Titanic. From the total row in the bottom, we see that 24.79% of the passengers were in the first class, 20.9% were in the second class and 54.31% were in the third class. Both the total column and the total row are marginal distributions since they are displayed at the margin of the two-way table. Each is a probability of a single variable. The total column is a probability distribution of survival status while the total row at the bottom is a probability distribution of passenger class.

The collection of the proportions in table (2) that are not in the total column and the total row is a joint distribution of the two categorical variables. The joint distribution describes the joint behavior of the two variables and it gives the proportion of the passengers that belong to a combination of passenger class and survival status. For example, the proportion of the passengers who were in first class and survived the sinking is 0.1554 (15.54%).

Even though the joint distribution in table (2) describes the joint behavior of the two variables, it does not help us see the relationship between passenger class and survival status. Table (2) tells us that the proportion of the passengers who were in third class and survived the sinking is 0.1336 (13.36%) while the proportion of the passengers who were in second class and survived the sinking is 0.0909 (9.09%). This does not mean that the third class passengers had a higher survival rate. In fact the opposite was the case.

We need to look at the survival status separately for each passenger class. In other words, we need to look at the proportion of survival within each passenger class. Putting it in another way, we calculate the probability of survival by conditioning on the value of the passenger class.

The three columns in table (3) that are labeled First Class, Second Class and Third Class are the conditional distributions of survival status conditional on the value of the passenger class. Now the relationship between passenger class and survival status is clear. We see that 62.7% of the first class passengers survived, while only 24.61% of the third class passengers survived.

We compare the three conditional distribution of survival status conditioning on the value of each of the passenger class. They are obviously different (the higher the class, the survival was more likely). There is a clear relationship between passenger class and survival. Though this may already be a well known fact, the Titanic data example presented here is another excellent demonstration of the data analysis for two-way tables.

We can add to the above analysis by separting the same data for women (combined with children) and men. We then calculate the conditional distributions for these two groups. We have the following pair of two-way tables.

We now calculate the conditional distributions of survival status (conditioning on passenger class) within each of the gender groups of women plus children and men.

We make two observations. One is that women and children fared much better than men (a result seen in a previous post). Note the 68.7% survival rate for women and children vs. the 16.91% survival rate for men. The second is that the categorical variable of passenger class made even a bigger difference in survival rates both within the women plus children group and in the comparison between women and men.

Winthin the group of women and children passengers, almost all of the first class passengers survived (about 97%). Even the second class passengers among women and children had a close to 90% survival rate.

On the other hand, men did not fare well in comparison to women and children. Within the first class passengers, only one in three men survived, while the survival rates for men in the second class and third class were very dismal. In the first class passengers, the women and children had a survival rate of close to 97% (vs. 33.33% for men). The differential was even larger for the second class (close to 90% for women and children vs. close to 9% for men). In the third class, the differential was 44% for women and children vs. 13% for men.

Pingback: The Chi-Squared Distribution, Part 3c | Topics in Actuarial Modeling