Titanic

This is the legendary Titanic ML competition on Kaggle (https://www.kaggle.com/c/titanic).

In this challenge, I built several predictive models to answer the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc). I used ridge regression, lasso regression, random forest and gradient boosted model to make the survival prediction and compare the error rates among different models. The dataset provided has already been seperated into trainging, and testing set. We can take a look at the two datasets.

train_data

## # A tibble: 891 x 12
##    PassengerId Survived Pclass Name  Sex     Age SibSp Parch Ticket  Fare
##          <dbl>    <dbl>  <dbl> <chr> <chr> <dbl> <dbl> <dbl> <chr>  <dbl>
##  1           1        0      3 Brau… male     22     1     0 A/5 2…  7.25
##  2           2        1      1 Cumi… fema…    38     1     0 PC 17… 71.3 
##  3           3        1      3 Heik… fema…    26     0     0 STON/…  7.92
##  4           4        1      1 Futr… fema…    35     1     0 113803 53.1 
##  5           5        0      3 Alle… male     35     0     0 373450  8.05
##  6           6        0      3 Mora… male     NA     0     0 330877  8.46
##  7           7        0      1 McCa… male     54     0     0 17463  51.9 
##  8           8        0      3 Pals… male      2     3     1 349909 21.1 
##  9           9        1      3 John… fema…    27     0     2 347742 11.1 
## 10          10        1      2 Nass… fema…    14     1     0 237736 30.1 
## # … with 881 more rows, and 2 more variables: Cabin <chr>, Embarked <chr>

test_data

## # A tibble: 418 x 11
##    PassengerId Pclass Name  Sex     Age SibSp Parch Ticket  Fare Cabin
##          <dbl>  <dbl> <chr> <chr> <dbl> <dbl> <dbl> <chr>  <dbl> <chr>
##  1         892      3 Kell… male   34.5     0     0 330911  7.83 <NA> 
##  2         893      3 Wilk… fema…  47       1     0 363272  7    <NA> 
##  3         894      2 Myle… male   62       0     0 240276  9.69 <NA> 
##  4         895      3 Wirz… male   27       0     0 315154  8.66 <NA> 
##  5         896      3 Hirv… fema…  22       1     1 31012… 12.3  <NA> 
##  6         897      3 Sven… male   14       0     0 7538    9.22 <NA> 
##  7         898      3 Conn… fema…  30       0     0 330972  7.63 <NA> 
##  8         899      2 Cald… male   26       1     1 248738 29    <NA> 
##  9         900      3 Abra… fema…  18       0     0 2657    7.23 <NA> 
## 10         901      3 Davi… male   21       2     0 A/4 4… 24.2  <NA> 
## # … with 408 more rows, and 1 more variable: Embarked <chr>

There are 891 samples in the training set and 11 explanatory variables. A description of each variable, including description of its key values, is given below.

Variable Name	Description	Key
Survived	Survival	0 = No, 1 = Yes
Pclass	Passenger’s class	1 = 1st, 2 = 2nd, 3 = 3rd
Name	Passenger’s name
Sex	Passenger’s sex
Age	Passenger’s age
SibSp	Number of siblings/spouses aboard
Parch	Number of parents/children aboard
Ticket	Ticket number
Fare	Fare
Cabin	Cabin
Embarked	Port of embarkation	C = Cherbourg, Q = Queenstown, S = Southampton

1. Data Visualization

We will first see some raw summaries of the variables in the training data. While variable Pclass and Fare can be considered moderately correlated (correlation coefficient of -0.549), no other variable is highly correlated to one another.

1.1 Did age influence survival?

A step further, omitting the missing values, we can have some feelings about the average age of those who survived. At first glance, it is interesting that survived males have average age smaller than those perished, but on the contrary, survived female passengers have a higher average age than the unfortunate female.

## # A tibble: 4 x 5
## # Groups:   Sex [2]
##   Sex    Survived Count Portion Avg.age
##   <fct>  <fct>    <int>   <dbl>   <dbl>
## 1 male   N          468  0.525     31.6
## 2 male   Y          109  0.122     27.3
## 3 female N           81  0.0909    25.0
## 4 female Y          233  0.262     28.8

1.2 Is survival of a passenger related to the class he/she travelled in?

Passengers travelling in higher classes had higher survival rate and it is even obvious among female passengers.

## # A tibble: 6 x 3
## # Groups:   Sex [2]
##   Sex    Pclass Surv.rate
##   <fct>   <dbl>     <dbl>
## 1 male        1     0.369
## 2 male        2     0.157
## 3 male        3     0.135
## 4 female      1     0.968
## 5 female      2     0.921
## 6 female      3     0.5

2. Data Wrangling

Now put the training and the test data together to wrangle them more efficiently, but remain a column called data to label wether it is in the training or test set.

n1 <- nrow(train_data)
n2 <- nrow(test_data)
test_data$Survived <- NA
all <- rbind(train_data, test_data) 
all$data <- c(rep("train", n1), rep("test", n2))
all <- all %>% mutate(Sex = factor(Sex, levels = c("male","female")))

The table below shows the number of missing values in each variable, i.e, the variable age (including test data) in total has 263 missing values. Different measures are taken to fill missing values of different variables in the following sections.

## PassengerId    Survived      Pclass        Name         Sex         Age 
##           0         418           0           0           0         263 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##           0           0           0           1        1014           2 
##        data 
##           0

2.1 Titles of Passengers

A quick glance at the name data.

head(all$Name, n=15)

##  [1] "Braund, Mr. Owen Harris"                            
##  [2] "Cumings, Mrs. John Bradley (Florence Briggs Thayer)"
##  [3] "Heikkinen, Miss. Laina"                             
##  [4] "Futrelle, Mrs. Jacques Heath (Lily May Peel)"       
##  [5] "Allen, Mr. William Henry"                           
##  [6] "Moran, Mr. James"                                   
##  [7] "McCarthy, Mr. Timothy J"                            
##  [8] "Palsson, Master. Gosta Leonard"                     
##  [9] "Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)"  
## [10] "Nasser, Mrs. Nicholas (Adele Achem)"                
## [11] "Sandstrom, Miss. Marguerite Rut"                    
## [12] "Bonnell, Miss. Elizabeth"                           
## [13] "Saundercock, Mr. William Henry"                     
## [14] "Andersson, Mr. Anders Johan"                        
## [15] "Vestrom, Miss. Hulda Amanda Adolfina"

all$Name[40]

## [1] "Nicola-Yarred, Miss. Jamila"

Names of the passengers is complete (no NAs). In addition to first name and surname, the variable also contains title– Mr., Mrs., Miss, etc.–to each individual. I seperated the information from the original variable to obtain tidy data, and the information can be used to investigate the effects of families traveling together on survival.

#Extracting Title and Surname from Name
all$Surname <- sapply(all$Name, function(x) strsplit(x, split='[,.]')[[1]][1] )
all$Surname[40]

## [1] "Nicola-Yarred"

#Extracting surnames that also include a maiden name
all$Surname <- sapply(all$Surname, function(x) strsplit(x, split='[-]')[[1]][1] )
all$Surname[40]

## [1] "Nicola"

all$Title <- sapply(all$Name, function(x) strsplit(x,', ')[[1]][2])
all$Title <- sapply(all$Title, function(x) strsplit(x,'\\. ')[[1]][1])

all$Title <- sub(' ', '', all$Title) #remove spaces before title
kable(table(all$Sex, all$Title))

	Capt	Col	Don	Dona	Dr	Jonkheer	Lady	Major	Master	Miss	Mlle	Mme	Mr	Mrs	Ms	Rev	Sir	theCountess
male	1	4	1	0	7	1	0	2	61	0	0	0	757	0	0	8	1	0
female	0	0	0	1	1	0	1	0	0	260	2	1	0	197	2	0	0	1

Arrange and see the data first before deciding the next step. Variable Number counts the number of peolple in that Title group, and Age_missing is the number of missing ages in that Title group.

title_age <- all %>% mutate(Title = as.factor(Title)) %>% dplyr::select(Sex, Age, Title) %>% group_by(Title) %>% 
                     summarize(Number = n(), Age_missing = sum(is.na(Age)), 
                               Median_age = median(Age, na.rm = TRUE), 
                               Min_age = min(Age, na.rm = TRUE), 
                               Max_age = max(Age, na.rm = TRUE)) %>% arrange(desc(Number))
title_age

## # A tibble: 18 x 6
##    Title       Number Age_missing Median_age Min_age Max_age
##    <fct>        <int>       <int>      <dbl>   <dbl>   <dbl>
##  1 Mr             757         176       29     11       80  
##  2 Miss           260          50       22      0.17    63  
##  3 Mrs            197          27       35.5   14       76  
##  4 Master          61           8        4      0.33    14.5
##  5 Dr               8           1       49     23       54  
##  6 Rev              8           0       41.5   27       57  
##  7 Col              4           0       54.5   47       60  
##  8 Major            2           0       48.5   45       52  
##  9 Mlle             2           0       24     24       24  
## 10 Ms               2           1       28     28       28  
## 11 Capt             1           0       70     70       70  
## 12 Don              1           0       40     40       40  
## 13 Dona             1           0       39     39       39  
## 14 Jonkheer         1           0       38     38       38  
## 15 Lady             1           0       48     48       48  
## 16 Mme              1           0       24     24       24  
## 17 Sir              1           0       49     49       49  
## 18 theCountess      1           0       33     33       33

To reduce the dimension and become more manageable, combine the less-used categories in the 18 categories into more popular ones accoridng to his/her gender and age. I will classify them in the following ways.

Variable Name	Description
Mr	Male: Mr
Master	Male: Master
Miss	Female: Includes Mlle, Ms, Miss, Mme
Mrs	Female: Mrs
Rare	Others

all$Title[all$Title %in% c("Mlle", "Ms", "Mme")] <- "Miss"
all$Title[!(all$Title %in% c('Mr', 'Master', 'Miss', 'Mrs'))] <- "Rare"
all$Title <- as.factor(all$Title)
#table(all$Sex, all$Title)
all %>% dplyr::select(Sex, Title) %>% group_by(Sex,Title)%>% summarize(Number = n()) %>% 
        pivot_wider(names_from = Title, values_from = Number, values_fill = list(Number=0)) %>%
        dplyr::select(Sex, Mr, Master, Miss, Mrs, Rare)

## # A tibble: 2 x 6
## # Groups:   Sex [2]
##   Sex       Mr Master  Miss   Mrs  Rare
##   <fct>  <int>  <int> <int> <int> <int>
## 1 male     757     61     0     0    25
## 2 female     0      0   265   197     4

We can take a look at how different titles are related to suvival.

ggplot(all[(all$data=="train"),], aes(x = Title, 
                                       fill = factor(Survived, 
                                                     levels = c(0,1), 
                                                     labels = c("N", "Y")))) +
  geom_bar(stat='count', position='stack') +
  labs(title = "Titles of Passengers and Suvival Status",
       x = "Title",
       y = "Number of Passengers",
       fill="Survival Status")

2.2 People Traveling in Groups

Intuitively, people traveling with family may have higher chance to survive. Thus, I created a new variable, familySize, which adds up number of the passenger’s parents/children (Parch), number of siblings/spouses (SibSp) and one (himself / herself). From the plot below, it is obvious that passengers traveling alone had a much higher chance to die, and people traveling in families of 2 to 4 people are more likely to survive, but larger family size does not guarantee higher survival opportunities.

#creating new variable, familySize  
all$familySize <- all$SibSp + all$Parch + 1 

all[(all$data=="train"),] %>% ggplot(aes(x = familySize, fill = factor(Survived,
                                                                       levels = c(0,1),
                                                                       labels = c("N", "Y")))) +
        geom_bar(stat='count', position='dodge') +
        scale_x_continuous(breaks=c(1:max(all$familySize))) +
        labs(title = "Family Size and Suvival Status",
             x = 'Family Size',
             y = "Number of Passengers",
             fill="Survival Status")

3. Missing Values

3.1 Fill Missing Age Values with Linear Regression

There are 263 ages missing. We first see the survival probability density of age. From the density plot, we can find that survival chances of children are relatively high for male kids. Male passengers in ages 20-30 were less likely to survive. Passengers in the 20-30 age category had higher percentage of travelling alone, which could explain the lower chances of survival.

all[(!is.na(all$Survived) & !is.na(all$Age)),] %>% ggplot(aes(x = Age, fill = Survived)) +
        geom_density(alpha=0.5, aes(fill=factor(Survived, levels = c(0,1),labels = c("N", "Y")))) + 
        facet_wrap(.~Sex)+
        labs(title="Survival Density and Age",
             y = "Density",
             fill="Survival Status") +
        scale_x_continuous(breaks = scales::pretty_breaks(n = 10))

I predicted missing values in age with linear regression.

#predicting Age with Linear Regression
set.seed(0)
AgeLM <- lm(Age ~ Pclass + Sex + Embarked + Title + familySize, data=all[!is.na(all$Age),])
summary(AgeLM)

## 
## Call:
## lm(formula = Age ~ Pclass + Sex + Embarked + Title + familySize, 
##     data = all[!is.na(all$Age), ])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -26.096  -7.905  -1.262   6.398  45.738 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  23.3874     2.2353  10.463  < 2e-16 ***
## Pclass       -6.2780     0.4456 -14.090  < 2e-16 ***
## Sexfemale    -5.4681     5.9567  -0.918  0.35885    
## EmbarkedQ     7.2097     1.8116   3.980 7.38e-05 ***
## EmbarkedS     1.1970     0.8990   1.331  0.18333    
## TitleMiss    17.6105     6.2130   2.834  0.00468 ** 
## TitleMr      23.2191     1.7849  13.009  < 2e-16 ***
## TitleMrs     31.6726     6.2158   5.095 4.13e-07 ***
## TitleRare    30.8947     2.8604  10.801  < 2e-16 ***
## familySize   -0.7077     0.2711  -2.610  0.00917 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.02 on 1034 degrees of freedom
##   (2 observations deleted due to missingness)
## Multiple R-squared:  0.4189, Adjusted R-squared:  0.4138 
## F-statistic: 82.81 on 9 and 1034 DF,  p-value: < 2.2e-16

all <- all %>% add_predictions(AgeLM, var = "pred_age")

#filling Linear Regression predictions to missing Ages
all$Age[which(is.na(all$Age))] <- all$pred_age[which(is.na(all$Age))]

3.2 Fill Missing Fare Value and Embarked Value with KNN

There are two missing values in Embarked, and one in Fare. To fill in the missing value in the Fare variable, we may need to consider the relationship between the variable Embark and Fare, since intuitively different cities that passengers embarked means longer or shorter journeys.

# passengers with missing Embarked
all[which(is.na(all$Embarked)),]

## # A tibble: 2 x 17
##   PassengerId Survived Pclass Name  Sex     Age SibSp Parch Ticket  Fare
##         <dbl>    <dbl>  <dbl> <chr> <fct> <dbl> <dbl> <dbl> <chr>  <dbl>
## 1          62        1      1 Icar… fema…    38     0     0 113572    80
## 2         830        1      1 Ston… fema…    62     0     0 113572    80
## # … with 7 more variables: Cabin <chr>, Embarked <chr>, data <chr>,
## #   Surname <chr>, Title <fct>, familySize <dbl>, pred_age <dbl>

# passengers with missing Fare
all[which(is.na(all$Fare)),]

## # A tibble: 1 x 17
##   PassengerId Survived Pclass Name  Sex     Age SibSp Parch Ticket  Fare
##         <dbl>    <dbl>  <dbl> <chr> <fct> <dbl> <dbl> <dbl> <chr>  <dbl>
## 1        1044       NA      3 Stor… male   60.5     0     0 3701      NA
## # … with 7 more variables: Cabin <chr>, Embarked <chr>, data <chr>,
## #   Surname <chr>, Title <fct>, familySize <dbl>, pred_age <dbl>

Let’s first see the avearaage fare of different classes and port of embarkation. As, the two passengers had same ticket number, same ticket fare, and travelled in the same cabin, they probably bought the tickets together and emberked at the same port. I also filtered the data to find if the two passengers were travelling with their families.

missing_emb <- which(is.na(all$Embarked))
missing_fare <- which(is.na(all$Fare))
sum(all$Surname == all$Surname[missing_emb[1]]) + sum(all$Surname == all$Surname[missing_emb[2]])

## [1] 2

They were not travelling with other family members. Thereofore, I use knn to fill the missing embarkation based on their “Pclass”, “Sex”, “Age”, “Fare”, and “familySize”.

# fill the missing Embarked with KNN
all_knn <- all %>% mutate(Sex = ifelse(Sex=="male", 0, 1)) #set female as 1
knn.pred.emb <- knn(train = all_knn[-c(missing_emb, missing_fare), c(3,5,6,10,16)], test = all_knn[missing_emb, c(3,5, 6,10,16)], cl = all_knn$Embarked[-c(missing_emb, missing_fare)], k = 2, l = 0, prob = FALSE, use.all = TRUE)

all$Embarked[missing_emb] <- as.character(knn.pred.emb)
all$Embarked <- as.factor(all$Embarked)
knn.pred.emb

## [1] S S
## Levels: C Q S

# see the avearaage fare of different classes and Embarked
fare_class <- all %>% group_by(Embarked, Pclass) %>% summarize(Avg_price = mean(Fare, na.rm = TRUE )) %>% arrange(desc(Avg_price))
fare_class

## # A tibble: 9 x 3
## # Groups:   Embarked [3]
##   Embarked Pclass Avg_price
##   <fct>     <dbl>     <dbl>
## 1 C             1     107. 
## 2 Q             1      90  
## 3 S             1      72.2
## 4 C             2      23.3
## 5 S             2      21.2
## 6 S             3      14.4
## 7 Q             2      11.7
## 8 C             3      11.0
## 9 Q             3      10.4

#all[missing_fare,]
all$Fare[missing_fare] <- (fare_class %>% filter(Embarked == all[missing_fare,]$Embarked) %>% filter(Pclass==all[missing_fare,]$Pclass))$Avg_price

4. Model and Prediction

As the response variable is binary, survived or not, I would use logistic regression, but since my purpose is to do predictions, instead of making inference, I would like to add penalty terms to the coefficients of explanatory variables to avoid overfitting which would result in poor prediction on the test set; in other words, I seek to add bias and reduce variance. I therefore fitted the logistic lasso and logistic ridge regression. In addition, I deployed nonparametric models— random forest and gradient boosting model—to compare their generalization results.

I randomly choose 30 percent of the training data to use as validation set to find the best fitting model.

set.seed(0) 
val.n <- round(dim(train_data)[1] * 0.3)
val <- sample(1:dim(train_data)[1], val.n)
all$data[val] <- "val"

4.1 Logistic Lasso Regression

lambda <- .1
colnames(all)[c(3, 5, 6, 7, 8, 10, 12, 15, 16)]

## [1] "Pclass"     "Sex"        "Age"        "SibSp"      "Parch"     
## [6] "Fare"       "Embarked"   "Title"      "familySize"

lasso.data <- all
lasso.data$Sex <- droplevels(all$Sex)
lasso.data$Title <- droplevels(all$Title)
lasso.data$Embarked <- droplevels(all$Embarked)
# 10 fold cross validation
set.seed(0) 
glm.ridge <- cv.glmnet(x=data.matrix(lasso.data[(lasso.data$data=="train"), c(3, 5, 6, 7, 8, 10, 12, 15, 16)]),           
                       y=data.matrix(lasso.data[(lasso.data$data=="train"), 2]),family="binomial", alpha = 0, intercept=TRUE, type.measure = "class")

glm.lasso <- cv.glmnet(x=data.matrix(lasso.data[(lasso.data$data=="train"), c(3, 5, 6, 7, 8, 10, 12, 15, 16)]),
                       y=data.matrix(lasso.data[(lasso.data$data=="train"), 2]),family="binomial", alpha = 1, intercept=TRUE, type.measure = "class")
par(mfrow=c(1,2))
plot(glm.ridge, main = "Ridge") ; plot(glm.ridge, main = "Lasso")

glm.ridge$lambda.min; coef(glm.ridge, s = "lambda.min")

## [1] 0.06815303

## 10 x 1 sparse Matrix of class "dgCMatrix"
##                        1
## (Intercept) -0.334576469
## Pclass      -0.680151567
## Sex          1.788485543
## Age         -0.016491660
## SibSp       -0.149275772
## Parch        0.057390117
## Fare         0.003578326
## Embarked    -0.138826036
## Title       -0.060529274
## familySize  -0.059620025

glm.lasso$lambda.min; coef(glm.lasso, s = "lambda.min")

## [1] 0.003090657

## 10 x 1 sparse Matrix of class "dgCMatrix"
##                        1
## (Intercept)  0.496216854
## Pclass      -1.258248694
## Sex          2.598839652
## Age         -0.038078066
## SibSp       -0.355327227
## Parch        .          
## Fare         .          
## Embarked    -0.148377009
## Title       -0.022033558
## familySize  -0.002656274

The minimum misclassification rate of both ridge and lasso are around 20%. With 10-fold cross validation the best lambda value in the ridge model is about 0.068 and 0.00309 for that of lasso model. From the lasso model, the variables Parch (Number of parents/children aboard) and Fare are shrunk to 0, which are consider least important to interpret the survival status of passengers abroad.

4.2 Random Forest

all$Sex <- as.factor(all$Sex)
all$Embarked <- as.factor(all$Embarked)
all$Survived <- as.factor(all$Survived)
all$Surname <- as.factor(all$Surname)
set.seed(0)
rf <- randomForest(Survived ~ Pclass + Sex + Age + familySize + Fare + SibSp + Parch + 
                              Embarked + Title , data = all[all$data=="train", ])
rf

## 
## Call:
##  randomForest(formula = Survived ~ Pclass + Sex + Age + familySize +      Fare + SibSp + Parch + Embarked + Title, data = all[all$data ==      "train", ]) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 18.11%
## Confusion matrix:
##     0   1 class.error
## 0 334  51   0.1324675
## 1  62 177   0.2594142

all <- all %>% add_predictions(rf, var = "pred_surv")
# all %>% filter(data == "train") %>% group_by( Survived, pred_surv) %>% summarize(number = n())
error <- table((all %>% filter(data == "train"))$Survived, (all %>% filter(data == "train"))$pred_surv)

#Confusion Matrix
error

##    
##       0   1
##   0 376   9
##   1  29 210

# training error False Positive
cat("Training Error False Positive Rate:",error[1,2] / sum(error[1,]))

## Training Error False Positive Rate: 0.02337662

# training error False Negative
cat("Training Error False Negative Rate:",error[2,1] / sum(error[2,]))

## Training Error False Negative Rate: 0.1213389

# training error rate
cat("Training Error Rate:",(error[1,2] + error[2,1] )/ sum(error))

## Training Error Rate: 0.06089744

# Show model error
{plot(rf, ylim=c(0,0.36), main = "Random Forest Prediction")
legend('topright', colnames(rf$err.rate), col=1:3, fill=1:3)}

# Get importance
importance    <- importance(rf)
varImportance <- data.frame(Variables = row.names(importance), 
                            Importance = round(importance[ ,'MeanDecreaseGini'],2))

# Create a rank variable based on importance
rankImportance <- varImportance %>%
  mutate(Rank = paste0('#',dense_rank(desc(Importance))))

# Use ggplot2 to visualize the relative importance of variables
ggplot(rankImportance, aes(x = reorder(Variables, Importance), 
    y = Importance, fill = Importance)) +
  geom_bar(stat='identity') + 
  geom_text(aes(x = Variables, y = 0.5, label = Rank),
    hjust=0, vjust=0.55, size = 4, colour = 'red') +
  labs(x = 'Variables') +
  coord_flip()

4.3 Boosting

Boosting builds lots of smaller trees. Unlike random forests, each new tree in boosting tries to patch up the deficiencies of the current ensemble.

set.seed(0)
boost.train=gbm(Survived ~ Pclass + Sex + Age + SibSp + Parch + familySize + Fare + 
                              Embarked + Title , data = all[all$data=="train",],
                 distribution="multinomial",n.trees=10000,shrinkage=0.05,interaction.depth=9)
summary(boost.train)

##                   var    rel.inf
## Fare             Fare 40.1631349
## Age               Age 32.1598029
## Embarked     Embarked 10.2273700
## Title           Title  9.2621859
## Pclass         Pclass  3.5505256
## Sex               Sex  1.8401615
## familySize familySize  1.7565892
## SibSp           SibSp  0.5365829
## Parch           Parch  0.5036472

pred <- predict(boost.train, all, n.trees=10000, type = "response")
labels <-  colnames(pred)[apply(pred, 1, which.max)]
all$pred_surv_boost <-  as.factor(labels)
#all %>% filter(data == "train") %>% group_by( Survived, pred_surv_boost) %>% summarize(number = n())
error <- table((all[1:n1,])$Survived, (all[1:n1,])$pred_surv_boost)

#Confusion Matrix
error

##    
##       0   1
##   0 516  33
##   1  37 305

# training error False Positive
cat("Training Error False Positive Rate:",error[1,2] / sum(error[1,]))

## Training Error False Positive Rate: 0.06010929

# training error False Negative
cat("Training Error False Negative Rate:",error[2,1] / sum(error[2,]))

## Training Error False Negative Rate: 0.1081871

# training error rate
train.error <- (error[1,2] + error[2,1] )/ sum(error)
cat("Training Error Rate:", train.error)

## Training Error Rate: 0.07856341

# tree number and error
n.trees=seq(from=100,to=10000,by=100)
predmat <- predict(boost.train, all[1:n1,], n.trees=n.trees, type = "response")
dim(predmat)

## [1] 891   2 100

labels <- vector()
result <- vector()
for(i in 1:n1){
  labels <- as.double(colnames(predmat)[apply(predmat[i,,], 2, which.max)])
  if(i>1){
    result <- rbind(result, labels)
  }else{
    result <- labels
  }
}
Survived <- as.double(all$Survived[1:n1])
berr <-  apply((result - Survived)^2, 2, mean)

{plot(n.trees,berr,pch=19,ylab="Error", xlab="Number of Trees", main="Boosting Train Error (0-1 Loss)",type="l")
abline(h=train.error,col="red")}

5. Final Validation

Before fitting the models, I randomly selected 30% of the training data, a number of 267 passenger data, as validation set and left it out at the first place. After fitting the models mentioned in the previous section, I would like to test which model make the best prediction, i.e. have the lowest classification error, among the validation set.

pred.glm.ridge <- predict(glm.ridge, newx = data.matrix(all[(all$data == "val"),c(3, 5, 6, 7, 8, 10, 12, 15, 16)]), type = "class")
pred.glm.ridge <- as.integer(ifelse(pred.glm.ridge==1, 1, 0))
pred.glm.lasso <- predict(glm.lasso, data.matrix(all[(all$data == "val"),c(3, 5, 6, 7, 8, 10, 12, 15, 16)]),  type = "class")

pred.rf <- predict(rf, all[(all$data == "val"),c(3, 5, 6, 7, 8, 10, 12, 15, 16)],  type = "class")
pred.boost <- predict(boost.train, all[(all$data == "val"),c(3, 5, 6, 7, 8, 10, 12, 15, 16)],  type = "response", n.trees=10000)
labels <-  colnames(pred.boost)[apply(pred.boost, 1, which.max)]
pred.boost <-  as.factor(labels)

val.err.ridge <- sum(pred.glm.ridge != all$Survived[(all$data == "val")]) / val.n
val.err.lasso <- sum(pred.glm.lasso != all$Survived[(all$data == "val")]) / val.n
val.err.rf <- sum(pred.rf != all$Survived[(all$data == "val")]) / val.n
val.err.boost <- sum(pred.boost != all$Survived[(all$data == "val")]) / val.n

## The ridge regression classification error for validation set is 0.2359551

## The lasso regression classification error for validation set is 0.2247191

## The random forest classification error for validation set is 0.1872659

## The boosting classification error for validation set is 0.2359551