This is the legendary Titanic ML competition on Kaggle (https://www.kaggle.com/c/titanic).
In this challenge, I built several predictive models to answer the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc). I used ridge regression, lasso regression, random forest and gradient boosted model to make the survival prediction and compare the error rates among different models. The dataset provided has already been seperated into trainging, and testing set. We can take a look at the two datasets.
train_data
## # A tibble: 891 x 12
## PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare
## <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <chr> <dbl>
## 1 1 0 3 Brau… male 22 1 0 A/5 2… 7.25
## 2 2 1 1 Cumi… fema… 38 1 0 PC 17… 71.3
## 3 3 1 3 Heik… fema… 26 0 0 STON/… 7.92
## 4 4 1 1 Futr… fema… 35 1 0 113803 53.1
## 5 5 0 3 Alle… male 35 0 0 373450 8.05
## 6 6 0 3 Mora… male NA 0 0 330877 8.46
## 7 7 0 1 McCa… male 54 0 0 17463 51.9
## 8 8 0 3 Pals… male 2 3 1 349909 21.1
## 9 9 1 3 John… fema… 27 0 2 347742 11.1
## 10 10 1 2 Nass… fema… 14 1 0 237736 30.1
## # … with 881 more rows, and 2 more variables: Cabin <chr>, Embarked <chr>
test_data
## # A tibble: 418 x 11
## PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin
## <dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <chr> <dbl> <chr>
## 1 892 3 Kell… male 34.5 0 0 330911 7.83 <NA>
## 2 893 3 Wilk… fema… 47 1 0 363272 7 <NA>
## 3 894 2 Myle… male 62 0 0 240276 9.69 <NA>
## 4 895 3 Wirz… male 27 0 0 315154 8.66 <NA>
## 5 896 3 Hirv… fema… 22 1 1 31012… 12.3 <NA>
## 6 897 3 Sven… male 14 0 0 7538 9.22 <NA>
## 7 898 3 Conn… fema… 30 0 0 330972 7.63 <NA>
## 8 899 2 Cald… male 26 1 1 248738 29 <NA>
## 9 900 3 Abra… fema… 18 0 0 2657 7.23 <NA>
## 10 901 3 Davi… male 21 2 0 A/4 4… 24.2 <NA>
## # … with 408 more rows, and 1 more variable: Embarked <chr>
There are 891 samples in the training set and 11 explanatory variables. A description of each variable, including description of its key values, is given below.
Variable Name | Description | Key |
---|---|---|
Survived | Survival | 0 = No, 1 = Yes |
Pclass | Passenger’s class | 1 = 1st, 2 = 2nd, 3 = 3rd |
Name | Passenger’s name | |
Sex | Passenger’s sex | |
Age | Passenger’s age | |
SibSp | Number of siblings/spouses aboard | |
Parch | Number of parents/children aboard | |
Ticket | Ticket number | |
Fare | Fare | |
Cabin | Cabin | |
Embarked | Port of embarkation | C = Cherbourg, Q = Queenstown, S = Southampton |
We will first see some raw summaries of the variables in the training data. While variable Pclass and Fare can be considered moderately correlated (correlation coefficient of -0.549), no other variable is highly correlated to one another.
A step further, omitting the missing values, we can have some feelings about the average age of those who survived. At first glance, it is interesting that survived males have average age smaller than those perished, but on the contrary, survived female passengers have a higher average age than the unfortunate female.
## # A tibble: 4 x 5
## # Groups: Sex [2]
## Sex Survived Count Portion Avg.age
## <fct> <fct> <int> <dbl> <dbl>
## 1 male N 468 0.525 31.6
## 2 male Y 109 0.122 27.3
## 3 female N 81 0.0909 25.0
## 4 female Y 233 0.262 28.8
Now put the training and the test data together to wrangle them more efficiently, but remain a column called data to label wether it is in the training or test set.
n1 <- nrow(train_data)
n2 <- nrow(test_data)
test_data$Survived <- NA
all <- rbind(train_data, test_data)
all$data <- c(rep("train", n1), rep("test", n2))
all <- all %>% mutate(Sex = factor(Sex, levels = c("male","female")))
The table below shows the number of missing values in each variable, i.e, the variable age (including test data) in total has 263 missing values. Different measures are taken to fill missing values of different variables in the following sections.
## PassengerId Survived Pclass Name Sex Age
## 0 418 0 0 0 263
## SibSp Parch Ticket Fare Cabin Embarked
## 0 0 0 1 1014 2
## data
## 0
A quick glance at the name data.
head(all$Name, n=15)
## [1] "Braund, Mr. Owen Harris"
## [2] "Cumings, Mrs. John Bradley (Florence Briggs Thayer)"
## [3] "Heikkinen, Miss. Laina"
## [4] "Futrelle, Mrs. Jacques Heath (Lily May Peel)"
## [5] "Allen, Mr. William Henry"
## [6] "Moran, Mr. James"
## [7] "McCarthy, Mr. Timothy J"
## [8] "Palsson, Master. Gosta Leonard"
## [9] "Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)"
## [10] "Nasser, Mrs. Nicholas (Adele Achem)"
## [11] "Sandstrom, Miss. Marguerite Rut"
## [12] "Bonnell, Miss. Elizabeth"
## [13] "Saundercock, Mr. William Henry"
## [14] "Andersson, Mr. Anders Johan"
## [15] "Vestrom, Miss. Hulda Amanda Adolfina"
all$Name[40]
## [1] "Nicola-Yarred, Miss. Jamila"
Names of the passengers is complete (no NAs). In addition to first name and surname, the variable also contains title– Mr., Mrs., Miss, etc.–to each individual. I seperated the information from the original variable to obtain tidy data, and the information can be used to investigate the effects of families traveling together on survival.
#Extracting Title and Surname from Name
all$Surname <- sapply(all$Name, function(x) strsplit(x, split='[,.]')[[1]][1] )
all$Surname[40]
## [1] "Nicola-Yarred"
#Extracting surnames that also include a maiden name
all$Surname <- sapply(all$Surname, function(x) strsplit(x, split='[-]')[[1]][1] )
all$Surname[40]
## [1] "Nicola"
all$Title <- sapply(all$Name, function(x) strsplit(x,', ')[[1]][2])
all$Title <- sapply(all$Title, function(x) strsplit(x,'\\. ')[[1]][1])
all$Title <- sub(' ', '', all$Title) #remove spaces before title
kable(table(all$Sex, all$Title))
Capt | Col | Don | Dona | Dr | Jonkheer | Lady | Major | Master | Miss | Mlle | Mme | Mr | Mrs | Ms | Rev | Sir | theCountess | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
male | 1 | 4 | 1 | 0 | 7 | 1 | 0 | 2 | 61 | 0 | 0 | 0 | 757 | 0 | 0 | 8 | 1 | 0 |
female | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 260 | 2 | 1 | 0 | 197 | 2 | 0 | 0 | 1 |
Arrange and see the data first before deciding the next step. Variable Number counts the number of peolple in that Title group, and Age_missing is the number of missing ages in that Title group.
title_age <- all %>% mutate(Title = as.factor(Title)) %>% dplyr::select(Sex, Age, Title) %>% group_by(Title) %>%
summarize(Number = n(), Age_missing = sum(is.na(Age)),
Median_age = median(Age, na.rm = TRUE),
Min_age = min(Age, na.rm = TRUE),
Max_age = max(Age, na.rm = TRUE)) %>% arrange(desc(Number))
title_age
## # A tibble: 18 x 6
## Title Number Age_missing Median_age Min_age Max_age
## <fct> <int> <int> <dbl> <dbl> <dbl>
## 1 Mr 757 176 29 11 80
## 2 Miss 260 50 22 0.17 63
## 3 Mrs 197 27 35.5 14 76
## 4 Master 61 8 4 0.33 14.5
## 5 Dr 8 1 49 23 54
## 6 Rev 8 0 41.5 27 57
## 7 Col 4 0 54.5 47 60
## 8 Major 2 0 48.5 45 52
## 9 Mlle 2 0 24 24 24
## 10 Ms 2 1 28 28 28
## 11 Capt 1 0 70 70 70
## 12 Don 1 0 40 40 40
## 13 Dona 1 0 39 39 39
## 14 Jonkheer 1 0 38 38 38
## 15 Lady 1 0 48 48 48
## 16 Mme 1 0 24 24 24
## 17 Sir 1 0 49 49 49
## 18 theCountess 1 0 33 33 33
To reduce the dimension and become more manageable, combine the less-used categories in the 18 categories into more popular ones accoridng to his/her gender and age. I will classify them in the following ways.
Variable Name | Description |
---|---|
Mr | Male: Mr |
Master | Male: Master |
Miss | Female: Includes Mlle, Ms, Miss, Mme |
Mrs | Female: Mrs |
Rare | Others |
all$Title[all$Title %in% c("Mlle", "Ms", "Mme")] <- "Miss"
all$Title[!(all$Title %in% c('Mr', 'Master', 'Miss', 'Mrs'))] <- "Rare"
all$Title <- as.factor(all$Title)
#table(all$Sex, all$Title)
all %>% dplyr::select(Sex, Title) %>% group_by(Sex,Title)%>% summarize(Number = n()) %>%
pivot_wider(names_from = Title, values_from = Number, values_fill = list(Number=0)) %>%
dplyr::select(Sex, Mr, Master, Miss, Mrs, Rare)
## # A tibble: 2 x 6
## # Groups: Sex [2]
## Sex Mr Master Miss Mrs Rare
## <fct> <int> <int> <int> <int> <int>
## 1 male 757 61 0 0 25
## 2 female 0 0 265 197 4
We can take a look at how different titles are related to suvival.
ggplot(all[(all$data=="train"),], aes(x = Title,
fill = factor(Survived,
levels = c(0,1),
labels = c("N", "Y")))) +
geom_bar(stat='count', position='stack') +
labs(title = "Titles of Passengers and Suvival Status",
x = "Title",
y = "Number of Passengers",
fill="Survival Status")
Intuitively, people traveling with family may have higher chance to survive. Thus, I created a new variable, familySize, which adds up number of the passenger’s parents/children (Parch), number of siblings/spouses (SibSp) and one (himself / herself). From the plot below, it is obvious that passengers traveling alone had a much higher chance to die, and people traveling in families of 2 to 4 people are more likely to survive, but larger family size does not guarantee higher survival opportunities.
#creating new variable, familySize
all$familySize <- all$SibSp + all$Parch + 1
all[(all$data=="train"),] %>% ggplot(aes(x = familySize, fill = factor(Survived,
levels = c(0,1),
labels = c("N", "Y")))) +
geom_bar(stat='count', position='dodge') +
scale_x_continuous(breaks=c(1:max(all$familySize))) +
labs(title = "Family Size and Suvival Status",
x = 'Family Size',
y = "Number of Passengers",
fill="Survival Status")
There are 263 ages missing. We first see the survival probability density of age. From the density plot, we can find that survival chances of children are relatively high for male kids. Male passengers in ages 20-30 were less likely to survive. Passengers in the 20-30 age category had higher percentage of travelling alone, which could explain the lower chances of survival.
all[(!is.na(all$Survived) & !is.na(all$Age)),] %>% ggplot(aes(x = Age, fill = Survived)) +
geom_density(alpha=0.5, aes(fill=factor(Survived, levels = c(0,1),labels = c("N", "Y")))) +
facet_wrap(.~Sex)+
labs(title="Survival Density and Age",
y = "Density",
fill="Survival Status") +
scale_x_continuous(breaks = scales::pretty_breaks(n = 10))
I predicted missing values in age with linear regression.
#predicting Age with Linear Regression
set.seed(0)
AgeLM <- lm(Age ~ Pclass + Sex + Embarked + Title + familySize, data=all[!is.na(all$Age),])
summary(AgeLM)
##
## Call:
## lm(formula = Age ~ Pclass + Sex + Embarked + Title + familySize,
## data = all[!is.na(all$Age), ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -26.096 -7.905 -1.262 6.398 45.738
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23.3874 2.2353 10.463 < 2e-16 ***
## Pclass -6.2780 0.4456 -14.090 < 2e-16 ***
## Sexfemale -5.4681 5.9567 -0.918 0.35885
## EmbarkedQ 7.2097 1.8116 3.980 7.38e-05 ***
## EmbarkedS 1.1970 0.8990 1.331 0.18333
## TitleMiss 17.6105 6.2130 2.834 0.00468 **
## TitleMr 23.2191 1.7849 13.009 < 2e-16 ***
## TitleMrs 31.6726 6.2158 5.095 4.13e-07 ***
## TitleRare 30.8947 2.8604 10.801 < 2e-16 ***
## familySize -0.7077 0.2711 -2.610 0.00917 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.02 on 1034 degrees of freedom
## (2 observations deleted due to missingness)
## Multiple R-squared: 0.4189, Adjusted R-squared: 0.4138
## F-statistic: 82.81 on 9 and 1034 DF, p-value: < 2.2e-16
all <- all %>% add_predictions(AgeLM, var = "pred_age")
#filling Linear Regression predictions to missing Ages
all$Age[which(is.na(all$Age))] <- all$pred_age[which(is.na(all$Age))]
There are two missing values in Embarked, and one in Fare. To fill in the missing value in the Fare variable, we may need to consider the relationship between the variable Embark and Fare, since intuitively different cities that passengers embarked means longer or shorter journeys.
# passengers with missing Embarked
all[which(is.na(all$Embarked)),]
## # A tibble: 2 x 17
## PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare
## <dbl> <dbl> <dbl> <chr> <fct> <dbl> <dbl> <dbl> <chr> <dbl>
## 1 62 1 1 Icar… fema… 38 0 0 113572 80
## 2 830 1 1 Ston… fema… 62 0 0 113572 80
## # … with 7 more variables: Cabin <chr>, Embarked <chr>, data <chr>,
## # Surname <chr>, Title <fct>, familySize <dbl>, pred_age <dbl>
# passengers with missing Fare
all[which(is.na(all$Fare)),]
## # A tibble: 1 x 17
## PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare
## <dbl> <dbl> <dbl> <chr> <fct> <dbl> <dbl> <dbl> <chr> <dbl>
## 1 1044 NA 3 Stor… male 60.5 0 0 3701 NA
## # … with 7 more variables: Cabin <chr>, Embarked <chr>, data <chr>,
## # Surname <chr>, Title <fct>, familySize <dbl>, pred_age <dbl>
Let’s first see the avearaage fare of different classes and port of embarkation. As, the two passengers had same ticket number, same ticket fare, and travelled in the same cabin, they probably bought the tickets together and emberked at the same port. I also filtered the data to find if the two passengers were travelling with their families.
missing_emb <- which(is.na(all$Embarked))
missing_fare <- which(is.na(all$Fare))
sum(all$Surname == all$Surname[missing_emb[1]]) + sum(all$Surname == all$Surname[missing_emb[2]])
## [1] 2
They were not travelling with other family members. Thereofore, I use knn to fill the missing embarkation based on their “Pclass”, “Sex”, “Age”, “Fare”, and “familySize”.
# fill the missing Embarked with KNN
all_knn <- all %>% mutate(Sex = ifelse(Sex=="male", 0, 1)) #set female as 1
knn.pred.emb <- knn(train = all_knn[-c(missing_emb, missing_fare), c(3,5,6,10,16)], test = all_knn[missing_emb, c(3,5, 6,10,16)], cl = all_knn$Embarked[-c(missing_emb, missing_fare)], k = 2, l = 0, prob = FALSE, use.all = TRUE)
all$Embarked[missing_emb] <- as.character(knn.pred.emb)
all$Embarked <- as.factor(all$Embarked)
knn.pred.emb
## [1] S S
## Levels: C Q S
# see the avearaage fare of different classes and Embarked
fare_class <- all %>% group_by(Embarked, Pclass) %>% summarize(Avg_price = mean(Fare, na.rm = TRUE )) %>% arrange(desc(Avg_price))
fare_class
## # A tibble: 9 x 3
## # Groups: Embarked [3]
## Embarked Pclass Avg_price
## <fct> <dbl> <dbl>
## 1 C 1 107.
## 2 Q 1 90
## 3 S 1 72.2
## 4 C 2 23.3
## 5 S 2 21.2
## 6 S 3 14.4
## 7 Q 2 11.7
## 8 C 3 11.0
## 9 Q 3 10.4
#all[missing_fare,]
all$Fare[missing_fare] <- (fare_class %>% filter(Embarked == all[missing_fare,]$Embarked) %>% filter(Pclass==all[missing_fare,]$Pclass))$Avg_price
As the response variable is binary, survived or not, I would use logistic regression, but since my purpose is to do predictions, instead of making inference, I would like to add penalty terms to the coefficients of explanatory variables to avoid overfitting which would result in poor prediction on the test set; in other words, I seek to add bias and reduce variance. I therefore fitted the logistic lasso and logistic ridge regression. In addition, I deployed nonparametric models— random forest and gradient boosting model—to compare their generalization results.
I randomly choose 30 percent of the training data to use as validation set to find the best fitting model.
set.seed(0)
val.n <- round(dim(train_data)[1] * 0.3)
val <- sample(1:dim(train_data)[1], val.n)
all$data[val] <- "val"
lambda <- .1
colnames(all)[c(3, 5, 6, 7, 8, 10, 12, 15, 16)]
## [1] "Pclass" "Sex" "Age" "SibSp" "Parch"
## [6] "Fare" "Embarked" "Title" "familySize"
lasso.data <- all
lasso.data$Sex <- droplevels(all$Sex)
lasso.data$Title <- droplevels(all$Title)
lasso.data$Embarked <- droplevels(all$Embarked)
# 10 fold cross validation
set.seed(0)
glm.ridge <- cv.glmnet(x=data.matrix(lasso.data[(lasso.data$data=="train"), c(3, 5, 6, 7, 8, 10, 12, 15, 16)]),
y=data.matrix(lasso.data[(lasso.data$data=="train"), 2]),family="binomial", alpha = 0, intercept=TRUE, type.measure = "class")
glm.lasso <- cv.glmnet(x=data.matrix(lasso.data[(lasso.data$data=="train"), c(3, 5, 6, 7, 8, 10, 12, 15, 16)]),
y=data.matrix(lasso.data[(lasso.data$data=="train"), 2]),family="binomial", alpha = 1, intercept=TRUE, type.measure = "class")
par(mfrow=c(1,2))
plot(glm.ridge, main = "Ridge") ; plot(glm.ridge, main = "Lasso")
glm.ridge$lambda.min; coef(glm.ridge, s = "lambda.min")
## [1] 0.06815303
## 10 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) -0.334576469
## Pclass -0.680151567
## Sex 1.788485543
## Age -0.016491660
## SibSp -0.149275772
## Parch 0.057390117
## Fare 0.003578326
## Embarked -0.138826036
## Title -0.060529274
## familySize -0.059620025
glm.lasso$lambda.min; coef(glm.lasso, s = "lambda.min")
## [1] 0.003090657
## 10 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 0.496216854
## Pclass -1.258248694
## Sex 2.598839652
## Age -0.038078066
## SibSp -0.355327227
## Parch .
## Fare .
## Embarked -0.148377009
## Title -0.022033558
## familySize -0.002656274
The minimum misclassification rate of both ridge and lasso are around 20%. With 10-fold cross validation the best lambda value in the ridge model is about 0.068 and 0.00309 for that of lasso model. From the lasso model, the variables Parch (Number of parents/children aboard) and Fare are shrunk to 0, which are consider least important to interpret the survival status of passengers abroad.
all$Sex <- as.factor(all$Sex)
all$Embarked <- as.factor(all$Embarked)
all$Survived <- as.factor(all$Survived)
all$Surname <- as.factor(all$Surname)
set.seed(0)
rf <- randomForest(Survived ~ Pclass + Sex + Age + familySize + Fare + SibSp + Parch +
Embarked + Title , data = all[all$data=="train", ])
rf
##
## Call:
## randomForest(formula = Survived ~ Pclass + Sex + Age + familySize + Fare + SibSp + Parch + Embarked + Title, data = all[all$data == "train", ])
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 18.11%
## Confusion matrix:
## 0 1 class.error
## 0 334 51 0.1324675
## 1 62 177 0.2594142
all <- all %>% add_predictions(rf, var = "pred_surv")
# all %>% filter(data == "train") %>% group_by( Survived, pred_surv) %>% summarize(number = n())
error <- table((all %>% filter(data == "train"))$Survived, (all %>% filter(data == "train"))$pred_surv)
#Confusion Matrix
error
##
## 0 1
## 0 376 9
## 1 29 210
# training error False Positive
cat("Training Error False Positive Rate:",error[1,2] / sum(error[1,]))
## Training Error False Positive Rate: 0.02337662
# training error False Negative
cat("Training Error False Negative Rate:",error[2,1] / sum(error[2,]))
## Training Error False Negative Rate: 0.1213389
# training error rate
cat("Training Error Rate:",(error[1,2] + error[2,1] )/ sum(error))
## Training Error Rate: 0.06089744
# Show model error
{plot(rf, ylim=c(0,0.36), main = "Random Forest Prediction")
legend('topright', colnames(rf$err.rate), col=1:3, fill=1:3)}
# Get importance
importance <- importance(rf)
varImportance <- data.frame(Variables = row.names(importance),
Importance = round(importance[ ,'MeanDecreaseGini'],2))
# Create a rank variable based on importance
rankImportance <- varImportance %>%
mutate(Rank = paste0('#',dense_rank(desc(Importance))))
# Use ggplot2 to visualize the relative importance of variables
ggplot(rankImportance, aes(x = reorder(Variables, Importance),
y = Importance, fill = Importance)) +
geom_bar(stat='identity') +
geom_text(aes(x = Variables, y = 0.5, label = Rank),
hjust=0, vjust=0.55, size = 4, colour = 'red') +
labs(x = 'Variables') +
coord_flip()
Boosting builds lots of smaller trees. Unlike random forests, each new tree in boosting tries to patch up the deficiencies of the current ensemble.
set.seed(0)
boost.train=gbm(Survived ~ Pclass + Sex + Age + SibSp + Parch + familySize + Fare +
Embarked + Title , data = all[all$data=="train",],
distribution="multinomial",n.trees=10000,shrinkage=0.05,interaction.depth=9)
summary(boost.train)
## var rel.inf
## Fare Fare 40.1631349
## Age Age 32.1598029
## Embarked Embarked 10.2273700
## Title Title 9.2621859
## Pclass Pclass 3.5505256
## Sex Sex 1.8401615
## familySize familySize 1.7565892
## SibSp SibSp 0.5365829
## Parch Parch 0.5036472
pred <- predict(boost.train, all, n.trees=10000, type = "response")
labels <- colnames(pred)[apply(pred, 1, which.max)]
all$pred_surv_boost <- as.factor(labels)
#all %>% filter(data == "train") %>% group_by( Survived, pred_surv_boost) %>% summarize(number = n())
error <- table((all[1:n1,])$Survived, (all[1:n1,])$pred_surv_boost)
#Confusion Matrix
error
##
## 0 1
## 0 516 33
## 1 37 305
# training error False Positive
cat("Training Error False Positive Rate:",error[1,2] / sum(error[1,]))
## Training Error False Positive Rate: 0.06010929
# training error False Negative
cat("Training Error False Negative Rate:",error[2,1] / sum(error[2,]))
## Training Error False Negative Rate: 0.1081871
# training error rate
train.error <- (error[1,2] + error[2,1] )/ sum(error)
cat("Training Error Rate:", train.error)
## Training Error Rate: 0.07856341
# tree number and error
n.trees=seq(from=100,to=10000,by=100)
predmat <- predict(boost.train, all[1:n1,], n.trees=n.trees, type = "response")
dim(predmat)
## [1] 891 2 100
labels <- vector()
result <- vector()
for(i in 1:n1){
labels <- as.double(colnames(predmat)[apply(predmat[i,,], 2, which.max)])
if(i>1){
result <- rbind(result, labels)
}else{
result <- labels
}
}
Survived <- as.double(all$Survived[1:n1])
berr <- apply((result - Survived)^2, 2, mean)
{plot(n.trees,berr,pch=19,ylab="Error", xlab="Number of Trees", main="Boosting Train Error (0-1 Loss)",type="l")
abline(h=train.error,col="red")}
Before fitting the models, I randomly selected 30% of the training data, a number of 267 passenger data, as validation set and left it out at the first place. After fitting the models mentioned in the previous section, I would like to test which model make the best prediction, i.e. have the lowest classification error, among the validation set.
pred.glm.ridge <- predict(glm.ridge, newx = data.matrix(all[(all$data == "val"),c(3, 5, 6, 7, 8, 10, 12, 15, 16)]), type = "class")
pred.glm.ridge <- as.integer(ifelse(pred.glm.ridge==1, 1, 0))
pred.glm.lasso <- predict(glm.lasso, data.matrix(all[(all$data == "val"),c(3, 5, 6, 7, 8, 10, 12, 15, 16)]), type = "class")
pred.rf <- predict(rf, all[(all$data == "val"),c(3, 5, 6, 7, 8, 10, 12, 15, 16)], type = "class")
pred.boost <- predict(boost.train, all[(all$data == "val"),c(3, 5, 6, 7, 8, 10, 12, 15, 16)], type = "response", n.trees=10000)
labels <- colnames(pred.boost)[apply(pred.boost, 1, which.max)]
pred.boost <- as.factor(labels)
val.err.ridge <- sum(pred.glm.ridge != all$Survived[(all$data == "val")]) / val.n
val.err.lasso <- sum(pred.glm.lasso != all$Survived[(all$data == "val")]) / val.n
val.err.rf <- sum(pred.rf != all$Survived[(all$data == "val")]) / val.n
val.err.boost <- sum(pred.boost != all$Survived[(all$data == "val")]) / val.n
## The ridge regression classification error for validation set is 0.2359551
## The lasso regression classification error for validation set is 0.2247191
## The random forest classification error for validation set is 0.1872659
## The boosting classification error for validation set is 0.2359551