The CASdatasets package, which contains a large variety of actuarial datasets, is proposed and edited by Christophe Dutang https://www.openml.org/d/41214.
The freMTPL2freq dataset I’m going to study consists of risk features that were collected from motor third-part liability policies in France. There are 678,013 samples available in the dataset and 12 explanatory variables. A description of each variable, including descriptions of its key values, is given below.
library("CASdatasets")
data(freMTPL2freq)
data <- as_tibble(freMTPL2freq)
Variable Name | Description | Key |
---|---|---|
IDpol | Policy ID | (link with the claims dataset) |
ClaimNb | Number of claims during the exposure period | |
Exposure | Period of exposure (in years) | |
VehPower | Power of the car | |
VehAge | Vehicle age (in years) | |
DrivAge | Driver age (in years) | |
BonusMalus | Bonus/malus, between 50 and 350 | <100: bonus; >100: malus in France |
VehBrand | Car brand | Unknown categories |
VehGas | Car gas | Diesel or regular |
Area | Density value of the city where the car driver lives in | “A” for rural to “F” for urban centre |
Density | Density of inhabitants of the city where the car driver lives in | Number of inhabitants per square-kilometer |
Region | Policy region in France |
str(data)
## tibble [678,013 × 12] (S3: tbl_df/tbl/data.frame)
## $ IDpol : num [1:678013] 1 3 5 10 11 13 15 17 18 21 ...
## $ ClaimNb : 'table' num [1:678013(1d)] 1 1 1 1 1 1 1 1 1 1 ...
## $ Exposure : num [1:678013] 0.1 0.77 0.75 0.09 0.84 0.52 0.45 0.27 0.71 0.15 ...
## $ VehPower : int [1:678013] 5 5 6 7 7 6 6 7 7 7 ...
## $ VehAge : int [1:678013] 0 0 2 0 0 2 2 0 0 0 ...
## $ DrivAge : int [1:678013] 55 55 52 46 46 38 38 33 33 41 ...
## $ BonusMalus: int [1:678013] 50 50 50 50 50 50 50 68 68 50 ...
## $ VehBrand : Factor w/ 11 levels "B1","B10","B11",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ VehGas : chr [1:678013] "Regular" "Regular" "Diesel" "Diesel" ...
## $ Area : Factor w/ 6 levels "A","B","C","D",..: 4 4 2 2 2 5 5 3 3 2 ...
## $ Density : int [1:678013] 1217 1217 54 76 76 3003 3003 137 137 60 ...
## $ Region : Factor w/ 21 levels "Alsace","Aquitaine",..: 21 21 18 2 2 16 16 13 13 17 ...
#change the VehGas to factor
data <- data %>% mutate(VehGas = factor(VehGas, levels = c("Diesel","Regular")), ClaimNb = as.double(ClaimNb))
data
## # A tibble: 678,013 x 12
## IDpol ClaimNb Exposure VehPower VehAge DrivAge BonusMalus VehBrand
## <dbl> <dbl> <dbl> <int> <int> <int> <int> <fct>
## 1 1 1 0.1 5 0 55 50 B12
## 2 3 1 0.77 5 0 55 50 B12
## 3 5 1 0.75 6 2 52 50 B12
## 4 10 1 0.09 7 0 46 50 B12
## 5 11 1 0.84 7 0 46 50 B12
## 6 13 1 0.52 6 2 38 50 B12
## 7 15 1 0.45 6 2 38 50 B12
## 8 17 1 0.27 7 0 33 68 B12
## 9 18 1 0.71 7 0 33 68 B12
## 10 21 1 0.15 7 0 41 50 B12
## # … with 678,003 more rows, and 4 more variables: VehGas <fct>,
## # Area <fct>, Density <int>, Region <fct>
# identify rows that are identical except for IDpol, ClaimNb and Exposure.
to_combine <- data %>% distinct_at(vars(-c(IDpol, ClaimNb, Exposure))) %>%
mutate(group_id = row_number())
#to_combine
# combine and add frequency variable
data <- data %>% left_join(to_combine) %>% mutate(Freq = as.double(ClaimNb / Exposure),
Density = log(Density))
## Joining, by = c("VehPower", "VehAge", "DrivAge", "BonusMalus", "VehBrand", "VehGas", "Area", "Density", "Region")
The exposure is the duration of the insurance coverage of a given policy. The claim frequency is the number of claims divided by the exposure, i.e. claim count per unit of exposure. Among the 678,013 policies, there were 5.02% notified claims, i.e. 34,060 filed claims.
claimpercent <- data %>% group_by(ClaimNb) %>% summarize(count = n()) %>% mutate(percent = count / sum(count) * 100)
## `summarise()` ungrouping output (override with `.groups` argument)
claimpercent
## # A tibble: 11 x 3
## ClaimNb count percent
## <dbl> <int> <dbl>
## 1 0 643953 95.0
## 2 1 32178 4.75
## 3 2 1784 0.263
## 4 3 82 0.0121
## 5 4 7 0.00103
## 6 5 2 0.000295
## 7 6 1 0.000147
## 8 8 1 0.000147
## 9 9 1 0.000147
## 10 11 3 0.000442
## 11 16 1 0.000147
total
## # A tibble: 2 x 2
## ClaimNb total_percent
## <fct> <dbl>
## 1 0 95.0
## 2 1 5.02
ggplot(total, aes(x = factor(ClaimNb, labels = c("No", "Yes")) , y= total_percent, fill = factor(ClaimNb, labels = c("No", "Yes")))) +
geom_col()+ # colour="black",
# geom_density(alpha=.2)+
labs(title = "Claim Occurance Indicator (%)",
x = "Claim Occurance",
y = "Percentage of Policies ",
fill="Claim Occurance")
Mean should equal to Variance in Poisson distribution.
Do the Overdispersion Test and use the Negative binomial distribution if the test confirms overdispersion.
More 0s than are expected in Poisson regression?
May consider incorporating the logit model for predicting excess 0s.
Observations are not comparable with varied exposure periods.
Add offset of Exposure term to the model to scale the exposure effect.
We will see some raw summaries of the variables in the dataset. Before visualizing variables, we first check if there are any missing values in the data. Fortunately, there is no missing value.
sapply(data, function(x) {sum(is.na(x))})
## IDpol ClaimNb Exposure VehPower VehAge DrivAge
## 0 0 0 0 0 0
## BonusMalus VehBrand VehGas Area Density Region
## 0 0 0 0 0 0
## group_id Freq
## 0 0
The Freq variable added above, is defined as the number of claims per period of time. We graph the average frequency according to different drivers’ age. The minimum avergae frequency happens at age 92, and the maximumn happens at age 94.
p2 <- data %>% group_by(DrivAge) %>% summarize(avgFreq = mean(Freq))
## `summarise()` ungrouping output (override with `.groups` argument)
head(p2,10)
## # A tibble: 10 x 2
## DrivAge avgFreq
## <int> <dbl>
## 1 18 0.701
## 2 19 0.468
## 3 20 0.490
## 4 21 0.442
## 5 22 0.419
## 6 23 0.610
## 7 24 0.277
## 8 25 0.259
## 9 26 0.403
## 10 27 0.298
p2$DrivAge[which.min(p2$avgFreq)]
## [1] 92
p2$DrivAge[which.max(p2$avgFreq)]
## [1] 94
p2 %>% ggplot(mapping=aes(x = DrivAge, y= avgFreq))+
#geom_col(position='dodge') +
geom_area(fill="lightblue", color="black")+
labs(title = "Average Frequency at Each Age",
x = "Driver Age",
y = "Average Frequency")
p3 <- data %>% mutate(vehagegroup = as.factor(floor(VehAge/10))) %>% mutate(agegroup = factor(vehagegroup, levels = c("0-9","10-19","20-29","30-39","40-49","50-59","60-69","70-79","80-89","90-99","100"))) %>% group_by(VehBrand, vehagegroup) %>% summarize(avgFreq = mean(Freq))
## `summarise()` regrouping output by 'VehBrand' (override with `.groups` argument)
plot3 <- p3 %>% ggplot(mapping=aes(x = factor(vehagegroup, labels = c("0-9","10-19","20-29","30-39","40-49","50-59","60-69","70-79","80-89","90-99", "100")), y = avgFreq, fill = VehBrand))+
geom_col(position='dodge') +
#geom_boxplot() +
facet_wrap(VehBrand~.)+
labs(title = "Vehicle brand and age on frequency",
x = "Vehicle Age (in years)",
y = "Average Frequency")+
coord_flip()
plot3
The Area variable represents the density value of the city community where the car driver lives in. The value ranges from “A” for rural areas to “F” for the urban centre.
p4 <- data
levels(p4$Area)[levels(p4$Area)=="A"] <- "A : Rural"
levels(p4$Area)[levels(p4$Area)=="F"] <- "F : Urban"
plot4 <- p4 %>% ggplot(mapping=aes(x = Area, y = BonusMalus))+
#geom_col(position='dodge') +
geom_boxplot() +
#facet_wrap(VehBrand~.)+
labs(title = "Area on BonusMalus",
x = "Area",
y = "Log BonusMalus")+
scale_y_log10()+
coord_flip()
plot4
I randomly choose 30 percent of the data to use as a testing set to find the best fitting model.
set.seed(0)
test.n <- round(dim(data)[1] * 0.3)
n1 <- dim(data)[1] - test.n # train data number
test <- sample(1:dim(data)[1], test.n)
data$data <- "train"
data$data[test] <- "test"
cat("The number of training set is", n1)
## The number of training set is 474609
cat("The number of test set is", test.n)
## The number of test set is 203404
poissonglm<- glm(ClaimNb ~ VehPower + VehAge + DrivAge + BonusMalus
+ VehBrand + VehGas + Density + Region + Area, data=data[(data$data=="train"),], family = "poisson", offset=log(Exposure))
summary(poissonglm)
##
## Call:
## glm(formula = ClaimNb ~ VehPower + VehAge + DrivAge + BonusMalus +
## VehBrand + VehGas + Density + Region + Area, family = "poisson",
## data = data[(data$data == "train"), ], offset = log(Exposure))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1059 -0.3860 -0.2953 -0.1613 13.3922
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.1735648 0.1254057 -33.280 < 2e-16
## VehPower 0.0141975 0.0033018 4.300 1.71e-05
## VehAge -0.0380864 0.0013949 -27.305 < 2e-16
## DrivAge 0.0062648 0.0004707 13.310 < 2e-16
## BonusMalus 0.0227045 0.0003639 62.396 < 2e-16
## VehBrandB10 -0.0073934 0.0439954 -0.168 0.86654
## VehBrandB11 0.0433983 0.0478811 0.906 0.36474
## VehBrandB12 0.1712531 0.0209126 8.189 2.63e-16
## VehBrandB13 0.0140278 0.0491767 0.285 0.77545
## VehBrandB14 -0.1167775 0.0930572 -1.255 0.20951
## VehBrandB2 -0.0004928 0.0182293 -0.027 0.97843
## VehBrandB3 -0.0041910 0.0262130 -0.160 0.87297
## VehBrandB4 -0.0655094 0.0358522 -1.827 0.06767
## VehBrandB5 0.0703785 0.0295369 2.383 0.01718
## VehBrandB6 -0.0546279 0.0341846 -1.598 0.11004
## VehGasRegular 0.0383847 0.0130754 2.936 0.00333
## Density 0.0399956 0.0150920 2.650 0.00805
## RegionAquitaine -0.0517645 0.1099885 -0.471 0.63790
## RegionAuvergne -0.2826195 0.1375760 -2.054 0.03995
## RegionBasse-Normandie 0.0304173 0.1155773 0.263 0.79241
## RegionBourgogne 0.0313282 0.1178533 0.266 0.79037
## RegionBretagne 0.1567049 0.1075443 1.457 0.14508
## RegionCentre 0.1273406 0.1060485 1.201 0.22984
## RegionChampagne-Ardenne 0.0928946 0.1455315 0.638 0.52327
## RegionCorse 0.1706513 0.1304968 1.308 0.19097
## RegionFranche-Comte -0.0266387 0.1898815 -0.140 0.88843
## RegionHaute-Normandie 0.0408281 0.1245511 0.328 0.74306
## RegionIle-de-France 0.0411092 0.1073911 0.383 0.70187
## RegionLanguedoc-Roussillon 0.0383289 0.1093758 0.350 0.72601
## RegionLimousin 0.2373672 0.1292935 1.836 0.06638
## RegionMidi-Pyrenees -0.0648700 0.1144030 -0.567 0.57069
## RegionNord-Pas-de-Calais -0.1151478 0.1088077 -1.058 0.28993
## RegionPays-de-la-Loire 0.0536552 0.1082971 0.495 0.62029
## RegionPicardie 0.1276932 0.1193011 1.070 0.28446
## RegionPoitou-Charentes 0.0158620 0.1119924 0.142 0.88737
## RegionProvence-Alpes-Cotes-D'Azur 0.0424099 0.1066098 0.398 0.69077
## RegionRhone-Alpes 0.1434866 0.1062340 1.351 0.17680
## AreaB 0.0399998 0.0304434 1.314 0.18888
## AreaC 0.0032159 0.0396951 0.081 0.93543
## AreaD 0.0454594 0.0599546 0.758 0.44831
## AreaE 0.0470237 0.0798501 0.589 0.55593
## AreaF -0.0552729 0.1100743 -0.502 0.61557
##
## (Intercept) ***
## VehPower ***
## VehAge ***
## DrivAge ***
## BonusMalus ***
## VehBrandB10
## VehBrandB11
## VehBrandB12 ***
## VehBrandB13
## VehBrandB14
## VehBrandB2
## VehBrandB3
## VehBrandB4 .
## VehBrandB5 *
## VehBrandB6
## VehGasRegular **
## Density **
## RegionAquitaine
## RegionAuvergne *
## RegionBasse-Normandie
## RegionBourgogne
## RegionBretagne
## RegionCentre
## RegionChampagne-Ardenne
## RegionCorse
## RegionFranche-Comte
## RegionHaute-Normandie
## RegionIle-de-France
## RegionLanguedoc-Roussillon
## RegionLimousin .
## RegionMidi-Pyrenees
## RegionNord-Pas-de-Calais
## RegionPays-de-la-Loire
## RegionPicardie
## RegionPoitou-Charentes
## RegionProvence-Alpes-Cotes-D'Azur
## RegionRhone-Alpes
## AreaB
## AreaC
## AreaD
## AreaE
## AreaF
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 156918 on 474608 degrees of freedom
## Residual deviance: 151734 on 474567 degrees of freedom
## AIC: 200294
##
## Number of Fisher Scoring iterations: 6
The Dispersion Test tests the null hypothesis of equidispersion in Poisson GLMs against the alternative of overdispersion and/or underdispersion. Overdispersion corresponds to alpha > 0 and underdispersion to alpha < 0. In our study, though the test confirms the dispersion by rejecting the null hypothesis with a very small p-value, the alpha value is very close to zero suggesting that over-dispersion may not be a serious concern here.
# Test for dispersion
dispersiontest(poissonglm,trafo=1)
##
## Overdispersion test
##
## data: poissonglm
## z = 3.9191, p-value = 4.444e-05
## alternative hypothesis: true alpha is greater than 0
## sample estimates:
## alpha
## 0.243334
data <- data %>% add_predictions(poissonglm, var = "pred_claim_glm", type = "response" )
#lambda <- .1
colnames(data)[c(3,4,5,6,7,8,9,10,11,12)]
## [1] "Exposure" "VehPower" "VehAge" "DrivAge" "BonusMalus"
## [6] "VehBrand" "VehGas" "Area" "Density" "Region"
lasso.data <- data
#the data.matrix changes factors into numeric matrrix
#transform <- data[, -which(colnames(data) == "VehBrand")]
#transform$VehBrand <- as.numeric(data$VehBrand)
# 10 fold cross validation
# cv.glmnet expects a matrix of predictors, not a data frame.
set.seed(0)
glm.ridge <- cv.glmnet(x=data.matrix(lasso.data[(lasso.data$data=="train"), c(3,4,5,6,7,8,9,10,11,12)]),
y=data.matrix(lasso.data[(lasso.data$data=="train"), 2]),family="poisson", offset =data.matrix(log(lasso.data[(lasso.data$data=="train"), ]['Exposure'])) ,alpha = 0, intercept=TRUE, type.measure = "mse")
glm.lasso <- cv.glmnet(x=data.matrix(lasso.data[(lasso.data$data=="train"), c(3,4,5,6,7,8,9,10,11,12)]),
y=data.matrix(lasso.data[(lasso.data$data=="train"), 2]),family="poisson",offset = data.matrix(log(lasso.data[(lasso.data$data=="train"), ]['Exposure'])), alpha = 1, intercept=TRUE, type.measure = "mse")
par(mfrow=c(1,2))
plot(glm.ridge, main = "Ridge") ; plot(glm.ridge, main = "Lasso")
glm.ridge$lambda.min; coef(glm.ridge, s = "lambda.min")
## [1] 0.009804138
## 11 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) -2.9270950727
## Exposure -1.0400993812
## VehPower 0.0061349023
## VehAge -0.0263678397
## DrivAge 0.0060848768
## BonusMalus 0.0169722817
## VehBrand -0.0010265539
## VehGas 0.0502432492
## Area 0.0169615264
## Density 0.0194397589
## Region -0.0009442549
glm.lasso$lambda.min; coef(glm.lasso, s = "lambda.min")
## [1] 0.001635429
## 11 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) -2.696642397
## Exposure -1.193913018
## VehPower .
## VehAge -0.024586132
## DrivAge 0.006071144
## BonusMalus 0.017390359
## VehBrand .
## VehGas 0.004379296
## Area .
## Density 0.016603390
## Region .
set.seed(0)
boost.train=gbm(ClaimNb ~ offset(log(Exposure)) + VehPower + VehAge + DrivAge + BonusMalus
+ VehBrand + VehGas + Density + Region + Area , data = data[data$data=="train",],
distribution="poisson",n.trees=3000,shrinkage=0.01,interaction.depth=20)
summary(boost.train)
## var rel.inf
## BonusMalus BonusMalus 17.014808
## Region Region 15.372979
## VehAge VehAge 14.459134
## DrivAge DrivAge 13.862481
## VehBrand VehBrand 12.328304
## VehPower VehPower 10.782009
## Density Density 8.396521
## VehGas VehGas 4.728894
## Area Area 3.054871
pred.glm <- predict(poissonglm, data[(data$data == "test"),c(3,4,5,6,7,8,9,10,11,12)], type = "response", newoffset=log(Exposure))
#pred.glm <- round(pred.glm)
pred.glm.ridge <- predict(glm.ridge, newx = data.matrix(data[(data$data == "test"),c(3,4,5,6,7,8,9,10,11,12)]), type = "response", s = "lambda.min", newoffset = data.matrix(log(lasso.data[(lasso.data$data=="test"), ]['Exposure'])))
#pred.glm.ridge <- as.integer(ifelse(pred.glm.ridge==1, 1, 0))
pred.glm.lasso <- predict(glm.lasso, data.matrix(data[(data$data == "test"),c(3,4,5,6,7,8,9,10,11,12)]), type = "response", s = "lambda.min", newoffset = data.matrix(log(lasso.data[(lasso.data$data=="test"), ]['Exposure'])))
pred.boost <- predict(boost.train, data[(data$data == "test"),c(3,4,5,6,7,8,9,10,11,12)], type = "response", n.trees=3000)
## Warning in predict.gbm(boost.train, data[(data$data == "test"), c(3, 4, :
## predict.gbm does not add the offset to the predicted values.
pred.boost <- pred.boost* data[(data$data == "test"),]$Exposure
#pred.boost <- round(pred.boost)
#val.err.glm <- sum(pred.glm != data$ClaimNb[(data$data == "test")]) / test.n
#val.err.boost <- sum( pred.boost != data$ClaimNb[(data$data == "test")]) / test.n
val.err.glm <- sum( abs(pred.glm - data$ClaimNb[(data$data == "test")]) ) / test.n
val.err.ridge <- sum( abs(pred.glm.ridge - data$ClaimNb[(data$data == "test")]) ) / test.n
val.err.lasso <- sum( abs(pred.glm.lasso - data$ClaimNb[(data$data == "test")]) ) / test.n
val.err.boost <- sum( abs(pred.boost - data$ClaimNb[(data$data == "test")]) ) / test.n
val.err = tibble(val.err.glm =val.err.glm, val.err.ridge=val.err.ridge, val.err.lasso=val.err.lasso, val.err.boost=val.err.boost)
val.err <- val.err %>% pivot_longer(cols = c("val.err.glm", "val.err.ridge", "val.err.lasso", "val.err.boost")) %>% mutate(name = factor(name, levels = c("val.err.glm", "val.err.ridge", "val.err.lasso", "val.err.boost") ,labels = c( "glm", "ridge", "lasso","gbm"), ordered = FALSE))
val.err %>% ggplot(mapping=aes(x=name, y=value)) +
geom_col(aes(fill = name))+
labs(title = "Models and Prediction Mean Absolute Error (MAE)",
x = "Models",
y = "MAE",
fill="Models")
## The claim number prediction MAE for test set with Poisson GLM is 0.09905573
## The claim number prediction MAE for test set with Poisson Ridge GLM is 0.09988506
## The claim number prediction MAE for test set with Poisson Lasso GLM is 0.09996999
## The claim number prediction MAE for test set with Gradient Boosting Model is 0.09630762
freq_graph<- data[(data$data == "test"),]
freq_graph$pred.glm = pred.glm
freq_graph$pred.glm.ridge = pred.glm.ridge[,"1"]
freq_graph$pred.glm.lasso = pred.glm.lasso[,"1"]
freq_graph$pred.boost = pred.boost
freq_graph <- freq_graph %>% mutate(pred.freq.glm = pred.glm/ Exposure,
pred.freq.glm.ridge = pred.glm.ridge / Exposure,
pred.freq.glm.lasso = pred.glm.lasso / Exposure,
pred.freq.boost = pred.boost / Exposure)
no_per_group = test.n %/% 10 + 1
no_per_group
## [1] 20341
## `summarise()` ungrouping output (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)
The real claim numbers in the test set are curved in the black lines below.