Find the summary slides for this project here.


1. Data Exploration

1.1 CASdatasets

The CASdatasets package, which contains a large variety of actuarial datasets, is proposed and edited by Christophe Dutang https://www.openml.org/d/41214.

The freMTPL2freq dataset I’m going to study consists of risk features that were collected from motor third-part liability policies in France. There are 678,013 samples available in the dataset and 12 explanatory variables. A description of each variable, including descriptions of its key values, is given below.

library("CASdatasets")
data(freMTPL2freq)
data <- as_tibble(freMTPL2freq)
Variable Name Description Key
IDpol Policy ID (link with the claims dataset)
ClaimNb Number of claims during the exposure period
Exposure Period of exposure (in years)
VehPower Power of the car
VehAge Vehicle age (in years)
DrivAge Driver age (in years)
BonusMalus Bonus/malus, between 50 and 350 <100: bonus; >100: malus in France
VehBrand Car brand Unknown categories
VehGas Car gas Diesel or regular
Area Density value of the city where the car driver lives in “A” for rural to “F” for urban centre
Density Density of inhabitants of the city where the car driver lives in Number of inhabitants per square-kilometer
Region Policy region in France

str(data)
## tibble [678,013 × 12] (S3: tbl_df/tbl/data.frame)
##  $ IDpol     : num [1:678013] 1 3 5 10 11 13 15 17 18 21 ...
##  $ ClaimNb   : 'table' num [1:678013(1d)] 1 1 1 1 1 1 1 1 1 1 ...
##  $ Exposure  : num [1:678013] 0.1 0.77 0.75 0.09 0.84 0.52 0.45 0.27 0.71 0.15 ...
##  $ VehPower  : int [1:678013] 5 5 6 7 7 6 6 7 7 7 ...
##  $ VehAge    : int [1:678013] 0 0 2 0 0 2 2 0 0 0 ...
##  $ DrivAge   : int [1:678013] 55 55 52 46 46 38 38 33 33 41 ...
##  $ BonusMalus: int [1:678013] 50 50 50 50 50 50 50 68 68 50 ...
##  $ VehBrand  : Factor w/ 11 levels "B1","B10","B11",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ VehGas    : chr [1:678013] "Regular" "Regular" "Diesel" "Diesel" ...
##  $ Area      : Factor w/ 6 levels "A","B","C","D",..: 4 4 2 2 2 5 5 3 3 2 ...
##  $ Density   : int [1:678013] 1217 1217 54 76 76 3003 3003 137 137 60 ...
##  $ Region    : Factor w/ 21 levels "Alsace","Aquitaine",..: 21 21 18 2 2 16 16 13 13 17 ...
#change the VehGas to factor
data <- data %>% mutate(VehGas = factor(VehGas, levels = c("Diesel","Regular")), ClaimNb = as.double(ClaimNb))
data
## # A tibble: 678,013 x 12
##    IDpol ClaimNb Exposure VehPower VehAge DrivAge BonusMalus VehBrand
##    <dbl>   <dbl>    <dbl>    <int>  <int>   <int>      <int> <fct>   
##  1     1       1     0.1         5      0      55         50 B12     
##  2     3       1     0.77        5      0      55         50 B12     
##  3     5       1     0.75        6      2      52         50 B12     
##  4    10       1     0.09        7      0      46         50 B12     
##  5    11       1     0.84        7      0      46         50 B12     
##  6    13       1     0.52        6      2      38         50 B12     
##  7    15       1     0.45        6      2      38         50 B12     
##  8    17       1     0.27        7      0      33         68 B12     
##  9    18       1     0.71        7      0      33         68 B12     
## 10    21       1     0.15        7      0      41         50 B12     
## # … with 678,003 more rows, and 4 more variables: VehGas <fct>,
## #   Area <fct>, Density <int>, Region <fct>



1.2 Data Preprocessing

# identify rows that are identical except for IDpol, ClaimNb and Exposure.
to_combine <- data %>% distinct_at(vars(-c(IDpol, ClaimNb, Exposure))) %>%
  mutate(group_id = row_number())
#to_combine
# combine and add frequency variable
data <- data %>% left_join(to_combine) %>% mutate(Freq = as.double(ClaimNb / Exposure), 
                                                  Density = log(Density))
## Joining, by = c("VehPower", "VehAge", "DrivAge", "BonusMalus", "VehBrand", "VehGas", "Area", "Density", "Region")


The exposure is the duration of the insurance coverage of a given policy. The claim frequency is the number of claims divided by the exposure, i.e. claim count per unit of exposure. Among the 678,013 policies, there were 5.02% notified claims, i.e. 34,060 filed claims.

claimpercent <- data %>% group_by(ClaimNb) %>% summarize(count = n()) %>% mutate(percent = count / sum(count) * 100)
## `summarise()` ungrouping output (override with `.groups` argument)
claimpercent
## # A tibble: 11 x 3
##    ClaimNb  count   percent
##      <dbl>  <int>     <dbl>
##  1       0 643953 95.0     
##  2       1  32178  4.75    
##  3       2   1784  0.263   
##  4       3     82  0.0121  
##  5       4      7  0.00103 
##  6       5      2  0.000295
##  7       6      1  0.000147
##  8       8      1  0.000147
##  9       9      1  0.000147
## 10      11      3  0.000442
## 11      16      1  0.000147
total
## # A tibble: 2 x 2
##   ClaimNb total_percent
##   <fct>           <dbl>
## 1 0               95.0 
## 2 1                5.02
ggplot(total, aes(x = factor(ClaimNb, labels = c("No", "Yes")) , y= total_percent, fill = factor(ClaimNb, labels = c("No", "Yes")))) + 
   geom_col()+ # colour="black",
   # geom_density(alpha=.2)+
   labs(title = "Claim Occurance Indicator (%)",
        x = "Claim Occurance",
        y = "Percentage of Policies ",
        fill="Claim Occurance")


Potential problems and strategies:

  1. Mean should equal to Variance in Poisson distribution.
    Do the Overdispersion Test and use the Negative binomial distribution if the test confirms overdispersion.

  2. More 0s than are expected in Poisson regression?
    May consider incorporating the logit model for predicting excess 0s.

  3. Observations are not comparable with varied exposure periods.
    Add offset of Exposure term to the model to scale the exposure effect.



2. Data Visualization

We will see some raw summaries of the variables in the dataset. Before visualizing variables, we first check if there are any missing values in the data. Fortunately, there is no missing value.

sapply(data, function(x) {sum(is.na(x))})
##      IDpol    ClaimNb   Exposure   VehPower     VehAge    DrivAge 
##          0          0          0          0          0          0 
## BonusMalus   VehBrand     VehGas       Area    Density     Region 
##          0          0          0          0          0          0 
##   group_id       Freq 
##          0          0



2.1 Did driver age influence frequency?

The Freq variable added above, is defined as the number of claims per period of time. We graph the average frequency according to different drivers’ age. The minimum avergae frequency happens at age 92, and the maximumn happens at age 94.

p2 <- data %>% group_by(DrivAge) %>% summarize(avgFreq = mean(Freq))
## `summarise()` ungrouping output (override with `.groups` argument)
head(p2,10)
## # A tibble: 10 x 2
##    DrivAge avgFreq
##      <int>   <dbl>
##  1      18   0.701
##  2      19   0.468
##  3      20   0.490
##  4      21   0.442
##  5      22   0.419
##  6      23   0.610
##  7      24   0.277
##  8      25   0.259
##  9      26   0.403
## 10      27   0.298
p2$DrivAge[which.min(p2$avgFreq)]
## [1] 92
p2$DrivAge[which.max(p2$avgFreq)]
## [1] 94
p2 %>% ggplot(mapping=aes(x = DrivAge, y= avgFreq))+
  #geom_col(position='dodge') +
  geom_area(fill="lightblue", color="black")+
  labs(title = "Average Frequency at Each Age",
       x = "Driver Age",
       y = "Average Frequency")



2.2 Did vehicle brand and age influence frequency?

p3 <- data %>% mutate(vehagegroup = as.factor(floor(VehAge/10))) %>% mutate(agegroup = factor(vehagegroup, levels = c("0-9","10-19","20-29","30-39","40-49","50-59","60-69","70-79","80-89","90-99","100"))) %>%  group_by(VehBrand, vehagegroup) %>% summarize(avgFreq = mean(Freq)) 
## `summarise()` regrouping output by 'VehBrand' (override with `.groups` argument)
plot3 <-  p3 %>% ggplot(mapping=aes(x = factor(vehagegroup, labels = c("0-9","10-19","20-29","30-39","40-49","50-59","60-69","70-79","80-89","90-99", "100")), y = avgFreq, fill = VehBrand))+
          geom_col(position='dodge') +
          #geom_boxplot() +
          facet_wrap(VehBrand~.)+
          labs(title = "Vehicle brand and age on frequency",
               x = "Vehicle Age (in years)",
               y = "Average Frequency")+
          coord_flip()
plot3



2.3 What is the relationship between area and bonus-malus?

The Area variable represents the density value of the city community where the car driver lives in. The value ranges from “A” for rural areas to “F” for the urban centre.

p4 <- data
levels(p4$Area)[levels(p4$Area)=="A"] <- "A : Rural"
levels(p4$Area)[levels(p4$Area)=="F"] <- "F : Urban"
plot4 <-  p4 %>% ggplot(mapping=aes(x = Area, y = BonusMalus))+
          #geom_col(position='dodge') +
          geom_boxplot() +
          #facet_wrap(VehBrand~.)+
          labs(title = "Area on BonusMalus",
               x = "Area",
               y = "Log BonusMalus")+
          scale_y_log10()+
          coord_flip()
plot4



3. Model and Prediction

I randomly choose 30 percent of the data to use as a testing set to find the best fitting model.

set.seed(0) 
test.n <- round(dim(data)[1] * 0.3)
n1 <- dim(data)[1] - test.n # train data number
test <- sample(1:dim(data)[1], test.n)
data$data <- "train"
data$data[test] <- "test"
cat("The number of training set is", n1)
## The number of training set is 474609
cat("The number of test set is", test.n)
## The number of test set is 203404



3.1 Classics: Generalized Linear Models - Poisson

poissonglm<- glm(ClaimNb ~  VehPower + VehAge + DrivAge + BonusMalus
              + VehBrand + VehGas + Density + Region + Area, data=data[(data$data=="train"),], family = "poisson", offset=log(Exposure))
summary(poissonglm)
## 
## Call:
## glm(formula = ClaimNb ~ VehPower + VehAge + DrivAge + BonusMalus + 
##     VehBrand + VehGas + Density + Region + Area, family = "poisson", 
##     data = data[(data$data == "train"), ], offset = log(Exposure))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1059  -0.3860  -0.2953  -0.1613  13.3922  
## 
## Coefficients:
##                                     Estimate Std. Error z value Pr(>|z|)
## (Intercept)                       -4.1735648  0.1254057 -33.280  < 2e-16
## VehPower                           0.0141975  0.0033018   4.300 1.71e-05
## VehAge                            -0.0380864  0.0013949 -27.305  < 2e-16
## DrivAge                            0.0062648  0.0004707  13.310  < 2e-16
## BonusMalus                         0.0227045  0.0003639  62.396  < 2e-16
## VehBrandB10                       -0.0073934  0.0439954  -0.168  0.86654
## VehBrandB11                        0.0433983  0.0478811   0.906  0.36474
## VehBrandB12                        0.1712531  0.0209126   8.189 2.63e-16
## VehBrandB13                        0.0140278  0.0491767   0.285  0.77545
## VehBrandB14                       -0.1167775  0.0930572  -1.255  0.20951
## VehBrandB2                        -0.0004928  0.0182293  -0.027  0.97843
## VehBrandB3                        -0.0041910  0.0262130  -0.160  0.87297
## VehBrandB4                        -0.0655094  0.0358522  -1.827  0.06767
## VehBrandB5                         0.0703785  0.0295369   2.383  0.01718
## VehBrandB6                        -0.0546279  0.0341846  -1.598  0.11004
## VehGasRegular                      0.0383847  0.0130754   2.936  0.00333
## Density                            0.0399956  0.0150920   2.650  0.00805
## RegionAquitaine                   -0.0517645  0.1099885  -0.471  0.63790
## RegionAuvergne                    -0.2826195  0.1375760  -2.054  0.03995
## RegionBasse-Normandie              0.0304173  0.1155773   0.263  0.79241
## RegionBourgogne                    0.0313282  0.1178533   0.266  0.79037
## RegionBretagne                     0.1567049  0.1075443   1.457  0.14508
## RegionCentre                       0.1273406  0.1060485   1.201  0.22984
## RegionChampagne-Ardenne            0.0928946  0.1455315   0.638  0.52327
## RegionCorse                        0.1706513  0.1304968   1.308  0.19097
## RegionFranche-Comte               -0.0266387  0.1898815  -0.140  0.88843
## RegionHaute-Normandie              0.0408281  0.1245511   0.328  0.74306
## RegionIle-de-France                0.0411092  0.1073911   0.383  0.70187
## RegionLanguedoc-Roussillon         0.0383289  0.1093758   0.350  0.72601
## RegionLimousin                     0.2373672  0.1292935   1.836  0.06638
## RegionMidi-Pyrenees               -0.0648700  0.1144030  -0.567  0.57069
## RegionNord-Pas-de-Calais          -0.1151478  0.1088077  -1.058  0.28993
## RegionPays-de-la-Loire             0.0536552  0.1082971   0.495  0.62029
## RegionPicardie                     0.1276932  0.1193011   1.070  0.28446
## RegionPoitou-Charentes             0.0158620  0.1119924   0.142  0.88737
## RegionProvence-Alpes-Cotes-D'Azur  0.0424099  0.1066098   0.398  0.69077
## RegionRhone-Alpes                  0.1434866  0.1062340   1.351  0.17680
## AreaB                              0.0399998  0.0304434   1.314  0.18888
## AreaC                              0.0032159  0.0396951   0.081  0.93543
## AreaD                              0.0454594  0.0599546   0.758  0.44831
## AreaE                              0.0470237  0.0798501   0.589  0.55593
## AreaF                             -0.0552729  0.1100743  -0.502  0.61557
##                                      
## (Intercept)                       ***
## VehPower                          ***
## VehAge                            ***
## DrivAge                           ***
## BonusMalus                        ***
## VehBrandB10                          
## VehBrandB11                          
## VehBrandB12                       ***
## VehBrandB13                          
## VehBrandB14                          
## VehBrandB2                           
## VehBrandB3                           
## VehBrandB4                        .  
## VehBrandB5                        *  
## VehBrandB6                           
## VehGasRegular                     ** 
## Density                           ** 
## RegionAquitaine                      
## RegionAuvergne                    *  
## RegionBasse-Normandie                
## RegionBourgogne                      
## RegionBretagne                       
## RegionCentre                         
## RegionChampagne-Ardenne              
## RegionCorse                          
## RegionFranche-Comte                  
## RegionHaute-Normandie                
## RegionIle-de-France                  
## RegionLanguedoc-Roussillon           
## RegionLimousin                    .  
## RegionMidi-Pyrenees                  
## RegionNord-Pas-de-Calais             
## RegionPays-de-la-Loire               
## RegionPicardie                       
## RegionPoitou-Charentes               
## RegionProvence-Alpes-Cotes-D'Azur    
## RegionRhone-Alpes                    
## AreaB                                
## AreaC                                
## AreaD                                
## AreaE                                
## AreaF                                
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 156918  on 474608  degrees of freedom
## Residual deviance: 151734  on 474567  degrees of freedom
## AIC: 200294
## 
## Number of Fisher Scoring iterations: 6



The Dispersion Test tests the null hypothesis of equidispersion in Poisson GLMs against the alternative of overdispersion and/or underdispersion. Overdispersion corresponds to alpha > 0 and underdispersion to alpha < 0. In our study, though the test confirms the dispersion by rejecting the null hypothesis with a very small p-value, the alpha value is very close to zero suggesting that over-dispersion may not be a serious concern here.

# Test for dispersion
dispersiontest(poissonglm,trafo=1)
## 
##  Overdispersion test
## 
## data:  poissonglm
## z = 3.9191, p-value = 4.444e-05
## alternative hypothesis: true alpha is greater than 0
## sample estimates:
##    alpha 
## 0.243334

data <- data %>% add_predictions(poissonglm, var = "pred_claim_glm", type = "response" )



3.2 Poisson Lasso and Ridge Regression

#lambda <- .1
colnames(data)[c(3,4,5,6,7,8,9,10,11,12)]
##  [1] "Exposure"   "VehPower"   "VehAge"     "DrivAge"    "BonusMalus"
##  [6] "VehBrand"   "VehGas"     "Area"       "Density"    "Region"
lasso.data <- data
#the data.matrix changes factors into numeric matrrix
#transform <- data[, -which(colnames(data) == "VehBrand")]
#transform$VehBrand <- as.numeric(data$VehBrand)

# 10 fold cross validation
# cv.glmnet expects a matrix of predictors, not a data frame.
set.seed(0) 
glm.ridge <- cv.glmnet(x=data.matrix(lasso.data[(lasso.data$data=="train"), c(3,4,5,6,7,8,9,10,11,12)]), 
                       y=data.matrix(lasso.data[(lasso.data$data=="train"), 2]),family="poisson", offset =data.matrix(log(lasso.data[(lasso.data$data=="train"), ]['Exposure'])) ,alpha = 0, intercept=TRUE, type.measure = "mse")

glm.lasso <- cv.glmnet(x=data.matrix(lasso.data[(lasso.data$data=="train"), c(3,4,5,6,7,8,9,10,11,12)]),
                       y=data.matrix(lasso.data[(lasso.data$data=="train"), 2]),family="poisson",offset = data.matrix(log(lasso.data[(lasso.data$data=="train"), ]['Exposure'])), alpha = 1, intercept=TRUE, type.measure = "mse")
par(mfrow=c(1,2))
plot(glm.ridge, main = "Ridge") ; plot(glm.ridge, main = "Lasso") 

glm.ridge$lambda.min; coef(glm.ridge, s = "lambda.min")
## [1] 0.009804138
## 11 x 1 sparse Matrix of class "dgCMatrix"
##                         1
## (Intercept) -2.9270950727
## Exposure    -1.0400993812
## VehPower     0.0061349023
## VehAge      -0.0263678397
## DrivAge      0.0060848768
## BonusMalus   0.0169722817
## VehBrand    -0.0010265539
## VehGas       0.0502432492
## Area         0.0169615264
## Density      0.0194397589
## Region      -0.0009442549
glm.lasso$lambda.min; coef(glm.lasso, s = "lambda.min")
## [1] 0.001635429
## 11 x 1 sparse Matrix of class "dgCMatrix"
##                        1
## (Intercept) -2.696642397
## Exposure    -1.193913018
## VehPower     .          
## VehAge      -0.024586132
## DrivAge      0.006071144
## BonusMalus   0.017390359
## VehBrand     .          
## VehGas       0.004379296
## Area         .          
## Density      0.016603390
## Region       .



3.3 Gradient Boosting Model

set.seed(0)
boost.train=gbm(ClaimNb ~ offset(log(Exposure)) + VehPower + VehAge + DrivAge + BonusMalus
              + VehBrand + VehGas + Density + Region + Area , data = data[data$data=="train",],
                 distribution="poisson",n.trees=3000,shrinkage=0.01,interaction.depth=20)
summary(boost.train)

##                   var   rel.inf
## BonusMalus BonusMalus 17.014808
## Region         Region 15.372979
## VehAge         VehAge 14.459134
## DrivAge       DrivAge 13.862481
## VehBrand     VehBrand 12.328304
## VehPower     VehPower 10.782009
## Density       Density  8.396521
## VehGas         VehGas  4.728894
## Area             Area  3.054871



4. Final Validation

pred.glm <- predict(poissonglm, data[(data$data == "test"),c(3,4,5,6,7,8,9,10,11,12)], type = "response", newoffset=log(Exposure))
#pred.glm <- round(pred.glm)


pred.glm.ridge <- predict(glm.ridge, newx = data.matrix(data[(data$data == "test"),c(3,4,5,6,7,8,9,10,11,12)]), type = "response", s = "lambda.min", newoffset = data.matrix(log(lasso.data[(lasso.data$data=="test"), ]['Exposure'])))
#pred.glm.ridge <- as.integer(ifelse(pred.glm.ridge==1, 1, 0))
pred.glm.lasso <- predict(glm.lasso, data.matrix(data[(data$data == "test"),c(3,4,5,6,7,8,9,10,11,12)]),  type = "response", s = "lambda.min", newoffset = data.matrix(log(lasso.data[(lasso.data$data=="test"), ]['Exposure'])))




pred.boost <- predict(boost.train, data[(data$data == "test"),c(3,4,5,6,7,8,9,10,11,12)], type = "response", n.trees=3000) 
## Warning in predict.gbm(boost.train, data[(data$data == "test"), c(3, 4, :
## predict.gbm does not add the offset to the predicted values.
pred.boost <- pred.boost* data[(data$data == "test"),]$Exposure
#pred.boost <- round(pred.boost)


#val.err.glm <- sum(pred.glm != data$ClaimNb[(data$data == "test")]) / test.n
#val.err.boost <- sum( pred.boost != data$ClaimNb[(data$data == "test")]) / test.n

val.err.glm <- sum( abs(pred.glm - data$ClaimNb[(data$data == "test")]) ) / test.n

val.err.ridge <- sum( abs(pred.glm.ridge - data$ClaimNb[(data$data == "test")]) ) / test.n
val.err.lasso <- sum( abs(pred.glm.lasso - data$ClaimNb[(data$data == "test")]) ) / test.n

val.err.boost <- sum( abs(pred.boost - data$ClaimNb[(data$data == "test")]) ) / test.n


val.err = tibble(val.err.glm =val.err.glm, val.err.ridge=val.err.ridge, val.err.lasso=val.err.lasso, val.err.boost=val.err.boost)

val.err <- val.err %>% pivot_longer(cols = c("val.err.glm", "val.err.ridge", "val.err.lasso", "val.err.boost")) %>% mutate(name = factor(name, levels =  c("val.err.glm", "val.err.ridge", "val.err.lasso", "val.err.boost") ,labels = c( "glm", "ridge", "lasso","gbm"),  ordered = FALSE))

val.err %>% ggplot(mapping=aes(x=name, y=value)) + 
  geom_col(aes(fill = name))+
  labs(title = "Models and Prediction Mean Absolute Error (MAE)",
        x = "Models",
        y = "MAE",
        fill="Models")

## The claim number prediction MAE for test set with Poisson GLM is 0.09905573
## The claim number prediction MAE for test set with Poisson Ridge GLM is 0.09988506
## The claim number prediction MAE for test set with Poisson Lasso GLM is 0.09996999
## The claim number prediction MAE for test set with Gradient Boosting Model is 0.09630762

freq_graph<- data[(data$data == "test"),]
freq_graph$pred.glm = pred.glm
freq_graph$pred.glm.ridge =  pred.glm.ridge[,"1"]
freq_graph$pred.glm.lasso = pred.glm.lasso[,"1"]
freq_graph$pred.boost = pred.boost

freq_graph <- freq_graph %>% mutate(pred.freq.glm  = pred.glm/ Exposure, 
                      pred.freq.glm.ridge = pred.glm.ridge / Exposure, 
                      pred.freq.glm.lasso = pred.glm.lasso / Exposure, 
                      pred.freq.boost = pred.boost / Exposure) 

no_per_group = test.n %/% 10 + 1

no_per_group
## [1] 20341
## `summarise()` ungrouping output (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)


Evaluation of the Predicted Number of Claims in the Test Set

The real claim numbers in the test set are curved in the black lines below.