Analysis of HIV Drug Resistance Data

The scientific goal is to determine which mutations of the Human Immunodeficiency Virus Type 1 (HIV-1) are associated with drug resistance. The data set, publicly available from the Stanford HIV Drug Resistance Database http://hivdb.stanford.edu/pages/published_analysis/genophenoPNAS2006/, was originally analyzed in (Rhee et al. 2006).

Preparing the data

The data set consists of measurements for three classes of drugs: protease inhibitors (PIs), nucleoside reverse transcriptase (RT) inhibitors (NRTIs), and nonnucleoside RT inhibitors (NNRTIs). Protease and reverse transcriptase are two enzymes in HIV-1 that are crucial to the function of the virus. This data set seeks associations between mutations in the HIV-1 protease and drug resistance to different PI type drugs, and between mutations in the HIV-1 reverse transcriptase and drug resistance to different NRTI and NNRTI type drugs (The raw data are saved as gene_df).

In order to evaluate our results, we compare with the treatment-selected mutation panels created by (Rhee et al. 2005), which can be viewed as the ground true. These panels give lists of HIV-1 mutations appearing more frequently in patients who have previously been treated with PI, NRTI, or NNRTI type drugs, than in patients with no previous exposure to that drug type. Increased frequency of a mutation among patients treated with a certain drug type implies that the mutation confers resistance to that drug type (The raw data are saved as tsm_df).

To simplify the analysis, in this project we will confine our attention to the PI drugs.

drug_class = 'PI' # Possible drug types are 'PI', 'NRTI', and 'NNRTI'. 

Fetching and cleaning the data

First, we download the data and read it into data frames.

base_url = 'http://hivdb.stanford.edu/pages/published_analysis/genophenoPNAS2006'
gene_url = paste(base_url, 'DATA', paste0(drug_class, '_DATA.txt'), sep='/')
tsm_url = paste(base_url, 'MUTATIONLISTS', 'NP_TSM', drug_class, sep='/')

gene_df = read.delim(gene_url, na.string = c('NA', ''), stringsAsFactors = FALSE)
tsm_df = read.delim(tsm_url, header = FALSE, stringsAsFactors = FALSE)
names(tsm_df) = c('Position', 'Mutations')

A small sample of the data is shown below.

head(gene_df, n=6)
##   IsolateName PseudoName MedlineID  APV  ATV   IDV   LPV   NFV   RTV   SQV
## 1     CA10676      CA622  10839657  2.3   NA  32.7    NA  23.4  51.6  37.8
## 2     CA37880      CA622  15995959 76.0   NA 131.0 200.0  50.0 200.0 156.0
## 3      CA9984      CA624  11897594  2.8   NA  12.0    NA 100.0  41.0 145.6
## 4     CA17003      CA628  15995959  6.5  9.2   2.1   5.3   5.0  36.0  13.0
## 5     CA10670      CA634  10839657  8.3   NA 100.0    NA 161.1 170.2 100.0
## 6     CA42683      CA634  15995959 82.0 75.0 400.0 400.0  91.0 400.0 400.0
##   P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12 P13 P14 P15 P16 P17 P18 P19 P20
## 1  -  -  -  -  -  -  -  -  -   I   -   -   -   -   -   -   -   -   -   -
## 2  -  -  -  -  -  -  -  -  -   F   L   -   -   -   -   A   -   -   I   -
## 3  -  -  -  -  -  -  -  -  -   -   -   -   V   -   -   -   -   -   -   R
## 4  -  -  -  -  -  -  -  -  -   I   -   -   -   -   -   -   -   -   -   -
## 5  -  -  -  -  -  -  -  -  -   I   -   -   -   -   -   -   -   -   -   R
## 6  -  -  -  -  -  -  -  -  -   I   -   -   -   -   -   -   -   -   -   R
##   P21 P22 P23 P24 P25 P26 P27 P28 P29 P30 P31 P32 P33 P34 P35 P36 P37 P38
## 1   -   -   -   -   -   -   -   -   -   -   -   -   -   -   -   -  DN   -
## 2   -   -   -   -   -   -   -   -   -   -   -   -   F   -   -   -   -   -
## 3   -   -   -   -   -   -   -   -   -   N   -   -  IL   -   D   I   -   -
## 4   -   -   I   -   -   -   -   -   -   -   -   -   -   -   -   -   -   -
## 5   -   -   F   -   -   -   -   -   -   -   -   -   -   -   D   I   D   -
## 6   -   -   -   -   -   -   -   -   -   -   -   -   F   T   D   I   D   -
##   P39 P40 P41 P42 P43 P44 P45 P46 P47 P48 P49 P50 P51 P52 P53 P54 P55 P56
## 1   -   -   K   -   -   -  RK   I   -   -   -   -   -   -   -   -   -   -
## 2   -   -   K   -   -   -   -   I   -   -   -   -   -   -   -   V   -   -
## 3   -   -   -   -   -   -   -   -   -   -   -   -   -   -   -   V   -   -
## 4   -   -   -   -   -   -   -   L   -   -   -   -   -   -   -   -   -   -
## 5   -   -   -   -   -   -   -   I   -   V   -   -   -   -   -   T   -   -
## 6   -   -   -   -   T   -   -   I   -   V   -   V   -   -   -   S   -   -
##   P57 P58 P59 P60 P61 P62 P63 P64 P65 P66 P67 P68 P69 P70 P71 P72 P73 P74
## 1   -   -   -   -   -   -   P   -   -   -   -   -   -   -  LV   -   S   -
## 2   -   -   -   -   -   -   P   -   -   -   -   -   -   -   V   -   S   -
## 3   -   -   -   -   -   -   P   -   -   -   -   -   -   -  TA   -   -   -
## 4   -   E   -   E  EK   -   P   -   -   -   -   -   -   -   T   T   -   -
## 5   -   -   -   -   -   V   P   -   -   -   -   -   -   -   V   X   -   -
## 6   -   -   -   -   -   V   P   -   -   -   -   -   -   -   V   V   -   -
##   P75 P76 P77 P78 P79 P80 P81 P82 P83 P84 P85 P86 P87 P88 P89 P90 P91 P92
## 1   -   -   I   -   -   -   -   T   -   V   V   -   -   -   -   M   -   -
## 2   -   -   -   -   -   -   -   T   -   V   -   -   -   -   V   M   S   -
## 3   -   -   -   -   -   -   -   -   -   -   -   -   -   D   -   M   -   -
## 4   -   -   -   -   -   -   -   A   -   V   -   -   -   -   -   M   -   -
## 5   -   -   I   -   -   -   -   A   -   -   -   -   -   -   -   -   -   -
## 6   -   -   I   -   -   -   -   A   -   -   V   -   -   -   -   -   -   -
##   P93 P94 P95 P96 P97 P98 P99
## 1   L   -   -   -   -   -   -
## 2   L   -   -   -   -   -   -
## 3   -   -   -   -   -   -   .
## 4   -   -   -   -   -   -   -
## 5   L   -   -   -   -   -   -
## 6   L   -   -   -   -   -   -
head(tsm_df, n=6)
##   Position Mutations
## 1       10       F R
## 2       11         I
## 3       20     I T V
## 4       23         I
## 5       24         I
## 6       30         N

In tsm_df, the variable Position denotes the position of the mutations that are associated with drug-resistance, while Mutations indicating the mutation type.

The gene data table has some rows with error flags or nonstandard mutation codes. For simplicity, we remove all such rows.

# Returns rows for which every column matches the given regular expression.
grepl_rows <- function(pattern, df) {
  cell_matches = apply(df, c(1,2), function(x) grepl(pattern, x))
  apply(cell_matches, 1, all)
}

pos_start = which(names(gene_df) == 'P1')
pos_cols = seq.int(pos_start, ncol(gene_df))
valid_rows = grepl_rows('^(\\.|-|[A-Zid]+)$', gene_df[,pos_cols])
gene_df = gene_df[valid_rows,]

Preparing the regression matrix

We now construct the design matrix \(X\) and matrix of response vectors \(Y\). The features (columns of \(X\)) are given by mutation/position pairs. Define

$$ X_{i,j} = 1 i j \

Y_{i,k} = i k. $$

For example, in the sample for PI type drugs, three different mutations (A, C, and D) are observed at position 63 in the protease, and so three columns of \(X\) (named P63.A, P63.C, and P63.D) indicate the presence or absence of each mutation at this position.

# Flatten a matrix to a vector with names from concatenating row/column names.
flatten_matrix <- function(M, sep='.') {
  x <- c(M)
  names(x) <- c(outer(rownames(M), colnames(M),
                      function(...) paste(..., sep=sep)))
  x
}

# Construct preliminary design matrix.
muts = c(LETTERS, 'i', 'd')
X = outer(muts, as.matrix(gene_df[,pos_cols]), Vectorize(grepl))
X = aperm(X, c(2,3,1))
dimnames(X)[[3]] <- muts
X = t(apply(X, 1, flatten_matrix))
mode(X) <- 'numeric'
# Remove any mutation/position pairs that never appear in the data.
X = X[,colSums(X) != 0]

# Extract response matrix.
Y = gene_df[,4:(pos_start-1)]

An excerpt of the design matrix is shown below. By construction, every column contains at least one 1, but the matrix is still quite sparse with the relative frequency of 1’s being about 0.025.

datatable(data.frame(X)[1:10, ], options = list(scrollX=T, pageLength = 10))

The response matrix looks like:

head(Y, n=6)
##    APV  ATV   IDV   LPV   NFV   RTV   SQV
## 1  2.3   NA  32.7    NA  23.4  51.6  37.8
## 2 76.0   NA 131.0 200.0  50.0 200.0 156.0
## 3  2.8   NA  12.0    NA 100.0  41.0 145.6
## 4  6.5  9.2   2.1   5.3   5.0  36.0  13.0
## 5  8.3   NA 100.0    NA 161.1 170.2 100.0
## 6 82.0 75.0 400.0 400.0  91.0 400.0 400.0

There are 7 PI-type drugs: APV, ATV, IDV, LPV, NFV, RTV, and SQV.

Selecting drug-resistance-associated mutations

In this step, you need to build an appropriate linear regression model, and use the method we discussed in lecture to select mutations that may associated with drug-resistance. For 7 PI-type drugs, you need to run a seperate analysis for each drug.

Notice that there are some missing values.

Before building the model, we need to perform some final pre-processing steps. We remove rows with missing values (which vary from drug to drug) and we then further reduce the design matrix by removing predictor columns for mutations that do not appear at least three times in the sample. Finally, for identifiability, we remove any columns that are duplicates (i.e. two mutations that appear only in tandem, and therefore we cannot distinguish between their effects on the response).

selection <- function (X, y, alpha) {
  # Remove patients with missing measurements.
  missing = is.na(y)
  y = y[!missing]
  X = X[!missing,]
    
  # Remove predictors that appear less than 3 times.
  X = X[,colSums(X) >= 3]
  
  # Remove duplicate predictors.
  X = X[,colSums(abs(cor(X)-1) < 1e-4) == 1]
  
  # Buid an appropriate linear regression model
  # Select the mutations that may associated with drug-resistance
  model <- lm(y~., data = as_tibble(X))
  summary <- summary(model)$coefficients
  sel <- summary[,"Pr(>|t|)"] < alpha / 2  ##  selection according to alpha
  return1 <- names(sel[sel])
  
  #############best subsets selection
  # ------> data too big to run exhaustive search
  #X1 <-  as_tibble(cbind(y, X))
  #regfit.full <- regsubsets (y~., data = X1, really.big=T)
  #regfit.full.summary <- summary(regfit.full)
  #coe <- summary(regfit.full)[[1]][which.min(regfit.full.summary$cp),]
  #return5 <- names(coe)  
  
  #############forward selection
  min.model <- lm(y~1, data = as_tibble(X))
  full.model <- formula(lm(y~., data = as_tibble(X)))
  results <- step(min.model, scope = full.model, direction = c("forward")) # refer to p.171
  coe <-results$coefficients
  #regfit.fwd <- regsubsets (y∼.,data=as_tibble(X) , nvmax=19, method ="forward ")
  #coe <-regfit.fwd$coefficients
  return2 <- names(coe)

 #############Box-Cox
  bac.box <-  boxcox(y~.,data=as_tibble(X))   # refer to p.131
  bac.lambda <-  bac.box$x[which(bac.box$y == max(bac.box$y))]
  #bac.lambda
  ylambda <- y^bac.lambda
  model <- lm(ylambda~., data = as_tibble(X))
  summary<-summary(model)$coefficients
  sel<-summary[,"Pr(>|t|)"] < alpha / 2
  return3<-names(sel[sel])
  
 #############Box-Cox + forward selection
  min.model <- lm(ylambda~1, data = as_tibble(X))
  full.model <- formula(lm(ylambda~., data = as_tibble(X)))
  results <- step(min.model, scope = full.model,direction=c("forward"))
  coe <- results$coefficients
  return4 <- names(coe)
  
  return(list(p_value = return1, Foward = return2, BoxCox = return3, BoxCox_Forward = return4))
}
alpha = 0.05 # the nominal FWER
results = lapply(Y, function(y) selection(X, y, alpha))

  #draw correlation picture of data
  #X1 <- as_tibble(X)
  #ggpairs(X1)+theme_light()

Results of Linear Regression Models

Scale of data is too big to run with exhaustive search (best subset selection).

results
## $APV
## $APV$p_value
##  [1] "P84.A" "P84.C" "P60.E" "P61.E" "P33.F" "P11.I" "P32.I" "P66.I"
##  [9] "P11.L" "P54.L" "P71.L" "P19.P" "P74.P" "P34.Q" "P20.T" "P43.T"
## [17] "P63.T" "P10.V" "P47.V" "P50.V" "P66.V" "P76.V" "P84.V" "P89.V"
## 
## $APV$Foward
##  [1] "(Intercept)" "P11.L"       "P33.F"       "P32.I"       "P11.I"      
##  [6] "P84.A"       "P47.V"       "P89.V"       "P76.V"       "P20.T"      
## [11] "P43.T"       "P84.V"       "P84.C"       "P60.E"       "P50.V"      
## [16] "P70.E"       "P61.E"       "P34.Q"       "P37.H"       "P19.P"      
## [21] "P54.L"       "P74.P"       "P15.I"       "P58.E"       "P48.M"      
## [26] "P71.L"       "P10.V"       "P54.M"       "P63.T"       "P19.I"      
## [31] "P71.V"       "P82.F"       "P61.H"       "P64.V"       "P93.L"      
## [36] "P73.T"       "P36.I"       "P66.I"       "P66.V"       "P77.I"      
## [41] "P66.L"       "P85.V"       "P36.V"       "P71.A"       "P69.K"      
## [46] "P67.Y"       "P91.S"       "P66.F"       "P82.T"      
## 
## $APV$BoxCox
##  [1] "(Intercept)" "P12.A"       "P84.A"       "P84.C"       "P58.E"      
##  [6] "P10.F"       "P33.F"       "P82.F"       "P10.I"       "P24.I"      
## [11] "P32.I"       "P46.I"       "P64.I"       "P82.I"       "P10.L"      
## [16] "P46.L"       "P50.L"       "P54.L"       "P48.M"       "P54.M"      
## [21] "P90.M"       "P37.N"       "P63.P"       "P69.Q"       "P88.S"      
## [26] "P91.S"       "P20.T"       "P43.T"       "P54.T"       "P10.V"      
## [31] "P15.V"       "P47.V"       "P50.V"       "P54.V"       "P76.V"      
## [36] "P84.V"       "P37.Y"       "P14.Z"      
## 
## $APV$BoxCox_Forward
##  [1] "(Intercept)" "P90.M"       "P46.I"       "P88.S"       "P33.F"      
##  [6] "P82.A"       "P84.V"       "P50.V"       "P84.A"       "P84.C"      
## [11] "P82.F"       "P47.V"       "P10.F"       "P10.I"       "P50.L"      
## [16] "P54.M"       "P54.L"       "P76.V"       "P48.M"       "P24.I"      
## [21] "P54.T"       "P20.T"       "P22.V"       "P10.V"       "P46.L"      
## [26] "P43.T"       "P58.E"       "P14.Z"       "P10.L"       "P66.I"      
## [31] "P82.I"       "P12.A"       "P64.V"       "P32.I"       "P54.V"      
## [36] "P71.L"       "P19.Z"       "P85.I"       "P37.Y"       "P67.Y"      
## [41] "P72.Z"       "P55.K"       "P48.V"       "P37.S"       "P36.L"      
## [46] "P89.I"       "P19.Q"       "P92.Q"       "P63.P"       "P64.I"      
## [51] "P69.Q"       "P37.N"       "P39.P"       "P16.A"       "P93.L"      
## [56] "P66.F"       "P91.S"       "P79.S"       "P72.V"       "P43.R"      
## [61] "P35.D"       "P15.V"       "P14.X"       "P77.I"       "P69.H"      
## [66] "P82.T"       "P10.R"       "P93.I"       "P74.S"       "P63.A"      
## [71] "P70.R"       "P14.R"       "P73.T"       "P37.C"       "P12.N"      
## [76] "P36.M"       "P46.V"      
## 
## 
## $ATV
## $ATV$p_value
##  [1] "P79.A" "P82.A" "P58.E" "P70.E" "P11.I" "P32.I" "P46.I" "P71.I"
##  [9] "P74.P" "P34.Q" "P20.T" "P43.T" "P48.V" "P76.V"
## 
## $ATV$Foward
##  [1] "(Intercept)" "P33.F"       "P11.I"       "P20.T"       "P34.Q"      
##  [6] "P20.R"       "P46.I"       "P76.V"       "P69.R"       "P19.I"      
## [11] "P32.I"       "P74.P"       "P71.I"       "P79.A"       "P48.V"      
## [16] "P82.A"       "P58.E"       "P70.E"       "P50.L"       "P22.V"      
## [21] "P43.T"       "P54.V"       "P67.E"       "P19.T"       "P33.I"      
## [26] "P20.V"       "P63.P"       "P79.D"       "P10.F"       "P19.P"      
## [31] "P70.R"       "P47.V"       "P48.M"       "P73.T"       "P37.D"      
## [36] "P63.A"       "P89.M"       "P30.N"       "P36.I"       "P93.I"      
## [41] "P46.L"       "P92.K"       "P73.C"       "P57.R"       "P57.K"      
## [46] "P14.X"       "P64.V"       "P82.F"      
## 
## $ATV$BoxCox
##  [1] "(Intercept)" "P12.A"       "P16.A"       "P79.D"       "P88.D"      
##  [6] "P58.E"       "P67.E"       "P10.F"       "P20.I"       "P24.I"      
## [11] "P32.I"       "P46.I"       "P77.I"       "P50.L"       "P48.M"      
## [16] "P90.M"       "P37.N"       "P19.Q"       "P73.S"       "P88.S"      
## [21] "P91.S"       "P20.T"       "P73.T"       "P82.T"       "P48.V"      
## [26] "P54.V"       "P71.V"       "P76.V"       "P84.V"       "P89.V"      
## 
## $ATV$BoxCox_Forward
##  [1] "(Intercept)" "P90.M"       "P84.V"       "P71.V"       "P43.T"      
##  [6] "P48.V"       "P10.F"       "P24.I"       "P50.L"       "P88.S"      
## [11] "P82.T"       "P47.V"       "P82.A"       "P30.N"       "P20.T"      
## [16] "P76.V"       "P20.R"       "P33.I"       "P48.M"       "P54.V"      
## [21] "P20.I"       "P73.T"       "P54.M"       "P79.D"       "P10.V"      
## [26] "P13.V"       "P64.V"       "P19.Q"       "P37.N"       "P58.E"      
## [31] "P46.I"       "P16.A"       "P88.D"       "P36.L"       "P54.L"      
## [36] "P91.S"       "P66.F"       "P73.S"       "P14.R"       "P36.V"      
## [41] "P37.E"       "P10.I"       "P32.I"       "P12.A"       "P35.D"      
## [46] "P14.K"       "P19.Z"       "P72.I"       "P71.A"       "P89.M"      
## [51] "P82.F"       "P18.H"       "P62.I"       "P63.Q"       "P57.R"      
## [56] "P67.E"       "P89.V"       "P70.E"       "P95.F"       "P14.Z"      
## [61] "P72.Z"       "P71.T"       "P77.I"       "P70.R"       "P39.S"      
## [66] "P82.I"       "P93.I"       "P85.V"      
## 
## 
## $IDV
## $IDV$p_value
##  [1] "P22.A" "P54.A" "P84.A" "P67.E" "P69.H" "P20.I" "P72.I" "P85.I"
##  [9] "P92.K" "P54.L" "P71.L" "P48.M" "P19.Q" "P57.R" "P69.R" "P54.S"
## [17] "P73.S" "P12.T" "P43.T" "P54.T" "P20.V" "P46.V" "P48.V" "P50.V"
## [25] "P72.V" "P76.V" "P89.V" "P67.Y"
## 
## $IDV$Foward
##  [1] "(Intercept)" "P69.R"       "P54.T"       "P54.A"       "P54.S"      
##  [6] "P54.V"       "P84.A"       "P57.R"       "P90.M"       "P67.E"      
## [11] "P85.I"       "P73.S"       "P43.T"       "P19.Q"       "P67.Y"      
## [16] "P46.I"       "P69.H"       "P12.T"       "P76.V"       "P83.D"      
## [21] "P48.V"       "P20.I"       "P50.V"       "P19.P"       "P91.S"      
## [26] "P22.A"       "P72.V"       "P10.L"       "P82.T"       "P20.M"      
## [31] "P24.I"       "P32.I"       "P48.M"       "P22.V"       "P92.K"      
## [36] "P82.A"       "P85.V"       "P92.Q"       "P71.L"       "P54.L"      
## [41] "P89.V"       "P20.V"       "P72.I"       "P46.V"       "P72.X"      
## [46] "P12.P"       "P61.E"       "P33.I"       "P16.E"       "P66.L"      
## [51] "P67.L"       "P72.R"       "P20.T"       "P89.I"       "P71.I"      
## [56] "P82.F"       "P63.T"       "P19.T"       "P77.V"       "P15.I"      
## [61] "P72.E"       "P63.Q"       "P65.D"       "P89.L"       "P69.K"      
## 
## $IDV$BoxCox
##  [1] "(Intercept)" "P54.A"       "P82.A"       "P84.A"       "P73.C"      
##  [6] "P84.C"       "P72.E"       "P10.F"       "P82.F"       "P18.H"      
## [11] "P61.H"       "P10.I"       "P20.I"       "P24.I"       "P32.I"      
## [16] "P33.I"       "P46.I"       "P69.K"       "P10.L"       "P36.L"      
## [21] "P46.L"       "P50.L"       "P48.M"       "P54.M"       "P90.M"      
## [26] "P37.N"       "P63.P"       "P37.S"       "P54.S"       "P73.S"      
## [31] "P74.S"       "P82.S"       "P88.S"       "P20.T"       "P43.T"      
## [36] "P54.T"       "P71.T"       "P73.T"       "P82.T"       "P48.V"      
## [41] "P54.V"       "P71.V"       "P76.V"       "P84.V"       "P10.X"      
## [46] "P67.Y"       "P14.Z"       "P19.Z"       "P72.Z"      
## 
## $IDV$BoxCox_Forward
##  [1] "(Intercept)" "P90.M"       "P54.V"       "P46.I"       "P71.V"      
##  [6] "P10.I"       "P50.L"       "P54.T"       "P54.S"       "P10.F"      
## [11] "P84.A"       "P54.A"       "P46.L"       "P63.P"       "P88.S"      
## [16] "P20.I"       "P43.T"       "P24.I"       "P48.V"       "P82.F"      
## [21] "P48.M"       "P84.V"       "P32.I"       "P76.V"       "P82.T"      
## [26] "P73.S"       "P73.T"       "P82.A"       "P84.C"       "P20.T"      
## [31] "P33.I"       "P14.Z"       "P36.L"       "P61.H"       "P15.V"      
## [36] "P71.T"       "P72.Z"       "P67.Y"       "P37.S"       "P10.L"      
## [41] "P82.S"       "P73.C"       "P69.K"       "P20.R"       "P55.K"      
## [46] "P54.M"       "P85.I"       "P19.Z"       "P82.I"       "P12.P"      
## [51] "P61.D"       "P66.L"       "P33.F"       "P61.K"       "P66.F"      
## [56] "P88.D"       "P71.A"       "P10.X"       "P74.S"       "P18.L"      
## [61] "P23.I"       "P16.E"       "P63.Q"       "P72.R"       "P67.E"      
## [66] "P69.R"       "P72.K"       "P37.N"       "P64.I"       "P39.P"      
## [71] "P19.L"       "P70.R"       "P72.E"       "P74.P"       "P71.L"      
## [76] "P37.E"       "P57.G"       "P10.R"       "P46.V"       "P36.V"      
## [81] "P37.A"       "P57.K"       "P18.H"       "P22.V"       "P22.A"      
## [86] "P79.P"       "P18.Q"       "P70.E"      
## 
## 
## $LPV
## $LPV$p_value
##  [1] "P54.A" "P79.A" "P58.E" "P61.E" "P33.F" "P92.K" "P11.L" "P55.R"
##  [9] "P69.R" "P54.S" "P91.S" "P43.T" "P47.V" "P48.V" "P50.V" "P54.V"
## [17] "P76.V" "P67.Y"
## 
## $LPV$Foward
##  [1] "(Intercept)" "P33.F"       "P54.V"       "P47.V"       "P69.R"      
##  [6] "P54.A"       "P11.L"       "P54.S"       "P50.V"       "P91.S"      
## [11] "P55.R"       "P67.Y"       "P76.V"       "P43.T"       "P48.V"      
## [16] "P10.F"       "P53.L"       "P57.K"       "P12.T"       "P61.E"      
## [21] "P66.F"       "P92.K"       "P43.K"       "P58.E"       "P67.E"      
## [26] "P88.G"       "P61.H"       "P18.H"       "P73.S"       "P72.L"      
## [31] "P66.L"       "P23.I"       "P73.C"       "P34.Q"       "P16.A"      
## [36] "P58.Q"       "P54.T"       "P10.I"       "P46.I"       "P50.L"      
## [41] "P19.Q"       "P85.I"       "P10.L"       "P22.A"       "P79.A"      
## [46] "P89.V"       "P64.L"       "P69.K"       "P37.D"       "P84.C"      
## [51] "P36.I"       "P67.F"       "P48.M"       "P72.V"      
## 
## $LPV$BoxCox
##  [1] "(Intercept)" "P16.A"       "P54.A"       "P82.A"       "P84.A"      
##  [6] "P88.D"       "P10.F"       "P33.F"       "P66.F"       "P82.F"      
## [11] "P10.I"       "P24.I"       "P33.I"       "P46.I"       "P10.L"      
## [16] "P46.L"       "P50.L"       "P48.M"       "P54.M"       "P90.M"      
## [21] "P30.N"       "P73.S"       "P82.S"       "P20.T"       "P54.T"      
## [26] "P82.T"       "P47.V"       "P48.V"       "P50.V"       "P54.V"      
## [31] "P76.V"       "P84.V"       "P67.Y"      
## 
## $LPV$BoxCox_Forward
##  [1] "(Intercept)" "P54.V"       "P46.I"       "P33.F"       "P71.V"      
##  [6] "P50.L"       "P20.R"       "P48.V"       "P10.F"       "P10.I"      
## [11] "P47.V"       "P82.F"       "P93.L"       "P48.M"       "P46.L"      
## [16] "P90.M"       "P76.V"       "P54.A"       "P24.I"       "P50.V"      
## [21] "P84.V"       "P82.A"       "P67.Y"       "P54.T"       "P84.A"      
## [26] "P82.T"       "P82.S"       "P11.I"       "P20.T"       "P54.S"      
## [31] "P10.L"       "P20.I"       "P53.L"       "P69.K"       "P36.L"      
## [36] "P33.I"       "P16.A"       "P61.H"       "P54.M"       "P54.L"      
## [41] "P45.K"       "P10.V"       "P66.L"       "P33.L"       "P63.H"      
## [46] "P61.E"       "P61.D"       "P55.K"       "P35.G"       "P19.Q"      
## [51] "P39.S"       "P84.C"       "P62.V"       "P19.L"       "P88.D"      
## [56] "P30.N"       "P73.T"       "P73.S"       "P89.I"       "P36.I"      
## [61] "P69.Q"       "P41.R"       "P72.V"       "P10.R"       "P91.S"      
## [66] "P60.D"       "P79.D"       "P20.M"       "P14.R"       "P88.S"      
## [71] "P23.I"       "P61.K"       "P60.E"       "P71.T"       "P85.V"      
## [76] "P72.K"       "P37.N"       "P66.F"       "P37.S"       "P18.H"      
## [81] "P85.I"       "P33.M"       "P34.Q"       "P72.R"       "P63.L"      
## 
## 
## $NFV
## $NFV$p_value
##  [1] "P82.A" "P84.A" "P84.C" "P88.D" "P58.E" "P10.F" "P69.H" "P46.I"
##  [9] "P33.M" "P48.M" "P90.M" "P30.N" "P12.Q" "P20.R" "P69.R" "P54.S"
## [17] "P20.T" "P54.T" "P82.T" "P36.V" "P46.V" "P54.V" "P76.V" "P67.Y"
## 
## $NFV$Foward
##  [1] "(Intercept)" "P84.A"       "P88.D"       "P90.M"       "P84.C"      
##  [6] "P69.R"       "P20.R"       "P20.T"       "P54.T"       "P54.V"      
## [11] "P30.N"       "P54.S"       "P12.Q"       "P33.M"       "P10.F"      
## [16] "P67.Y"       "P24.I"       "P82.S"       "P69.H"       "P76.V"      
## [21] "P58.E"       "P36.V"       "P47.V"       "P54.A"       "P73.S"      
## [26] "P20.V"       "P88.S"       "P82.A"       "P43.T"       "P63.H"      
## [31] "P63.E"       "P71.A"       "P46.V"       "P20.M"       "P36.I"      
## [36] "P66.I"       "P63.T"       "P36.L"       "P19.I"       "P32.I"      
## [41] "P82.I"       "P82.T"       "P46.I"       "P48.M"       "P70.E"      
## [46] "P69.K"       "P61.D"       "P72.V"       "P93.L"       "P19.Q"      
## [51] "P67.F"       "P36.M"       "P57.K"       "P48.V"       "P50.V"      
## [56] "P46.L"       "P57.G"       "P61.E"       "P37.H"       "P53.L"      
## [61] "P37.N"       "P71.V"       "P19.V"       "P74.S"       "P61.N"      
## 
## $NFV$BoxCox
##  [1] "(Intercept)" "P54.A"       "P82.A"       "P84.A"       "P84.C"      
##  [6] "P79.D"       "P88.D"       "P72.E"       "P10.F"       "P82.F"      
## [11] "P10.I"       "P20.I"       "P24.I"       "P33.I"       "P36.I"      
## [16] "P46.I"       "P89.I"       "P10.L"       "P36.L"       "P46.L"      
## [21] "P50.L"       "P48.M"       "P54.M"       "P90.M"       "P30.N"      
## [26] "P63.P"       "P74.P"       "P19.Q"       "P10.R"       "P54.S"      
## [31] "P73.S"       "P74.S"       "P88.S"       "P20.T"       "P54.T"      
## [36] "P71.T"       "P72.T"       "P73.T"       "P36.V"       "P48.V"      
## [41] "P54.V"       "P71.V"       "P76.V"       "P84.V"       "P85.V"      
## [46] "P37.Y"       "P67.Y"      
## 
## $NFV$BoxCox_Forward
##  [1] "(Intercept)" "P90.M"       "P30.N"       "P54.V"       "P46.I"      
##  [6] "P84.C"       "P84.A"       "P54.T"       "P88.S"       "P54.S"      
## [11] "P84.V"       "P24.I"       "P36.I"       "P32.I"       "P20.I"      
## [16] "P50.L"       "P10.I"       "P10.F"       "P63.P"       "P74.S"      
## [21] "P54.A"       "P73.S"       "P82.F"       "P46.L"       "P48.V"      
## [26] "P48.M"       "P36.L"       "P73.T"       "P71.V"       "P88.D"      
## [31] "P20.T"       "P74.P"       "P54.M"       "P71.T"       "P72.R"      
## [36] "P36.V"       "P67.Y"       "P88.G"       "P10.L"       "P16.E"      
## [41] "P58.E"       "P14.Z"       "P55.R"       "P67.E"       "P63.E"      
## [46] "P82.A"       "P19.Z"       "P61.K"       "P66.F"       "P67.L"      
## [51] "P33.I"       "P10.R"       "P77.I"       "P12.T"       "P79.D"      
## [56] "P22.V"       "P10.X"       "P10.V"       "P76.V"       "P37.Y"      
## [61] "P72.E"       "P64.I"       "P89.V"       "P89.I"       "P13.V"      
## [66] "P19.Q"       "P63.X"       "P37.C"       "P20.R"       "P71.A"      
## [71] "P60.D"       "P72.T"       "P69.K"       "P12.K"       "P72.K"      
## [76] "P43.T"       "P37.S"       "P85.V"       "P71.I"       "P64.L"      
## [81] "P77.V"       "P62.V"       "P23.I"       "P14.R"       "P57.G"      
## [86] "P83.D"       "P39.S"       "P37.N"       "P63.L"       "P72.Z"      
## [91] "P73.C"       "P36.M"       "P35.G"       "P58.Q"       "P19.L"      
## [96] "P82.S"       "P61.E"      
## 
## 
## $RTV
## $RTV$p_value
##  [1] "P54.A" "P84.A" "P84.C" "P61.D" "P58.E" "P67.E" "P68.E" "P33.F"
##  [9] "P67.F" "P95.F" "P19.I" "P23.I" "P24.I" "P32.I" "P33.I" "P66.I"
## [17] "P72.K" "P36.L" "P53.L" "P90.M" "P19.P" "P34.Q" "P20.R" "P54.S"
## [25] "P74.S" "P91.S" "P20.T" "P43.T" "P54.T" "P36.V" "P47.V" "P50.V"
## [33] "P54.V" "P84.V"
## 
## $RTV$Foward
##  [1] "(Intercept)" "P54.V"       "P33.F"       "P32.I"       "P54.A"      
##  [6] "P20.R"       "P69.R"       "P24.I"       "P11.I"       "P47.V"      
## [11] "P34.Q"       "P50.V"       "P90.M"       "P36.L"       "P48.V"      
## [16] "P72.V"       "P73.S"       "P43.T"       "P72.R"       "P82.T"      
## [21] "P73.C"       "P68.E"       "P84.A"       "P67.F"       "P67.E"      
## [26] "P58.E"       "P53.L"       "P91.S"       "P20.T"       "P54.S"      
## [31] "P54.T"       "P74.S"       "P23.I"       "P95.F"       "P82.F"      
## [36] "P33.I"       "P30.N"       "P76.V"       "P84.C"       "P33.L"      
## [41] "P84.V"       "P20.K"       "P36.V"       "P10.L"       "P72.K"      
## [46] "P57.K"       "P54.L"       "P19.P"       "P73.T"       "P66.I"      
## [51] "P66.V"       "P74.P"       "P69.K"       "P79.P"       "P79.S"      
## [56] "P60.E"       "P61.D"       "P69.H"       "P69.Y"       "P70.E"      
## [61] "P16.E"       "P37.H"       "P19.Z"       "P70.K"       "P71.T"      
## [66] "P37.D"       "P93.L"       "P36.M"       "P61.H"       "P19.L"      
## [71] "P19.I"       "P12.Q"       "P89.M"       "P12.S"       "P37.A"      
## [76] "P15.I"       "P58.Q"      
## 
## $RTV$BoxCox
##  [1] "(Intercept)" "P54.A"       "P82.A"       "P84.A"       "P84.C"      
##  [6] "P88.D"       "P58.E"       "P33.F"       "P82.F"       "P10.I"      
## [11] "P24.I"       "P36.I"       "P46.I"       "P64.I"       "P82.I"      
## [16] "P10.L"       "P36.L"       "P50.L"       "P54.L"       "P93.L"      
## [21] "P54.M"       "P90.M"       "P30.N"       "P37.N"       "P63.P"      
## [26] "P10.R"       "P20.R"       "P54.S"       "P73.S"       "P82.S"      
## [31] "P20.T"       "P54.T"       "P82.T"       "P36.V"       "P47.V"      
## [36] "P48.V"       "P50.V"       "P54.V"       "P71.V"       "P84.V"      
## [41] "P67.Y"      
## 
## $RTV$BoxCox_Forward
##  [1] "(Intercept)" "P54.V"       "P90.M"       "P82.A"       "P84.V"      
##  [6] "P46.I"       "P82.T"       "P84.A"       "P24.I"       "P82.F"      
## [11] "P33.F"       "P50.L"       "P36.I"       "P50.V"       "P54.T"      
## [16] "P84.C"       "P32.I"       "P54.S"       "P82.S"       "P54.A"      
## [21] "P67.Y"       "P58.E"       "P73.S"       "P36.L"       "P53.L"      
## [26] "P64.I"       "P54.M"       "P54.L"       "P71.T"       "P71.L"      
## [31] "P14.Z"       "P20.T"       "P82.I"       "P63.P"       "P10.I"      
## [36] "P93.L"       "P47.V"       "P36.V"       "P10.L"       "P10.R"      
## [41] "P71.V"       "P48.V"       "P20.R"       "P19.Z"       "P63.H"      
## [46] "P79.P"       "P15.V"       "P46.L"       "P74.S"       "P37.N"      
## [51] "P72.Z"       "P12.N"       "P72.K"       "P18.L"       "P67.E"      
## [56] "P89.V"       "P33.I"       "P89.I"       "P19.Q"       "P73.T"      
## [61] "P73.C"       "P46.V"       "P67.L"       "P16.E"       "P36.M"      
## [66] "P57.G"       "P12.T"       "P67.F"       "P88.D"       "P30.N"      
## [71] "P11.I"       "P74.A"       "P72.R"       "P83.D"       "P72.V"      
## [76] "P35.E"       "P74.P"       "P19.I"       "P20.I"       "P35.D"      
## [81] "P33.V"       "P61.K"       "P93.M"      
## 
## 
## $SQV
## $SQV$p_value
##  [1] "P22.A" "P54.A" "P73.A" "P82.A" "P84.A" "P84.C" "P83.D" "P88.D"
##  [9] "P10.F" "P67.F" "P69.H" "P71.I" "P19.L" "P53.L" "P48.M" "P19.P"
## [17] "P54.S" "P73.S" "P91.S" "P43.T" "P54.T" "P82.T" "P22.V" "P48.V"
## [25] "P54.V" "P76.V" "P84.V" "P72.X"
## 
## $SQV$Foward
##  [1] "(Intercept)" "P48.V"       "P84.V"       "P48.M"       "P84.A"      
##  [6] "P20.R"       "P90.M"       "P67.F"       "P22.V"       "P53.L"      
## [11] "P84.C"       "P54.T"       "P43.T"       "P22.A"       "P54.S"      
## [16] "P54.V"       "P54.A"       "P72.X"       "P73.A"       "P20.T"      
## [21] "P67.E"       "P88.D"       "P73.S"       "P71.I"       "P82.A"      
## [26] "P82.T"       "P91.S"       "P66.L"       "P67.Y"       "P76.V"      
## [31] "P63.P"       "P83.D"       "P71.L"       "P57.K"       "P54.L"      
## [36] "P36.L"       "P19.T"       "P72.T"       "P63.H"       "P70.E"      
## [41] "P61.H"       "P82.F"       "P67.S"       "P19.L"       "P19.P"      
## [46] "P73.T"       "P32.I"       "P82.S"       "P24.I"       "P20.V"      
## [51] "P69.H"       "P36.I"       "P54.M"       "P72.E"       "P69.K"      
## [56] "P64.V"       "P57.G"      
## 
## $SQV$BoxCox
##  [1] "(Intercept)" "P54.A"       "P73.A"       "P82.A"       "P84.A"      
##  [6] "P84.C"       "P88.D"       "P58.E"       "P67.E"       "P10.F"      
## [11] "P82.F"       "P57.G"       "P61.H"       "P10.I"       "P20.I"      
## [16] "P24.I"       "P71.I"       "P10.L"       "P36.L"       "P50.L"      
## [21] "P53.L"       "P93.L"       "P48.M"       "P90.M"       "P37.N"      
## [26] "P63.P"       "P74.P"       "P55.R"       "P54.S"       "P73.S"      
## [31] "P74.S"       "P88.S"       "P91.S"       "P54.T"       "P73.T"      
## [36] "P10.V"       "P15.V"       "P48.V"       "P54.V"       "P71.V"      
## [41] "P76.V"       "P84.V"       "P67.Y"      
## 
## $SQV$BoxCox_Forward
##  [1] "(Intercept)" "P90.M"       "P84.V"       "P48.V"       "P84.A"      
##  [6] "P24.I"       "P54.V"       "P84.C"       "P88.D"       "P48.M"      
## [11] "P53.L"       "P71.V"       "P36.I"       "P88.S"       "P82.F"      
## [16] "P74.S"       "P73.S"       "P50.L"       "P10.F"       "P10.I"      
## [21] "P67.E"       "P10.L"       "P54.A"       "P54.S"       "P54.T"      
## [26] "P20.I"       "P76.V"       "P10.V"       "P82.A"       "P58.E"      
## [31] "P73.T"       "P67.Y"       "P74.P"       "P55.R"       "P61.H"      
## [36] "P73.A"       "P63.P"       "P15.V"       "P36.L"       "P35.G"      
## [41] "P57.G"       "P91.S"       "P12.P"       "P71.I"       "P93.L"      
## [46] "P12.N"       "P67.F"       "P70.R"       "P69.K"       "P20.R"      
## [51] "P85.I"       "P64.L"       "P72.R"       "P37.N"       "P22.A"      
## [56] "P54.L"       "P63.H"       "P43.R"       "P14.R"       "P71.A"      
## [61] "P71.T"       "P54.M"       "P60.D"       "P50.V"       "P19.Z"      
## [66] "P39.S"       "P18.L"       "P83.D"       "P82.T"       "P43.K"      
## [71] "P12.A"       "P12.Z"       "P72.E"       "P72.K"       "P20.V"      
## [76] "P19.L"       "P37.S"       "P19.Q"       "P63.L"       "P37.E"      
## [81] "P74.A"       "P64.I"       "P11.I"       "P70.E"       "P60.E"

Evaluating the results

In this case, we are fortunate enough to have a “ground truth” obtained by another experiment (data saved as tsm_df). Using this, we can evaluate the selected results. Note that we only need to compare the position of the mutations, not the mutation type. This is because it is known that multiple mutations at the same protease or RT position can often be associated with related drug-resistance outcomes.

knitr::opts_chunk$set(echo=FALSE, warning=FALSE, message=FALSE)
library("stringr"); library("purrr")
# Evaluate the result by comparing it to the ground true
length(results)
length(results[[1]]) 
rm_intercept <- function(results){   #remove intercept to compare the position
  for(i in c(1:7)){
    for(j in c(1:4)){
      if(results[[i]][j][1]=="(Intercept)"){
        results[[i]][j] <-  results[[i]][j][-1]
      } 
    }
  }
  return(results) 
}
results<-rm_intercept(results)
results

compare <- function(results){
  ans <- lapply(results,function(x) as.numeric(sort(unique(str_sub(x,2,3)))))  # remove possible duplicates
  tandf <- map(ans,~.x %in% tsm_df$Position)
  td <- map_dbl(tandf, function(x) sum(x) / length(x))
  dr <- map_dbl(tandf, function(x) sum(x) / nrow(tsm_df))
  fdr <- map_dbl(tandf, function(x) sum(!x) / nrow(tsm_df))
  stat <- data.frame(stat=c("true_discoveries","discover_rate","false_discoveries_rate")) 
  evaluation <- as_tibble(cbind(stat, rbind(td, dr, fdr)))
  return(evaluation)
} 

evaluation <- map(results, compare) 
## Warning in FUN(X[[i]], ...): NAs introduced by coercion

## Warning in FUN(X[[i]], ...): NAs introduced by coercion

## Warning in FUN(X[[i]], ...): NAs introduced by coercion

## Warning in FUN(X[[i]], ...): NAs introduced by coercion

## Warning in FUN(X[[i]], ...): NAs introduced by coercion

## Warning in FUN(X[[i]], ...): NAs introduced by coercion

## Warning in FUN(X[[i]], ...): NAs introduced by coercion

## Warning in FUN(X[[i]], ...): NAs introduced by coercion

## Warning in FUN(X[[i]], ...): NAs introduced by coercion

## Warning in FUN(X[[i]], ...): NAs introduced by coercion

## Warning in FUN(X[[i]], ...): NAs introduced by coercion

## Warning in FUN(X[[i]], ...): NAs introduced by coercion

## Warning in FUN(X[[i]], ...): NAs introduced by coercion

## Warning in FUN(X[[i]], ...): NAs introduced by coercion

## Warning in FUN(X[[i]], ...): NAs introduced by coercion

## Warning in FUN(X[[i]], ...): NAs introduced by coercion

## Warning in FUN(X[[i]], ...): NAs introduced by coercion

## Warning in FUN(X[[i]], ...): NAs introduced by coercion

## Warning in FUN(X[[i]], ...): NAs introduced by coercion

## Warning in FUN(X[[i]], ...): NAs introduced by coercion

## Warning in FUN(X[[i]], ...): NAs introduced by coercion
## $APV
## # A tibble: 3 x 5
##   stat                   p_value Foward BoxCox BoxCox_Forward
##   <fct>                    <dbl>  <dbl>  <dbl>          <dbl>
## 1 true_discoveries         0.8    0.611  0.654          0.609
## 2 discover_rate            0.471  0.647  0.5            0.824
## 3 false_discoveries_rate   0.118  0.412  0.265          0.529
## 
## $ATV
## # A tibble: 3 x 5
##   stat                   p_value Foward BoxCox BoxCox_Forward
##   <fct>                    <dbl>  <dbl>  <dbl>          <dbl>
## 1 true_discoveries        0.929   0.657  0.731          0.587
## 2 discover_rate           0.382   0.676  0.559          0.794
## 3 false_discoveries_rate  0.0294  0.353  0.206          0.559
## 
## $IDV
## # A tibble: 3 x 5
##   stat                   p_value Foward BoxCox BoxCox_Forward
##   <fct>                    <dbl>  <dbl>  <dbl>          <dbl>
## 1 true_discoveries         0.7    0.583  0.655          0.571
## 2 discover_rate            0.412  0.618  0.559          0.706
## 3 false_discoveries_rate   0.176  0.441  0.294          0.529
## 
## $LPV
## # A tibble: 3 x 5
##   stat                   p_value Foward BoxCox BoxCox_Forward
##   <fct>                    <dbl>  <dbl>  <dbl>          <dbl>
## 1 true_discoveries        0.812   0.632 0.9             0.609
## 2 discover_rate           0.382   0.706 0.529           0.824
## 3 false_discoveries_rate  0.0882  0.412 0.0588          0.529
## 
## $NFV
## # A tibble: 3 x 5
##   stat                   p_value Foward BoxCox BoxCox_Forward
##   <fct>                    <dbl>  <dbl>  <dbl>          <dbl>
## 1 true_discoveries        0.824   0.667  0.778          0.583
## 2 discover_rate           0.412   0.706  0.618          0.824
## 3 false_discoveries_rate  0.0882  0.353  0.176          0.588
## 
## $RTV
## # A tibble: 3 x 5
##   stat                   p_value Foward BoxCox BoxCox_Forward
##   <fct>                    <dbl>  <dbl>  <dbl>          <dbl>
## 1 true_discoveries         0.75   0.636  0.75           0.610
## 2 discover_rate            0.529  0.824  0.529          0.735
## 3 false_discoveries_rate   0.176  0.471  0.176          0.471
## 
## $SQV
## # A tibble: 3 x 5
##   stat                   p_value Foward BoxCox BoxCox_Forward
##   <fct>                    <dbl>  <dbl>  <dbl>          <dbl>
## 1 true_discoveries         0.667  0.552  0.667          0.512
## 2 discover_rate            0.353  0.471  0.529          0.647
## 3 false_discoveries_rate   0.176  0.382  0.265          0.618

References

Rhee, Soo-Yon, W Jeffrey Fessel, Andrew R Zolopa, Leo Hurley, Tommy Liu, Jonathan Taylor, Dong Phuong Nguyen, et al. 2005. “HIV-1 Protease and Reverse-Transcriptase Mutations: correlations with Antiretroviral Therapy in Subtype B Isolates and Implications for Drug-Resistance Surveillance.” Journal of Infectious Diseases 192 (3). Oxford University Press: 456–65.

Rhee, Soo-Yon, Jonathan Taylor, Gauhar Wadhera, Asa Ben-Hur, Douglas L Brutlag, and Robert W Shafer. 2006. “Genotypic Predictors of Human Immunodeficiency Virus Type 1 Drug Resistance.” Proceedings of the National Academy of Sciences 103 (46). National Academy of Sciences: 17355–60.