The scientific goal is to determine which mutations of the Human Immunodeficiency Virus Type 1 (HIV-1) are associated with drug resistance. The data set, publicly available from the Stanford HIV Drug Resistance Database http://hivdb.stanford.edu/pages/published_analysis/genophenoPNAS2006/, was originally analyzed in (Rhee et al. 2006).
The data set consists of measurements for three classes of drugs: protease inhibitors (PIs), nucleoside reverse transcriptase (RT) inhibitors (NRTIs), and nonnucleoside RT inhibitors (NNRTIs). Protease and reverse transcriptase are two enzymes in HIV-1 that are crucial to the function of the virus. This data set seeks associations between mutations in the HIV-1 protease and drug resistance to different PI type drugs, and between mutations in the HIV-1 reverse transcriptase and drug resistance to different NRTI and NNRTI type drugs (The raw data are saved as gene_df
).
In order to evaluate our results, we compare with the treatment-selected mutation panels created by (Rhee et al. 2005), which can be viewed as the ground true. These panels give lists of HIV-1 mutations appearing more frequently in patients who have previously been treated with PI, NRTI, or NNRTI type drugs, than in patients with no previous exposure to that drug type. Increased frequency of a mutation among patients treated with a certain drug type implies that the mutation confers resistance to that drug type (The raw data are saved as tsm_df
).
To simplify the analysis, in this project we will confine our attention to the PI drugs.
drug_class = 'PI' # Possible drug types are 'PI', 'NRTI', and 'NNRTI'.
First, we download the data and read it into data frames.
base_url = 'http://hivdb.stanford.edu/pages/published_analysis/genophenoPNAS2006'
gene_url = paste(base_url, 'DATA', paste0(drug_class, '_DATA.txt'), sep='/')
tsm_url = paste(base_url, 'MUTATIONLISTS', 'NP_TSM', drug_class, sep='/')
gene_df = read.delim(gene_url, na.string = c('NA', ''), stringsAsFactors = FALSE)
tsm_df = read.delim(tsm_url, header = FALSE, stringsAsFactors = FALSE)
names(tsm_df) = c('Position', 'Mutations')
A small sample of the data is shown below.
head(gene_df, n=6)
## IsolateName PseudoName MedlineID APV ATV IDV LPV NFV RTV SQV
## 1 CA10676 CA622 10839657 2.3 NA 32.7 NA 23.4 51.6 37.8
## 2 CA37880 CA622 15995959 76.0 NA 131.0 200.0 50.0 200.0 156.0
## 3 CA9984 CA624 11897594 2.8 NA 12.0 NA 100.0 41.0 145.6
## 4 CA17003 CA628 15995959 6.5 9.2 2.1 5.3 5.0 36.0 13.0
## 5 CA10670 CA634 10839657 8.3 NA 100.0 NA 161.1 170.2 100.0
## 6 CA42683 CA634 15995959 82.0 75.0 400.0 400.0 91.0 400.0 400.0
## P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12 P13 P14 P15 P16 P17 P18 P19 P20
## 1 - - - - - - - - - I - - - - - - - - - -
## 2 - - - - - - - - - F L - - - - A - - I -
## 3 - - - - - - - - - - - - V - - - - - - R
## 4 - - - - - - - - - I - - - - - - - - - -
## 5 - - - - - - - - - I - - - - - - - - - R
## 6 - - - - - - - - - I - - - - - - - - - R
## P21 P22 P23 P24 P25 P26 P27 P28 P29 P30 P31 P32 P33 P34 P35 P36 P37 P38
## 1 - - - - - - - - - - - - - - - - DN -
## 2 - - - - - - - - - - - - F - - - - -
## 3 - - - - - - - - - N - - IL - D I - -
## 4 - - I - - - - - - - - - - - - - - -
## 5 - - F - - - - - - - - - - - D I D -
## 6 - - - - - - - - - - - - F T D I D -
## P39 P40 P41 P42 P43 P44 P45 P46 P47 P48 P49 P50 P51 P52 P53 P54 P55 P56
## 1 - - K - - - RK I - - - - - - - - - -
## 2 - - K - - - - I - - - - - - - V - -
## 3 - - - - - - - - - - - - - - - V - -
## 4 - - - - - - - L - - - - - - - - - -
## 5 - - - - - - - I - V - - - - - T - -
## 6 - - - - T - - I - V - V - - - S - -
## P57 P58 P59 P60 P61 P62 P63 P64 P65 P66 P67 P68 P69 P70 P71 P72 P73 P74
## 1 - - - - - - P - - - - - - - LV - S -
## 2 - - - - - - P - - - - - - - V - S -
## 3 - - - - - - P - - - - - - - TA - - -
## 4 - E - E EK - P - - - - - - - T T - -
## 5 - - - - - V P - - - - - - - V X - -
## 6 - - - - - V P - - - - - - - V V - -
## P75 P76 P77 P78 P79 P80 P81 P82 P83 P84 P85 P86 P87 P88 P89 P90 P91 P92
## 1 - - I - - - - T - V V - - - - M - -
## 2 - - - - - - - T - V - - - - V M S -
## 3 - - - - - - - - - - - - - D - M - -
## 4 - - - - - - - A - V - - - - - M - -
## 5 - - I - - - - A - - - - - - - - - -
## 6 - - I - - - - A - - V - - - - - - -
## P93 P94 P95 P96 P97 P98 P99
## 1 L - - - - - -
## 2 L - - - - - -
## 3 - - - - - - .
## 4 - - - - - - -
## 5 L - - - - - -
## 6 L - - - - - -
head(tsm_df, n=6)
## Position Mutations
## 1 10 F R
## 2 11 I
## 3 20 I T V
## 4 23 I
## 5 24 I
## 6 30 N
In tsm_df
, the variable Position
denotes the position of the mutations that are associated with drug-resistance, while Mutations
indicating the mutation type.
The gene data table has some rows with error flags or nonstandard mutation codes. For simplicity, we remove all such rows.
# Returns rows for which every column matches the given regular expression.
grepl_rows <- function(pattern, df) {
cell_matches = apply(df, c(1,2), function(x) grepl(pattern, x))
apply(cell_matches, 1, all)
}
pos_start = which(names(gene_df) == 'P1')
pos_cols = seq.int(pos_start, ncol(gene_df))
valid_rows = grepl_rows('^(\\.|-|[A-Zid]+)$', gene_df[,pos_cols])
gene_df = gene_df[valid_rows,]
We now construct the design matrix \(X\) and matrix of response vectors \(Y\). The features (columns of \(X\)) are given by mutation/position pairs. Define
$$ X_{i,j} = 1 i j \
Y_{i,k} = i k. $$
For example, in the sample for PI type drugs, three different mutations (A, C, and D) are observed at position 63 in the protease, and so three columns of \(X\) (named P63.A, P63.C, and P63.D) indicate the presence or absence of each mutation at this position.
# Flatten a matrix to a vector with names from concatenating row/column names.
flatten_matrix <- function(M, sep='.') {
x <- c(M)
names(x) <- c(outer(rownames(M), colnames(M),
function(...) paste(..., sep=sep)))
x
}
# Construct preliminary design matrix.
muts = c(LETTERS, 'i', 'd')
X = outer(muts, as.matrix(gene_df[,pos_cols]), Vectorize(grepl))
X = aperm(X, c(2,3,1))
dimnames(X)[[3]] <- muts
X = t(apply(X, 1, flatten_matrix))
mode(X) <- 'numeric'
# Remove any mutation/position pairs that never appear in the data.
X = X[,colSums(X) != 0]
# Extract response matrix.
Y = gene_df[,4:(pos_start-1)]
An excerpt of the design matrix is shown below. By construction, every column contains at least one 1, but the matrix is still quite sparse with the relative frequency of 1’s being about 0.025.
datatable(data.frame(X)[1:10, ], options = list(scrollX=T, pageLength = 10))
The response matrix looks like:
head(Y, n=6)
## APV ATV IDV LPV NFV RTV SQV
## 1 2.3 NA 32.7 NA 23.4 51.6 37.8
## 2 76.0 NA 131.0 200.0 50.0 200.0 156.0
## 3 2.8 NA 12.0 NA 100.0 41.0 145.6
## 4 6.5 9.2 2.1 5.3 5.0 36.0 13.0
## 5 8.3 NA 100.0 NA 161.1 170.2 100.0
## 6 82.0 75.0 400.0 400.0 91.0 400.0 400.0
There are 7 PI-type drugs: APV, ATV, IDV, LPV, NFV, RTV, and SQV.
In this step, you need to build an appropriate linear regression model, and use the method we discussed in lecture to select mutations that may associated with drug-resistance. For 7 PI-type drugs, you need to run a seperate analysis for each drug.
Notice that there are some missing values.
Before building the model, we need to perform some final pre-processing steps. We remove rows with missing values (which vary from drug to drug) and we then further reduce the design matrix by removing predictor columns for mutations that do not appear at least three times in the sample. Finally, for identifiability, we remove any columns that are duplicates (i.e. two mutations that appear only in tandem, and therefore we cannot distinguish between their effects on the response).
selection <- function (X, y, alpha) {
# Remove patients with missing measurements.
missing = is.na(y)
y = y[!missing]
X = X[!missing,]
# Remove predictors that appear less than 3 times.
X = X[,colSums(X) >= 3]
# Remove duplicate predictors.
X = X[,colSums(abs(cor(X)-1) < 1e-4) == 1]
# Buid an appropriate linear regression model
# Select the mutations that may associated with drug-resistance
model <- lm(y~., data = as_tibble(X))
summary <- summary(model)$coefficients
sel <- summary[,"Pr(>|t|)"] < alpha / 2 ## selection according to alpha
return1 <- names(sel[sel])
#############best subsets selection
# ------> data too big to run exhaustive search
#X1 <- as_tibble(cbind(y, X))
#regfit.full <- regsubsets (y~., data = X1, really.big=T)
#regfit.full.summary <- summary(regfit.full)
#coe <- summary(regfit.full)[[1]][which.min(regfit.full.summary$cp),]
#return5 <- names(coe)
#############forward selection
min.model <- lm(y~1, data = as_tibble(X))
full.model <- formula(lm(y~., data = as_tibble(X)))
results <- step(min.model, scope = full.model, direction = c("forward")) # refer to p.171
coe <-results$coefficients
#regfit.fwd <- regsubsets (y∼.,data=as_tibble(X) , nvmax=19, method ="forward ")
#coe <-regfit.fwd$coefficients
return2 <- names(coe)
#############Box-Cox
bac.box <- boxcox(y~.,data=as_tibble(X)) # refer to p.131
bac.lambda <- bac.box$x[which(bac.box$y == max(bac.box$y))]
#bac.lambda
ylambda <- y^bac.lambda
model <- lm(ylambda~., data = as_tibble(X))
summary<-summary(model)$coefficients
sel<-summary[,"Pr(>|t|)"] < alpha / 2
return3<-names(sel[sel])
#############Box-Cox + forward selection
min.model <- lm(ylambda~1, data = as_tibble(X))
full.model <- formula(lm(ylambda~., data = as_tibble(X)))
results <- step(min.model, scope = full.model,direction=c("forward"))
coe <- results$coefficients
return4 <- names(coe)
return(list(p_value = return1, Foward = return2, BoxCox = return3, BoxCox_Forward = return4))
}
alpha = 0.05 # the nominal FWER
results = lapply(Y, function(y) selection(X, y, alpha))
#draw correlation picture of data
#X1 <- as_tibble(X)
#ggpairs(X1)+theme_light()
Scale of data is too big to run with exhaustive search (best subset selection).
results
## $APV
## $APV$p_value
## [1] "P84.A" "P84.C" "P60.E" "P61.E" "P33.F" "P11.I" "P32.I" "P66.I"
## [9] "P11.L" "P54.L" "P71.L" "P19.P" "P74.P" "P34.Q" "P20.T" "P43.T"
## [17] "P63.T" "P10.V" "P47.V" "P50.V" "P66.V" "P76.V" "P84.V" "P89.V"
##
## $APV$Foward
## [1] "(Intercept)" "P11.L" "P33.F" "P32.I" "P11.I"
## [6] "P84.A" "P47.V" "P89.V" "P76.V" "P20.T"
## [11] "P43.T" "P84.V" "P84.C" "P60.E" "P50.V"
## [16] "P70.E" "P61.E" "P34.Q" "P37.H" "P19.P"
## [21] "P54.L" "P74.P" "P15.I" "P58.E" "P48.M"
## [26] "P71.L" "P10.V" "P54.M" "P63.T" "P19.I"
## [31] "P71.V" "P82.F" "P61.H" "P64.V" "P93.L"
## [36] "P73.T" "P36.I" "P66.I" "P66.V" "P77.I"
## [41] "P66.L" "P85.V" "P36.V" "P71.A" "P69.K"
## [46] "P67.Y" "P91.S" "P66.F" "P82.T"
##
## $APV$BoxCox
## [1] "(Intercept)" "P12.A" "P84.A" "P84.C" "P58.E"
## [6] "P10.F" "P33.F" "P82.F" "P10.I" "P24.I"
## [11] "P32.I" "P46.I" "P64.I" "P82.I" "P10.L"
## [16] "P46.L" "P50.L" "P54.L" "P48.M" "P54.M"
## [21] "P90.M" "P37.N" "P63.P" "P69.Q" "P88.S"
## [26] "P91.S" "P20.T" "P43.T" "P54.T" "P10.V"
## [31] "P15.V" "P47.V" "P50.V" "P54.V" "P76.V"
## [36] "P84.V" "P37.Y" "P14.Z"
##
## $APV$BoxCox_Forward
## [1] "(Intercept)" "P90.M" "P46.I" "P88.S" "P33.F"
## [6] "P82.A" "P84.V" "P50.V" "P84.A" "P84.C"
## [11] "P82.F" "P47.V" "P10.F" "P10.I" "P50.L"
## [16] "P54.M" "P54.L" "P76.V" "P48.M" "P24.I"
## [21] "P54.T" "P20.T" "P22.V" "P10.V" "P46.L"
## [26] "P43.T" "P58.E" "P14.Z" "P10.L" "P66.I"
## [31] "P82.I" "P12.A" "P64.V" "P32.I" "P54.V"
## [36] "P71.L" "P19.Z" "P85.I" "P37.Y" "P67.Y"
## [41] "P72.Z" "P55.K" "P48.V" "P37.S" "P36.L"
## [46] "P89.I" "P19.Q" "P92.Q" "P63.P" "P64.I"
## [51] "P69.Q" "P37.N" "P39.P" "P16.A" "P93.L"
## [56] "P66.F" "P91.S" "P79.S" "P72.V" "P43.R"
## [61] "P35.D" "P15.V" "P14.X" "P77.I" "P69.H"
## [66] "P82.T" "P10.R" "P93.I" "P74.S" "P63.A"
## [71] "P70.R" "P14.R" "P73.T" "P37.C" "P12.N"
## [76] "P36.M" "P46.V"
##
##
## $ATV
## $ATV$p_value
## [1] "P79.A" "P82.A" "P58.E" "P70.E" "P11.I" "P32.I" "P46.I" "P71.I"
## [9] "P74.P" "P34.Q" "P20.T" "P43.T" "P48.V" "P76.V"
##
## $ATV$Foward
## [1] "(Intercept)" "P33.F" "P11.I" "P20.T" "P34.Q"
## [6] "P20.R" "P46.I" "P76.V" "P69.R" "P19.I"
## [11] "P32.I" "P74.P" "P71.I" "P79.A" "P48.V"
## [16] "P82.A" "P58.E" "P70.E" "P50.L" "P22.V"
## [21] "P43.T" "P54.V" "P67.E" "P19.T" "P33.I"
## [26] "P20.V" "P63.P" "P79.D" "P10.F" "P19.P"
## [31] "P70.R" "P47.V" "P48.M" "P73.T" "P37.D"
## [36] "P63.A" "P89.M" "P30.N" "P36.I" "P93.I"
## [41] "P46.L" "P92.K" "P73.C" "P57.R" "P57.K"
## [46] "P14.X" "P64.V" "P82.F"
##
## $ATV$BoxCox
## [1] "(Intercept)" "P12.A" "P16.A" "P79.D" "P88.D"
## [6] "P58.E" "P67.E" "P10.F" "P20.I" "P24.I"
## [11] "P32.I" "P46.I" "P77.I" "P50.L" "P48.M"
## [16] "P90.M" "P37.N" "P19.Q" "P73.S" "P88.S"
## [21] "P91.S" "P20.T" "P73.T" "P82.T" "P48.V"
## [26] "P54.V" "P71.V" "P76.V" "P84.V" "P89.V"
##
## $ATV$BoxCox_Forward
## [1] "(Intercept)" "P90.M" "P84.V" "P71.V" "P43.T"
## [6] "P48.V" "P10.F" "P24.I" "P50.L" "P88.S"
## [11] "P82.T" "P47.V" "P82.A" "P30.N" "P20.T"
## [16] "P76.V" "P20.R" "P33.I" "P48.M" "P54.V"
## [21] "P20.I" "P73.T" "P54.M" "P79.D" "P10.V"
## [26] "P13.V" "P64.V" "P19.Q" "P37.N" "P58.E"
## [31] "P46.I" "P16.A" "P88.D" "P36.L" "P54.L"
## [36] "P91.S" "P66.F" "P73.S" "P14.R" "P36.V"
## [41] "P37.E" "P10.I" "P32.I" "P12.A" "P35.D"
## [46] "P14.K" "P19.Z" "P72.I" "P71.A" "P89.M"
## [51] "P82.F" "P18.H" "P62.I" "P63.Q" "P57.R"
## [56] "P67.E" "P89.V" "P70.E" "P95.F" "P14.Z"
## [61] "P72.Z" "P71.T" "P77.I" "P70.R" "P39.S"
## [66] "P82.I" "P93.I" "P85.V"
##
##
## $IDV
## $IDV$p_value
## [1] "P22.A" "P54.A" "P84.A" "P67.E" "P69.H" "P20.I" "P72.I" "P85.I"
## [9] "P92.K" "P54.L" "P71.L" "P48.M" "P19.Q" "P57.R" "P69.R" "P54.S"
## [17] "P73.S" "P12.T" "P43.T" "P54.T" "P20.V" "P46.V" "P48.V" "P50.V"
## [25] "P72.V" "P76.V" "P89.V" "P67.Y"
##
## $IDV$Foward
## [1] "(Intercept)" "P69.R" "P54.T" "P54.A" "P54.S"
## [6] "P54.V" "P84.A" "P57.R" "P90.M" "P67.E"
## [11] "P85.I" "P73.S" "P43.T" "P19.Q" "P67.Y"
## [16] "P46.I" "P69.H" "P12.T" "P76.V" "P83.D"
## [21] "P48.V" "P20.I" "P50.V" "P19.P" "P91.S"
## [26] "P22.A" "P72.V" "P10.L" "P82.T" "P20.M"
## [31] "P24.I" "P32.I" "P48.M" "P22.V" "P92.K"
## [36] "P82.A" "P85.V" "P92.Q" "P71.L" "P54.L"
## [41] "P89.V" "P20.V" "P72.I" "P46.V" "P72.X"
## [46] "P12.P" "P61.E" "P33.I" "P16.E" "P66.L"
## [51] "P67.L" "P72.R" "P20.T" "P89.I" "P71.I"
## [56] "P82.F" "P63.T" "P19.T" "P77.V" "P15.I"
## [61] "P72.E" "P63.Q" "P65.D" "P89.L" "P69.K"
##
## $IDV$BoxCox
## [1] "(Intercept)" "P54.A" "P82.A" "P84.A" "P73.C"
## [6] "P84.C" "P72.E" "P10.F" "P82.F" "P18.H"
## [11] "P61.H" "P10.I" "P20.I" "P24.I" "P32.I"
## [16] "P33.I" "P46.I" "P69.K" "P10.L" "P36.L"
## [21] "P46.L" "P50.L" "P48.M" "P54.M" "P90.M"
## [26] "P37.N" "P63.P" "P37.S" "P54.S" "P73.S"
## [31] "P74.S" "P82.S" "P88.S" "P20.T" "P43.T"
## [36] "P54.T" "P71.T" "P73.T" "P82.T" "P48.V"
## [41] "P54.V" "P71.V" "P76.V" "P84.V" "P10.X"
## [46] "P67.Y" "P14.Z" "P19.Z" "P72.Z"
##
## $IDV$BoxCox_Forward
## [1] "(Intercept)" "P90.M" "P54.V" "P46.I" "P71.V"
## [6] "P10.I" "P50.L" "P54.T" "P54.S" "P10.F"
## [11] "P84.A" "P54.A" "P46.L" "P63.P" "P88.S"
## [16] "P20.I" "P43.T" "P24.I" "P48.V" "P82.F"
## [21] "P48.M" "P84.V" "P32.I" "P76.V" "P82.T"
## [26] "P73.S" "P73.T" "P82.A" "P84.C" "P20.T"
## [31] "P33.I" "P14.Z" "P36.L" "P61.H" "P15.V"
## [36] "P71.T" "P72.Z" "P67.Y" "P37.S" "P10.L"
## [41] "P82.S" "P73.C" "P69.K" "P20.R" "P55.K"
## [46] "P54.M" "P85.I" "P19.Z" "P82.I" "P12.P"
## [51] "P61.D" "P66.L" "P33.F" "P61.K" "P66.F"
## [56] "P88.D" "P71.A" "P10.X" "P74.S" "P18.L"
## [61] "P23.I" "P16.E" "P63.Q" "P72.R" "P67.E"
## [66] "P69.R" "P72.K" "P37.N" "P64.I" "P39.P"
## [71] "P19.L" "P70.R" "P72.E" "P74.P" "P71.L"
## [76] "P37.E" "P57.G" "P10.R" "P46.V" "P36.V"
## [81] "P37.A" "P57.K" "P18.H" "P22.V" "P22.A"
## [86] "P79.P" "P18.Q" "P70.E"
##
##
## $LPV
## $LPV$p_value
## [1] "P54.A" "P79.A" "P58.E" "P61.E" "P33.F" "P92.K" "P11.L" "P55.R"
## [9] "P69.R" "P54.S" "P91.S" "P43.T" "P47.V" "P48.V" "P50.V" "P54.V"
## [17] "P76.V" "P67.Y"
##
## $LPV$Foward
## [1] "(Intercept)" "P33.F" "P54.V" "P47.V" "P69.R"
## [6] "P54.A" "P11.L" "P54.S" "P50.V" "P91.S"
## [11] "P55.R" "P67.Y" "P76.V" "P43.T" "P48.V"
## [16] "P10.F" "P53.L" "P57.K" "P12.T" "P61.E"
## [21] "P66.F" "P92.K" "P43.K" "P58.E" "P67.E"
## [26] "P88.G" "P61.H" "P18.H" "P73.S" "P72.L"
## [31] "P66.L" "P23.I" "P73.C" "P34.Q" "P16.A"
## [36] "P58.Q" "P54.T" "P10.I" "P46.I" "P50.L"
## [41] "P19.Q" "P85.I" "P10.L" "P22.A" "P79.A"
## [46] "P89.V" "P64.L" "P69.K" "P37.D" "P84.C"
## [51] "P36.I" "P67.F" "P48.M" "P72.V"
##
## $LPV$BoxCox
## [1] "(Intercept)" "P16.A" "P54.A" "P82.A" "P84.A"
## [6] "P88.D" "P10.F" "P33.F" "P66.F" "P82.F"
## [11] "P10.I" "P24.I" "P33.I" "P46.I" "P10.L"
## [16] "P46.L" "P50.L" "P48.M" "P54.M" "P90.M"
## [21] "P30.N" "P73.S" "P82.S" "P20.T" "P54.T"
## [26] "P82.T" "P47.V" "P48.V" "P50.V" "P54.V"
## [31] "P76.V" "P84.V" "P67.Y"
##
## $LPV$BoxCox_Forward
## [1] "(Intercept)" "P54.V" "P46.I" "P33.F" "P71.V"
## [6] "P50.L" "P20.R" "P48.V" "P10.F" "P10.I"
## [11] "P47.V" "P82.F" "P93.L" "P48.M" "P46.L"
## [16] "P90.M" "P76.V" "P54.A" "P24.I" "P50.V"
## [21] "P84.V" "P82.A" "P67.Y" "P54.T" "P84.A"
## [26] "P82.T" "P82.S" "P11.I" "P20.T" "P54.S"
## [31] "P10.L" "P20.I" "P53.L" "P69.K" "P36.L"
## [36] "P33.I" "P16.A" "P61.H" "P54.M" "P54.L"
## [41] "P45.K" "P10.V" "P66.L" "P33.L" "P63.H"
## [46] "P61.E" "P61.D" "P55.K" "P35.G" "P19.Q"
## [51] "P39.S" "P84.C" "P62.V" "P19.L" "P88.D"
## [56] "P30.N" "P73.T" "P73.S" "P89.I" "P36.I"
## [61] "P69.Q" "P41.R" "P72.V" "P10.R" "P91.S"
## [66] "P60.D" "P79.D" "P20.M" "P14.R" "P88.S"
## [71] "P23.I" "P61.K" "P60.E" "P71.T" "P85.V"
## [76] "P72.K" "P37.N" "P66.F" "P37.S" "P18.H"
## [81] "P85.I" "P33.M" "P34.Q" "P72.R" "P63.L"
##
##
## $NFV
## $NFV$p_value
## [1] "P82.A" "P84.A" "P84.C" "P88.D" "P58.E" "P10.F" "P69.H" "P46.I"
## [9] "P33.M" "P48.M" "P90.M" "P30.N" "P12.Q" "P20.R" "P69.R" "P54.S"
## [17] "P20.T" "P54.T" "P82.T" "P36.V" "P46.V" "P54.V" "P76.V" "P67.Y"
##
## $NFV$Foward
## [1] "(Intercept)" "P84.A" "P88.D" "P90.M" "P84.C"
## [6] "P69.R" "P20.R" "P20.T" "P54.T" "P54.V"
## [11] "P30.N" "P54.S" "P12.Q" "P33.M" "P10.F"
## [16] "P67.Y" "P24.I" "P82.S" "P69.H" "P76.V"
## [21] "P58.E" "P36.V" "P47.V" "P54.A" "P73.S"
## [26] "P20.V" "P88.S" "P82.A" "P43.T" "P63.H"
## [31] "P63.E" "P71.A" "P46.V" "P20.M" "P36.I"
## [36] "P66.I" "P63.T" "P36.L" "P19.I" "P32.I"
## [41] "P82.I" "P82.T" "P46.I" "P48.M" "P70.E"
## [46] "P69.K" "P61.D" "P72.V" "P93.L" "P19.Q"
## [51] "P67.F" "P36.M" "P57.K" "P48.V" "P50.V"
## [56] "P46.L" "P57.G" "P61.E" "P37.H" "P53.L"
## [61] "P37.N" "P71.V" "P19.V" "P74.S" "P61.N"
##
## $NFV$BoxCox
## [1] "(Intercept)" "P54.A" "P82.A" "P84.A" "P84.C"
## [6] "P79.D" "P88.D" "P72.E" "P10.F" "P82.F"
## [11] "P10.I" "P20.I" "P24.I" "P33.I" "P36.I"
## [16] "P46.I" "P89.I" "P10.L" "P36.L" "P46.L"
## [21] "P50.L" "P48.M" "P54.M" "P90.M" "P30.N"
## [26] "P63.P" "P74.P" "P19.Q" "P10.R" "P54.S"
## [31] "P73.S" "P74.S" "P88.S" "P20.T" "P54.T"
## [36] "P71.T" "P72.T" "P73.T" "P36.V" "P48.V"
## [41] "P54.V" "P71.V" "P76.V" "P84.V" "P85.V"
## [46] "P37.Y" "P67.Y"
##
## $NFV$BoxCox_Forward
## [1] "(Intercept)" "P90.M" "P30.N" "P54.V" "P46.I"
## [6] "P84.C" "P84.A" "P54.T" "P88.S" "P54.S"
## [11] "P84.V" "P24.I" "P36.I" "P32.I" "P20.I"
## [16] "P50.L" "P10.I" "P10.F" "P63.P" "P74.S"
## [21] "P54.A" "P73.S" "P82.F" "P46.L" "P48.V"
## [26] "P48.M" "P36.L" "P73.T" "P71.V" "P88.D"
## [31] "P20.T" "P74.P" "P54.M" "P71.T" "P72.R"
## [36] "P36.V" "P67.Y" "P88.G" "P10.L" "P16.E"
## [41] "P58.E" "P14.Z" "P55.R" "P67.E" "P63.E"
## [46] "P82.A" "P19.Z" "P61.K" "P66.F" "P67.L"
## [51] "P33.I" "P10.R" "P77.I" "P12.T" "P79.D"
## [56] "P22.V" "P10.X" "P10.V" "P76.V" "P37.Y"
## [61] "P72.E" "P64.I" "P89.V" "P89.I" "P13.V"
## [66] "P19.Q" "P63.X" "P37.C" "P20.R" "P71.A"
## [71] "P60.D" "P72.T" "P69.K" "P12.K" "P72.K"
## [76] "P43.T" "P37.S" "P85.V" "P71.I" "P64.L"
## [81] "P77.V" "P62.V" "P23.I" "P14.R" "P57.G"
## [86] "P83.D" "P39.S" "P37.N" "P63.L" "P72.Z"
## [91] "P73.C" "P36.M" "P35.G" "P58.Q" "P19.L"
## [96] "P82.S" "P61.E"
##
##
## $RTV
## $RTV$p_value
## [1] "P54.A" "P84.A" "P84.C" "P61.D" "P58.E" "P67.E" "P68.E" "P33.F"
## [9] "P67.F" "P95.F" "P19.I" "P23.I" "P24.I" "P32.I" "P33.I" "P66.I"
## [17] "P72.K" "P36.L" "P53.L" "P90.M" "P19.P" "P34.Q" "P20.R" "P54.S"
## [25] "P74.S" "P91.S" "P20.T" "P43.T" "P54.T" "P36.V" "P47.V" "P50.V"
## [33] "P54.V" "P84.V"
##
## $RTV$Foward
## [1] "(Intercept)" "P54.V" "P33.F" "P32.I" "P54.A"
## [6] "P20.R" "P69.R" "P24.I" "P11.I" "P47.V"
## [11] "P34.Q" "P50.V" "P90.M" "P36.L" "P48.V"
## [16] "P72.V" "P73.S" "P43.T" "P72.R" "P82.T"
## [21] "P73.C" "P68.E" "P84.A" "P67.F" "P67.E"
## [26] "P58.E" "P53.L" "P91.S" "P20.T" "P54.S"
## [31] "P54.T" "P74.S" "P23.I" "P95.F" "P82.F"
## [36] "P33.I" "P30.N" "P76.V" "P84.C" "P33.L"
## [41] "P84.V" "P20.K" "P36.V" "P10.L" "P72.K"
## [46] "P57.K" "P54.L" "P19.P" "P73.T" "P66.I"
## [51] "P66.V" "P74.P" "P69.K" "P79.P" "P79.S"
## [56] "P60.E" "P61.D" "P69.H" "P69.Y" "P70.E"
## [61] "P16.E" "P37.H" "P19.Z" "P70.K" "P71.T"
## [66] "P37.D" "P93.L" "P36.M" "P61.H" "P19.L"
## [71] "P19.I" "P12.Q" "P89.M" "P12.S" "P37.A"
## [76] "P15.I" "P58.Q"
##
## $RTV$BoxCox
## [1] "(Intercept)" "P54.A" "P82.A" "P84.A" "P84.C"
## [6] "P88.D" "P58.E" "P33.F" "P82.F" "P10.I"
## [11] "P24.I" "P36.I" "P46.I" "P64.I" "P82.I"
## [16] "P10.L" "P36.L" "P50.L" "P54.L" "P93.L"
## [21] "P54.M" "P90.M" "P30.N" "P37.N" "P63.P"
## [26] "P10.R" "P20.R" "P54.S" "P73.S" "P82.S"
## [31] "P20.T" "P54.T" "P82.T" "P36.V" "P47.V"
## [36] "P48.V" "P50.V" "P54.V" "P71.V" "P84.V"
## [41] "P67.Y"
##
## $RTV$BoxCox_Forward
## [1] "(Intercept)" "P54.V" "P90.M" "P82.A" "P84.V"
## [6] "P46.I" "P82.T" "P84.A" "P24.I" "P82.F"
## [11] "P33.F" "P50.L" "P36.I" "P50.V" "P54.T"
## [16] "P84.C" "P32.I" "P54.S" "P82.S" "P54.A"
## [21] "P67.Y" "P58.E" "P73.S" "P36.L" "P53.L"
## [26] "P64.I" "P54.M" "P54.L" "P71.T" "P71.L"
## [31] "P14.Z" "P20.T" "P82.I" "P63.P" "P10.I"
## [36] "P93.L" "P47.V" "P36.V" "P10.L" "P10.R"
## [41] "P71.V" "P48.V" "P20.R" "P19.Z" "P63.H"
## [46] "P79.P" "P15.V" "P46.L" "P74.S" "P37.N"
## [51] "P72.Z" "P12.N" "P72.K" "P18.L" "P67.E"
## [56] "P89.V" "P33.I" "P89.I" "P19.Q" "P73.T"
## [61] "P73.C" "P46.V" "P67.L" "P16.E" "P36.M"
## [66] "P57.G" "P12.T" "P67.F" "P88.D" "P30.N"
## [71] "P11.I" "P74.A" "P72.R" "P83.D" "P72.V"
## [76] "P35.E" "P74.P" "P19.I" "P20.I" "P35.D"
## [81] "P33.V" "P61.K" "P93.M"
##
##
## $SQV
## $SQV$p_value
## [1] "P22.A" "P54.A" "P73.A" "P82.A" "P84.A" "P84.C" "P83.D" "P88.D"
## [9] "P10.F" "P67.F" "P69.H" "P71.I" "P19.L" "P53.L" "P48.M" "P19.P"
## [17] "P54.S" "P73.S" "P91.S" "P43.T" "P54.T" "P82.T" "P22.V" "P48.V"
## [25] "P54.V" "P76.V" "P84.V" "P72.X"
##
## $SQV$Foward
## [1] "(Intercept)" "P48.V" "P84.V" "P48.M" "P84.A"
## [6] "P20.R" "P90.M" "P67.F" "P22.V" "P53.L"
## [11] "P84.C" "P54.T" "P43.T" "P22.A" "P54.S"
## [16] "P54.V" "P54.A" "P72.X" "P73.A" "P20.T"
## [21] "P67.E" "P88.D" "P73.S" "P71.I" "P82.A"
## [26] "P82.T" "P91.S" "P66.L" "P67.Y" "P76.V"
## [31] "P63.P" "P83.D" "P71.L" "P57.K" "P54.L"
## [36] "P36.L" "P19.T" "P72.T" "P63.H" "P70.E"
## [41] "P61.H" "P82.F" "P67.S" "P19.L" "P19.P"
## [46] "P73.T" "P32.I" "P82.S" "P24.I" "P20.V"
## [51] "P69.H" "P36.I" "P54.M" "P72.E" "P69.K"
## [56] "P64.V" "P57.G"
##
## $SQV$BoxCox
## [1] "(Intercept)" "P54.A" "P73.A" "P82.A" "P84.A"
## [6] "P84.C" "P88.D" "P58.E" "P67.E" "P10.F"
## [11] "P82.F" "P57.G" "P61.H" "P10.I" "P20.I"
## [16] "P24.I" "P71.I" "P10.L" "P36.L" "P50.L"
## [21] "P53.L" "P93.L" "P48.M" "P90.M" "P37.N"
## [26] "P63.P" "P74.P" "P55.R" "P54.S" "P73.S"
## [31] "P74.S" "P88.S" "P91.S" "P54.T" "P73.T"
## [36] "P10.V" "P15.V" "P48.V" "P54.V" "P71.V"
## [41] "P76.V" "P84.V" "P67.Y"
##
## $SQV$BoxCox_Forward
## [1] "(Intercept)" "P90.M" "P84.V" "P48.V" "P84.A"
## [6] "P24.I" "P54.V" "P84.C" "P88.D" "P48.M"
## [11] "P53.L" "P71.V" "P36.I" "P88.S" "P82.F"
## [16] "P74.S" "P73.S" "P50.L" "P10.F" "P10.I"
## [21] "P67.E" "P10.L" "P54.A" "P54.S" "P54.T"
## [26] "P20.I" "P76.V" "P10.V" "P82.A" "P58.E"
## [31] "P73.T" "P67.Y" "P74.P" "P55.R" "P61.H"
## [36] "P73.A" "P63.P" "P15.V" "P36.L" "P35.G"
## [41] "P57.G" "P91.S" "P12.P" "P71.I" "P93.L"
## [46] "P12.N" "P67.F" "P70.R" "P69.K" "P20.R"
## [51] "P85.I" "P64.L" "P72.R" "P37.N" "P22.A"
## [56] "P54.L" "P63.H" "P43.R" "P14.R" "P71.A"
## [61] "P71.T" "P54.M" "P60.D" "P50.V" "P19.Z"
## [66] "P39.S" "P18.L" "P83.D" "P82.T" "P43.K"
## [71] "P12.A" "P12.Z" "P72.E" "P72.K" "P20.V"
## [76] "P19.L" "P37.S" "P19.Q" "P63.L" "P37.E"
## [81] "P74.A" "P64.I" "P11.I" "P70.E" "P60.E"
In this case, we are fortunate enough to have a “ground truth” obtained by another experiment (data saved as tsm_df
). Using this, we can evaluate the selected results. Note that we only need to compare the position of the mutations, not the mutation type. This is because it is known that multiple mutations at the same protease or RT position can often be associated with related drug-resistance outcomes.
knitr::opts_chunk$set(echo=FALSE, warning=FALSE, message=FALSE)
library("stringr"); library("purrr")
# Evaluate the result by comparing it to the ground true
length(results)
length(results[[1]])
rm_intercept <- function(results){ #remove intercept to compare the position
for(i in c(1:7)){
for(j in c(1:4)){
if(results[[i]][j][1]=="(Intercept)"){
results[[i]][j] <- results[[i]][j][-1]
}
}
}
return(results)
}
results<-rm_intercept(results)
results
compare <- function(results){
ans <- lapply(results,function(x) as.numeric(sort(unique(str_sub(x,2,3))))) # remove possible duplicates
tandf <- map(ans,~.x %in% tsm_df$Position)
td <- map_dbl(tandf, function(x) sum(x) / length(x))
dr <- map_dbl(tandf, function(x) sum(x) / nrow(tsm_df))
fdr <- map_dbl(tandf, function(x) sum(!x) / nrow(tsm_df))
stat <- data.frame(stat=c("true_discoveries","discover_rate","false_discoveries_rate"))
evaluation <- as_tibble(cbind(stat, rbind(td, dr, fdr)))
return(evaluation)
}
evaluation <- map(results, compare)
## Warning in FUN(X[[i]], ...): NAs introduced by coercion
## Warning in FUN(X[[i]], ...): NAs introduced by coercion
## Warning in FUN(X[[i]], ...): NAs introduced by coercion
## Warning in FUN(X[[i]], ...): NAs introduced by coercion
## Warning in FUN(X[[i]], ...): NAs introduced by coercion
## Warning in FUN(X[[i]], ...): NAs introduced by coercion
## Warning in FUN(X[[i]], ...): NAs introduced by coercion
## Warning in FUN(X[[i]], ...): NAs introduced by coercion
## Warning in FUN(X[[i]], ...): NAs introduced by coercion
## Warning in FUN(X[[i]], ...): NAs introduced by coercion
## Warning in FUN(X[[i]], ...): NAs introduced by coercion
## Warning in FUN(X[[i]], ...): NAs introduced by coercion
## Warning in FUN(X[[i]], ...): NAs introduced by coercion
## Warning in FUN(X[[i]], ...): NAs introduced by coercion
## Warning in FUN(X[[i]], ...): NAs introduced by coercion
## Warning in FUN(X[[i]], ...): NAs introduced by coercion
## Warning in FUN(X[[i]], ...): NAs introduced by coercion
## Warning in FUN(X[[i]], ...): NAs introduced by coercion
## Warning in FUN(X[[i]], ...): NAs introduced by coercion
## Warning in FUN(X[[i]], ...): NAs introduced by coercion
## Warning in FUN(X[[i]], ...): NAs introduced by coercion
## $APV
## # A tibble: 3 x 5
## stat p_value Foward BoxCox BoxCox_Forward
## <fct> <dbl> <dbl> <dbl> <dbl>
## 1 true_discoveries 0.8 0.611 0.654 0.609
## 2 discover_rate 0.471 0.647 0.5 0.824
## 3 false_discoveries_rate 0.118 0.412 0.265 0.529
##
## $ATV
## # A tibble: 3 x 5
## stat p_value Foward BoxCox BoxCox_Forward
## <fct> <dbl> <dbl> <dbl> <dbl>
## 1 true_discoveries 0.929 0.657 0.731 0.587
## 2 discover_rate 0.382 0.676 0.559 0.794
## 3 false_discoveries_rate 0.0294 0.353 0.206 0.559
##
## $IDV
## # A tibble: 3 x 5
## stat p_value Foward BoxCox BoxCox_Forward
## <fct> <dbl> <dbl> <dbl> <dbl>
## 1 true_discoveries 0.7 0.583 0.655 0.571
## 2 discover_rate 0.412 0.618 0.559 0.706
## 3 false_discoveries_rate 0.176 0.441 0.294 0.529
##
## $LPV
## # A tibble: 3 x 5
## stat p_value Foward BoxCox BoxCox_Forward
## <fct> <dbl> <dbl> <dbl> <dbl>
## 1 true_discoveries 0.812 0.632 0.9 0.609
## 2 discover_rate 0.382 0.706 0.529 0.824
## 3 false_discoveries_rate 0.0882 0.412 0.0588 0.529
##
## $NFV
## # A tibble: 3 x 5
## stat p_value Foward BoxCox BoxCox_Forward
## <fct> <dbl> <dbl> <dbl> <dbl>
## 1 true_discoveries 0.824 0.667 0.778 0.583
## 2 discover_rate 0.412 0.706 0.618 0.824
## 3 false_discoveries_rate 0.0882 0.353 0.176 0.588
##
## $RTV
## # A tibble: 3 x 5
## stat p_value Foward BoxCox BoxCox_Forward
## <fct> <dbl> <dbl> <dbl> <dbl>
## 1 true_discoveries 0.75 0.636 0.75 0.610
## 2 discover_rate 0.529 0.824 0.529 0.735
## 3 false_discoveries_rate 0.176 0.471 0.176 0.471
##
## $SQV
## # A tibble: 3 x 5
## stat p_value Foward BoxCox BoxCox_Forward
## <fct> <dbl> <dbl> <dbl> <dbl>
## 1 true_discoveries 0.667 0.552 0.667 0.512
## 2 discover_rate 0.353 0.471 0.529 0.647
## 3 false_discoveries_rate 0.176 0.382 0.265 0.618
Rhee, Soo-Yon, W Jeffrey Fessel, Andrew R Zolopa, Leo Hurley, Tommy Liu, Jonathan Taylor, Dong Phuong Nguyen, et al. 2005. “HIV-1 Protease and Reverse-Transcriptase Mutations: correlations with Antiretroviral Therapy in Subtype B Isolates and Implications for Drug-Resistance Surveillance.” Journal of Infectious Diseases 192 (3). Oxford University Press: 456–65.
Rhee, Soo-Yon, Jonathan Taylor, Gauhar Wadhera, Asa Ben-Hur, Douglas L Brutlag, and Robert W Shafer. 2006. “Genotypic Predictors of Human Immunodeficiency Virus Type 1 Drug Resistance.” Proceedings of the National Academy of Sciences 103 (46). National Academy of Sciences: 17355–60.