SYS 6018 | Spring 2021 | University of Virginia


1 Introduction

In 2010 an island in the Caribbean suffered a massive earthquake, crippling the entire country. As part of a rescue effort, images were captured and the pixels gathered into the data for this project. These pixels need to be analyzed for pictures of blue tarps, a signal that there are people in need of resources at that location. Our goal is to create a model with the best possible predictions to identify those blue tarps, which can then be used in allocating resources.

Note: this data set and problem are authentic, used by the team that actually built a classifier used to direct international aid resources. In my masters program, we were given this data set as it was given to the origional team and instructed to implement a series of models and calculate performance. We were given no guidance on how to implement these or how to judge performance. Some details have been removed at the request of the professor because this project will be used in future semesters. Reach out to jordan.hiatt11@gmail.com for the full report.

2 Training Data / EDA

2.1 Set-up

First, I loaded the data and did some digging on the different classes. A huge majority of our observations are not classified as Blue Tarps. We are not interested in further defining the non-tarp data at this point so I grouped the non-tarp data into one class and saved it to another column to maintain the integrity of the original data, in case it is useful down the road. After collapsing classes, only 3.3% of our data is classified as blue tarp, with the rest belonging to the “Other” class.

# load in csv
data <- read.csv('HaitiPixels.csv',colClasses=c("factor",rep("numeric",3))) 
# for help with colClasses and rep() 
# https://stackoverflow.com/questions/2805357/specifying-colclasses-in-the-read-csv

# gather general csv data
data %>% dim()
#> [1] 63241     4
data %>% filter(is.na(data))
#> [1] Class Red   Green Blue 
#> <0 rows> (or 0-length row.names)
# classes
data$Class %>% contrasts()
#>                  Rooftop Soil Various Non-Tarp Vegetation
#> Blue Tarp              0    0                0          0
#> Rooftop                1    0                0          0
#> Soil                   0    1                0          0
#> Various Non-Tarp       0    0                1          0
#> Vegetation             0    0                0          1
data %>% filter(Class=='BlueTarp') %>% dim()
#> [1] 0 4
data %>% filter(Class=='Rooftop') %>% dim()
#> [1] 9903    4
data %>% filter(Class=='Soil') %>% dim()
#> [1] 20566     4
data %>% filter(Class=='Various Non-Tarp') %>% dim()
#> [1] 4744    4
data %>% filter(Class=='Vegetation') %>% dim()
#> [1] 26006     4
# Combine classes
data <- data %>% mutate(Class.cl = ifelse(Class=='Blue Tarp','BlueTarp','Other'))
data$Class.cl <- data$Class.cl %>% as_factor()
data$Class.cl %>% contrasts()
#>          BlueTarp
#> Other           0
#> BlueTarp        1
# what percentage of our data is 'BlueTarp'
bt <- data %>% filter(Class.cl=='BlueTarp') %>% dim()
ot <- data %>% filter(Class.cl!='BlueTarp') %>% dim()
bt[1]/ot[1] # 3.3%
#> [1] 0.03302896

I decided to create my own 10-fold split so I would more easily have access to any data that I wanted. This also gives me the advantage of having the same data for every methodology so they can be compared against eachother more reliably. Each of the 5 models use this same 10 fold split, even KNN and Elastic Net which utilize the Caret package. Currently, this loop isn’t very efficient so there is probably a better way, but the upside is it only has to run once and I should be able to use this on most of my models. Because I am are dealing with three variables that represent different aspects of the color spectrum and measured in the same units, I feel that this data is already pretty uniform, so I decided against any kind of scaling, normalization or parameter selection.

## set up 10-fold cross-validation
set.seed(42)
# First I get the number of items in each fold, then I get an index number
# cutoff for each fold if I were to number observations 1 to n
foldcount <- floor(dim(data)[1]/10) # this should give us the correct number of
#observations in each fold
# I need to make sure there is no top end or the last observation would be 
# missed due to rounding, so I stop at 9
cuttoff_list <- list()
for(i in 1:9){
  cuttoff_list <- cuttoff_list %>% append(foldcount*i)
}

# create function for assigning folds
# really slow needs to be updated
assign_fold <- function(id) {
  if (id<cuttoff_list[1]) {
    return(1)
  } else if (id<=cuttoff_list[2]) {
    return(2)
  } else if (id<=cuttoff_list[3]) {
    return(3)
  } else if (id<=cuttoff_list[4]) {
    return(4)
  } else if (id<=cuttoff_list[5]) {
    return(5)
  } else if (id<=cuttoff_list[6]) {
    return(6)
  } else if (id<=cuttoff_list[7]) {
    return(7)
  } else if (id<=cuttoff_list[8]) {
    return(8)
  } else if (id<=cuttoff_list[9]) {
    return(9)
  } else {
    return(10)
  }
}

set.seed(42)
# generate a random id, no repeated values, from 1 to the size of the data
data.kfold <- data %>% mutate(random_id = sample(1:dim(data)[1], 
              size = dim(data)[1], replace = FALSE), fold = 10) 
# sample without replacement:
# https://stackoverflow.com/questions/14864275/randomize-w-no-repeats-using-r

# run assign_fold to get fold assignment from random id, then save as column 
for(i in 1:dim(data.kfold)[1]){
  data.kfold[i,]$fold = assign_fold(data.kfold[i,]$random_id)
}

# need a row id to filter by in KNN section
data.kfold <- data.kfold %>% rowid_to_column()
data.kfold %>% filter(fold==1) %>% dim()
#> [1] 6323    8
data.kfold %>% filter(fold==2) %>% dim()
#> [1] 6325    8
data.kfold %>% filter(fold==3) %>% dim()
#> [1] 6324    8
data.kfold %>% filter(fold==4) %>% dim()
#> [1] 6324    8
data.kfold %>% filter(fold==5) %>% dim()
#> [1] 6324    8
data.kfold %>% filter(fold==6) %>% dim()
#> [1] 6324    8
data.kfold %>% filter(fold==7) %>% dim()
#> [1] 6324    8
data.kfold %>% filter(fold==8) %>% dim()
#> [1] 6324    8
data.kfold %>% filter(fold==9) %>% dim()
#> [1] 6324    8
data.kfold %>% filter(fold==10) %>% dim()
#> [1] 6325    8

It looks like these folds are evenly split. There is a strange issue with the first and second fold where the first has 6,323 and the second has 6,325 when both should have 6,324. Because this is a shift of one observation out of thousands I am not concerned with this error.

3 Model Training

3.1 Logistic Regression

#> # A tibble: 5 x 3
#>    .threshold specificity sensitivity
#>         <dbl>       <dbl>       <dbl>
#> 1 -Inf              0               1
#> 2    2.22e-16       0               1
#> 3    9.36e-14       0.216           1
#> 4    9.37e-14       0.216           1
#> 5    9.37e-14       0.216           1
#> # A tibble: 1 x 3
#>   .metric .estimator .estimate
#>   <chr>   <chr>          <dbl>
#> 1 roc_auc binary         0.998

#> # A tibble: 5 x 3
#>   .threshold recall precision
#>        <dbl>  <dbl>     <dbl>
#> 1     Inf    0              1
#> 2       1.00 0.0658         1
#> 3       1.00 0.0663         1
#> 4       1.00 0.0668         1
#> 5       1.00 0.0673         1
#> # A tibble: 1 x 3
#>   .metric .estimator .estimate
#>   <chr>   <chr>          <dbl>
#> 1 pr_auc  binary         0.974

I calculated logistic regression, LDA and QDA using the same basic code with minor variation. In each, I created a function to loop through each of the ten folds, treating the current fold as the test data set and the rest as training. The function would then calculate predictions for the current fold (the test data) and return the prediction and actuals which were saved in a data frame. The ROC curve was then calculated using the dataframe making it an ROC curve calculated based on all observations with predictions calculated as test data. These predictions give us probabilities (not the logit of probabilities) because I used type response.

For logistic regression, the area under the curve is very close to 1 and the curve itself is very close to reaching the corner. It is clear a wide range of threshold values would be acceptable according to this ROC curve.

Because I have such a small fraction of Blue Tarp data, I am concerned that the high volume of correct easy predictions, where I am predicting that this is not a blue tarp, might skew the data. If there is a 97% chance that the observation is not a blue tarp, how can I be sure I am predicting true values (blue tarps) accurately. The ROC curve shows the true positive rate vs the true negative rate; the true negative rate may skew the results. For this reason I decided to also calculate a precision recall curve.

Precision recall show the true positives as a percentage of total positives (recall), versus the true positives as a percentage of all predicted positives (precision). The precision recall also has a very strong AUC, actually stronger than the ROC curve. The curve, as plotted does not have a precision level below 0.95, which is confirmed by the data. The plot suggests that the model is very accurate. Under most thresholds it captures all true positives and the false positives are not a large proportion.

On both of the plotted curves, I have calculated and placed the ideal threshold value. However, the precision recall curve gave an optimal threshold (based on my criteria) that fell to extremely in favor of recall, sacrificing precision. In the spirit of my criteria I eased the rules and hand picked a value that prioritized recall, but has a reasonable precision value. See the threshold section below for information on the calculation.

3.2 LDA

#> Warning: `data_frame()` is deprecated as of tibble 1.1.0.
#> Please use `tibble()` instead.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_warnings()` to see where this warning was generated.
#> # A tibble: 5 x 3
#>    .threshold specificity sensitivity
#>         <dbl>       <dbl>       <dbl>
#> 1 -Inf          0                   1
#> 2    1.92e-16   0                   1
#> 3    5.02e-16   0.0000163           1
#> 4    4.42e-15   0.0000327           1
#> 5    1.24e-14   0.0000490           1
#> # A tibble: 1 x 3
#>   .metric .estimator .estimate
#>   <chr>   <chr>          <dbl>
#> 1 roc_auc binary         0.989

#> # A tibble: 5 x 3
#>   .threshold   recall precision
#>        <dbl>    <dbl>     <dbl>
#> 1     Inf    0                1
#> 2       1.00 0.000495         1
#> 3       1.00 0.000989         1
#> 4       1.00 0.00148          1
#> 5       1.00 0.00198          1
#> # A tibble: 1 x 3
#>   .metric .estimator .estimate
#>   <chr>   <chr>          <dbl>
#> 1 pr_auc  binary         0.859

LDA follows logistic regression model calculation very closely. The only major difference being the lda function is used instead of glm, which leads to a different output format from lda.prediction. However, with minor adjustment the outcome is the same.

The ROC curve is not as strong on LDA as it is on Logistic regression and the precision recall curve shows a sharp drop off in performance. Because I am concerned with the small number of true values in the data, I will weight the precision recall curve more heavily.

3.3 QDA

#> # A tibble: 5 x 3
#>    .threshold specificity sensitivity
#>         <dbl>       <dbl>       <dbl>
#> 1 -Inf          0                   1
#> 2    8.05e-33   0                   1
#> 3    3.63e-31   0.0000163           1
#> 4    5.81e-30   0.0000327           1
#> 5    3.02e-29   0.0000490           1
#> # A tibble: 1 x 3
#>   .metric .estimator .estimate
#>   <chr>   <chr>          <dbl>
#> 1 roc_auc binary         0.998

#> # A tibble: 5 x 3
#>   .threshold   recall precision
#>        <dbl>    <dbl>     <dbl>
#> 1     Inf    0                1
#> 2       1.00 0.000495         1
#> 3       1.00 0.000989         1
#> 4       1.00 0.00148          1
#> 5       1.00 0.00198          1
#> # A tibble: 1 x 3
#>   .metric .estimator .estimate
#>   <chr>   <chr>          <dbl>
#> 1 pr_auc  binary         0.965

The calculation of QDA is almost identical to LDA. The ROC results are closer to Logistic regression with only a slightly lower AUC. The precision recall out performs LDA but is behind Logistic regression. Because precision recall and ROC appear significantly different, I will give preference to precision recall as stated above.

One major difference with the precision recall curve is that the process I used to select the ideal threshold gives us a value with very small precision. As with Logistic regression, I hand picked another point in red that appeared to be better positioned based on visual inspection.

3.4 KNN

#> k-Nearest Neighbors 
#> 
#> 63241 samples
#>     3 predictor
#>     2 classes: 'Other', 'BlueTarp' 
#> 
#> No pre-processing
#> Resampling: Cross-Validated (10 fold) 
#> Summary of sample sizes: 6323, 6325, 6324, 6324, 6324, 6324, ... 
#> Resampling results across tuning parameters:
#> 
#>   k  Accuracy   Kappa    
#>   5  0.9961611  0.9377746
#>   7  0.9959502  0.9342853
#>   9  0.9958431  0.9324288
#> 
#> Accuracy was used to select the optimal model using the largest value.
#> The final value used for the model was k = 5.
#>   k
#> 1 5
#> 5-nearest neighbor model
#> Training set outcome distribution:
#> 
#>    Other BlueTarp 
#>    61219     2022
#> # A tibble: 5 x 3
#>   .threshold specificity sensitivity
#>        <dbl>       <dbl>       <dbl>
#> 1  -Inf            0            1   
#> 2     0            0            1   
#> 3     0.0833       0.871        1.00
#> 4     0.0909       0.872        1.00
#> 5     0.111        0.872        1.00
#> # A tibble: 1 x 3
#>   .metric .estimator .estimate
#>   <chr>   <chr>          <dbl>
#> 1 roc_auc binary         0.998

#> # A tibble: 5 x 3
#>   .threshold recall precision
#>        <dbl>  <dbl>     <dbl>
#> 1    Inf      0          1   
#> 2      1      0.992      1.00
#> 3      0.937  0.992      1.00
#> 4      0.937  0.993      1.00
#> 5      0.936  0.993      1.00
#> # A tibble: 1 x 3
#>   .metric .estimator .estimate
#>   <chr>   <chr>          <dbl>
#> 1 pr_auc  binary          1.00

For KNN I used the Caret package. Caret gives us a lot of flexibility and allows us to pass in our current folds so that I can compare this model with the others. The ROC curve and precision recall curve appears to outperform even logistic regression.

Again I selected a handpicked threshold point, this time for the ROC curve. The results for KNN seem more polarized than for logistic regression, which may indicate overtraining.

3.4.1 Tuning Parameter \(k\)

#> 5-nearest neighbor model
#> Training set outcome distribution:
#> 
#>    Other BlueTarp 
#>    61219     2022
#>   k
#> 1 5
#>   k  Accuracy     Kappa   AccuracySD     KappaSD
#> 1 5 0.9961611 0.9377746 0.0002210399 0.004020758
#> 2 7 0.9959502 0.9342853 0.0002664929 0.004729823
#> 3 9 0.9958431 0.9324288 0.0003148741 0.005395091

The Caret package will run the cross validation through multiple values of k to select the optimal model. The best model is the final model shown above, with the best tune of k=5. This is the model I used to calculate the ROC curve and precision recall curve. You can see from the table that this model had the best accuracy, though it is not clear the threshold used to determine that accuracy, or if this is another measure of accuracy.

3.5 Penalized Logistic Regression (ElasticNet)

#> # A tibble: 48,393 x 3
#>     .threshold specificity sensitivity
#>          <dbl>       <dbl>       <dbl>
#>  1 -Inf           0                  1
#>  2    7.51e-13    0                  1
#>  3    8.48e-13    0.000495           1
#>  4    1.15e-12    0.000989           1
#>  5    1.20e-12    0.00148            1
#>  6    1.27e-12    0.00198            1
#>  7    1.45e-12    0.00247            1
#>  8    1.82e-12    0.00297            1
#>  9    1.93e-12    0.00346            1
#> 10    2.11e-12    0.00396            1
#> # … with 48,383 more rows
#> # A tibble: 1 x 3
#>   .metric .estimator .estimate
#>   <chr>   <chr>          <dbl>
#> 1 roc_auc binary         0.998

#> # A tibble: 48,392 x 3
#>    .threshold   recall precision
#>         <dbl>    <dbl>     <dbl>
#>  1     Inf    0                1
#>  2       1    0.000359         1
#>  3       1.00 0.000833         1
#>  4       1.00 0.00108          1
#>  5       1.00 0.00137          1
#>  6       1.00 0.00170          1
#>  7       1.00 0.00191          1
#>  8       1.00 0.00211          1
#>  9       1.00 0.00232          1
#> 10       1.00 0.00247          1
#> # … with 48,382 more rows
#> # A tibble: 1 x 3
#>   .metric .estimator .estimate
#>   <chr>   <chr>          <dbl>
#> 1 pr_auc  binary          1.00

Elastic Net again uses the Caret package and the results appear very strong.

3.5.1 Tuning Parameters

#> # A tibble: 10 x 5
#>    term         step estimate lambda dev.ratio
#>    <chr>       <dbl>    <dbl>  <dbl>     <dbl>
#>  1 (Intercept)     1    -3.41 0.0416 -1.69e-13
#>  2 (Intercept)     2    -3.66 0.0379  3.34e- 2
#>  3 (Intercept)     3    -3.90 0.0345  6.11e- 2
#>  4 (Intercept)     4    -4.13 0.0315  8.46e- 2
#>  5 (Intercept)     5    -4.35 0.0287  1.05e- 1
#>  6 (Intercept)     6    -4.57 0.0261  1.22e- 1
#>  7 (Intercept)     7    -4.79 0.0238  1.37e- 1
#>  8 (Intercept)     8    -5.00 0.0217  1.50e- 1
#>  9 (Intercept)     9    -5.21 0.0198  1.61e- 1
#> 10 (Intercept)    10    -5.41 0.0180  1.71e- 1
#>   alpha       lambda
#> 7     1 8.322155e-05
#>   alpha       lambda  Accuracy      Kappa   AccuracySD     KappaSD
#> 1  0.10 8.322155e-05 0.9935871 0.88582329 0.0004607613 0.009039671
#> 2  0.10 8.322155e-04 0.9873816 0.74791964 0.0008907367 0.020106389
#> 3  0.10 8.322155e-03 0.9686473 0.03658366 0.0008569820 0.035498941
#> 4  0.55 8.322155e-05 0.9941933 0.89820710 0.0004230630 0.008167783
#> 5  0.55 8.322155e-04 0.9893933 0.79578944 0.0008342802 0.017931401
#> 6  0.55 8.322155e-03 0.9706678 0.14606151 0.0017632151 0.078172744
#> 7  1.00 8.322155e-05 0.9950823 0.91620831 0.0002582712 0.005095955
#> 8  1.00 8.322155e-04 0.9931953 0.87765099 0.0005212525 0.010316629
#> 9  1.00 8.322155e-03 0.9803029 0.54542429 0.0017897003 0.054553473
#> [1] "Red"   "Green" "Blue"

Elastic net has two tuning parameters, alpha and lambda. Alpha controls the type of impact from the penalty, while lambda controls the strength of the penalty. After attempting for a while to create my own tuning grid, I discovered a way to get the caret package to find the optimal alpha and lambda on it’s own. My assumption is that the ideal lambda is set using accuracy of the predictions, however, I know that this was done without passing in the threshold I have selected so there is probably room for optimizing these tuning parameters.

The end result was that an alpha of 1 was selected with a lambda of 8.322155e-05, row 7 under results. Compared to the other lambdas and alphas considered, this one has the highest accuracy. An alpha of 1 indicates that this model behaves like lasso penalized regression, suggesting that a model could be eliminated, but looking up coefficient names, it appears that all three colors were kept.

3.6 Random Forest

The first step for my random forest model is finding a good range for trees for my tuning parameter. Rather than tuning for the optimal tree value and number of parameters at once on all of the training data, I decided to try and discover a more precise range. I did this by pulling out a small subsection of data and running against different tree values to get an idea of what number of trees will be enough for tuning my model. I used my existing folds to create a temporary data set with only a few of the folds in training and in test. I did not cross validate to train the model, as this is only for a rough estimate. To tune the final model I will use the full training data set.

It looks like after 15-25 trees I reach diminishing returns.

#>   mtry  Accuracy     Kappa   AccuracySD     KappaSD num_trees
#> 1    1 0.9955022 0.9253887 0.0004218980 0.007453250        15
#> 2    2 0.9956147 0.9276915 0.0004714528 0.008720681        15
#> 3    3 0.9955602 0.9271198 0.0003572042 0.006942250        15
#> 4    4 0.9955180 0.9265358 0.0003376965 0.006218307        15
#> 5    5 0.9953915 0.9243633 0.0003918444 0.007335122        15
#> 6    6 0.9955865 0.9278435 0.0003192610 0.005644371        15
#>   mtry  Accuracy     Kappa   AccuracySD     KappaSD num_trees
#> 1    2 0.9957658 0.9301793 0.0002903687 0.005664213        25
#> 2    6 0.9957306 0.9303791 0.0002142535 0.003863963        25
#> 3    5 0.9957130 0.9298727 0.0003951489 0.006898454        24
#> 4   10 0.9956990 0.9298576 0.0003386598 0.006035337        25
#> 5    2 0.9956849 0.9287639 0.0004502504 0.008445817        17
#> 6    4 0.9956779 0.9290254 0.0003741924 0.007221128        19
#> [1] 2

The results give us the highest number of trees in the range, but there is pretty good variation in the first 6 so I think our tree range is safe.

#> # A tibble: 28 x 3
#>    .threshold specificity sensitivity
#>         <dbl>       <dbl>       <dbl>
#>  1  -Inf            0           1    
#>  2     0            0           1    
#>  3     0.04         0.713       1.00 
#>  4     0.0800       0.785       1.00 
#>  5     0.12         0.836       1.00 
#>  6     0.16         0.862       1.00 
#>  7     0.200        0.878       1.00 
#>  8     0.24         0.892       0.999
#>  9     0.28         0.906       0.999
#> 10     0.320        0.915       0.999
#> # … with 18 more rows
#> # A tibble: 1 x 3
#>   .metric .estimator .estimate
#>   <chr>   <chr>          <dbl>
#> 1 roc_auc binary         0.995

#> # A tibble: 27 x 3
#>    .threshold recall precision
#>         <dbl>  <dbl>     <dbl>
#>  1     Inf     0         1    
#>  2       1     0.992     1.00 
#>  3       0.96  0.995     1.00 
#>  4       0.92  0.996     1.00 
#>  5       0.88  0.996     0.999
#>  6       0.84  0.997     0.999
#>  7       0.8   0.997     0.999
#>  8       0.76  0.997     0.999
#>  9       0.72  0.998     0.999
#> 10       0.68  0.998     0.999
#> # … with 17 more rows
#> # A tibble: 1 x 3
#>   .metric .estimator .estimate
#>   <chr>   <chr>          <dbl>
#> 1 pr_auc  binary          1.00

3.7 SVM

3.7.1 SVM Linear

svm.tune.linear$performances
#>   cost       error   dispersion
#> 1    5 0.003652691 0.0008078339
#> 2    7 0.003526192 0.0007890470
#> 3    9 0.003462941 0.0008643225
#> 4   11 0.003462941 0.0008513679
#> 5   13 0.003431313 0.0008302043

Initial passes showed that a cost of 0.01 had an error fo 0.12, while the better error rates were with costs of around 5 and 10 at around 0.003. As the cost increased, error decreased, but after a cost of about 5, the decrease in error became small. I decided to use a cost range of 5-13 for tuning. I’m sure a cost of above 13 will give a lower cost but I am wary of overtraining. Here, 13 gives the best error, but 9 and 11 have the same error. Because this seems like a plateau, I will chose the lower value of 9. The error is so similar, a cost of 9 seems like a safer value if concerned about overtraining.

#> # A tibble: 29,605 x 3
#>    .threshold specificity sensitivity
#>         <dbl>       <dbl>       <dbl>
#>  1 -Inf             0               1
#>  2    1.00e-7       0               1
#>  3    1.05e-7       0.378           1
#>  4    1.07e-7       0.378           1
#>  5    1.18e-7       0.379           1
#>  6    1.24e-7       0.379           1
#>  7    1.31e-7       0.380           1
#>  8    1.35e-7       0.380           1
#>  9    1.36e-7       0.381           1
#> 10    1.38e-7       0.381           1
#> # … with 29,595 more rows
#> # A tibble: 1 x 3
#>   .metric .estimator .estimate
#>   <chr>   <chr>          <dbl>
#> 1 roc_auc binary         0.998

#> # A tibble: 29,604 x 3
#>    .threshold recall precision
#>         <dbl>  <dbl>     <dbl>
#>  1     Inf     0             1
#>  2       1.00  0.438         1
#>  3       1.00  0.438         1
#>  4       1.00  0.438         1
#>  5       1.00  0.438         1
#>  6       1.00  0.438         1
#>  7       1.00  0.438         1
#>  8       1.00  0.438         1
#>  9       1.00  0.438         1
#> 10       1.00  0.438         1
#> # … with 29,594 more rows
#> # A tibble: 1 x 3
#>   .metric .estimator .estimate
#>   <chr>   <chr>          <dbl>
#> 1 pr_auc  binary          1.00

3.7.2 SVM Radial

svm.tune.rad$performances[order(svm.tune.rad$performances$error),] %>% head(5)
#>    cost gamma       error   dispersion
#> 24   15     7 0.002640696 0.0007267535
#> 20   18     6 0.002672319 0.0007471052
#> 25   18     7 0.002672324 0.0007050301
#> 23   13     7 0.002688130 0.0007454215
#> 15   18     5 0.002688132 0.0007710791

Similar to my approach in linear, I used my first few times running the tuning function to chose a list of 5 tuning candidates for cost and gamma. Looking at these results it seems like I’ve got a pretty good spread of several different Gamma and Cost values (if all the top results were the highest cost and gamma, I might expand the range).

#> Joining, by = c("fold", "actual", "prediction")
#> Joining, by = c("fold", "actual", "prediction")
#> Joining, by = c("fold", "actual", "prediction")
#> Joining, by = c("fold", "actual", "prediction")
#> Joining, by = c("fold", "actual", "prediction")
#> Joining, by = c("fold", "actual", "prediction")
#> Joining, by = c("fold", "actual", "prediction")
#> Joining, by = c("fold", "actual", "prediction")
#> Joining, by = c("fold", "actual", "prediction")
#> Joining, by = c("fold", "actual", "prediction")
#> # A tibble: 42,962 x 3
#>    .threshold specificity sensitivity
#>         <dbl>       <dbl>       <dbl>
#>  1 -Inf            0                1
#>  2    1.00e-7      0                1
#>  3    1.02e-7      0.0104           1
#>  4    1.13e-7      0.0109           1
#>  5    1.22e-7      0.0114           1
#>  6    1.23e-7      0.0119           1
#>  7    1.25e-7      0.0124           1
#>  8    1.26e-7      0.0129           1
#>  9    1.31e-7      0.0134           1
#> 10    1.36e-7      0.0138           1
#> # … with 42,952 more rows
#> # A tibble: 1 x 3
#>   .metric .estimator .estimate
#>   <chr>   <chr>          <dbl>
#> 1 roc_auc binary         0.998

#> # A tibble: 42,961 x 3
#>    .threshold recall precision
#>         <dbl>  <dbl>     <dbl>
#>  1     Inf     0             1
#>  2       1.00  0.113         1
#>  3       1.00  0.113         1
#>  4       1.00  0.113         1
#>  5       1.00  0.113         1
#>  6       1.00  0.113         1
#>  7       1.00  0.113         1
#>  8       1.00  0.113         1
#>  9       1.00  0.113         1
#> 10       1.00  0.113         1
#> # … with 42,951 more rows
#> # A tibble: 1 x 3
#>   .metric .estimator .estimate
#>   <chr>   <chr>          <dbl>
#> 1 pr_auc  binary          1.00

3.8 Threshold Selection

Up until this point I have evaluate models broadly but have not selected any thresholds. Threshold selection requires us to make some assumptions. Because I have limited information about the resources to handle these supply deliveries and rescues, or the priorities of the vulnerable people or the helpers on the ground, I will have to make assumptions on our own and recognize that with information from people closer to the situation, these assumptions will change.

My hope is that this disaster received enough worldwide support that resources are abundant and that the key limiting factor from delivering help is a lack of information. With this assumption in place, I will target a lower threshold that will lean toward classifying pixels as blue tarp pixels when in doubt. In other words I am trying to be as efficient as possible with a no person left behind approach. This means that I will shoot for the highest possible true positive rate and the highest recall and sensitivity as our first priority, while maximizing precision and selectivity under these conditions.

To achieve this, I performed the same task for each model. On the ROC data, I filtered for a sensitivity of 1 (the maximum value), and then selected the threshold with the highest specificity from the list. For the precision, recall data, I filtered by a recall of 1 and then selected the threshold for the highest precision from the list.

I created two cross validation tables (below), the first with calculations based on an ROC threshold and the second based on a precision recall threshold. To calculate TPR, FPR, accuracy and precision, I applied these thresholds to the probabilities that the pixel represented a blue tarp. I ensured that in logistic regression, the threshold was applied to the probability not the logit of probabilities.

For several of the models, the thresholds given by my computational method on the ROC or PR curve were very far to one side, to the point where the threshold chosen appeared to be less useful. In addition to using these values I hand picked values based on visual examination of the plot, finding an ideal balance of sensativity and specificity or precision and recall. I have included these as additional lines in the table.

In including multiple thresholds for each best model, I am hoping to cast a wider net and find an ideal threshold. My expectation is that the PR based thresholds (particularly the handpicked variable when available) will perform the best, because they consider more heavily the Blue Tarp data, which is much more rare. I am interested to see if there is a broad trend in whether PR or ROC based thresholds are more reliable.

4 Results (Cross-Validation)

CV Performance Table based on the threshold obtained using the ROC Curve

Tuning AUROC Threshold Accuracy TPR FPR Precision
Logisitic 0.9984832 0.0006946 0.8260306 0.9995054 0.1796991 0.1551989
LDA 0.9888577 0.0000381 0.8500340 0.9995054 0.1549029 0.1756780
QDA 0.9981940 0.0000199 0.4279977 0.9995054 0.5908786 0.0529141
KNN k: 5 0.9981137 -Inf 0.0319729 1.0000000 1.0000000 0.0319729
KNN - Hand Picked k: 5 0.9981137 0.9217687 0.0040164 0.0148368 0.9963410 0.0004916
Elastic Net alpha: 0.1 , lambda: 8.32215451789909e-05 0.9984862 0.0192558 0.0098986 0.3090999 0.9999837 0.0101062
Random Forest Num Params: 2 , Num Trees: 25 0.9947785 -Inf 0.0319729 1.0000000 1.0000000 0.0319729
Random Forest - Hand Picked Num Params: 2 , Num Trees: 25 0.9947785 0.2000000 0.0039373 0.1078140 0.9994936 0.0035501
SVM Linear Cost: 9 0.9979764 0.0016853 0.0100726 0.3145401 0.9999837 0.0102823
SVM Radial Cost: 15, Gamma: 7 0.9984229 0.0006169 0.0228175 0.7131553 0.9999837 0.0230131

Results using the thresholds calculated from the Precision Recall curves.

Tuning AUROC Threshold Accuracy TPR FPR Precision
Logisitic 0.0006946 0.0006946 0.8260306 0.9995054 0.1796991 0.1551989
Logisitic - Hand Picked 0.0925338 0.0925338 0.9904967 0.9678536 0.0087555 0.7849980
LDA 0.0000381 0.0000381 0.8500340 0.9995054 0.1549029 0.1756780
QDA 0.0000199 0.0000199 0.4279977 0.9995054 0.5908786 0.0529141
QDA - Hand Picked 0.0277838 0.0277838 0.9825746 0.9767557 0.0172332 0.6518152
KNN k: 5 0.0000000 0.0000000 0.0044908 0.1285856 0.9996080 0.0042307
Elastic Net alpha: 0.1 , lambda: 8.32215451789909e-05 0.0192558 0.0192558 0.0098986 0.3090999 0.9999837 0.0101062
Random Forest Num Params: 2 , Num Trees: 25 0.0000000 0.0000000 0.0092661 0.2873393 0.9999183 0.0094021
SVM Linear Cost: 9 0.0016853 0.0016853 0.0100726 0.3145401 0.9999837 0.0102823
SVM Radial Cost: 15, Gamma: 7 0.0006169 0.0006169 0.0228175 0.7131553 0.9999837 0.0230131

Both of these tables are filled in with a mix of metrics calculated in the model creation sections and code included in this section when these numbers weren’t readily available. To generate these metrics, I created an “evaluate_model” function and passed in actual and predicted values, and a threshold. TPR, FPR, Accuracy, Precision are all calculated in this function. I then bring in AUC, Threshold, and Tuning Parameters. You can see that these thresholds are very small values, meaning that anything that could potentially be a Blue Tarp is classified as a Blue Tarp. This aligns with our stated assumptions and goals.

5 Hold-out Data / EDA

The following code is used to bring in each of the tab dilimited text files containing the Holdout data. Each file is saved as a seperate data set. Image files are ignored.

#> # A tibble: 5 x 6
#>      ID    B1    B2    B3 Class.cl Source
#>   <dbl> <dbl> <dbl> <dbl> <chr>    <chr> 
#> 1     1   104    89    63 Other    1     
#> 2     2   101    80    60 Other    1     
#> 3     3   103    87    69 Other    1     
#> 4     4   107    93    72 Other    1     
#> 5     5   109    99    68 Other    1
#> # A tibble: 5 x 6
#>      B1    B2    B3 Class.cl Source    ID
#>   <dbl> <dbl> <dbl> <chr>    <chr>  <int>
#> 1    77    94    99 BlueTarp 2          1
#> 2    79    98   103 BlueTarp 2          2
#> 3    77    99   102 BlueTarp 2          3
#> 4    76    94    99 BlueTarp 2          4
#> 5    80    95   103 BlueTarp 2          5
# processing file 3
# load
data.holdout.3 <- read_delim('HoldOutData/orthovnir067_ROI_Blue_Tarps.txt',delim = " ",
                             col_names = TRUE, skip_empty_rows=TRUE, skip = 7, trim_ws = TRUE)
# fix column header alignment
colnms <- (colnames(data.holdout.3))[-c(1,5,7)]
colnames(data.holdout.3) <- unlist(colnms)
# remove unnecessary columns
data.holdout.3 <- data.holdout.3[,c(1,8:10)]
# add class column, and source column
data.holdout.3 <- data.holdout.3 %>% mutate(Class.cl = "BlueTarp", Source="3")
data.holdout.3 %>% head(5)
#> # A tibble: 5 x 6
#>      ID    B1    B2    B3 Class.cl Source
#>   <dbl> <dbl> <dbl> <dbl> <chr>    <chr> 
#> 1     1    77    94    99 BlueTarp 3     
#> 2     2    79    98   103 BlueTarp 3     
#> 3     3    77    99   102 BlueTarp 3     
#> 4     4    76    94    99 BlueTarp 3     
#> 5     5    80    95   103 BlueTarp 3
# processing file 4
# load
data.holdout.4 <- read_delim('HoldOutData/orthovnir067_ROI_NOT_Blue_Tarps.txt',
    delim = " ", col_names = TRUE, skip_empty_rows=TRUE, skip = 7, trim_ws = TRUE)
# fix column header alignment
colnms <- (colnames(data.holdout.4))[-c(1,5,7)]
colnames(data.holdout.4) <- unlist(colnms)
# remove unnecessary columns
data.holdout.4 <- data.holdout.4[,c(1,8:10)]
# add class column, and source column
data.holdout.4 <- data.holdout.4 %>% mutate(Class.cl = "Other", Source="4")
data.holdout.4 %>% head(5)
#> # A tibble: 5 x 6
#>      ID    B1    B2    B3 Class.cl Source
#>   <dbl> <dbl> <dbl> <dbl> <chr>    <chr> 
#> 1     1    68    62    58 Other    4     
#> 2     2    65    62    57 Other    4     
#> 3     3    65    62    57 Other    4     
#> 4     4    64    63    57 Other    4     
#> 5     5    72    64    59 Other    4
#> # A tibble: 5 x 6
#>      ID    B1    B2    B3 Class.cl Source
#>   <dbl> <dbl> <dbl> <dbl> <chr>    <chr> 
#> 1     1    88   102   123 BlueTarp 5     
#> 2     2    88   103   126 BlueTarp 5     
#> 3     3    88   103   126 BlueTarp 5     
#> 4     4    89   103   125 BlueTarp 5     
#> 5     5    90   104   126 BlueTarp 5
#> # A tibble: 5 x 6
#>      ID    B1    B2    B3 Class.cl Source
#>   <dbl> <dbl> <dbl> <dbl> <chr>    <chr> 
#> 1     1   116   138   120 Other    6     
#> 2     2   104   121   109 Other    6     
#> 3     3   106   125   109 Other    6     
#> 4     4   114   135   109 Other    6     
#> 5     5   111   133   107 Other    6
#> # A tibble: 5 x 6
#>      ID    B1    B2    B3 Class.cl Source
#>   <dbl> <dbl> <dbl> <dbl> <chr>    <chr> 
#> 1     1    91   100   132 BlueTarp 7     
#> 2     2    88   108   140 BlueTarp 7     
#> 3     3    85   100   130 BlueTarp 7     
#> 4     4    88   101   135 BlueTarp 7     
#> 5     5    91   104   141 BlueTarp 7
#> # A tibble: 5 x 6
#>      ID    B1    B2    B3 Class.cl Source
#>   <dbl> <dbl> <dbl> <dbl> <chr>    <chr> 
#> 1     1    83    75    65 Other    8     
#> 2     2    81    75    65 Other    8     
#> 3     3    84    78    66 Other    8     
#> 4     4    81    78    65 Other    8     
#> 5     5    80    79    65 Other    8

Next I will analyze the data sets to spot trends and differences.

bt <- 0
ot <- 0
for(i in 1:8){
  # print out values Class.cl values and number of observations
  d <- eval(parse(text = paste0("data.holdout.",i))) #https://stackoverflow.com/questions/9057006/getting-strings-recognized-as-variable-names-in-r
  print(paste0("Holdout: ",i))
  print(d$Class.cl %>% table)
  
  # sum and print totals of each class
  if(d$Class.cl[1] == "BlueTarp"){
    bt = bt + dim(d)[1]
  }else{
    ot = ot + dim(d)[1]
  }
}
#> [1] "Holdout: 1"
#> .
#>  Other 
#> 979278 
#> [1] "Holdout: 2"
#> .
#> BlueTarp 
#>     4446 
#> [1] "Holdout: 3"
#> .
#> BlueTarp 
#>     4446 
#> [1] "Holdout: 4"
#> .
#>  Other 
#> 305211 
#> [1] "Holdout: 5"
#> .
#> BlueTarp 
#>     6828 
#> [1] "Holdout: 6"
#> .
#>  Other 
#> 295510 
#> [1] "Holdout: 7"
#> .
#> BlueTarp 
#>     3206 
#> [1] "Holdout: 8"
#> .
#>  Other 
#> 409698
print(paste0("Blue tarp ", bt))
#> [1] "Blue tarp 18926"
print(paste0("Other ", ot))
#> [1] "Other 1989697"
print(paste0("Ratio of Blue Tarps ", bt/ot))
#> [1] "Ratio of Blue Tarps 0.00951200107353029"

I noticed that Holdout 2 and Holdout 3 have the same number of observations. Looking closer, I noticed that they both reference region 67 blue tarps, which makes me think this is duplicate data in a different format. To be sure, I will compare the B1, B2, and B3 values.

# https://stackoverflow.com/questions/19119320/how-to-check-if-two-data-frames-are-equal
all.equal(data.holdout.2[,c(1,2,3)], data.holdout.3[,c(2,3,4)])
#> [1] "Attributes: < Names: 1 string mismatch >"                                    
#> [2] "Attributes: < Length mismatch: comparison on first 2 components >"           
#> [3] "Attributes: < Component 2: Modes: numeric, list >"                           
#> [4] "Attributes: < Component 2: Lengths: 4446, 5 >"                               
#> [5] "Attributes: < Component 2: names for current but not for target >"           
#> [6] "Attributes: < Component 2: Attributes: < target is NULL, current is list > >"
#> [7] "Attributes: < Component 2: target is numeric, current is tbl_df >"
df.temp <- data.holdout.2[,c(1,2,3)]-data.holdout.3[,c(2,3,4)]
df.temp %>% table()
#> , , B3 = 0
#> 
#>    B2
#> B1     0
#>   0 4446

“all.equal()” gives me 7 differences between the table, which seem insignificant, but are hard to interpret. Subtracting the tables gives me what appears to be a table full of 0’s equal to the size of each table. I can pretty confidently conclude that this is duplicate data and I will ignore data.holdout.3. I will go ahead and combine the data.

data.holdout <- rbind(data.holdout.1, data.holdout.2, data.holdout.4, 
                data.holdout.5, data.holdout.6, data.holdout.7, data.holdout.8)
data.holdout$Class.cl %>% table()
#> .
#> BlueTarp    Other 
#>    14480  1989697
b<-data.holdout %>% filter(Class.cl=="BlueTarp") %>% dim()
o<-data.holdout %>% filter(Class.cl=="Other") %>% dim()
b[1]/(b[1]+o[1])
#> [1] 0.007224911
o[1]/(b[1]+o[1])
#> [1] 0.9927751

It looks like there are even fewer Blue tarps in the Holdout data (as a percentage) then there were in the training data.

It is reasonably clear that B1, B2, and B3 correspond to Red, Green and Blue respectively. I have already spot checked some of these color codes to ensure that read this way the blue tarps appear blue. As a further check, I will look at the aggregate values of the Blue Tarps to see if they are blue in color.

blue <- data.holdout %>% filter(Class.cl=="BlueTarp")
other <- data.holdout %>% filter(Class.cl=="Other")

blueAvg <- c(mean(blue$B1),mean(blue$B2),mean(blue$B3)) 
blueAvg
#> [1] 111.6773 133.3961 165.2613
otherAvg <- c(mean(other$B1),mean(other$B2),mean(other$B3)) 
otherAvg
#> [1] 118.3081 105.1734  81.7588
blueMed <- c(median(blue$B1),median(blue$B2),median(blue$B3)) 
blueMed
#> [1] 108 127 158
otherMed <- c(median(other$B1),median(other$B2),median(other$B3)) 
otherMed
#> [1] 107  91  66

Rounding down for the average values, the Blue Tarp gave a mean and a median that were considerably more blue than the Other group of values. Both the mean and median for the Blue Tarps gave a bluish gray, while the, mean for Other was a dark gold and the median was a dark brown. This lines up with my expectations so I am confident in assuming that B1, B2, and B2 are equivalent to Red, Green, Blue. I will go ahead and update the data frame, and I will convert Class.cl to a factor.

data.holdout <- data.holdout %>% mutate(Red = B1, Green = B2, Blue = B3)
data.holdout$Class.cl <- data.holdout$Class.cl %>% as_factor()
data.holdout$Class.cl %>% contrasts()
#>          BlueTarp
#> Other           0
#> BlueTarp        1

6 Results (Hold-Out)

Next I will calculate predictions on my large data set.

# create function for obtaining predictions
generate_predictions <- function(data) {
  # logistic
  logistic.predictions <- logistic.fit.final %>% predict(data, type='response')
  
  # lda
  lda.predictions <- lda.fit.final %>% predict(data, type='response')
  
  # qda
  qda.predictions <- qda.fit.final %>% predict(data, type='response')
  
  # knn
  knn.predictions <- knn.fit.final %>% predict(data, 'prob')

  # elnet
  elnet.predictions <- elnet.fit.final %>% predict(data, 'prob')
  
  # random forest
  rf.predictions <- rf.fit.final %>% predict(data, 'prob')
  
  # svm linear
  svm.linear.predictions <- svm.linear.fit.final %>% predict(data, probability=TRUE)
  
  #svg radial
  svm.radial.predictions <- svm.radial.fit.final %>% predict(data, probability=TRUE)
 
  return(data_frame(actual=data$Class.cl,
                    logistic_prediction=(logistic.predictions),
                    lda_prediction=lda.predictions$posterior[,2],
                    qda_prediction = qda.predictions$posterior[,2],
                    knn_prediction = 1-knn.predictions$BlueTarp,
                    elnet_prediction = 1-elnet.predictions$BlueTarp,
                    rf_prediction = 1-rf.predictions$BlueTarp,
                    svm_linear_prediction=1-attr(svm.linear.predictions,"probabilities")[,2],
                    svm_radial_prediction=1-attr(svm.radial.predictions,"probabilities")[,2]))
}
data.holdout.results <- data.holdout %>% generate_predictions()

I will use these predictions, and the actual values to calculate my results.

# adding calculated parameters
df1 <- evaluate_model("Logisitic",logistic.thresh$.threshold,
                      data.holdout.results$logistic_prediction,
                      data.holdout.results$actual)
df2 <- evaluate_model("LDA",lda.thresh$.threshold,
                      data.holdout.results$lda_prediction,
                      data.holdout.results$actual)
df3 <- evaluate_model("QDA",qda.thresh$.threshold,
                      data.holdout.results$qda_prediction,
                      data.holdout.results$actual)
df4 <- evaluate_model("KNN",knn.thresh$.threshold,
                      data.holdout.results$knn_prediction,
                      data.holdout.results$actual)
df4.hp <- evaluate_model("KNN - Hand Picked",knn.thresh.hp$.threshold,
                         data.holdout.results$knn_prediction,
                         data.holdout.results$actual)
df5 <- evaluate_model("Elastic Net",elnet.thresh$.threshold,
                      data.holdout.results$elnet_prediction,
                      data.holdout.results$actual)
df6 <- evaluate_model("Random Forest",rf.thresh$.threshold,
                      data.holdout.results$rf_prediction,
                      data.holdout.results$actual)
df6.hp <- evaluate_model("Random Forest - Hand Picked",rf.thresh.hp$.threshold,
                         data.holdout.results$rf_prediction,
                         data.holdout.results$actual)
df7 <- evaluate_model("SVM Linear",svm.linear.thresh$.threshold,
                      data.holdout.results$svm_linear_prediction,
                      data.holdout.results$actual)
df8 <- evaluate_model("SVM Radial",svm.radial.thresh$.threshold,
                      data.holdout.results$svm_radial_prediction,
                      data.holdout.results$actual)

holdout_results <- rbind(df1, df2, df3, df4, df4.hp, df5, df6, df6.hp, df7, df8)
holdout_results
holdout_results <- holdout_results %>% mutate(Tuning = "", 
                                AUROC = 0) %>% column_to_rownames(var = "Model")


# adding AUC
model_types <- list("logistic","lda","qda","knn","knn",
                    "elnet","rf","rf","svm.linear","svm.radial")
models <- row.names(holdout_results)
for(i in 1:length(models)){
  st <- paste0(model_types[i],".auc$.estimate", sep='')
  auc <- eval(parse(text = st))
  holdout_results[models[i],]$AUROC <- auc
}

# adding tuning params
holdout_results["KNN",]$Tuning <- paste("k: ",knn.fit$results[1,1])
holdout_results["KNN - Hand Picked",]$Tuning <- paste("k: ",knn.fit$results[1,1])
holdout_results["Elastic Net",]$Tuning <- paste("alpha: ",
                    elnet.fit$results[1,1],", lambda: ",elnet.fit$results[1,2])
holdout_results["Random Forest",]$Tuning <- paste("Num Params: ",rf.tune.top$mtry, 
                                          ", Num Trees: ", rf.tune.top$num_trees)
holdout_results["Random Forest - Hand Picked",]$Tuning <- paste("Num Params: ",rf.tune.top$mtry, 
                                          ", Num Trees: ", rf.tune.top$num_trees)
holdout_results["SVM Linear",]$Tuning <- "Cost: 9"
holdout_results["SVM Radial",]$Tuning <- "Cost: 15, Gamma: 7"

holdout_results <- holdout_results %>% dplyr::select(Tuning, AUROC, Threshold, 
                                                  Accuracy, TPR, FPR, Precision)
holdout_results
# adding calculated parameters
pr.df1 <- evaluate_model("Logisitic",logistic.thresh.pr$.threshold,
                data.holdout.results$logistic_prediction,
                data.holdout.results$actual)
pr.df1.hp <- evaluate_model("Logisitic - Hand Picked",
                logistic.thresh.pr.hp$.threshold,
                data.holdout.results$logistic_prediction,
                data.holdout.results$actual)
pr.df2 <- evaluate_model("LDA",
                lda.thresh.pr$.threshold,
                data.holdout.results$lda_prediction,
                data.holdout.results$actual)
pr.df3 <- evaluate_model("QDA",
                qda.thresh.pr$.threshold,
                data.holdout.results$qda_prediction,
                data.holdout.results$actual)
pr.df3.hp <- evaluate_model("QDA - Hand Picked",
                qda.thresh.pr.hp$.threshold,
                data.holdout.results$qda_prediction,
                data.holdout.results$actual)
pr.df4 <- evaluate_model("KNN",
                knn.thresh.pr$.threshold,
                data.holdout.results$knn_prediction,
                data.holdout.results$actual)
pr.df5 <- evaluate_model("Elastic Net",
                elnet.thresh.pr$.threshold,
                data.holdout.results$elnet_prediction,
                data.holdout.results$actual)
pr.df6 <- evaluate_model("Random Forest",
                rf.thresh.pr$.threshold,
                data.holdout.results$rf_prediction,
                data.holdout.results$actual)
pr.df7 <- evaluate_model("SVM Linear",
                svm.linear.thresh.pr$.threshold,
                data.holdout.results$svm_linear_prediction,
                data.holdout.results$actual)
pr.df8 <- evaluate_model("SVM Radial",
                svm.radial.thresh.pr$.threshold,
                data.holdout.results$svm_radial_prediction,
                data.holdout.results$actual)

holdout_results_pr <- rbind(pr.df1, pr.df1.hp, pr.df2, pr.df3, pr.df3.hp, pr.df4,
                            pr.df5, pr.df6, pr.df7, pr.df8)
holdout_results_pr
holdout_results_pr <- holdout_results_pr %>% mutate(Tuning = "", AUROC = 0) %>% 
      column_to_rownames(var = "Model")


# adding AUC
model_types <- list("logistic","logistic","lda","qda","qda","knn",
                    "elnet","rf","svm.linear","svm.radial")
models <- row.names(holdout_results_pr)
for(i in 1:length(models)){
  st <- paste0(model_types[i],".pr$.estimate", sep='')
  auc <- eval(parse(text = st))
  holdout_results_pr[models[i],]$AUROC <- auc
}
#> Warning: Unknown or uninitialised column: `.estimate`.

#> Warning: Unknown or uninitialised column: `.estimate`.

#> Warning: Unknown or uninitialised column: `.estimate`.

#> Warning: Unknown or uninitialised column: `.estimate`.

#> Warning: Unknown or uninitialised column: `.estimate`.

#> Warning: Unknown or uninitialised column: `.estimate`.

#> Warning: Unknown or uninitialised column: `.estimate`.

#> Warning: Unknown or uninitialised column: `.estimate`.

#> Warning: Unknown or uninitialised column: `.estimate`.

#> Warning: Unknown or uninitialised column: `.estimate`.
# adding tuning params
holdout_results_pr["KNN",]$Tuning <- paste("k: ",knn.fit$results[1,1])
holdout_results_pr["Elastic Net",]$Tuning <- paste("alpha: ",
                    elnet.fit$results[1,1],", lambda: ",elnet.fit$results[1,2])
holdout_results_pr["Random Forest",]$Tuning <- paste("Num Params: ",rf.tune.top$mtry, 
                                          ", Num Trees: ", rf.tune.top$num_trees)
holdout_results_pr["SVM Linear",]$Tuning <- "Cost: 9"
holdout_results_pr["SVM Radial",]$Tuning <- "Cost: 15, Gamma: 7"

holdout_results_pr <- holdout_results_pr %>% dplyr::select(Tuning, AUROC, 
                                        Threshold, Accuracy, TPR, FPR, Precision)
holdout_results_pr

Results using the thresholds selected from ROC curves.

Tuning AUROC Threshold Accuracy TPR FPR Precision
Logisitic 0.9984832 0.0006946 0.7656943 1.0000000 0.2360108 0.0299130
LDA 0.9888577 0.0000381 0.9191728 0.9998619 0.0814144 0.0820432
QDA 0.9981940 0.0000199 0.1858997 0.9998619 0.8200239 0.0087955
KNN k: 5 0.9981137 -Inf 0.0072249 1.0000000 1.0000000 0.0072249
KNN - Hand Picked k: 5 0.9981137 0.9217687 0.0132149 0.1003453 0.9874192 0.0007390
Elastic Net alpha: 0.1 , lambda: 8.32215451789909e-05 0.9984862 0.0192558 0.0011371 0.0718923 0.9993778 0.0005232
Random Forest Num Params: 2 , Num Trees: 25 0.9947785 -Inf 0.0072249 1.0000000 1.0000000 0.0072249
Random Forest - Hand Picked Num Params: 2 , Num Trees: 25 0.9947785 0.2000000 0.0047695 0.4504144 0.9984736 0.0032722
SVM Linear Cost: 9 0.9979764 0.0016853 0.0011162 0.0656768 0.9993537 0.0004780
SVM Radial Cost: 15, Gamma: 7 0.9984229 0.0006169 0.0078526 0.9236188 0.9988119 0.0066846

Results using the thresholds calculated from the Precision Recall curves.

Tuning AUROC Threshold Accuracy TPR FPR Precision
Logisitic 0.0006946 0.0006946 0.7656943 1.0000000 0.2360108 0.0299130
Logisitic - Hand Picked 0.0925338 0.0925338 0.9251129 0.9998619 0.0754311 0.0879785
LDA 0.0000381 0.0000381 0.9191728 0.9998619 0.0814144 0.0820432
QDA 0.0000199 0.0000199 0.1858997 0.9998619 0.8200239 0.0087955
QDA - Hand Picked 0.0277838 0.0277838 0.9713648 0.9056630 0.0281571 0.1896786
KNN k: 5 0.0000000 0.0000000 0.0049711 0.3184392 0.9973101 0.0023183
Elastic Net alpha: 0.1 , lambda: 8.32215451789909e-05 0.0192558 0.0192558 0.0011371 0.0718923 0.9993778 0.0005232
Random Forest Num Params: 2 , Num Trees: 25 0.0000000 0.0000000 0.0047755 0.6327348 0.9997944 0.0045846
SVM Linear Cost: 9 0.0016853 0.0016853 0.0011162 0.0656768 0.9993537 0.0004780
SVM Radial Cost: 15, Gamma: 7 0.0006169 0.0006169 0.0078526 0.9236188 0.9988119 0.0066846

7 Final Conclusions

7.0.1 Conclusion #1 - Model Selection

When selecting a model, my most important criteria are true positive rate, accuracy, and precision, generally in that order. I will pick the more favorable between the hand picked and automated threshold selection, where applicable, and will consider the ROC curve and PR curve results together when making a determination. While in the end, I have to select one threshold, seeing strong performance in both tables with both thresholds gives me an impression of robustness, less likely explained by over training.

From the training results, there are a lot of promising models. Logistic regression and LDA perform very well in both the ROC and PR thresholds. Precision is low across all models, which is to be expected as it was not prioritized. Logistic and LDA have some of the highest precision values in both tables. QDA also performs well with a handpicked threshold in the PR table, but less well in the ROC table. A notable observation about QDA, it has one of the top precision in the PR table. Overall, Logistic regression has the top numbers, using a hand picked threshold, and this is the model I would select based on training data, specifically using the handpicked threshold in the PR data.

From the holdout data, accuracy takes a considerable hit for most models, though TPR is very strong, almost across the board. Precision levels are devastated compared to the training data. From both tables, Logistic, LDA and QDA are clear front runners, with QDA lagging considerably in the ROC table, but ahead of all models in the PR table. In this case, I would select QDA, using the handpicked threshold from the PR table, primarily due to the much larger precision value.

7.0.2 Conclusion #2 - Justifying findings and the selected models

Finding QDA as the highest performing model, while it was not selected based on training data is somewhat unexpected but not entirely surprising. QDA, while not linear, is very similar to the top two models from the training data, in comparison with the other models. QDA also was a standout performer, though not a top performer in the training data.

Probably the best explanation, however, is the particular threshold picked for QDA. The thresholds I picked from the training data were very aggressive, classifying as many observations as Blue Tarps as possible, leading to very minimal TPR’s. For QDA, however, the hand picked threshold was placed toward the middle of a very healthy PR curve. Selecting a point in the middle of the curve may have given the model a bit of an edge when dealing with new data that may have a different make up. More research would need to be done to determine a concrete cause, however. QDA’s performance on the training data was good enough, that it could have just lucked out with the holdout data.

7.0.3 Conclusion #3 - Relying on this Model and Threshold

For this model I selected a threshold assuming unlimited resources. In reality, there will be some constraints on rescue resources. It is important when setting objectives to determine what a reasonable true positive, false positive, and precision rate would look like. In our models I had precision levels around 18%. Is it reasonable to send a helicopter to check five potential sites and find only one with an actual blue tarp? On the other hand, do I have the ability to search all day just to find one blue tarp I would have missed without the algorithm.

I do not recommend allocating all resources based on a data model. Ideally a fraction (even if a substantial fraction) of search resources will go to areas predicted by this model and the rest will go toward areas identified by more traditional means of information gathering, such as word of mouth and following leads. This would perhaps allow us to set a higher precision target and use resources more efficiently. However, it is clear that precision was vastly over estimated using the training data, so targeting a specific precision rate will be very risky.

7.0.4 Conclusion #4 - Precision Recall Data

The precision recall plots yield very interesting data. The graphs are often weaker than the ROC curves, making it easier to see the trade-offs between TPR and the alternative. Interestingly, selecting the highest TPR (recall or sensitivity), and then the highest precision or specificity led to the same threshold in most cases. The main advantage then of using the precision recall curve (PR curve), verses the ROC curve is the ability to visually see a better middle point and hand pick an optimal threshold. In training and test, the hand picked thresholds performed best. Having high marks in both, did appear to indicate that the model would be strong in the holdout data.

7.0.5 Conclusion #5 - Small Thresholds

I found it astonishing how low the thresholds I selected were. In a training dataset where only 3% of data were our target class, the thresholds I selected gave us Blue Tarp when there was only a 9% or 2% chance of the observation being a Blue Tarp. These predictions also ended up being more accurate than the thresholds that were not handpicked, which were often in the 0.006% range. I find it facinating that a lopsided data set can give such lopsided thresholds, but this may help explain the sheer drop off in precision when running those thresholds on the holdout data.

7.0.6 Conclusion #6 - General Shape of Data

Due to the success of Logistic, LDA, and QDA, and the weak performance of Random Forest, SVM and the others, it seems clear that the underlying relationships are probably linear, or at least mildly quadratic. However, KNN, Elastic Net, Random Forest, and SVM all underperformed and all have tuning parameters. Random Forest and SVM specifically do have a lot of rum for tuning that was explored only to a very limited degree. It could be that there is a high performing Random Forest model with just the right conditions.

On further examination, Random Forest found it’s best tune with only two parameters, indicating a simpler (or linear) model might be preferred. On the other hand SVM Radial was vastly superior to SVM Linear, which undercuts the linear data argument.