Meet tidymodel for the first time

Ji Huang 2021-05-30 5 min read

R
statistics

I’ve known the tidymodel framework for a while, but never tried it. Tidymodel is the new framework from Max Kuhn, the author of caret framework in R. In the past week, I read the Tidy Modeling with R book and tried tidymodel system for the first time.

In this post, I fit a random forest model for a classification problem using tidymodel. Just a quick first try of this new system.

1. Load dataset

The dataset is from sda package. There is a dataset singh2002 that includes Gene expression data (6033 genes for 102 samples) from the microarray study of Singh et al. (2002)..

library(tidymodels)
library(sda)

tidymodels_prefer()

data(singh2002)

For quick processing, I just keep the first 100 genes in the model. The dataset consists 102 samples, 52 cancer and 50 healthy samples.

gexpr <- tibble(as.data.frame(singh2002$x))
gexpr$label <- singh2002$y
gexpr <- select(gexpr, label, V1:V100)

table(gexpr$label)

## 
##  cancer healthy 
##      52      50

2. Build Random Forest model

2.1 Split dataset

set.seed(123)
gexpr_split <- initial_split(gexpr, strata = label)
gexpr_train <- training(gexpr_split)
gexpr_test <- testing(gexpr_split)

2.2 Create Cross-Validation set

set.seed(234)
gexpr_folds <- vfold_cv(gexpr_train, strata = label)
gexpr_folds

## #  10-fold cross-validation using stratification 
## # A tibble: 10 x 2
##    splits         id    
##    <list>         <chr> 
##  1 <split [68/8]> Fold01
##  2 <split [68/8]> Fold02
##  3 <split [68/8]> Fold03
##  4 <split [68/8]> Fold04
##  5 <split [68/8]> Fold05
##  6 <split [68/8]> Fold06
##  7 <split [68/8]> Fold07
##  8 <split [69/7]> Fold08
##  9 <split [69/7]> Fold09
## 10 <split [70/6]> Fold10

2.3 Receipes

The [usemodels] package helps to quickly create code snippets to fit models using the tidymodels framework.

usemodels::use_ranger(label ~ ., data = gexpr_train)

## ranger_recipe <- 
##   recipe(formula = label ~ ., data = gexpr_train) 
## 
## ranger_spec <- 
##   rand_forest(mtry = tune(), min_n = tune(), trees = 1000) %>% 
##   set_mode("classification") %>% 
##   set_engine("ranger") 
## 
## ranger_workflow <- 
##   workflow() %>% 
##   add_recipe(ranger_recipe) %>% 
##   add_model(ranger_spec) 
## 
## set.seed(29103)
## ranger_tune <-
##   tune_grid(ranger_workflow, resamples = stop("add your rsample object"), grid = stop("add number of candidate points"))

ranger_recipe <-
    recipe(formula = label ~ ., data = gexpr_train)

2.4 Model specification

ranger_spec <-
    rand_forest(trees = 1000) %>%
    set_mode("classification") %>%
    set_engine("ranger")

2.5 Workflow

ranger_workflow <-
    workflow() %>%
    add_recipe(ranger_recipe) %>%
    add_model(ranger_spec)

2.6 Fit CV training set

doParallel::registerDoParallel()
set.seed(12345)
ranger_rs <-
    fit_resamples(ranger_workflow,
                  resamples = gexpr_folds,
                  control = control_resamples(save_pred = TRUE)
    )

2.7 Check Training Performance

10-fold CV summary

collect_metrics(ranger_rs, summarize = TRUE)

## # A tibble: 2 x 6
##   .metric  .estimator  mean     n std_err .config             
##   <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy binary     0.932    10  0.0297 Preprocessor1_Model1
## 2 roc_auc  binary     0.988    10  0.0125 Preprocessor1_Model1

Performance of each fold

collect_metrics(ranger_rs, summarize = FALSE)

## # A tibble: 20 x 5
##    id     .metric  .estimator .estimate .config             
##    <chr>  <chr>    <chr>          <dbl> <chr>               
##  1 Fold01 accuracy binary         1     Preprocessor1_Model1
##  2 Fold01 roc_auc  binary         1     Preprocessor1_Model1
##  3 Fold02 accuracy binary         1     Preprocessor1_Model1
##  4 Fold02 roc_auc  binary         1     Preprocessor1_Model1
##  5 Fold03 accuracy binary         0.75  Preprocessor1_Model1
##  6 Fold03 roc_auc  binary         0.875 Preprocessor1_Model1
##  7 Fold04 accuracy binary         0.875 Preprocessor1_Model1
##  8 Fold04 roc_auc  binary         1     Preprocessor1_Model1
##  9 Fold05 accuracy binary         1     Preprocessor1_Model1
## 10 Fold05 roc_auc  binary         1     Preprocessor1_Model1
## 11 Fold06 accuracy binary         1     Preprocessor1_Model1
## 12 Fold06 roc_auc  binary         1     Preprocessor1_Model1
## 13 Fold07 accuracy binary         1     Preprocessor1_Model1
## 14 Fold07 roc_auc  binary         1     Preprocessor1_Model1
## 15 Fold08 accuracy binary         0.857 Preprocessor1_Model1
## 16 Fold08 roc_auc  binary         1     Preprocessor1_Model1
## 17 Fold09 accuracy binary         1     Preprocessor1_Model1
## 18 Fold09 roc_auc  binary         1     Preprocessor1_Model1
## 19 Fold10 accuracy binary         0.833 Preprocessor1_Model1
## 20 Fold10 roc_auc  binary         1     Preprocessor1_Model1

Plot the 10 fold ROC, but most of them have AUC=1.

collect_predictions(ranger_rs) %>%
    group_by(id) %>%
    roc_curve(label, .pred_cancer) %>%
    autoplot()

Averaged confusion matrix.

conf_mat_resampled(ranger_rs, tidy = FALSE) %>%
    autoplot()

2.8 Fit Testing set

final_fitted <- last_fit(ranger_workflow, gexpr_split)
collect_metrics(final_fitted)

## # A tibble: 2 x 4
##   .metric  .estimator .estimate .config             
##   <chr>    <chr>          <dbl> <chr>               
## 1 accuracy binary             1 Preprocessor1_Model1
## 2 roc_auc  binary             1 Preprocessor1_Model1

2.9 Check Testing set Performance

collect_metrics(final_fitted)

## # A tibble: 2 x 4
##   .metric  .estimator .estimate .config             
##   <chr>    <chr>          <dbl> <chr>               
## 1 accuracy binary             1 Preprocessor1_Model1
## 2 roc_auc  binary             1 Preprocessor1_Model1

All 26 samples were correct.

collect_predictions(final_fitted) %>%
    conf_mat(label, .pred_class)

##           Truth
## Prediction cancer healthy
##    cancer      13       0
##    healthy      0      13

The confusion matrix plot is a little confusing. It looks like there are incorrect classfication, while the model predicts all 26 samples correctly.

collect_predictions(final_fitted) %>%
    conf_mat(label, .pred_class) %>%
    autoplot()

3. Extra

Some things I didn’t cover in this post.

Tuning models using tune package
Comparing models.

Useful links:

Julia Silge’s Blogs.
Jan Kirenz posts.
Tidy Modeling with R Book.

Some thoughts on tidymodel:

The tidymodel ecosystem looks like a solid platform for modeling building and testing in R. I like it has so many modules that can cover almost all statistical models. For time-series data, use tidyverts.

However, because of tidymodel’s modularity, it contains so many packages and so many functions. It does not offer an all-in-one system like caret (and it will probably never be).

I also have concerns on tidymodel on speed and memory usage. Anyway the tidymodel is definitely an interesting modeling system. I will try it more using my real research problems.

Dreaming time