Project 3 diabetes_data: High_School_Graduate Analysis

Jasmine Gallaway and Keren Vivas 2023-11-16

1. Introduction section

The data analyzed in this report is derived from the Diabetes Health Indicators Dataset, which was gathered through the Behavioral Risk Factor Surveillance System (BRFSS) in the year 2015. The BRFSS is an annual health-related telephone survey conducted by the Centers for Disease Control and Prevention (CDC). This dataset encompasses a substantial collection of over 200,000 survey responses, featuring 21 distinct variables or factors, along with a pivotal response variable termed “Diabetes_binary.” This response variable comprises three discrete levels:

0 = No diabetes or diabetes only during pregnancy. 1 = Prediabetes. 2 = Established diabetes.

The dataset encompasses a diverse array of variables that cover a wide spectrum of information, including the current diabetes status of the survey respondents, various demographic details, physical and mental health metrics, dietary habits, and lifestyle factors. The specific variables selected for analysis in this report are as follows:

Demographics: This category includes two categorical variables: “Age” and “Sex” of the respondents.
Physical Conditions: These variables encompass “HighBP,” “HighChol,” and “HeartDiseaseorAttack.” These are categorical variables, taking values of 0 or 1, indicating the presence or absence of high blood pressure, high cholesterol, and a history of heart disease or heart attacks. Additionally, the “Body Mass Index (BMI)” is included as a continuous variable, with values ranging from 12 to 98, providing insights into the respondents’ body composition.
Mental Conditions: The variable “MentHlth” is a continuous variable that quantifies the number of days in a month during which the respondent’s mental health was not good. It ranges from 1 to 30.
Diet Conditions: Two categorical variables, “Fruits” and “Veggies,” denote whether respondents consume fruits and vegetables one or more times per day. These variables take values of 1 (indicating consumption) or 0 (indicating non-consumption).
Habits: This category includes two categorical variables: “Smoker” and “PhysActivity.” “Smoker” indicates whether respondents have smoked at least 100 cigarettes in their lifetime (with values 1 for yes and 0 for no), while “PhysActivity” assesses whether respondents have engaged in physical activity within the last month (with values 1 for yes and 0 for no).

The Exploratory Data Analysis (EDA) of the selected variables is designed to offer a holistic understanding of the health and lifestyle characteristics of the survey respondents. This preliminary phase of analysis helps us uncover patterns within the data.

Subsequently, when we move to modeling the data, our objective is to delve deeper and gain insights into the intricate ways these factors impact an individual’s diabetes status. Our analysis will not only reveal the influence of these factors but also quantify the individual contributions of each factor to the risk of diabetes.

In the end, the entire analysis will be segmented based on the level of education. This segmentation enables us to investigate how this specific demographic factor influences the overall findings. By doing so, we can discern whether educational levels play a significant role in shaping the relationship between health, lifestyle, and diabetes risk.

2. Required packages section

#Reading in libraries
library(readr)        #For reading    
library(dplyr)        #For data manipulation
library(ggplot2)      #For plotting
library(caret)        #For modeling
library(rmarkdown)    #For creating of dynamic documents
library(purrr)        #For data manipulation
library(glmnet)       #For Lasso model
library(randomForest) #For random forest model
library(gbm)          #Fors boosting model
library(Metrics)      #For logLoss()  
library(cvms)         #For cross-validation
library(rpart)        #For tree model  
library(pls)          #For Partial Least Squares Model

3. Data section

In this section we change most of the variables into factors and label them with appropriate names.

#readin data  
diabetes_data <- read_csv("diabetes_binary_health_indicators_BRFSS2015.csv")  

#making a tibble
diabetes_data <- as_tibble(diabetes_data)  

#changing diabetes_binary to factor  
diabetes_data$Diabetes_binary <- diabetes_data$Diabetes_binary %>% 
  factor(levels = c(0,1),
         labels = c("No","Pre"))  

#combine education levels one and two  
diabetes_data$Education <- diabetes_data$Education %>% 
  factor(levels = c(1,2,3,4,5,6), 
            labels = c("Elementary", "Elementary", "Some_High_School", "High_School_Graduate", "Some_College", "College_Graduate"))  

#changing sex to factor  
diabetes_data$Sex <- diabetes_data$Sex %>% 
  factor(levels = c(0,1),
         labels = c("female", "male"))  

#changing age to factor  
diabetes_data$Age <- diabetes_data$Age %>% 
  factor(levels = c(1,2,3,4,5,6,7,8,9,10,11,12,13),
         labels = c("18-24", "25-30", "31-35", "36-40", "41-45", "46-50", "51-55", "56-60", "61-65", "66-70", "71-75", "76-80", "80+"))  

#Changing fruit to factor  
diabetes_data$Fruits <- diabetes_data$Fruits %>% 
  factor(levels = c(0, 1),
         labels = c("None", "Yes"))  

#Changing physical activity to factor  
diabetes_data$PhysActivity <- diabetes_data$PhysActivity %>% 
  factor(levels = c(0, 1),
         labels = c("None", "Yes"))  

#Changing smoker to factor  
diabetes_data$Smoker <- diabetes_data$Smoker %>% 
  factor(levels = c(0, 1),
         labels = c("No", "Yes"))  

#Changing stroke to factor  
diabetes_data$Stroke <- diabetes_data$Stroke %>%
  factor(levels = c(0, 1),
         labels = c("No", "Yes"))  

#Changing heart disease or attack to factor  
diabetes_data$HeartDiseaseorAttack <- diabetes_data$HeartDiseaseorAttack %>% 
  factor(levels = c(0, 1),
         labels = c("No", "Yes"))  

#Changing veggies to factor  
diabetes_data$Veggies <- diabetes_data$Veggies %>% 
  factor(levels = c(0, 1),
         labels = c("None", "Yes"))  

#Changing Heavy alcohol consumption to factor  
diabetes_data$HvyAlcoholConsump <- diabetes_data$HvyAlcoholConsump %>% 
  factor(levels = c(0, 1),
         labels = c("No", "Yes"))  

#Changing any healthcare to factor  
diabetes_data$AnyHealthcare <- diabetes_data$AnyHealthcare %>% 
  factor(levels = c(0, 1),
         labels = c("No", "Yes"))  

#Changing nodocbcCost to factor  
diabetes_data$NoDocbcCost <- diabetes_data$NoDocbcCost %>% 
  factor(levels = c(0, 1),
         labels = c("No", "Yes"))  

#Changing GenHlth to factor  
diabetes_data$GenHlth <- diabetes_data$GenHlth %>% 
  factor(levels = c(1, 2, 3, 4, 5),
         labels = c("Excellent", "Very Good", "Good", "Fair", "Poor"))  

#Changing DiffWalk to factor  
diabetes_data$DiffWalk <- diabetes_data$DiffWalk %>% 
  factor(levels = c(0, 1),
         labels = c("No", "Yes")) 

#changing HighBP to factor
diabetes_data$HighBP <- diabetes_data$HighBP %>%
  factor(levels = c(0, 1),
         labels = c("No", "Yes"))

#changing HighChol to factor
diabetes_data$HighChol <- diabetes_data$HighChol %>%
  factor(levels = c(0,1),
         labels = c("No", "Yes"))

#changing CholCheck to factor
diabetes_data$CholCheck <- diabetes_data$CholCheck %>%
  factor(levels = c(0,1),
         labels = c("No", "Yes"))

#changing Income to factor
diabetes_data$Income <- diabetes_data$Income %>%
  factor(levels = c(1,2,3,4,5,6,7,8),
         labels = c("Less than $10k", "$10k to less than $15k", "$15kto less than $20k", "$20k to less than $25k", "$25k to less than $35k", "$35k to less than $50k", "$50k to less than $75k", "$75k or more"))

Below our data set is subset into the specified education level for the page.

# Subset data into a single Education level
subset_data <- diabetes_data %>%
 filter(Education == params$Education)

4. Summarizations section

Below is some initial exploratory data analysis on the dataset. We began with looking at group summaries of BMI and mental health. Both are grouped by the Diabetes_binary variable which sorts the participants into no diabetes, prediabetes, or diabetes.

#gathering summary statistics of BMI  
sum_data1 <- subset_data %>% 
  group_by(Diabetes_binary) %>%
  summarize(mean = mean(BMI),
            sd = sd(BMI),
            min = min(BMI),
            max = max(BMI),
            IQR = IQR(BMI))  
#printing out  
print(sum_data1)  

## # A tibble: 2 × 6
##   Diabetes_binary  mean    sd   min   max   IQR
##   <fct>           <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 No               28.4  6.43    12    98     7
## 2 Pre              32.1  7.42    13    92     9

#Summary statistics of mental health 
sum_data2 <- subset_data %>% 
  group_by(Diabetes_binary) %>% 
  summarize(mean = mean(MentHlth),
            sd = sd(MentHlth),
            min = min(MentHlth),
            max = max(MentHlth))  
print(sum_data2)

## # A tibble: 2 × 5
##   Diabetes_binary  mean    sd   min   max
##   <fct>           <dbl> <dbl> <dbl> <dbl>
## 1 No               3.45  7.83     0    30
## 2 Pre              4.72  9.15     0    30

Contingency Tables

After initial group summaries we then made several contingency tables in order to observe some of the possible relationships between variables. Based on these contingency tables we were able to numerically see where some trends might occur.

 #1-way contingency table
 one_way_table <- table(subset_data$Fruits)
 one_way_table

## 
##  None   Yes 
## 27011 35739

 #2-way contingency table
 two_way_table <- table(subset_data$Smoker, subset_data$HighChol)
 two_way_table

##      
##          No   Yes
##   No  16828 12599
##   Yes 16961 16362

 #3-way contingency table
 three_way_table <- table(subset_data$HighBP, subset_data$HighChol, subset_data$HeartDiseaseorAttack)
 three_way_table  

## , ,  = No
## 
##      
##          No   Yes
##   No  20678  8648
##   Yes 10876 15081
## 
## , ,  = Yes
## 
##      
##          No   Yes
##   No    834   870
##   Yes  1401  4362

#4 Contingency table for diabetes_binary vs no doctors visits because of cost  
 tbl_four <- table(subset_data$NoDocbcCost, subset_data$Diabetes_binary)  
 tbl_four

##      
##          No   Pre
##   No  46595  9837
##   Yes  5089  1229

#5 Contingency table for diabetes_binary vs general health ratings  
 tbl_five <- table(subset_data$GenHlth, subset_data$Diabetes_binary)  
 tbl_five  

##            
##                No   Pre
##   Excellent  6959   265
##   Very Good 16840  1630
##   Good      17715  4133
##   Fair       7514  3477
##   Poor       2656  1561

#6 Contingency table for diabetes_binary vs NoDocbcCost vs GenHlth  
 tbl_six <- table(subset_data$GenHlth, subset_data$Diabetes_binary, subset_data$NoDocbcCost)  
 tbl_six

## , ,  = No
## 
##            
##                No   Pre
##   Excellent  6593   244
##   Very Good 15768  1544
##   Good      15891  3770
##   Fair       6284  3023
##   Poor       2059  1256
## 
## , ,  = Yes
## 
##            
##                No   Pre
##   Excellent   366    21
##   Very Good  1072    86
##   Good       1824   363
##   Fair       1230   454
##   Poor        597   305

Plots

We then looked at some additional combinations of variables in graphs to observe other relationships within the data. In these plots we are attempting to graph possible relationships between variables. If any trends are found it could give insight to which variables are important to use in our predictive modeling.

#Plot 1 is a violin plot that illustrates the distribution of the Mental Health variable across the two sex categories, female and male.

plot1 <- ggplot(subset_data, aes(x = Sex, y = MentHlth)) +
  geom_violin(fill = "lightblue", color = "blue", alpha = 0.7) +
  labs(title = "Mental health condition across gender", x = "Gender", y = "Mental health condition") +
  theme_minimal() 
print(plot1)

#Plot 2 is a stacked bar plot that depicts the distribution of the Physical Activity variable across all age groups, with each group represented by a different color.

plot2 <- ggplot(data = subset_data, aes(x = Age, y = PhysActivity , fill = Age)) +
  geom_bar(stat = "identity") +
  labs(title = "Stacked Bar Plot of Physical activity days by Age Group", x = "Age Group", y = "Physical activity (days)") +
  theme_minimal()  
print(plot2)

#Plot 3 is a scatter plot designed to visualize the relationship between the Body Mass Index (BMI) and Mental Health variables, with points colored by sex (female and male)

plot3 <- ggplot(subset_data, aes(x = BMI, y = MentHlth)) +
  geom_point(aes(color = Sex), size = 3) +
  labs(title = "Scatterplot of BMI vs. Mental health", x = "BMI", y = "Mental Health") +
  theme_minimal()  
print(plot3)

#Plot 4 is a count plot of income versus mental health and color coded by the response variable of Diabetes_binary  
plot4 <- ggplot(subset_data, aes(x = Income, y = MentHlth)) + 
  geom_count(aes(color = Diabetes_binary)) + 
  labs(title = "Count Plot of Income vs. Mental Health", x = "Income", y = "Mental Health") + 
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1))
print(plot4)

#Plot 5 is a count plot of income versus physical health and color coded by the response variable of Diabetes_binary  
plot5 <- ggplot(subset_data, aes(x = Income, y = PhysHlth)) + 
  geom_count(aes(color = Diabetes_binary)) + 
  labs(title = "Count Plot of Income vs. Physical Health", x = "Income", y = "Physical Health") + 
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1))
print(plot5)

#Plot 6 is a bar plot of diabetes_binary versus mental health used to uncrowd the graphs
plot6 <- ggplot(subset_data, aes(x = MentHlth)) + 
  geom_bar(aes(color = Diabetes_binary, fill = Diabetes_binary)) + 
  facet_wrap(~Diabetes_binary) + 
  labs(title = "Bar Plot of Mental Health", x = "How Many Bad Mental Days", y = "Participant Count") + 
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1))
print(plot6)  

#Plot 7 is a bar plot of diabetes_binary versus physical health used to uncrowd the graphs
plot7 <- ggplot(subset_data, aes(x = PhysHlth)) + 
  geom_bar(aes(color = Diabetes_binary, fill = Diabetes_binary)) + 
  facet_wrap(~Diabetes_binary) + 
  labs(title = "Bar Plot of Physical Health", x = "Days Active", y = "Participant Count") + 
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1))
print(plot7)  

#Plot 8 is a bar plot of diabetes_binary versus income used to uncrowd the graphs
plot8 <- ggplot(subset_data, aes(x = Income)) + 
  geom_bar(aes(color = Diabetes_binary, fill = Diabetes_binary)) + 
  facet_wrap(~Diabetes_binary) + 
  labs(title = "Bar Plot of Income", x = "Income", y = "Frequency") + 
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1))
print(plot8)  

5. Modeling

Below begins the work to shift into modeling our data. We had some initial issues with modeling due to our labeling of the variables. We also had some modeling issues due to there being no recorded responses of 2 / “Diabetes” within the Diabetes_binary dataset. Due to this lack of response, the “Diabetes” level of Diabetes_binary had to be removed in order for the models to work. The code chunk directly below removes that factor level. Some of the variables we arbitrarily chose to focus on in our modeling are Sex, Income, Age, BMI, MentHlth, PhysHlth, HighBP, Fruits, HvyAlcoholConsump, and NodocbcCost. Variable selection can be done with penalization modeling such as LASSO.

#Dropping level
subset_data$Diabetes_binary <- droplevels(subset_data$Diabetes_binary)  

#Setting seed for reproducibility
set.seed(3033)  

#Splitting the data  
intrain <- createDataPartition(y = subset_data$Diabetes_binary, p= 0.7, list = FALSE) 

#Splitting the data
training <- subset_data[intrain,]
testing <- subset_data[-intrain,]  

#Smaller training set for computations  
training <- training[1:250,]

What log loss is and why we may prefer it to things like accuracy?

Logarithmic loss is metric used to evaluate how well a model fits. Also known as cross-entropy loss, log loss is used on binary classification models where the responses are either 0 or 1. Log loss calculates the difference between the predicted probabilities and the actual values. The lower the log loss value, the better the fit of the model is to the data. Log loss is preferred with this data set because of our focus on the binary response of diabetes. Our Diabetes_binary response variable has three responses of no diabetes, prediabetes, and diabetes; however, there seems to be no recorded response of the factor level of 2 (diabetes) within the data so the recorded responses are either 0 or 1. Log loss is better than accuracy in this case because log loss focuses on models that predict the class better and penalizes models that are further from the class.

What a LASSO logistic regression model is and why we might try to use this over basic logistic regresion?

LASSO (Least Absolute Shrinkage and Selection Operator) is a type of regularized logistic regression. It adds a penalty term to the logistic regression’s likelihood function to encourage feature selection by driving some coefficients to exactly zero. This can improve model interpretability and reduce overfitting and that is why it is used over the basic logistic regresion.

LASSO logistic regression is a valuable tool when you have high-dimensional datasets with potentially redundant or irrelevant features. It can improve model performance, reduce overfitting, and enhance model interpretability by automatically selecting the most relevant predictors while eliminating the least important ones. However, the choice between basic logistic regression and LASSO logistic regression depends on the specific characteristics of your data and the goals of your analysis.

Fit a LASSO logistic regression model and choose the best model

#Setting control for the train control  
control <- trainControl(method = "cv", number = 5, classProbs = TRUE, summaryFunction = mnLogLoss)

#Fitting a LASSO logistic regression model 
model_lasso <- train(Diabetes_binary ~ Sex + Income + Age + BMI + MentHlth + PhysHlth + HighBP + Fruits + HvyAlcoholConsump + NoDocbcCost,
  data = training,
  method = "glmnet",
  trControl =control,
  tuneGrid = expand.grid(alpha = 1, lambda = seq(0.001, 1, length = 100))
)

#Predict
pred_model_lasso <- predict(model_lasso, newdata = testing, type = "prob")

#Turn values into 1's and 0's for the log loss calculation
testing$Diabetes_binary <- ifelse(testing$Diabetes_binary == "No", 0, 1)

# LogLoss function with explicit conversion to numeric
LogLoss <- function(actual, predicted) {
  result <- -1 / length(actual) * sum((actual * log(predicted) + (1 - actual) * log(1 - predicted)))
  return(result)
}

# Calculate log loss on the testing set
test_logloss_lasso <- LogLoss(testing$Diabetes_binary, pred_model_lasso)

test_logloss_lasso

## [1] 2.16225

# Display the log loss on the testing set
cat("Log Loss on Testing Set:", test_logloss_lasso, "\n")

## Log Loss on Testing Set: 2.16225

what a classification tree model is and why we might try to use it?

A classification tree, specifically known as a decision tree for classification, is a model designed for classifying data. It recursively divides the input space into regions, assigning a class label to each region. The decision-making process is visualized as a tree structure, where internal nodes represent decisions based on specific features, branches depict the outcomes of those decisions, and leaf nodes contain the final predicted classes. The construction of a classification tree involves selecting optimal features to split the data at each internal node, with the process continuing until a stopping criterion is met, such as a maximum tree depth or a minimum number of samples in a leaf node. The tree’s depth and complexity are controlled by different values of the complexity parameter, typically determined through cross-validation to select the best parameter.

We might choose to use a classification tree for several reasons. It offers benefits such as excellent interpretability, no assumption of linearity, handling of both numeric and non-numeric data, automatic variable selection, insensitivity to feature scale, robustness to outliers, suitability for imbalanced data, and its role as a building block for ensemble methods, enhancing predictive performance.

Fit a classification tree with varying values for the complexity parameter and choose the best model (best complexity parameter)

#Set seed for reproducibility
set.seed(123)

# Create 5-fold cross-validation indices
folds <- createFolds(training$Diabetes_binary, k = 5, list = TRUE, returnTrain = FALSE)

# Create a grid of complexity parameter values to try
cp_values <- seq(0.01, 0.1, by = 0.01)

# Create a data frame to store log loss for each complexity parameter
logloss_df <- data.frame(cp = numeric(length(cp_values)), logloss = numeric(length(cp_values)))

# Loop through different complexity parameters
for (i in seq_along(cp_values)) {
  avg_logloss <- 0
  
  # Perform 5-fold cross-validation
  for (fold in seq_along(folds)) {
    # Extract training and validation sets
    training_set <- training[-folds[[fold]], ]
    validation_set <- training[folds[[fold]], ]
    
    # Build the classification tree on the training set
    tree_model <- rpart(Diabetes_binary ~ Sex + Income + Age + BMI + MentHlth + PhysHlth + HighBP + Fruits + HvyAlcoholConsump + NoDocbcCost, data = training_set, method = "class", control = rpart.control(cp = cp_values[i]))
    
    # Make predictions on the validation set
    predictions <- predict(tree_model, newdata = validation_set, type = "prob")[, 2]
    
    # Calculate log loss on the validation set
    fold_logloss <- logLoss(as.numeric(validation_set$Diabetes_binary) - 1, predictions)
    
    avg_logloss <- avg_logloss + fold_logloss
  }
  
  # Calculate average log loss over all folds
  avg_logloss <- avg_logloss / length(folds)
  
  # Store average log loss in the data frame
  logloss_df[i, ] <- c(cp = cp_values[i], logloss = avg_logloss)
}

# Find the best complexity parameter with the minimum average log loss
best_cp <- logloss_df$cp[which.min(logloss_df$logloss)]

# Build the final classification tree with the best complexity parameter
final_tree_model <- rpart(Diabetes_binary ~ Sex + Income + Age + BMI + MentHlth + PhysHlth + HighBP + Fruits + HvyAlcoholConsump + NoDocbcCost, data = training, method = "class", control = rpart.control(cp = best_cp))

# Display the best complexity parameter
cat("Best Complexity Parameter:", best_cp, "\n")

## Best Complexity Parameter: 0.08

# Display the final tree
print(final_tree_model)

## n= 250 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
## 1) root 250 58 No (0.7680000 0.2320000) *

# Evaluate the final model on the testing set
test_predictions <- predict(final_tree_model, newdata = testing, type = "prob")[, 2]

# Calculate log loss on the testing set
test_logloss_tree <- abs(logLoss(as.numeric(testing$Diabetes_binary) - 1, test_predictions))

# Display the log loss on the testing set
cat("Log Loss on Testing Set:", test_logloss_tree, "\n")

## Log Loss on Testing Set: 0.7220256

What a bagged CART model is?

A bagged CART (Bootstrap Aggregating for Classification and Regression Trees) model is an ensemble machine learning technique aimed at enhancing predictive model stability and accuracy while mitigating overfitting. The process involves creating multiple bootstrap samples from the original dataset by randomly selecting data points with replacement. Each bootstrap sample is used to train an independent CART model (decision tree). The predictions from all individual trees are then combined to form the final ensemble prediction. For classification tasks, the mode of predictions is often used, while the average is employed for regression problems. Bagging creates diversity in the models by training them on different subsets of the data, making the ensemble more robust and less prone to overfitting.

Fit a bagged CART and choose the best model

#Changing factor levels for eased calculations
testing$Diabetes_binary <- factor(testing$Diabetes_binary, levels = c(0, 1), labels = c("No", "Pre"))

#Setting control for the train control  
control <- trainControl(method = "cv", number = 5, classProbs = TRUE, summaryFunction = mnLogLoss)

# Fit a bagged model (using bagged trees) on the training data
model_bagged <- train(Diabetes_binary ~ Sex + Income + Age + BMI + MentHlth + PhysHlth + HighBP + Fruits + HvyAlcoholConsump + NoDocbcCost, data = training, method = "treebag", trControl = control)

# Predict probabilities on the testing set
pred_probabilities <- predict(model_bagged, newdata = testing, type = "prob")

# Turn values into 1's and 0's for the log loss calculation
testing$Diabetes_binary <- as.numeric(testing$Diabetes_binary) - 1

pred_prob_pre <- pred_probabilities[,"Pre"]

# LogLoss function with epsilon to avoid log(0)
LogLoss <- function(actual, predicted) {
  epsilon <- 1e-15  # Small epsilon value to avoid log(0)

  # Add epsilon to avoid log(0) or log(1)
  predicted <- pmax(pmin(predicted, 1 - epsilon), epsilon)

  result <- -1 / length(actual) * sum((actual * log(predicted) + (1 - actual) * log(1 - predicted)))
  return(result)
}
# Calculate log loss on the testing set
test_logloss_cart <- LogLoss(testing$Diabetes_binary, pred_prob_pre)

test_logloss_cart

## [1] 0.7746904

# Display the log loss on the testing set
cat("Log Loss on Testing Set:", test_logloss_cart, "\n")

## Log Loss on Testing Set: 0.7746904

What a logistic regression is and why we apply it to this kind of data?

Logistic regression is a form of predictive analysis in which the regression models the relationship(s) of a binary response variable to one or more other variables. Typically we think of linear regression models, but generalized linear models, such as logistic regression, are used in cases in which the general assumptions of linear models are broken. The general assumptions of a linear model are that the model is for a continuous response, errors are normally distributed, and that constant variance can be assumed. In this case our Diabetes_binary variable could do well being modeled by logistic regression. In our data Diabetes_binary has two recorded responses of either 0 “No diabetes” or 1 “Prediabetes”; this is not a continuous response and therefore a general linear model cannot be used. A typical logistic function can be backsolved to result in showing what is called a link function. The link function shows how a linear model could be fit to our response variable. The link function of logistic regression is called the logit function and models the log odds of success. The log odds of success is the probability that the success (or 1 response) will occur. In this scenario it feels wrong to call prediabetes a success; however, we will statistically still consider 1 “Prediabetes” to be the success.

Fit 3 candidate logistic regression models and choose the best

#Setting trainControl to include cv  
my_ctrl <- trainControl(method = "cv", number = 5, classProbs = TRUE, summaryFunction = mnLogLoss)

# Model 1: Logistic Regression with first set of predictors
model1 <- train(Diabetes_binary ~ Age + BMI + HighBP, data = training, 
                method = "glm", 
                trControl = my_ctrl,
                metric = "logLoss",
                family = binomial(link = "logit"))  

#Saving logLoss value  
m1 <- model1$results$logLoss  

#Observing Summaries  
model1  

## Generalized Linear Model 
## 
## 250 samples
##   3 predictor
##   2 classes: 'No', 'Pre' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 200, 200, 200, 199, 201 
## Resampling results:
## 
##   logLoss  
##   0.4922877

# Model 2: Logistic Regression with different predictor variables
model2 <- train(Diabetes_binary ~ Age*PhysActivity, data = training, 
                method = "glm", 
                trControl = my_ctrl,
                metric = "logLoss",
                family = binomial(link = "logit"))  

#Saving logLoss value  
m2 <- model2$results$logLoss  

#Observing Summaries  
model2

## Generalized Linear Model 
## 
## 250 samples
##   2 predictor
##   2 classes: 'No', 'Pre' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 199, 200, 201, 199, 201 
## Resampling results:
## 
##   logLoss  
##   0.6617013

# Model 3: Logistic Regression with more parameter variations
model3 <- train(Diabetes_binary ~ Age + PhysActivity + HvyAlcoholConsump + Fruits, data = training, 
                method = "glm", 
                trControl = my_ctrl,
                metric = "logLoss",
                family = binomial(link = "logit"))  

#Saving logLoss value  
m3 <- model3$results$logLoss  

#Observing Summaries  
model3   

## Generalized Linear Model 
## 
## 250 samples
##   4 predictor
##   2 classes: 'No', 'Pre' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 201, 201, 200, 199, 199 
## Resampling results:
## 
##   logLoss  
##   0.5808374

#Making a table to compare the logLoss metric values and pick the smallest value  
comparing <- c(Model1 = m1, Model2 = m2,Model3 = m3)
best_logistic_model <- names(which.min(comparing))  
best_logistic_model

## [1] "Model1"

#building a contender item to automate returning the name of the model that worked the best  
log_contender <- if (best_logistic_model == "Model1"){
           log_contender <- model1
           } else if (best_logistic_model == "Model2"){
           long_contender <- model2
           } else if (best_logistic_model == "Model3"){
           log_contender <- model3
}

#Prediction for contending logistic model with testing data  
#Due to my model objects being train objects the actual predict function that R uses to run is predict.train which requires type = "prob" for classification problems  
log_result <- predict(log_contender, newdata = testing, type = "prob")  

#Saving the testing values as numerict so that they can be used for calculations  
testing$Diabetes_binary <- as.numeric(testing$Diabetes_binary)  

#Picking the pre (prediabetes) column to use in my logloss function  
success_col <- log_result[,"Pre"]

#Creating a log loss function to compare the final results  
LogLoss_values <- function(binary_num, pred)
{
 abs(-mean(binary_num * log(pred) + (1 - binary_num) * log(1 - pred)))
}

#Gathering the logLoss value of the predicted logistic candidate to compare to other models  
logistic_logLoss <- LogLoss_values(testing$Diabetes_binary, success_col)  
print(logistic_logLoss)

## [1] 0.7210212

# Display the log loss on the testing set
cat("Log Loss on Testing Set:", logistic_logLoss, "\n")

## Log Loss on Testing Set: 0.7210212

what a random forest is and why we might use it instead of a basic classification tree?

A random forest is another ensemble method of modeling. Random forests are similar to bagging in that both take bootstrapping samples from the data to build multiple trees. After repeat sampling and tree building, these ensemble tree building models can take the average or majority classification as their prediction. With our data being a classification problem, our random forest would take the class predicted by moajority of the trees it builds. Building many trees and then taking the average or majority vote decreases the variance that a basic classification tree would have which makes the ensemble methods the preferred model when prediction is the focus. Random forests differ from bagging as they take in a subset of the variables rather than using all of them for prediction. This subset increases bias, but also can decrease variance.

Fit a random forest model and choose the best model

Here it looks like we fit one overall model but our mtry value, which is the number of variables to sample from at each split in the tree, was a range of 1 to 3. This range is how we fitted multiple models into one overall train function. So here, in order to pick the best model, we have to look for the mtry value with the lowest logLoss value.

#setting control for the train control  
control <- trainControl(method = "cv", number = 5, classProbs = TRUE, summaryFunction = mnLogLoss)  

#Fitting a random forest model
rf_fit <- train(Diabetes_binary ~ Age + HvyAlcoholConsump + HighBP, data = training, 
                method = "rf", 
                trControl = control,
                metric = "logLoss",
                tuneGrid = expand.grid(mtry = 1:3))  

#Saving the logloss value of mtry 1 as object
var1 <- rf_fit$results$logLoss[[1]]  

#Saving the logloss value of mtry 2 as object
var2 <- rf_fit$results$logLoss[[2]]  

#Saving the logloss value of mtry 3 as object
var3 <- rf_fit$results$logLoss[[3]]  

#Making a table to compare the logLoss metric values and pick the smallest value  
battle_of_logloss <- c(Mtry1 = var1, Mtry2 = var2, Mtry3 = var3)
best_rf_model <- names(which.min(battle_of_logloss))  

best_rf_model

## [1] "Mtry3"

#building a contender item to automate returning the mtry tree that worked the best in the random forest model
rf_contender <- if (best_rf_model == "Mtry1"){
           rf_contender <- 1
           } else if (best_rf_model == "Mtry2"){
           rf_contender <- 2
           } else if (best_rf_model == "Mtry3"){
           rf_contender <- 3
           }

#Now to specify what training model (which mtry) should be used on the testing dataset 
#We took the same format as above, but we automated the mtry value by inputting our winning rf_contender value
rf_best <- train(Diabetes_binary ~ Age + HvyAlcoholConsump + HighBP, data = training, 
                method = "rf", 
                trControl = control,
                metric = "logLoss",
                tuneGrid = expand.grid(mtry = rf_contender))  

#Now we can train the testing dataset and predict with our rf_best model  
rf_result <- predict(rf_best, newdata = testing)  

#Turning values into 0 and 1 for the log loss calculation
testing$Diabetes_binary <- ifelse(testing$Diabetes_binary == "No", 0, 1)

#Picking the pre (prediabetes) column to use in my logloss function  
pre_col <- ifelse(rf_result == "No", 0, 1)

#Reiterating the log loss function to compare the final results  
#The 1e-15 is added in here because the results of the rf predictions are 0 and 1 and in the equation binary_num*log(pred) would result in Nan  
LogLoss_values <- function(binary_num, pred)
{ abs(-mean(binary_num * log(pred + 1e-15) + (1 - binary_num) * log(1 - pred + 1e-15)))
}

#Gathering the logLoss value of the predicted logistic candidate to compare to other models  
rf_logLoss <- LogLoss_values(testing$Diabetes_binary, pre_col)  

print(rf_logLoss)

## [1] 32.00121

# Display the log loss on the testing set
cat("Log Loss on Testing Set:", rf_logLoss, "\n")

## Log Loss on Testing Set: 32.00121

What the Partial Least Squares model is?

Partial least squares(PLS) model is a dimension reduction techinque. PLS is similar to principal component analysis except that the reduction strategy used by PLS is driven by the response variable. PLS picks linear combinations of correlated variables that cover the most variability in the dataset to model as predictors. PLS is more difficult to interpret than logistic regression or a tree regression.

Fit models and choose the best model

#creating the control for the train model
control <- trainControl(method = "cv", number = 5, classProbs = TRUE, summaryFunction = mnLogLoss)  

#Fitting a partial least squares model
pls_fit <- train(Diabetes_binary ~ Sex + BMI + HighChol + MentHlth + Fruits + NoDocbcCost, data = training, 
                method = "kernelpls", 
                trControl = control,
                metric = "logLoss")  

#Saving the logloss value of mtry 1 as object
comp1 <- pls_fit$results$logLoss[[1]]  

#Saving the logloss value of mtry 2 as object
comp2 <- pls_fit$results$logLoss[[2]]  

#Saving the logloss value of mtry 3 as object
comp3 <- pls_fit$results$logLoss[[3]]  

#Making a table to compare the logLoss metric values and pick the smallest value  
battles_of_logloss <- c(Ncomp1 = comp1, Ncomp2 = comp2, Ncomp3 = comp3)
best_pls_model <- names(which.min(battles_of_logloss))  

best_pls_model

## [1] "Ncomp3"

#building a contender item to automate returning the mtry tree that worked the best in the random forest model
pls_contender <- if (best_pls_model == "Ncomp1"){
           pls_contender <- 1
           } else if (best_pls_model == "Ncomp2"){
           pls_contender <- 2
           } else if (best_pls_model == "Ncomp3"){
           pls_contender <- 3
           }

#Now to specify what training model (which mtry) should be used on the testing dataset 
#We took the same format as above, but we automated the mtry value by inputting our winning rf_contender value
pls_best <- train(Diabetes_binary ~ Sex + BMI + HighChol + MentHlth + Fruits + NoDocbcCost, data = training, 
                method = "kernelpls", 
                trControl = control,
                metric = "logLoss",
                ncmop = pls_contender)  

#Now we can train the testing dataset and predict with our pls_best model  
pls_result <- predict(pls_best, newdata = testing)  

#Turning values into 0 and 1 for the log loss calculation
testing$Diabetes_binary <- ifelse(testing$Diabetes_binary == "No", 0, 1)


#Picking the pre (prediabetes) column to use in my logloss function  
pre_col <- ifelse(pls_result == "No", 0, 1)

#Reiterating the log loss function to compare the final results  
#The 1e-15 is added in here because the results of the rf predictions are 0 and 1 and in the equation binary_num*log(pred) would result in Nan  
LogLoss_values <- function(binary_num, pred)
{ abs(-mean(binary_num * log(pred + 1e-15) + (1 - binary_num) * log(1 - pred + 1e-15)))
}

#Gathering the logLoss value of the predicted logistic candidate to compare to other models  
pls_logLoss <- LogLoss_values(testing$Diabetes_binary, pre_col)  

print(pls_logLoss)

## [1] 33.79934

# Display the log loss on the testing set
cat("Log Loss on Testing Set:", pls_logLoss, "\n")

## Log Loss on Testing Set: 33.79934

6. Final Model Selection (Compare the six best above and declare an overall winner)

What should happen is the tree or ensemble methods should be in the winning circle of the lowest log loss value. Ensemble methods run through multiple tree builds which should lower the variance of the model. With a lower variance there would hopefully be more predictice power. However, there are many tuning parameters that can be optimized and this report may not have the best fits overall.

#Lasso best model 
Lasso <- test_logloss_lasso  

#Bringing the tree best model  
Tree <- test_logloss_tree  

#Bringing the cart best model  
Cart <- test_logloss_cart  

#Bringing in the logistic best model  
Logistic <- logistic_logLoss  

#Bringing in the random forest model  
Random <- rf_logLoss  

#Bringing in the partial least squares model  
Partial <- pls_logLoss  

#Comparing the models  
best_of_best <- c(Lasso_Model = Lasso, Tree_Model = Tree, Cart_Model = Cart, Logistic_Model = Logistic, Random_Forest_Model = Random, Partial_Least_Squares = Partial)  
best <- names(which.min(best_of_best))  

# Get the name or identifier of the best model
best_model <- paste(best)

# Print the result
cat("The best model is:", best_model, "with log loss:", best_of_best[best], "\n")  

## The best model is: Logistic_Model with log loss: 0.7210212