This function preprocesses a dataset by removing features with near-zero variance, highly correlated features, and features that are constant within any class of the target variable. It ensures that the resulting feature set is more suitable for machine learning models.
Arguments
- data
A data frame containing predictor features and the target variable.
- target_col
A character string specifying the name of the target column in
data
. Default is"target"
.- cor_thresh
A numeric value between 0 and 1 specifying the correlation threshold for removing highly correlated features. Default is
0.9
.
Details
The preprocessing steps include:
Removing near-zero variance features (using
caret::nearZeroVar
).Removing highly correlated features above the specified threshold (using
caret::findCorrelation
).Removing features that are constant within any class of the target variable (i.e., provide no discriminatory power across classes).
After preprocessing, the target column is re-attached to the dataset.
Examples
if (FALSE) { # \dontrun{
library(caret)
library(dplyr)
set.seed(123)
df <- data.frame(
feature1 = c(1, 1, 1, 1, 1), # constant
feature2 = c(1, 2, 3, 4, 5), # numeric
feature3 = c(1, 2, 3, 4, 5) * 2, # highly correlated with feature2
target = c("A", "A", "B", "B", "B")
)
clean_df <- preprocess_features(df, target_col = "target", cor_thresh = 0.9)
} # }