Employee Attrition Analysis Pt.2 w/ Different ML Models
Canterra has a high attrition rate. Utilizing employee data, Machine Learning (ML) can be leveraged to gain insight and better improve attrition outcomes by identifying the factors that contribute to attrition and implementing strategies that are effective for combating it. But which models provide the most accurate insights?
Executive Summary
Employee attrition is a major concern amongst employers. It leads to a decrease in productivity and significant increase in costs that negatively impact company profitability. Canterra has a high attrition rate. Utilizing employee data, Machine Learning (ML) can be leveraged to gain insight and better improve attrition outcomes by identifying the factors that contribute to attrition and implementing strategies that are effective for combating it. After running two ML models, it was determined that Total Working Years, Years at the Company, and Years with the Current Manager were the most influential factors affecting the rate of attrition for employees. After tuning, the Support Vector Machine model proved to be the most effective and optimal model in terms of accuracy and prediction. We recommend that the company hire a slightly larger number of employees with more experience while keeping in mind that new hires are important for company growth and overall survival. We also recommend that the company create a milestone and gift-giving program to increase the number of years an employee will remain at the company, and finally we think having a developmental pathway where both managers and individual contributors can develop, progress titles, and increase responsibilities in their role without necessarily leaving the company or going to a different department – managerial levels and director levels based on experience and tenure – is an optimal strategy for improving attrition.
Background
Our client has requested an analysis of the rate of attrition as they have been experiencing high turnover amongst employees due to a number of factors. To figure out what factors contribute to attrition, we will use Machine Learning algorithms, k-Nearest Neighbor (kNN) and Support Vector Machine (SVM) models to determine which variables have the most significant impact on employee attrition. The models were built with a dataset provided by our client that covered a number of factors including age, employee ID, income, job level, marital status, etc. We will compare and contrast the results of the two models and will report the three most influential factors contributing to the rise in the rate of attrition.
Methods & Diagnostics
Preprocessing
The k-Nearest Neighbor (kNN) and Support Vector Machine (SVM) models were run in R-Studio using a multitude of packages that include Caret, Class, Dplyr, pROC, and e1071. To start, the employee dataset was loaded to R and any variables that were either redundant or didn’t provide any significant information were removed from the dataset. The variables that had no impact on our dependent variable, Attrition, were Standard Hours and Employee ID. Next, the Attrition variable was factored into two levels representing “yes”, the employee left, or “no”, the employee remained. Next, the data was split into testing and training sets, in which the training set received 70% and the testing set received 30% of the data. A seed was set for reproducibility prior to running the training and testing sets. Moreover, a prop table was created to analyze the distribution of the Attrition variable. It was found that roughly 84% of the data matched the “no” observation and roughly 16% matched the “yes” observation. In other words, the vast majority of employees remained at the company. Next, all missing values had to be dealt with, so for any columns with NA values, the mean was taken and the NA’s were replaced with that average value.
K-Nearest Neighbor (KNN)
In solving our employee attrition problem, we wanted to know if K Nearest Neighbor (kNN) Machine Learning modeling would be the most optimal model for predicting employee attrition. KNN is a classification Machine Learning algorithm that is supervised learning. In the broader context, kNN is a simple algorithm where each observation is predicted based on its similarity to other observations. Essentially, understanding kNN modeling can be summed up with "if it walks like a duck and quacks like a duck, then it is probably a duck." By identifying the information from "neighbors" to classify our training data, we can then predict new observations using this model. The nearest neighbors are defined by the value of "k" which can be calculated in several ways.
The advantages of kNN modeling is that it is highly unbiased and non-parametric therefore it does not make any prior assumptions about our employee data. It is versatile, less complicated to build and implement, and overall still highly useful.
The disadvantage is that it slows down as the number of predictors or independent variables increases. The result is highly dependent on the choice of the distance function and the number of neighbors (value of k). The final major downside is that it does not help us completely understand how features are related to classes.
During the preprocessing of our employee data, we had to scale our predictors. Scaling is used in KNN to normalize the features of the dataset. The purpose of scaling is to ensure that all features are on a similar scale by adjusting the feature values to have similar ranges or means and standard deviations. This is important because KNN uses the distance between observations in the feature space to determine the closest neighbors, and without scaling, features with larger values would have a greater influence on the distance calculation, which could lead to a bias towards those features.
Once we started the modeling, we used a rule of thumb to set a value for K. We found that K=56 was a good start. Then we tested 0-75 for the value of K and found that K=1 was over-fitting and the most optimal K was 15. When using the CARET package to compare, it validated our initial model using the base package in sticking with K=15 as the best nearest neighbor value, Appendix 1: K-Values vs. Accuracy. In comparing the models derived from the two packages, we had slightly similar Area Under the Curve, 75% for base package, and 79% for CARET, Appendix 4: ROC Curve for Models 1 & 2. CARET Package also displayed that the optimal K-Value was 15, Appendix 3: CARET K-Values.
Both models showed that the top 3 most important variables for prediction were Total Working Years, Years at the Company, and Years with the Current Manager, Appendix 2: Important Variables. We had a model accuracy of 86%. Our ROC curve essentially shows that the classifier did well in predicting attrition.
Support Vector Machine (SVM)
While the data in the kNN model required scaling, one of the benefits of using the SVM function of the e1071 package is that the SVM function scales automatically for the user. To start, using the training data and the SVM function, an SVM model was created. To properly identify the SVM model as a classification machine, C-classification is inserted into type= argument. To start, the linear kernel was chosen, however, this may not deliver the optimal results. Next a smaller cost parameter was chosen that reduced the number of support vectors from 1183 in the first model to 1057 in the second. After reducing the cost parameter, the tune function was used to perform ten-fold cross validation. Next we obtain the predicted classes and predicted probabilities from the previous model, which are inserted into a confusion matrix. This initial confusion matrix reported an 83% accuracy at a very low sensitivity and very high specificity, which clearly tells us that the model could use some additional changes.
Next, in order to tweak the model, a Gaussian kernel (“radial”) was input into the model with a gamma of 0.05, as all kernels need a gamma value, with the exception of the linear kernel. The lower the gamma the less curvature in our decision boundary. We then used the tune function to get the best cost and gamma combination for our model in which the best performance was an error of 0.0214. To obtain the results of the predicted probabilities of our radial model, we ran another confusion matrix. This confusion matrix noted a 97.9% accuracy at a sensitivity of .87 and a specificity of 1.
The trainControl function of the CARET package is used for tuning parameters and for ten-fold cross validation, as well as for calculating probabilities. We ran the model and inserted the cross validated parameters into a new model and plotted the results. As can be seen in Appendix 5: Accuracy v. Cost, as cost increases in our cross-validated model increases, accuracy also increases. Our best model using this method had an accuracy of 95.98%, a cost of 128, and a Kappa of .847. Appendix 6: Variables of Importance notes the variables that the SVM model noted were influential towards attrition. The total years an employee has worked in general, the total years an employee has worked at the company specifically, and the amount of years an employee has spent with their current manager are the three most important variables, according to our model.
The last step was to run the ROC Curve, which can be seen in Appendix 7: ROC Curve (SVM) and calculate the Area Under the Curve (AUC). The AUC for our first model is .517, the second model is .996 and the third is .953.
Recommendations
Based on both models showing that the most important variables were Total Working Years, Years at the Company, and Years with Current Manager for predicting attrition, we recommend the following:
Strategically hire more experienced employees without negatively impacting the company's survival rate due to aging. Create pathways where younger employees can learn from and be involved with tenured employees.
Create an employee recognition, milestone and gift giving program to increase the number of years an employee stays with the company.
Create a developmental pathway where both managers and employees along with executives can develop, progress hierarchically, increase role responsibility, and job title without necessarily leaving the company or switching departments.
Conclusion
Predicting employee attrition can be a challenging task for organizations as it can have a significant impact on the company's productivity and profitability. The model that is used to predict the factors that contribute is important to understand because the impact is significant. The SVM model is the best model to proceed and we believe that the company should implement the suggested recommendations because it shows that the identified variables are drastically contributing to the attrition and need to be addressed.