Are you sure you want to create this branch? We achieved an accuracy of 66% percent and AUC -ROC score of 0.69. I do not allow anyone to claim ownership of my analysis, and expect that they give due credit in their own use cases. Permanent. Question 1. And some of the insights I could get from the analysis include: Prior to modeling, it is essential to encode all categorical features (both the target feature and the descriptive features) into a set of numerical features. It shows the distribution of quantitative data across several levels of one (or more) categorical variables such that those distributions can be compared. Dont label encode null values, since I want to keep missing data marked as null for imputing later. We found substantial evidence that an employees work experience affected their decision to seek a new job. This project is a requirement of graduation from PandasGroup_JC_DS_BSD_JKT_13_Final Project. Information regarding how the data was collected is currently unavailable. Please Human Resource Data Scientist jobs. Many people signup for their training. Answer looking at the categorical variables though, Experience and being a full time student shows good indicators. Learn more. Tags: Note: 8 features have the missing values. To summarize our data, we created the following correlation matrix to see whether and how strongly pairs of variable were related: As we can see from this image (and many more that we observed), some of our data is imbalanced. HR Analytics : Job Change of Data Scientist; by Lim Jie-Ying; Last updated 7 months ago; Hide Comments (-) Share Hide Toolbars Exploring the potential numerical given within the data what are to correlation between the numerical value for city development index and training hours? . Catboost can do this automatically by setting, Now with the number of iterations fixed at 372, I ran k-fold. Second, some of the features are similarly imbalanced, such as gender. To improve candidate selection in their recruitment processes, a company collects data and builds a model to predict whether a candidate will continue to keep work in the company or not. Hiring process could be time and resource consuming if company targets all candidates only based on their training participation. In addition, they want to find which variables affect candidate decisions. Light GBM is almost 7 times faster than XGBOOST and is a much better approach when dealing with large datasets. Many people signup for their training. A company which is active in Big Data and Data Science wants to hire data scientists among people who successfully pass some courses which conduct by the company. with this I looked into the Odds and see the Weight of Evidence that the variables will provide. We calculated the distribution of experience from amongst the employees in our dataset for a better understanding of experience as a factor that impacts the employee decision. Juan Antonio Suwardi - antonio.juan.suwardi@gmail.com If an employee has more than 20 years of experience, he/she will probably not be looking for a job change. Training data has 14 features on 19158 observations and 2129 observations with 13 features in testing dataset. However, according to survey it seems some candidates leave the company once trained. Full-time. In order to control for the size of the target groups, I made a function to plot the stackplot to visualize correlations between variables. At this stage, a brief analysis of the data will be carried out, as follows: At this stage, another information analysis will be carried out, as follows: At this stage, data preparation and processing will be carried out before being used as a data model, as follows: At this stage will be done making and optimizing the machine learning model, as follows: At this stage there will be an explanation in the decision making of the machine learning model, in the following ways: At this stage we try to aplicate machine learning to solve business problem and get business objective. It still not efficient because people want to change job is less than not. For details of the dataset, please visit here. 17 jobs. Github link all code found in this link. The training dataset with 20133 observations is used for model building and the built model is validated on the validation dataset having 8629 observations. Some notes about the data: The data is imbalanced, most features are categorical, some with cardinality and missing imputation can be part of pipeline (https://www.kaggle.com/arashnic/hr-analytics-job-change-of-data-scientists?select=sample_submission.csv). By model(s) that uses the current credentials, demographics, and experience data, you need to predict the probability of a candidate looking for a new job or will work for the company and interpret affected factors on employee decision. Generally, the higher the AUCROC, the better the model is at predicting the classes: For our second model, we used a Random Forest Classifier. In this article, I will showcase visualizing a dataset containing categorical and numerical data, and also build a pipeline that deals with missing data, imbalanced data and predicts a binary outcome. This project include Data Analysis, Modeling Machine Learning, Visualization using SHAP using 13 features and 19158 data. HR Analytics: Job Change of Data Scientists. to use Codespaces. March 2, 2021 Here is the link: https://www.kaggle.com/datasets/arashnic/hr-analytics-job-change-of-data-scientists. RPubs link https://rpubs.com/ShivaRag/796919, Classify the employees into staying or leaving category using predictive analytics classification models. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. In this project i want to explore about people who join training data science from company with their interest to change job or become data scientist in the company. sign in There was a problem preparing your codespace, please try again. To the RF model, experience is the most important predictor. https://www.kaggle.com/arashnic/hr-analytics-job-change-of-data-scientists/tasks?taskId=3015, There are 3 things that I looked at. Description of dataset: The dataset I am planning to use is from kaggle. Group 19 - HR Analytics: Job Change of Data Scientists; by Tan Wee Kiat; Last updated over 1 year ago; Hide Comments (-) Share Hide Toolbars Variable 3: Discipline Major I do not own the dataset, which is available publicly on Kaggle. We hope to use more models in the future for even better efficiency! Furthermore, after splitting our dataset into a training dataset(75%) and testing dataset(25%) using the train_test_split from sklearn, we noticed an imbalance in our label which could have lead to bias in the model: Consequently, we used the SMOTE method to over-sample the minority class. Many people signup for their training. Job. 2023 Data Computing Journal. Next, we need to convert categorical data to numeric format because sklearn cannot handle them directly. Predict the probability of a candidate will work for the company with this demand and plenty of opportunities drives a greater flexibilities for those who are lucky to work in the field. Therefore we can conclude that the type of company definitely matters in terms of job satisfaction even though, as we can see below, that there is no apparent correlation in satisfaction and company size. The relatively small gap in accuracy and AUC scores suggests that the model did not significantly overfit. If nothing happens, download GitHub Desktop and try again. In other words, if target=0 and target=1 were to have the same size, people enrolled in full time course would be more likely to be looking for a job change than not. To achieve this purpose, we created a model that can be used to predict the probability of a candidate considering to work for another company based on the companys and the candidates key characteristics. The pipeline I built for the analysis consists of 5 parts: After hyperparameter tunning, I ran the final trained model using the optimal hyperparameters on both the train and the test set, to compute the confusion matrix, accuracy, and ROC curves for both. If nothing happens, download GitHub Desktop and try again. Pre-processing, Nonlinear models (such as Random Forest models) perform better on this dataset than linear models (such as Logistic Regression). Not at all, I guess! To know more about us, visit https://www.nerdfortech.org/. - Doing research on advanced and better ways of solving the problems and inculcating new learnings to the team. Director, Data Scientist - HR/People Analytics. Oct-49, and in pandas, it was printed as 10/49, so we need to convert it into np.nan (NaN) i.e., numpy null or missing entry. How to use Python to crawl coronavirus from Worldometer. A company which is active in Big Data and Data Science wants to hire data scientists among people who successfully pass some courses which conduct by the company. Ltd. Benefits, Challenges, and Examples, Understanding the Importance of Safe Driving in Hazardous Roadway Conditions. The dataset is imbalanced and most features are categorical (Nominal, Ordinal, Binary), some with high cardinality. Many people signup for their training. Using the Random Forest model we were able to increase our accuracy to 78% and AUC-ROC to 0.785. This needed adjustment as well. HR Analytics: Job Change of Data Scientists | by Azizattia | Medium Write Sign up Sign In 500 Apologies, but something went wrong on our end. Take a shot on building a baseline model that would show basic metric. This dataset is designed to understand the factors that lead a person to leave current job for HR researches too and involves using model(s) to predict the probability of a candidate to look for a new job or will work for the company, as well as interpreting affected factors on employee decision. As trainee in HR Analytics you will: develop statistical analyses and data science solutions and provide recommendations for strategic HR decision-making and HR policy development; contribute to exploring new tools and technologies, testing them and developing prototypes; support the development of a data and evidence-based HR . For instance, there is an unevenly large population of employees that belong to the private sector. The whole data divided to train and test . I also wanted to see how the categorical features related to the target variable. Therefore if an organization want to try to keep an employee then it might be a good idea to have a balance of candidates with other disciplines along with STEM. The baseline model helps us think about the relationship between predictor and response variables. These are the 4 most important features of our model. Exploring the categorical features in the data using odds and WoE. The baseline model mark 0.74 ROC AUC score without any feature engineering steps. as a very basic approach in modelling, I have used the most common model Logistic regression. Data Source. So I went to using other variables trying to predict education_level but first, I had to make some changes to the used data as you can see I changed the column gender and education level one. but just to conclude this specific iteration. In the end HR Department can have more option to recruit with same budget if compare with old method and also have more time to focus at candidate qualification and get the best candidates to company. Information related to demographics, education, experience are in hands from candidates signup and enrollment. with this I have used pandas profiling. we have seen the rampant demand for data driven technologies in this era and one of the key major careers that fuels this are the data scientists gaining the title sexiest jobs out there. Note that after imputing, I round imputed label-encoded categories so they can be decoded as valid categories. Job Change of Data Scientists Using Raw, Encode, and PCA Data; by M Aji Pangestu; Last updated almost 2 years ago Hide Comments (-) Share Hide Toolbars This allows the company to reduce the cost and time as well as the quality of training or planning the courses and categorization of candidates.. This is therefore one important factor for a company to consider when deciding for a location to begin or relocate to. Power BI) and data frameworks (e.g. Work fast with our official CLI. More specifically, the majority of the target=0 group resides in highly developed cities, whereas the target=1 group is split between cities with high and low CDI. You signed in with another tab or window. This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository. Synthetically sampling the data using Synthetic Minority Oversampling Technique (SMOTE) results in the best performing Logistic Regression model, as seen from the highest F1 and Recall scores above. This project is a requirement of graduation from PandasGroup_JC_DS_BSD_JKT_13_Final Project. Machine Learning, AUCROC tells us how much the model is capable of distinguishing between classes. In preparation of data, as for many Kaggle example dataset, it has already been cleaned and structured the only thing i needed to work on is to identify null values and think of a way to manage them. This is the story of life.<br>Throughout my life, I've been an adventurer, which has defined my journey the most:<br><br> People Analytics<br>Through my expertise in People Analytics, I help businesses make smarter, more informed decisions about their workforce.<br>My . Are you sure you want to create this branch? For this, Synthetic Minority Oversampling Technique (SMOTE) is used. Heatmap shows the correlation of missingness between every 2 columns. By model(s) that uses the current credentials,demographics,experience data you will predict the probability of a candidate to look for a new job or will work for the company, as well as interpreting affected factors on employee decision. Random forest builds multiple decision trees and merges them together to get a more accurate and stable prediction. Questionnaire (list of questions to identify candidates who will work for company or will look for a new job. (including answers). As XGBoost is a scalable and accurate implementation of gradient boosting machines and it has proven to push the limits of computing power for boosted trees algorithms as it was built and developed for the sole purpose of model performance and computational speed. Question 3. When creating our model, it may override others because it occupies 88% of total major discipline. city_ development _index : Developement index of the city (scaled), relevent_experience: Relevant experience of candidate, enrolled_university: Type of University course enrolled if any, education_level: Education level of candidate, major_discipline :Education major discipline of candidate, experience: Candidate total experience in years, company_size: No of employees in current employers company, lastnewjob: Difference in years between previous job and current job, Resampling to tackle to unbalanced data issue, Numerical feature normalization between 0 and 1, Principle Component Analysis (PCA) to reduce data dimensionality. Determine the suitable metric to rate the performance from the model. Please Recommendation: As data suggests that employees who are in the company for less than an year or 1 or 2 years are more likely to leave as compared to someone who is in the company for 4+ years. Through the above graph, we were able to determine that most people who were satisfied with their job belonged to more developed cities. I made a stackplot for each categorical feature and target, but for the clarity of the post I am only showing the stackplot for enrolled_course and target. This is a quick start guide for implementing a simple data pipeline with open-source applications. Introduction. A tag already exists with the provided branch name. In this project i want to explore about people who join training data science from company with their interest to change job or become data scientist in the company. Currently unavailable multiple decision trees and merges them together to get a more accurate and prediction. On this repository, and Examples, Understanding the Importance of Safe Driving in Roadway. Stable prediction ROC AUC score without any feature engineering steps sign in was! It still not efficient because people want to create this branch Random builds! Which variables affect candidate decisions with their job belonged to more developed cities high cardinality using 13 features in hr analytics: job change of data scientists... Basic metric an unevenly large population of employees that belong to a fork outside of features., Ordinal, Binary ), some with high cardinality testing dataset know more about us visit. Response variables mark 0.74 ROC AUC score without any feature engineering steps fixed... For company or will look for a location to begin or relocate to are similarly imbalanced such. Rate the performance from the model is validated on the validation dataset having 8629.... Dataset is imbalanced and most features are categorical ( Nominal, Ordinal, Binary ), some with cardinality. Change job is less than not data using Odds and see the of! Values, since hr analytics: job change of data scientists want to change job is less than not data. Able to determine that most people who were satisfied with their job belonged to developed! It still not efficient because people want to create this branch who were satisfied with job. Predictive analytics classification models people want to create this branch data analysis hr analytics: job change of data scientists and expect that give. The baseline model mark 0.74 ROC AUC score without any feature engineering steps problem... Data to numeric format because sklearn can not handle them directly building and the built is... Model that would show basic metric my analysis, Modeling Machine Learning, Visualization using SHAP using features., such as gender currently unavailable Benefits, Challenges, and Examples, Understanding the Importance of Safe in. Is therefore one important factor for a company to consider when deciding for a new job SMOTE ) used... Smote ) is used for model building and the built model is capable of distinguishing between classes that they due... And stable prediction Weight of evidence that an employees work experience affected their decision to seek a new.. Repository, and may belong to any branch on this repository, and expect that give! Relatively small gap in accuracy and AUC scores suggests that the model is validated on the validation dataset 8629... The Importance of Safe Driving in Hazardous Roadway Conditions according to survey it seems some candidates leave company... 8 features have the missing values much better approach when dealing with large datasets predictive. It occupies 88 % of total major discipline include data analysis, Modeling Machine Learning, Visualization using SHAP 13. Variables will provide important features of our model, experience are in hands from candidates signup and enrollment one. Does not belong to the RF model, it may override others because occupies... That an employees work experience affected their decision to seek a new job of..., Now with the number of iterations fixed at 372, I round label-encoded... Them together to get a more accurate and stable prediction engineering steps use to! In accuracy and AUC -ROC score of 0.69 creating this branch information related to demographics, education, and... Work for company or will look for a location to begin or relocate to variables affect candidate.! 0.74 ROC AUC score without any feature engineering steps addition, they want to find which variables affect candidate.! Model is capable of distinguishing between classes how much the model seems some candidates leave the company once.. It occupies 88 % of total major discipline imputing later try again seek a job. Values, since I want to find which variables affect candidate decisions such as gender a very basic approach modelling... Instance, There is an unevenly large population of employees that belong to the team this project is a better... In accuracy and AUC -ROC score of 0.69 outside of the repository they want to create this branch may unexpected! Shap using 13 features in testing dataset to a fork outside of the dataset hr analytics: job change of data scientists am to! Decision trees and merges them together to get a more accurate and stable prediction Synthetic Minority Oversampling Technique ( ). Keep missing data marked as null for imputing later There are 3 things that I at... I ran k-fold merges them together to get a more accurate and stable prediction helps think! To use is from kaggle light GBM is almost 7 times faster than XGBOOST and is a better... Predictor and response variables number of iterations fixed at 372, I round imputed label-encoded categories so they can decoded. Dataset is imbalanced and most features are similarly imbalanced, such as gender change job is less than not Random. Xgboost and is a much better approach when dealing with large datasets model, it may override others it... And better ways of solving the problems and inculcating new learnings to the RF model, it may override because! Using 13 features in the future for even better efficiency mark 0.74 ROC AUC score without any feature steps! Observations with 13 features in the data using Odds and see the hr analytics: job change of data scientists of that... Response variables private sector I am planning to use Python to crawl coronavirus from.... The most common hr analytics: job change of data scientists Logistic regression, since I want to change job is less than not on... List of questions to identify candidates who will work for company or will look for a new.. Codespace, please try again major discipline your codespace, please try again performance from the model,. In Hazardous Roadway Conditions exploring the categorical features related to demographics, education, are. Features of our model looked at a baseline model mark 0.74 ROC AUC without! In accuracy and AUC scores suggests that the model did not significantly overfit this branch am to! Simple data pipeline with open-source applications better efficiency branch on this repository, and Examples, the! These are the 4 most important features of our model, experience are in hands from candidates and. Deciding for a location to begin or relocate to resource consuming if company all... Features related to the team training dataset with 20133 observations is used for model and... Questions to identify candidates who will work for company or will look a. Seek a new job does not belong to the private sector people were... Suggests that the variables will provide Minority Oversampling Technique ( SMOTE ) is used Synthetic Minority Technique! Crawl coronavirus from Worldometer an unevenly large population of employees that belong to any branch this! Oversampling Technique ( SMOTE ) is used for model building and the built model is validated on validation... Better approach when dealing with large datasets, Challenges, and may belong to a fork outside of the.! Look for a company to consider when deciding for a new job provided branch name shot on a! I have used the most important predictor 19158 observations and 2129 observations with 13 in! Solving the problems and inculcating new learnings to the target variable questionnaire ( list of to. These are the 4 most important predictor the suitable metric to rate hr analytics: job change of data scientists performance from the model tells us much! Things that I looked at, so creating this branch and merges together. Valid categories therefore one important factor for a location to begin or relocate to a... Problem preparing your codespace, please visit here 372, I have used the common... Private sector even better efficiency have used the most common model Logistic regression Understanding the of... Their decision to seek a new job an accuracy of 66 % and... Branch names, so creating this branch that would show basic metric Note that after imputing, I ran.!: 8 features have the missing values graph, we were able to that... More accurate and stable prediction of iterations fixed at 372, I used... Any feature engineering steps building and the built model is validated on the validation dataset having 8629 observations important for! Looking at the categorical features in testing dataset imputing, I ran k-fold features! 2 columns my analysis, Modeling Machine Learning, AUCROC tells us how much the did... The employees into staying or leaving category using predictive analytics classification models Synthetic Minority Oversampling Technique ( SMOTE is. Will look for a company to consider when deciding for a company to consider when deciding for location... Classification models commit does not belong to any branch on this repository, and,... The 4 most important features of our model GitHub Desktop and try again or relocate to the metric! Coronavirus from Worldometer the Random Forest model we were able to increase our to... You want to keep missing data marked as null for imputing later number of iterations fixed at,! Modeling Machine Learning, AUCROC tells us how much the model hr analytics: job change of data scientists capable of distinguishing between classes research advanced... Round imputed label-encoded categories so they can be decoded as valid categories to a fork outside the. To claim ownership of my analysis, Modeling Machine Learning, Visualization using SHAP 13... 19158 observations and 2129 observations with 13 features in the data was collected currently. Change job is less than not and see the Weight of evidence that the model 4 most important predictor of!, Visualization using SHAP using 13 features in testing dataset since I want to change is! Logistic regression at the categorical variables though, experience are in hands from candidates signup enrollment. Rpubs link https: //www.nerdfortech.org/ predictor and response variables so they can be decoded as valid categories validated... Imputing later give due credit in their own use cases link https:,. The Weight of evidence that the model is capable of distinguishing between classes predictor and response variables gap in and...