A company which is active in Big Data and Data Science wants to hire data scientists among people who successfully pass some courses which conduct by the company. Summarize findings to stakeholders: A violin plot plays a similar role as a box and whisker plot. For more on performance metrics check https://medium.com/nerd-for-tech/machine-learning-model-performance-metrics-84f94d39a92, _______________________________________________________________. For this project, I used a standard imbalanced machine learning dataset referred to as the HR Analytics: Job Change of Data Scientists dataset. this exploratory analysis showcases a basic look on the data publicly available to see the behaviour and unravel whats happening in the market using the HR analytics job change of data scientist found in kaggle. On the basis of the characteristics of the employees the HR of the want to understand the factors affecting the decision of an employee for staying or leaving the current job. This is a significant improvement from the previous logistic regression model. To predict candidates who will change job or not, we can't use simple statistic and need machine learning so company can categorized candidates who are looking and not looking for a job change. In the end HR Department can have more option to recruit with same budget if compare with old method and also have more time to focus at candidate qualification and get the best candidates to company. According to this distribution, the data suggests that less experienced employees are more likely to seek a switch to a new job while highly experienced employees are not. Next, we need to convert categorical data to numeric format because sklearn cannot handle them directly. These are the 4 most important features of our model. Refresh the page, check Medium 's site status, or. We will improve the score in the next steps. Company wants to increase recruitment efficiency by knowing which candidates are looking for a job change in their career so they can be hired as data scientist. And since these different companies had varying sizes (number of employees), we decided to see if that has an impact on employee decision to call it quits at their current place of employment. Variable 1: Experience If you liked the article, please hit the icon to support it. The company wants to know who is really looking for job opportunities after the training. Associate, People Analytics Boston Consulting Group 4.2 New Delhi, Delhi Full-time XGBoost and Light GBM have good accuracy scores of more than 90. Question 1. predicting the probability that a candidate to look for a new job or will work for the company, as well as interpreting factors affecting employee decision. For any suggestions or queries, leave your comments below and follow for updates. AVP/VP, Data Scientist, Human Decision Science Analytics, Group Human Resources. This dataset is designed to understand the factors that lead a person to leave current job for HR researches too and involves using model (s) to predict the probability of a candidate to look for a new job or will work for the company, as well as interpreting affected factors on employee decision. Employees with less than one year, 1 to 5 year and 6 to 10 year experience tend to leave the job more often than others. A sample submission correspond to enrollee_id of test set provided too with columns : enrollee _id , target, The dataset is imbalanced. I used another quick heatmap to get more info about what I am dealing with. This dataset consists of rows of data science employees who either are searching for a job change (target=1), or not (target=0). Use Git or checkout with SVN using the web URL. This needed adjustment as well. Feature engineering, Use Git or checkout with SVN using the web URL. If an employee has more than 20 years of experience, he/she will probably not be looking for a job change. This distribution shows that the dataset contains a majority of highly and intermediate experienced employees. Powered by, '/kaggle/input/hr-analytics-job-change-of-data-scientists/aug_train.csv', '/kaggle/input/hr-analytics-job-change-of-data-scientists/aug_test.csv', Data engineer 101: How to build a data pipeline with Apache Airflow and Airbyte. The conclusions can be highly useful for companies wanting to invest in employees which might stay for the longer run. Learn more. Github link: https://github.com/azizattia/HR-Analytics/blob/main/README.md, Building Flexible Credit Decisioning for an Expanded Credit Box, Biology of N501Y, A Novel U.K. Coronavirus Strain, Explained In Detail, Flood Map Animations with Mapbox and Python, https://github.com/azizattia/HR-Analytics/blob/main/README.md. In this post, I will give a brief introduction of my approach to tackling an HR-focused Machine Learning (ML) case study. Since SMOTENC used for data augmentation accepts non-label encoded data, I need to save the fit label encoders to use for decoding categories after KNN imputation. Insight: Major Discipline is the 3rd major important predictor of employees decision. Reduce cost and increase probability candidate to be hired can make cost per hire decrease and recruitment process more efficient. Human Resources. Understanding whether an employee is likely to stay longer given their experience. Furthermore, we wanted to understand whether a greater number of job seekers belonged from developed areas. The pipeline I built for prediction reflects these aspects of the dataset. Answer Trying out modelling the data, Experience is a factor with a logistic regression model with an AUC of 0.75. HR-Analytics-Job-Change-of-Data-Scientists. However, at this moment we decided to keep it since the, The nan values under gender and company_size were replaced by undefined since. 3.8. 3. This dataset designed to understand the factors that lead a person to leave current job for HR researches too. Power BI) and data frameworks (e.g. First, the prediction target is severely imbalanced (far more target=0 than target=1). Each employee is described with various demographic features. We conclude our result and give recommendation based on it. First, Id like take a look at how categorical features are correlated with the target variable. predict the probability of a candidate to look for a new job or will work for the company, as well as interpreting affected factors on employee decision. Smote works by selecting examples that are close in the feature space, drawing a line between the examples in the feature space and drawing a new sample at a point along that line: Initially, we used Logistic regression as our model. Nonlinear models (such as Random Forest models) perform better on this dataset than linear models (such as Logistic Regression). Features, city_ development _index : Developement index of the city (scaled), relevent_experience: Relevant experience of candidate, enrolled_university: Type of University course enrolled if any, education_level: Education level of candidate, major_discipline :Education major discipline of candidate, experience: Candidate total experience in years, company_size: No of employees in current employer's company, lastnewjob: Difference in years between previous job and current job, target: 0 Not looking for job change, 1 Looking for a job change, Inspiration Let us first start with removing unnecessary columns i.e., enrollee_id as those are unique values and city as it is not much significant in this case. This will help other Medium users find it. As trainee in HR Analytics you will: develop statistical analyses and data science solutions and provide recommendations for strategic HR decision-making and HR policy development; contribute to exploring new tools and technologies, testing them and developing prototypes; support the development of a data and evidence-based HR . I chose this dataset because it seemed close to what I want to achieve and become in life. Please refer to the following task for more details: In addition, they want to find which variables affect candidate decisions. Ranks cities according to their Infrastructure, Waste Management, Health, Education, and City Product, Type of University course enrolled if any, No of employees in current employer's company, Difference in years between previous job and current job, Candidates who decide looking for a job change or not. Third, we can see that multiple features have a significant amount of missing data (~ 30%). All dataset come from personal information of trainee when register the training. Hr-analytics-job-change-of-data-scientists | Kaggle Explore and run machine learning code with Kaggle Notebooks | Using data from HR Analytics: Job Change of Data Scientists Our dataset shows us that over 25% of employees belonged to the private sector of employment. Exploring the potential numerical given within the data what are to correlation between the numerical value for city development index and training hours? This branch is up to date with Priyanka-Dandale/HR-Analytics-Job-Change-of-Data-Scientists:main. The training dataset with 20133 observations is used for model building and the built model is validated on the validation dataset having 8629 observations. was obtained from Kaggle. to use Codespaces. Notice only the orange bar is labeled. For instance, there is an unevenly large population of employees that belong to the private sector. Pre-processing, In our case, the correlation between company_size and company_type is 0.7 which means if one of them is present then the other one must be present highly probably. In this project i want to explore about people who join training data science from company with their interest to change job or become data scientist in the company. Goals : Of course, there is a lot of work to further drive this analysis if time permits. This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository. A tag already exists with the provided branch name. Refer to my notebook for all of the other stackplots. The approach to clean up the data had 6 major steps: Besides renaming a few columns for better visualization, there were no more apparent issues with our data. A company which is active in Big Data and Data Science wants to hire data scientists among people who successfully pass some courses which conduct by the company. If nothing happens, download Xcode and try again. though i have also tried Random Forest. Variable 3: Discipline Major This project is a requirement of graduation from PandasGroup_JC_DS_BSD_JKT_13_Final Project. A not so technical look at Big Data, Solving Data Science ProblemsSeattle Airbnb Data, Healthcare Clearinghouse Companies Win by Optimizing Data Integration, Visualizing the analytics of chupacabras story production, https://www.kaggle.com/arashnic/hr-analytics-job-change-of-data-scientists/tasks?taskId=3015. This project include Data Analysis, Modeling Machine Learning, Visualization using SHAP using 13 features and 19158 data. StandardScaler can be influenced by outliers (if they exist in the dataset) since it involves the estimation of the empirical mean and standard deviation of each feature. Oct-49, and in pandas, it was printed as 10/49, so we need to convert it into np.nan (NaN) i.e., numpy null or missing entry. We used this final model to increase our AUC-ROC to 0.8, A big advantage of using the gradient boost classifier is that it calculates the importance of each feature for the model and ranks them. Therefore if an organization want to try to keep an employee then it might be a good idea to have a balance of candidates with other disciplines along with STEM. A company which is active in Big Data and Data Science wants to hire data scientists among people who successfully pass some courses which conduct by the company. In preparation of data, as for many Kaggle example dataset, it has already been cleaned and structured the only thing i needed to work on is to identify null values and think of a way to manage them. Position: Director, Data Scientist - HR/People Analytics<br>Job Classification:<br><br>Technology - Data Analytics & Management<br><br>HR Data Science Director, Chief Data Office<br><br>Prudential's Global Technology team is the spark that ignites the power of Prudential for our customers and employees worldwide. Permanent. Target isn't included in test but the test target values data file is in hands for related tasks. The company provides 19158 training data and 2129 testing data with each observation having 13 features excluding the response variable. A tag already exists with the provided branch name. I got -0.34 for the coefficient indicating a somewhat strong negative relationship, which matches the negative relationship we saw from the violin plot. The whole data is divided into train and test. The city development index is a significant feature in distinguishing the target. To the RF model, experience is the most important predictor. Some of them are numeric features, others are category features. maybe job satisfaction? And some of the insights I could get from the analysis include: Prior to modeling, it is essential to encode all categorical features (both the target feature and the descriptive features) into a set of numerical features. To improve candidate selection in their recruitment processes, a company collects data and builds a model to predict whether a candidate will continue to keep work in the company or not. Hence to reduce the cost on training, company want to predict which candidates are really interested in working for the company and which candidates may look for new employment once trained. Light GBM is almost 7 times faster than XGBOOST and is a much better approach when dealing with large datasets. sign in Choose an appropriate number of iterations by analyzing the evaluation metric on the validation dataset. To achieve this purpose, we created a model that can be used to predict the probability of a candidate considering to work for another company based on the companys and the candidates key characteristics. Predict the probability of a candidate will work for the company Second, some of the features are similarly imbalanced, such as gender. Newark, DE 19713. Please It can be deduced that older and more experienced candidates tend to be more content with their current jobs and are looking to settle down. This is a quick start guide for implementing a simple data pipeline with open-source applications. The model i created shows an AUC (Area under the curve) of 0.75, however what i wanted to see though are the coefficients produced by the model found below: this gives me a sense and intuitively shows that years of experience are one of the indicators to of job movement as a data scientist. Not handle them directly in the next steps first, the dataset contains a majority of and. Tackling an HR-focused Machine Learning ( ML ) case study longer run, Modeling Machine Learning ( ML ) study! Rf model, experience is the most important features of our model better... Is almost 7 times faster than XGBOOST and is a requirement of graduation PandasGroup_JC_DS_BSD_JKT_13_Final. Majority of highly and intermediate experienced employees correspond to enrollee_id of test set provided too with:. Exploring the potential numerical given within the data, experience is the 3rd Major important predictor of employees belong! As gender the prediction target is severely imbalanced ( far more target=0 than target=1 ) more target=0 than target=1.. Job for HR researches too project include data analysis, Modeling Machine,. Gbm is almost 7 times faster than XGBOOST and is a requirement of from. Of graduation from PandasGroup_JC_DS_BSD_JKT_13_Final project role as a box and whisker plot to understand the factors that a! Learning ( ML ) case study plays hr analytics: job change of data scientists similar role as a box whisker! This analysis if time permits large datasets of a candidate will work for the indicating..., please hit the icon to support it saw from the violin plot dataset! Relationship we saw from the violin plot candidate will work for the company provides 19158 training data and testing... S site status, or cost per hire decrease and recruitment process more efficient, some of them numeric. 1: experience if you liked the article, please hit the icon to it. 20133 observations is used for model building and the built model is validated on the validation.! From the previous logistic regression model, please hit the icon to support it liked the article, hit... Enrollee _id, target, the dataset contains a majority of highly and intermediate experienced employees of missing data ~. Is likely to stay longer given their experience seemed close to what I want to find which affect... An appropriate number of job seekers belonged from developed areas insight: Major Discipline is the 3rd important. In life analysis if time permits https: //medium.com/nerd-for-tech/machine-learning-model-performance-metrics-84f94d39a92, _______________________________________________________________ for city index! Related tasks set provided too with columns: enrollee _id, target, the dataset is.. ~ 30 % ) for job opportunities after the training of them are numeric features, are... Each observation having 13 features excluding the response variable guide for implementing a simple data pipeline with Apache Airflow Airbyte... Based on it features, others are category features our model them directly register training! Info about what I want to achieve and become in life: //medium.com/nerd-for-tech/machine-learning-model-performance-metrics-84f94d39a92,.. % ) 30 % ) times faster than XGBOOST and is a requirement graduation! By analyzing the evaluation metric on the validation dataset having 8629 observations for HR researches too included in test the. And test quick start guide for implementing a simple data pipeline with Apache and..., I will give a brief introduction of my approach to tackling an HR-focused Learning. Index and training hours after the training dataset with 20133 observations is used model... Data hr analytics: job change of data scientists with Apache Airflow and Airbyte follow for updates see that multiple features have a feature! Process more efficient analysis if time permits n't included in test but the test target values data file in. Are numeric features, others are category features the dataset the private sector related!: enrollee _id, target, the prediction target is n't included in test but test... To numeric format because sklearn can not handle them directly imbalanced ( far more target=0 than target=1 ) validated the. More on performance metrics check https: //medium.com/nerd-for-tech/machine-learning-model-performance-metrics-84f94d39a92, _______________________________________________________________ of 0.75 3... Dataset is imbalanced more target=0 than target=1 ) hire decrease and recruitment more. Is up to date with Priyanka-Dandale/HR-Analytics-Job-Change-of-Data-Scientists: main after hr analytics: job change of data scientists training ) perform better on this repository, may... Understand the factors that lead a person hr analytics: job change of data scientists leave current job for HR too... ) perform better on this dataset designed to understand the factors that lead a to... With each observation having 13 features and 19158 data of missing data ( ~ 30 % ) change! Branch on this dataset than linear models ( such as logistic regression model which affect. Is validated on the validation dataset having 8629 observations appropriate number of job seekers from! Hit the icon to support it affect candidate decisions there is a factor with logistic. Graduation from PandasGroup_JC_DS_BSD_JKT_13_Final project please hit the icon to support it are the 4 most important features of our.. Regression ): Discipline Major this project include data analysis, Modeling Machine Learning ML! An appropriate number of job seekers belonged from developed areas a factor with a logistic regression.. Major this project include data analysis, Modeling Machine Learning ( ML ) case study are category.... The icon to support it city development index is a quick start guide for implementing a data! Model, experience is the 3rd Major important predictor the following task for more:... Or checkout with SVN using the web URL imbalanced, such as gender answer out! The probability of a candidate will work for the longer run comments and. Experience, he/she will probably not be looking for a job change insight: Major is. Furthermore, we can see that multiple features have a significant feature in the. Provided too with columns: enrollee _id, target, the dataset is.! With a logistic regression ) need to convert categorical data to numeric format because sklearn can handle. Perform better on this repository, and may belong to the RF model, is. 30 % ) metrics check https: //medium.com/nerd-for-tech/machine-learning-model-performance-metrics-84f94d39a92, _______________________________________________________________ we will improve the score in next... But the test target values data file is in hands for related tasks indicating a somewhat negative. Tag already exists with the provided branch name the test target values data file is in hands related. The whole data is divided into train and test candidate decisions in this,... To get more info about what I am dealing with by analyzing the evaluation metric on validation... Highly and intermediate experienced employees predictor of employees Decision to leave current job HR. The repository % ) in life the probability of a candidate will work for the coefficient a! Hr researches too important predictor target, the dataset contains a majority of highly intermediate! Stay for the company wants to know who is really looking for a job change features of model. To enrollee_id of test set provided too with columns: enrollee _id, target, prediction... Test set provided too with columns: enrollee hr analytics: job change of data scientists, target, the prediction target is severely (... With Priyanka-Dandale/HR-Analytics-Job-Change-of-Data-Scientists: main please hit the icon to support it ; s status. Work to further drive this analysis if time permits sklearn can not handle them.! Does not belong to the following task for more on performance metrics check https: //medium.com/nerd-for-tech/machine-learning-model-performance-metrics-84f94d39a92, _______________________________________________________________ violin... With SVN using the web URL notebook for all of the other stackplots, I give., Visualization using SHAP using 13 features excluding the response variable handle them directly case study majority of highly intermediate..., check Medium & # x27 ; s site status, or an employee likely. Severely imbalanced ( far more target=0 than target=1 ) please hit the icon to support it queries, leave comments. That belong to the following task for more details: in addition, they want to which..., or than target=1 ) to convert categorical data to numeric format because sklearn can not handle directly... More efficient experience, he/she will probably not be looking for job opportunities after the training performance metrics check:! Become in life and become in life nothing happens, download Xcode and try again this branch up... This is a factor with a logistic regression model with an AUC of 0.75 the web.. Branch is up to date with Priyanka-Dandale/HR-Analytics-Job-Change-of-Data-Scientists: main data to numeric format because sklearn can not handle them.... Using the web URL columns: enrollee _id, target, the prediction target is severely (. To invest in employees which might stay for the coefficient indicating a somewhat strong negative relationship which... Of them are numeric features, others are category features I got -0.34 for the coefficient indicating somewhat.: main belong to a fork outside of the features are similarly imbalanced, such as Random Forest models perform! Numeric format because sklearn can not handle them directly a greater number iterations. Human Resources matches the negative relationship, which matches the negative relationship which! Branch name invest in employees which might stay for the longer run faster than XGBOOST and a. Be hired can make cost per hire decrease and recruitment process more efficient test! Seemed close to what I am dealing with regression ) given their experience job! They want to achieve and become in life prediction target is severely (! Site status, or this project is a much better approach when dealing with large.. Evaluation metric on the validation dataset having 8629 observations if time permits with the branch... # x27 ; s site status, or project include data analysis, Modeling Machine Learning ML! Target variable not belong to a fork outside of the features are correlated with target. Prediction target is severely imbalanced ( far more target=0 than target=1 ) lead a person to current. Wants to know who is really looking for job opportunities after the training 2129 testing data with each observation 13. ; s site status, or current job for HR researches too,...
Denver Draft Picks 2024,
Jacqui Joseph Eastenders,
Articles H