We have been given the data set of exit survey from Department of Education, Training and Employment (DETE) and TAFE. Our task is to clean and analysis data and try to answer following questions and insight to the stake holders
Are employees who only worked for the institutes for a short period of time resigning due to some kind of dissatisfaction? What about employees who have been there longer?
Are younger employees resigning due to some kind of dissatisfaction? What about older employees?
we have been advised by stake holder that they would like to see combined results
# import libraries
import pandas as pd
import numpy as np
# read csv files
dete_survey = pd.read_csv('dete_survey.csv')
tafe_survey = pd.read_csv('tafe_survey.csv')
def survey_summary(survey):
"""
Summary of given dataframe.
shows info() and head()
"""
print(survey.info())
print(survey.head())
# lets check what's in dete_survey
survey_summary(dete_survey)
# lets check what's in tafe_survey
survey_summary(tafe_survey)
We have derived following observations from looking at summary of these data frames
# simple clean of dataframe where we replace `Not Stated` with `NaN`
dete_survey = pd.read_csv('dete_survey.csv', na_values='Not Stated')
# quick glance at data
dete_survey.head()
# lets drop dead weight for dete_survey
dete_survey_updated = dete_survey.drop(dete_survey.columns[28:49], axis=1)
# quick glance again to confirm
# dete_survey_updated.head()
print(dete_survey_updated.columns)
# lets drop dead weight for tafe_survey
tafe_survey_updated = tafe_survey.drop(tafe_survey.columns[17:66], axis=1)
# quick glance again to confirm
# tafe_survey_updated.head()
print(tafe_survey_updated.columns)
We are going to standardise the columns names so we can eventually use them when we combined dataset
# replace space with _ and lowercase all column names
dete_survey_updated.columns = dete_survey_updated.columns.str.replace(' ', '_').str.lower()
# Check that the column names were updated correctly
dete_survey_updated.columns
# Update column names to match the names in dete_survey_updated
mapping = {'Record ID': 'id', 'CESSATION YEAR': 'cease_date', 'Reason for ceasing employment': 'separationtype', 'Gender. What is your Gender?': 'gender', 'CurrentAge. Current Age': 'age',
'Employment Type. Employment Type': 'employment_status',
'Classification. Classification': 'position',
'LengthofServiceOverall. Overall Length of Service at Institute (in years)': 'institute_service',
'LengthofServiceCurrent. Length of Service at current workplace (in years)': 'role_service'}
tafe_survey_updated = tafe_survey_updated.rename(mapping, axis = 1)
# Check that the specified column names were updated correctly
tafe_survey_updated.columns
# Check the unique values for the separationtype column
tafe_survey_updated['separationtype'].value_counts()
# Check the unique values for the separationtype column
dete_survey_updated['separationtype'].value_counts()
# Update all separation types containing the word "resignation" to 'Resignation'
dete_survey_updated['separationtype'] = dete_survey_updated['separationtype'].str.split('-').str[0]
# Check the values in the separationtype column were updated correctly
dete_survey_updated['separationtype'].value_counts()
# Select only the resignation separation types from each dataframe
dete_resignations = dete_survey_updated[dete_survey_updated['separationtype'] == 'Resignation'].copy()
tafe_resignations = tafe_survey_updated[tafe_survey_updated['separationtype'] == 'Resignation'].copy()
We are now going to verify the cease_date
and dete_start_date
to make sure it makes sense and we would not end up with wrong analysis
dete_resignations['cease_date'].value_counts()
# Extract the years and convert them to a float type
dete_resignations['cease_date'] = dete_resignations['cease_date'].str.split('/').str[-1]
dete_resignations['cease_date'] = dete_resignations['cease_date'].astype("float")
# Check the values again and look for outliers
dete_resignations['cease_date'].value_counts()
# Check the unique values and look for outliers
dete_resignations['dete_start_date'].value_counts().sort_values()
# Check the unique values
tafe_resignations['cease_date'].value_counts().sort_values()
Below are our findings:
Since our end goal is to answer the question below, we need a column containing the length of time an employee spent in their workplace, or years of service, in both dataframes.
End goal: Are employees who have only worked for the institutes for a short period of time resigning due to some kind of dissatisfaction? What about employees who have been at the job longer? The tafe_resignations dataframe already contains a "service" column, which we renamed to institute_service.
Below, we calculate the years of service in the dete_survey_updated dataframe by subtracting the dete_start_date from the cease_date and create a new column named institute_service.
# Calculate the length of time an employee spent in their respective workplace and create a new column
dete_resignations['institute_service'] = dete_resignations['cease_date'] - dete_resignations['dete_start_date']
# Quick check of the result
dete_resignations['institute_service'].head()
Next, we'll identify any employees who resigned because they were dissatisfied. Below are the columns we'll use to categorize employees as "dissatisfied" from each dataframe:
If the employee indicated any of the factors above caused them to resign, we'll mark them as dissatisfied in a new column. After our changes, the new dissatisfied column will contain just the following values:
# Check the unique values
tafe_resignations['Contributing Factors. Dissatisfaction'].value_counts()
# Check the unique values
tafe_resignations['Contributing Factors. Job Dissatisfaction'].value_counts()
# Update the values in the contributing factors columns to be either True, False, or NaN
def update_vals(x):
if x == '-':
return False
elif pd.isnull(x):
return np.nan
else:
return True
tafe_resignations['dissatisfied'] = tafe_resignations[['Contributing Factors. Dissatisfaction', 'Contributing Factors. Job Dissatisfaction']].applymap(update_vals).any(1, skipna=False)
tafe_resignations_up = tafe_resignations.copy()
# Check the unique values after the updates
tafe_resignations_up['dissatisfied'].value_counts(dropna=False)
# Update the values in columns related to dissatisfaction to be either True, False, or NaN
dete_resignations['dissatisfied'] = dete_resignations[['job_dissatisfaction',
'dissatisfaction_with_the_department', 'physical_work_environment',
'lack_of_recognition', 'lack_of_job_security', 'work_location',
'employment_conditions', 'work_life_balance',
'workload']].any(1, skipna=False)
dete_resignations_up = dete_resignations.copy()
dete_resignations_up['dissatisfied'].value_counts(dropna=False)
# Add an institute column
dete_resignations_up['institute'] = 'DETE'
tafe_resignations_up['institute'] = 'TAFE'
# Combine the dataframes
combined = pd.concat([dete_resignations_up, tafe_resignations_up], ignore_index=True)
# Verify the number of non null values in each column
combined.notnull().sum().sort_values()
# Drop columns with less than 500 non null values
combined_updated = combined.dropna(thresh = 500, axis =1).copy()
# Check the unique values
combined_updated['institute_service'].value_counts(dropna=False)
# Extract the years of service and convert the type to float
combined_updated['institute_service_up'] = combined_updated['institute_service'].astype('str').str.extract(r'(\d+)')
combined_updated['institute_service_up'] = combined_updated['institute_service_up'].astype('float')
# Check the years extracted are correct
combined_updated['institute_service_up'].value_counts()
# Convert years of service to categories
def transform_service(val):
if val >= 11:
return "Veteran"
elif 7 <= val < 11:
return "Established"
elif 3 <= val < 7:
return "Experienced"
elif pd.isnull(val):
return np.nan
else:
return "New"
combined_updated['service_cat'] = combined_updated['institute_service_up'].apply(transform_service)
# Quick check of the update
combined_updated['service_cat'].value_counts()
# Verify the unique values
combined_updated['dissatisfied'].value_counts(dropna=False)
# Replace missing values with the most frequent value, False
combined_updated['dissatisfied'] = combined_updated['dissatisfied'].fillna(False)
dis_pct = combined_updated.pivot_table(index='service_cat', values='dissatisfied')
# Plot the results
%matplotlib inline
dis_pct.plot(kind='bar', rot=30)