Engaging Senior Volunteers Volunteer profile

Seniors volunteering - a deeper look

Background

RSVP

RSVP Singapore The Organisation of Senior Volunteers is an Institution of Public Character and the National Centre of Excellence for Senior Volunteerism under the patronage of Mdm Halimah Yacob, President of the Republic of Singapore. The organisation started in 1998 and was launched by then-Prime Minister Mr Goh Chok Tong. RSVP Singapore is a registered society under the Societies Act and a member of the National Council of Social Service (NCSS).
Since its inception, RSVP Singapore has been actively engaging seniors in purpose-driven volunteerism. With over 1,000 members and 1,500 ad-hoc volunteers, RSVP Singapore serves more than 200,000 beneficiaries each year including the mentally disadvantaged, at-risk children from low income families, and socially isolated seniors through its community service programmes. (ref: https://rsvp.org.sg/vision-mission/)
The project was scoped to look at the past four years of activities and profile of volunteers - 2015 to 2018 (full years) and till Oct 2019.

Importing relevant libraries

# importing libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import time
import random
from datetime import datetime
from datetime import timedelta
from sklearn import metrics
import math


from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression, LassoCV, Lasso
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, mean_squared_error

import graphviz
from sklearn.externals.six import StringIO
from IPython.display import Image
from sklearn.tree import export_graphviz
import pydotplus


import requests

%matplotlib inline

# setting options to view entire dataset

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

Problem statement

– To look at the volunteer database and activities conducted over 4 years (2015-2018) and study patterns:

Profile of volunteers
What activities are these volunteers most involved in? (Comparison of numbers)
Clustering of volunteers based on activities they are invovled in / hours they volunteer
Engagement of volunteers and suggestions for continuing their involvement

While this notebook will cover the basic profile and interests of the volunteers, the second book will focus on the various types of volunteers and the activities used by RSVP to engage them.

# reading excel file containing volunteer details

df_volunteer= pd.read_excel('.\datasets\Vol Basic run on 29Oct Sample.xlsx')
df_volunteer.head()

	Date_Joined	Gender	Nationality	Last 4 digits IC	Date_of_Birth	Age	Postal_Code	Race	Spoken_languages	Interests	Skills	Hear_about_us	Education_Qualification	Current_Employment_Status	Employer_Name	Designation	Status	Status_Remarks	Registration_Date	Membership_Type	Membership_Start_Date	Membership_Expiry_Date	Membership_Payment
0	11-01-2016	Male	NaN	NaN	05-08-1938	81.0	788214.0	Others	Hokkien,English	Infocomm Related,	NaN	NaN	NaN	Retired	ForumEnergy Technologies Pte Ltd	Regional Manager	Inactive	Inactive since June 2017.	22-05-2016	NaN	NaN	NaN	NaN
1	16-05-2002	Female	NaN	0001H	30-07-1941	78.0	572152.0	Chinese	Cantonese,Hokkien,Teochew,English,Mandarin	NaN	NaN	NaN	Secondary	Homemaker	K K Hospital	Midwife	Inactive	Status data correction from (Ordinary Member) ...	22-05-2016	Ordinary Member	NaN	30-06-2015	NaN
2	02-02-2016	Male	NaN	0024H	17-06-1963	56.0	161057.0	Chinese	Cantonese,Hokkien,Teochew,English,Mandarin	NaN	NaN	NaN	NaN	Others, pls	NTUC Fairprice	Retail Assistant	Inactive	Status data correction from (Ordinary Member) ...	22-05-2016	Ordinary Member	NaN	30-06-2016	NaN
3	12-10-2010	Male	Singapore Permanent Resident	0037I	07-12-1935	84.0	570269.0	Chinese	Cantonese,English,Mandarin	NaN	NaN	NaN	Retired	Retired	Bank Central Asia, Indonesia	Clerk	Active	NaN	22-05-2016	Ordinary Member	25-09-2018	30-06-2019	NaN
4	12-04-2013	Female	Singapore Citizen	0066e	29-07-1952	67.0	100050.0	Chinese	,English	Office Admin.,Snr Guiding,Active Ageing Seniors	NaN	NaN	Retired	Retired	Singapore Post Pte Ltd	Part time CSO	Active	NaN	22-05-2016	Ordinary	05-07-2019	30-06-2019	paid

df_volunteer.shape

(8285, 23)

df_volunteer.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8285 entries, 0 to 8284
Data columns (total 23 columns):
Date_Joined                  3151 non-null object
Gender                       7955 non-null object
Nationality                  6437 non-null object
Last 4 digits IC             8080 non-null object
Date_of_Birth                8211 non-null object
Age                          8211 non-null float64
Postal_Code                  4950 non-null float64
Race                         4999 non-null object
Spoken_languages             4917 non-null object
Interests                    1342 non-null object
Skills                       722 non-null object
Hear_about_us                351 non-null object
Education_Qualification      4396 non-null object
Current_Employment_Status    2726 non-null object
Employer_Name                1108 non-null object
Designation                  1256 non-null object
Status                       8227 non-null object
Status_Remarks               5940 non-null object
Registration_Date            8227 non-null object
Membership_Type              2568 non-null object
Membership_Start_Date        1092 non-null object
Membership_Expiry_Date       2597 non-null object
Membership_Payment           799 non-null object
dtypes: float64(2), object(21)
memory usage: 1.5+ MB

Observations:
The data has 8285 rows of info, However there are many null values and it needs to be cleaned for duplicates and missing values
Relevant columns for analysis, such as date of birth, date of joining, status, details to link to other data sheets, will be retained
Duplicates checked and rows with null values in identifying variables (‘vol_id’ and ‘Date_of_Birth’) dropped as ther is no other way toidentify such persons

# dropping columns not relevant to study - eg. membership expiry and payment details

df_volunteer.drop(columns=['Status_Remarks','Membership_Start_Date',
                            'Membership_Expiry_Date','Membership_Payment'], inplace=True)
df_volunteer.head()

	Date_Joined	Gender	Nationality	Last 4 digits IC	Date_of_Birth	Age	Postal_Code	Race	Spoken_languages	Interests	Skills	Hear_about_us	Education_Qualification	Current_Employment_Status	Employer_Name	Designation	Status	Registration_Date	Membership_Type
0	11-01-2016	Male	NaN	NaN	05-08-1938	81.0	788214.0	Others	Hokkien,English	Infocomm Related,	NaN	NaN	NaN	Retired	ForumEnergy Technologies Pte Ltd	Regional Manager	Inactive	22-05-2016	NaN
1	16-05-2002	Female	NaN	0001H	30-07-1941	78.0	572152.0	Chinese	Cantonese,Hokkien,Teochew,English,Mandarin	NaN	NaN	NaN	Secondary	Homemaker	K K Hospital	Midwife	Inactive	22-05-2016	Ordinary Member
2	02-02-2016	Male	NaN	0024H	17-06-1963	56.0	161057.0	Chinese	Cantonese,Hokkien,Teochew,English,Mandarin	NaN	NaN	NaN	NaN	Others, pls	NTUC Fairprice	Retail Assistant	Inactive	22-05-2016	Ordinary Member
3	12-10-2010	Male	Singapore Permanent Resident	0037I	07-12-1935	84.0	570269.0	Chinese	Cantonese,English,Mandarin	NaN	NaN	NaN	Retired	Retired	Bank Central Asia, Indonesia	Clerk	Active	22-05-2016	Ordinary Member
4	12-04-2013	Female	Singapore Citizen	0066e	29-07-1952	67.0	100050.0	Chinese	,English	Office Admin.,Snr Guiding,Active Ageing Seniors	NaN	NaN	Retired	Retired	Singapore Post Pte Ltd	Part time CSO	Active	22-05-2016	Ordinary

Creating Volunteer Id (vol_id) that can be used to link volunteer database to activity databases

With an effort to protect privacy and confidentiality, only the last 4 numbers of the IC and the alphabet are retained as linking details to connect the volunteer datatbase and other sheets showing the activities they have participated in.
This detail will be called ‘vol_id’.

# 'last 4 digits IC' is one of the identifying factors for volunteers.

df_volunteer.rename(columns={'Last 4 digits IC': 'vol_id'}, inplace=True)
df_volunteer.head()

	Date_Joined	Gender	Nationality	vol_id	Date_of_Birth	Age	Postal_Code	Race	Spoken_languages	Interests	Skills	Hear_about_us	Education_Qualification	Current_Employment_Status	Employer_Name	Designation	Status	Registration_Date	Membership_Type
0	11-01-2016	Male	NaN	NaN	05-08-1938	81.0	788214.0	Others	Hokkien,English	Infocomm Related,	NaN	NaN	NaN	Retired	ForumEnergy Technologies Pte Ltd	Regional Manager	Inactive	22-05-2016	NaN
1	16-05-2002	Female	NaN	0001H	30-07-1941	78.0	572152.0	Chinese	Cantonese,Hokkien,Teochew,English,Mandarin	NaN	NaN	NaN	Secondary	Homemaker	K K Hospital	Midwife	Inactive	22-05-2016	Ordinary Member
2	02-02-2016	Male	NaN	0024H	17-06-1963	56.0	161057.0	Chinese	Cantonese,Hokkien,Teochew,English,Mandarin	NaN	NaN	NaN	NaN	Others, pls	NTUC Fairprice	Retail Assistant	Inactive	22-05-2016	Ordinary Member
3	12-10-2010	Male	Singapore Permanent Resident	0037I	07-12-1935	84.0	570269.0	Chinese	Cantonese,English,Mandarin	NaN	NaN	NaN	Retired	Retired	Bank Central Asia, Indonesia	Clerk	Active	22-05-2016	Ordinary Member
4	12-04-2013	Female	Singapore Citizen	0066e	29-07-1952	67.0	100050.0	Chinese	,English	Office Admin.,Snr Guiding,Active Ageing Seniors	NaN	NaN	Retired	Retired	Singapore Post Pte Ltd	Part time CSO	Active	22-05-2016	Ordinary

df_volunteer['Date_of_Birth'].isnull().value_counts()

False    8211
True       74
Name: Date_of_Birth, dtype: int64

For 74 volunteers Date of Birth is not available.
Also checking the vol_id to see if the individual can be identified
For volunteers where both Date of Birth and vol_id is not available - 59 nos - these are dropped as there is no way to identify them.
In addition, most other fields also do not have info, allowing for dropping of the same.
The resulting dataset has 8226 rows.

# For 74 volunteers Date of Birth is not available.
# Also checking the vol_id to see if the individual can be identified
# For volunteers where both Date of Birth and vol_id is not available -
# 59 nos. - these are dropped as there is no way to identify them.
# In addition, most other fields have not be populated with info.

df_blanks = df_volunteer[(df_volunteer['Date_of_Birth'].isnull()) & (df_volunteer['vol_id'].isnull())]
df_blanks.head()

df_blanks.shape

(59, 19)

# 59 volunteers do not have identifying details - vol_id, Date_of_Birth  
# dropping these 59 rows

index_blanks = df_blanks.index
print(index_blanks)
df_volunteer.drop(index_blanks, inplace=True)

Int64Index([7742, 8227, 8228, 8229, 8230, 8231, 8232, 8233, 8234, 8235, 8236,
            8237, 8238, 8239, 8240, 8241, 8242, 8243, 8244, 8245, 8246, 8247,
            8248, 8249, 8250, 8251, 8252, 8253, 8254, 8255, 8256, 8257, 8258,
            8259, 8260, 8261, 8262, 8263, 8264, 8265, 8266, 8267, 8268, 8269,
            8270, 8271, 8272, 8273, 8274, 8275, 8276, 8277, 8278, 8279, 8280,
            8281, 8282, 8283, 8284],
           dtype='int64')

df_volunteer.shape

(8226, 19)

# Other null values in 'vol_id' column have been given an Id of '1000Q'

df_volunteer['vol_id'].fillna('1000Q', inplace=True)
df_volunteer['vol_id'].astype('str')
df_volunteer['vol_id'] = df_volunteer['vol_id'].str[0:4]+ df_volunteer['vol_id'].str[4].str.upper()
df_volunteer.head()

	Date_Joined	Gender	Nationality	vol_id	Date_of_Birth	Age	Postal_Code	Race	Spoken_languages	Interests	Skills	Hear_about_us	Education_Qualification	Current_Employment_Status	Employer_Name	Designation	Status	Registration_Date	Membership_Type
0	11-01-2016	Male	NaN	1000Q	05-08-1938	81.0	788214.0	Others	Hokkien,English	Infocomm Related,	NaN	NaN	NaN	Retired	ForumEnergy Technologies Pte Ltd	Regional Manager	Inactive	22-05-2016	NaN
1	16-05-2002	Female	NaN	0001H	30-07-1941	78.0	572152.0	Chinese	Cantonese,Hokkien,Teochew,English,Mandarin	NaN	NaN	NaN	Secondary	Homemaker	K K Hospital	Midwife	Inactive	22-05-2016	Ordinary Member
2	02-02-2016	Male	NaN	0024H	17-06-1963	56.0	161057.0	Chinese	Cantonese,Hokkien,Teochew,English,Mandarin	NaN	NaN	NaN	NaN	Others, pls	NTUC Fairprice	Retail Assistant	Inactive	22-05-2016	Ordinary Member
3	12-10-2010	Male	Singapore Permanent Resident	0037I	07-12-1935	84.0	570269.0	Chinese	Cantonese,English,Mandarin	NaN	NaN	NaN	Retired	Retired	Bank Central Asia, Indonesia	Clerk	Active	22-05-2016	Ordinary Member
4	12-04-2013	Female	Singapore Citizen	0066E	29-07-1952	67.0	100050.0	Chinese	,English	Office Admin.,Snr Guiding,Active Ageing Seniors	NaN	NaN	Retired	Retired	Singapore Post Pte Ltd	Part time CSO	Active	22-05-2016	Ordinary

# checking number of unique Id identifying volunteers

df_volunteer['vol_id'].nunique()

# Checking duplicates of vol_id if any volunteer details are duplicated

df_duplicates = df_volunteer.groupby(by=['vol_id','Date_of_Birth']).count()
df_duplicates.head()

		Date_Joined	Gender	Nationality	Age	Postal_Code	Race	Spoken_languages	Interests	Skills	Hear_about_us	Education_Qualification	Current_Employment_Status	Employer_Name	Designation	Status	Registration_Date	Membership_Type
vol_id	Date_of_Birth
0000D	03-07-1962	0	1	1	1	0	0	0	0	0	0	0	0	0	0	1	1	0
0000J	08-07-1948	0	1	1	1	1	1	0	0	0	0	1	0	0	0	1	1	0
0001H	04-02-1945	0	1	1	1	1	1	1	0	0	0	1	0	0	0	1	1	0
0001H	30-07-1941	1	1	0	1	1	1	1	0	0	0	1	1	1	1	1	1	1
0002I	16-02-1967	0	1	1	1	1	1	0	1	1	1	1	1	1	1	1	1	0

Basic profile of volunteer

Based on the data available in the Volunteer database, the features defining a typical volunteer were shortlisted
Total number of volunteers registered till 2018 = 8226
Many have not provided much information about themselves…..
- 7954 have stated Gender
  - female(5097, 62%)
  - male(2858,34.7%)
  - Unstated (272, 3.3%)
- 6436 have stated Nationality
  - Singapore Citizen (6243,76%)
  - Singapore PR(153, 1.9%),
  - Foreigner(41, 0.5%),
  - Unknown(1790, 21.8%)
- 4994 have provided information about Race
  - Chinese (4665, 56.7%),
  - Indian (190, 2.3%),
  - Malay (49,0.6%),
  - Others (39, 0.5%),
  - Eurasian (29, 0.4%),
  - Unstated (3232, 39.3%)
- Recent years have seen rapid increase in signing up of volunteers
  - 2017: 2327 pax
  - 2018: 1694 pax
  - 2019: 1876 pax (till Oct 2019)

# checking null values in 'Age' column - 15 values. using mean value to fill null values

df_volunteer['Age'].fillna(df_volunteer['Age'].mean(), inplace=True)

# checking for null values in 'Nationality' column - replacing with 'NotStated'

df_volunteer['Nationality'].fillna('NotStated', inplace=True)
print(df_volunteer['Nationality'].value_counts())
df_volunteer['Nationality'].value_counts().plot(kind='bar');

Singapore Citizen               6242
NotStated                       1790
Singapore Permanent Resident     153
Foreigner                         41
Name: Nationality, dtype: int64

png

# checking for null values in 'Gender' column - replacing with 'NotStated'

df_volunteer['Gender'].fillna('NotStated', inplace=True)
print(df_volunteer['Gender'].value_counts())
df_volunteer['Gender'].value_counts().plot(kind='bar')

Female       5097
Male         2857
NotStated     272
Name: Gender, dtype: int64

<matplotlib.axes._subplots.AxesSubplot at 0x1f01c28c2e8>

png

# plotting gender and age of volunteers

df_volunteer.groupby('Gender').Age.hist()
plt.legend(['Female', 'Male'], facecolor='w' )
plt.show()

png

# checking info in 'Race' column
df_volunteer['Race'].fillna('Unknown', inplace=True)
df_Race= pd.DataFrame(df_volunteer['Race'].value_counts())
df_Race.head(10)

	Race
Chinese	4665
Unknown	3232
Indian	190
Malay	49
Others	39
Eurasian	29
Others\|\|Filipino	6
Others\|\|Punjabi	3
Indonesian	2
Japanese	2

(df_Race.sort_values(by='Race')).plot(kind='barh',xlim=(0, 250))

<matplotlib.axes._subplots.AxesSubplot at 0x1f01d316e80>

png

df_volunteer['Spoken_languages'].head(10)

                                   Hokkien,English
        Cantonese,Hokkien,Teochew,English,Mandarin
        Cantonese,Hokkien,Teochew,English,Mandarin
                        Cantonese,English,Mandarin
                                          ,English
                                     ,English,Thai
                                           Chinese
  Cantonese,Hakka,Hokkien,Teochew,English,Mandarin
                                           Chinese
                        Cantonese,English,Mandarin
Name: Spoken_languages, dtype: object

df_volunteer['Spoken_languages'].isnull().sum()

# Spoken languages info only available for 4909 volunteers (i.e., 8226-3317)

df_volunteer['Spoken_languages'].astype(str)
languages = df_volunteer['Spoken_languages'].str.get_dummies(',')
languages.sum().sort_values(ascending=False).head(10)

Chinese      3277
English      1508
Mandarin     1141
Hokkien       903
Cantonese     754
Teochew       489
Malay         341
Hainanese      82
Hakka          69
Tamil          57
dtype: int64

df_volunteer.columns

Index(['Date_Joined', 'Gender', 'Nationality', 'vol_id', 'Date_of_Birth',
       'Age', 'Postal_Code', 'Race', 'Spoken_languages', 'Interests', 'Skills',
       'Hear_about_us', 'Education_Qualification', 'Current_Employment_Status',
       'Employer_Name', 'Designation', 'Status', 'Registration_Date',
       'Membership_Type'],
      dtype='object')

df_volunteer['Education_Qualification'].notnull().sum()

df_volunteer['Education_Qualification'].value_counts()

Secondary                           755
O Level                             565
University Degree                   348
GCE O Level                         340
Polytechnic                         324
Retired                             308
Degree                              289
A Levels                            232
Primary                             202
Diploma                             194
GCE A Level                         130
Postgraduate Degree                 109
Masters                              70
Employed (Full-Time)                 53
Semi-retired                         52
Homemaker                            41
Self-employed                        38
Employed (Part-Time)                 35
Others, pls                          34
ITE/NTC                              32
College                              29
Professional Qualifications          22
Post Graduate Diploma                21
No Formal Education                  21
GCE N Level                          18
Doctorate                            17
Freelance Work                       16
Certificate/Advanced Certificate     15
N Level                              15
Higher Nitec                         11
Awaiting Employment                  11
A Level                               7
Nitec (ITE)                           6
Bachelor Degree                       4
secondary                             3
Bachelor"s Degree                     3
diploma                               3
Post Graduate                         2
others                                2
A level                               2
Post secondary                        2
degree                                2
o.level                               2
Advanced Diploma                      1
Senior Cambridge                      1
Others                                1
GCE O level                           1
A-Level                               1
post secondary                        1
GCE O LEVEL                           1
Postgraduate                          1
No Formal education                   1
Post-Graduate                         1
Name: Education_Qualification, dtype: int64

# growth of number of volunteers
df_volunteer['Date_Joined'].fillna(df_volunteer['Registration_Date'], inplace=True)
df_volunteer['join_yr']=pd.to_datetime(df_volunteer['Date_Joined'], format='%d-%m-%Y').dt.year
df_volunteer['join_yr'].fillna(1999)
df_volunteer['join_yr'].replace(to_replace=[1954,1953,1946,1936,1935,1949,1933,1965,1944,1996,1955,1997],
                                value=1999, inplace=True)
dataviz_vol= pd.DataFrame(df_volunteer['join_yr'].sort_index().value_counts())
print(df_volunteer['join_yr'].sort_index().value_counts())
df_volunteer['join_yr'].sort_index().plot(kind='hist')
plt.gcf().set_size_inches(10, 4)

  2327
  1876
  1694
   273
   238
   230
   227
   196
   185
   172
   158
   134
   103
    95
    60
    42
    41
    40
    40
    38
    31
    26
Name: join_yr, dtype: int64

png

dataviz_vol= dataviz_vol.sort_index()
# ser = pd.Series(np.random.normal(size=1000))

dataviz_vol.hist(cumulative=True, density=1, bins=200)

plt.show()

df_volunteer[df_volunteer['join_yr'] > 2016]['Education_Qualification'].value_counts()

O Level                  554
Secondary                479
University Degree        339
Polytechnic              312
A Levels                 227
Primary                  135
Postgraduate Degree      108
Diploma                   43
Degree                    42
ITE/NTC                   32
Retired                   26
No Formal Education       21
GCE O Level               20
N Level                   15
Semi-retired              13
GCE A Level               11
Others, pls                7
Employed (Full-Time)       7
A Level                    6
Masters                    5
Homemaker                  4
Post Graduate Diploma      4
secondary                  3
Bachelor Degree            3
Bachelor"s Degree          3
o.level                    2
Post Graduate              2
Self-employed              2
Freelance Work             2
A level                    2
Post secondary             2
others                     1
Employed (Part-Time)       1
Postgraduate               1
Senior Cambridge           1
diploma                    1
Advanced Diploma           1
Others                     1
GCE O level                1
A-Level                    1
post secondary             1
Higher Nitec               1
GCE O LEVEL                1
degree                     1
No Formal education        1
College                    1
Post-Graduate              1
Name: Education_Qualification, dtype: int64

# To extract year of joining from Registration Date
df_volunteer['Years']=2019 - (pd.to_datetime(df_volunteer['Date_Joined'], format='%d-%m-%Y').dt.year)

Interest of volunteers as stated during registration

info available from only 1342 persons

df_volunteer['Interests'].notnull().sum()

# To look at interests of volunteers

df_volunteer['Interests'].str.lower()
interests =df_volunteer['Interests'].dropna()

# creating WordCloud - interests
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator

wordcloud = WordCloud(max_font_size=50, max_words=50, background_color="white").generate(' '.join(interests))

# Display the generated image:
plt.figure(figsize=(10,4))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

png

# saving dataset as csv file

df_volunteer.to_csv('./datasets/df_volunteer.csv')

Creating a basic model to predict if member will be ‘active’

From the available information which was quite limited, a predictive model was prepared to study the profile of members who are active.
Logistic regression, Lasso regression and Decision tree models were used to identify features that were most likely to indicate ‘activeness’ in volunteers
The 3 models highighted Nationality to be highly positive correlated and gender to be highly negative.
However, due to the limitations of the data as well as absence of any health / mobility related data, the model is not very effective in predictions.
Also the status of ‘Active’ and ‘Inactive’ is determined by the renewal of membership by the volunteer and hence is not a reliable measure. A volunteer may have not renewed his memebrship for various reasons, such as ill-health, other commitments on their time, forgetfulness, taken up employment, travelling, caring for family members, etc.

## creating dummy columns for nominal columns - df_merged_dummies

df_baseline = df_volunteer[['vol_id','Gender','Nationality', 'Race', 'Age', 'Years', 'Spoken_languages', 'Status']]
df_baseline.head(10)

	vol_id	Gender	Nationality	Race	Age	Years	Spoken_languages	Status
0	1000Q	Male	NotStated	Others	81.0	3	Hokkien,English	Inactive
1	0001H	Female	NotStated	Chinese	78.0	17	Cantonese,Hokkien,Teochew,English,Mandarin	Inactive
2	0024H	Male	NotStated	Chinese	56.0	3	Cantonese,Hokkien,Teochew,English,Mandarin	Inactive
3	0037I	Male	Singapore Permanent Resident	Chinese	84.0	9	Cantonese,English,Mandarin	Active
4	0066E	Female	Singapore Citizen	Chinese	67.0	6	,English	Active
5	0067I	Female	Singapore Citizen	Chinese	68.0	5	,English,Thai	Active
6	0068H	Female	Singapore Citizen	Chinese	73.0	20	Chinese	Active
7	0072Z	Female	Singapore Citizen	Chinese	82.0	11	Cantonese,Hakka,Hokkien,Teochew,English,Mandarin	Active
8	0074J	Male	Singapore Citizen	Chinese	68.0	13	Chinese	Active
9	0081G	Male	NotStated	Chinese	75.0	17	Cantonese,English,Mandarin	Inactive

df_baseline['Status'].unique()

array(['Inactive', 'Active', 'Special Attention', 'Probation', 'New'],
      dtype=object)

mapping = {'Active':1, 'Special Attention':1, 'Probation':1, 'New': 1, 'Inactive': 0}
df_baseline['Status_curr']= df_baseline['Status'].apply(lambda x : mapping[x])

df_baseline.head(10)

C:\Users\bhavn\Anaconda3\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

	vol_id	Gender	Nationality	Race	Age	Years	Spoken_languages	Status	Status_curr
0	1000Q	Male	NotStated	Others	81.0	3	Hokkien,English	Inactive	0
1	0001H	Female	NotStated	Chinese	78.0	17	Cantonese,Hokkien,Teochew,English,Mandarin	Inactive	0
2	0024H	Male	NotStated	Chinese	56.0	3	Cantonese,Hokkien,Teochew,English,Mandarin	Inactive	0
3	0037I	Male	Singapore Permanent Resident	Chinese	84.0	9	Cantonese,English,Mandarin	Active	1
4	0066E	Female	Singapore Citizen	Chinese	67.0	6	,English	Active	1
5	0067I	Female	Singapore Citizen	Chinese	68.0	5	,English,Thai	Active	1
6	0068H	Female	Singapore Citizen	Chinese	73.0	20	Chinese	Active	1
7	0072Z	Female	Singapore Citizen	Chinese	82.0	11	Cantonese,Hakka,Hokkien,Teochew,English,Mandarin	Active	1
8	0074J	Male	Singapore Citizen	Chinese	68.0	13	Chinese	Active	1
9	0081G	Male	NotStated	Chinese	75.0	17	Cantonese,English,Mandarin	Inactive	0

# replacing 'Spoken_languages' with dummy columns for languages

df_baseline = pd.concat([df_baseline, languages], axis=1, join='inner')
df_baseline.drop(['Spoken_languages', 'Status'], axis=1, inplace=True)

# saving dataset as csv file
df_baseline.to_csv('./datasets/df_baseline.csv')
df_baseline.head()

	vol_id	Gender	Nationality	Race	Age	Years	Status_curr	Cantonese	English	Hokkien	Mandarin	Teochew
0	1000Q	Male	NotStated	Others	81.0	3	0	0	1	1	0	0
1	0001H	Female	NotStated	Chinese	78.0	17	0	1	1	1	1	1
2	0024H	Male	NotStated	Chinese	56.0	3	0	1	1	1	1	1
3	0037I	Male	Singapore Permanent Resident	Chinese	84.0	9	1	1	1	0	1	0
4	0066E	Female	Singapore Citizen	Chinese	67.0	6	1	0	1	0	0	0

columns_to_getdummies =['Gender','Nationality', 'Race']
df_baseline_dummies = pd.get_dummies(df_baseline, columns=(columns_to_getdummies), prefix=(columns_to_getdummies), prefix_sep='')

df_baseline_dummies.head()

	vol_id	Age	Years	Status_curr	Cantonese	English	Hokkien	Mandarin	Teochew	GenderFemale	GenderMale	NationalityNotStated	NationalitySingapore Citizen	NationalitySingapore Permanent Resident	RaceChinese	RaceOthers
0	1000Q	81.0	3	0	0	1	1	0	0	0	1	1	0	0	0	1
1	0001H	78.0	17	0	1	1	1	1	1	1	0	1	0	0	1	0
2	0024H	56.0	3	0	1	1	1	1	1	0	1	1	0	0	1	0
3	0037I	84.0	9	1	1	1	0	1	0	0	1	0	0	1	1	0
4	0066E	67.0	6	1	0	1	0	0	0	1	0	0	1	0	1	0

# baseline score of data on current status

df_baseline_dummies['Status_curr'].value_counts(normalize=True)

1    0.771821
0    0.228179
Name: Status_curr, dtype: float64

df_baseline_dummies.to_csv('./datasets/baseline.csv')

# preparing X and y for modelling

X = df_baseline_dummies[['Age', 'Years', 'Cantonese', 'Mandarin', 'Chinese', 'English',
                         'Hainanese', 'Hakka', 'Hindi', 'Hokkien', 'Malay', 'Mandarin', 'Tamil', 'Teochew',
                         'GenderFemale', 'GenderMale', 'NationalityForeigner',  
                         'NationalitySingapore Citizen', 'NationalitySingapore Permanent Resident',
                         'RaceChinese', 'RaceEurasian','RaceIndian', 'RaceMalay']]

y = df_baseline_dummies['Status_curr']
X.shape, y.shape

((8226, 23), (8226,))

# saving dataset as csv file

df_baseline_dummies.to_csv('./datasets/df_baseline_dummies.csv')

Applying Standard Scaler

# setting aside 20% of X and y for test purposes
# applying Standard scaler

X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.25,random_state=0, stratify=y)

# standard scaling
ss = StandardScaler()
ss.fit(X_train)
X_train = ss.transform(X_train)
X_test = ss.transform(X_test)

X_train.shape, X_test.shape

((6169, 23), (2057, 23))

Model 1 :Logistic regression - training and validation

# creating a logistic model - X_train1, X_val, y_train1, y_val

# split X and y into training and testing sets
X_train1,X_val,y_train1,y_val=train_test_split(X_train,y_train,test_size=0.25,random_state=0)

# standard scaling
ss = StandardScaler()
ss.fit(X_train1)
X_train1 = ss.transform(X_train1)
X_val = ss.transform(X_val)

# instantiate the model (using the default parameters)
lg_baseline = LogisticRegression(solver='lbfgs')

# fit the model with data
lg_baseline.fit(X_train1,y_train1)

# getting predictions
y_pred=lg_baseline.predict(X_val)

# calculating probabilities
prob_lg_baseline = lg_baseline.predict_proba(X_val)

# validation score
training_score = lg_baseline.score(X_train1, y_train1)
validation_score = lg_baseline.score(X_val, y_val)

# Generate a confusion matrix.
confusion_matrix(y_val, y_pred)

tn, fp, fn, tp = confusion_matrix(y_val, y_pred).ravel()

print('True Negatives: %s' % tn)
print('False Positives: %s' % fp)
print('False Negatives: %s' % fn)
print('True Positives: %s' % tp)

print('Training score:', training_score)
print('Validation score:', validation_score)
print('MSE:',mean_squared_error(y_val, y_pred))

# Print out intercept and coefficients.
print('Intercept:', lg_baseline.intercept_)
print('Coefficient:', lg_baseline.coef_)
print(type(lg_baseline.coef_))

True Negatives: 253
False Positives: 70
False Negatives: 7
True Positives: 1213
Training score: 0.9420665801988759
Validation score: 0.950097213220998
MSE: 0.049902786779001944
Intercept: [2.11751837]
Coefficient: [[-0.01523056 -0.4101083  -0.11754014  0.02185603  0.10445194 -0.10417632
   0.09231091  0.08439391 -0.08350564 -0.09991119 -0.06327909  0.02185603
  -0.06003371 -0.01040314 -2.60230975 -2.56801013  0.64389587  2.49393104
   0.69417795 -1.26545949 -0.09677627 -0.43664819 -0.14489351]]
<class 'numpy.ndarray'>

coeff_df = pd.DataFrame(X.columns)
arr = np.array(lg_baseline.coef_)
df_arr = pd.DataFrame(data=arr).T
coeff_df = coeff_df.merge(df_arr, left_index=True, right_index=True, how='inner')
coeff_df.columns= ['feature', 'coeff_logreg']
print(coeff_df.sort_values(by='coeff_logreg',ascending=False))
coeff_df.sort_values(by='coeff_logreg',ascending=False).plot(kind='bar', x='feature')

                                    feature  coeff_logreg
           NationalitySingapore Citizen      2.493931
NationalitySingapore Permanent Resident      0.694178
                   NationalityForeigner      0.643896
                                 Chinese      0.104452
                               Hainanese      0.092311
                                   Hakka      0.084394
                               Mandarin      0.021856
                                Mandarin      0.021856
                                Teochew     -0.010403
                                     Age     -0.015231
                                  Tamil     -0.060034
                                  Malay     -0.063279
                                   Hindi     -0.083506
                           RaceEurasian     -0.096776
                                 Hokkien     -0.099911
                                 English     -0.104176
                               Cantonese     -0.117540
                              RaceMalay     -0.144894
                                   Years     -0.410108
                             RaceIndian     -0.436648
                            RaceChinese     -1.265459
                             GenderMale     -2.568010
                           GenderFemale     -2.602310





<matplotlib.axes._subplots.AxesSubplot at 0x1f020a740f0>

png

y_pred_proba = lg_baseline.predict_proba(X_test)[::,1]
fpr, tpr, _ = metrics.roc_curve(y_val,  y_pred)
auc = metrics.roc_auc_score(y_val,y_pred)
plt.plot(fpr,tpr,label="logreg model, auc="+str(auc))
plt.legend(loc=4)
plt.show()

png

Model 2 :Lasso model - training and validation

## Applying Lasso

# Lasso regression for given X_train, X_test, y_train, y_test, name of regression model

def lasso_reg(X_train, X_test, y_train, y_test, name):

    # Calculation of optimal alpha
    optimal_lasso = LassoCV(n_alphas=500, cv=10, verbose=1)
    optimal_lasso.fit(X_train, y_train)

    # instantiate lasso model with optimal alpha
    lasso = Lasso(alpha=optimal_lasso.alpha_)

    # Training the model
    lasso.fit(X_train, y_train)
    lasso_scores_train = cross_val_score(lasso, X_train, y_train, cv=10)
    print ("Mean score:", np.mean(lasso_scores_train))

    # getting co-eff values
    coeff_df = pd.DataFrame(lasso.coef_ , X.columns, columns=['Coefficient'])  

    # Fitting the model
    lasso.score(X_test, y_test)
    print('Validation/Test score:', lasso.score(X_test, y_test))
    pred = lasso.predict(X_test)

    # Metrics
    print ("Means Square Error:", metrics.mean_squared_error(y_test, pred ))
    print("Root Mean Square:", np.sqrt(metrics.mean_squared_error(y_test, pred)))


    return coeff_df, pred

coeff_df, pred = lasso_reg(X_train1, X_val, y_train1, y_val, 'lasso')

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:    1.4s finished


Mean score: 0.7242563051897465
Validation/Test score: 0.7348770422392643
Means Square Error: 0.04388113346759728
Root Mean Square: 0.20947824103614504

coeff_df.sort_values(by='Coefficient',ascending=False).head(20).plot(kind='bar')

<matplotlib.axes._subplots.AxesSubplot at 0x1f01da0d780>

png

Model 3: Decision Tree Classifier - training and validation

# fitting Decision Tree Classifier to training, validation

# Instantiate model.
dt = DecisionTreeClassifier(max_depth=8,random_state = 42)

# Fit training data
model_dt = dt.fit(X_train1, y_train1)

# Evaluate model.
print(f'Score on training set: {model_dt.score(X_train1, y_train1)}')
print(f'Score on validation set: {model_dt.score(X_val, y_val)}')

# testing for validation set - X_val, y_val
model_dt_val = model_dt.fit(X_val, y_val)

#predictions
preds_dt=model_dt.predict(X_val)

# Generate a confusion matrix.
confusion_matrix(y_val, preds_dt)

tn, fp, fn, tp = confusion_matrix(y_val, preds_dt).ravel()

print("True Negatives: %s" % tn)
print("False Positives: %s" % fp)
print("False Negatives: %s" % fn)
print("True Positives: %s" % tp)

Score on training set: 0.9589277993947255
Score on validation set: 0.948801036941024
True Negatives: 310
False Positives: 13
False Negatives: 24
True Positives: 1196

#Create a new tree with a lower depth (easier to visualize)

clf = DecisionTreeClassifier(max_depth = 8)

clf.fit(X_train1, y_train1)
dot_data = StringIO()
export_graphviz(clf, out_file=dot_data, feature_names=X.columns,
               filled=True, rounded=True,
               special_characters=True)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
Image(graph.create_png())

png

Testing of model

Model 1: Logistic Regression - testing

# using logistic regression model - X_train, X_test, y_train, y_test

# standard scaling
ss = StandardScaler()
ss.fit(X_train)
X_train = ss.transform(X_train)
X_test = ss.transform(X_test)

# instantiate the model (using the default parameters)
lg_baseline_test = LogisticRegression(solver='lbfgs')

# fit the model with data
lg_baseline_test.fit(X_train,y_train)

# getting predictions
y_pred_test=lg_baseline_test.predict(X_test)

# calculating probabilities
probability_lg_baseline_test = lg_baseline.predict_proba(X_test)

# validation score
training_score = lg_baseline.score(X_train, y_train)
test_score = lg_baseline.score(X_test, y_test)

# Generate a confusion matrix.
confusion_matrix(y_test, y_pred_test)

tn, fp, fn, tp = confusion_matrix(y_test, y_pred_test).ravel()

print("True Negatives: %s" % tn)
print("False Positives: %s" % fp)
print("False Negatives: %s" % fn)
print("True Positives: %s" % tp)

print('Training score:', training_score)
print('Test score:', test_score)
print("MSE:",mean_squared_error(y_test, y_pred_test))

True Negatives: 379
False Positives: 90
False Negatives: 23
True Positives: 1565
Training score: 0.9442373156103097
Test score: 0.9445794846864366
MSE: 0.054934370442391835

# to get co-efficients for model

coeff_df_test = pd.DataFrame(X.columns)
arr = np.array(lg_baseline_test.coef_)
df_arr_test = pd.DataFrame(data=arr).T
coeff_df_test = coeff_df_test.merge(df_arr_test, left_index=True, right_index=True, how='inner')
coeff_df_test.columns= ['feature', 'coef_logreg']
print(coeff_df_test.sort_values(by='coef_logreg',ascending=False))
coeff_df_test.sort_values(by='coef_logreg',ascending=False).plot(kind='bar', x='feature')

                                    feature  coef_logreg
           NationalitySingapore Citizen     2.560097
NationalitySingapore Permanent Resident     0.692651
                   NationalityForeigner     0.661561
                                 Chinese     0.092881
                               Hainanese     0.043636
                                   Hakka     0.042051
                               Mandarin     0.011858
                                Mandarin     0.011858
                                  Tamil    -0.049496
                                     Age    -0.055801
                                 Hokkien    -0.062695
                                Teochew    -0.062806
                                  Malay    -0.068920
                               Cantonese    -0.082700
                           RaceEurasian    -0.087335
                                   Hindi    -0.092611
                                 English    -0.101242
                              RaceMalay    -0.164994
                                   Years    -0.363874
                             RaceIndian    -0.387989
                            RaceChinese    -1.235166
                             GenderMale    -2.822673
                           GenderFemale    -2.885762





<matplotlib.axes._subplots.AxesSubplot at 0x1f01c2b6e10>

png

y_pred_proba = lg_baseline_test.predict_proba(X_test)[::,1]
fpr, tpr, _ = metrics.roc_curve(y_test,  y_pred_proba)
auc = metrics.roc_auc_score(y_test, y_pred_proba)
plt.plot(fpr,tpr,label="data 1, auc="+str(auc))
plt.legend(loc=4)
plt.show()

png

Model 2: Lasso - testing

coeff_df, pred = lasso_reg(X_train, X_test, y_train, y_test, 'lasso')

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:    1.4s finished


Mean score: 0.7276218355950257
Validation/Test score: 0.737418933566127
Means Square Error: 0.046218746760138305
Root Mean Square: 0.21498545708986527

coeff_df.sort_values(by='Coefficient',ascending=False).head(20).plot(kind='bar')

<matplotlib.axes._subplots.AxesSubplot at 0x1f020e7aef0>

png

Model 3: Decision Tree Classifier - testing

# fitting Decision Tree Classifier to test data

# Instantiate model.
dt = DecisionTreeClassifier(max_depth=8,random_state = 42)

# Fit training data
model_dt_test = dt.fit(X_train, y_train)

# Evaluate model.
print(f'Score on training set: {model_dt_test.score(X_train, y_train)}')
print(f'Score on test set: {model_dt_test.score(X_test, y_test)}')

# testing  - X_test, y_test
test_dt = dt.fit(X_test, y_test)

#predictions
preds_dt_test=model_dt_test.predict(X_test)

# Generate a confusion matrix.
confusion_matrix(y_test, preds_dt_test)

tn, fp, fn, tp = confusion_matrix(y_test, preds_dt_test).ravel()

print("True Negatives: %s" % tn)
print("False Positives: %s" % fp)
print("False Negatives: %s" % fn)
print("True Positives: %s" % tp)

Score on training set: 0.9586642891878748
Score on test set: 0.9402041808458921
True Negatives: 422
False Positives: 47
False Negatives: 18
True Positives: 1570

#Create a new tree with a lower depth (easier to visualize)

clf = DecisionTreeClassifier(max_depth = 8)

clf.fit(X_train, y_train)
dot_data = StringIO()
export_graphviz(clf, out_file=dot_data, feature_names=X.columns,
               filled=True, rounded=True,
               special_characters=True)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
Image(graph.create_png())

Volunteer profile and recommendations

Membership has increased in last 3 years and interests of the volunteers are varied
Among the 5897 volunteers who have registered since 2017, 2290 (39%) have indicated that they have post secondary qualifications.

Bhavana Rao