Engaging Senior Volunteers Volunteer profile

Seniors volunteering - a deeper look

Background

RSVP

  • RSVP Singapore The Organisation of Senior Volunteers is an Institution of Public Character and the National Centre of Excellence for Senior Volunteerism under the patronage of Mdm Halimah Yacob, President of the Republic of Singapore. The organisation started in 1998 and was launched by then-Prime Minister Mr Goh Chok Tong. RSVP Singapore is a registered society under the Societies Act and a member of the National Council of Social Service (NCSS).

  • Since its inception, RSVP Singapore has been actively engaging seniors in purpose-driven volunteerism. With over 1,000 members and 1,500 ad-hoc volunteers, RSVP Singapore serves more than 200,000 beneficiaries each year including the mentally disadvantaged, at-risk children from low income families, and socially isolated seniors through its community service programmes. (ref: https://rsvp.org.sg/vision-mission/)

  • The project was scoped to look at the past four years of activities and profile of volunteers - 2015 to 2018 (full years) and till Oct 2019.

Importing relevant libraries

# importing libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import time
import random
from datetime import datetime
from datetime import timedelta
from sklearn import metrics
import math


from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression, LassoCV, Lasso
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, mean_squared_error

import graphviz
from sklearn.externals.six import StringIO
from IPython.display import Image
from sklearn.tree import export_graphviz
import pydotplus


import requests

%matplotlib inline
# setting options to view entire dataset

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

Problem statement

– To look at the volunteer database and activities conducted over 4 years (2015-2018) and study patterns:

  • Profile of volunteers
  • What activities are these volunteers most involved in? (Comparison of numbers)
  • Clustering of volunteers based on activities they are invovled in / hours they volunteer
  • Engagement of volunteers and suggestions for continuing their involvement

While this notebook will cover the basic profile and interests of the volunteers, the second book will focus on the various types of volunteers and the activities used by RSVP to engage them.

# reading excel file containing volunteer details

df_volunteer= pd.read_excel('.\datasets\Vol Basic run on 29Oct Sample.xlsx')
df_volunteer.head()
Date_Joined Gender Nationality Last 4 digits IC Date_of_Birth Age Postal_Code Race Spoken_languages Interests Skills Hear_about_us Education_Qualification Current_Employment_Status Employer_Name Designation Status Status_Remarks Registration_Date Membership_Type Membership_Start_Date Membership_Expiry_Date Membership_Payment
0 11-01-2016 Male NaN NaN 05-08-1938 81.0 788214.0 Others Hokkien,English Infocomm Related, NaN NaN NaN Retired ForumEnergy Technologies Pte Ltd Regional Manager Inactive Inactive since June 2017. 22-05-2016 NaN NaN NaN NaN
1 16-05-2002 Female NaN 0001H 30-07-1941 78.0 572152.0 Chinese Cantonese,Hokkien,Teochew,English,Mandarin NaN NaN NaN Secondary Homemaker K K Hospital Midwife Inactive Status data correction from (Ordinary Member) ... 22-05-2016 Ordinary Member NaN 30-06-2015 NaN
2 02-02-2016 Male NaN 0024H 17-06-1963 56.0 161057.0 Chinese Cantonese,Hokkien,Teochew,English,Mandarin NaN NaN NaN NaN Others, pls NTUC Fairprice Retail Assistant Inactive Status data correction from (Ordinary Member) ... 22-05-2016 Ordinary Member NaN 30-06-2016 NaN
3 12-10-2010 Male Singapore Permanent Resident 0037I 07-12-1935 84.0 570269.0 Chinese Cantonese,English,Mandarin NaN NaN NaN Retired Retired Bank Central Asia, Indonesia Clerk Active NaN 22-05-2016 Ordinary Member 25-09-2018 30-06-2019 NaN
4 12-04-2013 Female Singapore Citizen 0066e 29-07-1952 67.0 100050.0 Chinese ,English Office Admin.,Snr Guiding,Active Ageing Seniors NaN NaN Retired Retired Singapore Post Pte Ltd Part time CSO Active NaN 22-05-2016 Ordinary 05-07-2019 30-06-2019 paid
df_volunteer.shape
(8285, 23)
df_volunteer.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8285 entries, 0 to 8284
Data columns (total 23 columns):
Date_Joined                  3151 non-null object
Gender                       7955 non-null object
Nationality                  6437 non-null object
Last 4 digits IC             8080 non-null object
Date_of_Birth                8211 non-null object
Age                          8211 non-null float64
Postal_Code                  4950 non-null float64
Race                         4999 non-null object
Spoken_languages             4917 non-null object
Interests                    1342 non-null object
Skills                       722 non-null object
Hear_about_us                351 non-null object
Education_Qualification      4396 non-null object
Current_Employment_Status    2726 non-null object
Employer_Name                1108 non-null object
Designation                  1256 non-null object
Status                       8227 non-null object
Status_Remarks               5940 non-null object
Registration_Date            8227 non-null object
Membership_Type              2568 non-null object
Membership_Start_Date        1092 non-null object
Membership_Expiry_Date       2597 non-null object
Membership_Payment           799 non-null object
dtypes: float64(2), object(21)
memory usage: 1.5+ MB
  • Observations:
  • The data has 8285 rows of info, However there are many null values and it needs to be cleaned for duplicates and missing values
  • Relevant columns for analysis, such as date of birth, date of joining, status, details to link to other data sheets, will be retained
  • Duplicates checked and rows with null values in identifying variables (‘vol_id’ and ‘Date_of_Birth’) dropped as ther is no other way toidentify such persons
# dropping columns not relevant to study - eg. membership expiry and payment details

df_volunteer.drop(columns=['Status_Remarks','Membership_Start_Date',
                            'Membership_Expiry_Date','Membership_Payment'], inplace=True)
df_volunteer.head()
Date_Joined Gender Nationality Last 4 digits IC Date_of_Birth Age Postal_Code Race Spoken_languages Interests Skills Hear_about_us Education_Qualification Current_Employment_Status Employer_Name Designation Status Registration_Date Membership_Type
0 11-01-2016 Male NaN NaN 05-08-1938 81.0 788214.0 Others Hokkien,English Infocomm Related, NaN NaN NaN Retired ForumEnergy Technologies Pte Ltd Regional Manager Inactive 22-05-2016 NaN
1 16-05-2002 Female NaN 0001H 30-07-1941 78.0 572152.0 Chinese Cantonese,Hokkien,Teochew,English,Mandarin NaN NaN NaN Secondary Homemaker K K Hospital Midwife Inactive 22-05-2016 Ordinary Member
2 02-02-2016 Male NaN 0024H 17-06-1963 56.0 161057.0 Chinese Cantonese,Hokkien,Teochew,English,Mandarin NaN NaN NaN NaN Others, pls NTUC Fairprice Retail Assistant Inactive 22-05-2016 Ordinary Member
3 12-10-2010 Male Singapore Permanent Resident 0037I 07-12-1935 84.0 570269.0 Chinese Cantonese,English,Mandarin NaN NaN NaN Retired Retired Bank Central Asia, Indonesia Clerk Active 22-05-2016 Ordinary Member
4 12-04-2013 Female Singapore Citizen 0066e 29-07-1952 67.0 100050.0 Chinese ,English Office Admin.,Snr Guiding,Active Ageing Seniors NaN NaN Retired Retired Singapore Post Pte Ltd Part time CSO Active 22-05-2016 Ordinary
  • With an effort to protect privacy and confidentiality, only the last 4 numbers of the IC and the alphabet are retained as linking details to connect the volunteer datatbase and other sheets showing the activities they have participated in.
  • This detail will be called ‘vol_id’.
# 'last 4 digits IC' is one of the identifying factors for volunteers.

df_volunteer.rename(columns={'Last 4 digits IC': 'vol_id'}, inplace=True)
df_volunteer.head()
Date_Joined Gender Nationality vol_id Date_of_Birth Age Postal_Code Race Spoken_languages Interests Skills Hear_about_us Education_Qualification Current_Employment_Status Employer_Name Designation Status Registration_Date Membership_Type
0 11-01-2016 Male NaN NaN 05-08-1938 81.0 788214.0 Others Hokkien,English Infocomm Related, NaN NaN NaN Retired ForumEnergy Technologies Pte Ltd Regional Manager Inactive 22-05-2016 NaN
1 16-05-2002 Female NaN 0001H 30-07-1941 78.0 572152.0 Chinese Cantonese,Hokkien,Teochew,English,Mandarin NaN NaN NaN Secondary Homemaker K K Hospital Midwife Inactive 22-05-2016 Ordinary Member
2 02-02-2016 Male NaN 0024H 17-06-1963 56.0 161057.0 Chinese Cantonese,Hokkien,Teochew,English,Mandarin NaN NaN NaN NaN Others, pls NTUC Fairprice Retail Assistant Inactive 22-05-2016 Ordinary Member
3 12-10-2010 Male Singapore Permanent Resident 0037I 07-12-1935 84.0 570269.0 Chinese Cantonese,English,Mandarin NaN NaN NaN Retired Retired Bank Central Asia, Indonesia Clerk Active 22-05-2016 Ordinary Member
4 12-04-2013 Female Singapore Citizen 0066e 29-07-1952 67.0 100050.0 Chinese ,English Office Admin.,Snr Guiding,Active Ageing Seniors NaN NaN Retired Retired Singapore Post Pte Ltd Part time CSO Active 22-05-2016 Ordinary
df_volunteer['Date_of_Birth'].isnull().value_counts()
False    8211
True       74
Name: Date_of_Birth, dtype: int64
  • For 74 volunteers Date of Birth is not available.
  • Also checking the vol_id to see if the individual can be identified
  • For volunteers where both Date of Birth and vol_id is not available - 59 nos - these are dropped as there is no way to identify them.
  • In addition, most other fields also do not have info, allowing for dropping of the same.
  • The resulting dataset has 8226 rows.
# For 74 volunteers Date of Birth is not available.
# Also checking the vol_id to see if the individual can be identified
# For volunteers where both Date of Birth and vol_id is not available -
# 59 nos. - these are dropped as there is no way to identify them.
# In addition, most other fields have not be populated with info.

df_blanks = df_volunteer[(df_volunteer['Date_of_Birth'].isnull()) & (df_volunteer['vol_id'].isnull())]
df_blanks.head()
df_blanks.shape
(59, 19)
# 59 volunteers do not have identifying details - vol_id, Date_of_Birth  
# dropping these 59 rows

index_blanks = df_blanks.index
print(index_blanks)
df_volunteer.drop(index_blanks, inplace=True)
Int64Index([7742, 8227, 8228, 8229, 8230, 8231, 8232, 8233, 8234, 8235, 8236,
            8237, 8238, 8239, 8240, 8241, 8242, 8243, 8244, 8245, 8246, 8247,
            8248, 8249, 8250, 8251, 8252, 8253, 8254, 8255, 8256, 8257, 8258,
            8259, 8260, 8261, 8262, 8263, 8264, 8265, 8266, 8267, 8268, 8269,
            8270, 8271, 8272, 8273, 8274, 8275, 8276, 8277, 8278, 8279, 8280,
            8281, 8282, 8283, 8284],
           dtype='int64')
df_volunteer.shape
(8226, 19)
# Other null values in 'vol_id' column have been given an Id of '1000Q'

df_volunteer['vol_id'].fillna('1000Q', inplace=True)
df_volunteer['vol_id'].astype('str')
df_volunteer['vol_id'] = df_volunteer['vol_id'].str[0:4]+ df_volunteer['vol_id'].str[4].str.upper()
df_volunteer.head()
Date_Joined Gender Nationality vol_id Date_of_Birth Age Postal_Code Race Spoken_languages Interests Skills Hear_about_us Education_Qualification Current_Employment_Status Employer_Name Designation Status Registration_Date Membership_Type
0 11-01-2016 Male NaN 1000Q 05-08-1938 81.0 788214.0 Others Hokkien,English Infocomm Related, NaN NaN NaN Retired ForumEnergy Technologies Pte Ltd Regional Manager Inactive 22-05-2016 NaN
1 16-05-2002 Female NaN 0001H 30-07-1941 78.0 572152.0 Chinese Cantonese,Hokkien,Teochew,English,Mandarin NaN NaN NaN Secondary Homemaker K K Hospital Midwife Inactive 22-05-2016 Ordinary Member
2 02-02-2016 Male NaN 0024H 17-06-1963 56.0 161057.0 Chinese Cantonese,Hokkien,Teochew,English,Mandarin NaN NaN NaN NaN Others, pls NTUC Fairprice Retail Assistant Inactive 22-05-2016 Ordinary Member
3 12-10-2010 Male Singapore Permanent Resident 0037I 07-12-1935 84.0 570269.0 Chinese Cantonese,English,Mandarin NaN NaN NaN Retired Retired Bank Central Asia, Indonesia Clerk Active 22-05-2016 Ordinary Member
4 12-04-2013 Female Singapore Citizen 0066E 29-07-1952 67.0 100050.0 Chinese ,English Office Admin.,Snr Guiding,Active Ageing Seniors NaN NaN Retired Retired Singapore Post Pte Ltd Part time CSO Active 22-05-2016 Ordinary
# checking number of unique Id identifying volunteers

df_volunteer['vol_id'].nunique()
7830
# Checking duplicates of vol_id if any volunteer details are duplicated

df_duplicates = df_volunteer.groupby(by=['vol_id','Date_of_Birth']).count()
df_duplicates.head()
Date_Joined Gender Nationality Age Postal_Code Race Spoken_languages Interests Skills Hear_about_us Education_Qualification Current_Employment_Status Employer_Name Designation Status Registration_Date Membership_Type
vol_id Date_of_Birth
0000D 03-07-1962 0 1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 0
0000J 08-07-1948 0 1 1 1 1 1 0 0 0 0 1 0 0 0 1 1 0
0001H 04-02-1945 0 1 1 1 1 1 1 0 0 0 1 0 0 0 1 1 0
30-07-1941 1 1 0 1 1 1 1 0 0 0 1 1 1 1 1 1 1
0002I 16-02-1967 0 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 0

Basic profile of volunteer

  • Based on the data available in the Volunteer database, the features defining a typical volunteer were shortlisted

  • Total number of volunteers registered till 2018 = 8226

  • Many have not provided much information about themselves…..

    • 7954 have stated Gender
      • female(5097, 62%)
      • male(2858,34.7%)
      • Unstated (272, 3.3%)
    • 6436 have stated Nationality
      • Singapore Citizen (6243,76%)
      • Singapore PR(153, 1.9%),
      • Foreigner(41, 0.5%),
      • Unknown(1790, 21.8%)
    • 4994 have provided information about Race
      • Chinese (4665, 56.7%),
      • Indian (190, 2.3%),
      • Malay (49,0.6%),
      • Others (39, 0.5%),
      • Eurasian (29, 0.4%),
      • Unstated (3232, 39.3%)
    • Recent years have seen rapid increase in signing up of volunteers
      • 2017: 2327 pax
      • 2018: 1694 pax
      • 2019: 1876 pax (till Oct 2019)
# checking null values in 'Age' column - 15 values. using mean value to fill null values

df_volunteer['Age'].fillna(df_volunteer['Age'].mean(), inplace=True)
# checking for null values in 'Nationality' column - replacing with 'NotStated'

df_volunteer['Nationality'].fillna('NotStated', inplace=True)
print(df_volunteer['Nationality'].value_counts())
df_volunteer['Nationality'].value_counts().plot(kind='bar');
Singapore Citizen               6242
NotStated                       1790
Singapore Permanent Resident     153
Foreigner                         41
Name: Nationality, dtype: int64

png

# checking for null values in 'Gender' column - replacing with 'NotStated'

df_volunteer['Gender'].fillna('NotStated', inplace=True)
print(df_volunteer['Gender'].value_counts())
df_volunteer['Gender'].value_counts().plot(kind='bar')
Female       5097
Male         2857
NotStated     272
Name: Gender, dtype: int64





<matplotlib.axes._subplots.AxesSubplot at 0x1f01c28c2e8>

png

# plotting gender and age of volunteers

df_volunteer.groupby('Gender').Age.hist()
plt.legend(['Female', 'Male'], facecolor='w' )
plt.show()

png

# checking info in 'Race' column
df_volunteer['Race'].fillna('Unknown', inplace=True)
df_Race= pd.DataFrame(df_volunteer['Race'].value_counts())
df_Race.head(10)
Race
Chinese 4665
Unknown 3232
Indian 190
Malay 49
Others 39
Eurasian 29
Others||Filipino 6
Others||Punjabi 3
Indonesian 2
Japanese 2
(df_Race.sort_values(by='Race')).plot(kind='barh',xlim=(0, 250))

<matplotlib.axes._subplots.AxesSubplot at 0x1f01d316e80>

png

df_volunteer['Spoken_languages'].head(10)
0                                     Hokkien,English
1          Cantonese,Hokkien,Teochew,English,Mandarin
2          Cantonese,Hokkien,Teochew,English,Mandarin
3                          Cantonese,English,Mandarin
4                                            ,English
5                                       ,English,Thai
6                                             Chinese
7    Cantonese,Hakka,Hokkien,Teochew,English,Mandarin
8                                             Chinese
9                          Cantonese,English,Mandarin
Name: Spoken_languages, dtype: object
df_volunteer['Spoken_languages'].isnull().sum()
3317
# Spoken languages info only available for 4909 volunteers (i.e., 8226-3317)

df_volunteer['Spoken_languages'].astype(str)
languages = df_volunteer['Spoken_languages'].str.get_dummies(',')
languages.sum().sort_values(ascending=False).head(10)

Chinese      3277
English      1508
Mandarin     1141
Hokkien       903
Cantonese     754
Teochew       489
Malay         341
Hainanese      82
Hakka          69
Tamil          57
dtype: int64
df_volunteer.columns
Index(['Date_Joined', 'Gender', 'Nationality', 'vol_id', 'Date_of_Birth',
       'Age', 'Postal_Code', 'Race', 'Spoken_languages', 'Interests', 'Skills',
       'Hear_about_us', 'Education_Qualification', 'Current_Employment_Status',
       'Employer_Name', 'Designation', 'Status', 'Registration_Date',
       'Membership_Type'],
      dtype='object')
df_volunteer['Education_Qualification'].notnull().sum()
4395
df_volunteer['Education_Qualification'].value_counts()
Secondary                           755
O Level                             565
University Degree                   348
GCE O Level                         340
Polytechnic                         324
Retired                             308
Degree                              289
A Levels                            232
Primary                             202
Diploma                             194
GCE A Level                         130
Postgraduate Degree                 109
Masters                              70
Employed (Full-Time)                 53
Semi-retired                         52
Homemaker                            41
Self-employed                        38
Employed (Part-Time)                 35
Others, pls                          34
ITE/NTC                              32
College                              29
Professional Qualifications          22
Post Graduate Diploma                21
No Formal Education                  21
GCE N Level                          18
Doctorate                            17
Freelance Work                       16
Certificate/Advanced Certificate     15
N Level                              15
Higher Nitec                         11
Awaiting Employment                  11
A Level                               7
Nitec (ITE)                           6
Bachelor Degree                       4
secondary                             3
Bachelor"s Degree                     3
diploma                               3
Post Graduate                         2
others                                2
A level                               2
Post secondary                        2
degree                                2
o.level                               2
Advanced Diploma                      1
Senior Cambridge                      1
Others                                1
GCE O level                           1
A-Level                               1
post secondary                        1
GCE O LEVEL                           1
Postgraduate                          1
No Formal education                   1
Post-Graduate                         1
Name: Education_Qualification, dtype: int64
# growth of number of volunteers
df_volunteer['Date_Joined'].fillna(df_volunteer['Registration_Date'], inplace=True)
df_volunteer['join_yr']=pd.to_datetime(df_volunteer['Date_Joined'], format='%d-%m-%Y').dt.year
df_volunteer['join_yr'].fillna(1999)
df_volunteer['join_yr'].replace(to_replace=[1954,1953,1946,1936,1935,1949,1933,1965,1944,1996,1955,1997],
                                value=1999, inplace=True)
dataviz_vol= pd.DataFrame(df_volunteer['join_yr'].sort_index().value_counts())
print(df_volunteer['join_yr'].sort_index().value_counts())
df_volunteer['join_yr'].sort_index().plot(kind='hist')
plt.gcf().set_size_inches(10, 4)
2017    2327
2019    1876
2018    1694
2009     273
2010     238
2013     230
2012     227
2007     196
2008     185
2011     172
2016     158
2015     134
2014     103
2006      95
2005      60
2000      42
1999      41
2004      40
2001      40
2002      38
1998      31
2003      26
Name: join_yr, dtype: int64

png

dataviz_vol= dataviz_vol.sort_index()
# ser = pd.Series(np.random.normal(size=1000))

dataviz_vol.hist(cumulative=True, density=1, bins=200)

plt.show()

df_volunteer[df_volunteer['join_yr'] > 2016]['Education_Qualification'].value_counts()
O Level                  554
Secondary                479
University Degree        339
Polytechnic              312
A Levels                 227
Primary                  135
Postgraduate Degree      108
Diploma                   43
Degree                    42
ITE/NTC                   32
Retired                   26
No Formal Education       21
GCE O Level               20
N Level                   15
Semi-retired              13
GCE A Level               11
Others, pls                7
Employed (Full-Time)       7
A Level                    6
Masters                    5
Homemaker                  4
Post Graduate Diploma      4
secondary                  3
Bachelor Degree            3
Bachelor"s Degree          3
o.level                    2
Post Graduate              2
Self-employed              2
Freelance Work             2
A level                    2
Post secondary             2
others                     1
Employed (Part-Time)       1
Postgraduate               1
Senior Cambridge           1
diploma                    1
Advanced Diploma           1
Others                     1
GCE O level                1
A-Level                    1
post secondary             1
Higher Nitec               1
GCE O LEVEL                1
degree                     1
No Formal education        1
College                    1
Post-Graduate              1
Name: Education_Qualification, dtype: int64
# To extract year of joining from Registration Date
df_volunteer['Years']=2019 - (pd.to_datetime(df_volunteer['Date_Joined'], format='%d-%m-%Y').dt.year)

Interest of volunteers as stated during registration

  • info available from only 1342 persons
df_volunteer['Interests'].notnull().sum()
1342
# To look at interests of volunteers

df_volunteer['Interests'].str.lower()
interests =df_volunteer['Interests'].dropna()
# creating WordCloud - interests
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator

wordcloud = WordCloud(max_font_size=50, max_words=50, background_color="white").generate(' '.join(interests))

# Display the generated image:
plt.figure(figsize=(10,4))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

png

# saving dataset as csv file

df_volunteer.to_csv('./datasets/df_volunteer.csv')

Creating a basic model to predict if member will be ‘active’

  • From the available information which was quite limited, a predictive model was prepared to study the profile of members who are active.
  • Logistic regression, Lasso regression and Decision tree models were used to identify features that were most likely to indicate ‘activeness’ in volunteers
  • The 3 models highighted Nationality to be highly positive correlated and gender to be highly negative.

  • However, due to the limitations of the data as well as absence of any health / mobility related data, the model is not very effective in predictions.
  • Also the status of ‘Active’ and ‘Inactive’ is determined by the renewal of membership by the volunteer and hence is not a reliable measure. A volunteer may have not renewed his memebrship for various reasons, such as ill-health, other commitments on their time, forgetfulness, taken up employment, travelling, caring for family members, etc.
## creating dummy columns for nominal columns - df_merged_dummies

df_baseline = df_volunteer[['vol_id','Gender','Nationality', 'Race', 'Age', 'Years', 'Spoken_languages', 'Status']]
df_baseline.head(10)

vol_id Gender Nationality Race Age Years Spoken_languages Status
0 1000Q Male NotStated Others 81.0 3 Hokkien,English Inactive
1 0001H Female NotStated Chinese 78.0 17 Cantonese,Hokkien,Teochew,English,Mandarin Inactive
2 0024H Male NotStated Chinese 56.0 3 Cantonese,Hokkien,Teochew,English,Mandarin Inactive
3 0037I Male Singapore Permanent Resident Chinese 84.0 9 Cantonese,English,Mandarin Active
4 0066E Female Singapore Citizen Chinese 67.0 6 ,English Active
5 0067I Female Singapore Citizen Chinese 68.0 5 ,English,Thai Active
6 0068H Female Singapore Citizen Chinese 73.0 20 Chinese Active
7 0072Z Female Singapore Citizen Chinese 82.0 11 Cantonese,Hakka,Hokkien,Teochew,English,Mandarin Active
8 0074J Male Singapore Citizen Chinese 68.0 13 Chinese Active
9 0081G Male NotStated Chinese 75.0 17 Cantonese,English,Mandarin Inactive
df_baseline['Status'].unique()
array(['Inactive', 'Active', 'Special Attention', 'Probation', 'New'],
      dtype=object)
mapping = {'Active':1, 'Special Attention':1, 'Probation':1, 'New': 1, 'Inactive': 0}
df_baseline['Status_curr']= df_baseline['Status'].apply(lambda x : mapping[x])

df_baseline.head(10)

C:\Users\bhavn\Anaconda3\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
vol_id Gender Nationality Race Age Years Spoken_languages Status Status_curr
0 1000Q Male NotStated Others 81.0 3 Hokkien,English Inactive 0
1 0001H Female NotStated Chinese 78.0 17 Cantonese,Hokkien,Teochew,English,Mandarin Inactive 0
2 0024H Male NotStated Chinese 56.0 3 Cantonese,Hokkien,Teochew,English,Mandarin Inactive 0
3 0037I Male Singapore Permanent Resident Chinese 84.0 9 Cantonese,English,Mandarin Active 1
4 0066E Female Singapore Citizen Chinese 67.0 6 ,English Active 1
5 0067I Female Singapore Citizen Chinese 68.0 5 ,English,Thai Active 1
6 0068H Female Singapore Citizen Chinese 73.0 20 Chinese Active 1
7 0072Z Female Singapore Citizen Chinese 82.0 11 Cantonese,Hakka,Hokkien,Teochew,English,Mandarin Active 1
8 0074J Male Singapore Citizen Chinese 68.0 13 Chinese Active 1
9 0081G Male NotStated Chinese 75.0 17 Cantonese,English,Mandarin Inactive 0
# replacing 'Spoken_languages' with dummy columns for languages

df_baseline = pd.concat([df_baseline, languages], axis=1, join='inner')
df_baseline.drop(['Spoken_languages', 'Status'], axis=1, inplace=True)

# saving dataset as csv file
df_baseline.to_csv('./datasets/df_baseline.csv')
df_baseline.head()
vol_id Gender Nationality Race Age Years Status_curr Cantonese Filipino Mandarin Teochew/Hokkien (weak) Bahasa Indonesia Cantonese Chinese Dutch English Foochow French Fuqing German Hainanese Hakka Hengwah Hindi Hindi sanskrit Hockchew Hokkien Japanese Malay Mandarin Mandarin Mandarin (weak) Norwegian Not fluent in Malay Others Shanghainese Spanish Tamil Teochew Thai Vietnamese
0 1000Q Male NotStated Others 81.0 3 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 0001H Female NotStated Chinese 78.0 17 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 1 0 0
2 0024H Male NotStated Chinese 56.0 3 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 1 0 0
3 0037I Male Singapore Permanent Resident Chinese 84.0 9 1 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
4 0066E Female Singapore Citizen Chinese 67.0 6 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

columns_to_getdummies =['Gender','Nationality', 'Race']
df_baseline_dummies = pd.get_dummies(df_baseline, columns=(columns_to_getdummies), prefix=(columns_to_getdummies), prefix_sep='')

df_baseline_dummies.head()
vol_id Age Years Status_curr Cantonese Filipino Mandarin Teochew/Hokkien (weak) Bahasa Indonesia Cantonese Chinese Dutch English Foochow French Fuqing German Hainanese Hakka Hengwah Hindi Hindi sanskrit Hockchew Hokkien Japanese Malay Mandarin Mandarin Mandarin (weak) Norwegian Not fluent in Malay Others Shanghainese Spanish Tamil Teochew Thai Vietnamese GenderFemale GenderMale GenderNotStated NationalityForeigner NationalityNotStated NationalitySingapore Citizen NationalitySingapore Permanent Resident RaceChinese RaceEurasian RaceIndian RaceIndonesian RaceJapanese RaceMalay RaceOthers RaceOthers||Caucasian RaceOthers||Ceylonese RaceOthers||Filipino RaceOthers||Punjabi RaceOthers||Sinhalese RaceOthers||Thai RaceOthers||Vietnamese RacePakistani RaceSinhalese RaceUnknown
0 1000Q 81.0 3 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
1 0001H 78.0 17 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 0024H 56.0 3 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 0037I 84.0 9 1 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 0066E 67.0 6 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# baseline score of data on current status

df_baseline_dummies['Status_curr'].value_counts(normalize=True)
1    0.771821
0    0.228179
Name: Status_curr, dtype: float64
df_baseline_dummies.to_csv('./datasets/baseline.csv')
# preparing X and y for modelling

X = df_baseline_dummies[['Age', 'Years', 'Cantonese', 'Mandarin', 'Chinese', 'English',
                         'Hainanese', 'Hakka', 'Hindi', 'Hokkien', 'Malay', 'Mandarin', 'Tamil', 'Teochew',
                         'GenderFemale', 'GenderMale', 'NationalityForeigner',  
                         'NationalitySingapore Citizen', 'NationalitySingapore Permanent Resident',
                         'RaceChinese', 'RaceEurasian','RaceIndian', 'RaceMalay']]

y = df_baseline_dummies['Status_curr']
X.shape, y.shape

((8226, 23), (8226,))
# saving dataset as csv file

df_baseline_dummies.to_csv('./datasets/df_baseline_dummies.csv')

Applying Standard Scaler

# setting aside 20% of X and y for test purposes
# applying Standard scaler

X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.25,random_state=0, stratify=y)

# standard scaling
ss = StandardScaler()
ss.fit(X_train)
X_train = ss.transform(X_train)
X_test = ss.transform(X_test)

X_train.shape, X_test.shape
((6169, 23), (2057, 23))

Model 1 :Logistic regression - training and validation

# creating a logistic model - X_train1, X_val, y_train1, y_val

# split X and y into training and testing sets
X_train1,X_val,y_train1,y_val=train_test_split(X_train,y_train,test_size=0.25,random_state=0)

# standard scaling
ss = StandardScaler()
ss.fit(X_train1)
X_train1 = ss.transform(X_train1)
X_val = ss.transform(X_val)

# instantiate the model (using the default parameters)
lg_baseline = LogisticRegression(solver='lbfgs')

# fit the model with data
lg_baseline.fit(X_train1,y_train1)

# getting predictions
y_pred=lg_baseline.predict(X_val)

# calculating probabilities
prob_lg_baseline = lg_baseline.predict_proba(X_val)

# validation score
training_score = lg_baseline.score(X_train1, y_train1)
validation_score = lg_baseline.score(X_val, y_val)

# Generate a confusion matrix.
confusion_matrix(y_val, y_pred)

tn, fp, fn, tp = confusion_matrix(y_val, y_pred).ravel()

print('True Negatives: %s' % tn)
print('False Positives: %s' % fp)
print('False Negatives: %s' % fn)
print('True Positives: %s' % tp)

print('Training score:', training_score)
print('Validation score:', validation_score)
print('MSE:',mean_squared_error(y_val, y_pred))

# Print out intercept and coefficients.
print('Intercept:', lg_baseline.intercept_)
print('Coefficient:', lg_baseline.coef_)
print(type(lg_baseline.coef_))
True Negatives: 253
False Positives: 70
False Negatives: 7
True Positives: 1213
Training score: 0.9420665801988759
Validation score: 0.950097213220998
MSE: 0.049902786779001944
Intercept: [2.11751837]
Coefficient: [[-0.01523056 -0.4101083  -0.11754014  0.02185603  0.10445194 -0.10417632
   0.09231091  0.08439391 -0.08350564 -0.09991119 -0.06327909  0.02185603
  -0.06003371 -0.01040314 -2.60230975 -2.56801013  0.64389587  2.49393104
   0.69417795 -1.26545949 -0.09677627 -0.43664819 -0.14489351]]
<class 'numpy.ndarray'>
coeff_df = pd.DataFrame(X.columns)
arr = np.array(lg_baseline.coef_)
df_arr = pd.DataFrame(data=arr).T
coeff_df = coeff_df.merge(df_arr, left_index=True, right_index=True, how='inner')
coeff_df.columns= ['feature', 'coeff_logreg']
print(coeff_df.sort_values(by='coeff_logreg',ascending=False))
coeff_df.sort_values(by='coeff_logreg',ascending=False).plot(kind='bar', x='feature')
                                    feature  coeff_logreg
17             NationalitySingapore Citizen      2.493931
18  NationalitySingapore Permanent Resident      0.694178
16                     NationalityForeigner      0.643896
4                                   Chinese      0.104452
6                                 Hainanese      0.092311
7                                     Hakka      0.084394
11                                 Mandarin      0.021856
3                                  Mandarin      0.021856
13                                  Teochew     -0.010403
0                                       Age     -0.015231
12                                    Tamil     -0.060034
10                                    Malay     -0.063279
8                                     Hindi     -0.083506
20                             RaceEurasian     -0.096776
9                                   Hokkien     -0.099911
5                                   English     -0.104176
2                                 Cantonese     -0.117540
22                                RaceMalay     -0.144894
1                                     Years     -0.410108
21                               RaceIndian     -0.436648
19                              RaceChinese     -1.265459
15                               GenderMale     -2.568010
14                             GenderFemale     -2.602310





<matplotlib.axes._subplots.AxesSubplot at 0x1f020a740f0>

png

y_pred_proba = lg_baseline.predict_proba(X_test)[::,1]
fpr, tpr, _ = metrics.roc_curve(y_val,  y_pred)
auc = metrics.roc_auc_score(y_val,y_pred)
plt.plot(fpr,tpr,label="logreg model, auc="+str(auc))
plt.legend(loc=4)
plt.show()

png

Model 2 :Lasso model - training and validation

## Applying Lasso

# Lasso regression for given X_train, X_test, y_train, y_test, name of regression model

def lasso_reg(X_train, X_test, y_train, y_test, name):

    # Calculation of optimal alpha
    optimal_lasso = LassoCV(n_alphas=500, cv=10, verbose=1)
    optimal_lasso.fit(X_train, y_train)

    # instantiate lasso model with optimal alpha
    lasso = Lasso(alpha=optimal_lasso.alpha_)

    # Training the model
    lasso.fit(X_train, y_train)
    lasso_scores_train = cross_val_score(lasso, X_train, y_train, cv=10)
    print ("Mean score:", np.mean(lasso_scores_train))

    # getting co-eff values
    coeff_df = pd.DataFrame(lasso.coef_ , X.columns, columns=['Coefficient'])  

    # Fitting the model
    lasso.score(X_test, y_test)
    print('Validation/Test score:', lasso.score(X_test, y_test))
    pred = lasso.predict(X_test)

    # Metrics
    print ("Means Square Error:", metrics.mean_squared_error(y_test, pred ))
    print("Root Mean Square:", np.sqrt(metrics.mean_squared_error(y_test, pred)))


    return coeff_df, pred

coeff_df, pred = lasso_reg(X_train1, X_val, y_train1, y_val, 'lasso')
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:    1.4s finished


Mean score: 0.7242563051897465
Validation/Test score: 0.7348770422392643
Means Square Error: 0.04388113346759728
Root Mean Square: 0.20947824103614504
coeff_df.sort_values(by='Coefficient',ascending=False).head(20).plot(kind='bar')
<matplotlib.axes._subplots.AxesSubplot at 0x1f01da0d780>

png

Model 3: Decision Tree Classifier - training and validation

# fitting Decision Tree Classifier to training, validation

# Instantiate model.
dt = DecisionTreeClassifier(max_depth=8,random_state = 42)

# Fit training data
model_dt = dt.fit(X_train1, y_train1)

# Evaluate model.
print(f'Score on training set: {model_dt.score(X_train1, y_train1)}')
print(f'Score on validation set: {model_dt.score(X_val, y_val)}')

# testing for validation set - X_val, y_val
model_dt_val = model_dt.fit(X_val, y_val)

#predictions
preds_dt=model_dt.predict(X_val)

# Generate a confusion matrix.
confusion_matrix(y_val, preds_dt)

tn, fp, fn, tp = confusion_matrix(y_val, preds_dt).ravel()

print("True Negatives: %s" % tn)
print("False Positives: %s" % fp)
print("False Negatives: %s" % fn)
print("True Positives: %s" % tp)

Score on training set: 0.9589277993947255
Score on validation set: 0.948801036941024
True Negatives: 310
False Positives: 13
False Negatives: 24
True Positives: 1196
#Create a new tree with a lower depth (easier to visualize)

clf = DecisionTreeClassifier(max_depth = 8)

clf.fit(X_train1, y_train1)
dot_data = StringIO()
export_graphviz(clf, out_file=dot_data, feature_names=X.columns,
               filled=True, rounded=True,
               special_characters=True)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
Image(graph.create_png())

png

Testing of model

Model 1: Logistic Regression - testing

# using logistic regression model - X_train, X_test, y_train, y_test

# standard scaling
ss = StandardScaler()
ss.fit(X_train)
X_train = ss.transform(X_train)
X_test = ss.transform(X_test)

# instantiate the model (using the default parameters)
lg_baseline_test = LogisticRegression(solver='lbfgs')

# fit the model with data
lg_baseline_test.fit(X_train,y_train)

# getting predictions
y_pred_test=lg_baseline_test.predict(X_test)

# calculating probabilities
probability_lg_baseline_test = lg_baseline.predict_proba(X_test)

# validation score
training_score = lg_baseline.score(X_train, y_train)
test_score = lg_baseline.score(X_test, y_test)

# Generate a confusion matrix.
confusion_matrix(y_test, y_pred_test)

tn, fp, fn, tp = confusion_matrix(y_test, y_pred_test).ravel()

print("True Negatives: %s" % tn)
print("False Positives: %s" % fp)
print("False Negatives: %s" % fn)
print("True Positives: %s" % tp)

print('Training score:', training_score)
print('Test score:', test_score)
print("MSE:",mean_squared_error(y_test, y_pred_test))


True Negatives: 379
False Positives: 90
False Negatives: 23
True Positives: 1565
Training score: 0.9442373156103097
Test score: 0.9445794846864366
MSE: 0.054934370442391835
# to get co-efficients for model

coeff_df_test = pd.DataFrame(X.columns)
arr = np.array(lg_baseline_test.coef_)
df_arr_test = pd.DataFrame(data=arr).T
coeff_df_test = coeff_df_test.merge(df_arr_test, left_index=True, right_index=True, how='inner')
coeff_df_test.columns= ['feature', 'coef_logreg']
print(coeff_df_test.sort_values(by='coef_logreg',ascending=False))
coeff_df_test.sort_values(by='coef_logreg',ascending=False).plot(kind='bar', x='feature')
                                    feature  coef_logreg
17             NationalitySingapore Citizen     2.560097
18  NationalitySingapore Permanent Resident     0.692651
16                     NationalityForeigner     0.661561
4                                   Chinese     0.092881
6                                 Hainanese     0.043636
7                                     Hakka     0.042051
11                                 Mandarin     0.011858
3                                  Mandarin     0.011858
12                                    Tamil    -0.049496
0                                       Age    -0.055801
9                                   Hokkien    -0.062695
13                                  Teochew    -0.062806
10                                    Malay    -0.068920
2                                 Cantonese    -0.082700
20                             RaceEurasian    -0.087335
8                                     Hindi    -0.092611
5                                   English    -0.101242
22                                RaceMalay    -0.164994
1                                     Years    -0.363874
21                               RaceIndian    -0.387989
19                              RaceChinese    -1.235166
15                               GenderMale    -2.822673
14                             GenderFemale    -2.885762





<matplotlib.axes._subplots.AxesSubplot at 0x1f01c2b6e10>

png

y_pred_proba = lg_baseline_test.predict_proba(X_test)[::,1]
fpr, tpr, _ = metrics.roc_curve(y_test,  y_pred_proba)
auc = metrics.roc_auc_score(y_test, y_pred_proba)
plt.plot(fpr,tpr,label="data 1, auc="+str(auc))
plt.legend(loc=4)
plt.show()

png

Model 2: Lasso - testing

coeff_df, pred = lasso_reg(X_train, X_test, y_train, y_test, 'lasso')
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:    1.4s finished


Mean score: 0.7276218355950257
Validation/Test score: 0.737418933566127
Means Square Error: 0.046218746760138305
Root Mean Square: 0.21498545708986527
coeff_df.sort_values(by='Coefficient',ascending=False).head(20).plot(kind='bar')
<matplotlib.axes._subplots.AxesSubplot at 0x1f020e7aef0>

png

Model 3: Decision Tree Classifier - testing

# fitting Decision Tree Classifier to test data

# Instantiate model.
dt = DecisionTreeClassifier(max_depth=8,random_state = 42)

# Fit training data
model_dt_test = dt.fit(X_train, y_train)

# Evaluate model.
print(f'Score on training set: {model_dt_test.score(X_train, y_train)}')
print(f'Score on test set: {model_dt_test.score(X_test, y_test)}')

# testing  - X_test, y_test
test_dt = dt.fit(X_test, y_test)

#predictions
preds_dt_test=model_dt_test.predict(X_test)

# Generate a confusion matrix.
confusion_matrix(y_test, preds_dt_test)

tn, fp, fn, tp = confusion_matrix(y_test, preds_dt_test).ravel()

print("True Negatives: %s" % tn)
print("False Positives: %s" % fp)
print("False Negatives: %s" % fn)
print("True Positives: %s" % tp)


Score on training set: 0.9586642891878748
Score on test set: 0.9402041808458921
True Negatives: 422
False Positives: 47
False Negatives: 18
True Positives: 1570
#Create a new tree with a lower depth (easier to visualize)

clf = DecisionTreeClassifier(max_depth = 8)

clf.fit(X_train, y_train)
dot_data = StringIO()
export_graphviz(clf, out_file=dot_data, feature_names=X.columns,
               filled=True, rounded=True,
               special_characters=True)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
Image(graph.create_png())

Volunteer profile and recommendations

  • Membership has increased in last 3 years and interests of the volunteers are varied
  • Among the 5897 volunteers who have registered since 2017, 2290 (39%) have indicated that they have post secondary qualifications.