Regression Analysis in Python Using UNICEF MICS Data
If you're trying to learn about data analytics, economics or statistics, you've probably come across the term "regression analysis" before. It is one of the most commonly used statistical methods for estimating the relationships between a dependent variable and one or several independent variables. In this entry, we'll look at how we can use Python to run a regression analysis on panel data from the Multiple Indicator Cluster Survey datasets produced by the UNICEF. We'll illustrate this tutorial with a real-world use case, specifically looking at education indicators in the context of Madagascar. As usual, our program will include code for importing data, cleaning it up, run basic statistical analyses, specify a model and display the results. Let's get to it!
The Multiple Indicator Cluster Survey (aka MICS) datasets produced by the UNICEF are probably some of the greatest, most extensive datasets available as far as micro-level data is concerned. This data is particularly important in the context of developing countries where good statistics are often in short supply. This data is especially helpful to get a good grasp of the human, social and economic challenges a country's population might face. The UNICEF does a great job of collecting this data, keeping it organized and making it available free of charge. If you haven't, I would recommend checking out their website. The data used for this article was kindly provided by the UNICEF MICS Team. Before we begin, a short disclaimer is in order.
Disclaimer: The following code was written in and ran using Visual Studio Code and the latest Anaconda distribution. The data used for this article was kindly provided by the UNICEF MICS Team. The code itself is solely intended for educational purposes and the potential findings may not engage the data provider's responsibility in any capacity.
Author: Johary Razafindratsita, 2022.
The first step is to download the data from the MICS Website. I would recommend registering as a MICS data user first, and applying to get access to the datasets you are interested in. For the purpose of this article, we will be downloading the 2018 MICS datasets for Madagascar. Since the data is made available in SPSS format, the first thing we need to do is install a module that will allow us to read that data in Python. This is where pyreadstat comes in:
pip install pyreadstat
Note that in addition to pyreadstat, we will also be using a specific module called statsmodels to specify our model and run our regression analysis. Now, let's import the different libraries and modules we will be using:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import pyreadstat
import statsmodels.api as sm
import dataframe_image as dfi
We are now ready to import our data into Python. MICS produces several large and extensive datasets. At this stage, it is therefore very important to understand what these datasets are, how they are structured and what each indicator means. I cannot stress this enough. Thankfully, the MICS Team provides a good FAQ as well as supporting documents to help us here. Additionally, reading the associated MICS reports is essential in understanding how these datasets are structured.
Because we're interested in education indicators, we'll be working on the 'fs.sav' file which is the survey related to children aged 5 to 17. Let's import it in our program:
df, meta = pyreadstat.read_sav('./MICS/fs.sav')
At this stage, you can convert it into a .csv file should you want to have a look at the data in say, Excel, for instance (you could also convert it to a regular Excel file, though this will result in a much larger file size):
pd.DataFrame.to_csv(df, 'mics_fs.csv')
Now that we've imported our data, it's time to select the variables we'll be using for our regression analysis. For the purpose of this article, we'll be looking at determinants of school attendance in Madagascar based on information provided by the MICS data.
Our dependent (or exogenous) variable, the one we're looking to explain, will therefore be "School Attendance". This is the CB7 variable in our database, which can take one of 2 values:
The value "1" if they attended school or pre-primary school at any point during the current school year
The value "2" if they did not attend school or pre-primary school at all during the current school year
Now let's select our independent (or explanatory) variables and lay out our hypotheses:
Our first variable is "Age" (CB3 in our database). Children aged 5 to 17 should normally be attending school. However, in Madagascar and many developing countries, school attendance often decreases with age. This can be due to several factors: there are considerably less secondary schools - and even less higher secondary schools - compared to primary schools; attending secondary schools usually requires traveling longer distances in rural areas, and would often requires children to move out from their parent's home and rent a place in the town where the school is located, resulting in substantial financial costs for the family; the older children are, the more likely they are to work or contribute in a meaningful manner to help their family.
Our second variable is "Child Labour" (CL3 in our database). As explained above, adolescents often find themselves in a position where they are asked to work or contribute meaningfully to help their family make ends meet. This can happen at the detriment of their education, causing them to stop attending school. This specific variable measures the time children spent working over the previous week (measured in hours).
Our third variable is "Chores" (CL13 in our database). This variable measures the time children spent helping with domestic chores over the previous week (measured in hours). The more time children spend on chores, the less likely they are to attend school. This is particularly true for children living in rural areas where chores can consist of substantial tasks, including fetching water and firewood, cooking, cleaning and helping care for other family members.
Our fourth variable is "Farm Work" (CL1A in our database). This is another indicator for child labour which focuses on children's involvement in farming activities. As the vast majority of Madagascar's population still relies on subsistence farming, children are very likely to be asked to help with these types of activities as they grow up. This specific variable, like our exogenous variable, is dichotomous, meaning that it can take one of 2 values (the value "1" if children are doing any farm work and "2" otherwise).
Our fifth and final variable is "Other Income Generating Activities" (CL1X in our database). Like the previous variable, this indicator encompasses all other activities that children could be involved in that generate an income. It is also a dichotomous variable (with a value of "1" if children are involved in any such activities and "2" otherwise).
Now that we've identified all our variables, we can build a DataFrame using only these variables:
Table_1 = df[['CB3', 'CB7', 'CL3', 'CL13', 'CL1X','CL1A']]
Table_1 = pd.DataFrame(Table_1)
Now let's run some basic statistical analyses on our newly built DataFrame using the .describe() function and save the output table as an image file on our local drive:
edu_tab1 = Table_1.describe()
dfi.export(edu_tab1, 'edu_tab1.png')
You will notice that the count results are different between our variables. This is because our database has blank or non-specified data points (aka not a number or NaN). Before we proceed further, we will need to clean up our DataFrame by excluding these data points before we can specify our model and run a regression analysis:
Table_1.dropna(inplace = True)
Let's also rename our variables properly so we can better navigate our database. This will also help us streamline the process of specifying the model later on:
Table_1.rename(columns = {'CB3' : 'Age', 'CB7' : 'School Attendance', 'CL3' : 'Child Labour', 'CL13' : 'Chores', 'CL1A' : 'Farm Work', 'CL1X' : 'Income Generating Activities'}, inplace = True)
Let's re-run some basic statistical analyses now that our DataFrame has been cleaned up:
edu_tab2 = Table_1.describe()
dfi.export(edu_tab2, 'edu_tab2.png')
You will notice that now that all non-specified data points have been removed, the count results are now consistent between all our variables. The downside to this is that our sample size is reduced considerably, by almost 36% for some of our variables. However, as previously mentioned, the data clean-up step is necessary in order to specify our model and run a clean regression analysis. Besides, despite the clean-up, our sample size remains large enough.
Additionally, we're noticing that cleaning up the data has resulted in some changes in our distribution. The average age, for instance, increased to 12 years from 10.4 previously, while the average hours worked decreased from 17 hours per week to 15.6 and the average time spent on chores increased to 11 hours per week from 9.3. It should be noted that it is possible these changes could have non-negligible impacts on the results of our analysis and the interpretation of our model. However, as previously mentioned, cleaning up the data is not optional, and our still large sample size should help keep our results relevant.
Because we'll be working in terms of odds (more on that shortly), we will be recoding our dichotomous variables to satisfy our model specifications and make it easier to interpret. Specifically, we will be assigning values of "0" and "1" in lieu of the current values of "1" and "2". For our exogenous variable, "0" will mean that children are currently not attending school (or pre-primary school) while "1" will mean that they are. For our Farm Work and Income Generating Activities variables, "0" will mean that children are not involved in these activities while "1" means that they are. To recode these variables, we will be using a loop function:
for i in ['School Attendance', 'Farm Work', 'Income Generating Activities']:
Table_1[i] = Table_1[i].replace([2], [0])
NB: As I dug though the data, I noticed that some of our variables had unexpected values. For example, Farm Work and Income Generating Activities sometimes had 3 possible values, instead of the expected 2 ("1", "2" and "9"). Since these unexpected values were only present on a couple of lines of data and I was not able to find what they correspond to, I have decided to simply ignore them and delete these lines from the DataFrame:
Table_1 = Table_1[Table_1['Farm Work'] != 9]
Table_1 = Table_1[Table_1['Income Generating Activities'] != 9]
Now, let's plot some distribution graphs. Here, we'll be charting multiple graphs into a single image. For other charting methods, you can refer to our article about charting World Bank data in Python. As usual, we'll be saving the output image to our local drive:
plt.figure(figsize = [24, 24])
plt.subplot(221)
plt.hist(Table_1['Age'], bins = 13, rwidth = 0.8, color = 'b')
plt.xlabel('Age')
plt.title('Distribution by Age')
plt.subplot(222)
plt.hist(Table_1['Child Labour'], rwidth = 0.8, color = 'r')
plt.xlabel('Child Labour: Hours Worked')
plt.title('Distribution by Hours Worked')
plt.subplot(223)
plt.hist(Table_1['Chores'], rwidth = 0.8, color = 'y')
plt.xlabel('Time spent on Chores')
plt.title('Distribution by Time spent on Chores')
plt.subplot(224)
plt.hist(Table_1['School Attendance'], bins = 3, rwidth = 0.7, color = 'g')
plt.xlabel('School Attendance')
plt.title('Distribution by School Attendance')
x_ticks = [0, 1]
x_labels = ['No', 'Yes']
plt.xticks(ticks = x_ticks, labels = x_labels)
plt.savefig('edu.png', dpi = 300)
These are our distribution graphs:
From these graphs, we can infer that the vast majority of children in our sample are attending school (or pre-primary school), although a substantial proportion are not enrolled in the education system. Additionally, it appears that most children in our sample work less than 20 hours per week, and spend less than 20 hours per week on domestic chores. Our sample seems to be largely concentrated between the ages of 10 to 17.
It's now time to specify our model. For the purpose of this article, we will be using a logistic regression model (aka logit regression). Logistic regression is a subset of regression analysis used when the dependent variable is a categorical variable. In the context of our data, our exogenous variable, School Attendance, fits this description as children can either attend school (with a value of [1]), or not attend school (with a value of [0]). Logistic regression transforms the exogenous variable and then uses Maximum Likelihood Estimation (MLE) to estimate the parameters of the model. This is in contrast to linear regression analysis, where the Ordinary Least Squares (OLS) method is usually the preferred approach.
The first step is to define our dependent and independent variables in the program. Here, our independent variables will be grouped in a matrix called 'x', while 'y' will be our dependent variable:
x = Table_1[['Age', 'Child Labour', 'Chores', 'Farm Work', 'Income Generating Activities']]
y = Table_1[['School Attendance']]
We will also need to create a constant for our model, and add it to our 'x' matrix:
x = sm.add_constant(x)
Finally, we can specify and fit our model which, as previously mentioned, is a logit model:
model = sm.Logit(y, x).fit()
Let's check our model results, and save the output file to our local drive:
model_results = model.summary()
plt.rc('figure', figsize=(12, 7))
plt.text(0.01, 0.05, str(model_results), {'fontsize': 10}, fontproperties = 'monospace')
plt.axis('off')
plt.tight_layout()
plt.savefig('model_results.png')
This is the output:
We can learn a few things from this summary table. The first thing we notice is the negative sign in front of all the coefficients (aside from the constant). This is consistent with our assumptions, meaning that an increase in children's age, in their time spent working or doing chores and in their involvement in farming and income generating activities is associated with a decrease in their odds of attending school.
Let's take our Child Labour variable. Knowing that its coefficient is -0.0188, we can calculate that an increase by 1 hour in child labour per week would result in a decrease in the odds of attending school of about 2%, assuming all other variables remain constant. Comparatively, time spent on chores seems to have a relatively small impact on the relative odds of attending school, as a 1 hour per week increase would only result in a 1% decrease in the odds of attending school. By contrast, each yearly increase in children's age between 5 and 17 is associated with a decrease in the relative odds of attending school by 32%, all else equal.
Our additional indicators for child labour point to similar outcomes. The relative odds of attending school are 31% less likely for children who participate in farm work, and 38% less likely for those that are actively engaging in other income generating activities.
The second takeaway we can learn from this summary table is that the P-values for all our explanatory variables are < 0.01, indicating that all our coefficients are statistically significant at the 1% level.
Finally, it should be noted that it's always difficult to assess the fit of a logistic regression model. Unlike Ordinary Least Squared models, where R-squared (and adjusted R-squared) can be used to evaluate the fit of a model, Maximum Likelihood Estimation models do not have a R-squared. Instead, a "Pseudo R-squared" can be calculated, although caution is always in order when it comes to interpreting its value in absolute terms. That being said, Pseudo R-squared and Log-likelihood can be used for comparative purposes between models.
All in all, our regression analysis on education indicators in Madagascar points to child labour and age as being significant factors that affect the odds of children attending school negatively. As we've previously pointed out, this may be the result of multiple considerations, including:
The scarcity of lower and higher secondary schools relative to primary schools.
The substantial costs involved in attending secondary school, which would usually require traveling longer distances in rural areas or renting a place closer to school.
As children grow up, they become more likely to work or contribute meaningfully to help their family make ends meet, often to the detriment of their education.
Reintegrating children who have dropped out of the education system for any reason becomes increasingly difficult as they grow older.