This post is a great way to dive in data analysis and get familiar with python tools such as pandas, seaborn, matplotlib, numpy.

This is also an opportunity to work on multilabel data (several target variables that are not mutually exclusive).

Setting up workspace

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import seaborn as sns
import matplotlib as mpl
import math
from scipy.stats import pearsonr, spearmanr, chi2_contingency, ttest_ind, mannwhitneyu, norm, normaltest, shapiro, anderson
import operator
from IPython.display import HTML, display
import seaborn as sns
sns.set(rc={'figure.figsize':(10,5)})
sns.set_style('whitegrid')
%matplotlib inline

Dataset preparation

Drug consumption (quantified) Data Set is a multilabel dataset from the UCI machine learning repository.

The dataset contains records for 1885 respondents.

To summarize, it contains:

  • an ID column
  • 5 demographic columns (features)
  • 7 personality traits (features)
  • 18 drugs with their usage frequency (target)
  • a fake drug called Semeron to verify reliability of answers

Each drug variable can take 6 different values:

  • CL0 Never Used
  • CL1 Used over a Decade
  • CL2 Used in the Last Decade
  • CL3 Used in the Last Year
  • CL4 Used in the Last Month
  • CL5 Used in the Last Week
  • CL6 Used in the Last Day

All qualitative variables where transformed to real values (except for the target variables).

Tip: Read the dataset description for more details!

The following lists will be useful for loading and indexing the data.

demographic_columns = [
    'Age', 
    'Gender', 
    'Education', 
    'Country',
    'Ethnicity',
]

personality_columns = [
    'Neuroticism',
    'Extraversion',
    'Openness to experience',
    'Agreeableness',
    'Conscientiousness',
    'Impulsiveness',
    'Sensation seeking'
]

feature_columns = demographic_columns + personality_columns

drugs_columns = [
    'Alcohol consumption',
    'Amphetamines consumption',
    'Amyl nitrite consumption',
    'Benzodiazepine consumption',
    'Caffeine consumption',
    'Cannabis consumption',
    'Chocolate consumption',
    'Cocaine consumption',
    'Crack consumption',
    'Ecstasy consumption',
    'Heroin consumption',
    'Ketamine consumption',
    'Legal highs consumption',
    'Lysergic acid diethylamide consumption',
    'Methadone consumption',
    'Magic mushrooms consumption',
    'Nicotine consumption',
    'Fictitious drug Semeron consumption',
    'Volatile substance abuse consumption'
]

drugs_legal = ['Alcohol consumption', 'Caffeine consumption', 'Chocolate consumption', 'Nicotine consumption']

drugs_illegal = [drug for drug in drugs_columns if drug not in drugs_legal]

all_columns = feature_columns + drugs_columns

Loading the dataset

dataset = pd.read_csv("drug_consumption.data", names=["ID"] + all_columns)
dataset = dataset.set_index("ID")
dataset.head()
Age Gender Education Country Ethnicity Neuroticism Extraversion Openness to experience Agreeableness Conscientiousness ... Ecstasy consumption Heroin consumption Ketamine consumption Legal highs consumption Lysergic acid diethylamide consumption Methadone consumption Magic mushrooms consumption Nicotine consumption Fictitious drug Semeron consumption Volatile substance abuse consumption
ID
1 0.49788 0.48246 -0.05921 0.96082 0.12600 0.31287 -0.57545 -0.58331 -0.91699 -0.00665 ... CL0 CL0 CL0 CL0 CL0 CL0 CL0 CL2 CL0 CL0
2 -0.07854 -0.48246 1.98437 0.96082 -0.31685 -0.67825 1.93886 1.43533 0.76096 -0.14277 ... CL4 CL0 CL2 CL0 CL2 CL3 CL0 CL4 CL0 CL0
3 0.49788 -0.48246 -0.05921 0.96082 -0.31685 -0.46725 0.80523 -0.84732 -1.62090 -1.01450 ... CL0 CL0 CL0 CL0 CL0 CL0 CL1 CL0 CL0 CL0
4 -0.95197 0.48246 1.16365 0.96082 -0.31685 -0.14882 -0.80615 -0.01928 0.59042 0.58489 ... CL0 CL0 CL2 CL0 CL0 CL0 CL0 CL2 CL0 CL0
5 0.49788 0.48246 1.98437 0.96082 -0.31685 0.73545 -1.63340 -0.45174 -0.30172 1.30612 ... CL1 CL0 CL0 CL1 CL0 CL0 CL2 CL2 CL0 CL0

5 rows × 31 columns

We successfully loaded the dataset and set the ID as row index.

dataset.dtypes
Age                                       float64
Gender                                    float64
Education                                 float64
Country                                   float64
Ethnicity                                 float64
Neuroticism                               float64
Extraversion                              float64
Openness to experience                    float64
Agreeableness                             float64
Conscientiousness                         float64
Impulsiveness                             float64
Sensation seeking                         float64
Alcohol consumption                        object
Amphetamines consumption                   object
Amyl nitrite consumption                   object
Benzodiazepine consumption                 object
Caffeine consumption                       object
Cannabis consumption                       object
Chocolate consumption                      object
Cocaine consumption                        object
Crack consumption                          object
Ecstasy consumption                        object
Heroin consumption                         object
Ketamine consumption                       object
Legal highs consumption                    object
Lysergic acid diethylamide consumption     object
Methadone consumption                      object
Magic mushrooms consumption                object
Nicotine consumption                       object
Fictitious drug Semeron consumption        object
Volatile substance abuse consumption       object
dtype: object

Features are all floats and targets are objects.

dataset.shape
(1885, 31)

1885 observations and 31 variables.

dataset.isnull().values.any()
False

There are no missing values.

Nominal drug to ordinal data

for i in drugs_columns:
    dataset[i] = dataset[i].map({'CL0': 0, 'CL1': 1, 'CL2': 2, 'CL3': 3, 'CL4': 4, 'CL5': 5, 'CL6': 6})

Mapping CLX to X. Now all variables are numeric.

Removing Semeron users

Let's see who claimed to took Semeron.

semerons = dataset[dataset['Fictitious drug Semeron consumption'] != 0]
semerons
Age Gender Education Country Ethnicity Neuroticism Extraversion Openness to experience Agreeableness Conscientiousness ... Ecstasy consumption Heroin consumption Ketamine consumption Legal highs consumption Lysergic acid diethylamide consumption Methadone consumption Magic mushrooms consumption Nicotine consumption Fictitious drug Semeron consumption Volatile substance abuse consumption
ID
730 -0.07854 0.48246 -1.73790 -0.09765 -0.31685 -0.58016 0.32197 0.14143 -0.60633 0.12331 ... 2 2 2 0 4 2 6 6 2 2
821 -0.95197 -0.48246 -0.61113 -0.09765 -0.50212 -0.67825 1.74091 0.72330 0.13136 0.41594 ... 3 0 0 0 5 0 5 4 3 0
1520 -0.95197 -0.48246 -0.61113 -0.57009 -0.31685 -0.24649 -0.80615 -1.27553 -1.34289 -1.92173 ... 1 2 1 2 1 2 4 2 3 1
1537 -0.95197 0.48246 -0.61113 -0.57009 0.11440 -0.46725 0.80523 0.29338 2.03972 1.81175 ... 4 0 4 3 2 0 3 4 4 3
1702 0.49788 0.48246 0.45468 -0.57009 -0.31685 1.98437 -0.80615 2.15324 0.76096 -0.00665 ... 2 0 2 2 2 0 2 6 2 0
1773 -0.95197 -0.48246 -1.22751 -0.57009 -0.22166 -0.34799 1.28610 1.06238 -0.01729 -0.52745 ... 3 0 4 3 6 3 3 3 1 3
1810 -0.95197 0.48246 -1.43719 -0.57009 -0.31685 1.23461 1.11406 1.06238 -1.47955 0.12331 ... 4 2 1 4 1 0 1 6 1 2
1827 -0.95197 0.48246 0.45468 -0.57009 -0.31685 0.22393 -0.30033 0.88309 1.28610 -0.00665 ... 0 0 0 2 3 0 3 5 2 0

8 rows × 31 columns

There are 8 people that lied about taking Semeron, which is a fake drug, so we should remove them.

dataset = dataset[dataset['Fictitious drug Semeron consumption'] == 0]
# Removing it from drug lists
drugs_columns.remove('Fictitious drug Semeron consumption')
drugs_illegal.remove('Fictitious drug Semeron consumption')
#Dropping the column from the dataset
dataset.drop(columns='Fictitious drug Semeron consumption')
dataset.shape
(1877, 31)

We have 1877 respondents left.

Tip: If you ever acquire data from individuals, try a similar approach to get rid of untrustworthy observations.

Exploratory Data analysis

dataset.describe()
Age Gender Education Country Ethnicity Neuroticism Extraversion Openness to experience Agreeableness Conscientiousness ... Ecstasy consumption Heroin consumption Ketamine consumption Legal highs consumption Lysergic acid diethylamide consumption Methadone consumption Magic mushrooms consumption Nicotine consumption Fictitious drug Semeron consumption Volatile substance abuse consumption
count 1877.000000 1877.000000 1877.000000 1877.000000 1877.000000 1877.000000 1877.000000 1877.000000 1877.000000 1877.000000 ... 1877.000000 1877.000000 1877.000000 1877.000000 1877.000000 1877.000000 1877.000000 1877.000000 1877.0 1877.000000
mean 0.037577 -0.000771 -0.000984 0.358984 -0.309728 -0.000551 -0.001951 -0.003224 -0.000657 -0.000394 ... 1.309536 0.372403 0.564198 1.353223 1.053277 0.826319 1.177944 3.194992 0.0 0.429409
std 0.878387 0.482588 0.949831 0.699707 0.166220 0.998442 0.997418 0.995691 0.996689 0.997657 ... 1.647373 1.034761 1.216341 1.790881 1.484582 1.648379 1.459212 2.415814 0.0 0.959160
min -0.951970 -0.482460 -2.435910 -0.570090 -1.107020 -3.464360 -3.273930 -3.273930 -3.464360 -3.464360 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.000000
25% -0.951970 -0.482460 -0.611130 -0.570090 -0.316850 -0.678250 -0.695090 -0.717270 -0.606330 -0.652530 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.0 0.000000
50% -0.078540 -0.482460 -0.059210 0.960820 -0.316850 0.042570 0.003320 -0.019280 -0.017290 -0.006650 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 3.000000 0.0 0.000000
75% 0.497880 0.482460 0.454680 0.960820 -0.316850 0.629670 0.637790 0.723300 0.760960 0.584890 ... 3.000000 0.000000 0.000000 3.000000 2.000000 0.000000 2.000000 6.000000 0.0 0.000000
max 2.591710 0.482460 1.984370 0.960820 1.907250 3.273930 3.273930 2.901610 3.464360 3.464360 ... 6.000000 6.000000 6.000000 6.000000 6.000000 6.000000 6.000000 6.000000 0.0 6.000000

8 rows × 31 columns

The representation above is quite archaic. Let's create a plot to make it easier to read.

fig, ax = plt.subplots(figsize=(11,15))
plt.ylabel("Features")
plt.title("Box plot of Pre-Processed Data Set")
ax = sns.boxplot(data = dataset[feature_columns], orient="h", palette="Set2")
sns.reset_orig()

Seems like the dataset creators centered the features near 0 and scaled them with a small std (around 1), so we can probably keep the current encoding to build a model.

Demographic data analysis

Rearranging data

For data analysis, we will transform the quantified categorical data back to a clearer, nominal, form.

Note: Our mapping is based on the dataset description
demo_data = dataset.copy()
age = ['18-24' if a <= -0.9 else 
       '25-34' if a >= -0.5 and a < 0 else 
       '35-44' if a > 0 and a < 1 else 
       '45-54' if a > 1 and a < 1.5 else 
       '55-64' if a > 1.5 and a < 2 else 
       '65+' 
       for a in demo_data['Age']]

gender = ['Female' if g > 0 else "Male" for g in demo_data['Gender']]

education = ['Left school before 16 years' if e <-2 else 
             'Left school at 16 years' if e > -2 and e < -1.5 else 
             'Left school at 17 years' if e > -1.5 and e < -1.4 else 
             'Left school at 18 years' if e > -1.4 and e < -1 else 
             'Some college or university, no certificate or degree' if e > -1 and e < -0.5 else 
             'Professional certificate/ diploma' if e > -0.5 and e < 0 else 
             'University degree' if e > 0 and e < 0.5 else 
             'Masters degree' if e > 0.5 and e < 1.5 else 
             'Doctorate degree' 
             for e in demo_data['Education']]

country = ['USA' if c < -0.5 else 
           'New Zealand' if c > -0.5 and c < -0.4 else 
           'Other' if c > -0.4 and c < -0.2 else 
           'Australia' if c > -0.2 and c < 0 else 
           'Ireland' if c > 0 and c < 0.23 else 
           'Canada' if c > 0.23 and c < 0.9 else 
           'UK' 
           for c in demo_data['Country']]

ethnicity = ['Black' if e < -1 else 
             'Asian' if e > -1 and e < -0.4 else 
             'White' if e > -0.4 and e < -0.25 else 
             'Mixed-White/Black' if e >= -0.25 and e < 0.11 else 
             'Mixed-White/Asian' if e > 0.12 and e < 1 else 
             'Mixed-Black/Asian' if e > 1.9 else 
             'Other' 
             for e in demo_data['Ethnicity']]


demo_data['Age'] = age
demo_data['Gender'] = gender
demo_data['Education'] = education
demo_data['Country'] = country
demo_data['Ethnicity'] = ethnicity
demo_data[demographic_columns].head()
Age Gender Education Country Ethnicity
ID
1 35-44 Female Professional certificate/ diploma UK Mixed-White/Asian
2 25-34 Male Doctorate degree UK White
3 35-44 Male Professional certificate/ diploma UK White
4 18-24 Female Masters degree UK White
5 35-44 Female Doctorate degree UK White

This looks better :grin:

Proportion analysis

Let's start of by taking a look at the demographic data balance.

But first, we will define a few useful functions that will be reused throughout this analysis.

def plot_density(dataset, col, ax, plot_gaussian = True, color="Blue"):
    '''
    Extension of the seaborn histogram that plots, for a given column, an estimated normal distribution (if requested) on top of the fitted data distribution.
    '''
    ncount = len(dataset)
    
    if plot_gaussian:
        std = dataset[col].std()
        mean = dataset[col].mean()
    
    #plot histogram using seaborn
    ax = sns.histplot(dataset[col], ax=ax, color=color, kde=True, stat="probability", kde_kws={"bw_adjust":3})
    
    if plot_gaussian:
        # Limiting our gaussian to the limits from the above plot
        xmin, xmax = ax.get_xlim()
        xnorm = np.arange(xmin, xmax, 0.001)
        ynorm = norm.pdf(xnorm, mean, std)
        ax.plot(xnorm, ynorm, 'r', lw=2)
        ax.legend(["data distribution", "estimated normal distribution"], loc="upper center", bbox_to_anchor=(0.5, -0.05), fancybox=True, shadow=True, ncol=2)
    
    ax.set_title(col)
    ax.set_xlabel("")
def plot_pie(dataset, col, ax):  
    '''
    Pandas' pie plot wrapper
    '''
    ax = dataset[col].value_counts().plot(kind='pie', ax=ax)
    ax.set_title(col)
    ax.set_ylabel("")
def plot_count(dataset, col, ax, order = None, show_percent=True, rotate_label = True, add_args={"palette": "Set2"}):
    '''
    Extending the seaorn countplot to get frequencies and counts in a pretty way. 
    '''
    
    ncount = len(dataset)
    
    if order is None:
        order = np.sort(dataset[col].unique())
    
    # plot seaborn countplot
    ax = sns.countplot(data=dataset, x=col, ax=ax, order=order, **add_args)
    
    # Get y limit (number of elements)
    _ ,max_nb = ax.get_ylim()
    # Transform this limit into a frequency in [0, 100]
    freq_lim = (max_nb * 100/ ncount)
    
    # Duplicate the ax
    ax2 = ax.twinx()
    
    #Move duplicate y axis ticks to the left
    ax2.yaxis.tick_left()
    
    #Move original y axis ticks to the right
    ax.yaxis.tick_right()
    
    # Swap the label positions to match the ticks
    ax.yaxis.set_label_position('right')
    ax2.yaxis.set_label_position('left')
    ax2.set_ylabel('Frequency [%]')

    # We want to write the frequency on top of each bar
    if show_percent:
        # for every bar
        for p in ax.patches:
            x=p.get_bbox().get_points()[:,0]
            y=p.get_bbox().get_points()[1,1]
            if not math.isnan(x.mean()) and not math.isnan(y):
                # Write frequency at an x and y coordinate
                ax.text(x.mean(), y, '{:.1f}%'.format(100.*y/ncount), 
                    ha='center', va='bottom')

    # Set y axis limit for counts and frequencies
    ax2.set_ylim(0,freq_lim)
    ax.set_ylim(0,max_nb)

    # set ticks for count
    ax.yaxis.set_major_locator(ticker.LinearLocator(11))
    ax.yaxis.set_tick_params(which="major")

    # set ticks for frequencies
    ax2.yaxis.set_major_locator(ticker.MultipleLocator(freq_lim/10))
    ax2.yaxis.set_tick_params(which="major")

    # remove grid for ax 2 (keep only ax)
    ax2.grid(False)
    ax.grid(False)
    ax.set_xlabel("")
    if rotate_label:
        # rotate tick labels on the x axis / / /
        _ = ax.set_xticklabels(ax.get_xticklabels(), rotation=40, ha="right")
    ax.set_title(col)
def plot(kind, dataset, columns=None, fig_title="Count/Frequency plots", fontsizes = 20, **kwargs):
    '''
    Wrapper function that takes care of plot wise sizes and calling the wanted procedure
    '''
    
    # plot choices
    kind_dict = {
        'pie': plot_pie,
        'count': plot_count,
        'density': plot_density}
    
    if kind not in kind_dict:
        raise ValueError(f"{kind} is not a valid kind, has to be one of {kind_dict.keys()}")
    
    if columns is None:
        # us all dataset columns
        columns = list(dataset.columns)
    
    fig = None
    
    # Setting font sizes
    plt.rc('xtick', labelsize=fontsizes*1.5)
    plt.rc('ytick', labelsize=fontsizes*1.5)
    plt.rc('axes', labelsize=fontsizes*2)
    plt.rc('legend', fontsize=fontsizes*1.5, title_fontsize=0)
    plt.rc('axes', titlesize=2*fontsizes)
    plt.rc('font', size=1.7*fontsizes)
    
    # Scale of the figure in ax (to be used later)
    figsize_scale = fontsizes
    
    if not isinstance(columns, list):
        # columns has to be a list
        if isinstance(columns, str):
            columns = [columns]
        else:
            columns = list(columns)
    
    if len(columns) == 1: # Only variable to plot
        ncols, nrows = 1, 1
        figsize_scale *= 2 # double figsize
    else:
        ncols, nrows = 2, math.ceil(len(columns) / 2)
     
    fig, axes = plt.subplots(figsize=(figsize_scale*ncols, figsize_scale*nrows), nrows=nrows, ncols=ncols)
    
    if ncols == 1 and nrows == 1:
        # We need a list of axes
        axes = np.array([axes])
    
    # Plot
    do_plots(dataset, columns, axes, kind_dict[kind], **kwargs)
    
    fig.suptitle(fig_title + "\n\n", fontsize=fontsizes*2.5)
    plt.tight_layout()
    #Reset plot setting to original
    sns.reset_orig()
    
def do_plots(dataset, columns, axes, plot_func, **kwargs):
    '''
    Calls the plotting function on every axis and removes unused axes.
    '''
    axes = axes.flat
    
    #plot every variable
    for index, col in enumerate(columns):
        plot_func(dataset, col, axes[index], **kwargs)

    # remove empty axes
    for empty in range(len(columns), len(axes)):
        axes[empty].axis("off")

Important: If you are new to using python and the various libraries for data analysis, try to read through the above functions and understand what they do.
Let's create pie plots and count plots.
plot("pie", demo_data, demographic_columns, fig_title="Plot pies of demographic data")
plot("count", demo_data, demographic_columns, fig_title="Count / Frequency plots of demographic data")