This post is a great way to dive in data analysis and get familiar with python tools such as pandas, seaborn, matplotlib, numpy.

This is also an opportunity to work on multilabel data (several target variables that are not mutually exclusive).

Setting up workspace

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import seaborn as sns
import matplotlib as mpl
import math
from scipy.stats import pearsonr, spearmanr, chi2_contingency, ttest_ind, mannwhitneyu, norm, normaltest, shapiro, anderson
import operator
from IPython.display import HTML, display
import seaborn as sns
sns.set(rc={'figure.figsize':(10,5)})
sns.set_style('whitegrid')
%matplotlib inline

Dataset preparation

Drug consumption (quantified) Data Set is a multilabel dataset from the UCI machine learning repository.

The dataset contains records for 1885 respondents.

To summarize, it contains:

  • an ID column
  • 5 demographic columns (features)
  • 7 personality traits (features)
  • 18 drugs with their usage frequency (target)
  • a fake drug called Semeron to verify reliability of answers

Each drug variable can take 6 different values:

  • CL0 Never Used
  • CL1 Used over a Decade
  • CL2 Used in the Last Decade
  • CL3 Used in the Last Year
  • CL4 Used in the Last Month
  • CL5 Used in the Last Week
  • CL6 Used in the Last Day

All qualitative variables where transformed to real values (except for the target variables).

Tip: Read the dataset description for more details!

The following lists will be useful for loading and indexing the data.

demographic_columns = [
    'Age', 
    'Gender', 
    'Education', 
    'Country',
    'Ethnicity',
]

personality_columns = [
    'Neuroticism',
    'Extraversion',
    'Openness to experience',
    'Agreeableness',
    'Conscientiousness',
    'Impulsiveness',
    'Sensation seeking'
]

feature_columns = demographic_columns + personality_columns

drugs_columns = [
    'Alcohol consumption',
    'Amphetamines consumption',
    'Amyl nitrite consumption',
    'Benzodiazepine consumption',
    'Caffeine consumption',
    'Cannabis consumption',
    'Chocolate consumption',
    'Cocaine consumption',
    'Crack consumption',
    'Ecstasy consumption',
    'Heroin consumption',
    'Ketamine consumption',
    'Legal highs consumption',
    'Lysergic acid diethylamide consumption',
    'Methadone consumption',
    'Magic mushrooms consumption',
    'Nicotine consumption',
    'Fictitious drug Semeron consumption',
    'Volatile substance abuse consumption'
]

drugs_legal = ['Alcohol consumption', 'Caffeine consumption', 'Chocolate consumption', 'Nicotine consumption']

drugs_illegal = [drug for drug in drugs_columns if drug not in drugs_legal]

all_columns = feature_columns + drugs_columns

Loading the dataset

dataset = pd.read_csv("drug_consumption.data", names=["ID"] + all_columns)
dataset = dataset.set_index("ID")
dataset.head()
Age Gender Education Country Ethnicity Neuroticism Extraversion Openness to experience Agreeableness Conscientiousness ... Ecstasy consumption Heroin consumption Ketamine consumption Legal highs consumption Lysergic acid diethylamide consumption Methadone consumption Magic mushrooms consumption Nicotine consumption Fictitious drug Semeron consumption Volatile substance abuse consumption
ID
1 0.49788 0.48246 -0.05921 0.96082 0.12600 0.31287 -0.57545 -0.58331 -0.91699 -0.00665 ... CL0 CL0 CL0 CL0 CL0 CL0 CL0 CL2 CL0 CL0
2 -0.07854 -0.48246 1.98437 0.96082 -0.31685 -0.67825 1.93886 1.43533 0.76096 -0.14277 ... CL4 CL0 CL2 CL0 CL2 CL3 CL0 CL4 CL0 CL0
3 0.49788 -0.48246 -0.05921 0.96082 -0.31685 -0.46725 0.80523 -0.84732 -1.62090 -1.01450 ... CL0 CL0 CL0 CL0 CL0 CL0 CL1 CL0 CL0 CL0
4 -0.95197 0.48246 1.16365 0.96082 -0.31685 -0.14882 -0.80615 -0.01928 0.59042 0.58489 ... CL0 CL0 CL2 CL0 CL0 CL0 CL0 CL2 CL0 CL0
5 0.49788 0.48246 1.98437 0.96082 -0.31685 0.73545 -1.63340 -0.45174 -0.30172 1.30612 ... CL1 CL0 CL0 CL1 CL0 CL0 CL2 CL2 CL0 CL0

5 rows × 31 columns

We successfully loaded the dataset and set the ID as row index.

dataset.dtypes
Age                                       float64
Gender                                    float64
Education                                 float64
Country                                   float64
Ethnicity                                 float64
Neuroticism                               float64
Extraversion                              float64
Openness to experience                    float64
Agreeableness                             float64
Conscientiousness                         float64
Impulsiveness                             float64
Sensation seeking                         float64
Alcohol consumption                        object
Amphetamines consumption                   object
Amyl nitrite consumption                   object
Benzodiazepine consumption                 object
Caffeine consumption                       object
Cannabis consumption                       object
Chocolate consumption                      object
Cocaine consumption                        object
Crack consumption                          object
Ecstasy consumption                        object
Heroin consumption                         object
Ketamine consumption                       object
Legal highs consumption                    object
Lysergic acid diethylamide consumption     object
Methadone consumption                      object
Magic mushrooms consumption                object
Nicotine consumption                       object
Fictitious drug Semeron consumption        object
Volatile substance abuse consumption       object
dtype: object

Features are all floats and targets are objects.

dataset.shape
(1885, 31)

1885 observations and 31 variables.

dataset.isnull().values.any()
False

There are no missing values.

Nominal drug to ordinal data

for i in drugs_columns:
    dataset[i] = dataset[i].map({'CL0': 0, 'CL1': 1, 'CL2': 2, 'CL3': 3, 'CL4': 4, 'CL5': 5, 'CL6': 6})

Mapping CLX to X. Now all variables are numeric.

Removing Semeron users

Let's see who claimed to took Semeron.

semerons = dataset[dataset['Fictitious drug Semeron consumption'] != 0]
semerons
Age Gender Education Country Ethnicity Neuroticism Extraversion Openness to experience Agreeableness Conscientiousness ... Ecstasy consumption Heroin consumption Ketamine consumption Legal highs consumption Lysergic acid diethylamide consumption Methadone consumption Magic mushrooms consumption Nicotine consumption Fictitious drug Semeron consumption Volatile substance abuse consumption
ID
730 -0.07854 0.48246 -1.73790 -0.09765 -0.31685 -0.58016 0.32197 0.14143 -0.60633 0.12331 ... 2 2 2 0 4 2 6 6 2 2
821 -0.95197 -0.48246 -0.61113 -0.09765 -0.50212 -0.67825 1.74091 0.72330 0.13136 0.41594 ... 3 0 0 0 5 0 5 4 3 0
1520 -0.95197 -0.48246 -0.61113 -0.57009 -0.31685 -0.24649 -0.80615 -1.27553 -1.34289 -1.92173 ... 1 2 1 2 1 2 4 2 3 1
1537 -0.95197 0.48246 -0.61113 -0.57009 0.11440 -0.46725 0.80523 0.29338 2.03972 1.81175 ... 4 0 4 3 2 0 3 4 4 3
1702 0.49788 0.48246 0.45468 -0.57009 -0.31685 1.98437 -0.80615 2.15324 0.76096 -0.00665 ... 2 0 2 2 2 0 2 6 2 0
1773 -0.95197 -0.48246 -1.22751 -0.57009 -0.22166 -0.34799 1.28610 1.06238 -0.01729 -0.52745 ... 3 0 4 3 6 3 3 3 1 3
1810 -0.95197 0.48246 -1.43719 -0.57009 -0.31685 1.23461 1.11406 1.06238 -1.47955 0.12331 ... 4 2 1 4 1 0 1 6 1 2
1827 -0.95197 0.48246 0.45468 -0.57009 -0.31685 0.22393 -0.30033 0.88309 1.28610 -0.00665 ... 0 0 0 2 3 0 3 5 2 0

8 rows × 31 columns

There are 8 people that lied about taking Semeron, which is a fake drug, so we should remove them.

dataset = dataset[dataset['Fictitious drug Semeron consumption'] == 0]
# Removing it from drug lists
drugs_columns.remove('Fictitious drug Semeron consumption')
drugs_illegal.remove('Fictitious drug Semeron consumption')
#Dropping the column from the dataset
dataset.drop(columns='Fictitious drug Semeron consumption')
dataset.shape
(1877, 31)

We have 1877 respondents left.

Tip: If you ever acquire data from individuals, try a similar approach to get rid of untrustworthy observations.

Exploratory Data analysis

dataset.describe()
Age Gender Education Country Ethnicity Neuroticism Extraversion Openness to experience Agreeableness Conscientiousness ... Ecstasy consumption Heroin consumption Ketamine consumption Legal highs consumption Lysergic acid diethylamide consumption Methadone consumption Magic mushrooms consumption Nicotine consumption Fictitious drug Semeron consumption Volatile substance abuse consumption
count 1877.000000 1877.000000 1877.000000 1877.000000 1877.000000 1877.000000 1877.000000 1877.000000 1877.000000 1877.000000 ... 1877.000000 1877.000000 1877.000000 1877.000000 1877.000000 1877.000000 1877.000000 1877.000000 1877.0 1877.000000
mean 0.037577 -0.000771 -0.000984 0.358984 -0.309728 -0.000551 -0.001951 -0.003224 -0.000657 -0.000394 ... 1.309536 0.372403 0.564198 1.353223 1.053277 0.826319 1.177944 3.194992 0.0 0.429409
std 0.878387 0.482588 0.949831 0.699707 0.166220 0.998442 0.997418 0.995691 0.996689 0.997657 ... 1.647373 1.034761 1.216341 1.790881 1.484582 1.648379 1.459212 2.415814 0.0 0.959160
min -0.951970 -0.482460 -2.435910 -0.570090 -1.107020 -3.464360 -3.273930 -3.273930 -3.464360 -3.464360 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.000000
25% -0.951970 -0.482460 -0.611130 -0.570090 -0.316850 -0.678250 -0.695090 -0.717270 -0.606330 -0.652530 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.0 0.000000
50% -0.078540 -0.482460 -0.059210 0.960820 -0.316850 0.042570 0.003320 -0.019280 -0.017290 -0.006650 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 3.000000 0.0 0.000000
75% 0.497880 0.482460 0.454680 0.960820 -0.316850 0.629670 0.637790 0.723300 0.760960 0.584890 ... 3.000000 0.000000 0.000000 3.000000 2.000000 0.000000 2.000000 6.000000 0.0 0.000000
max 2.591710 0.482460 1.984370 0.960820 1.907250 3.273930 3.273930 2.901610 3.464360 3.464360 ... 6.000000 6.000000 6.000000 6.000000 6.000000 6.000000 6.000000 6.000000 0.0 6.000000

8 rows × 31 columns

The representation above is quite archaic. Let's create a plot to make it easier to read.

fig, ax = plt.subplots(figsize=(11,15))
plt.ylabel("Features")
plt.title("Box plot of Pre-Processed Data Set")
ax = sns.boxplot(data = dataset[feature_columns], orient="h", palette="Set2")
sns.reset_orig()

Seems like the dataset creators centered the features near 0 and scaled them with a small std (around 1), so we can probably keep the current encoding to build a model.

Demographic data analysis

Rearranging data

For data analysis, we will transform the quantified categorical data back to a clearer, nominal, form.

Note: Our mapping is based on the dataset description
demo_data = dataset.copy()
age = ['18-24' if a <= -0.9 else 
       '25-34' if a >= -0.5 and a < 0 else 
       '35-44' if a > 0 and a < 1 else 
       '45-54' if a > 1 and a < 1.5 else 
       '55-64' if a > 1.5 and a < 2 else 
       '65+' 
       for a in demo_data['Age']]

gender = ['Female' if g > 0 else "Male" for g in demo_data['Gender']]

education = ['Left school before 16 years' if e <-2 else 
             'Left school at 16 years' if e > -2 and e < -1.5 else 
             'Left school at 17 years' if e > -1.5 and e < -1.4 else 
             'Left school at 18 years' if e > -1.4 and e < -1 else 
             'Some college or university, no certificate or degree' if e > -1 and e < -0.5 else 
             'Professional certificate/ diploma' if e > -0.5 and e < 0 else 
             'University degree' if e > 0 and e < 0.5 else 
             'Masters degree' if e > 0.5 and e < 1.5 else 
             'Doctorate degree' 
             for e in demo_data['Education']]

country = ['USA' if c < -0.5 else 
           'New Zealand' if c > -0.5 and c < -0.4 else 
           'Other' if c > -0.4 and c < -0.2 else 
           'Australia' if c > -0.2 and c < 0 else 
           'Ireland' if c > 0 and c < 0.23 else 
           'Canada' if c > 0.23 and c < 0.9 else 
           'UK' 
           for c in demo_data['Country']]

ethnicity = ['Black' if e < -1 else 
             'Asian' if e > -1 and e < -0.4 else 
             'White' if e > -0.4 and e < -0.25 else 
             'Mixed-White/Black' if e >= -0.25 and e < 0.11 else 
             'Mixed-White/Asian' if e > 0.12 and e < 1 else 
             'Mixed-Black/Asian' if e > 1.9 else 
             'Other' 
             for e in demo_data['Ethnicity']]


demo_data['Age'] = age
demo_data['Gender'] = gender
demo_data['Education'] = education
demo_data['Country'] = country
demo_data['Ethnicity'] = ethnicity
demo_data[demographic_columns].head()
Age Gender Education Country Ethnicity
ID
1 35-44 Female Professional certificate/ diploma UK Mixed-White/Asian
2 25-34 Male Doctorate degree UK White
3 35-44 Male Professional certificate/ diploma UK White
4 18-24 Female Masters degree UK White
5 35-44 Female Doctorate degree UK White

This looks better :grin:

Proportion analysis

Let's start of by taking a look at the demographic data balance.

But first, we will define a few useful functions that will be reused throughout this analysis.

def plot_density(dataset, col, ax, plot_gaussian = True, color="Blue"):
    '''
    Extension of the seaborn histogram that plots, for a given column, an estimated normal distribution (if requested) on top of the fitted data distribution.
    '''
    ncount = len(dataset)
    
    if plot_gaussian:
        std = dataset[col].std()
        mean = dataset[col].mean()
    
    #plot histogram using seaborn
    ax = sns.histplot(dataset[col], ax=ax, color=color, kde=True, stat="probability", kde_kws={"bw_adjust":3})
    
    if plot_gaussian:
        # Limiting our gaussian to the limits from the above plot
        xmin, xmax = ax.get_xlim()
        xnorm = np.arange(xmin, xmax, 0.001)
        ynorm = norm.pdf(xnorm, mean, std)
        ax.plot(xnorm, ynorm, 'r', lw=2)
        ax.legend(["data distribution", "estimated normal distribution"], loc="upper center", bbox_to_anchor=(0.5, -0.05), fancybox=True, shadow=True, ncol=2)
    
    ax.set_title(col)
    ax.set_xlabel("")
def plot_pie(dataset, col, ax):  
    '''
    Pandas' pie plot wrapper
    '''
    ax = dataset[col].value_counts().plot(kind='pie', ax=ax)
    ax.set_title(col)
    ax.set_ylabel("")
def plot_count(dataset, col, ax, order = None, show_percent=True, rotate_label = True, add_args={"palette": "Set2"}):
    '''
    Extending the seaorn countplot to get frequencies and counts in a pretty way. 
    '''
    
    ncount = len(dataset)
    
    if order is None:
        order = np.sort(dataset[col].unique())
    
    # plot seaborn countplot
    ax = sns.countplot(data=dataset, x=col, ax=ax, order=order, **add_args)
    
    # Get y limit (number of elements)
    _ ,max_nb = ax.get_ylim()
    # Transform this limit into a frequency in [0, 100]
    freq_lim = (max_nb * 100/ ncount)
    
    # Duplicate the ax
    ax2 = ax.twinx()
    
    #Move duplicate y axis ticks to the left
    ax2.yaxis.tick_left()
    
    #Move original y axis ticks to the right
    ax.yaxis.tick_right()
    
    # Swap the label positions to match the ticks
    ax.yaxis.set_label_position('right')
    ax2.yaxis.set_label_position('left')
    ax2.set_ylabel('Frequency [%]')

    # We want to write the frequency on top of each bar
    if show_percent:
        # for every bar
        for p in ax.patches:
            x=p.get_bbox().get_points()[:,0]
            y=p.get_bbox().get_points()[1,1]
            if not math.isnan(x.mean()) and not math.isnan(y):
                # Write frequency at an x and y coordinate
                ax.text(x.mean(), y, '{:.1f}%'.format(100.*y/ncount), 
                    ha='center', va='bottom')

    # Set y axis limit for counts and frequencies
    ax2.set_ylim(0,freq_lim)
    ax.set_ylim(0,max_nb)

    # set ticks for count
    ax.yaxis.set_major_locator(ticker.LinearLocator(11))
    ax.yaxis.set_tick_params(which="major")

    # set ticks for frequencies
    ax2.yaxis.set_major_locator(ticker.MultipleLocator(freq_lim/10))
    ax2.yaxis.set_tick_params(which="major")

    # remove grid for ax 2 (keep only ax)
    ax2.grid(False)
    ax.grid(False)
    ax.set_xlabel("")
    if rotate_label:
        # rotate tick labels on the x axis / / /
        _ = ax.set_xticklabels(ax.get_xticklabels(), rotation=40, ha="right")
    ax.set_title(col)
def plot(kind, dataset, columns=None, fig_title="Count/Frequency plots", fontsizes = 20, **kwargs):
    '''
    Wrapper function that takes care of plot wise sizes and calling the wanted procedure
    '''
    
    # plot choices
    kind_dict = {
        'pie': plot_pie,
        'count': plot_count,
        'density': plot_density}
    
    if kind not in kind_dict:
        raise ValueError(f"{kind} is not a valid kind, has to be one of {kind_dict.keys()}")
    
    if columns is None:
        # us all dataset columns
        columns = list(dataset.columns)
    
    fig = None
    
    # Setting font sizes
    plt.rc('xtick', labelsize=fontsizes*1.5)
    plt.rc('ytick', labelsize=fontsizes*1.5)
    plt.rc('axes', labelsize=fontsizes*2)
    plt.rc('legend', fontsize=fontsizes*1.5, title_fontsize=0)
    plt.rc('axes', titlesize=2*fontsizes)
    plt.rc('font', size=1.7*fontsizes)
    
    # Scale of the figure in ax (to be used later)
    figsize_scale = fontsizes
    
    if not isinstance(columns, list):
        # columns has to be a list
        if isinstance(columns, str):
            columns = [columns]
        else:
            columns = list(columns)
    
    if len(columns) == 1: # Only variable to plot
        ncols, nrows = 1, 1
        figsize_scale *= 2 # double figsize
    else:
        ncols, nrows = 2, math.ceil(len(columns) / 2)
     
    fig, axes = plt.subplots(figsize=(figsize_scale*ncols, figsize_scale*nrows), nrows=nrows, ncols=ncols)
    
    if ncols == 1 and nrows == 1:
        # We need a list of axes
        axes = np.array([axes])
    
    # Plot
    do_plots(dataset, columns, axes, kind_dict[kind], **kwargs)
    
    fig.suptitle(fig_title + "\n\n", fontsize=fontsizes*2.5)
    plt.tight_layout()
    #Reset plot setting to original
    sns.reset_orig()
    
def do_plots(dataset, columns, axes, plot_func, **kwargs):
    '''
    Calls the plotting function on every axis and removes unused axes.
    '''
    axes = axes.flat
    
    #plot every variable
    for index, col in enumerate(columns):
        plot_func(dataset, col, axes[index], **kwargs)

    # remove empty axes
    for empty in range(len(columns), len(axes)):
        axes[empty].axis("off")

Important: If you are new to using python and the various libraries for data analysis, try to read through the above functions and understand what they do.
Let's create pie plots and count plots.
plot("pie", demo_data, demographic_columns, fig_title="Plot pies of demographic data")
plot("count", demo_data, demographic_columns, fig_title="Count / Frequency plots of demographic data")

We can see some features are pretty balanced but in some cases, classes are underrepresented. We will go over each of them in more detail.

def value_counts_percentage(dataset, column):
    ''' value.counts() method extended by displaying percentage '''
    
    a = dataset[column].value_counts()
    b = dataset[column].value_counts(normalize=True) * 100
    
    return pd.concat([a,b.round(2)], axis=1, keys=['N', '%'])

Note: The above function creates a dataframe with count and frequencies for a given column.

Age

value_counts_percentage(demo_data, 'Age')
N %
18-24 637 33.94
25-34 480 25.57
35-44 355 18.91
45-54 294 15.66
55-64 93 4.95
65+ 18 0.96

The age groups are intervals of around 10 years. As we can see, the older the age group, the less represented they are.

The 18-24 group holds around 1/3 of all respondents !

Unbalanced and not representative.

Gender

value_counts_percentage(demo_data, 'Gender')
N %
Male 940 50.08
Female 937 49.92

Gender is balanced and representative.

Education

value_counts_percentage(demo_data, 'Education')
N %
Some college or university, no certificate or degree 503 26.80
University degree 478 25.47
Masters degree 283 15.08
Professional certificate/ diploma 270 14.38
Left school at 18 years 99 5.27
Left school at 16 years 98 5.22
Doctorate degree 89 4.74
Left school at 17 years 29 1.55
Left school before 16 years 28 1.49

There is a predomination of educated people (around 85% are of college or above level), causing an unbalance.

Not sure these are representative of the real life distribution, but looking at the educational attainment in the United States, it seems the proportion are not so far off for developped countries (and they are very represented in this dataset).

Warning: That doesn’t mean our data is balanced, but by taking people randomnly we would get around the same distribution.

Tip: A representative dataset is useful to build a representative model that describes the data rather than predicting. On the other hand, if your objective is to build a a predictive model, particularly one that performs well by measure of AUC or rank-order and plan to use a basic ML framework, then feeding it a balanced dataset is recommended. You can also use a representative dataset in the latter case and use custom sampling that will take care of balancing the data for training.

Country

value_counts_percentage(demo_data, 'Country')
N %
UK 1044 55.62
USA 551 29.36
Other 118 6.29
Canada 87 4.64
Australia 52 2.77
Ireland 20 1.07
New Zealand 5 0.27

All the known countries (90+%) are English speaking with US and UK totalling to over 80% of the data.

This is a clear unbalance and isn't representative.

Ethnicity

value_counts_percentage(demo_data, 'Ethnicity')
N %
White 1715 91.37
Other 62 3.30
Black 33 1.76
Asian 25 1.33
Mixed-White/Asian 20 1.07
Mixed-White/Black 19 1.01
Mixed-Black/Asian 3 0.16

90+% of respondents are white. Again, a clear unbalance and is not representative.

Age-Gender

plot("count", demo_data, 'Age', fig_title="Age-Gender count/frequency", rotate_label=False, add_args={"hue":"Gender", "palette":'ch:.25'}, fontsizes=4.5);
  • There are clearly a lot more men in the 18-24 age goupe.
  • More women in the 25-34 age group.
  • As the age increases, the gender proportion gradually evens out.

Country-Gender

plot("count", demo_data, 'Country', fig_title="Age-Gender count/frequency", rotate_label=False, add_args={"hue":"Gender", "palette":'ch:.25'}, fontsizes=4.5);
  • More women than men were tested in the USA, New-Zealand (see next table) and "Other".
  • More men than women in the UK.
  • Around the same in the remaining countries.

Note: Very few respondents in some countries meaning we can’t tell if the respondents were taken from the real life distribution or not.

Age – Gender – Country cross table

pd.crosstab(demo_data['Age'], [demo_data['Gender'], demo_data['Country']])
Gender Female Male
Country Australia Canada Ireland New Zealand Other UK USA Australia Canada Ireland New Zealand Other UK USA
Age
18-24 1 7 3 0 12 112 99 20 25 5 2 39 91 221
25-34 10 9 2 1 13 200 58 5 8 3 2 26 80 63
35-44 4 8 3 0 7 150 29 3 7 0 0 10 108 26
45-54 5 7 1 0 4 127 14 3 4 3 0 5 99 22
55-64 0 7 0 0 0 29 7 1 1 0 0 2 36 10
65+ 0 3 0 0 0 5 0 0 1 0 0 0 7 2
  • New-Zealand doesn't have anyone over 34 years old and tested only one woman.

  • Ireland doesn't have anyone over 54 and has no man in the 35-44 age group.

  • Australia doesn't have anyone over 65 (and only one over 54) and 20 times more males in the 18-24 age group.

  • Canada has 3 times more men in the 18-24 group and has only 2 males above 54.

  • UK has over twice more women than men in the 25-34 group.

  • USA has over twice more men in the 18-24 group and no woman over 64.

  • Other countries have 3 times more men for the 18-24 age group, twice for the 25-34 group, doesn't have any woman in the 55-64 group and doesn't have anyone over 64.

def catplot(data, x, hue, col, rotate_label=True, palette="ch:.25"):
    plot = sns.catplot(x=x, hue=hue, col=col, palette=palette, data=data.sort_values(by=[x]), kind="count")

    if rotate_label:
        for axes in plot.axes.flat:
            _ = axes.set_xticklabels(axes.get_xticklabels(), rotation=40, ha="right")
catplot(demo_data, 'Country', 'Age', 'Gender')

Using the above plot, some aspects described above pop up quite nicely.

Education – Gender – Ethnicity cross table

demo_data.pivot_table(index='Education', columns=['Gender', 'Ethnicity'], aggfunc='size', fill_value=0)
Gender Female Male
Ethnicity Asian Black Mixed-Black/Asian Mixed-White/Asian Mixed-White/Black Other White Asian Black Mixed-Black/Asian Mixed-White/Asian Mixed-White/Black Other White
Education
Doctorate degree 1 0 0 1 2 1 52 1 0 0 0 0 2 29
Left school at 16 years 0 0 0 0 0 2 40 0 0 0 1 1 0 54
Left school at 17 years 0 0 0 0 1 0 12 0 0 0 0 0 2 14
Left school at 18 years 0 0 0 0 0 2 35 0 1 0 1 2 2 56
Left school before 16 years 0 0 0 0 0 0 12 0 0 0 0 0 0 16
Masters degree 4 6 0 3 5 5 156 5 1 0 0 1 4 93
Professional certificate/ diploma 0 3 0 1 0 2 130 1 5 0 1 1 6 120
Some college or university, no certificate or degree 3 2 1 2 1 8 153 0 4 1 4 3 12 309
University degree 7 6 1 4 2 8 263 3 5 0 2 0 6 171
  • Basically few entries of people leaving school at or before 18.

  • Non-white: no one left school before being 16 and appart for the 'Other' category, there is only one female that didn't get a higher education. No male** expect in 'Other' have a Doctorate degree.

  • Asian: no respondent with lower education and *no male in some college or university (but has some with masters degree, so doesn't mean no one went through university etc.)

  • Black: only one person with lower education (one male left school at 18) and has no female with a professional diploma when it is the field with most respondents on the male side.

  • Mixed-Black/Asian: very few respondents and only people following some college or university, or with an university degree are represented.

  • Mixed-White/Asian: all the female respondents have high education and 2 males left school at 18 or before. Males, unlike females have no master's degree respondent.

  • Mixed-White/Black: all females except 2 have a finished university, master's or doctorate degree, while males have only one master's degree and the other have either unfinished university degrees, a professional diploma or left school at 18 or before.

  • Other: lots of males have professional diplomas, while no female has one. There is a high number of doctorate degrees but also a high number of respondents that left school at 18 or before (both relative to the categories up until now).

  • White: basically all the respondents are white as we saw before. Both genders have respondents in all categories but not in the same proportions. To see things clearer, you can refer to the table below (we will focus on proportion instead of counts). The majority of males are pursuing a college or university degree, while the majority of females have an university degree. Nearly twice as many females have a doctorate degree and more male have left school at 18 or before.

pd.concat([value_counts_percentage(demo_data[(demo_data['Ethnicity'] == 'White') & (demo_data['Gender'] == 'Male')], 'Education'),
           value_counts_percentage(demo_data[(demo_data['Ethnicity'] == 'White') & (demo_data['Gender'] == 'Female')], 'Education')],
           axis = 1, keys=['Male', 'Female'])
Male Female
N % N %
Some college or university, no certificate or degree 309 35.85 153 17.94
University degree 171 19.84 263 30.83
Professional certificate/ diploma 120 13.92 130 15.24
Masters degree 93 10.79 156 18.29
Left school at 18 years 56 6.50 35 4.10
Left school at 16 years 54 6.26 40 4.69
Doctorate degree 29 3.36 52 6.10
Left school before 16 years 16 1.86 12 1.41
Left school at 17 years 14 1.62 12 1.41

Note: Beause of the huge unbalance in ethnicity, plots will be hard to interpret. See for yourself below. Changing the axis and hue won’t solve anything.
catplot(demo_data, 'Ethnicity', 'Education', 'Gender')

Country – Gender – Ethnicity cross table

pd.crosstab(demo_data['Country'], [demo_data['Gender'], demo_data['Ethnicity']])
Gender Female Male
Ethnicity Asian Black Mixed-Black/Asian Mixed-White/Asian Mixed-White/Black Other White Asian Black Mixed-Black/Asian Mixed-White/Asian Mixed-White/Black Other White
Country
Australia 0 0 0 0 0 0 20 0 0 0 0 0 0 32
Canada 0 0 0 2 0 1 38 0 0 0 0 2 2 42
Ireland 0 0 0 0 0 0 9 0 0 0 0 0 0 11
New Zealand 0 0 0 0 0 0 1 0 0 0 0 0 0 4
Other 1 1 0 0 2 2 30 1 0 1 0 1 5 74
UK 11 14 0 6 9 12 571 9 11 0 2 1 4 394
USA 3 2 2 3 0 13 184 0 5 0 7 4 23 305
  • Nearly no entry of Mixed-Black/Asian. (And compared to white, no entry of any other category)

  • USA has no mixed-White/Black female, no asian and mixed-Black/Asian male.

  • All the other countries have no mixed-black asian (except one male in an 'Other' country) (even UK that has such a large number of respondents)

  • UK has 9 times more Mixed-White/Black female than male and 3 times more 'Other' female than male.

Note: Again, plotting will not help because of the unbalance in white respondents.

Correlations (using Cramér's V)

"In statistics, Cramér's V (sometimes referred to as Cramér's phi and denoted as φc) is a measure of association between two nominal variables, giving a value between 0 and +1 (inclusive). It is based on Pearson's chi-squared statistic and was published by Harald Cramér in 1946."(Wikipedia)

Below we define a set of useful functions for working with correlations.

def get_pval_code(pval):
    '''
    Returns a significance code string for a given p-value.
    '''
    code = ''
    if pval < 0.001:
        code = '***'
    elif pval < 0.01:
        code = '**'
    elif pval < 0.05:
        code = '*'
    elif pval < 0.1:
        code = '.'
    return code

def display_pval_codes():
    print("Significance codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1")
def cramers_v(crosstab):
    ''' Returns Cramer's V correlation coefficient (and statistic significance) for data 
        delivered as a crosstable. Used for nominal data.
    '''
    
    chi2 = chi2_contingency(crosstab)[0]
    n = crosstab.sum().sum()
    phi2 = chi2/n
    r,k = crosstab.shape
    phi2corr = max(0, phi2 - ((k-1)*(r-1))/(n-1))    
    rcorr = r - ((r-1)**2)/(n-1)
    kcorr = k - ((k-1)**2)/(n-1)
    
    return round(np.sqrt(phi2corr / min((kcorr-1), (rcorr-1))), 3), chi2_contingency(crosstab)[1]
def nominal_corrs(dataset, col_1_names, col_2_names, pvalues=True):
    ''' Returns Cramer's V coefficients matrix and p-value (obtained with Chi-square test of independence) matrix for the whole dataset.
    
    col_1_names, col_2_names - lists of names of columns to correlate. Function creates crosstables for every 
        columns' combination and returns a matrix with single Cramer's V coefficients of every combination;
    pvalues - if set to False, only correlation values will be returned in DataFrame 
        (without '**' marks for significant observations)
    '''
    
    corr_table = pd.DataFrame()
    pval_table = pd.DataFrame()
    for i in range(len(col_1_names)):
        for j in range(len(col_2_names)):
            crosstab = pd.crosstab(dataset[col_1_names[i]], [dataset[col_2_names[j]]])
            corr, pval = cramers_v(crosstab)
            
            v = ' '.join([str(i) for i in (corr, get_pval_code(pval))]) if pvalues else corr
            
            corr_table.loc[i, j] = v
            pval_table.loc[i,j] = pval
    corr_table.index = col_1_names
    corr_table.columns = col_2_names
    pval_table.index = col_1_names
    pval_table.columns = col_2_names
    
    return corr_table, pval_table
def heatmap_corr(dataset, method='spearman', ready=False, mask=True, nominal=False):
    ''' Extended sns.heatmap() method. 
    
    dataset - can be 'pure' data (without calculated correlations) or a DataFrame with already calcuateg 
        correlations (in that case attribute 'ready' should be set to True);
    method - mainly pearson or spearman; nominal correlations should be calculated externally 
        and be delivered with attribute ready=True; 
    mask - if dataset is NOT a cross-valued DataFrame of one type, mask should be set to False;
    nominal - for nominal data correlations values are in range (0, 1) instead of (-1, -1). 
        nominal=True should be folowed by ready=True 
    '''
    
    if not ready:
        corr = dataset.corr(method=method)
    elif ready:
        corr = dataset
    cmap = sns.diverging_palette(220, 10, as_cmap=True)
    vmax = corr.max().max()
    if nominal:
        center = 0.5
        cmap=None
    elif not nominal:
        center = 0
    if mask:
        mask = np.zeros_like(corr, dtype=np.bool)
        mask[np.triu_indices_from(mask)] = True
        np.fill_diagonal(corr.values, -2)
        vmax = corr.max().max()
    elif not mask:
        mask=None
    f, ax = plt.subplots(figsize=(20,9))
    return sns.heatmap(corr, cmap=cmap, mask=mask, vmax=vmax, center=center, annot=True, square=True, 
                       linewidths=0.5, cbar_kws={'shrink': 0.5})

Let's display the correlations among the demographic data.

corr_table_demo, pval_table_demo = nominal_corrs(demo_data, demographic_columns, demographic_columns)
display_pval_codes()
corr_table_demo
Significance codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Age Gender Education Country Ethnicity
Age 1.0 *** 0.196 *** 0.229 *** 0.173 *** 0.014
Gender 0.196 *** 0.999 *** 0.235 *** 0.219 *** 0.0
Education 0.229 *** 0.235 *** 1.0 *** 0.2 *** 0.0
Country 0.173 *** 0.219 *** 0.2 *** 1.0 *** 0.053 **
Ethnicity 0.014 0.0 0.0 0.053 ** 1.0 ***

Values with at least one * next to them are statistically significant (p-value < 0.05).

heatmap_corr(nominal_corrs(demo_data, demographic_columns, demographic_columns, pvalues=False)[0],
             nominal=True, ready=True);

Only very weak statistically significant correlations at most.

Important: A statistically significant correlation does not necessarily mean that the strength of the correlation is strong. A low p-value reassures us that the correlation is this way nearly all of the time. (It is the probability that we got it by chance)

This section tries to interpret these very weak relationships between variables

Ethnicity's lack of statistical significant correlations can be explained by the great imbalance in the various possible classes (all whites basically). Meaning the lack of correlation could be caused by chance. The abscence (high significance) of correlation with country may sound surprising but since the only countries considered are the "same", as in developed and white prevalence in respondent, it makes sense. It's safe to assume the lack of correlation with age and gender are correct (even if unsignificant) for obvious reasons. I can't be so sure about education (as you can see below, there is around 50% chances the correlation was obtained by chance).

pval_table_demo.loc['Education', 'Ethnicity']
0.5424195567242813

There is a statistically significant very weak correlation between country and gender. We know, at least for the countries represented here (developped countries were population is basically 50/50), that there isn't a real life correlation between the country and gender. The very weak correlation may be due to some countries having more female repondents and some having more males (as we saw previously).

The weak correlation between age and gender can be explained, as we saw previously, that gender wasn't even for some young age groups but as we know this correlation doesn't generalize to real life.

A correlation between age and country would be relevant if some countries were more developped than the others but here we have only developped country. This idea would make more sense if we had more categories for elder people, meaning a country's life expectency could be taken into account, but here a single group for 65+ year olds isn't enough for developed countries. The very weak statistically significant relationship may be caused by the 'other' countries.

The very weak correlations we can keep in mind are the following:

  • Age / Education (makes sense since basically no 18-24 would have a doctorate degree for example)
  • Gender / Education (males and females may have slightly different aims and standards regarding education)
  • Education / Country (for social tendencies, country development and other reasons, the education and country may be correlated but since the country here are very similar, the relationship is very weak)

Personality features analysis

All the personnality traits have a score that is represented by a real value in the dataset.

Some have their original possible scores in ranges of integer types (with a step of 1, so 4-7 means 4, 5, 6 and 7):

  • Neuroticism: 12-60
  • Extraversion: 16-59
  • Openness to experience: 24-60
  • Agreeableness: 12-60
  • Conscientiousness: 17-59
pers_data = dataset[personality_columns]
pers_data.describe()
Neuroticism Extraversion Openness to experience Agreeableness Conscientiousness Impulsiveness Sensation seeking
count 1877.000000 1877.000000 1877.000000 1877.000000 1877.000000 1877.000000 1877.000000
mean -0.000551 -0.001951 -0.003224 -0.000657 -0.000394 0.005293 -0.007408
std 0.998442 0.997418 0.995691 0.996689 0.997657 0.954148 0.962074
min -3.464360 -3.273930 -3.273930 -3.464360 -3.464360 -2.555240 -2.078480
25% -0.678250 -0.695090 -0.717270 -0.606330 -0.652530 -0.711260 -0.525930
50% 0.042570 0.003320 -0.019280 -0.017290 -0.006650 -0.217120 0.079870
75% 0.629670 0.637790 0.723300 0.760960 0.584890 0.529750 0.765400
max 3.273930 3.273930 2.901610 3.464360 3.464360 2.901610 1.921730

Checking the balance / proportions

plot("pie", pers_data, personality_columns, fig_title="Pie plots personality traits")

The values overlap but we don't really care about them (I left them just in case you need them), we just want to see if some slices are bigger than the others.

plot("count", pers_data, personality_columns, show_percent=False, fig_title="Count / Frequency of personality data")

Hard to interpret since there are many levels for some personality traits and they are encoded as hardly interpretable real values, even though they are sorted in the same order as the original values, which aren't really interpretable either.

According to the pie plot, there doesn't seem to be any huge unbalance.

The count / frequency plot is more interesting. The distributions of all the personality traits may be gaussian distribution, with some skewness involved sometimes. Let's plot those densities a bit more nicely against actual estimated gaussian distributions to see if our intuition was right!

plot("density", pers_data, personality_columns, fig_title="Personality densities")

They are indeed belshaped but far from the shape of a fitted gaussian distribution.

Let's try out a statistical test.

def normal_test(dataset, columns = None, alpha = 0.05):
    if columns is None:
        columns = dataset.columns
    res = pd.DataFrame()
    data = dataset[columns]
    stats, pvals = data.apply(shapiro).values
    return pd.DataFrame(data=[("Not rejected" if pval > alpha else "Rejected") + f" with alpha = {alpha}"  for pval in pvals], columns=["H0: is from normal distribution"], index=columns)
normal_test(pers_data)
H0: is from normal distribution
Neuroticism Not rejected with alpha = 0.05
Extraversion Rejected with alpha = 0.05
Openness to experience Rejected with alpha = 0.05
Agreeableness Rejected with alpha = 0.05
Conscientiousness Rejected with alpha = 0.05
Impulsiveness Rejected with alpha = 0.05
Sensation seeking Rejected with alpha = 0.05

After running the Shapiro-Wilk test on each personality trait, the only hypothesis that couldn't be rejected was "Neuroticism is from a normal distribution" with an alpha = 0.05.

Warning: That doesn’t mean it is gaussian. Actually the probability was something around 10%, but we can’t reject this hypothesis.

Correlations among personality traits (using Spearman)

"In statistics, Spearman's rank correlation coefficient or Spearman's ρ, named after Charles Spearman, is a nonparametric measure of rank correlation (statistical dependence between the rankings of two variables). ... Spearman's correlation assesses monotonic relationships (whether linear or not). If there are no repeated data values, a perfect Spearman correlation of +1 or −1 occurs when each of the variables is a perfect monotone function of the other."(Wikipedia)

Again, we have to define a few useful functions.

def calculate_r(df1, df2, method='spearman', pvalues=True):
    ''' Returns correlation matrix extended by statistical significance index. Used for non-nominal data.
    
    df1, df2 - DataFrames of data to correlate;
    method - mainly pearson and spearman;
    pvalues - if set to False, only correlation values will be returned in DataFrame 
        (without '**' marks for significant observations)
    '''
    float_precision = 3
    
    data_corr_table = pd.DataFrame()
    data_pvalues = pd.DataFrame()
    for x in df1.columns:
        for y in df2.columns:
            if method == 'pearson':
                corr = pearsonr(df1[x], df2[y])
            elif method == 'spearman':
                corr = spearmanr(df1[x], df2[y])
            else:
                raise ValueError('Unknown method')
            if pvalues:
                data_corr_table.loc[x,y] = '{} {}'.format(round(corr[0], float_precision), get_pval_code(corr[1]))
            elif not pvalues:
                data_corr_table.loc[x,y] = round(corr[0], float_precision)
            data_pvalues.loc[x,y] = corr[1]
    
    return data_corr_table, data_pvalues
def get_ranked_corr(correlations, pvalues, pval_threshold=0.05, corr_threshold=0.5, name_transform=lambda x: x, against_itself=True, comp_corr = operator.ge, comp_pval = operator.le):
    ''' Returns the sorted top (wanted) correlations as a Dataframe
    
    correlations - the correlations dataframe.
    pvalues - The pvalues of H0: the correlation is unsignificant.
    pval_threshold - alpha to compare to pvalue
    corr_threshold - Threadshold of correlations to consider (upper or lower depending on comp_corr)
    name_transform - transform the column names before writing index.
    against_itself - True if columns and rows are the same data (if symmetric matrix)
    comp_corr - comparator to compare to the corr_threshold (default operator.ge means we want correlations >=)
    comp_pval - comparator to compare the pvalue (default operator.le meaning we want pvalues <=)
    '''
    
    columns = correlations.columns
    rows = correlations.index
    pairs=[]
    for i, row in enumerate(rows):
        for j, col in enumerate(columns[i+1:], start=i+1) if against_itself else enumerate(columns):
            corr = correlations.iloc[i, j]
            if comp_pval(pvalues.iloc[i,j], pval_threshold) and comp_corr(abs(corr), corr_threshold):
                pairs.append((name_transform(row), name_transform(col), corr))
    list.sort(pairs, key=lambda x: abs(x[2]), reverse=True)
    return pd.Series(data=[x[2] for x in pairs], index=[' / '.join((x[0], x[1])) for x in pairs], name="Correlation", dtype=np.float64).to_frame()

Now let's use them.

pers_data_corr_table, pers_data_pvalues = calculate_r(pers_data, pers_data, method='spearman')
pers_data_corr_table
Neuroticism Extraversion Openness to experience Agreeableness Conscientiousness Impulsiveness Sensation seeking
Neuroticism 1.0 *** -0.417 *** 0.015 -0.206 *** -0.378 *** 0.168 *** 0.09 ***
Extraversion -0.417 *** 1.0 *** 0.226 *** 0.162 *** 0.29 *** 0.119 *** 0.19 ***
Openness to experience 0.015 0.226 *** 1.0 *** 0.037 -0.073 ** 0.27 *** 0.405 ***
Agreeableness -0.206 *** 0.162 *** 0.037 1.0 *** 0.237 *** -0.226 *** -0.208 ***
Conscientiousness -0.378 *** 0.29 *** -0.073 ** 0.237 *** 1.0 *** -0.344 *** -0.254 ***
Impulsiveness 0.168 *** 0.119 *** 0.27 *** -0.226 *** -0.344 *** 1.0 *** 0.628 ***
Sensation seeking 0.09 *** 0.19 *** 0.405 *** -0.208 *** -0.254 *** 0.628 *** 1.0 ***
heatmap_corr(pers_data, method='spearman');

Again, no strong correlations but some moderate and weak-moderate ones. The sign of the correlation is interesting here.

Neuroticism for example has no positive correlation with any other personality trait. It has a weak-moderate downhill correlation with extraversion and conscientiousness.

Tip: If you ever feel moody, anxious, frustrated, lonely, depressed or other such things, try to be more outgoing, talkative, energetic, careful and diligent and see how it goes! :wink:

I also found it interesting that Extraversion has no moderate correlation with any other personality trait (and that its strongest correlation, which is weak, is with conscentiousness).

Below is a list of the very weak and above statistically significant correlations. (|r| >= 0.15)

pers_data_corr_table_nopval, _ = calculate_r(pers_data, pers_data, method='spearman', pvalues=False)
get_ranked_corr(pers_data_corr_table_nopval, pers_data_pvalues, corr_threshold=0.15)
Correlation
Impulsiveness / Sensation seeking 0.628
Neuroticism / Extraversion -0.417
Openness to experience / Sensation seeking 0.405
Neuroticism / Conscientiousness -0.378
Conscientiousness / Impulsiveness -0.344
Extraversion / Conscientiousness 0.290
Openness to experience / Impulsiveness 0.270
Conscientiousness / Sensation seeking -0.254
Agreeableness / Conscientiousness 0.237
Extraversion / Openness to experience 0.226
Agreeableness / Impulsiveness -0.226
Agreeableness / Sensation seeking -0.208
Neuroticism / Agreeableness -0.206
Extraversion / Sensation seeking 0.190
Neuroticism / Impulsiveness 0.168
Extraversion / Agreeableness 0.162

Drug consumption analysis

As a reminder, here are the different levels for each drug:

0 - Never Used

1 - Used over a Decade Ago

2 - Used in the Last Decade

3 - Used in the Last Year

4 - Used in the Last Month

5 - Used in the Last Week

6 - Used in the Last Day

drug_data = dataset[drugs_columns]
drug_data.describe()
Alcohol consumption Amphetamines consumption Amyl nitrite consumption Benzodiazepine consumption Caffeine consumption Cannabis consumption Chocolate consumption Cocaine consumption Crack consumption Ecstasy consumption Heroin consumption Ketamine consumption Legal highs consumption Lysergic acid diethylamide consumption Methadone consumption Magic mushrooms consumption Nicotine consumption Volatile substance abuse consumption
count 1877.000000 1877.000000 1877.000000 1877.000000 1877.000000 1877.000000 1877.000000 1877.000000 1877.000000 1877.000000 1877.000000 1877.000000 1877.000000 1877.000000 1877.000000 1877.000000 1877.000000 1877.000000
mean 4.637720 1.337773 0.606819 1.461907 5.484283 2.980288 5.109750 1.156633 0.295685 1.309536 0.372403 0.564198 1.353223 1.053277 0.826319 1.177944 3.194992 0.429409
std 1.328234 1.782384 1.064005 1.869193 1.115119 2.286778 1.085716 1.510791 0.835925 1.647373 1.034761 1.216341 1.790881 1.484582 1.648379 1.459212 2.415814 0.959160
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 4.000000 0.000000 0.000000 0.000000 5.000000 1.000000 5.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000
50% 5.000000 0.000000 0.000000 0.000000 6.000000 3.000000 5.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 3.000000 0.000000
75% 6.000000 2.000000 1.000000 3.000000 6.000000 5.000000 6.000000 2.000000 0.000000 3.000000 0.000000 0.000000 3.000000 2.000000 0.000000 2.000000 6.000000 0.000000
max 6.000000 6.000000 6.000000 6.000000 6.000000 6.000000 6.000000 6.000000 6.000000 6.000000 6.000000 6.000000 6.000000 6.000000 6.000000 6.000000 6.000000 6.000000

Let's make a table with the number of observation and frequency for every consumption level of every drug.

counts = []
proportions = []
for drug in drugs_columns:
    counts.append(drug_data[drug].value_counts())
    proportions.append(drug_data[drug].value_counts(normalize=True))

drug_table_count = pd.concat(counts, axis=1, 
          keys=drugs_columns, sort=True)
drug_table_prop = (pd.concat(proportions, axis=1, 
          keys=drugs_columns, sort=True) * 100).round(2)
drug_table = pd.concat([drug_table_count.T, drug_table_prop.T], keys=['N', '%'], axis=1)

drug_table
N %
0 1 2 3 4 5 6 0 1 2 3 4 5 6
Alcohol consumption 33 34 68 197 284 758 503 1.76 1.81 3.62 10.50 15.13 40.38 26.80
Amphetamines consumption 973 230 241 196 75 61 101 51.84 12.25 12.84 10.44 4.00 3.25 5.38
Amyl nitrite consumption 1299 210 236 91 24 14 3 69.21 11.19 12.57 4.85 1.28 0.75 0.16
Benzodiazepine consumption 999 116 230 234 119 84 95 53.22 6.18 12.25 12.47 6.34 4.48 5.06
Caffeine consumption 27 10 24 59 106 271 1380 1.44 0.53 1.28 3.14 5.65 14.44 73.52
Cannabis consumption 413 207 266 210 138 185 458 22.00 11.03 14.17 11.19 7.35 9.86 24.40
Chocolate consumption 32 2 10 53 295 680 805 1.70 0.11 0.53 2.82 15.72 36.23 42.89
Cocaine consumption 1036 160 267 257 98 40 19 55.19 8.52 14.22 13.69 5.22 2.13 1.01
Crack consumption 1622 67 109 59 9 9 2 86.41 3.57 5.81 3.14 0.48 0.48 0.11
Ecstasy consumption 1020 112 232 275 154 63 21 54.34 5.97 12.36 14.65 8.20 3.36 1.12
Heroin consumption 1600 68 91 65 24 16 13 85.24 3.62 4.85 3.46 1.28 0.85 0.69
Ketamine consumption 1488 43 140 129 40 33 4 79.28 2.29 7.46 6.87 2.13 1.76 0.21
Legal highs consumption 1092 29 195 321 109 64 67 58.18 1.55 10.39 17.10 5.81 3.41 3.57
Lysergic acid diethylamide consumption 1069 257 175 213 96 55 12 56.95 13.69 9.32 11.35 5.11 2.93 0.64
Methadone consumption 1424 39 95 148 50 48 73 75.87 2.08 5.06 7.88 2.66 2.56 3.89
Magic mushrooms consumption 982 208 259 272 114 39 3 52.32 11.08 13.80 14.49 6.07 2.08 0.16
Nicotine consumption 428 193 203 184 106 156 607 22.80 10.28 10.82 9.80 5.65 8.31 32.34
Volatile substance abuse consumption 1452 199 133 59 13 14 7 77.36 10.60 7.09 3.14 0.69 0.75 0.37

We will also plot them for each drug.

plot("count", drug_data, fig_title="Count / Frequency for drug consumption\n\n", rotate_label=False)

Alcohol, caffeine and chocolate are legal stimulants that are consumed reguarly by the vast majority of respondents.

Cannabis and nicotine are distributed pretty evenly, even if there is a higher proportion at both extremes (0 and 6 category). These two are are considered bad or ok depending on the people, this may explain the about equal number of daily consumers and those who never tried it, with some in between.

Other illegal drugs were never tried for the vast majority of respondents, or tried a long time ago (considered as more than a month ago) People probably either never tried it (fear or convictions) or did it once but never again. This also holds for legal highs, which follows the same trend as the illegal drugs.

Correlations among drugs (using Spearman)

drug_data_corr_table, drug_data_pvalues = calculate_r(drug_data, drug_data, method='spearman')
display_pval_codes()
drug_data_corr_table
Significance codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Alcohol consumption Amphetamines consumption Amyl nitrite consumption Benzodiazepine consumption Caffeine consumption Cannabis consumption Chocolate consumption Cocaine consumption Crack consumption Ecstasy consumption Heroin consumption Ketamine consumption Legal highs consumption Lysergic acid diethylamide consumption Methadone consumption Magic mushrooms consumption Nicotine consumption Volatile substance abuse consumption
Alcohol consumption 1.0 *** -0.019 0.076 ** -0.03 0.103 *** 0.003 0.01 0.054 * -0.056 * 0.039 . -0.054 * 0.048 * 0.01 -0.017 -0.08 *** -0.0 0.043 . -0.002
Amphetamines consumption -0.019 1.0 *** 0.365 *** 0.505 *** 0.022 0.504 *** -0.093 *** 0.606 *** 0.338 *** 0.579 *** 0.382 *** 0.402 *** 0.481 *** 0.499 *** 0.422 *** 0.484 *** 0.395 *** 0.327 ***
Amyl nitrite consumption 0.076 ** 0.365 *** 1.0 *** 0.219 *** 0.061 ** 0.217 *** 0.005 0.395 *** 0.178 *** 0.359 *** 0.169 *** 0.344 *** 0.216 *** 0.208 *** 0.078 *** 0.236 *** 0.228 *** 0.188 ***
Benzodiazepine consumption -0.03 0.505 *** 0.219 *** 1.0 *** 0.052 * 0.37 *** -0.053 * 0.467 *** 0.372 *** 0.373 *** 0.437 *** 0.314 *** 0.345 *** 0.37 *** 0.515 *** 0.374 *** 0.318 *** 0.314 ***
Caffeine consumption 0.103 *** 0.022 0.061 ** 0.052 * 1.0 *** -0.01 0.109 *** 0.036 0.011 -0.037 0.006 -0.033 -0.052 * -0.038 . 0.014 -0.004 0.101 *** 0.03
Cannabis consumption 0.003 0.504 *** 0.217 *** 0.37 *** -0.01 1.0 *** -0.108 *** 0.477 *** 0.24 *** 0.566 *** 0.239 *** 0.319 *** 0.578 *** 0.542 *** 0.322 *** 0.593 *** 0.515 *** 0.29 ***
Chocolate consumption 0.01 -0.093 *** 0.005 -0.053 * 0.109 *** -0.108 *** 1.0 *** -0.088 *** -0.126 *** -0.091 *** -0.083 *** -0.053 * -0.098 *** -0.122 *** -0.061 ** -0.108 *** -0.058 * -0.092 ***
Cocaine consumption 0.054 * 0.606 *** 0.395 *** 0.467 *** 0.036 0.477 *** -0.088 *** 1.0 *** 0.406 *** 0.631 *** 0.428 *** 0.465 *** 0.42 *** 0.464 *** 0.378 *** 0.469 *** 0.414 *** 0.322 ***
Crack consumption -0.056 * 0.338 *** 0.178 *** 0.372 *** 0.011 0.24 *** -0.126 *** 0.406 *** 1.0 *** 0.26 *** 0.539 *** 0.269 *** 0.199 *** 0.308 *** 0.391 *** 0.293 *** 0.256 *** 0.305 ***
Ecstasy consumption 0.039 . 0.579 *** 0.359 *** 0.373 *** -0.037 0.566 *** -0.091 *** 0.631 *** 0.26 *** 1.0 *** 0.29 *** 0.523 *** 0.57 *** 0.581 *** 0.313 *** 0.565 *** 0.392 *** 0.282 ***
Heroin consumption -0.054 * 0.382 *** 0.169 *** 0.437 *** 0.006 0.239 *** -0.083 *** 0.428 *** 0.539 *** 0.29 *** 1.0 *** 0.297 *** 0.244 *** 0.345 *** 0.501 *** 0.298 *** 0.246 *** 0.311 ***
Ketamine consumption 0.048 * 0.402 *** 0.344 *** 0.314 *** -0.033 0.319 *** -0.053 * 0.465 *** 0.269 *** 0.523 *** 0.297 *** 1.0 *** 0.405 *** 0.413 *** 0.236 *** 0.409 *** 0.256 *** 0.207 ***
Legal highs consumption 0.01 0.481 *** 0.216 *** 0.345 *** -0.052 * 0.578 *** -0.098 *** 0.42 *** 0.199 *** 0.57 *** 0.244 *** 0.405 *** 1.0 *** 0.478 *** 0.35 *** 0.551 *** 0.348 *** 0.308 ***
Lysergic acid diethylamide consumption -0.017 0.499 *** 0.208 *** 0.37 *** -0.038 . 0.542 *** -0.122 *** 0.464 *** 0.308 *** 0.581 *** 0.345 *** 0.413 *** 0.478 *** 1.0 *** 0.34 *** 0.672 *** 0.309 *** 0.291 ***
Methadone consumption -0.08 *** 0.422 *** 0.078 *** 0.515 *** 0.014 0.322 *** -0.061 ** 0.378 *** 0.391 *** 0.313 *** 0.501 *** 0.236 *** 0.35 *** 0.34 *** 1.0 *** 0.333 *** 0.255 *** 0.308 ***
Magic mushrooms consumption -0.0 0.484 *** 0.236 *** 0.374 *** -0.004 0.593 *** -0.108 *** 0.469 *** 0.293 *** 0.565 *** 0.298 *** 0.409 *** 0.551 *** 0.672 *** 0.333 *** 1.0 *** 0.338 *** 0.263 ***
Nicotine consumption 0.043 . 0.395 *** 0.228 *** 0.318 *** 0.101 *** 0.515 *** -0.058 * 0.414 *** 0.256 *** 0.392 *** 0.246 *** 0.256 *** 0.348 *** 0.309 *** 0.255 *** 0.338 *** 1.0 *** 0.294 ***
Volatile substance abuse consumption -0.002 0.327 *** 0.188 *** 0.314 *** 0.03 0.29 *** -0.092 *** 0.322 *** 0.305 *** 0.282 *** 0.311 *** 0.207 *** 0.308 *** 0.291 *** 0.308 *** 0.263 *** 0.294 *** 1.0 ***
heatmap_corr(drug_data, method='spearman');

Alcohol, caffeine and chocolate don't correlate with any other drugs (too weak or insignificant relations).

There are some moderate correlations with high enough statistical significance.

There are no strong correlations (I consider an absolute value of 0.70 or above strong).

drug_data_corr_table_nopval, _ =  calculate_r(drug_data, drug_data, method='spearman', pvalues=False)

We will list every correlations, regrouped and ordered by strength.

Tip: Feel free to scroll past it, I just included them because I thought it might be of some interest for some people.

List of unsignificant correlations

get_ranked_corr(drug_data_corr_table_nopval, drug_data_pvalues, corr_threshold=0, name_transform=lambda x:x.rsplit(' ', 1)[0], comp_pval= operator.gt).drop('Correlation', axis=1)
Alcohol / Nicotine
Alcohol / Ecstasy
Caffeine / Lysergic acid diethylamide
Caffeine / Ecstasy
Caffeine / Cocaine
Caffeine / Ketamine
Alcohol / Benzodiazepine
Caffeine / Volatile substance abuse
Amphetamines / Caffeine
Alcohol / Amphetamines
Alcohol / Lysergic acid diethylamide
Caffeine / Methadone
Caffeine / Crack
Alcohol / Chocolate
Alcohol / Legal highs
Caffeine / Cannabis
Caffeine / Heroin
Amyl nitrite / Chocolate
Caffeine / Magic mushrooms
Alcohol / Cannabis
Alcohol / Volatile substance abuse
Alcohol / Magic mushrooms

List of significant weak or non-existant correlations

get_ranked_corr(drug_data_corr_table_nopval, drug_data_pvalues, corr_threshold=0.3, name_transform=lambda x:x.rsplit(' ', 1)[0], comp_corr=operator.lt)
Correlation
Heroin / Magic mushrooms 0.298
Heroin / Ketamine 0.297
Nicotine / Volatile substance abuse 0.294
Crack / Magic mushrooms 0.293
Lysergic acid diethylamide / Volatile substance abuse 0.291
Cannabis / Volatile substance abuse 0.290
Ecstasy / Heroin 0.290
Ecstasy / Volatile substance abuse 0.282
Crack / Ketamine 0.269
Magic mushrooms / Volatile substance abuse 0.263
Crack / Ecstasy 0.260
Crack / Nicotine 0.256
Ketamine / Nicotine 0.256
Methadone / Nicotine 0.255
Heroin / Nicotine 0.246
Heroin / Legal highs 0.244
Cannabis / Crack 0.240
Cannabis / Heroin 0.239
Amyl nitrite / Magic mushrooms 0.236
Ketamine / Methadone 0.236
Amyl nitrite / Nicotine 0.228
Amyl nitrite / Benzodiazepine 0.219
Amyl nitrite / Cannabis 0.217
Amyl nitrite / Legal highs 0.216
Amyl nitrite / Lysergic acid diethylamide 0.208
Ketamine / Volatile substance abuse 0.207
Crack / Legal highs 0.199
Amyl nitrite / Volatile substance abuse 0.188
Amyl nitrite / Crack 0.178
Amyl nitrite / Heroin 0.169
Chocolate / Crack -0.126
Chocolate / Lysergic acid diethylamide -0.122
Caffeine / Chocolate 0.109
Cannabis / Chocolate -0.108
Chocolate / Magic mushrooms -0.108
Alcohol / Caffeine 0.103
Caffeine / Nicotine 0.101
Chocolate / Legal highs -0.098
Amphetamines / Chocolate -0.093
Chocolate / Volatile substance abuse -0.092
Chocolate / Ecstasy -0.091
Chocolate / Cocaine -0.088
Chocolate / Heroin -0.083
Alcohol / Methadone -0.080
Amyl nitrite / Methadone 0.078
Alcohol / Amyl nitrite 0.076
Amyl nitrite / Caffeine 0.061
Chocolate / Methadone -0.061
Chocolate / Nicotine -0.058
Alcohol / Crack -0.056
Alcohol / Cocaine 0.054
Alcohol / Heroin -0.054
Benzodiazepine / Chocolate -0.053
Chocolate / Ketamine -0.053
Benzodiazepine / Caffeine 0.052
Caffeine / Legal highs -0.052
Alcohol / Ketamine 0.048

List of significant weak to weak-moderate correlations

get_ranked_corr(drug_data_corr_table_nopval, drug_data_pvalues, corr_threshold=0.5, name_transform=lambda x:x.rsplit(' ', 1)[0], comp_corr=operator.lt).query('Correlation >= 0.3').reset_index()
index Correlation
0 Amphetamines / Lysergic acid diethylamide 0.499
1 Amphetamines / Magic mushrooms 0.484
2 Amphetamines / Legal highs 0.481
3 Legal highs / Lysergic acid diethylamide 0.478
4 Cannabis / Cocaine 0.477
5 Cocaine / Magic mushrooms 0.469
6 Benzodiazepine / Cocaine 0.467
7 Cocaine / Ketamine 0.465
8 Cocaine / Lysergic acid diethylamide 0.464
9 Benzodiazepine / Heroin 0.437
10 Cocaine / Heroin 0.428
11 Amphetamines / Methadone 0.422
12 Cocaine / Legal highs 0.420
13 Cocaine / Nicotine 0.414
14 Ketamine / Lysergic acid diethylamide 0.413
15 Ketamine / Magic mushrooms 0.409
16 Cocaine / Crack 0.406
17 Ketamine / Legal highs 0.405
18 Amphetamines / Ketamine 0.402
19 Amphetamines / Nicotine 0.395
20 Amyl nitrite / Cocaine 0.395
21 Ecstasy / Nicotine 0.392
22 Crack / Methadone 0.391
23 Amphetamines / Heroin 0.382
24 Cocaine / Methadone 0.378
25 Benzodiazepine / Magic mushrooms 0.374
26 Benzodiazepine / Ecstasy 0.373
27 Benzodiazepine / Crack 0.372
28 Benzodiazepine / Cannabis 0.370
29 Benzodiazepine / Lysergic acid diethylamide 0.370
30 Amphetamines / Amyl nitrite 0.365
31 Amyl nitrite / Ecstasy 0.359
32 Legal highs / Methadone 0.350
33 Legal highs / Nicotine 0.348
34 Benzodiazepine / Legal highs 0.345
35 Heroin / Lysergic acid diethylamide 0.345
36 Amyl nitrite / Ketamine 0.344
37 Lysergic acid diethylamide / Methadone 0.340
38 Amphetamines / Crack 0.338
39 Magic mushrooms / Nicotine 0.338
40 Methadone / Magic mushrooms 0.333
41 Amphetamines / Volatile substance abuse 0.327
42 Cannabis / Methadone 0.322
43 Cocaine / Volatile substance abuse 0.322
44 Cannabis / Ketamine 0.319
45 Benzodiazepine / Nicotine 0.318
46 Benzodiazepine / Ketamine 0.314
47 Benzodiazepine / Volatile substance abuse 0.314
48 Ecstasy / Methadone 0.313
49 Heroin / Volatile substance abuse 0.311
50 Lysergic acid diethylamide / Nicotine 0.309
51 Crack / Lysergic acid diethylamide 0.308
52 Legal highs / Volatile substance abuse 0.308
53 Methadone / Volatile substance abuse 0.308
54 Crack / Volatile substance abuse 0.305

List of significant moderate correlations

get_ranked_corr(drug_data_corr_table_nopval, drug_data_pvalues, corr_threshold=0.4, name_transform=lambda x:x.rsplit(' ', 1)[0])
Correlation
Lysergic acid diethylamide / Magic mushrooms 0.672
Cocaine / Ecstasy 0.631
Amphetamines / Cocaine 0.606
Cannabis / Magic mushrooms 0.593
Ecstasy / Lysergic acid diethylamide 0.581
Amphetamines / Ecstasy 0.579
Cannabis / Legal highs 0.578
Ecstasy / Legal highs 0.570
Cannabis / Ecstasy 0.566
Ecstasy / Magic mushrooms 0.565
Legal highs / Magic mushrooms 0.551
Cannabis / Lysergic acid diethylamide 0.542
Crack / Heroin 0.539
Ecstasy / Ketamine 0.523
Benzodiazepine / Methadone 0.515
Cannabis / Nicotine 0.515
Amphetamines / Benzodiazepine 0.505
Amphetamines / Cannabis 0.504
Heroin / Methadone 0.501
Amphetamines / Lysergic acid diethylamide 0.499
Amphetamines / Magic mushrooms 0.484
Amphetamines / Legal highs 0.481
Legal highs / Lysergic acid diethylamide 0.478
Cannabis / Cocaine 0.477
Cocaine / Magic mushrooms 0.469
Benzodiazepine / Cocaine 0.467
Cocaine / Ketamine 0.465
Cocaine / Lysergic acid diethylamide 0.464
Benzodiazepine / Heroin 0.437
Cocaine / Heroin 0.428
Amphetamines / Methadone 0.422
Cocaine / Legal highs 0.420
Cocaine / Nicotine 0.414
Ketamine / Lysergic acid diethylamide 0.413
Ketamine / Magic mushrooms 0.409
Cocaine / Crack 0.406
Ketamine / Legal highs 0.405
Amphetamines / Ketamine 0.402

We have seen correlations of categories of variables with themselves. Let's now dig into inter-category correlations.

Correlations between personality traits and drugs (using Spearman)

drug_pers_corr_table, drug_pers_pvalues = calculate_r(drug_data, pers_data, method='spearman')
drug_pers_corr_table
Neuroticism Extraversion Openness to experience Agreeableness Conscientiousness Impulsiveness Sensation seeking
Alcohol consumption 0.0 0.084 *** 0.032 -0.033 0.006 0.039 . 0.094 ***
Amphetamines consumption 0.135 *** -0.042 . 0.253 *** -0.14 *** -0.253 *** 0.294 *** 0.365 ***
Amyl nitrite consumption 0.04 . 0.04 . 0.065 ** -0.079 *** -0.121 *** 0.13 *** 0.189 ***
Benzodiazepine consumption 0.266 *** -0.096 *** 0.225 *** -0.164 *** -0.217 *** 0.233 *** 0.255 ***
Caffeine consumption 0.022 0.013 -0.02 -0.008 0.006 0.014 -0.005
Cannabis consumption 0.11 *** -0.022 0.417 *** -0.155 *** -0.293 *** 0.313 *** 0.465 ***
Chocolate consumption 0.025 0.028 -0.015 0.038 0.023 -0.015 -0.052 *
Cocaine consumption 0.144 *** 0.006 0.205 *** -0.183 *** -0.221 *** 0.264 *** 0.34 ***
Crack consumption 0.118 *** -0.052 * 0.126 *** -0.091 *** -0.13 *** 0.19 *** 0.197 ***
Ecstasy consumption 0.087 *** 0.056 * 0.311 *** -0.118 *** -0.249 *** 0.272 *** 0.405 ***
Heroin consumption 0.178 *** -0.076 *** 0.163 *** -0.14 *** -0.17 *** 0.198 *** 0.216 ***
Ketamine consumption 0.078 *** 0.019 0.193 *** -0.125 *** -0.167 *** 0.188 *** 0.266 ***
Legal highs consumption 0.122 *** -0.033 0.347 *** -0.136 *** -0.27 *** 0.281 *** 0.441 ***
Lysergic acid diethylamide consumption 0.071 ** -0.021 0.368 *** -0.12 *** -0.198 *** 0.253 *** 0.381 ***
Methadone consumption 0.194 *** -0.108 *** 0.209 *** -0.153 *** -0.209 *** 0.198 *** 0.242 ***
Magic mushrooms consumption 0.066 ** 0.001 0.379 *** -0.127 *** -0.221 *** 0.281 *** 0.391 ***
Nicotine consumption 0.133 *** -0.033 0.193 *** -0.113 *** -0.237 *** 0.253 *** 0.306 ***
Volatile substance abuse consumption 0.132 *** -0.066 ** 0.151 *** -0.125 *** -0.182 *** 0.192 *** 0.255 ***
drug_pers_corr_table_no_pvalues, drug_pers_pvalues = calculate_r(drug_data, pers_data, pvalues=False, method='spearman')
heatmap_corr(drug_pers_corr_table_no_pvalues, ready=True, mask=False);

Alcohol, Chocolate, Caffeine and Nicotine consumption have no (or unsignificant) correlation with any personality trait.

For the rest of this paragraph, "drug consumption" is not taking alcohol, chocolate or caffeine into account.

Extraversion is the only personality trait that has no correlation with any drug (not even very weak), which I found surpising at first but it may be caused about my wrong understanding of extraversion.

Crack, and Amyl nitrite have no correlation above 0.2 (absolute value).

Agreeableness and Conscientiousness both have only weak negative correlations (at most) with all the drugs. Even if the negative correlation is always a bit stronger for conscientious respondents.

Impulsiveness, Openness to experience and Sensation seeking are the personality factors that correlate the most with drug consumption.

There are no strong correlation between personality factors and drug taking.

It seems like a particular personality trait has either only positive or negative correlations with drug consumption.

Let's list the weak ones and above (|r| >= 0.3).

get_ranked_corr(drug_pers_corr_table_no_pvalues, drug_pers_pvalues, corr_threshold=0.3, against_itself=False)
Correlation
Cannabis consumption / Sensation seeking 0.465
Legal highs consumption / Sensation seeking 0.441
Cannabis consumption / Openness to experience 0.417
Ecstasy consumption / Sensation seeking 0.405
Magic mushrooms consumption / Sensation seeking 0.391
Lysergic acid diethylamide consumption / Sensation seeking 0.381
Magic mushrooms consumption / Openness to experience 0.379
Lysergic acid diethylamide consumption / Openness to experience 0.368
Amphetamines consumption / Sensation seeking 0.365
Legal highs consumption / Openness to experience 0.347
Cocaine consumption / Sensation seeking 0.340
Cannabis consumption / Impulsiveness 0.313
Ecstasy consumption / Openness to experience 0.311
Nicotine consumption / Sensation seeking 0.306

Note: There are only positive correlations in the above list. The strongest negative is -0.293 between Cannabis and conscientiousness.

Nominal correlations between demographic and drug consumption (Cramér's V)

demo_drug_corr_table, demo_drug_pval = nominal_corrs(dataset, demographic_columns, drugs_columns)
demo_drug_corr_table
Alcohol consumption Amphetamines consumption Amyl nitrite consumption Benzodiazepine consumption Caffeine consumption Cannabis consumption Chocolate consumption Cocaine consumption Crack consumption Ecstasy consumption Heroin consumption Ketamine consumption Legal highs consumption Lysergic acid diethylamide consumption Methadone consumption Magic mushrooms consumption Nicotine consumption Volatile substance abuse consumption
Age 0.083 *** 0.196 *** 0.151 *** 0.144 *** 0.093 *** 0.24 *** 0.087 *** 0.173 *** 0.079 *** 0.203 *** 0.11 *** 0.131 *** 0.215 *** 0.233 *** 0.117 *** 0.221 *** 0.186 *** 0.156 ***
Gender 0.041 0.249 *** 0.157 *** 0.156 *** 0.0 0.309 *** 0.088 ** 0.178 *** 0.151 *** 0.241 *** 0.131 *** 0.189 *** 0.327 *** 0.286 *** 0.188 *** 0.278 *** 0.193 *** 0.139 ***
Education 0.079 *** 0.122 *** 0.045 * 0.093 *** 0.033 0.166 *** 0.043 * 0.068 *** 0.082 *** 0.125 *** 0.068 *** 0.061 *** 0.144 *** 0.133 *** 0.106 *** 0.14 *** 0.131 *** 0.098 ***
Country 0.07 *** 0.184 *** 0.112 *** 0.171 *** 0.044 * 0.229 *** 0.063 *** 0.123 *** 0.117 *** 0.146 *** 0.136 *** 0.061 *** 0.183 *** 0.213 *** 0.189 *** 0.203 *** 0.136 *** 0.151 ***
Ethnicity 0.148 *** 0.047 ** 0.035 . 0.046 ** 0.072 *** 0.088 *** 0.023 0.047 ** 0.052 ** 0.029 0.0 0.0 0.049 ** 0.062 *** 0.022 0.056 *** 0.063 *** 0.074 ***
demo_drug_corr_table_no_pvalues, _ = nominal_corrs(dataset, demographic_columns, drugs_columns, pvalues=False)
heatmap_corr(demo_drug_corr_table_no_pvalues, ready=True, mask=False, nominal=True);

Ethnicity isn't correlated to drug taking, except very weakly with alcohol consumption.

Education also seems to be at most very weakly correlated to drug consumption.

Caffeine and Chocolate have no correlations with demographic variables.

There are no strong or even moderate correlations, the strongest weak correlations are mostly found in Gender.

Underneath is a list of the significant correlations with (|r| >= 0.2).

get_ranked_corr(demo_drug_corr_table_no_pvalues, demo_drug_pval, corr_threshold=0.2)
Correlation
Gender / Legal highs consumption 0.327
Gender / Cannabis consumption 0.309
Gender / Lysergic acid diethylamide consumption 0.286
Gender / Magic mushrooms consumption 0.278
Gender / Ecstasy consumption 0.241
Age / Cannabis consumption 0.240
Age / Lysergic acid diethylamide consumption 0.233
Country / Cannabis consumption 0.229
Age / Magic mushrooms consumption 0.221
Age / Legal highs consumption 0.215
Country / Lysergic acid diethylamide consumption 0.213
Age / Ecstasy consumption 0.203
Country / Magic mushrooms consumption 0.203

Let's look at the consumption median of cannabis and legal highs consumption for female and male, which are the two drugs most correlated with gender according to our above table.

The significance is base on the Mann-Whitney U test (used in the below function).

"In statistics, the Mann–Whitney U test is a nonparametric test of the null hypothesis that, for randomly selected values X and Y from two populations, the probability of X being greater than Y is equal to the probability of Y being greater than X"(Wikipedia)

We implemented the difference test in the above procedure.

Here the two populations are male and female.

def diff_test(dataset, sample_attr, diff_attr, sample_attr_split='median', diff_attr_type='ordered', split_point=None, alpha=0.05):
    ''' Difference significance test on dataset.
    Returns (p of no difference,
             the group split point,
             the group 1 central,
             the group 2 central,
             a summary). <- shouldn't be in here but whatever
    
    sample_attr - column which will be divided into two samples with median value;
    diff_attr - attribute, which value will be checked in two sample groups;
    diff_attr_type - determines type of data which medians will be compared (ordered, interval)
    '''
    # In case of big data imbalance in highest rank (contains more than half)
    # It would make an empty group when splitting futher down
    if split_point is None and sample_attr_split == "median":
        levels = dataset[sample_attr].unique()
        if not (levels > dataset[sample_attr].median()).any(): # The highest level is the same as median
            split_point = levels[-1] # Split one rank lower instead
    
    if split_point is None:
        split_point = dataset[sample_attr].median() if sample_attr_split=='median' else dataset[sample_attr].mean()

    group_1 = dataset[dataset[sample_attr] > split_point]
    group_2 = dataset[dataset[sample_attr] <= split_point]
    
    group_1_diff_attr_central = group_1[diff_attr].median() if diff_attr_type=='ordered' else group_1[diff_attr].mean()
    group_2_diff_attr_central = group_2[diff_attr].median() if diff_attr_type=='ordered' else group_2[diff_attr].mean()
    
    if diff_attr_type == 'ordered':
        diff_sign, p = mannwhitneyu(group_1[diff_attr], group_2[diff_attr])
    elif diff_attr_type == 'interval':
        diff_sign, p = ttest_ind(group_1[diff_attr], group_2[diff_attr])
    else:
        raise ValueError('diff_attr_type should be one of ordered and interval')
    are = 'are' if p < alpha else 'are not'
    sample_central = 'split point' if split_point is not None else sample_attr_split
    diff_central = 'median' if diff_attr_type=='ordered' else 'mean'
    
    return (p, split_point, group_1_diff_attr_central, group_2_diff_attr_central,
            f'First group: {sample_attr} above {sample_central} value {round(split_point, 3)}\n Second group: {sample_attr} equal or below {sample_central} value {round(split_point, 6)} \n First group {diff_attr} {diff_central}: {round(group_1_diff_attr_central, 3)} \n Second group {diff_attr} {diff_central}: {round(group_2_diff_attr_central, 3)} \n Difference significance for samples: {round(diff_sign, 3)} with p-value: {round(p, 3)} \n Samples {are} statistically different.')
val_between_male_female = (dataset['Gender'][demo_data['Gender'] == 'Male'].iloc[0] + dataset['Gender'][demo_data['Gender'] == 'Female'].iloc[0])/ 2
print(diff_test(dataset, 'Gender', 'Cannabis consumption', split_point=val_between_male_female)[-1])
First group: Gender above split point value 0.0
 Second group: Gender equal or below split point value 0.0 
 First group Cannabis consumption median: 2.0 
 Second group Cannabis consumption median: 4.0 
 Difference significance for samples: 288686.5 with p-value: 0.0 
 Samples are statistically different.
print(diff_test(dataset, 'Gender', 'Legal highs consumption', split_point=val_between_male_female)[-1])
First group: Gender above split point value 0.0
 Second group: Gender equal or below split point value 0.0 
 First group Legal highs consumption median: 0.0 
 Second group Legal highs consumption median: 2.0 
 Difference significance for samples: 290886.5 with p-value: 0.0 
 Samples are statistically different.

Females (first group) consume less cannabis and legal highs than men.

As a reminder, here are the different levels for each drug:

0 - Never Used (median female legal highs consumer)

1 - Used over a Decade Ago

2 - Used in Last Decade (median female cannabis consumer, median male legal high consumer)

3 - Used in Last Year

4 - Used in Last Month (median male cannabis consumer)

5 - Used in Last Week

6 - Used in Last Day

And the values for male and female, respectively, are:

(dataset['Gender'][demo_data['Gender'] == 'Male'].iloc[0], dataset['Gender'][demo_data['Gender'] == 'Female'].iloc[0])
(-0.48246, 0.48246)

Nominal correlations between demographic info and personality traits (using Cramér's V)

demo_psycho_corr_table, demo_psycho_pvalues = nominal_corrs(dataset, demographic_columns, personality_columns)
demo_psycho_corr_table
Neuroticism Extraversion Openness to experience Agreeableness Conscientiousness Impulsiveness Sensation seeking
Age 0.033 0.0 0.139 *** 0.0 0.095 *** 0.103 *** 0.159 ***
Gender 0.089 . 0.063 0.139 *** 0.22 *** 0.173 *** 0.19 *** 0.261 ***
Education 0.054 . 0.032 0.094 *** 0.0 0.112 *** 0.072 *** 0.096 ***
Country 0.047 0.086 *** 0.141 *** 0.036 0.105 *** 0.098 *** 0.144 ***
Ethnicity 0.0 0.0 0.04 0.0 0.0 0.045 * 0.046 *
demo_psycho_corr_table_no_pvalues, _ = nominal_corrs(dataset, demographic_columns, personality_columns, pvalues=False)
heatmap_corr(demo_psycho_corr_table_no_pvalues, ready=True, nominal=True, mask=False);

Neuroticism has no statistically significant correlation with any demographic variable.

Extraversion has only one statistically significant correlation, with Country, which is non existent.

Education and Ethnicity has no correlation with any personality trait (too low or unsignificant).

Gender seems to have the strongest (least weak) correlations.

There are no moderate or strong correlations, only pretty weak ones at most.

Let's list the weak correlations (|r| > 0.15)

get_ranked_corr(demo_psycho_corr_table_no_pvalues, demo_psycho_pvalues, corr_threshold=0.15)
Correlation
Gender / Sensation seeking 0.261
Gender / Agreeableness 0.220
Gender / Impulsiveness 0.190
Gender / Conscientiousness 0.173
Age / Sensation seeking 0.159

Let's display the median of Agreeableness and Sensation seeking for each gender.

print(diff_test(dataset, 'Gender', 'Agreeableness', split_point=val_between_male_female)[-1])
First group: Gender above split point value 0.0
 Second group: Gender equal or below split point value 0.0 
 First group Agreeableness median: 0.288 
 Second group Agreeableness median: -0.302 
 Difference significance for samples: 327571.0 with p-value: 0.0 
 Samples are statistically different.
print(diff_test(dataset, 'Gender', 'Sensation seeking', split_point=val_between_male_female)[-1])
First group: Gender above split point value 0.0
 Second group: Gender equal or below split point value 0.0 
 First group Sensation seeking median: -0.216 
 Second group Sensation seeking median: 0.401 
 Difference significance for samples: 312536.0 with p-value: 0.0 
 Samples are statistically different.

Women tend to be more agreeable and less sensation seeking than men.

Difference significance tests

In this section, we will split the population in 2 groups based on a feature.

All the features are rank based, we will use the Mann–Whitney U test against a target, to see if the target shows a statistically significant difference in median for the two groups.

Important: The test can show significant difference when the median values are the same. The significance means "for randomly selected values X and Y from two populations, the probability of X being greater than Y is not equal to the probability of Y being greater than X."

The splitting point will be the median, except if the median is the highest rank the feature can have. In this case we split one rank lower.

  • Group 1: respondents > split.
  • Group 2: respondents <= split.

In the following section, we will only show significant difference in median value for a given target (and only this value differs between the two groups).

Note: The below tables will have some Nan values. This is because the cell has no significant difference but the column is needed for other rows.

We won't comment on them but feel free to explore. The tables are quite self-explanatory.

def get_significant_diff_tests(dataset, samples, diff_attrs, alpha=0.05, display_same_center=False):
    '''
    Perform difference tests on several splitting attributes for different targets.
    '''
    samp_dfs = []
    for sample in samples:
        diff_dfs = []
        for diff_attr in diff_attrs:
            levels =  dataset[sample].unique()
            split = None
            if len(levels) == 2:
                split = (dataset[sample][dataset[sample] == levels[0]].iloc[0] + dataset[sample][dataset[sample] == levels[1]].iloc[0])/ 2
            
            p, split_point, group_1_diff_attr_central, group_2_diff_attr_central, _ = diff_test(dataset, sample, diff_attr, split_point=split, alpha=alpha)
            
            if p < alpha and (group_1_diff_attr_central != group_2_diff_attr_central or display_same_center):
                split_point = round(split_point, 5)
                diff_dfs.append(pd.DataFrame({'Median': [group_1_diff_attr_central, group_2_diff_attr_central], 'Pval / significant diff ?': [p, "Yes" if p < alpha else "No"]}, index=[f"> {split_point}", f"<= {split_point}"]))
        if diff_dfs:
            samp_dfs.append(pd.concat(diff_dfs, axis=1, keys=diff_attrs))
    
    return pd.concat(samp_dfs, axis=0, keys=samples) if samp_dfs else None

Difference on drug consumption splitting on demographic

get_significant_diff_tests(dataset, demographic_columns, drugs_columns)
Alcohol consumption Amphetamines consumption Amyl nitrite consumption Benzodiazepine consumption Caffeine consumption Cannabis consumption Chocolate consumption Cocaine consumption Crack consumption Ecstasy consumption
Median Pval / significant diff ? Median Pval / significant diff ? Median Pval / significant diff ? Median Pval / significant diff ? Median Pval / significant diff ? Median Pval / significant diff ? Median Pval / significant diff ? Median Pval / significant diff ? Median Pval / significant diff ? Median Pval / significant diff ?
Age > -0.07854 0.0 0.000000 1.0 0.000000 0.0 0.000000 0.0 0.000000 0.0 0.000000 2.0 0.000000 NaN NaN NaN NaN NaN NaN NaN NaN
<= -0.07854 1.0 Yes 4.0 Yes 2.0 Yes 2.0 Yes 2.0 Yes 4.0 Yes NaN NaN NaN NaN NaN NaN NaN NaN
Gender > 0.0 0.0 0.000000 0.0 0.000000 2.0 0.000000 0.0 0.000000 0.0 0.000000 0.0 0.000000 0.0 0.000000 0.0 0.000000 2.0 0.000000 NaN NaN
<= 0.0 1.0 Yes 1.0 Yes 4.0 Yes 1.0 Yes 2.0 Yes 2.0 Yes 1.0 Yes 1.0 Yes 4.0 Yes NaN NaN
Education > -0.05921 0.0 0.000000 0.0 0.000000 2.0 0.000000 0.0 0.000000 0.0 0.000000 0.0 0.000000 2.0 0.000000 NaN NaN NaN NaN NaN NaN
<= -0.05921 1.0 Yes 1.0 Yes 4.0 Yes 1.0 Yes 2.0 Yes 1.0 Yes 5.0 Yes NaN NaN NaN NaN NaN NaN
Country > -0.46841 0.0 0.000000 0.0 0.000000 2.0 0.000000 0.0 0.000000 0.0 0.000000 0.0 0.000000 0.0 0.000000 0.0 0.000000 0.0 0.000000 2.0 0.0
<= -0.46841 2.0 Yes 3.0 Yes 5.0 Yes 2.0 Yes 2.0 Yes 3.0 Yes 2.0 Yes 2.0 Yes 2.0 Yes 5.0 Yes
Ethnicity > -0.31685 1.0 0.010986 2.0 0.000049 4.0 0.001836 0.5 0.009257 1.0 0.044104 1.0 0.011844 1.0 0.002797 1.0 0.011986 4.0 0.045376 NaN NaN
<= -0.31685 0.0 Yes 0.0 Yes 3.0 Yes 0.0 Yes 0.0 Yes 0.0 Yes 0.0 Yes 0.0 Yes 3.0 Yes NaN NaN

Difference on drug consumption splitting on personality traits

get_significant_diff_tests(dataset, personality_columns, drugs_columns)
Alcohol consumption Amphetamines consumption Amyl nitrite consumption Benzodiazepine consumption Caffeine consumption Cannabis consumption Chocolate consumption Cocaine consumption Crack consumption
Median Pval / significant diff ? Median Pval / significant diff ? Median Pval / significant diff ? Median Pval / significant diff ? Median Pval / significant diff ? Median Pval / significant diff ? Median Pval / significant diff ? Median Pval / significant diff ? Median Pval / significant diff ?
Neuroticism > 0.04257 1.0 0.000000 2.0 0.0 3.0 0.0 1.0 0.0 1.0 0.000081 1.0 0.00054 4.0 0.000000 NaN NaN NaN NaN
<= 0.04257 0.0 Yes 0.0 Yes 2.0 Yes 0.0 Yes 0.0 Yes 0.0 Yes 2.0 Yes NaN NaN NaN NaN
Extraversion > -0.01928 1.0 0.000000 2.0 0.0 4.0 0.0 1.0 0.0 2.0 0.000000 2.0 0.00000 1.0 0.000000 2.0 0.0 4.0 0.0
<= -0.01928 0.0 Yes 0.0 Yes 2.0 Yes 0.0 Yes 0.0 Yes 0.0 Yes 0.0 Yes 0.0 Yes 2.0 Yes
Openness to experience > -0.01729 0.0 0.000001 0.0 0.0 2.0 0.0 0.0 0.0 0.0 0.000004 0.0 0.00000 3.0 0.000068 NaN NaN NaN NaN
<= -0.01729 1.0 Yes 1.0 Yes 3.0 Yes 1.0 Yes 0.5 Yes 1.0 Yes 4.0 Yes NaN NaN NaN NaN
Agreeableness > -0.00665 0.0 0.000000 0.0 0.0 2.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.00000 0.0 0.000000 0.0 0.0 2.0 0.0
<= -0.00665 1.0 Yes 1.0 Yes 4.0 Yes 1.0 Yes 2.0 Yes 2.0 Yes 1.0 Yes 1.0 Yes 4.0 Yes
Conscientiousness > -0.21712 1.0 0.000000 1.0 0.0 4.0 0.0 1.0 0.0 2.0 0.000000 2.0 0.00000 1.0 0.000000 1.0 0.0 4.0 0.0
<= -0.21712 0.0 Yes 0.0 Yes 2.0 Yes 0.0 Yes 0.0 Yes 0.0 Yes 0.0 Yes 0.0 Yes 2.0 Yes
Impulsiveness > 0.07987 2.0 0.000000 2.0 0.0 5.0 0.0 2.0 0.0 2.0 0.000000 2.0 0.00000 1.0 0.000000 2.0 0.0 5.0 0.0
<= 0.07987 0.0 Yes 0.0 Yes 2.0 Yes 0.0 Yes 0.0 Yes 0.0 Yes 0.0 Yes 0.0 Yes 2.0 Yes

Difference on personality traits splitting on demographic

get_significant_diff_tests(dataset, demographic_columns, personality_columns)
Agreeableness Conscientiousness Extraversion Impulsiveness Neuroticism Openness to experience Sensation seeking
Median Pval / significant diff ? Median Pval / significant diff ? Median Pval / significant diff ? Median Pval / significant diff ? Median Pval / significant diff ? Median Pval / significant diff ? Median Pval / significant diff ?
Age > -0.07854 -0.21712 0.000000 -0.21575 0.0 -0.247775 0.000000 NaN NaN -0.14882 0.000000 0.25953 0.000000 NaN NaN
<= -0.07854 0.19268 Yes 0.40148 Yes 0.141430 Yes NaN NaN 0.13606 Yes -0.14277 Yes NaN NaN
Gender > 0.0 0.28783 0.000000 0.25953 0.0 0.003320 0.002804 -0.21712 0.0 0.04257 0.003225 -0.17779 0.000000 -0.21575 0.0
<= 0.0 -0.30172 Yes -0.27607 Yes -0.154870 Yes 0.19268 Yes -0.05188 Yes 0.14143 Yes 0.40148 Yes
Education > -0.05921 0.25953 0.000000 -0.21712 0.0 0.167670 0.000017 -0.21575 0.0 -0.10035 0.000112 0.13136 0.000237 NaN NaN
<= -0.05921 -0.27607 Yes 0.19268 Yes -0.154870 Yes 0.07987 Yes 0.04257 Yes -0.01729 Yes NaN NaN
Country > -0.46841 0.13136 0.000005 0.12331 0.0 0.003320 0.000025 -0.21712 0.0 -0.14882 0.000000 -0.17779 0.000000 -0.21575 0.0
<= -0.46841 -0.15487 Yes -0.27607 Yes -0.154870 Yes 0.19268 Yes 0.31287 Yes 0.44585 Yes 0.40148 Yes
Ethnicity > -0.31685 NaN NaN NaN NaN 0.529750 0.001346 NaN NaN 0.14143 0.014002 0.40148 0.000401 NaN NaN
<= -0.31685 NaN NaN NaN NaN -0.217120 Yes NaN NaN -0.01928 Yes 0.07987 Yes NaN NaN

Consumer vs Non-consumer

In practice, we are more interested in predicting which drugs an individual user consumes and less in knowing how long it has been.

We want to regroup the different levels we currently have for each drug into two classes: User and Non-User.

Year and decade based split for binarization of targets

We are going to compare year and decade based splits.

  • Year based: Consumer = Used in last year
  • Decade based: Consumer = Used in last decade
dataset_decade = dataset.copy()
for i in drugs_columns:
    dataset_decade[i] = ["Non Consumer" if c < 2 else "Consumer" for c in dataset[i]]
    
dataset_year = dataset.copy()
for i in drugs_columns:
    dataset_year[i] = ["Non Consumer" if c < 3 else "Consumer" for c in dataset[i]]
    
compare_year_decade = pd.concat([dataset_decade.assign(split="decade"), dataset_year.assign(split="year")])
plot("count", compare_year_decade, drugs_columns, fig_title="Count / Frequency consumer comparison between decade and year split\n\n\n", rotate_label=False, add_args={"hue": "split", "palette": "Set2"})

Note: Since we duplicated our original dataset, you can get the correct percentage for a split (decade or year) by doubling the value. This will also double the gap between the two splits !
Appart from Alcohol, Caffeine, Cannabis and Chocolate consumption (it doesn't change that much anyways for these), the decade based split is more balanced.

We could choose the decade based split for model training and inference because of the balance.

Also, someone may not have consumed a particular drug for a year because he didn't get the opportunity to or tried out other drugs during this time period but could take it again in the future.

Whereas, if he didn't consume it in the last 10 years, he probably doesn't plan to consume it again.

This rule can't really be applied to 18-24 year olds.

We will continue our analysis considering both splits.

Data analysis

grouped_df_decade = pd.DataFrame()
for drug in drugs_columns:
    grouped_df_decade = pd.concat([grouped_df_decade, dataset_decade.groupby(drug).size()], axis=1)
grouped_df_decade.columns = drugs_columns

grouped_df_year = pd.DataFrame()
for drug in drugs_columns:
    grouped_df_year = pd.concat([grouped_df_year, dataset_year.groupby(drug).size()], axis=1)
grouped_df_year.columns = drugs_columns
grouped_df_decade.T.plot.bar(color=['red', 'green'], figsize=(15,7), title="Count plot of consumer / non consumer for every drug");
  • Decade based split

The consumers (red) are in a huge majority for Alcohol, Caffeine and Chocolate.

They are also around double the number of non-consumers for Cannabis and Nicotine. This may be due to the fact that lots of people try these out at least once, and the last time happened to be in the last decade.

For the other drugs, non-consumers are the majority.

Cocaine, Benzodiazepine, Amphetamines, Ecstasy, Legal highs and Magic mushrooms consumers account for more than half the non-consumers.

grouped_df_year.T.plot.bar(color=['red', 'green'], figsize=(15,7), title="Count plot of consumer / non consumer for every drug");
  • Year based split

The huge majorities for consumers are the same.

Cannabis and Nicotine still have a majority of consumers. Our previous hypothesis wasn't as relevant as I had expected.

The drugs cited previously for which the non-consumers accounted over half the consumers are consumed in a lesser proportion, but the scale of the decrease wasn't the same for all.

For some drug the proportions didn't change much, meaning people either consume them (on a yearly basis or more frequently) or not at all.

Let's encode drug users as 1 and non-users as 0.

for i in drugs_columns:
    dataset_decade[i] = [0 if c < 2 else 1 for c in dataset[i]]
    
for i in drugs_columns:
    dataset_year[i] = [0 if c < 3 else 1 for c in dataset[i]]
    
compare_year_decade = pd.concat([dataset_decade.assign(split="decade"), dataset_year.assign(split="year")])

Let's see the count of individual per number of drug used.

np_drugs = np.array([drug.split(' ', -1)[0] for drug in drugs_columns])

combinations = compare_year_decade[drugs_columns].apply(
    lambda x: pd.Series({
        'names':'/'.join(np_drugs[x.values.astype(bool)]), 
        'index': np.where(x.values)[0], 'number of drugs': x.values.sum()}), axis=1)

combinations['split'] = compare_year_decade['split']

year_split = combinations['split'] == "year"

combinations.head()
names index number of drugs split
ID
1 Alcohol/Amphetamines/Benzodiazepine/Caffeine/C... [0, 1, 3, 4, 6, 16] 6 decade
2 Alcohol/Amphetamines/Amyl/Caffeine/Cannabis/Ch... [0, 1, 2, 4, 5, 6, 7, 9, 11, 13, 14, 16] 12 decade
3 Alcohol/Caffeine/Cannabis/Chocolate [0, 4, 5, 6] 4 decade
4 Alcohol/Benzodiazepine/Caffeine/Cannabis/Choco... [0, 3, 4, 5, 6, 7, 11, 16] 8 decade
5 Alcohol/Caffeine/Cannabis/Chocolate/Magic/Nico... [0, 4, 5, 6, 15, 16] 6 decade
combinations[~year_split].describe()
number of drugs
count 1877.000000
mean 7.785296
std 4.212409
min 0.000000
25% 4.000000
50% 7.000000
75% 11.000000
max 18.000000

For the decade split, the mean number of drug consumed is a bit lower than 8. Over 25% of the respondents are users of 11 drugs or more. The median is of 7 drugs. The mean is slightly higher than the median, meaning the distribution is slightly positively skewed. The number of drugs used ranges from 0 to 18 (all of them).

Note: The one(s) that consumed all of them probably didn’t lie or they would have lied for the Semeron.

Note: The median and 3rd quartile number of drugs seem very high to me. I feel like this isn’t representative of the real life population (in the countries considered here), but maybe I am just no fun at parties :cry:.
combinations[year_split].describe()
number of drugs
count 1877.000000
mean 6.200852
std 3.468506
min 0.000000
25% 3.000000
50% 5.000000
75% 9.000000
max 17.000000

For the year split, the mean number of drug consumed is a bit above 6. Over 25% of the respondents are users of 9 drugs or more. The median is of 5 drugs. The mean is higher than the median, meaning the distribution is positively skewed. The number of drugs used ranges from 0 to 17. None one consumed all the drugs in the last year.

catplot(combinations, 'number of drugs', hue=None, col='split', rotate_label=False, palette="Set2")
  • Decade split

Barely anyone used 0, 1, 2, 17 or all 18 drugs. The largest number used are 3, 4, 5 and 6.

An interesting fact is that 10 has more respondents than 7, 8 and 9. I guess for some drugs, there is a threshold. Once people go over 6 drugs, they may tend to try out a bunch more.

  • Year split

Again, barely anyone used 0, 1 and over 15 drugs.

The number of respondents decreases monotonically as the number of drugs increases (>= 3 drugs). We dont see the same bump around 10, strengthening my previous hypothesis.

3 drugs is now the huge majority!

Let's see which are the most common combinations!

value_counts_percentage(combinations[~year_split], 'names').head(10)
N %
Alcohol/Caffeine/Chocolate 290 15.45
Alcohol/Caffeine/Chocolate/Nicotine 122 6.50
Alcohol/Caffeine/Cannabis/Chocolate/Nicotine 100 5.33
Alcohol/Benzodiazepine/Caffeine/Chocolate 42 2.24
Alcohol/Caffeine/Cannabis/Chocolate 41 2.18
Alcohol/Benzodiazepine/Caffeine/Chocolate/Nicotine 23 1.23
Alcohol/Benzodiazepine/Caffeine/Cannabis/Chocolate/Nicotine 20 1.07
Caffeine/Chocolate 18 0.96
Alcohol/Amyl/Caffeine/Cannabis/Chocolate/Nicotine 16 0.85
Alcohol/Caffeine/Cannabis/Chocolate/Legal/Nicotine 15 0.80
value_counts_percentage(combinations[year_split], 'names').head(10)
N %
Alcohol/Caffeine/Chocolate 453 24.13
Alcohol/Caffeine/Chocolate/Nicotine 148 7.88
Alcohol/Caffeine/Cannabis/Chocolate/Nicotine 84 4.48
Caffeine/Chocolate 38 2.02
Alcohol/Benzodiazepine/Caffeine/Chocolate 34 1.81
Alcohol/Caffeine/Cannabis/Chocolate 31 1.65
Alcohol/Chocolate 23 1.23
Alcohol/Caffeine/Cannabis/Chocolate/Legal/Nicotine 22 1.17
Alcohol/Benzodiazepine/Caffeine/Cannabis/Chocolate/Nicotine 21 1.12
Alcohol/Benzodiazepine/Caffeine/Chocolate/Nicotine 19 1.01

No wonder the number of users of 3, 4 or 5 drugs were so high!

But really, we are not that interested in Chocolate, Caffeine, Nicotine and Alcohol.

Let's redo all this by not taking them into account.

np_drugs_illegal = np.array([drug.split(' ', -1)[0] for drug in drugs_illegal])


combinations_illegal = compare_year_decade[drugs_illegal].apply(
    lambda x: pd.Series({
        'names':'/'.join(np_drugs_illegal[x.values.astype(bool)]), 
        'index illegal': np.where(x.values)[0], 'number of illegal drugs': x.values.sum()}), axis=1)

combinations_illegal['split'] = compare_year_decade['split']
catplot(combinations_illegal, "number of illegal drugs", hue=None, col="split", rotate_label=False, palette="Set2")

Removing legal drugs shifted everything to the left (basically fusing 0, 1, 2 and 3). The shape of the distribution remains around the same.

value_counts_percentage(combinations_illegal[~year_split], 'names').head(10)
N %
467 24.88
Cannabis 149 7.94
Benzodiazepine 74 3.94
Benzodiazepine/Cannabis 30 1.60
Cannabis/Legal 19 1.01
Amphetamines/Amyl/Benzodiazepine/Cannabis/Cocaine/Ecstasy/Ketamine/Legal/Lysergic/Magic 19 1.01
Amyl/Cannabis 18 0.96
Cannabis/Legal/Magic 15 0.80
Cannabis/Cocaine 13 0.69
Cannabis/Ecstasy/Legal/Lysergic/Magic 11 0.59
value_counts_percentage(combinations_illegal[year_split], 'names').head(10)
N %
706 37.61
Cannabis 134 7.14
Benzodiazepine 61 3.25
Benzodiazepine/Cannabis 32 1.70
Cannabis/Legal 30 1.60
Cannabis/Ecstasy 14 0.75
Cannabis/Magic 13 0.69
Cannabis/Ecstasy/Legal/Lysergic/Magic 12 0.64
Amphetamines/Cannabis 12 0.64
Cannabis/Ecstasy/Lysergic/Magic 12 0.64

A lot more people (37.61%) in the year split didn't consume any drug (not counting alcohol, caffeine, chocolate and nicotine) compared to the decade split (24.88%).

The first 4 rows (combination of drugs) are the same but then there is some variation (different order or different combinations). The proportions for each ranking are around the same for both splits, appart from the first row.

Cannabis comes second (with between 7% and 8% in both splits), as it is legal in some countries and a very popular drug. It is also often present in the popular tuples.

Let's redo it without cannabis.

drugs_illegal_no_cannabis = [drug for drug in drugs_illegal if drug != "Cannabis consumption"]

np_drugs_illegal_no_cannabis = np.array([drug.split(' ', -1)[0] for drug in drugs_illegal_no_cannabis])

combinations_illegal_no_cannabis = compare_year_decade[drugs_illegal_no_cannabis].apply(
    lambda x: pd.Series({
        'names':'/'.join(np_drugs_illegal_no_cannabis[x.values.astype(bool)]), 
        'index illegal no cannabis': np.where(x.values)[0], 'number of illegal (no cannabis) drugs': x.values.sum()}), axis=1)

combinations_illegal_no_cannabis['split'] = compare_year_decade['split']

catplot(combinations_illegal_no_cannabis, "number of illegal (no cannabis) drugs", hue=None, col="split", rotate_label=False, palette="Set2")

The gap between 0 and 1 became even larger

value_counts_percentage(combinations_illegal_no_cannabis[~year_split], 'names').head(20)
N %
616 32.82
Benzodiazepine 104 5.54
Amyl 24 1.28
Legal 22 1.17
Amphetamines/Amyl/Benzodiazepine/Cocaine/Ecstasy/Ketamine/Legal/Lysergic/Magic 19 1.01
Cocaine 18 0.96
Legal/Magic 16 0.85
Benzodiazepine/Methadone 13 0.69
Ecstasy/Legal/Lysergic/Magic 11 0.59
Amphetamines/Benzodiazepine/Cocaine/Crack/Ecstasy/Heroin/Ketamine/Legal/Lysergic/Methadone/Magic 11 0.59
Methadone 11 0.59
Magic 10 0.53
Volatile 10 0.53
Amphetamines/Benzodiazepine 9 0.48
Ecstasy 9 0.48
Amphetamines/Benzodiazepine/Ecstasy/Legal/Lysergic/Magic 9 0.48
Amphetamines/Cocaine/Ecstasy 9 0.48
Amphetamines/Benzodiazepine/Cocaine/Ecstasy/Ketamine/Legal/Lysergic/Methadone/Magic 9 0.48
Amphetamines/Amyl/Benzodiazepine/Cocaine/Crack/Ecstasy/Heroin/Ketamine/Legal/Lysergic/Methadone/Magic/Volatile 8 0.43
Amphetamines 8 0.43
value_counts_percentage(combinations_illegal_no_cannabis[year_split], 'names').head(20)
N %
840 44.75
Benzodiazepine 93 4.95
Legal 35 1.86
Benzodiazepine/Methadone 18 0.96
Ecstasy 16 0.85
Methadone 15 0.80
Cocaine 14 0.75
Magic 13 0.69
Amphetamines 13 0.69
Ecstasy/Legal/Lysergic/Magic 12 0.64
Legal/Magic 12 0.64
Ecstasy/Lysergic/Magic 12 0.64
Lysergic/Magic 12 0.64
Amphetamines/Benzodiazepine 11 0.59
Legal/Lysergic/Magic 10 0.53
Cocaine/Ecstasy 10 0.53
Amphetamines/Benzodiazepine/Methadone 9 0.48
Cocaine/Ecstasy/Legal 9 0.48
Amphetamines/Benzodiazepine/Cocaine/Ecstasy/Ketamine/Legal/Lysergic/Magic 9 0.48
Ecstasy/Legal 8 0.43

Here we have the 20 most common nontrivial drugs combinations for both splits.

The decade split has more length combinations, meaning some respondents tried out several drugs in the last decade but didn't pick them all up for a more regular consumption.

compare_year_decade[drugs_columns].apply(lambda x: x[drugs_legal].sum() == 0 and x[drugs_illegal].sum() > 0, axis=1).any()
False

I was curious as if any respondent could be an alcohol, caffeine, chocolate or nicotine non-user but a user of other drugs. Seems like not for both splits!

Conclusion

I hope you learnt lots reading this post. I certainly did! :smiley:

Warning: Don’t try out all the popular combination of drugs we just uncovered!

I plan on training a predictive model in a future post, so stay tuned.

Did I miss out on anything? Do you have any remark? Please let me know by annotating using Hypothesis or by commenting below!