This post is a great way to dive in data analysis and get familiar with python tools such as pandas, seaborn, matplotlib, numpy.

This is also an opportunity to work on multilabel data (several target variables that are not mutually exclusive).

Setting up workspace

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import seaborn as sns
import matplotlib as mpl
import math
from scipy.stats import pearsonr, spearmanr, chi2_contingency, ttest_ind, mannwhitneyu, norm, normaltest, shapiro, anderson
import operator
from IPython.display import HTML, display
import seaborn as sns

sns.set(rc={'figure.figsize':(10,5)})
sns.set_style('whitegrid')
%matplotlib inline

Dataset preparation

Drug consumption (quantified) Data Set is a multilabel dataset from the UCI machine learning repository.

The dataset contains records for 1885 respondents.

To summarize, it contains:

an ID column
5 demographic columns (features)
7 personality traits (features)
18 drugs with their usage frequency (target)
a fake drug called Semeron to verify reliability of answers

Each drug variable can take 6 different values:

CL0 Never Used
CL1 Used over a Decade
CL2 Used in the Last Decade
CL3 Used in the Last Year
CL4 Used in the Last Month
CL5 Used in the Last Week
CL6 Used in the Last Day

All qualitative variables where transformed to real values (except for the target variables).

Tip: Read the dataset description for more details!

The following lists will be useful for loading and indexing the data.

demographic_columns = [
    'Age', 
    'Gender', 
    'Education', 
    'Country',
    'Ethnicity',
]

personality_columns = [
    'Neuroticism',
    'Extraversion',
    'Openness to experience',
    'Agreeableness',
    'Conscientiousness',
    'Impulsiveness',
    'Sensation seeking'
]

feature_columns = demographic_columns + personality_columns

drugs_columns = [
    'Alcohol consumption',
    'Amphetamines consumption',
    'Amyl nitrite consumption',
    'Benzodiazepine consumption',
    'Caffeine consumption',
    'Cannabis consumption',
    'Chocolate consumption',
    'Cocaine consumption',
    'Crack consumption',
    'Ecstasy consumption',
    'Heroin consumption',
    'Ketamine consumption',
    'Legal highs consumption',
    'Lysergic acid diethylamide consumption',
    'Methadone consumption',
    'Magic mushrooms consumption',
    'Nicotine consumption',
    'Fictitious drug Semeron consumption',
    'Volatile substance abuse consumption'
]

drugs_legal = ['Alcohol consumption', 'Caffeine consumption', 'Chocolate consumption', 'Nicotine consumption']

drugs_illegal = [drug for drug in drugs_columns if drug not in drugs_legal]

all_columns = feature_columns + drugs_columns

Loading the dataset

dataset = pd.read_csv("drug_consumption.data", names=["ID"] + all_columns)

dataset = dataset.set_index("ID")

dataset.head()

We successfully loaded the dataset and set the ID as row index.

dataset.dtypes

Age                                       float64
Gender                                    float64
Education                                 float64
Country                                   float64
Ethnicity                                 float64
Neuroticism                               float64
Extraversion                              float64
Openness to experience                    float64
Agreeableness                             float64
Conscientiousness                         float64
Impulsiveness                             float64
Sensation seeking                         float64
Alcohol consumption                        object
Amphetamines consumption                   object
Amyl nitrite consumption                   object
Benzodiazepine consumption                 object
Caffeine consumption                       object
Cannabis consumption                       object
Chocolate consumption                      object
Cocaine consumption                        object
Crack consumption                          object
Ecstasy consumption                        object
Heroin consumption                         object
Ketamine consumption                       object
Legal highs consumption                    object
Lysergic acid diethylamide consumption     object
Methadone consumption                      object
Magic mushrooms consumption                object
Nicotine consumption                       object
Fictitious drug Semeron consumption        object
Volatile substance abuse consumption       object
dtype: object

Features are all floats and targets are objects.

dataset.shape

(1885, 31)

1885 observations and 31 variables.

dataset.isnull().values.any()

False

There are no missing values.

Nominal drug to ordinal data

for i in drugs_columns:
    dataset[i] = dataset[i].map({'CL0': 0, 'CL1': 1, 'CL2': 2, 'CL3': 3, 'CL4': 4, 'CL5': 5, 'CL6': 6})

Mapping CLX to X. Now all variables are numeric.

Removing Semeron users

Let's see who claimed to took Semeron.

semerons = dataset[dataset['Fictitious drug Semeron consumption'] != 0]
semerons

There are 8 people that lied about taking Semeron, which is a fake drug, so we should remove them.

dataset = dataset[dataset['Fictitious drug Semeron consumption'] == 0]
# Removing it from drug lists
drugs_columns.remove('Fictitious drug Semeron consumption')
drugs_illegal.remove('Fictitious drug Semeron consumption')
#Dropping the column from the dataset
dataset.drop(columns='Fictitious drug Semeron consumption')
dataset.shape

(1877, 31)

We have 1877 respondents left.

Tip: If you ever acquire data from individuals, try a similar approach to get rid of untrustworthy observations.

Exploratory Data analysis

dataset.describe()

The representation above is quite archaic. Let's create a plot to make it easier to read.

fig, ax = plt.subplots(figsize=(11,15))
plt.ylabel("Features")
plt.title("Box plot of Pre-Processed Data Set")
ax = sns.boxplot(data = dataset[feature_columns], orient="h", palette="Set2")
sns.reset_orig()

Seems like the dataset creators centered the features near 0 and scaled them with a small std (around 1), so we can probably keep the current encoding to build a model.

Demographic data analysis

Rearranging data

For data analysis, we will transform the quantified categorical data back to a clearer, nominal, form.

Note: Our mapping is based on the dataset description

demo_data = dataset.copy()

age = ['18-24' if a <= -0.9 else 
       '25-34' if a >= -0.5 and a < 0 else 
       '35-44' if a > 0 and a < 1 else 
       '45-54' if a > 1 and a < 1.5 else 
       '55-64' if a > 1.5 and a < 2 else 
       '65+' 
       for a in demo_data['Age']]

gender = ['Female' if g > 0 else "Male" for g in demo_data['Gender']]

education = ['Left school before 16 years' if e <-2 else 
             'Left school at 16 years' if e > -2 and e < -1.5 else 
             'Left school at 17 years' if e > -1.5 and e < -1.4 else 
             'Left school at 18 years' if e > -1.4 and e < -1 else 
             'Some college or university, no certificate or degree' if e > -1 and e < -0.5 else 
             'Professional certificate/ diploma' if e > -0.5 and e < 0 else 
             'University degree' if e > 0 and e < 0.5 else 
             'Masters degree' if e > 0.5 and e < 1.5 else 
             'Doctorate degree' 
             for e in demo_data['Education']]

country = ['USA' if c < -0.5 else 
           'New Zealand' if c > -0.5 and c < -0.4 else 
           'Other' if c > -0.4 and c < -0.2 else 
           'Australia' if c > -0.2 and c < 0 else 
           'Ireland' if c > 0 and c < 0.23 else 
           'Canada' if c > 0.23 and c < 0.9 else 
           'UK' 
           for c in demo_data['Country']]

ethnicity = ['Black' if e < -1 else 
             'Asian' if e > -1 and e < -0.4 else 
             'White' if e > -0.4 and e < -0.25 else 
             'Mixed-White/Black' if e >= -0.25 and e < 0.11 else 
             'Mixed-White/Asian' if e > 0.12 and e < 1 else 
             'Mixed-Black/Asian' if e > 1.9 else 
             'Other' 
             for e in demo_data['Ethnicity']]


demo_data['Age'] = age
demo_data['Gender'] = gender
demo_data['Education'] = education
demo_data['Country'] = country
demo_data['Ethnicity'] = ethnicity

demo_data[demographic_columns].head()

This looks better

Proportion analysis

Let's start of by taking a look at the demographic data balance.

But first, we will define a few useful functions that will be reused throughout this analysis.

def plot_density(dataset, col, ax, plot_gaussian = True, color="Blue"):
    '''
    Extension of the seaborn histogram that plots, for a given column, an estimated normal distribution (if requested) on top of the fitted data distribution.
    '''
    ncount = len(dataset)
    
    if plot_gaussian:
        std = dataset[col].std()
        mean = dataset[col].mean()
    
    #plot histogram using seaborn
    ax = sns.histplot(dataset[col], ax=ax, color=color, kde=True, stat="probability", kde_kws={"bw_adjust":3})
    
    if plot_gaussian:
        # Limiting our gaussian to the limits from the above plot
        xmin, xmax = ax.get_xlim()
        xnorm = np.arange(xmin, xmax, 0.001)
        ynorm = norm.pdf(xnorm, mean, std)
        ax.plot(xnorm, ynorm, 'r', lw=2)
        ax.legend(["data distribution", "estimated normal distribution"], loc="upper center", bbox_to_anchor=(0.5, -0.05), fancybox=True, shadow=True, ncol=2)
    
    ax.set_title(col)
    ax.set_xlabel("")

def plot_pie(dataset, col, ax):  
    '''
    Pandas' pie plot wrapper
    '''
    ax = dataset[col].value_counts().plot(kind='pie', ax=ax)
    ax.set_title(col)
    ax.set_ylabel("")

def plot_count(dataset, col, ax, order = None, show_percent=True, rotate_label = True, add_args={"palette": "Set2"}):
    '''
    Extending the seaorn countplot to get frequencies and counts in a pretty way. 
    '''
    
    ncount = len(dataset)
    
    if order is None:
        order = np.sort(dataset[col].unique())
    
    # plot seaborn countplot
    ax = sns.countplot(data=dataset, x=col, ax=ax, order=order, **add_args)
    
    # Get y limit (number of elements)
    _ ,max_nb = ax.get_ylim()
    # Transform this limit into a frequency in [0, 100]
    freq_lim = (max_nb * 100/ ncount)
    
    # Duplicate the ax
    ax2 = ax.twinx()
    
    #Move duplicate y axis ticks to the left
    ax2.yaxis.tick_left()
    
    #Move original y axis ticks to the right
    ax.yaxis.tick_right()
    
    # Swap the label positions to match the ticks
    ax.yaxis.set_label_position('right')
    ax2.yaxis.set_label_position('left')
    ax2.set_ylabel('Frequency [%]')

    # We want to write the frequency on top of each bar
    if show_percent:
        # for every bar
        for p in ax.patches:
            x=p.get_bbox().get_points()[:,0]
            y=p.get_bbox().get_points()[1,1]
            if not math.isnan(x.mean()) and not math.isnan(y):
                # Write frequency at an x and y coordinate
                ax.text(x.mean(), y, '{:.1f}%'.format(100.*y/ncount), 
                    ha='center', va='bottom')

    # Set y axis limit for counts and frequencies
    ax2.set_ylim(0,freq_lim)
    ax.set_ylim(0,max_nb)

    # set ticks for count
    ax.yaxis.set_major_locator(ticker.LinearLocator(11))
    ax.yaxis.set_tick_params(which="major")

    # set ticks for frequencies
    ax2.yaxis.set_major_locator(ticker.MultipleLocator(freq_lim/10))
    ax2.yaxis.set_tick_params(which="major")

    # remove grid for ax 2 (keep only ax)
    ax2.grid(False)
    ax.grid(False)
    ax.set_xlabel("")
    if rotate_label:
        # rotate tick labels on the x axis / / /
        _ = ax.set_xticklabels(ax.get_xticklabels(), rotation=40, ha="right")
    ax.set_title(col)

def plot(kind, dataset, columns=None, fig_title="Count/Frequency plots", fontsizes = 20, **kwargs):
    '''
    Wrapper function that takes care of plot wise sizes and calling the wanted procedure
    '''
    
    # plot choices
    kind_dict = {
        'pie': plot_pie,
        'count': plot_count,
        'density': plot_density}
    
    if kind not in kind_dict:
        raise ValueError(f"{kind} is not a valid kind, has to be one of {kind_dict.keys()}")
    
    if columns is None:
        # us all dataset columns
        columns = list(dataset.columns)
    
    fig = None
    
    # Setting font sizes
    plt.rc('xtick', labelsize=fontsizes*1.5)
    plt.rc('ytick', labelsize=fontsizes*1.5)
    plt.rc('axes', labelsize=fontsizes*2)
    plt.rc('legend', fontsize=fontsizes*1.5, title_fontsize=0)
    plt.rc('axes', titlesize=2*fontsizes)
    plt.rc('font', size=1.7*fontsizes)
    
    # Scale of the figure in ax (to be used later)
    figsize_scale = fontsizes
    
    if not isinstance(columns, list):
        # columns has to be a list
        if isinstance(columns, str):
            columns = [columns]
        else:
            columns = list(columns)
    
    if len(columns) == 1: # Only variable to plot
        ncols, nrows = 1, 1
        figsize_scale *= 2 # double figsize
    else:
        ncols, nrows = 2, math.ceil(len(columns) / 2)
     
    fig, axes = plt.subplots(figsize=(figsize_scale*ncols, figsize_scale*nrows), nrows=nrows, ncols=ncols)
    
    if ncols == 1 and nrows == 1:
        # We need a list of axes
        axes = np.array([axes])
    
    # Plot
    do_plots(dataset, columns, axes, kind_dict[kind], **kwargs)
    
    fig.suptitle(fig_title + "\n\n", fontsize=fontsizes*2.5)
    plt.tight_layout()
    #Reset plot setting to original
    sns.reset_orig()
    
def do_plots(dataset, columns, axes, plot_func, **kwargs):
    '''
    Calls the plotting function on every axis and removes unused axes.
    '''
    axes = axes.flat
    
    #plot every variable
    for index, col in enumerate(columns):
        plot_func(dataset, col, axes[index], **kwargs)

    # remove empty axes
    for empty in range(len(columns), len(axes)):
        axes[empty].axis("off")

Important: If you are new to using python and the various libraries for data analysis, try to read through the above functions and understand what they do.

Let's create pie plots and count plots.

plot("pie", demo_data, demographic_columns, fig_title="Plot pies of demographic data")

plot("count", demo_data, demographic_columns, fig_title="Count / Frequency plots of demographic data")

We can see some features are pretty balanced but in some cases, classes are underrepresented. We will go over each of them in more detail.

def value_counts_percentage(dataset, column):
    ''' value.counts() method extended by displaying percentage '''
    
    a = dataset[column].value_counts()
    b = dataset[column].value_counts(normalize=True) * 100
    
    return pd.concat([a,b.round(2)], axis=1, keys=['N', '%'])

Note: The above function creates a dataframe with count and frequencies for a given column.

Age

value_counts_percentage(demo_data, 'Age')

The age groups are intervals of around 10 years. As we can see, the older the age group, the less represented they are.

The 18-24 group holds around 1/3 of all respondents !

Unbalanced and not representative.

Gender

value_counts_percentage(demo_data, 'Gender')

Gender is balanced and representative.

Education

value_counts_percentage(demo_data, 'Education')

There is a predomination of educated people (around 85% are of college or above level), causing an unbalance.

Not sure these are representative of the real life distribution, but looking at the educational attainment in the United States, it seems the proportion are not so far off for developped countries (and they are very represented in this dataset).

Warning: That doesn’t mean our data is balanced, but by taking people randomnly we would get around the same distribution.

Tip: A representative dataset is useful to build a representative model that describes the data rather than predicting. On the other hand, if your objective is to build a a predictive model, particularly one that performs well by measure of AUC or rank-order and plan to use a basic ML framework, then feeding it a balanced dataset is recommended. You can also use a representative dataset in the latter case and use custom sampling that will take care of balancing the data for training.

Country

value_counts_percentage(demo_data, 'Country')

All the known countries (90+%) are English speaking with US and UK totalling to over 80% of the data.

This is a clear unbalance and isn't representative.

Ethnicity

value_counts_percentage(demo_data, 'Ethnicity')

90+% of respondents are white. Again, a clear unbalance and is not representative.

Age-Gender

plot("count", demo_data, 'Age', fig_title="Age-Gender count/frequency", rotate_label=False, add_args={"hue":"Gender", "palette":'ch:.25'}, fontsizes=4.5);

There are clearly a lot more men in the 18-24 age goupe.
More women in the 25-34 age group.
As the age increases, the gender proportion gradually evens out.

Country-Gender

plot("count", demo_data, 'Country', fig_title="Age-Gender count/frequency", rotate_label=False, add_args={"hue":"Gender", "palette":'ch:.25'}, fontsizes=4.5);

More women than men were tested in the USA, New-Zealand (see next table) and "Other".
More men than women in the UK.
Around the same in the remaining countries.

Note: Very few respondents in some countries meaning we can’t tell if the respondents were taken from the real life distribution or not.

Age – Gender – Country cross table

pd.crosstab(demo_data['Age'], [demo_data['Gender'], demo_data['Country']])

New-Zealand doesn't have anyone over 34 years old and tested only one woman.
Ireland doesn't have anyone over 54 and has no man in the 35-44 age group.
Australia doesn't have anyone over 65 (and only one over 54) and 20 times more males in the 18-24 age group.
Canada has 3 times more men in the 18-24 group and has only 2 males above 54.
UK has over twice more women than men in the 25-34 group.
USA has over twice more men in the 18-24 group and no woman over 64.
Other countries have 3 times more men for the 18-24 age group, twice for the 25-34 group, doesn't have any woman in the 55-64 group and doesn't have anyone over 64.

def catplot(data, x, hue, col, rotate_label=True, palette="ch:.25"):
    plot = sns.catplot(x=x, hue=hue, col=col, palette=palette, data=data.sort_values(by=[x]), kind="count")

    if rotate_label:
        for axes in plot.axes.flat:
            _ = axes.set_xticklabels(axes.get_xticklabels(), rotation=40, ha="right")

catplot(demo_data, 'Country', 'Age', 'Gender')

Using the above plot, some aspects described above pop up quite nicely.

Education – Gender – Ethnicity cross table

demo_data.pivot_table(index='Education', columns=['Gender', 'Ethnicity'], aggfunc='size', fill_value=0)

Basically few entries of people leaving school at or before 18.
Non-white: no one left school before being 16 and appart for the 'Other' category, there is only one female that didn't get a higher education. No male** expect in 'Other' have a Doctorate degree.
Asian: no respondent with lower education and *no male in some college or university (but has some with masters degree, so doesn't mean no one went through university etc.)
Black: only one person with lower education (one male left school at 18) and has no female with a professional diploma when it is the field with most respondents on the male side.
Mixed-Black/Asian: very few respondents and only people following some college or university, or with an university degree are represented.
Mixed-White/Asian: all the female respondents have high education and 2 males left school at 18 or before. Males, unlike females have no master's degree respondent.
Mixed-White/Black: all females except 2 have a finished university, master's or doctorate degree, while males have only one master's degree and the other have either unfinished university degrees, a professional diploma or left school at 18 or before.
Other: lots of males have professional diplomas, while no female has one. There is a high number of doctorate degrees but also a high number of respondents that left school at 18 or before (both relative to the categories up until now).
White: basically all the respondents are white as we saw before. Both genders have respondents in all categories but not in the same proportions. To see things clearer, you can refer to the table below (we will focus on proportion instead of counts). The majority of males are pursuing a college or university degree, while the majority of females have an university degree. Nearly twice as many females have a doctorate degree and more male have left school at 18 or before.

pd.concat([value_counts_percentage(demo_data[(demo_data['Ethnicity'] == 'White') & (demo_data['Gender'] == 'Male')], 'Education'),
           value_counts_percentage(demo_data[(demo_data['Ethnicity'] == 'White') & (demo_data['Gender'] == 'Female')], 'Education')],
           axis = 1, keys=['Male', 'Female'])

Note: Beause of the huge unbalance in ethnicity, plots will be hard to interpret. See for yourself below. Changing the axis and hue won’t solve anything.

catplot(demo_data, 'Ethnicity', 'Education', 'Gender')

Country – Gender – Ethnicity cross table

pd.crosstab(demo_data['Country'], [demo_data['Gender'], demo_data['Ethnicity']])

Nearly no entry of Mixed-Black/Asian. (And compared to white, no entry of any other category)
USA has no mixed-White/Black female, no asian and mixed-Black/Asian male.
All the other countries have no mixed-black asian (except one male in an 'Other' country) (even UK that has such a large number of respondents)
UK has 9 times more Mixed-White/Black female than male and 3 times more 'Other' female than male.

Note: Again, plotting will not help because of the unbalance in white respondents.

Correlations (using Cramér's V)

"In statistics, Cramér's V (sometimes referred to as Cramér's phi and denoted as φc) is a measure of association between two nominal variables, giving a value between 0 and +1 (inclusive). It is based on Pearson's chi-squared statistic and was published by Harald Cramér in 1946."(Wikipedia)

Below we define a set of useful functions for working with correlations.

def get_pval_code(pval):
    '''
    Returns a significance code string for a given p-value.
    '''
    code = ''
    if pval < 0.001:
        code = '***'
    elif pval < 0.01:
        code = '**'
    elif pval < 0.05:
        code = '*'
    elif pval < 0.1:
        code = '.'
    return code

def display_pval_codes():
    print("Significance codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1")

def cramers_v(crosstab):
    ''' Returns Cramer's V correlation coefficient (and statistic significance) for data 
        delivered as a crosstable. Used for nominal data.
    '''
    
    chi2 = chi2_contingency(crosstab)[0]
    n = crosstab.sum().sum()
    phi2 = chi2/n
    r,k = crosstab.shape
    phi2corr = max(0, phi2 - ((k-1)*(r-1))/(n-1))    
    rcorr = r - ((r-1)**2)/(n-1)
    kcorr = k - ((k-1)**2)/(n-1)
    
    return round(np.sqrt(phi2corr / min((kcorr-1), (rcorr-1))), 3), chi2_contingency(crosstab)[1]

def nominal_corrs(dataset, col_1_names, col_2_names, pvalues=True):
    ''' Returns Cramer's V coefficients matrix and p-value (obtained with Chi-square test of independence) matrix for the whole dataset.
    
    col_1_names, col_2_names - lists of names of columns to correlate. Function creates crosstables for every 
        columns' combination and returns a matrix with single Cramer's V coefficients of every combination;
    pvalues - if set to False, only correlation values will be returned in DataFrame 
        (without '**' marks for significant observations)
    '''
    
    corr_table = pd.DataFrame()
    pval_table = pd.DataFrame()
    for i in range(len(col_1_names)):
        for j in range(len(col_2_names)):
            crosstab = pd.crosstab(dataset[col_1_names[i]], [dataset[col_2_names[j]]])
            corr, pval = cramers_v(crosstab)
            
            v = ' '.join([str(i) for i in (corr, get_pval_code(pval))]) if pvalues else corr
            
            corr_table.loc[i, j] = v
            pval_table.loc[i,j] = pval
    corr_table.index = col_1_names
    corr_table.columns = col_2_names
    pval_table.index = col_1_names
    pval_table.columns = col_2_names
    
    return corr_table, pval_table

def heatmap_corr(dataset, method='spearman', ready=False, mask=True, nominal=False):
    ''' Extended sns.heatmap() method. 
    
    dataset - can be 'pure' data (without calculated correlations) or a DataFrame with already calcuateg 
        correlations (in that case attribute 'ready' should be set to True);
    method - mainly pearson or spearman; nominal correlations should be calculated externally 
        and be delivered with attribute ready=True; 
    mask - if dataset is NOT a cross-valued DataFrame of one type, mask should be set to False;
    nominal - for nominal data correlations values are in range (0, 1) instead of (-1, -1). 
        nominal=True should be folowed by ready=True 
    '''
    
    if not ready:
        corr = dataset.corr(method=method)
    elif ready:
        corr = dataset
    cmap = sns.diverging_palette(220, 10, as_cmap=True)
    vmax = corr.max().max()
    if nominal:
        center = 0.5
        cmap=None
    elif not nominal:
        center = 0
    if mask:
        mask = np.zeros_like(corr, dtype=np.bool)
        mask[np.triu_indices_from(mask)] = True
        np.fill_diagonal(corr.values, -2)
        vmax = corr.max().max()
    elif not mask:
        mask=None
    f, ax = plt.subplots(figsize=(20,9))
    return sns.heatmap(corr, cmap=cmap, mask=mask, vmax=vmax, center=center, annot=True, square=True, 
                       linewidths=0.5, cbar_kws={'shrink': 0.5})

Let's display the correlations among the demographic data.

corr_table_demo, pval_table_demo = nominal_corrs(demo_data, demographic_columns, demographic_columns)
display_pval_codes()
corr_table_demo

Significance codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Values with at least one * next to them are statistically significant (p-value < 0.05).

heatmap_corr(nominal_corrs(demo_data, demographic_columns, demographic_columns, pvalues=False)[0],
             nominal=True, ready=True);

Only very weak statistically significant correlations at most.

Important: A statistically significant correlation does not necessarily mean that the strength of the correlation is strong. A low p-value reassures us that the correlation is this way nearly all of the time. (It is the probability that we got it by chance)

This section tries to interpret these very weak relationships between variables

Ethnicity's lack of statistical significant correlations can be explained by the great imbalance in the various possible classes (all whites basically). Meaning the lack of correlation could be caused by chance. The abscence (high significance) of correlation with country may sound surprising but since the only countries considered are the "same", as in developed and white prevalence in respondent, it makes sense. It's safe to assume the lack of correlation with age and gender are correct (even if unsignificant) for obvious reasons. I can't be so sure about education (as you can see below, there is around 50% chances the correlation was obtained by chance).

pval_table_demo.loc['Education', 'Ethnicity']

0.5424195567242813

There is a statistically significant very weak correlation between country and gender. We know, at least for the countries represented here (developped countries were population is basically 50/50), that there isn't a real life correlation between the country and gender. The very weak correlation may be due to some countries having more female repondents and some having more males (as we saw previously).

The weak correlation between age and gender can be explained, as we saw previously, that gender wasn't even for some young age groups but as we know this correlation doesn't generalize to real life.

A correlation between age and country would be relevant if some countries were more developped than the others but here we have only developped country. This idea would make more sense if we had more categories for elder people, meaning a country's life expectency could be taken into account, but here a single group for 65+ year olds isn't enough for developed countries. The very weak statistically significant relationship may be caused by the 'other' countries.

The very weak correlations we can keep in mind are the following:

Age / Education (makes sense since basically no 18-24 would have a doctorate degree for example)
Gender / Education (males and females may have slightly different aims and standards regarding education)
Education / Country (for social tendencies, country development and other reasons, the education and country may be correlated but since the country here are very similar, the relationship is very weak)

Personality features analysis

All the personnality traits have a score that is represented by a real value in the dataset.

Some have their original possible scores in ranges of integer types (with a step of 1, so 4-7 means 4, 5, 6 and 7):

Neuroticism: 12-60
Extraversion: 16-59
Openness to experience: 24-60
Agreeableness: 12-60
Conscientiousness: 17-59

pers_data = dataset[personality_columns]
pers_data.describe()

Checking the balance / proportions

plot("pie", pers_data, personality_columns, fig_title="Pie plots personality traits")

The values overlap but we don't really care about them (I left them just in case you need them), we just want to see if some slices are bigger than the others.

plot("count", pers_data, personality_columns, show_percent=False, fig_title="Count / Frequency of personality data")

Hard to interpret since there are many levels for some personality traits and they are encoded as hardly interpretable real values, even though they are sorted in the same order as the original values, which aren't really interpretable either.

According to the pie plot, there doesn't seem to be any huge unbalance.

The count / frequency plot is more interesting. The distributions of all the personality traits may be gaussian distribution, with some skewness involved sometimes. Let's plot those densities a bit more nicely against actual estimated gaussian distributions to see if our intuition was right!

plot("density", pers_data, personality_columns, fig_title="Personality densities")

They are indeed belshaped but far from the shape of a fitted gaussian distribution.

Let's try out a statistical test.

def normal_test(dataset, columns = None, alpha = 0.05):
    if columns is None:
        columns = dataset.columns
    res = pd.DataFrame()
    data = dataset[columns]
    stats, pvals = data.apply(shapiro).values
    return pd.DataFrame(data=[("Not rejected" if pval > alpha else "Rejected") + f" with alpha = {alpha}"  for pval in pvals], columns=["H0: is from normal distribution"], index=columns)

normal_test(pers_data)

After running the Shapiro-Wilk test on each personality trait, the only hypothesis that couldn't be rejected was "Neuroticism is from a normal distribution" with an alpha = 0.05.

Warning: That doesn’t mean it is gaussian. Actually the probability was something around 10%, but we can’t reject this hypothesis.

Correlations among personality traits (using Spearman)

"In statistics, Spearman's rank correlation coefficient or Spearman's ρ, named after Charles Spearman, is a nonparametric measure of rank correlation (statistical dependence between the rankings of two variables). ... Spearman's correlation assesses monotonic relationships (whether linear or not). If there are no repeated data values, a perfect Spearman correlation of +1 or −1 occurs when each of the variables is a perfect monotone function of the other."(Wikipedia)

Again, we have to define a few useful functions.

def calculate_r(df1, df2, method='spearman', pvalues=True):
    ''' Returns correlation matrix extended by statistical significance index. Used for non-nominal data.
    
    df1, df2 - DataFrames of data to correlate;
    method - mainly pearson and spearman;
    pvalues - if set to False, only correlation values will be returned in DataFrame 
        (without '**' marks for significant observations)
    '''
    float_precision = 3
    
    data_corr_table = pd.DataFrame()
    data_pvalues = pd.DataFrame()
    for x in df1.columns:
        for y in df2.columns:
            if method == 'pearson':
                corr = pearsonr(df1[x], df2[y])
            elif method == 'spearman':
                corr = spearmanr(df1[x], df2[y])
            else:
                raise ValueError('Unknown method')
            if pvalues:
                data_corr_table.loc[x,y] = '{} {}'.format(round(corr[0], float_precision), get_pval_code(corr[1]))
            elif not pvalues:
                data_corr_table.loc[x,y] = round(corr[0], float_precision)
            data_pvalues.loc[x,y] = corr[1]
    
    return data_corr_table, data_pvalues

def get_ranked_corr(correlations, pvalues, pval_threshold=0.05, corr_threshold=0.5, name_transform=lambda x: x, against_itself=True, comp_corr = operator.ge, comp_pval = operator.le):
    ''' Returns the sorted top (wanted) correlations as a Dataframe
    
    correlations - the correlations dataframe.
    pvalues - The pvalues of H0: the correlation is unsignificant.
    pval_threshold - alpha to compare to pvalue
    corr_threshold - Threadshold of correlations to consider (upper or lower depending on comp_corr)
    name_transform - transform the column names before writing index.
    against_itself - True if columns and rows are the same data (if symmetric matrix)
    comp_corr - comparator to compare to the corr_threshold (default operator.ge means we want correlations >=)
    comp_pval - comparator to compare the pvalue (default operator.le meaning we want pvalues <=)
    '''
    
    columns = correlations.columns
    rows = correlations.index
    pairs=[]
    for i, row in enumerate(rows):
        for j, col in enumerate(columns[i+1:], start=i+1) if against_itself else enumerate(columns):
            corr = correlations.iloc[i, j]
            if comp_pval(pvalues.iloc[i,j], pval_threshold) and comp_corr(abs(corr), corr_threshold):
                pairs.append((name_transform(row), name_transform(col), corr))
    list.sort(pairs, key=lambda x: abs(x[2]), reverse=True)
    return pd.Series(data=[x[2] for x in pairs], index=[' / '.join((x[0], x[1])) for x in pairs], name="Correlation", dtype=np.float64).to_frame()

Now let's use them.

pers_data_corr_table, pers_data_pvalues = calculate_r(pers_data, pers_data, method='spearman')
pers_data_corr_table

heatmap_corr(pers_data, method='spearman');

Again, no strong correlations but some moderate and weak-moderate ones. The sign of the correlation is interesting here.

Neuroticism for example has no positive correlation with any other personality trait. It has a weak-moderate downhill correlation with extraversion and conscientiousness.

Tip: If you ever feel moody, anxious, frustrated, lonely, depressed or other such things, try to be more outgoing, talkative, energetic, careful and diligent and see how it goes!

I also found it interesting that Extraversion has no moderate correlation with any other personality trait (and that its strongest correlation, which is weak, is with conscentiousness).

Below is a list of the very weak and above statistically significant correlations. (|r| >= 0.15)

pers_data_corr_table_nopval, _ = calculate_r(pers_data, pers_data, method='spearman', pvalues=False)

get_ranked_corr(pers_data_corr_table_nopval, pers_data_pvalues, corr_threshold=0.15)

Drug consumption analysis

As a reminder, here are the different levels for each drug:

0 - Never Used

1 - Used over a Decade Ago

2 - Used in the Last Decade

3 - Used in the Last Year

4 - Used in the Last Month

5 - Used in the Last Week

6 - Used in the Last Day

drug_data = dataset[drugs_columns]

drug_data.describe()

Let's make a table with the number of observation and frequency for every consumption level of every drug.

counts = []
proportions = []
for drug in drugs_columns:
    counts.append(drug_data[drug].value_counts())
    proportions.append(drug_data[drug].value_counts(normalize=True))

drug_table_count = pd.concat(counts, axis=1, 
          keys=drugs_columns, sort=True)
drug_table_prop = (pd.concat(proportions, axis=1, 
          keys=drugs_columns, sort=True) * 100).round(2)
drug_table = pd.concat([drug_table_count.T, drug_table_prop.T], keys=['N', '%'], axis=1)

drug_table

We will also plot them for each drug.

plot("count", drug_data, fig_title="Count / Frequency for drug consumption\n\n", rotate_label=False)

Alcohol, caffeine and chocolate are legal stimulants that are consumed reguarly by the vast majority of respondents.

Cannabis and nicotine are distributed pretty evenly, even if there is a higher proportion at both extremes (0 and 6 category). These two are are considered bad or ok depending on the people, this may explain the about equal number of daily consumers and those who never tried it, with some in between.

Other illegal drugs were never tried for the vast majority of respondents, or tried a long time ago (considered as more than a month ago) People probably either never tried it (fear or convictions) or did it once but never again. This also holds for legal highs, which follows the same trend as the illegal drugs.

Correlations among drugs (using Spearman)

drug_data_corr_table, drug_data_pvalues = calculate_r(drug_data, drug_data, method='spearman')
display_pval_codes()
drug_data_corr_table

Significance codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

heatmap_corr(drug_data, method='spearman');

Alcohol, caffeine and chocolate don't correlate with any other drugs (too weak or insignificant relations).

There are some moderate correlations with high enough statistical significance.

There are no strong correlations (I consider an absolute value of 0.70 or above strong).

drug_data_corr_table_nopval, _ =  calculate_r(drug_data, drug_data, method='spearman', pvalues=False)

We will list every correlations, regrouped and ordered by strength.

Tip: Feel free to scroll past it, I just included them because I thought it might be of some interest for some people.

List of unsignificant correlations

get_ranked_corr(drug_data_corr_table_nopval, drug_data_pvalues, corr_threshold=0, name_transform=lambda x:x.rsplit(' ', 1)[0], comp_pval= operator.gt).drop('Correlation', axis=1)

List of significant weak or non-existant correlations

get_ranked_corr(drug_data_corr_table_nopval, drug_data_pvalues, corr_threshold=0.3, name_transform=lambda x:x.rsplit(' ', 1)[0], comp_corr=operator.lt)

List of significant weak to weak-moderate correlations

get_ranked_corr(drug_data_corr_table_nopval, drug_data_pvalues, corr_threshold=0.5, name_transform=lambda x:x.rsplit(' ', 1)[0], comp_corr=operator.lt).query('Correlation >= 0.3').reset_index()

List of significant moderate correlations

get_ranked_corr(drug_data_corr_table_nopval, drug_data_pvalues, corr_threshold=0.4, name_transform=lambda x:x.rsplit(' ', 1)[0])

We have seen correlations of categories of variables with themselves. Let's now dig into inter-category correlations.

Correlations between personality traits and drugs (using Spearman)

drug_pers_corr_table, drug_pers_pvalues = calculate_r(drug_data, pers_data, method='spearman')
drug_pers_corr_table

drug_pers_corr_table_no_pvalues, drug_pers_pvalues = calculate_r(drug_data, pers_data, pvalues=False, method='spearman')
heatmap_corr(drug_pers_corr_table_no_pvalues, ready=True, mask=False);

Alcohol, Chocolate, Caffeine and Nicotine consumption have no (or unsignificant) correlation with any personality trait.

For the rest of this paragraph, "drug consumption" is not taking alcohol, chocolate or caffeine into account.

Extraversion is the only personality trait that has no correlation with any drug (not even very weak), which I found surpising at first but it may be caused about my wrong understanding of extraversion.

Crack, and Amyl nitrite have no correlation above 0.2 (absolute value).

Agreeableness and Conscientiousness both have only weak negative correlations (at most) with all the drugs. Even if the negative correlation is always a bit stronger for conscientious respondents.

Impulsiveness, Openness to experience and Sensation seeking are the personality factors that correlate the most with drug consumption.

There are no strong correlation between personality factors and drug taking.

It seems like a particular personality trait has either only positive or negative correlations with drug consumption.

Let's list the weak ones and above (|r| >= 0.3).

get_ranked_corr(drug_pers_corr_table_no_pvalues, drug_pers_pvalues, corr_threshold=0.3, against_itself=False)

Note: There are only positive correlations in the above list. The strongest negative is -0.293 between Cannabis and conscientiousness.

Nominal correlations between demographic and drug consumption (Cramér's V)

demo_drug_corr_table, demo_drug_pval = nominal_corrs(dataset, demographic_columns, drugs_columns)
demo_drug_corr_table

demo_drug_corr_table_no_pvalues, _ = nominal_corrs(dataset, demographic_columns, drugs_columns, pvalues=False)
heatmap_corr(demo_drug_corr_table_no_pvalues, ready=True, mask=False, nominal=True);

Ethnicity isn't correlated to drug taking, except very weakly with alcohol consumption.

Education also seems to be at most very weakly correlated to drug consumption.

Caffeine and Chocolate have no correlations with demographic variables.

There are no strong or even moderate correlations, the strongest weak correlations are mostly found in Gender.

Underneath is a list of the significant correlations with (|r| >= 0.2).

get_ranked_corr(demo_drug_corr_table_no_pvalues, demo_drug_pval, corr_threshold=0.2)

Let's look at the consumption median of cannabis and legal highs consumption for female and male, which are the two drugs most correlated with gender according to our above table.

The significance is base on the Mann-Whitney U test (used in the below function).

"In statistics, the Mann–Whitney U test is a nonparametric test of the null hypothesis that, for randomly selected values X and Y from two populations, the probability of X being greater than Y is equal to the probability of Y being greater than X"(Wikipedia)

We implemented the difference test in the above procedure.

Here the two populations are male and female.

def diff_test(dataset, sample_attr, diff_attr, sample_attr_split='median', diff_attr_type='ordered', split_point=None, alpha=0.05):
    ''' Difference significance test on dataset.
    Returns (p of no difference,
             the group split point,
             the group 1 central,
             the group 2 central,
             a summary). <- shouldn't be in here but whatever
    
    sample_attr - column which will be divided into two samples with median value;
    diff_attr - attribute, which value will be checked in two sample groups;
    diff_attr_type - determines type of data which medians will be compared (ordered, interval)
    '''
    # In case of big data imbalance in highest rank (contains more than half)
    # It would make an empty group when splitting futher down
    if split_point is None and sample_attr_split == "median":
        levels = dataset[sample_attr].unique()
        if not (levels > dataset[sample_attr].median()).any(): # The highest level is the same as median
            split_point = levels[-1] # Split one rank lower instead
    
    if split_point is None:
        split_point = dataset[sample_attr].median() if sample_attr_split=='median' else dataset[sample_attr].mean()

    group_1 = dataset[dataset[sample_attr] > split_point]
    group_2 = dataset[dataset[sample_attr] <= split_point]
    
    group_1_diff_attr_central = group_1[diff_attr].median() if diff_attr_type=='ordered' else group_1[diff_attr].mean()
    group_2_diff_attr_central = group_2[diff_attr].median() if diff_attr_type=='ordered' else group_2[diff_attr].mean()
    
    if diff_attr_type == 'ordered':
        diff_sign, p = mannwhitneyu(group_1[diff_attr], group_2[diff_attr])
    elif diff_attr_type == 'interval':
        diff_sign, p = ttest_ind(group_1[diff_attr], group_2[diff_attr])
    else:
        raise ValueError('diff_attr_type should be one of ordered and interval')
    are = 'are' if p < alpha else 'are not'
    sample_central = 'split point' if split_point is not None else sample_attr_split
    diff_central = 'median' if diff_attr_type=='ordered' else 'mean'
    
    return (p, split_point, group_1_diff_attr_central, group_2_diff_attr_central,
            f'First group: {sample_attr} above {sample_central} value {round(split_point, 3)}\n Second group: {sample_attr} equal or below {sample_central} value {round(split_point, 6)} \n First group {diff_attr} {diff_central}: {round(group_1_diff_attr_central, 3)} \n Second group {diff_attr} {diff_central}: {round(group_2_diff_attr_central, 3)} \n Difference significance for samples: {round(diff_sign, 3)} with p-value: {round(p, 3)} \n Samples {are} statistically different.')

val_between_male_female = (dataset['Gender'][demo_data['Gender'] == 'Male'].iloc[0] + dataset['Gender'][demo_data['Gender'] == 'Female'].iloc[0])/ 2

print(diff_test(dataset, 'Gender', 'Cannabis consumption', split_point=val_between_male_female)[-1])

First group: Gender above split point value 0.0
 Second group: Gender equal or below split point value 0.0 
 First group Cannabis consumption median: 2.0 
 Second group Cannabis consumption median: 4.0 
 Difference significance for samples: 288686.5 with p-value: 0.0 
 Samples are statistically different.

print(diff_test(dataset, 'Gender', 'Legal highs consumption', split_point=val_between_male_female)[-1])

First group: Gender above split point value 0.0
 Second group: Gender equal or below split point value 0.0 
 First group Legal highs consumption median: 0.0 
 Second group Legal highs consumption median: 2.0 
 Difference significance for samples: 290886.5 with p-value: 0.0 
 Samples are statistically different.

Females (first group) consume less cannabis and legal highs than men.

As a reminder, here are the different levels for each drug:

0 - Never Used (median female legal highs consumer)

1 - Used over a Decade Ago

2 - Used in Last Decade (median female cannabis consumer, median male legal high consumer)

3 - Used in Last Year

4 - Used in Last Month (median male cannabis consumer)

5 - Used in Last Week

6 - Used in Last Day

And the values for male and female, respectively, are:

(dataset['Gender'][demo_data['Gender'] == 'Male'].iloc[0], dataset['Gender'][demo_data['Gender'] == 'Female'].iloc[0])

(-0.48246, 0.48246)

Nominal correlations between demographic info and personality traits (using Cramér's V)

demo_psycho_corr_table, demo_psycho_pvalues = nominal_corrs(dataset, demographic_columns, personality_columns)
demo_psycho_corr_table

demo_psycho_corr_table_no_pvalues, _ = nominal_corrs(dataset, demographic_columns, personality_columns, pvalues=False)
heatmap_corr(demo_psycho_corr_table_no_pvalues, ready=True, nominal=True, mask=False);

Neuroticism has no statistically significant correlation with any demographic variable.

Extraversion has only one statistically significant correlation, with Country, which is non existent.

Education and Ethnicity has no correlation with any personality trait (too low or unsignificant).

Gender seems to have the strongest (least weak) correlations.

There are no moderate or strong correlations, only pretty weak ones at most.

Let's list the weak correlations (|r| > 0.15)

get_ranked_corr(demo_psycho_corr_table_no_pvalues, demo_psycho_pvalues, corr_threshold=0.15)

Let's display the median of Agreeableness and Sensation seeking for each gender.

print(diff_test(dataset, 'Gender', 'Agreeableness', split_point=val_between_male_female)[-1])

First group: Gender above split point value 0.0
 Second group: Gender equal or below split point value 0.0 
 First group Agreeableness median: 0.288 
 Second group Agreeableness median: -0.302 
 Difference significance for samples: 327571.0 with p-value: 0.0 
 Samples are statistically different.

print(diff_test(dataset, 'Gender', 'Sensation seeking', split_point=val_between_male_female)[-1])

First group: Gender above split point value 0.0
 Second group: Gender equal or below split point value 0.0 
 First group Sensation seeking median: -0.216 
 Second group Sensation seeking median: 0.401 
 Difference significance for samples: 312536.0 with p-value: 0.0 
 Samples are statistically different.

Women tend to be more agreeable and less sensation seeking than men.

Difference significance tests

In this section, we will split the population in 2 groups based on a feature.

All the features are rank based, we will use the Mann–Whitney U test against a target, to see if the target shows a statistically significant difference in median for the two groups.

Important: The test can show significant difference when the median values are the same. The significance means "for randomly selected values X and Y from two populations, the probability of X being greater than Y is not equal to the probability of Y being greater than X."

The splitting point will be the median, except if the median is the highest rank the feature can have. In this case we split one rank lower.

Group 1: respondents > split.
Group 2: respondents <= split.

In the following section, we will only show significant difference in median value for a given target (and only this value differs between the two groups).

Note: The below tables will have some Nan values. This is because the cell has no significant difference but the column is needed for other rows.

We won't comment on them but feel free to explore. The tables are quite self-explanatory.

def get_significant_diff_tests(dataset, samples, diff_attrs, alpha=0.05, display_same_center=False):
    '''
    Perform difference tests on several splitting attributes for different targets.
    '''
    samp_dfs = []
    for sample in samples:
        diff_dfs = []
        for diff_attr in diff_attrs:
            levels =  dataset[sample].unique()
            split = None
            if len(levels) == 2:
                split = (dataset[sample][dataset[sample] == levels[0]].iloc[0] + dataset[sample][dataset[sample] == levels[1]].iloc[0])/ 2
            
            p, split_point, group_1_diff_attr_central, group_2_diff_attr_central, _ = diff_test(dataset, sample, diff_attr, split_point=split, alpha=alpha)
            
            if p < alpha and (group_1_diff_attr_central != group_2_diff_attr_central or display_same_center):
                split_point = round(split_point, 5)
                diff_dfs.append(pd.DataFrame({'Median': [group_1_diff_attr_central, group_2_diff_attr_central], 'Pval / significant diff ?': [p, "Yes" if p < alpha else "No"]}, index=[f"> {split_point}", f"<= {split_point}"]))
        if diff_dfs:
            samp_dfs.append(pd.concat(diff_dfs, axis=1, keys=diff_attrs))
    
    return pd.concat(samp_dfs, axis=0, keys=samples) if samp_dfs else None

Difference on drug consumption splitting on demographic

get_significant_diff_tests(dataset, demographic_columns, drugs_columns)

Difference on drug consumption splitting on personality traits

get_significant_diff_tests(dataset, personality_columns, drugs_columns)

Difference on personality traits splitting on demographic

get_significant_diff_tests(dataset, demographic_columns, personality_columns)

Consumer vs Non-consumer

In practice, we are more interested in predicting which drugs an individual user consumes and less in knowing how long it has been.

We want to regroup the different levels we currently have for each drug into two classes: User and Non-User.

Year and decade based split for binarization of targets

We are going to compare year and decade based splits.

Year based: Consumer = Used in last year
Decade based: Consumer = Used in last decade

dataset_decade = dataset.copy()
for i in drugs_columns:
    dataset_decade[i] = ["Non Consumer" if c < 2 else "Consumer" for c in dataset[i]]
    
dataset_year = dataset.copy()
for i in drugs_columns:
    dataset_year[i] = ["Non Consumer" if c < 3 else "Consumer" for c in dataset[i]]
    
compare_year_decade = pd.concat([dataset_decade.assign(split="decade"), dataset_year.assign(split="year")])

plot("count", compare_year_decade, drugs_columns, fig_title="Count / Frequency consumer comparison between decade and year split\n\n\n", rotate_label=False, add_args={"hue": "split", "palette": "Set2"})

Note: Since we duplicated our original dataset, you can get the correct percentage for a split (decade or year) by doubling the value. This will also double the gap between the two splits !

Appart from Alcohol, Caffeine, Cannabis and Chocolate consumption (it doesn't change that much anyways for these), the decade based split is more balanced.

We could choose the decade based split for model training and inference because of the balance.

Also, someone may not have consumed a particular drug for a year because he didn't get the opportunity to or tried out other drugs during this time period but could take it again in the future.

Whereas, if he didn't consume it in the last 10 years, he probably doesn't plan to consume it again.

This rule can't really be applied to 18-24 year olds.

We will continue our analysis considering both splits.

Data analysis

grouped_df_decade = pd.DataFrame()
for drug in drugs_columns:
    grouped_df_decade = pd.concat([grouped_df_decade, dataset_decade.groupby(drug).size()], axis=1)
grouped_df_decade.columns = drugs_columns

grouped_df_year = pd.DataFrame()
for drug in drugs_columns:
    grouped_df_year = pd.concat([grouped_df_year, dataset_year.groupby(drug).size()], axis=1)
grouped_df_year.columns = drugs_columns

grouped_df_decade.T.plot.bar(color=['red', 'green'], figsize=(15,7), title="Count plot of consumer / non consumer for every drug");

Decade based split

The consumers (red) are in a huge majority for Alcohol, Caffeine and Chocolate.

They are also around double the number of non-consumers for Cannabis and Nicotine. This may be due to the fact that lots of people try these out at least once, and the last time happened to be in the last decade.

For the other drugs, non-consumers are the majority.

Cocaine, Benzodiazepine, Amphetamines, Ecstasy, Legal highs and Magic mushrooms consumers account for more than half the non-consumers.

grouped_df_year.T.plot.bar(color=['red', 'green'], figsize=(15,7), title="Count plot of consumer / non consumer for every drug");

Year based split

The huge majorities for consumers are the same.

Cannabis and Nicotine still have a majority of consumers. Our previous hypothesis wasn't as relevant as I had expected.

The drugs cited previously for which the non-consumers accounted over half the consumers are consumed in a lesser proportion, but the scale of the decrease wasn't the same for all.

For some drug the proportions didn't change much, meaning people either consume them (on a yearly basis or more frequently) or not at all.

Let's encode drug users as 1 and non-users as 0.

for i in drugs_columns:
    dataset_decade[i] = [0 if c < 2 else 1 for c in dataset[i]]
    
for i in drugs_columns:
    dataset_year[i] = [0 if c < 3 else 1 for c in dataset[i]]
    
compare_year_decade = pd.concat([dataset_decade.assign(split="decade"), dataset_year.assign(split="year")])

Let's see the count of individual per number of drug used.

np_drugs = np.array([drug.split(' ', -1)[0] for drug in drugs_columns])

combinations = compare_year_decade[drugs_columns].apply(
    lambda x: pd.Series({
        'names':'/'.join(np_drugs[x.values.astype(bool)]), 
        'index': np.where(x.values)[0], 'number of drugs': x.values.sum()}), axis=1)

combinations['split'] = compare_year_decade['split']

year_split = combinations['split'] == "year"

combinations.head()

combinations[~year_split].describe()

For the decade split, the mean number of drug consumed is a bit lower than 8. Over 25% of the respondents are users of 11 drugs or more. The median is of 7 drugs. The mean is slightly higher than the median, meaning the distribution is slightly positively skewed. The number of drugs used ranges from 0 to 18 (all of them).

Note: The one(s) that consumed all of them probably didn’t lie or they would have lied for the Semeron.

Note: The median and 3rd quartile number of drugs seem very high to me. I feel like this isn’t representative of the real life population (in the countries considered here), but maybe I am just no fun at parties

.

combinations[year_split].describe()

For the year split, the mean number of drug consumed is a bit above 6. Over 25% of the respondents are users of 9 drugs or more. The median is of 5 drugs. The mean is higher than the median, meaning the distribution is positively skewed. The number of drugs used ranges from 0 to 17. None one consumed all the drugs in the last year.

catplot(combinations, 'number of drugs', hue=None, col='split', rotate_label=False, palette="Set2")

Decade split

Barely anyone used 0, 1, 2, 17 or all 18 drugs. The largest number used are 3, 4, 5 and 6.

An interesting fact is that 10 has more respondents than 7, 8 and 9. I guess for some drugs, there is a threshold. Once people go over 6 drugs, they may tend to try out a bunch more.

Year split

Again, barely anyone used 0, 1 and over 15 drugs.

The number of respondents decreases monotonically as the number of drugs increases (>= 3 drugs). We dont see the same bump around 10, strengthening my previous hypothesis.

3 drugs is now the huge majority!

Let's see which are the most common combinations!

value_counts_percentage(combinations[~year_split], 'names').head(10)

value_counts_percentage(combinations[year_split], 'names').head(10)

No wonder the number of users of 3, 4 or 5 drugs were so high!

But really, we are not that interested in Chocolate, Caffeine, Nicotine and Alcohol.

Let's redo all this by not taking them into account.

np_drugs_illegal = np.array([drug.split(' ', -1)[0] for drug in drugs_illegal])


combinations_illegal = compare_year_decade[drugs_illegal].apply(
    lambda x: pd.Series({
        'names':'/'.join(np_drugs_illegal[x.values.astype(bool)]), 
        'index illegal': np.where(x.values)[0], 'number of illegal drugs': x.values.sum()}), axis=1)

combinations_illegal['split'] = compare_year_decade['split']

catplot(combinations_illegal, "number of illegal drugs", hue=None, col="split", rotate_label=False, palette="Set2")

Removing legal drugs shifted everything to the left (basically fusing 0, 1, 2 and 3). The shape of the distribution remains around the same.

value_counts_percentage(combinations_illegal[~year_split], 'names').head(10)

value_counts_percentage(combinations_illegal[year_split], 'names').head(10)

A lot more people (37.61%) in the year split didn't consume any drug (not counting alcohol, caffeine, chocolate and nicotine) compared to the decade split (24.88%).

The first 4 rows (combination of drugs) are the same but then there is some variation (different order or different combinations). The proportions for each ranking are around the same for both splits, appart from the first row.

Cannabis comes second (with between 7% and 8% in both splits), as it is legal in some countries and a very popular drug. It is also often present in the popular tuples.

Let's redo it without cannabis.

drugs_illegal_no_cannabis = [drug for drug in drugs_illegal if drug != "Cannabis consumption"]

np_drugs_illegal_no_cannabis = np.array([drug.split(' ', -1)[0] for drug in drugs_illegal_no_cannabis])

combinations_illegal_no_cannabis = compare_year_decade[drugs_illegal_no_cannabis].apply(
    lambda x: pd.Series({
        'names':'/'.join(np_drugs_illegal_no_cannabis[x.values.astype(bool)]), 
        'index illegal no cannabis': np.where(x.values)[0], 'number of illegal (no cannabis) drugs': x.values.sum()}), axis=1)

combinations_illegal_no_cannabis['split'] = compare_year_decade['split']

catplot(combinations_illegal_no_cannabis, "number of illegal (no cannabis) drugs", hue=None, col="split", rotate_label=False, palette="Set2")

The gap between 0 and 1 became even larger

value_counts_percentage(combinations_illegal_no_cannabis[~year_split], 'names').head(20)

value_counts_percentage(combinations_illegal_no_cannabis[year_split], 'names').head(20)

Here we have the 20 most common nontrivial drugs combinations for both splits.

The decade split has more length combinations, meaning some respondents tried out several drugs in the last decade but didn't pick them all up for a more regular consumption.

compare_year_decade[drugs_columns].apply(lambda x: x[drugs_legal].sum() == 0 and x[drugs_illegal].sum() > 0, axis=1).any()

False

I was curious as if any respondent could be an alcohol, caffeine, chocolate or nicotine non-user but a user of other drugs. Seems like not for both splits!

Conclusion

I hope you learnt lots reading this post. I certainly did!

Warning: Don’t try out all the popular combination of drugs we just uncovered!

I plan on training a predictive model in a future post, so stay tuned.

Did I miss out on anything? Do you have any remark? Please let me know by annotating using Hypothesis or by commenting below!

	Age	Gender	Education	Country	Ethnicity	Neuroticism	Extraversion	Openness to experience	Agreeableness	Conscientiousness	...	Ecstasy consumption	Heroin consumption	Ketamine consumption	Legal highs consumption	Lysergic acid diethylamide consumption	Methadone consumption	Magic mushrooms consumption	Nicotine consumption	Fictitious drug Semeron consumption	Volatile substance abuse consumption
ID
730	-0.07854	0.48246	-1.73790	-0.09765	-0.31685	-0.58016	0.32197	0.14143	-0.60633	0.12331	...	2	2	2	0	4	2	6	6	2	2
821	-0.95197	-0.48246	-0.61113	-0.09765	-0.50212	-0.67825	1.74091	0.72330	0.13136	0.41594	...	3	0	0	0	5	0	5	4	3	0
1520	-0.95197	-0.48246	-0.61113	-0.57009	-0.31685	-0.24649	-0.80615	-1.27553	-1.34289	-1.92173	...	1	2	1	2	1	2	4	2	3	1
1537	-0.95197	0.48246	-0.61113	-0.57009	0.11440	-0.46725	0.80523	0.29338	2.03972	1.81175	...	4	0	4	3	2	0	3	4	4	3
1702	0.49788	0.48246	0.45468	-0.57009	-0.31685	1.98437	-0.80615	2.15324	0.76096	-0.00665	...	2	0	2	2	2	0	2	6	2	0
1773	-0.95197	-0.48246	-1.22751	-0.57009	-0.22166	-0.34799	1.28610	1.06238	-0.01729	-0.52745	...	3	0	4	3	6	3	3	3	1	3
1810	-0.95197	0.48246	-1.43719	-0.57009	-0.31685	1.23461	1.11406	1.06238	-1.47955	0.12331	...	4	2	1	4	1	0	1	6	1	2
1827	-0.95197	0.48246	0.45468	-0.57009	-0.31685	0.22393	-0.30033	0.88309	1.28610	-0.00665	...	0	0	0	2	3	0	3	5	2	0

	Age	Gender	Education	Country	Ethnicity	Neuroticism	Extraversion	Openness to experience	Agreeableness	Conscientiousness	...	Ecstasy consumption	Heroin consumption	Ketamine consumption	Legal highs consumption	Lysergic acid diethylamide consumption	Methadone consumption	Magic mushrooms consumption	Nicotine consumption	Fictitious drug Semeron consumption	Volatile substance abuse consumption
count	1877.000000	1877.000000	1877.000000	1877.000000	1877.000000	1877.000000	1877.000000	1877.000000	1877.000000	1877.000000	...	1877.000000	1877.000000	1877.000000	1877.000000	1877.000000	1877.000000	1877.000000	1877.000000	1877.0	1877.000000
mean	0.037577	-0.000771	-0.000984	0.358984	-0.309728	-0.000551	-0.001951	-0.003224	-0.000657	-0.000394	...	1.309536	0.372403	0.564198	1.353223	1.053277	0.826319	1.177944	3.194992	0.0	0.429409
std	0.878387	0.482588	0.949831	0.699707	0.166220	0.998442	0.997418	0.995691	0.996689	0.997657	...	1.647373	1.034761	1.216341	1.790881	1.484582	1.648379	1.459212	2.415814	0.0	0.959160
min	-0.951970	-0.482460	-2.435910	-0.570090	-1.107020	-3.464360	-3.273930	-3.273930	-3.464360	-3.464360	...	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.0	0.000000
25%	-0.951970	-0.482460	-0.611130	-0.570090	-0.316850	-0.678250	-0.695090	-0.717270	-0.606330	-0.652530	...	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	1.000000	0.0	0.000000
50%	-0.078540	-0.482460	-0.059210	0.960820	-0.316850	0.042570	0.003320	-0.019280	-0.017290	-0.006650	...	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	3.000000	0.0	0.000000
75%	0.497880	0.482460	0.454680	0.960820	-0.316850	0.629670	0.637790	0.723300	0.760960	0.584890	...	3.000000	0.000000	0.000000	3.000000	2.000000	0.000000	2.000000	6.000000	0.0	0.000000
max	2.591710	0.482460	1.984370	0.960820	1.907250	3.273930	3.273930	2.901610	3.464360	3.464360	...	6.000000	6.000000	6.000000	6.000000	6.000000	6.000000	6.000000	6.000000	0.0	6.000000

	N	%
18-24	637	33.94
25-34	480	25.57
35-44	355	18.91
45-54	294	15.66
55-64	93	4.95
65+	18	0.96

	N	%
Male	940	50.08
Female	937	49.92

	N	%
Some college or university, no certificate or degree	503	26.80
University degree	478	25.47
Masters degree	283	15.08
Professional certificate/ diploma	270	14.38
Left school at 18 years	99	5.27
Left school at 16 years	98	5.22
Doctorate degree	89	4.74
Left school at 17 years	29	1.55
Left school before 16 years	28	1.49

	Age	Gender	Education	Country	Ethnicity	Neuroticism	Extraversion	Openness to experience	Agreeableness	Conscientiousness	...	Ecstasy consumption	Heroin consumption	Ketamine consumption	Legal highs consumption	Lysergic acid diethylamide consumption	Methadone consumption	Magic mushrooms consumption	Nicotine consumption	Fictitious drug Semeron consumption	Volatile substance abuse consumption
ID
1	0.49788	0.48246	-0.05921	0.96082	0.12600	0.31287	-0.57545	-0.58331	-0.91699	-0.00665	...	CL0	CL0	CL0	CL0	CL0	CL0	CL0	CL2	CL0	CL0
2	-0.07854	-0.48246	1.98437	0.96082	-0.31685	-0.67825	1.93886	1.43533	0.76096	-0.14277	...	CL4	CL0	CL2	CL0	CL2	CL3	CL0	CL4	CL0	CL0
3	0.49788	-0.48246	-0.05921	0.96082	-0.31685	-0.46725	0.80523	-0.84732	-1.62090	-1.01450	...	CL0	CL0	CL0	CL0	CL0	CL0	CL1	CL0	CL0	CL0
4	-0.95197	0.48246	1.16365	0.96082	-0.31685	-0.14882	-0.80615	-0.01928	0.59042	0.58489	...	CL0	CL0	CL2	CL0	CL0	CL0	CL0	CL2	CL0	CL0
5	0.49788	0.48246	1.98437	0.96082	-0.31685	0.73545	-1.63340	-0.45174	-0.30172	1.30612	...	CL1	CL0	CL0	CL1	CL0	CL0	CL2	CL2	CL0	CL0

	N	%
UK	1044	55.62
USA	551	29.36
Other	118	6.29
Canada	87	4.64
Australia	52	2.77
Ireland	20	1.07
New Zealand	5	0.27

	N	%
White	1715	91.37
Other	62	3.30
Black	33	1.76
Asian	25	1.33
Mixed-White/Asian	20	1.07
Mixed-White/Black	19	1.01
Mixed-Black/Asian	3	0.16

	Male		Female
	N	%	N	%
Some college or university, no certificate or degree	309	35.85	153	17.94
University degree	171	19.84	263	30.83
Professional certificate/ diploma	120	13.92	130	15.24
Masters degree	93	10.79	156	18.29
Left school at 18 years	56	6.50	35	4.10
Left school at 16 years	54	6.26	40	4.69
Doctorate degree	29	3.36	52	6.10
Left school before 16 years	16	1.86	12	1.41
Left school at 17 years	14	1.62	12	1.41

	Age	Gender	Education	Country	Ethnicity
Age	1.0 ***	0.196 ***	0.229 ***	0.173 ***	0.014
Gender	0.196 ***	0.999 ***	0.235 ***	0.219 ***	0.0
Education	0.229 ***	0.235 ***	1.0 ***	0.2 ***	0.0
Country	0.173 ***	0.219 ***	0.2 ***	1.0 ***	0.053 **
Ethnicity	0.014	0.0	0.0	0.053 **	1.0 ***

	H0: is from normal distribution
Neuroticism	Not rejected with alpha = 0.05
Extraversion	Rejected with alpha = 0.05
Openness to experience	Rejected with alpha = 0.05
Agreeableness	Rejected with alpha = 0.05
Conscientiousness	Rejected with alpha = 0.05
Impulsiveness	Rejected with alpha = 0.05
Sensation seeking	Rejected with alpha = 0.05

	Neuroticism	Extraversion	Openness to experience	Agreeableness	Conscientiousness	Impulsiveness	Sensation seeking
Neuroticism	1.0 ***	-0.417 ***	0.015	-0.206 ***	-0.378 ***	0.168 ***	0.09 ***
Extraversion	-0.417 ***	1.0 ***	0.226 ***	0.162 ***	0.29 ***	0.119 ***	0.19 ***
Openness to experience	0.015	0.226 ***	1.0 ***	0.037	-0.073 **	0.27 ***	0.405 ***
Agreeableness	-0.206 ***	0.162 ***	0.037	1.0 ***	0.237 ***	-0.226 ***	-0.208 ***
Conscientiousness	-0.378 ***	0.29 ***	-0.073 **	0.237 ***	1.0 ***	-0.344 ***	-0.254 ***
Impulsiveness	0.168 ***	0.119 ***	0.27 ***	-0.226 ***	-0.344 ***	1.0 ***	0.628 ***
Sensation seeking	0.09 ***	0.19 ***	0.405 ***	-0.208 ***	-0.254 ***	0.628 ***	1.0 ***

	Correlation
Impulsiveness / Sensation seeking	0.628
Neuroticism / Extraversion	-0.417
Openness to experience / Sensation seeking	0.405
Neuroticism / Conscientiousness	-0.378
Conscientiousness / Impulsiveness	-0.344
Extraversion / Conscientiousness	0.290
Openness to experience / Impulsiveness	0.270
Conscientiousness / Sensation seeking	-0.254
Agreeableness / Conscientiousness	0.237
Extraversion / Openness to experience	0.226
Agreeableness / Impulsiveness	-0.226
Agreeableness / Sensation seeking	-0.208
Neuroticism / Agreeableness	-0.206
Extraversion / Sensation seeking	0.190
Neuroticism / Impulsiveness	0.168
Extraversion / Agreeableness	0.162

	N							%
	0	1	2	3	4	5	6	0	1	2	3	4	5	6
Alcohol consumption	33	34	68	197	284	758	503	1.76	1.81	3.62	10.50	15.13	40.38	26.80
Amphetamines consumption	973	230	241	196	75	61	101	51.84	12.25	12.84	10.44	4.00	3.25	5.38
Amyl nitrite consumption	1299	210	236	91	24	14	3	69.21	11.19	12.57	4.85	1.28	0.75	0.16
Benzodiazepine consumption	999	116	230	234	119	84	95	53.22	6.18	12.25	12.47	6.34	4.48	5.06
Caffeine consumption	27	10	24	59	106	271	1380	1.44	0.53	1.28	3.14	5.65	14.44	73.52
Cannabis consumption	413	207	266	210	138	185	458	22.00	11.03	14.17	11.19	7.35	9.86	24.40
Chocolate consumption	32	2	10	53	295	680	805	1.70	0.11	0.53	2.82	15.72	36.23	42.89
Cocaine consumption	1036	160	267	257	98	40	19	55.19	8.52	14.22	13.69	5.22	2.13	1.01
Crack consumption	1622	67	109	59	9	9	2	86.41	3.57	5.81	3.14	0.48	0.48	0.11
Ecstasy consumption	1020	112	232	275	154	63	21	54.34	5.97	12.36	14.65	8.20	3.36	1.12
Heroin consumption	1600	68	91	65	24	16	13	85.24	3.62	4.85	3.46	1.28	0.85	0.69
Ketamine consumption	1488	43	140	129	40	33	4	79.28	2.29	7.46	6.87	2.13	1.76	0.21
Legal highs consumption	1092	29	195	321	109	64	67	58.18	1.55	10.39	17.10	5.81	3.41	3.57
Lysergic acid diethylamide consumption	1069	257	175	213	96	55	12	56.95	13.69	9.32	11.35	5.11	2.93	0.64
Methadone consumption	1424	39	95	148	50	48	73	75.87	2.08	5.06	7.88	2.66	2.56	3.89
Magic mushrooms consumption	982	208	259	272	114	39	3	52.32	11.08	13.80	14.49	6.07	2.08	0.16
Nicotine consumption	428	193	203	184	106	156	607	22.80	10.28	10.82	9.80	5.65	8.31	32.34
Volatile substance abuse consumption	1452	199	133	59	13	14	7	77.36	10.60	7.09	3.14	0.69	0.75	0.37

	Alcohol consumption	Amphetamines consumption	Amyl nitrite consumption	Benzodiazepine consumption	Caffeine consumption	Cannabis consumption	Chocolate consumption	Cocaine consumption	Crack consumption	Ecstasy consumption	Heroin consumption	Ketamine consumption	Legal highs consumption	Lysergic acid diethylamide consumption	Methadone consumption	Magic mushrooms consumption	Nicotine consumption	Volatile substance abuse consumption
Alcohol consumption	1.0 ***	-0.019	0.076 **	-0.03	0.103 ***	0.003	0.01	0.054 *	-0.056 *	0.039 .	-0.054 *	0.048 *	0.01	-0.017	-0.08 ***	-0.0	0.043 .	-0.002
Amphetamines consumption	-0.019	1.0 ***	0.365 ***	0.505 ***	0.022	0.504 ***	-0.093 ***	0.606 ***	0.338 ***	0.579 ***	0.382 ***	0.402 ***	0.481 ***	0.499 ***	0.422 ***	0.484 ***	0.395 ***	0.327 ***
Amyl nitrite consumption	0.076 **	0.365 ***	1.0 ***	0.219 ***	0.061 **	0.217 ***	0.005	0.395 ***	0.178 ***	0.359 ***	0.169 ***	0.344 ***	0.216 ***	0.208 ***	0.078 ***	0.236 ***	0.228 ***	0.188 ***
Benzodiazepine consumption	-0.03	0.505 ***	0.219 ***	1.0 ***	0.052 *	0.37 ***	-0.053 *	0.467 ***	0.372 ***	0.373 ***	0.437 ***	0.314 ***	0.345 ***	0.37 ***	0.515 ***	0.374 ***	0.318 ***	0.314 ***
Caffeine consumption	0.103 ***	0.022	0.061 **	0.052 *	1.0 ***	-0.01	0.109 ***	0.036	0.011	-0.037	0.006	-0.033	-0.052 *	-0.038 .	0.014	-0.004	0.101 ***	0.03
Cannabis consumption	0.003	0.504 ***	0.217 ***	0.37 ***	-0.01	1.0 ***	-0.108 ***	0.477 ***	0.24 ***	0.566 ***	0.239 ***	0.319 ***	0.578 ***	0.542 ***	0.322 ***	0.593 ***	0.515 ***	0.29 ***
Chocolate consumption	0.01	-0.093 ***	0.005	-0.053 *	0.109 ***	-0.108 ***	1.0 ***	-0.088 ***	-0.126 ***	-0.091 ***	-0.083 ***	-0.053 *	-0.098 ***	-0.122 ***	-0.061 **	-0.108 ***	-0.058 *	-0.092 ***
Cocaine consumption	0.054 *	0.606 ***	0.395 ***	0.467 ***	0.036	0.477 ***	-0.088 ***	1.0 ***	0.406 ***	0.631 ***	0.428 ***	0.465 ***	0.42 ***	0.464 ***	0.378 ***	0.469 ***	0.414 ***	0.322 ***
Crack consumption	-0.056 *	0.338 ***	0.178 ***	0.372 ***	0.011	0.24 ***	-0.126 ***	0.406 ***	1.0 ***	0.26 ***	0.539 ***	0.269 ***	0.199 ***	0.308 ***	0.391 ***	0.293 ***	0.256 ***	0.305 ***
Ecstasy consumption	0.039 .	0.579 ***	0.359 ***	0.373 ***	-0.037	0.566 ***	-0.091 ***	0.631 ***	0.26 ***	1.0 ***	0.29 ***	0.523 ***	0.57 ***	0.581 ***	0.313 ***	0.565 ***	0.392 ***	0.282 ***
Heroin consumption	-0.054 *	0.382 ***	0.169 ***	0.437 ***	0.006	0.239 ***	-0.083 ***	0.428 ***	0.539 ***	0.29 ***	1.0 ***	0.297 ***	0.244 ***	0.345 ***	0.501 ***	0.298 ***	0.246 ***	0.311 ***
Ketamine consumption	0.048 *	0.402 ***	0.344 ***	0.314 ***	-0.033	0.319 ***	-0.053 *	0.465 ***	0.269 ***	0.523 ***	0.297 ***	1.0 ***	0.405 ***	0.413 ***	0.236 ***	0.409 ***	0.256 ***	0.207 ***
Legal highs consumption	0.01	0.481 ***	0.216 ***	0.345 ***	-0.052 *	0.578 ***	-0.098 ***	0.42 ***	0.199 ***	0.57 ***	0.244 ***	0.405 ***	1.0 ***	0.478 ***	0.35 ***	0.551 ***	0.348 ***	0.308 ***
Lysergic acid diethylamide consumption	-0.017	0.499 ***	0.208 ***	0.37 ***	-0.038 .	0.542 ***	-0.122 ***	0.464 ***	0.308 ***	0.581 ***	0.345 ***	0.413 ***	0.478 ***	1.0 ***	0.34 ***	0.672 ***	0.309 ***	0.291 ***
Methadone consumption	-0.08 ***	0.422 ***	0.078 ***	0.515 ***	0.014	0.322 ***	-0.061 **	0.378 ***	0.391 ***	0.313 ***	0.501 ***	0.236 ***	0.35 ***	0.34 ***	1.0 ***	0.333 ***	0.255 ***	0.308 ***
Magic mushrooms consumption	-0.0	0.484 ***	0.236 ***	0.374 ***	-0.004	0.593 ***	-0.108 ***	0.469 ***	0.293 ***	0.565 ***	0.298 ***	0.409 ***	0.551 ***	0.672 ***	0.333 ***	1.0 ***	0.338 ***	0.263 ***
Nicotine consumption	0.043 .	0.395 ***	0.228 ***	0.318 ***	0.101 ***	0.515 ***	-0.058 *	0.414 ***	0.256 ***	0.392 ***	0.246 ***	0.256 ***	0.348 ***	0.309 ***	0.255 ***	0.338 ***	1.0 ***	0.294 ***
Volatile substance abuse consumption	-0.002	0.327 ***	0.188 ***	0.314 ***	0.03	0.29 ***	-0.092 ***	0.322 ***	0.305 ***	0.282 ***	0.311 ***	0.207 ***	0.308 ***	0.291 ***	0.308 ***	0.263 ***	0.294 ***	1.0 ***

		Agreeableness		Conscientiousness		Extraversion		Impulsiveness		Neuroticism		Openness to experience		Sensation seeking
		Median	Pval / significant diff ?	Median	Pval / significant diff ?	Median	Pval / significant diff ?	Median	Pval / significant diff ?	Median	Pval / significant diff ?	Median	Pval / significant diff ?	Median	Pval / significant diff ?
Age	> -0.07854	-0.21712	0.000000	-0.21575	0.0	-0.247775	0.000000	NaN	NaN	-0.14882	0.000000	0.25953	0.000000	NaN	NaN
Age	<= -0.07854	0.19268	Yes	0.40148	Yes	0.141430	Yes	NaN	NaN	0.13606	Yes	-0.14277	Yes	NaN	NaN
Gender	> 0.0	0.28783	0.000000	0.25953	0.0	0.003320	0.002804	-0.21712	0.0	0.04257	0.003225	-0.17779	0.000000	-0.21575	0.0
Gender	<= 0.0	-0.30172	Yes	-0.27607	Yes	-0.154870	Yes	0.19268	Yes	-0.05188	Yes	0.14143	Yes	0.40148	Yes
Education	> -0.05921	0.25953	0.000000	-0.21712	0.0	0.167670	0.000017	-0.21575	0.0	-0.10035	0.000112	0.13136	0.000237	NaN	NaN
Education	<= -0.05921	-0.27607	Yes	0.19268	Yes	-0.154870	Yes	0.07987	Yes	0.04257	Yes	-0.01729	Yes	NaN	NaN
Country	> -0.46841	0.13136	0.000005	0.12331	0.0	0.003320	0.000025	-0.21712	0.0	-0.14882	0.000000	-0.17779	0.000000	-0.21575	0.0
Country	<= -0.46841	-0.15487	Yes	-0.27607	Yes	-0.154870	Yes	0.19268	Yes	0.31287	Yes	0.44585	Yes	0.40148	Yes
Ethnicity	> -0.31685	NaN	NaN	NaN	NaN	0.529750	0.001346	NaN	NaN	0.14143	0.014002	0.40148	0.000401	NaN	NaN
Ethnicity	<= -0.31685	NaN	NaN	NaN	NaN	-0.217120	Yes	NaN	NaN	-0.01928	Yes	0.07987	Yes	NaN	NaN

	names	index	number of drugs	split
ID
1	Alcohol/Amphetamines/Benzodiazepine/Caffeine/C...	[0, 1, 3, 4, 6, 16]	6	decade
2	Alcohol/Amphetamines/Amyl/Caffeine/Cannabis/Ch...	[0, 1, 2, 4, 5, 6, 7, 9, 11, 13, 14, 16]	12	decade
3	Alcohol/Caffeine/Cannabis/Chocolate	[0, 4, 5, 6]	4	decade
4	Alcohol/Benzodiazepine/Caffeine/Cannabis/Choco...	[0, 3, 4, 5, 6, 7, 11, 16]	8	decade
5	Alcohol/Caffeine/Cannabis/Chocolate/Magic/Nico...	[0, 4, 5, 6, 15, 16]	6	decade

	N	%
Alcohol/Caffeine/Chocolate	290	15.45
Alcohol/Caffeine/Chocolate/Nicotine	122	6.50
Alcohol/Caffeine/Cannabis/Chocolate/Nicotine	100	5.33
Alcohol/Benzodiazepine/Caffeine/Chocolate	42	2.24
Alcohol/Caffeine/Cannabis/Chocolate	41	2.18
Alcohol/Benzodiazepine/Caffeine/Chocolate/Nicotine	23	1.23
Alcohol/Benzodiazepine/Caffeine/Cannabis/Chocolate/Nicotine	20	1.07
Caffeine/Chocolate	18	0.96
Alcohol/Amyl/Caffeine/Cannabis/Chocolate/Nicotine	16	0.85
Alcohol/Caffeine/Cannabis/Chocolate/Legal/Nicotine	15	0.80

	N	%
Alcohol/Caffeine/Chocolate	453	24.13
Alcohol/Caffeine/Chocolate/Nicotine	148	7.88
Alcohol/Caffeine/Cannabis/Chocolate/Nicotine	84	4.48
Caffeine/Chocolate	38	2.02
Alcohol/Benzodiazepine/Caffeine/Chocolate	34	1.81
Alcohol/Caffeine/Cannabis/Chocolate	31	1.65
Alcohol/Chocolate	23	1.23
Alcohol/Caffeine/Cannabis/Chocolate/Legal/Nicotine	22	1.17
Alcohol/Benzodiazepine/Caffeine/Cannabis/Chocolate/Nicotine	21	1.12
Alcohol/Benzodiazepine/Caffeine/Chocolate/Nicotine	19	1.01

	Correlation
Heroin / Magic mushrooms	0.298
Heroin / Ketamine	0.297
Nicotine / Volatile substance abuse	0.294
Crack / Magic mushrooms	0.293
Lysergic acid diethylamide / Volatile substance abuse	0.291
Cannabis / Volatile substance abuse	0.290
Ecstasy / Heroin	0.290
Ecstasy / Volatile substance abuse	0.282
Crack / Ketamine	0.269
Magic mushrooms / Volatile substance abuse	0.263
Crack / Ecstasy	0.260
Crack / Nicotine	0.256
Ketamine / Nicotine	0.256
Methadone / Nicotine	0.255
Heroin / Nicotine	0.246
Heroin / Legal highs	0.244
Cannabis / Crack	0.240
Cannabis / Heroin	0.239
Amyl nitrite / Magic mushrooms	0.236
Ketamine / Methadone	0.236
Amyl nitrite / Nicotine	0.228
Amyl nitrite / Benzodiazepine	0.219
Amyl nitrite / Cannabis	0.217
Amyl nitrite / Legal highs	0.216
Amyl nitrite / Lysergic acid diethylamide	0.208
Ketamine / Volatile substance abuse	0.207
Crack / Legal highs	0.199
Amyl nitrite / Volatile substance abuse	0.188
Amyl nitrite / Crack	0.178
Amyl nitrite / Heroin	0.169
Chocolate / Crack	-0.126
Chocolate / Lysergic acid diethylamide	-0.122
Caffeine / Chocolate	0.109
Cannabis / Chocolate	-0.108
Chocolate / Magic mushrooms	-0.108
Alcohol / Caffeine	0.103
Caffeine / Nicotine	0.101
Chocolate / Legal highs	-0.098
Amphetamines / Chocolate	-0.093
Chocolate / Volatile substance abuse	-0.092
Chocolate / Ecstasy	-0.091
Chocolate / Cocaine	-0.088
Chocolate / Heroin	-0.083
Alcohol / Methadone	-0.080
Amyl nitrite / Methadone	0.078
Alcohol / Amyl nitrite	0.076
Amyl nitrite / Caffeine	0.061
Chocolate / Methadone	-0.061
Chocolate / Nicotine	-0.058
Alcohol / Crack	-0.056
Alcohol / Cocaine	0.054
Alcohol / Heroin	-0.054
Benzodiazepine / Chocolate	-0.053
Chocolate / Ketamine	-0.053
Benzodiazepine / Caffeine	0.052
Caffeine / Legal highs	-0.052
Alcohol / Ketamine	0.048

	index	Correlation
0	Amphetamines / Lysergic acid diethylamide	0.499
1	Amphetamines / Magic mushrooms	0.484
2	Amphetamines / Legal highs	0.481
3	Legal highs / Lysergic acid diethylamide	0.478
4	Cannabis / Cocaine	0.477
5	Cocaine / Magic mushrooms	0.469
6	Benzodiazepine / Cocaine	0.467
7	Cocaine / Ketamine	0.465
8	Cocaine / Lysergic acid diethylamide	0.464
9	Benzodiazepine / Heroin	0.437
10	Cocaine / Heroin	0.428
11	Amphetamines / Methadone	0.422
12	Cocaine / Legal highs	0.420
13	Cocaine / Nicotine	0.414
14	Ketamine / Lysergic acid diethylamide	0.413
15	Ketamine / Magic mushrooms	0.409
16	Cocaine / Crack	0.406
17	Ketamine / Legal highs	0.405
18	Amphetamines / Ketamine	0.402
19	Amphetamines / Nicotine	0.395
20	Amyl nitrite / Cocaine	0.395
21	Ecstasy / Nicotine	0.392
22	Crack / Methadone	0.391
23	Amphetamines / Heroin	0.382
24	Cocaine / Methadone	0.378
25	Benzodiazepine / Magic mushrooms	0.374
26	Benzodiazepine / Ecstasy	0.373
27	Benzodiazepine / Crack	0.372
28	Benzodiazepine / Cannabis	0.370
29	Benzodiazepine / Lysergic acid diethylamide	0.370
30	Amphetamines / Amyl nitrite	0.365
31	Amyl nitrite / Ecstasy	0.359
32	Legal highs / Methadone	0.350
33	Legal highs / Nicotine	0.348
34	Benzodiazepine / Legal highs	0.345
35	Heroin / Lysergic acid diethylamide	0.345
36	Amyl nitrite / Ketamine	0.344
37	Lysergic acid diethylamide / Methadone	0.340
38	Amphetamines / Crack	0.338
39	Magic mushrooms / Nicotine	0.338
40	Methadone / Magic mushrooms	0.333
41	Amphetamines / Volatile substance abuse	0.327
42	Cannabis / Methadone	0.322
43	Cocaine / Volatile substance abuse	0.322
44	Cannabis / Ketamine	0.319
45	Benzodiazepine / Nicotine	0.318
46	Benzodiazepine / Ketamine	0.314
47	Benzodiazepine / Volatile substance abuse	0.314
48	Ecstasy / Methadone	0.313
49	Heroin / Volatile substance abuse	0.311
50	Lysergic acid diethylamide / Nicotine	0.309
51	Crack / Lysergic acid diethylamide	0.308
52	Legal highs / Volatile substance abuse	0.308
53	Methadone / Volatile substance abuse	0.308
54	Crack / Volatile substance abuse	0.305

	Correlation
Lysergic acid diethylamide / Magic mushrooms	0.672
Cocaine / Ecstasy	0.631
Amphetamines / Cocaine	0.606
Cannabis / Magic mushrooms	0.593
Ecstasy / Lysergic acid diethylamide	0.581
Amphetamines / Ecstasy	0.579
Cannabis / Legal highs	0.578
Ecstasy / Legal highs	0.570
Cannabis / Ecstasy	0.566
Ecstasy / Magic mushrooms	0.565
Legal highs / Magic mushrooms	0.551
Cannabis / Lysergic acid diethylamide	0.542
Crack / Heroin	0.539
Ecstasy / Ketamine	0.523
Benzodiazepine / Methadone	0.515
Cannabis / Nicotine	0.515
Amphetamines / Benzodiazepine	0.505
Amphetamines / Cannabis	0.504
Heroin / Methadone	0.501
Amphetamines / Lysergic acid diethylamide	0.499
Amphetamines / Magic mushrooms	0.484
Amphetamines / Legal highs	0.481
Legal highs / Lysergic acid diethylamide	0.478
Cannabis / Cocaine	0.477
Cocaine / Magic mushrooms	0.469
Benzodiazepine / Cocaine	0.467
Cocaine / Ketamine	0.465
Cocaine / Lysergic acid diethylamide	0.464
Benzodiazepine / Heroin	0.437
Cocaine / Heroin	0.428
Amphetamines / Methadone	0.422
Cocaine / Legal highs	0.420
Cocaine / Nicotine	0.414
Ketamine / Lysergic acid diethylamide	0.413
Ketamine / Magic mushrooms	0.409
Cocaine / Crack	0.406
Ketamine / Legal highs	0.405
Amphetamines / Ketamine	0.402

	Neuroticism	Extraversion	Openness to experience	Agreeableness	Conscientiousness	Impulsiveness	Sensation seeking
Alcohol consumption	0.0	0.084 ***	0.032	-0.033	0.006	0.039 .	0.094 ***
Amphetamines consumption	0.135 ***	-0.042 .	0.253 ***	-0.14 ***	-0.253 ***	0.294 ***	0.365 ***
Amyl nitrite consumption	0.04 .	0.04 .	0.065 **	-0.079 ***	-0.121 ***	0.13 ***	0.189 ***
Benzodiazepine consumption	0.266 ***	-0.096 ***	0.225 ***	-0.164 ***	-0.217 ***	0.233 ***	0.255 ***
Caffeine consumption	0.022	0.013	-0.02	-0.008	0.006	0.014	-0.005
Cannabis consumption	0.11 ***	-0.022	0.417 ***	-0.155 ***	-0.293 ***	0.313 ***	0.465 ***
Chocolate consumption	0.025	0.028	-0.015	0.038	0.023	-0.015	-0.052 *
Cocaine consumption	0.144 ***	0.006	0.205 ***	-0.183 ***	-0.221 ***	0.264 ***	0.34 ***
Crack consumption	0.118 ***	-0.052 *	0.126 ***	-0.091 ***	-0.13 ***	0.19 ***	0.197 ***
Ecstasy consumption	0.087 ***	0.056 *	0.311 ***	-0.118 ***	-0.249 ***	0.272 ***	0.405 ***
Heroin consumption	0.178 ***	-0.076 ***	0.163 ***	-0.14 ***	-0.17 ***	0.198 ***	0.216 ***
Ketamine consumption	0.078 ***	0.019	0.193 ***	-0.125 ***	-0.167 ***	0.188 ***	0.266 ***
Legal highs consumption	0.122 ***	-0.033	0.347 ***	-0.136 ***	-0.27 ***	0.281 ***	0.441 ***
Lysergic acid diethylamide consumption	0.071 **	-0.021	0.368 ***	-0.12 ***	-0.198 ***	0.253 ***	0.381 ***
Methadone consumption	0.194 ***	-0.108 ***	0.209 ***	-0.153 ***	-0.209 ***	0.198 ***	0.242 ***
Magic mushrooms consumption	0.066 **	0.001	0.379 ***	-0.127 ***	-0.221 ***	0.281 ***	0.391 ***
Nicotine consumption	0.133 ***	-0.033	0.193 ***	-0.113 ***	-0.237 ***	0.253 ***	0.306 ***
Volatile substance abuse consumption	0.132 ***	-0.066 **	0.151 ***	-0.125 ***	-0.182 ***	0.192 ***	0.255 ***

	Correlation
Cannabis consumption / Sensation seeking	0.465
Legal highs consumption / Sensation seeking	0.441
Cannabis consumption / Openness to experience	0.417
Ecstasy consumption / Sensation seeking	0.405
Magic mushrooms consumption / Sensation seeking	0.391
Lysergic acid diethylamide consumption / Sensation seeking	0.381
Magic mushrooms consumption / Openness to experience	0.379
Lysergic acid diethylamide consumption / Openness to experience	0.368
Amphetamines consumption / Sensation seeking	0.365
Legal highs consumption / Openness to experience	0.347
Cocaine consumption / Sensation seeking	0.340
Cannabis consumption / Impulsiveness	0.313
Ecstasy consumption / Openness to experience	0.311
Nicotine consumption / Sensation seeking	0.306

	Alcohol consumption	Amphetamines consumption	Amyl nitrite consumption	Benzodiazepine consumption	Caffeine consumption	Cannabis consumption	Chocolate consumption	Cocaine consumption	Crack consumption	Ecstasy consumption	Heroin consumption	Ketamine consumption	Legal highs consumption	Lysergic acid diethylamide consumption	Methadone consumption	Magic mushrooms consumption	Nicotine consumption	Volatile substance abuse consumption
Age	0.083 ***	0.196 ***	0.151 ***	0.144 ***	0.093 ***	0.24 ***	0.087 ***	0.173 ***	0.079 ***	0.203 ***	0.11 ***	0.131 ***	0.215 ***	0.233 ***	0.117 ***	0.221 ***	0.186 ***	0.156 ***
Gender	0.041	0.249 ***	0.157 ***	0.156 ***	0.0	0.309 ***	0.088 **	0.178 ***	0.151 ***	0.241 ***	0.131 ***	0.189 ***	0.327 ***	0.286 ***	0.188 ***	0.278 ***	0.193 ***	0.139 ***
Education	0.079 ***	0.122 ***	0.045 *	0.093 ***	0.033	0.166 ***	0.043 *	0.068 ***	0.082 ***	0.125 ***	0.068 ***	0.061 ***	0.144 ***	0.133 ***	0.106 ***	0.14 ***	0.131 ***	0.098 ***
Country	0.07 ***	0.184 ***	0.112 ***	0.171 ***	0.044 *	0.229 ***	0.063 ***	0.123 ***	0.117 ***	0.146 ***	0.136 ***	0.061 ***	0.183 ***	0.213 ***	0.189 ***	0.203 ***	0.136 ***	0.151 ***
Ethnicity	0.148 ***	0.047 **	0.035 .	0.046 **	0.072 ***	0.088 ***	0.023	0.047 **	0.052 **	0.029	0.0	0.0	0.049 **	0.062 ***	0.022	0.056 ***	0.063 ***	0.074 ***

	Correlation
Gender / Legal highs consumption	0.327
Gender / Cannabis consumption	0.309
Gender / Lysergic acid diethylamide consumption	0.286
Gender / Magic mushrooms consumption	0.278
Gender / Ecstasy consumption	0.241
Age / Cannabis consumption	0.240
Age / Lysergic acid diethylamide consumption	0.233
Country / Cannabis consumption	0.229
Age / Magic mushrooms consumption	0.221
Age / Legal highs consumption	0.215
Country / Lysergic acid diethylamide consumption	0.213
Age / Ecstasy consumption	0.203
Country / Magic mushrooms consumption	0.203

	Neuroticism	Extraversion	Openness to experience	Agreeableness	Conscientiousness	Impulsiveness	Sensation seeking
Age	0.033	0.0	0.139 ***	0.0	0.095 ***	0.103 ***	0.159 ***
Gender	0.089 .	0.063	0.139 ***	0.22 ***	0.173 ***	0.19 ***	0.261 ***
Education	0.054 .	0.032	0.094 ***	0.0	0.112 ***	0.072 ***	0.096 ***
Country	0.047	0.086 ***	0.141 ***	0.036	0.105 ***	0.098 ***	0.144 ***
Ethnicity	0.0	0.0	0.04	0.0	0.0	0.045 *	0.046 *

	Correlation
Gender / Sensation seeking	0.261
Gender / Agreeableness	0.220
Gender / Impulsiveness	0.190
Gender / Conscientiousness	0.173
Age / Sensation seeking	0.159

	N	%
	467	24.88
Cannabis	149	7.94
Benzodiazepine	74	3.94
Benzodiazepine/Cannabis	30	1.60
Cannabis/Legal	19	1.01
Amphetamines/Amyl/Benzodiazepine/Cannabis/Cocaine/Ecstasy/Ketamine/Legal/Lysergic/Magic	19	1.01
Amyl/Cannabis	18	0.96
Cannabis/Legal/Magic	15	0.80
Cannabis/Cocaine	13	0.69
Cannabis/Ecstasy/Legal/Lysergic/Magic	11	0.59

	N	%
	706	37.61
Cannabis	134	7.14
Benzodiazepine	61	3.25
Benzodiazepine/Cannabis	32	1.70
Cannabis/Legal	30	1.60
Cannabis/Ecstasy	14	0.75
Cannabis/Magic	13	0.69
Cannabis/Ecstasy/Legal/Lysergic/Magic	12	0.64
Amphetamines/Cannabis	12	0.64
Cannabis/Ecstasy/Lysergic/Magic	12	0.64

	N	%
	616	32.82
Benzodiazepine	104	5.54
Amyl	24	1.28
Legal	22	1.17
Amphetamines/Amyl/Benzodiazepine/Cocaine/Ecstasy/Ketamine/Legal/Lysergic/Magic	19	1.01
Cocaine	18	0.96
Legal/Magic	16	0.85
Benzodiazepine/Methadone	13	0.69
Ecstasy/Legal/Lysergic/Magic	11	0.59
Amphetamines/Benzodiazepine/Cocaine/Crack/Ecstasy/Heroin/Ketamine/Legal/Lysergic/Methadone/Magic	11	0.59
Methadone	11	0.59
Magic	10	0.53
Volatile	10	0.53
Amphetamines/Benzodiazepine	9	0.48
Ecstasy	9	0.48
Amphetamines/Benzodiazepine/Ecstasy/Legal/Lysergic/Magic	9	0.48
Amphetamines/Cocaine/Ecstasy	9	0.48
Amphetamines/Benzodiazepine/Cocaine/Ecstasy/Ketamine/Legal/Lysergic/Methadone/Magic	9	0.48
Amphetamines/Amyl/Benzodiazepine/Cocaine/Crack/Ecstasy/Heroin/Ketamine/Legal/Lysergic/Methadone/Magic/Volatile	8	0.43
Amphetamines	8	0.43

	N	%
	840	44.75
Benzodiazepine	93	4.95
Legal	35	1.86
Benzodiazepine/Methadone	18	0.96
Ecstasy	16	0.85
Methadone	15	0.80
Cocaine	14	0.75
Magic	13	0.69
Amphetamines	13	0.69
Ecstasy/Legal/Lysergic/Magic	12	0.64
Legal/Magic	12	0.64
Ecstasy/Lysergic/Magic	12	0.64
Lysergic/Magic	12	0.64
Amphetamines/Benzodiazepine	11	0.59
Legal/Lysergic/Magic	10	0.53
Cocaine/Ecstasy	10	0.53
Amphetamines/Benzodiazepine/Methadone	9	0.48
Cocaine/Ecstasy/Legal	9	0.48
Amphetamines/Benzodiazepine/Cocaine/Ecstasy/Ketamine/Legal/Lysergic/Magic	9	0.48
Ecstasy/Legal	8	0.43