This post is a great way to dive in data analysis and get familiar with python tools such as pandas, seaborn, matplotlib, numpy.

This is also an opportunity to work on multilabel data (several target variables that are not mutually exclusive).

Setting up workspace

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import seaborn as sns
import matplotlib as mpl
import math
from scipy.stats import pearsonr, spearmanr, chi2_contingency, ttest_ind, mannwhitneyu, norm, normaltest, shapiro, anderson
import operator
from IPython.display import HTML, display
import seaborn as sns
sns.set(rc={'figure.figsize':(10,5)})
sns.set_style('whitegrid')
%matplotlib inline

Dataset preparation

Drug consumption (quantified) Data Set is a multilabel dataset from the UCI machine learning repository.

The dataset contains records for 1885 respondents.

To summarize, it contains:

  • an ID column
  • 5 demographic columns (features)
  • 7 personality traits (features)
  • 18 drugs with their usage frequency (target)
  • a fake drug called Semeron to verify reliability of answers

Each drug variable can take 6 different values:

  • CL0 Never Used
  • CL1 Used over a Decade
  • CL2 Used in the Last Decade
  • CL3 Used in the Last Year
  • CL4 Used in the Last Month
  • CL5 Used in the Last Week
  • CL6 Used in the Last Day

All qualitative variables where transformed to real values (except for the target variables).

Tip: Read the dataset description for more details!

The following lists will be useful for loading and indexing the data.

demographic_columns = [
    'Age', 
    'Gender', 
    'Education', 
    'Country',
    'Ethnicity',
]

personality_columns = [
    'Neuroticism',
    'Extraversion',
    'Openness to experience',
    'Agreeableness',
    'Conscientiousness',
    'Impulsiveness',
    'Sensation seeking'
]

feature_columns = demographic_columns + personality_columns

drugs_columns = [
    'Alcohol consumption',
    'Amphetamines consumption',
    'Amyl nitrite consumption',
    'Benzodiazepine consumption',
    'Caffeine consumption',
    'Cannabis consumption',
    'Chocolate consumption',
    'Cocaine consumption',
    'Crack consumption',
    'Ecstasy consumption',
    'Heroin consumption',
    'Ketamine consumption',
    'Legal highs consumption',
    'Lysergic acid diethylamide consumption',
    'Methadone consumption',
    'Magic mushrooms consumption',
    'Nicotine consumption',
    'Fictitious drug Semeron consumption',
    'Volatile substance abuse consumption'
]

drugs_legal = ['Alcohol consumption', 'Caffeine consumption', 'Chocolate consumption', 'Nicotine consumption']

drugs_illegal = [drug for drug in drugs_columns if drug not in drugs_legal]

all_columns = feature_columns + drugs_columns

Loading the dataset

dataset = pd.read_csv("drug_consumption.data", names=["ID"] + all_columns)
dataset = dataset.set_index("ID")
dataset.head()
Age Gender Education Country Ethnicity Neuroticism Extraversion Openness to experience Agreeableness Conscientiousness ... Ecstasy consumption Heroin consumption Ketamine consumption Legal highs consumption Lysergic acid diethylamide consumption Methadone consumption Magic mushrooms consumption Nicotine consumption Fictitious drug Semeron consumption Volatile substance abuse consumption
ID
1 0.49788 0.48246 -0.05921 0.96082 0.12600 0.31287 -0.57545 -0.58331 -0.91699 -0.00665 ... CL0 CL0 CL0 CL0 CL0 CL0 CL0 CL2 CL0 CL0
2 -0.07854 -0.48246 1.98437 0.96082 -0.31685 -0.67825 1.93886 1.43533 0.76096 -0.14277 ... CL4 CL0 CL2 CL0 CL2 CL3 CL0 CL4 CL0 CL0
3 0.49788 -0.48246 -0.05921 0.96082 -0.31685 -0.46725 0.80523 -0.84732 -1.62090 -1.01450 ... CL0 CL0 CL0 CL0 CL0 CL0 CL1 CL0 CL0 CL0
4 -0.95197 0.48246 1.16365 0.96082 -0.31685 -0.14882 -0.80615 -0.01928 0.59042 0.58489 ... CL0 CL0 CL2 CL0 CL0 CL0 CL0 CL2 CL0 CL0
5 0.49788 0.48246 1.98437 0.96082 -0.31685 0.73545 -1.63340 -0.45174 -0.30172 1.30612 ... CL1 CL0 CL0 CL1 CL0 CL0 CL2 CL2 CL0 CL0

5 rows × 31 columns

We successfully loaded the dataset and set the ID as row index.

dataset.dtypes
Age                                       float64
Gender                                    float64
Education                                 float64
Country                                   float64
Ethnicity                                 float64
Neuroticism                               float64
Extraversion                              float64
Openness to experience                    float64
Agreeableness                             float64
Conscientiousness                         float64
Impulsiveness                             float64
Sensation seeking                         float64
Alcohol consumption                        object
Amphetamines consumption                   object
Amyl nitrite consumption                   object
Benzodiazepine consumption                 object
Caffeine consumption                       object
Cannabis consumption                       object
Chocolate consumption                      object
Cocaine consumption                        object
Crack consumption                          object
Ecstasy consumption                        object
Heroin consumption                         object
Ketamine consumption                       object
Legal highs consumption                    object
Lysergic acid diethylamide consumption     object
Methadone consumption                      object
Magic mushrooms consumption                object
Nicotine consumption                       object
Fictitious drug Semeron consumption        object
Volatile substance abuse consumption       object
dtype: object

Features are all floats and targets are objects.

dataset.shape
(1885, 31)

1885 observations and 31 variables.

dataset.isnull().values.any()
False

There are no missing values.

Nominal drug to ordinal data

for i in drugs_columns:
    dataset[i] = dataset[i].map({'CL0': 0, 'CL1': 1, 'CL2': 2, 'CL3': 3, 'CL4': 4, 'CL5': 5, 'CL6': 6})

Mapping CLX to X. Now all variables are numeric.

Removing Semeron users

Let's see who claimed to took Semeron.

semerons = dataset[dataset['Fictitious drug Semeron consumption'] != 0]
semerons
Age Gender Education Country Ethnicity Neuroticism Extraversion Openness to experience Agreeableness Conscientiousness ... Ecstasy consumption Heroin consumption Ketamine consumption Legal highs consumption Lysergic acid diethylamide consumption Methadone consumption Magic mushrooms consumption Nicotine consumption Fictitious drug Semeron consumption Volatile substance abuse consumption
ID
730 -0.07854 0.48246 -1.73790 -0.09765 -0.31685 -0.58016 0.32197 0.14143 -0.60633 0.12331 ... 2 2 2 0 4 2 6 6 2 2
821 -0.95197 -0.48246 -0.61113 -0.09765 -0.50212 -0.67825 1.74091 0.72330 0.13136 0.41594 ... 3 0 0 0 5 0 5 4 3 0
1520 -0.95197 -0.48246 -0.61113 -0.57009 -0.31685 -0.24649 -0.80615 -1.27553 -1.34289 -1.92173 ... 1 2 1 2 1 2 4 2 3 1
1537 -0.95197 0.48246 -0.61113 -0.57009 0.11440 -0.46725 0.80523 0.29338 2.03972 1.81175 ... 4 0 4 3 2 0 3 4 4 3
1702 0.49788 0.48246 0.45468 -0.57009 -0.31685 1.98437 -0.80615 2.15324 0.76096 -0.00665 ... 2 0 2 2 2 0 2 6 2 0
1773 -0.95197 -0.48246 -1.22751 -0.57009 -0.22166 -0.34799 1.28610 1.06238 -0.01729 -0.52745 ... 3 0 4 3 6 3 3 3 1 3
1810 -0.95197 0.48246 -1.43719 -0.57009 -0.31685 1.23461 1.11406 1.06238 -1.47955 0.12331 ... 4 2 1 4 1 0 1 6 1 2
1827 -0.95197 0.48246 0.45468 -0.57009 -0.31685 0.22393 -0.30033 0.88309 1.28610 -0.00665 ... 0 0 0 2 3 0 3 5 2 0

8 rows × 31 columns

There are 8 people that lied about taking Semeron, which is a fake drug, so we should remove them.

dataset = dataset[dataset['Fictitious drug Semeron consumption'] == 0]
# Removing it from drug lists
drugs_columns.remove('Fictitious drug Semeron consumption')
drugs_illegal.remove('Fictitious drug Semeron consumption')
#Dropping the column from the dataset
dataset.drop(columns='Fictitious drug Semeron consumption')
dataset.shape
(1877, 31)

We have 1877 respondents left.

Tip: If you ever acquire data from individuals, try a similar approach to get rid of untrustworthy observations.

Exploratory Data analysis

dataset.describe()
Age Gender Education Country Ethnicity Neuroticism Extraversion Openness to experience Agreeableness Conscientiousness ... Ecstasy consumption Heroin consumption Ketamine consumption Legal highs consumption Lysergic acid diethylamide consumption Methadone consumption Magic mushrooms consumption Nicotine consumption Fictitious drug Semeron consumption Volatile substance abuse consumption
count 1877.000000 1877.000000 1877.000000 1877.000000 1877.000000 1877.000000 1877.000000 1877.000000 1877.000000 1877.000000 ... 1877.000000 1877.000000 1877.000000 1877.000000 1877.000000 1877.000000 1877.000000 1877.000000 1877.0 1877.000000
mean 0.037577 -0.000771 -0.000984 0.358984 -0.309728 -0.000551 -0.001951 -0.003224 -0.000657 -0.000394 ... 1.309536 0.372403 0.564198 1.353223 1.053277 0.826319 1.177944 3.194992 0.0 0.429409
std 0.878387 0.482588 0.949831 0.699707 0.166220 0.998442 0.997418 0.995691 0.996689 0.997657 ... 1.647373 1.034761 1.216341 1.790881 1.484582 1.648379 1.459212 2.415814 0.0 0.959160
min -0.951970 -0.482460 -2.435910 -0.570090 -1.107020 -3.464360 -3.273930 -3.273930 -3.464360 -3.464360 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.000000
25% -0.951970 -0.482460 -0.611130 -0.570090 -0.316850 -0.678250 -0.695090 -0.717270 -0.606330 -0.652530 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.0 0.000000
50% -0.078540 -0.482460 -0.059210 0.960820 -0.316850 0.042570 0.003320 -0.019280 -0.017290 -0.006650 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 3.000000 0.0 0.000000
75% 0.497880 0.482460 0.454680 0.960820 -0.316850 0.629670 0.637790 0.723300 0.760960 0.584890 ... 3.000000 0.000000 0.000000 3.000000 2.000000 0.000000 2.000000 6.000000 0.0 0.000000
max 2.591710 0.482460 1.984370 0.960820 1.907250 3.273930 3.273930 2.901610 3.464360 3.464360 ... 6.000000 6.000000 6.000000 6.000000 6.000000 6.000000 6.000000 6.000000 0.0 6.000000

8 rows × 31 columns

The representation above is quite archaic. Let's create a plot to make it easier to read.

fig, ax = plt.subplots(figsize=(11,15))
plt.ylabel("Features")
plt.title("Box plot of Pre-Processed Data Set")
ax = sns.boxplot(data = dataset[feature_columns], orient="h", palette="Set2")
sns.reset_orig()

Seems like the dataset creators centered the features near 0 and scaled them with a small std (around 1), so we can probably keep the current encoding to build a model.

Demographic data analysis

Rearranging data

For data analysis, we will transform the quantified categorical data back to a clearer, nominal, form.

Note: Our mapping is based on the dataset description
demo_data = dataset.copy()
age = ['18-24' if a <= -0.9 else 
       '25-34' if a >= -0.5 and a < 0 else 
       '35-44' if a > 0 and a < 1 else 
       '45-54' if a > 1 and a < 1.5 else 
       '55-64' if a > 1.5 and a < 2 else 
       '65+' 
       for a in demo_data['Age']]

gender = ['Female' if g > 0 else "Male" for g in demo_data['Gender']]

education = ['Left school before 16 years' if e <-2 else 
             'Left school at 16 years' if e > -2 and e < -1.5 else 
             'Left school at 17 years' if e > -1.5 and e < -1.4 else 
             'Left school at 18 years' if e > -1.4 and e < -1 else 
             'Some college or university, no certificate or degree' if e > -1 and e < -0.5 else 
             'Professional certificate/ diploma' if e > -0.5 and e < 0 else 
             'University degree' if e > 0 and e < 0.5 else 
             'Masters degree' if e > 0.5 and e < 1.5 else 
             'Doctorate degree' 
             for e in demo_data['Education']]

country = ['USA' if c < -0.5 else 
           'New Zealand' if c > -0.5 and c < -0.4 else 
           'Other' if c > -0.4 and c < -0.2 else 
           'Australia' if c > -0.2 and c < 0 else 
           'Ireland' if c > 0 and c < 0.23 else 
           'Canada' if c > 0.23 and c < 0.9 else 
           'UK' 
           for c in demo_data['Country']]

ethnicity = ['Black' if e < -1 else 
             'Asian' if e > -1 and e < -0.4 else 
             'White' if e > -0.4 and e < -0.25 else 
             'Mixed-White/Black' if e >= -0.25 and e < 0.11 else 
             'Mixed-White/Asian' if e > 0.12 and e < 1 else 
             'Mixed-Black/Asian' if e > 1.9 else 
             'Other' 
             for e in demo_data['Ethnicity']]


demo_data['Age'] = age
demo_data['Gender'] = gender
demo_data['Education'] = education
demo_data['Country'] = country
demo_data['Ethnicity'] = ethnicity
demo_data[demographic_columns].head()
Age Gender Education Country Ethnicity
ID
1 35-44 Female Professional certificate/ diploma UK Mixed-White/Asian
2 25-34 Male Doctorate degree UK White
3 35-44 Male Professional certificate/ diploma UK White
4 18-24 Female Masters degree UK White
5 35-44 Female Doctorate degree UK White

This looks better :grin:

Proportion analysis

Let's start of by taking a look at the demographic data balance.

But first, we will define a few useful functions that will be reused throughout this analysis.

def plot_density(dataset, col, ax, plot_gaussian = True, color="Blue"):
    '''
    Extension of the seaborn histogram that plots, for a given column, an estimated normal distribution (if requested) on top of the fitted data distribution.
    '''
    ncount = len(dataset)
    
    if plot_gaussian:
        std = dataset[col].std()
        mean = dataset[col].mean()
    
    #plot histogram using seaborn
    ax = sns.histplot(dataset[col], ax=ax, color=color, kde=True, stat="probability", kde_kws={"bw_adjust":3})
    
    if plot_gaussian:
        # Limiting our gaussian to the limits from the above plot
        xmin, xmax = ax.get_xlim()
        xnorm = np.arange(xmin, xmax, 0.001)
        ynorm = norm.pdf(xnorm, mean, std)
        ax.plot(xnorm, ynorm, 'r', lw=2)
        ax.legend(["data distribution", "estimated normal distribution"], loc="upper center", bbox_to_anchor=(0.5, -0.05), fancybox=True, shadow=True, ncol=2)
    
    ax.set_title(col)
    ax.set_xlabel("")
def plot_pie(dataset, col, ax):  
    '''
    Pandas' pie plot wrapper
    '''
    ax = dataset[col].value_counts().plot(kind='pie', ax=ax)
    ax.set_title(col)
    ax.set_ylabel("")
def plot_count(dataset, col, ax, order = None, show_percent=True, rotate_label = True, add_args={"palette": "Set2"}):
    '''
    Extending the seaorn countplot to get frequencies and counts in a pretty way. 
    '''
    
    ncount = len(dataset)
    
    if order is None:
        order = np.sort(dataset[col].unique())
    
    # plot seaborn countplot
    ax = sns.countplot(data=dataset, x=col, ax=ax, order=order, **add_args)
    
    # Get y limit (number of elements)
    _ ,max_nb = ax.get_ylim()
    # Transform this limit into a frequency in [0, 100]
    freq_lim = (max_nb * 100/ ncount)
    
    # Duplicate the ax
    ax2 = ax.twinx()
    
    #Move duplicate y axis ticks to the left
    ax2.yaxis.tick_left()
    
    #Move original y axis ticks to the right
    ax.yaxis.tick_right()
    
    # Swap the label positions to match the ticks
    ax.yaxis.set_label_position('right')
    ax2.yaxis.set_label_position('left')
    ax2.set_ylabel('Frequency [%]')

    # We want to write the frequency on top of each bar
    if show_percent:
        # for every bar
        for p in ax.patches:
            x=p.get_bbox().get_points()[:,0]
            y=p.get_bbox().get_points()[1,1]
            if not math.isnan(x.mean()) and not math.isnan(y):
                # Write frequency at an x and y coordinate
                ax.text(x.mean(), y, '{:.1f}%'.format(100.*y/ncount), 
                    ha='center', va='bottom')

    # Set y axis limit for counts and frequencies
    ax2.set_ylim(0,freq_lim)
    ax.set_ylim(0,max_nb)

    # set ticks for count
    ax.yaxis.set_major_locator(ticker.LinearLocator(11))
    ax.yaxis.set_tick_params(which="major")

    # set ticks for frequencies
    ax2.yaxis.set_major_locator(ticker.MultipleLocator(freq_lim/10))
    ax2.yaxis.set_tick_params(which="major")

    # remove grid for ax 2 (keep only ax)
    ax2.grid(False)
    ax.grid(False)
    ax.set_xlabel("")
    if rotate_label:
        # rotate tick labels on the x axis / / /
        _ = ax.set_xticklabels(ax.get_xticklabels(), rotation=40, ha="right")
    ax.set_title(col)
def plot(kind, dataset, columns=None, fig_title="Count/Frequency plots", fontsizes = 20, **kwargs):
    '''
    Wrapper function that takes care of plot wise sizes and calling the wanted procedure
    '''
    
    # plot choices
    kind_dict = {
        'pie': plot_pie,
        'count': plot_count,
        'density': plot_density}
    
    if kind not in kind_dict:
        raise ValueError(f"{kind} is not a valid kind, has to be one of {kind_dict.keys()}")
    
    if columns is None:
        # us all dataset columns
        columns = list(dataset.columns)
    
    fig = None
    
    # Setting font sizes
    plt.rc('xtick', labelsize=fontsizes*1.5)
    plt.rc('ytick', labelsize=fontsizes*1.5)
    plt.rc('axes', labelsize=fontsizes*2)
    plt.rc('legend', fontsize=fontsizes*1.5, title_fontsize=0)
    plt.rc('axes', titlesize=2*fontsizes)
    plt.rc('font', size=1.7*fontsizes)
    
    # Scale of the figure in ax (to be used later)
    figsize_scale = fontsizes
    
    if not isinstance(columns, list):
        # columns has to be a list
        if isinstance(columns, str):
            columns = [columns]
        else:
            columns = list(columns)
    
    if len(columns) == 1: # Only variable to plot
        ncols, nrows = 1, 1
        figsize_scale *= 2 # double figsize
    else:
        ncols, nrows = 2, math.ceil(len(columns) / 2)
     
    fig, axes = plt.subplots(figsize=(figsize_scale*ncols, figsize_scale*nrows), nrows=nrows, ncols=ncols)
    
    if ncols == 1 and nrows == 1:
        # We need a list of axes
        axes = np.array([axes])
    
    # Plot
    do_plots(dataset, columns, axes, kind_dict[kind], **kwargs)
    
    fig.suptitle(fig_title + "\n\n", fontsize=fontsizes*2.5)
    plt.tight_layout()
    #Reset plot setting to original
    sns.reset_orig()
    
def do_plots(dataset, columns, axes, plot_func, **kwargs):
    '''
    Calls the plotting function on every axis and removes unused axes.
    '''
    axes = axes.flat
    
    #plot every variable
    for index, col in enumerate(columns):
        plot_func(dataset, col, axes[index], **kwargs)

    # remove empty axes
    for empty in range(len(columns), len(axes)):
        axes[empty].axis("off")

Important: If you are new to using python and the various libraries for data analysis, try to read through the above functions and understand what they do.
Let's create pie plots and count plots.
plot("pie", demo_data, demographic_columns, fig_title="Plot pies of demographic data")
plot("count", demo_data, demographic_columns, fig_title="Count / Frequency plots of demographic data")

We can see some features are pretty balanced but in some cases, classes are underrepresented. We will go over each of them in more detail.

def value_counts_percentage(dataset, column):
    ''' value.counts() method extended by displaying percentage '''
    
    a = dataset[column].value_counts()
    b = dataset[column].value_counts(normalize=True) * 100
    
    return pd.concat([a,b.round(2)], axis=1, keys=['N', '%'])

Note: The above function creates a dataframe with count and frequencies for a given column.

Age

value_counts_percentage(demo_data, 'Age')
N %
18-24 637 33.94
25-34 480 25.57
35-44 355 18.91
45-54 294 15.66
55-64 93 4.95
65+ 18 0.96

The age groups are intervals of around 10 years. As we can see, the older the age group, the less represented they are.

The 18-24 group holds around 1/3 of all respondents !

Unbalanced and not representative.

Gender

value_counts_percentage(demo_data, 'Gender')
N %
Male 940 50.08
Female 937 49.92

Gender is balanced and representative.

Education

value_counts_percentage(demo_data, 'Education')
N %
Some college or university, no certificate or degree 503 26.80
University degree 478 25.47
Masters degree 283 15.08
Professional certificate/ diploma 270 14.38
Left school at 18 years 99 5.27
Left school at 16 years 98 5.22
Doctorate degree 89 4.74
Left school at 17 years 29 1.55
Left school before 16 years 28 1.49

There is a predomination of educated people (around 85% are of college or above level), causing an unbalance.

Not sure these are representative of the real life distribution, but looking at the educational attainment in the United States, it seems the proportion are not so far off for developped countries (and they are very represented in this dataset).

Warning: That doesn’t mean our data is balanced, but by taking people randomnly we would get around the same distribution.

Tip: A representative dataset is useful to build a representative model that describes the data rather than predicting. On the other hand, if your objective is to build a a predictive model, particularly one that performs well by measure of AUC or rank-order and plan to use a basic ML framework, then feeding it a balanced dataset is recommended. You can also use a representative dataset in the latter case and use custom sampling that will take care of balancing the data for training.

Country

value_counts_percentage(demo_data, 'Country')
N %
UK 1044 55.62
USA 551 29.36
Other 118 6.29
Canada 87 4.64
Australia 52 2.77
Ireland 20 1.07
New Zealand 5 0.27

All the known countries (90+%) are English speaking with US and UK totalling to over 80% of the data.

This is a clear unbalance and isn't representative.

Ethnicity

value_counts_percentage(demo_data, 'Ethnicity')
N %
White 1715 91.37
Other 62 3.30
Black 33 1.76
Asian 25 1.33
Mixed-White/Asian 20 1.07
Mixed-White/Black 19 1.01
Mixed-Black/Asian 3 0.16

90+% of respondents are white. Again, a clear unbalance and is not representative.

Age-Gender

plot("count", demo_data, 'Age', fig_title="Age-Gender count/frequency", rotate_label=False, add_args={"hue":"Gender", "palette":'ch:.25'}, fontsizes=4.5);
  • There are clearly a lot more men in the 18-24 age goupe.
  • More women in the 25-34 age group.
  • As the age increases, the gender proportion gradually evens out.

Country-Gender

plot("count", demo_data, 'Country', fig_title="Age-Gender count/frequency", rotate_label=False, add_args={"hue":"Gender", "palette":'ch:.25'}, fontsizes=4.5);
  • More women than men were tested in the USA, New-Zealand (see next table) and "Other".
  • More men than women in the UK.
  • Around the same in the remaining countries.

Note: Very few respondents in some countries meaning we can’t tell if the respondents were taken from the real life distribution or not.

Age – Gender – Country cross table

pd.crosstab(demo_data['Age'], [demo_data['Gender'], demo_data['Country']])
Gender Female Male
Country Australia Canada Ireland New Zealand Other UK USA Australia Canada Ireland New Zealand Other UK USA
Age
18-24 1 7 3 0 12 112 99 20 25 5 2 39 91 221
25-34 10 9 2 1 13 200 58 5 8 3 2 26 80 63
35-44 4 8 3 0 7 150 29 3 7 0 0 10 108 26
45-54 5 7 1 0 4 127 14 3 4 3 0 5 99 22
55-64 0 7 0 0 0 29 7 1 1 0 0 2 36 10
65+ 0 3 0 0 0 5 0 0 1 0 0 0 7 2
  • New-Zealand doesn't have anyone over 34 years old and tested only one woman.

  • Ireland doesn't have anyone over 54 and has no man in the 35-44 age group.

  • Australia doesn't have anyone over 65 (and only one over 54) and 20 times more males in the 18-24 age group.

  • Canada has 3 times more men in the 18-24 group and has only 2 males above 54.

  • UK has over twice more women than men in the 25-34 group.

  • USA has over twice more men in the 18-24 group and no woman over 64.

  • Other countries have 3 times more men for the 18-24 age group, twice for the 25-34 group, doesn't have any woman in the 55-64 group and doesn't have anyone over 64.

def catplot(data, x, hue, col, rotate_label=True, palette="ch:.25"):
    plot = sns.catplot(x=x, hue=hue, col=col, palette=palette, data=data.sort_values(by=[x]), kind="count")

    if rotate_label:
        for axes in plot.axes.flat:
            _ = axes.set_xticklabels(axes.get_xticklabels(), rotation=40, ha="right")
catplot(demo_data, 'Country', 'Age', 'Gender')

Using the above plot, some aspects described above pop up quite nicely.

Education – Gender – Ethnicity cross table

demo_data.pivot_table(index='Education', columns=['Gender', 'Ethnicity'], aggfunc='size', fill_value=0)
Gender Female Male
Ethnicity Asian Black Mixed-Black/Asian Mixed-White/Asian Mixed-White/Black Other White Asian Black Mixed-Black/Asian Mixed-White/Asian Mixed-White/Black Other White
Education
Doctorate degree 1 0 0 1 2 1 52 1 0 0 0 0 2 29
Left school at 16 years 0 0 0 0 0 2 40 0 0 0 1 1 0 54
Left school at 17 years 0 0 0 0 1 0 12 0 0 0 0 0 2 14
Left school at 18 years 0 0 0 0 0 2 35 0 1 0 1 2 2 56
Left school before 16 years 0 0 0 0 0 0 12 0 0 0 0 0 0 16
Masters degree 4 6 0 3 5 5 156 5 1 0 0 1 4 93
Professional certificate/ diploma 0 3 0 1 0 2 130 1 5 0 1 1 6 120
Some college or university, no certificate or degree 3 2 1 2 1 8 153 0 4 1 4 3 12 309
University degree 7 6 1 4 2 8 263 3 5 0 2 0 6 171
  • Basically few entries of people leaving school at or before 18.

  • Non-white: no one left school before being 16 and appart for the 'Other' category, there is only one female that didn't get a higher education. No male** expect in 'Other' have a Doctorate degree.

  • Asian: no respondent with lower education and *no male in some college or university (but has some with masters degree, so doesn't mean no one went through university etc.)

  • Black: only one person with lower education (one male left school at 18) and has no female with a professional diploma when it is the field with most respondents on the male side.

  • Mixed-Black/Asian: very few respondents and only people following some college or university, or with an university degree are represented.

  • Mixed-White/Asian: all the female respondents have high education and 2 males left school at 18 or before. Males, unlike females have no master's degree respondent.

  • Mixed-White/Black: all females except 2 have a finished university, master's or doctorate degree, while males have only one master's degree and the other have either unfinished university degrees, a professional diploma or left school at 18 or before.

  • Other: lots of males have professional diplomas, while no female has one. There is a high number of doctorate degrees but also a high number of respondents that left school at 18 or before (both relative to the categories up until now).

  • White: basically all the respondents are white as we saw before. Both genders have respondents in all categories but not in the same proportions. To see things clearer, you can refer to the table below (we will focus on proportion instead of counts). The majority of males are pursuing a college or university degree, while the majority of females have an university degree. Nearly twice as many females have a doctorate degree and more male have left school at 18 or before.

pd.concat([value_counts_percentage(demo_data[(demo_data['Ethnicity'] == 'White') & (demo_data['Gender'] == 'Male')], 'Education'),
           value_counts_percentage(demo_data[(demo_data['Ethnicity'] == 'White') & (demo_data['Gender'] == 'Female')], 'Education')],
           axis = 1, keys=['Male', 'Female'])
Male Female
N % N %
Some college or university, no certificate or degree 309 35.85 153 17.94
University degree 171 19.84 263 30.83
Professional certificate/ diploma 120 13.92 130 15.24
Masters degree 93 10.79 156 18.29
Left school at 18 years 56 6.50 35 4.10
Left school at 16 years 54 6.26 40 4.69
Doctorate degree 29 3.36 52 6.10
Left school before 16 years 16 1.86 12 1.41
Left school at 17 years 14 1.62 12 1.41

Note: Beause of the huge unbalance in ethnicity, plots will be hard to interpret. See for yourself below. Changing the axis and hue won’t solve anything.
catplot(demo_data, 'Ethnicity', 'Education', 'Gender')

Country – Gender – Ethnicity cross table

pd.crosstab(demo_data['Country'], [demo_data['Gender'], demo_data['Ethnicity']])
Gender Female Male
Ethnicity Asian Black Mixed-Black/Asian Mixed-White/Asian Mixed-White/Black Other White Asian Black Mixed-Black/Asian Mixed-White/Asian Mixed-White/Black Other White
Country
Australia 0 0 0 0 0 0 20 0 0 0 0 0 0 32
Canada 0 0 0 2 0 1 38 0 0 0 0 2 2 42
Ireland 0 0 0 0 0 0 9 0 0 0 0 0 0 11
New Zealand 0 0 0 0 0 0 1 0 0 0 0 0 0 4
Other 1 1 0 0 2 2 30 1 0 1 0 1 5 74
UK 11 14 0 6 9 12 571 9 11 0 2 1 4 394
USA 3 2 2 3 0 13 184 0 5 0 7 4 23 305
  • Nearly no entry of Mixed-Black/Asian. (And compared to white, no entry of any other category)

  • USA has no mixed-White/Black female, no asian and mixed-Black/Asian male.

  • All the other countries have no mixed-black asian (except one male in an 'Other' country) (even UK that has such a large number of respondents)

  • UK has 9 times more Mixed-White/Black female than male and 3 times more 'Other' female than male.

Note: Again, plotting will not help because of the unbalance in white respondents.

Correlations (using Cramér's V)

"In statistics, Cramér's V (sometimes referred to as Cramér's phi and denoted as φc) is a measure of association between two nominal variables, giving a value between 0 and +1 (inclusive). It is based on Pearson's chi-squared statistic and was published by Harald Cramér in 1946."(Wikipedia)

Below we define a set of useful functions for working with correlations.

def get_pval_code(pval):
    '''
    Returns a significance code string for a given p-value.
    '''
    code = ''
    if pval < 0.001:
        code = '***'
    elif pval < 0.01:
        code = '**'
    elif pval < 0.05:
        code = '*'
    elif pval < 0.1:
        code = '.'
    return code

def display_pval_codes():
    print("Significance codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1")
def cramers_v(crosstab):
    ''' Returns Cramer's V correlation coefficient (and statistic significance) for data 
        delivered as a crosstable. Used for nominal data.
    '''
    
    chi2 = chi2_contingency(crosstab)[0]
    n = crosstab.sum().sum()
    phi2 = chi2/n
    r,k = crosstab.shape
    phi2corr = max(0, phi2 - ((k-1)*(r-1))/(n-1))    
    rcorr = r - ((r-1)**2)/(n-1)
    kcorr = k - ((k-1)**2)/(n-1)
    
    return round(np.sqrt(phi2corr / min((kcorr-1), (rcorr-1))), 3), chi2_contingency(crosstab)[1]
def nominal_corrs(dataset, col_1_names, col_2_names, pvalues=True):
    ''' Returns Cramer's V coefficients matrix and p-value (obtained with Chi-square test of independence) matrix for the whole dataset.
    
    col_1_names, col_2_names - lists of names of columns to correlate. Function creates crosstables for every 
        columns' combination and returns a matrix with single Cramer's V coefficients of every combination;
    pvalues - if set to False, only correlation values will be returned in DataFrame 
        (without '**' marks for significant observations)
    '''
    
    corr_table = pd.DataFrame()
    pval_table = pd.DataFrame()
    for i in range(len(col_1_names)):
        for j in range(len(col_2_names)):
            crosstab = pd.crosstab(dataset[col_1_names[i]], [dataset[col_2_names[j]]])
            corr, pval = cramers_v(crosstab)
            
            v = ' '.join([str(i) for i in (corr, get_pval_code(pval))]) if pvalues else corr
            
            corr_table.loc[i, j] = v
            pval_table.loc[i,j] = pval
    corr_table.index = col_1_names
    corr_table.columns = col_2_names
    pval_table.index = col_1_names
    pval_table.columns = col_2_names
    
    return corr_table, pval_table
def heatmap_corr(dataset, method='spearman', ready=False, mask=True, nominal=False):
    ''' Extended sns.heatmap() method. 
    
    dataset - can be 'pure' data (without calculated correlations) or a DataFrame with already calcuateg 
        correlations (in that case attribute 'ready' should be set to True);
    method - mainly pearson or spearman; nominal correlations should be calculated externally 
        and be delivered with attribute ready=True; 
    mask - if dataset is NOT a cross-valued DataFrame of one type, mask should be set to False;
    nominal - for nominal data correlations values are in range (0, 1) instead of (-1, -1). 
        nominal=True should be folowed by ready=True 
    '''
    
    if not ready:
        corr = dataset.corr(method=method)
    elif ready:
        corr = dataset
    cmap = sns.diverging_palette(220, 10, as_cmap=True)
    vmax = corr.max().max()
    if nominal:
        center = 0.5
        cmap=None
    elif not nominal:
        center = 0
    if mask:
        mask = np.zeros_like(corr, dtype=np.bool)
        mask[np.triu_indices_from(mask)] = True
        np.fill_diagonal(corr.values, -2)
        vmax = corr.max().max()
    elif not mask:
        mask=None
    f, ax = plt.subplots(figsize=(20,9))
    return sns.heatmap(corr, cmap=cmap, mask=mask, vmax=vmax, center=center, annot=True, square=True, 
                       linewidths=0.5, cbar_kws={'shrink': 0.5})

Let's display the correlations among the demographic data.

corr_table_demo, pval_table_demo = nominal_corrs(demo_data, demographic_columns, demographic_columns)
display_pval_codes()
corr_table_demo
Significance codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Age Gender Education Country Ethnicity
Age 1.0 *** 0.196 *** 0.229 *** 0.173 *** 0.014
Gender 0.196 *** 0.999 *** 0.235 *** 0.219 *** 0.0
Education 0.229 *** 0.235 *** 1.0 *** 0.2 *** 0.0
Country 0.173 *** 0.219 *** 0.2 *** 1.0 *** 0.053 **
Ethnicity 0.014 0.0 0.0 0.053 ** 1.0 ***

Values with at least one * next to them are statistically significant (p-value < 0.05).

heatmap_corr(nominal_corrs(demo_data, demographic_columns, demographic_columns, pvalues=False)[0],
             nominal=True, ready=True);

Only very weak statistically significant correlations at most.

Important: A statistically significant correlation does not necessarily mean that the strength of the correlation is strong. A low p-value reassures us that the correlation is this way nearly all of the time. (It is the probability that we got it by chance)

This section tries to interpret these very weak relationships between variables

Ethnicity's lack of statistical significant correlations can be explained by the great imbalance in the various possible classes (all whites basically). Meaning the lack of correlation could be caused by chance. The abscence (high significance) of correlation with country may sound surprising but since the only countries considered are the "same", as in developed and white prevalence in respondent, it makes sense. It's safe to assume the lack of correlation with age and gender are correct (even if unsignificant) for obvious reasons. I can't be so sure about education (as you can see below, there is around 50% chances the correlation was obtained by chance).

pval_table_demo.loc['Education', 'Ethnicity']
0.5424195567242813

There is a statistically significant very weak correlation between country and gender. We know, at least for the countries represented here (developped countries were population is basically 50/50), that there isn't a real life correlation between the country and gender. The very weak correlation may be due to some countries having more female repondents and some having more males (as we saw previously).

The weak correlation between age and gender can be explained, as we saw previously, that gender wasn't even for some young age groups but as we know this correlation doesn't generalize to real life.

A correlation between age and country would be relevant if some countries were more developped than the others but here we have only developped country. This idea would make more sense if we had more categories for elder people, meaning a country's life expectency could be taken into account, but here a single group for 65+ year olds isn't enough for developed countries. The very weak statistically significant relationship may be caused by the 'other' countries.

The very weak correlations we can keep in mind are the following:

  • Age / Education (makes sense since basically no 18-24 would have a doctorate degree for example)
  • Gender / Education (males and females may have slightly different aims and standards regarding education)
  • Education / Country (for social tendencies, country development and other reasons, the education and country may be correlated but since the country here are very similar, the relationship is very weak)

Personality features analysis

All the personnality traits have a score that is represented by a real value in the dataset.

Some have their original possible scores in ranges of integer types (with a step of 1, so 4-7 means 4, 5, 6 and 7):

  • Neuroticism: 12-60
  • Extraversion: 16-59
  • Openness to experience: 24-60
  • Agreeableness: 12-60
  • Conscientiousness: 17-59
pers_data = dataset[personality_columns]
pers_data.describe()
Neuroticism Extraversion Openness to experience Agreeableness Conscientiousness Impulsiveness Sensation seeking
count 1877.000000 1877.000000 1877.000000 1877.000000 1877.000000 1877.000000 1877.000000
mean -0.000551 -0.001951 -0.003224 -0.000657 -0.000394 0.005293 -0.007408
std 0.998442 0.997418 0.995691 0.996689 0.997657 0.954148 0.962074
min -3.464360 -3.273930 -3.273930 -3.464360 -3.464360 -2.555240 -2.078480
25% -0.678250 -0.695090 -0.717270 -0.606330 -0.652530 -0.711260 -0.525930
50% 0.042570 0.003320 -0.019280 -0.017290 -0.006650 -0.217120 0.079870
75% 0.629670 0.637790 0.723300 0.760960 0.584890 0.529750 0.765400
max 3.273930 3.273930 2.901610 3.464360 3.464360 2.901610 1.921730

Checking the balance / proportions

plot("pie", pers_data, personality_columns, fig_title="Pie plots personality traits")

The values overlap but we don't really care about them (I left them just in case you need them), we just want to see if some slices are bigger than the others.

plot("count", pers_data, personality_columns, show_percent=False, fig_title="Count / Frequency of personality data")

Hard to interpret since there are many levels for some personality traits and they are encoded as hardly interpretable real values, even though they are sorted in the same order as the original values, which aren't really interpretable either.

According to the pie plot, there doesn't seem to be any huge unbalance.

The count / frequency plot is more interesting. The distributions of all the personality traits may be gaussian distribution, with some skewness involved sometimes. Let's plot those densities a bit more nicely against actual estimated gaussian distributions to see if our intuition was right!

plot("density", pers_data, personality_columns, fig_title="Personality densities")

They are indeed belshaped but far from the shape of a fitted gaussian distribution.

Let's try out a statistical test.

def normal_test(dataset, columns = None, alpha = 0.05):
    if columns is None:
        columns = dataset.columns
    res = pd.DataFrame()
    data = dataset[columns]
    stats, pvals = data.apply(shapiro).values
    return pd.DataFrame(data=[("Not rejected" if pval > alpha else "Rejected") + f" with alpha = {alpha}"  for pval in pvals], columns=["H0: is from normal distribution"], index=columns)
normal_test(pers_data)
H0: is from normal distribution
Neuroticism Not rejected with alpha = 0.05
Extraversion Rejected with alpha = 0.05
Openness to experience Rejected with alpha = 0.05
Agreeableness Rejected with alpha = 0.05
Conscientiousness Rejected with alpha = 0.05
Impulsiveness Rejected with alpha = 0.05
Sensation seeking Rejected with alpha = 0.05

After running the Shapiro-Wilk test on each personality trait, the only hypothesis that couldn't be rejected was "Neuroticism is from a normal distribution" with an alpha = 0.05.

Warning: That doesn’t mean it is gaussian. Actually the probability was something around 10%, but we can’t reject this hypothesis.

Correlations among personality traits (using Spearman)

"In statistics, Spearman's rank correlation coefficient or Spearman's ρ, named after Charles Spearman, is a nonparametric measure of rank correlation (statistical dependence between the rankings of two variables). ... Spearman's correlation assesses monotonic relationships (whether linear or not). If there are no repeated data values, a perfect Spearman correlation of +1 or −1 occurs when each of the variables is a perfect monotone function of the other."(Wikipedia)

Again, we have to define a few useful functions.

def calculate_r(df1, df2, method='spearman', pvalues=True):
    ''' Returns correlation matrix extended by statistical significance index. Used for non-nominal data.
    
    df1, df2 - DataFrames of data to correlate;
    method - mainly pearson and spearman;
    pvalues - if set to False, only correlation values will be returned in DataFrame 
        (without '**' marks for significant observations)
    '''
    float_precision = 3
    
    data_corr_table = pd.DataFrame()
    data_pvalues = pd.DataFrame()
    for x in df1.columns:
        for y in df2.columns:
            if method == 'pearson':
                corr = pearsonr(df1[x], df2[y])
            elif method == 'spearman':
                corr = spearmanr(df1[x], df2[y])
            else:
                raise ValueError('Unknown method')
            if pvalues:
                data_corr_table.loc[x,y] = '{} {}'.format(round(corr[0], float_precision), get_pval_code(corr[1]))
            elif not pvalues:
                data_corr_table.loc[x,y] = round(corr[0], float_precision)
            data_pvalues.loc[x,y] = corr[1]
    
    return data_corr_table, data_pvalues
def get_ranked_corr(correlations, pvalues, pval_threshold=0.05, corr_threshold=0.5, name_transform=lambda x: x, against_itself=True, comp_corr = operator.ge, comp_pval = operator.le):
    ''' Returns the sorted top (wanted) correlations as a Dataframe
    
    correlations - the correlations dataframe.
    pvalues - The pvalues of H0: the correlation is unsignificant.
    pval_threshold - alpha to compare to pvalue
    corr_threshold - Threadshold of correlations to consider (upper or lower depending on comp_corr)
    name_transform - transform the column names before writing index.
    against_itself - True if columns and rows are the same data (if symmetric matrix)
    comp_corr - comparator to compare to the corr_threshold (default operator.ge means we want correlations >=)
    comp_pval - comparator to compare the pvalue (default operator.le meaning we want pvalues <=)
    '''
    
    columns = correlations.columns
    rows = correlations.index
    pairs=[]
    for i, row in enumerate(rows):
        for j, col in enumerate(columns[i+1:], start=i+1) if against_itself else enumerate(columns):
            corr = correlations.iloc[i, j]
            if comp_pval(pvalues.iloc[i,j], pval_threshold) and comp_corr(abs(corr), corr_threshold):
                pairs.append((name_transform(row), name_transform(col), corr))
    list.sort(pairs, key=lambda x: abs(x[2]), reverse=True)
    return pd.Series(data=[x[2] for x in pairs], index=[' / '.join((x[0], x[1])) for x in pairs], name="Correlation", dtype=np.float64).to_frame()

Now let's use them.

pers_data_corr_table, pers_data_pvalues = calculate_r(pers_data, pers_data, method='spearman')
pers_data_corr_table
Neuroticism Extraversion Openness to experience Agreeableness Conscientiousness Impulsiveness Sensation seeking
Neuroticism 1.0 *** -0.417 *** 0.015 -0.206 *** -0.378 *** 0.168 *** 0.09 ***
Extraversion -0.417 *** 1.0 *** 0.226 *** 0.162 *** 0.29 *** 0.119 *** 0.19 ***
Openness to experience 0.015 0.226 *** 1.0 *** 0.037 -0.073 ** 0.27 *** 0.405 ***
Agreeableness -0.206 *** 0.162 *** 0.037 1.0 *** 0.237 *** -0.226 *** -0.208 ***
Conscientiousness -0.378 *** 0.29 *** -0.073 ** 0.237 *** 1.0 *** -0.344 *** -0.254 ***
Impulsiveness 0.168 *** 0.119 *** 0.27 *** -0.226 *** -0.344 *** 1.0 *** 0.628 ***
Sensation seeking 0.09 *** 0.19 *** 0.405 *** -0.208 *** -0.254 *** 0.628 *** 1.0 ***
heatmap_corr(pers_data, method='spearman');

Again, no strong correlations but some moderate and weak-moderate ones. The sign of the correlation is interesting here.

Neuroticism for example has no positive correlation with any other personality trait. It has a weak-moderate downhill correlation with extraversion and conscientiousness.

Tip: If you ever feel moody, anxious, frustrated, lonely, depressed or other such things, try to be more outgoing, talkative, energetic, careful and diligent and see how it goes! :wink:

I also found it interesting that Extraversion has no moderate correlation with any other personality trait (and that its strongest correlation, which is weak, is with conscentiousness).

Below is a list of the very weak and above statistically significant correlations. (|r| >= 0.15)

pers_data_corr_table_nopval, _ = calculate_r(pers_data, pers_data, method='spearman', pvalues=False)
get_ranked_corr(pers_data_corr_table_nopval, pers_data_pvalues, corr_threshold=0.15)
Correlation
Impulsiveness / Sensation seeking 0.628
Neuroticism / Extraversion -0.417
Openness to experience / Sensation seeking 0.405
Neuroticism / Conscientiousness -0.378
Conscientiousness / Impulsiveness -0.344
Extraversion / Conscientiousness 0.290
Openness to experience / Impulsiveness 0.270
Conscientiousness / Sensation seeking -0.254
Agreeableness / Conscientiousness 0.237
Extraversion / Openness to experience 0.226
Agreeableness / Impulsiveness -0.226
Agreeableness / Sensation seeking -0.208
Neuroticism / Agreeableness -0.206
Extraversion / Sensation seeking 0.190
Neuroticism / Impulsiveness 0.168
Extraversion / Agreeableness 0.162

Drug consumption analysis

As a reminder, here are the different levels for each drug:

0 - Never Used

1 - Used over a Decade Ago

2 - Used in the Last Decade

3 - Used in the Last Year

4 - Used in the Last Month

5 - Used in the Last Week

6 - Used in the Last Day

drug_data = dataset[drugs_columns]
drug_data.describe()
Alcohol consumption Amphetamines consumption Amyl nitrite consumption Benzodiazepine consumption Caffeine consumption Cannabis consumption Chocolate consumption Cocaine consumption Crack consumption Ecstasy consumption Heroin consumption Ketamine consumption Legal highs consumption Lysergic acid diethylamide consumption Methadone consumption Magic mushrooms consumption Nicotine consumption Volatile substance abuse consumption
count 1877.000000 1877.000000 1877.000000 1877.000000 1877.000000 1877.000000 1877.000000 1877.000000 1877.000000 1877.000000 1877.000000 1877.000000 1877.000000 1877.000000 1877.000000 1877.000000 1877.000000 1877.000000
mean 4.637720 1.337773 0.606819 1.461907 5.484283 2.980288 5.109750 1.156633 0.295685 1.309536 0.372403 0.564198 1.353223 1.053277 0.826319 1.177944 3.194992 0.429409
std 1.328234 1.782384 1.064005 1.869193 1.115119 2.286778 1.085716 1.510791 0.835925 1.647373 1.034761 1.216341 1.790881 1.484582 1.648379 1.459212 2.415814 0.959160
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 4.000000 0.000000 0.000000 0.000000 5.000000 1.000000 5.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000
50% 5.000000 0.000000 0.000000 0.000000 6.000000 3.000000 5.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 3.000000 0.000000
75% 6.000000 2.000000 1.000000 3.000000 6.000000 5.000000 6.000000 2.000000 0.000000 3.000000 0.000000 0.000000 3.000000 2.000000 0.000000 2.000000 6.000000 0.000000
max 6.000000 6.000000 6.000000 6.000000 6.000000 6.000000 6.000000 6.000000 6.000000 6.000000 6.000000 6.000000 6.000000 6.000000 6.000000 6.000000 6.000000 6.000000

Let's make a table with the number of observation and frequency for every consumption level of every drug.

counts = []
proportions = []
for drug in drugs_columns:
    counts.append(drug_data[drug].value_counts())
    proportions.append(drug_data[drug].value_counts(normalize=True))

drug_table_count = pd.concat(counts, axis=1, 
          keys=drugs_columns, sort=True)
drug_table_prop = (pd.concat(proportions, axis=1, 
          keys=drugs_columns, sort=True) * 100).round(2)
drug_table = pd.concat([drug_table_count.T, drug_table_prop.T], keys=['N', '%'], axis=1)

drug_table
N %
0 1 2 3 4 5 6 0 1 2 3 4 5 6
Alcohol consumption 33 34 68 197 284 758 503 1.76 1.81 3.62 10.50 15.13 40.38 26.80
Amphetamines consumption 973 230 241 196 75 61 101 51.84 12.25 12.84 10.44 4.00 3.25 5.38
Amyl nitrite consumption 1299 210 236 91 24 14 3 69.21 11.19 12.57 4.85 1.28 0.75 0.16
Benzodiazepine consumption 999 116 230 234 119 84 95 53.22 6.18 12.25 12.47 6.34 4.48 5.06
Caffeine consumption 27 10 24 59 106 271 1380 1.44 0.53 1.28 3.14 5.65 14.44 73.52
Cannabis consumption 413 207 266 210 138 185 458 22.00 11.03 14.17 11.19 7.35 9.86 24.40
Chocolate consumption 32 2 10 53 295 680 805 1.70 0.11 0.53 2.82 15.72 36.23 42.89
Cocaine consumption 1036 160 267 257 98 40 19 55.19 8.52 14.22 13.69 5.22 2.13 1.01
Crack consumption 1622 67 109 59 9 9 2 86.41 3.57 5.81 3.14 0.48 0.48 0.11
Ecstasy consumption 1020 112 232 275 154 63 21 54.34 5.97 12.36 14.65 8.20 3.36 1.12
Heroin consumption 1600 68 91 65 24 16 13 85.24 3.62 4.85 3.46 1.28 0.85 0.69
Ketamine consumption 1488 43 140 129 40 33 4 79.28 2.29 7.46 6.87 2.13 1.76 0.21
Legal highs consumption 1092 29 195 321 109 64 67 58.18 1.55 10.39 17.10 5.81 3.41 3.57
Lysergic acid diethylamide consumption 1069 257 175 213 96 55 12 56.95 13.69 9.32 11.35 5.11 2.93 0.64
Methadone consumption 1424 39 95 148 50 48 73 75.87 2.08 5.06 7.88 2.66 2.56 3.89
Magic mushrooms consumption 982 208 259 272 114 39 3 52.32 11.08 13.80 14.49 6.07 2.08 0.16
Nicotine consumption 428 193 203 184 106 156 607 22.80 10.28 10.82 9.80 5.65 8.31 32.34
Volatile substance abuse consumption 1452 199 133 59 13 14 7 77.36 10.60 7.09 3.14 0.69 0.75 0.37

We will also plot them for each drug.

plot("count", drug_data, fig_title="Count / Frequency for drug consumption\n\n", rotate_label=False)