Drug consumption dataset data analysis
Multilabel data analysis using python and associated libraries.
- Setting up workspace
- Dataset preparation
-
Exploratory Data analysis
- Demographic data analysis
- Personality features analysis
- Drug consumption analysis
- Correlations between personality traits and drugs (using Spearman)
- Nominal correlations between demographic and drug consumption (Cramér's V)
- Nominal correlations between demographic info and personality traits (using Cramér's V)
- Difference significance tests
- Consumer vs Non-consumer
- Conclusion
This post is a great way to dive in data analysis and get familiar with python tools such as pandas, seaborn, matplotlib, numpy.
This is also an opportunity to work on multilabel data (several target variables that are not mutually exclusive).
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import seaborn as sns
import matplotlib as mpl
import math
from scipy.stats import pearsonr, spearmanr, chi2_contingency, ttest_ind, mannwhitneyu, norm, normaltest, shapiro, anderson
import operator
from IPython.display import HTML, display
import seaborn as sns
sns.set(rc={'figure.figsize':(10,5)})
sns.set_style('whitegrid')
%matplotlib inline
Drug consumption (quantified) Data Set is a multilabel dataset from the UCI machine learning repository.
The dataset contains records for 1885 respondents.
To summarize, it contains:
- an ID column
- 5 demographic columns (features)
- 7 personality traits (features)
- 18 drugs with their usage frequency (target)
- a fake drug called Semeron to verify reliability of answers
Each drug variable can take 6 different values:
- CL0 Never Used
- CL1 Used over a Decade
- CL2 Used in the Last Decade
- CL3 Used in the Last Year
- CL4 Used in the Last Month
- CL5 Used in the Last Week
- CL6 Used in the Last Day
All qualitative variables where transformed to real values (except for the target variables).
The following lists will be useful for loading and indexing the data.
demographic_columns = [
'Age',
'Gender',
'Education',
'Country',
'Ethnicity',
]
personality_columns = [
'Neuroticism',
'Extraversion',
'Openness to experience',
'Agreeableness',
'Conscientiousness',
'Impulsiveness',
'Sensation seeking'
]
feature_columns = demographic_columns + personality_columns
drugs_columns = [
'Alcohol consumption',
'Amphetamines consumption',
'Amyl nitrite consumption',
'Benzodiazepine consumption',
'Caffeine consumption',
'Cannabis consumption',
'Chocolate consumption',
'Cocaine consumption',
'Crack consumption',
'Ecstasy consumption',
'Heroin consumption',
'Ketamine consumption',
'Legal highs consumption',
'Lysergic acid diethylamide consumption',
'Methadone consumption',
'Magic mushrooms consumption',
'Nicotine consumption',
'Fictitious drug Semeron consumption',
'Volatile substance abuse consumption'
]
drugs_legal = ['Alcohol consumption', 'Caffeine consumption', 'Chocolate consumption', 'Nicotine consumption']
drugs_illegal = [drug for drug in drugs_columns if drug not in drugs_legal]
all_columns = feature_columns + drugs_columns
dataset = pd.read_csv("drug_consumption.data", names=["ID"] + all_columns)
dataset = dataset.set_index("ID")
dataset.head()
We successfully loaded the dataset and set the ID as row index.
dataset.dtypes
Features are all floats and targets are objects.
dataset.shape
1885 observations and 31 variables.
dataset.isnull().values.any()
There are no missing values.
for i in drugs_columns:
dataset[i] = dataset[i].map({'CL0': 0, 'CL1': 1, 'CL2': 2, 'CL3': 3, 'CL4': 4, 'CL5': 5, 'CL6': 6})
Mapping CLX to X. Now all variables are numeric.
Let's see who claimed to took Semeron.
semerons = dataset[dataset['Fictitious drug Semeron consumption'] != 0]
semerons
There are 8 people that lied about taking Semeron, which is a fake drug, so we should remove them.
dataset = dataset[dataset['Fictitious drug Semeron consumption'] == 0]
# Removing it from drug lists
drugs_columns.remove('Fictitious drug Semeron consumption')
drugs_illegal.remove('Fictitious drug Semeron consumption')
#Dropping the column from the dataset
dataset.drop(columns='Fictitious drug Semeron consumption')
dataset.shape
We have 1877 respondents left.
dataset.describe()
The representation above is quite archaic. Let's create a plot to make it easier to read.
fig, ax = plt.subplots(figsize=(11,15))
plt.ylabel("Features")
plt.title("Box plot of Pre-Processed Data Set")
ax = sns.boxplot(data = dataset[feature_columns], orient="h", palette="Set2")
sns.reset_orig()
Seems like the dataset creators centered the features near 0 and scaled them with a small std (around 1), so we can probably keep the current encoding to build a model.
For data analysis, we will transform the quantified categorical data back to a clearer, nominal, form.
demo_data = dataset.copy()
age = ['18-24' if a <= -0.9 else
'25-34' if a >= -0.5 and a < 0 else
'35-44' if a > 0 and a < 1 else
'45-54' if a > 1 and a < 1.5 else
'55-64' if a > 1.5 and a < 2 else
'65+'
for a in demo_data['Age']]
gender = ['Female' if g > 0 else "Male" for g in demo_data['Gender']]
education = ['Left school before 16 years' if e <-2 else
'Left school at 16 years' if e > -2 and e < -1.5 else
'Left school at 17 years' if e > -1.5 and e < -1.4 else
'Left school at 18 years' if e > -1.4 and e < -1 else
'Some college or university, no certificate or degree' if e > -1 and e < -0.5 else
'Professional certificate/ diploma' if e > -0.5 and e < 0 else
'University degree' if e > 0 and e < 0.5 else
'Masters degree' if e > 0.5 and e < 1.5 else
'Doctorate degree'
for e in demo_data['Education']]
country = ['USA' if c < -0.5 else
'New Zealand' if c > -0.5 and c < -0.4 else
'Other' if c > -0.4 and c < -0.2 else
'Australia' if c > -0.2 and c < 0 else
'Ireland' if c > 0 and c < 0.23 else
'Canada' if c > 0.23 and c < 0.9 else
'UK'
for c in demo_data['Country']]
ethnicity = ['Black' if e < -1 else
'Asian' if e > -1 and e < -0.4 else
'White' if e > -0.4 and e < -0.25 else
'Mixed-White/Black' if e >= -0.25 and e < 0.11 else
'Mixed-White/Asian' if e > 0.12 and e < 1 else
'Mixed-Black/Asian' if e > 1.9 else
'Other'
for e in demo_data['Ethnicity']]
demo_data['Age'] = age
demo_data['Gender'] = gender
demo_data['Education'] = education
demo_data['Country'] = country
demo_data['Ethnicity'] = ethnicity
demo_data[demographic_columns].head()
This looks better
Let's start of by taking a look at the demographic data balance.
But first, we will define a few useful functions that will be reused throughout this analysis.
def plot_density(dataset, col, ax, plot_gaussian = True, color="Blue"):
'''
Extension of the seaborn histogram that plots, for a given column, an estimated normal distribution (if requested) on top of the fitted data distribution.
'''
ncount = len(dataset)
if plot_gaussian:
std = dataset[col].std()
mean = dataset[col].mean()
#plot histogram using seaborn
ax = sns.histplot(dataset[col], ax=ax, color=color, kde=True, stat="probability", kde_kws={"bw_adjust":3})
if plot_gaussian:
# Limiting our gaussian to the limits from the above plot
xmin, xmax = ax.get_xlim()
xnorm = np.arange(xmin, xmax, 0.001)
ynorm = norm.pdf(xnorm, mean, std)
ax.plot(xnorm, ynorm, 'r', lw=2)
ax.legend(["data distribution", "estimated normal distribution"], loc="upper center", bbox_to_anchor=(0.5, -0.05), fancybox=True, shadow=True, ncol=2)
ax.set_title(col)
ax.set_xlabel("")
def plot_pie(dataset, col, ax):
'''
Pandas' pie plot wrapper
'''
ax = dataset[col].value_counts().plot(kind='pie', ax=ax)
ax.set_title(col)
ax.set_ylabel("")
def plot_count(dataset, col, ax, order = None, show_percent=True, rotate_label = True, add_args={"palette": "Set2"}):
'''
Extending the seaorn countplot to get frequencies and counts in a pretty way.
'''
ncount = len(dataset)
if order is None:
order = np.sort(dataset[col].unique())
# plot seaborn countplot
ax = sns.countplot(data=dataset, x=col, ax=ax, order=order, **add_args)
# Get y limit (number of elements)
_ ,max_nb = ax.get_ylim()
# Transform this limit into a frequency in [0, 100]
freq_lim = (max_nb * 100/ ncount)
# Duplicate the ax
ax2 = ax.twinx()
#Move duplicate y axis ticks to the left
ax2.yaxis.tick_left()
#Move original y axis ticks to the right
ax.yaxis.tick_right()
# Swap the label positions to match the ticks
ax.yaxis.set_label_position('right')
ax2.yaxis.set_label_position('left')
ax2.set_ylabel('Frequency [%]')
# We want to write the frequency on top of each bar
if show_percent:
# for every bar
for p in ax.patches:
x=p.get_bbox().get_points()[:,0]
y=p.get_bbox().get_points()[1,1]
if not math.isnan(x.mean()) and not math.isnan(y):
# Write frequency at an x and y coordinate
ax.text(x.mean(), y, '{:.1f}%'.format(100.*y/ncount),
ha='center', va='bottom')
# Set y axis limit for counts and frequencies
ax2.set_ylim(0,freq_lim)
ax.set_ylim(0,max_nb)
# set ticks for count
ax.yaxis.set_major_locator(ticker.LinearLocator(11))
ax.yaxis.set_tick_params(which="major")
# set ticks for frequencies
ax2.yaxis.set_major_locator(ticker.MultipleLocator(freq_lim/10))
ax2.yaxis.set_tick_params(which="major")
# remove grid for ax 2 (keep only ax)
ax2.grid(False)
ax.grid(False)
ax.set_xlabel("")
if rotate_label:
# rotate tick labels on the x axis / / /
_ = ax.set_xticklabels(ax.get_xticklabels(), rotation=40, ha="right")
ax.set_title(col)
def plot(kind, dataset, columns=None, fig_title="Count/Frequency plots", fontsizes = 20, **kwargs):
'''
Wrapper function that takes care of plot wise sizes and calling the wanted procedure
'''
# plot choices
kind_dict = {
'pie': plot_pie,
'count': plot_count,
'density': plot_density}
if kind not in kind_dict:
raise ValueError(f"{kind} is not a valid kind, has to be one of {kind_dict.keys()}")
if columns is None:
# us all dataset columns
columns = list(dataset.columns)
fig = None
# Setting font sizes
plt.rc('xtick', labelsize=fontsizes*1.5)
plt.rc('ytick', labelsize=fontsizes*1.5)
plt.rc('axes', labelsize=fontsizes*2)
plt.rc('legend', fontsize=fontsizes*1.5, title_fontsize=0)
plt.rc('axes', titlesize=2*fontsizes)
plt.rc('font', size=1.7*fontsizes)
# Scale of the figure in ax (to be used later)
figsize_scale = fontsizes
if not isinstance(columns, list):
# columns has to be a list
if isinstance(columns, str):
columns = [columns]
else:
columns = list(columns)
if len(columns) == 1: # Only variable to plot
ncols, nrows = 1, 1
figsize_scale *= 2 # double figsize
else:
ncols, nrows = 2, math.ceil(len(columns) / 2)
fig, axes = plt.subplots(figsize=(figsize_scale*ncols, figsize_scale*nrows), nrows=nrows, ncols=ncols)
if ncols == 1 and nrows == 1:
# We need a list of axes
axes = np.array([axes])
# Plot
do_plots(dataset, columns, axes, kind_dict[kind], **kwargs)
fig.suptitle(fig_title + "\n\n", fontsize=fontsizes*2.5)
plt.tight_layout()
#Reset plot setting to original
sns.reset_orig()
def do_plots(dataset, columns, axes, plot_func, **kwargs):
'''
Calls the plotting function on every axis and removes unused axes.
'''
axes = axes.flat
#plot every variable
for index, col in enumerate(columns):
plot_func(dataset, col, axes[index], **kwargs)
# remove empty axes
for empty in range(len(columns), len(axes)):
axes[empty].axis("off")
plot("pie", demo_data, demographic_columns, fig_title="Plot pies of demographic data")
plot("count", demo_data, demographic_columns, fig_title="Count / Frequency plots of demographic data")
We can see some features are pretty balanced but in some cases, classes are underrepresented. We will go over each of them in more detail.
def value_counts_percentage(dataset, column):
''' value.counts() method extended by displaying percentage '''
a = dataset[column].value_counts()
b = dataset[column].value_counts(normalize=True) * 100
return pd.concat([a,b.round(2)], axis=1, keys=['N', '%'])
value_counts_percentage(demo_data, 'Age')
The age groups are intervals of around 10 years. As we can see, the older the age group, the less represented they are.
The 18-24 group holds around 1/3 of all respondents !
Unbalanced and not representative.
value_counts_percentage(demo_data, 'Gender')
Gender is balanced and representative.
value_counts_percentage(demo_data, 'Education')
There is a predomination of educated people (around 85% are of college or above level), causing an unbalance.
Not sure these are representative of the real life distribution, but looking at the educational attainment in the United States, it seems the proportion are not so far off for developped countries (and they are very represented in this dataset).
value_counts_percentage(demo_data, 'Country')
All the known countries (90+%) are English speaking with US and UK totalling to over 80% of the data.
This is a clear unbalance and isn't representative.
value_counts_percentage(demo_data, 'Ethnicity')
90+% of respondents are white. Again, a clear unbalance and is not representative.
plot("count", demo_data, 'Age', fig_title="Age-Gender count/frequency", rotate_label=False, add_args={"hue":"Gender", "palette":'ch:.25'}, fontsizes=4.5);
- There are clearly a lot more men in the 18-24 age goupe.
- More women in the 25-34 age group.
- As the age increases, the gender proportion gradually evens out.
plot("count", demo_data, 'Country', fig_title="Age-Gender count/frequency", rotate_label=False, add_args={"hue":"Gender", "palette":'ch:.25'}, fontsizes=4.5);
- More women than men were tested in the USA, New-Zealand (see next table) and "Other".
- More men than women in the UK.
- Around the same in the remaining countries.
pd.crosstab(demo_data['Age'], [demo_data['Gender'], demo_data['Country']])
-
New-Zealand doesn't have anyone over 34 years old and tested only one woman.
-
Ireland doesn't have anyone over 54 and has no man in the 35-44 age group.
-
Australia doesn't have anyone over 65 (and only one over 54) and 20 times more males in the 18-24 age group.
-
Canada has 3 times more men in the 18-24 group and has only 2 males above 54.
-
UK has over twice more women than men in the 25-34 group.
-
USA has over twice more men in the 18-24 group and no woman over 64.
-
Other countries have 3 times more men for the 18-24 age group, twice for the 25-34 group, doesn't have any woman in the 55-64 group and doesn't have anyone over 64.
def catplot(data, x, hue, col, rotate_label=True, palette="ch:.25"):
plot = sns.catplot(x=x, hue=hue, col=col, palette=palette, data=data.sort_values(by=[x]), kind="count")
if rotate_label:
for axes in plot.axes.flat:
_ = axes.set_xticklabels(axes.get_xticklabels(), rotation=40, ha="right")
catplot(demo_data, 'Country', 'Age', 'Gender')
Using the above plot, some aspects described above pop up quite nicely.
demo_data.pivot_table(index='Education', columns=['Gender', 'Ethnicity'], aggfunc='size', fill_value=0)
-
Basically few entries of people leaving school at or before 18.
-
Non-white: no one left school before being 16 and appart for the 'Other' category, there is only one female that didn't get a higher education. No male** expect in 'Other' have a Doctorate degree.
-
Asian: no respondent with lower education and *no male in some college or university (but has some with masters degree, so doesn't mean no one went through university etc.)
-
Black: only one person with lower education (one male left school at 18) and has no female with a professional diploma when it is the field with most respondents on the male side.
-
Mixed-Black/Asian: very few respondents and only people following some college or university, or with an university degree are represented.
-
Mixed-White/Asian: all the female respondents have high education and 2 males left school at 18 or before. Males, unlike females have no master's degree respondent.
-
Mixed-White/Black: all females except 2 have a finished university, master's or doctorate degree, while males have only one master's degree and the other have either unfinished university degrees, a professional diploma or left school at 18 or before.
-
Other: lots of males have professional diplomas, while no female has one. There is a high number of doctorate degrees but also a high number of respondents that left school at 18 or before (both relative to the categories up until now).
-
White: basically all the respondents are white as we saw before. Both genders have respondents in all categories but not in the same proportions. To see things clearer, you can refer to the table below (we will focus on proportion instead of counts). The majority of males are pursuing a college or university degree, while the majority of females have an university degree. Nearly twice as many females have a doctorate degree and more male have left school at 18 or before.
pd.concat([value_counts_percentage(demo_data[(demo_data['Ethnicity'] == 'White') & (demo_data['Gender'] == 'Male')], 'Education'),
value_counts_percentage(demo_data[(demo_data['Ethnicity'] == 'White') & (demo_data['Gender'] == 'Female')], 'Education')],
axis = 1, keys=['Male', 'Female'])
catplot(demo_data, 'Ethnicity', 'Education', 'Gender')
pd.crosstab(demo_data['Country'], [demo_data['Gender'], demo_data['Ethnicity']])
-
Nearly no entry of Mixed-Black/Asian. (And compared to white, no entry of any other category)
-
USA has no mixed-White/Black female, no asian and mixed-Black/Asian male.
-
All the other countries have no mixed-black asian (except one male in an 'Other' country) (even UK that has such a large number of respondents)
-
UK has 9 times more Mixed-White/Black female than male and 3 times more 'Other' female than male.
"In statistics, Cramér's V (sometimes referred to as Cramér's phi and denoted as φc) is a measure of association between two nominal variables, giving a value between 0 and +1 (inclusive). It is based on Pearson's chi-squared statistic and was published by Harald Cramér in 1946."(Wikipedia)
Below we define a set of useful functions for working with correlations.
def get_pval_code(pval):
'''
Returns a significance code string for a given p-value.
'''
code = ''
if pval < 0.001:
code = '***'
elif pval < 0.01:
code = '**'
elif pval < 0.05:
code = '*'
elif pval < 0.1:
code = '.'
return code
def display_pval_codes():
print("Significance codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1")
def cramers_v(crosstab):
''' Returns Cramer's V correlation coefficient (and statistic significance) for data
delivered as a crosstable. Used for nominal data.
'''
chi2 = chi2_contingency(crosstab)[0]
n = crosstab.sum().sum()
phi2 = chi2/n
r,k = crosstab.shape
phi2corr = max(0, phi2 - ((k-1)*(r-1))/(n-1))
rcorr = r - ((r-1)**2)/(n-1)
kcorr = k - ((k-1)**2)/(n-1)
return round(np.sqrt(phi2corr / min((kcorr-1), (rcorr-1))), 3), chi2_contingency(crosstab)[1]
def nominal_corrs(dataset, col_1_names, col_2_names, pvalues=True):
''' Returns Cramer's V coefficients matrix and p-value (obtained with Chi-square test of independence) matrix for the whole dataset.
col_1_names, col_2_names - lists of names of columns to correlate. Function creates crosstables for every
columns' combination and returns a matrix with single Cramer's V coefficients of every combination;
pvalues - if set to False, only correlation values will be returned in DataFrame
(without '**' marks for significant observations)
'''
corr_table = pd.DataFrame()
pval_table = pd.DataFrame()
for i in range(len(col_1_names)):
for j in range(len(col_2_names)):
crosstab = pd.crosstab(dataset[col_1_names[i]], [dataset[col_2_names[j]]])
corr, pval = cramers_v(crosstab)
v = ' '.join([str(i) for i in (corr, get_pval_code(pval))]) if pvalues else corr
corr_table.loc[i, j] = v
pval_table.loc[i,j] = pval
corr_table.index = col_1_names
corr_table.columns = col_2_names
pval_table.index = col_1_names
pval_table.columns = col_2_names
return corr_table, pval_table
def heatmap_corr(dataset, method='spearman', ready=False, mask=True, nominal=False):
''' Extended sns.heatmap() method.
dataset - can be 'pure' data (without calculated correlations) or a DataFrame with already calcuateg
correlations (in that case attribute 'ready' should be set to True);
method - mainly pearson or spearman; nominal correlations should be calculated externally
and be delivered with attribute ready=True;
mask - if dataset is NOT a cross-valued DataFrame of one type, mask should be set to False;
nominal - for nominal data correlations values are in range (0, 1) instead of (-1, -1).
nominal=True should be folowed by ready=True
'''
if not ready:
corr = dataset.corr(method=method)
elif ready:
corr = dataset
cmap = sns.diverging_palette(220, 10, as_cmap=True)
vmax = corr.max().max()
if nominal:
center = 0.5
cmap=None
elif not nominal:
center = 0
if mask:
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
np.fill_diagonal(corr.values, -2)
vmax = corr.max().max()
elif not mask:
mask=None
f, ax = plt.subplots(figsize=(20,9))
return sns.heatmap(corr, cmap=cmap, mask=mask, vmax=vmax, center=center, annot=True, square=True,
linewidths=0.5, cbar_kws={'shrink': 0.5})
Let's display the correlations among the demographic data.
corr_table_demo, pval_table_demo = nominal_corrs(demo_data, demographic_columns, demographic_columns)
display_pval_codes()
corr_table_demo
Values with at least one * next to them are statistically significant (p-value < 0.05).
heatmap_corr(nominal_corrs(demo_data, demographic_columns, demographic_columns, pvalues=False)[0],
nominal=True, ready=True);
Only very weak statistically significant correlations at most.
This section tries to interpret these very weak relationships between variables
Ethnicity's lack of statistical significant correlations can be explained by the great imbalance in the various possible classes (all whites basically). Meaning the lack of correlation could be caused by chance. The abscence (high significance) of correlation with country may sound surprising but since the only countries considered are the "same", as in developed and white prevalence in respondent, it makes sense. It's safe to assume the lack of correlation with age and gender are correct (even if unsignificant) for obvious reasons. I can't be so sure about education (as you can see below, there is around 50% chances the correlation was obtained by chance).
pval_table_demo.loc['Education', 'Ethnicity']
There is a statistically significant very weak correlation between country and gender. We know, at least for the countries represented here (developped countries were population is basically 50/50), that there isn't a real life correlation between the country and gender. The very weak correlation may be due to some countries having more female repondents and some having more males (as we saw previously).
The weak correlation between age and gender can be explained, as we saw previously, that gender wasn't even for some young age groups but as we know this correlation doesn't generalize to real life.
A correlation between age and country would be relevant if some countries were more developped than the others but here we have only developped country. This idea would make more sense if we had more categories for elder people, meaning a country's life expectency could be taken into account, but here a single group for 65+ year olds isn't enough for developed countries. The very weak statistically significant relationship may be caused by the 'other' countries.
The very weak correlations we can keep in mind are the following:
- Age / Education (makes sense since basically no 18-24 would have a doctorate degree for example)
- Gender / Education (males and females may have slightly different aims and standards regarding education)
- Education / Country (for social tendencies, country development and other reasons, the education and country may be correlated but since the country here are very similar, the relationship is very weak)
All the personnality traits have a score that is represented by a real value in the dataset.
Some have their original possible scores in ranges of integer types (with a step of 1, so 4-7 means 4, 5, 6 and 7):
- Neuroticism: 12-60
- Extraversion: 16-59
- Openness to experience: 24-60
- Agreeableness: 12-60
- Conscientiousness: 17-59
pers_data = dataset[personality_columns]
pers_data.describe()
plot("pie", pers_data, personality_columns, fig_title="Pie plots personality traits")
The values overlap but we don't really care about them (I left them just in case you need them), we just want to see if some slices are bigger than the others.
plot("count", pers_data, personality_columns, show_percent=False, fig_title="Count / Frequency of personality data")
Hard to interpret since there are many levels for some personality traits and they are encoded as hardly interpretable real values, even though they are sorted in the same order as the original values, which aren't really interpretable either.
According to the pie plot, there doesn't seem to be any huge unbalance.
The count / frequency plot is more interesting. The distributions of all the personality traits may be gaussian distribution, with some skewness involved sometimes. Let's plot those densities a bit more nicely against actual estimated gaussian distributions to see if our intuition was right!
plot("density", pers_data, personality_columns, fig_title="Personality densities")
They are indeed belshaped but far from the shape of a fitted gaussian distribution.
Let's try out a statistical test.
def normal_test(dataset, columns = None, alpha = 0.05):
if columns is None:
columns = dataset.columns
res = pd.DataFrame()
data = dataset[columns]
stats, pvals = data.apply(shapiro).values
return pd.DataFrame(data=[("Not rejected" if pval > alpha else "Rejected") + f" with alpha = {alpha}" for pval in pvals], columns=["H0: is from normal distribution"], index=columns)
normal_test(pers_data)
After running the Shapiro-Wilk test on each personality trait, the only hypothesis that couldn't be rejected was "Neuroticism is from a normal distribution" with an alpha = 0.05.
"In statistics, Spearman's rank correlation coefficient or Spearman's ρ, named after Charles Spearman, is a nonparametric measure of rank correlation (statistical dependence between the rankings of two variables). ... Spearman's correlation assesses monotonic relationships (whether linear or not). If there are no repeated data values, a perfect Spearman correlation of +1 or −1 occurs when each of the variables is a perfect monotone function of the other."(Wikipedia)
Again, we have to define a few useful functions.
def calculate_r(df1, df2, method='spearman', pvalues=True):
''' Returns correlation matrix extended by statistical significance index. Used for non-nominal data.
df1, df2 - DataFrames of data to correlate;
method - mainly pearson and spearman;
pvalues - if set to False, only correlation values will be returned in DataFrame
(without '**' marks for significant observations)
'''
float_precision = 3
data_corr_table = pd.DataFrame()
data_pvalues = pd.DataFrame()
for x in df1.columns:
for y in df2.columns:
if method == 'pearson':
corr = pearsonr(df1[x], df2[y])
elif method == 'spearman':
corr = spearmanr(df1[x], df2[y])
else:
raise ValueError('Unknown method')
if pvalues:
data_corr_table.loc[x,y] = '{} {}'.format(round(corr[0], float_precision), get_pval_code(corr[1]))
elif not pvalues:
data_corr_table.loc[x,y] = round(corr[0], float_precision)
data_pvalues.loc[x,y] = corr[1]
return data_corr_table, data_pvalues
def get_ranked_corr(correlations, pvalues, pval_threshold=0.05, corr_threshold=0.5, name_transform=lambda x: x, against_itself=True, comp_corr = operator.ge, comp_pval = operator.le):
''' Returns the sorted top (wanted) correlations as a Dataframe
correlations - the correlations dataframe.
pvalues - The pvalues of H0: the correlation is unsignificant.
pval_threshold - alpha to compare to pvalue
corr_threshold - Threadshold of correlations to consider (upper or lower depending on comp_corr)
name_transform - transform the column names before writing index.
against_itself - True if columns and rows are the same data (if symmetric matrix)
comp_corr - comparator to compare to the corr_threshold (default operator.ge means we want correlations >=)
comp_pval - comparator to compare the pvalue (default operator.le meaning we want pvalues <=)
'''
columns = correlations.columns
rows = correlations.index
pairs=[]
for i, row in enumerate(rows):
for j, col in enumerate(columns[i+1:], start=i+1) if against_itself else enumerate(columns):
corr = correlations.iloc[i, j]
if comp_pval(pvalues.iloc[i,j], pval_threshold) and comp_corr(abs(corr), corr_threshold):
pairs.append((name_transform(row), name_transform(col), corr))
list.sort(pairs, key=lambda x: abs(x[2]), reverse=True)
return pd.Series(data=[x[2] for x in pairs], index=[' / '.join((x[0], x[1])) for x in pairs], name="Correlation", dtype=np.float64).to_frame()
Now let's use them.
pers_data_corr_table, pers_data_pvalues = calculate_r(pers_data, pers_data, method='spearman')
pers_data_corr_table
heatmap_corr(pers_data, method='spearman');
Again, no strong correlations but some moderate and weak-moderate ones. The sign of the correlation is interesting here.
Neuroticism for example has no positive correlation with any other personality trait. It has a weak-moderate downhill correlation with extraversion and conscientiousness.
I also found it interesting that Extraversion has no moderate correlation with any other personality trait (and that its strongest correlation, which is weak, is with conscentiousness).
Below is a list of the very weak and above statistically significant correlations. (|r| >= 0.15)
pers_data_corr_table_nopval, _ = calculate_r(pers_data, pers_data, method='spearman', pvalues=False)
get_ranked_corr(pers_data_corr_table_nopval, pers_data_pvalues, corr_threshold=0.15)
As a reminder, here are the different levels for each drug:
0 - Never Used
1 - Used over a Decade Ago
2 - Used in the Last Decade
3 - Used in the Last Year
4 - Used in the Last Month
5 - Used in the Last Week
6 - Used in the Last Day
drug_data = dataset[drugs_columns]
drug_data.describe()
Let's make a table with the number of observation and frequency for every consumption level of every drug.
counts = []
proportions = []
for drug in drugs_columns:
counts.append(drug_data[drug].value_counts())
proportions.append(drug_data[drug].value_counts(normalize=True))
drug_table_count = pd.concat(counts, axis=1,
keys=drugs_columns, sort=True)
drug_table_prop = (pd.concat(proportions, axis=1,
keys=drugs_columns, sort=True) * 100).round(2)
drug_table = pd.concat([drug_table_count.T, drug_table_prop.T], keys=['N', '%'], axis=1)
drug_table
We will also plot them for each drug.
plot("count", drug_data, fig_title="Count / Frequency for drug consumption\n\n", rotate_label=False)
Alcohol, caffeine and chocolate are legal stimulants that are consumed reguarly by the vast majority of respondents.
Cannabis and nicotine are distributed pretty evenly, even if there is a higher proportion at both extremes (0 and 6 category). These two are are considered bad or ok depending on the people, this may explain the about equal number of daily consumers and those who never tried it, with some in between.
Other illegal drugs were never tried for the vast majority of respondents, or tried a long time ago (considered as more than a month ago) People probably either never tried it (fear or convictions) or did it once but never again. This also holds for legal highs, which follows the same trend as the illegal drugs.
drug_data_corr_table, drug_data_pvalues = calculate_r(drug_data, drug_data, method='spearman')
display_pval_codes()
drug_data_corr_table
heatmap_corr(drug_data, method='spearman');
Alcohol, caffeine and chocolate don't correlate with any other drugs (too weak or insignificant relations).
There are some moderate correlations with high enough statistical significance.
There are no strong correlations (I consider an absolute value of 0.70 or above strong).
drug_data_corr_table_nopval, _ = calculate_r(drug_data, drug_data, method='spearman', pvalues=False)
We will list every correlations, regrouped and ordered by strength.
List of unsignificant correlations
get_ranked_corr(drug_data_corr_table_nopval, drug_data_pvalues, corr_threshold=0, name_transform=lambda x:x.rsplit(' ', 1)[0], comp_pval= operator.gt).drop('Correlation', axis=1)
List of significant weak or non-existant correlations
get_ranked_corr(drug_data_corr_table_nopval, drug_data_pvalues, corr_threshold=0.3, name_transform=lambda x:x.rsplit(' ', 1)[0], comp_corr=operator.lt)
List of significant weak to weak-moderate correlations
get_ranked_corr(drug_data_corr_table_nopval, drug_data_pvalues, corr_threshold=0.5, name_transform=lambda x:x.rsplit(' ', 1)[0], comp_corr=operator.lt).query('Correlation >= 0.3').reset_index()
List of significant moderate correlations
get_ranked_corr(drug_data_corr_table_nopval, drug_data_pvalues, corr_threshold=0.4, name_transform=lambda x:x.rsplit(' ', 1)[0])
We have seen correlations of categories of variables with themselves. Let's now dig into inter-category correlations.
drug_pers_corr_table, drug_pers_pvalues = calculate_r(drug_data, pers_data, method='spearman')
drug_pers_corr_table
drug_pers_corr_table_no_pvalues, drug_pers_pvalues = calculate_r(drug_data, pers_data, pvalues=False, method='spearman')
heatmap_corr(drug_pers_corr_table_no_pvalues, ready=True, mask=False);
Alcohol, Chocolate, Caffeine and Nicotine consumption have no (or unsignificant) correlation with any personality trait.
For the rest of this paragraph, "drug consumption" is not taking alcohol, chocolate or caffeine into account.
Extraversion is the only personality trait that has no correlation with any drug (not even very weak), which I found surpising at first but it may be caused about my wrong understanding of extraversion.
Crack, and Amyl nitrite have no correlation above 0.2 (absolute value).
Agreeableness and Conscientiousness both have only weak negative correlations (at most) with all the drugs. Even if the negative correlation is always a bit stronger for conscientious respondents.
Impulsiveness, Openness to experience and Sensation seeking are the personality factors that correlate the most with drug consumption.
There are no strong correlation between personality factors and drug taking.
It seems like a particular personality trait has either only positive or negative correlations with drug consumption.
Let's list the weak ones and above (|r| >= 0.3).
get_ranked_corr(drug_pers_corr_table_no_pvalues, drug_pers_pvalues, corr_threshold=0.3, against_itself=False)
demo_drug_corr_table, demo_drug_pval = nominal_corrs(dataset, demographic_columns, drugs_columns)
demo_drug_corr_table
demo_drug_corr_table_no_pvalues, _ = nominal_corrs(dataset, demographic_columns, drugs_columns, pvalues=False)
heatmap_corr(demo_drug_corr_table_no_pvalues, ready=True, mask=False, nominal=True);
Ethnicity isn't correlated to drug taking, except very weakly with alcohol consumption.
Education also seems to be at most very weakly correlated to drug consumption.
Caffeine and Chocolate have no correlations with demographic variables.
There are no strong or even moderate correlations, the strongest weak correlations are mostly found in Gender.
Underneath is a list of the significant correlations with (|r| >= 0.2).
get_ranked_corr(demo_drug_corr_table_no_pvalues, demo_drug_pval, corr_threshold=0.2)
Let's look at the consumption median of cannabis and legal highs consumption for female and male, which are the two drugs most correlated with gender according to our above table.
The significance is base on the Mann-Whitney U test (used in the below function).
"In statistics, the Mann–Whitney U test is a nonparametric test of the null hypothesis that, for randomly selected values X and Y from two populations, the probability of X being greater than Y is equal to the probability of Y being greater than X"(Wikipedia)
We implemented the difference test in the above procedure.
Here the two populations are male and female.
def diff_test(dataset, sample_attr, diff_attr, sample_attr_split='median', diff_attr_type='ordered', split_point=None, alpha=0.05):
''' Difference significance test on dataset.
Returns (p of no difference,
the group split point,
the group 1 central,
the group 2 central,
a summary). <- shouldn't be in here but whatever
sample_attr - column which will be divided into two samples with median value;
diff_attr - attribute, which value will be checked in two sample groups;
diff_attr_type - determines type of data which medians will be compared (ordered, interval)
'''
# In case of big data imbalance in highest rank (contains more than half)
# It would make an empty group when splitting futher down
if split_point is None and sample_attr_split == "median":
levels = dataset[sample_attr].unique()
if not (levels > dataset[sample_attr].median()).any(): # The highest level is the same as median
split_point = levels[-1] # Split one rank lower instead
if split_point is None:
split_point = dataset[sample_attr].median() if sample_attr_split=='median' else dataset[sample_attr].mean()
group_1 = dataset[dataset[sample_attr] > split_point]
group_2 = dataset[dataset[sample_attr] <= split_point]
group_1_diff_attr_central = group_1[diff_attr].median() if diff_attr_type=='ordered' else group_1[diff_attr].mean()
group_2_diff_attr_central = group_2[diff_attr].median() if diff_attr_type=='ordered' else group_2[diff_attr].mean()
if diff_attr_type == 'ordered':
diff_sign, p = mannwhitneyu(group_1[diff_attr], group_2[diff_attr])
elif diff_attr_type == 'interval':
diff_sign, p = ttest_ind(group_1[diff_attr], group_2[diff_attr])
else:
raise ValueError('diff_attr_type should be one of ordered and interval')
are = 'are' if p < alpha else 'are not'
sample_central = 'split point' if split_point is not None else sample_attr_split
diff_central = 'median' if diff_attr_type=='ordered' else 'mean'
return (p, split_point, group_1_diff_attr_central, group_2_diff_attr_central,
f'First group: {sample_attr} above {sample_central} value {round(split_point, 3)}\n Second group: {sample_attr} equal or below {sample_central} value {round(split_point, 6)} \n First group {diff_attr} {diff_central}: {round(group_1_diff_attr_central, 3)} \n Second group {diff_attr} {diff_central}: {round(group_2_diff_attr_central, 3)} \n Difference significance for samples: {round(diff_sign, 3)} with p-value: {round(p, 3)} \n Samples {are} statistically different.')
val_between_male_female = (dataset['Gender'][demo_data['Gender'] == 'Male'].iloc[0] + dataset['Gender'][demo_data['Gender'] == 'Female'].iloc[0])/ 2
print(diff_test(dataset, 'Gender', 'Cannabis consumption', split_point=val_between_male_female)[-1])
print(diff_test(dataset, 'Gender', 'Legal highs consumption', split_point=val_between_male_female)[-1])
Females (first group) consume less cannabis and legal highs than men.
As a reminder, here are the different levels for each drug:
0 - Never Used (median female legal highs consumer)
1 - Used over a Decade Ago
2 - Used in Last Decade (median female cannabis consumer, median male legal high consumer)
3 - Used in Last Year
4 - Used in Last Month (median male cannabis consumer)
5 - Used in Last Week
6 - Used in Last Day
And the values for male and female, respectively, are:
(dataset['Gender'][demo_data['Gender'] == 'Male'].iloc[0], dataset['Gender'][demo_data['Gender'] == 'Female'].iloc[0])
demo_psycho_corr_table, demo_psycho_pvalues = nominal_corrs(dataset, demographic_columns, personality_columns)
demo_psycho_corr_table
demo_psycho_corr_table_no_pvalues, _ = nominal_corrs(dataset, demographic_columns, personality_columns, pvalues=False)
heatmap_corr(demo_psycho_corr_table_no_pvalues, ready=True, nominal=True, mask=False);
Neuroticism has no statistically significant correlation with any demographic variable.
Extraversion has only one statistically significant correlation, with Country, which is non existent.
Education and Ethnicity has no correlation with any personality trait (too low or unsignificant).
Gender seems to have the strongest (least weak) correlations.
There are no moderate or strong correlations, only pretty weak ones at most.
Let's list the weak correlations (|r| > 0.15)
get_ranked_corr(demo_psycho_corr_table_no_pvalues, demo_psycho_pvalues, corr_threshold=0.15)
Let's display the median of Agreeableness and Sensation seeking for each gender.
print(diff_test(dataset, 'Gender', 'Agreeableness', split_point=val_between_male_female)[-1])
print(diff_test(dataset, 'Gender', 'Sensation seeking', split_point=val_between_male_female)[-1])
Women tend to be more agreeable and less sensation seeking than men.
In this section, we will split the population in 2 groups based on a feature.
All the features are rank based, we will use the Mann–Whitney U test against a target, to see if the target shows a statistically significant difference in median for the two groups.
The splitting point will be the median, except if the median is the highest rank the feature can have. In this case we split one rank lower.
- Group 1: respondents > split.
- Group 2: respondents <= split.
In the following section, we will only show significant difference in median value for a given target (and only this value differs between the two groups).
Nan
values. This is because the cell has no significant difference but the column is needed for other rows.
We won't comment on them but feel free to explore. The tables are quite self-explanatory.
def get_significant_diff_tests(dataset, samples, diff_attrs, alpha=0.05, display_same_center=False):
'''
Perform difference tests on several splitting attributes for different targets.
'''
samp_dfs = []
for sample in samples:
diff_dfs = []
for diff_attr in diff_attrs:
levels = dataset[sample].unique()
split = None
if len(levels) == 2:
split = (dataset[sample][dataset[sample] == levels[0]].iloc[0] + dataset[sample][dataset[sample] == levels[1]].iloc[0])/ 2
p, split_point, group_1_diff_attr_central, group_2_diff_attr_central, _ = diff_test(dataset, sample, diff_attr, split_point=split, alpha=alpha)
if p < alpha and (group_1_diff_attr_central != group_2_diff_attr_central or display_same_center):
split_point = round(split_point, 5)
diff_dfs.append(pd.DataFrame({'Median': [group_1_diff_attr_central, group_2_diff_attr_central], 'Pval / significant diff ?': [p, "Yes" if p < alpha else "No"]}, index=[f"> {split_point}", f"<= {split_point}"]))
if diff_dfs:
samp_dfs.append(pd.concat(diff_dfs, axis=1, keys=diff_attrs))
return pd.concat(samp_dfs, axis=0, keys=samples) if samp_dfs else None
get_significant_diff_tests(dataset, demographic_columns, drugs_columns)
get_significant_diff_tests(dataset, personality_columns, drugs_columns)
get_significant_diff_tests(dataset, demographic_columns, personality_columns)
In practice, we are more interested in predicting which drugs an individual user consumes and less in knowing how long it has been.
We want to regroup the different levels we currently have for each drug into two classes: User and Non-User.
We are going to compare year and decade based splits.
- Year based: Consumer = Used in last year
- Decade based: Consumer = Used in last decade
dataset_decade = dataset.copy()
for i in drugs_columns:
dataset_decade[i] = ["Non Consumer" if c < 2 else "Consumer" for c in dataset[i]]
dataset_year = dataset.copy()
for i in drugs_columns:
dataset_year[i] = ["Non Consumer" if c < 3 else "Consumer" for c in dataset[i]]
compare_year_decade = pd.concat([dataset_decade.assign(split="decade"), dataset_year.assign(split="year")])
plot("count", compare_year_decade, drugs_columns, fig_title="Count / Frequency consumer comparison between decade and year split\n\n\n", rotate_label=False, add_args={"hue": "split", "palette": "Set2"})
We could choose the decade based split for model training and inference because of the balance.
Also, someone may not have consumed a particular drug for a year because he didn't get the opportunity to or tried out other drugs during this time period but could take it again in the future.
Whereas, if he didn't consume it in the last 10 years, he probably doesn't plan to consume it again.
This rule can't really be applied to 18-24 year olds.
We will continue our analysis considering both splits.
grouped_df_decade = pd.DataFrame()
for drug in drugs_columns:
grouped_df_decade = pd.concat([grouped_df_decade, dataset_decade.groupby(drug).size()], axis=1)
grouped_df_decade.columns = drugs_columns
grouped_df_year = pd.DataFrame()
for drug in drugs_columns:
grouped_df_year = pd.concat([grouped_df_year, dataset_year.groupby(drug).size()], axis=1)
grouped_df_year.columns = drugs_columns
grouped_df_decade.T.plot.bar(color=['red', 'green'], figsize=(15,7), title="Count plot of consumer / non consumer for every drug");
- Decade based split
The consumers (red) are in a huge majority for Alcohol, Caffeine and Chocolate.
They are also around double the number of non-consumers for Cannabis and Nicotine. This may be due to the fact that lots of people try these out at least once, and the last time happened to be in the last decade.
For the other drugs, non-consumers are the majority.
Cocaine, Benzodiazepine, Amphetamines, Ecstasy, Legal highs and Magic mushrooms consumers account for more than half the non-consumers.
grouped_df_year.T.plot.bar(color=['red', 'green'], figsize=(15,7), title="Count plot of consumer / non consumer for every drug");
- Year based split
The huge majorities for consumers are the same.
Cannabis and Nicotine still have a majority of consumers. Our previous hypothesis wasn't as relevant as I had expected.
The drugs cited previously for which the non-consumers accounted over half the consumers are consumed in a lesser proportion, but the scale of the decrease wasn't the same for all.
For some drug the proportions didn't change much, meaning people either consume them (on a yearly basis or more frequently) or not at all.
Let's encode drug users as 1 and non-users as 0.
for i in drugs_columns:
dataset_decade[i] = [0 if c < 2 else 1 for c in dataset[i]]
for i in drugs_columns:
dataset_year[i] = [0 if c < 3 else 1 for c in dataset[i]]
compare_year_decade = pd.concat([dataset_decade.assign(split="decade"), dataset_year.assign(split="year")])
Let's see the count of individual per number of drug used.
np_drugs = np.array([drug.split(' ', -1)[0] for drug in drugs_columns])
combinations = compare_year_decade[drugs_columns].apply(
lambda x: pd.Series({
'names':'/'.join(np_drugs[x.values.astype(bool)]),
'index': np.where(x.values)[0], 'number of drugs': x.values.sum()}), axis=1)
combinations['split'] = compare_year_decade['split']
year_split = combinations['split'] == "year"
combinations.head()
combinations[~year_split].describe()
For the decade split, the mean number of drug consumed is a bit lower than 8. Over 25% of the respondents are users of 11 drugs or more. The median is of 7 drugs. The mean is slightly higher than the median, meaning the distribution is slightly positively skewed. The number of drugs used ranges from 0 to 18 (all of them).
combinations[year_split].describe()
For the year split, the mean number of drug consumed is a bit above 6. Over 25% of the respondents are users of 9 drugs or more. The median is of 5 drugs. The mean is higher than the median, meaning the distribution is positively skewed. The number of drugs used ranges from 0 to 17. None one consumed all the drugs in the last year.
catplot(combinations, 'number of drugs', hue=None, col='split', rotate_label=False, palette="Set2")
- Decade split
Barely anyone used 0, 1, 2, 17 or all 18 drugs. The largest number used are 3, 4, 5 and 6.
An interesting fact is that 10 has more respondents than 7, 8 and 9. I guess for some drugs, there is a threshold. Once people go over 6 drugs, they may tend to try out a bunch more.
- Year split
Again, barely anyone used 0, 1 and over 15 drugs.
The number of respondents decreases monotonically as the number of drugs increases (>= 3 drugs). We dont see the same bump around 10, strengthening my previous hypothesis.
3 drugs is now the huge majority!
Let's see which are the most common combinations!
value_counts_percentage(combinations[~year_split], 'names').head(10)
value_counts_percentage(combinations[year_split], 'names').head(10)
No wonder the number of users of 3, 4 or 5 drugs were so high!
But really, we are not that interested in Chocolate, Caffeine, Nicotine and Alcohol.
Let's redo all this by not taking them into account.
np_drugs_illegal = np.array([drug.split(' ', -1)[0] for drug in drugs_illegal])
combinations_illegal = compare_year_decade[drugs_illegal].apply(
lambda x: pd.Series({
'names':'/'.join(np_drugs_illegal[x.values.astype(bool)]),
'index illegal': np.where(x.values)[0], 'number of illegal drugs': x.values.sum()}), axis=1)
combinations_illegal['split'] = compare_year_decade['split']
catplot(combinations_illegal, "number of illegal drugs", hue=None, col="split", rotate_label=False, palette="Set2")
Removing legal drugs shifted everything to the left (basically fusing 0, 1, 2 and 3). The shape of the distribution remains around the same.
value_counts_percentage(combinations_illegal[~year_split], 'names').head(10)
value_counts_percentage(combinations_illegal[year_split], 'names').head(10)
A lot more people (37.61%) in the year split didn't consume any drug (not counting alcohol, caffeine, chocolate and nicotine) compared to the decade split (24.88%).
The first 4 rows (combination of drugs) are the same but then there is some variation (different order or different combinations). The proportions for each ranking are around the same for both splits, appart from the first row.
Cannabis comes second (with between 7% and 8% in both splits), as it is legal in some countries and a very popular drug. It is also often present in the popular tuples.
Let's redo it without cannabis.
drugs_illegal_no_cannabis = [drug for drug in drugs_illegal if drug != "Cannabis consumption"]
np_drugs_illegal_no_cannabis = np.array([drug.split(' ', -1)[0] for drug in drugs_illegal_no_cannabis])
combinations_illegal_no_cannabis = compare_year_decade[drugs_illegal_no_cannabis].apply(
lambda x: pd.Series({
'names':'/'.join(np_drugs_illegal_no_cannabis[x.values.astype(bool)]),
'index illegal no cannabis': np.where(x.values)[0], 'number of illegal (no cannabis) drugs': x.values.sum()}), axis=1)
combinations_illegal_no_cannabis['split'] = compare_year_decade['split']
catplot(combinations_illegal_no_cannabis, "number of illegal (no cannabis) drugs", hue=None, col="split", rotate_label=False, palette="Set2")
The gap between 0 and 1 became even larger
value_counts_percentage(combinations_illegal_no_cannabis[~year_split], 'names').head(20)
value_counts_percentage(combinations_illegal_no_cannabis[year_split], 'names').head(20)
Here we have the 20 most common nontrivial drugs combinations for both splits.
The decade split has more length combinations, meaning some respondents tried out several drugs in the last decade but didn't pick them all up for a more regular consumption.
compare_year_decade[drugs_columns].apply(lambda x: x[drugs_legal].sum() == 0 and x[drugs_illegal].sum() > 0, axis=1).any()
I was curious as if any respondent could be an alcohol, caffeine, chocolate or nicotine non-user but a user of other drugs. Seems like not for both splits!
I hope you learnt lots reading this post. I certainly did!
I plan on training a predictive model in a future post, so stay tuned.
Did I miss out on anything? Do you have any remark? Please let me know by annotating using Hypothesis or by commenting below!