Hillstrom Email Marketing¶

An email marketing dataset from Kevin Hillstrom’s MineThatData blog.

See an example using this data at causeinfer/examples/business_hillstrom.

Description found at:

https://blog.minethatdata.com/2008/03/minethatdata-e-mail-analytics-and-data.html

Based on

Kuchumov, A. pyuplift: Lightweight uplift modeling framework for Python. (2019). URL: https://github.com/duketemon/pyuplift. License: https://github.com/duketemon/pyuplift/blob/master/LICENSE.

K. Hillstrom. “The MineThatData E-Mail Analytics And Data Mining Challenge”. 2008. URL: https://blog.minethatdata.com/2008/03/minethatdata-e-mail-analytics-and-data.html.

Contents

download_hillstrom, _format_data, load_hillstrom

causeinfer.data.hillstrom.download_hillstrom(data_path=None, url='http://www.minethatdata.com/Kevin_Hillstrom_MineThatData_E-MailAnalytics_DataMiningChallenge_2008.03.20.csv')[source]¶

Downloads the dataset from Kevin Hillstrom’s blog.

Parameters:

data_pathstroptional (default=None): A user specified path for where the data should go.
urlstr: The url from which the data is to be downloaded.

Returns:

The data ‘hillstrom.csv’ in a ‘datasets’ folder, unless otherwise specified.

causeinfer.data.hillstrom._format_data(df, format_covariates=True, normalize=True)[source]¶

Formats the data upon loading for consistent data preparation.

Parameters:

dfpd.DataFrame

The original unformatted version of the data.

format_covariatesbooloptional (default=True), controlled in load_hillstrom

True: creates dummy columns and encodes the data.
False: only steps for data readability will be taken.

normalizebooloptional (default=True), controlled in load_hillstrom

Normalize dataset columns to prepare them for ML methods.

Returns:

dfpd.DataFrame: A formated version of the data.

causeinfer.data.hillstrom.load_hillstrom(file_path=None, format_covariates=True, download_if_missing=True, normalize=True)[source]¶

Loads the Hillstrom dataset with formatting if desired.

Parameters:

file_pathstroptional (default=None)

Specify another path for the dataset.

By default the dataset should be stored in the ‘datasets’ folder in the cwd.

format_covariatesbooloptional (default=True)

Indicates whether raw data should be loaded without covariate manipulation.

download_if_missingbooloptional (default=True)

Download the dataset if it is not downloaded before using ‘download_hillstrom’.

normalizebooloptional (default=True)

Normalize dataset columns to prepare them for ML methods.

Returns:

datadict object with the following attributes:

data.descriptionstr: A description of the Hillstrom email marketing dataset.
data.dataset_fullnumpy.ndarray(64000, 12) or formatted (64000, 22): The full dataset with features, treatment, and target variables.
data.dataset_full_nameslist, size 12 or formatted 22: List of dataset variables names.
data.featuresnumpy.ndarray(64000, 8) or formatted (64000, 18): Each row corresponding to the 8 feature values in order.
data.feature_nameslist, size 8 or formatted 18: List of feature names.
data.treatmentnumpy.ndarray(64000,): Each value corresponds to the treatment.
data.response_spendnumpy.ndarray(64000,): Each value corresponds to how much customers spent during the two-week outcome period.
data.response_visitnumpy.ndarray(64000,): Each value corresponds to whether people visited the site during the two-week outcome period.
data.response_conversionnumpy.ndarray(64000,): Each value corresponds to whether they purchased at the site (i.e. converted) during the two-week outcome period.