Hillstrom Email Marketing

An email marketing dataset from Kevin Hillstrom’s MineThatData blog.

See an example using this data at causeinfer/examples/business_hillstrom.

Description found at:

https://blog.minethatdata.com/2008/03/minethatdata-e-mail-analytics-and-data.html

Based on

Kuchumov, A. pyuplift: Lightweight uplift modeling framework for Python. (2019). URL: https://github.com/duketemon/pyuplift. License: https://github.com/duketemon/pyuplift/blob/master/LICENSE.

K. Hillstrom. “The MineThatData E-Mail Analytics And Data Mining Challenge”. 2008. URL: https://blog.minethatdata.com/2008/03/minethatdata-e-mail-analytics-and-data.html.

Contents

download_hillstrom, _format_data, load_hillstrom

causeinfer.data.hillstrom.download_hillstrom(data_path=None, url='http://www.minethatdata.com/Kevin_Hillstrom_MineThatData_E-MailAnalytics_DataMiningChallenge_2008.03.20.csv')[source]

Downloads the dataset from Kevin Hillstrom’s blog.

Parameters:
data_pathstroptional (default=None)

A user specified path for where the data should go.

urlstr

The url from which the data is to be downloaded.

Returns:
The data ‘hillstrom.csv’ in a ‘datasets’ folder, unless otherwise specified.
causeinfer.data.hillstrom._format_data(df, format_covariates=True, normalize=True)[source]

Formats the data upon loading for consistent data preparation.

Parameters:
dfpd.DataFrame

The original unformatted version of the data.

format_covariatesbooloptional (default=True), controlled in load_hillstrom
  • True: creates dummy columns and encodes the data.

  • False: only steps for data readability will be taken.

normalizebooloptional (default=True), controlled in load_hillstrom

Normalize dataset columns to prepare them for ML methods.

Returns:
dfpd.DataFrame

A formated version of the data.

causeinfer.data.hillstrom.load_hillstrom(file_path=None, format_covariates=True, download_if_missing=True, normalize=True)[source]

Loads the Hillstrom dataset with formatting if desired.

Parameters:
file_pathstroptional (default=None)

Specify another path for the dataset.

By default the dataset should be stored in the ‘datasets’ folder in the cwd.

format_covariatesbooloptional (default=True)

Indicates whether raw data should be loaded without covariate manipulation.

download_if_missingbooloptional (default=True)

Download the dataset if it is not downloaded before using ‘download_hillstrom’.

normalizebooloptional (default=True)

Normalize dataset columns to prepare them for ML methods.

Returns:
datadict object with the following attributes:
data.descriptionstr

A description of the Hillstrom email marketing dataset.

data.dataset_fullnumpy.ndarray(64000, 12) or formatted (64000, 22)

The full dataset with features, treatment, and target variables.

data.dataset_full_nameslist, size 12 or formatted 22

List of dataset variables names.

data.featuresnumpy.ndarray(64000, 8) or formatted (64000, 18)

Each row corresponding to the 8 feature values in order.

data.feature_nameslist, size 8 or formatted 18

List of feature names.

data.treatmentnumpy.ndarray(64000,)

Each value corresponds to the treatment.

data.response_spendnumpy.ndarray(64000,)

Each value corresponds to how much customers spent during the two-week outcome period.

data.response_visitnumpy.ndarray(64000,)

Each value corresponds to whether people visited the site during the two-week outcome period.

data.response_conversionnumpy.ndarray(64000,)

Each value corresponds to whether they purchased at the site (i.e. converted) during the two-week outcome period.