The Data, The Method, and Why They Must Earn Their Place

Posted: 15 May 2026 | PROM05 Week 3

One of the more clarifying exercises this week has been articulating, in plain terms, exactly what data this project uses, why it is the right data, and what the analysis will actually do with it. These questions might seem like they were settled in the proposal, and in broad terms they were. But the process of engaging with the statistical and methodological literature this week has sharpened the reasoning considerably.

The Data

The project uses the Criteo Uplift dataset (Criteo, 2018), a publicly available large-scale secondary dataset comprising approximately 13.9 million observations generated from a randomised controlled trial of online advertising exposure. The key variables are binary treatment assignment (whether a user was shown an advertising campaign), binary conversion outcome (whether a user converted), and eleven continuous numerical feature variables. Both binary variables are Bernoulli-distributed outcomes in Hand's (2016) taxonomy; the feature variables are continuous numerical measurements.

The choice of secondary data is methodologically deliberate. Urban and van Eeden-Moorefield (2017) position secondary data as a legitimate and standard quantitative data collection method, and the Criteo dataset has a specific property that makes it not merely acceptable but optimal for this research: treatment assignment is randomised by design. This means the unconfoundedness assumption, which states that treatment assignment is independent of potential outcomes conditional on observed covariates, is satisfied without requiring it to be invoked as an untestable belief. In most real-world uplift modelling applications, practitioners must assume unconfoundedness because they cannot verify it. Here it holds by construction. The dataset therefore provides a controlled evaluation environment that no observational dataset could replicate.

Wilbanks (2012) makes a broader point about secondary data that resonates here: the value locked inside large-scale datasets is frequently inaccessible not because the data does not exist but because the institutional and methodological frameworks for using it have not kept pace with what computational analysis now makes possible. The Criteo dataset is an example of data made available precisely to address that gap, and using it for rigorous evaluation research is exactly what it was released to enable.

The Method

The analytical approach is experimental and comparative, working within a quantitative paradigm throughout. Urban and van Eeden-Moorefield (2017) describe quantitative research as deductive in its logic: expected results are stated before data collection begins, through pre-specified hypotheses. The primary hypothesis H2, that uplift metrics will show statistically meaningful inconsistency across metrics and dataset subsamples, is stated before any analysis is run, and the full analysis plan will be pre-registered on OSF before any model is trained or metric is computed.

The analysis itself operates in two layers. The first layer produces the primary experimental outputs: Qini coefficient scores and Area Under the Uplift Curve scores for three uplift estimators and a random targeting baseline, evaluated on fifty stratified bootstrap subsamples of 100,000 observations each drawn from the test partition. The second layer analyses those outputs: Kendall's tau and Spearman's rho rank correlations are computed for each model pair across subsamples, producing a distribution of rank correlation values from which bootstrap confidence intervals and means are derived.

Hand (2016) establishes the statistical basis for this approach. Kendall's tau and Spearman's rho are non-parametric rank correlation measures, appropriate when no defensible distributional assumption can be made about the metric scores being compared. The distribution of Qini and AUUC values across models and subsamples is empirically unknown; imposing a parametric assumption would introduce unjustified structure. The non-parametric approach is therefore not a compromise but the epistemically correct choice. Hand (2016) also notes that while parametric tests are more powerful when their assumptions hold, non-parametric tests are more generally applicable, and with fifty subsamples of 100,000 observations each, statistical power is already high.

Why Both Must Earn Their Place

The most useful insight from this week's reading is the closing observation in Hand (2016): research is seldom a question of looking at data, deciding what technique to apply, running a calculation, and writing things up. Complications arise. Data has structure that analysis must respect. Methods have assumptions that data may or may not satisfy. The process of selecting and justifying an analytical approach is not administrative; it is substantive.

For this project, the data earns its place because it provides a controlled, randomised evaluation environment at a scale that makes the primary research question answerable with statistical precision. The method earns its place because it makes no distributional assumptions the data cannot support and produces outputs that directly address the hypothesis. The pre-registration commitment earns its place because it removes the possibility of the analytical choices being shaped, consciously or otherwise, by the results they produce.

None of these choices was arbitrary. Each one followed from the research question.

References

Criteo (2018) Criteo Uplift Prediction Dataset. Available at: https://ailab.criteo.com/criteo-uplift-prediction-dataset/ (Accessed: 5 March 2026).

Hand, D.J. (2016) 'Elementary statistics', in Greenfield, T. and Greener, S. (eds.) Research Methods for Postgraduates. New York: John Wiley and Sons, pp. 362-374. Available at: http://ebookcentral.proquest.com/lib/sunderland/detail.action?docID=4644084 (Accessed: 3 May 2026).

Urban, J.B. and van Eeden-Moorefield, B.M. (2017) Designing and Proposing Your Research Project. Washington DC: American Psychological Association. Available at: http://ebookcentral.proquest.com/lib/sunderland/detail.action?docID=5041740 (Accessed: 3 May 2026).

Wilbanks, J. (2012) Let's pool our medical data [TED Talk]. Available at: https://www.ted.com/talks/john_wilbanks_let_s_pool_our_medical_data/transcript (Accessed: 3 May 2026).

Report abuse