1.2 The Data Landscape of Marketing

Marketing data is never just one table. It is a stack of sources, each capturing a different part of the customer journey. In this section, I will map that stack using three lenses: the funnel, the KPIs at each stage, and the datasets behind them.

The Funnel as a Measurability Map

Customers rarely buy on first exposure. They move — sometimes over months, sometimes minutes — from “not knowing the brand exists” to “buying” to “buying again.” The most common model of this journey is the funnel:

Awareness → Consideration → Conversion → Retention

The funnel is not useful because customers move through it in order. It is useful because it acts as a measurability map. As you move down the funnel, data gets more granular, measurement is more direct, signals come in faster, and experiments are easier to run.

Horizontal marketing funnel diagram with four colored stages from left to right: Awareness (blue, megaphone icon), Consideration (teal, magnifier icon), Conversion (orange, shopping-cart icon), and Retention (purple, returning-customer icon). A grey customer-cluster icon enters on the left; a purple heart icon exits on the right.
Figure 1: The marketing funnel as a measurability map. As customers move down the stages — Awareness, Consideration, Conversion, Retention — data becomes more granular, measurement more direct, signals faster, and experiments easier to run.

This is why method choice depends on the business model. E-commerce and SaaS teams can often test behavior directly. CPG and brand advertisers have to infer upstream effects from signals that are noisier, slower, and more aggregated.

A common trap is the streetlight effect: focusing only on what is easy to measure. Performance channels can report ROAS every day, but brand campaigns cannot. The first step is to ask which part of the funnel a KPI, dataset, or method actually covers.

KPIs across the Funnel

Each funnel stage has its own set of KPIs. You do not need to memorize them, but you should be able to tell which stage a metric belongs to. That way, you can spot when a dashboard is mixing up different layers.

Funnel Stage Typical KPIs What They Measure
Awareness Reach, Impressions, Aided / Unaided Brand Awareness, Share of Voice Whether people know you exist
Consideration CTR, Site Visits, Time on Site, Search Volume, Branded Search Lift Whether people are evaluating you
Conversion CVR, CAC, AOV, ROAS Whether people buy, and how efficiently
Retention Repeat Rate, Churn Rate, Retention Curve, LTV / CLV, NPS Whether people come back

Typical Datasets in Marketing

The Data box in Figure 1 lists the main dataset categories that drive everything downstream. Here is what those categories look like in practice, and the pitfalls to watch for as a data scientist.

Category Examples What a DS Newcomer Should Know
Customer / User CRM, loyalty data, demographics, account / contract data Schemas drift; definitions of “active user” change with org restructures. Identity resolution across IDs is harder than it looks.
Transaction Purchase history, subscription / billing, cart, returns Owned by EC and SaaS; only partially seen in CPG (POS via retailers). Granularity (line item vs order vs daily) matters for every downstream model.
Product / Pricing SKU master, price lists, inventory, planograms, promotion calendars SKU hierarchies are messy. Prices move with promotions, competitor moves, and seasons — a “price” column is rarely as stable as it looks.
Digital behavior Clicks, page views, sessions, in-app events, email opens / clicks Volume is huge, but explains only a small fraction of actual purchase variance. Easy to over-trust.
Media Spend, impressions, GRP, creative metadata, campaign tags, email sends, push logs Every platform reports differently. Attribution windows, “conversion” definitions, and viewability standards diverge. Joining across platforms is a recurring tax.
Survey Brand tracking, NPS / CSAT, ad recall, intent surveys Pre-aggregated and sample-based. Useful for awareness-stage signals you cannot get from logs. Watch for sampling and response bias.
Third-party Panel data (NielsenIQ, Circana), syndicated category reports, search trends, weather, economic indicators, competitor data Pre-aggregated and often delivered with a lag. Use for context, not for fine-grained causal claims.
Voice of customer Reviews, support tickets, social mentions Unstructured. Strong selection bias (only certain customers post). Useful for hypothesis generation, dangerous as a representative signal.