1.2 The Data Landscape of Marketing

Marketing data is never just one table. It is a stack of sources, each capturing a different part of the customer journey. In this section, I will map that stack using three lenses: the funnel, the KPIs at each stage, and the datasets behind them.

The Funnel as a Measurability Map

Customers rarely buy on first exposure. They move — sometimes over months, sometimes minutes — from “not knowing the brand exists” to “buying” to “buying again.” The most common model of this journey is the funnel:

Awareness → Consideration → Conversion → Retention

The funnel is not useful because customers move through it in order. It is useful because it acts as a measurability map. As you move down the funnel, data gets more granular, measurement is more direct, signals come in faster, and experiments are easier to run.

Horizontal marketing funnel diagram with four colored stages from left to right: Awareness (blue, megaphone icon), Consideration (teal, magnifier icon), Conversion (orange, shopping-cart icon), and Retention (purple, returning-customer icon). A grey customer-cluster icon enters on the left; a purple heart icon exits on the right. — Figure 1: The marketing funnel as a measurability map. As customers move down the stages — Awareness, Consideration, Conversion, Retention — data becomes more granular, measurement more direct, signals faster, and experiments easier to run.

This is why method choice depends on the business model. E-commerce and SaaS teams can often test behavior directly. CPG and brand advertisers have to infer upstream effects from signals that are noisier, slower, and more aggregated.

A common trap is the streetlight effect: focusing only on what is easy to measure. Performance channels can report ROAS every day, but brand campaigns cannot. The first step is to ask which part of the funnel a KPI, dataset, or method actually covers.

KPIs across the Funnel

Each funnel stage has its own set of KPIs. You do not need to memorize them, but you should be able to tell which stage a metric belongs to. That way, you can spot when a dashboard is mixing up different layers.

Funnel Stage	Typical KPIs	What They Measure
Awareness	Reach, Impressions, Aided / Unaided Brand Awareness, Share of Voice	Whether people know you exist
Consideration	CTR, Site Visits, Time on Site, Search Volume, Branded Search Lift	Whether people are evaluating you
Conversion	CVR, CAC, AOV, ROAS	Whether people buy, and how efficiently
Retention	Repeat Rate, Churn Rate, Retention Curve, LTV / CLV, NPS	Whether people come back

Typical Datasets in Marketing

The Data box in Figure 1 lists the main dataset categories that drive everything downstream. Here is what those categories look like in practice, and the pitfalls to watch for as a data scientist.

Category	Examples	What a DS Newcomer Should Know
Customer / User	CRM, loyalty data, demographics, account / contract data	Schemas drift; definitions of “active user” change with org restructures. Identity resolution across IDs is harder than it looks.
Transaction	Purchase history, subscription / billing, cart, returns	Owned by EC and SaaS; only partially seen in CPG (POS via retailers). Granularity (line item vs order vs daily) matters for every downstream model.
Product / Pricing	SKU master, price lists, inventory, planograms, promotion calendars	SKU hierarchies are messy. Prices move with promotions, competitor moves, and seasons — a “price” column is rarely as stable as it looks.
Digital behavior	Clicks, page views, sessions, in-app events, email opens / clicks	Volume is huge, but explains only a small fraction of actual purchase variance. Easy to over-trust.
Media	Spend, impressions, GRP, creative metadata, campaign tags, email sends, push logs	Every platform reports differently. Attribution windows, “conversion” definitions, and viewability standards diverge. Joining across platforms is a recurring tax.
Survey	Brand tracking, NPS / CSAT, ad recall, intent surveys	Pre-aggregated and sample-based. Useful for awareness-stage signals you cannot get from logs. Watch for sampling and response bias.
Third-party	Panel data (NielsenIQ, Circana), syndicated category reports, search trends, weather, economic indicators, competitor data	Pre-aggregated and often delivered with a lag. Use for context, not for fine-grained causal claims.
Voice of customer	Reviews, support tickets, social mentions	Unstructured. Strong selection bias (only certain customers post). Useful for hypothesis generation, dangerous as a representative signal.