Łukasz Małecki, Author at Clearcode

How to Build a CDP, DMP, and Data Lake for AdTech & MarTech

Mike Sweeney — Wed, 22 Sep 2021 05:19:18 +0000

Data platforms have been a key part of the programmatic advertising and digital marketing industries for well over a decade.

Platforms like customer data platforms (CDPs) and data management platforms (DMPs) are crucial for helping advertisers and publishers run targeting advertising campaigns, generate detailed analytics reports, run attribution, and help them better understand their audiences.

Another key component of data platforms is a data lake, which is a centralized repository that allows you to store all your structured and unstructured data in one place. The data collected by a data lake can then be passed to a CDP or DMP and used to create audiences, among other things.

In this blog post, we’ll look at what CDPs, DMPs, and data lakes are, outline situations where building them makes sense, and provide an overview of how to build them based on our experience.

Why Should You Build a CDP or DMP?

Although there are many CDPs and DMPs on the market, many companies require their own solution to provide them with control over the collected data, intellectual property, and feature roadmap.

Here are a couple of situations where building a CDP or DMP makes sense:

If you’re an AdTech or MarTech company and are wanting to expand or improve your tech offering.
If you’re a publisher and want to build a walled garden to monetize your first-party data and allow advertisers to target your audiences.
If you’re a company that collects large amounts of data from multiple sources and want to have ownership of the tech and control over the product and feature roadmap.

We Can Help You Build a Customer Data Platform (CDP)

Our AdTech development teams can work with you to design, build, and maintain a custom-built customer data platform (CDP) for any programmatic advertising channel.

Learn more

What Is a Customer Data Platform (CDP)?

A customer data platform (CDP) is a piece of marketing technology that collects and organizes data from a range of online and offline sources.

CDPs are typically used by marketers to collect all the available data about the customer and aggregate it into a single database, which is integrated with and accessible from a number of other marketing systems and platforms used by the company.

With a CDP, marketers can view detailed analytics reports, create user profiles, audiences, segments, and single customer views, as well as improve advertising and marketing campaigns by exporting the data to other systems.

View our infographic below to learn more about the key components of a CDP:

Click on the image above to view the full infographic.

What Is a Data Management Platform (DMP)?

A data management platform (DMP) is a piece of software that collects, stores, and organizes data collected from a range of sources, such as websites, mobile apps, and advertising campaigns. Advertisers, agencies, and publishers use a DMP to improve ad targeting, conduct advanced analytics, look-alike modeling, and audience extension.

View our infographic below to learn more about the key components of a DMP:

Click on the image above to view the full infographic.

What Is a Data Lake?

A data lake is a centralized repository that stores structured, semi-structured and unstructured data, usually in large amounts. Data lakes are often used as a single source of truth. This means that the data is prepared and stored in a way that ensures it’s correct and validated. A data lake is also a universal source of normalized, deduplicated, aggregated data that is used across an entire company and often includes user-access controls.

Structured data: Data that has been formatted using a schema. Structured data is easily searchable in relational databases.

Semi-structured data: Data that doesn’t conform with the tablature structure of databases, but contains organizational properties that allow it to be analyzed.

Unstructured data: Data that hasn’t been formatted and is in its original state.

Many companies have data science departments or products (like a CDP) that collect data from different sources, but they require a common source of data. Data collected from these different data sources often requires additional processing before it can be used for programmatic advertising or data analysis.

Generally, unaltered or raw-stage data (also known as bronze data) is also available. With this data-copying approach, we are able to perform additional data verification steps on sampled or full data sets. Raw stage is also helpful if, for some reason, we need to process the historical data, which was not entirely transformed.

What’s the Difference Between a CDP, DMP, and a Data Lake?

CDPs may seem very similar to DMPs, as they are all responsible for collecting and storing data about customers. There are, however, certain differences in the way they work.

CDPs primarily use first-party data and are based on real consumer identities generated by collecting and using personally identifiable information (PII). The information comes from various systems in the organization and can be enriched with third-party data. CDPs are mainly used by marketers to nurture the existing consumer base.

DMPs, on the other hand, are primarily responsible for aggregating third-party data, which typically involves the use of cookies. In this way, a DMP is more of an AdTech platform, while a CDP can be considered a MarTech tool. DMPs are mainly used to enhance advertising campaigns and acquire lookalike audiences.

A data lake is essentially a system that collects different types of data from multiple sources and then feeds that data into a CDP or DMP.

Popular Use Cases of a CDP, DMP, and Data Lake

What Types of Data Do CDPs, DMPs, and Data Lakes Collect?

The types of data CDPs, DMPs, and data lakes collect include:

First-Party Data

First-party party data is information gathered straight from a user or customer and is considered to be the most valuable form of data as the advertiser or publisher has a direct relationship with the user (e.g. the user has already engaged and interacted with the advertiser).

First-party data is typically collected from:

Web and mobile analytics tools.
Customer relationship management (CRM) systems.
Transactional systems.

Second-Party Data

Many publishers and merchants monetize their data by adding third-party trackers to their websites or tracking SDKs to their apps and passing data about their audiences to data brokers and DMPs.

This data can include a user’s browsing history, content interaction, purchases, profile information entered by the user (e.g. gender or age), GPS geolocation, and much more.

Based on these data sets, data brokers can create inferred data points about interests, purchase preferences, income groups, demographics and more.

The data can be further enriched from offline data providers, such as credit card companies, credit scoring agencies and telcos.

How Do CDPs, DMPs, and Data Lakes Collect This Data?

The most common ways for CDPs, DMPs, and data lakes to collect data are by:

Integrating with other AdTech and MarTech platforms via a server to server connection or API.
Adding a tag (aka JavaScript snippet or HTML pixel) to an advertiser or publisher’s website.
Importing data from files, e.g. CSV, TSV, and parquet.

Common Technical Challenges and Requirements When Building a DMP or CDP

Both CDP and DMP infrastructures are intended to process large amounts of data as the more data the CDP or DMP can use to build segments, the more valuable it is for its users (e.g. advertisers, data scientists, publishers, etc.).

However, the larger the scale of data collection, the more complex the infrastructure setup will be.

For this reason, we first need to properly assess the scale and amount of data that needs to be processed as the infrastructure design will be dependent on many different requirements.

Below are some key requirements that should be taken into account when building a CDP or DMP.

Data-Source Stream

A data-source stream is responsible for obtaining data from users/visitors. This data has to be collected and sent to a tracking server.

Data sources include:

Website data: JavaScript code on a website is used to check for browser events. If an action is undertaken by a visitor, then the JS code creates a payload and sends it into the tracker component.
Mobile application data: This often involves using an SDK, which can collect first-party application data. This data may include user identification data, profile attributes, as well as user behaviour data. User behaviour events include specific actions inside mobile apps. Data sent from an SDK is collected by the tracker component.

Data Integration

There are multiple data sources that can be incorporated into a CDP’s or DMP’s infrastructure:

First-party data integration: This includes data collected by a tracker and data from other platforms.
Second-party data integration: Data collected via integrations with data vendors (e.g. credit reporting companies), which can be used to enrich profile information.
Third-party data integration: Typically via third-party trackers, e.g. pixels and scripts on websites and SDKs in mobile apps.

The Number of Profiles

Knowing the number of profiles that will be stored in a CDP or DMP is crucial in determining the database type for profile storage.

Seeing as the profile database is responsible for identity resolution, which plays a key role in profile merging, and for proper segment assignment, it is a key component of the CDP’s or DMP’s infrastructure.

Data Extraction and Discovery

One common use case of a CDP and DMP is to provide an interface for data scientists so they have a common source of normalized data.

The cleaned and deduplicated data source is a very valuable input that can be used to additionally prepare data for machine-learning purposes. This kind of data preparation often requires you to create a data lake, where data is transformed and encoded to a form that can be understood by machines.

There are many types of data transformations, such as:

OneHotEncoder
Hashing
LeaveOneOut
Target
Ordinal (Integer)
Binary

Selecting a suitable data transformation type and designing a good data pipeline for machine learning involves collaboration between the development team and data scientists who analyze the data and provide valuable input regarding the machine-learning requirements.

Additionally, machine learning may be used to create event-prediction models to produce clustering and classification jobs, and aggregate and transform data. This can lead to discovering patterns that may be invisible to a human eye initially, but become quite obvious after applying a transformation (e.g. a hyperplane transformation).

Segments

The types of segments that need to be supported by a CDP’s and DMP’s infrastructure also influence the infrastructure’s design.

The following types of segments can include:

Attribute-based segments (demographic data, location, device type, etc).
Behavioral segments based on events (e.g. clicking on a link in an email), and their frequency of actions (e.g. visiting a web page at least three times a month).
Segments based on classification performed by machine learning:
- Lookalike / affinity:The goal of lookalike/affinity modelling is to support audience extension. Audience extension can be based on a variety of inputs and be driven by similar functions. In the end, you can imagine a self-improving loop where we pick profiles with a lot of conversions and create affinity audiences. This results in an audience with more conversions, which can be used to create more affinity profiles, etc.
- Predictive:The goal of predictive targeting is to use available information to predict the possibility of an interesting event (purchase, app installation, etc.) and to target only the profiles who have a high prediction rate.

Common Technical Challenges and Requirements When Building a Data Lake

Below are some common challenges when building a data lake:

It’s difficult to combine multiple data sources together to generate any kind of useful insights and actionable data. Usually, IDs are required to bind the different data sources together, but often these IDs are not present or simply don’t match.
It’s often hard to know what data is included in a given data source. Sometimes the data owner doesn’t even know what kind of data is there.
There is also a need to clean up the data and reprocess it in case of an ETL pipeline failure, which will happen from time to time. This needs to be done either manually or automatically. Databricks Delta Lake has an automatic solution since their delta tables comply with ACID properties. AWS is also implementing ACID transactions in one of their solutions (governed tables), but it’s only available in one region at the moment.

In the first step of processing, data is extracted and loaded into the first raw stage. After the first stage, multiple data lake stages are often available, depending on the use case.

Usually, the second step carries out various data transformations, like deduplication, normalisation, column prioritisation, and merging. The following steps perform additional layers of data transformations, for example, business-level aggregations required for the data science team or for reporting purposes.

By incorporating data lake components from AWS, such as Amazon Lake Formation, which uses a well-known S3 storage mechanism, with Amazon Glue or Amazon EMR for the ETL data pipeline purpose, we are able to create a centralized, curated, and secured data repository.

On top of Amazon Lake Formation, there is a common interface called Amazon Athena that can be used between multiple infrastructure components, and provides a unified data access method to Amazon Lake Formation.

Additionally, by using IAM security methods, an additional layer of proper access-level controls can be added to the data lake.

If the data lake is properly designed and created, access to the data can be optimized for costs.

Also, thanks to the final aggregate level, we are allowed to perform the required operations only once during the ETL pipeline when required.

An Example of How to Build a CDP, DMP, and Data Lake

Download the full version of this article to see an example of a CDP/DMP and data lake development project.

The full version includes:

A list of the main features of a CDP/DMP and data lake.
An example of the architecture setup on AWS.
The request flows.
The Amazon Web Services we used.
A cost-level analysis of the different components.
Important considerations.

We Can Help You Build a Customer Data Platform (CDP)

Our AdTech development teams can work with you to design, build, and maintain a custom-built customer data platform (CDP) for any programmatic advertising channel.

Learn more

The post How to Build a CDP, DMP, and Data Lake for AdTech & MarTech appeared first on Clearcode.

How Profile Merging and Audience Building Work In a DMP

Mike Sweeney — Tue, 10 Sep 2019 01:47:43 +0000

Almost every data-management platform (DMP) on the market allows advertisers to create audiences and use them for different use cases, such as improved online ad targeting and advanced analytics.

To create audiences in a DMP, the platform must first create user profiles, which comprise of numerous profile identifiers.

As part of a recent internal project carried out by one of the AdTech development teams at Clearcode, we researched the topics of audience building and profile merging and included some of our findings below.

To provide some context about the goal and purpose of profile merging, we first need to explain what audience building is and what profiles and profile identifiers are.

How Audience Building Works in a DMP

Audience building is one of the main data processes in a DMP.

Once advertisers create an audience in a DMP, they can export it to other systems, such as a demand-side platform (DSP), for improved ad targeting.

An audience is a group of user profiles that share a common user identifier.

For example, an advertiser might create an audience in its DMP called “Visitors from the USA.” The audience would then contain profiles that have an attribute such as “country = USA.”

How the profile merging process looks in a DMP.

Here’s an overview of what’s happening in the image above:

A new event occurs – in this case, a website visit.
The event contains numerous profile identifiers: cookie_id, country and click_id.
The profile identifiers are identified as belonging to an existing profile. Any new identifiers, in this case the click_id, are added to the profile.
The profile is added to any existing audience, provided it met the conditions. In this case, it would be added to the Visitors from the USA audience because of the country = USA attribute.

Note: Most DMPs hash personally identifiable information (PII) such as email addresses. To keep things simple, we’ll use examples of unhashed email addresses in this article.

Audiences are built on numerous processing assumptions, with the process starting from an input event (e.g. web visit), which may contain different user identifiers.

To create profiles, and subsequently audiences, every event generally needs to have at least one profile identifier.

What Are Profiles and Profile Identifiers?

A profile is a set of data collected from events tracked by a DMP. It represents a user and may contain the following pieces of information:

profile id
cookie id (list)
hashed email (list)
sid / uuid (list)
country (last seen)
name (nullable)
device_type (last seen)
device_vendor (last seen)
device_os (last seen)
browser_vendor (last seen)
gender (nullable)
company (nullable)
company size (nullable)
matching ids (list)

The list presented above can be extended by specific use cases of a DMP. Some of the fields are not filled with data in the beginning.

Generally speaking, if an input event contains an unknown identifier (i.e. one that is not in the DMP already), a new profile is created.

On the other hand, if the input event contains an identifier that is already known to the DMP, the profile is updated with incoming data from the event.

After updating the profiles with event data, two profiles may often share a common identifier.

If this occurs, the DMP will have to perform an operation known as profile merging.

What Is Profile Merging Exactly?

The profile-merging operation ensures there are no duplicate identifiers or attributes within a given profile and that no two profiles have the same unique identifiers (such as email addresses). It achieves this by converting all profiles sharing a common identifier into one profile.

As events can have multiple identifiers, they can arrive from the same user/profile but with different identifiers.

For example, consider the following three events:

Event 1: A user visits publisher.com using Firefox: {cookie_id = 7M-Q1P8-6AWG-1N3I}

Event 2: The same user subscribes to a newsletter on publisher.com using Chrome: {email = ben.kenobi@example.com, cookie_id = eyJraWQiOiJzZXN}

Event 3: The user fills in a form on publisher.com using Firefox: {email = ben.kenobi@example.com, cookie_id = 7M-Q1P8-6AWG-1N3I}

All three are from the same user, but before the third event arrives in the system, this isn’t known, and they are treated as two totally separate profiles.

Once it’s known that all three are from the same person, it would be advisable to treat them as the same object (profile), otherwise, we would have multiple profiles assigned to one user and those profiles wouldn’t contain the latest and most up-to-date information.

At a minimum, profile merging requires joining the IDs and profile attributes together.

Due to the large number of IDs and attributes that can be collected via events, it’s possible to merge and use a small percentage of collected data for audience creation.

Also, if multiple user identifiers are found between profiles, we need to determine which identifier is the proper one – i.e. a single ID that will be used as the master ID after the data has been merged. This master ID will also be used to assign new data from events to a given profile.

To make things easier, it is assumed that a master identifier can be computed. This means that when an event with multiple IDs arrives in the system, it will be assigned a single ID calculated on the basis of the event IDs plus any other known IDs.

A simple implementation would be to construct a list of all known IDs, sort it, and use the first element as the master ID. This approach is the simplest, but differs depending on the business use case of the DMP.

The merged profile can be assigned to segments or audiences different than those its original profiles were assigned to.

After the profile-merging operation, DMP taxonomies, segments and audiences also need to be regenerated.

How to Merge Profiles Together

To effectively carry out the profile-merging operation, a proper way of merging must be determined.

Imagine the merging operation between two profiles, both containing information entered by the user, where linking fields were found.

The profile-merging operation has to decide which name is correct.

There are a few ways to conduct profile merging. Below, we list four possible options.

Sort By Overwriting Existing IDs and Attributes

One of the simplest ways to merge profiles would be to overwrite all existing IDs and attributes with new, incoming ones.

This can be done either by defining a master ID that will remain consistent (meaning it won’t be updated) or replacing the master ID each time a new ID is collected.

Alphabetical Sorting

Alphabetical sorting is another simple option for merging different profiles together.

With this method, the data between profiles is sorted alphabetically and the first value is used.

According to our example, we have two names: Ben and Obi-Wan. With alphabetical sorting, the name Ben is defined as the correct one.

Timestamp Sorting

Another approach would be to use the value that has the first or last recorded timestamp.

In most cases, timestamp sorting will be the most desired method to use.

Again, according to the example, the event containing the name Ben was received first, so we use it instead of Obi-Wan.

It’s important to note that timestamp sorting is determined by the event time, rather than the processing time.

Wait-and-See Sorting

A more complex solution would be to keep all values for reference until a different sorting method (e.g. timestamp) becomes applicable, which may determine whether the assumption was correct and outline the final value after the merging itself.

Which Profile-Merging Option Do You Use?

Most of the time, assigning a profile-merging algorithm is based on the DMP’s use case, but is also dependant on the type of data we are merging and in most cases will need business justification.

Another aspect to consider is the order of profile-merging operations.

When two profiles are found with linking fields, the profile-merging operation is performed. There will likely be cases where more than two profiles will need to be merged during a single operation.

For example, if there are three profiles that need to be merged, the first two profiles would be merged and the third would be merged with the result of the first merger.

To properly carry out this process, a merge order must be determined.

For example, we can assume that the order is based on the timestamp of each of the events.

Taking this into account, we may face a situation where a different order of merge operations may possibly end up with a different final profile combination.

Depending on the business use case, an additional service that will periodically verify the profile merging may be required in order to guarantee proper merges.

How to Handle Concurrent Merging

Most DMP systems often face very high processing requirements in terms of speed and amount of data.

Concurrent profile merging is a solution that enables us to perform profile merging in a short amount of time.

However, in this case, multiple processes are evaluating events and the merges become a lot more complicated.

The main problem with concurrent merging is deciding how to handle this when multiple events are being processed by the DMP at the same time.

A simple approach is for the process that first receives an event to create a new profile, which should then be used in the second process.

However, this causes all sorts of problems with synchronization – it often takes a while to create a new profile – so before the first process finishes creating a new profile, the second process decides that it should also create one, resulting in two profiles that should really have merged.

While this might seem unlikely to happen, considering the scale of data processed by a DMP, such problems will no doubt appear.

In order to avoid such problems, we decided to route events to different processes, resulting in a master process that handles all identifiers.

When a new event arrives in the DMP, the router checks its ID and decides which processor it should go to, which (with a decent routing algorithm) should allow for even load distribution.

This reduces the problem of concurrent profile-merging to multiple, simple merges such as those described above, at the cost of a single point of sequential processing.

Even if two events arrive at the same time from the same profile but with different identifiers (and therefore should be merged), they will process one after another and both will go to the same processor.

To ensure the process runs smoothly, each process should have access to all profiles. If each process has its own profile store (e.g. a database), this will require copying profiles from one process to another.

Key Takeaways

Below are a few key takeaways from our profile-merging research:

The correct implementation of a profile-merging algorithm is not just a matter of technical implementation, but also the DMP and business use case.
As there are multiple ways to carry out profile merging, the user profiles may change over time according to the different merging operations and the time at which the data is collected. It’s important to remember that DMPs collect user data at different times – sometimes it can be in real time (e.g. data collection from a website) and other times is can be via a data-import operation (e.g. first-party data onboarding).
To be sure that user data is not contaminated with false information, we need a proper merging algorithm for each of the collected and populated pieces of profile information.

The post How Profile Merging and Audience Building Work In a DMP appeared first on Clearcode.