Connectors
Audiences
Syncs
Resources
đź““ API Docs
Deduplication is the process of identifying duplicate records and merging them together into a single, well-defined version (sometimes called “golden record”, or the "single version of the truth”).
In Octolis, deduplication is used to unify data coming from different Sources and build Datasets that are free of duplicate records.
Octolis enables you to identify duplicate records based on a set of columns (aka a deduplication key).
Let’s take a few examples:
Each time a new record enters the Dataset, we will compare its deduplication key to the ones of all records that already exist in the Dataset, and identify if it is a duplicate.
If several records enter the Dataset all at once, we also make sure to identify duplicates amongst them.
Each time Octolis identifies duplicate records (based on the deduplication key you set), they are merged together, resulting in a single record in the Dataset.
We also automatically add several system columns to the Dataset deduplicated records:
__master-Id__
to uniquely identify each deduplicated record (stable over time).__modified-At__
to state when each record was updated for the last time in the Sources.__<SourceName>_<SourcekeyColumn>_list__
(for each Source Key column) to list the Source key values of all duplicates the deduplicated record is resulting from.__created-At__
to state when each record was created in the DB table (stable over time).__updated-At__
to state when each record was updated in the DB table.Thanks to the work of Octolis, you are ensured that only deduplicated records will be synced to your systems.
What now? You might want to do some data stewarding to clean your systems from duplicate records.
For this, we advise you to map in a Sync towards your system the __master-Id__
and the list of the Source key values of all duplicates the deduplicated record is resulting from.
We will later offer some native capability of tagging duplicates as Duplicate of Id XXX
in a dedicated field.