Data Quality user documentation

The Find duplicates step uses powerful standardization and matching algorithms to group together records containing similar contact data (e.g. name, address, email, phone) and keep that information within a duplicate store. Each group of records, known as a cluster, is assigned a unique cluster ID and a match level. The step provides out-of-the-box functionality for the United Kingdom, Australia, and the United States, but is also completely configurable down to the most granular name and contact elements.

We strongly recommend that you tag data before using this step. This will allow the relevant columns to be automatically selected.

Find out how to use, configure and troubleshoot this step.

The output will display the records of the clusters that have had at least one record inserted or updated. A record is either:

INSERTED - A new record with a new unique ID was added to the store.
UPDATED - A record with a matching unique ID had values changed.
AFFECTED - The record's cluster ID or match status has potentially changed due to the insertion or update of another record in the cluster.
UNCHANGED - A record with a matching unique ID had no updated values.

The Find duplicates delete step provides the ability to bulk delete records within a duplicate store and update its clusters appropriately.

A Duplicate store must be provided as well as an input column where its values will be used as the ids of the records to be deleted.

By default, if an input contains one column with the Unique ID tag then that column will be automatically selected.

The output will display the records of the clusters that have had at least one record deleted. A record is either:

DELETED - The record is removed from the store.
AFFECTED - The record's cluster ID or match status has potentially changed due to the deletion of another record in the cluster.
DUPLICATE - Id is ignored as the deletion of the record has already occurred.
NOT FOUND - Id is ignored as no records with it could be found in the store.

If all IDs are not found within the store a warning will be displayed as the output.

The Find duplicates query step provides the ability to search for a collection of records in a Duplicate store using records in any column.

The values in the input column(s) are used to search the store for matching records.

A Duplicate store must be provided as well as at least one input column. Any amount of columns can be selected, as long as they exist in the duplicate store.

By default, if the input contains one or more columns that are tagged, these columns will be automatically selected.

Rules can also be changed or edited prior to submitting a search query.

The output will display the clusters that have had at least one record found. One of the output columns will be the search match status that corresponds to the match level of the input record against the found record/s in the store.

The Find duplicates workbench helps you make matching and performance improvements to your Find duplicates step configuration.

Once you've used the Find duplicates step to generate a duplicate store, find out how to use the workbench to analyze and improve your configuration.

Was this helpful?

Previous: Validate and enrich data

Next: Harmonize data

Aperture Data Studio v2

Create a single customer view (SCV)

Next topic:
Discover and profile data
Previous topic:
Create Spaces

Deduplicate data

Aperture Data Studio v2

Create a single customer view (SCV)