Data Quality user documentation

This section covers key concepts related to the Find duplicates step.

Cluster ID

A cluster is a collection of records that have been identified as representing the same entity using the Find duplicates rules. Each cluster is identified by a unique cluster ID.

Match status/level

Each match between two records will have one of the following confidence levels:

Match status	Description
Exact (0)	Each individual field that makes up the record matches exactly.
Close (1)	Records might have some fields that match exactly, and some fields that are very similar.
Probable (2)	Records might have some fields that match exactly, some fields that are very similar, and some fields that differ a little more.
Possible (3)	Records contain the majority of fields that have a number of similarities, but do not match exactly.
None (4)	Records do not match.

Column mappings

If your data has columns tagged already, this step will recognize the tagged columns and automatically assign a relevant Find duplicates column mapping to them. Otherwise, you can manually assign the column mappings based on your knowledge of the data.

This step will only recognize the following system-defined tags:

Company
Address
- City
- Country
- County
- Locality
- Postal Code
- Premise And Street
- Province
- State
- Zip Code
Date
Email
Generic String
Phone
Name
- Forenames
- Surname
- Title
Unique Id

It is important to map your columns as accurately as possible before using the Find duplicates step to make the matching process more efficient. For example, mapping a column as Address when it contains primarily company or name information will lead to less accurate results.

Additionally, using the more granular address element mappings such as Premise and Street and Locality as opposed to the higher level Address mapping (providing your data is divided in such a way) will mean that less effort is required to identify individual address components.

The standardization process running as part of the Find Duplicates step attempts to recognize the Company in the context of an Address. We therefore recommend that you always provide an address (in a standard address element order of the relevant country) to optimize the standardization results.

For more information on how Find duplicates utilizes these column mappings, you can refer to the advanced configuration page.

Group IDs

You can apply different rulesets to columns with the same tag by using group IDs.

For example, you may have delivery and billing addresses that you want to treat differently. You would tag both as an address, but create separate group IDs, allowing you to apply different rulesets: only accept an exact match for the billing address, but a close one for the delivery address.

Rules and blocking keys

The Find duplicates step first creates blocks of similar records, which reduces the number of records that need to be compared. This is done to make the duplicate detection process more efficient. Blocks of records are generated based on a blocking key, which is made up of a combination of record elements.

Every pair of records in the resulting block is then compared using a set of rules, which are logical expressions that control the match level returned.

Combinations of blocking keys and rules are stored in Step settings which can be selected in the Duplicate store configuration when using a persistent store, or in the Find duplicates step. To view the rules or blocking keys, or create a new set, go to Step settings > Find duplicates settings, or from the Duplicate stores screen either click Create new Duplicate store or select the Edit details action on an existing Duplicate store.

Default Find duplicates step settings

Aperture Data Studio provides default Find duplicates step settings for use with the Find duplicates step. The following default types of blocking keys and rules are available by navigating to Step settings > Find duplicates settings:

Individual - groups records with similar names at similar addresses. For example, GBR_Individual_Default will find individuals in Great Britain. Note that emails, phone numbers, and other identifiers will not be taken into account, but can be added manually.
Household - groups records with the same or similar family names at a similar address. For example, GBR_Household_Default will find households in Great Britain.
Location - groups records with similar addresses or locations. For example, GBR_Location_Default will find locations in Great Britain.

Default blocking keys and rules are provided for Australia (AUS), Great Britain (GBR), and United States (USA) as detailed in the table below:

Name	Summary
AUS_Individual_Default	Default Australia individual level rules and blocking keys based on name and address
AUS_Household_Default	Default Australia household level rules and blocking keys based on surname (last name) only and address
AUS_Location_Default	Default Australia location level rules and blocking keys based on address only
GBR_Individual_Default	Default United Kingdom individual level rules and blocking keys based on name and address
GBR_Household_Default	Default United Kingdom household level rules and blocking keys based on surname (last name) only and address
GBR_Location_Default	Default United Kingdom location level rules and blocking keys based on address only
USA_Individual_Default	Default United States of America individual level rules and blocking keys based on name and address
USA_Household_Default	Default United States of America household level rules and blocking keys based on surname (last name) only and address
USA_Location_Default	Default United States of America location level rules and blocking keys based on address only

The summary of each step setting is included to explain the purpose of the blocking keys and rules. The details of a step setting can also be viewed when clicked in the Step settings list screen.

Was this helpful?

Previous: Overview

Next: Duplicate stores

Aperture Data Studio v2

Find duplicates step

Next topic:
Technical recommendations

Key concepts