Data Quality user documentation

This section covers key concepts related to the Find duplicates) Workflow step.

Cluster ID

A cluster is a collection of records that have been identified as representing the same entity using the Find duplicates rules. Each cluster is identified by a unique cluster ID.

Match status/level

Each match between two records will have one of the following confidence levels:

Match status	Description
Exact (0)	Each individual field that makes up the record matches exactly.
Close (1)	Records might have some fields that match exactly, and some fields that are very similar.
Probable (2)	Records might have some fields that match exactly, some fields that are very similar, and some fields that differ a little more.
Possible (3)	Records contain the majority of fields that have a number of similarities, but do not match exactly.
None (4)	Records do not match.

Column mappings

If your data has columns tagged already, this step will recognize the tagged columns and automatically assign a relevant Find duplicates column mapping to them. Otherwise, you can manually assign the column mappings based on your knowledge of the data.

This step will only recognize the following system-defined tags:

Address
- City
- Country
- County
- Locality
- Postal Code
- Premise And Street
- Province
- State
- Zip Code
Date
Email
Generic String
Phone
Name
- Forenames
- Surname
- Title
Unique Id

It is important to map your columns as accurately as possible before using the Find duplicates step to make the matching process more efficient. For example, mapping a column as Address when it contains primarily company or name information will lead to less accurate results.

Additionally, using the more granular address element mappings such as Premise and Street and Locality as opposed to the higher level Address mapping (providing your data is divided in such a way) will mean that less effort is required to identify individual address components.

The standardization process running as part of the Find Duplicates step attempts to recognize the Company in the context of an Address. We therefore recommend that you always provide an address (in a standard address element order of the relevant country) to optimize the standardization results.

For more information on how Find duplicates utilizes these column mappings, you can refer to the advanced configuration page.

Group IDs

You can apply different rulesets to columns with the same tag by using group IDs.

For example, you may have delivery and billing addresses that you want to treat differently. You would tag both as an address, but create separate group IDs, allowing you to apply different rulesets: only accept an exact match for the billing address, but a close one for the delivery address.

To apply a group ID to one or more columns, use the left-hand side menu in Workflow Designer:

Right-click on the column.
Select Configure column and enter the value for Group ID.
Click Apply to save the changes.

Rules and blocking keys

The Find duplicates step creates blocks of similar records to assist with the generation of suitable candidate record pairs for scoring. Blocks are created from records that have the same blocking key values.

Blocking keys are created for each input record from combinations of the record's elements that have been keyed. Keying is the process of encoding individual elements to the same representation so that they can be matched despite minor differences in spelling.

Click Undefined blocking keys to specify a blocking key set.

To view the default and define your own blocking key sets, go to Glossary > Find Duplicates blocking keys.
Find out how to create your own blocking keys.

A ruleset is a set of logical expressions (rules) that control how records are compared and how match statuses/levels are decided.

Click Undefined ruleset to specify a ruleset.

To view the default and define your own rulesets, go to Glossary > Find Duplicates rulesets. Find out how to create your own rules.

Consumer users will not be able to access the Glossary. Find out about user roles

The following default blocking keys and rulesets are available:

Individual - groups records with similar names at similar addresses. For example, GBR_Individual_Default will find individuals in Great Britain. Note that emails, phone numbers, and other identifiers will not be taken into account, but can be added manually.
Household - groups records with the same or similar family names at a similar address. For example, GBR_Household_Default will find households in Great Britain.
Location - groups records with similar addresses or locations. For example, GBR_Location_Default will find locations in Great Britain.

Duplicate store

Retaining a duplicate store

You can retain your duplicate store to disk, so it can be used for searching and maintenance operations.

Duplicate stores are retained to your machine's Data Studio repository, within the experianmatch sub-directory. However, if you have configured a separate instance of the Find duplicates server, duplicate stores will be retained on that same machine.

To retain a duplicate store when using the Find duplicates step:

Tick the Retain the duplicate store checkbox.
Enter a name for your duplicate store.
Click Show data to run the step and retain it to disk.

Note that executing the entire workflow (instead of running the Find duplicates step separately within the workflow) will retain the duplicate store in the same way.

Encrypting a duplicate store

Any duplicate store can be encrypted if you specify this before running the Find duplicates step.

Encrypting the store protects data whilst the step is running on disk and is especially important for duplicate stores that have been retained for later use. Non-retained stores are deleted after the step has completed but can still be protected while the results are being processed.

To encrypt:

Tick Encrypt the duplicate store in the step dialog.
The encryption method will depend on whether you want to retain your store or not:
- if you don't retain the store, a random encryption key will be generated and used;
- if you do retain the store, a new input node will appear. You have to connect your encryption key source to this new node. The encryption key source can be as simple as a single-cell input file (with headings disabled) from the Data Explorer or a custom step that connects to a remote key vault.
  A known encryption key _has to be_ specified for a retained duplicate store so it can be unlocked/locked for searching and maintenance operations later on.
Click Show data to run the step and enable the encryption.

Was this helpful?

Previous: Overview

Next: Connecting to a Find duplicates server

Aperture Data Studio v1

Find duplicates step

Next topic:
Technical recommendations

Key concepts