Data de-duplication is about finding records in a table of data that are the same as others and removing them. This is a simple process when looking at records that are identical and gets more complicated when looking for 'similar' duplicated data to remove.
Different scenarios require different tools.
This is the most common use case for identical duplicate values (if you're a MS Excel user, think of the Remove duplicates option).
If there are other columns in your table you don’t want to de-duplicate, but want to keep:
There are a number of aggregates in addition to First value such as Count, Sum, Maximum, Last value etc.
One of these is Grouped values. Instead of returning the first value in the column, for each group it will collate all values into a comma separated list. Find out more about the Group step.
A list of comma separated values might contain duplicate values. There are several useful List Functions such as deduplicating values, determining the most/least common value or add a frequency count for each value.
Values that are not exactly the same; are of different datatype, case, spacing, spelling, etc. will require some pre-processing before they can be de-duplicated. Either transform the column(s) or clone them and transform the cloned values, so these can be matched whilst retaining the original values. Find out more about the Transform step.
Finding similar values can be challenging in larger datasets. The Relationships option can help to highlight data inconsistencies in your data (such as United States/USA/United States of America). Find out more about Identify data conflicts.
The Find duplicates Workflow step uses powerful standardization and matching algorithms to group together records containing similar contact data.
For duplicate unsorted records with complex merge requirements, use the Harmonize duplicates Workflow step. It takes records that have already been identified as duplicates (clustered) and allows you to define the preferred method to merge records together or to identify the best record in the cluster.
Harmonize duplicates can keep all columns, including some that may be added in future, without needing to update step settings or the downstream steps that data is being fed into. This is better than the group approach where columns are assumed to be fixed.
First value also assumes that data can be pre-sorted into the required order. The Harmonize duplicates step doesn't require sorted data and can use a custom Score to determine the priority.