Data Quality user documentation | Advanced configuration

Overview

Data Studio provides default blocking keys and rulesets that you can use to run the Find duplicates step. However, you can create your own custom rules and blocking keys that are tailored to your needs.

Blocking keys and rulesets are designed and specified using elements - a representation of your input data after it has been mapped in the Find duplicates step and gone through the initial standardization process. Find out about elements.

The standardization process also creates additional versions of these elements to assist further - known as modifiers. Modifiers can correct, enhance or derive many known terms that appear in the input. For example, a DERIVED modifier may be created when the element was not contained in the input but the standardization process was able to determine the value (e.g. COUNTY can sometimes be derived from the LOCALITY and POSTCODE input). Find out about modifiers.

Elements can also be put into specific groups to separate them from other elements of the same type. This is especially important when you want to create blocking keys or rules that treat them as separate entities, such as cross-field matching.

We strongly recommend that you understand basic blocking key and rule design before using groups.

These elements, modifiers and groups (along with blocking key algorithms and rule comparators) can then be combined to make your own blocking keys and rules specific to your design and data.

The table below covers the available elements that can be used when designing blocking keys and rules, together with the possible modifiers and rule comparators that can be used with them. Note that blocking key algorithms are not listed here because they apply to all elements.

Element	Description	Example	Comparators	Modifiers
Title	Title	Mrs	ExactString	Default StandardSpelling StandardAbbreviation
Forenames	Given name(s) and any initials	John	ExactString ForenameCompare Levenshtein JaroWinkler DoubleMetaphone Soundex NYSIIS	Default RootName
Surname_Prefix	Surname prefix	De la	ExactString Levenshtein JaroWinkler DoubleMetaphone Soundex NYSIIS	Default StandardSpelling StandardAbbreviation
Surname	Surname with prefix	Smith	ExactString Levenshtein JaroWinkler DoubleMetaphone Soundex NYSIIS	Default RootName
Full_Name	Concatenation of forenames and surname (including prefix if present)	John O'Connor	ExactString TransposedNameCompare Levenshtein JaroWinkler DoubleMetaphone Soundex NYSIIS	Default RootName
Surname_Suffix	Surname suffixes	Junior	ExactString Levenshtein JaroWinkler DoubleMetaphone Soundex NYSIIS	Default StandardSpelling StandardAbbreviation
Gender	Gender	Female	ExactString Levenshtein JaroWinkler DoubleMetaphone Soundex NYSIIS	Default
Honorifics	Honorifics	Ph.D	ExactString Levenshtein JaroWinkler	Default StandardSpelling StandardAbbreviation
Company	Organization name	Experian Ltd	ExactString Levenshtein JaroWinkler DoubleMetaphone Soundex NYSIIS	Default
Building_Description	Building name and type	George West House	ExactString Levenshtein JaroWinkler DoubleMetaphone Soundex NYSIIS	Default
Building_Number	Building number	43	ExactString PremiseCompare	Default
SubBuilding_Number	Sub-building number	2	ExactString PremiseCompare	Default
SubBuilding_Description	Sub-building name	First-floor	ExactString Levenshtein JaroWinkler DoubleMetaphone Soundex NYSIIS	Default
SubBuilding_Type	Sub-building type	Flat	ExactString Levenshtein JaroWinkler DoubleMetaphone Soundex NYSIIS	Default StandardSpelling StandardAbbreviation
MinorStreet_Number	Street number	34th	ExactString PremiseCompare	Default
MinorStreet_Predirectional	Street pre-directional	South	ExactString Levenshtein JaroWinkler DoubleMetaphone Soundex NYSIIS	Default StandardSpelling StandardAbbreviation
MinorStreet_Description	Street name	Carnaby	ExactString Levenshtein JaroWinkler DoubleMetaphone Soundex NYSIIS	Default
MinorStreet_Type	Street descriptor	Street	ExactString Levenshtein JaroWinkler NumericCompare DoubleMetaphone Soundex NYSIIS	Default StandardSpelling StandardAbbreviation
MinorStreet_Postdirectional	Street post-directional	South	ExactString Levenshtein JaroWinkler DoubleMetaphone Soundex NYSIIS	Default StandardSpelling StandardAbbreviation
PoBox_Number	PO box number	79	ExactString Levenshtein JaroWinkler	Default
PoBox_Description	PO box description	PO Box	ExactString Levenshtein JaroWinkler DoubleMetaphone Soundex NYSIIS	Default StandardSpelling StandardAbbreviation
DoubleDependentLocality	A small locality such as a village, used to identify an address where a street appears more than once in a dependent locality	Kingston Gorse	ExactString Levenshtein JaroWinkler DoubleMetaphone Soundex NYSIIS	Default/Derived StandardSpelling
DependentLocality	Smaller locality used to identify an address where a street appears more than once in a locality	East Preston	ExactString Levenshtein JaroWinkler DoubleMetaphone Soundex NYSIIS	Default/Derived StandardSpelling
Locality	A larger locality, such as a town or a city	Cambridge	ExactString Levenshtein JaroWinkler DoubleMetaphone Soundex NYSIIS	Default/Derived StandardSpelling
Province	A larger area of a country, contains multiple localities	Cambridgeshire	ExactString Levenshtein JaroWinkler DoubleMetaphone Soundex NYSIIS	Default/Derived StandardSpelling StandardAbbreviation
Country	Country name	United Kingdom	ExactString Levenshtein JaroWinkler DoubleMetaphone Soundex NYSIIS	Default
Postcode	Postal code or ZIP code	'SW4 0QL' or '20521 9000'	ExactString PostcodeCompare	Default/Derived StandardSpelling
Generic_String	Generic string	ab-1234cdef	ExactString Levenshtein JaroWinkler NumericCompare DoubleMetaphone Soundex NYSIIS	Default
Date	ISO date in the format YYYY-MM-DD	1980-06-21	ExactString DateCompare	Default
Phone	Phone number	(01234) 567890	ExactString Levenshtein JaroWinkler	Default
Email	Email address	john.smith@domain.com	ExactString Levenshtein JaroWinkler DoubleMetaphone Soundex NYSIIS	Default
Email_Local	Local part of email address	john.smith	ExactString Levenshtein JaroWinkler DoubleMetaphone Soundex NYSIIS	Default
Email_Domain	Email domain	domain.com	ExactString Levenshtein JaroWinkler DoubleMetaphone Soundex NYSIIS	Default
Hash	Hash value of all normalised input fields within a group	8bf14557574b8793aae648fc1b0280c3	ExactString	Default

The table below covers the available element modifiers that can be used. See the elements table to find out which modifiers can be used with which elements.

Modifier	Operation	Example
(Default)	The element classified from the input in a cleaned form, normalised to remove diacritics and converted to upper case.	Supplied: 123 High Road MINORSTREET_TYPE -> ROAD
StandardSpelling	The element converted to a standard spelling (contains Derived value when available).	Supplied: 123 High Road MINORSTREET_TYPE.STANDARDSPELLING -> ROAD
StandardAbbreviation	The element converted to the standard abbreviation.	Supplied: 123 High Road MINORSTREET_TYPE.STANDARDABBREVIATION -> RD
Derived	A derived value that was inferred from other information in the input address.	Supplied: 123 High Road, London, E1 2EZ PROVINCE.DERIVED -> GREATER LONDON
RootName	The root name of the input name.	Supplied name: Alex FORENAMES.ROOTNAME -> ALEXANDER

Any element can be used multiple times in an input to represent separate or unique sets of information. For example, the input may include multiple:

street numbers, street names and postcodes (e.g. a delivery and a billing address)
forenames and surnames (e.g. a primary account holder and a spouse)
generic strings (e.g. an account and a customer reference)
dates (e.g. a date of birth and a registration date)
phone numbers (e.g. a cell phone number and a landline number)
email addresses (e.g. a personal and a work email)

A group name has to begin with an alphabetic character and can only consist of alphabetic characters, numbers, underscores and hyphens.
Groups can only be defined at the match level in rules, not at the element or theme level.

Using groups in blocking keys

You may want to create different blocking key specifications for different groups of the same element type, or use the same blocking key specification for multiple groups of the same element type. To do this, you need to add elementGroups to your element specification within your blocking key design. Find out more about elementGroups.

For example, if you have a billing and a delivery address and you want to have two different blocking key specifications for them that have different designs, you would add the following to each address element of the group within your blocking key designs:

"elementGroups":["BillingAddress"]

"elementGroups":["DeliveryAddress"]

Alternatively, if you wanted to use the same blocking key specification for both groups, you would add the following to each address element of the group within that single blocking key design:

"elementGroups":[
    "BillingAddress",
    "DeliveryAddress"
]

The blocking key specification examples below are taken from the default blocking keys available in Data Studio, and have been amended to show how they would be used for the two cases above.

Two differently designed blocking key specifications for two different groups of the same element type:

{
    "description": "SurnameMinorStreetNumberLocality",
    "countryCode": "GBR",
    "elementSpecifications": [
      {
        "elementType": "SURNAME",
        "algorithm": {
          "name": "INITIAL"
        },
        "includeFromNChars": 1
      },
      {
        "elementType": "MINORSTREET_NUMBER",
        "elementGroups": ["BillingAddress"],
        "includeFromNChars": 1,
        "truncateToNChars": 5
      },
      {
        "elementType": "LOCALITY",
        "elementGroups": ["BillingAddress"],
        "elementModifiers": [
          "STANDARDSPELLING",
          "DERIVED"
        ],
        "includeFromNChars": 2,
        "truncateToNChars": 30
      }
   ]
}

{
    "description": "POBoxNumberLocality",
    "countryCode": "GBR",
    "elementSpecifications": [
      {
        "elementType": "POBOX_NUMBER",
        "elementGroups": ["DeliveryAddress"],
        "includeFromNChars": 1
      },
      {
        "elementType": "LOCALITY",
        "elementGroups": ["DeliveryAddress"],
        "elementModifiers": [
          "STANDARDSPELLING",
          "DERIVED"
        ],
        "includeFromNChars": 2,
        "truncateToNChars": 30
      }
   ]
}

The same blocking key specification for two different groups of the same element type:

{
    "description": "FullPostcode",
    "countryCode": "GBR",
    "elementSpecifications": [
      {
        "elementType": "POSTCODE",
        "elementGroups": ["BillingAddress","DeliveryAddress"],
        "elementModifiers": [
          "STANDARDSPELLING"
        ],
        "includeFromNChars": 5,
        "truncateToNChars": 7
      }
   ]
}

Using groups in rules

You may want to handle the items listed above separately during rule evaluation. To do this, you need to prefix the rule with a hash symbol, followed by the group name in square brackets, followed by a period, followed by the rest of the rule.

For example, if you have a billing address and a delivery address and you want to evaluate them separately, you could write rules such as the ones below.

Match.L0 = {Name.L0 & #[Billing].Address.L0 & #[Delivery].Address.L0}
Match.L1 = {Name.L1 & (#[Delivery].Address.L1 | #[Billing].Address.L1)}

By default, this will mean that the two different groups will use the same underlying rules (in the case above, the same address rules). To use different rules for two groups that are of the same element type, you can use theme rules that are then utilized at the top level.

Match.L0={#[Billing].LooseAddressRule.L0 & #[Delivery].StrictAddressRule.L0}

LooseAddressRule.L0={Postcode.L0}
StrictAddressRule.L0={MinorStreet_Number & Postcode.L0}

MinorStreet_Number.L0={PremiseCompare[ExactMatch]}
Postcode.L0={PostcodeCompare[Part1Match] & PostcodeCompare[Part2Match]}

In the above example, two different theme rules have been used to evaluate an address (one being stricter than the other since it requires both the premises number AND postcode to be compared), but each is only used with one group as defined at the match level.

You can also specify more than one group in any rule, such as the example below. This will perform cross-field matching.

Match.L0 = {Name.L0 & #[Billing,Delivery].Address.L0}

The Find duplicates step uses blocking keys to create blocks of similar records to assist with the generation of suitable candidate record pairs for scoring via a ruleset.

Blocks are created from records that have the same blocking key values. Blocking keys are created for each input record from combinations of the record's elements that have been keyed. Keying is the process of encoding individual elements to the same representation so that they can be matched despite minor differences in spelling. To be effective, blocking keys should represent a range of contact data sub-element combinations.

Data Studio provides default blocking keys tuned for name and address matching for the Find duplicates step. However, you can modify these or create your own to suit your needs.

Structure

All blocking keys have to be built in the structure below.

Note that all values (e.g. element names, modifiers, algorithms) used in a blocking key specification have to be upper case. For example, MinorStreet_Number in a key has to be specified as MINORSTREET_NUMBER.

Element	Description
description (optional)	A descriptive name for the blocking key, for example "MyBlockingKey".
countryCode (optional)	An informational code that can be used to imply the country that the blocking key is to be used with (the value does not affect processing in any way). If you require blocking keys conditionally generated, see the validCountries element.
validCountries (optional)	An array of ISO3 country codes for which the blocking key is valid. If this element is present and the standardized record has a country code, then it must match one of the country codes in the array - otherwise the blocking key won't be generated. For example: "validCountries":["GBR","IRL"]
elementSpecifications	An array of elements that are to be used in the key, created as part of the initial standardization process. Blocking keys are created using the list of elements in order. For example, the key FORENAMES+SURNAME+MINORSTREET_NUMBER would be created from the array below: "elementSpecifications":[ { "elementType":"FORENAMES" }, { "elementType":"SURNAME" }, { "elementType":"MINORSTREET_NUMBER" } ]

Each element within elementSpecifications has to then be built in the structure below:

Element	Description
elementType	The element to use in the blocking key. See the list of available elements and examples for each.
elementGroups (optional)	The group(s) that the element belongs to. If this isn't supplied, the default group will be used. This is required when you have more than one of the same element type (e.g. multiple phone numbers) that you want to treat separately. Find out about groups. The example below specifies multiple groups for a single element type: "elementGroups":[ "HomeAddress", "WorkAddress" ]
elementModifiers (optional)	An array of element modifiers to use. This list is processed in order and the first populated element will be used. If no modifier is specified/found, the default value of the element will be used. See the list of available modifiers, which elements they apply to and examples for each. The benefit of element modifiers is to use a more standardized and uniform version of the element, to better assist with blocking together records that may not look the same in their string representation but could still very much be intended to represent the same thing. The example below will use the standard spelling form of the element if available, otherwise the derived form. If neither are found, the default value for the element will be used. "elementModifiers":[ "STANDARDSPELLING", "DERIVED" ]
algorithm (optional)	Keying algorithm used to key the element. These are useful for blocking as they can cause similar elements to be blocked together even if their original form is not identical. For example, the SOUNDEX algorithm can be used to block similar sounding values (since they will have the same SOUNDEX key) even if they do not look the same. Defined algorithms have to have a name and can also have an optional set of additional properties. For example, the SUBSTRING algorithms require start, end and/or length. "algorithm": { "name": "MIDDLE_SUBSTRING", "properties": { "start": 2, "end": 6 } } See the list of available algorithms (and any associated properties) and examples for each.
includeFromNChars (optional)	Only include the keyed element in the blocking key if it is N or more characters in length. If specified, a blocking key will not be created when the configured element does not meet this criterion. For example, if this is set to 5 on the POSTCODE element, any postcodes found to be less than 5 characters in length will not be included in the blocking key, therefore the entire blocking key will not be created for that record.
truncateToNChars (optional)	Truncate the keyed element to N characters in length. For example, if this is set to 10 on the SURNAME element, any surnames longer than 10 characters will be included into the blocking key but cut at the 10th character.

Algorithms

The algorithms below are available for use in blocking key specifications, and apply to all non-numeric elements. If no algorithm is specified, the SIMPLIFIED_STRING algorithm is used.

Name	Description	Keyed Example	Additional properties
NO_CHANGE	No modification performed, and whitespaces retained	ANDREW J -> ANDREW J
SIMPLIFIED_STRING	Whitespace removed	ANDREW J -> ANDREWJ
DOUBLE_METAPHONE	Double metaphone algorithm on all words	ANDREW J -> ANTR
DOUBLE_METAPHONE_FIRST_WORD	Double metaphone algorithm on the first word only	ANDREW J -> ANTR
NYSIIS	NYSIIS algorithm	ANDREW J -> ANDRAJ
SOUNDEX	SOUNDEX algorithm	ANDREW J -> A536
CONSONANT	Only include consonants	ANDREW J -> NDRWJ
INITIAL	Initial character only	ANDREW J -> A
START_SUBSTRING	Substring from the start of the value	ANDREW J ("length":3) -> AND	"length": `<integer>`
MIDDLE_SUBSTRING	Substring from defined start position to defined end position	ANDREW J ("start":2, "end":5) -> NDRE	"start": `<integer>`, "end": `<integer>`
END_SUBSTRING	Subtring from the end of the value	ANDREW J ("length":3) -> W J	"length": `<integer>`

CONSONANT and SOUNDEX support the following character sets:

Basic Latin (ASCII)
Latin-1 Supplement
Latin Extended-A
Latin Extended Additional

All other key types support the following Latin character sets:

Basic Latin (ASCII)
Latin-1 Supplement
Latin Extended-A
Latin Extended-B
Latin Extended-C
Latin Extended-D
Latin Extended-E
Latin Extended Additional
IPA Extensions
Phonetic Extensions
Phonetic Extensions Supplement

Blocking key examples

Let's use the default blocking keys in Data Studio for more advanced examples of blocking key design:

{
   "description": "FullZipCode",
   "countryCode": "USA",
   "elementSpecifications": [
     {
       "elementType": "POSTCODE",
       "elementModifiers": [
         "STANDARDSPELLING"
       ],
       "includeFromNChars": 9
     }
   ]
}

This first blocking key specification is comprised of just the POSTCODE element, using the STANDARDSPELLING element modifier, and restricting the keys created from it in that they must be at least 9 characters long.

This will result in the generation of candidate score pairs for every record in the data with the same zip and +4 code.

If the +4 code is missing, a blocking key will not be created.

The sample input USA zip codes below would therefore create the following keys with this blocking key specification:

90210-8560 ⇒ 902108560 - complete example, with the correct number of characters
90210-08560 ⇒ 902100856 - truncated example, with the final 0 cut off
90210 ⇒ <null> - no blocking key generated due to min. length of 9 being unsatisfied

This second blocking key specification below is a little more advanced - it combines a few elements together and uses various modifiers and algorithms, as well as different character number restrictions.

{
    "description": "ForenameSurnameMinorStreetNumber",
    "countryCode": "GBR",
    "elementSpecifications": [
      {
        "elementType": "FORENAMES",
        "elementModifiers": [
          "ROOTNAME"
        ],
        "algorithm": {
          "name": "DOUBLE_METAPHONE_FIRST_WORD"
        },
        "includeFromNChars": 1,
        "truncateToNChars": 10
      },
      {
        "elementType": "SURNAME",
        "algorithm": {
          "name": "DOUBLE_METAPHONE"
        },
        "includeFromNChars": 1,
        "truncateToNChars": 10
      },
      {
        "elementType": "MINORSTREET_NUMBER",
        "includeFromNChars": 1,
        "truncateToNChars": 5
      }
   ]
}

This will result in the generation of candidate score pairs for every record in the data with the same forename root name double metaphone value, surname double metaphone value, and premises number.

The first element defined is FORENAMES and uses:
- the ROOTNAME element modifier
- the DOUBLE METAPHONE_FIRST_WORD algorithm on that root name
- key restrictions where the forename must be at least 1 character long, and then stop including characters in the key after the 10^th
The second element defined is SURNAME without any element modifier, then uses:
- the DOUBLE METAPHONE algorithm on the surname
- key restrictions where the surname must be at least 1 character long, and then stop including characters in the key after the 10^th
The third (and final) element defined is MINORSTREET_NUMBER, and uses:
- no element modifiers or element algorithms
- key restrictions where the number must be at least 1 digit long, and then stop including characters in the key after the 5^th

The sample name and addresses below show what those blocking keys would look like:

MRS VAL JONES, 45 MAIN ROAD, LONDON, E1 2AS

FORENAMES = VAL
- Modifier: ROOTNAME of VAL = VALERIE
- Algorithm: DOUBLE_METAPHONE_FIRST_WORD of VALERIE = FLR
- Length restriction: Length of FLR = 3, and 3 is greater than 1 and less than 10 so satisfies restriction

⇒ FLR

SURNAME = JONES
- Modifier: <none>
- Algorithm: DOUBLE_METAPHONE of JONES = JNS
- Length restriction: Length of JNS = 3, and 3 is greater than 1 and less than 10 so satisfies restriction

⇒JNS

MINORSTREET_NUMBER = 45
- Modifier: <none>
- Algorithm: <none>
- Length restriction: Length of 45 = 2, and 2 is greater than 1 and less than 5 so satisfies restriction

⇒ 45

Therefore, the final key generated for this address is: FLRJNS45.

MR JOHNNY ANDERSON-THOMPSON, 123456 HIGH STREET, BRIGHTON, BN1 3SX

FORENAMES = JOHNNY
- Modifier: ROOTNAME of JOHNNY = JOHNATHON
- Algorithm: DOUBLE_METAPHONE_FIRST_WORD of JOHNATHON = JN0N
- Length restriction: Length of JN0N = 4, and 4 is greater than 1 and less than 10 so satisfies restriction

⇒ JN0N

SURNAME = ANDERSON-THOMPSON
- Modifier: <none>
- Algorithm: DOUBLE_METAPHONE of ANDERSON-THOMPSON = ANTR
- Length restriction: Length of ANTR = 4, and 4 is greater than 1 and less than 10 so satisfies restriction

⇒ ANTR

MINORSTREET_NUMBER = 123456
- Modifier: <none>
- Algorithm: <none>
- Length restriction: Length of 123456 = 6, and 6 is greater than 1but more than 5 so truncate at the 5^th character

⇒ 12345

Therefore, the final key generated for this address is: JN0NANTR12345.

PAUL SMITH, HANCOCK BUILDING, GRACE AVENUE, NOTTINGHAM

FORENAMES = PAUL
- Modifier: ROOTNAME of PAUL = PAUL
- Algorithm: DOUBLE_METAPHONE_FIRST_WORD of PAUL = PL
- Length restriction: Length of PL = 2, and 2 is greater than 1 and less than 10 so satisfies restriction

⇒ PL

SURNAME = SMITH
- Modifier: <none>
- Algorithm: DOUBLE_METAPHONE of SMITH = SM0
- Length restriction: Length of SM0 = 3, and 3 is greater than 1 and less than 10 so satisfies restriction

⇒ SM0

MINORSTREET_NUMBER = null
- Modifier: <none>
- Algorithm: <none>
- Length restriction: Length of null = 0, which is NOT greater than 1 so key restriction NOT satisfied

⇒ <null>

Because not all the blocking key specification has been satisfied in this case (due to the missing premises number in the input), no blocking key is generated for this record from this blocking key specification.

Whilst blocking keys are used to generate suitable candidate pairs of input records, rules are then used to score those pairs together and cluster together any that are considered a good enough match.

You can modify the default rulesets or even create your own. Before you do though, we recommend that you review the concepts below:

Rules take the following form: <rule reference>=<expression>
A rule reference consists of a rule name followed by a "." and followed by a match level (e.g. MyRule.L0)
An expression may take multiple forms:
- Low-level expression - operates on the elements within a record.
- Higher-level expression - composed of references to other rules.
A ruleset consists of a combination of three rule types which increase in specificity from match, to theme, to element, as illustrated in the diagram below.
- Match and theme rules are comprised of references to rules from the level below.
- Element rules are comprised of rules that are set on specific data elements, such as a postcode or a building number. These element rules can use special comparators depending on the element type.

Syntax

All rules have a rule reference on the left-hand side.

A rule reference takes the following format: <rule name>.<match level>

The rule name may be any of the following:

"Match" (match rule).
A custom identifier (theme rule). Must begin with an alphabetical character and may contain alphanumeric characters, underscores, and hyphens.
Element (single element rule).

The match level can be: L0, L1, L2, or L3. Note that you can override these values by using aliases.

The right-hand side of a rule:

Is always surrounded by curly brackets: { }
May contain logical operators: & and |

Expressions may be nested and logical operators combined (parentheses are required), e.g. MyRule.L2 = {((RuleA.L3 & RuleB.L2) | (RuleA.L2 & RuleB.L3)) | RuleC.L0}

Element rules include the element and the allowed result set (enclosed in [ ] and comma-separated) and may also include an optional element modifier and/or comparator.

Any theme or element rule may also optionally include a group from the input mappings, which is defined by using a hash symbol before the group name. For example:
#MyFieldGroup.PostcodeTheme.L0 = {Postcode[ExactMatch]}.

Find out about groups.

Default country

You can also specify a default country in the rules file. This lets the Find duplicates step know what country to use when processing and standardizing data if:

you're not using a country tag when mapping your data columns or
the country field for a record is blank.

If a default country isn't specified and no country mapping is used, the default country will be automatically set to GBR (United Kingdom).

To specify a default country, add @default.country=<countryiso> to the top of the rules file (where <countryiso> is the ISO 3166-1 alpha-3 country code). For example: to set Australia as the default country, add: @default.country=AUS.

You can use the default rules provided with the Find duplicates step as the basis for customization since they already contain the correct default country value.

Levels and evaluation order

There are 4 match levels that can be used within each rule specification:

Match level	Description
L0	Each individual field that makes up the record matches exactly.
L1	Records might have some fields that match exactly, and some fields that are very similar.
L2	Records might have some fields that match exactly, some fields that are very similar, and some fields that differ a little more.
L3	Records contain the majority of fields that have a number of similarities, but do not match exactly.

Note that you can override these level names with your own custom names. To do so, you have to use the expression define <Alias> as <Match level> at the top of the rules file, e.g.

define High as L0.
Match.High={Name.High & Address.High}

When working with match levels, note that:

They will be evaluated in order from L0 through to L3, stopping at the first level that passes.
A rule level will evaluate as true if the same rule has evaluated as true for a higher level. For example, if MyRule.L1 is true, then MyRule.L2 and MyRule.L3 will also be true.
Records will be considered a match and they will be clustered together, provided that any of the defined top level overall rule match levels evaluate to true.
Every rule must include a match level as part of the rule reference.

Rules are evaluated as follows:

Match rules first (Match.<Match Level>).
Left to right.
Lazily.

Higher order rule matches mean that lower match levels are not required to be evaluated.

Types

Three rule types can be defined:

Match rules
Theme rules
Element rules

Match rules

This is the highest rule level, defining an overall match between two records. A match rule is made up of references to other rules.

The name of the match rule must always be Match.<Match Level>

At least one match rule must be defined for a successful matching job.

Example: Match.L0={Name.L0 & Address.L0} (Name.L0 and Address.L0 have been defined separately).

You can combine rule references into compound logical expressions. This way, you have complete control over the logic used to determine matches.

Example: Match.L1={(Name.L0 & Address.L0) | (Name.L0 & Email.L0 & Phone.L0)}

Theme rules

Theme rules represent the next level down, after match rules.

Similar to a match rule, a theme rule is made up of references to other rules. The theme rule name must begin with an alphabetical character and may contain alphanumeric characters, underscores and hyphens. The rule name cannot contain "Match" or the set of reserved elements.

Example: Address.L0={Premise.L0 & Street.L0 & Locality.L0 & Postcode.L0}

The rule references within the expression can either be other theme rules, or low-level element rules.

Element rules

Element rules are the most granular of rules. They can be used to specify how to compare individual elements within a record. Elements are basic units of data that comprise an overall theme. For example, postcode and premise could be elements of an address theme.

Comparators

Rules are designed to evaluate and compare elements using special comparators. The table below covers the available comparators you can use.

If you want to know which comparators are available for which elements, see the elements table.

The results available for the default comparator (ExactString) will also be available for the other comparators.

Comparator	Results
ExactString (default comparator)	ExactMatch: Strings match exactly e.g. "John Smith" & "John Smith" OnePopulated: The field is populated for one of the records e.g. "John Smith" & "" NonePopulated: The field is not populated for either of the records e.g "" & "" NoMatch: The strings are both populated but are not an exact match e.g. "John Smith" & "John Doe"
ForenameCompare	ExactMatch: Strings match exactly, ignoring hyphens e.g. "Sarah-Jane" & "Sarah Jane" InitialVsFullName: An initial or initials match to the full name e.g "S J" & "Sarah Jane" FirstNameMatch: The first name of a multiple-name forename matches e.g "Sarah" & "Sarah Jane" InvertedNameMatch: All the names in a multiple-name forename match but are in a different order e.g "Jane Sarah" & "Sarah Jane" AnyNameMatch: Any name matches in a multiple-name forename e.g "Jane" & "Sarah Jane" Plus all ExactString (default comparator) results.
TransposedNameCompare	ExactMatch: The forename(s) and surname match exactly to the transposed version e.g. "John Smith" & "Smith John" PartialMatch: Only one part of the name matches to the transposed version e.g. "John Smith" & "Smith Paul", or "John Smith" & "Jones John" NoMatch: Neither the forename(s) or surname match exactly to the transposed version e.g. "John Smith" & "Jones Paul" Plus all ExactString (default comparator) results.
PremiseCompare	StartMatch: Premise matches the start of a premise range e.g. "12" & "12-15" StartMatchAndEncapsulated: Premise ranges match at the start and one encapsulates the other e.g. "12-15" & "12-16" EndMatch: Premise matches the end of a premise range e.g. "15" & "12-15" EndMatchAndEncapsulated: Premise ranges match at the end and one encapsulates the other e.g. "13-16" & "12-16" Encapsulated: Premise or premise range is encapsulated by the other e.g. "12" & "11-16" Overlapped: Premise ranges overlap each other e.g. "12-15" & "14-18" NumberMatchWithTrailingAlpha: Premise numbers match and one record has a trailing alpha e.g. "12" & "12a" NumberMatchWithDifferingAlpha: Premise numbers are a perfect match but trailing alpha is different e.g. "12a" & "12b" Plus all ExactString (default comparator) results.
DateCompare	DayMonthReversed: Matches dates where the day and month are reversed eg "2017-06-03" & "2017-03-06" MonthYearMatch: Matches dates where only the month and year match eg "2017-06-03" & "2017-06-04" DayMonthMatch: Matches dates where only the day and month match eg "2017-06-03" & "2016-06-03" DayYearMatch: Matches dates where only the day and year match eg "2017-06-03" & "2017-07-03" YearMatch: Matches dates where only the year matches eg "2017-06-03" & "2017-07-04" <n>DaysDifference: Matches dates that differ by up to n days <n>WeeksDifference: Matches dates that differ by up to n weeks (n * 7 days) <n>MonthsDifference: Matches dates that differ by up to n calendar months <n>MaxCharsDifference: Matches dates where up to n characters can be different (equivalent to levenshtein distance for dates) Plus all ExactString (default comparator) results.
PostcodeCompare	Part1Match: Records match to the first part of the postcode e.g. "HA2 9PP" & "HA2 5QR" Part2Match: Records match to the second part of the postcode e.g. SM1 9PP" & "HA2 9PP" PostcodeCompatible: The first part of the postcode for both records matches, and the second part is populated in one record only e.g. "HA2 9PP" & "HA2" Plus all ExactString (default comparator) results.
Levenshtein	Depending on specified comparison type, either: <Minimum %>: The minimum Levenshtein percentage to provide a match (integer between 0-100) e.g. setting the LevenshteinPercent result to 90 would return a match for "John Smith" & "Joan Smith" Or <Maximum distance>: The maximum Levenshtein distance to provide a match (integer). e.g. setting the LevenshteinDistance result to 1 would return a match for "John Smith" & "Joan Smith" Plus all ExactString (default comparator) results.
JaroWinkler	<Minimum %>: The minimum Jaro-Winkler distance percentage to provide a match (integer between 0-100). e.g. setting the JaroWinkler result to 95 would return a match for "John Smith" & "Joan Smith" Plus all ExactString (default comparator) results.
NumericCompare	ExactMatch: The numeric part of the strings match exactly e.g. "5th" & "5th" OnePopulated: Only one of the records contains a numeric value. The other record is either blank or does not contain a numeric value e.g. "5th" & "", or "5th" & "fifth" NonePopulated: Both records are either blank or do not contain numeric values e.g. "fifth" & "fifteenth" NoMatch: Both values are numeric but do not match e.g. "5th" & "15th" <n>: Both values are numeric and adding n to the lower value produces a value greater than or equal to the higher value <n%>: Both values are numeric and adding n percent of the higher value to the lower value produces a value greater than or equal to the higher value
DoubleMetaphone	The DoubleMetaphone comparator ignores all digits. When applied to an element where alphanumeric data is expected it should be combined with a NumericCompare rule. ExactMatch: The primary Double Metaphone codes for both strings are the same e.g. "Smith" & "Smythe" AlternateCodeMatch: The strings have matching Double Metaphone codes, taking into account alternate codes e.g. "Smith" & "Schmidt" FirstWordMatch: The first words of each string have matching Double Metaphone codes e.g. "Mary Anne" & "Marie" Plus all ExactString (default comparator) results.
Soundex	The Soundex comparator ignores all digits. When applied to an element where alphanumeric data is expected it should be combined with a NumericCompare rule. ExactMatch: The Soundex codes for both strings are the same e.g. "Smith" & "Smythe" Plus all ExactString (default comparator) results.
NYSIIS	The NYSIIS comparator ignores all digits. When applied to an element where alphanumeric data is expected it should be combined with a NumericCompare rule. ExactMatch: The NYSIIS codes for both strings are the same e.g. "Jan" & "John" Plus all ExactString (default comparator) results.

Filters

Filters may be used within rules to select only part of the specified string element. Insert one (or chain several) filter(s) between the element and the comparator. All existing comparators may be used, including Levenshtein.

Selecting a sub string

The SubString filter allows you to select a portion of a string. It has two integer parameters, the first determines where the selection will start and the second the number of characters to select.

The first parameter selects the offset in the string at which to start. When zero or positive it selects an offset from the start of the string. When negative it selects an offset from the end of the string. For example -1 selects the last character, -2 from the second last character, etc.

The second parameter determines the number of characters to select. When zero, all characters to the end of the string will be selected. When positive, the supplied number of characters will be selected. When negative, the supplied number of characters will be removed from the end of the selection.

Boundary Conditions: If the second parameter is positive and greater than the number of characters available in the string (after applying the offset from the first parameter) then only the available characters will be selected. Otherwise, if the supplied parameters cannot select any characters then an empty string will be produced from the filter.

All of the examples below will produce a match.

Record	Generic String
Record 1	123-TR
Record 2	123-RM

GenericStringMatch.L2 = {Generic_String.SubString[0,3].[ExactMatch]}

Record	Generic String
Record 1	TR-123
Record 2	RM-123

GenericStringMatch.L2 = {Generic_String.SubString[3,0].[ExactMatch]}
(notice that 3,0 will remove the first 3 characters, leaving the remainder of the string)

Record	Generic String
Record 1	R-4AB
Record 2	N-4AR

GenericStringMatch.L2 = {Generic_String.SubString[2,-1].[ExactMatch]}
(produces a string starting at offset 2 and removing the last character)

Record	Generic String
Record 1	R-4567-AB
Record 2	N-7643227-AB

GenericStringMatch.L2 = {Generic_String.SubString[-2,0].[ExactMatch]}
(produces a string starting at the offset 2 characters from the end)

Record	Generic String
Record 1	MM-John Smith
Record 2	CC-Joan Smith

GenericStringMatch.L2 = {Generic_String.SubString[3,0].Levenshtein[90%]}

Selecting an item from a delimited string

DelimitedField can be used to select a field from a string delimited by one or more characters. The delimiter is defined as a java regular expression specified as the first argument, while the second argument is an integer that specifies the field to select. Since character escapes are now allowed in the rule syntax it is necessary to double escape any string literals that use characters reserved for regular expressions.

Record	Generic String
Record 1	A-4567-AB
Record 2	A-7643227-AB

GenericStringMatch.L2 = {Generic_String.DelimitedField["\\-",0].[ExactMatch]}
(selects the first item of an array of strings delimited by the '-' character, so both values will be 'A'. Note the double escaping in use. The first is to escape possible special character codes, the second because the string is a regular expression)

Record	Generic String
Record 1	First item-4567-ATTR
Record 2	Second item-4567-ATTR

GenericStringMatch.L2 = {Generic_String.DelimitedField["\\-",1].[ExactMatch]}
(selects the second item of an array of strings delimited by the '-' character, so both values will be '4567'. Note the double escaping in use. The first is to escape possible special character codes, the second because the string is a regular expression)

Using 'Contains'

The 'Contains' filter selects the shorter value from both records if the value in the shorter is contained within the longer, otherwise the values pass through the filter unchanged.

Record	Generic String
Record 1	Blah
Record 2	I contain Blah here

GenericStringMatch.L2 = {Generic_String.Contains.[ExactMatch]}

Chaining filters

Filters may be chained together. There's no limit to the number, but we recommend using no more than three.

Record	Generic String
Record 1	First item-4588-ATTR
Record 2	Second item-4599-ATTR

GenericStringMatch.L2 = {Generic_String.DelimitedField["\\-",1].SubString[0,2].[ExactMatch]}

Rule examples

These are element rule examples that focus on how different elements, modifiers and comparators can be used when designing the rules:

Initial vs full name

Record	Forename	Surname
Record 1	Robert	Brooke
Record 2	R	Brooke

Name.L2 = {Forenames.ForenameCompare[InitialVsFullName] & Surname[ExactMatch]}

Minor street number

Record	MinorStreet_Number	MinorStreet_Description	MinorStreet_Type
Record 1	123	Burnthouse	Lane
Record 2	123a	Burnthouse	Lane

StreetAddress.L1 = {MinorStreet_Number.PremiseCompare[NumberMatchWithTrailingAlpha] & MinorStreet_Description[ExactMatch] & MinorStreet_Type.StandardAbbreviation[ExactMatch]}

Postcode

Record	MinorStreet_Description	MinorStreet_Type	Locality	Postcode
Record 1	Hints	Road	Tamworth	B78 3AB
Record 2	Hints	Road	Tamworth	B78 3AT

Address.L2 = {Building_Number[ExactMatch] & MinorStreet_Description[ExactMatch] & Locality[ExactMatch] & Postcode.PostcodeCompare[Part1Match]}

Cross-field matching allows you to match across multiple fields of the same type to find potential duplicates.

For example, if your data consists of three phone number fields (e.g. home, work and mobile number), you can configure the Find duplicates step to find potential phone number matches across all three of them.

For cross-field matching to work successfully, each value (e.g. phone number) that you want to cross-match has to be in its own custom group. We recommend that you understand how groups work before configuring cross-field matching.

To configure the Find duplicates step to perform cross-field matching, you have to set up your blocking keys and rulesets in such a way that your groups of similar data will be blocked and scored together.

Example

You have multiple address fields (home and billing) and want to identify potential duplicates where the billing address of one record matches the delivery address of another record.

RECORD ID	NAME	HomeAddress			BillingAddress
		ADDRESS	LOCALITY	POSTCODE	ADDRESS	LOCALITY	POSTCODE
1	John Smith	1 High Street	London	SW4 0QL	48 Webber Road	Brighton	BN3 1EJ
2	John Smith	12 Acacia Avenue	London	E1W 2BB	1 High Street	London	SW4 0QL

Using the default rules and blocking keys for GBR and modifying them to consider these two addresses in separate groups (HomeAddress and BillingAddress), the two records will not be matched together since the home address for the two records will not block together or evaluate as a match in the rules. Similarly, the two records would still not match together based on the billing address group since the billing addresses are also different.

The desired behavior is for the following cross-field operations to be performed (with the operation in bold causing a match using the above example):

HomeAddress 1 vs. HomeAddress 2
BillingAddress 1 vs. BillingAddress 2
HomeAddress 1 vs. BillingAddress 2
BillingAddress 1 vs. HomeAddress 2

To achieve this, add: "elementGroups": ["HomeAddress","BillingAddress"] to the blocking keys for each address component of each blocking key definition and #[HomeAddress,BillingAddress] to the address rules of the final match level rules.

Blocking keys

The blocking key definition below is from the default GBR individual blocking keys, combining the surname initial, street number and locality into a blocking key. By default, this will create keys for a single address. However, it can be modified to create keys for multiple addresses using the HomeAddress and BillingAddress below:

{
   "description": "SurnameMinorStreetNumberLocality",
   "countryCode": "GBR",
   "elementSpecifications": [
     {
       "elementType": "SURNAME",
       "algorithm": {
         "name": "INITIAL"
       },
       "includeFromNChars": 1
     },
     {
       "elementType": "MINORSTREET_NUMBER",
       "elementGroups": ["HomeAddress","BillingAddress"],
       "includeFromNChars": 1,
       "truncateToNChars": 5
     },
     {
       "elementType": "LOCALITY",
       "elementGroups": ["HomeAddress","BillingAddress"],
       "elementModifiers": [
         "STANDARDSPELLING",
         "DERIVED"
       ],
       "includeFromNChars": 2,
       "truncateToNChars": 30
     }
   ]
 }

This will now cause two blocking keys to be generated for every record:

SURNAME_Initial + HomeAddress_MINORSTREET_NUMBER + HomeAddress_LOCALITY
SURNAME_Initial + BillingAddress_MINORSTREET_NUMBER + BillingAddress_LOCALITY

Using the sample data, the following blocking keys will be created:

Record 1
- S1LONDON
- S48BRIGHTON
Record 2
- S12LONDON
- S1LONDON

Since the first and last keys are identical, records 1 and 2 will now be blocked together and identified as a candidate pair for scoring in the rules.

Rules

The rule snippet below is from the default GBR individual ruleset that has been amended to include cross-field support as an example:

Match.Exact={Name.Exact & #[HomeAddress,BillingAddress].Address.Exact}

This will now cause the following rule evaluations to be performed:

Fields A		Fields B		Result
Record1_Name + Record1_HomeAddress	vs.	Record2_Name + Record2_HomeAddress	⇒	FALSE
Record1_Name + Record1_BillingAddress	vs.	Record2_Name + Record2_BillingAddress	⇒	FALSE
Record1_Name + Record1_HomeAddress	vs.	Record2_Name + Record2_BillingAddress	⇒	TRUE
Record1_Name + Record1_BillingAddress	vs.	Record2_Name + Record2_HomeAddress	⇒	FALSE

The third case will be true since the home address of record 1 matches the billing address of record 2 as an exact match.

Was this helpful?

Previous: Installing a separate instance

Next: Advanced usage

Aperture Data Studio v2

Find duplicates step

Next topic:
Technical recommendations