Advanced configuration

Overview

Data Studio provides default blocking keys and rulesets that you can use to run the Find duplicates step. However, you can create your own custom rules and blocking keys that are tailored to your needs.

Blocking keys and rulesets are designed and specified using elements - a representation of your input data after it has been mapped in the Find duplicates step and gone through the initial standardization process. Find out about elements.

The standardization process also creates additional versions of these elements to assist further - known as modifiers. Modifiers can correct, enhance or derive many known terms that appear in the input. For example, a DERIVED modifier may be created when the element was not contained in the input but the standardization process was able to determine the value (e.g. COUNTY can sometimes be derived from the LOCALITY and POSTCODE input). Find out about modifiers.

Elements can also be put into specific groups to separate them from other elements of the same type. This is especially important when you want to create blocking keys or rules that treat them as separate entities, such as cross-field matching.

These elements, modifiers and groups (along with blocking key algorithms and rule comparators) can then be combined to make your own blocking keys and rules specific to your design and data.

The table below covers the available elements that can be used when designing blocking keys and rules, together with the possible modifiers and rule comparators that can be used with them. Note that blocking key algorithms are not listed here because they apply to all elements.

Element Description Example Comparators Modifiers
Title Title Mrs
  • ExactString
  • Default
  • StandardSpelling
  • StandardAbbreviation
Forenames Given name(s) and any initials John
  • ExactString
  • ForenameCompare
  • Levenshtein
  • JaroWinkler
  • DoubleMetaphone
  • Soundex
  • NYSIIS
  • Default
  • RootName
Surname_Prefix Surname prefix De la
  • ExactString
  • Levenshtein
  • JaroWinkler
  • DoubleMetaphone
  • Soundex
  • NYSIIS
  • Default
  • StandardSpelling
  • StandardAbbreviation
Surname Surname with prefix Smith
  • ExactString
  • Levenshtein
  • JaroWinkler
  • DoubleMetaphone
  • Soundex
  • NYSIIS
  • Default
  • RootName
Full_Name Concatenation of forenames and surname (including prefix if present) John O'Connor
  • ExactString
  • TransposedNameCompare
  • Levenshtein
  • JaroWinkler
  • DoubleMetaphone
  • Soundex
  • NYSIIS
  • Default
  • RootName
Surname_Suffix Surname suffixes Junior
  • ExactString
  • Levenshtein
  • JaroWinkler
  • DoubleMetaphone
  • Soundex
  • NYSIIS
  • Default
  • StandardSpelling
  • StandardAbbreviation
Gender Gender Female
  • ExactString
  • Levenshtein
  • JaroWinkler
  • DoubleMetaphone
  • Soundex
  • NYSIIS
  • Default
Honorifics Honorifics Ph.D
  • ExactString
  • Levenshtein
  • JaroWinkler
  • Default
  • StandardSpelling
  • StandardAbbreviation
Company Organization name Experian Ltd
  • ExactString
  • Levenshtein
  • JaroWinkler
  • DoubleMetaphone
  • Soundex
  • NYSIIS
  • Default
Building_Description Building name and type George West House
  • ExactString
  • Levenshtein
  • JaroWinkler
  • DoubleMetaphone
  • Soundex
  • NYSIIS
  • Default
Building_Number Building number 43
  • ExactString
  • PremiseCompare
  • Default
SubBuilding_Number Sub-building number 2
  • ExactString
  • PremiseCompare
  • Default
SubBuilding_Description Sub-building name First-floor
  • ExactString
  • Levenshtein
  • JaroWinkler
  • DoubleMetaphone
  • Soundex
  • NYSIIS
  • Default
SubBuilding_Type Sub-building type Flat
  • ExactString
  • Levenshtein
  • JaroWinkler
  • DoubleMetaphone
  • Soundex
  • NYSIIS
  • Default
  • StandardSpelling
  • StandardAbbreviation
MinorStreet_Number Street number 34th
  • ExactString
  • PremiseCompare
  • Default
MinorStreet_Predirectional Street pre-directional South
  • ExactString
  • Levenshtein
  • JaroWinkler
  • DoubleMetaphone
  • Soundex
  • NYSIIS
  • Default
  • StandardSpelling
  • StandardAbbreviation
MinorStreet_Description Street name Carnaby
  • ExactString
  • Levenshtein
  • JaroWinkler
  • DoubleMetaphone
  • Soundex
  • NYSIIS
  • Default
MinorStreet_Type Street descriptor Street
  • ExactString
  • Levenshtein
  • JaroWinkler
  • NumericCompare
  • DoubleMetaphone
  • Soundex
  • NYSIIS
  • Default
  • StandardSpelling
  • StandardAbbreviation
MinorStreet_Postdirectional Street post-directional South
  • ExactString
  • Levenshtein
  • JaroWinkler
  • DoubleMetaphone
  • Soundex
  • NYSIIS
  • Default
  • StandardSpelling
  • StandardAbbreviation
PoBox_Number PO box number 79
  • ExactString
  • Levenshtein
  • JaroWinkler
  • Default
PoBox_Description PO box description PO Box
  • ExactString
  • Levenshtein
  • JaroWinkler
  • DoubleMetaphone
  • Soundex
  • NYSIIS
  • Default
  • StandardSpelling
  • StandardAbbreviation
DoubleDependentLocality A small locality such as a village, used to identify an address where a street appears more than once in a dependent locality Kingston Gorse
  • ExactString
  • Levenshtein
  • JaroWinkler
  • DoubleMetaphone
  • Soundex
  • NYSIIS
  • Default/Derived
  • StandardSpelling
DependentLocality Smaller locality used to identify an address where a street appears more than once in a locality East Preston
  • ExactString
  • Levenshtein
  • JaroWinkler
  • DoubleMetaphone
  • Soundex
  • NYSIIS
  • Default/Derived
  • StandardSpelling
Locality A larger locality, such as a town or a city Cambridge
  • ExactString
  • Levenshtein
  • JaroWinkler
  • DoubleMetaphone
  • Soundex
  • NYSIIS
  • Default/Derived
  • StandardSpelling
Province A larger area of a country, contains multiple localities Cambridgeshire
  • ExactString
  • Levenshtein
  • JaroWinkler
  • DoubleMetaphone
  • Soundex
  • NYSIIS
  • Default/Derived
  • StandardSpelling
  • StandardAbbreviation
Country Country name United Kingdom
  • ExactString
  • Levenshtein
  • JaroWinkler
  • DoubleMetaphone
  • Soundex
  • NYSIIS
  • Default
Postcode Postal code or ZIP code 'SW4 0QL' or '20521 9000'
  • ExactString
  • PostcodeCompare
  • Default/Derived
  • StandardSpelling
Generic_String Generic string ab-1234cdef
  • ExactString
  • Levenshtein
  • JaroWinkler
  • NumericCompare
  • DoubleMetaphone
  • Soundex
  • NYSIIS
  • Default
Date ISO date in the format YYYY-MM-DD 1980-06-21
  • ExactString
  • DateCompare
  • Default
Phone Phone number (01234) 567890
  • ExactString
  • Levenshtein
  • JaroWinkler
  • Default
Email Email address john.smith@domain.com
  • ExactString
  • Levenshtein
  • JaroWinkler
  • DoubleMetaphone
  • Soundex
  • NYSIIS
  • Default
Email_Local Local part of email address john.smith
  • ExactString
  • Levenshtein
  • JaroWinkler
  • DoubleMetaphone
  • Soundex
  • NYSIIS
  • Default
Email_Domain Email domain domain.com
  • ExactString
  • Levenshtein
  • JaroWinkler
  • DoubleMetaphone
  • Soundex
  • NYSIIS
  • Default
Hash Hash value of all normalised input fields within a group 8bf14557574b8793aae648fc1b0280c3
  • ExactString
  • Default

The table below covers the available element modifiers that can be used. See the elements table to find out which modifiers can be used with which elements.

Modifier Operation Example
(Default) The element classified from the input in a cleaned form, normalised to remove diacritics and converted to upper case. Supplied: 123 High Road
MINORSTREET_TYPE -> ROAD
StandardSpelling The element converted to a standard spelling (contains Derived value when available). Supplied: 123 High Road
MINORSTREET_TYPE.STANDARDSPELLING -> ROAD
StandardAbbreviation The element converted to the standard abbreviation. Supplied: 123 High Road
MINORSTREET_TYPE.STANDARDABBREVIATION -> RD
Derived A derived value that was inferred from other information in the input address. Supplied: 123 High Road, London, E1 2EZ
PROVINCE.DERIVED -> GREATER LONDON
RootName The root name of the input name. Supplied name: Alex
FORENAMES.ROOTNAME -> ALEXANDER

Any element can be used multiple times in an input to represent separate or unique sets of information. For example, the input may include multiple:

  • street numbers, street names and postcodes (e.g. a delivery and a billing address)
  • forenames and surnames (e.g. a primary account holder and a spouse)
  • generic strings (e.g. an account and a customer reference)
  • dates (e.g. a date of birth and a registration date)
  • phone numbers (e.g. a cell phone number and a landline number)
  • email addresses (e.g. a personal and a work email)

Using groups in blocking keys

You may want to create different blocking key specifications for different groups of the same element type, or use the same blocking key specification for multiple groups of the same element type. To do this, you need to add elementGroups to your element specification within your blocking key design. Find out more about elementGroups.

For example, if you have a billing and a delivery address and you want to have two different blocking key specifications for them that have different designs, you would add the following to each address element of the group within your blocking key designs:

"elementGroups":["BillingAddress"]
"elementGroups":["DeliveryAddress"]

Alternatively, if you wanted to use the same blocking key specification for both groups, you would add the following to each address element of the group within that single blocking key design:

"elementGroups":[
    "BillingAddress",
    "DeliveryAddress"
]

The blocking key specification examples below are taken from the default blocking keys available in Data Studio, and have been amended to show how they would be used for the two cases above.

Two differently designed blocking key specifications for two different groups of the same element type:

{
    "description": "SurnameMinorStreetNumberLocality",
    "countryCode": "GBR",
    "elementSpecifications": [
      {
        "elementType": "SURNAME",
        "algorithm": {
          "name": "INITIAL"
        },
        "includeFromNChars": 1
      },
      {
        "elementType": "MINORSTREET_NUMBER",
        "elementGroups": ["BillingAddress"],
        "includeFromNChars": 1,
        "truncateToNChars": 5
      },
      {
        "elementType": "LOCALITY",
        "elementGroups": ["BillingAddress"],
        "elementModifiers": [
          "STANDARDSPELLING",
          "DERIVED"
        ],
        "includeFromNChars": 2,
        "truncateToNChars": 30
      }
   ]
}
{
    "description": "POBoxNumberLocality",
    "countryCode": "GBR",
    "elementSpecifications": [
      {
        "elementType": "POBOX_NUMBER",
        "elementGroups": ["DeliveryAddress"],
        "includeFromNChars": 1
      },
      {
        "elementType": "LOCALITY",
        "elementGroups": ["DeliveryAddress"],
        "elementModifiers": [
          "STANDARDSPELLING",
          "DERIVED"
        ],
        "includeFromNChars": 2,
        "truncateToNChars": 30
      }
   ]
}

The same blocking key specification for two different groups of the same element type:

{
    "description": "FullPostcode",
    "countryCode": "GBR",
    "elementSpecifications": [
      {
        "elementType": "POSTCODE",
        "elementGroups": ["BillingAddress","DeliveryAddress"],
        "elementModifiers": [
          "STANDARDSPELLING"
        ],
        "includeFromNChars": 5,
        "truncateToNChars": 7
      }
   ]
}

Using groups in rules

You may want to handle the items listed above separately during rule evaluation. To do this, you need to prefix the rule with a hash symbol, followed by the group name in square brackets, followed by a period, followed by the rest of the rule.

For example, if you have a billing address and a delivery address and you want to evaluate them separately, you could write rules such as the ones below.

Match.L0 = {Name.L0 & #[Billing].Address.L0 & #[Delivery].Address.L0}
Match.L1 = {Name.L1 & (#[Delivery].Address.L1 | #[Billing].Address.L1)}

By default, this will mean that the two different groups will use the same underlying rules (in the case above, the same address rules). To use different rules for two groups that are of the same element type, you can use theme rules that are then utilized at the top level.

Match.L0={#[Billing].LooseAddressRule.L0 & #[Delivery].StrictAddressRule.L0}

LooseAddressRule.L0={Postcode.L0}
StrictAddressRule.L0={MinorStreet_Number & Postcode.L0}

MinorStreet_Number.L0={PremiseCompare[ExactMatch]}
Postcode.L0={PostcodeCompare[Part1Match] & PostcodeCompare[Part2Match]}

In the above example, two different theme rules have been used to evaluate an address (one being stricter than the other since it requires both the premises number AND postcode to be compared), but each is only used with one group as defined at the match level.

You can also specify more than one group in any rule, such as the example below. This will perform cross-field matching.

Match.L0 = {Name.L0 & #[Billing,Delivery].Address.L0}

The Find duplicates step uses blocking keys to create blocks of similar records to assist with the generation of suitable candidate record pairs for scoring via a ruleset.

Blocks are created from records that have the same blocking key values. Blocking keys are created for each input record from combinations of the record's elements that have been keyed. Keying is the process of encoding individual elements to the same representation so that they can be matched despite minor differences in spelling. To be effective, blocking keys should represent a range of contact data sub-element combinations.

Data Studio provides default blocking keys tuned for name and address matching for the Find duplicates step. However, you can modify these or create your own to suit your needs.

Structure

All blocking keys have to be built in the structure below.

Element Description
description (optional) A descriptive name for the blocking key, for example "MyBlockingKey".
countryCode (optional) An informational code that can be used to imply the country that the blocking key is to be used with (the value does not affect processing in any way). If you require blocking keys conditionally generated, see the validCountries element.
validCountries (optional) An array of ISO3 country codes for which the blocking key is valid. If this element is present and the standardized record has a country code, then it must match one of the country codes in the array - otherwise the blocking key won't be generated. For example:
"validCountries":["GBR","IRL"]
elementSpecifications An array of elements that are to be used in the key, created as part of the initial standardization process.

Blocking keys are created using the list of elements in order.
For example, the key FORENAMES+SURNAME+MINORSTREET_NUMBER would be created from the array below:
"elementSpecifications":[
{ "elementType":"FORENAMES" },
{ "elementType":"SURNAME" },
{ "elementType":"MINORSTREET_NUMBER" }
]

Each element within elementSpecifications has to then be built in the structure below:

Element Description
elementType The element to use in the blocking key. See the list of available elements and examples for each.
elementGroups (optional) The group(s) that the element belongs to. If this isn't supplied, the default group will be used.

This is required when you have more than one of the same element type (e.g. multiple phone numbers) that you want to treat separately. Find out about groups.

The example below specifies multiple groups for a single element type:
"elementGroups":[
"HomeAddress",
"WorkAddress"
]
elementModifiers (optional) An array of element modifiers to use. This list is processed in order and the first populated element will be used. If no modifier is specified/found, the default value of the element will be used. See the list of available modifiers, which elements they apply to and examples for each.

The benefit of element modifiers is to use a more standardized and uniform version of the element, to better assist with blocking together records that may not look the same in their string representation but could still very much be intended to represent the same thing.

The example below will use the standard spelling form of the element if available, otherwise the derived form. If neither are found, the default value for the element will be used.
"elementModifiers":[
"STANDARDSPELLING",
"DERIVED"
]
algorithm (optional) Keying algorithm used to key the element. These are useful for blocking as they can cause similar elements to be blocked together even if their original form is not identical. For example, the SOUNDEX algorithm can be used to block similar sounding values (since they will have the same SOUNDEX key) even if they do not look the same.

Defined algorithms have to have a name and can also have an optional set of additional properties. For example, the SUBSTRING algorithms require start, end and/or length.
"algorithm": {
"name": "MIDDLE_SUBSTRING",
"properties": {
"start": 2,
"end": 6
}
}
See the list of available algorithms (and any associated properties) and examples for each.
includeFromNChars (optional) Only include the keyed element in the blocking key if it is N or more characters in length.

If specified, a blocking key will not be created when the configured element does not meet this criterion.

For example, if this is set to 5 on the POSTCODE element, any postcodes found to be less than 5 characters in length will not be included in the blocking key, therefore the entire blocking key will not be created for that record.
truncateToNChars (optional) Truncate the keyed element to N characters in length.

For example, if this is set to 10 on the SURNAME element, any surnames longer than 10 characters will be included into the blocking key but cut at the 10th character.

Algorithms

The algorithms below are available for use in blocking key specifications, and apply to all non-numeric elements. If no algorithm is specified, the SIMPLIFIED_STRING algorithm is used.

Name Description Keyed Example Additional properties
NO_CHANGE No modification performed, and whitespaces retained ANDREW J -> ANDREW J
SIMPLIFIED_STRING Whitespace removed ANDREW J -> ANDREWJ
DOUBLE_METAPHONE Double metaphone algorithm on all words ANDREW J -> ANTR
DOUBLE_METAPHONE_FIRST_WORD Double metaphone algorithm on the first word only ANDREW J -> ANTR
NYSIIS NYSIIS algorithm ANDREW J -> ANDRAJ
SOUNDEX SOUNDEX algorithm ANDREW J -> A536
CONSONANT Only include consonants ANDREW J -> NDRWJ
INITIAL Initial character only ANDREW J -> A
START_SUBSTRING Substring from the start of the value ANDREW J ("length":3) -> AND "length": <integer>
MIDDLE_SUBSTRING Substring from defined start position to defined end position ANDREW J ("start":2, "end":5) -> NDRE "start": <integer>,
"end": <integer>
END_SUBSTRING Subtring from the end of the value ANDREW J ("length":3) -> W J "length": <integer>

CONSONANT and SOUNDEX support the following character sets:

  • Basic Latin (ASCII)
  • Latin-1 Supplement
  • Latin Extended-A
  • Latin Extended Additional

All other key types support the following Latin character sets:

  • Basic Latin (ASCII)
  • Latin-1 Supplement
  • Latin Extended-A
  • Latin Extended-B
  • Latin Extended-C
  • Latin Extended-D
  • Latin Extended-E
  • Latin Extended Additional
  • IPA Extensions
  • Phonetic Extensions
  • Phonetic Extensions Supplement

Blocking key examples

Let's use the default blocking keys in Data Studio for more advanced examples of blocking key design:

{
   "description": "FullZipCode",
   "countryCode": "USA",
   "elementSpecifications": [
     {
       "elementType": "POSTCODE",
       "elementModifiers": [
         "STANDARDSPELLING"
       ],
       "includeFromNChars": 9
     }
   ]
}

This first blocking key specification is comprised of just the POSTCODE element, using the STANDARDSPELLING element modifier, and restricting the keys created from it in that they must be at least 9 characters long.

This will result in the generation of candidate score pairs for every record in the data with the same zip and +4 code.

The sample input USA zip codes below would therefore create the following keys with this blocking key specification:

  • 90210-8560 ⇒ 902108560 - complete example, with the correct number of characters
  • 90210-08560 ⇒ 902100856 - truncated example, with the final 0 cut off
  • 90210 ⇒ <null> - no blocking key generated due to min. length of 9 being unsatisfied

This second blocking key specification below is a little more advanced - it combines a few elements together and uses various modifiers and algorithms, as well as different character number restrictions.

{
    "description": "ForenameSurnameMinorStreetNumber",
    "countryCode": "GBR",
    "elementSpecifications": [
      {
        "elementType": "FORENAMES",
        "elementModifiers": [
          "ROOTNAME"
        ],
        "algorithm": {
          "name": "DOUBLE_METAPHONE_FIRST_WORD"
        },
        "includeFromNChars": 1,
        "truncateToNChars": 10
      },
      {
        "elementType": "SURNAME",
        "algorithm": {
          "name": "DOUBLE_METAPHONE"
        },
        "includeFromNChars": 1,
        "truncateToNChars": 10
      },
      {
        "elementType": "MINORSTREET_NUMBER",
        "includeFromNChars": 1,
        "truncateToNChars": 5
      }
   ]
}

This will result in the generation of candidate score pairs for every record in the data with the same forename root name double metaphone value, surname double metaphone value, and premises number.

  • The first element defined is FORENAMES and uses:
    • the ROOTNAME element modifier
    • the DOUBLE METAPHONE_FIRST_WORD algorithm on that root name
    • key restrictions where the forename must be at least 1 character long, and then stop including characters in the key after the 10th
  • The second element defined is SURNAME without any element modifier, then uses:
    • the DOUBLE METAPHONE algorithm on the surname
    • key restrictions where the surname must be at least 1 character long, and then stop including characters in the key after the 10th
  • The third (and final) element defined is MINORSTREET_NUMBER, and uses:
    • no element modifiers or element algorithms
    • key restrictions where the number must be at least 1 digit long, and then stop including characters in the key after the 5th

The sample name and addresses below show what those blocking keys would look like:

MRS VAL JONES, 45 MAIN ROAD, LONDON, E1 2AS
  • FORENAMES = VAL

    • Modifier: ROOTNAME of VAL = VALERIE
    • Algorithm: DOUBLE_METAPHONE_FIRST_WORD of VALERIE = FLR
    • Length restriction: Length of FLR = 3, and 3 is greater than 1 and less than 10 so satisfies restriction
FLR
  • SURNAME = JONES
    • Modifier: <none>
    • Algorithm: DOUBLE_METAPHONE of JONES = JNS
    • Length restriction: Length of JNS = 3, and 3 is greater than 1 and less than 10 so satisfies restriction
JNS
  • MINORSTREET_NUMBER = 45

    • Modifier: <none>
    • Algorithm: <none>
    • Length restriction: Length of 45 = 2, and 2 is greater than 1 and less than 5 so satisfies restriction
45

Therefore, the final key generated for this address is: FLRJNS45.

MR JOHNNY ANDERSON-THOMPSON, 123456 HIGH STREET, BRIGHTON, BN1 3SX
  • FORENAMES = JOHNNY

    • Modifier: ROOTNAME of JOHNNY = JOHNATHON
    • Algorithm: DOUBLE_METAPHONE_FIRST_WORD of JOHNATHON = JN0N
    • Length restriction: Length of JN0N = 4, and 4 is greater than 1 and less than 10 so satisfies restriction
⇒ JN0N
  • SURNAME = ANDERSON-THOMPSON

    • Modifier: <none>
    • Algorithm: DOUBLE_METAPHONE of ANDERSON-THOMPSON = ANTR
    • Length restriction: Length of ANTR = 4, and 4 is greater than 1 and less than 10 so satisfies restriction
⇒ ANTR
  • MINORSTREET_NUMBER = 123456

    • Modifier: <none>
    • Algorithm: <none>
    • Length restriction: Length of 123456 = 6, and 6 is greater than 1but more than 5 so truncate at the 5th character
⇒ 12345

Therefore, the final key generated for this address is: JN0NANTR12345.

PAUL SMITH, HANCOCK BUILDING, GRACE AVENUE, NOTTINGHAM
  • FORENAMES = PAUL

    • Modifier: ROOTNAME of PAUL = PAUL
    • Algorithm: DOUBLE_METAPHONE_FIRST_WORD of PAUL = PL
    • Length restriction: Length of PL = 2, and 2 is greater than 1 and less than 10 so satisfies restriction
⇒ PL
  • SURNAME = SMITH

    • Modifier: <none>
    • Algorithm: DOUBLE_METAPHONE of SMITH = SM0
    • Length restriction: Length of SM0 = 3, and 3 is greater than 1 and less than 10 so satisfies restriction
⇒ SM0
  • MINORSTREET_NUMBER = null

    • Modifier: <none>
    • Algorithm: <none>
    • Length restriction: Length of null = 0, which is NOT greater than 1 so key restriction NOT satisfied
⇒ <null>

Because not all the blocking key specification has been satisfied in this case (due to the missing premises number in the input), no blocking key is generated for this record from this blocking key specification.

Whilst blocking keys are used to generate suitable candidate pairs of input records, rules are then used to score those pairs together and cluster together any that are considered a good enough match.

You can modify the default rulesets or even create your own. Before you do though, we recommend that you review the concepts below:

  • Rules take the following form: <rule reference>=<expression>
  • A rule reference consists of a rule name followed by a "." and followed by a match level (e.g. MyRule.L0)
  • An expression may take multiple forms:
    • Low-level expression - operates on the elements within a record.
    • Higher-level expression - composed of references to other rules.
  • A ruleset consists of a combination of three rule types which increase in specificity from match, to theme, to element, as illustrated in the diagram below.
    • Match and theme rules are comprised of references to rules from the level below.
    • Element rules are comprised of rules that are set on specific data elements, such as a postcode or a building number. These element rules can use special comparators depending on the element type.

rulestree.png

Syntax

All rules have a rule reference on the left-hand side.

A rule reference takes the following format: <rule name>.<match level>

The rule name may be any of the following:

  • "Match" (match rule).
  • A custom identifier (theme rule). Must begin with an alphabetical character and may contain alphanumeric characters, underscores, and hyphens.
  • Element (single element rule).

The match level can be: L0, L1, L2, or L3. Note that you can override these values by using aliases.

The right-hand side of a rule:

  • Is always surrounded by curly brackets: { }
  • May contain logical operators: & and |

Expressions may be nested and logical operators combined (parentheses are required), e.g. MyRule.L2 = {((RuleA.L3 & RuleB.L2) | (RuleA.L2 & RuleB.L3)) | RuleC.L0}

Element rules include the element and the allowed result set (enclosed in [ ] and comma-separated) and may also include an optional element modifier and/or comparator.

Any theme or element rule may also optionally include a group from the input mappings, which is defined by using a hash symbol before the group name. For example:
#MyFieldGroup.PostcodeTheme.L0 = {Postcode[ExactMatch]}.

Find out about groups.

Default country

You can also specify a default country in the rules file. This lets the Find duplicates step know what country to use when processing and standardizing data if:

  • you're not using a country tag when mapping your data columns or
  • the country field for a record is blank.

To specify a default country, add @default.country=<countryiso> to the top of the rules file (where <countryiso> is the ISO 3166-1 alpha-3 country code). For example: to set Australia as the default country, add: @default.country=AUS.

You can use the default rules provided with the Find duplicates step as the basis for customization since they already contain the correct default country value.

Levels and evaluation order

There are 4 match levels that can be used within each rule specification:

Match level Description
L0 Each individual field that makes up the record matches exactly.
L1 Records might have some fields that match exactly, and some fields that are very similar.
L2 Records might have some fields that match exactly, some fields that are very similar, and some fields that differ a little more.
L3 Records contain the majority of fields that have a number of similarities, but do not match exactly.

Note that you can override these level names with your own custom names. To do so, you have to use the expression define <Alias> as <Match level> at the top of the rules file, e.g.

define High as L0.
Match.High={Name.High & Address.High}

When working with match levels, note that:

  • They will be evaluated in order from L0 through to L3, stopping at the first level that passes.
  • A rule level will evaluate as true if the same rule has evaluated as true for a higher level. For example, if MyRule.L1 is true, then MyRule.L2 and MyRule.L3 will also be true.
  • Records will be considered a match and they will be clustered together, provided that any of the defined top level overall rule match levels evaluate to true.
  • Every rule must include a match level as part of the rule reference.

Rules are evaluated as follows:

  1. Match rules first (Match.<Match Level>).
  2. Left to right.
  3. Lazily.

Types

Three rule types can be defined:

Match rules

This is the highest rule level, defining an overall match between two records. A match rule is made up of references to other rules.

At least one match rule must be defined for a successful matching job.

  • Example: Match.L0={Name.L0 & Address.L0} (Name.L0 and Address.L0 have been defined separately).

You can combine rule references into compound logical expressions. This way, you have complete control over the logic used to determine matches.

  • Example: Match.L1={(Name.L0 & Address.L0) | (Name.L0 & Email.L0 & Phone.L0)}
Theme rules

Theme rules represent the next level down, after match rules.

Similar to a match rule, a theme rule is made up of references to other rules. The theme rule name must begin with an alphabetical character and may contain alphanumeric characters, underscores and hyphens. The rule name cannot contain "Match" or the set of reserved elements.

  • Example: Address.L0={Premise.L0 & Street.L0 & Locality.L0 & Postcode.L0}

The rule references within the expression can either be other theme rules, or low-level element rules.

Element rules

Element rules are the most granular of rules. They can be used to specify how to compare individual elements within a record. Elements are basic units of data that comprise an overall theme. For example, postcode and premise could be elements of an address theme.

Comparators

Rules are designed to evaluate and compare elements using special comparators. The table below covers the available comparators you can use.

If you want to know which comparators are available for which elements, see the elements table.

Comparator Results
ExactString (default comparator)
  • ExactMatch: Strings match exactly e.g. "John Smith" & "John Smith"
  • OnePopulated: The field is populated for one of the records e.g. "John Smith" & ""
  • NonePopulated: The field is not populated for either of the records e.g "" & ""
  • NoMatch: The strings are both populated but are not an exact match e.g. "John Smith" & "John Doe"
ForenameCompare
  • ExactMatch: Strings match exactly, ignoring hyphens e.g. "Sarah-Jane" & "Sarah Jane"
  • InitialVsFullName: An initial or initials match to the full name e.g "S J" & "Sarah Jane"
  • FirstNameMatch: The first name of a multiple-name forename matches e.g "Sarah" & "Sarah Jane"
  • InvertedNameMatch: All the names in a multiple-name forename match but are in a different order e.g "Jane Sarah" & "Sarah Jane"
  • AnyNameMatch: Any name matches in a multiple-name forename e.g "Jane" & "Sarah Jane"
Plus all ExactString (default comparator) results.
TransposedNameCompare
  • ExactMatch: The forename(s) and surname match exactly to the transposed version e.g. "John Smith" & "Smith John"
  • PartialMatch: Only one part of the name matches to the transposed version e.g. "John Smith" & "Smith Paul", or "John Smith" & "Jones John"
  • NoMatch: Neither the forename(s) or surname match exactly to the transposed version e.g. "John Smith" & "Jones Paul"
Plus all ExactString (default comparator) results.
PremiseCompare
  • StartMatch: Premise matches the start of a premise range e.g. "12" & "12-15"
  • StartMatchAndEncapsulated: Premise ranges match at the start and one encapsulates the other e.g. "12-15" & "12-16"
  • EndMatch: Premise matches the end of a premise range e.g. "15" & "12-15"
  • EndMatchAndEncapsulated: Premise ranges match at the end and one encapsulates the other e.g. "13-16" & "12-16"
  • Encapsulated: Premise or premise range is encapsulated by the other e.g. "12" & "11-16"
  • Overlapped: Premise ranges overlap each other e.g. "12-15" & "14-18"
  • NumberMatchWithTrailingAlpha: Premise numbers match and one record has a trailing alpha e.g. "12" & "12a"
  • NumberMatchWithDifferingAlpha: Premise numbers are a perfect match but trailing alpha is different e.g. "12a" & "12b"
Plus all ExactString (default comparator) results.
DateCompare
  • DayMonthReversed: Matches dates where the day and month are reversed eg "2017-06-03" & "2017-03-06"
  • MonthYearMatch: Matches dates where only the month and year match eg "2017-06-03" & "2017-06-04"
  • DayMonthMatch: Matches dates where only the day and month match eg "2017-06-03" & "2016-06-03"
  • DayYearMatch: Matches dates where only the day and year match eg "2017-06-03" & "2017-07-03"
  • YearMatch: Matches dates where only the year matches eg "2017-06-03" & "2017-07-04"
  • <n>DaysDifference: Matches dates that differ by up to n days
  • <n>WeeksDifference: Matches dates that differ by up to n weeks (n * 7 days)
  • <n>MonthsDifference: Matches dates that differ by up to n calendar months
  • <n>MaxCharsDifference: Matches dates where up to n characters can be different (equivalent to levenshtein distance for dates)
Plus all ExactString (default comparator) results.
PostcodeCompare
  • Part1Match: Records match to the first part of the postcode e.g. "HA2 9PP" & "HA2 5QR"
  • Part2Match: Records match to the second part of the postcode e.g. SM1 9PP" & "HA2 9PP"
  • PostcodeCompatible: The first part of the postcode for both records matches, and the second part is populated in one record only e.g. "HA2 9PP" & "HA2"
Plus all ExactString (default comparator) results.
Levenshtein Depending on specified comparison type, either:
  • <Minimum %>: The minimum Levenshtein percentage to provide a match (integer between 0-100) e.g. setting the LevenshteinPercent result to 90 would return a match for "John Smith" & "Joan Smith"
Or
  • <Maximum distance>: The maximum Levenshtein distance to provide a match (integer). e.g. setting the LevenshteinDistance result to 1 would return a match for "John Smith" & "Joan Smith"
Plus all ExactString (default comparator) results.
JaroWinkler
  • <Minimum %>: The minimum Jaro-Winkler distance percentage to provide a match (integer between 0-100). e.g. setting the JaroWinkler result to 95 would return a match for "John Smith" & "Joan Smith"
Plus all ExactString (default comparator) results.
NumericCompare
  • ExactMatch: The numeric part of the strings match exactly e.g. "5th" & "5th"
  • OnePopulated: Only one of the records contains a numeric value. The other record is either blank or does not contain a numeric value e.g. "5th" & "", or "5th" & "fifth"
  • NonePopulated: Both records are either blank or do not contain numeric values e.g. "fifth" & "fifteenth"
  • NoMatch: Both values are numeric but do not match e.g. "5th" & "15th"
  • <n>: Both values are numeric and adding n to the lower value produces a value greater than or equal to the higher value
  • <n%>: Both values are numeric and adding n percent of the higher value to the lower value produces a value greater than or equal to the higher value
DoubleMetaphone
    The DoubleMetaphone comparator ignores all digits. When applied to an element where alphanumeric data is expected it should be combined with a NumericCompare rule.

  • ExactMatch: The primary Double Metaphone codes for both strings are the same e.g. "Smith" & "Smythe"
  • AlternateCodeMatch: The strings have matching Double Metaphone codes, taking into account alternate codes e.g. "Smith" & "Schmidt"
  • FirstWordMatch: The first words of each string have matching Double Metaphone codes e.g. "Mary Anne" & "Marie"
Plus all ExactString (default comparator) results.
Soundex
    The Soundex comparator ignores all digits. When applied to an element where alphanumeric data is expected it should be combined with a NumericCompare rule.

  • ExactMatch: The Soundex codes for both strings are the same e.g. "Smith" & "Smythe"
Plus all ExactString (default comparator) results.
NYSIIS
    The NYSIIS comparator ignores all digits. When applied to an element where alphanumeric data is expected it should be combined with a NumericCompare rule.

  • ExactMatch: The NYSIIS codes for both strings are the same e.g. "Jan" & "John"
Plus all ExactString (default comparator) results.

Filters

Filters may be used within rules to select only part of the specified string element. Insert one (or chain several) filter(s) between the element and the comparator. All existing comparators may be used, including Levenshtein.

Selecting a sub string

The SubString filter allows you to select a portion of a string. It has two integer parameters, the first determines where the selection will start and the second the number of characters to select.

The first parameter selects the offset in the string at which to start. When zero or positive it selects an offset from the start of the string. When negative it selects an offset from the end of the string. For example -1 selects the last character, -2 from the second last character, etc.

The second parameter determines the number of characters to select. When zero, all characters to the end of the string will be selected. When positive, the supplied number of characters will be selected. When negative, the supplied number of characters will be removed from the end of the selection.

Boundary Conditions: If the second parameter is positive and greater than the number of characters available in the string (after applying the offset from the first parameter) then only the available characters will be selected. Otherwise, if the supplied parameters cannot select any characters then an empty string will be produced from the filter.

All of the examples below will produce a match.

Record Generic String
Record 1 123-TR
Record 2 123-RM

GenericStringMatch.L2 = {Generic_String.SubString[0,3].[ExactMatch]}

Record Generic String
Record 1 TR-123
Record 2 RM-123

GenericStringMatch.L2 = {Generic_String.SubString[3,0].[ExactMatch]}
(notice that 3,0 will remove the first 3 characters, leaving the remainder of the string)

Record Generic String
Record 1 R-4AB
Record 2 N-4AR

GenericStringMatch.L2 = {Generic_String.SubString[2,-1].[ExactMatch]}
(produces a string starting at offset 2 and removing the last character)

Record Generic String
Record 1 R-4567-AB
Record 2 N-7643227-AB

GenericStringMatch.L2 = {Generic_String.SubString[-2,0].[ExactMatch]}
(produces a string starting at the offset 2 characters from the end)

Record Generic String
Record 1 MM-John Smith
Record 2 CC-Joan Smith

GenericStringMatch.L2 = {Generic_String.SubString[3,0].Levenshtein[90%]}

Selecting an item from a delimited string

DelimitedField can be used to select a field from a string delimited by one or more characters. The delimiter is defined as a java regular expression specified as the first argument, while the second argument is an integer that specifies the field to select. Since character escapes are now allowed in the rule syntax it is necessary to double escape any string literals that use characters reserved for regular expressions.

Record Generic String
Record 1 A-4567-AB
Record 2 A-7643227-AB

GenericStringMatch.L2 = {Generic_String.DelimitedField["\\-",0].[ExactMatch]}
(selects the first item of an array of strings delimited by the '-' character, so both values will be 'A'. Note the double escaping in use. The first is to escape possible special character codes, the second because the string is a regular expression)

Record Generic String
Record 1 First item-4567-ATTR
Record 2 Second item-4567-ATTR

GenericStringMatch.L2 = {Generic_String.DelimitedField["\\-",1].[ExactMatch]}
(selects the second item of an array of strings delimited by the '-' character, so both values will be '4567'. Note the double escaping in use. The first is to escape possible special character codes, the second because the string is a regular expression)

Using 'Contains'

The 'Contains' filter selects the shorter value from both records if the value in the shorter is contained within the longer, otherwise the values pass through the filter unchanged.

Record Generic String
Record 1 Blah
Record 2 I contain Blah here

GenericStringMatch.L2 = {Generic_String.Contains.[ExactMatch]}

Chaining filters

Filters may be chained together. There's no limit to the number, but we recommend using no more than three.

Record Generic String
Record 1 First item-4588-ATTR
Record 2 Second item-4599-ATTR

GenericStringMatch.L2 = {Generic_String.DelimitedField["\\-",1].SubString[0,2].[ExactMatch]}

Rule examples

These are element rule examples that focus on how different elements, modifiers and comparators can be used when designing the rules:

Initial vs full name

Record Forename Surname
Record 1 Robert Brooke
Record 2 R Brooke

Name.L2 = {Forenames.ForenameCompare[InitialVsFullName] & Surname[ExactMatch]}

Minor street number

Record MinorStreet_Number MinorStreet_Description MinorStreet_Type
Record 1 123 Burnthouse Lane
Record 2 123a Burnthouse Lane

StreetAddress.L1 = {MinorStreet_Number.PremiseCompare[NumberMatchWithTrailingAlpha] & MinorStreet_Description[ExactMatch] & MinorStreet_Type.StandardAbbreviation[ExactMatch]}

Postcode

Record MinorStreet_Description MinorStreet_Type Locality Postcode
Record 1 Hints Road Tamworth B78 3AB
Record 2 Hints Road Tamworth B78 3AT

Address.L2 = {Building_Number[ExactMatch] & MinorStreet_Description[ExactMatch] & Locality[ExactMatch] & Postcode.PostcodeCompare[Part1Match]}

Cross-field matching allows you to match across multiple fields of the same type to find potential duplicates.

For example, if your data consists of three phone number fields (e.g. home, work and mobile number), you can configure the Find duplicates step to find potential phone number matches across all three of them.

To configure the Find duplicates step to perform cross-field matching, you have to set up your blocking keys and rulesets in such a way that your groups of similar data will be blocked and scored together.

Example

You have multiple address fields (home and billing) and want to identify potential duplicates where the billing address of one record matches the delivery address of another record.

RECORD ID NAME HomeAddress BillingAddress
ADDRESS LOCALITY POSTCODE ADDRESS LOCALITY POSTCODE
1 John Smith 1 High Street London SW4 0QL 48 Webber Road Brighton BN3 1EJ
2 John Smith 12 Acacia Avenue London E1W 2BB 1 High Street London SW4 0QL

Using the default rules and blocking keys for GBR and modifying them to consider these two addresses in separate groups (HomeAddress and BillingAddress), the two records will not be matched together since the home address for the two records will not block together or evaluate as a match in the rules. Similarly, the two records would still not match together based on the billing address group since the billing addresses are also different.

The desired behavior is for the following cross-field operations to be performed (with the operation in bold causing a match using the above example):

  • HomeAddress 1 vs. HomeAddress 2
  • BillingAddress 1 vs. BillingAddress 2
  • HomeAddress 1 vs. BillingAddress 2
  • BillingAddress 1 vs. HomeAddress 2

To achieve this, add: "elementGroups": ["HomeAddress","BillingAddress"] to the blocking keys for each address component of each blocking key definition and #[HomeAddress,BillingAddress] to the address rules of the final match level rules.

Blocking keys

The blocking key definition below is from the default GBR individual blocking keys, combining the surname initial, street number and locality into a blocking key. By default, this will create keys for a single address. However, it can be modified to create keys for multiple addresses using the HomeAddress and BillingAddress below:

{
   "description": "SurnameMinorStreetNumberLocality",
   "countryCode": "GBR",
   "elementSpecifications": [
     {
       "elementType": "SURNAME",
       "algorithm": {
         "name": "INITIAL"
       },
       "includeFromNChars": 1
     },
     {
       "elementType": "MINORSTREET_NUMBER",
       "elementGroups": ["HomeAddress","BillingAddress"],
       "includeFromNChars": 1,
       "truncateToNChars": 5
     },
     {
       "elementType": "LOCALITY",
       "elementGroups": ["HomeAddress","BillingAddress"],
       "elementModifiers": [
         "STANDARDSPELLING",
         "DERIVED"
       ],
       "includeFromNChars": 2,
       "truncateToNChars": 30
     }
   ]
 }

This will now cause two blocking keys to be generated for every record:

  • SURNAME_Initial + HomeAddress_MINORSTREET_NUMBER + HomeAddress_LOCALITY
  • SURNAME_Initial + BillingAddress_MINORSTREET_NUMBER + BillingAddress_LOCALITY

Using the sample data, the following blocking keys will be created:

  • Record 1
    • S1LONDON
    • S48BRIGHTON
  • Record 2
    • S12LONDON
    • S1LONDON

Since the first and last keys are identical, records 1 and 2 will now be blocked together and identified as a candidate pair for scoring in the rules.

Rules

The rule snippet below is from the default GBR individual ruleset that has been amended to include cross-field support as an example:

Match.Exact={Name.Exact & #[HomeAddress,BillingAddress].Address.Exact}

This will now cause the following rule evaluations to be performed:

Fields A Fields B Result
Record1_Name + Record1_HomeAddress vs. Record2_Name + Record2_HomeAddress FALSE
Record1_Name + Record1_BillingAddress vs. Record2_Name + Record2_BillingAddress FALSE
Record1_Name + Record1_HomeAddress vs. Record2_Name + Record2_BillingAddress TRUE
Record1_Name + Record1_BillingAddress vs. Record2_Name + Record2_HomeAddress FALSE

The third case will be true since the home address of record 1 matches the billing address of record 2 as an exact match.