Data Studio provides default blocking keys and rulesets that you can use to run the Find duplicates step. However, you can create your own custom rules and blocking keys that are tailored to your needs.
Blocking keys and rulesets are designed and specified using elements - a representation of your input data after it has been mapped in the Find duplicates step and gone through the initial standardization process. Find out about elements.
The standardization process also creates additional versions of these elements to assist further - known as modifiers. Modifiers can correct, enhance or derive many known terms that appear in the input. For example, a DERIVED modifier may be created when the element was not contained in the input but the standardization process was able to determine the value (e.g. COUNTY can sometimes be derived from the LOCALITY and POSTCODE input). Find out about modifiers.
Elements can also be put into specific groups to separate them from other elements of the same type. This is especially important when you want to create blocking keys or rules that treat them as separate entities, such as cross-field matching.
These elements, modifiers and groups (along with blocking key algorithms and rule comparators) can then be combined to make your own blocking keys and rules specific to your design and data.
The table below covers the available elements that can be used when designing blocking keys and rules, together with the possible modifiers and rule comparators that can be used with them. Note that blocking key algorithms are not listed here because they apply to all elements.
Element | Description | Example | Comparators | Modifiers |
---|---|---|---|---|
Title | Title | Mrs |
|
|
Forenames | Given name(s) and any initials | John |
|
|
Surname_Prefix | Surname prefix | De la |
|
|
Surname | Surname with prefix | Smith |
|
|
Full_Name | Concatenation of forenames and surname (including prefix if present) | John O'Connor |
|
|
Surname_Suffix | Surname suffixes | Junior |
|
|
Gender | Gender | Female |
|
|
Honorifics | Honorifics | Ph.D |
|
|
Company | Organization name | Experian Ltd |
|
|
Building_Description | Building name and type | George West House |
|
|
Building_Number | Building number | 43 |
|
|
SubBuilding_Number | Sub-building number | 2 |
|
|
SubBuilding_Description | Sub-building name | First-floor |
|
|
SubBuilding_Type | Sub-building type | Flat |
|
|
MinorStreet_Number | Street number | 34th |
|
|
MinorStreet_Predirectional | Street pre-directional | South |
|
|
MinorStreet_Description | Street name | Carnaby |
|
|
MinorStreet_Type | Street descriptor | Street |
|
|
MinorStreet_Postdirectional | Street post-directional | South |
|
|
PoBox_Number | PO box number | 79 |
|
|
PoBox_Description | PO box description | PO Box |
|
|
DoubleDependentLocality | A small locality such as a village, used to identify an address where a street appears more than once in a dependent locality | Kingston Gorse |
|
|
DependentLocality | Smaller locality used to identify an address where a street appears more than once in a locality | East Preston |
|
|
Locality | A larger locality, such as a town or a city | Cambridge |
|
|
Province | A larger area of a country, contains multiple localities | Cambridgeshire |
|
|
Country | Country name | United Kingdom |
|
|
Postcode | Postal code or ZIP code | 'SW4 0QL' or '20521 9000' |
|
|
Generic_String | Generic string | ab-1234cdef |
|
|
Date | ISO date in the format YYYY-MM-DD | 1980-06-21 |
|
|
Phone | Phone number | (01234) 567890 |
|
|
Email address | john.smith@domain.com |
|
|
|
Email_Local | Local part of email address | john.smith |
|
|
Email_Domain | Email domain | domain.com |
|
|
Hash | Hash value of all normalised input fields within a group | 8bf14557574b8793aae648fc1b0280c3 |
|
|
The table below covers the available element modifiers that can be used. See the elements table to find out which modifiers can be used with which elements.
Modifier | Operation | Example |
---|---|---|
(Default) | The element classified from the input in a cleaned form, normalised to remove diacritics and converted to upper case. | Supplied: 123 High Road MINORSTREET_TYPE -> ROAD |
StandardSpelling | The element converted to a standard spelling (contains Derived value when available). | Supplied: 123 High Road MINORSTREET_TYPE.STANDARDSPELLING -> ROAD |
StandardAbbreviation | The element converted to the standard abbreviation. | Supplied: 123 High Road MINORSTREET_TYPE.STANDARDABBREVIATION -> RD |
Derived | A derived value that was inferred from other information in the input address. | Supplied: 123 High Road, London, E1 2EZ PROVINCE.DERIVED -> GREATER LONDON |
RootName | The root name of the input name. | Supplied name: Alex FORENAMES.ROOTNAME -> ALEXANDER |
Any element can be used multiple times in an input to represent separate or unique sets of information. For example, the input may include multiple:
You may want to create different blocking key specifications for different groups of the same element type, or use the same blocking key specification for multiple groups of the same element type. To do this, you need to add elementGroups to your element specification within your blocking key design. Find out more about elementGroups.
For example, if you have a billing and a delivery address and you want to have two different blocking key specifications for them that have different designs, you would add the following to each address element of the group within your blocking key designs:
"elementGroups":["BillingAddress"]
"elementGroups":["DeliveryAddress"]
Alternatively, if you wanted to use the same blocking key specification for both groups, you would add the following to each address element of the group within that single blocking key design:
"elementGroups":[
"BillingAddress",
"DeliveryAddress"
]
The blocking key specification examples below are taken from the default blocking keys available in Data Studio, and have been amended to show how they would be used for the two cases above.
Two differently designed blocking key specifications for two different groups of the same element type:
{
"description": "SurnameMinorStreetNumberLocality",
"countryCode": "GBR",
"elementSpecifications": [
{
"elementType": "SURNAME",
"algorithm": {
"name": "INITIAL"
},
"includeFromNChars": 1
},
{
"elementType": "MINORSTREET_NUMBER",
"elementGroups": ["BillingAddress"],
"includeFromNChars": 1,
"truncateToNChars": 5
},
{
"elementType": "LOCALITY",
"elementGroups": ["BillingAddress"],
"elementModifiers": [
"STANDARDSPELLING",
"DERIVED"
],
"includeFromNChars": 2,
"truncateToNChars": 30
}
]
}
{
"description": "POBoxNumberLocality",
"countryCode": "GBR",
"elementSpecifications": [
{
"elementType": "POBOX_NUMBER",
"elementGroups": ["DeliveryAddress"],
"includeFromNChars": 1
},
{
"elementType": "LOCALITY",
"elementGroups": ["DeliveryAddress"],
"elementModifiers": [
"STANDARDSPELLING",
"DERIVED"
],
"includeFromNChars": 2,
"truncateToNChars": 30
}
]
}
The same blocking key specification for two different groups of the same element type:
{
"description": "FullPostcode",
"countryCode": "GBR",
"elementSpecifications": [
{
"elementType": "POSTCODE",
"elementGroups": ["BillingAddress","DeliveryAddress"],
"elementModifiers": [
"STANDARDSPELLING"
],
"includeFromNChars": 5,
"truncateToNChars": 7
}
]
}
You may want to handle the items listed above separately during rule evaluation. To do this, you need to prefix the rule with a hash symbol, followed by the group name in square brackets, followed by a period, followed by the rest of the rule.
For example, if you have a billing address and a delivery address and you want to evaluate them separately, you could write rules such as the ones below.
Match.L0 = {Name.L0 & #[Billing].Address.L0 & #[Delivery].Address.L0}
Match.L1 = {Name.L1 & (#[Delivery].Address.L1 | #[Billing].Address.L1)}
By default, this will mean that the two different groups will use the same underlying rules (in the case above, the same address rules). To use different rules for two groups that are of the same element type, you can use theme rules that are then utilized at the top level.
Match.L0={#[Billing].LooseAddressRule.L0 & #[Delivery].StrictAddressRule.L0}
LooseAddressRule.L0={Postcode.L0}
StrictAddressRule.L0={MinorStreet_Number & Postcode.L0}
MinorStreet_Number.L0={PremiseCompare[ExactMatch]}
Postcode.L0={PostcodeCompare[Part1Match] & PostcodeCompare[Part2Match]}
In the above example, two different theme rules have been used to evaluate an address (one being stricter than the other since it requires both the premises number AND postcode to be compared), but each is only used with one group as defined at the match level.
You can also specify more than one group in any rule, such as the example below. This will perform cross-field matching.
Match.L0 = {Name.L0 & #[Billing,Delivery].Address.L0}
The Find duplicates step uses blocking keys to create blocks of similar records to assist with the generation of suitable candidate record pairs for scoring via a ruleset.
Blocks are created from records that have the same blocking key values. Blocking keys are created for each input record from combinations of the record's elements that have been keyed. Keying is the process of encoding individual elements to the same representation so that they can be matched despite minor differences in spelling. To be effective, blocking keys should represent a range of contact data sub-element combinations.
Data Studio provides default blocking keys tuned for name and address matching for the Find duplicates step. However, you can modify these or create your own to suit your needs.
All blocking keys have to be built in the structure below.
Element | Description |
---|---|
description (optional) | A descriptive name for the blocking key, for example "MyBlockingKey". |
countryCode (optional) | An informational code that can be used to imply the country that the blocking key is to be used with (the value does not affect processing in any way). If you require blocking keys conditionally generated, see the validCountries element. |
validCountries (optional) | An array of ISO3 country codes for which the blocking key is valid. If this element is present and the standardized record has a country code, then it must match one of the country codes in the array - otherwise the blocking key won't be generated. For example:"validCountries":["GBR","IRL"] |
elementSpecifications | An array of elements that are to be used in the key, created as part of the initial standardization process. Blocking keys are created using the list of elements in order. For example, the key FORENAMES+SURNAME+MINORSTREET_NUMBER would be created from the array below: "elementSpecifications":[ |
Each element within elementSpecifications has to then be built in the structure below:
Element | Description |
---|---|
elementType | The element to use in the blocking key. See the list of available elements and examples for each. |
elementGroups (optional) | The group(s) that the element belongs to. If this isn't supplied, the default group will be used. This is required when you have more than one of the same element type (e.g. multiple phone numbers) that you want to treat separately. Find out about groups. The example below specifies multiple groups for a single element type: "elementGroups":[ |
elementModifiers (optional) | An array of element modifiers to use. This list is processed in order and the first populated element will be used. If no modifier is specified/found, the default value of the element will be used. See the list of available modifiers, which elements they apply to and examples for each. The benefit of element modifiers is to use a more standardized and uniform version of the element, to better assist with blocking together records that may not look the same in their string representation but could still very much be intended to represent the same thing. The example below will use the standard spelling form of the element if available, otherwise the derived form. If neither are found, the default value for the element will be used. "elementModifiers":[ |
algorithm (optional) | Keying algorithm used to key the element. These are useful for blocking as they can cause similar elements to be blocked together even if their original form is not identical. For example, the SOUNDEX algorithm can be used to block similar sounding values (since they will have the same SOUNDEX key) even if they do not look the same. Defined algorithms have to have a name and can also have an optional set of additional properties. For example, the SUBSTRING algorithms require start, end and/or length. "algorithm": {See the list of available algorithms (and any associated properties) and examples for each. |
includeFromNChars (optional) | Only include the keyed element in the blocking key if it is N or more characters in length. If specified, a blocking key will not be created when the configured element does not meet this criterion. For example, if this is set to 5 on the POSTCODE element, any postcodes found to be less than 5 characters in length will not be included in the blocking key, therefore the entire blocking key will not be created for that record. |
truncateToNChars (optional) | Truncate the keyed element to N characters in length. For example, if this is set to 10 on the SURNAME element, any surnames longer than 10 characters will be included into the blocking key but cut at the 10th character. |
The algorithms below are available for use in blocking key specifications, and apply to all non-numeric elements. If no algorithm is specified, the SIMPLIFIED_STRING algorithm is used.
Name | Description | Keyed Example | Additional properties |
---|---|---|---|
NO_CHANGE | No modification performed, and whitespaces retained | ANDREW J -> ANDREW J | |
SIMPLIFIED_STRING | Whitespace removed | ANDREW J -> ANDREWJ | |
DOUBLE_METAPHONE | Double metaphone algorithm on all words | ANDREW J -> ANTR | |
DOUBLE_METAPHONE_FIRST_WORD | Double metaphone algorithm on the first word only | ANDREW J -> ANTR | |
NYSIIS | NYSIIS algorithm | ANDREW J -> ANDRAJ | |
SOUNDEX | SOUNDEX algorithm | ANDREW J -> A536 | |
CONSONANT | Only include consonants | ANDREW J -> NDRWJ | |
INITIAL | Initial character only | ANDREW J -> A | |
START_SUBSTRING | Substring from the start of the value | ANDREW J ("length":3) -> AND | "length": <integer> |
MIDDLE_SUBSTRING | Substring from defined start position to defined end position | ANDREW J ("start":2, "end":5) -> NDRE | "start": <integer> ,"end": <integer> |
END_SUBSTRING | Subtring from the end of the value | ANDREW J ("length":3) -> W J | "length": <integer> |
CONSONANT and SOUNDEX support the following character sets:
All other key types support the following Latin character sets:
Let's use the default blocking keys in Data Studio for more advanced examples of blocking key design:
{
"description": "FullZipCode",
"countryCode": "USA",
"elementSpecifications": [
{
"elementType": "POSTCODE",
"elementModifiers": [
"STANDARDSPELLING"
],
"includeFromNChars": 9
}
]
}
This first blocking key specification is comprised of just the POSTCODE element, using the STANDARDSPELLING element modifier, and restricting the keys created from it in that they must be at least 9 characters long.
This will result in the generation of candidate score pairs for every record in the data with the same zip and +4 code.
The sample input USA zip codes below would therefore create the following keys with this blocking key specification:
<null>
- no blocking key generated due to min. length of 9 being unsatisfiedThis second blocking key specification below is a little more advanced - it combines a few elements together and uses various modifiers and algorithms, as well as different character number restrictions.
{
"description": "ForenameSurnameMinorStreetNumber",
"countryCode": "GBR",
"elementSpecifications": [
{
"elementType": "FORENAMES",
"elementModifiers": [
"ROOTNAME"
],
"algorithm": {
"name": "DOUBLE_METAPHONE_FIRST_WORD"
},
"includeFromNChars": 1,
"truncateToNChars": 10
},
{
"elementType": "SURNAME",
"algorithm": {
"name": "DOUBLE_METAPHONE"
},
"includeFromNChars": 1,
"truncateToNChars": 10
},
{
"elementType": "MINORSTREET_NUMBER",
"includeFromNChars": 1,
"truncateToNChars": 5
}
]
}
This will result in the generation of candidate score pairs for every record in the data with the same forename root name double metaphone value, surname double metaphone value, and premises number.
The sample name and addresses below show what those blocking keys would look like:
FORENAMES = VAL
MINORSTREET_NUMBER = 45
Therefore, the final key generated for this address is: FLRJNS45.
FORENAMES = JOHNNY
SURNAME = ANDERSON-THOMPSON
MINORSTREET_NUMBER = 123456
Therefore, the final key generated for this address is: JN0NANTR12345.
FORENAMES = PAUL
SURNAME = SMITH
MINORSTREET_NUMBER = null
Because not all the blocking key specification has been satisfied in this case (due to the missing premises number in the input), no blocking key is generated for this record from this blocking key specification.
Whilst blocking keys are used to generate suitable candidate pairs of input records, rules are then used to score those pairs together and cluster together any that are considered a good enough match.
You can modify the default rulesets or even create your own. Before you do though, we recommend that you review the concepts below:
<rule reference>=<expression>
All rules have a rule reference on the left-hand side.
A rule reference takes the following format: <rule name>.<match level>
The rule name may be any of the following:
The match level can be: L0, L1, L2, or L3. Note that you can override these values by using aliases.
The right-hand side of a rule:
Expressions may be nested and logical operators combined (parentheses are required), e.g. MyRule.L2 = {((RuleA.L3 & RuleB.L2) | (RuleA.L2 & RuleB.L3)) | RuleC.L0}
Element rules include the element and the allowed result set (enclosed in [ ] and comma-separated) and may also include an optional element modifier and/or comparator.
Any theme or element rule may also optionally include a group from the input mappings, which is defined by using a hash symbol before the group name. For example:
#MyFieldGroup.PostcodeTheme.L0 = {Postcode[ExactMatch]}
.
You can also specify a default country in the rules file. This lets the Find duplicates step know what country to use when processing and standardizing data if:
To specify a default country, add @default.country=<countryiso>
to the top of the rules file (where <countryiso>
is the ISO 3166-1 alpha-3 country code). For example: to set Australia as the default country, add: @default.country=AUS
.
You can use the default rules provided with the Find duplicates step as the basis for customization since they already contain the correct default country value.
There are 4 match levels that can be used within each rule specification:
Match level | Description |
---|---|
L0 | Each individual field that makes up the record matches exactly. |
L1 | Records might have some fields that match exactly, and some fields that are very similar. |
L2 | Records might have some fields that match exactly, some fields that are very similar, and some fields that differ a little more. |
L3 | Records contain the majority of fields that have a number of similarities, but do not match exactly. |
Note that you can override these level names with your own custom names. To do so, you have to use the expression define <Alias> as <Match level>
at the top of the rules file, e.g.
define High as L0
.
Match.High={Name.High & Address.High}
When working with match levels, note that:
MyRule.L1
is true, then MyRule.L2
and MyRule.L3
will also be true.Rules are evaluated as follows:
Match.<Match Level>
).Three rule types can be defined:
This is the highest rule level, defining an overall match between two records. A match rule is made up of references to other rules.
At least one match rule must be defined for a successful matching job.
Match.L0={Name.L0 & Address.L0}
(Name.L0
and Address.L0
have been defined separately).You can combine rule references into compound logical expressions. This way, you have complete control over the logic used to determine matches.
Match.L1={(Name.L0 & Address.L0) | (Name.L0 & Email.L0 & Phone.L0)}
Theme rules represent the next level down, after match rules.
Similar to a match rule, a theme rule is made up of references to other rules. The theme rule name must begin with an alphabetical character and may contain alphanumeric characters, underscores and hyphens. The rule name cannot contain "Match" or the set of reserved elements.
Address.L0={Premise.L0 & Street.L0 & Locality.L0 & Postcode.L0}
The rule references within the expression can either be other theme rules, or low-level element rules.
Element rules are the most granular of rules. They can be used to specify how to compare individual elements within a record. Elements are basic units of data that comprise an overall theme. For example, postcode and premise could be elements of an address theme.
Rules are designed to evaluate and compare elements using special comparators. The table below covers the available comparators you can use.
If you want to know which comparators are available for which elements, see the elements table.
Comparator | Results |
---|---|
ExactString (default comparator) |
|
ForenameCompare |
|
TransposedNameCompare |
|
PremiseCompare |
|
DateCompare |
|
PostcodeCompare |
|
Levenshtein | Depending on specified comparison type, either:
|
JaroWinkler |
|
NumericCompare |
|
DoubleMetaphone |
|
Soundex |
|
NYSIIS |
|
Filters may be used within rules to select only part of the specified string element. Insert one (or chain several) filter(s) between the element and the comparator. All existing comparators may be used, including Levenshtein.
The SubString filter allows you to select a portion of a string. It has two integer parameters, the first determines where the selection will start and the second the number of characters to select.
The first parameter selects the offset in the string at which to start. When zero or positive it selects an offset from the start of the string. When negative it selects an offset from the end of the string. For example -1 selects the last character, -2 from the second last character, etc.
The second parameter determines the number of characters to select. When zero, all characters to the end of the string will be selected. When positive, the supplied number of characters will be selected. When negative, the supplied number of characters will be removed from the end of the selection.
Boundary Conditions: If the second parameter is positive and greater than the number of characters available in the string (after applying the offset from the first parameter) then only the available characters will be selected. Otherwise, if the supplied parameters cannot select any characters then an empty string will be produced from the filter.
All of the examples below will produce a match.
Record | Generic String |
---|---|
Record 1 | 123-TR |
Record 2 | 123-RM |
GenericStringMatch.L2 = {Generic_String.SubString[0,3].[ExactMatch]}
Record | Generic String |
---|---|
Record 1 | TR-123 |
Record 2 | RM-123 |
GenericStringMatch.L2 = {Generic_String.SubString[3,0].[ExactMatch]}
(notice that 3,0 will remove the first 3 characters, leaving the remainder of the string)
Record | Generic String |
---|---|
Record 1 | R-4AB |
Record 2 | N-4AR |
GenericStringMatch.L2 = {Generic_String.SubString[2,-1].[ExactMatch]}
(produces a string starting at offset 2 and removing the last character)
Record | Generic String |
---|---|
Record 1 | R-4567-AB |
Record 2 | N-7643227-AB |
GenericStringMatch.L2 = {Generic_String.SubString[-2,0].[ExactMatch]}
(produces a string starting at the offset 2 characters from the end)
Record | Generic String |
---|---|
Record 1 | MM-John Smith |
Record 2 | CC-Joan Smith |
GenericStringMatch.L2 = {Generic_String.SubString[3,0].Levenshtein[90%]}
DelimitedField can be used to select a field from a string delimited by one or more characters. The delimiter is defined as a java regular expression specified as the first argument, while the second argument is an integer that specifies the field to select. Since character escapes are now allowed in the rule syntax it is necessary to double escape any string literals that use characters reserved for regular expressions.
Record | Generic String |
---|---|
Record 1 | A-4567-AB |
Record 2 | A-7643227-AB |
GenericStringMatch.L2 = {Generic_String.DelimitedField["\\-",0].[ExactMatch]}
(selects the first item of an array of strings delimited by the '-' character, so both values will be 'A'. Note the double escaping in use. The first is to escape possible special character codes, the second because the string is a regular expression)
Record | Generic String |
---|---|
Record 1 | First item-4567-ATTR |
Record 2 | Second item-4567-ATTR |
GenericStringMatch.L2 = {Generic_String.DelimitedField["\\-",1].[ExactMatch]}
(selects the second item of an array of strings delimited by the '-' character, so both values will be '4567'. Note the double escaping in use. The first is to escape possible special character codes, the second because the string is a regular expression)
The 'Contains' filter selects the shorter value from both records if the value in the shorter is contained within the longer, otherwise the values pass through the filter unchanged.
Record | Generic String |
---|---|
Record 1 | Blah |
Record 2 | I contain Blah here |
GenericStringMatch.L2 = {Generic_String.Contains.[ExactMatch]}
Filters may be chained together. There's no limit to the number, but we recommend using no more than three.
Record | Generic String |
---|---|
Record 1 | First item-4588-ATTR |
Record 2 | Second item-4599-ATTR |
GenericStringMatch.L2 = {Generic_String.DelimitedField["\\-",1].SubString[0,2].[ExactMatch]}
These are element rule examples that focus on how different elements, modifiers and comparators can be used when designing the rules:
Initial vs full name
Record | Forename | Surname |
---|---|---|
Record 1 | Robert | Brooke |
Record 2 | R | Brooke |
Name.L2 = {Forenames.ForenameCompare[InitialVsFullName] & Surname[ExactMatch]}
Minor street number
Record | MinorStreet_Number | MinorStreet_Description | MinorStreet_Type |
---|---|---|---|
Record 1 | 123 | Burnthouse | Lane |
Record 2 | 123a | Burnthouse | Lane |
StreetAddress.L1 = {MinorStreet_Number.PremiseCompare[NumberMatchWithTrailingAlpha] & MinorStreet_Description[ExactMatch] & MinorStreet_Type.StandardAbbreviation[ExactMatch]}
Postcode
Record | MinorStreet_Description | MinorStreet_Type | Locality | Postcode |
---|---|---|---|---|
Record 1 | Hints | Road | Tamworth | B78 3AB |
Record 2 | Hints | Road | Tamworth | B78 3AT |
Address.L2 = {Building_Number[ExactMatch] & MinorStreet_Description[ExactMatch] & Locality[ExactMatch] & Postcode.PostcodeCompare[Part1Match]}
Cross-field matching allows you to match across multiple fields of the same type to find potential duplicates.
For example, if your data consists of three phone number fields (e.g. home, work and mobile number), you can configure the Find duplicates step to find potential phone number matches across all three of them.
To configure the Find duplicates step to perform cross-field matching, you have to set up your blocking keys and rulesets in such a way that your groups of similar data will be blocked and scored together.
You have multiple address fields (home and billing) and want to identify potential duplicates where the billing address of one record matches the delivery address of another record.
RECORD ID | NAME | HomeAddress | BillingAddress | ||||
---|---|---|---|---|---|---|---|
ADDRESS | LOCALITY | POSTCODE | ADDRESS | LOCALITY | POSTCODE | ||
1 | John Smith | 1 High Street | London | SW4 0QL | 48 Webber Road | Brighton | BN3 1EJ |
2 | John Smith | 12 Acacia Avenue | London | E1W 2BB | 1 High Street | London | SW4 0QL |
Using the default rules and blocking keys for GBR and modifying them to consider these two addresses in separate groups (HomeAddress and BillingAddress), the two records will not be matched together since the home address for the two records will not block together or evaluate as a match in the rules. Similarly, the two records would still not match together based on the billing address group since the billing addresses are also different.
The desired behavior is for the following cross-field operations to be performed (with the operation in bold causing a match using the above example):
To achieve this, add: "elementGroups": ["HomeAddress","BillingAddress"]
to the blocking keys for each address component of each blocking key definition and #[HomeAddress,BillingAddress]
to the address rules of the final match level rules.
Blocking keys
The blocking key definition below is from the default GBR individual blocking keys, combining the surname initial, street number and locality into a blocking key. By default, this will create keys for a single address. However, it can be modified to create keys for multiple addresses using the HomeAddress and BillingAddress below:
{
"description": "SurnameMinorStreetNumberLocality",
"countryCode": "GBR",
"elementSpecifications": [
{
"elementType": "SURNAME",
"algorithm": {
"name": "INITIAL"
},
"includeFromNChars": 1
},
{
"elementType": "MINORSTREET_NUMBER",
"elementGroups": ["HomeAddress","BillingAddress"],
"includeFromNChars": 1,
"truncateToNChars": 5
},
{
"elementType": "LOCALITY",
"elementGroups": ["HomeAddress","BillingAddress"],
"elementModifiers": [
"STANDARDSPELLING",
"DERIVED"
],
"includeFromNChars": 2,
"truncateToNChars": 30
}
]
}
This will now cause two blocking keys to be generated for every record:
Using the sample data, the following blocking keys will be created:
Since the first and last keys are identical, records 1 and 2 will now be blocked together and identified as a candidate pair for scoring in the rules.
Rules
The rule snippet below is from the default GBR individual ruleset that has been amended to include cross-field support as an example:
Match.Exact={Name.Exact & #[HomeAddress,BillingAddress].Address.Exact}
This will now cause the following rule evaluations to be performed:
Fields A | Fields B | Result | ||
---|---|---|---|---|
Record1_Name + Record1_HomeAddress | vs. | Record2_Name + Record2_HomeAddress | ⇒ | FALSE |
Record1_Name + Record1_BillingAddress | vs. | Record2_Name + Record2_BillingAddress | ⇒ | FALSE |
Record1_Name + Record1_HomeAddress | vs. | Record2_Name + Record2_BillingAddress | ⇒ | TRUE |
Record1_Name + Record1_BillingAddress | vs. | Record2_Name + Record2_HomeAddress | ⇒ | FALSE |
The third case will be true since the home address of record 1 matches the billing address of record 2 as an exact match.