A Workflow is a sequence of connected steps that defines a process of transforming and manipulating your data. Data Studio provides a number of default Workflow steps but you can also create custom ones. The area where Workflows are viewed or edited is called the Workflow canvas.
If your Workflow is complex or tackles many actions, it can become difficult to manage and test. You can resolve this is by creating re-usable Workflows.
Duplicate data to multiple outputs.
The rows from the source data will be duplicated (branched) into two outputs. To create more than two duplicates of the data, you can link several Branch steps.
Visualize your data as a bar, pie, or gauge chart. Export as a PDF or an image file (.JPG, .PNG, .SVG). Find out more about the Chart step.
This step allows you to convert column(s) into rows by clicking Selected columns in the column list on the left.
Specify the following for the new columns:
To do the reverse, use the Rows to columns Workflow step.
Compare two inputs using one column as the Record key. The Record key is used to identify which records should be compared across the two inputs. If a Record key value is duplicated in either file, that record will not be compared.
By default, values in all columns that are found in both inputs (the column name must match exactly) are compared. To exclude any columns from the comparison, click Columns.
Data can be compared using a number of Functions on individual columns between the two Datasets:
If all comparisons across a record pass, the record is considered as Unchanged in the result output.
The Show results by key output displays two columns:
The Show summary output displays the detailed metrics of the comparison:
The Show detail output displays the column level results for each compared record alongside the corresponding values from the two inputs.
This step allows you to export your Workflow's output data to a file or an External system that uses a JDBC connection.
Each Workflow can contain multiple Export steps, so you can export data at any point. Each of these steps can be configured individually.
Click Export Settings in the step dialog to specify where and how the data will be exported.
You can specify a delimiter and filename extension for CSV files (created in an Export step or downloaded from a grid of data):
Navigate to the installation folder (by default C:\Program Files\Experian\Aperture Data Studio {version}) and open the server.properties file.
On a new line in the file, add the setting you want to modify:
Setting name | Default value | Description |
---|---|---|
Data.csvDelimiter |
, |
Any string; special keywords can be used (see below) |
Data.csvFilenameExtension |
csv |
All non alphanumeric characters will be removed |
Keyword | Value |
---|---|
BAR |
¦ |
COLON |
: |
COMMA |
, |
NONE |
\O |
PIPE |
| |
SEMICOLON |
; |
SPACE |
' ' |
TAB |
\t |
For example, to change the delimiter to tab, the entry in server.properties should be:
Data.csvDelimiter=TAB
"INSERT INTO tablename (column1, …) VALUES (value1, …)"
"UPDATE tablename SET (column1=value1, …) WHERE (columnName=value, … )"
"DELETE FROM tablename WHERE columnName=value, … "
Filter your data by selecting a column and entering a comparison value to match against using an operation like Equals, Contains or In Range. To filter on multiple columns and/or multiple comparison values create an advanced Function that returns a boolean (true or false) value.
In Explore mode, right-click on a value in a column and select one of the options Keep or Remove to show or exclude all rows containing the chosen value.
The Split Workflow step is the same as Filter but has the benefit of two outputs to see the rows that pass the filter and the rows that fail the filter.
The Find duplicates step uses powerful standardization and matching algorithms to group together records containing similar contact data (e.g. name, address, email, phone) and keep that information within a duplicate store. Each group of records, known as a cluster, is assigned a unique cluster ID and a match level. The step provides out-of-the-box functionality for the United Kingdom, Australia, and the United States, but is also completely configurable down to the most granular name and contact elements.
Find out how to use, configure and troubleshoot this step.
The output will display the records of the clusters that have had at least one record inserted or updated. A record is either:
The Find duplicates delete step provides the ability to bulk delete records within a duplicate store and update its clusters appropriately.
A Duplicate store must be provided as well as an input column where its values will be used as the ids of the records to be deleted.
By default, if an input contains one column with the Unique ID tag then that column will be automatically selected.
The output will display the records of the clusters that have had at least one record deleted. A record is either:
The Find duplicates query step provides the ability to search for a collection of records in a Duplicate store using records in any column.
The values in the input column(s) are used to search the store for matching records.
A Duplicate store must be provided as well as at least one input column. Any amount of columns can be selected, as long as they exist in the duplicate store.
By default, if the input contains one or more columns that are tagged, these columns will be automatically selected.
Rules can also be changed or edited prior to submitting a search query.
The output will display the clusters that have had at least one record found. One of the output columns will be the search match status that corresponds to the match level of the input record against the found record/s in the store.
Fires a user-defined event from the Workflow.
You can kick off custom events at any point in your Workflow.
Tick the Only fire if input has rows checkbox for the step to be kicked off only when the input has data.
Select the required Custom event then specify values or assign a Workflow parameter for the available event data items (the event data name will be displayed).
For example, you want to kick off an event when the Validate step has failures. Select the 'Validation failed' custom event and specify the reason for the failure in the 'Reason' event data item. This reason will now be available as an event data item when creating/editing notifications.
Another example is using a Filter step to identify records that you have particular interest in. The Fire event step can then follow the Filter step but only fire when the Filter step is producing records.
Grouping will reduce the number of rows in a table.
The default output of a Group step is the Count aggregate, which returns the total count of rows. This count will then be broken down by each unique value in the group column.
Moving columns from ungrouped to grouped will show the distinct/deduplicated values in the column (along with the optional Count of the number of times each combination of values appears in the data).
The Add aggregate button is used to create a new column(s) containing a value or metric for each group.
Give the new aggregate column a name or optional suffix value if applying an aggregate on Multiple columns, then select the type of aggregate to use.
Aggregate | Description |
---|---|
Count | Returns a count of the number of items in each group. |
Count excluding nulls | Returns a count of the number of items in each group ignoring null values. |
First value | Returns the first value in the column for each group. |
First populated value | Returns the first value in the column for each group that is not null. |
Last value | Returns the last value (based on input row order) in the column for each group. |
Minimum value | Returns the minimum value in the column for each group. |
Maximum value | Returns the maximum value in the column for each group. |
Grouped values | Returns a comma separated list of all values in each group. The order is as they appear in the input. |
Sorted grouped values | Returns a comma separated list of all values in each group sorted A to Z. |
Reverse grouped values | Returns a comma separated list of all values in each group sorted Z to A. |
Average | Returns the average (mean) of the values in each group. |
Median | Returns the median (based on a sort of the values) of the values in each group. |
Sum | Returns the sum of the values in each group. |
Standard deviation | Returns the standard deviation of the values in each group. |
Standard deviation (population) | Returns the standard deviation of the values in each group where it is known that the values represent the whole population. |
Variance | Returns the statistical variance of the values in each group. |
Variance (population) | Returns the statistical variance of the values in each group where it is known that the values represent the whole population. |
Group ID | Returns a numeric value that can be used to identify each group. |
Non-numeric values will be excluded (ignored) from numeric aggregation calculations. Zero will be returned by default if the values are all non-numeric.
In Explore mode you can drill into a particular group to see the rows that belong to it by right-clicking and selecting View rows in group.
This ability to drill down into grouped values will be retained if the step is connected to any of the following subsequent steps: Branch, Chart, Export, Filter, Fire event, Output, Sample, Sort, Split, Take Snapshot, Transform and Union.
Harmonization is the process of deduplication - taking multiple records that are deemed to be for the same underlying real-world subject and producing a single resulting record.
Join two inputs into one based on one or more columns from each input.
To specify the columns you want to join, click Left side columns or Right side columns.
Use the Retention type menus to specify different combinations of join. The top menu is for the first join input and the bottom one for the second join input.
Click on the Venn diagram to specify the type of join you want to apply:
Above the Venn diagram select More options to expose more detailed options for the join.
The Join step provides three matching options which determine how keys will be compared:
The Do not match nulls option configures how null values are handled in the join. When enabled, left or right column values containing nulls will never match with any other row. This means that rows where the join key contains nulls will never appear in inner join results and will always appear in left/right unmatched results if the selected join type will include left/right unmatched rows.
In the join dialog, click Show step results to view the results of the join. Specify which values from the joined columns to use in any following steps: all the matches, only duplicates or only singletons.
Click to see suggested join keys. Joins will be suggested only where column names match exact on the two side of the join. Click on the suggested join to apply it.
Allows you to match values from a column in the first input (source) to a column in the second input (the lookup table / lookup input). Values are returned from a second column from the lookup table after an operation is applied to the set of values for the matched rows.
You can use any source in a Workflow as the input for this step: a Dataset, View, or the output from another step. You can even use the same output node for both the source and the input (if you want to do a self-lookup).
To add multiple lookup tables, click Add additional lookup table.
Click Definitions to open up the configuration panel, and Add definitions to allow you to define a new lookup. Multiple lookup definitions can be created in a single Lookup step. A lookup definition consists of the following options:
The name of the resulting new column from the lookup definition.
A lookup type is the operation applied to the value (or values, if more than one match exists in the lookup table) returned by the Lookup. These types/operations are applied to the set of values for the return column where the lookup column matches the lookup value.
Lookup type | Description |
---|---|
Count | Returns a count of the number of matches that the lookup value has in the lookup column. |
Count excluding nulls | Returns a count of the number of matches that the lookup value has in the lookup column ignoring nulls. |
Exists | Returns true if the lookup value has a match in the lookup column, false if not. |
Not exists | Returns false if the lookup value has a match in the lookup column, true if not. |
First value | Returns the first value (based on input row order) from the return column where the lookup value matches the lookup column. |
First populated value | Returns the first non-null value (based on input row order) from the return column where the lookup value matches the lookup column. |
Last value | Returns the last value (based on input row order) from the return column where the lookup value matches the lookup column. |
Minimum value | Returns the minimum value (based on a sort of the values) from the return column where the lookup value matches the lookup column. |
Maximum value | Returns the maximum value (based on a sort of the values) from the return column where the lookup value matches the lookup column. |
Sum | Returns the sum of the values in the return column where the lookup value matches the lookup column. Non-numeric values will be counted as '0'. |
Average | Returns the average (mean) of the values in the return column where the lookup value matches the lookup column. Non-numeric values will be excluded from the calculation. Returns '0' by default if the set of return column values are all non-numeric. |
Median | Returns the median (based on a sort of the values) of the values in the return column where the lookup value matches the lookup column. |
Variance | Returns the statistical variance of the values in the return column where the lookup value matches the lookup column. Non-numeric values will be excluded from the calculation. Returns '0' by default if the set of return column values are all non-numeric. |
Variance (population) | Returns the statistical variance of the values in the return column where the lookup value matches the lookup column where it is known that the values represent the whole population. Non-numeric values will be excluded from the calculation. Returns '0' by default if the set of return column values are all non-numeric. |
Standard deviation | Returns the standard deviation of the values in the return column where the lookup value matches the lookup column. Non-numeric values will be excluded from the calculation. Returns '0' by default if the set of return column values are all non-numeric. |
Standard deviation (population) | Returns the standard deviation of the values in the return column where the lookup value matches the lookup column where it is known that the values represent the whole population. Non-numeric values will be excluded from the calculation. Returns '0' by default if the set of return column values are all non-numeric. |
All values | Returns a comma separated list of all values in the return column where the lookup value matches the lookup column. The order is as they appear in the input. |
Reverse values | Returns a comma separated list of all values in the return column where the lookup value matches the lookup column. The order is a reverse sort of the values. |
Sorted values | Returns a comma separated list of all values in the return column where the lookup value matches the lookup column. The order is a sort of the values. |
The column from the lookup table that you want the above operations to act on. This option is not present if you have selected the Exists or Not Exists lookup types, as no value from the lookup table is returned.
This is where you define how you want to match values between your lookup value and lookup column. The selected match type will apply to both sides.
Match type | Description |
---|---|
Exact | Will only match if the value and datatype exactly match. |
Ignore datatype | Will match if the values match, ignoring the datatype (e.g. alphanumeric '1' will match numeric value 1). |
Case-insensitive | Will match if the values match, ignoring the casing (e.g. 'experian' will match 'experIAN'). |
Denoise | Will match if the values match after special characters / whitespace etc. are removed (e.g. 'Experian' will match 'Ex'per ian'). |
Format | Will match if the formats of the two values are the same (e.g. '123abc' will match '456def'). |
Soundex | Will match if the values match after going through the Soundex phonetic algorithm. |
Metaphone | Will match if the values match after going through the Metaphone phonetic algorithm. |
Double metaphone | Will match if the values match after going through the Double metaphone phonetic algorithm. |
If selected, each value from the return column will only be used once per lookup value match group. Only present for lookup types which have the potential to operate across multiple values, for example Average and All values.
The value that is returned if no match for the lookup value is found. This field can be left blank in which case a null value is returned. Some lookup types will never use this (e.g. Exists).
The column from the lookup table that contains the values you want to match on.
The column from the source that contains the values you want to match on.
Allows you to define additional pairs of lookup columns (additional match criteria) to further narrow matches.
A source Dataset contains customer information alongside a column with a "discount code":
Customer ID | Code |
---|---|
1 | A |
2 | A |
3 | B |
4 | C |
5 | A |
6 | 9 |
Another dataset contains the "discount code" value alongside the rate that that code represents. This will be the data that we lookup in to:
Discount Code | Rate |
---|---|
A | 5 |
B | 10 |
C | 15 |
We want to define a lookup that will return the appropriate rate for each discount code. The rate will be added as a new column in my customer (source) Dataset.
To do this, in a Workflow add a Lookup step with the customer dataset as the source input and the discount code dataset as the lookup input.
The lookup definition is set up as follows:
Resulting table:
Customer ID | Code | Discount Rate |
---|---|---|
1 | A | 5 |
2 | A | 5 |
3 | B | 10 |
4 | C | 15 |
5 | A | 5 |
6 | 9 | NO MATCH |
Lookups and Joins can both achieve a lot of the same things when combining of two data sources. However, their performance is optimized for different scenarios, so for large data volumes it is important to understand these before choosing which one to use for your workflow.
In general, we recommend lookups for use-cases where the lookup table is small (e.g. a domain file of a few thousand, typically unique, values) or when you need to perform an operation on the results (e.g. calculate and average for the matched values). Joins are better suited for large to large file operations.
More specifically, for high cardinality columns (e.g. true unique IDs), lookups can take at least twice as long as an equivalent join. For low-cardinality (i.e. low uniqueness) / high row count keys, lookup indexes will built quickly compared to the corresponding join index.
This step is used to map columns in your data to a target schema definition. The target schema must be a Dataset that is already loaded into Data Studio.
Map to target is often used in combination with the Export step to populate a target database schema but can also be used to rename multiple columns easily.
Auto-map functionality will attempt to map columns automatically where possible, based on column names in the source (input columns) and the target (output columns). You can also map columns manually.
If an input column is not selected in the mapping for a given target column, the values in the target column will be null
.
If Include in output is unchecked for a target column, that column is not included in the step's output. Use this setting when mapping to a schema that includes auto-generated columns in the target database.
When a re-usable Workflow is using the Output step, the step's data will be available in another Workflows.
Click Show step results to view the result from previous step that connects to the Output step.
Profile your data at any point in a Workflow.
We examine each value and identify various attributes (such as type, min/max values, count) and determine its uniqueness and completeness.
Click Profiled columns to select the column(s) you want to profile. In the profiled columns list on the left, select the required ones and click Apply. To see the results, click Show step results in the step dialog.
You can now export or take a snapshot of your profiled data to track data uniqueness over time.
When profiling data, you can select a column and choose Values or Formats to drill into the value or format information identified from the profiling process, and from there drill down again to the individual rows that contain that selected value or format.
The ability to drill down into the Profile step results will be retained if the step is connected to any of the following subsequent steps:
You can view the drilldown using the right-click options or the toolbar on these steps.
The results of the profiling can be saved to a Dataset by including a Take Snapshot step in your Workflow. Furthermore, if the Interactive option is selected when creating a new Dataset on the Take Snapshot step, you can drill into the data to view the profiled values and formats in the Dataset. The underlying rows of the drilled down values or formats can also subsequently be drilled into.
Once an interactive Dataset has been created, it can be drilled into in various ways, such as:
Interactivity also works for multi batch Datasets.
To further improve data quality, Data Studio can capture lineage metadata of data sources. Lineage metadata can be included in Workflow outputs and subsequently used for further integration or processing.
In the Profile step, lineage metadata can be included in the output using the dropdown under More Options.
View up to three inputs side by side (horizontally or vertically). Nest these steps for more flexibility over layout.
Click on Vertical split to change it to horizontal.
To make a report available to Data Consumer users, click Expose to Data Consumer users.
To view the report, click Show report.
This step allows you to convert row(s) into columns by clicking Settings.
The step results will consist of the new column(s) and all other columns except for the Column with rows to split in new columns and Column with rows to split in column values.
To do the reverse, use the Columns to rows Workflow step.
You can use this step together with Group and Export to create a pivot table.
For example, you have the following car sales data:
Year | Brand | Sold | Product ID | Model |
---|---|---|---|---|
2019 | Honda | 10 | 1 | HRV |
2019 | Mazda | 15 | 2 | CX5 |
2020 | Honda | 7 | 3 | Civic Si |
2019 | BMW | 3 | 4 | 3 Series |
2019 | Honda | 5 | 5 | CRV |
2020 | Mazda | 50 | 6 | Mazda 3 |
What you're interested in is the amount of Honda and Mazda sales per year:
Year | Honda | Mazda |
---|---|---|
2019 | 15 | 15 |
2020 | 7 | 50 |
To achieve that:
Add a Source step with the relevant Dataset.
Connect the Rows to columns step.
Click Settings to configure it:
-Select Brand as the Column to split.
-Select Sold as the Values column.
-Click Profile for new values. The values Honda, Mazda, BMW should be automatically populated.
-Remove the BMW value and save changes.
Connect the Group step.
Click Grouping to configure it:
-Remove Count from the Grouped and aggregated columns.
-Move Year from the Ungrouped columns to the Grouped and aggregated columns.
-Click Add aggregate and enter the following values:
-Column selection type: Multiple
-Selected columns: Honda and Mazda
-Aggregate: Sum
-Click Apply twice, to save changes and to apply changes to the groupings.
Connect the Export step.
Click Show step results to preview the results and export as required.
Limit the number of rows or take a random sample or rows.
Examples for consecutive rows:
Examples for non-consecutive rows:
Sort your data in ascending/descending order.
Move columns from 'unsorted columns' to 'sorted columns', set the sort order (min to max or max to min) for each column using Flip sort direction or the up/down arrow on the right of each 'sorted column'.
Select a 'sorted column' and use the 'move up'/'move down' buttons to change the priority order if sorting on mutliple columns.
Your data source, often the first step in a Workflow. Select the required Dataset, View or Issue list to start manipulating and transforming your data.
The options you see in the Source step will depend on the properties of the source you select.
Enable Can supply source when executed to make the data source replaceable, so that a different source can be provided when the Workflow is executed or by making the Workflow re-usable and embedding it in a parent Workflow.
{end-mode}
Enable the Delete batches on completion setting to automatically remove processed batches from a Dataset when the Workflow successfully completes. This option is only available is the Dataset is multi-batch and has the Allow automatic batch deletion option enabled.
Enable the Allow auto-refresh setting to automatically load a new batch of data into the selected Dataset when the Workflow is executed. If a View is selected, it's underlying Dataset will be refreshed, even if that Dataset is in a different Space.
When Allow auto-refresh is enabled you have the option to Stop execution when refresh failed. This setting causes a workflow execution to fail is the refresh could not be completed, for example if the External System could not be accessed due to issues with network connection or authentication. When the settings is unchecked, the workflow will process the existing data when the Dataset could not be refreshed.
To further improve data quality, Data Studio can capture lineage metadata of data sources. Lineage metadata can be included in Workflow outputs and subsequently used for further integration or processing.
In the Source step, the available lineage metadata depends on the source type and can be included in the output using the dropdown under More Options after a Dataset has been selected.
Each row in a Dataset has a row ID which is guaranteed to be unique across all Datasets. These IDs can optionally be included in the output of the Source step. The IDs are generated in ascending order over time, so it's possible to use them to sort rows of data by the time they were first loaded into Data Studio.
This feature allows the Source step to output only a set of batches for processing in the Workflow instead of all of them. Select a multi-batch Dataset as the source for these options to be available.
This option will allow you to output one of the following:
This option will only output the last batch created within each interval of the time unit provided.
For example, if you have three batches with these timestamps:
Depending on the unit of time selected, the following would be the output:
Day
Hour
Minute
Batch 1: 2020-01-01 9:00
Batch 2: 2020-01-01 11:00
Batch 3: 2020-01-01 11:59
###Include resolved issues
For Issue lists, enable the Include resolved setting if issues that have already been marked as resolved should be processed in the Workflow.
Combine two inputs into one by presenting the values next to each other.
Data in the columns is not combined or merged - the duplicated values are renamed with an appended numeric value. The output – combined values from two sources – can then be used as input to any further step(s).
Click Show step results to view the results.
Apply a true/false filter to one input, splitting it into passing and failing rows.
To create a filter, click Filter in the step dialog and Create in the left-hand menu. Specify the necessary filters and click Apply to save the changes.
Click Show passing/Failing rows to view the values that passed/failed the filter. The passing and failing results can be individually linked to any further step(s).
Snapshots are copies of data that allow you to track changes over time or store your results for use elsewhere. A Snapshot is a type of Dataset that's created by and stored in Data Studio.
Connect the Take Snapshot step's input to the data you'd like to capture in a Snapshot. When the Workfow is run, a new batch of data will be created.
This data can then be added to an existing or new Dataset:
Select an existing one from the Dataset list.
Use Create new Dataset to create a new one.
Snapshots can be used like other Datasets:
You may want to update the schema if columns have been added/removed when writing data to a Snapshot.
A summary of the unresolved changes will appear as a warning on the step. To fix this, you can change the input or resolve the conflicts. Click Update schema: {X} schema change(s) to open the Update Schema dialog (where {X} is the number of changes found) to see a consolidated list of existing, new and missing columns.
You can now:
Click Apply to update the schema. The step will become valid and you'll be able view the step results/run the Workflow.
Manipulate and configure your data.
This step allows you to add/remove various Functions and arrange or exclude columns.
Clicking the Columns row on any Transform Workflow step will open the panel where you can select the columns you want to work with.
Moving columns allows you to change the order in which they appear in the data. This can be done by selecting the columns and pressing the up/down arrow button next to the columns list.
Columns can also be moved by dragging them up/down the list.
Rename a column by selecting the column and clicking Rename on the toolbar. You will then be presented with a modal box prompting you to enter the name and an optional summary.
If multiple columns are renamed at the same time, the modal box will provide you with an opportunity to apply a common prefix or suffix to the originating column name. Alternatively, you will be able to rename all columns individually.
Select the columns you want to exclude and click Exclude in the toolbar. This can also be done by hovering over the column name and clicking the inline icon.
To make columns visible again, select the columns and click the Include button in the toolbar.
A column can be transformed by applying one or many Functions to it. These Functions range from a simple action, such as converting a column to uppercase, to more complex multiple column operations.
Whatever it may be, transforming a column can be done to create a new column or to change the data in an existing one.
Before applying any custom transformations, you can use the Suggest option to analyze the data being explored and return some recommended transformations based on its content.
Any of these suggestions can be applied to the data by checking their selection boxes and clicking Apply.
The transformation suggestions can also be created as new columns rather than replacing the original data, by checking New column? in the table and specifying a column name.
If the column already has a transformation Function applied you will first be prompted to choose whether to 'Append to existing functions' or 'Replace existing functions'.
Duplicating columns can be done by selecting the columns and clicking Duplicate on the toolbar. The duplicated column will be placed after the original column and will be renamed with the prefix 'Copy of '.
To remove the Functions applied to columns, select them and click Delete on the toolbar.
You can only completely delete a created column rather than just remove the transformation, because created columns can't exist without a Function.
Combine two or more inputs into one output by presenting the values on top of each other. The source columns with matching headings will be combined.
The output – combined values from two or more sources – can then be used as input to any further step(s).
Click Show step results to view the results.
Use business rules to measure the quality, completeness and accuracy of your data.
A rule is a Function that returns a true or false value, which translates to a pass or fail. This can be either a default Function (like 'Contains' or 'Is Date') or a custom one.
The easiest way to get started is to use Suggest validation rules, which profiles the data to then suggest formulas to be applied to columns as validation rules. Uncheck any suggestions that are not useful and apply to create rules separated into rule groups ready for further changes if required.
In Explore mode, Profile your data to see the Suggest rules action button, which will create a new Workflow containing a Validate step.
Selecting Apply ruleset allows you to select any Rulesets that have been created and map columns to the rules where the names differ.
On the Validate step, click Rules to view and edit any existing rules and to Add rule or Add group.
Each rule has a name and summary, a parent rule group and a Function definition. The Function can either be a pre-existing Function in Data Studio, or a custom Function created in the Function builder. When building a custom Function, pick one of the relevant columns (or a parameter) as the input value and ensure the Function returns true/false as the result.
Rule groups represent logical collections of validation rules and can have different thresholds set. For example, a group of compliance rules may have a fail threshold of 99%, but a data quality group of rules may use a lower threshold and be increased over time as data and processes are improved.
The Status result column for each rule is based on these thresholds, so Red below the fail threshold, Green at or above the pass threshold, and Amber in between.
Each group has a name and description, pass and fail threshold, plus an optional column to weight the results. As well as counting the number of Passed rows and Failed rows, the results will also include Passed weight and Failed weight columns, which contain a sum of the chosen column for all the passing/failing rows. For example, weighting each rule's results by total sales allows you to prioritize data quality issues for the highest spending customers with the most overall impact.
Rules can be set to ignore or skip values that do not need to be validated, and so should not impact the rule pass rate. For example, where 10 emails are being checked; 5 are valid, 2 invalid and 3 are blank, the pass rate would be 5/7 (rather than 5/10 if all values were considered).
The Ignore Null values checkbox can be selected when first creating a rule using an existing Function. However, values other than null can also be ignored using the Ignore literal value when designing a custom Function.
You can use a scripting language to create new and manage existing validation rules, especially if you're making several changes at once. Click Edit script to open the script editor and make the required changes, such as renaming, re-ordering or changing rule groups.
To open a read-only version, click View script.
You can create Functions using the same functionality in the Function script editor.
A rule is made up of a rule group, group id, rule name, rule id, rule summary and Function in that order. For example:
## Rule_Group_1 {{c1e5ff0d}}
# Rule_1 {{d1f5ff0c}}
-- Summary of rule 1
IsEven(3)
Using the script editor:
Validation results are available in several formats:
Similar to the Source and Profile step, the Validate step has a Source metadata dropdown that enables you to include the lineage metadata of input data. This information (such as file name or batch ID) is useful to include as additional columns in the aggregated results.
This step validates and enriches addresses in bulk.
Addresses will be cleaned by verifying them against the official postally-correct address files for the relevant country. Cleaned addresses are assigned a match result, based on the accuracy of the original address. You can define layouts specifying the number, content and format of the address output columns. Choose one of the available Additional datasets to enrich your data. The datasets that are available to you depend on your license.
Validate emails based on the format or domain address.
Select the Email column and pick one of the two available Validation type options:
Format Check: Checks whether the value matched a valid email format. Returns either true or false.
Examples of valid and invalid email formats:
Format | Result |
---|---|
info@gmail.com | Valid |
first.second-name@gmail.com | Valid |
first.name+tag@gmail.com | Valid |
name@info@gmail.com | Invalid |
name"not"right@test.com | Invalid |
another.test.com | Invalid |
name@incorrect_domain.com | Invalid |
com.domain@name | Invalid |
first_address@gmail.com, second_address@gmail.com | Invalid |
Only one email can be validated at once; lists of emails as seen in the last example will be rejected.
Domain Level: Checks whether the value has a domain that exists and is an email server. This option returns both an overall validity result (true or false) in the Email domain: Result column, and additional information in the Email domain: Error column describing the reason for failure. The possible outcomes are:
Error | Result | Description |
---|---|---|
Valid | True | Domain exists and is a mail server. |
Bad format | False | Email format is invalid. |
Invalid domain | False | Domain validation check failed. The domain may not exist, or may have been flagged as illegitimate, disposable, harmful, nondeterministic or unverifiable. |
Invalid name | False | Local part validation failed. For example it may have been identified as a spam trap or role account such as "admin@server.com". |
Failed to perform DNS lookup | False | An error occurred when attempting to perform the DNS lookup. |
Domain level validation results are cached with results refreshed every 30 days. The cache validity is configurable in Settings > Workflow steps by changing the Email validation cache validity setting.
Click Show step results to view the results.
To learn more about validating phone numbers, please see the validate phone numbers step section.
Each step in the Workflow has the option to Show step results. This opens the grid to view data up to that stage in the Workflow, Explore and apply any Actions. If the Action(s) have a corresponding Workflow step then click Add to Workflow to insert a new Workflow step into the current Workflow from whichever step was previously opened.
Workflow steps can be hidden, from the step list that is shown when editing a Workflow, to show only the steps relevant to the use cases of that Space. Clicking the Spaces menu options then Workflow steps allows you to turn both default and custom steps on/off from appearing in the list.
In addition to the default steps, there is an SDK to allow you to create your own Custom Workflows steps create your own Custom Workflows steps to manipulate your data in a specific way, call out to an API to pull in values to enrich your data or push out data to another application.
Workflow canvas is the main area of the page where Workflows are viewed or edited.
Designer users can build and optimize Workflows by dragging Workflow steps around the Workflow canvas.
Zoom In/Out within a Workflow to show more/fewer steps on a page. The current zoom-level is noted on the mini map in the bottom right corner. Hovering over a step in the mini map shows the step name and clicking on a step moves the canvas to show that step.
Each Workflow step has a menu with the Edit properties option which allows you to change:
Notes are very useful for adding detailed descriptions to a Workflow or around some specific Workflow steps. Notes are also shown in the Workflow mini map in the bottom right corner, which can aid navigation.
To add a note, double-click anywhere on the Workflow canvas. Change the color or resize the note then click away to save the note. Double-click an existing note to edit it.
Using Markdown text in a note allows you to style the contents, e.g. add links or create tables and lists.
For larger Workflows, it can be useful to search for a specific step using the keyboard shortcut Ctrl+F, which will highlight any matching steps and move the canvas to show any steps appearing off screen.
Create an HTML report, by clicking the Report button, containing a comprehensive and detailed summary of the Workflow:
There are different ways to execute (or run) a Workflow:
Executing a Workflow manually opens a dialog allowing you to specify the following execution options:
Executing a Workflow creates a Job in the Jobs list page.
A Workflow parameter is a placeholder value that can be used within different Workflow steps to apply values dynamically when running a Workflow. For example, create a mandatory parameter Initials that prompts a user to enter their initials which are then included as part of the filename on an Export step.