Create Workflows

A Workflow is a sequence of connected steps that defines a process of transforming and manipulating your data. Data Studio provides a number of default Workflow steps but you can also create custom ones. The area where Workflows are viewed or edited is called the Workflow canvas.

If your Workflow is complex or tackles many actions, it can become difficult to manage and test. You can resolve this is by creating re-usable Workflows.

Duplicate data to multiple outputs.

The rows from the source data will be duplicated (branched) into two outputs. To create more than two duplicates of the data, you can link several Branch steps.

Visualize your data as a bar, pie, or gauge chart. Export as a PDF or an image file (.JPG, .PNG, .SVG). Find out more about the Chart step.

This step allows you to convert column(s) into rows by clicking Selected columns in the column list on the left.

Specify the following for the new columns:

  • Attributes column heading - this will be populated with the selected column names.
  • Values column heading - this will be populated with the values of the selected columns.

To do the reverse, use the Rows to columns Workflow step.

Compare two inputs using one column as the Record key. The Record key is used to identify which records should be compared across the two inputs. If a Record key value is duplicated in either file, that record will not be compared.

By default, values in all columns that are found in both inputs (the column name must match exactly) are compared. To exclude any columns from the comparison, click Columns.

Data can be compared using a number of Functions on individual columns between the two Datasets:

  • Equals - the values pass if they match. Can be configured to become non-strict equals, e.g. Ignore case.
  • Greater than - the values pass if the value from input 2 is greater than the value from input 1.
  • Greater than or equal - the values pass if the value from input 2 is greater than or equal to the value from input 1.
  • Less than - the values pass if the value from input 2 is less than the value from input 1.
  • Less than or equal - the values pass if the value from input 2 is less than or equal to the value from input 1.

If all comparisons across a record pass, the record is considered as Unchanged in the result output.

The Show results by key output displays two columns:

  • Key - the column for all the values found in both inputs based on what you've selected as the Record key
  • Result - the column showing the result of the record comparison. The result types are:
    • Unchanged: a record with the key is found in both inputs and the values in all compared columns match to the defined strictness, including any expected or accepted changes.
    • Changed: a record with the key is found in both inputs but at least one value in a compared column does not match to the defined strictness.
    • Missing: a record with the key is found in the Input 1 but not Input 2.
    • New: a record with the key is found in Input 2 but not Input 1.
    • Duplicate: multiple records with the key were found in at least one of the inputs signifying that the key is not unique.

The Show summary output displays the detailed metrics of the comparison:

  • Total records in input 1
  • Total records in input 2
  • Duplicate keys
  • Missing records from input 2
  • New records in input 2
  • Total compared records
  • Unchanged records
  • Changed records
  • Columns compared
  • Columns with changes
  • For each compared column: Column name (Comparison function) changed

The Show detail output displays the column level results for each compared record alongside the corresponding values from the two inputs.

This step allows you to export your Workflow's output data to a file or an External system that uses a JDBC connection.

Each Workflow can contain multiple Export steps, so you can export data at any point. Each of these steps can be configured individually.

Click Export Settings in the step dialog to specify where and how the data will be exported.

Exporting to a file

  • Specify the Exporter (.csv, .txt, .xls, .xlsx or .json).
  • Choose the Target System - the export location. The default Server export directory will export files into the appropriate Space's export directory on the server. You can also export to an external file system (Apache Hadoop HDFS, Amazon S3, Microsoft Azure Blob Storage, or SFTP server) if one is configured.
  • Enter the name of the file or assign a Workflow parameter for Filename.
    • The Filename is made up of several components. By default, it will have two components; one for the name of the file and a second one for a timestamp.
    • The timestamp is the date and time at the point that the Workflow is executed.
    • You can add new components by using @ on your keyboard to select and enter a template item.
    • Components can be deleted by removing the template directly from the Filename.
    • Each component can either be custom text or a timestamp in one of the available datetime formats.
    • An example of the resulting filename is shown as you build up the filename components.
  • To remove the default extension (such as .csv) from exported files, un-select Include default file extension.
    • You can export files with a custom extension by excluding the default and specifying a custom extension at end of the Filename.
  • Enable Overwrite existing to allow a file with the same name to be exported more than once (the latest copy will replace the previous one). If this option is disabled, you will get an error when trying to export to the same file name more than once.
  • Select Compression or encryption to export the output as one of the following:
    • None for a normal export file
    • Compressed only to export as a compressed file. Currently, only the .zip files are supported. By default, the Compressed filename will have the same filename as defined by the compression type. You can change this if required.
      • To add new components, use @ on your keyboard to select and enter a template item.
      • To delete components, remove the template directly from the Compressed filename.
      • Each component can either be custom text or a timestamp in one of the available datetime formats.
      • An example of the resulting filename is shown as you build up the compressed filename components.
    • Compress and encrypt to export as an encrypted file. By default, the Encrypted filename will have the same filename as defined by the encryption type. You can change this if required.
      • Select an existing key based on the Encryption key name. Find out how to add a new encryption key.
      • To add new components, use @ on your keyboard to select and enter a template item.
      • To delete components, remove the template directly from the Encrypted filename.
      • Each component can either be custom text or a timestamp in one of the available datetime formats.
      • An example of the resulting filename is shown as you build up the encrypted filename components.
  • If the Include column names is checked, the first row in the exported file will contain column names. This doesn't apply to .json files.
  • Column name style determines how the column names will be displayed in the export file:
    • None - the column name as it appears in Data Studio.
    • Database style - the column name will be in upper case and underscores will be used to separate words.
    • Lower case - the column name will be in lower case.
    • Upper case - the column name will be in upper case.
    • Humanized - the first word will be capitalized and underscores turned into spaces.
  • Specify the Character set (most sets are supported).
Custom settings for CSV files

You can specify a delimiter and filename extension for CSV files (created in an Export step or downloaded from a grid of data):

  1. Navigate to the installation folder (by default C:\Program Files\Experian\Aperture Data Studio {version}) and open the server.properties file.

  2. On a new line in the file, add the setting you want to modify:

    Setting name Default value Description
    Data.csvDelimiter , Any string; special keywords can be used (see below)
    Data.csvFilenameExtension csv All non alphanumeric characters will be removed
    Keyword Value
    BAR ¦
    COLON :
    COMMA ,
    NONE \O
    PIPE |
    SEMICOLON ;
    SPACE ' '
    TAB \t

    For example, to change the delimiter to tab, the entry in server.properties should be:

    Data.csvDelimiter=TAB
    

Exporting to an External System using JDBC

  • Select JDBC as the Exporter.
  • Choose a DBMS system from the list of available ones. This is a list of the External System connections configured in Data Studio that use a JDBC driver.
  • Choose a Credential for authenticating with the DBMS. These credentials will have been created alongside the External System definition.
  • Select the target table from the list in Table name. These are the tables that you can access in the target DBMS using the selected credential.
  • Mode determines how the data will be exported into the target system:
    • Insert: New rows are inserted into the table. The equivalent SQL statement is:
      "INSERT INTO tablename (column1, …) VALUES (value1, …)"
    • Update: Existing rows are updated. The mapping dialog will show which source columns are mapped to which target columns and which ones have been specified as keys. You can change the mapping and keys at this stage. The keys referred to are the columns that will appear in the equivalent SQL WHERE clause:
      "UPDATE tablename SET (column1=value1, …) WHERE (columnName=value, … )"
    • Delete: Rows are deleted. The keys are the column names used in the SQL WHERE clause:"DELETE FROM tablename WHERE columnName=value, … "
    • Insert or update (if insert fails): If a row insert fails (there is a key collision), then the statement will be retried as an update.
    • Update or insert (if update fails): If the update fails (no record found with the given key), then the statement will be retried as an insert.
  • For Insert mode, you have the option to Truncate table before carrying out the row inserts.
  • For all modes other than Insert, you will need to select one or more Key columns from those in your Workflow data. These columns need to match the key fields in your selected target table.
  • Select Atomic database update if the export should be performed in a single transaction. In this mode it is possible that the database may lock table(s) for the duration of the update. This may be for some considerable time for large exports.
  • Select Continue Workflow execution on JDBC exports error if you want the Workflow to continue to execute when there is any INSERT, UPDATE or DELETE errors. This option is unchecked by default. If this option is selected, two new output nodes will appear on the step: Show passing rows and Show failing rows.
    • Show passing rows contains the rows that successfully executed.
    • Show failing rows contains the rows that failed to execute.
  • Commit batch size: Commit a batch after the specified number of records have been sent to the database. This setting is only available when Atomic database update is unset. Note that the final batch usually has fewer records.
  • Pre SQL: Optionally enter a SQL statement that will be executed before the export. The SQL may contain any valid statements and will be executed on the target database.
  • Post SQL: Optionally enter a SQL statement, similar to Pre SQL but which will be executed after the export has completed.

Filter your data by selecting a column and entering a comparison value to match against using an operation like Equals, Contains or In Range. To filter on multiple columns and/or multiple comparison values create an advanced Function that returns a boolean (true or false) value.

Quick filter

In Explore mode, right-click on a value in a column and select one of the options Keep or Remove to show or exclude all rows containing the chosen value.

Filter or Split?

The Split Workflow step is the same as Filter but has the benefit of two outputs to see the rows that pass the filter and the rows that fail the filter.

The Find duplicates step uses powerful standardization and matching algorithms to group together records containing similar contact data (e.g. name, address, email, phone) and keep that information within a duplicate store. Each group of records, known as a cluster, is assigned a unique cluster ID and a match level. The step provides out-of-the-box functionality for the United Kingdom, Australia, and the United States, but is also completely configurable down to the most granular name and contact elements.

Find out how to use, configure and troubleshoot this step.

The output will display the records of the clusters that have had at least one record inserted or updated. A record is either:

  • INSERTED - A new record with a new unique ID was added to the store.
  • UPDATED - A record with a matching unique ID had values changed.
  • AFFECTED - The record's cluster ID or match status has potentially changed due to the insertion or update of another record in the cluster.
  • UNCHANGED - A record with a matching unique ID had no updated values.

The Find duplicates delete step provides the ability to bulk delete records within a duplicate store and update its clusters appropriately.

A Duplicate store must be provided as well as an input column where its values will be used as the ids of the records to be deleted.

By default, if an input contains one column with the Unique ID tag then that column will be automatically selected.

The output will display the records of the clusters that have had at least one record deleted. A record is either:

  • DELETED - The record is removed from the store.
  • AFFECTED - The record's cluster ID or match status has potentially changed due to the deletion of another record in the cluster.
  • DUPLICATE - Id is ignored as the deletion of the record has already occurred.
  • NOT FOUND - Id is ignored as no records with it could be found in the store.

The Find duplicates query step provides the ability to search for a collection of records in a Duplicate store using records in any column.

The values in the input column(s) are used to search the store for matching records.

A Duplicate store must be provided as well as at least one input column. Any amount of columns can be selected, as long as they exist in the duplicate store.

By default, if the input contains one or more columns that are tagged, these columns will be automatically selected.

Rules can also be changed or edited prior to submitting a search query.

The output will display the clusters that have had at least one record found. One of the output columns will be the search match status that corresponds to the match level of the input record against the found record/s in the store.

Fires a user-defined event from the Workflow.

You can kick off custom events at any point in your Workflow.

Tick the Only fire if input has rows checkbox for the step to be kicked off only when the input has data.

Select the required Custom event then specify values or assign a Workflow parameter for the available event data items (the event data name will be displayed).

For example, you want to kick off an event when the Validate step has failures. Select the 'Validation failed' custom event and specify the reason for the failure in the 'Reason' event data item. This reason will now be available as an event data item when creating/editing notifications.

Another example is using a Filter step to identify records that you have particular interest in. The Fire event step can then follow the Filter step but only fire when the Filter step is producing records.

Grouping will reduce the number of rows in a table.

The default output of a Group step is the Count aggregate, which returns the total count of rows. This count will then be broken down by each unique value in the group column.

Moving columns from ungrouped to grouped will show the distinct/deduplicated values in the column (along with the optional Count of the number of times each combination of values appears in the data).

Aggregates

The Add aggregate button is used to create a new column(s) containing a value or metric for each group.

Give the new aggregate column a name or optional suffix value if applying an aggregate on Multiple columns, then select the type of aggregate to use.

Aggregate Description
Count Returns a count of the number of items in each group.
Count excluding nulls Returns a count of the number of items in each group ignoring null values.
First value Returns the first value in the column for each group.
First populated value Returns the first value in the column for each group that is not null.
Last value Returns the last value (based on input row order) in the column for each group.
Minimum value Returns the minimum value in the column for each group.
Maximum value Returns the maximum value in the column for each group.
Grouped values Returns a comma separated list of all values in each group. The order is as they appear in the input.
Sorted grouped values Returns a comma separated list of all values in each group sorted A to Z.
Reverse grouped values Returns a comma separated list of all values in each group sorted Z to A.
Average Returns the average (mean) of the values in each group.
Median Returns the median (based on a sort of the values) of the values in each group.
Sum Returns the sum of the values in each group.
Standard deviation Returns the standard deviation of the values in each group.
Standard deviation (population) Returns the standard deviation of the values in each group where it is known that the values represent the whole population.
Variance Returns the statistical variance of the values in each group.
Variance (population) Returns the statistical variance of the values in each group where it is known that the values represent the whole population.
Group ID Returns a numeric value that can be used to identify each group.

Non-numeric values will be excluded (ignored) from numeric aggregation calculations. Zero will be returned by default if the values are all non-numeric.

Interactivity

In Explore mode you can drill into a particular group to see the rows that belong to it by right-clicking and selecting View rows in group.

This ability to drill down into grouped values will be retained if the step is connected to any of the following subsequent steps: Branch, Chart, Export, Filter, Fire event, Output, Sample, Sort, Split, Take Snapshot, Transform and Union.

Harmonization is the process of deduplication - taking multiple records that are deemed to be for the same underlying real-world subject and producing a single resulting record.

Find out how to use this step.

Join two inputs into one based on one or more columns from each input.

To specify the columns you want to join, click Left side columns or Right side columns.

Use the Retention type menus to specify different combinations of join. The top menu is for the first join input and the bottom one for the second join input.

  • All matches: Performs a standard join.
  • Singletons only: Return the first match of a one-to-many match and any matches that occur only once. This is a normal relational join with duplicates removed.
  • Duplicates only: Return all the matches that are one-to-many only . This is a normal relational join excluding matches that match only once.

Click on the Venn diagram to specify the type of join you want to apply:

  • Inner (center segment of the Venn): Returns all rows where the join condition is met (keys match).
  • Left outer (left and center segments): Returns all rows from the left-hand input and only the rows from the right-hand input where the join condition is met (keys match).
  • Left outer without intersection (left segment): A variant on the basic left outer join, which returns only rows from the left-hand input that don't meet the join condition. No rows from the right-hand input are returned.
  • Right outer (center and right segments): Returns all rows from the right-hand input and only the rows from the left-hand input where the join condition is met (keys match)
  • Right outer without intersection (right segment only): A variant on the basic right outer join, which returns only rows from the right-hand input that don't meet the join condition. No rows from the left-hand input are returned.
  • Full outer (all segments of the Venn): Returns all rows in both the left and right inputs. If a row from one side of the join has no match in the input on the other side, the columns from the other input will contain null values.
  • Full outer without intersection (left and right segments only): Returns all rows in both the left and right inputs, excluding rows where the join condition is met.
  • Cross join or Cartesian product (no segments of the Venn): Returns all rows from the left input. Each row from the left input is combined with all rows from the right input.

Above the Venn diagram select More options to expose more detailed options for the join.

The Join step provides three matching options which determine how keys will be compared:

  • Exact match: rows will be joined if key values on the left-hand side and right-hand side are exactly the same, including data type and casing.
  • Ignore datatype: rows will be joined if keys can be standardized to the same value. A key with the alphanumeric value '1' will be joined with a key that is the integer value 1.
  • Case-insensitive: rows will be joined if keys match with casing ignored. A key with value 'ABC' will be joined to a key with value 'abc'.

The Do not match nulls option configures how null values are handled in the join. When enabled, left or right column values containing nulls will never match with any other row. This means that rows where the join key contains nulls will never appear in inner join results and will always appear in left/right unmatched results if the selected join type will include left/right unmatched rows.

In the join dialog, click Show step results to view the results of the join. Specify which values from the joined columns to use in any following steps: all the matches, only duplicates or only singletons.

Click Lightbulb icon. to see suggested join keys. Joins will be suggested only where column names match exact on the two side of the join. Click on the suggested join to apply it.

Allows you to match values from a column in the first input (source) to a column in the second input (the lookup table / lookup input). Values are returned from a second column from the lookup table after an operation is applied to the set of values for the matched rows.

Step configuration

You can use any source in a Workflow as the input for this step: a Dataset, View, or the output from another step. You can even use the same output node for both the source and the input (if you want to do a self-lookup).

To add multiple lookup tables, click Add additional lookup table.

Lookup definition

Click Definitions to open up the configuration panel, and Add definitions to allow you to define a new lookup. Multiple lookup definitions can be created in a single Lookup step. A lookup definition consists of the following options:

Name

The name of the resulting new column from the lookup definition.

Lookup type

A lookup type is the operation applied to the value (or values, if more than one match exists in the lookup table) returned by the Lookup. These types/operations are applied to the set of values for the return column where the lookup column matches the lookup value.

Lookup type Description
Count Returns a count of the number of matches that the lookup value has in the lookup column.
Count excluding nulls Returns a count of the number of matches that the lookup value has in the lookup column ignoring nulls.
Exists Returns true if the lookup value has a match in the lookup column, false if not.
Not exists Returns false if the lookup value has a match in the lookup column, true if not.
First value Returns the first value (based on input row order) from the return column where the lookup value matches the lookup column.
First populated value Returns the first non-null value (based on input row order) from the return column where the lookup value matches the lookup column.
Last value Returns the last value (based on input row order) from the return column where the lookup value matches the lookup column.
Minimum value Returns the minimum value (based on a sort of the values) from the return column where the lookup value matches the lookup column.
Maximum value Returns the maximum value (based on a sort of the values) from the return column where the lookup value matches the lookup column.
Sum Returns the sum of the values in the return column where the lookup value matches the lookup column. Non-numeric values will be counted as '0'.
Average Returns the average (mean) of the values in the return column where the lookup value matches the lookup column. Non-numeric values will be excluded from the calculation. Returns '0' by default if the set of return column values are all non-numeric.
Median Returns the median (based on a sort of the values) of the values in the return column where the lookup value matches the lookup column.
Variance Returns the statistical variance of the values in the return column where the lookup value matches the lookup column. Non-numeric values will be excluded from the calculation. Returns '0' by default if the set of return column values are all non-numeric.
Variance (population) Returns the statistical variance of the values in the return column where the lookup value matches the lookup column where it is known that the values represent the whole population. Non-numeric values will be excluded from the calculation. Returns '0' by default if the set of return column values are all non-numeric.
Standard deviation Returns the standard deviation of the values in the return column where the lookup value matches the lookup column. Non-numeric values will be excluded from the calculation. Returns '0' by default if the set of return column values are all non-numeric.
Standard deviation (population) Returns the standard deviation of the values in the return column where the lookup value matches the lookup column where it is known that the values represent the whole population. Non-numeric values will be excluded from the calculation. Returns '0' by default if the set of return column values are all non-numeric.
All values Returns a comma separated list of all values in the return column where the lookup value matches the lookup column. The order is as they appear in the input.
Reverse values Returns a comma separated list of all values in the return column where the lookup value matches the lookup column. The order is a reverse sort of the values.
Sorted values Returns a comma separated list of all values in the return column where the lookup value matches the lookup column. The order is a sort of the values.
Return column

The column from the lookup table that you want the above operations to act on. This option is not present if you have selected the Exists or Not Exists lookup types, as no value from the lookup table is returned.

Match type

This is where you define how you want to match values between your lookup value and lookup column. The selected match type will apply to both sides.

Match type Description
Exact Will only match if the value and datatype exactly match.
Ignore datatype Will match if the values match, ignoring the datatype (e.g. alphanumeric '1' will match numeric value 1).
Case-insensitive Will match if the values match, ignoring the casing (e.g. 'experian' will match 'experIAN').
Denoise Will match if the values match after special characters / whitespace etc. are removed (e.g. 'Experian' will match 'Ex'per ian').
Format Will match if the formats of the two values are the same (e.g. '123abc' will match '456def').
Soundex Will match if the values match after going through the Soundex phonetic algorithm.
Metaphone Will match if the values match after going through the Metaphone phonetic algorithm.
Double metaphone Will match if the values match after going through the Double metaphone phonetic algorithm.
Distinct checkbox

If selected, each value from the return column will only be used once per lookup value match group. Only present for lookup types which have the potential to operate across multiple values, for example Average and All values.

Default value

The value that is returned if no match for the lookup value is found. This field can be left blank in which case a null value is returned. Some lookup types will never use this (e.g. Exists).

Lookup column

The column from the lookup table that contains the values you want to match on.

Lookup value

The column from the source that contains the values you want to match on.

Add more conditions button

Allows you to define additional pairs of lookup columns (additional match criteria) to further narrow matches.

Example

A source Dataset contains customer information alongside a column with a "discount code":

Customer ID Code
1 A
2 A
3 B
4 C
5 A
6 9

Another dataset contains the "discount code" value alongside the rate that that code represents. This will be the data that we lookup in to:

Discount Code Rate
A 5
B 10
C 15

We want to define a lookup that will return the appropriate rate for each discount code. The rate will be added as a new column in my customer (source) Dataset.

To do this, in a Workflow add a Lookup step with the customer dataset as the source input and the discount code dataset as the lookup input.

The lookup definition is set up as follows:

  • Name: Discount Rate
  • Lookup type: First value
  • Lookup table: Rows for discount code
  • Return column: Rate
  • Match type: Exact
  • Default value: "NO MATCH"
  • Lookup column: Code
  • Lookup value: Discount Code

Resulting table:

Customer ID Code Discount Rate
1 A 5
2 A 5
3 B 10
4 C 15
5 A 5
6 9 NO MATCH

Lookup or Join?

Lookups and Joins can both achieve a lot of the same things when combining of two data sources. However, their performance is optimized for different scenarios, so for large data volumes it is important to understand these before choosing which one to use for your workflow.

In general, we recommend lookups for use-cases where the lookup table is small (e.g. a domain file of a few thousand, typically unique, values) or when you need to perform an operation on the results (e.g. calculate and average for the matched values). Joins are better suited for large to large file operations.

More specifically, for high cardinality columns (e.g. true unique IDs), lookups can take at least twice as long as an equivalent join. For low-cardinality (i.e. low uniqueness) / high row count keys, lookup indexes will built quickly compared to the corresponding join index.

This step is used to map columns in your data to a target schema definition. The target schema must be a Dataset that is already loaded into Data Studio.

Map to target is often used in combination with the Export step to populate a target database schema but can also be used to rename multiple columns easily.

Auto-map functionality will attempt to map columns automatically where possible, based on column names in the source (input columns) and the target (output columns). You can also map columns manually.

If an input column is not selected in the mapping for a given target column, the values in the target column will be null.

If Include in output is unchecked for a target column, that column is not included in the step's output. Use this setting when mapping to a schema that includes auto-generated columns in the target database.

When a re-usable Workflow is using the Output step, the step's data will be available in another Workflows.

Click Show step results to view the result from previous step that connects to the Output step.

Profile your data at any point in a Workflow.

We examine each value and identify various attributes (such as type, min/max values, count) and determine its uniqueness and completeness.

Click Profiled columns to select the column(s) you want to profile. In the profiled columns list on the left, select the required ones and click Apply. To see the results, click Show step results in the step dialog.

You can now export or take a snapshot of your profiled data to track data uniqueness over time.

When profiling data, you can select a column and choose Values or Formats to drill into the value or format information identified from the profiling process, and from there drill down again to the individual rows that contain that selected value or format.

Viewing profiled values, formats and rows in other steps

The ability to drill down into the Profile step results will be retained if the step is connected to any of the following subsequent steps:

  • Branch
  • Chart
  • Export
  • Filter
  • Fire event
  • Output
  • Sample
  • Sort
  • Split
  • Take Snapshot
  • Transform
  • Union

You can view the drilldown using the right-click options or the toolbar on these steps.

Interactive Datasets

The results of the profiling can be saved to a Dataset by including a Take Snapshot step in your Workflow. Furthermore, if the Interactive option is selected when creating a new Dataset on the Take Snapshot step, you can drill into the data to view the profiled values and formats in the Dataset. The underlying rows of the drilled down values or formats can also subsequently be drilled into.

Once an interactive Dataset has been created, it can be drilled into in various ways, such as:

  • The interactive Dataset itself.
  • A View with the interactive Dataset as the source.
  • A Source Workflow step with the Interactive Dataset as the source.
  • A Source Workflow step, with a View on the Interactive Dataset as the source.

Interactivity also works for multi batch Datasets.

Lineage metadata

To further improve data quality, Data Studio can capture lineage metadata of data sources. Lineage metadata can be included in Workflow outputs and subsequently used for further integration or processing.

In the Profile step, lineage metadata can be included in the output using the dropdown under More Options.

View up to three inputs side by side (horizontally or vertically). Nest these steps for more flexibility over layout.

Click on Vertical split to change it to horizontal.

To make a report available to Data Consumer users, click Expose to Data Consumer users.

To view the report, click Show report.

This step allows you to convert row(s) into columns by clicking Settings.

  • Column with rows to split in new columns - this is the column containing the row values you want to convert to new column names.
  • Column with rows to split in column values - this is the column containing the row values that will be populated in the new columns.
  • Click Refresh column names to automatically identify unique values from the new columns Column with rows to split column names (you can edit these as required). You can also manually enter a new column name, as long as it's an existing row value from the chosen column.
  • Check Group and aggregate values to allow aggregate and condense pivot output with existing columns aggregator and new columns aggregator.
  • Uncheck Group and aggregate values to maintain existing default behavior

The step results will consist of the new column(s) and all other columns except for the Column with rows to split in new columns and Column with rows to split in column values.

To do the reverse, use the Columns to rows Workflow step.

Create a pivot table

You can use this step together with Group and Export to create a pivot table.

For example, you have the following car sales data:

Year Brand Sold Product ID Model
2019 Honda 10 1 HRV
2019 Mazda 15 2 CX5
2020 Honda 7 3 Civic Si
2019 BMW 3 4 3 Series
2019 Honda 5 5 CRV
2020 Mazda 50 6 Mazda 3

What you're interested in is the amount of Honda and Mazda sales per year:

Year Honda Mazda
2019 15 15
2020 7 50

To achieve that:

  1. Add a Source step with the relevant Dataset.

  2. Connect the Rows to columns step.

  3. Click Settings to configure it:
    -Select Brand as the Column to split.
    -Select Sold as the Values column.
    -Click Profile for new values. The values Honda, Mazda, BMW should be automatically populated.
    -Remove the BMW value and save changes.

  4. Connect the Group step.

  5. Click Grouping to configure it:
    -Remove Count from the Grouped and aggregated columns.
    -Move Year from the Ungrouped columns to the Grouped and aggregated columns.
    -Click Add aggregate and enter the following values:
    -Column selection type: Multiple
    -Selected columns: Honda and Mazda
    -Aggregate: Sum
    -
    Click Apply twice, to save changes and to apply changes to the groupings.

  6. Connect the Export step.

  7. Click Show step results to preview the results and export as required.

Limit the number of rows or take a random sample or rows.

Examples for consecutive rows:

  • first 50 rows: Start at row 1 and Limit rows 50
  • specific 20 rows: Start at row number and Limit rows 20
  • last rows: sort your data and take first rows

Examples for non-consecutive rows:

  • every 10th row: How often to sample 10
  • random sample 100 rows: Sample randomly on and Limit rows 100

Sort your data in ascending/descending order.

Move columns from 'unsorted columns' to 'sorted columns', set the sort order (min to max or max to min) for each column using Flip sort direction or the up/down arrow on the right of each 'sorted column'.

Select a 'sorted column' and use the 'move up'/'move down' buttons to change the priority order if sorting on mutliple columns.

Your data source, often the first step in a Workflow. Select the required Dataset, View or Issue list to start manipulating and transforming your data.

The options you see in the Source step will depend on the properties of the source you select.

Replaceable sources

Enable Can supply source when executed to make the data source replaceable, so that a different source can be provided when the Workflow is executed or by making the Workflow re-usable and embedding it in a parent Workflow.
{end-mode}

Delete batches on completion

Enable the Delete batches on completion setting to automatically remove processed batches from a Dataset when the Workflow successfully completes. This option is only available is the Dataset is multi-batch and has the Allow automatic batch deletion option enabled.

Allow auto-refresh

Enable the Allow auto-refresh setting to automatically load a new batch of data into the selected Dataset when the Workflow is executed. If a View is selected, it's underlying Dataset will be refreshed, even if that Dataset is in a different Space.

When Allow auto-refresh is enabled you have the option to Stop execution when refresh failed. This setting causes a workflow execution to fail is the refresh could not be completed, for example if the External System could not be accessed due to issues with network connection or authentication. When the settings is unchecked, the workflow will process the existing data when the Dataset could not be refreshed.

Lineage metadata

To further improve data quality, Data Studio can capture lineage metadata of data sources. Lineage metadata can be included in Workflow outputs and subsequently used for further integration or processing.

In the Source step, the available lineage metadata depends on the source type and can be included in the output using the dropdown under More Options after a Dataset has been selected.

Unique row IDs

Each row in a Dataset has a row ID which is guaranteed to be unique across all Datasets. These IDs can optionally be included in the output of the Source step. The IDs are generated in ascending order over time, so it's possible to use them to sort rows of data by the time they were first loaded into Data Studio.

Source batch sampling

This feature allows the Source step to output only a set of batches for processing in the Workflow instead of all of them. Select a multi-batch Dataset as the source for these options to be available.

Take last

This option will allow you to output one of the following:

  • All batches that were created within a given timeframe from the current time.
  • The last given number of batches.
  • All batches.

Sample every

This option will only output the last batch created within each interval of the time unit provided.

For example, if you have three batches with these timestamps:

  • Batch 1: 2020-01-01 9:00
  • Batch 2: 2020-01-01 11:00
  • Batch 3: 2020-01-01 11:59

Depending on the unit of time selected, the following would be the output:

Day

  • Batch 3: 2020-01-01 11:59

Hour

  • Batch 1: 2020-01-01 9:00
  • Batch 3: 2020-01-01 11:59

Minute

  • Batch 1: 2020-01-01 9:00

  • Batch 2: 2020-01-01 11:00

  • Batch 3: 2020-01-01 11:59

  • ###Include resolved issues
    For Issue lists, enable the Include resolved setting if issues that have already been marked as resolved should be processed in the Workflow.

Combine two inputs into one by presenting the values next to each other.

Data in the columns is not combined or merged - the duplicated values are renamed with an appended numeric value. The output – combined values from two sources – can then be used as input to any further step(s).

Click Show step results to view the results.

Apply a true/false filter to one input, splitting it into passing and failing rows.

To create a filter, click Filter in the step dialog and Create in the left-hand menu. Specify the necessary filters and click Apply to save the changes.

Click Show passing/Failing rows to view the values that passed/failed the filter. The passing and failing results can be individually linked to any further step(s).

Snapshots are copies of data that allow you to track changes over time or store your results for use elsewhere. A Snapshot is a type of Dataset that's created by and stored in Data Studio.

Taking Snapshots

Connect the Take Snapshot step's input to the data you'd like to capture in a Snapshot. When the Workfow is run, a new batch of data will be created.

This data can then be added to an existing or new Dataset:

  • Select an existing one from the Dataset list.

  • Use Create new Dataset to create a new one.

    • Name - add a name as it will appear in Data Studio.
    • Summary - an optional short summary of the Dataset.
    • Description - an optional, longer description.
    • Dataset type - choose one of the options: Single batch will write the latest data to the Dataset and keep no history. Multi batch will retain older data in the Dataset allowing for trend analysis.
    • Interactivity - This option will only be available if you're making a copy of Results by rule or Results by group from the Validate step. Selecting Interactive (with drill down) will result in a Dataset that allows users to drill into the underlying data and view the passing/failing rows for the rules/rule groups.
    • Add Batch Timestamp Column - When selected, an additional column will be added to the Dataset, recording the timestamp that the batch was created. This option is useful for trend analysis.
    • Allow automatic batch deletion - This will ensure that data batches which have been used in a Workflow are deleted after they have been processed (i.e. after the Workflow using those batches has been run). This option is intended to be used so that no batch is processed through a Workflow twice. It's used in conjunction with the Delete batches on completion setting in the Source step.
    • Allow auto-refresh - Not applicable to Snapshots.
    • Publish to ODBC - Make the Dataset visible through ODBC connections.
    • Compression level - Choose one of the options: Row-based to store data by row, Column-based to store data by column or None to not use compression for this Snapshot. The value selected by default is determined by what's been specified in Settings > Performance > Compression. Find out more about the compression levels.

Using Snapshots

Snapshots can be used like other Datasets:

  • As a Workflow source - add a Source step to a Workflow and select the required Snapshot Dataset.
  • As a View source - choose the required Snapshot Dataset when creating a View.

Snapshot schema changes

You may want to update the schema if columns have been added/removed when writing data to a Snapshot.

Resolving Snapshot warnings

A summary of the unresolved changes will appear as a warning on the step. To fix this, you can change the input or resolve the conflicts. Click Update schema: {X} schema change(s) to open the Update Schema dialog (where {X} is the number of changes found) to see a consolidated list of existing, new and missing columns.

You can now:

  • Include or Exclude new columns in the Snapshot's schema. You can change this at any point later.
  • Set all missing mandatory columns as Optional. Optional columns will be retained and filled with empty values in the target Dataset.

Click Apply to update the schema. The step will become valid and you'll be able view the step results/run the Workflow.

Record issues in your data so that they can be tracked, organized and resolved.

Find out how to use this step.

Manipulate and configure your data.

This step allows you to add/remove various Functions and arrange or exclude columns.

Clicking the Columns row on any Transform Workflow step will open the panel where you can select the columns you want to work with.

Moving columns

Moving columns allows you to change the order in which they appear in the data. This can be done by selecting the columns and pressing the up/down arrow button next to the columns list.

Columns can also be moved by dragging them up/down the list.

Renaming columns

Rename a column by selecting the column and clicking Rename on the toolbar. You will then be presented with a modal box prompting you to enter the name and an optional summary.

If multiple columns are renamed at the same time, the modal box will provide you with an opportunity to apply a common prefix or suffix to the originating column name. Alternatively, you will be able to rename all columns individually.

Excluding columns

Select the columns you want to exclude and click Exclude in the toolbar. This can also be done by hovering over the column name and clicking the inline icon.

To make columns visible again, select the columns and click the Include button in the toolbar.

Transforming and creating columns

A column can be transformed by applying one or many Functions to it. These Functions range from a simple action, such as converting a column to uppercase, to more complex multiple column operations.
Whatever it may be, transforming a column can be done to create a new column or to change the data in an existing one.

Suggestions

Before applying any custom transformations, you can use the Suggest option to analyze the data being explored and return some recommended transformations based on its content.

Any of these suggestions can be applied to the data by checking their selection boxes and clicking Apply.

The transformation suggestions can also be created as new columns rather than replacing the original data, by checking New column? in the table and specifying a column name.

Applying a Function

  1. Select the columns
  2. Decide to transform the data in the existing columns, or create new ones for the results
    • To change existing columns click Existing column on the toolbar
    • To create new columns click New column on the toolbar
  3. Choose an existing Function from the list (these are grouped into expandable categories)
  4. After choosing your Function you will be presented with options to configure it
  5. Click Apply transformation to apply your Function to the columns

Creating an advanced Function

  1. Select the columns
  2. Click Existing column or New column (see Applying a Function above for more info)
  3. Click Create your own function
  4. Create a Function with as many Function steps as you like
  5. Click Apply to apply your Function to the columns

If the column already has a transformation Function applied you will first be prompted to choose whether to 'Append to existing functions' or 'Replace existing functions'.

Duplicating columns

Duplicating columns can be done by selecting the columns and clicking Duplicate on the toolbar. The duplicated column will be placed after the original column and will be renamed with the prefix 'Copy of '.

Editing a transformation

  1. Select the column that is already transformed
  2. Click Edit on the toolbar
  3. Either edit the options in the configuration form or start over with a completely new Function

Removing a transformation

To remove the Functions applied to columns, select them and click Delete on the toolbar.

You can only completely delete a created column rather than just remove the transformation, because created columns can't exist without a Function.

Combine two or more inputs into one output by presenting the values on top of each other. The source columns with matching headings will be combined.

The output – combined values from two or more sources – can then be used as input to any further step(s).

Click Show step results to view the results.

Use business rules to measure the quality, completeness and accuracy of your data.

A rule is a Function that returns a true or false value, which translates to a pass or fail. This can be either a default Function (like 'Contains' or 'Is Date') or a custom one.

Suggest validation rules

The easiest way to get started is to use Suggest validation rules, which profiles the data to then suggest formulas to be applied to columns as validation rules. Uncheck any suggestions that are not useful and apply to create rules separated into rule groups ready for further changes if required.

In Explore mode, Profile your data to see the Suggest rules action button, which will create a new Workflow containing a Validate step.

Applying a Ruleset

Selecting Apply ruleset allows you to select any Rulesets that have been created and map columns to the rules where the names differ.

Creating rules and groups

On the Validate step, click Rules to view and edit any existing rules and to Add rule or Add group.

Each rule has a name and summary, a parent rule group and a Function definition. The Function can either be a pre-existing Function in Data Studio, or a custom Function created in the Function builder. When building a custom Function, pick one of the relevant columns (or a parameter) as the input value and ensure the Function returns true/false as the result.

Rule groups represent logical collections of validation rules and can have different thresholds set. For example, a group of compliance rules may have a fail threshold of 99%, but a data quality group of rules may use a lower threshold and be increased over time as data and processes are improved.

The Status result column for each rule is based on these thresholds, so Red below the fail threshold, Green at or above the pass threshold, and Amber in between.

Each group has a name and description, pass and fail threshold, plus an optional column to weight the results. As well as counting the number of Passed rows and Failed rows, the results will also include Passed weight and Failed weight columns, which contain a sum of the chosen column for all the passing/failing rows. For example, weighting each rule's results by total sales allows you to prioritize data quality issues for the highest spending customers with the most overall impact.

Ignoring values in validation rules

Rules can be set to ignore or skip values that do not need to be validated, and so should not impact the rule pass rate. For example, where 10 emails are being checked; 5 are valid, 2 invalid and 3 are blank, the pass rate would be 5/7 (rather than 5/10 if all values were considered).

The Ignore Null values checkbox can be selected when first creating a rule using an existing Function. However, values other than null can also be ignored using the Ignore literal value when designing a custom Function.

Validation rule script editor

You can use a scripting language to create new and manage existing validation rules, especially if you're making several changes at once. Click Edit script to open the script editor and make the required changes, such as renaming, re-ordering or changing rule groups.
To open a read-only version, click View script.

You can create Functions using the same functionality in the Function script editor.

A rule is made up of a rule group, group id, rule name, rule id, rule summary and Function in that order. For example:

## Rule_Group_1 {{c1e5ff0d}}
# Rule_1 {{d1f5ff0c}}
-- Summary of rule 1
IsEven(3)

Using the script editor:

  • ## denotes a rule group.
  • # denotes a rule name.
  • -- denotes rule or rule group summary text.
  • Rule id's are non-editable and should not be copied when creating a new rule.
  • Use Ctrl+Space to show available Functions or to suggest a Function according to what you have already typed.
  • Hover over a Function name with the cursor to show the arguments required for that Function.
  • Use Ctrl+Enter to validate and reformat the Function script.

Viewing results

Validation results are available in several formats:

  • Show passing rows contains the rows that pass all of the defined rules.
  • Show failing rows contains the rows that fail at least one validation rule.
  • Show all rows is a combination of the two above, comprising the entire input data alongside the rules and a Pass/Fail result for each.
  • Show results by rule summarizes/aggregates the results for each rule with a count of passes and fails.
    • Show results by rule for analysis is the same data, un-pivoted into rows.
  • Show results by group is similar to the above, but useful for a higher level view of aggregated results when there are many rules.
    • Show results by group for analysis is the same data, un-pivoted into rows.

Similar to the Source and Profile step, the Validate step has a Source metadata dropdown that enables you to include the lineage metadata of input data. This information (such as file name or batch ID) is useful to include as additional columns in the aggregated results.

This step validates and enriches addresses in bulk.

Addresses will be cleaned by verifying them against the official postally-correct address files for the relevant country. Cleaned addresses are assigned a match result, based on the accuracy of the original address. You can define layouts specifying the number, content and format of the address output columns. Choose one of the available Additional datasets to enrich your data. The datasets that are available to you depend on your license.

Find out how to use, configure and troubleshoot this step.

Validate emails based on the format or domain address.

Select the Email column and pick one of the two available Validation type options:

  • Format Check: Checks whether the value matched a valid email format. Returns either true or false.
    Examples of valid and invalid email formats:

    Format Result
    info@gmail.com Valid
    first.second-name@gmail.com Valid
    first.name+tag@gmail.com Valid
    name@info@gmail.com Invalid
    name"not"right@test.com Invalid
    another.test.com Invalid
    name@incorrect_domain.com Invalid
    com.domain@name Invalid
    first_address@gmail.com, second_address@gmail.com Invalid

    Only one email can be validated at once; lists of emails as seen in the last example will be rejected.

  • Domain Level: Checks whether the value has a domain that exists and is an email server. This option returns both an overall validity result (true or false) in the Email domain: Result column, and additional information in the Email domain: Error column describing the reason for failure. The possible outcomes are:

    Error Result Description
    Valid True Domain exists and is a mail server.
    Bad format False Email format is invalid.
    Invalid domain False Domain validation check failed. The domain may not exist, or may have been flagged as illegitimate, disposable, harmful, nondeterministic or unverifiable.
    Invalid name False Local part validation failed. For example it may have been identified as a spam trap or role account such as "admin@server.com".
    Failed to perform DNS lookup False An error occurred when attempting to perform the DNS lookup.

Domain level validation results are cached with results refreshed every 30 days. The cache validity is configurable in Settings > Workflow steps by changing the Email validation cache validity setting.

Click Show step results to view the results.

To learn more about validating phone numbers, please see the validate phone numbers step section.

Explore step results

Each step in the Workflow has the option to Show step results. This opens the grid to view data up to that stage in the Workflow, Explore and apply any Actions. If the Action(s) have a corresponding Workflow step then click Add to Workflow to insert a new Workflow step into the current Workflow from whichever step was previously opened.

Include/exclude steps

Workflow steps can be hidden, from the step list that is shown when editing a Workflow, to show only the steps relevant to the use cases of that Space. Clicking the Spaces menu options then Workflow steps allows you to turn both default and custom steps on/off from appearing in the list.

Custom steps

In addition to the default steps, there is an SDK to allow you to create your own Custom Workflows steps create your own Custom Workflows steps to manipulate your data in a specific way, call out to an API to pull in values to enrich your data or push out data to another application.

Workflow canvas is the main area of the page where Workflows are viewed or edited.

Designer users can build and optimize Workflows by dragging Workflow steps around the Workflow canvas.

Zoom and mini map

Zoom In/Out within a Workflow to show more/fewer steps on a page. The current zoom-level is noted on the mini map in the bottom right corner. Hovering over a step in the mini map shows the step name and clicking on a step moves the canvas to show that step.

Step manipulation

  • Automatically Arrange steps to present them in a uniform layout without any steps overlapping.
  • Drag select when on prevents the canvas being moved, instead allowing multiple steps to be selected.
  • Snap helps to automatically connect Workflow step nodes together when hovering them on top of each other.
  • Undo (Ctrl+Z) or Redo (Ctrl+Y) your latest actions.
  • Copy (Ctrl+C) the selected Workflow step(s) including any settings that have been set within them.
  • Paste (Ctrl+V) the copied Workflow step(s) into the same Workflow or a different Workflow being edited.
  • Select all (Ctrl+A) steps in a Workflow.
  • Delete selection (Delete key) to delete all selected Workflow steps, step connections and notes.
  • For steps with a corresponding View Action, Convert to View can be used to simplify a Workflow, replacing some or part or the process with a View.

Edit properties pane

Each Workflow step has a menu with the Edit properties option which allows you to change:

  • Name - rename the Workflow step to help distinguish or explain the purpose of it. This name will be used in connecting steps and Workflow reports.
  • External label - this label is used by Workflow APIs to call a specific Source step.
  • Description - add clarification for the step. This will add a note icon to the step with the details showing on hover.

Sticky notes

Notes are very useful for adding detailed descriptions to a Workflow or around some specific Workflow steps. Notes are also shown in the Workflow mini map in the bottom right corner, which can aid navigation.

To add a note, double-click anywhere on the Workflow canvas. Change the color or resize the note then click away to save the note. Double-click an existing note to edit it.

Using Markdown text in a note allows you to style the contents, e.g. add links or create tables and lists.

Workflow search

For larger Workflows, it can be useful to search for a specific step using the keyboard shortcut Ctrl+F, which will highlight any matching steps and move the canvas to show any steps appearing off screen.

Workflow reports

Create an HTML report, by clicking the Report button, containing a comprehensive and detailed summary of the Workflow:

  1. Contents: The summary, version history, Workflow parameters and a high-level abstracted graphical view of the entire Workflow, highlighting the relationships between steps.
  2. Workflow steps: Detailed information (e.g. position, column summaries, options, expressions) of each Workflow step.
  3. Output column lineage and Function glossary: Details of lineage for all output columns in the Workflow as well as any Functions used.

There are different ways to execute (or run) a Workflow:

  • Manually - click Run in the Workflow list page.
  • Set it to run later - as a one-off event or using a recurring Schedule.
  • Set up an Automation for the Workflow to be run.
  • Using the REST API.

Executing a Workflow manually opens a dialog allowing you to specify the following execution options:

  • Versions - select Draft or Published Workflow and dependencies.
  • Options to ignore Export, Fire event, Take snapshot Workflow steps and skip Refresh Sources.
  • Data sources - (if Can supply source when executed is selected for a Source step) select a Dataset (or local file) with a matching schema.
  • Workflow Parameter values.
  • Date and time - to set the Workflow to run later.

Executing a Workflow creates a Job in the Jobs list page.

Workflow parameters

A Workflow parameter is a placeholder value that can be used within different Workflow steps to apply values dynamically when running a Workflow. For example, create a mandatory parameter Initials that prompts a user to enter their initials which are then included as part of the filename on an Export step.

  1. Click Workflow parameters > Add parameter to create a new Workflow parameter.
  2. Give it a Name, Datatype (alphanumeric for any value or Numeric to force a number value).
  3. (Optional) Set the Value and add a Summary description. Select Configurable on runtime to allow a value to be specified by users and processes. Select Required to prevent a blank/null value being submitted against a configurable parameter when the Workflow is executed.