Data Quality user documentation

Data Explorer allows you to manage your data sources, prepare and configure your data.

Find out about

Sources
File parser
Restricting access
Preview and configure
Data tagging
Auto tagging
View data
Dependency/Key analysis

Sources

This menu lists all your data sources: both text files and database connections.

My files contains files uploaded by you only; Data sources lists all the available connected/configured database connections.

To upload/connect to your data, do one of the following:

drag and drop files into the window or
go to My files > Click here to upload a file or
click upload in the top menu or
connect to your database using a JDBC driver: Data sources > Click here to create a new data source.

The supported file formats are:

.csv
.txt
.xls
.xlsx
.sas7dbat
.psv
.json
.metro2
.dat

Note that .dat files have to be uploaded together with the associated metadata files such as .cob or .ddl.

Find out about using Metro 2 and JSON files.

The supported database drivers/services are:

JDBC
Apache Hadoop HDFS. Find out how to configure Hadoop HDFS.
Amazon S3
Microsoft Azure Blob Storage

You can create JDBC connections for the following databases:

Amazon Redshift
Apache Cassandra
Apache Hive which supports:
- Apache Hadoop Hive
- HortonWorks
- Cloudera CDH Hive
- Amazon EMR Hive
- IBM BigInsights Hive
- MapR Hive
- Pivotal HD Hive
DB2
Greenplum which supports:
- Pivotal Greenplum
- Pivotal HAWQ
IBM Informix
MongoDB
MySQL
Oracle Database
Oracle Service Cloud
PostgreSQL
Progress OpenEdge
Salesforce which supports:
- Salesforce.com
- Veeva CRM
- Force.com Applications
- Financial Force
Spark SQL
Microsoft SQL which supports:
- Microsoft SQL Server
- Microsoft SQL Azure
SAP Sybase

Find out how to add a custom JDBC driver.

File parser

Data Studio supports files that contain data not displayed in tabular form (Metro 2 and JSON). You can also use the Data Studio SDK to create your own file parsers.

Note that Data Studio will only recognize the files if the .metro2 or .json extension is used.

Simply upload the file as usual and the inbuilt file parser will automatically parse (i.e. 'read') the file.

Identifier field

Where required, the field used as the identifier for each unique record can be configured in the Data Explorer: right-click on the file and select Preview and configure. The Id field name in the Data tab has to match one of the column names displayed in the preview. This field will be used by the parser as the identifier field (the default is Data Studio Link which we recommend to leave unchanged).

The parsing process

Metro 2 files are parsed in a way such that each base record exists as a row in a resulting master table. Any segments associated with base records are arranged in a resulting sub-table named after that segment. The sub-tables contain the identifier column so that they can be linked back to the master record. Many segment records can, therefore, reference the same base record.

JSON files with records containing arrays of elements will have these arrays broken out into resulting sub-tables which are linked to the associated master record by the identifier field.

Once uploaded, use Data Explorer to view the data. Right-click on the loaded file and select View data. This will show the parent/base table and all the sub-tables/segments. Similarly, you can profile and view outliers for either the parent table or sub-tables.

If the file can't be parsed for any reason, it will have the Error status. Right-click and select View data to see the details for the error.

Each table will also be available as a separate output node when a Metro 2 or JSON file is used in a workflow. Each output node represents a different table and you can preview the data in each by clicking on Show data.

To combine the sub-tables with the parent/base table we recommend using the Join step.

Restricting access

For each data source, the administrator can specify individual login credentials for each user/user group to ensure that restricted data is not accessible to anyone without the appropriate permissions.

To create and assign access credentials:

In Data Explorer, click on the required data source (e.g. Salesforce).
Click Credentials and then Add credentials.
In the Add Credential dialog, specify the name for this credential. We recommend using a descriptive name as this will be used when assigning the credential to users/user groups.
Enter the username and a password.
Click Edit to specify the table(s) that user(s) with this credential will have access to.
Click Apply to save changes.
Go back to the previous screen - click Credentials.
To apply the newly created credential to individual users, click Users; to apply to a group of users, click Teams.
Right-click on the required user/team and click Add access.
Specify the credential and click Apply.

To remove access from the user/team, simply right-click on them and select Remove access.

Preview and configure

To ensure your data has been interpreted correctly and is displayed as expected, you can preview it: right-click on your source data and select Preview and configure. A preview of the first 1,000 rows will be shown.

We strongly recommend that you preview your data before loading it. This ensures that the data we cache is as expected.

You can move, rename and exclude column headings and change character sets/delimiters/quotation marks.

To configure how the data should be interpreted, use the Data tab.

Depending on the file type, you can specify various settings. For example, for .csv files:

Character set - the character set for the data.
Language and region - the formatting region (e.g. whether to use UK or US format for dates).
End of line delimiter - the sequence of characters denoting the end of the line.
Delimiter - the delimiter to be used for separating fields in this file.
Quote - the delimiter to be used when loading quoted values from delimited files.
End of line precedence - enable to force line endings to take precedence over text delimiters.
Headings on first line - include or exclude column headings in the first row.
Profile after load - enable to automatically profile data as soon as it's fully loaded in Data Studio.

The Headings tab allows you to manipulate the column headings: you can rename, re-order, remove and exclude them. You can also tag data.

Alternatively, modify columns by right-clicking on the column header, selecting the required option and clicking Apply in the top-right corner to save your changes.

Standardize data

You can standardize your data easily before loading it into Data Studio: right-click on the column header and select Edit.

You can now apply one or more standardisation options:

Auto-parse integers: convert numeric values with no fractional part to integers. We don't recommend disabling this option if the column values are integer numbers.
Auto-parse decimals: convert numeric values with fractional parts to decimals. We don't recommend disabling this option if the column values are decimal numbers.
Auto-parse dates: parse dates based on the language for the file.
Auto-parse scientific numbers: convert values in scientific notation (e.g. '1E47') into numbers.
Treat blanks as null: we recommend to have this option enabled to treat all blank values in the column as empty/null.
Remove leading blanks: we recommend to have this option enabled to remove the leading white space from the column values.
Remove trailing blanks: remove the trailing white space from the column values.

Note that if the Treat blanks as null option is turned off, and both Remove leading blanks and Remove trailing blanks are on, Data Studio will ensure that at least a single space is retained for entirely blank values.

If you want to keep the null and blank values separate, we recommend that you turn off Remove leading blanks and Remove trailing blanks. Otherwise, Data Studio will not load values with varying spaces and will not treat them as different values.

By default, all of the standardization options above are enabled (except for Auto-parse scientific numbers). To change the enabled/disabled options for all files:

Go to Configuration > All server settings.
Find Default standardization rules.
Enable/disable options as required and click Apply.

Note that in preview you can only move headings, not data.

To rearrange columns, you can simply drag and drop the column headings. Click Apply in the top-right corner when you're happy with the changes. The file will now appear as Configured.

To revert all the changes made to your data:

click Cancel in the top menu to remove changes made since the last configuration or
right-click on the file and select Reset configuration when not previewing the file to completely reset the configuration.

Data tagging

Data Studio can use machine learning algorithms to automatically tag columns. Data tags allow you to enrich source data by specifying additional information about what each column contains. These are then used to improve the user experience by providing sensible defaults and suggestions. Tags are used in various workflow steps such as Find duplicates and Validate addresses.

Note that if you upload a file with the same name, the tags will be retained.

Data tags appear next to the column name, so if you're not very familiar with your data, they provide an overview at a glance.

There are two types of data tags:

System defined (cannot be modified, used to drive system behavior)
User defined

Tags in Data Studio are hierarchical: a parent tag (e.g. address) can have multiple child ones (e.g. street, city).

While you can't modify system defined tags, you can create a child tag (e.g. PO box) and assign it to the system one (e.g. Address). Once created, it will appear in the User defined list under that system parent tag.

To create new and manage existing tags, go to Glossary > Data tags. To add a new one, click Create a new data tag.

Note that Consumer users will not be able to access Glossary. Find out about user roles.

The best way to manually tag data is before it has been loaded, in preview:

Go to Data Explorer.
Right-click on the required file and select Preview and configure.
Open the Headings tab to see all your column headings.
Click Multi select and choose one or more headings of the columns you want to tag (e.g. City, Country).
Right-click and select Tag columns.
In the Tag column dialog, click Edit.
Apply the necessary tag(s) (e.g. Address) and click away for changes to be saved.
The Selected tab will show all the tags applied to this column.
Click Tag to save changes.

If you tag columns in a workflow step, the tags will be applied to that column in all the subsequent steps for that workflow only.

Auto tagging

Columns can be automatically tagged by allowing the system to recognize the type and meaning of data from its trained knowledge base.

For example, Data Studio can automatically detect columns containing names, addresses, and other customer-related data within a file. You can also easily train the system to recognize types of data that are specific to your organization (or not yet included in Data Studio's knowledge base).

One of the benefits of having your data tagged is to allow the Workflow Designer to apply intelligent defaults in your workflow steps, significantly speeding up workflow creation.

Automated tagging is done by processing the data as it's loaded into the system. The processing is performed one column at a time by comparing the patterns of its values with known patterns (fingerprints). This comparison results in a similarity score on a scale of 0-100, and there is a configurable threshold to determine how similar the data must be to the fingerprint in order for a tag to be applied automatically.

For some tags, you may want the threshold to be high for a positive result, particularly where the values have similarity to fingerprints for other tags. For others, where the data is typically unique to that tag, you may prefer to set the threshold lower. For example, columns containing country names tend to have a very high similarity score because there's a relatively small number of countries in the world. Conversely, columns containing surnames tend to have a lower similarity score due to the much larger number of surnames in most datasets. Therefore, it's appropriate to specify a higher similarity threshold for a country and lower threshold for the surname column.

If you find that Data Studio is making mistakes when automatically tagging data, it's likely to be caused by a threshold that's set at the wrong level for the type of data you're dealing with.

Both system and user defined tags can be found in Glossary > Data tags.

Note that Consumer users will not be able to access Glossary. Find out about user roles.

To adjust the threshold, right click on the tag and select Edit to adjust this to be between 1 and 100. You can also Delete user training data to remove all the fingerprint files associated with the tag.

You can also exclude the tag from auto tagging by right-clicking on it and selecting Exclude from auto data tagging.

To enable auto tagging:

Go to Data Explorer.
Right-click on your data file and select Preview and configure.
In the Data tab, select Auto tag data on next load.
Enabling this will force the data to be re-read and tags automatically applied to the columns when it's next viewed. Once viewed, this option will be automatically set back to disabled.
Click Apply to save changes.

If you have manually applied tags to your data, you have the option to train the system so that when it receives similar data in the future, it can automatically apply the same tags. To train Data Studio's data tagger:

Go to Data Explorer.
Right-click on your data file and select Preview and configure.
Open the Headings tab.
Right-click on the column you want to use and select Tag column.
Open the Data tab and enable the Train auto tagging on next load.
Enabling this option means that the next time you view the data, it will be re-read from the source and used to train the system. Once viewed, this option will be automatically set back to disabled.
Click Apply to save changes. When the process is completed, the new fingerprint files will be created and the rejection threshold might need to be adjusted.

You can also specify the Number Of Rows Used As Data Tagging Training Data – the number of rows that will be included in generating fingerprint files: go to Configuration > Loading.

Note that only administrators will be able to see this area. Find out about user roles.

View data

View data and transform it as required.

Changes made when viewing files in Data Explorer are temporary and will not persist. However, the operations you make can be saved as a workflow and the data changes can be saved to a file.

Once the data is loaded you can:

Configure and transform your data.
Save as workflow: saves your transformation/configuration as a workflow in 'My workflows' list in the Workflow Designer.
Download as .CSV: saves your transformed data as a .csv file in your browser's default download location. Note that you can hide this option by going to Configuration > System > and disabling Enable download as .CSV.

Dependency/Key analysis

Dependencies are normally used for table normalisation – a process of decomposing a table into smaller tables in terms of attribute sets.

Keys are used to identify records in a table and are therefore crucial when joining tables in cross-domain queries. The analysis of keys depends on the results of dependency analysis.

Once the table is loaded, right-click on it and select Dependency analysis.

This option will be disabled if the table hasn't already been profiled.

Using the Analyse dependencies dialog, specify the settings for the analysis such as the maximum number of columns that will be considered as a multi-column key and the minimum correlation percentage to be considered as dependency.

The analysis result will be shown under the table name, following the Configured and Loaded status.

To see the generated results, right-click on the table and select Dependencies or Keys.

Was this helpful?

Previous: Quick start

Next: Edit functions