Data Explorer allows you to manage your data sources, prepare and configure your data.
Find out about
This menu lists all your data sources: both text files and database connections.
My files contains files uploaded by you only; Data sources lists all the available connected/configured database connections.
To upload/connect to your data, do one of the following:
The supported file formats are:
Find out about using Metro 2 and JSON files.
The supported database drivers/services are:
You can create JDBC connections for the following databases:
Find out how to add a custom JDBC driver.
Data Studio supports files that contain data not displayed in tabular form (Metro 2 and JSON). You can also use the Data Studio SDK to create your own file parsers.
Simply upload the file as usual and the inbuilt file parser will automatically parse (i.e. 'read') the file.
Where required, the field used as the identifier for each unique record can be configured in the Data Explorer: right-click on the file and select Preview and configure. The Id field name in the Data tab has to match one of the column names displayed in the preview. This field will be used by the parser as the identifier field (the default is Data Studio Link which we recommend to leave unchanged).
Metro 2 files are parsed in a way such that each base record exists as a row in a resulting master table. Any segments associated with base records are arranged in a resulting sub-table named after that segment. The sub-tables contain the identifier column so that they can be linked back to the master record. Many segment records can, therefore, reference the same base record.
JSON files with records containing arrays of elements will have these arrays broken out into resulting sub-tables which are linked to the associated master record by the identifier field.
Once uploaded, use Data Explorer to view the data. Right-click on the loaded file and select View data. This will show the parent/base table and all the sub-tables/segments. Similarly, you can profile and view outliers for either the parent table or sub-tables.
If the file can't be parsed for any reason, it will have the Error status. Right-click and select View data to see the details for the error.
Each table will also be available as a separate output node when a Metro 2 or JSON file is used in a workflow. Each output node represents a different table and you can preview the data in each by clicking on Show data.
To combine the sub-tables with the parent/base table we recommend using the Join step.
For each data source, the administrator can specify individual login credentials for each user/user group to ensure that restricted data is not accessible to anyone without the appropriate permissions.
To create and assign access credentials:
To remove access from the user/team, simply right-click on them and select Remove access.
To ensure your data has been interpreted correctly and is displayed as expected, you can preview it: right-click on your source data and select Preview and configure. A preview of the first 1,000 rows will be shown.
You can move, rename and exclude column headings and change character sets/delimiters/quotation marks.
To configure how the data should be interpreted, use the Data tab.
Depending on the file type, you can specify various settings. For example, for .csv files:
The Headings tab allows you to manipulate the column headings: you can rename, re-order, remove and exclude them. You can also tag data.
Alternatively, modify columns by right-clicking on the column header, selecting the required option and clicking Apply in the top-right corner to save your changes.
You can standardize your data easily before loading it into Data Studio: right-click on the column header and select Edit.
You can now apply one or more standardisation options:
If you want to keep the null and blank values separate, we recommend that you turn off Remove leading blanks and Remove trailing blanks. Otherwise, Data Studio will not load values with varying spaces and will not treat them as different values.
By default, all of the standardization options above are enabled (except for Auto-parse scientific numbers). To change the enabled/disabled options for all files:
To rearrange columns, you can simply drag and drop the column headings. Click Apply in the top-right corner when you're happy with the changes. The file will now appear as Configured.
To revert all the changes made to your data:
Data Studio can use machine learning algorithms to automatically tag columns. Data tags allow you to enrich source data by specifying additional information about what each column contains. These are then used to improve the user experience by providing sensible defaults and suggestions. Tags are used in various workflow steps such as Find duplicates and Validate addresses.
Data tags appear next to the column name, so if you're not very familiar with your data, they provide an overview at a glance.
There are two types of data tags:
Tags in Data Studio are hierarchical: a parent tag (e.g. address) can have multiple child ones (e.g. street, city).
While you can't modify system defined tags, you can create a child tag (e.g. PO box) and assign it to the system one (e.g. Address). Once created, it will appear in the User defined list under that system parent tag.
To create new and manage existing tags, go to Glossary > Data tags. To add a new one, click Create a new data tag.
The best way to manually tag data is before it has been loaded, in preview:
If you tag columns in a workflow step, the tags will be applied to that column in all the subsequent steps for that workflow only.
Columns can be automatically tagged by allowing the system to recognize the type and meaning of data from its trained knowledge base.
For example, Data Studio can automatically detect columns containing names, addresses, and other customer-related data within a file. You can also easily train the system to recognize types of data that are specific to your organization (or not yet included in Data Studio's knowledge base).
One of the benefits of having your data tagged is to allow the Workflow Designer to apply intelligent defaults in your workflow steps, significantly speeding up workflow creation.
Automated tagging is done by processing the data as it's loaded into the system. The processing is performed one column at a time by comparing the patterns of its values with known patterns (fingerprints). This comparison results in a similarity score on a scale of 0-100, and there is a configurable threshold to determine how similar the data must be to the fingerprint in order for a tag to be applied automatically.
For some tags, you may want the threshold to be high for a positive result, particularly where the values have similarity to fingerprints for other tags. For others, where the data is typically unique to that tag, you may prefer to set the threshold lower. For example, columns containing country names tend to have a very high similarity score because there's a relatively small number of countries in the world. Conversely, columns containing surnames tend to have a lower similarity score due to the much larger number of surnames in most datasets. Therefore, it's appropriate to specify a higher similarity threshold for a country and lower threshold for the surname column.
If you find that Data Studio is making mistakes when automatically tagging data, it's likely to be caused by a threshold that's set at the wrong level for the type of data you're dealing with.
Both system and user defined tags can be found in Glossary > Data tags.
To adjust the threshold, right click on the tag and select Edit to adjust this to be between 1 and 100. You can also Delete user training data to remove all the fingerprint files associated with the tag.
You can also exclude the tag from auto tagging by right-clicking on it and selecting Exclude from auto data tagging.
To enable auto tagging:
If you have manually applied tags to your data, you have the option to train the system so that when it receives similar data in the future, it can automatically apply the same tags. To train Data Studio's data tagger:
You can also specify the Number Of Rows Used As Data Tagging Training Data – the number of rows that will be included in generating fingerprint files: go to Configuration > Loading.
View data and transform it as required.
Once the data is loaded you can:
Dependencies are normally used for table normalisation – a process of decomposing a table into smaller tables in terms of attribute sets.
Keys are used to identify records in a table and are therefore crucial when joining tables in cross-domain queries. The analysis of keys depends on the results of dependency analysis.
Once the table is loaded, right-click on it and select Dependency analysis.
Using the Analyse dependencies dialog, specify the settings for the analysis such as the maximum number of columns that will be considered as a multi-column key and the minimum correlation percentage to be considered as dependency.
The analysis result will be shown under the table name, following the Configured and Loaded status.
To see the generated results, right-click on the table and select Dependencies or Keys.