One of the last items to write about regarding Data Warehouse virtualisation (and any other form of ETL generation) is the handling of the metadata itself. In a previous post I covered what metadata needs to be captured at a minimum for ETL automation, and this post is all about how to incorporate this metadata from various locations. One technique in particular I tend to use is the user defined properties (or extended properties) of the physical tables to be able to read model metadata from the database catalog / data dictionary. Using this approach you can add all kinds of labels to database objects which you can read using standard SQL.
As is often the case with these type of things there are many ways to store and process information and this is merely the approach I have adopted while realising there are many other equally viable ways. In any case the approach outlined below underpins the virtualisation solution.
At the very basic level the model metadata is collected / interpreted by analyzing the model deployment. This information is combined with the source-to-target information – the information that you need to feed in from somewhere. The combined metadata is then stored in a temporary tables – the metadata model for automation. For the various demonstrations I use the ERwin community edition, for which I created the metadata model and some demonstration models. The metadata model is shown here.
The idea is to use this model to collect the required information and store it for access by ETL automation logic (in this case the virtualisation tool). Of course, it is theoretically possible to derive / calculate this information on the fly but I found it easier to store it in this model. This allows easier debugging and generally getting a grip on what happens. In this particular case I use this model to combine the model information (driving keys, multi-active attributes etc.) with the source-to-target mapping metadata.
The source-to-target mapping metadata can be delivered in many ways, ranging from specialized platforms such as AnalytixDS Mapping Manager to proprietary solutions built in data modelling tools, Excel sheets or database-backed forms applications. In most scenarios you would encounter Excel to record the source and target relationships in a way similar to this:
In this context, regardless of how the mapping metadata is delivered or managed, I upload this information in the two highlighted tables in the metadata model. These act as a staging for this kind of information, before this is combined with the model metadata and processed further. The tables in question are:
- MD_TABLE_MAPPING (defining the relationships between the source and target tables and business key definitions)
- MD_ATTRIBUTE_MAPPING (defining the relationships between source and target attributes for the scenarios that these have different names or meaning)
In most cases I try to make life easier by automating this upload from whatever location the mapping metadata is managed!
Assuming the model is deployed (forward engineered) and therefore the data dictionary can be accessed, everything is now in place to populate the metadata model itself. I’m working on reworking this logic to C# to live in the virtualisation tool but for the time being (not wanting to delay publication any longer) it’s developed in SQL. The script is too long to explain in a lot of detail here but effectively it runs a series of queries to read the table structures and extended properties from the database to create the required information in an accessible way. Examples are:
- Create a list of all the tables that start with ‘HUB_’, or ‘LINK_’
- Check if there are any attributes part of the Primary Key in Satellites other than the Hash and Load Date / Time Stamp; these will be multi-active (multi-variant) attributes
- Check if there are any attributes part of the Alternate / Unique Key in Link tables other than Hash values; these will be degenerate attributes
- Check if any of the Link Hash keys are flagged with extended properties; these will be Driving Keys
- For Dimensional models only: check that the history type is for the individual attributes
- … and many more
It is obvious why this type of interpretation is better placed in C# – this allows more control and simplification of sometimes fairly complex SQL. It also allows for better validation options – something I’m developing now but will add a bit later as a complete solution probably. The SQL is available for download below, please bear in mind that this is geared towards the naming conventions of the demonstration setup and might need some tweaks if you adopt other standards!
The models and SQL files are available for download here. This will allow you to setup your own demo if you’re interested.
The process to get things up and running is to:
- Deploy the metadata model
- Make sure the two metadata staging tables are populated
- Execute the SQL to parse the metadata
To make life a bit easier I wrapped this up in an Excel file that uses a macro to copy the contents to the metadata staging tables and executes the SQL. Let me know how it goes. As always I’m open to comments and suggestions!
The files can be downloaded here:
- ETL Automation Metadata (an Excel example with some mapping content)
- Data Virtualisation Data Models (a zip file containing sample data models and the virtualisation metadata model)
- usp_MD_DV_METADATA (a zip file containing the SQL scripted as a Stored Procedure)