NoETL – Data Vault Satellite tables

June 2, 2015, 9:27 am

≫ Next: Quick and easy referential integrity validation (for dynamic testing)

The recent presentations provides a push to wrap up the development and release of the Data Vault virtualisation initiative, so now everything is working properly the next few posts should be relatively quick to produce. First off is the Satellite processing, which supports the typical elements we have seen earlier:

Regular, composite, concatenated business keys with hashing
Zero record provision
Reuse of the objects for ETL purposes if required

As this is another process going from ‘staging’ to ‘integration’ the approach is very similar to the NoETL Hub post from a while ago. The pattern used is the virtual Satellite as explained here. As with the virtual Hubs the really cool concept is that you can use the view to act as a normal table. This is because the information visible in the view is 100% the same as if you were to physically create the table, create the ETL and load the table with data using this ETL process. The view is a true mirror of the end result, hence the virtualisation terminology. And even more interesting: you can use the same object (the view) to act as ETL again so you can literally swap between virtualised and physical without any rework or downtime.

For good reference the example model is shown here as well:

As you can see this example model contains various testcases I used to validate the automation logic including the code built in to the virtualisation tool. Specifically the following are relevant for this post:

SAT_CUSTOMER (has less attributes than the underlying Staging Area table)
SAT_INCENTIVE_OFFER (normal, easiest one)
SAT_MEMBERSHIP_PLAN_DETAIL (one out of two Satellites off the same Hub table)
SAT_MEMBERSHIP_PLAN_VALUATION (two out of two Satellites off the same Hub table, with a multi-active attribute)
SAT_SEGMENT (based on a user managed staging table e.g. without true source system)

As with the Hub view, the first thing to do is to configure the required metadata. In this case this covers the mapping between the source and target (Satellite) tables, the interpretation of the Business Key and any filtering criteria.

As an example I have setup the following metadata:

This shows that, for instance, the SAT_CUSTOMER table is sourced from the STG_PROFILER_CUSTOMER_PERSONAL table using the CustomerID source attribute as the (column to be mapped to the) Business Key.

The generated logic is visible below, together with the information that 5 views have been generated as expected.

The views have been directly generated in the database, which now contains 5 view objects for the Satellites. In terms of structure and contents these are identical to the physical tables including end-dating and the optional current record indicator.

This is visible in the screenshot below although there is not enough space to show the various attributes.

Because there is only one data delta available in the PSA there are two records for every Satellite key; the zero record and the normal record. Subsequent deltas in the PSA will add records to this view accordingly. Bear in mind that at present the ideas around zero records have changed slightly (more options are available). In any case this behaves 100% identical to the generated ETL processes we have used for Satellites so rather than going into details about the record output it is probably more interesting to look at the view logic.

I should mention the following two things:

Kent Graziano commented recently that the Current Row Indicator is better derived using the LEAD statement (similar to the LOAD_END_DATETIME) to improve performance, as RANK is more expensive. I haven’t made this change yet but will update this shortly
The SELECT DISTINCT in the subquery still needs to be updated to a GROUP BY or similar. This is because DISTINCT doesn’t necessarily condense timelines properly. The idea is that removing attributes from the selection (e.g. when only a few of the available attributes are mapped to the Satellite) requires timelines to be condensed in this pass and DISTINCT is not sufficient to achieve this. When I have some time intend to update this as well.

One of the really cool features of this concept is that now the view is there and ready to be used for upstream interpretation, you can also use the same view to act as an ETL process. For this purpose I created the INSERT INTO feature in the tool. This generates an INSERT INTO statement that selects from the view, while checking if information is not inserted more than once.

There you go: the Virtual Satellite and ETL generated from metadata and supported using a single database object!

↧

Quick and easy referential integrity validation (for dynamic testing)

June 2, 2015, 5:01 pm

≫ Next: NoETL – Data Vault Link tables

≪ Previous: NoETL – Data Vault Satellite tables

This post is in a way related to the recent post about generating some test data. In a similar way I was looking for ways to make life a bit easier when it comes to validating the outputs of Data Vault ETL processes. Some background is provided in an earlier post on the topic of Referential Integrity (RI) specifically in the context of Data Vault 2.0. In short, by adopting the hash key concepts it is possible to load ETL in different order – the Satellite before the Hub for instance. In a way this is similar (process wise) to loading Fact tables before Dimensions in the world of Dimensional Modelling.

This means that traditional RI is not applicable: the key lookups in ETL and constraints on database level need to be removed to allow this type of parallelism to be adopted. This also means we still need to validate RI at some point, ideally before the data delta gets committed to the database so things can be rolled-back if the RI isn’t correct after processing the delta.

As a default approach I handle this by adding a final process step to validate RI in a given workflow. All data from a specific Staging Area table is loaded to the various Data Vault tables by the dedicated ETL processes – but before completing the workflow the RI check (a SQL query) will validate if everything is in sync for the involved tables and data delta. If not, the workflow will be flagged as ‘failed’. The SQL itself basically used the model metadata to see which tables (should) have outgoing Foreign Keys (FK) and check if there are corresponding parent records. E.g. a Satellite has a FK to a Hub, a Link has FKs to the related Hubs and so on.

The logic is very straightforward and I wanted to share this here as an example. I updated my homebrew Data Vault automation tool to generate some end-to-end and partial validation scripts as shown here. The end-to-end validation checks the full tables, the partial validation includes logic to apply this only on the available data delta. As with the test data generation, it’s nothing fancy but it does the job (although I probably should use NOT EXISTS instead of an OUTER JOIN).

It is a bit of professional hobbyism, but it works for me as I can now setup metadata and table structure and immediately generate and process test data and validate the integrity in one go. Doing this ensures that all metadata is configured correctly and for instance no errors in the hash key definitions are present. It is also easy to swap between the virtual and physical structure, in line with the virtualisation efforts.

From more traditional ETL perspective: as the workflows for ETLs can be generated as well there is always the option to initially generate the ETL-based and/or database based RI validation (lookups and database constraints) and adjust this when things get more stable later. Various people I spoke with adopt this approach. Quite simply this covers: generate your ETL using the hashing concept but with key lookup options and corresponding dependencies in the workflow, but regenerate and apply post-processing RI validation when performance becomes an issue (e.g. parallelism is required). That makes sense too.

↧

NoETL – Data Vault Link tables

June 6, 2015, 3:24 pm

≫ Next: NoETL and ETL automation metadata overview

≪ Previous: Quick and easy referential integrity validation (for dynamic testing)

Virtualising Data Vault Link structures follows a similar process to that of the virtual Hubs, with some small additions such as the support for (optional) degenerate attributes. To make things a bit more interesting I created some metadata that requires different Business Key ‘types’ so this can be shown and tested in the virtualisation program.

For the example in this post I created three Link definitions (the metadata), one of which (LNK_CUSTOMER_COSTING) has a three-way relationship with the following properties:

1 normal / straightforward key: HUB_CUSTOMER (using the source value ‘Member’)
1 composite key: HUB_MEMBERSHIP_PLAN which is composed of ‘Plan_Code’ and ‘Plan_Suffix’ (the latter being a random hardcoded value for testing purposes)
1 concatenated key: HUB_SEGMENT for which a single-attribute Business Key is created from concatenating the ‘Segment’ attribute and some arbitrary hardcoded values (this could have been another attribute as well)

The used metadata is shown here:

As with the Hub views, the Link views may also use UNION statements to make sure data from different sources is presented as an integrated result. And similar to Hubs as well, the correct (target) attribute naming is handled in the subquery whereas hashing and condensing is implemented in the outer query.

It’s all very similar – demonstrating the pattern-based approach towards design and implementation!

The generated output is shown in the next two screenshots.

Because looking directly at Link table contents isn’t the most exciting thing to do I added some clarifications to explain how the query logic works:

The INSERT INTO statement, to allow the view to ‘act’as ETL again is in many ways similar to the Hub approach. Obviously, in this case the Link Hash key is used to check if a record already exists in the target physical table!

Over the last few posts we have generated a source-to-staging process, a Persistent Staging Area (PSA) process and Hubs, Views and Links that simulate the Data Warehouse on top off the PSA. Next up is the Link Satellites which will be split across two posts to cater for the variations in this particular area.

↧

NoETL and ETL automation metadata overview

June 10, 2015, 7:53 pm

≫ Next: NoETL – Data Vault Link Satellite tables (part 1)

≪ Previous: NoETL – Data Vault Link tables

One of the last items to write about regarding Data Warehouse virtualisation (and any other form of ETL generation) is the handling of the metadata itself. In a previous post I covered what metadata needs to be captured at a minimum for ETL automation, and this post is all about how to incorporate this metadata from various locations. One technique in particular I tend to use is the user defined properties (or extended properties) of the physical tables to be able to read model metadata from the database catalog / data dictionary. Using this approach you can add all kinds of labels to database objects which you can read using standard SQL.

As is often the case with these type of things there are many ways to store and process information and this is merely the approach I have adopted while realising there are many other equally viable ways. In any case the approach outlined below underpins the virtualisation solution.

At the very basic level the model metadata is collected / interpreted by analyzing the model deployment. This information is combined with the source-to-target information – the information that you need to feed in from somewhere. The combined metadata is then stored in a temporary tables – the metadata model for automation. For the various demonstrations I use the ERwin community edition, for which I created the metadata model and some demonstration models. The metadata model is shown here.

The idea is to use this model to collect the required information and store it for access by ETL automation logic (in this case the virtualisation tool). Of course, it is theoretically possible to derive / calculate this information on the fly but I found it easier to store it in this model. This allows easier debugging and generally getting a grip on what happens. In this particular case I use this model to combine the model information (driving keys, multi-active attributes etc.) with the source-to-target mapping metadata.

The source-to-target mapping metadata can be delivered in many ways, ranging from specialized platforms such as AnalytixDS Mapping Manager to proprietary solutions built in data modelling tools, Excel sheets or database-backed forms applications. In most scenarios you would encounter Excel to record the source and target relationships in a way similar to this:

In this context, regardless of how the mapping metadata is delivered or managed, I upload this information in the two highlighted tables in the metadata model. These act as a staging for this kind of information, before this is combined with the model metadata and processed further. The tables in question are:

MD_TABLE_MAPPING (defining the relationships between the source and target tables and business key definitions)
MD_ATTRIBUTE_MAPPING (defining the relationships between source and target attributes for the scenarios that these have different names or meaning)

In most cases I try to make life easier by automating this upload from whatever location the mapping metadata is managed!

Assuming the model is deployed (forward engineered) and therefore the data dictionary can be accessed, everything is now in place to populate the metadata model itself. I’m working on reworking this logic to C# to live in the virtualisation tool but for the time being (not wanting to delay publication any longer) it’s developed in SQL. The script is too long to explain in a lot of detail here but effectively it runs a series of queries to read the table structures and extended properties from the database to create the required information in an accessible way. Examples are:

Create a list of all the tables that start with ‘HUB_’, or ‘LINK_’
Check if there are any attributes part of the Primary Key in Satellites other than the Hash and Load Date / Time Stamp; these will be multi-active (multi-variant) attributes
Check if there are any attributes part of the Alternate / Unique Key in Link tables other than Hash values; these will be degenerate attributes
Check if any of the Link Hash keys are flagged with extended properties; these will be Driving Keys
For Dimensional models only: check that the history type is for the individual attributes
… and many more

It is obvious why this type of interpretation is better placed in C# – this allows more control and simplification of sometimes fairly complex SQL. It also allows for better validation options – something I’m developing now but will add a bit later as a complete solution probably. The SQL is available for download below, please bear in mind that this is geared towards the naming conventions of the demonstration setup and might need some tweaks if you adopt other standards!

The models and SQL files are available for download here. This will allow you to setup your own demo if you’re interested.

The process to get things up and running is to:

Deploy the metadata model
Make sure the two metadata staging tables are populated
Execute the SQL to parse the metadata

To make life a bit easier I wrapped this up in an Excel file that uses a macro to copy the contents to the metadata staging tables and executes the SQL. Let me know how it goes. As always I’m open to comments and suggestions!

The files can be downloaded here:

ETL Automation Metadata (an Excel example with some mapping content)
Data Virtualisation Data Models (a zip file containing sample data models and the virtualisation metadata model)
usp_MD_DV_METADATA (a zip file containing the SQL scripted as a Stored Procedure)

↧

NoETL – Data Vault Link Satellite tables (part 1)

June 12, 2015, 11:20 pm

≫ Next: NoETL – Data Vault Link Satellite tables (part 2)

≪ Previous: NoETL and ETL automation metadata overview

The final of the series of planned posts (for now at least) about Data Warehouse Virtualisation is all about Link Satellites. As with some of the earlier posts there are various similarities to the earlier approaches – most notably the Satellite virtualisation and processing. Concepts such as zero records and ‘virtual’ or computed end-dating are all there again, as are the constructions of using subqueries to do attribute mapping and outer queries to calculate hash keys. As with the previous solutions the logic will use pre-calculated hash keys if they’re there (e.g. already generated going into Staging) but otherwise calculate the hash keys ‘on the fly’.

For the purpose of explaining the logic I have categorised ‘Link Satellites’ into normal, or historical, Link-Satellites and Driving Key based Link Satellites. The normal ones act exactly the same as the Satellite processes including the support of multi-active attributes and the way zero records are handled. The only different, obviously, is that Link Satellites provide context (track history) for relationships (Links) instead of business concepts (Hubs).

The Driving Key based Link Satellites are slightly different, as they focus on capturing history of the relationship itself. There are (typically) no attributes to be tracked here – it’s really about the relationship (as in the combination of Hub keys) itself.

An easy example is the hypothetical relationship between an employee and a company. This information is captured in a Link relationship table; an employee key and company key are created as a Link record. If the employee decides to work for another company this creates a new Link record that captures the existence of a relationship between the employee and the other (new) company. There are now two records in the Link table as over time there have been two relationships.

But here is the thing: if we agree that the move from the employee should end-date the relationship of that employee with the previous company we should create a record in the Link Satellite to reflect this. We are effectively defining what is the ‘one’ side (as in ‘one-to-many’) in an entity that is setup as a ‘many-to-many’ table. From a modelling point of view we are making the employee the ‘one’ side: an employee can only be associated with a single company at any point in time. In other words: employee is the ‘driving key’.

In the virtualisation / automation metadata I use user defined properties / labels on to identify which of the Hub keys in a given Link acts as the driving key. This is forward-engineered as an extended property that can be queried from the data dictionary.

Driving key based logic is more interesting than the regular Satellite / Link-Satellite logic, mainly because you need to cater for re-opening closed relationships over time. An interesting challenge to virtualise!

I won’t repeat myself talking about the metadata in detail (the examples can be downloaded here anyway). The virtualisation tool performs the usual SQL / view generation and deploys the views using the SQL Server SMO API. It’s probably better to go straight into the query logic itself. The only comment I want to make relates to the example metadata which uses LSAT_CUSTOMER_OFFER as the Driving Key example:

For normal / historical Link Satellites (the ones with attributes) the example and SQL are displayed below. The example used is LSAT_CUSTOMER_COSTING; this is a ‘multi-active Link-Satellite’ in the spirit in adding some complexity. The COSTING_EFFECTIVE_DATE is sourced from ‘Date_Effective’, just to demonstrate some options. Since this is the last post I have added the majority of exceptions in this SQL :-) .

This query provides you the representation of the Link Satellite table.

As before INSERT INTO statements can be generated from here, but if you have read some of the previous posts you will very likely believe this by now. The next post will go into more detail about the Driving Key logic since this differs slightly from the SQL as displayed above.

↧

NoETL – Data Vault Link Satellite tables (part 2)

June 17, 2015, 8:33 pm

≫ Next: Data Warehouse versioning… for virtualisation

≪ Previous: NoETL – Data Vault Link Satellite tables (part 1)

This is the second part of the Link Satellite virtualisation overview (the first post on this topic is here), and it dives deeper into the logic behind Driving Key based Link Satellites. Driving Key implementation is arguably one of the more complex things to implement in Data Vault – and you (still) need to ensure you can cover reloads (deterministic outputs!), zero records / time variance and things such as re-opening closed relationships.

In the example of the employee that moves to another company (therefore ending the relationship with the previous company) as outlined in the previous post you want to support correctly recording the fact when the employee moves back to the previous company at some point! In the sample metadata this scenario is covered using a ‘customer’ that is associated with an active ‘offer’.

This is modelled as the LSAT_CUSTOMER_OFFER table with the customer as the Driving Key:

First off we should analyse the query itself to virtualize / simulate this table:

Let’s start seeing how this works by closing and re-opening some relationships between the ‘Customer’ and the ‘Offer’. For the purpose of explaining the behaviour I uncommented the business keys (CUSTOMER_ID and OFFER_ID) in the selection.

As explained in the previous posts we agree (through the model configuration) that if the ‘customer’ is associated with a new ‘offer’ this should end-date the relationship of that ‘customer’ with the previous ‘offer’. We can trigger this by introducing a new ‘change’ in our source data for customer 235892 to receive a new offer. I won’t bore you by running through the screens with the delta detection again – the process is the same and results in an additional record introduced to the PSA. This was literally a two-second job!

This action triggers the end-dating of the previous relationship for the Driving Key if we virtualise / simulate the result again as shown here:

As you can see the relationship key (hash) has changed for our driving key 235892, but, the timeline is maintained. It is a design decision to interpret things like this, but being a virtual environment you can make other choices here as you see fit. Now, to make things interesting we can test re-opening the (end-dated) relationship.

In the source this means updating the customer to match the original offer and detecting the changes as usual. After we process the data delta the result looks like this:

As you can see the relationship is now re-opened which is an interesting testcase to have. Hopefully this explains this particular mechanism and intended behaviour a bit! I’ll publish the tool for download shortly for those interested.

↧

Data Warehouse versioning… for virtualisation

June 26, 2015, 2:06 am

≫ Next: Loading too fast for unique date/time stamps – what to do?

≪ Previous: NoETL – Data Vault Link Satellite tables (part 2)

Recent discussions around Data Warehouse virtualisation made me realise I forgot to post one of the important requirements: version control. In the various recent presentations this was discussed at length but somehow it didn’t make it to the transcript.

Data Warehouse virtualisation needs versioning. Think of it this way – if you can drop and refactor your Data Warehouse based on (the changes in your) metadata then your upstream reports and analytics are very likely to not only change structurally but also report different numbers. To remain auditable, you need to make sure your metadata is versioned so you can ‘roll out’ your Data Warehouse in versions that correspond to Business Intelligence or analytics output.

Managing end-to-end version control which ties the model, the ETL and the output together has been in place in some environments but when you are able to deliver a new ‘view’ with the speed of a click this becomes a mandatory requirement.

Concepts can be borrowed from other areas such as message formats / canonicals or SOA processes. These architectures struggle with the same concepts. Most of these environments are able to allow at least a prior version active, allowing all subscribers some time to become compatible with the newer version. In any case, it is important to be able to ‘go back’ to previous versions in your virtualized Data Warehouse if numbers need explaining and this all comes down to metadata version control.

In any case, I added this to the overall story.

↧

Loading too fast for unique date/time stamps – what to do?

August 10, 2015, 11:42 pm

≫ Next: Virtual Enterprise Data Warehouse ideas & updates (towards 1.2)

≪ Previous: Data Warehouse versioning… for virtualisation

Let’s start by clarifying that this concerns the RDBMS world, not the Hadoop world

It’s a good problem to have – loading data too quickly. So quickly that, even at high precision, multiple changes for the same key end up being inserted with the same Load Date/Time Stamp (LDTS). What happens here?

A quick recap: in Data Vault the Load Date/Time Stamp (LDTS, LOAD_DTS, or INSERT_DATETIME) is defined as the moment data is recorded in the Data Warehouse (more information here). Typically this is defined at database level as a default value, trigger or equivalent to support accurate timestamping of the data. Alternatives being implemented in ETL usually end up becoming more expensive from a performance perspective.

So what happens? Data is ingested so quickly by the database that thousands of records are inserted with identical timestamps (bear in mind that the datetime attribute has high precision e.g. sysdatetimestamp, datetime2(7)). Across the bigger sets you see a gradual increase in the time across records so the database does it’s job.

This itself is not an issue, but problems arise when you handle multiple changes for the same (natural) key in a single data delta. When this occurs, these changes (that end up having the same Load Date / Time Stamp) result in Primary Key violations in the Satellites or the Persistent Staging Area because the LDTS is part of the Primary Key (acting as ‘effective date/time’). You basically try to store more than one version of context for a Business Key at the same point in time. As mentioned, this only occurs when you actually detect multiple changes for a natural key – for example when a customer changes address a couple of times and you pick up and process these changes in one go.

I have posted many times about this requirement for ETL to handle multiple changes in a delta to support flexibility and scalability. A simple Full Outer Join (FOJ) delta mechanism will only give you a single change for each key, but (transactional) CDC mechanisms will buffer changes for ETL to be picked up at any given time or frequency.

To be able to handle processing genuine changes with the exact same LDTS I made the following design decisions:

In the Persistent Staging Area, the Source Row ID is added to the Primary Key. The PK now consists of the Natural Key (source key), the LDTS and the Source Row ID. The Source Row ID is traditionally defined as a sequence value generated starting a 1 within each data delta
In the Satellite I use the Source Row ID to add this as a factional second to the date/time

Blasphemy! I’m changing the values going into the Data Vault (and arguably is a bit dodgy).

The reason I think this is acceptable is that the original values are still available in the PSA so we can refactor / rerun from here if needed. Also, the auditability is maintained (as long as the PSA is there). Because the additions are done at the lowest time level the date/time change is very small. If, say, you have three changes in a row with Source Row ID 4,5 and 6 you basically offset the date/time respectively with 4, 5 and 6 nanoseconds.

This also explains why you should add and not subtract the time, as the Source Row ID is sequential and subtraction will change the order in time (whereas addition doesn’t). Will subtraction will certainly cause problems, addition only has a theoretical risk to overtake future deltas but in case that happens I’ll retain my PSA for the time being.

Dan Linstedt has some advanced views on applying these concepts when adopting Hadoop to maintain the raw Data Vault – in these scenarios you can let the platform handle the Load Date / Time Stamp.

As part of ‘options and considerations’ an alternative is to replace the Load Date / Time Stamp in the key with the Row Number (Source Row ID in my examples). This would work as well, but I didn’t opt for this approach as in my designs the Source Row ID identifies rows within each delta – and to be able to be a PK candidate the row number needs to be incremental across deltas (e.g. continuing at every run).

↧

Virtual Enterprise Data Warehouse ideas & updates (towards 1.2)

October 20, 2015, 1:05 am

≫ Next: The DWH Time Machine: synchronising model and automation metadata versions

≪ Previous: Loading too fast for unique date/time stamps – what to do?

Lately I have had a bit more head space to work on some ideas I find interesting, and these are now intended to culminate into ‘version 1.2’ of the Virtual EDW tool I have been developing. I’ve been using this tool extensively for various Data Warehouses and am generally very happy with it as a quick prototyping tool. But what really starts to play up is the requirement for a physical Data Vault (Integration Layer), as most of the logic is generated based off the data dictionary / catalog.

As it turns out, I’m often more than happy to use the views ‘for real’ but in that case I don’t really want the table to be present either. Only if I want to instantiate the table I want to create it (and populate it).

Another (in my mind) sensible change is to remove the need for Excel (and macros) to populate the automation metadata model – this is now incorporated into the tool.

Lastly, as the Data Vault components are now really solid, it helps to be able to quickly generate some (raw) information / data marts as well!

This is quite a big change. The intention is to write some posts about some of these changes and the adjustments work in more detail. So far I’ve been working on doing this and some other bits and pieces as outlined below:

Multi-threading; output generation is now processed on separate background threads allowing for a more responsive UI and parallel calculations – much quicker when you generate large Data Warehouses!
Bug fixing: removed some remaining hard-coding (thanks Douglas Jones and Bas Stiekema for letting me know).
Versioning; now built in as save and retrieve option using a slider – really neat as this allows you to cycle through versions and see the code change over time.
Automation Metadata maintenance; the menu option is enabled. This removes the need for Excel to store the mapping metadata and populate the metadata repository. This also allows for more advanced validation and some syntax highlighting.
Model maintenance; upload the model (or reverse-engineer) to generate views without having Data Vault tables physically available.
Dimensional model update – navigate the Data Vault model and generate Facts and Dimension views!
Updates in Satellite view creation – some new logic to cater for those pesky remaining situations (record condensing).
Proper window resizing; controls now resize properly, allowing more efficient maximising of the windows (just to make it a bit nicer).

Watch this space!

↧

The DWH Time Machine: synchronising model and automation metadata versions

October 29, 2015, 3:32 pm

≫ Next: Foreign Keys in the Staging Layer – joining or not?

≪ Previous: Virtual Enterprise Data Warehouse ideas & updates (towards 1.2)

I’ve completed a fairly large body of work that I’ve been meaning to do for a long time: how to automatically version the Data Warehouse data model in sync with the version of the ETL automation metadata.

Although versioning models and code is relevant (but rarely implemented) in the traditional ETL area, this requirement to becomes very real when moving to a virtualised Data Warehouse / integrated model approach (Data Vault 2.0 in my case). This is covered in more detail in earlier posts but suffice it is to say that by removing a physical, or instantiated, core Data Warehouse layer in the design you need a way to preserve the audit trail and support any Data Mart and / or Business Intelligence components that draw information from this Data Warehouse layer.

Simply put: your Data Warehouse can’t just have one implementation the one day and a different one the next. Well, it actually can: if you incorporate synchronised version control.

In my mind this is very similar to how (canonical) message formats or (SOA) services are sometimes developed to be backwards compatible. In this world you need to allow for adapters (’users’) to change their configuration when a change in the message / service is deployed. For this purpose sometimes one or more older versions are supported for some time. This is more or less what I’ve been working on for ETL and Data Warehouse models and their Data / Information Marts.

In a practical sense: what I am looking for is a way to cycle through previous and current versions and see the model and metadata change (over time – no pun intended) while you look at your designs. A Data Warehouse time machine.

This itself also serves another purpose, which is decoupling the existing dependency on the implemented data model (the table in the database) for ETL automation. Many automation efforts read from the data dictionary / catalog to generate output, but to fully support the above concepts what is needed is a way to capture the model at a point in time. For the virtualisation development efforts this is really useful as it allows you to generate ‘ETL’ (Views, Insert Statements, ETL outputs e.g. packages, mappings, anything) without having a physical model in place.

Not only that, but you can basically have a fully functioning Enterprise Data Warehouse up and running without physical tables. (more information about where ‘history’ lives is here). As of recently, even though the virtual (view-based) Data Warehouse was working fine it was still required a physical model to be present for automation.

This is now something of the past; the only database objects are views now – everything is virtual – which I think is really cool! This is a real-world application of these concepts used in various projects, sometimes as prototyping but also as a smaller production deployment.

A quick overview of how this works is shown in the next few screenshots. The first thing to do is to import the model metadata (via a modelling tool or by reverse-engineering the table structures):

I disabled the current grid-like representation of the model as I intend to create a more visual presentation. In any case since I already have a few versions created I am working on ‘2.1’ and uploaded whatever is currently in the database. The next step is to go to the screen that manages the ETL automation (this replaces the Excel sheet that was previously used). As you can see the software has already snapped to v2.1. The slider on this screen will later be used to cycle through the view output.

The screen that is created to maintain the automation metadata is where the typical Data Vault 2.0 metadata can be maintained. I added some logic to add colour coding and syntax highlighting to some extent, but one of the fun features is that the software calculates the expected Data Vault output. This metadata was previously imported from a file (it’s stored as XML) – and is linked to v2.1 of the model.

When changes are made there is the option to save these into a newer version:

The highlighted slider allows for cycling through the versions so you can see current and past definitions presented in the grid at runtime. We can now look what the difference in the Data Vault 2.0 model is when we cycle through the version, starting with viewing the results of the latest change.

And we can query the virtual Data Warehouse the same way as the traditional deployment with the exact same content:

Finally we cycle to version 2.1 to see how it used to be:

As is visible in the output as insert in the image above due to the consistent definition of business keys the number of records is down to three.

Not having to rely on a data model that is (physically) available in the database allows for even more easier prototyping, and so far it’s working great!

↧

Foreign Keys in the Staging Layer – joining or not?

March 28, 2016, 4:18 am

≫ Next: Best practices on developing Data Vault in SQL Server (including SSIS)

≪ Previous: The DWH Time Machine: synchronising model and automation metadata versions

Warning – this is another post in the ‘options and considerations’ context, meaning that some people will probably disagree with this based on their personal convictions or ideas!

One or two Satellites?

The case in question is how to handle complexities that may arise if you want to simplify loading by joining tables in the Staging Layer. You may want to do this depending on the design choices made for the source system your are receiving your data from (some systems are better designed than others). An example of this could be a ‘Customer’ table which contains an ID (a foreign key called ‘Type ID’ or something like that) to a generic reference table with the value for the ID for example being ‘Good Customer’.

In the corresponding Satellite I would ideally like to include the label (description) instead of the ID, otherwise I’d have to join later to get this. This is because I prefer the Integration Layer model (Data Vault) to be as self-explanatory as possible.

The example used here is a very simplified case, of course. But, real life scenarios occur often when there are central ‘generic’ code tables with IDs, and I have seen things such as ‘code’ with ‘name’ in the one table, and ‘code’ with ‘description’ in the other… Another common scenario is that the (real) Business Key you need is not in the table you intend to process, so you need to get it from somewhere else.

Consider the following situation:

Choices to make…

Modelling considerations

The first decision to make is how to model this to begin with. Does it make sense to model out the codes as its own (unconnected / reference) Hub / Satellite? There are a couple of advantages to this:

If the meaning or value for the reference code changes and it is captured independently in a dedicated Satellite, it can be combined with the Satellite at any (later) time to represent the correct point-in-time value relevant to that Satellite. For instance in the extreme example that the label (descriptions) change for the ID from ‘Good Customer’ to ‘Do Not Call Anymore’ this only will lead to a change in the reference Satellite as the code itself doesn’t change – only the meaning / description. This reference Hub / Satellite can be joined with the Customer Satellite to display the correct series of events occurred / history, including the change of the label.
It is easier to develop
You can (opt to) avoid a potential massive data redundancy when code labels change. Incidentally, this advantage here – when separating the Satellites – is exactly the same as the disadvantage when you choose to combine the Satellites as we will explore later on. As a quick introduction: some updates on codes (when combined into a single Satellite) may have a massive impact on row counts. If my ‘Customer’ table has a billion rows with the same ID, a change in the code label would blow this out to two billion just with this single change. It’s a consideration to make on things such as expected rate of change etc.

There are also some disadvantages:

The model becomes harder to understand as you need to provide additional content to make it ‘readable’. It may not be immediately obvious what all the IDs mean or where they should join to.
You usually end up with more ETL processes.
You need to join the various tables in /after the Data Vault to get the complete picture of what really happened.

So, if you opt to model the Customer Satellite as one (without the generic Hub/Satellite) does that make things easier? It means you will need to find a way to load the combination of tables into the Integration Layer / Data Vault which effectively leaves you two options: joining the tables in the Staging Layer or defining a template that looks for changes for part of the Satellite (similar to the traditional conformed Dimension updates from various sources ).

The second option means you typically calculate a checksum for the change and inherit the rest of the attributes from the Satellite record into a new one if there is a change. For this post today I wanted to explore the first option though: how to join Staging Layer tables together to populate the combined Satellite in one go. A quick note: these kinds of ideas are explored in the context of Data Vault / Integration Layer virtualisation which supports rapid refactoring of the Integration Layer if required (no lock-in).

After the above conceptual introduction it’s time to look deeper into what it means to define joins in the Staging Layer in order to get all the attributes together before presenting them to the Integration Layer / Data Vault.

Using Views in Staging to combine data

Remember, in the architecture I defined the Staging Layer consists of the Staging Area and Persistent Staging Area (PSA). The reason I point this out is that you can’t join tables in the Staging Area because the information in this area is transient; only deltas are available. Joining two delta sets together gives you incorrect results as keys on both sides of the join may be missing (because no change was detected at the point of processing). What you can do, however, is to leverage the PSA to join against – as the full history of raw changes is maintained and available here. To use this in its simplest form you need to designate a ‘driving’ Staging Area table which delta is used to load to the Data Vault, and outer-join this against the PSA table(s) that has the rest of the attributes you need.

This is especially handy if all you need is the Business Key from another table since it may not be in the table you want to process. Joining in the Staging Layer in this instance eliminates the need for key-Satellite type exception logic.

The approach outlined here is easy to do as it allows you to develop a ‘raw’ Staging view that has the information you need to present to the Integration Layer in the format that makes sense to you (without changing the contents – no business logic should be added ever). This makes uploading information to the Data Vault incredibly straightforward. However, the obvious downside of using these kinds of views for your design is that you introduce dependencies: the PSA table you join to needs to be up-to-date first.

This is a potential serious issue and fundamental design decision as scaling beyond micro-batch will be difficult once loading dependencies are introduced. But – horses for courses as they say in Australia – and the PSA / virtualisation / refactoring safety catch is always present if you change your mind. I will go into more technical details in the next post but suffice it to say there are two main variants of the Staging Layer view as mentioned above:

1) Use a (one) Staging Area table as driving table and left-outer-join to (the point in time) equivalent in the PSA for as many tables as you need.

This is the simple option, and will be OK in most cases. The thing to remember for this approach is (when taking the example from the introduction of this post) that code label (= PSA table) changes for the customer will only be picked-up by the ETL when the customer record (= Staging Area driving table) changes. This may seem trivial but an extreme case example would be that the customer records never change after the initial load, but the code label changes after a while. In this example the Data Vault customer records would still show the old label value (until something else changes for the customer).

Why didn’t the customer record change? Because the Foreign Key ID stayed the same, so for specifically the customer record there was never a change to detect.In short – best used with volatile driving Staging Area tables and relatively static reference PSA tables. Good to avoid the aforementioned Key-Satellite logic though, and it’s easy to add as many PSA tables as makes sense.

2) Join everything bidirectionally. Enter the second option, which is created to counter the above issues related to triggering changes. This option uses all in-scope tables as driving tables, joins them against their respective PSA tables and unions the results to generate change records. Sounds a bit cryptic, but it is fairly easy.

Using the same example again: if you are worried that changes of the code label aren’t coming through quickly enough because the customer records aren’t changing that rapidly, you can use option 2. To do this use the ‘Customer’ table as driving Staging table and outer-join to the PSA ‘Generic Reference Code’ table. This results will then be unioned with the following: use the ‘Generic Reference Code’ table as driving Staging table and outer-join to the PSA ‘Customer’ table. This will pick up changes both ways, and will update Customer records when the code label has changed. Beware of potentially huge row count increases as mentioned previously: if you have a lot of customers, a simple description change may create new rows for all of them.

Wrap-up thoughts

Data Vault provides solutions for all of this though, and you can easily separate Satellites into their rapid and slow-changing equivalents. In the next posts I’ll copy some of the logic to develop this with some examples to bring it to life a bit more.

My personal preference is not to let IDs into the Data Vault, I just like it better to be able to ‘read’ the model at a glance and having a clearer integrated model to expose upstream. I favour view-based approaches and like the ability to pull attributes together to simplify loading. Especially when all you are missing in your Staging Area table is the Business Key that you need. But, there’s options…

↧

Best practices on developing Data Vault in SQL Server (including SSIS)

March 29, 2016, 1:46 am

≫ Next: Data Vault ETL Implementation using SSIS: Step 7 – Link Satellite ETL – part 3 – End Dating

≪ Previous: Foreign Keys in the Staging Layer – joining or not?

Sharing is caring, so today’s post covers some technical details for the Microsoft world: implementing Data Vault models on the SQL Server database and corresponding ETL using SSIS and technologies such as BIML. This is based on experiences gained developing many Data Warehouses (both Data Vault based as well as using other methodologies).

Physical modelling (for Data Vault-based Integration Layers):

Don’t use clustered indexes on Primary Keys! This is the single biggest tip to be aware of since Hash keys basically act as ‘random’ seeds for an index. Hashes are not random of course, but their non-sequential nature means that they usually need to be inserted (in order) somewhere in between existing Hashes if you would use a clustered index. Because of this loading new keys leads to a continuous altering of the index (keys get moved up and down all the time) which severely degrades I/O performance when loading. It also causes massive fragmentation for the index when reading which further impacts updates (e.g. end-dating). Non-clustered indexes should be considered mandatory for Hash-based keys as this will make sure new keys are added ‘at the end’ of the index. This is fine, as there is no logical order for the hash keys anyway.
Setup an index maintenance plan and run it on a schedule as per the MSDN best practices.
Use clustered indexes on Unique Keys in Hubs. This (obviously) creates a separate index in the order of the real business keys for Hubs which improves joins / lookup performance. For Links you have the option to use a clustered or non-clustered index on the ‘unique key’ (the combination of Hub hash keys in the Link). You want to have this constraint in place to guarantee uniqueness but how you cluster the index or not depends on balancing the query select performance. If you use a clustered index be mindful of fragmentation caused by the reshuffling of keys. At least you don’t need to update a link so you have options here.
Make your date/time attributes datetime2(7), as this is the highest precision date/time SQL Server provides (after 2008 from memory). This makes everything more future proof as datetime2(7) has a range from 0001-01-01 to 9999-12-31 while the datetime only supports 1753 onwards (to 9999). Additionally datetime2(7) can be more specific at time level: datetime2(7) can capture time down to 100ns whereas datetime starts rounding at 3ms.
Don’t enable Foreign Keys anywhere. You need to do this to enable the parallel loading that using Hash keys make possible. This means you can’t generate a diagram in SQL Server as it read off these constraints. In the Data Vault world there are other ways to enforce referential integrity.
Use Page compression on everything, pretty much everywhere (e.g. tables, indexes). I haven’t encountered a situation where the CPU hit doesn’t pay off against the I/O gain. Especially in historical tables compression gains are huge.
Consider Unicode. I know, it takes up more space but you can handle the (conceptual) scenario that a unicode-based OLTP system actually uses different casing for different meaning (e.g. Abc = customer 1, ABC = customer 2).
Always ‘sand’ Hash keys with delimiters (I use a pipe delimiter) as per Data Vault best practices. The built-in SQL Server MD5 or SHA1 algorithms deliver the results in upper case. An example is:

    CONVERT(CHAR(32),HASHBYTES('MD5',
      ISNULL(RTRIM(CONVERT(NVARCHAR(100),[AGENT_CODE])),'NA')+'|'+
      ISNULL(RTRIM(CONVERT(NVARCHAR(100),[POLICY_NR])),'NA')+'|'
    ),2) AS POLICY_AGENT_HSH

ETL development:

Push-down SQL as much as possible when working with SSIS. As an ETL tool, SSIS isn’t particularly good at lookups and caching in general. Techniques such are (easy) shared caching are still foreign to SSIS. Any joins or checks are better done in the source SQL override, leaving the rest of the Data Flow to add mid-stream (ETL process) attributes or any passive transformations that are needed such as interpreting join results. The Control Flow is great to use to handle the ETL process itself (ETL Framework / exception handling). The result is that in SSIS most templates are really SQL templates with an SSIS wrapper.
Investigate techniques such as BIML to automate development. I tend to use some C# code to iterate through metadata and pass variables into BIML code in order to generate pretty much everything, but at the very least the ETL process control handles. You can use the free BIDS Helper plugin for this, or alternatively Varigence sells an IDE that also covers the wider MS BI stack. There are various sites with code snippets, and it’s easy to setup.
Use an ETL Framework! This is a more generic statement, but you really can’t manage and scale a solution without having a consistent framework and approach to handle ETL exceptions.
Lastly, I can’t help it but don’t use the SCD operation. It doesn’t scale! But I leave the rest to consider based off the Project REAL insights, otherwise there’s too much in here that’s already covered somewhere else

Hope this helps!

↧

Data Vault ETL Implementation using SSIS: Step 7 – Link Satellite ETL – part 3 – End Dating

April 17, 2016, 5:38 pm

≫ Next: Unknown keys (zero keys or ghost keys) in Hubs for DV2.0

≪ Previous: Best practices on developing Data Vault in SQL Server (including SSIS)

I’m catching up on old drafts within WordPress, and in the spirit of being complete on the older SSIS series felt I should pick this one up and complete it. While most of my focus is on developing the virtualisation concepts I still work a lot with more traditional ETL tools, one of which is Microsoft SSIS.

Recently I merged the metadata models that underpin the virtualisation and SSIS automation and I am retesting everything end-to-end. So, after a long break time and incentive to complete this series.

Please bear in mind that the logic can (probably needs) to be updated to LEAD and LAG instead of the traditional Common Table Expression in SQL Server. Also this logic relies on zero records. Thankfully things are not hard to change if you don’t want zero records or using some of the analytic SQL functions

For the same reasons as outlined for the SSIS Satelllite End Dating mechanism, Link Satellites End Dating is always implemented as a stand-alone generic ETL process. The only difference is some minor tweaking in the logic to handle the difference between historical and ‘driving key’ based Link Satellites in the same template.

Again, thanks to SSIS’ ability to pass parameters into the Execute SQL Task we are able to provide the Process ID information into this SQL before execution, so it will be a relatively efficient task to run (for a redundant step, that is).

As with Satellite end-dating logic this is a very straightforward process with only one operation in the SSIS Control Flow. Another way of saying is that SSIS is really only used as a scheduling tool, and to support the ETL control framework (this part is not displayed here) as there is a variety of event handlers one can add to an SSIS object. For example, I use pre-and post processing event handlers to do ETL framework activities.

It doesn’t look very exciting in SSIS, but the SQL query used in the displayed Execute SQL Task (EST – displayed below) identifies any records that need to be end-dated and also performs the corresponding updates; the selected Expiry Date (the Effective Date from the next record). This means that this process can be run at any time and closes off all records which need to be closed.

An example SQL is provided a bit further in this post. It’s pretty neat in the sense that this approach supports both Driving Key and normal history templates by selecting the ‘Driving Key’ in the main Common Table Expression as either the Link Key (=normal history) or the real Driving Key (=end dating relationships across the Driving Key). The Driving Key concept is better explained in the virtualisation section, but this SQL works for both templates as it is now.

To understand the example it’s really important to consider the differences between the regular historical and Driving Key behaviour, so I will mention it again: by using the ‘normal’ relationship key (Link SK) as the ‘Driving Key’ you trigger normal history tracking (SCD2) behaviour, and by using the assigned Driving Key (one part of the relationship) you trigger Driving Key history tracking behaviour.

Please note the ‘?’ symbols are mapped to parameters via the ‘Parameter Mapping’ in the Execute SQL Task properties. It’s mapped to the unique identifier of the ETL process (= package name in this case) that executes the SQL. The high dates are inserted by the regular Link-Satellite ETL and already in place from the perspective of End Dating. The SQL looks a bit complex, but as with all things SSIS it’s all generated from metadata (no need to develop packages manually in Visual Studio these days).

 WITH MyCTE 
 (
   <Link SK>,
   DRIVING_KEY, -- Note this is an alias that will be used for either the Link Key (for historical LSAT behaviour) or one of the Hub keys in the Link (= the Driving Key)
   LOAD_DATETIME, -- The Effective Date
   LOAD_END_DATETIME, -- The End Date
   RowVersion
 )
AS (
 SELECT
   A.<Link SK>,
   B.<Link SK or Driving Key>, --using a one-side of a relationship (=Driving Key) forced this behaviour, otherwise use the Link Key for normal history behaviour
   A.LOAD_DATETIME,
   A.LOAD_END_DATETIME,
   DENSE_RANK()OVER(PARTITION BY 
     B.<Link SK or Driving Key>
   ORDER BY 
    B.LOAD_DATETIME, B.<Link SK or Driving Key> ASC) RowVersion
 FROM <Link Satellite table> A
 JOIN <Link table> B 
   ON A.<Link SK>=B.<Link SK>
 JOIN
 (
 SELECT 
   B.<Link SK or Driving Key>
 FROM <Link Satellite table> A
 JOIN <Link table> B ON A.<Link SK>=B.<Link SK>
 WHERE A.LOAD_END_DATETIME = '99991231'
 GROUP BY 
   B.<Link SK or Driving Key>
 HAVING COUNT(*) > 1
 ) C
 ON B.<Link SK or Driving Key> = C.<Link SK or Driving Key>
 )
UPDATE A
SET A.LOAD_END_DATETIME = B.LOAD_END_DATETIME, A.ETL_INSERT_RUN_ID = ?,
 A.CURRENT_RECORD_INDICATOR = B. CURRENT_RECORD_INDICATOR
FROM <Link Satellite table> A
JOIN(
 SELECT 
   BASE.<Link SK>,
   CASE WHEN LAG.LOAD_DATETIME IS NULL THEN '19000101' ELSE BASE.LOAD_DATETIME END AS LOAD_DATETIME,
   CASE WHEN LEAD.LOAD_DATETIME IS NULL THEN '99991231' ELSE LEAD.LOAD_DATETIME END AS LOAD_END_DATETIME,
   CASE WHEN LEAD.LOAD_DATETIME IS NULL THEN 'Y' ELSE 'N' END AS CURRENT_RECORD_INDICATOR
 FROM MyCTE BASE
 LEFT JOIN MyCTE LEAD 
   ON BASE.DRIVING_KEY = LEAD.DRIVING_KEY
   AND BASE.RowVersion = LEAD.RowVersion-1
 LEFT JOIN MyCTE LAG 
   ON BASE.DRIVING_KEY = LAG.DRIVING_KEY
   AND BASE.RowVersion = LAG.RowVersion+1
 WHERE BASE.LOAD_END_DATETIME = '99991231'
)B 
 ON A.<Link SK>=B.<Link SK>
 AND A.LOAD_DATETIME = B.LOAD_DATETIME

Let’s run an example taken from the test set used for testing the virtualisation efforts (the results should be 100% the same). The testcase taken from the sample models is LSAT_CUSTOMER_OFFER. This is the ‘Driving Key’ based history example, and the idea is that a Customer can only have one active offer at a point in time. In other words: the Customer is the Driving Key. If we were to load the normal LSAT to populate the records prior to running this end-dating logic the recordset would look like this:

I joined to the Link to make it a bit clearer what is happening by adding the Customer and Incentive Offer Hash Keys. While it still looks cryptic, you can see now that a Customer has different Incentive Offers at various points in time. Including a ‘re-opening’ where an earlier relationship is enabled again after being superseded earlier (one of the standard testcases). The four rows basically state that there is a zero record for the first relationship between Customer and Incentive Offer, followed by the Load Date/Time for this relationship. Then, the Customer is associated with a different Incentive Offer and lastly the Customer is again associated with the original Incentive Offer.

Obviously, the end-dating process has not run yet. Running this query corrects the timelines:

There you have it – Driving Key based history tracking using the same SQL used for normal history.

For reference, the query that was run was:

WITH MyCTE 
 (
   CUSTOMER_OFFER_HSH,
   DRIVING_KEY, 
   LOAD_DATETIME,
   LOAD_END_DATETIME,
   RowVersion
 )
AS (
 SELECT
   A.CUSTOMER_OFFER_HSH,
   B.CUSTOMER_HSH, -- The Driving Key
   A.LOAD_DATETIME,
   A.LOAD_END_DATETIME,
   DENSE_RANK()OVER(PARTITION BY 
   B.CUSTOMER_HSH -- Again the Driving Key
 ORDER BY 
   A.LOAD_DATETIME,
   B.CUSTOMER_OFFER_HSH ASC) RowVersion
 FROM LSAT_CUSTOMER_OFFER A
 JOIN LNK_CUSTOMER_OFFER B 
   ON A.CUSTOMER_OFFER_HSH=B.CUSTOMER_OFFER_HSH
 JOIN
 (
 SELECT 
   B.CUSTOMER_HSH -- Again the Driving Key
 FROM LSAT_CUSTOMER_OFFER A
 JOIN LNK_CUSTOMER_OFFER B ON A.CUSTOMER_OFFER_HSH=B.CUSTOMER_OFFER_HSH
 WHERE A.LOAD_END_DATETIME = '99991231'
 GROUP BY 
   B.CUSTOMER_HSH
 HAVING COUNT(*) > 1
 ) C
 ON B.CUSTOMER_HSH = C.CUSTOMER_HSH -- Again the Driving Key
 )
UPDATE A
SET A.LOAD_END_DATETIME = B.LOAD_END_DATETIME, A.ETL_INSERT_RUN_ID = 0,
 A.CURRENT_RECORD_INDICATOR = B. CURRENT_RECORD_INDICATOR
FROM LSAT_CUSTOMER_OFFER A
JOIN(
 SELECT 
   BASE.CUSTOMER_OFFER_HSH,
   CASE WHEN LAG.LOAD_DATETIME IS NULL THEN '19000101' ELSE BASE.LOAD_DATETIME END AS LOAD_DATETIME,
   CASE WHEN LEAD.LOAD_DATETIME IS NULL THEN '99991231' ELSE LEAD.LOAD_DATETIME END AS LOAD_END_DATETIME,
   CASE WHEN LEAD.LOAD_DATETIME IS NULL THEN 'Y' ELSE 'N' END AS CURRENT_RECORD_INDICATOR
 FROM MyCTE BASE
 LEFT JOIN MyCTE LEAD 
   ON BASE.DRIVING_KEY = LEAD.DRIVING_KEY
   AND BASE.RowVersion = LEAD.RowVersion-1
 LEFT JOIN MyCTE LAG 
   ON BASE.DRIVING_KEY = LAG.DRIVING_KEY
   AND BASE.RowVersion = LAG.RowVersion+1
 WHERE BASE.LOAD_END_DATETIME = '99991231'
)B 
 ON A.CUSTOMER_OFFER_HSH = B.CUSTOMER_OFFER_HSH
 AND A.LOAD_DATETIME = B.LOAD_DATETIME

↧

Unknown keys (zero keys or ghost keys) in Hubs for DV2.0

May 27, 2016, 6:36 am

≫ Next: Why you really want a Persistent Staging Area in your Data Vault architecture

≪ Previous: Data Vault ETL Implementation using SSIS: Step 7 – Link Satellite ETL – part 3 – End Dating

I am still working towards capturing the generation (using BIML in SSIS) and virtualisation (using views / SQL) of the Presentation Layer (in a Dimensional Model). But before we get there, some topics need to be addressed first. One of these is the requirement to have ‘unknown’ keys available in the Hubs. Thankfully, this is one of the easiest concepts to implement.

The basic idea is that you create a dummy record in the Hub which represents the unknown (NULL or ’empty’) business key.

Why do we need this at all? It’s primarily to make sure that referential integrity is intact when Link records are inserted where one of the Business Keys is NULL. The fact that some of the Hub keys in a Link are NULL does not mean you should exclude that record entirely, it just means that at that point in time for some reason there is incomplete information. And we need to be able to handle 100% of the data 100% of the time.

There can be various reasons that can cause this to happen. Some examples of this are:

The attribute that has been identified as Business Key is declared as an optional Foreign Key in the source system. For instance when ‘XYZ’ is true, then the business key is populated. Otherwise the business key remains NULL
The source system declares the attribute as mandatory, but this declaration is broken or not adequately enforced (there is an error in the source application that allows NULLS when it shouldn’t)

As an example you can image there is a Link relationship between Customer, Product and Promotion. If this is modelled as a 3-way Link there may be occasions that Customer and Product are known, but Promotion is NULL.

With the unknown key in place in the Promotion Hub you will be able to confirm that the ‘committing’ of the Staging Area delta to the Data Vault is complete – as the final check of a Batch is the validation of the Referential Integrity before the delta is released. Obviously, this also applies when you run periodic Referential Integrity checks across the complete (or parts of) Data Vault.

The underlying idea behind all this is that you can only ‘close’ an ETL run once all received data delta (in the Staging Area) is applied to all relevant Integration Layer tables. As outlined earlier each source data set is a self-contained unit that loads into all tables, to ensure independent and parallel execution. As such, the last step in the workflow (Batch in my terminology) is to make sure that for that source set all created Hub, Link and Satellite information is consistent.

Interestingly, in my view (and some will disagree) adding Hub unknown keys does not mean that you can rely on inner joins to combine Hubs and Links together. I always use outer joins primarily due to handle potential timing issues related to parallel execution. The fact that you can run everything parallel may mean that at certain points in time there are brief inconsistencies – which are soon rectified by the corresponding batches. But my point is you don’t want to lose records when you just happen to execute your query in that brief potential interval.

The decision to handle NULL values is considered a ‘hard’ business rule, which is implemented on the way into the Data Vault.

So how to do this? It’s very easy however you want to do this, but it’s important to note that in my designs the Business Key attribute is always NVARCHAR (unicode text). This is really to further future-proof the Data Warehouse solution. You don’t know how the next system handles the business key…

This makes the implementation of the unknown Hub key extremely easy. I use the generated hash key for NULL as the Hub Key, as this is already handled by the hashing built into the ETL generation templates.

The fun part is that this key is the same for all Hubs! So all you need to do is run a script that iterates through your Hub tables and insert what is effectively the same hard-coded record. In my approach I create this record as Hub Key ’48C436D1FBE939CB4E2732836808CE90′ with Business Key ‘Unknown’ but it’s up to you how you want to handle this as long as it’s consistent.

How did I get to that hash? By putting NULL into the hashing logic, as that’s what automatically is done by the generation logic:

SELECT CONVERT(CHAR(32),HASHBYTES('MD5', ISNULL(RTRIM(CONVERT(NVARCHAR(100),NULL)),'NA')+'|' ),2) AS EXAMPLE

As always, there are options and considerations:

You can opt to include this logic as part of your Hub ETL. I used to do this (briefly), but don’t anymore as it improves performance a bit. It is a one-off process which is only required when you deploy a new Hub (or truncate the table)
In the fully virtualised Data Warehouse this is easily implemented as an additional single-record UNION to the Hub SQL

Lastly, there is potentially more to this than only a NULL or unknown key. There is actually a taxonomy that can be defined to handle various ‘flavours of unknown’, for example:

Missing (-1 key) as a root node / supertype of all ‘missing’ information. This is the typical unknown key, but may encompass:
1. Missing values (-2 key): parent of all missing values. This can be ‘Unknown’ or ‘Not Applicable’:
  1. Not Applicable (-3 key)
  2. Unknown (-4 key)
2. Missing attribute (-5 key): parent of all missing values due to columns not being there (yet). This can be defined as for instance:
  1. Missing source attribute (-6 key), which can be implemented when source fails to supply attribute/column
  2. Missing target attribute (-7 key), which may be used for time-variant information presented before the availability of the attribute

Deciding between the various more detailed types of ‘unknown’ is a business question that is decided based on how the source system is designed – and fall in the category of ‘hard business rules’. One of the few examples of this in Data Vault methodology.

↧

Why you really want a Persistent Staging Area in your Data Vault architecture

June 17, 2016, 6:55 pm

≫ Next: Advanced row condensing for Satellites

≪ Previous: Unknown keys (zero keys or ghost keys) in Hubs for DV2.0

Recently at the Worldwide Data Vault Conference in Vermont USA (WWDVC) I had many conversations about the Persistent Staging Area (PSA) concept, also known as Historical Staging Area.

I have been using this idea for years and really can’t do without it. I would even go as far as saying you really want a PSA in your architecture. However there is a common opinion that having a PSA isn’t the best idea as it introduces a ‘2nd Data Warehouse’ in your architecture.

I fundamentally disagree with this and thought it would be good to explain this in more detail for today’s post. I wrote a post similar to this one back in 2012, but it needs a bit of an update :-).

What is a Persistent Staging Area?

For those new to the discussion, below is a quick introduction of the PSA.

A PSA is a record, or archive, of all data deltas (records) that were ever presented to the Data Warehouse (Staging Area) by its interfaces: historised raw data. Every record that is loaded into the Staging Area – a transient area (will be truncated regularly) – is sent to the PSA as well as to the upstream (usually Data Vault based) Integration Layer.

Following the Staging Layer architecture patterns: when records are loaded into the Staging Area they will be timestamped with the Load Date/Time Stamp (LOAD_DTS). It is also the role of the Staging Area ETL to capture the Event Date/Time (explained in more detail in the post covering Data Vault ‘time’) as well as logical deletes. Records then go two ways: into the PSA (part of the Staging Layer) and upwards into the Integration Layer.

The following diagram covers the overview outlined in the above paragraph:

As a quick explanation of the terms in this diagram: the ETL components are the typical loading patterns into the various areas of the Data Warehouse such as ‘Source to Staging’, ‘Staging to Data Vault’ and ‘Staging to PSA’. This is the foundation of the EDW.

The ‘Virtualisation’ arrow indicates the concept of NoETL for which I have written many articles. This is based on the idea that if you can reload & regenerate your entire Data Warehouse (schema and data) from your metadata, you can also opt to use the same metadata to create a view that simulates your Data Warehouse structure. Effectively you are then using that metadata (and model) as a ‘schema on read’ on raw historised data – your Data Warehouse model defines your schema.

Lastly, the ‘Reinitialisation’ arrow captures the process of physically (re)loading your entire history back into the Staging Area for a full rebuild or gap-fixing of your Integration Layer.

Why would you want this?

Simply put: because you will make mistakes in your model and this is the most complete and repeatable way to remediate these. More on this later.

So to start off, introducing a PSA does indeed create a 2nd repository of history in your Enterprise Data Warehouse architecture. But if that is a bad thing or not all depends on how you use and approach this. Let’s explore the benefits here:

You can start straight away! You don’t need to wait for the modelling to be completed. This is an extremely useful feature because as soon as you know which interfaces you need (tables, files, messages) you can start ‘recording’ data deltas into your PSA. This gives you time to properly model out the Integration Layer (Data Vault): data is already being collected ready to go for further handling once your model is complete. When you are happy with your model you can ‘replay’ from your PSA and populate your Integration Layer.
You can start small. You don’t need to load everything in the Data Vault straight away. If you only model what you need and leave the rest for later you will find there is more room for (modelling) discussions on the things that matter for the scope of your project, as opposed to spending time on elements that may be required later. As an example, if you need some additional attributes later you can reload that Satellite to populate the missing attribute back in time (as opposed to building history from that point onwards, potentially increasing record counts or requiring a bit of rework).
You will change your mind (and/or make mistakes). Even with all the experience in the world and the best intentions invariably you look back at models and think of better alternatives. The same ‘replay’ (re-initialisation) mechanism allows you to make changes and truncate and load Data Vault tables with the same outcome (deterministic). Hash keys are invaluable here as you can truncate a Hub or Link and then reload without any negative impacts. You don’t want to try this in integer/sequence based designs.
You can support unexpected use-cases. There are edge cases where a historical archive of source data (including the added Data Warehouse information such as logical deletes) can be become really useful. I have been involved with fraud detection projects where it does pay off to have an independent ‘memory’ of your source systems because, well, you can’t trust your source systems directly.
You can delay some design and architecture decisions. One key point is that you don’t need to commit to a methodology or architecture from the outset of your project. Staging Layer concepts are relevant regardless of the modelling technique you use for your Integration Layer. In some cases you may want to ‘start recording’ in your PSA but start with a small Dimensional Model to prove your cases or because it meets some specific requirements better. If you later decide you want to add an Integration layer (go from 2-tiered solution design to 3-tiered solution design) you can reuse the same data and even have something to compare with.
You can’t virtualise your Data Warehouse without an archive containing the historised raw data. Data Warehouse virtualisation brings even more agility and speed to market. I have written about this extensively but the key point is that if you virtualise your Data Warehouse you still need history somewhere.

Some last thoughts

As with any architecture decision it’s all about options and considerations. So there are some things to ponder over or be aware of. These can be considered ‘cons’ but are not really such a big deal (in my personal opinion).

In messaging or near Real Time scenarios the Staging Layer is usually skipped, but it my recommendation to keep a message copy in your PSA regardless (as a parallel independent stream)
You will be impacted by source system changes more regularly, and have basically more objects to manage and maintain. However, this can be omitted by adding logic to automate schema (and corresponding ETL/ELT) changes. This is perfectly doable as there is no further meaning to be applied here. For instance you don’t need additional smarts to understand where the information fits int he model, what the business key is etc.
You don’t need necessarily need to make your PSA a database. This is fairly straightforward but good to point out: in smaller deployments you may use a database to record your historised raw data but this is by no means a requirement. It’s perfectly doable to use a file archive with potentially lower costs, less overhead and more scalability.
Be very careful providing access to your PSA, this is in line with ‘not making your PSA a 2nd Data Warehouse’. Your PSA is there to support your Data Warehouse design and development but has no place in end-user access.

My biggest recommendation is to make sure your design can survive without the PSA. By this I mean to avoid making the PSA an integral part of your architecture. Examples of this would be to populate your Integration Layer from the Staging Area, and not from the PSA (populating the PSA is then a side step in parallel with your Integration Layer ETL). This is important, because when your system reaches maturity, you may want to decide not to use the PSA anymore (or change the physical deployment of it).

Another way of saying this is: can use the PSA to launch your new Data Warehouse – with the option of casting off these boosters once your system reach escape velocity!

↧

Advanced row condensing for Satellites

July 10, 2016, 2:28 am

≫ Next: Tech tip: making SSIS Project Connections generate correctly using BIML Express

≪ Previous: Why you really want a Persistent Staging Area in your Data Vault architecture

When it comes to record condensing, DISTINCT just doesn’t cut it. I’ve been meaning to post about this for ages as the earliest templates (as also posted on this site) were not flexible enough to work in all cases (which is what we strive for).

Record condensing

Record condensing is making sure that the data delta (differential) you process is a true delta for the specific scope. It’s about making sure no redundant records are processed into historised tables. If you have a source table that is logically split towards various target Data Warehouse entities (Hubs, Links and Satellites for instance) the change that originated the new (change / delta) transaction in the original table is not necessarily a change for a specific target. This is simply because CDC mechanisms typically consider the entire row in a source. Consider the following example:

The example shows that across the key (Customer Code) and date you can’t use a SELECT DISTINCT to minimize the recordset. It may look like you can, but what would happen when the name in this examples changes back to it’s original value? These changes would be lost.

C17532 for 2016-07-01 was ‘Adam’
C17532 for 2016-07-03 was changed to ‘Scott’
C17532 for 2016-07-05 was changed back to ‘Adam’

This ‘A-B-A’ mechanism is something which requires additional logic to evaluate changes between rows in the order of the date/time. To add to the complexity: when evaluating rows you also need to look after the logical deletes as this is not a change in one of the attributes themselves, but in process attributes (e.g. the CDC attribute, logical delete indicator or equivalent). This is further explained in the SQL example a bit further down in this post.

Why is implementing record condensing worth it?

Row condensing behaviour in ETL processes is applicable in many cases, for example when splitting Satellites into ‘slow changing’ and ‘fast changing’, mapping attributes to different tables etc.

This concept is directly related to one of the first and fundamental Design Principles: ETL processes must be able to process multiple intervals (changes) in one run, must be able to be run at any point in time, and process whatever data there is to process. The last part of this requirement directly relates to the record condensing concept outlined in this post.

For instance: if the address of an employee changes multiple times during the day and ETL is run daily, all changes should still be captured and correctly processed in a single run of the ETL process.

This requirement prevents ETL processes to be run many times for catch-up processing, makes it possible to easily change loading frequencies and generally supports a full reload of the Integration Layer from the Persistent Staging Area. Last, but not least, it is a fundamental bit of logic that is required to support Data Warehouse virtualisation.

We often run a ‘re-initialisation’ run to make sure no delta’s have been missed. Re-initialisation is simply copying the entire contents of the Persistent Staging Area and reloading this history against the Data Warehouse / Integration Layer. This approach is something that relies on record condensing to be in place, and Data Warehouse virtualisation is simply using this same re-initialisation approach to ‘virtualise’ the Integration Layer without any data being copied.

Record condensing in Data Vault Satellites

To implement record condensing in Satellite logic you can use a full row hash containing the original source attributes (‘COMBINED_VALUE’) in a subquery and use the various components (e.g. Load Date/Time Stamp, Event Date/Time and CDC indicators) to figure out what happens within the delta set for a specific target:

COMBINED_VALUE, -- Full row hash across in-scope source attributes only
-- Condition 1 -- make sure there are differences between the individual rows
CASE
  WHEN LAG(COMBINED_VALUE,1,'N/A') OVER 
  (
  PARTITION BY 
    <Business Key>
  ORDER BY 
    <Load Date/Time stamp> ASC, 
    <Event Date/Time> ASC, 
    <CDC Operation> DESC
   ) = COMBINED_VALUE
   THEN 'Same'
   ELSE 'Different'
END AS VALUE_CHANGE_INDICATOR,
-- Condition 2 -- make sure the Change Data Operation is different between rows
CASE
  WHEN LAG(<CDC Operation>,1,'') OVER 
  (
    PARTITION BY 
    <Business Key>
    ORDER BY 
      <Load Date/Time stamp> ASC, 
      <Event Date/Time> ASC, 
      <CDC Operation> ASC
  ) = <CDC Operation>
  THEN 'Same'
  ELSE 'Different'
END AS CDC_CHANGE_INDICATOR,
-- Condition 3 -- Make sure the Load Date/Time stamp is ordered (remove duplicates0
CASE
  WHEN LEAD([OMD_INSERT_DATETIME],1,'9999-12-31') OVER 
  (
  PARTITION BY
    <Business Key>
  ORDER BY
      <Load Date/Time stamp> ASC, 
      <Event Date/Time> ASC, 
      <CDC Operation> ASC
  ) = <Load Date / Time stamp>
    THEN 'Same'            
ELSE 'Different'          
END AS TIME_CHANGE_INDICATOR

In the overall query the above subsequently gets interpreted as follows to remove the redundant records:

WHERE (VALUE_CHANGE_INDICATOR ='Different' and <CDC Operation> in (<new flag>, <changed flag>)) OR (CDC_CHANGE_INDICATOR = <changed flag> and TIME_CHANGE_INDICATOR = 'Different')

Many thanks for my colleague Ben Chen for helping to update & test this for the automation logic.Full code can be generated using the Virtualisation application. It is a bit long to post here, but you get the idea!

↧

Tech tip: making SSIS Project Connections generate correctly using BIML Express

August 12, 2016, 9:52 pm

≫ Next: Creating Data Vault Point-In-Time and Dimension tables: merging historical data sources

≪ Previous: Advanced row condensing for Satellites

A bit more of a technical view on things today. In order to stay up to date with the latest when it comes to generating ETL for the Microsoft stack (SSIS), I recently upgraded from Visual Studio 2013 with BIDS Helper 1.6.6. to Visual Studio 2015 with BIML Express. And this means a lot of regression testing for years and years of increasingly complex BIML and C# scripts. As it turns out it wasn’t too bad and the differences between Visual Studio 2013 and Visual Studio 2015 aren’t massive (I could even open the solution using both versions, that’s rare with Visual Studio).

Moving from BIDS Helper to BIML Express is a slightly bigger changes. BIML Express has been positioned by Varigence as the new development plug-in to use BIML for ETL generation. It basically supersedes BIDS Helper in this respect, although the BIDS Helper plug-in can still be used (even in conjunction with BIML Express) and has a host of useful features supporting broader BI development within Visual Studio (some of them are addressed in VS2015 though). If the only thing you’re interested in is BIML, then BIML Express will suffice. The biggest selling point is the IntelliSense option (and code formatting) which makes development a lot easier. I am happy to have moved to this new solution.

However, there were two bugs that kept everything from working properly – both occurring when generating Project Connections.

When generating a Project Connection with a Username / Password connection string, the password isn’t stored. This may not even be a bug, but it used to work like this (and I want it to).
When generating a Package using a Package Connection, the connection isn’t recognised in the package. The Project Connections are created, but the sources and destination objects are not mapped. This is a far more serious issues, which has been acknowledged as a bug and has been reported to be fixed in a next release of BIML Express.

So, in the meantime here’s the workaround to still use this functionality and generate Project Connections that are correctly linked to Packages when you generate them. The (temporary) solution here is to:

Make the BIML store the password as an expression of the Project Connection. This can either be hard-coded, or by using a Project Parameter. Either way, the ‘password’ expression is added to the (central) Project Connection so it works for all packages that use that connection.
Reference the Project Connection by its GUID when generating Packages.

To implement this define connections as usual (top-level, within the <BIML> tags, before the <Packages> tag). I mapped these connections to variables provided by C#.

<Connections>
  <Connection Name="METADATA" CreateInProject="true" ConnectionString="<#=connectionStringMETADATA#>"/>
  <Connection Name="STG" CreateInProject="true" ConnectionString="<#=connectionStringSTG#>"/>
  <Connection Name="INT" CreateInProject="true" ConnectionString="<#=connectionStringINT#>"/>
</Connections>

This is all as it used to be, but to address the above two issues I added a reference within the <Package> tag (again) to identify the connection and set the connection password property. You can add the GUID of the connection as an ID here to make the SSIS package reference to the correct Project Connection. When BIML Express is updated you can take this out again. The GUID of the connection can be viewed in the properties of the Project Connection in the Connection Managers, and copied here.

<Connections>
 <Connection ConnectionName="METADATA" Id="{6A486C3B-A2C0-4338-8929-A966D2C201EB}">
 <Expressions>
   <Expression ExternalProperty="Password">"password"</Expression>
 </Expressions>
 </Connection>

 <Connection ConnectionName="STG" Id="{5CBCE390-8497-48D6-B9FB-6209F7041896}" >
 <Expressions>
 <Expression ExternalProperty="Password">"password"</Expression>
 </Expressions>

 <Connection ConnectionName="INT" Id="{19DC3EE4-9EA3-43FC-BDDD-B3EEEF187958}">
 <Expressions>
 <Expression ExternalProperty="Password">"password"</Expression>
 </Expressions>
 </Connection>
</Connections>

This will generate SSIS packages and their Project Connections in a way that the source and destinations are mapped correctly. The underlying issue is that at present BIML seems to lookup an incorrect GUID which doesn’t match the one for the designated connection. This is a way to get around that.

↧

Creating Data Vault Point-In-Time and Dimension tables: merging historical data sources

September 13, 2016, 10:42 pm

≫ Next: Some insights about … Insights

≪ Previous: Tech tip: making SSIS Project Connections generate correctly using BIML Express

Beyond creating Hubs, Links and Satellites and current-state (Type 1) views off the Data Vault, one of the most common requirements is the ability to represent a complete history of changes for a specific business entity (Hub, Link or groups of those). If a given Hub has on average 3 or 4 Satellites, is it useful at the very least to see the full history of changes for that specific Hub across all Satellites.

How to merge various historised (time-variant) data sources such as Satellites together, especially when you have more than two of them, essentially reuses some concepts that have been around for many years and in many forms. They can generally be referred to as ‘gaps and islands’ of time (validity) periods.

In this post I intend to explain how this concept can be applied to Data Vault to create both Point-In-Time (PIT) tables as well as Dimensions. The Dimension in this example would contain the full history for every attribute (completely ‘Type 2’ – in Dimensional Modelling) but of course can contain a mix at attribute level as to how history is presented. Ultimately it is a consideration for every individual attribute how this is done, but for the context of this post I will assume the entire Dimension requires the historical / Type 2 / time-variant perspective.

An additional disclaimer is needed here for PIT tables as well: if you have read any of the Data Vault books you may notice that my interpretation of a PIT table is slightly different. The explanations in the books is geared towards storing and maintaining snapshots of ‘state’ for a group for various intervals (e.g. days as per midnight or something like that). This is a very sensible way to define this, but in this post I will focus more on combining the history at the most detailed level. The reasoning for this is that a ‘state per interval’ can always be derived from the full history and using the same techniques outlined in this post. In other words, from the full history you can present the point-in-time snapshot as outlined in Dan Linstedt’s book very easily.

The default PIT table is a combination of the various Satellite keys and date attributes against the lowest defined grain (the entity, Hub), but it is perfectly OK to add attributes or even business logic if it makes sense for you. In the end, these are performance measures that need to be fit-for-purpose. The more attributes you add, the more the raw result starts to look like a Type 2 (full history) Dimension.

To keep some differentiation between the two let’s agree that a PIT table combines history from its direct surrounding tables (all Satellites for a Hub for example). A Dimension is broader than this and can contain more historised sets (e.g. multiple Hubs, Links and their Satellites). If in any way possible, I always try to generate the Dimensional Model directly off the Data Vault and only use PIT tables if there is really a performance issue. As a result, I don’t really use PIT structures that much. But if you need that bit of additional performance loading a Dimension they do come in handy.

As per the above introduction, the difference between PIT tables and Dimensions isn’t really big and the logic to create both is therefore very similar. The main difference is that creating a Dimension consisting of multiple sets of Hubs, Links and Satellites requires some creativity around navigating through the model. Which path to take, preventing cycling back to tables that are already incorporate – that sort of thing. But once you have mapped your path through the model the logic to put it all together is the same.

I have applied these concepts in the Virtual EDW software so I can easily ‘play around’ with various structures and in terms of generation / automation of this I can attest the logic is pretty much the same indeed: once you have the tables and paths you’re basically done. Feel free to download the latest version and try this out yourself, but let’s start with the basics first.

Joining time-variant / historised data sets together

The basic approach to join two time-variant tables together is to join them on their shared key (CUSTOMER_SK in the example below) as well as their Effective and Expiry Date/Times. This is -by far- the easiest if you have an expiry date/time, of course, but if you don’t want to persist this you can always derive it off the effective date/time. In the example below I persisted the expiry date/time in the available underlying data.

The join logic is essentially making sure the overlap in time-periods (I always refer to this as time slices) is evaluated: the greatest of the two effective dates is smaller than the smallest of the two expiry dates. Everything else we do is a variation of this theme. Consider the example below:

This query provides you a full history result showing the raw output for a Customer Hub and its two Satellites. The first Effective Date/Time and Expiry Date/Time combination (column 2 and 3) represent the smallest time slices for this set of tables. The other two combinations (columns 4+5 and 6+7) represent how the two Satellites fit in these timelines and the red circles indicate that for a specific ‘most-granular’ time slice there are were no changes for that specific Satellite (repeating of values).

This is the most basic PIT table (albeit at the most granual level – see introduction), which can be persisted and updated as part of the Data Vault refresh mechanisms. The idea is that this table can be used to INNER JOIN Satellites to retrieve any required attributes for upstream processing. The fact that the complex join logic is already handled, and that an INNER JOIN can be used can make a performance difference. Of course, SQL has GREATEST and LEAST functions that you can use as well as opposed to a CASE statement but I thought this was the easiest way to explain.

Making joining easier when you have multiple tables in scope

In any case, the above technique is geared towards combining two time-variant sources and it is not very straightforward (transparent) to add more time-variant tables using this approach. Of course you can wrap the above logic in a sub-query and merge this with the next time-variant set, but this creates complex logic – especially when handling exceptions such as multi-active Satellites. A better way in my view is to adopt the gaps-and-islands technique again. This basically works as follows:

The first step is to create a combined set of all available times (Effective Date/Tiems). This is achieved by UNION-ing the Effective Date/Times across all the time-variant tables (data sets) you need and provides you with the smallest (most granular) available time-slices / level of detail. It is recommended to also UNION a zero-record (ghost record) in this step to make sure a complete timeline between ‘earliest history’ and ‘end of calendar’ can be created. This way you don’t even need zero-records in the Satellites!
The second step is to derive the Expiry Date/Times from the above set, creating a range that can be easily joined against.
The third step is to join each individual (time variant) table against this range of time-slices and selecting the attributes you want to present. The join uses the same mechanism as above (greatest of the two effective dates < smallest of the two expiry dates) but the big difference is that each time-variant table is joined against the central ‘range’ set, making this a lot easier to configure and extend. Additional tables can be easily added (but will have an impact on the granularity). I also recommend to join in the central Hub to the range because typically you want to add the business key itself here or also make mention of the time-slice invoked by the creation of the business key. A business key may have been created some time before the first context started to appear in Satellites.

An example is provided here, and the result is the same as earlier (although I selected some other attributes). You can see how easy it is to add more tables to join using this structure, and it’s also easy to add or remove attributes.

Are we there yet?

It is important to realise that the ranges / time-slices created provide the most detailed level available: that is all the history available for all attributes between all involved tables. However, the attributes you select as output may not necessarily be directly related to some of the change records. Or, to look at this the other way: the attribute that may have lead to a change being tracked in the Satellite may not be exposed in the SELECT statement.

Depending on which attributes you expose this may lead to duplicates in the end result, but these can easily be handled by adding row condensing similar to the one used in the Satellites. By applying this concept over the top of the above query it is easy to show the historical context for the attributes in scope and very easy to toggle between a Dimension view, a true Point-in-Time (snapshot) or raw changes.

Applying row condensing for PIT or Dimensions can be implemented by creating a checksum across the all attributes except the lowest-granularity (range) Effective Date/Time, and comparing the rows with each other when sorted by Effective Date/Time. If there are no changes between the rows then the row with the higher Effective Date can be discarded (condensed). This is because apparently the change in the Satellite that created the time-slice was not required in this specific output. The query logic is a bit long to paste here, but can be tested in the Virtual EDW app.

The key steps are to:

From the output of the time-variant query shown in the previous section; create a checksum (I use MD5) across all all attributes excluding the Effective Date/Time for the range.

Make the checksums available for comparison, for example:

LAG(ATTRIBUTE_CHECKSUM, 1, '-1') OVER(PARTITION BY [HUB_CUSTOMER.CUSTOMER_SK] ORDER BY PIT_EFFECTIVE_DATETIME ASC) AS PREVIOUS_ATTRIBUTE_CHECKSUM

Filter out records were the checksum is different than the previous checksum (these are the duplicates)

Again, this applies to every scenario where you only select a subset of all available attributes in time-variant tables.

Final thoughts

The above explanation should provide some handles to create your own time-variant output, but there are always other things to be aware of when deploying these solutions. This is especially true in cases where the result is instantiated into a physical table that needs to be periodically refreshed. Without going into too much detail (posts get too long) here are some pointers:

Updates in Satellites may occur at different times, which can mean that a later Effective Date/Time can be available while another Satellite still needs to be updated. I work in an environment where everything runs continuously, which means that the moment of refresh for a PIT or Dimension may mean some information may still be loaded into some of the Satellites. This is obviously only an issue if you want to process deltas into your PIT or Dimension, but it is good practice to build in delta handling from the start as a full refresh may quickly become too costly. There are a few ways to handle this (good topic for a next post). The solution can be to implement a mechanism that rolls-back some of the rows and re-applies the updated history or to create loading dependencies (not my personal favourite).
In complex environments the queries may not work in a single pass, but it’s easy to break up the steps into separate ETL processes that each persist their data. This can give the database a bit of breathing space if required.
As I mentioned earliest in this post I prefer to ‘jump’ from a Data Vault / DWH into a Dimension or other form of Presentation Layer output straight away skipping PIT tables altogether.

Hope this all makes sense!

↧

Some insights about … Insights

February 4, 2017, 3:14 am

≫ Next: When is a change a ‘change’?

≪ Previous: Creating Data Vault Point-In-Time and Dimension tables: merging historical data sources

Can I get some insights, please?

Over the years, I have come to somewhat dislike the term ‘insights’ almost to the same level as, say, a ‘Data Lake’. And that’s saying something. Not because these concepts themselves are related that much (they are to some extent, of course). But, because to me personally, they both conjure the same feeling: a mixture of annoyance and desperation. One of the reasons is that since the word ‘insights’ is part of my job title, there is a general view that insights are something that you can just ‘mine’ from ‘big data’ and are there to be ‘delivered’.

My experience however has been that insights are generated from working closely with the business over a longer period of time, and with a dedicated focus of honing in on explanations that are relevant to meet a specific business goal. Insights do not appear from data, and ‘shaking’ data vigorously does not make insights fall out of it. Insights supplement data analysis by proving or disproving detailed hypotheses or assumptions that the initial analysis may have brought to the attention. Furthermore, insights are sharpened by implementing a feedback loop that continuously addresses gaps in knowledge and information over a longer period of time and across various areas of the business. This article aims to explain how we can achieve this cross-functional collaboration, which I believe is a requirement to get ‘insights’.

The way I usually introduce this topic is by emphasising the ‘enabling’ of all of this: the data itself and the hard work required to bring data together, and address any gaps. Also, do not underestimate the effort and diligence required to ensure data is of reasonable quality.

Ownership of this collection process lies in the realm of Data Warehousing, Data Integration and to some extent Data Governance. They can be seen as the custodians of the data, which by the way does not mean that ownership of the data lies here as well. Ultimately the business owns the data (or should take ownership if they haven’t done so already) and should determine its fitness-for-purpose, as they will be the ones that ultimately will reap the rewards of the data.

To support the above, as an Insights Manager, my intent is to mold this data into ‘data (Lego) blocks’: defined and auditable data elements of sufficient quality, with which you can build many things. The sky is the limit, or until you run out of blocks.

The idea is that exposing these ‘data blocks’ provides basic information to Subject Matter Experts (SMEs), which hopefully sparks creativity and subsequent deeper analysis. The ideas that emerge from this creativity are the nutrients to cultivate ‘insights’. Self-service (managed self-service, that is) plays a role here, too.

With the right reusable data blocks, you can build anything. Data availability sparks creativity.

The Book With All The Answers

We have a book at home called ‘The Book With All The Answers’, which is absolutely great if you’re looking for any answer to any question. It is basically a book full of answers such as ‘yes, you should totally go for it’ or ‘that would be the biggest mistake of your life’. Every page has a specific answer printed and all you have to do is say the question out loud, open the book without looking and you have your answer. There are times that this is used rather heavily (you know who you are!).

There is a work equivalent of this as well.

Invariably, I am asked to ‘give some insights’, or ‘can I get some insights on our XYZ?’. We sometimes (jokingly) respond by saying ‘give me a minute and I’ll run the Report With All The Answers’ or open up the ‘Silver Bullet View’. Because it doesn’t work like that. Any kind of analysis always starts with a business problem. Or better yet: any kind of analysis should always start with a clearly defined business problem.

So, how does it work? While working on analytics and insights can start at any time and with incomplete data (as long as there is a clearly defined problem), I believe it should be linked to a holistic approach that focuses on incrementally delivering more information to support the insights process. An approach that focuses on addressing gaps in data to sharpen the analysis over time. From this perspective, the continuous delivery of insights is part of a bigger process that is supported by data availability: the available set of data blocks.

It is also about finding a way to engage with the business on the ‘what’s next?’, ‘so what?’ or some other specific topic. These would typically be use-case driven: aimed at achieving an actionable goal, but can also involve investigations for specific topics to understand why things happened.

Another way of looking at this is that fundamental, more basic information needs ought to be satisfied first before you can talk about generating insights. Insights require cultivation, to grow over time and don’t magically appear out of thin air. You can picture this as a hierarchy of data needs, in a similar way to how human and / or organisational drives are described in Maslow’s pyramid of needs.

Maslows’ pyramid reinvented

You probably are aware of Maslow’s Hierarchy of Needs, a model of human longings which is typically represented as a pyramid. It can be used to understand what drives (motivates) people, and that more basic (lower) needs require to be fulfilled first before the ‘next level up’ becomes relevant. For example, a basic need would be food, water or rest. These fundamental needs are required to be met before relationships and accomplishments become focus of attention. And so on.

This approach can also be applied to data needs, which in my view culminate into ‘insights’ at the top of the pyramid. I’m not sure where it exactly originates from (I got my example from gfxspeak) but there are plenty of other examples all over the internet.

In any case, I like this way of thinking and have been using it to positioning insights as well as capabilities such as Data Warehousing, Business Intelligence and Advanced Analytics (Data Science) in general. The ‘pyramid of data needs’ provides a good overview of what kind of requirements and capabilities a business should have in place in order to support ongoing insights generation.

I took the liberty of adding two additional layers to the original hierarchy of Data, Information, Knowledge and Wisdom: an explanation with examples in the middle column and ownership topics on the left column. The result is added below.

The pyramid of ‘data needs’.

Adequate recording of events, transactions and descriptive data is the fundamental requirement. Without data, there is very little to report or analyse so we can define this as the most basic data need. Data is captured by operational (Online Transactional Processing) systems, which are typically managed by the IT departments through a combination of infrastructure and application management. Data is subsequently ‘sourced’ from these transactional systems into Data Warehouse (DWH) solutions through some form of Data Integration or emerging alternative techniques such as data virtualisation. Please note that not all data is available in neatly organised environments such as databases. More often than not there is a significant amount of (potentially) relevant data available in semi- or unstructured formats as well, which is something to bear in mind. Regardless of how this is stored though, it needs to be recorded somewhere.

Management of these solutions is usually part of the mandate of the Business Intelligence (BI) team, although in some organisations the responsibility of capturing (DWH) and reporting (BI) are separated (not my personal preference, though).

This answers the question about what data is available in principle, and can also be used to address gaps in data collection. Modelling plays a part here, usually through a conceptual or logical information model to highlight the ideal and as-is states.

Once the fundamental need for data to be recorded is satisfied, and data can be accessed in a fairly consistent way including access to previous states of the data (the history of changes), various questions can be answered such as ‘what is happening now?’, and ‘what happened in the past?’. This is one of the core domains of BI and provides an important enabling service for any organisation. Providing consistent access to information also helps guide the organisation towards conformity of definitions and meaning, another key BI responsibility. Limited extrapolation or analysis is done ‘forward-looking’ in time, but the majority of focus is the ‘rear-view mirror’. The ability to achieve this is another major requirement to proceed up the pyramid: raw data is processed into information.

There is a strong relationship between BI, DWH and Analytics although analytics in some forms can (and do) exist without the former. I believe however that structural application of analytics requires a more holistic approach where both information and (business) knowledge ‘gaps’ (uncovered through the analytics process) are addressed by the BI and DWH areas on an ongoing basis. A BI feedback loop if you like. In the same vein, validated findings and completed analysis can be incorporated as Business-As-Usual by the BI team. Data and business knowledge gaps go hand-in-hand here, as a lack of accessible information is likely to be a major cause for not fully understanding a particular area of the business.

In any case, Analytics (Advanced Analytics) incorporates various approaches for using statistics / modelling to predict, explain or explore relationships and causality in data. Predictive modelling applies a (carefully selected) statistical model or data mining algorithm to data for the purpose of predicting new or future observations. This topic covers, as an example, propensity analysis and generally looking at trends from a more statistical view such as calculating seasonal influences. Practically speaking Analytics is a way to help answer questions such as ‘what is likely to happen next?’. In this context, it is interesting to consider that predicting something successfully using an analytics model does not equate to (fully) understanding why this behaviour manifests itself.

At this level, an organisation would be able to develop knowledge: an understanding of a specific topic including weighting different approaches. Each ‘approach’ is the result of a specific analysis for a specific defined topic or question. But how are these topics or questions selected?

We have now approached the core of this article, how to connect and engage with the business to achieve insights. While obviously relevant for BI as well, it is especially important for Analytics teams to work closely with the business. Another way of saying this is that (business) questions need to be very specific in the way they are formulated. It is for this level of detail that Subject Matter Experts(SMEs) come in, as true SMEs understand these nuances and generally have a clear idea on what the intended outcome needs to be, or what is generally the expected result.

If the outcome of the test is different to what is hypothesised by the SME, then you have your first insight.

SMEs need time and focus to work with the BI and Analytics teams. But, they also need to connect with the strategy and decision making side of the business. This great post by Tom Breur explains this really well.

With this process established, the foundation is ready for the last level of organisational data needs. At this stage, a specific direction of the business can be supported by the required data in a processed and accessible way and various tests have been executed. Obtaining wisdom, by incorporation of information, is where insights live, because this is where various options have been defined, weighted, understood and ultimately used in a decision-making process that aligns with a specific strategy and intent.

These mechanisms need to be considered when working towards, or organising, data-driven initiatives. These initiatives need time, focus, cross-functional collaboration (e.g. SME and information specialists) and perseverance. Ultimately it comes down to understanding intent (strategy) and the ability to define the metrics relevant to the specific area of expertise and at a very detailed level, be it direct marketing, contact centres or risk modelling. The BI, DWH and Analytics teams can then work with the business to make this a reality.

This means that delivering insights is a team effort across the organisation and requires time, focus and collaboration with the various areas to provide the data requirements following the ‘data pyramid of needs’.

More insights

As stated before: every analysis should always start with a clearly defined business problem. The pyramid of data needs aims to explain what the requirements are to enable this, culminating in collaboration between the business (SME) and information specialists. If you’re not an SME and are tempted into accepting delivering insights in areas that are not clearly defined, then your first problem is to figure out what the problem is to begin with. Hopefully this paper helps to avoid this pitfall.

This post admittedly has a very bottom-up view about working with data and turning it into insights. But my experience is that without these foundations in place the analytics process can be very slow, because gaps in data availability will be encountered repeatedly. Ironically, this is almost counter-intuitive to the ‘big data’ idea that we’re swamped in data. By incorporating a data management approach or platform into your strategy, where the ‘data Lego blocks’ are being added to on a use-case basis you will be able to support the ongoing analytics process and bridge gaps in data and knowledge over time.

Having said that, this is ultimately dependent on a combination of factors like the topic, the scale and the way data is managed for a particular organisation or subject area so your mileage may vary.

Market research (both quantitative and qualitative), is also a factor which is not covered in this paper. This provides another perspective on generating insights (which still requires a fairly well-defined topic though). In all cases being mindful of the types of skills, the specific focus and data needs will help putting together the right team for the right results.

At the end of the day, generating insights is an iterative cycle, a two-way street, that starts with strategy (focus) and gradually informs and improves understanding for a specific topic if you have the perseverance to stick with it. Another way of looking at this is that just having strategy and information is not enough. What we should work towards to is to create the feedback loop between the two areas and continuously improve this over time.

The sheep with five legs

This focus on cross-functional collaboration is opposed to the notion that an analytics specialist needs to be able to be an expert in both the business and statistics: a ‘sheep with five legs’.

I’m not really sure this translates well from Dutch (my native language) into English, but a sheep with five legs in Dutch is basically a super talent: someone who is above-and-beyond versatile and can act as genuine catalyst or intermediary. The employee everyone is looking for. In the context of this article our ‘five-legged-sheep’ would be an advanced analytics specialist who also happens to be an expert in one or more areas of the business. Someone who, for instance, truly understands marketing, aviation, fast-moving-consumer-goods – whatever is relevant for the business – and at a practical level.

I don’t disagree that finding someone that can combine a business specialisation with analytics expertise is a great asset. The point I am trying to make is that, especially in a large organisation, a single person or small team cannot be ‘the SME’ across all areas of the business. Instead, what you may want to hire for are so-called generalised specialists: people who excel in some area, yet are conversant enough in multiple adjacent domains to be versatile.

You therefore have to consider organising the teams to either focus on dedicated areas and hire for this purpose, or focus on broader capability and engage with SMEs in the business. Or maybe launch your own five-legged-sheep breeding program?

↧

When is a change a ‘change’?

February 18, 2017, 11:11 pm

≫ Next: Updated the Data Vault implementation & automation training for 12-14 June in Germany

≪ Previous: Some insights about … Insights

This is a post that touches on what I think is one the essential best-practices for ETL design: the ability to process multiple changes for the same key in a single pass. This is specifically relevant for typical ETL processes that load data to a time-variant target (PSA, Satellite, Dimension etc.). For non-time variant targets (Hubs, Links etc.) the process is a bit easier as this is essentially built-in the patterns already :-). In a given process, there are usually (at least) two rules I maintain:

Making sure there is a safety catch to prevent loading information multiple times (by accident, out of order etc.)
Making sure the correct delta is selected to be merged with the target

The paper I’ve written here (click the link below to open) captures the essence that explains the second topic: how the correct delta is selected.

When is a change a ‘change’?

Also, if you happen to be in Melbourne in March and are interested in hearing more on these topics as a classroom training please have a look at the Data Vault implementation course as well.

↧