MSBI (SSIS/SSRS/SSAS) Online Training

Thursday, March 7, 2013

SSIS SCD Wizard Disadvantages


Project Description

A custom Data Flow component for SQL Server Integration Services (SSIS) that replaces the standard SCD Wizard with a superior experience, from the configuration UI to runtime performance. Performs 100x faster than the standard component, and edits are non-destructive.


The
Slowly Changing Dimension Wizard in SQL Server Integration Services is a very useful tool for setting up and handling slowly changing dimension processing, which is a very common requirement of Data Warehousing. The SCD Wizard is easy to configure and works as advertised.
However, there are some significant issues with it that have been experienced by many:

  1. It can be very slow.
  2. Re-running the Wizard destroys parts of the data flow.
  3. It can’t be used against all data providers.
  4. It can be hard to troubleshoot why it makes the decisions it does. Click here for more details on these and other disadvantages of the SCD Wizard

 

The SCD Wizard is definitely a step in the right direction. It makes handling complex SCD processing relatively easy and quick – at least when compared with constructing your own equivalent logic using Lookups, Conditional Splits, and other components. However, after using it for one or two projects, you’ll probably run in to one or more of these issues – I’ll only briefly describe them, because if you’ve used it, you probably know them better than I do:

Ø  Re-running the Wizard in order to alter the behaviour of the SCD processing completely destroys that section of the Data Flow.

Ø  It’s slooow on larger dimensions, especially when used in a scale-out/remote scenario (where the IS server and DBMS containing the dimension are on different boxes) due to the Wizard doing an uncached RBAR round-trip lookup for every row of the source.

Ø  Column value comparison is always case-sensitive, trailing-space-sensitive, and culture-sensitive.

Ø  No explanation is given for why certain rows get sent to certain outputs in order to debug data. Items shown as “new” or “changed” may be so due to trailing spaces, case, collation, implicit data type conversion issues on any one of the keys or attributes – but none of that is fed back to you.

Ø  Only OLE DB Connection Managers are accepted to reference the dimension table – and only some OLE DB providers are permitted.

Ø  The “current row” must always be identifiable by keeping the “expiry” date NULL – or by manually altering the components created by the Wizard, which get destroyed if the Wizard is re-run. A better practice is to put far-future dates in the expiry column, leading to easier queries. (See The Microsoft Data Warehouse Toolkit p241 - Mundy, Thornthwaite, Kimball.)

Ø  Not possible to “retire” dimension members (rows), due to the “Change Stream” architecture.

Ø  No support for managing (generating) surrogate keys.

Ø  No support for “special” members – such as the “unknown” member – per Kimball Method best practices.

Ø  No support for auditing changes is included.


Take advantage of the improvements over the SCD Wizard present in the Kimball Method SCD component.



Kimball Method SCD Component Advantages
 
 
 
 


There are quite a few advantages to using the Kimball Method SCD component over the SCD Wizard. Here’s a starter list:

Ø  One component on the design surface that can be edited without adverse effects on the rest of the Data Flow. The SCD Wizard creates multiple components that are destroyed and rebuilt if the Wizard is run again. Any customizations that were made in any of the components – including the “secondary” Derived Column and OLE DB Command transforms – are completely lost. (all versions)

Ø  Insane performance - measured to be 100x superior (see the Performance Improvement by use of multiple threads, sort optimization, and implied outcome determination. Those architectural differences beat the SCD Wizard’s single-threaded uncached row-by-row lookups. (all versions)

Ø  Surrogate Key management inside the component – if desired. The SCD Wizard offers no assistance creating or maintaining surrogate keys. (all versions)

Ø  "Special" (unknown) member support, per Kimball Method best practices. If you have some “constant” members that should always appear in your dimension, you can manage them (and changes to them) using the Kimball SCD. Maintaining those rows is completely external to the SCD Wizard, which could result in unintentional collisions with actual business information. (all versions)

Ø  Includes a "Row Change Reason" output column on all (except Unchanged) outputs, per Kimball Method best practices. Doing so gives you some meaningful way to understand why a row was sent to a particular output. This is in comparison to zero support for debugging why a row was directed to a particular output in the SCD Wizard, which can get quite frustrating – especially when dealing with understanding mapping your RDBMS data types to SSIS data types. (all versions)

Ø  Supports simple and advanced styles of Row Auditing for inserts and updates, per Kimball Method best practices. The SCD Wizard offers no auditing support. (v1.4+)

Ø  Flexible column comparisons: case (in)sensitive and space (in)sensitive as desired, plus culture-sensitivity. The SCD Wizard is case-, space-, and culture-sensitive with no features available to alter that. (v1.4+)

Ø  Flexible SCD 2 "current row" handling - permits specification of the date "endpoints". Comparatively, the only choice with the SCD Wizard is to have the "expiry" date be NULL to signify the current record. (all versions)

Ø  Flexible SCD 2 date handling - permits specification of what date expired and new rows get marked with. The SCD Wizard leaves that up to a Derived Column component that will get destroyed when the Wizard is run again to adjust other properties. (all versions)

Ø  Options to use other data types for the Current Member and Inferred Member columns - not just Boolean types as in the SCD Wizard. (v1.5+)

Ø  Reads the existing dimension from the Data Flow, not a Connection Manager - allowing the package designer to cache the existing dimension table as they see fit, and use any repository to store the dimension table. The SCD Wizard requires an OLE DB Connection Manager for the dimension table, and only supports some OLE DB providers. (all versions)

Ø  Supports "retiring" dimension members that have been deleted from the Source System by marking their "expiry date" with a real date. The SCD Wizard does not support this, because it uses "Change Stream" style SCD processing. With the SCD Wizard, your dimension table will never be queryable for "how many active things are there?" at any specific point in time. (all versions)

 

 

 

 

 

 

 

Part One
Introducing the sample data
Part Two
Introducing the sample SSIS package
Part Three
Loading the dimension table with the SCD Wizard
Part Four
Loading the dimension table with other SSIS components (Lookup/Conditional Split)
Part Five
Loading the dimension table with the T-SQL MERGE command
Part Six
Loading the dimension table with the Dimension Merge SCD component

 

 

Source:


MSBI (SSIS/SSRS/SSAS) Online Training:




No comments: