Project Description
A custom Data Flow component for SQL Server Integration Services (SSIS)
that replaces the standard SCD Wizard with a superior experience, from the
configuration UI to runtime performance. Performs 100x faster than the standard
component, and edits are non-destructive.
The Slowly Changing Dimension Wizard in SQL Server Integration Services is a very useful tool for setting up and handling slowly changing dimension processing, which is a very common requirement of Data Warehousing. The SCD Wizard is easy to configure and works as advertised.
However, there are some significant issues with it that have been experienced by many:
The Slowly Changing Dimension Wizard in SQL Server Integration Services is a very useful tool for setting up and handling slowly changing dimension processing, which is a very common requirement of Data Warehousing. The SCD Wizard is easy to configure and works as advertised.
However, there are some significant issues with it that have been experienced by many:
- It can be very slow.
- Re-running the Wizard destroys parts of the
data flow.
- It can’t be used against all data providers.
- It can be hard to troubleshoot why it makes
the decisions it does. Click
here for more details on these and other disadvantages of the SCD Wizard
The SCD Wizard is definitely a step in the right direction. It makes
handling complex SCD processing relatively easy and quick – at least when
compared with constructing your own equivalent logic using Lookups, Conditional
Splits, and other components. However, after using it for one or two projects,
you’ll probably run in to one or more of these issues – I’ll only briefly
describe them, because if you’ve used it, you probably know them better than I
do:
Ø Re-running the Wizard in order to alter the behaviour of the SCD
processing completely destroys that section of the Data Flow.
Ø It’s slooow on larger dimensions, especially when used in a
scale-out/remote scenario (where the IS server and DBMS containing the
dimension are on different boxes) due to the Wizard doing an uncached RBAR
round-trip lookup for every row of the source.
Ø Column value comparison is always case-sensitive,
trailing-space-sensitive, and culture-sensitive.
Ø No explanation is given for why certain rows get sent to certain outputs
in order to debug data. Items shown as “new” or “changed” may be so due to
trailing spaces, case, collation, implicit data type conversion issues on any
one of the keys or attributes – but none of that is fed back to you.
Ø Only OLE DB Connection Managers are accepted to reference the dimension
table – and only some OLE DB providers are permitted.
Ø The “current row” must always be identifiable by keeping the “expiry”
date NULL – or by manually altering the components created by the Wizard, which
get destroyed if the Wizard is re-run. A better practice is to put far-future
dates in the expiry column, leading to easier queries. (See The Microsoft
Data Warehouse Toolkit p241 - Mundy, Thornthwaite, Kimball.)
Ø Not possible to “retire” dimension members (rows), due to the “Change
Stream” architecture.
Ø No support for managing (generating) surrogate keys.
Ø No support for “special” members – such as the “unknown” member – per
Kimball Method best practices.
Ø No support for auditing changes is included.
Take advantage of the improvements over the SCD Wizard present in the Kimball Method SCD component.
Kimball Method SCD Component Advantages
There are quite a few advantages to using the Kimball Method SCD
component over the SCD Wizard. Here’s a starter list:
Ø One component on the design surface that can be edited without adverse
effects on the rest of the Data Flow. The SCD Wizard creates multiple
components that are destroyed and rebuilt if the Wizard is run again. Any
customizations that were made in any of the components – including the
“secondary” Derived Column and OLE DB Command transforms – are completely lost.
(all versions)
Ø Insane performance - measured to be 100x superior (see the Performance
Improvement by use of multiple threads, sort optimization, and implied
outcome determination. Those architectural differences beat the SCD Wizard’s
single-threaded uncached row-by-row lookups. (all versions)
Ø Surrogate Key management inside the component – if desired. The SCD
Wizard offers no assistance creating or maintaining surrogate keys. (all
versions)
Ø "Special" (unknown) member support, per Kimball Method best
practices. If you have some “constant” members that should always appear in
your dimension, you can manage them (and changes to them) using the Kimball
SCD. Maintaining those rows is completely external to the SCD Wizard, which
could result in unintentional collisions with actual business information. (all
versions)
Ø Includes a "Row Change Reason" output column on all (except
Unchanged) outputs, per Kimball Method best practices. Doing so gives you some
meaningful way to understand why a row was sent to a particular output. This is
in comparison to zero support for debugging why a row was directed to a
particular output in the SCD Wizard, which can get quite frustrating –
especially when dealing with understanding mapping your RDBMS data types to
SSIS data types. (all versions)
Ø Supports simple and advanced styles of Row Auditing for inserts and
updates, per Kimball Method best practices. The SCD Wizard offers no auditing
support. (v1.4+)
Ø Flexible column comparisons: case (in)sensitive and space (in)sensitive
as desired, plus culture-sensitivity. The SCD Wizard is case-, space-, and
culture-sensitive with no features available to alter that. (v1.4+)
Ø Flexible SCD 2 "current row" handling - permits specification
of the date "endpoints". Comparatively, the only choice with the SCD
Wizard is to have the "expiry" date be NULL to signify the current
record. (all versions)
Ø Flexible SCD 2 date handling - permits specification of what date
expired and new rows get marked with. The SCD Wizard leaves that up to a
Derived Column component that will get destroyed when the Wizard is run again
to adjust other properties. (all versions)
Ø Options to use other data types for the Current Member and Inferred
Member columns - not just Boolean types as in the SCD Wizard. (v1.5+)
Ø Reads the existing dimension from the Data Flow, not a Connection
Manager - allowing the package designer to cache the existing dimension table
as they see fit, and use any repository to store the dimension table. The SCD
Wizard requires an OLE DB Connection Manager for the dimension table, and only
supports some OLE DB providers. (all versions)
Ø Supports "retiring" dimension members that have been deleted
from the Source System by marking their "expiry date" with a real
date. The SCD Wizard does not support this, because it uses "Change
Stream" style SCD processing. With the SCD Wizard, your dimension table
will never be queryable for "how many active things are there?" at
any specific point in time. (all versions)
Part
One
|
Introducing
the sample data
|
||
Part
Two
|
Introducing the sample SSIS package
|
||
Part
Three
|
Loading the dimension table with the SCD
Wizard
|
||
Part
Four
|
Loading the dimension table with other SSIS
components (Lookup/Conditional Split)
|
||
Part
Five
|
Loading the dimension table with the T-SQL
MERGE command
|
||
Part
Six
|
Loading the dimension table with the
Dimension Merge SCD component
|
Source:
MSBI (SSIS/SSRS/SSAS) Online Training:
No comments:
Post a Comment