MSBI (SSIS/SSRS/SSAS) Online Training

Wednesday, January 16, 2013

Microsoft PDW

Parallel Data Warehouse – Overview :

SQL Server Parallel Data Warehouse is a highly scalable appliance for enterprise data warehousing and it can be used for massively scale out scenarios up to hundreds of Terabytes. Enterprise Data Warehouse typically used within a hub & spoke architecture, as shown in the following picture.
In this kind of architecture the Enterprise Data Warehouse (hub) is the central store of source system data. The spokes are typical used for reporting purposes, like regional reporting and high performance reporting, and fed by the central hub with data. Currently there are only 2 vendors providing PDW hardware, HP and DELL.

SQL Server Parallel Data Warehouse – Architecture Components :

SQL Server Parallel Data Warehouse (PDW) architecture components of the appliance are listed below. The smallest PDW configuration consists of two racks: the Control Rack and the Data Rack as shown in the following picture.

The Control Rack consists of the PDW management components: the Control Node, the Management Node, the Landing Zone and the Backup node. As you can see on the picture the Control Node and the Management Node are designed as an active / passive cluster for high availability reasons. The Landing Zone and the Backup Node not.

The Data Rack holds all the Compute Nodes together with the Storage Node. Each Control Node has it’s own Storage Node connected with dual fiber channel. The Spare Compute Node is used for redundancy and failover capabilities. 

 

Let’s have a closer look on each of those components.

SQL Server Version

The official name of PDW is SQL Server 2008 R2 Parallel Data Warehouse. In fact on all nodes beside the Landing Zone SQL Server 2008 is installed, even though PDW works as a whole system. There is a database behind, it is pre-configured and it does what is has to do. It is not supported to connect to the nodes and change something in the configuration.

The Landing Zone ships without any pre-installed version of SQL Server. It is also supported to put SQL Server 2008 R2 on the Landing Zone as well as 3rd party tools.

Control Node

All client connections to the database go always through the control node. It does not contain any persistent user data and is the PDW “Secret Sauce”. Why? Because it processes the SQL queries coming from the clients, prepares the execution plan and orchestrates the distribution execution on the Compute Nodes.

The Control Node supports the following drivers for data connection: ODBC, OLEDB and ADO.NET.

The network protocol used to connect to the Control Node is SeQuel Link which is currently not supported by the SQL Server Management studio.

Management Node

The Management Node is responsible for deploying patches to all nodes and holds images in case of a re-imaging of on of the nodes. It additionally runs a PDW internal Domain Controller together with an Active Directory that is currently only used for managing purposes and not for authentication

Landing Zone

The Landing Zone represents the ETL interface of PDW. It provides high capacity storage (dedicated direct attached server storage) for data file from the ETL process. The ETL processes can be executed with Integration Services or the DWLoader. As mentioned earlier the Landing Zone is the only node where additional tools can be installed.

Backup Node

The Backup Nodes is the central place that coordinates backup across all nodes. There are 3 different possible backup options: full, differential or metadata only.

The Backup Node can be orderable in different capacities up to 524TB of disk space. As the Landing Zone the storage of the Backup Node is dedicated direct attached storage. 

Compute Node

The last node to mention is the Compute Node (I see the Storage Node part of the Compute Node). Each of those nodes is a highly tuned SMP node with standard interfaces and dedicated hardware, database and storage.

Each Compute Node has it’s own TempDB that is stored on the server storage.

The Compute Nodes are used for data distribution and to run parts of the overall query in parallel.

General Information
  • The smallest PDW appliance currently ships with one Data Rack containing 10 Compute Nodes. In the future there will be a half rack with 5 Compute Nodes available.
  • For large enterprise environments there are also developer and test appliances available that have less Compute Nodes and no high availability options.

The Control Node is currently able to handle up to 4 Data Racks.

No comments: