MSBI (SSIS/SSRS/SSAS) Online Training

Sunday, July 26, 2009

DataMining Concepts

Module 8: Data Mining




Data Mining........................................................................................ 2
Introduction........................................................................................ 2
Demonstration 1: Browse Data Mining Models..................................... 2
Demonstration 2: Add DM Viewers to a Web Page............................... 5

Data Mining
Introduction
This section provides an overview of the data mining features in SQL Server 2005. The following features are covered in the demonstrations:
· Naïve Bayes Model
· Decision Tree Model
· Association Model
· Data Mining Viewer
Demonstration 1: Browse Data Mining Models
This demonstration introduces you to mining models through the built-in browsers for developers and administrators. You will explore a Naïve Bayes mining model to review attribute characteristics, attribute discrimination, and dependencies. You will also explore a Decision Tree model, an Association model, and compare the predictive capability of multiple models.
#
Task
Notes
1
Browse the Naïve Bayes model of the Targeted Mailing mining structure of the Adventure Works DW database

Before you start, you need to make sure you have the AdventureWorks DW Analysis Services database installed if you did not already deploy this database during an earlier module.:

· In Business Intelligence Development Studio, open the Adventure Works solution at C:\Program Files\Microsoft SQL Server\90\Tools\Samples\1033\awasdb\AdventureWorksAS.slnbi.
· Deploy the project.

In SQL Server Management Studio (SSMS):
Connect to the local Analysis Server.
Open the Mining Structures folder of Adventure Works DW.
Open the Targeted Mailing mining structure.
Each mining structure is a set of mining models that goes against a single data source. You can use the same source as shown in the Targeted Mailing structure, which has four different mining models. By contrast, the Market Basket structure has only one. Think of the mining structure as the problem to solve, with each model as an option to consider for solving the problem.
Right-click TM_Naive_Bayes and click Browse.
Naïve Bayes is the simplest model, so it’s a good one to start with.
2
Review the Attribute Characteristics of the model
The Targeted Mailing “problem” is to predict who will buy a bike. Using a training set of known bike buyers, each record is flagged with a value where 1 is assigned to people who did buy bikes and 0 is assigned to people who did not. Use the Attribute Characteristics to see how attributes compare with the population for each group – bike buyers and non-bike buyers.

§ Select the Attribute Characteristics tab, and choose 1 in the Value list
This shows the most frequent attribute states (in descending order) for bike buyers (Bike Buyer = 1). Note that Number Children at Home = 0 is at the top of the list. It is not necessarily the best predictor, because it is also high for non-bike buyers. This is an example where correlation does not imply causation.
§ Change the Value option to 0 (non-bike buyers).
Notice that Number Children at Home = 0 is still at the top of the list. The population of the sample just doesn’t have many kids.
3
Review the Attribute Discrimination of the model
Attribute Discrimination is a one-dimensional sequence of the importance of different attributes. Naïve Bayes is a fast and simple model to use to compare attributes.

§ Select the Attribute Discrimination tab, with 1 for Value 1 and 0 for Value 2:
This shows which attributes are the best predictors (discriminators) for buying bikes. Not having a car is the highest positive predictor, followed closely age in the mid-30s. Having 2 cars means you’re unlikely to buy a bike. Notice that Number of Children at Home doesn’t show up for a while.
4
Review the Dependency Network
§ Select the Dependency Network tab.

§ In the Dependency Network, click Bike Buyer.

The nodes change colors to show the prediction relationship. This becomes more useful as models become more complex. In this example, the network is very simple. All attributes here are predicting the bike buyer value. Nodes that predict BikeBuyer are indicated by color, which you can identify by the legend.

§ Click Number of Cars Owned.

Now you can see what this attribute predicts. As you can see, because this model is simple, the relationships are easy to see.

§ Gradually drag the Links bar down.
Weaker links drop off the model. The strongest predictors are Age and Number Cars Owned.

5
Browse the Decision Tree model of the Targeted Mailing mining structure of the Adventure Works DW database

Right-click TM_Decision_Tree and click Browse.
Decision Tree examines relationships in a more complex way than Naïve Bayes. It looks more closely at the interplay between attributes.
§ Choose TM_Decision_Tree in the model list.
The decision tree shows all the possible values of the strongest variable—number of cars owned. But for people who own 3 cars, the second most important factor is Income, while for people who own 1 car, the second most important factor is Age. The decision tree allows you to see how different factors interrelate to create discrete groups within the population.
§ In the Background drop-down, choose 1.
This shades the background darker based on the percent of cases that meet the condition—in this case, bike buyers.
§ Hover the mouse over the three Yearly Income boxes for Number Cars Owned = 3.
There are not as many rich people (income > 122,000) in this group, but the ratio of bike buyers is very high.
6
Browse the Association mining structure
Association is very different from the prediction models. This model is checking for which products are purchased together. Sometimes one, two, or three different products are purchased together.
Open the Market Basket structure, right-click Association and click Browse.
§ Start with the Itemsets tab.
This shows a list of “shopping carts” along with how many times each happened. Lots of people just bought a single Sport-100.
§ Change the Minimum Itemset Size to 3.
This model shows only “shopping carts” that had at least three items purchased together.
§ In the Filter Itemset box, type water.
This shows only shopping carts that included something to do with water. In this case, the sales are primarily water bottles.
§ Change to the Rules tab.
Probability just shows how often a combination occurs, but not whether one item is a good predictor of another. For a while, everybody who bought a Betty Crocker book also bought a Harry Potter book, but that doesn’t mean that purchasing a Betty Crocker book was an important predictor for purchasing a Harry Potter book.
Remember that the goal is to find which product purchases predict the others. For example, if someone bought a Road Bottle Cage and a Cycling Cap, then bought a Water Bottle, there is 100% probability that the first two strongly predict the third.
§ Sort in descending order of Importance.
This view shows the combinations that the model judges to be the most use in making predictions. Its purpose is to measure the competition between other products that could predict purchases.
§ Select the Dependency Network, drag the Links bar to the middle, and choose Water Bottle = Existing.
Notice that this dependency network is much more complex than the previous one you viewed. A hydration pack is likely to be accompanied by a water bottle, but just because you bought a water bottle, you won’t necessarily buy a Hydration Pack.
7
Compare the mining models of the Targeted Mailing mining structure
When a mining structure contains more than one mining model, you can compare their predictive ability.
§ Close the mining viewer.
§ Right-click the Targeted Mailing structure, and choose Lift Chart.
§ As the Case table (also known as Input Table), select the Case Table vTargetMail.
This is the same view that was used to generate the models, so the columns should match. A common practice is to divide the data in half and use half to train the model and the other half to test it.
§ In the Predict Value column of the grid, choose 1.
This tests how well the models predict bike buyers, rather than how the models work in general.
§ Select the Lift Chart tab.
The Red line is theoretically perfect model. The Yellow line is pure chance. The lines in between are the four models in the mining structure. The “best” model for this data appears to be Decision Tree, while the “worst” model appears to be Clustering.
Note:
You may receive a permissions error at this point. If so, implement the following workaround:
Right-click on the Roles folder underneath the AdventureWorksAS database and add a new role. Change the membership of the role by adding the ‘Everyone group.
On the Data Source Access page give read/write access to adventure_works_dw_3.0
On the Mining Structure page give read access to Targeted Mailing
Underneath Target Mailing, give read access to each Mining Model in turn.
Please note that this is not a recommended security practice.
Demonstration 2: Add DM Viewers to a Web Page
In this demonstration, you will learn how to add a data mining viewer to a web page.
#
Task
Notes
1
Run the DM Viewer application
In Visual Studio 2005:
Open the DM Viewer solution as a Web Site (File/Open/Web Site) in the folder C:\BID\Finished\08 DM\DM Viewer folder.
Press F5 to run the page.
This is a simple application that you can include in your own application with very little work. Just add the control and set the properties. It’s not beautiful, but it’s functional. You may find that the colored bars do not appear in this version of the control
2
Review the Default.aspx code
Open the Default.aspx page.
View the Design page.
Select the data mining viewer control on the page, and then review properties in the Properties window.
You can see that this application is pretty simple. DM Viewer uses a self-contained control. Just set the database, mining model, server, and add a few attributes.
3
Add a second copy of control to the Default.aspx page
§ Right-click on Toolbox window and select Choose Items.
§ Click the Browse button to add the DLL to your toolbox from C:\BID\Finished\08 DM\DM Viewer.
§ Drag the DMNaiveBayesViewer onto the Default.aspx page.
§ As the database, assign Adventure Works DW.
§ As the model, assign TM_Naive_Bayes.
§ Run the project to see the second copy of the control.

You can learn more about the controls if you first install the Analysis Services samples and then look at the Data Mining samples found in C:\Program Files\Microsoft SQL Server 2005 Samples\Analysis Services\DataMining\Data Mining Web Controls.

No comments: