Real-World Data Mining
Build a real-world data mining app that enables you to examine historical customer data and make predictions about new customers by comparing the two.
by John Charles Olamendy Turruellas
April 23, 2007
Technology Toolbox: SQL Server, SQL Server Business Intelligence Development Studio, Microsoft SQL Server Management Studio 2005
Data mining and visualization is one of the most significant aspects of SQL Server 2005—and among the least well understood. Its goals are easy enough to understand: You want to achieve better business efficiency by seeing your data in new ways. But implementing the technology to take advantage of these features often leaves developers at a loss, as much for conceptual reasons as technical ones. In a recent article, I explained some of the core concepts of data mining, what it's for, how you take advantage of it, and how to implement it in your own applications [Database Design, "Mine Your Data for Business Insights," VSM January 2007]. That article walked you through the general approach of setting up and exploiting data visualization models with Microsoft Analysis Services 2005.
While useful and broadly applicable to many different circumstances, that approach generated several questions from readers who wanted to know how to take a specific, real-world project and apply the principles of data mining and visualization to it. Your wish is my command.
Let's analyze a business scenario. Assume you have a retailer that wants to increase its operations and needs to analyze the historical data of its customers' shopping behavior to enable better service. For example, the retailer might want to determine the preferred shipping methods for a given set of conditions, and see if it can glean insights into serving the bulk of its customers better. Specifically, the retailer might look at its customer buying patterns in the context of which cities its customers live in, as well as the customer's title. By analyzing this data, the retailer hopes to determine better ways of interacting with all its customers, as well as to discover and recommend what are likely the most efficient ways to handle distributing its products and services to its customers.
This business problem centers on clustering data mining tasks (also known as segmentation). The idea is to create natural groupings of cases based on a set of attributes in order to group entities with more or less similar attributes. (see Figure 1). The algorithm that you'll use to solve this task is called Microsoft Clustering.
Before I walk you through the specific steps required to create the server and client applications, let's take a couple minutes to go over the recent history of data mining and where and how it is used. This context is important for getting the most out of the solution I'll describe in this article.
Data mining requires three basic steps (see Figure 2). The first step is to define a mining model, such as input attributes, predictable attributes, and the associated algorithms. The next step is to train (process) the models and store the discovered patterns. Finally, you need to be able to analyze data for future predictions and assigning prediction results to new data.
Real-World Requirements for DM
OLE DB for Data Mining (OLE DB for DM) specifies a common interface for accessing DM engines and is supported by several vendors. Its goals are to provide an industry standard for data mining and comprises of a set of specifications, several predictive functions, and a set of schema rowsets (metadata) for discovering mining models and their underlying structures. This technology relies on the Data Mining eXtension (DMX) query language which you can use for model creation, training, and prediction.
To use OLE DB for DM, you create a connection object to access the mining engines, define and execute a prediction query using a command object, and get the model results represented by a result set. Several data mining vendors have developed their own OLE DB for DM providers; you can access these products through OLE DB, ADO, and ADO.NET.
Back to top
|