· 7 years ago · Jan 24, 2019, 11:24 PM
1M.Sc. I.T. Part I
2
3Semester I
4
5Data Mining
6
7PRACTICAL
82018 – 2019
9
10 Decision Tree
11 Naïve Bayesian Algorithm
12 Clustering Algorithm
13 Time Series Model
14 Association Algorithm
15 Text Mining Technique
16 Social Network Analysis
17 Apriori Algorithm
18
19Practical No:1
20
21Aim:-Prepare the analysis services for Adventure Work cycles or (any other database). Built the data mining model structure and built the decision tree with proper decision nodes and infer atleast five different types of reports.
22Implementation:-
23Creating an Analysis Services Project
24-Each Microsoft SQL Server Analysis Services project defines the schema for the objects in a single Analysis Services database.
25-To create an Analysis Services project
261. Open SQL Server Data Tools (SSDT).
272. On the File menu, point to New, and then select Project.
283. Verify that Business Intelligence Projects is selected in the Project types pane.
294. In the Templates pane, select Analysis Services Multidimensional and DataMining Project.
305. In the Name box, name the new project DECISION TREE.
316.Click .
32
33
34Creating a Data Source
35-A data source is a data connection that is saved and managed in your project and deployed to your Microsoft SQL Server Analysis Services database.
36-To create a data source
371. In Solution Explorer, right-click the Data Sources folder and select New DataSource.
382. On the Welcome to the Data Source Wizard page, click Next.
393. On the Select how to define the connection page, click New to add aconnection to thedatabase.
404. In the Provider list in Connection Manager, select Native OLE DB\SQL ServerNative Client 11.0.
415. In the Server name box, type or select the name of the server on which you installedFor example, type localhost if the database is hosted on the local server.
426. In the Log onto the server group, select Use Windows Authentication.
43
44
45Creating a Data Source View
46-To create a data source view
471. In Solution Explorer, right-click Data Source Views, and select New DataSource View.
482. On the Welcome to the Data Source View Wizard page, click Next.
493. On the Select a Data Source page, under Relational data sources, select theAdventure Works DW 2012 data source that you created in the last task. Click Next.
504. On the Select Tables and Views page, select the following objects, and thenclick the right arrow to include them in the new data source view:
51• ProspectiveBuyer (dbo) - table of prospective bike buyers
52• vTargetMail (dbo) - view of historical data about past bike buyers
535. Click Next.
546. On the Completing the Wizard page, by default the data source view is namedAdventure Works DW 2012. Change the name to Targeted Mailing, and then click Finish.
55The new data source view opens in the Targeted Mailing.dsv [Design] tab.
56
57
58Building a Targeted Mailing Structure
59-To create a mining structure for the targeted mailing scenario
601. In Solution Explorer, right-click Mining Structures and select New MiningStructure to start the Data Mining Wizard.
612. On the Welcome to the Data Mining Wizard page, click Next.
623. On the Select the Definition Method page, verify that From existing relational database or data warehouse is selected, and then click Next.
634. On the Create the Data Mining Structure page, under Which data mining technique do you want to use?, select Microsoft Decision Trees.
645. Click Next.
656. On the Select Data Source View page, in the Available data source views pane,select Targeted Mailing. You can click Browse to view the tables in the data source view and then click Close to return to the wizard.
667. Click Next.
678. On the Specify Table Types page, select the check box in the Case column for vTargetMail to use it as the case table, and then click Next. You will use the ProspectiveBuyer table later for testing; ignore it for now.
689. On the Specify the Training Data page, you will identify at least one predictable column, one key column, and one input column for your model. Select the check box in the Predictable column in the BikeBuyer row.
6910. Verify that the check box in the Key column is selected in the CustomerKey row.
7011. Select the check boxes in the Input column in the following rows. You can check multiple columns by highlighting a range of cells and pressing CTRL whileselecting a check box.
71• Age
72• CommuteDistance
73• EnglishEducation
74• EnglishOccupation
75• Gender
76• GeographyKey
77• HouseOwnerFlag
78• MaritalStatus
79• NumberCarsOwned
80• NumberChildrenAtHome
81• Region
82• TotalChildren
83• YearlyIncome
84
8512. On the far left column of the page, select the check boxes in the following rows.
86• AddressLine1
87• AddressLine2
88• DateFirstPurchase
89• EmailAddress
90• FirstName
91• LastName
92Ensure that these rows have checks only in the left column.
9313. Click Next.
94
95
96Specifying the Data Type and Content Type
971. On the Specify Columns' Content and Data Type page, click Detect to run an algorithm that determines the default data and content types for each column.
982. Review the entries in the Content Type and Data Type columns and change them if necessary
99
100
101Specifying a Testing Data Set for the Structure
102-To specify the testing set
1031. On the Create Testing Set page, for Percentage of data for testing, leave the default value of 30.
1042. For Maximum number of cases in testing data set, type 1000.
1053. Click Next.
106
107-To name the model and structure and specify drillthrough
1081. On the Completing the Wizard page, in Mining structure name, type Targeted Mailing.
1092. In Mining model name, type TM_Decision_Tree.
1103. Select the Allow drill through check box.
111
112
113Deploying and Processing the Models
114-To deploy the project and process all the mining models
1151. In the Mining Model menu, select Process Mining Structure and All Models.
116If you made changes to the structure, you will be prompted to build and deploy the project before processing the models. Click Yes.
1172. Click Run in the Processing Mining Structure - Targeted Mailing dialog box. The Process Progress dialog box opens to display the details of model
118processing. Model processing might take some time, depending on your computer.
1193. Click Close in the Process Progress dialog box after the models have completed processing.
1204. Click Close in the Processing Mining Structure - <structure> dialog box.
121
122Exploring the Decision Tree Model
123-To explore the model in the Decision Tree tab
1241. Select the Mining Model Viewer tab in Data Mining Designer.
125By default, the designer opens to the first model that was added to the structure
126-- in this case, TM_Decision_Tree.
1272. Use the magnifying glass buttons to adjust the size of the tree display. By default, the Microsoft Tree Viewer shows only the first three levels of the tree. If the tree contains fewer than three levels, the viewer shows only the existing levels. You can view more levels by using the Show Level slider or the Default Expansion list.
1283. Slide Show Level to the fourth bar.
129
130-Dependency Network Tab
131
132-To drill through to case data
1331. Right-click a node, and select Drill Through then Model Columns Only. The details for each training case are displayed in spreadsheet format. These details come from the vTargetMail view that you selected as the case table when building the mining structure.
1342. Right-click a node, and select Drill Through then Model and Structure Columns. The same spreadsheet displays with the structure columns appended to the end.
135
136
137Conclusion: Using analysis services, we have thus implemented the decision tree algorithm successfully.
138
139Practical No: 2
140
141Aim: Prepare the analysis services for the Adventure Works cycles or (any other database). Build the data mining model structure and implement the Naïve Bayesian algorithm.
142Steps to create a Naive Bayes mining model:
143
1441. In the Mining Models tab of Data Mining Designer, right-click the Structurecolumn, and select New Mining Model.
1452. In the New Mining Model dialog box, under Model name, type
146TM_NaiveBayes.
1473. In Algorithm name, select Microsoft Naive Bayes, then click OK.
148A message appears stating that the Microsoft Naive Bayes algorithm does notsupport the Age and Yearly Income columns, which are continuous.
1494. Click Yes to acknowledge the message and continue.
150
151A new model appears in the Mining Models tab of Data Mining Designer.
152
153First , we need to select a data source ,thus right click on Data Source and select New Data Source
154It will open a Data Source Wizard ,click on Next
155
156
157In the next window click on Next button and in the next window select first radio button and then type user name and password and click on Next
158
159
160In the last window click on Finish button to end data source selection activity
161
162
163Now in solution explorer right click on the Data Source Views, select New Data Source View
164
165
166In the Data Source View Wizard.Verify that the proper relational data source is selected and proceed ahead by clicking on the Next button.
167
168
169 Select appropriate Tables and Views from the relational database that are to be included in the data source view. Click on Next .
170
171
172 In the Completing the Wizard window, verify the data source name and click on Finish
173
174
175Conclusion: Using analysis services, we have thus implemented the naïve bayesian algorithm successfully.
176
177Practical No: 3
178
179Aim: Prepare the analysis services for the Adventure Works cycles or (any other database). Build the data mining model structure and implement the clustering algorithm.
180To create a data source:
181
182In Solution Explorer, right-click the Data Sources folder and select New Data Source.
183On the Welcome to the Data Source Wizard page, click Next.
184On the Select how to define the connection page, click New to add a
185 connection to the database.
186In the Provider list in Connection Manager, select Native OLE DB\SQL Server Native Client 11.0.
187In the Server name box, type or select the name of the server on which you
188installed.
189In the Log onto the server group, select Use Windows Authentication.
190In the Select or enter a database name list, select and then click OK.
191Click Next.
192On the Impersonation Information page, click Use the service account, and then click Next.
193On the Completing the Wizard page, notice that, by default, the data source is named Adventure Works DW 2012.
194Click Finish.
195
196The new data source, Adventure Works DW 2012, appears in the Data Sources
197folder in Solution Explorer
198
199Creating a Data Source View
200
201In Solution Explorer, right-click Data Source Views, and select New Data
202Source View.
203On the Welcome to the Data Source View Wizard page, click Next.
204 On the Select a Data Source page, under Relational data sources, select the Adventure Works DW 2012 data source that you created in the last task. ClickNext.
205On the Select Tables and Views page, select the following objects, and then
206Click the right arrow to include them in the new data source view:
207• ProspectiveBuyer (dbo) - table of prospective bike buyers
208• vTargetMail (dbo) - view of historical data about past bike buyers
209Click Next.
210On the Completing the Wizard page, by default the data source view is namedAdventure Works DW 2012. Change the name to Targeted Mailing, and then click Finish.
211
212The new data source view opens in the Targeted Mailing.dsv [Design] tab.
213
214To create a clustering mining model:
215
216Switch to the Mining Models tab in Data Mining Designer in SQL Server DataTools (SSDT).
217Notice that the designer displays two columns, one for the mining structure andone for the TM_Decision_Tree mining model, which you created in the previouslesson.
218Right-click the Structure column and select New Mining Model.
219In the New Mining Model dialog box, in Model name, type TM_Clustering.
220In Algorithm name, select Microsoft Clustering.
221Click .
222
223
224Conclusion: Using analysis services, we have thus implemented the clustering algorithm successfully.
225
226 Practical No: 4
227
228Aim: Prepare the analysis services for Adventure Work cycles or (any other database). Built the basic time series model structure and create the predictions.
229Adding a Data Source View for Forecasting (Intermediate Data Mining Tutorial)
230To add a data source view:
231In Solution Explorer, right-click Data Source Views, and then select New Data Source View.
232
233
234On the Welcome to the Data Source View Wizard page, click Next.
235
236
237On the Select a Data Source page, under Relational data sources, select the data source. Click Next.
238
239
240On the Select Tables and Views page, select the table, vTimeSeries (dbo), and then click the right arrow to add it to the data source view.
241
2425. Click Next.
2436. On the Completing the Wizard page, by default the data source view is named Adventure Works DW Multidimensional 2012 . Change the name to
244SalesByRegion, and then click Finish.
245
246
247Data Source View Designer opens and the SalesByRegion data source view
248appears.
249
250
251To set the key in the data source view:
2521. In the pane SalesByRegion.dsv, select the vTimeSeries table.
2532. Right-click the column, Reporting Date, and select Set Logical Primary Key.
254
255
256Creating a Forecasting Mining Model Structure (Data Mining Tutorial)
257
258To create a forecasting mining structure:
259
260In Solution Explorer in SQL Server Data Tools (SSDT), right-click Mining Structures and select New Mining Structure.
261
2622. On the Welcome to the Data Mining Wizard page, click Next.
2633. On the Select the Definition Method page, verify that From existing relational database or data warehouse is selected, and then click Next.
264
265
266On the Create the Data Mining Structure page, under Which data miningtechnique do you want to use?, select Microsoft Time Series, and then click Next.
267
268On the Select Data Source View page, under Available data source views,selectSalesByRegion.
269
270Click Next.
271
2727. On the Specify Table Types page, ensure that the check box in the Case column for the vTimeSeries table is selected, and then click Next.
273
274On the Specify the Training Data page, select the check boxes in the Key column for the ModelRegion and ReportingDate columns.
275
276ReportingDate should be selected by default, because you specified this column as the logical primary key when you created the data source view. By adding ModelRegion as a second key, you are telling the algorithm to create a separate time series for each combination of model and region listed in this field.
277
278Select the check boxes in the Input and Predictable columns for the Quantity, column, and then click Next.
279
280By selecting Predictable, you indicate that you want to create forecasts on the data in this column. However, because you want to base the forecasts on past data, you must also add the column as an input.
281
28210. On the page Specify Columns' Content and Data Type, review the selections.
283The ModelRegion column is designated as a Key column and the ReportingDate column is automatically designated as a Key Time column. You can have only one of each type of key.
284
285Click Next.
286
28712. On the Completing the Wizard page, for Mining structure name, type
288Forecasting.
289
29013. In Mining model name, type Forecasting, and then click Finish.
291Data Mining Designer opens to display the Forecasting mining structure that
292you just created.
293
294
295Modifying the Forecasting Structure (Data Mining Tutorial)
296To add the Amount column to the Forecasting mining structure:
2971. On the Mining Structure tab of Data Mining Designer, in the Data Source View pane, select the Amount column in the vTimeSeries table.
2982. Drag the Amount column from the Data Source View pane into the list of
299columns for the Forecasting structure.
300The Amount column is now included in the Forecasting mining structure.
301
302
303To define how the Amount column will be used:
3041. In the Forecasting column of the grid on the Mining Models tab, click the cell in the Amount row.
3052. Select Predict from the list.
306The Amount column is now both an input column and a predictable column.
307
308
309Customizing and Processing the Forecasting Model (Intermediate Data Mining Tutorial)
310
311To change the algorithm parameters:
312On the Mining Models tab, right-click Forecasting, and select Set Algorithm Parameters.
313
3142. In the PERIODICITY_HINT row of the Algorithm Parameters dialog box, click the Value column, then type {12}, including the braces.
315By default, the algorithm will also add the value {1}.
3163. In the FORECAST_METHOD row, verify that the Value text box is either blank or set to MIXED. If a different value has been entered, type MIXED to change the parameter back to the default value.
3174. In the PREDICTION_SMOOTHING row, verify that the Value text box is either blank or set to 0.5. If a different value has been entered, click Value and type 0.5 to change the parameter back to the default value.
318To specify that gaps be filled by averaging values:
3191. On the Mining Models tab, right-click the Forecasting column, and select Set Algorithm Parameters.
3202. In the Algorithm Parameters dialog box, in the MISSING_VALUE_SUBSTITUTION row, click the Value column, and type Mean.
321
322
323To process the forecasting model:
324On the Mining Model menu of SQL Server Data Tools, select Process Mining Structure and All Models.
325
3262. At the warning asking whether you want to build and deploy the project, click Yes.
3273. In the Process Mining Structure - Forecasting dialog box, click Run.
328
329The Process Progress dialog box opens to display information about model
330processing. Model processing may take some time.
3314. After processing is complete, click Close to exit the Process Progress dialog box.
3325. Click Close again to exit the Process Mining Structure - Forecasting dialog box.
333
334
335 Exploring the Forecasting Model (Data Mining Tutorial)
336
337To select a time series:
3381. Click the Charts tab of the Mining Model Viewer tab, if it is not visible.
3392. Click the drop-down list to the right of the chart view, and select all the checkboxes.
340Click .
341The chart should now contain 24 different series lines.
342
3433. In the check boxes to the right of the chart, clear the boxes to temporarily hide the lines for all series that are based on Amount.
344Now, clear the check boxes related to the R750 and R250 bicycles.
345
346
347To change prediction and display options in the Chart view:
348Try changing the value for Prediction Steps gradually, increasing it from 5 to 10,and then back to 6.
349
350When the historical data has large fluctuations, the fluctuations tend to be
351repeated or even amplified as you increase the number of predictions. You
352probably need to do some research at this point, to understand the cause of the big increase in the historical data and then decide whether to accept these
353results, seek some kind of correction in the source data, or apply some kind of smoothing in the model.
3542. Select the Show Deviations check box.
355This option displays the estimated error for each predicted value.
356
3573. Note the scale of the X-axis. The changes over both historical and predicted data are always expressed as a percentage, but the actual values are adjusted automatically to fit all values onto the graph. Therefore you need to be careful when comparing models to not rely on visuals alone. To get the exact value, or the percentage increase and value for predictions, pause the mouse over the dotted line or solid lines, or click the lines to view the values in the Mining Legend.
358
359To explore the decision tree for a time series model:
3601. In the Tree list on the Model tab of the viewer, select the T1000 Europe:
361Amount series.
362
363Click the node labeled All.
364For an All node, the ToolTip that appears includes information such as, the
365number of cases in the entire series, and time series equations derived from
366analysis of the data.
367
3682. If the Mining Legend is not visible, right-click the node and select Show Legend.
369The Mining Legend provides much the same information that is in the Tooltip. If any of your independent variables are discrete, you will also see a histogram that shows the distribution of variables in the node.
3703. Now select a different time series to view. Using the Tree list on the Model tab of the viewer, select the M200 North America: Amount series.
371The tree graph now contains anAll node and two child nodes. By looking at the labels on the child nodes, you can understand at what point the trend line
372changed.
373
374For each child node, the description in the Mining Legend also includes the
375count of cases in each branch of the tree.
376The following list describes some additional features in the tree viewer:
377• You can change the variable that is represented in the chart by using the
378Background control. By default, nodes that are darker contain more cases, because the value of Background is set to Population. To see just how many cases there are in a node, pause the mouse over a node and view the ToolTip that appears, or click the node and view the numbers in the Node Legend window.
379
380To view details for a particular data series in the generic content viewer:
381In the Mining Model Viewer tab, select Microsoft Generic Content Tree Viewer from the Viewer drop-down list.
382
3832. In the Node Caption pane, click the topmost (All) node.
3843. In the Node Details pane, view the value for ATTRIBUTE_NAME.
385This value shows you which series, or combination of product and region, is
386contained in this node. In the AdventureWorks example, the topmost node is for the M200 Europe series.
387
3884. In the Node Caption pane, locate the first node that has child nodes.
389If a series node has children, the tree view that appears on the Model tab of the Microsoft Time Series Viewer will also have a branching structure.
3905. Expand the node and click one of the child nodes.
391The NODE_DESCRIPTION column of the schema contains the condition that caused the tree to split.
3926. In the Node Caption pane, click the topmost ARIMA node, and expand the node until all child nodes are visible.
3937. In the Node Details pane, view the value for ATTRIBUTE_NAME.
394This value tells you which time series is contained in this node. The topmost node in the ARIMA section should match the topmost node in the (All) section. In the AdventureWorks example, this node contains the ARIMA analysis for the series, M200 Europe.
395
396
397Creating Time Series Predictions (Intermediate Data Mining Tutorial)
398
399To select a model and input table:
4001. On the Mining Model Prediction tab of the Data Mining Designer, in the
401Mining Model box, click Select Model.
402In the Select Mining Model dialog box, expand the Forecasting structure, select the Forecasting model from the list, and then click OK.
403
4043. Ignore the Select Input Table(s) box.
4054. In the Source column, in the grid on the Mining Model Prediction tab, click the cell in the first empty row, and then select Forecasting mining model.
4065. In the Field column, select Model Region.
407This action adds the series identifier to the prediction query to indicate the
408combination of model and region to which the prediction applies.
4096. Click the next empty row in the Source column, and then select Prediction Function.
4107. In the Field column, select PredictTimeSeries.
4118. In the Mining Model pane, select the mining model column, Amount. Drag Amount to the Criteria/Arguments box for the PredictTimeSeries function that you added earlier.
4129. Click the Criteria/Arguments box, and type a comma, followed by 5, after the field name.
413The text in the Criteria/Arguments box should now display the following:
414[Forecasting].[Amount],5
41510. In the Alias column, type PredictAmount.
41611. Click the next empty row in the Source column, and then select Prediction Function again.
41712. In the Field column, select PredictTimeSeries.
41813. In the Mining Model pane, select the column Quantity, and then drag it into the Criteria/Arguments box for the second PredictTimeSeries function.
41914. Click the Criteria/Arguments box, and type a comma, followed by 5, after the field name.
420The text in the Criteria/Arguments box should now display the following:
421[Forecasting].[ Quantity],5
42215. In the Alias column, type PredictQuantity.
42316. Click Switch to query result view.
424
425The results of the query are displayed in tabular format.
426
427To flatten the results in the Results paneâ€
428In the Prediction Query Builder, click Switch to query design view.
429The view changes to allow manual editing of the DMX query text.
430
431 Type the FLATTENED keyword after the SELECT keyword. The complete query text should be as follows:
432SELECT FLATTENED
433[Forecasting].[Model Region],
434(PredictTimeSeries([Forecasting].[Amount],5)) as
435[PredictAmount],
436(PredictTimeSeries([Forecasting].[Quantity],5)) as
437[PredictQuantity]
438FROM
439[Forecasting]
4403. Optionally, you can type a clause to restrict the results, such as the following example:
441SELECT FLATTENED
442[Forecasting].[Model Region],
443(PredictTimeSeries([Forecasting].[Amount],5)) as
444 [PredictAmount],
445(PredictTimeSeries([Forecasting].[Quantity],5)) as
446[PredictQuantity]
447FROM
448[Forecasting]
449WHERE [Forecasting].[Model Region] = 'M200 North America'
450OR [Forecasting].[Model Region] = 'M200 Europe'
451Click Switch to query result view.
452
453To export prediction query results:
454Click Save query results.
455
456 In the Save Data Mining Query Result dialog box, for Data Source, select You can also create a data source if you want to save the data to a different relational database.
457In the Table Name column, type a new temporary table name, such as Test Predictions.
458Click Save.
459
460Conclusion: Using analysis services, we have thus implemented the time series model successfully.
461
462Practical No:5
463
464Aim: Prepare the analysis services for Adventure Work cycles or (any other database). Built the basic data mining model and show the implementation of association algorithm and also apply the DMX queries.
465
466Adding a Data Source View with Nested Tables (Intermediate Data Mining Tutorial)
467
468To add a data source view:
469In Solution Explorer, right-click Data Source Views, and then select New Data Source View.
470
471The Data Source View Wizard opens.
4722. On the Welcome to the Data Source View Wizard page, click Next.
4733. On the Select a Data Source page, under Relational data sources, select the Adventure Works DW Multidimensional 2012 data source that you created in the Basic Data Mining Tutorial. Click Next.
474
4754. On the Select Tables and Views page, select the following tables, and then click the right arrow to include them in the new data source view:
476• vAssocSeqOrders
477• vAssocSeqLineItems
478Click Next.
479
4806. On the Completing the Wizard page, by default the data source view is named Adventure Works DW Multidimensional 2012 . Change the name to Orders, and then click Finish.
481Data Source View Designer opens and the Orders data source view appears.
482
483
484
485
486To create a relationship between tables:
4871. In Data Source View Designer, position the two tables so that the tables are aligned horizontally, with the vAssocSeqLineItems table on the left side and the vAssocSeqOrders table on the right side.
4882. Select the OrderNumber column in the vAssocSeqLineItems table.
4893. Drag the column to the vAssocSeqOrders table, and put it on the OrderNumber column.
490
491
492
493
494
495Creating a Market Basket Structure and Model (Intermediate Data Mining Tutorial)
496
497To create an association mining structure:
4981. In Solution Explorer in SQL Server Data Tools (SSDT), right-click Mining Structures and select New Mining Structure to open the Data Mining Wizard.
4992. On the Welcome to the Data Mining Wizard page, click Next.
5003. On the Select the Definition Method page, verify that From existing relational database or data warehouse is selected, and then click Next.
5014. On the Create the Data Mining Structure page, under Which data mining technique do you want to use?, select Microsoft Association Rules from the list, and then click Next. The Select Data Source View page appears.
502
5035. Select Orders under Available data source views, and then click Next.
5046. On the Specify Table Types page, in the row for the vAssocSeqLineItems table, select the Nested check box, and in the row for the nested table vAssocSeqOrders, select the Case check box. Click Next.
505
5067. On the Specify the Training Data page, clear any boxes that might be checked.Set the key for the case table, vAssocSeqOrders, by selecting the Key check box next to OrderNumber.
507Because the purpose of the market basket analysis is to determine which products are included in a single transaction, you do not have to use the CustomerKey field.
5088. Set the key for the nested table, vAssocSeqLineItems, by selecting the Key checkbox next to Model. The Input check box is also automatically selected when you do this. Select the Predictable check box for Model as well.
509In a market basket model, you do not care about the sequence of products in the shopping basket, and therefore you should not include LineNumber as a key for the nested table. You would use LineNumber as a key only in a model where the sequence is important. You will create a model that uses the Microsoft SequenceClustering algorithm in Lesson 4.
5109. Select the check box to the left of IncomeGroup and Region,but do not make any other selections. Checking the leftmost column adds the columns to the structure for later reference, but the columns will not be used in the model. Your selections should look like the following:
511
512Click Next.
51311. On the Specify Columns' Content and Data Type page, review the selections, which should be as shown in the following table, and then click Next.
51412. On the Create testing set page, the default value for the option Percentage of data for testing is 30 percent. Change this to 0. Click Next.
515
51613. On the Completing the Wizard page, in Mining structure name, type Association.
51714. In Mining model name, type Association.
51815. Select the option Allow drill through, and then click Finish.
519
520 Data Mining Designer opens to display the Association mining structure that you just created.
521
522Modifying the Market Basket Model (Intermediate Data Mining Tutorial)
523To adjust the parameters of the Association model:
5241. Open the Mining Models tab of Data Mining Designer.
5252. Right-click the Association column in the grid in the designer and select Set Algorithm Parameters to open the Algorithm Parameters dialog box.
526
5273. In the Value column of the Algorithm Parameters dialog box, set the following parameters:
528MINIMUM_PROBABILITY = 0.1
529MINIMUM_SUPPORT = 0.01
530
5314.Click .
532
533To process the mining model:
534On the Mining Model menu of SQL Server Data Tools (SSDT), select Process Mining Structure and All Models.
535
5362. At the warning asking whether you want to build and deploy the project, click Yes.
537The Process Mining Structure - Association dialog box opens.
538
539Click Run.
540The Process Progress dialog box opens to display information about model processing. Processing of the new structure and model might take some time.
541
5424. After processing is complete, click Close to exit the Process Progress dialog box.
5435. Click Close again to exit the Process Mining Structure - Association dialog box.
544
545Exploring the Market Basket Models (Intermediate Data Mining Tutorial)
546
547To open the Association mode in the Microsoft Assocaition Rules Viewer:
548In Solution Explorer, double-click the Association structure.
549
550In Data Mining Designer, click the Mining Model Viewer tab.
551
5523. Select Association from the list of mining models in the Mining Model dropdown list.
553
554To navigate the dependency graph and locate specific nodes:
555In the Mining Model Viewer tab, click the Dependency Network tab.
556
5572. Click Zoom In several times, until you can easily view the labels for each node.
558By default, the graph displays with all nodes visible. In a complex model, there may be many nodes, making each node quite small.
559Click the + sign in the lower right-hand corner of the viewer and hold down the mouse button to pan around the graph.
560
5614. On the left side of the viewer, drag the slider down, moving it from All Links (the default) to the bottom of the slider control.
5625. The viewer updates the graph to now show only the strongest association,between the Touring Tire and Touring Tire Tube items.
5636. Click the node labeled Touring Tire Tube = Existing.
564The graph is updated to highlight only items that are strongly related to this item. Note the direction of the arrow between the two items.
565
5667. On the left side of the viewer, drag the slider up again, moving it from the bottom to around the middle.
567Note the changes in the arrow that connects the two items.
5688. Select Show attribute name only from the dropdown list at the top of the Dependency Network pane.
569The text labels in the graph are updated to show only the model name.
570
571
572To view details for an itemset:
573In the list of itemsets, click the Itemset column heading to sort by name.
574
575Locate the item, Touring Tire (with no second item).
576
577Right-click the item, Touring Tire, select Drill Through, and then select Model and Structure Columns.
578
579
580The Drill Through dialog box displays the individual transactions used as support for this itemset.
581Expand the nested table, vAssocSeqLineItems, to view the actual list of purchases in the transaction.
582
583
584To filter itemsets by support or size:
5851. Clear any text that might be in the Filter Itemset box. You cannot use a text filter together with a numeric filter.
5862. In the Minimum support box, type 100, and then click the background of the viewer.
587The list of itemsets is updated to show only itemsets with support of at least 100.
588
589To see only rules that include the Mountain-200 bicycle:
590In the Mining Model Viewer tab, click the Rules tab.
591
5922. In the Filter Rule box, enter Mountain-200.
593Clear the Show long name check box.
5943. From the Show list, select Show attribute name only.
595The viewer will then display only the rules that contain the words "Mountain-200". The probability of the rule tells you how likely it is that when someone buys a Mountain-200 bicycle, that person will also buy the other listed product.
596
597
598To view cases that support a particular rule
5991. In the Rules tab, right-click the rule that you want to view.
6002. Select Drill Through, and then select Model Columns Only, or Model and Structure Columns.
601The Drill Through dialog box provides a summary of the rule at the top of the pane, and a list of all cases that were used as supporting data for the rule.
602
603
604
605
606Filtering a Nested Table in a Mining Model (Intermediate Data Mining Tutorial)
607
608To create and modify a copy of the Association model:
609In the Mining Models tab of SQL Server Data Tools (SSDT), right-click the Association model, and select New Mining Model.
610
611For Model Name, type Association Filtered. For Algorithm Name, select Microsoft Association Rules. Click OK.
612
613In the column for the Association Filtered model, click the IncomeGroup row and change the value from Ignore to Input.
614
615
616To add a filter to a mining model:
6171. In the Mining Models tab, right-click the model, and select
618Set Model Filter.
619
620In Model Filter dialog box, click the top row in the grid, in the Mining Structure Column text box.
621
6223. In the Mining Structure Column text box, select IncomeGroup.
623The icon at the left side of the text box changes to indicate that the selected item is a column.
6244. Click the Operator text box and select the = operator from the list.
6255. Click the Value text box, and type High in the box.
6266. Click the next row in the grid.
6277. Click the AND/OR text box in the next row of the grid and select OR.
6288. In the Mining Structure Column text box, select IncomeGroup. In the Value textbox, type Moderate.
629The filter condition that you created is automatically added to the Expression text box, and should appears as follows:
630 [IncomeGroup] = 'High' OR [IncomeGroup] = 'Moderate'
6319. Click the next row in the grid, leaving the operator as the default, AND.
63210. For Operator, leave the default value, Contains. Click the Value text box.
633
63411. In the Filter dialog box, in the first row under Mining Structure Column, select Model.
63512. For Operator, select IS NOT NULL. Leave the Value text box blank. Click OK.
636
637The filter condition in the Expression text box of the Model Filter dialog box is automatically updated to include the new condition on the nested table. The completed expression is as follows:
638 [IncomeGroup] = 'High' OR [IncomeGroup] = 'Moderate' AND EXISTS
639SELECT * FROM [vAssocSeqLineItems] WHERE [Model] <> NULL).
640Click .
641
642
643To enable drillthrough and to process the filtered model:
644In the Mining Models tab, right-click the Association Filtered model, and select Properties.
645
646Change the AllowDrillThrough property to True.
647
648Right-click the Association Filtered mining model, and select Process Model.
649
6504. Click Yes in the error message to deploy the new model to the Analysis Services database.
6515. In the Process Mining Structure dialog box, click Run.
6526. When processing is complete click Close to exit the Process Progress dialog box, and click Close again to exit the Process Mining Structure dialog box.
653
654
655
656Creating Recommendations and Predicting Associations:
657To create a prediction query on an association model
658Click the Mining Model Prediction tab of Data Mining Designer.
659
6602. In the Mining Model pane, click Select Model.
6613. In the Select Mining Model dialog box, expand the node that represents the mining structure Association, and select the model Association. Click OK.
662For now, you can ignore the input pane.
6634. In the grid, click the empty cell under Source and select Prediction Function. In the cell under Field, select PredictAssociation.
664You can also use the Predict function to predict associations. If you do, be sure to choose the version of the Predict function that takes a table column as argument.
6655. In the Mining Model pane, select the nested table vAssocSeqLineItems, and drag it into the grid, to the Criteria/Argument box for the PredictAssociation
666function.
667Dragging and dropping table and column names lets you build complex statements without syntax errors. However, it replaces the current contents of the cell, which include other optional arguments for the PredictAssociation function.
668To view the other arguments, you can temporarily add a second instance of the function to the grid for reference.
6696. Click the Criteria/Argument box and type the following text after the table name: ,3
670The complete text in the Criteria/Argument box should be as follows:
671[Association].[v Assoc Seq Line Items],3
672
673Click the Results button in the upper corner of the Prediction Query Builder.
674
675
676To create a singleton prediction query with nested table inputs:
677Click the Design button in the corner of the Prediction Query Builder to switch back to the query building grid.
678On the Mining Model menu, select Singleton Query.
679
680
681
682In the Mining Model dialog box, select the Association model.
683
6844. In the grid, click the empty cell under Source and select Prediction Function. In the cell under Field, select PredictAssociation.
6855. In the Mining Model pane, select the nested table vAssocSeqLineItems, and drag it into the grid, to the Criteria/Argument box for the PredictAssociation
686function. Type ,3 after the nested table name just as in the previous procedure.
6876.In the Singleton Query Input dialog box, click the Value box next to vAssoc SeqLine Items, and then click the (...) button.
688
689In the Nested Table Input dialog box, select Touring Tire in the Key column pane, and then click Add.
690
691
692
693Click the Results button.
694
695
696To add inputs and probabilities to the singleton prediction query:
6971. Click the Design button in the corner of the Prediction Query Builder to switch back to the query building grid.
6982. In the Singleton Query Input dialog box, click the Value box next to vAssoc SeqLine Items, and then click the (...) button.
6993. In the Key column pane, select Touring Tire, and then click Add.
7004. In the grid, click the empty cell under Source and select Prediction Function. In the cell under Field, select PredictAssociation.
7015. In the Mining Model pane, select the nested table vAssocSeqLineItems, and drag it into the grid, to the Criteria/Argument box for the PredictAssociation function. Type ,3 after the nested table name just as in the previous procedure.
7026. In the Nested Table Input dialog box, select Touring Tire Tube in the Key column pane, and then click Add.
703
7047. In the grid, in the row for the PredictAssociation function, click the Criteria/Argument box, and change the arguments to add the argument, INCLUDE_STATISTICS.
705The complete text in the Criteria/Argument box should be as follows:
706[Association].[v Assoc Seq Line Items], INCLUDE_STATISTICS, 3
707
708Click the Results button.
709
710To flatten nested rowsets in a prediction query:
711Click the SQL button in the corner of the Prediction Query Builder.
712
713The grid changes to an open pane where you can view and modify the DMX
714statement that was created by the Prediction Query Builder.
7152. After the SELECT keyword, type FLATTENED.
716The complete text of the query should be as follows:
717SELECT FLATTENED
718PredictAssociation([Association].[v Assoc Seq Line
719Items],INCLUDE_STATISTICS,3)
720FROM
721[Association]
722NATURAL PREDICTION JOIN
723(SELECT (SELECT 'Touring Tire' AS [Model]
724UNION SELECT 'Touring Tire Tube' AS [Model]) AS [v Assoc
725Seq Line Items]) AS t
726Click the Results button in the upper corner of the Prediction Query Builder.
727
728
729To change the input method for the prediction query:
7301. In the Mining Model menu, select Singleton Query again, to clear the check mark.
7312. An error message appears warning that your singleton query will be lost. Click Yes.
732The name of the input dialog box changes to Select Input Table(s).
733
734To create a prediction query using nested table inputs:
7351. In the Mining Model pane, select the Association Filtered model.
7362. In the Select Input Table(s) dialog box, click Select Case Table.
737
7383. In the Select Table dialog box, for Data Source, select AdventureWorksDW2008.
739In the Table/View Name list, select vAssocSeqOrders, and then click OK.
740
741The table vAssocSeqOrders is added to the pane.
742
7434. In the Select Input Table(s) dialog box, click Select Nested Table.
7445. In the Select Table dialog box, for Data Source, select AdventureWorksDW2008.
745In the Table/View name list, select vAssocSeqLineItems, and then click OK.
746
747The table vAssocSeqLineItems is added to the pane.
748
7496. In the Specify Nested Join dialog box, drag the OrderNumber field from the case table and drop it onto the OrderNumber field in the nested table.
750You can also click Add Relationship and create the relationship by selecting columns from a list.
7517. In the Specify Relationship dialog box, verify that the OrderNumber fields are mapped correctly, and then click OK.
7528. Click OK to close the Specify Nested Join dialog box.
753The case and nested tables are updated in the design pane to show the joins connecting the external data columns to the columns in the model. If the relationships are wrong, you can right-click the join line and select Modify Connections to edit the column mapping, or you can right-click the join line and select Delete to remove the relationship completely.
7549. Add a new row to the grid. For Source, select vAssocSeqOrders table. For Field, select CustomerKey.
75510. Add a new row to the grid. For Source, select vAssocSeqOrders table. For Field, select Region.
75611. Add a new row to the grid. For Source, select Prediction Function, and for Field,select PredictAssociation.
75712. Drag vAssocSeqLineItems, into the Criteria/Argument box of the PredictAssociation row. Click at the end of the Criteria/Argument box and then type the following text: INCLUDE_STATISTICS,3
758The complete text in the Criteria/Argument box should be: [Association].[vAssoc Seq Line Items], INCLUDE_STATISTICS, 3
759
76013. Click the Result button to view the predictions for each customer.
761
762Conclusion: Thus, we have implemented the association algorithm successfully.
763
764Practical no : 6
765
766Aim: Consider any suitable data for text mining and implement the text mining technique using R Tool.
767# AccessingrdmTweets - double click on rdmTweets.RData and type:
768>rdmTweets
769
770#Finding the number of documents in rdmTweet.RData:
771> (nDocs<- length(rdmTweets))
772[1] 154
773
774
775# Print any five tweets
776>rdmTweets[5:10]
777
778# Load Required Packages
779>library(tm)
780>library(twitteR)
781>library(ROAuth)
782>library(RCurl)
783>library(bitops)
784>library(digest)
785>library(rjson)
786>library(twitteR)
787
788# Transforming Tweets to a dataframe
789df<- do.call("rbind", lapply(rdmTweets, as.data.frame))
790or
791tweets.df <- twListToDF(rdmTweets)
792# Verify the contents of df
793>dim(df)
794
795# build a corpus, and specify the source to be character vectors
796>myCorpus<- Corpus(VectorSource(df$text))
797# Verify the contents of myCorpus
798>myCorpus
799
800# Apply Transformations to convert to lower case
801>myCorpus<- tm_map(myCorpus, tolower)
802
803
804# Apply Transformations to remove punctuation
805>myCorpus<- tm_map(myCorpus, removePunctuation)
806
807# Apply Transformations to remove numbers
808>myCorpus<- tm_map(myCorpus, removeNumbers)
809
810
811# keep a copy of corpus to use later as a dictionary for stem completion
812>mycopyCorpus<- myCorpus
813# Apply transformation to stem words
814>myCorpus<- tm_map(myCorpus, stemDocument)
815# Verify the results of Stemming Transformation on first five documents
816>inspect(myCorpus[1:5])
817
818# Apply Transformation for stem completion
819>myCorpus<- tm_map(myCorpus, stemCompletion, dictionary=mycopyCorpus)
820# Verify the results of Stemming completion on first five documents
821>inspect(myCorpus[1:5])
822
823# Counting the frequency of the term "computing"
824>computingCases<- tm_map(myCorpus, grep, pattern="\\<computing")
825# Printing the frequency of the term "mining"
826>sum(unlist(computingCases))
827
828
829# Counting the frequency of the term "parallel"
830>parallelCases<- tm_map(myCorpus, grep, pattern="\\<parallel")
831# Printing the frequency of the term "parallel"
832>sum(unlist(parallelCases))
833
834
835
836 Building a Term-Document Matrix:
837 A term-document matrix represents the relationship between terms and documents, where each row stands for a term and each column for a document, and an entry is the number of occurrences of the term in the document.
838myTdm<- TermDocumentMatrix(myCorpus, control=list(wordLengths=c(1,Inf)))
839# Verifying the contents of Term-Document Matrix
840>myTdm
841
842# Storing the first six terms starting with \r" and tweets numbered 101 to 110.
843>idx<- which(dimnames(myTdm)$Terms == "r")
844# Inspecting the first six terms starting with \r" and tweets numbered 101 to 110.
845>inspect(myTdm[idx+(0:5),101:110])
846
847
848 Finding Associations in Frequent terms:
849# Finds frequent terms with frequency no less than ten.
850>findFreqTerms(myTdm, lowfreq=10)
851
852>termFrequency<- rowSums(as.matrix(myTdm))
853>termFrequency<- subset(termFrequency, termFrequency>=10)
854>termFrequency
855
856# which words are associated with "r"?
857findAssocs(myTdm, 'r', 0.25)
858
859# which words are associated with "the"?
860>findAssocs(myTdm, 'the', 0.25)
861
862
863
864
865Conclusion: Using R-tool, we have thus implemented the text mining technique successfully.
866
867Practical No: 7
868
869Aim: Using R-Tool, show the analysis for social networking sites.
870termDocMatrix[5:10,1:20]
871
872#change it to a Boolean matrix
873>termDocMatrix[termDocMatrix>=1] <- 1
874 #transform into a term-term adjacency matrix
875>termMatrix<- termDocMatrix %*% t(termDocMatrix)
876 #inspect terms numbered 5 to 10
877>termMatrix[5:10,5:10]
878
879>library(igraph)
880 # build a graph from the above matrix
881> g <- graph.adjacency(termMatrix, weighted=T, mode="undirected")
882> # remove loops
883> g <- simplify(g)
884> # set labels and degrees of vertices
885>V(g)$label <- V(g)$name
886>V(g)$degree <- degree(g)
887> # set seed to make the layout reproducible
888>set.seed(3952)
889> layout1 <- layout.fruchterman.reingold(g)
890>plot(g, layout=layout1)
891
892
893Conclusion: Using R-tool, we have analyzed the social networking sites successfully.
894
895Practical No: 8
896
897Aim: Using R-Tool, built the data mining model and implement the Apriori algorithm.
898>str(Titanic)
899
900>df<- as.data.frame(Titanic)
901>head(df)
902
903>titanic.raw<- NULL
904> for(i in 1:4) { titanic.raw<- cbind(titanic.raw, rep(as.character(df[,i]), df$Freq)) }
905>titanic.raw<- as.data.frame(titanic.raw)
906>names(titanic.raw) <- names(df)[1:4]
907>dim(titanic.raw)
908
909>str(titanic.raw)
910
911>head(titanic.raw)
912
913>summary(titanic.raw)
914
915
916
917
918
919>library(arules)
920
921> # find association rules with default settings
922>rules.all<- apriori(titanic.raw)
923
924>rules.all
925
926>inspect(rules.all)
927
928> # rules with rhs containing "Survived" only
929>rules<- apriori(titanic.raw, control = list(verbose=F), parameter = list(minlen=2, supp=0.005, conf=0.8), appearance = list(rhs=c("Survived=No", "Survived=Yes"), default="lhs"))
930>quality(rules) <- round(quality(rules), digits=3)
931>rules.sorted<- sort(rules, by="lift")
932>inspect(rules.sorted)
933
934> # find redundant rules
935>subset.matrix<- is.subset(rules.sorted, rules.sorted)
936>subset.matrix[lower.tri(subset.matrix, diag=T)] <- NA
937>redundant<- colSums(subset.matrix, na.rm=T) >= 1
938>which(redundant)
939
940> # remove redundant rules
941>rules.pruned<- rules.sorted[!redundant]
942>inspect(rules.pruned)
943
944> rules <- apriori(titanic.raw, parameter = list(minlen=3, supp=0.002, conf=0.2), appearance = list(rhs=c("Survived=Yes"), lhs=c("Class=1st", "Class=2nd", "Class=3rd", "Age=Child", "Age=Adult"), default="none"), control = list(verbose=F))
945>rules.sorted<- sort(rules, by="confidence")
946>inspect(rules.sorted)
947
948
949>library(arulesViz)
950>plot(rules.all)
951
952>plot(rules.all, method="grouped")
953
954>plot(rules.all, method="graph")
955
956Conclusion: Using R-tool, we have thus implemented the apriori algorithm successfully.