· 7 years ago · Dec 06, 2018, 03:18 AM
1{
2 "cells": [
3 {
4 "cell_type": "markdown",
5 "metadata": {},
6 "source": [
7 "# Apriori Algorithm for Generating Frequent Itemsets"
8 ]
9 },
10 {
11 "cell_type": "markdown",
12 "metadata": {},
13 "source": [
14 "##### by Kunal Verma "
15 ]
16 },
17 {
18 "cell_type": "markdown",
19 "metadata": {},
20 "source": [
21 "## Introduction"
22 ]
23 },
24 {
25 "cell_type": "markdown",
26 "metadata": {},
27 "source": [
28 "Apriori invented by Rakesh Agarwal and Ramakrishnan Srikant in 1994 is\n",
29 "a well known algorithm in data mining. Apriori Algorithm is used in finding frequent itemsets. Identifying associations between items in a dataset of transactions can be useful in various data mining tasks. The challenge is that given a dataset D having T transactions each with n number of attributes, how to find itemsets that appear frequently in D?\n",
30 "\n",
31 "\n",
32 "Using this algorithm we are trying to find significant terms obtained from a collection of electronic document which can be further used for text classification, topic finding and many other applications."
33 ]
34 },
35 {
36 "cell_type": "markdown",
37 "metadata": {},
38 "source": [
39 "## Definitions"
40 ]
41 },
42 {
43 "cell_type": "markdown",
44 "metadata": {},
45 "source": [
46 "\n",
47 "1) Let T = {$t_i, t2,..., t_N $} be the set of all transactions and I = {$h, h i_d$} be the set of all items in a transaction database. Each transaction $t_j$ consists of items which are subsets of I. \n",
48 "\n",
49 "\n",
50 "2) Itemset: It is defined as a collection of zero or more items in a transaction. If an itemset has no items in it then it is termed as a null itemset, and if it contains k items then it is referred as a k-itemset. \n",
51 "\n",
52 "3) Support count: Support count is defined as the number of transactions that contain a particular itemset. It is the most important property of an itemset. \n",
53 "\n"
54 ]
55 },
56 {
57 "cell_type": "markdown",
58 "metadata": {},
59 "source": [
60 "Now using the rules of association and concepts of support and confidence, we are going to describe how the Apriori actually works.\n"
61 ]
62 },
63 {
64 "cell_type": "markdown",
65 "metadata": {},
66 "source": [
67 "## Apriori Algorithm"
68 ]
69 },
70 {
71 "cell_type": "markdown",
72 "metadata": {},
73 "source": [
74 "Apriori works on the following principle which is also known as apriori property:\n",
75 "\n",
76 "If an itemset is frequent, then all of its subsets must also be frequent.\n",
77 "\n",
78 "The Apriori algorithm needs a minimum support level as an input and a data set. The algorithm will generate a list of all candidate itemsets with one item. The transaction data set will then be scanned to see which sets meet the minimum support level. The sets which are below the specified minimum support level are removed from further computations.\n",
79 "Now we will demonstrate the step-by-step approach of the algorithm using an example."
80 ]
81 },
82 {
83 "cell_type": "markdown",
84 "metadata": {},
85 "source": [
86 "| TID | Items |\n",
87 "|-----|---------|\n",
88 "| 100 | 1,3,4 |\n",
89 "| 200 | 1,3,4 |\n",
90 "| 300 | 1,2,3,5 |\n",
91 "| 400 | 2,5 |"
92 ]
93 },
94 {
95 "cell_type": "markdown",
96 "metadata": {},
97 "source": [
98 "Our dataset contains four transactions which could be sentences in case of documents with above particulars (sets/words in that sentence)"
99 ]
100 },
101 {
102 "cell_type": "markdown",
103 "metadata": {},
104 "source": [
105 "#### Step 1: Generating 1-itemset table"
106 ]
107 },
108 {
109 "cell_type": "markdown",
110 "metadata": {},
111 "source": [
112 "Generate the 1-itemset table by enumerating all the unique elements of items. Correspondingly calculate their support "
113 ]
114 },
115 {
116 "cell_type": "markdown",
117 "metadata": {},
118 "source": [
119 "| Itemset | Support |\n",
120 "|---------|---------|\n",
121 "| {1} | 2 |\n",
122 "| {2} | 3 |\n",
123 "| {3} | 3 |\n",
124 "| {4} | 1 |\n",
125 "| {5} | 3 |"
126 ]
127 },
128 {
129 "cell_type": "markdown",
130 "metadata": {},
131 "source": [
132 "Support value is calculated as support ($a -> b$) = $\\frac{\\text{Number of transactions a and b appear}} {\\text{total transactions}}$. But for the sake of simplicity we use support value as number of times each transaction appears. We also assume support threshold = 2."
133 ]
134 },
135 {
136 "cell_type": "markdown",
137 "metadata": {},
138 "source": [
139 "Now as we can see the support of Itemset 4 is 1 which is less then our threshold of 2. So we remove that itemset.\n",
140 "Now we create the 2-itemset table"
141 ]
142 },
143 {
144 "cell_type": "markdown",
145 "metadata": {},
146 "source": [
147 "#### Step 2 : Generate 2-itemsets table"
148 ]
149 },
150 {
151 "cell_type": "markdown",
152 "metadata": {},
153 "source": [
154 "From the above 1-itemset table, we look at the different combinations of size 2 itemsets, not considering the eliminated itemsets. Correspondingly calculating the support of each 2-itemset"
155 ]
156 },
157 {
158 "cell_type": "markdown",
159 "metadata": {},
160 "source": [
161 "| Itemset | Support |\n",
162 "|---------|---------|\n",
163 "| {1,2} | 1 |\n",
164 "| {1,3} | 2 |\n",
165 "| {1,5} | 1 |\n",
166 "| {2,3} | 2 |\n",
167 "| {2,5} | 3 |\n",
168 "| {3,5} | 2 |"
169 ]
170 },
171 {
172 "cell_type": "markdown",
173 "metadata": {},
174 "source": [
175 "Now repeat the above process of removing itemsets with support less than 2. and using the new 2-itemset create 3-itemset table. In this example {1,2} and {1,5} are removed."
176 ]
177 },
178 {
179 "cell_type": "markdown",
180 "metadata": {},
181 "source": [
182 "#### Step 3 : Generate 3-itemsets table"
183 ]
184 },
185 {
186 "cell_type": "markdown",
187 "metadata": {},
188 "source": [
189 "The corresponding 3-itemseet table would be formed using all combinations of above accepted size 2 itemsets"
190 ]
191 },
192 {
193 "cell_type": "markdown",
194 "metadata": {},
195 "source": [
196 "| Itemset | Support |\n",
197 "|---------|---------|\n",
198 "| {2,3,5} | 2 |"
199 ]
200 },
201 {
202 "cell_type": "markdown",
203 "metadata": {},
204 "source": [
205 "Now we have to stop because 4-itemsets cannot be generated as there are only three items left.\n",
206 "\n",
207 "Following are the frequent itemsets that we have generated and which are above support threshold: {1}, {2}, {3}, {5}, {1, 3}, {2, 3}, {2, 5}, {3, 5} and {2, 3, 5}"
208 ]
209 },
210 {
211 "cell_type": "markdown",
212 "metadata": {},
213 "source": [
214 "### Pseudo-code for the whole Apriori algorithm"
215 ]
216 },
217 {
218 "cell_type": "markdown",
219 "metadata": {},
220 "source": [
221 ""
222 ]
223 },
224 {
225 "cell_type": "markdown",
226 "metadata": {},
227 "source": [
228 "1) Create a list of candidate itemsets of length k\n",
229 "\n",
230 "2) Scan the dataset to see if each itemset is frequent\n",
231 "\n",
232 "3) Keep frequent itemsets to create itemsets of length k+1"
233 ]
234 },
235 {
236 "cell_type": "markdown",
237 "metadata": {},
238 "source": [
239 "## Association analysis"
240 ]
241 },
242 {
243 "cell_type": "markdown",
244 "metadata": {},
245 "source": [
246 "Looking for hidden relationships in large datasets is known as association analysis or association rule learning. The problem is, finding different combinations of items can be a time-consuming task and prohibitively expensive in terms of computing power.\n",
247 "\n",
248 " Association rules suggest that a strong relationship exists between two items.\n",
249 "The support and confidence are ways we can quantify the success of our association analysis.\n",
250 "\n",
251 "The support of an itemset is defined as the percentage of the dataset that contains this itemset.\n",
252 "\n",
253 "\n",
254 "The confidence for a rule P âžž H is defined as support(P | H)/ support(P). Remember, in Python, the | symbol is the set union; the mathematical symbol is U. P | H means all the items in set P or in set H."
255 ]
256 },
257 {
258 "cell_type": "markdown",
259 "metadata": {},
260 "source": [
261 "1) To find association rules, we first start with a frequent itemset. We know this set of items is unique, but we want to see if there is anything else we can get out of these items. One item or one set of items can imply another item."
262 ]
263 },
264 {
265 "cell_type": "markdown",
266 "metadata": {},
267 "source": [
268 "2) We use this to find association between the above generated frequent itemsets. Also we calculate confidence for each association and prune those with confidence below the minimum threshold"
269 ]
270 },
271 {
272 "cell_type": "markdown",
273 "metadata": {},
274 "source": [
275 "This gives you three rules: {1} ➞ {3},{5} ➞ {2},and {2} ➞ {5}. It’s interesting to see that the rule with 2 and 5 can be flipped around but not the rule with 1 and 3"
276 ]
277 },
278 {
279 "cell_type": "markdown",
280 "metadata": {},
281 "source": [
282 "## Code for the Apriori Algorithm and Association Rules"
283 ]
284 },
285 {
286 "cell_type": "markdown",
287 "metadata": {},
288 "source": [
289 "## Apriori algorithm"
290 ]
291 },
292 {
293 "cell_type": "markdown",
294 "metadata": {},
295 "source": [
296 ""
297 ]
298 },
299 {
300 "cell_type": "markdown",
301 "metadata": {},
302 "source": [
303 "The following code is referenced from the blog on Apriori Algorithm on adataanlyst.com \n",
304 "Scanning the dataset\n",
305 "For each transaction in the dataset:\n",
306 "\n",
307 "For each candidate itemset, can:\n",
308 "\n",
309 " Check to see if can is a subset of tran\n",
310 "\n",
311 " If so increment the count of can\n",
312 "\n",
313 "For each candidate itemset:\n",
314 "\n",
315 "If the support meets the minimum, keep this item\n",
316 "\n",
317 "Return list of frequent itemsets"
318 ]
319 },
320 {
321 "cell_type": "code",
322 "execution_count": 20,
323 "metadata": {},
324 "outputs": [],
325 "source": [
326 "from IPython.display import Latex"
327 ]
328 },
329 {
330 "cell_type": "code",
331 "execution_count": 10,
332 "metadata": {},
333 "outputs": [],
334 "source": [
335 "from numpy import *"
336 ]
337 },
338 {
339 "cell_type": "markdown",
340 "metadata": {},
341 "source": [
342 "Dataset for testing"
343 ]
344 },
345 {
346 "cell_type": "code",
347 "execution_count": 13,
348 "metadata": {},
349 "outputs": [],
350 "source": [
351 "def loadDataSet():\n",
352 " return [[1, 3, 4], [2, 3, 5], [1, 2, 3, 5], [2, 5]]"
353 ]
354 },
355 {
356 "cell_type": "markdown",
357 "metadata": {},
358 "source": [
359 "This creates 1-itemset table(C1). we’ll scan the dataset to see if these one itemsets meet our minimum support requirements. The itemsets that do meet our minimum requirements become L1. L1 then gets combined to become C2 and C2 will get filtered to become L2.\n",
360 "\n",
361 "Frozensets are sets that are frozen, which means they’re immutable; you can’t change them. You need to use the type frozenset instead of set because you’ll later use these sets as the key in a dictionary."
362 ]
363 },
364 {
365 "cell_type": "code",
366 "execution_count": 8,
367 "metadata": {},
368 "outputs": [],
369 "source": [
370 "def createC1(dataSet):\n",
371 " C1 = []\n",
372 " for transaction in dataSet:\n",
373 " for item in transaction:\n",
374 " if not [item] in C1:\n",
375 " C1.append([item])\n",
376 " \n",
377 " C1.sort()\n",
378 " return list(map(frozenset, C1))"
379 ]
380 },
381 {
382 "cell_type": "markdown",
383 "metadata": {},
384 "source": [
385 "Next function takes three arguments: a dataset, Ck, a list of candidate sets, and minSupport, which is the minimum support you’re interested in. This is the function you’ll use to generate L1 from C1."
386 ]
387 },
388 {
389 "cell_type": "code",
390 "execution_count": 11,
391 "metadata": {},
392 "outputs": [],
393 "source": [
394 "def scanD(D, Ck, minSupport):\n",
395 " ssCnt = {}\n",
396 " for tid in D:\n",
397 " for can in Ck:\n",
398 " if can.issubset(tid):\n",
399 " if not can in ssCnt: ssCnt[can]=1\n",
400 " else: ssCnt[can] += 1\n",
401 " numItems = float(len(D))\n",
402 " retList = []\n",
403 " supportData = {}\n",
404 " for key in ssCnt:\n",
405 " support = ssCnt[key]/numItems\n",
406 " if support >= minSupport:\n",
407 " retList.insert(0,key)\n",
408 " supportData[key] = support\n",
409 " return retList, supportData"
410 ]
411 },
412 {
413 "cell_type": "code",
414 "execution_count": 14,
415 "metadata": {},
416 "outputs": [
417 {
418 "data": {
419 "text/plain": [
420 "[[1, 3, 4], [2, 3, 5], [1, 2, 3, 5], [2, 5]]"
421 ]
422 },
423 "execution_count": 14,
424 "metadata": {},
425 "output_type": "execute_result"
426 }
427 ],
428 "source": [
429 "dataSet = loadDataSet()\n",
430 "dataSet"
431 ]
432 },
433 {
434 "cell_type": "code",
435 "execution_count": 15,
436 "metadata": {},
437 "outputs": [
438 {
439 "data": {
440 "text/plain": [
441 "[frozenset({1}),\n",
442 " frozenset({2}),\n",
443 " frozenset({3}),\n",
444 " frozenset({4}),\n",
445 " frozenset({5})]"
446 ]
447 },
448 "execution_count": 15,
449 "metadata": {},
450 "output_type": "execute_result"
451 }
452 ],
453 "source": [
454 "C1 = createC1(dataSet)\n",
455 "C1"
456 ]
457 },
458 {
459 "cell_type": "code",
460 "execution_count": 17,
461 "metadata": {},
462 "outputs": [
463 {
464 "data": {
465 "text/plain": [
466 "[{1, 3, 4}, {2, 3, 5}, {1, 2, 3, 5}, {2, 5}]"
467 ]
468 },
469 "execution_count": 17,
470 "metadata": {},
471 "output_type": "execute_result"
472 }
473 ],
474 "source": [
475 "#D is a dataset in the setform.\n",
476 "\n",
477 "D = list(map(set,dataSet))\n",
478 "D"
479 ]
480 },
481 {
482 "cell_type": "markdown",
483 "metadata": {},
484 "source": [
485 "Now that you have everything in set form, you can remove items that don’t meet our minimum support."
486 ]
487 },
488 {
489 "cell_type": "code",
490 "execution_count": 18,
491 "metadata": {},
492 "outputs": [
493 {
494 "data": {
495 "text/plain": [
496 "[frozenset({5}), frozenset({2}), frozenset({3}), frozenset({1})]"
497 ]
498 },
499 "execution_count": 18,
500 "metadata": {},
501 "output_type": "execute_result"
502 }
503 ],
504 "source": [
505 "L1,suppDat0 = scanD(D,C1,0.5)\n",
506 "L1"
507 ]
508 },
509 {
510 "cell_type": "markdown",
511 "metadata": {},
512 "source": [
513 "Creating candidate itemsets: Ck\n",
514 "\n",
515 "The function aprioriGen() will take a list of frequent itemsets, Lk, and the size of the itemsets, k, to produce Ck"
516 ]
517 },
518 {
519 "cell_type": "code",
520 "execution_count": 19,
521 "metadata": {},
522 "outputs": [],
523 "source": [
524 "def aprioriGen(Lk, k): #creates Ck\n",
525 " retList = []\n",
526 " lenLk = len(Lk)\n",
527 " for i in range(lenLk):\n",
528 " for j in range(i+1, lenLk): \n",
529 " L1 = list(Lk[i])[:k-2]; L2 = list(Lk[j])[:k-2]\n",
530 " L1.sort(); L2.sort()\n",
531 " if L1==L2: #if first k-2 elements are equal\n",
532 " retList.append(Lk[i] | Lk[j]) #set union\n",
533 " return retList"
534 ]
535 },
536 {
537 "cell_type": "markdown",
538 "metadata": {},
539 "source": [
540 "it will take the itemsets {0}, {1}, {2} and so on and produce {0,1} {0,2}, and {1,2}"
541 ]
542 },
543 {
544 "cell_type": "code",
545 "execution_count": 26,
546 "metadata": {},
547 "outputs": [],
548 "source": [
549 "def apriori(dataSet, minSupport = 0.5):\n",
550 " C1 = createC1(dataSet)\n",
551 " D = list(map(set, dataSet))\n",
552 " L1, supportData = scanD(D, C1, minSupport)\n",
553 " L = [L1]\n",
554 " k = 2\n",
555 " while (len(L[k-2]) > 0):\n",
556 " Ck = aprioriGen(L[k-2], k)\n",
557 " Lk, supK = scanD(D, Ck, minSupport)#scan DB to get Lk\n",
558 " supportData.update(supK)\n",
559 " L.append(Lk)\n",
560 " k += 1\n",
561 " return L, supportData"
562 ]
563 },
564 {
565 "cell_type": "markdown",
566 "metadata": {},
567 "source": [
568 "Now we run this main function apriori for implementing the apriori algorithm."
569 ]
570 },
571 {
572 "cell_type": "markdown",
573 "metadata": {},
574 "source": [
575 "## Mining association rules"
576 ]
577 },
578 {
579 "cell_type": "markdown",
580 "metadata": {},
581 "source": [
582 "The generateRules() function takes three inputs: a list of frequent itemsets, a dictionary of support data for those itemsets, and a minimum confidence threshold."
583 ]
584 },
585 {
586 "cell_type": "code",
587 "execution_count": 21,
588 "metadata": {},
589 "outputs": [],
590 "source": [
591 "def generateRules(L, supportData, minConf=0.7): #supportData is a dict coming from scanD\n",
592 " bigRuleList = []\n",
593 " for i in range(1, len(L)):#only get the sets with two or more items\n",
594 " for freqSet in L[i]:\n",
595 " H1 = [frozenset([item]) for item in freqSet]\n",
596 " if (i > 1):\n",
597 " rulesFromConseq(freqSet, H1, supportData, bigRuleList, minConf)\n",
598 " else:\n",
599 " calcConf(freqSet, H1, supportData, bigRuleList, minConf)\n",
600 " return bigRuleList"
601 ]
602 },
603 {
604 "cell_type": "markdown",
605 "metadata": {},
606 "source": [
607 "calcConf() calculates the confidence of the rule and then find out the which rules meet the minimum confidence."
608 ]
609 },
610 {
611 "cell_type": "code",
612 "execution_count": 24,
613 "metadata": {},
614 "outputs": [],
615 "source": [
616 "def calcConf(freqSet, H, supportData, brl, minConf=0.7):\n",
617 " prunedH = [] #create new list to return\n",
618 " for conseq in H:\n",
619 " conf = supportData[freqSet]/supportData[freqSet-conseq] #calc confidence\n",
620 " if conf >= minConf: \n",
621 " print (freqSet-conseq,'-->',conseq,'conf:',conf)\n",
622 " brl.append((freqSet-conseq, conseq, conf))\n",
623 " prunedH.append(conseq)\n",
624 " return prunedH"
625 ]
626 },
627 {
628 "cell_type": "markdown",
629 "metadata": {},
630 "source": [
631 "rulesFromConseq() generates more association rules from our initial dataset. This takes a frequent itemset and H, which is a list of items that could be on the right-hand side of a rule."
632 ]
633 },
634 {
635 "cell_type": "code",
636 "execution_count": 27,
637 "metadata": {},
638 "outputs": [
639 {
640 "name": "stdout",
641 "output_type": "stream",
642 "text": [
643 "frozenset({5}) --> frozenset({2}) conf: 1.0\n",
644 "frozenset({2}) --> frozenset({5}) conf: 1.0\n",
645 "frozenset({1}) --> frozenset({3}) conf: 1.0\n"
646 ]
647 }
648 ],
649 "source": [
650 "def rulesFromConseq(freqSet, H, supportData, brl, minConf=0.7):\n",
651 " m = len(H[0])\n",
652 " if (len(freqSet) > (m + 1)): #try further merging\n",
653 " Hmp1 = aprioriGen(H, m+1)#create Hm+1 new candidates\n",
654 " Hmp1 = calcConf(freqSet, Hmp1, supportData, brl, minConf)\n",
655 " if (len(Hmp1) > 1): #need at least two sets to merge\n",
656 " rulesFromConseq(freqSet, Hmp1, supportData, brl, minConf)\n",
657 "L,suppData= apriori(dataSet,minSupport=0.5)\n",
658 "rules= generateRules(L,suppData, minConf=0.7)"
659 ]
660 }
661 ],
662 "metadata": {
663 "kernelspec": {
664 "display_name": "Python 3",
665 "language": "python",
666 "name": "python3"
667 },
668 "language_info": {
669 "codemirror_mode": {
670 "name": "ipython",
671 "version": 3
672 },
673 "file_extension": ".py",
674 "mimetype": "text/x-python",
675 "name": "python",
676 "nbconvert_exporter": "python",
677 "pygments_lexer": "ipython3",
678 "version": "3.7.0"
679 }
680 },
681 "nbformat": 4,
682 "nbformat_minor": 2
683}