Dataset
|
DTNB
|
DT
|
C4.5
|
PT
|
FR
|
RDR
|
CBA
|
SA
|
J&B
|
Breast.Can
|
122
|
22
|
10
|
20
|
13
|
13
|
63
|
20
|
47
|
Balance
|
31
|
35
|
35
|
27
|
44
|
22
|
77
|
45
|
79
|
Car.Evn
|
144
|
432
|
123
|
62
|
100
|
119
|
72
|
160
|
41
|
Vote
|
270
|
24
|
11
|
8
|
17
|
7
|
22
|
30
|
13
|
Tic-Tac-Toe
|
258
|
121
|
88
|
37
|
21
|
13
|
23
|
60
|
14
|
Nursery
|
1240
|
804
|
301
|
172
|
288
|
141
|
141
|
175
|
109
|
Hayes-root
|
5
|
8
|
22
|
14
|
11
|
10
|
34
|
45
|
34
|
Lymp
|
129
|
19
|
20
|
10
|
17
|
11
|
23
|
60
|
29
|
Spect.H
|
145
|
2
|
9
|
13
|
17
|
12
|
4
|
50
|
11
|
Adult
|
737
|
1571
|
279
|
571
|
150
|
175
|
126
|
130
|
97
|
Chess
|
507
|
101
|
31
|
29
|
29
|
30
|
12
|
120
|
24
|
Connect4
|
3826
|
4952
|
3973
|
3973
|
403
|
341
|
349
|
600
|
273
|
Average (%)
|
618
|
674
|
409
|
411
|
93
|
75
|
79
|
125
|
64
|
Statistically significant counts (wins/losses) of J&B against other rule-based classification models on classification rules is shown in Table 9.
Table 5. Statistically significant wins/losses counts of J&B method on rules.
|
DTNB
|
DT
|
C4.5
|
PT
|
FR
|
RDR
|
CBA
|
SA
|
W
|
10
|
7
|
6
|
6
|
8
|
5
|
7
|
10
|
L
|
2
|
5
|
4
|
6
|
4
|
5
|
3
|
2
|
W-L
|
0
|
0
|
2
|
0
|
0
|
2
|
2
|
0
|
Table 9 proves that J&B produced a statistically smaller classifier than DTNB and SA methods on 10 datasets, and DT, FR and CBA methods on 7 (or more than 7) datasets out of 12. The most importantly, J&B generated statistically smaller classifiers than all other models on bigger datasets, which was our main goal in this research. Experimental evaluations on bigger datasets (over 10,000 samples) are shown in Figure 7.
Figure 7. Comparison of rule-based classification methods on the average number of rules.
Figure 7 proves the advantage of the J&B method: it produced the smallest classifier among all rule-based classification models on selected datasets.
Our experiment on relevance measures (average results) such as “Precision”, “Recall” and “F-measure”, is highlighted in Figure 8. The detailed result for each dataset can be found in Appendix A.
Figure 8. Comparison of the J&B classifier on Accuracy, Precision, Recall and F-measure.
Example 1. Let us assume that we have the following class association rules (shown in Table 3.7) (satisfied the user specified minimum support and confidence thresholds) generated from a dataset. We apply here the minimum coverage thresh- old as 80%, that is, when our intended classifier covers at least 80% of the training examples, then we stop.
And, learning dataset is to build the model
In the first step, we sort the class association rules by confidence and support descending order, the result is shown in Table 3.9.
In the next step, we form our classifier by selecting the strong rules. We select strong rules which contribute to improve the overall coverage, we continue until achieving the intended training dataset coverage. Table 3.10 illustrates our final classifier.
Our classifier includes 6 rules. In this example, intended coverage is equal to 80% and 6 classification rules in the classifier cover 80% of the learning set. Since our training dataset has some examples with missing values, our classifier covered the whole training dataset (examples without missing values). Other rules also may cover unclassified examples, but we cannot exceed the user-defined training dataset coverage threshold. This is our stopping criterion and we cannot include other rules into our classifier. We also cannot include classification rules which cover only classified examples (this means it does not contribute to the improvement of overall coverage). Now, we classify the following unseen example:
{a1=1,a2=5,a3=5,a4=4,a5=5} ?
So, this example is classified by third and fourth classification rules. The class value of the rules which correctly classified the new example are 3 and 3. So, our classifier predicts that the class value of the new example is 3 (because the majority class value is 3).
Example 2.
№
|
Outlook
|
Temperature
|
Humidity
|
Windy
|
Play
|
1
|
sunny
|
hot
|
high
|
FALSE
|
no
|
2
|
sunny
|
hot
|
high
|
TRUE
|
no
|
3
|
overcast
|
hot
|
high
|
FALSE
|
yes
|
4
|
rainy
|
mild
|
high
|
FALSE
|
yes
|
5
|
rainy
|
cool
|
normal
|
FALSE
|
yes
|
6
|
rainy
|
cool
|
normal
|
TRUE
|
no
|
7
|
overcast
|
cool
|
normal
|
TRUE
|
yes
|
8
|
sunny
|
mild
|
high
|
FALSE
|
no
|
9
|
sunny
|
cool
|
normal
|
FALSE
|
yes
|
10
|
rainy
|
mild
|
normal
|
FALSE
|
yes
|
11
|
sunny
|
mild
|
normal
|
TRUE
|
yes
|
12
|
overcast
|
mild
|
high
|
TRUE
|
yes
|
13
|
overcast
|
hot
|
normal
|
FALSE
|
yes
|
14
|
rainy
|
mild
|
high
|
TRUE
|
no
|
We use the a priori algorithm to find the association rules.
Min support: 10%
Confidance: 80%
Car: True
If we reduce confidamce we can get more rules, if we increase we get less rules. This is one of the most important parameters. Because if we reduce it, the rules will increase and unnecessary rules will be created. If we multiply, good rules can be lost. So we have to define it as a good analysis. We can get 21 rules if we make min support 0.1 and confidance 0.8. If we lower the support to 0.05, we get 72 rules. Below you can see a table of what we did in two ways.
|
confidance
|
1. outlook=overcast ==> play=yes
2. humidity=normal windy=FALSE ==> play=yes
3. outlook=sunny humidity=high ==> play=no
4. outlook=rainy windy=FALSE ==> play=yes
5. outlook=sunny humidity=normal ==> play=yes
6. outlook=sunny temperature=hot ==> play=no
7. outlook=overcast temperature=hot ==> play=yes
8. outlook=overcast humidity=high ==> play=yes
9. outlook=overcast humidity=normal ==> play=yes
10. outlook=overcast windy=TRUE ==> play=yes
11. outlook=overcast windy=FALSE ==> play=yes
12. outlook=rainy windy=TRUE ==> play=no
13. temperature=mild humidity=normal ==> play=yes
14. temperature=cool windy=FALSE ==> play=yes
15. outlook=sunny temperature=hot humidity=high ==> play=no
16. outlook=sunny humidity=high windy=FALSE ==> play=no
17. outlook=overcast temperature=hot windy=FALSE ==> play=yes
18. outlook=rainy temperature=mild windy=FALSE ==> play=yes
19. outlook=rainy humidity=normal windy=FALSE ==> play=yes
20. temperature=cool humidity=normal windy=FALSE ==> play=yes
21. outlook=sunny temperature=cool ==> play=yes
22. outlook=overcast temperature=mild ==> play=yes
23. outlook=overcast temperature=cool ==> play=yes
24. temperature=hot humidity=normal ==> play=yes
25. temperature=hot windy=TRUE ==> play=no
26. outlook=sunny temperature=mild humidity=normal ==> play=yes
27. outlook=sunny temperature=mild windy=TRUE ==> play=yes
28. outlook=sunny temperature=cool humidity=normal ==> play=yes
29. outlook=sunny temperature=cool windy=FALSE ==> play=yes
30. outlook=sunny humidity=normal windy=TRUE ==> play=yes
31. outlook=sunny humidity=normal windy=FALSE ==> play=yes
32. outlook=sunny temperature=hot windy=TRUE ==> play=no
33. outlook=sunny temperature=hot windy=FALSE ==> play=no
34. outlook=sunny temperature=mild humidity=high ==> play=no
35. outlook=sunny temperature=mild windy=FALSE ==> play=no
36. outlook=sunny humidity=high windy=TRUE ==> play=no
37. outlook=overcast temperature=hot humidity=high ==> play=yes
38. outlook=overcast temperature=hot humidity=normal ==> play=yes
39. outlook=overcast temperature=mild humidity=high ==> play=yes
40. outlook=overcast temperature=mild windy=TRUE ==> play=yes
41. outlook=overcast temperature=cool humidity=normal ==> play=yes
42. outlook=overcast temperature=cool windy=TRUE ==> play=yes
43. outlook=overcast humidity=high windy=TRUE ==> play=yes
44. outlook=overcast humidity=high windy=FALSE ==> play=yes
45. outlook=overcast humidity=normal windy=TRUE ==> play=yes
46. outlook=overcast humidity=normal windy=FALSE ==> play=yes
47. outlook=rainy temperature=mild humidity=normal ==> play=yes
48. outlook=rainy temperature=cool windy=FALSE ==> play=yes
49. outlook=rainy humidity=high windy=FALSE ==> play=yes
50. outlook=rainy temperature=mild windy=TRUE ==> play=no
51. outlook=rainy temperature=cool windy=TRUE ==> play=no
52. outlook=rainy humidity=high windy=TRUE ==> play=no
53. outlook=rainy humidity=normal windy=TRUE ==> play=no
54. temperature=hot humidity=normal windy=FALSE ==> play=yes
55. temperature=hot humidity=high windy=TRUE ==> play=no
56. temperature=mild humidity=normal windy=TRUE ==> play=yes
57. temperature=mild humidity=normal windy=FALSE ==> play=yes
58. outlook=sunny temperature=mild humidity=normal windy=TRUE ==> play=yes
59. outlook=sunny temperature=cool humidity=normal windy=FALSE ==> play=yes
60. outlook=sunny temperature=hot humidity=high windy=TRUE ==> play=no
61. outlook=sunny temperature=hot humidity=high windy=FALSE ==> play=no
62. outlook=sunny temperature=mild humidity=high windy=FALSE ==> play=no
63. outlook=overcast temperature=hot humidity=high windy=FALSE ==> play=yes
64. outlook=overcast temperature=hot humidity=normal windy=FALSE ==> play=yes
65. outlook=overcast temperature=mild humidity=high windy=TRUE ==> play=yes
66. outlook=overcast temperature=cool humidity=normal windy=TRUE ==> play=yes
67. outlook=rainy temperature=mild humidity=high windy=FALSE ==> play=yes
68. outlook=rainy temperature=mild humidity=normal windy=FALSE ==> play=yes
69. outlook=rainy temperature=cool humidity=normal windy=FALSE ==> play=yes
70. outlook=rainy temperature=mild humidity=high windy=TRUE ==> play=no
71. outlook=rainy temperature=cool humidity=normal windy=TRUE ==> play=no
72. humidity=normal ==> play=yes
|
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0.86
|
|
confidance
|
1. outlook=overcast ==> play=yes
2. humidity=normal windy=FALSE ==> play=yes
3. outlook=sunny humidity=high ==> play=no
4. outlook=rainy windy=FALSE ==> play=yes
5. outlook=sunny humidity=normal ==> play=yes
6. outlook=sunny temperature=hot ==> play=no
7. outlook=overcast temperature=hot ==> play=yes
8. outlook=overcast humidity=high ==> play=yes
9. outlook=overcast humidity=normal ==> play=yes
10. outlook=overcast windy=TRUE ==> play=yes
11. outlook=overcast windy=FALSE ==> play=yes
12. outlook=rainy windy=TRUE ==> play=no
13. temperature=mild humidity=normal ==> play=yes
14. temperature=cool windy=FALSE ==> play=yes
15. outlook=sunny temperature=hot humidity=high ==> play=no
16. outlook=sunny humidity=high windy=FALSE ==> play=no
17. outlook=overcast temperature=hot windy=FALSE ==> play=yes
18. outlook=rainy temperature=mild windy=FALSE ==> play=yes
19. outlook=rainy humidity=normal windy=FALSE ==> play=yes
20. temperature=cool humidity=normal windy=FALSE ==> play=yes
21. humidity=normal ==> play=yes
|
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0.86
|
Conclusion
Our experiments on accuracy and number of rules show that our method is compact, accurate and comparable with 8 other well-known classification methods. Although it did not achieve the best average classification accuracy, it produced significantly smaller rules on bigger datasets compared to other classification algorithms. Our proposed classifier achieved reasonably high average coverage with.
Statistical significance testing shows that our method was statistically better than or equal to other classification methods on some datasets, while it obtained worse results than those methods on some other datasets. The most important achievement in this research was that J&B got significantly better results in terms of an average number of classification rules than all other classification methods, while it had comparable results to those methods on accuracy.
This research was the first and main step for our future goal, where we plan to cluster class association rules by their similarity and thus further reduce their number and increase the accuracy and understandability of the classifier.
References
Agrawal, R., Srikant, R.: Fast algorithms for mining association rules. In: VLDB 94 Proceedings of the 20th International Conference on Very Large Data Bases, pp. 487-499. Chile (1994).
Ali, K., Manganaris, S., Srikant, R. Partial Classification Using Association Rules. In Proceedings of KDD-97, pp. 115-118, U.S.A (1997).
Baralis, E., Cagliero, L., Garza, P.: A novel pattern-based Bayesian classifier. IEEE Transactions on Knowledge and Data Engineering 25(12), 2780–2795 (2013).
Bayardo, R. J. Brute-force mining of high-confidence classification rules. In Proceedings of the Third International Conference on Knowledge Discovery and Data Mining, pp. 123-126, U.S.A (1997).
Breiman L.: Random Forests. Machine Learning 45(1), pp. 5-32 (2001).
Cendrowska J.: PRISM: An algorithm for inducing modular rules. International Journal of Man-Machine Studies 27(4), pp. 349-370 (1987).
Chen, G., Liu, H., Yu, L., Wei, Q., Zhang, X.: A new approach to classification based on association rule mining. Decision Support Systems 42(2), 674–689 (2006).
Clark, P., Niblett, T.: The CN2 induction algorithm. Machine Learning, 3(4), 261–283 (1989).
Cohen, W., W.: Fast Effective Rule Induction. In: ICML'95 Proceedings of the Twelfth International Conference on Machine Learning, pp. 115-123, Tahoe City, California (1995).
Dua, D., Graff, C.: UCI Machine Learning Repository, Irvine, CA: University of California (2019).
Frank, E., Witten, I.: Generating Accurate Rule Sets Without Global Optimization. In: Fifteenth International Conference on Machine Learning, pp. 144-151. USA (1998).
Holte, R.: Very simple classification rules perform well on most commonly used datasets. Machine Learning 11(1), pp. 63-91 (1993).
Kohavi, R.: The Power of Decision Tables. In: 8th European Conference on Machine Learning, pp. 174-189, Heraclion, Crete, Greece (1995).
Lent, B., Swami, A., Widom, J.: Clustering association rules. In: ICDE '97 Proceedings of the Thirteenth International Conference on Data Engineering, pp. 220-231. England (1997).
Li, W., Han, J., Pei, J.: CMAR: accurate and efficient classification based on multiple class-association rules. in Proceedings of the 1st IEEE International Conference on Data Mining (ICDM ’01), pp. 369–376, San Jose, California, USA (2001).
Liu, B., Hsu, W., Ma, Y.: Integrating classification and association rule mining. in Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining (KDD ’98), pp. 80–86, New York, USA (1998).
Quinlan, J.: C4.5: Programs for Machine Learning, Machine Learning 16(3), 235-240 (1993).
Xiaoxin, Y., Jiawei, H. CPAR: Classification based on Predictive Association Rules. Proceedings of the SIAM International Conference on Data Mining, pp. 331-335, San Francisco, U.S.A (2003).
Zhang, M., Zhou Z.: A k-nearest neighbor based algorithm for multi-label classification. In: Proceedings of the 1st IEEE International Conference on Granular Computing (GrC’05), vol. 2, pp. 718–721, Beijing, China (2005).
Zhou, Z., Liu, X.: Training cost-sensitive neural networks with methods addressing the class imbalance problem. IEEE Transactions on Knowledge and Data Engineering, 18(1), pp. 63–77 (2006).
Do'stlaringiz bilan baham: |