Reference Document: DataMining Assignment1.pdf
Question 1: Calculate Entropy
Formula
(a) P(x = True) = 0.5; P(x = False) = 0.5
(b) P(x = True) = 1; P(x = False) = 0
(c) P(x = True) = 0.2; P(x = False) = 0.8
Question 2
Sample Data
(a)
Calculate the Entropy of each feature
Pointed
Probabilities:
No = 6/8
Yes = 2/8
Entropy =
Entropy =
Threaded
Probabilities:
No = 2/8
Yes = 6/8
Entropy =
Entropy =
Width
Probabilities:
Slim = 2/8
Medium = 3/8
Fat = 3/8
Entropy =
Entropy = (Greater than 1 because classes = 3)
(b)
Calculate the Gini Index of each feature
Pointed
Probabilities:
No = 6/8
Yes = 2/8
Gini =
Gini =
Threaded
Probabilities:
No = 2/8
Yes = 6/8
Gini =
Gini =
Width
Probabilities:
Slim = 2/8
Medium = 3/8
Fat = 3/8
Gini =
Gini =
(c)
Calculate the Information Gain of each feature and then report the best feature to split the examples
using the information gain metric
Parent / Original Entropy
Probabilities:
Nail = 4/8
Bolt = 4/8
Original Entropy =
Original Entropy = P =
Child / Split Entropies
Pointed | ||
---|---|---|
Yes | No | |
Nail | 1 | 3 |
Bolt | 1 | 3 |
E(Pointed = Yes) =
E(Pointed = Yes) =
E(Pointed = Yes) =
E(Pointed = No) =
E(Pointed = No) =
E(Pointed = No) =
Weighted Average =
Weighted Average = M =
Gain =
Gain =
Threaded | ||
---|---|---|
Yes | No | |
Nail | 3 | 1 |
Bolt | 3 | 1 |
E(Threaded = Yes) =
E(Threaded = Yes) =
E(Threaded = Yes) =
E(Threaded = No) =
E(Threaded = No) =
E(Threaded = No) =
Weighted Average =
Weighted Average = M =
Gain =
Gain =
Width | |||
---|---|---|---|
Slim | Medium | Fat | |
Nail | 2 | 1 | 1 |
Bolt | 0 | 2 | 2 |
E(Width = Slim) =
E(Width = Slim) =
E(Width = Slim) = 0
E(Width = Medium) =
E(Width = Medium) =
E(Width = Medium) = 0.92
E(Width = Fat) =
E(Width = Fat) =
E(Width = Fat) = 0.92
Weighted Average =
Weighted Average = M =
Gain =
Gain =
The best feature to split the examples is “Width” since it has the highest gain using a multi-way split.
(d)
Calculate the gain ratio for each feature and then report the best feature to split the examples using
the gain ratio metric.
where is the no. of records in child node
Pointed | ||
---|---|---|
Yes | No | |
Nail | 1 | 3 |
Bolt | 1 | 3 |
Split Info =
Split Info =
Gain Ratio =
Threaded | ||
---|---|---|
Yes | No | |
Nail | 3 | 1 |
Bolt | 3 | 1 |
Split Info =
Split Info =
Gain Ratio =
Width | |||
---|---|---|---|
Slim | Medium | Fat | |
Nail | 2 | 1 | 1 |
Bolt | 0 | 2 | 2 |
Split Info =
Split Info =
Gain Ratio =
The best feature to split the examples is still “Width” since it has the highest gain ratio out of all.
Question 3
Sample Data
(a) P(A = t)
(b) P(B = f)
(c) P(C = t)
(d) P(B = t | C = t)
(e) P(A = f | C = t)
(f) P(A = t | C = f)
(g) P(A = f, C = t)
(h) P(A = t, C = t)
(h) P(A = t, B = f)