Reference Document: DataMining Assignment1.pdf

Question 1: Calculate Entropy

Formula

(a) P(x = True) = 0.5; P(x = False) = 0.5

(b) P(x = True) = 1; P(x = False) = 0

(c) P(x = True) = 0.2; P(x = False) = 0.8

Question 2

Sample Data

(a)

Calculate the Entropy of each feature

Pointed

Probabilities:

No = 6/8
Yes = 2/8

Entropy =

Entropy =

Threaded

Probabilities:

No = 2/8
Yes = 6/8

Entropy =

Entropy =

Width

Probabilities:

Slim = 2/8
Medium = 3/8
Fat = 3/8

Entropy =

Entropy = (Greater than 1 because classes = 3)

(b)

Calculate the Gini Index of each feature

Pointed

Probabilities:

No = 6/8
Yes = 2/8

Gini =

Gini =

Threaded

Probabilities:

No = 2/8
Yes = 6/8

Gini =

Gini =

Width

Probabilities:

Slim = 2/8
Medium = 3/8
Fat = 3/8

Gini =

Gini =

(c)

Calculate the Information Gain of each feature and then report the best feature to split the examples
using the information gain metric

Parent / Original Entropy

Probabilities:

Nail = 4/8
Bolt = 4/8

Original Entropy =

Original Entropy = P =

Child / Split Entropies

Pointed
YesNo
Nail13
Bolt13

E(Pointed = Yes) =

E(Pointed = Yes) =

E(Pointed = Yes) =

E(Pointed = No) =

E(Pointed = No) =

E(Pointed = No) =

Weighted Average =

Weighted Average = M =

Gain =

Gain =

Threaded
YesNo
Nail31
Bolt31

E(Threaded = Yes) =

E(Threaded = Yes) =

E(Threaded = Yes) =

E(Threaded = No) =

E(Threaded = No) =

E(Threaded = No) =

Weighted Average =

Weighted Average = M =

Gain =

Gain =

Width
SlimMediumFat
Nail211
Bolt022

E(Width = Slim) =

E(Width = Slim) =

E(Width = Slim) = 0

E(Width = Medium) =

E(Width = Medium) =

E(Width = Medium) = 0.92

E(Width = Fat) =

E(Width = Fat) =

E(Width = Fat) = 0.92

Weighted Average =

Weighted Average = M =

Gain =

Gain =

The best feature to split the examples is “Width” since it has the highest gain using a multi-way split.

(d)

Calculate the gain ratio for each feature and then report the best feature to split the examples using
the gain ratio metric.

where is the no. of records in child node

Pointed
YesNo
Nail13
Bolt13

Split Info =

Split Info =

Gain Ratio =

Threaded
YesNo
Nail31
Bolt31

Split Info =

Split Info =

Gain Ratio =

Width
SlimMediumFat
Nail211
Bolt022

Split Info =

Split Info =

Gain Ratio =

The best feature to split the examples is still “Width” since it has the highest gain ratio out of all.

Question 3

Sample Data

(a) P(A = t)

(b) P(B = f)

(c) P(C = t)

(d) P(B = t | C = t)

(e) P(A = f | C = t)

(f) P(A = t | C = f)

(g) P(A = f, C = t)

(h) P(A = t, C = t)

(h) P(A = t, B = f)