the data vectors stored as rows of a matrix
the class array, where y_i = class for row i of the matrix x
the string array holding the feature names
boolean value to indicate whether according feature is continuous
the value count array indicating number of distinct values per feature
the number of classes
Class that contains information for a tree node.
Given the next most distinguishing feature/attribute, extend the decision tree.
Given the next most distinguishing feature/attribute, extend the decision tree.
the optimal feature and its gain
Given a continuous feature, adjust its threshold to improve gain.
Given a continuous feature, adjust its threshold to improve gain.
the feature index to consider
Given a data vector z, classify it returning the class number (0, .
Given a data vector z, classify it returning the class number (0, ..., k-1) by following a decision path from the root to a leaf.
the data vector to classify (some continuous features)
Given a data vector z, classify it returning the class number (0, .
Given a data vector z, classify it returning the class number (0, ..., k-1) by following a decision path from the root to a leaf.
the data vector to classify (purely discrete features)
Given a k-dimensional probability vector, compute its entropy (a measure of disorder).
Given a k-dimensional probability vector, compute its entropy (a measure of disorder).
the probability vector (e.g., (0, 1) -> 0, (.5, .5) -> 1)
http://en.wikipedia.org/wiki/Entropy_%28information_theory%29
Show the flaw by printing the error message.
Show the flaw by printing the error message.
the method where the error occurred
the error message
Given a feature column (e.
Given a feature column (e.g., 2 (Humidity)) and a value (e.g., 1 (High)) use the frequency of ocurrence the value for each classification (e.g., 0 (no), 1 (yes)) to estimate k probabilities. Also, determine the fraction of training cases where the feature has this value (e.g., fraction where Humidity is High = 7/14).
a feature column to consider (e.g., Humidity)
one of the possible values for this feature (e.g., 1 (High))
indicates whether is calculating continuous feature
threshold for continuous feature
Compute the information gain due to using the values of a feature/attribute to distinguish the training cases (e.
Compute the information gain due to using the values of a feature/attribute to distinguish the training cases (e.g., how well does Humidity with its values Normal and High indicate whether one will play tennis).
the feature to consider (e.g., 2 (Humidity))
Return new x matrix and y array for next step of constructing decision tree.
Return new x matrix and y array for next step of constructing decision tree.
the feature index
one of the features values
Print out the decision tree using Breadth First Search (BFS).
Train the classifier, i.
Train the classifier, i.e., determine which feature provides the most information gain and select it as the root of the decision tree.
the data vectors stored as rows of a matrix
the class array, where y_i = class for row i of the matrix x
This class implements a Decision Tree classifier using the C4.5 algorithm. The classifier is trained using a data matrix x and a classification vector y. Each data vector in the matrix is classified into one of k classes numbered 0, ..., k-1. Each column in the matrix represents a feature (e.g., Humidity). The vc array gives the number of distinct values per feature (e.g., 2 for Humidity).