NBDTs can be summarized into 3 different parts:
An induced hierarchy is a hierarchy built from the weight vector of a model's final fully connected layer. This idea here is that each dimension
within the vector represents a class. Using agglomerative clustering, we can iteratively pair each class together, framing the decisions a model makes
as a binary split (though this is not always the case, hence this is a limitation of this approach). This allows for more model interpretation by
gaining the ability to ascertain which classes are more likely to be paired together.
The induced hierarchy tree is produced by first loading the weights of a pre-trained model’s final fully connected layer,
with weight matrix W ∈ R D×K. Then it takes rows ωk ∈ W and normalizes for each leaf node’s weight and averages each pair of leaf nodes for the parents’
weight. Last but not least, for each ancestor, it averages all leaf node weights in its subtree.
That average is the ancestor’s weight. Here, the ancestor is the root, so its weight is the average of all leaf weights ω1, ω2, ω3, ω4.
Now, to tune the model, first we have to define a new loss function that can utilize the decision tree structure from the induced hierarchy. We do this by choosing between either Hard or Soft loss, which is defined below.
To fine tune the model, we wrap a loss function, in this case CrossEntropyLoss, with ωt and βt being the weights of the original model and the weights of the soft or hard tree loss. Δ here are the probability distributions of the predictions and the labels.