决策树 | Decision Tree

定义 | Definition

决策树是一种监督学习算法，广泛应用于分类和回归任务。它通过递归地分割数据集，构建树形结构，从根节点到叶节点的路径表示决策规则。

A decision tree is a supervised learning algorithm widely used for classification and regression tasks. It constructs a tree-like structure by recursively splitting the dataset, where the paths from the root to the leaf nodes represent decision rules.

主要概念 | Key Concepts

节点 (Node): 树的基本组成部分，表示数据集的一个特征。 The basic unit of a tree, representing a feature of the dataset.
根节点 (Root Node): 树的顶端节点，包含整个数据集的所有样本。 The topmost node of the tree, containing all samples of the dataset.
内部节点 (Internal Node): 非叶节点，表示一个特征的测试。 Non-leaf nodes that represent a test on a feature.
叶节点 (Leaf Node): 最终节点，表示一个类别标签（分类树）或一个数值（回归树）。 The terminal nodes that represent a class label (classification tree) or a value (regression tree).
分支 (Branch): 从一个节点到另一个节点的连接，表示一个特征的可能取值。 The connection between nodes, representing possible values of a feature.

构建过程 | Construction Process

选择最优特征 (Select the Best Feature): 使用某种度量（如信息增益、基尼指数等）选择最能分割数据集的特征。 Select the feature that best splits the dataset using some metric (e.g., information gain, Gini index).
分割数据集 (Split the Dataset): 根据选定的特征值将数据集分割成子集。 Split the dataset into subsets based on the selected feature values.
递归构建 (Recursive Construction): 对每个子集递归地应用上述步骤，构建子树。 Recursively apply the above steps to each subset, constructing subtrees.
停止条件 (Stopping Criteria): 当满足某个停止条件（如所有样本属于同一类、没有特征可分、达到最大深度等）时，停止递归。 Stop the recursion when some stopping criteria are met (e.g., all samples belong to the same class, no features left to split, maximum depth reached).

常用度量 | Common Metrics

信息增益 (Information Gain): 衡量特征在分割数据集时的信息增量。定义为 Measures the increase in information when a feature splits the dataset. Defined as

IG (D, A) = Entropy (D) - v \in values (A) \sum \frac{∣ D _{v} ∣}{∣ D ∣} Entropy (D_{v})

基尼指数 (Gini Index): 衡量数据集的不纯度。定义为 Measures the impurity of the dataset. Defined as

Gini (D) = 1 - i = 1 \sum n p_{i}^{2}

排序算法的决策树 | Decision Tree for Sorting Algorithms

决策树也可以用来表示排序算法中的决策过程，例如比较排序算法。以下是一些常见排序算法的决策树：

Decision trees can also be used to represent the decision-making process in sorting algorithms, such as comparison-based sorting algorithms. Here are some decision trees for common sorting algorithms:

冒泡排序 | Bubble Sort

冒泡排序通过重复交换相邻元素来排序。以下是一个包含三个元素的冒泡排序决策树：

Bubble Sort sorts by repeatedly swapping adjacent elements. Here is a decision tree for Bubble Sort with three elements:

if a > b:
    swap(a, b)
if b > c:
    swap(b, c)
if a > b:
    swap(a, b)

插入排序 | Insertion Sort

插入排序通过逐步构建有序序列来排序。以下是一个包含三个元素的插入排序决策树：

Insertion Sort sorts by progressively building a sorted sequence. Here is a decision tree for Insertion Sort with three elements:

if b < a:
    insert(b before a)
if c < b:
    insert(c before b)
if c < a:
    insert(c before a)

树高计算 | Tree Height Calculation

最短路径和最长路径 | Shortest Path and Longest Path

在决策树中，最短路径和最长路径分别对应于树中从根节点到叶节点的最浅和最深路径。树的高度是最长路径的长度。

In a decision tree, the shortest and longest paths correspond to the shallowest and deepest paths from the root to the leaf nodes, respectively. The height of the tree is the length of the longest path.

树高的阶乘近似 | Factorial Approximation of Tree Height

对于完全二叉树，树的高度 $h$ 与节点数 $n$ 的关系可以近似为：

For a complete binary tree, the height $h$ in relation to the number of nodes $n$ can be approximated as:

h \approx lo g_{2} n

这一近似公式假设每个层次都完全填满，层数越多，高度越高。

This approximation assumes each level is fully filled, and the more levels, the greater the height.

使用斯特林公式的推导 | Derivation Using Stirling’s Approximation

斯特林公式是一种近似计算大阶乘的方法，对于阶乘 $n!$ ，可以用以下公式近似：

Stirling’s approximation is a method for approximating large factorials. For the factorial $n!$ , it can be approximated as:

n! \approx 2 πn (\frac{n}{e})^{n}

对于完全二叉树，假设我们有 $n$ 个节点，其高度 $h$ 与节点数的关系可以通过阶乘近似来推导：

For a complete binany tree, assuming we have $n$ nodes, the relationship between the height $h$ and the number of nodes can be derived using the factorial approximation:

首先，完全二叉树的节点数 $n$ 与高度 $h$ 的关系为： First, the number of nodes $n$ in a complete binary tree in relation to its height $h$ is given by:

n = 2^{h} - 1

对该公式取对数，可以得到高度 $h$ 的近似值： Taking the logarithm of this equation gives an approximation for the height $h$ :

lo g_{2} n = lo g_{2} (2^{h} - 1) \approx h

由于阶乘用于近似大数，因此我们使用斯特林公式来近似计算节点数： Since factorials are used to approximate large numbers, we use Stirling’s approximation for the calculation of the number of nodes:

n! \approx 2 πn (\frac{n}{e})^{n}

通过这些推导，可以更清晰地理解树高的对数近似公式的合理性：

Through these derivations, we can more clearly understand the reasonableness of the logarithmic approximation formula for tree height:

h \approx lo g_{2} n

示例代码 | Example Code

class DecisionTree:
    def __init__(self, max_depth=None):
        self.max_depth = max_depth
 
    def fit(self, X, y):
        self.tree = self._grow_tree(X, y)
 
    def _grow_tree(self, X, y, depth=0):
        # 停止条件 | Stopping criteria
        if depth == self.max_depth or len(set(y)) == 1:
            return {'label': max(set(y), key=list(y).count)}
 
        # 选择最优特征 | Select the best feature
        best_feature = self._best_split(X, y)
 
        # 分割数据集 | Split the dataset
        left_indices = X[:, best_feature] < np.median(X[:, best_feature])
        right_indices = ~left_indices
        left_tree = self._grow_tree(X[left_indices], y[left_indices], depth + 1)
        right_tree = self._grow_tree(X[right_indices], y[right_indices], depth + 1)
        
        return {'feature': best_feature, 'left': left_tree, 'right': right_tree}
 
    def _best_split(self, X, y):
        # 使用信息增益选择最优特征 | Select the best feature using information gain
        # 省略详细实现 | Detailed implementation omitted
        pass
 
    def predict(self, X):
        # 预测函数 | Prediction function
        pass
 
    def tree_height(self):
        return self._height(self.tree)
 
    def _height(self, node):
        if not node or 'label' in node:
            return 0
        return 1 + max(self._height(node['left']), self._height(node['right']))

常见问题 | Common Issues

过拟合 (Overfitting): 决策树容易过拟合训练数据，解决方法包括剪枝、设置最大深度、使用集成方法等。 Decision trees are prone to overfitting the training data. Solutions include pruning, setting a maximum depth, and using ensemble methods.
高维数据 (High-Dimensional Data): 在高维数据上，决策树可能表现不佳，需要结合特征选择或降维方法。 Decision trees may perform poorly on high-dimensional data, requiring feature selection or dimensionality reduction techniques.

Zephyr's Notes on ISCS & CBMS, UTokyo

Explorer