Class imbalance in Classification 王羚宇 2017/03/28
Class imbalance Problem Assumption: Positive and negative samples have similar numbers What if we have 998 negative samples and 2 positive samples? common in the real world If we return a classification always return negative, we can gain 99.8% accuracy on the training set. But is it we want? How to solve it?
Class imbalance-rescaling Taking linear classification as an example: 𝑦= 𝑤 𝑇 𝑥+𝑏 Compare y with a threshold (0.5) if 𝑦 𝑝𝑟𝑒𝑑𝑖𝑐𝑡 >𝟎.𝟓:positive if 𝑦 𝑝𝑟𝑒𝑑𝑖𝑐𝑡 <𝟎.𝟓:negative Assumption: we have similar numbers of positive and negative samples 𝑦 1−𝑦 >1 :𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑦 1−𝑦 > m + 𝑚 − : 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒
Class imbalance-rescaling We let 𝑦 ′ 1− 𝑦 ′ = 𝑦 1−𝑦 × m − 𝑚 + And judging class by 𝑦 ′ . Not Practical for real: Assumption: we can predict the sample distribution according to the training set.
Methods to solve class imbalance Oversampling Add some minor class samples first and then learn SMOTE Undersampling Remove some major class samples first and then learn ENN Edited Nearest Neighbor EasyEnsemble Threshold-moving Use rescaling Oversampling 不能简单地重复采样,否则会导致严重的过拟合;SMOTE 通过正例进行差值的方法产生额外的正例 Undersampling 随机丢弃反例则会丢失重要的信息 对于欠抽样算法,将多数类样本删除有可能会导致分类器丢失有关多数类的重要信息。 对于过抽样算法,虽然只是简单地将复制后的数据添加到原始数据集中,且某些样本的多个实例都是“并列的”,但这样也可能会导致分类器学习出现过拟合现象,对于同一个样本的多个复本产生多个规则条例,这就使得规则过于具体化;虽然在这种情况下,分类器的精度会很高,但在未知样本的分类性能就会非常不理想。
Oversampling-SMOTE SMOTE: A State-of-the-Art Resampling Approach SMOTE stands for Synthetic Minority Oversampling Technique. For each minority Sample Find its k-nearest minority neighbors Randomly select j of these neighbors Randomly generate synthetic samples along the lines joining the minority sample and its j selected neighbors (j depends on the amount of oversampling desired) (1)对于少数类中每一个样本x,以欧氏距离为标准计算它到少数类样本集中所有样本的距离,得到其k近邻。 (2)根据样本不平衡比例设置一个采样比例以确定采样倍率N,对于每一个少数类样本x,从其k近邻中随机选择若干个样本,假设选择的近邻为xn。 (3)对于每一个随机选出的近邻xn,分别与原样本按照如下的公式构建新的样本。 利用少数类样本控制人工样本的生成与分布,实现数据集均衡的目的,而且该方法可以有效地解决由于决策区间较小导致的分类过拟合问题。
Oversampling-SMOTE
Oversampling-SMOTE
Overgeneralization
Oversampling-SMOTE Overgeneralization Lack of Flexibility SMOTE’s procedure is inherently dangerous since it blindly generalizes the minority area without regard to the majority class. This strategy is particularly problematic in the case of highly skewed class distributions since, in such cases, the minority class is very sparse with respect to the majority class, thus resulting in a greater chance of class mixture. Lack of Flexibility The number of synthetic samples generated by SMOTE is fixed in advance, thus not allowing for any flexibility in the re-balancing rate.
Oversampling-Borderline SMOTE1 For each point p in S: 计算点p在训练集T上的m个最近邻。我们称这个集合为Mp然后设 m'= |Mp ∩ L| (表示点p的最近邻中属于L的数量). If m'= m, p 是一个噪声,不做任何操作. If 0 ≤m'≤m/2, 则说明p很安全,不做任何操作. If m/2 ≤ m'≤ m, 那么点p就很危险了,我们需要在这个点附近生成一些新的少数类点,所以我们把它加入到DANGER中. 最后,对于每个在DANGER中的点d,使用SMOTE算法生成新的样本.
Oversampling-Borderline SMOTE2 Similar with Borderline SMOTE2 FOR p in DANGER: Find k-NN samples 𝑆 𝑘 and 𝐿 𝑘 in S and L Generate samples using SMOTE selecting 𝛼 rate in 𝑆 𝑘 Generate samples using SMOTE selecting 1−𝛼 rate in 𝐿 𝑘 ## Borderline-SMOTE2这个与Borderline-SMOTE1很像,只有最后一步不一样。 在DANGER集中的点不仅从S集中求最近邻并生成新的少数类点,而且在L集中求最近邻,并生成新的少数类点,这会使得少数类的点更加接近其真实值。 ```FOR p in DANGER: 1.在S和L中分别得到k个最近邻样本Sk和Lk。 2.在Sk中选出α比例的样本点和p作随机的线性插值产生新的少数类样本 3.在Lk中选出1−α比例的样本点和p作随机的线性插值产生新的少数类样本。 ```
Undersampling-Edited Nearest Neighbor Delete those majority samples most of whose K neighbor is minor Repeated Edited Nearest Neighbor 缺点:因为大多数的多数类样本的样本附近都是多数类,所以该方法所能删除的多数类样本十分有限。
Undersampling-EasyEnsemble Extract some major class samples and minor class samples to form a adaboost classifier Each classifier undersamples the original data Decide the result by summing these weak classifiers’ results up. For i = 1, ..., N: (a) 随机从 L中抽取样本Li使得|Li| = |S|. (b) 使用Li和S数据集,训练AdaBoost分类器Fi。 将上述分类器联合起来 从上面的伪代码可以看出,easy ensemble每次从多数类中抽样出和少数类数目差不多的样本,然后和少数类样本组合作为训练集。在这个训练集上学习一个adaboost分类器。 最后预测的时候,是使用之前学习到的所有adaboost中的弱分类器(就是每颗决策树)的预测结果向量(每个树给的结果组成一个向量)和对应的权重向量做内积,然后减去阈值,根据差的符号确定样本的类别。 之前我的理解是根据每个adaboost的预测结果做多数表决,比如10个adaboost,有6个adaboost预测为少数类,那么这个样本就是少数类。显然,easy ensemble不是这样来实现的。 一个最简单的集成方法就是不断从多数类中抽取样本,使得每个模型的多数类样本数量和少数类样本数量都相同,最后将这些模型集成起来。 首先通过从多数类中独立随机抽取出若干子集 将每个子集与少数类数据联合起来训练生成多个基分类器 最终将这些基分类器组合形成一个集成学习系统 EasyEnsemble 算法被认为是非监督学习算法,因此它每次都独立利用可放回随机抽样机制来提取多数类样本