决策树ID3

决策树笔记

Posted by jiang on November 8, 2018

ID3

  • ID3 初级 不适合大量特征的数据集 过拟合 信息增益 奥卡姆剃刀原理 更少的东西做更多的事 字典
  • ID3 即Iterative Dichotomiser 3,迭代二叉树3代,是Ross Quinlan发明的一种决策树算法

信息增益

信息期望

C4.5

  • C4.5算法优点:产生的分类规则易于理解,准确率较高。
  • 缺点:在构造树的过程中,需要对数据集进行多次的顺序扫描和排序,因而导致算法的低效。此外,C4.5只适合于能够驻留于内存的数据集,当训练集大得无法在内存容纳时程序无法运行。

CART

  • Classification And Regression Tree(CART)是决策树的一种,并且是非常重要的决策树,属于Top Ten Machine Learning Algorithm。顾名思义,CART算法既可以用于创建分类树(Classification Tree),也可以用于创建回归树(Regression Tree)、模型树(Model Tree),两者在建树的过程稍有差异。CART是二叉树。

sklearn代码示例

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
from numpy import *
import operator

x = [[1,1,'yes'],
     [1,1,'yes'],
     [1,0,'no'],
     [0,1,'no'],
     [0,1,'no']
     ]


def chooseBestFeatureToSplit(dataset):
    numFeatures = len(dataset) - 1
    baseEntropy = calcShannonEnt(dataset)
    bestInfoGain = 0.0
    bestFeature = -1
    for i in range(numFeatures):
        featList = [example[i] for example in dataset]
        uniqueVals = set(featList)
        newEntropy = 0.0
        for value in uniqueVals:
            subDataSet = splitDataSet(dataset,i,value)
            prob = len(subDataSet)/float(len(dataset))
            newEntropy += prob * calcShannonEnt(subDataSet)
        infoGain = baseEntropy - newEntropy
        if infoGain > bestInfoGain:
            bestInfoGain = infoGain
            bestFeature = i
    return bestFeature

def majorityCnt(classList):
    classCount = {}
    for vote in classList:
        if vote not in classCount.keys():classCount[vote] = 0
        classCount[vote] += 1
    sortedClassCount = sorted(classCount.items(),
                              key=operator.itemgetter(1),
                              reverse=True
                              )
    return sortedClassCount[0][0]

def splitDataSet(dataset,axis,value):
    retDataSet = []
    for featVec in dataset:
        if featVec[axis] == value:
            reduceFeatVec = featVec[:axis]
            reduceFeatVec.extend(featVec[axis + 1 :])
            retDataSet.append(reduceFeatVec)
    return retDataSet

def calcShannonEnt(dataset):
    numEntries = len(dataset)
    labelCounts = {}
    for featVec in dataset:
        currentLabel = featVec[-1]
        if currentLabel not in labelCounts.keys():
            labelCounts[currentLabel] = 0
    shannon = 0.0
    for key in labelCounts:
        prob = float(labelCounts[key])/numEntries
        shannon -= prob * log(prob,2)
    return shannon

'''
dataset [[1,1,yes],[1,0,no]
lables [no surfacing,filppers]
'''
def creatTree(dataset,labels):
    classList = [example[-1] for example in dataset]
    if classList.count(classList[0]) == len(classList):
        return classList[0]

    if len(dataset[0]) == 1:
        return majorityCnt(classList)

    bestFeat = chooseBestFeatureToSplit(dataset)
    bestFeatLabel = labels[bestFeat]
    myTree = {bestFeatLabel:{}}
    featValues = [example[bestFeat] for example in dataset]
    uniqueVals = set(featValues)
    for value in uniqueVals:
        subLabels = labels[:]
        myTree[bestFeatLabel][value] = creatTree(splitDataSet(dataset,bestFeat,value),subLabels)

    return myTree

'''
{'no surfacing':{0:'no',1:{'flippers':{0:'no',1:'yes'}}}
'''

from sklearn import tree

x = [[1,1],[1,0],[0,1],[0,0]]
y = [1,0,0,0]
tr = tree.DecisionTreeClassifier()
tr.fit(x,y)