前言

最近在學(xué)習(xí)一篇論文《Mining Quality Phrases from Massive Text Corpora》，講的是如何從海量文本語料庫中挖掘優(yōu)質(zhì)短語，其中用到了隨機森林（Random Forest）算法，所以我去學(xué)習(xí)了一下，我博客之前專門針對決策樹（Decision Tree）有過講解，Random Forest 就是基于Decision Tree 的優(yōu)化版本，下面我們來一起來討論一下什么是Random Forest。

一、什么是Random Forest ？

作為高度靈活的一種機器學(xué)習(xí)算法，隨機森林（Random Forest，簡稱RF）擁有廣泛的應(yīng)用前景，從市場營銷到醫(yī)療保健保險，既可以用來做市場營銷模擬的建模，統(tǒng)計客戶來源，保留和流失，也可用來預(yù)測疾病的風險和病患者的易感性。最近幾年的國內(nèi)外大賽，包括2013年百度校園電影推薦系統(tǒng)大賽、2014年阿里巴巴天池大數(shù)據(jù)競賽以及 Kaggle數(shù)據(jù)科學(xué)競賽，參賽者對隨機森林的使用占有相當高的比例。所以可以看出，Random Forest在準確率方面還是相當有優(yōu)勢的。

那說了這么多，那隨機森林到底是怎樣的一種算法呢？

如果讀者接觸過決策樹（Decision Tree）的話，那么會很容易理解什么是隨機森林。隨機森林就是通過集成學(xué)習(xí)的思想將多棵樹集成的一種算法，它的基本單元是決策樹，而它的本質(zhì)屬于機器學(xué)習(xí)的一大分支——集成學(xué)習(xí)（Ensemble Learning）方法。隨機森林的名稱中有兩個關(guān)鍵詞，一個是“隨機”，一個就是“森林”?！吧帧蔽覀兒芎美斫?，一棵叫做樹，那么成百上千棵就可以叫做森林了，這樣的比喻還是很貼切的，其實這也是隨機森林的主要思想--集成思想的體現(xiàn)?！半S機”的含義我們會在下邊部分講到。

其實從直觀角度來解釋，每棵決策樹都是一個分類器（假設(shè)現(xiàn)在針對的是分類問題），那么對于一個輸入樣本，N棵樹會有N個分類結(jié)果。而隨機森林集成了所有的分類投票結(jié)果，將投票次數(shù)最多的類別指定為最終的輸出，這就是一種最簡單的 Bagging 思想。

隨機森林是一種功能強大且用途廣泛的 監(jiān)督機器學(xué)習(xí)算法 ，它生長并組合多個決策樹以創(chuàng)建'森林'。它可用于R和Python中的分類和回歸問題。

在我們更詳細地探索隨機森林之前，讓我們分解一下：

什么是監(jiān)督學(xué)習(xí)？
什么是分類和回歸？
什么是決策樹？

了解這些概念中的每一個都將幫助您了解隨機森林及其工作原理。所以讓我們解釋一下。

1.1 什么是監(jiān)督式機器學(xué)習(xí)？

從給定的訓(xùn)練數(shù)據(jù)集中學(xué)習(xí)出一個函數(shù)（模型參數(shù)），當新的數(shù)據(jù)到來時，可以根據(jù)這個函數(shù)預(yù)測結(jié)果。監(jiān)督學(xué)習(xí)的訓(xùn)練集要求包括輸入輸出，也可以說是特征和目標。訓(xùn)練集中的目標是由人標注的。監(jiān)督學(xué)習(xí)就是最常見的分類（注意和聚類區(qū)分）問題，通過已有的訓(xùn)練樣本（即已知數(shù)據(jù)及其對應(yīng)的輸出）去訓(xùn)練得到一個最優(yōu)模型（這個模型屬于某個函數(shù)的集合，最優(yōu)表示某個評價準則下是最佳的），再利用這個模型將所有的輸入映射為相應(yīng)的輸出，對輸出進行簡單的判斷從而實現(xiàn)分類的目的。也就具有了對未知數(shù)據(jù)分類的能力。監(jiān)督學(xué)習(xí)的目標往往是讓計算機去學(xué)習(xí)我們已經(jīng)創(chuàng)建好的分類系統(tǒng)（模型）。

監(jiān)督學(xué)習(xí)是訓(xùn)練神經(jīng)網(wǎng)絡(luò)和決策樹的常見技術(shù)。這兩種技術(shù)高度依賴事先確定的分類系統(tǒng)給出的信息，對于神經(jīng)網(wǎng)絡(luò)，分類系統(tǒng)利用信息判斷網(wǎng)絡(luò)的錯誤，然后不斷調(diào)整網(wǎng)絡(luò)參數(shù)。對于決策樹，分類系統(tǒng)用它來判斷哪些屬性提供了最多的信息。監(jiān)督學(xué)習(xí)里典型的例子就是KNN、SVM。

1.2 什么是回歸和分類？

在機器學(xué)習(xí)中，算法用于將某些觀察結(jié)果、事件或輸入分類到組中。例如，垃圾郵件過濾器會將每封電子郵件分類為'垃圾郵件'或'非垃圾郵件'。但是，電子郵件示例只是一個簡單的示例;在業(yè)務(wù)環(huán)境中，這些模型的預(yù)測能力可以對如何做出決策以及如何形成戰(zhàn)略產(chǎn)生重大影響，但稍后會詳細介紹。

因此：回歸和分類都是監(jiān)督式機器學(xué)習(xí)問題，用于預(yù)測結(jié)果或結(jié)果的價值或類別。他們的區(qū)別是：

分類問題是用于將事物打上一個標簽，通常結(jié)果為離散值。例如判斷一幅圖片上的動物是一只貓還是一只狗，分類通常是建立在回歸之上，分類的最后一層通常要使用softmax函數(shù)進行判斷其所屬類別。分類并沒有逼近的概念，最終正確結(jié)果只有一個，錯誤的就是錯誤的，不會有相近的概念。最常見的分類方法是邏輯回歸，或者叫邏輯分類。

回歸問題通常是用來預(yù)測一個值，如預(yù)測房價、未來的天氣情況等等，例如一個產(chǎn)品的實際價格為500元，通過回歸分析預(yù)測值為499元，我們認為這是一個比較好的回歸分析。一個比較常見的回歸算法是線性回歸算法（LR）。另外，回歸分析用在神經(jīng)網(wǎng)絡(luò)上，其最上層是不需要加上softmax函數(shù)的，而是直接對前一層累加即可。回歸是對真實值的一種逼近預(yù)測。

區(qū)分兩者的簡單方法大概可以表述為， 分類是關(guān)于預(yù)測標簽 （例如'垃圾郵件'或'不是垃圾郵件'），而 回歸是關(guān)于預(yù)測數(shù)量 。

1.3 什么是決策樹？

在解釋隨機森林前，需要先提一下決策樹。決策樹是一種很簡單的算法，他的解釋性強，也符合人類的直觀思維。這是一種基于if-then-else規(guī)則的有監(jiān)督學(xué)習(xí)算法，上面的圖片可以直觀的表達決策樹的邏輯。

決策樹的推導(dǎo)過程在我之前的博客中有詳細的介紹

機器學(xué)習(xí)——決策樹（一）_歡迎來到AI小書童的博客-CSDN博客

機器學(xué)習(xí)——決策樹推導(dǎo)_歡迎來到AI小書童的博客-CSDN博客

1.4 什么是隨機森林？

隨機森林是由很多決策樹構(gòu)成的，不同決策樹之間沒有關(guān)聯(lián)。

當我們進行分類任務(wù)時，新的輸入樣本進入，就讓森林中的每一棵決策樹分別進行判斷和分類，每個決策樹會得到一個自己的分類結(jié)果，決策樹的分類結(jié)果中哪一個分類最多，那么隨機森林就會把這個結(jié)果當做最終的結(jié)果。

二、Random Forest 的構(gòu)造過程

2.1 算法實現(xiàn)

一個樣本容量為N的樣本，有放回的抽取N次，每次抽取1個，最終形成了N個樣本。這選擇好了的N個樣本用來訓(xùn)練一個決策樹，作為決策樹根節(jié)點處的樣本。
當每個樣本有M個屬性時，在決策樹的每個節(jié)點需要分裂時，隨機從這M個屬性中選取出m個屬性，滿足條件m << M。然后從這m個屬性中采用某種策略（比如說信息增益）來選擇1個屬性作為該節(jié)點的分裂屬性。
決策樹形成過程中每個節(jié)點都要按照步驟2來分裂（很容易理解，如果下一次該節(jié)點選出來的那一個屬性是剛剛其父節(jié)點分裂時用過的屬性，則該節(jié)點已經(jīng)達到了葉子節(jié)點，無須繼續(xù)分裂了），一直到不能夠再分裂為止。注意整個決策樹形成過程中沒有進行剪枝。
按照步驟1~3建立大量的決策樹，這樣就構(gòu)成了隨機森林了。

2.2 數(shù)據(jù)的隨機選取

首先，從原始的數(shù)據(jù)集中采取有放回的抽樣，構(gòu)造子數(shù)據(jù)集，子數(shù)據(jù)集的數(shù)據(jù)量是和原始數(shù)據(jù)集相同的。不同子數(shù)據(jù)集的元素可以重復(fù)，同一個子數(shù)據(jù)集中的元素也可以重復(fù)。第二，利用子數(shù)據(jù)集來構(gòu)建子決策樹，將這個數(shù)據(jù)放到每個子決策樹中，每個子決策樹輸出一個結(jié)果。最后，如果有了新的數(shù)據(jù)需要通過隨機森林得到分類結(jié)果，就可以通過對子決策樹的判斷結(jié)果的投票，得到隨機森林的輸出結(jié)果了。如圖3，假設(shè)隨機森林中有3棵子決策樹，2棵子樹的分類結(jié)果是A類，1棵子樹的分類結(jié)果是B類，那么隨機森林的分類結(jié)果就是A類。

2.3 待選特征的隨機選取

與數(shù)據(jù)集的隨機選取類似，隨機森林中的子樹的每一個分裂過程并未用到所有的待選特征，而是從所有的待選特征中隨機選取一定的特征，之后再在隨機選取的特征中選取最優(yōu)的特征。這樣能夠使得隨機森林中的決策樹都能夠彼此不同，提升系統(tǒng)的多樣性，從而提升分類性能。

下圖中，藍色的方塊代表所有可以被選擇的特征，也就是待選特征。黃色的方塊是分裂特征。左邊是一棵決策樹的特征選取過程，通過在待選特征中選取最優(yōu)的分裂特征（別忘了前文提到的ID3算法，C4.5算法，CART算法等等），完成分裂。右邊是一個隨機森林中的子樹的特征選取過程。

2.4 相關(guān)概念解釋

1. 分裂：在決策樹的訓(xùn)練過程中，需要一次次的將訓(xùn)練數(shù)據(jù)集分裂成兩個子數(shù)據(jù)集，這個過程就叫做分裂。

2. 特征：在分類問題中，輸入到分類器中的數(shù)據(jù)叫做特征。以上面的股票漲跌預(yù)測問題為例，特征就是前一天的交易量和收盤價。

3. 待選特征 ：在決策樹的構(gòu)建過程中，需要按照一定的次序從全部的特征中選取特征。待選特征就是在步驟之前還沒有被選擇的特征的集合。例如，全部的特征是 ABCDE，第一步的時候，待選特征就是ABCDE，第一步選擇了C，那么第二步的時候，待選特征就是ABDE。

4. 分裂特征 ：接待選特征的定義，每一次選取的特征就是分裂特征，例如，在上面的例子中，第一步的分裂特征就是C。因為選出的這些特征將數(shù)據(jù)集分成了一個個不相交的部分，所以叫它們分裂特征。

三、 Random Forest 優(yōu)缺點

3.1 優(yōu)點

它可以出來很高維度（特征很多）的數(shù)據(jù)，并且不用降維，無需做特征選擇
它可以判斷特征的重要程度
可以判斷出不同特征之間的相互影響
不容易過擬合
訓(xùn)練速度比較快，容易做成并行方法
實現(xiàn)起來比較簡單
對于不平衡的數(shù)據(jù)集來說，它可以平衡誤差。
如果有很大一部分的特征遺失，仍可以維持準確度。

3.2 缺點

隨機森林已經(jīng)被證明在某些噪音較大的分類或回歸問題上會過擬合
對于有不同取值的屬性的數(shù)據(jù)，取值劃分較多的屬性會對隨機森林產(chǎn)生更大的影響，所以隨機森林在這種數(shù)據(jù)上產(chǎn)出的屬性權(quán)值是不可信的
由于隨機林使用許多決策樹，因此在較大的項目上可能需要大量內(nèi)存。這可以使它比其他一些更有效的算法慢

四、Extra-Trees（極端隨機樹）

ET或Extra-Trees（Extremely randomized trees，極端隨機樹）算法與隨機森林算法十分相似，都是由許多決策樹構(gòu)成。極限樹與隨機森林的主要區(qū)別：

1. randomForest應(yīng)用的是Bagging模型,extraTree使用的所有的樣本，只是特征是隨機選取的，因為分裂是隨機的，所以在某種程度上比隨機森林得到的結(jié)果更加好

2. 隨機森林是在一個隨機子集內(nèi)得到最佳分叉屬性，而ET是完全隨機的得到分叉值，從而實現(xiàn)對決策樹進行分叉的

五、Random Forest 的Python實現(xiàn)

5.1 Random Forest的Python實現(xiàn)

# -*- coding: utf-8 -*-import csvfrom random import seedfrom random import randrangefrom math import sqrtdef loadCSV(filename):#加載數(shù)據(jù)，一行行的存入列表 dataSet = [] with open(filename, 'r') as file: csvReader = csv.reader(file) for line in csvReader: dataSet.append(line) return dataSet# 除了標簽列，其他列都轉(zhuǎn)換為float類型def column_to_float(dataSet): featLen = len(dataSet[0]) - 1 for data in dataSet: for column in range(featLen): data[column] = float(data[column].strip())# 將數(shù)據(jù)集隨機分成N塊，方便交叉驗證，其中一塊是測試集，其他四塊是訓(xùn)練集def spiltDataSet(dataSet, n_folds): fold_size = int(len(dataSet) / n_folds) dataSet_copy = list(dataSet) dataSet_spilt = [] for i in range(n_folds): fold = [] while len(fold) < fold_size: # 這里不能用if，if只是在第一次判斷時起作用，while執(zhí)行循環(huán)，直到條件不成立 index = randrange(len(dataSet_copy)) fold.append(dataSet_copy.pop(index)) # pop() 函數(shù)用于移除列表中的一個元素（默認最后一個元素），并且返回該元素的值。 dataSet_spilt.append(fold) return dataSet_spilt# 構(gòu)造數(shù)據(jù)子集def get_subsample(dataSet, ratio): subdataSet = [] lenSubdata = round(len(dataSet) * ratio)#返回浮點數(shù) while len(subdataSet) < lenSubdata: index = randrange(len(dataSet) - 1) subdataSet.append(dataSet[index]) # print len(subdataSet) return subdataSet# 分割數(shù)據(jù)集def data_spilt(dataSet, index, value): left = [] right = [] for row in dataSet: if row[index] < value: left.append(row) else: right.append(row) return left, right# 計算分割代價def spilt_loss(left, right, class_values): loss = 0.0 for class_value in class_values: left_size = len(left) if left_size != 0: # 防止除數(shù)為零 prop = [row[-1] for row in left].count(class_value) / float(left_size) loss += (prop * (1.0 - prop)) right_size = len(right) if right_size != 0: prop = [row[-1] for row in right].count(class_value) / float(right_size) loss += (prop * (1.0 - prop)) return loss# 選取任意的n個特征，在這n個特征中，選取分割時的最優(yōu)特征def get_best_spilt(dataSet, n_features): features = [] class_values = list(set(row[-1] for row in dataSet)) b_index, b_value, b_loss, b_left, b_right = 999, 999, 999, None, None while len(features) < n_features: index = randrange(len(dataSet[0]) - 1) if index not in features: features.append(index) # print 'features:',features for index in features:#找到列的最適合做節(jié)點的索引，（損失最?。?for row in dataSet: left, right = data_spilt(dataSet, index, row[index])#以它為節(jié)點的，左右分支 loss = spilt_loss(left, right, class_values) if loss < b_loss:#尋找最小分割代價 b_index, b_value, b_loss, b_left, b_right = index, row[index], loss, left, right # print b_loss # print type(b_index) return {'index': b_index, 'value': b_value, 'left': b_left, 'right': b_right}# 決定輸出標簽def decide_label(data): output = [row[-1] for row in data] return max(set(output), key=output.count)# 子分割，不斷地構(gòu)建葉節(jié)點的過程def sub_spilt(root, n_features, max_depth, min_size, depth): left = root['left'] # print left right = root['right'] del (root['left']) del (root['right']) # print depth if not left or not right: root['left'] = root['right'] = decide_label(left + right) # print 'testing' return if depth > max_depth: root['left'] = decide_label(left) root['right'] = decide_label(right) return if len(left) < min_size: root['left'] = decide_label(left) else: root['left'] = get_best_spilt(left, n_features) # print 'testing_left' sub_spilt(root['left'], n_features, max_depth, min_size, depth + 1) if len(right) < min_size: root['right'] = decide_label(right) else: root['right'] = get_best_spilt(right, n_features) # print 'testing_right' sub_spilt(root['right'], n_features, max_depth, min_size, depth + 1) # 構(gòu)造決策樹def build_tree(dataSet, n_features, max_depth, min_size): root = get_best_spilt(dataSet, n_features) sub_spilt(root, n_features, max_depth, min_size, 1) return root# 預(yù)測測試集結(jié)果def predict(tree, row): predictions = [] if row[tree['index']] < tree['value']: if isinstance(tree['left'], dict): return predict(tree['left'], row) else: return tree['left'] else: if isinstance(tree['right'], dict): return predict(tree['right'], row) else: return tree['right'] # predictions=set(predictions)def bagging_predict(trees, row): predictions = [predict(tree, row) for tree in trees] return max(set(predictions), key=predictions.count)# 創(chuàng)建隨機森林def random_forest(train, test, ratio, n_feature, max_depth, min_size, n_trees): trees = [] for i in range(n_trees): train = get_subsample(train, ratio)#從切割的數(shù)據(jù)集中選取子集 tree = build_tree(train, n_features, max_depth, min_size) # print 'tree %d: '%i,tree trees.append(tree) # predict_values = [predict(trees,row) for row in test] predict_values = [bagging_predict(trees, row) for row in test] return predict_values# 計算準確率def accuracy(predict_values, actual): correct = 0 for i in range(len(actual)): if actual[i] == predict_values[i]: correct += 1 return correct / float(len(actual))if __name__ == '__main__': seed(1) dataSet = loadCSV('D:/深度之眼/sonar-all-data.csv') column_to_float(dataSet)#dataSet n_folds = 5 max_depth = 15 min_size = 1 ratio = 1.0 # n_features=sqrt(len(dataSet)-1) n_features = 15 n_trees = 10 folds = spiltDataSet(dataSet, n_folds)#先是切割數(shù)據(jù)集 scores = [] for fold in folds: train_set = folds[ :] # 此處不能簡單地用train_set=folds，這樣用屬于引用,那么當train_set的值改變的時候，folds的值也會改變，所以要用復(fù)制的形式。（L[:]）能夠復(fù)制序列，D.copy() 能夠復(fù)制字典，list能夠生成拷貝 list(L) train_set.remove(fold)#選好訓(xùn)練集 # print len(folds) train_set = sum(train_set, []) # 將多個fold列表組合成一個train_set列表 # print len(train_set) test_set = [] for row in fold: row_copy = list(row) row_copy[-1] = None test_set.append(row_copy) # for row in test_set: # print row[-1] actual = [row[-1] for row in fold] predict_values = random_forest(train_set, test_set, ratio, n_features, max_depth, min_size, n_trees) accur = accuracy(predict_values, actual) scores.append(accur) print ('Trees is %d' % n_trees) print ('scores:%s' % scores) print ('mean score:%s' % (sum(scores) / float(len(scores))))

打印結(jié)果

Trees is 10scores:[0.6341463414634146, 0.6829268292682927, 0.6341463414634146, 0.5853658536585366, 0.5853658536585366]mean score:0.624390243902439

5.2 Decision Tree、Random Forest和Extra-Trees對比

# -*- coding: utf-8 -*-from sklearn.model_selection import cross_val_scorefrom sklearn.datasets import make_blobsfrom sklearn.ensemble import RandomForestClassifierfrom sklearn.ensemble import ExtraTreesClassifierfrom sklearn.tree import DecisionTreeClassifier##創(chuàng)建100個類共10000個樣本，每個樣本10個特征X, y = make_blobs(n_samples=10000, n_features=10, centers=100,random_state=0)## 決策樹clf1 = DecisionTreeClassifier(max_depth=None, min_samples_split=2,random_state=0)scores1 = cross_val_score(clf1, X, y)print(scores1.mean())## 隨機森林clf2 = RandomForestClassifier(n_estimators=10, max_depth=None,min_samples_split=2, random_state=0)scores2 = cross_val_score(clf2, X, y)print(scores2.mean())## ExtraTree分類器集合clf3 = ExtraTreesClassifier(n_estimators=10, max_depth=None,min_samples_split=2, random_state=0)scores3 = cross_val_score(clf3, X, y)print(scores3.mean())

輸出結(jié)果打印

0.98230000000000010.99971.0

5.3 基于pandas和scikit-learn實現(xiàn)Random Forest

iris數(shù)據(jù)集結(jié)構(gòu)

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)	is_train	species
0	5.1	3.5	1.4	0.2	True	setosa
1	4.9	3.0	1.4	0.2	True	setosa
2	4.7	3.2	1.3	0.2	True	setosa
3	4.6	3.1	1.5	0.2	True	setosa
4	5.0	3.6	1.4	0.2	True	setosa

from sklearn.datasets import load_irisfrom sklearn.ensemble import RandomForestClassifierimport pandas as pdimport numpy as npiris = load_iris()df = pd.DataFrame(iris.data, columns=iris.feature_names)df['is_train'] = np.random.uniform(0, 1, len(df)) <= .75df['species'] = pd.Categorical.from_codes(iris.target, iris.target_names)df.head()train, test = df[df['is_train']==True], df[df['is_train']==False]features = df.columns[:4]clf = RandomForestClassifier(n_jobs=2)y, _ = pd.factorize(train['species'])clf.fit(train[features], y)preds = iris.target_names[clf.predict(test[features])]pd.crosstab(test['species'], preds, rownames=['actual'], colnames=['preds'])

分類結(jié)果打印：

preds	setosa	versicolor	virginica
actual
setosa	14	0	0
versicolor	0	15	1
virginica	0	0	9

5.4 Random Forest 與其他機器學(xué)習(xí)分類算法對比

import numpy as npimport matplotlib.pyplot as pltfrom matplotlib.colors import ListedColormapfrom sklearn.model_selection import train_test_splitfrom sklearn.preprocessing import StandardScalerfrom sklearn.datasets import make_moons, make_circles, make_classificationfrom sklearn.neighbors import KNeighborsClassifierfrom sklearn.svm import SVCfrom sklearn.tree import DecisionTreeClassifierfrom sklearn.ensemble import RandomForestClassifier, AdaBoostClassifierfrom sklearn.naive_bayes import GaussianNBfrom sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDAfrom sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis as QDAh = .02  # step size in the meshnames = ['Nearest Neighbors', 'Linear SVM', 'RBF SVM', 'Decision Tree',         'Random Forest', 'AdaBoost', 'Naive Bayes', 'LDA', 'QDA']classifiers = [    KNeighborsClassifier(3),    SVC(kernel='linear', C=0.025),    SVC(gamma=2, C=1),    DecisionTreeClassifier(max_depth=5),    RandomForestClassifier(max_depth=5, n_estimators=10, max_features=1),    AdaBoostClassifier(),    GaussianNB(),    LDA(),    QDA()]X, y = make_classification(n_features=2, n_redundant=0, n_informative=2,                           random_state=1, n_clusters_per_class=1)rng = np.random.RandomState(2)X += 2 * rng.uniform(size=X.shape)linearly_separable = (X, y)datasets = [make_moons(noise=0.3, random_state=0),            make_circles(noise=0.2, factor=0.5, random_state=1),            linearly_separable            ]figure = plt.figure(figsize=(27, 9))i = 1# iterate over datasetsfor ds in datasets:    # preprocess dataset, split into training and test part    X, y = ds    X = StandardScaler().fit_transform(X)    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.4)    x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5    y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),                         np.arange(y_min, y_max, h))    # just plot the dataset first    cm = plt.cm.RdBu    cm_bright = ListedColormap(['#FF0000', '#0000FF'])    ax = plt.subplot(len(datasets), len(classifiers) + 1, i)    # Plot the training points    ax.scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap=cm_bright)    # and testing points    ax.scatter(X_test[:, 0], X_test[:, 1], c=y_test, cmap=cm_bright, alpha=0.6)    ax.set_xlim(xx.min(), xx.max())    ax.set_ylim(yy.min(), yy.max())    ax.set_xticks(())    ax.set_yticks(())    i += 1    # iterate over classifiers    for name, clf in zip(names, classifiers):        ax = plt.subplot(len(datasets), len(classifiers) + 1, i)        clf.fit(X_train, y_train)        score = clf.score(X_test, y_test)        # Plot the decision boundary. For that, we will assign a color to each        # point in the mesh [x_min, m_max]x[y_min, y_max].        if hasattr(clf, 'decision_function'):            Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()])        else:            Z = clf.predict_proba(np.c_[xx.ravel(), yy.ravel()])[:, 1]        # Put the result into a color plot        Z = Z.reshape(xx.shape)        ax.contourf(xx, yy, Z, cmap=cm, alpha=.8)        # Plot also the training points        ax.scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap=cm_bright)        # and testing points        ax.scatter(X_test[:, 0], X_test[:, 1], c=y_test, cmap=cm_bright,                   alpha=0.6)        ax.set_xlim(xx.min(), xx.max())        ax.set_ylim(yy.min(), yy.max())        ax.set_xticks(())        ax.set_yticks(())        ax.set_title(name)        ax.text(xx.max() - .3, yy.min() + .3, ('%.2f' % score).lstrip('0'),                size=15, horizontalalignment='right')        i += 1figure.subplots_adjust(left=.02, right=.98)plt.show()

這里隨機生成了三個樣本集，分割面近似為月形、圓形和線形的。我們可以重點對比一下決策樹和隨機森林對樣本空間的分割：

1）從準確率上可以看出，隨機森林在這三個測試集上都要優(yōu)于單棵決策樹，90%>88%，90%=90%，88%=88%；

2）從特征空間上直觀地可以看出，隨機森林比決策樹擁有更強的分割能力（非線性擬合能力）。

六、 Random Forest 應(yīng)用方向

隨機森林可以在很多地方使用：

對離散值的分類
對連續(xù)值的回歸
無監(jiān)督學(xué)習(xí)聚類
異常點檢測

本站僅提供存儲服務(wù)，所有內(nèi)容均由用戶發(fā)布，如發(fā)現(xiàn)有害或侵權(quán)內(nèi)容，請點擊舉報。

免费视频淫片aa毛片_日韩高清在线亚洲专区vr_日韩大片免费观看视频播放_亚洲欧美国产精品完整版

目錄

前言

一、什么是Random Forest ？

1.1 什么是監(jiān)督式機器學(xué)習(xí)？

1.2 什么是回歸和分類？

1.3 什么是決策樹？

1.4 什么是隨機森林？

二、Random Forest 的構(gòu)造過程

2.1 算法實現(xiàn)

2.2 數(shù)據(jù)的隨機選取

2.3 待選特征的隨機選取

2.4 相關(guān)概念解釋

三、 Random Forest 優(yōu)缺點

3.1 優(yōu)點

3.2 缺點

四、Extra-Trees（極端隨機樹）

五、Random Forest 的Python實現(xiàn)

5.1 Random Forest的Python實現(xiàn)

5.2 Decision Tree、Random Forest和Extra-Trees對比

5.3 基于pandas和scikit-learn實現(xiàn)Random Forest

5.4 Random Forest 與其他機器學(xué)習(xí)分類算法對比

六、 Random Forest 應(yīng)用方向

免费视频淫片aa毛片_日韩高清在线亚洲专区vr_日韩大片免费观看视频播放_亚洲欧美国产精品完整版

目錄

前言

一、什么是Random Forest ？

1.1 什么是監(jiān)督式機器學(xué)習(xí)？

1.2 什么是回歸和分類？

1.3 什么是決策樹？

1.4 什么是隨機森林？

二、Random Forest 的構(gòu)造過程

2.1 算法實現(xiàn)

2.2 數(shù)據(jù)的隨機選取

2.3 待選特征的隨機選取

2.4 相關(guān)概念解釋

三、 Random Forest 優(yōu)缺點

3.1 優(yōu)點

3.2 缺點

四、Extra-Trees（極端隨機樹）

五、Random Forest 的Python實現(xiàn)

5.1 Random Forest的Python實現(xiàn)

5.2 Decision Tree、Random Forest和Extra-Trees對比

5.3 基于pandas和scikit-learn實現(xiàn)Random Forest

5.4 Random Forest 與其他機器學(xué)習(xí)分類算法對比

六、 Random Forest 應(yīng)用方向

一、什么是Random Forest ？

1.2 什么是回歸和分類？

1.3 什么是決策樹？

1.4 什么是隨機森林？

二、Random Forest 的構(gòu)造過程

三、 Random Forest 優(yōu)缺點

五、Random Forest 的Python實現(xiàn)

5.2 Decision Tree、Random Forest和Extra-Trees對比

六、 Random Forest 應(yīng)用方向