使用python实现森林算法方法步骤详解

发布时间：2023-09-06 01:06责任编辑：顾先生关键词：暂无标签

本文和大家分享的是使用python实现森林算法相关内容，一起来看看吧，希望对大家学习python有所帮助。

算法描述

随机森林算法

决策树运行的每一步都涉及到对数据集中的最优**点（best split point）进行贪婪选择（greedy selection）。

这个机制使得决策树在没有被剪枝的情况下易产生较高的方差。整合通过提取训练数据库中不同样本（某一问题的不同表现形式）构建的复合树及其生成的预测值能够稳定并降低这样的高方差。这种方法被称作引导**算法（bootstrap aggregating），其简称bagging正好是装进口袋，袋子的意思，所以被称为「装袋算法」。该算法的局限在于，由于生成每一棵树的贪婪算法是相同的，那么有可能造成每棵树选取的**点（split point）相同或者极其相似，最终导致不同树之间的趋同（树与树相关联）。相应地，反过来说，这也使得其会产生相似的预测值，降低原本要求的方差。

我们可以采用限制特征的方法来创建不一样的决策树，使贪婪算法能够在建树的同时评估每一个**点。这就是随机森林算法（Random Forest algorithm）。

与装袋算法一样，随机森林算法从训练集里撷取复合样本并训练。其不同之处在于，数据在每个**点处完全**并添加到相应的那棵决策树当中，且可以只考虑用于存储属性的某一固定子集。

对于分类问题，也就是本教程中我们将要探讨的问题，其被考虑用于**的属性数量被限定为小于输入特征的数量之**根。代码如下：

num_features_for_split = sqrt(total_input_features)

这个小更改会让生成的决策树各不相同（没有关联），从而使得到的预测值更加多样化。而多样的预测值组合往往会比一棵单一的决策树或者单一的装袋算法有更优的表现。

声纳数据集（Sonar dataset）

我们将在本教程里使用声纳数据集作为输入数据。这是一个描述声纳反射到不同物体表面后返回的不同数值的数据集。60个输入变量表示声纳从不同角度返回的强度。这是一个二元分类问题（binary classification problem），要求模型能够区分出岩石和金属柱体的不同材质和形状，总共有208个观测样本。

该数据集非常易于理解——每个变量都互有连续性且都在0到1的标准范围之间，便于数据处理。作为输出变量，字符串’M’表示金属矿物质，’R’表示岩石。二者需分别转换成整数1和0。

通过预测数据集（M或者金属矿物质）中拥有最多观测值的类，零规则算法（Zero Rule Algorithm）可实现53%的精确度。

更多有关该数据集的内容可参见UCI Machine Learning repository：https://archive.ics.uci.edu/ml/datasets/Connectionist+Bench+(Sonar,+Mines+vs.+Rocks)

免费下载该数据集，将其命名为sonar.all-data.csv，并存储到需要被操作的工作目录当中。

教程

此次教程分为两个步骤。

1.**次数的计算。

2.声纳数据集案例研究

这些步骤能让你了解为你自己的预测建模问题实现和应用随机森林算法的基础

**次数的计算

在决策树中，我们通过找到一些特定属性和属性的值来确定**点，这类特定属性需表现为其所需的成本是最低的。

分类问题的成本函数（cost function）通常是基尼指数（Gini index），即计算由**点产生的数据组的纯度（purity）。对于这样二元分类的分类问题来说，指数为0表示绝对纯度，说明类值被完美地分为两组。

从一棵决策树中找到最佳**点需要在训练数据集中对每个输入变量的值做成本评估。

在装袋算法和随机森林中，这个过程是在训练集的样本上执行并替换（放回）的。因为随机森林对输入的数据要进行行和列的采样。对于行采样，采用有放回的方式，也就是说同一行也许会在样本中被选取和放入不止一次。

我们可以考虑创建一个可以自行输入属性的样本，而不是枚举所有输入属性的值以期找到获取成本最低的**点，从而对这个过程进行优化。

该输入属性样本可随机选取且没有替换过程，这就意味着在寻找最低成本**点的时候每个输入属性只需被选取一次。

如下的代码所示，函数get_split()实现了上述过程。它将一定数量的来自待评估数据的输入特征和一个数据集作为参数，该数据集可以是实际训练集里的样本。辅助函数test_split()用于通过候选的**点来分割数据集，函数gini_index()用于评估通过创建的行组（groups of rows）来确定的某一**点的成本。

以上我们可以看出，特征列表是通过随机选择特征索引生成的。通过枚举该特征列表，我们可将训练集中的特定值评估为符合条件的**点。

# Select the best split point for a datasetdefget_split(dataset, n_features):

class_values = list(set(row[-1]forrowindataset))

b_index, b_value, b_score, b_groups = 999, 999, 999,None

features = list()

whilelen(features) < n_features:

index = randrange(len(dataset[0])-1)

ifindexnotinfeatures:

features.append(index)

forindexinfeatures:

forrowindataset:

groups = test_split(index, row[index], dataset)

gini = gini_index(groups, class_values)

ifgini < b_score:

b_index, b_value, b_score, b_groups = index, row[index], gini, groups

return{’index’:b_index, ’value’:b_value, ’groups’:b_groups}

至此，我们知道该如何改造一棵用于随机森林算法的决策树。我们可将之与装袋算法结合运用到真实的数据集当中。

关于声纳数据集的案例研究

在这个部分，我们将把随机森林算法用于声纳数据集。本示例假定声纳数据集的csv格式副本已存在于当前工作目录中，文件名为sonar.all-data.csv。

首先加载该数据集，将字符串转换成数字，并将输出列从字符串转换成数值0和1.这个过程是通过辅助函数load_csv()、str_column_to_float()和str_column_to_int()来分别实现的。

我们将通过K折交叉验证（k-fold cross validatio）来预估得到的学习模型在未知数据上的表现。这就意味着我们将创建并评估K个模型并预估这K个模型的平均误差。评估每一个模型是由分类准确度来体现的。辅助函数cross_validation_split()、accuracy_metric()和evaluate_algorithm()分别实现了上述功能。

装袋算法将通过分类和回归树算法来满足。辅助函数test_split()将数据集分割成不同的组；gini_index()评估每个**点；前文提及的改进过的get_split()函数用来获取**点；函数to_terminal()、split()和build_tree()用以创建单个决策树；predict()用于预测；subsample()为训练集建立子样本集；bagging_predict()对决策树列表进行预测。

新命名的函数random_forest()首先从训练集的子样本中创建决策树列表，然后对其进行预测。

正如我们开篇所说，随机森林与决策树关键的区别在于前者在建树的方法上的小小的改变，这一点在运行函数get_split()得到了体现。

完整的代码如下：

# Random Forest Algorithm on Sonar Datasetfromrandomimportseedfromrandomimportrandrangefromcsvimportreaderfrommathimportsqrt

# Load a CSV filedefload_csv(filename):

dataset = list()

withopen(filename, ’r’)asfile:

csv_reader = reader(file)

forrowincsv_reader:

ifnotrow:

continue

dataset.append(row)

returndataset

# Convert string column to floatdefstr_column_to_float(dataset, column):

forrowindataset:

row[column] = float(row[column].strip())

# Convert string column to integerdefstr_column_to_int(dataset, column):

class_values = [row[column]forrowindataset]

unique = set(class_values)

lookup = dict()

fori, valueinenumerate(unique):

lookup[value] = i

forrowindataset:

row[column] = lookup[row[column]]

returnlookup

# Split a dataset into k foldsdefcross_validation_split(dataset, n_folds):

dataset_split = list()

dataset_copy = list(dataset)

fold_size = len(dataset) / n_folds

foriinrange(n_folds):

fold = list()

whilelen(fold) < fold_size:

index = randrange(len(dataset_copy))

fold.append(dataset_copy.pop(index))

dataset_split.append(fold)

returndataset_split

# Calculate accuracy percentagedefaccuracy_metric(actual, predicted):

correct = 0

foriinrange(len(actual)):

ifactual== predicted:

correct += 1

returncorrect / float(len(actual)) * 100.0

# Evaluate an algorithm using a cross validation splitdefevaluate_algorithm(dataset, algorithm, n_folds, *args):

folds = cross_validation_split(dataset, n_folds)

scores = list()

forfoldinfolds:

train_set = list(folds)

train_set.remove(fold)

train_set = sum(train_set, [])

test_set = list()

forrowinfold:

row_copy = list(row)

test_set.append(row_copy)

row_copy[-1] =None

predicted = algorithm(train_set, test_set, *args)

actual = [row[-1]forrowinfold]

accuracy = accuracy_metric(actual, predicted)

scores.append(accuracy)

returnscores

# Split a dataset based on an attribute and an attribute valuedeftest_split(index, value, dataset):

left, right = list(), list()

forrowindataset:

ifrow[index] < value:

left.append(row)

else:

right.append(row)

returnleft, right

# Calculate the Gini index for a split datasetdefgini_index(groups, class_values):

gini = 0.0

forclass_valueinclass_values:

forgroupingroups:

size = len(group)

ifsize == 0:

continue

proportion = [row[-1]forrowingroup].count(class_value) / float(size)

gini += (proportion * (1.0 - proportion))

returngini

# Select the best split point for a datasetdefget_split(dataset, n_features):

class_values = list(set(row[-1]forrowindataset))

b_index, b_value, b_score, b_groups = 999, 999, 999,None

features = list()

whilelen(features) < n_features:

index = randrange(len(dataset[0])-1)

ifindexnotinfeatures:

features.append(index)

forindexinfeatures:

forrowindataset:

groups = test_split(index, row[index], dataset)

gini = gini_index(groups, class_values)

ifgini < b_score:

b_index, b_value, b_score, b_groups = index, row[index], gini, groups

return{’index’:b_index, ’value’:b_value, ’groups’:b_groups}

# Create a terminal node valuedefto_terminal(group):

outcomes = [row[-1]forrowingroup]

returnmax(set(outcomes), key=outcomes.count)

# Create child splits for a node or make terminaldefsplit(node, max_depth, min_size, n_features, depth):

left, right = node[’groups’]

del(node[’groups’])

# check for a no split

ifnotleftornotright:

node[’left’] = node[’right’] = to_terminal(left + right)

return

# check for max depth

ifdepth >= max_depth:

node[’left’], node[’right’] = to_terminal(left), to_terminal(right)

return

# process left child

iflen(left) <= min_size:

node[’left’] = to_terminal(left)

else:

node[’left’] = get_split(left, n_features)

知识推荐

我的编程学习网——分享web前端后端开发技术知识。垃圾信息处理邮箱 tousu563@163.com 网站地图