标签归档:gradient-boosting

Tpot-使用遗传编程优化机器学习管道的Python自动机器学习工具

TPOT代表T基于REE的PipelineO优化T哦哦。将TPOT视为您的数据科学助理TPOT是一种Python自动机器学习工具,可使用遗传编程优化机器学习管道

TPOT将通过智能地探索数千个可能的管道来找到最适合您数据的管道,从而自动化机器学习中最繁琐的部分

一个机器学习流水线示例

一旦TPOT完成搜索(或者您厌倦了等待),它就会为您提供它找到的最佳管道的Python代码,这样您就可以从那里修补管道了

TPOT构建在SCRICKIT-LEARN之上,因此它生成的所有代码看起来都应该很熟悉。如果你熟悉SCRICKIT-不管怎样,还是要学

TPOT仍在积极发展中我们鼓励您定期检查此存储库是否有更新

有关TPOT的更多信息,请参阅project documentation

许可证

请参阅repository license有关TPOT的许可和使用信息

通常,我们已经授权TPOT使其尽可能广泛使用

安装

我们坚持TPOT installation instructions在文档中。TPOT需要Python的正常安装

用法

可以使用TPOTon the command linewith Python code

单击相应的链接以在文档中查找有关TPOT用法的更多信息

示例

分类

以下是光学识别手写数字数据集的最小工作示例

from tpot import TPOTClassifier
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split

digits = load_digits()
X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target,
                                                    train_size=0.75, test_size=0.25, random_state=42)

tpot = TPOTClassifier(generations=5, population_size=50, verbosity=2, random_state=42)
tpot.fit(X_train, y_train)
print(tpot.score(X_test, y_test))
tpot.export('tpot_digits_pipeline.py')

运行此代码应该会发现达到约98%测试准确率的管道,并且相应的Python代码应该导出到tpot_digits_pipeline.py文件,如下所示:

import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline, make_union
from sklearn.preprocessing import PolynomialFeatures
from tpot.builtins import StackingEstimator
from tpot.export_utils import set_param_recursive

# NOTE: Make sure that the outcome column is labeled 'target' in the data file
tpot_data = pd.read_csv('PATH/TO/DATA/FILE', sep='COLUMN_SEPARATOR', dtype=np.float64)
features = tpot_data.drop('target', axis=1)
training_features, testing_features, training_target, testing_target = \
            train_test_split(features, tpot_data['target'], random_state=42)

# Average CV score on the training set was: 0.9799428471757372
exported_pipeline = make_pipeline(
    PolynomialFeatures(degree=2, include_bias=False, interaction_only=False),
    StackingEstimator(estimator=LogisticRegression(C=0.1, dual=False, penalty="l1")),
    RandomForestClassifier(bootstrap=True, criterion="entropy", max_features=0.35000000000000003, min_samples_leaf=20, min_samples_split=19, n_estimators=100)
)
# Fix random state for all the steps in exported pipeline
set_param_recursive(exported_pipeline.steps, 'random_state', 42)

exported_pipeline.fit(training_features, training_target)
results = exported_pipeline.predict(testing_features)

回归

同样,TPOT可以针对回归问题优化管道。下面是使用Practice波士顿房价数据集的最小工作示例

from tpot import TPOTRegressor
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split

housing = load_boston()
X_train, X_test, y_train, y_test = train_test_split(housing.data, housing.target,
                                                    train_size=0.75, test_size=0.25, random_state=42)

tpot = TPOTRegressor(generations=5, population_size=50, verbosity=2, random_state=42)
tpot.fit(X_train, y_train)
print(tpot.score(X_test, y_test))
tpot.export('tpot_boston_pipeline.py')

这将导致管道达到约12.77的均方误差(MSE),并且中的Python代码tpot_boston_pipeline.py应与以下内容类似:

import numpy as np
import pandas as pd
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import PolynomialFeatures
from tpot.export_utils import set_param_recursive

# NOTE: Make sure that the outcome column is labeled 'target' in the data file
tpot_data = pd.read_csv('PATH/TO/DATA/FILE', sep='COLUMN_SEPARATOR', dtype=np.float64)
features = tpot_data.drop('target', axis=1)
training_features, testing_features, training_target, testing_target = \
            train_test_split(features, tpot_data['target'], random_state=42)

# Average CV score on the training set was: -10.812040755234403
exported_pipeline = make_pipeline(
    PolynomialFeatures(degree=2, include_bias=False, interaction_only=False),
    ExtraTreesRegressor(bootstrap=False, max_features=0.5, min_samples_leaf=2, min_samples_split=3, n_estimators=100)
)
# Fix random state for all the steps in exported pipeline
set_param_recursive(exported_pipeline.steps, 'random_state', 42)

exported_pipeline.fit(training_features, training_target)
results = exported_pipeline.predict(testing_features)

请查看文档以了解more examples and tutorials

对TPOT的贡献

我们欢迎您的光临check the existing issues以获取要处理的错误或增强功能。如果您有扩展TPOT的想法,请file a new issue这样我们就可以讨论一下了

在提交任何投稿之前,请审阅我们的contribution guidelines

对TPOT有问题或有疑问吗?

check the existing open and closed issues看看您的问题是否已经得到处理。如果没有,file a new issue在此存储库上,以便我们可以检查您的问题

引用TPOT

如果您在科学出版物中使用TPOT,请考虑至少引用以下一篇论文:

陈天乐,傅维轩,杰森·H·摩尔(2020)。Scaling tree-based automated machine learning to biomedical big data with a feature set selector生物信息学36(1):250-256

BibTeX条目:

@article{le2020scaling,
  title={Scaling tree-based automated machine learning to biomedical big data with a feature set selector},
  author={Le, Trang T and Fu, Weixuan and Moore, Jason H},
  journal={Bioinformatics},
  volume={36},
  number={1},
  pages={250--256},
  year={2020},
  publisher={Oxford University Press}
}

兰德尔·S·奥尔森、瑞安·J·厄巴诺维茨、彼得·C·安德鲁斯、妮可·A·拉文德、拉克里斯·基德和杰森·H·摩尔(2016)。Automating biomedical data science through tree-based pipeline optimization进化计算的应用,第123-137页

BibTeX条目:

@inbook{Olson2016EvoBio,
    author={Olson, Randal S. and Urbanowicz, Ryan J. and Andrews, Peter C. and Lavender, Nicole A. and Kidd, La Creis and Moore, Jason H.},
    editor={Squillero, Giovanni and Burelli, Paolo},
    chapter={Automating Biomedical Data Science Through Tree-Based Pipeline Optimization},
    title={Applications of Evolutionary Computation: 19th European Conference, EvoApplications 2016, Porto, Portugal, March 30 -- April 1, 2016, Proceedings, Part I},
    year={2016},
    publisher={Springer International Publishing},
    pages={123--137},
    isbn={978-3-319-31204-0},
    doi={10.1007/978-3-319-31204-0_9},
    url={http://dx.doi.org/10.1007/978-3-319-31204-0_9}
}

兰德尔·S·奥尔森、内森·巴特利、瑞安·J·厄巴诺维奇和杰森·H·摩尔(2016)。Evaluation of a Tree-based Pipeline Optimization Tool for Automating Data ScienceGECCO 2016论文集,第485-492页

BibTeX条目:

@inproceedings{OlsonGECCO2016,
    author = {Olson, Randal S. and Bartley, Nathan and Urbanowicz, Ryan J. and Moore, Jason H.},
    title = {Evaluation of a Tree-based Pipeline Optimization Tool for Automating Data Science},
    booktitle = {Proceedings of the Genetic and Evolutionary Computation Conference 2016},
    series = {GECCO '16},
    year = {2016},
    isbn = {978-1-4503-4206-3},
    location = {Denver, Colorado, USA},
    pages = {485--492},
    numpages = {8},
    url = {http://doi.acm.org/10.1145/2908812.2908918},
    doi = {10.1145/2908812.2908918},
    acmid = {2908918},
    publisher = {ACM},
    address = {New York, NY, USA},
}

或者,您也可以使用以下DOI直接引用存储库:

支持TPOT

TPOT是在Computational Genetics LabUniversity of Pennsylvania有了来自NIH在赠款R01 AI117694项下。我们非常感谢美国国立卫生研究院和宾夕法尼亚大学在这个项目的发展过程中给予的支持

TPOT标志是由托德·纽穆伊斯(Todd Newmuis)设计的,他慷慨地为该项目贡献了时间

LightGBM-基于决策树算法的快速、分布式、高性能梯度提升框架,用于排序、分类和许多其他机器学习任务

LightGBM是一个使用基于树的学习算法的梯度提升框架。它设计为分布式且高效,具有以下优势:

  • 更快的培训速度和更高的效率
  • 降低内存使用率
  • 更高的精确度
  • 支持并行、分布式和GPU学习
  • 能够处理大规模数据

有关更多详情,请参阅Features

得益于这些优势,LightGBM在许多领域得到了广泛的应用winning solutions机器学习竞赛的

Comparison experiments公开数据集上的数据显示,LightGBM在效率和准确性上都优于现有的Boosting框架,并且内存消耗明显较低。更重要的是,distributed learning experiments展示了LightGBM可以通过在特定设置中使用多台机器进行训练来实现线性加速

入门和文档

我们的主要文档在https://lightgbm.readthedocs.io/并且是从该存储库生成的。如果您是LightGBM的新手,请关注the installation instructions在那个网站上

接下来,您可能想要阅读:

投稿人文档:

新闻

请参阅更改日志,地址为GitHub releases页面

一些旧的更新日志位于Key Events页面

外部(非官方)存储库

FLAML(用于超参数优化的AutoML库):https://github.com/microsoft/FLAML

Optuna(超参数优化框架):https://github.com/optuna/optuna

朱莉娅-套餐:https://github.com/IQVIA-ML/LightGBM.jl

JPMML(Java PMML转换器):https://github.com/jpmml/jpmml-lightgbm

Treite(用于高效部署的模型编译器):https://github.com/dmlc/treelite

lLeaf(基于LLVM的模型编译器,用于高效推理):https://github.com/siboehm/lleaves

Hummingbird(将模型编译器转换为张量计算):https://github.com/microsoft/hummingbird

CuML林推理库(GPU加速推理):https://github.com/rapidsai/cuml

daal4py(英特尔CPU加速推理):https://github.com/IntelPython/daal4py

m2cgen(适用于各种语言的模型应用程序):https://github.com/BayesWitnesses/m2cgen

树叶(GO模型施加器):https://github.com/dmitryikh/leaves

ONNXMLTools(ONNX转换器):https://github.com/onnx/onnxmltools

Shap(模型输出解释器):https://github.com/slundberg/shap

Shapash(模型可视化和解释):https://github.com/MAIF/shapash

dtreeviz(决策树可视化和模型解释):https://github.com/parrt/dtreeviz

MMLSpark(电光上的LightGBM):https://github.com/Azure/mmlspark

Kubeflow光顺(Kubernetes上的LightGBM):https://github.com/kubeflow/fairing

Kubeflow运算符(Kubernetes上的LightGBM):https://github.com/kubeflow/xgboost-operator

ML.NET(.NET/C#-Package):https://github.com/dotnet/machinelearning

LightGBM.NET(.NET/C#-Package):https://github.com/rca22/LightGBM.Net

红宝石:https://github.com/ankane/lightgbm

LightGBM4j(Java高级绑定):https://github.com/metarank/lightgbm4j

lightgbm-rs(铁锈装订):https://github.com/vaaaaanquish/lightgbm-rs

MLflow(实验跟踪、模型监控框架):https://github.com/mlflow/mlflow

{treesnip}(r{parsnip}-兼容接口):https://github.com/curso-r/treesnip

{mlr3learners.lightgbm}(r{mlr3}-兼容接口):https://github.com/mlr3learners/mlr3learners.lightgbm

支持

如何做出贡献

检查CONTRIBUTING页面

Microsoft开放源代码行为准则

本项目采用了Microsoft Open Source Code of Conduct有关更多信息,请参阅Code of Conduct FAQ或联系方式opencode@microsoft.com如有任何其他问题或评论

参考文献

柯国林,齐蒙,托马斯·芬利,王泰峰,魏晨,马卫东,叶启伟,刘铁岩。“LightGBM: A Highly Efficient Gradient Boosting Decision Tree“神经信息处理系统的进展”(NIPS 2017),第3149-3157页。

齐蒙,柯国林,王泰峰,魏晨,叶启伟,马志明,刘铁岩。“A Communication-Efficient Parallel Algorithm for Decision Tree“神经信息处理系统的进展”29(NIPS 2016),第1279-1287页

张欢,四思,谢楚瑞。“GPU Acceleration for Large-scale Tree Boosting“.SysML大会,2018年

注意事项:如果您在GitHub项目中使用LightGBM,请添加lightgbmrequirements.txt

许可证

这个项目是根据麻省理工学院的许可证条款授权的。看见LICENSE有关更多详细信息,请参阅