标签归档:automated-machine-learning

Tpot-使用遗传编程优化机器学习管道的Python自动机器学习工具

TPOT代表T基于REE的PipelineO优化T哦哦。将TPOT视为您的数据科学助理TPOT是一种Python自动机器学习工具,可使用遗传编程优化机器学习管道

TPOT将通过智能地探索数千个可能的管道来找到最适合您数据的管道,从而自动化机器学习中最繁琐的部分

一个机器学习流水线示例

一旦TPOT完成搜索(或者您厌倦了等待),它就会为您提供它找到的最佳管道的Python代码,这样您就可以从那里修补管道了

TPOT构建在SCRICKIT-LEARN之上,因此它生成的所有代码看起来都应该很熟悉。如果你熟悉SCRICKIT-不管怎样,还是要学

TPOT仍在积极发展中我们鼓励您定期检查此存储库是否有更新

有关TPOT的更多信息,请参阅project documentation

许可证

请参阅repository license有关TPOT的许可和使用信息

通常,我们已经授权TPOT使其尽可能广泛使用

安装

我们坚持TPOT installation instructions在文档中。TPOT需要Python的正常安装

用法

可以使用TPOTon the command linewith Python code

单击相应的链接以在文档中查找有关TPOT用法的更多信息

示例

分类

以下是光学识别手写数字数据集的最小工作示例

from tpot import TPOTClassifier
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split

digits = load_digits()
X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target,
                                                    train_size=0.75, test_size=0.25, random_state=42)

tpot = TPOTClassifier(generations=5, population_size=50, verbosity=2, random_state=42)
tpot.fit(X_train, y_train)
print(tpot.score(X_test, y_test))
tpot.export('tpot_digits_pipeline.py')

运行此代码应该会发现达到约98%测试准确率的管道,并且相应的Python代码应该导出到tpot_digits_pipeline.py文件,如下所示:

import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline, make_union
from sklearn.preprocessing import PolynomialFeatures
from tpot.builtins import StackingEstimator
from tpot.export_utils import set_param_recursive

# NOTE: Make sure that the outcome column is labeled 'target' in the data file
tpot_data = pd.read_csv('PATH/TO/DATA/FILE', sep='COLUMN_SEPARATOR', dtype=np.float64)
features = tpot_data.drop('target', axis=1)
training_features, testing_features, training_target, testing_target = \
            train_test_split(features, tpot_data['target'], random_state=42)

# Average CV score on the training set was: 0.9799428471757372
exported_pipeline = make_pipeline(
    PolynomialFeatures(degree=2, include_bias=False, interaction_only=False),
    StackingEstimator(estimator=LogisticRegression(C=0.1, dual=False, penalty="l1")),
    RandomForestClassifier(bootstrap=True, criterion="entropy", max_features=0.35000000000000003, min_samples_leaf=20, min_samples_split=19, n_estimators=100)
)
# Fix random state for all the steps in exported pipeline
set_param_recursive(exported_pipeline.steps, 'random_state', 42)

exported_pipeline.fit(training_features, training_target)
results = exported_pipeline.predict(testing_features)

回归

同样,TPOT可以针对回归问题优化管道。下面是使用Practice波士顿房价数据集的最小工作示例

from tpot import TPOTRegressor
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split

housing = load_boston()
X_train, X_test, y_train, y_test = train_test_split(housing.data, housing.target,
                                                    train_size=0.75, test_size=0.25, random_state=42)

tpot = TPOTRegressor(generations=5, population_size=50, verbosity=2, random_state=42)
tpot.fit(X_train, y_train)
print(tpot.score(X_test, y_test))
tpot.export('tpot_boston_pipeline.py')

这将导致管道达到约12.77的均方误差(MSE),并且中的Python代码tpot_boston_pipeline.py应与以下内容类似:

import numpy as np
import pandas as pd
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import PolynomialFeatures
from tpot.export_utils import set_param_recursive

# NOTE: Make sure that the outcome column is labeled 'target' in the data file
tpot_data = pd.read_csv('PATH/TO/DATA/FILE', sep='COLUMN_SEPARATOR', dtype=np.float64)
features = tpot_data.drop('target', axis=1)
training_features, testing_features, training_target, testing_target = \
            train_test_split(features, tpot_data['target'], random_state=42)

# Average CV score on the training set was: -10.812040755234403
exported_pipeline = make_pipeline(
    PolynomialFeatures(degree=2, include_bias=False, interaction_only=False),
    ExtraTreesRegressor(bootstrap=False, max_features=0.5, min_samples_leaf=2, min_samples_split=3, n_estimators=100)
)
# Fix random state for all the steps in exported pipeline
set_param_recursive(exported_pipeline.steps, 'random_state', 42)

exported_pipeline.fit(training_features, training_target)
results = exported_pipeline.predict(testing_features)

请查看文档以了解more examples and tutorials

对TPOT的贡献

我们欢迎您的光临check the existing issues以获取要处理的错误或增强功能。如果您有扩展TPOT的想法,请file a new issue这样我们就可以讨论一下了

在提交任何投稿之前,请审阅我们的contribution guidelines

对TPOT有问题或有疑问吗?

check the existing open and closed issues看看您的问题是否已经得到处理。如果没有,file a new issue在此存储库上,以便我们可以检查您的问题

引用TPOT

如果您在科学出版物中使用TPOT,请考虑至少引用以下一篇论文:

陈天乐,傅维轩,杰森·H·摩尔(2020)。Scaling tree-based automated machine learning to biomedical big data with a feature set selector生物信息学36(1):250-256

BibTeX条目:

@article{le2020scaling,
  title={Scaling tree-based automated machine learning to biomedical big data with a feature set selector},
  author={Le, Trang T and Fu, Weixuan and Moore, Jason H},
  journal={Bioinformatics},
  volume={36},
  number={1},
  pages={250--256},
  year={2020},
  publisher={Oxford University Press}
}

兰德尔·S·奥尔森、瑞安·J·厄巴诺维茨、彼得·C·安德鲁斯、妮可·A·拉文德、拉克里斯·基德和杰森·H·摩尔(2016)。Automating biomedical data science through tree-based pipeline optimization进化计算的应用,第123-137页

BibTeX条目:

@inbook{Olson2016EvoBio,
    author={Olson, Randal S. and Urbanowicz, Ryan J. and Andrews, Peter C. and Lavender, Nicole A. and Kidd, La Creis and Moore, Jason H.},
    editor={Squillero, Giovanni and Burelli, Paolo},
    chapter={Automating Biomedical Data Science Through Tree-Based Pipeline Optimization},
    title={Applications of Evolutionary Computation: 19th European Conference, EvoApplications 2016, Porto, Portugal, March 30 -- April 1, 2016, Proceedings, Part I},
    year={2016},
    publisher={Springer International Publishing},
    pages={123--137},
    isbn={978-3-319-31204-0},
    doi={10.1007/978-3-319-31204-0_9},
    url={http://dx.doi.org/10.1007/978-3-319-31204-0_9}
}

兰德尔·S·奥尔森、内森·巴特利、瑞安·J·厄巴诺维奇和杰森·H·摩尔(2016)。Evaluation of a Tree-based Pipeline Optimization Tool for Automating Data ScienceGECCO 2016论文集,第485-492页

BibTeX条目:

@inproceedings{OlsonGECCO2016,
    author = {Olson, Randal S. and Bartley, Nathan and Urbanowicz, Ryan J. and Moore, Jason H.},
    title = {Evaluation of a Tree-based Pipeline Optimization Tool for Automating Data Science},
    booktitle = {Proceedings of the Genetic and Evolutionary Computation Conference 2016},
    series = {GECCO '16},
    year = {2016},
    isbn = {978-1-4503-4206-3},
    location = {Denver, Colorado, USA},
    pages = {485--492},
    numpages = {8},
    url = {http://doi.acm.org/10.1145/2908812.2908918},
    doi = {10.1145/2908812.2908918},
    acmid = {2908918},
    publisher = {ACM},
    address = {New York, NY, USA},
}

或者,您也可以使用以下DOI直接引用存储库:

支持TPOT

TPOT是在Computational Genetics LabUniversity of Pennsylvania有了来自NIH在赠款R01 AI117694项下。我们非常感谢美国国立卫生研究院和宾夕法尼亚大学在这个项目的发展过程中给予的支持

TPOT标志是由托德·纽穆伊斯(Todd Newmuis)设计的,他慷慨地为该项目贡献了时间

Autokeras-面向深度学习的AutoML库

官网:autokeras.com

AutoKera:一个基于KERS的AutoML系统。它是由DATA Lab在德克萨斯农工大学。AutoKera的目标是让每个人都可以使用机器学习

学习资源

  • 一个简短的例子
import autokeras as ak

clf = ak.ImageClassifier()
clf.fit(x_train, y_train)
results = clf.predict(x_test)

安装

要安装该软件包,请使用pip安装步骤如下:

pip3 install autokeras

请按照installation guide有关更多详细信息,请参阅

注:目前,AutoKera仅与Python>=3.5TensorFlow>=2.3.0

社区

随时了解最新信息

推特:你也可以在推特上关注我们@autokeras了解最新消息

电子邮件:订阅我们的email list接收通知的步骤

问题和讨论

GitHub讨论:请在我们的GitHub Discussions这是一个在GitHub上托管的论坛。我们将在那里监控并回答问题

即时通信

松弛Request an invitation使用#autokeras通信通道

QQ群:加入我们的QQ群1150366085。密码:akqqgroup

在线会议:加入online meeting Google group日历事件将出现在您的Google日历上

贡献代码

我们致力于让AutoKera的一切向公众开放。每个人都可以很容易地以开发人员的身份加入。以下是我们如何管理我们的项目

  • 对问题进行分类例如,我们从中挑选要解决的关键问题GitHub issues它们将被添加到此Project其中一些问题随后将添加到milestones,用于计划发布
  • 分配任务:我们在网上会议期间将任务分配给人们
  • 讨论:我们可以在多个地方进行讨论。代码审查在GitHub上。问题可以在Slake或在会议期间提问

请加入我们的Slack给金海峰发个口信。或顺道拜访我们的online meetings然后跟我们谈谈。我们将帮助您入门!

请参阅我们的Contributing Guide学习最佳实践

感谢所有的贡献者!

捐赠

我们接受财政上的支持Open Collective感谢每一位赞助商对我们的支持!


引用这部作品

金海峰、宋清泉、夏虎。“Auto-keras:一种高效的神经结构搜索系统。”第25届ACM SIGKDD知识发现与数据挖掘国际会议论文集。ACM,2019年。(Download)

Biblatex条目:

@inproceedings{jin2019auto,
  title={Auto-Keras: An Efficient Neural Architecture Search System},
  author={Jin, Haifeng and Song, Qingquan and Hu, Xia},
  booktitle={Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery \& Data Mining},
  pages={1946--1956},
  year={2019},
  organization={ACM}
}

确认

作者感谢国防高级研究计划局(DARPA)通过AFRL合同FA8750-17-2-0116、德克萨斯农工学院和德克萨斯农工大学管理的D3M计划