标签归档:pandas

Pyintervals 解决你的阈值判断问题

Pyintervals 是一个用于数值区间计算的模块,比如我们想要判断一个数值是否处于一个、或者一系列区间范围内,就可以使用Pyintervals模块取缔IF-ELSE语句以达到简化代码的目的。

如果你想一次性生成上千个区间阈值并进行数值区间判断,比如根据数值生成成百上千个分类,那么这个模块就是你的最佳选择。

1.准备

开始之前,你要确保Python和pip已经成功安装在电脑上,如果没有,请访问这篇文章:超详细Python安装指南 进行安装。

(可选1) 如果你用Python的目的是数据分析,可以直接安装Anaconda:Python数据分析与挖掘好帮手—Anaconda,它内置了Python和pip.

(可选2) 此外,推荐大家用VSCode编辑器来编写小型Python项目:Python 编程的最好搭档—VSCode 详细指南

Windows环境下打开Cmd(开始—运行—CMD),苹果系统环境下请打开Terminal(command+空格输入Terminal),输入命令安装依赖:

pip install pyinterval

2.基本使用

使用Pyinterval做区间判断是非常简单的:

from interval import interval
a = interval[1,5]
# interval([1.0, 5.0])
print(3 in a)
# True

此外,你还可以构建一个多区间:

from interval import interval
a = interval([0, 1], [2, 3], [10, 15])
print(2.5 in a)
# True

interval.hall 方法还可以将多个区间合并,取其最小及最大值为边界:

from interval import interval
a = interval.hull((interval[1, 3], interval[10, 15], interval[16, 2222]))
# interval([1.0, 2222.0])
print(1231 in a)
# True

区间并集计算:

from interval import interval
a = interval.union([interval([1, 3], [4, 6]), interval([2, 5], 9)])
# interval([1.0, 6.0], [9.0])
print(5 in a)
# True
print(8 in a)
# False

3.生成多个阈值区间

如果你在做深度学习训练分类任务,你的分类数量比较多,达到了上百个,请不要傻傻地使用IF-ELSE, 下面教你使用四行代码生成上百个阈值区间。

假设你的值区间分布在0,1之间,每个阈值范围为0.005,并有正负两个方向。下面这4行代码就能非常简单地实现你想要的区间阈值:

from interval import interval
import numpy as np
threshold_list = np.arange(0.0, 1.0, 0.005)
intervals = [interval([threshold_list[i - 1], threshold_list[i]]) for i in range(1, len(threshold_list))]
intervals += [interval([-threshold_list[i], -threshold_list[i - 1]]) for i in range(len(threshold_list) - 1, 0, -1)]
print(len(intervals))
# 398
print(intervals[0], intervals[-1])
# interval([0.0, 0.005]) interval([-0.005, -0.0])

有了这个阈值,区间,你想要画分类就非常简单了,下面是一个简单示例,实际工作中要因不同应用场景改变使用方式。

target = 0.023
class_labels = {}
for index, interval_ in enumerate(intervals):
    if target in interval_:
        class_labels[target] = index

Pyintervals对于正在做大规模分类任务的同学而言是非常好用的模块,建议有需要的朋友可以试一试。其他同学也可以收藏点赞记录一下,说不定未来也会有应用场景呢!

我们的文章到此就结束啦,如果你喜欢今天的 Python 教程,请持续关注Python实用宝典。

有任何问题,可以在公众号后台回复:加群,回答相应验证信息,进入互助群询问。

原创不易,希望你能在下面点个赞和在看支持我继续创作,谢谢!


​Python实用宝典 ( pythondict.com )
不只是一个宝典
欢迎关注公众号:Python实用宝典

教你使用 Python 获取美国重要经济指标数据

美国的金融市场(主要是股市、债市和汇市)经常受到其国内各项经济数据影响而波动。不仅如此,这些经济数据甚至会影响远在太平洋对岸的港股和A股。因此对于世界经济火车头的美国,我们要有对其经济数据有一定程度的理解。

对于股市而言,几个比较重要的经济指标为:

1. 联储局公开市场委员会会议声明

联储局公开市场委员会(FOMC)是决定美国利率走向的主宰者,而利率是美国经济未来增长的最重要因素,它的变化都会令消费支出、公司利润、政府预算及股票债券和美元的价值都受到影响。

2.消费者物价指数 Consumer Price Index (CPI)

这个指标是市场上最瞩目的经济指标之一,通胀(缩)率是联储局决定是否加息的最主要参考指标,而消费者物价指数则是最重要的通胀(缩)指标。通胀(缩)会影响到民生、政府的财政政策和民间的所有经济活动。通胀(缩)对于投资市场来说是件非常可怕的事情,因为它制造了经济的不稳定性和不确定性,给股市会带来波动和风险。

3.生产者物价指数 Producer Price Index (PPI)

跟CPI一样,这个指标是预测通货膨胀的重要指标之一,不过它反映的是生产者这个环节,是在通胀转移到消费者之前的数据,也就是说它对通胀和利率政策更具前瞻性,尽管其对消费者的影响力不如CPI。

4.采购经理人指数 Purchasing Managers Index (PMI)

PMI是一项全面的经济指标,概括了美国整体制造业状况、就业及物价表现,是全球最受关注的经济资料之一。采购经理人指数为每月第一个公布的重要数据,加上其所反映的经济状况较为全面,因此市场十分重视数据所反映的具体结果。在一般意义上讲采购经理人指数上升,会带来美元汇价上涨;采购经理人指数下降,会带来美元汇价的下跌。

5.非农就业数据 Non-farm Payrolls (NFP)

是美国非农业人口的就业数据,由美国劳工部每月公布一次,反应美国经济的趋势,数据好说明经济好转,数据差说明经济转坏。非农数据会影响美联储对美元的货币政策,经济差,美联储会倾向减息,美元贬值,经济好,美联储会倾向加息,美元升值。

本文将教你如何使用Python调用 FRED(Federal Reserve Economic Data) 数据库API获取以上相关数据。

1.准备

开始之前,你要确保Python和pip已经成功安装在电脑上,如果没有,请访问这篇文章:超详细Python安装指南 进行安装。

(可选1) 如果你用Python的目的是数据分析,可以直接安装Anaconda:Python数据分析与挖掘好帮手—Anaconda,它内置了Python和pip.

(可选2) 此外,推荐大家用VSCode编辑器来编写小型Python项目:Python 编程的最好搭档—VSCode 详细指南

Windows环境下打开Cmd(开始—运行—CMD),苹果系统环境下请打开Terminal(command+空格输入Terminal),输入命令安装依赖:

pip install fredapi

2.注册账户获取API

为了能够获取FRED的数据,你需要先注册账号、然后申请一个API秘钥,完全免费,三分钟就能解决。

进入FRED主页:https://research.stlouisfed.org

右上角有一个『My Account』,点进去后,选择『Create New Account』申请 FRED 账号:

注册完成后,会跳转到会员页,点击左侧API Keys:

申请API秘钥:

申请完毕后就能获得一个API Key了。

3.通过接口获取FRED数据

FRED 数据量非常庞大,其分为大分类和大分类的子项目。大分类我们可以通过这样的代码获得:

import requests
import pandas as pd
import datetime as dt
def fetch_releases(api_key):
    """
    取得 FRED 大分类信息
    Args:
        api_key (str): 秘钥
    """
    r = requests.get('https://api.stlouisfed.org/fred/releases?api_key='+api_key+'&file_type=json', verify=True)
    full_releases = r.json()['releases']
    full_releases = pd.DataFrame.from_dict(full_releases)
    full_releases = full_releases.set_index('id')
    # full_releases.to_csv("full_releases.csv")
    return full_releases

导出为CSV后,你能看到所有的大分类ID及其说明。

每个大分类中有许多子项目,比如 355:Minimum Wage Rates 底下会有:
* FEDMINNFRWG:Nonfarm Workers Minimum Hourly Wage
* FEDMINFRMWG:Farm Workers Minimum Hourly Wage

每一个子项目也有一个专属的『子项目英文代码』,就是前面的那串英文字。

我们可以通过对大分类进行关键字搜索,获取我们文首提到的五个指标:

from fredapi import Fred
import requests
import numpy as np
import pandas as pd
import datetime as dt


def fetch_releases(api_key):
    """
    取得 FRED 大分类信息
    Args:
        api_key (str): 秘钥
    """
    r = requests.get('https://api.stlouisfed.org/fred/releases?api_key='+api_key+'&file_type=json', verify=True)
    full_releases = r.json()['releases']
    full_releases = pd.DataFrame.from_dict(full_releases)
    full_releases = full_releases.set_index('id')
    # full_releases.to_csv("full_releases.csv")
    return full_releases

  
def fetch_release_id_data(release_id):
    """
    按照分类ID获取数据

    Args:
        release_id (int): 大分类ID

    Returns:
        dataframe: 数据
    """
    econ_data = pd.DataFrame(index=pd.date_range(start='2000-01-01', end=dt.datetime.today(), freq='MS'))
    series_df = fred.search_by_release(release_id, limit=3, order_by='popularity', sort_order='desc')
    for topic_label in series_df.index:
        econ_data[series_df.loc[topic_label].title] = fred.get_series(topic_label, observation_start='2000-01-01', observation_end=dt.datetime.today())
    return econ_data


api_key = '填入你的API秘钥'

fred = Fred(api_key)

full_releases = fetch_releases(api_key)

keywords = ["producer price", "consumer price", "fomc", "manufacturing", "employment"]

for search_keywords in keywords:
    search_result = full_releases.name[full_releases.name.apply(lambda x: search_keywords in x.lower())]
    econ_data = pd.DataFrame(index=pd.date_range(start='2000-01-01', end=dt.datetime.today(), freq='MS'))

    for release_id in search_result.index:
        print("scraping release_id: ", release_id)
        econ_data = pd.concat([econ_data, fetch_release_id_data(release_id)], axis=1)
    econ_data.to_csv(f"{search_keywords}.csv")

上面就是完整的数据下载代码,如果你想直接获取脚本文件,请在Python实用宝典后台回复:FRED 下载。

填入你申请的 API 秘钥,运行脚本,就能获取我们想要的五个指标数据。会在当前文件夹下生成相应关键词的csv文件。比如 employment.csv 的内容如下:

里面包含了就业相关的许多数据,包括我们关注的非农数据等。

如果你想要更换关键词下载其他关键词的数据,也可以在keywords中进行增删。请注意,这里搜索必须使用小写单词。

获取数据只是第一步,最重要的是如何分析这些数据与股市的相关性。

德意志银行有一个研究发现历年来ISM(即PMI)指数的数值和标普500的同比增长数值是高度相关的。

类似于这样的数据分析切入点是非常有意思的,大家也可以尝试基于这些数据做一些自己的研究分析,说不定会有意外的发现。

我们的文章到此就结束啦,如果你喜欢今天的 Python 教程,请持续关注Python实用宝典。

有任何问题,可以在公众号后台回复:加群,回答相应验证信息,进入互助群询问。

原创不易,希望你能在下面点个赞和在看支持我继续创作,谢谢!


​Python实用宝典 ( pythondict.com )
不只是一个宝典
欢迎关注公众号:Python实用宝典

什么格式是保存 Pandas 数据的最好格式?

在数据分析相关项目工作时,我通常使用Jupyter笔记本和pandas库来处理和移动我的数据。对于中等大小的数据集来说,这是一个非常直接的过程,你甚至可以将其存储为纯文本文件而没有太多的开销。

然而,当你的数据集中的观测数据数量较多时,保存和加载数据回内存的过程就会变慢,现在程序的重新启动都会迫使你等待数据重新加载。所以最终,CSV文件或任何其他纯文本格式都会失去吸引力。

我们可以做得更好。有很多二进制格式可以用来将数据存储到磁盘上,其中有很多格式pandas都支持。我们怎么能知道哪一种更适合我们的目的呢?

来吧,我们尝试其中的几个,然后进行对比!这就是我决定在这篇文章中要做的:通过几种方法将 pandas.DataFrame 保存到磁盘上,看看哪一种在I/O速度、内存消耗和磁盘空间方面做的更好。

在这篇文章中,我将展示我的测试结果。

1.要比较的格式

我们将考虑采用以下格式来存储我们的数据:

1. CSV — 数据科学家的一个好朋友
2. Pickle — 一种Python的方式来序列化事物
3. MessagePack — 它就像JSON,但又快又小
4. HDF5 — 一种设计用于存储和组织大量数据的文件格式
5. Feather — 一种快速、轻量级、易于使用的二进制文件格式,用于存储数据框架
6. Parquet — Apache Hadoop的柱状存储格式

所有这些格式都是被广泛使用的,而且(也许除了MessagePack)在你做一些数据分析的事情时非常经常遇到。

为了追求找到最好的缓冲格式来存储程序会话之间的数据,我选择了以下指标进行比较。

1. size_mb – 文件大小(Mb)。
2. save_time – 将数据帧保存到磁盘上所需的时间量。
3. load_time – 将之前转储的数据帧加载到内存中所需要的时间量。
4. save_ram_delta_mb – 数据帧保存过程中最大的内存消耗增长量。
5. load_ram_delta_mb – 数据帧加载过程中的最大内存消耗增长量。

请注意,当我们使用高效压缩的二进制数据格式,如 Parquet 时,最后两个指标变得非常重要。它们可以帮助我们估计加载序列化数据所需的内存量,此外还有数据大小本身。我们将在接下来的章节中更详细地讨论这个问题。

2.测试及结果

我决定使用一个合成数据集进行测试,以便更好地控制序列化的数据结构和属性。

另外,我在我的基准中使用了两种不同的方法:

(a) 将生成的分类变量保留为字符串。

(b) 在执行任何I/O之前将它们转换为 pandas.Categorical 数据类型。

函数generate_dataset显示了我在基准中是如何生成数据集的:

def generate_dataset(n_rows, num_count, cat_count, max_nan=0.1, max_cat_size=100):
    """
    随机生成具有数字和分类特征的数据集。
    
    数字特征取自正态分布X ~ N(0, 1)。
    分类特征则被生成为随机的uuid4字符串。
    
    此外,数字和分类特征的max_nan比例被替换为NaN值。
    """
    dataset, types = {}, {}
    
    def generate_categories():
        from uuid import uuid4
        category_size = np.random.randint(2, max_cat_size)
        return [str(uuid4()) for _ in range(category_size)]
    
    for col in range(num_count):
        name = f'n{col}'
        values = np.random.normal(0, 1, n_rows)
        nan_cnt = np.random.randint(1, int(max_nan*n_rows))
        index = np.random.choice(n_rows, nan_cnt, replace=False)
        values[index] = np.nan
        dataset[name] = values
        types[name] = 'float32'
        
    for col in range(cat_count):
        name = f'c{col}'
        cats = generate_categories()
        values = np.array(np.random.choice(cats, n_rows, replace=True), dtype=object)
        nan_cnt = np.random.randint(1, int(max_nan*n_rows))
        index = np.random.choice(n_rows, nan_cnt, replace=False)
        values[index] = np.nan
        dataset[name] = values
        types[name] = 'object'
    
    return pd.DataFrame(dataset), types

我们将CSV文件的保存和加载性能作为一个基准。

五个随机生成的具有一百万个观测值的数据集被转储到CSV中,并读回内存以获得平均指标。

每种二进制格式都针对20个随机生成的具有相同行数的数据集进行测试。

这些数据集包括15个数字特征和15个分类特征。你可以在这个资源库中找到带有基准测试功能和所需的完整源代码:

https://github.com/devforfu/pandas-formats-benchmark

或在Python实用宝典后台回复 Pandas IO对比 ,下载完整代码。

(a) 数据为字符串特征时的性能

下图显示了每种数据格式的平均I/O时间。一个有趣的观察是,hdf显示出比csv更慢的加载速度,而其他二进制格式的表现明显更好。其中最令人印象深刻的是feather和parquet。

在保存数据和从磁盘上读取数据时,内存开销如何?

下一张图片告诉我们,hdf 的表现就不是那么好了。可以肯定的是,csv在保存/加载纯文本字符串时不需要太多的额外内存,而Feather和parquet则相当接近:

最后,让我们看看文件的大小。这次parquet显示了一个令人印象深刻的结果,考虑到这种格式是为有效存储大量数据而开发的,这并不令人惊讶。

(b) 字符串特征转换为数字时的性能

在上一节中,我们没有尝试有效地存储我们的分类特征而是使用普通的字符串。让我们来弥补这个遗漏吧! 这一次我们使用一个专门的 pandas.Categorical 类型,转字符串特征为数字特征。

看看现在与纯文本的csv相比,它看起来如何!

现在所有的二进制格式都显示出它们的真正力量。Csv的基准结果已经远远落后了,所以让我们把它去掉,以便更清楚地看到各种二进制格式之间的差异:

Feather 和 Pickle 显示了最好的 I/O 速度,而 hdf 仍然显示了明显的性能开销。

现在是时候比较数据进程加载时的内存消耗了。下面的柱状图显示了我们之前提到的关于parquet格式的一个重要事实。

可以看到 parquet 读写时的内存空间差距有多大,你有可能你无法将比较大的 parquet 文件加载到内存中。

最后的图显示了各格式的文件大小。所有的格式都显示出良好的效果,除了hdf仍然需要比其他格式多得多的空间:

3.结论

正如我们的测试所显示的,似乎 feather 格式是存储Python会话数据的理想候选者。它显示了很快的I/O速度,在磁盘上不占用太多内存,并且在加载回RAM时不需要消耗太大的内存。

当然,这种比较并不意味着你应该在每个可能的情况下使用这种格式。例如,feather格式一般不会被用作长期文件存储的格式。

另外,某些特定情况下也无法使用 feather,这由你的整个程序架构决定。然而,就如本帖开头所述的目的,它在不被任何特殊事项限制的情况下是一个很好的选择。

本文译自 towardsdatascience
作者: Ilia Zaitsev
有部分修改。

我们的文章到此就结束啦,如果你喜欢今天的 Python 教程,请持续关注Python实用宝典。

有任何问题,可以在公众号后台回复:加群,回答相应验证信息,进入互助群询问。

原创不易,希望你能在下面点个赞和在看支持我继续创作,谢谢!


​Python实用宝典 ( pythondict.com )
不只是一个宝典
欢迎关注公众号:Python实用宝典

Pandarallel 一个能让你的Pandas计算火力拉满的工具

没有使用Pandarallel
使用了Pandarallel

众所周知,由于GIL的存在,Python单进程中的所有操作都是在一个CPU核上进行的,所以为了提高运行速度,我们一般会采用多进程的方式。而多进程无非就是以下几种方案:

  • 1、multiprocessing
  • 2、concurrent.futures.ProcessPoolExecutor()
  • 3、joblib
  • 4、ppserver
  • 5、celery

这些方案对于普通Pandas玩家来说都不是特别友好,怎样才能算作一个友好的并行处理方案?就是原来的逻辑我基本不用变,仅修改需要计算的那行就能完成我们目标的方案,而 pandarallel 就是一个这样友好的工具。

没有并行计算(原始pandas) pandarallel
df.apply(func)df.parallel_apply(func)
df.applymap(func)df.parallel_applymap(func)
df.groupby(args).apply(func)df.groupby(args).parallel_apply(func)
df.groupby(args1).col_name.rolling(args2).apply(func)df.groupby(args1).col_name.rolling(args2).parallel_apply(func)
df.groupby(args1).col_name.expanding(args2).apply(func)df.groupby(args1).col_name.expanding(args2).parallel_apply(func)
series.map(func)series.parallel_map(func)
series.apply(func)series.parallel_apply(func)
series.rolling(args).apply(func)series.rolling(args).parallel_apply(func)

可以看到,在 pandarallel 的世界里,你只需要替换原有的 pandas 处理语句就能实现多CPU并行计算。非常方便、非常nice.

在4核CPU的性能测试上,它比原始语句快了接近4倍。测试条件(OS: Linux Ubuntu 16.04,Hardware: Intel Core i7 @ 3.40 GHz – 4 cores),这就是我所说的,它把CPU充分利用了起来。

下面就给大家介绍这个模块怎么用,其实非常简单,任何代码只需要加几行代码就能实现质的飞跃。

1.准备

开始之前,你要确保Python和pip已经成功安装在电脑上,如果没有,请访问这篇文章:超详细Python安装指南 进行安装。

(可选1) 如果你用Python的目的是数据分析,可以直接安装Anaconda:Python数据分析与挖掘好帮手—Anaconda,它内置了Python和pip.

(可选2) 此外,推荐大家用VSCode编辑器来编写小型Python项目:Python 编程的最好搭档—VSCode 详细指南

Windows环境下打开Cmd(开始—运行—CMD),苹果系统环境下请打开Terminal(command+空格输入Terminal),输入命令安装依赖:

pip install pandarallel

对于windows用户,有一个不好的消息是,它只能在Windows的linux子系统上运行(WSL),你可以在微软官网上找到安装教程:

https://docs.microsoft.com/zh-cn/windows/wsl/about

2.使用Pandarallel

使用前,需要对Pandarallel进行初始化:

from pandarallel import pandarallel
pandarallel.initialize()

这样才能调用并行计算的API,不过 initialize 中有一个重要参数需要说明,那就是 nb_workers ,它将指定并行计算的Worker数,如果没有设置,所有CPU的核都会用上。

Pandarallel一共支持8种Pandas操作,下面是一个apply方法的例子。

import pandas as pd
import time
import math
import numpy as np
from pandarallel import pandarallel

# 初始化
pandarallel.initialize()
df_size = int(5e6)
df = pd.DataFrame(dict(a=np.random.randint(1, 8, df_size),
                       b=np.random.rand(df_size)))
def func(x):
    return math.sin(x.a**2) + math.sin(x.b**2)

# 正常处理
res = df.apply(func, axis=1)

# 并行处理
res_parallel = df.parallel_apply(func, axis=1)

# 查看结果是否相同
res.equals(res_parallel)

其他方法使用上也是类似的,在原始的函数名称前加上 parallel_。比如DataFrame.groupby.apply:

import pandas as pd
import time
import math
import numpy as np
from pandarallel import pandarallel

# 初始化
pandarallel.initialize()
df_size = int(3e7)
df = pd.DataFrame(dict(a=np.random.randint(1, 1000, df_size),
                       b=np.random.rand(df_size)))
def func(df):
    dum = 0
    for item in df.b:
        dum += math.log10(math.sqrt(math.exp(item**2)))
        
    return dum / len(df.b)

# 正常处理
res = df.groupby("a").apply(func)
# 并行处理
res_parallel = df.groupby("a").parallel_apply(func)
res.equals(res_parallel)

又比如 DataFrame.groupby.rolling.apply:

import pandas as pd
import time
import math
import numpy as np
from pandarallel import pandarallel

# 初始化
pandarallel.initialize()
df_size = int(1e6)
df = pd.DataFrame(dict(a=np.random.randint(1, 300, df_size),
                       b=np.random.rand(df_size)))
def func(x):
    return x.iloc[0] + x.iloc[1] ** 2 + x.iloc[2] ** 3 + x.iloc[3] ** 4

# 正常处理
res = df.groupby('a').b.rolling(4).apply(func, raw=False)
# 并行处理
res_parallel = df.groupby('a').b.rolling(4).parallel_apply(func, raw=False)
res.equals(res_parallel)

案例都是类似的,这里就直接列出表格,不浪费大家宝贵的时间去阅读一些重复的例子了:

没有并行计算(原始pandas) pandarallel
df.apply(func)df.parallel_apply(func)
df.applymap(func)df.parallel_applymap(func)
df.groupby(args).apply(func)df.groupby(args).parallel_apply(func)
df.groupby(args1).col_name.rolling(args2).apply(func)df.groupby(args1).col_name.rolling(args2).parallel_apply(func)
df.groupby(args1).col_name.expanding(args2).apply(func)df.groupby(args1).col_name.expanding(args2).parallel_apply(func)
series.map(func)series.parallel_map(func)
series.apply(func)series.parallel_apply(func)
series.rolling(args).apply(func)series.rolling(args).parallel_apply(func)

3.注意事项

1. 我有 8 个 CPU,但 parallel_apply 只能加快大约4倍的计算速度。为什么?

答:正如我前面所言,Python中每个进程占用一个核,Pandarallel 最多只能加快到你所拥有的核心的总数,一个 4 核的超线程 CPU 将向操作系统显示 8 个 CPU,但实际上只有 4 个核心,因此最多加快4倍。

2. 并行化是有成本的(实例化新进程,通过共享内存发送数据,…),所以只有当并行化的计算量足够大时,并行化才是有意义的。对于很少量的数据,使用 Pandarallel 并不总是值得的。

我们的文章到此就结束啦,如果你喜欢今天的 Python 教程,请持续关注Python实用宝典。

有任何问题,可以在公众号后台回复:加群,回答相应验证信息,进入互助群询问。

原创不易,希望你能在下面点个赞和在看支持我继续创作,谢谢!


​Python实用宝典 ( pythondict.com )
不只是一个宝典
欢迎关注公众号:Python实用宝典

如何获取熊猫DataFrame的最后N行?

问题:如何获取熊猫DataFrame的最后N行?

我有熊猫数据帧df1df2(df1是vanila数据帧,df2由’STK_ID’和’RPT_Date’索引):

>>> df1
    STK_ID  RPT_Date  TClose   sales  discount
0   000568  20060331    3.69   5.975       NaN
1   000568  20060630    9.14  10.143       NaN
2   000568  20060930    9.49  13.854       NaN
3   000568  20061231   15.84  19.262       NaN
4   000568  20070331   17.00   6.803       NaN
5   000568  20070630   26.31  12.940       NaN
6   000568  20070930   39.12  19.977       NaN
7   000568  20071231   45.94  29.269       NaN
8   000568  20080331   38.75  12.668       NaN
9   000568  20080630   30.09  21.102       NaN
10  000568  20080930   26.00  30.769       NaN

>>> df2
                 TClose   sales  discount  net_sales    cogs
STK_ID RPT_Date                                             
000568 20060331    3.69   5.975       NaN      5.975   2.591
       20060630    9.14  10.143       NaN     10.143   4.363
       20060930    9.49  13.854       NaN     13.854   5.901
       20061231   15.84  19.262       NaN     19.262   8.407
       20070331   17.00   6.803       NaN      6.803   2.815
       20070630   26.31  12.940       NaN     12.940   5.418
       20070930   39.12  19.977       NaN     19.977   8.452
       20071231   45.94  29.269       NaN     29.269  12.606
       20080331   38.75  12.668       NaN     12.668   3.958
       20080630   30.09  21.102       NaN     21.102   7.431

我可以通过以下方式获得df2的最后3行:

>>> df2.ix[-3:]
                 TClose   sales  discount  net_sales    cogs
STK_ID RPT_Date                                             
000568 20071231   45.94  29.269       NaN     29.269  12.606
       20080331   38.75  12.668       NaN     12.668   3.958
       20080630   30.09  21.102       NaN     21.102   7.431

同时df1.ix[-3:]给出所有行:

>>> df1.ix[-3:]
    STK_ID  RPT_Date  TClose   sales  discount
0   000568  20060331    3.69   5.975       NaN
1   000568  20060630    9.14  10.143       NaN
2   000568  20060930    9.49  13.854       NaN
3   000568  20061231   15.84  19.262       NaN
4   000568  20070331   17.00   6.803       NaN
5   000568  20070630   26.31  12.940       NaN
6   000568  20070930   39.12  19.977       NaN
7   000568  20071231   45.94  29.269       NaN
8   000568  20080331   38.75  12.668       NaN
9   000568  20080630   30.09  21.102       NaN
10  000568  20080930   26.00  30.769       NaN

为什么呢 如何获得df1(索引的数据帧)的最后3行?熊猫0.10.1

I have pandas dataframe df1 and df2 (df1 is vanila dataframe, df2 is indexed by ‘STK_ID’ & ‘RPT_Date’) :

>>> df1
    STK_ID  RPT_Date  TClose   sales  discount
0   000568  20060331    3.69   5.975       NaN
1   000568  20060630    9.14  10.143       NaN
2   000568  20060930    9.49  13.854       NaN
3   000568  20061231   15.84  19.262       NaN
4   000568  20070331   17.00   6.803       NaN
5   000568  20070630   26.31  12.940       NaN
6   000568  20070930   39.12  19.977       NaN
7   000568  20071231   45.94  29.269       NaN
8   000568  20080331   38.75  12.668       NaN
9   000568  20080630   30.09  21.102       NaN
10  000568  20080930   26.00  30.769       NaN

>>> df2
                 TClose   sales  discount  net_sales    cogs
STK_ID RPT_Date                                             
000568 20060331    3.69   5.975       NaN      5.975   2.591
       20060630    9.14  10.143       NaN     10.143   4.363
       20060930    9.49  13.854       NaN     13.854   5.901
       20061231   15.84  19.262       NaN     19.262   8.407
       20070331   17.00   6.803       NaN      6.803   2.815
       20070630   26.31  12.940       NaN     12.940   5.418
       20070930   39.12  19.977       NaN     19.977   8.452
       20071231   45.94  29.269       NaN     29.269  12.606
       20080331   38.75  12.668       NaN     12.668   3.958
       20080630   30.09  21.102       NaN     21.102   7.431

I can get the last 3 rows of df2 by:

>>> df2.ix[-3:]
                 TClose   sales  discount  net_sales    cogs
STK_ID RPT_Date                                             
000568 20071231   45.94  29.269       NaN     29.269  12.606
       20080331   38.75  12.668       NaN     12.668   3.958
       20080630   30.09  21.102       NaN     21.102   7.431

while df1.ix[-3:] give all the rows:

>>> df1.ix[-3:]
    STK_ID  RPT_Date  TClose   sales  discount
0   000568  20060331    3.69   5.975       NaN
1   000568  20060630    9.14  10.143       NaN
2   000568  20060930    9.49  13.854       NaN
3   000568  20061231   15.84  19.262       NaN
4   000568  20070331   17.00   6.803       NaN
5   000568  20070630   26.31  12.940       NaN
6   000568  20070930   39.12  19.977       NaN
7   000568  20071231   45.94  29.269       NaN
8   000568  20080331   38.75  12.668       NaN
9   000568  20080630   30.09  21.102       NaN
10  000568  20080930   26.00  30.769       NaN

Why ? How to get the last 3 rows of df1 (dataframe without index) ? Pandas 0.10.1


回答 0

别忘了DataFrame.tail!例如df1.tail(10)

Don’t forget DataFrame.tail! e.g. df1.tail(10)


回答 1

这是因为使用整数索引(通过-3而不是positionix通过标签选择索引,这是设计使然:请参见pandas“ gotchas” *中的整数索引)。

*在较新版本的熊猫中,建议使用loc或iloc删除ix作为位置或标签的歧义:

df.iloc[-3:]

请参阅文档

正如Wes所指出的,在这种特定情况下,您应该只使用tail!

This is because of using integer indices (ix selects those by label over -3 rather than position, and this is by design: see integer indexing in pandas “gotchas”*).

*In newer versions of pandas prefer loc or iloc to remove the ambiguity of ix as position or label:

df.iloc[-3:]

see the docs.

As Wes points out, in this specific case you should just use tail!


回答 2

如何获取熊猫DataFrame的最后N行?

如果您按位置进行切片,__getitem__(即使用进行切片[])效果很好,并且是我针对该问题找到的最简洁的解决方案。

pd.__version__
# '0.24.2'

df = pd.DataFrame({'A': list('aaabbbbc'), 'B': np.arange(1, 9)})
df

   A  B
0  a  1
1  a  2
2  a  3
3  b  4
4  b  5
5  b  6
6  b  7
7  c  8

df[-3:]

   A  B
5  b  6
6  b  7
7  c  8

例如,这与调用相同df.iloc[-3:]iloc内部委托__getitem__)。


顺便说一句,如果要查找每个组的最后N行,请使用groupbyGroupBy.tail

df.groupby('A').tail(2)

   A  B
1  a  2
2  a  3
5  b  6
6  b  7
7  c  8

How to get the last N rows of a pandas DataFrame?

If you are slicing by position, __getitem__ (i.e., slicing with[]) works well, and is the most succinct solution I’ve found for this problem.

pd.__version__
# '0.24.2'

df = pd.DataFrame({'A': list('aaabbbbc'), 'B': np.arange(1, 9)})
df

   A  B
0  a  1
1  a  2
2  a  3
3  b  4
4  b  5
5  b  6
6  b  7
7  c  8

df[-3:]

   A  B
5  b  6
6  b  7
7  c  8

This is the same as calling df.iloc[-3:], for instance (iloc internally delegates to __getitem__).


As an aside, if you want to find the last N rows for each group, use groupby and GroupBy.tail:

df.groupby('A').tail(2)

   A  B
1  a  2
2  a  3
5  b  6
6  b  7
7  c  8

将包含NaN的Pandas列转换为dtype`int`

问题:将包含NaN的Pandas列转换为dtype`int`

我将数据从.csv文件读取到Pandas数据框,如下所示。对于其中一列,id我想将列类型指定为int。问题在于该id系列的值缺失/为空。

当我尝试id在读取.csv时将列转换为整数时,得到:

df= pd.read_csv("data.csv", dtype={'id': int}) 
error: Integer column has NA values

或者,我尝试在阅读以下内容后转换列类型,但是这次我得到:

df= pd.read_csv("data.csv") 
df[['id']] = df[['id']].astype(int)
error: Cannot convert NA to integer

我该如何解决?

I read data from a .csv file to a Pandas dataframe as below. For one of the columns, namely id, I want to specify the column type as int. The problem is the id series has missing/empty values.

When I try to cast the id column to integer while reading the .csv, I get:

df= pd.read_csv("data.csv", dtype={'id': int}) 
error: Integer column has NA values

Alternatively, I tried to convert the column type after reading as below, but this time I get:

df= pd.read_csv("data.csv") 
df[['id']] = df[['id']].astype(int)
error: Cannot convert NA to integer

How can I tackle this?


回答 0

整数列中缺少NaN rep是熊猫的“陷阱”

通常的解决方法是仅使用浮点数。

The lack of NaN rep in integer columns is a pandas “gotcha”.

The usual workaround is to simply use floats.


回答 1

在0.24。+版本中,pandas获得了保留具有缺失值的整数dtypes的功能。

可空整数数据类型

大熊猫可以使用来表示可能缺少值的整数数据arrays.IntegerArray。这是在熊猫中实现的扩展类型。它不是整数的默认dtype,因此不会被推断。您必须将dtype明确传递给array()Series

arr = pd.array([1, 2, np.nan], dtype=pd.Int64Dtype())
pd.Series(arr)

0      1
1      2
2    NaN
dtype: Int64

要将列转换为可为空的整数,请使用:

df['myCol'] = df['myCol'].astype('Int64')

In version 0.24.+ pandas has gained the ability to hold integer dtypes with missing values.

Nullable Integer Data Type.

Pandas can represent integer data with possibly missing values using arrays.IntegerArray. This is an extension types implemented within pandas. It is not the default dtype for integers, and will not be inferred; you must explicitly pass the dtype into array() or Series:

arr = pd.array([1, 2, np.nan], dtype=pd.Int64Dtype())
pd.Series(arr)

0      1
1      2
2    NaN
dtype: Int64

For convert column to nullable integers use:

df['myCol'] = df['myCol'].astype('Int64')

回答 2

我的用例是在装入数据库表之前先整理数据:

df[col] = df[col].fillna(-1)
df[col] = df[col].astype(int)
df[col] = df[col].astype(str)
df[col] = df[col].replace('-1', np.nan)

删除NaN,转换为int,转换为str,然后重新插入NAN。

它虽然不漂亮,但可以完成工作!

My use case is munging data prior to loading into a DB table:

df[col] = df[col].fillna(-1)
df[col] = df[col].astype(int)
df[col] = df[col].astype(str)
df[col] = df[col].replace('-1', np.nan)

Remove NaNs, convert to int, convert to str and then reinsert NANs.

It’s not pretty but it gets the job done!


回答 3

现在可以创建一个包含NaNs作为intdtype 的熊猫列,因为它现在已正式添加到熊猫0.24.0中。

pandas 0.24.x发行说明 Quote:“ Pandas已经拥有了持有缺失值的整数dtypes的能力

It is now possible to create a pandas column containing NaNs as dtype int, since it is now officially added on pandas 0.24.0

pandas 0.24.x release notes Quote: “Pandas has gained the ability to hold integer dtypes with missing values


回答 4

如果绝对要在列中组合整数和NaN,则可以使用“对象”数据类型:

df['col'] = (
    df['col'].fillna(0)
    .astype(int)
    .astype(object)
    .where(df['col'].notnull())
)

这将用整数替换NaN(无关紧要),将其转换为int,转换为对象,最后重新插入NaN。

If you absolutely want to combine integers and NaNs in a column, you can use the ‘object’ data type:

df['col'] = (
    df['col'].fillna(0)
    .astype(int)
    .astype(object)
    .where(df['col'].notnull())
)

This will replace NaNs with an integer (doesn’t matter which), convert to int, convert to object and finally reinsert NaNs.


回答 5

如果您可以修改存储的数据,请使用缺少的哨兵值id。由列名推断出的一个常见用例id是一个严格大于零的整数,您可以将其0用作前哨值,以便编写

if row['id']:
   regular_process(row)
else:
   special_process(row)

If you can modify your stored data, use a sentinel value for missing id. A common use case, inferred by the column name, being that id is an integer, strictly greater than zero, you could use 0 as a sentinel value so that you can write

if row['id']:
   regular_process(row)
else:
   special_process(row)

回答 6

您可以使用.dropna()是否可以删除带有NaN值的行。

df = df.dropna(subset=['id'])

或者,使用.fillna().astype()将NaN替换为值,并将其转换为int。

在处理带有大整数的CSV文件时,我遇到了这个问题,而其中一些缺失(NaN)。不能使用float作为类型,因为我可能会降低精度。

我的解决方案是使用str作为中间类型。然后,您可以在稍后的代码中将字符串转换为int。我将NaN替换为0,但是您可以选择任何值。

df = pd.read_csv(filename, dtype={'id':str})
df["id"] = df["id"].fillna("0").astype(int)

为了进行说明,这是一个示例,说明浮点数可能会降低精度:

s = "12345678901234567890"
f = float(s)
i = int(f)
i2 = int(s)
print (f, i, i2)

输出为:

1.2345678901234567e+19 12345678901234567168 12345678901234567890

You could use .dropna() if it is OK to drop the rows with the NaN values.

df = df.dropna(subset=['id'])

Alternatively, use .fillna() and .astype() to replace the NaN with values and convert them to int.

I ran into this problem when processing a CSV file with large integers, while some of them were missing (NaN). Using float as the type was not an option, because I might loose the precision.

My solution was to use str as the intermediate type. Then you can convert the string to int as you please later in the code. I replaced NaN with 0, but you could choose any value.

df = pd.read_csv(filename, dtype={'id':str})
df["id"] = df["id"].fillna("0").astype(int)

For the illustration, here is an example how floats may loose the precision:

s = "12345678901234567890"
f = float(s)
i = int(f)
i2 = int(s)
print (f, i, i2)

And the output is:

1.2345678901234567e+19 12345678901234567168 12345678901234567890

回答 7

这里的大多数解决方案都告诉您如何使用占位符整数表示空值。如果不确定整数是否会显示在源数据中,则该方法无济于事。我的方法将格式化不包含其十进制值的浮点数,并将空值转换为无值。结果是一个对象数据类型,当加载到CSV中时,它将看起来像一个带有空值的整数字段。

keep_df[col] = keep_df[col].apply(lambda x: None if pandas.isnull(x) else '{0:.0f}'.format(pandas.to_numeric(x)))

Most solutions here tell you how to use a placeholder integer to represent nulls. That approach isn’t helpful if you’re uncertain that integer won’t show up in your source data though. My method with will format floats without their decimal values and convert nulls to None’s. The result is an object datatype that will look like an integer field with null values when loaded into a CSV.

keep_df[col] = keep_df[col].apply(lambda x: None if pandas.isnull(x) else '{0:.0f}'.format(pandas.to_numeric(x)))

回答 8

我在使用pyspark时遇到了这个问题。由于这是在jvm上运行的代码的python前端,因此它需要类型安全,并且不能选择使用float而不是int。我通过将熊猫包装pd.read_csv在一个函数中来解决此问题,该函数将使用用户定义的填充值填充用户定义的列,然后再将其转换为所需的类型。这是我最终使用的内容:

def custom_read_csv(file_path, custom_dtype = None, fill_values = None, **kwargs):
    if custom_dtype is None:
        return pd.read_csv(file_path, **kwargs)
    else:
        assert 'dtype' not in kwargs.keys()
        df = pd.read_csv(file_path, dtype = {}, **kwargs)
        for col, typ in custom_dtype.items():
            if fill_values is None or col not in fill_values.keys():
                fill_val = -1
            else:
                fill_val = fill_values[col]
            df[col] = df[col].fillna(fill_val).astype(typ)
    return df

I ran into this issue working with pyspark. As this is a python frontend for code running on a jvm, it requires type safety and using float instead of int is not an option. I worked around the issue by wrapping the pandas pd.read_csv in a function that will fill user-defined columns with user-defined fill values before casting them to the required type. Here is what I ended up using:

def custom_read_csv(file_path, custom_dtype = None, fill_values = None, **kwargs):
    if custom_dtype is None:
        return pd.read_csv(file_path, **kwargs)
    else:
        assert 'dtype' not in kwargs.keys()
        df = pd.read_csv(file_path, dtype = {}, **kwargs)
        for col, typ in custom_dtype.items():
            if fill_values is None or col not in fill_values.keys():
                fill_val = -1
            else:
                fill_val = fill_values[col]
            df[col] = df[col].fillna(fill_val).astype(typ)
    return df

回答 9

首先删除包含NaN的行。然后对剩余的行进行整数转换。最后,再次插入删除的行。希望它能工作

First remove the rows which contain NaN. Then do Integer conversion on remaining rows. At Last insert the removed rows again. Hope it will work


回答 10

import pandas as pd

df= pd.read_csv("data.csv")
df['id'] = pd.to_numeric(df['id'])
import pandas as pd

df= pd.read_csv("data.csv")
df['id'] = pd.to_numeric(df['id'])

回答 11

假设您的DateColumn格式为3312018.0的字符串应转换为03/31/2018。并且,某些记录丢失或为0。

df['DateColumn'] = df['DateColumn'].astype(int)
df['DateColumn'] = df['DateColumn'].astype(str)
df['DateColumn'] = df['DateColumn'].apply(lambda x: x.zfill(8))
df.loc[df['DateColumn'] == '00000000','DateColumn'] = '01011980'
df['DateColumn'] = pd.to_datetime(df['DateColumn'], format="%m%d%Y")
df['DateColumn'] = df['DateColumn'].apply(lambda x: x.strftime('%m/%d/%Y'))

Assuming your DateColumn formatted 3312018.0 should be converted to 03/31/2018 as a string. And, some records are missing or 0.

df['DateColumn'] = df['DateColumn'].astype(int)
df['DateColumn'] = df['DateColumn'].astype(str)
df['DateColumn'] = df['DateColumn'].apply(lambda x: x.zfill(8))
df.loc[df['DateColumn'] == '00000000','DateColumn'] = '01011980'
df['DateColumn'] = pd.to_datetime(df['DateColumn'], format="%m%d%Y")
df['DateColumn'] = df['DateColumn'].apply(lambda x: x.strftime('%m/%d/%Y'))

在组对象上应用vs变换

问题:在组对象上应用vs变换

考虑以下数据帧:

     A      B         C         D
0  foo    one  0.162003  0.087469
1  bar    one -1.156319 -1.526272
2  foo    two  0.833892 -1.666304
3  bar  three -2.026673 -0.322057
4  foo    two  0.411452 -0.954371
5  bar    two  0.765878 -0.095968
6  foo    one -0.654890  0.678091
7  foo  three -1.789842 -1.130922

以下命令起作用:

> df.groupby('A').apply(lambda x: (x['C'] - x['D']))
> df.groupby('A').apply(lambda x: (x['C'] - x['D']).mean())

但以下任何一项均无效:

> df.groupby('A').transform(lambda x: (x['C'] - x['D']))
ValueError: could not broadcast input array from shape (5) into shape (5,3)

> df.groupby('A').transform(lambda x: (x['C'] - x['D']).mean())
 TypeError: cannot concatenate a non-NDFrame object

为什么? 文档上的示例似乎建议通过调用transform组,可以进行行操作处理:

# Note that the following suggests row-wise operation (x.mean is the column mean)
zscore = lambda x: (x - x.mean()) / x.std()
transformed = ts.groupby(key).transform(zscore)

换句话说,我认为转换本质上是一种特定的应用类型(不聚合)。我哪里错了?

供参考,以下是上面原始数据帧的构造:

df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
                          'foo', 'bar', 'foo', 'foo'],
                   'B' : ['one', 'one', 'two', 'three',
                         'two', 'two', 'one', 'three'],
                   'C' : randn(8), 'D' : randn(8)})

Consider the following dataframe:

     A      B         C         D
0  foo    one  0.162003  0.087469
1  bar    one -1.156319 -1.526272
2  foo    two  0.833892 -1.666304
3  bar  three -2.026673 -0.322057
4  foo    two  0.411452 -0.954371
5  bar    two  0.765878 -0.095968
6  foo    one -0.654890  0.678091
7  foo  three -1.789842 -1.130922

The following commands work:

> df.groupby('A').apply(lambda x: (x['C'] - x['D']))
> df.groupby('A').apply(lambda x: (x['C'] - x['D']).mean())

but none of the following work:

> df.groupby('A').transform(lambda x: (x['C'] - x['D']))
ValueError: could not broadcast input array from shape (5) into shape (5,3)

> df.groupby('A').transform(lambda x: (x['C'] - x['D']).mean())
 TypeError: cannot concatenate a non-NDFrame object

Why? The example on the documentation seems to suggest that calling transform on a group allows one to do row-wise operation processing:

# Note that the following suggests row-wise operation (x.mean is the column mean)
zscore = lambda x: (x - x.mean()) / x.std()
transformed = ts.groupby(key).transform(zscore)

In other words, I thought that transform is essentially a specific type of apply (the one that does not aggregate). Where am I wrong?

For reference, below is the construction of the original dataframe above:

df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
                          'foo', 'bar', 'foo', 'foo'],
                   'B' : ['one', 'one', 'two', 'three',
                         'two', 'two', 'one', 'three'],
                   'C' : randn(8), 'D' : randn(8)})

回答 0

apply和之间的两个主要区别transform

transformapplygroupby方法之间有两个主要区别。

  • 输入:
    • apply将每个组的所有列作为DataFrame隐式传递给自定义函数。
    • 同时transform将每个组的每一列作为系列分别传递给自定义函数。
  • 输出:
    • 传递给的自定义函数apply可以返回标量,或者返回Series或DataFrame(或numpy数组,甚至是list)
    • 传递给的自定义函数transform必须返回与group长度相同的序列(一维Series,数组或列表)。

因此,transform一次只能处理一个Series,而一次apply可以处理整个DataFrame。

检查自定义功能

检查传递给applyor的自定义函数的输入可能会很有帮助transform

例子

让我们创建一些示例数据并检查组,以便您可以了解我在说什么:

import pandas as pd
import numpy as np
df = pd.DataFrame({'State':['Texas', 'Texas', 'Florida', 'Florida'], 
                   'a':[4,5,1,3], 'b':[6,10,3,11]})

     State  a   b
0    Texas  4   6
1    Texas  5  10
2  Florida  1   3
3  Florida  3  11

让我们创建一个简单的自定义函数,该函数打印出隐式传递的对象的类型,然后引发一个错误,以便可以停止执行。

def inspect(x):
    print(type(x))
    raise

现在,让我们将此函数传递给groupby applytransformmethod,以查看传递给它的对象:

df.groupby('State').apply(inspect)

<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
RuntimeError

如您所见,DataFrame被传递到inspect函数中。您可能想知道为什么将DataFrame类型打印两次。熊猫两次参加第一组比赛。这样做是为了确定是否存在快速完成计算的方法。这是您不应该担心的次要细节。

现在,让我们用 transform

df.groupby('State').transform(inspect)
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
RuntimeError

它传递了一个Series-一个完全不同的Pandas对象。

因此,一次transform只能使用一个系列。它并非不可能同时作用于两根色谱柱。因此,如果尝试ab自定义函数中减去column ,则会出现错误transform。见下文:

def subtract_two(x):
    return x['a'] - x['b']

df.groupby('State').transform(subtract_two)
KeyError: ('a', 'occurred at index a')

当熊猫试图找到a不存在的Series索引时,我们得到一个KeyError 。您可以通过完整apply的DataFrame 来完成此操作:

df.groupby('State').apply(subtract_two)

State     
Florida  2   -2
         3   -8
Texas    0   -2
         1   -5
dtype: int64

输出是一个Series,并且保留了原始索引,因此有些混乱,但是我们可以访问所有列。


显示传递的熊猫对象

它可以在自定义函数中显示整个pandas对象,从而提供更多帮助,因此您可以确切地看到所使用的对象。您可以使用print我喜欢使用模块中的display函数的语句,IPython.display以便在Jupyter笔记本中以HTML形式很好地输出DataFrame:

from IPython.display import display
def subtract_two(x):
    display(x)
    return x['a'] - x['b']

屏幕截图:


变换必须返回与组大小相同的一维序列

另一个区别是transform必须返回与该组相同大小的一维序列。在这种特定情况下,每个组都有两行,因此transform必须返回两行的序列。如果没有,则会引发错误:

def return_three(x):
    return np.array([1, 2, 3])

df.groupby('State').transform(return_three)
ValueError: transform must return a scalar value for each group

该错误消息并不能真正说明问题。您必须返回与组长度相同的序列。因此,这样的功能将起作用:

def rand_group_len(x):
    return np.random.rand(len(x))

df.groupby('State').transform(rand_group_len)

          a         b
0  0.962070  0.151440
1  0.440956  0.782176
2  0.642218  0.483257
3  0.056047  0.238208

返回单个标量对象也适用于 transform

如果仅从自定义函数返回单个标量,transform则将其用于组中的每一行:

def group_sum(x):
    return x.sum()

df.groupby('State').transform(group_sum)

   a   b
0  9  16
1  9  16
2  4  14
3  4  14

Two major differences between apply and transform

There are two major differences between the transform and apply groupby methods.

  • Input:
  • apply implicitly passes all the columns for each group as a DataFrame to the custom function.
  • while transform passes each column for each group individually as a Series to the custom function.
  • Output:
  • The custom function passed to apply can return a scalar, or a Series or DataFrame (or numpy array or even list).
  • The custom function passed to transform must return a sequence (a one dimensional Series, array or list) the same length as the group.

So, transform works on just one Series at a time and apply works on the entire DataFrame at once.

Inspecting the custom function

It can help quite a bit to inspect the input to your custom function passed to apply or transform.

Examples

Let’s create some sample data and inspect the groups so that you can see what I am talking about:

import pandas as pd
import numpy as np
df = pd.DataFrame({'State':['Texas', 'Texas', 'Florida', 'Florida'], 
                   'a':[4,5,1,3], 'b':[6,10,3,11]})

     State  a   b
0    Texas  4   6
1    Texas  5  10
2  Florida  1   3
3  Florida  3  11

Let’s create a simple custom function that prints out the type of the implicitly passed object and then raised an error so that execution can be stopped.

def inspect(x):
    print(type(x))
    raise

Now let’s pass this function to both the groupby apply and transform methods to see what object is passed to it:

df.groupby('State').apply(inspect)

<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
RuntimeError

As you can see, a DataFrame is passed into the inspect function. You might be wondering why the type, DataFrame, got printed out twice. Pandas runs the first group twice. It does this to determine if there is a fast way to complete the computation or not. This is a minor detail that you shouldn’t worry about.

Now, let’s do the same thing with transform

df.groupby('State').transform(inspect)
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
RuntimeError

It is passed a Series – a totally different Pandas object.

So, transform is only allowed to work with a single Series at a time. It is impossible for it to act on two columns at the same time. So, if we try and subtract column a from b inside of our custom function we would get an error with transform. See below:

def subtract_two(x):
    return x['a'] - x['b']

df.groupby('State').transform(subtract_two)
KeyError: ('a', 'occurred at index a')

We get a KeyError as pandas is attempting to find the Series index a which does not exist. You can complete this operation with apply as it has the entire DataFrame:

df.groupby('State').apply(subtract_two)

State     
Florida  2   -2
         3   -8
Texas    0   -2
         1   -5
dtype: int64

The output is a Series and a little confusing as the original index is kept, but we have access to all columns.


Displaying the passed pandas object

It can help even more to display the entire pandas object within the custom function, so you can see exactly what you are operating with. You can use print statements by I like to use the display function from the IPython.display module so that the DataFrames get nicely outputted in HTML in a jupyter notebook:

from IPython.display import display
def subtract_two(x):
    display(x)
    return x['a'] - x['b']

Screenshot:


Transform must return a single dimensional sequence the same size as the group

The other difference is that transform must return a single dimensional sequence the same size as the group. In this particular instance, each group has two rows, so transform must return a sequence of two rows. If it does not then an error is raised:

def return_three(x):
    return np.array([1, 2, 3])

df.groupby('State').transform(return_three)
ValueError: transform must return a scalar value for each group

The error message is not really descriptive of the problem. You must return a sequence the same length as the group. So, a function like this would work:

def rand_group_len(x):
    return np.random.rand(len(x))

df.groupby('State').transform(rand_group_len)

          a         b
0  0.962070  0.151440
1  0.440956  0.782176
2  0.642218  0.483257
3  0.056047  0.238208

Returning a single scalar object also works for transform

If you return just a single scalar from your custom function, then transform will use it for each of the rows in the group:

def group_sum(x):
    return x.sum()

df.groupby('State').transform(group_sum)

   a   b
0  9  16
1  9  16
2  4  14
3  4  14

回答 1

就像我对.transform操作vs 感到困惑一样,.apply我找到了一些答案,这使我对该问题有所了解。例如,此答案非常有帮助。

到目前为止,我的建议是彼此隔离地.transform处理(或处理)Series(列)。这意味着在最后两个呼叫中:

df.groupby('A').transform(lambda x: (x['C'] - x['D']))
df.groupby('A').transform(lambda x: (x['C'] - x['D']).mean())

您要求.transform从两列中获取值,而“它”实际上并没有同时“看到”它们(可以这么说)。transform将逐一查看数据框列,然后返回一系列“(由一系列)标量组成的”(或一组系列),这些标量被重复了len(input_column)几次。

因此,应使用此标量.transform来使之Series成为输入上应用某种归约函数的结果Series(并且一次只能应用于一个系列/列)。

考虑以下示例(在您的数据框上):

zscore = lambda x: (x - x.mean()) / x.std() # Note that it does not reference anything outside of 'x' and for transform 'x' is one column.
df.groupby('A').transform(zscore)

将生成:

       C      D
0  0.989  0.128
1 -0.478  0.489
2  0.889 -0.589
3 -0.671 -1.150
4  0.034 -0.285
5  1.149  0.662
6 -1.404 -0.907
7 -0.509  1.653

这与您一次只在一列上使用它完全相同:

df.groupby('A')['C'].transform(zscore)

生成:

0    0.989
1   -0.478
2    0.889
3   -0.671
4    0.034
5    1.149
6   -1.404
7   -0.509

请注意,.apply在上一个示例(df.groupby('A')['C'].apply(zscore))中,它的工作方式完全相同,但是如果您尝试在数据帧上使用它,它将失败:

df.groupby('A').apply(zscore)

给出错误:

ValueError: operands could not be broadcast together with shapes (6,) (2,)

那么还有什么.transform用处呢?最简单的情况是尝试将归约函数的结果分配回原始数据帧。

df['sum_C'] = df.groupby('A')['C'].transform(sum)
df.sort('A') # to clearly see the scalar ('sum') applies to the whole column of the group

生成:

     A      B      C      D  sum_C
1  bar    one  1.998  0.593  3.973
3  bar  three  1.287 -0.639  3.973
5  bar    two  0.687 -1.027  3.973
4  foo    two  0.205  1.274  4.373
2  foo    two  0.128  0.924  4.373
6  foo    one  2.113 -0.516  4.373
7  foo  three  0.657 -1.179  4.373
0  foo    one  1.270  0.201  4.373

尝试用同样.apply会给NaNssum_C。因为.apply会返回reduce Series,所以它不知道如何广播回去:

df.groupby('A')['C'].apply(sum)

给予:

A
bar    3.973
foo    4.373

在某些情况下,什么时候.transform用于过滤数据:

df[df.groupby(['B'])['D'].transform(sum) < -1]

     A      B      C      D
3  bar  three  1.287 -0.639
7  foo  three  0.657 -1.179

我希望这可以增加一些清晰度。

As I felt similarly confused with .transform operation vs. .apply I found a few answers shedding some light on the issue. This answer for example was very helpful.

My takeout so far is that .transform will work (or deal) with Series (columns) in isolation from each other. What this means is that in your last two calls:

df.groupby('A').transform(lambda x: (x['C'] - x['D']))
df.groupby('A').transform(lambda x: (x['C'] - x['D']).mean())

You asked .transform to take values from two columns and ‘it’ actually does not ‘see’ both of them at the same time (so to speak). transform will look at the dataframe columns one by one and return back a series (or group of series) ‘made’ of scalars which are repeated len(input_column) times.

So this scalar, that should be used by .transform to make the Series is a result of some reduction function applied on an input Series (and only on ONE series/column at a time).

Consider this example (on your dataframe):

zscore = lambda x: (x - x.mean()) / x.std() # Note that it does not reference anything outside of 'x' and for transform 'x' is one column.
df.groupby('A').transform(zscore)

will yield:

       C      D
0  0.989  0.128
1 -0.478  0.489
2  0.889 -0.589
3 -0.671 -1.150
4  0.034 -0.285
5  1.149  0.662
6 -1.404 -0.907
7 -0.509  1.653

Which is exactly the same as if you would use it on only on one column at a time:

df.groupby('A')['C'].transform(zscore)

yielding:

0    0.989
1   -0.478
2    0.889
3   -0.671
4    0.034
5    1.149
6   -1.404
7   -0.509

Note that .apply in the last example (df.groupby('A')['C'].apply(zscore)) would work in exactly the same way, but it would fail if you tried using it on a dataframe:

df.groupby('A').apply(zscore)

gives error:

ValueError: operands could not be broadcast together with shapes (6,) (2,)

So where else is .transform useful? The simplest case is trying to assign results of reduction function back to original dataframe.

df['sum_C'] = df.groupby('A')['C'].transform(sum)
df.sort('A') # to clearly see the scalar ('sum') applies to the whole column of the group

yielding:

     A      B      C      D  sum_C
1  bar    one  1.998  0.593  3.973
3  bar  three  1.287 -0.639  3.973
5  bar    two  0.687 -1.027  3.973
4  foo    two  0.205  1.274  4.373
2  foo    two  0.128  0.924  4.373
6  foo    one  2.113 -0.516  4.373
7  foo  three  0.657 -1.179  4.373
0  foo    one  1.270  0.201  4.373

Trying the same with .apply would give NaNs in sum_C. Because .apply would return a reduced Series, which it does not know how to broadcast back:

df.groupby('A')['C'].apply(sum)

giving:

A
bar    3.973
foo    4.373

There are also cases when .transform is used to filter the data:

df[df.groupby(['B'])['D'].transform(sum) < -1]

     A      B      C      D
3  bar  three  1.287 -0.639
7  foo  three  0.657 -1.179

I hope this adds a bit more clarity.


回答 2

我将使用一个非常简单的代码片段来说明不同之处:

test = pd.DataFrame({'id':[1,2,3,1,2,3,1,2,3], 'price':[1,2,3,2,3,1,3,1,2]})
grouping = test.groupby('id')['price']

DataFrame看起来像这样:

    id  price   
0   1   1   
1   2   2   
2   3   3   
3   1   2   
4   2   3   
5   3   1   
6   1   3   
7   2   1   
8   3   2   

该表中有3个客户ID,每个客户进行三笔交易,每次支付1,2,3美元。

现在,我想找到每个客户的最低付款额。有两种方法:

  1. 使用apply

    grouping.min()

回报看起来像这样:

id
1    1
2    1
3    1
Name: price, dtype: int64

pandas.core.series.Series # return type
Int64Index([1, 2, 3], dtype='int64', name='id') #The returned Series' index
# lenght is 3
  1. 使用transform

    分组变换(最小值)

回报看起来像这样:

0    1
1    1
2    1
3    1
4    1
5    1
6    1
7    1
8    1
Name: price, dtype: int64

pandas.core.series.Series # return type
RangeIndex(start=0, stop=9, step=1) # The returned Series' index
# length is 9    

这两个方法都返回一个Series对象,但是第一个的对象length为3,length第二个的对象为9。

如果要回答What is the minimum price paid by each customer,则该apply方法是更适合选择的一种。

如果要回答What is the difference between the amount paid for each transaction vs the minimum payment,则要使用transform,因为:

test['minimum'] = grouping.transform(min) # ceates an extra column filled with minimum payment
test.price - test.minimum # returns the difference for each row

Apply 不能简单地在这里工作,因为它返回的是大小为3的Series,但是原始df的长度为9。您无法轻松地将其集成回原始df。

I am going to use a very simple snippet to illustrate the difference:

test = pd.DataFrame({'id':[1,2,3,1,2,3,1,2,3], 'price':[1,2,3,2,3,1,3,1,2]})
grouping = test.groupby('id')['price']

The DataFrame looks like this:

    id  price   
0   1   1   
1   2   2   
2   3   3   
3   1   2   
4   2   3   
5   3   1   
6   1   3   
7   2   1   
8   3   2   

There are 3 customer IDs in this table, each customer made three transactions and paid 1,2,3 dollars each time.

Now, I want to find the minimum payment made by each customer. There are two ways of doing it:

  1. Using apply:

    grouping.min()

The return looks like this:

id
1    1
2    1
3    1
Name: price, dtype: int64

pandas.core.series.Series # return type
Int64Index([1, 2, 3], dtype='int64', name='id') #The returned Series' index
# lenght is 3
  1. Using transform:

    grouping.transform(min)

The return looks like this:

0    1
1    1
2    1
3    1
4    1
5    1
6    1
7    1
8    1
Name: price, dtype: int64

pandas.core.series.Series # return type
RangeIndex(start=0, stop=9, step=1) # The returned Series' index
# length is 9    

Both methods return a Series object, but the length of the first one is 3 and the length of the second one is 9.

If you want to answer What is the minimum price paid by each customer, then the apply method is the more suitable one to choose.

If you want to answer What is the difference between the amount paid for each transaction vs the minimum payment, then you want to use transform, because:

test['minimum'] = grouping.transform(min) # ceates an extra column filled with minimum payment
test.price - test.minimum # returns the difference for each row

Apply does not work here simply because it returns a Series of size 3, but the original df’s length is 9. You cannot integrate it back to the original df easily.


回答 3

tmp = df.groupby(['A'])['c'].transform('mean')

就好像

tmp1 = df.groupby(['A']).agg({'c':'mean'})
tmp = df['A'].map(tmp1['c'])

要么

tmp1 = df.groupby(['A'])['c'].mean()
tmp = df['A'].map(tmp1)
tmp = df.groupby(['A'])['c'].transform('mean')

is like

tmp1 = df.groupby(['A']).agg({'c':'mean'})
tmp = df['A'].map(tmp1['c'])

or

tmp1 = df.groupby(['A'])['c'].mean()
tmp = df['A'].map(tmp1)

pandas loc vs. iloc vs. ix vs. at vs. iat?

问题:pandas loc vs. iloc vs. ix vs. at vs. iat?

最近开始从我的安全地方(R)分支到Python,并且对中的单元格本地化/选择感到有些困惑Pandas。我已经阅读了文档,但仍在努力了解各种本地化/选择选项的实际含义。

  • 我为什么应该使用.loc.iloc超过最一般的选择.ix
  • 我的理解是.locilocat,和iat可以提供一些保证正确性是.ix不能提供的,但我也看到了在那里.ix往往是一刀切最快的解决方案。
  • 请说明使用除.ix?以外的任何东西背后的现实世界,最佳实践推理。

Recently began branching out from my safe place (R) into Python and and am a bit confused by the cell localization/selection in Pandas. I’ve read the documentation but I’m struggling to understand the practical implications of the various localization/selection options.

  • Is there a reason why I should ever use .loc or .iloc over the most general option .ix?
  • I understand that .loc, iloc, at, and iat may provide some guaranteed correctness that .ix can’t offer, but I’ve also read where .ix tends to be the fastest solution across the board.
  • Please explain the real-world, best-practices reasoning behind utilizing anything other than .ix?

回答 0

loc:仅适用于索引
iloc:适用于位置
ix:您可以从数据获取数据,而无需将其包含在索引
中:获取标量值。这是一个非常快速的定位
获取标量值。这是一个非常快的iloc

http://pyciencia.blogspot.com/2015/05/obtener-y-filtrar-datos-de-un-dataframe.html

注:由于pandas 0.20.0中,.ix索引被弃用赞成更加严格.iloc.loc索引。

loc: only work on index
iloc: work on position
ix: You can get data from dataframe without it being in the index
at: get scalar values. It’s a very fast loc
iat: Get scalar values. It’s a very fast iloc

http://pyciencia.blogspot.com/2015/05/obtener-y-filtrar-datos-de-un-dataframe.html

Note: As of pandas 0.20.0, the .ix indexer is deprecated in favour of the more strict .iloc and .loc indexers.


回答 1

已更新,pandas 0.20因为ix已弃用。这不但表明了如何使用locilocatiatset_value,但如何实现,混合位置/标签基于索引。


loc基于标签
允许您将一维数组作为索引器传递。数组可以是索引或列的切片(子集),也可以是长度与索引或列相等的布尔数组。

特别说明:当传递标量索引器时,loc可以分配以前不存在的新索引或列值。

# label based, but we can use position values
# to get the labels from the index object
df.loc[df.index[2], 'ColName'] = 3

df.loc[df.index[1:3], 'ColName'] = 3

iloc基于位置
类似于,loc除了位置而不是索引值。但是,您不能分配新的列或索引。

# position based, but we can get the position
# from the columns object via the `get_loc` method
df.iloc[2, df.columns.get_loc('ColName')] = 3

df.iloc[2, 4] = 3

df.iloc[:3, 2:4] = 3

at基于标签的
作品与loc标量索引器非常相似。 无法对数组索引器进行操作。 能够!分配新的索引和列。

优势loc是,这是速度更快。
缺点是不能将数组用于索引器。

# label based, but we can use position values
# to get the labels from the index object
df.at[df.index[2], 'ColName'] = 3

df.at['C', 'ColName'] = 3

iat基于位置的
原理相似iloc无法在数组索引器中工作。 不能!分配新的索引和列。

优势iloc是,这是速度更快。
缺点是不能将数组用于索引器。

# position based, but we can get the position
# from the columns object via the `get_loc` method
IBM.iat[2, IBM.columns.get_loc('PNL')] = 3

set_value基于标签的
作品与loc标量索引器非常相似。 无法对数组索引器进行操作。 能够!分配新的索引和列

优势超级快,因为几乎没有开销!
缺点由于pandas没有进行大量安全检查,因此开销很少。 使用风险自负。另外,这也不打算供公众使用。

# label based, but we can use position values
# to get the labels from the index object
df.set_value(df.index[2], 'ColName', 3)

set_valuetakable=True位置,并根据
原理相似iloc无法在数组索引器中工作。 不能!分配新的索引和列。

优势超级快,因为几乎没有开销!
缺点由于pandas没有进行大量安全检查,因此开销很少。 使用风险自负。另外,这也不打算供公众使用。

# position based, but we can get the position
# from the columns object via the `get_loc` method
df.set_value(2, df.columns.get_loc('ColName'), 3, takable=True)

Updated for pandas 0.20 given that ix is deprecated. This demonstrates not only how to use loc, iloc, at, iat, set_value, but how to accomplish, mixed positional/label based indexing.


loclabel based
Allows you to pass 1-D arrays as indexers. Arrays can be either slices (subsets) of the index or column, or they can be boolean arrays which are equal in length to the index or columns.

Special Note: when a scalar indexer is passed, loc can assign a new index or column value that didn’t exist before.

# label based, but we can use position values
# to get the labels from the index object
df.loc[df.index[2], 'ColName'] = 3

df.loc[df.index[1:3], 'ColName'] = 3

ilocposition based
Similar to loc except with positions rather that index values. However, you cannot assign new columns or indices.

# position based, but we can get the position
# from the columns object via the `get_loc` method
df.iloc[2, df.columns.get_loc('ColName')] = 3

df.iloc[2, 4] = 3

df.iloc[:3, 2:4] = 3

atlabel based
Works very similar to loc for scalar indexers. Cannot operate on array indexers. Can! assign new indices and columns.

Advantage over loc is that this is faster.
Disadvantage is that you can’t use arrays for indexers.

# label based, but we can use position values
# to get the labels from the index object
df.at[df.index[2], 'ColName'] = 3

df.at['C', 'ColName'] = 3

iatposition based
Works similarly to iloc. Cannot work in array indexers. Cannot! assign new indices and columns.

Advantage over iloc is that this is faster.
Disadvantage is that you can’t use arrays for indexers.

# position based, but we can get the position
# from the columns object via the `get_loc` method
IBM.iat[2, IBM.columns.get_loc('PNL')] = 3

set_valuelabel based
Works very similar to loc for scalar indexers. Cannot operate on array indexers. Can! assign new indices and columns

Advantage Super fast, because there is very little overhead!
Disadvantage There is very little overhead because pandas is not doing a bunch of safety checks. Use at your own risk. Also, this is not intended for public use.

# label based, but we can use position values
# to get the labels from the index object
df.set_value(df.index[2], 'ColName', 3)

set_value with takable=Trueposition based
Works similarly to iloc. Cannot work in array indexers. Cannot! assign new indices and columns.

Advantage Super fast, because there is very little overhead!
Disadvantage There is very little overhead because pandas is not doing a bunch of safety checks. Use at your own risk. Also, this is not intended for public use.

# position based, but we can get the position
# from the columns object via the `get_loc` method
df.set_value(2, df.columns.get_loc('ColName'), 3, takable=True)

回答 2

熊猫从DataFrame中进行选择的主要方式有两种。

  • 标签
  • 整数位置

该文档使用位置一词来指代整数位置。我不喜欢这个术语,因为我觉得它很混乱。整数位置更具描述性,正好.iloc代表该位置。此处的关键字是INTEGER-按整数位置选择时必须使用整数。

在显示摘要之前,让我们确保…

.ix已弃用且含糊不清,切勿使用

熊猫有三个主要的索引器。我们有索引运算符本身(括号[].loc,和.iloc。让我们总结一下:

  • []-主要选择列的子集,但也可以选择行。无法同时选择行和列。
  • .loc -仅按标签选择行和列的子集
  • .iloc -仅按整数位置选择行和列的子集

我几乎从未使用过,.at或者.iat因为它们没有添加任何附加功能并且只增加了一点性能。除非您有一个对时间敏感的应用程序,否则我不建议您使用它们。无论如何,我们有他们的摘要:

  • .at 仅通过标签在DataFrame中选择单个标量值
  • .iat 仅通过整数位置选择DataFrame中的单个标量值

除了按标签和整数位置进行选择外,还存在布尔选择(也称为布尔索引)


解释.loc,,.iloc布尔选择.at.iat的示例如下所示

我们将首先关注.loc和之间的差异.iloc。在讨论差异之前,必须了解DataFrame具有用于帮助标识每一列和每一行的标签,这一点很重要。让我们看一个示例DataFrame:

df = pd.DataFrame({'age':[30, 2, 12, 4, 32, 33, 69],
                   'color':['blue', 'green', 'red', 'white', 'gray', 'black', 'red'],
                   'food':['Steak', 'Lamb', 'Mango', 'Apple', 'Cheese', 'Melon', 'Beans'],
                   'height':[165, 70, 120, 80, 180, 172, 150],
                   'score':[4.6, 8.3, 9.0, 3.3, 1.8, 9.5, 2.2],
                   'state':['NY', 'TX', 'FL', 'AL', 'AK', 'TX', 'TX']
                   },
                  index=['Jane', 'Nick', 'Aaron', 'Penelope', 'Dean', 'Christina', 'Cornelia'])

所有粗体字均为标签。标签,agecolorfoodheightscorestate被用于。其他标签,JaneNickAaronPenelopeDeanChristinaCornelia用作标签的行。这些行标签统称为index


在DataFrame中选择特定行的主要方式是使用.loc.iloc索引器。这些索引器中的每一个也可以用于同时选择列,但是现在只关注行比较容易。此外,每个索引器都使用紧跟其名称的一组括号进行选择。

.loc仅通过标签选择数据

我们将首先讨论.loc仅通过索引或列标签选择数据的索引器。在示例DataFrame中,我们提供了有意义的名称作为索引值。许多DataFrame都没有任何有意义的名称,而是默认为0到n-1之间的整数,其中n是DataFrame的长度(行数)。

您可以使用三种输入中的许多不同.loc,它们是

  • 一串
  • 字符串列表
  • 使用字符串作为起始值和终止值的切片符号

用带字符串的.loc选择单行

要选择单行数据,请将索引标签放在后面的括号内.loc

df.loc['Penelope']

这将数据行作为系列返回

age           4
color     white
food      Apple
height       80
score       3.3
state        AL
Name: Penelope, dtype: object

使用.loc与字符串列表选择多行

df.loc[['Cornelia', 'Jane', 'Dean']]

这将返回一个DataFrame,其中的数据行按列表中指定的顺序进行:

使用带有切片符号的.loc选择多行

切片符号由开始值,停止值和步长值定义。按标签切片时,大熊猫在返回值中包含停止值。以下是从亚伦到迪恩(含)的片段。它的步长未明确定义,但默认为1。

df.loc['Aaron':'Dean']

可以采用与Python列表相同的方式获取复杂的切片。

.iloc仅按整数位置选择数据

现在转到.iloc。DataFrame中数据的每一行和每一列都有一个定义它的整数位置。这是输出中直观显示的标签的补充。整数位置就是从0开始从顶部/左侧开始的行数/列数。

您可以使用三种输入中的许多不同.iloc,它们是

  • 一个整数
  • 整数列表
  • 使用整数作为起始值和终止值的切片符号

用带整数的.iloc选择单行

df.iloc[4]

这将返回第5行(整数位置4)为系列

age           32
color       gray
food      Cheese
height       180
score        1.8
state         AK
Name: Dean, dtype: object

用.iloc选择带有整数列表的多行

df.iloc[[2, -2]]

这将返回第三行和倒数第二行的DataFrame:

使用带切片符号的.iloc选择多行

df.iloc[:5:3]


使用.loc和.iloc同时选择行和列

两者的一项出色功能.loc/.iloc是它们可以同时选择行和列。在上面的示例中,所有列都是从每个选择中返回的。我们可以选择输入类型与行相同的列。我们只需要用逗号分隔行和列的选择即可。

例如,我们可以选择Jane行和Dean行,它们的高度,得分和状态如下:

df.loc[['Jane', 'Dean'], 'height':]

这对行使用标签列表,对列使用切片符号

我们自然可以.iloc只使用整数来执行类似的操作。

df.iloc[[1,4], 2]
Nick      Lamb
Dean    Cheese
Name: food, dtype: object

带标签和整数位置的同时选择

.ix用来与标签和整数位置同时进行选择,这很有用,但有时会造成混淆和模棱两可,值得庆幸的是,它已弃用。如果您需要混合使用标签和整数位置进行选择,则必须同时选择标签或整数位置。

例如,如果我们要选择行Nick以及第Cornelia2列和第4列,则可以.loc通过以下方式将整数转换为标签来使用:

col_names = df.columns[[2, 4]]
df.loc[['Nick', 'Cornelia'], col_names] 

或者,可以使用get_locindex方法将索引标签转换为整数。

labels = ['Nick', 'Cornelia']
index_ints = [df.index.get_loc(label) for label in labels]
df.iloc[index_ints, [2, 4]]

布尔选择

.loc索引器还可以进行布尔选择。例如,如果我们有兴趣查找年龄在30岁以上的所有行并仅返回foodscore列,则可以执行以下操作:

df.loc[df['age'] > 30, ['food', 'score']] 

您可以使用复制它,.iloc但是不能将其传递为布尔系列。您必须将boolean Series转换为numpy数组,如下所示:

df.iloc[(df['age'] > 30).values, [2, 4]] 

选择所有行

可以.loc/.iloc仅用于列选择。您可以使用冒号选择所有行,如下所示:

df.loc[:, 'color':'score':2]


索引运算符[]可以切片也可以选择行和列,但不能同时选择。

大多数人都熟悉DataFrame索引运算符的主要目的,即选择列。字符串选择单个列作为系列,字符串列表选择多个列作为DataFrame。

df['food']

Jane          Steak
Nick           Lamb
Aaron         Mango
Penelope      Apple
Dean         Cheese
Christina     Melon
Cornelia      Beans
Name: food, dtype: object

使用列表选择多个列

df[['food', 'score']]

人们所不熟悉的是,当使用切片符号时,选择是通过行标签或整数位置进行的。这非常令人困惑,我几乎从未使用过,但是确实可以使用。

df['Penelope':'Christina'] # slice rows by label

df[2:6:2] # slice rows by integer location

.loc/.iloc选择行的明确性是高度首选的。单独的索引运算符无法同时选择行和列。

df[3:5, 'color']
TypeError: unhashable type: 'slice'

.at和选择.iat

选择与.at几乎相同,.loc但仅在DataFrame中选择一个“单元”。我们通常将此单元称为标量值。要使用.at,请将行标签和列标签都传递给它,并用逗号分隔。

df.at['Christina', 'color']
'black'

选择与.iat几乎相同,.iloc但仅选择一个标量值。您必须为行和列位置都传递一个整数

df.iat[2, 5]
'FL'

There are two primary ways that pandas makes selections from a DataFrame.

  • By Label
  • By Integer Location

The documentation uses the term position for referring to integer location. I do not like this terminology as I feel it is confusing. Integer location is more descriptive and is exactly what .iloc stands for. The key word here is INTEGER – you must use integers when selecting by integer location.

Before showing the summary let’s all make sure that …

.ix is deprecated and ambiguous and should never be used

There are three primary indexers for pandas. We have the indexing operator itself (the brackets []), .loc, and .iloc. Let’s summarize them:

  • [] – Primarily selects subsets of columns, but can select rows as well. Cannot simultaneously select rows and columns.
  • .loc – selects subsets of rows and columns by label only
  • .iloc – selects subsets of rows and columns by integer location only

I almost never use .at or .iat as they add no additional functionality and with just a small performance increase. I would discourage their use unless you have a very time-sensitive application. Regardless, we have their summary:

  • .at selects a single scalar value in the DataFrame by label only
  • .iat selects a single scalar value in the DataFrame by integer location only

In addition to selection by label and integer location, boolean selection also known as boolean indexing exists.


Examples explaining .loc, .iloc, boolean selection and .at and .iat are shown below

We will first focus on the differences between .loc and .iloc. Before we talk about the differences, it is important to understand that DataFrames have labels that help identify each column and each row. Let’s take a look at a sample DataFrame:

df = pd.DataFrame({'age':[30, 2, 12, 4, 32, 33, 69],
                   'color':['blue', 'green', 'red', 'white', 'gray', 'black', 'red'],
                   'food':['Steak', 'Lamb', 'Mango', 'Apple', 'Cheese', 'Melon', 'Beans'],
                   'height':[165, 70, 120, 80, 180, 172, 150],
                   'score':[4.6, 8.3, 9.0, 3.3, 1.8, 9.5, 2.2],
                   'state':['NY', 'TX', 'FL', 'AL', 'AK', 'TX', 'TX']
                   },
                  index=['Jane', 'Nick', 'Aaron', 'Penelope', 'Dean', 'Christina', 'Cornelia'])

All the words in bold are the labels. The labels, age, color, food, height, score and state are used for the columns. The other labels, Jane, Nick, Aaron, Penelope, Dean, Christina, Cornelia are used as labels for the rows. Collectively, these row labels are known as the index.


The primary ways to select particular rows in a DataFrame are with the .loc and .iloc indexers. Each of these indexers can also be used to simultaneously select columns but it is easier to just focus on rows for now. Also, each of the indexers use a set of brackets that immediately follow their name to make their selections.

.loc selects data only by labels

We will first talk about the .loc indexer which only selects data by the index or column labels. In our sample DataFrame, we have provided meaningful names as values for the index. Many DataFrames will not have any meaningful names and will instead, default to just the integers from 0 to n-1, where n is the length(number of rows) of the DataFrame.

There are many different inputs you can use for .loc three out of them are

  • A string
  • A list of strings
  • Slice notation using strings as the start and stop values

Selecting a single row with .loc with a string

To select a single row of data, place the index label inside of the brackets following .loc.

df.loc['Penelope']

This returns the row of data as a Series

age           4
color     white
food      Apple
height       80
score       3.3
state        AL
Name: Penelope, dtype: object

Selecting multiple rows with .loc with a list of strings

df.loc[['Cornelia', 'Jane', 'Dean']]

This returns a DataFrame with the rows in the order specified in the list:

Selecting multiple rows with .loc with slice notation

Slice notation is defined by a start, stop and step values. When slicing by label, pandas includes the stop value in the return. The following slices from Aaron to Dean, inclusive. Its step size is not explicitly defined but defaulted to 1.

df.loc['Aaron':'Dean']

Complex slices can be taken in the same manner as Python lists.

.iloc selects data only by integer location

Let’s now turn to .iloc. Every row and column of data in a DataFrame has an integer location that defines it. This is in addition to the label that is visually displayed in the output. The integer location is simply the number of rows/columns from the top/left beginning at 0.

There are many different inputs you can use for .iloc three out of them are

  • An integer
  • A list of integers
  • Slice notation using integers as the start and stop values

Selecting a single row with .iloc with an integer

df.iloc[4]

This returns the 5th row (integer location 4) as a Series

age           32
color       gray
food      Cheese
height       180
score        1.8
state         AK
Name: Dean, dtype: object

Selecting multiple rows with .iloc with a list of integers

df.iloc[[2, -2]]

This returns a DataFrame of the third and second to last rows:

Selecting multiple rows with .iloc with slice notation

df.iloc[:5:3]


Simultaneous selection of rows and columns with .loc and .iloc

One excellent ability of both .loc/.iloc is their ability to select both rows and columns simultaneously. In the examples above, all the columns were returned from each selection. We can choose columns with the same types of inputs as we do for rows. We simply need to separate the row and column selection with a comma.

For example, we can select rows Jane, and Dean with just the columns height, score and state like this:

df.loc[['Jane', 'Dean'], 'height':]

This uses a list of labels for the rows and slice notation for the columns

We can naturally do similar operations with .iloc using only integers.

df.iloc[[1,4], 2]
Nick      Lamb
Dean    Cheese
Name: food, dtype: object

Simultaneous selection with labels and integer location

.ix was used to make selections simultaneously with labels and integer location which was useful but confusing and ambiguous at times and thankfully it has been deprecated. In the event that you need to make a selection with a mix of labels and integer locations, you will have to make both your selections labels or integer locations.

For instance, if we want to select rows Nick and Cornelia along with columns 2 and 4, we could use .loc by converting the integers to labels with the following:

col_names = df.columns[[2, 4]]
df.loc[['Nick', 'Cornelia'], col_names] 

Or alternatively, convert the index labels to integers with the get_loc index method.

labels = ['Nick', 'Cornelia']
index_ints = [df.index.get_loc(label) for label in labels]
df.iloc[index_ints, [2, 4]]

Boolean Selection

The .loc indexer can also do boolean selection. For instance, if we are interested in finding all the rows where age is above 30 and return just the food and score columns we can do the following:

df.loc[df['age'] > 30, ['food', 'score']] 

You can replicate this with .iloc but you cannot pass it a boolean series. You must convert the boolean Series into a numpy array like this:

df.iloc[(df['age'] > 30).values, [2, 4]] 

Selecting all rows

It is possible to use .loc/.iloc for just column selection. You can select all the rows by using a colon like this:

df.loc[:, 'color':'score':2]


The indexing operator, [], can slice can select rows and columns too but not simultaneously.

Most people are familiar with the primary purpose of the DataFrame indexing operator, which is to select columns. A string selects a single column as a Series and a list of strings selects multiple columns as a DataFrame.

df['food']

Jane          Steak
Nick           Lamb
Aaron         Mango
Penelope      Apple
Dean         Cheese
Christina     Melon
Cornelia      Beans
Name: food, dtype: object

Using a list selects multiple columns

df[['food', 'score']]

What people are less familiar with, is that, when slice notation is used, then selection happens by row labels or by integer location. This is very confusing and something that I almost never use but it does work.

df['Penelope':'Christina'] # slice rows by label

df[2:6:2] # slice rows by integer location

The explicitness of .loc/.iloc for selecting rows is highly preferred. The indexing operator alone is unable to select rows and columns simultaneously.

df[3:5, 'color']
TypeError: unhashable type: 'slice'

Selection by .at and .iat

Selection with .at is nearly identical to .loc but it only selects a single ‘cell’ in your DataFrame. We usually refer to this cell as a scalar value. To use .at, pass it both a row and column label separated by a comma.

df.at['Christina', 'color']
'black'

Selection with .iat is nearly identical to .iloc but it only selects a single scalar value. You must pass it an integer for both the row and column locations

df.iat[2, 5]
'FL'

回答 3

df = pd.DataFrame({'A':['a', 'b', 'c'], 'B':[54, 67, 89]}, index=[100, 200, 300])

df

                        A   B
                100     a   54
                200     b   67
                300     c   89
In [19]:    
df.loc[100]

Out[19]:
A     a
B    54
Name: 100, dtype: object

In [20]:    
df.iloc[0]

Out[20]:
A     a
B    54
Name: 100, dtype: object

In [24]:    
df2 = df.set_index([df.index,'A'])
df2

Out[24]:
        B
    A   
100 a   54
200 b   67
300 c   89

In [25]:    
df2.ix[100, 'a']

Out[25]:    
B    54
Name: (100, a), dtype: int64
df = pd.DataFrame({'A':['a', 'b', 'c'], 'B':[54, 67, 89]}, index=[100, 200, 300])

df

                        A   B
                100     a   54
                200     b   67
                300     c   89
In [19]:    
df.loc[100]

Out[19]:
A     a
B    54
Name: 100, dtype: object

In [20]:    
df.iloc[0]

Out[20]:
A     a
B    54
Name: 100, dtype: object

In [24]:    
df2 = df.set_index([df.index,'A'])
df2

Out[24]:
        B
    A   
100 a   54
200 b   67
300 c   89

In [25]:    
df2.ix[100, 'a']

Out[25]:    
B    54
Name: (100, a), dtype: int64

回答 4

让我们从这个小df开始:

import pandas as pd
import time as tm
import numpy as np
n=10
a=np.arange(0,n**2)
df=pd.DataFrame(a.reshape(n,n))

我们会这样

df
Out[25]: 
        0   1   2   3   4   5   6   7   8   9
    0   0   1   2   3   4   5   6   7   8   9
    1  10  11  12  13  14  15  16  17  18  19
    2  20  21  22  23  24  25  26  27  28  29
    3  30  31  32  33  34  35  36  37  38  39
    4  40  41  42  43  44  45  46  47  48  49
    5  50  51  52  53  54  55  56  57  58  59
    6  60  61  62  63  64  65  66  67  68  69
    7  70  71  72  73  74  75  76  77  78  79
    8  80  81  82  83  84  85  86  87  88  89
    9  90  91  92  93  94  95  96  97  98  99

有了这个我们有:

df.iloc[3,3]
Out[33]: 33

df.iat[3,3]
Out[34]: 33

df.iloc[:3,:3]
Out[35]: 
    0   1   2   3
0   0   1   2   3
1  10  11  12  13
2  20  21  22  23
3  30  31  32  33



df.iat[:3,:3]
Traceback (most recent call last):
   ... omissis ...
ValueError: At based indexing on an integer index can only have integer indexers

因此,我们不能将.iat用于子集,而只能在其中使用.iloc。

但是,让我们尝试从较大的df中进行选择,并检查速度…

# -*- coding: utf-8 -*-
"""
Created on Wed Feb  7 09:58:39 2018

@author: Fabio Pomi
"""

import pandas as pd
import time as tm
import numpy as np
n=1000
a=np.arange(0,n**2)
df=pd.DataFrame(a.reshape(n,n))
t1=tm.time()
for j in df.index:
    for i in df.columns:
        a=df.iloc[j,i]
t2=tm.time()
for j in df.index:
    for i in df.columns:
        a=df.iat[j,i]
t3=tm.time()
loc=t2-t1
at=t3-t2
prc = loc/at *100
print('\nloc:%f at:%f prc:%f' %(loc,at,prc))

loc:10.485600 at:7.395423 prc:141.784987

因此,使用.loc我们可以管理子集,并且仅使用单个标量即可使用.loc,但是.at比.loc更快

🙂

Let’s start with this small df:

import pandas as pd
import time as tm
import numpy as np
n=10
a=np.arange(0,n**2)
df=pd.DataFrame(a.reshape(n,n))

We’ll so have

df
Out[25]: 
        0   1   2   3   4   5   6   7   8   9
    0   0   1   2   3   4   5   6   7   8   9
    1  10  11  12  13  14  15  16  17  18  19
    2  20  21  22  23  24  25  26  27  28  29
    3  30  31  32  33  34  35  36  37  38  39
    4  40  41  42  43  44  45  46  47  48  49
    5  50  51  52  53  54  55  56  57  58  59
    6  60  61  62  63  64  65  66  67  68  69
    7  70  71  72  73  74  75  76  77  78  79
    8  80  81  82  83  84  85  86  87  88  89
    9  90  91  92  93  94  95  96  97  98  99

With this we have:

df.iloc[3,3]
Out[33]: 33

df.iat[3,3]
Out[34]: 33

df.iloc[:3,:3]
Out[35]: 
    0   1   2   3
0   0   1   2   3
1  10  11  12  13
2  20  21  22  23
3  30  31  32  33



df.iat[:3,:3]
Traceback (most recent call last):
   ... omissis ...
ValueError: At based indexing on an integer index can only have integer indexers

Thus we cannot use .iat for subset, where we must use .iloc only.

But let’s try both to select from a larger df and let’s check the speed …

# -*- coding: utf-8 -*-
"""
Created on Wed Feb  7 09:58:39 2018

@author: Fabio Pomi
"""

import pandas as pd
import time as tm
import numpy as np
n=1000
a=np.arange(0,n**2)
df=pd.DataFrame(a.reshape(n,n))
t1=tm.time()
for j in df.index:
    for i in df.columns:
        a=df.iloc[j,i]
t2=tm.time()
for j in df.index:
    for i in df.columns:
        a=df.iat[j,i]
t3=tm.time()
loc=t2-t1
at=t3-t2
prc = loc/at *100
print('\nloc:%f at:%f prc:%f' %(loc,at,prc))

loc:10.485600 at:7.395423 prc:141.784987

So with .loc we can manage subsets and with .at only a single scalar, but .at is faster than .loc

🙂


如何将Seaborn图保存到文件中

问题:如何将Seaborn图保存到文件中

我尝试了以下代码(test_seaborn.py):

import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
matplotlib.style.use('ggplot')
import seaborn as sns
sns.set()
df = sns.load_dataset('iris')
sns_plot = sns.pairplot(df, hue='species', size=2.5)
fig = sns_plot.get_figure()
fig.savefig("output.png")
#sns.plt.show()

但是我得到这个错误:

  Traceback (most recent call last):
  File "test_searborn.py", line 11, in <module>
    fig = sns_plot.get_figure()
AttributeError: 'PairGrid' object has no attribute 'get_figure'

我希望决赛output.png将存在,看起来像这样:

我该如何解决该问题?

I tried the following code (test_seaborn.py):

import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
matplotlib.style.use('ggplot')
import seaborn as sns
sns.set()
df = sns.load_dataset('iris')
sns_plot = sns.pairplot(df, hue='species', size=2.5)
fig = sns_plot.get_figure()
fig.savefig("output.png")
#sns.plt.show()

But I get this error:

  Traceback (most recent call last):
  File "test_searborn.py", line 11, in <module>
    fig = sns_plot.get_figure()
AttributeError: 'PairGrid' object has no attribute 'get_figure'

I expect the final output.png will exist and look like this:

How can I resolve the problem?


回答 0

删除get_figure并使用sns_plot.savefig('output.png')

df = sns.load_dataset('iris')
sns_plot = sns.pairplot(df, hue='species', size=2.5)
sns_plot.savefig("output.png")

Remove the get_figure and just use sns_plot.savefig('output.png')

df = sns.load_dataset('iris')
sns_plot = sns.pairplot(df, hue='species', size=2.5)
sns_plot.savefig("output.png")

回答 1

建议的解决方案与Seaborn 0.8.1不兼容

由于Seaborn界面已更改,因此出现以下错误:

AttributeError: 'AxesSubplot' object has no attribute 'fig'
When trying to access the figure

AttributeError: 'AxesSubplot' object has no attribute 'savefig'
when trying to use the savefig directly as a function

以下调用允许您访问该图(与Seaborn 0.8.1兼容):

swarm_plot = sns.swarmplot(...)
fig = swarm_plot.get_figure()
fig.savefig(...) 

如先前在此答案中所见。

更新: 我最近使用了seaborn的PairGrid对象生成了一个类似于本示例中的图。在这种情况下,由于GridPlot不是像sns.swarmplot这样的绘图对象,因此它没有get_figure()函数。可以通过以下方式直接访问matplotlib图

fig = myGridPlotObject.fig

就像之前在该主题的其他文章中建议的那样。

The suggested solutions are incompatible with Seaborn 0.8.1

giving the following errors because the Seaborn interface has changed:

AttributeError: 'AxesSubplot' object has no attribute 'fig'
When trying to access the figure

AttributeError: 'AxesSubplot' object has no attribute 'savefig'
when trying to use the savefig directly as a function

The following calls allow you to access the figure (Seaborn 0.8.1 compatible):

swarm_plot = sns.swarmplot(...)
fig = swarm_plot.get_figure()
fig.savefig(...) 

as seen previously in this answer.

UPDATE: I have recently used PairGrid object from seaborn to generate a plot similar to the one in this example. In this case, since GridPlot is not a plot object like, for example, sns.swarmplot, it has no get_figure() function. It is possible to directly access the matplotlib figure by

fig = myGridPlotObject.fig

Like previously suggested in other posts in this thread.


回答 2

上述某些解决方案对我不起作用。.fig尝试该属性时未找到该属性,因此无法.savefig()直接使用。但是,起作用的是:

sns_plot.figure.savefig("output.png")

我是Python新用户,所以我不知道这是否是由于更新引起的。我想提一下,以防其他人遇到和我一样的问题。

Some of the above solutions did not work for me. The .fig attribute was not found when I tried that and I was unable to use .savefig() directly. However, what did work was:

sns_plot.figure.savefig("output.png")

I am a newer Python user, so I do not know if this is due to an update. I wanted to mention it in case anybody else runs into the same issues as I did.


回答 3

您应该只能够直接使用savefig方法sns_plot

sns_plot.savefig("output.png")

为了使您的代码更加清晰,如果您确实要访问sns_plot驻留在其中的matplotlib图形,则可以直接通过

fig = sns_plot.fig

在这种情况下get_figure,您的代码将假定没有方法。

You should just be able to use the savefig method of sns_plot directly.

sns_plot.savefig("output.png")

For clarity with your code if you did want to access the matplotlib figure that sns_plot resides in then you can get it directly with

fig = sns_plot.fig

In this case there is no get_figure method as your code assumes.


回答 4

我使用distplotget_figure成功保存了图片。

sns_hist = sns.distplot(df_train['SalePrice'])
fig = sns_hist.get_figure()
fig.savefig('hist.png')

I use distplot and get_figure to save picture successfully.

sns_hist = sns.distplot(df_train['SalePrice'])
fig = sns_hist.get_figure()
fig.savefig('hist.png')

回答 5

2019年搜索者的台词更少:

import matplotlib.pyplot as plt
import seaborn as sns

df = sns.load_dataset('iris')
sns_plot = sns.pairplot(df, hue='species', height=2.5)
plt.savefig('output.png')

更新说明:size已更改为height

Fewer lines for 2019 searchers:

import matplotlib.pyplot as plt
import seaborn as sns

df = sns.load_dataset('iris')
sns_plot = sns.pairplot(df, hue='species', height=2.5)
plt.savefig('output.png')

UPDATE NOTE: size was changed to height.


回答 6

这对我有用

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

sns.factorplot(x='holiday',data=data,kind='count',size=5,aspect=1)
plt.savefig('holiday-vs-count.png')

This works for me

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

sns.factorplot(x='holiday',data=data,kind='count',size=5,aspect=1)
plt.savefig('holiday-vs-count.png')

回答 7

也可以只创建一个matplotlib figure对象,然后使用plt.savefig(...)

from matplotlib import pyplot as plt
import seaborn as sns
import pandas as pd

df = sns.load_dataset('iris')
plt.figure() # Push new figure on stack
sns_plot = sns.pairplot(df, hue='species', size=2.5)
plt.savefig('output.png') # Save that figure

Its also possible to just create a matplotlib figure object and then use plt.savefig(...):

from matplotlib import pyplot as plt
import seaborn as sns
import pandas as pd

df = sns.load_dataset('iris')
plt.figure() # Push new figure on stack
sns_plot = sns.pairplot(df, hue='species', size=2.5)
plt.savefig('output.png') # Save that figure

回答 8

sns.figure.savefig("output.png")在seaborn 0.8.1中使用会出错。

而是使用:

import seaborn as sns

df = sns.load_dataset('iris')
sns_plot = sns.pairplot(df, hue='species', size=2.5)
sns_plot.savefig("output.png")

You would get an error for using sns.figure.savefig("output.png") in seaborn 0.8.1.

Instead use:

import seaborn as sns

df = sns.load_dataset('iris')
sns_plot = sns.pairplot(df, hue='species', size=2.5)
sns_plot.savefig("output.png")

回答 9

仅供参考,下面的命令在seaborn 0.8.1中起作用,因此我想最初的答案仍然有效。

sns_plot = sns.pairplot(data, hue='species', size=3)
sns_plot.savefig("output.png")

Just FYI, the below command worked in seaborn 0.8.1 so I guess the initial answer is still valid.

sns_plot = sns.pairplot(data, hue='species', size=3)
sns_plot.savefig("output.png")

网址中的熊猫read_csv

问题:网址中的熊猫read_csv

我将Python 3.4与IPython结合使用,并具有以下代码。我无法从给定的URL读取csv文件:

import pandas as pd
import requests

url="https://github.com/cs109/2014_data/blob/master/countries.csv"
s=requests.get(url).content
c=pd.read_csv(s)

我有以下错误

“预期的文件路径名或类文件对象,得到类型”

我怎样才能解决这个问题?

I am using Python 3.4 with IPython and have the following code. I’m unable to read a csv-file from the given URL:

import pandas as pd
import requests

url="https://github.com/cs109/2014_data/blob/master/countries.csv"
s=requests.get(url).content
c=pd.read_csv(s)

I have the following error

“Expected file path name or file-like object, got type”

How can I fix this?


回答 0

更新资料

0.19.2现在,您可以从熊猫直接传递URL


正如错误所暗示的,pandas.read_csv需要一个类似文件的对象作为第一个参数。

如果要从字符串读取csv,可以使用io.StringIO(Python 3.x)或StringIO.StringIO(Python 2.x)

另外,对于URL- https://github.com/cs109/2014_data/blob/master/countries.csv-您正在获得html响应,而不是原始的csv,您应该使用Rawgithub页面中的链接给出的url 获取原始的csv响应-https: //raw.githubusercontent.com/cs109/2014_data/master/countries.csv

范例-

import pandas as pd
import io
import requests
url="https://raw.githubusercontent.com/cs109/2014_data/master/countries.csv"
s=requests.get(url).content
c=pd.read_csv(io.StringIO(s.decode('utf-8')))

Update

From pandas 0.19.2 you can now just pass the url directly.


Just as the error suggests, pandas.read_csv needs a file-like object as the first argument.

If you want to read the csv from a string, you can use io.StringIO (Python 3.x) or StringIO.StringIO (Python 2.x) .

Also, for the URL – https://github.com/cs109/2014_data/blob/master/countries.csv – you are getting back html response , not raw csv, you should use the url given by the Raw link in the github page for getting raw csv response , which is – https://raw.githubusercontent.com/cs109/2014_data/master/countries.csv

Example –

import pandas as pd
import io
import requests
url="https://raw.githubusercontent.com/cs109/2014_data/master/countries.csv"
s=requests.get(url).content
c=pd.read_csv(io.StringIO(s.decode('utf-8')))

回答 1

在最新版本的pandas(0.19.2)中,您可以直接传递网址

import pandas as pd

url="https://raw.githubusercontent.com/cs109/2014_data/master/countries.csv"
c=pd.read_csv(url)

In the latest version of pandas (0.19.2) you can directly pass the url

import pandas as pd

url="https://raw.githubusercontent.com/cs109/2014_data/master/countries.csv"
c=pd.read_csv(url)

回答 2

正如我评论的那样,您需要使用StringIO对象并进行解码,即c=pd.read_csv(io.StringIO(s.decode("utf-8")))如果使用请求,则需要进行解码,因为如果您使用.text ,则content会返回字节,您只需要像s = requests.get(url).textc = 那样传递s即可pd.read_csv(StringIO(s))

一种更简单的方法是将原始数据的正确url 直接传递给read_csv,您不必传递像object这样的文件,您可以传递url从而根本不需要请求:

c = pd.read_csv("https://raw.githubusercontent.com/cs109/2014_data/master/countries.csv")

print(c)

输出:

                              Country         Region
0                             Algeria         AFRICA
1                              Angola         AFRICA
2                               Benin         AFRICA
3                            Botswana         AFRICA
4                             Burkina         AFRICA
5                             Burundi         AFRICA
6                            Cameroon         AFRICA
..................................

文档

filepath_or_buffer

字符串或文件句柄/ StringIO字符串可以是URL。有效的URL方案包括http,ftp,s3和file。对于文件URL,需要一个主机。例如,本地文件可以是文件://localhost/path/to/table.csv

As I commented you need to use a StringIO object and decode i.e c=pd.read_csv(io.StringIO(s.decode("utf-8"))) if using requests, you need to decode as .content returns bytes if you used .text you would just need to pass s as is s = requests.get(url).text c = pd.read_csv(StringIO(s)).

A simpler approach is to pass the correct url of the raw data directly to read_csv, you don’t have to pass a file like object, you can pass a url so you don’t need requests at all:

c = pd.read_csv("https://raw.githubusercontent.com/cs109/2014_data/master/countries.csv")

print(c)

Output:

                              Country         Region
0                             Algeria         AFRICA
1                              Angola         AFRICA
2                               Benin         AFRICA
3                            Botswana         AFRICA
4                             Burkina         AFRICA
5                             Burundi         AFRICA
6                            Cameroon         AFRICA
..................................

From the docs:

filepath_or_buffer :

string or file handle / StringIO The string could be a URL. Valid URL schemes include http, ftp, s3, and file. For file URLs, a host is expected. For instance, a local file could be file ://localhost/path/to/table.csv


回答 3

您遇到的问题是,进入变量s的输出不是csv,而是html文件。为了获得原始的csv,您必须将url修改为:

https://raw.githubusercontent.com/cs109/2014_data/master/countries.csv

您的第二个问题是read_csv需要一个文件名,我们可以通过使用io模块中的StringIO来解决此问题。第三个问题是request.get(url).content提供了字节流,我们可以改用request.get(url).text解决。

最终结果是此代码:

from io import StringIO

import pandas as pd
import requests
url='https://raw.githubusercontent.com/cs109/2014_data/master/countries.csv'
s=requests.get(url).text

c=pd.read_csv(StringIO(s))

输出:

>>> c.head()
    Country  Region
0   Algeria  AFRICA
1    Angola  AFRICA
2     Benin  AFRICA
3  Botswana  AFRICA
4   Burkina  AFRICA

The problem you’re having is that the output you get into the variable ‘s’ is not a csv, but a html file. In order to get the raw csv, you have to modify the url to:

https://raw.githubusercontent.com/cs109/2014_data/master/countries.csv

Your second problem is that read_csv expects a file name, we can solve this by using StringIO from io module. Third problem is that request.get(url).content delivers a byte stream, we can solve this using the request.get(url).text instead.

End result is this code:

from io import StringIO

import pandas as pd
import requests
url='https://raw.githubusercontent.com/cs109/2014_data/master/countries.csv'
s=requests.get(url).text

c=pd.read_csv(StringIO(s))

output:

>>> c.head()
    Country  Region
0   Algeria  AFRICA
1    Angola  AFRICA
2     Benin  AFRICA
3  Botswana  AFRICA
4   Burkina  AFRICA

回答 4

url = "https://github.com/cs109/2014_data/blob/master/countries.csv"
c = pd.read_csv(url, sep = "\t")
url = "https://github.com/cs109/2014_data/blob/master/countries.csv"
c = pd.read_csv(url, sep = "\t")

回答 5

要通过熊猫中的URL导入数据,只需应用下面的简单代码即可,实际上效果更好。

import pandas as pd
train = pd.read_table("https://urlandfile.com/dataset.csv")
train.head()

如果您对原始数据有疑问,则只需在网址前添加“ r”

import pandas as pd
train = pd.read_table(r"https://urlandfile.com/dataset.csv")
train.head()

To Import Data through URL in pandas just apply the simple below code it works actually better.

import pandas as pd
train = pd.read_table("https://urlandfile.com/dataset.csv")
train.head()

If you are having issues with a raw data then just put ‘r’ before URL

import pandas as pd
train = pd.read_table(r"https://urlandfile.com/dataset.csv")
train.head()