# Pyintervals 解决你的阈值判断问题

Pyintervals 是一个用于数值区间计算的模块，比如我们想要判断一个数值是否处于一个、或者一系列区间范围内，就可以使用Pyintervals模块取缔IF-ELSE语句以达到简化代码的目的。

## 1.准备

(可选1) 如果你用Python的目的是数据分析，可以直接安装Anaconda：Python数据分析与挖掘好帮手—Anaconda，它内置了Python和pip.

(可选2) 此外，推荐大家用VSCode编辑器来编写小型Python项目：Python 编程的最好搭档—VSCode 详细指南

Windows环境下打开Cmd(开始—运行—CMD)，苹果系统环境下请打开Terminal(command+空格输入Terminal)，输入命令安装依赖：

`pip install pyinterval`

## 2.基本使用

```from interval import interval
a = interval[1,5]
# interval([1.0, 5.0])
print(3 in a)
# True```

```from interval import interval
a = interval([0, 1], [2, 3], [10, 15])
print(2.5 in a)
# True```

interval.hall 方法还可以将多个区间合并，取其最小及最大值为边界：

```from interval import interval
a = interval.hull((interval[1, 3], interval[10, 15], interval[16, 2222]))
# interval([1.0, 2222.0])
print(1231 in a)
# True```

```from interval import interval
a = interval.union([interval([1, 3], [4, 6]), interval([2, 5], 9)])
# interval([1.0, 6.0], [9.0])
print(5 in a)
# True
print(8 in a)
# False```

## 3.生成多个阈值区间

```from interval import interval
import numpy as np
threshold_list = np.arange(0.0, 1.0, 0.005)
intervals = [interval([threshold_list[i - 1], threshold_list[i]]) for i in range(1, len(threshold_list))]
intervals += [interval([-threshold_list[i], -threshold_list[i - 1]]) for i in range(len(threshold_list) - 1, 0, -1)]
print(len(intervals))
# 398
print(intervals, intervals[-1])
# interval([0.0, 0.005]) interval([-0.005, -0.0])```

```target = 0.023
class_labels = {}
for index, interval_ in enumerate(intervals):
if target in interval_:
class_labels[target] = index```

Pyintervals对于正在做大规模分类任务的同学而言是非常好用的模块，建议有需要的朋友可以试一试。其他同学也可以收藏点赞记录一下，说不定未来也会有应用场景呢！

​Python实用宝典 ( pythondict.com )

# 教你使用 Python 获取美国重要经济指标数据

1. 联储局公开市场委员会会议声明

2.消费者物价指数 Consumer Price Index (CPI)

3.生产者物价指数 Producer Price Index (PPI)

PMI是一项全面的经济指标，概括了美国整体制造业状况、就业及物价表现，是全球最受关注的经济资料之一。采购经理人指数为每月第一个公布的重要数据，加上其所反映的经济状况较为全面，因此市场十分重视数据所反映的具体结果。在一般意义上讲采购经理人指数上升，会带来美元汇价上涨；采购经理人指数下降，会带来美元汇价的下跌。

5.非农就业数据 Non-farm Payrolls (NFP)

## 1.准备

(可选1) 如果你用Python的目的是数据分析，可以直接安装Anaconda：Python数据分析与挖掘好帮手—Anaconda，它内置了Python和pip.

(可选2) 此外，推荐大家用VSCode编辑器来编写小型Python项目：Python 编程的最好搭档—VSCode 详细指南

Windows环境下打开Cmd(开始—运行—CMD)，苹果系统环境下请打开Terminal(command+空格输入Terminal)，输入命令安装依赖：

`pip install fredapi`

## 3.通过接口获取FRED数据

FRED 数据量非常庞大，其分为大分类和大分类的子项目。大分类我们可以通过这样的代码获得：

```import requests
import pandas as pd
import datetime as dt
def fetch_releases(api_key):
"""
取得 FRED 大分类信息
Args:
api_key (str): 秘钥
"""
r = requests.get('https://api.stlouisfed.org/fred/releases?api_key='+api_key+'&file_type=json', verify=True)
full_releases = r.json()['releases']
full_releases = pd.DataFrame.from_dict(full_releases)
full_releases = full_releases.set_index('id')
# full_releases.to_csv("full_releases.csv")
return full_releases```

* FEDMINNFRWG：Nonfarm Workers Minimum Hourly Wage
* FEDMINFRMWG：Farm Workers Minimum Hourly Wage

```from fredapi import Fred
import requests
import numpy as np
import pandas as pd
import datetime as dt

def fetch_releases(api_key):
"""
取得 FRED 大分类信息
Args:
api_key (str): 秘钥
"""
r = requests.get('https://api.stlouisfed.org/fred/releases?api_key='+api_key+'&file_type=json', verify=True)
full_releases = r.json()['releases']
full_releases = pd.DataFrame.from_dict(full_releases)
full_releases = full_releases.set_index('id')
# full_releases.to_csv("full_releases.csv")
return full_releases

def fetch_release_id_data(release_id):
"""
按照分类ID获取数据

Args:
release_id (int): 大分类ID

Returns:
dataframe: 数据
"""
econ_data = pd.DataFrame(index=pd.date_range(start='2000-01-01', end=dt.datetime.today(), freq='MS'))
series_df = fred.search_by_release(release_id, limit=3, order_by='popularity', sort_order='desc')
for topic_label in series_df.index:
econ_data[series_df.loc[topic_label].title] = fred.get_series(topic_label, observation_start='2000-01-01', observation_end=dt.datetime.today())
return econ_data

api_key = '填入你的API秘钥'

fred = Fred(api_key)

full_releases = fetch_releases(api_key)

keywords = ["producer price", "consumer price", "fomc", "manufacturing", "employment"]

for search_keywords in keywords:
search_result = full_releases.name[full_releases.name.apply(lambda x: search_keywords in x.lower())]
econ_data = pd.DataFrame(index=pd.date_range(start='2000-01-01', end=dt.datetime.today(), freq='MS'))

for release_id in search_result.index:
print("scraping release_id: ", release_id)
econ_data = pd.concat([econ_data, fetch_release_id_data(release_id)], axis=1)
econ_data.to_csv(f"{search_keywords}.csv")```

​Python实用宝典 ( pythondict.com )

# 什么格式是保存 Pandas 数据的最好格式？

## 1.要比较的格式

1. CSV — 数据科学家的一个好朋友
2. Pickle — 一种Python的方式来序列化事物
3. MessagePack — 它就像JSON，但又快又小
4. HDF5 — 一种设计用于存储和组织大量数据的文件格式
5. Feather — 一种快速、轻量级、易于使用的二进制文件格式，用于存储数据框架

1. size_mb – 文件大小（Mb）。
2. save_time – 将数据帧保存到磁盘上所需的时间量。
4. save_ram_delta_mb – 数据帧保存过程中最大的内存消耗增长量。

## 2.测试及结果

(a) 将生成的分类变量保留为字符串。

(b) 在执行任何I/O之前将它们转换为 pandas.Categorical 数据类型。

```def generate_dataset(n_rows, num_count, cat_count, max_nan=0.1, max_cat_size=100):
"""
随机生成具有数字和分类特征的数据集。

数字特征取自正态分布X ~ N(0, 1)。
分类特征则被生成为随机的uuid4字符串。

此外，数字和分类特征的max_nan比例被替换为NaN值。
"""
dataset, types = {}, {}

def generate_categories():
from uuid import uuid4
category_size = np.random.randint(2, max_cat_size)
return [str(uuid4()) for _ in range(category_size)]

for col in range(num_count):
name = f'n{col}'
values = np.random.normal(0, 1, n_rows)
nan_cnt = np.random.randint(1, int(max_nan*n_rows))
index = np.random.choice(n_rows, nan_cnt, replace=False)
values[index] = np.nan
dataset[name] = values
types[name] = 'float32'

for col in range(cat_count):
name = f'c{col}'
cats = generate_categories()
values = np.array(np.random.choice(cats, n_rows, replace=True), dtype=object)
nan_cnt = np.random.randint(1, int(max_nan*n_rows))
index = np.random.choice(n_rows, nan_cnt, replace=False)
values[index] = np.nan
dataset[name] = values
types[name] = 'object'

return pd.DataFrame(dataset), types```

https://github.com/devforfu/pandas-formats-benchmark

### (b) 字符串特征转换为数字时的性能

Feather 和 Pickle 显示了最好的 I/O 速度，而 hdf 仍然显示了明显的性能开销。

## 3.结论

​Python实用宝典 ( pythondict.com )

# Pandarallel 一个能让你的Pandas计算火力拉满的工具

• 1、multiprocessing
• 2、concurrent.futures.ProcessPoolExecutor()
• 3、joblib
• 4、ppserver
• 5、celery

## 1.准备

(可选1) 如果你用Python的目的是数据分析，可以直接安装Anaconda：Python数据分析与挖掘好帮手—Anaconda，它内置了Python和pip.

(可选2) 此外，推荐大家用VSCode编辑器来编写小型Python项目：Python 编程的最好搭档—VSCode 详细指南

Windows环境下打开Cmd(开始—运行—CMD)，苹果系统环境下请打开Terminal(command+空格输入Terminal)，输入命令安装依赖：

`pip install pandarallel`

## 2.使用Pandarallel

```from pandarallel import pandarallel
pandarallel.initialize()
```

Pandarallel一共支持8种Pandas操作，下面是一个apply方法的例子。

```import pandas as pd
import time
import math
import numpy as np
from pandarallel import pandarallel

# 初始化
pandarallel.initialize()
df_size = int(5e6)
df = pd.DataFrame(dict(a=np.random.randint(1, 8, df_size),
b=np.random.rand(df_size)))
def func(x):
return math.sin(x.a**2) + math.sin(x.b**2)

# 正常处理
res = df.apply(func, axis=1)

# 并行处理
res_parallel = df.parallel_apply(func, axis=1)

# 查看结果是否相同
res.equals(res_parallel)```

```import pandas as pd
import time
import math
import numpy as np
from pandarallel import pandarallel

# 初始化
pandarallel.initialize()
df_size = int(3e7)
df = pd.DataFrame(dict(a=np.random.randint(1, 1000, df_size),
b=np.random.rand(df_size)))
def func(df):
dum = 0
for item in df.b:
dum += math.log10(math.sqrt(math.exp(item**2)))

return dum / len(df.b)

# 正常处理
res = df.groupby("a").apply(func)
# 并行处理
res_parallel = df.groupby("a").parallel_apply(func)
res.equals(res_parallel)
```

```import pandas as pd
import time
import math
import numpy as np
from pandarallel import pandarallel

# 初始化
pandarallel.initialize()
df_size = int(1e6)
df = pd.DataFrame(dict(a=np.random.randint(1, 300, df_size),
b=np.random.rand(df_size)))
def func(x):
return x.iloc + x.iloc ** 2 + x.iloc ** 3 + x.iloc ** 4

# 正常处理
res = df.groupby('a').b.rolling(4).apply(func, raw=False)
# 并行处理
res_parallel = df.groupby('a').b.rolling(4).parallel_apply(func, raw=False)
res.equals(res_parallel)```

## 3.注意事项

1. 我有 8 个 CPU，但 `parallel_apply` 只能加快大约4倍的计算速度。为什么？

2. 并行化是有成本的（实例化新进程，通过共享内存发送数据，…），所以只有当并行化的计算量足够大时，并行化才是有意义的。对于很少量的数据，使用 Pandarallel 并不总是值得的。

​Python实用宝典 ( pythondict.com )

# 如何获取熊猫DataFrame的最后N行？

## 问题：如何获取熊猫DataFrame的最后N行？

``````>>> df1
STK_ID  RPT_Date  TClose   sales  discount
0   000568  20060331    3.69   5.975       NaN
1   000568  20060630    9.14  10.143       NaN
2   000568  20060930    9.49  13.854       NaN
3   000568  20061231   15.84  19.262       NaN
4   000568  20070331   17.00   6.803       NaN
5   000568  20070630   26.31  12.940       NaN
6   000568  20070930   39.12  19.977       NaN
7   000568  20071231   45.94  29.269       NaN
8   000568  20080331   38.75  12.668       NaN
9   000568  20080630   30.09  21.102       NaN
10  000568  20080930   26.00  30.769       NaN

>>> df2
TClose   sales  discount  net_sales    cogs
STK_ID RPT_Date
000568 20060331    3.69   5.975       NaN      5.975   2.591
20060630    9.14  10.143       NaN     10.143   4.363
20060930    9.49  13.854       NaN     13.854   5.901
20061231   15.84  19.262       NaN     19.262   8.407
20070331   17.00   6.803       NaN      6.803   2.815
20070630   26.31  12.940       NaN     12.940   5.418
20070930   39.12  19.977       NaN     19.977   8.452
20071231   45.94  29.269       NaN     29.269  12.606
20080331   38.75  12.668       NaN     12.668   3.958
20080630   30.09  21.102       NaN     21.102   7.431``````

``````>>> df2.ix[-3:]
TClose   sales  discount  net_sales    cogs
STK_ID RPT_Date
000568 20071231   45.94  29.269       NaN     29.269  12.606
20080331   38.75  12.668       NaN     12.668   3.958
20080630   30.09  21.102       NaN     21.102   7.431``````

``````>>> df1.ix[-3:]
STK_ID  RPT_Date  TClose   sales  discount
0   000568  20060331    3.69   5.975       NaN
1   000568  20060630    9.14  10.143       NaN
2   000568  20060930    9.49  13.854       NaN
3   000568  20061231   15.84  19.262       NaN
4   000568  20070331   17.00   6.803       NaN
5   000568  20070630   26.31  12.940       NaN
6   000568  20070930   39.12  19.977       NaN
7   000568  20071231   45.94  29.269       NaN
8   000568  20080331   38.75  12.668       NaN
9   000568  20080630   30.09  21.102       NaN
10  000568  20080930   26.00  30.769       NaN``````

I have pandas dataframe `df1` and `df2` (df1 is vanila dataframe, df2 is indexed by ‘STK_ID’ & ‘RPT_Date’) :

``````>>> df1
STK_ID  RPT_Date  TClose   sales  discount
0   000568  20060331    3.69   5.975       NaN
1   000568  20060630    9.14  10.143       NaN
2   000568  20060930    9.49  13.854       NaN
3   000568  20061231   15.84  19.262       NaN
4   000568  20070331   17.00   6.803       NaN
5   000568  20070630   26.31  12.940       NaN
6   000568  20070930   39.12  19.977       NaN
7   000568  20071231   45.94  29.269       NaN
8   000568  20080331   38.75  12.668       NaN
9   000568  20080630   30.09  21.102       NaN
10  000568  20080930   26.00  30.769       NaN

>>> df2
TClose   sales  discount  net_sales    cogs
STK_ID RPT_Date
000568 20060331    3.69   5.975       NaN      5.975   2.591
20060630    9.14  10.143       NaN     10.143   4.363
20060930    9.49  13.854       NaN     13.854   5.901
20061231   15.84  19.262       NaN     19.262   8.407
20070331   17.00   6.803       NaN      6.803   2.815
20070630   26.31  12.940       NaN     12.940   5.418
20070930   39.12  19.977       NaN     19.977   8.452
20071231   45.94  29.269       NaN     29.269  12.606
20080331   38.75  12.668       NaN     12.668   3.958
20080630   30.09  21.102       NaN     21.102   7.431
``````

I can get the last 3 rows of df2 by:

``````>>> df2.ix[-3:]
TClose   sales  discount  net_sales    cogs
STK_ID RPT_Date
000568 20071231   45.94  29.269       NaN     29.269  12.606
20080331   38.75  12.668       NaN     12.668   3.958
20080630   30.09  21.102       NaN     21.102   7.431
``````

while `df1.ix[-3:]` give all the rows:

``````>>> df1.ix[-3:]
STK_ID  RPT_Date  TClose   sales  discount
0   000568  20060331    3.69   5.975       NaN
1   000568  20060630    9.14  10.143       NaN
2   000568  20060930    9.49  13.854       NaN
3   000568  20061231   15.84  19.262       NaN
4   000568  20070331   17.00   6.803       NaN
5   000568  20070630   26.31  12.940       NaN
6   000568  20070930   39.12  19.977       NaN
7   000568  20071231   45.94  29.269       NaN
8   000568  20080331   38.75  12.668       NaN
9   000568  20080630   30.09  21.102       NaN
10  000568  20080930   26.00  30.769       NaN
``````

Why ? How to get the last 3 rows of `df1` (dataframe without index) ? Pandas 0.10.1

## 回答 0

Don’t forget `DataFrame.tail`! e.g. `df1.tail(10)`

## 回答 1

*在较新版本的熊猫中，建议使用loc或iloc删除ix作为位置或标签的歧义：

``df.iloc[-3:]``

This is because of using integer indices (`ix` selects those by label over -3 rather than position, and this is by design: see integer indexing in pandas “gotchas”*).

*In newer versions of pandas prefer loc or iloc to remove the ambiguity of ix as position or label:

``````df.iloc[-3:]
``````

see the docs.

As Wes points out, in this specific case you should just use tail!

## 如何获取熊猫DataFrame的最后N行？

``````pd.__version__
# '0.24.2'

df = pd.DataFrame({'A': list('aaabbbbc'), 'B': np.arange(1, 9)})
df

A  B
0  a  1
1  a  2
2  a  3
3  b  4
4  b  5
5  b  6
6  b  7
7  c  8``````

``````df[-3:]

A  B
5  b  6
6  b  7
7  c  8``````

``````df.groupby('A').tail(2)

A  B
1  a  2
2  a  3
5  b  6
6  b  7
7  c  8``````

## How to get the last N rows of a pandas DataFrame?

If you are slicing by position, `__getitem__` (i.e., slicing with`[]`) works well, and is the most succinct solution I’ve found for this problem.

``````pd.__version__
# '0.24.2'

df = pd.DataFrame({'A': list('aaabbbbc'), 'B': np.arange(1, 9)})
df

A  B
0  a  1
1  a  2
2  a  3
3  b  4
4  b  5
5  b  6
6  b  7
7  c  8
``````

``````df[-3:]

A  B
5  b  6
6  b  7
7  c  8
``````

This is the same as calling `df.iloc[-3:]`, for instance (`iloc` internally delegates to `__getitem__`).

As an aside, if you want to find the last N rows for each group, use `groupby` and `GroupBy.tail`:

``````df.groupby('A').tail(2)

A  B
1  a  2
2  a  3
5  b  6
6  b  7
7  c  8
``````

# 将包含NaN的Pandas列转换为dtype`int`

## 问题：将包含NaN的Pandas列转换为dtype`int`

``````df= pd.read_csv("data.csv", dtype={'id': int})
error: Integer column has NA values``````

``````df= pd.read_csv("data.csv")
df[['id']] = df[['id']].astype(int)
error: Cannot convert NA to integer``````

I read data from a .csv file to a Pandas dataframe as below. For one of the columns, namely `id`, I want to specify the column type as `int`. The problem is the `id` series has missing/empty values.

When I try to cast the `id` column to integer while reading the .csv, I get:

``````df= pd.read_csv("data.csv", dtype={'id': int})
error: Integer column has NA values
``````

Alternatively, I tried to convert the column type after reading as below, but this time I get:

``````df= pd.read_csv("data.csv")
df[['id']] = df[['id']].astype(int)
error: Cannot convert NA to integer
``````

How can I tackle this?

## 回答 0

The lack of NaN rep in integer columns is a pandas “gotcha”.

The usual workaround is to simply use floats.

## 回答 1

``````arr = pd.array([1, 2, np.nan], dtype=pd.Int64Dtype())
pd.Series(arr)

0      1
1      2
2    NaN
dtype: Int64``````

``df['myCol'] = df['myCol'].astype('Int64')``

In version 0.24.+ pandas has gained the ability to hold integer dtypes with missing values.

Pandas can represent integer data with possibly missing values using `arrays.IntegerArray`. This is an extension types implemented within pandas. It is not the default dtype for integers, and will not be inferred; you must explicitly pass the dtype into `array()` or `Series`:

``````arr = pd.array([1, 2, np.nan], dtype=pd.Int64Dtype())
pd.Series(arr)

0      1
1      2
2    NaN
dtype: Int64
``````

For convert column to nullable integers use:

``````df['myCol'] = df['myCol'].astype('Int64')
``````

## 回答 2

``````df[col] = df[col].fillna(-1)
df[col] = df[col].astype(int)
df[col] = df[col].astype(str)
df[col] = df[col].replace('-1', np.nan)``````

My use case is munging data prior to loading into a DB table:

``````df[col] = df[col].fillna(-1)
df[col] = df[col].astype(int)
df[col] = df[col].astype(str)
df[col] = df[col].replace('-1', np.nan)
``````

Remove NaNs, convert to int, convert to str and then reinsert NANs.

It’s not pretty but it gets the job done!

## 回答 3

pandas 0.24.x发行说明 Quote：“ Pandas已经拥有了持有缺失值的整数dtypes的能力

It is now possible to create a pandas column containing NaNs as dtype `int`, since it is now officially added on pandas 0.24.0

pandas 0.24.x release notes Quote: “Pandas has gained the ability to hold integer dtypes with missing values

## 回答 4

``````df['col'] = (
df['col'].fillna(0)
.astype(int)
.astype(object)
.where(df['col'].notnull())
)``````

If you absolutely want to combine integers and NaNs in a column, you can use the ‘object’ data type:

``````df['col'] = (
df['col'].fillna(0)
.astype(int)
.astype(object)
.where(df['col'].notnull())
)
``````

This will replace NaNs with an integer (doesn’t matter which), convert to int, convert to object and finally reinsert NaNs.

## 回答 5

``````if row['id']:
regular_process(row)
else:
special_process(row)``````

If you can modify your stored data, use a sentinel value for missing `id`. A common use case, inferred by the column name, being that `id` is an integer, strictly greater than zero, you could use `0` as a sentinel value so that you can write

``````if row['id']:
regular_process(row)
else:
special_process(row)
``````

## 回答 6

``df = df.dropna(subset=['id'])``

``````df = pd.read_csv(filename, dtype={'id':str})
df["id"] = df["id"].fillna("0").astype(int)``````

``````s = "12345678901234567890"
f = float(s)
i = int(f)
i2 = int(s)
print (f, i, i2)``````

``1.2345678901234567e+19 12345678901234567168 12345678901234567890``

You could use `.dropna()` if it is OK to drop the rows with the NaN values.

``````df = df.dropna(subset=['id'])
``````

Alternatively, use `.fillna()` and `.astype()` to replace the NaN with values and convert them to int.

I ran into this problem when processing a CSV file with large integers, while some of them were missing (NaN). Using float as the type was not an option, because I might loose the precision.

My solution was to use str as the intermediate type. Then you can convert the string to int as you please later in the code. I replaced NaN with 0, but you could choose any value.

``````df = pd.read_csv(filename, dtype={'id':str})
df["id"] = df["id"].fillna("0").astype(int)
``````

For the illustration, here is an example how floats may loose the precision:

``````s = "12345678901234567890"
f = float(s)
i = int(f)
i2 = int(s)
print (f, i, i2)
``````

And the output is:

``````1.2345678901234567e+19 12345678901234567168 12345678901234567890
``````

## 回答 7

``keep_df[col] = keep_df[col].apply(lambda x: None if pandas.isnull(x) else '{0:.0f}'.format(pandas.to_numeric(x)))``

Most solutions here tell you how to use a placeholder integer to represent nulls. That approach isn’t helpful if you’re uncertain that integer won’t show up in your source data though. My method with will format floats without their decimal values and convert nulls to None’s. The result is an object datatype that will look like an integer field with null values when loaded into a CSV.

``````keep_df[col] = keep_df[col].apply(lambda x: None if pandas.isnull(x) else '{0:.0f}'.format(pandas.to_numeric(x)))
``````

## 回答 8

``````def custom_read_csv(file_path, custom_dtype = None, fill_values = None, **kwargs):
if custom_dtype is None:
else:
assert 'dtype' not in kwargs.keys()
df = pd.read_csv(file_path, dtype = {}, **kwargs)
for col, typ in custom_dtype.items():
if fill_values is None or col not in fill_values.keys():
fill_val = -1
else:
fill_val = fill_values[col]
df[col] = df[col].fillna(fill_val).astype(typ)
return df``````

I ran into this issue working with pyspark. As this is a python frontend for code running on a jvm, it requires type safety and using float instead of int is not an option. I worked around the issue by wrapping the pandas `pd.read_csv` in a function that will fill user-defined columns with user-defined fill values before casting them to the required type. Here is what I ended up using:

``````def custom_read_csv(file_path, custom_dtype = None, fill_values = None, **kwargs):
if custom_dtype is None:
else:
assert 'dtype' not in kwargs.keys()
df = pd.read_csv(file_path, dtype = {}, **kwargs)
for col, typ in custom_dtype.items():
if fill_values is None or col not in fill_values.keys():
fill_val = -1
else:
fill_val = fill_values[col]
df[col] = df[col].fillna(fill_val).astype(typ)
return df
``````

## 回答 9

First remove the rows which contain NaN. Then do Integer conversion on remaining rows. At Last insert the removed rows again. Hope it will work

## 回答 10

``````import pandas as pd

df['id'] = pd.to_numeric(df['id'])``````
``````import pandas as pd

df['id'] = pd.to_numeric(df['id'])
``````

## 回答 11

``````df['DateColumn'] = df['DateColumn'].astype(int)
df['DateColumn'] = df['DateColumn'].astype(str)
df['DateColumn'] = df['DateColumn'].apply(lambda x: x.zfill(8))
df.loc[df['DateColumn'] == '00000000','DateColumn'] = '01011980'
df['DateColumn'] = pd.to_datetime(df['DateColumn'], format="%m%d%Y")
df['DateColumn'] = df['DateColumn'].apply(lambda x: x.strftime('%m/%d/%Y'))``````

Assuming your DateColumn formatted 3312018.0 should be converted to 03/31/2018 as a string. And, some records are missing or 0.

``````df['DateColumn'] = df['DateColumn'].astype(int)
df['DateColumn'] = df['DateColumn'].astype(str)
df['DateColumn'] = df['DateColumn'].apply(lambda x: x.zfill(8))
df.loc[df['DateColumn'] == '00000000','DateColumn'] = '01011980'
df['DateColumn'] = pd.to_datetime(df['DateColumn'], format="%m%d%Y")
df['DateColumn'] = df['DateColumn'].apply(lambda x: x.strftime('%m/%d/%Y'))
``````

# 在组对象上应用vs变换

## 问题：在组对象上应用vs变换

``````     A      B         C         D
0  foo    one  0.162003  0.087469
1  bar    one -1.156319 -1.526272
2  foo    two  0.833892 -1.666304
3  bar  three -2.026673 -0.322057
4  foo    two  0.411452 -0.954371
5  bar    two  0.765878 -0.095968
6  foo    one -0.654890  0.678091
7  foo  three -1.789842 -1.130922``````

``````> df.groupby('A').apply(lambda x: (x['C'] - x['D']))
> df.groupby('A').apply(lambda x: (x['C'] - x['D']).mean())``````

``````> df.groupby('A').transform(lambda x: (x['C'] - x['D']))
ValueError: could not broadcast input array from shape (5) into shape (5,3)

> df.groupby('A').transform(lambda x: (x['C'] - x['D']).mean())
TypeError: cannot concatenate a non-NDFrame object``````

``````# Note that the following suggests row-wise operation (x.mean is the column mean)
zscore = lambda x: (x - x.mean()) / x.std()
transformed = ts.groupby(key).transform(zscore)``````

``````df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
'foo', 'bar', 'foo', 'foo'],
'B' : ['one', 'one', 'two', 'three',
'two', 'two', 'one', 'three'],
'C' : randn(8), 'D' : randn(8)})``````

Consider the following dataframe:

``````     A      B         C         D
0  foo    one  0.162003  0.087469
1  bar    one -1.156319 -1.526272
2  foo    two  0.833892 -1.666304
3  bar  three -2.026673 -0.322057
4  foo    two  0.411452 -0.954371
5  bar    two  0.765878 -0.095968
6  foo    one -0.654890  0.678091
7  foo  three -1.789842 -1.130922
``````

The following commands work:

``````> df.groupby('A').apply(lambda x: (x['C'] - x['D']))
> df.groupby('A').apply(lambda x: (x['C'] - x['D']).mean())
``````

but none of the following work:

``````> df.groupby('A').transform(lambda x: (x['C'] - x['D']))
ValueError: could not broadcast input array from shape (5) into shape (5,3)

> df.groupby('A').transform(lambda x: (x['C'] - x['D']).mean())
TypeError: cannot concatenate a non-NDFrame object
``````

Why? The example on the documentation seems to suggest that calling `transform` on a group allows one to do row-wise operation processing:

``````# Note that the following suggests row-wise operation (x.mean is the column mean)
zscore = lambda x: (x - x.mean()) / x.std()
transformed = ts.groupby(key).transform(zscore)
``````

In other words, I thought that transform is essentially a specific type of apply (the one that does not aggregate). Where am I wrong?

For reference, below is the construction of the original dataframe above:

``````df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
'foo', 'bar', 'foo', 'foo'],
'B' : ['one', 'one', 'two', 'three',
'two', 'two', 'one', 'three'],
'C' : randn(8), 'D' : randn(8)})
``````

## 回答 0

### `apply`和之间的两个主要区别`transform`

`transform``apply`groupby方法之间有两个主要区别。

• 输入：
• `apply`将每个组的所有列作为DataFrame隐式传递给自定义函数。
• 同时`transform`将每个组的每一列作为系列分别传递给自定义函数。
• 输出：
• 传递给的自定义函数`apply`可以返回标量，或者返回Series或DataFrame（或numpy数组，甚至是list）
• 传递给的自定义函数`transform`必须返回与group长度相同的序列（一维Series，数组或列表）。

### 例子

``````import pandas as pd
import numpy as np
df = pd.DataFrame({'State':['Texas', 'Texas', 'Florida', 'Florida'],
'a':[4,5,1,3], 'b':[6,10,3,11]})

State  a   b
0    Texas  4   6
1    Texas  5  10
2  Florida  1   3
3  Florida  3  11``````

``````def inspect(x):
print(type(x))
raise``````

``````df.groupby('State').apply(inspect)

<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
RuntimeError``````

``````df.groupby('State').transform(inspect)
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
RuntimeError``````

``````def subtract_two(x):
return x['a'] - x['b']

df.groupby('State').transform(subtract_two)
KeyError: ('a', 'occurred at index a')``````

``````df.groupby('State').apply(subtract_two)

State
Florida  2   -2
3   -8
Texas    0   -2
1   -5
dtype: int64``````

### 显示传递的熊猫对象

``````from IPython.display import display
def subtract_two(x):
display(x)
return x['a'] - x['b']``````

### 变换必须返回与组大小相同的一维序列

``````def return_three(x):
return np.array([1, 2, 3])

df.groupby('State').transform(return_three)
ValueError: transform must return a scalar value for each group``````

``````def rand_group_len(x):
return np.random.rand(len(x))

df.groupby('State').transform(rand_group_len)

a         b
0  0.962070  0.151440
1  0.440956  0.782176
2  0.642218  0.483257
3  0.056047  0.238208``````

### 返回单个标量对象也适用于 `transform`

``````def group_sum(x):
return x.sum()

df.groupby('State').transform(group_sum)

a   b
0  9  16
1  9  16
2  4  14
3  4  14``````

### Two major differences between `apply` and `transform`

There are two major differences between the `transform` and `apply` groupby methods.

• Input:
• `apply` implicitly passes all the columns for each group as a DataFrame to the custom function.
• while `transform` passes each column for each group individually as a Series to the custom function.
• Output:
• The custom function passed to `apply` can return a scalar, or a Series or DataFrame (or numpy array or even list).
• The custom function passed to `transform` must return a sequence (a one dimensional Series, array or list) the same length as the group.

So, `transform` works on just one Series at a time and `apply` works on the entire DataFrame at once.

### Inspecting the custom function

It can help quite a bit to inspect the input to your custom function passed to `apply` or `transform`.

### Examples

Let’s create some sample data and inspect the groups so that you can see what I am talking about:

``````import pandas as pd
import numpy as np
df = pd.DataFrame({'State':['Texas', 'Texas', 'Florida', 'Florida'],
'a':[4,5,1,3], 'b':[6,10,3,11]})

State  a   b
0    Texas  4   6
1    Texas  5  10
2  Florida  1   3
3  Florida  3  11
``````

Let’s create a simple custom function that prints out the type of the implicitly passed object and then raised an error so that execution can be stopped.

``````def inspect(x):
print(type(x))
raise
``````

Now let’s pass this function to both the groupby `apply` and `transform` methods to see what object is passed to it:

``````df.groupby('State').apply(inspect)

<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
RuntimeError
``````

As you can see, a DataFrame is passed into the `inspect` function. You might be wondering why the type, DataFrame, got printed out twice. Pandas runs the first group twice. It does this to determine if there is a fast way to complete the computation or not. This is a minor detail that you shouldn’t worry about.

Now, let’s do the same thing with `transform`

``````df.groupby('State').transform(inspect)
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
RuntimeError
``````

It is passed a Series – a totally different Pandas object.

So, `transform` is only allowed to work with a single Series at a time. It is impossible for it to act on two columns at the same time. So, if we try and subtract column `a` from `b` inside of our custom function we would get an error with `transform`. See below:

``````def subtract_two(x):
return x['a'] - x['b']

df.groupby('State').transform(subtract_two)
KeyError: ('a', 'occurred at index a')
``````

We get a KeyError as pandas is attempting to find the Series index `a` which does not exist. You can complete this operation with `apply` as it has the entire DataFrame:

``````df.groupby('State').apply(subtract_two)

State
Florida  2   -2
3   -8
Texas    0   -2
1   -5
dtype: int64
``````

The output is a Series and a little confusing as the original index is kept, but we have access to all columns.

### Displaying the passed pandas object

It can help even more to display the entire pandas object within the custom function, so you can see exactly what you are operating with. You can use `print` statements by I like to use the `display` function from the `IPython.display` module so that the DataFrames get nicely outputted in HTML in a jupyter notebook:

``````from IPython.display import display
def subtract_two(x):
display(x)
return x['a'] - x['b']
``````

### Transform must return a single dimensional sequence the same size as the group

The other difference is that `transform` must return a single dimensional sequence the same size as the group. In this particular instance, each group has two rows, so `transform` must return a sequence of two rows. If it does not then an error is raised:

``````def return_three(x):
return np.array([1, 2, 3])

df.groupby('State').transform(return_three)
ValueError: transform must return a scalar value for each group
``````

The error message is not really descriptive of the problem. You must return a sequence the same length as the group. So, a function like this would work:

``````def rand_group_len(x):
return np.random.rand(len(x))

df.groupby('State').transform(rand_group_len)

a         b
0  0.962070  0.151440
1  0.440956  0.782176
2  0.642218  0.483257
3  0.056047  0.238208
``````

### Returning a single scalar object also works for `transform`

If you return just a single scalar from your custom function, then `transform` will use it for each of the rows in the group:

``````def group_sum(x):
return x.sum()

df.groupby('State').transform(group_sum)

a   b
0  9  16
1  9  16
2  4  14
3  4  14
``````

## 回答 1

``````df.groupby('A').transform(lambda x: (x['C'] - x['D']))
df.groupby('A').transform(lambda x: (x['C'] - x['D']).mean())``````

``````zscore = lambda x: (x - x.mean()) / x.std() # Note that it does not reference anything outside of 'x' and for transform 'x' is one column.
df.groupby('A').transform(zscore)``````

``````       C      D
0  0.989  0.128
1 -0.478  0.489
2  0.889 -0.589
3 -0.671 -1.150
4  0.034 -0.285
5  1.149  0.662
6 -1.404 -0.907
7 -0.509  1.653``````

``df.groupby('A')['C'].transform(zscore)``

``````0    0.989
1   -0.478
2    0.889
3   -0.671
4    0.034
5    1.149
6   -1.404
7   -0.509``````

``df.groupby('A').apply(zscore)``

``ValueError: operands could not be broadcast together with shapes (6,) (2,)``

``````df['sum_C'] = df.groupby('A')['C'].transform(sum)
df.sort('A') # to clearly see the scalar ('sum') applies to the whole column of the group``````

``````     A      B      C      D  sum_C
1  bar    one  1.998  0.593  3.973
3  bar  three  1.287 -0.639  3.973
5  bar    two  0.687 -1.027  3.973
4  foo    two  0.205  1.274  4.373
2  foo    two  0.128  0.924  4.373
6  foo    one  2.113 -0.516  4.373
7  foo  three  0.657 -1.179  4.373
0  foo    one  1.270  0.201  4.373``````

``df.groupby('A')['C'].apply(sum)``

``````A
bar    3.973
foo    4.373``````

``````df[df.groupby(['B'])['D'].transform(sum) < -1]

A      B      C      D
3  bar  three  1.287 -0.639
7  foo  three  0.657 -1.179``````

As I felt similarly confused with `.transform` operation vs. `.apply` I found a few answers shedding some light on the issue. This answer for example was very helpful.

My takeout so far is that `.transform` will work (or deal) with `Series` (columns) in isolation from each other. What this means is that in your last two calls:

``````df.groupby('A').transform(lambda x: (x['C'] - x['D']))
df.groupby('A').transform(lambda x: (x['C'] - x['D']).mean())
``````

You asked `.transform` to take values from two columns and ‘it’ actually does not ‘see’ both of them at the same time (so to speak). `transform` will look at the dataframe columns one by one and return back a series (or group of series) ‘made’ of scalars which are repeated `len(input_column)` times.

So this scalar, that should be used by `.transform` to make the `Series` is a result of some reduction function applied on an input `Series` (and only on ONE series/column at a time).

Consider this example (on your dataframe):

``````zscore = lambda x: (x - x.mean()) / x.std() # Note that it does not reference anything outside of 'x' and for transform 'x' is one column.
df.groupby('A').transform(zscore)
``````

will yield:

``````       C      D
0  0.989  0.128
1 -0.478  0.489
2  0.889 -0.589
3 -0.671 -1.150
4  0.034 -0.285
5  1.149  0.662
6 -1.404 -0.907
7 -0.509  1.653
``````

Which is exactly the same as if you would use it on only on one column at a time:

``````df.groupby('A')['C'].transform(zscore)
``````

yielding:

``````0    0.989
1   -0.478
2    0.889
3   -0.671
4    0.034
5    1.149
6   -1.404
7   -0.509
``````

Note that `.apply` in the last example (`df.groupby('A')['C'].apply(zscore)`) would work in exactly the same way, but it would fail if you tried using it on a dataframe:

``````df.groupby('A').apply(zscore)
``````

gives error:

``````ValueError: operands could not be broadcast together with shapes (6,) (2,)
``````

So where else is `.transform` useful? The simplest case is trying to assign results of reduction function back to original dataframe.

``````df['sum_C'] = df.groupby('A')['C'].transform(sum)
df.sort('A') # to clearly see the scalar ('sum') applies to the whole column of the group
``````

yielding:

``````     A      B      C      D  sum_C
1  bar    one  1.998  0.593  3.973
3  bar  three  1.287 -0.639  3.973
5  bar    two  0.687 -1.027  3.973
4  foo    two  0.205  1.274  4.373
2  foo    two  0.128  0.924  4.373
6  foo    one  2.113 -0.516  4.373
7  foo  three  0.657 -1.179  4.373
0  foo    one  1.270  0.201  4.373
``````

Trying the same with `.apply` would give `NaNs` in `sum_C`. Because `.apply` would return a reduced `Series`, which it does not know how to broadcast back:

``````df.groupby('A')['C'].apply(sum)
``````

giving:

``````A
bar    3.973
foo    4.373
``````

There are also cases when `.transform` is used to filter the data:

``````df[df.groupby(['B'])['D'].transform(sum) < -1]

A      B      C      D
3  bar  three  1.287 -0.639
7  foo  three  0.657 -1.179
``````

I hope this adds a bit more clarity.

## 回答 2

``````test = pd.DataFrame({'id':[1,2,3,1,2,3,1,2,3], 'price':[1,2,3,2,3,1,3,1,2]})
grouping = test.groupby('id')['price']``````

DataFrame看起来像这样：

``````    id  price
0   1   1
1   2   2
2   3   3
3   1   2
4   2   3
5   3   1
6   1   3
7   2   1
8   3   2   ``````

1. 使用`apply`

grouping.min（）

``````id
1    1
2    1
3    1
Name: price, dtype: int64

pandas.core.series.Series # return type
Int64Index([1, 2, 3], dtype='int64', name='id') #The returned Series' index
# lenght is 3``````
1. 使用`transform`

分组变换（最小值）

``````0    1
1    1
2    1
3    1
4    1
5    1
6    1
7    1
8    1
Name: price, dtype: int64

pandas.core.series.Series # return type
RangeIndex(start=0, stop=9, step=1) # The returned Series' index
# length is 9    ``````

``````test['minimum'] = grouping.transform(min) # ceates an extra column filled with minimum payment
test.price - test.minimum # returns the difference for each row``````

`Apply` 不能简单地在这里工作，因为它返回的是大小为3的Series，但是原始df的长度为9。您无法轻松地将其集成回原始df。

I am going to use a very simple snippet to illustrate the difference:

``````test = pd.DataFrame({'id':[1,2,3,1,2,3,1,2,3], 'price':[1,2,3,2,3,1,3,1,2]})
grouping = test.groupby('id')['price']
``````

The DataFrame looks like this:

``````    id  price
0   1   1
1   2   2
2   3   3
3   1   2
4   2   3
5   3   1
6   1   3
7   2   1
8   3   2
``````

There are 3 customer IDs in this table, each customer made three transactions and paid 1,2,3 dollars each time.

Now, I want to find the minimum payment made by each customer. There are two ways of doing it:

1. Using `apply`:

grouping.min()

The return looks like this:

``````id
1    1
2    1
3    1
Name: price, dtype: int64

pandas.core.series.Series # return type
Int64Index([1, 2, 3], dtype='int64', name='id') #The returned Series' index
# lenght is 3
``````
1. Using `transform`:

grouping.transform(min)

The return looks like this:

``````0    1
1    1
2    1
3    1
4    1
5    1
6    1
7    1
8    1
Name: price, dtype: int64

pandas.core.series.Series # return type
RangeIndex(start=0, stop=9, step=1) # The returned Series' index
# length is 9
``````

Both methods return a `Series` object, but the `length` of the first one is 3 and the `length` of the second one is 9.

If you want to answer `What is the minimum price paid by each customer`, then the `apply` method is the more suitable one to choose.

If you want to answer `What is the difference between the amount paid for each transaction vs the minimum payment`, then you want to use `transform`, because:

``````test['minimum'] = grouping.transform(min) # ceates an extra column filled with minimum payment
test.price - test.minimum # returns the difference for each row
``````

`Apply` does not work here simply because it returns a Series of size 3, but the original df’s length is 9. You cannot integrate it back to the original df easily.

## 回答 3

``tmp = df.groupby(['A'])['c'].transform('mean')``

``````tmp1 = df.groupby(['A']).agg({'c':'mean'})
tmp = df['A'].map(tmp1['c'])``````

``````tmp1 = df.groupby(['A'])['c'].mean()
tmp = df['A'].map(tmp1)``````
``````tmp = df.groupby(['A'])['c'].transform('mean')
``````

is like

``````tmp1 = df.groupby(['A']).agg({'c':'mean'})
tmp = df['A'].map(tmp1['c'])
``````

or

``````tmp1 = df.groupby(['A'])['c'].mean()
tmp = df['A'].map(tmp1)
``````

# pandas loc vs. iloc vs. ix vs. at vs. iat？

## 问题：pandas loc vs. iloc vs. ix vs. at vs. iat？

• 我为什么应该使用`.loc``.iloc`超过最一般的选择`.ix`
• 我的理解是`.loc``iloc``at`，和`iat`可以提供一些保证正确性是`.ix`不能提供的，但我也看到了在那里`.ix`往往是一刀切最快的解决方案。
• 请说明使用除`.ix`？以外的任何东西背后的现实世界，最佳实践推理。

Recently began branching out from my safe place (R) into Python and and am a bit confused by the cell localization/selection in `Pandas`. I’ve read the documentation but I’m struggling to understand the practical implications of the various localization/selection options.

• Is there a reason why I should ever use `.loc` or `.iloc` over the most general option `.ix`?
• I understand that `.loc`, `iloc`, `at`, and `iat` may provide some guaranteed correctness that `.ix` can’t offer, but I’ve also read where `.ix` tends to be the fastest solution across the board.
• Please explain the real-world, best-practices reasoning behind utilizing anything other than `.ix`?

## 回答 0

loc：仅适用于索引
iloc：适用于位置
ix：您可以从数据获取数据，而无需将其包含在索引

http://pyciencia.blogspot.com/2015/05/obtener-y-filtrar-datos-de-un-dataframe.html

loc: only work on index
iloc: work on position
ix: You can get data from dataframe without it being in the index
at: get scalar values. It’s a very fast loc
iat: Get scalar values. It’s a very fast iloc

http://pyciencia.blogspot.com/2015/05/obtener-y-filtrar-datos-de-un-dataframe.html

Note: As of `pandas 0.20.0`, the `.ix` indexer is deprecated in favour of the more strict `.iloc` and `.loc` indexers.

## 回答 1

`loc`基于标签

``````# label based, but we can use position values
# to get the labels from the index object
df.loc[df.index, 'ColName'] = 3
``````

``df.loc[df.index[1:3], 'ColName'] = 3``

`iloc`基于位置

``````# position based, but we can get the position
# from the columns object via the `get_loc` method
df.iloc[2, df.columns.get_loc('ColName')] = 3
``````

``df.iloc[2, 4] = 3``

``df.iloc[:3, 2:4] = 3``

`at`基于标签的

``````# label based, but we can use position values
# to get the labels from the index object
df.at[df.index, 'ColName'] = 3
``````

``df.at['C', 'ColName'] = 3``

`iat`基于位置的

``````# position based, but we can get the position
# from the columns object via the `get_loc` method
IBM.iat[2, IBM.columns.get_loc('PNL')] = 3
``````

`set_value`基于标签的

``````# label based, but we can use position values
# to get the labels from the index object
df.set_value(df.index, 'ColName', 3)
``````

`set_value``takable=True`位置，并根据

``````# position based, but we can get the position
# from the columns object via the `get_loc` method
df.set_value(2, df.columns.get_loc('ColName'), 3, takable=True)
``````

Updated for `pandas` `0.20` given that `ix` is deprecated. This demonstrates not only how to use `loc`, `iloc`, `at`, `iat`, `set_value`, but how to accomplish, mixed positional/label based indexing.

`loc`label based
Allows you to pass 1-D arrays as indexers. Arrays can be either slices (subsets) of the index or column, or they can be boolean arrays which are equal in length to the index or columns.

Special Note: when a scalar indexer is passed, `loc` can assign a new index or column value that didn’t exist before.

``````# label based, but we can use position values
# to get the labels from the index object
df.loc[df.index, 'ColName'] = 3
``````

``````df.loc[df.index[1:3], 'ColName'] = 3
``````

`iloc`position based
Similar to `loc` except with positions rather that index values. However, you cannot assign new columns or indices.

``````# position based, but we can get the position
# from the columns object via the `get_loc` method
df.iloc[2, df.columns.get_loc('ColName')] = 3
``````

``````df.iloc[2, 4] = 3
``````

``````df.iloc[:3, 2:4] = 3
``````

`at`label based
Works very similar to `loc` for scalar indexers. Cannot operate on array indexers. Can! assign new indices and columns.

Advantage over `loc` is that this is faster.
Disadvantage is that you can’t use arrays for indexers.

``````# label based, but we can use position values
# to get the labels from the index object
df.at[df.index, 'ColName'] = 3
``````

``````df.at['C', 'ColName'] = 3
``````

`iat`position based
Works similarly to `iloc`. Cannot work in array indexers. Cannot! assign new indices and columns.

Advantage over `iloc` is that this is faster.
Disadvantage is that you can’t use arrays for indexers.

``````# position based, but we can get the position
# from the columns object via the `get_loc` method
IBM.iat[2, IBM.columns.get_loc('PNL')] = 3
``````

`set_value`label based
Works very similar to `loc` for scalar indexers. Cannot operate on array indexers. Can! assign new indices and columns

Disadvantage There is very little overhead because `pandas` is not doing a bunch of safety checks. Use at your own risk. Also, this is not intended for public use.

``````# label based, but we can use position values
# to get the labels from the index object
df.set_value(df.index, 'ColName', 3)
``````

`set_value` with `takable=True`position based
Works similarly to `iloc`. Cannot work in array indexers. Cannot! assign new indices and columns.

Disadvantage There is very little overhead because `pandas` is not doing a bunch of safety checks. Use at your own risk. Also, this is not intended for public use.

``````# position based, but we can get the position
# from the columns object via the `get_loc` method
df.set_value(2, df.columns.get_loc('ColName'), 3, takable=True)
``````

• 标签
• 整数位置

# .ix已弃用且含糊不清，切勿使用

• `[]`-主要选择列的子集，但也可以选择行。无法同时选择行和列。
• `.loc` -仅按标签选择行和列的子集
• `.iloc` -仅按整数位置选择行和列的子集

• `.at` 仅通过标签在DataFrame中选择单个标量值
• `.iat` 仅通过整数位置选择DataFrame中的单个标量值

### 解释`.loc`，，`.iloc`布尔选择`.at`和`.iat`的示例如下所示

``````df = pd.DataFrame({'age':[30, 2, 12, 4, 32, 33, 69],
'color':['blue', 'green', 'red', 'white', 'gray', 'black', 'red'],
'food':['Steak', 'Lamb', 'Mango', 'Apple', 'Cheese', 'Melon', 'Beans'],
'height':[165, 70, 120, 80, 180, 172, 150],
'score':[4.6, 8.3, 9.0, 3.3, 1.8, 9.5, 2.2],
'state':['NY', 'TX', 'FL', 'AL', 'AK', 'TX', 'TX']
},
index=['Jane', 'Nick', 'Aaron', 'Penelope', 'Dean', 'Christina', 'Cornelia'])`````` ## .loc仅通过标签选择数据

• 一串
• 字符串列表
• 使用字符串作为起始值和终止值的切片符号

``df.loc['Penelope']``

``````age           4
color     white
food      Apple
height       80
score       3.3
state        AL
Name: Penelope, dtype: object``````

``df.loc[['Cornelia', 'Jane', 'Dean']]`` ``df.loc['Aaron':'Dean']`` ## .iloc仅按整数位置选择数据

• 一个整数
• 整数列表
• 使用整数作为起始值和终止值的切片符号

``df.iloc``

``````age           32
color       gray
food      Cheese
height       180
score        1.8
state         AK
Name: Dean, dtype: object``````

``df.iloc[[2, -2]]`` ``df.iloc[:5:3]`` ## 使用.loc和.iloc同时选择行和列

``df.loc[['Jane', 'Dean'], 'height':]`` ``````df.iloc[[1,4], 2]
Nick      Lamb
Dean    Cheese
Name: food, dtype: object``````

### 带标签和整数位置的同时选择

`.ix`用来与标签和整数位置同时进行选择，这很有用，但有时会造成混淆和模棱两可，值得庆幸的是，它已弃用。如果您需要混合使用标签和整数位置进行选择，则必须同时选择标签或整数位置。

``````col_names = df.columns[[2, 4]]
df.loc[['Nick', 'Cornelia'], col_names] ``````

``````labels = ['Nick', 'Cornelia']
index_ints = [df.index.get_loc(label) for label in labels]
df.iloc[index_ints, [2, 4]]``````

### 布尔选择

.loc索引器还可以进行布尔选择。例如，如果我们有兴趣查找年龄在30岁以上的所有行并仅返回`food``score`列，则可以执行以下操作：

``df.loc[df['age'] > 30, ['food', 'score']] ``

``df.iloc[(df['age'] > 30).values, [2, 4]] ``

### 选择所有行

``df.loc[:, 'color':'score':2]`` ### 索引运算符`[]`可以切片也可以选择行和列，但不能同时选择。

``````df['food']

Jane          Steak
Nick           Lamb
Aaron         Mango
Penelope      Apple
Dean         Cheese
Christina     Melon
Cornelia      Beans
Name: food, dtype: object``````

``df[['food', 'score']]`` ``df['Penelope':'Christina'] # slice rows by label`` ``df[2:6:2] # slice rows by integer location`` `.loc/.iloc`选择行的明确性是高度首选的。单独的索引运算符无法同时选择行和列。

``````df[3:5, 'color']
TypeError: unhashable type: 'slice'``````

## 由`.at`和选择`.iat`

``````df.at['Christina', 'color']
'black'``````

``````df.iat[2, 5]
'FL'``````

There are two primary ways that pandas makes selections from a DataFrame.

• By Label
• By Integer Location

The documentation uses the term position for referring to integer location. I do not like this terminology as I feel it is confusing. Integer location is more descriptive and is exactly what `.iloc` stands for. The key word here is INTEGER – you must use integers when selecting by integer location.

Before showing the summary let’s all make sure that …

# .ix is deprecated and ambiguous and should never be used

There are three primary indexers for pandas. We have the indexing operator itself (the brackets `[]`), `.loc`, and `.iloc`. Let’s summarize them:

• `[]` – Primarily selects subsets of columns, but can select rows as well. Cannot simultaneously select rows and columns.
• `.loc` – selects subsets of rows and columns by label only
• `.iloc` – selects subsets of rows and columns by integer location only

I almost never use `.at` or `.iat` as they add no additional functionality and with just a small performance increase. I would discourage their use unless you have a very time-sensitive application. Regardless, we have their summary:

• `.at` selects a single scalar value in the DataFrame by label only
• `.iat` selects a single scalar value in the DataFrame by integer location only

In addition to selection by label and integer location, boolean selection also known as boolean indexing exists.

### Examples explaining `.loc`, `.iloc`, boolean selection and `.at` and `.iat` are shown below

We will first focus on the differences between `.loc` and `.iloc`. Before we talk about the differences, it is important to understand that DataFrames have labels that help identify each column and each row. Let’s take a look at a sample DataFrame:

``````df = pd.DataFrame({'age':[30, 2, 12, 4, 32, 33, 69],
'color':['blue', 'green', 'red', 'white', 'gray', 'black', 'red'],
'food':['Steak', 'Lamb', 'Mango', 'Apple', 'Cheese', 'Melon', 'Beans'],
'height':[165, 70, 120, 80, 180, 172, 150],
'score':[4.6, 8.3, 9.0, 3.3, 1.8, 9.5, 2.2],
'state':['NY', 'TX', 'FL', 'AL', 'AK', 'TX', 'TX']
},
index=['Jane', 'Nick', 'Aaron', 'Penelope', 'Dean', 'Christina', 'Cornelia'])
`````` All the words in bold are the labels. The labels, `age`, `color`, `food`, `height`, `score` and `state` are used for the columns. The other labels, `Jane`, `Nick`, `Aaron`, `Penelope`, `Dean`, `Christina`, `Cornelia` are used as labels for the rows. Collectively, these row labels are known as the index.

The primary ways to select particular rows in a DataFrame are with the `.loc` and `.iloc` indexers. Each of these indexers can also be used to simultaneously select columns but it is easier to just focus on rows for now. Also, each of the indexers use a set of brackets that immediately follow their name to make their selections.

## .loc selects data only by labels

We will first talk about the `.loc` indexer which only selects data by the index or column labels. In our sample DataFrame, we have provided meaningful names as values for the index. Many DataFrames will not have any meaningful names and will instead, default to just the integers from 0 to n-1, where n is the length(number of rows) of the DataFrame.

There are many different inputs you can use for `.loc` three out of them are

• A string
• A list of strings
• Slice notation using strings as the start and stop values

Selecting a single row with .loc with a string

To select a single row of data, place the index label inside of the brackets following `.loc`.

``````df.loc['Penelope']
``````

This returns the row of data as a Series

``````age           4
color     white
food      Apple
height       80
score       3.3
state        AL
Name: Penelope, dtype: object
``````

Selecting multiple rows with .loc with a list of strings

``````df.loc[['Cornelia', 'Jane', 'Dean']]
``````

This returns a DataFrame with the rows in the order specified in the list: Selecting multiple rows with .loc with slice notation

Slice notation is defined by a start, stop and step values. When slicing by label, pandas includes the stop value in the return. The following slices from Aaron to Dean, inclusive. Its step size is not explicitly defined but defaulted to 1.

``````df.loc['Aaron':'Dean']
`````` Complex slices can be taken in the same manner as Python lists.

## .iloc selects data only by integer location

Let’s now turn to `.iloc`. Every row and column of data in a DataFrame has an integer location that defines it. This is in addition to the label that is visually displayed in the output. The integer location is simply the number of rows/columns from the top/left beginning at 0.

There are many different inputs you can use for `.iloc` three out of them are

• An integer
• A list of integers
• Slice notation using integers as the start and stop values

Selecting a single row with .iloc with an integer

``````df.iloc
``````

This returns the 5th row (integer location 4) as a Series

``````age           32
color       gray
food      Cheese
height       180
score        1.8
state         AK
Name: Dean, dtype: object
``````

Selecting multiple rows with .iloc with a list of integers

``````df.iloc[[2, -2]]
``````

This returns a DataFrame of the third and second to last rows: Selecting multiple rows with .iloc with slice notation

``````df.iloc[:5:3]
`````` ## Simultaneous selection of rows and columns with .loc and .iloc

One excellent ability of both `.loc/.iloc` is their ability to select both rows and columns simultaneously. In the examples above, all the columns were returned from each selection. We can choose columns with the same types of inputs as we do for rows. We simply need to separate the row and column selection with a comma.

For example, we can select rows Jane, and Dean with just the columns height, score and state like this:

``````df.loc[['Jane', 'Dean'], 'height':]
`````` This uses a list of labels for the rows and slice notation for the columns

We can naturally do similar operations with `.iloc` using only integers.

``````df.iloc[[1,4], 2]
Nick      Lamb
Dean    Cheese
Name: food, dtype: object
``````

### Simultaneous selection with labels and integer location

`.ix` was used to make selections simultaneously with labels and integer location which was useful but confusing and ambiguous at times and thankfully it has been deprecated. In the event that you need to make a selection with a mix of labels and integer locations, you will have to make both your selections labels or integer locations.

For instance, if we want to select rows `Nick` and `Cornelia` along with columns 2 and 4, we could use `.loc` by converting the integers to labels with the following:

``````col_names = df.columns[[2, 4]]
df.loc[['Nick', 'Cornelia'], col_names]
``````

Or alternatively, convert the index labels to integers with the `get_loc` index method.

``````labels = ['Nick', 'Cornelia']
index_ints = [df.index.get_loc(label) for label in labels]
df.iloc[index_ints, [2, 4]]
``````

### Boolean Selection

The .loc indexer can also do boolean selection. For instance, if we are interested in finding all the rows where age is above 30 and return just the `food` and `score` columns we can do the following:

``````df.loc[df['age'] > 30, ['food', 'score']]
``````

You can replicate this with `.iloc` but you cannot pass it a boolean series. You must convert the boolean Series into a numpy array like this:

``````df.iloc[(df['age'] > 30).values, [2, 4]]
``````

### Selecting all rows

It is possible to use `.loc/.iloc` for just column selection. You can select all the rows by using a colon like this:

``````df.loc[:, 'color':'score':2]
`````` ### The indexing operator, `[]`, can slice can select rows and columns too but not simultaneously.

Most people are familiar with the primary purpose of the DataFrame indexing operator, which is to select columns. A string selects a single column as a Series and a list of strings selects multiple columns as a DataFrame.

``````df['food']

Jane          Steak
Nick           Lamb
Aaron         Mango
Penelope      Apple
Dean         Cheese
Christina     Melon
Cornelia      Beans
Name: food, dtype: object
``````

Using a list selects multiple columns

``````df[['food', 'score']]
`````` What people are less familiar with, is that, when slice notation is used, then selection happens by row labels or by integer location. This is very confusing and something that I almost never use but it does work.

``````df['Penelope':'Christina'] # slice rows by label
`````` ``````df[2:6:2] # slice rows by integer location
`````` The explicitness of `.loc/.iloc` for selecting rows is highly preferred. The indexing operator alone is unable to select rows and columns simultaneously.

``````df[3:5, 'color']
TypeError: unhashable type: 'slice'
``````

## Selection by `.at` and `.iat`

Selection with `.at` is nearly identical to `.loc` but it only selects a single ‘cell’ in your DataFrame. We usually refer to this cell as a scalar value. To use `.at`, pass it both a row and column label separated by a comma.

``````df.at['Christina', 'color']
'black'
``````

Selection with `.iat` is nearly identical to `.iloc` but it only selects a single scalar value. You must pass it an integer for both the row and column locations

``````df.iat[2, 5]
'FL'
``````

## 回答 3

``````df = pd.DataFrame({'A':['a', 'b', 'c'], 'B':[54, 67, 89]}, index=[100, 200, 300])

df

A   B
100     a   54
200     b   67
300     c   89
In :
df.loc

Out:
A     a
B    54
Name: 100, dtype: object

In :
df.iloc

Out:
A     a
B    54
Name: 100, dtype: object

In :
df2 = df.set_index([df.index,'A'])
df2

Out:
B
A
100 a   54
200 b   67
300 c   89

In :
df2.ix[100, 'a']

Out:
B    54
Name: (100, a), dtype: int64``````
``````df = pd.DataFrame({'A':['a', 'b', 'c'], 'B':[54, 67, 89]}, index=[100, 200, 300])

df

A   B
100     a   54
200     b   67
300     c   89
In :
df.loc

Out:
A     a
B    54
Name: 100, dtype: object

In :
df.iloc

Out:
A     a
B    54
Name: 100, dtype: object

In :
df2 = df.set_index([df.index,'A'])
df2

Out:
B
A
100 a   54
200 b   67
300 c   89

In :
df2.ix[100, 'a']

Out:
B    54
Name: (100, a), dtype: int64
``````

## 回答 4

``````import pandas as pd
import time as tm
import numpy as np
n=10
a=np.arange(0,n**2)
df=pd.DataFrame(a.reshape(n,n))``````

``````df
Out:
0   1   2   3   4   5   6   7   8   9
0   0   1   2   3   4   5   6   7   8   9
1  10  11  12  13  14  15  16  17  18  19
2  20  21  22  23  24  25  26  27  28  29
3  30  31  32  33  34  35  36  37  38  39
4  40  41  42  43  44  45  46  47  48  49
5  50  51  52  53  54  55  56  57  58  59
6  60  61  62  63  64  65  66  67  68  69
7  70  71  72  73  74  75  76  77  78  79
8  80  81  82  83  84  85  86  87  88  89
9  90  91  92  93  94  95  96  97  98  99``````

``````df.iloc[3,3]
Out: 33

df.iat[3,3]
Out: 33

df.iloc[:3,:3]
Out:
0   1   2   3
0   0   1   2   3
1  10  11  12  13
2  20  21  22  23
3  30  31  32  33

df.iat[:3,:3]
Traceback (most recent call last):
... omissis ...
ValueError: At based indexing on an integer index can only have integer indexers``````

``````# -*- coding: utf-8 -*-
"""
Created on Wed Feb  7 09:58:39 2018

@author: Fabio Pomi
"""

import pandas as pd
import time as tm
import numpy as np
n=1000
a=np.arange(0,n**2)
df=pd.DataFrame(a.reshape(n,n))
t1=tm.time()
for j in df.index:
for i in df.columns:
a=df.iloc[j,i]
t2=tm.time()
for j in df.index:
for i in df.columns:
a=df.iat[j,i]
t3=tm.time()
loc=t2-t1
at=t3-t2
prc = loc/at *100
print('\nloc:%f at:%f prc:%f' %(loc,at,prc))

loc:10.485600 at:7.395423 prc:141.784987``````

🙂

``````import pandas as pd
import time as tm
import numpy as np
n=10
a=np.arange(0,n**2)
df=pd.DataFrame(a.reshape(n,n))
``````

We’ll so have

``````df
Out:
0   1   2   3   4   5   6   7   8   9
0   0   1   2   3   4   5   6   7   8   9
1  10  11  12  13  14  15  16  17  18  19
2  20  21  22  23  24  25  26  27  28  29
3  30  31  32  33  34  35  36  37  38  39
4  40  41  42  43  44  45  46  47  48  49
5  50  51  52  53  54  55  56  57  58  59
6  60  61  62  63  64  65  66  67  68  69
7  70  71  72  73  74  75  76  77  78  79
8  80  81  82  83  84  85  86  87  88  89
9  90  91  92  93  94  95  96  97  98  99
``````

With this we have:

``````df.iloc[3,3]
Out: 33

df.iat[3,3]
Out: 33

df.iloc[:3,:3]
Out:
0   1   2   3
0   0   1   2   3
1  10  11  12  13
2  20  21  22  23
3  30  31  32  33

df.iat[:3,:3]
Traceback (most recent call last):
... omissis ...
ValueError: At based indexing on an integer index can only have integer indexers
``````

Thus we cannot use .iat for subset, where we must use .iloc only.

But let’s try both to select from a larger df and let’s check the speed …

``````# -*- coding: utf-8 -*-
"""
Created on Wed Feb  7 09:58:39 2018

@author: Fabio Pomi
"""

import pandas as pd
import time as tm
import numpy as np
n=1000
a=np.arange(0,n**2)
df=pd.DataFrame(a.reshape(n,n))
t1=tm.time()
for j in df.index:
for i in df.columns:
a=df.iloc[j,i]
t2=tm.time()
for j in df.index:
for i in df.columns:
a=df.iat[j,i]
t3=tm.time()
loc=t2-t1
at=t3-t2
prc = loc/at *100
print('\nloc:%f at:%f prc:%f' %(loc,at,prc))

loc:10.485600 at:7.395423 prc:141.784987
``````

So with .loc we can manage subsets and with .at only a single scalar, but .at is faster than .loc

🙂

# 如何将Seaborn图保存到文件中

## 问题：如何将Seaborn图保存到文件中

``````import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
matplotlib.style.use('ggplot')
import seaborn as sns
sns.set()
sns_plot = sns.pairplot(df, hue='species', size=2.5)
fig = sns_plot.get_figure()
fig.savefig("output.png")
#sns.plt.show()
``````

``````  Traceback (most recent call last):
File "test_searborn.py", line 11, in <module>
fig = sns_plot.get_figure()
AttributeError: 'PairGrid' object has no attribute 'get_figure'
`````` I tried the following code (`test_seaborn.py`):

``````import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
matplotlib.style.use('ggplot')
import seaborn as sns
sns.set()
sns_plot = sns.pairplot(df, hue='species', size=2.5)
fig = sns_plot.get_figure()
fig.savefig("output.png")
#sns.plt.show()
``````

But I get this error:

``````  Traceback (most recent call last):
File "test_searborn.py", line 11, in <module>
fig = sns_plot.get_figure()
AttributeError: 'PairGrid' object has no attribute 'get_figure'
``````

I expect the final `output.png` will exist and look like this: How can I resolve the problem?

## 回答 0

``````df = sns.load_dataset('iris')
sns_plot = sns.pairplot(df, hue='species', size=2.5)
sns_plot.savefig("output.png")
``````

Remove the `get_figure` and just use `sns_plot.savefig('output.png')`

``````df = sns.load_dataset('iris')
sns_plot = sns.pairplot(df, hue='species', size=2.5)
sns_plot.savefig("output.png")
``````

## 回答 1

``````AttributeError: 'AxesSubplot' object has no attribute 'fig'
When trying to access the figure

AttributeError: 'AxesSubplot' object has no attribute 'savefig'
when trying to use the savefig directly as a function
``````

``````swarm_plot = sns.swarmplot(...)
fig = swarm_plot.get_figure()
fig.savefig(...)
``````

``fig = myGridPlotObject.fig``

The suggested solutions are incompatible with Seaborn 0.8.1

giving the following errors because the Seaborn interface has changed:

``````AttributeError: 'AxesSubplot' object has no attribute 'fig'
When trying to access the figure

AttributeError: 'AxesSubplot' object has no attribute 'savefig'
when trying to use the savefig directly as a function
``````

The following calls allow you to access the figure (Seaborn 0.8.1 compatible):

``````swarm_plot = sns.swarmplot(...)
fig = swarm_plot.get_figure()
fig.savefig(...)
``````

as seen previously in this answer.

UPDATE: I have recently used PairGrid object from seaborn to generate a plot similar to the one in this example. In this case, since GridPlot is not a plot object like, for example, sns.swarmplot, it has no get_figure() function. It is possible to directly access the matplotlib figure by

``````fig = myGridPlotObject.fig
``````

Like previously suggested in other posts in this thread.

## 回答 2

``sns_plot.figure.savefig("output.png")``

Some of the above solutions did not work for me. The `.fig` attribute was not found when I tried that and I was unable to use `.savefig()` directly. However, what did work was:

``````sns_plot.figure.savefig("output.png")
``````

I am a newer Python user, so I do not know if this is due to an update. I wanted to mention it in case anybody else runs into the same issues as I did.

## 回答 3

``sns_plot.savefig("output.png")``

``fig = sns_plot.fig``

You should just be able to use the `savefig` method of `sns_plot` directly.

``````sns_plot.savefig("output.png")
``````

For clarity with your code if you did want to access the matplotlib figure that `sns_plot` resides in then you can get it directly with

``````fig = sns_plot.fig
``````

In this case there is no `get_figure` method as your code assumes.

## 回答 4

``````sns_hist = sns.distplot(df_train['SalePrice'])
fig = sns_hist.get_figure()
fig.savefig('hist.png')``````

I use `distplot` and `get_figure` to save picture successfully.

``````sns_hist = sns.distplot(df_train['SalePrice'])
fig = sns_hist.get_figure()
fig.savefig('hist.png')
``````

## 回答 5

2019年搜索者的台词更少：

``````import matplotlib.pyplot as plt
import seaborn as sns

sns_plot = sns.pairplot(df, hue='species', height=2.5)
plt.savefig('output.png')``````

Fewer lines for 2019 searchers:

``````import matplotlib.pyplot as plt
import seaborn as sns

sns_plot = sns.pairplot(df, hue='species', height=2.5)
plt.savefig('output.png')
``````

UPDATE NOTE: `size` was changed to `height`.

## 回答 6

``````import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

sns.factorplot(x='holiday',data=data,kind='count',size=5,aspect=1)
plt.savefig('holiday-vs-count.png')``````

This works for me

``````import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

sns.factorplot(x='holiday',data=data,kind='count',size=5,aspect=1)
plt.savefig('holiday-vs-count.png')
``````

## 回答 7

``````from matplotlib import pyplot as plt
import seaborn as sns
import pandas as pd

plt.figure() # Push new figure on stack
sns_plot = sns.pairplot(df, hue='species', size=2.5)
plt.savefig('output.png') # Save that figure``````

Its also possible to just create a matplotlib `figure` object and then use `plt.savefig(...)`:

``````from matplotlib import pyplot as plt
import seaborn as sns
import pandas as pd

plt.figure() # Push new figure on stack
sns_plot = sns.pairplot(df, hue='species', size=2.5)
plt.savefig('output.png') # Save that figure
``````

## 回答 8

`sns.figure.savefig("output.png")`在seaborn 0.8.1中使用会出错。

``````import seaborn as sns

sns_plot = sns.pairplot(df, hue='species', size=2.5)
sns_plot.savefig("output.png")``````

You would get an error for using `sns.figure.savefig("output.png")` in seaborn 0.8.1.

``````import seaborn as sns

sns_plot = sns.pairplot(df, hue='species', size=2.5)
sns_plot.savefig("output.png")
``````

## 回答 9

``````sns_plot = sns.pairplot(data, hue='species', size=3)
sns_plot.savefig("output.png")``````

Just FYI, the below command worked in seaborn 0.8.1 so I guess the initial answer is still valid.

``````sns_plot = sns.pairplot(data, hue='species', size=3)
sns_plot.savefig("output.png")
``````

``````import pandas as pd
import requests

url="https://github.com/cs109/2014_data/blob/master/countries.csv"
s=requests.get(url).content

“预期的文件路径名或类文件对象，得到类型”

I am using Python 3.4 with IPython and have the following code. I’m unable to read a csv-file from the given URL:

``````import pandas as pd
import requests

url="https://github.com/cs109/2014_data/blob/master/countries.csv"
s=requests.get(url).content
``````

I have the following error

“Expected file path name or file-like object, got type”

How can I fix this?

## 更新资料

`0.19.2`现在，您可以从熊猫直接传递URL

``````import pandas as pd
import io
import requests
url="https://raw.githubusercontent.com/cs109/2014_data/master/countries.csv"
s=requests.get(url).content

## Update

From pandas `0.19.2` you can now just pass the url directly.

Just as the error suggests, `pandas.read_csv` needs a file-like object as the first argument.

If you want to read the csv from a string, you can use `io.StringIO` (Python 3.x) or `StringIO.StringIO` (Python 2.x) .

Also, for the URL – https://github.com/cs109/2014_data/blob/master/countries.csv – you are getting back `html` response , not raw csv, you should use the url given by the `Raw` link in the github page for getting raw csv response , which is – https://raw.githubusercontent.com/cs109/2014_data/master/countries.csv

Example –

``````import pandas as pd
import io
import requests
url="https://raw.githubusercontent.com/cs109/2014_data/master/countries.csv"
s=requests.get(url).content
``````

## 回答 1

``````import pandas as pd

url="https://raw.githubusercontent.com/cs109/2014_data/master/countries.csv"

In the latest version of pandas (`0.19.2`) you can directly pass the url

``````import pandas as pd

url="https://raw.githubusercontent.com/cs109/2014_data/master/countries.csv"
``````

## 回答 2

``````c = pd.read_csv("https://raw.githubusercontent.com/cs109/2014_data/master/countries.csv")

print(c)``````

``````                              Country         Region
0                             Algeria         AFRICA
1                              Angola         AFRICA
2                               Benin         AFRICA
3                            Botswana         AFRICA
4                             Burkina         AFRICA
5                             Burundi         AFRICA
6                            Cameroon         AFRICA
..................................``````

filepath_or_buffer

As I commented you need to use a StringIO object and decode i.e `c=pd.read_csv(io.StringIO(s.decode("utf-8")))` if using requests, you need to decode as .content returns bytes if you used .text you would just need to pass s as is `s = requests.get(url).text` c = `pd.read_csv(StringIO(s))`.

A simpler approach is to pass the correct url of the raw data directly to `read_csv`, you don’t have to pass a file like object, you can pass a url so you don’t need requests at all:

``````c = pd.read_csv("https://raw.githubusercontent.com/cs109/2014_data/master/countries.csv")

print(c)
``````

Output:

``````                              Country         Region
0                             Algeria         AFRICA
1                              Angola         AFRICA
2                               Benin         AFRICA
3                            Botswana         AFRICA
4                             Burkina         AFRICA
5                             Burundi         AFRICA
6                            Cameroon         AFRICA
..................................
``````

From the docs:

filepath_or_buffer :

string or file handle / StringIO The string could be a URL. Valid URL schemes include http, ftp, s3, and file. For file URLs, a host is expected. For instance, a local file could be file ://localhost/path/to/table.csv

## 回答 3

``````from io import StringIO

import pandas as pd
import requests
url='https://raw.githubusercontent.com/cs109/2014_data/master/countries.csv'
s=requests.get(url).text

``````>>> c.head()
Country  Region
0   Algeria  AFRICA
1    Angola  AFRICA
2     Benin  AFRICA
3  Botswana  AFRICA
4   Burkina  AFRICA``````

The problem you’re having is that the output you get into the variable ‘s’ is not a csv, but a html file. In order to get the raw csv, you have to modify the url to:

Your second problem is that read_csv expects a file name, we can solve this by using StringIO from io module. Third problem is that request.get(url).content delivers a byte stream, we can solve this using the request.get(url).text instead.

End result is this code:

``````from io import StringIO

import pandas as pd
import requests
url='https://raw.githubusercontent.com/cs109/2014_data/master/countries.csv'
s=requests.get(url).text

``````

output:

``````>>> c.head()
Country  Region
0   Algeria  AFRICA
1    Angola  AFRICA
2     Benin  AFRICA
3  Botswana  AFRICA
4   Burkina  AFRICA
``````

## 回答 4

``````url = "https://github.com/cs109/2014_data/blob/master/countries.csv"
c = pd.read_csv(url, sep = "\t")``````
``````url = "https://github.com/cs109/2014_data/blob/master/countries.csv"
c = pd.read_csv(url, sep = "\t")
``````

# 要通过熊猫中的URL导入数据，只需应用下面的简单代码即可，实际上效果更好。

``````import pandas as pd

# 如果您对原始数据有疑问，则只需在网址前添加“ r”

``````import pandas as pd

# To Import Data through URL in pandas just apply the simple below code it works actually better.

``````import pandas as pd
``````import pandas as pd