标签归档:pandas

pandas loc vs. iloc vs. ix vs. at vs. iat?

问题:pandas loc vs. iloc vs. ix vs. at vs. iat?

最近开始从我的安全地方(R)分支到Python,并且对中的单元格本地化/选择感到有些困惑Pandas。我已经阅读了文档,但仍在努力了解各种本地化/选择选项的实际含义。

  • 我为什么应该使用.loc.iloc超过最一般的选择.ix
  • 我的理解是.locilocat,和iat可以提供一些保证正确性是.ix不能提供的,但我也看到了在那里.ix往往是一刀切最快的解决方案。
  • 请说明使用除.ix?以外的任何东西背后的现实世界,最佳实践推理。

Recently began branching out from my safe place (R) into Python and and am a bit confused by the cell localization/selection in Pandas. I’ve read the documentation but I’m struggling to understand the practical implications of the various localization/selection options.

  • Is there a reason why I should ever use .loc or .iloc over the most general option .ix?
  • I understand that .loc, iloc, at, and iat may provide some guaranteed correctness that .ix can’t offer, but I’ve also read where .ix tends to be the fastest solution across the board.
  • Please explain the real-world, best-practices reasoning behind utilizing anything other than .ix?

回答 0

loc:仅适用于索引
iloc:适用于位置
ix:您可以从数据获取数据,而无需将其包含在索引
中:获取标量值。这是一个非常快速的定位
获取标量值。这是一个非常快的iloc

http://pyciencia.blogspot.com/2015/05/obtener-y-filtrar-datos-de-un-dataframe.html

注:由于pandas 0.20.0中,.ix索引被弃用赞成更加严格.iloc.loc索引。

loc: only work on index
iloc: work on position
ix: You can get data from dataframe without it being in the index
at: get scalar values. It’s a very fast loc
iat: Get scalar values. It’s a very fast iloc

http://pyciencia.blogspot.com/2015/05/obtener-y-filtrar-datos-de-un-dataframe.html

Note: As of pandas 0.20.0, the .ix indexer is deprecated in favour of the more strict .iloc and .loc indexers.


回答 1

已更新,pandas 0.20因为ix已弃用。这不但表明了如何使用locilocatiatset_value,但如何实现,混合位置/标签基于索引。


loc基于标签
允许您将一维数组作为索引器传递。数组可以是索引或列的切片(子集),也可以是长度与索引或列相等的布尔数组。

特别说明:当传递标量索引器时,loc可以分配以前不存在的新索引或列值。

# label based, but we can use position values
# to get the labels from the index object
df.loc[df.index[2], 'ColName'] = 3

df.loc[df.index[1:3], 'ColName'] = 3

iloc基于位置
类似于,loc除了位置而不是索引值。但是,您不能分配新的列或索引。

# position based, but we can get the position
# from the columns object via the `get_loc` method
df.iloc[2, df.columns.get_loc('ColName')] = 3

df.iloc[2, 4] = 3

df.iloc[:3, 2:4] = 3

at基于标签的
作品与loc标量索引器非常相似。 无法对数组索引器进行操作。 能够!分配新的索引和列。

优势loc是,这是速度更快。
缺点是不能将数组用于索引器。

# label based, but we can use position values
# to get the labels from the index object
df.at[df.index[2], 'ColName'] = 3

df.at['C', 'ColName'] = 3

iat基于位置的
原理相似iloc无法在数组索引器中工作。 不能!分配新的索引和列。

优势iloc是,这是速度更快。
缺点是不能将数组用于索引器。

# position based, but we can get the position
# from the columns object via the `get_loc` method
IBM.iat[2, IBM.columns.get_loc('PNL')] = 3

set_value基于标签的
作品与loc标量索引器非常相似。 无法对数组索引器进行操作。 能够!分配新的索引和列

优势超级快,因为几乎没有开销!
缺点由于pandas没有进行大量安全检查,因此开销很少。 使用风险自负。另外,这也不打算供公众使用。

# label based, but we can use position values
# to get the labels from the index object
df.set_value(df.index[2], 'ColName', 3)

set_valuetakable=True位置,并根据
原理相似iloc无法在数组索引器中工作。 不能!分配新的索引和列。

优势超级快,因为几乎没有开销!
缺点由于pandas没有进行大量安全检查,因此开销很少。 使用风险自负。另外,这也不打算供公众使用。

# position based, but we can get the position
# from the columns object via the `get_loc` method
df.set_value(2, df.columns.get_loc('ColName'), 3, takable=True)

Updated for pandas 0.20 given that ix is deprecated. This demonstrates not only how to use loc, iloc, at, iat, set_value, but how to accomplish, mixed positional/label based indexing.


loclabel based
Allows you to pass 1-D arrays as indexers. Arrays can be either slices (subsets) of the index or column, or they can be boolean arrays which are equal in length to the index or columns.

Special Note: when a scalar indexer is passed, loc can assign a new index or column value that didn’t exist before.

# label based, but we can use position values
# to get the labels from the index object
df.loc[df.index[2], 'ColName'] = 3

df.loc[df.index[1:3], 'ColName'] = 3

ilocposition based
Similar to loc except with positions rather that index values. However, you cannot assign new columns or indices.

# position based, but we can get the position
# from the columns object via the `get_loc` method
df.iloc[2, df.columns.get_loc('ColName')] = 3

df.iloc[2, 4] = 3

df.iloc[:3, 2:4] = 3

atlabel based
Works very similar to loc for scalar indexers. Cannot operate on array indexers. Can! assign new indices and columns.

Advantage over loc is that this is faster.
Disadvantage is that you can’t use arrays for indexers.

# label based, but we can use position values
# to get the labels from the index object
df.at[df.index[2], 'ColName'] = 3

df.at['C', 'ColName'] = 3

iatposition based
Works similarly to iloc. Cannot work in array indexers. Cannot! assign new indices and columns.

Advantage over iloc is that this is faster.
Disadvantage is that you can’t use arrays for indexers.

# position based, but we can get the position
# from the columns object via the `get_loc` method
IBM.iat[2, IBM.columns.get_loc('PNL')] = 3

set_valuelabel based
Works very similar to loc for scalar indexers. Cannot operate on array indexers. Can! assign new indices and columns

Advantage Super fast, because there is very little overhead!
Disadvantage There is very little overhead because pandas is not doing a bunch of safety checks. Use at your own risk. Also, this is not intended for public use.

# label based, but we can use position values
# to get the labels from the index object
df.set_value(df.index[2], 'ColName', 3)

set_value with takable=Trueposition based
Works similarly to iloc. Cannot work in array indexers. Cannot! assign new indices and columns.

Advantage Super fast, because there is very little overhead!
Disadvantage There is very little overhead because pandas is not doing a bunch of safety checks. Use at your own risk. Also, this is not intended for public use.

# position based, but we can get the position
# from the columns object via the `get_loc` method
df.set_value(2, df.columns.get_loc('ColName'), 3, takable=True)

回答 2

熊猫从DataFrame中进行选择的主要方式有两种。

  • 标签
  • 整数位置

该文档使用位置一词来指代整数位置。我不喜欢这个术语,因为我觉得它很混乱。整数位置更具描述性,正好.iloc代表该位置。此处的关键字是INTEGER-按整数位置选择时必须使用整数。

在显示摘要之前,让我们确保…

.ix已弃用且含糊不清,切勿使用

熊猫有三个主要的索引器。我们有索引运算符本身(括号[].loc,和.iloc。让我们总结一下:

  • []-主要选择列的子集,但也可以选择行。无法同时选择行和列。
  • .loc -仅按标签选择行和列的子集
  • .iloc -仅按整数位置选择行和列的子集

我几乎从未使用过,.at或者.iat因为它们没有添加任何附加功能并且只增加了一点性能。除非您有一个对时间敏感的应用程序,否则我不建议您使用它们。无论如何,我们有他们的摘要:

  • .at 仅通过标签在DataFrame中选择单个标量值
  • .iat 仅通过整数位置选择DataFrame中的单个标量值

除了按标签和整数位置进行选择外,还存在布尔选择(也称为布尔索引)


解释.loc,,.iloc布尔选择.at.iat的示例如下所示

我们将首先关注.loc和之间的差异.iloc。在讨论差异之前,必须了解DataFrame具有用于帮助标识每一列和每一行的标签,这一点很重要。让我们看一个示例DataFrame:

df = pd.DataFrame({'age':[30, 2, 12, 4, 32, 33, 69],
                   'color':['blue', 'green', 'red', 'white', 'gray', 'black', 'red'],
                   'food':['Steak', 'Lamb', 'Mango', 'Apple', 'Cheese', 'Melon', 'Beans'],
                   'height':[165, 70, 120, 80, 180, 172, 150],
                   'score':[4.6, 8.3, 9.0, 3.3, 1.8, 9.5, 2.2],
                   'state':['NY', 'TX', 'FL', 'AL', 'AK', 'TX', 'TX']
                   },
                  index=['Jane', 'Nick', 'Aaron', 'Penelope', 'Dean', 'Christina', 'Cornelia'])

所有粗体字均为标签。标签,agecolorfoodheightscorestate被用于。其他标签,JaneNickAaronPenelopeDeanChristinaCornelia用作标签的行。这些行标签统称为index


在DataFrame中选择特定行的主要方式是使用.loc.iloc索引器。这些索引器中的每一个也可以用于同时选择列,但是现在只关注行比较容易。此外,每个索引器都使用紧跟其名称的一组括号进行选择。

.loc仅通过标签选择数据

我们将首先讨论.loc仅通过索引或列标签选择数据的索引器。在示例DataFrame中,我们提供了有意义的名称作为索引值。许多DataFrame都没有任何有意义的名称,而是默认为0到n-1之间的整数,其中n是DataFrame的长度(行数)。

您可以使用三种输入中的许多不同.loc,它们是

  • 一串
  • 字符串列表
  • 使用字符串作为起始值和终止值的切片符号

用带字符串的.loc选择单行

要选择单行数据,请将索引标签放在后面的括号内.loc

df.loc['Penelope']

这将数据行作为系列返回

age           4
color     white
food      Apple
height       80
score       3.3
state        AL
Name: Penelope, dtype: object

使用.loc与字符串列表选择多行

df.loc[['Cornelia', 'Jane', 'Dean']]

这将返回一个DataFrame,其中的数据行按列表中指定的顺序进行:

使用带有切片符号的.loc选择多行

切片符号由开始值,停止值和步长值定义。按标签切片时,大熊猫在返回值中包含停止值。以下是从亚伦到迪恩(含)的片段。它的步长未明确定义,但默认为1。

df.loc['Aaron':'Dean']

可以采用与Python列表相同的方式获取复杂的切片。

.iloc仅按整数位置选择数据

现在转到.iloc。DataFrame中数据的每一行和每一列都有一个定义它的整数位置。这是输出中直观显示的标签的补充。整数位置就是从0开始从顶部/左侧开始的行数/列数。

您可以使用三种输入中的许多不同.iloc,它们是

  • 一个整数
  • 整数列表
  • 使用整数作为起始值和终止值的切片符号

用带整数的.iloc选择单行

df.iloc[4]

这将返回第5行(整数位置4)为系列

age           32
color       gray
food      Cheese
height       180
score        1.8
state         AK
Name: Dean, dtype: object

用.iloc选择带有整数列表的多行

df.iloc[[2, -2]]

这将返回第三行和倒数第二行的DataFrame:

使用带切片符号的.iloc选择多行

df.iloc[:5:3]


使用.loc和.iloc同时选择行和列

两者的一项出色功能.loc/.iloc是它们可以同时选择行和列。在上面的示例中,所有列都是从每个选择中返回的。我们可以选择输入类型与行相同的列。我们只需要用逗号分隔行和列的选择即可。

例如,我们可以选择Jane行和Dean行,它们的高度,得分和状态如下:

df.loc[['Jane', 'Dean'], 'height':]

这对行使用标签列表,对列使用切片符号

我们自然可以.iloc只使用整数来执行类似的操作。

df.iloc[[1,4], 2]
Nick      Lamb
Dean    Cheese
Name: food, dtype: object

带标签和整数位置的同时选择

.ix用来与标签和整数位置同时进行选择,这很有用,但有时会造成混淆和模棱两可,值得庆幸的是,它已弃用。如果您需要混合使用标签和整数位置进行选择,则必须同时选择标签或整数位置。

例如,如果我们要选择行Nick以及第Cornelia2列和第4列,则可以.loc通过以下方式将整数转换为标签来使用:

col_names = df.columns[[2, 4]]
df.loc[['Nick', 'Cornelia'], col_names] 

或者,可以使用get_locindex方法将索引标签转换为整数。

labels = ['Nick', 'Cornelia']
index_ints = [df.index.get_loc(label) for label in labels]
df.iloc[index_ints, [2, 4]]

布尔选择

.loc索引器还可以进行布尔选择。例如,如果我们有兴趣查找年龄在30岁以上的所有行并仅返回foodscore列,则可以执行以下操作:

df.loc[df['age'] > 30, ['food', 'score']] 

您可以使用复制它,.iloc但是不能将其传递为布尔系列。您必须将boolean Series转换为numpy数组,如下所示:

df.iloc[(df['age'] > 30).values, [2, 4]] 

选择所有行

可以.loc/.iloc仅用于列选择。您可以使用冒号选择所有行,如下所示:

df.loc[:, 'color':'score':2]


索引运算符[]可以切片也可以选择行和列,但不能同时选择。

大多数人都熟悉DataFrame索引运算符的主要目的,即选择列。字符串选择单个列作为系列,字符串列表选择多个列作为DataFrame。

df['food']

Jane          Steak
Nick           Lamb
Aaron         Mango
Penelope      Apple
Dean         Cheese
Christina     Melon
Cornelia      Beans
Name: food, dtype: object

使用列表选择多个列

df[['food', 'score']]

人们所不熟悉的是,当使用切片符号时,选择是通过行标签或整数位置进行的。这非常令人困惑,我几乎从未使用过,但是确实可以使用。

df['Penelope':'Christina'] # slice rows by label

df[2:6:2] # slice rows by integer location

.loc/.iloc选择行的明确性是高度首选的。单独的索引运算符无法同时选择行和列。

df[3:5, 'color']
TypeError: unhashable type: 'slice'

.at和选择.iat

选择与.at几乎相同,.loc但仅在DataFrame中选择一个“单元”。我们通常将此单元称为标量值。要使用.at,请将行标签和列标签都传递给它,并用逗号分隔。

df.at['Christina', 'color']
'black'

选择与.iat几乎相同,.iloc但仅选择一个标量值。您必须为行和列位置都传递一个整数

df.iat[2, 5]
'FL'

There are two primary ways that pandas makes selections from a DataFrame.

  • By Label
  • By Integer Location

The documentation uses the term position for referring to integer location. I do not like this terminology as I feel it is confusing. Integer location is more descriptive and is exactly what .iloc stands for. The key word here is INTEGER – you must use integers when selecting by integer location.

Before showing the summary let’s all make sure that …

.ix is deprecated and ambiguous and should never be used

There are three primary indexers for pandas. We have the indexing operator itself (the brackets []), .loc, and .iloc. Let’s summarize them:

  • [] – Primarily selects subsets of columns, but can select rows as well. Cannot simultaneously select rows and columns.
  • .loc – selects subsets of rows and columns by label only
  • .iloc – selects subsets of rows and columns by integer location only

I almost never use .at or .iat as they add no additional functionality and with just a small performance increase. I would discourage their use unless you have a very time-sensitive application. Regardless, we have their summary:

  • .at selects a single scalar value in the DataFrame by label only
  • .iat selects a single scalar value in the DataFrame by integer location only

In addition to selection by label and integer location, boolean selection also known as boolean indexing exists.


Examples explaining .loc, .iloc, boolean selection and .at and .iat are shown below

We will first focus on the differences between .loc and .iloc. Before we talk about the differences, it is important to understand that DataFrames have labels that help identify each column and each row. Let’s take a look at a sample DataFrame:

df = pd.DataFrame({'age':[30, 2, 12, 4, 32, 33, 69],
                   'color':['blue', 'green', 'red', 'white', 'gray', 'black', 'red'],
                   'food':['Steak', 'Lamb', 'Mango', 'Apple', 'Cheese', 'Melon', 'Beans'],
                   'height':[165, 70, 120, 80, 180, 172, 150],
                   'score':[4.6, 8.3, 9.0, 3.3, 1.8, 9.5, 2.2],
                   'state':['NY', 'TX', 'FL', 'AL', 'AK', 'TX', 'TX']
                   },
                  index=['Jane', 'Nick', 'Aaron', 'Penelope', 'Dean', 'Christina', 'Cornelia'])

All the words in bold are the labels. The labels, age, color, food, height, score and state are used for the columns. The other labels, Jane, Nick, Aaron, Penelope, Dean, Christina, Cornelia are used as labels for the rows. Collectively, these row labels are known as the index.


The primary ways to select particular rows in a DataFrame are with the .loc and .iloc indexers. Each of these indexers can also be used to simultaneously select columns but it is easier to just focus on rows for now. Also, each of the indexers use a set of brackets that immediately follow their name to make their selections.

.loc selects data only by labels

We will first talk about the .loc indexer which only selects data by the index or column labels. In our sample DataFrame, we have provided meaningful names as values for the index. Many DataFrames will not have any meaningful names and will instead, default to just the integers from 0 to n-1, where n is the length(number of rows) of the DataFrame.

There are many different inputs you can use for .loc three out of them are

  • A string
  • A list of strings
  • Slice notation using strings as the start and stop values

Selecting a single row with .loc with a string

To select a single row of data, place the index label inside of the brackets following .loc.

df.loc['Penelope']

This returns the row of data as a Series

age           4
color     white
food      Apple
height       80
score       3.3
state        AL
Name: Penelope, dtype: object

Selecting multiple rows with .loc with a list of strings

df.loc[['Cornelia', 'Jane', 'Dean']]

This returns a DataFrame with the rows in the order specified in the list:

Selecting multiple rows with .loc with slice notation

Slice notation is defined by a start, stop and step values. When slicing by label, pandas includes the stop value in the return. The following slices from Aaron to Dean, inclusive. Its step size is not explicitly defined but defaulted to 1.

df.loc['Aaron':'Dean']

Complex slices can be taken in the same manner as Python lists.

.iloc selects data only by integer location

Let’s now turn to .iloc. Every row and column of data in a DataFrame has an integer location that defines it. This is in addition to the label that is visually displayed in the output. The integer location is simply the number of rows/columns from the top/left beginning at 0.

There are many different inputs you can use for .iloc three out of them are

  • An integer
  • A list of integers
  • Slice notation using integers as the start and stop values

Selecting a single row with .iloc with an integer

df.iloc[4]

This returns the 5th row (integer location 4) as a Series

age           32
color       gray
food      Cheese
height       180
score        1.8
state         AK
Name: Dean, dtype: object

Selecting multiple rows with .iloc with a list of integers

df.iloc[[2, -2]]

This returns a DataFrame of the third and second to last rows:

Selecting multiple rows with .iloc with slice notation

df.iloc[:5:3]


Simultaneous selection of rows and columns with .loc and .iloc

One excellent ability of both .loc/.iloc is their ability to select both rows and columns simultaneously. In the examples above, all the columns were returned from each selection. We can choose columns with the same types of inputs as we do for rows. We simply need to separate the row and column selection with a comma.

For example, we can select rows Jane, and Dean with just the columns height, score and state like this:

df.loc[['Jane', 'Dean'], 'height':]

This uses a list of labels for the rows and slice notation for the columns

We can naturally do similar operations with .iloc using only integers.

df.iloc[[1,4], 2]
Nick      Lamb
Dean    Cheese
Name: food, dtype: object

Simultaneous selection with labels and integer location

.ix was used to make selections simultaneously with labels and integer location which was useful but confusing and ambiguous at times and thankfully it has been deprecated. In the event that you need to make a selection with a mix of labels and integer locations, you will have to make both your selections labels or integer locations.

For instance, if we want to select rows Nick and Cornelia along with columns 2 and 4, we could use .loc by converting the integers to labels with the following:

col_names = df.columns[[2, 4]]
df.loc[['Nick', 'Cornelia'], col_names] 

Or alternatively, convert the index labels to integers with the get_loc index method.

labels = ['Nick', 'Cornelia']
index_ints = [df.index.get_loc(label) for label in labels]
df.iloc[index_ints, [2, 4]]

Boolean Selection

The .loc indexer can also do boolean selection. For instance, if we are interested in finding all the rows where age is above 30 and return just the food and score columns we can do the following:

df.loc[df['age'] > 30, ['food', 'score']] 

You can replicate this with .iloc but you cannot pass it a boolean series. You must convert the boolean Series into a numpy array like this:

df.iloc[(df['age'] > 30).values, [2, 4]] 

Selecting all rows

It is possible to use .loc/.iloc for just column selection. You can select all the rows by using a colon like this:

df.loc[:, 'color':'score':2]


The indexing operator, [], can slice can select rows and columns too but not simultaneously.

Most people are familiar with the primary purpose of the DataFrame indexing operator, which is to select columns. A string selects a single column as a Series and a list of strings selects multiple columns as a DataFrame.

df['food']

Jane          Steak
Nick           Lamb
Aaron         Mango
Penelope      Apple
Dean         Cheese
Christina     Melon
Cornelia      Beans
Name: food, dtype: object

Using a list selects multiple columns

df[['food', 'score']]

What people are less familiar with, is that, when slice notation is used, then selection happens by row labels or by integer location. This is very confusing and something that I almost never use but it does work.

df['Penelope':'Christina'] # slice rows by label

df[2:6:2] # slice rows by integer location

The explicitness of .loc/.iloc for selecting rows is highly preferred. The indexing operator alone is unable to select rows and columns simultaneously.

df[3:5, 'color']
TypeError: unhashable type: 'slice'

Selection by .at and .iat

Selection with .at is nearly identical to .loc but it only selects a single ‘cell’ in your DataFrame. We usually refer to this cell as a scalar value. To use .at, pass it both a row and column label separated by a comma.

df.at['Christina', 'color']
'black'

Selection with .iat is nearly identical to .iloc but it only selects a single scalar value. You must pass it an integer for both the row and column locations

df.iat[2, 5]
'FL'

回答 3

df = pd.DataFrame({'A':['a', 'b', 'c'], 'B':[54, 67, 89]}, index=[100, 200, 300])

df

                        A   B
                100     a   54
                200     b   67
                300     c   89
In [19]:    
df.loc[100]

Out[19]:
A     a
B    54
Name: 100, dtype: object

In [20]:    
df.iloc[0]

Out[20]:
A     a
B    54
Name: 100, dtype: object

In [24]:    
df2 = df.set_index([df.index,'A'])
df2

Out[24]:
        B
    A   
100 a   54
200 b   67
300 c   89

In [25]:    
df2.ix[100, 'a']

Out[25]:    
B    54
Name: (100, a), dtype: int64
df = pd.DataFrame({'A':['a', 'b', 'c'], 'B':[54, 67, 89]}, index=[100, 200, 300])

df

                        A   B
                100     a   54
                200     b   67
                300     c   89
In [19]:    
df.loc[100]

Out[19]:
A     a
B    54
Name: 100, dtype: object

In [20]:    
df.iloc[0]

Out[20]:
A     a
B    54
Name: 100, dtype: object

In [24]:    
df2 = df.set_index([df.index,'A'])
df2

Out[24]:
        B
    A   
100 a   54
200 b   67
300 c   89

In [25]:    
df2.ix[100, 'a']

Out[25]:    
B    54
Name: (100, a), dtype: int64

回答 4

让我们从这个小df开始:

import pandas as pd
import time as tm
import numpy as np
n=10
a=np.arange(0,n**2)
df=pd.DataFrame(a.reshape(n,n))

我们会这样

df
Out[25]: 
        0   1   2   3   4   5   6   7   8   9
    0   0   1   2   3   4   5   6   7   8   9
    1  10  11  12  13  14  15  16  17  18  19
    2  20  21  22  23  24  25  26  27  28  29
    3  30  31  32  33  34  35  36  37  38  39
    4  40  41  42  43  44  45  46  47  48  49
    5  50  51  52  53  54  55  56  57  58  59
    6  60  61  62  63  64  65  66  67  68  69
    7  70  71  72  73  74  75  76  77  78  79
    8  80  81  82  83  84  85  86  87  88  89
    9  90  91  92  93  94  95  96  97  98  99

有了这个我们有:

df.iloc[3,3]
Out[33]: 33

df.iat[3,3]
Out[34]: 33

df.iloc[:3,:3]
Out[35]: 
    0   1   2   3
0   0   1   2   3
1  10  11  12  13
2  20  21  22  23
3  30  31  32  33



df.iat[:3,:3]
Traceback (most recent call last):
   ... omissis ...
ValueError: At based indexing on an integer index can only have integer indexers

因此,我们不能将.iat用于子集,而只能在其中使用.iloc。

但是,让我们尝试从较大的df中进行选择,并检查速度…

# -*- coding: utf-8 -*-
"""
Created on Wed Feb  7 09:58:39 2018

@author: Fabio Pomi
"""

import pandas as pd
import time as tm
import numpy as np
n=1000
a=np.arange(0,n**2)
df=pd.DataFrame(a.reshape(n,n))
t1=tm.time()
for j in df.index:
    for i in df.columns:
        a=df.iloc[j,i]
t2=tm.time()
for j in df.index:
    for i in df.columns:
        a=df.iat[j,i]
t3=tm.time()
loc=t2-t1
at=t3-t2
prc = loc/at *100
print('\nloc:%f at:%f prc:%f' %(loc,at,prc))

loc:10.485600 at:7.395423 prc:141.784987

因此,使用.loc我们可以管理子集,并且仅使用单个标量即可使用.loc,但是.at比.loc更快

:-)

Let’s start with this small df:

import pandas as pd
import time as tm
import numpy as np
n=10
a=np.arange(0,n**2)
df=pd.DataFrame(a.reshape(n,n))

We’ll so have

df
Out[25]: 
        0   1   2   3   4   5   6   7   8   9
    0   0   1   2   3   4   5   6   7   8   9
    1  10  11  12  13  14  15  16  17  18  19
    2  20  21  22  23  24  25  26  27  28  29
    3  30  31  32  33  34  35  36  37  38  39
    4  40  41  42  43  44  45  46  47  48  49
    5  50  51  52  53  54  55  56  57  58  59
    6  60  61  62  63  64  65  66  67  68  69
    7  70  71  72  73  74  75  76  77  78  79
    8  80  81  82  83  84  85  86  87  88  89
    9  90  91  92  93  94  95  96  97  98  99

With this we have:

df.iloc[3,3]
Out[33]: 33

df.iat[3,3]
Out[34]: 33

df.iloc[:3,:3]
Out[35]: 
    0   1   2   3
0   0   1   2   3
1  10  11  12  13
2  20  21  22  23
3  30  31  32  33



df.iat[:3,:3]
Traceback (most recent call last):
   ... omissis ...
ValueError: At based indexing on an integer index can only have integer indexers

Thus we cannot use .iat for subset, where we must use .iloc only.

But let’s try both to select from a larger df and let’s check the speed …

# -*- coding: utf-8 -*-
"""
Created on Wed Feb  7 09:58:39 2018

@author: Fabio Pomi
"""

import pandas as pd
import time as tm
import numpy as np
n=1000
a=np.arange(0,n**2)
df=pd.DataFrame(a.reshape(n,n))
t1=tm.time()
for j in df.index:
    for i in df.columns:
        a=df.iloc[j,i]
t2=tm.time()
for j in df.index:
    for i in df.columns:
        a=df.iat[j,i]
t3=tm.time()
loc=t2-t1
at=t3-t2
prc = loc/at *100
print('\nloc:%f at:%f prc:%f' %(loc,at,prc))

loc:10.485600 at:7.395423 prc:141.784987

So with .loc we can manage subsets and with .at only a single scalar, but .at is faster than .loc

:-)


如何将Seaborn图保存到文件中

问题:如何将Seaborn图保存到文件中

我尝试了以下代码(test_seaborn.py):

import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
matplotlib.style.use('ggplot')
import seaborn as sns
sns.set()
df = sns.load_dataset('iris')
sns_plot = sns.pairplot(df, hue='species', size=2.5)
fig = sns_plot.get_figure()
fig.savefig("output.png")
#sns.plt.show()

但是我得到这个错误:

  Traceback (most recent call last):
  File "test_searborn.py", line 11, in <module>
    fig = sns_plot.get_figure()
AttributeError: 'PairGrid' object has no attribute 'get_figure'

我希望决赛output.png将存在,看起来像这样:

我该如何解决该问题?

I tried the following code (test_seaborn.py):

import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
matplotlib.style.use('ggplot')
import seaborn as sns
sns.set()
df = sns.load_dataset('iris')
sns_plot = sns.pairplot(df, hue='species', size=2.5)
fig = sns_plot.get_figure()
fig.savefig("output.png")
#sns.plt.show()

But I get this error:

  Traceback (most recent call last):
  File "test_searborn.py", line 11, in <module>
    fig = sns_plot.get_figure()
AttributeError: 'PairGrid' object has no attribute 'get_figure'

I expect the final output.png will exist and look like this:

How can I resolve the problem?


回答 0

删除get_figure并使用sns_plot.savefig('output.png')

df = sns.load_dataset('iris')
sns_plot = sns.pairplot(df, hue='species', size=2.5)
sns_plot.savefig("output.png")

Remove the get_figure and just use sns_plot.savefig('output.png')

df = sns.load_dataset('iris')
sns_plot = sns.pairplot(df, hue='species', size=2.5)
sns_plot.savefig("output.png")

回答 1

建议的解决方案与Seaborn 0.8.1不兼容

由于Seaborn界面已更改,因此出现以下错误:

AttributeError: 'AxesSubplot' object has no attribute 'fig'
When trying to access the figure

AttributeError: 'AxesSubplot' object has no attribute 'savefig'
when trying to use the savefig directly as a function

以下调用允许您访问该图(与Seaborn 0.8.1兼容):

swarm_plot = sns.swarmplot(...)
fig = swarm_plot.get_figure()
fig.savefig(...) 

如先前在此答案中所见。

更新: 我最近使用了seaborn的PairGrid对象生成了一个类似于本示例中的图。在这种情况下,由于GridPlot不是像sns.swarmplot这样的绘图对象,因此它没有get_figure()函数。可以通过以下方式直接访问matplotlib图

fig = myGridPlotObject.fig

就像之前在该主题的其他文章中建议的那样。

The suggested solutions are incompatible with Seaborn 0.8.1

giving the following errors because the Seaborn interface has changed:

AttributeError: 'AxesSubplot' object has no attribute 'fig'
When trying to access the figure

AttributeError: 'AxesSubplot' object has no attribute 'savefig'
when trying to use the savefig directly as a function

The following calls allow you to access the figure (Seaborn 0.8.1 compatible):

swarm_plot = sns.swarmplot(...)
fig = swarm_plot.get_figure()
fig.savefig(...) 

as seen previously in this answer.

UPDATE: I have recently used PairGrid object from seaborn to generate a plot similar to the one in this example. In this case, since GridPlot is not a plot object like, for example, sns.swarmplot, it has no get_figure() function. It is possible to directly access the matplotlib figure by

fig = myGridPlotObject.fig

Like previously suggested in other posts in this thread.


回答 2

上述某些解决方案对我不起作用。.fig尝试该属性时未找到该属性,因此无法.savefig()直接使用。但是,起作用的是:

sns_plot.figure.savefig("output.png")

我是Python新用户,所以我不知道这是否是由于更新引起的。我想提一下,以防其他人遇到和我一样的问题。

Some of the above solutions did not work for me. The .fig attribute was not found when I tried that and I was unable to use .savefig() directly. However, what did work was:

sns_plot.figure.savefig("output.png")

I am a newer Python user, so I do not know if this is due to an update. I wanted to mention it in case anybody else runs into the same issues as I did.


回答 3

您应该只能够直接使用savefig方法sns_plot

sns_plot.savefig("output.png")

为了使您的代码更加清晰,如果您确实要访问sns_plot驻留在其中的matplotlib图形,则可以直接通过

fig = sns_plot.fig

在这种情况下get_figure,您的代码将假定没有方法。

You should just be able to use the savefig method of sns_plot directly.

sns_plot.savefig("output.png")

For clarity with your code if you did want to access the matplotlib figure that sns_plot resides in then you can get it directly with

fig = sns_plot.fig

In this case there is no get_figure method as your code assumes.


回答 4

我使用distplotget_figure成功保存了图片。

sns_hist = sns.distplot(df_train['SalePrice'])
fig = sns_hist.get_figure()
fig.savefig('hist.png')

I use distplot and get_figure to save picture successfully.

sns_hist = sns.distplot(df_train['SalePrice'])
fig = sns_hist.get_figure()
fig.savefig('hist.png')

回答 5

2019年搜索者的台词更少:

import matplotlib.pyplot as plt
import seaborn as sns

df = sns.load_dataset('iris')
sns_plot = sns.pairplot(df, hue='species', height=2.5)
plt.savefig('output.png')

更新说明:size已更改为height

Fewer lines for 2019 searchers:

import matplotlib.pyplot as plt
import seaborn as sns

df = sns.load_dataset('iris')
sns_plot = sns.pairplot(df, hue='species', height=2.5)
plt.savefig('output.png')

UPDATE NOTE: size was changed to height.


回答 6

这对我有用

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

sns.factorplot(x='holiday',data=data,kind='count',size=5,aspect=1)
plt.savefig('holiday-vs-count.png')

This works for me

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

sns.factorplot(x='holiday',data=data,kind='count',size=5,aspect=1)
plt.savefig('holiday-vs-count.png')

回答 7

也可以只创建一个matplotlib figure对象,然后使用plt.savefig(...)

from matplotlib import pyplot as plt
import seaborn as sns
import pandas as pd

df = sns.load_dataset('iris')
plt.figure() # Push new figure on stack
sns_plot = sns.pairplot(df, hue='species', size=2.5)
plt.savefig('output.png') # Save that figure

Its also possible to just create a matplotlib figure object and then use plt.savefig(...):

from matplotlib import pyplot as plt
import seaborn as sns
import pandas as pd

df = sns.load_dataset('iris')
plt.figure() # Push new figure on stack
sns_plot = sns.pairplot(df, hue='species', size=2.5)
plt.savefig('output.png') # Save that figure

回答 8

sns.figure.savefig("output.png")在seaborn 0.8.1中使用会出错。

而是使用:

import seaborn as sns

df = sns.load_dataset('iris')
sns_plot = sns.pairplot(df, hue='species', size=2.5)
sns_plot.savefig("output.png")

You would get an error for using sns.figure.savefig("output.png") in seaborn 0.8.1.

Instead use:

import seaborn as sns

df = sns.load_dataset('iris')
sns_plot = sns.pairplot(df, hue='species', size=2.5)
sns_plot.savefig("output.png")

回答 9

仅供参考,下面的命令在seaborn 0.8.1中起作用,因此我想最初的答案仍然有效。

sns_plot = sns.pairplot(data, hue='species', size=3)
sns_plot.savefig("output.png")

Just FYI, the below command worked in seaborn 0.8.1 so I guess the initial answer is still valid.

sns_plot = sns.pairplot(data, hue='species', size=3)
sns_plot.savefig("output.png")

网址中的熊猫read_csv

问题:网址中的熊猫read_csv

我将Python 3.4与IPython结合使用,并具有以下代码。我无法从给定的URL读取csv文件:

import pandas as pd
import requests

url="https://github.com/cs109/2014_data/blob/master/countries.csv"
s=requests.get(url).content
c=pd.read_csv(s)

我有以下错误

“预期的文件路径名或类文件对象,得到类型”

我怎样才能解决这个问题?

I am using Python 3.4 with IPython and have the following code. I’m unable to read a csv-file from the given URL:

import pandas as pd
import requests

url="https://github.com/cs109/2014_data/blob/master/countries.csv"
s=requests.get(url).content
c=pd.read_csv(s)

I have the following error

“Expected file path name or file-like object, got type”

How can I fix this?


回答 0

更新资料

0.19.2现在,您可以从熊猫直接传递URL


正如错误所暗示的,pandas.read_csv需要一个类似文件的对象作为第一个参数。

如果要从字符串读取csv,可以使用io.StringIO(Python 3.x)或StringIO.StringIO(Python 2.x)

另外,对于URL- https://github.com/cs109/2014_data/blob/master/countries.csv-您正在获得html响应,而不是原始的csv,您应该使用Rawgithub页面中的链接给出的url 获取原始的csv响应-https: //raw.githubusercontent.com/cs109/2014_data/master/countries.csv

范例-

import pandas as pd
import io
import requests
url="https://raw.githubusercontent.com/cs109/2014_data/master/countries.csv"
s=requests.get(url).content
c=pd.read_csv(io.StringIO(s.decode('utf-8')))

Update

From pandas 0.19.2 you can now just pass the url directly.


Just as the error suggests, pandas.read_csv needs a file-like object as the first argument.

If you want to read the csv from a string, you can use io.StringIO (Python 3.x) or StringIO.StringIO (Python 2.x) .

Also, for the URL – https://github.com/cs109/2014_data/blob/master/countries.csv – you are getting back html response , not raw csv, you should use the url given by the Raw link in the github page for getting raw csv response , which is – https://raw.githubusercontent.com/cs109/2014_data/master/countries.csv

Example –

import pandas as pd
import io
import requests
url="https://raw.githubusercontent.com/cs109/2014_data/master/countries.csv"
s=requests.get(url).content
c=pd.read_csv(io.StringIO(s.decode('utf-8')))

回答 1

在最新版本的pandas(0.19.2)中,您可以直接传递网址

import pandas as pd

url="https://raw.githubusercontent.com/cs109/2014_data/master/countries.csv"
c=pd.read_csv(url)

In the latest version of pandas (0.19.2) you can directly pass the url

import pandas as pd

url="https://raw.githubusercontent.com/cs109/2014_data/master/countries.csv"
c=pd.read_csv(url)

回答 2

正如我评论的那样,您需要使用StringIO对象并进行解码,即c=pd.read_csv(io.StringIO(s.decode("utf-8")))如果使用请求,则需要进行解码,因为如果您使用.text ,则content会返回字节,您只需要像s = requests.get(url).textc = 那样传递s即可pd.read_csv(StringIO(s))

一种更简单的方法是将原始数据的正确url 直接传递给read_csv,您不必传递像object这样的文件,您可以传递url从而根本不需要请求:

c = pd.read_csv("https://raw.githubusercontent.com/cs109/2014_data/master/countries.csv")

print(c)

输出:

                              Country         Region
0                             Algeria         AFRICA
1                              Angola         AFRICA
2                               Benin         AFRICA
3                            Botswana         AFRICA
4                             Burkina         AFRICA
5                             Burundi         AFRICA
6                            Cameroon         AFRICA
..................................

文档

filepath_or_buffer

字符串或文件句柄/ StringIO字符串可以是URL。有效的URL方案包括http,ftp,s3和file。对于文件URL,需要一个主机。例如,本地文件可以是文件://localhost/path/to/table.csv

As I commented you need to use a StringIO object and decode i.e c=pd.read_csv(io.StringIO(s.decode("utf-8"))) if using requests, you need to decode as .content returns bytes if you used .text you would just need to pass s as is s = requests.get(url).text c = pd.read_csv(StringIO(s)).

A simpler approach is to pass the correct url of the raw data directly to read_csv, you don’t have to pass a file like object, you can pass a url so you don’t need requests at all:

c = pd.read_csv("https://raw.githubusercontent.com/cs109/2014_data/master/countries.csv")

print(c)

Output:

                              Country         Region
0                             Algeria         AFRICA
1                              Angola         AFRICA
2                               Benin         AFRICA
3                            Botswana         AFRICA
4                             Burkina         AFRICA
5                             Burundi         AFRICA
6                            Cameroon         AFRICA
..................................

From the docs:

filepath_or_buffer :

string or file handle / StringIO The string could be a URL. Valid URL schemes include http, ftp, s3, and file. For file URLs, a host is expected. For instance, a local file could be file ://localhost/path/to/table.csv


回答 3

您遇到的问题是,进入变量s的输出不是csv,而是html文件。为了获得原始的csv,您必须将url修改为:

https://raw.githubusercontent.com/cs109/2014_data/master/countries.csv

您的第二个问题是read_csv需要一个文件名,我们可以通过使用io模块中的StringIO来解决此问题。第三个问题是request.get(url).content提供了字节流,我们可以改用request.get(url).text解决。

最终结果是此代码:

from io import StringIO

import pandas as pd
import requests
url='https://raw.githubusercontent.com/cs109/2014_data/master/countries.csv'
s=requests.get(url).text

c=pd.read_csv(StringIO(s))

输出:

>>> c.head()
    Country  Region
0   Algeria  AFRICA
1    Angola  AFRICA
2     Benin  AFRICA
3  Botswana  AFRICA
4   Burkina  AFRICA

The problem you’re having is that the output you get into the variable ‘s’ is not a csv, but a html file. In order to get the raw csv, you have to modify the url to:

https://raw.githubusercontent.com/cs109/2014_data/master/countries.csv

Your second problem is that read_csv expects a file name, we can solve this by using StringIO from io module. Third problem is that request.get(url).content delivers a byte stream, we can solve this using the request.get(url).text instead.

End result is this code:

from io import StringIO

import pandas as pd
import requests
url='https://raw.githubusercontent.com/cs109/2014_data/master/countries.csv'
s=requests.get(url).text

c=pd.read_csv(StringIO(s))

output:

>>> c.head()
    Country  Region
0   Algeria  AFRICA
1    Angola  AFRICA
2     Benin  AFRICA
3  Botswana  AFRICA
4   Burkina  AFRICA

回答 4

url = "https://github.com/cs109/2014_data/blob/master/countries.csv"
c = pd.read_csv(url, sep = "\t")
url = "https://github.com/cs109/2014_data/blob/master/countries.csv"
c = pd.read_csv(url, sep = "\t")

回答 5

要通过熊猫中的URL导入数据,只需应用下面的简单代码即可,实际上效果更好。

import pandas as pd
train = pd.read_table("https://urlandfile.com/dataset.csv")
train.head()

如果您对原始数据有疑问,则只需在网址前添加“ r”

import pandas as pd
train = pd.read_table(r"https://urlandfile.com/dataset.csv")
train.head()

To Import Data through URL in pandas just apply the simple below code it works actually better.

import pandas as pd
train = pd.read_table("https://urlandfile.com/dataset.csv")
train.head()

If you are having issues with a raw data then just put ‘r’ before URL

import pandas as pd
train = pd.read_table(r"https://urlandfile.com/dataset.csv")
train.head()

熊猫:设置编号。最大行数

问题:熊猫:设置编号。最大行数

我在查看以下内容时遇到问题DataFrame

n = 100
foo = DataFrame(index=range(n))
foo['floats'] = np.random.randn(n)
foo

问题是它不会在ipython笔记本中默认情况下不打印所有行,但是我必须切片才能查看结果行。甚至以下选项也不会更改输出:

pd.set_option('display.max_rows', 500)

有谁知道如何显示整个数组?

I have a problem viewing the following DataFrame:

n = 100
foo = DataFrame(index=range(n))
foo['floats'] = np.random.randn(n)
foo

The problem is that it does not print all rows per default in ipython notebook, but I have to slice to view the resulting rows. Even the following option does not change the output:

pd.set_option('display.max_rows', 500)

Does anyone know how to display the whole array?


回答 0

设置display.max_rows

pd.set_option('display.max_rows', 500)

对于较早版本的熊猫(<= 0.11.0),您需要同时更改display.heightdisplay.max_rows

pd.set_option('display.height', 500)
pd.set_option('display.max_rows', 500)

另请参阅pd.describe_option('display')

您只能一次临时设置一个选项,如下所示:

from IPython.display import display
with pd.option_context('display.max_rows', 100, 'display.max_columns', 10):
    display(df) #need display to show the dataframe when using with in jupyter
    #some pandas stuff

您还可以将选项重置为默认值,如下所示:

pd.reset_option('display.max_rows')

然后将它们全部重置:

pd.reset_option('all')

Set display.max_rows:

pd.set_option('display.max_rows', 500)

For older versions of pandas (<=0.11.0) you need to change both display.height and display.max_rows.

pd.set_option('display.height', 500)
pd.set_option('display.max_rows', 500)

See also pd.describe_option('display').

You can set an option only temporarily for this one time like this:

from IPython.display import display
with pd.option_context('display.max_rows', 100, 'display.max_columns', 10):
    display(df) #need display to show the dataframe when using with in jupyter
    #some pandas stuff

You can also reset an option back to its default value like this:

pd.reset_option('display.max_rows')

And reset all of them back:

pd.reset_option('all')


回答 1

就个人而言,我喜欢直接使用赋值语句设置选项,因为iPython使得通过制表符补全很容易找到。我很难记住确切的选项名称是什么,因此此方法对我有用。

例如,我要记住的是,它始于 pd.options

pd.options.<TAB>

大多数选项在 display

pd.options.display.<TAB>

从这里,我通常输出如下所示的当前值:

pd.options.display.max_rows
60

然后,将其设置为我想要的样子:

pd.options.display.max_rows = 100

另外,您应该注意用于选项的上下文管理器,它可以在代码块内临时设置选项。将选项名称作为字符串传递,后跟所需的值。您可以在同一行中传递任意数量的选项:

with pd.option_context('display.max_rows', 100, 'display.max_columns', 10):
    some pandas stuff

您还可以将选项重置为默认值,如下所示:

pd.reset_option('display.max_rows')

然后将它们全部重置:

pd.reset_option('all')

通过设置选项仍然非常好pd.set_option。我只是发现直接使用属性更容易,并且对get_option和的需求也更少set_option

Personally, I like setting the options directly with an assignment statement as it is easy to find via tab completion thanks to iPython. I find it hard to remember what the exact option names are, so this method works for me.

For instance, all I have to remember is that it begins with pd.options

pd.options.<TAB>

Most of the options are available under display

pd.options.display.<TAB>

From here, I usually output what the current value is like this:

pd.options.display.max_rows
60

I then set it to what I want it to be:

pd.options.display.max_rows = 100

Also, you should be aware of the context manager for options, which temporarily sets the options inside of a block of code. Pass in the option name as a string followed by the value you want it to be. You may pass in any number of options in the same line:

with pd.option_context('display.max_rows', 100, 'display.max_columns', 10):
    some pandas stuff

You can also reset an option back to its default value like this:

pd.reset_option('display.max_rows')

And reset all of them back:

pd.reset_option('all')

It is still perfectly good to set options via pd.set_option. I just find using the attributes directly is easier and there is less need for get_option and set_option.


回答 2

此注释此答案中已经指出了一点,但是我将尝试对该问题给出更直接的答案:

from IPython.display import display
import numpy as np
import pandas as pd

n = 100
foo = pd.DataFrame(index=range(n))
foo['floats'] = np.random.randn(n)

with pd.option_context("display.max_rows", foo.shape[0]):
    display(foo)

从pandas 0.13.1(pandas 0.13.1发行说明)开始,pandas.option_context可用。根据

[it]允许您执行带有一组选项的代码块,当您退出with块时,这些选项会还原为先前的设置。

It was already pointed in this comment and in this answer, but I’ll try to give a more direct answer to the question:

from IPython.display import display
import numpy as np
import pandas as pd

n = 100
foo = pd.DataFrame(index=range(n))
foo['floats'] = np.random.randn(n)

with pd.option_context("display.max_rows", foo.shape[0]):
    display(foo)

pandas.option_context is available since pandas 0.13.1 (pandas 0.13.1 release notes). According to this,

[it] allow[s] you to execute a codeblock with a set of options that revert to prior settings when you exit the with block.


回答 3

正如@hanleyhansen在评论中指出的那样,从0.18.1版本开始,该display.height选项已被弃用,并说“使用display.max_rows代替”。因此,您只需要像这样配置它:

pd.set_option('display.max_rows', 500)

请参阅发行说明-pandas 0.18.1文档

现在已弃用的display.height,display.width仅是一个格式选项,无法控制摘要的触发,类似于<0.11.0。

As @hanleyhansen noted in a comment, as of version 0.18.1, the display.height option is deprecated, and says “use display.max_rows instead”. So you just have to configure it like this:

pd.set_option('display.max_rows', 500)

See the Release Notes — pandas 0.18.1 documentation:

Deprecated display.height, display.width is now only a formatting option does not control triggering of summary, similar to < 0.11.0.


回答 4

pd.set_option('display.max_rows', 500)
df

不工作的Jupyter!
而是使用:

pd.set_option('display.max_rows', 500)
df.head(500)
pd.set_option('display.max_rows', 500)
df

Does not work in Jupyter!
Instead use:

pd.set_option('display.max_rows', 500)
df.head(500)

回答 5

正如这个答案类似的问题,不存在需要破解的设置。编写起来要简单得多:

print(foo.to_string())

As in this answer to a similar question, there is no need to hack settings. It is much simpler to write:

print(foo.to_string())

查找名称包含特定字符串的列

问题:查找名称包含特定字符串的列

我有一个带有列名称的数据框,我想找到一个包含特定字符串但与之不完全匹配的数据框。我在寻找'spike'列名喜欢'spike-2''hey spike''spiked-in'(该'spike'部分总是连续)。

我希望列名以字符串或变量的形式返回,因此我以后可以使用df['name']df[name]照常访问列。我试图找到方法,但没有成功。有小费吗?

I have a dataframe with column names, and I want to find the one that contains a certain string, but does not exactly match it. I’m searching for 'spike' in column names like 'spike-2', 'hey spike', 'spiked-in' (the 'spike' part is always continuous).

I want the column name to be returned as a string or a variable, so I access the column later with df['name'] or df[name] as normal. I’ve tried to find ways to do this, to no avail. Any tips?


回答 0

只需遍历DataFrame.columns,这是一个示例,在此示例中,您将获得匹配的列名称列表:

import pandas as pd

data = {'spike-2': [1,2,3], 'hey spke': [4,5,6], 'spiked-in': [7,8,9], 'no': [10,11,12]}
df = pd.DataFrame(data)

spike_cols = [col for col in df.columns if 'spike' in col]
print(list(df.columns))
print(spike_cols)

输出:

['hey spke', 'no', 'spike-2', 'spiked-in']
['spike-2', 'spiked-in']

说明:

  1. df.columns 返回列名列表
  2. [col for col in df.columns if 'spike' in col]df.columns使用变量遍历列表col并将其添加到结果列表(如果col包含)'spike'。此语法是列表理解

如果只希望结果数据集的列匹配,则可以执行以下操作:

df2 = df.filter(regex='spike')
print(df2)

输出:

   spike-2  spiked-in
0        1          7
1        2          8
2        3          9

Just iterate over DataFrame.columns, now this is an example in which you will end up with a list of column names that match:

import pandas as pd

data = {'spike-2': [1,2,3], 'hey spke': [4,5,6], 'spiked-in': [7,8,9], 'no': [10,11,12]}
df = pd.DataFrame(data)

spike_cols = [col for col in df.columns if 'spike' in col]
print(list(df.columns))
print(spike_cols)

Output:

['hey spke', 'no', 'spike-2', 'spiked-in']
['spike-2', 'spiked-in']

Explanation:

  1. df.columns returns a list of column names
  2. [col for col in df.columns if 'spike' in col] iterates over the list df.columns with the variable col and adds it to the resulting list if col contains 'spike'. This syntax is list comprehension.

If you only want the resulting data set with the columns that match you can do this:

df2 = df.filter(regex='spike')
print(df2)

Output:

   spike-2  spiked-in
0        1          7
1        2          8
2        3          9

回答 1

此答案使用DataFrame.filter方法执行此操作而无需列表理解:

import pandas as pd

data = {'spike-2': [1,2,3], 'hey spke': [4,5,6]}
df = pd.DataFrame(data)

print(df.filter(like='spike').columns)

将仅输出“ spike-2”。您还可以使用正则表达式,如某些人在上面的评论中建议的那样:

print(df.filter(regex='spike|spke').columns)

将输出两列:[‘spike-2’,’hey spke’]

This answer uses the DataFrame.filter method to do this without list comprehension:

import pandas as pd

data = {'spike-2': [1,2,3], 'hey spke': [4,5,6]}
df = pd.DataFrame(data)

print(df.filter(like='spike').columns)

Will output just ‘spike-2’. You can also use regex, as some people suggested in comments above:

print(df.filter(regex='spike|spke').columns)

Will output both columns: [‘spike-2’, ‘hey spke’]


回答 2

您也可以使用 df.columns[df.columns.str.contains(pat = 'spike')]

data = {'spike-2': [1,2,3], 'hey spke': [4,5,6], 'spiked-in': [7,8,9], 'no': [10,11,12]}
df = pd.DataFrame(data)

colNames = df.columns[df.columns.str.contains(pat = 'spike')] 

print(colNames)

这将输出列名称: 'spike-2', 'spiked-in'

有关pandas.Series.str.contains的更多信息。

You can also use df.columns[df.columns.str.contains(pat = 'spike')]

data = {'spike-2': [1,2,3], 'hey spke': [4,5,6], 'spiked-in': [7,8,9], 'no': [10,11,12]}
df = pd.DataFrame(data)

colNames = df.columns[df.columns.str.contains(pat = 'spike')] 

print(colNames)

This will output the column names: 'spike-2', 'spiked-in'

More about pandas.Series.str.contains.


回答 3

# select columns containing 'spike'
df.filter(like='spike', axis=1)

您还可以按名称选择正则表达式。请参阅:pandas.DataFrame.filter

# select columns containing 'spike'
df.filter(like='spike', axis=1)

You can also select by name, regular expression. Refer to: pandas.DataFrame.filter


回答 4

df.loc[:,df.columns.str.contains("spike")]
df.loc[:,df.columns.str.contains("spike")]

回答 5

您还可以使用以下代码:

spike_cols =[x for x in df.columns[df.columns.str.contains('spike')]]

You also can use this code:

spike_cols =[x for x in df.columns[df.columns.str.contains('spike')]]

回答 6

根据“开始”,“包含”和“结束”获取名称和子集:

# from: /programming/21285380/find-column-whose-name-contains-a-specific-string
# from: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.contains.html
# from: https://cmdlinetips.com/2019/04/how-to-select-columns-using-prefix-suffix-of-column-names-in-pandas/
# from: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.filter.html




import pandas as pd



data = {'spike_starts': [1,2,3], 'ends_spike_starts': [4,5,6], 'ends_spike': [7,8,9], 'not': [10,11,12]}
df = pd.DataFrame(data)



print("\n")
print("----------------------------------------")
colNames_contains = df.columns[df.columns.str.contains(pat = 'spike')].tolist() 
print("Contains")
print(colNames_contains)



print("\n")
print("----------------------------------------")
colNames_starts = df.columns[df.columns.str.contains(pat = '^spike')].tolist() 
print("Starts")
print(colNames_starts)



print("\n")
print("----------------------------------------")
colNames_ends = df.columns[df.columns.str.contains(pat = 'spike$')].tolist() 
print("Ends")
print(colNames_ends)



print("\n")
print("----------------------------------------")
df_subset_start = df.filter(regex='^spike',axis=1)
print("Starts")
print(df_subset_start)



print("\n")
print("----------------------------------------")
df_subset_contains = df.filter(regex='spike',axis=1)
print("Contains")
print(df_subset_contains)



print("\n")
print("----------------------------------------")
df_subset_ends = df.filter(regex='spike$',axis=1)
print("Ends")
print(df_subset_ends)

Getting name and subsetting based on Start, Contains, and Ends:

# from: https://stackoverflow.com/questions/21285380/find-column-whose-name-contains-a-specific-string
# from: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.contains.html
# from: https://cmdlinetips.com/2019/04/how-to-select-columns-using-prefix-suffix-of-column-names-in-pandas/
# from: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.filter.html




import pandas as pd



data = {'spike_starts': [1,2,3], 'ends_spike_starts': [4,5,6], 'ends_spike': [7,8,9], 'not': [10,11,12]}
df = pd.DataFrame(data)



print("\n")
print("----------------------------------------")
colNames_contains = df.columns[df.columns.str.contains(pat = 'spike')].tolist() 
print("Contains")
print(colNames_contains)



print("\n")
print("----------------------------------------")
colNames_starts = df.columns[df.columns.str.contains(pat = '^spike')].tolist() 
print("Starts")
print(colNames_starts)



print("\n")
print("----------------------------------------")
colNames_ends = df.columns[df.columns.str.contains(pat = 'spike$')].tolist() 
print("Ends")
print(colNames_ends)



print("\n")
print("----------------------------------------")
df_subset_start = df.filter(regex='^spike',axis=1)
print("Starts")
print(df_subset_start)



print("\n")
print("----------------------------------------")
df_subset_contains = df.filter(regex='spike',axis=1)
print("Contains")
print(df_subset_contains)



print("\n")
print("----------------------------------------")
df_subset_ends = df.filter(regex='spike$',axis=1)
print("Ends")
print(df_subset_ends)

熊猫数据框获取每个组的第一行

问题:熊猫数据框获取每个组的第一行

我有DataFrame下面的熊猫。

df = pd.DataFrame({'id' : [1,1,1,2,2,3,3,3,3,4,4,5,6,6,6,7,7],
                'value'  : ["first","second","second","first",
                            "second","first","third","fourth",
                            "fifth","second","fifth","first",
                            "first","second","third","fourth","fifth"]})

我想通过[“ id”,“ value”]对此分组,并获得每个分组的第一行。

        id   value
0        1   first
1        1  second
2        1  second
3        2   first
4        2  second
5        3   first
6        3   third
7        3  fourth
8        3   fifth
9        4  second
10       4   fifth
11       5   first
12       6   first
13       6  second
14       6   third
15       7  fourth
16       7   fifth

预期结果

    id   value
     1   first
     2   first
     3   first
     4  second
     5  first
     6  first
     7  fourth

我尝试了以下操作,仅给出的第一行DataFrame。任何有关此的帮助表示赞赏。

In [25]: for index, row in df.iterrows():
   ....:     df2 = pd.DataFrame(df.groupby(['id','value']).reset_index().ix[0])

I have a pandas DataFrame like following.

df = pd.DataFrame({'id' : [1,1,1,2,2,3,3,3,3,4,4,5,6,6,6,7,7],
                'value'  : ["first","second","second","first",
                            "second","first","third","fourth",
                            "fifth","second","fifth","first",
                            "first","second","third","fourth","fifth"]})

I want to group this by [“id”,”value”] and get the first row of each group.

        id   value
0        1   first
1        1  second
2        1  second
3        2   first
4        2  second
5        3   first
6        3   third
7        3  fourth
8        3   fifth
9        4  second
10       4   fifth
11       5   first
12       6   first
13       6  second
14       6   third
15       7  fourth
16       7   fifth

Expected outcome

    id   value
     1   first
     2   first
     3   first
     4  second
     5  first
     6  first
     7  fourth

I tried following which only gives the first row of the DataFrame. Any help regarding this is appreciated.

In [25]: for index, row in df.iterrows():
   ....:     df2 = pd.DataFrame(df.groupby(['id','value']).reset_index().ix[0])

回答 0

>>> df.groupby('id').first()
     value
id        
1    first
2    first
3    first
4   second
5    first
6    first
7   fourth

如果需要id作为列:

>>> df.groupby('id').first().reset_index()
   id   value
0   1   first
1   2   first
2   3   first
3   4  second
4   5   first
5   6   first
6   7  fourth

要获取n条第一条记录,可以使用head():

>>> df.groupby('id').head(2).reset_index(drop=True)
    id   value
0    1   first
1    1  second
2    2   first
3    2  second
4    3   first
5    3   third
6    4  second
7    4   fifth
8    5   first
9    6   first
10   6  second
11   7  fourth
12   7   fifth
>>> df.groupby('id').first()
     value
id        
1    first
2    first
3    first
4   second
5    first
6    first
7   fourth

If you need id as column:

>>> df.groupby('id').first().reset_index()
   id   value
0   1   first
1   2   first
2   3   first
3   4  second
4   5   first
5   6   first
6   7  fourth

To get n first records, you can use head():

>>> df.groupby('id').head(2).reset_index(drop=True)
    id   value
0    1   first
1    1  second
2    2   first
3    2  second
4    3   first
5    3   third
6    4  second
7    4   fifth
8    5   first
9    6   first
10   6  second
11   7  fourth
12   7   fifth

回答 1

这将为您提供每组的第二行(零索引,nth(0)与first()相同):

df.groupby('id').nth(1) 

文档:http : //pandas.pydata.org/pandas-docs/stable/groupby.html#taking-the-nth-row-of-each-group

This will give you the second row of each group (zero indexed, nth(0) is the same as first()):

df.groupby('id').nth(1) 

Documentation: http://pandas.pydata.org/pandas-docs/stable/groupby.html#taking-the-nth-row-of-each-group


回答 2

我建议使用.nth(0)而不是.first()如果您需要获得第一行。

它们之间的区别在于它们处理NaN的方式,因此.nth(0)无论该行中的值是什么,都将返回组的第一行,而.first()最终将返回每列中的第一个not NaN值。

例如,如果您的数据集是:

df = pd.DataFrame({'id' : [1,1,1,2,2,3,3,3,3,4,4],
            'value'  : ["first","second","third", np.NaN,
                        "second","first","second","third",
                        "fourth","first","second"]})

>>> df.groupby('id').nth(0)
    value
id        
1    first
2    NaN
3    first
4    first

>>> df.groupby('id').first()
    value
id        
1    first
2    second
3    first
4    first

I’d suggest to use .nth(0) rather than .first() if you need to get the first row.

The difference between them is how they handle NaNs, so .nth(0) will return the first row of group no matter what are the values in this row, while .first() will eventually return the first not NaN value in each column.

E.g. if your dataset is :

df = pd.DataFrame({'id' : [1,1,1,2,2,3,3,3,3,4,4],
            'value'  : ["first","second","third", np.NaN,
                        "second","first","second","third",
                        "fourth","first","second"]})

>>> df.groupby('id').nth(0)
    value
id        
1    first
2    NaN
3    first
4    first

And

>>> df.groupby('id').first()
    value
id        
1    first
2    second
3    first
4    first

回答 3

也许这就是你想要的

import pandas as pd
idx = pd.MultiIndex.from_product([['state1','state2'],   ['county1','county2','county3','county4']])
df = pd.DataFrame({'pop': [12,15,65,42,78,67,55,31]}, index=idx)
                pop
state1 county1   12
       county2   15
       county3   65
       county4   42
state2 county1   78
       county2   67
       county3   55
       county4   31
df.groupby(level=0, group_keys=False).apply(lambda x: x.sort_values('pop', ascending=False)).groupby(level=0).head(3)

> Out[29]: 
                pop
state1 county3   65
       county4   42
       county2   15
state2 county1   78
       county2   67
       county3   55

maybe this is what you want

import pandas as pd
idx = pd.MultiIndex.from_product([['state1','state2'],   ['county1','county2','county3','county4']])
df = pd.DataFrame({'pop': [12,15,65,42,78,67,55,31]}, index=idx)
                pop
state1 county1   12
       county2   15
       county3   65
       county4   42
state2 county1   78
       county2   67
       county3   55
       county4   31
df.groupby(level=0, group_keys=False).apply(lambda x: x.sort_values('pop', ascending=False)).groupby(level=0).head(3)

> Out[29]: 
                pop
state1 county3   65
       county4   42
       county2   15
state2 county1   78
       county2   67
       county3   55

回答 4

如果只需要我们可以处理的每个组的第一行drop_duplicates,请注意函数default方法keep='first'

df.drop_duplicates('id')
Out[1027]: 
    id   value
0    1   first
3    2   first
5    3   first
9    4  second
11   5   first
12   6   first
15   7  fourth

If you only need the first row from each group we can do with drop_duplicates, Notice the function default method keep='first'.

df.drop_duplicates('id')
Out[1027]: 
    id   value
0    1   first
3    2   first
5    3   first
9    4  second
11   5   first
12   6   first
15   7  fourth

用sklearn缩放的pandas数据框列

问题:用sklearn缩放的pandas数据框列

我有一个带有混合类型列的pandas数据框,我想将sklearn的min_max_scaler应用于某些列。理想情况下,我想就地进行这些转换,但还没有找到一种方法来进行。我编写了以下有效的代码:

import pandas as pd
import numpy as np
from sklearn import preprocessing

scaler = preprocessing.MinMaxScaler()

dfTest = pd.DataFrame({'A':[14.00,90.20,90.95,96.27,91.21],'B':[103.02,107.26,110.35,114.23,114.68], 'C':['big','small','big','small','small']})
min_max_scaler = preprocessing.MinMaxScaler()

def scaleColumns(df, cols_to_scale):
    for col in cols_to_scale:
        df[col] = pd.DataFrame(min_max_scaler.fit_transform(pd.DataFrame(dfTest[col])),columns=[col])
    return df

dfTest

    A   B   C
0    14.00   103.02  big
1    90.20   107.26  small
2    90.95   110.35  big
3    96.27   114.23  small
4    91.21   114.68  small

scaled_df = scaleColumns(dfTest,['A','B'])
scaled_df

A   B   C
0    0.000000    0.000000    big
1    0.926219    0.363636    small
2    0.935335    0.628645    big
3    1.000000    0.961407    small
4    0.938495    1.000000    small

我很好奇这是否是进行此转换的首选/最有效的方法。有没有办法可以使用df.apply更好呢?

我也很惊讶我无法使用以下代码:

bad_output = min_max_scaler.fit_transform(dfTest['A'])

如果我将整个数据帧传递给缩放器,则它可以工作:

dfTest2 = dfTest.drop('C', axis = 1) good_output = min_max_scaler.fit_transform(dfTest2) good_output

我很困惑为什么将系列传递给定标器会失败。在上面的完整工作代码中,我希望只将一个系列传递给缩放器,然后将dataframe column =设置为缩放的序列。我已经看到这个问题在其他几个地方问过,但找不到一个好的答案。任何帮助了解这里发生的事情将不胜感激!

I have a pandas dataframe with mixed type columns, and I’d like to apply sklearn’s min_max_scaler to some of the columns. Ideally, I’d like to do these transformations in place, but haven’t figured out a way to do that yet. I’ve written the following code that works:

import pandas as pd
import numpy as np
from sklearn import preprocessing

scaler = preprocessing.MinMaxScaler()

dfTest = pd.DataFrame({'A':[14.00,90.20,90.95,96.27,91.21],'B':[103.02,107.26,110.35,114.23,114.68], 'C':['big','small','big','small','small']})
min_max_scaler = preprocessing.MinMaxScaler()

def scaleColumns(df, cols_to_scale):
    for col in cols_to_scale:
        df[col] = pd.DataFrame(min_max_scaler.fit_transform(pd.DataFrame(dfTest[col])),columns=[col])
    return df

dfTest

    A   B   C
0    14.00   103.02  big
1    90.20   107.26  small
2    90.95   110.35  big
3    96.27   114.23  small
4    91.21   114.68  small

scaled_df = scaleColumns(dfTest,['A','B'])
scaled_df

A   B   C
0    0.000000    0.000000    big
1    0.926219    0.363636    small
2    0.935335    0.628645    big
3    1.000000    0.961407    small
4    0.938495    1.000000    small

I’m curious if this is the preferred/most efficient way to do this transformation. Is there a way I could use df.apply that would be better?

I’m also surprised I can’t get the following code to work:

bad_output = min_max_scaler.fit_transform(dfTest['A'])

If I pass an entire dataframe to the scaler it works:

dfTest2 = dfTest.drop('C', axis = 1) good_output = min_max_scaler.fit_transform(dfTest2) good_output

I’m confused why passing a series to the scaler fails. In my full working code above I had hoped to just pass a series to the scaler then set the dataframe column = to the scaled series. I’ve seen this question asked a few other places, but haven’t found a good answer. Any help understanding what’s going on here would be greatly appreciated!


回答 0

我不确定以前的版本是否pandas阻止了此操作,但现在以下代码段对我来说效果很好,并且无需使用就可以产生所需的内容apply

>>> import pandas as pd
>>> from sklearn.preprocessing import MinMaxScaler


>>> scaler = MinMaxScaler()

>>> dfTest = pd.DataFrame({'A':[14.00,90.20,90.95,96.27,91.21],
                           'B':[103.02,107.26,110.35,114.23,114.68],
                           'C':['big','small','big','small','small']})

>>> dfTest[['A', 'B']] = scaler.fit_transform(dfTest[['A', 'B']])

>>> dfTest
          A         B      C
0  0.000000  0.000000    big
1  0.926219  0.363636  small
2  0.935335  0.628645    big
3  1.000000  0.961407  small
4  0.938495  1.000000  small

I am not sure if previous versions of pandas prevented this but now the following snippet works perfectly for me and produces exactly what you want without having to use apply

>>> import pandas as pd
>>> from sklearn.preprocessing import MinMaxScaler


>>> scaler = MinMaxScaler()

>>> dfTest = pd.DataFrame({'A':[14.00,90.20,90.95,96.27,91.21],
                           'B':[103.02,107.26,110.35,114.23,114.68],
                           'C':['big','small','big','small','small']})

>>> dfTest[['A', 'B']] = scaler.fit_transform(dfTest[['A', 'B']])

>>> dfTest
          A         B      C
0  0.000000  0.000000    big
1  0.926219  0.363636  small
2  0.935335  0.628645    big
3  1.000000  0.961407  small
4  0.938495  1.000000  small

回答 1

像这样?

dfTest = pd.DataFrame({
           'A':[14.00,90.20,90.95,96.27,91.21],
           'B':[103.02,107.26,110.35,114.23,114.68], 
           'C':['big','small','big','small','small']
         })
dfTest[['A','B']] = dfTest[['A','B']].apply(
                           lambda x: MinMaxScaler().fit_transform(x))
dfTest

    A           B           C
0   0.000000    0.000000    big
1   0.926219    0.363636    small
2   0.935335    0.628645    big
3   1.000000    0.961407    small
4   0.938495    1.000000    small

Like this?

dfTest = pd.DataFrame({
           'A':[14.00,90.20,90.95,96.27,91.21],
           'B':[103.02,107.26,110.35,114.23,114.68], 
           'C':['big','small','big','small','small']
         })
dfTest[['A','B']] = dfTest[['A','B']].apply(
                           lambda x: MinMaxScaler().fit_transform(x))
dfTest

    A           B           C
0   0.000000    0.000000    big
1   0.926219    0.363636    small
2   0.935335    0.628645    big
3   1.000000    0.961407    small
4   0.938495    1.000000    small

回答 2

正如pir的评论中提到的那样-该.apply(lambda el: scale.fit_transform(el))方法将产生以下警告:

DeprecationWarning:在0.17中弃用1d数组作为数据,它将在0.19中引发ValueError。如果数据具有单个功能,则使用X.reshape(-1,1)来重塑数据,如果包含单个样本,则使用X.reshape(1,-1)来重塑数据。

将您的列转换为numpy数组应该可以完成这项工作(我更喜欢StandardScaler):

from sklearn.preprocessing import StandardScaler
scale = StandardScaler()

dfTest[['A','B','C']] = scale.fit_transform(dfTest[['A','B','C']].as_matrix())

编辑 2018年11月(已针对熊猫0.23.4测试)-

作为罗布·默里提到的意见,大熊猫的电流(v0.23.4)版本.as_matrix()的回报FutureWarning。因此,应将其替换为.values

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

scaler.fit_transform(dfTest[['A','B']].values)

编辑 2019年5月(已针对熊猫0.24.2测试)-

正如joelostblom在评论中提到的那样:“因此0.24.0,建议使用.to_numpy()代替.values。”

更新的示例:

import pandas as pd
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
dfTest = pd.DataFrame({
               'A':[14.00,90.20,90.95,96.27,91.21],
               'B':[103.02,107.26,110.35,114.23,114.68],
               'C':['big','small','big','small','small']
             })
dfTest[['A', 'B']] = scaler.fit_transform(dfTest[['A','B']].to_numpy())
dfTest
      A         B      C
0 -1.995290 -1.571117    big
1  0.436356 -0.603995  small
2  0.460289  0.100818    big
3  0.630058  0.985826  small
4  0.468586  1.088469  small

As it is being mentioned in pir’s comment – the .apply(lambda el: scale.fit_transform(el)) method will produce the following warning:

DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.

Converting your columns to numpy arrays should do the job (I prefer StandardScaler):

from sklearn.preprocessing import StandardScaler
scale = StandardScaler()

dfTest[['A','B','C']] = scale.fit_transform(dfTest[['A','B','C']].as_matrix())

Edit Nov 2018 (Tested for pandas 0.23.4)–

As Rob Murray mentions in the comments, in the current (v0.23.4) version of pandas .as_matrix() returns FutureWarning. Therefore, it should be replaced by .values:

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

scaler.fit_transform(dfTest[['A','B']].values)

Edit May 2019 (Tested for pandas 0.24.2)–

As joelostblom mentions in the comments, “Since 0.24.0, it is recommended to use .to_numpy() instead of .values.”

Updated example:

import pandas as pd
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
dfTest = pd.DataFrame({
               'A':[14.00,90.20,90.95,96.27,91.21],
               'B':[103.02,107.26,110.35,114.23,114.68],
               'C':['big','small','big','small','small']
             })
dfTest[['A', 'B']] = scaler.fit_transform(dfTest[['A','B']].to_numpy())
dfTest
      A         B      C
0 -1.995290 -1.571117    big
1  0.436356 -0.603995  small
2  0.460289  0.100818    big
3  0.630058  0.985826  small
4  0.468586  1.088469  small

回答 3

df = pd.DataFrame(scale.fit_transform(df.values), columns=df.columns, index=df.index)

这应该在没有折旧警告的情况下起作用。

df = pd.DataFrame(scale.fit_transform(df.values), columns=df.columns, index=df.index)

This should work without depreciation warnings.


回答 4

您只能使用以下方法进行操作 pandas

In [235]:
dfTest = pd.DataFrame({'A':[14.00,90.20,90.95,96.27,91.21],'B':[103.02,107.26,110.35,114.23,114.68], 'C':['big','small','big','small','small']})
df = dfTest[['A', 'B']]
df_norm = (df - df.min()) / (df.max() - df.min())
print df_norm
print pd.concat((df_norm, dfTest.C),1)

          A         B
0  0.000000  0.000000
1  0.926219  0.363636
2  0.935335  0.628645
3  1.000000  0.961407
4  0.938495  1.000000
          A         B      C
0  0.000000  0.000000    big
1  0.926219  0.363636  small
2  0.935335  0.628645    big
3  1.000000  0.961407  small
4  0.938495  1.000000  small

You can do it using pandas only:

In [235]:
dfTest = pd.DataFrame({'A':[14.00,90.20,90.95,96.27,91.21],'B':[103.02,107.26,110.35,114.23,114.68], 'C':['big','small','big','small','small']})
df = dfTest[['A', 'B']]
df_norm = (df - df.min()) / (df.max() - df.min())
print df_norm
print pd.concat((df_norm, dfTest.C),1)

          A         B
0  0.000000  0.000000
1  0.926219  0.363636
2  0.935335  0.628645
3  1.000000  0.961407
4  0.938495  1.000000
          A         B      C
0  0.000000  0.000000    big
1  0.926219  0.363636  small
2  0.935335  0.628645    big
3  1.000000  0.961407  small
4  0.938495  1.000000  small

回答 5

我知道这是一个很老的评论,但仍然:

不要使用单括号(dfTest['A']),而应使用双括号(dfTest[['A']])

即:min_max_scaler.fit_transform(dfTest[['A']])

我相信这会取得理想的结果。

I know it’s a very old comment, but still:

Instead of using single bracket (dfTest['A']), use double brackets (dfTest[['A']]).

i.e: min_max_scaler.fit_transform(dfTest[['A']]).

I believe this will give the desired result.


创建两个熊猫数据框列的字典的最有效方法是什么?

问题:创建两个熊猫数据框列的字典的最有效方法是什么?

组织以下熊猫数据框的最有效方法是什么:

数据=

Position    Letter
1           a
2           b
3           c
4           d
5           e

变成字典一样alphabet[1 : 'a', 2 : 'b', 3 : 'c', 4 : 'd', 5 : 'e']

What is the most efficient way to organise the following pandas Dataframe:

data =

Position    Letter
1           a
2           b
3           c
4           d
5           e

into a dictionary like alphabet[1 : 'a', 2 : 'b', 3 : 'c', 4 : 'd', 5 : 'e']?


回答 0

In [9]: pd.Series(df.Letter.values,index=df.Position).to_dict()
Out[9]: {1: 'a', 2: 'b', 3: 'c', 4: 'd', 5: 'e'}

速度比较(使用Wouter方法)

In [6]: df = pd.DataFrame(randint(0,10,10000).reshape(5000,2),columns=list('AB'))

In [7]: %timeit dict(zip(df.A,df.B))
1000 loops, best of 3: 1.27 ms per loop

In [8]: %timeit pd.Series(df.A.values,index=df.B).to_dict()
1000 loops, best of 3: 987 us per loop
In [9]: pd.Series(df.Letter.values,index=df.Position).to_dict()
Out[9]: {1: 'a', 2: 'b', 3: 'c', 4: 'd', 5: 'e'}

Speed comparion (using Wouter’s method)

In [6]: df = pd.DataFrame(randint(0,10,10000).reshape(5000,2),columns=list('AB'))

In [7]: %timeit dict(zip(df.A,df.B))
1000 loops, best of 3: 1.27 ms per loop

In [8]: %timeit pd.Series(df.A.values,index=df.B).to_dict()
1000 loops, best of 3: 987 us per loop

回答 1

我找到了解决问题的更快方法,至少在使用以下方法的大型数据集上: df.set_index(KEY).to_dict()[VALUE]

50,000行的证明:

df = pd.DataFrame(np.random.randint(32, 120, 100000).reshape(50000,2),columns=list('AB'))
df['A'] = df['A'].apply(chr)

%timeit dict(zip(df.A,df.B))
%timeit pd.Series(df.A.values,index=df.B).to_dict()
%timeit df.set_index('A').to_dict()['B']

输出:

100 loops, best of 3: 7.04 ms per loop  # WouterOvermeire
100 loops, best of 3: 9.83 ms per loop  # Jeff
100 loops, best of 3: 4.28 ms per loop  # Kikohs (me)

I found a faster way to solve the problem, at least on realistically large datasets using: df.set_index(KEY).to_dict()[VALUE]

Proof on 50,000 rows:

df = pd.DataFrame(np.random.randint(32, 120, 100000).reshape(50000,2),columns=list('AB'))
df['A'] = df['A'].apply(chr)

%timeit dict(zip(df.A,df.B))
%timeit pd.Series(df.A.values,index=df.B).to_dict()
%timeit df.set_index('A').to_dict()['B']

Output:

100 loops, best of 3: 7.04 ms per loop  # WouterOvermeire
100 loops, best of 3: 9.83 ms per loop  # Jeff
100 loops, best of 3: 4.28 ms per loop  # Kikohs (me)

回答 2

在Python 3.6中,最快的方法仍然是WouterOvermeire。Kikohs的提议比其他两个方案要慢。

import timeit

setup = '''
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(32, 120, 100000).reshape(50000,2),columns=list('AB'))
df['A'] = df['A'].apply(chr)
'''

timeit.Timer('dict(zip(df.A,df.B))', setup=setup).repeat(7,500)
timeit.Timer('pd.Series(df.A.values,index=df.B).to_dict()', setup=setup).repeat(7,500)
timeit.Timer('df.set_index("A").to_dict()["B"]', setup=setup).repeat(7,500)

结果:

1.1214002349999777 s  # WouterOvermeire
1.1922008498571748 s  # Jeff
1.7034366211428602 s  # Kikohs

In Python 3.6 the fastest way is still the WouterOvermeire one. Kikohs’ proposal is slower than the other two options.

import timeit

setup = '''
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(32, 120, 100000).reshape(50000,2),columns=list('AB'))
df['A'] = df['A'].apply(chr)
'''

timeit.Timer('dict(zip(df.A,df.B))', setup=setup).repeat(7,500)
timeit.Timer('pd.Series(df.A.values,index=df.B).to_dict()', setup=setup).repeat(7,500)
timeit.Timer('df.set_index("A").to_dict()["B"]', setup=setup).repeat(7,500)

Results:

1.1214002349999777 s  # WouterOvermeire
1.1922008498571748 s  # Jeff
1.7034366211428602 s  # Kikohs

回答 3

TL; DR

>>> import pandas as pd
>>> df = pd.DataFrame({'Position':[1,2,3,4,5], 'Letter':['a', 'b', 'c', 'd', 'e']})
>>> dict(sorted(df.values.tolist())) # Sort of sorted... 
{'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5}
>>> from collections import OrderedDict
>>> OrderedDict(df.values.tolist())
OrderedDict([('a', 1), ('b', 2), ('c', 3), ('d', 4), ('e', 5)])

在长

解决方案说明: dict(sorted(df.values.tolist()))

鉴于:

df = pd.DataFrame({'Position':[1,2,3,4,5], 'Letter':['a', 'b', 'c', 'd', 'e']})

[出]:

 Letter Position
0   a   1
1   b   2
2   c   3
3   d   4
4   e   5

尝试:

# Get the values out to a 2-D numpy array, 
df.values

[出]:

array([['a', 1],
       ['b', 2],
       ['c', 3],
       ['d', 4],
       ['e', 5]], dtype=object)

然后(可选):

# Dump it into a list so that you can sort it using `sorted()`
sorted(df.values.tolist()) # Sort by key

要么:

# Sort by value:
from operator import itemgetter
sorted(df.values.tolist(), key=itemgetter(1))

[出]:

[['a', 1], ['b', 2], ['c', 3], ['d', 4], ['e', 5]]

最后,将2个元素的列表转换成字典。

dict(sorted(df.values.tolist())) 

[出]:

{'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5}

有关

回答@sbradbio评论:

如果一个特定的键有多个值,而您想保留所有值,那么这不是最有效,但最直观的方法是:

from collections import defaultdict
import pandas as pd

multivalue_dict = defaultdict(list)

df = pd.DataFrame({'Position':[1,2,4,4,4], 'Letter':['a', 'b', 'd', 'e', 'f']})

for idx,row in df.iterrows():
    multivalue_dict[row['Position']].append(row['Letter'])

[出]:

>>> print(multivalue_dict)
defaultdict(list, {1: ['a'], 2: ['b'], 4: ['d', 'e', 'f']})

TL;DR

>>> import pandas as pd
>>> df = pd.DataFrame({'Position':[1,2,3,4,5], 'Letter':['a', 'b', 'c', 'd', 'e']})
>>> dict(sorted(df.values.tolist())) # Sort of sorted... 
{'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5}
>>> from collections import OrderedDict
>>> OrderedDict(df.values.tolist())
OrderedDict([('a', 1), ('b', 2), ('c', 3), ('d', 4), ('e', 5)])

In Long

Explaining solution: dict(sorted(df.values.tolist()))

Given:

df = pd.DataFrame({'Position':[1,2,3,4,5], 'Letter':['a', 'b', 'c', 'd', 'e']})

[out]:

 Letter Position
0   a   1
1   b   2
2   c   3
3   d   4
4   e   5

Try:

# Get the values out to a 2-D numpy array, 
df.values

[out]:

array([['a', 1],
       ['b', 2],
       ['c', 3],
       ['d', 4],
       ['e', 5]], dtype=object)

Then optionally:

# Dump it into a list so that you can sort it using `sorted()`
sorted(df.values.tolist()) # Sort by key

Or:

# Sort by value:
from operator import itemgetter
sorted(df.values.tolist(), key=itemgetter(1))

[out]:

[['a', 1], ['b', 2], ['c', 3], ['d', 4], ['e', 5]]

Lastly, cast the list of list of 2 elements into a dict.

dict(sorted(df.values.tolist())) 

[out]:

{'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5}

Related

Answering @sbradbio comment:

If there are multiple values for a specific key and you would like to keep all of them, it’s the not the most efficient but the most intuitive way is:

from collections import defaultdict
import pandas as pd

multivalue_dict = defaultdict(list)

df = pd.DataFrame({'Position':[1,2,4,4,4], 'Letter':['a', 'b', 'd', 'e', 'f']})

for idx,row in df.iterrows():
    multivalue_dict[row['Position']].append(row['Letter'])

[out]:

>>> print(multivalue_dict)
defaultdict(list, {1: ['a'], 2: ['b'], 4: ['d', 'e', 'f']})

熊猫加入问题:列重叠但未指定后缀

问题:熊猫加入问题:列重叠但未指定后缀

我有以下2个数据帧:

df_a =

     mukey  DI  PI
0   100000  35  14
1  1000005  44  14
2  1000006  44  14
3  1000007  43  13
4  1000008  43  13

df_b = 
    mukey  niccdcd
0  190236        4
1  190237        6
2  190238        7
3  190239        4
4  190240        7

当我尝试加入这两个数据框时:

join_df = df_a.join(df_b,on='mukey',how='left')

我得到错误:

*** ValueError: columns overlap but no suffix specified: Index([u'mukey'], dtype='object')

为什么会这样呢?数据帧确实具有通用的“ mukey”值。

I have following 2 data frames:

df_a =

     mukey  DI  PI
0   100000  35  14
1  1000005  44  14
2  1000006  44  14
3  1000007  43  13
4  1000008  43  13

df_b = 
    mukey  niccdcd
0  190236        4
1  190237        6
2  190238        7
3  190239        4
4  190240        7

When I try to join these 2 dataframes:

join_df = df_a.join(df_b,on='mukey',how='left')

I get the error:

*** ValueError: columns overlap but no suffix specified: Index([u'mukey'], dtype='object')

Why is this so? The dataframes do have common ‘mukey’ values.


回答 0

您发布的数据片段中的错误有点神秘,因为没有通用值,所以联接操作失败,因为这些值不重叠,这需要您在左侧和右侧提供后缀:

In [173]:

df_a.join(df_b, on='mukey', how='left', lsuffix='_left', rsuffix='_right')
Out[173]:
       mukey_left  DI  PI  mukey_right  niccdcd
index                                          
0          100000  35  14          NaN      NaN
1         1000005  44  14          NaN      NaN
2         1000006  44  14          NaN      NaN
3         1000007  43  13          NaN      NaN
4         1000008  43  13          NaN      NaN

merge 之所以有效,是因为它没有此限制:

In [176]:

df_a.merge(df_b, on='mukey', how='left')
Out[176]:
     mukey  DI  PI  niccdcd
0   100000  35  14      NaN
1  1000005  44  14      NaN
2  1000006  44  14      NaN
3  1000007  43  13      NaN
4  1000008  43  13      NaN

Your error on the snippet of data you posted is a little cryptic, in that because there are no common values, the join operation fails because the values don’t overlap it requires you to supply a suffix for the left and right hand side:

In [173]:

df_a.join(df_b, on='mukey', how='left', lsuffix='_left', rsuffix='_right')
Out[173]:
       mukey_left  DI  PI  mukey_right  niccdcd
index                                          
0          100000  35  14          NaN      NaN
1         1000005  44  14          NaN      NaN
2         1000006  44  14          NaN      NaN
3         1000007  43  13          NaN      NaN
4         1000008  43  13          NaN      NaN

merge works because it doesn’t have this restriction:

In [176]:

df_a.merge(df_b, on='mukey', how='left')
Out[176]:
     mukey  DI  PI  niccdcd
0   100000  35  14      NaN
1  1000005  44  14      NaN
2  1000006  44  14      NaN
3  1000007  43  13      NaN
4  1000008  43  13      NaN

回答 1

.join()函数正在使用index传递的参数数据集的,因此您应该改用set_index或使用.mergefunction。

请找到适合您的情况的两个示例:

join_df = LS_sgo.join(MSU_pi.set_index('mukey'), on='mukey', how='left')

要么

join_df = df_a.merge(df_b, on='mukey', how='left')

The .join() function is using the index of the passed as argument dataset, so you should use set_index or use .merge function instead.

Please find the two examples that should work in your case:

join_df = LS_sgo.join(MSU_pi.set_index('mukey'), on='mukey', how='left')

or

join_df = df_a.merge(df_b, on='mukey', how='left')

回答 2

此错误表明两个表具有1个或多个具有相同列名的列名。错误消息翻译为:“我可以在两个表中看到同一列,但是您没有告诉我在将其中一个引入之前重命名了任何一个”

您要么要删除一列,然后再使用del df [‘column name’]从另一列中引入,要么使用lsuffix来重写原始列,或者使用rsuffix重命名要引入的列。

df_a.join(df_b, on='mukey', how='left', lsuffix='_left', rsuffix='_right')

This error indicates that the two tables have the 1 or more column names that have the same column name. The error message translates to: “I can see the same column in both tables but you haven’t told me to rename either before bringing one of them in”

You either want to delete one of the columns before bringing it in from the other on using del df[‘column name’], or use lsuffix to re-write the original column, or rsuffix to rename the one that is being brought it.

df_a.join(df_b, on='mukey', how='left', lsuffix='_left', rsuffix='_right')

回答 3

主要是join是专门用于基于索引的联接,而不是基于属性名称的联接,因此在两个不同的数据框中更改属性名称,然后尝试联接,它们将被联接,否则会引发此错误

Mainly join is used exclusively to join based on the index,not on the attribute names,so change the attributes names in two different dataframes,then try to join,they will be joined,else this error is raised


如何将tsv文件加载到Pandas DataFrame中?

问题:如何将tsv文件加载到Pandas DataFrame中?

我是python和pandas的新手。我正在尝试将tsv文件加载到熊猫中DataFrame

这是我正在尝试的错误:

>>> df1 = DataFrame(csv.reader(open('c:/~/trainSetRel3.txt'), delimiter='\t'))

Traceback (most recent call last):
  File "<pyshell#28>", line 1, in <module>
    df1 = DataFrame(csv.reader(open('c:/~/trainSetRel3.txt'), delimiter='\t'))
  File "C:\Python27\lib\site-packages\pandas\core\frame.py", line 318, in __init__
    raise PandasError('DataFrame constructor not properly called!')
PandasError: DataFrame constructor not properly called!

I’m new to python and pandas. I’m trying to get a tsv file loaded into a pandas DataFrame.

This is what I’m trying and the error I’m getting:

>>> df1 = DataFrame(csv.reader(open('c:/~/trainSetRel3.txt'), delimiter='\t'))

Traceback (most recent call last):
  File "<pyshell#28>", line 1, in <module>
    df1 = DataFrame(csv.reader(open('c:/~/trainSetRel3.txt'), delimiter='\t'))
  File "C:\Python27\lib\site-packages\pandas\core\frame.py", line 318, in __init__
    raise PandasError('DataFrame constructor not properly called!')
PandasError: DataFrame constructor not properly called!

回答 0

:由于17.0 from_csv气馁:使用pd.read_csv替代

该文档列出了一个.from_csv函数,该函数似乎可以执行您想要的操作:

DataFrame.from_csv('c:/~/trainSetRel3.txt', sep='\t')

如果您有标题,则可以传递header=0

DataFrame.from_csv('c:/~/trainSetRel3.txt', sep='\t', header=0)

Note: As of 17.0 from_csv is discouraged: use pd.read_csv instead

The documentation lists a .from_csv function that appears to do what you want:

DataFrame.from_csv('c:/~/trainSetRel3.txt', sep='\t')

If you have a header, you can pass header=0.

DataFrame.from_csv('c:/~/trainSetRel3.txt', sep='\t', header=0)

回答 1

从17.0开始from_csv不建议使用。

使用pd.read_csv(fpath, sep='\t')pd.read_table(fpath)

As of 17.0 from_csv is discouraged.

Use pd.read_csv(fpath, sep='\t') or pd.read_table(fpath).


回答 2

使用read_table(filepath)。默认分隔符是制表符

Use read_table(filepath). The default separator is tab


回答 3

试试这个

df = pd.read_csv("rating-data.tsv",sep='\t')
df.head()

您实际上需要修复sep参数。

Try this

df = pd.read_csv("rating-data.tsv",sep='\t')
df.head()

You actually need to fix the sep parameter.


回答 4

打开文件,另存为.csv,然后应用

df = pd.read_csv('apps.csv', sep='\t')

对于任何其他格式,只需更改sep标记

open file, save as .csv and then apply

df = pd.read_csv('apps.csv', sep='\t')

for any other format also, just change the sep tag


回答 5

df = pd.read_csv('filename.csv', sep='\t', header=0)

您可以通过指定分隔符和标头将tsv文件直接加载到pandas数据框中。

df = pd.read_csv('filename.csv', sep='\t', header=0)

You can load the tsv file directly into pandas data frame by specifying delimitor and header.