Python 实用宝典

Question 1

I have a list ‘abc’ and a dataframe ‘df’:

abc = ['foo', 'bar']
df =
    A  B
0  12  NaN
1  23  NaN

I want to insert the list into cell 1B, so I want this result:

    A  B
0  12  NaN
1  23  ['foo', 'bar']

Ho can I do that?

1) If I use this:

df.ix[1,'B'] = abc

I get the following error message:

ValueError: Must have equal len keys and value when setting with an iterable

because it tries to insert the list (that has two elements) into a row / column but not into a cell.

2) If I use this:

df.ix[1,'B'] = [abc]

then it inserts a list that has only one element that is the ‘abc’ list ( [['foo', 'bar']] ).

3) If I use this:

df.ix[1,'B'] = ', '.join(abc)

then it inserts a string: ( foo, bar ) but not a list.

4) If I use this:

df.ix[1,'B'] = [', '.join(abc)]

then it inserts a list but it has only one element ( ['foo, bar'] ) but not two as I want ( ['foo', 'bar'] ).

Thanks for help!

EDIT

My new dataframe and the old list:

abc = ['foo', 'bar']
df2 =
    A    B         C
0  12  NaN      'bla'
1  23  NaN  'bla bla'

Another dataframe:

df3 =
    A    B         C                    D
0  12  NaN      'bla'  ['item1', 'item2']
1  23  NaN  'bla bla'        [11, 12, 13]

I want insert the ‘abc’ list into df2.loc[1,'B'] and/or df3.loc[1,'B'].

If the dataframe has columns only with integer values and/or NaN values and/or list values then inserting a list into a cell works perfectly. If the dataframe has columns only with string values and/or NaN values and/or list values then inserting a list into a cell works perfectly. But if the dataframe has columns with integer and string values and other columns then the error message appears if I use this: df2.loc[1,'B'] = abc or df3.loc[1,'B'] = abc.

Another dataframe:

df4 =
          A     B
0      'bla'  NaN
1  'bla bla'  NaN

These inserts work perfectly: df.loc[1,'B'] = abc or df4.loc[1,'B'] = abc.

Question 2

Since set_value has been deprecated since version 0.21.0, you should now use at. It can insert a list into a cell without raising a ValueError as loc does. I think this is because at always refers to a single value, while loc can refer to values as well as rows and columns.

df = pd.DataFrame(data={'A': [1, 2, 3], 'B': ['x', 'y', 'z']})

df.at[1, 'B'] = ['m', 'n']

df =
    A   B
0   1   x
1   2   [m, n]
2   3   z

You also need to make sure the column you are inserting into has dtype=object. For example

>>> df = pd.DataFrame(data={'A': [1, 2, 3], 'B': [1,2,3]})
>>> df.dtypes
A    int64
B    int64
dtype: object

>>> df.at[1, 'B'] = [1, 2, 3]
ValueError: setting an array element with a sequence

>>> df['B'] = df['B'].astype('object')
>>> df.at[1, 'B'] = [1, 2, 3]
>>> df
   A          B
0  1          1
1  2  [1, 2, 3]
2  3          3

Question 3

df3.set_value(1, 'B', abc) works for any dataframe. Take care of the data type of column ‘B’. Eg. a list can not be inserted into a float column, at that case df['B'] = df['B'].astype(object) can help.

Question 4

Pandas >= 0.21

set_value has been deprecated. You can now use DataFrame.at to set by label, and DataFrame.iat to set by integer position.

Setting Cell Values with `at`/`iat`

# Setup
df = pd.DataFrame({'A': [12, 23], 'B': [['a', 'b'], ['c', 'd']]})
df

    A       B
0  12  [a, b]
1  23  [c, d]

df.dtypes

A     int64
B    object
dtype: object

If you want to set a value in second row of the “B” to some new list, use DataFrane.at:

df.at[1, 'B'] = ['m', 'n']
df

    A       B
0  12  [a, b]
1  23  [m, n]

You can also set by integer position using DataFrame.iat

df.iat[1, df.columns.get_loc('B')] = ['m', 'n']
df

    A       B
0  12  [a, b]
1  23  [m, n]

What if I get `ValueError: setting an array element with a sequence`?

I’ll try to reproduce this with:

df

    A   B
0  12 NaN
1  23 NaN

df.dtypes

A      int64
B    float64
dtype: object

df.at[1, 'B'] = ['m', 'n']
# ValueError: setting an array element with a sequence.

This is because of a your object is of float64 dtype, whereas lists are objects, so there’s a mismatch there. What you would have to do in this situation is to convert the column to object first.

df['B'] = df['B'].astype(object)
df.dtypes

A     int64
B    object
dtype: object

Then, it works:

df.at[1, 'B'] = ['m', 'n']
df

    A       B
0  12     NaN
1  23  [m, n]

Possible, But Hacky

Even more wacky, I’ve found you can hack through DataFrame.loc to achieve something similar if you pass nested lists.

df.loc[1, 'B'] = [['m'], ['n'], ['o'], ['p']]
df

    A             B
0  12        [a, b]
1  23  [m, n, o, p]

You can read more about why this works here.

Question 5

As mentionned in this post pandas: how to store a list in a dataframe?; the dtypes in the dataframe may influence the results, as well as calling a dataframe or not to be assigned to.

Question 6

Quick work around

Simply enclose the list within a new list, as done for col2 in the data frame below. The reason it works is that python takes the outer list (of lists) and converts it into a column as if it were containing normal scalar items, which is lists in our case and not normal scalars.

mydict={'col1':[1,2,3],'col2':[[1, 4], [2, 5], [3, 6]]}
data=pd.DataFrame(mydict)
data


   col1     col2
0   1       [1, 4]
1   2       [2, 5]
2   3       [3, 6]

Question 7

Also getting

ValueError: Must have equal len keys and value when setting with an iterable,

using .at rather than .loc did not make any difference in my case, but enforcing the datatype of the dataframe column did the trick:

df['B'] = df['B'].astype(object)

Then I could set lists, numpy array and all sorts of things as single cell values in my dataframes.

Question 8

我有一个DataFrame：

import pandas as pd
import numpy as np

df = pd.DataFrame({'foo.aa': [1, 2.1, np.nan, 4.7, 5.6, 6.8],
                   'foo.fighters': [0, 1, np.nan, 0, 0, 0],
                   'foo.bars': [0, 0, 0, 0, 0, 1],
                   'bar.baz': [5, 5, 6, 5, 5.6, 6.8],
                   'foo.fox': [2, 4, 1, 0, 0, 5],
                   'nas.foo': ['NA', 0, 1, 0, 0, 0],
                   'foo.manchu': ['NA', 0, 0, 0, 0, 0],})

我想在以开头的列中选择1的值foo.。除了以下以外，还有更好的方法吗：

df2 = df[(df['foo.aa'] == 1)|
(df['foo.fighters'] == 1)|
(df['foo.bars'] == 1)|
(df['foo.fox'] == 1)|
(df['foo.manchu'] == 1)
]

类似于写类似的东西：

df2= df[df.STARTS_WITH_FOO == 1]

答案应打印出如下所示的DataFrame：

   bar.baz  foo.aa  foo.bars  foo.fighters  foo.fox foo.manchu nas.foo
0      5.0     1.0         0             0        2         NA      NA
1      5.0     2.1         0             1        4          0       0
2      6.0     NaN         0           NaN        1          0       1
5      6.8     6.8         1             0        5          0       0

[4 rows x 7 columns]

Question 9

I have a DataFrame:

import pandas as pd
import numpy as np

df = pd.DataFrame({'foo.aa': [1, 2.1, np.nan, 4.7, 5.6, 6.8],
                   'foo.fighters': [0, 1, np.nan, 0, 0, 0],
                   'foo.bars': [0, 0, 0, 0, 0, 1],
                   'bar.baz': [5, 5, 6, 5, 5.6, 6.8],
                   'foo.fox': [2, 4, 1, 0, 0, 5],
                   'nas.foo': ['NA', 0, 1, 0, 0, 0],
                   'foo.manchu': ['NA', 0, 0, 0, 0, 0],})

I want to select values of 1 in columns starting with foo.. Is there a better way to do it other than:

df2 = df[(df['foo.aa'] == 1)|
(df['foo.fighters'] == 1)|
(df['foo.bars'] == 1)|
(df['foo.fox'] == 1)|
(df['foo.manchu'] == 1)
]

Something similar to writing something like:

df2= df[df.STARTS_WITH_FOO == 1]

The answer should print out a DataFrame like this:

   bar.baz  foo.aa  foo.bars  foo.fighters  foo.fox foo.manchu nas.foo
0      5.0     1.0         0             0        2         NA      NA
1      5.0     2.1         0             1        4          0       0
2      6.0     NaN         0           NaN        1          0       1
5      6.8     6.8         1             0        5          0       0

[4 rows x 7 columns]

Question 10

只需执行列表推导即可创建您的列：

In [28]:

filter_col = [col for col in df if col.startswith('foo')]
filter_col
Out[28]:
['foo.aa', 'foo.bars', 'foo.fighters', 'foo.fox', 'foo.manchu']
In [29]:

df[filter_col]
Out[29]:
   foo.aa  foo.bars  foo.fighters  foo.fox foo.manchu
0     1.0         0             0        2         NA
1     2.1         0             1        4          0
2     NaN         0           NaN        1          0
3     4.7         0             0        0          0
4     5.6         0             0        0          0
5     6.8         1             0        5          0

另一种方法是从列创建序列，并使用向量化str方法startswith：

In [33]:

df[df.columns[pd.Series(df.columns).str.startswith('foo')]]
Out[33]:
   foo.aa  foo.bars  foo.fighters  foo.fox foo.manchu
0     1.0         0             0        2         NA
1     2.1         0             1        4          0
2     NaN         0           NaN        1          0
3     4.7         0             0        0          0
4     5.6         0             0        0          0
5     6.8         1             0        5          0

为了实现您想要的目标，您需要添加以下内容以过滤不符合您的==1条件的值：

In [36]:

df[df[df.columns[pd.Series(df.columns).str.startswith('foo')]]==1]
Out[36]:
   bar.baz  foo.aa  foo.bars  foo.fighters  foo.fox foo.manchu nas.foo
0      NaN       1       NaN           NaN      NaN        NaN     NaN
1      NaN     NaN       NaN             1      NaN        NaN     NaN
2      NaN     NaN       NaN           NaN        1        NaN     NaN
3      NaN     NaN       NaN           NaN      NaN        NaN     NaN
4      NaN     NaN       NaN           NaN      NaN        NaN     NaN
5      NaN     NaN         1           NaN      NaN        NaN     NaN

编辑

看到您想要复杂的答案后，确定为：

In [72]:

df.loc[df[df[df.columns[pd.Series(df.columns).str.startswith('foo')]] == 1].dropna(how='all', axis=0).index]
Out[72]:
   bar.baz  foo.aa  foo.bars  foo.fighters  foo.fox foo.manchu nas.foo
0      5.0     1.0         0             0        2         NA      NA
1      5.0     2.1         0             1        4          0       0
2      6.0     NaN         0           NaN        1          0       1
5      6.8     6.8         1             0        5          0       0

Question 11

Just perform a list comprehension to create your columns:

In [28]:

filter_col = [col for col in df if col.startswith('foo')]
filter_col
Out[28]:
['foo.aa', 'foo.bars', 'foo.fighters', 'foo.fox', 'foo.manchu']
In [29]:

df[filter_col]
Out[29]:
   foo.aa  foo.bars  foo.fighters  foo.fox foo.manchu
0     1.0         0             0        2         NA
1     2.1         0             1        4          0
2     NaN         0           NaN        1          0
3     4.7         0             0        0          0
4     5.6         0             0        0          0
5     6.8         1             0        5          0

Another method is to create a series from the columns and use the vectorised str method startswith:

In [33]:

df[df.columns[pd.Series(df.columns).str.startswith('foo')]]
Out[33]:
   foo.aa  foo.bars  foo.fighters  foo.fox foo.manchu
0     1.0         0             0        2         NA
1     2.1         0             1        4          0
2     NaN         0           NaN        1          0
3     4.7         0             0        0          0
4     5.6         0             0        0          0
5     6.8         1             0        5          0

In order to achieve what you want you need to add the following to filter the values that don’t meet your ==1 criteria:

In [36]:

df[df[df.columns[pd.Series(df.columns).str.startswith('foo')]]==1]
Out[36]:
   bar.baz  foo.aa  foo.bars  foo.fighters  foo.fox foo.manchu nas.foo
0      NaN       1       NaN           NaN      NaN        NaN     NaN
1      NaN     NaN       NaN             1      NaN        NaN     NaN
2      NaN     NaN       NaN           NaN        1        NaN     NaN
3      NaN     NaN       NaN           NaN      NaN        NaN     NaN
4      NaN     NaN       NaN           NaN      NaN        NaN     NaN
5      NaN     NaN         1           NaN      NaN        NaN     NaN

EDIT

OK after seeing what you want the convoluted answer is this:

In [72]:

df.loc[df[df[df.columns[pd.Series(df.columns).str.startswith('foo')]] == 1].dropna(how='all', axis=0).index]
Out[72]:
   bar.baz  foo.aa  foo.bars  foo.fighters  foo.fox foo.manchu nas.foo
0      5.0     1.0         0             0        2         NA      NA
1      5.0     2.1         0             1        4          0       0
2      6.0     NaN         0           NaN        1          0       1
5      6.8     6.8         1             0        5          0       0

Question 12

既然熊猫的索引支持字符串操作，那么可以说选择以’foo’开头的列的最简单最好的方法就是：

df.loc[:, df.columns.str.startswith('foo')]

或者，您可以使用过滤列（或行）标签df.filter()。要指定正则表达式以匹配以开头的名称foo.：

>>> df.filter(regex=r'^foo\.', axis=1)
   foo.aa  foo.bars  foo.fighters  foo.fox foo.manchu
0     1.0         0             0        2         NA
1     2.1         0             1        4          0
2     NaN         0           NaN        1          0
3     4.7         0             0        0          0
4     5.6         0             0        0          0
5     6.8         1             0        5          0

要仅选择所需的行（包含1）和列，可以使用loc，使用filter（或任何其他方法）选择列，使用any：

>>> df.loc[(df == 1).any(axis=1), df.filter(regex=r'^foo\.', axis=1).columns]
   foo.aa  foo.bars  foo.fighters  foo.fox foo.manchu
0     1.0         0             0        2         NA
1     2.1         0             1        4          0
2     NaN         0           NaN        1          0
5     6.8         1             0        5          0

Question 13

Now that pandas’ indexes support string operations, arguably the simplest and best way to select columns beginning with ‘foo’ is just:

df.loc[:, df.columns.str.startswith('foo')]

Alternatively, you can filter column (or row) labels with df.filter(). To specify a regular expression to match the names beginning with foo.:

>>> df.filter(regex=r'^foo\.', axis=1)
   foo.aa  foo.bars  foo.fighters  foo.fox foo.manchu
0     1.0         0             0        2         NA
1     2.1         0             1        4          0
2     NaN         0           NaN        1          0
3     4.7         0             0        0          0
4     5.6         0             0        0          0
5     6.8         1             0        5          0

To select only the required rows (containing a 1) and the columns, you can use loc, selecting the columns using filter (or any other method) and the rows using any:

>>> df.loc[(df == 1).any(axis=1), df.filter(regex=r'^foo\.', axis=1).columns]
   foo.aa  foo.bars  foo.fighters  foo.fox foo.manchu
0     1.0         0             0        2         NA
1     2.1         0             1        4          0
2     NaN         0           NaN        1          0
5     6.8         1             0        5          0

Question 14

最简单的方法是直接在列名上使用str，不需要 pd.Series

df.loc[:,df.columns.str.startswith("foo")]

Question 15

The simplest way is to use str directly on column names, there is no need for pd.Series

df.loc[:,df.columns.str.startswith("foo")]

Question 16

根据@EdChum的答案，您可以尝试以下解决方案：

df[df.columns[pd.Series(df.columns).str.contains("foo")]]

万一并非您要选择的所有列都以开头，这将非常有用foo。此方法选择包含子字符串的所有列，foo并且可以将其放置在列名称的任何位置。

本质上，我替换.startswith()为.contains()。

Question 17

Based on @EdChum’s answer, you can try the following solution:

df[df.columns[pd.Series(df.columns).str.contains("foo")]]

This will be really helpful in case not all the columns you want to select start with foo. This method selects all the columns that contain the substring foo and it could be placed in at any point of a column’s name.

In essence, I replaced .startswith() with .contains().

Question 18

我的解决方案。性能可能会变慢：

a = pd.concat(df[df[c] == 1] for c in df.columns if c.startswith('foo'))
a.sort_index()


   bar.baz  foo.aa  foo.bars  foo.fighters  foo.fox foo.manchu nas.foo
0      5.0     1.0         0             0        2         NA      NA
1      5.0     2.1         0             1        4          0       0
2      6.0     NaN         0           NaN        1          0       1
5      6.8     6.8         1             0        5          0       0

Question 19

My solution. It may be slower on performance:

a = pd.concat(df[df[c] == 1] for c in df.columns if c.startswith('foo'))
a.sort_index()


   bar.baz  foo.aa  foo.bars  foo.fighters  foo.fox foo.manchu nas.foo
0      5.0     1.0         0             0        2         NA      NA
1      5.0     2.1         0             1        4          0       0
2      6.0     NaN         0           NaN        1          0       1
5      6.8     6.8         1             0        5          0       0

Question 20

选择所需条目的另一种方法是使用map：

df.loc[(df == 1).any(axis=1), df.columns.map(lambda x: x.startswith('foo'))]

这将为您提供包含的行的所有列1：

   foo.aa  foo.bars  foo.fighters  foo.fox foo.manchu
0     1.0         0             0        2         NA
1     2.1         0             1        4          0
2     NaN         0           NaN        1          0
5     6.8         1             0        5          0

该行选择是通过做

(df == 1).any(axis=1)

如@ajcr的答案，它为您提供：

0     True
1     True
2     True
3    False
4    False
5     True
dtype: bool

表示该行3和4不包含1和不会被选中。

列的选择是使用布尔索引完成的，如下所示：

df.columns.map(lambda x: x.startswith('foo'))

在上面的示例中，此返回

array([False,  True,  True,  True,  True,  True, False], dtype=bool)

因此，如果某列不是以开头foo，False则返回该列，因此未选择该列。

如果您只想返回包含1-的所有行（如您期望的输出所示），则只需执行

df.loc[(df == 1).any(axis=1)]

哪个返回

   bar.baz  foo.aa  foo.bars  foo.fighters  foo.fox foo.manchu nas.foo
0      5.0     1.0         0             0        2         NA      NA
1      5.0     2.1         0             1        4          0       0
2      6.0     NaN         0           NaN        1          0       1
5      6.8     6.8         1             0        5          0       0

Question 21

Another option for the selection of the desired entries is to use map:

df.loc[(df == 1).any(axis=1), df.columns.map(lambda x: x.startswith('foo'))]

which gives you all the columns for rows that contain a 1:

   foo.aa  foo.bars  foo.fighters  foo.fox foo.manchu
0     1.0         0             0        2         NA
1     2.1         0             1        4          0
2     NaN         0           NaN        1          0
5     6.8         1             0        5          0

The row selection is done by

(df == 1).any(axis=1)

as in @ajcr’s answer which gives you:

0     True
1     True
2     True
3    False
4    False
5     True
dtype: bool

meaning that row 3 and 4 do not contain a 1 and won’t be selected.

The selection of the columns is done using Boolean indexing like this:

df.columns.map(lambda x: x.startswith('foo'))

In the example above this returns

array([False,  True,  True,  True,  True,  True, False], dtype=bool)

So, if a column does not start with foo, False is returned and the column is therefore not selected.

If you just want to return all rows that contain a 1 – as your desired output suggests – you can simply do

df.loc[(df == 1).any(axis=1)]

which returns

   bar.baz  foo.aa  foo.bars  foo.fighters  foo.fox foo.manchu nas.foo
0      5.0     1.0         0             0        2         NA      NA
1      5.0     2.1         0             1        4          0       0
2      6.0     NaN         0           NaN        1          0       1
5      6.8     6.8         1             0        5          0       0

Question 22

您可以在此处尝试使用正则表达式来过滤以“ foo”开头的列

df.filter(regex='^foo*')

如果您需要在列中包含字符串foo，则

df.filter(regex='foo*')

将是适当的。

下一步，您可以使用

df[df.filter(regex='^foo*').values==1]

过滤掉“ foo *”列的值之一为1的行。

Question 23

You can try the regex here to filter out the columns starting with “foo”

df.filter(regex='^foo*')

If you need to have the string foo in your column then

df.filter(regex='foo*')

would be appropriate.

For the next step, you can use

df[df.filter(regex='^foo*').values==1]

to filter out the rows where one of the values of ‘foo*’ column is 1.

Question 24

就我而言，我需要一个前缀列表

colsToScale=["production", "test", "development"]
dc[dc.columns[dc.columns.str.startswith(tuple(colsToScale))]]

Question 25

In my case I needed a list of prefixes

colsToScale=["production", "test", "development"]
dc[dc.columns[dc.columns.str.startswith(tuple(colsToScale))]]

Question 26

Say I have the following DataFrame

Letter    Number
A          1
B          2
C          3
D          4

Which can be obtained through the following code

import pandas as pd

letters=pd.Series(('A', 'B', 'C', 'D'))
numbers=pd.Series((1, 2, 3, 4))
keys=('Letters', 'Numbers')
df=pd.concat((letters, numbers), axis=1, keys=keys)

Now I want to get the value C from the column Letters.

The command line

df[df.Letters=='C'].Letters

will return

2    C
Name: Letters, dtype: object

How can I get only the value C and not the whole two line output?

Question 27

df[df.Letters=='C'].Letters.item()

This returns the first element in the Index/Series returned from that selection. In this case, the value is always the first element.

EDIT:

Or you can run a loc() and access the first element that way. This was shorter and is the way I have implemented it in the past.

Question 28

Use the values attribute to return the values as a np array and then use [0] to get the first value:

In [4]:
df.loc[df.Letters=='C','Letters'].values[0]

Out[4]:
'C'

EDIT

I personally prefer to access the columns using subscript operators:

df.loc[df['Letters'] == 'C', 'Letters'].values[0]

This avoids issues where the column names can have spaces or dashes - which mean that accessing using ..

Question 29

import pandas as pd

dataset = pd.read_csv("data.csv")
values = list(x for x in dataset["column name"])

>>> values[0]
'item_0'

edit:

actually, you can just index the dataset like any old array.

import pandas as pd

dataset = pd.read_csv("data.csv")
first_value = dataset["column name"][0]

>>> print(first_value)
'item_0'

Question 30

I want to read a .xlsx file using the Pandas Library of python and port the data to a postgreSQL table.

All I could do up until now is:

import pandas as pd
data = pd.ExcelFile("*File Name*")

Now I know that the step got executed successfully, but I want to know how i can parse the excel file that has been read so that I can understand how the data in the excel maps to the data in the variable data.
I learnt that data is a Dataframe object if I’m not wrong. So How do i parse this dataframe object to extract each line row by row.

Question 31

I usually create a dictionary containing a DataFrame for every sheet:

xl_file = pd.ExcelFile(file_name)

dfs = {sheet_name: xl_file.parse(sheet_name) 
          for sheet_name in xl_file.sheet_names}

Update: In pandas version 0.21.0+ you will get this behavior more cleanly by passing sheet_name=None to read_excel:

dfs = pd.read_excel(file_name, sheet_name=None)

In 0.20 and prior, this was sheetname rather than sheet_name (this is now deprecated in favor of the above):

dfs = pd.read_excel(file_name, sheetname=None)

Question 32

from pandas import read_excel
# find your sheet name at the bottom left of your excel file and assign 
# it to my_sheet 
my_sheet = 'Sheet1' # change it to your sheet name
file_name = 'products_and_categories.xlsx' # change it to the name of your excel file
df = read_excel(file_name, sheet_name = my_sheet)
print(df.head()) # shows headers with top 5 rows

Question 33

DataFrame’s read_excel method is like read_csv method:

dfs = pd.read_excel(xlsx_file, sheetname="sheet1")


Help on function read_excel in module pandas.io.excel:

read_excel(io, sheetname=0, header=0, skiprows=None, skip_footer=0, index_col=None, names=None, parse_cols=None, parse_dates=False, date_parser=None, na_values=None, thousands=None, convert_float=True, has_index_names=None, converters=None, true_values=None, false_values=None, engine=None, squeeze=False, **kwds)
    Read an Excel table into a pandas DataFrame

    Parameters
    ----------
    io : string, path object (pathlib.Path or py._path.local.LocalPath),
        file-like object, pandas ExcelFile, or xlrd workbook.
        The string could be a URL. Valid URL schemes include http, ftp, s3,
        and file. For file URLs, a host is expected. For instance, a local
        file could be file://localhost/path/to/workbook.xlsx
    sheetname : string, int, mixed list of strings/ints, or None, default 0

        Strings are used for sheet names, Integers are used in zero-indexed
        sheet positions.

        Lists of strings/integers are used to request multiple sheets.

        Specify None to get all sheets.

        str|int -> DataFrame is returned.
        list|None -> Dict of DataFrames is returned, with keys representing
        sheets.

        Available Cases

        * Defaults to 0 -> 1st sheet as a DataFrame
        * 1 -> 2nd sheet as a DataFrame
        * "Sheet1" -> 1st sheet as a DataFrame
        * [0,1,"Sheet5"] -> 1st, 2nd & 5th sheet as a dictionary of DataFrames
        * None -> All sheets as a dictionary of DataFrames

    header : int, list of ints, default 0
        Row (0-indexed) to use for the column labels of the parsed
        DataFrame. If a list of integers is passed those row positions will
        be combined into a ``MultiIndex``
    skiprows : list-like
        Rows to skip at the beginning (0-indexed)
    skip_footer : int, default 0
        Rows at the end to skip (0-indexed)
    index_col : int, list of ints, default None
        Column (0-indexed) to use as the row labels of the DataFrame.
        Pass None if there is no such column.  If a list is passed,
        those columns will be combined into a ``MultiIndex``
    names : array-like, default None
        List of column names to use. If file contains no header row,
        then you should explicitly pass header=None
    converters : dict, default None
        Dict of functions for converting values in certain columns. Keys can
        either be integers or column labels, values are functions that take one
        input argument, the Excel cell content, and return the transformed
        content.
    true_values : list, default None
        Values to consider as True

        .. versionadded:: 0.19.0

    false_values : list, default None
        Values to consider as False

        .. versionadded:: 0.19.0

    parse_cols : int or list, default None
        * If None then parse all columns,
        * If int then indicates last column to be parsed
        * If list of ints then indicates list of column numbers to be parsed
        * If string then indicates comma separated list of column names and
          column ranges (e.g. "A:E" or "A,C,E:F")
    squeeze : boolean, default False
        If the parsed data only contains one column then return a Series
    na_values : scalar, str, list-like, or dict, default None
        Additional strings to recognize as NA/NaN. If dict passed, specific
        per-column NA values. By default the following values are interpreted
        as NaN: '', '#N/A', '#N/A N/A', '#NA', '-1.#IND', '-1.#QNAN', '-NaN', '-nan',
    '1.#IND', '1.#QNAN', 'N/A', 'NA', 'NULL', 'NaN', 'nan'.
    thousands : str, default None
        Thousands separator for parsing string columns to numeric.  Note that
        this parameter is only necessary for columns stored as TEXT in Excel,
        any numeric columns will automatically be parsed, regardless of display
        format.
    keep_default_na : bool, default True
        If na_values are specified and keep_default_na is False the default NaN
        values are overridden, otherwise they're appended to.
    verbose : boolean, default False
        Indicate number of NA values placed in non-numeric columns
    engine: string, default None
        If io is not a buffer or path, this must be set to identify io.
        Acceptable values are None or xlrd
    convert_float : boolean, default True
        convert integral floats to int (i.e., 1.0 --> 1). If False, all numeric
        data will be read in as floats: Excel stores all numbers as floats
        internally
    has_index_names : boolean, default None
        DEPRECATED: for version 0.17+ index names will be automatically
        inferred based on index_col.  To read Excel output from 0.16.2 and
        prior that had saved index names, use True.

    Returns
    -------
    parsed : DataFrame or Dict of DataFrames
        DataFrame from the passed in Excel file.  See notes in sheetname
        argument for more information on when a Dict of Dataframes is returned.

Question 34

Instead of using a sheet name, in case you don’t know or can’t open the excel file to check in ubuntu (in my case, Python 3.6.7, ubuntu 18.04), I use the parameter index_col (index_col=0 for the first sheet)

import pandas as pd
file_name = 'some_data_file.xlsx' 
df = pd.read_excel(file_name, index_col=0)
print(df.head()) # print the first 5 rows

Question 35

Assign spreadsheet filename to file

Load spreadsheet

Print the sheet names

Load a sheet into a DataFrame by name: df1

file = 'example.xlsx'
xl = pd.ExcelFile(file)
print(xl.sheet_names)
df1 = xl.parse('Sheet1')

Question 36

If you use read_excel() on a file opened using the function open(), make sure to add rb to the open function to avoid encoding errors

Question 37

创建给定大小的零填充熊猫数据框的最佳方法是什么？

我用过了：

zero_data = np.zeros(shape=(len(data),len(feature_list)))
d = pd.DataFrame(zero_data, columns=feature_list)

有更好的方法吗？

Question 38

What is the best way to create a zero-filled pandas data frame of a given size?

I have used:

zero_data = np.zeros(shape=(len(data),len(feature_list)))
d = pd.DataFrame(zero_data, columns=feature_list)

Is there a better way to do it?

Question 39

您可以尝试以下方法：

d = pd.DataFrame(0, index=np.arange(len(data)), columns=feature_list)

Question 40

You can try this:

d = pd.DataFrame(0, index=np.arange(len(data)), columns=feature_list)

Question 41

我认为最好用numpy做到这一点

import numpy as np
import pandas as pd
d = pd.DataFrame(np.zeros((N_rows, N_cols)))

Question 42

It’s best to do this with numpy in my opinion

import numpy as np
import pandas as pd
d = pd.DataFrame(np.zeros((N_rows, N_cols)))

Question 43

类似于@Shravan，但不使用numpy：

  height = 10
  width = 20
  df_0 = pd.DataFrame(0, index=range(height), columns=range(width))

然后，您可以使用它做任何您想做的事情：

post_instantiation_fcn = lambda x: str(x)
df_ready_for_whatever = df_0.applymap(post_instantiation_fcn)

Question 44

Similar to @Shravan, but without the use of numpy:

  height = 10
  width = 20
  df_0 = pd.DataFrame(0, index=range(height), columns=range(width))

Then you can do whatever you want with it:

post_instantiation_fcn = lambda x: str(x)
df_ready_for_whatever = df_0.applymap(post_instantiation_fcn)

Question 45

如果您希望新数据框具有与现有数据框相同的索引和列，则可以将现有数据框乘以零：

df_zeros = df * 0

Question 46

If you would like the new data frame to have the same index and columns as an existing data frame, you can just multiply the existing data frame by zero:

df_zeros = df * 0

Question 47

如果您已经有一个数据框，这是最快的方法：

In [1]: columns = ["col{}".format(i) for i in range(10)]
In [2]: orig_df = pd.DataFrame(np.ones((10, 10)), columns=columns)
In [3]: %timeit d = pd.DataFrame(np.zeros_like(orig_df), index=orig_df.index, columns=orig_df.columns)
10000 loops, best of 3: 60.2 µs per loop

相比于：

In [4]: %timeit d = pd.DataFrame(0, index = np.arange(10), columns=columns)
10000 loops, best of 3: 110 µs per loop

In [5]: temp = np.zeros((10, 10))
In [6]: %timeit d = pd.DataFrame(temp, columns=columns)
10000 loops, best of 3: 95.7 µs per loop

Question 48

If you already have a dataframe, this is the fastest way:

In [1]: columns = ["col{}".format(i) for i in range(10)]
In [2]: orig_df = pd.DataFrame(np.ones((10, 10)), columns=columns)
In [3]: %timeit d = pd.DataFrame(np.zeros_like(orig_df), index=orig_df.index, columns=orig_df.columns)
10000 loops, best of 3: 60.2 µs per loop

Compare to:

In [4]: %timeit d = pd.DataFrame(0, index = np.arange(10), columns=columns)
10000 loops, best of 3: 110 µs per loop

In [5]: temp = np.zeros((10, 10))
In [6]: %timeit d = pd.DataFrame(temp, columns=columns)
10000 loops, best of 3: 95.7 µs per loop

Question 49

假设有一个模板DataFrame，要在此处填充零值进行复制…

如果您的数据集中没有NaN，那么乘以零可能会更快：

In [19]: columns = ["col{}".format(i) for i in xrange(3000)]                                                                                       

In [20]: indices = xrange(2000)

In [21]: orig_df = pd.DataFrame(42.0, index=indices, columns=columns)

In [22]: %timeit d = pd.DataFrame(np.zeros_like(orig_df), index=orig_df.index, columns=orig_df.columns)
100 loops, best of 3: 12.6 ms per loop

In [23]: %timeit d = orig_df * 0.0
100 loops, best of 3: 7.17 ms per loop

改进取决于DataFrame的大小，但从未发现它会变慢。

只是为了它：

In [24]: %timeit d = orig_df * 0.0 + 1.0
100 loops, best of 3: 13.6 ms per loop

In [25]: %timeit d = pd.eval('orig_df * 0.0 + 1.0')
100 loops, best of 3: 8.36 ms per loop

但：

In [24]: %timeit d = orig_df.copy()
10 loops, best of 3: 24 ms per loop

编辑！！！

假设您有一个使用float64的框架，那么这将是最快的！通过将0.0替换为所需的填充编号，它还可以生成任何值。

In [23]: %timeit d = pd.eval('orig_df > 1.7976931348623157e+308 + 0.0')
100 loops, best of 3: 3.68 ms per loop

根据口味的不同，可以从外部定义nan，并做出通用的解决方案，而与特定的浮点类型无关：

In [39]: nan = np.nan
In [40]: %timeit d = pd.eval('orig_df > nan + 0.0')
100 loops, best of 3: 4.39 ms per loop

Question 50

Assuming having a template DataFrame, which one would like to copy with zero values filled here…

If you have no NaNs in your data set, multiplying by zero can be significantly faster:

In [19]: columns = ["col{}".format(i) for i in xrange(3000)]                                                                                       

In [20]: indices = xrange(2000)

In [21]: orig_df = pd.DataFrame(42.0, index=indices, columns=columns)

In [22]: %timeit d = pd.DataFrame(np.zeros_like(orig_df), index=orig_df.index, columns=orig_df.columns)
100 loops, best of 3: 12.6 ms per loop

In [23]: %timeit d = orig_df * 0.0
100 loops, best of 3: 7.17 ms per loop

Improvement depends on DataFrame size, but never found it slower.

And just for the heck of it:

In [24]: %timeit d = orig_df * 0.0 + 1.0
100 loops, best of 3: 13.6 ms per loop

In [25]: %timeit d = pd.eval('orig_df * 0.0 + 1.0')
100 loops, best of 3: 8.36 ms per loop

But:

In [24]: %timeit d = orig_df.copy()
10 loops, best of 3: 24 ms per loop

EDIT!!!

Assuming you have a frame using float64, this will be the fastest by a huge margin! It is also able to generate any value by replacing 0.0 to the desired fill number.

In [23]: %timeit d = pd.eval('orig_df > 1.7976931348623157e+308 + 0.0')
100 loops, best of 3: 3.68 ms per loop

Depending on taste, one can externally define nan, and do a general solution, irrespective of the particular float type:

In [39]: nan = np.nan
In [40]: %timeit d = pd.eval('orig_df > nan + 0.0')
100 loops, best of 3: 4.39 ms per loop

Question 51

以此为起点：

a = [['10', '1.2', '4.2'], ['15', '70', '0.03'], ['8', '5', '0']]
df = pd.DataFrame(a, columns=['one', 'two', 'three'])

Out[8]: 
  one  two three
0   10  1.2   4.2
1   15  70   0.03
2    8   5     0

我想if在熊猫中使用类似声明的内容。

if df['one'] >= df['two'] and df['one'] <= df['three']:
    df['que'] = df['one']

基本上，通过if语句检查每一行，然后创建新列。

文档说要使用，.all但没有示例…

Question 52

Using this as a starting point:

a = [['10', '1.2', '4.2'], ['15', '70', '0.03'], ['8', '5', '0']]
df = pd.DataFrame(a, columns=['one', 'two', 'three'])

Out[8]: 
  one  two three
0   10  1.2   4.2
1   15  70   0.03
2    8   5     0

I want to use something like an if statement within pandas.

if df['one'] >= df['two'] and df['one'] <= df['three']:
    df['que'] = df['one']

Basically, check each row via the if statement, create new column.

The docs say to use .all but there is no example…

Question 53

您可以使用np.where。如果cond是布尔数组，A并且B是数组，则

C = np.where(cond, A, B)

将C定义为等于A哪里cond为True，B哪里cond为False。

import numpy as np
import pandas as pd

a = [['10', '1.2', '4.2'], ['15', '70', '0.03'], ['8', '5', '0']]
df = pd.DataFrame(a, columns=['one', 'two', 'three'])

df['que'] = np.where((df['one'] >= df['two']) & (df['one'] <= df['three'])
                     , df['one'], np.nan)

Yield

  one  two three  que
0  10  1.2   4.2   10
1  15   70  0.03  NaN
2   8    5     0  NaN

如果您有多个条件，则可以使用np.select代替。例如，如果你想df['que']等于df['two']时df['one'] < df['two']，则

conditions = [
    (df['one'] >= df['two']) & (df['one'] <= df['three']), 
    df['one'] < df['two']]

choices = [df['one'], df['two']]

df['que'] = np.select(conditions, choices, default=np.nan)

Yield

  one  two three  que
0  10  1.2   4.2   10
1  15   70  0.03   70
2   8    5     0  NaN

如果我们可以假设df['one'] >= df['two']whendf['one'] < df['two']为False，那么条件和选择可以简化为

conditions = [
    df['one'] < df['two'],
    df['one'] <= df['three']]

choices = [df['two'], df['one']]

（如果包含df['one']或df['two']包含NaN，则该假设可能不正确。）

注意

a = [['10', '1.2', '4.2'], ['15', '70', '0.03'], ['8', '5', '0']]
df = pd.DataFrame(a, columns=['one', 'two', 'three'])

用字符串值定义一个DataFrame。由于它们看起来是数字，因此最好将这些字符串转换为浮点数：

df2 = df.astype(float)

但是，这会改变结果，因为字符串会逐个字符进行比较，而浮点数会进行数字比较。

In [61]: '10' <= '4.2'
Out[61]: True

In [62]: 10 <= 4.2
Out[62]: False

Question 54

You could use np.where. If cond is a boolean array, and A and B are arrays, then

C = np.where(cond, A, B)

defines C to be equal to A where cond is True, and B where cond is False.

import numpy as np
import pandas as pd

a = [['10', '1.2', '4.2'], ['15', '70', '0.03'], ['8', '5', '0']]
df = pd.DataFrame(a, columns=['one', 'two', 'three'])

df['que'] = np.where((df['one'] >= df['two']) & (df['one'] <= df['three'])
                     , df['one'], np.nan)

yields

  one  two three  que
0  10  1.2   4.2   10
1  15   70  0.03  NaN
2   8    5     0  NaN

If you have more than one condition, then you could use np.select instead. For example, if you wish df['que'] to equal df['two'] when df['one'] < df['two'], then

conditions = [
    (df['one'] >= df['two']) & (df['one'] <= df['three']), 
    df['one'] < df['two']]

choices = [df['one'], df['two']]

df['que'] = np.select(conditions, choices, default=np.nan)

yields

  one  two three  que
0  10  1.2   4.2   10
1  15   70  0.03   70
2   8    5     0  NaN

If we can assume that df['one'] >= df['two'] when df['one'] < df['two'] is False, then the conditions and choices could be simplified to

conditions = [
    df['one'] < df['two'],
    df['one'] <= df['three']]

choices = [df['two'], df['one']]

(The assumption may not be true if df['one'] or df['two'] contain NaNs.)

Note that

a = [['10', '1.2', '4.2'], ['15', '70', '0.03'], ['8', '5', '0']]
df = pd.DataFrame(a, columns=['one', 'two', 'three'])

defines a DataFrame with string values. Since they look numeric, you might be better off converting those strings to floats:

df2 = df.astype(float)

This changes the results, however, since strings compare character-by-character, while floats are compared numerically.

In [61]: '10' <= '4.2'
Out[61]: True

In [62]: 10 <= 4.2
Out[62]: False

Question 55

您可以将其.equals用于列或整个数据框。

df['col1'].equals(df['col2'])

如果它们相等，则该语句将返回Trueelse False。

Question 56

You can use .equals for columns or entire dataframes.

df['col1'].equals(df['col2'])

If they’re equal, that statement will return True, else False.

Question 57

您可以使用apply（）并执行类似的操作

df['que'] = df.apply(lambda x : x['one'] if x['one'] >= x['two'] and x['one'] <= x['three'] else "", axis=1)

或者如果您不想使用lambda

def que(x):
    if x['one'] >= x['two'] and x['one'] <= x['three']:
        return x['one']
    return ''
df['que'] = df.apply(que, axis=1)

Question 58

You could use apply() and do something like this

df['que'] = df.apply(lambda x : x['one'] if x['one'] >= x['two'] and x['one'] <= x['three'] else "", axis=1)

or if you prefer not to use a lambda

def que(x):
    if x['one'] >= x['two'] and x['one'] <= x['three']:
        return x['one']
    return ''
df['que'] = df.apply(que, axis=1)

Question 59

一种方法是使用布尔序列对列进行索引df['one']。这将为您提供一个新列，其中的True条目与相同的行具有相同的值，df['one']并且这些False值为NaN。

布尔级数仅由您的if语句给出（尽管必须使用&代替and）：

>>> df['que'] = df['one'][(df['one'] >= df['two']) & (df['one'] <= df['three'])]
>>> df
    one two three   que
0   10  1.2 4.2      10
1   15  70  0.03    NaN
2   8   5   0       NaN

如果您希望将这些NaN值替换为其他值，则可以使用fillna新列上的方法que。我用的0不是这里的空字符串：

>>> df['que'] = df['que'].fillna(0)
>>> df
    one two three   que
0   10  1.2   4.2    10
1   15   70  0.03     0
2    8    5     0     0

Question 60

One way is to use a Boolean series to index the column df['one']. This gives you a new column where the True entries have the same value as the same row as df['one'] and the False values are NaN.

The Boolean series is just given by your if statement (although it is necessary to use & instead of and):

>>> df['que'] = df['one'][(df['one'] >= df['two']) & (df['one'] <= df['three'])]
>>> df
    one two three   que
0   10  1.2 4.2      10
1   15  70  0.03    NaN
2   8   5   0       NaN

If you want the NaN values to be replaced by other values, you can use the fillna method on the new column que. I’ve used 0 instead of the empty string here:

>>> df['que'] = df['que'].fillna(0)
>>> df
    one two three   que
0   10  1.2   4.2    10
1   15   70  0.03     0
2    8    5     0     0

Question 61

将每个条件括在括号中，然后使用&运算符组合条件：

df.loc[(df['one'] >= df['two']) & (df['one'] <= df['three']), 'que'] = df['one']

您可以仅使用~（“ not”运算符）来反转匹配项来填充不匹配的行：

df.loc[~ ((df['one'] >= df['two']) & (df['one'] <= df['three'])), 'que'] = ''

您需要使用&和~而不是and和，not因为&和~运算符是逐个元素地工作的。

最终结果：

df
Out[8]: 
  one  two three que
0  10  1.2   4.2  10
1  15   70  0.03    
2   8    5     0

Question 62

Wrap each individual condition in parentheses, and then use the & operator to combine the conditions:

df.loc[(df['one'] >= df['two']) & (df['one'] <= df['three']), 'que'] = df['one']

You can fill the non-matching rows by just using ~ (the “not” operator) to invert the match:

df.loc[~ ((df['one'] >= df['two']) & (df['one'] <= df['three'])), 'que'] = ''

You need to use & and ~ rather than and and not because the & and ~ operators work element-by-element.

The final result:

df
Out[8]: 
  one  two three que
0  10  1.2   4.2  10
1  15   70  0.03    
2   8    5     0

Question 63

使用np.select，如果你必须从数据帧和输出特定的选择在不同的列中选中多个条件

conditions=[(condition1),(condition2)]
choices=["choice1","chocie2"]

df["new column"]=np.select=(condtion,choice,default=)

注意：没有条件，没有选择项应该匹配，如果对于两个不同的条件您有相同的选择，请重复选择文本

Question 64

Use np.select if you have multiple conditions to be checked from the dataframe and output a specific choice in a different column

conditions=[(condition1),(condition2)]
choices=["choice1","chocie2"]

df["new column"]=np.select=(condtion,choice,default=)

Note: No of conditions and no of choices should match, repeat text in choice if for two different conditions you have same choices

问题：Python Pandas将列表插入单元格

编辑

EDIT

回答 0

回答 1

回答 2

熊猫> = 0.21

使用at/ 设置单元格值iat

如果得到了ValueError: setting an array element with a sequence怎么办？

可能，但是哈基

Pandas >= 0.21

Setting Cell Values with at/iat

What if I get ValueError: setting an array element with a sequence?

Possible, But Hacky

回答 3

回答 4

回答 5

问题：熊猫：选择所有名称以X开头的列的最佳方法

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

回答 7

问题：如何从Pandas DataFrame获取值而不是索引和对象类型

回答 0

回答 1

回答 2

问题：如何使用iPython中的pandas库读取.xlsx文件？

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

问题：创建零填充的熊猫数据框

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

问题：使用熊猫比较两列

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

问题：Python Pandas-查找两个数据框之间的差异

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

回答 7

回答 8

回答 9

问题：在Jupyter Python Notebook中显示所有数据框列

回答 0

回答 1

回答 2

适用于大型（但不是太大）DataFrame的Python 3.x

Python 3.x for large (but not too large) DataFrames

回答 3

回答 4

回答 5

问题：’DataFrame’对象没有属性’sort’

回答 0

回答 1

熊猫排序101

按单列排序

按多列排序

按DataFrame索引排序

按指数列表排序

使用`at`/ 设置单元格值`iat`

如果得到了`ValueError: setting an array element with a sequence`怎么办？

Setting Cell Values with `at`/`iat`

What if I get `ValueError: setting an array element with a sequence`?