Python 实用宝典

Question 1

我正在使用pandas库读取一些CSV数据。在我的数据中，某些列包含字符串。该字符串"nan"是一个可能的值，一个空字符串也可以。我设法让大熊猫将“ nan”读取为字符串，但是我不知道如何获取不读取空值的NaN。这是示例数据和输出

One,Two,Three
a,1,one
b,2,two
,3,three
d,4,nan
e,5,five
nan,6,
g,7,seven

>>> pandas.read_csv('test.csv', na_values={'One': [], "Three": []})
    One  Two  Three
0    a    1    one
1    b    2    two
2  NaN    3  three
3    d    4    nan
4    e    5   five
5  nan    6    NaN
6    g    7  seven

它正确地写着“男”为字符串“南”，但仍读取空单元格作为NaN的。我想传递str的converters参数read_csv（带converters={'One': str})），但它仍然读取空单元格作为NaN的。

我意识到我可以在读取后使用fillna填充值，但是真的没有办法告诉熊猫特定CSV列中的空单元格应被读取为空字符串而不是NaN吗？

Question 2

I’m using the pandas library to read in some CSV data. In my data, certain columns contain strings. The string "nan" is a possible value, as is an empty string. I managed to get pandas to read “nan” as a string, but I can’t figure out how to get it not to read an empty value as NaN. Here’s sample data and output

One,Two,Three
a,1,one
b,2,two
,3,three
d,4,nan
e,5,five
nan,6,
g,7,seven

>>> pandas.read_csv('test.csv', na_values={'One': [], "Three": []})
    One  Two  Three
0    a    1    one
1    b    2    two
2  NaN    3  three
3    d    4    nan
4    e    5   five
5  nan    6    NaN
6    g    7  seven

It correctly reads “nan” as the string “nan’, but still reads the empty cells as NaN. I tried passing in str in the converters argument to read_csv (with converters={'One': str})), but it still reads the empty cells as NaN.

I realize I can fill the values after reading, with fillna, but is there really no way to tell pandas that an empty cell in a particular CSV column should be read as an empty string instead of NaN?

Question 3

我添加了票证以在此处添加某种选项：

https://github.com/pydata/pandas/issues/1450

同时，result.fillna('')应该做你想做的

编辑：在开发版本中（最终为0.8.0），如果您指定的空列表na_values，则空字符串将在结果中保留空字符串

Question 4

I added a ticket to add an option of some sort here:

https://github.com/pydata/pandas/issues/1450

In the meantime, result.fillna('') should do what you want

EDIT: in the development version (to be 0.8.0 final) if you specify an empty list of na_values, empty strings will stay empty strings in the result

Question 5

阅读其他答案和评论后，我仍然感到困惑。但是，现在的答案似乎更简单，因此您可以开始操作。

从Pandas 0.9版（自2012年起）开始，您只需设置keep_default_na=False以下内容，即可读取解释为空字符串的空单元格的csv ：

pd.read_csv('test.csv', keep_default_na=False)

此问题在以下内容中有更清楚的说明

read_csv中更一致的na_values处理·问题＃1657·pandas-dev / pandas

该问题已于2012年8月19日在Pandas 0.9版中修复

错误：更一致的na_values＃1657·pandas-dev / pandas @ d9abf68

Question 6

I was still confused after reading the other answers and comments. But the answer now seems simpler, so here you go.

Since Pandas version 0.9 (from 2012), you can read your csv with empty cells interpreted as empty strings by simply setting keep_default_na=False:

pd.read_csv('test.csv', keep_default_na=False)

This issue is more clearly explained in

More consistent na_values handling in read_csv · Issue #1657 · pandas-dev/pandas

That was fixed on on Aug 19, 2012 for Pandas version 0.9 in

BUG: more consistent na_values #1657 · pandas-dev/pandas@d9abf68

Question 7

我们在Pandas read_csv中有一个简单的参数：

使用：

df = pd.read_csv('test.csv', na_filter= False)

熊猫的文档清楚地解释了上述论点是如何工作的。

链接

Question 8

We have a simple argument in Pandas read_csv for this:

Use:

df = pd.read_csv('test.csv', na_filter= False)

Pandas documentation clearly explains how the above argument works.

Link

Question 9

采取以下数据框架：

x = np.tile(np.arange(3),3)
y = np.repeat(np.arange(3),3)
df = pd.DataFrame({"x": x, "y": y})

我需要x首先对其进行排序，然后仅需按其进行排序y：

df2 = df.sort(["x", "y"])

如何更改索引，使其再次上升。即我怎么得到这个：

我尝试了以下方法。不幸的是，它根本不会改变索引：

df2.reindex(np.arange(len(df2.index)))

Question 10

Take the following data-frame:

x = np.tile(np.arange(3),3)
y = np.repeat(np.arange(3),3)
df = pd.DataFrame({"x": x, "y": y})

I need to sort it by x first, and only second by y:

df2 = df.sort(["x", "y"])

How can I change the index such that it is ascending again. I.e. how do I get this:

I have tried the following. Unfortunately, it doesn’t change the index at all:

df2.reindex(np.arange(len(df2.index)))

Question 11

您可以使用来重置索引，reset_index以获取默认索引0、1、2，…，n-1（并用于drop=True指示您要删除现有索引，而不是将其作为附加列添加到数据框中）。：

In [19]: df2 = df2.reset_index(drop=True)

In [20]: df2
Out[20]:
   x  y
0  0  0
1  0  1
2  0  2
3  1  0
4  1  1
5  1  2
6  2  0
7  2  1
8  2  2

Question 12

You can reset the index using reset_index to get back a default index of 0, 1, 2, …, n-1 (and use drop=True to indicate you want to drop the existing index instead of adding it as an additional column to your dataframe):

In [19]: df2 = df2.reset_index(drop=True)

In [20]: df2
Out[20]:
   x  y
0  0  0
1  0  1
2  0  2
3  1  0
4  1  1
5  1  2
6  2  0
7  2  1
8  2  2

Question 13

df.sort()已弃用，请使用df.sort_values(...)：https : //pandas.pydata.org/pandas-docs/stable/generation/pandas.DataFrame.sort_values.html

然后按照乔里斯的回答做 df.reset_index(drop=True)

Question 14

df.sort() is deprecated, use df.sort_values(...): https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_values.html

Then follow joris’ answer by doing df.reset_index(drop=True)

Question 15

由于pandas 1.0.0df.sort_values具有一个新参数ignore_index，可以满足您的实际需要：

In [1]: df2 = df.sort_values(by=['x','y'],ignore_index=True)

In [2]: df2
Out[2]:
   x  y
0  0  0
1  0  1
2  0  2
3  1  0
4  1  1
5  1  2
6  2  0
7  2  1
8  2  2

Question 16

Since pandas 1.0.0 df.sort_values has a new parameter ignore_index which does exactly what you need:

In [1]: df2 = df.sort_values(by=['x','y'],ignore_index=True)

In [2]: df2
Out[2]:
   x  y
0  0  0
1  0  1
2  0  2
3  1  0
4  1  1
5  1  2
6  2  0
7  2  1
8  2  2

Question 17

您可以使用来设置新索引set_index：

df2.set_index(np.arange(len(df2.index)))

输出：

Question 18

You can set new indices by using set_index:

df2.set_index(np.arange(len(df2.index)))

Output:

Question 19

我有一个列的数据帧A，B。我需要创建一个列C，以便为每个记录/行：

C = max(A, B)。

我应该怎么做呢？

Question 20

I have a dataframe with columns A,B. I need to create a column C such that for every record / row:

C = max(A, B).

How should I go about doing this?

Question 21

您可以这样获得最大值：

>>> import pandas as pd
>>> df = pd.DataFrame({"A": [1,2,3], "B": [-2, 8, 1]})
>>> df
   A  B
0  1 -2
1  2  8
2  3  1
>>> df[["A", "B"]]
   A  B
0  1 -2
1  2  8
2  3  1
>>> df[["A", "B"]].max(axis=1)
0    1
1    8
2    3

所以：

>>> df["C"] = df[["A", "B"]].max(axis=1)
>>> df
   A  B  C
0  1 -2  1
1  2  8  8
2  3  1  3

如果您知道“ A”和“ B”是唯一的列，那么您甚至可以逃脱

>>> df["C"] = df.max(axis=1)

.apply(max, axis=1)我猜你也可以使用。

Question 22

You can get the maximum like this:

>>> import pandas as pd
>>> df = pd.DataFrame({"A": [1,2,3], "B": [-2, 8, 1]})
>>> df
   A  B
0  1 -2
1  2  8
2  3  1
>>> df[["A", "B"]]
   A  B
0  1 -2
1  2  8
2  3  1
>>> df[["A", "B"]].max(axis=1)
0    1
1    8
2    3

and so:

>>> df["C"] = df[["A", "B"]].max(axis=1)
>>> df
   A  B  C
0  1 -2  1
1  2  8  8
2  3  1  3

If you know that “A” and “B” are the only columns, you could even get away with

>>> df["C"] = df.max(axis=1)

And you could use .apply(max, axis=1) too, I guess.

Question 23

在几乎所有正常情况下，@ DSM的答案都很好。但是，如果您是想比表面层次更深入一点的程序员类型，那么您可能想知道，在基础数组.to_numpy()（.values对于<0.24）上调用numpy函数要快一些，而不是直接调用调用在DataFrame / Series对象上定义的（cythonized）函数。

例如，您可以ndarray.max()沿第一个轴使用。

# Data borrowed from @DSM's post.
df = pd.DataFrame({"A": [1,2,3], "B": [-2, 8, 1]})
df
   A  B
0  1 -2
1  2  8
2  3  1

df['C'] = df[['A', 'B']].values.max(1)
# Or, assuming "A" and "B" are the only columns, 
# df['C'] = df.values.max(1) 
df

   A  B  C
0  1 -2  1
1  2  8  8
2  3  1  3

如果您的数据包含NaN，则将需要numpy.nanmax：

df['C'] = np.nanmax(df.values, axis=1)
df

   A  B  C
0  1 -2  1
1  2  8  8
2  3  1  3

您也可以使用numpy.maximum.reduce。numpy.maximum是ufunc（通用函数），每个ufunc都有reduce：

df['C'] = np.maximum.reduce(df['A', 'B']].values, axis=1)
# df['C'] = np.maximum.reduce(df[['A', 'B']], axis=1)
# df['C'] = np.maximum.reduce(df, axis=1)
df

   A  B  C
0  1 -2  1
1  2  8  8
2  3  1  3

np.maximum.reduce并且np.max看起来或多或少是相同的（对于大多数标准尺寸的DataFrame），并且阴影的速度比快DataFrame.max。我认为这种差异大致保持不变，并且是由于内部开销（索引对齐，处理NaN等）引起的。

该图是使用perfplot生成的。基准测试代码，以供参考：

import pandas as pd
import perfplot

np.random.seed(0)
df_ = pd.DataFrame(np.random.randn(5, 1000))

perfplot.show(
    setup=lambda n: pd.concat([df_] * n, ignore_index=True),
    kernels=[
        lambda df: df.assign(new=df.max(axis=1)),
        lambda df: df.assign(new=df.values.max(1)),
        lambda df: df.assign(new=np.nanmax(df.values, axis=1)),
        lambda df: df.assign(new=np.maximum.reduce(df.values, axis=1)),
    ],
    labels=['df.max', 'np.max', 'np.maximum.reduce', 'np.nanmax'],
    n_range=[2**k for k in range(0, 15)],
    xlabel='N (* len(df))',
    logx=True,
    logy=True)

Question 24

@DSM’s answer is perfectly fine in almost any normal scenario. But if you’re the type of programmer who wants to go a little deeper than the surface level, you might be interested to know that it is a little faster to call numpy functions on the underlying .to_numpy() (or .values for <0.24) array instead of directly calling the (cythonized) functions defined on the DataFrame/Series objects.

For example, you can use ndarray.max() along the first axis.

# Data borrowed from @DSM's post.
df = pd.DataFrame({"A": [1,2,3], "B": [-2, 8, 1]})
df
   A  B
0  1 -2
1  2  8
2  3  1

df['C'] = df[['A', 'B']].values.max(1)
# Or, assuming "A" and "B" are the only columns, 
# df['C'] = df.values.max(1) 
df

   A  B  C
0  1 -2  1
1  2  8  8
2  3  1  3

If your data has NaNs, you will need numpy.nanmax:

df['C'] = np.nanmax(df.values, axis=1)
df

   A  B  C
0  1 -2  1
1  2  8  8
2  3  1  3

You can also use numpy.maximum.reduce. numpy.maximum is a ufunc (Universal Function), and every ufunc has a reduce:

df['C'] = np.maximum.reduce(df['A', 'B']].values, axis=1)
# df['C'] = np.maximum.reduce(df[['A', 'B']], axis=1)
# df['C'] = np.maximum.reduce(df, axis=1)
df

   A  B  C
0  1 -2  1
1  2  8  8
2  3  1  3

np.maximum.reduce and np.max appear to be more or less the same (for most normal sized DataFrames)—and happen to be a shade faster than DataFrame.max. I imagine this difference roughly remains constant, and is due to internal overhead (indexing alignment, handling NaNs, etc).

The graph was generated using perfplot. Benchmarking code, for reference:

import pandas as pd
import perfplot

np.random.seed(0)
df_ = pd.DataFrame(np.random.randn(5, 1000))

perfplot.show(
    setup=lambda n: pd.concat([df_] * n, ignore_index=True),
    kernels=[
        lambda df: df.assign(new=df.max(axis=1)),
        lambda df: df.assign(new=df.values.max(1)),
        lambda df: df.assign(new=np.nanmax(df.values, axis=1)),
        lambda df: df.assign(new=np.maximum.reduce(df.values, axis=1)),
    ],
    labels=['df.max', 'np.max', 'np.maximum.reduce', 'np.nanmax'],
    n_range=[2**k for k in range(0, 15)],
    xlabel='N (* len(df))',
    logx=True,
    logy=True)

Question 25

I have a Data Frame column with numeric values:

df['percentage'].head()
46.5
44.2
100.0
42.12

I want to see the column as bin counts:

bins = [0, 1, 5, 10, 25, 50, 100]

How can I get the result as bins with their value counts?

[0, 1] bin amount
[1, 5] etc 
[5, 10] etc 
......

Question 26

You can use pandas.cut:

bins = [0, 1, 5, 10, 25, 50, 100]
df['binned'] = pd.cut(df['percentage'], bins)
print (df)
   percentage     binned
0       46.50   (25, 50]
1       44.20   (25, 50]
2      100.00  (50, 100]
3       42.12   (25, 50]

bins = [0, 1, 5, 10, 25, 50, 100]
labels = [1,2,3,4,5,6]
df['binned'] = pd.cut(df['percentage'], bins=bins, labels=labels)
print (df)
   percentage binned
0       46.50      5
1       44.20      5
2      100.00      6
3       42.12      5

Or numpy.searchsorted:

bins = [0, 1, 5, 10, 25, 50, 100]
df['binned'] = np.searchsorted(bins, df['percentage'].values)
print (df)
   percentage  binned
0       46.50       5
1       44.20       5
2      100.00       6
3       42.12       5

…and then value_counts or groupby and aggregate size:

s = pd.cut(df['percentage'], bins=bins).value_counts()
print (s)
(25, 50]     3
(50, 100]    1
(10, 25]     0
(5, 10]      0
(1, 5]       0
(0, 1]       0
Name: percentage, dtype: int64

s = df.groupby(pd.cut(df['percentage'], bins=bins)).size()
print (s)
percentage
(0, 1]       0
(1, 5]       0
(5, 10]      0
(10, 25]     0
(25, 50]     3
(50, 100]    1
dtype: int64

By default cut return categorical.

Series methods like Series.value_counts() will use all categories, even if some categories are not present in the data, operations in categorical.

Question 27

Using `numba` module for speed up.

On big datasets (500k >) pd.cut can be quite slow for binning data.

I wrote my own function in numba with just in time compilation, which is roughly 16x faster:

from numba import njit

@njit
def cut(arr):
    bins = np.empty(arr.shape[0])
    for idx, x in enumerate(arr):
        if (x >= 0) & (x < 1):
            bins[idx] = 1
        elif (x >= 1) & (x < 5):
            bins[idx] = 2
        elif (x >= 5) & (x < 10):
            bins[idx] = 3
        elif (x >= 10) & (x < 25):
            bins[idx] = 4
        elif (x >= 25) & (x < 50):
            bins[idx] = 5
        elif (x >= 50) & (x < 100):
            bins[idx] = 6
        else:
            bins[idx] = 7

    return bins

cut(df['percentage'].to_numpy())

# array([5., 5., 7., 5.])

Optional: you can also map it to bins as strings:

a = cut(df['percentage'].to_numpy())

conversion_dict = {1: 'bin1',
                   2: 'bin2',
                   3: 'bin3',
                   4: 'bin4',
                   5: 'bin5',
                   6: 'bin6',
                   7: 'bin7'}

bins = list(map(conversion_dict.get, a))

# ['bin5', 'bin5', 'bin7', 'bin5']

Speed comparison:

# create dataframe of 8 million rows for testing
dfbig = pd.concat([df]*2000000, ignore_index=True)

dfbig.shape

# (8000000, 1)

%%timeit
cut(dfbig['percentage'].to_numpy())

# 38 ms ± 616 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%%timeit
bins = [0, 1, 5, 10, 25, 50, 100]
labels = [1,2,3,4,5,6]
pd.cut(dfbig['percentage'], bins=bins, labels=labels)

# 215 ms ± 9.76 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Question 28

To filter a dataframe (df) by a single column, if we consider data with male and females we might:

males = df[df[Gender]=='Male']

Question 1 – But what if the data spanned multiple years and i wanted to only see males for 2014?

In other languages I might do something like:

if A = "Male" and if B = "2014" then

(except I want to do this and get a subset of the original dataframe in a new dataframe object)

Question 2. How do I do this in a loop, and create a dataframe object for each unique sets of year and gender (i.e. a df for: 2013-Male, 2013-Female, 2014-Male, and 2014-Female

for y in year:

for g in gender:

df = .....

Question 29

Using & operator, don’t forget to wrap the sub-statements with ():

males = df[(df[Gender]=='Male') & (df[Year]==2014)]

To store your dataframes in a dict using a for loop:

from collections import defaultdict
dic={}
for g in ['male', 'female']:
  dic[g]=defaultdict(dict)
  for y in [2013, 2014]:
    dic[g][y]=df[(df[Gender]==g) & (df[Year]==y)] #store the DataFrames to a dict of dict

EDIT:

A demo for your getDF:

def getDF(dic, gender, year):
  return dic[gender][year]

print genDF(dic, 'male', 2014)

Question 30

For more general boolean functions that you would like to use as a filter and that depend on more than one column, you can use:

df = df[df[['col_1','col_2']].apply(lambda x: f(*x), axis=1)]

where f is a function that is applied to every pair of elements (x1, x2) from col_1 and col_2 and returns True or False depending on any condition you want on (x1, x2).

Question 31

Start from pandas 0.13, this is the most efficient way.

df.query('Gender=="Male" & Year=="2014" ')

Question 32

In case somebody wonders what is the faster way to filter (the accepted answer or the one from @redreamality):

import pandas as pd
import numpy as np

length = 100_000
df = pd.DataFrame()
df['Year'] = np.random.randint(1950, 2019, size=length)
df['Gender'] = np.random.choice(['Male', 'Female'], length)

%timeit df.query('Gender=="Male" & Year=="2014" ')
%timeit df[(df['Gender']=='Male') & (df['Year']==2014)]

Results for 100,000 rows:

6.67 ms ± 557 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
5.54 ms ± 536 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Results for 10,000,000 rows:

326 ms ± 6.52 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
472 ms ± 25.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

So results depend on the size and the data. On my laptop, query() gets faster after 500k rows. Further, the string search in Year=="2014" has an unnecessary overhead (Year==2014 is faster).

Question 33

You can create your own filter function using query in pandas. Here you have filtering of df results by all the kwargs parameters. Dont’ forgot to add some validators(kwargs filtering) to get filter function for your own df.

def filter(df, **kwargs):
    query_list = []
    for key in kwargs.keys():
        query_list.append(f'{key}=="{kwargs[key]}"')
    query = ' & '.join(query_list)
    return df.query(query)

Question 34

You can filter by multiple columns (more than two) by using the np.logical_and operator to replace & (or np.logical_or to replace |)

Here’s an example function that does the job, if you provide target values for multiple fields. You can adapt it for different types of filtering and whatnot:

def filter_df(df, filter_values):
    """Filter df by matching targets for multiple columns.

    Args:
        df (pd.DataFrame): dataframe
        filter_values (None or dict): Dictionary of the form:
                `{<field>: <target_values_list>}`
            used to filter columns data.
    """
    import numpy as np
    if filter_values is None or not filter_values:
        return df
    return df[
        np.logical_and.reduce([
            df[column].isin(target_values) 
            for column, target_values in filter_values.items()
        ])
    ]

Usage:

df = pd.DataFrame({'a': [1, 2, 3, 4], 'b': [1, 2, 3, 4]})

filter_df(df, {
    'a': [1, 2, 3],
    'b': [1, 2, 4]
})

Question 35

我试图将DataFrame修改df为仅包含其列中的值在closing_price99到101之间的行，并尝试使用下面的代码执行此操作。

但是，我得到了错误

ValueError：系列的真值不明确。使用a.empty，a.bool（），a.item（），a.any（）或a.all（）

我想知道是否有一种方法可以不使用循环。

df = df[(99 <= df['closing_price'] <= 101)]

Question 36

I am trying to modify a DataFrame df to only contain rows for which the values in the column closing_price are between 99 and 101 and trying to do this with the code below.

However, I get the error

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()

and I am wondering if there is a way to do this without using loops.

df = df[(99 <= df['closing_price'] <= 101)]

Question 37

您应该使用()将布尔向量分组的方式来消除歧义。

df = df[(df['closing_price'] >= 99) & (df['closing_price'] <= 101)]

Question 38

You should use () to group your boolean vector to remove ambiguity.

df = df[(df['closing_price'] >= 99) & (df['closing_price'] <= 101)]

Question 39

还考虑以下系列：

df = df[df['closing_price'].between(99, 101)]

Question 40

Consider also series between:

df = df[df['closing_price'].between(99, 101)]

Question 41

还有一个更好的选择-使用query（）方法：

In [58]: df = pd.DataFrame({'closing_price': np.random.randint(95, 105, 10)})

In [59]: df
Out[59]:
   closing_price
0            104
1             99
2             98
3             95
4            103
5            101
6            101
7             99
8             95
9             96

In [60]: df.query('99 <= closing_price <= 101')
Out[60]:
   closing_price
1             99
5            101
6            101
7             99

更新：回答评论：

我喜欢这里的语法，但是在尝试与expresison结合使用时感到失望； df.query('(mean + 2 *sd) <= closing_price <=(mean + 2 *sd)')

In [161]: qry = "(closing_price.mean() - 2*closing_price.std())" +\
     ...:       " <= closing_price <= " + \
     ...:       "(closing_price.mean() + 2*closing_price.std())"
     ...:

In [162]: df.query(qry)
Out[162]:
   closing_price
0             97
1            101
2             97
3             95
4            100
5             99
6            100
7            101
8             99
9             95

Question 42

there is a nicer alternative – use query() method:

In [58]: df = pd.DataFrame({'closing_price': np.random.randint(95, 105, 10)})

In [59]: df
Out[59]:
   closing_price
0            104
1             99
2             98
3             95
4            103
5            101
6            101
7             99
8             95
9             96

In [60]: df.query('99 <= closing_price <= 101')
Out[60]:
   closing_price
1             99
5            101
6            101
7             99

UPDATE: answering the comment:

I like the syntax here but fell down when trying to combine with expresison; df.query('(mean + 2 *sd) <= closing_price <=(mean + 2 *sd)')

In [161]: qry = "(closing_price.mean() - 2*closing_price.std())" +\
     ...:       " <= closing_price <= " + \
     ...:       "(closing_price.mean() + 2*closing_price.std())"
     ...:

In [162]: df.query(qry)
Out[162]:
   closing_price
0             97
1            101
2             97
3             95
4            100
5             99
6            100
7            101
8             99
9             95

Question 43

您也可以使用.between()方法

emp = pd.read_csv("C:\\py\\programs\\pandas_2\\pandas\\employees.csv")

emp[emp["Salary"].between(60000, 61000)]

输出量

Question 44

you can also use .between() method

emp = pd.read_csv("C:\\py\\programs\\pandas_2\\pandas\\employees.csv")

emp[emp["Salary"].between(60000, 61000)]

Output

Question 45

newdf = df.query('closing_price.mean() <= closing_price <= closing_price.std()')

要么

mean = closing_price.mean()
std = closing_price.std()

newdf = df.query('@mean <= closing_price <= @std')

Question 46

newdf = df.query('closing_price.mean() <= closing_price <= closing_price.std()')

or

mean = closing_price.mean()
std = closing_price.std()

newdf = df.query('@mean <= closing_price <= @std')

Question 47

如果您要处理多个值和多个输入，则还可以设置这样的apply函数。在这种情况下，为落在特定范围内的GPS位置过滤数据帧。

def filter_values(lat,lon):
    if abs(lat - 33.77) < .01 and abs(lon - -118.16) < .01:
        return True
    elif abs(lat - 37.79) < .01 and abs(lon - -122.39) < .01:
        return True
    else:
        return False


df = df[df.apply(lambda x: filter_values(x['lat'],x['lon']),axis=1)]

Question 48

If you’re dealing with multiple values and multiple inputs you could also set up an apply function like this. In this case filtering a dataframe for GPS locations that fall withing certain ranges.

def filter_values(lat,lon):
    if abs(lat - 33.77) < .01 and abs(lon - -118.16) < .01:
        return True
    elif abs(lat - 37.79) < .01 and abs(lon - -122.39) < .01:
        return True
    else:
        return False


df = df[df.apply(lambda x: filter_values(x['lat'],x['lon']),axis=1)]

Question 49

代替这个

df = df[(99 <= df['closing_price'] <= 101)]

你应该用这个

df = df[(df['closing_price']>=99 ) & (df['closing_price']<=101)]

我们必须使用NumPy的按位逻辑运算符|，＆，〜，^进行复合查询。同样，括号对于运算符优先级也很重要。

有关更多信息，您可以访问链接：比较，掩码和布尔逻辑

Question 50

Instead of this

df = df[(99 <= df['closing_price'] <= 101)]

You should use this

df = df[(df['closing_price']>=99 ) & (df['closing_price']<=101)]

We have to use NumPy’s bitwise Logic operators |, &, ~, ^ for compounding queries. Also, the parentheses are important for operator precedence.

For more info, you can visit the link :Comparisons, Masks, and Boolean Logic

Question 51

我将数据存储在pandas数据框中，如下所示：

df1 = pd.DataFrame({'A':['yes','yes','yes','yes','no','no','yes','yes','yes','no'],
                   'B':['yes','no','no','no','yes','yes','no','yes','yes','no']})

所以我的数据看起来像这样

----------------------------
index         A        B
0           yes      yes
1           yes       no
2           yes       no
3           yes       no
4            no      yes
5            no      yes
6           yes       no
7           yes      yes
8           yes      yes
9            no       no
-----------------------------

我想将其转换为另一个数据框。预期的输出可以在以下python脚本中显示：

output = pd.DataFrame({'A':['no','no','yes','yes'],'B':['no','yes','no','yes'],'count':[1,2,4,3]})

因此，我的预期输出如下所示

--------------------------------------------
index      A       B       count
--------------------------------------------
0         no       no        1
1         no      yes        2
2        yes       no        4
3        yes      yes        3
--------------------------------------------

实际上，我可以使用以下命令来找到所有组合并对其进行计数： mytable = df1.groupby(['A','B']).size()

但是，事实证明，此类组合在单个列中。我想将组合中的每个值分隔到不同的列中，并且还要为计数结果增加一列。有可能这样做吗？请问您有什么建议吗？先感谢您。

Question 52

I have my data in pandas data frame as follows:

df1 = pd.DataFrame({'A':['yes','yes','yes','yes','no','no','yes','yes','yes','no'],
                   'B':['yes','no','no','no','yes','yes','no','yes','yes','no']})

So, my data looks like this

----------------------------
index         A        B
0           yes      yes
1           yes       no
2           yes       no
3           yes       no
4            no      yes
5            no      yes
6           yes       no
7           yes      yes
8           yes      yes
9            no       no
-----------------------------

I would like to transform it to another data frame. The expected output can be shown in the following python script:

output = pd.DataFrame({'A':['no','no','yes','yes'],'B':['no','yes','no','yes'],'count':[1,2,4,3]})

So, my expected output looks like this

--------------------------------------------
index      A       B       count
--------------------------------------------
0         no       no        1
1         no      yes        2
2        yes       no        4
3        yes      yes        3
--------------------------------------------

Actually, I can achieve to find all combinations and count them by using the following command: mytable = df1.groupby(['A','B']).size()

However, it turns out that such combinations are in a single column. I would like to separate each value in a combination into different column and also add one more column for the result of counting. Is it possible to do that? May I have your suggestions? Thank you in advance.

Question 53

你可以groupby上的cols“A”和“B”和呼叫size，然后reset_index和rename生成列：

In [26]:

df1.groupby(['A','B']).size().reset_index().rename(columns={0:'count'})
Out[26]:
     A    B  count
0   no   no      1
1   no  yes      2
2  yes   no      4
3  yes  yes      3

更新

简要说明一下，通过将2列分组，将A和B值相同的行分组，我们称之为size返回唯一组的数量：

In[202]:
df1.groupby(['A','B']).size()

Out[202]: 
A    B  
no   no     1
     yes    2
yes  no     4
     yes    3
dtype: int64

所以现在要还原分组的列，我们调用reset_index：

In[203]:
df1.groupby(['A','B']).size().reset_index()

Out[203]: 
     A    B  0
0   no   no  1
1   no  yes  2
2  yes   no  4
3  yes  yes  3

这将还原索引，但是大小聚合将变成生成的column 0，因此我们必须重命名此名称：

In[204]:
df1.groupby(['A','B']).size().reset_index().rename(columns={0:'count'})

Out[204]: 
     A    B  count
0   no   no      1
1   no  yes      2
2  yes   no      4
3  yes  yes      3

groupby确实接受了as_index我们可以设置为的arg ，False因此它不会使分组的列成为索引，但是这会生成a，series并且您仍然必须还原索引，依此类推….：

In[205]:
df1.groupby(['A','B'], as_index=False).size()

Out[205]: 
A    B  
no   no     1
     yes    2
yes  no     4
     yes    3
dtype: int64

Question 54

You can groupby on cols ‘A’ and ‘B’ and call size and then reset_index and rename the generated column:

In [26]:

df1.groupby(['A','B']).size().reset_index().rename(columns={0:'count'})
Out[26]:
     A    B  count
0   no   no      1
1   no  yes      2
2  yes   no      4
3  yes  yes      3

update

A little explanation, by grouping on the 2 columns, this groups rows where A and B values are the same, we call size which returns the number of unique groups:

In[202]:
df1.groupby(['A','B']).size()

Out[202]: 
A    B  
no   no     1
     yes    2
yes  no     4
     yes    3
dtype: int64

So now to restore the grouped columns, we call reset_index:

In[203]:
df1.groupby(['A','B']).size().reset_index()

Out[203]: 
     A    B  0
0   no   no  1
1   no  yes  2
2  yes   no  4
3  yes  yes  3

This restores the indices but the size aggregation is turned into a generated column 0, so we have to rename this:

In[204]:
df1.groupby(['A','B']).size().reset_index().rename(columns={0:'count'})

Out[204]: 
     A    B  count
0   no   no      1
1   no  yes      2
2  yes   no      4
3  yes  yes      3

groupby does accept the arg as_index which we could have set to False so it doesn’t make the grouped columns the index, but this generates a series and you’d still have to restore the indices and so on….:

In[205]:
df1.groupby(['A','B'], as_index=False).size()

Out[205]: 
A    B  
no   no     1
     yes    2
yes  no     4
     yes    3
dtype: int64

Question 55

稍微相关，我一直在寻找独特的组合，然后我想到了这种方法：

def unique_columns(df,columns):

    result = pd.Series(index = df.index)

    groups = meta_data_csv.groupby(by = columns)
    for name,group in groups:
       is_unique = len(group) == 1
       result.loc[group.index] = is_unique

    assert not result.isnull().any()

    return result

如果只想断言所有组合都是唯一的：

df1.set_index(['A','B']).index.is_unique

Question 56

Slightly related, I was looking for the unique combinations and I came up with this method:

def unique_columns(df,columns):

    result = pd.Series(index = df.index)

    groups = meta_data_csv.groupby(by = columns)
    for name,group in groups:
       is_unique = len(group) == 1
       result.loc[group.index] = is_unique

    assert not result.isnull().any()

    return result

And if you only want to assert that all combinations are unique:

df1.set_index(['A','B']).index.is_unique

Question 57

将@EdChum的非常好的答案放入函数中count_unique_index。唯一方法仅适用于熊猫系列，不适用于数据框。下面的函数重现了R中唯一函数的行为：

unique返回向量，数据框或数组（如x），但删除了重复的元素/行。

并根据OP的要求添加发生次数。

df1 = pd.DataFrame({'A':['yes','yes','yes','yes','no','no','yes','yes','yes','no'],                                                                                             
                    'B':['yes','no','no','no','yes','yes','no','yes','yes','no']})                                                                                               
def count_unique_index(df, by):                                                                                                                                                 
    return df.groupby(by).size().reset_index().rename(columns={0:'count'})                                                                                                      

count_unique_index(df1, ['A','B'])                                                                                                                                              
     A    B  count                                                                                                                                                                  
0   no   no      1                                                                                                                                                                  
1   no  yes      2                                                                                                                                                                  
2  yes   no      4                                                                                                                                                                  
3  yes  yes      3

Question 58

Placing @EdChum’s very nice answer into a function count_unique_index. The unique method only works on pandas series, not on data frames. The function below reproduces the behavior of the unique function in R:

unique returns a vector, data frame or array like x but with duplicate elements/rows removed.

And adds a count of the occurrences as requested by the OP.

df1 = pd.DataFrame({'A':['yes','yes','yes','yes','no','no','yes','yes','yes','no'],                                                                                             
                    'B':['yes','no','no','no','yes','yes','no','yes','yes','no']})                                                                                               
def count_unique_index(df, by):                                                                                                                                                 
    return df.groupby(by).size().reset_index().rename(columns={0:'count'})                                                                                                      

count_unique_index(df1, ['A','B'])                                                                                                                                              
     A    B  count                                                                                                                                                                  
0   no   no      1                                                                                                                                                                  
1   no  yes      2                                                                                                                                                                  
2  yes   no      4                                                                                                                                                                  
3  yes  yes      3

Question 59

我还没有做时间测试，但是尝试很有趣。基本上将两列转换为一列的元组。现在将其转换为数据框，执行“ value_counts（）”以查找唯一元素并对其进行计数。再次拉动拉链，然后按需要排列各列。您可能可以使步骤更优雅，但是对我来说，使用元组似乎更自然

b = pd.DataFrame({'A':['yes','yes','yes','yes','no','no','yes','yes','yes','no'],'B':['yes','no','no','no','yes','yes','no','yes','yes','no']})

b['count'] = pd.Series(zip(*[b.A,b.B]))
df = pd.DataFrame(b['count'].value_counts().reset_index())
df['A'], df['B'] = zip(*df['index'])
df = df.drop(columns='index')[['A','B','count']]

Question 60

I haven’t done time test with this but it was fun to try. Basically convert two columns to one column of tuples. Now convert that to a dataframe, do ‘value_counts()’ which finds the unique elements and counts them. Fiddle with zip again and put the columns in order you want. You can probably make the steps more elegant but working with tuples seems more natural to me for this problem

b = pd.DataFrame({'A':['yes','yes','yes','yes','no','no','yes','yes','yes','no'],'B':['yes','no','no','no','yes','yes','no','yes','yes','no']})

b['count'] = pd.Series(zip(*[b.A,b.B]))
df = pd.DataFrame(b['count'].value_counts().reset_index())
df['A'], df['B'] = zip(*df['index'])
df = df.drop(columns='index')[['A','B','count']]

Question 61

我被要求生成一些Excel报告。我目前正在大量使用pandas作为数据，所以自然地我想使用pandas.ExcelWriter方法生成这些报告。但是，固定的列宽是一个问题。

到目前为止，我的代码很简单。假设我有一个名为“ df”的数据框：

writer = pd.ExcelWriter(excel_file_path, engine='openpyxl')
df.to_excel(writer, sheet_name="Summary")

我正在查看pandas代码，但实际上没有看到任何设置列宽的选项。宇宙中是否有技巧可以使列自动调整为数据？还是在事实之后我可以对xlsx文件做一些事情来调整列宽？

（我正在使用OpenPyXL库，并生成.xlsx文件-如果有区别的话。）

谢谢。

Question 62

I am being asked to generate some Excel reports. I am currently using pandas quite heavily for my data, so naturally I would like to use the pandas.ExcelWriter method to generate these reports. However the fixed column widths are a problem.

The code I have so far is simple enough. Say I have a dataframe called ‘df’:

writer = pd.ExcelWriter(excel_file_path, engine='openpyxl')
df.to_excel(writer, sheet_name="Summary")

I was looking over the pandas code, and I don’t really see any options to set column widths. Is there a trick out there in the universe to make it such that the columns auto-adjust to the data? Or is there something I can do after the fact to the xlsx file to adjust the column widths?

(I am using the OpenPyXL library, and generating .xlsx files – if that makes any difference.)

Thank you.

Question 63

受user6178746的回答启发，我有以下内容：

# Given a dict of dataframes, for example:
# dfs = {'gadgets': df_gadgets, 'widgets': df_widgets}

writer = pd.ExcelWriter(filename, engine='xlsxwriter')
for sheetname, df in dfs.items():  # loop through `dict` of dataframes
    df.to_excel(writer, sheet_name=sheetname)  # send df to writer
    worksheet = writer.sheets[sheetname]  # pull worksheet object
    for idx, col in enumerate(df):  # loop through all columns
        series = df[col]
        max_len = max((
            series.astype(str).map(len).max(),  # len of largest item
            len(str(series.name))  # len of column name/header
            )) + 1  # adding a little extra space
        worksheet.set_column(idx, idx, max_len)  # set column width
writer.save()

Question 64

Inspired by user6178746’s answer, I have the following:

# Given a dict of dataframes, for example:
# dfs = {'gadgets': df_gadgets, 'widgets': df_widgets}

writer = pd.ExcelWriter(filename, engine='xlsxwriter')
for sheetname, df in dfs.items():  # loop through `dict` of dataframes
    df.to_excel(writer, sheet_name=sheetname)  # send df to writer
    worksheet = writer.sheets[sheetname]  # pull worksheet object
    for idx, col in enumerate(df):  # loop through all columns
        series = df[col]
        max_len = max((
            series.astype(str).map(len).max(),  # len of largest item
            len(str(series.name))  # len of column name/header
            )) + 1  # adding a little extra space
        worksheet.set_column(idx, idx, max_len)  # set column width
writer.save()

问题：获取pandas.read_csv以将空值读取为空字符串而不是nan

回答 0

回答 1

回答 2

问题：排序数据框后更新索引

回答 0

回答 1

回答 2

回答 3

问题：使用熊猫查找最多两列或更多列

回答 0

回答 1

问题：用python熊猫装箱列

回答 0

回答 1

使用numba模块加速。

Using numba module for speed up.

问题：如何按多列过滤熊猫数据框

回答 0

编辑：

EDIT:

回答 1

回答 2

回答 3

回答 4

回答 5

问题：如何在Python Pandas中的两个值之间选择DataFrame中的行？

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

问题：熊猫数据框中选定列和计数中值的唯一组合

回答 0

回答 1

回答 2

回答 3

问题：有没有一种方法可以使用pandas.ExcelWriter自动调整Excel列的宽度？

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

回答 7

回答 8

回答 9

回答 10

回答 11

问题：将可识别熊猫时区的DateTimeIndex转换为朴素的时间戳，但在特定的时区

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

回答 7

问题：通过熊猫DataFrame分组并选择最常用的值

回答 0

回答 1

熊猫> = 0.16

pd.Series.mode 可用！

处理多种模式

（不）考虑的替代方案

Pandas >= 0.16

pd.Series.mode is available!

Dealing with Multiple Modes

Alternatives to (not) consider

回答 2

回答 3

回答 4

回答 5

回答 6

回答 7

回答 8

回答 9

使用`numba`模块加速。

Using `numba` module for speed up.

`pd.Series.mode` 可用！

`pd.Series.mode` is available!