Python 实用宝典

Question 1

使用以下代码：

import matplotlib
matplotlib.style.use('ggplot')
import matplotlib.pyplot as plt
import pandas as pd

df = pd.DataFrame({ 'celltype':["foo","bar","qux","woz"], 's1':[5,9,1,7], 's2':[12,90,13,87]})
df = df[["celltype","s1","s2"]]
df.set_index(["celltype"],inplace=True)
df.plot(kind='bar',alpha=0.75)
plt.xlabel("")

我做了这个情节：

如何将x轴刻度标签旋转到0度？

我尝试添加它，但是没有用：

plt.set_xticklabels(df.index,rotation=90)

Question 2

With the following code:

import matplotlib
matplotlib.style.use('ggplot')
import matplotlib.pyplot as plt
import pandas as pd

df = pd.DataFrame({ 'celltype':["foo","bar","qux","woz"], 's1':[5,9,1,7], 's2':[12,90,13,87]})
df = df[["celltype","s1","s2"]]
df.set_index(["celltype"],inplace=True)
df.plot(kind='bar',alpha=0.75)
plt.xlabel("")

I made this plot:

How can I rotate the x-axis tick labels to 0 degrees?

I tried adding this but did not work:

plt.set_xticklabels(df.index,rotation=90)

Question 3

传递参数rot=0以旋转xticks：

import matplotlib
matplotlib.style.use('ggplot')
import matplotlib.pyplot as plt
import pandas as pd

df = pd.DataFrame({ 'celltype':["foo","bar","qux","woz"], 's1':[5,9,1,7], 's2':[12,90,13,87]})
df = df[["celltype","s1","s2"]]
df.set_index(["celltype"],inplace=True)
df.plot(kind='bar',alpha=0.75, rot=0)
plt.xlabel("")
plt.show()

Yield图：

Question 4

Pass param rot=0 to rotate the xticks:

import matplotlib
matplotlib.style.use('ggplot')
import matplotlib.pyplot as plt
import pandas as pd

df = pd.DataFrame({ 'celltype':["foo","bar","qux","woz"], 's1':[5,9,1,7], 's2':[12,90,13,87]})
df = df[["celltype","s1","s2"]]
df.set_index(["celltype"],inplace=True)
df.plot(kind='bar',alpha=0.75, rot=0)
plt.xlabel("")
plt.show()

yields plot:

Question 5

尝试这个 –

plt.xticks(rotation=90)

Question 6

Try this –

plt.xticks(rotation=90)

Question 7

问题很明确，但标题不够准确。我的答案是给那些想要更改轴标签（而不是刻度标签）的人的，这是公认的答案。（标题现已更正）。

for ax in plt.gcf().axes:
    plt.sca(ax)
    plt.xlabel(ax.get_xlabel(), rotation=90)

Question 8

The question is clear but the title is not as precise as it could be. My answer is for those who came looking to change the axis label, as opposed to the tick labels, which is what the accepted answer is about. (The title has now been corrected).

for ax in plt.gcf().axes:
    plt.sca(ax)
    plt.xlabel(ax.get_xlabel(), rotation=90)

Question 9

您可以使用set_xticklabels（）

ax.set_xticklabels(df['Names'], rotation=90, ha='right')

Question 10

You can use set_xticklabels()

ax.set_xticklabels(df['Names'], rotation=90, ha='right')

Question 11

以下内容可能会有所帮助：

# Valid font size are xx-small, x-small, small, medium, large, x-large, xx-large, larger, smaller, None

plt.xticks(
    rotation=45,
    horizontalalignment='right',
    fontweight='light',
    fontsize='medium',
)

这是带有示例和API的函数xticks[reference]

def xticks(ticks=None, labels=None, **kwargs):
    """
    Get or set the current tick locations and labels of the x-axis.

    Call signatures::

        locs, labels = xticks()            # Get locations and labels
        xticks(ticks, [labels], **kwargs)  # Set locations and labels

    Parameters
    ----------
    ticks : array_like
        A list of positions at which ticks should be placed. You can pass an
        empty list to disable xticks.

    labels : array_like, optional
        A list of explicit labels to place at the given *locs*.

    **kwargs
        :class:`.Text` properties can be used to control the appearance of
        the labels.

    Returns
    -------
    locs
        An array of label locations.
    labels
        A list of `.Text` objects.

    Notes
    -----
    Calling this function with no arguments (e.g. ``xticks()``) is the pyplot
    equivalent of calling `~.Axes.get_xticks` and `~.Axes.get_xticklabels` on
    the current axes.
    Calling this function with arguments is the pyplot equivalent of calling
    `~.Axes.set_xticks` and `~.Axes.set_xticklabels` on the current axes.

    Examples
    --------
    Get the current locations and labels:

        >>> locs, labels = xticks()

    Set label locations:

        >>> xticks(np.arange(0, 1, step=0.2))

    Set text labels:

        >>> xticks(np.arange(5), ('Tom', 'Dick', 'Harry', 'Sally', 'Sue'))

    Set text labels and properties:

        >>> xticks(np.arange(12), calendar.month_name[1:13], rotation=20)

    Disable xticks:

        >>> xticks([])
    """

Question 12

The follows might be helpful:

# Valid font size are xx-small, x-small, small, medium, large, x-large, xx-large, larger, smaller, None

plt.xticks(
    rotation=45,
    horizontalalignment='right',
    fontweight='light',
    fontsize='medium',
)

Here is the function xticks[reference] with example and API

def xticks(ticks=None, labels=None, **kwargs):
    """
    Get or set the current tick locations and labels of the x-axis.

    Call signatures::

        locs, labels = xticks()            # Get locations and labels
        xticks(ticks, [labels], **kwargs)  # Set locations and labels

    Parameters
    ----------
    ticks : array_like
        A list of positions at which ticks should be placed. You can pass an
        empty list to disable xticks.

    labels : array_like, optional
        A list of explicit labels to place at the given *locs*.

    **kwargs
        :class:`.Text` properties can be used to control the appearance of
        the labels.

    Returns
    -------
    locs
        An array of label locations.
    labels
        A list of `.Text` objects.

    Notes
    -----
    Calling this function with no arguments (e.g. ``xticks()``) is the pyplot
    equivalent of calling `~.Axes.get_xticks` and `~.Axes.get_xticklabels` on
    the current axes.
    Calling this function with arguments is the pyplot equivalent of calling
    `~.Axes.set_xticks` and `~.Axes.set_xticklabels` on the current axes.

    Examples
    --------
    Get the current locations and labels:

        >>> locs, labels = xticks()

    Set label locations:

        >>> xticks(np.arange(0, 1, step=0.2))

    Set text labels:

        >>> xticks(np.arange(5), ('Tom', 'Dick', 'Harry', 'Sally', 'Sue'))

    Set text labels and properties:

        >>> xticks(np.arange(12), calendar.month_name[1:13], rotation=20)

    Disable xticks:

        >>> xticks([])
    """

Question 13

对于条形图，可以包括最终希望刻度线具有的角度。

在这里，我正在rot=0使它们平行于x轴。

series.plot.bar(rot=0)
plt.show()
plt.close()

Question 14

For bar graphs, you can include the angle which you finally want the ticks to have.

Here I am using rot=0 to make them parallel to the x axis.

series.plot.bar(rot=0)
plt.show()
plt.close()

Question 15

从pandas DataFrame中选择单个列时（例如df.iloc[:, 0]，df['A']或df.A等），结果矢量将自动转换为Series而不是单列DataFrame。但是，我正在编写一些将DataFrame作为输入参数的函数。因此，我更喜欢处理单列DataFrame而不是Series，以便函数可以假定df.columns是可访问的。现在，我必须使用来将Series显式转换为DataFrame pd.DataFrame(df.iloc[:, 0])。这似乎不是最干净的方法。是否有更优雅的方法直接从DataFrame进行索引，以便结果是单列DataFrame而不是Series？

Question 16

When selecting a single column from a pandas DataFrame(say df.iloc[:, 0], df['A'], or df.A, etc), the resulting vector is automatically converted to a Series instead of a single-column DataFrame. However, I am writing some functions that takes a DataFrame as an input argument. Therefore, I prefer to deal with single-column DataFrame instead of Series so that the function can assume say df.columns is accessible. Right now I have to explicitly convert the Series into a DataFrame by using something like pd.DataFrame(df.iloc[:, 0]). This doesn’t seem like the most clean method. Is there a more elegant way to index from a DataFrame directly so that the result is a single-column DataFrame instead of Series?

Question 17

正如@Jeff提到的，有几种方法可以做到这一点，但我建议使用loc / iloc来使其更明确（如果尝试歧义，请提早出错）：

In [10]: df = pd.DataFrame([[1, 2], [3, 4]], columns=['A', 'B'])

In [11]: df
Out[11]:
   A  B
0  1  2
1  3  4

In [12]: df[['A']]

In [13]: df[[0]]

In [14]: df.loc[:, ['A']]

In [15]: df.iloc[:, [0]]

Out[12-15]:  # they all return the same thing:
   A
0  1
1  3

在整数列名称的情况下，后两种选择消除了歧义（正是创建loc / iloc的原因）。例如：

In [16]: df = pd.DataFrame([[1, 2], [3, 4]], columns=['A', 0])

In [17]: df
Out[17]:
   A  0
0  1  2
1  3  4

In [18]: df[[0]]  # ambiguous
Out[18]:
   A
0  1
1  3

Question 18

As @Jeff mentions there are a few ways to do this, but I recommend using loc/iloc to be more explicit (and raise errors early if your trying something ambiguous):

In [10]: df = pd.DataFrame([[1, 2], [3, 4]], columns=['A', 'B'])

In [11]: df
Out[11]:
   A  B
0  1  2
1  3  4

In [12]: df[['A']]

In [13]: df[[0]]

In [14]: df.loc[:, ['A']]

In [15]: df.iloc[:, [0]]

Out[12-15]:  # they all return the same thing:
   A
0  1
1  3

The latter two choices remove ambiguity in the case of integer column names (precisely why loc/iloc were created). For example:

In [16]: df = pd.DataFrame([[1, 2], [3, 4]], columns=['A', 0])

In [17]: df
Out[17]:
   A  0
0  1  2
1  3  4

In [18]: df[[0]]  # ambiguous
Out[18]:
   A
0  1
1  3

Question 19

正如安迪·海登（Andy Hayden）所建议的那样，利用.iloc / .loc索引（单列）数据帧是可行的方法。要注意的另一点是如何表达索引位置。使用列出的索引标签/位置，同时指定要作为数据框索引的参数值；否则将返回“ pandas.core.series.Series”

输入：

    A_1 = train_data.loc[:,'Fraudster']
    print('A_1 is of type', type(A_1))
    A_2 = train_data.loc[:, ['Fraudster']]
    print('A_2 is of type', type(A_2))
    A_3 = train_data.iloc[:,12]
    print('A_3 is of type', type(A_3))
    A_4 = train_data.iloc[:,[12]]
    print('A_4 is of type', type(A_4))

输出：

    A_1 is of type <class 'pandas.core.series.Series'>
    A_2 is of type <class 'pandas.core.frame.DataFrame'>
    A_3 is of type <class 'pandas.core.series.Series'>
    A_4 is of type <class 'pandas.core.frame.DataFrame'>

Question 20

As Andy Hayden recommends, utilizing .iloc/.loc to index out (single-columned) dataframe is the way to go; another point to note is how to express the index positions. Use a listed Index labels/positions whilst specifying the argument values to index out as Dataframe; failure to do so will return a ‘pandas.core.series.Series’

Input:

    A_1 = train_data.loc[:,'Fraudster']
    print('A_1 is of type', type(A_1))
    A_2 = train_data.loc[:, ['Fraudster']]
    print('A_2 is of type', type(A_2))
    A_3 = train_data.iloc[:,12]
    print('A_3 is of type', type(A_3))
    A_4 = train_data.iloc[:,[12]]
    print('A_4 is of type', type(A_4))

Output:

    A_1 is of type <class 'pandas.core.series.Series'>
    A_2 is of type <class 'pandas.core.frame.DataFrame'>
    A_3 is of type <class 'pandas.core.series.Series'>
    A_4 is of type <class 'pandas.core.frame.DataFrame'>

Question 21

您可以使用df.iloc[:, 0:1]，在这种情况下，结果向量将是aDataFrame而不是序列。

如你看到的：

Question 22

You can use df.iloc[:, 0:1], in this case the resulting vector will be a DataFrame and not series.

As you can see:

Question 23

提到了这三种方法：

pd.DataFrame(df.loc[:, 'A'])  # Approach of the original post
df.loc[:,[['A']]              # Approach 2 (note: use iloc for positional indexing)
df[['A']]                     # Approach 3

pd.Series.to_frame（）是另一种方法。

因为它是一种方法，所以可以在上述第二种方法和第三种方法不适用的情况下使用。特别是，在将某些方法应用于数据框中的列并且要将输出转换为数据框而不是序列时，此方法很有用。例如，在Jupyter Notebook中，一系列不会有漂亮的输出，但是会有一个数据框。

# Basic use case: 
df['A'].to_frame()

# Use case 2 (this will give you pretty output in a Jupyter Notebook): 
df['A'].describe().to_frame()

# Use case 3: 
df['A'].str.strip().to_frame()

# Use case 4: 
def some_function(num): 
    ...

df['A'].apply(some_function).to_frame()

Question 24

These three approaches have been mentioned:

pd.DataFrame(df.loc[:, 'A'])  # Approach of the original post
df.loc[:,[['A']]              # Approach 2 (note: use iloc for positional indexing)
df[['A']]                     # Approach 3

pd.Series.to_frame() is another approach.

Because it is a method, it can be used in situations where the second and third approaches above do not apply. In particular, it is useful when applying some method to a column in your dataframe and you want to convert the output into a dataframe instead of a series. For instance, in a Jupyter Notebook a series will not have pretty output, but a dataframe will.

# Basic use case: 
df['A'].to_frame()

# Use case 2 (this will give you pretty output in a Jupyter Notebook): 
df['A'].describe().to_frame()

# Use case 3: 
df['A'].str.strip().to_frame()

# Use case 4: 
def some_function(num): 
    ...

df['A'].apply(some_function).to_frame()

Question 25

I want to print the whole dataframe, but I don’t want to print the index

Besides, one column is datetime type, I just want to print time, not date.

The dataframe looks like:

   User ID           Enter Time   Activity Number
0      123  2014-07-08 00:09:00              1411
1      123  2014-07-08 00:18:00               893
2      123  2014-07-08 00:49:00              1041

I want it print as

User ID   Enter Time   Activity Number
123         00:09:00              1411
123         00:18:00               893
123         00:49:00              1041

Question 26

print df.to_string(index=False)

Question 27

print(df.to_csv(sep='\t', index=False))

Or possibly:

print(df.to_csv(columns=['A', 'B', 'C'], sep='\t', index=False))

Question 28

The line below would hide the index column of DataFrame when you print

df.style.hide_index()

Question 29

If you want to pretty print the data frames, then you can use tabulate package.

import pandas as pd
import numpy as np
from tabulate import tabulate

def pprint_df(dframe):
    print tabulate(dframe, headers='keys', tablefmt='psql', showindex=False)

df = pd.DataFrame({'col1': np.random.randint(0, 100, 10), 
    'col2': np.random.randint(50, 100, 10), 
    'col3': np.random.randint(10, 10000, 10)})

pprint_df(df)

Specifically, the showindex=False, as the name says, allows you to not show index. The output would look as follows:

+--------+--------+--------+
|   col1 |   col2 |   col3 |
|--------+--------+--------|
|     15 |     76 |   5175 |
|     30 |     97 |   3331 |
|     34 |     56 |   3513 |
|     50 |     65 |    203 |
|     84 |     75 |   7559 |
|     41 |     82 |    939 |
|     78 |     59 |   4971 |
|     98 |     99 |    167 |
|     81 |     99 |   6527 |
|     17 |     94 |   4267 |
+--------+--------+--------+

Question 30

To retain “pretty-print” use

from IPython.display import HTML
HTML(df.to_html(index=False))

Question 31

If you just want a string/json to print it can be solved with:

print(df.to_string(index=False))

Buf if you want to serialize the data too or even send to a MongoDB, would be better to do something like:

document = df.to_dict(orient='list')

There are 6 ways by now to orient the data, check more in the panda docs which better fits you.

Question 32

To answer the “How to print dataframe without an index” question, you can set the index to be an array of empty strings (one for each row in the dataframe), like this:

blankIndex=[''] * len(df)
df.index=blankIndex

If we use the data from your post:

row1 = (123, '2014-07-08 00:09:00', 1411)
row2 = (123, '2014-07-08 00:49:00', 1041)
row3 = (123, '2014-07-08 00:09:00', 1411)
data = [row1, row2, row3]
#set up dataframe
df = pd.DataFrame(data, columns=('User ID', 'Enter Time', 'Activity Number'))
print(df)

which would normally print out as:

   User ID           Enter Time  Activity Number
0      123  2014-07-08 00:09:00             1411
1      123  2014-07-08 00:49:00             1041
2      123  2014-07-08 00:09:00             1411

By creating an array with as many empty strings as there are rows in the data frame:

blankIndex=[''] * len(df)
df.index=blankIndex
print(df)

It will remove the index from the output:

  User ID           Enter Time  Activity Number
      123  2014-07-08 00:09:00             1411
      123  2014-07-08 00:49:00             1041
      123  2014-07-08 00:09:00             1411

And in Jupyter Notebooks would render as per this screenshot: Juptyer Notebooks dataframe with no index column

Question 33

Similar to many of the answers above that use df.to_string(index=False), I often find it necessary to extract a single column of values in which case you can specify an individual column with .to_string using the following:

data = pd.DataFrame({'col1': np.random.randint(0, 100, 10), 
    'col2': np.random.randint(50, 100, 10), 
    'col3': np.random.randint(10, 10000, 10)})

print(data.to_string(columns=['col1'], index=False)

print(data.to_string(columns=['col1', 'col2'], index=False))

Which provides an easy to copy (and index free) output for use pasting elsewhere (Excel). Sample output:

Question 34

I would like to cleanly filter a dataframe using regex on one of the columns.

For a contrived example:

In [210]: foo = pd.DataFrame({'a' : [1,2,3,4], 'b' : ['hi', 'foo', 'fat', 'cat']})
In [211]: foo
Out[211]: 
   a    b
0  1   hi
1  2  foo
2  3  fat
3  4  cat

I want to filter the rows to those that start with f using a regex. First go:

In [213]: foo.b.str.match('f.*')
Out[213]: 
0    []
1    ()
2    ()
3    []

That’s not too terribly useful. However this will get me my boolean index:

In [226]: foo.b.str.match('(f.*)').str.len() > 0
Out[226]: 
0    False
1     True
2     True
3    False
Name: b

So I could then do my restriction by:

In [229]: foo[foo.b.str.match('(f.*)').str.len() > 0]
Out[229]: 
   a    b
1  2  foo
2  3  fat

That makes me artificially put a group into the regex though, and seems like maybe not the clean way to go. Is there a better way to do this?

Question 35

Use contains instead:

In [10]: df.b.str.contains('^f')
Out[10]: 
0    False
1     True
2     True
3    False
Name: b, dtype: bool

Question 36

There is already a string handling function Series.str.startswith(). You should try foo[foo.b.str.startswith('f')].

Result:

    a   b
1   2   foo
2   3   fat

I think what you expect.

Alternatively you can use contains with regex option. For example:

foo[foo.b.str.contains('oo', regex= True, na=False)]

Result:

    a   b
1   2   foo

na=False is to prevent Errors in case there is nan, null etc. values

Question 37

Multiple column search with dataframe:

frame[frame.filename.str.match('*.'+MetaData+'.*') & frame.file_path.str.match('C:\test\test.txt')]

Question 38

It may be a bit late, but this is now easier to do in Pandas by calling Series.str.match. The docs explain the difference between match, fullmatch and contains.

Note that in order to use the results for indexing, set the na=False argument (or True if you want to include NANs in the results).

Question 39

Thanks for the great answer @user3136169, here is an example of how that might be done also removing NoneType values.

def regex_filter(val):
    if val:
        mo = re.search(regex,val)
        if mo:
            return True
        else:
            return False
    else:
        return False

df_filtered = df[df['col'].apply(regex_filter)]

Also you can also add regex as an arg:

def regex_filter(val,myregex):
    ...

df_filtered = df[df['col'].apply(res_regex_filter,regex=myregex)]

Question 40

Write a Boolean function that checks the regex and use apply on the column

foo[foo['b'].apply(regex_function)]

Question 41

Using str slice

foo[foo.b.str[0]=='f']
Out[18]: 
   a    b
1  2  foo
2  3  fat

Question 42

Why does pandas make a distinction between a Series and a single-column DataFrame?
In other words: what is the reason of existence of the Series class?

I’m mainly using time series with datetime index, maybe that helps to set the context.

Question 43

Quoting the Pandas docs

pandas.DataFrame(data=None, index=None, columns=None, dtype=None, copy=False)

Two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). Arithmetic operations align on both row and column labels. Can be thought of as a dict-like container for Series objects. The primary pandas data structure.

So, the Series is the data structure for a single column of a DataFrame, not only conceptually, but literally, i.e. the data in a DataFrame is actually stored in memory as a collection of Series.

Analogously: We need both lists and matrices, because matrices are built with lists. Single row matricies, while equivalent to lists in functionality still cannot exists without the list(s) they’re composed of.

They both have extremely similar APIs, but you’ll find that DataFrame methods always cater to the possibility that you have more than one column. And, of course, you can always add another Series (or equivalent object) to a DataFrame, while adding a Series to another Series involves creating a DataFrame.

Question 44

from the pandas doc http://pandas.pydata.org/pandas-docs/stable/dsintro.html Series is a one-dimensional labeled array capable of holding any data type. To read data in form of panda Series:

import pandas as pd
ds = pd.Series(data, index=index)

DataFrame is a 2-dimensional labeled data structure with columns of potentially different types.

import pandas as pd
df = pd.DataFrame(data, index=index)

In both of the above index is list

for example: I have a csv file with following data:

,country,popuplation,area,capital
BR,Brazil,10210,12015,Brasile
RU,Russia,1025,457,Moscow
IN,India,10458,457787,New Delhi

To read above data as series and data frame:

import pandas as pd
file_data = pd.read_csv("file_path", index_col=0)
d = pd.Series(file_data.country, index=['BR','RU','IN'] or index =  file_data.index)

output:

>>> d
BR           Brazil
RU           Russia
IN            India

df = pd.DataFrame(file_data.area, index=['BR','RU','IN'] or index = file_data.index )

output:

>>> df
      area
BR   12015
RU     457
IN  457787

Question 45

Series is a one-dimensional object that can hold any data type such as integers, floats and strings e.g

   import pandas as pd
   x = pd.Series([A,B,C]) 

0 A
1 B
2 C

The first column of Series is known as index i.e 0,1,2 the second column is your actual data i.e A,B,C

DataFrames is two-dimensional object that can hold series, list, dictionary

df=pd.DataFrame(rd(5,4),['A','B','C','D','E'],['W','X','Y','Z'])

Question 46

Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the index. The basic method to create a Series is to call:

s = pd.Series(data, index=index)

DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects.

 d = {'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']),
 two' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}
 df = pd.DataFrame(d)

Question 47

Import cars data

import pandas as pd

cars = pd.read_csv('cars.csv', index_col = 0)

Here is how the cars.csv file looks.

Print out drives_right column as Series:

print(cars.loc[:,"drives_right"])

    US      True
    AUS    False
    JAP    False
    IN     False
    RU      True
    MOR     True
    EG      True
    Name: drives_right, dtype: bool

The single bracket version gives a Pandas Series, the double bracket version gives a Pandas DataFrame.

Print out drives_right column as DataFrame

print(cars.loc[:,["drives_right"]])

         drives_right
    US           True
    AUS         False
    JAP         False
    IN          False
    RU           True
    MOR          True
    EG           True

Adding a Series to another Series creates a DataFrame.

Question 48

I am trying to join two pandas data frames using two columns:

new_df = pd.merge(A_df, B_df,  how='left', left_on='[A_c1,c2]', right_on = '[B_c1,c2]')

but got the following error:

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:4164)()

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:4028)()

pandas/src/hashtable_class_helper.pxi in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13166)()

pandas/src/hashtable_class_helper.pxi in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13120)()

KeyError: '[B_1, c2]'

Any idea what should be the right way to do this? Thanks!

Question 49

Try this

new_df = pd.merge(A_df, B_df,  how='left', left_on=['A_c1','c2'], right_on = ['B_c1','c2'])

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html

left_on : label or list, or array-like Field names to join on in left DataFrame. Can be a vector or list of vectors of the length of the DataFrame to use a particular vector as the join key instead of columns

right_on : label or list, or array-like Field names to join on in right DataFrame or vector/list of vectors per left_on docs

Question 50

the problem here is that by using the apostrophes you are setting the value being passed to be a string, when in fact, as @Shijo stated from the documentation, the function is expecting a label or list, but not a string! If the list contains each of the name of the columns beings passed for both the left and right dataframe, then each column-name must individually be within apostrophes. With what has been stated, we can understand why this is inccorect:

new_df = pd.merge(A_df, B_df,  how='left', left_on='[A_c1,c2]', right_on = '[B_c1,c2]')

And this is the correct way of using the function:

new_df = pd.merge(A_df, B_df,  how='left', left_on=['A_c1','c2'], right_on = ['B_c1','c2'])

Question 51

Another way of doing this: new_df = A_df.merge(B_df, left_on=['A_c1','c2'], right_on = ['B_c1','c2'], how='left')

Question 52

I have a DataFrame with four columns. I want to convert this DataFrame to a python dictionary. I want the elements of first column be keys and the elements of other columns in same row be values.

DataFrame:

    ID   A   B   C
0   p    1   3   2
1   q    4   3   2
2   r    4   0   9

Output should be like this:

Dictionary:

{'p': [1,3,2], 'q': [4,3,2], 'r': [4,0,9]}

Question 53

The to_dict() method sets the column names as dictionary keys so you’ll need to reshape your DataFrame slightly. Setting the ‘ID’ column as the index and then transposing the DataFrame is one way to achieve this.

to_dict() also accepts an ‘orient’ argument which you’ll need in order to output a list of values for each column. Otherwise, a dictionary of the form {index: value} will be returned for each column.

These steps can be done with the following line:

>>> df.set_index('ID').T.to_dict('list')
{'p': [1, 3, 2], 'q': [4, 3, 2], 'r': [4, 0, 9]}

In case a different dictionary format is needed, here are examples of the possible orient arguments. Consider the following simple DataFrame:

>>> df = pd.DataFrame({'a': ['red', 'yellow', 'blue'], 'b': [0.5, 0.25, 0.125]})
>>> df
        a      b
0     red  0.500
1  yellow  0.250
2    blue  0.125

Then the options are as follows.

dict – the default: column names are keys, values are dictionaries of index:data pairs

>>> df.to_dict('dict')
{'a': {0: 'red', 1: 'yellow', 2: 'blue'}, 
 'b': {0: 0.5, 1: 0.25, 2: 0.125}}

list – keys are column names, values are lists of column data

>>> df.to_dict('list')
{'a': ['red', 'yellow', 'blue'], 
 'b': [0.5, 0.25, 0.125]}

series – like ‘list’, but values are Series

>>> df.to_dict('series')
{'a': 0       red
      1    yellow
      2      blue
      Name: a, dtype: object, 

 'b': 0    0.500
      1    0.250
      2    0.125
      Name: b, dtype: float64}

split – splits columns/data/index as keys with values being column names, data values by row and index labels respectively

>>> df.to_dict('split')
{'columns': ['a', 'b'],
 'data': [['red', 0.5], ['yellow', 0.25], ['blue', 0.125]],
 'index': [0, 1, 2]}

records – each row becomes a dictionary where key is column name and value is the data in the cell

>>> df.to_dict('records')
[{'a': 'red', 'b': 0.5}, 
 {'a': 'yellow', 'b': 0.25}, 
 {'a': 'blue', 'b': 0.125}]

index – like ‘records’, but a dictionary of dictionaries with keys as index labels (rather than a list)

>>> df.to_dict('index')
{0: {'a': 'red', 'b': 0.5},
 1: {'a': 'yellow', 'b': 0.25},
 2: {'a': 'blue', 'b': 0.125}}

Question 54

Try to use Zip

df = pd.read_csv("file")
d= dict([(i,[a,b,c ]) for i, a,b,c in zip(df.ID, df.A,df.B,df.C)])
print d

Output:

{'p': [1, 3, 2], 'q': [4, 3, 2], 'r': [4, 0, 9]}

Question 55

Follow these steps:

Suppose your dataframe is as follows:

>>> df
   A  B  C ID
0  1  3  2  p
1  4  3  2  q
2  4  0  9  r

1. Use `set_index` to set `ID` columns as the dataframe index.

    df.set_index("ID", drop=True, inplace=True)

2. Use the `orient=index` parameter to have the index as dictionary keys.

    dictionary = df.to_dict(orient="index")

The results will be as follows:

    >>> dictionary
    {'q': {'A': 4, 'B': 3, 'D': 2}, 'p': {'A': 1, 'B': 3, 'D': 2}, 'r': {'A': 4, 'B': 0, 'D': 9}}

3. If you need to have each sample as a list run the following code. Determine the column order

column_order= ["A", "B", "C"] #  Determine your preferred order of columns
d = {} #  Initialize the new dictionary as an empty dictionary
for k in dictionary:
    d[k] = [dictionary[k][column_name] for column_name in column_order]

Question 56

If you don’t mind the dictionary values being tuples, you can use itertuples:

>>> {x[0]: x[1:] for x in df.itertuples(index=False)}
{'p': (1, 3, 2), 'q': (4, 3, 2), 'r': (4, 0, 9)}

Question 57

should a dictionary like:

{'red': '0.500', 'yellow': '0.250, 'blue': '0.125'}

be required out of a dataframe like:

        a      b
0     red  0.500
1  yellow  0.250
2    blue  0.125

simplest way would be to do:

dict(df.values.tolist())

working snippet below:

import pandas as pd
df = pd.DataFrame({'a': ['red', 'yellow', 'blue'], 'b': [0.5, 0.25, 0.125]})
dict(df.values.tolist())

Question 58

For my use (node names with xy positions) I found @user4179775’s answer to the most helpful / intuitive:

import pandas as pd

df = pd.read_csv('glycolysis_nodes_xy.tsv', sep='\t')

df.head()
    nodes    x    y
0  c00033  146  958
1  c00031  601  195
...

xy_dict_list=dict([(i,[a,b]) for i, a,b in zip(df.nodes, df.x,df.y)])

xy_dict_list
{'c00022': [483, 868],
 'c00024': [146, 868],
 ... }

xy_dict_tuples=dict([(i,(a,b)) for i, a,b in zip(df.nodes, df.x,df.y)])

xy_dict_tuples
{'c00022': (483, 868),
 'c00024': (146, 868),
 ... }

Addendum

I later returned to this issue, for other, but related, work. Here is an approach that more closely mirrors the [excellent] accepted answer.

node_df = pd.read_csv('node_prop-glycolysis_tca-from_pg.tsv', sep='\t')

node_df.head()
   node  kegg_id kegg_cid            name  wt  vis
0  22    22       c00022   pyruvate        1   1
1  24    24       c00024   acetyl-CoA      1   1
...

Convert Pandas dataframe to a [list], {dict}, {dict of {dict}}, …

Per accepted answer:

node_df.set_index('kegg_cid').T.to_dict('list')

{'c00022': [22, 22, 'pyruvate', 1, 1],
 'c00024': [24, 24, 'acetyl-CoA', 1, 1],
 ... }

node_df.set_index('kegg_cid').T.to_dict('dict')

{'c00022': {'kegg_id': 22, 'name': 'pyruvate', 'node': 22, 'vis': 1, 'wt': 1},
 'c00024': {'kegg_id': 24, 'name': 'acetyl-CoA', 'node': 24, 'vis': 1, 'wt': 1},
 ... }

In my case, I wanted to do the same thing but with selected columns from the Pandas dataframe, so I needed to slice the columns. There are two approaches.

Directly:

(see: Convert pandas to dictionary defining the columns used fo the key values)

node_df.set_index('kegg_cid')[['name', 'wt', 'vis']].T.to_dict('dict')

{'c00022': {'name': 'pyruvate', 'vis': 1, 'wt': 1},
 'c00024': {'name': 'acetyl-CoA', 'vis': 1, 'wt': 1},
 ... }

“Indirectly:” first, slice the desired columns/data from the Pandas dataframe (again, two approaches),

node_df_sliced = node_df[['kegg_cid', 'name', 'wt', 'vis']]

or

node_df_sliced2 = node_df.loc[:, ['kegg_cid', 'name', 'wt', 'vis']]

that can then can be used to create a dictionary of dictionaries

node_df_sliced.set_index('kegg_cid').T.to_dict('dict')

{'c00022': {'name': 'pyruvate', 'vis': 1, 'wt': 1},
 'c00024': {'name': 'acetyl-CoA', 'vis': 1, 'wt': 1},
 ... }

Question 59

DataFrame.to_dict() converts DataFrame to dictionary.

Example

>>> df = pd.DataFrame(
    {'col1': [1, 2], 'col2': [0.5, 0.75]}, index=['a', 'b'])
>>> df
   col1  col2
a     1   0.1
b     2   0.2
>>> df.to_dict()
{'col1': {'a': 1, 'b': 2}, 'col2': {'a': 0.5, 'b': 0.75}}

See this Documentation for details

Question 60

I want to group my dataframe by two columns and then sort the aggregated results within the groups.

In [167]:
df

Out[167]:
count   job source
0   2   sales   A
1   4   sales   B
2   6   sales   C
3   3   sales   D
4   7   sales   E
5   5   market  A
6   3   market  B
7   2   market  C
8   4   market  D
9   1   market  E

In [168]:
df.groupby(['job','source']).agg({'count':sum})

Out[168]:
            count
job     source  
market  A   5
        B   3
        C   2
        D   4
        E   1
sales   A   2
        B   4
        C   6
        D   3
        E   7

I would now like to sort the count column in descending order within each of the groups. And then take only the top three rows. To get something like:

            count
job     source  
market  A   5
        D   4
        B   3
sales   E   7
        C   6
        B   4

Question 61

What you want to do is actually again a groupby (on the result of the first groupby): sort and take the first three elements per group.

Starting from the result of the first groupby:

In [60]: df_agg = df.groupby(['job','source']).agg({'count':sum})

We group by the first level of the index:

In [63]: g = df_agg['count'].groupby('job', group_keys=False)

Then we want to sort (‘order’) each group and take the first three elements:

In [64]: res = g.apply(lambda x: x.sort_values(ascending=False).head(3))

However, for this, there is a shortcut function to do this, nlargest:

In [65]: g.nlargest(3)
Out[65]:
job     source
market  A         5
        D         4
        B         3
sales   E         7
        C         6
        B         4
dtype: int64

So in one go, this looks like:

df_agg['count'].groupby('job', group_keys=False).nlargest(3)

Question 62

You could also just do it in one go, by doing the sort first and using head to take the first 3 of each group.

In[34]: df.sort_values(['job','count'],ascending=False).groupby('job').head(3)

Out[35]: 
   count     job source
4      7   sales      E
2      6   sales      C
1      4   sales      B
5      5  market      A
8      4  market      D
6      3  market      B

Question 63

Here’s other example of taking top 3 on sorted order, and sorting within the groups:

In [43]: import pandas as pd                                                                                                                                                       

In [44]:  df = pd.DataFrame({"name":["Foo", "Foo", "Baar", "Foo", "Baar", "Foo", "Baar", "Baar"], "count_1":[5,10,12,15,20,25,30,35], "count_2" :[100,150,100,25,250,300,400,500]})

In [45]: df                                                                                                                                                                        
Out[45]: 
   count_1  count_2  name
0        5      100   Foo
1       10      150   Foo
2       12      100  Baar
3       15       25   Foo
4       20      250  Baar
5       25      300   Foo
6       30      400  Baar
7       35      500  Baar


### Top 3 on sorted order:
In [46]: df.groupby(["name"])["count_1"].nlargest(3)                                                                                                                               
Out[46]: 
name   
Baar  7    35
      6    30
      4    20
Foo   5    25
      3    15
      1    10
dtype: int64


### Sorting within groups based on column "count_1":
In [48]: df.groupby(["name"]).apply(lambda x: x.sort_values(["count_1"], ascending = False)).reset_index(drop=True)
Out[48]: 
   count_1  count_2  name
0       35      500  Baar
1       30      400  Baar
2       20      250  Baar
3       12      100  Baar
4       25      300   Foo
5       15       25   Foo
6       10      150   Foo
7        5      100   Foo

Question 64

Try this Instead

simple way to do ‘groupby’ and sorting in descending order

df.groupby(['companyName'])['overallRating'].sum().sort_values(ascending=False).head(20)

问题：如何在Pandas barplot中旋转x轴刻度标签

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

问题：Python Pandas：将选定的列保留为DataFrame而不是Series

回答 0

回答 1

回答 2

回答 3

问题：如何在没有索引的情况下打印Pandas DataFrame

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

回答 7

问题：如何通过正则表达式过滤熊猫中的行

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

问题：pandas系列和单列DataFrame有什么区别？

回答 0

回答 1

回答 2

回答 3

回答 4

问题：大熊猫：在多列上合并（合并）两个数据框

回答 0

回答 1

回答 2

问题：将Pandas DataFrame转换为字典

回答 0

回答 1

回答 2

跟着这些步骤：

1. set_index用于将ID列设置为数据框索引。

2.使用orient=index参数将索引用作字典键。

3.如果需要将每个样本作为列表，请运行以下代码。确定列顺序

Follow these steps:

1. Use set_index to set ID columns as the dataframe index.

2. Use the orient=index parameter to have the index as dictionary keys.

3. If you need to have each sample as a list run the following code. Determine the column order

回答 3

回答 4

回答 5

回答 6

问题：熊猫groupby排序

回答 0

回答 1

回答 2

回答 3

试试这个代替

执行“ groupby”并按降序排序的简单方法

Try this Instead

simple way to do ‘groupby’ and sorting in descending order

回答 4

问题：熊猫DataFrame Groupby两列并获取计数

回答 0

回答 1

回答 2

回答 3

仅使用单个groupby的惯用解决方案

Idiomatic solution that uses only a single groupby

回答 4

回答 5

问题：如何将标题行添加到Pandas DataFrame

回答 0

回答 1

回答 2

回答 3

有趣好用的Python教程

1. `set_index`用于将`ID`列设置为数据框索引。

2.使用`orient=index`参数将索引用作字典键。

1. Use `set_index` to set `ID` columns as the dataframe index.

2. Use the `orient=index` parameter to have the index as dictionary keys.