标签归档:pandas

如何在Pandas barplot中旋转x轴刻度标签

问题:如何在Pandas barplot中旋转x轴刻度标签

使用以下代码:

import matplotlib
matplotlib.style.use('ggplot')
import matplotlib.pyplot as plt
import pandas as pd

df = pd.DataFrame({ 'celltype':["foo","bar","qux","woz"], 's1':[5,9,1,7], 's2':[12,90,13,87]})
df = df[["celltype","s1","s2"]]
df.set_index(["celltype"],inplace=True)
df.plot(kind='bar',alpha=0.75)
plt.xlabel("")

我做了这个情节:

在此处输入图片说明

如何将x轴刻度标签旋转到0度?

我尝试添加它,但是没有用:

plt.set_xticklabels(df.index,rotation=90)

With the following code:

import matplotlib
matplotlib.style.use('ggplot')
import matplotlib.pyplot as plt
import pandas as pd

df = pd.DataFrame({ 'celltype':["foo","bar","qux","woz"], 's1':[5,9,1,7], 's2':[12,90,13,87]})
df = df[["celltype","s1","s2"]]
df.set_index(["celltype"],inplace=True)
df.plot(kind='bar',alpha=0.75)
plt.xlabel("")

I made this plot:

enter image description here

How can I rotate the x-axis tick labels to 0 degrees?

I tried adding this but did not work:

plt.set_xticklabels(df.index,rotation=90)

回答 0

传递参数rot=0以旋转xticks:

import matplotlib
matplotlib.style.use('ggplot')
import matplotlib.pyplot as plt
import pandas as pd

df = pd.DataFrame({ 'celltype':["foo","bar","qux","woz"], 's1':[5,9,1,7], 's2':[12,90,13,87]})
df = df[["celltype","s1","s2"]]
df.set_index(["celltype"],inplace=True)
df.plot(kind='bar',alpha=0.75, rot=0)
plt.xlabel("")
plt.show()

Yield图:

在此处输入图片说明

Pass param rot=0 to rotate the xticks:

import matplotlib
matplotlib.style.use('ggplot')
import matplotlib.pyplot as plt
import pandas as pd

df = pd.DataFrame({ 'celltype':["foo","bar","qux","woz"], 's1':[5,9,1,7], 's2':[12,90,13,87]})
df = df[["celltype","s1","s2"]]
df.set_index(["celltype"],inplace=True)
df.plot(kind='bar',alpha=0.75, rot=0)
plt.xlabel("")
plt.show()

yields plot:

enter image description here


回答 1

尝试这个 –

plt.xticks(rotation=90)

在此处输入图片说明

Try this –

plt.xticks(rotation=90)

enter image description here


回答 2

问题很明确,但标题不够准确。我的答案是给那些想要更改标签(而不是刻度标签)的人的,这是公认的答案。(标题现已更正)。

for ax in plt.gcf().axes:
    plt.sca(ax)
    plt.xlabel(ax.get_xlabel(), rotation=90)

The question is clear but the title is not as precise as it could be. My answer is for those who came looking to change the axis label, as opposed to the tick labels, which is what the accepted answer is about. (The title has now been corrected).

for ax in plt.gcf().axes:
    plt.sca(ax)
    plt.xlabel(ax.get_xlabel(), rotation=90)

回答 3

您可以使用set_xticklabels()

ax.set_xticklabels(df['Names'], rotation=90, ha='right')

You can use set_xticklabels()

ax.set_xticklabels(df['Names'], rotation=90, ha='right')

回答 4

以下内容可能会有所帮助:

# Valid font size are xx-small, x-small, small, medium, large, x-large, xx-large, larger, smaller, None

plt.xticks(
    rotation=45,
    horizontalalignment='right',
    fontweight='light',
    fontsize='medium',
)

这是带有示例和API的函数xticks[reference]

def xticks(ticks=None, labels=None, **kwargs):
    """
    Get or set the current tick locations and labels of the x-axis.

    Call signatures::

        locs, labels = xticks()            # Get locations and labels
        xticks(ticks, [labels], **kwargs)  # Set locations and labels

    Parameters
    ----------
    ticks : array_like
        A list of positions at which ticks should be placed. You can pass an
        empty list to disable xticks.

    labels : array_like, optional
        A list of explicit labels to place at the given *locs*.

    **kwargs
        :class:`.Text` properties can be used to control the appearance of
        the labels.

    Returns
    -------
    locs
        An array of label locations.
    labels
        A list of `.Text` objects.

    Notes
    -----
    Calling this function with no arguments (e.g. ``xticks()``) is the pyplot
    equivalent of calling `~.Axes.get_xticks` and `~.Axes.get_xticklabels` on
    the current axes.
    Calling this function with arguments is the pyplot equivalent of calling
    `~.Axes.set_xticks` and `~.Axes.set_xticklabels` on the current axes.

    Examples
    --------
    Get the current locations and labels:

        >>> locs, labels = xticks()

    Set label locations:

        >>> xticks(np.arange(0, 1, step=0.2))

    Set text labels:

        >>> xticks(np.arange(5), ('Tom', 'Dick', 'Harry', 'Sally', 'Sue'))

    Set text labels and properties:

        >>> xticks(np.arange(12), calendar.month_name[1:13], rotation=20)

    Disable xticks:

        >>> xticks([])
    """

The follows might be helpful:

# Valid font size are xx-small, x-small, small, medium, large, x-large, xx-large, larger, smaller, None

plt.xticks(
    rotation=45,
    horizontalalignment='right',
    fontweight='light',
    fontsize='medium',
)

Here is the function xticks[reference] with example and API

def xticks(ticks=None, labels=None, **kwargs):
    """
    Get or set the current tick locations and labels of the x-axis.

    Call signatures::

        locs, labels = xticks()            # Get locations and labels
        xticks(ticks, [labels], **kwargs)  # Set locations and labels

    Parameters
    ----------
    ticks : array_like
        A list of positions at which ticks should be placed. You can pass an
        empty list to disable xticks.

    labels : array_like, optional
        A list of explicit labels to place at the given *locs*.

    **kwargs
        :class:`.Text` properties can be used to control the appearance of
        the labels.

    Returns
    -------
    locs
        An array of label locations.
    labels
        A list of `.Text` objects.

    Notes
    -----
    Calling this function with no arguments (e.g. ``xticks()``) is the pyplot
    equivalent of calling `~.Axes.get_xticks` and `~.Axes.get_xticklabels` on
    the current axes.
    Calling this function with arguments is the pyplot equivalent of calling
    `~.Axes.set_xticks` and `~.Axes.set_xticklabels` on the current axes.

    Examples
    --------
    Get the current locations and labels:

        >>> locs, labels = xticks()

    Set label locations:

        >>> xticks(np.arange(0, 1, step=0.2))

    Set text labels:

        >>> xticks(np.arange(5), ('Tom', 'Dick', 'Harry', 'Sally', 'Sue'))

    Set text labels and properties:

        >>> xticks(np.arange(12), calendar.month_name[1:13], rotation=20)

    Disable xticks:

        >>> xticks([])
    """

回答 5

对于条形图,可以包括最终希望刻度线具有的角度。

在这里,我正在rot=0使它们平行于x轴。

series.plot.bar(rot=0)
plt.show()
plt.close()

For bar graphs, you can include the angle which you finally want the ticks to have.

Here I am using rot=0 to make them parallel to the x axis.

series.plot.bar(rot=0)
plt.show()
plt.close()

Python Pandas:将选定的列保留为DataFrame而不是Series

问题:Python Pandas:将选定的列保留为DataFrame而不是Series

从pandas DataFrame中选择单个列时(例如df.iloc[:, 0]df['A']df.A等),结果矢量将自动转换为Series而不是单列DataFrame。但是,我正在编写一些将DataFrame作为输入参数的函数。因此,我更喜欢处理单列DataFrame而不是Series,以便函数可以假定df.columns是可访问的。现在,我必须使用来将Series显式转换为DataFrame pd.DataFrame(df.iloc[:, 0])。这似乎不是最干净的方法。是否有更优雅的方法直接从DataFrame进行索引,以便结果是单列DataFrame而不是Series?

When selecting a single column from a pandas DataFrame(say df.iloc[:, 0], df['A'], or df.A, etc), the resulting vector is automatically converted to a Series instead of a single-column DataFrame. However, I am writing some functions that takes a DataFrame as an input argument. Therefore, I prefer to deal with single-column DataFrame instead of Series so that the function can assume say df.columns is accessible. Right now I have to explicitly convert the Series into a DataFrame by using something like pd.DataFrame(df.iloc[:, 0]). This doesn’t seem like the most clean method. Is there a more elegant way to index from a DataFrame directly so that the result is a single-column DataFrame instead of Series?


回答 0

正如@Jeff提到的,有几种方法可以做到这一点,但我建议使用loc / iloc来使其更明确(如果尝试歧义,请提早出错):

In [10]: df = pd.DataFrame([[1, 2], [3, 4]], columns=['A', 'B'])

In [11]: df
Out[11]:
   A  B
0  1  2
1  3  4

In [12]: df[['A']]

In [13]: df[[0]]

In [14]: df.loc[:, ['A']]

In [15]: df.iloc[:, [0]]

Out[12-15]:  # they all return the same thing:
   A
0  1
1  3

在整数列名称的情况下,后两种选择消除了歧义(正是创建loc / iloc的原因)。例如:

In [16]: df = pd.DataFrame([[1, 2], [3, 4]], columns=['A', 0])

In [17]: df
Out[17]:
   A  0
0  1  2
1  3  4

In [18]: df[[0]]  # ambiguous
Out[18]:
   A
0  1
1  3

As @Jeff mentions there are a few ways to do this, but I recommend using loc/iloc to be more explicit (and raise errors early if your trying something ambiguous):

In [10]: df = pd.DataFrame([[1, 2], [3, 4]], columns=['A', 'B'])

In [11]: df
Out[11]:
   A  B
0  1  2
1  3  4

In [12]: df[['A']]

In [13]: df[[0]]

In [14]: df.loc[:, ['A']]

In [15]: df.iloc[:, [0]]

Out[12-15]:  # they all return the same thing:
   A
0  1
1  3

The latter two choices remove ambiguity in the case of integer column names (precisely why loc/iloc were created). For example:

In [16]: df = pd.DataFrame([[1, 2], [3, 4]], columns=['A', 0])

In [17]: df
Out[17]:
   A  0
0  1  2
1  3  4

In [18]: df[[0]]  # ambiguous
Out[18]:
   A
0  1
1  3

回答 1

正如安迪·海登(Andy Hayden)所建议的那样,利用.iloc / .loc索引(单列)数据帧是可行的方法。要注意的另一点是如何表达索引位置。使用列出的索引标签/位置,同时指定要作为数据框索引的参数值;否则将返回“ pandas.core.series.Series”

输入:

    A_1 = train_data.loc[:,'Fraudster']
    print('A_1 is of type', type(A_1))
    A_2 = train_data.loc[:, ['Fraudster']]
    print('A_2 is of type', type(A_2))
    A_3 = train_data.iloc[:,12]
    print('A_3 is of type', type(A_3))
    A_4 = train_data.iloc[:,[12]]
    print('A_4 is of type', type(A_4))

输出:

    A_1 is of type <class 'pandas.core.series.Series'>
    A_2 is of type <class 'pandas.core.frame.DataFrame'>
    A_3 is of type <class 'pandas.core.series.Series'>
    A_4 is of type <class 'pandas.core.frame.DataFrame'>

As Andy Hayden recommends, utilizing .iloc/.loc to index out (single-columned) dataframe is the way to go; another point to note is how to express the index positions. Use a listed Index labels/positions whilst specifying the argument values to index out as Dataframe; failure to do so will return a ‘pandas.core.series.Series’

Input:

    A_1 = train_data.loc[:,'Fraudster']
    print('A_1 is of type', type(A_1))
    A_2 = train_data.loc[:, ['Fraudster']]
    print('A_2 is of type', type(A_2))
    A_3 = train_data.iloc[:,12]
    print('A_3 is of type', type(A_3))
    A_4 = train_data.iloc[:,[12]]
    print('A_4 is of type', type(A_4))

Output:

    A_1 is of type <class 'pandas.core.series.Series'>
    A_2 is of type <class 'pandas.core.frame.DataFrame'>
    A_3 is of type <class 'pandas.core.series.Series'>
    A_4 is of type <class 'pandas.core.frame.DataFrame'>

回答 2

您可以使用df.iloc[:, 0:1],在这种情况下,结果向量将是aDataFrame而不是序列。

如你看到的:

在此处输入图片说明

You can use df.iloc[:, 0:1], in this case the resulting vector will be a DataFrame and not series.

As you can see:

enter image description here


回答 3

提到了这三种方法:

pd.DataFrame(df.loc[:, 'A'])  # Approach of the original post
df.loc[:,[['A']]              # Approach 2 (note: use iloc for positional indexing)
df[['A']]                     # Approach 3

pd.Series.to_frame()是另一种方法。

因为它是一种方法,所以可以在上述第二种方法和第三种方法不适用的情况下使用。特别是,在将某些方法应用于数据框中的列并且要将输出转换为数据框而不是序列时,此方法很有用。例如,在Jupyter Notebook中,一系列不会有漂亮的输出,但是会有一个数据框。

# Basic use case: 
df['A'].to_frame()

# Use case 2 (this will give you pretty output in a Jupyter Notebook): 
df['A'].describe().to_frame()

# Use case 3: 
df['A'].str.strip().to_frame()

# Use case 4: 
def some_function(num): 
    ...

df['A'].apply(some_function).to_frame()

These three approaches have been mentioned:

pd.DataFrame(df.loc[:, 'A'])  # Approach of the original post
df.loc[:,[['A']]              # Approach 2 (note: use iloc for positional indexing)
df[['A']]                     # Approach 3

pd.Series.to_frame() is another approach.

Because it is a method, it can be used in situations where the second and third approaches above do not apply. In particular, it is useful when applying some method to a column in your dataframe and you want to convert the output into a dataframe instead of a series. For instance, in a Jupyter Notebook a series will not have pretty output, but a dataframe will.

# Basic use case: 
df['A'].to_frame()

# Use case 2 (this will give you pretty output in a Jupyter Notebook): 
df['A'].describe().to_frame()

# Use case 3: 
df['A'].str.strip().to_frame()

# Use case 4: 
def some_function(num): 
    ...

df['A'].apply(some_function).to_frame()

如何在没有索引的情况下打印Pandas DataFrame

问题:如何在没有索引的情况下打印Pandas DataFrame

我想打印整个数据框,但是我不想打印索引

此外,一列是日期时间类型,我只想打印时间,而不是日期。

数据框如下所示:

   User ID           Enter Time   Activity Number
0      123  2014-07-08 00:09:00              1411
1      123  2014-07-08 00:18:00               893
2      123  2014-07-08 00:49:00              1041

我希望它打印为

User ID   Enter Time   Activity Number
123         00:09:00              1411
123         00:18:00               893
123         00:49:00              1041

I want to print the whole dataframe, but I don’t want to print the index

Besides, one column is datetime type, I just want to print time, not date.

The dataframe looks like:

   User ID           Enter Time   Activity Number
0      123  2014-07-08 00:09:00              1411
1      123  2014-07-08 00:18:00               893
2      123  2014-07-08 00:49:00              1041

I want it print as

User ID   Enter Time   Activity Number
123         00:09:00              1411
123         00:18:00               893
123         00:49:00              1041

回答 0

print df.to_string(index=False)
print df.to_string(index=False)

回答 1

print(df.to_csv(sep='\t', index=False))

或可能:

print(df.to_csv(columns=['A', 'B', 'C'], sep='\t', index=False))
print(df.to_csv(sep='\t', index=False))

Or possibly:

print(df.to_csv(columns=['A', 'B', 'C'], sep='\t', index=False))

回答 2

下面的行在打印时将隐藏DataFrame的索引列

df.style.hide_index()

The line below would hide the index column of DataFrame when you print

df.style.hide_index()

回答 3

如果要漂亮地打印数据框,则可以使用列表包。

import pandas as pd
import numpy as np
from tabulate import tabulate

def pprint_df(dframe):
    print tabulate(dframe, headers='keys', tablefmt='psql', showindex=False)

df = pd.DataFrame({'col1': np.random.randint(0, 100, 10), 
    'col2': np.random.randint(50, 100, 10), 
    'col3': np.random.randint(10, 10000, 10)})

pprint_df(df)

具体来说,showindex=False顾名思义,,您可以不显示索引。输出如下所示:

+--------+--------+--------+
|   col1 |   col2 |   col3 |
|--------+--------+--------|
|     15 |     76 |   5175 |
|     30 |     97 |   3331 |
|     34 |     56 |   3513 |
|     50 |     65 |    203 |
|     84 |     75 |   7559 |
|     41 |     82 |    939 |
|     78 |     59 |   4971 |
|     98 |     99 |    167 |
|     81 |     99 |   6527 |
|     17 |     94 |   4267 |
+--------+--------+--------+

If you want to pretty print the data frames, then you can use tabulate package.

import pandas as pd
import numpy as np
from tabulate import tabulate

def pprint_df(dframe):
    print tabulate(dframe, headers='keys', tablefmt='psql', showindex=False)

df = pd.DataFrame({'col1': np.random.randint(0, 100, 10), 
    'col2': np.random.randint(50, 100, 10), 
    'col3': np.random.randint(10, 10000, 10)})

pprint_df(df)

Specifically, the showindex=False, as the name says, allows you to not show index. The output would look as follows:

+--------+--------+--------+
|   col1 |   col2 |   col3 |
|--------+--------+--------|
|     15 |     76 |   5175 |
|     30 |     97 |   3331 |
|     34 |     56 |   3513 |
|     50 |     65 |    203 |
|     84 |     75 |   7559 |
|     41 |     82 |    939 |
|     78 |     59 |   4971 |
|     98 |     99 |    167 |
|     81 |     99 |   6527 |
|     17 |     94 |   4267 |
+--------+--------+--------+

回答 4

保留“精美印刷”使用

from IPython.display import HTML
HTML(df.to_html(index=False))

在此处输入图片说明

To retain “pretty-print” use

from IPython.display import HTML
HTML(df.to_html(index=False))

enter image description here


回答 5

如果只想打印一个字符串/ json,可以使用以下方法解决:

print(df.to_string(index=False))

Buf如果您也想序列化数据甚至发送到MongoDB,最好执行以下操作:

document = df.to_dict(orient='list')

到目前为止,有6种方法可以调整数据方向,请在熊猫文档中查看更多适合您的方法。

If you just want a string/json to print it can be solved with:

print(df.to_string(index=False))

Buf if you want to serialize the data too or even send to a MongoDB, would be better to do something like:

document = df.to_dict(orient='list')

There are 6 ways by now to orient the data, check more in the panda docs which better fits you.


回答 6

要回答“如何在没有索引的情况下打印数据框”问题,可以将索引设置为空字符串数组(数据帧中的每一行一个),如下所示:

blankIndex=[''] * len(df)
df.index=blankIndex

如果我们使用您帖子中的数据:

row1 = (123, '2014-07-08 00:09:00', 1411)
row2 = (123, '2014-07-08 00:49:00', 1041)
row3 = (123, '2014-07-08 00:09:00', 1411)
data = [row1, row2, row3]
#set up dataframe
df = pd.DataFrame(data, columns=('User ID', 'Enter Time', 'Activity Number'))
print(df)

通常将其打印为:

   User ID           Enter Time  Activity Number
0      123  2014-07-08 00:09:00             1411
1      123  2014-07-08 00:49:00             1041
2      123  2014-07-08 00:09:00             1411

通过创建一个空字符串与数据框中的行数一样多的数组:

blankIndex=[''] * len(df)
df.index=blankIndex
print(df)

它将从输出中删除索引:

  User ID           Enter Time  Activity Number
      123  2014-07-08 00:09:00             1411
      123  2014-07-08 00:49:00             1041
      123  2014-07-08 00:09:00             1411

并且在Jupyter Notebook中将按照此屏幕截图进行渲染: 没有索引列的Juptyer Notebooks数据框

To answer the “How to print dataframe without an index” question, you can set the index to be an array of empty strings (one for each row in the dataframe), like this:

blankIndex=[''] * len(df)
df.index=blankIndex

If we use the data from your post:

row1 = (123, '2014-07-08 00:09:00', 1411)
row2 = (123, '2014-07-08 00:49:00', 1041)
row3 = (123, '2014-07-08 00:09:00', 1411)
data = [row1, row2, row3]
#set up dataframe
df = pd.DataFrame(data, columns=('User ID', 'Enter Time', 'Activity Number'))
print(df)

which would normally print out as:

   User ID           Enter Time  Activity Number
0      123  2014-07-08 00:09:00             1411
1      123  2014-07-08 00:49:00             1041
2      123  2014-07-08 00:09:00             1411

By creating an array with as many empty strings as there are rows in the data frame:

blankIndex=[''] * len(df)
df.index=blankIndex
print(df)

It will remove the index from the output:

  User ID           Enter Time  Activity Number
      123  2014-07-08 00:09:00             1411
      123  2014-07-08 00:49:00             1041
      123  2014-07-08 00:09:00             1411

And in Jupyter Notebooks would render as per this screenshot: Juptyer Notebooks dataframe with no index column


回答 7

与上面使用df.to_string(index = False)的许多答案类似,我经常发现有必要提取值的单列,在这种情况下,您可以使用.to_string使用以下内容指定单个列:

data = pd.DataFrame({'col1': np.random.randint(0, 100, 10), 
    'col2': np.random.randint(50, 100, 10), 
    'col3': np.random.randint(10, 10000, 10)})

print(data.to_string(columns=['col1'], index=False)

print(data.to_string(columns=['col1', 'col2'], index=False))

它提供了易于复制(且无索引)的输出,可用于粘贴到其他地方(Excel)。样本输出:

col1  col2    
49    62    
97    97    
87    94    
85    61    
18    55

Similar to many of the answers above that use df.to_string(index=False), I often find it necessary to extract a single column of values in which case you can specify an individual column with .to_string using the following:

data = pd.DataFrame({'col1': np.random.randint(0, 100, 10), 
    'col2': np.random.randint(50, 100, 10), 
    'col3': np.random.randint(10, 10000, 10)})

print(data.to_string(columns=['col1'], index=False)

print(data.to_string(columns=['col1', 'col2'], index=False))

Which provides an easy to copy (and index free) output for use pasting elsewhere (Excel). Sample output:

col1  col2    
49    62    
97    97    
87    94    
85    61    
18    55

如何通过正则表达式过滤熊猫中的行

问题:如何通过正则表达式过滤熊猫中的行

我想在其中一列上使用正则表达式干净地过滤数据框。

举一个人为的例子:

In [210]: foo = pd.DataFrame({'a' : [1,2,3,4], 'b' : ['hi', 'foo', 'fat', 'cat']})
In [211]: foo
Out[211]: 
   a    b
0  1   hi
1  2  foo
2  3  fat
3  4  cat

我想将行过滤为以f正则表达式开头的行。首先去:

In [213]: foo.b.str.match('f.*')
Out[213]: 
0    []
1    ()
2    ()
3    []

这不是太有用了。但是,这将使我得到我的布尔值索引:

In [226]: foo.b.str.match('(f.*)').str.len() > 0
Out[226]: 
0    False
1     True
2     True
3    False
Name: b

因此,我可以通过以下方式进行限制:

In [229]: foo[foo.b.str.match('(f.*)').str.len() > 0]
Out[229]: 
   a    b
1  2  foo
2  3  fat

但是,这使我人为地将一组放入正则表达式中,并且似乎不是一种干净的方法。有一个更好的方法吗?

I would like to cleanly filter a dataframe using regex on one of the columns.

For a contrived example:

In [210]: foo = pd.DataFrame({'a' : [1,2,3,4], 'b' : ['hi', 'foo', 'fat', 'cat']})
In [211]: foo
Out[211]: 
   a    b
0  1   hi
1  2  foo
2  3  fat
3  4  cat

I want to filter the rows to those that start with f using a regex. First go:

In [213]: foo.b.str.match('f.*')
Out[213]: 
0    []
1    ()
2    ()
3    []

That’s not too terribly useful. However this will get me my boolean index:

In [226]: foo.b.str.match('(f.*)').str.len() > 0
Out[226]: 
0    False
1     True
2     True
3    False
Name: b

So I could then do my restriction by:

In [229]: foo[foo.b.str.match('(f.*)').str.len() > 0]
Out[229]: 
   a    b
1  2  foo
2  3  fat

That makes me artificially put a group into the regex though, and seems like maybe not the clean way to go. Is there a better way to do this?


回答 0

使用包含代替:

In [10]: df.b.str.contains('^f')
Out[10]: 
0    False
1     True
2     True
3    False
Name: b, dtype: bool

Use contains instead:

In [10]: df.b.str.contains('^f')
Out[10]: 
0    False
1     True
2     True
3    False
Name: b, dtype: bool

回答 1

已经有一个字符串处理功能 Series.str.startswith()。你应该尝试foo[foo.b.str.startswith('f')]

结果:

    a   b
1   2   foo
2   3   fat

我认为您的期望。

另外,您可以使用包含和正则表达式选项。例如:

foo[foo.b.str.contains('oo', regex= True, na=False)]

结果:

    a   b
1   2   foo

na=False 是为了防止出现nan,null等值时出现错误

There is already a string handling function Series.str.startswith(). You should try foo[foo.b.str.startswith('f')].

Result:

    a   b
1   2   foo
2   3   fat

I think what you expect.

Alternatively you can use contains with regex option. For example:

foo[foo.b.str.contains('oo', regex= True, na=False)]

Result:

    a   b
1   2   foo

na=False is to prevent Errors in case there is nan, null etc. values


回答 2

使用数据框进行多列搜索:

frame[frame.filename.str.match('*.'+MetaData+'.*') & frame.file_path.str.match('C:\test\test.txt')]

Multiple column search with dataframe:

frame[frame.filename.str.match('*.'+MetaData+'.*') & frame.file_path.str.match('C:\test\test.txt')]

回答 3

这可能会有点晚,但是现在在Pandas中更容易做到。您可以调用match with as_indexer=True以获得布尔结果。这是记录(与之间的差异沿matchcontains在这里

It may be a bit late, but this is now easier to do in Pandas by calling Series.str.match. The docs explain the difference between match, fullmatch and contains.

Note that in order to use the results for indexing, set the na=False argument (or True if you want to include NANs in the results).


回答 4

感谢您提供@ user3136169的出色答案,这是一个如何删除NoneType值的示例。

def regex_filter(val):
    if val:
        mo = re.search(regex,val)
        if mo:
            return True
        else:
            return False
    else:
        return False

df_filtered = df[df['col'].apply(regex_filter)]

您也可以将regex添加为arg:

def regex_filter(val,myregex):
    ...

df_filtered = df[df['col'].apply(res_regex_filter,regex=myregex)]

Thanks for the great answer @user3136169, here is an example of how that might be done also removing NoneType values.

def regex_filter(val):
    if val:
        mo = re.search(regex,val)
        if mo:
            return True
        else:
            return False
    else:
        return False

df_filtered = df[df['col'].apply(regex_filter)]

Also you can also add regex as an arg:

def regex_filter(val,myregex):
    ...

df_filtered = df[df['col'].apply(res_regex_filter,regex=myregex)]

回答 5

编写一个布尔函数来检查正则表达式并在列上使用apply

foo[foo['b'].apply(regex_function)]

Write a Boolean function that checks the regex and use apply on the column

foo[foo['b'].apply(regex_function)]

回答 6

使用str 切片

foo[foo.b.str[0]=='f']
Out[18]: 
   a    b
1  2  foo
2  3  fat

Using str slice

foo[foo.b.str[0]=='f']
Out[18]: 
   a    b
1  2  foo
2  3  fat

pandas系列和单列DataFrame有什么区别?

问题:pandas系列和单列DataFrame有什么区别?

为何熊猫区分a Series和单栏DataFrame
换句话说:Series该类存在的原因是什么?

我主要使用带有datetime索引的时间序列,也许这有助于设置上下文。

Why does pandas make a distinction between a Series and a single-column DataFrame?
In other words: what is the reason of existence of the Series class?

I’m mainly using time series with datetime index, maybe that helps to set the context.


回答 0

引用熊猫文档

pandas.DataFrame(data=None, index=None, columns=None, dtype=None, copy=False)

具有标注轴(行和列)的二维大小可变的,可能是异构的表格数据结构。算术运算在行和列标签上对齐。可以看作是Series对象的类似dict的容器。大熊猫的主要数据结构。

因此,系列是a的单个列的数据结构DataFrame,不仅在概念上,而且从字面上看,即a中的数据DataFrame实际上都作为的集合存储在内存中Series

类似地:我们需要列表和矩阵,因为矩阵是用列表构建的。单行矩阵虽然在功能上等同于列表,但没有它们组成的列表仍然不存在。

它们都具有极其相似的API,但是您会发现DataFrame方法始终可以满足您拥有不止一列的可能性。并且,当然,您总是可以向添加另一个Series(或等效对象)DataFrame,而向添加Series另一个Series涉及创建DataFrame

Quoting the Pandas docs

pandas.DataFrame(data=None, index=None, columns=None, dtype=None, copy=False)

Two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). Arithmetic operations align on both row and column labels. Can be thought of as a dict-like container for Series objects. The primary pandas data structure.

So, the Series is the data structure for a single column of a DataFrame, not only conceptually, but literally, i.e. the data in a DataFrame is actually stored in memory as a collection of Series.

Analogously: We need both lists and matrices, because matrices are built with lists. Single row matricies, while equivalent to lists in functionality still cannot exists without the list(s) they’re composed of.

They both have extremely similar APIs, but you’ll find that DataFrame methods always cater to the possibility that you have more than one column. And, of course, you can always add another Series (or equivalent object) to a DataFrame, while adding a Series to another Series involves creating a DataFrame.


回答 1

来自pandas doc,网址http://pandas.pydata.org/pandas-docs/stable/dsintro.html。Series是一维标记的数组,能够保存任何数据类型。以熊猫系列的形式读取数据:

import pandas as pd
ds = pd.Series(data, index=index)

DataFrame是二维标记的数据结构,具有可能不同类型的列。

import pandas as pd
df = pd.DataFrame(data, index=index)

在以上两个索引中都是列表

例如:我有一个csv文件,其中包含以下数据:

,country,popuplation,area,capital
BR,Brazil,10210,12015,Brasile
RU,Russia,1025,457,Moscow
IN,India,10458,457787,New Delhi

要读取以上数据作为序列和数据框:

import pandas as pd
file_data = pd.read_csv("file_path", index_col=0)
d = pd.Series(file_data.country, index=['BR','RU','IN'] or index =  file_data.index)

输出:

>>> d
BR           Brazil
RU           Russia
IN            India

df = pd.DataFrame(file_data.area, index=['BR','RU','IN'] or index = file_data.index )

输出:

>>> df
      area
BR   12015
RU     457
IN  457787

from the pandas doc http://pandas.pydata.org/pandas-docs/stable/dsintro.html Series is a one-dimensional labeled array capable of holding any data type. To read data in form of panda Series:

import pandas as pd
ds = pd.Series(data, index=index)

DataFrame is a 2-dimensional labeled data structure with columns of potentially different types.

import pandas as pd
df = pd.DataFrame(data, index=index)

In both of the above index is list

for example: I have a csv file with following data:

,country,popuplation,area,capital
BR,Brazil,10210,12015,Brasile
RU,Russia,1025,457,Moscow
IN,India,10458,457787,New Delhi

To read above data as series and data frame:

import pandas as pd
file_data = pd.read_csv("file_path", index_col=0)
d = pd.Series(file_data.country, index=['BR','RU','IN'] or index =  file_data.index)

output:

>>> d
BR           Brazil
RU           Russia
IN            India

df = pd.DataFrame(file_data.area, index=['BR','RU','IN'] or index = file_data.index )

output:

>>> df
      area
BR   12015
RU     457
IN  457787

回答 2

系列是一维对象,可以保存任何数据类型,例如整数,浮点数和字符串,例如

   import pandas as pd
   x = pd.Series([A,B,C]) 

0 A
1 B
2 C

系列的第一列称为索引,即0,1,2,第二列是您的实际数据,即A,B,C

DataFrames是二维对象,可以容纳序列,列表,字典

df=pd.DataFrame(rd(5,4),['A','B','C','D','E'],['W','X','Y','Z'])

Series is a one-dimensional object that can hold any data type such as integers, floats and strings e.g

   import pandas as pd
   x = pd.Series([A,B,C]) 

0 A
1 B
2 C

The first column of Series is known as index i.e 0,1,2 the second column is your actual data i.e A,B,C

DataFrames is two-dimensional object that can hold series, list, dictionary

df=pd.DataFrame(rd(5,4),['A','B','C','D','E'],['W','X','Y','Z'])

回答 3

系列是一维标记的数组,能够保存任何数据类型(整数,字符串,浮点数,Python对象等)。轴标签统称为索引。创建系列的基本方法是调用:

s = pd.Series(data, index=index)

DataFrame是二维标记的数据结构,具有可能不同类型的列。您可以将其视为电子表格或SQL表,或Series对象的字典。

 d = {'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']),
 two' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}
 df = pd.DataFrame(d)

Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the index. The basic method to create a Series is to call:

s = pd.Series(data, index=index)

DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects.

 d = {'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']),
 two' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}
 df = pd.DataFrame(d)

回答 4

导入汽车数据

import pandas as pd

cars = pd.read_csv('cars.csv', index_col = 0)

这是cars.csv文件的外观。

打印出drive_right列为Series:

print(cars.loc[:,"drives_right"])

    US      True
    AUS    False
    JAP    False
    IN     False
    RU      True
    MOR     True
    EG      True
    Name: drives_right, dtype: bool

单括号版本提供Pandas系列,双括号版本提供Pandas DataFrame。

打印出drive_right列作为DataFrame

print(cars.loc[:,["drives_right"]])

         drives_right
    US           True
    AUS         False
    JAP         False
    IN          False
    RU           True
    MOR          True
    EG           True

将一个系列添加到另一个系列会创建一个DataFrame。

Import cars data

import pandas as pd

cars = pd.read_csv('cars.csv', index_col = 0)

Here is how the cars.csv file looks.

Print out drives_right column as Series:

print(cars.loc[:,"drives_right"])

    US      True
    AUS    False
    JAP    False
    IN     False
    RU      True
    MOR     True
    EG      True
    Name: drives_right, dtype: bool

The single bracket version gives a Pandas Series, the double bracket version gives a Pandas DataFrame.

Print out drives_right column as DataFrame

print(cars.loc[:,["drives_right"]])

         drives_right
    US           True
    AUS         False
    JAP         False
    IN          False
    RU           True
    MOR          True
    EG           True

Adding a Series to another Series creates a DataFrame.


大熊猫:在多列上合并(合并)两个数据框

问题:大熊猫:在多列上合并(合并)两个数据框

我正在尝试使用两列加入两个熊猫数据框:

new_df = pd.merge(A_df, B_df,  how='left', left_on='[A_c1,c2]', right_on = '[B_c1,c2]')

但出现以下错误:

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:4164)()

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:4028)()

pandas/src/hashtable_class_helper.pxi in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13166)()

pandas/src/hashtable_class_helper.pxi in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13120)()

KeyError: '[B_1, c2]'

任何想法应该是正确的方法吗?谢谢!

I am trying to join two pandas data frames using two columns:

new_df = pd.merge(A_df, B_df,  how='left', left_on='[A_c1,c2]', right_on = '[B_c1,c2]')

but got the following error:

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:4164)()

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:4028)()

pandas/src/hashtable_class_helper.pxi in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13166)()

pandas/src/hashtable_class_helper.pxi in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13120)()

KeyError: '[B_1, c2]'

Any idea what should be the right way to do this? Thanks!


回答 0

试试这个

new_df = pd.merge(A_df, B_df,  how='left', left_on=['A_c1','c2'], right_on = ['B_c1','c2'])

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html

left_on:要在左侧DataFrame中加入的标签或列表或类似数组的字段名称。可以是DataFrame长度的向量或向量列表,以使用特定向量作为连接键而不是列

right_on:标签或列表,或类似数组的字段名称,以在right DataFrame或每个left_on文档的向量/向量列表中加入

Try this

new_df = pd.merge(A_df, B_df,  how='left', left_on=['A_c1','c2'], right_on = ['B_c1','c2'])

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html

left_on : label or list, or array-like Field names to join on in left DataFrame. Can be a vector or list of vectors of the length of the DataFrame to use a particular vector as the join key instead of columns

right_on : label or list, or array-like Field names to join on in right DataFrame or vector/list of vectors per left_on docs


回答 1

这里的问题是,通过使用撇号,您可以将要传递的值设置为字符串,而实际上,正如文档中的@Shijo所述,该函数需要的是标签或列表,而不是字符串!如果列表包含为左和右数据帧传递的每个列名,则每个列名必须分别在撇号内。通过上述陈述,我们可以理解为什么这是不正确的:

new_df = pd.merge(A_df, B_df,  how='left', left_on='[A_c1,c2]', right_on = '[B_c1,c2]')

这是使用函数的正确方法:

new_df = pd.merge(A_df, B_df,  how='left', left_on=['A_c1','c2'], right_on = ['B_c1','c2'])

the problem here is that by using the apostrophes you are setting the value being passed to be a string, when in fact, as @Shijo stated from the documentation, the function is expecting a label or list, but not a string! If the list contains each of the name of the columns beings passed for both the left and right dataframe, then each column-name must individually be within apostrophes. With what has been stated, we can understand why this is inccorect:

new_df = pd.merge(A_df, B_df,  how='left', left_on='[A_c1,c2]', right_on = '[B_c1,c2]')

And this is the correct way of using the function:

new_df = pd.merge(A_df, B_df,  how='left', left_on=['A_c1','c2'], right_on = ['B_c1','c2'])

回答 2

另一种方法是: new_df = A_df.merge(B_df, left_on=['A_c1','c2'], right_on = ['B_c1','c2'], how='left')

Another way of doing this: new_df = A_df.merge(B_df, left_on=['A_c1','c2'], right_on = ['B_c1','c2'], how='left')


将Pandas DataFrame转换为字典

问题:将Pandas DataFrame转换为字典

我有一个包含四列的DataFrame。我想将此DataFrame转换为python字典。我希望第一列keys的元素为,同一行中其他列的元素为values

数据框:

    ID   A   B   C
0   p    1   3   2
1   q    4   3   2
2   r    4   0   9  

输出应如下所示:

字典:

{'p': [1,3,2], 'q': [4,3,2], 'r': [4,0,9]}

I have a DataFrame with four columns. I want to convert this DataFrame to a python dictionary. I want the elements of first column be keys and the elements of other columns in same row be values.

DataFrame:

    ID   A   B   C
0   p    1   3   2
1   q    4   3   2
2   r    4   0   9  

Output should be like this:

Dictionary:

{'p': [1,3,2], 'q': [4,3,2], 'r': [4,0,9]}

回答 0

to_dict()方法将列名设置为字典键,因此您需要稍微调整DataFrame的形状。将“ ID”列设置为索引,然后转置DataFrame是实现此目的的一种方法。

to_dict()还接受一个“东方”参数,您需要该参数才能为每列输出值列表。否则,{index: value}将为每一列返回形式的字典。

这些步骤可以通过以下行完成:

>>> df.set_index('ID').T.to_dict('list')
{'p': [1, 3, 2], 'q': [4, 3, 2], 'r': [4, 0, 9]}

如果需要不同的字典格式,则以下是可能的Orient参数的示例。考虑以下简单的DataFrame:

>>> df = pd.DataFrame({'a': ['red', 'yellow', 'blue'], 'b': [0.5, 0.25, 0.125]})
>>> df
        a      b
0     red  0.500
1  yellow  0.250
2    blue  0.125

然后,选项如下。

dict-默认值:列名称是键,值是index:data对的字典

>>> df.to_dict('dict')
{'a': {0: 'red', 1: 'yellow', 2: 'blue'}, 
 'b': {0: 0.5, 1: 0.25, 2: 0.125}}

list-键是列名,值是列数据的列表

>>> df.to_dict('list')
{'a': ['red', 'yellow', 'blue'], 
 'b': [0.5, 0.25, 0.125]}

系列 -类似于“列表”,但值是系列

>>> df.to_dict('series')
{'a': 0       red
      1    yellow
      2      blue
      Name: a, dtype: object, 

 'b': 0    0.500
      1    0.250
      2    0.125
      Name: b, dtype: float64}

split-将列/数据/索引拆分为键,其值分别是列名,按行和索引标签的数据值

>>> df.to_dict('split')
{'columns': ['a', 'b'],
 'data': [['red', 0.5], ['yellow', 0.25], ['blue', 0.125]],
 'index': [0, 1, 2]}

记录 -每行成为一个字典,其中键是列名,值是单元格中的数据

>>> df.to_dict('records')
[{'a': 'red', 'b': 0.5}, 
 {'a': 'yellow', 'b': 0.25}, 
 {'a': 'blue', 'b': 0.125}]

索引 -类似于“记录”,但是字典的字典以键作为索引标签(而不是列表)

>>> df.to_dict('index')
{0: {'a': 'red', 'b': 0.5},
 1: {'a': 'yellow', 'b': 0.25},
 2: {'a': 'blue', 'b': 0.125}}

The to_dict() method sets the column names as dictionary keys so you’ll need to reshape your DataFrame slightly. Setting the ‘ID’ column as the index and then transposing the DataFrame is one way to achieve this.

to_dict() also accepts an ‘orient’ argument which you’ll need in order to output a list of values for each column. Otherwise, a dictionary of the form {index: value} will be returned for each column.

These steps can be done with the following line:

>>> df.set_index('ID').T.to_dict('list')
{'p': [1, 3, 2], 'q': [4, 3, 2], 'r': [4, 0, 9]}

In case a different dictionary format is needed, here are examples of the possible orient arguments. Consider the following simple DataFrame:

>>> df = pd.DataFrame({'a': ['red', 'yellow', 'blue'], 'b': [0.5, 0.25, 0.125]})
>>> df
        a      b
0     red  0.500
1  yellow  0.250
2    blue  0.125

Then the options are as follows.

dict – the default: column names are keys, values are dictionaries of index:data pairs

>>> df.to_dict('dict')
{'a': {0: 'red', 1: 'yellow', 2: 'blue'}, 
 'b': {0: 0.5, 1: 0.25, 2: 0.125}}

list – keys are column names, values are lists of column data

>>> df.to_dict('list')
{'a': ['red', 'yellow', 'blue'], 
 'b': [0.5, 0.25, 0.125]}

series – like ‘list’, but values are Series

>>> df.to_dict('series')
{'a': 0       red
      1    yellow
      2      blue
      Name: a, dtype: object, 

 'b': 0    0.500
      1    0.250
      2    0.125
      Name: b, dtype: float64}

split – splits columns/data/index as keys with values being column names, data values by row and index labels respectively

>>> df.to_dict('split')
{'columns': ['a', 'b'],
 'data': [['red', 0.5], ['yellow', 0.25], ['blue', 0.125]],
 'index': [0, 1, 2]}

records – each row becomes a dictionary where key is column name and value is the data in the cell

>>> df.to_dict('records')
[{'a': 'red', 'b': 0.5}, 
 {'a': 'yellow', 'b': 0.25}, 
 {'a': 'blue', 'b': 0.125}]

index – like ‘records’, but a dictionary of dictionaries with keys as index labels (rather than a list)

>>> df.to_dict('index')
{0: {'a': 'red', 'b': 0.5},
 1: {'a': 'yellow', 'b': 0.25},
 2: {'a': 'blue', 'b': 0.125}}

回答 1

尝试使用 Zip

df = pd.read_csv("file")
d= dict([(i,[a,b,c ]) for i, a,b,c in zip(df.ID, df.A,df.B,df.C)])
print d

输出:

{'p': [1, 3, 2], 'q': [4, 3, 2], 'r': [4, 0, 9]}

Try to use Zip

df = pd.read_csv("file")
d= dict([(i,[a,b,c ]) for i, a,b,c in zip(df.ID, df.A,df.B,df.C)])
print d

Output:

{'p': [1, 3, 2], 'q': [4, 3, 2], 'r': [4, 0, 9]}

回答 2

跟着这些步骤:

假设您的数据框如下:

>>> df
   A  B  C ID
0  1  3  2  p
1  4  3  2  q
2  4  0  9  r

1. set_index用于将ID列设置为数据框索引。

    df.set_index("ID", drop=True, inplace=True)

2.使用orient=index参数将索引用作字典键。

    dictionary = df.to_dict(orient="index")

结果如下:

    >>> dictionary
    {'q': {'A': 4, 'B': 3, 'D': 2}, 'p': {'A': 1, 'B': 3, 'D': 2}, 'r': {'A': 4, 'B': 0, 'D': 9}}

3.如果需要将每个样本作为列表,请运行以下代码。确定列顺序

column_order= ["A", "B", "C"] #  Determine your preferred order of columns
d = {} #  Initialize the new dictionary as an empty dictionary
for k in dictionary:
    d[k] = [dictionary[k][column_name] for column_name in column_order]

Follow these steps:

Suppose your dataframe is as follows:

>>> df
   A  B  C ID
0  1  3  2  p
1  4  3  2  q
2  4  0  9  r

1. Use set_index to set ID columns as the dataframe index.

    df.set_index("ID", drop=True, inplace=True)

2. Use the orient=index parameter to have the index as dictionary keys.

    dictionary = df.to_dict(orient="index")

The results will be as follows:

    >>> dictionary
    {'q': {'A': 4, 'B': 3, 'D': 2}, 'p': {'A': 1, 'B': 3, 'D': 2}, 'r': {'A': 4, 'B': 0, 'D': 9}}

3. If you need to have each sample as a list run the following code. Determine the column order

column_order= ["A", "B", "C"] #  Determine your preferred order of columns
d = {} #  Initialize the new dictionary as an empty dictionary
for k in dictionary:
    d[k] = [dictionary[k][column_name] for column_name in column_order]

回答 3

如果您不介意字典值是元组,则可以使用itertuples:

>>> {x[0]: x[1:] for x in df.itertuples(index=False)}
{'p': (1, 3, 2), 'q': (4, 3, 2), 'r': (4, 0, 9)}

If you don’t mind the dictionary values being tuples, you can use itertuples:

>>> {x[0]: x[1:] for x in df.itertuples(index=False)}
{'p': (1, 3, 2), 'q': (4, 3, 2), 'r': (4, 0, 9)}

回答 4

字典应该像:

{'red': '0.500', 'yellow': '0.250, 'blue': '0.125'}

需要像这样的数据框之外:

        a      b
0     red  0.500
1  yellow  0.250
2    blue  0.125

最简单的方法是:

dict(df.values.tolist())

下面的工作片段:

import pandas as pd
df = pd.DataFrame({'a': ['red', 'yellow', 'blue'], 'b': [0.5, 0.25, 0.125]})
dict(df.values.tolist())

在此处输入图片说明

should a dictionary like:

{'red': '0.500', 'yellow': '0.250, 'blue': '0.125'}

be required out of a dataframe like:

        a      b
0     red  0.500
1  yellow  0.250
2    blue  0.125

simplest way would be to do:

dict(df.values.tolist())

working snippet below:

import pandas as pd
df = pd.DataFrame({'a': ['red', 'yellow', 'blue'], 'b': [0.5, 0.25, 0.125]})
dict(df.values.tolist())

enter image description here


回答 5

对于我的使用(具有xy位置的节点名称),我发现@ user4179775对最有用/最直观的答案:

import pandas as pd

df = pd.read_csv('glycolysis_nodes_xy.tsv', sep='\t')

df.head()
    nodes    x    y
0  c00033  146  958
1  c00031  601  195
...

xy_dict_list=dict([(i,[a,b]) for i, a,b in zip(df.nodes, df.x,df.y)])

xy_dict_list
{'c00022': [483, 868],
 'c00024': [146, 868],
 ... }

xy_dict_tuples=dict([(i,(a,b)) for i, a,b in zip(df.nodes, df.x,df.y)])

xy_dict_tuples
{'c00022': (483, 868),
 'c00024': (146, 868),
 ... }

附录

后来,我又回到了这个问题,进行其他但相关的工作。这是一种更接近于[优秀]公认答案的方法。

node_df = pd.read_csv('node_prop-glycolysis_tca-from_pg.tsv', sep='\t')

node_df.head()
   node  kegg_id kegg_cid            name  wt  vis
0  22    22       c00022   pyruvate        1   1
1  24    24       c00024   acetyl-CoA      1   1
...

将Pandas数据帧转换为[列表],{dict},{dict的{dict}},…

每个接受的答案:

node_df.set_index('kegg_cid').T.to_dict('list')

{'c00022': [22, 22, 'pyruvate', 1, 1],
 'c00024': [24, 24, 'acetyl-CoA', 1, 1],
 ... }

node_df.set_index('kegg_cid').T.to_dict('dict')

{'c00022': {'kegg_id': 22, 'name': 'pyruvate', 'node': 22, 'vis': 1, 'wt': 1},
 'c00024': {'kegg_id': 24, 'name': 'acetyl-CoA', 'node': 24, 'vis': 1, 'wt': 1},
 ... }

就我而言,我想做同样的事情,但要选择Pandas数据框中的列,因此我需要对列进行切片。有两种方法。

  1. 直:

(请参阅:将大熊猫转换为字典,以定义用于键值的列

node_df.set_index('kegg_cid')[['name', 'wt', 'vis']].T.to_dict('dict')

{'c00022': {'name': 'pyruvate', 'vis': 1, 'wt': 1},
 'c00024': {'name': 'acetyl-CoA', 'vis': 1, 'wt': 1},
 ... }
  1. “间接:”首先,从Pandas数据框中切片所需的列/数据(同样,两种方法),
node_df_sliced = node_df[['kegg_cid', 'name', 'wt', 'vis']]

要么

node_df_sliced2 = node_df.loc[:, ['kegg_cid', 'name', 'wt', 'vis']]

然后可以用来创建字典的字典

node_df_sliced.set_index('kegg_cid').T.to_dict('dict')

{'c00022': {'name': 'pyruvate', 'vis': 1, 'wt': 1},
 'c00024': {'name': 'acetyl-CoA', 'vis': 1, 'wt': 1},
 ... }

For my use (node names with xy positions) I found @user4179775’s answer to the most helpful / intuitive:

import pandas as pd

df = pd.read_csv('glycolysis_nodes_xy.tsv', sep='\t')

df.head()
    nodes    x    y
0  c00033  146  958
1  c00031  601  195
...

xy_dict_list=dict([(i,[a,b]) for i, a,b in zip(df.nodes, df.x,df.y)])

xy_dict_list
{'c00022': [483, 868],
 'c00024': [146, 868],
 ... }

xy_dict_tuples=dict([(i,(a,b)) for i, a,b in zip(df.nodes, df.x,df.y)])

xy_dict_tuples
{'c00022': (483, 868),
 'c00024': (146, 868),
 ... }

Addendum

I later returned to this issue, for other, but related, work. Here is an approach that more closely mirrors the [excellent] accepted answer.

node_df = pd.read_csv('node_prop-glycolysis_tca-from_pg.tsv', sep='\t')

node_df.head()
   node  kegg_id kegg_cid            name  wt  vis
0  22    22       c00022   pyruvate        1   1
1  24    24       c00024   acetyl-CoA      1   1
...

Convert Pandas dataframe to a [list], {dict}, {dict of {dict}}, …

Per accepted answer:

node_df.set_index('kegg_cid').T.to_dict('list')

{'c00022': [22, 22, 'pyruvate', 1, 1],
 'c00024': [24, 24, 'acetyl-CoA', 1, 1],
 ... }

node_df.set_index('kegg_cid').T.to_dict('dict')

{'c00022': {'kegg_id': 22, 'name': 'pyruvate', 'node': 22, 'vis': 1, 'wt': 1},
 'c00024': {'kegg_id': 24, 'name': 'acetyl-CoA', 'node': 24, 'vis': 1, 'wt': 1},
 ... }

In my case, I wanted to do the same thing but with selected columns from the Pandas dataframe, so I needed to slice the columns. There are two approaches.

  1. Directly:

(see: Convert pandas to dictionary defining the columns used fo the key values)

node_df.set_index('kegg_cid')[['name', 'wt', 'vis']].T.to_dict('dict')

{'c00022': {'name': 'pyruvate', 'vis': 1, 'wt': 1},
 'c00024': {'name': 'acetyl-CoA', 'vis': 1, 'wt': 1},
 ... }
  1. “Indirectly:” first, slice the desired columns/data from the Pandas dataframe (again, two approaches),
node_df_sliced = node_df[['kegg_cid', 'name', 'wt', 'vis']]

or

node_df_sliced2 = node_df.loc[:, ['kegg_cid', 'name', 'wt', 'vis']]

that can then can be used to create a dictionary of dictionaries

node_df_sliced.set_index('kegg_cid').T.to_dict('dict')

{'c00022': {'name': 'pyruvate', 'vis': 1, 'wt': 1},
 'c00024': {'name': 'acetyl-CoA', 'vis': 1, 'wt': 1},
 ... }

回答 6

DataFrame.to_dict() 将DataFrame转换为字典。

>>> df = pd.DataFrame(
    {'col1': [1, 2], 'col2': [0.5, 0.75]}, index=['a', 'b'])
>>> df
   col1  col2
a     1   0.1
b     2   0.2
>>> df.to_dict()
{'col1': {'a': 1, 'b': 2}, 'col2': {'a': 0.5, 'b': 0.75}}

有关详细信息,请参见此文档

DataFrame.to_dict() converts DataFrame to dictionary.

Example

>>> df = pd.DataFrame(
    {'col1': [1, 2], 'col2': [0.5, 0.75]}, index=['a', 'b'])
>>> df
   col1  col2
a     1   0.1
b     2   0.2
>>> df.to_dict()
{'col1': {'a': 1, 'b': 2}, 'col2': {'a': 0.5, 'b': 0.75}}

See this Documentation for details


熊猫groupby排序

问题:熊猫groupby排序

我想按两列对数据框进行分组,然后对各组中的汇总结果进行排序。

In [167]:
df

Out[167]:
count   job source
0   2   sales   A
1   4   sales   B
2   6   sales   C
3   3   sales   D
4   7   sales   E
5   5   market  A
6   3   market  B
7   2   market  C
8   4   market  D
9   1   market  E

In [168]:
df.groupby(['job','source']).agg({'count':sum})

Out[168]:
            count
job     source  
market  A   5
        B   3
        C   2
        D   4
        E   1
sales   A   2
        B   4
        C   6
        D   3
        E   7

现在,我想在每个组中按降序对计数列进行排序。然后只取前三行。得到类似的东西:

            count
job     source  
market  A   5
        D   4
        B   3
sales   E   7
        C   6
        B   4

I want to group my dataframe by two columns and then sort the aggregated results within the groups.

In [167]:
df

Out[167]:
count   job source
0   2   sales   A
1   4   sales   B
2   6   sales   C
3   3   sales   D
4   7   sales   E
5   5   market  A
6   3   market  B
7   2   market  C
8   4   market  D
9   1   market  E

In [168]:
df.groupby(['job','source']).agg({'count':sum})

Out[168]:
            count
job     source  
market  A   5
        B   3
        C   2
        D   4
        E   1
sales   A   2
        B   4
        C   6
        D   3
        E   7

I would now like to sort the count column in descending order within each of the groups. And then take only the top three rows. To get something like:

            count
job     source  
market  A   5
        D   4
        B   3
sales   E   7
        C   6
        B   4

回答 0

您实际上想要做的是再次使用groupby(在第一个groupby的结果上):对每个组的前三个元素进行排序并取其值。

从第一个分组依据的结果开始:

In [60]: df_agg = df.groupby(['job','source']).agg({'count':sum})

我们按索引的第一级分组:

In [63]: g = df_agg['count'].groupby(level=0, group_keys=False)

然后,我们要对每个组进行排序(“排序”)并采用前三个元素:

In [64]: res = g.apply(lambda x: x.order(ascending=False).head(3))

但是,为此,有一个快捷功能可以实现nlargest

In [65]: g.nlargest(3)
Out[65]:
job     source
market  A         5
        D         4
        B         3
sales   E         7
        C         6
        B         4
dtype: int64

What you want to do is actually again a groupby (on the result of the first groupby): sort and take the first three elements per group.

Starting from the result of the first groupby:

In [60]: df_agg = df.groupby(['job','source']).agg({'count':sum})

We group by the first level of the index:

In [63]: g = df_agg['count'].groupby('job', group_keys=False)

Then we want to sort (‘order’) each group and take the first three elements:

In [64]: res = g.apply(lambda x: x.sort_values(ascending=False).head(3))

However, for this, there is a shortcut function to do this, nlargest:

In [65]: g.nlargest(3)
Out[65]:
job     source
market  A         5
        D         4
        B         3
sales   E         7
        C         6
        B         4
dtype: int64

So in one go, this looks like:

df_agg['count'].groupby('job', group_keys=False).nlargest(3)

回答 1

您也可以一次性完成,方法是先进行排序,然后使用head进行每个组的前3个。

In[34]: df.sort_values(['job','count'],ascending=False).groupby('job').head(3)

Out[35]: 
   count     job source
4      7   sales      E
2      6   sales      C
1      4   sales      B
5      5  market      A
8      4  market      D
6      3  market      B

You could also just do it in one go, by doing the sort first and using head to take the first 3 of each group.

In[34]: df.sort_values(['job','count'],ascending=False).groupby('job').head(3)

Out[35]: 
   count     job source
4      7   sales      E
2      6   sales      C
1      4   sales      B
5      5  market      A
8      4  market      D
6      3  market      B

回答 2

这是另一个在排序顺序上排在前3位并在组内排序的示例:

In [43]: import pandas as pd                                                                                                                                                       

In [44]:  df = pd.DataFrame({"name":["Foo", "Foo", "Baar", "Foo", "Baar", "Foo", "Baar", "Baar"], "count_1":[5,10,12,15,20,25,30,35], "count_2" :[100,150,100,25,250,300,400,500]})

In [45]: df                                                                                                                                                                        
Out[45]: 
   count_1  count_2  name
0        5      100   Foo
1       10      150   Foo
2       12      100  Baar
3       15       25   Foo
4       20      250  Baar
5       25      300   Foo
6       30      400  Baar
7       35      500  Baar


### Top 3 on sorted order:
In [46]: df.groupby(["name"])["count_1"].nlargest(3)                                                                                                                               
Out[46]: 
name   
Baar  7    35
      6    30
      4    20
Foo   5    25
      3    15
      1    10
dtype: int64


### Sorting within groups based on column "count_1":
In [48]: df.groupby(["name"]).apply(lambda x: x.sort_values(["count_1"], ascending = False)).reset_index(drop=True)
Out[48]: 
   count_1  count_2  name
0       35      500  Baar
1       30      400  Baar
2       20      250  Baar
3       12      100  Baar
4       25      300   Foo
5       15       25   Foo
6       10      150   Foo
7        5      100   Foo

Here’s other example of taking top 3 on sorted order, and sorting within the groups:

In [43]: import pandas as pd                                                                                                                                                       

In [44]:  df = pd.DataFrame({"name":["Foo", "Foo", "Baar", "Foo", "Baar", "Foo", "Baar", "Baar"], "count_1":[5,10,12,15,20,25,30,35], "count_2" :[100,150,100,25,250,300,400,500]})

In [45]: df                                                                                                                                                                        
Out[45]: 
   count_1  count_2  name
0        5      100   Foo
1       10      150   Foo
2       12      100  Baar
3       15       25   Foo
4       20      250  Baar
5       25      300   Foo
6       30      400  Baar
7       35      500  Baar


### Top 3 on sorted order:
In [46]: df.groupby(["name"])["count_1"].nlargest(3)                                                                                                                               
Out[46]: 
name   
Baar  7    35
      6    30
      4    20
Foo   5    25
      3    15
      1    10
dtype: int64


### Sorting within groups based on column "count_1":
In [48]: df.groupby(["name"]).apply(lambda x: x.sort_values(["count_1"], ascending = False)).reset_index(drop=True)
Out[48]: 
   count_1  count_2  name
0       35      500  Baar
1       30      400  Baar
2       20      250  Baar
3       12      100  Baar
4       25      300   Foo
5       15       25   Foo
6       10      150   Foo
7        5      100   Foo

回答 3

试试这个代替

执行“ groupby”并按降序排序的简单方法

df.groupby(['companyName'])['overallRating'].sum().sort_values(ascending=False).head(20)

Try this Instead

simple way to do ‘groupby’ and sorting in descending order

df.groupby(['companyName'])['overallRating'].sum().sort_values(ascending=False).head(20)

回答 4

如果您不需要汇总一列,请使用@tvashtar的答案。如果您确实需要求和,则可以使用@joris的答案或与此非常相似的答案。

df.groupby(['job']).apply(lambda x: (x.groupby('source')
                                      .sum()
                                      .sort_values('count', ascending=False))
                                     .head(3))

If you don’t need to sum a column, then use @tvashtar’s answer. If you do need to sum, then you can use @joris’ answer or this one which is very similar to it.

df.groupby(['job']).apply(lambda x: (x.groupby('source')
                                      .sum()
                                      .sort_values('count', ascending=False))
                                     .head(3))

熊猫DataFrame Groupby两列并获取计数

问题:熊猫DataFrame Groupby两列并获取计数

我有以下格式的熊猫数据框:

df = pd.DataFrame([[1.1, 1.1, 1.1, 2.6, 2.5, 3.4,2.6,2.6,3.4,3.4,2.6,1.1,1.1,3.3], list('AAABBBBABCBDDD'), [1.1, 1.7, 2.5, 2.6, 3.3, 3.8,4.0,4.2,4.3,4.5,4.6,4.7,4.7,4.8], ['x/y/z','x/y','x/y/z/n','x/u','x','x/u/v','x/y/z','x','x/u/v/b','-','x/y','x/y/z','x','x/u/v/w'],['1','3','3','2','4','2','5','3','6','3','5','1','1','1']]).T
df.columns = ['col1','col2','col3','col4','col5']

df:

   col1 col2 col3     col4 col5
0   1.1    A  1.1    x/y/z    1
1   1.1    A  1.7      x/y    3
2   1.1    A  2.5  x/y/z/n    3
3   2.6    B  2.6      x/u    2
4   2.5    B  3.3        x    4
5   3.4    B  3.8    x/u/v    2
6   2.6    B    4    x/y/z    5
7   2.6    A  4.2        x    3
8   3.4    B  4.3  x/u/v/b    6
9   3.4    C  4.5        -    3
10  2.6    B  4.6      x/y    5
11  1.1    D  4.7    x/y/z    1
12  1.1    D  4.7        x    1
13  3.3    D  4.8  x/u/v/w    1

现在,我想按两列将其分组,如下所示:

df.groupby(['col5','col2']).reset_index()

输出:

             index col1 col2 col3     col4 col5
col5 col2                                      
1    A    0      0  1.1    A  1.1    x/y/z    1
     D    0     11  1.1    D  4.7    x/y/z    1
          1     12  1.1    D  4.7        x    1
          2     13  3.3    D  4.8  x/u/v/w    1
2    B    0      3  2.6    B  2.6      x/u    2
          1      5  3.4    B  3.8    x/u/v    2
3    A    0      1  1.1    A  1.7      x/y    3
          1      2  1.1    A  2.5  x/y/z/n    3
          2      7  2.6    A  4.2        x    3
     C    0      9  3.4    C  4.5        -    3
4    B    0      4  2.5    B  3.3        x    4
5    B    0      6  2.6    B    4    x/y/z    5
          1     10  2.6    B  4.6      x/y    5
6    B    0      8  3.4    B  4.3  x/u/v/b    6

我想按如下方式获取每一行的计数。预期Yield:

col5 col2 count
1    A      1
     D      3
2    B      2
etc...

如何获得我的预期输出?我想为每个“ col2”值找到最大的计数吗?

I have a pandas dataframe in the following format:

df = pd.DataFrame([[1.1, 1.1, 1.1, 2.6, 2.5, 3.4,2.6,2.6,3.4,3.4,2.6,1.1,1.1,3.3], list('AAABBBBABCBDDD'), [1.1, 1.7, 2.5, 2.6, 3.3, 3.8,4.0,4.2,4.3,4.5,4.6,4.7,4.7,4.8], ['x/y/z','x/y','x/y/z/n','x/u','x','x/u/v','x/y/z','x','x/u/v/b','-','x/y','x/y/z','x','x/u/v/w'],['1','3','3','2','4','2','5','3','6','3','5','1','1','1']]).T
df.columns = ['col1','col2','col3','col4','col5']

df:

   col1 col2 col3     col4 col5
0   1.1    A  1.1    x/y/z    1
1   1.1    A  1.7      x/y    3
2   1.1    A  2.5  x/y/z/n    3
3   2.6    B  2.6      x/u    2
4   2.5    B  3.3        x    4
5   3.4    B  3.8    x/u/v    2
6   2.6    B    4    x/y/z    5
7   2.6    A  4.2        x    3
8   3.4    B  4.3  x/u/v/b    6
9   3.4    C  4.5        -    3
10  2.6    B  4.6      x/y    5
11  1.1    D  4.7    x/y/z    1
12  1.1    D  4.7        x    1
13  3.3    D  4.8  x/u/v/w    1

Now I want to group this by two columns like following:

df.groupby(['col5','col2']).reset_index()

OutPut:

             index col1 col2 col3     col4 col5
col5 col2                                      
1    A    0      0  1.1    A  1.1    x/y/z    1
     D    0     11  1.1    D  4.7    x/y/z    1
          1     12  1.1    D  4.7        x    1
          2     13  3.3    D  4.8  x/u/v/w    1
2    B    0      3  2.6    B  2.6      x/u    2
          1      5  3.4    B  3.8    x/u/v    2
3    A    0      1  1.1    A  1.7      x/y    3
          1      2  1.1    A  2.5  x/y/z/n    3
          2      7  2.6    A  4.2        x    3
     C    0      9  3.4    C  4.5        -    3
4    B    0      4  2.5    B  3.3        x    4
5    B    0      6  2.6    B    4    x/y/z    5
          1     10  2.6    B  4.6      x/y    5
6    B    0      8  3.4    B  4.3  x/u/v/b    6

I want to get the count by each row like following. Expected Output:

col5 col2 count
1    A      1
     D      3
2    B      2
etc...

How to get my expected output? And I want to find largest count for each ‘col2’ value?


回答 0

紧跟@Andy的答案,您可以执行以下操作来解决第二个问题:

In [56]: df.groupby(['col5','col2']).size().reset_index().groupby('col2')[[0]].max()
Out[56]: 
      0
col2   
A     3
B     2
C     1
D     3

Followed by @Andy’s answer, you can do following to solve your second question:

In [56]: df.groupby(['col5','col2']).size().reset_index().groupby('col2')[[0]].max()
Out[56]: 
      0
col2   
A     3
B     2
C     1
D     3

回答 1

您正在寻找size

In [11]: df.groupby(['col5', 'col2']).size()
Out[11]:
col5  col2
1     A       1
      D       3
2     B       2
3     A       3
      C       1
4     B       1
5     B       2
6     B       1
dtype: int64

要获得与waitingkuo相同的答案(“第二个问题”),但要简洁一些,可以按级别分组:

In [12]: df.groupby(['col5', 'col2']).size().groupby(level=1).max()
Out[12]:
col2
A       3
B       2
C       1
D       3
dtype: int64

You are looking for size:

In [11]: df.groupby(['col5', 'col2']).size()
Out[11]:
col5  col2
1     A       1
      D       3
2     B       2
3     A       3
      C       1
4     B       1
5     B       2
6     B       1
dtype: int64

To get the same answer as waitingkuo (the “second question”), but slightly cleaner, is to groupby the level:

In [12]: df.groupby(['col5', 'col2']).size().groupby(level=1).max()
Out[12]:
col2
A       3
B       2
C       1
D       3
dtype: int64

回答 2

数据插入pandas数据框并提供列名

import pandas as pd
df = pd.DataFrame([['A','C','A','B','C','A','B','B','A','A'], ['ONE','TWO','ONE','ONE','ONE','TWO','ONE','TWO','ONE','THREE']]).T
df.columns = [['Alphabet','Words']]
print(df)   #printing dataframe.

这是我们的打印数据:

在此处输入图片说明

为了在pandas和counter中创建一组数据框
您需要再提供一个对分组进行计数的列,我们将该列称为dataframe中的“ COUNTER”

像这样:

df['COUNTER'] =1       #initially, set that counter to 1.
group_data = df.groupby(['Alphabet','Words'])['COUNTER'].sum() #sum function
print(group_data)

输出:

在此处输入图片说明

Inserting data into a pandas dataframe and providing column name.

import pandas as pd
df = pd.DataFrame([['A','C','A','B','C','A','B','B','A','A'], ['ONE','TWO','ONE','ONE','ONE','TWO','ONE','TWO','ONE','THREE']]).T
df.columns = [['Alphabet','Words']]
print(df)   #printing dataframe.

This is our printed data:

enter image description here

For making a group of dataframe in pandas and counter,
You need to provide one more column which counts the grouping, let’s call that column as, “COUNTER” in dataframe.

Like this:

df['COUNTER'] =1       #initially, set that counter to 1.
group_data = df.groupby(['Alphabet','Words'])['COUNTER'].sum() #sum function
print(group_data)

OUTPUT:

enter image description here


回答 3

仅使用单个groupby的惯用解决方案

(df.groupby(['col5', 'col2']).size() 
   .sort_values(ascending=False) 
   .reset_index(name='count') 
   .drop_duplicates(subset='col2'))

  col5 col2  count
0    3    A      3
1    1    D      3
2    5    B      2
6    3    C      1

说明

groupby size方法的结果是带有col5col2在索引中的Series 。从这里,您可以使用另一种groupby方法来查找其中的每个值的最大值, col2但是没有必要这样做。您可以简单地对所有值进行降序排序,然后只保留第一次col2使用drop_duplicates方法出现的行。

Idiomatic solution that uses only a single groupby

(df.groupby(['col5', 'col2']).size() 
   .sort_values(ascending=False) 
   .reset_index(name='count') 
   .drop_duplicates(subset='col2'))

  col5 col2  count
0    3    A      3
1    1    D      3
2    5    B      2
6    3    C      1

Explanation

The result of the groupby size method is a Series with col5 and col2 in the index. From here, you can use another groupby method to find the maximum value of each value in col2 but it is not necessary to do. You can simply sort all the values descendingly and then keep only the rows with the first occurrence of col2 with the drop_duplicates method.


回答 4

您是否要在数据帧中添加一个包含组计数的新列(例如’count_column’):

df.count_column=df.groupby(['col5','col2']).col5.transform('count')

(我选择了“ col5”,因为它不包含nan)

Should you want to add a new column (say ‘count_column’) containing the groups’ counts into the dataframe:

df.count_column=df.groupby(['col5','col2']).col5.transform('count')

(I picked ‘col5’ as it contains no nan)


回答 5

您可以只使用内置函数计数,然后使用groupby函数

df.groupby(['col5','col2']).count()

You can just use the built-in function count follow by the groupby function

df.groupby(['col5','col2']).count()

如何将标题行添加到Pandas DataFrame

问题:如何将标题行添加到Pandas DataFrame

我正在将csv文件读入pandas。此csv文件由四列和一些行组成,但没有要添加的标题行。我一直在尝试以下方法:

Cov = pd.read_csv("path/to/file.txt", sep='\t')
Frame=pd.DataFrame([Cov], columns = ["Sequence", "Start", "End", "Coverage"])
Frame.to_csv("path/to/file.txt", sep='\t')

但是,当我应用代码时,出现以下错误:

ValueError: Shape of passed values is (1, 1), indices imply (4, 1)

错误的确切含义是什么?在python中将标题行添加到csv文件/ pandas df的一种干净方法是什么?

I am reading a csv file into pandas. This csv file constists of four columns and some rows, but does not have a header row, which I want to add. I have been trying the following:

Cov = pd.read_csv("path/to/file.txt", sep='\t')
Frame=pd.DataFrame([Cov], columns = ["Sequence", "Start", "End", "Coverage"])
Frame.to_csv("path/to/file.txt", sep='\t')

But when I apply the code, I get the following Error:

ValueError: Shape of passed values is (1, 1), indices imply (4, 1)

What exactly does the error mean? And what would be a clean way in python to add a header row to my csv file/pandas df?


回答 0

您可以names直接在read_csv

names:类似数组,默认为None要使用的列名列表。如果文件不包含标题行,则应显式传递header = None

Cov = pd.read_csv("path/to/file.txt", 
                  sep='\t', 
                  names=["Sequence", "Start", "End", "Coverage"])

You can use names directly in the read_csv

names : array-like, default None List of column names to use. If file contains no header row, then you should explicitly pass header=None

Cov = pd.read_csv("path/to/file.txt", 
                  sep='\t', 
                  names=["Sequence", "Start", "End", "Coverage"])

回答 1

另外,您可以阅读csv,header=None然后添加df.columns

Cov = pd.read_csv("path/to/file.txt", sep='\t', header=None)
Cov.columns = ["Sequence", "Start", "End", "Coverage"]

Alternatively you could read you csv with header=None and then add it with df.columns:

Cov = pd.read_csv("path/to/file.txt", sep='\t', header=None)
Cov.columns = ["Sequence", "Start", "End", "Coverage"]

回答 2

col_Names=["Sequence", "Start", "End", "Coverage"]
my_CSV_File= pd.read_csv("yourCSVFile.csv",names=col_Names)

完成此操作后,只需进行检查[显然,我知道,你知道。但是…

my_CSV_File.head()

希望对您有帮助…干杯

col_Names=["Sequence", "Start", "End", "Coverage"]
my_CSV_File= pd.read_csv("yourCSVFile.csv",names=col_Names)

having done this, just check it with[well obviously I know, u know that. But still…

my_CSV_File.head()

Hope it helps … Cheers


回答 3

要修复代码,您只需将更[Cov]改为Cov.values,第一个参数pd.DataFrame将变为多维numpy数组:

Cov = pd.read_csv("path/to/file.txt", sep='\t')
Frame=pd.DataFrame(Cov.values, columns = ["Sequence", "Start", "End", "Coverage"])
Frame.to_csv("path/to/file.txt", sep='\t')

但最聪明的解决方案仍然是pd.read_excelheader=None和一起使用names=columns_list

To fix your code you can simply change [Cov] to Cov.values, the first parameter of pd.DataFrame will become a multi-dimensional numpy array:

Cov = pd.read_csv("path/to/file.txt", sep='\t')
Frame=pd.DataFrame(Cov.values, columns = ["Sequence", "Start", "End", "Coverage"])
Frame.to_csv("path/to/file.txt", sep='\t')

But the smartest solution still is use pd.read_excel with header=None and names=columns_list.