标签归档:pandas

如何合并两个数据帧?

问题:如何合并两个数据帧?

我正在使用Pandas数据框。我说有一个初始数据框D。我从中提取两个数据帧,如下所示:

A = D[D.label == k]
B = D[D.label != k]

然后我更改标签中AB

A.label = 1
B.label = -1

我想将A和B结合起来,这样我就可以将它们作为一个数据帧使用,类似于联合操作。数据的顺序并不重要。但是,当我们从D采样A和B时,它们保留了D的索引。

I’m using Pandas data frames. I have a initial data frame, say D. I extract two data frames from it like this:

A = D[D.label == k]
B = D[D.label != k]

I want to combine A and B so I can have them as one DataFrame, something like a union operation. The order of the data is not important. However, when we sample A and B from D, they retain their indexes from D.


回答 0

我相信你可以使用该append方法

bigdata = data1.append(data2, ignore_index=True)

保持索引只是不使用ignore_index关键字…

I believe you can use the append method

bigdata = data1.append(data2, ignore_index=True)

to keep their indexes just dont use the ignore_index keyword …


回答 1

您还可以使用pd.concat,当您连接两个以上数据框时,这特别有用:

bigdata = pd.concat([data1, data2], ignore_index=True, sort=False)

You can also use pd.concat, which is particularly helpful when you are joining more than two dataframes:

bigdata = pd.concat([data1, data2], ignore_index=True, sort=False)

回答 2

如果有人发现它有用,可以考虑在此处添加它。@ostrokach已经提到了如何合并跨行的数据框,即

df_row_merged = pd.concat([df_a, df_b], ignore_index=True)

要跨列合并,可以使用以下语法:

df_col_merged = pd.concat([df_a, df_b], axis=1)

Thought to add this here in case someone finds it useful. @ostrokach already mentioned how you can merge the data frames across rows which is

df_row_merged = pd.concat([df_a, df_b], ignore_index=True)

To merge across columns, you can use the following syntax:

df_col_merged = pd.concat([df_a, df_b], axis=1)

回答 3

对于正在处理大数据并需要连接多个数据集的情况,还有另一种解决方案。concat可能会提高性能,因此,如果您不想每次都创建新的df,则可以使用列表推导

frames = [ process_file(f) for f in dataset_files ]
result = pd.append(frames)

(如本节底部文档中的此处所指出):

注意:但是,值得注意的是,concat(并因此append)制作了数据的完整副本,并且不断地重用此功能可能会严重影响性能。如果需要对多个数据集使用该操作,请使用列表推导。

There’s another solution for the case that you are working with big data and need to concatenate multiple datasets. concat can get performance-intensive, so if you don’t want to create a new df each time, you can instead use a list comprehension:

frames = [ process_file(f) for f in dataset_files ]
result = pd.append(frames)

(as pointed out here in the docs at the bottom of the section):

Note: It is worth noting however, that concat (and therefore append) makes a full copy of the data, and that constantly reusing this function can create a significant performance hit. If you need to use the operation over several datasets, use a list comprehension.


回答 4

如果要用df1第二个数据帧的值更新/替换第一个数据帧的值df2。您可以按照以下步骤进行操作-

步骤1:设置第一个数据帧(df1)的索引

df1.set_index('id')

步骤2:设置第二个数据帧(df2)的索引

df2.set_index('id')

最后使用以下代码段更新数据框-

df1.update(df2)

If you want to update/replace the values of first dataframe df1 with the values of second dataframe df2. you can do it by following steps —

Step 1: Set index of the first dataframe (df1)

df1.set_index('id')

Step 2: Set index of the second dataframe (df2)

df2.set_index('id')

and finally update the dataframe using the following snippet —

df1.update(df2)

回答 5

第一个数据帧

train.shape

结果:-

(31962, 3)

第二个数据帧

test.shape

结果:-

(17197, 2)

结合

new_data=train.append(test,ignore_index=True)

检查一下

new_data.shape

结果:-

(49159, 3)

1st dataFrame

train.shape

result:-

(31962, 3)

2nd dataFrame

test.shape

result:-

(17197, 2)

Combine

new_data=train.append(test,ignore_index=True)

Check

new_data.shape

result:-

(49159, 3)

根据列表索引选择熊猫行

问题:根据列表索引选择熊猫行

我有一个数据框df:

   20060930  10.103       NaN     10.103   7.981
   20061231  15.915       NaN     15.915  12.686
   20070331   3.196       NaN      3.196   2.710
   20070630   7.907       NaN      7.907   6.459

然后,我想选择列表中指示的具有某些序列号的行,假设这里是[1,3],然后向左移:

   20061231  15.915       NaN     15.915  12.686
   20070630   7.907       NaN      7.907   6.459

如何或什么功能可以做到这一点?

I have a dataframe df :

   20060930  10.103       NaN     10.103   7.981
   20061231  15.915       NaN     15.915  12.686
   20070331   3.196       NaN      3.196   2.710
   20070630   7.907       NaN      7.907   6.459

Then I want to select rows with certain sequence numbers which indicated in a list, suppose here is [1,3], then left:

   20061231  15.915       NaN     15.915  12.686
   20070630   7.907       NaN      7.907   6.459

How or what function can do that ?


回答 0

List = [1, 3]
df.ix[List]

应该做的把戏!当我用数据帧建立索引时,我总是使用.ix()方法。它是如此容易和灵活…

UPDATE 这不再是可接受的索引编制方法。该ix方法已弃用。使用.iloc基于整数索引和.loc基于标签索引。

ind_list = [1, 3]
df.ix[ind_list]

should do the trick! When I index with data frames I always use the .ix() method. Its so much easier and more flexible…

UPDATE This is no longer the accepted method for indexing. The ix method is deprecated. Use .iloc for integer based indexing and .loc for label based indexing.


回答 1

您还可以使用iloc:

df.iloc[[1,3],:]

如果由于先前的计算,如果数据框中的索引与行的顺序不对应,则此方法将无效。在这种情况下,请使用:

df.index.isin([1,3])

…如其他回应所建议。

you can also use iloc:

df.iloc[[1,3],:]

This will not work if the indexes in your dataframe do not correspond to the order of the rows due to prior computations. In that case use:

df.index.isin([1,3])

… as suggested in other responses.


回答 2

另一种方法(尽管它是更长的代码),但是比上面的代码要快。使用%timeit函数检查它:

df[df.index.isin([1,3])]

PS:您找出原因

Another way (although it is a longer code) but it is faster than the above codes. Check it using %timeit function:

df[df.index.isin([1,3])]

PS: You figure out the reason


回答 3

对于大型数据集,通过skiprows参数仅读取选定的行会节省内存。

pred = lambda x: x not in [1, 3]
pd.read_csv("data.csv", skiprows=pred, index_col=0, names=...)

现在,这将从文件中返回一个DataFrame,该文件将跳过除1和3之外的所有行。


细节

文档

skiprows :类似于列表或整数或可调用的列表,默认 None

如果可调用,则将针对行索引评估可调用函数,如果应跳过该行,则返回True,否则返回False。有效的可调用参数的一个示例是lambda x: x in [0, 2]

此功能适用于熊猫0.20.0+版本。另请参见相应的问题相关文章

For large datasets, it is memory efficient to read only selected rows via the skiprows parameter.

Example

pred = lambda x: x not in [1, 3]
pd.read_csv("data.csv", skiprows=pred, index_col=0, names=...)

This will now return a DataFrame from a file that skips all rows except 1 and 3.


Details

From the docs:

skiprows : list-like or integer or callable, default None

If callable, the callable function will be evaluated against the row indices, returning True if the row should be skipped and False otherwise. An example of a valid callable argument would be lambda x: x in [0, 2]

This feature works in version pandas 0.20.0+. See also the corresponding issue and a related post.


Python Pandas将列表插入单元格

问题:Python Pandas将列表插入单元格

我有一个列表“ abc”和一个数据框“ df”:

abc = ['foo', 'bar']
df =
    A  B
0  12  NaN
1  23  NaN

我想将列表插入单元格1B中,所以我想要这个结果:

    A  B
0  12  NaN
1  23  ['foo', 'bar']

我能做到吗?

1)如果我使用这个:

df.ix[1,'B'] = abc

我收到以下错误消息:

ValueError: Must have equal len keys and value when setting with an iterable

因为它尝试将列表(具有两个元素)插入行/列而不插入单元格。

2)如果我使用这个:

df.ix[1,'B'] = [abc]

然后插入一个只有一个元素的列表,即“ abc”列表( [['foo', 'bar']])。

3)如果我使用这个:

df.ix[1,'B'] = ', '.join(abc)

然后插入一个字符串:( foo, bar),但不插入列表。

4)如果我使用这个:

df.ix[1,'B'] = [', '.join(abc)]

然后插入一个列表,但只有一个元素(['foo, bar']),但没有两个我想要的元素(['foo', 'bar'])。

感谢帮助!


编辑

我的新数据框和旧列表:

abc = ['foo', 'bar']
df2 =
    A    B         C
0  12  NaN      'bla'
1  23  NaN  'bla bla'

另一个数据框:

df3 =
    A    B         C                    D
0  12  NaN      'bla'  ['item1', 'item2']
1  23  NaN  'bla bla'        [11, 12, 13]

我想将“ abc”列表插入df2.loc[1,'B']和/或df3.loc[1,'B']

如果数据框仅包含具有整数值和/或NaN值和/或列表值的列,则将列表插入到单元格中的效果很好。如果数据框仅包含具有字符串值和/或NaN值和/或列表值的列,则将列表插入到单元格中的效果很好。但是,如果数据框具有包含整数和字符串值的列以及其他列,那么如果我使用此错误消息,则会出现错误消息:df2.loc[1,'B'] = abcdf3.loc[1,'B'] = abc

另一个数据框:

df4 =
          A     B
0      'bla'  NaN
1  'bla bla'  NaN

这些插入可以完美地工作:df.loc[1,'B'] = abcdf4.loc[1,'B'] = abc

I have a list ‘abc’ and a dataframe ‘df’:

abc = ['foo', 'bar']
df =
    A  B
0  12  NaN
1  23  NaN

I want to insert the list into cell 1B, so I want this result:

    A  B
0  12  NaN
1  23  ['foo', 'bar']

Ho can I do that?

1) If I use this:

df.ix[1,'B'] = abc

I get the following error message:

ValueError: Must have equal len keys and value when setting with an iterable

because it tries to insert the list (that has two elements) into a row / column but not into a cell.

2) If I use this:

df.ix[1,'B'] = [abc]

then it inserts a list that has only one element that is the ‘abc’ list ( [['foo', 'bar']] ).

3) If I use this:

df.ix[1,'B'] = ', '.join(abc)

then it inserts a string: ( foo, bar ) but not a list.

4) If I use this:

df.ix[1,'B'] = [', '.join(abc)]

then it inserts a list but it has only one element ( ['foo, bar'] ) but not two as I want ( ['foo', 'bar'] ).

Thanks for help!


EDIT

My new dataframe and the old list:

abc = ['foo', 'bar']
df2 =
    A    B         C
0  12  NaN      'bla'
1  23  NaN  'bla bla'

Another dataframe:

df3 =
    A    B         C                    D
0  12  NaN      'bla'  ['item1', 'item2']
1  23  NaN  'bla bla'        [11, 12, 13]

I want insert the ‘abc’ list into df2.loc[1,'B'] and/or df3.loc[1,'B'].

If the dataframe has columns only with integer values and/or NaN values and/or list values then inserting a list into a cell works perfectly. If the dataframe has columns only with string values and/or NaN values and/or list values then inserting a list into a cell works perfectly. But if the dataframe has columns with integer and string values and other columns then the error message appears if I use this: df2.loc[1,'B'] = abc or df3.loc[1,'B'] = abc.

Another dataframe:

df4 =
          A     B
0      'bla'  NaN
1  'bla bla'  NaN

These inserts work perfectly: df.loc[1,'B'] = abc or df4.loc[1,'B'] = abc.


回答 0

由于自0.21.0版set_value开始不推荐使用,因此您现在应该使用at。它可以插入一个列表的小区没有抚养ValueErrorloc一样。我认为这是因为at 总是引用单个值,而loc可以引用值以及行和列。

df = pd.DataFrame(data={'A': [1, 2, 3], 'B': ['x', 'y', 'z']})

df.at[1, 'B'] = ['m', 'n']

df =
    A   B
0   1   x
1   2   [m, n]
2   3   z

您还需要确保要插入的具有dtype=object。例如

>>> df = pd.DataFrame(data={'A': [1, 2, 3], 'B': [1,2,3]})
>>> df.dtypes
A    int64
B    int64
dtype: object

>>> df.at[1, 'B'] = [1, 2, 3]
ValueError: setting an array element with a sequence

>>> df['B'] = df['B'].astype('object')
>>> df.at[1, 'B'] = [1, 2, 3]
>>> df
   A          B
0  1          1
1  2  [1, 2, 3]
2  3          3

Since set_value has been deprecated since version 0.21.0, you should now use at. It can insert a list into a cell without raising a ValueError as loc does. I think this is because at always refers to a single value, while loc can refer to values as well as rows and columns.

df = pd.DataFrame(data={'A': [1, 2, 3], 'B': ['x', 'y', 'z']})

df.at[1, 'B'] = ['m', 'n']

df =
    A   B
0   1   x
1   2   [m, n]
2   3   z

You also need to make sure the column you are inserting into has dtype=object. For example

>>> df = pd.DataFrame(data={'A': [1, 2, 3], 'B': [1,2,3]})
>>> df.dtypes
A    int64
B    int64
dtype: object

>>> df.at[1, 'B'] = [1, 2, 3]
ValueError: setting an array element with a sequence

>>> df['B'] = df['B'].astype('object')
>>> df.at[1, 'B'] = [1, 2, 3]
>>> df
   A          B
0  1          1
1  2  [1, 2, 3]
2  3          3

回答 1

df3.set_value(1, 'B', abc)适用于任何数据框。注意列“ B”的数据类型。例如。不能将列表插入浮点列,在这种情况下df['B'] = df['B'].astype(object)可以提供帮助。

df3.set_value(1, 'B', abc) works for any dataframe. Take care of the data type of column ‘B’. Eg. a list can not be inserted into a float column, at that case df['B'] = df['B'].astype(object) can help.


回答 2

熊猫> = 0.21

set_value已不推荐使用。 现在,您可以使用DataFrame.at按标签DataFrame.iat设置和按整数位置设置。

使用at/ 设置单元格值iat

# Setup
df = pd.DataFrame({'A': [12, 23], 'B': [['a', 'b'], ['c', 'd']]})
df

    A       B
0  12  [a, b]
1  23  [c, d]

df.dtypes

A     int64
B    object
dtype: object

如果要将“ B”第二行中的值设置为一些新列表,请使用DataFrane.at

df.at[1, 'B'] = ['m', 'n']
df

    A       B
0  12  [a, b]
1  23  [m, n]

您也可以使用 DataFrame.iat

df.iat[1, df.columns.get_loc('B')] = ['m', 'n']
df

    A       B
0  12  [a, b]
1  23  [m, n]

如果得到了ValueError: setting an array element with a sequence怎么办?

我将尝试通过以下方式重现该内容:

df

    A   B
0  12 NaN
1  23 NaN

df.dtypes

A      int64
B    float64
dtype: object

df.at[1, 'B'] = ['m', 'n']
# ValueError: setting an array element with a sequence.

这是因为您的对象是float64dtype,而列表是objects,所以那里不匹配。在这种情况下,您要做的是先将列转换为对象。

df['B'] = df['B'].astype(object)
df.dtypes

A     int64
B    object
dtype: object

然后,它起作用:

df.at[1, 'B'] = ['m', 'n']
df

    A       B
0  12     NaN
1  23  [m, n]

可能,但是哈基

更古怪的是,我发现DataFrame.loc如果传递嵌套列表,您可以破解以实现相似的目的。

df.loc[1, 'B'] = [['m'], ['n'], ['o'], ['p']]
df

    A             B
0  12        [a, b]
1  23  [m, n, o, p]

您可以在这里阅读更多有关其工作原理的信息。

Pandas >= 0.21

set_value has been deprecated. You can now use DataFrame.at to set by label, and DataFrame.iat to set by integer position.

Setting Cell Values with at/iat

# Setup
df = pd.DataFrame({'A': [12, 23], 'B': [['a', 'b'], ['c', 'd']]})
df

    A       B
0  12  [a, b]
1  23  [c, d]

df.dtypes

A     int64
B    object
dtype: object

If you want to set a value in second row of the “B” to some new list, use DataFrane.at:

df.at[1, 'B'] = ['m', 'n']
df

    A       B
0  12  [a, b]
1  23  [m, n]

You can also set by integer position using DataFrame.iat

df.iat[1, df.columns.get_loc('B')] = ['m', 'n']
df

    A       B
0  12  [a, b]
1  23  [m, n]

What if I get ValueError: setting an array element with a sequence?

I’ll try to reproduce this with:

df

    A   B
0  12 NaN
1  23 NaN

df.dtypes

A      int64
B    float64
dtype: object

df.at[1, 'B'] = ['m', 'n']
# ValueError: setting an array element with a sequence.

This is because of a your object is of float64 dtype, whereas lists are objects, so there’s a mismatch there. What you would have to do in this situation is to convert the column to object first.

df['B'] = df['B'].astype(object)
df.dtypes

A     int64
B    object
dtype: object

Then, it works:

df.at[1, 'B'] = ['m', 'n']
df

    A       B
0  12     NaN
1  23  [m, n]

Possible, But Hacky

Even more wacky, I’ve found you can hack through DataFrame.loc to achieve something similar if you pass nested lists.

df.loc[1, 'B'] = [['m'], ['n'], ['o'], ['p']]
df

    A             B
0  12        [a, b]
1  23  [m, n, o, p]

You can read more about why this works here.


回答 3

如本篇文章中提到的熊猫:如何在数据框中存储列表?; 数据帧中的dtype可能会影响结果,以及调用数据帧或不将其分配给它。

As mentionned in this post pandas: how to store a list in a dataframe?; the dtypes in the dataframe may influence the results, as well as calling a dataframe or not to be assigned to.


回答 4

快速解决

只需将列表括在新列表中,就像在下面的数据框中对col2所做的那样。它起作用的原因是python获取(列表的)外部列表,并将其转换为列,就好像它包含普通标量项目一样,在我们的例子中是列表,而不是普通标量。

mydict={'col1':[1,2,3],'col2':[[1, 4], [2, 5], [3, 6]]}
data=pd.DataFrame(mydict)
data


   col1     col2
0   1       [1, 4]
1   2       [2, 5]
2   3       [3, 6]

Quick work around

Simply enclose the list within a new list, as done for col2 in the data frame below. The reason it works is that python takes the outer list (of lists) and converts it into a column as if it were containing normal scalar items, which is lists in our case and not normal scalars.

mydict={'col1':[1,2,3],'col2':[[1, 4], [2, 5], [3, 6]]}
data=pd.DataFrame(mydict)
data


   col1     col2
0   1       [1, 4]
1   2       [2, 5]
2   3       [3, 6]

回答 5

也得到

ValueError: Must have equal len keys and value when setting with an iterable

在我的情况下,使用.at而不是.loc并没有任何区别,但是强制使用dataframe列的数据类型可以解决问题:

df['B'] = df['B'].astype(object)

然后,我可以将列表,numpy数组和所有类型的东西设置为数据帧中的单个单元格值。

Also getting

ValueError: Must have equal len keys and value when setting with an iterable,

using .at rather than .loc did not make any difference in my case, but enforcing the datatype of the dataframe column did the trick:

df['B'] = df['B'].astype(object)

Then I could set lists, numpy array and all sorts of things as single cell values in my dataframes.


如何将熊猫数据框中的日期转换为“日期”数据类型?

问题:如何将熊猫数据框中的日期转换为“日期”数据类型?

我有一个熊猫数据框,其中一列包含格式为日期的字符串 YYYY-MM-DD

例如 '2013-10-28'

目前该dtype列的是object

如何将列值转换为Pandas日期格式?

I have a Pandas data frame, one of the column contains date strings in the format YYYY-MM-DD

For e.g. '2013-10-28'

At the moment the dtype of the column is object.

How do I convert the column values to Pandas date format?


回答 0

使用类型

In [31]: df
Out[31]: 
   a        time
0  1  2013-01-01
1  2  2013-01-02
2  3  2013-01-03

In [32]: df['time'] = df['time'].astype('datetime64[ns]')

In [33]: df
Out[33]: 
   a                time
0  1 2013-01-01 00:00:00
1  2 2013-01-02 00:00:00
2  3 2013-01-03 00:00:00

Use astype

In [31]: df
Out[31]: 
   a        time
0  1  2013-01-01
1  2  2013-01-02
2  3  2013-01-03

In [32]: df['time'] = df['time'].astype('datetime64[ns]')

In [33]: df
Out[33]: 
   a                time
0  1 2013-01-01 00:00:00
1  2 2013-01-02 00:00:00
2  3 2013-01-03 00:00:00

回答 1

基本上等同于@waitingkuo,但我将to_datetime在这里使用(它看起来更干净一些,并提供了一些其他功能,例如dayfirst):

In [11]: df
Out[11]:
   a        time
0  1  2013-01-01
1  2  2013-01-02
2  3  2013-01-03

In [12]: pd.to_datetime(df['time'])
Out[12]:
0   2013-01-01 00:00:00
1   2013-01-02 00:00:00
2   2013-01-03 00:00:00
Name: time, dtype: datetime64[ns]

In [13]: df['time'] = pd.to_datetime(df['time'])

In [14]: df
Out[14]:
   a                time
0  1 2013-01-01 00:00:00
1  2 2013-01-02 00:00:00
2  3 2013-01-03 00:00:00

处理ValueError小号
如果碰上的情况下做

df['time'] = pd.to_datetime(df['time'])

抛出一个

ValueError: Unknown string format

这意味着您具有无效(不可强制)的值。如果可以将它们转换为pd.NaT,可以在其中添加一个errors='coerce'参数to_datetime

df['time'] = pd.to_datetime(df['time'], errors='coerce')

Essentially equivalent to @waitingkuo, but I would use to_datetime here (it seems a little cleaner, and offers some additional functionality e.g. dayfirst):

In [11]: df
Out[11]:
   a        time
0  1  2013-01-01
1  2  2013-01-02
2  3  2013-01-03

In [12]: pd.to_datetime(df['time'])
Out[12]:
0   2013-01-01 00:00:00
1   2013-01-02 00:00:00
2   2013-01-03 00:00:00
Name: time, dtype: datetime64[ns]

In [13]: df['time'] = pd.to_datetime(df['time'])

In [14]: df
Out[14]:
   a                time
0  1 2013-01-01 00:00:00
1  2 2013-01-02 00:00:00
2  3 2013-01-03 00:00:00

Handling ValueErrors
If you run into a situation where doing

df['time'] = pd.to_datetime(df['time'])

Throws a

ValueError: Unknown string format

That means you have invalid (non-coercible) values. If you are okay with having them converted to pd.NaT, you can add an errors='coerce' argument to to_datetime:

df['time'] = pd.to_datetime(df['time'], errors='coerce')

回答 2

我想象大量数据从CSV文件输入到Pandas中,在这种情况下,您可以简单地在初始CSV读取期间转换日期:

dfcsv = pd.read_csv('xyz.csv', parse_dates=[0])其中0表示日期所在的列。如果希望日期成为索引,
也可以, index_col=0在其中添加。

参见https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

I imagine a lot of data comes into Pandas from CSV files, in which case you can simply convert the date during the initial CSV read:

dfcsv = pd.read_csv('xyz.csv', parse_dates=[0]) where the 0 refers to the column the date is in.
You could also add , index_col=0 in there if you want the date to be your index.

See https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html


回答 3

现在你可以做 df['column'].dt.date

请注意,对于日期时间对象,如果您没有看到它们均为00:00:00的小时数,则说明它不是熊猫。那是iPython笔记本,试图使事情看起来更漂亮。

Now you can do df['column'].dt.date

Note that for datetime objects, if you don’t see the hour when they’re all 00:00:00, that’s not pandas. That’s iPython notebook trying to make things look pretty.


回答 4

执行此操作的另一种方法,如果您有多个要转换为日期时间的列,则此方法效果很好。

cols = ['date1','date2']
df[cols] = df[cols].apply(pd.to_datetime)

Another way to do this and this works well if you have multiple columns to convert to datetime.

cols = ['date1','date2']
df[cols] = df[cols].apply(pd.to_datetime)

回答 5

如果要获取DATE而不是DATETIME格式:

df["id_date"] = pd.to_datetime(df["id_date"]).dt.date

If you want to get the DATE and not DATETIME format:

df["id_date"] = pd.to_datetime(df["id_date"]).dt.date

回答 6

在某些情况下,可能需要将日期转换为其他频率。在这种情况下,我建议按日期设置索引。

#set an index by dates
df.set_index(['time'], drop=True, inplace=True)

此后,您可以更轻松地转换为最需要的日期格式类型。在下面,我依次转换为多种日期格式,最终以每个月初的一组每日日期结束。

#Convert to daily dates
df.index = pd.DatetimeIndex(data=df.index)

#Convert to monthly dates
df.index = df.index.to_period(freq='M')

#Convert to strings
df.index = df.index.strftime('%Y-%m')

#Convert to daily dates
df.index = pd.DatetimeIndex(data=df.index)

为了简洁起见,我没有显示在上面的每一行之后都运行以下代码:

print(df.index)
print(df.index.dtype)
print(type(df.index))

这给了我以下输出:

Index(['2013-01-01', '2013-01-02', '2013-01-03'], dtype='object', name='time')
object
<class 'pandas.core.indexes.base.Index'>

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03'], dtype='datetime64[ns]', name='time', freq=None)
datetime64[ns]
<class 'pandas.core.indexes.datetimes.DatetimeIndex'>

PeriodIndex(['2013-01', '2013-01', '2013-01'], dtype='period[M]', name='time', freq='M')
period[M]
<class 'pandas.core.indexes.period.PeriodIndex'>

Index(['2013-01', '2013-01', '2013-01'], dtype='object')
object
<class 'pandas.core.indexes.base.Index'>

DatetimeIndex(['2013-01-01', '2013-01-01', '2013-01-01'], dtype='datetime64[ns]', freq=None)
datetime64[ns]
<class 'pandas.core.indexes.datetimes.DatetimeIndex'>

It may be the case that dates need to be converted to a different frequency. In this case, I would suggest setting an index by dates.

#set an index by dates
df.set_index(['time'], drop=True, inplace=True)

After this, you can more easily convert to the type of date format you will need most. Below, I sequentially convert to a number of date formats, ultimately ending up with a set of daily dates at the beginning of the month.

#Convert to daily dates
df.index = pd.DatetimeIndex(data=df.index)

#Convert to monthly dates
df.index = df.index.to_period(freq='M')

#Convert to strings
df.index = df.index.strftime('%Y-%m')

#Convert to daily dates
df.index = pd.DatetimeIndex(data=df.index)

For brevity, I don’t show that I run the following code after each line above:

print(df.index)
print(df.index.dtype)
print(type(df.index))

This gives me the following output:

Index(['2013-01-01', '2013-01-02', '2013-01-03'], dtype='object', name='time')
object
<class 'pandas.core.indexes.base.Index'>

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03'], dtype='datetime64[ns]', name='time', freq=None)
datetime64[ns]
<class 'pandas.core.indexes.datetimes.DatetimeIndex'>

PeriodIndex(['2013-01', '2013-01', '2013-01'], dtype='period[M]', name='time', freq='M')
period[M]
<class 'pandas.core.indexes.period.PeriodIndex'>

Index(['2013-01', '2013-01', '2013-01'], dtype='object')
object
<class 'pandas.core.indexes.base.Index'>

DatetimeIndex(['2013-01-01', '2013-01-01', '2013-01-01'], dtype='datetime64[ns]', freq=None)
datetime64[ns]
<class 'pandas.core.indexes.datetimes.DatetimeIndex'>

回答 7

尝试使用pd.to_datetime函数将行之一转换为时间戳,然后使用.map将公式映射到整个列

Try to convert one of the rows into timestamp using the pd.to_datetime function and then use .map to map the formular to the entire column


回答 8

 #   Column          Non-Null Count   Dtype         
---  ------          --------------   -----         
 0   startDay        110526 non-null  object
 1   endDay          110526 non-null  object

import pandas as pd

df['startDay'] = pd.to_datetime(df.startDay)

df['endDay'] = pd.to_datetime(df.endDay)

 #   Column          Non-Null Count   Dtype         
---  ------          --------------   -----         
 0   startDay        110526 non-null  datetime64[ns]
 1   endDay          110526 non-null  datetime64[ns]
 #   Column          Non-Null Count   Dtype         
---  ------          --------------   -----         
 0   startDay        110526 non-null  object
 1   endDay          110526 non-null  object

import pandas as pd

df['startDay'] = pd.to_datetime(df.startDay)

df['endDay'] = pd.to_datetime(df.endDay)

 #   Column          Non-Null Count   Dtype         
---  ------          --------------   -----         
 0   startDay        110526 non-null  datetime64[ns]
 1   endDay          110526 non-null  datetime64[ns]

回答 9

为了完整起见,可能不是最直接的另一种选择,有点类似于@SSS提出的选择,但使用datetime库是:

import datetime
df["Date"] = df["Date"].apply(lambda x: datetime.datetime.strptime(x, '%Y-%d-%m').date())

For the sake of completeness, another option, which might not be the most straightforward one, a bit similar to the one proposed by @SSS, but using rather the datetime library is:

import datetime
df["Date"] = df["Date"].apply(lambda x: datetime.datetime.strptime(x, '%Y-%d-%m').date())

熊猫:选择所有名称以X开头的列的最佳方法

问题:熊猫:选择所有名称以X开头的列的最佳方法

我有一个DataFrame:

import pandas as pd
import numpy as np

df = pd.DataFrame({'foo.aa': [1, 2.1, np.nan, 4.7, 5.6, 6.8],
                   'foo.fighters': [0, 1, np.nan, 0, 0, 0],
                   'foo.bars': [0, 0, 0, 0, 0, 1],
                   'bar.baz': [5, 5, 6, 5, 5.6, 6.8],
                   'foo.fox': [2, 4, 1, 0, 0, 5],
                   'nas.foo': ['NA', 0, 1, 0, 0, 0],
                   'foo.manchu': ['NA', 0, 0, 0, 0, 0],})

我想在以开头的列中选择1的值foo.。除了以下以外,还有更好的方法吗:

df2 = df[(df['foo.aa'] == 1)|
(df['foo.fighters'] == 1)|
(df['foo.bars'] == 1)|
(df['foo.fox'] == 1)|
(df['foo.manchu'] == 1)
]

类似于写类似的东西:

df2= df[df.STARTS_WITH_FOO == 1]

答案应打印出如下所示的DataFrame:

   bar.baz  foo.aa  foo.bars  foo.fighters  foo.fox foo.manchu nas.foo
0      5.0     1.0         0             0        2         NA      NA
1      5.0     2.1         0             1        4          0       0
2      6.0     NaN         0           NaN        1          0       1
5      6.8     6.8         1             0        5          0       0

[4 rows x 7 columns]

I have a DataFrame:

import pandas as pd
import numpy as np

df = pd.DataFrame({'foo.aa': [1, 2.1, np.nan, 4.7, 5.6, 6.8],
                   'foo.fighters': [0, 1, np.nan, 0, 0, 0],
                   'foo.bars': [0, 0, 0, 0, 0, 1],
                   'bar.baz': [5, 5, 6, 5, 5.6, 6.8],
                   'foo.fox': [2, 4, 1, 0, 0, 5],
                   'nas.foo': ['NA', 0, 1, 0, 0, 0],
                   'foo.manchu': ['NA', 0, 0, 0, 0, 0],})

I want to select values of 1 in columns starting with foo.. Is there a better way to do it other than:

df2 = df[(df['foo.aa'] == 1)|
(df['foo.fighters'] == 1)|
(df['foo.bars'] == 1)|
(df['foo.fox'] == 1)|
(df['foo.manchu'] == 1)
]

Something similar to writing something like:

df2= df[df.STARTS_WITH_FOO == 1]

The answer should print out a DataFrame like this:

   bar.baz  foo.aa  foo.bars  foo.fighters  foo.fox foo.manchu nas.foo
0      5.0     1.0         0             0        2         NA      NA
1      5.0     2.1         0             1        4          0       0
2      6.0     NaN         0           NaN        1          0       1
5      6.8     6.8         1             0        5          0       0

[4 rows x 7 columns]

回答 0

只需执行列表推导即可创建您的列:

In [28]:

filter_col = [col for col in df if col.startswith('foo')]
filter_col
Out[28]:
['foo.aa', 'foo.bars', 'foo.fighters', 'foo.fox', 'foo.manchu']
In [29]:

df[filter_col]
Out[29]:
   foo.aa  foo.bars  foo.fighters  foo.fox foo.manchu
0     1.0         0             0        2         NA
1     2.1         0             1        4          0
2     NaN         0           NaN        1          0
3     4.7         0             0        0          0
4     5.6         0             0        0          0
5     6.8         1             0        5          0

另一种方法是从列创建序列,并使用向量化str方法startswith

In [33]:

df[df.columns[pd.Series(df.columns).str.startswith('foo')]]
Out[33]:
   foo.aa  foo.bars  foo.fighters  foo.fox foo.manchu
0     1.0         0             0        2         NA
1     2.1         0             1        4          0
2     NaN         0           NaN        1          0
3     4.7         0             0        0          0
4     5.6         0             0        0          0
5     6.8         1             0        5          0

为了实现您想要的目标,您需要添加以下内容以过滤不符合您的==1条件的值:

In [36]:

df[df[df.columns[pd.Series(df.columns).str.startswith('foo')]]==1]
Out[36]:
   bar.baz  foo.aa  foo.bars  foo.fighters  foo.fox foo.manchu nas.foo
0      NaN       1       NaN           NaN      NaN        NaN     NaN
1      NaN     NaN       NaN             1      NaN        NaN     NaN
2      NaN     NaN       NaN           NaN        1        NaN     NaN
3      NaN     NaN       NaN           NaN      NaN        NaN     NaN
4      NaN     NaN       NaN           NaN      NaN        NaN     NaN
5      NaN     NaN         1           NaN      NaN        NaN     NaN

编辑

看到您想要复杂的答案后,确定为:

In [72]:

df.loc[df[df[df.columns[pd.Series(df.columns).str.startswith('foo')]] == 1].dropna(how='all', axis=0).index]
Out[72]:
   bar.baz  foo.aa  foo.bars  foo.fighters  foo.fox foo.manchu nas.foo
0      5.0     1.0         0             0        2         NA      NA
1      5.0     2.1         0             1        4          0       0
2      6.0     NaN         0           NaN        1          0       1
5      6.8     6.8         1             0        5          0       0

Just perform a list comprehension to create your columns:

In [28]:

filter_col = [col for col in df if col.startswith('foo')]
filter_col
Out[28]:
['foo.aa', 'foo.bars', 'foo.fighters', 'foo.fox', 'foo.manchu']
In [29]:

df[filter_col]
Out[29]:
   foo.aa  foo.bars  foo.fighters  foo.fox foo.manchu
0     1.0         0             0        2         NA
1     2.1         0             1        4          0
2     NaN         0           NaN        1          0
3     4.7         0             0        0          0
4     5.6         0             0        0          0
5     6.8         1             0        5          0

Another method is to create a series from the columns and use the vectorised str method startswith:

In [33]:

df[df.columns[pd.Series(df.columns).str.startswith('foo')]]
Out[33]:
   foo.aa  foo.bars  foo.fighters  foo.fox foo.manchu
0     1.0         0             0        2         NA
1     2.1         0             1        4          0
2     NaN         0           NaN        1          0
3     4.7         0             0        0          0
4     5.6         0             0        0          0
5     6.8         1             0        5          0

In order to achieve what you want you need to add the following to filter the values that don’t meet your ==1 criteria:

In [36]:

df[df[df.columns[pd.Series(df.columns).str.startswith('foo')]]==1]
Out[36]:
   bar.baz  foo.aa  foo.bars  foo.fighters  foo.fox foo.manchu nas.foo
0      NaN       1       NaN           NaN      NaN        NaN     NaN
1      NaN     NaN       NaN             1      NaN        NaN     NaN
2      NaN     NaN       NaN           NaN        1        NaN     NaN
3      NaN     NaN       NaN           NaN      NaN        NaN     NaN
4      NaN     NaN       NaN           NaN      NaN        NaN     NaN
5      NaN     NaN         1           NaN      NaN        NaN     NaN

EDIT

OK after seeing what you want the convoluted answer is this:

In [72]:

df.loc[df[df[df.columns[pd.Series(df.columns).str.startswith('foo')]] == 1].dropna(how='all', axis=0).index]
Out[72]:
   bar.baz  foo.aa  foo.bars  foo.fighters  foo.fox foo.manchu nas.foo
0      5.0     1.0         0             0        2         NA      NA
1      5.0     2.1         0             1        4          0       0
2      6.0     NaN         0           NaN        1          0       1
5      6.8     6.8         1             0        5          0       0

回答 1

既然熊猫的索引支持字符串操作,那么可以说选择以’foo’开头的列的最简单最好的方法就是:

df.loc[:, df.columns.str.startswith('foo')]

或者,您可以使用过滤列(或行)标签df.filter()。要指定正则表达式以匹配以开头的名称foo.

>>> df.filter(regex=r'^foo\.', axis=1)
   foo.aa  foo.bars  foo.fighters  foo.fox foo.manchu
0     1.0         0             0        2         NA
1     2.1         0             1        4          0
2     NaN         0           NaN        1          0
3     4.7         0             0        0          0
4     5.6         0             0        0          0
5     6.8         1             0        5          0

要仅选择所需的行(包含1)和列,可以使用loc,使用filter(或任何其他方法)选择列,使用any

>>> df.loc[(df == 1).any(axis=1), df.filter(regex=r'^foo\.', axis=1).columns]
   foo.aa  foo.bars  foo.fighters  foo.fox foo.manchu
0     1.0         0             0        2         NA
1     2.1         0             1        4          0
2     NaN         0           NaN        1          0
5     6.8         1             0        5          0

Now that pandas’ indexes support string operations, arguably the simplest and best way to select columns beginning with ‘foo’ is just:

df.loc[:, df.columns.str.startswith('foo')]

Alternatively, you can filter column (or row) labels with df.filter(). To specify a regular expression to match the names beginning with foo.:

>>> df.filter(regex=r'^foo\.', axis=1)
   foo.aa  foo.bars  foo.fighters  foo.fox foo.manchu
0     1.0         0             0        2         NA
1     2.1         0             1        4          0
2     NaN         0           NaN        1          0
3     4.7         0             0        0          0
4     5.6         0             0        0          0
5     6.8         1             0        5          0

To select only the required rows (containing a 1) and the columns, you can use loc, selecting the columns using filter (or any other method) and the rows using any:

>>> df.loc[(df == 1).any(axis=1), df.filter(regex=r'^foo\.', axis=1).columns]
   foo.aa  foo.bars  foo.fighters  foo.fox foo.manchu
0     1.0         0             0        2         NA
1     2.1         0             1        4          0
2     NaN         0           NaN        1          0
5     6.8         1             0        5          0

回答 2

最简单的方法是直接在列名上使用str,不需要 pd.Series

df.loc[:,df.columns.str.startswith("foo")]

The simplest way is to use str directly on column names, there is no need for pd.Series

df.loc[:,df.columns.str.startswith("foo")]



回答 3

根据@EdChum的答案,您可以尝试以下解决方案:

df[df.columns[pd.Series(df.columns).str.contains("foo")]]

万一并非您要选择的所有列都以开头,这将非常有用foo。此方法选择包含子字符串的所有列,foo并且可以将其放置在列名称的任何位置。

本质上,我替换.startswith().contains()

Based on @EdChum’s answer, you can try the following solution:

df[df.columns[pd.Series(df.columns).str.contains("foo")]]

This will be really helpful in case not all the columns you want to select start with foo. This method selects all the columns that contain the substring foo and it could be placed in at any point of a column’s name.

In essence, I replaced .startswith() with .contains().


回答 4

我的解决方案。性能可能会变慢:

a = pd.concat(df[df[c] == 1] for c in df.columns if c.startswith('foo'))
a.sort_index()


   bar.baz  foo.aa  foo.bars  foo.fighters  foo.fox foo.manchu nas.foo
0      5.0     1.0         0             0        2         NA      NA
1      5.0     2.1         0             1        4          0       0
2      6.0     NaN         0           NaN        1          0       1
5      6.8     6.8         1             0        5          0       0

My solution. It may be slower on performance:

a = pd.concat(df[df[c] == 1] for c in df.columns if c.startswith('foo'))
a.sort_index()


   bar.baz  foo.aa  foo.bars  foo.fighters  foo.fox foo.manchu nas.foo
0      5.0     1.0         0             0        2         NA      NA
1      5.0     2.1         0             1        4          0       0
2      6.0     NaN         0           NaN        1          0       1
5      6.8     6.8         1             0        5          0       0

回答 5

选择所需条目的另一种方法是使用map

df.loc[(df == 1).any(axis=1), df.columns.map(lambda x: x.startswith('foo'))]

这将为您提供包含的行的所有列1

   foo.aa  foo.bars  foo.fighters  foo.fox foo.manchu
0     1.0         0             0        2         NA
1     2.1         0             1        4          0
2     NaN         0           NaN        1          0
5     6.8         1             0        5          0

行选择是通过做

(df == 1).any(axis=1)

如@ajcr的答案,它为您提供:

0     True
1     True
2     True
3    False
4    False
5     True
dtype: bool

表示该行34不包含1和不会被选中。

选择是使用布尔索引完成的,如下所示:

df.columns.map(lambda x: x.startswith('foo'))

在上面的示例中,此返回

array([False,  True,  True,  True,  True,  True, False], dtype=bool)

因此,如果某列不是以开头fooFalse则返回该列,因此未选择该列。

如果您只想返回包含1-的所有行(如您期望的输出所示),则只需执行

df.loc[(df == 1).any(axis=1)]

哪个返回

   bar.baz  foo.aa  foo.bars  foo.fighters  foo.fox foo.manchu nas.foo
0      5.0     1.0         0             0        2         NA      NA
1      5.0     2.1         0             1        4          0       0
2      6.0     NaN         0           NaN        1          0       1
5      6.8     6.8         1             0        5          0       0

Another option for the selection of the desired entries is to use map:

df.loc[(df == 1).any(axis=1), df.columns.map(lambda x: x.startswith('foo'))]

which gives you all the columns for rows that contain a 1:

   foo.aa  foo.bars  foo.fighters  foo.fox foo.manchu
0     1.0         0             0        2         NA
1     2.1         0             1        4          0
2     NaN         0           NaN        1          0
5     6.8         1             0        5          0

The row selection is done by

(df == 1).any(axis=1)

as in @ajcr’s answer which gives you:

0     True
1     True
2     True
3    False
4    False
5     True
dtype: bool

meaning that row 3 and 4 do not contain a 1 and won’t be selected.

The selection of the columns is done using Boolean indexing like this:

df.columns.map(lambda x: x.startswith('foo'))

In the example above this returns

array([False,  True,  True,  True,  True,  True, False], dtype=bool)

So, if a column does not start with foo, False is returned and the column is therefore not selected.

If you just want to return all rows that contain a 1 – as your desired output suggests – you can simply do

df.loc[(df == 1).any(axis=1)]

which returns

   bar.baz  foo.aa  foo.bars  foo.fighters  foo.fox foo.manchu nas.foo
0      5.0     1.0         0             0        2         NA      NA
1      5.0     2.1         0             1        4          0       0
2      6.0     NaN         0           NaN        1          0       1
5      6.8     6.8         1             0        5          0       0

回答 6

您可以在此处尝试使用正则表达式来过滤以“ foo”开头的列

df.filter(regex='^foo*')

如果您需要在列中包含字符串foo,则

df.filter(regex='foo*')

将是适当的。

下一步,您可以使用

df[df.filter(regex='^foo*').values==1]

过滤掉“ foo *”列的值之一为1的行。

You can try the regex here to filter out the columns starting with “foo”

df.filter(regex='^foo*')

If you need to have the string foo in your column then

df.filter(regex='foo*')

would be appropriate.

For the next step, you can use

df[df.filter(regex='^foo*').values==1]

to filter out the rows where one of the values of ‘foo*’ column is 1.


回答 7

就我而言,我需要一个前缀列表

colsToScale=["production", "test", "development"]
dc[dc.columns[dc.columns.str.startswith(tuple(colsToScale))]]

In my case I needed a list of prefixes

colsToScale=["production", "test", "development"]
dc[dc.columns[dc.columns.str.startswith(tuple(colsToScale))]]

如何从Pandas DataFrame获取值而不是索引和对象类型

问题:如何从Pandas DataFrame获取值而不是索引和对象类型

说我有以下DataFrame

字母编号
A 1
B 2
C 3
4天

可以通过以下代码获得

import pandas as pd

letters=pd.Series(('A', 'B', 'C', 'D'))
numbers=pd.Series((1, 2, 3, 4))
keys=('Letters', 'Numbers')
df=pd.concat((letters, numbers), axis=1, keys=keys)

现在,我想从“字母”列中获取值C。

命令行

df[df.Letters=='C'].Letters

将返回

2℃
名称:字母,dtype:对象

我怎样才能只获得C值而不是整个两行输出?

Say I have the following DataFrame

Letter    Number
A          1
B          2
C          3
D          4

Which can be obtained through the following code

import pandas as pd

letters=pd.Series(('A', 'B', 'C', 'D'))
numbers=pd.Series((1, 2, 3, 4))
keys=('Letters', 'Numbers')
df=pd.concat((letters, numbers), axis=1, keys=keys)

Now I want to get the value C from the column Letters.

The command line

df[df.Letters=='C'].Letters

will return

2    C
Name: Letters, dtype: object

How can I get only the value C and not the whole two line output?


回答 0

df[df.Letters=='C'].Letters.item()

这将返回从该选择返回的索引/系列中的第一个元素。在这种情况下,该值始终是第一个元素。

编辑:

或者,您可以运行loc()并以这种方式访问​​第一个元素。这比较短,这是我过去实现它的方式。

df[df.Letters=='C'].Letters.item()

This returns the first element in the Index/Series returned from that selection. In this case, the value is always the first element.

EDIT:

Or you can run a loc() and access the first element that way. This was shorter and is the way I have implemented it in the past.


回答 1

使用values属性将值作为np数组返回,然后使用[0]获取第一个值:

In [4]:
df.loc[df.Letters=='C','Letters'].values[0]

Out[4]:
'C'

编辑

我个人更喜欢使用下标运算符访问列:

df.loc[df['Letters'] == 'C', 'Letters'].values[0]

这样可以避免列名中可以包含空格或破折号的问题-,这意味着使用进行访问.

Use the values attribute to return the values as a np array and then use [0] to get the first value:

In [4]:
df.loc[df.Letters=='C','Letters'].values[0]

Out[4]:
'C'

EDIT

I personally prefer to access the columns using subscript operators:

df.loc[df['Letters'] == 'C', 'Letters'].values[0]

This avoids issues where the column names can have spaces or dashes - which mean that accessing using ..


回答 2

import pandas as pd

dataset = pd.read_csv("data.csv")
values = list(x for x in dataset["column name"])

>>> values[0]
'item_0'

编辑:

实际上,您可以像对任何旧数组一样索引数据集。

import pandas as pd

dataset = pd.read_csv("data.csv")
first_value = dataset["column name"][0]

>>> print(first_value)
'item_0'
import pandas as pd

dataset = pd.read_csv("data.csv")
values = list(x for x in dataset["column name"])

>>> values[0]
'item_0'

edit:

actually, you can just index the dataset like any old array.

import pandas as pd

dataset = pd.read_csv("data.csv")
first_value = dataset["column name"][0]

>>> print(first_value)
'item_0'

在Ipython notebook / Jupyter中,Pandas未显示我尝试绘制的图形

问题:在Ipython notebook / Jupyter中,Pandas未显示我尝试绘制的图形

我正在尝试使用Ipython Notebook中的熊猫绘制一些数据,尽管它给了我对象,但实际上并没有绘制图形本身。所以看起来像这样:

In [7]:

pledge.Amount.plot()

Out[7]:

<matplotlib.axes.AxesSubplot at 0x9397c6c>

该图应在此之后,但根本不会出现。我已经导入了matplotlib,所以这不是问题。我还需要导入其他模块吗?

I am trying to plot some data using pandas in Ipython Notebook, and while it gives me the object, it doesn’t actually plot the graph itself. So it looks like this:

In [7]:

pledge.Amount.plot()

Out[7]:

<matplotlib.axes.AxesSubplot at 0x9397c6c>

The graph should follow after that, but it simply doesn’t appear. I have imported matplotlib, so that’s not the problem. Is there any other module I need to import?


回答 0

请注意,–pylab已被弃用,并且已从较新的IPython版本中删除。建议在IPython Notebook中启用内联绘图的方法现已运行:

%matplotlib inline
import matplotlib.pyplot as plt

有关更多详细信息,请参阅ipython-dev邮件列表中的这篇文章

Note that –pylab is deprecated and has been removed from newer builds of IPython, The recommended way to enable inline plotting in the IPython Notebook is now to run:

%matplotlib inline
import matplotlib.pyplot as plt

See this post from the ipython-dev mailing list for more details.


回答 1

编辑:Pylab已被弃用,请参阅当前接受的答案

好的,看来答案是使用–pylab = inline启动ipython Notebook。因此,ipython notebook –pylab = inline可以完成我之前看到的以及我想要它做的事情。对不起这个原始的问题。

Edit:Pylab has been deprecated please see the current accepted answer

Ok, It seems the answer is to start ipython notebook with –pylab=inline. so ipython notebook –pylab=inline This has it do what I saw earlier and what I wanted it to do. Sorry about the vague original question.


回答 2

与您import matplotlib.pyplot as plt只需添加

plt.show()

它将显示所有存储的图。

With your import matplotlib.pyplot as plt just add

plt.show()

and it will show all stored plots.


回答 3

导入matplotlib之后很简单,如果像这样启动ipython,就可以执行一个魔术

ipython notebook 

%matplotlib inline 

运行此命令,一切都会完美显示

simple after importing the matplotlib you have execute one magic if you have started the ipython as like this

ipython notebook 

%matplotlib inline 

run this command everything will be shown perfectly


回答 4

使用来启动ipython ipython notebook --pylab inline,然后图形将内联显示。

start ipython with ipython notebook --pylab inline ,then graph will show inline.


回答 5

import matplotlib as plt
%matplotlib as inline
import matplotlib as plt
%matplotlib as inline

回答 6

您需要做的就是导入 matplotlib。

import matplotlib.pyplot as plt 

All you need to do is to import matplotlib.

import matplotlib.pyplot as plt 

将列追加到熊猫数据框

问题:将列追加到熊猫数据框

这可能很容易,但是我有以下数据:

在数据框1中:

index dat1
0     9
1     5

在数据框2中:

index dat2
0     7
1     6

我想要一个具有以下形式的数据框:

index dat1  dat2
0     9     7
1     5     6

我尝试使用该append方法,但是得到了交叉连接(即笛卡尔积)。

什么是正确的方法?

This is probably easy, but I have the following data:

In data frame 1:

index dat1
0     9
1     5

In data frame 2:

index dat2
0     7
1     6

I want a data frame with the following form:

index dat1  dat2
0     9     7
1     5     6

I’ve tried using the append method, but I get a cross join (i.e. cartesian product).

What’s the right way to do this?


回答 0

通常看来,您只是在寻找联接:

> dat1 = pd.DataFrame({'dat1': [9,5]})
> dat2 = pd.DataFrame({'dat2': [7,6]})
> dat1.join(dat2)
   dat1  dat2
0     9     7
1     5     6

It seems in general you’re just looking for a join:

> dat1 = pd.DataFrame({'dat1': [9,5]})
> dat2 = pd.DataFrame({'dat2': [7,6]})
> dat1.join(dat2)
   dat1  dat2
0     9     7
1     5     6

回答 1

您还可以使用:

dat1 = pd.concat([dat1, dat2], axis=1)

You can also use:

dat1 = pd.concat([dat1, dat2], axis=1)

回答 2

两者join()concat()方法都可以解决问题。但是,我不得不提一个警告:在您之前join()或者concat()如果您试图通过从另一个DataFrame中选择一些行来处理某个数据框架,请重置索引。

下面的一个示例显示了join和concat的一些有趣行为:

dat1 = pd.DataFrame({'dat1': range(4)})
dat2 = pd.DataFrame({'dat2': range(4,8)})
dat1.index = [1,3,5,7]
dat2.index = [2,4,6,8]

# way1 join 2 DataFrames
print(dat1.join(dat2))
# output
   dat1  dat2
1     0   NaN
3     1   NaN
5     2   NaN
7     3   NaN

# way2 concat 2 DataFrames
print(pd.concat([dat1,dat2],axis=1))
#output
   dat1  dat2
1   0.0   NaN
2   NaN   4.0
3   1.0   NaN
4   NaN   5.0
5   2.0   NaN
6   NaN   6.0
7   3.0   NaN
8   NaN   7.0

#reset index 
dat1 = dat1.reset_index(drop=True)
dat2 = dat2.reset_index(drop=True)
#both 2 ways to get the same result

print(dat1.join(dat2))
   dat1  dat2
0     0     4
1     1     5
2     2     6
3     3     7


print(pd.concat([dat1,dat2],axis=1))
   dat1  dat2
0     0     4
1     1     5
2     2     6
3     3     7

Both join() and concat() way could solve the problem. However, there is one warning I have to mention: Reset the index before you join() or concat() if you trying to deal with some data frame by selecting some rows from another DataFrame.

One example below shows some interesting behavior of join and concat:

dat1 = pd.DataFrame({'dat1': range(4)})
dat2 = pd.DataFrame({'dat2': range(4,8)})
dat1.index = [1,3,5,7]
dat2.index = [2,4,6,8]

# way1 join 2 DataFrames
print(dat1.join(dat2))
# output
   dat1  dat2
1     0   NaN
3     1   NaN
5     2   NaN
7     3   NaN

# way2 concat 2 DataFrames
print(pd.concat([dat1,dat2],axis=1))
#output
   dat1  dat2
1   0.0   NaN
2   NaN   4.0
3   1.0   NaN
4   NaN   5.0
5   2.0   NaN
6   NaN   6.0
7   3.0   NaN
8   NaN   7.0

#reset index 
dat1 = dat1.reset_index(drop=True)
dat2 = dat2.reset_index(drop=True)
#both 2 ways to get the same result

print(dat1.join(dat2))
   dat1  dat2
0     0     4
1     1     5
2     2     6
3     3     7


print(pd.concat([dat1,dat2],axis=1))
   dat1  dat2
0     0     4
1     1     5
2     2     6
3     3     7

回答 3

事实上:

data_joined = dat1.join(dat2)
print(data_joined)

Just as a matter of fact:

data_joined = dat1.join(dat2)
print(data_joined)

回答 4

只是正确的Google搜索问题:

data = dat_1.append(dat_2)
data = data.groupby(data.index).sum()

Just a matter of the right google search:

data = dat_1.append(dat_2)
data = data.groupby(data.index).sum()

如何删除熊猫数据框的最后一行数据

问题:如何删除熊猫数据框的最后一行数据

我认为这应该很简单,但是我尝试了一些想法,但都没有成功:

last_row = len(DF)
DF = DF.drop(DF.index[last_row])  #<-- fail!

我尝试使用负索引,但这也会导致错误。我仍然会误解一些基本知识。

I think this should be simple, but I tried a few ideas and none of them worked:

last_row = len(DF)
DF = DF.drop(DF.index[last_row])  #<-- fail!

I tried using negative indices but that also lead to errors. I must still be misunderstanding something basic.


回答 0

要删除最后n行:

df.drop(df.tail(n).index,inplace=True) # drop last n rows

同样,您可以删除前n行:

df.drop(df.head(n).index,inplace=True) # drop first n rows

To drop last n rows:

df.drop(df.tail(n).index,inplace=True) # drop last n rows

By the same vein, you can drop first n rows:

df.drop(df.head(n).index,inplace=True) # drop first n rows

回答 1

DF[:-n]

其中n是要删除的最后行数。

要删除最后一行:

DF = DF[:-1]
DF[:-n]

where n is the last number of rows to drop.

To drop the last row :

DF = DF[:-1]

回答 2

由于Python中的索引定位是基于0的,因此index与相对应的位置实际上没有元素len(DF)。您需要这样last_row = len(DF) - 1

In [49]: dfrm
Out[49]: 
          A         B         C
0  0.120064  0.785538  0.465853
1  0.431655  0.436866  0.640136
2  0.445904  0.311565  0.934073
3  0.981609  0.695210  0.911697
4  0.008632  0.629269  0.226454
5  0.577577  0.467475  0.510031
6  0.580909  0.232846  0.271254
7  0.696596  0.362825  0.556433
8  0.738912  0.932779  0.029723
9  0.834706  0.002989  0.333436

[10 rows x 3 columns]

In [50]: dfrm.drop(dfrm.index[len(dfrm)-1])
Out[50]: 
          A         B         C
0  0.120064  0.785538  0.465853
1  0.431655  0.436866  0.640136
2  0.445904  0.311565  0.934073
3  0.981609  0.695210  0.911697
4  0.008632  0.629269  0.226454
5  0.577577  0.467475  0.510031
6  0.580909  0.232846  0.271254
7  0.696596  0.362825  0.556433
8  0.738912  0.932779  0.029723

[9 rows x 3 columns]

但是,编写起来要简单得多DF[:-1]

Since index positioning in Python is 0-based, there won’t actually be an element in index at the location corresponding to len(DF). You need that to be last_row = len(DF) - 1:

In [49]: dfrm
Out[49]: 
          A         B         C
0  0.120064  0.785538  0.465853
1  0.431655  0.436866  0.640136
2  0.445904  0.311565  0.934073
3  0.981609  0.695210  0.911697
4  0.008632  0.629269  0.226454
5  0.577577  0.467475  0.510031
6  0.580909  0.232846  0.271254
7  0.696596  0.362825  0.556433
8  0.738912  0.932779  0.029723
9  0.834706  0.002989  0.333436

[10 rows x 3 columns]

In [50]: dfrm.drop(dfrm.index[len(dfrm)-1])
Out[50]: 
          A         B         C
0  0.120064  0.785538  0.465853
1  0.431655  0.436866  0.640136
2  0.445904  0.311565  0.934073
3  0.981609  0.695210  0.911697
4  0.008632  0.629269  0.226454
5  0.577577  0.467475  0.510031
6  0.580909  0.232846  0.271254
7  0.696596  0.362825  0.556433
8  0.738912  0.932779  0.029723

[9 rows x 3 columns]

However, it’s much simpler to just write DF[:-1].


回答 3

没有人惊讶地提出这一点:

# To remove last n rows
df.head(-n)

# To remove first n rows
df.tail(-n)

对1000行的DataFrame进行速度测试表明,切片和head/ tail的速度比使用drop以下方法快约6倍:

>>> %timeit df[:-1]
125 µs ± 132 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

>>> %timeit df.head(-1)
129 µs ± 1.18 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

>>> %timeit df.drop(df.tail(1).index)
751 µs ± 20.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Surprised nobody brought this one up:

# To remove last n rows
df.head(-n)

# To remove first n rows
df.tail(-n)

Running a speed test on a DataFrame of 1000 rows shows that slicing and head/tail are ~6 times faster than using drop:

>>> %timeit df[:-1]
125 µs ± 132 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

>>> %timeit df.head(-1)
129 µs ± 1.18 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

>>> %timeit df.drop(df.tail(1).index)
751 µs ± 20.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

回答 4

stats = pd.read_csv("C:\\py\\programs\\second pandas\\ex.csv")

统计输出:

       A            B          C
0   0.120064    0.785538    0.465853
1   0.431655    0.436866    0.640136
2   0.445904    0.311565    0.934073
3   0.981609    0.695210    0.911697
4   0.008632    0.629269    0.226454
5   0.577577    0.467475    0.510031
6   0.580909    0.232846    0.271254
7   0.696596    0.362825    0.556433
8   0.738912    0.932779    0.029723
9   0.834706    0.002989    0.333436

只是使用 skipfooter=1

skipfooter:int,默认0

文件底部要跳过的行数

stats_2 = pd.read_csv("C:\\py\\programs\\second pandas\\ex.csv", skipfooter=1, engine='python')

stats_2的输出

       A          B            C
0   0.120064    0.785538    0.465853
1   0.431655    0.436866    0.640136
2   0.445904    0.311565    0.934073
3   0.981609    0.695210    0.911697
4   0.008632    0.629269    0.226454
5   0.577577    0.467475    0.510031
6   0.580909    0.232846    0.271254
7   0.696596    0.362825    0.556433
8   0.738912    0.932779    0.029723
stats = pd.read_csv("C:\\py\\programs\\second pandas\\ex.csv")

The Output of stats:

       A            B          C
0   0.120064    0.785538    0.465853
1   0.431655    0.436866    0.640136
2   0.445904    0.311565    0.934073
3   0.981609    0.695210    0.911697
4   0.008632    0.629269    0.226454
5   0.577577    0.467475    0.510031
6   0.580909    0.232846    0.271254
7   0.696596    0.362825    0.556433
8   0.738912    0.932779    0.029723
9   0.834706    0.002989    0.333436

just use skipfooter=1

skipfooter : int, default 0

Number of lines at bottom of file to skip

stats_2 = pd.read_csv("C:\\py\\programs\\second pandas\\ex.csv", skipfooter=1, engine='python')

Output of stats_2

       A          B            C
0   0.120064    0.785538    0.465853
1   0.431655    0.436866    0.640136
2   0.445904    0.311565    0.934073
3   0.981609    0.695210    0.911697
4   0.008632    0.629269    0.226454
5   0.577577    0.467475    0.510031
6   0.580909    0.232846    0.271254
7   0.696596    0.362825    0.556433
8   0.738912    0.932779    0.029723

回答 5

drop返回一个新数组,这就是为什么它在og post中阻塞了;由于将格式错误的csv文件转换为Dataframe,我对重命名某些列标题和删除某些行有类似的要求,因此在阅读本文后,我使用了:

newList = pd.DataFrame(newList)
newList.columns = ['Area', 'Price']
print(newList)
# newList = newList.drop(0)
# newList = newList.drop(len(newList))
newList = newList[1:-1]
print(newList)

而且效果很好,正如您在上面两行注释中看到的那样,我尝试了drop。()方法,它可以工作,但不像使用[n:-n]那样简单易懂,希望对您有所帮助,谢谢。

drop returns a new array so that is why it choked in the og post; I had a similar requirement to rename some column headers and deleted some rows due to an ill formed csv file converted to Dataframe, so after reading this post I used:

newList = pd.DataFrame(newList)
newList.columns = ['Area', 'Price']
print(newList)
# newList = newList.drop(0)
# newList = newList.drop(len(newList))
newList = newList[1:-1]
print(newList)

and it worked great, as you can see with the two commented out lines above I tried the drop.() method and it work but not as kool and readable as using [n:-n], hope that helps someone, thanks.


回答 6

对于具有多索引(例如“股票”和“日期”)且希望删除每个股票的最后一行而不只是最后一个股票的最后一行的更复杂的数据框,解决方案如下:

# To remove last n rows
df = df.groupby(level='Stock').apply(lambda x: x.head(-1)).reset_index(0, drop=True)

# To remove first n rows
df = df.groupby(level='Stock').apply(lambda x: x.tail(-1)).reset_index(0, drop=True)

由于会groupby()为Multi-Index添加一个附加级别,因此我们只需在末尾使用将其删除reset_index()。生成的df与操作之前保持相同的Multi-Index类型。

For more complex DataFrames that have a Multi-Index (say “Stock” and “Date”) and one wants to remove the last row for each Stock not just the last row of the last Stock, then the solution reads:

# To remove last n rows
df = df.groupby(level='Stock').apply(lambda x: x.head(-1)).reset_index(0, drop=True)

# To remove first n rows
df = df.groupby(level='Stock').apply(lambda x: x.tail(-1)).reset_index(0, drop=True)

As the groupby() is adding an additional level to the Multi-Index we just drop it at the end using reset_index(). The resulting df keeps the same type of Multi-Index as before the operation.


设置熊猫数据框中的列顺序

问题:设置熊猫数据框中的列顺序

有没有一种方法可以根据我的个人喜好(即不按字母或数字排序,而是更像遵循某些约定)对熊猫数据框中的列进行重新排序?

简单的例子:

frame = pd.DataFrame({
        'one thing':[1,2,3,4],
        'second thing':[0.1,0.2,1,2],
        'other thing':['a','e','i','o']})

产生这个:

   one thing other thing  second thing
0          1           a           0.1
1          2           e           0.2
2          3           i           1.0
3          4           o           2.0

但是,我想这样:

   one thing second thing  other thing
0          1           0.1           a
1          2           0.2           e
2          3           1.0           i
3          4           2.0           o

(请提供通用解决方案,而不是针对此情况。非常感谢。)

Is there a way to reorder columns in pandas dataframe based on my personal preference (i.e. not alphabetically or numerically sorted, but more like following certain conventions)?

Simple example:

frame = pd.DataFrame({
        'one thing':[1,2,3,4],
        'second thing':[0.1,0.2,1,2],
        'other thing':['a','e','i','o']})

produces this:

   one thing other thing  second thing
0          1           a           0.1
1          2           e           0.2
2          3           i           1.0
3          4           o           2.0

But instead, I would like this:

   one thing second thing  other thing
0          1           0.1           a
1          2           0.2           e
2          3           1.0           i
3          4           2.0           o

(Please, provide a generic solution rather than specific to this case. Many thanks.)


回答 0

只需输入列名称即可自己选择订单。请注意双括号:

frame = frame[['column I want first', 'column I want second'...etc.]]

Just select the order yourself by typing in the column names. Note the double brackets:

frame = frame[['column I want first', 'column I want second'...etc.]]

回答 1

您可以使用此:

columnsTitles = ['onething', 'secondthing', 'otherthing']

frame = frame.reindex(columns=columnsTitles)

You can use this:

columnsTitles = ['onething', 'secondthing', 'otherthing']

frame = frame.reindex(columns=columnsTitles)

回答 2

这是我经常使用的解决方案。当您拥有包含大量列的大型数据集时,您绝对不希望手动重新排列所有列。

您可以而且很可能想做的是只是对您经常使用的前几列进行排序,而让所有其他列成为自己。这是R中的常用方法。df %>%select(one, two, three, everything())

因此,您可以首先手动键入要排序的列和要位于列表中所有其他列之前的列cols_to_order

然后,通过组合其余各列来构造新列的列表:

new_columns = cols_to_order + (frame.columns.drop(cols_to_order).tolist())

之后,您可以使用new_columns建议的其他解决方案。

import pandas as pd
frame = pd.DataFrame({
    'one thing': [1, 2, 3, 4],
    'other thing': ['a', 'e', 'i', 'o'],
    'more things': ['a', 'e', 'i', 'o'],
    'second thing': [0.1, 0.2, 1, 2],
})

cols_to_order = ['one thing', 'second thing']
new_columns = cols_to_order + (frame.columns.drop(cols_to_order).tolist())
frame = frame[new_columns]

   one thing  second thing other thing more things
0          1           0.1           a           a
1          2           0.2           e           e
2          3           1.0           i           i
3          4           2.0           o           o

Here is a solution I use very often. When you have a large data set with tons of columns, you definitely do not want to manually rearrange all the columns.

What you can and, most likely, want to do is to just order the first a few columns that you frequently use, and let all other columns just be themselves. This is a common approach in R. df %>%select(one, two, three, everything())

So you can first manually type the columns that you want to order and to be positioned before all the other columns in a list cols_to_order.

Then you construct a list for new columns by combining the rest of the columns:

new_columns = cols_to_order + (frame.columns.drop(cols_to_order).tolist())

After this, you can use the new_columns as other solutions suggested.

import pandas as pd
frame = pd.DataFrame({
    'one thing': [1, 2, 3, 4],
    'other thing': ['a', 'e', 'i', 'o'],
    'more things': ['a', 'e', 'i', 'o'],
    'second thing': [0.1, 0.2, 1, 2],
})

cols_to_order = ['one thing', 'second thing']
new_columns = cols_to_order + (frame.columns.drop(cols_to_order).tolist())
frame = frame[new_columns]

   one thing  second thing other thing more things
0          1           0.1           a           a
1          2           0.2           e           e
2          3           1.0           i           i
3          4           2.0           o           o

回答 3

您也可以做类似的事情 df = df[['x', 'y', 'a', 'b']]

import pandas as pd
frame = pd.DataFrame({'one thing':[1,2,3,4],'second thing':[0.1,0.2,1,2],'other thing':['a','e','i','o']})
frame = frame[['second thing', 'other thing', 'one thing']]
print frame
   second thing other thing  one thing
0           0.1           a          1
1           0.2           e          2
2           1.0           i          3
3           2.0           o          4

另外,您可以通过以下方式获取列列表:

cols = list(df.columns.values)

输出将产生如下内容:

['x', 'y', 'a', 'b']

这样就很容易手动重新排列。

You could also do something like df = df[['x', 'y', 'a', 'b']]

import pandas as pd
frame = pd.DataFrame({'one thing':[1,2,3,4],'second thing':[0.1,0.2,1,2],'other thing':['a','e','i','o']})
frame = frame[['second thing', 'other thing', 'one thing']]
print frame
   second thing other thing  one thing
0           0.1           a          1
1           0.2           e          2
2           1.0           i          3
3           2.0           o          4

Also, you can get the list of columns with:

cols = list(df.columns.values)

The output will produce something like this:

['x', 'y', 'a', 'b']

Which is then easy to rearrange manually.


回答 4

用列表而不是字典构造它

frame = pd.DataFrame([
        [1, .1, 'a'],
        [2, .2, 'e'],
        [3,  1, 'i'],
        [4,  4, 'o']
    ], columns=['one thing', 'second thing', 'other thing'])

frame

   one thing  second thing other thing
0          1           0.1           a
1          2           0.2           e
2          3           1.0           i
3          4           4.0           o

Construct it with a list instead of a dictionary

frame = pd.DataFrame([
        [1, .1, 'a'],
        [2, .2, 'e'],
        [3,  1, 'i'],
        [4,  4, 'o']
    ], columns=['one thing', 'second thing', 'other thing'])

frame

   one thing  second thing other thing
0          1           0.1           a
1          2           0.2           e
2          3           1.0           i
3          4           4.0           o

回答 5

您还可以使用OrderedDict:

In [183]: from collections import OrderedDict

In [184]: data = OrderedDict()

In [185]: data['one thing'] = [1,2,3,4]

In [186]: data['second thing'] = [0.1,0.2,1,2]

In [187]: data['other thing'] = ['a','e','i','o']

In [188]: frame = pd.DataFrame(data)

In [189]: frame
Out[189]:
   one thing  second thing other thing
0          1           0.1           a
1          2           0.2           e
2          3           1.0           i
3          4           2.0           o

You can also use OrderedDict:

In [183]: from collections import OrderedDict

In [184]: data = OrderedDict()

In [185]: data['one thing'] = [1,2,3,4]

In [186]: data['second thing'] = [0.1,0.2,1,2]

In [187]: data['other thing'] = ['a','e','i','o']

In [188]: frame = pd.DataFrame(data)

In [189]: frame
Out[189]:
   one thing  second thing other thing
0          1           0.1           a
1          2           0.2           e
2          3           1.0           i
3          4           2.0           o

回答 6

添加“ columns”参数:

frame = pd.DataFrame({
        'one thing':[1,2,3,4],
        'second thing':[0.1,0.2,1,2],
        'other thing':['a','e','i','o']},
        columns=['one thing', 'second thing', 'other thing']
)

Add the ‘columns’ parameter:

frame = pd.DataFrame({
        'one thing':[1,2,3,4],
        'second thing':[0.1,0.2,1,2],
        'other thing':['a','e','i','o']},
        columns=['one thing', 'second thing', 'other thing']
)

回答 7

尝试建立索引(因此您不仅需要通用的解决方案,因此索引顺序也可以是您想要的):

l=[0,2,1] # index order
frame=frame[[frame.columns[i] for i in l]]

现在:

print(frame)

是:

   one thing second thing  other thing
0          1           0.1           a
1          2           0.2           e
2          3           1.0           i
3          4           2.0           o

Try indexing (so you want a generic solution not only for this, so index order can be just what you want):

l=[0,2,1] # index order
frame=frame[[frame.columns[i] for i in l]]

Now:

print(frame)

Is:

   one thing second thing  other thing
0          1           0.1           a
1          2           0.2           e
2          3           1.0           i
3          4           2.0           o

回答 8

我发现这是最简单,最有效的方法:

df = pd.DataFrame({
        'one thing':[1,2,3,4],
        'second thing':[0.1,0.2,1,2],
        'other thing':['a','e','i','o']})

df = df[['one thing','second thing', 'other thing']]

I find this to be the most straightforward and working:

df = pd.DataFrame({
        'one thing':[1,2,3,4],
        'second thing':[0.1,0.2,1,2],
        'other thing':['a','e','i','o']})

df = df[['one thing','second thing', 'other thing']]