如何制作好的可复制熊猫实例

问题:如何制作好的可复制熊猫实例

花了相当多的时间看这两个 在SO上标记,我得到的印象是pandas问题不太可能包含可重复的数据。这是后话了R的群落已经不错了关于鼓励,并感谢像导游这样,新人能得到放在一起,这些例子一些帮助。能够阅读这些指南并获得可复制数据的人通常会很幸运地得到他们问题的答案。

我们如何为pandas问题创建良好的可复制示例?可以将简单的数据框放在一起,例如:

import pandas as pd
df = pd.DataFrame({'user': ['Bob', 'Jane', 'Alice'], 
                   'income': [40000, 50000, 42000]})

但是许多示例数据集需要更复杂的结构,例如:

  • datetime 索引或数据
  • 多个类别变量(是否存在R的等效expand.grid()函数,该函数产生某些给定变量的所有可能组合?)
  • MultiIndex或Panel数据

对于dput()难以使用几行代码来模拟的数据集,是否有与R等效的功能,可让您生成可复制粘贴的代码来重新生成数据结构?

Having spent a decent amount of time watching both the and tags on SO, the impression that I get is that pandas questions are less likely to contain reproducible data. This is something that the R community has been pretty good about encouraging, and thanks to guides like this, newcomers are able to get some help on putting together these examples. People who are able to read these guides and come back with reproducible data will often have much better luck getting answers to their questions.

How can we create good reproducible examples for pandas questions? Simple dataframes can be put together, e.g.:

import pandas as pd
df = pd.DataFrame({'user': ['Bob', 'Jane', 'Alice'], 
                   'income': [40000, 50000, 42000]})

But many example datasets need more complicated structure, e.g.:

  • datetime indices or data
  • Multiple categorical variables (is there an equivalent to R’s expand.grid() function, which produces all possible combinations of some given variables?)
  • MultiIndex or Panel data

For datasets that are hard to mock up using a few lines of code, is there an equivalent to R’s dput() that allows you to generate copy-pasteable code to regenerate your datastructure?


回答 0

注意:这里的想法对于Stack Overflow非常通用,实际上是问题

免责声明:写一个好问题是困难的。

好:

  • 确实包含小的*示例DataFrame,作为可运行代码:

    In [1]: df = pd.DataFrame([[1, 2], [1, 3], [4, 6]], columns=['A', 'B'])

    或使用使其“可复制和粘贴” pd.read_clipboard(sep='\s\s+'),您可以设置Stack Overflow高亮显示的文本格式,并使用Ctrl+ K(或在每行前添加四个空格),或在代码上方和下方放置三个波浪号,而无需缩进代码:

    In [2]: df
    Out[2]: 
       A  B
    0  1  2
    1  1  3
    2  4  6
    

    测试pd.read_clipboard(sep='\s\s+')自己。

    * 我确实的意思是很小,绝大多数示例DataFrames可能需要少于6行引用我敢打赌我可以在5行中完成。df = df.head()是否可以用来重现该错误,如果没有弄清楚,是否可以组成一个小的DataFrame来显示您面临的问题。

    * 所有规则都有exceptions,很明显的一个是性能问题(在这种情况下,肯定要用到%timeit和可能的%PRUN),你应该生成(考虑使用np.random.seed所以我们有相同的帧)df = pd.DataFrame(np.random.randn(100000000, 10))。说“让我快速编写此代码”并不是严格意义上的网站主题…

  • 写出您想要的结果(与上面类似)

    In [3]: iwantthis
    Out[3]: 
       A  B
    0  1  5
    1  4  6

    解释数字的来源:5是A为1的行的B列之和。

  • 确实显示您尝试过的代码

    In [4]: df.groupby('A').sum()
    Out[4]: 
       B
    A   
    1  5
    4  6

    但是,请说出不正确的地方:A列位于索引中,而不是列中。

  • 确实表明您已经做过一些研究(搜索docs搜索StackOverflow),并给出摘要:

    sum的文档字符串仅声明“计算组值之和”

    GROUPBY文档不给任何的例子。

    撇开:这里的答案是使用df.groupby('A', as_index=False).sum()

  • 如果您有“时间戳记”列是相关的(例如,正在重采样等),则应明确并将pd.to_datetime其应用于良好的度量**。

    df['date'] = pd.to_datetime(df['date']) # this column ought to be date..

    ** 有时这就是问题本身:它们是字符串。

坏处:

  • 不包含我们无法复制和粘贴的MultiIndex (请参见上文),这对熊猫默认显示有点不满,但很烦人:

    In [11]: df
    Out[11]:
         C
    A B   
    1 2  3
      2  6

    正确的方法是在set_index调用中包含一个普通的DataFrame :

    In [12]: df = pd.DataFrame([[1, 2, 3], [1, 2, 6]], columns=['A', 'B', 'C']).set_index(['A', 'B'])
    
    In [13]: df
    Out[13]: 
         C
    A B   
    1 2  3
      2  6
  • 在提供所需结果时,确实可以洞察其含义:

       B
    A   
    1  1
    5  0

    请具体说明您如何获得这些数字(它们是什么)…再次检查它们是正确的。

  • 如果您的代码抛出错误,请包括整个堆栈跟踪信息(如果噪声太大,可以稍后编辑)。显示行号(以及代码所针对的行)。

丑陋的:

  • 不要链接到我们无权访问的CSV(理想情况下根本不要链接到外部源…)

    df = pd.read_csv('my_secret_file.csv')  # ideally with lots of parsing options

    我们得到的大多数数据都是专有的:组成相似的数据,看看是否可以重现问题(很小的问题)。

  • 不要用语言模糊地解释这种情况,就像您有一个“大”的DataFrame一样,在传递时提及一些列名(请确保不要提及它们的dtypes)。尝试深入了解一些完全没有意义的细节,而无需查看实际上下文。大概没人会读到本段末尾。

    杂文不好,用小例子更容易。

  • 在讨论您的实际问题之前,请勿包含10+(100+ ??)行数据处理。

    拜托,我们在日常工作中会看到足够多的东西。我们想提供帮助,但不是这样
    切入简介,仅在引起麻烦的步骤中显示相关的DataFrame(或它们的小版本)。

无论如何,祝您学习Python,NumPy和Pandas玩得开心!

Note: The ideas here are pretty generic for Stack Overflow, indeed questions.

Disclaimer: Writing a good question is HARD.

The Good:

  • do include small* example DataFrame, either as runnable code:

    In [1]: df = pd.DataFrame([[1, 2], [1, 3], [4, 6]], columns=['A', 'B'])
    

    or make it “copy and pasteable” using pd.read_clipboard(sep='\s\s+'), you can format the text for Stack Overflow highlight and use Ctrl+K (or prepend four spaces to each line), or place three tildes above and below your code with your code unindented:

    In [2]: df
    Out[2]: 
       A  B
    0  1  2
    1  1  3
    2  4  6
    

    test pd.read_clipboard(sep='\s\s+') yourself.

    * I really do mean small, the vast majority of example DataFrames could be fewer than 6 rowscitation needed, and I bet I can do it in 5 rows. Can you reproduce the error with df = df.head(), if not fiddle around to see if you can make up a small DataFrame which exhibits the issue you are facing.

    * Every rule has an exception, the obvious one is for performance issues (in which case definitely use %timeit and possibly %prun), where you should generate (consider using np.random.seed so we have the exact same frame): df = pd.DataFrame(np.random.randn(100000000, 10)). Saying that, “make this code fast for me” is not strictly on topic for the site…

  • write out the outcome you desire (similarly to above)

    In [3]: iwantthis
    Out[3]: 
       A  B
    0  1  5
    1  4  6
    

    Explain what the numbers come from: the 5 is sum of the B column for the rows where A is 1.

  • do show the code you’ve tried:

    In [4]: df.groupby('A').sum()
    Out[4]: 
       B
    A   
    1  5
    4  6
    

    But say what’s incorrect: the A column is in the index rather than a column.

  • do show you’ve done some research (search the docs, search StackOverflow), give a summary:

    The docstring for sum simply states “Compute sum of group values”

    The groupby docs don’t give any examples for this.

    Aside: the answer here is to use df.groupby('A', as_index=False).sum().

  • if it’s relevant that you have Timestamp columns, e.g. you’re resampling or something, then be explicit and apply pd.to_datetime to them for good measure**.

    df['date'] = pd.to_datetime(df['date']) # this column ought to be date..
    

    ** Sometimes this is the issue itself: they were strings.

The Bad:

  • don’t include a MultiIndex, which we can’t copy and paste (see above), this is kind of a grievance with pandas default display but nonetheless annoying:

    In [11]: df
    Out[11]:
         C
    A B   
    1 2  3
      2  6
    

    The correct way is to include an ordinary DataFrame with a set_index call:

    In [12]: df = pd.DataFrame([[1, 2, 3], [1, 2, 6]], columns=['A', 'B', 'C']).set_index(['A', 'B'])
    
    In [13]: df
    Out[13]: 
         C
    A B   
    1 2  3
      2  6
    
  • do provide insight to what it is when giving the outcome you want:

       B
    A   
    1  1
    5  0
    

    Be specific about how you got the numbers (what are they)… double check they’re correct.

  • If your code throws an error, do include the entire stack trace (this can be edited out later if it’s too noisy). Show the line number (and the corresponding line of your code which it’s raising against).

The Ugly:

  • don’t link to a csv we don’t have access to (ideally don’t link to an external source at all…)

    df = pd.read_csv('my_secret_file.csv')  # ideally with lots of parsing options
    

    Most data is proprietary we get that: Make up similar data and see if you can reproduce the problem (something small).

  • don’t explain the situation vaguely in words, like you have a DataFrame which is “large”, mention some of the column names in passing (be sure not to mention their dtypes). Try and go into lots of detail about something which is completely meaningless without seeing the actual context. Presumably no one is even going to read to the end of this paragraph.

    Essays are bad, it’s easier with small examples.

  • don’t include 10+ (100+??) lines of data munging before getting to your actual question.

    Please, we see enough of this in our day jobs. We want to help, but not like this….
    Cut the intro, and just show the relevant DataFrames (or small versions of them) in the step which is causing you trouble.

Anyways, have fun learning Python, NumPy and Pandas!


回答 1

如何创建样本数据集

这主要是通过提供有关如何创建示例数据框的示例来扩展@AndyHayden的答案。熊猫和(特别是)numpy为此提供了多种工具,因此您通常只需几行代码就可以为任何实际数据集创建合理的传真。

导入numpy和pandas之后,如果您希望人们能够准确地复制数据和结果,请确保提供随机种子。

import numpy as np
import pandas as pd

np.random.seed(123)

厨房水槽的例子

这是一个示例,显示了您可以执行的各种操作。各种有用的示例数据框都可以从其中的一个子集创建:

df = pd.DataFrame({ 

    # some ways to create random data
    'a':np.random.randn(6),
    'b':np.random.choice( [5,7,np.nan], 6),
    'c':np.random.choice( ['panda','python','shark'], 6),

    # some ways to create systematic groups for indexing or groupby
    # this is similar to r's expand.grid(), see note 2 below
    'd':np.repeat( range(3), 2 ),
    'e':np.tile(   range(2), 3 ),

    # a date range and set of random dates
    'f':pd.date_range('1/1/2011', periods=6, freq='D'),
    'g':np.random.choice( pd.date_range('1/1/2011', periods=365, 
                          freq='D'), 6, replace=False) 
    })

这将生成:

          a   b       c  d  e          f          g
0 -1.085631 NaN   panda  0  0 2011-01-01 2011-08-12
1  0.997345   7   shark  0  1 2011-01-02 2011-11-10
2  0.282978   5   panda  1  0 2011-01-03 2011-10-30
3 -1.506295   7  python  1  1 2011-01-04 2011-09-07
4 -0.578600 NaN   shark  2  0 2011-01-05 2011-02-27
5  1.651437   7  python  2  1 2011-01-06 2011-02-03

一些注意事项:

  1. np.repeatnp.tile(列de)对于以非常规则的方式创建组和索引非常有用。对于2列,这可用于轻松复制r,expand.grid()但在提供所有排列的子集的能力上也更加灵活。但是,对于3列或更多列,语法很快变得笨拙。
  2. 有关r’s的更直接替代方法,expand.grid()请参阅itertools熊猫食谱》中np.meshgrid解决方案或此处显示的解决方案。这些将允许任何数量的尺寸。
  3. 您可以使用进行很多操作np.random.choice。例如,在column列中g,我们从2011年开始随机选择6个日期。此外,通过设置,replace=False我们可以确保这些日期是唯一的-如果我们要将其用作具有唯一值的索引,则非常方便。

假股市数据

除了获取上述代码的子集之外,您还可以进一步结合这些技术来执行几乎所有操作。例如,这是一个简短的示例,它结合np.tiledate_range创建涵盖相同日期的4只股票的样本报价数据:

stocks = pd.DataFrame({ 
    'ticker':np.repeat( ['aapl','goog','yhoo','msft'], 25 ),
    'date':np.tile( pd.date_range('1/1/2011', periods=25, freq='D'), 4 ),
    'price':(np.random.randn(100).cumsum() + 10) })

现在,我们有了一个具有100行的示例数据集(每个代码25个日期),但是仅用了4行就可以做到,这使得其他所有人都可以轻松复制而无需复制和粘贴100行代码。然后,如果可以帮助解释您的问题,则可以显示数据的子集:

>>> stocks.head(5)

        date      price ticker
0 2011-01-01   9.497412   aapl
1 2011-01-02  10.261908   aapl
2 2011-01-03   9.438538   aapl
3 2011-01-04   9.515958   aapl
4 2011-01-05   7.554070   aapl

>>> stocks.groupby('ticker').head(2)

         date      price ticker
0  2011-01-01   9.497412   aapl
1  2011-01-02  10.261908   aapl
25 2011-01-01   8.277772   goog
26 2011-01-02   7.714916   goog
50 2011-01-01   5.613023   yhoo
51 2011-01-02   6.397686   yhoo
75 2011-01-01  11.736584   msft
76 2011-01-02  11.944519   msft

How to create sample datasets

This is to mainly to expand on @AndyHayden’s answer by providing examples of how you can create sample dataframes. Pandas and (especially) numpy give you a variety of tools for this such that you can generally create a reasonable facsimile of any real dataset with just a few lines of code.

After importing numpy and pandas, be sure to provide a random seed if you want folks to be able to exactly reproduce your data and results.

import numpy as np
import pandas as pd

np.random.seed(123)

A kitchen sink example

Here’s an example showing a variety of things you can do. All kinds of useful sample dataframes could be created from a subset of this:

df = pd.DataFrame({ 

    # some ways to create random data
    'a':np.random.randn(6),
    'b':np.random.choice( [5,7,np.nan], 6),
    'c':np.random.choice( ['panda','python','shark'], 6),

    # some ways to create systematic groups for indexing or groupby
    # this is similar to r's expand.grid(), see note 2 below
    'd':np.repeat( range(3), 2 ),
    'e':np.tile(   range(2), 3 ),

    # a date range and set of random dates
    'f':pd.date_range('1/1/2011', periods=6, freq='D'),
    'g':np.random.choice( pd.date_range('1/1/2011', periods=365, 
                          freq='D'), 6, replace=False) 
    })

This produces:

          a   b       c  d  e          f          g
0 -1.085631 NaN   panda  0  0 2011-01-01 2011-08-12
1  0.997345   7   shark  0  1 2011-01-02 2011-11-10
2  0.282978   5   panda  1  0 2011-01-03 2011-10-30
3 -1.506295   7  python  1  1 2011-01-04 2011-09-07
4 -0.578600 NaN   shark  2  0 2011-01-05 2011-02-27
5  1.651437   7  python  2  1 2011-01-06 2011-02-03

Some notes:

  1. np.repeat and np.tile (columns d and e) are very useful for creating groups and indices in a very regular way. For 2 columns, this can be used to easily duplicate r’s expand.grid() but is also more flexible in ability to provide a subset of all permutations. However, for 3 or more columns the syntax quickly becomes unwieldy.
  2. For a more direct replacement for r’s expand.grid() see the itertools solution in the pandas cookbook or the np.meshgrid solution shown here. Those will allow any number of dimensions.
  3. You can do quite a bit with np.random.choice. For example, in column g, we have a random selection of 6 dates from 2011. Additionally, by setting replace=False we can assure these dates are unique — very handy if we want to use this as an index with unique values.

Fake stock market data

In addition to taking subsets of the above code, you can further combine the techniques to do just about anything. For example, here’s a short example that combines np.tile and date_range to create sample ticker data for 4 stocks covering the same dates:

stocks = pd.DataFrame({ 
    'ticker':np.repeat( ['aapl','goog','yhoo','msft'], 25 ),
    'date':np.tile( pd.date_range('1/1/2011', periods=25, freq='D'), 4 ),
    'price':(np.random.randn(100).cumsum() + 10) })

Now we have a sample dataset with 100 lines (25 dates per ticker), but we have only used 4 lines to do it, making it easy for everyone else to reproduce without copying and pasting 100 lines of code. You can then display subsets of the data if it helps to explain your question:

>>> stocks.head(5)

        date      price ticker
0 2011-01-01   9.497412   aapl
1 2011-01-02  10.261908   aapl
2 2011-01-03   9.438538   aapl
3 2011-01-04   9.515958   aapl
4 2011-01-05   7.554070   aapl

>>> stocks.groupby('ticker').head(2)

         date      price ticker
0  2011-01-01   9.497412   aapl
1  2011-01-02  10.261908   aapl
25 2011-01-01   8.277772   goog
26 2011-01-02   7.714916   goog
50 2011-01-01   5.613023   yhoo
51 2011-01-02   6.397686   yhoo
75 2011-01-01  11.736584   msft
76 2011-01-02  11.944519   msft

回答 2

答录人日记

对于提出问题,我最好的建议是发挥回答问题者的心理。作为其中的一员,我可以深入了解为什么我回答某些问题以及为什么我不回答其他问题。

动机

我出于以下几个原因而愿意回答问题

  1. 对我来说,Stackoverflow.com是非常宝贵的资源。我想回馈。
  2. 在回馈的过程中,我发现此站点是比以前更强大的资源。回答问题对我来说是一种学习经历,我喜欢学习。 阅读此答案,并请其他兽医发表评论。这种互动使我感到高兴。
  3. 我喜欢积分!
  4. 参见#3。
  5. 我喜欢有趣的问题。

我所有的最纯粹的意图都是美好的,但是如果我回答1个问题或30个问题,我就会感到满意。 驱使我选择回答哪些问题的动机在于最大化分数。

我还将花时间在有趣的问题上,但这之间相差无几,并且对于需要解决无趣问题的提问者没有帮助。让我回答问题的最佳选择是将问题尽快解决,让我尽可能少地回答。如果我正在看两个问题,而一个有代码,我可以复制粘贴以创建我需要的所有变量…我要使用那个!如果有时间的话,我会再回到另一个。

主要建议

使人们易于回答问题。

  • 提供创建所需变量的代码。
  • 最小化该代码。如果我在看帖子时眼神呆滞,那我将继续下一个问题,或者回到我正在做的其他事情。
  • 考虑一下您要问的内容并做到具体。我们想看看您做了什么,因为自然语言(英语)不准确且令人困惑。您尝试过的代码示例有助于解决自然语言描述中的不一致问题。
  • 请显示您的期望!!!我必须坐下来尝试一下。如果不尝试一些事情,我几乎永远不会知道问题的答案。如果我看不到您要查找的示例,则可能会跳过这个问题,因为我不想猜测。

您的声誉不仅仅是您的声誉。

我喜欢要点(我在上面提到过)。但是这些并不是我真正的声誉。我真正的声誉是网站上其他人对我的看法的融合。我努力做到公平诚实,希望其他人能看到这一点。对于询问者而言,这意味着我们记住询问者的行为。我记得,如果您没有选择答案并推荐好的答案。我记得,如果你的举止表现得我不喜欢或我喜欢。我还将回答哪些问题。


无论如何,我可能可以继续,但是我会饶恕所有真正读过这篇文章的人。

Diary of an Answerer

My best advice for asking questions would be to play on the psychology of the people who answer questions. Being one of those people, I can give insight into why I answer certain questions and why I don’t answer others.

Motivations

I’m motivated to answer questions for several reasons

  1. Stackoverflow.com has been a tremendously valuable resource to me. I wanted to give back.
  2. In my efforts to give back, I’ve found this site to be an even more powerful resource than before. Answering questions is a learning experience for me and I like to learn. Read this answer and comment from another vet. This kind of interaction makes me happy.
  3. I like points!
  4. See #3.
  5. I like interesting problems.

All my purest intentions are great and all, but I get that satisfaction if I answer 1 question or 30. What drives my choices for which questions to answer has a huge component of point maximization.

I’ll also spend time on interesting problems but that is few and far between and doesn’t help an asker who needs a solution to a non-interesting question. Your best bet to get me to answer a question is to serve that question up on a platter ripe for me to answer it with as little effort as possible. If I’m looking at two questions and one has code I can copy paste to create all the variables I need… I’m taking that one! I’ll come back to the other one if I have time, maybe.

Main Advice

Make it easy for the people answering questions.

  • Provide code that creates variables that are needed.
  • Minimize that code. If my eyes glaze over as I look at the post, I’m on to the next question or getting back to whatever else I’m doing.
  • Think about what you’re asking and be specific. We want to see what you’ve done because natural languages (English) are inexact and confusing. Code samples of what you’ve tried help resolve inconsistencies in a natural language description.
  • PLEASE show what you expect!!! I have to sit down and try things. I almost never know the answer to a question without trying some things out. If I don’t see an example of what you’re looking for, I might pass on the question because I don’t feel like guessing.

Your reputation is more than just your reputation.

I like points (I mentioned that above). But those points aren’t really really my reputation. My real reputation is an amalgamation of what others on the site think of me. I strive to be fair and honest and I hope others can see that. What that means for an asker is, we remember the behaviors of askers. If you don’t select answers and upvote good answers, I remember. If you behave in ways I don’t like or in ways I do like, I remember. This also plays into which questions I’ll answer.


Anyway, I can probably go on, but I’ll spare all of you who actually read this.


回答 3

挑战回答SO问题的最大挑战之一是重新创建问题(包括数据)所花费的时间。没有清晰的方法来重现数据的问题不太可能被回答。既然您花时间写问题,并且有一个需要帮助的问题,则可以通过提供其他人可以用来帮助解决问题的数据来轻松地帮助自己。

@Andy提供的有关编写良好熊猫问题的说明是一个很好的起点。有关更多信息,请参阅如何提问以及如何创建最小,完整和可验证的示例

请事先明确说明您的问题。 花时间写完您的问题和任何示例代码后,请尝试阅读并为您的读者提供一个“执行摘要”,其中概述了问题并清楚地陈述了问题。

原始问题

我有这个数据…

我想做这个…

我希望我的结果看起来像这样…

但是,当我尝试执行[this]时,出现以下问题…

我试图通过[this]和[that]找到解决方案。

我如何解决它?

根据所提供的数据量,示例代码和错误堆栈,读者需要走很长一段路才能理解问题所在。尝试重新陈述问题,使问题本身排在最前面,然后提供必要的详细信息。

修改后的问题

问题: 我该怎么做?

我试图通过[this]和[that]找到解决方案。

当我尝试执行此操作时,出现以下问题…

我希望最终结果看起来像这样…

这是一些可以重现我的问题的最小代码…

这里是如何重新创建示例数据的方法: df = pd.DataFrame({'A': [...], 'B': [...], ...})

如果需要,提供样品数据!!!

有时只需要DataFrame的开头或结尾即可。您也可以使用@JohnE提出的方法来创建更大的数据集,以供其他人复制。使用他的示例生成股票行的100行DataFrame:

stocks = pd.DataFrame({ 
    'ticker':np.repeat( ['aapl','goog','yhoo','msft'], 25 ),
    'date':np.tile( pd.date_range('1/1/2011', periods=25, freq='D'), 4 ),
    'price':(np.random.randn(100).cumsum() + 10) })

如果这是您的实际数据,则可能只需要按以下方式包括数据框的头部和/或尾部(请确保匿名所有敏感数据):

>>> stocks.head(5).to_dict()
{'date': {0: Timestamp('2011-01-01 00:00:00'),
  1: Timestamp('2011-01-01 00:00:00'),
  2: Timestamp('2011-01-01 00:00:00'),
  3: Timestamp('2011-01-01 00:00:00'),
  4: Timestamp('2011-01-02 00:00:00')},
 'price': {0: 10.284260107718254,
  1: 11.930300761831457,
  2: 10.93741046217319,
  3: 10.884574289565609,
  4: 11.78005850418319},
 'ticker': {0: 'aapl', 1: 'aapl', 2: 'aapl', 3: 'aapl', 4: 'aapl'}}

>>> pd.concat([stocks.head(), stocks.tail()], ignore_index=True).to_dict()
{'date': {0: Timestamp('2011-01-01 00:00:00'),
  1: Timestamp('2011-01-01 00:00:00'),
  2: Timestamp('2011-01-01 00:00:00'),
  3: Timestamp('2011-01-01 00:00:00'),
  4: Timestamp('2011-01-02 00:00:00'),
  5: Timestamp('2011-01-24 00:00:00'),
  6: Timestamp('2011-01-25 00:00:00'),
  7: Timestamp('2011-01-25 00:00:00'),
  8: Timestamp('2011-01-25 00:00:00'),
  9: Timestamp('2011-01-25 00:00:00')},
 'price': {0: 10.284260107718254,
  1: 11.930300761831457,
  2: 10.93741046217319,
  3: 10.884574289565609,
  4: 11.78005850418319,
  5: 10.017209045035006,
  6: 10.57090128181566,
  7: 11.442792747870204,
  8: 11.592953372130493,
  9: 12.864146419530938},
 'ticker': {0: 'aapl',
  1: 'aapl',
  2: 'aapl',
  3: 'aapl',
  4: 'aapl',
  5: 'msft',
  6: 'msft',
  7: 'msft',
  8: 'msft',
  9: 'msft'}}

您可能还需要提供DataFrame的描述(仅使用相关列)。这使得其他人更容易检查每一列的数据类型并识别其他常见错误(例如,日期为字符串vs. datetime64 vs.对象):

stocks.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 100 entries, 0 to 99
Data columns (total 3 columns):
date      100 non-null datetime64[ns]
price     100 non-null float64
ticker    100 non-null object
dtypes: datetime64[ns](1), float64(1), object(1)

注意:如果您的DataFrame有一个MultiIndex:

如果您的DataFrame具有多索引,则必须先重设,然后再调用to_dict。然后,您需要使用重新创建索引set_index

# MultiIndex example.  First create a MultiIndex DataFrame.
df = stocks.set_index(['date', 'ticker'])
>>> df
                       price
date       ticker           
2011-01-01 aapl    10.284260
           aapl    11.930301
           aapl    10.937410
           aapl    10.884574
2011-01-02 aapl    11.780059
...

# After resetting the index and passing the DataFrame to `to_dict`, make sure to use 
# `set_index` to restore the original MultiIndex.  This DataFrame can then be restored.

d = df.reset_index().to_dict()
df_new = pd.DataFrame(d).set_index(['date', 'ticker'])
>>> df_new.head()
                       price
date       ticker           
2011-01-01 aapl    10.284260
           aapl    11.930301
           aapl    10.937410
           aapl    10.884574
2011-01-02 aapl    11.780059

The Challenge One of the most challenging aspects of responding to SO questions is the time it takes to recreate the problem (including the data). Questions which don’t have a clear way to reproduce the data are less likely to be answered. Given that you are taking the time to write a question and you have an issue that you’d like help with, you can easily help yourself by providing data that others can then use to help solve your problem.

The instructions provided by @Andy for writing good Pandas questions are an excellent place to start. For more information, refer to how to ask and how to create Minimal, Complete, and Verifiable examples.

Please clearly state your question upfront. After taking the time to write your question and any sample code, try to read it and provide an ‘Executive Summary’ for your reader which summarizes the problem and clearly states the question.

Original question:

I have this data…

I want to do this…

I want my result to look like this…

However, when I try to do [this], I get the following problem…

I’ve tried to find solutions by doing [this] and [that].

How do I fix it?

Depending on the amount of data, sample code and error stacks provided, the reader needs to go a long way before understanding what the problem is. Try restating your question so that the question itself is on top, and then provide the necessary details.

Revised Question:

Qustion: How can I do [this]?

I’ve tried to find solutions by doing [this] and [that].

When I’ve tried to do [this], I get the following problem…

I’d like my final results to look like this…

Here is some minimal code that can reproduce my problem…

And here is how to recreate my sample data: df = pd.DataFrame({'A': [...], 'B': [...], ...})

PROVIDE SAMPLE DATA IF NEEDED!!!

Sometimes just the head or tail of the DataFrame is all that is needed. You can also use the methods proposed by @JohnE to create larger datasets that can be reproduced by others. Using his example to generate a 100 row DataFrame of stock prices:

stocks = pd.DataFrame({ 
    'ticker':np.repeat( ['aapl','goog','yhoo','msft'], 25 ),
    'date':np.tile( pd.date_range('1/1/2011', periods=25, freq='D'), 4 ),
    'price':(np.random.randn(100).cumsum() + 10) })

If this was your actual data, you may just want to include the head and/or tail of the dataframe as follows (be sure to anonymize any sensitive data):

>>> stocks.head(5).to_dict()
{'date': {0: Timestamp('2011-01-01 00:00:00'),
  1: Timestamp('2011-01-01 00:00:00'),
  2: Timestamp('2011-01-01 00:00:00'),
  3: Timestamp('2011-01-01 00:00:00'),
  4: Timestamp('2011-01-02 00:00:00')},
 'price': {0: 10.284260107718254,
  1: 11.930300761831457,
  2: 10.93741046217319,
  3: 10.884574289565609,
  4: 11.78005850418319},
 'ticker': {0: 'aapl', 1: 'aapl', 2: 'aapl', 3: 'aapl', 4: 'aapl'}}

>>> pd.concat([stocks.head(), stocks.tail()], ignore_index=True).to_dict()
{'date': {0: Timestamp('2011-01-01 00:00:00'),
  1: Timestamp('2011-01-01 00:00:00'),
  2: Timestamp('2011-01-01 00:00:00'),
  3: Timestamp('2011-01-01 00:00:00'),
  4: Timestamp('2011-01-02 00:00:00'),
  5: Timestamp('2011-01-24 00:00:00'),
  6: Timestamp('2011-01-25 00:00:00'),
  7: Timestamp('2011-01-25 00:00:00'),
  8: Timestamp('2011-01-25 00:00:00'),
  9: Timestamp('2011-01-25 00:00:00')},
 'price': {0: 10.284260107718254,
  1: 11.930300761831457,
  2: 10.93741046217319,
  3: 10.884574289565609,
  4: 11.78005850418319,
  5: 10.017209045035006,
  6: 10.57090128181566,
  7: 11.442792747870204,
  8: 11.592953372130493,
  9: 12.864146419530938},
 'ticker': {0: 'aapl',
  1: 'aapl',
  2: 'aapl',
  3: 'aapl',
  4: 'aapl',
  5: 'msft',
  6: 'msft',
  7: 'msft',
  8: 'msft',
  9: 'msft'}}

You may also want to provide a description of the DataFrame (using only the relevant columns). This makes it easier for others to check the data types of each column and identify other common errors (e.g. dates as string vs. datetime64 vs. object):

stocks.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 100 entries, 0 to 99
Data columns (total 3 columns):
date      100 non-null datetime64[ns]
price     100 non-null float64
ticker    100 non-null object
dtypes: datetime64[ns](1), float64(1), object(1)

NOTE: If your DataFrame has a MultiIndex:

If your DataFrame has a multiindex, you must first reset before calling to_dict. You then need to recreate the index using set_index:

# MultiIndex example.  First create a MultiIndex DataFrame.
df = stocks.set_index(['date', 'ticker'])
>>> df
                       price
date       ticker           
2011-01-01 aapl    10.284260
           aapl    11.930301
           aapl    10.937410
           aapl    10.884574
2011-01-02 aapl    11.780059
...

# After resetting the index and passing the DataFrame to `to_dict`, make sure to use 
# `set_index` to restore the original MultiIndex.  This DataFrame can then be restored.

d = df.reset_index().to_dict()
df_new = pd.DataFrame(d).set_index(['date', 'ticker'])
>>> df_new.head()
                       price
date       ticker           
2011-01-01 aapl    10.284260
           aapl    11.930301
           aapl    10.937410
           aapl    10.884574
2011-01-02 aapl    11.780059

回答 4

这是我的版本dput-用于生成可重复报告的标准R工具-适用于Pandas DataFrame。对于更复杂的框架,它可能会失败,但是在简单情况下,它似乎可以完成任务:

import pandas as pd
def dput (x):
    if isinstance(x,pd.Series):
        return "pd.Series(%s,dtype='%s',index=pd.%s)" % (list(x),x.dtype,x.index)
    if isinstance(x,pd.DataFrame):
        return "pd.DataFrame({" + ", ".join([
            "'%s': %s" % (c,dput(x[c])) for c in x.columns]) + (
                "}, index=pd.%s)" % (x.index))
    raise NotImplementedError("dput",type(x),x)

现在,

df = pd.DataFrame({'a':[1,2,3,4,2,1,3,1]})
assert df.equals(eval(dput(df)))
du = pd.get_dummies(df.a,"foo")
assert du.equals(eval(dput(du)))
di = df
di.index = list('abcdefgh')
assert di.equals(eval(dput(di)))

请注意,这产生的详细输出比DataFrame.to_dict,例如

pd.DataFrame({
  'foo_1':pd.Series([1, 0, 0, 0, 0, 1, 0, 1],dtype='uint8',index=pd.RangeIndex(start=0, stop=8, step=1)),
  'foo_2':pd.Series([0, 1, 0, 0, 1, 0, 0, 0],dtype='uint8',index=pd.RangeIndex(start=0, stop=8, step=1)),
  'foo_3':pd.Series([0, 0, 1, 0, 0, 0, 1, 0],dtype='uint8',index=pd.RangeIndex(start=0, stop=8, step=1)),
  'foo_4':pd.Series([0, 0, 0, 1, 0, 0, 0, 0],dtype='uint8',index=pd.RangeIndex(start=0, stop=8, step=1))},
  index=pd.RangeIndex(start=0, stop=8, step=1))

{'foo_1': {0: 1, 1: 0, 2: 0, 3: 0, 4: 0, 5: 1, 6: 0, 7: 1}, 
 'foo_2': {0: 0, 1: 1, 2: 0, 3: 0, 4: 1, 5: 0, 6: 0, 7: 0}, 
 'foo_3': {0: 0, 1: 0, 2: 1, 3: 0, 4: 0, 5: 0, 6: 1, 7: 0}, 
 'foo_4': {0: 0, 1: 0, 2: 0, 3: 1, 4: 0, 5: 0, 6: 0, 7: 0}}

du如上所述,但它保留列类型。例如,在上述测试案例中,

du.equals(pd.DataFrame(du.to_dict()))
==> False

因为du.dtypesuint8pd.DataFrame(du.to_dict()).dtypesint64

Here is my version of dput – the standard R tool to produce reproducible reports – for Pandas DataFrames. It will probably fail for more complex frames, but it seems to do the job in simple cases:

import pandas as pd
def dput(x):
    if isinstance(x,pd.Series):
        return "pd.Series(%s,dtype='%s',index=pd.%s)" % (list(x),x.dtype,x.index)
    if isinstance(x,pd.DataFrame):
        return "pd.DataFrame({" + ", ".join([
            "'%s': %s" % (c,dput(x[c])) for c in x.columns]) + (
                "}, index=pd.%s)" % (x.index))
    raise NotImplementedError("dput",type(x),x)

now,

df = pd.DataFrame({'a':[1,2,3,4,2,1,3,1]})
assert df.equals(eval(dput(df)))
du = pd.get_dummies(df.a,"foo")
assert du.equals(eval(dput(du)))
di = df
di.index = list('abcdefgh')
assert di.equals(eval(dput(di)))

Note that this produces a much more verbose output than DataFrame.to_dict, e.g.,

pd.DataFrame({
  'foo_1':pd.Series([1, 0, 0, 0, 0, 1, 0, 1],dtype='uint8',index=pd.RangeIndex(start=0, stop=8, step=1)),
  'foo_2':pd.Series([0, 1, 0, 0, 1, 0, 0, 0],dtype='uint8',index=pd.RangeIndex(start=0, stop=8, step=1)),
  'foo_3':pd.Series([0, 0, 1, 0, 0, 0, 1, 0],dtype='uint8',index=pd.RangeIndex(start=0, stop=8, step=1)),
  'foo_4':pd.Series([0, 0, 0, 1, 0, 0, 0, 0],dtype='uint8',index=pd.RangeIndex(start=0, stop=8, step=1))},
  index=pd.RangeIndex(start=0, stop=8, step=1))

vs

{'foo_1': {0: 1, 1: 0, 2: 0, 3: 0, 4: 0, 5: 1, 6: 0, 7: 1}, 
 'foo_2': {0: 0, 1: 1, 2: 0, 3: 0, 4: 1, 5: 0, 6: 0, 7: 0}, 
 'foo_3': {0: 0, 1: 0, 2: 1, 3: 0, 4: 0, 5: 0, 6: 1, 7: 0}, 
 'foo_4': {0: 0, 1: 0, 2: 0, 3: 1, 4: 0, 5: 0, 6: 0, 7: 0}}

for du above, but it preserves column types. E.g., in the above test case,

du.equals(pd.DataFrame(du.to_dict()))
==> False

because du.dtypes is uint8 and pd.DataFrame(du.to_dict()).dtypes is int64.