问题:计算大熊猫数量的最有效方法是什么?

我有一个大的(约1200万行)数据帧df,说:

df.columns = ['word','documents','frequency']

因此,以下及时运行:

word_grouping = df[['word','frequency']].groupby('word')
MaxFrequency_perWord = word_grouping[['frequency']].max().reset_index()
MaxFrequency_perWord.columns = ['word','MaxFrequency']

但是,这要花费很长的时间才能运行:

Occurrences_of_Words = word_grouping[['word']].count().reset_index()

我在这里做错了什么?有没有更好的方法来计算大型数据框中的出现次数?

df.word.describe()

运行良好,所以我真的没想到这个Occurrences_of_Words数据框会花费很长时间。

ps:如果答案很明显,并且您觉得有必要因提出这个问题而对我不利,请同时提供答案。谢谢。

I have a large (about 12M rows) dataframe df with say:

df.columns = ['word','documents','frequency']

So the following ran in a timely fashion:

word_grouping = df[['word','frequency']].groupby('word')
MaxFrequency_perWord = word_grouping[['frequency']].max().reset_index()
MaxFrequency_perWord.columns = ['word','MaxFrequency']

However, this is taking an unexpected long time to run:

Occurrences_of_Words = word_grouping[['word']].count().reset_index()

What am I doing wrong here? Is there a better way to count occurences in a large dataframe?

df.word.describe()

ran pretty well, so I really did not expect this Occurrences_of_Words dataframe to take very long to build.

ps: If the answer is obvious and you feel the need to penalize me for asking this question, please include the answer as well. thank you.


回答 0

我认为df['word'].value_counts()应该服务。通过跳过groupby机制,您可以节省一些时间。我不知道为什么count要慢于max。两者都需要一些时间来避免丢失值。(与相比size。)

无论如何,对value_counts进行了专门优化以处理像您的单词这样的对象类型,因此我怀疑您会做得更好。

I think df['word'].value_counts() should serve. By skipping the groupby machinery, you’ll save some time. I’m not sure why count should be much slower than max. Both take some time to avoid missing values. (Compare with size.)

In any case, value_counts has been specifically optimized to handle object type, like your words, so I doubt you’ll do much better than that.


回答 1

当您想统计pandas dataFrame中一列中分类数据的频率时,请使用: df['Column_Name'].value_counts()

来源

When you want to count the frequency of categorical data in a column in pandas dataFrame use: df['Column_Name'].value_counts()

Source.


回答 2

只是先前答案的补充。别忘了,在处理实际数据时,可能会有空值,因此使用选项将默认值包括在内也很有用dropna=False默认值为True

一个例子:

>>> df['Embarked'].value_counts(dropna=False)
S      644
C      168
Q       77
NaN      2

Just an addition to the previous answers. Let’s not forget that when dealing with real data there might be null values, so it’s useful to also include those in the counting by using the option dropna=False (default is True)

An example:

>>> df['Embarked'].value_counts(dropna=False)
S      644
C      168
Q       77
NaN      2

声明:本站所有文章,如无特殊说明或标注,均为本站原创发布。任何个人或组织,在未征得本站同意时,禁止复制、盗用、采集、发布本站内容到任何网站、书籍等各类媒体平台。如若本站内容侵犯了原著者的合法权益,可联系我们进行处理。