如何获取大熊猫DataFrame的行数?

问题:如何获取大熊猫DataFrame的行数?

我正在尝试使用Pandas获取数据框df的行数,这是我的代码。

方法1:

total_rows = df.count
print total_rows +1

方法2:

total_rows = df['First_columnn_label'].count
print total_rows +1

这两个代码段都给我这个错误:

TypeError:+不支持的操作数类型:“ instancemethod”和“ int”

我究竟做错了什么?

I’m trying to get the number of rows of dataframe df with Pandas, and here is my code.

Method 1:

total_rows = df.count
print total_rows +1

Method 2:

total_rows = df['First_columnn_label'].count
print total_rows +1

Both the code snippets give me this error:

TypeError: unsupported operand type(s) for +: ‘instancemethod’ and ‘int’

What am I doing wrong?


回答 0

您可以使用.shape属性,也可以使用len(DataFrame.index)。但是,存在明显的性能差异(len(DataFrame.index)最快):

In [1]: import numpy as np

In [2]: import pandas as pd

In [3]: df = pd.DataFrame(np.arange(12).reshape(4,3))

In [4]: df
Out[4]: 
   0  1  2
0  0  1  2
1  3  4  5
2  6  7  8
3  9  10 11

In [5]: df.shape
Out[5]: (4, 3)

In [6]: timeit df.shape
2.77 µs ± 644 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [7]: timeit df[0].count()
348 µs ± 1.31 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [8]: len(df.index)
Out[8]: 4

In [9]: timeit len(df.index)
990 ns ± 4.97 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

编辑:正如@Dan Allen在评论中指出的,len(df.index)并且df[0].count()不能与count排除NaNs 互换使用,

You can use the .shape property or just len(DataFrame.index). However, there are notable performance differences ( len(DataFrame.index) is fastest):

In [1]: import numpy as np

In [2]: import pandas as pd

In [3]: df = pd.DataFrame(np.arange(12).reshape(4,3))

In [4]: df
Out[4]: 
   0  1  2
0  0  1  2
1  3  4  5
2  6  7  8
3  9  10 11

In [5]: df.shape
Out[5]: (4, 3)

In [6]: timeit df.shape
2.77 µs ± 644 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [7]: timeit df[0].count()
348 µs ± 1.31 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [8]: len(df.index)
Out[8]: 4

In [9]: timeit len(df.index)
990 ns ± 4.97 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

EDIT: As @Dan Allen noted in the comments len(df.index) and df[0].count() are not interchangeable as count excludes NaNs,


回答 1

假设df是您的数据框,则:

count_row = df.shape[0]  # gives number of row count
count_col = df.shape[1]  # gives number of col count

或者,更简洁地说,

r, c = df.shape

Suppose df is your dataframe then:

count_row = df.shape[0]  # gives number of row count
count_col = df.shape[1]  # gives number of col count

Or, more succinctly,

r, c = df.shape

回答 2

使用len(df)。从熊猫0.11开始,甚至更早版本。

__len__()当前(0.12)用记录Returns length of index。时间信息,设置方法与root用户的答案相同:

In [7]: timeit len(df.index)
1000000 loops, best of 3: 248 ns per loop

In [8]: timeit len(df)
1000000 loops, best of 3: 573 ns per loop

由于进行了一个附加的函数调用,因此它比len(df.index)直接调用要慢一些,但是在大多数用例中,这不应发挥任何作用。

Use len(df). This works as of pandas 0.11 or maybe even earlier.

__len__() is currently (0.12) documented with Returns length of index. Timing info, set up the same way as in root’s answer:

In [7]: timeit len(df.index)
1000000 loops, best of 3: 248 ns per loop

In [8]: timeit len(df)
1000000 loops, best of 3: 573 ns per loop

Due to one additional function call it is a bit slower than calling len(df.index) directly, but this should not play any role in most use cases.


回答 3

如何获取大熊猫DataFrame的行数?

下表总结了您希望在DataFrame(或Series,为了完整性)中进行计数的不同情况,以及推荐的方法。

脚注

  1. DataFrame.countSeries由于非空计数随列而异,因此返回每一列的计数。
  2. DataFrameGroupBy.size返回Series,因为同一组中的所有列共享相同的行数。
  3. DataFrameGroupBy.count返回一个DataFrame,因为非空计数在同一组的各列之间可能有所不同。要获取特定列的逐组非空计数,请使用df.groupby(...)['x'].count()其中“ x”为要计数的列。

最少的代码示例

下面,我显示上表中描述的每种方法的示例。首先,设置-

df = pd.DataFrame({
    'A': list('aabbc'), 'B': ['x', 'x', np.nan, 'x', np.nan]})
s = df['B'].copy()

df

   A    B
0  a    x
1  a    x
2  b  NaN
3  b    x
4  c  NaN

s

0      x
1      x
2    NaN
3      x
4    NaN
Name: B, dtype: object

一个数据帧的行数:len(df)df.shape[0]len(df.index)

len(df)
# 5

df.shape[0]
# 5

len(df.index)
# 5

比较固定时间操作的性能似乎很愚蠢,尤其是当差异处于“严重不担心”级别时。但是,这似乎是带有其他答案的趋势,因此为了完整性,我正在做同样的事情。

在上述3种方法中,len(df.index)(如其他答案所述)最快。

注意

  • 上面的所有方法都是固定时间操作,因为它们是简单的属性查找。
  • df.shape(类似于ndarray.shape)是一个返回的元组的属性(# Rows, # Cols)。例如,此处df.shape返回(8, 2)示例。

列数一个数据帧的:df.shape[1]len(df.columns)

df.shape[1]
# 2

len(df.columns)
# 2

类似于len(df.index)len(df.columns)是这两种方法中比较快的一种(但键入的字符更多)。

行计数一个系列:len(s)s.sizelen(s.index)

len(s)
# 5

s.size
# 5

len(s.index)
# 5

s.sizelen(s.index)即将在速度方面是相同的。但我建议len(df)

注意
size是一个属性,它返回元素数(=任何系列的行数)。DataFrames还定义了一个size属性,该属性返回与相同的结果df.shape[0] * df.shape[1]

非空行数:DataFrame.countSeries.count

此处描述的方法仅计算非空值(表示忽略NaN)。

调用DataFrame.count将返回列的非NaN计数:

df.count()

A    5
B    3
dtype: int64

对于系列,请使用Series.count类似的效果:

s.count()
# 3

分组行数: GroupBy.size

对于DataFrames,用于DataFrameGroupBy.size计算每个组的行数。

df.groupby('A').size()

A
a    2
b    2
c    1
dtype: int64

同样,对于Series,您将使用SeriesGroupBy.size

s.groupby(df.A).size()

A
a    2
b    2
c    1
Name: B, dtype: int64

在两种情况下,Series都将返回a。这也很有意义,DataFrames因为所有组都共享相同的行数。

按组的非空行计数: GroupBy.count

与上述类似,但使用GroupBy.count而不是GroupBy.size。请注意,size总是返回a Series,而在特定列上count返回Seriesif,否则返回a DataFrame

以下方法返回相同的内容:

df.groupby('A')['B'].size()
df.groupby('A').size()

A
a    2
b    2
c    1
Name: B, dtype: int64

同时,count我们有

df.groupby('A').count()

   B
A   
a  2
b  1
c  0

…在整个GroupBy对象v / s上调用

df.groupby('A')['B'].count()

A
a    2
b    1
c    0
Name: B, dtype: int64

在特定列上调用。

How do I get the row count of a pandas DataFrame?

This table summarises the different situations in which you’d want to count something in a DataFrame (or Series, for completeness), along with the recommended method(s).

Footnotes

  1. DataFrame.count returns counts for each column as a Series since the non-null count varies by column.
  2. DataFrameGroupBy.size returns a Series, since all columns in the same group share the same row-count.
  3. DataFrameGroupBy.count returns a DataFrame, since the non-null count could differ across columns in the same group. To get the group-wise non-null count for a specific column, use df.groupby(...)['x'].count() where “x” is the column to count.

Minimal Code Examples

Below, I show examples of each of the methods described in the table above. First, the setup –

df = pd.DataFrame({
    'A': list('aabbc'), 'B': ['x', 'x', np.nan, 'x', np.nan]})
s = df['B'].copy()

df

   A    B
0  a    x
1  a    x
2  b  NaN
3  b    x
4  c  NaN

s

0      x
1      x
2    NaN
3      x
4    NaN
Name: B, dtype: object

Row Count of a DataFrame: len(df), df.shape[0], or len(df.index)

len(df)
# 5

df.shape[0]
# 5

len(df.index)
# 5

It seems silly to compare the performance of constant time operations, especially when the difference is on the level of “seriously, don’t worry about it”. But this seems to be a trend with other answers, so I’m doing the same for completeness.

Of the 3 methods above, len(df.index) (as mentioned in other answers) is the fastest.

Note

  • All the methods above are constant time operations as they are simple attribute lookups.
  • df.shape (similar to ndarray.shape) is an attribute that returns a tuple of (# Rows, # Cols). For example, df.shape returns (8, 2) for the example here.

Column Count of a DataFrame: df.shape[1], len(df.columns)

df.shape[1]
# 2

len(df.columns)
# 2

Analogous to len(df.index), len(df.columns) is the faster of the two methods (but takes more characters to type).

Row Count of a Series: len(s), s.size, len(s.index)

len(s)
# 5

s.size
# 5

len(s.index)
# 5

s.size and len(s.index) are about the same in terms of speed. But I recommend len(df).

Note
size is an attribute, and it returns the number of elements (=count of rows for any Series). DataFrames also define a size attribute which returns the same result as df.shape[0] * df.shape[1].

Non-Null Row Count: DataFrame.count and Series.count

The methods described here only count non-null values (meaning NaNs are ignored).

Calling DataFrame.count will return non-NaN counts for each column:

df.count()

A    5
B    3
dtype: int64

For Series, use Series.count to similar effect:

s.count()
# 3

Group-wise Row Count: GroupBy.size

For DataFrames, use DataFrameGroupBy.size to count the number of rows per group.

df.groupby('A').size()

A
a    2
b    2
c    1
dtype: int64

Similarly, for Series, you’ll use SeriesGroupBy.size.

s.groupby(df.A).size()

A
a    2
b    2
c    1
Name: B, dtype: int64

In both cases, a Series is returned. This makes sense for DataFrames as well since all groups share the same row-count.

Group-wise Non-Null Row Count: GroupBy.count

Similar to above, but use GroupBy.count, not GroupBy.size. Note that size always returns a Series, while count returns a Series if called on a specific column, or else a DataFrame.

The following methods return the same thing:

df.groupby('A')['B'].size()
df.groupby('A').size()

A
a    2
b    2
c    1
Name: B, dtype: int64

Meanwhile, for count, we have

df.groupby('A').count()

   B
A   
a  2
b  1
c  0

…called on the entire GroupBy object, v/s,

df.groupby('A')['B'].count()

A
a    2
b    1
c    0
Name: B, dtype: int64

Called on a specific column.


回答 4

TL; DR

采用 len(df)


len()是您的朋友,它可以用作行计数len(df)

另外,您可以访问的所有行df.index和的所有列 df.columns,并且可以使用len(anyList)获取表的计数, len(df.index)获取行数和len(df.columns)列数。

或者,df.shape如果您要访问仅使用的行数,而仅使用df.shape[0]的列数,则可以使用which一起返回行数和列数df.shape[1]

TL;DR

use len(df)


len() is your friend, it can be used for row counts as len(df).

Alternatively, you can access all rows by df.index and all columns by df.columns, and as you can use the len(anyList) for getting the count of list, use len(df.index) for getting the number of rows, and len(df.columns) for the column count.

Or, you can use df.shape which returns the number of rows and columns together, if you want to access the number of rows only use df.shape[0] and for the number of columns only use: df.shape[1].


回答 5

除上述答案外,use还可用于df.axes获取具有行和列索引的元组,然后使用len()function:

total_rows=len(df.axes[0])
total_cols=len(df.axes[1])

Apart from above answers use can use df.axes to get the tuple with row and column indexes and then use len() function:

total_rows=len(df.axes[0])
total_cols=len(df.axes[1])

回答 6

…以Jan-Philip Gehrcke的答案为基础。

之所以len(df)还是len(df.index)比快df.shape[0]。看代码。df.shape是一种@property运行len两次调用的DataFrame方法的方法。

df.shape??
Type:        property
String form: <property object at 0x1127b33c0>
Source:     
# df.shape.fget
@property
def shape(self):
    """
    Return a tuple representing the dimensionality of the DataFrame.
    """
    return len(self.index), len(self.columns)

在len(df)的内幕之下

df.__len__??
Signature: df.__len__()
Source:   
    def __len__(self):
        """Returns length of info axis, but here we use the index """
        return len(self.index)
File:      ~/miniconda2/lib/python2.7/site-packages/pandas/core/frame.py
Type:      instancemethod

len(df.index)将比len(df)由于少了一个函数调用而稍快一些,但这总是比df.shape[0]

…building on Jan-Philip Gehrcke’s answer.

The reason why len(df) or len(df.index) is faster than df.shape[0]. Look at the code. df.shape is a @property that runs a DataFrame method calling len twice.

df.shape??
Type:        property
String form: <property object at 0x1127b33c0>
Source:     
# df.shape.fget
@property
def shape(self):
    """
    Return a tuple representing the dimensionality of the DataFrame.
    """
    return len(self.index), len(self.columns)

And beneath the hood of len(df)

df.__len__??
Signature: df.__len__()
Source:   
    def __len__(self):
        """Returns length of info axis, but here we use the index """
        return len(self.index)
File:      ~/miniconda2/lib/python2.7/site-packages/pandas/core/frame.py
Type:      instancemethod

len(df.index) will be slightly faster than len(df) since it has one less function call, but this is always faster than df.shape[0]


回答 7

我是从大R背景来学习大熊猫的,我发现大熊猫在选择行或列时会更加复杂。我不得不花了一段时间,然后找到了一些应对方法:

获取列数:

len(df.columns)  
## Here:
#df is your data.frame
#df.columns return a string, it contains column's titles of the df. 
#Then, "len()" gets the length of it.

获取行数:

len(df.index) #It's similar.

I come to pandas from R background, and I see that pandas is more complicated when it comes to selecting row or column. I had to wrestle with it for a while, then I found some ways to deal with:

getting the number of columns:

len(df.columns)  
## Here:
#df is your data.frame
#df.columns return a string, it contains column's titles of the df. 
#Then, "len()" gets the length of it.

getting the number of rows:

len(df.index) #It's similar.

回答 8

如果要在链接操作的中间获取行数,可以使用:

df.pipe(len)

例:

row_count = (
      pd.DataFrame(np.random.rand(3,4))
      .reset_index()
      .pipe(len)
)

如果您不想在len()函数中放入长语句,这将很有用。

您可以__len__()改用,但__len__()看起来有点怪异。

In case you want to get the row count in the middle of a chained operation, you can use:

df.pipe(len)

Example:

row_count = (
      pd.DataFrame(np.random.rand(3,4))
      .reset_index()
      .pipe(len)
)

This can be useful if you don’t want to put a long statement inside a len() function.

You could use __len__() instead but __len__() looks a bit weird.


回答 9

嘿,您也可以使用此功能:

假设df是您的数据框。然后df.shape给你你的数据框的形状即(row,col)

因此,分配以下命令以获取所需的

 row = df.shape[0], col = df.shape[1]

Hey you can use do this also:

Let say df is your dataframe. Then df.shape gives you the shape of your dataframe i.e (row,col)

Thus, assign below command to get the required

 row = df.shape[0], col = df.shape[1]

回答 10

对于数据框df,在浏览数据时使用了以逗号分隔的打印格式的行数:

def nrow(df):
    print("{:,}".format(df.shape[0]))

例:

nrow(my_df)
12,456,789

For dataframe df, a printed comma formatted row count used while exploring data:

def nrow(df):
    print("{:,}".format(df.shape[0]))

Example:

nrow(my_df)
12,456,789

回答 11

在我认为是最易读的变体中找出数据帧中行数的另一种方法是 pandas.Index.size

请注意,在我对接受的答案发表评论时:

可疑pandas.Index.size速度实际上比我想知道的要快,len(df.index)但是timeit在我的计算机上却告诉我(每个循环慢150 ns)。

An alternative method to finding out the amount of rows in a dataframe which I think is the most readable variant is pandas.Index.size.

Do note that as I commented on the accepted answer:

Suspected pandas.Index.size would actually be faster than len(df.index) but timeit on my computer tells me otherwise (~150 ns slower per loop).


回答 12

我不确定这是否行得通(可以省略数据),但这可能行得通:

*dataframe name*.tails(1)

然后使用此代码,您可以通过运行代码段并查看提供给您的行号来找到行数。

I’m not sure if this would work(data COULD be omitted), but this may work:

*dataframe name*.tails(1)

and then using this, you could find the number of rows by running the code snippet and looking at the row number that was given to you.


回答 13

这两种方法都可以(df是DataFrame的名称):

方法1:使用len功能:

len(df) 将给出名为DataFrame的行数 df

方法2:使用count函数:

df[col].count()将计算给定列中的行数col

df.count() 将给出所有列的行数。

Either of this can do (df is the name of the DataFrame):

Method 1: Using len function:

len(df) will give the number of rows in a DataFrame named df.

Method 2: using count function:

df[col].count() will count the number of rows in a given column col.

df.count() will give the number of rows for all the columns.