问题:从pandas DataFrame列标题获取列表

我想从pandas DataFrame获取列标题的列表。DataFrame来自用户输入,所以我不知道会有多少列或它们将被称为什么。

例如,如果给我这样的数据框:

>>> my_dataframe
    y  gdp  cap
0   1    2    5
1   2    3    9
2   8    7    2
3   3    4    7
4   6    7    7
5   4    8    3
6   8    2    8
7   9    9   10
8   6    6    4
9  10   10    7

我想要一个这样的列表:

>>> header_list
['y', 'gdp', 'cap']

I want to get a list of the column headers from a pandas DataFrame. The DataFrame will come from user input so I won’t know how many columns there will be or what they will be called.

For example, if I’m given a DataFrame like this:

>>> my_dataframe
    y  gdp  cap
0   1    2    5
1   2    3    9
2   8    7    2
3   3    4    7
4   6    7    7
5   4    8    3
6   8    2    8
7   9    9   10
8   6    6    4
9  10   10    7

I would want to get a list like this:

>>> header_list
['y', 'gdp', 'cap']

回答 0

您可以通过执行以下操作以列表形式获取值:

list(my_dataframe.columns.values)

您也可以简单地使用:(如Ed Chum的答案所示):

list(my_dataframe)

You can get the values as a list by doing:

list(my_dataframe.columns.values)

Also you can simply use: (as shown in Ed Chum’s answer):

list(my_dataframe)

回答 1

有一个内置的方法是最有效的:

my_dataframe.columns.values.tolist()

.columns返回一个索引,.columns.values返回一个数组,并且它具有一个帮助函数.tolist来返回列表。

如果性能对您不那么重要,则Index对象定义一种.tolist()可以直接调用的方法:

my_dataframe.columns.tolist()

性能差异很明显:

%timeit df.columns.tolist()
16.7 µs ± 317 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

%timeit df.columns.values.tolist()
1.24 µs ± 12.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

对于那些谁讨厌打字,你可以叫listdf,像这样:

list(df)

There is a built in method which is the most performant:

my_dataframe.columns.values.tolist()

.columns returns an Index, .columns.values returns an array and this has a helper function .tolist to return a list.

If performance is not as important to you, Index objects define a .tolist() method that you can call directly:

my_dataframe.columns.tolist()

The difference in performance is obvious:

%timeit df.columns.tolist()
16.7 µs ± 317 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

%timeit df.columns.values.tolist()
1.24 µs ± 12.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

For those who hate typing, you can just call list on df, as so:

list(df)

回答 2

做了一些快速测试,使用内置版本dataframe.columns.values.tolist()最快的也许并不奇怪:

In [1]: %timeit [column for column in df]
1000 loops, best of 3: 81.6 µs per loop

In [2]: %timeit df.columns.values.tolist()
10000 loops, best of 3: 16.1 µs per loop

In [3]: %timeit list(df)
10000 loops, best of 3: 44.9 µs per loop

In [4]: % timeit list(df.columns.values)
10000 loops, best of 3: 38.4 µs per loop

list(dataframe)尽管我还是很喜欢,所以谢谢EdChum!)

Did some quick tests, and perhaps unsurprisingly the built-in version using dataframe.columns.values.tolist() is the fastest:

In [1]: %timeit [column for column in df]
1000 loops, best of 3: 81.6 µs per loop

In [2]: %timeit df.columns.values.tolist()
10000 loops, best of 3: 16.1 µs per loop

In [3]: %timeit list(df)
10000 loops, best of 3: 44.9 µs per loop

In [4]: % timeit list(df.columns.values)
10000 loops, best of 3: 38.4 µs per loop

(I still really like the list(dataframe) though, so thanks EdChum!)


回答 3

它变得更加简单(通过pandas 0.16.0):

df.columns.tolist()

将在一个不错的列表中为您提供列名。

Its gets even simpler (by pandas 0.16.0) :

df.columns.tolist()

will give you the column names in a nice list.


回答 4

>>> list(my_dataframe)
['y', 'gdp', 'cap']

要在调试器模式下列出数据帧的列,请使用列表推导:

>>> [c for c in my_dataframe]
['y', 'gdp', 'cap']

顺便说一句,您可以使用sorted以下命令简单地得到一个排序列表:

>>> sorted(my_dataframe)
['cap', 'gdp', 'y']
>>> list(my_dataframe)
['y', 'gdp', 'cap']

To list the columns of a dataframe while in debugger mode, use a list comprehension:

>>> [c for c in my_dataframe]
['y', 'gdp', 'cap']

By the way, you can get a sorted list simply by using sorted:

>>> sorted(my_dataframe)
['cap', 'gdp', 'y']

回答 5

很奇怪,到目前为止我还没有看到这个帖子,所以我就把它留在这里。

扩展的可迭代解压缩(python3.5 +):[*df]和Friends

Python 3.5引入了拆包概述(PEP 448)。因此,以下操作都是可能的。

df = pd.DataFrame('x', columns=['A', 'B', 'C'], index=range(5))
df

   A  B  C
0  x  x  x
1  x  x  x
2  x  x  x
3  x  x  x
4  x  x  x 

如果你想要一个list….

[*df]
# ['A', 'B', 'C']

或者,如果您想要一个set

{*df}
# {'A', 'B', 'C'}

或者,如果您想要一个tuple

*df,  # Please note the trailing comma
# ('A', 'B', 'C')

或者,如果您要将结果存储在某处,

*cols, = df  # A wild comma appears, again
cols
# ['A', 'B', 'C']

…如果您是那种将咖啡转换成打字声音的人,那么,这将更有效地消耗您的咖啡;)

PS:如果性能很重要,那么您最好放弃上述解决方案,而选择

df.columns.to_numpy().tolist()
# ['A', 'B', 'C']

这与Ed Chum的答案类似,但针对v0.24进行了更新,而v0.24 .to_numpy()则首选使用.values。有关更多信息,请参阅 此答案(我本人)。

视觉检查
由于我已经在其他答案中看到了这一点,因此可以使用可迭代的拆包(无需显式循环)。

print(*df)
A B C

print(*df, sep='\n')
A
B
C

批判其他方法

不要for对可以在一行中完成的操作使用显式循环(列表理解是可以的)。

接下来,using sorted(df) 不会保留的原始顺序。为此,您应该list(df)改用。

接下来,list(df.columns)list(df.columns.values)差的建议(为当前版本,v0.24)。两者Index(从返回df.columns)和NumPy的阵列(由返回df.columns.values)限定.tolist()方法,该方法是更快和更惯用。

最后,列表化,即,list(df)仅应作为上述python <= 3.4方法的简明替代方法,其中python <= 3.4无法扩展扩展。

Surprised I haven’t seen this posted so far, so I’ll just leave this here.

Extended Iterable Unpacking (python3.5+): [*df] and Friends

Unpacking generalizations (PEP 448) have been introduced with Python 3.5. So, the following operations are all possible.

df = pd.DataFrame('x', columns=['A', 'B', 'C'], index=range(5))
df

   A  B  C
0  x  x  x
1  x  x  x
2  x  x  x
3  x  x  x
4  x  x  x 

If you want a list….

[*df]
# ['A', 'B', 'C']

Or, if you want a set,

{*df}
# {'A', 'B', 'C'}

Or, if you want a tuple,

*df,  # Please note the trailing comma
# ('A', 'B', 'C')

Or, if you want to store the result somewhere,

*cols, = df  # A wild comma appears, again
cols
# ['A', 'B', 'C']

… if you’re the kind of person who converts coffee to typing sounds, well, this is going consume your coffee more efficiently ;)

P.S.: if performance is important, you will want to ditch the solutions above in favour of

df.columns.to_numpy().tolist()
# ['A', 'B', 'C']

This is similar to Ed Chum’s answer, but updated for v0.24 where .to_numpy() is preferred to the use of .values. See this answer (by me) for more information.

Visual Check
Since I’ve seen this discussed in other answers, you can utilise iterable unpacking (no need for explicit loops).

print(*df)
A B C

print(*df, sep='\n')
A
B
C

Critique of Other Methods

Don’t use an explicit for loop for an operation that can be done in a single line (List comprehensions are okay).

Next, using sorted(df) does not preserve the original order of the columns. For that, you should use list(df) instead.

Next, list(df.columns) and list(df.columns.values) are poor suggestions (as of the current version, v0.24). Both Index (returned from df.columns) and NumPy arrays (returned by df.columns.values) define .tolist() method which is faster and more idiomatic.

Lastly, listification i.e., list(df) should only be used as a concise alternative to the aforementioned methods for python <= 3.4 where extended unpacking is not available.


回答 6

可以作为my_dataframe.columns

That’s available as my_dataframe.columns.


回答 7

这很有趣,但是df.columns.values.tolist()快了将近三倍,df.columns.tolist()但我认为它们是相同的:

In [97]: %timeit df.columns.values.tolist()
100000 loops, best of 3: 2.97 µs per loop

In [98]: %timeit df.columns.tolist()
10000 loops, best of 3: 9.67 µs per loop

It’s interesting but df.columns.values.tolist() is almost 3 times faster then df.columns.tolist() but I thought that they are the same:

In [97]: %timeit df.columns.values.tolist()
100000 loops, best of 3: 2.97 µs per loop

In [98]: %timeit df.columns.tolist()
10000 loops, best of 3: 9.67 µs per loop

回答 8

一个数据帧遵循类似字典的遍历对象的“钥匙”的约定。

my_dataframe.keys()

创建键/列的列表-对象方法to_list()和pythonic方式

my_dataframe.keys().to_list()
list(my_dataframe.keys())

DataFrame的基本迭代返回列标签

[column for column in my_dataframe]

不要仅仅为了获取列标签而将DataFrame转换为列表。寻找方便的代码示例时,请不要停止思考。

xlarge = pd.DataFrame(np.arange(100000000).reshape(10000,10000))
list(xlarge) #compute time and memory consumption depend on dataframe size - O(N)
list(xlarge.keys()) #constant time operation - O(1)

A DataFrame follows the dict-like convention of iterating over the “keys” of the objects.

my_dataframe.keys()

Create a list of keys/columns – object method to_list() and pythonic way

my_dataframe.keys().to_list()
list(my_dataframe.keys())

Basic iteration on a DataFrame returns column labels

[column for column in my_dataframe]

Do not convert a DataFrame into a list, just to get the column labels. Do not stop thinking while looking for convenient code samples.

xlarge = pd.DataFrame(np.arange(100000000).reshape(10000,10000))
list(xlarge) #compute time and memory consumption depend on dataframe size - O(N)
list(xlarge.keys()) #constant time operation - O(1)

回答 9

在笔记本中

对于在IPython笔记本中进行数据探索,我的首选方式是:

sorted(df)

这将产生一个易于阅读的字母顺序列表。

在代码库中

在代码中,我发现这样做更加明确

df.columns

因为它告诉其他人阅读您的代码,您在做什么。

In the Notebook

For data exploration in the IPython notebook, my preferred way is this:

sorted(df)

Which will produce an easy to read alphabetically ordered list.

In a code repository

In code I find it more explicit to do

df.columns

Because it tells others reading your code what you are doing.


回答 10

%%timeit
final_df.columns.values.tolist()
948 ns ± 19.2 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%%timeit
list(final_df.columns)
14.2 µs ± 79.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%%timeit
list(final_df.columns.values)
1.88 µs ± 11.7 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%%timeit
final_df.columns.tolist()
12.3 µs ± 27.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%%timeit
list(final_df.head(1).columns)
163 µs ± 20.6 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%%timeit
final_df.columns.values.tolist()
948 ns ± 19.2 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%%timeit
list(final_df.columns)
14.2 µs ± 79.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%%timeit
list(final_df.columns.values)
1.88 µs ± 11.7 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%%timeit
final_df.columns.tolist()
12.3 µs ± 27.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%%timeit
list(final_df.head(1).columns)
163 µs ± 20.6 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

回答 11

正如Simeon Visser回答的那样…您可以

list(my_dataframe.columns.values) 

要么

list(my_dataframe) # for less typing.

但是我认为最甜蜜的地方是:

list(my_dataframe.columns)

很明显,与此同时不必太长。

as answered by Simeon Visser…you could do

list(my_dataframe.columns.values) 

or

list(my_dataframe) # for less typing.

But I think most the sweet spot is:

list(my_dataframe.columns)

It is explicit, at the same time not unnecessarily long.


回答 12

为了进行快速,整洁的外观检查,请尝试以下操作:

for col in df.columns:
    print col

For a quick, neat, visual check, try this:

for col in df.columns:
    print col

回答 13

这为我们提供了列表中列的名称:

list(my_dataframe.columns)

也可以使用另一个称为tolist()的函数:

my_dataframe.columns.tolist()

This gives us the names of columns in a list:

list(my_dataframe.columns)

Another function called tolist() can be used too:

my_dataframe.columns.tolist()

回答 14

我觉得问题值得进一步解释。

正如@fixxxer指出的,答案取决于您在项目中使用的熊猫版本。您可以通过pd.__version__命令获得。

如果您出于某种原因(在我的Debian jessie上使用0.14.1)使用了比0.16.0更旧的熊猫,那么您需要使用:

df.keys().tolist()因为还没有df.columns实现任何方法。

这种密钥方法的优点是,即使在较新版本的熊猫中也可以使用,因此更加通用。

I feel question deserves additional explanation.

As @fixxxer noted, the answer depends on the pandas version you are using in your project. Which you can get with pd.__version__ command.

If you are for some reason like me (on debian jessie I use 0.14.1) using older version of pandas than 0.16.0, then you need to use:

df.keys().tolist() because there is no df.columns method implemented yet.

The advantage of this keys method is, that it works even in newer version of pandas, so it’s more universal.


回答 15

n = []
for i in my_dataframe.columns:
    n.append(i)
print n
n = []
for i in my_dataframe.columns:
    n.append(i)
print n

回答 16

即使上面提供的解决方案很好。我也希望像frame.column_names()这样的东西在熊猫中是一个函数,但是由于不是,所以使用以下语法可能会很好。通过调用“ tolist”函数,它以某种方式保留了您以正确方式使用熊猫的感觉:frame.columns.tolist()

frame.columns.tolist() 

Even though the solution that was provided above is nice. I would also expect something like frame.column_names() to be a function in pandas, but since it is not, maybe it would be nice to use the following syntax. It somehow preserves the feeling that you are using pandas in a proper way by calling the “tolist” function: frame.columns.tolist()

frame.columns.tolist() 

回答 17

如果DataFrame恰好有一个Index或MultiIndex,并且您也希望将它们作为列名包括在内:

names = list(filter(None, df.index.names + df.columns.values.tolist()))

它避免了调用reset_index(),因为这种简单的操作会对性能造成不必要的影响。

我经常遇到这种情况,因为我正在从数据帧索引映射到主键/唯一键的数据库中穿梭数据,但实际上这只是我的另一个“列”。对于大熊猫来说,为这样的事情提供内置方法可能是有道理的(我完全可能错过了它)。

If the DataFrame happens to have an Index or MultiIndex and you want those included as column names too:

names = list(filter(None, df.index.names + df.columns.values.tolist()))

It avoids calling reset_index() which has an unnecessary performance hit for such a simple operation.

I’ve run into needing this more often because I’m shuttling data from databases where the dataframe index maps to a primary/unique key, but is really just another “column” to me. It would probably make sense for pandas to have a built-in method for something like this (totally possible I’ve missed it).


回答 18

此解决方案列出了对象my_dataframe的所有列:

print(list(my_dataframe))

This solution lists all the columns of your object my_dataframe:

print(list(my_dataframe))

声明:本站所有文章,如无特殊说明或标注,均为本站原创发布。任何个人或组织,在未征得本站同意时,禁止复制、盗用、采集、发布本站内容到任何网站、书籍等各类媒体平台。如若本站内容侵犯了原著者的合法权益,可联系我们进行处理。