如何将熊猫系列或索引转换为Numpy数组?

问题:如何将熊猫系列或索引转换为Numpy数组?

您是否知道如何以NumPy数组或python列表的形式获取DataFrame的索引或列?

Do you know how to get the index or column of a DataFrame as a NumPy array or python list?


回答 0

要获取NumPy数组,应使用以下values属性:

In [1]: df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]}, index=['a', 'b', 'c']); df
   A  B
a  1  4
b  2  5
c  3  6

In [2]: df.index.values
Out[2]: array(['a', 'b', 'c'], dtype=object)

这样可以访问数据的存储方式,因此无需进行转换。
注意:此属性也可用于其他许多熊猫的对象。

In [3]: df['A'].values
Out[3]: Out[16]: array([1, 2, 3])

要将索引作为列表获取,请调用tolist

In [4]: df.index.tolist()
Out[4]: ['a', 'b', 'c']

同样,对于列。

To get a NumPy array, you should use the values attribute:

In [1]: df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]}, index=['a', 'b', 'c']); df
   A  B
a  1  4
b  2  5
c  3  6

In [2]: df.index.values
Out[2]: array(['a', 'b', 'c'], dtype=object)

This accesses how the data is already stored, so there’s no need for a conversion.
Note: This attribute is also available for many other pandas’ objects.

In [3]: df['A'].values
Out[3]: Out[16]: array([1, 2, 3])

To get the index as a list, call tolist:

In [4]: df.index.tolist()
Out[4]: ['a', 'b', 'c']

And similarly, for columns.


回答 1

您可以使用df.index访问索引对象,然后使用来获取列表中的值df.index.tolist()。同样,您可以将其df['col'].tolist()用于Series。

You can use df.index to access the index object and then get the values in a list using df.index.tolist(). Similarly, you can use df['col'].tolist() for Series.


回答 2

熊猫> = 0.24

.values不赞成使用这些方法,而推荐使用这些方法!

从v0.24.0开始,我们将有两个崭新的品牌,从获得与NumPy阵列的优选方法IndexSeriesDataFrame对象:他们是to_numpy().array。关于用法,文档提到:

我们尚未删除或弃用Series.valuesDataFrame.values,但我们强烈建议您使用.array.to_numpy()代替。

有关更多信息,请参见v0.24.0发行说明的这一部分


to_numpy() 方法

df.index.to_numpy()
# array(['a', 'b'], dtype=object)

df['A'].to_numpy()
#  array([1, 4])

默认情况下,返回一个视图。所做的任何修改都会影响原件。

v = df.index.to_numpy()
v[0] = -1

df
    A  B
-1  1  2
b   4  5

如果您需要副本,请使用to_numpy(copy=True);

v = df.index.to_numpy(copy=True)
v[-1] = -123

df
   A  B
a  1  2
b  4  5

请注意,此功能也适用于DataFrames(而不适用于.array)。


array属性
此属性返回一个ExtensionArray支持索引/系列的对象。

pd.__version__
# '0.24.0rc1'

# Setup.
df = pd.DataFrame([[1, 2], [4, 5]], columns=['A', 'B'], index=['a', 'b'])
df

   A  B
a  1  2
b  4  5

df.index.array    
# <PandasArray>
# ['a', 'b']
# Length: 2, dtype: object

df['A'].array
# <PandasArray>
# [1, 4]
# Length: 2, dtype: int64

在这里,可以使用来获取列表list

list(df.index.array)
# ['a', 'b']

list(df['A'].array)
# [1, 4]

或者,直接调用.tolist()

df.index.tolist()
# ['a', 'b']

df['A'].tolist()
# [1, 4]

关于返回的内容,文档中提到,

对于由常规NumPy数组支持的SeriesIndexSeries.array 将返回一个new arrays.PandasArray,它是一个薄的(无副本)包装numpy.ndarrayarrays.PandasArray本身并不是特别有用,但它确实提供了与pandas或第三方库中定义的任何扩展数组相同的接口。

因此,总而言之,.array将返回

  1. 现有ExtensionArray的索引/系列支持,或
  2. 如果有支持该系列的NumPy数组,则将新ExtensionArray对象创建为基础数组上的精简包装。

添加两种新方法的原理
这些功能是在GitHub两个问题GH19954GH23623下进行讨论的结果而添加的

具体来说,文档提到了基本原理:

[…] .values目前尚不清楚返回的值是实际数组,它的某种转换还是熊猫自定义数组之一(如Categorical)。例如,使用PeriodIndex,每次都会.values 生成一个新ndarray的周期对象。[…]

这两个功能旨在提高API的一致性,这是朝正确方向迈出的重要一步。

最后,.values不会在当前版本中弃用,但我希望这种情况将来可能会发生,因此,我敦促用户尽快迁移到较新的API。

pandas >= 0.24

Deprecate your usage of .values in favour of these methods!

From v0.24.0 onwards, we will have two brand spanking new, preferred methods for obtaining NumPy arrays from Index, Series, and DataFrame objects: they are to_numpy(), and .array. Regarding usage, the docs mention:

We haven’t removed or deprecated Series.values or DataFrame.values, but we highly recommend and using .array or .to_numpy() instead.

See this section of the v0.24.0 release notes for more information.


to_numpy() Method

df.index.to_numpy()
# array(['a', 'b'], dtype=object)

df['A'].to_numpy()
#  array([1, 4])

By default, a view is returned. Any modifications made will affect the original.

v = df.index.to_numpy()
v[0] = -1

df
    A  B
-1  1  2
b   4  5

If you need a copy instead, use to_numpy(copy=True);

v = df.index.to_numpy(copy=True)
v[-1] = -123

df
   A  B
a  1  2
b  4  5

Note that this function also works for DataFrames (while .array does not).


array Attribute
This attribute returns an ExtensionArray object that backs the Index/Series.

pd.__version__
# '0.24.0rc1'

# Setup.
df = pd.DataFrame([[1, 2], [4, 5]], columns=['A', 'B'], index=['a', 'b'])
df

   A  B
a  1  2
b  4  5

df.index.array    
# <PandasArray>
# ['a', 'b']
# Length: 2, dtype: object

df['A'].array
# <PandasArray>
# [1, 4]
# Length: 2, dtype: int64

From here, it is possible to get a list using list:

list(df.index.array)
# ['a', 'b']

list(df['A'].array)
# [1, 4]

or, just directly call .tolist():

df.index.tolist()
# ['a', 'b']

df['A'].tolist()
# [1, 4]

Regarding what is returned, the docs mention,

For Series and Indexes backed by normal NumPy arrays, Series.array will return a new arrays.PandasArray, which is a thin (no-copy) wrapper around a numpy.ndarray. arrays.PandasArray isn’t especially useful on its own, but it does provide the same interface as any extension array defined in pandas or by a third-party library.

So, to summarise, .array will return either

  1. The existing ExtensionArray backing the Index/Series, or
  2. If there is a NumPy array backing the series, a new ExtensionArray object is created as a thin wrapper over the underlying array.

Rationale for adding TWO new methods
These functions were added as a result of discussions under two GitHub issues GH19954 and GH23623.

Specifically, the docs mention the rationale:

[…] with .values it was unclear whether the returned value would be the actual array, some transformation of it, or one of pandas custom arrays (like Categorical). For example, with PeriodIndex, .values generates a new ndarray of period objects each time. […]

These two functions aim to improve the consistency of the API, which is a major step in the right direction.

Lastly, .values will not be deprecated in the current version, but I expect this may happen at some point in the future, so I would urge users to migrate towards the newer API, as soon as you can.


回答 3

如果要处理多索引数据框,则可能只对提取多索引一个名称的列感兴趣。您可以这样做

df.index.get_level_values('name_sub_index')

并且当然name_sub_index必须是FrozenList df.index.names

If you are dealing with a multi-index dataframe, you may be interested in extracting only the column of one name of the multi-index. You can do this as

df.index.get_level_values('name_sub_index')

and of course name_sub_index must be an element of the FrozenList df.index.names


回答 4

从pandas v0.13开始,您还可以使用get_values

df.index.get_values()

Since pandas v0.13 you can also use get_values:

df.index.get_values()

回答 5

我将大熊猫转换dataframelist,然后使用基本list.index()。像这样:

dd = list(zone[0]) #Where zone[0] is some specific column of the table
idx = dd.index(filename[i])

您将索引值设为idx

I converted the pandas dataframe to list and then used the basic list.index(). Something like this:

dd = list(zone[0]) #Where zone[0] is some specific column of the table
idx = dd.index(filename[i])

You have you index value as idx.


回答 6

最近执行此操作的方法是使用.to_numpy()函数。

如果我的数据框的价格为“价格”列,则可以按以下方式进行转换:

priceArray = df['price'].to_numpy()

您还可以将数据类型(例如float或object)作为函数的参数传递

A more recent way to do this is to use the .to_numpy() function.

If I have a dataframe with a column ‘price’, I can convert it as follows:

priceArray = df['price'].to_numpy()

You can also pass the data type, such as float or object, as an argument of the function


回答 7

以下是将dataframe列转换为numpy数组的简单方法。

df = pd.DataFrame(somedict) 
ytrain = df['label']
ytrain_numpy = np.array([x for x in ytrain['label']])

ytrain_numpy是一个numpy数组。

我尝试过,to.numpy()但是给了我以下错误: TypeError:在使用线性SVC进行二进制相关性分类时,不支持类型转换:(dtype(’O’),)。to.numpy()正在将dataFrame转换为numpy数组,但是内部元素的数据类型为list,因此会观察到上述错误。

Below is a simple way to convert dataframe column into numpy array.

df = pd.DataFrame(somedict) 
ytrain = df['label']
ytrain_numpy = np.array([x for x in ytrain['label']])

ytrain_numpy is a numpy array.

I tried with to.numpy() but it gave me the below error: TypeError: no supported conversion for types: (dtype(‘O’),) while doing Binary Relevance classfication using Linear SVC. to.numpy() was converting the dataFrame into numpy array but the inner element’s data type was list because of which the above error was observed.