

特别是,我正在将内部数据结构转换为Pandas DataFrame。在我们的结构中,我们有仍然具有NaN的整数类型的列(但该列的dtype是int)。如果我们将其设为DataFrame,似乎将所有内容重播为浮点数,但我们真的很希望成为int



我尝试from_records()在pandas.DataFrame下使用该功能coerce_float=False,但这并没有帮助。我还尝试使用带有NaN fill_value的NumPy蒙版数组,该数组也无法正常工作。所有这些导致列数据类型变为浮点型。

Is there a preferred way to keep the data type of a numpy array fixed as int (or int64 or whatever), while still having an element inside listed as numpy.NaN?

In particular, I am converting an in-house data structure to a Pandas DataFrame. In our structure, we have integer-type columns that still have NaN’s (but the dtype of the column is int). It seems to recast everything as a float if we make this a DataFrame, but we’d really like to be int.


Things tried:

I tried using the from_records() function under pandas.DataFrame, with coerce_float=False and this did not help. I also tried using NumPy masked arrays, with NaN fill_value, which also did not work. All of these caused the column data type to become a float.

回答 0

此功能已添加到熊猫(从0.24版开始):https : //pandas.pydata.org/pandas-docs/version/0.24/whatsnew/v0.24.0.html#optional-integer-na-support

此时,它需要使用扩展名dtype Int64(大写),而不是默认的dtype int64(小写)。

This capability has been added to pandas (beginning with version 0.24): https://pandas.pydata.org/pandas-docs/version/0.24/whatsnew/v0.24.0.html#optional-integer-na-support

At this point, it requires the use of extension dtype Int64 (capitalized), rather than the default dtype int64 (lowercase).

回答 1



(此功能是从熊猫0.24版开始添加的,但请注意,它需要使用扩展名dtype Int64(大写),而不是默认的dtype int64(小写):https : //pandas.pydata.org/pandas- docs / version / 0.24 / whatsnew / v0.24.0.html#optional-integer-na-support

NaN can’t be stored in an integer array. This is a known limitation of pandas at the moment; I have been waiting for progress to be made with NA values in NumPy (similar to NAs in R), but it will be at least 6 months to a year before NumPy gets these features, it seems:


(This feature has been added beginning with version 0.24 of pandas, but note it requires the use of extension dtype Int64 (capitalized), rather than the default dtype int64 (lower case): https://pandas.pydata.org/pandas-docs/version/0.24/whatsnew/v0.24.0.html#optional-integer-na-support )

回答 2


df.col = df.col.dropna().apply(lambda x: str(int(x)) )



If performance is not the main issue, you can store strings instead.

df.col = df.col.dropna().apply(lambda x: str(int(x)) )

Then you can mix then with NaN as much as you want. If you really want to have integers, depending on your application, you can use -1, or 0, or 1234567890, or some other dedicated value to represent NaN.

You can also temporarily duplicate the columns: one as you have, with floats; the other one experimental, with ints or strings. Then inserts asserts in every reasonable place checking that the two are in sync. After enough testing you can let go of the floats.

回答 3


a3['MapInfo'] = a3['MapInfo'].fillna(0).astype(int)


This is not a solution for all cases, but mine (genomic coordinates) I’ve resorted to using 0 as NaN

a3['MapInfo'] = a3['MapInfo'].fillna(0).astype(int)

This at least allows for the proper ‘native’ column type to be used, operations like subtraction, comparison etc work as expected

回答 4

熊猫v0.24 +

支持功能 NaNv0.24或更高版本将提供整数系列。有这些信息在v0.24部分,并在更多的细节“新什么是” 空整数数据类型

Pandas v0.23及更早版本


文档确实建议:“一种可能性是使用dtype=object数组。” 例如:

s = pd.Series([1, 2, 3, np.nan])


0      1
1      2
2      3
3    NaN
dtype: object

出于美观原因,例如输出到文件,此 可能是更可取的。






Typeclass   Promotion dtype for storing NAs
floating    no change
object      no change
integer     cast to float64
boolean     cast to object

Pandas v0.24+

Functionality to support NaN in integer series will be available in v0.24 upwards. There’s information on this in the v0.24 “What’s New” section, and more details under Nullable Integer Data Type.

Pandas v0.23 and earlier

In general, it’s best to work with float series where possible, even when the series is upcast from int to float due to inclusion of NaN values. This enables vectorised NumPy-based calculations where, otherwise, Python-level loops would be processed.

The docs do suggest : “One possibility is to use dtype=object arrays instead.” For example:

s = pd.Series([1, 2, 3, np.nan])


0      1
1      2
2      3
3    NaN
dtype: object

For cosmetic reasons, e.g. output to a file, this may be preferable.

Pandas v0.23 and earlier: background

NaN is considered a float. The docs currently (as of v0.23) specify the reason why integer series are upcasted to float:

In the absence of high performance NA support being built into NumPy from the ground up, the primary casualty is the ability to represent NAs in integer arrays.

This trade-off is made largely for memory and performance reasons, and also so that the resulting Series continues to be “numeric”.

The docs also provide rules for upcasting due to NaN inclusion:

Typeclass   Promotion dtype for storing NAs
floating    no change
object      no change
integer     cast to float64
boolean     cast to object

回答 5

现在这是可能的,因为pandas v 0.24.0

pandas 0.24.x发行说明 Quote:“ Pandas已具备保存具有缺失值的整数dtypes的能力。

This is now possible, since pandas v 0.24.0

pandas 0.24.x release notes Quote: “Pandas has gained the ability to hold integer dtypes with missing values.

回答 6

只是想补充一下,以防您尝试将浮点数(1.143)向量转换为整数(1),并且将NA转换为新的’Int64’dtype会导致错误。为了解决这个问题,您必须四舍五入数字,然后执行“ .astype(’Int64’)”

s1 = pd.Series([1.434, 2.343, np.nan])
#without round() the next line returns an error 
#cannot safely cast non-equivalent float64 to int64
##with round() it works
0      1
1      2
2    NaN
dtype: Int64

我的用例是我有一个浮点数系列,我想四舍五入为整数,但是当您执行.round()时,数字末尾仍为’* .0’,因此您可以从末尾减去0转换为int。

Just wanted to add that in case you are trying to convert a float (1.143) vector to integer (1) that has NA converting to the new ‘Int64’ dtype will give you an error. In order to solve this you have to round the numbers and then do “.astype(‘Int64’)”

s1 = pd.Series([1.434, 2.343, np.nan])
#without round() the next line returns an error 
#cannot safely cast non-equivalent float64 to int64
##with round() it works
0      1
1      2
2    NaN
dtype: Int64

My use case is that I have a float series that I want to round to int, but when you do .round() a ‘*.0’ at the end of the number remains, so you can drop that 0 from the end by converting to int.

回答 7

如果文本数据中有空格,则通常为整数的列将转换为float64 dtype,因为int64 dtype无法处理null。如果您要加载多个文件,其中一些带有空白(最终将以float64的形式加载,而另一些将最终以int64的形式加载),则可能导致架构不一致


import pandas as pd
import numpy as np

#show datatypes before transformation

for c in mydf.select_dtypes(np.number).columns:
        mydf[c] = mydf[c].astype('Int64')
        print('casted {} as Int64'.format(c))
        print('could not cast {} to Int64'.format(c))

#show datatypes after transformation

If there are blanks in the text data, columns that would normally be integers will be cast to floats as float64 dtype because int64 dtype cannot handle nulls. This can cause inconsistent schema if you are loading multiple files some with blanks (which will end up as float64 and others without which will end up as int64

This code will attempt to convert any number type columns to Int64 (as opposed to int64) since Int64 can handle nulls

import pandas as pd
import numpy as np

#show datatypes before transformation

for c in mydf.select_dtypes(np.number).columns:
        mydf[c] = mydf[c].astype('Int64')
        print('casted {} as Int64'.format(c))
        print('could not cast {} to Int64'.format(c))

#show datatypes after transformation
