问题:如何检查python pandas中列的dtype

我需要使用不同的函数来处理数字列和字符串列。我现在正在做的事情真是愚蠢:

allc = list((agg.loc[:, (agg.dtypes==np.float64)|(agg.dtypes==np.int)]).columns)
for y in allc:
    treat_numeric(agg[y])    

allc = list((agg.loc[:, (agg.dtypes!=np.float64)&(agg.dtypes!=np.int)]).columns)
for y in allc:
    treat_str(agg[y])    

有没有更优雅的方法可以做到这一点?例如

for y in agg.columns:
    if(dtype(agg[y]) == 'string'):
          treat_str(agg[y])
    elif(dtype(agg[y]) != 'string'):
          treat_numeric(agg[y])

I need to use different functions to treat numeric columns and string columns. What I am doing now is really dumb:

allc = list((agg.loc[:, (agg.dtypes==np.float64)|(agg.dtypes==np.int)]).columns)
for y in allc:
    treat_numeric(agg[y])    

allc = list((agg.loc[:, (agg.dtypes!=np.float64)&(agg.dtypes!=np.int)]).columns)
for y in allc:
    treat_str(agg[y])    

Is there a more elegant way to do this? E.g.

for y in agg.columns:
    if(dtype(agg[y]) == 'string'):
          treat_str(agg[y])
    elif(dtype(agg[y]) != 'string'):
          treat_numeric(agg[y])

回答 0

您可以使用以下命令访问列的数据类型dtype

for y in agg.columns:
    if(agg[y].dtype == np.float64 or agg[y].dtype == np.int64):
          treat_numeric(agg[y])
    else:
          treat_str(agg[y])

You can access the data-type of a column with dtype:

for y in agg.columns:
    if(agg[y].dtype == np.float64 or agg[y].dtype == np.int64):
          treat_numeric(agg[y])
    else:
          treat_str(agg[y])

回答 1

pandas 0.20.2你可以这样做:

from pandas.api.types import is_string_dtype
from pandas.api.types import is_numeric_dtype

is_string_dtype(df['A'])
>>>> True

is_numeric_dtype(df['B'])
>>>> True

因此,您的代码变为:

for y in agg.columns:
    if (is_string_dtype(agg[y])):
        treat_str(agg[y])
    elif (is_numeric_dtype(agg[y])):
        treat_numeric(agg[y])

In pandas 0.20.2 you can do:

from pandas.api.types import is_string_dtype
from pandas.api.types import is_numeric_dtype

is_string_dtype(df['A'])
>>>> True

is_numeric_dtype(df['B'])
>>>> True

So your code becomes:

for y in agg.columns:
    if (is_string_dtype(agg[y])):
        treat_str(agg[y])
    elif (is_numeric_dtype(agg[y])):
        treat_numeric(agg[y])

回答 2

我知道这有点旧,但是使用熊猫19.02,您可以执行以下操作:

df.select_dtypes(include=['float64']).apply(your_function)
df.select_dtypes(exclude=['string','object']).apply(your_other_function)

http://pandas.pydata.org/pandas-docs/version/0.19.2/generated/pandas.DataFrame.select_dtypes.html

I know this is a bit of an old thread but with pandas 19.02, you can do:

df.select_dtypes(include=['float64']).apply(your_function)
df.select_dtypes(exclude=['string','object']).apply(your_other_function)

http://pandas.pydata.org/pandas-docs/version/0.19.2/generated/pandas.DataFrame.select_dtypes.html


回答 3

问题标题是一般性的,但问题正文中所述的作者用例是特定的。因此,可以使用任何其他答案。

但是,为了完全回答标题问题,应澄清所有方法似乎在某些情况下可能会失败,并且需要进行一些重新设计。我以降低可靠性的顺序(我认为)对所有这些(以及其他一些)进行了审查:

1.通过==(接受的答案)直接比较类型。

尽管这是公认的答案,并且投票最多,但我认为完全不应使用此方法。因为实际上,这种方法在python中不建议使用,如这里多次提到的。
但是,如果仍然想使用它-应该知道像一些熊猫专用dtypes的pd.CategoricalDTypepd.PeriodDtypepd.IntervalDtypetype( )为了正确识别dtype,这里必须使用extra :

s = pd.Series([pd.Period('2002-03','D'), pd.Period('2012-02-01', 'D')])
s
s.dtype == pd.PeriodDtype   # Not working
type(s.dtype) == pd.PeriodDtype # working 

>>> 0    2002-03-01
>>> 1    2012-02-01
>>> dtype: period[D]
>>> False
>>> True

这里的另一个警告是应该精确指出类型:

s = pd.Series([1,2])
s
s.dtype == np.int64 # Working
s.dtype == np.int32 # Not working

>>> 0    1
>>> 1    2
>>> dtype: int64
>>> True
>>> False

2. isinstance()方法。

到目前为止,尚未在答案中提及此方法。

因此,如果直接比较类型不是一个好主意-为此,请尝试使用内置的python函数,即- isinstance()
它会在一开始就失败,因为它假定我们有一些对象,但是pd.Series或者pd.DataFrame可能只用作带有预定义dtype但没有对象的空容器:

s = pd.Series([], dtype=bool)
s

>>> Series([], dtype: bool)

但是,如果有人以某种方式克服了这个问题,并且想要访问每个对象,例如,在第一行中,并像这样检查其dtype:

df = pd.DataFrame({'int': [12, 2], 'dt': [pd.Timestamp('2013-01-02'), pd.Timestamp('2016-10-20')]},
                  index = ['A', 'B'])
for col in df.columns:
    df[col].dtype, 'is_int64 = %s' % isinstance(df.loc['A', col], np.int64)

>>> (dtype('int64'), 'is_int64 = True')
>>> (dtype('<M8[ns]'), 'is_int64 = False')

在单列中混合类型的数据时,这将产生误导:

df2 = pd.DataFrame({'data': [12, pd.Timestamp('2013-01-02')]},
                  index = ['A', 'B'])
for col in df2.columns:
    df2[col].dtype, 'is_int64 = %s' % isinstance(df2.loc['A', col], np.int64)

>>> (dtype('O'), 'is_int64 = False')

最后但并非最不重要的一点-此方法无法直接识别Categorydtype。如文档所述

从分类数据返回单个项目也将返回值,而不是长度为“ 1”的分类。

df['int'] = df['int'].astype('category')
for col in df.columns:
    df[col].dtype, 'is_int64 = %s' % isinstance(df.loc['A', col], np.int64)

>>> (CategoricalDtype(categories=[2, 12], ordered=False), 'is_int64 = True')
>>> (dtype('<M8[ns]'), 'is_int64 = False')

因此,这种方法几乎也不适用。

3. df.dtype.kind方法。

此方法可能与空方法一起使用,pd.Series或者pd.DataFrames还有其他问题。

首先-无法区分某些dtype:

df = pd.DataFrame({'prd'  :[pd.Period('2002-03','D'), pd.Period('2012-02-01', 'D')],
                   'str'  :['s1', 's2'],
                   'cat'  :[1, -1]})
df['cat'] = df['cat'].astype('category')
for col in df:
    # kind will define all columns as 'Object'
    print (df[col].dtype, df[col].dtype.kind)

>>> period[D] O
>>> object O
>>> category O

第二,实际上我仍然不清楚,它甚至在某些dtypes返回None

4. df.select_dtypes方法。

这几乎是我们想要的。此方法在pandas内部设计,因此可以处理前面提到的大多数极端情况-空的DataFrame,与numpy或特定于pandas的dtypes完全不同。与dtype这样的单个dtype一起使用时效果很好.select_dtypes('bool')。它甚至可以用于基于dtype选择列组:

test = pd.DataFrame({'bool' :[False, True], 'int64':[-1,2], 'int32':[-1,2],'float': [-2.5, 3.4],
                     'compl':np.array([1-1j, 5]),
                     'dt'   :[pd.Timestamp('2013-01-02'), pd.Timestamp('2016-10-20')],
                     'td'   :[pd.Timestamp('2012-03-02')- pd.Timestamp('2016-10-20'),
                              pd.Timestamp('2010-07-12')- pd.Timestamp('2000-11-10')],
                     'prd'  :[pd.Period('2002-03','D'), pd.Period('2012-02-01', 'D')],
                     'intrv':pd.arrays.IntervalArray([pd.Interval(0, 0.1), pd.Interval(1, 5)]),
                     'str'  :['s1', 's2'],
                     'cat'  :[1, -1],
                     'obj'  :[[1,2,3], [5435,35,-52,14]]
                    })
test['int32'] = test['int32'].astype(np.int32)
test['cat'] = test['cat'].astype('category')

就像文档中所述:

test.select_dtypes('number')

>>>     int64   int32   float   compl   td
>>> 0      -1      -1   -2.5    (1-1j)  -1693 days
>>> 1       2       2    3.4    (5+0j)   3531 days

在可能会认为这里我们看到的第一个意外结果(过去对我来说是:问题)- TimeDelta被包含在输出中DataFrame。但是,正如相反的回答,应该是这样,但是必须意识到这一点。请注意,bool跳过了dtype,这对于某些人来说也是不希望的,但这是由于boolnumber位于numpy dtype的不同“ 子树 ”中。如果是布尔型,我们可以test.select_dtypes(['bool'])在这里使用。

此方法的下一个限制是,对于当前版本的Pandas(0.24.2),此代码:test.select_dtypes('period')将引发NotImplementedError

另一件事是它无法将字符串与其他对象区分开:

test.select_dtypes('object')

>>>     str     obj
>>> 0    s1     [1, 2, 3]
>>> 1    s2     [5435, 35, -52, 14]

但这首先是- 在文档中已经提到。其次-不是此方法的问题,而是字符串存储在中的方式DataFrame。但是无论如何,这种情况必须进行一些后期处理。

5. df.api.types.is_XXX_dtype方法。

我猜想这是实现dtype识别(函数所在的模块的路径本身说)的最健壮和本机的方式。它几乎可以完美地工作,但是仍然至少有一个警告,并且仍然必须以某种方式区分字符串列

此外,这可能是主观的,但是与以下方法相比,该方法还具有更多的“人类可理解”的numberdtypes组处理.select_dtypes('number')

for col in test.columns:
    if pd.api.types.is_numeric_dtype(test[col]):
        print (test[col].dtype)

>>> bool
>>> int64
>>> int32
>>> float64
>>> complex128

timedeltabool包括在内。完善。

我的管道此时恰好利用了此功能,以及一些后期处理。

输出。

希望我能够论点的主要观点-所有讨论的方法可以使用,但只能pd.DataFrame.select_dtypes()pd.api.types.is_XXX_dtype必须真正视为适用的。

Asked question title is general, but authors use case stated in the body of the question is specific. So any other answers may be used.

But in order to fully answer the title question it should be clarified that it seems like all of the approaches may fail in some cases and require some rework. I reviewed all of them (and some additional) in decreasing of reliability order (in my opinion):

1. Comparing types directly via == (accepted answer).

Despite the fact that this is accepted answer and has most upvotes count, I think this method should not be used at all. Because in fact this approach is discouraged in python as mentioned several times here.
But if one still want to use it – should be aware of some pandas-specific dtypes like pd.CategoricalDType, pd.PeriodDtype, or pd.IntervalDtype. Here one have to use extra type( ) in order to recognize dtype correctly:

s = pd.Series([pd.Period('2002-03','D'), pd.Period('2012-02-01', 'D')])
s
s.dtype == pd.PeriodDtype   # Not working
type(s.dtype) == pd.PeriodDtype # working 

>>> 0    2002-03-01
>>> 1    2012-02-01
>>> dtype: period[D]
>>> False
>>> True

Another caveat here is that type should be pointed out precisely:

s = pd.Series([1,2])
s
s.dtype == np.int64 # Working
s.dtype == np.int32 # Not working

>>> 0    1
>>> 1    2
>>> dtype: int64
>>> True
>>> False

2. isinstance() approach.

This method has not been mentioned in answers so far.

So if direct comparing of types is not a good idea – lets try built-in python function for this purpose, namely – isinstance().
It fails just in the beginning, because assumes that we have some objects, but pd.Series or pd.DataFrame may be used as just empty containers with predefined dtype but no objects in it:

s = pd.Series([], dtype=bool)
s

>>> Series([], dtype: bool)

But if one somehow overcome this issue, and wants to access each object, for example, in the first row and checks its dtype like something like that:

df = pd.DataFrame({'int': [12, 2], 'dt': [pd.Timestamp('2013-01-02'), pd.Timestamp('2016-10-20')]},
                  index = ['A', 'B'])
for col in df.columns:
    df[col].dtype, 'is_int64 = %s' % isinstance(df.loc['A', col], np.int64)

>>> (dtype('int64'), 'is_int64 = True')
>>> (dtype('<M8[ns]'), 'is_int64 = False')

It will be misleading in the case of mixed type of data in single column:

df2 = pd.DataFrame({'data': [12, pd.Timestamp('2013-01-02')]},
                  index = ['A', 'B'])
for col in df2.columns:
    df2[col].dtype, 'is_int64 = %s' % isinstance(df2.loc['A', col], np.int64)

>>> (dtype('O'), 'is_int64 = False')

And last but not least – this method cannot directly recognize Category dtype. As stated in docs:

Returning a single item from categorical data will also return the value, not a categorical of length “1”.

df['int'] = df['int'].astype('category')
for col in df.columns:
    df[col].dtype, 'is_int64 = %s' % isinstance(df.loc['A', col], np.int64)

>>> (CategoricalDtype(categories=[2, 12], ordered=False), 'is_int64 = True')
>>> (dtype('<M8[ns]'), 'is_int64 = False')

So this method is also almost inapplicable.

3. df.dtype.kind approach.

This method yet may work with empty pd.Series or pd.DataFrames but has another problems.

First – it is unable to differ some dtypes:

df = pd.DataFrame({'prd'  :[pd.Period('2002-03','D'), pd.Period('2012-02-01', 'D')],
                   'str'  :['s1', 's2'],
                   'cat'  :[1, -1]})
df['cat'] = df['cat'].astype('category')
for col in df:
    # kind will define all columns as 'Object'
    print (df[col].dtype, df[col].dtype.kind)

>>> period[D] O
>>> object O
>>> category O

Second, what is actually still unclear for me, it even returns on some dtypes None.

4. df.select_dtypes approach.

This is almost what we want. This method designed inside pandas so it handles most corner cases mentioned earlier – empty DataFrames, differs numpy or pandas-specific dtypes well. It works well with single dtype like .select_dtypes('bool'). It may be used even for selecting groups of columns based on dtype:

test = pd.DataFrame({'bool' :[False, True], 'int64':[-1,2], 'int32':[-1,2],'float': [-2.5, 3.4],
                     'compl':np.array([1-1j, 5]),
                     'dt'   :[pd.Timestamp('2013-01-02'), pd.Timestamp('2016-10-20')],
                     'td'   :[pd.Timestamp('2012-03-02')- pd.Timestamp('2016-10-20'),
                              pd.Timestamp('2010-07-12')- pd.Timestamp('2000-11-10')],
                     'prd'  :[pd.Period('2002-03','D'), pd.Period('2012-02-01', 'D')],
                     'intrv':pd.arrays.IntervalArray([pd.Interval(0, 0.1), pd.Interval(1, 5)]),
                     'str'  :['s1', 's2'],
                     'cat'  :[1, -1],
                     'obj'  :[[1,2,3], [5435,35,-52,14]]
                    })
test['int32'] = test['int32'].astype(np.int32)
test['cat'] = test['cat'].astype('category')

Like so, as stated in the docs:

test.select_dtypes('number')

>>>     int64   int32   float   compl   td
>>> 0      -1      -1   -2.5    (1-1j)  -1693 days
>>> 1       2       2    3.4    (5+0j)   3531 days

On may think that here we see first unexpected (at used to be for me: question) results – TimeDelta is included into output DataFrame. But as answered in contrary it should be so, but one have to be aware of it. Note that bool dtype is skipped, that may be also undesired for someone, but it’s due to bool and number are in different “subtrees” of numpy dtypes. In case with bool, we may use test.select_dtypes(['bool']) here.

Next restriction of this method is that for current version of pandas (0.24.2), this code: test.select_dtypes('period') will raise NotImplementedError.

And another thing is that it’s unable to differ strings from other objects:

test.select_dtypes('object')

>>>     str     obj
>>> 0    s1     [1, 2, 3]
>>> 1    s2     [5435, 35, -52, 14]

But this is, first – already mentioned in the docs. And second – is not the problem of this method, rather the way strings are stored in DataFrame. But anyway this case have to have some post processing.

5. df.api.types.is_XXX_dtype approach.

This one is intended to be most robust and native way to achieve dtype recognition (path of the module where functions resides says by itself) as i suppose. And it works almost perfectly, but still have at least one caveat and still have to somehow distinguish string columns.

Besides, this may be subjective, but this approach also has more ‘human-understandable’ number dtypes group processing comparing with .select_dtypes('number'):

for col in test.columns:
    if pd.api.types.is_numeric_dtype(test[col]):
        print (test[col].dtype)

>>> bool
>>> int64
>>> int32
>>> float64
>>> complex128

No timedelta and bool is included. Perfect.

My pipeline exploits exactly this functionality at this moment of time, plus a bit of post hand processing.

Output.

Hope I was able to argument the main point – that all discussed approaches may be used, but only pd.DataFrame.select_dtypes() and pd.api.types.is_XXX_dtype should be really considered as the applicable ones.


回答 4

如果要将数据框列的类型标记为字符串,则可以执行以下操作:

df['A'].dtype.kind

一个例子:

In [8]: df = pd.DataFrame([[1,'a',1.2],[2,'b',2.3]])
In [9]: df[0].dtype.kind, df[1].dtype.kind, df[2].dtype.kind
Out[9]: ('i', 'O', 'f')

您的代码的答案:

for y in agg.columns:
    if(agg[y].dtype.kind == 'f' or agg[y].dtype.kind == 'i'):
          treat_numeric(agg[y])
    else:
          treat_str(agg[y])

If you want to mark the type of a dataframe column as a string, you can do:

df['A'].dtype.kind

An example:

In [8]: df = pd.DataFrame([[1,'a',1.2],[2,'b',2.3]])
In [9]: df[0].dtype.kind, df[1].dtype.kind, df[2].dtype.kind
Out[9]: ('i', 'O', 'f')

The answer for your code:

for y in agg.columns:
    if(agg[y].dtype.kind == 'f' or agg[y].dtype.kind == 'i'):
          treat_numeric(agg[y])
    else:
          treat_str(agg[y])

Note:


回答 5

漂亮地打印列数据类型

在例如从文件导入后检查数据类型

def printColumnInfo(df):
    template="%-8s %-30s %s"
    print(template % ("Type", "Column Name", "Example Value"))
    print("-"*53)
    for c in df.columns:
        print(template % (df[c].dtype, c, df[c].iloc[1]) )

说明性输出:

Type     Column Name                    Example Value
-----------------------------------------------------
int64    Age                            49
object   Attrition                      No
object   BusinessTravel                 Travel_Frequently
float64  DailyRate                      279.0

To pretty print the column data types

To check the data types after, for example, an import from a file

def printColumnInfo(df):
    template="%-8s %-30s %s"
    print(template % ("Type", "Column Name", "Example Value"))
    print("-"*53)
    for c in df.columns:
        print(template % (df[c].dtype, c, df[c].iloc[1]) )

Illustrative output:

Type     Column Name                    Example Value
-----------------------------------------------------
int64    Age                            49
object   Attrition                      No
object   BusinessTravel                 Travel_Frequently
float64  DailyRate                      279.0

声明:本站所有文章,如无特殊说明或标注,均为本站原创发布。任何个人或组织,在未征得本站同意时,禁止复制、盗用、采集、发布本站内容到任何网站、书籍等各类媒体平台。如若本站内容侵犯了原著者的合法权益,可联系我们进行处理。