问题:熊猫中的dtype(’O’)是什么?

我在pandas中有一个数据框,我试图找出其值的类型。我不确定column的类型'Test'。但是,当我跑步时myFrame['Test'].dtype,我得到了;

dtype('O')

这是什么意思?

I have a dataframe in pandas and I’m trying to figure out what the types of its values are. I am unsure what the type is of column 'Test'. However, when I run myFrame['Test'].dtype, I get;

dtype('O')

What does this mean?


回答 0

它的意思是:

'O'     (Python) objects

来源

第一个字符指定数据的类型,其余字符指定每个项目的字节数,Unicode除外,Unicode将其解释为字符数。项目大小必须与现有类型相对应,否则将引发错误。支持的类型为现有类型,否则将引发错误。支持的种类有:

'b'       boolean
'i'       (signed) integer
'u'       unsigned integer
'f'       floating-point
'c'       complex-floating point
'O'       (Python) objects
'S', 'a'  (byte-)string
'U'       Unicode
'V'       raw data (void)

如果需要检查,另一个答案会有所帮助type

It means:

'O'     (Python) objects

Source.

The first character specifies the kind of data and the remaining characters specify the number of bytes per item, except for Unicode, where it is interpreted as the number of characters. The item size must correspond to an existing type, or an error will be raised. The supported kinds are to an existing type, or an error will be raised. The supported kinds are:

'b'       boolean
'i'       (signed) integer
'u'       unsigned integer
'f'       floating-point
'c'       complex-floating point
'O'       (Python) objects
'S', 'a'  (byte-)string
'U'       Unicode
'V'       raw data (void)

Another answer helps if need check types.


回答 1

当您dtype('O')在数据框内看到这意味着熊猫字符串。

什么dtype

属于pandasnumpy或两者兼而有之的东西?如果我们检查熊猫代码:

df = pd.DataFrame({'float': [1.0],
                    'int': [1],
                    'datetime': [pd.Timestamp('20180310')],
                    'string': ['foo']})
print(df)
print(df['float'].dtype,df['int'].dtype,df['datetime'].dtype,df['string'].dtype)
df['string'].dtype

它将输出如下:

   float  int   datetime string    
0    1.0    1 2018-03-10    foo
---
float64 int64 datetime64[ns] object
---
dtype('O')

您可以将最后一个解释为Pandas dtype('O')或Pandas对象,它是Python类型的字符串,它对应于Numpy string_unicode_type。

Pandas dtype    Python type     NumPy type          Usage
object          str             string_, unicode_   Text

就像唐吉x德(Don Quixote)在屁股上一样,熊猫(Pandas)在Numpy上一样,Numpy理解系统的基础架构,并numpy.dtype为此使用类。

数据类型对象是numpy.dtype类的实例,可以更精确地理解数据类型,包括:

  • 数据类型(整数,浮点数,Python对象等)
  • 数据大小(例如整数中有多少个字节)
  • 数据的字节顺序(小端或大端)
  • 如果数据类型是结构化的,则为其他数据类型的集合(例如,描述由整数和浮点数组成的数组项)
  • 该结构的“字段”的名称是什么
  • 每个字段的数据类型是什么
  • 每个字段占用存储块的哪一部分
  • 如果数据类型是子数组,则其形状和数据类型是什么

在这个问题的上下文中dtype,它既属于pand又属于numpy,尤其dtype('O')意味着我们期望该字符串。


这是一些测试用的代码,并带有解释:如果我们将数据集作为字典

import pandas as pd
import numpy as np
from pandas import Timestamp

data={'id': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5}, 'date': {0: Timestamp('2018-12-12 00:00:00'), 1: Timestamp('2018-12-12 00:00:00'), 2: Timestamp('2018-12-12 00:00:00'), 3: Timestamp('2018-12-12 00:00:00'), 4: Timestamp('2018-12-12 00:00:00')}, 'role': {0: 'Support', 1: 'Marketing', 2: 'Business Development', 3: 'Sales', 4: 'Engineering'}, 'num': {0: 123, 1: 234, 2: 345, 3: 456, 4: 567}, 'fnum': {0: 3.14, 1: 2.14, 2: -0.14, 3: 41.3, 4: 3.14}}
df = pd.DataFrame.from_dict(data) #now we have a dataframe

print(df)
print(df.dtypes)

最后几行将检查数据框并记录输出:

   id       date                  role  num   fnum
0   1 2018-12-12               Support  123   3.14
1   2 2018-12-12             Marketing  234   2.14
2   3 2018-12-12  Business Development  345  -0.14
3   4 2018-12-12                 Sales  456  41.30
4   5 2018-12-12           Engineering  567   3.14
id               int64
date    datetime64[ns]
role            object
num              int64
fnum           float64
dtype: object

各种不同 dtypes

df.iloc[1,:] = np.nan
df.iloc[2,:] = None

但是,如果我们尝试设置np.nanNone这将不会影响原始列的dtype。输出将如下所示:

print(df)
print(df.dtypes)

    id       date         role    num   fnum
0  1.0 2018-12-12      Support  123.0   3.14
1  NaN        NaT          NaN    NaN    NaN
2  NaN        NaT         None    NaN    NaN
3  4.0 2018-12-12        Sales  456.0  41.30
4  5.0 2018-12-12  Engineering  567.0   3.14
id             float64
date    datetime64[ns]
role            object
num            float64
fnum           float64
dtype: object

因此,np.nan否则None将不会更改列dtype,除非我们将所有列行都设置为np.nanNone。在这种情况下,列将分别变为float64object

您也可以尝试设置单行:

df.iloc[3,:] = 0 # will convert datetime to object only
df.iloc[4,:] = '' # will convert all columns to object

这里需要注意的是,如果我们在非字符串列中设置字符串,它将变成string或object dtype

When you see dtype('O') inside dataframe this means Pandas string.

What is dtype?

Something that belongs to pandas or numpy, or both, or something else? If we examine pandas code:

df = pd.DataFrame({'float': [1.0],
                    'int': [1],
                    'datetime': [pd.Timestamp('20180310')],
                    'string': ['foo']})
print(df)
print(df['float'].dtype,df['int'].dtype,df['datetime'].dtype,df['string'].dtype)
df['string'].dtype

It will output like this:

   float  int   datetime string    
0    1.0    1 2018-03-10    foo
---
float64 int64 datetime64[ns] object
---
dtype('O')

You can interpret the last as Pandas dtype('O') or Pandas object which is Python type string, and this corresponds to Numpy string_, or unicode_ types.

Pandas dtype    Python type     NumPy type          Usage
object          str             string_, unicode_   Text

Like Don Quixote is on ass, Pandas is on Numpy and Numpy understand the underlying architecture of your system and uses the class numpy.dtype for that.

Data type object is an instance of numpy.dtype class that understand the data type more precise including:

  • Type of the data (integer, float, Python object, etc.)
  • Size of the data (how many bytes is in e.g. the integer)
  • Byte order of the data (little-endian or big-endian)
  • If the data type is structured, an aggregate of other data types, (e.g., describing an array item consisting of an integer and a float)
  • What are the names of the “fields” of the structure
  • What is the data-type of each field
  • Which part of the memory block each field takes
  • If the data type is a sub-array, what is its shape and data type

In the context of this question dtype belongs to both pands and numpy and in particular dtype('O') means we expect the string.


Here is some code for testing with explanation: If we have the dataset as dictionary

import pandas as pd
import numpy as np
from pandas import Timestamp

data={'id': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5}, 'date': {0: Timestamp('2018-12-12 00:00:00'), 1: Timestamp('2018-12-12 00:00:00'), 2: Timestamp('2018-12-12 00:00:00'), 3: Timestamp('2018-12-12 00:00:00'), 4: Timestamp('2018-12-12 00:00:00')}, 'role': {0: 'Support', 1: 'Marketing', 2: 'Business Development', 3: 'Sales', 4: 'Engineering'}, 'num': {0: 123, 1: 234, 2: 345, 3: 456, 4: 567}, 'fnum': {0: 3.14, 1: 2.14, 2: -0.14, 3: 41.3, 4: 3.14}}
df = pd.DataFrame.from_dict(data) #now we have a dataframe

print(df)
print(df.dtypes)

The last lines will examine the dataframe and note the output:

   id       date                  role  num   fnum
0   1 2018-12-12               Support  123   3.14
1   2 2018-12-12             Marketing  234   2.14
2   3 2018-12-12  Business Development  345  -0.14
3   4 2018-12-12                 Sales  456  41.30
4   5 2018-12-12           Engineering  567   3.14
id               int64
date    datetime64[ns]
role            object
num              int64
fnum           float64
dtype: object

All kind of different dtypes

df.iloc[1,:] = np.nan
df.iloc[2,:] = None

But if we try to set np.nan or None this will not affect the original column dtype. The output will be like this:

print(df)
print(df.dtypes)

    id       date         role    num   fnum
0  1.0 2018-12-12      Support  123.0   3.14
1  NaN        NaT          NaN    NaN    NaN
2  NaN        NaT         None    NaN    NaN
3  4.0 2018-12-12        Sales  456.0  41.30
4  5.0 2018-12-12  Engineering  567.0   3.14
id             float64
date    datetime64[ns]
role            object
num            float64
fnum           float64
dtype: object

So np.nan or None will not change the columns dtype, unless we set the all column rows to np.nan or None. In that case column will become float64 or object respectively.

You may try also setting single rows:

df.iloc[3,:] = 0 # will convert datetime to object only
df.iloc[4,:] = '' # will convert all columns to object

And to note here, if we set string inside a non string column it will become string or object dtype.


回答 2

它的意思是“一个python对象”,即不是numpy支持的内置标量类型之一。

np.array([object()]).dtype
=> dtype('O')

It means “a python object”, i.e. not one of the builtin scalar types supported by numpy.

np.array([object()]).dtype
=> dtype('O')

回答 3

“ O”代表对象

#Loading a csv file as a dataframe
import pandas as pd 
train_df = pd.read_csv('train.csv')
col_name = 'Name of Employee'

#Checking the datatype of column name
train_df[col_name].dtype

#Instead try printing the same thing
print train_df[col_name].dtype

第一行返回: dtype('O')

带有print语句的行返回以下内容: object

‘O’ stands for object.

#Loading a csv file as a dataframe
import pandas as pd 
train_df = pd.read_csv('train.csv')
col_name = 'Name of Employee'

#Checking the datatype of column name
train_df[col_name].dtype

#Instead try printing the same thing
print train_df[col_name].dtype

The first line returns: dtype('O')

The line with the print statement returns the following: object


声明:本站所有文章,如无特殊说明或标注,均为本站原创发布。任何个人或组织,在未征得本站同意时,禁止复制、盗用、采集、发布本站内容到任何网站、书籍等各类媒体平台。如若本站内容侵犯了原著者的合法权益,可联系我们进行处理。