Python 实用宝典

Question 1

Let’s say df is a pandas DataFrame. I would like to find all columns of numeric type. Something like:

isNumeric = is_numeric(df)

Question 2

You could use select_dtypes method of DataFrame. It includes two parameters include and exclude. So isNumeric would look like:

numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']

newdf = df.select_dtypes(include=numerics)

Question 3

You can use the undocumented function _get_numeric_data() to filter only numeric columns:

df._get_numeric_data()

Example:

In [32]: data
Out[32]:
   A  B
0  1  s
1  2  s
2  3  s
3  4  s

In [33]: data._get_numeric_data()
Out[33]:
   A
0  1
1  2
2  3
3  4

Note that this is a “private method” (i.e., an implementation detail) and is subject to change or total removal in the future. Use with caution.

Question 4

Simple one-line answer to create a new dataframe with only numeric columns:

df.select_dtypes(include=np.number)

If you want the names of numeric columns:

df.select_dtypes(include=np.number).columns.tolist()

Complete code:

import pandas as pd
import numpy as np

df = pd.DataFrame({'A': range(7, 10),
                   'B': np.random.rand(3),
                   'C': ['foo','bar','baz'],
                   'D': ['who','what','when']})
df
#    A         B    C     D
# 0  7  0.704021  foo   who
# 1  8  0.264025  bar  what
# 2  9  0.230671  baz  when

df_numerics_only = df.select_dtypes(include=np.number)
df_numerics_only
#    A         B
# 0  7  0.704021
# 1  8  0.264025
# 2  9  0.230671

colnames_numerics_only = df.select_dtypes(include=np.number).columns.tolist()
colnames_numerics_only
# ['A', 'B']

Question 5

df.select_dtypes(exclude = ['object'])

Update

df.select_dtypes(inlcude = np.number)
#or with new version of panda
df.select_dtypes('number')

Question 6

Simple one-liner:

df.select_dtypes('number').columns

Question 7

Following codes will return list of names of the numeric columns of a data set.

cnames=list(marketing_train.select_dtypes(exclude=['object']).columns)

here marketing_train is my data set and select_dtypes() is function to select data types using exclude and include arguments and columns is used to fetch the column name of data set output of above code will be following:

['custAge',
     'campaign',
     'pdays',
     'previous',
     'emp.var.rate',
     'cons.price.idx',
     'cons.conf.idx',
     'euribor3m',
     'nr.employed',
     'pmonths',
     'pastEmail']

Thanks

Question 8

This is another simple code for finding numeric column in pandas data frame,

numeric_clmns = df.dtypes[df.dtypes != "object"].index

Question 9

def is_type(df, baseType):
    import numpy as np
    import pandas as pd
    test = [issubclass(np.dtype(d).type, baseType) for d in df.dtypes]
    return pd.DataFrame(data = test, index = df.columns, columns = ["test"])
def is_float(df):
    import numpy as np
    return is_type(df, np.float)
def is_number(df):
    import numpy as np
    return is_type(df, np.number)
def is_integer(df):
    import numpy as np
    return is_type(df, np.integer)

Question 10

Adapting this answer, you could do

df.ix[:,df.applymap(np.isreal).all(axis=0)]

Here, np.applymap(np.isreal) shows whether every cell in the data frame is numeric, and .axis(all=0) checks if all values in a column are True and returns a series of Booleans that can be used to index the desired columns.

Question 11

Please see the below code:

if(dataset.select_dtypes(include=[np.number]).shape[1] > 0):
display(dataset.select_dtypes(include=[np.number]).describe())
if(dataset.select_dtypes(include=[np.object]).shape[1] > 0):
display(dataset.select_dtypes(include=[np.object]).describe())

This way you can check whether the value are numeric such as float and int or the srting values. the second if statement is used for checking the string values which is referred by the object.

Question 12

We can include and exclude data types as per the requirement as below:

train.select_dtypes(include=None, exclude=None)
train.select_dtypes(include='number') #will include all the numeric types

Referred from Jupyter Notebook.

To select all numeric types, use np.number or 'number'

To select strings you must use the object dtype but note that this will return all object dtype columns
See the NumPy dtype hierarchy <http://docs.scipy.org/doc/numpy/reference/arrays.scalars.html>__
To select datetimes, use np.datetime64, 'datetime' or 'datetime64'
To select timedeltas, use np.timedelta64, 'timedelta' or 'timedelta64'
To select Pandas categorical dtypes, use 'category'
To select Pandas datetimetz dtypes, use 'datetimetz' (new in 0.20.0) or “’datetime64[ns, tz]’

Question 13

Given an arbitrary python object, what’s the best way to determine whether it is a number? Here is is defined as acts like a number in certain circumstances.

For example, say you are writing a vector class. If given another vector, you want to find the dot product. If given a scalar, you want to scale the whole vector.

Checking if something is int, float, long, bool is annoying and doesn’t cover user-defined objects that might act like numbers. But, checking for __mul__, for example, isn’t good enough because the vector class I just described would define __mul__, but it wouldn’t be the kind of number I want.

Question 14

Use Number from the numbers module to test isinstance(n, Number) (available since 2.6).

>>> from numbers import Number
... from decimal import Decimal
... from fractions import Fraction
... for n in [2, 2.0, Decimal('2.0'), complex(2, 0), Fraction(2, 1), '2']:
...     print(f'{n!r:>14} {isinstance(n, Number)}')
              2 True
            2.0 True
 Decimal('2.0') True
         (2+0j) True
 Fraction(2, 1) True
            '2' False

This is, of course, contrary to duck typing. If you are more concerned about how an object acts rather than what it is, perform your operations as if you have a number and use exceptions to tell you otherwise.

Question 15

You want to check if some object

acts like a number in certain circumstances

If you’re using Python 2.5 or older, the only real way is to check some of those “certain circumstances” and see.

In 2.6 or better, you can use isinstance with numbers.Number — an abstract base class (ABC) that exists exactly for this purpose (lots more ABCs exist in the collections module for various forms of collections/containers, again starting with 2.6; and, also only in those releases, you can easily add your own abstract base classes if you need to).

Bach to 2.5 and earlier, “can be added to 0 and is not iterable” could be a good definition in some cases. But, you really need to ask yourself, what it is that you’re asking that what you want to consider “a number” must definitely be able to do, and what it must absolutely be unable to do — and check.

This may also be needed in 2.6 or later, perhaps for the purpose of making your own registrations to add types you care about that haven’t already be registered onto numbers.Numbers — if you want to exclude some types that claim they’re numbers but you just can’t handle, that takes even more care, as ABCs have no unregister method [[for example you could make your own ABC WeirdNum and register there all such weird-for-you types, then first check for isinstance thereof to bail out before you proceed to checking for isinstance of the normal numbers.Number to continue successfully.

BTW, if and when you need to check if x can or cannot do something, you generally have to try something like:

try: 0 + x
except TypeError: canadd=False
else: canadd=True

The presence of __add__ per se tells you nothing useful, since e.g all sequences have it for the purpose of concatenation with other sequences. This check is equivalent to the definition “a number is something such that a sequence of such things is a valid single argument to the builtin function sum“, for example. Totally weird types (e.g. ones that raise the “wrong” exception when summed to 0, such as, say, a ZeroDivisionError or ValueError &c) will propagate exception, but that’s OK, let the user know ASAP that such crazy types are just not acceptable in good company;-); but, a “vector” that’s summable to a scalar (Python’s standard library doesn’t have one, but of course they’re popular as third party extensions) would also give the wrong result here, so (e.g.) this check should come after the “not allowed to be iterable” one (e.g., check that iter(x) raises TypeError, or for the presence of special method __iter__ — if you’re in 2.5 or earlier and thus need your own checks).

A brief glimpse at such complications may be sufficient to motivate you to rely instead on abstract base classes whenever feasible…;-).

Question 16

This is a good example where exceptions really shine. Just do what you would do with the numeric types and catch the TypeError from everything else.

But obviously, this only checks if a operation works, not whether it makes sense! The only real solution for that is to never mix types and always know exactly what typeclass your values belong to.

Question 17

Multiply the object by zero. Any number times zero is zero. Any other result means that the object is not a number (including exceptions)

def isNumber(x):
    try:
        return bool(0 == x*0)
    except:
        return False

Using isNumber thusly will give the following output:

class A: pass 

def foo(): return 1

for x in [1,1.4, A(), range(10), foo, foo()]:
    answer = isNumber(x)
    print('{answer} == isNumber({x})'.format(**locals()))

Output:

True == isNumber(1)
True == isNumber(1.4)
False == isNumber(<__main__.A instance at 0x7ff52c15d878>)
False == isNumber([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
False == isNumber(<function foo at 0x7ff52c121488>)
True == isNumber(1)

There probably are some non-number objects in the world that define __mul__ to return zero when multiplied by zero but that is an extreme exception. This solution should cover all normal and sane code that you generate/encouter.

numpy.array example:

import numpy as np

def isNumber(x):
    try:
        return bool(x*0 == 0)
    except:
        return False

x = np.array([0,1])

answer = isNumber(x)
print('{answer} == isNumber({x})'.format(**locals()))

output:

False == isNumber([0 1])

Question 18

To rephrase your question, you are trying to determine whether something is a collection or a single value. Trying to compare whether something is a vector or a number is comparing apples to oranges – I can have a vector of strings or numbers, and I can have a single string or single number. You are interested in how many you have (1 or more), not what type you actually have.

my solution for this problem is to check whether the input is a single value or a collection by checking the presence of __len__. For example:

def do_mult(foo, a_vector):
    if hasattr(foo, '__len__'):
        return sum([a*b for a,b in zip(foo, a_vector)])
    else:
        return [foo*b for b in a_vector]

Or, for the duck-typing approach, you can try iterating on foo first:

def do_mult(foo, a_vector):
    try:
        return sum([a*b for a,b in zip(foo, a_vector)])
    except TypeError:
        return [foo*b for b in a_vector]

Ultimately, it is easier to test whether something is vector-like than to test whether something is scalar-like. If you have values of different type (i.e. string, numeric, etc.) coming through, then the logic of your program may need some work – how did you end up trying to multiply a string by a numeric vector in the first place?

Question 19

To summarize / evaluate existing methods:

Candidate    | type                      | delnan | mat | shrewmouse | ant6n
-------------------------------------------------------------------------
0            | <type 'int'>              |      1 |   1 |          1 |     1
0.0          | <type 'float'>            |      1 |   1 |          1 |     1
0j           | <type 'complex'>          |      1 |   1 |          1 |     0
Decimal('0') | <class 'decimal.Decimal'> |      1 |   0 |          1 |     1
True         | <type 'bool'>             |      1 |   1 |          1 |     1
False        | <type 'bool'>             |      1 |   1 |          1 |     1
''           | <type 'str'>              |      0 |   0 |          0 |     0
None         | <type 'NoneType'>         |      0 |   0 |          0 |     0
'0'          | <type 'str'>              |      0 |   0 |          0 |     1
'1'          | <type 'str'>              |      0 |   0 |          0 |     1
[]           | <type 'list'>             |      0 |   0 |          0 |     0
[1]          | <type 'list'>             |      0 |   0 |          0 |     0
[1, 2]       | <type 'list'>             |      0 |   0 |          0 |     0
(1,)         | <type 'tuple'>            |      0 |   0 |          0 |     0
(1, 2)       | <type 'tuple'>            |      0 |   0 |          0 |     0

(I came here by this question)

Code

#!/usr/bin/env python

"""Check if a variable is a number."""

import decimal


def delnan_is_number(candidate):
    import numbers
    return isinstance(candidate, numbers.Number)


def mat_is_number(candidate):
    return isinstance(candidate, (int, long, float, complex))


def shrewmouse_is_number(candidate):
    try:
        return 0 == candidate * 0
    except:
        return False


def ant6n_is_number(candidate):
    try:
        float(candidate)
        return True
    except:
        return False

# Test
candidates = (0, 0.0, 0j, decimal.Decimal(0),
              True, False, '', None, '0', '1', [], [1], [1, 2], (1, ), (1, 2))

methods = [delnan_is_number, mat_is_number, shrewmouse_is_number, ant6n_is_number]

print("Candidate    | type                      | delnan | mat | shrewmouse | ant6n")
print("-------------------------------------------------------------------------")
for candidate in candidates:
    results = [m(candidate) for m in methods]
    print("{:<12} | {:<25} | {:>6} | {:>3} | {:>10} | {:>5}"
          .format(repr(candidate), type(candidate), *results))

Question 20

Probably it’s better to just do it the other way around: You check if it’s a vector. If it is, you do a dot product and in all other cases you attempt scalar multiplication.

Checking for the vector is easy, since it should of your vector class type (or inherited from it). You could also just try first to do a dot-product, and if that fails (= it wasn’t really a vector), then fall back to scalar multiplication.

Question 21

Just to add upon. Perhaps we can use a combination of isinstance and isdigit as follows to find whether a value is a number (int, float, etc)

if isinstance(num1, int) or isinstance(num1 , float) or num1.isdigit():

Question 22

For the hypothetical vector class:

Suppose v is a vector, and we are multiplying it by x. If it makes sense to multiply each component of v by x, we probably meant that, so try that first. If not, maybe we can dot? Otherwise it’s a type error.

EDIT — the below code doesn’t work, because 2*[0]==[0,0] instead of raising a TypeError. I leave it because it was commented-upon.

def __mul__( self, x ):
    try:
        return [ comp * x for comp in self ]
    except TypeError:
        return [ x * y for x, y in itertools.zip_longest( self, x, fillvalue = 0 )

Question 23

I had a similar issue, when implementing a sort of vector class. One way to check for a number is to just convert to one, i.e. by using

float(x)

This should reject cases where x cannot be converted to a number; but may also reject other kinds of number-like structures that could be valid, for example complex numbers.

Question 24

If you want to call different methods depending on the argument type(s), look into multipledispatch.

For example, say you are writing a vector class. If given another vector, you want to find the dot product. If given a scalar, you want to scale the whole vector.

from multipledispatch import dispatch

class Vector(list):

    @dispatch(object)
    def __mul__(self, scalar):
        return Vector( x*scalar for x in self)

    @dispatch(list)
    def __mul__(self, other):
        return sum(x*y for x,y in zip(self, other))


>>> Vector([1,2,3]) * Vector([2,4,5])   # Vector time Vector is dot product
25
>>> Vector([1,2,3]) * 2                 # Vector times scalar is scaling
[2, 4, 6]

Unfortunately, (to my knowledge) we can’t write @dispatch(Vector) since we are still defining the type Vector, so that type name is not yet defined. Instead, I’m using the base type list, which allows you to even find the dot product of a Vector and a list.

Question 25

Short and simple way :

obj = 12345
print(isinstance(obj,int))

Output :

True

If the object is a string, ‘False’ will be returned :

obj = 'some string'
print(isinstance(obj,int))

Output :

False

Question 26

You have a data item, say rec_day that when written to a file will be a float. But during program processing it can be either float, int or str type (the str is used when initializing a new record and contains a dummy flag value).

You can then check to see if you have a number with this

                type(rec_day) != str

I’ve structured a python program this way and just put in ‘maintenance patch’ using this as a numeric check. Is it the Pythonic way? Most likely not since I used to program in COBOL.

Question 27

You could use the isdigit() function.

>>> x = "01234"
>>> a.isdigit()
True
>>> y = "1234abcd"
>>> y.isdigit()
False

Question 28

Why does Pandas tell me that I have objects, although every item in the selected column is a string — even after explicit conversion.

This is my DataFrame:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 56992 entries, 0 to 56991
Data columns (total 7 columns):
id            56992  non-null values
attr1         56992  non-null values
attr2         56992  non-null values
attr3         56992  non-null values
attr4         56992  non-null values
attr5         56992  non-null values
attr6         56992  non-null values
dtypes: int64(2), object(5)

Five of them are dtype object. I explicitly convert those objects to strings:

for c in df.columns:
    if df[c].dtype == object:
        print "convert ", df[c].name, " to string"
        df[c] = df[c].astype(str)

Then, df["attr2"] still has dtype object, although type(df["attr2"].ix[0] reveals str, which is correct.

Pandas distinguishes between int64 and float64 and object. What is the logic behind it when there is no dtype str? Why is a str covered by object?

Question 29

The dtype object comes from NumPy, it describes the type of element in a ndarray. Every element in a ndarray must has the same size in byte. For int64 and float64, they are 8 bytes. But for strings, the length of the string is not fixed. So instead of save the bytes of strings in the ndarray directly, Pandas use object ndarray, which save pointers to objects, because of this the dtype of this kind ndarray is object.

Here is an example:

the int64 array contains 4 int64 value.
the object array contains 4 pointers to 3 string objects.

Question 30

The accepted answer is good. Just wanted to provide an answer which referenced the documentation. The documentation says:

Pandas uses the object dtype for storing strings.

As the leading comment says “Don’t worry about it; it’s supposed to be like this.” (Although the accepted answer did a great job explaining the “why”; strings are variable-length)

But for strings, the length of the string is not fixed.

Question 31

@HYRY’s answer is great. I just want to provide a little more context..

Arrays store data as contiguous, fixed-size memory blocks. The combination of these properties together is what makes arrays lightning fast for data access. For example, consider how your computer might store an array of 32-bit integers, [3,0,1].

If you ask your computer to fetch the 3rd element in the array, it’ll start at the beginning and then jump across 64 bits to get to the 3rd element. Knowing exactly how many bits to jump across is what makes arrays fast.

Now consider the sequence of strings ['hello', 'i', 'am', 'a', 'banana']. Strings are objects that vary in size, so if you tried to store them in contiguous memory blocks, it’d end up looking like this.

Now your computer doesn’t have a fast way to access a randomly requested element. The key to overcoming this is to use pointers. Basically, store each string in some random memory location, and fill the array with the memory address of each string. (Memory addresses are just integers.) So now, things look like this

Now, if you ask your computer to fetch the 3rd element, just as before, it can jump across 64 bits (assuming the memory addresses are 32-bit integers) and then make one extra step to go fetch the string.

The challenge for NumPy is that there’s no guarantee the pointers are actually pointing to strings. That’s why it reports the dtype as ‘object’.

Shamelessly gonna plug my own blog article where I originally discussed this.

Question 32

As of version 1.0.0 (January 2020), pandas has introduced as an experimental feature providing first-class support for string types through pandas.StringDtype.

While you’ll still be seeing object by default, the new type can be used by specifying a dtype of pd.StringDtype or simply 'string':

>>> pd.Series(['abc', None, 'def'])
0     abc
1    None
2     def
dtype: object
>>> pd.Series(['abc', None, 'def'], dtype=pd.StringDtype())
0     abc
1    <NA>
2     def
dtype: string
>>> pd.Series(['abc', None, 'def']).astype('string')
0     abc
1    <NA>
2     def
dtype: string

Question 33

I have a dataframe in pandas and I’m trying to figure out what the types of its values are. I am unsure what the type is of column 'Test'. However, when I run myFrame['Test'].dtype, I get;

dtype('O')

What does this mean?

Question 34

It means:

'O'     (Python) objects

Source.

The first character specifies the kind of data and the remaining characters specify the number of bytes per item, except for Unicode, where it is interpreted as the number of characters. The item size must correspond to an existing type, or an error will be raised. The supported kinds are to an existing type, or an error will be raised. The supported kinds are:

'b'       boolean
'i'       (signed) integer
'u'       unsigned integer
'f'       floating-point
'c'       complex-floating point
'O'       (Python) objects
'S', 'a'  (byte-)string
'U'       Unicode
'V'       raw data (void)

Another answer helps if need check types.

Question 35

When you see `dtype('O')` inside dataframe this means Pandas string.

What is dtype?

Something that belongs to pandas or numpy, or both, or something else? If we examine pandas code:

df = pd.DataFrame({'float': [1.0],
                    'int': [1],
                    'datetime': [pd.Timestamp('20180310')],
                    'string': ['foo']})
print(df)
print(df['float'].dtype,df['int'].dtype,df['datetime'].dtype,df['string'].dtype)
df['string'].dtype

It will output like this:

   float  int   datetime string    
0    1.0    1 2018-03-10    foo
---
float64 int64 datetime64[ns] object
---
dtype('O')

You can interpret the last as Pandas dtype('O') or Pandas object which is Python type string, and this corresponds to Numpy string_, or unicode_ types.

Pandas dtype    Python type     NumPy type          Usage
object          str             string_, unicode_   Text

Like Don Quixote is on ass, Pandas is on Numpy and Numpy understand the underlying architecture of your system and uses the class numpy.dtype for that.

Data type object is an instance of numpy.dtype class that understand the data type more precise including:

Type of the data (integer, float, Python object, etc.)
Size of the data (how many bytes is in e.g. the integer)
Byte order of the data (little-endian or big-endian)
If the data type is structured, an aggregate of other data types, (e.g., describing an array item consisting of an integer and a float)
What are the names of the “fields” of the structure
What is the data-type of each field
Which part of the memory block each field takes
If the data type is a sub-array, what is its shape and data type

In the context of this question dtype belongs to both pands and numpy and in particular dtype('O') means we expect the string.

Here is some code for testing with explanation: If we have the dataset as dictionary

import pandas as pd
import numpy as np
from pandas import Timestamp

data={'id': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5}, 'date': {0: Timestamp('2018-12-12 00:00:00'), 1: Timestamp('2018-12-12 00:00:00'), 2: Timestamp('2018-12-12 00:00:00'), 3: Timestamp('2018-12-12 00:00:00'), 4: Timestamp('2018-12-12 00:00:00')}, 'role': {0: 'Support', 1: 'Marketing', 2: 'Business Development', 3: 'Sales', 4: 'Engineering'}, 'num': {0: 123, 1: 234, 2: 345, 3: 456, 4: 567}, 'fnum': {0: 3.14, 1: 2.14, 2: -0.14, 3: 41.3, 4: 3.14}}
df = pd.DataFrame.from_dict(data) #now we have a dataframe

print(df)
print(df.dtypes)

The last lines will examine the dataframe and note the output:

   id       date                  role  num   fnum
0   1 2018-12-12               Support  123   3.14
1   2 2018-12-12             Marketing  234   2.14
2   3 2018-12-12  Business Development  345  -0.14
3   4 2018-12-12                 Sales  456  41.30
4   5 2018-12-12           Engineering  567   3.14
id               int64
date    datetime64[ns]
role            object
num              int64
fnum           float64
dtype: object

All kind of different dtypes

df.iloc[1,:] = np.nan
df.iloc[2,:] = None

But if we try to set np.nan or None this will not affect the original column dtype. The output will be like this:

print(df)
print(df.dtypes)

    id       date         role    num   fnum
0  1.0 2018-12-12      Support  123.0   3.14
1  NaN        NaT          NaN    NaN    NaN
2  NaN        NaT         None    NaN    NaN
3  4.0 2018-12-12        Sales  456.0  41.30
4  5.0 2018-12-12  Engineering  567.0   3.14
id             float64
date    datetime64[ns]
role            object
num            float64
fnum           float64
dtype: object

So np.nan or None will not change the columns dtype, unless we set the all column rows to np.nan or None. In that case column will become float64 or object respectively.

You may try also setting single rows:

df.iloc[3,:] = 0 # will convert datetime to object only
df.iloc[4,:] = '' # will convert all columns to object

And to note here, if we set string inside a non string column it will become string or object dtype.

Question 36

It means “a python object”, i.e. not one of the builtin scalar types supported by numpy.

np.array([object()]).dtype
=> dtype('O')

Question 37

‘O’ stands for object.

#Loading a csv file as a dataframe
import pandas as pd 
train_df = pd.read_csv('train.csv')
col_name = 'Name of Employee'

#Checking the datatype of column name
train_df[col_name].dtype

#Instead try printing the same thing
print train_df[col_name].dtype

The first line returns: dtype('O')

The line with the print statement returns the following: object

Question 38

Basically I want to do this:

obj = 'str'
type ( obj ) == string

I tried:

type ( obj ) == type ( string )

and it didn’t work.

Also, what about the other types? For example, I couldn’t replicate NoneType.

Question 39

isinstance()

In your case, isinstance("this is a string", str) will return True.

You may also want to read this: http://www.canonical.org/~kragen/isinstance/

Question 40

isinstance works:

if isinstance(obj, MyClass): do_foo(obj)

but, keep in mind: if it looks like a duck, and if it sounds like a duck, it is a duck.

EDIT: For the None type, you can simply do:

if obj is None: obj = MyClass()

Question 41

First, avoid all type comparisons. They’re very, very rarely necessary. Sometimes, they help to check parameter types in a function — even that’s rare. Wrong type data will raise an exception, and that’s all you’ll ever need.

All of the basic conversion functions will map as equal to the type function.

type(9) is int
type(2.5) is float
type('x') is str
type(u'x') is unicode
type(2+3j) is complex

There are a few other cases.

isinstance( 'x', basestring )
isinstance( u'u', basestring )
isinstance( 9, int )
isinstance( 2.5, float )
isinstance( (2+3j), complex )

None, BTW, never needs any of this kind of type checking. None is the only instance of NoneType. The None object is a Singleton. Just check for None

variable is None

BTW, do not use the above in general. Use ordinary exceptions and Python’s own natural polymorphism.

Question 42

For other types, check out the types module:

>>> import types
>>> x = "mystring"
>>> isinstance(x, types.StringType)
True
>>> x = 5
>>> isinstance(x, types.IntType)
True
>>> x = None
>>> isinstance(x, types.NoneType)
True

P.S. Typechecking is a bad idea.

Question 43

You can always use the type(x) == type(y) trick, where y is something with known type.

# check if x is a regular string
type(x) == type('')
# check if x is an integer
type(x) == type(1)
# check if x is a NoneType
type(x) == type(None)

Often there are better ways of doing that, particularly with any recent python. But if you only want to remember one thing, you can remember that.

In this case, the better ways would be:

# check if x is a regular string
type(x) == str
# check if x is either a regular string or a unicode string
type(x) in [str, unicode]
# alternatively:
isinstance(x, basestring)
# check if x is an integer
type(x) == int
# check if x is a NoneType
x is None

Note the last case: there is only one instance of NoneType in python, and that is None. You’ll see NoneType a lot in exceptions (TypeError: 'NoneType' object is unsubscriptable — happens to me all the time..) but you’ll hardly ever need to refer to it in code.

Finally, as fengshaun points out, type checking in python is not always a good idea. It’s more pythonic to just use the value as though it is the type you expect, and catch (or allow to propagate) exceptions that result from it.

Question 44

You’re very close! string is a module, not a type. You probably want to compare the type of obj against the type object for strings, namely str:

type(obj) == str  # this works because str is already a type

Alternatively:

type(obj) == type('')

Note, in Python 2, if obj is a unicode type, then neither of the above will work. Nor will isinstance(). See John’s comments to this post for how to get around this… I’ve been trying to remember it for about 10 minutes now, but was having a memory block!

Question 45

It is because you have to write

s="hello"
type(s) == type("")

type accepts an instance and returns its type. In this case you have to compare two instances’ types.

If you need to do preemptive checking, it is better if you check for a supported interface than the type.

The type does not really tell you much, apart of the fact that your code want an instance of a specific type, regardless of the fact that you could have another instance of a completely different type which would be perfectly fine because it implements the same interface.

For example, suppose you have this code

def firstElement(parameter):
    return parameter[0]

Now, suppose you say: I want this code to accept only a tuple.

import types

def firstElement(parameter):
    if type(parameter) != types.TupleType:
         raise TypeError("function accepts only a tuple")
    return parameter[0]

This is reducing the reusability of this routine. It won’t work if you pass a list, or a string, or a numpy.array. Something better would be

def firstElement(parameter):
    if not (hasattr(parameter, "__getitem__") and callable(getattr(parameter,"__getitem__"))):
        raise TypeError("interface violation")
    return parameter[0]

but there’s no point in doing it: parameter[0] will raise an exception if the protocol is not satisfied anyway… this of course unless you want to prevent side effects or having to recover from calls that you could invoke before failing. (Stupid) example, just to make the point:

def firstElement(parameter):
    if not (hasattr(parameter, "__getitem__") and callable(getattr(parameter,"__getitem__"))):
        raise TypeError("interface violation")
    os.system("rm file")
    return parameter[0]

in this case, your code will raise an exception before running the system() call. Without interface checks, you would have removed the file, and then raised the exception.

Question 46

Use str instead of string

type ( obj ) == str

Explanation

>>> a = "Hello"
>>> type(a)==str
True
>>> type(a)
<type 'str'>
>>>

Question 47

I use type(x) == type(y)

For instance, if I want to check something is an array:

type( x ) == type( [] )

string check:

type( x ) == type( '' ) or type( x ) == type( u'' )

If you want to check against None, use is

x is None

Question 48

i think this should do it

if isinstance(obj, str)

Question 49

Type doesn’t work on certain classes. If you’re not sure of the object’s type use the __class__ method, as so:

>>>obj = 'a string'
>>>obj.__class__ == str
True

Also see this article – http://www.siafoo.net/article/56

Question 50

To get the type, use the __class__ member, as in unknown_thing.__class__

Talk of duck-typing is useless here because it doesn’t answer a perfectly good question. In my application code I never need to know the type of something, but it’s still useful to have a way to learn an object’s type. Sometimes I need to get the actual class to validate a unit test. Duck typing gets in the way there because all possible objects have the same API, but only one is correct. Also, sometimes I’m maintaining somebody else’s code, and I have no idea what kind of object I’ve been passed. This is my biggest problem with dynamically typed languages like Python. Version 1 is very easy and quick to develop. Version 2 is a pain in the buns, especially if you didn’t write version 1. So sometimes, when I’m working with a function I didn’t write, I need to know the type of a parameter, just so I know what methods I can call on it.

That’s where the __class__ parameter comes in handy. That (as far as I can tell) is the best way (maybe the only way) to get an object’s type.

Question 51

Use isinstance(object, type). As above this is easy to use if you know the correct type, e.g.,

isinstance('dog', str) ## gives bool True

But for more esoteric objects, this can be difficult to use. For example:

import numpy as np 
a = np.array([1,2,3]) 
isinstance(a,np.array) ## breaks

but you can do this trick:

y = type(np.array([1]))
isinstance(a,y) ## gives bool True

So I recommend instantiating a variable (y in this case) with a type of the object you want to check (e.g., type(np.array())), then using isinstance.

Question 52

You can compare classes for check level.

#!/usr/bin/env python
#coding:utf8

class A(object):
    def t(self):
        print 'A'
    def r(self):
        print 'rA',
        self.t()

class B(A):
    def t(self):
        print 'B'

class C(A):
    def t(self):
        print 'C'

class D(B, C):
    def t(self):
        print 'D',
        super(D, self).t()

class E(C, B):
    pass

d = D()
d.t()
d.r()

e = E()
e.t()
e.r()

print isinstance(e, D) # False
print isinstance(e, E) # True
print isinstance(e, C) # True
print isinstance(e, B) # True
print isinstance(e, (A,)) # True
print e.__class__ >= A, #False
print e.__class__ <= C, #False
print e.__class__ <  E, #False
print e.__class__ <= E  #True

Question 53

Today I was positively surprised by the fact that while reading data from a data file (for example) pandas is able to recognize types of values:

df = pandas.read_csv('test.dat', delimiter=r"\s+", names=['col1','col2','col3'])

For example it can be checked in this way:

for i, r in df.iterrows():
    print type(r['col1']), type(r['col2']), type(r['col3'])

In particular integer, floats and strings were recognized correctly. However, I have a column that has dates in the following format: 2013-6-4. These dates were recognized as strings (not as python date-objects). Is there a way to “learn” pandas to recognized dates?

Question 54

You should add parse_dates=True, or parse_dates=['column name'] when reading, thats usually enough to magically parse it. But there are always weird formats which need to be defined manually. In such a case you can also add a date parser function, which is the most flexible way possible.

Suppose you have a column ‘datetime’ with your string, then:

from datetime import datetime
dateparse = lambda x: datetime.strptime(x, '%Y-%m-%d %H:%M:%S')

df = pd.read_csv(infile, parse_dates=['datetime'], date_parser=dateparse)

This way you can even combine multiple columns into a single datetime column, this merges a ‘date’ and a ‘time’ column into a single ‘datetime’ column:

dateparse = lambda x: datetime.strptime(x, '%Y-%m-%d %H:%M:%S')

df = pd.read_csv(infile, parse_dates={'datetime': ['date', 'time']}, date_parser=dateparse)

You can find directives (i.e. the letters to be used for different formats) for strptime and strftime in this page.

Question 55

Perhaps the pandas interface has changed since @Rutger answered, but in the version I’m using (0.15.2), the date_parser function receives a list of dates instead of a single value. In this case, his code should be updated like so:

dateparse = lambda dates: [pd.datetime.strptime(d, '%Y-%m-%d %H:%M:%S') for d in dates]

df = pd.read_csv(infile, parse_dates=['datetime'], date_parser=dateparse)

Question 56

pandas read_csv method is great for parsing dates. Complete documentation at http://pandas.pydata.org/pandas-docs/stable/generated/pandas.io.parsers.read_csv.html

you can even have the different date parts in different columns and pass the parameter:

parse_dates : boolean, list of ints or names, list of lists, or dict
If True -> try parsing the index. If [1, 2, 3] -> try parsing columns 1, 2, 3 each as a
separate date column. If [[1, 3]] -> combine columns 1 and 3 and parse as a single date
column. {‘foo’ : [1, 3]} -> parse columns 1, 3 as date and call result ‘foo’

The default sensing of dates works great, but it seems to be biased towards north american Date formats. If you live elsewhere you might occasionally be caught by the results. As far as I can remember 1/6/2000 means 6 January in the USA as opposed to 1 Jun where I live. It is smart enough to swing them around if dates like 23/6/2000 are used. Probably safer to stay with YYYYMMDD variations of date though. Apologies to pandas developers,here but i have not tested it with local dates recently.

you can use the date_parser parameter to pass a function to convert your format.

date_parser : function
Function to use for converting a sequence of string columns to an array of datetime
instances. The default uses dateutil.parser.parser to do the conversion.

Question 57

You could use pandas.to_datetime() as recommended in the documentation for pandas.read_csv():

If a column or index contains an unparseable date, the entire column or index will be returned unaltered as an object data type. For non-standard datetime parsing, use pd.to_datetime after pd.read_csv.

Demo:

>>> D = {'date': '2013-6-4'}
>>> df = pd.DataFrame(D, index=[0])
>>> df
       date
0  2013-6-4
>>> df.dtypes
date    object
dtype: object
>>> df['date'] = pd.to_datetime(df.date, format='%Y-%m-%d')
>>> df
        date
0 2013-06-04
>>> df.dtypes
date    datetime64[ns]
dtype: object

Question 58

When merging two columns into a single datetime column, the accepted answer generates an error (pandas version 0.20.3), since the columns are sent to the date_parser function separately.

The following works:

def dateparse(d,t):
    dt = d + " " + t
    return pd.datetime.strptime(dt, '%d/%m/%Y %H:%M:%S')

df = pd.read_csv(infile, parse_dates={'datetime': ['date', 'time']}, date_parser=dateparse)

Question 59

Yes – according to the pandas.read_csv documentation:

Note: A fast-path exists for iso8601-formatted dates.

So if your csv has a column named datetime and the dates looks like 2013-01-01T01:01 for example, running this will make pandas (I’m on v0.19.2) pick up the date and time automatically:

df = pd.read_csv('test.csv', parse_dates=['datetime'])

Note that you need to explicitly pass parse_dates, it doesn’t work without.

Verify with:

df.dtypes

You should see the datatype of the column is datetime64[ns]

Question 60

If performance matters to you make sure you time:

import sys
import timeit
import pandas as pd

print('Python %s on %s' % (sys.version, sys.platform))
print('Pandas version %s' % pd.__version__)

repeat = 3
numbers = 100

def time(statement, _setup=None):
    print (min(
        timeit.Timer(statement, setup=_setup or setup).repeat(
            repeat, numbers)))

print("Format %m/%d/%y")
setup = """import pandas as pd
import io

data = io.StringIO('''\
ProductCode,Date
''' + '''\
x1,07/29/15
x2,07/29/15
x3,07/29/15
x4,07/30/15
x5,07/29/15
x6,07/29/15
x7,07/29/15
y7,08/05/15
x8,08/05/15
z3,08/05/15
''' * 100)"""

time('pd.read_csv(data); data.seek(0)')
time('pd.read_csv(data, parse_dates=["Date"]); data.seek(0)')
time('pd.read_csv(data, parse_dates=["Date"],'
     'infer_datetime_format=True); data.seek(0)')
time('pd.read_csv(data, parse_dates=["Date"],'
     'date_parser=lambda x: pd.datetime.strptime(x, "%m/%d/%y")); data.seek(0)')

print("Format %Y-%m-%d %H:%M:%S")
setup = """import pandas as pd
import io

data = io.StringIO('''\
ProductCode,Date
''' + '''\
x1,2016-10-15 00:00:43
x2,2016-10-15 00:00:56
x3,2016-10-15 00:00:56
x4,2016-10-15 00:00:12
x5,2016-10-15 00:00:34
x6,2016-10-15 00:00:55
x7,2016-10-15 00:00:06
y7,2016-10-15 00:00:01
x8,2016-10-15 00:00:00
z3,2016-10-15 00:00:02
''' * 1000)"""

time('pd.read_csv(data); data.seek(0)')
time('pd.read_csv(data, parse_dates=["Date"]); data.seek(0)')
time('pd.read_csv(data, parse_dates=["Date"],'
     'infer_datetime_format=True); data.seek(0)')
time('pd.read_csv(data, parse_dates=["Date"],'
     'date_parser=lambda x: pd.datetime.strptime(x, "%Y-%m-%d %H:%M:%S")); data.seek(0)')

prints:

Python 3.7.1 (v3.7.1:260ec2c36a, Oct 20 2018, 03:13:28) 
[Clang 6.0 (clang-600.0.57)] on darwin
Pandas version 0.23.4
Format %m/%d/%y
0.19123052499999993
8.20691274
8.143124389
1.2384357139999977
Format %Y-%m-%d %H:%M:%S
0.5238807110000039
0.9202787830000005
0.9832778819999959
12.002349824999996

So with iso8601-formatted date (%Y-%m-%d %H:%M:%S is apparently an iso8601-formatted date, I guess the T can be dropped and replaced by a space) you should not specify infer_datetime_format (which does not make a difference with more common ones either apparently) and passing your own parser in just cripples performance. On the other hand, date_parser does make a difference with not so standard day formats. Be sure to time before you optimize, as usual.

Question 61

While loading csv file contain date column.We have two approach to to make pandas to recognize date column i.e

Pandas explicit recognize the format by arg date_parser=mydateparser
Pandas implicit recognize the format by agr infer_datetime_format=True

Some of the date column data

01/01/18

01/02/18

Here we don’t know the first two things It may be month or day. So in this case we have to use Method 1:- Explicit pass the format

    mydateparser = lambda x: pd.datetime.strptime(x, "%m/%d/%y")
    df = pd.read_csv(file_name, parse_dates=['date_col_name'],
date_parser=mydateparser)

Method 2:- Implicit or Automatically recognize the format

df = pd.read_csv(file_name, parse_dates=[date_col_name],infer_datetime_format=True)

Question 62

The following snippet is annotated with the output (as seen on ideone.com):

print "100" < "2"      # True
print "5" > "9"        # False

print "100" < 2        # False
print 100 < "2"        # True

print 5 > "9"          # False
print "5" > 9          # True

print [] > float('inf') # True
print () > []          # True

Can someone explain why the output is as such?

Implementation details

Is this behavior mandated by the language spec, or is it up to implementors?
Are there differences between any of the major Python implementations?
Are there differences between versions of the Python language?

Question 63

From the python 2 manual:

CPython implementation detail: Objects of different types except numbers are ordered by their type names; objects of the same types that don’t support proper comparison are ordered by their address.

When you order two strings or two numeric types the ordering is done in the expected way (lexicographic ordering for string, numeric ordering for integers).

When you order a numeric and a non-numeric type, the numeric type comes first.

>>> 5 < 'foo'
True
>>> 5 < (1, 2)
True
>>> 5 < {}
True
>>> 5 < [1, 2]
True

When you order two incompatible types where neither is numeric, they are ordered by the alphabetical order of their typenames:

>>> [1, 2] > 'foo'   # 'list' < 'str' 
False
>>> (1, 2) > 'foo'   # 'tuple' > 'str'
True

>>> class Foo(object): pass
>>> class Bar(object): pass
>>> Bar() < Foo()
True

One exception is old-style classes that always come before new-style classes.

>>> class Foo: pass           # old-style
>>> class Bar(object): pass   # new-style
>>> Bar() < Foo()
False

Is this behavior mandated by the language spec, or is it up to implementors?

There is no language specification. The language reference says:

Otherwise, objects of different types always compare unequal, and are ordered consistently but arbitrarily.

So it is an implementation detail.

Are there differences between any of the major Python implementations?

I can’t answer this one because I have only used the official CPython implementation, but there are other implementations of Python such as PyPy.

Are there differences between versions of the Python language?

In Python 3.x the behaviour has been changed so that attempting to order an integer and a string will raise an error:

>>> '10' > 5
Traceback (most recent call last):
  File "<pyshell#0>", line 1, in <module>
    '10' > 5
TypeError: unorderable types: str() > int()

Question 64

Strings are compared lexicographically, and dissimilar types are compared by the name of their type ("int" < "string"). 3.x fixes the second point by making them non-comparable.

问题：如何在Pandas中找到数字列？

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

回答 7

回答 8

回答 9

回答 10

问题：检查对象是否为数字的最有效方法是什么？

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

码

Code

回答 6

回答 7

回答 8

回答 9

回答 10

回答 11

回答 12

回答 13

问题：DataFrame中的字符串，但dtype是object

回答 0

回答 1

回答 2

回答 3

问题：熊猫中的dtype（’O’）是什么？

回答 0

回答 1

当您dtype('O')在数据框内看到这意味着熊猫字符串。

When you see dtype('O') inside dataframe this means Pandas string.

回答 2

回答 3

问题：如何在Python中比较对象的类型？

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

回答 7

回答 8

回答 9

回答 10

回答 11

回答 12

回答 13

问题：熊猫可以自动识别日期吗？

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

回答 7

问题：Python 2如何比较string和int？为什么列表比较的结果大于数字，而元组的结果大于列表？

实施细节

Implementation details

回答 0

回答 1

问题：测试变量是列表还是元组

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

回答 7

回答 8

当您`dtype('O')`在数据框内看到这意味着熊猫字符串。

When you see `dtype('O')` inside dataframe this means Pandas string.