What is the best way to convert the columns to the appropriate types, in this case columns 2 and 3 into floats? Is there a way to specify the types while converting to DataFrame? Or is it better to create the DataFrame first and then loop through the columns to change the type for each column? Ideally I would like to do this in a dynamic way because there can be hundreds of columns and I don’t want to specify exactly which columns are of which type. All I can guarantee is that each columns contains values of the same type.
>>> s = pd.Series(["8",6,"7.5",3,"0.9"])# mixed string and numeric values>>> s
081627.53340.9
dtype: object
>>> pd.to_numeric(s)# convert everything to float values08.016.027.533.040.9
dtype: float64
如您所见,将返回一个新的Series。请记住,将此输出分配给变量或列名以继续使用它:
# convert Series
my_series = pd.to_numeric(my_series)# convert column "a" of a DataFrame
df["a"]= pd.to_numeric(df["a"])
您还可以通过以下apply()方法使用它来转换DataFrame的多个列:
# convert all columns of DataFrame
df = df.apply(pd.to_numeric)# convert all columns of DataFrame# convert just columns "a" and "b"
df[["a","b"]]= df[["a","b"]].apply(pd.to_numeric)
# convert all DataFrame columns to the int64 dtype
df = df.astype(int)# convert column "a" to int64 dtype and "b" to complex type
df = df.astype({"a": int,"b": complex})# convert Series to float16 type
s = s.astype(np.float16)# convert Series to Python strings
s = s.astype(str)# convert Series to categorical type - see docs for more details
s = s.astype('category')
You have three main options for converting types in pandas:
to_numeric() – provides functionality to safely convert non-numeric types (e.g. strings) to a suitable numeric type. (See also to_datetime() and to_timedelta().)
astype() – convert (almost) any type to (almost) any other type (even if it’s not necessarily sensible to do so). Also allows you to convert to categorial types (very useful).
infer_objects() – a utility method to convert object columns holding Python objects to a pandas type if possible.
Read on for more detailed explanations and usage of each of these methods.
1. to_numeric()
The best way to convert one or more columns of a DataFrame to numeric values is to use pandas.to_numeric().
This function will try to change non-numeric objects (such as strings) into integers or floating point numbers as appropriate.
Basic usage
The input to to_numeric() is a Series or a single column of a DataFrame.
As you can see, a new Series is returned. Remember to assign this output to a variable or column name to continue using it:
# convert Series
my_series = pd.to_numeric(my_series)
# convert column "a" of a DataFrame
df["a"] = pd.to_numeric(df["a"])
You can also use it to convert multiple columns of a DataFrame via the apply() method:
# convert all columns of DataFrame
df = df.apply(pd.to_numeric) # convert all columns of DataFrame
# convert just columns "a" and "b"
df[["a", "b"]] = df[["a", "b"]].apply(pd.to_numeric)
As long as your values can all be converted, that’s probably all you need.
Error handling
But what if some values can’t be converted to a numeric type?
to_numeric() also takes an errors keyword argument that allows you to force non-numeric values to be NaN, or simply ignore columns containing these values.
Here’s an example using a Series of strings s which has the object dtype:
The default behaviour is to raise if it can’t convert a value. In this case, it can’t cope with the string ‘pandas’:
>>> pd.to_numeric(s) # or pd.to_numeric(s, errors='raise')
ValueError: Unable to parse string
Rather than fail, we might want ‘pandas’ to be considered a missing/bad numeric value. We can coerce invalid values to NaN as follows using the errors keyword argument:
The third option for errors is just to ignore the operation if an invalid value is encountered:
>>> pd.to_numeric(s, errors='ignore')
# the original Series is returned untouched
This last option is particularly useful when you want to convert your entire DataFrame, but don’t not know which of our columns can be converted reliably to a numeric type. In that case just write:
df.apply(pd.to_numeric, errors='ignore')
The function will be applied to each column of the DataFrame. Columns that can be converted to a numeric type will be converted, while columns that cannot (e.g. they contain non-digit strings or dates) will be left alone.
Downcasting
By default, conversion with to_numeric() will give you either a int64 or float64 dtype (or whatever integer width is native to your platform).
That’s usually what you want, but what if you wanted to save some memory and use a more compact dtype, like float32, or int8?
to_numeric() gives you the option to downcast to either ‘integer’, ‘signed’, ‘unsigned’, ‘float’. Here’s an example for a simple series s of integer type:
>>> s = pd.Series([1, 2, -7])
>>> s
0 1
1 2
2 -7
dtype: int64
Downcasting to ‘integer’ uses the smallest possible integer that can hold the values:
The astype() method enables you to be explicit about the dtype you want your DataFrame or Series to have. It’s very versatile in that you can try and go from one type to the any other.
Basic usage
Just pick a type: you can use a NumPy dtype (e.g. np.int16), some Python types (e.g. bool), or pandas-specific types (like the categorical dtype).
Call the method on the object you want to convert and astype() will try and convert it for you:
# convert all DataFrame columns to the int64 dtype
df = df.astype(int)
# convert column "a" to int64 dtype and "b" to complex type
df = df.astype({"a": int, "b": complex})
# convert Series to float16 type
s = s.astype(np.float16)
# convert Series to Python strings
s = s.astype(str)
# convert Series to categorical type - see docs for more details
s = s.astype('category')
Notice I said “try” – if astype() does not know how to convert a value in the Series or DataFrame, it will raise an error. For example if you have a NaN or inf value you’ll get an error trying to convert it to an integer.
As of pandas 0.20.0, this error can be suppressed by passing errors='ignore'. Your original object will be return untouched.
Be careful
astype() is powerful, but it will sometimes convert values “incorrectly”. For example:
>>> s = pd.Series([1, 2, -7])
>>> s
0 1
1 2
2 -7
dtype: int64
These are small integers, so how about converting to an unsigned 8-bit type to save memory?
>>> s.astype(np.uint8)
0 1
1 2
2 249
dtype: uint8
The conversion worked, but the -7 was wrapped round to become 249 (i.e. 28 – 7)!
Trying to downcast using pd.to_numeric(s, downcast='unsigned') instead could help prevent this error.
3. infer_objects()
Version 0.21.0 of pandas introduced the method infer_objects() for converting columns of a DataFrame that have an object datatype to a more specific type (soft conversions).
For example, here’s a DataFrame with two columns of object type. One holds actual integers and the other holds strings representing integers:
>>> df = pd.DataFrame({'a': [7, 1, 5], 'b': ['3','2','1']}, dtype='object')
>>> df.dtypes
a object
b object
dtype: object
Using infer_objects(), you can change the type of column ‘a’ to int64:
>>> df = df.infer_objects()
>>> df.dtypes
a int64
b object
dtype: object
Column ‘b’ has been left alone since its values were strings, not integers. If you wanted to try and force the conversion of both columns to an integer type, you could use df.astype(int) instead.
回答 1
这个怎么样?
a =[['a','1.2','4.2'],['b','70','0.03'],['x','5','0']]
df = pd.DataFrame(a, columns=['one','two','three'])
df
Out[16]:
one two three
0 a 1.24.21 b 700.032 x 50
df.dtypes
Out[17]:
one object
two object
three object
df[['two','three']]= df[['two','three']].astype(float)
df.dtypes
Out[19]:
one object
two float64
three float64
a = [['a', '1.2', '4.2'], ['b', '70', '0.03'], ['x', '5', '0']]
df = pd.DataFrame(a, columns=['one', 'two', 'three'])
df
Out[16]:
one two three
0 a 1.2 4.2
1 b 70 0.03
2 x 5 0
df.dtypes
Out[17]:
one object
two object
three object
df[['two', 'three']] = df[['two', 'three']].astype(float)
df.dtypes
Out[19]:
one object
two float64
three float64
# df is the DataFrame, and column_list is a list of columns as strings (e.g ["col1","col2","col3"])# dependencies: pandasdef coerce_df_columns_to_numeric(df, column_list):
df[column_list]= df[column_list].apply(pd.to_numeric, errors='coerce')
因此,以您的示例为例:
import pandas as pd
def coerce_df_columns_to_numeric(df, column_list):
df[column_list]= df[column_list].apply(pd.to_numeric, errors='coerce')
a =[['a','1.2','4.2'],['b','70','0.03'],['x','5','0']]
df = pd.DataFrame(a, columns=['col1','col2','col3'])
coerce_df_columns_to_numeric(df,['col2','col3'])
Here is a function that takes as its arguments a DataFrame and a list of columns and coerces all data in the columns to numbers.
# df is the DataFrame, and column_list is a list of columns as strings (e.g ["col1","col2","col3"])
# dependencies: pandas
def coerce_df_columns_to_numeric(df, column_list):
df[column_list] = df[column_list].apply(pd.to_numeric, errors='coerce')
After the dataframe is created, you can populate it with floating point variables in the 1st column, and strings (or any data type you desire) in the 2nd column.
df = pd.DataFrame({'a':['1','2','3'],'b':[4,5,6]}, dtype=object)
df.dtypes
a object
b object
dtype: object
# Actually converts string to numeric - hard conversion
df.apply(pd.to_numeric).dtypes
a int64
b int64
dtype: object
# Infers better data types for object data - soft conversion
df.infer_objects().dtypes
a object # no change
b int64
dtype: object
# Same as infer_objects, but converts to equivalent ExtensionType
df.convert_dtypes().dtypes
Here’s a chart that summarises some of the most important conversions in pandas.
Conversions to string are trivial .astype(str) and are not shown in the figure.
“Hard” versus “Soft” conversions
Note that “conversions” in this context could either refer to converting text data into their actual data type (hard conversion), or inferring more appropriate data types for data in object columns (soft conversion). To illustrate the difference, take a look at
df = pd.DataFrame({'a': ['1', '2', '3'], 'b': [4, 5, 6]}, dtype=object)
df.dtypes
a object
b object
dtype: object
# Actually converts string to numeric - hard conversion
df.apply(pd.to_numeric).dtypes
a int64
b int64
dtype: object
# Infers better data types for object data - soft conversion
df.infer_objects().dtypes
a object # no change
b int64
dtype: object
# Same as infer_objects, but converts to equivalent ExtensionType
df.convert_dtypes().dtypes
I thought I had the same problem but actually I have a slight difference that makes the problem easier to solve. For others looking at this question it’s worth checking the format of your input list. In my case the numbers are initially floats not strings as in the question:
In[40]: df = pd.DataFrame(...:{...:"a": pd.Series([1,2,3], dtype=np.dtype("int32")),...:"b": pd.Series(["x","y","z"], dtype=np.dtype("O")),...:"c": pd.Series([True,False, np.nan], dtype=np.dtype("O")),...:"d": pd.Series(["h","i", np.nan], dtype=np.dtype("O")),...:"e": pd.Series([10, np.nan,20], dtype=np.dtype("float")),...:"f": pd.Series([np.nan,100.5,200], dtype=np.dtype("float")),...:}...:)In[41]: dff = df.copy()In[42]: df
Out[42]:
a b c d e f
01 x True h 10.0NaN12 y False i NaN100.523 z NaNNaN20.0200.0In[43]: df.dtypes
Out[43]:
a int32
b object
c object
d object
e float64
f float64
dtype: object
In[44]: df = df.convert_dtypes()In[45]: df.dtypes
Out[45]:
a Int32
b string
c boolean
d string
e Int64
f float64
dtype: object
In[46]: dff = dff.convert_dtypes(convert_boolean =False)In[47]: dff.dtypes
Out[47]:
a Int32
b string
c object
d string
e Int64
f float64
dtype: object
Starting pandas 1.0.0, we have pandas.DataFrame.convert_dtypes. You can even control what types to convert!
In [40]: df = pd.DataFrame(
...: {
...: "a": pd.Series([1, 2, 3], dtype=np.dtype("int32")),
...: "b": pd.Series(["x", "y", "z"], dtype=np.dtype("O")),
...: "c": pd.Series([True, False, np.nan], dtype=np.dtype("O")),
...: "d": pd.Series(["h", "i", np.nan], dtype=np.dtype("O")),
...: "e": pd.Series([10, np.nan, 20], dtype=np.dtype("float")),
...: "f": pd.Series([np.nan, 100.5, 200], dtype=np.dtype("float")),
...: }
...: )
In [41]: dff = df.copy()
In [42]: df
Out[42]:
a b c d e f
0 1 x True h 10.0 NaN
1 2 y False i NaN 100.5
2 3 z NaN NaN 20.0 200.0
In [43]: df.dtypes
Out[43]:
a int32
b object
c object
d object
e float64
f float64
dtype: object
In [44]: df = df.convert_dtypes()
In [45]: df.dtypes
Out[45]:
a Int32
b string
c boolean
d string
e Int64
f float64
dtype: object
In [46]: dff = dff.convert_dtypes(convert_boolean = False)
In [47]: dff.dtypes
Out[47]:
a Int32
b string
c object
d string
e Int64
f float64
dtype: object
A regex or other string parsing method would be uglier and slower.
I’m not sure that anything much could be faster than the above. It calls the function and returns. Try/Catch doesn’t introduce much overhead because the most common exception is caught without an extensive search of stack frames.
The issue is that any numeric conversion function has two kinds of results
A number, if the number is valid
A status code (e.g., via errno) or exception to show that no valid number could be parsed.
C (as an example) hacks around this a number of ways. Python lays it out clearly and explicitly.
def is_number_tryexcept(s):""" Returns True is string is a number. """try:
float(s)returnTrueexceptValueError:returnFalseimport re
def is_number_regex(s):""" Returns True is string is a number. """if re.match("^\d+?\.\d+?$", s)isNone:return s.isdigit()returnTruedef is_number_repl_isdigit(s):""" Returns True is string is a number. """return s.replace('.','',1).isdigit()
funcs =[
is_number_tryexcept,
is_number_regex,
is_number_repl_isdigit
]
a_float ='.1234'print('Float notation ".1234" is not supported by:')for f in funcs:ifnot f(a_float):print('\t -', f.__name__)
浮点符号“ .1234”不受以下支持:
-is_number_regex
scientific1 ='1.000000e+50'
scientific2 ='1e50'print('Scientific notation "1.000000e+50" is not supported by:')for f in funcs:ifnot f(scientific1):print('\t -', f.__name__)print('Scientific notation "1e50" is not supported by:')for f in funcs:ifnot f(scientific2):print('\t -', f.__name__)
import timeit
test_cases =['1.12345','1.12.345','abc12345','12345']
times_n ={f.__name__:[]for f in funcs}for t in test_cases:for f in funcs:
f = f.__name__
times_n[f].append(min(timeit.Timer('%s(t)'%f,'from __main__ import %s, t'%f).repeat(repeat=3, number=1000000)))
测试以下功能的地方
from re import match as re_match
from re import compile as re_compile
def is_number_tryexcept(s):""" Returns True is string is a number. """try:
float(s)returnTrueexceptValueError:returnFalsedef is_number_regex(s):""" Returns True is string is a number. """if re_match("^\d+?\.\d+?$", s)isNone:return s.isdigit()returnTrue
comp = re_compile("^\d+?\.\d+?$")def compiled_regex(s):""" Returns True is string is a number. """if comp.match(s)isNone:return s.isdigit()returnTruedef is_number_repl_isdigit(s):""" Returns True is string is a number. """return s.replace('.','',1).isdigit()
TL;DR The best solution is s.replace('.','',1).isdigit()
I did some benchmarks comparing the different approaches
def is_number_tryexcept(s):
""" Returns True is string is a number. """
try:
float(s)
return True
except ValueError:
return False
import re
def is_number_regex(s):
""" Returns True is string is a number. """
if re.match("^\d+?\.\d+?$", s) is None:
return s.isdigit()
return True
def is_number_repl_isdigit(s):
""" Returns True is string is a number. """
return s.replace('.','',1).isdigit()
If the string is not a number, the except-block is quite slow. But more importantly, the try-except method is the only approach that handles scientific notations correctly.
funcs = [
is_number_tryexcept,
is_number_regex,
is_number_repl_isdigit
]
a_float = '.1234'
print('Float notation ".1234" is not supported by:')
for f in funcs:
if not f(a_float):
print('\t -', f.__name__)
Float notation “.1234” is not supported by:
– is_number_regex
scientific1 = '1.000000e+50'
scientific2 = '1e50'
print('Scientific notation "1.000000e+50" is not supported by:')
for f in funcs:
if not f(scientific1):
print('\t -', f.__name__)
print('Scientific notation "1e50" is not supported by:')
for f in funcs:
if not f(scientific2):
print('\t -', f.__name__)
Scientific notation “1.000000e+50” is not supported by:
– is_number_regex
– is_number_repl_isdigit
Scientific notation “1e50” is not supported by:
– is_number_regex
– is_number_repl_isdigit
EDIT: The benchmark results
import timeit
test_cases = ['1.12345', '1.12.345', 'abc12345', '12345']
times_n = {f.__name__:[] for f in funcs}
for t in test_cases:
for f in funcs:
f = f.__name__
times_n[f].append(min(timeit.Timer('%s(t)' %f,
'from __main__ import %s, t' %f)
.repeat(repeat=3, number=1000000)))
where the following functions were tested
from re import match as re_match
from re import compile as re_compile
def is_number_tryexcept(s):
""" Returns True is string is a number. """
try:
float(s)
return True
except ValueError:
return False
def is_number_regex(s):
""" Returns True is string is a number. """
if re_match("^\d+?\.\d+?$", s) is None:
return s.isdigit()
return True
comp = re_compile("^\d+?\.\d+?$")
def compiled_regex(s):
""" Returns True is string is a number. """
if comp.match(s) is None:
return s.isdigit()
return True
def is_number_repl_isdigit(s):
""" Returns True is string is a number. """
return s.replace('.','',1).isdigit()
There is one exception that you may want to take into account: the string ‘NaN’
If you want is_number to return FALSE for ‘NaN’ this code will not work as Python converts it to its representation of a number that is not a number (talk about identity issues):
>>> float('NaN')
nan
Otherwise, I should actually thank you for the piece of code I now use extensively. :)
which will return true only if there is one or no ‘.’ in the string of digits.
'3.14.5'.replace('.','',1).isdigit()
will return false
edit: just saw another comment …
adding a .replace(badstuff,'',maxnum_badstuff) for other cases can be done. if you are passing salt and not arbitrary condiments (ref:xkcd#974) this will do fine :P
It may take some getting used to, but this is the pythonic way of doing it. As has been already pointed out, the alternatives are worse. But there is one other advantage of doing things this way: polymorphism.
The central idea behind duck typing is that “if it walks and talks like a duck, then it’s a duck.” What if you decide that you need to subclass string so that you can change how you determine if something can be converted into a float? Or what if you decide to test some other object entirely? You can do these things without having to change the above code.
Other languages solve these problems by using interfaces. I’ll save the analysis of which solution is better for another thread. The point, though, is that python is decidedly on the duck typing side of the equation, and you’re probably going to have to get used to syntax like this if you plan on doing much programming in Python (but that doesn’t mean you have to like it of course).
One other thing you might want to take into consideration: Python is pretty fast in throwing and catching exceptions compared to a lot of other languages (30x faster than .Net for instance). Heck, the language itself even throws exceptions to communicate non-exceptional, normal program conditions (every time you use a for loop). Thus, I wouldn’t worry too much about the performance aspects of this code until you notice a significant problem.
回答 6
在Alfe指出您不需要单独检查float之后进行了更新,因为这两种情况都比较复杂:
def is_number(s):try:
complex(s)# for int, long, float and complexexceptValueError:returnFalsereturnTrue
先前曾说过:在极少数情况下,您可能还需要检查复数(例如1 + 2i),而复数不能用浮点数表示:
def is_number(s):try:
float(s)# for int, long and floatexceptValueError:try:
complex(s)# for complexexceptValueError:returnFalsereturnTrue
float.parse('giggity')// throws TypeException
float.parse('54.3')// returns the scalar value 54.3
float.tryParse('twank')// returns None
float.tryParse('32.2')// returns the scalar value 32.2
Note: You don’t want to return the boolean ‘False’ because that’s still a value type. None is better because it indicates failure. Of course, if you want something different you can change the fail parameter to whatever you want.
To extend float to include the ‘parse()’ and ‘try_parse()’ you’ll need to monkeypatch the ‘float’ class to add these methods.
If you want respect pre-existing functions the code should be something like:
For strings of non-numbers, try: except: is actually slower than regular expressions. For strings of valid numbers, regex is slower. So, the appropriate method depends on your input.
If you find that you are in a performance bind, you can use a new third-party module called fastnumbers that provides a function called isfloat. Full disclosure, I am the author. I have included its results in the timings below.
from __future__ import print_function
import timeit
prep_base = '''\
x = 'invalid'
y = '5402'
z = '4.754e3'
'''
prep_try_method = '''\
def is_number_try(val):
try:
float(val)
return True
except ValueError:
return False
'''
prep_re_method = '''\
import re
float_match = re.compile(r'[-+]?\d*\.?\d+(?:[eE][-+]?\d+)?$').match
def is_number_re(val):
return bool(float_match(val))
'''
fn_method = '''\
from fastnumbers import isfloat
'''
print('Try with non-number strings', timeit.timeit('is_number_try(x)',
prep_base + prep_try_method), 'seconds')
print('Try with integer strings', timeit.timeit('is_number_try(y)',
prep_base + prep_try_method), 'seconds')
print('Try with float strings', timeit.timeit('is_number_try(z)',
prep_base + prep_try_method), 'seconds')
print()
print('Regex with non-number strings', timeit.timeit('is_number_re(x)',
prep_base + prep_re_method), 'seconds')
print('Regex with integer strings', timeit.timeit('is_number_re(y)',
prep_base + prep_re_method), 'seconds')
print('Regex with float strings', timeit.timeit('is_number_re(z)',
prep_base + prep_re_method), 'seconds')
print()
print('fastnumbers with non-number strings', timeit.timeit('isfloat(x)',
prep_base + 'from fastnumbers import isfloat'), 'seconds')
print('fastnumbers with integer strings', timeit.timeit('isfloat(y)',
prep_base + 'from fastnumbers import isfloat'), 'seconds')
print('fastnumbers with float strings', timeit.timeit('isfloat(z)',
prep_base + 'from fastnumbers import isfloat'), 'seconds')
print()
Try with non-number strings 2.39108395576 seconds
Try with integer strings 0.375686168671 seconds
Try with float strings 0.369210958481 seconds
Regex with non-number strings 0.748660802841 seconds
Regex with integer strings 1.02021503448 seconds
Regex with float strings 1.08564686775 seconds
fastnumbers with non-number strings 0.174362897873 seconds
fastnumbers with integer strings 0.179651021957 seconds
fastnumbers with float strings 0.20222902298 seconds
As you can see
try: except: was fast for numeric input but very slow for an invalid input
I know this is particularly old but I would add an answer I believe covers the information missing from the highest voted answer that could be very valuable to any who find this:
For each of the following methods connect them with a count if you need any input to be accepted. (Assuming we are using vocal definitions of integers rather than 0-255, etc.)
x.isdigit()
works well for checking if x is an integer.
x.replace('-','').isdigit()
works well for checking if x is a negative.(Check – in first position)
x.replace('.','').isdigit()
works well for checking if x is a decimal.
x.replace(':','').isdigit()
works well for checking if x is a ratio.
x.replace('/','',1).isdigit()
works well for checking if x is a fraction.
def is_number(n):try:
float(n)# Type-casting the string to `float`.# If string is not a valid `float`, # it'll raise `ValueError` exceptionexceptValueError:returnFalsereturnTrue
# `nan_num` variable is taken from above example>>> nan_num == nan_num
False
因此,上述功能is_number可以更新,返回False的"NaN"是:
def is_number(n):
is_number =Truetry:
num = float(n)# check for "nan" floats
is_number = num == num # or use `math.isnan(num)`exceptValueError:
is_number =Falsereturn is_number
样品运行:
>>> is_number('Nan')# not a number "Nan" stringFalse>>> is_number('nan')# not a number string "nan" with all lower casedFalse>>> is_number('123')# positive integerTrue>>> is_number('-123')# negative integerTrue>>> is_number('-1.12')# negative `float`True>>> is_number('abc')# "some random" stringFalse
This answer provides step by step guide having function with examples to find the string is:
Positive integer
Positive/negative – integer/float
How to discard “NaN” (not a number) strings while checking for number?
Check if string is positive integer
You may use str.isdigit() to check whether given string is positive integer.
Sample Results:
# For digit
>>> '1'.isdigit()
True
>>> '1'.isalpha()
False
Check for string as positive/negative – integer/float
str.isdigit() returns False if the string is a negative number or a float number. For example:
# returns `False` for float
>>> '123.3'.isdigit()
False
# returns `False` for negative number
>>> '-123'.isdigit()
False
If you want to also check for the negative integers and float, then you may write a custom function to check for it as:
def is_number(n):
try:
float(n) # Type-casting the string to `float`.
# If string is not a valid `float`,
# it'll raise `ValueError` exception
except ValueError:
return False
return True
Sample Run:
>>> is_number('123') # positive integer number
True
>>> is_number('123.4') # positive float number
True
>>> is_number('-123') # negative integer number
True
>>> is_number('-123.4') # negative `float` number
True
>>> is_number('abc') # `False` for "some random" string
False
Discard “NaN” (not a number) strings while checking for number
The above functions will return True for the “NAN” (Not a number) string because for Python it is valid float representing it is not a number. For example:
>>> is_number('NaN')
True
In order to check whether the number is “NaN”, you may use math.isnan() as:
>>> import math
>>> nan_num = float('nan')
>>> math.isnan(nan_num)
True
Or if you don’t want to import additional library to check this, then you may simply check it via comparing it with itself using ==. Python returns False when nan float is compared with itself. For example:
# `nan_num` variable is taken from above example
>>> nan_num == nan_num
False
Hence, above function is_number can be updated to return False for "NaN" as:
def is_number(n):
is_number = True
try:
num = float(n)
# check for "nan" floats
is_number = num == num # or use `math.isnan(num)`
except ValueError:
is_number = False
return is_number
Sample Run:
>>> is_number('Nan') # not a number "Nan" string
False
>>> is_number('nan') # not a number string "nan" with all lower cased
False
>>> is_number('123') # positive integer
True
>>> is_number('-123') # negative integer
True
>>> is_number('-1.12') # negative `float`
True
>>> is_number('abc') # "some random" string
False
PS: Each operation for each check depending on the type of number comes with additional overhead. Choose the version of is_number function which fits your requirement.
Casting to float and catching ValueError is probably the fastest way, since float() is specifically meant for just that. Anything else that requires string parsing (regex, etc) will likely be slower due to the fact that it’s not tuned for this operation. My $0.02.
回答 13
您可以使用Unicode字符串,它们有一种方法可以执行您想要的操作:
>>> s = u"345">>> s.isnumeric()True
要么:
>>> s ="345">>> u = unicode(s)>>> u.isnumeric()True
import time, re, random, string
ITERATIONS =10000000classTimer:def __enter__(self):
self.start = time.clock()return self
def __exit__(self,*args):
self.end = time.clock()
self.interval = self.end - self.start
def check_regexp(x):return re.compile("^\d*\.?\d*$").match(x)isnotNonedef check_replace(x):return x.replace('.','',1).isdigit()def check_exception(s):try:
float(s)returnTrueexceptValueError:returnFalse
to_check =[check_regexp, check_replace, check_exception]print('preparing data...')
good_numbers =[
str(random.random()/ random.random())for x in range(ITERATIONS)]
bad_numbers =['.'+ x for x in good_numbers]
strings =[''.join(random.choice(string.ascii_uppercase + string.digits)for _ in range(random.randint(1,10)))for x in range(ITERATIONS)]print('running test...')for func in to_check:withTimer()as t:for x in good_numbers:
res = func(x)print('%s with good floats: %s'%(func.__name__, t.interval))withTimer()as t:for x in bad_numbers:
res = func(x)print('%s with bad floats: %s'%(func.__name__, t.interval))withTimer()as t:for x in strings:
res = func(x)print('%s with strings: %s'%(func.__name__, t.interval))
以下是2017年MacBook Pro 13上Python 2.7.10的结果:
check_regexp with good floats:12.688639
check_regexp with bad floats:11.624862
check_regexp with strings:11.349414
check_replace with good floats:4.419841
check_replace with bad floats:4.294909
check_replace with strings:4.086358
check_exception with good floats:3.276668
check_exception with bad floats:13.843092
check_exception with strings:15.786169
以下是2017年MacBook Pro 13上Python 3.6.5的结果:
check_regexp with good floats:13.472906000000009
check_regexp with bad floats:12.977665000000016
check_regexp with strings:12.417542999999995
check_replace with good floats:6.011045999999993
check_replace with bad floats:4.849356
check_replace with strings:4.282754000000011
check_exception with good floats:6.039081999999979
check_exception with bad floats:9.322753000000006
check_exception with strings:9.952595000000002
以下是2017年MacBook Pro 13上PyPy 2.7.13的结果:
check_regexp with good floats:2.693217
check_regexp with bad floats:2.744819
check_regexp with strings:2.532414
check_replace with good floats:0.604367
check_replace with bad floats:0.538169
check_replace with strings:0.598664
check_exception with good floats:1.944103
check_exception with bad floats:2.449182
check_exception with strings:2.200056
I wanted to see which method is fastest. Overall the best and most consistent results were given by the check_replace function. The fastest results were given by the check_exception function, but only if there was no exception fired – meaning its code is the most efficient, but the overhead of throwing an exception is quite large.
Please note that checking for a successful cast is the only method which is accurate, for example, this works with check_exception but the other two test functions will return False for a valid float:
huge_number = float('1e+100')
Here is the benchmark code:
import time, re, random, string
ITERATIONS = 10000000
class Timer:
def __enter__(self):
self.start = time.clock()
return self
def __exit__(self, *args):
self.end = time.clock()
self.interval = self.end - self.start
def check_regexp(x):
return re.compile("^\d*\.?\d*$").match(x) is not None
def check_replace(x):
return x.replace('.','',1).isdigit()
def check_exception(s):
try:
float(s)
return True
except ValueError:
return False
to_check = [check_regexp, check_replace, check_exception]
print('preparing data...')
good_numbers = [
str(random.random() / random.random())
for x in range(ITERATIONS)]
bad_numbers = ['.' + x for x in good_numbers]
strings = [
''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(random.randint(1,10)))
for x in range(ITERATIONS)]
print('running test...')
for func in to_check:
with Timer() as t:
for x in good_numbers:
res = func(x)
print('%s with good floats: %s' % (func.__name__, t.interval))
with Timer() as t:
for x in bad_numbers:
res = func(x)
print('%s with bad floats: %s' % (func.__name__, t.interval))
with Timer() as t:
for x in strings:
res = func(x)
print('%s with strings: %s' % (func.__name__, t.interval))
Here are the results with Python 2.7.10 on a 2017 MacBook Pro 13:
check_regexp with good floats: 12.688639
check_regexp with bad floats: 11.624862
check_regexp with strings: 11.349414
check_replace with good floats: 4.419841
check_replace with bad floats: 4.294909
check_replace with strings: 4.086358
check_exception with good floats: 3.276668
check_exception with bad floats: 13.843092
check_exception with strings: 15.786169
Here are the results with Python 3.6.5 on a 2017 MacBook Pro 13:
check_regexp with good floats: 13.472906000000009
check_regexp with bad floats: 12.977665000000016
check_regexp with strings: 12.417542999999995
check_replace with good floats: 6.011045999999993
check_replace with bad floats: 4.849356
check_replace with strings: 4.282754000000011
check_exception with good floats: 6.039081999999979
check_exception with bad floats: 9.322753000000006
check_exception with strings: 9.952595000000002
Here are the results with PyPy 2.7.13 on a 2017 MacBook Pro 13:
check_regexp with good floats: 2.693217
check_regexp with bad floats: 2.744819
check_regexp with strings: 2.532414
check_replace with good floats: 0.604367
check_replace with bad floats: 0.538169
check_replace with strings: 0.598664
check_exception with good floats: 1.944103
check_exception with bad floats: 2.449182
check_exception with strings: 2.200056
def is_number(s):try:
n=str(float(s))if n =="nan"or n=="inf"or n=="-inf":returnFalseexceptValueError:try:
complex(s)# for complexexceptValueError:returnFalsereturnTrue
>>> a=454>>> a.isdigit()Traceback(most recent call last):File"<stdin>", line 1,in<module>AttributeError:'int' object has no attribute 'isdigit'>>> a="454">>> a.isdigit()True
Finds whether the given variable is numeric. Numeric strings consist of optional sign, any number of digits, optional decimal part and optional exponential part. Thus +0123.45e6 is a valid numeric value. Hexadecimal (e.g. 0xf4c3b00c) and binary (e.g. 0b10100111001) notation is not allowed.
is_numeric function
import ast
import numbers
def is_numeric(obj):
if isinstance(obj, numbers.Number):
return True
elif isinstance(obj, str):
nodes = list(ast.walk(ast.parse(obj)))[1:]
if not isinstance(nodes[0], ast.Expr):
return False
if not isinstance(nodes[-1], ast.Num):
return False
nodes = nodes[1:-1]
for i in range(len(nodes)):
#if used + or - in digit :
if i % 2 == 0:
if not isinstance(nodes[i], ast.UnaryOp):
return False
else:
if not isinstance(nodes[i], (ast.USub, ast.UAdd)):
return False
return True
else:
return False
Finds whether the given variable is float. float strings consist of optional sign, any number of digits, …
import ast
def is_float(obj):
if isinstance(obj, float):
return True
if isinstance(obj, int):
return False
elif isinstance(obj, str):
nodes = list(ast.walk(ast.parse(obj)))[1:]
if not isinstance(nodes[0], ast.Expr):
return False
if not isinstance(nodes[-1], ast.Num):
return False
if not isinstance(nodes[-1].n, float):
return False
nodes = nodes[1:-1]
for i in range(len(nodes)):
if i % 2 == 0:
if not isinstance(nodes[i], ast.UnaryOp):
return False
else:
if not isinstance(nodes[i], (ast.USub, ast.UAdd)):
return False
return True
else:
return False
I did some speed test. Lets say that if the string is likely to be a number the try/except strategy is the fastest possible.If the string is not likely to be a number and you are interested in Integer check, it worths to do some test (isdigit plus heading ‘-‘).
If you are interested to check float number, you have to use the try/except code whitout escape.
def str_to_type (s):""" Get possible cast type for a string
Parameters
----------
s : string
Returns
-------
float,int,str,bool : type
Depending on what it can be cast to
"""try:
f = float(s)if"."notin s:return int
return float
exceptValueError:
value = s.upper()if value =="TRUE"or value =="FALSE":return bool
return type(s)
例
str_to_type("true")# bool
str_to_type("6.0")# float
str_to_type("6")# int
str_to_type("6abc")# str
str_to_type(u"6abc")# unicode
您可以捕获类型并使用它
s ="6.0"
type_ = str_to_type(s)# float
f = type_(s)
I needed to determine if a string cast into basic types (float,int,str,bool). After not finding anything on the internet I created this:
def str_to_type (s):
""" Get possible cast type for a string
Parameters
----------
s : string
Returns
-------
float,int,str,bool : type
Depending on what it can be cast to
"""
try:
f = float(s)
if "." not in s:
return int
return float
except ValueError:
value = s.upper()
if value == "TRUE" or value == "FALSE":
return bool
return type(s)
If you want to return False for a NaN and Inf, change line to x = float(s); return (x == x) and (x – 1 != x). This should return True for all floats except Inf and NaN
But this doesn’t quite work, because for sufficiently large floats, x-1 == x returns true. For example, 2.0**54 - 1 == 2.0**54
check_regexp with good floats:18.001921
check_regexp with bad floats:17.861423
check_regexp with strings:17.558862
check_correct_regexp with good floats:11.04428
check_correct_regexp with bad floats:8.71211
check_correct_regexp with strings:8.144161
check_replace with good floats:6.020597
check_replace with bad floats:5.343049
check_replace with strings:5.091642
check_exception with good floats:5.201605
check_exception with bad floats:23.921864
check_exception with strings:23.755481
I think your solution is fine, but there is a correct regexp implementation.
There does seem to be a lot of regexp hate towards these answers which I think is unjustified, regexps can be reasonably clean and correct and fast. It really depends on what you’re trying to do. The original question was how can you “check if a string can be represented as a number (float)” (as per your title). Presumably you would want to use the numeric/float value once you’ve checked that it’s valid, in which case your try/except makes a lot of sense. But if, for some reason, you just want to validate that a string is a number then a regex also works fine, but it’s hard to get correct. I think most of the regex answers so far, for example, do not properly parse strings without an integer part (such as “.7”) which is a float as far as python is concerned. And that’s slightly tricky to check for in a single regex where the fractional portion is not required. I’ve included two regex to show this.
It does raise the interesting question as to what a “number” is. Do you include “inf” which is valid as a float in python? Or do you include numbers that are “numbers” but maybe can’t be represented in python (such as numbers that are larger than the float max).
There’s also ambiguities in how you parse numbers. For example, what about “–20”? Is this a “number”? Is this a legal way to represent “20”? Python will let you do “var = –20” and set it to 20 (though really this is because it treats it as an expression), but float(“–20”) does not work.
Anyways, without more info, here’s a regex that I believe covers all the ints and floats as python parses them.
# Doesn't properly handle floats missing the integer part, such as ".7"
SIMPLE_FLOAT_REGEXP = re.compile(r'^[-+]?[0-9]+\.?[0-9]+([eE][-+]?[0-9]+)?$')
# Example "-12.34E+56" # sign (-)
# integer (12)
# mantissa (34)
# exponent (E+56)
# Should handle all floats
FLOAT_REGEXP = re.compile(r'^[-+]?([0-9]+|[0-9]*\.[0-9]+)([eE][-+]?[0-9]+)?$')
# Example "-12.34E+56" # sign (-)
# integer (12)
# OR
# int/mantissa (12.34)
# exponent (E+56)
def is_float(str):
return True if FLOAT_REGEXP.match(str) else False
Running the benchmarking code in @ron-reiter’s answer shows that this regex is actually faster than the normal regex and is much faster at handling bad values than the exception, which makes some sense. Results:
check_regexp with good floats: 18.001921
check_regexp with bad floats: 17.861423
check_regexp with strings: 17.558862
check_correct_regexp with good floats: 11.04428
check_correct_regexp with bad floats: 8.71211
check_correct_regexp with strings: 8.144161
check_replace with good floats: 6.020597
check_replace with bad floats: 5.343049
check_replace with strings: 5.091642
check_exception with good floats: 5.201605
check_exception with bad floats: 23.921864
check_exception with strings: 23.755481
回答 21
import re
def is_number(num):
pattern = re.compile(r'^[-+]?[-0-9]\d*\.\d*|[-+]?\.?[0-9]\d*$')
result = pattern.match(num)if result:returnTrueelse:returnFalse>>>: is_number('1')True>>>: is_number('111')True>>>: is_number('11.1')True>>>: is_number('-11.1')True>>>: is_number('inf')False>>>: is_number('-inf')False
Replace the myvar.apppend with whatever operation you want to do with the string if it turns out to be a number. The idea is to try to use a float() operation and use the returned error to determine whether or not the string is a number.
I also used the function you mentioned, but soon I notice that strings as “Nan”, “Inf” and it’s variation are considered as number. So I propose you improved version of your function, that will return false on those type of input and will not fail “1e3” variants:
def is_float(text):
try:
float(text)
# check for nan/infinity etc.
if text.isalpha():
return False
return True
except ValueError:
return False
import sys
def fix_quotes(s):try:
float(s)return s
exceptValueError:return'"{0}"'.format(s)for line in sys.stdin:
input = line.split()print input[0],'<- c(',','.join(fix_quotes(c)for c in input[1:]),')'
You can generalize the exception technique in a useful way by returning more useful values than True and False. For example this function puts quotes round strings but leaves numbers alone. Which is just what I needed for a quick and dirty filter to make some variable definitions for R.
import sys
def fix_quotes(s):
try:
float(s)
return s
except ValueError:
return '"{0}"'.format(s)
for line in sys.stdin:
input = line.split()
print input[0], '<- c(', ','.join(fix_quotes(c) for c in input[1:]), ')'
I was working on a problem that led me to this thread, namely how to convert a collection of data to strings and numbers in the most intuitive way. I realized after reading the original code that what I needed was different in two ways:
1 – I wanted an integer result if the string represented an integer
2 – I wanted a number or a string result to stick into a data structure
so I adapted the original code to produce this derivative:
def string_or_number(s):
try:
z = int(s)
return z
except ValueError:
try:
z = float(s)
return z
except ValueError:
return s
回答 28
尝试这个。
def is_number(var):try:if var == int(var):returnTrueexceptException:returnFalse
def is_number(var):
try:
if var == int(var):
return True
except Exception:
return False
回答 29
def is_float(s):if s isNone:returnFalseif len(s)==0:returnFalse
digits_count =0
dots_count =0
signs_count =0for c in s:if'0'<= c <='9':
digits_count +=1elif c =='.':
dots_count +=1elif c =='-'or c =='+':
signs_count +=1else:returnFalseif digits_count ==0:returnFalseif dots_count >1:returnFalseif signs_count >1:returnFalsereturnTrue
def is_float(s):
if s is None:
return False
if len(s) == 0:
return False
digits_count = 0
dots_count = 0
signs_count = 0
for c in s:
if '0' <= c <= '9':
digits_count += 1
elif c == '.':
dots_count += 1
elif c == '-' or c == '+':
signs_count += 1
else:
return False
if digits_count == 0:
return False
if dots_count > 1:
return False
if signs_count > 1:
return False
return True