I want to set the dtypes of multiple columns in pd.Dataframe (I have a file that I’ve had to manually parse into a list of lists, as the file was not amenable for pd.read_csv)
import pandas as pd
print pd.DataFrame([['a','1'],['b','2']],
dtype={'x':'object','y':'int'},
columns=['x','y'])
I get
ValueError: entry not a 2- or 3- tuple
The only way I can set them is by looping through each column variable and recasting with astype.
dtypes = {'x':'object','y':'int'}
mydata = pd.DataFrame([['a','1'],['b','2']],
columns=['x','y'])
for c in mydata.columns:
mydata[c] = mydata[c].astype(dtypes[c])
print mydata['y'].dtype #=> int64
Is there a better way?
回答 0
从0.17开始,您必须使用显式转换:
pd.to_datetime, pd.to_timedelta and pd.to_numeric
(如下所述,convert_objects在0.17中已不再弃用“魔术” )
df = pd.DataFrame({'x':{0:'a',1:'b'},'y':{0:'1',1:'2'},'z':{0:'2018-05-01',1:'2018-05-02'}})
df.dtypes
x object
y object
z object
dtype: object
df
x y z
0 a 12018-05-011 b 22018-05-02
您可以将它们应用于要转换的每一列:
df["y"]= pd.to_numeric(df["y"])
df["z"]= pd.to_datetime(df["z"])
df
x y z
0 a 12018-05-011 b 22018-05-02
df.dtypes
x object
y int64
z datetime64[ns]
dtype: object
In[21]: df
Out[21]:
x y
0 a 11 b 2In[22]: df.dtypes
Out[22]:
x object
y object
dtype: object
In[23]: df.convert_objects(convert_numeric=True)Out[23]:
x y
0 a 11 b 2In[24]: df.convert_objects(convert_numeric=True).dtypes
Out[24]:
x object
y int64
dtype: object
Since 0.17, you have to use the explicit conversions:
pd.to_datetime, pd.to_timedelta and pd.to_numeric
(As mentioned below, no more “magic”, convert_objects has been deprecated in 0.17)
df = pd.DataFrame({'x': {0: 'a', 1: 'b'}, 'y': {0: '1', 1: '2'}, 'z': {0: '2018-05-01', 1: '2018-05-02'}})
df.dtypes
x object
y object
z object
dtype: object
df
x y z
0 a 1 2018-05-01
1 b 2 2018-05-02
You can apply these to each column you want to convert:
df["y"] = pd.to_numeric(df["y"])
df["z"] = pd.to_datetime(df["z"])
df
x y z
0 a 1 2018-05-01
1 b 2 2018-05-02
df.dtypes
x object
y int64
z datetime64[ns]
dtype: object
and confirm the dtype is updated.
OLD/DEPRECATED ANSWER for pandas 0.12 – 0.16: You can use convert_objects to infer better dtypes:
In [21]: df
Out[21]:
x y
0 a 1
1 b 2
In [22]: df.dtypes
Out[22]:
x object
y object
dtype: object
In [23]: df.convert_objects(convert_numeric=True)
Out[23]:
x y
0 a 1
1 b 2
In [24]: df.convert_objects(convert_numeric=True).dtypes
Out[24]:
x object
y int64
dtype: object
you can set the types explicitly with pandas DataFrame.astype(dtype, copy=True, raise_on_error=True, **kwargs) and pass in a dictionary with the dtypes you want to dtype
import pandas as pd
import numpy as np
x = np.empty((10,), dtype=[('x', np.uint8),('y', np.float64)])
df = pd.DataFrame(x)
df.dtypes ->
x uint8
y float64
Another way to set the column types is to first construct a numpy record array with your desired types, fill it out and then pass it to a DataFrame constructor.
import pandas as pd
import numpy as np
x = np.empty((10,), dtype=[('x', np.uint8), ('y', np.float64)])
df = pd.DataFrame(x)
df.dtypes ->
x uint8
y float64
facing similar problem to you. In my case I have 1000’s of files from cisco logs that I need to parse manually.
In order to be flexible with fields and types I have successfully tested using StringIO + read_cvs which indeed does accept a dict for the dtype specification.
I usually get each of the files ( 5k-20k lines) into a buffer and create the dtype dictionaries dynamically.
Eventually I concatenate ( with categorical… thanks to 0.19) these dataframes into a large data frame that I dump into hdf5.
Something along these lines
import pandas as pd
import io
output = io.StringIO()
output.write('A,1,20,31\n')
output.write('B,2,21,32\n')
output.write('C,3,22,33\n')
output.write('D,4,23,34\n')
output.seek(0)
df=pd.read_csv(output, header=None,
names=["A","B","C","D"],
dtype={"A":"category","B":"float32","C":"int32","D":"float64"},
sep=","
)
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
A 5 non-null category
B 5 non-null float32
C 5 non-null int32
D 5 non-null float64
dtypes: category(1), float32(1), float64(1), int32(1)
memory usage: 205.0 bytes
None
Not very pythonic…. but does the job
Hope it helps.
JC
回答 5
最好使用键入的np.arrays,然后将数据和列名作为字典传递。
import numpy as np
import pandas as pd
# Feature: np arrays are 1: efficient, 2: can be pre-sized
x = np.array(['a','b'], dtype=object)
y = np.array([1,2], dtype=np.int32)
df = pd.DataFrame({'x': x,# Feature: column name is near data array'y': y,})
You’re better off using typed np.arrays, and then pass the data and column names as a dictionary.
import numpy as np
import pandas as pd
# Feature: np arrays are 1: efficient, 2: can be pre-sized
x = np.array(['a', 'b'], dtype=object)
y = np.array([ 1 , 2 ], dtype=np.int32)
df = pd.DataFrame({
'x' : x, # Feature: column name is near data array
'y' : y,
}
)
I have seen many answers posted to questions on Stack Overflow involving the use of the Pandas method apply. I have also seen users commenting under them saying that “apply is slow, and should be avoided”.
I have read many articles on the topic of performance that explain apply is slow. I have also seen a disclaimer in the docs about how apply is simply a convenience function for passing UDFs (can’t seem to find that now). So, the general consensus is that apply should be avoided if possible. However, this raises the following questions:
If apply is so bad, then why is it in the API?
How and when should I make my code apply-free?
Are there ever any situations where apply is good (better than other possible solutions)?
df[[y.lower()in x.lower()for x, y in zip(df['Title'], df['Name'])]]NameTitleValue1 donald welcome to donald's castle 10
2 minnie Minnie mouse clubhouse 86
%timeit df[df.apply(lambda x: x['Name'].lower()in x['Title'].lower(), axis=1)]%timeit df[[y.lower()in x.lower()for x, y in zip(df['Title'], df['Name'])]]2.85 ms ±38.4µs per loop (mean ± std. dev. of 7 runs,100 loops each)788µs ±16.4µs per loop (mean ± std. dev. of 7 runs,1000 loops each)
%timeit df.apply(pd.to_datetime, errors='coerce')%timeit pd.to_datetime(df.stack(), errors='coerce').unstack()%timeit pd.concat([pd.to_datetime(df[c], errors='coerce')for c in df], axis=1)%timeit for c in df.columns: df[c]= pd.to_datetime(df[c], errors='coerce')5.49 ms ±247µs per loop (mean ± std. dev. of 7 runs,100 loops each)3.94 ms ±48.1µs per loop (mean ± std. dev. of 7 runs,100 loops each)3.16 ms ±216µs per loop (mean ± std. dev. of 7 runs,100 loops each)2.41 ms ±1.71 ms per loop (mean ± std. dev. of 7 runs,1 loop each)
您可以对其他操作(例如字符串操作或转换为类别)进行类似的设置。
u = df.apply(lambda x: x.str.contains(...))
v = df.apply(lambda x: x.astype(category))
伏/秒
u = pd.concat([df[c].str.contains(...)for c in df], axis=1)
v = df.copy()for c in df:
v[c]= df[c].astype(category)
We start by addressing the questions in the OP, one by one.
“If apply is so bad, then why is it in the API?”
DataFrame.apply and Series.apply are convenience functions defined on DataFrame and Series object respectively. apply accepts any user defined function that applies a transformation/aggregation on a DataFrame. apply is effectively a silver bullet that does whatever any existing pandas function cannot do.
Some of the things apply can do:
Run any user-defined function on a DataFrame or Series
Apply a function either row-wise (axis=1) or column-wise (axis=0) on a DataFrame
Perform index alignment while applying the function
Perform aggregation with user-defined functions (however, we usually prefer agg or transform in these cases)
Perform element-wise transformations
Broadcast aggregated results to original rows (see the result_type argument).
Accept positional/keyword arguments to pass to the user-defined functions.
So, with all these features, why is apply bad? It is because apply isslow. Pandas makes no assumptions about the nature of your function, and so iteratively applies your function to each row/column as necessary. Additionally, handling all of the situations above means apply incurs some major overhead at each iteration. Further, apply consumes a lot more memory, which is a challenge for memory bounded applications.
There are very few situations where apply is appropriate to use (more on that below). If you’re not sure whether you should be using apply, you probably shouldn’t.
Let’s address the next question.
“How and when should I make my code apply-free?”
To rephrase, here are some common situations where you will want to get rid of any calls to apply.
Numeric Data
If you’re working with numeric data, there is likely already a vectorized cython function that does exactly what you’re trying to do (if not, please either ask a question on Stack Overflow or open a feature request on GitHub).
Contrast the performance of apply for a simple addition operation.
df.apply(np.sum)
A 16
B 28
dtype: int64
df.sum()
A 16
B 28
dtype: int64
Performance wise, there’s no comparison, the cythonized equivalent is much faster. There’s no need for a graph, because the difference is obvious even for toy data.
%timeit df.apply(np.sum)
%timeit df.sum()
2.22 ms ± 41.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
471 µs ± 8.16 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Even if you enable passing raw arrays with the raw argument, it’s still twice as slow.
%timeit df.apply(np.sum, raw=True)
840 µs ± 691 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Another example:
df.apply(lambda x: x.max() - x.min())
A 8
B 8
dtype: int64
df.max() - df.min()
A 8
B 8
dtype: int64
%timeit df.apply(lambda x: x.max() - x.min())
%timeit df.max() - df.min()
2.43 ms ± 450 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
1.23 ms ± 14.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In general, seek out vectorized alternatives if possible.
String/Regex
Pandas provides “vectorized” string functions in most situations, but there are rare cases where those functions do not… “apply”, so to speak.
A common problem is to check whether a value in a column is present in another column of the same row.
df = pd.DataFrame({
'Name': ['mickey', 'donald', 'minnie'],
'Title': ['wonderland', "welcome to donald's castle", 'Minnie mouse clubhouse'],
'Value': [20, 10, 86]})
df
Name Value Title
0 mickey 20 wonderland
1 donald 10 welcome to donald's castle
2 minnie 86 Minnie mouse clubhouse
This should return the row second and third row, since “donald” and “minnie” are present in their respective “Title” columns.
Using apply, this would be done using
df.apply(lambda x: x['Name'].lower() in x['Title'].lower(), axis=1)
0 False
1 True
2 True
dtype: bool
df[df.apply(lambda x: x['Name'].lower() in x['Title'].lower(), axis=1)]
Name Title Value
1 donald welcome to donald's castle 10
2 minnie Minnie mouse clubhouse 86
However, a better solution exists using list comprehensions.
df[[y.lower() in x.lower() for x, y in zip(df['Title'], df['Name'])]]
Name Title Value
1 donald welcome to donald's castle 10
2 minnie Minnie mouse clubhouse 86
<!- ->
%timeit df[df.apply(lambda x: x['Name'].lower() in x['Title'].lower(), axis=1)]
%timeit df[[y.lower() in x.lower() for x, y in zip(df['Title'], df['Name'])]]
2.85 ms ± 38.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
788 µs ± 16.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
The thing to note here is that iterative routines happen to be faster than apply, because of the lower overhead. If you need to handle NaNs and invalid dtypes, you can build on this using a custom function you can then call with arguments inside the list comprehension.
Note
Date and datetime operations also have vectorized versions. So, for example, you should prefer pd.to_datetime(df['date']), over,
say, df['date'].apply(pd.to_datetime).
s = pd.Series([[1, 2]] * 3)
s
0 [1, 2]
1 [1, 2]
2 [1, 2]
dtype: object
People are tempted to use apply(pd.Series). This is horrible in terms of performance.
s.apply(pd.Series)
0 1
0 1 2
1 1 2
2 1 2
A better option is to listify the column and pass it to pd.DataFrame.
pd.DataFrame(s.tolist())
0 1
0 1 2
1 1 2
2 1 2
<!- ->
%timeit s.apply(pd.Series)
%timeit pd.DataFrame(s.tolist())
2.65 ms ± 294 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
816 µs ± 40.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Lastly,
“Are there any situations where apply is good?”
Apply is a convenience function, so there are situations where the overhead is negligible enough to forgive. It really depends on how many times the function is called.
Functions that are Vectorized for Series, but not DataFrames
What if you want to apply a string operation on multiple columns? What if you want to convert multiple columns to datetime? These functions are vectorized for Series only, so they must be applied over each column that you want to convert/operate on.
Note that it would also make sense to stack, or just use an explicit loop. All these options are slightly faster than using apply, but the difference is small enough to forgive.
%timeit df.apply(pd.to_datetime, errors='coerce')
%timeit pd.to_datetime(df.stack(), errors='coerce').unstack()
%timeit pd.concat([pd.to_datetime(df[c], errors='coerce') for c in df], axis=1)
%timeit for c in df.columns: df[c] = pd.to_datetime(df[c], errors='coerce')
5.49 ms ± 247 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
3.94 ms ± 48.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
3.16 ms ± 216 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.41 ms ± 1.71 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
You can make a similar case for other operations such as string operations, or conversion to category.
u = df.apply(lambda x: x.str.contains(...))
v = df.apply(lambda x: x.astype(category))
v/s
u = pd.concat([df[c].str.contains(...) for c in df], axis=1)
v = df.copy()
for c in df:
v[c] = df[c].astype(category)
And so on…
Converting Series to str: astype versus apply
This seems like an idiosyncrasy of the API. Using apply to convert integers in a Series to string is comparable (and sometimes faster) than using astype.
import perfplot
perfplot.show(
setup=lambda n: pd.Series(np.random.randint(0, n, n)),
kernels=[
lambda s: s.astype(str),
lambda s: s.apply(str)
],
labels=['astype', 'apply'],
n_range=[2**k for k in range(1, 20)],
xlabel='N',
logx=True,
logy=True,
equality_check=lambda x, y: (x == y).all())
With floats, I see the astype is consistently as fast as, or slightly faster than apply. So this has to do with the fact that the data in the test is integer type.
GroupBy operations with chained transformations
GroupBy.apply has not been discussed until now, but GroupBy.apply is also an iterative convenience function to handle anything that the existing GroupBy functions do not.
One common requirement is to perform a GroupBy and then two prime operations such as a “lagged cumsum”:
df = pd.DataFrame({"A": list('aabcccddee'), "B": [12, 7, 5, 4, 5, 4, 3, 2, 1, 10]})
df
A B
0 a 12
1 a 7
2 b 5
3 c 4
4 c 5
5 c 4
6 d 3
7 d 2
8 e 1
9 e 10
<!- ->
You’d need two successive groupby calls here:
df.groupby('A').B.cumsum().groupby(df.A).shift()
0 NaN
1 12.0
2 NaN
3 NaN
4 4.0
5 9.0
6 NaN
7 3.0
8 NaN
9 1.0
Name: B, dtype: float64
Using apply, you can shorten this to a a single call.
df.groupby('A').B.apply(lambda x: x.cumsum().shift())
0 NaN
1 12.0
2 NaN
3 NaN
4 4.0
5 9.0
6 NaN
7 3.0
8 NaN
9 1.0
Name: B, dtype: float64
It is very hard to quantify the performance because it depends on the data. But in general, apply is an acceptable solution if the goal is to reduce a groupby call (because groupby is also quite expensive).
Other Caveats
Aside from the caveats mentioned above, it is also worth mentioning that apply operates on the first row (or column) twice. This is done to determine whether the function has any side effects. If not, apply may be able to use a fast-path for evaluating the result, else it falls back to a slow implementation.
df = pd.DataFrame({
'A': [1, 2],
'B': ['x', 'y']
})
def func(x):
print(x['A'])
return x
df.apply(func, axis=1)
# 1
# 1
# 2
A B
0 1 x
1 2 y
This behaviour is also seen in GroupBy.apply on pandas versions <0.25 (it was fixed for 0.25, see here for more information.)
The below chart suggests when to consider apply1. Green means possibly efficient; red avoid.
Some of this is intuitive: pd.Series.apply is a Python-level row-wise loop, ditto pd.DataFrame.apply row-wise (axis=1). The misuses of these are many and wide-ranging. The other post deals with them in more depth. Popular solutions are to use vectorised methods, list comprehensions (assumes clean data), or efficient tools such as the pd.DataFrame constructor (e.g. to avoid apply(pd.Series)).
If you are using pd.DataFrame.apply row-wise, specifying raw=True (where possible) is often beneficial. At this stage, numba is usually a better choice.
GroupBy.apply: generally favoured
Repeating groupby operations to avoid apply will hurt performance. GroupBy.apply is usually fine here, provided the methods you use in your custom function are themselves vectorised. Sometimes there is no native Pandas method for a groupwise aggregation you wish to apply. In this case, for a small number of groups apply with a custom function may still offer reasonable performance.
pd.DataFrame.apply column-wise: a mixed bag
pd.DataFrame.apply column-wise (axis=0) is an interesting case. For a small number of rows versus a large number of columns, it’s almost always expensive. For a large number of rows relative to columns, the more common case, you may sometimes see significant performance improvements using apply:
# Python 3.7, Pandas 0.23.4
np.random.seed(0)
df = pd.DataFrame(np.random.random((10**7, 3))) # Scenario_1, many rows
df = pd.DataFrame(np.random.random((10**4, 10**3))) # Scenario_2, many columns
# Scenario_1 | Scenario_2
%timeit df.sum() # 800 ms | 109 ms
%timeit df.apply(pd.Series.sum) # 568 ms | 325 ms
%timeit df.max() - df.min() # 1.63 s | 314 ms
%timeit df.apply(lambda x: x.max() - x.min()) # 838 ms | 473 ms
%timeit df.mean() # 108 ms | 94.4 ms
%timeit df.apply(pd.Series.mean) # 276 ms | 233 ms
1 There are exceptions, but these are usually marginal or uncommon. A couple of examples:
df['col'].apply(str) may slightly outperform df['col'].astype(str).
df.apply(pd.to_datetime) working on strings doesn’t scale well with rows versus a regular for loop.
For axis=1 (i.e. row-wise functions) then you can just use the following function in lieu of apply. I wonder why this isn’t the pandas behavior. (Untested with compound indexes, but it does appear to be much faster than apply)
def faster_df_apply(df, func):
cols = list(df.columns)
data, index = [], []
for row in df.itertuples(index=True):
row_dict = {f:v for f,v in zip(cols, row[1:])}
data.append(func(row_dict))
index.append(row[0])
return pd.Series(data, index=index)
回答 3
有没有什么情况apply是好的?是的,有时。
任务:解码Unicode字符串。
import numpy as np
import pandas as pd
import unidecode
s = pd.Series(['mañana','Ceñía'])
s.head()0 mañana
1Ceñía
s.apply(unidecode.unidecode)0 manana
1Cenia
Are there ever any situations where apply is good?
Yes, sometimes.
Task: decode Unicode strings.
import numpy as np
import pandas as pd
import unidecode
s = pd.Series(['mañana','Ceñía'])
s.head()
0 mañana
1 Ceñía
s.apply(unidecode.unidecode)
0 manana
1 Cenia
Update
I was by no means advocating for the use of apply, just thinking since the NumPy cannot deal with the above situation, it could have been a good candidate for pandas apply. But I was forgetting the plain ol list comprehension thanks to the reminder by @jpp.
<?xml version="1.0" encoding="utf-8"?><stackoverflow><topusers><user>Gordon Linoff</user><link>http://www.stackoverflow.com//users/1144035/gordon-linoff</link><location>New York, United States</location><year_rep>5,985</year_rep><total_rep>499,408</total_rep><tag1>sql</tag1><tag2>sql-server</tag2><tag3>mysql</tag3></topusers><topusers><user>Günter Zöchbauer</user><link>http://www.stackoverflow.com//users/217408/g%c3%bcnter-z%c3%b6chbauer</link><location>Linz, Austria</location><year_rep>5,835</year_rep><total_rep>154,439</total_rep><tag1>angular2</tag1><tag2>typescript</tag2><tag3>javascript</tag3></topusers><topusers><user>jezrael</user><link>http://www.stackoverflow.com//users/2901002/jezrael</link><location>Bratislava, Slovakia</location><year_rep>5,740</year_rep><total_rep>83,237</total_rep><tag1>pandas</tag1><tag2>python</tag2><tag3>dataframe</tag3></topusers><topusers><user>VonC</user><link>http://www.stackoverflow.com//users/6309/vonc</link><location>France</location><year_rep>5,577</year_rep><total_rep>651,397</total_rep><tag1>git</tag1><tag2>github</tag2><tag3>docker</tag3></topusers><topusers><user>Martijn Pieters</user><link>http://www.stackoverflow.com//users/100297/martijn-pieters</link><location>Cambridge, United Kingdom</location><year_rep>5,337</year_rep><total_rep>525,176</total_rep><tag1>python</tag1><tag2>python-3.x</tag2><tag3>python-2.7</tag3></topusers><topusers><user>T.J. Crowder</user><link>http://www.stackoverflow.com//users/157247/t-j-crowder</link><location>United Kingdom</location><year_rep>5,258</year_rep><total_rep>508,310</total_rep><tag1>javascript</tag1><tag2>jquery</tag2><tag3>java</tag3></topusers><topusers><user>akrun</user><link>http://www.stackoverflow.com//users/3732271/akrun</link><location></location><year_rep>5,188</year_rep><total_rep>229,553</total_rep><tag1>r</tag1><tag2>dplyr</tag2><tag3>dataframe</tag3></topusers><topusers><user>Wiktor Stribi?ew</user><link>http://www.stackoverflow.com//users/3832970/wiktor-stribi%c5%bcew</link><location>Warsaw, Poland</location><year_rep>4,948</year_rep><total_rep>158,134</total_rep><tag1>regex</tag1><tag2>javascript</tag2><tag3>c#</tag3></topusers><topusers><user>Darin Dimitrov</user><link>http://www.stackoverflow.com//users/29407/darin-dimitrov</link><location>Sofia, Bulgaria</location><year_rep>4,936</year_rep><total_rep>709,683</total_rep><tag1>c#</tag1><tag2>asp.net-mvc</tag2><tag3>asp.net-mvc-3</tag3></topusers><topusers><user>Eric Duminil</user><link>http://www.stackoverflow.com//users/6419007/eric-duminil</link><location></location><year_rep>4,854</year_rep><total_rep>12,557</total_rep><tag1>ruby</tag1><tag2>ruby-on-rails</tag2><tag3>arrays</tag3></topusers><topusers><user>alecxe</user><link>http://www.stackoverflow.com//users/771848/alecxe</link><location>New York, United States</location><year_rep>4,723</year_rep><total_rep>233,368</total_rep><tag1>python</tag1><tag2>selenium</tag2><tag3>protractor</tag3></topusers><topusers><user>Jean-François Fabre</user><link>http://www.stackoverflow.com//users/6451573/jean-fran%c3%a7ois-fabre</link><location>Toulouse, France</location><year_rep>4,526</year_rep><total_rep>30,027</total_rep><tag1>python</tag1><tag2>python-3.x</tag2><tag3>python-2.7</tag3></topusers><topusers><user>piRSquared</user><link>http://www.stackoverflow.com//users/2336654/pirsquared</link><location>Bellevue, WA, United States</location><year_rep>4,482</year_rep><total_rep>41,183</total_rep><tag1>pandas</tag1><tag2>python</tag2><tag3>dataframe</tag3></topusers><topusers><user>CommonsWare</user><link>http://www.stackoverflow.com//users/115145/commonsware</link><location>Who Wants to Know?</location><year_rep>4,475</year_rep><total_rep>616,135</total_rep><tag1>android</tag1><tag2>java</tag2><tag3>android-intent</tag3></topusers><topusers><user>Quentin</user><link>http://www.stackoverflow.com//users/19068/quentin</link><location>United Kingdom</location><year_rep>4,464</year_rep><total_rep>509,365</total_rep><tag1>javascript</tag1><tag2>html</tag2><tag3>css</tag3></topusers><topusers><user>Jon Skeet</user><link>http://www.stackoverflow.com//users/22656/jon-skeet</link><location>Reading, United Kingdom</location><year_rep>4,348</year_rep><total_rep>921,690</total_rep><tag1>c#</tag1><tag2>java</tag2><tag3>.net</tag3></topusers><topusers><user>Felix Kling</user><link>http://www.stackoverflow.com//users/218196/felix-kling</link><location>Sunnyvale, CA</location><year_rep>4,324</year_rep><total_rep>411,535</total_rep><tag1>javascript</tag1><tag2>jquery</tag2><tag3>asynchronous</tag3></topusers><topusers><user>matt</user><link>http://www.stackoverflow.com//users/341994/matt</link><location></location><year_rep>4,313</year_rep><total_rep>220,515</total_rep><tag1>swift</tag1><tag2>ios</tag2><tag3>xcode</tag3></topusers><topusers><user>Psidom</user><link>http://www.stackoverflow.com//users/4983450/psidom</link><location>Atlanta, GA, United States</location><year_rep>4,236</year_rep><total_rep>36,950</total_rep><tag1>python</tag1><tag2>pandas</tag2><tag3>r</tag3></topusers><topusers><user>Martin R</user><link>http://www.stackoverflow.com//users/1187415/martin-r</link><location>Germany</location><year_rep>4,195</year_rep><total_rep>269,380</total_rep><tag1>swift</tag1><tag2>ios</tag2><tag3>swift3</tag3></topusers><topusers><user>Barmar</user><link>http://www.stackoverflow.com//users/1491895/barmar</link><location>Arlington, MA</location><year_rep>4,179</year_rep><total_rep>289,989</total_rep><tag1>javascript</tag1><tag2>php</tag2><tag3>jquery</tag3></topusers><topusers><user>Alexey Mezenin</user><link>http://www.stackoverflow.com//users/1227923/alexey-mezenin</link><location>??????</location><year_rep>4,142</year_rep><total_rep>31,602</total_rep><tag1>laravel</tag1><tag2>php</tag2><tag3>laravel-5.3</tag3></topusers><topusers><user>BalusC</user><link>http://www.stackoverflow.com//users/157882/balusc</link><location>Amsterdam, Netherlands</location><year_rep>4,046</year_rep><total_rep>703,046</total_rep><tag1>java</tag1><tag2>jsf</tag2><tag3>servlets</tag3></topusers><topusers><user>GurV</user><link>http://www.stackoverflow.com//users/6348498/gurv</link><location></location><year_rep>4,016</year_rep><total_rep>7,932</total_rep><tag1>sql</tag1><tag2>mysql</tag2><tag3>sql-server</tag3></topusers><topusers><user>Nina Scholz</user><link>http://www.stackoverflow.com//users/1447675/nina-scholz</link><location>Berlin, Deutschland</location><year_rep>3,950</year_rep><total_rep>61,135</total_rep><tag1>javascript</tag1><tag2>arrays</tag2><tag3>object</tag3></topusers><topusers><user>JB Nizet</user><link>http://www.stackoverflow.com//users/571407/jb-nizet</link><location>Saint-Etienne, France</location><year_rep>3,923</year_rep><total_rep>418,780</total_rep><tag1>java</tag1><tag2>hibernate</tag2><tag3>java-8</tag3></topusers><topusers><user>Frank van Puffelen</user><link>http://www.stackoverflow.com//users/209103/frank-van-puffelen</link><location>San Francisco, CA</location><year_rep>3,920</year_rep><total_rep>86,520</total_rep><tag1>firebase</tag1><tag2>firebase-database</tag2><tag3>android</tag3></topusers><topusers><user>dasblinkenlight</user><link>http://www.stackoverflow.com//users/335858/dasblinkenlight</link><location>United States</location><year_rep>3,886</year_rep><total_rep>475,813</total_rep><tag1>c#</tag1><tag2>java</tag2><tag3>c++</tag3></topusers><topusers><user>Tim Biegeleisen</user><link>http://www.stackoverflow.com//users/1863229/tim-biegeleisen</link><location>Singapore</location><year_rep>3,814</year_rep><total_rep>77,211</total_rep><tag1>sql</tag1><tag2>mysql</tag2><tag3>java</tag3></topusers><topusers><user>Greg Hewgill</user><link>http://www.stackoverflow.com//users/893/greg-hewgill</link><location>Christchurch, New Zealand</location><year_rep>3,796</year_rep><total_rep>529,137</total_rep><tag1>git</tag1><tag2>python</tag2><tag3>git-pull</tag3></topusers><topusers><user>unutbu</user><link>http://www.stackoverflow.com//users/190597/unutbu</link><location></location><year_rep>3,735</year_rep><total_rep>401,595</total_rep><tag1>python</tag1><tag2>pandas</tag2><tag3>numpy</tag3></topusers><topusers><user>Hans Passant</user><link>http://www.stackoverflow.com//users/17034/hans-passant</link><location>Madison, WI</location><year_rep>3,688</year_rep><total_rep>672,118</total_rep><tag1>c#</tag1><tag2>.net</tag2><tag3>winforms</tag3></topusers><topusers><user>Jonathan Leffler</user><link>http://www.stackoverflow.com//users/15168/jonathan-leffler</link><location>California, USA</location><year_rep>3,649</year_rep><total_rep>455,157</total_rep><tag1>c</tag1><tag2>bash</tag2><tag3>unix</tag3></topusers><topusers><user>paxdiablo</user><link>http://www.stackoverflow.com//users/14860/paxdiablo</link><location></location><year_rep>3,636</year_rep><total_rep>507,043</total_rep><tag1>c</tag1><tag2>c++</tag2><tag3>bash</tag3></topusers><topusers><user>Pranav C Balan</user><link>http://www.stackoverflow.com//users/3037257/pranav-c-balan</link><location>Ramanthali, Kannur, Kerala, India</location><year_rep>3,604</year_rep><total_rep>64,476</total_rep><tag1>javascript</tag1><tag2>jquery</tag2><tag3>html</tag3></topusers><topusers><user>Suragch</user><link>http://www.stackoverflow.com//users/3681880/suragch</link><location>Hohhot, China</location><year_rep>3,580</year_rep><total_rep>71,032</total_rep><tag1>swift</tag1><tag2>ios</tag2><tag3>android</tag3></topusers></stackoverflow>
Python方法
import xml.etree.ElementTreeas etimport pandas as pdfrom io importStringIOfrom lxml import etree as lxetdef read_xml_iterfind():
tree = et.parse('Input.xml')
data =[]
inner ={}for el in tree.iterfind('./*'):for i in el.iterfind('*'):
inner[i.tag]= i.text
data.append(inner)
inner ={}
df = pd.DataFrame(data)def read_xml_iterparse():
data =[]
inner ={}
i =1for(ev, el)in et.iterparse(path):if i <=2:
first_tag = el.tagif el.tag == first_tag and len(inner)!=0:
data.append(inner)
inner ={}if el.text isnotNoneand len(el.text.strip())>0:
inner[el.tag]= el.text
i +=1
df = pd.DataFrame(data)def read_xml_lxml_xpath():
tree = lxet.parse('Input.xml')
data =[]
inner ={}for el in tree.xpath('/*/*'):for i in el:
inner[i.tag]= i.text
data.append(inner)
inner ={}
df = pd.DataFrame(data)def read_xml_lxml_xsl():
xml = lxet.parse('Input.xml')
xslstr ='''
<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:output version="1.0" encoding="UTF-8" indent="yes" method="text"/>
<xsl:strip-space elements="*"/>
<!-- HEADERS -->
<xsl:template match = "/*">
<xsl:for-each select="*[1]/*">
<xsl:value-of select="local-name()" />
<xsl:choose>
<xsl:when test="position() != last()">
<xsl:text>,</xsl:text>
</xsl:when>
<xsl:otherwise>
<xsl:text>
</xsl:text>
</xsl:otherwise>
</xsl:choose>
</xsl:for-each>
<xsl:apply-templates/>
</xsl:template>
<!-- DATA ROWS (COMMA-SEPARATED) -->
<xsl:template match="/*/*" priority="2">
<xsl:for-each select="*">
<xsl:if test="position() = 1">
<xsl:text>"</xsl:text>
</xsl:if>
<xsl:value-of select="." />
<xsl:choose>
<xsl:when test="position() != last()">
<xsl:text>","</xsl:text>
</xsl:when>
<xsl:otherwise>
<xsl:text>"
</xsl:text>
</xsl:otherwise>
</xsl:choose>
</xsl:for-each>
</xsl:template>
</xsl:transform>
'''
xsl = lxet.fromstring(xslstr)
transform = lxet.XSLT(xsl)
newdom = transform(xml)
df = pd.read_csv(StringIO(str(newdom)))
时序(当前的XML和XML的子级是25倍(即900条StackOverflow用户记录)
# SHORTER FILE
python -mtimeit -s'import readxml_test_runs as test''test.read_xml_iterfind()'100 loops, best of3:3.87 msec per loop
python -mtimeit -s'import readxml_test_runs as test''test.read_xml_iterparse()'100 loops, best of3:5.5 msec per loop
python -mtimeit -s'import readxml_test_runs as test''test.read_xml_lxml_xpath()'100 loops, best of3:3.86 msec per loop
python -mtimeit -s'import readxml_test_runs as test''test.read_xml_lxml_xsl()'100 loops, best of3:5.68 msec per loop# LARGER FILE
python -mtimeit -n'100'-s'import readxml_test_runs as test''test.read_xml_iterfind()'100 loops, best of3:36 msec per loop
python -mtimeit -n'100'-s'import readxml_test_runs as test''test.read_xml_iterparse()'100 loops, best of3:78.9 msec per loop
python -mtimeit -n'100'-s'import readxml_test_runs as test''test.read_xml_lxml_xpath()'100 loops, best of3:32.7 msec per loop
python -mtimeit -n'100'-s'import readxml_test_runs as test''test.read_xml_lxml_xsl()'100 loops, best of3:51.4 msec per loop
Currently, pandas I/O tools does not maintain a read_xml() method and the counterpart to_xml(). However, read_json proves tree-like structures can be implemented for dataframe import and read_html for markup formats.
If the pandas team does consider such a read_xml method for a future pandas version, what implementation would they pursue: parsing with built-in xml.etree.ElementTree with its iterfind() or iterparse() functions or the third-party module, lxml with its XPath 1.0 and XSLT 1.0 methods?
Below are my test runs for four method types on a simple, flat, element-centric XML input. All are set up for generalized parsing for any second level children of root and each method should yield exact same pandas dataframe. All but the last calls pd.Dataframe() on list of dictionaries. The XSLT method transforms XML to CSV for casted StringIO() in pd.read_csv().
Question(multi-part)
PERFORMANCE: How do you explain the slower iterparse often recommended for larger files as file is iteratively parsed? Is it partly due to the if logic checks?
MEMORY: Do CPU memory correlate with timings in I/O calls? XSLT and XPath 1.0 tend not to scale well with larger XML documents as entire file must be read in memory to be parsed.
STRATEGY: Is list of dictionaries an optimal strategy for Dataframe() call? See these interesting answers: generator version and a iterwalk user-defined version. Both upcast lists to dataframe.
Input Data (Stack Overflow’s current top users by year of which our pandas friends are included)
import xml.etree.ElementTree as et
import pandas as pd
from io import StringIO
from lxml import etree as lxet
def read_xml_iterfind():
tree = et.parse('Input.xml')
data = []
inner = {}
for el in tree.iterfind('./*'):
for i in el.iterfind('*'):
inner[i.tag] = i.text
data.append(inner)
inner = {}
df = pd.DataFrame(data)
def read_xml_iterparse():
data = []
inner = {}
i = 1
for (ev, el) in et.iterparse(path):
if i <= 2:
first_tag = el.tag
if el.tag == first_tag and len(inner) != 0:
data.append(inner)
inner = {}
if el.text is not None and len(el.text.strip()) > 0:
inner[el.tag] = el.text
i += 1
df = pd.DataFrame(data)
def read_xml_lxml_xpath():
tree = lxet.parse('Input.xml')
data = []
inner = {}
for el in tree.xpath('/*/*'):
for i in el:
inner[i.tag] = i.text
data.append(inner)
inner = {}
df = pd.DataFrame(data)
def read_xml_lxml_xsl():
xml = lxet.parse('Input.xml')
xslstr = '''
<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:output version="1.0" encoding="UTF-8" indent="yes" method="text"/>
<xsl:strip-space elements="*"/>
<!-- HEADERS -->
<xsl:template match = "/*">
<xsl:for-each select="*[1]/*">
<xsl:value-of select="local-name()" />
<xsl:choose>
<xsl:when test="position() != last()">
<xsl:text>,</xsl:text>
</xsl:when>
<xsl:otherwise>
<xsl:text>
</xsl:text>
</xsl:otherwise>
</xsl:choose>
</xsl:for-each>
<xsl:apply-templates/>
</xsl:template>
<!-- DATA ROWS (COMMA-SEPARATED) -->
<xsl:template match="/*/*" priority="2">
<xsl:for-each select="*">
<xsl:if test="position() = 1">
<xsl:text>"</xsl:text>
</xsl:if>
<xsl:value-of select="." />
<xsl:choose>
<xsl:when test="position() != last()">
<xsl:text>","</xsl:text>
</xsl:when>
<xsl:otherwise>
<xsl:text>"
</xsl:text>
</xsl:otherwise>
</xsl:choose>
</xsl:for-each>
</xsl:template>
</xsl:transform>
'''
xsl = lxet.fromstring(xslstr)
transform = lxet.XSLT(xsl)
newdom = transform(xml)
df = pd.read_csv(StringIO(str(newdom)))
Timings(with current XML and XML with 25 times the children (i.e., 900 StackOverflow user records)
# SHORTER FILE
python -mtimeit -s'import readxml_test_runs as test' 'test.read_xml_iterfind()'
100 loops, best of 3: 3.87 msec per loop
python -mtimeit -s'import readxml_test_runs as test' 'test.read_xml_iterparse()'
100 loops, best of 3: 5.5 msec per loop
python -mtimeit -s'import readxml_test_runs as test' 'test.read_xml_lxml_xpath()'
100 loops, best of 3: 3.86 msec per loop
python -mtimeit -s'import readxml_test_runs as test' 'test.read_xml_lxml_xsl()'
100 loops, best of 3: 5.68 msec per loop
# LARGER FILE
python -mtimeit -n'100' -s'import readxml_test_runs as test' 'test.read_xml_iterfind()'
100 loops, best of 3: 36 msec per loop
python -mtimeit -n'100' -s'import readxml_test_runs as test' 'test.read_xml_iterparse()'
100 loops, best of 3: 78.9 msec per loop
python -mtimeit -n'100' -s'import readxml_test_runs as test' 'test.read_xml_lxml_xpath()'
100 loops, best of 3: 32.7 msec per loop
python -mtimeit -n'100' -s'import readxml_test_runs as test' 'test.read_xml_lxml_xsl()'
100 loops, best of 3: 51.4 msec per loop
def read_xml_iterparse2(path):
data =[]
inner ={}
first_tag =Nonefor(ev, el)in et.iterparse(path):ifnot first_tag:
first_tag = el.tagif el.tag == first_tag and len(inner)!=0:
data.append(inner)
inner ={}if el.text isnotNoneand len(el.text.strip())>0:
inner[el.tag]= el.text
df = pd.DataFrame(data)%timeit read_xml_iterparse(path)# 10 loops, best of 5: 33 ms per loop%timeit read_xml_iterparse2(path)# 10 loops, best of 5: 23 ms per loop
def read_xml_iterparse3(path):
data =[]
inner ={}
first_tag =Nonefor(ev, el)in et.iterparse(path):ifnot first_tag:
first_tag = el.tagif el.tag == first_tag and len(inner)!=0:
data.append(inner)
inner ={}
inner[el.tag]= el.text
df = pd.DataFrame(data)%timeit read_xml_iterparse(path)# 10 loops, best of 5: 34.4 ms per loop%timeit read_xml_iterparse2(path)# 10 loops, best of 5: 24.5 ms per loop%timeit read_xml_iterparse3(path)# 10 loops, best of 5: 20.9 ms per loop
def read_xml_iterparse5(path):
data =[]
inner ={}for(ev, el)in et.iterparse(path):# /ending parents trigger a new row, and in our case .text is \n followed by spaces. it would be more reliable to pass 'topusers' to our read_xml_iterparse5 as the .tag to checkif el.text and el.text[0]=='\n':# ignore /stackoverflowif inner:
data.append(inner)
inner ={}else:
inner[el.tag]= el.textreturn pd.DataFrame(data)print(read_xml_iterfind(path).shape)# (900, 8)print(read_xml_iterparse(path).shape)# (7050, 8)print(read_xml_lxml_xpath(path).shape)# (900, 8)print(read_xml_lxml_xsl(path).shape)# (900, 8)print(read_xml_iterparse5(path).shape)# (900, 8)%timeit read_xml_iterparse5(path)# 10 loops, best of 5: 20.6 ms per loop
PERFORMANCE: How do you explain the slower iterparse often recommended for larger files as file is iteratively parsed? Is it partly due to the if logic checks?
I would assume that more python code would make it slower, as the python code is evaluated every time. Have you tried a JIT compiler like pypy?
If I remove i and use first_tag only, it seems to be quite a bit faster, so yes it is partly due to the if logic checks:
def read_xml_iterparse2(path):
data = []
inner = {}
first_tag = None
for (ev, el) in et.iterparse(path):
if not first_tag:
first_tag = el.tag
if el.tag == first_tag and len(inner) != 0:
data.append(inner)
inner = {}
if el.text is not None and len(el.text.strip()) > 0:
inner[el.tag] = el.text
df = pd.DataFrame(data)
%timeit read_xml_iterparse(path)
# 10 loops, best of 5: 33 ms per loop
%timeit read_xml_iterparse2(path)
# 10 loops, best of 5: 23 ms per loop
I wasn’t sure I understood the purpose of the last if check, but I’m also not sure why you would want to lose whitespace-only elements. Removing the last if consistently shaves off a little bit of time:
def read_xml_iterparse3(path):
data = []
inner = {}
first_tag = None
for (ev, el) in et.iterparse(path):
if not first_tag:
first_tag = el.tag
if el.tag == first_tag and len(inner) != 0:
data.append(inner)
inner = {}
inner[el.tag] = el.text
df = pd.DataFrame(data)
%timeit read_xml_iterparse(path)
# 10 loops, best of 5: 34.4 ms per loop
%timeit read_xml_iterparse2(path)
# 10 loops, best of 5: 24.5 ms per loop
%timeit read_xml_iterparse3(path)
# 10 loops, best of 5: 20.9 ms per loop
Now, with or without those performance improvements, your iterparse version seems to produce an extra-large dataframe. Here seems to be a working, fast version:
def read_xml_iterparse5(path):
data = []
inner = {}
for (ev, el) in et.iterparse(path):
# /ending parents trigger a new row, and in our case .text is \n followed by spaces. it would be more reliable to pass 'topusers' to our read_xml_iterparse5 as the .tag to check
if el.text and el.text[0] == '\n':
# ignore /stackoverflow
if inner:
data.append(inner)
inner = {}
else:
inner[el.tag] = el.text
return pd.DataFrame(data)
print(read_xml_iterfind(path).shape)
# (900, 8)
print(read_xml_iterparse(path).shape)
# (7050, 8)
print(read_xml_lxml_xpath(path).shape)
# (900, 8)
print(read_xml_lxml_xsl(path).shape)
# (900, 8)
print(read_xml_iterparse5(path).shape)
# (900, 8)
%timeit read_xml_iterparse5(path)
# 10 loops, best of 5: 20.6 ms per loop
MEMORY: Do CPU memory correlate with timings in I/O calls? XSLT and XPath 1.0 tend not to scale well with larger XML documents as entire file must be read in memory to be parsed.
I’m not totally sure what you mean by “I/O calls” but if your document is small enough to fit in cache, then everything will be much faster as it won’t evict many other items from the cache.
STRATEGY: Is list of dictionaries an optimal strategy for Dataframe() call? See these interesting answers: generator version and a iterwalk user-defined version. Both upcast lists to dataframe.
The lists use less memory, so depending on how many columns you have, it could make a noticeable difference. Of course, this then requires your XML tags to be in a consistent order, which they do appear to be. The DataFrame() call would also need to do less work, as it doesn’t have to lookup keys in the dict on every row, to figure out what column if for what value.
In this example, you are merging dataframe1 and dataframe2. You have chosen to do an outer left join on ‘key’. However, for dataframe2 you have specified .iloc which allows you to specific the rows and columns you want in a numerical format. Using :, your selecting all rows, but [0:5] selects the first 5 columns. You could use .loc to specify by name, but if your dealing with long column names, then .iloc may be better.
This is to merge selected columns from two tables.
If table_1 contains t1_a,t1_b,t1_c..,id,..t1_z columns,
and table_2 contains t2_a, t2_b, t2_c..., id,..t2_z columns,
and only t1_a, id, t2_a are required in the final table, then
mergedCSV = table_1[['t1_a','id']].merge(table_2[['t2_a','id']], on = 'id',how = 'left')
# save resulting output file
mergedCSV.to_csv('output.csv',index = False)
My dataframe has a DOB column (example format 1/1/2016) which by default gets converted to pandas dtype ‘object’: DOB object
Converting this to date format with df['DOB'] = pd.to_datetime(df['DOB']), the date gets converted to: 2016-01-26 and its dtype is: DOB datetime64[ns].
Now I want to convert this date format to 01/26/2016 or in any other general date formats. How do I do it?
Whatever the method I try, it always shows the date in 2016-01-26 format.
Compared to the first answer, I will recommend to use dt.strftime() first, then pd.to_datetime(). In this way, it will still result in the datetime data type.
the content of a dataframe cell (a binary value) and
its presentation (displaying it) for us, humans.
So the question is: How to reach the appropriate presentation of my datas without changing the data / data types themselves?
Here is the answer:
If you use the Jupyter notebook for displaying your dataframe, or
if you want to reach a presentation in the form of an HTML file (even with many prepared superfluous id and class attributes for further CSS styling — you may or you may not use them),
use styling. Styling don’t change data / data types of columns of your dataframe.
Now I show you how to reach it in the Jupyter notebook — for a presentation in the form of HTML file see the note near the end of the question.
I will suppose that your column DOBalready has the type datetime64 (you shown that you know how to reach it). I prepared a simple dataframe (with only one column) to show you some basic styling:
Not styled:
df
DOB
0 2019-07-03
1 2019-08-03
2 2019-09-03
3 2019-10-03
DOB
0 03-07-2019
1 03-08-2019
2 03-09-2019
3 03-10-2019
Be careful!
The returning object is NOT a dataframe — it is an object of the class Styler, so don’t assign it back to df:
Don´t do this:
df = df.style.format({"DOB": lambda t: t.strftime("%m/%d/%Y")}) # Don´t do this!
(Every dataframe has its Styler object accessible by its .style property, and we changed this df.style object, not the dataframe itself.)
Questions and Answers:
Q:Why your Styler object (or an expression returning it) used as the last command in a Jupyter notebook cell displays your (styled) table, and not the Styler object itself?
A: Because every Styler object has a callback method ._repr_html_() which returns an HTML code for rendering your dataframe (as a nice HTML table).
Jupyter Notebook IDE calls this method automatically to render objects which have it.
Note:
You don’t need the Jupyter notebook for styling (i.e. for nice outputting a dataframe without changing its data / data types).
A Styler object has a method render(), too, if you want to obtain a string with the HTML code (e.g. for publishing your formatted dataframe to the Web, or simply present your table in the HTML format):
In the pandas library many times there is an option to change the object inplace such as with the following statement…
df.dropna(axis='index', how='all', inplace=True)
I am curious what is being returned as well as how the object is handled when inplace=True is passed vs. when inplace=False.
Are all operations modifying self when inplace=True? And when inplace=False is a new object created immediately such as new_df = self and then new_df is returned?
df = pd.DataFrame({'a': [3, 2, 1], 'b': ['x', 'y', 'z']})
df2 = df[df['a'] > 1]
df2['b'].replace({'x': 'abc'}, inplace=True)
# SettingWithCopyWarning: # A value is trying to be set on a copy of a slice from a DataFrame
inplace, contrary to what the name implies, often does not prevent copies from being created, and (almost) never offers any performance benefits
inplace does not work with method chaining
inplace is a common pitfall for beginners, so removing this option will simplify the API
I don’t advise setting this parameter as it serves little purpose. See this GitHub issue which proposes the inplace argument be deprecated api-wide.
It is a common misconception that using inplace=True will lead to more efficient or optimized code. In reality, there are absolutely no performance benefits to using inplace=True. Both the in-place and out-of-place versions create a copy of the data anyway, with the in-place version automatically assigning the copy back.
inplace=True is a common pitfall for beginners. For example, it can trigger the SettingWithCopyWarning:
df = pd.DataFrame({'a': [3, 2, 1], 'b': ['x', 'y', 'z']})
df2 = df[df['a'] > 1]
df2['b'].replace({'x': 'abc'}, inplace=True)
# SettingWithCopyWarning:
# A value is trying to be set on a copy of a slice from a DataFrame
Calling a function on a DataFrame column with inplace=Truemay or may not work. This is especially true when chained indexing is involved.
As if the problems described above aren’t enough, inplace=True also hinders method chaining. Contrast the working of
result = df.some_function1().reset_index().some_function2()
As opposed to
temp = df.some_function1()
temp.reset_index(inplace=True)
result = temp.some_function2()
The former lends itself to better code organization and readability.
Another supporting claim is that the API for set_axis was recently changed such that inplace default value was switched from True to False. See GH27600. Great job devs!
回答 2
我使用它的方式是
# Have to assign back to dataframe (because it is a new copy)
df = df.some_operation(inplace=False)
要么
# No need to assign back to dataframe (because it is on the same copy)
df.some_operation(inplace=True)
结论:
if inplace isFalse
Assign to a new variable;
else
No need to assign
As you can read in the rest of my answer’s further below, we still can have good reason to use this parameter i.e. the inplace operations, but we should avoid it if we can, as it generate more issues, as:
1. Your code will be harder to debug (Actually SettingwithCopyWarning stands for warning you to this possible problem)
2. Conflict with method chaining
So there is even case when we should use it yet?
Definitely yes. If we use pandas or any tool for handeling huge dataset, we can easily face the situation, where some big data can consume our entire memory.
To avoid this unwanted effect we can use some technics like method chaining:
which make our code more compact (though harder to interpret and debug too) and consumes less memory as the chained methods works with the other method’s returned values, thus resulting in only one copy of the input data. We can see clearly, that we will have 2 x original data memory consumption after this operations.
Or we can use inplace parameter (though harder to interpret and debug too) our memory consumption will be 2 x original data, but our memory consumption after this operation remains 1 x original data, which if somebody whenever worked with huge datasets exactly knows can be a big benefit.
Final conclusion:
Avoid using inplace parameter unless you don’t work with huge data and be aware of its possible issues in case of still using of it.
When trying to make changes to a Pandas dataframe using a function, we use ‘inplace=True’ if we want to commit the changes to the dataframe.
Therefore, the first line in the following code changes the name of the first column in ‘df’ to ‘Grades’. We need to call the database if we want to see the resulting database.
df.rename(columns={0: 'Grades'}, inplace=True)
df
We use ‘inplace=False’ (this is also the default value) when we don’t want to commit the changes but just print the resulting database. So, in effect a copy of the original database with the committed changes is printed without altering the original database.
Just to be more clear, the following codes do the same thing:
inplace=True makes the function impure. It changes the original dataframe and returns None. In that case, You breaks the DSL chain.
Because most of dataframe functions return a new dataframe, you can use the DSL conveniently. Like
df.sort_values().rename().to_csv()
Function call with inplace=True returns None and DSL chain is broken. For example
df.sort_values(inplace=True).rename().to_csv()
will throw NoneType object has no attribute 'rename'
Something similar with python’s build-in sort and sorted. lst.sort() returns None and sorted(lst) returns a new list.
Generally, do not use inplace=True unless you have specific reason of doing so. When you have to write reassignment code like df = df.sort_values(), try attaching the function call in the DSL chain, e.g.
Yes, in Pandas we have many functions has the parameter inplace but by default it is assigned to False.
So, when you do df.dropna(axis='index', how='all', inplace=False) it thinks that you do not want to change the orignial DataFrame, therefore it instead creates a new copy for you with the required changes.
But, when you change the inplace parameter to True
Then it is equivalent to explicitly say that I do not want a new copy
of the DataFrame instead do the changes on the given DataFrame
This forces the Python interpreter to not to create a new DataFrame
But you can also avoid using the inplace parameter by reassigning the result to the orignal DataFrame
Description of the compression argument from the docs:
compression : {‘infer’, ‘gzip’, ‘bz2’, ‘zip’, ‘xz’, None}, default ‘infer’
For on-the-fly decompression of on-disk data. If ‘infer’ and filepath_or_buffer is path-like, then detect compression from the following extensions: ‘.gz’, ‘.bz2’, ‘.zip’, or ‘.xz’ (otherwise no decompression). If using ‘zip’, the ZIP file must contain only one data file to be read in. Set to None for no decompression.
New in version 0.18.1: support for ‘zip’ and ‘xz’ compression.
For “zip” files, you can use import zipfile and your code will be working simply with these lines:
import zipfile
import pandas as pd
with zipfile.ZipFile("Crime_Incidents_in_2013.zip") as z:
with z.open("Crime_Incidents_in_2013.csv") as f:
train = pd.read_csv(f, header=0, delimiter="\t")
print(train.head()) # print the first 5 rows
"\Python36\lib\site-packages\pandas\core\ops.py:792: FutureWarning: elementwise
comparison failed; returning scalar, but in the future will perform
elementwise comparison
result = getattr(x, name)(y)"
I am using Pandas 0.19.1 on Python 3. I am getting a warning on these lines of code. I’m trying to get a list that contains all the row numbers where string Peter is present at column Unnamed: 5.
"\Python36\lib\site-packages\pandas\core\ops.py:792: FutureWarning: elementwise
comparison failed; returning scalar, but in the future will perform
elementwise comparison
result = getattr(x, name)(y)"
What is this FutureWarning and should I ignore it since it seems to work.
import numpy as np
print(np.__version__) # Numpy version '1.12.0''x'in np.arange(5) #Future warning thrown here
FutureWarning: elementwise comparison failed; returning scalar instead, but in the
future will perform elementwise comparison
False
使用double equals运算符重现此错误的另一种方法:
import numpy as np
np.arange(5) == np.arange(5).astype(str) #FutureWarning thrown here
This FutureWarning isn’t from Pandas, it is from numpy and the bug also affects matplotlib and others, here’s how to reproduce the warning nearer to the source of the trouble:
import numpy as np
print(np.__version__) # Numpy version '1.12.0'
'x' in np.arange(5) #Future warning thrown here
FutureWarning: elementwise comparison failed; returning scalar instead, but in the
future will perform elementwise comparison
False
Another way to reproduce this bug using the double equals operator:
import numpy as np
np.arange(5) == np.arange(5).astype(str) #FutureWarning thrown here
There is a disagreement between Numpy and native python on what should happen when you compare a strings to numpy’s numeric types. Notice the left operand is python’s turf, a primitive string, and the middle operation is python’s turf, but the right operand is numpy’s turf. Should you return a Python style Scalar or a Numpy style ndarray of Boolean? Numpy says ndarray of bool, Pythonic developers disagree. Classic standoff.
Should it be elementwise comparison or Scalar if item exists in the array?
If your code or library is using the in or == operators to compare python string to numpy ndarrays, they aren’t compatible, so when if you try it, it returns a scalar, but only for now. The Warning indicates that in the future this behavior might change so your code pukes all over the carpet if python/numpy decide to do adopt Numpy style.
Submitted Bug reports:
Numpy and Python are in a standoff, for now the operation returns a scalar, but in the future it may change.
Either lockdown your version of python and numpy, ignore the warnings and expect the behavior to not change, or convert both left and right operands of == and in to be from a numpy type or primitive python numeric type.
Suppress the warning globally:
import warnings
import numpy as np
warnings.simplefilter(action='ignore', category=FutureWarning)
print('x' in np.arange(5)) #returns False, without Warning
Suppress the warning on a line by line basis.
import warnings
import numpy as np
with warnings.catch_warnings():
warnings.simplefilter(action='ignore', category=FutureWarning)
print('x' in np.arange(2)) #returns False, warning is suppressed
print('x' in np.arange(10)) #returns False, Throws FutureWarning
Just suppress the warning by name, then put a loud comment next to it mentioning the current version of python and numpy, saying this code is brittle and requires these versions and put a link to here. Kick the can down the road.
df = pd.read_csv('my_file.tsv', sep='\t', header=0, index_col=['0']) ## or same with the following
df = pd.read_csv('my_file.tsv', sep='\t', header=0, index_col=[0])
I get the same error when I try to set the index_col reading a file into a Panda‘s data-frame:
df = pd.read_csv('my_file.tsv', sep='\t', header=0, index_col=['0']) ## or same with the following
df = pd.read_csv('my_file.tsv', sep='\t', header=0, index_col=[0])
I have never encountered such an error previously. I still am trying to figure out the reason behind this (using @Eric Leschinski explanation and others).
Anyhow, the following approach solves the problem for now until I figure the reason out:
df = pd.read_csv('my_file.tsv', sep='\t', header=0) ## not setting the index_col
df.set_index(['0'], inplace=True)
I will update this as soon as I figure out the reason for such behavior.
回答 2
我对同一条警告消息的体验是由TypeError引起的。
TypeError:类型比较无效
因此,您可能要检查 Unnamed: 5
for x in df['Unnamed: 5']:
print(type(x)) # are they 'str' ?
这是我可以复制警告消息的方法:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(3, 2), columns=['num1', 'num2'])
df['num3'] = 3
df.loc[df['num3'] == '3', 'num3'] = 4# TypeError and the Warning
df.loc[df['num3'] == 3, 'num3'] = 4# No Error
Can’t beat Eric Leschinski’s awesomely detailed answer, but here’s a quick workaround to the original question that I don’t think has been mentioned yet – put the string in a list and use .isin instead of ==
For example:
import pandas as pd
import numpy as np
df = pd.DataFrame({"Name": ["Peter", "Joe"], "Number": [1, 2]})
# Raises warning using == to compare different types:
df.loc[df["Number"] == "2", "Number"]
# No warning using .isin:
df.loc[df["Number"].isin(["2"]), "Number"]
Eric’s answer helpfully explains that the trouble comes from comparing a Pandas Series (containing a NumPy array) to a Python string. Unfortunately, his two workarounds both just suppress the warning.
To write code that doesn’t cause the warning in the first place, explicitly compare your string to each element of the Series and get a separate bool for each. For example, you could use map and an anonymous function.
myRows = df[df['Unnamed: 5'].map( lambda x: x == 'Peter' )].index.tolist()
a = linspace(0, 5, 10)
b = linspace(0, 50, 100)
n = 500
string1 = 'Peter'
string2 = 'blargh'
times_a = zeros(n)
times_str_a = zeros(n)
times_s = zeros(n)
times_str_s = zeros(n)
times_b = zeros(n)
times_str_b = zeros(n)
for i inrange(n):
t0 = time.time()
tmp1 = a == string1
t1 = time.time()
tmp2 = str(a) == string1
t2 = time.time()
tmp3 = string2 == string1
t3 = time.time()
tmp4 = str(string2) == string1
t4 = time.time()
tmp5 = b == string1
t5 = time.time()
tmp6 = str(b) == string1
t6 = time.time()
times_a[i] = t1 - t0
times_str_a[i] = t2 - t1
times_s[i] = t3 - t2
times_str_s[i] = t4 - t3
times_b[i] = t5 - t4
times_str_b[i] = t6 - t5
print('Small array:')
print('Time to compare without str conversion: {} s. With str conversion: {} s'.format(mean(times_a), mean(times_str_a)))
print('Ratio of time with/without string conversion: {}'.format(mean(times_str_a)/mean(times_a)))
print('\nBig array')
print('Time to compare without str conversion: {} s. With str conversion: {} s'.format(mean(times_b), mean(times_str_b)))
print(mean(times_str_b)/mean(times_b))
print('\nString')
print('Time to compare without str conversion: {} s. With str conversion: {} s'.format(mean(times_s), mean(times_str_s)))
print('Ratio of time with/without string conversion: {}'.format(mean(times_str_s)/mean(times_s)))
结果:
Small array:
Time to compare without str conversion: 6.58464431763e-06 s. With str conversion: 0.000173756599426 s
Ratio of time with/without string conversion: 26.3881526541
Big array
Time to compare without str conversion: 5.44309616089e-06 s. With str conversion: 0.000870866775513 s
159.99474375821288
String
Time to compare without str conversion: 5.89370727539e-07 s. With str conversion: 8.30173492432e-07 s
Ratio of time with/without string conversion: 1.40857605178
But this is ~1.5 times slower if df['Unnamed: 5'] is a string, 25-30 times slower if df['Unnamed: 5'] is a small numpy array (length = 10), and 150-160 times slower if it’s a numpy array with length 100 (times averaged over 500 trials).
a = linspace(0, 5, 10)
b = linspace(0, 50, 100)
n = 500
string1 = 'Peter'
string2 = 'blargh'
times_a = zeros(n)
times_str_a = zeros(n)
times_s = zeros(n)
times_str_s = zeros(n)
times_b = zeros(n)
times_str_b = zeros(n)
for i in range(n):
t0 = time.time()
tmp1 = a == string1
t1 = time.time()
tmp2 = str(a) == string1
t2 = time.time()
tmp3 = string2 == string1
t3 = time.time()
tmp4 = str(string2) == string1
t4 = time.time()
tmp5 = b == string1
t5 = time.time()
tmp6 = str(b) == string1
t6 = time.time()
times_a[i] = t1 - t0
times_str_a[i] = t2 - t1
times_s[i] = t3 - t2
times_str_s[i] = t4 - t3
times_b[i] = t5 - t4
times_str_b[i] = t6 - t5
print('Small array:')
print('Time to compare without str conversion: {} s. With str conversion: {} s'.format(mean(times_a), mean(times_str_a)))
print('Ratio of time with/without string conversion: {}'.format(mean(times_str_a)/mean(times_a)))
print('\nBig array')
print('Time to compare without str conversion: {} s. With str conversion: {} s'.format(mean(times_b), mean(times_str_b)))
print(mean(times_str_b)/mean(times_b))
print('\nString')
print('Time to compare without str conversion: {} s. With str conversion: {} s'.format(mean(times_s), mean(times_str_s)))
print('Ratio of time with/without string conversion: {}'.format(mean(times_str_s)/mean(times_s)))
Result:
Small array:
Time to compare without str conversion: 6.58464431763e-06 s. With str conversion: 0.000173756599426 s
Ratio of time with/without string conversion: 26.3881526541
Big array
Time to compare without str conversion: 5.44309616089e-06 s. With str conversion: 0.000870866775513 s
159.99474375821288
String
Time to compare without str conversion: 5.89370727539e-07 s. With str conversion: 8.30173492432e-07 s
Ratio of time with/without string conversion: 1.40857605178
In my case, the warning occurred because of just the regular type of boolean indexing — because the series had only np.nan. Demonstration (pandas 1.0.3):
>>> import pandas as pd
>>> import numpy as np
>>> pd.Series([np.nan, 'Hi']) == 'Hi'
0 False
1 True
>>> pd.Series([np.nan, np.nan]) == 'Hi'
~/anaconda3/envs/ms3/lib/python3.7/site-packages/pandas/core/ops/array_ops.py:255: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
res_values = method(rvalues)
0 False
1 False
I think with pandas 1.0 they really want you to use the new 'string' datatype which allows for pd.NA values:
>>> %time count = np.sum(x == 's')
>>> print("Count {} using ==".format(count))
CPU times: user 46 µs, sys: 1 µs, total: 47 µs
Wall time: 50.1 µs
Count 0 using ==
同样,错误答案(0 != 2)。这更加隐蔽,因为没有后续警告(0可以像一样传递2)。
现在,让我们尝试一个列表理解:
>>> %time count = np.sum([operator.eq(_x, 's') for _x in x])
>>> print("Count {} using list comprehension".format(count))
CPU times: user 55 µs, sys: 1 µs, total: 56 µs
Wall time: 60.3 µs
Count 2 using list comprehension
我们在这里得到正确的答案,而且速度很快!
另一种可能性pandas:
>>> y = pd.Series(x)
>>> %time count = np.sum(y == 's')
>>> print("Count {} using pandas ==".format(count))
CPU times: user 453 µs, sys: 31 µs, total: 484 µs
Wall time: 463 µs
Count 2 using pandas ==
慢,但是正确!
最后,我将使用的选项是:将numpy数组转换为object类型:
>>> x = np.array(['s', 'b', 's', 'b']).astype(object)
>>> %time count = np.sum(np.equal('s', x))
>>> print("Count {} using numpy equal".format(count))
CPU times: user 50 µs, sys: 1 µs, total: 51 µs
Wall time: 55.1 µs
Count 2 using numpy equal
I’ve compared a few of the methods possible for doing this, including pandas, several numpy methods, and a list comprehension method.
First, let’s start with a baseline:
>>> import numpy as np
>>> import operator
>>> import pandas as pd
>>> x = [1, 2, 1, 2]
>>> %time count = np.sum(np.equal(1, x))
>>> print("Count {} using numpy equal with ints".format(count))
CPU times: user 52 µs, sys: 0 ns, total: 52 µs
Wall time: 56 µs
Count 2 using numpy equal with ints
So, our baseline is that the count should be correct 2, and we should take about 50 us.
Now, we try the naive method:
>>> x = ['s', 'b', 's', 'b']
>>> %time count = np.sum(np.equal('s', x))
>>> print("Count {} using numpy equal".format(count))
CPU times: user 145 µs, sys: 24 µs, total: 169 µs
Wall time: 158 µs
Count NotImplemented using numpy equal
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/ipykernel_launcher.py:1: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
"""Entry point for launching an IPython kernel.
And here, we get the wrong answer (NotImplemented != 2), it takes us a long time, and it throws the warning.
So we’ll try another naive method:
>>> %time count = np.sum(x == 's')
>>> print("Count {} using ==".format(count))
CPU times: user 46 µs, sys: 1 µs, total: 47 µs
Wall time: 50.1 µs
Count 0 using ==
Again, the wrong answer (0 != 2). This is even more insidious because there’s no subsequent warnings (0 can be passed around just like 2).
Now, let’s try a list comprehension:
>>> %time count = np.sum([operator.eq(_x, 's') for _x in x])
>>> print("Count {} using list comprehension".format(count))
CPU times: user 55 µs, sys: 1 µs, total: 56 µs
Wall time: 60.3 µs
Count 2 using list comprehension
We get the right answer here, and it’s pretty fast!
Another possibility, pandas:
>>> y = pd.Series(x)
>>> %time count = np.sum(y == 's')
>>> print("Count {} using pandas ==".format(count))
CPU times: user 453 µs, sys: 31 µs, total: 484 µs
Wall time: 463 µs
Count 2 using pandas ==
Slow, but correct!
And finally, the option I’m going to use: casting the numpy array to the object type:
>>> x = np.array(['s', 'b', 's', 'b']).astype(object)
>>> %time count = np.sum(np.equal('s', x))
>>> print("Count {} using numpy equal".format(count))
CPU times: user 50 µs, sys: 1 µs, total: 51 µs
Wall time: 55.1 µs
Count 2 using numpy equal
Fast and correct!
回答 10
我有导致错误的此代码:
for t in dfObj['time']:
iftype(t) == str:
the_date = dateutil.parser.parse(t)
loc_dt_int = int(the_date.timestamp())
dfObj.loc[t == dfObj.time, 'time'] = loc_dt_int
我将其更改为:
for t in dfObj['time']:
try:
the_date = dateutil.parser.parse(t)
loc_dt_int = int(the_date.timestamp())
dfObj.loc[t == dfObj.time, 'time'] = loc_dt_int
except Exception as e:
print(e)
continue
for t in dfObj['time']:
if type(t) == str:
the_date = dateutil.parser.parse(t)
loc_dt_int = int(the_date.timestamp())
dfObj.loc[t == dfObj.time, 'time'] = loc_dt_int
I changed it to this:
for t in dfObj['time']:
try:
the_date = dateutil.parser.parse(t)
loc_dt_int = int(the_date.timestamp())
dfObj.loc[t == dfObj.time, 'time'] = loc_dt_int
except Exception as e:
print(e)
continue
to avoid the comparison, which is throwing the warning – as stated above. I only had to avoid the exception because of dfObj.loc in the for loop, maybe there is a way to tell it not to check the rows it has already changed.
Pandas具有使用能力,pandas.read_sql但这需要使用原始SQL。我有两个避免发生这种情况的原因:1)我已经使用ORM拥有了一切(本身就是一个很好的理由),并且2)我正在使用python列表作为查询的一部分(例如:模型类.db.session.query(Item).filter(Item.symbol.in_(add_symbols)在哪里Item)并且add_symbols是列表)。这等效于SQL SELECT ... from ... WHERE ... IN。
This topic hasn’t been addressed in a while, here or elsewhere. Is there a solution converting a SQLAlchemy <Query object> to a pandas DataFrame?
Pandas has the capability to use pandas.read_sql but this requires use of raw SQL. I have two reasons for wanting to avoid it: 1) I already have everything using the ORM (a good reason in and of itself) and 2) I’m using python lists as part of the query (eg: .db.session.query(Item).filter(Item.symbol.in_(add_symbols) where Item is my model class and add_symbols is a list). This is the equivalent of SQL SELECT ... from ... WHERE ... IN.