numpy.random.randint accepts a third argument (size) , in which you can specify the size of the output array. You can use this to create your DataFrame –
import numpy as np
import pandas as pd
rng = np.random.default_rng()
df = pd.DataFrame(rng.integers(0,100, size=(100,4)), columns=list('ABCD'))
df
----------------------
A B C D
0589682241213353626779227838165779447367096...............95763228519633685477977643574398346412579981773250100 rows ×4 columns
And if you want to produce a column containing the name of the column with the maximum value but considering only a subset of columns then you use a variation of @ajcr’s answer:
In[146]:%timeit df.apply(lambda x: x.argmax(), axis=1)1 loops, best of 3:479 ms per loop
In[147]:%timeit df.idxmax(axis=1)10 loops, best of 3:47.3 ms per loop
You could apply on dataframe and get argmax() of each row via axis=1
In [144]: df.apply(lambda x: x.argmax(), axis=1)
Out[144]:
0 Communications
1 Business
2 Communications
3 Communications
4 Business
dtype: object
Here’s a benchmark to compare how slow apply method is to idxmax() for len(df) ~ 20K
In [146]: %timeit df.apply(lambda x: x.argmax(), axis=1)
1 loops, best of 3: 479 ms per loop
In [147]: %timeit df.idxmax(axis=1)
10 loops, best of 3: 47.3 ms per loop
import pandas as pd
df ={'col_1':[0,1,2,3],'col_2':[4,5,6,7]}
df = pd.DataFrame(df)
df[['column_new_1','column_new_2','column_new_3']]=[np.nan,'dogs',3]#thought this would work here...
I’m new to pandas and trying to figure out how to add multiple columns to pandas simultaneously. Any help here is appreciated. Ideally I would like to do this in one step rather than multiple repeated steps…
import pandas as pd
df = {'col_1': [0, 1, 2, 3],
'col_2': [4, 5, 6, 7]}
df = pd.DataFrame(df)
df[[ 'column_new_1', 'column_new_2','column_new_3']] = [np.nan, 'dogs',3] #thought this would work here...
I would have expected your syntax to work too. The problem arises because when you create new columns with the column-list syntax (df[[new1, new2]] = ...), pandas requires that the right hand side be a DataFrame (note that it doesn’t actually matter if the columns of the DataFrame have the same names as the columns you are creating).
Your syntax works fine for assigning scalar values to existing columns, and pandas is also happy to assign scalar values to a new column using the single-column syntax (df[new1] = ...). So the solution is either to convert this into several single-column assignments, or create a suitable DataFrame for the right-hand side.
5) Using a dict is a more “natural” way to create the new data frame than the previous two, but the new columns will be sorted alphabetically (at least before Python 3.6 or 3.7):
I like this variant on @zero’s answer a lot, but like the previous one, the new columns will always be sorted alphabetically, at least with early versions of Python:
In [128]: df
Out[128]:
col_1 col_2
0 0 4
1 1 5
2 2 6
3 3 7
In [129]: pd.concat([df, pd.DataFrame(columns = [ 'column_new_1', 'column_new_2','column_new_3'])])
Out[129]:
col_1 col_2 column_new_1 column_new_2 column_new_3
0 0.0 4.0 NaN NaN NaN
1 1.0 5.0 NaN NaN NaN
2 2.0 6.0 NaN NaN NaN
3 3.0 7.0 NaN NaN NaN
Not very sure of what you wanted to do with [np.nan, 'dogs',3]. Maybe now set them as default values?
In [142]: df1 = pd.concat([df, pd.DataFrame(columns = [ 'column_new_1', 'column_new_2','column_new_3'])])
In [143]: df1[[ 'column_new_1', 'column_new_2','column_new_3']] = [np.nan, 'dogs', 3]
In [144]: df1
Out[144]:
col_1 col_2 column_new_1 column_new_2 column_new_3
0 0.0 4.0 NaN dogs 3
1 1.0 5.0 NaN dogs 3
2 2.0 6.0 NaN dogs 3
3 3.0 7.0 NaN dogs 3
回答 3
使用列表理解,pd.DataFrame以及pd.concat
pd.concat([
df,
pd.DataFrame([[np.nan,'dogs',3]for _ in range(df.shape[0])],
df.index,['column_new_1','column_new_2','column_new_3'])], axis=1)
You can pass a list of columns to [] to select columns in that order.
If a column is not contained in the DataFrame, an exception will be raised.
Multiple columns can also be set in this manner.
You may find this useful for applying a transform (in-place) to a subset of the columns.
If you just want to add empty new columns, reindex will do the job
df
col_1 col_2
0 0 4
1 1 5
2 2 6
3 3 7
df.reindex(list(df)+['column_new_1', 'column_new_2','column_new_3'], axis=1)
col_1 col_2 column_new_1 column_new_2 column_new_3
0 0 4 NaN NaN NaN
1 1 5 NaN NaN NaN
2 2 6 NaN NaN NaN
3 3 7 NaN NaN NaN
df['d'] = df.apply(rowFunc, axis=1)
>>> df
a b c d
0 1 2 3 7
1 4 5 6 34
Awesome! Now what if I want to incorporate the index into my function?
The index of any given row in this DataFrame before adding d would be Index([u'a', u'b', u'c', u'd'], dtype='object'), but I want the 0 and 1. So I can’t just access row.index.
I know I could create a temporary column in the table where I store the index, but I’m wondering if it is stored in the row object somewhere.
回答 0
在这种情况下,要访问索引,请访问name属性:
In[182]:
df = pd.DataFrame([[1,2,3],[4,5,6]], columns=['a','b','c'])def rowFunc(row):return row['a']+ row['b']* row['c']def rowIndex(row):return row.name
df['d']= df.apply(rowFunc, axis=1)
df['rowIndex']= df.apply(rowIndex, axis=1)
dfOut[182]:
a b c d rowIndex0123701456341
请注意,如果这确实是您要尝试执行的操作,则可以使用以下命令并且速度更快:
In[198]:
df['d']= df['a']+ df['b']* df['c']
dfOut[198]:
a b c d01237145634In[199]:%timeit df['a']+ df['b']* df['c']%timeit df.apply(rowIndex, axis=1)10000 loops, best of 3:163µs per loop1000 loops, best of 3:286µs per loop
编辑
3年后再看这个问题,您可以这样做:
In[15]:
df['d'],df['rowIndex']= df['a']+ df['b']* df['c'], df.index
dfOut[15]:
a b c d rowIndex0123701456341
To access the index in this case you access the name attribute:
In [182]:
df = pd.DataFrame([[1,2,3],[4,5,6]], columns=['a','b','c'])
def rowFunc(row):
return row['a'] + row['b'] * row['c']
def rowIndex(row):
return row.name
df['d'] = df.apply(rowFunc, axis=1)
df['rowIndex'] = df.apply(rowIndex, axis=1)
df
Out[182]:
a b c d rowIndex
0 1 2 3 7 0
1 4 5 6 34 1
Note that if this is really what you are trying to do that the following works and is much faster:
In [198]:
df['d'] = df['a'] + df['b'] * df['c']
df
Out[198]:
a b c d
0 1 2 3 7
1 4 5 6 34
In [199]:
%timeit df['a'] + df['b'] * df['c']
%timeit df.apply(rowIndex, axis=1)
10000 loops, best of 3: 163 µs per loop
1000 loops, best of 3: 286 µs per loop
EDIT
Looking at this question 3+ years later, you could just do:
In[15]:
df['d'],df['rowIndex'] = df['a'] + df['b'] * df['c'], df.index
df
Out[15]:
a b c d rowIndex
0 1 2 3 7 0
1 4 5 6 34 1
but assuming it isn’t as trivial as this, whatever your rowFunc is really doing, you should look to use the vectorised functions, and then use them against the df index:
In[16]:
df['newCol'] = df['a'] + df['b'] + df['c'] + df.index
df
Out[16]:
a b c d rowIndex newCol
0 1 2 3 7 0 6
1 4 5 6 34 1 16
回答 1
要么:
1.与row.name内线apply(..., axis=1)通话:
df = pandas.DataFrame([[1,2,3],[4,5,6]], columns=['a','b','c'], index=['x','y'])
a b c
x 123
y 456
df.apply(lambda row: row.name, axis=1)
x x
y y
>>>import pandas as pd>>> df = pd.DataFrame([[1,2,3],[4,5,6]], columns=['a','b','c'])>>> df.set_index('a', inplace=True)>>> df
b c
a 123456>>> df['index_x10']= df.apply(lambda row:10*row.name, axis=1)>>> df
b c index_x10
a 1231045640
To answer the original question: yes, you can access the index value of a row in apply(). It is available under the key name and requires that you specify axis=1 (because the lambda processes the columns of a row and not the rows of a column).
Working example (pandas 0.23.4):
>>> import pandas as pd
>>> df = pd.DataFrame([[1,2,3],[4,5,6]], columns=['a','b','c'])
>>> df.set_index('a', inplace=True)
>>> df
b c
a
1 2 3
4 5 6
>>> df['index_x10'] = df.apply(lambda row: 10*row.name, axis=1)
>>> df
b c index_x10
a
1 2 3 10
4 5 6 40
Is there any function that would be the equivalent of a combination of df.isin() and df[col].str.contains()?
For example, say I have the series
s = pd.Series(['cat','hat','dog','fog','pet']), and I want to find all places where s contains any of ['og', 'at'], I would want to get everything but ‘pet’.
I have a solution, but it’s rather inelegant:
searchfor = ['og', 'at']
found = [s.str.contains(x) for x in searchfor]
result = pd.DataFrame[found]
result.any()
One option is just to use the regex | character to try to match each of the substrings in the words in your Series s (still using str.contains).
You can construct the regex by joining the words in searchfor with |:
>>> searchfor = ['og', 'at']
>>> s[s.str.contains('|'.join(searchfor))]
0 cat
1 hat
2 dog
3 fog
dtype: object
As @AndyHayden noted in the comments below, take care if your substrings have special characters such as $ and ^ which you want to match literally. These characters have specific meanings in the context of regular expressions and will affect the matching.
You can make your list of substrings safer by escaping non-alphanumeric characters with re.escape:
>>> import re
>>> matches = ['$money', 'x^y']
>>> safe_matches = [re.escape(m) for m in matches]
>>> safe_matches
['\\$money', 'x\\^y']
The strings with in this new list will match each character literally when used with str.contains.
You can use str.contains alone with a regex pattern using OR (|):
s[s.str.contains('og|at')]
Or you could add the series to a dataframe then use str.contains:
df = pd.DataFrame(s)
df[s.str.contains('og|at')]
Output:
0 cat
1 hat
2 dog
3 fog
回答 2
这是一个单行lambda,它也可以工作:
df["TrueFalse"]= df['col1'].apply(lambda x:1if any(i in x for i in searchfor)else0)
输入:
searchfor =['og','at']
df = pd.DataFrame([('cat',1000.0),('hat',2000000.0),('dog',1000.0),('fog',330000.0),('pet',330000.0)], columns=['col1','col2'])
col1 col2
0 cat 1000.01 hat 2000000.02 dog 1000.03 fog 330000.04 pet 330000.0
应用Lambda:
df["TrueFalse"]= df['col1'].apply(lambda x:1if any(i in x for i in searchfor)else0)
输出:
col1 col2 TrueFalse0 cat 1000.011 hat 2000000.012 dog 1000.013 fog 330000.014 pet 330000.00
I’m confused about the rules Pandas uses when deciding that a selection from a dataframe is a copy of the original dataframe, or a view on the original.
I understand that a query returns a copy so that something like
foo = df.query('2 < index <= 5')
foo.loc[:,'E'] = 40
will have no effect on the original dataframe, df. I also understand that scalar or named slices return a view, so that assignments to these, such as
df.iloc[3] = 70
or
df.ix[1,'B':'E'] = 222
will change df. But I’m lost when it comes to more complicated cases. For example,
df[df.C <= df.B] = 7654321
changes df, but
df[df.C <= df.B].ix[:,'B':'E']
does not.
Is there a simple rule that Pandas is using that I’m just missing? What’s going on in these specific cases; and in particular, how do I change all values (or a subset of values) in a dataframe that satisfy a particular query (as I’m attempting to do in the last example above)?
Note: This is not the same as this question; and I have read the documentation, but am not enlightened by it. I’ve also read through the “Related” questions on this topic, but I’m still missing the simple rule Pandas is using, and how I’d apply it to — for example — modify the values (or a subset of values) in a dataframe that satisfy a particular query.
If inplace=True is provided, it will modify in-place; only some operations support this
An indexer that sets, e.g. .loc/.iloc/.iat/.at will set inplace.
An indexer that gets on a single-dtyped object is almost always a view (depending on the memory layout it may not be that’s why this is not reliable). This is mainly for efficiency. (the example from above is for .query; this will always return a copy as its evaluated by numexpr)
An indexer that gets on a multiple-dtyped object is always a copy.
Your example of chained indexing
df[df.C <= df.B].loc[:,'B':'E']
is not guaranteed to work (and thus you shoulld never do this).
Instead do:
df.loc[df.C <= df.B, 'B':'E']
as this is faster and will always work
The chained indexing is 2 separate python operations and thus cannot be reliably intercepted by pandas (you will oftentimes get a SettingWithCopyWarning, but that is not 100% detectable either). The dev docs, which you pointed, offer a much more full explanation.
I would like to append a string to the start of each value in a said column of a pandas dataframe (elegantly).
I already figured out how to kind-of do this and I am currently using:
This seems one hell of an inelegant thing to do – do you know any other way (which maybe also adds the character to rows where that column is 0 or NaN)?
In case this is yet unclear, I would like to turn:
col
1 a
2 0
into:
col
1 stra
2 str0
回答 0
df['col']='str'+ df['col'].astype(str)
例:
>>> df = pd.DataFrame({'col':['a',0]})>>> df
col
0 a
10>>> df['col']='str'+ df['col'].astype(str)>>> df
col
0 stra
1 str0
df = pd.DataFrame({'col':['a',0]*200000})%timeit df['col'].apply(lambda x: f"str{x}")117 ms ±451µs per loop (mean ± std. dev. of 7 runs,10 loops each)%timeit 'str'+ df['col'].astype(str)112 ms ±1.04 ms per loop (mean ± std. dev. of 7 runs,10 loops each)
format但是,使用的确确实要慢得多:
%timeit df['col'].apply(lambda x:"{}{}".format('str', x))185 ms ±1.07 ms per loop (mean ± std. dev. of 7 runs,10 loops each)
As an alternative, you can also use an apply combined with format (or better with f-strings) which I find slightly more readable if one e.g. also wants to add a suffix or manipulate the element itself:
Assuming df has a unique index, this gives the row with the maximum value:
In [34]: df.loc[df['Value'].idxmax()]
Out[34]:
Country US
Place Kansas
Value 894
Name: 7
Note that idxmax returns index labels. So if the DataFrame has duplicates in the index, the label may not uniquely identify the row, so df.loc may return more than one row.
Therefore, if df does not have a unique index, you must make the index unique before proceeding as above. Depending on the DataFrame, sometimes you can use stack or set_index to make the index unique. Or, you can simply reset the index (so the rows become renumbered, starting at 0):
I think the easiest way to return a row with the maximum value is by getting its index. argmax() can be used to return the index of the row with the largest value.
index = df.Value.argmax()
Now the index could be used to get the features for that particular row:
df.iloc[df.Value.argmax(), 0:2]
回答 4
使用的index属性DataFrame。请注意,我没有在示例中键入所有行。
In[14]: df = data.groupby(['Country','Place'])['Value'].max()In[15]: df.index
Out[15]:MultiIndex[SpainManchester, UK London, US Mchigan,NewYork]In[16]: df.index[0]Out[16]:('Spain','Manchester')In[17]: df.index[1]Out[17]:('UK','London')
您还可以通过该索引获取值:
In[21]:for index in df.index:print index, df[index]....:('Spain','Manchester')512('UK','London')778('US','Mchigan')854('US','NewYork')562
Use the index attribute of DataFrame. Note that I don’t type all the rows in the example.
In [14]: df = data.groupby(['Country','Place'])['Value'].max()
In [15]: df.index
Out[15]:
MultiIndex
[Spain Manchester, UK London , US Mchigan , NewYork ]
In [16]: df.index[0]
Out[16]: ('Spain', 'Manchester')
In [17]: df.index[1]
Out[17]: ('UK', 'London')
You can also get the value by that index:
In [21]: for index in df.index:
print index, df[index]
....:
('Spain', 'Manchester') 512
('UK', 'London') 778
('US', 'Mchigan') 854
('US', 'NewYork') 562
Edit
Sorry for misunderstanding what you want, try followings:
In [52]: s=data.max()
In [53]: print '%s, %s, %s' % (s['Country'], s['Place'], s['Value'])
US, NewYork, 854
I encountered a similar error while trying to import data using pandas, The first column on my dataset had spaces before the start of the words. I removed the spaces and it worked like a charm!!
Good question and answer but only handle one column with list(In my answer the self-def function will work for multiple columns, also the accepted answer is use the most time consuming apply , which is not recommended, check more info When should I ever want to use pandas apply() in my code?)
pd.DataFrame([[x]+[z]for x, y in df.values for z in y],columns=df.columns)Out[488]:
A B011112221322
如果超过两列,请使用
s=pd.DataFrame([[x]+[z]for x, y in zip(df.index,df.B)for z in y])
s.merge(df,left_on=0,right_index=True)Out[491]:01 A B0011[1,2]1021[1,2]2112[1,2]3122[1,2]
方法4
使用reindex 或loc
df.reindex(df.index.repeat(df.B.str.len())).assign(B=np.concatenate(df.B.values))Out[554]:
A B011012121122#df.loc[df.index.repeat(df.B.str.len())].assign(B=np.concatenate(df.B.values))
列表仅包含唯一值时的方法5:
df=pd.DataFrame({'A':[1,2],'B':[[1,2],[3,4]]})from collections importChainMap
d = dict(ChainMap(*map(dict.fromkeys, df['B'], df['A'])))
pd.DataFrame(list(d.items()),columns=df.columns[::-1])Out[574]:
B A011121232342
高性能
使用方法6numpy:
newvalues=np.dstack((np.repeat(df.A.values,list(map(len,df.B.values))),np.concatenate(df.B.values)))
pd.DataFrame(data=newvalues[0],columns=df.columns)
A B011112221322
方法7
使用基本函数itertoolscycle和chain:纯python解决方案,只是为了好玩
from itertools import cycle,chain
l=df.values.tolist()
l1=[list(zip([x[0]], cycle(x[1]))if len([x[0]])> len(x[1])else list(zip(cycle([x[0]]), x[1])))for x in l]
pd.DataFrame(list(chain.from_iterable(l1)),columns=df.columns)
A B011112221322
归纳到多列
df=pd.DataFrame({'A':[1,2],'B':[[1,2],[3,4]],'C':[[1,2],[3,4]]})
dfOut[592]:
A B C01[1,2][1,2]12[3,4][3,4]
自卫功能:
def unnesting(df, explode):
idx = df.index.repeat(df[explode[0]].str.len())
df1 = pd.concat([
pd.DataFrame({x: np.concatenate(df[x].values)})for x in explode], axis=1)
df1.index = idxreturn df1.join(df.drop(explode,1), how='left')
unnesting(df,['B','C'])Out[609]:
B C A0111022113321442
I know object columns type makes the data hard to convert with a pandas function. When I received the data like this, the first thing that came to mind was to ‘flatten’ or unnest the columns .
I am using pandas and python functions for this type of question. If you are worried about the speed of the above solutions, check user3483203’s answer, since it’s using numpy and most of the time numpy is faster . I recommend Cpython and numba if speed matters.
Method 0 [pandas >= 0.25]
Starting from pandas 0.25, if you only need to explode one column, you can use the pandas.DataFrame.explode function:
df.explode('B')
A B
0 1 1
1 1 2
0 2 1
1 2 2
Given a dataframe with an empty list or a NaN in the column. An empty list will not cause an issue, but a NaN will need to be filled with a list
df = pd.DataFrame({'A': [1, 2, 3, 4],'B': [[1, 2], [1, 2], [], np.nan]})
df.B = df.B.fillna({i: [] for i in df.index}) # replace NaN with []
df.explode('B')
A B
0 1 1
0 1 2
1 2 1
1 2 2
2 3 NaN
3 4 NaN
Method 1 apply + pd.Series (easy to understand but in terms of performance not recommended . )
df.set_index('A').B.apply(pd.Series).stack().reset_index(level=0).rename(columns={0:'B'})
Out[463]:
A B
0 1 1
1 1 2
0 2 1
1 2 2
Method 2
Using repeat with DataFrame constructor , re-create your dataframe (good at performance, not good at multiple columns )
df=pd.DataFrame({'A':df.A.repeat(df.B.str.len()),'B':np.concatenate(df.B.values)})
df
Out[465]:
A B
0 1 1
0 1 2
1 2 1
1 2 2
Method 2.1
for example besides A we have A.1 …..A.n. If we still use the method(Method 2) above it is hard for us to re-create the columns one by one .
Solution : join or merge with the index after ‘unnest’ the single columns
s=pd.DataFrame({'B':np.concatenate(df.B.values)},index=df.index.repeat(df.B.str.len()))
s.join(df.drop('B',1),how='left')
Out[477]:
B A
0 1 1
0 2 1
1 1 2
1 2 2
If you need the column order exactly the same as before, add reindex at the end.
pd.DataFrame([[x] + [z] for x, y in df.values for z in y],columns=df.columns)
Out[488]:
A B
0 1 1
1 1 2
2 2 1
3 2 2
If more than two columns, use
s=pd.DataFrame([[x] + [z] for x, y in zip(df.index,df.B) for z in y])
s.merge(df,left_on=0,right_index=True)
Out[491]:
0 1 A B
0 0 1 1 [1, 2]
1 0 2 1 [1, 2]
2 1 1 2 [1, 2]
3 1 2 2 [1, 2]
Method 4
using reindex or loc
df.reindex(df.index.repeat(df.B.str.len())).assign(B=np.concatenate(df.B.values))
Out[554]:
A B
0 1 1
0 1 2
1 2 1
1 2 2
#df.loc[df.index.repeat(df.B.str.len())].assign(B=np.concatenate(df.B.values))
Method 5
when the list only contains unique values:
df=pd.DataFrame({'A':[1,2],'B':[[1,2],[3,4]]})
from collections import ChainMap
d = dict(ChainMap(*map(dict.fromkeys, df['B'], df['A'])))
pd.DataFrame(list(d.items()),columns=df.columns[::-1])
Out[574]:
B A
0 1 1
1 2 1
2 3 2
3 4 2
Method 6
using numpy for high performance:
newvalues=np.dstack((np.repeat(df.A.values,list(map(len,df.B.values))),np.concatenate(df.B.values)))
pd.DataFrame(data=newvalues[0],columns=df.columns)
A B
0 1 1
1 1 2
2 2 1
3 2 2
Method 7
using base function itertoolscycle and chain: Pure python solution just for fun
from itertools import cycle,chain
l=df.values.tolist()
l1=[list(zip([x[0]], cycle(x[1])) if len([x[0]]) > len(x[1]) else list(zip(cycle([x[0]]), x[1]))) for x in l]
pd.DataFrame(list(chain.from_iterable(l1)),columns=df.columns)
A B
0 1 1
1 1 2
2 2 1
3 2 2
Generalizing to multiple columns
df=pd.DataFrame({'A':[1,2],'B':[[1,2],[3,4]],'C':[[1,2],[3,4]]})
df
Out[592]:
A B C
0 1 [1, 2] [1, 2]
1 2 [3, 4] [3, 4]
Self-def function:
def unnesting(df, explode):
idx = df.index.repeat(df[explode[0]].str.len())
df1 = pd.concat([
pd.DataFrame({x: np.concatenate(df[x].values)}) for x in explode], axis=1)
df1.index = idx
return df1.join(df.drop(explode, 1), how='left')
unnesting(df,['B','C'])
Out[609]:
B C A
0 1 1 1
0 2 2 1
1 3 3 2
1 4 4 2
Column-wise Unnesting
All above method is talking about the vertical unnesting and explode , If you do need expend the list horizontal, Check with pd.DataFrame constructor
df.join(pd.DataFrame(df.B.tolist(),index=df.index).add_prefix('B_'))
Out[33]:
A B C B_0 B_1
0 1 [1, 2] [1, 2] 1 2
1 2 [3, 4] [3, 4] 3 4
Updated function
def unnesting(df, explode, axis):
if axis==1:
idx = df.index.repeat(df[explode[0]].str.len())
df1 = pd.concat([
pd.DataFrame({x: np.concatenate(df[x].values)}) for x in explode], axis=1)
df1.index = idx
return df1.join(df.drop(explode, 1), how='left')
else :
df1 = pd.concat([
pd.DataFrame(df[x].tolist(), index=df.index).add_prefix(x) for x in explode], axis=1)
return df1.join(df.drop(explode, 1), how='left')
vals = np.array(df.B.values.tolist())
a = np.repeat(df.A, vals.shape[1])
pd.DataFrame(np.column_stack((a, vals.ravel())), columns=df.columns)
A B011112221322
选项2
如果子列表的长度不同,则需要执行其他步骤:
vals = df.B.values.tolist()
rs =[len(r)for r in vals]
a = np.repeat(df.A, rs)
pd.DataFrame(np.column_stack((a, np.concatenate(vals))), columns=df.columns)
A B C D01[1,2][1,2,3] A12[1,2,3][1,2] B23[1][1,2] C
def unnest(df, tile, explode):
vals = df[explode].sum(1)
rs =[len(r)for r in vals]
a = np.repeat(df[tile].values, rs, axis=0)
b = np.concatenate(vals.values)
d = np.column_stack((a, b))return pd.DataFrame(d, columns = tile +['_'.join(explode)])
unnest(df,['A','D'],['B','C'])
A D B_C01 A 111 A 221 A 131 A 241 A 352 B 162 B 272 B 382 B 192 B 2103 C 1113 C 1123 C 2
功能
def wen1(df):return df.set_index('A').B.apply(pd.Series).stack().reset_index(level=0).rename(columns={0:'B'})def wen2(df):return pd.DataFrame({'A':df.A.repeat(df.B.str.len()),'B':np.concatenate(df.B.values)})def wen3(df):
s = pd.DataFrame({'B': np.concatenate(df.B.values)}, index=df.index.repeat(df.B.str.len()))return s.join(df.drop('B',1), how='left')def wen4(df):return pd.DataFrame([[x]+[z]for x, y in df.values for z in y],columns=df.columns)def chris1(df):
vals = np.array(df.B.values.tolist())
a = np.repeat(df.A, vals.shape[1])return pd.DataFrame(np.column_stack((a, vals.ravel())), columns=df.columns)def chris2(df):
vals = df.B.values.tolist()
rs =[len(r)for r in vals]
a = np.repeat(df.A.values, rs)return pd.DataFrame(np.column_stack((a, np.concatenate(vals))), columns=df.columns)
时机
import pandas as pdimport matplotlib.pyplot as pltimport numpy as npfrom timeit import timeit
res = pd.DataFrame(
index=['wen1','wen2','wen3','wen4','chris1','chris2'],
columns=[10,50,100,500,1000,5000,10000],
dtype=float)for f in res.index:for c in res.columns:
df = pd.DataFrame({'A':[1,2],'B':[[1,2],[1,2]]})
df = pd.concat([df]*c)
stmt ='{}(df)'.format(f)
setp ='from __main__ import df, {}'.format(f)
res.at[f, c]= timeit(stmt, setp, number=50)
ax = res.div(res.min()).T.plot(loglog=True)
ax.set_xlabel("N")
ax.set_ylabel("time (relative)")
If all of the sublists in the other column are the same length, numpy can be an efficient option here:
vals = np.array(df.B.values.tolist())
a = np.repeat(df.A, vals.shape[1])
pd.DataFrame(np.column_stack((a, vals.ravel())), columns=df.columns)
A B
0 1 1
1 1 2
2 2 1
3 2 2
Option 2
If the sublists have different length, you need an additional step:
vals = df.B.values.tolist()
rs = [len(r) for r in vals]
a = np.repeat(df.A, rs)
pd.DataFrame(np.column_stack((a, np.concatenate(vals))), columns=df.columns)
A B
0 1 1
1 1 2
2 2 1
3 2 2
Option 3
I took a shot at generalizing this to work to flatten N columns and tile M columns, I’ll work later on making it more efficient:
df[['B1','B2']]= pd.DataFrame([*df['B']])# if values.tolist() is too boring(pd.wide_to_long(df.drop('B',1),'B','A','').reset_index(level=1, drop=True).reset_index())
Because normally sublist length are different and join/merge is far more computational expensive. I retested the method for different length sublist and more normal columns.
MultiIndex should be also a easier way to write and has near the same performances as numpy way.
Surprisingly, in my implementation comprehension way has the best performance.
def stack(df):
return df.set_index(['A', 'C']).B.apply(pd.Series).stack()
def comprehension(df):
return pd.DataFrame([x + [z] for x, y in zip(df[['A', 'C']].values.tolist(), df.B) for z in y])
def multiindex(df):
return pd.DataFrame(np.concatenate(df.B.values), index=df.set_index(['A', 'C']).index.repeat(df.B.str.len()))
def array(df):
return pd.DataFrame(
np.column_stack((
np.repeat(df[['A', 'C']].values, df.B.str.len(), axis=0),
np.concatenate(df.B.values)
))
)
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from timeit import timeit
res = pd.DataFrame(
index=[
'stack',
'comprehension',
'multiindex',
'array',
],
columns=[1000, 2000, 5000, 10000, 20000, 50000],
dtype=float
)
for f in res.index:
for c in res.columns:
df = pd.DataFrame({'A': list('abc'), 'C': list('def'), 'B': [['g', 'h', 'i'], ['j', 'k'], ['l']]})
df = pd.concat([df] * c)
stmt = '{}(df)'.format(f)
setp = 'from __main__ import df, {}'.format(f)
res.at[f, c] = timeit(stmt, setp, number=20)
ax = res.div(res.min()).T.plot(loglog=True)
ax.set_xlabel("N")
ax.set_ylabel("time (relative)")
import pandas as pd
import numpy as np
df=pd.DataFrame({'A':['A1','A2','A3'],'B':['B1','B2','B3'],'C':[['C1.1','C1.2'],['C2.1','C2.2'],'C3'],'columnD':['D1',['D2.1','D2.2','D2.3'],['D3.1','D3.2']],})print('df',df, sep='\n')def dfListExplode(df, explodeKeys):ifnot isinstance(explodeKeys, list):
explodeKeys=[explodeKeys]# recursive handling of explodeKeysif len(explodeKeys)==0:return df
elif len(explodeKeys)==1:
explodeKey=explodeKeys[0]else:return dfListExplode( dfListExplode(df, explodeKeys[:1]), explodeKeys[1:])# perform explosion/unnesting for key: explodeKey
dfPrep=df[explodeKey].apply(lambda x: x if isinstance(x,list)else[x])#casts all elements to a list
dfIndExpl=pd.DataFrame([[x]+[z]for x, y in zip(dfPrep.index,dfPrep.values)for z in y ], columns=['explodedIndex',explodeKey])
dfMerged=dfIndExpl.merge(df.drop(explodeKey, axis=1), left_on='explodedIndex', right_index=True)
dfReind=dfMerged.reindex(columns=list(df))return dfReind
dfExpl=dfListExplode(df,['C','columnD'])print('dfExpl',dfExpl, sep='\n')
The actual explosion is performed in 3 lines. The rest is cosmetics (multi column explosion, handling of strings instead of lists in the explosion column, …).
import pandas as pd
import numpy as np
df=pd.DataFrame( {'A': ['A1','A2','A3'],
'B': ['B1','B2','B3'],
'C': [ ['C1.1','C1.2'],['C2.1','C2.2'],'C3'],
'columnD': [ 'D1',['D2.1','D2.2', 'D2.3'],['D3.1','D3.2']],
})
print('df',df, sep='\n')
def dfListExplode(df, explodeKeys):
if not isinstance(explodeKeys, list):
explodeKeys=[explodeKeys]
# recursive handling of explodeKeys
if len(explodeKeys)==0:
return df
elif len(explodeKeys)==1:
explodeKey=explodeKeys[0]
else:
return dfListExplode( dfListExplode(df, explodeKeys[:1]), explodeKeys[1:])
# perform explosion/unnesting for key: explodeKey
dfPrep=df[explodeKey].apply(lambda x: x if isinstance(x,list) else [x]) #casts all elements to a list
dfIndExpl=pd.DataFrame([[x] + [z] for x, y in zip(dfPrep.index,dfPrep.values) for z in y ], columns=['explodedIndex',explodeKey])
dfMerged=dfIndExpl.merge(df.drop(explodeKey, axis=1), left_on='explodedIndex', right_index=True)
dfReind=dfMerged.reindex(columns=list(df))
return dfReind
dfExpl=dfListExplode(df,['C','columnD'])
print('dfExpl',dfExpl, sep='\n')
from itertools import zip_longest, product
def xplode(df, explode, zipped=True):
method = zip_longest if zipped else product
rest ={*df}-{*explode}
zipped = zip(zip(*map(df.get, rest)), zip(*map(df.get, explode)))
tups =[tup + exploded
for tup, pre in zipped
for exploded in method(*pre)]return pd.DataFrame(tups, columns=[*rest,*explode])[[*df]]
压缩的
xplode(df,['B','C'])
A B C
011.01112.02223.03324.0442NaN5
产品
xplode(df,['B','C'], zipped=False)
A B C
0111111221213122423352346235724382449245
新设定
修改示例
df = pd.DataFrame({'A':[1,2],'B':[[1,2],[3,4]],'C':'C','D':[[1,2],[3,4,5]],'E':[('X','Y','Z'),('W',)]})
df
A B C D E
01[1,2] C [1,2](X, Y, Z)12[3,4] C [3,4,5](W,)
压缩的
xplode(df,['B','D','E'])
A B C D E
011.0 C 1.0 X
112.0 C 2.0 Y
21NaN C NaN Z
323.0 C 3.0 W
424.0 C 4.0None52NaN C 5.0None
产品
xplode(df,['B','D','E'], zipped=False)
A B C D E
011 C 1 X
111 C 1 Y
211 C 1 Z
311 C 2 X
411 C 2 Y
511 C 2 Z
612 C 1 X
712 C 1 Y
812 C 1 Z
912 C 2 X
1012 C 2 Y
1112 C 2 Z
1223 C 3 W
1323 C 4 W
1423 C 5 W
1524 C 3 W
1624 C 4 W
1724 C 5 W
When the lengths are the same, it is easy for us to assume that the varying elements coincide and should be “zipped” together.
A B C
0 1 [1, 2] [1, 2] # Typical to assume these should be zipped [(1, 1), (2, 2)]
1 2 [3, 4] [3, 4, 5]
However, the assumption gets challenged when we see different length objects, should we “zip”, if so, how do we handle the excess in one of the objects. OR, maybe we want the product of all of the objects. This will get big fast, but might be what is wanted.
A B C
0 1 [1, 2] [1, 2]
1 2 [3, 4] [3, 4, 5] # is this [(3, 3), (4, 4), (None, 5)]?
OR
A B C
0 1 [1, 2] [1, 2]
1 2 [3, 4] [3, 4, 5] # is this [(3, 3), (3, 4), (3, 5), (4, 3), (4, 4), (4, 5)]
The Function
This function gracefully handles zip or product based on a parameter and assumes to zip according to the length of the longest object with zip_longest
from itertools import zip_longest, product
def xplode(df, explode, zipped=True):
method = zip_longest if zipped else product
rest = {*df} - {*explode}
zipped = zip(zip(*map(df.get, rest)), zip(*map(df.get, explode)))
tups = [tup + exploded
for tup, pre in zipped
for exploded in method(*pre)]
return pd.DataFrame(tups, columns=[*rest, *explode])[[*df]]
Zipped
xplode(df, ['B', 'C'])
A B C
0 1 1.0 1
1 1 2.0 2
2 2 3.0 3
3 2 4.0 4
4 2 NaN 5
df = pd.DataFrame({
'A': [1, 2],
'B': [[1, 2], [3, 4]],
'C': 'C',
'D': [[1, 2], [3, 4, 5]],
'E': [('X', 'Y', 'Z'), ('W',)]
})
df
A B C D E
0 1 [1, 2] C [1, 2] (X, Y, Z)
1 2 [3, 4] C [3, 4, 5] (W,)
Zipped
xplode(df, ['B', 'D', 'E'])
A B C D E
0 1 1.0 C 1.0 X
1 1 2.0 C 2.0 Y
2 1 NaN C NaN Z
3 2 3.0 C 3.0 W
4 2 4.0 C 4.0 None
5 2 NaN C 5.0 None
Product
xplode(df, ['B', 'D', 'E'], zipped=False)
A B C D E
0 1 1 C 1 X
1 1 1 C 1 Y
2 1 1 C 1 Z
3 1 1 C 2 X
4 1 1 C 2 Y
5 1 1 C 2 Z
6 1 2 C 1 X
7 1 2 C 1 Y
8 1 2 C 1 Z
9 1 2 C 2 X
10 1 2 C 2 Y
11 1 2 C 2 Z
12 2 3 C 3 W
13 2 3 C 4 W
14 2 3 C 5 W
15 2 4 C 3 W
16 2 4 C 4 W
17 2 4 C 5 W
Any opinions on this method I thought of? or is doing both concat and melt considered too “expensive”?
回答 10
df=pd.DataFrame({'A':[1,2],'B':[[1,2],[1,2]]})
out = pd.concat([df.loc[:,'A'],(df.B.apply(pd.Series))], axis=1, sort=False)
out = out.set_index('A').stack().droplevel(level=1).reset_index().rename(columns={0:"B"})
A B
011112221322
df=pd.DataFrame({'A':[1,2],'B':[[1,2],[1,2]]})
out = pd.concat([df.loc[:,'A'],(df.B.apply(pd.Series))], axis=1, sort=False)
out = out.set_index('A').stack().droplevel(level=1).reset_index().rename(columns={0:"B"})
A B
0 1 1
1 1 2
2 2 1
3 2 2
you can implement this as one liner, if you don’t wish to create intermediate object
回答 11
# Here's the answer to the related question in:# https://stackoverflow.com/q/56708671/11426125# initial dataframe
df12=pd.DataFrame({'Date':['2007-12-03','2008-09-07'],'names':[['Peter','Alex'],['Donald','Stan']]})# convert dataframe to array for indexing list values (names)
a = np.array(df12.values)# create a new, dataframe with dimensions for unnested
b = np.ndarray(shape =(4,2))
df2 = pd.DataFrame(b, columns =["Date","names"], dtype = str)# implement loops to assign date/name values as required
i = range(len(a[0]))
j = range(len(a[0]))for x in i:for y in j:
df2.iat[2*x+y,0]= a[x][0]
df2.iat[2*x+y,1]= a[x][1][y]# set Date column as Index
df2.Date=pd.to_datetime(df2.Date)
df2.index=df2.Date
df2.drop('Date',axis=1,inplace =True)
# Here's the answer to the related question in:
# https://stackoverflow.com/q/56708671/11426125
# initial dataframe
df12=pd.DataFrame({'Date':['2007-12-03','2008-09-07'],'names':
[['Peter','Alex'],['Donald','Stan']]})
# convert dataframe to array for indexing list values (names)
a = np.array(df12.values)
# create a new, dataframe with dimensions for unnested
b = np.ndarray(shape = (4,2))
df2 = pd.DataFrame(b, columns = ["Date", "names"], dtype = str)
# implement loops to assign date/name values as required
i = range(len(a[0]))
j = range(len(a[0]))
for x in i:
for y in j:
df2.iat[2*x+y, 0] = a[x][0]
df2.iat[2*x+y, 1] = a[x][1][y]
# set Date column as Index
df2.Date=pd.to_datetime(df2.Date)
df2.index=df2.Date
df2.drop('Date',axis=1,inplace =True)
I have another good way to solves this when you have more than one column to explode.
df=pd.DataFrame({'A':[1,2],'B':[[1,2],[1,2]], 'C':[[1,2,3],[1,2,3]]})
print(df)
A B C
0 1 [1, 2] [1, 2, 3]
1 2 [1, 2] [1, 2, 3]
I want to explode the columns B and C. First I explode B, second C. Than I drop B and C from the original df. After that I will do an index join on the 3 dfs.
If you are in Jupyter notebook, you could run the following code to interactively display the dataframe in a well formatted table.
This answer builds on the to_html(‘temp.html’) answer above, but instead of creating a file displays the well formatted table directly in the notebook:
from IPython.display import display, HTML
display(HTML(df.to_html()))
You can use prettytable to render the table as text. The trick is to convert the data_frame to an in-memory csv file and have prettytable read it. Here’s the code:
I used Ofer’s answer for a while and found it great in most cases. Unfortunately, due to inconsistencies between pandas’s to_csv and prettytable‘s from_csv, I had to use prettytable in a different way.
One failure case is a dataframe containing commas:
pd.DataFrame({'A': [1, 2], 'B': ['a,', 'b']})
Prettytable raises an error of the form:
Error: Could not determine delimiter
The following function handles this case:
def format_for_print(df):
table = PrettyTable([''] + list(df.columns))
for row in df.itertuples():
table.add_row(row)
return str(table)
If you don’t care about the index, use:
def format_for_print2(df):
table = PrettyTable(list(df.columns))
for row in df.itertuples():
table.add_row(row[1:])
return str(table)
Following up on Mark’s answer, if you’re not using Jupyter for some reason, e.g. you want to do some quick testing on the console, you can use the DataFrame.to_string method, which works from — at least — Pandas 0.12 (2014) onwards.
I wanted a paper printout of a dataframe but I wanted to add some results and comments as well on the same page.
I have worked through the above and I could not get what I wanted. I ended up using
file.write(df1.to_csv()) and file.write(“,,,blah,,,,,,blah”) statements to get my extras on the page.
When I opened the csv file it went straight to a spreadsheet which printed everything in the right pace and format.