I have a DataFrame, and I want to replace the values in a particular column that exceed a value with zero. I had thought this was a way of achieving this:
df[df.my_channel > 20000].my_channel = 0
If I copy the channel into a new data frame it’s simple:
df2 = df.my_channel
df2[df2 > 20000] = 0
This does exactly what I want, but seems not to work with the channel as part of the original DataFrame.
.ix indexer works okay for pandas version prior to 0.20.0, but since pandas 0.20.0, the .ix indexer is deprecated, so you should avoid using it. Instead, you can use .loc or iloc indexers. You can solve this problem by:
mask helps you to select the rows in which df.my_channel > 20000 is True, while df.loc[mask, column_name] = 0 sets the value 0 to the selected rows where maskholds in the column which name is column_name.
Update:
In this case, you should use loc because if you use iloc, you will get a NotImplementedError telling you that iLocation based boolean indexing on an integer type is not available.
The reason your original dataframe does not update is because chained indexing may cause you to modify a copy rather than a view of your dataframe. The docs give this advice:
When setting values in a pandas object, care must be taken to avoid
what is called chained indexing.
You can use NumPy by assigning your original series when your condition is not satisfied; however, the first two solutions are cleaner since they explicitly change only specified values.
A B C
0 1 0.749065 This
1 2 0.301084 is
2 3 0.463468 a
3 4 0.643961 random
4 1 0.866521 string
5 2 0.120737 !
Calling
In [10]: print df.groupby("A")["B"].sum()
will return
A
1 1.615586
2 0.421821
3 0.463468
4 0.643961
Now I would like to do “the same” for column “C”. Because that column contains strings, sum() doesn’t work (although you might think that it would concatenate the strings). What I would really like to see is a list or set of the strings for each group, i.e.
so I was hoping any Series method would work. Any ideas?
回答 0
In[4]: df = read_csv(StringIO(data),sep='\s+')In[5]: df
Out[5]:
A B C
010.749065This120.301084is230.463468 a
340.643961 random
410.866521 string
520.120737!In[6]: df.dtypes
Out[6]:
A int64
B float64
C object
dtype: object
当您应用自己的功能时,不会自动排除非数字列。这会慢一些,但比应用.sum()到groupby
In[8]: df.groupby('A').apply(lambda x: x.sum())Out[8]:
A B C
A
121.615586Thisstring240.421821is!330.463468 a
440.643961 random
sum 默认情况下串联
In[9]: df.groupby('A')['C'].apply(lambda x: x.sum())Out[9]:
A
1Thisstring2is!3 a
4 random
dtype: object
你几乎可以做你想做的
In[11]: df.groupby('A')['C'].apply(lambda x:"{%s}"%', '.join(x))Out[11]:
A
1{This, string}2{is,!}3{a}4{random}
dtype: object
在整个框架上进行一次,一次一组。关键是要返回一个Series
def f(x):returnSeries(dict(A = x['A'].sum(),
B = x['B'].sum(),
C ="{%s}"%', '.join(x['C'])))In[14]: df.groupby('A').apply(f)Out[14]:
A B C
A
121.615586{This, string}240.421821{is,!}330.463468{a}440.643961{random}
In [4]: df = read_csv(StringIO(data),sep='\s+')
In [5]: df
Out[5]:
A B C
0 1 0.749065 This
1 2 0.301084 is
2 3 0.463468 a
3 4 0.643961 random
4 1 0.866521 string
5 2 0.120737 !
In [6]: df.dtypes
Out[6]:
A int64
B float64
C object
dtype: object
When you apply your own function, there is not automatic exclusions of non-numeric columns. This is slower, though, than the application of .sum() to the groupby
In [8]: df.groupby('A').apply(lambda x: x.sum())
Out[8]:
A B C
A
1 2 1.615586 Thisstring
2 4 0.421821 is!
3 3 0.463468 a
4 4 0.643961 random
sum by default concatenates
In [9]: df.groupby('A')['C'].apply(lambda x: x.sum())
Out[9]:
A
1 Thisstring
2 is!
3 a
4 random
dtype: object
You can use the apply method to apply an arbitrary function to the grouped data. So if you want a set, apply set. If you want a list, apply list.
>>> d
A B
0 1 This
1 2 is
2 3 a
3 4 random
4 1 string
5 2 !
>>> d.groupby('A')['B'].apply(list)
A
1 [This, string]
2 [is, !]
3 [a]
4 [random]
dtype: object
If you want something else, just write a function that does what you want and then apply that.
grp = df.groupby('A').agg(B_sum=('B','sum'),
C=('C', list)).reset_index()print(grp)
A B_sum C
011.615586[This, string]120.421821[is,!]230.463468[a]340.643961[random]
汇总并加入字符串
grp = df.groupby('A').agg(B_sum=('B','sum'),
C=('C',', '.join)).reset_index()print(grp)
A B_sum C
011.615586This, string
120.421821is,!230.463468 a
340.643961 random
Since pandas version 0.25.0 we have named aggregations where we can groupby, aggregate and at the same time assign new names to our columns. This way we won’t get the MultiIndex columns, and the column names make more sense given the data they contain:
Following @Erfan’s good answer, most of the times in an analysis of aggregate values you want the unique possible combinations of these existing character values:
And if you want to produce a column containing the name of the column with the maximum value but considering only a subset of columns then you use a variation of @ajcr’s answer:
In[146]:%timeit df.apply(lambda x: x.argmax(), axis=1)1 loops, best of 3:479 ms per loop
In[147]:%timeit df.idxmax(axis=1)10 loops, best of 3:47.3 ms per loop
You could apply on dataframe and get argmax() of each row via axis=1
In [144]: df.apply(lambda x: x.argmax(), axis=1)
Out[144]:
0 Communications
1 Business
2 Communications
3 Communications
4 Business
dtype: object
Here’s a benchmark to compare how slow apply method is to idxmax() for len(df) ~ 20K
In [146]: %timeit df.apply(lambda x: x.argmax(), axis=1)
1 loops, best of 3: 479 ms per loop
In [147]: %timeit df.idxmax(axis=1)
10 loops, best of 3: 47.3 ms per loop
import pandas as pd
df ={'col_1':[0,1,2,3],'col_2':[4,5,6,7]}
df = pd.DataFrame(df)
df[['column_new_1','column_new_2','column_new_3']]=[np.nan,'dogs',3]#thought this would work here...
I’m new to pandas and trying to figure out how to add multiple columns to pandas simultaneously. Any help here is appreciated. Ideally I would like to do this in one step rather than multiple repeated steps…
import pandas as pd
df = {'col_1': [0, 1, 2, 3],
'col_2': [4, 5, 6, 7]}
df = pd.DataFrame(df)
df[[ 'column_new_1', 'column_new_2','column_new_3']] = [np.nan, 'dogs',3] #thought this would work here...
I would have expected your syntax to work too. The problem arises because when you create new columns with the column-list syntax (df[[new1, new2]] = ...), pandas requires that the right hand side be a DataFrame (note that it doesn’t actually matter if the columns of the DataFrame have the same names as the columns you are creating).
Your syntax works fine for assigning scalar values to existing columns, and pandas is also happy to assign scalar values to a new column using the single-column syntax (df[new1] = ...). So the solution is either to convert this into several single-column assignments, or create a suitable DataFrame for the right-hand side.
5) Using a dict is a more “natural” way to create the new data frame than the previous two, but the new columns will be sorted alphabetically (at least before Python 3.6 or 3.7):
I like this variant on @zero’s answer a lot, but like the previous one, the new columns will always be sorted alphabetically, at least with early versions of Python:
In [128]: df
Out[128]:
col_1 col_2
0 0 4
1 1 5
2 2 6
3 3 7
In [129]: pd.concat([df, pd.DataFrame(columns = [ 'column_new_1', 'column_new_2','column_new_3'])])
Out[129]:
col_1 col_2 column_new_1 column_new_2 column_new_3
0 0.0 4.0 NaN NaN NaN
1 1.0 5.0 NaN NaN NaN
2 2.0 6.0 NaN NaN NaN
3 3.0 7.0 NaN NaN NaN
Not very sure of what you wanted to do with [np.nan, 'dogs',3]. Maybe now set them as default values?
In [142]: df1 = pd.concat([df, pd.DataFrame(columns = [ 'column_new_1', 'column_new_2','column_new_3'])])
In [143]: df1[[ 'column_new_1', 'column_new_2','column_new_3']] = [np.nan, 'dogs', 3]
In [144]: df1
Out[144]:
col_1 col_2 column_new_1 column_new_2 column_new_3
0 0.0 4.0 NaN dogs 3
1 1.0 5.0 NaN dogs 3
2 2.0 6.0 NaN dogs 3
3 3.0 7.0 NaN dogs 3
回答 3
使用列表理解,pd.DataFrame以及pd.concat
pd.concat([
df,
pd.DataFrame([[np.nan,'dogs',3]for _ in range(df.shape[0])],
df.index,['column_new_1','column_new_2','column_new_3'])], axis=1)
You can pass a list of columns to [] to select columns in that order.
If a column is not contained in the DataFrame, an exception will be raised.
Multiple columns can also be set in this manner.
You may find this useful for applying a transform (in-place) to a subset of the columns.
If you just want to add empty new columns, reindex will do the job
df
col_1 col_2
0 0 4
1 1 5
2 2 6
3 3 7
df.reindex(list(df)+['column_new_1', 'column_new_2','column_new_3'], axis=1)
col_1 col_2 column_new_1 column_new_2 column_new_3
0 0 4 NaN NaN NaN
1 1 5 NaN NaN NaN
2 2 6 NaN NaN NaN
3 3 7 NaN NaN NaN
df['d'] = df.apply(rowFunc, axis=1)
>>> df
a b c d
0 1 2 3 7
1 4 5 6 34
Awesome! Now what if I want to incorporate the index into my function?
The index of any given row in this DataFrame before adding d would be Index([u'a', u'b', u'c', u'd'], dtype='object'), but I want the 0 and 1. So I can’t just access row.index.
I know I could create a temporary column in the table where I store the index, but I’m wondering if it is stored in the row object somewhere.
回答 0
在这种情况下,要访问索引,请访问name属性:
In[182]:
df = pd.DataFrame([[1,2,3],[4,5,6]], columns=['a','b','c'])def rowFunc(row):return row['a']+ row['b']* row['c']def rowIndex(row):return row.name
df['d']= df.apply(rowFunc, axis=1)
df['rowIndex']= df.apply(rowIndex, axis=1)
dfOut[182]:
a b c d rowIndex0123701456341
请注意,如果这确实是您要尝试执行的操作,则可以使用以下命令并且速度更快:
In[198]:
df['d']= df['a']+ df['b']* df['c']
dfOut[198]:
a b c d01237145634In[199]:%timeit df['a']+ df['b']* df['c']%timeit df.apply(rowIndex, axis=1)10000 loops, best of 3:163µs per loop1000 loops, best of 3:286µs per loop
编辑
3年后再看这个问题,您可以这样做:
In[15]:
df['d'],df['rowIndex']= df['a']+ df['b']* df['c'], df.index
dfOut[15]:
a b c d rowIndex0123701456341
To access the index in this case you access the name attribute:
In [182]:
df = pd.DataFrame([[1,2,3],[4,5,6]], columns=['a','b','c'])
def rowFunc(row):
return row['a'] + row['b'] * row['c']
def rowIndex(row):
return row.name
df['d'] = df.apply(rowFunc, axis=1)
df['rowIndex'] = df.apply(rowIndex, axis=1)
df
Out[182]:
a b c d rowIndex
0 1 2 3 7 0
1 4 5 6 34 1
Note that if this is really what you are trying to do that the following works and is much faster:
In [198]:
df['d'] = df['a'] + df['b'] * df['c']
df
Out[198]:
a b c d
0 1 2 3 7
1 4 5 6 34
In [199]:
%timeit df['a'] + df['b'] * df['c']
%timeit df.apply(rowIndex, axis=1)
10000 loops, best of 3: 163 µs per loop
1000 loops, best of 3: 286 µs per loop
EDIT
Looking at this question 3+ years later, you could just do:
In[15]:
df['d'],df['rowIndex'] = df['a'] + df['b'] * df['c'], df.index
df
Out[15]:
a b c d rowIndex
0 1 2 3 7 0
1 4 5 6 34 1
but assuming it isn’t as trivial as this, whatever your rowFunc is really doing, you should look to use the vectorised functions, and then use them against the df index:
In[16]:
df['newCol'] = df['a'] + df['b'] + df['c'] + df.index
df
Out[16]:
a b c d rowIndex newCol
0 1 2 3 7 0 6
1 4 5 6 34 1 16
回答 1
要么:
1.与row.name内线apply(..., axis=1)通话:
df = pandas.DataFrame([[1,2,3],[4,5,6]], columns=['a','b','c'], index=['x','y'])
a b c
x 123
y 456
df.apply(lambda row: row.name, axis=1)
x x
y y
>>>import pandas as pd>>> df = pd.DataFrame([[1,2,3],[4,5,6]], columns=['a','b','c'])>>> df.set_index('a', inplace=True)>>> df
b c
a 123456>>> df['index_x10']= df.apply(lambda row:10*row.name, axis=1)>>> df
b c index_x10
a 1231045640
To answer the original question: yes, you can access the index value of a row in apply(). It is available under the key name and requires that you specify axis=1 (because the lambda processes the columns of a row and not the rows of a column).
Working example (pandas 0.23.4):
>>> import pandas as pd
>>> df = pd.DataFrame([[1,2,3],[4,5,6]], columns=['a','b','c'])
>>> df.set_index('a', inplace=True)
>>> df
b c
a
1 2 3
4 5 6
>>> df['index_x10'] = df.apply(lambda row: 10*row.name, axis=1)
>>> df
b c index_x10
a
1 2 3 10
4 5 6 40
here marketing_train is my data set and select_dtypes() is function to select data types using exclude and include arguments and columns is used to fetch the column name of data set
output of above code will be following:
def is_type(df, baseType):import numpy as np
import pandas as pd
test =[issubclass(np.dtype(d).type, baseType)for d in df.dtypes]return pd.DataFrame(data = test, index = df.columns, columns =["test"])def is_float(df):import numpy as np
return is_type(df, np.float)def is_number(df):import numpy as np
return is_type(df, np.number)def is_integer(df):import numpy as np
return is_type(df, np.integer)
Here, np.applymap(np.isreal) shows whether every cell in the data frame is numeric, and .axis(all=0) checks if all values in a column are True and returns a series of Booleans that can be used to index the desired columns.
This way you can check whether the value are numeric such as float and int or the srting values. the second if statement is used for checking the string values which is referred by the object.
回答 10
我们可以根据以下要求包括和排除数据类型:
train.select_dtypes(include=None, exclude=None)
train.select_dtypes(include='number')#will include all the numeric types
import pandasfrom openpyxl import load_workbook
book = load_workbook('Masterfile.xlsx')
writer = pandas.ExcelWriter('Masterfile.xlsx', engine='openpyxl')
writer.book = book## ExcelWriter for some reason uses writer.sheets to access the sheet.## If you leave it empty it will not know that sheet Main is already there## and will create a new sheet.
writer.sheets = dict((ws.title, ws)for ws in book.worksheets)
data_filtered.to_excel(writer,"Main", cols=['Diff1','Diff2'])
writer.save()
Pandas docs says it uses openpyxl for xlsx files. Quick look through the code in ExcelWriter gives a clue that something like this might work out:
import pandas
from openpyxl import load_workbook
book = load_workbook('Masterfile.xlsx')
writer = pandas.ExcelWriter('Masterfile.xlsx', engine='openpyxl')
writer.book = book
## ExcelWriter for some reason uses writer.sheets to access the sheet.
## If you leave it empty it will not know that sheet Main is already there
## and will create a new sheet.
writer.sheets = dict((ws.title, ws) for ws in book.worksheets)
data_filtered.to_excel(writer, "Main", cols=['Diff1', 'Diff2'])
writer.save()
回答 1
这是一个辅助函数:
def append_df_to_excel(filename, df, sheet_name='Sheet1', startrow=None,
truncate_sheet=False,**to_excel_kwargs):"""
Append a DataFrame [df] to existing Excel file [filename]
into [sheet_name] Sheet.
If [filename] doesn't exist, then this function will create it.
Parameters:
filename : File path or existing ExcelWriter
(Example: '/path/to/file.xlsx')
df : dataframe to save to workbook
sheet_name : Name of sheet which will contain DataFrame.
(default: 'Sheet1')
startrow : upper left cell row to dump data frame.
Per default (startrow=None) calculate the last row
in the existing DF and write to the next row...
truncate_sheet : truncate (remove and recreate) [sheet_name]
before writing DataFrame to Excel file
to_excel_kwargs : arguments which will be passed to `DataFrame.to_excel()`
[can be dictionary]
Returns: None
"""from openpyxl import load_workbook# ignore [engine] parameter if it was passedif'engine'in to_excel_kwargs:
to_excel_kwargs.pop('engine')
writer = pd.ExcelWriter(filename, engine='openpyxl')# Python 2.x: define [FileNotFoundError] exception if it doesn't exist try:FileNotFoundErrorexceptNameError:FileNotFoundError=IOErrortry:# try to open an existing workbook
writer.book = load_workbook(filename)# get the last row in the existing Excel sheet# if it was not specified explicitlyif startrow isNoneand sheet_name in writer.book.sheetnames:
startrow = writer.book[sheet_name].max_row# truncate sheetif truncate_sheet and sheet_name in writer.book.sheetnames:# index of [sheet_name] sheet
idx = writer.book.sheetnames.index(sheet_name)# remove [sheet_name]
writer.book.remove(writer.book.worksheets[idx])# create an empty sheet [sheet_name] using old index
writer.book.create_sheet(sheet_name, idx)# copy existing sheets
writer.sheets ={ws.title:ws for ws in writer.book.worksheets}exceptFileNotFoundError:# file does not exist yet, we will create itpassif startrow isNone:
startrow =0# write out the new sheet
df.to_excel(writer, sheet_name, startrow=startrow,**to_excel_kwargs)# save the workbook
writer.save()
def append_df_to_excel(filename, df, sheet_name='Sheet1', startrow=None,
truncate_sheet=False,
**to_excel_kwargs):
"""
Append a DataFrame [df] to existing Excel file [filename]
into [sheet_name] Sheet.
If [filename] doesn't exist, then this function will create it.
Parameters:
filename : File path or existing ExcelWriter
(Example: '/path/to/file.xlsx')
df : dataframe to save to workbook
sheet_name : Name of sheet which will contain DataFrame.
(default: 'Sheet1')
startrow : upper left cell row to dump data frame.
Per default (startrow=None) calculate the last row
in the existing DF and write to the next row...
truncate_sheet : truncate (remove and recreate) [sheet_name]
before writing DataFrame to Excel file
to_excel_kwargs : arguments which will be passed to `DataFrame.to_excel()`
[can be dictionary]
Returns: None
(c) [MaxU](https://stackoverflow.com/users/5741205/maxu?tab=profile)
"""
from openpyxl import load_workbook
# ignore [engine] parameter if it was passed
if 'engine' in to_excel_kwargs:
to_excel_kwargs.pop('engine')
writer = pd.ExcelWriter(filename, engine='openpyxl')
# Python 2.x: define [FileNotFoundError] exception if it doesn't exist
try:
FileNotFoundError
except NameError:
FileNotFoundError = IOError
try:
# try to open an existing workbook
writer.book = load_workbook(filename)
# get the last row in the existing Excel sheet
# if it was not specified explicitly
if startrow is None and sheet_name in writer.book.sheetnames:
startrow = writer.book[sheet_name].max_row
# truncate sheet
if truncate_sheet and sheet_name in writer.book.sheetnames:
# index of [sheet_name] sheet
idx = writer.book.sheetnames.index(sheet_name)
# remove [sheet_name]
writer.book.remove(writer.book.worksheets[idx])
# create an empty sheet [sheet_name] using old index
writer.book.create_sheet(sheet_name, idx)
# copy existing sheets
writer.sheets = {ws.title:ws for ws in writer.book.worksheets}
except FileNotFoundError:
# file does not exist yet, we will create it
pass
if startrow is None:
startrow = 0
# write out the new sheet
df.to_excel(writer, sheet_name, startrow=startrow, **to_excel_kwargs)
# save the workbook
writer.save()
NOTE: for Pandas < 0.21.0, replace sheet_name with sheetname!
# read a single or multi-sheet excel file# (returns dict of sheetname(s), dataframe(s))
ws_dict = pd.read_excel(excel_file_path,
sheetname=None)# all worksheets are accessible as dataframes.# easy to change a worksheet as a dataframe:
mod_df = ws_dict['existing_worksheet']# do work on mod_df...then reassign
ws_dict['existing_worksheet']= mod_df# add a dataframe to the workbook as a new worksheet with# ws name, df as dict key, value:
ws_dict['new_worksheet']= some_other_dataframe# when done, write dictionary back to excel...# xlsxwriter honors datetime and date formats# (only included as example)...with pd.ExcelWriter(excel_file_path,
engine='xlsxwriter',
datetime_format='yyyy-mm-dd',
date_format='yyyy-mm-dd')as writer:for ws_name, df_sheet in ws_dict.items():
df_sheet.to_excel(writer, sheet_name=ws_name)
Old question, but I am guessing some people still search for this – so…
I find this method nice because all worksheets are loaded into a dictionary of sheet name and dataframe pairs, created by pandas with the sheetname=None option. It is simple to add, delete or modify worksheets between reading the spreadsheet into the dict format and writing it back from the dict. For me the xlsxwriter works better than openpyxl for this particular task in terms of speed and format.
Note: future versions of pandas (0.21.0+) will change the “sheetname” parameter to “sheet_name”.
# read a single or multi-sheet excel file
# (returns dict of sheetname(s), dataframe(s))
ws_dict = pd.read_excel(excel_file_path,
sheetname=None)
# all worksheets are accessible as dataframes.
# easy to change a worksheet as a dataframe:
mod_df = ws_dict['existing_worksheet']
# do work on mod_df...then reassign
ws_dict['existing_worksheet'] = mod_df
# add a dataframe to the workbook as a new worksheet with
# ws name, df as dict key, value:
ws_dict['new_worksheet'] = some_other_dataframe
# when done, write dictionary back to excel...
# xlsxwriter honors datetime and date formats
# (only included as example)...
with pd.ExcelWriter(excel_file_path,
engine='xlsxwriter',
datetime_format='yyyy-mm-dd',
date_format='yyyy-mm-dd') as writer:
for ws_name, df_sheet in ws_dict.items():
df_sheet.to_excel(writer, sheet_name=ws_name)
For the example in the 2013 question:
ws_dict = pd.read_excel('Masterfile.xlsx',
sheetname=None)
ws_dict['Main'] = data_filtered[['Diff1', 'Diff2']]
with pd.ExcelWriter('Masterfile.xlsx',
engine='xlsxwriter') as writer:
for ws_name, df_sheet in ws_dict.items():
df_sheet.to_excel(writer, sheet_name=ws_name)
import xlwings as xw
import pandas as pd
#create DF
months =['2017-01','2017-02','2017-03','2017-04','2017-05','2017-06','2017-07','2017-08','2017-09','2017-10','2017-11','2017-12']
value1 =[x *5+5for x in range(len(months))]
df = pd.DataFrame(value1, index = months, columns =['value1'])
df['value2']= df['value1']+5
df['value3']= df['value2']+5#load workbook that has a chart in it
wb = xw.Book('C:\\data\\bookwithChart.xlsx')
ws = wb.sheets['chartData']
ws.range('A1').options(index=False).value = df
wb = xw.Book('C:\\data\\bookwithChart_updated.xlsx')
xw.apps[0].quit()
I know this is an older thread, but this is the first item you find when searching, and the above solutions don’t work if you need to retain charts in a workbook that you already have created. In that case, xlwings is a better option – it allows you to write to the excel book and keeps the charts/chart data.
simple example:
import xlwings as xw
import pandas as pd
#create DF
months = ['2017-01','2017-02','2017-03','2017-04','2017-05','2017-06','2017-07','2017-08','2017-09','2017-10','2017-11','2017-12']
value1 = [x * 5+5 for x in range(len(months))]
df = pd.DataFrame(value1, index = months, columns = ['value1'])
df['value2'] = df['value1']+5
df['value3'] = df['value2']+5
#load workbook that has a chart in it
wb = xw.Book('C:\\data\\bookwithChart.xlsx')
ws = wb.sheets['chartData']
ws.range('A1').options(index=False).value = df
wb = xw.Book('C:\\data\\bookwithChart_updated.xlsx')
xw.apps[0].quit()
回答 6
在pandas 0.24中有一个更好的解决方案:
with pd.ExcelWriter(path, mode='a')as writer:
s.to_excel(writer, sheet_name='another sheet', index=False)
Is there any function that would be the equivalent of a combination of df.isin() and df[col].str.contains()?
For example, say I have the series
s = pd.Series(['cat','hat','dog','fog','pet']), and I want to find all places where s contains any of ['og', 'at'], I would want to get everything but ‘pet’.
I have a solution, but it’s rather inelegant:
searchfor = ['og', 'at']
found = [s.str.contains(x) for x in searchfor]
result = pd.DataFrame[found]
result.any()
One option is just to use the regex | character to try to match each of the substrings in the words in your Series s (still using str.contains).
You can construct the regex by joining the words in searchfor with |:
>>> searchfor = ['og', 'at']
>>> s[s.str.contains('|'.join(searchfor))]
0 cat
1 hat
2 dog
3 fog
dtype: object
As @AndyHayden noted in the comments below, take care if your substrings have special characters such as $ and ^ which you want to match literally. These characters have specific meanings in the context of regular expressions and will affect the matching.
You can make your list of substrings safer by escaping non-alphanumeric characters with re.escape:
>>> import re
>>> matches = ['$money', 'x^y']
>>> safe_matches = [re.escape(m) for m in matches]
>>> safe_matches
['\\$money', 'x\\^y']
The strings with in this new list will match each character literally when used with str.contains.
You can use str.contains alone with a regex pattern using OR (|):
s[s.str.contains('og|at')]
Or you could add the series to a dataframe then use str.contains:
df = pd.DataFrame(s)
df[s.str.contains('og|at')]
Output:
0 cat
1 hat
2 dog
3 fog
回答 2
这是一个单行lambda,它也可以工作:
df["TrueFalse"]= df['col1'].apply(lambda x:1if any(i in x for i in searchfor)else0)
输入:
searchfor =['og','at']
df = pd.DataFrame([('cat',1000.0),('hat',2000000.0),('dog',1000.0),('fog',330000.0),('pet',330000.0)], columns=['col1','col2'])
col1 col2
0 cat 1000.01 hat 2000000.02 dog 1000.03 fog 330000.04 pet 330000.0
应用Lambda:
df["TrueFalse"]= df['col1'].apply(lambda x:1if any(i in x for i in searchfor)else0)
输出:
col1 col2 TrueFalse0 cat 1000.011 hat 2000000.012 dog 1000.013 fog 330000.014 pet 330000.00
D:\Python\lib\site-packages\pandas\core\frame.py:3581:FutureWarning: rename with inplace=True will returnNonefrom pandas 0.11 onward
" from pandas 0.11 onward",FutureWarning)
When I run the program, Pandas gives ‘Future warning’ like below every time.
D:\Python\lib\site-packages\pandas\core\frame.py:3581: FutureWarning: rename with inplace=True will return None from pandas 0.11 onward
" from pandas 0.11 onward", FutureWarning)
I got the msg, but I just want to stop Pandas showing such msg again and again, is there any buildin parameter that I can set to let Pandas not pop up the ‘Future warning’ ?
>>>import pandas as pd
>>> pd.reset_option('all')
html.border has been deprecated, use display.html.border instead
(currently both are identical)
C:\projects\stackoverflow\venv\lib\site-packages\pandas\core\config.py:619:FutureWarning: html.bord
er has been deprecated, use display.html.border instead
(currently both are identical)
warnings.warn(d.msg,FutureWarning): boolean
use_inf_as_null had been deprecated and will be removed in a future
version.Use`use_inf_as_na` instead.
C:\projects\stackoverflow\venv\lib\site-packages\pandas\core\config.py:619:FutureWarning:: boolean
use_inf_as_null had been deprecated and will be removed in a future
version.Use`use_inf_as_na` instead.
warnings.warn(d.msg,FutureWarning)>>>
>>>import warnings
>>> warnings.simplefilter(action='ignore', category=FutureWarning)>>>import pandas as pd
>>> pd.reset_option('all')
html.border has been deprecated, use display.html.border instead
(currently both are identical): boolean
use_inf_as_null had been deprecated and will be removed in a future
version.Use`use_inf_as_na` instead.>>>
实际上,禁用所有警告会产生相同的输出:
>>>import warnings
>>> warnings.simplefilter(action='ignore', category=Warning)>>>import pandas as pd
>>> pd.reset_option('all')
html.border has been deprecated, use display.html.border instead
(currently both are identical): boolean
use_inf_as_null had been deprecated and will be removed in a future
version.Use`use_inf_as_na` instead.>>>
@bdiamante’s answer may only partially help you. If you still get a message after you’ve suppressed warnings, it’s because the pandas library itself is printing the message. There’s not much you can do about it unless you edit the Pandas source code yourself. Maybe there’s an option internally to suppress them, or a way to override things, but I couldn’t find one.
For those who need to know why…
Suppose that you want to ensure a clean working environment. At the top of your script, you put pd.reset_option('all'). With Pandas 0.23.4, you get the following:
>>> import pandas as pd
>>> pd.reset_option('all')
html.border has been deprecated, use display.html.border instead
(currently both are identical)
C:\projects\stackoverflow\venv\lib\site-packages\pandas\core\config.py:619: FutureWarning: html.bord
er has been deprecated, use display.html.border instead
(currently both are identical)
warnings.warn(d.msg, FutureWarning)
: boolean
use_inf_as_null had been deprecated and will be removed in a future
version. Use `use_inf_as_na` instead.
C:\projects\stackoverflow\venv\lib\site-packages\pandas\core\config.py:619: FutureWarning:
: boolean
use_inf_as_null had been deprecated and will be removed in a future
version. Use `use_inf_as_na` instead.
warnings.warn(d.msg, FutureWarning)
>>>
Following the @bdiamante’s advice, you use the warnings library. Now, true to it’s word, the warnings have been removed. However, several pesky messages remain:
>>> import warnings
>>> warnings.simplefilter(action='ignore', category=FutureWarning)
>>> import pandas as pd
>>> pd.reset_option('all')
html.border has been deprecated, use display.html.border instead
(currently both are identical)
: boolean
use_inf_as_null had been deprecated and will be removed in a future
version. Use `use_inf_as_na` instead.
>>>
In fact, disabling all warnings produces the same output:
>>> import warnings
>>> warnings.simplefilter(action='ignore', category=Warning)
>>> import pandas as pd
>>> pd.reset_option('all')
html.border has been deprecated, use display.html.border instead
(currently both are identical)
: boolean
use_inf_as_null had been deprecated and will be removed in a future
version. Use `use_inf_as_na` instead.
>>>
In the standard library sense, these aren’t true warnings. Pandas implements its own warnings system. Running grep -rn on the warning messages shows that the pandas warning system is implemented in core/config_init.py:
$ grep -rn "html.border has been deprecated"
core/config_init.py:207:html.border has been deprecated, use display.html.border instead
Further chasing shows that I don’t have time for this. And you probably don’t either. Hopefully this saves you from falling down the rabbit hole or perhaps inspires someone to figure out how to truly suppress these messages!
But if you want to handle them one by one and you are managing a bigger codebase, it will be difficult to find the line of code which is causing the warning. Since warnings unlike errors don’t come with code traceback. In order to trace warnings like errors, you can write this at the top of the code:
import warnings
warnings.filterwarnings("error")
But if the codebase is bigger and it is importing bunch of other libraries/packages, then all sort of warnings will start to be raised as errors. In order to raise only certain type of warnings (in your case, its FutureWarning) as error, you can write:
I’m confused about the rules Pandas uses when deciding that a selection from a dataframe is a copy of the original dataframe, or a view on the original.
I understand that a query returns a copy so that something like
foo = df.query('2 < index <= 5')
foo.loc[:,'E'] = 40
will have no effect on the original dataframe, df. I also understand that scalar or named slices return a view, so that assignments to these, such as
df.iloc[3] = 70
or
df.ix[1,'B':'E'] = 222
will change df. But I’m lost when it comes to more complicated cases. For example,
df[df.C <= df.B] = 7654321
changes df, but
df[df.C <= df.B].ix[:,'B':'E']
does not.
Is there a simple rule that Pandas is using that I’m just missing? What’s going on in these specific cases; and in particular, how do I change all values (or a subset of values) in a dataframe that satisfy a particular query (as I’m attempting to do in the last example above)?
Note: This is not the same as this question; and I have read the documentation, but am not enlightened by it. I’ve also read through the “Related” questions on this topic, but I’m still missing the simple rule Pandas is using, and how I’d apply it to — for example — modify the values (or a subset of values) in a dataframe that satisfy a particular query.
If inplace=True is provided, it will modify in-place; only some operations support this
An indexer that sets, e.g. .loc/.iloc/.iat/.at will set inplace.
An indexer that gets on a single-dtyped object is almost always a view (depending on the memory layout it may not be that’s why this is not reliable). This is mainly for efficiency. (the example from above is for .query; this will always return a copy as its evaluated by numexpr)
An indexer that gets on a multiple-dtyped object is always a copy.
Your example of chained indexing
df[df.C <= df.B].loc[:,'B':'E']
is not guaranteed to work (and thus you shoulld never do this).
Instead do:
df.loc[df.C <= df.B, 'B':'E']
as this is faster and will always work
The chained indexing is 2 separate python operations and thus cannot be reliably intercepted by pandas (you will oftentimes get a SettingWithCopyWarning, but that is not 100% detectable either). The dev docs, which you pointed, offer a much more full explanation.