How do I easily set x and y-labels while preserving my ability to use specific colormaps? I noticed that the plot() wrapper for pandas DataFrames doesn’t take any parameters specific for that.
pandas uses matplotlib for basic dataframe plots. So, if you are using pandas for basic plot you can use matplotlib for plot customization. However, I propose an alternative method here using seaborn which allows more customization of the plot while not going into the basic level of matplotlib.
The error shows that the machine does not have enough memory to read the entire
CSV into a DataFrame at one time. Assuming you do not need the entire dataset in
memory all at one time, one way to avoid the problem would be to process the CSV in
chunks (by specifying the chunksize parameter):
chunksize = 10 ** 6
for chunk in pd.read_csv(filename, chunksize=chunksize):
process(chunk)
The chunksize parameter specifies the number of rows per chunk.
(The last chunk may contain fewer than chunksize rows, of course.)
Does your workflow require slicing, manipulating, exporting?
If so, you can use dask.dataframe to slice, perform your calculations and export iteratively. Chunking is performed silently by dask, which also supports a subset of pandas API.
Another great alternative would be to use modin because all the functionality is identical to pandas yet it leverages on distributed dataframe libraries such as dask.
The above answer is already satisfying the topic. Anyway, if you need all the data in memory – have a look at bcolz. Its compressing the data in memory. I have had really good experience with it. But its missing a lot of pandas features
Edit: I got compression rates at around 1/10 or orig size i think, of course depending of the kind of data. Important features missing were aggregates.
回答 5
您可以将数据读取为大块,并将每个大块另存为泡菜。
import pandas as pd import pickle
in_path =""#Path where the large file is
out_path =""#Path to save the pickle files to
chunk_size =400000#size of chunks relies on your available memory
separator ="~"
reader = pd.read_csv(in_path,sep=separator,chunksize=chunk_size,
low_memory=False)for i, chunk in enumerate(reader):
out_file = out_path +"/data_{}.pkl".format(i+1)with open(out_file,"wb")as f:
pickle.dump(chunk,f,pickle.HIGHEST_PROTOCOL)
在下一步中,您将读取泡菜并将每个泡菜附加到所需的数据框中。
import glob
pickle_path =""#Same Path as out_path i.e. where the pickle files are
data_p_files=[]for name in glob.glob(pickle_path +"/data_*.pkl"):
data_p_files.append(name)
df = pd.DataFrame([])for i in range(len(data_p_files)):
df = df.append(pd.read_pickle(data_p_files[i]),ignore_index=True)
You can read in the data as chunks and save each chunk as pickle.
import pandas as pd
import pickle
in_path = "" #Path where the large file is
out_path = "" #Path to save the pickle files to
chunk_size = 400000 #size of chunks relies on your available memory
separator = "~"
reader = pd.read_csv(in_path,sep=separator,chunksize=chunk_size,
low_memory=False)
for i, chunk in enumerate(reader):
out_file = out_path + "/data_{}.pkl".format(i+1)
with open(out_file, "wb") as f:
pickle.dump(chunk,f,pickle.HIGHEST_PROTOCOL)
In the next step you read in the pickles and append each pickle to your desired dataframe.
import glob
pickle_path = "" #Same Path as out_path i.e. where the pickle files are
data_p_files=[]
for name in glob.glob(pickle_path + "/data_*.pkl"):
data_p_files.append(name)
df = pd.DataFrame([])
for i in range(len(data_p_files)):
df = df.append(pd.read_pickle(data_p_files[i]),ignore_index=True)
TextFileReader= pd.read_csv(path, chunksize=1000)# the number of rows per chunk
dfList =[]for df inTextFileReader:
dfList.append(df)
df = pd.concat(dfList,sort=False)
TextFileReader = pd.read_csv(path, chunksize=1000) # the number of rows per chunk
dfList = []
for df in TextFileReader:
dfList.append(df)
df = pd.concat(dfList,sort=False)
回答 8
下面是一个示例:
chunkTemp =[]
queryTemp =[]
query = pd.DataFrame()for chunk in pd.read_csv(file, header=0, chunksize=<your_chunksize>, iterator=True, low_memory=False):#REPLACING BLANK SPACES AT COLUMNS' NAMES FOR SQL OPTIMIZATION
chunk = chunk.rename(columns ={c: c.replace(' ','')for c in chunk.columns})#YOU CAN EITHER: #1)BUFFER THE CHUNKS IN ORDER TO LOAD YOUR WHOLE DATASET
chunkTemp.append(chunk)#2)DO YOUR PROCESSING OVER A CHUNK AND STORE THE RESULT OF IT
query = chunk[chunk[<column_name>].str.startswith(<some_pattern>)]#BUFFERING PROCESSED DATA
queryTemp.append(query)#! NEVER DO pd.concat OR pd.DataFrame() INSIDE A LOOPprint("Database: CONCATENATING CHUNKS INTO A SINGLE DATAFRAME")
chunk = pd.concat(chunkTemp)print("Database: LOADED")#CONCATENATING PROCESSED DATA
query = pd.concat(queryTemp)print(query)
chunkTemp = []
queryTemp = []
query = pd.DataFrame()
for chunk in pd.read_csv(file, header=0, chunksize=<your_chunksize>, iterator=True, low_memory=False):
#REPLACING BLANK SPACES AT COLUMNS' NAMES FOR SQL OPTIMIZATION
chunk = chunk.rename(columns = {c: c.replace(' ', '') for c in chunk.columns})
#YOU CAN EITHER:
#1)BUFFER THE CHUNKS IN ORDER TO LOAD YOUR WHOLE DATASET
chunkTemp.append(chunk)
#2)DO YOUR PROCESSING OVER A CHUNK AND STORE THE RESULT OF IT
query = chunk[chunk[<column_name>].str.startswith(<some_pattern>)]
#BUFFERING PROCESSED DATA
queryTemp.append(query)
#! NEVER DO pd.concat OR pd.DataFrame() INSIDE A LOOP
print("Database: CONCATENATING CHUNKS INTO A SINGLE DATAFRAME")
chunk = pd.concat(chunkTemp)
print("Database: LOADED")
#CONCATENATING PROCESSED DATA
query = pd.concat(queryTemp)
print(query)
The maximum value of UNSIGNED CHAR =255The minimum value of SHORT INT =-32768The maximum value of SHORT INT =32767The minimum value of INT =-2147483648The maximum value of INT =2147483647The minimum value of CHAR =-128The maximum value of CHAR =127The minimum value of LONG =-9223372036854775808The maximum value of LONG =9223372036854775807
I want to make a more comprehensive answer based off of the most of the potential solutions that are already provided. I also want to point out one more potential aid that may help reading process.
Option 1: dtypes
“dtypes” is a pretty powerful parameter that you can use to reduce the memory pressure of read methods. See this and this answer. Pandas, on default, try to infer dtypes of the data.
Referring to data structures, every data stored, a memory allocation takes place. At a basic level refer to the values below (The table below illustrates values for C programming language):
The maximum value of UNSIGNED CHAR = 255
The minimum value of SHORT INT = -32768
The maximum value of SHORT INT = 32767
The minimum value of INT = -2147483648
The maximum value of INT = 2147483647
The minimum value of CHAR = -128
The maximum value of CHAR = 127
The minimum value of LONG = -9223372036854775808
The maximum value of LONG = 9223372036854775807
Refer to this page to see the matching between NumPy and C types.
Let’s say you have an array of integers of digits. You can both theoretically and practically assign, say array of 16-bit integer type, but you would then allocate more memory than you actually need to store that array. To prevent this, you can set dtype option on read_csv. You do not want to store the array items as long integer where actually you can fit them with 8-bit integer (np.int8 or np.uint8).
Reading the data in chunks allows you to access a part of the data in-memory, and you can apply preprocessing on your data and preserve the processed data rather than raw data. It’d be much better if you combine this option with the first one, dtypes.
I want to point out the pandas cookbook sections for that process, where you can find it here. Note those two sections there;
Dask is a framework that is defined in Dask’s website as:
Dask provides advanced parallelism for analytics, enabling performance at scale for the tools you love
It was born to cover the necessary parts where pandas cannot reach. Dask is a powerful framework that allows you much more data access by processing it in a distributed way.
You can use dask to preprocess your data as a whole, Dask takes care of the chunking part, so unlike pandas you can just define your processing steps and let Dask do the work. Dask does not apply the computations before it is explicitly pushed by compute and/or persist (see the answer here for the difference).
Other Aids (Ideas)
ETL flow designed for the data. Keeping only what is needed from the raw data.
First, apply ETL to whole data with frameworks like Dask or PySpark, and export the processed data.
Then see if the processed data can be fit in the memory as a whole.
Consider increasing your RAM.
Consider working with that data on a cloud platform.
def apply(dfg):# do stuffreturn dfg
c = d6tstack.combine_csv.CombinerCSV([bigfile.csv], apply_after_read=apply, sep=',', chunksize=1e6)# or
c = d6tstack.combine_csv.CombinerCSV(glob.glob('*.csv'), apply_after_read=apply, chunksize=1e6)# output to various formats, automatically chunked to reduce memory consumption
c.to_csv_combine(filename='out.csv')
c.to_parquet_combine(filename='out.pq')
c.to_psql_combine('postgresql+psycopg2://usr:pwd@localhost/db','tablename')# fast for postgres
c.to_mysql_combine('mysql+mysqlconnector://usr:pwd@localhost/db','tablename')# fast for mysql
c.to_sql_combine('postgresql+psycopg2://usr:pwd@localhost/db','tablename')# slow but flexible
In addition to the answers above, for those who want to process CSV and then export to csv, parquet or SQL, d6tstack is another good option. You can load multiple files and it deals with data schema changes (added/removed columns). Chunked out of core support is already built in.
def apply(dfg):
# do stuff
return dfg
c = d6tstack.combine_csv.CombinerCSV([bigfile.csv], apply_after_read=apply, sep=',', chunksize=1e6)
# or
c = d6tstack.combine_csv.CombinerCSV(glob.glob('*.csv'), apply_after_read=apply, chunksize=1e6)
# output to various formats, automatically chunked to reduce memory consumption
c.to_csv_combine(filename='out.csv')
c.to_parquet_combine(filename='out.pq')
c.to_psql_combine('postgresql+psycopg2://usr:pwd@localhost/db', 'tablename') # fast for postgres
c.to_mysql_combine('mysql+mysqlconnector://usr:pwd@localhost/db', 'tablename') # fast for mysql
c.to_sql_combine('postgresql+psycopg2://usr:pwd@localhost/db', 'tablename') # slow but flexible
In case someone is still looking for something like this, I found that this new library called modin can help. It uses distributed computing that can help with the read. Here’s a nice article comparing its functionality with pandas. It essentially uses the same functions as pandas.
import modin.pandas as pd
pd.read_csv(CSV_FILE_NAME)
Before using chunksize option if you want to be sure about the process function that you want to write inside the chunking for-loop as mentioned by @unutbu you can simply use nrows option.
small_df = pd.read_csv(filename, nrows=100)
Once you are sure that the process block is ready, you can put that in the chunking for loop for the entire dataframe.
I have 3 CSV files. Each has the first column as the (string) names of people, while all the other columns in each dataframe are attributes of that person.
How can I “join” together all three CSV documents to create a single CSV with each row having all the attributes for each unique value of the person’s string name?
The join() function in pandas specifies that I need a multiindex, but I’m confused about what a hierarchical indexing scheme has to do with making a join based on a single index.
回答 0
假设进口:
import pandas as pd
John Galt的答案基本上是一项reduce手术。如果我有几个数据帧,则将它们放在这样的列表中(通过列表推导或循环或其他方式生成):
John Galt’s answer is basically a reduce operation. If I have more than a handful of dataframes, I’d put them in a list like this (generated via list comprehensions or loops or whatnot):
dfs = [df0, df1, df2, dfN]
Assuming they have some common column, like name in your example, I’d do the following:
The join method is built exactly for these types of situations. You can join any number of DataFrames together with it. The calling DataFrame joins with the index of the collection of passed DataFrames. To work with multiple DataFrames, you must put the joining columns in the index.
The code would look something like this:
filenames = ['fn1', 'fn2', 'fn3', 'fn4',....]
dfs = [pd.read_csv(filename, index_col=index_col) for filename in filenames)]
dfs[0].join(dfs[1:])
# Simple example where dataframes index are the name on which to perform the join operationsimport pandas as pd
import numpy as np
name =['Sophia','Emma','Isabella','Olivia','Ava','Emily','Abigail','Mia']
df1 = pd.DataFrame(np.random.randn(8,3), columns=['A','B','C'], index=name)
df2 = pd.DataFrame(np.random.randn(8,1), columns=['D'], index=name)
df3 = pd.DataFrame(np.random.randn(8,2), columns=['E','F'], index=name)
df = df1.join(df2)
df = df.join(df3)# If you a 'Name' column that is not the index of your dataframe, one can set this column to be the index# 1) Create a column 'Name' based on the previous index
df1['Name']=df1.index
# 1) Select the index from column 'Name'
df1=df1.set_index('Name')# If indexes are different, one may have to play with parameter how
gf1 = pd.DataFrame(np.random.randn(8,3), columns=['A','B','C'], index=range(8))
gf2 = pd.DataFrame(np.random.randn(8,1), columns=['D'], index=range(2,10))
gf3 = pd.DataFrame(np.random.randn(8,2), columns=['E','F'], index=range(4,12))
gf = gf1.join(gf2, how='outer')
gf = gf.join(gf3, how='outer')
One does not need a multiindex to perform join operations.
One just need to set correctly the index column on which to perform the join operations (which command df.set_index('Name') for example)
The join operation is by default performed on index.
In your case, you just have to specify that the Name column corresponds to your index.
Below is an example
# Simple example where dataframes index are the name on which to perform
# the join operations
import pandas as pd
import numpy as np
name = ['Sophia' ,'Emma' ,'Isabella' ,'Olivia' ,'Ava' ,'Emily' ,'Abigail' ,'Mia']
df1 = pd.DataFrame(np.random.randn(8, 3), columns=['A','B','C'], index=name)
df2 = pd.DataFrame(np.random.randn(8, 1), columns=['D'], index=name)
df3 = pd.DataFrame(np.random.randn(8, 2), columns=['E','F'], index=name)
df = df1.join(df2)
df = df.join(df3)
# If you have a 'Name' column that is not the index of your dataframe,
# one can set this column to be the index
# 1) Create a column 'Name' based on the previous index
df1['Name'] = df1.index
# 1) Select the index from column 'Name'
df1 = df1.set_index('Name')
# If indexes are different, one may have to play with parameter how
gf1 = pd.DataFrame(np.random.randn(8, 3), columns=['A','B','C'], index=range(8))
gf2 = pd.DataFrame(np.random.randn(8, 1), columns=['D'], index=range(2,10))
gf3 = pd.DataFrame(np.random.randn(8, 2), columns=['E','F'], index=range(4,12))
gf = gf1.join(gf2, how='outer')
gf = gf.join(gf3, how='outer')
Here is a method to merge a dictionary of data frames while keeping the column names in sync with the dictionary. Also it fills in missing values if needed:
This is the function to merge a dict of data frames
def MergeDfDict(dfDict, onCols, how='outer', naFill=None):
keys = dfDict.keys()
for i in range(len(keys)):
key = keys[i]
df0 = dfDict[key]
cols = list(df0.columns)
valueCols = list(filter(lambda x: x not in (onCols), cols))
df0 = df0[onCols + valueCols]
df0.columns = onCols + [(s + '_' + key) for s in valueCols]
if (i == 0):
outDf = df0
else:
outDf = pd.merge(outDf, df0, how=how, on=onCols)
if (naFill != None):
outDf = outDf.fillna(naFill)
return(outDf)
>>> df = pd.DataFrame([[1,2],[3,4]], columns=list('AB'))
A B
012134>>> df2 = pd.DataFrame([[5,6],[7,8]], columns=list('AB'))
A B
056178>>> df.append(df2, ignore_index=True)
A B
012134256378
this will drop all rows where there are at least two non-NaN.
Then you could then drop where name is NaN:
In [87]:
nms
Out[87]:
movie name rating
0 thg John 3
1 thg NaN 4
3 mol Graham NaN
4 lob NaN NaN
5 lob NaN NaN
[5 rows x 3 columns]
In [89]:
nms = nms.dropna(thresh=2)
In [90]:
nms[nms.name.notnull()]
Out[90]:
movie name rating
0 thg John 3
3 mol Graham NaN
[2 rows x 3 columns]
EDIT
Actually looking at what you originally want you can do just this without the dropna call:
nms[nms.name.notnull()]
UPDATE
Looking at this question 3 years later, there is a mistake, firstly thresh arg looks for at least n non-NaN values so in fact the output should be:
In [4]:
nms.dropna(thresh=2)
Out[4]:
movie name rating
0 thg John 3.0
1 thg NaN 4.0
3 mol Graham NaN
It’s possible that I was either mistaken 3 years ago or that the version of pandas I was running had a bug, both scenarios are entirely possible.
I’m sure this is simple, but as a complete newbie to python, I’m having trouble figuring out how to iterate over variables in a pandas dataframe and run a regression with each.
Here’s what I’m doing:
all_data = {}
for ticker in ['FIUIX', 'FSAIX', 'FSAVX', 'FSTMX']:
all_data[ticker] = web.get_data_yahoo(ticker, '1/1/2010', '1/1/2015')
prices = DataFrame({tic: data['Adj Close'] for tic, data in all_data.iteritems()})
returns = prices.pct_change()
I know I can run a regression like this:
regs = sm.OLS(returns.FIUIX,returns.FSTMX).fit()
but suppose I want to do this for each column in the dataframe. In particular, I want to regress FIUIX on FSTMX, and then FSAIX on FSTMX, and then FSAVX on FSTMX. After each regression I want to store the residuals.
I’ve tried various versions of the following, but I must be getting the syntax wrong:
resids = {}
for k in returns.keys():
reg = sm.OLS(returns[k],returns.FSTMX).fit()
resids[k] = reg.resid
I think the problem is I don’t know how to refer to the returns column by key, so returns[k] is probably wrong.
Any guidance on the best way to do this would be much appreciated. Perhaps there’s a common pandas approach I’m missing.
This answer is to iterate over selected columns as well as all columns in a DF.
df.columns gives a list containing all the columns’ names in the DF. Now that isn’t very helpful if you want to iterate over all the columns. But it comes in handy when you want to iterate over columns of your choosing only.
We can use Python’s list slicing easily to slice df.columns according to our needs. For eg, to iterate over all columns but the first one, we can do:
for column in df.columns[1:]:
print(df[column])
Similarly to iterate over all the columns in reversed order, we can do:
for column in df.columns[::-1]:
print(df[column])
We can iterate over all the columns in a lot of cool ways using this technique. Also remember that you can get the indices of all columns easily using:
for ind, column in enumerate(df.columns):
print(ind, column)
Based on the accepted answer, if an index corresponding to each column is also desired:
for i, column in enumerate(df):
print i, df[column]
The above df[column] type is Series, which can simply be converted into numpyndarrays:
for i, column in enumerate(df):
print i, np.asarray(df[column])
回答 7
我来晚了,但是这是我的方法。步骤:
创建所有列的列表
使用itertools进行x组合
将每个结果R平方值与排除的列列表一起附加到结果数据帧
以R平方的降序对结果DF排序,以找出最合适的DF。
这是我在DataFrame上使用的称为的代码aft_tmt。随意推断您的用例。
import pandas as pd
# setting options to print without truncating output
pd.set_option('display.max_columns',None)
pd.set_option('display.max_colwidth',None)import statsmodels.formula.api as smf
import itertools
# This section gets the column names of the DF and removes some columns which I don't want to use as predictors.
itercols = aft_tmt.columns.tolist()
itercols.remove("sc97")
itercols.remove("sc")
itercols.remove("grc")
itercols.remove("grc97")print itercols
len(itercols)# results DF
regression_res = pd.DataFrame(columns =["Rsq","predictors","excluded"])# excluded cols
exc =[]# change 9 to the number of columns you want to combine from N columns.#Possibly run an outer loop from 0 to N/2?for x in itertools.combinations(itercols,9):
lmstr ="+".join(x)
m = smf.ols(formula ="sc ~ "+ lmstr, data = aft_tmt)
f = m.fit()
exc =[item for item in x if item notin itercols]
regression_res = regression_res.append(pd.DataFrame([[f.rsquared, lmstr,"+".join([y for y in itercols if y notin list(x)])]], columns =["Rsq","predictors","excluded"]))
regression_res.sort_values(by="Rsq", ascending =False)
I’m a bit late but here’s how I did this. The steps:
Create a list of all columns
Use itertools to take x combinations
Append each result R squared value to a result dataframe along with excluded column list
Sort the result DF in descending order of R squared to see which is the best fit.
This is the code I used on DataFrame called aft_tmt. Feel free to extrapolate to your use case..
import pandas as pd
# setting options to print without truncating output
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
import statsmodels.formula.api as smf
import itertools
# This section gets the column names of the DF and removes some columns which I don't want to use as predictors.
itercols = aft_tmt.columns.tolist()
itercols.remove("sc97")
itercols.remove("sc")
itercols.remove("grc")
itercols.remove("grc97")
print itercols
len(itercols)
# results DF
regression_res = pd.DataFrame(columns = ["Rsq", "predictors", "excluded"])
# excluded cols
exc = []
# change 9 to the number of columns you want to combine from N columns.
#Possibly run an outer loop from 0 to N/2?
for x in itertools.combinations(itercols, 9):
lmstr = "+".join(x)
m = smf.ols(formula = "sc ~ " + lmstr, data = aft_tmt)
f = m.fit()
exc = [item for item in x if item not in itercols]
regression_res = regression_res.append(pd.DataFrame([[f.rsquared, lmstr, "+".join([y for y in itercols if y not in list(x)])]], columns = ["Rsq", "predictors", "excluded"]))
regression_res.sort_values(by="Rsq", ascending = False)
df = pd.DataFrame({'B':[1,2,3],'C':[4,5,6]})
df
Out:
B C
014125236
idx =0
new_col =[7,8,9]# can be a list, a Series, an array or a scalar
df.insert(loc=idx, column='A', value=new_col)
df
Out:
A B C
071418252936
You could try to extract columns as list, massage this as you want, and reindex your dataframe:
>>> cols = df.columns.tolist()
>>> cols = [cols[-1]]+cols[:-1] # or whatever change you need
>>> df.reindex(columns=cols)
n l v
0 0 a 1
1 0 b 2
2 0 c 1
3 0 d 2
EDIT: this can be done in one line ; however, this looks a bit ugly. Maybe some cleaner proposal may come…
>>> df.reindex(columns=['n']+df.columns[:-1].tolist())
n l v
0 0 a 1
1 0 b 2
2 0 c 1
3 0 d 2
回答 2
如果要为所有行使用一个值:
df.insert(0,'name_of_column','')
df['name_of_column']= value
df.insert(0,'name_of_column','')
df['name_of_column'] = value
Edit:
You can also:
df.insert(0,'name_of_column',value)
回答 3
这是一个非常简单的答案(仅一行)。
在将“ n”列添加到df中之后,您可以按照以下步骤进行操作。
import pandas as pd
df = pd.DataFrame({'l':['a','b','c','d'],'v':[1,2,1,2]})
df['n']=0
df
l v n
0 a 101 b 202 c 103 d 20# here you can add the below code and it should work.
df = df[list('nlv')]
df
n l v
00 a 110 b 220 c 130 d 2However,if you have words in your columns names instead of letters.It should include two brackets around your column names.import pandas as pd
df = pd.DataFrame({'Upper':['a','b','c','d'],'Lower':[1,2,1,2]})
df['Net']=0
df['Mid']=2
df['Zsore']=2
df
UpperLowerNetMidZsore0 a 10221 b 20222 c 10223 d 2022# here you can add below line and it should work
df = df[list(('Mid','Upper','Lower','Net','Zsore'))]
df
MidUpperLowerNetZsore02 a 10212 b 20222 c 10232 d 202
Here is a very simple answer to this(only one line).
You can do that after you added the ‘n’ column into your df as follows.
import pandas as pd
df = pd.DataFrame({'l':['a','b','c','d'], 'v':[1,2,1,2]})
df['n'] = 0
df
l v n
0 a 1 0
1 b 2 0
2 c 1 0
3 d 2 0
# here you can add the below code and it should work.
df = df[list('nlv')]
df
n l v
0 0 a 1
1 0 b 2
2 0 c 1
3 0 d 2
However, if you have words in your columns names instead of letters. It should include two brackets around your column names.
import pandas as pd
df = pd.DataFrame({'Upper':['a','b','c','d'], 'Lower':[1,2,1,2]})
df['Net'] = 0
df['Mid'] = 2
df['Zsore'] = 2
df
Upper Lower Net Mid Zsore
0 a 1 0 2 2
1 b 2 0 2 2
2 c 1 0 2 2
3 d 2 0 2 2
# here you can add below line and it should work
df = df[list(('Mid','Upper', 'Lower', 'Net','Zsore'))]
df
Mid Upper Lower Net Zsore
0 2 a 1 0 2
1 2 b 2 0 2
2 2 c 1 0 2
3 2 d 2 0 2
I know how to append an existing series / dataframe column. But this is a different situation, because all I need is to add the ‘Name’ column and set every row to the same value, in this case ‘abc’.
where the argument loc ( 0 <= loc <= len(columns) ) allows you to insert the column where you want.
‘loc’ gives you the index that your column will be at after the insertion. For example, the code above inserts the column Name as the 0-th column, i.e. it will be inserted before the first column, becoming the new first column. (Indexing starts from 0).
All these methods allow you to add a new column from a Series as well (just substitute the ‘abc’ default argument above with the series).
local_df.info() –> info Method will return detailed information about data frame and it’s columns such column count, data type of columns,
Not null value count, memory usage by Data Frame
len(local_df.columns) –> columns attribute will return index object of data frame columns & len function will return total available columns.
local_df.head(0) –> head method with parameter 0 will return 1st row of df which actually nothing but header.
Assuming number of columns are not more than 10. For loop fun:
li_count =0
for x in local_df:
li_count =li_count + 1
print(li_count)
So I completely understand how to use resample, but the documentation does not do a good job explaining the options.
So most options in the resample function are pretty straight forward except for these two:
rule : the offset string or object representing target conversion
how : string, method for down- or re-sampling, default to ‘mean’
So from looking at as many examples as I found online I can see for rule you can do 'D' for day, 'xMin' for minutes, 'xL' for milliseconds, but that is all I could find.
for how I have seen the following: 'first', np.max, 'last', 'mean', and 'n1n2n3n4...nx' where nx is the first letter of each column index.
So is there somewhere in the documentation that I am missing that displays every option for pandas.resample‘s rule and how inputs? If yes, where because I could not find it. If no, what are all the options for them?
回答 0
B business day frequency
C custom business day frequency (experimental)
D calendar day frequency
W weekly frequency
M month end frequency
SM semi-month end frequency (15thand end of month)
BM business month end frequency
CBM custom business month end frequency
MS month start frequency
SMS semi-month start frequency (1stand15th)
BMS business month start frequency
CBMS custom business month start frequency
Q quarter end frequency
BQ business quarter endfrequency
QS quarter start frequency
BQS business quarter start frequency
A year end frequency
BA, BY business year end frequency
AS, YS year start frequency
BAS, BYS business year start frequency
BH business hour frequency
H hourly frequency
T, min minutely frequency
S secondly frequency
L, ms milliseconds
U, us microseconds
N nanoseconds
B business day frequency
C custom business day frequency (experimental)
D calendar day frequency
W weekly frequency
M month end frequency
SM semi-month end frequency (15th and end of month)
BM business month end frequency
CBM custom business month end frequency
MS month start frequency
SMS semi-month start frequency (1st and 15th)
BMS business month start frequency
CBMS custom business month start frequency
Q quarter end frequency
BQ business quarter endfrequency
QS quarter start frequency
BQS business quarter start frequency
A year end frequency
BA, BY business year end frequency
AS, YS year start frequency
BAS, BYS business year start frequency
BH business hour frequency
H hourly frequency
T, min minutely frequency
S secondly frequency
L, ms milliseconds
U, us microseconds
N nanoseconds
Note that there isn’t a list of all the different how options, because it can be any NumPy array function and any function that is available via groupby dispatching can be passed to how by name.
回答 1
不仅限于此,但您可能正在寻找以下列表:
B business day frequency
C custom business day frequency (experimental)
D calendar day frequency
W weekly frequency
M month end frequency
BM business month end frequency
MS month start frequency
BMS business month start frequency
Q quarter end frequency
BQ business quarter endfrequency
QS quarter start frequency
BQS business quarter start frequency
A year end frequency
BA business year end frequency
AS year start frequency
BAS business year start frequency
H hourly frequency
T minutely frequency
S secondly frequency
L milliseconds
U microseconds
There’s more to it than this, but you’re probably looking for this list:
B business day frequency
C custom business day frequency (experimental)
D calendar day frequency
W weekly frequency
M month end frequency
BM business month end frequency
MS month start frequency
BMS business month start frequency
Q quarter end frequency
BQ business quarter endfrequency
QS quarter start frequency
BQS business quarter start frequency
A year end frequency
BA business year end frequency
AS year start frequency
BAS business year start frequency
H hourly frequency
T minutely frequency
S secondly frequency
L milliseconds
U microseconds
1. NAME object
2. On_Time object
3. On_Budget object
4. %actual_hr float64
5. Baseline Start Date datetime64[ns]
6. Forecast Start Date datetime64[ns]
I would like to be able to say: here is a dataframe, give me a list of the columns which are of type Object or of type DateTime?
I have a function which converts numbers (Float64) to two decimal places, and I would like to use this list of dataframe columns, of a particular type, and run it through this function to convert them all to 2dp.
Maybe:
For c in col_list: if c.dtype = "Something"
list[]
List.append(c)?
回答 0
如果您想要某种类型的列的列表,可以使用groupby:
>>> df = pd.DataFrame([[1,2.3456,'c','d',78]], columns=list("ABCDE"))>>> df
A B C D E
012.3456 c d 78[1 rows x 5 columns]>>> df.dtypes
A int64
B float64
C object
D object
E int64
dtype: object
>>> g = df.columns.to_series().groupby(df.dtypes).groups
>>> g
{dtype('int64'):['A','E'], dtype('float64'):['B'], dtype('O'):['C','D']}>>>{k.name: v for k, v in g.items()}{'object':['C','D'],'int64':['A','E'],'float64':['B']}
If you want a list of columns of a certain type, you can use groupby:
>>> df = pd.DataFrame([[1, 2.3456, 'c', 'd', 78]], columns=list("ABCDE"))
>>> df
A B C D E
0 1 2.3456 c d 78
[1 rows x 5 columns]
>>> df.dtypes
A int64
B float64
C object
D object
E int64
dtype: object
>>> g = df.columns.to_series().groupby(df.dtypes).groups
>>> g
{dtype('int64'): ['A', 'E'], dtype('float64'): ['B'], dtype('O'): ['C', 'D']}
>>> {k.name: v for k, v in g.items()}
{'object': ['C', 'D'], 'int64': ['A', 'E'], 'float64': ['B']}
Docstring: Attempt to infer better dtypes for object columns.
Attempts soft conversion of object-dtyped columns, leaving non-object
and unconvertible columns unchanged. The inference rules are the same
as during normal Series/DataFrame construction.