问题:如何遍历熊猫数据框的列以运行回归
我敢肯定这很简单,但是作为python的完整新手,我在弄清楚如何遍历pandas
数据帧中的变量并对每个变量进行回归时都遇到了麻烦。
这是我在做什么:
all_data = {}
for ticker in ['FIUIX', 'FSAIX', 'FSAVX', 'FSTMX']:
all_data[ticker] = web.get_data_yahoo(ticker, '1/1/2010', '1/1/2015')
prices = DataFrame({tic: data['Adj Close'] for tic, data in all_data.iteritems()})
returns = prices.pct_change()
我知道我可以像这样进行回归:
regs = sm.OLS(returns.FIUIX,returns.FSTMX).fit()
但是假设我要对数据框中的每一列执行此操作。特别是,我想在FSTMX上还原FIUIX,然后在FSTMX上还原FSAIX,然后在FSTMX上还原FSAVX。每次回归后,我想存储残差。
我尝试了以下各种版本,但语法一定有误:
resids = {}
for k in returns.keys():
reg = sm.OLS(returns[k],returns.FSTMX).fit()
resids[k] = reg.resid
我认为问题是我不知道如何按键引用return列,所以returns[k]
可能是错误的。
任何有关最佳方法的指导将不胜感激。也许我缺少一种常见的熊猫方法。
I’m sure this is simple, but as a complete newbie to python, I’m having trouble figuring out how to iterate over variables in a pandas
dataframe and run a regression with each.
Here’s what I’m doing:
all_data = {}
for ticker in ['FIUIX', 'FSAIX', 'FSAVX', 'FSTMX']:
all_data[ticker] = web.get_data_yahoo(ticker, '1/1/2010', '1/1/2015')
prices = DataFrame({tic: data['Adj Close'] for tic, data in all_data.iteritems()})
returns = prices.pct_change()
I know I can run a regression like this:
regs = sm.OLS(returns.FIUIX,returns.FSTMX).fit()
but suppose I want to do this for each column in the dataframe. In particular, I want to regress FIUIX on FSTMX, and then FSAIX on FSTMX, and then FSAVX on FSTMX. After each regression I want to store the residuals.
I’ve tried various versions of the following, but I must be getting the syntax wrong:
resids = {}
for k in returns.keys():
reg = sm.OLS(returns[k],returns.FSTMX).fit()
resids[k] = reg.resid
I think the problem is I don’t know how to refer to the returns column by key, so returns[k]
is probably wrong.
Any guidance on the best way to do this would be much appreciated. Perhaps there’s a common pandas approach I’m missing.
回答 0
for column in df:
print(df[column])
for column in df:
print(df[column])
回答 1
您可以使用iteritems()
:
for name, values in df.iteritems():
print('{name}: {value}'.format(name=name, value=values[0]))
You can use iteritems()
:
for name, values in df.iteritems():
print('{name}: {value}'.format(name=name, value=values[0]))
回答 2
这个答案是要遍历DF中的选定列以及所有列。
df.columns
给出包含DF中所有列名称的列表。现在,如果要遍历所有列,则不是很有帮助。但是,当您只想遍历所选列时,它会派上用场。
我们可以根据需要轻松使用Python的列表切片对df.columns进行切片。例如,要遍历除第一列之外的所有列,我们可以这样做:
for column in df.columns[1:]:
print(df[column])
类似于以相反的顺序遍历所有列,我们可以执行以下操作:
for column in df.columns[::-1]:
print(df[column])
我们可以使用这种技术以许多很酷的方式遍历所有列。还请记住,您可以使用以下命令轻松获取所有列的索引:
for ind, column in enumerate(df.columns):
print(ind, column)
This answer is to iterate over selected columns as well as all columns in a DF.
df.columns
gives a list containing all the columns’ names in the DF. Now that isn’t very helpful if you want to iterate over all the columns. But it comes in handy when you want to iterate over columns of your choosing only.
We can use Python’s list slicing easily to slice df.columns according to our needs. For eg, to iterate over all columns but the first one, we can do:
for column in df.columns[1:]:
print(df[column])
Similarly to iterate over all the columns in reversed order, we can do:
for column in df.columns[::-1]:
print(df[column])
We can iterate over all the columns in a lot of cool ways using this technique. Also remember that you can get the indices of all columns easily using:
for ind, column in enumerate(df.columns):
print(ind, column)
回答 3
您可以使用来按位置索引数据框列ix
。
df1.ix[:,1]
例如,这将返回第一列。(0为索引)
df1.ix[0,]
这将返回第一行。
df1.ix[:,1]
这将是第0行与第1列的交集处的值:
df1.ix[0,1]
等等。因此,您可以enumerate()
returns.keys():
并使用数字来索引数据框。
You can index dataframe columns by the position using ix
.
df1.ix[:,1]
This returns the first column for example. (0 would be the index)
df1.ix[0,]
This returns the first row.
df1.ix[:,1]
This would be the value at the intersection of row 0 and column 1:
df1.ix[0,1]
and so on. So you can enumerate()
returns.keys():
and use the number to index the dataframe.
回答 4
一种解决方法是对进行转置DataFrame
并在行上进行迭代。
for column_name, column in df.transpose().iterrows():
print column_name
A workaround is to transpose the DataFrame
and iterate over the rows.
for column_name, column in df.transpose().iterrows():
print column_name
回答 5
使用列表推导,您可以获得所有列名(标题):
[column for column in df]
Using list comprehension, you can get all the columns names (header):
[column for column in df]
回答 6
根据接受的答案,是否还需要与各列相对应的索引:
for i, column in enumerate(df):
print i, df[column]
上面的df[column]
类型是Series
,可以简单地转换为numpy
ndarray
s:
for i, column in enumerate(df):
print i, np.asarray(df[column])
Based on the accepted answer, if an index corresponding to each column is also desired:
for i, column in enumerate(df):
print i, df[column]
The above df[column]
type is Series
, which can simply be converted into numpy
ndarray
s:
for i, column in enumerate(df):
print i, np.asarray(df[column])
回答 7
我来晚了,但是这是我的方法。步骤:
- 创建所有列的列表
- 使用itertools进行x组合
- 将每个结果R平方值与排除的列列表一起附加到结果数据帧
- 以R平方的降序对结果DF排序,以找出最合适的DF。
这是我在DataFrame上使用的称为的代码aft_tmt
。随意推断您的用例。
import pandas as pd
# setting options to print without truncating output
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
import statsmodels.formula.api as smf
import itertools
# This section gets the column names of the DF and removes some columns which I don't want to use as predictors.
itercols = aft_tmt.columns.tolist()
itercols.remove("sc97")
itercols.remove("sc")
itercols.remove("grc")
itercols.remove("grc97")
print itercols
len(itercols)
# results DF
regression_res = pd.DataFrame(columns = ["Rsq", "predictors", "excluded"])
# excluded cols
exc = []
# change 9 to the number of columns you want to combine from N columns.
#Possibly run an outer loop from 0 to N/2?
for x in itertools.combinations(itercols, 9):
lmstr = "+".join(x)
m = smf.ols(formula = "sc ~ " + lmstr, data = aft_tmt)
f = m.fit()
exc = [item for item in x if item not in itercols]
regression_res = regression_res.append(pd.DataFrame([[f.rsquared, lmstr, "+".join([y for y in itercols if y not in list(x)])]], columns = ["Rsq", "predictors", "excluded"]))
regression_res.sort_values(by="Rsq", ascending = False)
I’m a bit late but here’s how I did this. The steps:
- Create a list of all columns
- Use itertools to take x combinations
- Append each result R squared value to a result dataframe along with excluded column list
- Sort the result DF in descending order of R squared to see which is the best fit.
This is the code I used on DataFrame called aft_tmt
. Feel free to extrapolate to your use case..
import pandas as pd
# setting options to print without truncating output
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
import statsmodels.formula.api as smf
import itertools
# This section gets the column names of the DF and removes some columns which I don't want to use as predictors.
itercols = aft_tmt.columns.tolist()
itercols.remove("sc97")
itercols.remove("sc")
itercols.remove("grc")
itercols.remove("grc97")
print itercols
len(itercols)
# results DF
regression_res = pd.DataFrame(columns = ["Rsq", "predictors", "excluded"])
# excluded cols
exc = []
# change 9 to the number of columns you want to combine from N columns.
#Possibly run an outer loop from 0 to N/2?
for x in itertools.combinations(itercols, 9):
lmstr = "+".join(x)
m = smf.ols(formula = "sc ~ " + lmstr, data = aft_tmt)
f = m.fit()
exc = [item for item in x if item not in itercols]
regression_res = regression_res.append(pd.DataFrame([[f.rsquared, lmstr, "+".join([y for y in itercols if y not in list(x)])]], columns = ["Rsq", "predictors", "excluded"]))
regression_res.sort_values(by="Rsq", ascending = False)