如何遍历熊猫数据框的列以运行回归

Question 1

I’m sure this is simple, but as a complete newbie to python, I’m having trouble figuring out how to iterate over variables in a pandas dataframe and run a regression with each.

Here’s what I’m doing:

all_data = {}
for ticker in ['FIUIX', 'FSAIX', 'FSAVX', 'FSTMX']:
    all_data[ticker] = web.get_data_yahoo(ticker, '1/1/2010', '1/1/2015')

prices = DataFrame({tic: data['Adj Close'] for tic, data in all_data.iteritems()})  
returns = prices.pct_change()

I know I can run a regression like this:

regs = sm.OLS(returns.FIUIX,returns.FSTMX).fit()

but suppose I want to do this for each column in the dataframe. In particular, I want to regress FIUIX on FSTMX, and then FSAIX on FSTMX, and then FSAVX on FSTMX. After each regression I want to store the residuals.

I’ve tried various versions of the following, but I must be getting the syntax wrong:

resids = {}
for k in returns.keys():
    reg = sm.OLS(returns[k],returns.FSTMX).fit()
    resids[k] = reg.resid

I think the problem is I don’t know how to refer to the returns column by key, so returns[k] is probably wrong.

Any guidance on the best way to do this would be much appreciated. Perhaps there’s a common pandas approach I’m missing.

Question 2

for column in df:
    print(df[column])

Question 3

You can use iteritems():

for name, values in df.iteritems():
    print('{name}: {value}'.format(name=name, value=values[0]))

Question 4

This answer is to iterate over selected columns as well as all columns in a DF.

df.columns gives a list containing all the columns’ names in the DF. Now that isn’t very helpful if you want to iterate over all the columns. But it comes in handy when you want to iterate over columns of your choosing only.

We can use Python’s list slicing easily to slice df.columns according to our needs. For eg, to iterate over all columns but the first one, we can do:

for column in df.columns[1:]:
    print(df[column])

Similarly to iterate over all the columns in reversed order, we can do:

for column in df.columns[::-1]:
    print(df[column])

We can iterate over all the columns in a lot of cool ways using this technique. Also remember that you can get the indices of all columns easily using:

for ind, column in enumerate(df.columns):
    print(ind, column)

Question 5

You can index dataframe columns by the position using ix.

df1.ix[:,1]

This returns the first column for example. (0 would be the index)

df1.ix[0,]

This returns the first row.

df1.ix[:,1]

This would be the value at the intersection of row 0 and column 1:

df1.ix[0,1]

and so on. So you can enumerate() returns.keys(): and use the number to index the dataframe.

Question 6

A workaround is to transpose the DataFrame and iterate over the rows.

for column_name, column in df.transpose().iterrows():
    print column_name

Question 7

Using list comprehension, you can get all the columns names (header):

[column for column in df]

Question 8

Based on the accepted answer, if an index corresponding to each column is also desired:

for i, column in enumerate(df):
    print i, df[column]

The above df[column] type is Series, which can simply be converted into numpy ndarrays:

for i, column in enumerate(df):
    print i, np.asarray(df[column])

Question 9

I’m a bit late but here’s how I did this. The steps:

Create a list of all columns
Use itertools to take x combinations
Append each result R squared value to a result dataframe along with excluded column list
Sort the result DF in descending order of R squared to see which is the best fit.

This is the code I used on DataFrame called aft_tmt. Feel free to extrapolate to your use case..

import pandas as pd
# setting options to print without truncating output
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)

import statsmodels.formula.api as smf
import itertools

# This section gets the column names of the DF and removes some columns which I don't want to use as predictors.
itercols = aft_tmt.columns.tolist()
itercols.remove("sc97")
itercols.remove("sc")
itercols.remove("grc")
itercols.remove("grc97")
print itercols
len(itercols)

# results DF
regression_res = pd.DataFrame(columns = ["Rsq", "predictors", "excluded"])

# excluded cols
exc = []

# change 9 to the number of columns you want to combine from N columns.
#Possibly run an outer loop from 0 to N/2?
for x in itertools.combinations(itercols, 9):
    lmstr = "+".join(x)
    m = smf.ols(formula = "sc ~ " + lmstr, data = aft_tmt)
    f = m.fit()
    exc = [item for item in x if item not in itercols]
    regression_res = regression_res.append(pd.DataFrame([[f.rsquared, lmstr, "+".join([y for y in itercols if y not in list(x)])]], columns = ["Rsq", "predictors", "excluded"]))

regression_res.sort_values(by="Rsq", ascending = False)

如何遍历熊猫数据框的列以运行回归

问题：如何遍历熊猫数据框的列以运行回归

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

回答 7

排行榜展示

Python 情人节超强技能导出微信聊天记录生成词云

你不得不知道的python超级文献批量搜索下载工具

7行代码 Python热力图可视化分析缺失数据处理

Python 流程图 — 一键转化代码为流程图

Python 优化—算出每条语句执行时间

你的10W块放哪里能赚最多钱？

文章展示

从字符串中删除数字

将日期时间格式化为字符串（以毫秒为单位）

我已经安装了哪个版本的Python？

使用boto3检查s3中存储桶中是否存在密钥

如何在Windows中同时安装Python 2.x和Python 3.x

从sklearn导入时出现ImportError：无法导入名称check_build

如何遍历熊猫数据框的列以运行回归

问题：如何遍历熊猫数据框的列以运行回归

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

回答 7

相关文章

排行榜展示

文章展示