I create a column that estimates the number of citable documents per person:
Top15['PopEst'] = Top15['Energy Supply'] / Top15['Energy Supply per Capita']
Top15['Citable docs per Capita'] = Top15['Citable documents'] / Top15['PopEst']
I want to know the correlation between the number of citable documents per capita and the energy supply per capita. So I use the .corr() method (Pearson’s correlation):
data = Top15[['Citable docs per Capita','Energy Supply per Capita']]
correlation = data.corr(method='pearson')
I want to return a single number, but the result is:
回答 0
没有实际数据,很难回答这个问题,但是我想您正在寻找这样的东西:
Top15['Citable docs per Capita'].corr(Top15['Energy Supply per Capita'])
这样就可以计算出两列'Citable docs per Capita'和之间的相关性'Energy Supply per Capita'。
举个例子:
import pandas as pd
df = pd.DataFrame({'A': range(4),'B':[2*i for i in range(4)]})
A B
000112224336
In the graphic you show, only the upper left corner of the correlation matrix is represented (I assume).
There can be cases, where you get NaNs in your solution – check this post for an example.
If you want to filter entries above/below a certain threshold, you can check this question.
If you want to plot a heatmap of the correlation coefficients, you can check this answer and if you then run into the issue with overlapping axis-labels check the following post.
回答 1
我遇到了同样的问题。它似乎Citable Documents per Person是一个浮点数,Python默认以某种方式跳过它。我数据框的所有其他列均为numpy格式,因此我通过将columnt转换为np.float64
Top15['Citable Documents per Person']=np.float64(Top15['Citable Documents per Person'])
I ran into the same issue.
It appeared Citable Documents per Person was a float, and python skips it somehow by default. All the other columns of my dataframe were in numpy-formats, so I solved it by converting the columnt to np.float64
Top15['Citable Documents per Person']=np.float64(Top15['Citable Documents per Person'])
Remember it’s exactly the column you calculated yourself
回答 2
我的解决方案是将数据转换为数值类型后:
Top15[['Citable docs per Capita','Energy Supply per Capita']].corr()
My solution would be after converting data to numerical type:
Top15[['Citable docs per Capita','Energy Supply per Capita']].corr()
回答 3
如果要在所有成对的列之间建立关联,可以执行以下操作:
import pandas as pd
import numpy as np
def get_corrs(df):
col_correlations = df.corr()
col_correlations.loc[:,:]= np.tril(col_correlations, k=-1)
cor_pairs = col_correlations.stack()return cor_pairs.to_dict()
my_corrs = get_corrs(df)# and the following line to retrieve the single correlationprint(my_corrs[('Citable docs per Capita','Energy Supply per Capita')])
If you want the correlations between all pairs of columns, you could do something like this:
import pandas as pd
import numpy as np
def get_corrs(df):
col_correlations = df.corr()
col_correlations.loc[:, :] = np.tril(col_correlations, k=-1)
cor_pairs = col_correlations.stack()
return cor_pairs.to_dict()
my_corrs = get_corrs(df)
# and the following line to retrieve the single correlation
print(my_corrs[('Citable docs per Capita','Energy Supply per Capita')])
回答 4
当您调用:
data =Top15[['Citable docs per Capita','Energy Supply per Capita']]
correlation = data.corr(method='pearson')
data = Top15[['Citable docs per Capita','Energy Supply per Capita']]
correlation = data.corr(method='pearson')
Since, DataFrame.corr() function performs pair-wise correlations, you have four pair from two variables. So, basically you are getting diagonal values as auto correlation (correlation with itself, two values since you have two variables), and other two values as cross correlations of one vs another and vice versa.
Either perform correlation between two series to get a single value:
from scipy.stats.stats import pearsonr
docs_col = Top15['Citable docs per Capita'].values
energy_col = Top15['Energy Supply per Capita'].values
corr , _ = pearsonr(docs_col, energy_col)
or,
if you want a single value from the same function (DataFrame’s corr):
single_value = correlation[0][1]
Hope this helps.
回答 5
它是这样的:
Top15['Citable docs per Capita']=np.float64(Top15['Citable docs per Capita'])Top15['Energy Supply per Capita']=np.float64(Top15['Energy Supply per Capita'])Top15['Energy Supply per Capita'].corr(Top15['Citable docs per Capita'])
Top15['Citable docs per Capita']=np.float64(Top15['Citable docs per Capita'])
Top15['Energy Supply per Capita']=np.float64(Top15['Energy Supply per Capita'])
Top15['Energy Supply per Capita'].corr(Top15['Citable docs per Capita'])
I solved this problem by changing the data type. If you see the ‘Energy Supply per Capita’ is a numerical type while the ‘Citable docs per Capita’ is an object type. I converted the column to float using astype. I had the same problem with some np functions: count_nonzero and sum worked while mean and std didn’t.
回答 7
在关联之前将“人均Citable docs”更改为数字可以解决该问题。
Top15['Citable docs per Capita']= pd.to_numeric(Top15['Citable docs per Capita'])
data =Top15[['Citable docs per Capita','Energy Supply per Capita']]
correlation = data.corr(method='pearson')
changing ‘Citable docs per Capita’ to numeric before correlation will solve the problem.
Top15['Citable docs per Capita'] = pd.to_numeric(Top15['Citable docs per Capita'])
data = Top15[['Citable docs per Capita','Energy Supply per Capita']]
correlation = data.corr(method='pearson')
corr_matrix = df.corr().abs()
#the matrix is symmetric so we need to extract upper triangle matrix without diagonal (k = 1)
sol = (corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))
.stack()
.sort_values(ascending=False))
#first element of sol series is the pair with the biggest correlation
Few lines solution without redundant pairs of variables:
corr_matrix = df.corr().abs()
#the matrix is symmetric so we need to extract upper triangle matrix without diagonal (k = 1)
sol = (corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))
.stack()
.sort_values(ascending=False))
#first element of sol series is the pair with the biggest correlation
Then you can iterate through names of variables pairs (which are pandas.Series multi-indexes) and theirs values like this:
defcorrank(X: pandas.DataFrame):import itertools
df = pd.DataFrame([[(i,j),X.corr().loc[i,j]] for i,j in list(itertools.combinations(X.corr(), 2))],columns=['pairs','corr'])
print(df.sort_values(by='corr',ascending=False))
corrank(X) # prints a descending list of correlation pair (Max on top)
Use itertools.combinations to get all unique correlations from pandas own correlation matrix .corr(), generate list of lists and feed it back into a DataFrame in order to use ‘.sort_values’. Set ascending = True to display lowest correlations on top
corrank takes a DataFrame as argument because it requires .corr().
def corrank(X: pandas.DataFrame):
import itertools
df = pd.DataFrame([[(i,j),X.corr().loc[i,j]] for i,j in list(itertools.combinations(X.corr(), 2))],columns=['pairs','corr'])
print(df.sort_values(by='corr',ascending=False))
corrank(X) # prints a descending list of correlation pair (Max on top)
回答 8
我不想让unstack这个问题复杂化,因为我只是想在功能选择阶段删除一些高度相关的功能。
因此,我得到了以下简化的解决方案:
# map features to their absolute correlation values
corr = features.corr().abs()
# set equality (self correlation) as zero
corr[corr == 1] = 0# of each feature, find the max correlation# and sort the resulting array in ascending order
corr_cols = corr.max().sort_values(ascending=False)
# display the highly correlated features
display(corr_cols[corr_cols > 0.8])
I didn’t want to unstack or over-complicate this issue, since I just wanted to drop some highly correlated features as part of a feature selection phase.
So I ended up with the following simplified solution:
# map features to their absolute correlation values
corr = features.corr().abs()
# set equality (self correlation) as zero
corr[corr == 1] = 0
# of each feature, find the max correlation
# and sort the resulting array in ascending order
corr_cols = corr.max().sort_values(ascending=False)
# display the highly correlated features
display(corr_cols[corr_cols > 0.8])
In this case, if you want to drop correlated features, you may map through the filtered corr_cols array and remove the odd-indexed (or even-indexed) ones.
defget_feature_correlation(df, top_n=None, corr_method='spearman',
remove_duplicates=True, remove_self_correlations=True):"""
Compute the feature correlation and sort feature pairs based on their correlation
:param df: The dataframe with the predictor variables
:type df: pandas.core.frame.DataFrame
:param top_n: Top N feature pairs to be reported (if None, all of the pairs will be returned)
:param corr_method: Correlation compuation method
:type corr_method: str
:param remove_duplicates: Indicates whether duplicate features must be removed
:type remove_duplicates: bool
:param remove_self_correlations: Indicates whether self correlations will be removed
:type remove_self_correlations: bool
:return: pandas.core.frame.DataFrame
"""
corr_matrix_abs = df.corr(method=corr_method).abs()
corr_matrix_abs_us = corr_matrix_abs.unstack()
sorted_correlated_features = corr_matrix_abs_us \
.sort_values(kind="quicksort", ascending=False) \
.reset_index()
# Remove comparisons of the same featureif remove_self_correlations:
sorted_correlated_features = sorted_correlated_features[
(sorted_correlated_features.level_0 != sorted_correlated_features.level_1)
]
# Remove duplicatesif remove_duplicates:
sorted_correlated_features = sorted_correlated_features.iloc[:-2:2]
# Create meaningful names for the columns
sorted_correlated_features.columns = ['Feature 1', 'Feature 2', 'Correlation (abs)']
if top_n:
return sorted_correlated_features[:top_n]
return sorted_correlated_features
The following function should do the trick. This implementation
Removes self correlations
Removes duplicates
Enables the selection of top N highest correlated features
and it is also configurable so that you can keep both the self correlations as well as the duplicates. You can also to report as many feature pairs as you wish.
def get_feature_correlation(df, top_n=None, corr_method='spearman',
remove_duplicates=True, remove_self_correlations=True):
"""
Compute the feature correlation and sort feature pairs based on their correlation
:param df: The dataframe with the predictor variables
:type df: pandas.core.frame.DataFrame
:param top_n: Top N feature pairs to be reported (if None, all of the pairs will be returned)
:param corr_method: Correlation compuation method
:type corr_method: str
:param remove_duplicates: Indicates whether duplicate features must be removed
:type remove_duplicates: bool
:param remove_self_correlations: Indicates whether self correlations will be removed
:type remove_self_correlations: bool
:return: pandas.core.frame.DataFrame
"""
corr_matrix_abs = df.corr(method=corr_method).abs()
corr_matrix_abs_us = corr_matrix_abs.unstack()
sorted_correlated_features = corr_matrix_abs_us \
.sort_values(kind="quicksort", ascending=False) \
.reset_index()
# Remove comparisons of the same feature
if remove_self_correlations:
sorted_correlated_features = sorted_correlated_features[
(sorted_correlated_features.level_0 != sorted_correlated_features.level_1)
]
# Remove duplicates
if remove_duplicates:
sorted_correlated_features = sorted_correlated_features.iloc[:-2:2]
# Create meaningful names for the columns
sorted_correlated_features.columns = ['Feature 1', 'Feature 2', 'Correlation (abs)']
if top_n:
return sorted_correlated_features[:top_n]
return sorted_correlated_features
I liked Addison Klinke’s post the most, as being the simplest, but used Wojciech Moszczyńsk’s suggestion for filtering and charting, but extended the filter to avoid absolute values, so given a large correlation matrix, filter it, chart it, and then flatten it:
In the end, I created a small function to create the correlation matrix, filter it, and then flatten it. As an idea, it could easily be extended, e.g., asymmetric upper and lower bounds, etc.