标签归档:statistics

在Python中计算Pearson相关性和重要性

问题:在Python中计算Pearson相关性和重要性

我正在寻找一个将两个列表作为输入,并返回Pearson相关性相关性意义的函数。

I am looking for a function that takes as input two lists, and returns the Pearson correlation, and the significance of the correlation.


回答 0

您可以看一下scipy.stats

from pydoc import help
from scipy.stats.stats import pearsonr
help(pearsonr)

>>>
Help on function pearsonr in module scipy.stats.stats:

pearsonr(x, y)
 Calculates a Pearson correlation coefficient and the p-value for testing
 non-correlation.

 The Pearson correlation coefficient measures the linear relationship
 between two datasets. Strictly speaking, Pearson's correlation requires
 that each dataset be normally distributed. Like other correlation
 coefficients, this one varies between -1 and +1 with 0 implying no
 correlation. Correlations of -1 or +1 imply an exact linear
 relationship. Positive correlations imply that as x increases, so does
 y. Negative correlations imply that as x increases, y decreases.

 The p-value roughly indicates the probability of an uncorrelated system
 producing datasets that have a Pearson correlation at least as extreme
 as the one computed from these datasets. The p-values are not entirely
 reliable but are probably reasonable for datasets larger than 500 or so.

 Parameters
 ----------
 x : 1D array
 y : 1D array the same length as x

 Returns
 -------
 (Pearson's correlation coefficient,
  2-tailed p-value)

 References
 ----------
 http://www.statsoft.com/textbook/glosp.html#Pearson%20Correlation

You can have a look at scipy.stats:

from pydoc import help
from scipy.stats.stats import pearsonr
help(pearsonr)

>>>
Help on function pearsonr in module scipy.stats.stats:

pearsonr(x, y)
 Calculates a Pearson correlation coefficient and the p-value for testing
 non-correlation.

 The Pearson correlation coefficient measures the linear relationship
 between two datasets. Strictly speaking, Pearson's correlation requires
 that each dataset be normally distributed. Like other correlation
 coefficients, this one varies between -1 and +1 with 0 implying no
 correlation. Correlations of -1 or +1 imply an exact linear
 relationship. Positive correlations imply that as x increases, so does
 y. Negative correlations imply that as x increases, y decreases.

 The p-value roughly indicates the probability of an uncorrelated system
 producing datasets that have a Pearson correlation at least as extreme
 as the one computed from these datasets. The p-values are not entirely
 reliable but are probably reasonable for datasets larger than 500 or so.

 Parameters
 ----------
 x : 1D array
 y : 1D array the same length as x

 Returns
 -------
 (Pearson's correlation coefficient,
  2-tailed p-value)

 References
 ----------
 http://www.statsoft.com/textbook/glosp.html#Pearson%20Correlation

回答 1

皮尔逊相关性可以用numpy的公式计算corrcoef

import numpy
numpy.corrcoef(list1, list2)[0, 1]

The Pearson correlation can be calculated with numpy’s corrcoef.

import numpy
numpy.corrcoef(list1, list2)[0, 1]

回答 2

一种替代可以是来自天然SciPy的功能linregress其计算:

斜率:回归线的斜率

截距:回归线的截距

r值:相关系数

p值:假设检验的两侧p值,其零假设是斜率为零

stderr:估算的标准误

这是一个示例:

a = [15, 12, 8, 8, 7, 7, 7, 6, 5, 3]
b = [10, 25, 17, 11, 13, 17, 20, 13, 9, 15]
from scipy.stats import linregress
linregress(a, b)

将返回您:

LinregressResult(slope=0.20833333333333337, intercept=13.375, rvalue=0.14499815458068521, pvalue=0.68940144811669501, stderr=0.50261704627083648)

An alternative can be a native scipy function from linregress which calculates:

slope : slope of the regression line

intercept : intercept of the regression line

r-value : correlation coefficient

p-value : two-sided p-value for a hypothesis test whose null hypothesis is that the slope is zero

stderr : Standard error of the estimate

And here is an example:

a = [15, 12, 8, 8, 7, 7, 7, 6, 5, 3]
b = [10, 25, 17, 11, 13, 17, 20, 13, 9, 15]
from scipy.stats import linregress
linregress(a, b)

will return you:

LinregressResult(slope=0.20833333333333337, intercept=13.375, rvalue=0.14499815458068521, pvalue=0.68940144811669501, stderr=0.50261704627083648)

回答 3

如果您不想安装scipy,请使用此快速技巧,该技巧已从Programming Collective Intelligence进行了稍微修改

(为确保正确性进行了编辑。)

from itertools import imap

def pearsonr(x, y):
  # Assume len(x) == len(y)
  n = len(x)
  sum_x = float(sum(x))
  sum_y = float(sum(y))
  sum_x_sq = sum(map(lambda x: pow(x, 2), x))
  sum_y_sq = sum(map(lambda x: pow(x, 2), y))
  psum = sum(imap(lambda x, y: x * y, x, y))
  num = psum - (sum_x * sum_y/n)
  den = pow((sum_x_sq - pow(sum_x, 2) / n) * (sum_y_sq - pow(sum_y, 2) / n), 0.5)
  if den == 0: return 0
  return num / den

If you don’t feel like installing scipy, I’ve used this quick hack, slightly modified from Programming Collective Intelligence:

(Edited for correctness.)

from itertools import imap

def pearsonr(x, y):
  # Assume len(x) == len(y)
  n = len(x)
  sum_x = float(sum(x))
  sum_y = float(sum(y))
  sum_x_sq = sum(map(lambda x: pow(x, 2), x))
  sum_y_sq = sum(map(lambda x: pow(x, 2), y))
  psum = sum(imap(lambda x, y: x * y, x, y))
  num = psum - (sum_x * sum_y/n)
  den = pow((sum_x_sq - pow(sum_x, 2) / n) * (sum_y_sq - pow(sum_y, 2) / n), 0.5)
  if den == 0: return 0
  return num / den

回答 4

以下代码是对该定义的直接解释:

import math

def average(x):
    assert len(x) > 0
    return float(sum(x)) / len(x)

def pearson_def(x, y):
    assert len(x) == len(y)
    n = len(x)
    assert n > 0
    avg_x = average(x)
    avg_y = average(y)
    diffprod = 0
    xdiff2 = 0
    ydiff2 = 0
    for idx in range(n):
        xdiff = x[idx] - avg_x
        ydiff = y[idx] - avg_y
        diffprod += xdiff * ydiff
        xdiff2 += xdiff * xdiff
        ydiff2 += ydiff * ydiff

    return diffprod / math.sqrt(xdiff2 * ydiff2)

测试:

print pearson_def([1,2,3], [1,5,7])

退货

0.981980506062

这符合Excel中,这个计算器SciPy的(也NumPy的),它分别返回0.981980506和0.9819805060619657和0.98198050606196574。

R

> cor( c(1,2,3), c(1,5,7))
[1] 0.9819805

编辑:修复了评论者指出的错误。

The following code is a straight-up interpretation of the definition:

import math

def average(x):
    assert len(x) > 0
    return float(sum(x)) / len(x)

def pearson_def(x, y):
    assert len(x) == len(y)
    n = len(x)
    assert n > 0
    avg_x = average(x)
    avg_y = average(y)
    diffprod = 0
    xdiff2 = 0
    ydiff2 = 0
    for idx in range(n):
        xdiff = x[idx] - avg_x
        ydiff = y[idx] - avg_y
        diffprod += xdiff * ydiff
        xdiff2 += xdiff * xdiff
        ydiff2 += ydiff * ydiff

    return diffprod / math.sqrt(xdiff2 * ydiff2)

Test:

print pearson_def([1,2,3], [1,5,7])

returns

0.981980506062

This agrees with Excel, this calculator, SciPy (also NumPy), which return 0.981980506 and 0.9819805060619657, and 0.98198050606196574, respectively.

R:

> cor( c(1,2,3), c(1,5,7))
[1] 0.9819805

EDIT: Fixed a bug pointed out by a commenter.


回答 5

您也可以使用进行此操作pandas.DataFrame.corr

import pandas as pd
a = [[1, 2, 3],
     [5, 6, 9],
     [5, 6, 11],
     [5, 6, 13],
     [5, 3, 13]]
df = pd.DataFrame(data=a)
df.corr()

这给

          0         1         2
0  1.000000  0.745601  0.916579
1  0.745601  1.000000  0.544248
2  0.916579  0.544248  1.000000

You can do this with pandas.DataFrame.corr, too:

import pandas as pd
a = [[1, 2, 3],
     [5, 6, 9],
     [5, 6, 11],
     [5, 6, 13],
     [5, 3, 13]]
df = pd.DataFrame(data=a)
df.corr()

This gives

          0         1         2
0  1.000000  0.745601  0.916579
1  0.745601  1.000000  0.544248
2  0.916579  0.544248  1.000000

回答 6

我认为我的答案应该最简单地编码和理解计算Pearson相关系数(PCC)的步骤,而不是依赖于numpy / scipy 。

import math

# calculates the mean
def mean(x):
    sum = 0.0
    for i in x:
         sum += i
    return sum / len(x) 

# calculates the sample standard deviation
def sampleStandardDeviation(x):
    sumv = 0.0
    for i in x:
         sumv += (i - mean(x))**2
    return math.sqrt(sumv/(len(x)-1))

# calculates the PCC using both the 2 functions above
def pearson(x,y):
    scorex = []
    scorey = []

    for i in x: 
        scorex.append((i - mean(x))/sampleStandardDeviation(x)) 

    for j in y:
        scorey.append((j - mean(y))/sampleStandardDeviation(y))

# multiplies both lists together into 1 list (hence zip) and sums the whole list   
    return (sum([i*j for i,j in zip(scorex,scorey)]))/(len(x)-1)

PCC 的意义基本上是向您显示两个变量/列表之间的相关程度如何。重要的是要注意PCC值的范围是-1至1。0到1之间的值表示正相关。值0 =最大变化(无相关性)。-1至0之间的值表示负相关。

Rather than rely on numpy/scipy, I think my answer should be the easiest to code and understand the steps in calculating the Pearson Correlation Coefficient (PCC) .

import math

# calculates the mean
def mean(x):
    sum = 0.0
    for i in x:
         sum += i
    return sum / len(x) 

# calculates the sample standard deviation
def sampleStandardDeviation(x):
    sumv = 0.0
    for i in x:
         sumv += (i - mean(x))**2
    return math.sqrt(sumv/(len(x)-1))

# calculates the PCC using both the 2 functions above
def pearson(x,y):
    scorex = []
    scorey = []

    for i in x: 
        scorex.append((i - mean(x))/sampleStandardDeviation(x)) 

    for j in y:
        scorey.append((j - mean(y))/sampleStandardDeviation(y))

# multiplies both lists together into 1 list (hence zip) and sums the whole list   
    return (sum([i*j for i,j in zip(scorex,scorey)]))/(len(x)-1)

The significance of PCC is basically to show you how strongly correlated the two variables/lists are. It is important to note that the PCC value ranges from -1 to 1. A value between 0 to 1 denotes a positive correlation. Value of 0 = highest variation (no correlation whatsoever). A value between -1 to 0 denotes a negative correlation.


回答 7

使用python中的pandas进行Pearson系数计算:由于您的数据包含列表,建议您尝试这种方法。与数据进行交互并从控制台对其进行操作将很容易,因为您可以可视化数据结构并根据需要进行更新。您还可以导出数据集并保存它,并从python控制台中添加新数据以供以后分析。该代码更简单,包含更少的代码行。我假设您需要一些快速的代码行来筛选数据以进行进一步分析

例:

data = {'list 1':[2,4,6,8],'list 2':[4,16,36,64]}

import pandas as pd #To Convert your lists to pandas data frames convert your lists into pandas dataframes

df = pd.DataFrame(data, columns = ['list 1','list 2'])

from scipy import stats # For in-built method to get PCC

pearson_coef, p_value = stats.pearsonr(df["list 1"], df["list 2"]) #define the columns to perform calculations on
print("Pearson Correlation Coefficient: ", pearson_coef, "and a P-value of:", p_value) # Results 

但是,您没有为我发布数据以查看数据集的大小或分析之前可能需要进行的转换。

Pearson coefficient calculation using pandas in python: I would suggest trying this approach since your data contains lists. It will be easy to interact with your data and manipulate it from the console since you can visualise your data structure and update it as you wish. You can also export the data set and save it and add new data out of the python console for later analysis. This code is simpler and contains less lines of code. I am assuming you need a few quick lines of code to screen your data for further analysis

Example:

data = {'list 1':[2,4,6,8],'list 2':[4,16,36,64]}

import pandas as pd #To Convert your lists to pandas data frames convert your lists into pandas dataframes

df = pd.DataFrame(data, columns = ['list 1','list 2'])

from scipy import stats # For in-built method to get PCC

pearson_coef, p_value = stats.pearsonr(df["list 1"], df["list 2"]) #define the columns to perform calculations on
print("Pearson Correlation Coefficient: ", pearson_coef, "and a P-value of:", p_value) # Results 

However, you did not post your data for me to see the size of the data set or the transformations that might be needed before the analysis.


回答 8

嗯,这些响应中的许多响应都有很长且很难阅读的代码…

我建议在使用数组时使用numpy及其漂亮的功能:

import numpy as np
def pcc(X, Y):
   ''' Compute Pearson Correlation Coefficient. '''
   # Normalise X and Y
   X -= X.mean(0)
   Y -= Y.mean(0)
   # Standardise X and Y
   X /= X.std(0)
   Y /= Y.std(0)
   # Compute mean product
   return np.mean(X*Y)

# Using it on a random example
from random import random
X = np.array([random() for x in xrange(100)])
Y = np.array([random() for x in xrange(100)])
pcc(X, Y)

Hmm, many of these responses have long and hard to read code…

I’d suggest using numpy with its nifty features when working with arrays:

import numpy as np
def pcc(X, Y):
   ''' Compute Pearson Correlation Coefficient. '''
   # Normalise X and Y
   X -= X.mean(0)
   Y -= Y.mean(0)
   # Standardise X and Y
   X /= X.std(0)
   Y /= Y.std(0)
   # Compute mean product
   return np.mean(X*Y)

# Using it on a random example
from random import random
X = np.array([random() for x in xrange(100)])
Y = np.array([random() for x in xrange(100)])
pcc(X, Y)

回答 9

这是使用numpy的Pearson Correlation函数的实现:


def corr(data1, data2):
    "data1 & data2 should be numpy arrays."
    mean1 = data1.mean() 
    mean2 = data2.mean()
    std1 = data1.std()
    std2 = data2.std()

#     corr = ((data1-mean1)*(data2-mean2)).mean()/(std1*std2)
    corr = ((data1*data2).mean()-mean1*mean2)/(std1*std2)
    return corr

This is a implementation of Pearson Correlation function using numpy:


def corr(data1, data2):
    "data1 & data2 should be numpy arrays."
    mean1 = data1.mean() 
    mean2 = data2.mean()
    std1 = data1.std()
    std2 = data2.std()

#     corr = ((data1-mean1)*(data2-mean2)).mean()/(std1*std2)
    corr = ((data1*data2).mean()-mean1*mean2)/(std1*std2)
    return corr


回答 10

这是mkh答案的一种变体,它比使用numba和scipy.stats.pearsonr运行得快得多。

import numba

@numba.jit
def corr(data1, data2):
    M = data1.size

    sum1 = 0.
    sum2 = 0.
    for i in range(M):
        sum1 += data1[i]
        sum2 += data2[i]
    mean1 = sum1 / M
    mean2 = sum2 / M

    var_sum1 = 0.
    var_sum2 = 0.
    cross_sum = 0.
    for i in range(M):
        var_sum1 += (data1[i] - mean1) ** 2
        var_sum2 += (data2[i] - mean2) ** 2
        cross_sum += (data1[i] * data2[i])

    std1 = (var_sum1 / M) ** .5
    std2 = (var_sum2 / M) ** .5
    cross_mean = cross_sum / M

    return (cross_mean - mean1 * mean2) / (std1 * std2)

Here’s a variant on mkh’s answer that runs much faster than it, and scipy.stats.pearsonr, using numba.

import numba

@numba.jit
def corr(data1, data2):
    M = data1.size

    sum1 = 0.
    sum2 = 0.
    for i in range(M):
        sum1 += data1[i]
        sum2 += data2[i]
    mean1 = sum1 / M
    mean2 = sum2 / M

    var_sum1 = 0.
    var_sum2 = 0.
    cross_sum = 0.
    for i in range(M):
        var_sum1 += (data1[i] - mean1) ** 2
        var_sum2 += (data2[i] - mean2) ** 2
        cross_sum += (data1[i] * data2[i])

    std1 = (var_sum1 / M) ** .5
    std2 = (var_sum2 / M) ** .5
    cross_mean = cross_sum / M

    return (cross_mean - mean1 * mean2) / (std1 * std2)

回答 11

这是基于稀疏向量的皮尔逊相关性的实现。这里的向量表示为表示为(索引,值)的元组列表。两个稀疏向量的长度可以不同,但​​在所有向量上,大小必须相同。这对于文本挖掘应用很有用,因为大多数特征都是单词包,因此向量大小非常大,因此通常使用稀疏向量执行计算。

def get_pearson_corelation(self, first_feature_vector=[], second_feature_vector=[], length_of_featureset=0):
    indexed_feature_dict = {}
    if first_feature_vector == [] or second_feature_vector == [] or length_of_featureset == 0:
        raise ValueError("Empty feature vectors or zero length of featureset in get_pearson_corelation")

    sum_a = sum(value for index, value in first_feature_vector)
    sum_b = sum(value for index, value in second_feature_vector)

    avg_a = float(sum_a) / length_of_featureset
    avg_b = float(sum_b) / length_of_featureset

    mean_sq_error_a = sqrt((sum((value - avg_a) ** 2 for index, value in first_feature_vector)) + ((
        length_of_featureset - len(first_feature_vector)) * ((0 - avg_a) ** 2)))
    mean_sq_error_b = sqrt((sum((value - avg_b) ** 2 for index, value in second_feature_vector)) + ((
        length_of_featureset - len(second_feature_vector)) * ((0 - avg_b) ** 2)))

    covariance_a_b = 0

    #calculate covariance for the sparse vectors
    for tuple in first_feature_vector:
        if len(tuple) != 2:
            raise ValueError("Invalid feature frequency tuple in featureVector: %s") % (tuple,)
        indexed_feature_dict[tuple[0]] = tuple[1]
    count_of_features = 0
    for tuple in second_feature_vector:
        count_of_features += 1
        if len(tuple) != 2:
            raise ValueError("Invalid feature frequency tuple in featureVector: %s") % (tuple,)
        if tuple[0] in indexed_feature_dict:
            covariance_a_b += ((indexed_feature_dict[tuple[0]] - avg_a) * (tuple[1] - avg_b))
            del (indexed_feature_dict[tuple[0]])
        else:
            covariance_a_b += (0 - avg_a) * (tuple[1] - avg_b)

    for index in indexed_feature_dict:
        count_of_features += 1
        covariance_a_b += (indexed_feature_dict[index] - avg_a) * (0 - avg_b)

    #adjust covariance with rest of vector with 0 value
    covariance_a_b += (length_of_featureset - count_of_features) * -avg_a * -avg_b

    if mean_sq_error_a == 0 or mean_sq_error_b == 0:
        return -1
    else:
        return float(covariance_a_b) / (mean_sq_error_a * mean_sq_error_b)

单元测试:

def test_get_get_pearson_corelation(self):
    vector_a = [(1, 1), (2, 2), (3, 3)]
    vector_b = [(1, 1), (2, 5), (3, 7)]
    self.assertAlmostEquals(self.sim_calculator.get_pearson_corelation(vector_a, vector_b, 3), 0.981980506062, 3, None, None)

    vector_a = [(1, 1), (2, 2), (3, 3)]
    vector_b = [(1, 1), (2, 5), (3, 7), (4, 14)]
    self.assertAlmostEquals(self.sim_calculator.get_pearson_corelation(vector_a, vector_b, 5), -0.0137089240555, 3, None, None)

Here is an implementation for pearson correlation based on sparse vector. The vectors here are expressed as a list of tuples expressed as (index, value). The two sparse vectors can be of different length but over all vector size will have to be same. This is useful for text mining applications where the vector size is extremely large due to most features being bag of words and hence calculations are usually performed using sparse vectors.

def get_pearson_corelation(self, first_feature_vector=[], second_feature_vector=[], length_of_featureset=0):
    indexed_feature_dict = {}
    if first_feature_vector == [] or second_feature_vector == [] or length_of_featureset == 0:
        raise ValueError("Empty feature vectors or zero length of featureset in get_pearson_corelation")

    sum_a = sum(value for index, value in first_feature_vector)
    sum_b = sum(value for index, value in second_feature_vector)

    avg_a = float(sum_a) / length_of_featureset
    avg_b = float(sum_b) / length_of_featureset

    mean_sq_error_a = sqrt((sum((value - avg_a) ** 2 for index, value in first_feature_vector)) + ((
        length_of_featureset - len(first_feature_vector)) * ((0 - avg_a) ** 2)))
    mean_sq_error_b = sqrt((sum((value - avg_b) ** 2 for index, value in second_feature_vector)) + ((
        length_of_featureset - len(second_feature_vector)) * ((0 - avg_b) ** 2)))

    covariance_a_b = 0

    #calculate covariance for the sparse vectors
    for tuple in first_feature_vector:
        if len(tuple) != 2:
            raise ValueError("Invalid feature frequency tuple in featureVector: %s") % (tuple,)
        indexed_feature_dict[tuple[0]] = tuple[1]
    count_of_features = 0
    for tuple in second_feature_vector:
        count_of_features += 1
        if len(tuple) != 2:
            raise ValueError("Invalid feature frequency tuple in featureVector: %s") % (tuple,)
        if tuple[0] in indexed_feature_dict:
            covariance_a_b += ((indexed_feature_dict[tuple[0]] - avg_a) * (tuple[1] - avg_b))
            del (indexed_feature_dict[tuple[0]])
        else:
            covariance_a_b += (0 - avg_a) * (tuple[1] - avg_b)

    for index in indexed_feature_dict:
        count_of_features += 1
        covariance_a_b += (indexed_feature_dict[index] - avg_a) * (0 - avg_b)

    #adjust covariance with rest of vector with 0 value
    covariance_a_b += (length_of_featureset - count_of_features) * -avg_a * -avg_b

    if mean_sq_error_a == 0 or mean_sq_error_b == 0:
        return -1
    else:
        return float(covariance_a_b) / (mean_sq_error_a * mean_sq_error_b)

Unit tests:

def test_get_get_pearson_corelation(self):
    vector_a = [(1, 1), (2, 2), (3, 3)]
    vector_b = [(1, 1), (2, 5), (3, 7)]
    self.assertAlmostEquals(self.sim_calculator.get_pearson_corelation(vector_a, vector_b, 3), 0.981980506062, 3, None, None)

    vector_a = [(1, 1), (2, 2), (3, 3)]
    vector_b = [(1, 1), (2, 5), (3, 7), (4, 14)]
    self.assertAlmostEquals(self.sim_calculator.get_pearson_corelation(vector_a, vector_b, 5), -0.0137089240555, 3, None, None)

回答 12

我有一个非常简单易懂的解决方案。对于长度相等的两个数组,皮尔逊系数可以很容易地计算如下:

def manual_pearson(a,b):
"""
Accepts two arrays of equal length, and computes correlation coefficient. 
Numerator is the sum of product of (a - a_avg) and (b - b_avg), 
while denominator is the product of a_std and b_std multiplied by 
length of array. 
"""
  a_avg, b_avg = np.average(a), np.average(b)
  a_stdev, b_stdev = np.std(a), np.std(b)
  n = len(a)
  denominator = a_stdev * b_stdev * n
  numerator = np.sum(np.multiply(a-a_avg, b-b_avg))
  p_coef = numerator/denominator
  return p_coef

I have a very simple and easy to understand solution for this. For two arrays of equal length, Pearson coefficient can be easily computed as follows:

def manual_pearson(a,b):
"""
Accepts two arrays of equal length, and computes correlation coefficient. 
Numerator is the sum of product of (a - a_avg) and (b - b_avg), 
while denominator is the product of a_std and b_std multiplied by 
length of array. 
"""
  a_avg, b_avg = np.average(a), np.average(b)
  a_stdev, b_stdev = np.std(a), np.std(b)
  n = len(a)
  denominator = a_stdev * b_stdev * n
  numerator = np.sum(np.multiply(a-a_avg, b-b_avg))
  p_coef = numerator/denominator
  return p_coef

回答 13

您可能想知道如何在寻找特定方向的相关性(负相关或正相关)的情况下解释您的概率。这是我编写的用于帮助实现此功能的函数。甚至可能是对的!

它基于我从http://www.vassarstats.net/rsig.htmlhttp://en.wikipedia.org/wiki/Student%27s_t_distribution收集的信息,这要归功于此处发布的其他答案。

# Given (possibly random) variables, X and Y, and a correlation direction,
# returns:
#  (r, p),
# where r is the Pearson correlation coefficient, and p is the probability
# that there is no correlation in the given direction.
#
# direction:
#  if positive, p is the probability that there is no positive correlation in
#    the population sampled by X and Y
#  if negative, p is the probability that there is no negative correlation
#  if 0, p is the probability that there is no correlation in either direction
def probabilityNotCorrelated(X, Y, direction=0):
    x = len(X)
    if x != len(Y):
        raise ValueError("variables not same len: " + str(x) + ", and " + \
                         str(len(Y)))
    if x < 6:
        raise ValueError("must have at least 6 samples, but have " + str(x))
    (corr, prb_2_tail) = stats.pearsonr(X, Y)

    if not direction:
        return (corr, prb_2_tail)

    prb_1_tail = prb_2_tail / 2
    if corr * direction > 0:
        return (corr, prb_1_tail)

    return (corr, 1 - prb_1_tail)

You may wonder how to interpret your probability in the context of looking for a correlation in a particular direction (negative or positive correlation.) Here is a function I wrote to help with that. It might even be right!

It’s based on info I gleaned from http://www.vassarstats.net/rsig.html and http://en.wikipedia.org/wiki/Student%27s_t_distribution, thanks to other answers posted here.

# Given (possibly random) variables, X and Y, and a correlation direction,
# returns:
#  (r, p),
# where r is the Pearson correlation coefficient, and p is the probability
# that there is no correlation in the given direction.
#
# direction:
#  if positive, p is the probability that there is no positive correlation in
#    the population sampled by X and Y
#  if negative, p is the probability that there is no negative correlation
#  if 0, p is the probability that there is no correlation in either direction
def probabilityNotCorrelated(X, Y, direction=0):
    x = len(X)
    if x != len(Y):
        raise ValueError("variables not same len: " + str(x) + ", and " + \
                         str(len(Y)))
    if x < 6:
        raise ValueError("must have at least 6 samples, but have " + str(x))
    (corr, prb_2_tail) = stats.pearsonr(X, Y)

    if not direction:
        return (corr, prb_2_tail)

    prb_1_tail = prb_2_tail / 2
    if corr * direction > 0:
        return (corr, prb_1_tail)

    return (corr, 1 - prb_1_tail)

回答 14

您可以看一下这篇文章。这是一个有据可查的示例,该示例用于使用pandas库(适用于Python)基于来自多个文件的历史外汇货币对数据计算相关性,然后使用seaborn库生成热图图。

http://www.tradinggeeks.net/2015/08/calculating-correlation-in-python/

You can take a look at this article. This is a well-documented example for calculating correlation based on historical forex currency pairs data from multiple files using pandas library (for Python), and then generating a heatmap plot using seaborn library.

http://www.tradinggeeks.net/2015/08/calculating-correlation-in-python/


回答 15

def pearson(x,y):
  n=len(x)
  vals=range(n)

  sumx=sum([float(x[i]) for i in vals])
  sumy=sum([float(y[i]) for i in vals])

  sumxSq=sum([x[i]**2.0 for i in vals])
  sumySq=sum([y[i]**2.0 for i in vals])

  pSum=sum([x[i]*y[i] for i in vals])
  # Calculating Pearson correlation
  num=pSum-(sumx*sumy/n)
  den=((sumxSq-pow(sumx,2)/n)*(sumySq-pow(sumy,2)/n))**.5
  if den==0: return 0
  r=num/den
  return r
def pearson(x,y):
  n=len(x)
  vals=range(n)

  sumx=sum([float(x[i]) for i in vals])
  sumy=sum([float(y[i]) for i in vals])

  sumxSq=sum([x[i]**2.0 for i in vals])
  sumySq=sum([y[i]**2.0 for i in vals])

  pSum=sum([x[i]*y[i] for i in vals])
  # Calculating Pearson correlation
  num=pSum-(sumx*sumy/n)
  den=((sumxSq-pow(sumx,2)/n)*(sumySq-pow(sumy,2)/n))**.5
  if den==0: return 0
  r=num/den
  return r

Python NumPy中的np.mean()vs np.average()吗?

问题:Python NumPy中的np.mean()vs np.average()吗?

我注意到

In [30]: np.mean([1, 2, 3])
Out[30]: 2.0

In [31]: np.average([1, 2, 3])
Out[31]: 2.0

但是,应该存在一些差异,因为它们毕竟是两个不同的功能。

它们之间有什么区别?

I notice that

In [30]: np.mean([1, 2, 3])
Out[30]: 2.0

In [31]: np.average([1, 2, 3])
Out[31]: 2.0

However, there should be some differences, since after all they are two different functions.

What are the differences between them?


回答 0

np.average采用可选的权重参数。如果未提供,则等效。看一下源代码:MeanAverage

np.mean:

try:
    mean = a.mean
except AttributeError:
    return _wrapit(a, 'mean', axis, dtype, out)
return mean(axis, dtype, out)

np.average:

...
if weights is None :
    avg = a.mean(axis)
    scl = avg.dtype.type(a.size/avg.size)
else:
    #code that does weighted mean here

if returned: #returned is another optional argument
    scl = np.multiply(avg, 0) + scl
    return avg, scl
else:
    return avg
...

np.average takes an optional weight parameter. If it is not supplied they are equivalent. Take a look at the source code: Mean, Average

np.mean:

try:
    mean = a.mean
except AttributeError:
    return _wrapit(a, 'mean', axis, dtype, out)
return mean(axis, dtype, out)

np.average:

...
if weights is None :
    avg = a.mean(axis)
    scl = avg.dtype.type(a.size/avg.size)
else:
    #code that does weighted mean here

if returned: #returned is another optional argument
    scl = np.multiply(avg, 0) + scl
    return avg, scl
else:
    return avg
...

回答 1

np.mean 总是计算算术平均值,并具有一些用于输入和输出的其他选项(例如,使用什么数据类型,将结果放置在何处)。

np.average如果weights提供了参数,则可以计算加权平均值。

np.mean always computes an arithmetic mean, and has some additional options for input and output (e.g. what datatypes to use, where to place the result).

np.average can compute a weighted average if the weights parameter is supplied.


回答 2

在某些版本的numpy中,您必须意识到另一个重要的区别:

average 不考虑掩码,因此请计算整个数据集的平均值。

mean 考虑到掩码,因此仅对未掩码的值计算平均值。

g = [1,2,3,55,66,77]
f = np.ma.masked_greater(g,5)

np.average(f)
Out: 34.0

np.mean(f)
Out: 2.0

In some version of numpy there is another imporant difference that you must be aware:

average do not take in account masks, so compute the average over the whole set of data.

mean takes in account masks, so compute the mean only over unmasked values.

g = [1,2,3,55,66,77]
f = np.ma.masked_greater(g,5)

np.average(f)
Out: 34.0

np.mean(f)
Out: 2.0

回答 3

在您的调用中,两个函数是相同的。

average 可以计算加权平均值。

Doc链接:meanaverage

In your invocation, the two functions are the same.

average can compute a weighted average though.

Doc links: mean and average


回答 4

除了已经指出的差异之外,还有另一个非常重要的差异,我刚刚发现了很难的方法:与不同np.meannp.average不允许使用dtype关键字,这在某些情况下对于获得正确的结果至关重要。我有一个非常大的单精度数组,可以从h5文件访问它。如果我沿轴0和1取平均值,除非指定dtype='float64'

>T.shape
(4096, 4096, 720)
>T.dtype
dtype('<f4')

m1 = np.average(T, axis=(0,1))                #  garbage
m2 = np.mean(T, axis=(0,1))                   #  the same garbage
m3 = np.mean(T, axis=(0,1), dtype='float64')  # correct results

不幸的是,除非您知道要查找的内容,否则不一定能说出结果是错误的。np.average由于这个原因,我将不再使用,但将始终np.mean(.., dtype='float64')在任何大型阵列上使用。如果我想要一个加权平均数,我将使用权重向量和目标数组的乘积,然后再加上np.sumnp.mean,适当地(也具有适当的精度),对它进行显式计算。

In addition to the differences already noted, there’s another extremely important difference that I just now discovered the hard way: unlike np.mean, np.average doesn’t allow the dtype keyword, which is essential for getting correct results in some cases. I have a very large single-precision array that is accessed from an h5 file. If I take the mean along axes 0 and 1, I get wildly incorrect results unless I specify dtype='float64':

>T.shape
(4096, 4096, 720)
>T.dtype
dtype('<f4')

m1 = np.average(T, axis=(0,1))                #  garbage
m2 = np.mean(T, axis=(0,1))                   #  the same garbage
m3 = np.mean(T, axis=(0,1), dtype='float64')  # correct results

Unfortunately, unless you know what to look for, you can’t necessarily tell your results are wrong. I will never use np.average again for this reason but will always use np.mean(.., dtype='float64') on any large array. If I want a weighted average, I’ll compute it explicitly using the product of the weight vector and the target array and then either np.sum or np.mean, as appropriate (with appropriate precision as well).


在Python中计算算术平均值(一种平均值)

问题:在Python中计算算术平均值(一种平均值)

Python中是否有内置或标准库方法来计算数字列表的算术平均值(一种平均值)?

Is there a built-in or standard library method in Python to calculate the arithmetic mean (one type of average) of a list of numbers?


回答 0

我不知道标准库中的任何内容。但是,您可以使用类似以下内容的方法:

def mean(numbers):
    return float(sum(numbers)) / max(len(numbers), 1)

>>> mean([1,2,3,4])
2.5
>>> mean([])
0.0

在numpy中,有numpy.mean()

I am not aware of anything in the standard library. However, you could use something like:

def mean(numbers):
    return float(sum(numbers)) / max(len(numbers), 1)

>>> mean([1,2,3,4])
2.5
>>> mean([])
0.0

In numpy, there’s numpy.mean().


回答 1

NumPy的a numpy.mean是算术平均值。用法很简单:

>>> import numpy
>>> a = [1, 2, 4]
>>> numpy.mean(a)
2.3333333333333335

NumPy has a numpy.mean which is an arithmetic mean. Usage is as simple as this:

>>> import numpy
>>> a = [1, 2, 4]
>>> numpy.mean(a)
2.3333333333333335

回答 2

用途statistics.mean

import statistics
print(statistics.mean([1,2,4])) # 2.3333333333333335

从Python 3.4开始可用。对于3.1-3.3用户,该模块的旧版本可在PyPI上以的名称获得stats。只需更改statistics为即可stats

Use statistics.mean:

import statistics
print(statistics.mean([1,2,4])) # 2.3333333333333335

It’s available since Python 3.4. For 3.1-3.3 users, an old version of the module is available on PyPI under the name stats. Just change statistics to stats.


回答 3

您甚至不需要麻木或肮脏的…

>>> a = [1, 2, 3, 4, 5, 6]
>>> print(sum(a) / len(a))
3

You don’t even need numpy or scipy…

>>> a = [1, 2, 3, 4, 5, 6]
>>> print(sum(a) / len(a))
3

回答 4

使用scipy:

import scipy;
a=[1,2,4];
print(scipy.mean(a));

Use scipy:

import scipy;
a=[1,2,4];
print(scipy.mean(a));

回答 5

除了强制浮动之外,您还可以执行以下操作

def mean(nums):
    return sum(nums, 0.0) / len(nums)

或使用lambda

mean = lambda nums: sum(nums, 0.0) / len(nums)

更新日期:2019-12-15

Python 3.8 在统计模块中添加了功能fmean。哪个更快,并且总是返回float。

将数据转换为浮点并计算算术平均值。

它的运行速度比mean()函数快,并且始终返回浮点数。数据可以是序列或可迭代的。如果输入数据集为空,则引发StatisticsError。

fmean([3.5,4.0,5.25])

4.25

3.8版的新功能。

Instead of casting to float you can do following

def mean(nums):
    return sum(nums, 0.0) / len(nums)

or using lambda

mean = lambda nums: sum(nums, 0.0) / len(nums)

UPDATES: 2019-12-15

Python 3.8 added function fmean to statistics module. Which is faster and always returns float.

Convert data to floats and compute the arithmetic mean.

This runs faster than the mean() function and it always returns a float. The data may be a sequence or iterable. If the input dataset is empty, raises a StatisticsError.

fmean([3.5, 4.0, 5.25])

4.25

New in version 3.8.


回答 6

from statistics import mean
avarage=mean(your_list)

例如

from statistics import mean

my_list=[5,2,3,2]
avarage=mean(my_list)
print(avarage)

结果是

3.0
from statistics import mean
avarage=mean(your_list)

for example

from statistics import mean

my_list=[5,2,3,2]
avarage=mean(my_list)
print(avarage)

and result is

3.0

回答 7

def avg(l):
    """uses floating-point division."""
    return sum(l) / float(len(l))

例子:

l1 = [3,5,14,2,5,36,4,3]
l2 = [0,0,0]

print(avg(l1)) # 9.0
print(avg(l2)) # 0.0
def avg(l):
    """uses floating-point division."""
    return sum(l) / float(len(l))

Examples:

l1 = [3,5,14,2,5,36,4,3]
l2 = [0,0,0]

print(avg(l1)) # 9.0
print(avg(l2)) # 0.0

回答 8

def list_mean(nums):
    sumof = 0
    num_of = len(nums)
    mean = 0
    for i in nums:
        sumof += i
    mean = sumof / num_of
    return float(mean)
def list_mean(nums):
    sumof = 0
    num_of = len(nums)
    mean = 0
    for i in nums:
        sumof += i
    mean = sumof / num_of
    return float(mean)

回答 9

我一直认为应该avg从Builtins / stdlib中省略它,因为它很简单

sum(L)/len(L) # L is some list

并且任何告诫将在调用者代码中解决以供本地使用

注意事项:

  1. 非浮点结果:在python2中,9/4为2。解析,使用float(sum(L))/len(L)from __future__ import division

  2. 除以零:列表可能为空。解决:

    if not L:
        raise WhateverYouWantError("foo")
    avg = float(sum(L))/len(L)

I always supposed avg is omitted from the builtins/stdlib because it is as simple as

sum(L)/len(L) # L is some list

and any caveats would be addressed in caller code for local usage already.

Notable caveats:

  1. non-float result: in python2, 9/4 is 2. to resolve, use float(sum(L))/len(L) or from __future__ import division

  2. division by zero: the list may be empty. to resolve:

    if not L:
        raise WhateverYouWantError("foo")
    avg = float(sum(L))/len(L)
    

回答 10

对您问题的正确答案是使用statistics.mean。但是为了好玩,这是一个不使用该len()功能的Mean的版本,因此statistics.mean可以在不支持该功能的生成器上使用它(如)len()

from functools import reduce
from operator import truediv
def ave(seq):
    return truediv(*reduce(lambda a, b: (a[0] + b[1], b[0]), 
                           enumerate(seq, start=1), 
                           (0, 0)))

The proper answer to your question is to use statistics.mean. But for fun, here is a version of mean that does not use the len() function, so it (like statistics.mean) can be used on generators, which do not support len():

from functools import reduce
from operator import truediv
def ave(seq):
    return truediv(*reduce(lambda a, b: (a[0] + b[1], b[0]), 
                           enumerate(seq, start=1), 
                           (0, 0)))

回答 11

其他人已经发布了很好的答案,但有些人可能仍在寻找找到Mean(avg)的经典方法,因此,我在这里发布了此信息(在Python 3.6中测试过的代码):

def meanmanual(listt):

mean = 0
lsum = 0
lenoflist = len(listt)

for i in listt:
    lsum += i

mean = lsum / lenoflist
return float(mean)

a = [1, 2, 3, 4, 5, 6]
meanmanual(a)

Answer: 3.5

Others already posted very good answers, but some people might still be looking for a classic way to find Mean(avg), so here I post this (code tested in Python 3.6):

def meanmanual(listt):

mean = 0
lsum = 0
lenoflist = len(listt)

for i in listt:
    lsum += i

mean = lsum / lenoflist
return float(mean)

a = [1, 2, 3, 4, 5, 6]
meanmanual(a)

Answer: 3.5

Pandas-profiling 从Pandas DataFrame对象创建HTML分析报告

Documentation|Slack|Stack Overflow

从熊猫生成配置文件报告DataFrame

熊猫们df.describe()函数很棒,但对于严肃的探索性数据分析来说有点基础pandas_profiling将熊猫DataFrame扩展为df.profile_report()用于快速数据分析

对于每个列,以下统计信息(如果与列类型相关)显示在交互式HTML报告中:

  • 类型推理:检测types数据帧中的列数
  • 要领:类型、唯一值、缺少值
  • 分位数统计如最小值、Q1、中位数、Q3、最大值、范围、四分位数间范围
  • 描述性统计如均值、模态、标准差、和、中位数绝对偏差、变异系数、峰度、偏度
  • 最频繁值
  • 直方图
  • 相关性突出高度相关的变量、Spearman、Pearson和Kendall矩阵
  • 缺少值缺失值的矩阵、计数、热图和树状图
  • 文本分析了解文本数据的类别(大写、空格)、脚本(拉丁文、西里尔文)和块(ASCII
  • 文件和图像分析提取文件大小、创建日期和维度,并扫描截断的图像或包含EXIF信息的图像

公告

发布版本v3.0.0其中对报告配置进行了全面检查,提供了更直观的API并修复了以前全局配置固有的问题

这是第一个坚持SemverConventional Commits规格说明

电光后端正在进行中:我们可以很高兴地宣布,用于生成个人资料报告的电光后端已经接近v1。招聘测试者!电光后端将作为此软件包的预发行版发布

支持pandas-profiling

关于……的发展pandas-profiling完全依赖于捐款。如果您在该包中发现了价值,我们欢迎您通过以下方式直接支持该项目GitHub Sponsors好了!请帮助我继续支持这个方案。特别令人兴奋的是GitHub与您的贡献相匹配第一年

请在此处查找更多信息:

2021年5月9日💘


内容:Examples|Installation|Documentation|Large datasets|Command line usage|Advanced usage|integrations|Support|Types|How to contribute|Editor Integration|Dependencies


示例

下面的示例可以让您对软件包的功能有一个印象:

具体功能:

教程:

安装

使用管道



通过运行以下命令,可以使用pip包管理器进行安装

pip install pandas-profiling[notebook]

或者,您也可以直接从Github安装最新版本:

pip install https://github.com/pandas-profiling/pandas-profiling/archive/master.zip

使用CONDA


通过运行以下命令,可以使用Conda包管理器进行安装

conda install -c conda-forge pandas-profiling

从源开始

通过克隆存储库或按键下载源代码‘Download ZIP’在这一页上

通过导航到正确的目录并运行以下命令来安装:

python setup.py install

文档

的文档pandas_profiling可以找到here以前的文档仍然可用here

快速入门

首先加载您的熊猫DataFrame,例如使用:

import numpy as np
import pandas as pd
from pandas_profiling import ProfileReport

df = pd.DataFrame(np.random.rand(100, 5), columns=["a", "b", "c", "d", "e"])

要生成报告,请运行以下命令:

profile = ProfileReport(df, title="Pandas Profiling Report")

更深入地探索

您可以按您喜欢的任何方式配置配置文件报告。下面的示例代码将explorative configuration file,它包括文本(长度分布、Unicode信息)、文件(文件大小、创建时间)和图像(尺寸、EXIF信息)的许多功能。如果您对使用的确切设置感兴趣,可以与default configuration file

profile = ProfileReport(df, title="Pandas Profiling Report", explorative=True)

了解有关配置的详细信息pandas-profilingAdvanced usage页面

木星笔记本

我们建议使用Jupyter笔记本以交互方式生成报告。有两个界面(参见下面的动画):通过小部件和通过HTML报告

这是通过简单地显示报告来实现的。在Jupyter笔记本中,运行:

profile.to_widgets()

HTML报告可以包含在Jupyter笔记本中:

运行以下代码:

profile.to_notebook_iframe()

保存报告

如果要生成HTML报告文件,请保存ProfileReport添加到对象,并使用to_file()功能:

profile.to_file("your_report.html")

或者,您也可以以JSON的形式获取数据:

# As a string
json_data = profile.to_json()

# As a file
profile.to_file("your_report.json")

大型数据集

版本2.4引入了最小模式

这是禁用代价高昂的计算(如关联和重复行检测)的默认配置

使用以下语法:

profile = ProfileReport(large_dataset, minimal=True)
profile.to_file("output.html")

有基准可用here

命令行用法

对于熊猫可以立即读取的标准格式的CSV文件,您可以使用pandas_profiling可执行文件

有关选项和参数的信息,请运行以下命令

pandas_profiling -h

高级用法

可以使用一组选项来调整生成的报告

  • title(str):报告标题(默认为‘Pandas Profiling Report’)
  • pool_size(int):线程池中的工作进程数。设置为零时,它将设置为可用CPU数(默认情况下为0)
  • progress_bar(bool):如果为True,pandas-profiling将显示进度条
  • infer_dtypes(bool):何时True(默认)dtype的变量是使用visions使用排版逻辑(例如,将整数存储为字符串的列将被视为数字进行分析)

有关更多设置,请参阅default configuration fileminimal configuration file

您可以在高级用法页面上找到配置文档here

示例

profile = df.profile_report(
    title="Pandas Profiling Report", plot={"histogram": {"bins": 8}}
)
profile.to_file("output.html")

集成

寄予厚望

分析数据与数据验证密切相关:通常,验证规则是根据众所周知的统计数据定义的。为此,pandas-profilingGreat Expectations这是一个世界级的开源库,可以帮助您维护数据质量并改善团队之间关于数据的沟通。远大期望允许您创建期望(基本上是数据的单元测试)和数据文档(便于共享的HTML数据报告)pandas-profiling提供了一种基于ProfileReport的结果创建一套预期的方法,您可以存储这些预期,并使用它来验证另一个(或将来的)数据集

您可以找到有关《远大前程》集成的更多详细信息here

支持开源

如果没有我们慷慨的赞助商的支持,维护和开发熊猫侧写的开源代码是不可能的,它有数百万的下载量和数千的用户

Lambda workstations、服务器、笔记本电脑和云服务为财富500强公司和94%的前50所大学的工程师和研究人员提供动力。Lambda Cloud提供4个和8个GPU实例,起步价为1.5美元/小时。预装TensorFlow、PyTorch、Ubuntu、CUDA和cuDNN

我们要感谢我们慷慨的Github赞助商和支持者,是他们让熊猫侧写成为可能:

Martin Sotir, Brian Lee, Stephanie Rivera, abdulAziz, gramster

如果您想出现在此处,请查看更多信息:Github Sponsor page

类型

类型是有效数据分析的强大抽象,它超越了逻辑数据类型(整型、浮点型等)。pandas-profiling目前,可识别以下类型:布尔值、数值、日期、分类、URL、路径、文件图像

我们为Python开发了一个类型系统,为数据分析量身定做:visions选择合适的排版既可以提高整体表现力,又可以降低分析/代码的复杂性。要了解更多信息,请执行以下操作pandas-profiling的类型系统,请签出默认实现here同时,现在完全支持用户自定义摘要和类型定义-如果您有特定的用例,请提出想法或公关!

贡献

请阅读有关参与Contribution Guide

提出问题或开始贡献的一个低门槛的地方是通过接触熊猫-侧写松弛。Join the Slack community

编辑器集成

PyCharm集成

  1. 安装pandas-profiling通过上述说明
  2. 找到您的pandas-profiling可执行文件
    • 在MacOS/Linux/BSD上:
      $ which pandas_profiling
      (example) /usr/local/bin/pandas_profiling
    • 在Windows上:
      $ where pandas_profiling
      (example) C:\ProgramData\Anaconda3\Scripts\pandas_profiling.exe
  3. 在PyCharm中,转到设置(或首选项在MacOS上)>工具>外部工具
  4. 单击+图标以添加新的外部工具
  5. 插入以下值
    • 名称:熊猫侧写
    • 计划:在步骤2中获得的位置
    • 参数:"$FilePath$" "$FileDir$/$FileNameWithoutAllExtensions$_report.html"
    • 工作目录:$ProjectFileDir$

要使用PyCharm集成,请右键单击任意数据集文件:

外部工具>熊猫侧写

其他集成

其他编辑器集成可以通过拉请求进行贡献

依赖项

配置文件报告是用HTML和CSS编写的,这意味着pandas-profiling需要现代浏览器

你需要Python 3来运行此程序包。其他依赖关系可以在需求文件中找到:

文件名 要求
requirements.txt 套餐要求
requirements-dev.txt 发展的要求
requirements-test.txt 测试的规定
setup.py 对微件等的要求

Virgilio-您的数据科学E-Learning新导师

Virgilio是什么?

通过互联网学习和阅读意味着在一个混沌信息的无限丛林,在快速变化的创新领域更是如此

你有没有感到不知所措?当试图接近数据科学没有一条真正的“路”可走?

你是否厌倦了点击“Run”,“Run”,“Run”。在一本木星笔记本上,带着别人工作的舒适区给人的那种虚假的自信?

您是否曾经因为同一算法或方法的几个相互矛盾的名称而感到困惑,这些名称来自不同的网站和零散的教程?

Virgilio为每个人免费解决这些关键问题

Enter in the new web version of Virgilio!

关于

Virgilio由以下人员开发和维护these awesome people您可以给我们发电子邮件virgilio.datascience (at) gmail.com或加入Discord chat

贡献力量

太棒了!检查contribution guidelines参与我们的项目吧!

许可证

内容由-NC-SA 4.0在知识共享下发布license代码在MIT licenseVirgilio形象来自于here

Scikit-learn-SCRKIT-学习:Python中的机器学习



Scikit-learn 是构建在SciPy之上的用于机器学习的Python模块。

该项目由David Cournapeau于2007年作为Google Summer of Code项目启动,从那时起,许多志愿者都做出了贡献。请参阅About us获取核心贡献者列表的页面。

它目前由一支志愿者团队负责维护。

网站:https://scikit-learn.org

安装

依赖项

SCRICKIT-学习要求:

  • Python(>=3.7)
  • NumPy(>=1.14.6)
  • SciPy(>=1.1.0)
  • joblib(>=0.11)
  • threadpoolctl(>=2.0.0)

Scikit-Learn 0.20是支持Python 2.7和Python 3.4的最后一个版本SCRICKIT-学习0.23和更高版本需要Python3.6或更高版本。SCRICKIT-学习1.0和更高版本需要Python 3.7或更高版本

Scikit-了解绘图功能(即函数以plot_并且类以“display”结尾)需要Matplotlib(>=2.2.2)。要运行示例,需要Matplotlib>=2.2.2。少数示例需要SCRICKIT-image>=0.14.5,少数示例需要熊猫>=0.25.0,有些示例需要海运>=0.9.0

用户安装

如果您已经安装了能正常工作的Numpy和Scipy,则安装SCRICIT-LEARN的最简单方法是使用pip

pip install -U scikit-learn

conda

conda install -c conda-forge scikit-learn

该文档包括更详细的installation instructions

更改日志

请参阅changelog有关SCRICKIT显著变化的历史-了解

发展

我们欢迎所有经验水平的新贡献者。科学工具包学习社区的目标是帮助、欢迎和有效。这个Development Guide包含有关贡献代码、文档、测试等的详细信息。我们在本自述中包含了一些基本信息

重要链接

源代码

您可以使用以下命令查看最新的源代码:

git clone https://github.com/scikit-learn/scikit-learn.git

贡献

要了解更多关于为SCRICKIT-LEARN做出贡献的信息,请参阅我们的Contributing guide

测试

安装之后,您可以从源目录外部启动测试套件(您将需要pytest>=5.0.1已安装):

pytest sklearn

请参阅网页https://scikit-learn.org/dev/developers/advanced_installation.html#testing了解更多信息

随机数生成可以在测试期间通过设置SKLEARN_SEED环境变量

提交拉取请求

在打开拉取请求之前,请查看完整的贡献页面,以确保您的代码符合我们的指导原则:https://scikit-learn.org/stable/developers/index.html

项目历史记录

该项目由David Cournapeau于2007年作为Google Summer of Code项目启动,从那时起,许多志愿者都做出了贡献。请参阅About us获取核心贡献者列表的页面

该项目目前由一组志愿者负责维护。

注意事项:SCRICIT-LEARN以前被称为SCRICKIT。

帮助和支持

文档

沟通

引文

如果您在科学出版物中使用SCRICKIT-LEARN,我们将非常感谢您的引用:https://scikit-learn.org/stable/about.html#citing-scikit-learn