python中是否存在针对均方根误差(RMSE)的库函数?

问题:python中是否存在针对均方根误差(RMSE)的库函数?

我知道我可以像这样实现均方根误差函数:

def rmse(predictions, targets):
    return np.sqrt(((predictions - targets) ** 2).mean())

如果此rmse函数在某个地方的某个库中实现(可能在scipy或scikit-learn中实现),我正在寻找什么?

I know I could implement a root mean squared error function like this:

def rmse(predictions, targets):
    return np.sqrt(((predictions - targets) ** 2).mean())

What I’m looking for if this rmse function is implemented in a library somewhere, perhaps in scipy or scikit-learn?


回答 0

sklearn.metrics具有mean_squared_error功能。RMSE只是返回值的平方根。

from sklearn.metrics import mean_squared_error
from math import sqrt

rms = sqrt(mean_squared_error(y_actual, y_predicted))

sklearn.metrics has a mean_squared_error function. The RMSE is just the square root of whatever it returns.

from sklearn.metrics import mean_squared_error
from math import sqrt

rms = sqrt(mean_squared_error(y_actual, y_predicted))

回答 1

什么是RMSE?也称为MSE,RMD或RMS。它解决什么问题?

如果您了解RMSE :(均方根误差),MSE :(均方根误差)RMD(均方根偏差)和RMS :(均方根),那么要求库为您计算此值是不必要的过度设计。所有这些指标都是最长2英寸长的单行python代码。rmse,mse,rmd和rms这三个指标在概念上核心相同。

RMSE回答了这个问题:“何其相似,平均而言,是数字在list1list2?”。这两个列表的大小必须相同。我想“消除任何两个给定元素之间的噪音,消除收集到的数据的大小,并获得随时间变化的单一数字感觉”。

直觉和ELI5 for RMSE:

想象一下,您正在学习在飞镖板上扔飞镖。每天练习一小时。您想弄清楚自己是好还是坏。因此,每天您要投掷10次球,并测量靶心与飞镖击中点之间的距离。

您列出这些数字list1。使用第1天与list2包含所有零的距离之间的均方根误差。在第二天和第n天做同样的事情。您将得到的是一个希望随时间减少的数字。当您的RMSE数为零时,您每次都击中Bullseyes。如果均方根值增加,则情况会越来越糟。

在python中计算均方根误差的示例:

import numpy as np
d = [0.000, 0.166, 0.333]   #ideal target distances, these can be all zeros.
p = [0.000, 0.254, 0.998]   #your performance goes here

print("d is: " + str(["%.8f" % elem for elem in d]))
print("p is: " + str(["%.8f" % elem for elem in p]))

def rmse(predictions, targets):
    return np.sqrt(((predictions - targets) ** 2).mean())

rmse_val = rmse(np.array(d), np.array(p))
print("rms error is: " + str(rmse_val))

哪些打印:

d is: ['0.00000000', '0.16600000', '0.33300000']
p is: ['0.00000000', '0.25400000', '0.99800000']
rms error between lists d and p is: 0.387284994115

数学符号:

字形图例: n是一个完整的正整数,表示投掷次数。 i表示枚举和的整个正整数计数器。 d代表理想距离,list2在上面的示例中包含所有零。 p代表性能,list1在上面的示例中。上标2代表数字平方。 d i是的第i个索引dp i是的第i个索引p

rmse分步进行,因此可以理解:

def rmse(predictions, targets):

    differences = predictions - targets                       #the DIFFERENCEs.

    differences_squared = differences ** 2                    #the SQUAREs of ^

    mean_of_differences_squared = differences_squared.mean()  #the MEAN of ^

    rmse_val = np.sqrt(mean_of_differences_squared)           #ROOT of ^

    return rmse_val                                           #get the ^

RMSE的每个步骤如何工作:

一个数字减去另一个数字就可以得出它们之间的距离。

8 - 5 = 3         #absolute distance between 8 and 5 is +3
-20 - 10 = -30    #absolute distance between -20 and 10 is +30

如果将任何数字乘以自身,则结果总是正数,因为负数乘以负数就是正数:

3*3     = 9   = positive
-30*-30 = 900 = positive

将它们全部加起来,但是等一下,那么包含许多元素的数组将比小的数组具有更大的误差,因此请按元素数对它们进行平均。

但是,等等,我们更早地对它们进行平方,以迫使他们保持积极态度。消除平方根的伤害!

剩下的一个数字平均代表list1的每个值与其list2的对应元素值之间的距离。

如果RMSE值随着时间下降,我们会感到高兴,因为方差正在减小。

RMSE不是最准确的线拟合策略,最小二乘法的总和为:

均方根误差测量的是点与线之间的垂直距离,因此,如果数据的形状像香蕉,底部附近平坦,顶部附近陡峭,则RMSE将报告距较高点的距离较大,而距点的距离较短实际上是距离相等时的低点。这会导致偏斜,在此偏斜时,线倾向于更靠近高点而不是低点。

如果这是一个问题,则总最小二乘法可以解决此问题:https : //mubaris.com/posts/linear-regression

可能会破坏此RMSE功能的陷阱:

如果在任何一个输入列表中都有空值或无穷大,则输出rmse值将变得没有意义。任一列表中都有三种处理空值/缺失值/无穷大的策略:忽略该组件,将其清零,或在所有时间步长中添加最佳猜测或统一的随机噪声。每种补救措施都有其优缺点,具体取决于数据的含义。通常,最好忽略任何缺少值的组件,但这会使RMSE偏向零,从而使您认为性能确实有所改善。如果存在很多缺失值,则最好在最佳猜测上添加随机噪声。

为了保证RMSE输出的相对正确性,您必须消除输入中的所有null / infinites。

对于不属于异常值的数据点,RMSE的容差为零

均方根误差平方根取决于所有数据正确,并且所有数据均视为相等。这意味着在左侧区域中出现的一个杂散点将完全破坏整个计算。若要处理离群数据点并在特定阈值后消除其巨大影响,请参见稳健估计器,该估计器内置了消除离群值的阈值。

What is RMSE? Also known as MSE, RMD, or RMS. What problem does it solve?

If you understand RMSE: (Root mean squared error), MSE: (Mean Squared Error) RMD (Root mean squared deviation) and RMS: (Root Mean Squared), then asking for a library to calculate this for you is unnecessary over-engineering. All these metrics are a single line of python code at most 2 inches long. The three metrics rmse, mse, rmd, and rms are at their core conceptually identical.

RMSE answers the question: “How similar, on average, are the numbers in list1 to list2?”. The two lists must be the same size. I want to “wash out the noise between any two given elements, wash out the size of the data collected, and get a single number feel for change over time”.

Intuition and ELI5 for RMSE:

Imagine you are learning to throw darts at a dart board. Every day you practice for one hour. You want to figure out if you are getting better or getting worse. So every day you make 10 throws and measure the distance between the bullseye and where your dart hit.

You make a list of those numbers list1. Use the root mean squared error between the distances at day 1 and a list2 containing all zeros. Do the same on the 2nd and nth days. What you will get is a single number that hopefully decreases over time. When your RMSE number is zero, you hit bullseyes every time. If the rmse number goes up, you are getting worse.

Example in calculating root mean squared error in python:

import numpy as np
d = [0.000, 0.166, 0.333]   #ideal target distances, these can be all zeros.
p = [0.000, 0.254, 0.998]   #your performance goes here

print("d is: " + str(["%.8f" % elem for elem in d]))
print("p is: " + str(["%.8f" % elem for elem in p]))

def rmse(predictions, targets):
    return np.sqrt(((predictions - targets) ** 2).mean())

rmse_val = rmse(np.array(d), np.array(p))
print("rms error is: " + str(rmse_val))

Which prints:

d is: ['0.00000000', '0.16600000', '0.33300000']
p is: ['0.00000000', '0.25400000', '0.99800000']
rms error between lists d and p is: 0.387284994115

The mathematical notation:

Glyph Legend: n is a whole positive integer representing the number of throws. i represents a whole positive integer counter that enumerates sum. d stands for the ideal distances, the list2 containing all zeros in above example. p stands for performance, the list1 in the above example. superscript 2 stands for numeric squared. di is the i’th index of d. pi is the i’th index of p.

The rmse done in small steps so it can be understood:

def rmse(predictions, targets):

    differences = predictions - targets                       #the DIFFERENCEs.

    differences_squared = differences ** 2                    #the SQUAREs of ^

    mean_of_differences_squared = differences_squared.mean()  #the MEAN of ^

    rmse_val = np.sqrt(mean_of_differences_squared)           #ROOT of ^

    return rmse_val                                           #get the ^

How does every step of RMSE work:

Subtracting one number from another gives you the distance between them.

8 - 5 = 3         #absolute distance between 8 and 5 is +3
-20 - 10 = -30    #absolute distance between -20 and 10 is +30

If you multiply any number times itself, the result is always positive because negative times negative is positive:

3*3     = 9   = positive
-30*-30 = 900 = positive

Add them all up, but wait, then an array with many elements would have a larger error than a small array, so average them by the number of elements.

But wait, we squared them all earlier to force them positive. Undo the damage with a square root!

That leaves you with a single number that represents, on average, the distance between every value of list1 to it’s corresponding element value of list2.

If the RMSE value goes down over time we are happy because variance is decreasing.

RMSE isn’t the most accurate line fitting strategy, total least squares is:

Root mean squared error measures the vertical distance between the point and the line, so if your data is shaped like a banana, flat near the bottom and steep near the top, then the RMSE will report greater distances to points high, but short distances to points low when in fact the distances are equivalent. This causes a skew where the line prefers to be closer to points high than low.

If this is a problem the total least squares method fixes this: https://mubaris.com/posts/linear-regression

Gotchas that can break this RMSE function:

If there are nulls or infinity in either input list, then output rmse value is is going to not make sense. There are three strategies to deal with nulls / missing values / infinities in either list: Ignore that component, zero it out or add a best guess or a uniform random noise to all timesteps. Each remedy has its pros and cons depending on what your data means. In general ignoring any component with a missing value is preferred, but this biases the RMSE toward zero making you think performance has improved when it really hasn’t. Adding random noise on a best guess could be preferred if there are lots of missing values.

In order to guarantee relative correctness of the RMSE output, you must eliminate all nulls/infinites from the input.

RMSE has zero tolerance for outlier data points which don’t belong

Root mean squared error squares relies on all data being right and all are counted as equal. That means one stray point that’s way out in left field is going to totally ruin the whole calculation. To handle outlier data points and dismiss their tremendous influence after a certain threshold, see Robust estimators that build in a threshold for dismissal of outliers.


回答 2

这可能更快吗?

n = len(predictions)
rmse = np.linalg.norm(predictions - targets) / np.sqrt(n)

This is probably faster?:

n = len(predictions)
rmse = np.linalg.norm(predictions - targets) / np.sqrt(n)

回答 3

在scikit-learn 0.22.0中,您可以传递mean_squared_error()参数squared=False以返回RMSE。

from sklearn.metrics import mean_squared_error

mean_squared_error(y_actual, y_predicted, squared=False)

In scikit-learn 0.22.0 you can pass mean_squared_error() the argument squared=False to return the RMSE.

from sklearn.metrics import mean_squared_error

mean_squared_error(y_actual, y_predicted, squared=False)


回答 4

以防万一有人在2019年发现此线程,有一个名为的库ml_metrics,无需预先安装就可以在Kaggle的内核中使用,该库非常轻巧并且可以通过以下方式访问pypi(可以使用轻松快速地安装pip install ml_metrics):

from ml_metrics import rmse
rmse(actual=[0, 1, 2], predicted=[1, 10, 5])
# 5.507570547286102

它还有其他一些有趣的指标sklearn,例如mapk

参考文献:

Just in case someone finds this thread in 2019, there is a library called ml_metrics which is available without pre-installation in Kaggle’s kernels, pretty lightweighted and accessible through pypi ( it can be installed easily and fast with pip install ml_metrics):

from ml_metrics import rmse
rmse(actual=[0, 1, 2], predicted=[1, 10, 5])
# 5.507570547286102

It has few other interesting metrics which are not available in sklearn, like mapk.

References:


回答 5

实际上,我确实写了一堆作为statsmodels的实用函数

http://statsmodels.sourceforge.net/devel/tools.html#measure-for-fit-performance-eval-measures

http://statsmodels.sourceforge.net/devel/generation/statsmodels.tools.eval_measures.rmse.html#statsmodels.tools.eval_measures.rmse

通常是一两个衬板,输入检查不多,主要用于比较数组时轻松获得一些统计信息。但是他们对轴参数有单元测试,因为这是我有时会犯草率错误的地方。

Actually, I did write a bunch of those as utility functions for statsmodels

http://statsmodels.sourceforge.net/devel/tools.html#measure-for-fit-performance-eval-measures

and http://statsmodels.sourceforge.net/devel/generated/statsmodels.tools.eval_measures.rmse.html#statsmodels.tools.eval_measures.rmse

Mostly one or two liners and not much input checking, and mainly intended for easily getting some statistics when comparing arrays. But they have unit tests for the axis arguments, because that’s where I sometimes make sloppy mistakes.


回答 6

或仅使用NumPy函数:

def rmse(y, y_pred):
    return np.sqrt(np.mean(np.square(y - y_pred)))

哪里:

  • y是我的目标
  • y_pred是我的预测

注意,rmse(y, y_pred)==rmse(y_pred, y)由于平方函数。

Or by simply using only NumPy functions:

def rmse(y, y_pred):
    return np.sqrt(np.mean(np.square(y - y_pred)))

Where:

  • y is my target
  • y_pred is my prediction

Note that rmse(y, y_pred)==rmse(y_pred, y) due to the square function.


回答 7

您无法在SKLearn中直接找到RMSE功能。但是,除了手动执行sqrt之外,还有另一种使用sklearn的标准方法。显然,Sklearn的mean_squared_error本身包含一个名为“ squared”的参数,默认值为true。如果将其设置为false,则同一函数将返回RMSE而不是MSE。

# code changes implemented by Esha Prakash
from sklearn.metrics import mean_squared_error
rmse = mean_squared_error(y_true, y_pred , squared=False)

You can’t find RMSE function directly in SKLearn. But , instead of manually doing sqrt , there is another standard way using sklearn. Apparently, Sklearn’s mean_squared_error itself contains a parameter called as “squared” with default value as true .If we set it to false ,the same function will return RMSE instead of MSE.

# code changes implemented by Esha Prakash
from sklearn.metrics import mean_squared_error
rmse = mean_squared_error(y_true, y_pred , squared=False)

回答 8

这是一个示例代码,用于计算两种多边形文件格式之间的RMSE PLY。它同时使用ml_metricslib和np.linalg.norm

import sys
import SimpleITK as sitk
from pyntcloud import PyntCloud as pc
import numpy as np
from ml_metrics import rmse

if len(sys.argv) < 3 or sys.argv[1] == "-h" or sys.argv[1] == "--help":
    print("Usage: compute-rmse.py <input1.ply> <input2.ply>")
    sys.exit(1)

def verify_rmse(a, b):
    n = len(a)
    return np.linalg.norm(np.array(b) - np.array(a)) / np.sqrt(n)

def compare(a, b):
    m = pc.from_file(a).points
    n = pc.from_file(b).points
    m = [ tuple(m.x), tuple(m.y), tuple(m.z) ]; m = m[0]
    n = [ tuple(n.x), tuple(n.y), tuple(n.z) ]; n = n[0]
    v1, v2 = verify_rmse(m, n), rmse(m,n)
    print(v1, v2)

compare(sys.argv[1], sys.argv[2])

Here’s an example code that calculates the RMSE between two polygon file formats PLY. It uses both the ml_metrics lib and the np.linalg.norm:

import sys
import SimpleITK as sitk
from pyntcloud import PyntCloud as pc
import numpy as np
from ml_metrics import rmse

if len(sys.argv) < 3 or sys.argv[1] == "-h" or sys.argv[1] == "--help":
    print("Usage: compute-rmse.py <input1.ply> <input2.ply>")
    sys.exit(1)

def verify_rmse(a, b):
    n = len(a)
    return np.linalg.norm(np.array(b) - np.array(a)) / np.sqrt(n)

def compare(a, b):
    m = pc.from_file(a).points
    n = pc.from_file(b).points
    m = [ tuple(m.x), tuple(m.y), tuple(m.z) ]; m = m[0]
    n = [ tuple(n.x), tuple(n.y), tuple(n.z) ]; n = n[0]
    v1, v2 = verify_rmse(m, n), rmse(m,n)
    print(v1, v2)

compare(sys.argv[1], sys.argv[2])

回答 9

  1. 不,有一个用于机器学习的Scikit Learn库,可以通过使用Python语言轻松使用。它具有均方误差的功能,我在下面共享以下链接:

https://scikit-learn.org/stable/modules/generation/sklearn.metrics.mean_squared_error.html

  1. 该函数的命名方式如下所示,其中y_true是数据元组的真实类值,而y_pred是预测值,由您使用的机器学习算法预测:

mean_squared_error(y_true,y_pred)

  1. 您必须对其进行修改以获取RMSE(通过使用Python使用sqrt函数)。此过程在以下链接中进行了描述:https : //www.codeastar.com/regression-model-rmsd/

因此,最终代码将类似于:

从sklearn.metrics从数学导入sqrt导入mean_squared_error

RMSD = sqrt(均方误差(testing_y,预测))

打印(RMSD)

  1. No, there is a library Scikit Learn for machine learning and it can be easily employed by using Python language. It has the a function for Mean Squared Error which i am sharing the link below:

https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html

  1. The function is named mean_squared_error as given below, where y_true would be real class values for the data tuples and y_pred would be the predicted values, predicted by the machine learning algorithm you are using:

mean_squared_error(y_true, y_pred)

  1. You have to modify it to get RMSE (by using sqrt function using Python).This process is described in this link: https://www.codeastar.com/regression-model-rmsd/

So, final code would be something like:

from sklearn.metrics import mean_squared_error from math import sqrt

RMSD = sqrt(mean_squared_error(testing_y, prediction))

print(RMSD)