标签归档:numpy

AutoGrad 这个Python神器能够帮你自动计算函数斜率和梯度

AutoGrad 是一个老少皆宜的 Python 梯度计算模块。

对于初高中生而言,它可以用来轻易计算一条曲线在任意一个点上的斜率。

对于大学生、机器学习爱好者而言,你只需要传递给它Numpy这样的标准数据库下编写的损失函数,它就可以自动计算损失函数的导数(梯度)。

我们将从普通斜率计算开始,介绍到如何只使用它来实现一个逻辑回归模型。

1.准备

开始之前,你要确保Python和pip已经成功安装在电脑上,如果没有,请访问这篇文章:超详细Python安装指南 进行安装。

(可选1) 如果你用Python的目的是数据分析,可以直接安装Anaconda:Python数据分析与挖掘好帮手—Anaconda,它内置了Python和pip.

(可选2) 此外,推荐大家用VSCode编辑器来编写小型Python项目:Python 编程的最好搭档—VSCode 详细指南

Windows环境下打开Cmd(开始—运行—CMD),苹果系统环境下请打开Terminal(command+空格输入Terminal),输入命令安装依赖:

pip install autograd

2.AutoGrad 计算斜率

对于初高中生同学而言,它可以用来轻松计算斜率,比如我编写一个斜率为0.5的直线函数:

# 公众号 Python实用宝典
import autograd.numpy as np
from autograd import grad


def oneline(x):
    y = x/2
    return y

grad_oneline = grad(oneline)
print(grad_oneline(3.0))

运行代码,传入任意X值,你就能得到在该X值下的斜率:

(base) G:\push\20220724>python 1.py
0.5

由于这是一条直线,因此无论你传什么值,都只会得到0.5的结果。

那么让我们再试试一个tanh函数:

# 公众号 Python实用宝典
import autograd.numpy as np
from autograd import grad

def tanh(x):
    y = np.exp(-2.0 * x)
    return (1.0 - y) / (1.0 + y)
grad_tanh = grad(tanh)
print(grad_tanh(1.0))

此时你会获得 1.0 这个 x 在tanh上的曲线的斜率:

(base) G:\push\20220724>python 1.py
0.419974341614026

我们还可以绘制出tanh的斜率的变化的曲线:

# 公众号 Python实用宝典
import autograd.numpy as np
from autograd import grad


def tanh(x):
    y = np.exp(-2.0 * x)
    return (1.0 - y) / (1.0 + y)
grad_tanh = grad(tanh)
print(grad_tanh(1.0))

import matplotlib.pyplot as plt
from autograd import elementwise_grad as egrad
x = np.linspace(-7, 7, 200)
plt.plot(x, tanh(x), x, egrad(tanh)(x))
plt.show()

图中蓝色的线是tanh,橙色的线是tanh的斜率,你可以非常清晰明了地看到tanh的斜率的变化。非常便于学习和理解斜率概念。

3.实现一个逻辑回归模型

有了Autograd,我们甚至不需要借用scikit-learn就能实现一个回归模型:

逻辑回归的底层分类就是基于一个sigmoid函数:

import autograd.numpy as np
from autograd import grad

# Build a toy dataset.
inputs = np.array([[0.52, 1.12,  0.77],
                   [0.88, -1.08, 0.15],
                   [0.52, 0.06, -1.30],
                   [0.74, -2.49, 1.39]])
targets = np.array([True, True, False, True])

def sigmoid(x):
    return 0.5 * (np.tanh(x / 2.) + 1)

def logistic_predictions(weights, inputs):
    # Outputs probability of a label being true according to logistic model.
    return sigmoid(np.dot(inputs, weights))

从下面的损失函数可以看到,预测结果的好坏取决于weights的好坏,因此我们的问题转化为怎么优化这个 weights 变量:

def training_loss(weights):
    # Training loss is the negative log-likelihood of the training labels.
    preds = logistic_predictions(weights, inputs)
    label_probabilities = preds * targets + (1 - preds) * (1 - targets)
    return -np.sum(np.log(label_probabilities))

知道了优化目标后,又有Autograd这个工具,我们的问题便迎刃而解了,我们只需要让weights往损失函数不断下降的方向移动即可:

# Define a function that returns gradients of training loss using Autograd.
training_gradient_fun = grad(training_loss)

# Optimize weights using gradient descent.
weights = np.array([0.0, 0.0, 0.0])
print("Initial loss:", training_loss(weights))
for i in range(100):
    weights -= training_gradient_fun(weights) * 0.01

print("Trained loss:", training_loss(weights))

运行结果如下:

(base) G:\push\20220724>python regress.py
Initial loss: 2.772588722239781
Trained loss: 1.067270675787016

由此可见损失函数以及下降方式的重要性,损失函数不正确,你可能无法优化模型。损失下降幅度太单一或者太快,你可能会错过损失的最低点。

总而言之,AutoGrad是一个你用来优化模型的一个好工具,它可以给你提供更加直观的损失走势,进而让你有更多优化想象力。有兴趣的朋友还可以看官方的更多示例代码:

我们的文章到此就结束啦,如果你喜欢今天的 Python 教程,请持续关注Python实用宝典。

有任何问题,可以在公众号后台回复:加群,回答相应验证信息,进入互助群询问。

原创不易,希望你能在下面点个赞和在看支持我继续创作,谢谢!

给作者打赏,选择打赏金额
¥1¥5¥10¥20¥50¥100¥200 自定义

​Python实用宝典 ( pythondict.com )
不只是一个宝典
欢迎关注公众号:Python实用宝典

为什么说Python大数据处理一定要用Numpy Array?

Numpy 是Python科学计算的一个核心模块。它提供了非常高效的数组对象,以及用于处理这些数组对象的工具。一个Numpy数组由许多值组成,所有值的类型是相同的。

Python的核心库提供了 List 列表。列表是最常见的Python数据类型之一,它可以调整大小并且包含不同类型的元素,非常方便。

那么List和Numpy Array到底有什么区别?为什么我们需要在大数据处理的时候使用Numpy Array?答案是性能。

Numpy数据结构在以下方面表现更好:

1.内存大小—Numpy数据结构占用的内存更小。

2.性能—Numpy底层是用C语言实现的,比列表更快。

3.运算方法—内置优化了代数运算等方法。

下面分别讲解在大数据处理时,Numpy数组相对于List的优势。

1.Numpy Array内存占用更小

适当地使用Numpy数组替代List,你能让你的内存占用降低20倍。

对于Python原生的List列表,由于每次新增对象,都需要8个字节来引用新对象,新的对象本身占28个字节(以整数为例)。所以,列表 list 的大小可以用以下公式计算:

64 + 8 * len(lst) + len(lst) * 28 字节

而使用Numpy,就能减少非常多的空间占用。比如长度为n的Numpy整形Array,它需要:

96 + len(a) * 8 字节

可见,数组越大,你节省的内存空间越多。假设你的数组有10亿个元素,那么这个内存占用大小的差距会是GB级别的。

2.Numpy Array速度更快、内置计算方法

运行下面这个脚本,同样是生成某个维度的两个数组并相加,你就能看到原生List和Numpy Array的性能差距。

import time
import numpy as np

size_of_vec = 1000

def pure_python_version():
    t1 = time.time()
    X = range(size_of_vec)
    Y = range(size_of_vec)
    Z = [X[i] + Y[i] for i in range(len(X)) ]
    return time.time() - t1

def numpy_version():
    t1 = time.time()
    X = np.arange(size_of_vec)
    Y = np.arange(size_of_vec)
    Z = X + Y
    return time.time() - t1


t1 = pure_python_version()
t2 = numpy_version()
print(t1, t2)
print("Numpy is in this example " + str(t1/t2) + " faster!")

结果如下:

0.00048732757568359375 0.0002491474151611328
Numpy is in this example 1.955980861244019 faster!

可以看到,Numpy比原生数组快1.95倍。

如果你细心的话,还能发现,Numpy array可以直接执行加法操作。而原生的数组是做不到这点的,这就是Numpy 运算方法的优势。

我们再做几次重复试验,以证明这个性能优势是持久性的。

import numpy as np
from timeit import Timer

size_of_vec = 1000
X_list = range(size_of_vec)
Y_list = range(size_of_vec)
X = np.arange(size_of_vec)
Y = np.arange(size_of_vec)

def pure_python_version():
    Z = [X_list[i] + Y_list[i] for i in range(len(X_list)) ]

def numpy_version():
    Z = X + Y

timer_obj1 = Timer("pure_python_version()", 
                   "from __main__ import pure_python_version")
timer_obj2 = Timer("numpy_version()", 
                   "from __main__ import numpy_version")

print(timer_obj1.timeit(10))
print(timer_obj2.timeit(10))  # Runs Faster!

print(timer_obj1.repeat(repeat=3, number=10))
print(timer_obj2.repeat(repeat=3, number=10)) # repeat to prove it!

结果如下:

0.0029753120616078377
0.00014940369874238968
[0.002683573868125677, 0.002754641231149435, 0.002803879790008068]
[6.536301225423813e-05, 2.9387418180704117e-05, 2.9171351343393326e-05]

可以看到,第二个输出的时间总是小得多,这就证明了这个性能优势是具有持久性的。

所以,如果你在做一些大数据研究,比如金融数据、股票数据的研究,使用Numpy能够节省你不少内存空间,并拥有更强大的性能。

参考文献:https://webcourses.ucf.edu/courses/1249560/pages/python-lists-vs-numpy-arrays-what-is-the-difference

我们的文章到此就结束啦,如果你喜欢今天的 Python 教程,请持续关注Python实用宝典。

有任何问题,可以在公众号后台回复:加群,回答相应验证信息,进入互助群询问。

原创不易,希望你能在下面点个赞和在看支持我继续创作,谢谢!

给作者打赏,选择打赏金额
¥1¥5¥10¥20¥50¥100¥200 自定义

​Python实用宝典 ( pythondict.com )
不只是一个宝典
欢迎关注公众号:Python实用宝典

切片NumPy 2d数组,或者如何从nxn数组(n> m)中提取mxm子矩阵?

问题:切片NumPy 2d数组,或者如何从nxn数组(n> m)中提取mxm子矩阵?

我想切片一个NumPy nxn数组。我想提取该数组的m行和列的任意选择(即,行/列数中没有任何模式),使其成为一个新的mxm数组。对于此示例,假设数组为4×4,我想从中提取2×2数组。

这是我们的数组:

from numpy import *
x = range(16)
x = reshape(x,(4,4))

print x
[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]
 [12 13 14 15]]

要删除的行和列相同。最简单的情况是,当我想提取在开始或结尾处的2×2子矩阵时,即:

In [33]: x[0:2,0:2]
Out[33]: 
array([[0, 1],
       [4, 5]])

In [34]: x[2:,2:]
Out[34]: 
array([[10, 11],
       [14, 15]])

但是,如果我需要删除其他混合的行/列怎么办?如果我需要删除第一行和第三行/行,从而提取子矩阵,该[[5,7],[13,15]]怎么办?行/线可以有任何组成。我读到某个地方,我只需要使用行/列的索引数组/索引列表来索引我的数组,但这似乎不起作用:

In [35]: x[[1,3],[1,3]]
Out[35]: array([ 5, 15])

我找到了一种方法,即:

    In [61]: x[[1,3]][:,[1,3]]
Out[61]: 
array([[ 5,  7],
       [13, 15]])

第一个问题是,尽管我可以接受,但很难阅读。如果有人有更好的解决方案,我当然想听听。

另一件事是我在一个论坛上读到用数组索引数组会迫使NumPy复制所需的数组,因此在处理大型数组时,这可能会成为问题。为什么这样/这个机制如何运作?

I want to slice a NumPy nxn array. I want to extract an arbitrary selection of m rows and columns of that array (i.e. without any pattern in the numbers of rows/columns), making it a new, mxm array. For this example let us say the array is 4×4 and I want to extract a 2×2 array from it.

Here is our array:

from numpy import *
x = range(16)
x = reshape(x,(4,4))

print x
[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]
 [12 13 14 15]]

The line and columns to remove are the same. The easiest case is when I want to extract a 2×2 submatrix that is at the beginning or at the end, i.e. :

In [33]: x[0:2,0:2]
Out[33]: 
array([[0, 1],
       [4, 5]])

In [34]: x[2:,2:]
Out[34]: 
array([[10, 11],
       [14, 15]])

But what if I need to remove another mixture of rows/columns? What if I need to remove the first and third lines/rows, thus extracting the submatrix [[5,7],[13,15]]? There can be any composition of rows/lines. I read somewhere that I just need to index my array using arrays/lists of indices for both rows and columns, but that doesn’t seem to work:

In [35]: x[[1,3],[1,3]]
Out[35]: array([ 5, 15])

I found one way, which is:

    In [61]: x[[1,3]][:,[1,3]]
Out[61]: 
array([[ 5,  7],
       [13, 15]])

First issue with this is that it is hardly readable, although I can live with that. If someone has a better solution, I’d certainly like to hear it.

Other thing is I read on a forum that indexing arrays with arrays forces NumPy to make a copy of the desired array, thus when treating with large arrays this could become a problem. Why is that so / how does this mechanism work?


回答 0

如Sven所述,x[[[0],[2]],[1,3]]将返回与1和3列匹配的0和2行,同时x[[0,2],[1,3]]将在数组中返回值x [0,1]和x [2,3]。

有一个有用的函数可以帮助我完成第一个示例numpy.ix_。您可以使用进行与我的第一个示例相同的操作x[numpy.ix_([0,2],[1,3])]。这样可以避免您必须输入所有这些多余的括号。

As Sven mentioned, x[[[0],[2]],[1,3]] will give back the 0 and 2 rows that match with the 1 and 3 columns while x[[0,2],[1,3]] will return the values x[0,1] and x[2,3] in an array.

There is a helpful function for doing the first example I gave, numpy.ix_. You can do the same thing as my first example with x[numpy.ix_([0,2],[1,3])]. This can save you from having to enter in all of those extra brackets.


回答 1

为了回答这个问题,我们必须研究如何在Numpy中为多维数组建立索引。首先说您有x问题中的数组。分配给的缓冲区x将包含从0到15的16个升序整数。如果要访问一个元素,例如x[i,j]NumPy必须找出该元素相对于缓冲区起始位置的存储位置。这是通过有效计算i*x.shape[1]+j(并乘以一个int的大小以获得实际的内存偏移量)来完成的。

如果通过基本切片提取子y = x[0:2,0:2]数组,则结果对象将与共享基础缓冲区x。但是,如果您同意,会发生什么y[i,j]?NumPy无法用于i*y.shape[1]+j计算数组中的偏移量,因为所属的数据y在内存中不是连续的。

NumPy通过引入步幅来解决此问题。在计算要访问的内存偏移量时x[i,j],实际计算的是i*x.strides[0]+j*x.strides[1](并且这已经包括int大小的因数):

x.strides
(16, 4)

y像上面那样提取时,NumPy不会创建新的缓冲区,但是创建一个引用相同缓冲区的新数组对象(否则y将等于x。)然后,新数组对象将具有不同的形状,x并且可能以不同的开头偏移到缓冲区中,但将与x(至少在这种情况下)共享跨步:

y.shape
(2,2)
y.strides
(16, 4)

这样,计算的内存偏移量y[i,j]将产生正确的结果。

但是NumPy应该做什么z=x[[1,3]]呢?如果原始缓冲区用于,则跨步机制将不允许正确的索引编制z。从理论上讲 NumPy 可以添加比跨步更复杂的机制,但是这会使元素访问相对昂贵,从而在某种程度上违背了数组的整体思想。此外,视图不再是真正的轻量级对象。

关于索引的NumPy文档对此进行了详细介绍

哦,几乎忘了您的实际问题:这是如何使具有多个列表的索引按预期工作:

x[[[1],[3]],[1,3]]

这是因为索引数组以相同的形状广播。当然,对于此特定示例,您也可以使用基本切片:

x[1::2, 1::2]

To answer this question, we have to look at how indexing a multidimensional array works in Numpy. Let’s first say you have the array x from your question. The buffer assigned to x will contain 16 ascending integers from 0 to 15. If you access one element, say x[i,j], NumPy has to figure out the memory location of this element relative to the beginning of the buffer. This is done by calculating in effect i*x.shape[1]+j (and multiplying with the size of an int to get an actual memory offset).

If you extract a subarray by basic slicing like y = x[0:2,0:2], the resulting object will share the underlying buffer with x. But what happens if you acces y[i,j]? NumPy can’t use i*y.shape[1]+j to calculate the offset into the array, because the data belonging to y is not consecutive in memory.

NumPy solves this problem by introducing strides. When calculating the memory offset for accessing x[i,j], what is actually calculated is i*x.strides[0]+j*x.strides[1] (and this already includes the factor for the size of an int):

x.strides
(16, 4)

When y is extracted like above, NumPy does not create a new buffer, but it does create a new array object referencing the same buffer (otherwise y would just be equal to x.) The new array object will have a different shape then x and maybe a different starting offset into the buffer, but will share the strides with x (in this case at least):

y.shape
(2,2)
y.strides
(16, 4)

This way, computing the memory offset for y[i,j] will yield the correct result.

But what should NumPy do for something like z=x[[1,3]]? The strides mechanism won’t allow correct indexing if the original buffer is used for z. NumPy theoretically could add some more sophisticated mechanism than the strides, but this would make element access relatively expensive, somehow defying the whole idea of an array. In addition, a view wouldn’t be a really lightweight object anymore.

This is covered in depth in the NumPy documentation on indexing.

Oh, and nearly forgot about your actual question: Here is how to make the indexing with multiple lists work as expected:

x[[[1],[3]],[1,3]]

This is because the index arrays are broadcasted to a common shape. Of course, for this particular example, you can also make do with basic slicing:

x[1::2, 1::2]

回答 2

我认为这x[[1,3]][:,[1,3]]很难理解。如果您想更加清楚自己的意图,可以执行以下操作:

a[[1,3],:][:,[1,3]]

我不是切片专家,但是通常情况下,如果尝试切片为数组并且值是连续的,则会返回一个视图,其中步幅值已更改。

例如,在输入33和34中,尽管得到2×2数组,步幅为4。因此,当索引下一行时,指针将移动到内存中的正确位置。

显然,这种机制不能很好地用于索引数组的情况。因此,numpy将必须进行复制。毕竟,许多其他矩阵数学函数依赖于大小,步幅和连续的内存分配。

I don’t think that x[[1,3]][:,[1,3]] is hardly readable. If you want to be more clear on your intent, you can do:

a[[1,3],:][:,[1,3]]

I am not an expert in slicing but typically, if you try to slice into an array and the values are continuous, you get back a view where the stride value is changed.

e.g. In your inputs 33 and 34, although you get a 2×2 array, the stride is 4. Thus, when you index the next row, the pointer moves to the correct position in memory.

Clearly, this mechanism doesn’t carry well into the case of an array of indices. Hence, numpy will have to make the copy. After all, many other matrix math function relies on size, stride and continuous memory allocation.


回答 3

如果要跳过每隔一行和每隔一列,则可以使用基本切片:

In [49]: x=np.arange(16).reshape((4,4))
In [50]: x[1:4:2,1:4:2]
Out[50]: 
array([[ 5,  7],
       [13, 15]])

这将返回一个视图,而不是数组的副本。

In [51]: y=x[1:4:2,1:4:2]

In [52]: y[0,0]=100

In [53]: x   # <---- Notice x[1,1] has changed
Out[53]: 
array([[  0,   1,   2,   3],
       [  4, 100,   6,   7],
       [  8,   9,  10,  11],
       [ 12,  13,  14,  15]])

z=x[(1,3),:][:,(1,3)]使用高级索引并因此返回副本:

In [58]: x=np.arange(16).reshape((4,4))
In [59]: z=x[(1,3),:][:,(1,3)]

In [60]: z
Out[60]: 
array([[ 5,  7],
       [13, 15]])

In [61]: z[0,0]=0

请注意,它x是不变的:

In [62]: x
Out[62]: 
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15]])

如果要选择任意行和列,则不能使用基本切片。您必须使用高级索引,使用类似x[rows,:][:,columns],where rowscolumnsare sequence的方法。当然,这将为您提供原始阵列的副本,而不是视图。正如人们所期望的那样,因为numpy数组使用连续内存(具有恒定的步幅),并且将无法生成具有任意行和列的视图(因为这将需要非恒定的步幅)。

If you want to skip every other row and every other column, then you can do it with basic slicing:

In [49]: x=np.arange(16).reshape((4,4))
In [50]: x[1:4:2,1:4:2]
Out[50]: 
array([[ 5,  7],
       [13, 15]])

This returns a view, not a copy of your array.

In [51]: y=x[1:4:2,1:4:2]

In [52]: y[0,0]=100

In [53]: x   # <---- Notice x[1,1] has changed
Out[53]: 
array([[  0,   1,   2,   3],
       [  4, 100,   6,   7],
       [  8,   9,  10,  11],
       [ 12,  13,  14,  15]])

while z=x[(1,3),:][:,(1,3)] uses advanced indexing and thus returns a copy:

In [58]: x=np.arange(16).reshape((4,4))
In [59]: z=x[(1,3),:][:,(1,3)]

In [60]: z
Out[60]: 
array([[ 5,  7],
       [13, 15]])

In [61]: z[0,0]=0

Note that x is unchanged:

In [62]: x
Out[62]: 
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15]])

If you wish to select arbitrary rows and columns, then you can’t use basic slicing. You’ll have to use advanced indexing, using something like x[rows,:][:,columns], where rows and columns are sequences. This of course is going to give you a copy, not a view, of your original array. This is as one should expect, since a numpy array uses contiguous memory (with constant strides), and there would be no way to generate a view with arbitrary rows and columns (since that would require non-constant strides).


回答 4

使用numpy时,您可以为索引的每个部分传递一个切片-因此,x[0:2,0:2]上面的示例有效。

如果您只想平均跳过列或行,则可以传递包含三个成分(即开始,停止,步进)的切片。

同样,对于上面的示例:

>>> x[1:4:2, 1:4:2]
array([[ 5,  7],
       [13, 15]])

这基本上是:第一维中的切片,从索引1开始,在索引等于或大于4时停止,并在每次遍历中将2加到索引上。第二维相同。同样:这仅适用于恒定的步骤。

您必须在内部执行完全不同的语法- x[[1,3]][:,[1,3]]实际要做的是创建一个仅包含原始数组中第1行和第3行的新数组(与x[[1,3]]部件一起完成),然后重新切片-创建第三个数组-仅包含上一个数组的第1列和第3列。

With numpy, you can pass a slice for each component of the index – so, your x[0:2,0:2] example above works.

If you just want to evenly skip columns or rows, you can pass slices with three components (i.e. start, stop, step).

Again, for your example above:

>>> x[1:4:2, 1:4:2]
array([[ 5,  7],
       [13, 15]])

Which is basically: slice in the first dimension, with start at index 1, stop when index is equal or greater than 4, and add 2 to the index in each pass. The same for the second dimension. Again: this only works for constant steps.

The syntax you got to do something quite different internally – what x[[1,3]][:,[1,3]] actually does is create a new array including only rows 1 and 3 from the original array (done with the x[[1,3]] part), and then re-slice that – creating a third array – including only columns 1 and 3 of the previous array.


回答 5

我在这里有一个类似的问题:以最Python的方式在ndarray的sub-ndarray中编写。Python 2

遵循针对您的案例的上一篇解决方案后,解决方案如下所示:

columns_to_keep = [1,3] 
rows_to_keep = [1,3]

使用ix_:

x[np.ix_(rows_to_keep, columns_to_keep)] 

这是:

array([[ 5,  7],
       [13, 15]])

I have a similar question here: Writting in sub-ndarray of a ndarray in the most pythonian way. Python 2 .

Following the solution of previous post for your case the solution looks like:

columns_to_keep = [1,3] 
rows_to_keep = [1,3]

An using ix_:

x[np.ix_(rows_to_keep, columns_to_keep)] 

Which is:

array([[ 5,  7],
       [13, 15]])

回答 6

我不确定这有多有效,但是您可以使用range()在两个轴上切片

 x=np.arange(16).reshape((4,4))
 x[range(1,3), :][:,range(1,3)] 

I’m not sure how efficient this is but you can use range() to slice in both axis

 x=np.arange(16).reshape((4,4))
 x[range(1,3), :][:,range(1,3)] 

如何捕获像异常一样的numpy警告(不仅用于测试)?

问题:如何捕获像异常一样的numpy警告(不仅用于测试)?

我必须在Python中为正在执行的项目制作Lagrange多项式。我正在做一个重心样式,以避免使用显式的for循环,而不是牛顿的分差样式。我的问题是我需要用零除,但是Python(或者也许是numpy)只是将其警告而不是正常异常。

因此,我需要知道的是如何捕获此警告,就像它是一个exceptions一样。我在本网站上发现的与此相关的问题并未按照我需要的方式回答。这是我的代码:

import numpy as np
import matplotlib.pyplot as plt
import warnings

class Lagrange:
    def __init__(self, xPts, yPts):
        self.xPts = np.array(xPts)
        self.yPts = np.array(yPts)
        self.degree = len(xPts)-1 
        self.weights = np.array([np.product([x_j - x_i for x_j in xPts if x_j != x_i]) for x_i in xPts])

    def __call__(self, x):
        warnings.filterwarnings("error")
        try:
            bigNumerator = np.product(x - self.xPts)
            numerators = np.array([bigNumerator/(x - x_j) for x_j in self.xPts])
            return sum(numerators/self.weights*self.yPts) 
        except Exception, e: # Catch division by 0. Only possible in 'numerators' array
            return yPts[np.where(xPts == x)[0][0]]

L = Lagrange([-1,0,1],[1,0,1]) # Creates quadratic poly L(x) = x^2

L(1) # This should catch an error, then return 1. 

执行此代码后,我得到的输出是:

Warning: divide by zero encountered in int_scalars

那是我要抓住的警告。它应该出现在列表理解中。

I have to make a Lagrange polynomial in Python for a project I’m doing. I’m doing a barycentric style one to avoid using an explicit for-loop as opposed to a Newton’s divided difference style one. The problem I have is that I need to catch a division by zero, but Python (or maybe numpy) just makes it a warning instead of a normal exception.

So, what I need to know how to do is to catch this warning as if it were an exception. The related questions to this I found on this site were answered not in the way I needed. Here’s my code:

import numpy as np
import matplotlib.pyplot as plt
import warnings

class Lagrange:
    def __init__(self, xPts, yPts):
        self.xPts = np.array(xPts)
        self.yPts = np.array(yPts)
        self.degree = len(xPts)-1 
        self.weights = np.array([np.product([x_j - x_i for x_j in xPts if x_j != x_i]) for x_i in xPts])

    def __call__(self, x):
        warnings.filterwarnings("error")
        try:
            bigNumerator = np.product(x - self.xPts)
            numerators = np.array([bigNumerator/(x - x_j) for x_j in self.xPts])
            return sum(numerators/self.weights*self.yPts) 
        except Exception, e: # Catch division by 0. Only possible in 'numerators' array
            return yPts[np.where(xPts == x)[0][0]]

L = Lagrange([-1,0,1],[1,0,1]) # Creates quadratic poly L(x) = x^2

L(1) # This should catch an error, then return 1. 

When this code is executed, the output I get is:

Warning: divide by zero encountered in int_scalars

That’s the warning I want to catch. It should occur inside the list comprehension.


回答 0

看来您的配置正在使用print选项numpy.seterr

>>> import numpy as np
>>> np.array([1])/0   #'warn' mode
__main__:1: RuntimeWarning: divide by zero encountered in divide
array([0])
>>> np.seterr(all='print')
{'over': 'warn', 'divide': 'warn', 'invalid': 'warn', 'under': 'ignore'}
>>> np.array([1])/0   #'print' mode
Warning: divide by zero encountered in divide
array([0])

这意味着您看到的警告不是真正的警告,而只是打印了一些字符stdout(请参阅文档以获取信息seterr)。如果您想抓住它,可以:

  1. 使用numpy.seterr(all='raise')它将直接引发异常。但是,这会更改所有操作的行为,因此,这是行为上的很大变化。
  2. 使用numpy.seterr(all='warn'),可以将打印的警告转换为真实的警告,您将可以使用上述解决方案来本地化此行为更改。

实际warnings收到警告后,您可以使用该模块来控制警告的处理方式:

>>> import warnings
>>> 
>>> warnings.filterwarnings('error')
>>> 
>>> try:
...     warnings.warn(Warning())
... except Warning:
...     print 'Warning was raised as an exception!'
... 
Warning was raised as an exception!

请仔细阅读文档,filterwarnings因为它可以使您仅过滤所需的警告并具有其他选项。我还要考虑看看catch_warnings哪个是上下文管理器,它会自动重置原始filterwarnings功能:

>>> import warnings
>>> with warnings.catch_warnings():
...     warnings.filterwarnings('error')
...     try:
...         warnings.warn(Warning())
...     except Warning: print 'Raised!'
... 
Raised!
>>> try:
...     warnings.warn(Warning())
... except Warning: print 'Not raised!'
... 
__main__:2: Warning: 

It seems that your configuration is using the print option for numpy.seterr:

>>> import numpy as np
>>> np.array([1])/0   #'warn' mode
__main__:1: RuntimeWarning: divide by zero encountered in divide
array([0])
>>> np.seterr(all='print')
{'over': 'warn', 'divide': 'warn', 'invalid': 'warn', 'under': 'ignore'}
>>> np.array([1])/0   #'print' mode
Warning: divide by zero encountered in divide
array([0])

This means that the warning you see is not a real warning, but it’s just some characters printed to stdout(see the documentation for seterr). If you want to catch it you can:

  1. Use numpy.seterr(all='raise') which will directly raise the exception. This however changes the behaviour of all the operations, so it’s a pretty big change in behaviour.
  2. Use numpy.seterr(all='warn'), which will transform the printed warning in a real warning and you’ll be able to use the above solution to localize this change in behaviour.

Once you actually have a warning, you can use the warnings module to control how the warnings should be treated:

>>> import warnings
>>> 
>>> warnings.filterwarnings('error')
>>> 
>>> try:
...     warnings.warn(Warning())
... except Warning:
...     print 'Warning was raised as an exception!'
... 
Warning was raised as an exception!

Read carefully the documentation for filterwarnings since it allows you to filter only the warning you want and has other options. I’d also consider looking at catch_warnings which is a context manager which automatically resets the original filterwarnings function:

>>> import warnings
>>> with warnings.catch_warnings():
...     warnings.filterwarnings('error')
...     try:
...         warnings.warn(Warning())
...     except Warning: print 'Raised!'
... 
Raised!
>>> try:
...     warnings.warn(Warning())
... except Warning: print 'Not raised!'
... 
__main__:2: Warning: 

回答 1

在@Bakuriu的答案中添加一些内容:

如果您已经知道警告可能在何处发生,那么使用numpy.errstate上下文管理器通常会更干净一些,而不是 numpy.seterr将所有相同类型的后续警告视为相同,而不管它们在代码中的位置如何:

import numpy as np

a = np.r_[1.]
with np.errstate(divide='raise'):
    try:
        a / 0   # this gets caught and handled as an exception
    except FloatingPointError:
        print('oh no!')
a / 0           # this prints a RuntimeWarning as usual

编辑:

在我最初的示例中,我有a = np.r_[0],但是显然numpy的行为发生了变化,使得在分子为全零的情况下对零除的处理方式有所不同。例如,在numpy 1.16.4中:

all_zeros = np.array([0., 0.])
not_all_zeros = np.array([1., 0.])

with np.errstate(divide='raise'):
    not_all_zeros / 0.  # Raises FloatingPointError

with np.errstate(divide='raise'):
    all_zeros / 0.  # No exception raised

with np.errstate(invalid='raise'):
    all_zeros / 0.  # Raises FloatingPointError

相应的警告消息也不同:1. / 0.记录为RuntimeWarning: divide by zero encountered in true_divide,而0. / 0.记录为RuntimeWarning: invalid value encountered in true_divide。我不确定为什么要进行此更改,但是我怀疑这与以下事实有关:0. / 0.是不能表示为数字(numpy的回报为NaN在这种情况下),而1. / 0.-1. / 0.分别返回+ Inf文件和-Inf ,符合IEE 754标准。

如果您想捕获两种类型的错误,则可以始终通过np.errstate(divide='raise', invalid='raise'),或者all='raise'如果您想对任何类型的浮点错误引发异常。

To add a little to @Bakuriu’s answer:

If you already know where the warning is likely to occur then it’s often cleaner to use the numpy.errstate context manager, rather than numpy.seterr which treats all subsequent warnings of the same type the same regardless of where they occur within your code:

import numpy as np

a = np.r_[1.]
with np.errstate(divide='raise'):
    try:
        a / 0   # this gets caught and handled as an exception
    except FloatingPointError:
        print('oh no!')
a / 0           # this prints a RuntimeWarning as usual

Edit:

In my original example I had a = np.r_[0], but apparently there was a change in numpy’s behaviour such that division-by-zero is handled differently in cases where the numerator is all-zeros. For example, in numpy 1.16.4:

all_zeros = np.array([0., 0.])
not_all_zeros = np.array([1., 0.])

with np.errstate(divide='raise'):
    not_all_zeros / 0.  # Raises FloatingPointError

with np.errstate(divide='raise'):
    all_zeros / 0.  # No exception raised

with np.errstate(invalid='raise'):
    all_zeros / 0.  # Raises FloatingPointError

The corresponding warning messages are also different: 1. / 0. is logged as RuntimeWarning: divide by zero encountered in true_divide, whereas 0. / 0. is logged as RuntimeWarning: invalid value encountered in true_divide. I’m not sure why exactly this change was made, but I suspect it has to do with the fact that the result of 0. / 0. is not representable as a number (numpy returns a NaN in this case) whereas 1. / 0. and -1. / 0. return +Inf and -Inf respectively, per the IEE 754 standard.

If you want to catch both types of error you can always pass np.errstate(divide='raise', invalid='raise'), or all='raise' if you want to raise an exception on any kind of floating point error.


回答 2

为了详细说明上述@Bakuriu的答案,我发现这使我能够以类似于捕获错误警告的方式捕获运行时警告,从而很好地打印警告:

import warnings

with warnings.catch_warnings():
    warnings.filterwarnings('error')
    try:
        answer = 1 / 0
    except Warning as e:
        print('error found:', e)

您可能可以尝试放置warnings.catch_warnings()的位置,具体取决于您要用这种方式捕获错误的伞的大小。

To elaborate on @Bakuriu’s answer above, I’ve found that this enables me to catch a runtime warning in a similar fashion to how I would catch an error warning, printing out the warning nicely:

import warnings

with warnings.catch_warnings():
    warnings.filterwarnings('error')
    try:
        answer = 1 / 0
    except Warning as e:
        print('error found:', e)

You will probably be able to play around with placing of the warnings.catch_warnings() placement depending on how big of an umbrella you want to cast with catching errors this way.


回答 3

删除warnings.filterwarnings并添加:

numpy.seterr(all='raise')

Remove warnings.filterwarnings and add:

numpy.seterr(all='raise')

使用Matplotlib绘制2D热图

问题:使用Matplotlib绘制2D热图

我想使用Matplotlib绘制2D热图。我的数据是一个n×n的Numpy数组,每个数组的值都在0到1之间。因此,对于该数组的(i,j)元素,我想在我的(i,j)坐标上绘制一个正方形热图,其颜色与数组中元素的值成比例。

我怎样才能做到这一点?

Using Matplotlib, I want to plot a 2D heat map. My data is an n-by-n Numpy array, each with a value between 0 and 1. So for the (i, j) element of this array, I want to plot a square at the (i, j) coordinate in my heat map, whose color is proportional to the element’s value in the array.

How can I do this?


回答 0

imshow()函数带有参数interpolation='nearest'cmap='hot'应该执行您想要的操作。

import matplotlib.pyplot as plt
import numpy as np

a = np.random.random((16, 16))
plt.imshow(a, cmap='hot', interpolation='nearest')
plt.show()

The imshow() function with parameters interpolation='nearest' and cmap='hot' should do what you want.

import matplotlib.pyplot as plt
import numpy as np

a = np.random.random((16, 16))
plt.imshow(a, cmap='hot', interpolation='nearest')
plt.show()


回答 1

Seaborn负责许多手动工作,并自动在图表的侧面绘制渐变等。

import numpy as np
import seaborn as sns
import matplotlib.pylab as plt

uniform_data = np.random.rand(10, 12)
ax = sns.heatmap(uniform_data, linewidth=0.5)
plt.show()

或者,您甚至可以绘制正方形矩阵的上/下左/右三角形,例如,一个正方形且对称的相关矩阵,因此绘制所有值无论如何都是多余的。

corr = np.corrcoef(np.random.randn(10, 200))
mask = np.zeros_like(corr)
mask[np.triu_indices_from(mask)] = True
with sns.axes_style("white"):
    ax = sns.heatmap(corr, mask=mask, vmax=.3, square=True,  cmap="YlGnBu")
    plt.show()

Seaborn takes care of a lot of the manual work and automatically plots a gradient at the side of the chart etc.

import numpy as np
import seaborn as sns
import matplotlib.pylab as plt

uniform_data = np.random.rand(10, 12)
ax = sns.heatmap(uniform_data, linewidth=0.5)
plt.show()

Or, you can even plot upper / lower left / right triangles of square matrices, for example a correlation matrix which is square and is symmetric, so plotting all values would be redundant anyway.

corr = np.corrcoef(np.random.randn(10, 200))
mask = np.zeros_like(corr)
mask[np.triu_indices_from(mask)] = True
with sns.axes_style("white"):
    ax = sns.heatmap(corr, mask=mask, vmax=.3, square=True,  cmap="YlGnBu")
    plt.show()


回答 2

对于二维numpy数组,简单地使用imshow()可能会帮助您:

import matplotlib.pyplot as plt
import numpy as np


def heatmap2d(arr: np.ndarray):
    plt.imshow(arr, cmap='viridis')
    plt.colorbar()
    plt.show()


test_array = np.arange(100 * 100).reshape(100, 100)
heatmap2d(test_array)

此代码产生连续的热图。

您可以colormap这里选择另一个内置的。

For a 2d numpy array, simply use imshow() may help you:

import matplotlib.pyplot as plt
import numpy as np


def heatmap2d(arr: np.ndarray):
    plt.imshow(arr, cmap='viridis')
    plt.colorbar()
    plt.show()


test_array = np.arange(100 * 100).reshape(100, 100)
heatmap2d(test_array)

This code produces a continuous heatmap.

You can choose another built-in colormap from here.


回答 3

我会使用matplotlib的pcolor / pcolormesh函数,因为它允许数据间距不均匀。

取自matplotlib的示例:

import matplotlib.pyplot as plt
import numpy as np

# generate 2 2d grids for the x & y bounds
y, x = np.meshgrid(np.linspace(-3, 3, 100), np.linspace(-3, 3, 100))

z = (1 - x / 2. + x ** 5 + y ** 3) * np.exp(-x ** 2 - y ** 2)
# x and y are bounds, so z should be the value *inside* those bounds.
# Therefore, remove the last value from the z array.
z = z[:-1, :-1]
z_min, z_max = -np.abs(z).max(), np.abs(z).max()

fig, ax = plt.subplots()

c = ax.pcolormesh(x, y, z, cmap='RdBu', vmin=z_min, vmax=z_max)
ax.set_title('pcolormesh')
# set the limits of the plot to the limits of the data
ax.axis([x.min(), x.max(), y.min(), y.max()])
fig.colorbar(c, ax=ax)

plt.show()

I would use matplotlib’s pcolor/pcolormesh function since it allows nonuniform spacing of the data.

Example taken from matplotlib:

import matplotlib.pyplot as plt
import numpy as np

# generate 2 2d grids for the x & y bounds
y, x = np.meshgrid(np.linspace(-3, 3, 100), np.linspace(-3, 3, 100))

z = (1 - x / 2. + x ** 5 + y ** 3) * np.exp(-x ** 2 - y ** 2)
# x and y are bounds, so z should be the value *inside* those bounds.
# Therefore, remove the last value from the z array.
z = z[:-1, :-1]
z_min, z_max = -np.abs(z).max(), np.abs(z).max()

fig, ax = plt.subplots()

c = ax.pcolormesh(x, y, z, cmap='RdBu', vmin=z_min, vmax=z_max)
ax.set_title('pcolormesh')
# set the limits of the plot to the limits of the data
ax.axis([x.min(), x.max(), y.min(), y.max()])
fig.colorbar(c, ax=ax)

plt.show()


回答 4

这是从csv执行操作的方法:

import numpy as np
import matplotlib.pyplot as plt
from scipy.interpolate import griddata

# Load data from CSV
dat = np.genfromtxt('dat.xyz', delimiter=' ',skip_header=0)
X_dat = dat[:,0]
Y_dat = dat[:,1]
Z_dat = dat[:,2]

# Convert from pandas dataframes to numpy arrays
X, Y, Z, = np.array([]), np.array([]), np.array([])
for i in range(len(X_dat)):
        X = np.append(X, X_dat[i])
        Y = np.append(Y, Y_dat[i])
        Z = np.append(Z, Z_dat[i])

# create x-y points to be used in heatmap
xi = np.linspace(X.min(), X.max(), 1000)
yi = np.linspace(Y.min(), Y.max(), 1000)

# Z is a matrix of x-y values
zi = griddata((X, Y), Z, (xi[None,:], yi[:,None]), method='cubic')

# I control the range of my colorbar by removing data 
# outside of my range of interest
zmin = 3
zmax = 12
zi[(zi<zmin) | (zi>zmax)] = None

# Create the contour plot
CS = plt.contourf(xi, yi, zi, 15, cmap=plt.cm.rainbow,
                  vmax=zmax, vmin=zmin)
plt.colorbar()  
plt.show()

dat.xyz形式在哪里

x1 y1 z1
x2 y2 z2
...

Here’s how to do it from a csv:

import numpy as np
import matplotlib.pyplot as plt
from scipy.interpolate import griddata

# Load data from CSV
dat = np.genfromtxt('dat.xyz', delimiter=' ',skip_header=0)
X_dat = dat[:,0]
Y_dat = dat[:,1]
Z_dat = dat[:,2]

# Convert from pandas dataframes to numpy arrays
X, Y, Z, = np.array([]), np.array([]), np.array([])
for i in range(len(X_dat)):
        X = np.append(X, X_dat[i])
        Y = np.append(Y, Y_dat[i])
        Z = np.append(Z, Z_dat[i])

# create x-y points to be used in heatmap
xi = np.linspace(X.min(), X.max(), 1000)
yi = np.linspace(Y.min(), Y.max(), 1000)

# Z is a matrix of x-y values
zi = griddata((X, Y), Z, (xi[None,:], yi[:,None]), method='cubic')

# I control the range of my colorbar by removing data 
# outside of my range of interest
zmin = 3
zmax = 12
zi[(zi<zmin) | (zi>zmax)] = None

# Create the contour plot
CS = plt.contourf(xi, yi, zi, 15, cmap=plt.cm.rainbow,
                  vmax=zmax, vmin=zmin)
plt.colorbar()  
plt.show()

where dat.xyz is in the form

x1 y1 z1
x2 y2 z2
...

我应该使用scipy.pi,numpy.pi还是math.pi?

问题:我应该使用scipy.pi,numpy.pi还是math.pi?

在使用SciPy的和NumPy的一个项目,我应该使用scipy.pinumpy.pimath.pi

In a project using SciPy and NumPy, should I use scipy.pi, numpy.pi, or math.pi?


回答 0

>>> import math
>>> import numpy as np
>>> import scipy
>>> math.pi == np.pi == scipy.pi
True

所以没关系,它们都是相同的值。

这三个模块均提供pi值的唯一原因是,如果仅使用三个模块之一,则可以方便地访问pi,而不必导入另一个模块。他们没有为pi提供不同的值。

>>> import math
>>> import numpy as np
>>> import scipy
>>> math.pi == np.pi == scipy.pi
True

So it doesn’t matter, they are all the same value.

The only reason all three modules provide a pi value is so if you are using just one of the three modules, you can conveniently have access to pi without having to import another module. They’re not providing different values for pi.


回答 1

需要注意的一件事是,当然,并非所有库都将对pi使用相同的含义,因此知道您使用的内容永远不会有任何伤害。例如,符号数学库Sympy对pi的表示与math和numpy不同:

import math
import numpy
import scipy
import sympy

print(math.pi == numpy.pi)
> True
print(math.pi == scipy.pi)
> True
print(math.pi == sympy.pi)
> False

One thing to note is that not all libraries will use the same meaning for pi, of course, so it never hurts to know what you’re using. For example, the symbolic math library Sympy’s representation of pi is not the same as math and numpy:

import math
import numpy
import scipy
import sympy

print(math.pi == numpy.pi)
> True
print(math.pi == scipy.pi)
> True
print(math.pi == sympy.pi)
> False

如何将NumPy数组标准化到一定范围内?

问题:如何将NumPy数组标准化到一定范围内?

在对音频或图像阵列进行一些处理之后,需要先在一定范围内对其进行标准化,然后才能将其写回到文件中。可以这样完成:

# Normalize audio channels to between -1.0 and +1.0
audio[:,0] = audio[:,0]/abs(audio[:,0]).max()
audio[:,1] = audio[:,1]/abs(audio[:,1]).max()

# Normalize image to between 0 and 255
image = image/(image.max()/255.0)

有没有那么繁琐,方便的函数方式来做到这一点?matplotlib.colors.Normalize()似乎无关。

After doing some processing on an audio or image array, it needs to be normalized within a range before it can be written back to a file. This can be done like so:

# Normalize audio channels to between -1.0 and +1.0
audio[:,0] = audio[:,0]/abs(audio[:,0]).max()
audio[:,1] = audio[:,1]/abs(audio[:,1]).max()

# Normalize image to between 0 and 255
image = image/(image.max()/255.0)

Is there a less verbose, convenience function way to do this? matplotlib.colors.Normalize() doesn’t seem to be related.


回答 0

audio /= np.max(np.abs(audio),axis=0)
image *= (255.0/image.max())

使用/=*=可以消除中间的临时阵列,从而节省了一些内存。乘法比除法便宜,所以

image *= 255.0/image.max()    # Uses 1 division and image.size multiplications

比…快一点

image /= image.max()/255.0    # Uses 1+image.size divisions

由于我们在这里使用基本的numpy方法,因此我认为这是尽可能有效的numpy解决方案。


就地操作不会更改容器数组的dtype。由于所需的标准化值是浮点型,因此在执行就地操作之前,audioand image数组需要具有浮点dtype。如果它们还不是浮点dtype,则需要使用进行转换astype。例如,

image = image.astype('float64')
audio /= np.max(np.abs(audio),axis=0)
image *= (255.0/image.max())

Using /= and *= allows you to eliminate an intermediate temporary array, thus saving some memory. Multiplication is less expensive than division, so

image *= 255.0/image.max()    # Uses 1 division and image.size multiplications

is marginally faster than

image /= image.max()/255.0    # Uses 1+image.size divisions

Since we are using basic numpy methods here, I think this is about as efficient a solution in numpy as can be.


In-place operations do not change the dtype of the container array. Since the desired normalized values are floats, the audio and image arrays need to have floating-point point dtype before the in-place operations are performed. If they are not already of floating-point dtype, you’ll need to convert them using astype. For example,

image = image.astype('float64')

回答 1

如果数组同时包含正数和负数,我将使用:

import numpy as np

a = np.random.rand(3,2)

# Normalised [0,1]
b = (a - np.min(a))/np.ptp(a)

# Normalised [0,255] as integer: don't forget the parenthesis before astype(int)
c = (255*(a - np.min(a))/np.ptp(a)).astype(int)        

# Normalised [-1,1]
d = 2.*(a - np.min(a))/np.ptp(a)-1

如果数组包含nan,则一种解决方案是将其删除为:

def nan_ptp(a):
    return np.ptp(a[np.isfinite(a)])

b = (a - np.nanmin(a))/nan_ptp(a)

但是,根据上下文,您可能需要nan不同的对待。例如,插值,用例如0代替,或引发错误。

最后,值得一提的是,即使不是OP的问题,也要标准化

e = (a - np.mean(a)) / np.std(a)

If the array contains both positive and negative data, I’d go with:

import numpy as np

a = np.random.rand(3,2)

# Normalised [0,1]
b = (a - np.min(a))/np.ptp(a)

# Normalised [0,255] as integer: don't forget the parenthesis before astype(int)
c = (255*(a - np.min(a))/np.ptp(a)).astype(int)        

# Normalised [-1,1]
d = 2.*(a - np.min(a))/np.ptp(a)-1

If the array contains nan, one solution could be to just remove them as:

def nan_ptp(a):
    return np.ptp(a[np.isfinite(a)])

b = (a - np.nanmin(a))/nan_ptp(a)

However, depending on the context you might want to treat nan differently. E.g. interpolate the value, replacing in with e.g. 0, or raise an error.

Finally, worth mentioning even if it’s not OP’s question, standardization:

e = (a - np.mean(a)) / np.std(a)

回答 2

您也可以使用重新缩放sklearn。优势在于,除了对数据进行均值居中之外,还可以调整标准差的归一化,并且可以在任一轴上,通过要素或按记录进行校准。

from sklearn.preprocessing import scale
X = scale( X, axis=0, with_mean=True, with_std=True, copy=True )

关键词参数axiswith_meanwith_std是自我解释,并且在默认状态显示。如果该参数copy设置为,则执行就地操作False这里的文件

You can also rescale using sklearn. The advantages are that you can adjust normalize the standard deviation, in addition to mean-centering the data, and that you can do this on either axis, by features, or by records.

from sklearn.preprocessing import scale
X = scale( X, axis=0, with_mean=True, with_std=True, copy=True )

The keyword arguments axis, with_mean, with_std are self explanatory, and are shown in their default state. The argument copy performs the operation in-place if it is set to False. Documentation here.


回答 3

您可以使用“ i”版本(如idiv中的imul ..),它看起来还不错:

image /= (image.max()/255.0)

在另一种情况下,您可以编写一个函数来通过colums标准化n维数组:

def normalize_columns(arr):
    rows, cols = arr.shape
    for col in xrange(cols):
        arr[:,col] /= abs(arr[:,col]).max()

You can use the “i” (as in idiv, imul..) version, and it doesn’t look half bad:

image /= (image.max()/255.0)

For the other case you can write a function to normalize an n-dimensional array by colums:

def normalize_columns(arr):
    rows, cols = arr.shape
    for col in xrange(cols):
        arr[:,col] /= abs(arr[:,col]).max()

回答 4

您正在尝试最小-最大比例缩放audio介于-1和+1 image之间以及0和255之间的值。

使用sklearn.preprocessing.minmax_scale,应该可以轻松解决您的问题。

例如:

audio_scaled = minmax_scale(audio, feature_range=(-1,1))

shape = image.shape
image_scaled = minmax_scale(image.ravel(), feature_range=(0,255)).reshape(shape)

注意:不要与将向量的范数(长度)缩放到某个值(通常为1)的操作相混淆,该操作通常也称为归一化。

You are trying to min-max scale the values of audio between -1 and +1 and image between 0 and 255.

Using sklearn.preprocessing.minmax_scale, should easily solve your problem.

e.g.:

audio_scaled = minmax_scale(audio, feature_range=(-1,1))

and

shape = image.shape
image_scaled = minmax_scale(image.ravel(), feature_range=(0,255)).reshape(shape)

note: Not to be confused with the operation that scales the norm (length) of a vector to a certain value (usually 1), which is also commonly referred to as normalization.


回答 5

一个简单的解决方案是使用sklearn.preprocessing库提供的缩放器。

scaler = sk.MinMaxScaler(feature_range=(0, 250))
scaler = scaler.fit(X)
X_scaled = scaler.transform(X)
# Checking reconstruction
X_rec = scaler.inverse_transform(X_scaled)

错误X_rec-X将为零。您可以根据需要调整feature_range,甚至可以使用标准缩放器sk.StandardScaler()

A simple solution is using the scalers offered by the sklearn.preprocessing library.

scaler = sk.MinMaxScaler(feature_range=(0, 250))
scaler = scaler.fit(X)
X_scaled = scaler.transform(X)
# Checking reconstruction
X_rec = scaler.inverse_transform(X_scaled)

The error X_rec-X will be zero. You can adjust the feature_range for your needs, or even use a standart scaler sk.StandardScaler()


回答 6

我尝试按照此操作,但出现了错误

TypeError: ufunc 'true_divide' output (typecode 'd') could not be coerced to provided output parameter (typecode 'l') according to the casting rule ''same_kind''

numpy我试图正常化阵列是一个integer数组。似乎他们不赞成在版本>中进行类型转换1.10,而您必须使用它numpy.true_divide()来解决该问题。

arr = np.array(img)
arr = np.true_divide(arr,[255.0],out=None)

img是一个PIL.Image对象。

I tried following this, and got the error

TypeError: ufunc 'true_divide' output (typecode 'd') could not be coerced to provided output parameter (typecode 'l') according to the casting rule ''same_kind''

The numpy array I was trying to normalize was an integer array. It seems they deprecated type casting in versions > 1.10, and you have to use numpy.true_divide() to resolve that.

arr = np.array(img)
arr = np.true_divide(arr,[255.0],out=None)

img was an PIL.Image object.


遍历一个numpy数组

问题:遍历一个numpy数组

有没有那么冗长的替代方案:

for x in xrange(array.shape[0]):
    for y in xrange(array.shape[1]):
        do_stuff(x, y)

我想出了这个:

for x, y in itertools.product(map(xrange, array.shape)):
    do_stuff(x, y)

这节省了一个缩进,但仍然很丑陋。

我希望看起来像这样的伪代码:

for x, y in array.indices:
    do_stuff(x, y)

有没有类似的东西存在?

Is there a less verbose alternative to this:

for x in xrange(array.shape[0]):
    for y in xrange(array.shape[1]):
        do_stuff(x, y)

I came up with this:

for x, y in itertools.product(map(xrange, array.shape)):
    do_stuff(x, y)

Which saves one indentation, but is still pretty ugly.

I’m hoping for something that looks like this pseudocode:

for x, y in array.indices:
    do_stuff(x, y)

Does anything like that exist?


回答 0

我认为您正在寻找ndenumerate

>>> a =numpy.array([[1,2],[3,4],[5,6]])
>>> for (x,y), value in numpy.ndenumerate(a):
...  print x,y
... 
0 0
0 1
1 0
1 1
2 0
2 1

关于性能。它比列表理解要慢一些。

X = np.zeros((100, 100, 100))

%timeit list([((i,j,k), X[i,j,k]) for i in range(X.shape[0]) for j in range(X.shape[1]) for k in range(X.shape[2])])
1 loop, best of 3: 376 ms per loop

%timeit list(np.ndenumerate(X))
1 loop, best of 3: 570 ms per loop

如果您担心性能,可以通过查看实现来进一步优化ndenumerate,它实现了两件事,转换为数组并循环。如果知道有数组,则可以调用.coords平面迭代器的属性。

a = X.flat
%timeit list([(a.coords, x) for x in a.flat])
1 loop, best of 3: 305 ms per loop

I think you’re looking for the ndenumerate.

>>> a =numpy.array([[1,2],[3,4],[5,6]])
>>> for (x,y), value in numpy.ndenumerate(a):
...  print x,y
... 
0 0
0 1
1 0
1 1
2 0
2 1

Regarding the performance. It is a bit slower than a list comprehension.

X = np.zeros((100, 100, 100))

%timeit list([((i,j,k), X[i,j,k]) for i in range(X.shape[0]) for j in range(X.shape[1]) for k in range(X.shape[2])])
1 loop, best of 3: 376 ms per loop

%timeit list(np.ndenumerate(X))
1 loop, best of 3: 570 ms per loop

If you are worried about the performance you could optimise a bit further by looking at the implementation of ndenumerate, which does 2 things, converting to an array and looping. If you know you have an array, you can call the .coords attribute of the flat iterator.

a = X.flat
%timeit list([(a.coords, x) for x in a.flat])
1 loop, best of 3: 305 ms per loop

回答 1

如果只需要索引,可以尝试numpy.ndindex

>>> a = numpy.arange(9).reshape(3, 3)
>>> [(x, y) for x, y in numpy.ndindex(a.shape)]
[(0, 0), (0, 1), (0, 2), (1, 0), (1, 1), (1, 2), (2, 0), (2, 1), (2, 2)]

If you only need the indices, you could try numpy.ndindex:

>>> a = numpy.arange(9).reshape(3, 3)
>>> [(x, y) for x, y in numpy.ndindex(a.shape)]
[(0, 0), (0, 1), (0, 2), (1, 0), (1, 1), (1, 2), (2, 0), (2, 1), (2, 2)]

回答 2

nditer

import numpy as np
Y = np.array([3,4,5,6])
for y in np.nditer(Y, op_flags=['readwrite']):
    y += 3

Y == np.array([6, 7, 8, 9])

y = 3将无法使用y *= 0y += 3而是使用。

see nditer

import numpy as np
Y = np.array([3,4,5,6])
for y in np.nditer(Y, op_flags=['readwrite']):
    y += 3

Y == np.array([6, 7, 8, 9])

y = 3 would not work, use y *= 0 and y += 3 instead.


Cython:“严重错误:numpy / arrayobject.h:没有此类文件或目录”

问题:Cython:“严重错误:numpy / arrayobject.h:没有此类文件或目录”

我试图加快答案在这里使用用Cython。我尝试编译代码(在完成此处cygwinccompiler.py介绍的hack 之后),但出现错误。谁能告诉我我的代码是否有问题,或者Cython有点神秘?fatal error: numpy/arrayobject.h: No such file or directory...compilation terminated

下面是我的代码。

import numpy as np
import scipy as sp
cimport numpy as np
cimport cython

cdef inline np.ndarray[np.int, ndim=1] fbincount(np.ndarray[np.int_t, ndim=1] x):
    cdef int m = np.amax(x)+1
    cdef int n = x.size
    cdef unsigned int i
    cdef np.ndarray[np.int_t, ndim=1] c = np.zeros(m, dtype=np.int)

    for i in xrange(n):
        c[<unsigned int>x[i]] += 1

    return c

cdef packed struct Point:
    np.float64_t f0, f1

@cython.boundscheck(False)
def sparsemaker(np.ndarray[np.float_t, ndim=2] X not None,
                np.ndarray[np.float_t, ndim=2] Y not None,
                np.ndarray[np.float_t, ndim=2] Z not None):

    cdef np.ndarray[np.float64_t, ndim=1] counts, factor
    cdef np.ndarray[np.int_t, ndim=1] row, col, repeats
    cdef np.ndarray[Point] indices

    cdef int x_, y_

    _, row = np.unique(X, return_inverse=True); x_ = _.size
    _, col = np.unique(Y, return_inverse=True); y_ = _.size
    indices = np.rec.fromarrays([row,col])
    _, repeats = np.unique(indices, return_inverse=True)
    counts = 1. / fbincount(repeats)
    Z.flat *= counts.take(repeats)

    return sp.sparse.csr_matrix((Z.flat,(row,col)), shape=(x_, y_)).toarray()

I’m trying to speed up the answer here using Cython. I try to compile the code (after doing the cygwinccompiler.py hack explained here), but get a fatal error: numpy/arrayobject.h: No such file or directory...compilation terminated error. Can anyone tell me if it’s a problem with my code, or some esoteric subtlety with Cython?

Below is my code.

import numpy as np
import scipy as sp
cimport numpy as np
cimport cython

cdef inline np.ndarray[np.int, ndim=1] fbincount(np.ndarray[np.int_t, ndim=1] x):
    cdef int m = np.amax(x)+1
    cdef int n = x.size
    cdef unsigned int i
    cdef np.ndarray[np.int_t, ndim=1] c = np.zeros(m, dtype=np.int)

    for i in xrange(n):
        c[<unsigned int>x[i]] += 1

    return c

cdef packed struct Point:
    np.float64_t f0, f1

@cython.boundscheck(False)
def sparsemaker(np.ndarray[np.float_t, ndim=2] X not None,
                np.ndarray[np.float_t, ndim=2] Y not None,
                np.ndarray[np.float_t, ndim=2] Z not None):

    cdef np.ndarray[np.float64_t, ndim=1] counts, factor
    cdef np.ndarray[np.int_t, ndim=1] row, col, repeats
    cdef np.ndarray[Point] indices

    cdef int x_, y_

    _, row = np.unique(X, return_inverse=True); x_ = _.size
    _, col = np.unique(Y, return_inverse=True); y_ = _.size
    indices = np.rec.fromarrays([row,col])
    _, repeats = np.unique(indices, return_inverse=True)
    counts = 1. / fbincount(repeats)
    Z.flat *= counts.take(repeats)

    return sp.sparse.csr_matrix((Z.flat,(row,col)), shape=(x_, y_)).toarray()

回答 0

在你里面setup.pyExtension应该有论据include_dirs=[numpy.get_include()]

另外,您np.import_array()的代码中缺少您。

示例setup.py:

from distutils.core import setup, Extension
from Cython.Build import cythonize
import numpy

setup(
    ext_modules=[
        Extension("my_module", ["my_module.c"],
                  include_dirs=[numpy.get_include()]),
    ],
)

# Or, if you use cythonize() to make the ext_modules list,
# include_dirs can be passed to setup()

setup(
    ext_modules=cythonize("my_module.pyx"),
    include_dirs=[numpy.get_include()]
)    

In your setup.py, the Extension should have the argument include_dirs=[numpy.get_include()].

Also, you are missing np.import_array() in your code.

Example setup.py:

from distutils.core import setup, Extension
from Cython.Build import cythonize
import numpy

setup(
    ext_modules=[
        Extension("my_module", ["my_module.c"],
                  include_dirs=[numpy.get_include()]),
    ],
)

# Or, if you use cythonize() to make the ext_modules list,
# include_dirs can be passed to setup()

setup(
    ext_modules=cythonize("my_module.pyx"),
    include_dirs=[numpy.get_include()]
)    

回答 1

对于像您这样的单文件项目,另一种选择是使用pyximportsetup.py如果使用IPython,则无需创建… …甚至无需打开命令行…都非常方便。您可以尝试在IPython或普通的Python脚本中运行以下命令:

import numpy
import pyximport
pyximport.install(setup_args={"script_args":["--compiler=mingw32"],
                              "include_dirs":numpy.get_include()},
                  reload_support=True)

import my_pyx_module

print my_pyx_module.some_function(...)
...

当然,您可能需要编辑编译器。这使得导入和重新加载对.pyx文件的作用与对文件的作用相同.py

资料来源:http : //wiki.cython.org/InstallingOnWindows

For a one-file project like yours, another alternative is to use pyximport. You don’t need to create a setup.py … you don’t need to even open a command line if you use IPython … it’s all very convenient. In your case, try running these commands in IPython or in a normal Python script:

import numpy
import pyximport
pyximport.install(setup_args={"script_args":["--compiler=mingw32"],
                              "include_dirs":numpy.get_include()},
                  reload_support=True)

import my_pyx_module

print my_pyx_module.some_function(...)
...

You may need to edit the compiler of course. This makes import and reload work the same for .pyx files as they work for .py files.

Source: http://wiki.cython.org/InstallingOnWindows


回答 2

该错误意味着在编译过程中找不到numpy头文件。

尝试这样做export CFLAGS=-I/usr/lib/python2.7/site-packages/numpy/core/include/,然后进行编译。这是几个不同软件包的问题。在ArchLinux中,存在针对同一问题的错误: https //bugs.archlinux.org/task/22326

The error means that a numpy header file isn’t being found during compilation.

Try doing export CFLAGS=-I/usr/lib/python2.7/site-packages/numpy/core/include/, and then compiling. This is a problem with a few different packages. There’s a bug filed in ArchLinux for the same issue: https://bugs.archlinux.org/task/22326


回答 3

简单的答案

一种更简单的方法是将路径添加到文件中distutils.cfg。默认情况下,它代表Windows 7的路径C:\Python27\Lib\distutils\。您只需声明以下内容即可解决:

[build_ext]
include_dirs= C:\Python27\Lib\site-packages\numpy\core\include

整个配置文件

为了给您一个示例,配置文件的外观,我的整个文件显示为:

[build]
compiler = mingw32

[build_ext]
include_dirs= C:\Python27\Lib\site-packages\numpy\core\include
compiler = mingw32

Simple answer

A way simpler way is to add the path to your file distutils.cfg. It’s path behalf of Windows 7 is by default C:\Python27\Lib\distutils\. You just assert the following contents and it should work out:

[build_ext]
include_dirs= C:\Python27\Lib\site-packages\numpy\core\include

Entire config file

To give you an example how the config file could look like, my entire file reads:

[build]
compiler = mingw32

[build_ext]
include_dirs= C:\Python27\Lib\site-packages\numpy\core\include
compiler = mingw32

回答 4

它应该能够在此处cythonize()提到的函数中执行此操作,但是由于存在已知问题,因此它不起作用

It should be able to do it within cythonize() function as mentioned here, but it doesn’t work beacuse there is a known issue


回答 5

如果您懒得编写设置文件并弄清楚包含目录的路径,请尝试cyper。它可以编译您的Cython代码并进行设置include_dirs自动为Numpy。

将您的代码加载到字符串中,然后简单地运行cymodule = cyper.inline(code_string),然后您的函数cymodule.sparsemaker即刻可用。像这样

code = open(your_pyx_file).read()
cymodule = cyper.inline(code)

cymodule.sparsemaker(...)
# do what you want with your function

您可以通过安装cyper pip install cyper

If you are too lazy to write setup files and figure out the path for include directories, try cyper. It can compile your Cython code and set include_dirs for Numpy automatically.

Load your code into a string, then simply run cymodule = cyper.inline(code_string), then your function is available as cymodule.sparsemaker instantaneously. Something like this

code = open(your_pyx_file).read()
cymodule = cyper.inline(code)

cymodule.sparsemaker(...)
# do what you want with your function

You can install cyper via pip install cyper.


如何在Pandas DataFrame中将True / False映射到1/0?

问题:如何在Pandas DataFrame中将True / False映射到1/0?

我在python pandas DataFrame中有一列具有布尔True / False值的列,但是对于进一步的计算,我需要1/0表示形式。有没有一种快速的方法来做到这一点?

I have a column in python pandas DataFrame that has boolean True/False values, but for further calculations I need 1/0 representation. Is there a quick pandas/numpy way to do that?


回答 0

一种将布尔值的单列转换为整数1或0的列的简洁方法:

df["somecolumn"] = df["somecolumn"].astype(int)

A succinct way to convert a single column of boolean values to a column of integers 1 or 0:

df["somecolumn"] = df["somecolumn"].astype(int)

回答 1

只需将您的数据框乘以1(int)

[1]: data = pd.DataFrame([[True, False, True], [False, False, True]])
[2]: print data
          0      1     2
     0   True  False  True
     1   False False  True

[3]: print data*1
         0  1  2
     0   1  0  1
     1   0  0  1

Just multiply your Dataframe by 1 (int)

[1]: data = pd.DataFrame([[True, False, True], [False, False, True]])
[2]: print data
          0      1     2
     0   True  False  True
     1   False False  True

[3]: print data*1
         0  1  2
     0   1  0  1
     1   0  0  1

回答 2

True1在Python,同样False0*

>>> True == 1
True
>>> False == 0
True

通过将它们视为数字,就可以对它们执行所需的任何操作,因为它们数字:

>>> issubclass(bool, int)
True
>>> True * 5
5

因此,回答您的问题,无需任何工作-您已经有了所需的东西。

*请注意,我使用的英文单词,而不是Python关键字isTrue与任何random都不是同一对象1

True is 1 in Python, and likewise False is 0*:

>>> True == 1
True
>>> False == 0
True

You should be able to perform any operations you want on them by just treating them as though they were numbers, as they are numbers:

>>> issubclass(bool, int)
True
>>> True * 5
5

So to answer your question, no work necessary – you already have what you are looking for.

* Note I use is as an English word, not the Python keyword isTrue will not be the same object as any random 1.


回答 3

您也可以直接在框架上执行此操作

In [104]: df = DataFrame(dict(A = True, B = False),index=range(3))

In [105]: df
Out[105]: 
      A      B
0  True  False
1  True  False
2  True  False

In [106]: df.dtypes
Out[106]: 
A    bool
B    bool
dtype: object

In [107]: df.astype(int)
Out[107]: 
   A  B
0  1  0
1  1  0
2  1  0

In [108]: df.astype(int).dtypes
Out[108]: 
A    int64
B    int64
dtype: object

You also can do this directly on Frames

In [104]: df = DataFrame(dict(A = True, B = False),index=range(3))

In [105]: df
Out[105]: 
      A      B
0  True  False
1  True  False
2  True  False

In [106]: df.dtypes
Out[106]: 
A    bool
B    bool
dtype: object

In [107]: df.astype(int)
Out[107]: 
   A  B
0  1  0
1  1  0
2  1  0

In [108]: df.astype(int).dtypes
Out[108]: 
A    int64
B    int64
dtype: object

回答 4

您可以对数据框使用转换:

df = pd.DataFrame(my_data condition)

在1/0中转换真/假

df = df*1

You can use a transformation for your data frame:

df = pd.DataFrame(my_data condition)

transforming True/False in 1/0

df = df*1

回答 5

使用Series.view的转换布尔为整数:

df["somecolumn"] = df["somecolumn"].view('i1')

Use Series.view for convert boolean to integers:

df["somecolumn"] = df["somecolumn"].view('i1')