标签归档:numpy

Numerical-linear-algebra-Jupyter笔记本免费在线教材Fast.ai计算线性代数课程

编码器的计算线性代数

本课程重点讨论以下问题:我们如何以可接受的速度和可接受的精度进行矩阵计算?

这门课是在University of San Francisco’s Masters of Science in Analytics计划,2017年夏季(面向正在学习成为数据科学家的研究生)。本课程使用Python和Jupyter笔记本讲授,在大多数课程中使用的库包括Scikit-Learning和Numpy,还有几节课使用Numba(将Python编译为C以提高性能的库)和PyTorch(用于GPU的Numpy的替代库

随附笔记本的还有一个playlist of lecture videos, available on YouTube如果你曾经被一堂课弄糊涂了,或者它讲得太快,请看下一段视频的开头,我在视频的开头复习了上一节课的概念,经常从新的角度或用不同的插图来解释,并回答问题。

获取帮助

您可以通过以下方式提问或分享您的想法和资源Computational Linear Algebra category on our fast.ai discussion forums

目录

下面的清单链接到此存储库中的笔记本,通过nbviewer服务。涵盖的主题:

0. Course Logistics(Video 1)

1. Why are we here?(Video 1)

我们首先对数值线性代数中的一些基本概念做一个高层次的概述

2. Topic Modeling with NMF and SVD(Video 2Video 3)

我们将使用新闻组数据集来尝试识别不同帖子的主题。我们使用术语-文档矩阵来表示文档中词汇的频率。我们使用NMF进行因子分解,然后使用奇异值分解(SVD

3. Background Removal with Robust PCA(Video 3Video 4,以及Video 5)

奇异值分解的另一个应用是识别人物并去除监控视频的背景。我们将介绍使用随机奇异值分解的鲁棒PCA。随机奇异值分解使用LU因式分解

4. Compressed Sensing with Robust Regression(Video 6Video 7)

压缩感知对于以较低的辐射进行CT扫描至关重要–可以用较少的数据重建图像。在这里,我们将学习这项技术,并将其应用于CT图像

5. Predicting Health Outcomes with Linear Regressions(Video 8)

6. How to Implement Linear Regression(Video 8)

7. PageRank with Eigen Decompositions(Video 9Video 10)

我们已经将奇异值分解应用于主题建模、背景去除和线性回归。奇异值分解与特征分解密切相关,因此我们现在将学习如何计算大型矩阵的特征值。我们将使用DBpedia数据,这是维基百科链接的大型数据集,因为这里的主要特征向量给出了不同维基百科页面的相对重要性(这是Google的PageRank算法的基本思想)。我们将看三种不同的计算特征向量的方法,它们的复杂度越来越高(实用性也越来越强!)

8. Implementing QR Factorization(Video 10)


为什么这门课的授课顺序如此怪异?

本课程的结构包括自上而下教学方法,这与大多数数学课程的操作方式不同。通常,在自下而上方法时,您首先学习要使用的所有独立组件,然后逐渐将它们构建成更复杂的结构。这样做的问题是,学生经常失去动力,没有“大局”意识,也不知道他们需要什么

哈佛大学教授大卫·珀金斯有一本书,Making Learning Whole他用棒球作类比。我们不要求孩子们在让他们玩棒球之前记住所有的棒球规则,了解所有的技术细节。相反,他们开始只是玩一般意义上的游戏,然后随着时间的推移逐渐学习更多的规则/细节。

如果你上了Fast.ai深度学习课程,那就是我们用的。你可以听到更多关于我的教学理念in this blog postthis talk I gave at the San Francisco Machine Learning meetup

总而言之,如果你一开始什么都不懂,也不要担心!你不应该这么做的。我们将开始使用一些尚未解释的“黑盒”或矩阵分解,然后我们将在稍后对更低级别的细节进行挖洞分析

首先,把重点放在事情做什么上,而不是它们是什么

Numba-使用LLVM的NumPy感知动态Python编译器

Numba

Python中数值函数的实时编译器

Numba是一个开源的、支持NumPy的Python优化编译器,由Anaconda,Inc.赞助。它使用LLVM编译器项目从Python语法生成机器码

Numba可以编译大量以数字为中心的Python,包括许多NumPy函数。此外,Numba还支持循环的自动并行化、GPU加速代码的生成以及ufuncs和C回调的创建

有关Numba的更多信息,请参阅Numba主页:https://numba.pydata.org

支持的平台

  • 操作系统和CPU:
    • Linux:x86(32位)、x86_64、ppc64le(POWER8和9)、ARMv7(32位)、ARMv8(64位)
    • Windows:x86、x86_64
    • MacOS:x86_64(M1/ARM64,仅非官方支持)
    • *BSD:(仅限非官方支持)
  • (可选)加速器和GPU:
    • Linux、Windows、MacOS(<10.14)上通过CUDA驱动程序实现的NVIDIA GPU(开普勒架构或更高版本)

依赖项

  • Python版本:3.7-3.9
  • 伊利莫利石0.37。*
  • NumPy>=1.17(可以使用1.11构建以实现ABI兼容性)

(可选):

  • SciPy>=1.0.0(适用于numpy.linalg支持)

正在安装

安装Numba并获取更新的最简单方法是使用Anaconda发行版:https://www.anaconda.com/download

$ conda install numba

有关更多选项,请参阅安装指南:https://numba.readthedocs.io/en/stable/user/installing.html

文档

https://numba.readthedocs.io/en/stable/index.html

邮件列表

Numba有一个供讨论的话语论坛:

一些旧的邮件列表档案位于:

持续集成

Azure Pipelines

如何将PIL图像转换为numpy数组?

问题:如何将PIL图像转换为numpy数组?

好吧,我想将PIL图像对象来回转换为numpy数组,因此我可以比PIL PixelAccess对象所允许的更快地进行逐像素转换。我已经找到了如何通过以下方式将像素信息放置在有用的3D numpy数组中:

pic = Image.open("foo.jpg")
pix = numpy.array(pic.getdata()).reshape(pic.size[0], pic.size[1], 3)

但是,在完成所有出色的转换之后,我似乎无法弄清楚如何将其重新加载到PIL对象中。我知道该putdata()方法,但似乎无法使其正常工作。

Alright, I’m toying around with converting a PIL image object back and forth to a numpy array so I can do some faster pixel by pixel transformations than PIL’s PixelAccess object would allow. I’ve figured out how to place the pixel information in a useful 3D numpy array by way of:

pic = Image.open("foo.jpg")
pix = numpy.array(pic.getdata()).reshape(pic.size[0], pic.size[1], 3)

But I can’t seem to figure out how to load it back into the PIL object after I’ve done all my awesome transforms. I’m aware of the putdata() method, but can’t quite seem to get it to behave.


回答 0

您并不是在说putdata()行为方式到底有多精确。我假设你在做

>>> pic.putdata(a)
Traceback (most recent call last):
  File "...blablabla.../PIL/Image.py", line 1185, in putdata
    self.im.putdata(data, scale, offset)
SystemError: new style getargs format but argument is not a tuple

这是因为putdata需要一个元组序列,并且您要给它一个numpy数组。这个

>>> data = list(tuple(pixel) for pixel in pix)
>>> pic.putdata(data)

可以工作,但是非常慢。

从PIL 1.1.6开始,在图像和numpy数组之间进行转换“正确”方法很简单

>>> pix = numpy.array(pic)

尽管结果数组的格式与您的格式不同(在这种情况下为3维数组或行/列/ rgb)。

然后,在对阵列进行更改之后,您应该可以执行任一操作pic.putdata(pix)或使用创建新图像Image.fromarray(pix)

You’re not saying how exactly putdata() is not behaving. I’m assuming you’re doing

>>> pic.putdata(a)
Traceback (most recent call last):
  File "...blablabla.../PIL/Image.py", line 1185, in putdata
    self.im.putdata(data, scale, offset)
SystemError: new style getargs format but argument is not a tuple

This is because putdata expects a sequence of tuples and you’re giving it a numpy array. This

>>> data = list(tuple(pixel) for pixel in pix)
>>> pic.putdata(data)

will work but it is very slow.

As of PIL 1.1.6, the “proper” way to convert between images and numpy arrays is simply

>>> pix = numpy.array(pic)

although the resulting array is in a different format than yours (3-d array or rows/columns/rgb in this case).

Then, after you make your changes to the array, you should be able to do either pic.putdata(pix) or create a new image with Image.fromarray(pix).


回答 1

I以数组形式打开:

>>> I = numpy.asarray(PIL.Image.open('test.jpg'))

对进行一些处理I,然后将其转换回图像:

>>> im = PIL.Image.fromarray(numpy.uint8(I))

使用FFT,Python过滤numpy图像

如果出于某种原因要明确地执行此操作,则此页面上的correlation.zip中有使用getdata()的pil2array()和array2pil()函数。

Open I as an array:

>>> I = numpy.asarray(PIL.Image.open('test.jpg'))

Do some stuff to I, then, convert it back to an image:

>>> im = PIL.Image.fromarray(numpy.uint8(I))

Filter numpy images with FFT, Python

If you want to do it explicitly for some reason, there are pil2array() and array2pil() functions using getdata() on this page in correlation.zip.


回答 2

我在Python 3.5中使用Pillow 4.1.1(PIL的后继产品)。枕头和numpy之间的转换非常简单。

from PIL import Image
import numpy as np
im = Image.open('1.jpg')
im2arr = np.array(im) # im2arr.shape: height x width x channel
arr2im = Image.fromarray(im2arr)

需要注意的一件事是,枕头样式im是专栏为主的,而numpy 样式是专栏的im2arr。但是,该功能Image.fromarray已经考虑了这一点。即,arr2im.size == im.sizearr2im.mode == im.mode在上面的例子。

在处理转换后的numpy数组时,例如在进行转换im2arr = np.rollaxis(im2arr, 2, 0)im2arr = np.transpose(im2arr, (2, 0, 1))转换为CxHxW格式时,我们应注意HxWxC数据格式。

I am using Pillow 4.1.1 (the successor of PIL) in Python 3.5. The conversion between Pillow and numpy is straightforward.

from PIL import Image
import numpy as np
im = Image.open('1.jpg')
im2arr = np.array(im) # im2arr.shape: height x width x channel
arr2im = Image.fromarray(im2arr)

One thing that needs noticing is that Pillow-style im is column-major while numpy-style im2arr is row-major. However, the function Image.fromarray already takes this into consideration. That is, arr2im.size == im.size and arr2im.mode == im.mode in the above example.

We should take care of the HxWxC data format when processing the transformed numpy arrays, e.g. do the transform im2arr = np.rollaxis(im2arr, 2, 0) or im2arr = np.transpose(im2arr, (2, 0, 1)) into CxHxW format.


回答 3

您需要通过以下方式将图像转换为numpy数组:

import numpy
import PIL

img = PIL.Image.open("foo.jpg").convert("L")
imgarr = numpy.array(img) 

You need to convert your image to a numpy array this way:

import numpy
import PIL

img = PIL.Image.open("foo.jpg").convert("L")
imgarr = numpy.array(img) 

回答 4

我今天使用的示例:

import PIL
import numpy
from PIL import Image

def resize_image(numpy_array_image, new_height):
    # convert nympy array image to PIL.Image
    image = Image.fromarray(numpy.uint8(numpy_array_image))
    old_width = float(image.size[0])
    old_height = float(image.size[1])
    ratio = float( new_height / old_height)
    new_width = int(old_width * ratio)
    image = image.resize((new_width, new_height), PIL.Image.ANTIALIAS)
    # convert PIL.Image into nympy array back again
    return array(image)

The example, I have used today:

import PIL
import numpy
from PIL import Image

def resize_image(numpy_array_image, new_height):
    # convert nympy array image to PIL.Image
    image = Image.fromarray(numpy.uint8(numpy_array_image))
    old_width = float(image.size[0])
    old_height = float(image.size[1])
    ratio = float( new_height / old_height)
    new_width = int(old_width * ratio)
    image = image.resize((new_width, new_height), PIL.Image.ANTIALIAS)
    # convert PIL.Image into nympy array back again
    return array(image)

回答 5

如果图像以Blob格式(即数据库)存储,则可以使用Billal Begueradj解释的相同技术将图像从Blob转换为字节数组。

就我而言,我需要将图像存储在db表的blob列中:

def select_all_X_values(conn):
    cur = conn.cursor()
    cur.execute("SELECT ImageData from PiecesTable")    
    rows = cur.fetchall()    
    return rows

然后,我创建了一个辅助函数,将我的数据集更改为np.array:

X_dataset = select_all_X_values(conn)
imagesList = convertToByteIO(np.array(X_dataset))

def convertToByteIO(imagesArray):
    """
    # Converts an array of images into an array of Bytes
    """
    imagesList = []

    for i in range(len(imagesArray)):  
        img = Image.open(BytesIO(imagesArray[i])).convert("RGB")
        imagesList.insert(i, np.array(img))

    return imagesList

之后,我可以在神经网络中使用byteArrays了。

plt.imshow(imagesList[0])

If your image is stored in a Blob format (i.e. in a database) you can use the same technique explained by Billal Begueradj to convert your image from Blobs to a byte array.

In my case, I needed my images where stored in a blob column in a db table:

def select_all_X_values(conn):
    cur = conn.cursor()
    cur.execute("SELECT ImageData from PiecesTable")    
    rows = cur.fetchall()    
    return rows

I then created a helper function to change my dataset into np.array:

X_dataset = select_all_X_values(conn)
imagesList = convertToByteIO(np.array(X_dataset))

def convertToByteIO(imagesArray):
    """
    # Converts an array of images into an array of Bytes
    """
    imagesList = []

    for i in range(len(imagesArray)):  
        img = Image.open(BytesIO(imagesArray[i])).convert("RGB")
        imagesList.insert(i, np.array(img))

    return imagesList

After this, I was able to use the byteArrays in my Neural Network.

plt.imshow(imagesList[0])

回答 6

转换Numpy to PIL图像并PIL to Numpy

import numpy as np
from PIL import Image

def pilToNumpy(img):
    return np.array(img)

def NumpyToPil(img):
    return Image.fromarray(img)

Convert Numpy to PIL image and PIL to Numpy

import numpy as np
from PIL import Image

def pilToNumpy(img):
    return np.array(img)

def NumpyToPil(img):
    return Image.fromarray(img)

回答 7

def imshow(img):
    img = img / 2 + 0.5     # unnormalize
    npimg = img.numpy()
    plt.imshow(np.transpose(npimg, (1, 2, 0)))
    plt.show()

您可以通过在压缩特征后将图像解析为numpy()函数来将图像转换为numpy(非规范化)

def imshow(img):
    img = img / 2 + 0.5     # unnormalize
    npimg = img.numpy()
    plt.imshow(np.transpose(npimg, (1, 2, 0)))
    plt.show()

You can transform the image into numpy by parsing the image into numpy() function after squishing out the features( unnormalization)


SciPy和NumPy之间的关系

问题:SciPy和NumPy之间的关系

SciPy似乎在其自己的命名空间中提供了NumPy的大多数(但不是全部[1])功能。换句话说,如果有一个名为的函数numpy.foo,则几乎可以肯定有一个scipy.foo。在大多数情况下,两者看起来是完全相同的,甚至有时指向相同的功能对象。

有时,它们是不同的。举一个最近出现的例子:

  • numpy.log10是一个ufunc该返回的NaN为负参数;
  • scipy.log10 返回负参数的复数值,并且似乎不是ufunc。

同样可以说,大约loglog2logn,但不是关于log1p[2]。

另一方面,numpy.expscipy.exp似乎对于同一ufunc是不同的名称。scipy.log1p和的情况也是如此numpy.log1p

另一个例子是numpy.linalg.solveVS scipy.linalg.solve。它们相似,但是后者比前者提供了一些附加功能。

为什么出现明显的重复?如果这意味着要的批发进口numpyscipy命名空间,为什么在行为的细微差别和缺少的功能?是否有一些有助于消除混乱的总体逻辑?

[1] ,,numpy.min 和其他几个人都在没有同行的命名空间。numpy.maxnumpy.absscipy

[2]使用NumPy 1.5.1和SciPy 0.9.0rc2进行了测试。

SciPy appears to provide most (but not all [1]) of NumPy’s functions in its own namespace. In other words, if there’s a function named numpy.foo, there’s almost certainly a scipy.foo. Most of the time, the two appear to be exactly the same, oftentimes even pointing to the same function object.

Sometimes, they’re different. To give an example that came up recently:

  • numpy.log10 is a ufunc that returns NaNs for negative arguments;
  • scipy.log10 returns complex values for negative arguments and doesn’t appear to be a ufunc.

The same can be said about log, log2 and logn, but not about log1p [2].

On the other hand, numpy.exp and scipy.exp appear to be different names for the same ufunc. This is also true of scipy.log1p and numpy.log1p.

Another example is numpy.linalg.solve vs scipy.linalg.solve. They’re similar, but the latter offers some additional features over the former.

Why the apparent duplication? If this is meant to be a wholesale import of numpy into the scipy namespace, why the subtle differences in behaviour and the missing functions? Is there some overarching logic that would help clear up the confusion?

[1] numpy.min, numpy.max, numpy.abs and a few others have no counterparts in the scipy namespace.

[2] Tested using NumPy 1.5.1 and SciPy 0.9.0rc2.


回答 0

上次我检查它时,scipy __init__方法执行

from numpy import *

以便在导入scipy模块时将整个numpy命名空间包含到scipy中。

log10您描述的行为很有趣,因为两个版本都来自numpy。一个是a ufunc,另一个是numpy.lib功能。为什么scipy偏爱库函数而不是ufunc,我不知道该怎么办。


编辑:事实上,我可以回答这个log10问题。在scipy __init__方法中,我看到以下内容:

# Import numpy symbols to scipy name space
import numpy as _num
from numpy import oldnumeric
from numpy import *
from numpy.random import rand, randn
from numpy.fft import fft, ifft
from numpy.lib.scimath import *

log10您获得scipy 的功能来自numpy.lib.scimath。查看该代码,它说:

"""
Wrapper functions to more user-friendly calling of certain math functions
whose output data-type is different than the input data-type in certain
domains of the input.

For example, for functions like log() with branch cuts, the versions in this
module provide the mathematically valid answers in the complex plane:

>>> import math
>>> from numpy.lib import scimath
>>> scimath.log(-math.exp(1)) == (1+1j*math.pi)
True

Similarly, sqrt(), other base logarithms, power() and trig functions are
correctly handled.  See their respective docstrings for specific examples.
"""

看来模块覆盖了基础numpy的ufuncs sqrtloglog2lognlog10powerarccosarcsin,和arctanh。这就解释了您所看到的行为。这样做的根本设计原因可能埋在某个地方的邮件列表中。

Last time I checked it, the scipy __init__ method executes a

from numpy import *

so that the whole numpy namespace is included into scipy when the scipy module is imported.

The log10 behavior you are describing is interesting, because both versions are coming from numpy. One is a ufunc, the other is a numpy.lib function. Why scipy is preferring the library function over the ufunc, I don’t know off the top of my head.


EDIT: In fact, I can answer the log10 question. Looking in the scipy __init__ method I see this:

# Import numpy symbols to scipy name space
import numpy as _num
from numpy import oldnumeric
from numpy import *
from numpy.random import rand, randn
from numpy.fft import fft, ifft
from numpy.lib.scimath import *

The log10 function you get in scipy comes from numpy.lib.scimath. Looking at that code, it says:

"""
Wrapper functions to more user-friendly calling of certain math functions
whose output data-type is different than the input data-type in certain
domains of the input.

For example, for functions like log() with branch cuts, the versions in this
module provide the mathematically valid answers in the complex plane:

>>> import math
>>> from numpy.lib import scimath
>>> scimath.log(-math.exp(1)) == (1+1j*math.pi)
True

Similarly, sqrt(), other base logarithms, power() and trig functions are
correctly handled.  See their respective docstrings for specific examples.
"""

It seems that module overlays the base numpy ufuncs for sqrt, log, log2, logn, log10, power, arccos, arcsin, and arctanh. That explains the behavior you are seeing. The underlying design reason why it is done like that is probably buried in a mailing list post somewhere.


回答 1

从《 SciPy参考指南》中:

…所有的Nu​​mpy函数都已包含在scipy 命名空间中,因此所有这些函数都可用而无需另外导入Numpy。

目的是使用户不必知道scipynumpy命名空间之间的区别,尽管显然您已经发现了一个exceptions。

From the SciPy Reference Guide:

… all of the Numpy functions have been subsumed into the scipy namespace so that all of those functions are available without additionally importing Numpy.

The intention is for users not to have to know the distinction between the scipy and numpy namespaces, though apparently you’ve found an exception.


回答 2

从看来 SciPy常见问题解答 NumPy的某些功能出于历史原因而在这里,而它仅应在SciPy中:

NumPy和SciPy有什么区别?

在理想的情况下,NumPy只会包含数组数据类型和最基本的操作:索引,排序,重塑,基本的元素函数等。所有数字代码都将驻留在SciPy中。但是,NumPy的重要目标之一是兼容性,因此NumPy尝试保留其前任任一个所支持的所有功能。因此,NumPy包含一些线性代数函数,即使这些函数更恰当地属于SciPy。无论如何,SciPy都包含线性代数模块的更多全功能版本,以及许多其他数值算法。如果您正在使用python进行科学计算,则可能应该同时安装NumPy和SciPy。大多数新功能属于SciPy,而不是NumPy。

这就解释了为什么scipy.linalg.solve在之上提供了一些附加功能numpy.linalg.solve

我没有看到SethMMorton对相关问题的回答

It seems from the SciPy FAQ that some functions from NumPy are here for historical reasons while it should only be in SciPy:

What is the difference between NumPy and SciPy?

In an ideal world, NumPy would contain nothing but the array data type and the most basic operations: indexing, sorting, reshaping, basic elementwise functions, et cetera. All numerical code would reside in SciPy. However, one of NumPy’s important goals is compatibility, so NumPy tries to retain all features supported by either of its predecessors. Thus NumPy contains some linear algebra functions, even though these more properly belong in SciPy. In any case, SciPy contains more fully-featured versions of the linear algebra modules, as well as many other numerical algorithms. If you are doing scientific computing with python, you should probably install both NumPy and SciPy. Most new features belong in SciPy rather than NumPy.

That explains why scipy.linalg.solve offers some additional features over numpy.linalg.solve.

I did not see the answer of SethMMorton to the related question


回答 3

SciPy文档的简介末尾有一段简短的评论:

另一个有用的命令是source。当给定一个用Python编写的函数作为参数时,它将打印出该函数的源代码清单。这有助于学习算法或准确了解函数对其参数的作用。另外,不要忘记Python命令目录,该目录可用于查看模块或包的命名空间。

我认为,这将允许有人用所有的软件包足够的知识涉及挑开完全的差异是什么之间的一些 SciPy的和numpy的功能(它没有帮助我在所有的日志10题)。我绝对不具备这些知识,但是source确实表明了这一点,scipy.linalg.solvenumpy.linalg.solve以不同的方式与lapack进行了交互。

Python 2.4.3 (#1, May  5 2011, 18:44:23) 
[GCC 4.1.2 20080704 (Red Hat 4.1.2-50)] on linux2
>>> import scipy
>>> import scipy.linalg
>>> import numpy
>>> scipy.source(scipy.linalg.solve)
In file: /usr/lib64/python2.4/site-packages/scipy/linalg/basic.py

def solve(a, b, sym_pos=0, lower=0, overwrite_a=0, overwrite_b=0,
          debug = 0):
    """ solve(a, b, sym_pos=0, lower=0, overwrite_a=0, overwrite_b=0) -> x

    Solve a linear system of equations a * x = b for x.

    Inputs:

      a -- An N x N matrix.
      b -- An N x nrhs matrix or N vector.
      sym_pos -- Assume a is symmetric and positive definite.
      lower -- Assume a is lower triangular, otherwise upper one.
               Only used if sym_pos is true.
      overwrite_y - Discard data in y, where y is a or b.

    Outputs:

      x -- The solution to the system a * x = b
    """
    a1, b1 = map(asarray_chkfinite,(a,b))
    if len(a1.shape) != 2 or a1.shape[0] != a1.shape[1]:
        raise ValueError, 'expected square matrix'
    if a1.shape[0] != b1.shape[0]:
        raise ValueError, 'incompatible dimensions'
    overwrite_a = overwrite_a or (a1 is not a and not hasattr(a,'__array__'))
    overwrite_b = overwrite_b or (b1 is not b and not hasattr(b,'__array__'))
    if debug:
        print 'solve:overwrite_a=',overwrite_a
        print 'solve:overwrite_b=',overwrite_b
    if sym_pos:
        posv, = get_lapack_funcs(('posv',),(a1,b1))
        c,x,info = posv(a1,b1,
                        lower = lower,
                        overwrite_a=overwrite_a,
                        overwrite_b=overwrite_b)
    else:
        gesv, = get_lapack_funcs(('gesv',),(a1,b1))
        lu,piv,x,info = gesv(a1,b1,
                             overwrite_a=overwrite_a,
                             overwrite_b=overwrite_b)

    if info==0:
        return x
    if info>0:
        raise LinAlgError, "singular matrix"
    raise ValueError,\
          'illegal value in %-th argument of internal gesv|posv'%(-info)

>>> scipy.source(numpy.linalg.solve)
In file: /usr/lib64/python2.4/site-packages/numpy/linalg/linalg.py

def solve(a, b):
    """
    Solve the equation ``a x = b`` for ``x``.

    Parameters
    ----------
    a : array_like, shape (M, M)
        Input equation coefficients.
    b : array_like, shape (M,)
        Equation target values.

    Returns
    -------
    x : array, shape (M,)

    Raises
    ------
    LinAlgError
        If `a` is singular or not square.

    Examples
    --------
    Solve the system of equations ``3 * x0 + x1 = 9`` and ``x0 + 2 * x1 = 8``:

    >>> a = np.array([[3,1], [1,2]])
    >>> b = np.array([9,8])
    >>> x = np.linalg.solve(a, b)
    >>> x
    array([ 2.,  3.])

    Check that the solution is correct:

    >>> (np.dot(a, x) == b).all()
    True

    """
    a, _ = _makearray(a)
    b, wrap = _makearray(b)
    one_eq = len(b.shape) == 1
    if one_eq:
        b = b[:, newaxis]
    _assertRank2(a, b)
    _assertSquareness(a)
    n_eq = a.shape[0]
    n_rhs = b.shape[1]
    if n_eq != b.shape[0]:
        raise LinAlgError, 'Incompatible dimensions'
    t, result_t = _commonType(a, b)
#    lapack_routine = _findLapackRoutine('gesv', t)
    if isComplexType(t):
        lapack_routine = lapack_lite.zgesv
    else:
        lapack_routine = lapack_lite.dgesv
    a, b = _fastCopyAndTranspose(t, a, b)
    pivots = zeros(n_eq, fortran_int)
    results = lapack_routine(n_eq, n_rhs, a, n_eq, pivots, b, n_eq, 0)
    if results['info'] > 0:
        raise LinAlgError, 'Singular matrix'
    if one_eq:
        return wrap(b.ravel().astype(result_t))
    else:
        return wrap(b.transpose().astype(result_t))

这也是我的第一篇文章,因此如果我要在此处进行更改,请告诉我。

There is a short comment at the end of the introduction to SciPy documentation:

Another useful command issource. When given a function written in Python as an argument, it prints out a listing of the source code for that function. This can be helpful in learning about an algorithm or understanding exactly what a function is doing with its arguments. Also don’t forget about the Python command dir which can be used to look at the namespace of a module or package.

I think this will allow someone with enough knowledge of all the packages involved to pick apart exactly what the differences are between some scipy and numpy functions (it didn’t help me with the log10 question at all). I definitely don’t have that knowledge but source does indicate that scipy.linalg.solve and numpy.linalg.solve interact with lapack in different ways;

Python 2.4.3 (#1, May  5 2011, 18:44:23) 
[GCC 4.1.2 20080704 (Red Hat 4.1.2-50)] on linux2
>>> import scipy
>>> import scipy.linalg
>>> import numpy
>>> scipy.source(scipy.linalg.solve)
In file: /usr/lib64/python2.4/site-packages/scipy/linalg/basic.py

def solve(a, b, sym_pos=0, lower=0, overwrite_a=0, overwrite_b=0,
          debug = 0):
    """ solve(a, b, sym_pos=0, lower=0, overwrite_a=0, overwrite_b=0) -> x

    Solve a linear system of equations a * x = b for x.

    Inputs:

      a -- An N x N matrix.
      b -- An N x nrhs matrix or N vector.
      sym_pos -- Assume a is symmetric and positive definite.
      lower -- Assume a is lower triangular, otherwise upper one.
               Only used if sym_pos is true.
      overwrite_y - Discard data in y, where y is a or b.

    Outputs:

      x -- The solution to the system a * x = b
    """
    a1, b1 = map(asarray_chkfinite,(a,b))
    if len(a1.shape) != 2 or a1.shape[0] != a1.shape[1]:
        raise ValueError, 'expected square matrix'
    if a1.shape[0] != b1.shape[0]:
        raise ValueError, 'incompatible dimensions'
    overwrite_a = overwrite_a or (a1 is not a and not hasattr(a,'__array__'))
    overwrite_b = overwrite_b or (b1 is not b and not hasattr(b,'__array__'))
    if debug:
        print 'solve:overwrite_a=',overwrite_a
        print 'solve:overwrite_b=',overwrite_b
    if sym_pos:
        posv, = get_lapack_funcs(('posv',),(a1,b1))
        c,x,info = posv(a1,b1,
                        lower = lower,
                        overwrite_a=overwrite_a,
                        overwrite_b=overwrite_b)
    else:
        gesv, = get_lapack_funcs(('gesv',),(a1,b1))
        lu,piv,x,info = gesv(a1,b1,
                             overwrite_a=overwrite_a,
                             overwrite_b=overwrite_b)

    if info==0:
        return x
    if info>0:
        raise LinAlgError, "singular matrix"
    raise ValueError,\
          'illegal value in %-th argument of internal gesv|posv'%(-info)

>>> scipy.source(numpy.linalg.solve)
In file: /usr/lib64/python2.4/site-packages/numpy/linalg/linalg.py

def solve(a, b):
    """
    Solve the equation ``a x = b`` for ``x``.

    Parameters
    ----------
    a : array_like, shape (M, M)
        Input equation coefficients.
    b : array_like, shape (M,)
        Equation target values.

    Returns
    -------
    x : array, shape (M,)

    Raises
    ------
    LinAlgError
        If `a` is singular or not square.

    Examples
    --------
    Solve the system of equations ``3 * x0 + x1 = 9`` and ``x0 + 2 * x1 = 8``:

    >>> a = np.array([[3,1], [1,2]])
    >>> b = np.array([9,8])
    >>> x = np.linalg.solve(a, b)
    >>> x
    array([ 2.,  3.])

    Check that the solution is correct:

    >>> (np.dot(a, x) == b).all()
    True

    """
    a, _ = _makearray(a)
    b, wrap = _makearray(b)
    one_eq = len(b.shape) == 1
    if one_eq:
        b = b[:, newaxis]
    _assertRank2(a, b)
    _assertSquareness(a)
    n_eq = a.shape[0]
    n_rhs = b.shape[1]
    if n_eq != b.shape[0]:
        raise LinAlgError, 'Incompatible dimensions'
    t, result_t = _commonType(a, b)
#    lapack_routine = _findLapackRoutine('gesv', t)
    if isComplexType(t):
        lapack_routine = lapack_lite.zgesv
    else:
        lapack_routine = lapack_lite.dgesv
    a, b = _fastCopyAndTranspose(t, a, b)
    pivots = zeros(n_eq, fortran_int)
    results = lapack_routine(n_eq, n_rhs, a, n_eq, pivots, b, n_eq, 0)
    if results['info'] > 0:
        raise LinAlgError, 'Singular matrix'
    if one_eq:
        return wrap(b.ravel().astype(result_t))
    else:
        return wrap(b.transpose().astype(result_t))

This is also my first post so if I should change something here please let me know.


回答 4

从Wikipedia(http://en.wikipedia.org/wiki/NumPy#History):

修改了数字代码,使其更具可维护性和灵活性,足以实现Numarray的新颖功能。这个新项目是SciPy的一部分。为了避免仅为了获取数组对象而安装整个程序包,将此新程序包分开并称为NumPy。

scipy为了方便起见,依赖numpy并将许多numpy函数导入其命名空间。

From Wikipedia ( http://en.wikipedia.org/wiki/NumPy#History ):

The Numeric code was adapted to make it more maintainable and flexible enough to implement the novel features of Numarray. This new project was part of SciPy. To avoid installing a whole package just to get an array object, this new package was separated and called NumPy.

scipy depends on numpy and imports many numpy functions into its namespace for convenience.


回答 5

关于linalg软件包-scipy函数将调用lapack和blas,它们在许多平台上都具有高度优化的版本,并且具有非常好的性能,尤其是对于在较大密度矩阵上的操作。另一方面,它们不是易于编译的库,需要fortran编译器和许多特定于平台的调整才能获得完整的性能。因此,numpy提供了许多常见线性代数函数的简单实现,这些函数通常足以满足许多目的。

Regarding the linalg package – the scipy functions will call lapack and blas, which are available in highly optimised versions on many platforms and offer very good performance, particularly for operations on reasonably large dense matrices. On the other hand, they are not easy libraries to compile, requiring a fortran compiler and many platform specific tweaks to get full performance. Therefore, numpy provides simple implementations of many common linear algebra functions which are often good enough for many purposes.


回答 6

从“ 定量经济学 ” 讲座

SciPy是一个软件包,其中包含使用NumPy构建的各种工具,这些工具使用其数组数据类型和相关功能

实际上,当我们导入SciPy时,我们也会得到NumPy,这可以从SciPy初始化文件中看到

# Import numpy symbols to scipy name space
import numpy as _num
linalg = None
from numpy import *
from numpy.random import rand, randn
from numpy.fft import fft, ifft
from numpy.lib.scimath import *

__all__  = []
__all__ += _num.__all__
__all__ += ['randn', 'rand', 'fft', 'ifft']

del _num
# Remove the linalg imported from numpy so that the scipy.linalg package can be
# imported.
del linalg
__all__.remove('linalg')

但是,显式使用NumPy功能是更常见和更好的做法

import numpy as np

a = np.identity(3)

在SciPy中有用的是其子包中的功能

  • scipy.optimize,scipy.integrate,scipy.stats等。

From Lectures on ‘Quantitative Economics

SciPy is a package that contains various tools that are built on top of NumPy, using its array data type and related functionality

In fact, when we import SciPy we also get NumPy, as can be seen from the SciPy initialization file

# Import numpy symbols to scipy name space
import numpy as _num
linalg = None
from numpy import *
from numpy.random import rand, randn
from numpy.fft import fft, ifft
from numpy.lib.scimath import *

__all__  = []
__all__ += _num.__all__
__all__ += ['randn', 'rand', 'fft', 'ifft']

del _num
# Remove the linalg imported from numpy so that the scipy.linalg package can be
# imported.
del linalg
__all__.remove('linalg')

However, it’s more common and better practice to use NumPy functionality explicitly

import numpy as np

a = np.identity(3)

What is useful in SciPy is the functionality in its subpackages

  • scipy.optimize, scipy.integrate, scipy.stats, etc.

回答 7

除了SciPy FAQ中描述的重复主要是为了向后兼容之外,在NumPy文档中进一步阐明说:

可选的SciPy加速例程(numpy.dual)

Scipy可能会加速的功能别名。

可以将SciPy构建为对FFT,线性代数和特殊函数使用加速或其他改进的库。该模块允许开发人员在SciPy可用时透明地支持这些加速功能,但仍支持仅安装NumPy的用户。

为简便起见,这些是:

  • 线性代数
  • 快速傅立叶变换
  • 第一种修改贝塞尔函数,阶数为0

另外,从SciPy教程中

SciPy的顶层还包含NumPy和numpy.lib.scimath中的函数。但是,最好直接从NumPy模块中使用它们。

因此,对于新应用程序,您应该首选在SciPy顶层重复的数组操作的NumPy版本。对于上面列出的域,您应该首选SciPy中的域,并在必要时在NumPy中检查向后兼容性。

以我的个人经验,我使用的大多数数组函数都位于NumPy的顶层(除外random)。但是,所有特定于域的例程都存在于SciPy的子包中,因此我很少使用SciPy顶层的任何东西。

In addition to the SciPy FAQ describing the duplication is mainly for backwards compatibility, it is further clarified in the NumPy documentation to say that

Optionally SciPy-accelerated routines (numpy.dual)

Aliases for functions which may be accelerated by Scipy.

SciPy can be built to use accelerated or otherwise improved libraries for FFTs, linear algebra, and special functions. This module allows developers to transparently support these accelerated functions when SciPy is available but still support users who have only installed NumPy.

For brevity, these are:

  • Linear algebra
  • FFT
  • The Modified Bessel function of the first kind, order 0

Also, from the SciPy Tutorial:

The top level of SciPy also contains functions from NumPy and numpy.lib.scimath. However, it is better to use them directly from the NumPy module instead.

So, for new applications, you should prefer the NumPy version of the array operations that are duplicated in the top level of SciPy. For the domains listed above, you should prefer those in SciPy and check backward compatibility if necessary in NumPy.

In my personal experience, most of the array functions I use exist in the top level of NumPy (except for random). However, all the domain specific routines exist in subpackages of SciPy, so I rarely use anything from the top level of SciPy.


比较两个NumPy数组的相等性,按元素

问题:比较两个NumPy数组的相等性,按元素

比较两个NumPy数组是否相等的最简单方法是什么(其中相等定义为:对于所有索引i:,A = B iff A[i] == B[i])?

简单地使用==给我一个布尔数组:

 >>> numpy.array([1,1,1]) == numpy.array([1,1,1])

array([ True,  True,  True], dtype=bool)

是否and需要确定该数组的元素是否相等,或者是否有更简单的比较方法?

What is the simplest way to compare two NumPy arrays for equality (where equality is defined as: A = B iff for all indices i: A[i] == B[i])?

Simply using == gives me a boolean array:

 >>> numpy.array([1,1,1]) == numpy.array([1,1,1])

array([ True,  True,  True], dtype=bool)

Do I have to and the elements of this array to determine if the arrays are equal, or is there a simpler way to compare?


回答 0

(A==B).all()

测试数组(A == B)的所有值是否均为True。

注意:也许您还想测试A和B形状,例如 A.shape == B.shape

特殊情况和替代方法(来自dbaupp的回答和yoavram的评论)

应当指出的是:

  • 在特定情况下,此解决方案可能会产生奇怪的行为:如果AB为空且另一个包含单个元素,则返回True。由于某种原因,比较会A==B返回一个空数组,all操作员将为此返回一个空数组True
  • 另一个风险是,如果AB形状不相同且不可广播,则此方法将引发错误。

总之,如果你有一个关于怀疑AB形状或只是想安全:的专业功能用途之一:

np.array_equal(A,B)  # test if same shape, same elements values
np.array_equiv(A,B)  # test if broadcastable shape, same elements values
np.allclose(A,B,...) # test if same shape, elements have close enough values
(A==B).all()

test if all values of array (A==B) are True.

Note: maybe you also want to test A and B shape, such as A.shape == B.shape

Special cases and alternatives (from dbaupp’s answer and yoavram’s comment)

It should be noted that:

  • this solution can have a strange behavior in a particular case: if either A or B is empty and the other one contains a single element, then it return True. For some reason, the comparison A==B returns an empty array, for which the all operator returns True.
  • Another risk is if A and B don’t have the same shape and aren’t broadcastable, then this approach will raise an error.

In conclusion, if you have a doubt about A and B shape or simply want to be safe: use one of the specialized functions:

np.array_equal(A,B)  # test if same shape, same elements values
np.array_equiv(A,B)  # test if broadcastable shape, same elements values
np.allclose(A,B,...) # test if same shape, elements have close enough values

回答 1

(A==B).all()解决方案是很整齐,但也有完成这个任务的一些内置的功能。也就是说array_equalallclosearray_equiv

(尽管,使用进行一些快速测试timeit似乎表明该(A==B).all()方法是最快的,由于必须分配一个全新的数组,因此该方法有点特殊。)

The (A==B).all() solution is very neat, but there are some built-in functions for this task. Namely array_equal, allclose and array_equiv.

(Although, some quick testing with timeit seems to indicate that the (A==B).all() method is the fastest, which is a little peculiar, given it has to allocate a whole new array.)


回答 2

让我们通过使用以下代码来评估性能。

import numpy as np
import time

exec_time0 = []
exec_time1 = []
exec_time2 = []

sizeOfArray = 5000
numOfIterations = 200

for i in xrange(numOfIterations):

    A = np.random.randint(0,255,(sizeOfArray,sizeOfArray))
    B = np.random.randint(0,255,(sizeOfArray,sizeOfArray))

    a = time.clock() 
    res = (A==B).all()
    b = time.clock()
    exec_time0.append( b - a )

    a = time.clock() 
    res = np.array_equal(A,B)
    b = time.clock()
    exec_time1.append( b - a )

    a = time.clock() 
    res = np.array_equiv(A,B)
    b = time.clock()
    exec_time2.append( b - a )

print 'Method: (A==B).all(),       ', np.mean(exec_time0)
print 'Method: np.array_equal(A,B),', np.mean(exec_time1)
print 'Method: np.array_equiv(A,B),', np.mean(exec_time2)

输出量

Method: (A==B).all(),        0.03031857
Method: np.array_equal(A,B), 0.030025185
Method: np.array_equiv(A,B), 0.030141515

根据上面的结果,numpy方法似乎比==运算符和all()方法的组合要快,并且通过比较numpy方法,最快的方法似乎是numpy.array_equal方法。

Let’s measure the performance by using the following piece of code.

import numpy as np
import time

exec_time0 = []
exec_time1 = []
exec_time2 = []

sizeOfArray = 5000
numOfIterations = 200

for i in xrange(numOfIterations):

    A = np.random.randint(0,255,(sizeOfArray,sizeOfArray))
    B = np.random.randint(0,255,(sizeOfArray,sizeOfArray))

    a = time.clock() 
    res = (A==B).all()
    b = time.clock()
    exec_time0.append( b - a )

    a = time.clock() 
    res = np.array_equal(A,B)
    b = time.clock()
    exec_time1.append( b - a )

    a = time.clock() 
    res = np.array_equiv(A,B)
    b = time.clock()
    exec_time2.append( b - a )

print 'Method: (A==B).all(),       ', np.mean(exec_time0)
print 'Method: np.array_equal(A,B),', np.mean(exec_time1)
print 'Method: np.array_equiv(A,B),', np.mean(exec_time2)

Output

Method: (A==B).all(),        0.03031857
Method: np.array_equal(A,B), 0.030025185
Method: np.array_equiv(A,B), 0.030141515

According to the results above, the numpy methods seem to be faster than the combination of the == operator and the all() method and by comparing the numpy methods the fastest one seems to be the numpy.array_equal method.


回答 3

如果要检查两个数组是否相同shape并且elements应该使用,np.array_equal因为这是文档中建议的方法。

在性能方面,不要期望任何相等性检查会胜过另一个,因为没有太多的优化空间comparing two elements。为了方便起见,我仍然进行了一些测试。

import numpy as np
import timeit

A = np.zeros((300, 300, 3))
B = np.zeros((300, 300, 3))
C = np.ones((300, 300, 3))

timeit.timeit(stmt='(A==B).all()', setup='from __main__ import A, B', number=10**5)
timeit.timeit(stmt='np.array_equal(A, B)', setup='from __main__ import A, B, np', number=10**5)
timeit.timeit(stmt='np.array_equiv(A, B)', setup='from __main__ import A, B, np', number=10**5)
> 51.5094
> 52.555
> 52.761

几乎相等,无需谈论速度。

(A==B).all()的行为几乎如下面的代码片段:

x = [1,2,3]
y = [1,2,3]
print all([x[i]==y[i] for i in range(len(x))])
> True

If you want to check if two arrays have the same shape AND elements you should use np.array_equal as it is the method recommended in the documentation.

Performance-wise don’t expect that any equality check will beat another, as there is not much room to optimize comparing two elements. Just for the sake, i still did some tests.

import numpy as np
import timeit

A = np.zeros((300, 300, 3))
B = np.zeros((300, 300, 3))
C = np.ones((300, 300, 3))

timeit.timeit(stmt='(A==B).all()', setup='from __main__ import A, B', number=10**5)
timeit.timeit(stmt='np.array_equal(A, B)', setup='from __main__ import A, B, np', number=10**5)
timeit.timeit(stmt='np.array_equiv(A, B)', setup='from __main__ import A, B, np', number=10**5)
> 51.5094
> 52.555
> 52.761

So pretty much equal, no need to talk about the speed.

The (A==B).all() behaves pretty much as the following code snippet:

x = [1,2,3]
y = [1,2,3]
print all([x[i]==y[i] for i in range(len(x))])
> True

回答 4

通常两个数组会有一些小的数字误差,

您可以使用numpy.allclose(A,B)代替(A==B).all()。这将返回布尔值True / False

Usually two arrays will have some small numeric errors,

You can use numpy.allclose(A,B), instead of (A==B).all(). This returns a bool True/False


回答 5

现在使用np.array_equal。从文档:

np.array_equal([1, 2], [1, 2])
True
np.array_equal(np.array([1, 2]), np.array([1, 2]))
True
np.array_equal([1, 2], [1, 2, 3])
False
np.array_equal([1, 2], [1, 4])
False

Now use np.array_equal. From documentation:

np.array_equal([1, 2], [1, 2])
True
np.array_equal(np.array([1, 2]), np.array([1, 2]))
True
np.array_equal([1, 2], [1, 2, 3])
False
np.array_equal([1, 2], [1, 4])
False

NumPy数组不可JSON序列化

问题:NumPy数组不可JSON序列化

创建NumPy数组并将其另存为Django上下文变量后,加载网页时出现以下错误:

array([   0,  239,  479,  717,  952, 1192, 1432, 1667], dtype=int64) is not JSON serializable

这是什么意思?

After creating a NumPy array, and saving it as a Django context variable, I receive the following error when loading the webpage:

array([   0,  239,  479,  717,  952, 1192, 1432, 1667], dtype=int64) is not JSON serializable

What does this mean?


回答 0

我经常“ jsonify” np.arrays。尝试首先在数组上使用“ .tolist()”方法,如下所示:

import numpy as np
import codecs, json 

a = np.arange(10).reshape(2,5) # a 2 by 5 array
b = a.tolist() # nested lists with same data, indices
file_path = "/path.json" ## your path variable
json.dump(b, codecs.open(file_path, 'w', encoding='utf-8'), separators=(',', ':'), sort_keys=True, indent=4) ### this saves the array in .json format

为了“ unjsonify”数组使用:

obj_text = codecs.open(file_path, 'r', encoding='utf-8').read()
b_new = json.loads(obj_text)
a_new = np.array(b_new)

I regularly “jsonify” np.arrays. Try using the “.tolist()” method on the arrays first, like this:

import numpy as np
import codecs, json 

a = np.arange(10).reshape(2,5) # a 2 by 5 array
b = a.tolist() # nested lists with same data, indices
file_path = "/path.json" ## your path variable
json.dump(b, codecs.open(file_path, 'w', encoding='utf-8'), separators=(',', ':'), sort_keys=True, indent=4) ### this saves the array in .json format

In order to “unjsonify” the array use:

obj_text = codecs.open(file_path, 'r', encoding='utf-8').read()
b_new = json.loads(obj_text)
a_new = np.array(b_new)

回答 1

将numpy.ndarray或任何嵌套列表组合作为JSON存储。

class NumpyEncoder(json.JSONEncoder):
    def default(self, obj):
        if isinstance(obj, np.ndarray):
            return obj.tolist()
        return json.JSONEncoder.default(self, obj)

a = np.array([[1, 2, 3], [4, 5, 6]])
print(a.shape)
json_dump = json.dumps({'a': a, 'aa': [2, (2, 3, 4), a], 'bb': [2]}, cls=NumpyEncoder)
print(json_dump)

将输出:

(2, 3)
{"a": [[1, 2, 3], [4, 5, 6]], "aa": [2, [2, 3, 4], [[1, 2, 3], [4, 5, 6]]], "bb": [2]}

要从JSON还原:

json_load = json.loads(json_dump)
a_restored = np.asarray(json_load["a"])
print(a_restored)
print(a_restored.shape)

将输出:

[[1 2 3]
 [4 5 6]]
(2, 3)

Store as JSON a numpy.ndarray or any nested-list composition.

class NumpyEncoder(json.JSONEncoder):
    def default(self, obj):
        if isinstance(obj, np.ndarray):
            return obj.tolist()
        return json.JSONEncoder.default(self, obj)

a = np.array([[1, 2, 3], [4, 5, 6]])
print(a.shape)
json_dump = json.dumps({'a': a, 'aa': [2, (2, 3, 4), a], 'bb': [2]}, cls=NumpyEncoder)
print(json_dump)

Will output:

(2, 3)
{"a": [[1, 2, 3], [4, 5, 6]], "aa": [2, [2, 3, 4], [[1, 2, 3], [4, 5, 6]]], "bb": [2]}

To restore from JSON:

json_load = json.loads(json_dump)
a_restored = np.asarray(json_load["a"])
print(a_restored)
print(a_restored.shape)

Will output:

[[1 2 3]
 [4 5 6]]
(2, 3)

回答 2

您可以使用Pandas

import pandas as pd
pd.Series(your_array).to_json(orient='values')

You can use Pandas:

import pandas as pd
pd.Series(your_array).to_json(orient='values')

回答 3

如果您在字典中嵌套了numpy数组,我找到了最佳解决方案:

import json
import numpy as np

class NumpyEncoder(json.JSONEncoder):
    """ Special json encoder for numpy types """
    def default(self, obj):
        if isinstance(obj, np.integer):
            return int(obj)
        elif isinstance(obj, np.floating):
            return float(obj)
        elif isinstance(obj, np.ndarray):
            return obj.tolist()
        return json.JSONEncoder.default(self, obj)

dumped = json.dumps(data, cls=NumpyEncoder)

with open(path, 'w') as f:
    json.dump(dumped, f)

感谢这个家伙

I found the best solution if you have nested numpy arrays in a dictionary:

import json
import numpy as np

class NumpyEncoder(json.JSONEncoder):
    """ Special json encoder for numpy types """
    def default(self, obj):
        if isinstance(obj, np.integer):
            return int(obj)
        elif isinstance(obj, np.floating):
            return float(obj)
        elif isinstance(obj, np.ndarray):
            return obj.tolist()
        return json.JSONEncoder.default(self, obj)

dumped = json.dumps(data, cls=NumpyEncoder)

with open(path, 'w') as f:
    json.dump(dumped, f)

Thanks to this guy.


回答 4

使用json.dumps defaultkwarg:

default应该是一个为无法序列化的对象调用的函数。

default函数中,检查对象是否来自numpy模块,如果是,则将其ndarray.tolist用于ndarray或将其.item用于任何其他特定于numpy的类型。

import numpy as np

def default(obj):
    if type(obj).__module__ == np.__name__:
        if isinstance(obj, np.ndarray):
            return obj.tolist()
        else:
            return obj.item()
    raise TypeError('Unknown type:', type(obj))

dumped = json.dumps(data, default=default)

Use the json.dumps default kwarg:

default should be a function that gets called for objects that can’t otherwise be serialized. … or raise a TypeError

In the default function check if the object is from the module numpy, if so either use ndarray.tolist for a ndarray or use .item for any other numpy specific type.

import numpy as np

def default(obj):
    if type(obj).__module__ == np.__name__:
        if isinstance(obj, np.ndarray):
            return obj.tolist()
        else:
            return obj.item()
    raise TypeError('Unknown type:', type(obj))

dumped = json.dumps(data, default=default)

回答 5

默认情况下不支持此功能,但是您可以使其轻松工作!如果您想返回完全相同的数据,则需要对几件事进行编码:

  • 数据本身,您可以获得 obj.tolist() @travelingbones。有时这可能已经足够了。
  • 数据类型。我觉得在某些情况下这很重要。
  • 如果您假设输入确实始终是“矩形”网格,则可以从上面得出尺寸(不一定是2D)。
  • 内存顺序(行或列为主)。这通常并不重要,但有时却很重要(例如性能),那么为什么不保存所有内容呢?

此外,您的numpy数组可能是数据结构的一部分,例如,您有一个包含一些矩阵的列表。为此,您可以使用基本上完成上述操作的自定义编码器。

这应该足以实施解决方案。或者,您可以使用json-tricks来做到这一点(并支持其他各种类型)(免责声明:我做到了)。

pip install json-tricks

然后

data = [
    arange(0, 10, 1, dtype=int).reshape((2, 5)),
    datetime(year=2017, month=1, day=19, hour=23, minute=00, second=00),
    1 + 2j,
    Decimal(42),
    Fraction(1, 3),
    MyTestCls(s='ub', dct={'7': 7}),  # see later
    set(range(7)),
]
# Encode with metadata to preserve types when decoding
print(dumps(data))

This is not supported by default, but you can make it work quite easily! There are several things you’ll want to encode if you want the exact same data back:

  • The data itself, which you can get with obj.tolist() as @travelingbones mentioned. Sometimes this may be good enough.
  • The data type. I feel this is important in quite some cases.
  • The dimension (not necessarily 2D), which could be derived from the above if you assume the input is indeed always a ‘rectangular’ grid.
  • The memory order (row- or column-major). This doesn’t often matter, but sometimes it does (e.g. performance), so why not save everything?

Furthermore, your numpy array could part of your data structure, e.g. you have a list with some matrices inside. For that you could use a custom encoder which basically does the above.

This should be enough to implement a solution. Or you could use json-tricks which does just this (and supports various other types) (disclaimer: I made it).

pip install json-tricks

Then

data = [
    arange(0, 10, 1, dtype=int).reshape((2, 5)),
    datetime(year=2017, month=1, day=19, hour=23, minute=00, second=00),
    1 + 2j,
    Decimal(42),
    Fraction(1, 3),
    MyTestCls(s='ub', dct={'7': 7}),  # see later
    set(range(7)),
]
# Encode with metadata to preserve types when decoding
print(dumps(data))

回答 6

嵌套字典中有一些numpy.ndarrays,我也遇到类似的问题。

def jsonify(data):
    json_data = dict()
    for key, value in data.iteritems():
        if isinstance(value, list): # for lists
            value = [ jsonify(item) if isinstance(item, dict) else item for item in value ]
        if isinstance(value, dict): # for nested lists
            value = jsonify(value)
        if isinstance(key, int): # if key is integer: > to string
            key = str(key)
        if type(value).__module__=='numpy': # if value is numpy.*: > to python list
            value = value.tolist()
        json_data[key] = value
    return json_data

I had a similar problem with a nested dictionary with some numpy.ndarrays in it.

def jsonify(data):
    json_data = dict()
    for key, value in data.iteritems():
        if isinstance(value, list): # for lists
            value = [ jsonify(item) if isinstance(item, dict) else item for item in value ]
        if isinstance(value, dict): # for nested lists
            value = jsonify(value)
        if isinstance(key, int): # if key is integer: > to string
            key = str(key)
        if type(value).__module__=='numpy': # if value is numpy.*: > to python list
            value = value.tolist()
        json_data[key] = value
    return json_data

回答 7

您还可以使用default参数例如:

def myconverter(o):
    if isinstance(o, np.float32):
        return float(o)

json.dump(data, default=myconverter)

You could also use default argument for example:

def myconverter(o):
    if isinstance(o, np.float32):
        return float(o)

json.dump(data, default=myconverter)

回答 8

另外,关于Python中的列表和数组,还有一些非常有趣的信息〜> Python中的列表与数组的 Python列表与数组-何时使用?

可以注意到,在将数组保存到JSON文件之前将其转换为列表之后,无论如何,现在无论如何在我的部署中,一旦读取该JSON文件以备后用,我就可以继续以列表形式使用它(如而不是将其转换回数组)。

这样,与屏幕上的列表(逗号分隔)和数组(非逗号分隔)相比,实际上它看起来更好(在我看来)。

使用上面的@travelingbones的.tolist()方法,我已经这样使用了(也发现了一些我发现的错误):

保存词典

def writeDict(values, name):
    writeName = DIR+name+'.json'
    with open(writeName, "w") as outfile:
        json.dump(values, outfile)

阅读词典

def readDict(name):
    readName = DIR+name+'.json'
    try:
        with open(readName, "r") as infile:
            dictValues = json.load(infile)
            return(dictValues)
    except IOError as e:
        print(e)
        return('None')
    except ValueError as e:
        print(e)
        return('None')

希望这可以帮助!

Also, some very interesting information further on lists vs. arrays in Python ~> Python List vs. Array – when to use?

It could be noted that once I convert my arrays into a list before saving it in a JSON file, in my deployment right now anyways, once I read that JSON file for use later, I can continue to use it in a list form (as opposed to converting it back to an array).

AND actually looks nicer (in my opinion) on the screen as a list (comma seperated) vs. an array (not-comma seperated) this way.

Using @travelingbones’s .tolist() method above, I’ve been using as such (catching a few errors I’ve found too):

SAVE DICTIONARY

def writeDict(values, name):
    writeName = DIR+name+'.json'
    with open(writeName, "w") as outfile:
        json.dump(values, outfile)

READ DICTIONARY

def readDict(name):
    readName = DIR+name+'.json'
    try:
        with open(readName, "r") as infile:
            dictValues = json.load(infile)
            return(dictValues)
    except IOError as e:
        print(e)
        return('None')
    except ValueError as e:
        print(e)
        return('None')

Hope this helps!


回答 9

这是一个对我有用的实现,并删除了所有nan(假设它们是简单的对象(列表或字典)):

from numpy import isnan

def remove_nans(my_obj, val=None):
    if isinstance(my_obj, list):
        for i, item in enumerate(my_obj):
            if isinstance(item, list) or isinstance(item, dict):
                my_obj[i] = remove_nans(my_obj[i], val=val)

            else:
                try:
                    if isnan(item):
                        my_obj[i] = val
                except Exception:
                    pass

    elif isinstance(my_obj, dict):
        for key, item in my_obj.iteritems():
            if isinstance(item, list) or isinstance(item, dict):
                my_obj[key] = remove_nans(my_obj[key], val=val)

            else:
                try:
                    if isnan(item):
                        my_obj[key] = val
                except Exception:
                    pass

    return my_obj

Here is an implementation that work for me and removed all nans (assuming these are simple object (list or dict)):

from numpy import isnan

def remove_nans(my_obj, val=None):
    if isinstance(my_obj, list):
        for i, item in enumerate(my_obj):
            if isinstance(item, list) or isinstance(item, dict):
                my_obj[i] = remove_nans(my_obj[i], val=val)

            else:
                try:
                    if isnan(item):
                        my_obj[i] = val
                except Exception:
                    pass

    elif isinstance(my_obj, dict):
        for key, item in my_obj.iteritems():
            if isinstance(item, list) or isinstance(item, dict):
                my_obj[key] = remove_nans(my_obj[key], val=val)

            else:
                try:
                    if isnan(item):
                        my_obj[key] = val
                except Exception:
                    pass

    return my_obj

回答 10

这是一个不同的答案,但这可能有助于帮助试图保存数据然后再次读取的人们。
有一个比泡菜快和容易的hi。
我试图保存并在泡菜转储中阅读它,但是阅读时有很多问题,浪费了一个小时,尽管我正在处理自己的数据以创建聊天机器人,但仍然找不到解决方案。

vec_x并且vec_y是numpy数组:

data=[vec_x,vec_y]
hkl.dump( data, 'new_data_file.hkl' )

然后,您只需阅读并执行以下操作:

data2 = hkl.load( 'new_data_file.hkl' )

This is a different answer, but this might help to help people who are trying to save data and then read it again.
There is hickle which is faster than pickle and easier.
I tried to save and read it in pickle dump but while reading there were lot of problems and wasted an hour and still didn’t find solution though I was working on my own data to create a chat bot.

vec_x and vec_y are numpy arrays:

data=[vec_x,vec_y]
hkl.dump( data, 'new_data_file.hkl' )

Then you just read it and perform the operations:

data2 = hkl.load( 'new_data_file.hkl' )

回答 11

可以使用检查类型来简化循环:

with open("jsondontdoit.json", 'w') as fp:
    for key in bests.keys():
        if type(bests[key]) == np.ndarray:
            bests[key] = bests[key].tolist()
            continue
        for idx in bests[key]:
            if type(bests[key][idx]) == np.ndarray:
                bests[key][idx] = bests[key][idx].tolist()
    json.dump(bests, fp)
    fp.close()

May do simple for loop with checking types:

with open("jsondontdoit.json", 'w') as fp:
    for key in bests.keys():
        if type(bests[key]) == np.ndarray:
            bests[key] = bests[key].tolist()
            continue
        for idx in bests[key]:
            if type(bests[key][idx]) == np.ndarray:
                bests[key][idx] = bests[key][idx].tolist()
    json.dump(bests, fp)
    fp.close()

回答 12

使用NumpyEncoder它将成功处理json转储。不抛出-NumPy数组不是JSON可序列化的

import numpy as np
import json
from numpyencoder import NumpyEncoder
arr = array([   0,  239,  479,  717,  952, 1192, 1432, 1667], dtype=int64) 
json.dumps(arr,cls=NumpyEncoder)

use NumpyEncoder it will process json dump successfully.without throwing – NumPy array is not JSON serializable

import numpy as np
import json
from numpyencoder import NumpyEncoder
arr = array([   0,  239,  479,  717,  952, 1192, 1432, 1667], dtype=int64) 
json.dumps(arr,cls=NumpyEncoder)

回答 13

TypeError:array([[0.46872085,0.67374235,1.0218339,0.13210179,0.5440686,0.9140083,0.58720225,0.2199381]],dtype = float32)不是JSON可序列化的

当我期望以json格式响应时,尝试将数据列表传递给model.predict()时,抛出了上述错误。

> 1        json_file = open('model.json','r')
> 2        loaded_model_json = json_file.read()
> 3        json_file.close()
> 4        loaded_model = model_from_json(loaded_model_json)
> 5        #load weights into new model
> 6        loaded_model.load_weights("model.h5")
> 7        loaded_model.compile(optimizer='adam', loss='mean_squared_error')
> 8        X =  [[874,12450,678,0.922500,0.113569]]
> 9        d = pd.DataFrame(X)
> 10       prediction = loaded_model.predict(d)
> 11       return jsonify(prediction)

但幸运的是找到了解决抛出错误的提示对象的序列化仅适用于以下转换映射应采用以下方式object-dict array-list string-string integer-integer

如果您向上滚动以查看第10行的代码,则这行代码将生成array数据类型的输出,当您尝试将array转换为json格式时,这是不可能的。

最终我找到了解决方案,只需通过遵循以下几行代码将获得的输出转换为类型列表即可

预测=加载模型。预测(d)
列表类型=预测。列表()返回jsonify(列表类型)

hoo!终于得到了预期的输出, 在此处输入图片说明

TypeError: array([[0.46872085, 0.67374235, 1.0218339 , 0.13210179, 0.5440686 , 0.9140083 , 0.58720225, 0.2199381 ]], dtype=float32) is not JSON serializable

The above-mentioned error was thrown when i tried to pass of list of data to model.predict() when i was expecting the response in json format.

> 1        json_file = open('model.json','r')
> 2        loaded_model_json = json_file.read()
> 3        json_file.close()
> 4        loaded_model = model_from_json(loaded_model_json)
> 5        #load weights into new model
> 6        loaded_model.load_weights("model.h5")
> 7        loaded_model.compile(optimizer='adam', loss='mean_squared_error')
> 8        X =  [[874,12450,678,0.922500,0.113569]]
> 9        d = pd.DataFrame(X)
> 10       prediction = loaded_model.predict(d)
> 11       return jsonify(prediction)

But luckily found the hint to resolve the error that was throwing The serializing of the objects is applicable only for the following conversion Mapping should be in following way object – dict array – list string – string integer – integer

If you scroll up to see the line number 10 prediction = loaded_model.predict(d) where this line of code was generating the output of type array datatype , when you try to convert array to json format its not possible

Finally i found the solution just by converting obtained output to the type list by following lines of code

prediction = loaded_model.predict(d)
listtype = prediction.tolist() return jsonify(listtype)

Bhoom! finally got the expected output, enter image description here


如何在Python中实现Softmax函数

问题:如何在Python中实现Softmax函数

Udacity的深度学习类中,y_i的softmax只是指数除以整个Y向量的指数和:

在此处输入图片说明

其中S(y_i)y_i和的softmax函数e是指数,并且j是否。输入向量Y中的列数。

我尝试了以下方法:

import numpy as np

def softmax(x):
    """Compute softmax values for each sets of scores in x."""
    e_x = np.exp(x - np.max(x))
    return e_x / e_x.sum()

scores = [3.0, 1.0, 0.2]
print(softmax(scores))

返回:

[ 0.8360188   0.11314284  0.05083836]

但是建议的解决方案是:

def softmax(x):
    """Compute softmax values for each sets of scores in x."""
    return np.exp(x) / np.sum(np.exp(x), axis=0)

即使第一个实现显式地获取每列和最大值的差然后除以总和,它也会产生与第一个实现相同的输出

有人可以从数学上说明为什么吗?一个是正​​确的,另一个是错误的吗?

在代码和时间复杂度方面实现是否相似?哪个更有效?

From the Udacity’s deep learning class, the softmax of y_i is simply the exponential divided by the sum of exponential of the whole Y vector:

enter image description here

Where S(y_i) is the softmax function of y_i and e is the exponential and j is the no. of columns in the input vector Y.

I’ve tried the following:

import numpy as np

def softmax(x):
    """Compute softmax values for each sets of scores in x."""
    e_x = np.exp(x - np.max(x))
    return e_x / e_x.sum()

scores = [3.0, 1.0, 0.2]
print(softmax(scores))

which returns:

[ 0.8360188   0.11314284  0.05083836]

But the suggested solution was:

def softmax(x):
    """Compute softmax values for each sets of scores in x."""
    return np.exp(x) / np.sum(np.exp(x), axis=0)

which produces the same output as the first implementation, even though the first implementation explicitly takes the difference of each column and the max and then divides by the sum.

Can someone show mathematically why? Is one correct and the other one wrong?

Are the implementation similar in terms of code and time complexity? Which is more efficient?


回答 0

它们都是正确的,但是从数值稳定性的角度来看,您是首选。

你开始

e ^ (x - max(x)) / sum(e^(x - max(x))

通过使用a ^(b-c)=(a ^ b)/(a ^ c)的事实,我们得到

= e ^ x / (e ^ max(x) * sum(e ^ x / e ^ max(x)))

= e ^ x / sum(e ^ x)

另一个答案是什么。您可以将max(x)替换为任何变量,它将被抵消。

They’re both correct, but yours is preferred from the point of view of numerical stability.

You start with

e ^ (x - max(x)) / sum(e^(x - max(x))

By using the fact that a^(b – c) = (a^b)/(a^c) we have

= e ^ x / (e ^ max(x) * sum(e ^ x / e ^ max(x)))

= e ^ x / sum(e ^ x)

Which is what the other answer says. You could replace max(x) with any variable and it would cancel out.


回答 1

(嗯……在这里,无论是在问题还是在答案中,都有很多困惑……)

首先,这两种解决方案(即您和建议的解决方案)并不相同;它们恰好只对一维分数数组的特例等效。如果您还尝试了Udacity测验提供的示例中的2-D分数数组,则将发现它。

从结果来看,这两个解决方案之间的唯一实际区别是axis=0参数。为了了解这种情况,让我们尝试您的解决方案(your_softmax),其中唯一的区别是axis参数:

import numpy as np

# your solution:
def your_softmax(x):
    """Compute softmax values for each sets of scores in x."""
    e_x = np.exp(x - np.max(x))
    return e_x / e_x.sum()

# correct solution:
def softmax(x):
    """Compute softmax values for each sets of scores in x."""
    e_x = np.exp(x - np.max(x))
    return e_x / e_x.sum(axis=0) # only difference

正如我所说,对于一维分数数组,结果确实是相同的:

scores = [3.0, 1.0, 0.2]
print(your_softmax(scores))
# [ 0.8360188   0.11314284  0.05083836]
print(softmax(scores))
# [ 0.8360188   0.11314284  0.05083836]
your_softmax(scores) == softmax(scores)
# array([ True,  True,  True], dtype=bool)

不过,以下是在Udacity测验中给出的2-D分数数组的结果作为测试示例:

scores2D = np.array([[1, 2, 3, 6],
                     [2, 4, 5, 6],
                     [3, 8, 7, 6]])

print(your_softmax(scores2D))
# [[  4.89907947e-04   1.33170787e-03   3.61995731e-03   7.27087861e-02]
#  [  1.33170787e-03   9.84006416e-03   2.67480676e-02   7.27087861e-02]
#  [  3.61995731e-03   5.37249300e-01   1.97642972e-01   7.27087861e-02]]

print(softmax(scores2D))
# [[ 0.09003057  0.00242826  0.01587624  0.33333333]
#  [ 0.24472847  0.01794253  0.11731043  0.33333333]
#  [ 0.66524096  0.97962921  0.86681333  0.33333333]]

结果是不同的-第二个结果确实与Udacity测验中预期的结果相同,在Udacity测验中,所有列的确加起来为1,而第一个(错误的)结果并非如此。

因此,所有的麻烦实际上是针对实现细节- axis参数。根据numpy.sum文档

默认值axis = None将对输入数组的所有元素求和

因此在这里我们要逐行求和axis=0。对于一维数组,(仅)行的总和与所有元素的总和恰好相同,因此在这种情况下您的结果相同…

除了axis问题之外,您的实现(即您选择先减去最大值)实际上比建议的解决方案更好!实际上,这是实现softmax函数的推荐方法- 有关理由,请参见此处(数字稳定性,此处也由其他一些答案指出)。

(Well… much confusion here, both in the question and in the answers…)

To start with, the two solutions (i.e. yours and the suggested one) are not equivalent; they happen to be equivalent only for the special case of 1-D score arrays. You would have discovered it if you had tried also the 2-D score array in the Udacity quiz provided example.

Results-wise, the only actual difference between the two solutions is the axis=0 argument. To see that this is the case, let’s try your solution (your_softmax) and one where the only difference is the axis argument:

import numpy as np

# your solution:
def your_softmax(x):
    """Compute softmax values for each sets of scores in x."""
    e_x = np.exp(x - np.max(x))
    return e_x / e_x.sum()

# correct solution:
def softmax(x):
    """Compute softmax values for each sets of scores in x."""
    e_x = np.exp(x - np.max(x))
    return e_x / e_x.sum(axis=0) # only difference

As I said, for a 1-D score array, the results are indeed identical:

scores = [3.0, 1.0, 0.2]
print(your_softmax(scores))
# [ 0.8360188   0.11314284  0.05083836]
print(softmax(scores))
# [ 0.8360188   0.11314284  0.05083836]
your_softmax(scores) == softmax(scores)
# array([ True,  True,  True], dtype=bool)

Nevertheless, here are the results for the 2-D score array given in the Udacity quiz as a test example:

scores2D = np.array([[1, 2, 3, 6],
                     [2, 4, 5, 6],
                     [3, 8, 7, 6]])

print(your_softmax(scores2D))
# [[  4.89907947e-04   1.33170787e-03   3.61995731e-03   7.27087861e-02]
#  [  1.33170787e-03   9.84006416e-03   2.67480676e-02   7.27087861e-02]
#  [  3.61995731e-03   5.37249300e-01   1.97642972e-01   7.27087861e-02]]

print(softmax(scores2D))
# [[ 0.09003057  0.00242826  0.01587624  0.33333333]
#  [ 0.24472847  0.01794253  0.11731043  0.33333333]
#  [ 0.66524096  0.97962921  0.86681333  0.33333333]]

The results are different – the second one is indeed identical with the one expected in the Udacity quiz, where all columns indeed sum to 1, which is not the case with the first (wrong) result.

So, all the fuss was actually for an implementation detail – the axis argument. According to the numpy.sum documentation:

The default, axis=None, will sum all of the elements of the input array

while here we want to sum row-wise, hence axis=0. For a 1-D array, the sum of the (only) row and the sum of all the elements happen to be identical, hence your identical results in that case…

The axis issue aside, your implementation (i.e. your choice to subtract the max first) is actually better than the suggested solution! In fact, it is the recommended way of implementing the softmax function – see here for the justification (numeric stability, also pointed out by some other answers here).


回答 2

因此,这确实是对Desertnaut答案的评论,但由于我的声誉,我暂时无法对此发表评论。正如他指出的那样,仅当您的输入包含单个样本时,您的版本才是正确的。如果您的输入包含多个样本,那是错误的。但是,desertnaut的解决方案也是错误的。问题在于,一旦他接受一维输入,然后接受二维输入。让我给你看看。

import numpy as np

# your solution:
def your_softmax(x):
    """Compute softmax values for each sets of scores in x."""
    e_x = np.exp(x - np.max(x))
    return e_x / e_x.sum()

# desertnaut solution (copied from his answer): 
def desertnaut_softmax(x):
    """Compute softmax values for each sets of scores in x."""
    e_x = np.exp(x - np.max(x))
    return e_x / e_x.sum(axis=0) # only difference

# my (correct) solution:
def softmax(z):
    assert len(z.shape) == 2
    s = np.max(z, axis=1)
    s = s[:, np.newaxis] # necessary step to do broadcasting
    e_x = np.exp(z - s)
    div = np.sum(e_x, axis=1)
    div = div[:, np.newaxis] # dito
    return e_x / div

让我们以Desertnauts为例:

x1 = np.array([[1, 2, 3, 6]]) # notice that we put the data into 2 dimensions(!)

这是输出:

your_softmax(x1)
array([[ 0.00626879,  0.01704033,  0.04632042,  0.93037047]])

desertnaut_softmax(x1)
array([[ 1.,  1.,  1.,  1.]])

softmax(x1)
array([[ 0.00626879,  0.01704033,  0.04632042,  0.93037047]])

您会看到在这种情况下desernauts版本将失败。(如果输入只是一维,如np.array([1、2、3、6]),则不会。

现在使用3个样本,因为那是我们使用二维输入的原因。以下x2与来自desernauts示例的x2不同。

x2 = np.array([[1, 2, 3, 6],  # sample 1
               [2, 4, 5, 6],  # sample 2
               [1, 2, 3, 6]]) # sample 1 again(!)

此输入包含3个样本的批次。但是样本一和样本三本质上是相同的。现在,我们期望3行softmax激活,其中第一行应与第三行相同,并且也应与x1的激活相同!

your_softmax(x2)
array([[ 0.00183535,  0.00498899,  0.01356148,  0.27238963],
       [ 0.00498899,  0.03686393,  0.10020655,  0.27238963],
       [ 0.00183535,  0.00498899,  0.01356148,  0.27238963]])


desertnaut_softmax(x2)
array([[ 0.21194156,  0.10650698,  0.10650698,  0.33333333],
       [ 0.57611688,  0.78698604,  0.78698604,  0.33333333],
       [ 0.21194156,  0.10650698,  0.10650698,  0.33333333]])

softmax(x2)
array([[ 0.00626879,  0.01704033,  0.04632042,  0.93037047],
       [ 0.01203764,  0.08894682,  0.24178252,  0.65723302],
       [ 0.00626879,  0.01704033,  0.04632042,  0.93037047]])

希望您能看到只有我的解决方案才有这种情况。

softmax(x1) == softmax(x2)[0]
array([[ True,  True,  True,  True]], dtype=bool)

softmax(x1) == softmax(x2)[2]
array([[ True,  True,  True,  True]], dtype=bool)

此外,这是TensorFlows softmax实现的结果:

import tensorflow as tf
import numpy as np
batch = np.asarray([[1,2,3,6],[2,4,5,6],[1,2,3,6]])
x = tf.placeholder(tf.float32, shape=[None, 4])
y = tf.nn.softmax(x)
init = tf.initialize_all_variables()
sess = tf.Session()
sess.run(y, feed_dict={x: batch})

结果:

array([[ 0.00626879,  0.01704033,  0.04632042,  0.93037045],
       [ 0.01203764,  0.08894681,  0.24178252,  0.657233  ],
       [ 0.00626879,  0.01704033,  0.04632042,  0.93037045]], dtype=float32)

So, this is really a comment to desertnaut’s answer but I can’t comment on it yet due to my reputation. As he pointed out, your version is only correct if your input consists of a single sample. If your input consists of several samples, it is wrong. However, desertnaut’s solution is also wrong. The problem is that once he takes a 1-dimensional input and then he takes a 2-dimensional input. Let me show this to you.

import numpy as np

# your solution:
def your_softmax(x):
    """Compute softmax values for each sets of scores in x."""
    e_x = np.exp(x - np.max(x))
    return e_x / e_x.sum()

# desertnaut solution (copied from his answer): 
def desertnaut_softmax(x):
    """Compute softmax values for each sets of scores in x."""
    e_x = np.exp(x - np.max(x))
    return e_x / e_x.sum(axis=0) # only difference

# my (correct) solution:
def softmax(z):
    assert len(z.shape) == 2
    s = np.max(z, axis=1)
    s = s[:, np.newaxis] # necessary step to do broadcasting
    e_x = np.exp(z - s)
    div = np.sum(e_x, axis=1)
    div = div[:, np.newaxis] # dito
    return e_x / div

Lets take desertnauts example:

x1 = np.array([[1, 2, 3, 6]]) # notice that we put the data into 2 dimensions(!)

This is the output:

your_softmax(x1)
array([[ 0.00626879,  0.01704033,  0.04632042,  0.93037047]])

desertnaut_softmax(x1)
array([[ 1.,  1.,  1.,  1.]])

softmax(x1)
array([[ 0.00626879,  0.01704033,  0.04632042,  0.93037047]])

You can see that desernauts version would fail in this situation. (It would not if the input was just one dimensional like np.array([1, 2, 3, 6]).

Lets now use 3 samples since thats the reason why we use a 2 dimensional input. The following x2 is not the same as the one from desernauts example.

x2 = np.array([[1, 2, 3, 6],  # sample 1
               [2, 4, 5, 6],  # sample 2
               [1, 2, 3, 6]]) # sample 1 again(!)

This input consists of a batch with 3 samples. But sample one and three are essentially the same. We now expect 3 rows of softmax activations where the first should be the same as the third and also the same as our activation of x1!

your_softmax(x2)
array([[ 0.00183535,  0.00498899,  0.01356148,  0.27238963],
       [ 0.00498899,  0.03686393,  0.10020655,  0.27238963],
       [ 0.00183535,  0.00498899,  0.01356148,  0.27238963]])


desertnaut_softmax(x2)
array([[ 0.21194156,  0.10650698,  0.10650698,  0.33333333],
       [ 0.57611688,  0.78698604,  0.78698604,  0.33333333],
       [ 0.21194156,  0.10650698,  0.10650698,  0.33333333]])

softmax(x2)
array([[ 0.00626879,  0.01704033,  0.04632042,  0.93037047],
       [ 0.01203764,  0.08894682,  0.24178252,  0.65723302],
       [ 0.00626879,  0.01704033,  0.04632042,  0.93037047]])

I hope you can see that this is only the case with my solution.

softmax(x1) == softmax(x2)[0]
array([[ True,  True,  True,  True]], dtype=bool)

softmax(x1) == softmax(x2)[2]
array([[ True,  True,  True,  True]], dtype=bool)

Additionally, here is the results of TensorFlows softmax implementation:

import tensorflow as tf
import numpy as np
batch = np.asarray([[1,2,3,6],[2,4,5,6],[1,2,3,6]])
x = tf.placeholder(tf.float32, shape=[None, 4])
y = tf.nn.softmax(x)
init = tf.initialize_all_variables()
sess = tf.Session()
sess.run(y, feed_dict={x: batch})

And the result:

array([[ 0.00626879,  0.01704033,  0.04632042,  0.93037045],
       [ 0.01203764,  0.08894681,  0.24178252,  0.657233  ],
       [ 0.00626879,  0.01704033,  0.04632042,  0.93037045]], dtype=float32)

回答 3

我要说的是,尽管两者在数学上都是正确的,但从实现角度来看,第一个更好。当计算softmax时,中间值可能会变得非常大。将两个大数相除可能会造成数值不稳定。这些注释(来自斯坦福大学)提到了归一化技巧,这实际上就是您正在做的事情。

I would say that while both are correct mathematically, implementation-wise, first one is better. When computing softmax, the intermediate values may become very large. Dividing two large numbers can be numerically unstable. These notes (from Stanford) mention a normalization trick which is essentially what you are doing.


回答 4

sklearn还提供softmax的实现

from sklearn.utils.extmath import softmax
import numpy as np

x = np.array([[ 0.50839931,  0.49767588,  0.51260159]])
softmax(x)

# output
array([[ 0.3340521 ,  0.33048906,  0.33545884]]) 

sklearn also offers implementation of softmax

from sklearn.utils.extmath import softmax
import numpy as np

x = np.array([[ 0.50839931,  0.49767588,  0.51260159]])
softmax(x)

# output
array([[ 0.3340521 ,  0.33048906,  0.33545884]]) 

回答 5

从数学观点来看,双方是平等的。

您可以轻松证明这一点。让我们开始吧m=max(x)。现在,您的函数softmax将返回一个向量,其第i个坐标等于

在此处输入图片说明

请注意,这适用于any m,因为对于所有(甚至复数)数字e^m != 0

  • 从计算复杂度的角度来看,它们也是等效的,并且都在O(n)时间上运行,其中n向量的大小在哪里。

  • 数值稳定性的角度来看,首选第一个解决方案,因为它e^x增长非常快,即使很小的值x也会溢出。减去最大值可以消除此溢出。为了实际体验我所谈论的内容,请尝试x = np.array([1000, 5])同时使用这两个功能。一个将返回正确的概率,第二个将溢出nan

  • 您的解决方案仅适用于向量(Udacity测验也希望您也针对矩阵进行计算)。为了修复它,您需要使用sum(axis=0)

From mathematical point of view both sides are equal.

And you can easily prove this. Let’s m=max(x). Now your function softmax returns a vector, whose i-th coordinate is equal to

enter image description here

notice that this works for any m, because for all (even complex) numbers e^m != 0

  • from computational complexity point of view they are also equivalent and both run in O(n) time, where n is the size of a vector.

  • from numerical stability point of view, the first solution is preferred, because e^x grows very fast and even for pretty small values of x it will overflow. Subtracting the maximum value allows to get rid of this overflow. To practically experience the stuff I was talking about try to feed x = np.array([1000, 5]) into both of your functions. One will return correct probability, the second will overflow with nan

  • your solution works only for vectors (Udacity quiz wants you to calculate it for matrices as well). In order to fix it you need to use sum(axis=0)


回答 6

编辑。从1.2.0版开始,scipy包含softmax作为特殊功能:

https://scipy.github.io/devdocs/generation/scipy.special.softmax.html

我编写了一个在所有轴上应用softmax的函数:

def softmax(X, theta = 1.0, axis = None):
    """
    Compute the softmax of each element along an axis of X.

    Parameters
    ----------
    X: ND-Array. Probably should be floats. 
    theta (optional): float parameter, used as a multiplier
        prior to exponentiation. Default = 1.0
    axis (optional): axis to compute values along. Default is the 
        first non-singleton axis.

    Returns an array the same size as X. The result will sum to 1
    along the specified axis.
    """

    # make X at least 2d
    y = np.atleast_2d(X)

    # find axis
    if axis is None:
        axis = next(j[0] for j in enumerate(y.shape) if j[1] > 1)

    # multiply y against the theta parameter, 
    y = y * float(theta)

    # subtract the max for numerical stability
    y = y - np.expand_dims(np.max(y, axis = axis), axis)

    # exponentiate y
    y = np.exp(y)

    # take the sum along the specified axis
    ax_sum = np.expand_dims(np.sum(y, axis = axis), axis)

    # finally: divide elementwise
    p = y / ax_sum

    # flatten if X was 1D
    if len(X.shape) == 1: p = p.flatten()

    return p

如其他用户所述,减去最大值是一种很好的做法。我在这里写了一篇详细的文章。

EDIT. As of version 1.2.0, scipy includes softmax as a special function:

https://scipy.github.io/devdocs/generated/scipy.special.softmax.html

I wrote a function applying the softmax over any axis:

def softmax(X, theta = 1.0, axis = None):
    """
    Compute the softmax of each element along an axis of X.

    Parameters
    ----------
    X: ND-Array. Probably should be floats. 
    theta (optional): float parameter, used as a multiplier
        prior to exponentiation. Default = 1.0
    axis (optional): axis to compute values along. Default is the 
        first non-singleton axis.

    Returns an array the same size as X. The result will sum to 1
    along the specified axis.
    """

    # make X at least 2d
    y = np.atleast_2d(X)

    # find axis
    if axis is None:
        axis = next(j[0] for j in enumerate(y.shape) if j[1] > 1)

    # multiply y against the theta parameter, 
    y = y * float(theta)

    # subtract the max for numerical stability
    y = y - np.expand_dims(np.max(y, axis = axis), axis)

    # exponentiate y
    y = np.exp(y)

    # take the sum along the specified axis
    ax_sum = np.expand_dims(np.sum(y, axis = axis), axis)

    # finally: divide elementwise
    p = y / ax_sum

    # flatten if X was 1D
    if len(X.shape) == 1: p = p.flatten()

    return p

Subtracting the max, as other users described, is good practice. I wrote a detailed post about it here.


回答 7

在这里,您可以了解他们为什么使用- max

从那里:

“在实践中编写用于计算Softmax函数的代码时,由于指数的原因,中间项可能会非常大。将大数相除可能会造成数值不稳定,因此使用归一化技巧很重要。”

Here you can find out why they used - max.

From there:

“When you’re writing code for computing the Softmax function in practice, the intermediate terms may be very large due to the exponentials. Dividing large numbers can be numerically unstable, so it is important to use a normalization trick.”


回答 8

一个更简洁的版本是:

def softmax(x):
    return np.exp(x) / np.exp(x).sum(axis=0)

A more concise version is:

def softmax(x):
    return np.exp(x) / np.exp(x).sum(axis=0)

回答 9

要提供替代解决方案,请考虑以下情况:您的论点的数量级非常大,以致exp(x)于下溢(在否定的情况下)或上溢(在肯定的情况下)。您希望在此处尽可能长时间地保留在日志空间中,仅在您可以相信结果会表现良好的末尾进行幂运算。

import scipy.special as sc
import numpy as np

def softmax(x: np.ndarray) -> np.ndarray:
    return np.exp(x - sc.logsumexp(x))

To offer an alternative solution, consider the cases where your arguments are extremely large in magnitude such that exp(x) would underflow (in the negative case) or overflow (in the positive case). Here you want to remain in log space as long as possible, exponentiating only at the end where you can trust the result will be well-behaved.

import scipy.special as sc
import numpy as np

def softmax(x: np.ndarray) -> np.ndarray:
    return np.exp(x - sc.logsumexp(x))

回答 10

我需要一些与Tensorflow密集层的输出兼容的东西。

@desertnaut的解决方案在这种情况下不起作用,因为我有大量数据。因此,我提供了另一种在两种情况下均适用的解决方案:

def softmax(x, axis=-1):
    e_x = np.exp(x - np.max(x)) # same code
    return e_x / e_x.sum(axis=axis, keepdims=True)

结果:

logits = np.asarray([
    [-0.0052024,  -0.00770216,  0.01360943, -0.008921], # 1
    [-0.0052024,  -0.00770216,  0.01360943, -0.008921]  # 2
])

print(softmax(logits))

#[[0.2492037  0.24858153 0.25393605 0.24827873]
# [0.2492037  0.24858153 0.25393605 0.24827873]]

参考:Tensorflow softmax

I needed something compatible with the output of a dense layer from Tensorflow.

The solution from @desertnaut does not work in this case because I have batches of data. Therefore, I came with another solution that should work in both cases:

def softmax(x, axis=-1):
    e_x = np.exp(x - np.max(x)) # same code
    return e_x / e_x.sum(axis=axis, keepdims=True)

Results:

logits = np.asarray([
    [-0.0052024,  -0.00770216,  0.01360943, -0.008921], # 1
    [-0.0052024,  -0.00770216,  0.01360943, -0.008921]  # 2
])

print(softmax(logits))

#[[0.2492037  0.24858153 0.25393605 0.24827873]
# [0.2492037  0.24858153 0.25393605 0.24827873]]

Ref: Tensorflow softmax


回答 11

我建议这样做:

def softmax(z):
    z_norm=np.exp(z-np.max(z,axis=0,keepdims=True))
    return(np.divide(z_norm,np.sum(z_norm,axis=0,keepdims=True)))

它将适用于随机和批处理。
有关更多详细信息,请参见:https : //medium.com/@ravish1729/analysis-of-softmax-function-ad058d6a564d

I would suggest this:

def softmax(z):
    z_norm=np.exp(z-np.max(z,axis=0,keepdims=True))
    return(np.divide(z_norm,np.sum(z_norm,axis=0,keepdims=True)))

It will work for stochastic as well as the batch.
For more detail see : https://medium.com/@ravish1729/analysis-of-softmax-function-ad058d6a564d


回答 12

为了保持数值稳定性,应减去max(x)。以下是softmax函数的代码;

def softmax(x):

if len(x.shape) > 1:
    tmp = np.max(x, axis = 1)
    x -= tmp.reshape((x.shape[0], 1))
    x = np.exp(x)
    tmp = np.sum(x, axis = 1)
    x /= tmp.reshape((x.shape[0], 1))
else:
    tmp = np.max(x)
    x -= tmp
    x = np.exp(x)
    tmp = np.sum(x)
    x /= tmp


return x

In order to maintain for numerical stability, max(x) should be subtracted. The following is the code for softmax function;

def softmax(x):

if len(x.shape) > 1:
    tmp = np.max(x, axis = 1)
    x -= tmp.reshape((x.shape[0], 1))
    x = np.exp(x)
    tmp = np.sum(x, axis = 1)
    x /= tmp.reshape((x.shape[0], 1))
else:
    tmp = np.max(x)
    x -= tmp
    x = np.exp(x)
    tmp = np.sum(x)
    x /= tmp


return x

回答 13

在以上答案中已经详细回答了。max被减去以避免溢出。我在这里在python3中添加了另一个实现。

import numpy as np
def softmax(x):
    mx = np.amax(x,axis=1,keepdims = True)
    x_exp = np.exp(x - mx)
    x_sum = np.sum(x_exp, axis = 1, keepdims = True)
    res = x_exp / x_sum
    return res

x = np.array([[3,2,4],[4,5,6]])
print(softmax(x))

Already answered in much detail in above answers. max is subtracted to avoid overflow. I am adding here one more implementation in python3.

import numpy as np
def softmax(x):
    mx = np.amax(x,axis=1,keepdims = True)
    x_exp = np.exp(x - mx)
    x_sum = np.sum(x_exp, axis = 1, keepdims = True)
    res = x_exp / x_sum
    return res

x = np.array([[3,2,4],[4,5,6]])
print(softmax(x))

回答 14

每个人似乎都发布了他们的解决方案,所以我将发布我的解决方案:

def softmax(x):
    e_x = np.exp(x.T - np.max(x, axis = -1))
    return (e_x / e_x.sum(axis=0)).T

我得到的结果与从sklearn导入的结果完全相同:

from sklearn.utils.extmath import softmax

Everybody seems to post their solution so I’ll post mine:

def softmax(x):
    e_x = np.exp(x.T - np.max(x, axis = -1))
    return (e_x / e_x.sum(axis=0)).T

I get the exact same results as the imported from sklearn:

from sklearn.utils.extmath import softmax

回答 15

import tensorflow as tf
import numpy as np

def softmax(x):
    return (np.exp(x).T / np.exp(x).sum(axis=-1)).T

logits = np.array([[1, 2, 3], [3, 10, 1], [1, 2, 5], [4, 6.5, 1.2], [3, 6, 1]])

sess = tf.Session()
print(softmax(logits))
print(sess.run(tf.nn.softmax(logits)))
sess.close()
import tensorflow as tf
import numpy as np

def softmax(x):
    return (np.exp(x).T / np.exp(x).sum(axis=-1)).T

logits = np.array([[1, 2, 3], [3, 10, 1], [1, 2, 5], [4, 6.5, 1.2], [3, 6, 1]])

sess = tf.Session()
print(softmax(logits))
print(sess.run(tf.nn.softmax(logits)))
sess.close()

回答 16

根据所有答复和CS231n注释,请允许我总结一下:

def softmax(x, axis):
    x -= np.max(x, axis=axis, keepdims=True)
    return np.exp(x) / np.exp(x).sum(axis=axis, keepdims=True)

用法:

x = np.array([[1, 0, 2,-1],
              [2, 4, 6, 8], 
              [3, 2, 1, 0]])
softmax(x, axis=1).round(2)

输出:

array([[0.24, 0.09, 0.64, 0.03],
       [0.  , 0.02, 0.12, 0.86],
       [0.64, 0.24, 0.09, 0.03]])

Based on all the responses and CS231n notes, allow me to summarise:

def softmax(x, axis):
    x -= np.max(x, axis=axis, keepdims=True)
    return np.exp(x) / np.exp(x).sum(axis=axis, keepdims=True)

Usage:

x = np.array([[1, 0, 2,-1],
              [2, 4, 6, 8], 
              [3, 2, 1, 0]])
softmax(x, axis=1).round(2)

Output:

array([[0.24, 0.09, 0.64, 0.03],
       [0.  , 0.02, 0.12, 0.86],
       [0.64, 0.24, 0.09, 0.03]])

回答 17

我想补充一点对问题的理解。在这里减去数组的最大值是正确的。但是,如果您在另一篇文章中运行代码,则当数组为2D或更高尺寸时,您会发现它没有给出正确的答案。

在这里,我给您一些建议:

  1. 要获得最大值,请尝试沿x轴进行操作,您将获得一维数组。
  2. 将您的最大数组重塑为原始形状。
  3. 是否使np.exp获得指数值。
  4. 沿轴做np.sum。
  5. 获得最终结果。

按照结果进行矢量化处理,您将获得正确的答案。由于它与大学作业有关,因此我无法在此处发布确切的代码,但是如果您不理解,我想提出更多建议。

I would like to supplement a little bit more understanding of the problem. Here it is correct of subtracting max of the array. But if you run the code in the other post, you would find it is not giving you right answer when the array is 2D or higher dimensions.

Here I give you some suggestions:

  1. To get max, try to do it along x-axis, you will get an 1D array.
  2. Reshape your max array to original shape.
  3. Do np.exp get exponential value.
  4. Do np.sum along axis.
  5. Get the final results.

Follow the result you will get the correct answer by doing vectorization. Since it is related to the college homework, I cannot post the exact code here, but I would like to give more suggestions if you don’t understand.


回答 18

softmax函数的目的是保留矢量的比率,而不是随着值饱和(即趋于+/- 1(tanh)或从0到1(逻辑))用S形压缩端点。这是因为它保留了有关端点变化率的更多信息,因此更适用于N输出为1-of的神经网络编码(即,如果压缩端点,则很难区分1 -of-N输出类,因为我们不能说哪个是“最大”或“最小”的,因为它们被压扁了。);也会使总输出总和为1,明确的获胜者将接近1,而彼此接近的其他数字将为1 / p,其中p是具有相似值的输出神经元的数量。

从向量中减去最大值的目的是,当您进行指数运算时,您可能会得到很高的值,该值会将浮点数修剪为最大值,导致出现平局,在此示例中不是这种情况。如果您减去最大值以得出负数,那么这将成为一个大问题,您将拥有一个负指数,该指数会迅速缩小值以更改比率,这是发帖人的问题中出现的结果,并且给出了错误的答案。

Udacity提供的答案很糟糕。我们要做的第一件事是为所有矢量分量计算e ^ y_j,保留这些值,然后将它们求和并除。Udacity搞砸的地方是他们计算两次e ^ y_j!这是正确的答案:

def softmax(y):
    e_to_the_y_j = np.exp(y)
    return e_to_the_y_j / np.sum(e_to_the_y_j, axis=0)

The purpose of the softmax function is to preserve the ratio of the vectors as opposed to squashing the end-points with a sigmoid as the values saturate (i.e. tend to +/- 1 (tanh) or from 0 to 1 (logistical)). This is because it preserves more information about the rate of change at the end-points and thus is more applicable to neural nets with 1-of-N Output Encoding (i.e. if we squashed the end-points it would be harder to differentiate the 1-of-N output class because we can’t tell which one is the “biggest” or “smallest” because they got squished.); also it makes the total output sum to 1, and the clear winner will be closer to 1 while other numbers that are close to each other will sum to 1/p, where p is the number of output neurons with similar values.

The purpose of subtracting the max value from the vector is that when you do e^y exponents you may get very high value that clips the float at the max value leading to a tie, which is not the case in this example. This becomes a BIG problem if you subtract the max value to make a negative number, then you have a negative exponent that rapidly shrinks the values altering the ratio, which is what occurred in poster’s question and yielded the incorrect answer.

The answer supplied by Udacity is HORRIBLY inefficient. The first thing we need to do is calculate e^y_j for all vector components, KEEP THOSE VALUES, then sum them up, and divide. Where Udacity messed up is they calculate e^y_j TWICE!!! Here is the correct answer:

def softmax(y):
    e_to_the_y_j = np.exp(y)
    return e_to_the_y_j / np.sum(e_to_the_y_j, axis=0)

回答 19

目标是使用Numpy和Tensorflow达到类似的结果。原始答案的唯一变化是api的axis参数np.sum

初始方法axis=0-但是,当尺寸为N时,这不会提供预期的结果。

修改方法axis=len(e_x.shape)-1-总是在最后一个维度上求和。这提供了与tensorflow的softmax函数相似的结果。

def softmax_fn(input_array):
    """
    | **@author**: Prathyush SP
    |
    | Calculate Softmax for a given array
    :param input_array: Input Array
    :return: Softmax Score
    """
    e_x = np.exp(input_array - np.max(input_array))
    return e_x / e_x.sum(axis=len(e_x.shape)-1)

Goal was to achieve similar results using Numpy and Tensorflow. The only change from original answer is axis parameter for np.sum api.

Initial approach : axis=0 – This however does not provide intended results when dimensions are N.

Modified approach: axis=len(e_x.shape)-1 – Always sum on the last dimension. This provides similar results as tensorflow’s softmax function.

def softmax_fn(input_array):
    """
    | **@author**: Prathyush SP
    |
    | Calculate Softmax for a given array
    :param input_array: Input Array
    :return: Softmax Score
    """
    e_x = np.exp(input_array - np.max(input_array))
    return e_x / e_x.sum(axis=len(e_x.shape)-1)

回答 20

这是使用numpy和comparision的广义解决方案,用于使用tensorflow ansscipy的正确性:

数据准备:

import numpy as np

np.random.seed(2019)

batch_size = 1
n_items = 3
n_classes = 2
logits_np = np.random.rand(batch_size,n_items,n_classes).astype(np.float32)
print('logits_np.shape', logits_np.shape)
print('logits_np:')
print(logits_np)

输出:

logits_np.shape (1, 3, 2)
logits_np:
[[[0.9034822  0.3930805 ]
  [0.62397    0.6378774 ]
  [0.88049906 0.299172  ]]]

使用张量流的Softmax:

import tensorflow as tf

logits_tf = tf.convert_to_tensor(logits_np, np.float32)
scores_tf = tf.nn.softmax(logits_np, axis=-1)

print('logits_tf.shape', logits_tf.shape)
print('scores_tf.shape', scores_tf.shape)

with tf.Session() as sess:
    scores_np = sess.run(scores_tf)

print('scores_np.shape', scores_np.shape)
print('scores_np:')
print(scores_np)

print('np.sum(scores_np, axis=-1).shape', np.sum(scores_np,axis=-1).shape)
print('np.sum(scores_np, axis=-1):')
print(np.sum(scores_np, axis=-1))

输出:

logits_tf.shape (1, 3, 2)
scores_tf.shape (1, 3, 2)
scores_np.shape (1, 3, 2)
scores_np:
[[[0.62490064 0.37509936]
  [0.4965232  0.5034768 ]
  [0.64137274 0.3586273 ]]]
np.sum(scores_np, axis=-1).shape (1, 3)
np.sum(scores_np, axis=-1):
[[1. 1. 1.]]

使用scipy的Softmax:

from scipy.special import softmax

scores_np = softmax(logits_np, axis=-1)

print('scores_np.shape', scores_np.shape)
print('scores_np:')
print(scores_np)

print('np.sum(scores_np, axis=-1).shape', np.sum(scores_np, axis=-1).shape)
print('np.sum(scores_np, axis=-1):')
print(np.sum(scores_np, axis=-1))

输出:

scores_np.shape (1, 3, 2)
scores_np:
[[[0.62490064 0.37509936]
  [0.4965232  0.5034768 ]
  [0.6413727  0.35862732]]]
np.sum(scores_np, axis=-1).shape (1, 3)
np.sum(scores_np, axis=-1):
[[1. 1. 1.]]

使用numpy的Softmax(https://nolanbconaway.github.io/blog/2017/softmax-numpy):

def softmax(X, theta = 1.0, axis = None):
    """
    Compute the softmax of each element along an axis of X.

    Parameters
    ----------
    X: ND-Array. Probably should be floats.
    theta (optional): float parameter, used as a multiplier
        prior to exponentiation. Default = 1.0
    axis (optional): axis to compute values along. Default is the
        first non-singleton axis.

    Returns an array the same size as X. The result will sum to 1
    along the specified axis.
    """

    # make X at least 2d
    y = np.atleast_2d(X)

    # find axis
    if axis is None:
        axis = next(j[0] for j in enumerate(y.shape) if j[1] > 1)

    # multiply y against the theta parameter,
    y = y * float(theta)

    # subtract the max for numerical stability
    y = y - np.expand_dims(np.max(y, axis = axis), axis)

    # exponentiate y
    y = np.exp(y)

    # take the sum along the specified axis
    ax_sum = np.expand_dims(np.sum(y, axis = axis), axis)

    # finally: divide elementwise
    p = y / ax_sum

    # flatten if X was 1D
    if len(X.shape) == 1: p = p.flatten()

    return p


scores_np = softmax(logits_np, axis=-1)

print('scores_np.shape', scores_np.shape)
print('scores_np:')
print(scores_np)

print('np.sum(scores_np, axis=-1).shape', np.sum(scores_np, axis=-1).shape)
print('np.sum(scores_np, axis=-1):')
print(np.sum(scores_np, axis=-1))

输出:

scores_np.shape (1, 3, 2)
scores_np:
[[[0.62490064 0.37509936]
  [0.49652317 0.5034768 ]
  [0.64137274 0.3586273 ]]]
np.sum(scores_np, axis=-1).shape (1, 3)
np.sum(scores_np, axis=-1):
[[1. 1. 1.]]

Here is generalized solution using numpy and comparision for correctness with tensorflow ans scipy:

Data preparation:

import numpy as np

np.random.seed(2019)

batch_size = 1
n_items = 3
n_classes = 2
logits_np = np.random.rand(batch_size,n_items,n_classes).astype(np.float32)
print('logits_np.shape', logits_np.shape)
print('logits_np:')
print(logits_np)

Output:

logits_np.shape (1, 3, 2)
logits_np:
[[[0.9034822  0.3930805 ]
  [0.62397    0.6378774 ]
  [0.88049906 0.299172  ]]]

Softmax using tensorflow:

import tensorflow as tf

logits_tf = tf.convert_to_tensor(logits_np, np.float32)
scores_tf = tf.nn.softmax(logits_np, axis=-1)

print('logits_tf.shape', logits_tf.shape)
print('scores_tf.shape', scores_tf.shape)

with tf.Session() as sess:
    scores_np = sess.run(scores_tf)

print('scores_np.shape', scores_np.shape)
print('scores_np:')
print(scores_np)

print('np.sum(scores_np, axis=-1).shape', np.sum(scores_np,axis=-1).shape)
print('np.sum(scores_np, axis=-1):')
print(np.sum(scores_np, axis=-1))

Output:

logits_tf.shape (1, 3, 2)
scores_tf.shape (1, 3, 2)
scores_np.shape (1, 3, 2)
scores_np:
[[[0.62490064 0.37509936]
  [0.4965232  0.5034768 ]
  [0.64137274 0.3586273 ]]]
np.sum(scores_np, axis=-1).shape (1, 3)
np.sum(scores_np, axis=-1):
[[1. 1. 1.]]

Softmax using scipy:

from scipy.special import softmax

scores_np = softmax(logits_np, axis=-1)

print('scores_np.shape', scores_np.shape)
print('scores_np:')
print(scores_np)

print('np.sum(scores_np, axis=-1).shape', np.sum(scores_np, axis=-1).shape)
print('np.sum(scores_np, axis=-1):')
print(np.sum(scores_np, axis=-1))

Output:

scores_np.shape (1, 3, 2)
scores_np:
[[[0.62490064 0.37509936]
  [0.4965232  0.5034768 ]
  [0.6413727  0.35862732]]]
np.sum(scores_np, axis=-1).shape (1, 3)
np.sum(scores_np, axis=-1):
[[1. 1. 1.]]

Softmax using numpy (https://nolanbconaway.github.io/blog/2017/softmax-numpy) :

def softmax(X, theta = 1.0, axis = None):
    """
    Compute the softmax of each element along an axis of X.

    Parameters
    ----------
    X: ND-Array. Probably should be floats.
    theta (optional): float parameter, used as a multiplier
        prior to exponentiation. Default = 1.0
    axis (optional): axis to compute values along. Default is the
        first non-singleton axis.

    Returns an array the same size as X. The result will sum to 1
    along the specified axis.
    """

    # make X at least 2d
    y = np.atleast_2d(X)

    # find axis
    if axis is None:
        axis = next(j[0] for j in enumerate(y.shape) if j[1] > 1)

    # multiply y against the theta parameter,
    y = y * float(theta)

    # subtract the max for numerical stability
    y = y - np.expand_dims(np.max(y, axis = axis), axis)

    # exponentiate y
    y = np.exp(y)

    # take the sum along the specified axis
    ax_sum = np.expand_dims(np.sum(y, axis = axis), axis)

    # finally: divide elementwise
    p = y / ax_sum

    # flatten if X was 1D
    if len(X.shape) == 1: p = p.flatten()

    return p


scores_np = softmax(logits_np, axis=-1)

print('scores_np.shape', scores_np.shape)
print('scores_np:')
print(scores_np)

print('np.sum(scores_np, axis=-1).shape', np.sum(scores_np, axis=-1).shape)
print('np.sum(scores_np, axis=-1):')
print(np.sum(scores_np, axis=-1))

Output:

scores_np.shape (1, 3, 2)
scores_np:
[[[0.62490064 0.37509936]
  [0.49652317 0.5034768 ]
  [0.64137274 0.3586273 ]]]
np.sum(scores_np, axis=-1).shape (1, 3)
np.sum(scores_np, axis=-1):
[[1. 1. 1.]]

回答 21

softmax函数是一种激活函数,可将数字转换为总计为1的概率。softmax函数输出一个向量,该向量表示结果列表的概率分布。它也是深度学习分类任务中使用的核心元素。

当我们有多个类时,将使用Softmax函数。

这对于找出具有最大值的类很有用。可能性。

Softmax函数理想地用于输出层,我们实际上是在尝试获得定义每个输入的类的概率。

取值范围是0〜1。

Softmax函数将logits [2.0,1.0,0.1]转换为概率[0.7,0.2,0.1],并且概率之和为1。Logits是神经网络最后一层输出的原始分数。在激活之前。要了解softmax函数,我们必须查看第(n-1)层的输出。

实际上,softmax函数是arg max函数。这意味着它不会从输入中返回最大值,而是返回最大值的位置。

例如:

在softmax之前

X = [13, 31, 5]

在softmax之后

array([1.52299795e-08, 9.99999985e-01, 5.10908895e-12]

码:

import numpy as np

# your solution:

def your_softmax(x): 

"""Compute softmax values for each sets of scores in x.""" 

e_x = np.exp(x - np.max(x)) 

return e_x / e_x.sum() 

# correct solution: 

def softmax(x): 

"""Compute softmax values for each sets of scores in x.""" 

e_x = np.exp(x - np.max(x)) 

return e_x / e_x.sum(axis=0) 

# only difference

The softmax function is an activation function that turns numbers into probabilities which sum to one. The softmax function outputs a vector that represents the probability distributions of a list of outcomes. It is also a core element used in deep learning classification tasks.

Softmax function is used when we have multiple classes.

It is useful for finding out the class which has the max. Probability.

The Softmax function is ideally used in the output layer, where we are actually trying to attain the probabilities to define the class of each input.

It ranges from 0 to 1.

Softmax function turns logits [2.0, 1.0, 0.1] into probabilities [0.7, 0.2, 0.1], and the probabilities sum to 1. Logits are the raw scores output by the last layer of a neural network. Before activation takes place. To understand the softmax function, we must look at the output of the (n-1)th layer.

The softmax function is, in fact, an arg max function. That means that it does not return the largest value from the input, but the position of the largest values.

For example:

Before softmax

X = [13, 31, 5]

After softmax

array([1.52299795e-08, 9.99999985e-01, 5.10908895e-12]

Code:

import numpy as np

# your solution:

def your_softmax(x): 

"""Compute softmax values for each sets of scores in x.""" 

e_x = np.exp(x - np.max(x)) 

return e_x / e_x.sum() 

# correct solution: 

def softmax(x): 

"""Compute softmax values for each sets of scores in x.""" 

e_x = np.exp(x - np.max(x)) 

return e_x / e_x.sum(axis=0) 

# only difference

numpy:数组中唯一值的最有效频率计数

问题:numpy:数组中唯一值的最有效频率计数

numpy/中scipy,是否有一种有效的方法来获取数组中唯一值的频率计数?

遵循以下原则:

x = array( [1,1,1,2,2,2,5,25,1,1] )
y = freq_count( x )
print y

>> [[1, 5], [2,3], [5,1], [25,1]]

(对于您来说,R用户在那里,我基本上是在寻找该table()功能)

In numpy / scipy, is there an efficient way to get frequency counts for unique values in an array?

Something along these lines:

x = array( [1,1,1,2,2,2,5,25,1,1] )
y = freq_count( x )
print y

>> [[1, 5], [2,3], [5,1], [25,1]]

( For you, R users out there, I’m basically looking for the table() function )


回答 0

看一下np.bincount

http://docs.scipy.org/doc/numpy/reference/generation/numpy.bincount.html

import numpy as np
x = np.array([1,1,1,2,2,2,5,25,1,1])
y = np.bincount(x)
ii = np.nonzero(y)[0]

然后:

zip(ii,y[ii]) 
# [(1, 5), (2, 3), (5, 1), (25, 1)]

要么:

np.vstack((ii,y[ii])).T
# array([[ 1,  5],
         [ 2,  3],
         [ 5,  1],
         [25,  1]])

或者您想将计数和唯一值结合起来。

Take a look at np.bincount:

http://docs.scipy.org/doc/numpy/reference/generated/numpy.bincount.html

import numpy as np
x = np.array([1,1,1,2,2,2,5,25,1,1])
y = np.bincount(x)
ii = np.nonzero(y)[0]

And then:

zip(ii,y[ii]) 
# [(1, 5), (2, 3), (5, 1), (25, 1)]

or:

np.vstack((ii,y[ii])).T
# array([[ 1,  5],
         [ 2,  3],
         [ 5,  1],
         [25,  1]])

or however you want to combine the counts and the unique values.


回答 1

从Numpy 1.9开始,最简单,最快的方法是简单地使用numpy.unique,现在有了return_counts关键字参数:

import numpy as np

x = np.array([1,1,1,2,2,2,5,25,1,1])
unique, counts = np.unique(x, return_counts=True)

print np.asarray((unique, counts)).T

这使:

 [[ 1  5]
  [ 2  3]
  [ 5  1]
  [25  1]]

scipy.stats.itemfreq以下内容进行快速比较:

In [4]: x = np.random.random_integers(0,100,1e6)

In [5]: %timeit unique, counts = np.unique(x, return_counts=True)
10 loops, best of 3: 31.5 ms per loop

In [6]: %timeit scipy.stats.itemfreq(x)
10 loops, best of 3: 170 ms per loop

As of Numpy 1.9, the easiest and fastest method is to simply use numpy.unique, which now has a return_counts keyword argument:

import numpy as np

x = np.array([1,1,1,2,2,2,5,25,1,1])
unique, counts = np.unique(x, return_counts=True)

print np.asarray((unique, counts)).T

Which gives:

 [[ 1  5]
  [ 2  3]
  [ 5  1]
  [25  1]]

A quick comparison with scipy.stats.itemfreq:

In [4]: x = np.random.random_integers(0,100,1e6)

In [5]: %timeit unique, counts = np.unique(x, return_counts=True)
10 loops, best of 3: 31.5 ms per loop

In [6]: %timeit scipy.stats.itemfreq(x)
10 loops, best of 3: 170 ms per loop

回答 2

更新:不建议使用原始答案中提到的方法,而应使用新方法:

>>> import numpy as np
>>> x = [1,1,1,2,2,2,5,25,1,1]
>>> np.array(np.unique(x, return_counts=True)).T
    array([[ 1,  5],
           [ 2,  3],
           [ 5,  1],
           [25,  1]])

原始答案:

您可以使用scipy.stats.itemfreq

>>> from scipy.stats import itemfreq
>>> x = [1,1,1,2,2,2,5,25,1,1]
>>> itemfreq(x)
/usr/local/bin/python:1: DeprecationWarning: `itemfreq` is deprecated! `itemfreq` is deprecated and will be removed in a future version. Use instead `np.unique(..., return_counts=True)`
array([[  1.,   5.],
       [  2.,   3.],
       [  5.,   1.],
       [ 25.,   1.]])

Update: The method mentioned in the original answer is deprecated, we should use the new way instead:

>>> import numpy as np
>>> x = [1,1,1,2,2,2,5,25,1,1]
>>> np.array(np.unique(x, return_counts=True)).T
    array([[ 1,  5],
           [ 2,  3],
           [ 5,  1],
           [25,  1]])

Original answer:

you can use scipy.stats.itemfreq

>>> from scipy.stats import itemfreq
>>> x = [1,1,1,2,2,2,5,25,1,1]
>>> itemfreq(x)
/usr/local/bin/python:1: DeprecationWarning: `itemfreq` is deprecated! `itemfreq` is deprecated and will be removed in a future version. Use instead `np.unique(..., return_counts=True)`
array([[  1.,   5.],
       [  2.,   3.],
       [  5.,   1.],
       [ 25.,   1.]])

回答 3

我对此也很感兴趣,因此我做了一些性能比较(使用perfplot,这是我的一个宠物项目)。结果:

y = np.bincount(a)
ii = np.nonzero(y)[0]
out = np.vstack((ii, y[ii])).T

是迄今为止最快的。(请注意对数缩放。)

在此处输入图片说明


生成绘图的代码:

import numpy as np
import pandas as pd
import perfplot
from scipy.stats import itemfreq


def bincount(a):
    y = np.bincount(a)
    ii = np.nonzero(y)[0]
    return np.vstack((ii, y[ii])).T


def unique(a):
    unique, counts = np.unique(a, return_counts=True)
    return np.asarray((unique, counts)).T


def unique_count(a):
    unique, inverse = np.unique(a, return_inverse=True)
    count = np.zeros(len(unique), np.int)
    np.add.at(count, inverse, 1)
    return np.vstack((unique, count)).T


def pandas_value_counts(a):
    out = pd.value_counts(pd.Series(a))
    out.sort_index(inplace=True)
    out = np.stack([out.keys().values, out.values]).T
    return out


perfplot.show(
    setup=lambda n: np.random.randint(0, 1000, n),
    kernels=[bincount, unique, itemfreq, unique_count, pandas_value_counts],
    n_range=[2 ** k for k in range(26)],
    logx=True,
    logy=True,
    xlabel="len(a)",
)

I was also interested in this, so I did a little performance comparison (using perfplot, a pet project of mine). Result:

y = np.bincount(a)
ii = np.nonzero(y)[0]
out = np.vstack((ii, y[ii])).T

is by far the fastest. (Note the log-scaling.)

enter image description here


Code to generate the plot:

import numpy as np
import pandas as pd
import perfplot
from scipy.stats import itemfreq


def bincount(a):
    y = np.bincount(a)
    ii = np.nonzero(y)[0]
    return np.vstack((ii, y[ii])).T


def unique(a):
    unique, counts = np.unique(a, return_counts=True)
    return np.asarray((unique, counts)).T


def unique_count(a):
    unique, inverse = np.unique(a, return_inverse=True)
    count = np.zeros(len(unique), np.int)
    np.add.at(count, inverse, 1)
    return np.vstack((unique, count)).T


def pandas_value_counts(a):
    out = pd.value_counts(pd.Series(a))
    out.sort_index(inplace=True)
    out = np.stack([out.keys().values, out.values]).T
    return out


perfplot.show(
    setup=lambda n: np.random.randint(0, 1000, n),
    kernels=[bincount, unique, itemfreq, unique_count, pandas_value_counts],
    n_range=[2 ** k for k in range(26)],
    logx=True,
    logy=True,
    xlabel="len(a)",
)

回答 4

使用熊猫模块:

>>> import pandas as pd
>>> import numpy as np
>>> x = np.array([1,1,1,2,2,2,5,25,1,1])
>>> pd.value_counts(x)
1     5
2     3
25    1
5     1
dtype: int64

Using pandas module:

>>> import pandas as pd
>>> import numpy as np
>>> x = np.array([1,1,1,2,2,2,5,25,1,1])
>>> pd.value_counts(x)
1     5
2     3
25    1
5     1
dtype: int64

回答 5

这是迄今为止最通用,最有效的解决方案。惊讶的是它还没有发布。

import numpy as np

def unique_count(a):
    unique, inverse = np.unique(a, return_inverse=True)
    count = np.zeros(len(unique), np.int)
    np.add.at(count, inverse, 1)
    return np.vstack(( unique, count)).T

print unique_count(np.random.randint(-10,10,100))

与当前接受的答案不同,它适用于可排序的任何数据类型(不仅是正整数),而且具有最佳性能。唯一的重大支出是由np.unique完成的排序。

This is by far the most general and performant solution; surprised it hasn’t been posted yet.

import numpy as np

def unique_count(a):
    unique, inverse = np.unique(a, return_inverse=True)
    count = np.zeros(len(unique), np.int)
    np.add.at(count, inverse, 1)
    return np.vstack(( unique, count)).T

print unique_count(np.random.randint(-10,10,100))

Unlike the currently accepted answer, it works on any datatype that is sortable (not just positive ints), and it has optimal performance; the only significant expense is in the sorting done by np.unique.


回答 6

numpy.bincount是最好的选择。如果您的数组除了小的密集整数之外还包含其他任何内容,则将其包装起来可能会很有用:

def count_unique(keys):
    uniq_keys = np.unique(keys)
    bins = uniq_keys.searchsorted(keys)
    return uniq_keys, np.bincount(bins)

例如:

>>> x = array([1,1,1,2,2,2,5,25,1,1])
>>> count_unique(x)
(array([ 1,  2,  5, 25]), array([5, 3, 1, 1]))

numpy.bincount is the probably the best choice. If your array contains anything besides small dense integers it might be useful to wrap it something like this:

def count_unique(keys):
    uniq_keys = np.unique(keys)
    bins = uniq_keys.searchsorted(keys)
    return uniq_keys, np.bincount(bins)

For example:

>>> x = array([1,1,1,2,2,2,5,25,1,1])
>>> count_unique(x)
(array([ 1,  2,  5, 25]), array([5, 3, 1, 1]))

回答 7

即使已经回答过,我还是建议使用一种不同的方法numpy.histogram。给定一个序列的此类函数,它返回归类为bin的元素的频率。

请注意:由于数字是整数,因此在此示例中有效。如果它们是实数,则此解决方案将不太适用。

>>> from numpy import histogram
>>> y = histogram (x, bins=x.max()-1)
>>> y
(array([5, 3, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       1]),
 array([  1.,   2.,   3.,   4.,   5.,   6.,   7.,   8.,   9.,  10.,  11.,
        12.,  13.,  14.,  15.,  16.,  17.,  18.,  19.,  20.,  21.,  22.,
        23.,  24.,  25.]))

Even though it has already been answered, I suggest a different approach that makes use of numpy.histogram. Such function given a sequence it returns the frequency of its elements grouped in bins.

Beware though: it works in this example because numbers are integers. If they where real numbers, then this solution would not apply as nicely.

>>> from numpy import histogram
>>> y = histogram (x, bins=x.max()-1)
>>> y
(array([5, 3, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       1]),
 array([  1.,   2.,   3.,   4.,   5.,   6.,   7.,   8.,   9.,  10.,  11.,
        12.,  13.,  14.,  15.,  16.,  17.,  18.,  19.,  20.,  21.,  22.,
        23.,  24.,  25.]))

回答 8

import pandas as pd
import numpy as np
x = np.array( [1,1,1,2,2,2,5,25,1,1] )
print(dict(pd.Series(x).value_counts()))

这给您:{1:5,2:3,5:1,25:1}

import pandas as pd
import numpy as np
x = np.array( [1,1,1,2,2,2,5,25,1,1] )
print(dict(pd.Series(x).value_counts()))

This gives you: {1: 5, 2: 3, 5: 1, 25: 1}


回答 9

为了计算唯一的非整数 -与Eelco Hoogendoorn的答案类似,但是速度更快(在我的机器上为5),我曾经weave.inline结合numpy.unique了一些c代码;

import numpy as np
from scipy import weave

def count_unique(datain):
  """
  Similar to numpy.unique function for returning unique members of
  data, but also returns their counts
  """
  data = np.sort(datain)
  uniq = np.unique(data)
  nums = np.zeros(uniq.shape, dtype='int')

  code="""
  int i,count,j;
  j=0;
  count=0;
  for(i=1; i<Ndata[0]; i++){
      count++;
      if(data(i) > data(i-1)){
          nums(j) = count;
          count = 0;
          j++;
      }
  }
  // Handle last value
  nums(j) = count+1;
  """
  weave.inline(code,
      ['data', 'nums'],
      extra_compile_args=['-O2'],
      type_converters=weave.converters.blitz)
  return uniq, nums

个人资料信息

> %timeit count_unique(data)
> 10000 loops, best of 3: 55.1 µs per loop

Eelco的纯numpy版本:

> %timeit unique_count(data)
> 1000 loops, best of 3: 284 µs per loop

注意

这里有冗余(unique也可以执行排序),这意味着可以通过将unique功能放入c代码循环中来进一步优化代码。

To count unique non-integers – similar to Eelco Hoogendoorn’s answer but considerably faster (factor of 5 on my machine), I used weave.inline to combine numpy.unique with a bit of c-code;

import numpy as np
from scipy import weave

def count_unique(datain):
  """
  Similar to numpy.unique function for returning unique members of
  data, but also returns their counts
  """
  data = np.sort(datain)
  uniq = np.unique(data)
  nums = np.zeros(uniq.shape, dtype='int')

  code="""
  int i,count,j;
  j=0;
  count=0;
  for(i=1; i<Ndata[0]; i++){
      count++;
      if(data(i) > data(i-1)){
          nums(j) = count;
          count = 0;
          j++;
      }
  }
  // Handle last value
  nums(j) = count+1;
  """
  weave.inline(code,
      ['data', 'nums'],
      extra_compile_args=['-O2'],
      type_converters=weave.converters.blitz)
  return uniq, nums

Profile info

> %timeit count_unique(data)
> 10000 loops, best of 3: 55.1 µs per loop

Eelco’s pure numpy version:

> %timeit unique_count(data)
> 1000 loops, best of 3: 284 µs per loop

Note

There’s redundancy here (unique performs a sort also), meaning that the code could probably be further optimized by putting the unique functionality inside the c-code loop.


回答 10

有一个老问题,但是我想提供自己的解决方案,该解决方案是最快的,根据我的基准测试,使用常规list而不是np.array输入(或首先转移到列表)。

如果也遇到了,请检查一下

def count(a):
    results = {}
    for x in a:
        if x not in results:
            results[x] = 1
        else:
            results[x] += 1
    return results

例如,

>>>timeit count([1,1,1,2,2,2,5,25,1,1]) would return:

100000个循环,每个循环最好为3:2.26 µs

>>>timeit count(np.array([1,1,1,2,2,2,5,25,1,1]))

100000个循环,最佳3:每个循环8.8 µs

>>>timeit count(np.array([1,1,1,2,2,2,5,25,1,1]).tolist())

100000次循环,每循环3:5.85 µs最佳

虽然可接受的答案会更慢,但scipy.stats.itemfreq解决方案甚至更糟。


更深入的测试并没有证实制定的期望。

from zmq import Stopwatch
aZmqSTOPWATCH = Stopwatch()

aDataSETasARRAY = ( 100 * abs( np.random.randn( 150000 ) ) ).astype( np.int )
aDataSETasLIST  = aDataSETasARRAY.tolist()

import numba
@numba.jit
def numba_bincount( anObject ):
    np.bincount(    anObject )
    return

aZmqSTOPWATCH.start();np.bincount(    aDataSETasARRAY );aZmqSTOPWATCH.stop()
14328L

aZmqSTOPWATCH.start();numba_bincount( aDataSETasARRAY );aZmqSTOPWATCH.stop()
592L

aZmqSTOPWATCH.start();count(          aDataSETasLIST  );aZmqSTOPWATCH.stop()
148609L

参考 以下是有关影响小型数据集的大规模重复测试结果的缓存和RAM中其他副作用的评论。

Old question, but I’d like to provide my own solution which turn out to be the fastest, use normal list instead of np.array as input (or transfer to list firstly), based on my bench test.

Check it out if you encounter it as well.

def count(a):
    results = {}
    for x in a:
        if x not in results:
            results[x] = 1
        else:
            results[x] += 1
    return results

For example,

>>>timeit count([1,1,1,2,2,2,5,25,1,1]) would return:

100000 loops, best of 3: 2.26 µs per loop

>>>timeit count(np.array([1,1,1,2,2,2,5,25,1,1]))

100000 loops, best of 3: 8.8 µs per loop

>>>timeit count(np.array([1,1,1,2,2,2,5,25,1,1]).tolist())

100000 loops, best of 3: 5.85 µs per loop

While the accepted answer would be slower, and the scipy.stats.itemfreq solution is even worse.


A more indepth testing did not confirm the formulated expectation.

from zmq import Stopwatch
aZmqSTOPWATCH = Stopwatch()

aDataSETasARRAY = ( 100 * abs( np.random.randn( 150000 ) ) ).astype( np.int )
aDataSETasLIST  = aDataSETasARRAY.tolist()

import numba
@numba.jit
def numba_bincount( anObject ):
    np.bincount(    anObject )
    return

aZmqSTOPWATCH.start();np.bincount(    aDataSETasARRAY );aZmqSTOPWATCH.stop()
14328L

aZmqSTOPWATCH.start();numba_bincount( aDataSETasARRAY );aZmqSTOPWATCH.stop()
592L

aZmqSTOPWATCH.start();count(          aDataSETasLIST  );aZmqSTOPWATCH.stop()
148609L

Ref. comments below on cache and other in-RAM side-effects that influence a small dataset massively repetitive testing results.


回答 11

这样的事情应该做到:

#create 100 random numbers
arr = numpy.random.random_integers(0,50,100)

#create a dictionary of the unique values
d = dict([(i,0) for i in numpy.unique(arr)])
for number in arr:
    d[j]+=1   #increment when that value is found

另外,除非我缺少某些内容,否则上一篇有关 有效计数唯一元素的文章似乎与您的问题非常相似。

some thing like this should do it:

#create 100 random numbers
arr = numpy.random.random_integers(0,50,100)

#create a dictionary of the unique values
d = dict([(i,0) for i in numpy.unique(arr)])
for number in arr:
    d[j]+=1   #increment when that value is found

Also, this previous post on Efficiently counting unique elements seems pretty similar to your question, unless I’m missing something.


回答 12

多维频率计数,即计数数组。

>>> print(color_array    )
  array([[255, 128, 128],
   [255, 128, 128],
   [255, 128, 128],
   ...,
   [255, 128, 128],
   [255, 128, 128],
   [255, 128, 128]], dtype=uint8)


>>> np.unique(color_array,return_counts=True,axis=0)
  (array([[ 60, 151, 161],
    [ 60, 155, 162],
    [ 60, 159, 163],
    [ 61, 143, 162],
    [ 61, 147, 162],
    [ 61, 162, 163],
    [ 62, 166, 164],
    [ 63, 137, 162],
    [ 63, 169, 164],
   array([     1,      2,      2,      1,      4,      1,      1,      2,
         3,      1,      1,      1,      2,      5,      2,      2,
       898,      1,      1,  

multi-dimentional frequency count, i.e. counting arrays.

>>> print(color_array    )
  array([[255, 128, 128],
   [255, 128, 128],
   [255, 128, 128],
   ...,
   [255, 128, 128],
   [255, 128, 128],
   [255, 128, 128]], dtype=uint8)


>>> np.unique(color_array,return_counts=True,axis=0)
  (array([[ 60, 151, 161],
    [ 60, 155, 162],
    [ 60, 159, 163],
    [ 61, 143, 162],
    [ 61, 147, 162],
    [ 61, 162, 163],
    [ 62, 166, 164],
    [ 63, 137, 162],
    [ 63, 169, 164],
   array([     1,      2,      2,      1,      4,      1,      1,      2,
         3,      1,      1,      1,      2,      5,      2,      2,
       898,      1,      1,  

回答 13

import pandas as pd
import numpy as np

print(pd.Series(name_of_array).value_counts())
import pandas as pd
import numpy as np

print(pd.Series(name_of_array).value_counts())

回答 14

from collections import Counter
x = array( [1,1,1,2,2,2,5,25,1,1] )
mode = counter.most_common(1)[0][0]
from collections import Counter
x = array( [1,1,1,2,2,2,5,25,1,1] )
mode = counter.most_common(1)[0][0]

numpy中的ndarray和array有什么区别?

问题:numpy中的ndarray和array有什么区别?

ndarrayarrayNumpy有什么区别?我在哪里可以找到numpy源代码中的实现?

What is the difference between ndarray and array in Numpy? And where can I find the implementations in the numpy source code?


回答 0

numpy.array只是创建一个便利函数ndarray; 它本身不是类。

您也可以使用创建数组numpy.ndarray,但不建议这样做。来自的文档字符串numpy.ndarray

阵列应该使用来构造arrayzerosempty…这里给出的参数是指低级方法(ndarray(...)用于实例化阵列)。

实现的大部分内容都在C代码中(在multiarray中),但是您可以在这里开始查看ndarray接口:

https://github.com/numpy/numpy/blob/master/numpy/core/numeric.py

numpy.array is just a convenience function to create an ndarray; it is not a class itself.

You can also create an array using numpy.ndarray, but it is not the recommended way. From the docstring of numpy.ndarray:

Arrays should be constructed using array, zeros or empty … The parameters given here refer to a low-level method (ndarray(...)) for instantiating an array.

Most of the meat of the implementation is in C code, here in multiarray, but you can start looking at the ndarray interfaces here:

https://github.com/numpy/numpy/blob/master/numpy/core/numeric.py


回答 1

numpy.array是一个返回的函数numpy.ndarray。没有对象类型numpy.array。

numpy.array is a function that returns a numpy.ndarray. There is no object type numpy.array.


回答 2

只需几行示例代码即可显示numpy.array和numpy.ndarray之间的区别

热身步骤:构建列表

a = [1,2,3]

检查类型

print(type(a))

你会得到

<class 'list'>

使用np.array构造一个数组(从列表中)

a = np.array(a)

或者,您可以跳过热身步骤,直接进行

a = np.array([1,2,3])

检查类型

print(type(a))

你会得到

<class 'numpy.ndarray'>

告诉你numpy数组的类型是numpy.ndarray

您还可以通过以下方式检查类型

isinstance(a, (np.ndarray))

你会得到

True

以下两行均会给您一条错误消息

np.ndarray(a)                # should be np.array(a)
isinstance(a, (np.array))    # should be isinstance(a, (np.ndarray))

Just a few lines of example code to show the difference between numpy.array and numpy.ndarray

Warm up step: Construct a list

a = [1,2,3]

Check the type

print(type(a))

You will get

<class 'list'>

Construct an array (from a list) using np.array

a = np.array(a)

Or, you can skip the warm up step, directly have

a = np.array([1,2,3])

Check the type

print(type(a))

You will get

<class 'numpy.ndarray'>

which tells you the type of the numpy array is numpy.ndarray

You can also check the type by

isinstance(a, (np.ndarray))

and you will get

True

Either of the following two lines will give you an error message

np.ndarray(a)                # should be np.array(a)
isinstance(a, (np.array))    # should be isinstance(a, (np.ndarray))

回答 3

numpy.ndarray()是一个类,numpy.array()而是要创建的方法/函数ndarray

在numpy docs中,如果您想从ndarray类创建数组,则可以使用以下两种方式进行引用:

1-使用array()zeros()empty()方法: 阵列应该使用数组,零或空构造(参考也参见下文部分)。此处给出的参数指的是ndarray(…)用于实例化数组的低级方法()。

2- ndarray直接来自类: 有两种创建数组的方式__new__:使用:如果buffer为None,则仅使用shape,dtype和order。如果buffer是暴露buffer接口的对象,则将解释所有关键字。

下面的示例给出了一个随机数组,因为我们没有分配缓冲区值:

np.ndarray(shape=(2,2), dtype=float, order='F', buffer=None)

array([[ -1.13698227e+002,   4.25087011e-303],
       [  2.88528414e-306,   3.27025015e-309]])         #random

另一个示例是将数组对象分配给缓冲区示例:

>>> np.ndarray((2,), buffer=np.array([1,2,3]),
...            offset=np.int_().itemsize,
...            dtype=int) # offset = 1*itemsize, i.e. skip first element
array([2, 3])

从上面的示例中,我们注意到我们无法为“缓冲区”分配列表,我们不得不使用numpy.array()返回缓冲区的ndarray对象

结论:numpy.array()如果要制造numpy.ndarray()物体,则使用“

numpy.ndarray() is a class, while numpy.array() is a method / function to create ndarray.

In numpy docs if you want to create an array from ndarray class you can do it with 2 ways as quoted:

1- using array(), zeros() or empty() methods: Arrays should be constructed using array, zeros or empty (refer to the See Also section below). The parameters given here refer to a low-level method (ndarray(…)) for instantiating an array.

2- from ndarray class directly: There are two modes of creating an array using __new__: If buffer is None, then only shape, dtype, and order are used. If buffer is an object exposing the buffer interface, then all keywords are interpreted.

The example below gives a random array because we didn’t assign buffer value:

np.ndarray(shape=(2,2), dtype=float, order='F', buffer=None)

array([[ -1.13698227e+002,   4.25087011e-303],
       [  2.88528414e-306,   3.27025015e-309]])         #random

another example is to assign array object to the buffer example:

>>> np.ndarray((2,), buffer=np.array([1,2,3]),
...            offset=np.int_().itemsize,
...            dtype=int) # offset = 1*itemsize, i.e. skip first element
array([2, 3])

from above example we notice that we can’t assign a list to “buffer” and we had to use numpy.array() to return ndarray object for the buffer

Conclusion: use numpy.array() if you want to make a numpy.ndarray() object”


回答 4

我认为与np.array()您一起只能创建C,尽管您提到了该命令,但是当您使用np.isfortran()它检查时说是false。但是,np.ndarrray()当您指定订单时,它将根据提供的订单创建订单。

I think with np.array() you can only create C like though you mention the order, when you check using np.isfortran() it says false. but with np.ndarrray() when you specify the order it creates based on the order provided.


更好地协调两个numpy数组的更好方法

问题:更好地协调两个numpy数组的更好方法

我有两个不同形状的numpy数组,但是长度(引导尺寸)相同。我想对它们中的每一个进行混洗,以使相应的元素继续对应-即相对于它们的前导索引一致地对它们进行混洗。

该代码有效,并说明了我的目标:

def shuffle_in_unison(a, b):
    assert len(a) == len(b)
    shuffled_a = numpy.empty(a.shape, dtype=a.dtype)
    shuffled_b = numpy.empty(b.shape, dtype=b.dtype)
    permutation = numpy.random.permutation(len(a))
    for old_index, new_index in enumerate(permutation):
        shuffled_a[new_index] = a[old_index]
        shuffled_b[new_index] = b[old_index]
    return shuffled_a, shuffled_b

例如:

>>> a = numpy.asarray([[1, 1], [2, 2], [3, 3]])
>>> b = numpy.asarray([1, 2, 3])
>>> shuffle_in_unison(a, b)
(array([[2, 2],
       [1, 1],
       [3, 3]]), array([2, 1, 3]))

但是,这感觉笨拙,效率低下且速度慢,并且需要复制数组-我宁愿就地对其进行随机播放,因为它们会很大。

有更好的方法来解决这个问题吗?更快的执行速度和更低的内存使用是我的主要目标,但是优美的代码也将是不错的。

我的另一个想法是:

def shuffle_in_unison_scary(a, b):
    rng_state = numpy.random.get_state()
    numpy.random.shuffle(a)
    numpy.random.set_state(rng_state)
    numpy.random.shuffle(b)

这行得通…但是有点吓人,因为我看不到它会继续工作-例如,它看起来像不能在numpy版本中生存的那种东西。

I have two numpy arrays of different shapes, but with the same length (leading dimension). I want to shuffle each of them, such that corresponding elements continue to correspond — i.e. shuffle them in unison with respect to their leading indices.

This code works, and illustrates my goals:

def shuffle_in_unison(a, b):
    assert len(a) == len(b)
    shuffled_a = numpy.empty(a.shape, dtype=a.dtype)
    shuffled_b = numpy.empty(b.shape, dtype=b.dtype)
    permutation = numpy.random.permutation(len(a))
    for old_index, new_index in enumerate(permutation):
        shuffled_a[new_index] = a[old_index]
        shuffled_b[new_index] = b[old_index]
    return shuffled_a, shuffled_b

For example:

>>> a = numpy.asarray([[1, 1], [2, 2], [3, 3]])
>>> b = numpy.asarray([1, 2, 3])
>>> shuffle_in_unison(a, b)
(array([[2, 2],
       [1, 1],
       [3, 3]]), array([2, 1, 3]))

However, this feels clunky, inefficient, and slow, and it requires making a copy of the arrays — I’d rather shuffle them in-place, since they’ll be quite large.

Is there a better way to go about this? Faster execution and lower memory usage are my primary goals, but elegant code would be nice, too.

One other thought I had was this:

def shuffle_in_unison_scary(a, b):
    rng_state = numpy.random.get_state()
    numpy.random.shuffle(a)
    numpy.random.set_state(rng_state)
    numpy.random.shuffle(b)

This works…but it’s a little scary, as I see little guarantee it’ll continue to work — it doesn’t look like the sort of thing that’s guaranteed to survive across numpy version, for example.


回答 0

您的“吓人”解决方案对我来说似乎并不可怕。调用shuffle()两个相同长度的序列会导致对随机数生成器的调用次数相同,这是随机播放算法中唯一的“随机”元素。通过重置状态,可以确保对随机数生成器的调用将在对的第二次调用中给出相同的结果shuffle(),因此整个算法将生成相同的排列。

如果您不喜欢这种方法,那么另一种解决方案是将数据存储在一个数组中,而不是从一开始就存储在两个数组中,然后在此单个数组中创建两个视图以模拟您现在拥有的两个数组。您可以将单个数组用于改组,并将视图用于所有其他目的。

例如:假设数组ab这个样子的:

a = numpy.array([[[  0.,   1.,   2.],
                  [  3.,   4.,   5.]],

                 [[  6.,   7.,   8.],
                  [  9.,  10.,  11.]],

                 [[ 12.,  13.,  14.],
                  [ 15.,  16.,  17.]]])

b = numpy.array([[ 0.,  1.],
                 [ 2.,  3.],
                 [ 4.,  5.]])

现在我们可以构造一个包含所有数据的数组:

c = numpy.c_[a.reshape(len(a), -1), b.reshape(len(b), -1)]
# array([[  0.,   1.,   2.,   3.,   4.,   5.,   0.,   1.],
#        [  6.,   7.,   8.,   9.,  10.,  11.,   2.,   3.],
#        [ 12.,  13.,  14.,  15.,  16.,  17.,   4.,   5.]])

现在我们创建模拟原始视图的视图 a和的b

a2 = c[:, :a.size//len(a)].reshape(a.shape)
b2 = c[:, a.size//len(a):].reshape(b.shape)

的数据 a2b2共享c。要同时混洗两个数组,请使用numpy.random.shuffle(c)

在生产代码,你当然会尽量避免创建原始ab根本,并马上创建ca2b2

该解决方案能够适应的情况下a,并b有不同的dtypes。

Your “scary” solution does not appear scary to me. Calling shuffle() for two sequences of the same length results in the same number of calls to the random number generator, and these are the only “random” elements in the shuffle algorithm. By resetting the state, you ensure that the calls to the random number generator will give the same results in the second call to shuffle(), so the whole algorithm will generate the same permutation.

If you don’t like this, a different solution would be to store your data in one array instead of two right from the beginning, and create two views into this single array simulating the two arrays you have now. You can use the single array for shuffling and the views for all other purposes.

Example: Let’s assume the arrays a and b look like this:

a = numpy.array([[[  0.,   1.,   2.],
                  [  3.,   4.,   5.]],

                 [[  6.,   7.,   8.],
                  [  9.,  10.,  11.]],

                 [[ 12.,  13.,  14.],
                  [ 15.,  16.,  17.]]])

b = numpy.array([[ 0.,  1.],
                 [ 2.,  3.],
                 [ 4.,  5.]])

We can now construct a single array containing all the data:

c = numpy.c_[a.reshape(len(a), -1), b.reshape(len(b), -1)]
# array([[  0.,   1.,   2.,   3.,   4.,   5.,   0.,   1.],
#        [  6.,   7.,   8.,   9.,  10.,  11.,   2.,   3.],
#        [ 12.,  13.,  14.,  15.,  16.,  17.,   4.,   5.]])

Now we create views simulating the original a and b:

a2 = c[:, :a.size//len(a)].reshape(a.shape)
b2 = c[:, a.size//len(a):].reshape(b.shape)

The data of a2 and b2 is shared with c. To shuffle both arrays simultaneously, use numpy.random.shuffle(c).

In production code, you would of course try to avoid creating the original a and b at all and right away create c, a2 and b2.

This solution could be adapted to the case that a and b have different dtypes.


回答 1

您可以使用NumPy的数组索引

def unison_shuffled_copies(a, b):
    assert len(a) == len(b)
    p = numpy.random.permutation(len(a))
    return a[p], b[p]

这将导致创建单独的统一重组的数组。

Your can use NumPy’s array indexing:

def unison_shuffled_copies(a, b):
    assert len(a) == len(b)
    p = numpy.random.permutation(len(a))
    return a[p], b[p]

This will result in creation of separate unison-shuffled arrays.


回答 2

X = np.array([[1., 0.], [2., 1.], [0., 0.]])
y = np.array([0, 1, 2])
from sklearn.utils import shuffle
X, y = shuffle(X, y, random_state=0)

要了解更多信息,请参见http://scikit-learn.org/stable/modules/generated/sklearn.utils.shuffle.html

X = np.array([[1., 0.], [2., 1.], [0., 0.]])
y = np.array([0, 1, 2])
from sklearn.utils import shuffle
X, y = shuffle(X, y, random_state=0)

To learn more, see http://scikit-learn.org/stable/modules/generated/sklearn.utils.shuffle.html


回答 3

很简单的解决方案:

randomize = np.arange(len(x))
np.random.shuffle(randomize)
x = x[randomize]
y = y[randomize]

现在,两个数组x,y都以相同的方式随机洗牌

Very simple solution:

randomize = np.arange(len(x))
np.random.shuffle(randomize)
x = x[randomize]
y = y[randomize]

the two arrays x,y are now both randomly shuffled in the same way


回答 4

James在2015年编写了一个sklearn 解决方案,这很有帮助。但是他添加了一个不需要的随机状态变量。在下面的代码中,自动假定numpy为随机状态。

X = np.array([[1., 0.], [2., 1.], [0., 0.]])
y = np.array([0, 1, 2])
from sklearn.utils import shuffle
X, y = shuffle(X, y)

James wrote in 2015 an sklearn solution which is helpful. But he added a random state variable, which is not needed. In the below code, the random state from numpy is automatically assumed.

X = np.array([[1., 0.], [2., 1.], [0., 0.]])
y = np.array([0, 1, 2])
from sklearn.utils import shuffle
X, y = shuffle(X, y)

回答 5

from np.random import permutation
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data #numpy array
y = iris.target #numpy array

# Data is currently unshuffled; we should shuffle 
# each X[i] with its corresponding y[i]
perm = permutation(len(X))
X = X[perm]
y = y[perm]
from np.random import permutation
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data #numpy array
y = iris.target #numpy array

# Data is currently unshuffled; we should shuffle 
# each X[i] with its corresponding y[i]
perm = permutation(len(X))
X = X[perm]
y = y[perm]

回答 6

仅使用NumPy将任意数量的数组混合在一起就位。

import numpy as np


def shuffle_arrays(arrays, set_seed=-1):
    """Shuffles arrays in-place, in the same order, along axis=0

    Parameters:
    -----------
    arrays : List of NumPy arrays.
    set_seed : Seed value if int >= 0, else seed is random.
    """
    assert all(len(arr) == len(arrays[0]) for arr in arrays)
    seed = np.random.randint(0, 2**(32 - 1) - 1) if set_seed < 0 else set_seed

    for arr in arrays:
        rstate = np.random.RandomState(seed)
        rstate.shuffle(arr)

可以这样使用

a = np.array([1, 2, 3, 4, 5])
b = np.array([10,20,30,40,50])
c = np.array([[1,10,11], [2,20,22], [3,30,33], [4,40,44], [5,50,55]])

shuffle_arrays([a, b, c])

注意事项:

  • 该断言确保所有输入数组沿其第一维具有相同的长度。
  • 数组按其第一个维度在原地随机排列-没有返回任何内容。
  • int32正范围内的随机种子。
  • 如果需要重复播放,可以设置种子值。

随机播放后,可以np.split使用切片对数据进行拆分或使用切片进行引用-取决于应用程序。

Shuffle any number of arrays together, in-place, using only NumPy.

import numpy as np


def shuffle_arrays(arrays, set_seed=-1):
    """Shuffles arrays in-place, in the same order, along axis=0

    Parameters:
    -----------
    arrays : List of NumPy arrays.
    set_seed : Seed value if int >= 0, else seed is random.
    """
    assert all(len(arr) == len(arrays[0]) for arr in arrays)
    seed = np.random.randint(0, 2**(32 - 1) - 1) if set_seed < 0 else set_seed

    for arr in arrays:
        rstate = np.random.RandomState(seed)
        rstate.shuffle(arr)

And can be used like this

a = np.array([1, 2, 3, 4, 5])
b = np.array([10,20,30,40,50])
c = np.array([[1,10,11], [2,20,22], [3,30,33], [4,40,44], [5,50,55]])

shuffle_arrays([a, b, c])

A few things to note:

  • The assert ensures that all input arrays have the same length along their first dimension.
  • Arrays shuffled in-place by their first dimension – nothing returned.
  • Random seed within positive int32 range.
  • If a repeatable shuffle is needed, seed value can be set.

After the shuffle, the data can be split using np.split or referenced using slices – depending on the application.


回答 7

您可以制作一个像这样的数组:

s = np.arange(0, len(a), 1)

然后随机播放:

np.random.shuffle(s)

现在使用this作为数组的参数。相同的改组参数返回相同的改组向量。

x_data = x_data[s]
x_label = x_label[s]

you can make an array like:

s = np.arange(0, len(a), 1)

then shuffle it:

np.random.shuffle(s)

now use this s as argument of your arrays. same shuffled arguments return same shuffled vectors.

x_data = x_data[s]
x_label = x_label[s]

回答 8

可以对连接的列表执行就地改组的一种方法是使用种子(可以是随机的)并使用numpy.random.shuffle进行改组。

# Set seed to a random number if you want the shuffling to be non-deterministic.
def shuffle(a, b, seed):
   np.random.seed(seed)
   np.random.shuffle(a)
   np.random.seed(seed)
   np.random.shuffle(b)

而已。这将以完全相同的方式混洗a和b。这也就地完成,这总是一个优点。

编辑,不要使用np.random.seed()而是使用np.random.RandomState

def shuffle(a, b, seed):
   rand_state = np.random.RandomState(seed)
   rand_state.shuffle(a)
   rand_state.seed(seed)
   rand_state.shuffle(b)

调用它时,只需传入任何种子即可提供随机状态:

a = [1,2,3,4]
b = [11, 22, 33, 44]
shuffle(a, b, 12345)

输出:

>>> a
[1, 4, 2, 3]
>>> b
[11, 44, 22, 33]

编辑:修复了重新设置随机状态的代码

One way in which in-place shuffling can be done for connected lists is using a seed (it could be random) and using numpy.random.shuffle to do the shuffling.

# Set seed to a random number if you want the shuffling to be non-deterministic.
def shuffle(a, b, seed):
   np.random.seed(seed)
   np.random.shuffle(a)
   np.random.seed(seed)
   np.random.shuffle(b)

That’s it. This will shuffle both a and b in the exact same way. This is also done in-place which is always a plus.

EDIT, don’t use np.random.seed() use np.random.RandomState instead

def shuffle(a, b, seed):
   rand_state = np.random.RandomState(seed)
   rand_state.shuffle(a)
   rand_state.seed(seed)
   rand_state.shuffle(b)

When calling it just pass in any seed to feed the random state:

a = [1,2,3,4]
b = [11, 22, 33, 44]
shuffle(a, b, 12345)

Output:

>>> a
[1, 4, 2, 3]
>>> b
[11, 44, 22, 33]

Edit: Fixed code to re-seed the random state


回答 9

有一个众所周知的函数可以处理此问题:

from sklearn.model_selection import train_test_split
X, _, Y, _ = train_test_split(X,Y, test_size=0.0)

只需将test_size设置为0即可避免拆分,并为您提供随机数据。尽管它通常用于拆分训练数据和测试数据,但它的确也可以洗牌。
文档

将数组或矩阵拆分为随机训练和测试子集

快速实用程序,用于包装输入验证以及next(ShuffleSplit()。split(X,y))和应用程序,以将数据输入到单个调用中,以便在oneliner中拆分(以及可选地对子采样)数据。

There is a well-known function that can handle this:

from sklearn.model_selection import train_test_split
X, _, Y, _ = train_test_split(X,Y, test_size=0.0)

Just setting test_size to 0 will avoid splitting and give you shuffled data. Though it is usually used to split train and test data, it does shuffle them too.
From documentation

Split arrays or matrices into random train and test subsets

Quick utility that wraps input validation and next(ShuffleSplit().split(X, y)) and application to input data into a single call for splitting (and optionally subsampling) data in a oneliner.


回答 10

假设我们有两个数组:a和b。

a = np.array([[1,2,3],[4,5,6],[7,8,9]])
b = np.array([[9,1,1],[6,6,6],[4,2,0]]) 

我们首先可以通过排列第一维来获得行索引

indices = np.random.permutation(a.shape[0])
[1 2 0]

然后使用高级索引。在这里,我们使用相同的索引来同时对两个数组进行混洗。

a_shuffled = a[indices[:,np.newaxis], np.arange(a.shape[1])]
b_shuffled = b[indices[:,np.newaxis], np.arange(b.shape[1])]

这相当于

np.take(a, indices, axis=0)
[[4 5 6]
 [7 8 9]
 [1 2 3]]

np.take(b, indices, axis=0)
[[6 6 6]
 [4 2 0]
 [9 1 1]]

Say we have two arrays: a and b.

a = np.array([[1,2,3],[4,5,6],[7,8,9]])
b = np.array([[9,1,1],[6,6,6],[4,2,0]]) 

We can first obtain row indices by permutating first dimension

indices = np.random.permutation(a.shape[0])
[1 2 0]

Then use advanced indexing. Here we are using the same indices to shuffle both arrays in unison.

a_shuffled = a[indices[:,np.newaxis], np.arange(a.shape[1])]
b_shuffled = b[indices[:,np.newaxis], np.arange(b.shape[1])]

This is equivalent to

np.take(a, indices, axis=0)
[[4 5 6]
 [7 8 9]
 [1 2 3]]

np.take(b, indices, axis=0)
[[6 6 6]
 [4 2 0]
 [9 1 1]]

回答 11

如果要避免复制数组,则建议不要遍历数组,而是遍历数组中的每个元素,然后将其随机交换到数组中的另一个位置

for old_index in len(a):
    new_index = numpy.random.randint(old_index+1)
    a[old_index], a[new_index] = a[new_index], a[old_index]
    b[old_index], b[new_index] = b[new_index], b[old_index]

这实现了Knuth-Fisher-Yates随机播放算法。

If you want to avoid copying arrays, then I would suggest that instead of generating a permutation list, you go through every element in the array, and randomly swap it to another position in the array

for old_index in len(a):
    new_index = numpy.random.randint(old_index+1)
    a[old_index], a[new_index] = a[new_index], a[old_index]
    b[old_index], b[new_index] = b[new_index], b[old_index]

This implements the Knuth-Fisher-Yates shuffle algorithm.


回答 12

这似乎是一个非常简单的解决方案:

import numpy as np
def shuffle_in_unison(a,b):

    assert len(a)==len(b)
    c = np.arange(len(a))
    np.random.shuffle(c)

    return a[c],b[c]

a =  np.asarray([[1, 1], [2, 2], [3, 3]])
b =  np.asarray([11, 22, 33])

shuffle_in_unison(a,b)
Out[94]: 
(array([[3, 3],
        [2, 2],
        [1, 1]]),
 array([33, 22, 11]))

This seems like a very simple solution:

import numpy as np
def shuffle_in_unison(a,b):

    assert len(a)==len(b)
    c = np.arange(len(a))
    np.random.shuffle(c)

    return a[c],b[c]

a =  np.asarray([[1, 1], [2, 2], [3, 3]])
b =  np.asarray([11, 22, 33])

shuffle_in_unison(a,b)
Out[94]: 
(array([[3, 3],
        [2, 2],
        [1, 1]]),
 array([33, 22, 11]))

回答 13

举个例子,这就是我在做什么:

combo = []
for i in range(60000):
    combo.append((images[i], labels[i]))

shuffle(combo)

im = []
lab = []
for c in combo:
    im.append(c[0])
    lab.append(c[1])
images = np.asarray(im)
labels = np.asarray(lab)

With an example, this is what I’m doing:

combo = []
for i in range(60000):
    combo.append((images[i], labels[i]))

shuffle(combo)

im = []
lab = []
for c in combo:
    im.append(c[0])
    lab.append(c[1])
images = np.asarray(im)
labels = np.asarray(lab)

回答 14

我扩展了python的random.shuffle()以获取第二个参数:

def shuffle_together(x, y):
    assert len(x) == len(y)

    for i in reversed(xrange(1, len(x))):
        # pick an element in x[:i+1] with which to exchange x[i]
        j = int(random.random() * (i+1))
        x[i], x[j] = x[j], x[i]
        y[i], y[j] = y[j], y[i]

这样,我可以确定改组发生在原位,并且函数不会太长或太复杂。

I extended python’s random.shuffle() to take a second arg:

def shuffle_together(x, y):
    assert len(x) == len(y)

    for i in reversed(xrange(1, len(x))):
        # pick an element in x[:i+1] with which to exchange x[i]
        j = int(random.random() * (i+1))
        x[i], x[j] = x[j], x[i]
        y[i], y[j] = y[j], y[i]

That way I can be sure that the shuffling happens in-place, and the function is not all too long or complicated.


回答 15

只需使用 numpy

首先合并两个输入数组,一维数组是labels(y),二维数组是data(x),然后用NumPy shuffle方法将它们洗牌。最后将它们拆分并返回。

import numpy as np

def shuffle_2d(a, b):
    rows= a.shape[0]
    if b.shape != (rows,1):
        b = b.reshape((rows,1))
    S = np.hstack((b,a))
    np.random.shuffle(S)
    b, a  = S[:,0], S[:,1:]
    return a,b

features, samples = 2, 5
x, y = np.random.random((samples, features)), np.arange(samples)
x, y = shuffle_2d(train, test)

Just use numpy

First merge the two input arrays 1D array is labels(y) and 2D array is data(x) and shuffle them with NumPy shuffle method. Finally split them and return.

import numpy as np

def shuffle_2d(a, b):
    rows= a.shape[0]
    if b.shape != (rows,1):
        b = b.reshape((rows,1))
    S = np.hstack((b,a))
    np.random.shuffle(S)
    b, a  = S[:,0], S[:,1:]
    return a,b

features, samples = 2, 5
x, y = np.random.random((samples, features)), np.arange(samples)
x, y = shuffle_2d(train, test)