标签归档:numpy

numpy,scipy,matplotlib和pylab之间的混淆

问题:numpy,scipy,matplotlib和pylab之间的混淆

Numpy,scipy,matplotlib和pylab是使用python进行科学计算的常用术语。

我只是学习了一些有关pylab的知识,而感到困惑。每当我要导入numpy时,我都可以执行以下操作:

import numpy as np

我只是认为,一旦我这样做

from pylab import *

numpy也将被导入(使用np别名)。所以基本上,第二个相比第一个做更多的事情。

我想问的几件事:

  1. pylab仅仅是numpy,scipy和matplotlib的包装吗?
  2. 由于NP是pylab中的numpy别名,因此pylab中的scipy和matplotlib别名是什么?(据我所知,plt是matplotlib.pyplot的别名,但我不知道matplotlib本身的别名)

Numpy, scipy, matplotlib, and pylab are common terms among they who use python for scientific computation.

I just learn a bit about pylab, and I got confused. Whenever I want to import numpy, I can always do:

import numpy as np

I just consider, that once I do

from pylab import *

the numpy will be imported as well (with np alias). So basically the second one does more things compared to the first one.

There are few things I want to ask:

  1. Is it right that pylab is just a wrapper for numpy, scipy and matplotlib?
  2. As np is the numpy alias in pylab, what is the scipy and matplotlib alias in pylab? (as far as I know, plt is alias of matplotlib.pyplot, but I don’t know the alias for the matplotlib itself)

回答 0

  1. 没有,pylab是的一部分matplotlib(在matplotlib.pylab),并试图给你喜欢的环境Matlab的。matplotlib有许多依赖项,其中有一些依赖项numpy以通用别名导入npscipy不是的依赖项matplotlib

  2. 如果运行ipython --pylab自动导入,则会将所有符号从中matplotlib.pylab放入全局范围。就像您写的一样numpy,在np别名下导入。别名matplotlib下的符号来自mpl

  1. No, pylab is part of matplotlib (in matplotlib.pylab) and tries to give you a MatLab like environment. matplotlib has a number of dependencies, among them numpy which it imports under the common alias np. scipy is not a dependency of matplotlib.

  2. If you run ipython --pylab an automatic import will put all symbols from matplotlib.pylab into global scope. Like you wrote numpy gets imported under the np alias. Symbols from matplotlib are available under the mpl alias.


回答 1

Scipy和numpy是科学项目,旨在为python带来高效,快速的数值计算。

Matplotlib是python绘图库的名称。

Pyplot是matplotlib的交互式api,主要用于jupyter之类的笔记本中。您通常会这样使用它:import matplotlib.pyplot as plt

Pylab与pyplot相同,但是具有额外的功能(目前不鼓励使用)。

  • pylab = pyplot + numpy的

在此处查看更多信息:Matplotlib,Pylab,Pyplot等:这些和何时使用它们有什么区别?

Scipy and numpy are scientific projects whose aim is to bring efficient and fast numeric computing to python.

Matplotlib is the name of the python plotting library.

Pyplot is an interactive api for matplotlib, mostly for use in notebooks like jupyter. You generally use it like this: import matplotlib.pyplot as plt.

Pylab is the same thing as pyplot, but with extra features (its use is currently discouraged).

  • pylab = pyplot + numpy

See more information here: Matplotlib, Pylab, Pyplot, etc: What’s the difference between these and when to use each?


回答 2

由于某些示例(例如我)可能仍然对pylab的使用感到困惑,因为pylab互联网上存在使用示例的示例,因此这里引用了官方matplotlib常见问题解答:

pylab是一个便捷模块,可在单个命名空间中批量导入matplotlib.pyplot(用于绘图)和numpy(用于数学以及使用数组)。尽管许多示例都使用pylab,但不再建议使用。

因此,TL; DR; 是不使用pylab,句点。根据需要分别使用pyplot和导入numpy

这是进一步阅读和其他有用示例的链接

Since some people (like me) may still be confused about usage of pylab since examples using pylab are out there on the internet, here is a quote from the official matplotlib FAQ:

pylab is a convenience module that bulk imports matplotlib.pyplot (for plotting) and numpy (for mathematics and working with arrays) in a single name space. Although many examples use pylab, it is no longer recommended.

So, TL;DR; is do not use pylab, period. Use pyplot and import numpy separately as needed.

Here is the link for further reading and other useful examples.


脾气暴躁的地方有多个条件

问题:脾气暴躁的地方有多个条件

我有一组距离称为dists。我想选择两个值之间的距离。我编写了以下代码行:

 dists[(np.where(dists >= r)) and (np.where(dists <= r + dr))]

但是,这仅针对条件选择

 (np.where(dists <= r + dr))

如果我通过使用临时变量按顺序执行命令,则效果很好。为什么上面的代码不起作用,如何使它起作用?

干杯

I have an array of distances called dists. I want to select dists which are between two values. I wrote the following line of code to do that:

 dists[(np.where(dists >= r)) and (np.where(dists <= r + dr))]

However this selects only for the condition

 (np.where(dists <= r + dr))

If I do the commands sequentially by using a temporary variable it works fine. Why does the above code not work, and how do I get it to work?

Cheers


回答 0

您的特定情况下,最好的方法将两个条件更改为一个条件:

dists[abs(dists - r - dr/2.) <= dr/2.]

它仅创建一个布尔数组,在我看来是更易于阅读,因为它说,dist内部的dr还是r(尽管我将重新定义r为您感兴趣的区域的中心,而不是开始的位置,所以r = r + dr/2.)但这并不能回答您的问题。


问题的答案:如果您只是想过滤出不符合标准的元素,则
实际上并不需要:wheredists

dists[(dists >= r) & (dists <= r+dr)]

因为&将会为您提供基本元素and(括号是必需的)。

或者,如果您where出于某些原因要使用,可以执行以下操作:

 dists[(np.where((dists >= r) & (dists <= r + dr)))]

原因:
不起作用的原因是因为np.where返回的是索引列表,而不是布尔数组。您试图and在两个数字列表之间移动,这些数字当然没有您期望的True/ False值。如果ab都是两个True值,则a and b返回b。所以说这样的话[0,1,2] and [2,3,4]只会给你[2,3,4]。它在起作用:

In [230]: dists = np.arange(0,10,.5)
In [231]: r = 5
In [232]: dr = 1

In [233]: np.where(dists >= r)
Out[233]: (array([10, 11, 12, 13, 14, 15, 16, 17, 18, 19]),)

In [234]: np.where(dists <= r+dr)
Out[234]: (array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12]),)

In [235]: np.where(dists >= r) and np.where(dists <= r+dr)
Out[235]: (array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12]),)

您期望比较的只是布尔数组,例如

In [236]: dists >= r
Out[236]: 
array([False, False, False, False, False, False, False, False, False,
       False,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True], dtype=bool)

In [237]: dists <= r + dr
Out[237]: 
array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True, False, False, False, False, False,
       False, False], dtype=bool)

In [238]: (dists >= r) & (dists <= r + dr)
Out[238]: 
array([False, False, False, False, False, False, False, False, False,
       False,  True,  True,  True, False, False, False, False, False,
       False, False], dtype=bool)

现在,您可以调用np.where组合的布尔数组:

In [239]: np.where((dists >= r) & (dists <= r + dr))
Out[239]: (array([10, 11, 12]),)

In [240]: dists[np.where((dists >= r) & (dists <= r + dr))]
Out[240]: array([ 5. ,  5.5,  6. ])

或者使用花式索引简单地用布尔数组对原始数组进行索引

In [241]: dists[(dists >= r) & (dists <= r + dr)]
Out[241]: array([ 5. ,  5.5,  6. ])

The best way in your particular case would just be to change your two criteria to one criterion:

dists[abs(dists - r - dr/2.) <= dr/2.]

It only creates one boolean array, and in my opinion is easier to read because it says, is dist within a dr or r? (Though I’d redefine r to be the center of your region of interest instead of the beginning, so r = r + dr/2.) But that doesn’t answer your question.


The answer to your question:
You don’t actually need where if you’re just trying to filter out the elements of dists that don’t fit your criteria:

dists[(dists >= r) & (dists <= r+dr)]

Because the & will give you an elementwise and (the parentheses are necessary).

Or, if you do want to use where for some reason, you can do:

 dists[(np.where((dists >= r) & (dists <= r + dr)))]

Why:
The reason it doesn’t work is because np.where returns a list of indices, not a boolean array. You’re trying to get and between two lists of numbers, which of course doesn’t have the True/False values that you expect. If a and b are both True values, then a and b returns b. So saying something like [0,1,2] and [2,3,4] will just give you [2,3,4]. Here it is in action:

In [230]: dists = np.arange(0,10,.5)
In [231]: r = 5
In [232]: dr = 1

In [233]: np.where(dists >= r)
Out[233]: (array([10, 11, 12, 13, 14, 15, 16, 17, 18, 19]),)

In [234]: np.where(dists <= r+dr)
Out[234]: (array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12]),)

In [235]: np.where(dists >= r) and np.where(dists <= r+dr)
Out[235]: (array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12]),)

What you were expecting to compare was simply the boolean array, for example

In [236]: dists >= r
Out[236]: 
array([False, False, False, False, False, False, False, False, False,
       False,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True], dtype=bool)

In [237]: dists <= r + dr
Out[237]: 
array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True, False, False, False, False, False,
       False, False], dtype=bool)

In [238]: (dists >= r) & (dists <= r + dr)
Out[238]: 
array([False, False, False, False, False, False, False, False, False,
       False,  True,  True,  True, False, False, False, False, False,
       False, False], dtype=bool)

Now you can call np.where on the combined boolean array:

In [239]: np.where((dists >= r) & (dists <= r + dr))
Out[239]: (array([10, 11, 12]),)

In [240]: dists[np.where((dists >= r) & (dists <= r + dr))]
Out[240]: array([ 5. ,  5.5,  6. ])

Or simply index the original array with the boolean array using fancy indexing

In [241]: dists[(dists >= r) & (dists <= r + dr)]
Out[241]: array([ 5. ,  5.5,  6. ])

回答 1

公认的答案已经很好地解释了这个问题。但是,应用多个条件的Numpythonic方法更多是使用numpy逻辑函数。在这种情况下,您可以使用np.logical_and

np.where(np.logical_and(np.greater_equal(dists,r),np.greater_equal(dists,r + dr)))

The accepted answer explained the problem well enough. However, the the more Numpythonic approach for applying multiple conditions is to use numpy logical functions. In this ase you can use np.logical_and:

np.where(np.logical_and(np.greater_equal(dists,r),np.greater_equal(dists,r + dr)))

回答 2

这里要指出的一件有趣的事情是:在这种情况下,通常也可以使用ORAND的方式,但有一点点变化。代替“ and”和“ or”,而使用Ampersand(&)Pipe Operator(|),它将起作用。

当我们使用‘and’时

ar = np.array([3,4,5,14,2,4,3,7])
np.where((ar>3) and (ar<6), 'yo', ar)

Output:
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

当我们使用&符时

ar = np.array([3,4,5,14,2,4,3,7])
np.where((ar>3) & (ar<6), 'yo', ar)

Output:
array(['3', 'yo', 'yo', '14', '2', 'yo', '3', '7'], dtype='<U11')

当我们尝试应用大熊猫Dataframe的多个过滤器时,情况也是如此。现在,其背后的原因必须与逻辑运算符和按位运算符有关,并且为了对它们有更多的了解,我建议在stackoverflow中仔细研究一下此答案或类似的Q / A。

更新

用户问,为什么需要在括号内给出(ar> 3)和(ar <6)。好吧,这就是事情。在我开始讨论这里发生的事情之前,需要了解Python中的运算符优先级。

类似于BODMAS所涉及的内容,python还优先执行应首先执行的操作。首先执行括号内的项目,然后按位运算符开始工作。我将在下面显示两种情况,当您确实使用和不使用“(”,“)”时会发生什么。

情况1:

np.where( ar>3 & ar<6, 'yo', ar)
np.where( np.array([3,4,5,14,2,4,3,7])>3 & np.array([3,4,5,14,2,4,3,7])<6, 'yo', ar)

由于这里没有括号,因此按位运算符(&)在这里变得困惑,您甚至要求它获得逻辑与,因为在运算符优先级表中(如果看到的话)&被赋予了优先于<>运算符。这是从最低优先级到最高优先级的表格。

它甚至不执行<>操作被要求执行逻辑与操作。这就是为什么它会导致该错误。

您可以查看以下链接以了解更多信息:运算符优先级

现在转到案例2:

如果您确实使用了支架,那么您会清楚地看到会发生什么。

np.where( (ar>3) & (ar<6), 'yo', ar)
np.where( (array([False,  True,  True,  True, False,  True, False,  True])) & (array([ True,  True,  True, False,  True,  True,  True, False])), 'yo', ar)

真假两个数组。而且,您可以轻松地对其执行逻辑AND操作。这给你:

np.where( array([False,  True,  True, False, False,  True, False, False]),  'yo', ar)

休息一下,np.where,对于给定的情况,在任何情况下,True都会分配第一个值(即“ yo”),如果为False,则分配另一个值(即在此保留原始值)。

就这样。我希望我能很好地解释查询。

One interesting thing to point here; the usual way of using OR and AND too will work in this case, but with a small change. Instead of “and” and instead of “or”, rather use Ampersand(&) and Pipe Operator(|) and it will work.

When we use ‘and’:

ar = np.array([3,4,5,14,2,4,3,7])
np.where((ar>3) and (ar<6), 'yo', ar)

Output:
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

When we use Ampersand(&):

ar = np.array([3,4,5,14,2,4,3,7])
np.where((ar>3) & (ar<6), 'yo', ar)

Output:
array(['3', 'yo', 'yo', '14', '2', 'yo', '3', '7'], dtype='<U11')

And this is same in the case when we are trying to apply multiple filters in case of pandas Dataframe. Now the reasoning behind this has to do something with Logical Operators and Bitwise Operators and for more understanding about same, I’d suggest to go through this answer or similar Q/A in stackoverflow.

UPDATE

A user asked, why is there a need for giving (ar>3) and (ar<6) inside the parenthesis. Well here’s the thing. Before I start talking about what’s happening here, one needs to know about Operator precedence in Python.

Similar to what BODMAS is about, python also gives precedence to what should be performed first. Items inside the parenthesis are performed first and then the bitwise operator comes to work. I’ll show below what happens in both the cases when you do use and not use “(“, “)”.

Case1:

np.where( ar>3 & ar<6, 'yo', ar)
np.where( np.array([3,4,5,14,2,4,3,7])>3 & np.array([3,4,5,14,2,4,3,7])<6, 'yo', ar)

Since there are no brackets here, the bitwise operator(&) is getting confused here that what are you even asking it to get logical AND of, because in the operator precedence table if you see, & is given precedence over < or > operators. Here’s the table from from lowest precedence to highest precedence.

It’s not even performing the < and > operation and being asked to perform a logical AND operation. So that’s why it gives that error.

One can check out the following link to learn more about: operator precedence

Now to Case 2:

If you do use the bracket, you clearly see what happens.

np.where( (ar>3) & (ar<6), 'yo', ar)
np.where( (array([False,  True,  True,  True, False,  True, False,  True])) & (array([ True,  True,  True, False,  True,  True,  True, False])), 'yo', ar)

Two arrays of True and False. And you can easily perform logical AND operation on them. Which gives you:

np.where( array([False,  True,  True, False, False,  True, False, False]),  'yo', ar)

And rest you know, np.where, for given cases, wherever True, assigns first value(i.e. here ‘yo’) and if False, the other(i.e. here, keeping the original).

That’s all. I hope I explained the query well.


回答 3

我喜欢np.vectorize用于此类任务。考虑以下:

>>> # function which returns True when constraints are satisfied.
>>> func = lambda d: d >= r and d<= (r+dr) 
>>>
>>> # Apply constraints element-wise to the dists array.
>>> result = np.vectorize(func)(dists) 
>>>
>>> result = np.where(result) # Get output.

您也可以使用np.argwhere代替以np.where获得清晰的输出。但这是您的电话:)

希望能帮助到你。

I like to use np.vectorize for such tasks. Consider the following:

>>> # function which returns True when constraints are satisfied.
>>> func = lambda d: d >= r and d<= (r+dr) 
>>>
>>> # Apply constraints element-wise to the dists array.
>>> result = np.vectorize(func)(dists) 
>>>
>>> result = np.where(result) # Get output.

You can also use np.argwhere instead of np.where for clear output. But that is your call :)

Hope it helps.


回答 4

尝试:

np.intersect1d(np.where(dists >= r)[0],np.where(dists <= r + dr)[0])

Try:

np.intersect1d(np.where(dists >= r)[0],np.where(dists <= r + dr)[0])

回答 5

这应该工作:

dists[((dists >= r) & (dists <= r+dr))]

最优雅的方式~~

This should work:

dists[((dists >= r) & (dists <= r+dr))]

The most elegant way~~


回答 6

尝试:

import numpy as np
dist = np.array([1,2,3,4,5])
r = 2
dr = 3
np.where(np.logical_and(dist> r, dist<=r+dr))

输出:(array([2,3]),)

您可以查看逻辑功能以获取更多详细信息。

Try:

import numpy as np
dist = np.array([1,2,3,4,5])
r = 2
dr = 3
np.where(np.logical_and(dist> r, dist<=r+dr))

Output: (array([2, 3]),)

You can see Logic functions for more details.


回答 7

我已经解决了这个简单的例子

import numpy as np

ar = np.array([3,4,5,14,2,4,3,7])

print [X for X in list(ar) if (X >= 3 and X <= 6)]

>>> 
[3, 4, 5, 4, 3]

I have worked out this simple example

import numpy as np

ar = np.array([3,4,5,14,2,4,3,7])

print [X for X in list(ar) if (X >= 3 and X <= 6)]

>>> 
[3, 4, 5, 4, 3]

numpy max vs amax vs maximum

问题:numpy max vs amax vs maximum

numpy的具有看起来他们可被用于同样的东西三个不同的函数—不同之处在于numpy.maximum被用于逐元素,而numpy.maxnumpy.amax可以在特定轴,或所有元件一起使用。为什么不仅仅存在numpy.max?在性能上有一些微妙之处吗?

(类似minvs. aminvs. minimum

numpy has three different functions which seem like they can be used for the same things — except that numpy.maximum can only be used element-wise, while numpy.max and numpy.amax can be used on particular axes, or all elements. Why is there more than just numpy.max? Is there some subtlety to this in performance?

(Similarly for min vs. amin vs. minimum)


回答 0

np.max只是的别名np.amax。此函数仅在单个输入数组上起作用,并在整个数组中找到最大元素的值(返回标量)。或者,它接受一个axis参数,并沿输入数组的轴找到最大值(返回一个新数组)。

>>> a = np.array([[0, 1, 6],
                  [2, 4, 1]])
>>> np.max(a)
6
>>> np.max(a, axis=0) # max of each column
array([2, 4, 6])

的默认行为np.maximum是采用两个数组并计算其按元素的最大值。在这里,“兼容”意味着可以将一个阵列广播到另一个阵列。例如:

>>> b = np.array([3, 6, 1])
>>> c = np.array([4, 2, 9])
>>> np.maximum(b, c)
array([4, 6, 9])

但是np.maximum它也是一个通用函数,这意味着它具有使用多维数组时有用的其他功能和方法。例如,您可以计算数组(或数组的特定轴)上的累积最大值:

>>> d = np.array([2, 0, 3, -4, -2, 7, 9])
>>> np.maximum.accumulate(d)
array([2, 2, 3, 3, 3, 7, 9])

无法使用np.max

您可以在使用时在一定程度上进行np.maximum模仿:np.maxnp.maximum.reduce

>>> np.maximum.reduce(d)
9
>>> np.max(d)
9

基本测试表明这两种方法在性能上是可比的。它们应该是np.max()实际需要np.maximum.reduce执行的计算。

np.max is just an alias for np.amax. This function only works on a single input array and finds the value of maximum element in that entire array (returning a scalar). Alternatively, it takes an axis argument and will find the maximum value along an axis of the input array (returning a new array).

>>> a = np.array([[0, 1, 6],
                  [2, 4, 1]])
>>> np.max(a)
6
>>> np.max(a, axis=0) # max of each column
array([2, 4, 6])

The default behaviour of np.maximum is to take two arrays and compute their element-wise maximum. Here, ‘compatible’ means that one array can be broadcast to the other. For example:

>>> b = np.array([3, 6, 1])
>>> c = np.array([4, 2, 9])
>>> np.maximum(b, c)
array([4, 6, 9])

But np.maximum is also a universal function which means that it has other features and methods which come in useful when working with multidimensional arrays. For example you can compute the cumulative maximum over an array (or a particular axis of the array):

>>> d = np.array([2, 0, 3, -4, -2, 7, 9])
>>> np.maximum.accumulate(d)
array([2, 2, 3, 3, 3, 7, 9])

This is not possible with np.max.

You can make np.maximum imitate np.max to a certain extent when using np.maximum.reduce:

>>> np.maximum.reduce(d)
9
>>> np.max(d)
9

Basic testing suggests the two approaches are comparable in performance; and they should be, as np.max() actually calls np.maximum.reduce to do the computation.


回答 1

您已经说明了为什么np.maximum不同的地方-它返回的数组是两个数组之间按元素的最大值。

至于np.amaxnp.max:它们都调用相同的函数- np.max只是的别名np.amax,它们计算数组中或沿数组轴上所有元素的最大值。

In [1]: import numpy as np

In [2]: np.amax
Out[2]: <function numpy.core.fromnumeric.amax>

In [3]: np.max
Out[3]: <function numpy.core.fromnumeric.amax>

You’ve already stated why np.maximum is different – it returns an array that is the element-wise maximum between two arrays.

As for np.amax and np.max: they both call the same function – np.max is just an alias for np.amax, and they compute the maximum of all elements in an array, or along an axis of an array.

In [1]: import numpy as np

In [2]: np.amax
Out[2]: <function numpy.core.fromnumeric.amax>

In [3]: np.max
Out[3]: <function numpy.core.fromnumeric.amax>

回答 2

为了完整起见,在Numpy中有四个最大相关函数。它们分为两个不同的类别:

  • np.amax/np.maxnp.nanmax::用于单阵列订单统计
  • np.maximumnp.fmax:用于两个数组的元素比较

单阵列订单统计

NaNs传播者np.amax/np.max及其NaN无知对应物np.nanmax

  • np.max只是的别名np.amax,因此它们被视为一个函数。

    >>> np.max.__name__
    'amax'
    >>> np.max is np.amax
    True
  • np.max传播NaN,而np.nanmax忽略NaN。

    >>> np.max([np.nan, 3.14, -1])
    nan
    >>> np.nanmax([np.nan, 3.14, -1])
    3.14

二。用于两个数组的元素比较

NaNs传播者np.maximum及其NaNs无知对应物np.fmax

  • 这两个函数都需要两个数组作为要比较的前两个位置args。

    # x1 and x2 must be the same shape or can be broadcast
    np.maximum(x1, x2, /, ...);
    np.fmax(x1, x2, /, ...)
  • np.maximum传播NaN,而np.fmax忽略NaN。

    >>> np.maximum([np.nan, 3.14, 0], [np.NINF, np.nan, 2.72])
    array([ nan,  nan, 2.72])
    >>> np.fmax([np.nan, 3.14, 0], [np.NINF, np.nan, 2.72])
    array([-inf, 3.14, 2.72])
  • 逐个元素的函数是np.ufuncUniversal Function,这意味着它们具有正常Numpy函数所不具备的一些特殊属性。

    >>> type(np.maximum)
    <class 'numpy.ufunc'>
    >>> type(np.fmax)
    <class 'numpy.ufunc'>
    >>> #---------------#
    >>> type(np.max)
    <class 'function'>
    >>> type(np.nanmax)
    <class 'function'>

最后,相同的规则适用于四个最小相关功能:

  • np.amin/np.minnp.nanmin;
  • 并且np.minimumnp.fmin

For completeness, in Numpy there are four maximum related functions. They fall into two different categories:

  • np.amax/np.max, np.nanmax: for single array order statistics
  • and np.maximum, np.fmax: for element-wise comparison of two arrays

I. For single array order statistics

NaNs propagator np.amax/np.max and its NaN ignorant counterpart np.nanmax.

  • np.max is just an alias of np.amax, so they are considered as one function.

    >>> np.max.__name__
    'amax'
    >>> np.max is np.amax
    True
    
  • np.max propagates NaNs while np.nanmax ignores NaNs.

    >>> np.max([np.nan, 3.14, -1])
    nan
    >>> np.nanmax([np.nan, 3.14, -1])
    3.14
    

II. For element-wise comparison of two arrays

NaNs propagator np.maximum and its NaNs ignorant counterpart np.fmax.

  • Both functions require two arrays as the first two positional args to compare with.

    # x1 and x2 must be the same shape or can be broadcast
    np.maximum(x1, x2, /, ...);
    np.fmax(x1, x2, /, ...)
    
  • np.maximum propagates NaNs while np.fmax ignores NaNs.

    >>> np.maximum([np.nan, 3.14, 0], [np.NINF, np.nan, 2.72])
    array([ nan,  nan, 2.72])
    >>> np.fmax([np.nan, 3.14, 0], [np.NINF, np.nan, 2.72])
    array([-inf, 3.14, 2.72])
    
  • The element-wise functions are np.ufunc(Universal Function), which means they have some special properties that normal Numpy function don’t have.

    >>> type(np.maximum)
    <class 'numpy.ufunc'>
    >>> type(np.fmax)
    <class 'numpy.ufunc'>
    >>> #---------------#
    >>> type(np.max)
    <class 'function'>
    >>> type(np.nanmax)
    <class 'function'>
    

And finally, the same rules apply to the four minimum related functions:

  • np.amin/np.min, np.nanmin;
  • and np.minimum, np.fmin.

回答 3

np.maximum 不仅按元素进行比较,而且将数组与单个值进行比较

>>>np.maximum([23, 14, 16, 20, 25], 18)
array([23, 18, 18, 20, 25])

np.maximum not only compares elementwise but also compares array elementwise with single value

>>>np.maximum([23, 14, 16, 20, 25], 18)
array([23, 18, 18, 20, 25])

如何使用matplotlib颜色图将NumPy数组转换为PIL图像

问题:如何使用matplotlib颜色图将NumPy数组转换为PIL图像

我有一个简单的问题,但找不到很好的解决方案。

我想获取一个代表灰度图像的NumPy 2D数组,并在应用一些matplotlib颜色图时将其转换为RGB PIL图像。

我可以使用以下pyplot.figure.figimage命令获得合理的PNG输出:

dpi = 100.0
w, h = myarray.shape[1]/dpi, myarray.shape[0]/dpi
fig = plt.figure(figsize=(w,h), dpi=dpi)
fig.figimage(sub, cmap=cm.gist_earth)
plt.savefig('out.png')

尽管我可以修改它以获取所需的东西(可能使用StringIO可以获取PIL图像),但我想知道是否没有一种更简单的方法可以这样做,因为这似乎是图像可视化的一个非常自然的问题。假设是这样的:

colored_PIL_image = magic_function(array, cmap)

I have a simple problem, but I cannot find a good solution to it.

I want to take a NumPy 2D array which represents a grayscale image, and convert it to an RGB PIL image while applying some of the matplotlib colormaps.

I can get a reasonable PNG output by using the pyplot.figure.figimage command:

dpi = 100.0
w, h = myarray.shape[1]/dpi, myarray.shape[0]/dpi
fig = plt.figure(figsize=(w,h), dpi=dpi)
fig.figimage(sub, cmap=cm.gist_earth)
plt.savefig('out.png')

Although I could adapt this to get what I want (probably using StringIO do get the PIL image), I wonder if there is not a simpler way to do that, since it seems to be a very natural problem of image visualization. Let’s say, something like this:

colored_PIL_image = magic_function(array, cmap)

回答 0

一行代码很忙,但是这里是:

  1. 首先,请确保您的NumPy数组myarray使用处的最大值进行了规范化1.0
  2. 将颜色表直接应用于myarray
  3. 重新调整0-255范围。
  4. 使用转换为整数np.uint8()
  5. 使用Image.fromarray()

这样就完成了:

from PIL import Image
from matplotlib import cm
im = Image.fromarray(np.uint8(cm.gist_earth(myarray)*255))

plt.savefig()

im.save()

Quite a busy one-liner, but here it is:

  1. First ensure your NumPy array, myarray, is normalised with the max value at 1.0.
  2. Apply the colormap directly to myarray.
  3. Rescale to the 0-255 range.
  4. Convert to integers, using np.uint8().
  5. Use Image.fromarray().

And you’re done:

from PIL import Image
from matplotlib import cm
im = Image.fromarray(np.uint8(cm.gist_earth(myarray)*255))

with plt.savefig():

with im.save():


回答 1

  • 输入= numpy_image
  • np.unit8->转换为整数
  • convert(’RGB’)->转换为RGB
  • Image.fromarray->返回图像对象

    from PIL import Image
    import numpy as np
    
    PIL_image = Image.fromarray(np.uint8(numpy_image)).convert('RGB')
    
    PIL_image = Image.fromarray(numpy_image.astype('uint8'), 'RGB')
  • input = numpy_image
  • np.unit8 -> converts to integers
  • convert(‘RGB’) -> converts to RGB
  • Image.fromarray -> returns an image object

    from PIL import Image
    import numpy as np
    
    PIL_image = Image.fromarray(np.uint8(numpy_image)).convert('RGB')
    
    PIL_image = Image.fromarray(numpy_image.astype('uint8'), 'RGB')
    

回答 2

即使应用了注释中提到的更改,接受的答案中描述的方法对我也不起作用。但是下面的简单代码有效:

import matplotlib.pyplot as plt
plt.imsave(filename, np_array, cmap='Greys')

np_array可以是2D数组,其值从0..1浮点型到o2 0..255 uint8,在这种情况下,它需要cmap。对于3D阵列,cmap将被忽略。

The method described in the accepted answer didn’t work for me even after applying changes mentioned in its comments. But the below simple code worked:

import matplotlib.pyplot as plt
plt.imsave(filename, np_array, cmap='Greys')

np_array could be either a 2D array with values from 0..1 floats o2 0..255 uint8, and in that case it needs cmap. For 3D arrays, cmap will be ignored.


标准化大熊猫中的数据

问题:标准化大熊猫中的数据

假设我有一个熊猫数据框df

我想计算数据框的列均值。

这很简单:

df.apply(average) 

然后按列范围max(col)-min(col)。这又很容易:

df.apply(max) - df.apply(min)

现在,对于每个元素,我要减去其列的均值并除以其列的范围。我不确定该怎么做

非常感谢任何帮助/指针。

Suppose I have a pandas data frame df:

I want to calculate the column wise mean of a data frame.

This is easy:

df.apply(average) 

then the column wise range max(col) – min(col). This is easy again:

df.apply(max) - df.apply(min)

Now for each element I want to subtract its column’s mean and divide by its column’s range. I am not sure how to do that

Any help/pointers are much appreciated.


回答 0

In [92]: df
Out[92]:
           a         b          c         d
A  -0.488816  0.863769   4.325608 -4.721202
B -11.937097  2.993993 -12.916784 -1.086236
C  -5.569493  4.672679  -2.168464 -9.315900
D   8.892368  0.932785   4.535396  0.598124

In [93]: df_norm = (df - df.mean()) / (df.max() - df.min())

In [94]: df_norm
Out[94]:
          a         b         c         d
A  0.085789 -0.394348  0.337016 -0.109935
B -0.463830  0.164926 -0.650963  0.256714
C -0.158129  0.605652 -0.035090 -0.573389
D  0.536170 -0.376229  0.349037  0.426611

In [95]: df_norm.mean()
Out[95]:
a   -2.081668e-17
b    4.857226e-17
c    1.734723e-17
d   -1.040834e-17

In [96]: df_norm.max() - df_norm.min()
Out[96]:
a    1
b    1
c    1
d    1
In [92]: df
Out[92]:
           a         b          c         d
A  -0.488816  0.863769   4.325608 -4.721202
B -11.937097  2.993993 -12.916784 -1.086236
C  -5.569493  4.672679  -2.168464 -9.315900
D   8.892368  0.932785   4.535396  0.598124

In [93]: df_norm = (df - df.mean()) / (df.max() - df.min())

In [94]: df_norm
Out[94]:
          a         b         c         d
A  0.085789 -0.394348  0.337016 -0.109935
B -0.463830  0.164926 -0.650963  0.256714
C -0.158129  0.605652 -0.035090 -0.573389
D  0.536170 -0.376229  0.349037  0.426611

In [95]: df_norm.mean()
Out[95]:
a   -2.081668e-17
b    4.857226e-17
c    1.734723e-17
d   -1.040834e-17

In [96]: df_norm.max() - df_norm.min()
Out[96]:
a    1
b    1
c    1
d    1

回答 1

如果您不介意导入sklearn库,我建议您使用博客上介绍的方法。

import pandas as pd
from sklearn import preprocessing

data = {'score': [234,24,14,27,-74,46,73,-18,59,160]}
cols = data.columns
df = pd.DataFrame(data)
df

min_max_scaler = preprocessing.MinMaxScaler()
np_scaled = min_max_scaler.fit_transform(df)
df_normalized = pd.DataFrame(np_scaled, columns = cols)
df_normalized

If you don’t mind importing the sklearn library, I would recommend the method talked on this blog.

import pandas as pd
from sklearn import preprocessing

data = {'score': [234,24,14,27,-74,46,73,-18,59,160]}
cols = data.columns
df = pd.DataFrame(data)
df

min_max_scaler = preprocessing.MinMaxScaler()
np_scaled = min_max_scaler.fit_transform(df)
df_normalized = pd.DataFrame(np_scaled, columns = cols)
df_normalized

回答 2

您可以使用apply它,它有点整洁:

import numpy as np
import pandas as pd

np.random.seed(1)

df = pd.DataFrame(np.random.randn(4,4)* 4 + 3)

          0         1         2         3
0  9.497381  0.552974  0.887313 -1.291874
1  6.461631 -6.206155  9.979247 -0.044828
2  4.276156  2.002518  8.848432 -5.240563
3  1.710331  1.463783  7.535078 -1.399565

df.apply(lambda x: (x - np.mean(x)) / (np.max(x) - np.min(x)))

          0         1         2         3
0  0.515087  0.133967 -0.651699  0.135175
1  0.125241 -0.689446  0.348301  0.375188
2 -0.155414  0.310554  0.223925 -0.624812
3 -0.484913  0.244924  0.079473  0.114448

此外,groupby如果您选择相关列,它也可以与配合使用:

df['grp'] = ['A', 'A', 'B', 'B']

          0         1         2         3 grp
0  9.497381  0.552974  0.887313 -1.291874   A
1  6.461631 -6.206155  9.979247 -0.044828   A
2  4.276156  2.002518  8.848432 -5.240563   B
3  1.710331  1.463783  7.535078 -1.399565   B


df.groupby(['grp'])[[0,1,2,3]].apply(lambda x: (x - np.mean(x)) / (np.max(x) - np.min(x)))

     0    1    2    3
0  0.5  0.5 -0.5 -0.5
1 -0.5 -0.5  0.5  0.5
2  0.5  0.5  0.5 -0.5
3 -0.5 -0.5 -0.5  0.5

You can use apply for this, and it’s a bit neater:

import numpy as np
import pandas as pd

np.random.seed(1)

df = pd.DataFrame(np.random.randn(4,4)* 4 + 3)

          0         1         2         3
0  9.497381  0.552974  0.887313 -1.291874
1  6.461631 -6.206155  9.979247 -0.044828
2  4.276156  2.002518  8.848432 -5.240563
3  1.710331  1.463783  7.535078 -1.399565

df.apply(lambda x: (x - np.mean(x)) / (np.max(x) - np.min(x)))

          0         1         2         3
0  0.515087  0.133967 -0.651699  0.135175
1  0.125241 -0.689446  0.348301  0.375188
2 -0.155414  0.310554  0.223925 -0.624812
3 -0.484913  0.244924  0.079473  0.114448

Also, it works nicely with groupby, if you select the relevant columns:

df['grp'] = ['A', 'A', 'B', 'B']

          0         1         2         3 grp
0  9.497381  0.552974  0.887313 -1.291874   A
1  6.461631 -6.206155  9.979247 -0.044828   A
2  4.276156  2.002518  8.848432 -5.240563   B
3  1.710331  1.463783  7.535078 -1.399565   B


df.groupby(['grp'])[[0,1,2,3]].apply(lambda x: (x - np.mean(x)) / (np.max(x) - np.min(x)))

     0    1    2    3
0  0.5  0.5 -0.5 -0.5
1 -0.5 -0.5  0.5  0.5
2  0.5  0.5  0.5 -0.5
3 -0.5 -0.5 -0.5  0.5

回答 3

稍作修改自:Python Pandas数据框:归一化0.01和0.99之间的数据?但是从一些评论中认为这是相关的(抱歉,如果考虑重新发布…)

我想要自定义归一化,因为基准或z分数的常规百分位数不够。有时我知道总体的可行最大值和最小值是多少,因此除了我的样本或其他中点之外,还想对其进行定义!这通常对于重新缩放和规范化神经网络的数据很有用,因为您可能希望所有输入都在0到1之间,但是某些数据可能需要以更自定义的方式进行缩放…因为百分位数和标准差假设您的样本覆盖了人口,但有时我们知道这是不对的。在可视化热图中的数据时,这对我也非常有用。因此,我构建了一个自定义函数(在此处的代码中使用了额外的步骤,以使其更具可读性):

def NormData(s,low='min',center='mid',hi='max',insideout=False,shrinkfactor=0.):    
    if low=='min':
        low=min(s)
    elif low=='abs':
        low=max(abs(min(s)),abs(max(s)))*-1.#sign(min(s))
    if hi=='max':
        hi=max(s)
    elif hi=='abs':
        hi=max(abs(min(s)),abs(max(s)))*1.#sign(max(s))

    if center=='mid':
        center=(max(s)+min(s))/2
    elif center=='avg':
        center=mean(s)
    elif center=='median':
        center=median(s)

    s2=[x-center for x in s]
    hi=hi-center
    low=low-center
    center=0.

    r=[]

    for x in s2:
        if x<low:
            r.append(0.)
        elif x>hi:
            r.append(1.)
        else:
            if x>=center:
                r.append((x-center)/(hi-center)*0.5+0.5)
            else:
                r.append((x-low)/(center-low)*0.5+0.)

    if insideout==True:
        ir=[(1.-abs(z-0.5)*2.) for z in r]
        r=ir

    rr =[x-(x-0.5)*shrinkfactor for x in r]    
    return rr

这将采用熊猫系列,甚至只是一个列表,并将其标准化为您指定的低点,中点和高点。还有一个缩小因素!使您可以缩小端点0和1之外的数据的比例(在matplotlib中组合颜色图时,我必须这样做:使用Matplotlib单个pcolormesh中使用多个颜色图)样本中具有[-5,1,10]的值,但要基于-7到7(因此,大于7的任何值,我们的“ 10”有效地视为7)以2为中点进行归一化但将其缩小以适合256 RGB色彩图:

#In[1]
NormData([-5,2,10],low=-7,center=1,hi=7,shrinkfactor=2./256)
#Out[1]
[0.1279296875, 0.5826822916666667, 0.99609375]

它也可以将您的数据完全翻过来……这似乎很奇怪,但是我发现它对于热图很有用。假设您想使用深色来表示接近0的值,而不是高/低。您可以基于归一化数据的热图,其中Insideout = True:

#In[2]
NormData([-5,2,10],low=-7,center=1,hi=7,insideout=True,shrinkfactor=2./256)
#Out[2]
[0.251953125, 0.8307291666666666, 0.00390625]

因此,现在最接近中心的“ 2”(定义为“ 1”)是最大值。

无论如何,如果您希望以其他可能对您有用的应用程序重新缩放数据的方式,我认为我的应用程序很重要。

Slightly modified from: Python Pandas Dataframe: Normalize data between 0.01 and 0.99? but from some of the comments thought it was relevant (sorry if considered a repost though…)

I wanted customized normalization in that regular percentile of datum or z-score was not adequate. Sometimes I knew what the feasible max and min of the population were, and therefore wanted to define it other than my sample, or a different midpoint, or whatever! This can often be useful for rescaling and normalizing data for neural nets where you may want all inputs between 0 and 1, but some of your data may need to be scaled in a more customized way… because percentiles and stdevs assumes your sample covers the population, but sometimes we know this isn’t true. It was also very useful for me when visualizing data in heatmaps. So i built a custom function (used extra steps in the code here to make it as readable as possible):

def NormData(s,low='min',center='mid',hi='max',insideout=False,shrinkfactor=0.):    
    if low=='min':
        low=min(s)
    elif low=='abs':
        low=max(abs(min(s)),abs(max(s)))*-1.#sign(min(s))
    if hi=='max':
        hi=max(s)
    elif hi=='abs':
        hi=max(abs(min(s)),abs(max(s)))*1.#sign(max(s))

    if center=='mid':
        center=(max(s)+min(s))/2
    elif center=='avg':
        center=mean(s)
    elif center=='median':
        center=median(s)

    s2=[x-center for x in s]
    hi=hi-center
    low=low-center
    center=0.

    r=[]

    for x in s2:
        if x<low:
            r.append(0.)
        elif x>hi:
            r.append(1.)
        else:
            if x>=center:
                r.append((x-center)/(hi-center)*0.5+0.5)
            else:
                r.append((x-low)/(center-low)*0.5+0.)

    if insideout==True:
        ir=[(1.-abs(z-0.5)*2.) for z in r]
        r=ir

    rr =[x-(x-0.5)*shrinkfactor for x in r]    
    return rr

This will take in a pandas series, or even just a list and normalize it to your specified low, center, and high points. also there is a shrink factor! to allow you to scale down the data away from endpoints 0 and 1 (I had to do this when combining colormaps in matplotlib:Single pcolormesh with more than one colormap using Matplotlib) So you can likely see how the code works, but basically say you have values [-5,1,10] in a sample, but want to normalize based on a range of -7 to 7 (so anything above 7, our “10” is treated as a 7 effectively) with a midpoint of 2, but shrink it to fit a 256 RGB colormap:

#In[1]
NormData([-5,2,10],low=-7,center=1,hi=7,shrinkfactor=2./256)
#Out[1]
[0.1279296875, 0.5826822916666667, 0.99609375]

It can also turn your data inside out… this may seem odd, but I found it useful for heatmapping. Say you want a darker color for values closer to 0 rather than hi/low. You could heatmap based on normalized data where insideout=True:

#In[2]
NormData([-5,2,10],low=-7,center=1,hi=7,insideout=True,shrinkfactor=2./256)
#Out[2]
[0.251953125, 0.8307291666666666, 0.00390625]

So now “2” which is closest to the center, defined as “1” is the highest value.

Anyways, I thought my application was relevant if you’re looking to rescale data in other ways that could have useful applications to you.


回答 4

这是按列进行的方式:

[df[col].update((df[col] - df[col].min()) / (df[col].max() - df[col].min())) for col in df.columns]

This is how you do it column-wise:

[df[col].update((df[col] - df[col].min()) / (df[col].max() - df[col].min())) for col in df.columns]

NumPy矩阵与数组类的乘法有何不同?

问题:NumPy矩阵与数组类的乘法有何不同?

numpy文档建议使用数组而不是矩阵来处理矩阵。但是,与八度(我直到最近才使用)不同,*不执行矩阵乘法,您需要使用函数matrixmultipy()。我觉得这使代码非常不可读。

是否有人分享我的观点并找到了解决方案?

The numpy docs recommend using array instead of matrix for working with matrices. However, unlike octave (which I was using till recently), * doesn’t perform matrix multiplication, you need to use the function matrixmultipy(). I feel this makes the code very unreadable.

Does anybody share my views, and has found a solution?


回答 0

避免使用的主要原因 matrix该类的是:a)本质上是二维的,并且b)与“常规” numpy数组相比,存在额外的开销。如果您要做的只是线性代数,那么请务必使用矩阵类…就我个人而言,我发现它比它值得的麻烦更多。

对于数组(Python 3.5之前的版本),请使用dot代替matrixmultiply

例如

import numpy as np
x = np.arange(9).reshape((3,3))
y = np.arange(3)

print np.dot(x,y)

或在新版本的numpy中,只需使用 x.dot(y)

就个人而言,我发现它比*表示矩阵乘法的运算符更具可读性…

对于Python 3.5中的数组,请使用x @ y

The main reason to avoid using the matrix class is that a) it’s inherently 2-dimensional, and b) there’s additional overhead compared to a “normal” numpy array. If all you’re doing is linear algebra, then by all means, feel free to use the matrix class… Personally I find it more trouble than it’s worth, though.

For arrays (prior to Python 3.5), use dot instead of matrixmultiply.

E.g.

import numpy as np
x = np.arange(9).reshape((3,3))
y = np.arange(3)

print np.dot(x,y)

Or in newer versions of numpy, simply use x.dot(y)

Personally, I find it much more readable than the * operator implying matrix multiplication…

For arrays in Python 3.5, use x @ y.


回答 1

与在NumPy 矩阵上进行操作相比,在NumPy 数组上进行操作要了解的关键事项是:

  • NumPy矩阵是NumPy数组的子类

  • NumPy 数组操作是基于元素的(一旦考虑了广播)

  • NumPy 矩阵运算遵循线性代数的一般规则

一些代码片段来说明:

>>> from numpy import linalg as LA
>>> import numpy as NP

>>> a1 = NP.matrix("4 3 5; 6 7 8; 1 3 13; 7 21 9")
>>> a1
matrix([[ 4,  3,  5],
        [ 6,  7,  8],
        [ 1,  3, 13],
        [ 7, 21,  9]])

>>> a2 = NP.matrix("7 8 15; 5 3 11; 7 4 9; 6 15 4")
>>> a2
matrix([[ 7,  8, 15],
        [ 5,  3, 11],
        [ 7,  4,  9],
        [ 6, 15,  4]])

>>> a1.shape
(4, 3)

>>> a2.shape
(4, 3)

>>> a2t = a2.T
>>> a2t.shape
(3, 4)

>>> a1 * a2t         # same as NP.dot(a1, a2t) 
matrix([[127,  84,  85,  89],
        [218, 139, 142, 173],
        [226, 157, 136, 103],
        [352, 197, 214, 393]])

但是如果将以下两个NumPy矩阵转换为数组,则此操作将失败:

>>> a1 = NP.array(a1)
>>> a2t = NP.array(a2t)

>>> a1 * a2t
Traceback (most recent call last):
   File "<pyshell#277>", line 1, in <module>
   a1 * a2t
   ValueError: operands could not be broadcast together with shapes (4,3) (3,4) 

尽管使用NP.dot语法可以处理数组 ; 该操作类似于矩阵乘法:

>> NP.dot(a1, a2t)
array([[127,  84,  85,  89],
       [218, 139, 142, 173],
       [226, 157, 136, 103],
       [352, 197, 214, 393]])

那么您是否需要NumPy矩阵?即,NumPy数组是否足以进行线性代数计算(前提是您知道正确的语法,即NP.dot)?

规则似乎是,如果参数(数组)的形状(mxn)与给定的线性代数运算兼容,那么您就可以了,否则,NumPy抛出。

我遇到的唯一exceptions(可能还有其他exceptions)是计算矩阵逆

下面是我称为纯线性代数运算(实际上是从Numpy的线性代数模块)并传递给NumPy数组的代码片段

数组的行列式

>>> m = NP.random.randint(0, 10, 16).reshape(4, 4)
>>> m
array([[6, 2, 5, 2],
       [8, 5, 1, 6],
       [5, 9, 7, 5],
       [0, 5, 6, 7]])

>>> type(m)
<type 'numpy.ndarray'>

>>> md = LA.det(m)
>>> md
1772.9999999999995

特征向量/特征值对:

>>> LA.eig(m)
(array([ 19.703+0.j   ,   0.097+4.198j,   0.097-4.198j,   5.103+0.j   ]), 
array([[-0.374+0.j   , -0.091+0.278j, -0.091-0.278j, -0.574+0.j   ],
       [-0.446+0.j   ,  0.671+0.j   ,  0.671+0.j   , -0.084+0.j   ],
       [-0.654+0.j   , -0.239-0.476j, -0.239+0.476j, -0.181+0.j   ],
       [-0.484+0.j   , -0.387+0.178j, -0.387-0.178j,  0.794+0.j   ]]))

矩阵范数

>>>> LA.norm(m)
22.0227

qr因式分解

>>> LA.qr(a1)
(array([[ 0.5,  0.5,  0.5],
        [ 0.5,  0.5, -0.5],
        [ 0.5, -0.5,  0.5],
        [ 0.5, -0.5, -0.5]]), 
 array([[ 6.,  6.,  6.],
        [ 0.,  0.,  0.],
        [ 0.,  0.,  0.]]))

矩阵等级

>>> m = NP.random.rand(40).reshape(8, 5)
>>> m
array([[ 0.545,  0.459,  0.601,  0.34 ,  0.778],
       [ 0.799,  0.047,  0.699,  0.907,  0.381],
       [ 0.004,  0.136,  0.819,  0.647,  0.892],
       [ 0.062,  0.389,  0.183,  0.289,  0.809],
       [ 0.539,  0.213,  0.805,  0.61 ,  0.677],
       [ 0.269,  0.071,  0.377,  0.25 ,  0.692],
       [ 0.274,  0.206,  0.655,  0.062,  0.229],
       [ 0.397,  0.115,  0.083,  0.19 ,  0.701]])
>>> LA.matrix_rank(m)
5

矩阵条件

>>> a1 = NP.random.randint(1, 10, 12).reshape(4, 3)
>>> LA.cond(a1)
5.7093446189400954

反演需要一个NumPy矩阵

>>> a1 = NP.matrix(a1)
>>> type(a1)
<class 'numpy.matrixlib.defmatrix.matrix'>

>>> a1.I
matrix([[ 0.028,  0.028,  0.028,  0.028],
        [ 0.028,  0.028,  0.028,  0.028],
        [ 0.028,  0.028,  0.028,  0.028]])
>>> a1 = NP.array(a1)
>>> a1.I

Traceback (most recent call last):
   File "<pyshell#230>", line 1, in <module>
   a1.I
   AttributeError: 'numpy.ndarray' object has no attribute 'I'

但是Moore-Penrose伪逆似乎工作得很好

>>> LA.pinv(m)
matrix([[ 0.314,  0.407, -1.008, -0.553,  0.131,  0.373,  0.217,  0.785],
        [ 1.393,  0.084, -0.605,  1.777, -0.054, -1.658,  0.069, -1.203],
        [-0.042, -0.355,  0.494, -0.729,  0.292,  0.252,  1.079, -0.432],
        [-0.18 ,  1.068,  0.396,  0.895, -0.003, -0.896, -1.115, -0.666],
        [-0.224, -0.479,  0.303, -0.079, -0.066,  0.872, -0.175,  0.901]])

>>> m = NP.array(m)

>>> LA.pinv(m)
array([[ 0.314,  0.407, -1.008, -0.553,  0.131,  0.373,  0.217,  0.785],
       [ 1.393,  0.084, -0.605,  1.777, -0.054, -1.658,  0.069, -1.203],
       [-0.042, -0.355,  0.494, -0.729,  0.292,  0.252,  1.079, -0.432],
       [-0.18 ,  1.068,  0.396,  0.895, -0.003, -0.896, -1.115, -0.666],
       [-0.224, -0.479,  0.303, -0.079, -0.066,  0.872, -0.175,  0.901]])

the key things to know for operations on NumPy arrays versus operations on NumPy matrices are:

  • NumPy matrix is a subclass of NumPy array

  • NumPy array operations are element-wise (once broadcasting is accounted for)

  • NumPy matrix operations follow the ordinary rules of linear algebra

some code snippets to illustrate:

>>> from numpy import linalg as LA
>>> import numpy as NP

>>> a1 = NP.matrix("4 3 5; 6 7 8; 1 3 13; 7 21 9")
>>> a1
matrix([[ 4,  3,  5],
        [ 6,  7,  8],
        [ 1,  3, 13],
        [ 7, 21,  9]])

>>> a2 = NP.matrix("7 8 15; 5 3 11; 7 4 9; 6 15 4")
>>> a2
matrix([[ 7,  8, 15],
        [ 5,  3, 11],
        [ 7,  4,  9],
        [ 6, 15,  4]])

>>> a1.shape
(4, 3)

>>> a2.shape
(4, 3)

>>> a2t = a2.T
>>> a2t.shape
(3, 4)

>>> a1 * a2t         # same as NP.dot(a1, a2t) 
matrix([[127,  84,  85,  89],
        [218, 139, 142, 173],
        [226, 157, 136, 103],
        [352, 197, 214, 393]])

but this operations fails if these two NumPy matrices are converted to arrays:

>>> a1 = NP.array(a1)
>>> a2t = NP.array(a2t)

>>> a1 * a2t
Traceback (most recent call last):
   File "<pyshell#277>", line 1, in <module>
   a1 * a2t
   ValueError: operands could not be broadcast together with shapes (4,3) (3,4) 

though using the NP.dot syntax works with arrays; this operations works like matrix multiplication:

>> NP.dot(a1, a2t)
array([[127,  84,  85,  89],
       [218, 139, 142, 173],
       [226, 157, 136, 103],
       [352, 197, 214, 393]])

so do you ever need a NumPy matrix? ie, will a NumPy array suffice for linear algebra computation (provided you know the correct syntax, ie, NP.dot)?

the rule seems to be that if the arguments (arrays) have shapes (m x n) compatible with the a given linear algebra operation, then you are ok, otherwise, NumPy throws.

the only exception i have come across (there are likely others) is calculating matrix inverse.

below are snippets in which i have called a pure linear algebra operation (in fact, from Numpy’s Linear Algebra module) and passed in a NumPy array

determinant of an array:

>>> m = NP.random.randint(0, 10, 16).reshape(4, 4)
>>> m
array([[6, 2, 5, 2],
       [8, 5, 1, 6],
       [5, 9, 7, 5],
       [0, 5, 6, 7]])

>>> type(m)
<type 'numpy.ndarray'>

>>> md = LA.det(m)
>>> md
1772.9999999999995

eigenvectors/eigenvalue pairs:

>>> LA.eig(m)
(array([ 19.703+0.j   ,   0.097+4.198j,   0.097-4.198j,   5.103+0.j   ]), 
array([[-0.374+0.j   , -0.091+0.278j, -0.091-0.278j, -0.574+0.j   ],
       [-0.446+0.j   ,  0.671+0.j   ,  0.671+0.j   , -0.084+0.j   ],
       [-0.654+0.j   , -0.239-0.476j, -0.239+0.476j, -0.181+0.j   ],
       [-0.484+0.j   , -0.387+0.178j, -0.387-0.178j,  0.794+0.j   ]]))

matrix norm:

>>>> LA.norm(m)
22.0227

qr factorization:

>>> LA.qr(a1)
(array([[ 0.5,  0.5,  0.5],
        [ 0.5,  0.5, -0.5],
        [ 0.5, -0.5,  0.5],
        [ 0.5, -0.5, -0.5]]), 
 array([[ 6.,  6.,  6.],
        [ 0.,  0.,  0.],
        [ 0.,  0.,  0.]]))

matrix rank:

>>> m = NP.random.rand(40).reshape(8, 5)
>>> m
array([[ 0.545,  0.459,  0.601,  0.34 ,  0.778],
       [ 0.799,  0.047,  0.699,  0.907,  0.381],
       [ 0.004,  0.136,  0.819,  0.647,  0.892],
       [ 0.062,  0.389,  0.183,  0.289,  0.809],
       [ 0.539,  0.213,  0.805,  0.61 ,  0.677],
       [ 0.269,  0.071,  0.377,  0.25 ,  0.692],
       [ 0.274,  0.206,  0.655,  0.062,  0.229],
       [ 0.397,  0.115,  0.083,  0.19 ,  0.701]])
>>> LA.matrix_rank(m)
5

matrix condition:

>>> a1 = NP.random.randint(1, 10, 12).reshape(4, 3)
>>> LA.cond(a1)
5.7093446189400954

inversion requires a NumPy matrix though:

>>> a1 = NP.matrix(a1)
>>> type(a1)
<class 'numpy.matrixlib.defmatrix.matrix'>

>>> a1.I
matrix([[ 0.028,  0.028,  0.028,  0.028],
        [ 0.028,  0.028,  0.028,  0.028],
        [ 0.028,  0.028,  0.028,  0.028]])
>>> a1 = NP.array(a1)
>>> a1.I

Traceback (most recent call last):
   File "<pyshell#230>", line 1, in <module>
   a1.I
   AttributeError: 'numpy.ndarray' object has no attribute 'I'

but the Moore-Penrose pseudoinverse seems to works just fine

>>> LA.pinv(m)
matrix([[ 0.314,  0.407, -1.008, -0.553,  0.131,  0.373,  0.217,  0.785],
        [ 1.393,  0.084, -0.605,  1.777, -0.054, -1.658,  0.069, -1.203],
        [-0.042, -0.355,  0.494, -0.729,  0.292,  0.252,  1.079, -0.432],
        [-0.18 ,  1.068,  0.396,  0.895, -0.003, -0.896, -1.115, -0.666],
        [-0.224, -0.479,  0.303, -0.079, -0.066,  0.872, -0.175,  0.901]])

>>> m = NP.array(m)

>>> LA.pinv(m)
array([[ 0.314,  0.407, -1.008, -0.553,  0.131,  0.373,  0.217,  0.785],
       [ 1.393,  0.084, -0.605,  1.777, -0.054, -1.658,  0.069, -1.203],
       [-0.042, -0.355,  0.494, -0.729,  0.292,  0.252,  1.079, -0.432],
       [-0.18 ,  1.068,  0.396,  0.895, -0.003, -0.896, -1.115, -0.666],
       [-0.224, -0.479,  0.303, -0.079, -0.066,  0.872, -0.175,  0.901]])

回答 2

在3.5中,Python终于有了一个矩阵乘法运算符。语法为a @ b

In 3.5, Python finally got a matrix multiplication operator. The syntax is a @ b.


回答 3

在处理数组和处理矩阵时,点运算符会给出不同的答案。例如,假设以下内容:

>>> a=numpy.array([1, 2, 3])
>>> b=numpy.array([1, 2, 3])

让我们将它们转换成矩阵:

>>> am=numpy.mat(a)
>>> bm=numpy.mat(b)

现在,我们可以看到两种情况的不同输出:

>>> print numpy.dot(a.T, b)
14
>>> print am.T*bm
[[1.  2.  3.]
 [2.  4.  6.]
 [3.  6.  9.]]

There is a situation where the dot operator will give different answers when dealing with arrays as with dealing with matrices. For example, suppose the following:

>>> a=numpy.array([1, 2, 3])
>>> b=numpy.array([1, 2, 3])

Lets convert them into matrices:

>>> am=numpy.mat(a)
>>> bm=numpy.mat(b)

Now, we can see a different output for the two cases:

>>> print numpy.dot(a.T, b)
14
>>> print am.T*bm
[[1.  2.  3.]
 [2.  4.  6.]
 [3.  6.  9.]]

回答 4

来自http://docs.scipy.org/doc/scipy/reference/tutorial/linalg.html的参考

…,使用的numpy.matrix气馁,因为它增加了什么,无法与2D来完成numpy.ndarray对象,并可能导致混乱,其中正在使用的类。例如,

>>> import numpy as np
>>> from scipy import linalg
>>> A = np.array([[1,2],[3,4]])
>>> A
    array([[1, 2],
           [3, 4]])
>>> linalg.inv(A)
array([[-2. ,  1. ],
      [ 1.5, -0.5]])
>>> b = np.array([[5,6]]) #2D array
>>> b
array([[5, 6]])
>>> b.T
array([[5],
      [6]])
>>> A*b #not matrix multiplication!
array([[ 5, 12],
      [15, 24]])
>>> A.dot(b.T) #matrix multiplication
array([[17],
      [39]])
>>> b = np.array([5,6]) #1D array
>>> b
array([5, 6])
>>> b.T  #not matrix transpose!
array([5, 6])
>>> A.dot(b)  #does not matter for multiplication
array([17, 39])

scipy.linalg操作可以同等地应用于numpy.matrix或2D numpy.ndarray对象。

Reference from http://docs.scipy.org/doc/scipy/reference/tutorial/linalg.html

…, the use of the numpy.matrix class is discouraged, since it adds nothing that cannot be accomplished with 2D numpy.ndarray objects, and may lead to a confusion of which class is being used. For example,

>>> import numpy as np
>>> from scipy import linalg
>>> A = np.array([[1,2],[3,4]])
>>> A
    array([[1, 2],
           [3, 4]])
>>> linalg.inv(A)
array([[-2. ,  1. ],
      [ 1.5, -0.5]])
>>> b = np.array([[5,6]]) #2D array
>>> b
array([[5, 6]])
>>> b.T
array([[5],
      [6]])
>>> A*b #not matrix multiplication!
array([[ 5, 12],
      [15, 24]])
>>> A.dot(b.T) #matrix multiplication
array([[17],
      [39]])
>>> b = np.array([5,6]) #1D array
>>> b
array([5, 6])
>>> b.T  #not matrix transpose!
array([5, 6])
>>> A.dot(b)  #does not matter for multiplication
array([17, 39])

scipy.linalg operations can be applied equally to numpy.matrix or to 2D numpy.ndarray objects.


回答 5

这个技巧可能就是您想要的。这是一种简单的运算符重载。

然后,您可以使用类似建议的Infix类的东西:

a = np.random.rand(3,4)
b = np.random.rand(4,3)
x = Infix(lambda x,y: np.dot(x,y))
c = a |x| b

This trick could be what you are looking for. It is a kind of simple operator overload.

You can then use something like the suggested Infix class like this:

a = np.random.rand(3,4)
b = np.random.rand(4,3)
x = Infix(lambda x,y: np.dot(x,y))
c = a |x| b

回答 6

来自PEP 465的相关报价 @ petr-viktorin提到的用于矩阵乘法的专用中缀运算符,阐明了OP遇到的问题:

numpy提供了两种使用不同__mul__方法的不同类型。对于numpy.ndarray对象,*执行元素乘法,矩阵乘法必须使用函数调用(numpy.dot)。对于numpy.matrix对象,*执行矩阵乘法,而元素乘法则需要函数语法。使用编写代码numpy.ndarray效果很好。使用编写代码numpy.matrix也可以。但是,一旦我们尝试将这两段代码集成在一起,麻烦就会开始。预期为ndarray并得到matrix或相反的代码可能会崩溃或返回错误的结果

@infix运算符的引入应有助于统一和简化python矩阵代码。

A pertinent quote from PEP 465 – A dedicated infix operator for matrix multiplication , as mentioned by @petr-viktorin, clarifies the problem the OP was getting at:

[…] numpy provides two different types with different __mul__ methods. For numpy.ndarray objects, * performs elementwise multiplication, and matrix multiplication must use a function call (numpy.dot). For numpy.matrix objects, * performs matrix multiplication, and elementwise multiplication requires function syntax. Writing code using numpy.ndarray works fine. Writing code using numpy.matrix also works fine. But trouble begins as soon as we try to integrate these two pieces of code together. Code that expects an ndarray and gets a matrix, or vice-versa, may crash or return incorrect results

The introduction of the @ infix operator should help to unify and simplify python matrix code.


回答 7

函数matmul(自numpy 1.10.1起)对两种类型均适用,并以numpy矩阵类返回结果:

import numpy as np

A = np.mat('1 2 3; 4 5 6; 7 8 9; 10 11 12')
B = np.array(np.mat('1 1 1 1; 1 1 1 1; 1 1 1 1'))
print (A, type(A))
print (B, type(B))

C = np.matmul(A, B)
print (C, type(C))

输出:

(matrix([[ 1,  2,  3],
        [ 4,  5,  6],
        [ 7,  8,  9],
        [10, 11, 12]]), <class 'numpy.matrixlib.defmatrix.matrix'>)
(array([[1, 1, 1, 1],
       [1, 1, 1, 1],
       [1, 1, 1, 1]]), <type 'numpy.ndarray'>)
(matrix([[ 6,  6,  6,  6],
        [15, 15, 15, 15],
        [24, 24, 24, 24],
        [33, 33, 33, 33]]), <class 'numpy.matrixlib.defmatrix.matrix'>)

由于python 3.5 如前所述,您还可以使用新的矩阵乘法运算符,@例如

C = A @ B

并获得与上述相同的结果。

Function matmul (since numpy 1.10.1) works fine for both types and return result as a numpy matrix class:

import numpy as np

A = np.mat('1 2 3; 4 5 6; 7 8 9; 10 11 12')
B = np.array(np.mat('1 1 1 1; 1 1 1 1; 1 1 1 1'))
print (A, type(A))
print (B, type(B))

C = np.matmul(A, B)
print (C, type(C))

Output:

(matrix([[ 1,  2,  3],
        [ 4,  5,  6],
        [ 7,  8,  9],
        [10, 11, 12]]), <class 'numpy.matrixlib.defmatrix.matrix'>)
(array([[1, 1, 1, 1],
       [1, 1, 1, 1],
       [1, 1, 1, 1]]), <type 'numpy.ndarray'>)
(matrix([[ 6,  6,  6,  6],
        [15, 15, 15, 15],
        [24, 24, 24, 24],
        [33, 33, 33, 33]]), <class 'numpy.matrixlib.defmatrix.matrix'>)

Since python 3.5 as mentioned early you also can use a new matrix multiplication operator @ like

C = A @ B

and get the same result as above.


Python中的多元线性回归

问题:Python中的多元线性回归

我似乎找不到任何进行多元回归的python库。我发现的唯一的东西只是做简单的回归。我需要针对几个自变量(x1,x2,x3等)对我的因变量(y)进行回归。

例如,使用以下数据:

print 'y        x1      x2       x3       x4      x5     x6       x7'
for t in texts:
    print "{:>7.1f}{:>10.2f}{:>9.2f}{:>9.2f}{:>10.2f}{:>7.2f}{:>7.2f}{:>9.2f}" /
   .format(t.y,t.x1,t.x2,t.x3,t.x4,t.x5,t.x6,t.x7)

(以上输出:)

      y        x1       x2       x3        x4     x5     x6       x7
   -6.0     -4.95    -5.87    -0.76     14.73   4.02   0.20     0.45
   -5.0     -4.55    -4.52    -0.71     13.74   4.47   0.16     0.50
  -10.0    -10.96   -11.64    -0.98     15.49   4.18   0.19     0.53
   -5.0     -1.08    -3.36     0.75     24.72   4.96   0.16     0.60
   -8.0     -6.52    -7.45    -0.86     16.59   4.29   0.10     0.48
   -3.0     -0.81    -2.36    -0.50     22.44   4.81   0.15     0.53
   -6.0     -7.01    -7.33    -0.33     13.93   4.32   0.21     0.50
   -8.0     -4.46    -7.65    -0.94     11.40   4.43   0.16     0.49
   -8.0    -11.54   -10.03    -1.03     18.18   4.28   0.21     0.55

我将如何在python中进行回归,以获得线性回归公式:

Y = a1x1 + a2x2 + a3x3 + a4x4 + a5x5 + a6x6 + + a7x7 + c

I can’t seem to find any python libraries that do multiple regression. The only things I find only do simple regression. I need to regress my dependent variable (y) against several independent variables (x1, x2, x3, etc.).

For example, with this data:

print 'y        x1      x2       x3       x4      x5     x6       x7'
for t in texts:
    print "{:>7.1f}{:>10.2f}{:>9.2f}{:>9.2f}{:>10.2f}{:>7.2f}{:>7.2f}{:>9.2f}" /
   .format(t.y,t.x1,t.x2,t.x3,t.x4,t.x5,t.x6,t.x7)

(output for above:)

      y        x1       x2       x3        x4     x5     x6       x7
   -6.0     -4.95    -5.87    -0.76     14.73   4.02   0.20     0.45
   -5.0     -4.55    -4.52    -0.71     13.74   4.47   0.16     0.50
  -10.0    -10.96   -11.64    -0.98     15.49   4.18   0.19     0.53
   -5.0     -1.08    -3.36     0.75     24.72   4.96   0.16     0.60
   -8.0     -6.52    -7.45    -0.86     16.59   4.29   0.10     0.48
   -3.0     -0.81    -2.36    -0.50     22.44   4.81   0.15     0.53
   -6.0     -7.01    -7.33    -0.33     13.93   4.32   0.21     0.50
   -8.0     -4.46    -7.65    -0.94     11.40   4.43   0.16     0.49
   -8.0    -11.54   -10.03    -1.03     18.18   4.28   0.21     0.55

How would I regress these in python, to get the linear regression formula:

Y = a1x1 + a2x2 + a3x3 + a4x4 + a5x5 + a6x6 + +a7x7 + c


回答 0

sklearn.linear_model.LinearRegression 会做的:

from sklearn import linear_model
clf = linear_model.LinearRegression()
clf.fit([[getattr(t, 'x%d' % i) for i in range(1, 8)] for t in texts],
        [t.y for t in texts])

然后clf.coef_将具有回归系数。

sklearn.linear_model 也具有类似的接口,可以对回归进行各种正则化。

sklearn.linear_model.LinearRegression will do it:

from sklearn import linear_model
clf = linear_model.LinearRegression()
clf.fit([[getattr(t, 'x%d' % i) for i in range(1, 8)] for t in texts],
        [t.y for t in texts])

Then clf.coef_ will have the regression coefficients.

sklearn.linear_model also has similar interfaces to do various kinds of regularizations on the regression.


回答 1

这是我创建的一些解决方法。我用R检查了它,它可以正常工作。

import numpy as np
import statsmodels.api as sm

y = [1,2,3,4,3,4,5,4,5,5,4,5,4,5,4,5,6,5,4,5,4,3,4]

x = [
     [4,2,3,4,5,4,5,6,7,4,8,9,8,8,6,6,5,5,5,5,5,5,5],
     [4,1,2,3,4,5,6,7,5,8,7,8,7,8,7,8,7,7,7,7,7,6,5],
     [4,1,2,5,6,7,8,9,7,8,7,8,7,7,7,7,7,7,6,6,4,4,4]
     ]

def reg_m(y, x):
    ones = np.ones(len(x[0]))
    X = sm.add_constant(np.column_stack((x[0], ones)))
    for ele in x[1:]:
        X = sm.add_constant(np.column_stack((ele, X)))
    results = sm.OLS(y, X).fit()
    return results

结果:

print reg_m(y, x).summary()

输出:

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.535
Model:                            OLS   Adj. R-squared:                  0.461
Method:                 Least Squares   F-statistic:                     7.281
Date:                Tue, 19 Feb 2013   Prob (F-statistic):            0.00191
Time:                        21:51:28   Log-Likelihood:                -26.025
No. Observations:                  23   AIC:                             60.05
Df Residuals:                      19   BIC:                             64.59
Df Model:                           3                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
x1             0.2424      0.139      1.739      0.098        -0.049     0.534
x2             0.2360      0.149      1.587      0.129        -0.075     0.547
x3            -0.0618      0.145     -0.427      0.674        -0.365     0.241
const          1.5704      0.633      2.481      0.023         0.245     2.895

==============================================================================
Omnibus:                        6.904   Durbin-Watson:                   1.905
Prob(Omnibus):                  0.032   Jarque-Bera (JB):                4.708
Skew:                          -0.849   Prob(JB):                       0.0950
Kurtosis:                       4.426   Cond. No.                         38.6

pandas 提供了运行此答案中给出的OLS的便捷方法:

使用Pandas Data Frame运行OLS回归

Here is a little work around that I created. I checked it with R and it works correct.

import numpy as np
import statsmodels.api as sm

y = [1,2,3,4,3,4,5,4,5,5,4,5,4,5,4,5,6,5,4,5,4,3,4]

x = [
     [4,2,3,4,5,4,5,6,7,4,8,9,8,8,6,6,5,5,5,5,5,5,5],
     [4,1,2,3,4,5,6,7,5,8,7,8,7,8,7,8,7,7,7,7,7,6,5],
     [4,1,2,5,6,7,8,9,7,8,7,8,7,7,7,7,7,7,6,6,4,4,4]
     ]

def reg_m(y, x):
    ones = np.ones(len(x[0]))
    X = sm.add_constant(np.column_stack((x[0], ones)))
    for ele in x[1:]:
        X = sm.add_constant(np.column_stack((ele, X)))
    results = sm.OLS(y, X).fit()
    return results

Result:

print reg_m(y, x).summary()

Output:

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.535
Model:                            OLS   Adj. R-squared:                  0.461
Method:                 Least Squares   F-statistic:                     7.281
Date:                Tue, 19 Feb 2013   Prob (F-statistic):            0.00191
Time:                        21:51:28   Log-Likelihood:                -26.025
No. Observations:                  23   AIC:                             60.05
Df Residuals:                      19   BIC:                             64.59
Df Model:                           3                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
x1             0.2424      0.139      1.739      0.098        -0.049     0.534
x2             0.2360      0.149      1.587      0.129        -0.075     0.547
x3            -0.0618      0.145     -0.427      0.674        -0.365     0.241
const          1.5704      0.633      2.481      0.023         0.245     2.895

==============================================================================
Omnibus:                        6.904   Durbin-Watson:                   1.905
Prob(Omnibus):                  0.032   Jarque-Bera (JB):                4.708
Skew:                          -0.849   Prob(JB):                       0.0950
Kurtosis:                       4.426   Cond. No.                         38.6

pandas provides a convenient way to run OLS as given in this answer:

Run an OLS regression with Pandas Data Frame


回答 2

为了澄清起见,您给出的示例是多元线性回归,而不是多元线性回归。区别

单个标量预测变量x和单个标量响应变量y的最简单情况就是简单线性回归。对多个和/或向量值的预测变量(用大写的X表示)的扩展被称为多元线性回归,也称为多元线性回归。几乎所有现实世界中的回归模型都涉及多个预测变量,而线性回归的基本描述通常用多元回归模型来表述。但是请注意,在这些情况下,响应变量y仍然是标量。另一个变量多元线性回归是指y是向量的情况,即与一般线性回归相同。

简而言之:

  • 多元线性回归:响应y是一个标量。
  • 多元线性回归:响应y是向量。

(另一个来源。)

Just to clarify, the example you gave is multiple linear regression, not multivariate linear regression refer. Difference:

The very simplest case of a single scalar predictor variable x and a single scalar response variable y is known as simple linear regression. The extension to multiple and/or vector-valued predictor variables (denoted with a capital X) is known as multiple linear regression, also known as multivariable linear regression. Nearly all real-world regression models involve multiple predictors, and basic descriptions of linear regression are often phrased in terms of the multiple regression model. Note, however, that in these cases the response variable y is still a scalar. Another term multivariate linear regression refers to cases where y is a vector, i.e., the same as general linear regression. The difference between multivariate linear regression and multivariable linear regression should be emphasized as it causes much confusion and misunderstanding in the literature.

In short:

  • multiple linear regression: the response y is a scalar.
  • multivariate linear regression: the response y is a vector.

(Another source.)


回答 3

您可以使用numpy.linalg.lstsq

import numpy as np
y = np.array([-6,-5,-10,-5,-8,-3,-6,-8,-8])
X = np.array([[-4.95,-4.55,-10.96,-1.08,-6.52,-0.81,-7.01,-4.46,-11.54],[-5.87,-4.52,-11.64,-3.36,-7.45,-2.36,-7.33,-7.65,-10.03],[-0.76,-0.71,-0.98,0.75,-0.86,-0.50,-0.33,-0.94,-1.03],[14.73,13.74,15.49,24.72,16.59,22.44,13.93,11.40,18.18],[4.02,4.47,4.18,4.96,4.29,4.81,4.32,4.43,4.28],[0.20,0.16,0.19,0.16,0.10,0.15,0.21,0.16,0.21],[0.45,0.50,0.53,0.60,0.48,0.53,0.50,0.49,0.55]])
X = X.T # transpose so input vectors are along the rows
X = np.c_[X, np.ones(X.shape[0])] # add bias term
beta_hat = np.linalg.lstsq(X,y)[0]
print beta_hat

结果:

[ -0.49104607   0.83271938   0.0860167    0.1326091    6.85681762  22.98163883 -41.08437805 -19.08085066]

您可以通过以下方式查看估计的输出:

print np.dot(X,beta_hat)

结果:

[ -5.97751163,  -5.06465759, -10.16873217,  -4.96959788,  -7.96356915,  -3.06176313,  -6.01818435,  -7.90878145,  -7.86720264]

You can use numpy.linalg.lstsq:

import numpy as np

y = np.array([-6, -5, -10, -5, -8, -3, -6, -8, -8])
X = np.array(
    [
        [-4.95, -4.55, -10.96, -1.08, -6.52, -0.81, -7.01, -4.46, -11.54],
        [-5.87, -4.52, -11.64, -3.36, -7.45, -2.36, -7.33, -7.65, -10.03],
        [-0.76, -0.71, -0.98, 0.75, -0.86, -0.50, -0.33, -0.94, -1.03],
        [14.73, 13.74, 15.49, 24.72, 16.59, 22.44, 13.93, 11.40, 18.18],
        [4.02, 4.47, 4.18, 4.96, 4.29, 4.81, 4.32, 4.43, 4.28],
        [0.20, 0.16, 0.19, 0.16, 0.10, 0.15, 0.21, 0.16, 0.21],
        [0.45, 0.50, 0.53, 0.60, 0.48, 0.53, 0.50, 0.49, 0.55],
    ]
)
X = X.T  # transpose so input vectors are along the rows
X = np.c_[X, np.ones(X.shape[0])]  # add bias term
beta_hat = np.linalg.lstsq(X, y, rcond=None)[0]
print(beta_hat)

Result:

[ -0.49104607   0.83271938   0.0860167    0.1326091    6.85681762  22.98163883 -41.08437805 -19.08085066]

You can see the estimated output with:

print(np.dot(X,beta_hat))

Result:

[ -5.97751163,  -5.06465759, -10.16873217,  -4.96959788,  -7.96356915,  -3.06176313,  -6.01818435,  -7.90878145,  -7.86720264]

回答 4

使用scipy.optimize.curve_fit。而且不仅适用于线性拟合。

from scipy.optimize import curve_fit
import scipy

def fn(x, a, b, c):
    return a + b*x[0] + c*x[1]

# y(x0,x1) data:
#    x0=0 1 2
# ___________
# x1=0 |0 1 2
# x1=1 |1 2 3
# x1=2 |2 3 4

x = scipy.array([[0,1,2,0,1,2,0,1,2,],[0,0,0,1,1,1,2,2,2]])
y = scipy.array([0,1,2,1,2,3,2,3,4])
popt, pcov = curve_fit(fn, x, y)
print popt

Use scipy.optimize.curve_fit. And not only for linear fit.

from scipy.optimize import curve_fit
import scipy

def fn(x, a, b, c):
    return a + b*x[0] + c*x[1]

# y(x0,x1) data:
#    x0=0 1 2
# ___________
# x1=0 |0 1 2
# x1=1 |1 2 3
# x1=2 |2 3 4

x = scipy.array([[0,1,2,0,1,2,0,1,2,],[0,0,0,1,1,1,2,2,2]])
y = scipy.array([0,1,2,1,2,3,2,3,4])
popt, pcov = curve_fit(fn, x, y)
print popt

回答 5

将数据转换为熊猫数据框(df)后,

import statsmodels.formula.api as smf
lm = smf.ols(formula='y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7', data=df).fit()
print(lm.params)

默认情况下包括拦截项。

有关更多示例,请参见此笔记本

Once you convert your data to a pandas dataframe (df),

import statsmodels.formula.api as smf
lm = smf.ols(formula='y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7', data=df).fit()
print(lm.params)

The intercept term is included by default.

See this notebook for more examples.


回答 6

我认为这可能是完成这项工作的最简单方法:

from random import random
from pandas import DataFrame
from statsmodels.api import OLS
lr = lambda : [random() for i in range(100)]
x = DataFrame({'x1': lr(), 'x2':lr(), 'x3':lr()})
x['b'] = 1
y = x.x1 + x.x2 * 2 + x.x3 * 3 + 4

print x.head()

         x1        x2        x3  b
0  0.433681  0.946723  0.103422  1
1  0.400423  0.527179  0.131674  1
2  0.992441  0.900678  0.360140  1
3  0.413757  0.099319  0.825181  1
4  0.796491  0.862593  0.193554  1

print y.head()

0    6.637392
1    5.849802
2    7.874218
3    7.087938
4    7.102337
dtype: float64

model = OLS(y, x)
result = model.fit()
print result.summary()

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       1.000
Model:                            OLS   Adj. R-squared:                  1.000
Method:                 Least Squares   F-statistic:                 5.859e+30
Date:                Wed, 09 Dec 2015   Prob (F-statistic):               0.00
Time:                        15:17:32   Log-Likelihood:                 3224.9
No. Observations:                 100   AIC:                            -6442.
Df Residuals:                      96   BIC:                            -6431.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
x1             1.0000   8.98e-16   1.11e+15      0.000         1.000     1.000
x2             2.0000   8.28e-16   2.41e+15      0.000         2.000     2.000
x3             3.0000   8.34e-16    3.6e+15      0.000         3.000     3.000
b              4.0000   8.51e-16    4.7e+15      0.000         4.000     4.000
==============================================================================
Omnibus:                        7.675   Durbin-Watson:                   1.614
Prob(Omnibus):                  0.022   Jarque-Bera (JB):                3.118
Skew:                           0.045   Prob(JB):                        0.210
Kurtosis:                       2.140   Cond. No.                         6.89
==============================================================================

I think this may the most easy way to finish this work:

from random import random
from pandas import DataFrame
from statsmodels.api import OLS
lr = lambda : [random() for i in range(100)]
x = DataFrame({'x1': lr(), 'x2':lr(), 'x3':lr()})
x['b'] = 1
y = x.x1 + x.x2 * 2 + x.x3 * 3 + 4

print x.head()

         x1        x2        x3  b
0  0.433681  0.946723  0.103422  1
1  0.400423  0.527179  0.131674  1
2  0.992441  0.900678  0.360140  1
3  0.413757  0.099319  0.825181  1
4  0.796491  0.862593  0.193554  1

print y.head()

0    6.637392
1    5.849802
2    7.874218
3    7.087938
4    7.102337
dtype: float64

model = OLS(y, x)
result = model.fit()
print result.summary()

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       1.000
Model:                            OLS   Adj. R-squared:                  1.000
Method:                 Least Squares   F-statistic:                 5.859e+30
Date:                Wed, 09 Dec 2015   Prob (F-statistic):               0.00
Time:                        15:17:32   Log-Likelihood:                 3224.9
No. Observations:                 100   AIC:                            -6442.
Df Residuals:                      96   BIC:                            -6431.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
x1             1.0000   8.98e-16   1.11e+15      0.000         1.000     1.000
x2             2.0000   8.28e-16   2.41e+15      0.000         2.000     2.000
x3             3.0000   8.34e-16    3.6e+15      0.000         3.000     3.000
b              4.0000   8.51e-16    4.7e+15      0.000         4.000     4.000
==============================================================================
Omnibus:                        7.675   Durbin-Watson:                   1.614
Prob(Omnibus):                  0.022   Jarque-Bera (JB):                3.118
Skew:                           0.045   Prob(JB):                        0.210
Kurtosis:                       2.140   Cond. No.                         6.89
==============================================================================

回答 7

可以使用上面提到的sklearn库处理多个线性回归。我正在使用Python 3.6的Anaconda安装。

如下创建模型:

from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X, y)

# display coefficients
print(regressor.coef_)

Multiple Linear Regression can be handled using the sklearn library as referenced above. I’m using the Anaconda install of Python 3.6.

Create your model as follows:

from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X, y)

# display coefficients
print(regressor.coef_)

回答 8

您可以使用numpy.linalg.lstsq


回答 9

您可以使用下面的函数并将其传递给DataFrame:

def linear(x, y=None, show=True):
    """
    @param x: pd.DataFrame
    @param y: pd.DataFrame or pd.Series or None
              if None, then use last column of x as y
    @param show: if show regression summary
    """
    import statsmodels.api as sm

    xy = sm.add_constant(x if y is None else pd.concat([x, y], axis=1))
    res = sm.OLS(xy.ix[:, -1], xy.ix[:, :-1], missing='drop').fit()

    if show: print res.summary()
    return res

You can use the function below and pass it a DataFrame:

def linear(x, y=None, show=True):
    """
    @param x: pd.DataFrame
    @param y: pd.DataFrame or pd.Series or None
              if None, then use last column of x as y
    @param show: if show regression summary
    """
    import statsmodels.api as sm

    xy = sm.add_constant(x if y is None else pd.concat([x, y], axis=1))
    res = sm.OLS(xy.ix[:, -1], xy.ix[:, :-1], missing='drop').fit()

    if show: print res.summary()
    return res

回答 10

Scikit-learn是一个适用于Python的机器学习库,可以为您完成这项工作。只需将sklearn.linear_model模块导入脚本即可。

在python中使用sklearn查找多重线性回归的代码模板:

import numpy as np
import matplotlib.pyplot as plt #to plot visualizations
import pandas as pd

# Importing the dataset
df = pd.read_csv(<Your-dataset-path>)
# Assigning feature and target variables
X = df.iloc[:,:-1]
y = df.iloc[:,-1]

# Use label encoders, if you have any categorical variable
from sklearn.preprocessing import LabelEncoder
labelencoder = LabelEncoder()
X['<column-name>'] = labelencoder.fit_transform(X['<column-name>'])

from sklearn.preprocessing import OneHotEncoder
onehotencoder = OneHotEncoder(categorical_features = ['<index-value>'])
X = onehotencoder.fit_transform(X).toarray()

# Avoiding the dummy variable trap
X = X[:,1:] # Usually done by the algorithm itself

#Spliting the data into test and train set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state = 0, test_size = 0.2)

# Fitting the model
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)

# Predicting the test set results
y_pred = regressor.predict(X_test)

而已。您可以将此代码用作在任何数据集中实现多元线性回归的模板。为了更好地理解示例,请访问:带有示例的线性回归

Scikit-learn is a machine learning library for Python which can do this job for you. Just import sklearn.linear_model module into your script.

Find the code template for Multiple Linear Regression using sklearn in Python:

import numpy as np
import matplotlib.pyplot as plt #to plot visualizations
import pandas as pd

# Importing the dataset
df = pd.read_csv(<Your-dataset-path>)
# Assigning feature and target variables
X = df.iloc[:,:-1]
y = df.iloc[:,-1]

# Use label encoders, if you have any categorical variable
from sklearn.preprocessing import LabelEncoder
labelencoder = LabelEncoder()
X['<column-name>'] = labelencoder.fit_transform(X['<column-name>'])

from sklearn.preprocessing import OneHotEncoder
onehotencoder = OneHotEncoder(categorical_features = ['<index-value>'])
X = onehotencoder.fit_transform(X).toarray()

# Avoiding the dummy variable trap
X = X[:,1:] # Usually done by the algorithm itself

#Spliting the data into test and train set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state = 0, test_size = 0.2)

# Fitting the model
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)

# Predicting the test set results
y_pred = regressor.predict(X_test)

That’s it. You can use this code as a template for implementing Multiple Linear Regression in any dataset. For a better understanding with an example, Visit: Linear Regression with an example


回答 11

这是另一种基本方法:

from patsy import dmatrices
import statsmodels.api as sm

y,x = dmatrices("y_data ~ x_1 + x_2 ", data = my_data)
### y_data is the name of the dependent variable in your data ### 
model_fit = sm.OLS(y,x)
results = model_fit.fit()
print(results.summary())

代替sm.OLS您也可以使用sm.Logitor sm.Probit和等。

Here is an alternative and basic method:

from patsy import dmatrices
import statsmodels.api as sm

y,x = dmatrices("y_data ~ x_1 + x_2 ", data = my_data)
### y_data is the name of the dependent variable in your data ### 
model_fit = sm.OLS(y,x)
results = model_fit.fit()
print(results.summary())

Instead of sm.OLS you can also use sm.Logit or sm.Probit and etc.


初始化一个numpy数组

问题:初始化一个numpy数组

有没有办法初始化形状的numpy数组并将其添加到其中?我将通过列表示例来说明我需要的内容。如果要创建循环中生成的对象列表,可以执行以下操作:

a = []
for i in range(5):
    a.append(i)

我想对一个numpy数组做类似的事情。我了解vstack,串联等。但是,这些似乎需要两个numpy数组作为输入。我需要的是:

big_array # Initially empty. This is where I don't know what to specify
for i in range(5):
    array i of shape = (2,4) created.
    add to big_array

big_array应具有的形状(10,4)。这该怎么做?


编辑:

我想添加以下说明。我知道我可以定义big_array = numpy.zeros((10,4))然后填充它。但是,这需要预先指定big_array的大小。我知道这种情况下的大小,但是如果我不知道该怎么办?当我们使用该.append函数在python中扩展列表时,我们不需要事先知道其最终大小。我想知道是否存在从空数组开始的从较小数组创建较大数组的类似方法。

Is there way to initialize a numpy array of a shape and add to it? I will explain what I need with a list example. If I want to create a list of objects generated in a loop, I can do:

a = []
for i in range(5):
    a.append(i)

I want to do something similar with a numpy array. I know about vstack, concatenate etc. However, it seems these require two numpy arrays as inputs. What I need is:

big_array # Initially empty. This is where I don't know what to specify
for i in range(5):
    array i of shape = (2,4) created.
    add to big_array

The big_array should have a shape (10,4). How to do this?


EDIT:

I want to add the following clarification. I am aware that I can define big_array = numpy.zeros((10,4)) and then fill it up. However, this requires specifying the size of big_array in advance. I know the size in this case, but what if I do not? When we use the .append function for extending the list in python, we don’t need to know its final size in advance. I am wondering if something similar exists for creating a bigger array from smaller arrays, starting with an empty array.


回答 0

numpy.zeros

返回给定形状和类型的新数组,并用零填充。

要么

numpy.ones

返回给定形状和类型的新数组,并填充其中的一个。

要么

numpy.empty

返回给定形状和类型的新数组,而无需初始化条目。


但是,通过将元素追加到列表来构造数组的思路在numpy中使用不多,因为它效率较低(numpy数据类型更接近基础C数组)。相反,您应该将数组预分配为所需的大小,然后填写行。不过,您可以numpy.append根据需要使用。

numpy.zeros

Return a new array of given shape and type, filled with zeros.

or

numpy.ones

Return a new array of given shape and type, filled with ones.

or

numpy.empty

Return a new array of given shape and type, without initializing entries.


However, the mentality in which we construct an array by appending elements to a list is not much used in numpy, because it’s less efficient (numpy datatypes are much closer to the underlying C arrays). Instead, you should preallocate the array to the size that you need it to be, and then fill in the rows. You can use numpy.append if you must, though.


回答 1

我通常这样做的方法是创建一个常规列表,然后将其添加到列表中,最后将列表转换为numpy数组,如下所示:

import numpy as np
big_array = [] #  empty regular list
for i in range(5):
    arr = i*np.ones((2,4)) # for instance
    big_array.append(arr)
big_np_array = np.array(big_array)  # transformed to a numpy array

当然,最终对象在创建步骤中占用的内存空间是原来的两倍,但是追加到python列表上的速度非常快,并且使用np.array()进行创建也是如此。

The way I usually do that is by creating a regular list, then append my stuff into it, and finally transform the list to a numpy array as follows :

import numpy as np
big_array = [] #  empty regular list
for i in range(5):
    arr = i*np.ones((2,4)) # for instance
    big_array.append(arr)
big_np_array = np.array(big_array)  # transformed to a numpy array

of course your final object takes twice the space in the memory at the creation step, but appending on python list is very fast, and creation using np.array() also.


回答 2

在numpy 1.8中引入:

numpy.full

返回给定形状和类型的新数组,并用fill_value填充。

例子:

>>> import numpy as np
>>> np.full((2, 2), np.inf)
array([[ inf,  inf],
       [ inf,  inf]])
>>> np.full((2, 2), 10)
array([[10, 10],
       [10, 10]])

Introduced in numpy 1.8:

numpy.full

Return a new array of given shape and type, filled with fill_value.

Examples:

>>> import numpy as np
>>> np.full((2, 2), np.inf)
array([[ inf,  inf],
       [ inf,  inf]])
>>> np.full((2, 2), 10)
array([[10, 10],
       [10, 10]])

回答 3

python的数组模拟

a = []
for i in range(5):
    a.append(i)

是:

import numpy as np

a = np.empty((0))
for i in range(5):
    a = np.append(a, i)

Array analogue for the python’s

a = []
for i in range(5):
    a.append(i)

is:

import numpy as np

a = np.empty((0))
for i in range(5):
    a = np.append(a, i)

回答 4

numpy.fromiter() 您正在寻找的是:

big_array = numpy.fromiter(xrange(5), dtype="int")

它也适用于生成器表达式,例如:

big_array = numpy.fromiter( (i*(i+1)/2 for i in xrange(5)), dtype="int" )

如果事先知道数组的长度,则可以使用可选的’count’参数指定它的长度。

numpy.fromiter() is what you are looking for:

big_array = numpy.fromiter(xrange(5), dtype="int")

It also works with generator expressions, e.g.:

big_array = numpy.fromiter( (i*(i+1)/2 for i in xrange(5)), dtype="int" )

If you know the length of the array in advance, you can specify it with an optional ‘count’ argument.


回答 5

您确实希望在进行数组计算时尽可能避免显式循环,因为这会降低这种形式的计算的速度增益。有多种初始化numpy数组的方法。如果要用零填充,请按照katrielalex的指示进行:

big_array = numpy.zeros((10,4))

编辑:您正在制作哪种顺序?您应该查看创建数组的不同numpy函数,例如numpy.linspace(start, stop, size)(等号)或numpy.arange(start, stop, inc)。在可能的情况下,这些函数将使数组比在显式循环中完成相同工作的速度快得多

You do want to avoid explicit loops as much as possible when doing array computing, as that reduces the speed gain from that form of computing. There are multiple ways to initialize a numpy array. If you want it filled with zeros, do as katrielalex said:

big_array = numpy.zeros((10,4))

EDIT: What sort of sequence is it you’re making? You should check out the different numpy functions that create arrays, like numpy.linspace(start, stop, size) (equally spaced number), or numpy.arange(start, stop, inc). Where possible, these functions will make arrays substantially faster than doing the same work in explicit loops


回答 6

对于您的第一个数组示例,

a = numpy.arange(5)

要初始化big_array,请使用

big_array = numpy.zeros((10,4))

假设您要用零初始化,这很典型,但是还有许多其他方法可以在numpy中初始化数组

编辑: 如果您事先不知道big_array的大小,通常最好首先使用append构建一个Python列表,并且当列表中收集了所有内容时,请使用将该列表转换为numpy数组numpy.array(mylist)。原因是列表的目的是非常高效和快速地增长,而numpy.concatenate效率很低,因为numpy数组不容易更改大小。但是,一旦所有内容都收集到列表中,并且您知道最终的数组大小,就可以有效地构造一个numpy数组。

For your first array example use,

a = numpy.arange(5)

To initialize big_array, use

big_array = numpy.zeros((10,4))

This assumes you want to initialize with zeros, which is pretty typical, but there are many other ways to initialize an array in numpy.

Edit: If you don’t know the size of big_array in advance, it’s generally best to first build a Python list using append, and when you have everything collected in the list, convert this list to a numpy array using numpy.array(mylist). The reason for this is that lists are meant to grow very efficiently and quickly, whereas numpy.concatenate would be very inefficient since numpy arrays don’t change size easily. But once everything is collected in a list, and you know the final array size, a numpy array can be efficiently constructed.


回答 7

要使用特定矩阵初始化numpy数组,请执行以下操作:

import numpy as np

mat = np.array([[1, 1, 0, 0, 0],
                [0, 1, 0, 0, 1],
                [1, 0, 0, 1, 1],
                [0, 0, 0, 0, 0],
                [1, 0, 1, 0, 1]])

print mat.shape
print mat

输出:

(5, 5)
[[1 1 0 0 0]
 [0 1 0 0 1]
 [1 0 0 1 1]
 [0 0 0 0 0]
 [1 0 1 0 1]]

To initialize a numpy array with a specific matrix:

import numpy as np

mat = np.array([[1, 1, 0, 0, 0],
                [0, 1, 0, 0, 1],
                [1, 0, 0, 1, 1],
                [0, 0, 0, 0, 0],
                [1, 0, 1, 0, 1]])

print mat.shape
print mat

output:

(5, 5)
[[1 1 0 0 0]
 [0 1 0 0 1]
 [1 0 0 1 1]
 [0 0 0 0 0]
 [1 0 1 0 1]]

回答 8

每当您处于以下情况时:

a = []
for i in range(5):
    a.append(i)

并且您想要类似numpy的内容,先前的几个答案已经指出了实现方法,但是正如@katrielalex指出的那样,这些方法效率不高。执行此操作的有效方法是建立一个长列表,然后在拥有一个长列表后以所需的方式重塑它。例如,假设我正在从文件中读取一些行,并且每一行都有一个数字列表,并且我想构建一个形状为numpy的数组(读取的行数,每一行中的向量长度)。这是我将更有效地执行此操作的方法:

long_list = []
counter = 0
with open('filename', 'r') as f:
    for row in f:
        row_list = row.split()
        long_list.extend(row_list)
        counter++
#  now we have a long list and we are ready to reshape
result = np.array(long_list).reshape(counter, len(row_list)) #  desired numpy array

Whenever you are in the following situation:

a = []
for i in range(5):
    a.append(i)

and you want something similar in numpy, several previous answers have pointed out ways to do it, but as @katrielalex pointed out these methods are not efficient. The efficient way to do this is to build a long list and then reshape it the way you want after you have a long list. For example, let’s say I am reading some lines from a file and each row has a list of numbers and I want to build a numpy array of shape (number of lines read, length of vector in each row). Here is how I would do it more efficiently:

long_list = []
counter = 0
with open('filename', 'r') as f:
    for row in f:
        row_list = row.split()
        long_list.extend(row_list)
        counter++
#  now we have a long list and we are ready to reshape
result = np.array(long_list).reshape(counter, len(row_list)) #  desired numpy array

回答 9

我意识到这有点晚了,但是我没有注意到提到索引到空数组的其他答案:

big_array = numpy.empty(10, 4)
for i in range(5):
    array_i = numpy.random.random(2, 4)
    big_array[2 * i:2 * (i + 1), :] = array_i

这样,您numpy.empty可以使用索引分配预先分配整个结果数组,并在行中填写行。

使用预分配empty而不是zeros您给出的示例是完全安全的,因为您可以保证整个数组将被生成的块填充。

I realize that this is a bit late, but I did not notice any of the other answers mentioning indexing into the empty array:

big_array = numpy.empty(10, 4)
for i in range(5):
    array_i = numpy.random.random(2, 4)
    big_array[2 * i:2 * (i + 1), :] = array_i

This way, you preallocate the entire result array with numpy.empty and fill in the rows as you go using indexed assignment.

It is perfectly safe to preallocate with empty instead of zeros in the example you gave since you are guaranteeing that the entire array will be filled with the chunks you generate.


回答 10

我建议先定义形状。然后对其进行迭代以插入值。

big_array= np.zeros(shape = ( 6, 2 ))
for it in range(6):
    big_array[it] = (it,it) # For example

>>>big_array

array([[ 0.,  0.],
       [ 1.,  1.],
       [ 2.,  2.],
       [ 3.,  3.],
       [ 4.,  4.],
       [ 5.,  5.]])

I’d suggest defining shape first. Then iterate over it to insert values.

big_array= np.zeros(shape = ( 6, 2 ))
for it in range(6):
    big_array[it] = (it,it) # For example

>>>big_array

array([[ 0.,  0.],
       [ 1.,  1.],
       [ 2.,  2.],
       [ 3.,  3.],
       [ 4.,  4.],
       [ 5.,  5.]])

回答 11

也许这样的东西会满足您的需求。

import numpy as np

N = 5
res = []

for i in range(N):
    res.append(np.cumsum(np.ones(shape=(2,4))))

res = np.array(res).reshape((10, 4))
print(res)

产生以下输出

[[ 1.  2.  3.  4.]
 [ 5.  6.  7.  8.]
 [ 1.  2.  3.  4.]
 [ 5.  6.  7.  8.]
 [ 1.  2.  3.  4.]
 [ 5.  6.  7.  8.]
 [ 1.  2.  3.  4.]
 [ 5.  6.  7.  8.]
 [ 1.  2.  3.  4.]
 [ 5.  6.  7.  8.]]

Maybe something like this will fit your needs..

import numpy as np

N = 5
res = []

for i in range(N):
    res.append(np.cumsum(np.ones(shape=(2,4))))

res = np.array(res).reshape((10, 4))
print(res)

Which produces the following output

[[ 1.  2.  3.  4.]
 [ 5.  6.  7.  8.]
 [ 1.  2.  3.  4.]
 [ 5.  6.  7.  8.]
 [ 1.  2.  3.  4.]
 [ 5.  6.  7.  8.]
 [ 1.  2.  3.  4.]
 [ 5.  6.  7.  8.]
 [ 1.  2.  3.  4.]
 [ 5.  6.  7.  8.]]

如何在python中将列表另存为numpy数组?

问题:如何在python中将列表另存为numpy数组?

我需要知道是否可以将python列表另存为numPy数组。

Is possible to construct a NumPy array from a python list?


回答 0

如果您在这里看,它可能会告诉您您需要了解的内容。

http://www.scipy.org/Tentative_NumPy_Tutorial#head-d3f8e5fe9b903f3c3b2a5c0dfceb60d71602cf93

基本上,您可以根据序列创建数组。

import numpy as np
a = np.array( [2,3,4] )

或来自序列序列。

import numpy as np
a = np.array( [[2,3,4], [3,4,5]] )

First of all, I’d recommend you to go through NumPy’s Quickstart tutorial, which will probably help with these basic questions.

You can directly create an array from a list as:

import numpy as np
a = np.array( [2,3,4] )

Or from a from a nested list in the same way:

import numpy as np
a = np.array( [[2,3,4], [3,4,5]] )

回答 1

你的意思是这样的吗?

from numpy  import array
a = array( your_list )

you mean something like this ?

from numpy  import array
a = array( your_list )

回答 2

是的:

a = numpy.array([1,2,3])

Yes it is:

a = numpy.array([1,2,3])

回答 3

您想将其另存为文件吗?

import numpy as np

myList = [1, 2, 3]

np.array(myList).dump(open('array.npy', 'wb'))

…然后阅读:

myArray = np.load(open('array.npy', 'rb'))

You want to save it as a file?

import numpy as np

myList = [1, 2, 3]

np.array(myList).dump(open('array.npy', 'wb'))

… and then read:

myArray = np.load(open('array.npy', 'rb'))

回答 4

您可以使用numpy.asarray,例如将列表转换为数组:

>>> a = [1, 2]
>>> np.asarray(a)
array([1, 2])

You can use numpy.asarray, for example to convert a list into an array:

>>> a = [1, 2]
>>> np.asarray(a)
array([1, 2])

回答 5

我想,您是说将列表转换为numpy数组?然后,

import numpy as np

# b is some list, then ...    
a = np.array(b).reshape(lengthDim0, lengthDim1);

以reshape给定的形状为您提供a作为列表b的数组。

I suppose, you mean converting a list into a numpy array? Then,

import numpy as np

# b is some list, then ...    
a = np.array(b).reshape(lengthDim0, lengthDim1);

gives you a as an array of list b in the shape given in reshape.


回答 6

这是一个更完整的示例:

import csv
import numpy as np

with open('filename','rb') as csvfile:
     cdl = list( csv.reader(csvfile,delimiter='\t'))
     print "Number of records = " + str(len(cdl))

#then later

npcdl = np.array(cdl)

希望这可以帮助!!

Here is a more complete example:

import csv
import numpy as np

with open('filename','rb') as csvfile:
     cdl = list( csv.reader(csvfile,delimiter='\t'))
     print "Number of records = " + str(len(cdl))

#then later

npcdl = np.array(cdl)

Hope this helps!!


回答 7

import numpy as np 

... ## other code

一些列表理解

t=[nodel[ nodenext[i][j] ] for j in idx]
            #for each link, find the node lables 
            #t is the list of node labels 

使用numpy库中指定的数组方法将列表转换为numpy数组。

t=np.array(t)

这可能会有所帮助:https : //numpy.org/devdocs/user/basics.creation.html

import numpy as np 

... ## other code

some list comprehension

t=[nodel[ nodenext[i][j] ] for j in idx]
            #for each link, find the node lables 
            #t is the list of node labels 

Convert the list to a numpy array using the array method specified in the numpy library.

t=np.array(t)

This may be helpful: https://numpy.org/devdocs/user/basics.creation.html


回答 8

也许:

import numpy as np
a=[[1,1],[2,2]]
b=np.asarray(a)
print(type(b))

输出:

<class 'numpy.ndarray'>

maybe:

import numpy as np
a=[[1,1],[2,2]]
b=np.asarray(a)
print(type(b))

output:

<class 'numpy.ndarray'>

NumPy数组的就地类型转换

问题:NumPy数组的就地类型转换

给定一个NumPy数组int32,如何将其转换为float32 原位?所以基本上,我想做

a = a.astype(numpy.float32)

而不复制阵列。好大

这样做的原因是我有两种算法来计算a。其中一个返回一个数组int32,另一个返回一个数组float32(这是两种不同算法固有的)。所有进一步的计算都假定a是的数组float32

目前,我在C函数中通过via进行转换ctypes。有没有办法在Python中做到这一点?

Given a NumPy array of int32, how do I convert it to float32 in place? So basically, I would like to do

a = a.astype(numpy.float32)

without copying the array. It is big.

The reason for doing this is that I have two algorithms for the computation of a. One of them returns an array of int32, the other returns an array of float32 (and this is inherent to the two different algorithms). All further computations assume that a is an array of float32.

Currently I do the conversion in a C function called via ctypes. Is there a way to do this in Python?


回答 0

您可以使用不同的dtype创建视图,然后就地复制到视图中:

import numpy as np
x = np.arange(10, dtype='int32')
y = x.view('float32')
y[:] = x

print(y)

Yield

array([ 0.,  1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9.], dtype=float32)

要显示转换是否就位,请注意 复制x到已y更改x

print(x)

版画

array([         0, 1065353216, 1073741824, 1077936128, 1082130432,
       1084227584, 1086324736, 1088421888, 1090519040, 1091567616])

You can make a view with a different dtype, and then copy in-place into the view:

import numpy as np
x = np.arange(10, dtype='int32')
y = x.view('float32')
y[:] = x

print(y)

yields

array([ 0.,  1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9.], dtype=float32)

To show the conversion was in-place, note that copying from x to y altered x:

print(x)

prints

array([         0, 1065353216, 1073741824, 1077936128, 1082130432,
       1084227584, 1086324736, 1088421888, 1090519040, 1091567616])

回答 1

更新:此功能仅在可能的情况下避免复制,因此这不是此问题的正确答案。unutbu的答案是正确的。


a = a.astype(numpy.float32, copy=False)

numpy astype具有复制标志。我们为什么不应该使用它?

Update: This function only avoids copy if it can, hence this is not the correct answer for this question. unutbu’s answer is the right one.


a = a.astype(numpy.float32, copy=False)

numpy astype has a copy flag. Why shouldn’t we use it ?


回答 2

您可以更改数组类型而无需进行如下转换:

a.dtype = numpy.float32

但首先,您必须将所有整数更改为将被解释为相应浮点数的值。一种很慢的方法是使用python的struct模块,如下所示:

def toi(i):
    return struct.unpack('i',struct.pack('f',float(i)))[0]

…应用于数组的每个成员。

但是,也许更快的方法是利用numpy的ctypeslib工具(我不熟悉)

-编辑-

由于ctypeslib似乎不起作用,所以我将使用典型numpy.astype方法进行转换,但以内存限制内的块大小进行处理:

a[0:10000] = a[0:10000].astype('float32').view('int32')

…然后在完成后更改dtype。

这是一个功能,可以完成所有兼容dtypes的任务(仅适用于具有相同大小项目的dtypes),并通过用户控制块大小来处理任意形状的数组:

import numpy

def astype_inplace(a, dtype, blocksize=10000):
    oldtype = a.dtype
    newtype = numpy.dtype(dtype)
    assert oldtype.itemsize is newtype.itemsize
    for idx in xrange(0, a.size, blocksize):
        a.flat[idx:idx + blocksize] = \
            a.flat[idx:idx + blocksize].astype(newtype).view(oldtype)
    a.dtype = newtype

a = numpy.random.randint(100,size=100).reshape((10,10))
print a
astype_inplace(a, 'float32')
print a

You can change the array type without converting like this:

a.dtype = numpy.float32

but first you have to change all the integers to something that will be interpreted as the corresponding float. A very slow way to do this would be to use python’s struct module like this:

def toi(i):
    return struct.unpack('i',struct.pack('f',float(i)))[0]

…applied to each member of your array.

But perhaps a faster way would be to utilize numpy’s ctypeslib tools (which I am unfamiliar with)

– edit –

Since ctypeslib doesnt seem to work, then I would proceed with the conversion with the typical numpy.astype method, but proceed in block sizes that are within your memory limits:

a[0:10000] = a[0:10000].astype('float32').view('int32')

…then change the dtype when done.

Here is a function that accomplishes the task for any compatible dtypes (only works for dtypes with same-sized items) and handles arbitrarily-shaped arrays with user-control over block size:

import numpy

def astype_inplace(a, dtype, blocksize=10000):
    oldtype = a.dtype
    newtype = numpy.dtype(dtype)
    assert oldtype.itemsize is newtype.itemsize
    for idx in xrange(0, a.size, blocksize):
        a.flat[idx:idx + blocksize] = \
            a.flat[idx:idx + blocksize].astype(newtype).view(oldtype)
    a.dtype = newtype

a = numpy.random.randint(100,size=100).reshape((10,10))
print a
astype_inplace(a, 'float32')
print a

回答 3

import numpy as np
arr_float = np.arange(10, dtype=np.float32)
arr_int = arr_float.view(np.float32)

使用view()和参数’dtype’更改数组。

import numpy as np
arr_float = np.arange(10, dtype=np.float32)
arr_int = arr_float.view(np.float32)

use view() and parameter ‘dtype’ to change the array in place.


回答 4

用这个:

In [105]: a
Out[105]: 
array([[15, 30, 88, 31, 33],
       [53, 38, 54, 47, 56],
       [67,  2, 74, 10, 16],
       [86, 33, 15, 51, 32],
       [32, 47, 76, 15, 81]], dtype=int32)

In [106]: float32(a)
Out[106]: 
array([[ 15.,  30.,  88.,  31.,  33.],
       [ 53.,  38.,  54.,  47.,  56.],
       [ 67.,   2.,  74.,  10.,  16.],
       [ 86.,  33.,  15.,  51.,  32.],
       [ 32.,  47.,  76.,  15.,  81.]], dtype=float32)

Use this:

In [105]: a
Out[105]: 
array([[15, 30, 88, 31, 33],
       [53, 38, 54, 47, 56],
       [67,  2, 74, 10, 16],
       [86, 33, 15, 51, 32],
       [32, 47, 76, 15, 81]], dtype=int32)

In [106]: float32(a)
Out[106]: 
array([[ 15.,  30.,  88.,  31.,  33.],
       [ 53.,  38.,  54.,  47.,  56.],
       [ 67.,   2.,  74.,  10.,  16.],
       [ 86.,  33.,  15.,  51.,  32.],
       [ 32.,  47.,  76.,  15.,  81.]], dtype=float32)

回答 5

a = np.subtract(a, 0., dtype=np.float32)

a = np.subtract(a, 0., dtype=np.float32)