标签归档:numpy

提取numpy数组中的特定列

问题:提取numpy数组中的特定列

这是一个简单的问题,但要说我有一个MxN矩阵。我要做的就是提取特定的列并将其存储在另一个numpy数组中,但是我得到了无效的语法错误。这是代码:

extractedData = data[[:,1],[:,9]]. 

似乎上述行就足够了,但我想不是。我环顾四周,但找不到关于此特定场景的任何语法明智的方法。

This is an easy question but say I have an MxN matrix. All I want to do is extract specific columns and store them in another numpy array but I get invalid syntax errors. Here is the code:

extractedData = data[[:,1],[:,9]]. 

It seems like the above line should suffice but I guess not. I looked around but couldn’t find anything syntax wise regarding this specific scenario.


回答 0

我假设你想要的列19?那是

data[:, [1, 9]]

或带有名称:

data[:, ['Column Name1','Column Name2']]

您可以从data.dtype.names… 获得名字。

I assume you wanted columns 1 and 9?

To select multiple columns at once, use

X = data[:, [1, 9]]

To select one at a time, use

x, y = data[:, 1], data[:, 9]

With names:

data[:, ['Column Name1','Column Name2']]

You can get the names from data.dtype.names


回答 1

假设您要获取具有该代码段的第1列和第9列,则应为:

extractedData = data[:,[1,9]]

Assuming you want to get columns 1 and 9 with that code snippet, it should be:

extractedData = data[:,[1,9]]

回答 2

如果只想提取一些列:

idx_IN_columns = [1, 9]
extractedData = data[:,idx_IN_columns]

如果要排除特定列:

idx_OUT_columns = [1, 9]
idx_IN_columns = [i for i in xrange(np.shape(data)[1]) if i not in idx_OUT_columns]
extractedData = data[:,idx_IN_columns]

if you want to extract only some columns:

idx_IN_columns = [1, 9]
extractedData = data[:,idx_IN_columns]

if you want to exclude specific columns:

idx_OUT_columns = [1, 9]
idx_IN_columns = [i for i in xrange(np.shape(data)[1]) if i not in idx_OUT_columns]
extractedData = data[:,idx_IN_columns]

回答 3

我想指出的一件事是,如果要提取的列数为1,则生成的矩阵将不是您期望的Mx1矩阵,而是包含所提取列元素的数组。

要将其转换为矩阵应在结果数组上使用reshape(M,1)方法。

One thing I would like to point out is, if the number of columns you want to extract is 1 the resulting matrix would not be a Mx1 Matrix as you might expect but instead an array containing the elements of the column you extracted.

To convert it to Matrix the reshape(M,1) method should be used on the resulting array.


回答 4

只是:

>>> m = np.matrix(np.random.random((5, 5)))
>>> m
matrix([[0.91074101, 0.65999332, 0.69774588, 0.007355  , 0.33025395],
        [0.11078742, 0.67463754, 0.43158254, 0.95367876, 0.85926405],
        [0.98665185, 0.86431513, 0.12153138, 0.73006437, 0.13404811],
        [0.24602225, 0.66139215, 0.08400288, 0.56769924, 0.47974697],
        [0.25345299, 0.76385882, 0.11002419, 0.2509888 , 0.06312359]])
>>> m[:,[1, 2]]
matrix([[0.65999332, 0.69774588],
        [0.67463754, 0.43158254],
        [0.86431513, 0.12153138],
        [0.66139215, 0.08400288],
        [0.76385882, 0.11002419]])

列不必按顺序排列:

>>> m[:,[2, 1, 3]]
matrix([[0.69774588, 0.65999332, 0.007355  ],
        [0.43158254, 0.67463754, 0.95367876],
        [0.12153138, 0.86431513, 0.73006437],
        [0.08400288, 0.66139215, 0.56769924],
        [0.11002419, 0.76385882, 0.2509888 ]])

Just:

>>> m = np.matrix(np.random.random((5, 5)))
>>> m
matrix([[0.91074101, 0.65999332, 0.69774588, 0.007355  , 0.33025395],
        [0.11078742, 0.67463754, 0.43158254, 0.95367876, 0.85926405],
        [0.98665185, 0.86431513, 0.12153138, 0.73006437, 0.13404811],
        [0.24602225, 0.66139215, 0.08400288, 0.56769924, 0.47974697],
        [0.25345299, 0.76385882, 0.11002419, 0.2509888 , 0.06312359]])
>>> m[:,[1, 2]]
matrix([[0.65999332, 0.69774588],
        [0.67463754, 0.43158254],
        [0.86431513, 0.12153138],
        [0.66139215, 0.08400288],
        [0.76385882, 0.11002419]])

The columns need not to be in order:

>>> m[:,[2, 1, 3]]
matrix([[0.69774588, 0.65999332, 0.007355  ],
        [0.43158254, 0.67463754, 0.95367876],
        [0.12153138, 0.86431513, 0.73006437],
        [0.08400288, 0.66139215, 0.56769924],
        [0.11002419, 0.76385882, 0.2509888 ]])

回答 5

使用类似这样的列表从ND数组中选择列时,您还应该注意一件事:

data[:,:,[1,9]]

如果要删除维度(例如,仅选择一行),则将由于某种原因对结果数组进行置换。所以:

print data.shape            # gives [10,20,30]
selection = data[1,:,[1,9]]
print selection.shape       # gives [2,20] instead of [20,2]!!

One more thing you should pay attention to when selecting columns from N-D array using a list like this:

data[:,:,[1,9]]

If you are removing a dimension (by selecting only one row, for example), the resulting array will be (for some reason) permuted. So:

print data.shape            # gives [10,20,30]
selection = data[1,:,[1,9]]
print selection.shape       # gives [2,20] instead of [20,2]!!

回答 6

您可以使用 :

extracted_data = data.ix[:,['Column1','Column2']]

You can use the following:

extracted_data = data.ix[:,['Column1','Column2']]

回答 7

我认为这里的解决方案不再适用于python版本的更新,为此使用新的python函数的一种方法是:

extracted_data = data[['Column Name1','Column Name2']].to_numpy()

这将为您提供理想的结果。

您可以在此处找到文档:https : //pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_numpy.html#pandas.DataFrame.to_numpy

I think the solution here is not working with an update of the python version anymore, one way to do it with a new python function for it is:

extracted_data = data[['Column Name1','Column Name2']].to_numpy()

which gives you the desired outcome.

The documentation you can find here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_numpy.html#pandas.DataFrame.to_numpy


回答 8

您还可以使用extractedData = data([:,1],[:, 9])

you can also use extractedData=data([:,1],[:,9])


numpy-将行添加到数组

问题:numpy-将行添加到数组

如何将行添加到numpy数组?

我有一个数组A:

A = array([[0, 1, 2], [0, 2, 0]])

如果X中每行的第一个元素满足特定条件,我希望从另一个数组X向该数组添加行。

Numpy数组没有像列表那样的“追加”方法,或者看起来。

如果A和X是列表,我只会这样做:

for i in X:
    if i[0] < 3:
        A.append(i)

是否有numpythonic的方式来做等效的?

谢谢,S ;-)

How does one add rows to a numpy array?

I have an array A:

A = array([[0, 1, 2], [0, 2, 0]])

I wish to add rows to this array from another array X if the first element of each row in X meets a specific condition.

Numpy arrays do not have a method ‘append’ like that of lists, or so it seems.

If A and X were lists I would merely do:

for i in X:
    if i[0] < 3:
        A.append(i)

Is there a numpythonic way to do the equivalent?

Thanks, S ;-)


回答 0

什么X啊 如果它是一个二维数组,你怎么能那么其行比作一个号码:i < 3

OP评论后编辑:

A = array([[0, 1, 2], [0, 2, 0]])
X = array([[0, 1, 2], [1, 2, 0], [2, 1, 2], [3, 2, 0]])

AX第一个元素添加到所有行< 3

import numpy as np
A = np.vstack((A, X[X[:,0] < 3]))

# returns: 
array([[0, 1, 2],
       [0, 2, 0],
       [0, 1, 2],
       [1, 2, 0],
       [2, 1, 2]])

What is X? If it is a 2D-array, how can you then compare its row to a number: i < 3?

EDIT after OP’s comment:

A = array([[0, 1, 2], [0, 2, 0]])
X = array([[0, 1, 2], [1, 2, 0], [2, 1, 2], [3, 2, 0]])

add to A all rows from X where the first element < 3:

import numpy as np
A = np.vstack((A, X[X[:,0] < 3]))

# returns: 
array([[0, 1, 2],
       [0, 2, 0],
       [0, 1, 2],
       [1, 2, 0],
       [2, 1, 2]])

回答 1

好吧,你可以这样做:

  newrow = [1,2,3]
  A = numpy.vstack([A, newrow])

well u can do this :

  newrow = [1,2,3]
  A = numpy.vstack([A, newrow])

回答 2

由于这个问题已经存在了7年,所以我使用的最新版本是numpy版本1.13和python3,我在向矩阵中添加一行时也做同样的事情,请记住在第二个参数中加上双括号,否则会引起尺寸误差。

在这里我要添加矩阵A

1 2 3
4 5 6

连续

7 8 9

相同的用法 np.r_

A= [[1, 2, 3], [4, 5, 6]]
np.append(A, [[7, 8, 9]], axis=0)

    >> array([[1, 2, 3],
              [4, 5, 6],
              [7, 8, 9]])
#or 
np.r_[A,[[7,8,9]]]

只是对某人感兴趣,如果您想添加一列,

array = np.c_[A,np.zeros(#A's row size)]

按照我们之前在矩阵A上所做的操作,向其中添加一列

np.c_[A, [2,8]]

>> array([[1, 2, 3, 2],
          [4, 5, 6, 8]])

As this question is been 7 years before, in the latest version which I am using is numpy version 1.13, and python3, I am doing the same thing with adding a row to a matrix, remember to put a double bracket to the second argument, otherwise, it will raise dimension error.

In here I am adding on matrix A

1 2 3
4 5 6

with a row

7 8 9

same usage in np.r_

A= [[1, 2, 3], [4, 5, 6]]
np.append(A, [[7, 8, 9]], axis=0)

    >> array([[1, 2, 3],
              [4, 5, 6],
              [7, 8, 9]])
#or 
np.r_[A,[[7,8,9]]]

Just to someone’s intersted, if you would like to add a column,

array = np.c_[A,np.zeros(#A's row size)]

following what we did before on matrix A, adding a column to it

np.c_[A, [2,8]]

>> array([[1, 2, 3, 2],
          [4, 5, 6, 8]])

回答 3

您也可以这样做:

newrow = [1,2,3]
A = numpy.concatenate((A,newrow))

You can also do this:

newrow = [1,2,3]
A = numpy.concatenate((A,newrow))

回答 4

如果每行之后都不需要进行计算,则在python中添加行然后转换为numpy会更快。以下是使用python 3.6与numpy 1.14进行的时序测试,添加了100行,一次添加一行:

import numpy as np 
from time import perf_counter, sleep

def time_it():
    # Compare performance of two methods for adding rows to numpy array
    py_array = [[0, 1, 2], [0, 2, 0]]
    py_row = [4, 5, 6]
    numpy_array = np.array(py_array)
    numpy_row = np.array([4,5,6])
    n_loops = 100

    start_clock = perf_counter()
    for count in range(0, n_loops):
       numpy_array = np.vstack([numpy_array, numpy_row]) # 5.8 micros
    duration = perf_counter() - start_clock
    print('numpy 1.14 takes {:.3f} micros per row'.format(duration * 1e6 / n_loops))

    start_clock = perf_counter()
    for count in range(0, n_loops):
        py_array.append(py_row) # .15 micros
    numpy_array = np.array(py_array) # 43.9 micros       
    duration = perf_counter() - start_clock
    print('python 3.6 takes {:.3f} micros per row'.format(duration * 1e6 / n_loops))
    sleep(15)

#time_it() prints:

numpy 1.14 takes 5.971 micros per row
python 3.6 takes 0.694 micros per row

因此,七年前对原始问题的简单解决方案是在将行转换为numpy数组后,使用vstack()添加新行。但是更现实的解决方案应该考虑在这些情况下vstack的性能不佳。如果您不需要在每次添加后对数组进行数据分析,最好将新行缓冲到python行列表(实际上是列表列表)中,然后将它们作为一个组添加到numpy数组中在进行任何数据分析之前使用vstack()。

If no calculations are necessary after every row, it’s much quicker to add rows in python, then convert to numpy. Here are timing tests using python 3.6 vs. numpy 1.14, adding 100 rows, one at a time:

import numpy as np 
from time import perf_counter, sleep

def time_it():
    # Compare performance of two methods for adding rows to numpy array
    py_array = [[0, 1, 2], [0, 2, 0]]
    py_row = [4, 5, 6]
    numpy_array = np.array(py_array)
    numpy_row = np.array([4,5,6])
    n_loops = 100

    start_clock = perf_counter()
    for count in range(0, n_loops):
       numpy_array = np.vstack([numpy_array, numpy_row]) # 5.8 micros
    duration = perf_counter() - start_clock
    print('numpy 1.14 takes {:.3f} micros per row'.format(duration * 1e6 / n_loops))

    start_clock = perf_counter()
    for count in range(0, n_loops):
        py_array.append(py_row) # .15 micros
    numpy_array = np.array(py_array) # 43.9 micros       
    duration = perf_counter() - start_clock
    print('python 3.6 takes {:.3f} micros per row'.format(duration * 1e6 / n_loops))
    sleep(15)

#time_it() prints:

numpy 1.14 takes 5.971 micros per row
python 3.6 takes 0.694 micros per row

So, the simple solution to the original question, from seven years ago, is to use vstack() to add a new row after converting the row to a numpy array. But a more realistic solution should consider vstack’s poor performance under those circumstances. If you don’t need to run data analysis on the array after every addition, it is better to buffer the new rows to a python list of rows (a list of lists, really), and add them as a group to the numpy array using vstack() before doing any data analysis.


回答 5

import numpy as np
array_ = np.array([[1,2,3]])
add_row = np.array([[4,5,6]])

array_ = np.concatenate((array_, add_row), axis=0)
import numpy as np
array_ = np.array([[1,2,3]])
add_row = np.array([[4,5,6]])

array_ = np.concatenate((array_, add_row), axis=0)

回答 6

如果您可以在一个操作中完成构造,那么类似vstack-with-fancy-indexing的答案就是很好的方法。但是,如果您的情况更加复杂,或者您的行不断增加,那么您可能想要增加数组。实际上,执行类似这样的numpythonic方法-动态增长数组-是动态增长列表:

A = np.array([[1,2,3],[4,5,6]])
Alist = [r for r in A]
for i in range(100):
    newrow = np.arange(3)+i
    if i%5:
        Alist.append(newrow)
A = np.array(Alist)
del Alist

列表针对这种访问模式进行了高度优化。在列表形式时,您没有方便的numpy多维索引,但是只要您要追加,就很难比行数组列表做得更好。

If you can do the construction in a single operation, then something like the vstack-with-fancy-indexing answer is a fine approach. But if your condition is more complicated or your rows come in on the fly, you may want to grow the array. In fact the numpythonic way to do something like this – dynamically grow an array – is to dynamically grow a list:

A = np.array([[1,2,3],[4,5,6]])
Alist = [r for r in A]
for i in range(100):
    newrow = np.arange(3)+i
    if i%5:
        Alist.append(newrow)
A = np.array(Alist)
del Alist

Lists are highly optimized for this kind of access pattern; you don’t have convenient numpy multidimensional indexing while in list form, but for as long as you’re appending it’s hard to do better than a list of row arrays.


回答 7

我使用更快的“ np.vstack”,例如:

import numpy as np

input_array=np.array([1,2,3])
new_row= np.array([4,5,6])

new_array=np.vstack([input_array, new_row])

I use ‘np.vstack’ which is faster, EX:

import numpy as np

input_array=np.array([1,2,3])
new_row= np.array([4,5,6])

new_array=np.vstack([input_array, new_row])

回答 8

您可以用来numpy.append()在numpty数组后附加一行,然后再将其整形为矩阵。

import numpy as np
a = np.array([1,2])
a = np.append(a, [3,4])
print a
# [1,2,3,4]
# in your example
A = [1,2]
for row in X:
    A = np.append(A, row)

You can use numpy.append() to append a row to numpty array and reshape to a matrix later on.

import numpy as np
a = np.array([1,2])
a = np.append(a, [3,4])
print a
# [1,2,3,4]
# in your example
A = [1,2]
for row in X:
    A = np.append(A, row)

NumPy或Pandas:具有NaN值时,将数组类型保持为整数

问题:NumPy或Pandas:具有NaN值时,将数组类型保持为整数

有没有一种首选的方法来将numpy数组的数据类型固定为intint64或其他),同时仍将元素内部列出为numpy.NaN

特别是,我正在将内部数据结构转换为Pandas DataFrame。在我们的结构中,我们有仍然具有NaN的整数类型的列(但该列的dtype是int)。如果我们将其设为DataFrame,似乎将所有内容重播为浮点数,但我们真的很希望成为int

有什么想法吗?

尝试过的事情:

我尝试from_records()在pandas.DataFrame下使用该功能coerce_float=False,但这并没有帮助。我还尝试使用带有NaN fill_value的NumPy蒙版数组,该数组也无法正常工作。所有这些导致列数据类型变为浮点型。

Is there a preferred way to keep the data type of a numpy array fixed as int (or int64 or whatever), while still having an element inside listed as numpy.NaN?

In particular, I am converting an in-house data structure to a Pandas DataFrame. In our structure, we have integer-type columns that still have NaN’s (but the dtype of the column is int). It seems to recast everything as a float if we make this a DataFrame, but we’d really like to be int.

Thoughts?

Things tried:

I tried using the from_records() function under pandas.DataFrame, with coerce_float=False and this did not help. I also tried using NumPy masked arrays, with NaN fill_value, which also did not work. All of these caused the column data type to become a float.


回答 0

此功能已添加到熊猫(从0.24版开始):https : //pandas.pydata.org/pandas-docs/version/0.24/whatsnew/v0.24.0.html#optional-integer-na-support

此时,它需要使用扩展名dtype Int64(大写),而不是默认的dtype int64(小写)。

This capability has been added to pandas (beginning with version 0.24): https://pandas.pydata.org/pandas-docs/version/0.24/whatsnew/v0.24.0.html#optional-integer-na-support

At this point, it requires the use of extension dtype Int64 (capitalized), rather than the default dtype int64 (lowercase).


回答 1

NaN不能存储在整数数组中。目前,这是熊猫的已知限制;我一直在等待NumPy中的NA值(与R中的NA相似)取得进展,但是至少要等6个月到一年的时间,NumPy才能获得这些功能,这似乎是:

http://pandas.pydata.org/pandas-docs/stable/gotchas.html#support-for-integer-na

(此功能是从熊猫0.24版开始添加的,但请注意,它需要使用扩展名dtype Int64(大写),而不是默认的dtype int64(小写):https : //pandas.pydata.org/pandas- docs / version / 0.24 / whatsnew / v0.24.0.html#optional-integer-na-support

NaN can’t be stored in an integer array. This is a known limitation of pandas at the moment; I have been waiting for progress to be made with NA values in NumPy (similar to NAs in R), but it will be at least 6 months to a year before NumPy gets these features, it seems:

http://pandas.pydata.org/pandas-docs/stable/gotchas.html#support-for-integer-na

(This feature has been added beginning with version 0.24 of pandas, but note it requires the use of extension dtype Int64 (capitalized), rather than the default dtype int64 (lower case): https://pandas.pydata.org/pandas-docs/version/0.24/whatsnew/v0.24.0.html#optional-integer-na-support )


回答 2

如果性能不是主要问题,则可以存储字符串。

df.col = df.col.dropna().apply(lambda x: str(int(x)) )

然后,您可以NaN根据需要随意混合。如果您确实希望拥有整数,则可以根据您的应用程序使用-1,或0,或1234567890或一些其他专用值来表示NaN

您也可以临时复制这些列:一列,有浮点数;另一个是实验型,带有整数或字符串。然后将其插入asserts每个合理的位置,以检查两者是否同步。经过足够的测试后,您可以放开浮子。

If performance is not the main issue, you can store strings instead.

df.col = df.col.dropna().apply(lambda x: str(int(x)) )

Then you can mix then with NaN as much as you want. If you really want to have integers, depending on your application, you can use -1, or 0, or 1234567890, or some other dedicated value to represent NaN.

You can also temporarily duplicate the columns: one as you have, with floats; the other one experimental, with ints or strings. Then inserts asserts in every reasonable place checking that the two are in sync. After enough testing you can let go of the floats.


回答 3

这并不是对所有情况都适用的解决方案,但我使用的是(基因座标)(NaO)

a3['MapInfo'] = a3['MapInfo'].fillna(0).astype(int)

这至少允许使用正确的“本机”列类型,如减法,比较等操作均按预期工作

This is not a solution for all cases, but mine (genomic coordinates) I’ve resorted to using 0 as NaN

a3['MapInfo'] = a3['MapInfo'].fillna(0).astype(int)

This at least allows for the proper ‘native’ column type to be used, operations like subtraction, comparison etc work as expected


回答 4

熊猫v0.24 +

支持功能 NaNv0.24或更高版本将提供整数系列。有这些信息在v0.24部分,并在更多的细节“新什么是” 空整数数据类型

Pandas v0.23及更早版本

通常,最好float在可能的情况下使用系列,即使该系列是从intfloat由于包含的NaN值。这将启用基于矢量的基于NumPy的计算,否则将处理Python级别的循环。

文档确实建议:“一种可能性是使用dtype=object数组。” 例如:

s = pd.Series([1, 2, 3, np.nan])

print(s.astype(object))

0      1
1      2
2      3
3    NaN
dtype: object

出于美观原因,例如输出到文件,此 可能是更可取的。

熊猫v0.23及更早版本:背景

NaN被认为是float当前文档(自v0.23起)指定了将整数序列向上转换为的原因float

在没有从根本上将高性能NA支持内置到NumPy中的情况下,主要的受害者是能够以整数数组表示NA。

这种权衡主要是出于内存和性能方面的考虑,并且也使得最终的Series仍然是“数字”。

该文档还提供NaN包含以下内容的上传规则

Typeclass   Promotion dtype for storing NAs
floating    no change
object      no change
integer     cast to float64
boolean     cast to object

Pandas v0.24+

Functionality to support NaN in integer series will be available in v0.24 upwards. There’s information on this in the v0.24 “What’s New” section, and more details under Nullable Integer Data Type.

Pandas v0.23 and earlier

In general, it’s best to work with float series where possible, even when the series is upcast from int to float due to inclusion of NaN values. This enables vectorised NumPy-based calculations where, otherwise, Python-level loops would be processed.

The docs do suggest : “One possibility is to use dtype=object arrays instead.” For example:

s = pd.Series([1, 2, 3, np.nan])

print(s.astype(object))

0      1
1      2
2      3
3    NaN
dtype: object

For cosmetic reasons, e.g. output to a file, this may be preferable.

Pandas v0.23 and earlier: background

NaN is considered a float. The docs currently (as of v0.23) specify the reason why integer series are upcasted to float:

In the absence of high performance NA support being built into NumPy from the ground up, the primary casualty is the ability to represent NAs in integer arrays.

This trade-off is made largely for memory and performance reasons, and also so that the resulting Series continues to be “numeric”.

The docs also provide rules for upcasting due to NaN inclusion:

Typeclass   Promotion dtype for storing NAs
floating    no change
object      no change
integer     cast to float64
boolean     cast to object

回答 5

现在这是可能的,因为pandas v 0.24.0

pandas 0.24.x发行说明 Quote:“ Pandas已具备保存具有缺失值的整数dtypes的能力。

This is now possible, since pandas v 0.24.0

pandas 0.24.x release notes Quote: “Pandas has gained the ability to hold integer dtypes with missing values.


回答 6

只是想补充一下,以防您尝试将浮点数(1.143)向量转换为整数(1),并且将NA转换为新的’Int64’dtype会导致错误。为了解决这个问题,您必须四舍五入数字,然后执行“ .astype(’Int64’)”

s1 = pd.Series([1.434, 2.343, np.nan])
#without round() the next line returns an error 
s1.astype('Int64')
#cannot safely cast non-equivalent float64 to int64
##with round() it works
s1.round().astype('Int64')
0      1
1      2
2    NaN
dtype: Int64

我的用例是我有一个浮点数系列,我想四舍五入为整数,但是当您执行.round()时,数字末尾仍为’* .0’,因此您可以从末尾减去0转换为int。

Just wanted to add that in case you are trying to convert a float (1.143) vector to integer (1) that has NA converting to the new ‘Int64’ dtype will give you an error. In order to solve this you have to round the numbers and then do “.astype(‘Int64’)”

s1 = pd.Series([1.434, 2.343, np.nan])
#without round() the next line returns an error 
s1.astype('Int64')
#cannot safely cast non-equivalent float64 to int64
##with round() it works
s1.round().astype('Int64')
0      1
1      2
2    NaN
dtype: Int64

My use case is that I have a float series that I want to round to int, but when you do .round() a ‘*.0’ at the end of the number remains, so you can drop that 0 from the end by converting to int.


回答 7

如果文本数据中有空格,则通常为整数的列将转换为float64 dtype,因为int64 dtype无法处理null。如果您要加载多个文件,其中一些带有空白(最终将以float64的形式加载,而另一些将最终以int64的形式加载),则可能导致架构不一致

该代码将尝试将任何数字类型的列转换为Int64(而不是int64),因为Int64可以处理空值

import pandas as pd
import numpy as np

#show datatypes before transformation
mydf.dtypes

for c in mydf.select_dtypes(np.number).columns:
    try:
        mydf[c] = mydf[c].astype('Int64')
        print('casted {} as Int64'.format(c))
    except:
        print('could not cast {} to Int64'.format(c))

#show datatypes after transformation
mydf.dtypes

If there are blanks in the text data, columns that would normally be integers will be cast to floats as float64 dtype because int64 dtype cannot handle nulls. This can cause inconsistent schema if you are loading multiple files some with blanks (which will end up as float64 and others without which will end up as int64

This code will attempt to convert any number type columns to Int64 (as opposed to int64) since Int64 can handle nulls

import pandas as pd
import numpy as np

#show datatypes before transformation
mydf.dtypes

for c in mydf.select_dtypes(np.number).columns:
    try:
        mydf[c] = mydf[c].astype('Int64')
        print('casted {} as Int64'.format(c))
    except:
        print('could not cast {} to Int64'.format(c))

#show datatypes after transformation
mydf.dtypes

从嵌套列表创建数组时,请抑制Numpy中的科学计数法

问题:从嵌套列表创建数组时,请抑制Numpy中的科学计数法

我有一个嵌套的Python列表,如下所示:

my_list = [[3.74, 5162, 13683628846.64, 12783387559.86, 1.81],
 [9.55, 116, 189688622.37, 260332262.0, 1.97],
 [2.2, 768, 6004865.13, 5759960.98, 1.21],
 [3.74, 4062, 3263822121.39, 3066869087.9, 1.93],
 [1.91, 474, 44555062.72, 44555062.72, 0.41],
 [5.8, 5006, 8254968918.1, 7446788272.74, 3.25],
 [4.5, 7887, 30078971595.46, 27814989471.31, 2.18],
 [7.03, 116, 66252511.46, 81109291.0, 1.56],
 [6.52, 116, 47674230.76, 57686991.0, 1.43],
 [1.85, 623, 3002631.96, 2899484.08, 0.64],
 [13.76, 1227, 1737874137.5, 1446511574.32, 4.32],
 [13.76, 1227, 1737874137.5, 1446511574.32, 4.32]]

然后导入Numpy,并将打印选项设置为(suppress=True)。创建数组时:

my_array = numpy.array(my_list)

我无法一生压制科学记法:

[[  3.74000000e+00   5.16200000e+03   1.36836288e+10   1.27833876e+10
    1.81000000e+00]
 [  9.55000000e+00   1.16000000e+02   1.89688622e+08   2.60332262e+08
    1.97000000e+00]
 [  2.20000000e+00   7.68000000e+02   6.00486513e+06   5.75996098e+06
    1.21000000e+00]
 [  3.74000000e+00   4.06200000e+03   3.26382212e+09   3.06686909e+09
    1.93000000e+00]
 [  1.91000000e+00   4.74000000e+02   4.45550627e+07   4.45550627e+07
    4.10000000e-01]
 [  5.80000000e+00   5.00600000e+03   8.25496892e+09   7.44678827e+09
    3.25000000e+00]
 [  4.50000000e+00   7.88700000e+03   3.00789716e+10   2.78149895e+10
    2.18000000e+00]
 [  7.03000000e+00   1.16000000e+02   6.62525115e+07   8.11092910e+07
    1.56000000e+00]
 [  6.52000000e+00   1.16000000e+02   4.76742308e+07   5.76869910e+07
    1.43000000e+00]
 [  1.85000000e+00   6.23000000e+02   3.00263196e+06   2.89948408e+06
    6.40000000e-01]
 [  1.37600000e+01   1.22700000e+03   1.73787414e+09   1.44651157e+09
    4.32000000e+00]
 [  1.37600000e+01   1.22700000e+03   1.73787414e+09   1.44651157e+09
    4.32000000e+00]]

如果我直接创建一个简单的numpy数组:

new_array = numpy.array([1.5, 4.65, 7.845])

我没有问题,它显示如下:

[ 1.5    4.65   7.845]

有人知道我的问题是什么吗?

I have a nested Python list that looks like the following:

my_list = [[3.74, 5162, 13683628846.64, 12783387559.86, 1.81],
 [9.55, 116, 189688622.37, 260332262.0, 1.97],
 [2.2, 768, 6004865.13, 5759960.98, 1.21],
 [3.74, 4062, 3263822121.39, 3066869087.9, 1.93],
 [1.91, 474, 44555062.72, 44555062.72, 0.41],
 [5.8, 5006, 8254968918.1, 7446788272.74, 3.25],
 [4.5, 7887, 30078971595.46, 27814989471.31, 2.18],
 [7.03, 116, 66252511.46, 81109291.0, 1.56],
 [6.52, 116, 47674230.76, 57686991.0, 1.43],
 [1.85, 623, 3002631.96, 2899484.08, 0.64],
 [13.76, 1227, 1737874137.5, 1446511574.32, 4.32],
 [13.76, 1227, 1737874137.5, 1446511574.32, 4.32]]

I then import Numpy, and set print options to (suppress=True). When I create an array:

my_array = numpy.array(my_list)

I can’t for the life of me suppress scientific notation:

[[  3.74000000e+00   5.16200000e+03   1.36836288e+10   1.27833876e+10
    1.81000000e+00]
 [  9.55000000e+00   1.16000000e+02   1.89688622e+08   2.60332262e+08
    1.97000000e+00]
 [  2.20000000e+00   7.68000000e+02   6.00486513e+06   5.75996098e+06
    1.21000000e+00]
 [  3.74000000e+00   4.06200000e+03   3.26382212e+09   3.06686909e+09
    1.93000000e+00]
 [  1.91000000e+00   4.74000000e+02   4.45550627e+07   4.45550627e+07
    4.10000000e-01]
 [  5.80000000e+00   5.00600000e+03   8.25496892e+09   7.44678827e+09
    3.25000000e+00]
 [  4.50000000e+00   7.88700000e+03   3.00789716e+10   2.78149895e+10
    2.18000000e+00]
 [  7.03000000e+00   1.16000000e+02   6.62525115e+07   8.11092910e+07
    1.56000000e+00]
 [  6.52000000e+00   1.16000000e+02   4.76742308e+07   5.76869910e+07
    1.43000000e+00]
 [  1.85000000e+00   6.23000000e+02   3.00263196e+06   2.89948408e+06
    6.40000000e-01]
 [  1.37600000e+01   1.22700000e+03   1.73787414e+09   1.44651157e+09
    4.32000000e+00]
 [  1.37600000e+01   1.22700000e+03   1.73787414e+09   1.44651157e+09
    4.32000000e+00]]

If I create a simple numpy array directly:

new_array = numpy.array([1.5, 4.65, 7.845])

I have no problem and it prints as follows:

[ 1.5    4.65   7.845]

Does anyone know what my problem is?


回答 0

我想您需要的是np.set_printoptions(suppress=True),有关详细信息,请参见此处:http : //pythonquirks.blogspot.fr/2009/10/controlling-printing-in-numpy.html

有关SciPy.org numpy文档,其中包括所有功能参数(以上链接未详细介绍抑制功能),请参见此处:https : //docs.scipy.org/doc/numpy/reference/generation/numpy.set_printoptions.html

I guess what you need is np.set_printoptions(suppress=True), for details see here: http://pythonquirks.blogspot.fr/2009/10/controlling-printing-in-numpy.html

For SciPy.org numpy documentation, which includes all function parameters (suppress isn’t detailed in the above link), see here: https://docs.scipy.org/doc/numpy/reference/generated/numpy.set_printoptions.html


回答 1

在打印numpy ndarray,纠缠文本对齐方式,舍入和打印选项时,Python强制抑制所有指数符号:

接下来是发生的情况的说明,滚动至底部以获取代码演示。

将参数传递suppress=True给函数set_printoptions仅适用于分配给它的默认8个字符空间的数字,如下所示:

import numpy as np
np.set_printoptions(suppress=True) #prevent numpy exponential 
                                   #notation on print, default False

#            tiny     med  large
a = np.array([1.01e-5, 22, 1.2345678e7])  #notice how index 2 is 8 
                                          #digits wide

print(a)    #prints [ 0.0000101   22.     12345678. ]

但是,如果您输入的数字的宽度大于8个字符,则会再次采用指数表示法,如下所示:

np.set_printoptions(suppress=True)

a = np.array([1.01e-5, 22, 1.2345678e10])    #notice how index 2 is 10
                                             #digits wide, too wide!

#exponential notation where we've told it not to!
print(a)    #prints [1.01000000e-005   2.20000000e+001   1.23456780e+10]

numpy可以选择将您的数字切成两半,然后再用错误的数字表示,或者强制采用指数表示法,然后选择后者。

这是set_printoptions(formatter=...)为了指定打印和取整选项而进行的工作。告诉set_printoptions只打印一个浮点数:

np.set_printoptions(suppress=True,
   formatter={'float_kind':'{:f}'.format})

a = np.array([1.01e-5, 22, 1.2345678e30])  #notice how index 2 is 30
                                           #digits wide.  

#Ok good, no exponential notation in the large numbers:
print(a)  #prints [0.000010 22.000000 1234567799999999979944197226496.000000] 

我们已经强制抑制了指数符号,但是它不是四舍五入的,因此请指定其他格式选项:

np.set_printoptions(suppress=True,
   formatter={'float_kind':'{:0.2f}'.format})  #float, 2 units 
                                               #precision right, 0 on left

a = np.array([1.01e-5, 22, 1.2345678e30])   #notice how index 2 is 30
                                            #digits wide

print(a)  #prints [0.00 22.00 1234567799999999979944197226496.00]

强制抑制ndarray中所有指数概念的缺点是,如果您的ndarray在其中的无穷大附近获得巨大的float值,并且进行打印,您将被满是数字的页面炸掉。

完整的示例演示1:

from pprint import pprint
import numpy as np
#chaotic python list of lists with very different numeric magnitudes
my_list = [[3.74, 5162, 13683628846.64, 12783387559.86, 1.81],
           [9.55, 116, 189688622.37, 260332262.0, 1.97],
           [2.2, 768, 6004865.13, 5759960.98, 1.21],
           [3.74, 4062, 3263822121.39, 3066869087.9, 1.93],
           [1.91, 474, 44555062.72, 44555062.72, 0.41],
           [5.8, 5006, 8254968918.1, 7446788272.74, 3.25],
           [4.5, 7887, 30078971595.46, 27814989471.31, 2.18],
           [7.03, 116, 66252511.46, 81109291.0, 1.56],
           [6.52, 116, 47674230.76, 57686991.0, 1.43],
           [1.85, 623, 3002631.96, 2899484.08, 0.64],
           [13.76, 1227, 1737874137.5, 1446511574.32, 4.32],
           [13.76, 1227, 1737874137.5, 1446511574.32, 4.32]]

#convert python list of lists to numpy ndarray called my_array
my_array = np.array(my_list)

#This is a little recursive helper function converts all nested 
#ndarrays to python list of lists so that pretty printer knows what to do.
def arrayToList(arr):
    if type(arr) == type(np.array):
        #If the passed type is an ndarray then convert it to a list and
        #recursively convert all nested types
        return arrayToList(arr.tolist())
    else:
        #if item isn't an ndarray leave it as is.
        return arr

#suppress exponential notation, define an appropriate float formatter
#specify stdout line width and let pretty print do the work
np.set_printoptions(suppress=True,
   formatter={'float_kind':'{:16.3f}'.format}, linewidth=130)
pprint(arrayToList(my_array))

印刷品:

array([[           3.740,         5162.000,  13683628846.640,  12783387559.860,            1.810],
       [           9.550,          116.000,    189688622.370,    260332262.000,            1.970],
       [           2.200,          768.000,      6004865.130,      5759960.980,            1.210],
       [           3.740,         4062.000,   3263822121.390,   3066869087.900,            1.930],
       [           1.910,          474.000,     44555062.720,     44555062.720,            0.410],
       [           5.800,         5006.000,   8254968918.100,   7446788272.740,            3.250],
       [           4.500,         7887.000,  30078971595.460,  27814989471.310,            2.180],
       [           7.030,          116.000,     66252511.460,     81109291.000,            1.560],
       [           6.520,          116.000,     47674230.760,     57686991.000,            1.430],
       [           1.850,          623.000,      3002631.960,      2899484.080,            0.640],
       [          13.760,         1227.000,   1737874137.500,   1446511574.320,            4.320],
       [          13.760,         1227.000,   1737874137.500,   1446511574.320,            4.320]])

完整的示例演示2:

import numpy as np  
#chaotic python list of lists with very different numeric magnitudes 

#            very tiny      medium size            large sized
#            numbers        numbers                numbers

my_list = [[0.000000000074, 5162, 13683628846.64, 1.01e10, 1.81], 
           [1.000000000055,  116, 189688622.37, 260332262.0, 1.97], 
           [0.010000000022,  768, 6004865.13,   -99e13, 1.21], 
           [1.000000000074, 4062, 3263822121.39, 3066869087.9, 1.93], 
           [2.91,            474, 44555062.72, 44555062.72, 0.41], 
           [5,              5006, 8254968918.1, 7446788272.74, 3.25], 
           [0.01,           7887, 30078971595.46, 27814989471.31, 2.18], 
           [7.03,            116, 66252511.46, 81109291.0, 1.56], 
           [6.52,            116, 47674230.76, 57686991.0, 1.43], 
           [1.85,            623, 3002631.96, 2899484.08, 0.64], 
           [13.76,          1227, 1737874137.5, 1446511574.32, 4.32], 
           [13.76,          1337, 1737874137.5, 1446511574.32, 4.32]] 
import sys 
#convert python list of lists to numpy ndarray called my_array 
my_array = np.array(my_list) 
#following two lines do the same thing, showing that np.savetxt can 
#correctly handle python lists of lists and numpy 2D ndarrays. 
np.savetxt(sys.stdout, my_list, '%19.2f') 
np.savetxt(sys.stdout, my_array, '%19.2f') 

印刷品:

 0.00             5162.00      13683628846.64      10100000000.00              1.81
 1.00              116.00        189688622.37        260332262.00              1.97
 0.01              768.00          6004865.13 -990000000000000.00              1.21
 1.00             4062.00       3263822121.39       3066869087.90              1.93
 2.91              474.00         44555062.72         44555062.72              0.41
 5.00             5006.00       8254968918.10       7446788272.74              3.25
 0.01             7887.00      30078971595.46      27814989471.31              2.18
 7.03              116.00         66252511.46         81109291.00              1.56
 6.52              116.00         47674230.76         57686991.00              1.43
 1.85              623.00          3002631.96          2899484.08              0.64
13.76             1227.00       1737874137.50       1446511574.32              4.32
13.76             1337.00       1737874137.50       1446511574.32              4.32
 0.00             5162.00      13683628846.64      10100000000.00              1.81
 1.00              116.00        189688622.37        260332262.00              1.97
 0.01              768.00          6004865.13 -990000000000000.00              1.21
 1.00             4062.00       3263822121.39       3066869087.90              1.93
 2.91              474.00         44555062.72         44555062.72              0.41
 5.00             5006.00       8254968918.10       7446788272.74              3.25
 0.01             7887.00      30078971595.46      27814989471.31              2.18
 7.03              116.00         66252511.46         81109291.00              1.56
 6.52              116.00         47674230.76         57686991.00              1.43
 1.85              623.00          3002631.96          2899484.08              0.64
13.76             1227.00       1737874137.50       1446511574.32              4.32
13.76             1337.00       1737874137.50       1446511574.32              4.32

请注意,舍入在2个单位的精度上是一致的,并且在非常大e+x和非常小的e-x范围内,指数符号都被抑制。

Python Force-suppress all exponential notation when printing numpy ndarrays, wrangle text justification, rounding and print options:

What follows is an explanation for what is going on, scroll to bottom for code demos.

Passing parameter suppress=True to function set_printoptions works only for numbers that fit in the default 8 character space allotted to it, like this:

import numpy as np
np.set_printoptions(suppress=True) #prevent numpy exponential 
                                   #notation on print, default False

#            tiny     med  large
a = np.array([1.01e-5, 22, 1.2345678e7])  #notice how index 2 is 8 
                                          #digits wide

print(a)    #prints [ 0.0000101   22.     12345678. ]

However if you pass in a number greater than 8 characters wide, exponential notation is imposed again, like this:

np.set_printoptions(suppress=True)

a = np.array([1.01e-5, 22, 1.2345678e10])    #notice how index 2 is 10
                                             #digits wide, too wide!

#exponential notation where we've told it not to!
print(a)    #prints [1.01000000e-005   2.20000000e+001   1.23456780e+10]

numpy has a choice between chopping your number in half thus misrepresenting it, or forcing exponential notation, it chooses the latter.

Here comes set_printoptions(formatter=...) to the rescue to specify options for printing and rounding. Tell set_printoptions to just print bare a bare float:

np.set_printoptions(suppress=True,
   formatter={'float_kind':'{:f}'.format})

a = np.array([1.01e-5, 22, 1.2345678e30])  #notice how index 2 is 30
                                           #digits wide.  

#Ok good, no exponential notation in the large numbers:
print(a)  #prints [0.000010 22.000000 1234567799999999979944197226496.000000] 

We’ve force-suppressed the exponential notation, but it is not rounded or justified, so specify extra formatting options:

np.set_printoptions(suppress=True,
   formatter={'float_kind':'{:0.2f}'.format})  #float, 2 units 
                                               #precision right, 0 on left

a = np.array([1.01e-5, 22, 1.2345678e30])   #notice how index 2 is 30
                                            #digits wide

print(a)  #prints [0.00 22.00 1234567799999999979944197226496.00]

The drawback for force-suppressing all exponential notion in ndarrays is that if your ndarray gets a huge float value near infinity in it, and you print it, you’re going to get blasted in the face with a page full of numbers.

Full example Demo 1:

from pprint import pprint
import numpy as np
#chaotic python list of lists with very different numeric magnitudes
my_list = [[3.74, 5162, 13683628846.64, 12783387559.86, 1.81],
           [9.55, 116, 189688622.37, 260332262.0, 1.97],
           [2.2, 768, 6004865.13, 5759960.98, 1.21],
           [3.74, 4062, 3263822121.39, 3066869087.9, 1.93],
           [1.91, 474, 44555062.72, 44555062.72, 0.41],
           [5.8, 5006, 8254968918.1, 7446788272.74, 3.25],
           [4.5, 7887, 30078971595.46, 27814989471.31, 2.18],
           [7.03, 116, 66252511.46, 81109291.0, 1.56],
           [6.52, 116, 47674230.76, 57686991.0, 1.43],
           [1.85, 623, 3002631.96, 2899484.08, 0.64],
           [13.76, 1227, 1737874137.5, 1446511574.32, 4.32],
           [13.76, 1227, 1737874137.5, 1446511574.32, 4.32]]

#convert python list of lists to numpy ndarray called my_array
my_array = np.array(my_list)

#This is a little recursive helper function converts all nested 
#ndarrays to python list of lists so that pretty printer knows what to do.
def arrayToList(arr):
    if type(arr) == type(np.array):
        #If the passed type is an ndarray then convert it to a list and
        #recursively convert all nested types
        return arrayToList(arr.tolist())
    else:
        #if item isn't an ndarray leave it as is.
        return arr

#suppress exponential notation, define an appropriate float formatter
#specify stdout line width and let pretty print do the work
np.set_printoptions(suppress=True,
   formatter={'float_kind':'{:16.3f}'.format}, linewidth=130)
pprint(arrayToList(my_array))

Prints:

array([[           3.740,         5162.000,  13683628846.640,  12783387559.860,            1.810],
       [           9.550,          116.000,    189688622.370,    260332262.000,            1.970],
       [           2.200,          768.000,      6004865.130,      5759960.980,            1.210],
       [           3.740,         4062.000,   3263822121.390,   3066869087.900,            1.930],
       [           1.910,          474.000,     44555062.720,     44555062.720,            0.410],
       [           5.800,         5006.000,   8254968918.100,   7446788272.740,            3.250],
       [           4.500,         7887.000,  30078971595.460,  27814989471.310,            2.180],
       [           7.030,          116.000,     66252511.460,     81109291.000,            1.560],
       [           6.520,          116.000,     47674230.760,     57686991.000,            1.430],
       [           1.850,          623.000,      3002631.960,      2899484.080,            0.640],
       [          13.760,         1227.000,   1737874137.500,   1446511574.320,            4.320],
       [          13.760,         1227.000,   1737874137.500,   1446511574.320,            4.320]])

Full example Demo 2:

import numpy as np  
#chaotic python list of lists with very different numeric magnitudes 

#            very tiny      medium size            large sized
#            numbers        numbers                numbers

my_list = [[0.000000000074, 5162, 13683628846.64, 1.01e10, 1.81], 
           [1.000000000055,  116, 189688622.37, 260332262.0, 1.97], 
           [0.010000000022,  768, 6004865.13,   -99e13, 1.21], 
           [1.000000000074, 4062, 3263822121.39, 3066869087.9, 1.93], 
           [2.91,            474, 44555062.72, 44555062.72, 0.41], 
           [5,              5006, 8254968918.1, 7446788272.74, 3.25], 
           [0.01,           7887, 30078971595.46, 27814989471.31, 2.18], 
           [7.03,            116, 66252511.46, 81109291.0, 1.56], 
           [6.52,            116, 47674230.76, 57686991.0, 1.43], 
           [1.85,            623, 3002631.96, 2899484.08, 0.64], 
           [13.76,          1227, 1737874137.5, 1446511574.32, 4.32], 
           [13.76,          1337, 1737874137.5, 1446511574.32, 4.32]] 
import sys 
#convert python list of lists to numpy ndarray called my_array 
my_array = np.array(my_list) 
#following two lines do the same thing, showing that np.savetxt can 
#correctly handle python lists of lists and numpy 2D ndarrays. 
np.savetxt(sys.stdout, my_list, '%19.2f') 
np.savetxt(sys.stdout, my_array, '%19.2f') 

Prints:

 0.00             5162.00      13683628846.64      10100000000.00              1.81
 1.00              116.00        189688622.37        260332262.00              1.97
 0.01              768.00          6004865.13 -990000000000000.00              1.21
 1.00             4062.00       3263822121.39       3066869087.90              1.93
 2.91              474.00         44555062.72         44555062.72              0.41
 5.00             5006.00       8254968918.10       7446788272.74              3.25
 0.01             7887.00      30078971595.46      27814989471.31              2.18
 7.03              116.00         66252511.46         81109291.00              1.56
 6.52              116.00         47674230.76         57686991.00              1.43
 1.85              623.00          3002631.96          2899484.08              0.64
13.76             1227.00       1737874137.50       1446511574.32              4.32
13.76             1337.00       1737874137.50       1446511574.32              4.32
 0.00             5162.00      13683628846.64      10100000000.00              1.81
 1.00              116.00        189688622.37        260332262.00              1.97
 0.01              768.00          6004865.13 -990000000000000.00              1.21
 1.00             4062.00       3263822121.39       3066869087.90              1.93
 2.91              474.00         44555062.72         44555062.72              0.41
 5.00             5006.00       8254968918.10       7446788272.74              3.25
 0.01             7887.00      30078971595.46      27814989471.31              2.18
 7.03              116.00         66252511.46         81109291.00              1.56
 6.52              116.00         47674230.76         57686991.00              1.43
 1.85              623.00          3002631.96          2899484.08              0.64
13.76             1227.00       1737874137.50       1446511574.32              4.32
13.76             1337.00       1737874137.50       1446511574.32              4.32

Notice that rounding is consistent at 2 units precision, and exponential notation is suppressed in both the very large e+x and very small e-x ranges.


回答 2

对于一维和二维数组,可以使用np.savetxt使用特定格式的字符串进行打印:

>>> import sys
>>> x = numpy.arange(20).reshape((4,5))
>>> numpy.savetxt(sys.stdout, x, '%5.2f')
 0.00  1.00  2.00  3.00  4.00
 5.00  6.00  7.00  8.00  9.00
10.00 11.00 12.00 13.00 14.00
15.00 16.00 17.00 18.00 19.00

在v1.3中使用numpy.set_printoptions或numpy.array2string的选项非常笨拙且受限制(例如,没有办法抑制大数的科学计数法)。看起来这将在将来的版本中发生变化,使用numpy.set_printoptions(formatter = ..)和numpy.array2string(style = ..)。

for 1D and 2D arrays you can use np.savetxt to print using a specific format string:

>>> import sys
>>> x = numpy.arange(20).reshape((4,5))
>>> numpy.savetxt(sys.stdout, x, '%5.2f')
 0.00  1.00  2.00  3.00  4.00
 5.00  6.00  7.00  8.00  9.00
10.00 11.00 12.00 13.00 14.00
15.00 16.00 17.00 18.00 19.00

Your options with numpy.set_printoptions or numpy.array2string in v1.3 are pretty clunky and limited (for example no way to suppress scientific notation for large numbers). It looks like this will change with future versions, with numpy.set_printoptions(formatter=..) and numpy.array2string(style=..).


回答 3

您可以编写将科学计数法转换为常规计数的函数,例如

def sc2std(x):
    s = str(x)
    if 'e' in s:
        num,ex = s.split('e')
        if '-' in num:
            negprefix = '-'
        else:
            negprefix = ''
        num = num.replace('-','')
        if '.' in num:
            dotlocation = num.index('.')
        else:
            dotlocation = len(num)
        newdotlocation = dotlocation + int(ex)
        num = num.replace('.','')
        if (newdotlocation < 1):
            return negprefix+'0.'+'0'*(-newdotlocation)+num
        if (newdotlocation > len(num)):
            return negprefix+ num + '0'*(newdotlocation - len(num))+'.0'
        return negprefix + num[:newdotlocation] + '.' + num[newdotlocation:]
    else:
        return s

You could write a function that converts a scientific notation to regular, something like

def sc2std(x):
    s = str(x)
    if 'e' in s:
        num,ex = s.split('e')
        if '-' in num:
            negprefix = '-'
        else:
            negprefix = ''
        num = num.replace('-','')
        if '.' in num:
            dotlocation = num.index('.')
        else:
            dotlocation = len(num)
        newdotlocation = dotlocation + int(ex)
        num = num.replace('.','')
        if (newdotlocation < 1):
            return negprefix+'0.'+'0'*(-newdotlocation)+num
        if (newdotlocation > len(num)):
            return negprefix+ num + '0'*(newdotlocation - len(num))+'.0'
        return negprefix + num[:newdotlocation] + '.' + num[newdotlocation:]
    else:
        return s

numpy:从2D数组中获取随机的行集

问题:numpy:从2D数组中获取随机的行集

我有一个非常大的2D数组,看起来像这样:

a=
[[a1, b1, c1],
 [a2, b2, c2],
 ...,
 [an, bn, cn]]

使用numpy,是否有一种简单的方法来获得一个新的2D数组,例如,从初始数组中获得2个随机行a(无需替换)?

例如

b=
[[a4,  b4,  c4],
 [a99, b99, c99]]

I have a very large 2D array which looks something like this:

a=
[[a1, b1, c1],
 [a2, b2, c2],
 ...,
 [an, bn, cn]]

Using numpy, is there an easy way to get a new 2D array with, e.g., 2 random rows from the initial array a (without replacement)?

e.g.

b=
[[a4,  b4,  c4],
 [a99, b99, c99]]

回答 0

>>> A = np.random.randint(5, size=(10,3))
>>> A
array([[1, 3, 0],
       [3, 2, 0],
       [0, 2, 1],
       [1, 1, 4],
       [3, 2, 2],
       [0, 1, 0],
       [1, 3, 1],
       [0, 4, 1],
       [2, 4, 2],
       [3, 3, 1]])
>>> idx = np.random.randint(10, size=2)
>>> idx
array([7, 6])
>>> A[idx,:]
array([[0, 4, 1],
       [1, 3, 1]])

一般情况下将其放在一起:

A[np.random.randint(A.shape[0], size=2), :]

对于非替换(numpy 1.7.0+):

A[np.random.choice(A.shape[0], 2, replace=False), :]

我不认为有一种很好的方法可以在不替换1.7之前生成随机列表。也许您可以设置一个小的定义,以确保两个值不相同。

>>> A = np.random.randint(5, size=(10,3))
>>> A
array([[1, 3, 0],
       [3, 2, 0],
       [0, 2, 1],
       [1, 1, 4],
       [3, 2, 2],
       [0, 1, 0],
       [1, 3, 1],
       [0, 4, 1],
       [2, 4, 2],
       [3, 3, 1]])
>>> idx = np.random.randint(10, size=2)
>>> idx
array([7, 6])
>>> A[idx,:]
array([[0, 4, 1],
       [1, 3, 1]])

Putting it together for a general case:

A[np.random.randint(A.shape[0], size=2), :]

For non replacement (numpy 1.7.0+):

A[np.random.choice(A.shape[0], 2, replace=False), :]

I do not believe there is a good way to generate random list without replacement before 1.7. Perhaps you can setup a small definition that ensures the two values are not the same.


回答 1

这是旧文章,但这对我来说是最合适的:

A[np.random.choice(A.shape[0], num_rows_2_sample, replace=False)]

将replace = False更改为True可以得到相同的结果,但是要进行替换。

This is an old post, but this is what works best for me:

A[np.random.choice(A.shape[0], num_rows_2_sample, replace=False)]

change the replace=False to True to get the same thing, but with replacement.


回答 2

如果您只想按一定因素对数据进行下采样,则另一种选择是创建随机掩码。假设我想将原始数据集下采样到当前存储在数组中的25%data_arr

# generate random boolean mask the length of data
# use p 0.75 for False and 0.25 for True
mask = numpy.random.choice([False, True], len(data_arr), p=[0.75, 0.25])

现在,您可以调用data_arr[mask]并返回大约25%的行(随机采样)。

Another option is to create a random mask if you just want to down-sample your data by a certain factor. Say I want to down-sample to 25% of my original data set, which is currently held in the array data_arr:

# generate random boolean mask the length of data
# use p 0.75 for False and 0.25 for True
mask = numpy.random.choice([False, True], len(data_arr), p=[0.75, 0.25])

Now you can call data_arr[mask] and return ~25% of the rows, randomly sampled.


回答 3

这与Hezi Rasheff提供的答案类似,但经过简化,因此新的python用户可以理解发生了什么(我注意到许多新的数据科学专业的学生以最奇怪的方式获取随机样本,因为他们不知道自己在python中做什么。)

您可以使用以下方法从数组中获得许多随机索引:

indices = np.random.choice(A.shape[0], amount_of_samples, replace=False)

然后,可以对numpy数组使用切片,以在这些索引处获取样本:

A[indices]

这将从您的数据中获得指定数量的随机样本。

This is a similar answer to the one Hezi Rasheff provided, but simplified so newer python users understand what’s going on (I noticed many new datascience students fetch random samples in the weirdest ways because they don’t know what they are doing in python).

You can get a number of random indices from your array by using:

indices = np.random.choice(A.shape[0], amount_of_samples, replace=False)

You can then use slicing with your numpy array to get the samples at those indices:

A[indices]

This will get you the specified number of random samples from your data.


回答 4

如果您需要相同的行而只是随机样本,

import random
new_array = random.sample(old_array,x)

在此,x必须是一个“ int”,用于定义要随机选择的行数。

If you need the same rows but just a random sample then,

import random
new_array = random.sample(old_array,x)

Here x, has to be an ‘int’ defining the number of rows you want to randomly pick.


回答 5

我看到有人建议进行排列。实际上,它可以做成一行:

>>> A = np.random.randint(5, size=(10,3))
>>> np.random.permutation(A)[:2]

array([[0, 3, 0],
       [3, 1, 2]])

I see permutation has been suggested. In fact it can be made into one line:

>>> A = np.random.randint(5, size=(10,3))
>>> np.random.permutation(A)[:2]

array([[0, 3, 0],
       [3, 1, 2]])

回答 6

如果要生成多个随机的行子集,例如,如果要执行RANSAC。

num_pop = 10
num_samples = 2
pop_in_sample = 3
rows_to_sample = np.random.random([num_pop, 5])
random_numbers = np.random.random([num_samples, num_pop])
samples = np.argsort(random_numbers, axis=1)[:, :pop_in_sample]
# will be shape [num_samples, pop_in_sample, 5]
row_subsets = rows_to_sample[samples, :]

If you want to generate multiple random subsets of rows, for example if your doing RANSAC.

num_pop = 10
num_samples = 2
pop_in_sample = 3
rows_to_sample = np.random.random([num_pop, 5])
random_numbers = np.random.random([num_samples, num_pop])
samples = np.argsort(random_numbers, axis=1)[:, :pop_in_sample]
# will be shape [num_samples, pop_in_sample, 5]
row_subsets = rows_to_sample[samples, :]

使用pip安装SciPy和NumPy

问题:使用pip安装SciPy和NumPy

我正在尝试在要分发的程序包中创建所需的库。它需要SciPyNumPy库。在开发过程中,我同时使用

apt-get install scipy

它安装了SciPy 0.9.0和NumPy 1.5.1,并且运行良好。

我想使用pip install– 做同样的事情,以便能够在我自己的包的setup.py中指定依赖项。

问题是,当我尝试:

pip install 'numpy==1.5.1'

它工作正常。

但是之后

pip install 'scipy==0.9.0'

惨败

raise self.notfounderror(self.notfounderror.__doc__)

numpy.distutils.system_info.BlasNotFoundError:

Blas (http://www.netlib.org/blas/) libraries not found.

Directories to search for the libraries can be specified in the

numpy/distutils/site.cfg file (section [blas]) or by setting

the BLAS environment variable.

我该如何工作?

I’m trying to create required libraries in a package I’m distributing. It requires both the SciPy and NumPy libraries. While developing, I installed both using

apt-get install scipy

which installed SciPy 0.9.0 and NumPy 1.5.1, and it worked fine.

I would like to do the same using pip install – in order to be able to specify dependencies in a setup.py of my own package.

The problem is, when I try:

pip install 'numpy==1.5.1'

it works fine.

But then

pip install 'scipy==0.9.0'

fails miserably, with

raise self.notfounderror(self.notfounderror.__doc__)

numpy.distutils.system_info.BlasNotFoundError:

Blas (http://www.netlib.org/blas/) libraries not found.

Directories to search for the libraries can be specified in the

numpy/distutils/site.cfg file (section [blas]) or by setting

the BLAS environment variable.

How do I get it to work?


回答 0

我假设我的回答是Linux经验。我发现pip install scipy要顺利进行有三个先决条件。

转到此处:安装SciPY

按照说明下载,构建和导出BLAS的env变量,然后下载LAPACK。注意不要盲目剪切’n’粘贴shell命令-您需要根据您的体系结构等选择几行,并且您需要修复/添加错误地假定为的正确目录好。

您可能需要做的第三件事是yum安装numpy-f2py或等效程序。

哦,是的,最后,您可能需要安装gcc-gfortran,因为上述库都是Fortran源码。

I am assuming Linux experience in my answer; I found that there are three prerequisites to getting pip install scipy to proceed nicely.

Go here: Installing SciPY

Follow the instructions to download, build and export the env variable for BLAS and then LAPACK. Be careful to not just blindly cut’n’paste the shell commands – there will be a few lines you need to select depending on your architecture, etc., and you’ll need to fix/add the correct directories that it incorrectly assumes as well.

The third thing you may need is to yum install numpy-f2py or the equivalent.

Oh, yes and lastly, you may need to yum install gcc-gfortran as the libraries above are Fortran source.


回答 1

这在Ubuntu 14.04上对我有用:

sudo apt-get install libblas-dev liblapack-dev libatlas-base-dev gfortran
pip install scipy

This worked for me on Ubuntu 14.04:

sudo apt-get install libblas-dev liblapack-dev libatlas-base-dev gfortran
pip install scipy

回答 2

如果使用的是Ubuntu,则需要libblas和liblapack开发软件包。

aptitude install libblas-dev liblapack-dev
pip install scipy

you need the libblas and liblapack dev packages if you are using Ubuntu.

aptitude install libblas-dev liblapack-dev
pip install scipy

回答 3

由于先前使用yum进行安装的说明已被破坏,因此这里提供了在诸如fedora之类的设备上进行安装的更新说明。我已经在“ Amazon Linux AMI 2016.03”上对此进行了测试

sudo yum install atlas-devel lapack-devel blas-devel libgfortran
pip install scipy

Since the previous instructions for installing with yum are broken here are the updated instructions for installing on something like fedora. I’ve tested this on “Amazon Linux AMI 2016.03”

sudo yum install atlas-devel lapack-devel blas-devel libgfortran
pip install scipy

回答 4

我当时正在从事一个依赖于numpy和scipy的项目。在Fedora 23的全新安装中,使用适用于Python 3.4的python虚拟环境(也适用于Python 2.7),并在setup.py中使用以下内容(在setup()方法中)

setup_requires=[
    'numpy',
],
install_requires=[
    'numpy',
    'scipy',
],

我发现必须运行以下命令才能pip install -e .开始工作:

pip install --upgrade pip

sudo dnf install atlas-devel gcc-{c++,gfortran} subversion redhat-rpm-config

redhat-rpm-config是SciPy的的使用redhat-hardened-cc1,而不是常规cc1

I was working on a project that depended on numpy and scipy. In a clean installation of Fedora 23, using a python virtual environment for Python 3.4 (also worked for Python 2.7), and with the following in my setup.py (in the setup() method)

setup_requires=[
    'numpy',
],
install_requires=[
    'numpy',
    'scipy',
],

I found I had to run the following to get pip install -e . to work:

pip install --upgrade pip

and

sudo dnf install atlas-devel gcc-{c++,gfortran} subversion redhat-rpm-config

The redhat-rpm-config is for scipy’s use of redhat-hardened-cc1 as opposed to the regular cc1


回答 5

Windows python 3.5上,我设法scipy使用conda not 进行安装pip

conda install scipy

On windows python 3.5, I managed to install scipy by using conda not pip:

conda install scipy

回答 6

这是什么操作系统?答案可能取决于所涉及的操作系统。但是,您似乎需要找到此BLAS库并进行安装。它似乎不在PIP中(因此您必须手工完成),但是如果您安装它,则应该让您进行SciPy安装。

What operating system is this? The answer might depend on the OS involved. However, it looks like you need to find this BLAS library and install it. It doesn’t seem to be in PIP (you’ll have to do it by hand thus), but if you install it, it ought let you progress your SciPy install.


回答 7

就我而言,升级点可以解决问题。另外,我已经用-U参数安装了scipy(将所有软件包升级到最新的可用版本)

in my case, upgrading pip did the trick. Also, I’ve installed scipy with -U parameter (upgrade all packages to the last available version)


如何在Python中使用省略号切片语法?

问题:如何在Python中使用省略号切片语法?

这是Python的“隐藏”功能中提到 ,但是我看不到很好的文档或说明该功能如何工作的示例。

This came up in Hidden features of Python, but I can’t see good documentation or examples that explain how the feature works.


回答 0

Ellipsis,或者...不是隐藏功能,它只是一个常量。例如,它与JavaScript ES6完全不同,后者是语言语法的一部分。没有内置的类或Python语言构造函数使用它。

因此,它的语法完全取决于您或其他人是否具有编写代码来理解它。

Numpy使用它,如文档中所述。这里有一些例子。

在您自己的Class中,您将像这样使用它:

>>> class TestEllipsis(object):
...     def __getitem__(self, item):
...         if item is Ellipsis:
...             return "Returning all items"
...         else:
...             return "return %r items" % item
... 
>>> x = TestEllipsis()
>>> print x[2]
return 2 items
>>> print x[...]
Returning all items

当然,这里有python文档语言参考。但是这些不是很有帮助。

Ellipsis, or ... is not a hidden feature, it’s just a constant. It’s quite different to, say, javascript ES6 where it’s a part of the language syntax. No builtin class or Python language constuct makes use of it.

So the syntax for it depends entirely on you, or someone else, having written code to understand it.

Numpy uses it, as stated in the documentation. Some examples here.

In your own class, you’d use it like this:

>>> class TestEllipsis(object):
...     def __getitem__(self, item):
...         if item is Ellipsis:
...             return "Returning all items"
...         else:
...             return "return %r items" % item
... 
>>> x = TestEllipsis()
>>> print x[2]
return 2 items
>>> print x[...]
Returning all items

Of course, there is the python documentation, and language reference. But those aren’t very helpful.


回答 1

省略号用在numpy中,以分割高维数据结构。

它的目的是在这一点上插入尽可能多的完整切片(:),以将多维切片扩展到所有维度

范例

>>> from numpy import arange
>>> a = arange(16).reshape(2,2,2,2)

现在,您有了一个2x2x2x2阶的4维矩阵。要选择第4维的所有第一个元素,可以使用省略号

>>> a[..., 0].flatten()
array([ 0,  2,  4,  6,  8, 10, 12, 14])

相当于

>>> a[:,:,:,0].flatten()
array([ 0,  2,  4,  6,  8, 10, 12, 14])

在您自己的实现中,您可以随意忽略上述合同并将其用于您认为合适的任何事情。

The ellipsis is used in numpy to slice higher-dimensional data structures.

It’s designed to mean at this point, insert as many full slices (:) to extend the multi-dimensional slice to all dimensions.

Example:

>>> from numpy import arange
>>> a = arange(16).reshape(2,2,2,2)

Now, you have a 4-dimensional matrix of order 2x2x2x2. To select all first elements in the 4th dimension, you can use the ellipsis notation

>>> a[..., 0].flatten()
array([ 0,  2,  4,  6,  8, 10, 12, 14])

which is equivalent to

>>> a[:,:,:,0].flatten()
array([ 0,  2,  4,  6,  8, 10, 12, 14])

In your own implementations, you’re free to ignore the contract mentioned above and use it for whatever you see fit.


回答 2

这是Ellipsis的另一种用法,它与切片没有关系:我经常在与队列的线程内通信中使用它,作为信号表示“完成”;它在那里,它是一个对象,它是一个单例,其名称表示“缺乏”,而且不是过度使用的None(可以将其作为常规数据流的一部分放入队列中)。YMMV。

This is another use for Ellipsis, which has nothing to do with slices: I often use it in intra-thread communication with queues, as a mark that signals “Done”; it’s there, it’s an object, it’s a singleton, and its name means “lack of”, and it’s not the overused None (which could be put in a queue as part of normal data flow). YMMV.


回答 3

如其他答案中所述,它可用于创建切片。当您不想编写许多完整的切片符号(:),或者只是不确定要操纵的数组的维数是什么时,此功能很有用。

我认为重要的是要突出显示,而其他答案都没有,那就是即使没有更多要填充的尺寸,也可以使用它。

例:

>>> from numpy import arange
>>> a = arange(4).reshape(2,2)

这将导致错误:

>>> a[:,0,:]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IndexError: too many indices for array

这将起作用:

a[...,0,:]
array([0, 1])

As stated in other answers, it can be used for creating slices. Useful when you do not want to write many full slices notations (:), or when you are just not sure on what is dimensionality of the array being manipulated.

What I thought important to highlight, and that was missing on the other answers, is that it can be used even when there is no more dimensions to be filled.

Example:

>>> from numpy import arange
>>> a = arange(4).reshape(2,2)

This will result in error:

>>> a[:,0,:]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IndexError: too many indices for array

This will work:

a[...,0,:]
array([0, 1])

如何在Python中进行指数和对数曲线拟合?我发现只有多项式拟合

问题:如何在Python中进行指数和对数曲线拟合?我发现只有多项式拟合

我有一组数据,我想比较哪条线描述得最好(不同阶数,指数或对数的多项式)。

我使用Python和Numpy,对于多项式拟合,有一个函数polyfit()。但是我没有找到用于指数和对数拟合的函数。

有吗 否则如何解决?

I have a set of data and I want to compare which line describes it best (polynomials of different orders, exponential or logarithmic).

I use Python and Numpy and for polynomial fitting there is a function polyfit(). But I found no such functions for exponential and logarithmic fitting.

Are there any? Or how to solve it otherwise?


回答 0

对于拟合y = A + B log x,只需将y拟合为(log x)。

>>> x = numpy.array([1, 7, 20, 50, 79])
>>> y = numpy.array([10, 19, 30, 35, 51])
>>> numpy.polyfit(numpy.log(x), y, 1)
array([ 8.46295607,  6.61867463])
# y ≈ 8.46 log(x) + 6.62

用于装配ÿ = Bx的,取两侧的对数使日志Ŷ =登录 + Bx的。因此对x拟合(log y)。

需要注意的是配件(日志Ÿ),就好像它是线性的会强调的较小值Ÿ,造成较大偏差大ÿ。这是因为polyfit(线性回归)的工作原理是最小化Σ (Δ Ý2 =&Sigma; ÿ Ŷ 2。当ÿ =登录ÿ ,残基Δ ÿ =Δ(日志Ý )≈Δ ÿ / | y |。所以即使polyfit对大y做出非常糟糕的决定,“除以| y | |” 因数将对其进行补偿,从而导致polyfit偏爱较小的值。

可以通过为每个条目赋予与y成正比的“权重”来缓解这种情况。polyfit通过w关键字参数支持加权最小二乘。

>>> x = numpy.array([10, 19, 30, 35, 51])
>>> y = numpy.array([1, 7, 20, 50, 79])
>>> numpy.polyfit(x, numpy.log(y), 1)
array([ 0.10502711, -0.40116352])
#    y ≈ exp(-0.401) * exp(0.105 * x) = 0.670 * exp(0.105 * x)
# (^ biased towards small values)
>>> numpy.polyfit(x, numpy.log(y), 1, w=numpy.sqrt(y))
array([ 0.06009446,  1.41648096])
#    y ≈ exp(1.42) * exp(0.0601 * x) = 4.12 * exp(0.0601 * x)
# (^ not so biased)

请注意,Excel,LibreOffice和大多数科学计算器通常对指数回归/趋势线使用未加权(有偏)公式。如果您希望您的结果与这些平台兼容,即使提供更好的结果,也不要包括权重。


现在,如果您可以使用scipy,则可以使用它scipy.optimize.curve_fit来拟合任何模型而无需进行转换。

对于y = A + B log x,结果与转换方法相同:

>>> x = numpy.array([1, 7, 20, 50, 79])
>>> y = numpy.array([10, 19, 30, 35, 51])
>>> scipy.optimize.curve_fit(lambda t,a,b: a+b*numpy.log(t),  x,  y)
(array([ 6.61867467,  8.46295606]), 
 array([[ 28.15948002,  -7.89609542],
        [ -7.89609542,   2.9857172 ]]))
# y ≈ 6.62 + 8.46 log(x)

但是,对于y = Ae Bx,因为它可以直接计算Δ(log y),所以我们可以获得更好的拟合度。但是我们需要提供一个初始猜测,以便curve_fit可以达到所需的局部最小值。

>>> x = numpy.array([10, 19, 30, 35, 51])
>>> y = numpy.array([1, 7, 20, 50, 79])
>>> scipy.optimize.curve_fit(lambda t,a,b: a*numpy.exp(b*t),  x,  y)
(array([  5.60728326e-21,   9.99993501e-01]),
 array([[  4.14809412e-27,  -1.45078961e-08],
        [ -1.45078961e-08,   5.07411462e+10]]))
# oops, definitely wrong.
>>> scipy.optimize.curve_fit(lambda t,a,b: a*numpy.exp(b*t),  x,  y,  p0=(4, 0.1))
(array([ 4.88003249,  0.05531256]),
 array([[  1.01261314e+01,  -4.31940132e-02],
        [ -4.31940132e-02,   1.91188656e-04]]))
# y ≈ 4.88 exp(0.0553 x). much better.

指数回归比较

For fitting y = A + B log x, just fit y against (log x).

>>> x = numpy.array([1, 7, 20, 50, 79])
>>> y = numpy.array([10, 19, 30, 35, 51])
>>> numpy.polyfit(numpy.log(x), y, 1)
array([ 8.46295607,  6.61867463])
# y ≈ 8.46 log(x) + 6.62

For fitting y = AeBx, take the logarithm of both side gives log y = log A + Bx. So fit (log y) against x.

Note that fitting (log y) as if it is linear will emphasize small values of y, causing large deviation for large y. This is because polyfit (linear regression) works by minimizing ∑iY)2 = ∑i (YiŶi)2. When Yi = log yi, the residues ΔYi = Δ(log yi) ≈ Δyi / |yi|. So even if polyfit makes a very bad decision for large y, the “divide-by-|y|” factor will compensate for it, causing polyfit favors small values.

This could be alleviated by giving each entry a “weight” proportional to y. polyfit supports weighted-least-squares via the w keyword argument.

>>> x = numpy.array([10, 19, 30, 35, 51])
>>> y = numpy.array([1, 7, 20, 50, 79])
>>> numpy.polyfit(x, numpy.log(y), 1)
array([ 0.10502711, -0.40116352])
#    y ≈ exp(-0.401) * exp(0.105 * x) = 0.670 * exp(0.105 * x)
# (^ biased towards small values)
>>> numpy.polyfit(x, numpy.log(y), 1, w=numpy.sqrt(y))
array([ 0.06009446,  1.41648096])
#    y ≈ exp(1.42) * exp(0.0601 * x) = 4.12 * exp(0.0601 * x)
# (^ not so biased)

Note that Excel, LibreOffice and most scientific calculators typically use the unweighted (biased) formula for the exponential regression / trend lines. If you want your results to be compatible with these platforms, do not include the weights even if it provides better results.


Now, if you can use scipy, you could use scipy.optimize.curve_fit to fit any model without transformations.

For y = A + B log x the result is the same as the transformation method:

>>> x = numpy.array([1, 7, 20, 50, 79])
>>> y = numpy.array([10, 19, 30, 35, 51])
>>> scipy.optimize.curve_fit(lambda t,a,b: a+b*numpy.log(t),  x,  y)
(array([ 6.61867467,  8.46295606]), 
 array([[ 28.15948002,  -7.89609542],
        [ -7.89609542,   2.9857172 ]]))
# y ≈ 6.62 + 8.46 log(x)

For y = AeBx, however, we can get a better fit since it computes Δ(log y) directly. But we need to provide an initialize guess so curve_fit can reach the desired local minimum.

>>> x = numpy.array([10, 19, 30, 35, 51])
>>> y = numpy.array([1, 7, 20, 50, 79])
>>> scipy.optimize.curve_fit(lambda t,a,b: a*numpy.exp(b*t),  x,  y)
(array([  5.60728326e-21,   9.99993501e-01]),
 array([[  4.14809412e-27,  -1.45078961e-08],
        [ -1.45078961e-08,   5.07411462e+10]]))
# oops, definitely wrong.
>>> scipy.optimize.curve_fit(lambda t,a,b: a*numpy.exp(b*t),  x,  y,  p0=(4, 0.1))
(array([ 4.88003249,  0.05531256]),
 array([[  1.01261314e+01,  -4.31940132e-02],
        [ -4.31940132e-02,   1.91188656e-04]]))
# y ≈ 4.88 exp(0.0553 x). much better.

comparison of exponential regression


回答 1

您还可以将一组数据拟合到您喜欢使用curve_fitfrom的任何函数scipy.optimize。例如,如果您想拟合指数函数(来自文档):

import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit

def func(x, a, b, c):
    return a * np.exp(-b * x) + c

x = np.linspace(0,4,50)
y = func(x, 2.5, 1.3, 0.5)
yn = y + 0.2*np.random.normal(size=len(x))

popt, pcov = curve_fit(func, x, yn)

然后,如果要绘制,则可以执行以下操作:

plt.figure()
plt.plot(x, yn, 'ko', label="Original Noised Data")
plt.plot(x, func(x, *popt), 'r-', label="Fitted Curve")
plt.legend()
plt.show()

(注:*在前面popt,当你将绘制出扩大的条款进入abc那个func。期待)

You can also fit a set of a data to whatever function you like using curve_fit from scipy.optimize. For example if you want to fit an exponential function (from the documentation):

import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit

def func(x, a, b, c):
    return a * np.exp(-b * x) + c

x = np.linspace(0,4,50)
y = func(x, 2.5, 1.3, 0.5)
yn = y + 0.2*np.random.normal(size=len(x))

popt, pcov = curve_fit(func, x, yn)

And then if you want to plot, you could do:

plt.figure()
plt.plot(x, yn, 'ko', label="Original Noised Data")
plt.plot(x, func(x, *popt), 'r-', label="Fitted Curve")
plt.legend()
plt.show()

(Note: the * in front of popt when you plot will expand out the terms into the a, b, and c that func is expecting.)


回答 2

我对此有些麻烦,所以请让我非常明确,让像我这样的菜鸟可以理解。

假设我们有一个数据文件或类似的文件

# -*- coding: utf-8 -*-

import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
import numpy as np
import sympy as sym

"""
Generate some data, let's imagine that you already have this. 
"""
x = np.linspace(0, 3, 50)
y = np.exp(x)

"""
Plot your data
"""
plt.plot(x, y, 'ro',label="Original Data")

"""
brutal force to avoid errors
"""    
x = np.array(x, dtype=float) #transform your data in a numpy array of floats 
y = np.array(y, dtype=float) #so the curve_fit can work

"""
create a function to fit with your data. a, b, c and d are the coefficients
that curve_fit will calculate for you. 
In this part you need to guess and/or use mathematical knowledge to find
a function that resembles your data
"""
def func(x, a, b, c, d):
    return a*x**3 + b*x**2 +c*x + d

"""
make the curve_fit
"""
popt, pcov = curve_fit(func, x, y)

"""
The result is:
popt[0] = a , popt[1] = b, popt[2] = c and popt[3] = d of the function,
so f(x) = popt[0]*x**3 + popt[1]*x**2 + popt[2]*x + popt[3].
"""
print "a = %s , b = %s, c = %s, d = %s" % (popt[0], popt[1], popt[2], popt[3])

"""
Use sympy to generate the latex sintax of the function
"""
xs = sym.Symbol('\lambda')    
tex = sym.latex(func(xs,*popt)).replace('$', '')
plt.title(r'$f(\lambda)= %s$' %(tex),fontsize=16)

"""
Print the coefficients and plot the funcion.
"""

plt.plot(x, func(x, *popt), label="Fitted Curve") #same as line above \/
#plt.plot(x, popt[0]*x**3 + popt[1]*x**2 + popt[2]*x + popt[3], label="Fitted Curve") 

plt.legend(loc='upper left')
plt.show()

结果是:a = 0.849195983017,b = -1.18101681765,c = 2.24061176543,d = 0.816643894816

原始数据和拟合函数

I was having some trouble with this so let me be very explicit so noobs like me can understand.

Lets say that we have a data file or something like that

# -*- coding: utf-8 -*-

import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
import numpy as np
import sympy as sym

"""
Generate some data, let's imagine that you already have this. 
"""
x = np.linspace(0, 3, 50)
y = np.exp(x)

"""
Plot your data
"""
plt.plot(x, y, 'ro',label="Original Data")

"""
brutal force to avoid errors
"""    
x = np.array(x, dtype=float) #transform your data in a numpy array of floats 
y = np.array(y, dtype=float) #so the curve_fit can work

"""
create a function to fit with your data. a, b, c and d are the coefficients
that curve_fit will calculate for you. 
In this part you need to guess and/or use mathematical knowledge to find
a function that resembles your data
"""
def func(x, a, b, c, d):
    return a*x**3 + b*x**2 +c*x + d

"""
make the curve_fit
"""
popt, pcov = curve_fit(func, x, y)

"""
The result is:
popt[0] = a , popt[1] = b, popt[2] = c and popt[3] = d of the function,
so f(x) = popt[0]*x**3 + popt[1]*x**2 + popt[2]*x + popt[3].
"""
print "a = %s , b = %s, c = %s, d = %s" % (popt[0], popt[1], popt[2], popt[3])

"""
Use sympy to generate the latex sintax of the function
"""
xs = sym.Symbol('\lambda')    
tex = sym.latex(func(xs,*popt)).replace('$', '')
plt.title(r'$f(\lambda)= %s$' %(tex),fontsize=16)

"""
Print the coefficients and plot the funcion.
"""

plt.plot(x, func(x, *popt), label="Fitted Curve") #same as line above \/
#plt.plot(x, popt[0]*x**3 + popt[1]*x**2 + popt[2]*x + popt[3], label="Fitted Curve") 

plt.legend(loc='upper left')
plt.show()

the result is: a = 0.849195983017 , b = -1.18101681765, c = 2.24061176543, d = 0.816643894816

Raw data and fitted function


回答 3

好吧,我想您可以随时使用:

np.log   -->  natural log
np.log10 -->  base 10
np.log2  -->  base 2

稍微修改IanVS的答案

import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit

def func(x, a, b, c):
  #return a * np.exp(-b * x) + c
  return a * np.log(b * x) + c

x = np.linspace(1,5,50)   # changed boundary conditions to avoid division by 0
y = func(x, 2.5, 1.3, 0.5)
yn = y + 0.2*np.random.normal(size=len(x))

popt, pcov = curve_fit(func, x, yn)

plt.figure()
plt.plot(x, yn, 'ko', label="Original Noised Data")
plt.plot(x, func(x, *popt), 'r-', label="Fitted Curve")
plt.legend()
plt.show()

结果如下图:

在此处输入图片说明

Well I guess you can always use:

np.log   -->  natural log
np.log10 -->  base 10
np.log2  -->  base 2

Slightly modifying IanVS’s answer:

import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit

def func(x, a, b, c):
  #return a * np.exp(-b * x) + c
  return a * np.log(b * x) + c

x = np.linspace(1,5,50)   # changed boundary conditions to avoid division by 0
y = func(x, 2.5, 1.3, 0.5)
yn = y + 0.2*np.random.normal(size=len(x))

popt, pcov = curve_fit(func, x, yn)

plt.figure()
plt.plot(x, yn, 'ko', label="Original Noised Data")
plt.plot(x, func(x, *popt), 'r-', label="Fitted Curve")
plt.legend()
plt.show()

This results in the following graph:

enter image description here


回答 4

这是使用scikit learning中的工具的简单数据的线性化选项。

给定

import numpy as np

import matplotlib.pyplot as plt

from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import FunctionTransformer


np.random.seed(123)

# General Functions
def func_exp(x, a, b, c):
    """Return values from a general exponential function."""
    return a * np.exp(b * x) + c


def func_log(x, a, b, c):
    """Return values from a general log function."""
    return a * np.log(b * x) + c


# Helper
def generate_data(func, *args, jitter=0):
    """Return a tuple of arrays with random data along a general function."""
    xs = np.linspace(1, 5, 50)
    ys = func(xs, *args)
    noise = jitter * np.random.normal(size=len(xs)) + jitter
    xs = xs.reshape(-1, 1)                                  # xs[:, np.newaxis]
    ys = (ys + noise).reshape(-1, 1)
    return xs, ys
transformer = FunctionTransformer(np.log, validate=True)

拟合指数数据

# Data
x_samp, y_samp = generate_data(func_exp, 2.5, 1.2, 0.7, jitter=3)
y_trans = transformer.fit_transform(y_samp)             # 1

# Regression
regressor = LinearRegression()
results = regressor.fit(x_samp, y_trans)                # 2
model = results.predict
y_fit = model(x_samp)

# Visualization
plt.scatter(x_samp, y_samp)
plt.plot(x_samp, np.exp(y_fit), "k--", label="Fit")     # 3
plt.title("Exponential Fit")

在此处输入图片说明

适合日志数据

# Data
x_samp, y_samp = generate_data(func_log, 2.5, 1.2, 0.7, jitter=0.15)
x_trans = transformer.fit_transform(x_samp)             # 1

# Regression
regressor = LinearRegression()
results = regressor.fit(x_trans, y_samp)                # 2
model = results.predict
y_fit = model(x_trans)

# Visualization
plt.scatter(x_samp, y_samp)
plt.plot(x_samp, y_fit, "k--", label="Fit")             # 3
plt.title("Logarithmic Fit")

在此处输入图片说明


细节

一般步骤

  1. 应用日志操作数据值(xy或两者)
  2. 将数据回归到线性模型
  3. 通过“反转”任何日志操作(使用np.exp())进行绘制并适合原始数据

假设我们的数据遵循指数趋势,则一般方程+可能为:

在此处输入图片说明

我们可以通过取log线性化后一个方程(例如y =截距+斜率* x):

在此处输入图片说明

给定一个线性方程式++和回归参数,我们可以计算:

  • A通过拦截(ln(A)
  • B通过坡度(B

线性化技术摘要

Relationship |  Example   |     General Eqn.     |  Altered Var.  |        Linearized Eqn.  
-------------|------------|----------------------|----------------|------------------------------------------
Linear       | x          | y =     B * x    + C | -              |        y =   C    + B * x
Logarithmic  | log(x)     | y = A * log(B*x) + C | log(x)         |        y =   C    + A * (log(B) + log(x))
Exponential  | 2**x, e**x | y = A * exp(B*x) + C | log(y)         | log(y-C) = log(A) + B * x
Power        | x**2       | y =     B * x**N + C | log(x), log(y) | log(y-C) = log(B) + N * log(x)

+注意:当噪声较小且C = 0时,线性化指数函数的效果最佳。请谨慎使用。

++注:更改x数据有助于线性化指数数据,而更改y数据有助于线性化日志数据。

Here’s a linearization option on simple data that uses tools from scikit learn.

Given

import numpy as np

import matplotlib.pyplot as plt

from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import FunctionTransformer


np.random.seed(123)

# General Functions
def func_exp(x, a, b, c):
    """Return values from a general exponential function."""
    return a * np.exp(b * x) + c


def func_log(x, a, b, c):
    """Return values from a general log function."""
    return a * np.log(b * x) + c


# Helper
def generate_data(func, *args, jitter=0):
    """Return a tuple of arrays with random data along a general function."""
    xs = np.linspace(1, 5, 50)
    ys = func(xs, *args)
    noise = jitter * np.random.normal(size=len(xs)) + jitter
    xs = xs.reshape(-1, 1)                                  # xs[:, np.newaxis]
    ys = (ys + noise).reshape(-1, 1)
    return xs, ys
transformer = FunctionTransformer(np.log, validate=True)

Code

Fit exponential data

# Data
x_samp, y_samp = generate_data(func_exp, 2.5, 1.2, 0.7, jitter=3)
y_trans = transformer.fit_transform(y_samp)             # 1

# Regression
regressor = LinearRegression()
results = regressor.fit(x_samp, y_trans)                # 2
model = results.predict
y_fit = model(x_samp)

# Visualization
plt.scatter(x_samp, y_samp)
plt.plot(x_samp, np.exp(y_fit), "k--", label="Fit")     # 3
plt.title("Exponential Fit")

enter image description here

Fit log data

# Data
x_samp, y_samp = generate_data(func_log, 2.5, 1.2, 0.7, jitter=0.15)
x_trans = transformer.fit_transform(x_samp)             # 1

# Regression
regressor = LinearRegression()
results = regressor.fit(x_trans, y_samp)                # 2
model = results.predict
y_fit = model(x_trans)

# Visualization
plt.scatter(x_samp, y_samp)
plt.plot(x_samp, y_fit, "k--", label="Fit")             # 3
plt.title("Logarithmic Fit")

enter image description here


Details

General Steps

  1. Apply a log operation to data values (x, y or both)
  2. Regress the data to a linearized model
  3. Plot by “reversing” any log operations (with np.exp()) and fit to original data

Assuming our data follows an exponential trend, a general equation+ may be:

enter image description here

We can linearize the latter equation (e.g. y = intercept + slope * x) by taking the log:

enter image description here

Given a linearized equation++ and the regression parameters, we could calculate:

  • A via intercept (ln(A))
  • B via slope (B)

Summary of Linearization Techniques

Relationship |  Example   |     General Eqn.     |  Altered Var.  |        Linearized Eqn.  
-------------|------------|----------------------|----------------|------------------------------------------
Linear       | x          | y =     B * x    + C | -              |        y =   C    + B * x
Logarithmic  | log(x)     | y = A * log(B*x) + C | log(x)         |        y =   C    + A * (log(B) + log(x))
Exponential  | 2**x, e**x | y = A * exp(B*x) + C | log(y)         | log(y-C) = log(A) + B * x
Power        | x**2       | y =     B * x**N + C | log(x), log(y) | log(y-C) = log(B) + N * log(x)

+Note: linearizing exponential functions works best when the noise is small and C=0. Use with caution.

++Note: while altering x data helps linearize exponential data, altering y data helps linearize log data.


回答 5

我们展示了lmfit同时解决这两个问题的功能。

给定

import lmfit

import numpy as np

import matplotlib.pyplot as plt


%matplotlib inline
np.random.seed(123)

# General Functions
def func_log(x, a, b, c):
    """Return values from a general log function."""
    return a * np.log(b * x) + c


# Data
x_samp = np.linspace(1, 5, 50)
_noise = np.random.normal(size=len(x_samp), scale=0.06)
y_samp = 2.5 * np.exp(1.2 * x_samp) + 0.7 + _noise
y_samp2 = 2.5 * np.log(1.2 * x_samp) + 0.7 + _noise

方法1- lmfit模型

拟合指数数据

regressor = lmfit.models.ExponentialModel()                # 1    
initial_guess = dict(amplitude=1, decay=-1)                # 2
results = regressor.fit(y_samp, x=x_samp, **initial_guess)
y_fit = results.best_fit    

plt.plot(x_samp, y_samp, "o", label="Data")
plt.plot(x_samp, y_fit, "k--", label="Fit")
plt.legend()

在此处输入图片说明

方法2-自定义模型

适合日志数据

regressor = lmfit.Model(func_log)                          # 1
initial_guess = dict(a=1, b=.1, c=.1)                      # 2
results = regressor.fit(y_samp2, x=x_samp, **initial_guess)
y_fit = results.best_fit

plt.plot(x_samp, y_samp2, "o", label="Data")
plt.plot(x_samp, y_fit, "k--", label="Fit")
plt.legend()

在此处输入图片说明


细节

  1. 选择回归类别
  2. 提供尊重功能域的命名,初步猜测

您可以从回归对象确定推断的参数。例:

regressor.param_names
# ['decay', 'amplitude']

注意:ExponentialModel()以下是衰减函数,该函数接受两个参数,其中一个为负数。

在此处输入图片说明

另请参见ExponentialGaussianModel(),它接受更多参数

通过安装> pip install lmfit

We demonstrate features of lmfit while solving both problems.

Given

import lmfit

import numpy as np

import matplotlib.pyplot as plt


%matplotlib inline
np.random.seed(123)

# General Functions
def func_log(x, a, b, c):
    """Return values from a general log function."""
    return a * np.log(b * x) + c


# Data
x_samp = np.linspace(1, 5, 50)
_noise = np.random.normal(size=len(x_samp), scale=0.06)
y_samp = 2.5 * np.exp(1.2 * x_samp) + 0.7 + _noise
y_samp2 = 2.5 * np.log(1.2 * x_samp) + 0.7 + _noise

Code

Approach 1 – lmfit Model

Fit exponential data

regressor = lmfit.models.ExponentialModel()                # 1    
initial_guess = dict(amplitude=1, decay=-1)                # 2
results = regressor.fit(y_samp, x=x_samp, **initial_guess)
y_fit = results.best_fit    

plt.plot(x_samp, y_samp, "o", label="Data")
plt.plot(x_samp, y_fit, "k--", label="Fit")
plt.legend()

enter image description here

Approach 2 – Custom Model

Fit log data

regressor = lmfit.Model(func_log)                          # 1
initial_guess = dict(a=1, b=.1, c=.1)                      # 2
results = regressor.fit(y_samp2, x=x_samp, **initial_guess)
y_fit = results.best_fit

plt.plot(x_samp, y_samp2, "o", label="Data")
plt.plot(x_samp, y_fit, "k--", label="Fit")
plt.legend()

enter image description here


Details

  1. Choose a regression class
  2. Supply named, initial guesses that respect the function’s domain

You can determine the inferred parameters from the regressor object. Example:

regressor.param_names
# ['decay', 'amplitude']

Note: the ExponentialModel() follows a decay function, which accepts two parameters, one of which is negative.

enter image description here

See also ExponentialGaussianModel(), which accepts more parameters.

Install the library via > pip install lmfit.


回答 6

Wolfram具有用于拟合指数的封闭形式的解决方案。他们也有类似的解决方案来拟合对数幂律

我发现这比scipy的curve_fit更好。这是一个例子:

import numpy as np
import matplotlib.pyplot as plt

# Fit the function y = A * exp(B * x) to the data
# returns (A, B)
# From: https://mathworld.wolfram.com/LeastSquaresFittingExponential.html
def fit_exp(xs, ys):
    S_x2_y = 0.0
    S_y_lny = 0.0
    S_x_y = 0.0
    S_x_y_lny = 0.0
    S_y = 0.0
    for (x,y) in zip(xs, ys):
        S_x2_y += x * x * y
        S_y_lny += y * np.log(y)
        S_x_y += x * y
        S_x_y_lny += x * y * np.log(y)
        S_y += y
    #end
    a = (S_x2_y * S_y_lny - S_x_y * S_x_y_lny) / (S_y * S_x2_y - S_x_y * S_x_y)
    b = (S_y * S_x_y_lny - S_x_y * S_y_lny) / (S_y * S_x2_y - S_x_y * S_x_y)
    return (np.exp(a), b)


xs = [33, 34, 35, 36, 37, 38, 39, 40, 41, 42]
ys = [3187, 3545, 4045, 4447, 4872, 5660, 5983, 6254, 6681, 7206]

(A, B) = fit_exp(xs, ys)

plt.figure()
plt.plot(xs, ys, 'o-', label='Raw Data')
plt.plot(xs, [A * np.exp(B *x) for x in xs], 'o-', label='Fit')

plt.title('Exponential Fit Test')
plt.xlabel('X')
plt.ylabel('Y')
plt.legend(loc='best')
plt.tight_layout()
plt.show()

在此处输入图片说明

Wolfram has a closed form solution for fitting an exponential. They also have similar solutions for fitting a logarithmic and power law.

I found this to work better than scipy’s curve_fit. Especially when you don’t have data “near zero”. Here is an example:

import numpy as np
import matplotlib.pyplot as plt

# Fit the function y = A * exp(B * x) to the data
# returns (A, B)
# From: https://mathworld.wolfram.com/LeastSquaresFittingExponential.html
def fit_exp(xs, ys):
    S_x2_y = 0.0
    S_y_lny = 0.0
    S_x_y = 0.0
    S_x_y_lny = 0.0
    S_y = 0.0
    for (x,y) in zip(xs, ys):
        S_x2_y += x * x * y
        S_y_lny += y * np.log(y)
        S_x_y += x * y
        S_x_y_lny += x * y * np.log(y)
        S_y += y
    #end
    a = (S_x2_y * S_y_lny - S_x_y * S_x_y_lny) / (S_y * S_x2_y - S_x_y * S_x_y)
    b = (S_y * S_x_y_lny - S_x_y * S_y_lny) / (S_y * S_x2_y - S_x_y * S_x_y)
    return (np.exp(a), b)


xs = [33, 34, 35, 36, 37, 38, 39, 40, 41, 42]
ys = [3187, 3545, 4045, 4447, 4872, 5660, 5983, 6254, 6681, 7206]

(A, B) = fit_exp(xs, ys)

plt.figure()
plt.plot(xs, ys, 'o-', label='Raw Data')
plt.plot(xs, [A * np.exp(B *x) for x in xs], 'o-', label='Fit')

plt.title('Exponential Fit Test')
plt.xlabel('X')
plt.ylabel('Y')
plt.legend(loc='best')
plt.tight_layout()
plt.show()

enter image description here


如何获得Numpy中向量的大小?

问题:如何获得Numpy中向量的大小?

遵循“只有一种明显的方法”,如何在Numpy中获得向量(一维数组)的大小?

def mag(x): 
    return math.sqrt(sum(i**2 for i in x))

上面的方法有效,但是我不敢相信自己必须指定这样一个琐碎而核心的功能。

In keeping with the “There’s only one obvious way to do it”, how do you get the magnitude of a vector (1D array) in Numpy?

def mag(x): 
    return math.sqrt(sum(i**2 for i in x))

The above works, but I cannot believe that I must specify such a trivial and core function myself.


回答 0

您需要的功能是numpy.linalg.norm。(我认为它应该以基numpy的形式作为数组的一个属性-说x.norm()-哦,很好)。

import numpy as np
x = np.array([1,2,3,4,5])
np.linalg.norm(x)

您还可以输入ord所需的n阶范数的可选内容。假设您想要1范数:

np.linalg.norm(x,ord=1)

等等。

The function you’re after is numpy.linalg.norm. (I reckon it should be in base numpy as a property of an array — say x.norm() — but oh well).

import numpy as np
x = np.array([1,2,3,4,5])
np.linalg.norm(x)

You can also feed in an optional ord for the nth order norm you want. Say you wanted the 1-norm:

np.linalg.norm(x,ord=1)

And so on.


回答 1

如果您完全担心速度,则应改用:

mag = np.sqrt(x.dot(x))

以下是一些基准:

>>> import timeit
>>> timeit.timeit('np.linalg.norm(x)', setup='import numpy as np; x = np.arange(100)', number=1000)
0.0450878
>>> timeit.timeit('np.sqrt(x.dot(x))', setup='import numpy as np; x = np.arange(100)', number=1000)
0.0181372

编辑:当您必须采用许多向量的范数时,才能真正提高速度。使用纯numpy函数不需要任何for循环。例如:

In [1]: import numpy as np

In [2]: a = np.arange(1200.0).reshape((-1,3))

In [3]: %timeit [np.linalg.norm(x) for x in a]
100 loops, best of 3: 4.23 ms per loop

In [4]: %timeit np.sqrt((a*a).sum(axis=1))
100000 loops, best of 3: 18.9 us per loop

In [5]: np.allclose([np.linalg.norm(x) for x in a],np.sqrt((a*a).sum(axis=1)))
Out[5]: True

If you are worried at all about speed, you should instead use:

mag = np.sqrt(x.dot(x))

Here are some benchmarks:

>>> import timeit
>>> timeit.timeit('np.linalg.norm(x)', setup='import numpy as np; x = np.arange(100)', number=1000)
0.0450878
>>> timeit.timeit('np.sqrt(x.dot(x))', setup='import numpy as np; x = np.arange(100)', number=1000)
0.0181372

EDIT: The real speed improvement comes when you have to take the norm of many vectors. Using pure numpy functions doesn’t require any for loops. For example:

In [1]: import numpy as np

In [2]: a = np.arange(1200.0).reshape((-1,3))

In [3]: %timeit [np.linalg.norm(x) for x in a]
100 loops, best of 3: 4.23 ms per loop

In [4]: %timeit np.sqrt((a*a).sum(axis=1))
100000 loops, best of 3: 18.9 us per loop

In [5]: np.allclose([np.linalg.norm(x) for x in a],np.sqrt((a*a).sum(axis=1)))
Out[5]: True

回答 2

另一个选择是einsum对两个数组使用numpy中的函数:

In [1]: import numpy as np

In [2]: a = np.arange(1200.0).reshape((-1,3))

In [3]: %timeit [np.linalg.norm(x) for x in a]
100 loops, best of 3: 3.86 ms per loop

In [4]: %timeit np.sqrt((a*a).sum(axis=1))
100000 loops, best of 3: 15.6 µs per loop

In [5]: %timeit np.sqrt(np.einsum('ij,ij->i',a,a))
100000 loops, best of 3: 8.71 µs per loop

或向量:

In [5]: a = np.arange(100000)

In [6]: %timeit np.sqrt(a.dot(a))
10000 loops, best of 3: 80.8 µs per loop

In [7]: %timeit np.sqrt(np.einsum('i,i', a, a))
10000 loops, best of 3: 60.6 µs per loop

但是,似乎确实存在一些与调用它相关的开销,这可能会使它在输入较小的情况下变慢:

In [2]: a = np.arange(100)

In [3]: %timeit np.sqrt(a.dot(a))
100000 loops, best of 3: 3.73 µs per loop

In [4]: %timeit np.sqrt(np.einsum('i,i', a, a))
100000 loops, best of 3: 4.68 µs per loop

Yet another alternative is to use the einsum function in numpy for either arrays:

In [1]: import numpy as np

In [2]: a = np.arange(1200.0).reshape((-1,3))

In [3]: %timeit [np.linalg.norm(x) for x in a]
100 loops, best of 3: 3.86 ms per loop

In [4]: %timeit np.sqrt((a*a).sum(axis=1))
100000 loops, best of 3: 15.6 µs per loop

In [5]: %timeit np.sqrt(np.einsum('ij,ij->i',a,a))
100000 loops, best of 3: 8.71 µs per loop

or vectors:

In [5]: a = np.arange(100000)

In [6]: %timeit np.sqrt(a.dot(a))
10000 loops, best of 3: 80.8 µs per loop

In [7]: %timeit np.sqrt(np.einsum('i,i', a, a))
10000 loops, best of 3: 60.6 µs per loop

There does, however, seem to be some overhead associated with calling it that may make it slower with small inputs:

In [2]: a = np.arange(100)

In [3]: %timeit np.sqrt(a.dot(a))
100000 loops, best of 3: 3.73 µs per loop

In [4]: %timeit np.sqrt(np.einsum('i,i', a, a))
100000 loops, best of 3: 4.68 µs per loop

回答 3

我发现最快的方法是通过inner1d。这是它与其他numpy方法的比较方式:

import numpy as np
from numpy.core.umath_tests import inner1d

V = np.random.random_sample((10**6,3,)) # 1 million vectors
A = np.sqrt(np.einsum('...i,...i', V, V))
B = np.linalg.norm(V,axis=1)   
C = np.sqrt((V ** 2).sum(-1))
D = np.sqrt((V*V).sum(axis=1))
E = np.sqrt(inner1d(V,V))

print [np.allclose(E,x) for x in [A,B,C,D]] # [True, True, True, True]

import cProfile
cProfile.run("np.sqrt(np.einsum('...i,...i', V, V))") # 3 function calls in 0.013 seconds
cProfile.run('np.linalg.norm(V,axis=1)')              # 9 function calls in 0.029 seconds
cProfile.run('np.sqrt((V ** 2).sum(-1))')             # 5 function calls in 0.028 seconds
cProfile.run('np.sqrt((V*V).sum(axis=1))')            # 5 function calls in 0.027 seconds
cProfile.run('np.sqrt(inner1d(V,V))')                 # 2 function calls in 0.009 seconds

inner1d比linalg.norm快3倍,头发比einsum快

Fastest way I found is via inner1d. Here’s how it compares to other numpy methods:

import numpy as np
from numpy.core.umath_tests import inner1d

V = np.random.random_sample((10**6,3,)) # 1 million vectors
A = np.sqrt(np.einsum('...i,...i', V, V))
B = np.linalg.norm(V,axis=1)   
C = np.sqrt((V ** 2).sum(-1))
D = np.sqrt((V*V).sum(axis=1))
E = np.sqrt(inner1d(V,V))

print [np.allclose(E,x) for x in [A,B,C,D]] # [True, True, True, True]

import cProfile
cProfile.run("np.sqrt(np.einsum('...i,...i', V, V))") # 3 function calls in 0.013 seconds
cProfile.run('np.linalg.norm(V,axis=1)')              # 9 function calls in 0.029 seconds
cProfile.run('np.sqrt((V ** 2).sum(-1))')             # 5 function calls in 0.028 seconds
cProfile.run('np.sqrt((V*V).sum(axis=1))')            # 5 function calls in 0.027 seconds
cProfile.run('np.sqrt(inner1d(V,V))')                 # 2 function calls in 0.009 seconds

inner1d is ~3x faster than linalg.norm and a hair faster than einsum


回答 4

scipy.linalg(或numpy.linalg)中使用功能norm

>>> from scipy import linalg as LA
>>> a = 10*NP.random.randn(6)
>>> a
  array([  9.62141594,   1.29279592,   4.80091404,  -2.93714318,
          17.06608678, -11.34617065])
>>> LA.norm(a)
    23.36461979210312

>>> # compare with OP's function:
>>> import math
>>> mag = lambda x : math.sqrt(sum(i**2 for i in x))
>>> mag(a)
     23.36461979210312

use the function norm in scipy.linalg (or numpy.linalg)

>>> from scipy import linalg as LA
>>> a = 10*NP.random.randn(6)
>>> a
  array([  9.62141594,   1.29279592,   4.80091404,  -2.93714318,
          17.06608678, -11.34617065])
>>> LA.norm(a)
    23.36461979210312

>>> # compare with OP's function:
>>> import math
>>> mag = lambda x : math.sqrt(sum(i**2 for i in x))
>>> mag(a)
     23.36461979210312

回答 5

您可以使用toolbelt vg简洁地执行此操作。它是numpy之上的一个轻层,它支持单个值和堆叠的向量。

import numpy as np
import vg

x = np.array([1, 2, 3, 4, 5])
mag1 = np.linalg.norm(x)
mag2 = vg.magnitude(x)
print mag1 == mag2
# True

我在上次启动时创建了该库,该库的灵感来自于这样的用法:简单的想法在NumPy中过于冗长。

You can do this concisely using the toolbelt vg. It’s a light layer on top of numpy and it supports single values and stacked vectors.

import numpy as np
import vg

x = np.array([1, 2, 3, 4, 5])
mag1 = np.linalg.norm(x)
mag2 = vg.magnitude(x)
print mag1 == mag2
# True

I created the library at my last startup, where it was motivated by uses like this: simple ideas which are far too verbose in NumPy.


如何将新行添加到空的numpy数组

问题:如何将新行添加到空的numpy数组

使用标准的Python数组,我可以执行以下操作:

arr = []
arr.append([1,2,3])
arr.append([4,5,6])
# arr is now [[1,2,3],[4,5,6]]

但是,我不能在numpy中做同样的事情。例如:

arr = np.array([])
arr = np.append(arr, np.array([1,2,3]))
arr = np.append(arr, np.array([4,5,6]))
# arr is now [1,2,3,4,5,6]

我也研究了vstack,但是在vstack空数组上使用时,得到:

ValueError: all the input array dimensions except for the concatenation axis must match exactly

那么,如何将新行追加到numpy中的空数组中?

Using standard Python arrays, I can do the following:

arr = []
arr.append([1,2,3])
arr.append([4,5,6])
# arr is now [[1,2,3],[4,5,6]]

However, I cannot do the same thing in numpy. For example:

arr = np.array([])
arr = np.append(arr, np.array([1,2,3]))
arr = np.append(arr, np.array([4,5,6]))
# arr is now [1,2,3,4,5,6]

I also looked into vstack, but when I use vstack on an empty array, I get:

ValueError: all the input array dimensions except for the concatenation axis must match exactly

So how do I do append a new row to an empty array in numpy?


回答 0

“启动”所需阵列的方法是:

arr = np.empty((0,3), int)

这是一个空数组,但具有适当的维数。

>>> arr
array([], shape=(0, 3), dtype=int64)

然后确保沿轴0附加:

arr = np.append(arr, np.array([[1,2,3]]), axis=0)
arr = np.append(arr, np.array([[4,5,6]]), axis=0)

但是,@ jonrsharpe是正确的。实际上,如果要循环添加,那么像第一个示例中那样将其添加到列表中会更快得多,然后最后转换为numpy数组,因为您实际上并没有使用numpy作为打算在循环中:

In [210]: %%timeit
   .....: l = []
   .....: for i in xrange(1000):
   .....:     l.append([3*i+1,3*i+2,3*i+3])
   .....: l = np.asarray(l)
   .....: 
1000 loops, best of 3: 1.18 ms per loop

In [211]: %%timeit
   .....: a = np.empty((0,3), int)
   .....: for i in xrange(1000):
   .....:     a = np.append(a, 3*i+np.array([[1,2,3]]), 0)
   .....: 
100 loops, best of 3: 18.5 ms per loop

In [214]: np.allclose(a, l)
Out[214]: True

numpythonic的实现方法取决于您的应用程序,但它更像是:

In [220]: timeit n = np.arange(1,3001).reshape(1000,3)
100000 loops, best of 3: 5.93 µs per loop

In [221]: np.allclose(a, n)
Out[221]: True

The way to “start” the array that you want is:

arr = np.empty((0,3), int)

Which is an empty array but it has the proper dimensionality.

>>> arr
array([], shape=(0, 3), dtype=int64)

Then be sure to append along axis 0:

arr = np.append(arr, np.array([[1,2,3]]), axis=0)
arr = np.append(arr, np.array([[4,5,6]]), axis=0)

But, @jonrsharpe is right. In fact, if you’re going to be appending in a loop, it would be much faster to append to a list as in your first example, then convert to a numpy array at the end, since you’re really not using numpy as intended during the loop:

In [210]: %%timeit
   .....: l = []
   .....: for i in xrange(1000):
   .....:     l.append([3*i+1,3*i+2,3*i+3])
   .....: l = np.asarray(l)
   .....: 
1000 loops, best of 3: 1.18 ms per loop

In [211]: %%timeit
   .....: a = np.empty((0,3), int)
   .....: for i in xrange(1000):
   .....:     a = np.append(a, 3*i+np.array([[1,2,3]]), 0)
   .....: 
100 loops, best of 3: 18.5 ms per loop

In [214]: np.allclose(a, l)
Out[214]: True

The numpythonic way to do it depends on your application, but it would be more like:

In [220]: timeit n = np.arange(1,3001).reshape(1000,3)
100000 loops, best of 3: 5.93 µs per loop

In [221]: np.allclose(a, n)
Out[221]: True

回答 1

这是我的解决方案:

arr = []
arr.append([1,2,3])
arr.append([4,5,6])
np_arr = np.array(arr)

Here is my solution:

arr = []
arr.append([1,2,3])
arr.append([4,5,6])
np_arr = np.array(arr)

回答 2

在这种情况下,您可能需要使用np.hstack和np.vstack函数

arr = np.array([])
arr = np.hstack((arr, np.array([1,2,3])))
# arr is now [1,2,3]

arr = np.vstack((arr, np.array([4,5,6])))
# arr is now [[1,2,3],[4,5,6]]

您也可以使用np.concatenate函数。

干杯

In this case you might want to use the functions np.hstack and np.vstack

arr = np.array([])
arr = np.hstack((arr, np.array([1,2,3])))
# arr is now [1,2,3]

arr = np.vstack((arr, np.array([4,5,6])))
# arr is now [[1,2,3],[4,5,6]]

You also can use the np.concatenate function.

Cheers


回答 3

使用自定义dtype定义,对我有用的是:

import numpy

# define custom dtype
type1 = numpy.dtype([('freq', numpy.float64, 1), ('amplitude', numpy.float64, 1)])
# declare empty array, zero rows but one column
arr = numpy.empty([0,1],dtype=type1)
# store row data, maybe inside a loop
row = numpy.array([(0.0001, 0.002)], dtype=type1)
# append row to the main array
arr = numpy.row_stack((arr, row))
# print values stored in the row 0
print float(arr[0]['freq'])
print float(arr[0]['amplitude'])

using an custom dtype definition, what worked for me was:

import numpy

# define custom dtype
type1 = numpy.dtype([('freq', numpy.float64, 1), ('amplitude', numpy.float64, 1)])
# declare empty array, zero rows but one column
arr = numpy.empty([0,1],dtype=type1)
# store row data, maybe inside a loop
row = numpy.array([(0.0001, 0.002)], dtype=type1)
# append row to the main array
arr = numpy.row_stack((arr, row))
# print values stored in the row 0
print float(arr[0]['freq'])
print float(arr[0]['amplitude'])

回答 4

如果要为循环中的数组添加新行,请直接为首次循环中的数组分配数组,而不是初始化一个空数组。

for i in range(0,len(0,100)):
    SOMECALCULATEDARRAY = .......
    if(i==0):
        finalArrayCollection = SOMECALCULATEDARRAY
    else:
        finalArrayCollection = np.vstack(finalArrayCollection,SOMECALCULATEDARRAY)

当阵列的形状未知时,这主要有用

In case of adding new rows for array in loop, Assign the array directly for firsttime in loop instead of initialising an empty array.

for i in range(0,len(0,100)):
    SOMECALCULATEDARRAY = .......
    if(i==0):
        finalArrayCollection = SOMECALCULATEDARRAY
    else:
        finalArrayCollection = np.vstack(finalArrayCollection,SOMECALCULATEDARRAY)

This is mainly useful when the shape of the array is unknown


回答 5

我想做一个for循环,但是用askewchan的方法效果不好,所以我修改了它。

x=np.empty((0,3))
y=np.array([1 2 3])
for i in ...
x = vstack((x,y))

I want to do a for loop, yet with askewchan’s method it does not work well, so I have modified it.

x=np.empty((0,3))
y=np.array([1 2 3])
for i in ...
x = vstack((x,y))