标签归档:numpy

‘DataFrame’对象没有属性’sort’

问题:’DataFrame’对象没有属性’sort’

我在这里遇到一些问题,在我的python包中,我已经安装了numpy,但是我仍然遇到此错误‘DataFrame’对象没有属性’sort’

任何人都可以给我一些想法。

这是我的代码:

final.loc[-1] =['', 'P','Actual']
final.index = final.index + 1  # shifting index
final = final.sort()
final.columns=[final.columns,final.iloc[0]]
final = final.iloc[1:].reset_index(drop=True)
final.columns.names = (None, None)

I face some problem here, in my python package I have install numpy, but I still have this error ‘DataFrame’ object has no attribute ‘sort’

Anyone can give me some idea..

This is my code :

final.loc[-1] =['', 'P','Actual']
final.index = final.index + 1  # shifting index
final = final.sort()
final.columns=[final.columns,final.iloc[0]]
final = final.iloc[1:].reset_index(drop=True)
final.columns.names = (None, None)

回答 0

sort() 不推荐使用DataFrames,而采用以下任何一种方法:

sort()在Pandas中已弃用(但仍可用)版本0.17(2015-10-09),并引入sort_values()sort_index()。它已从0.20版(2017-05-05)的Pandas中删除。

sort() was deprecated for DataFrames in favor of either:

sort() was deprecated (but still available) in Pandas with release 0.17 (2015-10-09) with the introduction of sort_values() and sort_index(). It was removed from Pandas with release 0.20 (2017-05-05).


回答 1

熊猫排序101

sort已经在v0.20替换DataFrame.sort_valuesDataFrame.sort_index。除此之外,我们还有argsort

以下是一些常见的排序用例,以及如何使用当前API中的排序功能解决它们。首先,设置。

# Setup
np.random.seed(0)
df = pd.DataFrame({'A': list('accab'), 'B': np.random.choice(10, 5)})    
df                                                                                                                                        
   A  B
0  a  7
1  c  9
2  c  3
3  a  5
4  b  2

按单列排序

例如,要按df列“ A” 排序,请使用sort_values单个列名:

df.sort_values(by='A')

   A  B
0  a  7
3  a  5
4  b  2
1  c  9
2  c  3

如果您需要新的RangeIndex,请使用DataFrame.reset_index

按多列排序

例如,通过排序两个关口“A”和“B”中df,你可以通过一个列表sort_values

df.sort_values(by=['A', 'B'])

   A  B
3  a  5
0  a  7
4  b  2
2  c  3
1  c  9

按DataFrame索引排序

df2 = df.sample(frac=1)
df2

   A  B
1  c  9
0  a  7
2  c  3
3  a  5
4  b  2

您可以使用sort_index

df2.sort_index()

   A  B
0  a  7
1  c  9
2  c  3
3  a  5
4  b  2

df.equals(df2)                                                                                                                            
# False
df.equals(df2.sort_index())                                                                                                               
# True

以下是一些可比较的方法及其性能:

%timeit df2.sort_index()                                                                                                                  
%timeit df2.iloc[df2.index.argsort()]                                                                                                     
%timeit df2.reindex(np.sort(df2.index))                                                                                                   

605 µs ± 13.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
610 µs ± 24.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
581 µs ± 7.63 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

按指数列表排序

例如,

idx = df2.index.argsort()
idx
# array([0, 7, 2, 3, 9, 4, 5, 6, 8, 1])

这个“排序”问题实际上是一个简单的索引问题。仅传递整数标签即可iloc

df.iloc[idx]

   A  B
1  c  9
0  a  7
2  c  3
3  a  5
4  b  2

Pandas Sorting 101

sort has been replaced in v0.20 by DataFrame.sort_values and DataFrame.sort_index. Aside from this, we also have argsort.

Here are some common use cases in sorting, and how to solve them using the sorting functions in the current API. First, the setup.

# Setup
np.random.seed(0)
df = pd.DataFrame({'A': list('accab'), 'B': np.random.choice(10, 5)})    
df                                                                                                                                        
   A  B
0  a  7
1  c  9
2  c  3
3  a  5
4  b  2

Sort by Single Column

For example, to sort df by column “A”, use sort_values with a single column name:

df.sort_values(by='A')

   A  B
0  a  7
3  a  5
4  b  2
1  c  9
2  c  3

If you need a fresh RangeIndex, use DataFrame.reset_index.

Sort by Multiple Columns

For example, to sort by both col “A” and “B” in df, you can pass a list to sort_values:

df.sort_values(by=['A', 'B'])

   A  B
3  a  5
0  a  7
4  b  2
2  c  3
1  c  9

Sort By DataFrame Index

df2 = df.sample(frac=1)
df2

   A  B
1  c  9
0  a  7
2  c  3
3  a  5
4  b  2

You can do this using sort_index:

df2.sort_index()

   A  B
0  a  7
1  c  9
2  c  3
3  a  5
4  b  2

df.equals(df2)                                                                                                                            
# False
df.equals(df2.sort_index())                                                                                                               
# True

Here are some comparable methods with their performance:

%timeit df2.sort_index()                                                                                                                  
%timeit df2.iloc[df2.index.argsort()]                                                                                                     
%timeit df2.reindex(np.sort(df2.index))                                                                                                   

605 µs ± 13.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
610 µs ± 24.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
581 µs ± 7.63 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Sort by List of Indices

For example,

idx = df2.index.argsort()
idx
# array([0, 7, 2, 3, 9, 4, 5, 6, 8, 1])

This “sorting” problem is actually a simple indexing problem. Just passing integer labels to iloc will do.

df.iloc[idx]

   A  B
1  c  9
0  a  7
2  c  3
3  a  5
4  b  2

使用Python / NumPy对数组中的项目进行排名,而无需对数组进行两次排序

问题:使用Python / NumPy对数组中的项目进行排名,而无需对数组进行两次排序

我有一个数字数组,我想创建另一个数组,该数组代表第一个数组中每个项目的等级。我正在使用Python和NumPy。

例如:

array = [4,2,7,1]
ranks = [2,1,3,0]

这是我想出的最好方法:

array = numpy.array([4,2,7,1])
temp = array.argsort()
ranks = numpy.arange(len(array))[temp.argsort()]

有没有更好/更快的方法可以避免对数组进行两次排序?

I have an array of numbers and I’d like to create another array that represents the rank of each item in the first array. I’m using Python and NumPy.

For example:

array = [4,2,7,1]
ranks = [2,1,3,0]

Here’s the best method I’ve come up with:

array = numpy.array([4,2,7,1])
temp = array.argsort()
ranks = numpy.arange(len(array))[temp.argsort()]

Are there any better/faster methods that avoid sorting the array twice?


回答 0

在最后一步中,在左侧使用切片:

array = numpy.array([4,2,7,1])
temp = array.argsort()
ranks = numpy.empty_like(temp)
ranks[temp] = numpy.arange(len(array))

通过在最后一步中反转排列,可以避免两次排序。

Use slicing on the left-hand side in the last step:

array = numpy.array([4,2,7,1])
temp = array.argsort()
ranks = numpy.empty_like(temp)
ranks[temp] = numpy.arange(len(array))

This avoids sorting twice by inverting the permutation in the last step.


回答 1

使用argsort两次,首先获取数组的顺序,然后获取排名:

array = numpy.array([4,2,7,1])
order = array.argsort()
ranks = order.argsort()

在处理2D(或更高维)数组时,请确保将轴参数传递给argsort以在正确的轴上排序。

Use argsort twice, first to obtain the order of the array, then to obtain ranking:

array = numpy.array([4,2,7,1])
order = array.argsort()
ranks = order.argsort()

When dealing with 2D (or higher dimensional) arrays, be sure to pass an axis argument to argsort to order over the correct axis.


回答 2

这个问题已有几年历史了,可以接受的答案很好,但是我认为以下仍然值得一提。如果您不介意对的依赖scipy,则可以使用scipy.stats.rankdata

In [22]: from scipy.stats import rankdata

In [23]: a = [4, 2, 7, 1]

In [24]: rankdata(a)
Out[24]: array([ 3.,  2.,  4.,  1.])

In [25]: (rankdata(a) - 1).astype(int)
Out[25]: array([2, 1, 3, 0])

的一个不错的功能rankdata是,该method参数提供了几种处理关系的选项。例如,在中有3次出现20次,两次出现40次b

In [26]: b = [40, 20, 70, 10, 20, 50, 30, 40, 20]

默认值将平均等级分配给绑定值:

In [27]: rankdata(b)
Out[27]: array([ 6.5,  3. ,  9. ,  1. ,  3. ,  8. ,  5. ,  6.5,  3. ])

method='ordinal' 分配连续等级:

In [28]: rankdata(b, method='ordinal')
Out[28]: array([6, 2, 9, 1, 3, 8, 5, 7, 4])

method='min' 将绑定值的最小等级分配给所有绑定值:

In [29]: rankdata(b, method='min')
Out[29]: array([6, 2, 9, 1, 2, 8, 5, 6, 2])

有关更多选项,请参见文档字符串。

This question is a few years old, and the accepted answer is great, but I think the following is still worth mentioning. If you don’t mind the dependency on scipy, you can use scipy.stats.rankdata:

In [22]: from scipy.stats import rankdata

In [23]: a = [4, 2, 7, 1]

In [24]: rankdata(a)
Out[24]: array([ 3.,  2.,  4.,  1.])

In [25]: (rankdata(a) - 1).astype(int)
Out[25]: array([2, 1, 3, 0])

A nice feature of rankdata is that the method argument provides several options for handling ties. For example, there are three occurrences of 20 and two occurrences of 40 in b:

In [26]: b = [40, 20, 70, 10, 20, 50, 30, 40, 20]

The default assigns the average rank to the tied values:

In [27]: rankdata(b)
Out[27]: array([ 6.5,  3. ,  9. ,  1. ,  3. ,  8. ,  5. ,  6.5,  3. ])

method='ordinal' assigns consecutive ranks:

In [28]: rankdata(b, method='ordinal')
Out[28]: array([6, 2, 9, 1, 3, 8, 5, 7, 4])

method='min' assigns the minimum rank of the tied values to all the tied values:

In [29]: rankdata(b, method='min')
Out[29]: array([6, 2, 9, 1, 2, 8, 5, 6, 2])

See the docstring for more options.


回答 3

我试图将两个以上的解决方案扩展到一个以上维度的数组A,假设您逐行处理数组(轴= 1)。

我用行循环扩展了第一个代码;可能可以改善

temp = A.argsort(axis=1)
rank = np.empty_like(temp)
rangeA = np.arange(temp.shape[1])
for iRow in xrange(temp.shape[0]): 
    rank[iRow, temp[iRow,:]] = rangeA

根据k.rooijers的建议,第二个变为:

temp = A.argsort(axis=1)
rank = temp.argsort(axis=1)

我随机生成400个形状为(1000,100)的数组;第一个代码大约是7.5,第二个代码是3.8。

I tried to extend both solution for arrays A of more than one dimension, supposing you process your array row-by-row (axis=1).

I extended the first code with a loop on rows; probably it can be improved

temp = A.argsort(axis=1)
rank = np.empty_like(temp)
rangeA = np.arange(temp.shape[1])
for iRow in xrange(temp.shape[0]): 
    rank[iRow, temp[iRow,:]] = rangeA

And the second one, following k.rooijers suggestion, becomes:

temp = A.argsort(axis=1)
rank = temp.argsort(axis=1)

I randomly generated 400 arrays with shape (1000,100); the first code took about 7.5, the second one 3.8.


回答 4

有关平均排名的矢量化版本,请参见下文。我喜欢np.unique,它确实扩大了可以有效地向量化代码的范围。除了避免python for循环外,这种方法还避免了对’a’的隐式双循环。

import numpy as np

a = np.array( [4,1,6,8,4,1,6])

a = np.array([4,2,7,2,1])
rank = a.argsort().argsort()

unique, inverse = np.unique(a, return_inverse = True)

unique_rank_sum = np.zeros_like(unique)
np.add.at(unique_rank_sum, inverse, rank)
unique_count = np.zeros_like(unique)
np.add.at(unique_count, inverse, 1)

unique_rank_mean = unique_rank_sum.astype(np.float) / unique_count

rank_mean = unique_rank_mean[inverse]

print rank_mean

For a vectorized version of an averaged rank, see below. I love np.unique, it really widens the scope of what code can and cannot be efficiently vectorized. Aside from avoiding python for-loops, this approach also avoids the implicit double loop over ‘a’.

import numpy as np

a = np.array( [4,1,6,8,4,1,6])

a = np.array([4,2,7,2,1])
rank = a.argsort().argsort()

unique, inverse = np.unique(a, return_inverse = True)

unique_rank_sum = np.zeros_like(unique)
np.add.at(unique_rank_sum, inverse, rank)
unique_count = np.zeros_like(unique)
np.add.at(unique_count, inverse, 1)

unique_rank_mean = unique_rank_sum.astype(np.float) / unique_count

rank_mean = unique_rank_mean[inverse]

print rank_mean

回答 5

除了解决方案的简洁和简短之外,还存在性能问题。这是一个小基准:

import numpy as np
from scipy.stats import rankdata
l = list(reversed(range(1000)))

%%timeit -n10000 -r5
x = (rankdata(l) - 1).astype(int)
>>> 128 µs ± 2.72 µs per loop (mean ± std. dev. of 5 runs, 10000 loops each)

%%timeit -n10000 -r5
a = np.array(l)
r = a.argsort().argsort()
>>> 69.1 µs ± 464 ns per loop (mean ± std. dev. of 5 runs, 10000 loops each)

%%timeit -n10000 -r5
a = np.array(l)
temp = a.argsort()
r = np.empty_like(temp)
r[temp] = np.arange(len(a))
>>> 63.7 µs ± 1.27 µs per loop (mean ± std. dev. of 5 runs, 10000 loops each)

Apart from the elegance and shortness of solutions, there is also the question of performance. Here is a little benchmark:

import numpy as np
from scipy.stats import rankdata
l = list(reversed(range(1000)))

%%timeit -n10000 -r5
x = (rankdata(l) - 1).astype(int)
>>> 128 µs ± 2.72 µs per loop (mean ± std. dev. of 5 runs, 10000 loops each)

%%timeit -n10000 -r5
a = np.array(l)
r = a.argsort().argsort()
>>> 69.1 µs ± 464 ns per loop (mean ± std. dev. of 5 runs, 10000 loops each)

%%timeit -n10000 -r5
a = np.array(l)
temp = a.argsort()
r = np.empty_like(temp)
r[temp] = np.arange(len(a))
>>> 63.7 µs ± 1.27 µs per loop (mean ± std. dev. of 5 runs, 10000 loops each)

回答 6

两次使用argsort()可以做到:

>>> array = [4,2,7,1]
>>> ranks = numpy.array(array).argsort().argsort()
>>> ranks
array([2, 1, 3, 0])

Use argsort() twice will do it:

>>> array = [4,2,7,1]
>>> ranks = numpy.array(array).argsort().argsort()
>>> ranks
array([2, 1, 3, 0])

回答 7

我尝试了上述方法,但失败了,因为我有很多zeore。是的,即使使用浮点数,重复项也可能很重要。

因此,我通过添加领带检查步骤编写了一个修改后的一维解决方案:

def ranks (v):
    import numpy as np
    t = np.argsort(v)
    r = np.empty(len(v),int)
    r[t] = np.arange(len(v))
    for i in xrange(1, len(r)):
        if v[t[i]] <= v[t[i-1]]: r[t[i]] = r[t[i-1]]
    return r

# test it
print sorted(zip(ranks(v), v))

我相信它会尽可能高效。

I tried the above methods, but failed because I had many zeores. Yes, even with floats duplicate items may be important.

So I wrote a modified 1D solution by adding a tie-checking step:

def ranks (v):
    import numpy as np
    t = np.argsort(v)
    r = np.empty(len(v),int)
    r[t] = np.arange(len(v))
    for i in xrange(1, len(r)):
        if v[t[i]] <= v[t[i-1]]: r[t[i]] = r[t[i-1]]
    return r

# test it
print sorted(zip(ranks(v), v))

I believe it’s as efficient as it can be.


回答 8

我喜欢k.rooijers的方法,但是正如rcoup所写,重复数字是根据数组位置进行排序的。这对我不利,因此我修改了版本以对等级进行后处理,并将所有重复的数字合并为合并的平均等级:

import numpy as np
a = np.array([4,2,7,2,1])
r = np.array(a.argsort().argsort(), dtype=float)
f = a==a
for i in xrange(len(a)):
   if not f[i]: continue
   s = a == a[i]
   ls = np.sum(s)
   if ls > 1:
      tr = np.sum(r[s])
      r[s] = float(tr)/ls
   f[s] = False

print r  # array([ 3. ,  1.5,  4. ,  1.5,  0. ])

我希望这也可以对其他人有所帮助,我试图找到其他解决方案,但是找不到任何解决方案…

I liked the method by k.rooijers, but as rcoup wrote, repeated numbers are ranked according to array position. This was no good for me, so I modified the version to postprocess the ranks and merge any repeated numbers into a combined average rank:

import numpy as np
a = np.array([4,2,7,2,1])
r = np.array(a.argsort().argsort(), dtype=float)
f = a==a
for i in xrange(len(a)):
   if not f[i]: continue
   s = a == a[i]
   ls = np.sum(s)
   if ls > 1:
      tr = np.sum(r[s])
      r[s] = float(tr)/ls
   f[s] = False

print r  # array([ 3. ,  1.5,  4. ,  1.5,  0. ])

I hope this might help others too, I tried to find anothers solution to this, but couldn’t find any…


回答 9

argsort和slice是对称操作。

尝试两次切片,而不是argsort两次。因为slice比argsort快

array = numpy.array([4,2,7,1])
order = array.argsort()
ranks = np.arange(array.shape[0])[order][order]

argsort and slice are symmetry operations.

try slice twice instead of argsort twice. since slice is faster than argsort

array = numpy.array([4,2,7,1])
order = array.argsort()
ranks = np.arange(array.shape[0])[order][order]

回答 10

更通用的版本之一:

In [140]: x = np.random.randn(10, 3)

In [141]: i = np.argsort(x, axis=0)

In [142]: ranks = np.empty_like(i)

In [143]: np.put_along_axis(ranks, i, np.repeat(np.arange(x.shape[0])[:,None], x.shape[1], axis=1), axis=0)

请参阅如何将numpy.argsort()用作两个以上维度的索引?泛化为更多的暗淡。

More general version of one of the answers:

In [140]: x = np.random.randn(10, 3)

In [141]: i = np.argsort(x, axis=0)

In [142]: ranks = np.empty_like(i)

In [143]: np.put_along_axis(ranks, i, np.repeat(np.arange(x.shape[0])[:,None], x.shape[1], axis=1), axis=0)

See How to use numpy.argsort() as indices in more than 2 dimensions? to generalize to more dims.


是否有一个内置的numpy来拒绝列表中的离群值

问题:是否有一个内置的numpy来拒绝列表中的离群值

是否有内置的numpy来执行以下操作?也就是说,获取一个列表d并返回一个列表,filtered_d其中根据中假定的点的某些分布,删除了所有外围元素d

import numpy as np

def reject_outliers(data):
    m = 2
    u = np.mean(data)
    s = np.std(data)
    filtered = [e for e in data if (u - 2 * s < e < u + 2 * s)]
    return filtered

>>> d = [2,4,5,1,6,5,40]
>>> filtered_d = reject_outliers(d)
>>> print filtered_d
[2,4,5,1,6,5]

我之所以说“类似”,是因为该函数可能允许变化的分布(泊松,高斯等)和这些分布内的异常阈值(如m我在这里使用的)。

Is there a numpy builtin to do something like the following? That is, take a list d and return a list filtered_d with any outlying elements removed based on some assumed distribution of the points in d.

import numpy as np

def reject_outliers(data):
    m = 2
    u = np.mean(data)
    s = np.std(data)
    filtered = [e for e in data if (u - 2 * s < e < u + 2 * s)]
    return filtered

>>> d = [2,4,5,1,6,5,40]
>>> filtered_d = reject_outliers(d)
>>> print filtered_d
[2,4,5,1,6,5]

I say ‘something like’ because the function might allow for varying distributions (poisson, gaussian, etc.) and varying outlier thresholds within those distributions (like the m I’ve used here).


回答 0

此方法与您的方法几乎相同,只是更多的numpyst(也适用于numpy数组):

def reject_outliers(data, m=2):
    return data[abs(data - np.mean(data)) < m * np.std(data)]

This method is almost identical to yours, just more numpyst (also working on numpy arrays only):

def reject_outliers(data, m=2):
    return data[abs(data - np.mean(data)) < m * np.std(data)]

回答 1

处理离群值时,重要的一点是应尝试使用估计值尽可能可靠。分布的平均值将受到异常值的影响,但例如中位数会小得多。

以eumiro的答案为基础:

def reject_outliers(data, m = 2.):
    d = np.abs(data - np.median(data))
    mdev = np.median(d)
    s = d/mdev if mdev else 0.
    return data[s<m]

在这里,我用更可靠的中位数代替了均值,并用中位数与中位数的绝对距离代替了标准偏差。然后,我用距离(再次)的中值来缩放距离,以使其m处于合理的相对范围内。

请注意,要使data[s<m]语法起作用,data必须是一个numpy数组。

Something important when dealing with outliers is that one should try to use estimators as robust as possible. The mean of a distribution will be biased by outliers but e.g. the median will be much less.

Building on eumiro’s answer:

def reject_outliers(data, m = 2.):
    d = np.abs(data - np.median(data))
    mdev = np.median(d)
    s = d/mdev if mdev else 0.
    return data[s<m]

Here I have replace the mean with the more robust median and the standard deviation with the median absolute distance to the median. I then scaled the distances by their (again) median value so that m is on a reasonable relative scale.

Note that for the data[s<m] syntax to work, data must be a numpy array.


回答 2

本杰明·班尼尔(Benjamin Bannier)的答案会在距离中位数的距离中位数为0时产生直通,因此我发现此修改版本对下面示例中给出的情况更有帮助。

def reject_outliers_2(data, m=2.):
    d = np.abs(data - np.median(data))
    mdev = np.median(d)
    s = d / (mdev if mdev else 1.)
    return data[s < m]

例:

data_points = np.array([10, 10, 10, 17, 10, 10])
print(reject_outliers(data_points))
print(reject_outliers_2(data_points))

给出:

[[10, 10, 10, 17, 10, 10]]  # 17 is not filtered
[10, 10, 10, 10, 10]  # 17 is filtered (it's distance, 7, is greater than m)

Benjamin Bannier’s answer yields a pass-through when the median of distances from the median is 0, so I found this modified version a bit more helpful for cases as given in the example below.

def reject_outliers_2(data, m=2.):
    d = np.abs(data - np.median(data))
    mdev = np.median(d)
    s = d / (mdev if mdev else 1.)
    return data[s < m]

Example:

data_points = np.array([10, 10, 10, 17, 10, 10])
print(reject_outliers(data_points))
print(reject_outliers_2(data_points))

Gives:

[[10, 10, 10, 17, 10, 10]]  # 17 is not filtered
[10, 10, 10, 10, 10]  # 17 is filtered (it's distance, 7, is greater than m)

回答 3

在Benjamin的基础上,使用pandas.Series,并用IQR替换MAD

def reject_outliers(sr, iq_range=0.5):
    pcnt = (1 - iq_range) / 2
    qlow, median, qhigh = sr.dropna().quantile([pcnt, 0.50, 1-pcnt])
    iqr = qhigh - qlow
    return sr[ (sr - median).abs() <= iqr]

例如,如果设置iq_range=0.6,则四分位数范围的百分位数将变为:0.20 <--> 0.80,因此将包含更多离群值。

Building on Benjamin’s, using pandas.Series, and replacing MAD with IQR:

def reject_outliers(sr, iq_range=0.5):
    pcnt = (1 - iq_range) / 2
    qlow, median, qhigh = sr.dropna().quantile([pcnt, 0.50, 1-pcnt])
    iqr = qhigh - qlow
    return sr[ (sr - median).abs() <= iqr]

For instance, if you set iq_range=0.6, the percentiles of the interquartile-range would become: 0.20 <--> 0.80, so more outliers will be included.


回答 4

另一种方法是对标准偏差进行可靠的估计(假设高斯统计量)。查找在线计算器,我发现90%的百分位数对应于1.2815σ,而95%的百分位数是1.645σ(http://vassarstats.net/tabs.html?#z

作为一个简单的例子:

import numpy as np

# Create some random numbers
x = np.random.normal(5, 2, 1000)

# Calculate the statistics
print("Mean= ", np.mean(x))
print("Median= ", np.median(x))
print("Max/Min=", x.max(), " ", x.min())
print("StdDev=", np.std(x))
print("90th Percentile", np.percentile(x, 90))

# Add a few large points
x[10] += 1000
x[20] += 2000
x[30] += 1500

# Recalculate the statistics
print()
print("Mean= ", np.mean(x))
print("Median= ", np.median(x))
print("Max/Min=", x.max(), " ", x.min())
print("StdDev=", np.std(x))
print("90th Percentile", np.percentile(x, 90))

# Measure the percentile intervals and then estimate Standard Deviation of the distribution, both from median to the 90th percentile and from the 10th to 90th percentile
p90 = np.percentile(x, 90)
p10 = np.percentile(x, 10)
p50 = np.median(x)
# p50 to p90 is 1.2815 sigma
rSig = (p90-p50)/1.2815
print("Robust Sigma=", rSig)

rSig = (p90-p10)/(2*1.2815)
print("Robust Sigma=", rSig)

我得到的输出是:

Mean=  4.99760520022
Median=  4.95395274981
Max/Min= 11.1226494654   -2.15388472011
Sigma= 1.976629928
90th Percentile 7.52065379649

Mean=  9.64760520022
Median=  4.95667658782
Max/Min= 2205.43861943   -2.15388472011
Sigma= 88.6263902244
90th Percentile 7.60646688694

Robust Sigma= 2.06772555531
Robust Sigma= 1.99878292462

接近预期值2。

如果要删除高于/低于5个标准偏差的点(对于1000个点,我们期望1个值> 3个标准偏差):

y = x[abs(x - p50) < rSig*5]

# Print the statistics again
print("Mean= ", np.mean(y))
print("Median= ", np.median(y))
print("Max/Min=", y.max(), " ", y.min())
print("StdDev=", np.std(y))

这使:

Mean=  4.99755359935
Median=  4.95213030447
Max/Min= 11.1226494654   -2.15388472011
StdDev= 1.97692712883

我不知道哪种方法更有效/更健壮

An alternative is to make a robust estimation of the standard deviation (assuming Gaussian statistics). Looking up online calculators, I see that the 90% percentile corresponds to 1.2815σ and the 95% is 1.645σ (http://vassarstats.net/tabs.html?#z)

As a simple example:

import numpy as np

# Create some random numbers
x = np.random.normal(5, 2, 1000)

# Calculate the statistics
print("Mean= ", np.mean(x))
print("Median= ", np.median(x))
print("Max/Min=", x.max(), " ", x.min())
print("StdDev=", np.std(x))
print("90th Percentile", np.percentile(x, 90))

# Add a few large points
x[10] += 1000
x[20] += 2000
x[30] += 1500

# Recalculate the statistics
print()
print("Mean= ", np.mean(x))
print("Median= ", np.median(x))
print("Max/Min=", x.max(), " ", x.min())
print("StdDev=", np.std(x))
print("90th Percentile", np.percentile(x, 90))

# Measure the percentile intervals and then estimate Standard Deviation of the distribution, both from median to the 90th percentile and from the 10th to 90th percentile
p90 = np.percentile(x, 90)
p10 = np.percentile(x, 10)
p50 = np.median(x)
# p50 to p90 is 1.2815 sigma
rSig = (p90-p50)/1.2815
print("Robust Sigma=", rSig)

rSig = (p90-p10)/(2*1.2815)
print("Robust Sigma=", rSig)

The output I get is:

Mean=  4.99760520022
Median=  4.95395274981
Max/Min= 11.1226494654   -2.15388472011
Sigma= 1.976629928
90th Percentile 7.52065379649

Mean=  9.64760520022
Median=  4.95667658782
Max/Min= 2205.43861943   -2.15388472011
Sigma= 88.6263902244
90th Percentile 7.60646688694

Robust Sigma= 2.06772555531
Robust Sigma= 1.99878292462

Which is close to the expected value of 2.

If we want to remove points above/below 5 standard deviations (with 1000 points we would expect 1 value > 3 standard deviations):

y = x[abs(x - p50) < rSig*5]

# Print the statistics again
print("Mean= ", np.mean(y))
print("Median= ", np.median(y))
print("Max/Min=", y.max(), " ", y.min())
print("StdDev=", np.std(y))

Which gives:

Mean=  4.99755359935
Median=  4.95213030447
Max/Min= 11.1226494654   -2.15388472011
StdDev= 1.97692712883

I have no idea which approach is the more efficent/robust


回答 5

我想在此答案中提供两种方法,基于“ z分数”的解决方案和基于“ IQR”的解决方案。

此答案中提供的代码适用于单个暗numpy数组和多个numpy数组。

让我们首先导入一些模块。

import collections
import numpy as np
import scipy.stats as stat
from scipy.stats import iqr

基于z评分的方法

此方法将测试数字是否超出三个标准偏差。根据此规则,如果值离群,则该方法将返回true,否则返回false。

def sd_outlier(x, axis = None, bar = 3, side = 'both'):
    assert side in ['gt', 'lt', 'both'], 'Side should be `gt`, `lt` or `both`.'

    d_z = stat.zscore(x, axis = axis)

    if side == 'gt':
        return d_z > bar
    elif side == 'lt':
        return d_z < -bar
    elif side == 'both':
        return np.abs(d_z) > bar

基于IQR的方法

此方法将测试值是否小于q1 - 1.5 * iqr或大于q3 + 1.5 * iqr,这与SPSS的plot方法类似。

def q1(x, axis = None):
    return np.percentile(x, 25, axis = axis)

def q3(x, axis = None):
    return np.percentile(x, 75, axis = axis)

def iqr_outlier(x, axis = None, bar = 1.5, side = 'both'):
    assert side in ['gt', 'lt', 'both'], 'Side should be `gt`, `lt` or `both`.'

    d_iqr = iqr(x, axis = axis)
    d_q1 = q1(x, axis = axis)
    d_q3 = q3(x, axis = axis)
    iqr_distance = np.multiply(d_iqr, bar)

    stat_shape = list(x.shape)

    if isinstance(axis, collections.Iterable):
        for single_axis in axis:
            stat_shape[single_axis] = 1
    else:
        stat_shape[axis] = 1

    if side in ['gt', 'both']:
        upper_range = d_q3 + iqr_distance
        upper_outlier = np.greater(x - upper_range.reshape(stat_shape), 0)
    if side in ['lt', 'both']:
        lower_range = d_q1 - iqr_distance
        lower_outlier = np.less(x - lower_range.reshape(stat_shape), 0)

    if side == 'gt':
        return upper_outlier
    if side == 'lt':
        return lower_outlier
    if side == 'both':
        return np.logical_or(upper_outlier, lower_outlier)

最后,如果要滤除异常值,请使用numpy选择器。

祝你今天愉快。

I would like to provide two methods in this answer, solution based on “z score” and solution based on “IQR”.

The code provided in this answer works on both single dim numpy array and multiple numpy array.

Let’s import some modules firstly.

import collections
import numpy as np
import scipy.stats as stat
from scipy.stats import iqr

z score based method

This method will test if the number falls outside the three standard deviations. Based on this rule, if the value is outlier, the method will return true, if not, return false.

def sd_outlier(x, axis = None, bar = 3, side = 'both'):
    assert side in ['gt', 'lt', 'both'], 'Side should be `gt`, `lt` or `both`.'

    d_z = stat.zscore(x, axis = axis)

    if side == 'gt':
        return d_z > bar
    elif side == 'lt':
        return d_z < -bar
    elif side == 'both':
        return np.abs(d_z) > bar

IQR based method

This method will test if the value is less than q1 - 1.5 * iqr or greater than q3 + 1.5 * iqr, which is similar to SPSS’s plot method.

def q1(x, axis = None):
    return np.percentile(x, 25, axis = axis)

def q3(x, axis = None):
    return np.percentile(x, 75, axis = axis)

def iqr_outlier(x, axis = None, bar = 1.5, side = 'both'):
    assert side in ['gt', 'lt', 'both'], 'Side should be `gt`, `lt` or `both`.'

    d_iqr = iqr(x, axis = axis)
    d_q1 = q1(x, axis = axis)
    d_q3 = q3(x, axis = axis)
    iqr_distance = np.multiply(d_iqr, bar)

    stat_shape = list(x.shape)

    if isinstance(axis, collections.Iterable):
        for single_axis in axis:
            stat_shape[single_axis] = 1
    else:
        stat_shape[axis] = 1

    if side in ['gt', 'both']:
        upper_range = d_q3 + iqr_distance
        upper_outlier = np.greater(x - upper_range.reshape(stat_shape), 0)
    if side in ['lt', 'both']:
        lower_range = d_q1 - iqr_distance
        lower_outlier = np.less(x - lower_range.reshape(stat_shape), 0)

    if side == 'gt':
        return upper_outlier
    if side == 'lt':
        return lower_outlier
    if side == 'both':
        return np.logical_or(upper_outlier, lower_outlier)

Finally, if you want to filter out the outliers, use a numpy selector.

Have a nice day.


回答 6

考虑到当您的标准偏差由于巨大的异常值而变得非常大时,上述所有方法都会失败。

Simalar的平均值计算失败,应该计算中位数。尽管如此,平均值“更容易出现stdDv这样的错误”。

您可以尝试迭代应用算法,也可以使用四分位数范围进行过滤:(此处“因数”与*范围有关,但仅当数据遵循高斯分布时)

import numpy as np

def sortoutOutliers(dataIn,factor):
    quant3, quant1 = np.percentile(dataIn, [75 ,25])
    iqr = quant3 - quant1
    iqrSigma = iqr/1.34896
    medData = np.median(dataIn)
    dataOut = [ x for x in dataIn if ( (x > medData - factor* iqrSigma) and (x < medData + factor* iqrSigma) ) ] 
    return(dataOut)

Consider that all the above methods fail when your standard deviation gets very large due to huge outliers.

(Simalar as the average caluclation fails and should rather caluclate the median. Though, the average is “more prone to such an error as the stdDv”.)

You could try to iteratively apply your algorithm or you filter using the interquartile range: (here “factor” relates to a n*sigma range, yet only when your data follows a Gaussian distribution)

import numpy as np

def sortoutOutliers(dataIn,factor):
    quant3, quant1 = np.percentile(dataIn, [75 ,25])
    iqr = quant3 - quant1
    iqrSigma = iqr/1.34896
    medData = np.median(dataIn)
    dataOut = [ x for x in dataIn if ( (x > medData - factor* iqrSigma) and (x < medData + factor* iqrSigma) ) ] 
    return(dataOut)

回答 7

我想做类似的事情,除了将数字设置为NaN而不是从数据中删除它,因为如果删除它,则更改了会弄乱绘图的长度(即,如果您仅从表的一列中删除异常值) ,但您需要使其与其他列保持相同,以便可以相互绘制图)。

为此,我使用了numpy的masking函数

def reject_outliers(data, m=2):
    stdev = np.std(data)
    mean = np.mean(data)
    maskMin = mean - stdev * m
    maskMax = mean + stdev * m
    mask = np.ma.masked_outside(data, maskMin, maskMax)
    print('Masking values outside of {} and {}'.format(maskMin, maskMax))
    return mask

I wanted to do something similar, except setting the number to NaN rather than removing it from the data, since if you remove it you change the length which can mess up plotting (i.e. if you’re only removing outliers from one column in a table, but you need it to remain the same as the other columns so you can plot them against each other).

To do so I used numpy’s masking functions:

def reject_outliers(data, m=2):
    stdev = np.std(data)
    mean = np.mean(data)
    maskMin = mean - stdev * m
    maskMax = mean + stdev * m
    mask = np.ma.masked_outside(data, maskMin, maskMax)
    print('Masking values outside of {} and {}'.format(maskMin, maskMax))
    return mask

回答 8

如果要获取离群值的索引位置,idx_list则将其返回。

def reject_outliers(data, m = 2.):
        d = np.abs(data - np.median(data))
        mdev = np.median(d)
        s = d/mdev if mdev else 0.
        data_range = np.arange(len(data))
        idx_list = data_range[s>=m]
        return data[s<m], idx_list

data_points = np.array([8, 10, 35, 17, 73, 77])  
print(reject_outliers(data_points))

after rejection: [ 8 10 35 17], index positions of outliers: [4 5]

if you want to get the index position of the outliers idx_list will return it.

def reject_outliers(data, m = 2.):
        d = np.abs(data - np.median(data))
        mdev = np.median(d)
        s = d/mdev if mdev else 0.
        data_range = np.arange(len(data))
        idx_list = data_range[s>=m]
        return data[s<m], idx_list

data_points = np.array([8, 10, 35, 17, 73, 77])  
print(reject_outliers(data_points))

after rejection: [ 8 10 35 17], index positions of outliers: [4 5]

回答 9

对于一组图像(每个图像都有3维),我想拒绝使用的每个像素的离群值:

mean = np.mean(imgs, axis=0)
std = np.std(imgs, axis=0)
mask = np.greater(0.5 * std + 1, np.abs(imgs - mean))
masked = np.multiply(imgs, mask)

然后可以计算平均值:

masked_mean = np.divide(np.sum(masked, axis=0), np.sum(mask, axis=0))

(我将其用于背景减法)

For a set of images (each image has 3 dimensions), where I wanted to reject outliers for each pixel I used:

mean = np.mean(imgs, axis=0)
std = np.std(imgs, axis=0)
mask = np.greater(0.5 * std + 1, np.abs(imgs - mean))
masked = np.multiply(imgs, mask)

Then it is possible to compute the mean:

masked_mean = np.divide(np.sum(masked, axis=0), np.sum(mask, axis=0))

(I use it for Background Subtraction)


输入和输出numpy数组到h5py

问题:输入和输出numpy数组到h5py

我有一个Python代码,其输出为在此处输入图片说明大小矩阵,其条目均为type float。如果使用扩展名保存,.dat则文件大小约为500 MB。我读到使用h5py会大大减少文件大小。因此,假设我有一个名为的2D numpy数组A。如何将其保存到h5py文件?另外,由于需要对数组进行操作,如何读取相同文件并将其作为numpy数组放入不同的代码中?

I have a Python code whose output is a enter image description here sized matrix, whose entries are all of the type float. If I save it with the extension .dat the file size is of the order of 500 MB. I read that using h5py reduces the file size considerably. So, let’s say I have the 2D numpy array named A. How do I save it to an h5py file? Also, how do I read the same file and put it as a numpy array in a different code, as I need to do manipulations with the array?


回答 0

h5py提供了一个数据集模型。前者基本上是数组,而后者可以视为目录。每个都被命名。您应该查看API的文档和示例:

http://docs.h5py.org/en/latest/quick.html

一个简单的示例,其中您要先创建所有数据,然后只想将其保存到hdf5文件,如下所示:

In [1]: import numpy as np
In [2]: import h5py
In [3]: a = np.random.random(size=(100,20))
In [4]: h5f = h5py.File('data.h5', 'w')
In [5]: h5f.create_dataset('dataset_1', data=a)
Out[5]: <HDF5 dataset "dataset_1": shape (100, 20), type "<f8">

In [6]: h5f.close()

然后,您可以使用以下命令将数据加载回:

In [10]: h5f = h5py.File('data.h5','r')
In [11]: b = h5f['dataset_1'][:]
In [12]: h5f.close()

In [13]: np.allclose(a,b)
Out[13]: True

绝对看看文档:

http://docs.h5py.org

写入hdf5文件取决于h5py或pytables(每个文件都具有位于hdf5文件规范之上的不同python API)。您还应该看看numpy本机提供的其他简单二进制格式,例如np.savenp.savez等等:

http://docs.scipy.org/doc/numpy/reference/routines.io.html

h5py provides a model of datasets and groups. The former is basically arrays and the latter you can think of as directories. Each is named. You should look at the documentation for the API and examples:

http://docs.h5py.org/en/latest/quick.html

A simple example where you are creating all of the data upfront and just want to save it to an hdf5 file would look something like:

In [1]: import numpy as np
In [2]: import h5py
In [3]: a = np.random.random(size=(100,20))
In [4]: h5f = h5py.File('data.h5', 'w')
In [5]: h5f.create_dataset('dataset_1', data=a)
Out[5]: <HDF5 dataset "dataset_1": shape (100, 20), type "<f8">

In [6]: h5f.close()

You can then load that data back in using: ‘

In [10]: h5f = h5py.File('data.h5','r')
In [11]: b = h5f['dataset_1'][:]
In [12]: h5f.close()

In [13]: np.allclose(a,b)
Out[13]: True

Definitely check out the docs:

http://docs.h5py.org

Writing to hdf5 file depends either on h5py or pytables (each has a different python API that sits on top of the hdf5 file specification). You should also take a look at other simple binary formats provided by numpy natively such as np.save, np.savez etc:

http://docs.scipy.org/doc/numpy/reference/routines.io.html


回答 1

处理文件打开/关闭并避免内存泄漏的更干净的方法

准备:

import numpy as np
import h5py

data_to_write = np.random.random(size=(100,20)) # or some such

写:

with h5py.File('name-of-file.h5', 'w') as hf:
    hf.create_dataset("name-of-dataset",  data=data_to_write)

读:

with h5py.File('name-of-file.h5', 'r') as hf:
    data = hf['name-of-dataset'][:]

A cleaner way to handle file open/close and avoid memory leaks:

Prep:

import numpy as np
import h5py

data_to_write = np.random.random(size=(100,20)) # or some such

Write:

with h5py.File('name-of-file.h5', 'w') as hf:
    hf.create_dataset("name-of-dataset",  data=data_to_write)

Read:

with h5py.File('name-of-file.h5', 'r') as hf:
    data = hf['name-of-dataset'][:]

如何用零除返回0

问题:如何用零除返回0

我正在尝试在python中执行元素明智的除法,但是如果遇到零,我需要将商设为零。

例如:

array1 = np.array([0, 1, 2])
array2 = np.array([0, 1, 1])

array1 / array2 # should be np.array([0, 1, 2])

我总是可以在数据中使用for循环,但是要真正利用numpy的优化,我需要除法函数在除以零错误后返回0,而不是忽略错误。

除非我缺少任何东西,否则numpy.seterr()似乎不会在出现错误时返回值。在设置自己的除以零的错误处理方法时,还有人对我如何从numpy中获得最大收益有其他建议吗?

I’m trying to perform an element wise divide in python, but if a zero is encountered, I need the quotient to just be zero.

For example:

array1 = np.array([0, 1, 2])
array2 = np.array([0, 1, 1])

array1 / array2 # should be np.array([0, 1, 2])

I could always just use a for-loop through my data, but to really utilize numpy’s optimizations, I need the divide function to return 0 upon divide by zero errors instead of ignoring the error.

Unless I’m missing something, it doesn’t seem numpy.seterr() can return values upon errors. Does anyone have any other suggestions on how I could get the best out of numpy while setting my own divide by zero error handling?


回答 0

在numpy v1.7 +中,您可以利用ufuncs的“ where”选项。您可以一行完成事情,而不必与错误上下文管理器打交道。

>>> a = np.array([-1, 0, 1, 2, 3], dtype=float)
>>> b = np.array([ 0, 0, 0, 2, 2], dtype=float)

# If you don't pass `out` the indices where (b == 0) will be uninitialized!
>>> c = np.divide(a, b, out=np.zeros_like(a), where=b!=0)
>>> print(c)
[ 0.   0.   0.   1.   1.5]

在这种情况下,它将在“其中” b不等于零的任何地方进行除法计算。当b等于零时,它与您在’out’参数中最初给它的任何值保持不变。

In numpy v1.7+, you can take advantage of the “where” option for ufuncs. You can do things in one line and you don’t have to deal with the errstate context manager.

>>> a = np.array([-1, 0, 1, 2, 3], dtype=float)
>>> b = np.array([ 0, 0, 0, 2, 2], dtype=float)

# If you don't pass `out` the indices where (b == 0) will be uninitialized!
>>> c = np.divide(a, b, out=np.zeros_like(a), where=b!=0)
>>> print(c)
[ 0.   0.   0.   1.   1.5]

In this case, it does the divide calculation anywhere ‘where’ b does not equal zero. When b does equal zero, then it remains unchanged from whatever value you originally gave it in the ‘out’ argument.


回答 1

以@Franck Dernoncourt的答案为基础,修正-1 / 0:

def div0( a, b ):
    """ ignore / 0, div0( [-1, 0, 1], 0 ) -> [0, 0, 0] """
    with np.errstate(divide='ignore', invalid='ignore'):
        c = np.true_divide( a, b )
        c[ ~ np.isfinite( c )] = 0  # -inf inf NaN
    return c

div0( [-1, 0, 1], 0 )
array([0, 0, 0])

Building on @Franck Dernoncourt’s answer, fixing -1 / 0:

def div0( a, b ):
    """ ignore / 0, div0( [-1, 0, 1], 0 ) -> [0, 0, 0] """
    with np.errstate(divide='ignore', invalid='ignore'):
        c = np.true_divide( a, b )
        c[ ~ np.isfinite( c )] = 0  # -inf inf NaN
    return c

div0( [-1, 0, 1], 0 )
array([0, 0, 0])

回答 2

以其他答案为基础,并在以下方面进行改进:

码:

import numpy as np

a = np.array([0,0,1,1,2], dtype='float')
b = np.array([0,1,0,1,3], dtype='float')

with np.errstate(divide='ignore', invalid='ignore'):
    c = np.true_divide(a,b)
    c[c == np.inf] = 0
    c = np.nan_to_num(c)

print('c: {0}'.format(c))

输出:

c: [ 0.          0.          0.          1.          0.66666667]

Building on the other answers, and improving on:

Code:

import numpy as np

a = np.array([0,0,1,1,2], dtype='float')
b = np.array([0,1,0,1,3], dtype='float')

with np.errstate(divide='ignore', invalid='ignore'):
    c = np.true_divide(a,b)
    c[c == np.inf] = 0
    c = np.nan_to_num(c)

print('c: {0}'.format(c))

Output:

c: [ 0.          0.          0.          1.          0.66666667]

回答 3

单线(引发警告)

np.nan_to_num(array1 / array2)

One-liner (throws warning)

np.nan_to_num(array1 / array2)

回答 4

尝试分两个步骤进行。先划分,然后更换。

with numpy.errstate(divide='ignore'):
    result = numerator / denominator
    result[denominator == 0] = 0

numpy.errstate行是可选的,并且仅防止numpy告诉您除零的“错误”,因为您已经打算这样做并处理这种情况。

Try doing it in two steps. Division first, then replace.

with numpy.errstate(divide='ignore'):
    result = numerator / denominator
    result[denominator == 0] = 0

The numpy.errstate line is optional, and just prevents numpy from telling you about the “error” of dividing by zero, since you’re already intending to do so, and handling that case.


回答 5

您也可以inf仅根据数组dtypes为float来基于进行替换,如下所示

>>> a = np.array([1,2,3], dtype='float')
>>> b = np.array([0,1,3], dtype='float')
>>> c = a / b
>>> c
array([ inf,   2.,   1.])
>>> c[c == np.inf] = 0
>>> c
array([ 0.,  2.,  1.])

You can also replace based on inf, only if the array dtypes are floats, as per this answer:

>>> a = np.array([1,2,3], dtype='float')
>>> b = np.array([0,1,3], dtype='float')
>>> c = a / b
>>> c
array([ inf,   2.,   1.])
>>> c[c == np.inf] = 0
>>> c
array([ 0.,  2.,  1.])

回答 6

我发现搜索一个相关问题的一个答案是根据分母是否为零来操纵输出。

假设arrayAarrayB已经初始化,但是arrayB有一些零。如果我们要arrayC = arrayA / arrayB安全地进行计算,可以执行以下操作。

在这种情况下,只要我在其中一个单元格中myOwnValue被零除,就将单元格设置为等于,在这种情况下为零

myOwnValue = 0
arrayC = np.zeros(arrayA.shape())
indNonZeros = np.where(arrayB != 0)
indZeros = np.where(arrayB = 0)

# division in two steps: first with nonzero cells, and then zero cells
arrayC[indNonZeros] = arrayA[indNonZeros] / arrayB[indNonZeros]
arrayC[indZeros] = myOwnValue # Look at footnote

脚注:回想起来,这条线无论如何都是不必要的,因为它arrayC[i]被实例化为零。但是,如果是这种情况myOwnValue != 0,该操作将有所作为。

One answer I found searching a related question was to manipulate the output based upon whether the denominator was zero or not.

Suppose arrayA and arrayB have been initialized, but arrayB has some zeros. We could do the following if we want to compute arrayC = arrayA / arrayB safely.

In this case, whenever I have a divide by zero in one of the cells, I set the cell to be equal to myOwnValue, which in this case would be zero

myOwnValue = 0
arrayC = np.zeros(arrayA.shape())
indNonZeros = np.where(arrayB != 0)
indZeros = np.where(arrayB = 0)

# division in two steps: first with nonzero cells, and then zero cells
arrayC[indNonZeros] = arrayA[indNonZeros] / arrayB[indNonZeros]
arrayC[indZeros] = myOwnValue # Look at footnote

Footnote: In retrospect, this line is unnecessary anyways, since arrayC[i] is instantiated to zero. But if were the case that myOwnValue != 0, this operation would do something.


回答 7

另一个值得一提的解决方案:

>>> a = np.array([1,2,3], dtype='float')
>>> b = np.array([0,1,3], dtype='float')
>>> b_inv = np.array([1/i if i!=0 else 0 for i in b])
>>> a*b_inv
array([0., 2., 1.])

An other solution worth mentioning :

>>> a = np.array([1,2,3], dtype='float')
>>> b = np.array([0,1,3], dtype='float')
>>> b_inv = np.array([1/i if i!=0 else 0 for i in b])
>>> a*b_inv
array([0., 2., 1.])

如何在python中识别numpy类型?

问题:如何在python中识别numpy类型?

如何可靠地确定一个对象是否具有numpy类型?

我意识到这个问题与鸭子类型的哲学背道而驰,但是我们的想法是确保一个函数(使用scipy和numpy)永远不会返回一个numpy类型,除非使用numpy类型进行调用。 这是另一个问题的解决方案,但是我认为确定对象是否具有numpy类型的一般问题与原始问题相距甚远,因此应将它们分开。

How can one reliably determine if an object has a numpy type?

I realize that this question goes against the philosophy of duck typing, but idea is to make sure a function (which uses scipy and numpy) never returns a numpy type unless it is called with a numpy type. This comes up in the solution to another question, but I think the general problem of determining if an object has a numpy type is far enough away from that original question that they should be separated.


回答 0

使用内置type函数获取类型,然后可以使用该__module__属性找出定义的位置:

>>> import numpy as np
a = np.array([1, 2, 3])
>>> type(a)
<type 'numpy.ndarray'>
>>> type(a).__module__
'numpy'
>>> type(a).__module__ == np.__name__
True

Use the builtin type function to get the type, then you can use the __module__ property to find out where it was defined:

>>> import numpy as np
a = np.array([1, 2, 3])
>>> type(a)
<type 'numpy.ndarray'>
>>> type(a).__module__
'numpy'
>>> type(a).__module__ == np.__name__
True

回答 1

我想出的解决方案是:

isinstance(y, (np.ndarray, np.generic) )

但是,并不是所有的numpy类型都保证是np.ndarray或都100%清晰np.generic这可能不是版本可靠的。

The solution I’ve come up with is:

isinstance(y, (np.ndarray, np.generic) )

However, it’s not 100% clear that all numpy types are guaranteed to be either np.ndarray or np.generic, and this probably isn’t version robust.


回答 2

老问题,但我想出一个明确的答案,举一个例子。我也遇到了同样的问题,也没有找到明确的答案,因此可以使问题保持​​新鲜感。关键是确保已numpy导入,然后运行isinstance布尔。虽然这看起来很简单,但是如果您要对不同的数据类型进行一些计算,则在开始一些numpy向量化操作之前,此小检查可以作为一项快速测试。

##################
# important part!
##################

import numpy as np

####################
# toy array for demo
####################

arr = np.asarray(range(1,100,2))

########################
# The instance check
######################## 

isinstance(arr,np.ndarray)

Old question but I came up with a definitive answer with an example. Can’t hurt to keep questions fresh as I had this same problem and didn’t find a clear answer. The key is to make sure you have numpy imported, and then run the isinstance bool. While this may seem simple, if you are doing some computations across different data types, this small check can serve as a quick test before your start some numpy vectorized operation.

##################
# important part!
##################

import numpy as np

####################
# toy array for demo
####################

arr = np.asarray(range(1,100,2))

########################
# The instance check
######################## 

isinstance(arr,np.ndarray)

回答 3

这实际上取决于您要查找的内容。

  • 如果要测试序列是否实际上是a ndarray,则a isinstance(..., np.ndarray)可能是最简单的。确保您不要在后台重新加载numpy,因为模块可能有所不同,但否则应该没问题。MaskedArraysmatrixrecarray是所有子类ndarray,所以你应该设置。
  • 如果要测试标量是否为numpy标量,事情会变得更加复杂。您可以检查它是否具有shapedtype属性。您可以将其dtype与基本dtype 进行比较,可以在中找到其基本列表np.core.numerictypes.genericTypeRank。请注意,此列表的元素是字符串,因此您必须执行tested.dtype is np.dtype(an_element_of_the_list)

That actually depends on what you’re looking for.

  • If you want to test whether a sequence is actually a ndarray, a isinstance(..., np.ndarray) is probably the easiest. Make sure you don’t reload numpy in the background as the module may be different, but otherwise, you should be OK. MaskedArrays, matrix, recarray are all subclasses of ndarray, so you should be set.
  • If you want to test whether a scalar is a numpy scalar, things get a bit more complicated. You could check whether it has a shape and a dtype attribute. You can compare its dtype to the basic dtypes, whose list you can find in np.core.numerictypes.genericTypeRank. Note that the elements of this list are strings, so you’d have to do a tested.dtype is np.dtype(an_element_of_the_list)

回答 4

要获取类型,请使用内置type函数。使用in运算符,可以检查类型是否为numpy类型,方法是检查其是否包含字符串numpy;

In [1]: import numpy as np

In [2]: a = np.array([1, 2, 3])

In [3]: type(a)
Out[3]: <type 'numpy.ndarray'>

In [4]: 'numpy' in str(type(a))
Out[4]: True

(顺便说一下,此示例在IPython中运行。非常便于交互使用和快速测试。)

To get the type, use the builtin type function. With the in operator, you can test if the type is a numpy type by checking if it contains the string numpy;

In [1]: import numpy as np

In [2]: a = np.array([1, 2, 3])

In [3]: type(a)
Out[3]: <type 'numpy.ndarray'>

In [4]: 'numpy' in str(type(a))
Out[4]: True

(This example was run in IPython, by the way. Very handy for interactive use and quick tests.)


回答 5

请注意,它本身type(numpy.ndarray)就是一个type,请注意布尔和标量类型。如果不是直观或简单的方法,不要太气disc,起初是很痛苦的。

另请参阅:-https : //docs.scipy.org/doc/numpy-1.15.1/reference/arrays.dtypes.html-https : //github.com/machinalis/mypy-data/tree/master/numpy- py

>>> import numpy as np
>>> np.ndarray
<class 'numpy.ndarray'>
>>> type(np.ndarray)
<class 'type'>
>>> a = np.linspace(1,25)
>>> type(a)
<class 'numpy.ndarray'>
>>> type(a) == type(np.ndarray)
False
>>> type(a) == np.ndarray
True
>>> isinstance(a, np.ndarray)
True

布尔值的乐趣:

>>> b = a.astype('int32') == 11
>>> b[0]
False
>>> isinstance(b[0], bool)
False
>>> isinstance(b[0], np.bool)
False
>>> isinstance(b[0], np.bool_)
True
>>> isinstance(b[0], np.bool8)
True
>>> b[0].dtype == np.bool
True
>>> b[0].dtype == bool  # python equivalent
True

标量类型的更多乐趣,请参见:-https : //docs.scipy.org/doc/numpy-1.15.1/reference/arrays.scalars.html#arrays-scalars-built-in

>>> x = np.array([1,], dtype=np.uint64)
>>> x[0].dtype
dtype('uint64')
>>> isinstance(x[0], np.uint64)
True
>>> isinstance(x[0], np.integer)
True  # generic integer
>>> isinstance(x[0], int)
False  # but not a python int in this case

# Try matching the `kind` strings, e.g.
>>> np.dtype('bool').kind                                                                                           
'b'
>>> np.dtype('int64').kind                                                                                          
'i'
>>> np.dtype('float').kind                                                                                          
'f'
>>> np.dtype('half').kind                                                                                           
'f'

# But be weary of matching dtypes
>>> np.integer
<class 'numpy.integer'>
>>> np.dtype(np.integer)
dtype('int64')
>>> x[0].dtype == np.dtype(np.integer)
False

# Down these paths there be dragons:

# the .dtype attribute returns a kind of dtype, not a specific dtype
>>> isinstance(x[0].dtype, np.dtype)
True
>>> isinstance(x[0].dtype, np.uint64)
False  
>>> isinstance(x[0].dtype, np.dtype(np.uint64))
Traceback (most recent call last):
  File "<console>", line 1, in <module>
TypeError: isinstance() arg 2 must be a type or tuple of types
# yea, don't go there
>>> isinstance(x[0].dtype, np.int_)
False  # again, confusing the .dtype with a specific dtype


# Inequalities can be tricky, although they might
# work sometimes, try to avoid these idioms:

>>> x[0].dtype <= np.dtype(np.uint64)
True
>>> x[0].dtype <= np.dtype(np.float)
True
>>> x[0].dtype <= np.dtype(np.half)
False  # just when things were going well
>>> x[0].dtype <= np.dtype(np.float16)
False  # oh boy
>>> x[0].dtype == np.int
False  # ya, no luck here either
>>> x[0].dtype == np.int_
False  # or here
>>> x[0].dtype == np.uint64
True  # have to end on a good note!

Note that the type(numpy.ndarray) is a type itself and watch out for boolean and scalar types. Don’t be too discouraged if it’s not intuitive or easy, it’s a pain at first.

See also: – https://docs.scipy.org/doc/numpy-1.15.1/reference/arrays.dtypes.htmlhttps://github.com/machinalis/mypy-data/tree/master/numpy-mypy

>>> import numpy as np
>>> np.ndarray
<class 'numpy.ndarray'>
>>> type(np.ndarray)
<class 'type'>
>>> a = np.linspace(1,25)
>>> type(a)
<class 'numpy.ndarray'>
>>> type(a) == type(np.ndarray)
False
>>> type(a) == np.ndarray
True
>>> isinstance(a, np.ndarray)
True

Fun with booleans:

>>> b = a.astype('int32') == 11
>>> b[0]
False
>>> isinstance(b[0], bool)
False
>>> isinstance(b[0], np.bool)
False
>>> isinstance(b[0], np.bool_)
True
>>> isinstance(b[0], np.bool8)
True
>>> b[0].dtype == np.bool
True
>>> b[0].dtype == bool  # python equivalent
True

More fun with scalar types, see: – https://docs.scipy.org/doc/numpy-1.15.1/reference/arrays.scalars.html#arrays-scalars-built-in

>>> x = np.array([1,], dtype=np.uint64)
>>> x[0].dtype
dtype('uint64')
>>> isinstance(x[0], np.uint64)
True
>>> isinstance(x[0], np.integer)
True  # generic integer
>>> isinstance(x[0], int)
False  # but not a python int in this case

# Try matching the `kind` strings, e.g.
>>> np.dtype('bool').kind                                                                                           
'b'
>>> np.dtype('int64').kind                                                                                          
'i'
>>> np.dtype('float').kind                                                                                          
'f'
>>> np.dtype('half').kind                                                                                           
'f'

# But be weary of matching dtypes
>>> np.integer
<class 'numpy.integer'>
>>> np.dtype(np.integer)
dtype('int64')
>>> x[0].dtype == np.dtype(np.integer)
False

# Down these paths there be dragons:

# the .dtype attribute returns a kind of dtype, not a specific dtype
>>> isinstance(x[0].dtype, np.dtype)
True
>>> isinstance(x[0].dtype, np.uint64)
False  
>>> isinstance(x[0].dtype, np.dtype(np.uint64))
Traceback (most recent call last):
  File "<console>", line 1, in <module>
TypeError: isinstance() arg 2 must be a type or tuple of types
# yea, don't go there
>>> isinstance(x[0].dtype, np.int_)
False  # again, confusing the .dtype with a specific dtype


# Inequalities can be tricky, although they might
# work sometimes, try to avoid these idioms:

>>> x[0].dtype <= np.dtype(np.uint64)
True
>>> x[0].dtype <= np.dtype(np.float)
True
>>> x[0].dtype <= np.dtype(np.half)
False  # just when things were going well
>>> x[0].dtype <= np.dtype(np.float16)
False  # oh boy
>>> x[0].dtype == np.int
False  # ya, no luck here either
>>> x[0].dtype == np.int_
False  # or here
>>> x[0].dtype == np.uint64
True  # have to end on a good note!

连续数组和非连续数组有什么区别?

问题:连续数组和非连续数组有什么区别?

在有关reshape()函数的numpy手册中,它说

>>> a = np.zeros((10, 2))
# A transpose make the array non-contiguous
>>> b = a.T
# Taking a view makes it possible to modify the shape without modifying the
# initial object.
>>> c = b.view()
>>> c.shape = (20)
AttributeError: incompatible shape for a non-contiguous array

我的问题是:

  1. 什么是连续和不连续数组?它类似于C 中的连续存储块吗?什么是连续存储块?
  2. 两者之间是否有性能差异?我们什么时候应该使用其中一个?
  3. 为什么转置会使数组不连续?
  4. 为什么会c.shape = (20)引发错误incompatible shape for a non-contiguous array

感谢您的回答!

In the numpy manual about the reshape() function, it says

>>> a = np.zeros((10, 2))
# A transpose make the array non-contiguous
>>> b = a.T
# Taking a view makes it possible to modify the shape without modifying the
# initial object.
>>> c = b.view()
>>> c.shape = (20)
AttributeError: incompatible shape for a non-contiguous array

My questions are:

  1. What are continuous and noncontiguous arrays? Is it similar to the contiguous memory block in C like What is a contiguous memory block?
  2. Is there any performance difference between these two? When should we use one or the other?
  3. Why does transpose make the array non-contiguous?
  4. Why does c.shape = (20) throws an error incompatible shape for a non-contiguous array?

Thanks for your answer!


回答 0

连续数组只是存储在不间断内存块中的数组:要访问数组中的下一个值,我们只需移至下一个内存地址。

考虑2D数组arr = np.arange(12).reshape(3,4)。看起来像这样:

在此处输入图片说明

在计算机的内存中,的值arr存储如下:

在此处输入图片说明

这意味着arrC连续数组,因为被存储为连续的内存块。下一个内存地址保存该行的下一行值。如果要向下移动一列,我们只需要跳过三个块(例如,从0跳到4意味着我们跳过1,2和3)。

用换位数组arr.T意味着C连续性丢失,因为相邻的行条目不再位于相邻的存储器地址中。但是,Fortranarr.T连续的,因为在内存的连续块中:

在此处输入图片说明


从性能角度来看,访问彼此相邻的内存地址通常比访问更“扩展”的地址更快(从RAM中获取值可能需要为CPU提取并缓存许多相邻地址。)意味着对连续阵列的操作通常会更快。

由于C连续的内存布局,因此按行操作通常比按列操作快。例如,您通常会发现

np.sum(arr, axis=1) # sum the rows

快于:

np.sum(arr, axis=0) # sum the columns

同样,对于Fortran连续数组,对列的操作将稍快一些。


最后,为什么不能通过分配新形状来展平Fortran连续数组?

>>> arr2 = arr.T
>>> arr2.shape = 12
AttributeError: incompatible shape for a non-contiguous array

为了使这成为可能,NumPy必须arr.T像这样将各行放在一起:

在此处输入图片说明

shape直接设置该属性将假定C顺序-即NumPy尝试逐行执行该操作。)

这是不可能的。对于任何轴,NumPy必须具有恒定的步幅长度(要移动的字节数)才能到达数组的下一个元素。arr.T以这种方式展平将需要在内存中向前和向后跳过以检索数组的连续值。

如果我们arr2.reshape(12)改为写,NumPy会将arr2的值复制到新的内存块中(因为它无法将视图返回到该形状的原始数据)。

A contiguous array is just an array stored in an unbroken block of memory: to access the next value in the array, we just move to the next memory address.

Consider the 2D array arr = np.arange(12).reshape(3,4). It looks like this:

enter image description here

In the computer’s memory, the values of arr are stored like this:

enter image description here

This means arr is a C contiguous array because the rows are stored as contiguous blocks of memory. The next memory address holds the next row value on that row. If we want to move down a column, we just need to jump over three blocks (e.g. to jump from 0 to 4 means we skip over 1,2 and 3).

Transposing the array with arr.T means that C contiguity is lost because adjacent row entries are no longer in adjacent memory addresses. However, arr.T is Fortran contiguous since the columns are in contiguous blocks of memory:

enter image description here


Performance-wise, accessing memory addresses which are next to each other is very often faster than accessing addresses which are more “spread out” (fetching a value from RAM could entail a number of neighbouring addresses being fetched and cached for the CPU.) This means that operations over contiguous arrays will often be quicker.

As a consequence of C contiguous memory layout, row-wise operations are usually faster than column-wise operations. For example, you’ll typically find that

np.sum(arr, axis=1) # sum the rows

is slightly faster than:

np.sum(arr, axis=0) # sum the columns

Similarly, operations on columns will be slightly faster for Fortran contiguous arrays.


Finally, why can’t we flatten the Fortran contiguous array by assigning a new shape?

>>> arr2 = arr.T
>>> arr2.shape = 12
AttributeError: incompatible shape for a non-contiguous array

In order for this to be possible NumPy would have to put the rows of arr.T together like this:

enter image description here

(Setting the shape attribute directly assumes C order – i.e. NumPy tries to perform the operation row-wise.)

This is impossible to do. For any axis, NumPy needs to have a constant stride length (the number of bytes to move) to get to the next element of the array. Flattening arr.T in this way would require skipping forwards and backwards in memory to retrieve consecutive values of the array.

If we wrote arr2.reshape(12) instead, NumPy would copy the values of arr2 into a new block of memory (since it can’t return a view on to the original data for this shape).


回答 1

也许此示例具有12个不同的数组值将有所帮助:

In [207]: x=np.arange(12).reshape(3,4).copy()

In [208]: x.flags
Out[208]: 
  C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : True
  ...
In [209]: x.T.flags
Out[209]: 
  C_CONTIGUOUS : False
  F_CONTIGUOUS : True
  OWNDATA : False
  ...

C order值是,他们在生成的顺序。在调换哪些不是

In [212]: x.reshape(12,)   # same as x.ravel()
Out[212]: array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11])

In [213]: x.T.reshape(12,)
Out[213]: array([ 0,  4,  8,  1,  5,  9,  2,  6, 10,  3,  7, 11])

您可以同时获得两者的一维视图

In [214]: x1=x.T

In [217]: x.shape=(12,)

的形状x也可以更改。

In [220]: x1.shape=(12,)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-220-cf2b1a308253> in <module>()
----> 1 x1.shape=(12,)

AttributeError: incompatible shape for a non-contiguous array

但是移调的形状无法更改。在data仍处于0,1,2,3,4...顺序,这不能被访问访问如0,4,8...在一维数组。

但是x1可以更改的副本:

In [227]: x2=x1.copy()

In [228]: x2.flags
Out[228]: 
  C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : True
  ...
In [229]: x2.shape=(12,)

strides也许也有帮助。跨步是到达下一个值必须走多远(以字节为单位)。对于2d数组,将有2个跨步值:

In [233]: x=np.arange(12).reshape(3,4).copy()

In [234]: x.strides
Out[234]: (16, 4)

要到达下一行,请步进16个字节,仅下一列4。

In [235]: x1.strides
Out[235]: (4, 16)

转置只是切换步幅的顺序。下一行只有4个字节,即下一个数字。

In [236]: x.shape=(12,)

In [237]: x.strides
Out[237]: (4,)

改变形状也会改变步幅-一次仅通过缓冲区4个字节。

In [238]: x2=x1.copy()

In [239]: x2.strides
Out[239]: (12, 4)

即使x2看起来像x1,它也有自己的数据缓冲区,其值以不同的顺序排列。现在,下一列是4字节,而下一行是12(3 * 4)。

In [240]: x2.shape=(12,)

In [241]: x2.strides
Out[241]: (4,)

并且x,将形状更改为1d会将步幅减小为(4,)

因为x1,按0,1,2,...顺序排列数据,不会产生一维的跨度0,4,8...

__array_interface__ 是显示数组信息的另一种有用方法:

In [242]: x1.__array_interface__
Out[242]: 
{'strides': (4, 16),
 'typestr': '<i4',
 'shape': (4, 3),
 'version': 3,
 'data': (163336056, False),
 'descr': [('', '<i4')]}

x1数据缓冲器地址将是相同x,同它的数据。 x2具有不同的缓冲区地址。

您也可以尝试order='F'copyreshape命令中添加参数。

Maybe this example with 12 different array values will help:

In [207]: x=np.arange(12).reshape(3,4).copy()

In [208]: x.flags
Out[208]: 
  C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : True
  ...
In [209]: x.T.flags
Out[209]: 
  C_CONTIGUOUS : False
  F_CONTIGUOUS : True
  OWNDATA : False
  ...

The C order values are in the order that they were generated in. The transposed ones are not

In [212]: x.reshape(12,)   # same as x.ravel()
Out[212]: array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11])

In [213]: x.T.reshape(12,)
Out[213]: array([ 0,  4,  8,  1,  5,  9,  2,  6, 10,  3,  7, 11])

You can get 1d views of both

In [214]: x1=x.T

In [217]: x.shape=(12,)

the shape of x can also be changed.

In [220]: x1.shape=(12,)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-220-cf2b1a308253> in <module>()
----> 1 x1.shape=(12,)

AttributeError: incompatible shape for a non-contiguous array

But the shape of the transpose cannot be changed. The data is still in the 0,1,2,3,4... order, which can’t be accessed accessed as 0,4,8... in a 1d array.

But a copy of x1 can be changed:

In [227]: x2=x1.copy()

In [228]: x2.flags
Out[228]: 
  C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : True
  ...
In [229]: x2.shape=(12,)

Looking at strides might also help. A strides is how far (in bytes) it has to step to get to the next value. For a 2d array, there will be be 2 stride values:

In [233]: x=np.arange(12).reshape(3,4).copy()

In [234]: x.strides
Out[234]: (16, 4)

To get to the next row, step 16 bytes, next column only 4.

In [235]: x1.strides
Out[235]: (4, 16)

Transpose just switches the order of the strides. The next row is only 4 bytes- i.e. the next number.

In [236]: x.shape=(12,)

In [237]: x.strides
Out[237]: (4,)

Changing the shape also changes the strides – just step through the buffer 4 bytes at a time.

In [238]: x2=x1.copy()

In [239]: x2.strides
Out[239]: (12, 4)

Even though x2 looks just like x1, it has its own data buffer, with the values in a different order. The next column is now 4 bytes over, while the next row is 12 (3*4).

In [240]: x2.shape=(12,)

In [241]: x2.strides
Out[241]: (4,)

And as with x, changing the shape to 1d reduces the strides to (4,).

For x1, with data in the 0,1,2,... order, there isn’t a 1d stride that would give 0,4,8....

__array_interface__ is another useful way of displaying array information:

In [242]: x1.__array_interface__
Out[242]: 
{'strides': (4, 16),
 'typestr': '<i4',
 'shape': (4, 3),
 'version': 3,
 'data': (163336056, False),
 'descr': [('', '<i4')]}

The x1 data buffer address will be same as for x, with which it shares the data. x2 has a different buffer address.

You could also experiment with adding a order='F' parameter to the copy and reshape commands.


如何将numpy数组列表转换为单个numpy数组?

问题:如何将numpy数组列表转换为单个numpy数组?

假设我有;

LIST = [[array([1, 2, 3, 4, 5]), array([1, 2, 3, 4, 5],[1,2,3,4,5])] # inner lists are numpy arrays

我尝试转换;

array([[1, 2, 3, 4, 5],
       [1, 2, 3, 4, 5],
       [1, 2, 3, 4, 5])

我现在正在vstack上通过迭代来解决它,但是对于特别大的LIST来说确实很慢

您对最佳有效方法有何建议?

Suppose I have ;

LIST = [[array([1, 2, 3, 4, 5]), array([1, 2, 3, 4, 5],[1,2,3,4,5])] # inner lists are numpy arrays

I try to convert;

array([[1, 2, 3, 4, 5],
       [1, 2, 3, 4, 5],
       [1, 2, 3, 4, 5])

I am solving it by iteration on vstack right now but it is really slow for especially large LIST

What do you suggest for the best efficient way?


回答 0

通常,您可以沿任意轴连接整个数组序列:

numpy.concatenate( LIST, axis=0 )

但是你必须对列表中的形状和每个阵列的维度担心(用于2维3×5的输出,你需要确保它们都是2维正由-5阵列的话)。如果要将一维数组连接为二维输出的行,则需要扩展其维数。

正如Jorge的答案所指出的那样,还有stacknumpy 1.10中引入的function :

numpy.stack( LIST, axis=0 )

这采用了补充方法:在连接之前,它会为每个输入数组创建一个新视图并添加一个额外的维数(在这种情况下,在左侧,因此每个n元素1D数组将变为1 x n2D数组)。仅当所有输入数组都具有相同的形状时才有效(即使沿着串联轴也是如此)。

vstack(或等价的row_stack)通常是一个更易于使用的解决方案,因为它将采用一维和/或二维数组序列,并在将整个列表连接在一起之前,在必要时且仅在必要时才自动扩展维数。在需要新尺寸的地方,将其添加到左侧。同样,您可以一次串联整个列表,而无需进行迭代:

numpy.vstack( LIST )

语法快捷方式也显示了这种灵活的行为numpy.r_[ array1, ...., arrayN ](请注意方括号)。这对于连接几个显式命名的数组很有用,但对您的情况不利,因为此语法将不接受数组序列,例如your LIST

还有一个类似的函数column_stack和快捷方式c_[...],用于水平(列方式)堆叠,以及一个几乎类似的函数hstack -尽管出于某种原因,后者的灵活性较差(它对输入数组的维数更为严格,并试图进行串联)一维数组首尾相连,而不是将它们视为列。

最后,在垂直堆叠一维数组的特定情况下,以下内容也适用:

numpy.array( LIST )

…因为数组可以从其他数组序列中构造出来,因此在开头增加了新的维度。

In general you can concatenate a whole sequence of arrays along any axis:

numpy.concatenate( LIST, axis=0 )

but you do have to worry about the shape and dimensionality of each array in the list (for a 2-dimensional 3×5 output, you need to ensure that they are all 2-dimensional n-by-5 arrays already). If you want to concatenate 1-dimensional arrays as the rows of a 2-dimensional output, you need to expand their dimensionality.

As Jorge’s answer points out, there is also the function stack, introduced in numpy 1.10:

numpy.stack( LIST, axis=0 )

This takes the complementary approach: it creates a new view of each input array and adds an extra dimension (in this case, on the left, so each n-element 1D array becomes a 1-by-n 2D array) before concatenating. It will only work if all the input arrays have the same shape—even along the axis of concatenation.

vstack (or equivalently row_stack) is often an easier-to-use solution because it will take a sequence of 1- and/or 2-dimensional arrays and expand the dimensionality automatically where necessary and only where necessary, before concatenating the whole list together. Where a new dimension is required, it is added on the left. Again, you can concatenate a whole list at once without needing to iterate:

numpy.vstack( LIST )

This flexible behavior is also exhibited by the syntactic shortcut numpy.r_[ array1, ...., arrayN ] (note the square brackets). This is good for concatenating a few explicitly-named arrays but is no good for your situation because this syntax will not accept a sequence of arrays, like your LIST.

There is also an analogous function column_stack and shortcut c_[...], for horizontal (column-wise) stacking, as well as an almost-analogous function hstack—although for some reason the latter is less flexible (it is stricter about input arrays’ dimensionality, and tries to concatenate 1-D arrays end-to-end instead of treating them as columns).

Finally, in the specific case of vertical stacking of 1-D arrays, the following also works:

numpy.array( LIST )

…because arrays can be constructed out of a sequence of other arrays, adding a new dimension to the beginning.


回答 1

从NumPy 1.10版开始,我们有了方法stack。它可以堆叠任何维度的数组(全部相等):

# List of arrays.
L = [np.random.randn(5,4,2,5,1,2) for i in range(10)]

# Stack them using axis=0.
M = np.stack(L)
M.shape # == (10,5,4,2,5,1,2)
np.all(M == L) # == True

M = np.stack(L, axis=1)
M.shape # == (5,10,4,2,5,1,2)
np.all(M == L) # == False (Don't Panic)

# This are all true    
np.all(M[:,0,:] == L[0]) # == True
all(np.all(M[:,i,:] == L[i]) for i in range(10)) # == True

请享用,

Starting in NumPy version 1.10, we have the method stack. It can stack arrays of any dimension (all equal):

# List of arrays.
L = [np.random.randn(5,4,2,5,1,2) for i in range(10)]

# Stack them using axis=0.
M = np.stack(L)
M.shape # == (10,5,4,2,5,1,2)
np.all(M == L) # == True

M = np.stack(L, axis=1)
M.shape # == (5,10,4,2,5,1,2)
np.all(M == L) # == False (Don't Panic)

# This are all true    
np.all(M[:,0,:] == L[0]) # == True
all(np.all(M[:,i,:] == L[i]) for i in range(10)) # == True

Enjoy,


回答 2

我检查了一些提高速度性能的方法,发现没有什么不同! 唯一的区别是,使用某些方法必须仔细检查尺寸。

定时:

|------------|----------------|-------------------|
|            | shape (10000)  |  shape (1,10000)  |
|------------|----------------|-------------------|
| np.concat  |    0.18280     |      0.17960      |
|------------|----------------|-------------------|
|  np.stack  |    0.21501     |      0.16465      |
|------------|----------------|-------------------|
| np.vstack  |    0.21501     |      0.17181      |
|------------|----------------|-------------------|
|  np.array  |    0.21656     |      0.16833      |
|------------|----------------|-------------------|

如您所见,我尝试了2个实验-使用np.random.rand(10000)np.random.rand(1, 10000) 如果我们使用2d数组,则np.stacknp.array创建附加维度-result.shape是(1,10000,10000)和(10000,1,10000),那么他们需要采取其他措施来避免这种情况。

码:

from time import perf_counter
from tqdm import tqdm_notebook
import numpy as np
l = []
for i in tqdm_notebook(range(10000)):
    new_np = np.random.rand(10000)
    l.append(new_np)



start = perf_counter()
stack = np.stack(l, axis=0 )
print(f'np.stack: {perf_counter() - start:.5f}')

start = perf_counter()
vstack = np.vstack(l)
print(f'np.vstack: {perf_counter() - start:.5f}')

start = perf_counter()
wrap = np.array(l)
print(f'np.array: {perf_counter() - start:.5f}')

start = perf_counter()
l = [el.reshape(1,-1) for el in l]
conc = np.concatenate(l, axis=0 )
print(f'np.concatenate: {perf_counter() - start:.5f}')

I checked some of the methods for speed performance and find that there is no difference! The only difference is that using some methods you must carefully check dimension.

Timing:

|------------|----------------|-------------------|
|            | shape (10000)  |  shape (1,10000)  |
|------------|----------------|-------------------|
| np.concat  |    0.18280     |      0.17960      |
|------------|----------------|-------------------|
|  np.stack  |    0.21501     |      0.16465      |
|------------|----------------|-------------------|
| np.vstack  |    0.21501     |      0.17181      |
|------------|----------------|-------------------|
|  np.array  |    0.21656     |      0.16833      |
|------------|----------------|-------------------|

As you can see I tried 2 experiments – using np.random.rand(10000) and np.random.rand(1, 10000) And if we use 2d arrays than np.stack and np.array create additional dimension – result.shape is (1,10000,10000) and (10000,1,10000) so they need additional actions to avoid this.

Code:

from time import perf_counter
from tqdm import tqdm_notebook
import numpy as np
l = []
for i in tqdm_notebook(range(10000)):
    new_np = np.random.rand(10000)
    l.append(new_np)



start = perf_counter()
stack = np.stack(l, axis=0 )
print(f'np.stack: {perf_counter() - start:.5f}')

start = perf_counter()
vstack = np.vstack(l)
print(f'np.vstack: {perf_counter() - start:.5f}')

start = perf_counter()
wrap = np.array(l)
print(f'np.array: {perf_counter() - start:.5f}')

start = perf_counter()
l = [el.reshape(1,-1) for el in l]
conc = np.concatenate(l, axis=0 )
print(f'np.concatenate: {perf_counter() - start:.5f}')

Numpy isnan()在浮点数组上失败(适用于pandas数据框)

问题:Numpy isnan()在浮点数组上失败(适用于pandas数据框)

我有一个浮点数数组(一些正常数字,一些nans),它们是从对熊猫数据框的应用中得出的。

由于某种原因,numpy.isnan在此数组上失败,但是,如下所示,每个元素都是浮点数,numpy.isnan在每个元素上正确运行,变量的类型肯定是一个numpy数组。

这是怎么回事?!

set([type(x) for x in tester])
Out[59]: {float}

tester
Out[60]: 
array([-0.7000000000000001, nan, nan, nan, nan, nan, nan, nan, nan, nan,
   nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
   nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
   nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
   nan, nan], dtype=object)

set([type(x) for x in tester])
Out[61]: {float}

np.isnan(tester)
Traceback (most recent call last):

File "<ipython-input-62-e3638605b43c>", line 1, in <module>
np.isnan(tester)

TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

set([np.isnan(x) for x in tester])
Out[65]: {False, True}

type(tester)
Out[66]: numpy.ndarray

I have an array of floats (some normal numbers, some nans) that is coming out of an apply on a pandas dataframe.

For some reason, numpy.isnan is failing on this array, however as shown below, each element is a float, numpy.isnan runs correctly on each element, the type of the variable is definitely a numpy array.

What’s going on?!

set([type(x) for x in tester])
Out[59]: {float}

tester
Out[60]: 
array([-0.7000000000000001, nan, nan, nan, nan, nan, nan, nan, nan, nan,
   nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
   nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
   nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
   nan, nan], dtype=object)

set([type(x) for x in tester])
Out[61]: {float}

np.isnan(tester)
Traceback (most recent call last):

File "<ipython-input-62-e3638605b43c>", line 1, in <module>
np.isnan(tester)

TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

set([np.isnan(x) for x in tester])
Out[65]: {False, True}

type(tester)
Out[66]: numpy.ndarray

回答 0

np.isnan 可以应用于本机dtype的NumPy数组(例如np.float64):

In [99]: np.isnan(np.array([np.nan, 0], dtype=np.float64))
Out[99]: array([ True, False], dtype=bool)

但是在应用于对象数组时引发TypeError:

In [96]: np.isnan(np.array([np.nan, 0], dtype=object))
TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

由于您拥有Pandas,pd.isnull因此可以改用-它可以接受对象或本机dtypes的NumPy数组:

In [97]: pd.isnull(np.array([np.nan, 0], dtype=float))
Out[97]: array([ True, False], dtype=bool)

In [98]: pd.isnull(np.array([np.nan, 0], dtype=object))
Out[98]: array([ True, False], dtype=bool)

请注意,None在对象数组中也将其视为空值。

np.isnan can be applied to NumPy arrays of native dtype (such as np.float64):

In [99]: np.isnan(np.array([np.nan, 0], dtype=np.float64))
Out[99]: array([ True, False], dtype=bool)

but raises TypeError when applied to object arrays:

In [96]: np.isnan(np.array([np.nan, 0], dtype=object))
TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

Since you have Pandas, you could use pd.isnull instead — it can accept NumPy arrays of object or native dtypes:

In [97]: pd.isnull(np.array([np.nan, 0], dtype=float))
Out[97]: array([ True, False], dtype=bool)

In [98]: pd.isnull(np.array([np.nan, 0], dtype=object))
Out[98]: array([ True, False], dtype=bool)

Note that None is also considered a null value in object arrays.


回答 1

np.isnan()和pd.isnull()的绝佳替代品是

for i in range(0,a.shape[0]):
    if(a[i]!=a[i]):
       //do something here
       //a[i] is nan

因为只有nan不等于自己。

A great substitute for np.isnan() and pd.isnull() is

for i in range(0,a.shape[0]):
    if(a[i]!=a[i]):
       //do something here
       //a[i] is nan

since only nan is not equal to itself.


回答 2

在@unutbu答案的顶部,您可以将pandas numpy对象数组强制转换为本机(float64)类型,沿线进行操作

import pandas as pd
pd.to_numeric(df['tester'], errors='coerce')

指定errors =’coerce’强制将无法解析为数字值的字符串变为NaN。列类型为dtype: float64,然后isnan检查是否可以使用

On top of @unutbu answer, you could coerce pandas numpy object array to native (float64) type, something along the line

import pandas as pd
pd.to_numeric(df['tester'], errors='coerce')

Specify errors=’coerce’ to force strings that can’t be parsed to a numeric value to become NaN. Column type would be dtype: float64, and then isnan check should work


回答 3

确保使用熊猫导入csv文件

import pandas as pd

condition = pd.isnull(data[i][j])

Make sure you import csv file using Pandas

import pandas as pd

condition = pd.isnull(data[i][j])

如何将数据集分割/划分为训练和测试数据集,例如进行交叉验证?

问题:如何将数据集分割/划分为训练和测试数据集,例如进行交叉验证?

将NumPy数组随机分为训练和测试/验证数据集的好方法是什么?与Matlab中的cvpartitioncrossvalind函数类似。

What is a good way to split a NumPy array randomly into training and testing/validation dataset? Something similar to the cvpartition or crossvalind functions in Matlab.


回答 0

如果要将数据集分成两半,可以使用numpy.random.shuffle,或者numpy.random.permutation需要跟踪索引:

import numpy
# x is your dataset
x = numpy.random.rand(100, 5)
numpy.random.shuffle(x)
training, test = x[:80,:], x[80:,:]

要么

import numpy
# x is your dataset
x = numpy.random.rand(100, 5)
indices = numpy.random.permutation(x.shape[0])
training_idx, test_idx = indices[:80], indices[80:]
training, test = x[training_idx,:], x[test_idx,:]

有多种方法可以重复分区同一数据集以进行交叉验证。一种策略是从数据集中重复采样:

import numpy
# x is your dataset
x = numpy.random.rand(100, 5)
training_idx = numpy.random.randint(x.shape[0], size=80)
test_idx = numpy.random.randint(x.shape[0], size=20)
training, test = x[training_idx,:], x[test_idx,:]

最后,sklearn包含几种交叉验证方法(k折,nave -n-out等)。它还包括更高级的“分层抽样”方法,该方法创建了针对某些功能平衡的数据分区,例如,确保训练和测试集中的正例和负例比例相同。

If you want to split the data set once in two halves, you can use numpy.random.shuffle, or numpy.random.permutation if you need to keep track of the indices:

import numpy
# x is your dataset
x = numpy.random.rand(100, 5)
numpy.random.shuffle(x)
training, test = x[:80,:], x[80:,:]

or

import numpy
# x is your dataset
x = numpy.random.rand(100, 5)
indices = numpy.random.permutation(x.shape[0])
training_idx, test_idx = indices[:80], indices[80:]
training, test = x[training_idx,:], x[test_idx,:]

There are many ways to repeatedly partition the same data set for cross validation. One strategy is to resample from the dataset, with repetition:

import numpy
# x is your dataset
x = numpy.random.rand(100, 5)
training_idx = numpy.random.randint(x.shape[0], size=80)
test_idx = numpy.random.randint(x.shape[0], size=20)
training, test = x[training_idx,:], x[test_idx,:]

Finally, sklearn contains several cross validation methods (k-fold, leave-n-out, …). It also includes more advanced “stratified sampling” methods that create a partition of the data that is balanced with respect to some features, for example to make sure that there is the same proportion of positive and negative examples in the training and test set.


回答 1

还有另一个选择就是需要使用scikit-learn。如scikit的Wiki所述,您可以按照以下说明进行操作:

from sklearn.model_selection import train_test_split

data, labels = np.arange(10).reshape((5, 2)), range(5)

data_train, data_test, labels_train, labels_test = train_test_split(data, labels, test_size=0.20, random_state=42)

这样,您就可以将要拆分为训练和测试的数据的标签保持同步。

There is another option that just entails using scikit-learn. As scikit’s wiki describes, you can just use the following instructions:

from sklearn.model_selection import train_test_split

data, labels = np.arange(10).reshape((5, 2)), range(5)

data_train, data_test, labels_train, labels_test = train_test_split(data, labels, test_size=0.20, random_state=42)

This way you can keep in sync the labels for the data you’re trying to split into training and test.


回答 2

请注意。如果您想要训练,测试和AND验证集,则可以执行以下操作:

from sklearn.cross_validation import train_test_split

X = get_my_X()
y = get_my_y()
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
x_test, x_val, y_test, y_val = train_test_split(x_test, y_test, test_size=0.5)

这些参数将为训练提供70%,为测试和验证集各提供15%。希望这可以帮助。

Just a note. In case you want train, test, AND validation sets, you can do this:

from sklearn.cross_validation import train_test_split

X = get_my_X()
y = get_my_y()
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
x_test, x_val, y_test, y_val = train_test_split(x_test, y_test, test_size=0.5)

These parameters will give 70 % to training, and 15 % each to test and val sets. Hope this helps.


回答 3

由于sklearn.cross_validation模块被弃用,你可以使用:

import numpy as np
from sklearn.model_selection import train_test_split
X, y = np.arange(10).reshape((5, 2)), range(5)

X_trn, X_tst, y_trn, y_tst = train_test_split(X, y, test_size=0.2, random_state=42)

As sklearn.cross_validation module was deprecated, you can use:

import numpy as np
from sklearn.model_selection import train_test_split
X, y = np.arange(10).reshape((5, 2)), range(5)

X_trn, X_tst, y_trn, y_tst = train_test_split(X, y, test_size=0.2, random_state=42)

回答 4

您还可以考虑将训练和测试集进行分层。初始除法还会随机生成训练集和测试集,但要保留原始Class的比例。这使得训练和测试集更好地反映了原始数据集的属性。

import numpy as np  

def get_train_test_inds(y,train_proportion=0.7):
    '''Generates indices, making random stratified split into training set and testing sets
    with proportions train_proportion and (1-train_proportion) of initial sample.
    y is any iterable indicating classes of each observation in the sample.
    Initial proportions of classes inside training and 
    testing sets are preserved (stratified sampling).
    '''

    y=np.array(y)
    train_inds = np.zeros(len(y),dtype=bool)
    test_inds = np.zeros(len(y),dtype=bool)
    values = np.unique(y)
    for value in values:
        value_inds = np.nonzero(y==value)[0]
        np.random.shuffle(value_inds)
        n = int(train_proportion*len(value_inds))

        train_inds[value_inds[:n]]=True
        test_inds[value_inds[n:]]=True

    return train_inds,test_inds

y = np.array([1,1,2,2,3,3])
train_inds,test_inds = get_train_test_inds(y,train_proportion=0.5)
print y[train_inds]
print y[test_inds]

此代码输出:

[1 2 3]
[1 2 3]

You may also consider stratified division into training and testing set. Startified division also generates training and testing set randomly but in such a way that original class proportions are preserved. This makes training and testing sets better reflect the properties of the original dataset.

import numpy as np  

def get_train_test_inds(y,train_proportion=0.7):
    '''Generates indices, making random stratified split into training set and testing sets
    with proportions train_proportion and (1-train_proportion) of initial sample.
    y is any iterable indicating classes of each observation in the sample.
    Initial proportions of classes inside training and 
    testing sets are preserved (stratified sampling).
    '''

    y=np.array(y)
    train_inds = np.zeros(len(y),dtype=bool)
    test_inds = np.zeros(len(y),dtype=bool)
    values = np.unique(y)
    for value in values:
        value_inds = np.nonzero(y==value)[0]
        np.random.shuffle(value_inds)
        n = int(train_proportion*len(value_inds))

        train_inds[value_inds[:n]]=True
        test_inds[value_inds[n:]]=True

    return train_inds,test_inds

y = np.array([1,1,2,2,3,3])
train_inds,test_inds = get_train_test_inds(y,train_proportion=0.5)
print y[train_inds]
print y[test_inds]

This code outputs:

[1 2 3]
[1 2 3]

回答 5

我为自己的项目编写了一个函数来执行此操作(尽管它不使用numpy):

def partition(seq, chunks):
    """Splits the sequence into equal sized chunks and them as a list"""
    result = []
    for i in range(chunks):
        chunk = []
        for element in seq[i:len(seq):chunks]:
            chunk.append(element)
        result.append(chunk)
    return result

如果您希望将块随机化,则在将列表传递之前先对其进行随机排序。

I wrote a function for my own project to do this (it doesn’t use numpy, though):

def partition(seq, chunks):
    """Splits the sequence into equal sized chunks and them as a list"""
    result = []
    for i in range(chunks):
        chunk = []
        for element in seq[i:len(seq):chunks]:
            chunk.append(element)
        result.append(chunk)
    return result

If you want the chunks to be randomized, just shuffle the list before passing it in.


回答 6

这是一个以分层方式将数据分成n = 5折的代码

% X = data array
% y = Class_label
from sklearn.cross_validation import StratifiedKFold
skf = StratifiedKFold(y, n_folds=5)
for train_index, test_index in skf:
    print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

Here is a code to split the data into n=5 folds in a stratified manner

% X = data array
% y = Class_label
from sklearn.cross_validation import StratifiedKFold
skf = StratifiedKFold(y, n_folds=5)
for train_index, test_index in skf:
    print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

回答 7

感谢pberkes的回答。我只是对其进行了修改,以避免(1)在训练和测试中采样(2)重复的实例时进行替换:

training_idx = np.random.choice(X.shape[0], int(np.round(X.shape[0] * 0.8)),replace=False)
training_idx = np.random.permutation(np.arange(X.shape[0]))[:np.round(X.shape[0] * 0.8)]
    test_idx = np.setdiff1d( np.arange(0,X.shape[0]), training_idx)

Thanks pberkes for your answer. I just modified it to avoid (1) replacement while sampling (2) duplicated instances occurred in both training and testing:

training_idx = np.random.choice(X.shape[0], int(np.round(X.shape[0] * 0.8)),replace=False)
training_idx = np.random.permutation(np.arange(X.shape[0]))[:np.round(X.shape[0] * 0.8)]
    test_idx = np.setdiff1d( np.arange(0,X.shape[0]), training_idx)

回答 8

在进行了一些阅读并考虑了(许多..)将数据拆分以进行训练和测试的不同方式之后,我决定计时了!

我使用了4种不同的方法(其中没有一种使用的是sklearn库,我相信它将得到最好的结果,因为它是经过精心设计和测试的代码):

  1. 洗净整个矩阵arr,然后拆分数据以进行训练和测试
  2. 随机排列索引,然后将其分配给x和y以拆分数据
  3. 与方法2相同,但以更有效的方式进行
  4. 使用熊猫数据框进行拆分

方法3赢得的时间最短,仅次于方法1,而方法2和4的确效率很低。

我计时的4种不同方法的代码:

import numpy as np
arr = np.random.rand(100, 3)
X = arr[:,:2]
Y = arr[:,2]
spl = 0.7
N = len(arr)
sample = int(spl*N)

#%% Method 1:  shuffle the whole matrix arr and then split
np.random.shuffle(arr)
x_train, x_test, y_train, y_test = X[:sample,:], X[sample:, :], Y[:sample, ], Y[sample:,]

#%% Method 2: shuffle the indecies and then shuffle and apply to X and Y
train_idx = np.random.choice(N, sample)
Xtrain = X[train_idx]
Ytrain = Y[train_idx]

test_idx = [idx for idx in range(N) if idx not in train_idx]
Xtest = X[test_idx]
Ytest = Y[test_idx]

#%% Method 3: shuffle indicies without a for loop
idx = np.random.permutation(arr.shape[0])  # can also use random.shuffle
train_idx, test_idx = idx[:sample], idx[sample:]
x_train, x_test, y_train, y_test = X[train_idx,:], X[test_idx,:], Y[train_idx,], Y[test_idx,]

#%% Method 4: using pandas dataframe to split
import pandas as pd
df = pd.read_csv(file_path, header=None) # Some csv file (I used some file with 3 columns)

train = df.sample(frac=0.7, random_state=200)
test = df.drop(train.index)

在这段时间内,执行1000次循环的3次重复的最短时间为:

  • 方法1:0.35883826200006297秒
  • 方法2:1.7157016959999964秒
  • 方法3:1.7876616719995582秒
  • 方法4:0.07562861499991413秒

希望对您有所帮助!

After doing some reading and taking into account the (many..) different ways of splitting the data to train and test, I decided to timeit!

I used 4 different methods (non of them are using the library sklearn, which I’m sure will give the best results, giving that it is well designed and tested code):

  1. shuffle the whole matrix arr and then split the data to train and test
  2. shuffle the indices and then assign it x and y to split the data
  3. same as method 2, but in a more efficient way to do it
  4. using pandas dataframe to split

method 3 won by far with the shortest time, after that method 1, and method 2 and 4 discovered to be really inefficient.

The code for the 4 different methods I timed:

import numpy as np
arr = np.random.rand(100, 3)
X = arr[:,:2]
Y = arr[:,2]
spl = 0.7
N = len(arr)
sample = int(spl*N)

#%% Method 1:  shuffle the whole matrix arr and then split
np.random.shuffle(arr)
x_train, x_test, y_train, y_test = X[:sample,:], X[sample:, :], Y[:sample, ], Y[sample:,]

#%% Method 2: shuffle the indecies and then shuffle and apply to X and Y
train_idx = np.random.choice(N, sample)
Xtrain = X[train_idx]
Ytrain = Y[train_idx]

test_idx = [idx for idx in range(N) if idx not in train_idx]
Xtest = X[test_idx]
Ytest = Y[test_idx]

#%% Method 3: shuffle indicies without a for loop
idx = np.random.permutation(arr.shape[0])  # can also use random.shuffle
train_idx, test_idx = idx[:sample], idx[sample:]
x_train, x_test, y_train, y_test = X[train_idx,:], X[test_idx,:], Y[train_idx,], Y[test_idx,]

#%% Method 4: using pandas dataframe to split
import pandas as pd
df = pd.read_csv(file_path, header=None) # Some csv file (I used some file with 3 columns)

train = df.sample(frac=0.7, random_state=200)
test = df.drop(train.index)

And for the times, the minimum time to execute out of 3 repetitions of 1000 loops is:

  • Method 1: 0.35883826200006297 seconds
  • Method 2: 1.7157016959999964 seconds
  • Method 3: 1.7876616719995582 seconds
  • Method 4: 0.07562861499991413 seconds

I hope that’s helpful!


回答 9

可能您不仅需要拆分训练和测试,而且还需要交叉验证以确保模型能够概括。在这里,我假设70%的训练数据,20%的验证和10%的坚持/测试数据。

查看np.split

如果indexs_or_sections是一维排序的整数数组,则条目指示沿轴在哪里拆分该数组。例如,对于轴= 0,[2,3]将导致

ary [:2] ary [2:3] ary [3:]

t, v, h = np.split(df.sample(frac=1, random_state=1), [int(0.7*len(df)), int(0.9*len(df))]) 

Likely you will not only need to split into train and test, but also cross validation to make sure your model generalizes. Here I am assuming 70% training data, 20% validation and 10% holdout/test data.

Check out the np.split:

If indices_or_sections is a 1-D array of sorted integers, the entries indicate where along axis the array is split. For example, [2, 3] would, for axis=0, result in

ary[:2] ary[2:3] ary[3:]

t, v, h = np.split(df.sample(frac=1, random_state=1), [int(0.7*len(df)), int(0.9*len(df))]) 

回答 10

分为火车测试并有效

x =np.expand_dims(np.arange(100), -1)


print(x)

indices = np.random.permutation(x.shape[0])

training_idx, test_idx, val_idx = indices[:int(x.shape[0]*.9)], indices[int(x.shape[0]*.9):int(x.shape[0]*.95)],  indices[int(x.shape[0]*.9):int(x.shape[0]*.95)]


training, test, val = x[training_idx,:], x[test_idx,:], x[val_idx,:]

print(training, test, val)

Split into train test and valid

x =np.expand_dims(np.arange(100), -1)


print(x)

indices = np.random.permutation(x.shape[0])

training_idx, test_idx, val_idx = indices[:int(x.shape[0]*.9)], indices[int(x.shape[0]*.9):int(x.shape[0]*.95)],  indices[int(x.shape[0]*.9):int(x.shape[0]*.95)]


training, test, val = x[training_idx,:], x[test_idx,:], x[val_idx,:]

print(training, test, val)