问题:更好地协调两个numpy数组的更好方法

我有两个不同形状的numpy数组,但是长度(引导尺寸)相同。我想对它们中的每一个进行混洗,以使相应的元素继续对应-即相对于它们的前导索引一致地对它们进行混洗。

该代码有效,并说明了我的目标:

def shuffle_in_unison(a, b):
    assert len(a) == len(b)
    shuffled_a = numpy.empty(a.shape, dtype=a.dtype)
    shuffled_b = numpy.empty(b.shape, dtype=b.dtype)
    permutation = numpy.random.permutation(len(a))
    for old_index, new_index in enumerate(permutation):
        shuffled_a[new_index] = a[old_index]
        shuffled_b[new_index] = b[old_index]
    return shuffled_a, shuffled_b

例如:

>>> a = numpy.asarray([[1, 1], [2, 2], [3, 3]])
>>> b = numpy.asarray([1, 2, 3])
>>> shuffle_in_unison(a, b)
(array([[2, 2],
       [1, 1],
       [3, 3]]), array([2, 1, 3]))

但是,这感觉笨拙,效率低下且速度慢,并且需要复制数组-我宁愿就地对其进行随机播放,因为它们会很大。

有更好的方法来解决这个问题吗?更快的执行速度和更低的内存使用是我的主要目标,但是优美的代码也将是不错的。

我的另一个想法是:

def shuffle_in_unison_scary(a, b):
    rng_state = numpy.random.get_state()
    numpy.random.shuffle(a)
    numpy.random.set_state(rng_state)
    numpy.random.shuffle(b)

这行得通…但是有点吓人,因为我看不到它会继续工作-例如,它看起来像不能在numpy版本中生存的那种东西。

I have two numpy arrays of different shapes, but with the same length (leading dimension). I want to shuffle each of them, such that corresponding elements continue to correspond — i.e. shuffle them in unison with respect to their leading indices.

This code works, and illustrates my goals:

def shuffle_in_unison(a, b):
    assert len(a) == len(b)
    shuffled_a = numpy.empty(a.shape, dtype=a.dtype)
    shuffled_b = numpy.empty(b.shape, dtype=b.dtype)
    permutation = numpy.random.permutation(len(a))
    for old_index, new_index in enumerate(permutation):
        shuffled_a[new_index] = a[old_index]
        shuffled_b[new_index] = b[old_index]
    return shuffled_a, shuffled_b

For example:

>>> a = numpy.asarray([[1, 1], [2, 2], [3, 3]])
>>> b = numpy.asarray([1, 2, 3])
>>> shuffle_in_unison(a, b)
(array([[2, 2],
       [1, 1],
       [3, 3]]), array([2, 1, 3]))

However, this feels clunky, inefficient, and slow, and it requires making a copy of the arrays — I’d rather shuffle them in-place, since they’ll be quite large.

Is there a better way to go about this? Faster execution and lower memory usage are my primary goals, but elegant code would be nice, too.

One other thought I had was this:

def shuffle_in_unison_scary(a, b):
    rng_state = numpy.random.get_state()
    numpy.random.shuffle(a)
    numpy.random.set_state(rng_state)
    numpy.random.shuffle(b)

This works…but it’s a little scary, as I see little guarantee it’ll continue to work — it doesn’t look like the sort of thing that’s guaranteed to survive across numpy version, for example.


回答 0

您的“吓人”解决方案对我来说似乎并不可怕。调用shuffle()两个相同长度的序列会导致对随机数生成器的调用次数相同,这是随机播放算法中唯一的“随机”元素。通过重置状态,可以确保对随机数生成器的调用将在对的第二次调用中给出相同的结果shuffle(),因此整个算法将生成相同的排列。

如果您不喜欢这种方法,那么另一种解决方案是将数据存储在一个数组中,而不是从一开始就存储在两个数组中,然后在此单个数组中创建两个视图以模拟您现在拥有的两个数组。您可以将单个数组用于改组,并将视图用于所有其他目的。

例如:假设数组ab这个样子的:

a = numpy.array([[[  0.,   1.,   2.],
                  [  3.,   4.,   5.]],

                 [[  6.,   7.,   8.],
                  [  9.,  10.,  11.]],

                 [[ 12.,  13.,  14.],
                  [ 15.,  16.,  17.]]])

b = numpy.array([[ 0.,  1.],
                 [ 2.,  3.],
                 [ 4.,  5.]])

现在我们可以构造一个包含所有数据的数组:

c = numpy.c_[a.reshape(len(a), -1), b.reshape(len(b), -1)]
# array([[  0.,   1.,   2.,   3.,   4.,   5.,   0.,   1.],
#        [  6.,   7.,   8.,   9.,  10.,  11.,   2.,   3.],
#        [ 12.,  13.,  14.,  15.,  16.,  17.,   4.,   5.]])

现在我们创建模拟原始视图的视图 a和的b

a2 = c[:, :a.size//len(a)].reshape(a.shape)
b2 = c[:, a.size//len(a):].reshape(b.shape)

的数据 a2b2共享c。要同时混洗两个数组,请使用numpy.random.shuffle(c)

在生产代码,你当然会尽量避免创建原始ab根本,并马上创建ca2b2

该解决方案能够适应的情况下a,并b有不同的dtypes。

Your “scary” solution does not appear scary to me. Calling shuffle() for two sequences of the same length results in the same number of calls to the random number generator, and these are the only “random” elements in the shuffle algorithm. By resetting the state, you ensure that the calls to the random number generator will give the same results in the second call to shuffle(), so the whole algorithm will generate the same permutation.

If you don’t like this, a different solution would be to store your data in one array instead of two right from the beginning, and create two views into this single array simulating the two arrays you have now. You can use the single array for shuffling and the views for all other purposes.

Example: Let’s assume the arrays a and b look like this:

a = numpy.array([[[  0.,   1.,   2.],
                  [  3.,   4.,   5.]],

                 [[  6.,   7.,   8.],
                  [  9.,  10.,  11.]],

                 [[ 12.,  13.,  14.],
                  [ 15.,  16.,  17.]]])

b = numpy.array([[ 0.,  1.],
                 [ 2.,  3.],
                 [ 4.,  5.]])

We can now construct a single array containing all the data:

c = numpy.c_[a.reshape(len(a), -1), b.reshape(len(b), -1)]
# array([[  0.,   1.,   2.,   3.,   4.,   5.,   0.,   1.],
#        [  6.,   7.,   8.,   9.,  10.,  11.,   2.,   3.],
#        [ 12.,  13.,  14.,  15.,  16.,  17.,   4.,   5.]])

Now we create views simulating the original a and b:

a2 = c[:, :a.size//len(a)].reshape(a.shape)
b2 = c[:, a.size//len(a):].reshape(b.shape)

The data of a2 and b2 is shared with c. To shuffle both arrays simultaneously, use numpy.random.shuffle(c).

In production code, you would of course try to avoid creating the original a and b at all and right away create c, a2 and b2.

This solution could be adapted to the case that a and b have different dtypes.


回答 1

您可以使用NumPy的数组索引

def unison_shuffled_copies(a, b):
    assert len(a) == len(b)
    p = numpy.random.permutation(len(a))
    return a[p], b[p]

这将导致创建单独的统一重组的数组。

Your can use NumPy’s array indexing:

def unison_shuffled_copies(a, b):
    assert len(a) == len(b)
    p = numpy.random.permutation(len(a))
    return a[p], b[p]

This will result in creation of separate unison-shuffled arrays.


回答 2

X = np.array([[1., 0.], [2., 1.], [0., 0.]])
y = np.array([0, 1, 2])
from sklearn.utils import shuffle
X, y = shuffle(X, y, random_state=0)

要了解更多信息,请参见http://scikit-learn.org/stable/modules/generated/sklearn.utils.shuffle.html

X = np.array([[1., 0.], [2., 1.], [0., 0.]])
y = np.array([0, 1, 2])
from sklearn.utils import shuffle
X, y = shuffle(X, y, random_state=0)

To learn more, see http://scikit-learn.org/stable/modules/generated/sklearn.utils.shuffle.html


回答 3

很简单的解决方案:

randomize = np.arange(len(x))
np.random.shuffle(randomize)
x = x[randomize]
y = y[randomize]

现在,两个数组x,y都以相同的方式随机洗牌

Very simple solution:

randomize = np.arange(len(x))
np.random.shuffle(randomize)
x = x[randomize]
y = y[randomize]

the two arrays x,y are now both randomly shuffled in the same way


回答 4

James在2015年编写了一个sklearn 解决方案,这很有帮助。但是他添加了一个不需要的随机状态变量。在下面的代码中,自动假定numpy为随机状态。

X = np.array([[1., 0.], [2., 1.], [0., 0.]])
y = np.array([0, 1, 2])
from sklearn.utils import shuffle
X, y = shuffle(X, y)

James wrote in 2015 an sklearn solution which is helpful. But he added a random state variable, which is not needed. In the below code, the random state from numpy is automatically assumed.

X = np.array([[1., 0.], [2., 1.], [0., 0.]])
y = np.array([0, 1, 2])
from sklearn.utils import shuffle
X, y = shuffle(X, y)

回答 5

from np.random import permutation
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data #numpy array
y = iris.target #numpy array

# Data is currently unshuffled; we should shuffle 
# each X[i] with its corresponding y[i]
perm = permutation(len(X))
X = X[perm]
y = y[perm]
from np.random import permutation
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data #numpy array
y = iris.target #numpy array

# Data is currently unshuffled; we should shuffle 
# each X[i] with its corresponding y[i]
perm = permutation(len(X))
X = X[perm]
y = y[perm]

回答 6

仅使用NumPy将任意数量的数组混合在一起就位。

import numpy as np


def shuffle_arrays(arrays, set_seed=-1):
    """Shuffles arrays in-place, in the same order, along axis=0

    Parameters:
    -----------
    arrays : List of NumPy arrays.
    set_seed : Seed value if int >= 0, else seed is random.
    """
    assert all(len(arr) == len(arrays[0]) for arr in arrays)
    seed = np.random.randint(0, 2**(32 - 1) - 1) if set_seed < 0 else set_seed

    for arr in arrays:
        rstate = np.random.RandomState(seed)
        rstate.shuffle(arr)

可以这样使用

a = np.array([1, 2, 3, 4, 5])
b = np.array([10,20,30,40,50])
c = np.array([[1,10,11], [2,20,22], [3,30,33], [4,40,44], [5,50,55]])

shuffle_arrays([a, b, c])

注意事项:

  • 该断言确保所有输入数组沿其第一维具有相同的长度。
  • 数组按其第一个维度在原地随机排列-没有返回任何内容。
  • int32正范围内的随机种子。
  • 如果需要重复播放,可以设置种子值。

随机播放后,可以np.split使用切片对数据进行拆分或使用切片进行引用-取决于应用程序。

Shuffle any number of arrays together, in-place, using only NumPy.

import numpy as np


def shuffle_arrays(arrays, set_seed=-1):
    """Shuffles arrays in-place, in the same order, along axis=0

    Parameters:
    -----------
    arrays : List of NumPy arrays.
    set_seed : Seed value if int >= 0, else seed is random.
    """
    assert all(len(arr) == len(arrays[0]) for arr in arrays)
    seed = np.random.randint(0, 2**(32 - 1) - 1) if set_seed < 0 else set_seed

    for arr in arrays:
        rstate = np.random.RandomState(seed)
        rstate.shuffle(arr)

And can be used like this

a = np.array([1, 2, 3, 4, 5])
b = np.array([10,20,30,40,50])
c = np.array([[1,10,11], [2,20,22], [3,30,33], [4,40,44], [5,50,55]])

shuffle_arrays([a, b, c])

A few things to note:

  • The assert ensures that all input arrays have the same length along their first dimension.
  • Arrays shuffled in-place by their first dimension – nothing returned.
  • Random seed within positive int32 range.
  • If a repeatable shuffle is needed, seed value can be set.

After the shuffle, the data can be split using np.split or referenced using slices – depending on the application.


回答 7

您可以制作一个像这样的数组:

s = np.arange(0, len(a), 1)

然后随机播放:

np.random.shuffle(s)

现在使用this作为数组的参数。相同的改组参数返回相同的改组向量。

x_data = x_data[s]
x_label = x_label[s]

you can make an array like:

s = np.arange(0, len(a), 1)

then shuffle it:

np.random.shuffle(s)

now use this s as argument of your arrays. same shuffled arguments return same shuffled vectors.

x_data = x_data[s]
x_label = x_label[s]

回答 8

可以对连接的列表执行就地改组的一种方法是使用种子(可以是随机的)并使用numpy.random.shuffle进行改组。

# Set seed to a random number if you want the shuffling to be non-deterministic.
def shuffle(a, b, seed):
   np.random.seed(seed)
   np.random.shuffle(a)
   np.random.seed(seed)
   np.random.shuffle(b)

而已。这将以完全相同的方式混洗a和b。这也就地完成,这总是一个优点。

编辑,不要使用np.random.seed()而是使用np.random.RandomState

def shuffle(a, b, seed):
   rand_state = np.random.RandomState(seed)
   rand_state.shuffle(a)
   rand_state.seed(seed)
   rand_state.shuffle(b)

调用它时,只需传入任何种子即可提供随机状态:

a = [1,2,3,4]
b = [11, 22, 33, 44]
shuffle(a, b, 12345)

输出:

>>> a
[1, 4, 2, 3]
>>> b
[11, 44, 22, 33]

编辑:修复了重新设置随机状态的代码

One way in which in-place shuffling can be done for connected lists is using a seed (it could be random) and using numpy.random.shuffle to do the shuffling.

# Set seed to a random number if you want the shuffling to be non-deterministic.
def shuffle(a, b, seed):
   np.random.seed(seed)
   np.random.shuffle(a)
   np.random.seed(seed)
   np.random.shuffle(b)

That’s it. This will shuffle both a and b in the exact same way. This is also done in-place which is always a plus.

EDIT, don’t use np.random.seed() use np.random.RandomState instead

def shuffle(a, b, seed):
   rand_state = np.random.RandomState(seed)
   rand_state.shuffle(a)
   rand_state.seed(seed)
   rand_state.shuffle(b)

When calling it just pass in any seed to feed the random state:

a = [1,2,3,4]
b = [11, 22, 33, 44]
shuffle(a, b, 12345)

Output:

>>> a
[1, 4, 2, 3]
>>> b
[11, 44, 22, 33]

Edit: Fixed code to re-seed the random state


回答 9

有一个众所周知的函数可以处理此问题:

from sklearn.model_selection import train_test_split
X, _, Y, _ = train_test_split(X,Y, test_size=0.0)

只需将test_size设置为0即可避免拆分,并为您提供随机数据。尽管它通常用于拆分训练数据和测试数据,但它的确也可以洗牌。
文档

将数组或矩阵拆分为随机训练和测试子集

快速实用程序,用于包装输入验证以及next(ShuffleSplit()。split(X,y))和应用程序,以将数据输入到单个调用中,以便在oneliner中拆分(以及可选地对子采样)数据。

There is a well-known function that can handle this:

from sklearn.model_selection import train_test_split
X, _, Y, _ = train_test_split(X,Y, test_size=0.0)

Just setting test_size to 0 will avoid splitting and give you shuffled data. Though it is usually used to split train and test data, it does shuffle them too.
From documentation

Split arrays or matrices into random train and test subsets

Quick utility that wraps input validation and next(ShuffleSplit().split(X, y)) and application to input data into a single call for splitting (and optionally subsampling) data in a oneliner.


回答 10

假设我们有两个数组:a和b。

a = np.array([[1,2,3],[4,5,6],[7,8,9]])
b = np.array([[9,1,1],[6,6,6],[4,2,0]]) 

我们首先可以通过排列第一维来获得行索引

indices = np.random.permutation(a.shape[0])
[1 2 0]

然后使用高级索引。在这里,我们使用相同的索引来同时对两个数组进行混洗。

a_shuffled = a[indices[:,np.newaxis], np.arange(a.shape[1])]
b_shuffled = b[indices[:,np.newaxis], np.arange(b.shape[1])]

这相当于

np.take(a, indices, axis=0)
[[4 5 6]
 [7 8 9]
 [1 2 3]]

np.take(b, indices, axis=0)
[[6 6 6]
 [4 2 0]
 [9 1 1]]

Say we have two arrays: a and b.

a = np.array([[1,2,3],[4,5,6],[7,8,9]])
b = np.array([[9,1,1],[6,6,6],[4,2,0]]) 

We can first obtain row indices by permutating first dimension

indices = np.random.permutation(a.shape[0])
[1 2 0]

Then use advanced indexing. Here we are using the same indices to shuffle both arrays in unison.

a_shuffled = a[indices[:,np.newaxis], np.arange(a.shape[1])]
b_shuffled = b[indices[:,np.newaxis], np.arange(b.shape[1])]

This is equivalent to

np.take(a, indices, axis=0)
[[4 5 6]
 [7 8 9]
 [1 2 3]]

np.take(b, indices, axis=0)
[[6 6 6]
 [4 2 0]
 [9 1 1]]

回答 11

如果要避免复制数组,则建议不要遍历数组,而是遍历数组中的每个元素,然后将其随机交换到数组中的另一个位置

for old_index in len(a):
    new_index = numpy.random.randint(old_index+1)
    a[old_index], a[new_index] = a[new_index], a[old_index]
    b[old_index], b[new_index] = b[new_index], b[old_index]

这实现了Knuth-Fisher-Yates随机播放算法。

If you want to avoid copying arrays, then I would suggest that instead of generating a permutation list, you go through every element in the array, and randomly swap it to another position in the array

for old_index in len(a):
    new_index = numpy.random.randint(old_index+1)
    a[old_index], a[new_index] = a[new_index], a[old_index]
    b[old_index], b[new_index] = b[new_index], b[old_index]

This implements the Knuth-Fisher-Yates shuffle algorithm.


回答 12

这似乎是一个非常简单的解决方案:

import numpy as np
def shuffle_in_unison(a,b):

    assert len(a)==len(b)
    c = np.arange(len(a))
    np.random.shuffle(c)

    return a[c],b[c]

a =  np.asarray([[1, 1], [2, 2], [3, 3]])
b =  np.asarray([11, 22, 33])

shuffle_in_unison(a,b)
Out[94]: 
(array([[3, 3],
        [2, 2],
        [1, 1]]),
 array([33, 22, 11]))

This seems like a very simple solution:

import numpy as np
def shuffle_in_unison(a,b):

    assert len(a)==len(b)
    c = np.arange(len(a))
    np.random.shuffle(c)

    return a[c],b[c]

a =  np.asarray([[1, 1], [2, 2], [3, 3]])
b =  np.asarray([11, 22, 33])

shuffle_in_unison(a,b)
Out[94]: 
(array([[3, 3],
        [2, 2],
        [1, 1]]),
 array([33, 22, 11]))

回答 13

举个例子,这就是我在做什么:

combo = []
for i in range(60000):
    combo.append((images[i], labels[i]))

shuffle(combo)

im = []
lab = []
for c in combo:
    im.append(c[0])
    lab.append(c[1])
images = np.asarray(im)
labels = np.asarray(lab)

With an example, this is what I’m doing:

combo = []
for i in range(60000):
    combo.append((images[i], labels[i]))

shuffle(combo)

im = []
lab = []
for c in combo:
    im.append(c[0])
    lab.append(c[1])
images = np.asarray(im)
labels = np.asarray(lab)

回答 14

我扩展了python的random.shuffle()以获取第二个参数:

def shuffle_together(x, y):
    assert len(x) == len(y)

    for i in reversed(xrange(1, len(x))):
        # pick an element in x[:i+1] with which to exchange x[i]
        j = int(random.random() * (i+1))
        x[i], x[j] = x[j], x[i]
        y[i], y[j] = y[j], y[i]

这样,我可以确定改组发生在原位,并且函数不会太长或太复杂。

I extended python’s random.shuffle() to take a second arg:

def shuffle_together(x, y):
    assert len(x) == len(y)

    for i in reversed(xrange(1, len(x))):
        # pick an element in x[:i+1] with which to exchange x[i]
        j = int(random.random() * (i+1))
        x[i], x[j] = x[j], x[i]
        y[i], y[j] = y[j], y[i]

That way I can be sure that the shuffling happens in-place, and the function is not all too long or complicated.


回答 15

只需使用 numpy

首先合并两个输入数组,一维数组是labels(y),二维数组是data(x),然后用NumPy shuffle方法将它们洗牌。最后将它们拆分并返回。

import numpy as np

def shuffle_2d(a, b):
    rows= a.shape[0]
    if b.shape != (rows,1):
        b = b.reshape((rows,1))
    S = np.hstack((b,a))
    np.random.shuffle(S)
    b, a  = S[:,0], S[:,1:]
    return a,b

features, samples = 2, 5
x, y = np.random.random((samples, features)), np.arange(samples)
x, y = shuffle_2d(train, test)

Just use numpy

First merge the two input arrays 1D array is labels(y) and 2D array is data(x) and shuffle them with NumPy shuffle method. Finally split them and return.

import numpy as np

def shuffle_2d(a, b):
    rows= a.shape[0]
    if b.shape != (rows,1):
        b = b.reshape((rows,1))
    S = np.hstack((b,a))
    np.random.shuffle(S)
    b, a  = S[:,0], S[:,1:]
    return a,b

features, samples = 2, 5
x, y = np.random.random((samples, features)), np.arange(samples)
x, y = shuffle_2d(train, test)

声明:本站所有文章,如无特殊说明或标注,均为本站原创发布。任何个人或组织,在未征得本站同意时,禁止复制、盗用、采集、发布本站内容到任何网站、书籍等各类媒体平台。如若本站内容侵犯了原著者的合法权益,可联系我们进行处理。