标签归档:numpy

将numpy dtypes转换为本地python类型

问题:将numpy dtypes转换为本地python类型

如果我有numpy dtype,如何将其自动转换为最接近的python数据类型?例如,

numpy.float32 -> "python float"
numpy.float64 -> "python float"
numpy.uint32  -> "python int"
numpy.int16   -> "python int"

我可以尝试提出所有这些情况的映射,但是numpy是否提供了一些自动方式将其dtypes转换为最接近的本机python类型?该映射不必详尽无遗,但它应该转换具有类似python类似物的常见dtypes。我认为这已经发生在numpy的某个地方。

If I have a numpy dtype, how do I automatically convert it to its closest python data type? For example,

numpy.float32 -> "python float"
numpy.float64 -> "python float"
numpy.uint32  -> "python int"
numpy.int16   -> "python int"

I could try to come up with a mapping of all of these cases, but does numpy provide some automatic way of converting its dtypes into the closest possible native python types? This mapping need not be exhaustive, but it should convert the common dtypes that have a close python analog. I think this already happens somewhere in numpy.


回答 0

使用val.item()最NumPy的值转换成原来的Python类型:

import numpy as np

# for example, numpy.float32 -> python float
val = np.float32(0)
pyval = val.item()
print(type(pyval))         # <class 'float'>

# and similar...
type(np.float64(0).item()) # <class 'float'>
type(np.uint32(0).item())  # <class 'long'>
type(np.int16(0).item())   # <class 'int'>
type(np.cfloat(0).item())  # <class 'complex'>
type(np.datetime64(0, 'D').item())  # <class 'datetime.date'>
type(np.datetime64('2001-01-01 00:00:00').item())  # <class 'datetime.datetime'>
type(np.timedelta64(0, 'D').item()) # <class 'datetime.timedelta'>
...

(另一种方法是np.asscalar(val),但是从NumPy 1.16开始不推荐使用)。


出于好奇,请为您的系统构建NumPy数组标量的转换表:

for name in dir(np):
    obj = getattr(np, name)
    if hasattr(obj, 'dtype'):
        try:
            if 'time' in name:
                npn = obj(0, 'D')
            else:
                npn = obj(0)
            nat = npn.item()
            print('{0} ({1!r}) -> {2}'.format(name, npn.dtype.char, type(nat)))
        except:
            pass

有迹象表明,有没有原生的Python相当于在某些系统上,包括一些NumPy的类型:clongdoubleclongfloatcomplex192complex256float128longcomplexlongdoublelongfloat。在使用之前,需要将它们转换为最接近的NumPy等效项.item()

Use val.item() to convert most NumPy values to a native Python type:

import numpy as np

# for example, numpy.float32 -> python float
val = np.float32(0)
pyval = val.item()
print(type(pyval))         # <class 'float'>

# and similar...
type(np.float64(0).item()) # <class 'float'>
type(np.uint32(0).item())  # <class 'long'>
type(np.int16(0).item())   # <class 'int'>
type(np.cfloat(0).item())  # <class 'complex'>
type(np.datetime64(0, 'D').item())  # <class 'datetime.date'>
type(np.datetime64('2001-01-01 00:00:00').item())  # <class 'datetime.datetime'>
type(np.timedelta64(0, 'D').item()) # <class 'datetime.timedelta'>
...

(Another method is np.asscalar(val), however it is deprecated since NumPy 1.16).


For the curious, to build a table of conversions of NumPy array scalars for your system:

for name in dir(np):
    obj = getattr(np, name)
    if hasattr(obj, 'dtype'):
        try:
            if 'time' in name:
                npn = obj(0, 'D')
            else:
                npn = obj(0)
            nat = npn.item()
            print('{0} ({1!r}) -> {2}'.format(name, npn.dtype.char, type(nat)))
        except:
            pass

There are a few NumPy types that have no native Python equivalent on some systems, including: clongdouble, clongfloat, complex192, complex256, float128, longcomplex, longdouble and longfloat. These need to be converted to their nearest NumPy equivalent before using .item().


回答 1

发现自己混合了numpy类型和标准python。由于所有numpy类型都源自numpy.generic,因此您可以将所有内容转换为python标准类型:

if isinstance(obj, numpy.generic):
    return numpy.asscalar(obj)

found myself having mixed set of numpy types and standard python. as all numpy types derive from numpy.generic, here’s how you can convert everything to python standard types:

if isinstance(obj, numpy.generic):
    return numpy.asscalar(obj)

回答 2

如果要将(numpy.array或numpy标量或本机类型或numpy.darray)转换为本机类型,则可以执行以下操作:

converted_value = getattr(value, "tolist", lambda: value)()

tolist会将标量或数组转换为python本机类​​型。默认的lambda函数处理值已经是本机的情况。

If you want to convert (numpy.array OR numpy scalar OR native type OR numpy.darray) TO native type you can simply do :

converted_value = getattr(value, "tolist", lambda: value)()

tolist will convert your scalar or array to python native type. The default lambda function takes care of the case where value is already native.


回答 3

怎么样:

In [51]: dict([(d, type(np.zeros(1,d).tolist()[0])) for d in (np.float32,np.float64,np.uint32, np.int16)])
Out[51]: 
{<type 'numpy.int16'>: <type 'int'>,
 <type 'numpy.uint32'>: <type 'long'>,
 <type 'numpy.float32'>: <type 'float'>,
 <type 'numpy.float64'>: <type 'float'>}

How about:

In [51]: dict([(d, type(np.zeros(1,d).tolist()[0])) for d in (np.float32,np.float64,np.uint32, np.int16)])
Out[51]: 
{<type 'numpy.int16'>: <type 'int'>,
 <type 'numpy.uint32'>: <type 'long'>,
 <type 'numpy.float32'>: <type 'float'>,
 <type 'numpy.float64'>: <type 'float'>}

回答 4

tolist()是实现此目的的更通用的方法。它适用于任何原始dtype以及数组或矩阵。

如果从原始类型调用,我实际上不会产生一个列表:

numpy的= = 1.15.2

>>> import numpy as np

>>> np_float = np.float64(1.23)
>>> print(type(np_float), np_float)
<class 'numpy.float64'> 1.23

>>> listed_np_float = np_float.tolist()
>>> print(type(listed_np_float), listed_np_float)
<class 'float'> 1.23

>>> np_array = np.array([[1,2,3.], [4,5,6.]])
>>> print(type(np_array), np_array)
<class 'numpy.ndarray'> [[1. 2. 3.]
 [4. 5. 6.]]

>>> listed_np_array = np_array.tolist()
>>> print(type(listed_np_array), listed_np_array)
<class 'list'> [[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]]

tolist() is a more general approach to accomplish this. It works in any primitive dtype and also in arrays or matrices.

I doesn’t actually yields a list if called from primitive types:

numpy == 1.15.2

>>> import numpy as np

>>> np_float = np.float64(1.23)
>>> print(type(np_float), np_float)
<class 'numpy.float64'> 1.23

>>> listed_np_float = np_float.tolist()
>>> print(type(listed_np_float), listed_np_float)
<class 'float'> 1.23

>>> np_array = np.array([[1,2,3.], [4,5,6.]])
>>> print(type(np_array), np_array)
<class 'numpy.ndarray'> [[1. 2. 3.]
 [4. 5. 6.]]

>>> listed_np_array = np_array.tolist()
>>> print(type(listed_np_array), listed_np_array)
<class 'list'> [[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]]

回答 5

您还可以调用要转换的对象的item()方法

>>> from numpy import float32, uint32
>>> type(float32(0).item())
<type 'float'>
>>> type(uint32(0).item())
<type 'long'>

You can also call the item() method of the object you want to convert:

>>> from numpy import float32, uint32
>>> type(float32(0).item())
<type 'float'>
>>> type(uint32(0).item())
<type 'long'>

回答 6

我认为您可以像这样编写通用类型转换函数:

import numpy as np

def get_type_convert(np_type):
   convert_type = type(np.zeros(1,np_type).tolist()[0])
   return (np_type, convert_type)

print get_type_convert(np.float32)
>> (<type 'numpy.float32'>, <type 'float'>)

print get_type_convert(np.float64)
>> (<type 'numpy.float64'>, <type 'float'>)

这意味着没有固定的列表,您的代码将使用更多类型进行扩展。

I think you can just write general type convert function like so:

import numpy as np

def get_type_convert(np_type):
   convert_type = type(np.zeros(1,np_type).tolist()[0])
   return (np_type, convert_type)

print get_type_convert(np.float32)
>> (<type 'numpy.float32'>, <type 'float'>)

print get_type_convert(np.float64)
>> (<type 'numpy.float64'>, <type 'float'>)

This means there is no fixed lists and your code will scale with more types.


回答 7

numpy将信息保留在公开的映射中,typeDict因此您可以执行以下操作:

>>> import __builtin__
>>> import numpy as np
>>> {v: k for k, v in np.typeDict.items() if k in dir(__builtin__)}
{numpy.object_: 'object',
 numpy.bool_: 'bool',
 numpy.string_: 'str',
 numpy.unicode_: 'unicode',
 numpy.int64: 'int',
 numpy.float64: 'float',
 numpy.complex128: 'complex'}

如果您想要实际的python类型而不是它们的名称,可以执行::

>>> {v: getattr(__builtin__, k) for k, v in np.typeDict.items() if k in vars(__builtin__)}
{numpy.object_: object,
 numpy.bool_: bool,
 numpy.string_: str,
 numpy.unicode_: unicode,
 numpy.int64: int,
 numpy.float64: float,
 numpy.complex128: complex}

numpy holds that information in a mapping exposed as typeDict so you could do something like the below::

>>> import __builtin__
>>> import numpy as np
>>> {v: k for k, v in np.typeDict.items() if k in dir(__builtin__)}
{numpy.object_: 'object',
 numpy.bool_: 'bool',
 numpy.string_: 'str',
 numpy.unicode_: 'unicode',
 numpy.int64: 'int',
 numpy.float64: 'float',
 numpy.complex128: 'complex'}

If you want the actual python types rather than their names, you can do ::

>>> {v: getattr(__builtin__, k) for k, v in np.typeDict.items() if k in vars(__builtin__)}
{numpy.object_: object,
 numpy.bool_: bool,
 numpy.string_: str,
 numpy.unicode_: unicode,
 numpy.int64: int,
 numpy.float64: float,
 numpy.complex128: complex}

回答 8

抱歉,部分迟到了,但是我正在研究仅转换numpy.float64为常规Python 的问题float。我看到了3种方法:

  1. npValue.item()
  2. npValue.astype(float)
  3. float(npValue)

以下是IPython的相关计时:

In [1]: import numpy as np

In [2]: aa = np.random.uniform(0, 1, 1000000)

In [3]: %timeit map(float, aa)
10 loops, best of 3: 117 ms per loop

In [4]: %timeit map(lambda x: x.astype(float), aa)
1 loop, best of 3: 780 ms per loop

In [5]: %timeit map(lambda x: x.item(), aa)
1 loop, best of 3: 475 ms per loop

听起来float(npValue)好像快得多。

Sorry to come late to the partly, but I was looking at a problem of converting numpy.float64 to regular Python float only. I saw 3 ways of doing that:

  1. npValue.item()
  2. npValue.astype(float)
  3. float(npValue)

Here are the relevant timings from IPython:

In [1]: import numpy as np

In [2]: aa = np.random.uniform(0, 1, 1000000)

In [3]: %timeit map(float, aa)
10 loops, best of 3: 117 ms per loop

In [4]: %timeit map(lambda x: x.astype(float), aa)
1 loop, best of 3: 780 ms per loop

In [5]: %timeit map(lambda x: x.item(), aa)
1 loop, best of 3: 475 ms per loop

It sounds like float(npValue) seems much faster.


回答 9

我的方法有点用力,但似乎在所有情况下都很好:

def type_np2py(dtype=None, arr=None):
    '''Return the closest python type for a given numpy dtype'''

    if ((dtype is None and arr is None) or
        (dtype is not None and arr is not None)):
        raise ValueError(
            "Provide either keyword argument `dtype` or `arr`: a numpy dtype or a numpy array.")

    if dtype is None:
        dtype = arr.dtype

    #1) Make a single-entry numpy array of the same dtype
    #2) force the array into a python 'object' dtype
    #3) the array entry should now be the closest python type
    single_entry = np.empty([1], dtype=dtype).astype(object)

    return type(single_entry[0])

用法:

>>> type_np2py(int)
<class 'int'>

>>> type_np2py(np.int)
<class 'int'>

>>> type_np2py(str)
<class 'str'>

>>> type_np2py(arr=np.array(['hello']))
<class 'str'>

>>> type_np2py(arr=np.array([1,2,3]))
<class 'int'>

>>> type_np2py(arr=np.array([1.,2.,3.]))
<class 'float'>

My approach is a bit forceful, but seems to play nice for all cases:

def type_np2py(dtype=None, arr=None):
    '''Return the closest python type for a given numpy dtype'''

    if ((dtype is None and arr is None) or
        (dtype is not None and arr is not None)):
        raise ValueError(
            "Provide either keyword argument `dtype` or `arr`: a numpy dtype or a numpy array.")

    if dtype is None:
        dtype = arr.dtype

    #1) Make a single-entry numpy array of the same dtype
    #2) force the array into a python 'object' dtype
    #3) the array entry should now be the closest python type
    single_entry = np.empty([1], dtype=dtype).astype(object)

    return type(single_entry[0])

Usage:

>>> type_np2py(int)
<class 'int'>

>>> type_np2py(np.int)
<class 'int'>

>>> type_np2py(str)
<class 'str'>

>>> type_np2py(arr=np.array(['hello']))
<class 'str'>

>>> type_np2py(arr=np.array([1,2,3]))
<class 'int'>

>>> type_np2py(arr=np.array([1.,2.,3.]))
<class 'float'>

回答 10

对于那些不需要自动转换并且知道该值的numpy dtype的人的数组标量的补充说明:

数组标量与Python标量不同,但是它们在大多数情况下可以互换使用(主要的exceptions是v2.x之前的Python版本,其中整数数组标量不能用作列表和元组的索引)。有一些exceptions,例如,当代码需要标量的非常特定的属性时,或者当代码专门检查值是否为Python标量时。通常,通过使用相应的Python类型函数(例如,int,float,complex,str,unicode)将数组标量显式转换为Python标量,即可轻松解决问题。

资源

因此,在大多数情况下,可能根本不需要转换,并且可以直接使用数组标量。效果应与使用Python标量相同:

>>> np.issubdtype(np.int64, int)
True
>>> np.int64(0) == 0
True
>>> np.issubdtype(np.float64, float)
True
>>> np.float64(1.1) == 1.1
True

但是,如果由于某种原因需要显式转换,则可以使用相应的Python内置函数。如另一个答案所示,它也比数组标量item()方法快。

A side note about array scalars for those who don’t need automatic conversion and know the numpy dtype of the value:

Array scalars differ from Python scalars, but for the most part they can be used interchangeably (the primary exception is for versions of Python older than v2.x, where integer array scalars cannot act as indices for lists and tuples). There are some exceptions, such as when code requires very specific attributes of a scalar or when it checks specifically whether a value is a Python scalar. Generally, problems are easily fixed by explicitly converting array scalars to Python scalars, using the corresponding Python type function (e.g., int, float, complex, str, unicode).

Source

Thus, for most cases conversion might not be needed at all, and the array scalar could be used directly. The effect should be identical to using Python scalar:

>>> np.issubdtype(np.int64, int)
True
>>> np.int64(0) == 0
True
>>> np.issubdtype(np.float64, float)
True
>>> np.float64(1.1) == 1.1
True

But if, for some reason, the explicit conversion is needed, using the corresponding Python built-in function is the way to go. As shown in the other answer it’s also faster than array scalar item() method.


回答 11

翻译整个ndarray而不是一个单位数据对象:

def trans(data):
"""
translate numpy.int/float into python native data type
"""
result = []
for i in data.index:
    # i = data.index[0]
    d0 = data.iloc[i].values
    d = []
    for j in d0:
        if 'int' in str(type(j)):
            res = j.item() if 'item' in dir(j) else j
        elif 'float' in str(type(j)):
            res = j.item() if 'item' in dir(j) else j
        else:
            res = j
        d.append(res)
    d = tuple(d)
    result.append(d)
result = tuple(result)
return result

但是,处理大型数据帧需要花费几分钟。我也在寻找一种更有效的解决方案。希望有一个更好的答案。

Translate the whole ndarray instead one unit data object:

def trans(data):
"""
translate numpy.int/float into python native data type
"""
result = []
for i in data.index:
    # i = data.index[0]
    d0 = data.iloc[i].values
    d = []
    for j in d0:
        if 'int' in str(type(j)):
            res = j.item() if 'item' in dir(j) else j
        elif 'float' in str(type(j)):
            res = j.item() if 'item' in dir(j) else j
        else:
            res = j
        d.append(res)
    d = tuple(d)
    result.append(d)
result = tuple(result)
return result

However, it takes some minutes when handling large dataframes. I am also looking for a more efficient solution. Hope a better answer.


NumPy数组初始化(使用相同的值填充)

问题:NumPy数组初始化(使用相同的值填充)

我需要创建一个长度为NumPy的数组n,其中每个元素为v

还有什么比:

a = empty(n)
for i in range(n):
    a[i] = v

我知道zeros并且ones可以在v = 0,1下使用。我可以使用v * ones(n),但是vis 上将不起作用None,而且速度会慢很多。

I need to create a NumPy array of length n, each element of which is v.

Is there anything better than:

a = empty(n)
for i in range(n):
    a[i] = v

I know zeros and ones would work for v = 0, 1. I could use v * ones(n), but it won’t work when v is None, and also would be much slower.


回答 0

NumPy的1.8引入np.full(),这是比更直接的方法empty(),接着fill()用于创建填充有一定值的数组:

>>> np.full((3, 5), 7)
array([[ 7.,  7.,  7.,  7.,  7.],
       [ 7.,  7.,  7.,  7.,  7.],
       [ 7.,  7.,  7.,  7.,  7.]])

>>> np.full((3, 5), 7, dtype=int)
array([[7, 7, 7, 7, 7],
       [7, 7, 7, 7, 7],
       [7, 7, 7, 7, 7]])

可以说这是创建一个填充有某些值的数组方法,因为它明确描述了要实现的目标(并且从原理上讲,它可以执行非常具体的任务,因此非常高效)。

NumPy 1.8 introduced np.full(), which is a more direct method than empty() followed by fill() for creating an array filled with a certain value:

>>> np.full((3, 5), 7)
array([[ 7.,  7.,  7.,  7.,  7.],
       [ 7.,  7.,  7.,  7.,  7.],
       [ 7.,  7.,  7.,  7.,  7.]])

>>> np.full((3, 5), 7, dtype=int)
array([[7, 7, 7, 7, 7],
       [7, 7, 7, 7, 7],
       [7, 7, 7, 7, 7]])

This is arguably the way of creating an array filled with certain values, because it explicitly describes what is being achieved (and it can in principle be very efficient since it performs a very specific task).


回答 1

已为Numpy 1.7.0更新:(@ Rolf Bartstra的提示)。

a=np.empty(n); a.fill(5) 最快。

以降序排列:

%timeit a=np.empty(1e4); a.fill(5)
100000 loops, best of 3: 5.85 us per loop

%timeit a=np.empty(1e4); a[:]=5 
100000 loops, best of 3: 7.15 us per loop

%timeit a=np.ones(1e4)*5
10000 loops, best of 3: 22.9 us per loop

%timeit a=np.repeat(5,(1e4))
10000 loops, best of 3: 81.7 us per loop

%timeit a=np.tile(5,[1e4])
10000 loops, best of 3: 82.9 us per loop

Updated for Numpy 1.7.0:(Hat-tip to @Rolf Bartstra.)

a=np.empty(n); a.fill(5) is fastest.

In descending speed order:

%timeit a=np.empty(1e4); a.fill(5)
100000 loops, best of 3: 5.85 us per loop

%timeit a=np.empty(1e4); a[:]=5 
100000 loops, best of 3: 7.15 us per loop

%timeit a=np.ones(1e4)*5
10000 loops, best of 3: 22.9 us per loop

%timeit a=np.repeat(5,(1e4))
10000 loops, best of 3: 81.7 us per loop

%timeit a=np.tile(5,[1e4])
10000 loops, best of 3: 82.9 us per loop

回答 2

我相信这fill是最快的方法。

a = np.empty(10)
a.fill(7)

您还应该始终避免像在示例中那样进行迭代。一个简单的a[:] = v函数将使用numpy 广播来完成您的迭代操作。

I believe fill is the fastest way to do this.

a = np.empty(10)
a.fill(7)

You should also always avoid iterating like you are doing in your example. A simple a[:] = v will accomplish what your iteration does using numpy broadcasting.


回答 3

显然,不仅绝对速度而且速度顺序(如user1579844所报告)均取决于机器。这是我发现的:

a=np.empty(1e4); a.fill(5) 最快

以降序排列:

timeit a=np.empty(1e4); a.fill(5) 
# 100000 loops, best of 3: 10.2 us per loop
timeit a=np.empty(1e4); a[:]=5
# 100000 loops, best of 3: 16.9 us per loop
timeit a=np.ones(1e4)*5
# 100000 loops, best of 3: 32.2 us per loop
timeit a=np.tile(5,[1e4])
# 10000 loops, best of 3: 90.9 us per loop
timeit a=np.repeat(5,(1e4))
# 10000 loops, best of 3: 98.3 us per loop
timeit a=np.array([5]*int(1e4))
# 1000 loops, best of 3: 1.69 ms per loop (slowest BY FAR!)

因此,请尝试找出并使用平台上最快的功能。

Apparently, not only the absolute speeds but also the speed order (as reported by user1579844) are machine dependent; here’s what I found:

a=np.empty(1e4); a.fill(5) is fastest;

In descending speed order:

timeit a=np.empty(1e4); a.fill(5) 
# 100000 loops, best of 3: 10.2 us per loop
timeit a=np.empty(1e4); a[:]=5
# 100000 loops, best of 3: 16.9 us per loop
timeit a=np.ones(1e4)*5
# 100000 loops, best of 3: 32.2 us per loop
timeit a=np.tile(5,[1e4])
# 10000 loops, best of 3: 90.9 us per loop
timeit a=np.repeat(5,(1e4))
# 10000 loops, best of 3: 98.3 us per loop
timeit a=np.array([5]*int(1e4))
# 1000 loops, best of 3: 1.69 ms per loop (slowest BY FAR!)

So, try and find out, and use what’s fastest on your platform.


回答 4

我有

numpy.array(n * [value])

请记住,但是显然,这比所有其他建议都足够慢n

这是与perfplot(我的一个宠物项目)的完整比较。

这两种empty选择仍然是最快的(使用NumPy 1.12.1)。full赶上大型阵列。


生成绘图的代码:

import numpy as np
import perfplot


def empty_fill(n):
    a = np.empty(n)
    a.fill(3.14)
    return a


def empty_colon(n):
    a = np.empty(n)
    a[:] = 3.14
    return a


def ones_times(n):
    return 3.14 * np.ones(n)


def repeat(n):
    return np.repeat(3.14, (n))


def tile(n):
    return np.repeat(3.14, [n])


def full(n):
    return np.full((n), 3.14)


def list_to_array(n):
    return np.array(n * [3.14])


perfplot.show(
    setup=lambda n: n,
    kernels=[empty_fill, empty_colon, ones_times, repeat, tile, full, list_to_array],
    n_range=[2 ** k for k in range(27)],
    xlabel="len(a)",
    logx=True,
    logy=True,
)

I had

numpy.array(n * [value])

in mind, but apparently that is slower than all other suggestions for large enough n.

Here is full comparison with perfplot (a pet project of mine).

The two empty alternatives are still the fastest (with NumPy 1.12.1). full catches up for large arrays.


Code to generate the plot:

import numpy as np
import perfplot


def empty_fill(n):
    a = np.empty(n)
    a.fill(3.14)
    return a


def empty_colon(n):
    a = np.empty(n)
    a[:] = 3.14
    return a


def ones_times(n):
    return 3.14 * np.ones(n)


def repeat(n):
    return np.repeat(3.14, (n))


def tile(n):
    return np.repeat(3.14, [n])


def full(n):
    return np.full((n), 3.14)


def list_to_array(n):
    return np.array(n * [3.14])


perfplot.show(
    setup=lambda n: n,
    kernels=[empty_fill, empty_colon, ones_times, repeat, tile, full, list_to_array],
    n_range=[2 ** k for k in range(27)],
    xlabel="len(a)",
    logx=True,
    logy=True,
)

回答 5

您可以使用numpy.tile,例如:

v = 7
rows = 3
cols = 5
a = numpy.tile(v, (rows,cols))
a
Out[1]: 
array([[7, 7, 7, 7, 7],
       [7, 7, 7, 7, 7],
       [7, 7, 7, 7, 7]])

尽管tile是为了“平铺”一个数组(而不是这种情况下的标量),但它可以完成工作,创建任何大小和尺寸的预填充数组。

You can use numpy.tile, e.g. :

v = 7
rows = 3
cols = 5
a = numpy.tile(v, (rows,cols))
a
Out[1]: 
array([[7, 7, 7, 7, 7],
       [7, 7, 7, 7, 7],
       [7, 7, 7, 7, 7]])

Although tile is meant to ’tile’ an array (instead of a scalar, as in this case), it will do the job, creating pre-filled arrays of any size and dimension.


回答 6

没有numpy的

>>>[2]*3
[2, 2, 2]

without numpy

>>>[2]*3
[2, 2, 2]

如何将numpy数组转换为(并显示)图像?

问题:如何将numpy数组转换为(并显示)图像?

我因此创建了一个数组:

import numpy as np
data = np.zeros( (512,512,3), dtype=np.uint8)
data[256,256] = [255,0,0]

我要执行的操作是在512×512图像的中心显示一个红点。(至少从…开始,我想我可以从那里找出其余的内容)

I have created an array thusly:

import numpy as np
data = np.zeros( (512,512,3), dtype=np.uint8)
data[256,256] = [255,0,0]

What I want this to do is display a single red dot in the center of a 512×512 image. (At least to begin with… I think I can figure out the rest from there)


回答 0

您可以使用PIL创建(并显示)图像:

from PIL import Image
import numpy as np

w, h = 512, 512
data = np.zeros((h, w, 3), dtype=np.uint8)
data[0:256, 0:256] = [255, 0, 0] # red patch in upper left
img = Image.fromarray(data, 'RGB')
img.save('my.png')
img.show()

You could use PIL to create (and display) an image:

from PIL import Image
import numpy as np

w, h = 512, 512
data = np.zeros((h, w, 3), dtype=np.uint8)
data[0:256, 0:256] = [255, 0, 0] # red patch in upper left
img = Image.fromarray(data, 'RGB')
img.save('my.png')
img.show()

回答 1

以下应该工作:

from matplotlib import pyplot as plt
plt.imshow(data, interpolation='nearest')
plt.show()

如果您使用的是Jupyter笔记本/实验室,请在导入matplotlib之前使用以下内联命令:

%matplotlib inline 

The following should work:

from matplotlib import pyplot as plt
plt.imshow(data, interpolation='nearest')
plt.show()

If you are using Jupyter notebook/lab, use this inline command before importing matplotlib:

%matplotlib inline 

回答 2

最短的路径是使用scipy,如下所示:

from scipy.misc import toimage
toimage(data).show()

这也需要安装PIL或Pillow。

同样需要PIL或Pillow但可以调用其他查看器的类似方法是:

from scipy.misc import imshow
imshow(data)

Shortest path is to use scipy, like this:

from scipy.misc import toimage
toimage(data).show()

This requires PIL or Pillow to be installed as well.

A similar approach also requiring PIL or Pillow but which may invoke a different viewer is:

from scipy.misc import imshow
imshow(data)

回答 3

使用pygame,您可以打开一个窗口,以像素阵列的形式获取表面,然后从那里进行操作。但是,您需要将numpy数组复制到Surface数组中,这比在pygame Surface本身上进行实际图形操作要慢得多。

Using pygame, you can open a window, get the surface as an array of pixels, and manipulate as you want from there. You’ll need to copy your numpy array into the surface array, however, which will be much slower than doing actual graphics operations on the pygame surfaces themselves.


回答 4

如何使用示例显示存储在numpy数组中的图像(在Jupyter笔记本中有效)

我知道有更简单的答案,但是这一答案将使您了解如何从numpy数组中淹没图像。

加载示例

from sklearn.datasets import load_digits
digits = load_digits()
digits.images.shape   #this will give you (1797, 8, 8). 1797 images, each 8 x 8 in size

显示一幅图像的阵列

digits.images[0]
array([[ 0.,  0.,  5., 13.,  9.,  1.,  0.,  0.],
       [ 0.,  0., 13., 15., 10., 15.,  5.,  0.],
       [ 0.,  3., 15.,  2.,  0., 11.,  8.,  0.],
       [ 0.,  4., 12.,  0.,  0.,  8.,  8.,  0.],
       [ 0.,  5.,  8.,  0.,  0.,  9.,  8.,  0.],
       [ 0.,  4., 11.,  0.,  1., 12.,  7.,  0.],
       [ 0.,  2., 14.,  5., 10., 12.,  0.,  0.],
       [ 0.,  0.,  6., 13., 10.,  0.,  0.,  0.]])

创建空的10 x 10子图以可视化100张图像

import matplotlib.pyplot as plt
fig, axes = plt.subplots(10,10, figsize=(8,8))

绘制100张图像

for i,ax in enumerate(axes.flat):
    ax.imshow(digits.images[i])

结果:

怎么axes.flat办? 它创建了numpy枚举器,因此您可以在轴上迭代以在其上绘制对象。 例:

import numpy as np
x = np.arange(6).reshape(2,3)
x.flat
for item in (x.flat):
    print (item, end=' ')

How to show images stored in numpy array with example (works in Jupyter notebook)

I know there are simpler answers but this one will give you understanding of how images are actually drawn from a numpy array.

Load example

from sklearn.datasets import load_digits
digits = load_digits()
digits.images.shape   #this will give you (1797, 8, 8). 1797 images, each 8 x 8 in size

Display array of one image

digits.images[0]
array([[ 0.,  0.,  5., 13.,  9.,  1.,  0.,  0.],
       [ 0.,  0., 13., 15., 10., 15.,  5.,  0.],
       [ 0.,  3., 15.,  2.,  0., 11.,  8.,  0.],
       [ 0.,  4., 12.,  0.,  0.,  8.,  8.,  0.],
       [ 0.,  5.,  8.,  0.,  0.,  9.,  8.,  0.],
       [ 0.,  4., 11.,  0.,  1., 12.,  7.,  0.],
       [ 0.,  2., 14.,  5., 10., 12.,  0.,  0.],
       [ 0.,  0.,  6., 13., 10.,  0.,  0.,  0.]])

Create empty 10 x 10 subplots for visualizing 100 images

import matplotlib.pyplot as plt
fig, axes = plt.subplots(10,10, figsize=(8,8))

Plotting 100 images

for i,ax in enumerate(axes.flat):
    ax.imshow(digits.images[i])

Result:

What does axes.flat do? It creates a numpy enumerator so you can iterate over axis in order to draw objects on them. Example:

import numpy as np
x = np.arange(6).reshape(2,3)
x.flat
for item in (x.flat):
    print (item, end=' ')

回答 5

例如,使用枕头的fromarray:

from PIL import Image
from numpy import *

im = array(Image.open('image.jpg'))
Image.fromarray(im).show()

Using pillow’s fromarray, for example:

from PIL import Image
from numpy import *

im = array(Image.open('image.jpg'))
Image.fromarray(im).show()

回答 6

Python图像库可以显示使用numpy的阵列的图像。查看此页面以获取示例代码:

编辑:正如该页面底部的注释所述,您应该检查最新的发行说明,这会使此过程变得更加简单:

http://effbot.org/zone/pil-changes-116.htm

The Python Imaging Library can display images using Numpy arrays. Take a look at this page for sample code:

EDIT: As the note on the bottom of that page says, you should check the latest release notes which make this much simpler:

http://effbot.org/zone/pil-changes-116.htm


回答 7

使用matplotlib进行补充。我发现在执行计算机视觉任务时很方便。假设您有dtype = int32的数据

from matplotlib import pyplot as plot
import numpy as np

fig = plot.figure()
ax = fig.add_subplot(1, 1, 1)
# make sure your data is in H W C, otherwise you can change it by
# data = data.transpose((_, _, _))
data = np.zeros((512,512,3), dtype=np.int32)
data[256,256] = [255,0,0]
ax.imshow(data.astype(np.uint8))

Supplement for doing so with matplotlib. I found it handy doing computer vision tasks. Let’s say you got data with dtype = int32

from matplotlib import pyplot as plot
import numpy as np

fig = plot.figure()
ax = fig.add_subplot(1, 1, 1)
# make sure your data is in H W C, otherwise you can change it by
# data = data.transpose((_, _, _))
data = np.zeros((512,512,3), dtype=np.int32)
data[256,256] = [255,0,0]
ax.imshow(data.astype(np.uint8))

将索引数组转换为1-hot编码的numpy数组

问题:将索引数组转换为1-hot编码的numpy数组

假设我有一个一维numpy数组

a = array([1,0,3])

我想将此编码为2d 1-hot数组

b = array([[0,1,0,0], [1,0,0,0], [0,0,0,1]])

有快速的方法可以做到这一点吗?比循环遍历a设置元素更快b

Let’s say I have a 1d numpy array

a = array([1,0,3])

I would like to encode this as a 2D one-hot array

b = array([[0,1,0,0], [1,0,0,0], [0,0,0,1]])

Is there a quick way to do this? Quicker than just looping over a to set elements of b, that is.


回答 0

您的数组a定义了输出数组中非零元素的列。您还需要定义行,然后使用花式索引:

>>> a = np.array([1, 0, 3])
>>> b = np.zeros((a.size, a.max()+1))
>>> b[np.arange(a.size),a] = 1
>>> b
array([[ 0.,  1.,  0.,  0.],
       [ 1.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  1.]])

Your array a defines the columns of the nonzero elements in the output array. You need to also define the rows and then use fancy indexing:

>>> a = np.array([1, 0, 3])
>>> b = np.zeros((a.size, a.max()+1))
>>> b[np.arange(a.size),a] = 1
>>> b
array([[ 0.,  1.,  0.,  0.],
       [ 1.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  1.]])

回答 1

>>> values = [1, 0, 3]
>>> n_values = np.max(values) + 1
>>> np.eye(n_values)[values]
array([[ 0.,  1.,  0.,  0.],
       [ 1.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  1.]])
>>> values = [1, 0, 3]
>>> n_values = np.max(values) + 1
>>> np.eye(n_values)[values]
array([[ 0.,  1.,  0.,  0.],
       [ 1.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  1.]])

回答 2

如果您使用的是keras,则有一个内置实用程序:

from keras.utils.np_utils import to_categorical   

categorical_labels = to_categorical(int_labels, num_classes=3)

它与@YXD的答案几乎相同(请参阅源代码)。

In case you are using keras, there is a built in utility for that:

from keras.utils.np_utils import to_categorical   

categorical_labels = to_categorical(int_labels, num_classes=3)

And it does pretty much the same as @YXD’s answer (see source-code).


回答 3

这是我发现有用的:

def one_hot(a, num_classes):
  return np.squeeze(np.eye(num_classes)[a.reshape(-1)])

num_classes代表您所拥有的类数量。因此,如果您拥有a形状为(10000,)的向量,则此函数会将其转换为(10000,C)。请注意,a是零索引,即one_hot(np.array([0, 1]), 2)会给[[1, 0], [0, 1]]

正是您想要的,我相信。

PS:来源是序列模型-deeplearning.ai

Here is what I find useful:

def one_hot(a, num_classes):
  return np.squeeze(np.eye(num_classes)[a.reshape(-1)])

Here num_classes stands for number of classes you have. So if you have a vector with shape of (10000,) this function transforms it to (10000,C). Note that a is zero-indexed, i.e. one_hot(np.array([0, 1]), 2) will give [[1, 0], [0, 1]].

Exactly what you wanted to have I believe.

PS: the source is Sequence models – deeplearning.ai


回答 4

您可以使用 sklearn.preprocessing.LabelBinarizer

例:

import sklearn.preprocessing
a = [1,0,3]
label_binarizer = sklearn.preprocessing.LabelBinarizer()
label_binarizer.fit(range(max(a)+1))
b = label_binarizer.transform(a)
print('{0}'.format(b))

输出:

[[0 1 0 0]
 [1 0 0 0]
 [0 0 0 1]]

除其他事项外,您可以初始化sklearn.preprocessing.LabelBinarizer()以便的输出transform稀疏。

You can use sklearn.preprocessing.LabelBinarizer:

Example:

import sklearn.preprocessing
a = [1,0,3]
label_binarizer = sklearn.preprocessing.LabelBinarizer()
label_binarizer.fit(range(max(a)+1))
b = label_binarizer.transform(a)
print('{0}'.format(b))

output:

[[0 1 0 0]
 [1 0 0 0]
 [0 0 0 1]]

Amongst other things, you may initialize sklearn.preprocessing.LabelBinarizer() so that the output of transform is sparse.


回答 5

您还可以使用numpy的eye函数:

numpy.eye(number of classes)[vector containing the labels]

You can also use eye function of numpy:

numpy.eye(number of classes)[vector containing the labels]


回答 6

这是将一维矢量转换为一维二维热阵列的函数。

#!/usr/bin/env python
import numpy as np

def convertToOneHot(vector, num_classes=None):
    """
    Converts an input 1-D vector of integers into an output
    2-D array of one-hot vectors, where an i'th input value
    of j will set a '1' in the i'th row, j'th column of the
    output array.

    Example:
        v = np.array((1, 0, 4))
        one_hot_v = convertToOneHot(v)
        print one_hot_v

        [[0 1 0 0 0]
         [1 0 0 0 0]
         [0 0 0 0 1]]
    """

    assert isinstance(vector, np.ndarray)
    assert len(vector) > 0

    if num_classes is None:
        num_classes = np.max(vector)+1
    else:
        assert num_classes > 0
        assert num_classes >= np.max(vector)

    result = np.zeros(shape=(len(vector), num_classes))
    result[np.arange(len(vector)), vector] = 1
    return result.astype(int)

以下是一些用法示例:

>>> a = np.array([1, 0, 3])

>>> convertToOneHot(a)
array([[0, 1, 0, 0],
       [1, 0, 0, 0],
       [0, 0, 0, 1]])

>>> convertToOneHot(a, num_classes=10)
array([[0, 1, 0, 0, 0, 0, 0, 0, 0, 0],
       [1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 1, 0, 0, 0, 0, 0, 0]])

Here is a function that converts a 1-D vector to a 2-D one-hot array.

#!/usr/bin/env python
import numpy as np

def convertToOneHot(vector, num_classes=None):
    """
    Converts an input 1-D vector of integers into an output
    2-D array of one-hot vectors, where an i'th input value
    of j will set a '1' in the i'th row, j'th column of the
    output array.

    Example:
        v = np.array((1, 0, 4))
        one_hot_v = convertToOneHot(v)
        print one_hot_v

        [[0 1 0 0 0]
         [1 0 0 0 0]
         [0 0 0 0 1]]
    """

    assert isinstance(vector, np.ndarray)
    assert len(vector) > 0

    if num_classes is None:
        num_classes = np.max(vector)+1
    else:
        assert num_classes > 0
        assert num_classes >= np.max(vector)

    result = np.zeros(shape=(len(vector), num_classes))
    result[np.arange(len(vector)), vector] = 1
    return result.astype(int)

Below is some example usage:

>>> a = np.array([1, 0, 3])

>>> convertToOneHot(a)
array([[0, 1, 0, 0],
       [1, 0, 0, 0],
       [0, 0, 0, 1]])

>>> convertToOneHot(a, num_classes=10)
array([[0, 1, 0, 0, 0, 0, 0, 0, 0, 0],
       [1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 1, 0, 0, 0, 0, 0, 0]])

回答 7

对于1热编码

   one_hot_encode=pandas.get_dummies(array)

例如

享受编码

For 1-hot-encoding

   one_hot_encode=pandas.get_dummies(array)

For Example

ENJOY CODING


回答 8

我认为简短的答案是“否”。对于更通用的n尺寸,我想到了:

# For 2-dimensional data, 4 values
a = np.array([[0, 1, 2], [3, 2, 1]])
z = np.zeros(list(a.shape) + [4])
z[list(np.indices(z.shape[:-1])) + [a]] = 1

我想知道是否有更好的解决方案-我不喜欢我必须在最后两行中创建这些列表。无论如何,我使用进行了一些测量,timeit看来numpy基于-(indices/ arange)和迭代版本的性能大致相同。

I think the short answer is no. For a more generic case in n dimensions, I came up with this:

# For 2-dimensional data, 4 values
a = np.array([[0, 1, 2], [3, 2, 1]])
z = np.zeros(list(a.shape) + [4])
z[list(np.indices(z.shape[:-1])) + [a]] = 1

I am wondering if there is a better solution — I don’t like that I have to create those lists in the last two lines. Anyway, I did some measurements with timeit and it seems that the numpy-based (indices/arange) and the iterative versions perform about the same.


回答 9

只是在阐述出色答卷K3 — RNC,这里是一个更宽泛的版本:

def onehottify(x, n=None, dtype=float):
    """1-hot encode x with the max value n (computed from data if n is None)."""
    x = np.asarray(x)
    n = np.max(x) + 1 if n is None else n
    return np.eye(n, dtype=dtype)[x]

此外,这里是这种方法的快速和肮脏的基准,并从一个方法目前公认的答案YXD(微变,让他们提供相同的API但后者只能与1D ndarrays):

def onehottify_only_1d(x, n=None, dtype=float):
    x = np.asarray(x)
    n = np.max(x) + 1 if n is None else n
    b = np.zeros((len(x), n), dtype=dtype)
    b[np.arange(len(x)), x] = 1
    return b

后一种方法的速度提高了约35%(MacBook Pro 13 2015),但前一种方法更通用:

>>> import numpy as np
>>> np.random.seed(42)
>>> a = np.random.randint(0, 9, size=(10_000,))
>>> a
array([6, 3, 7, ..., 5, 8, 6])
>>> %timeit onehottify(a, 10)
188 µs ± 5.03 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
>>> %timeit onehottify_only_1d(a, 10)
139 µs ± 2.78 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Just to elaborate on the excellent answer from K3—rnc, here is a more generic version:

def onehottify(x, n=None, dtype=float):
    """1-hot encode x with the max value n (computed from data if n is None)."""
    x = np.asarray(x)
    n = np.max(x) + 1 if n is None else n
    return np.eye(n, dtype=dtype)[x]

Also, here is a quick-and-dirty benchmark of this method and a method from the currently accepted answer by YXD (slightly changed, so that they offer the same API except that the latter works only with 1D ndarrays):

def onehottify_only_1d(x, n=None, dtype=float):
    x = np.asarray(x)
    n = np.max(x) + 1 if n is None else n
    b = np.zeros((len(x), n), dtype=dtype)
    b[np.arange(len(x)), x] = 1
    return b

The latter method is ~35% faster (MacBook Pro 13 2015), but the former is more general:

>>> import numpy as np
>>> np.random.seed(42)
>>> a = np.random.randint(0, 9, size=(10_000,))
>>> a
array([6, 3, 7, ..., 5, 8, 6])
>>> %timeit onehottify(a, 10)
188 µs ± 5.03 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
>>> %timeit onehottify_only_1d(a, 10)
139 µs ± 2.78 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

回答 10

您可以使用以下代码将其转换为单热向量:

令x为具有单个列的普通类向量,该类具有从0到某个数字的类:

import numpy as np
np.eye(x.max()+1)[x]

如果0不是一个类;然后删除+1。

You can use the following code for converting into a one-hot vector:

let x is the normal class vector having a single column with classes 0 to some number:

import numpy as np
np.eye(x.max()+1)[x]

if 0 is not a class; then remove +1.


回答 11

我最近遇到了一个同类问题,发现上述解决方案只有在您的数字在一定范围内时才令人满意。例如,如果您要对以下列表进行一次热编码:

all_good_list = [0,1,2,3,4]

继续,上面已经提到过发布的解决方案。但是如果考虑这些数据怎么办:

problematic_list = [0,23,12,89,10]

如果使用上述方法进行操作,则可能会得到90个单柱色谱柱。这是因为所有答案都包含n = np.max(a)+1。我找到了一个更通用的解决方案,可以为我解决这个问题,并希望与您分享:

import numpy as np
import sklearn
sklb = sklearn.preprocessing.LabelBinarizer()
a = np.asarray([1,2,44,3,2])
n = np.unique(a)
sklb.fit(n)
b = sklb.transform(a)

我希望有人在上述解决方案上遇到同样的限制,并且这可能派上用场

I recently ran into a problem of same kind and found said solution which turned out to be only satisfying if you have numbers that go within a certain formation. For example if you want to one-hot encode following list:

all_good_list = [0,1,2,3,4]

go ahead, the posted solutions are already mentioned above. But what if considering this data:

problematic_list = [0,23,12,89,10]

If you do it with methods mentioned above, you will likely end up with 90 one-hot columns. This is because all answers include something like n = np.max(a)+1. I found a more generic solution that worked out for me and wanted to share with you:

import numpy as np
import sklearn
sklb = sklearn.preprocessing.LabelBinarizer()
a = np.asarray([1,2,44,3,2])
n = np.unique(a)
sklb.fit(n)
b = sklb.transform(a)

I hope someone encountered same restrictions on above solutions and this might come in handy


回答 12

这种编码类型通常是numpy数组的一部分。如果您使用这样的numpy数组:

a = np.array([1,0,3])

那么有一种非常简单的方法可以将其转换为1-hot编码

out = (np.arange(4) == a[:,None]).astype(np.float32)

而已。

Such type of encoding are usually part of numpy array. If you are using a numpy array like this :

a = np.array([1,0,3])

then there is very simple way to convert that to 1-hot encoding

out = (np.arange(4) == a[:,None]).astype(np.float32)

That’s it.


回答 13

  • p将是一个二维数组。
  • 我们想知道哪个值是连续最高的,在那放置1,在其他任何地方放置0。

清洁简便的解决方案:

max_elements_i = np.expand_dims(np.argmax(p, axis=1), axis=1)
one_hot = np.zeros(p.shape)
np.put_along_axis(one_hot, max_elements_i, 1, axis=1)
  • p will be a 2d ndarray.
  • We want to know which value is the highest in a row, to put there 1 and everywhere else 0.

clean and easy solution:

max_elements_i = np.expand_dims(np.argmax(p, axis=1), axis=1)
one_hot = np.zeros(p.shape)
np.put_along_axis(one_hot, max_elements_i, 1, axis=1)

回答 14

使用Neuraxle管道步骤:

  1. 建立你的例子
import numpy as np
a = np.array([1,0,3])
b = np.array([[0,1,0,0], [1,0,0,0], [0,0,0,1]])
  1. 做实际的转换
from neuraxle.steps.numpy import OneHotEncoder
encoder = OneHotEncoder(nb_columns=4)
b_pred = encoder.transform(a)
  1. 断言有效
assert b_pred == b

链接到文档:neuraxle.steps.numpy.OneHotEncoder

Using a Neuraxle pipeline step:

  1. Set up your example
import numpy as np
a = np.array([1,0,3])
b = np.array([[0,1,0,0], [1,0,0,0], [0,0,0,1]])
  1. Do the actual conversion
from neuraxle.steps.numpy import OneHotEncoder
encoder = OneHotEncoder(nb_columns=4)
b_pred = encoder.transform(a)
  1. Assert it works
assert b_pred == b

Link to documentation: neuraxle.steps.numpy.OneHotEncoder


回答 15

这是我根据上述答案和自己的用例编写的一个示例函数:

def label_vector_to_one_hot_vector(vector, one_hot_size=10):
    """
    Use to convert a column vector to a 'one-hot' matrix

    Example:
        vector: [[2], [0], [1]]
        one_hot_size: 3
        returns:
            [[ 0.,  0.,  1.],
             [ 1.,  0.,  0.],
             [ 0.,  1.,  0.]]

    Parameters:
        vector (np.array): of size (n, 1) to be converted
        one_hot_size (int) optional: size of 'one-hot' row vector

    Returns:
        np.array size (vector.size, one_hot_size): converted to a 'one-hot' matrix
    """
    squeezed_vector = np.squeeze(vector, axis=-1)

    one_hot = np.zeros((squeezed_vector.size, one_hot_size))

    one_hot[np.arange(squeezed_vector.size), squeezed_vector] = 1

    return one_hot

label_vector_to_one_hot_vector(vector=[[2], [0], [1]], one_hot_size=3)

Here is an example function that I wrote to do this based upon the answers above and my own use case:

def label_vector_to_one_hot_vector(vector, one_hot_size=10):
    """
    Use to convert a column vector to a 'one-hot' matrix

    Example:
        vector: [[2], [0], [1]]
        one_hot_size: 3
        returns:
            [[ 0.,  0.,  1.],
             [ 1.,  0.,  0.],
             [ 0.,  1.,  0.]]

    Parameters:
        vector (np.array): of size (n, 1) to be converted
        one_hot_size (int) optional: size of 'one-hot' row vector

    Returns:
        np.array size (vector.size, one_hot_size): converted to a 'one-hot' matrix
    """
    squeezed_vector = np.squeeze(vector, axis=-1)

    one_hot = np.zeros((squeezed_vector.size, one_hot_size))

    one_hot[np.arange(squeezed_vector.size), squeezed_vector] = 1

    return one_hot

label_vector_to_one_hot_vector(vector=[[2], [0], [1]], one_hot_size=3)

回答 16

为了添加完整的功能,我仅使用numpy运算符:

   def probs_to_onehot(output_probabilities):
        argmax_indices_array = np.argmax(output_probabilities, axis=1)
        onehot_output_array = np.eye(np.unique(argmax_indices_array).shape[0])[argmax_indices_array.reshape(-1)]
        return onehot_output_array

它以概率矩阵作为输入:例如:

[[0.03038822 0.65810204 0.16549407 0.3797123] … [0.02771272 0.2760752 0.3280924 0.33458805]

它将返回

[[0 1 0 0] … [0 0 0 1]]

I am adding for completion a simple function, using only numpy operators:

   def probs_to_onehot(output_probabilities):
        argmax_indices_array = np.argmax(output_probabilities, axis=1)
        onehot_output_array = np.eye(np.unique(argmax_indices_array).shape[0])[argmax_indices_array.reshape(-1)]
        return onehot_output_array

It takes as input a probability matrix: e.g.:

[[0.03038822 0.65810204 0.16549407 0.3797123 ] … [0.02771272 0.2760752 0.3280924 0.33458805]]

And it will return

[[0 1 0 0] … [0 0 0 1]]


回答 17

这是一个与维数无关的独立解决方案。

这会将arr非负整数的任何N维数组转换为一维N + 1维数组one_hot,其中one_hot[i_1,...,i_N,c] = 1means arr[i_1,...,i_N] = c。您可以通过以下方式恢复输入np.argmax(one_hot, -1)

def expand_integer_grid(arr, n_classes):
    """

    :param arr: N dim array of size i_1, ..., i_N
    :param n_classes: C
    :returns: one-hot N+1 dim array of size i_1, ..., i_N, C
    :rtype: ndarray

    """
    one_hot = np.zeros(arr.shape + (n_classes,))
    axes_ranges = [range(arr.shape[i]) for i in range(arr.ndim)]
    flat_grids = [_.ravel() for _ in np.meshgrid(*axes_ranges, indexing='ij')]
    one_hot[flat_grids + [arr.ravel()]] = 1
    assert((one_hot.sum(-1) == 1).all())
    assert(np.allclose(np.argmax(one_hot, -1), arr))
    return one_hot

Here’s a dimensionality-independent standalone solution.

This will convert any N-dimensional array arr of nonnegative integers to a one-hot N+1-dimensional array one_hot, where one_hot[i_1,...,i_N,c] = 1 means arr[i_1,...,i_N] = c. You can recover the input via np.argmax(one_hot, -1)

def expand_integer_grid(arr, n_classes):
    """

    :param arr: N dim array of size i_1, ..., i_N
    :param n_classes: C
    :returns: one-hot N+1 dim array of size i_1, ..., i_N, C
    :rtype: ndarray

    """
    one_hot = np.zeros(arr.shape + (n_classes,))
    axes_ranges = [range(arr.shape[i]) for i in range(arr.ndim)]
    flat_grids = [_.ravel() for _ in np.meshgrid(*axes_ranges, indexing='ij')]
    one_hot[flat_grids + [arr.ravel()]] = 1
    assert((one_hot.sum(-1) == 1).all())
    assert(np.allclose(np.argmax(one_hot, -1), arr))
    return one_hot

回答 18

使用以下代码。效果最好。

def one_hot_encode(x):
"""
    argument
        - x: a list of labels
    return
        - one hot encoding matrix (number of labels, number of class)
"""
encoded = np.zeros((len(x), 10))

for idx, val in enumerate(x):
    encoded[idx][val] = 1

return encoded

在这里找到它 PS无需进入链接。

Use the following code. It works best.

def one_hot_encode(x):
"""
    argument
        - x: a list of labels
    return
        - one hot encoding matrix (number of labels, number of class)
"""
encoded = np.zeros((len(x), 10))

for idx, val in enumerate(x):
    encoded[idx][val] = 1

return encoded

Found it here P.S You don’t need to go into the link.


如何查看我使用的NumPy版本?

问题:如何查看我使用的NumPy版本?

如何查看我使用的NumPy版本?

(仅供参考,此问题已被编辑,因为问题和答案都不是特定于平台的。)

How can I check which version of NumPy I’m using?

(FYI this question has been edited because both the question and answer are not platform specific.)


回答 0

import numpy
numpy.version.version
import numpy
numpy.version.version

回答 1

>> import numpy
>> print numpy.__version__
>> import numpy
>> print numpy.__version__

回答 2

从命令行,您可以简单地发出:

python -c "import numpy; print(numpy.version.version)"

要么:

python -c "import numpy; print(numpy.__version__)"

From the command line, you can simply issue:

python -c "import numpy; print(numpy.version.version)"

Or:

python -c "import numpy; print(numpy.__version__)"

回答 3

跑:

pip list

应生成软件包列表。滚动到numpy。

...
nbpresent (3.0.2)
networkx (1.11)
nltk (3.2.2)
nose (1.3.7)
notebook (5.0.0)
numba (0.32.0+0.g139e4c6.dirty)
numexpr (2.6.2)
numpy (1.11.3) <--
numpydoc (0.6.0)
odo (0.5.0)
openpyxl (2.4.1)
pandas (0.20.1)
pandocfilters (1.4.1)
....

Run:

pip list

Should generate a list of packages. Scroll through to numpy.

...
nbpresent (3.0.2)
networkx (1.11)
nltk (3.2.2)
nose (1.3.7)
notebook (5.0.0)
numba (0.32.0+0.g139e4c6.dirty)
numexpr (2.6.2)
numpy (1.11.3) <--
numpydoc (0.6.0)
odo (0.5.0)
openpyxl (2.4.1)
pandas (0.20.1)
pandocfilters (1.4.1)
....

回答 4

您还可以通过以下方式检查您的版本是否在使用MKL:

import numpy
numpy.show_config()

You can also check if your version is using MKL with:

import numpy
numpy.show_config()

回答 5

我们可以pip freeze用来获取任何Python软件包版本,而无需打开Python shell。

pip freeze | grep 'numpy'

We can use pip freeze to get any Python package version without opening the Python shell.

pip freeze | grep 'numpy'

回答 6

如果您正在使用Anaconda发行版中的NumPy,则可以执行以下操作:

$ conda list | grep numpy
numpy     1.11.3     py35_0

这也给出了Python版本。


如果您想要一些花哨的东西,请使用 numexpr

它提供了很多信息,如下所示:

In [692]: import numexpr

In [693]: numexpr.print_versions()
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Numexpr version:   2.6.2
NumPy version:     1.13.3
Python version:    3.6.3 |Anaconda custom (64-bit)|
                   (default, Oct 13 2017, 12:02:49)
[GCC 7.2.0]
Platform:          linux-x86_64
AMD/Intel CPU?     True
VML available?     False
Number of threads used by default: 8 (out of 48 detected cores)
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

If you’re using NumPy from the Anaconda distribution, then you can just do:

$ conda list | grep numpy
numpy     1.11.3     py35_0

This gives the Python version as well.


If you want something fancy, then use numexpr

It gives lot of information as you can see below:

In [692]: import numexpr

In [693]: numexpr.print_versions()
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Numexpr version:   2.6.2
NumPy version:     1.13.3
Python version:    3.6.3 |Anaconda custom (64-bit)|
                   (default, Oct 13 2017, 12:02:49)
[GCC 7.2.0]
Platform:          linux-x86_64
AMD/Intel CPU?     True
VML available?     False
Number of threads used by default: 8 (out of 48 detected cores)
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

回答 7

您可以尝试以下方法:

点显示numpy

You can try this:

pip show numpy


回答 8

您可以使用Terminal或Python代码获取numpy版本。

在使用Ubuntu的终端机(bash)中:

pip list | grep numpy

在python 3.6.7中,此代码显示了numpy版本:

import numpy
print (numpy.version.version)

如果将此代码插入到showumpy.py文件中,则可以对其进行编译:

python shownumpy.py

要么

python3 shownumpy.py

我有以下输出:

1.16.1

You can get numpy version using Terminal or a Python code.

In a Terminal (bash) using Ubuntu:

pip list | grep numpy

In python 3.6.7, this code shows the numpy version:

import numpy
print (numpy.version.version)

If you insert this code in the file shownumpy.py, you can compile it:

python shownumpy.py

or

python3 shownumpy.py

I’ve got this output:

1.16.1

回答 9

import numpy
print numpy.__version__
import numpy
print numpy.__version__

回答 10

对于Python 3.X打印语法:

python -c "import numpy; print (numpy.version.version)"

要么

python -c "import numpy; print(numpy.__version__)"

For Python 3.X print syntax:

python -c "import numpy; print (numpy.version.version)"

Or

python -c "import numpy; print(numpy.__version__)"

回答 11

只需对解决方案进行一点更改,即可使用Python检查numpy的版本,

import numpy as np 
print("Numpy Version:",np.__version__)

要么,

import numpy as np
print("Numpy Version:",np.version.version)

我在PyCharm中的项目当前正在运行版本

1.17.4

Just a slight solution change for checking the version of numpy with Python,

import numpy as np 
print("Numpy Version:",np.__version__)

Or,

import numpy as np
print("Numpy Version:",np.version.version)

My projects in PyCharm are currently running version

1.17.4

回答 12

在Python Shell中:

>>> help()
help> numpy

In a Python shell:

>>> help()
help> numpy

回答 13

可以从终端执行的纯Python行(2.X和3.X版本):

python -c "import numpy; print(numpy.version.version)"

如果您已经在Python中,则:

import numpy
print(numpy.version.version)

Pure Python line that can be executed from the terminal (both 2.X and 3.X versions):

python -c "import numpy; print(numpy.version.version)"

If you are already inside Python, then:

import numpy
print(numpy.version.version)

回答 14

很高兴知道numpy您运行的版本,但是严格来说,如果您只需要在系统上具有特定版本,则可以这样编写:

pip install numpy==1.14.3 这将安装您需要的版本,并卸载其他版本的numpy

It is good to know the version of numpy you run, but strictly speaking if you just need to have specific version on your system you can write like this:

pip install numpy==1.14.3 and this will install the version you need and uninstall other versions of numpy.


numpy.random.seed(0)有什么作用?

问题:numpy.random.seed(0)有什么作用?

np.random.seedScikit-Learn教程的以下代码在做什么?我对NumPy的随机状态生成器不太熟悉,因此我非常感谢外行对此的解释。

np.random.seed(0)
indices = np.random.permutation(len(iris_X))

What does np.random.seed do in the below code from a Scikit-Learn tutorial? I’m not very familiar with NumPy’s random state generator stuff, so I’d really appreciate a layman’s terms explanation of this.

np.random.seed(0)
indices = np.random.permutation(len(iris_X))

回答 0

np.random.seed(0) 使随机数可预测

>>> numpy.random.seed(0) ; numpy.random.rand(4)
array([ 0.55,  0.72,  0.6 ,  0.54])
>>> numpy.random.seed(0) ; numpy.random.rand(4)
array([ 0.55,  0.72,  0.6 ,  0.54])

每次重置种子后,相同每次都会出现一组的数字。

如果未重置随机种子,则每次调用都会显示不同的数字:

>>> numpy.random.rand(4)
array([ 0.42,  0.65,  0.44,  0.89])
>>> numpy.random.rand(4)
array([ 0.96,  0.38,  0.79,  0.53])

(伪)随机数的工作方式是从一个数字(种子)开始,将其乘以一个大数字,加上一个偏移量,然后对该和取模。然后将所得的数字用作种子,以生成下一个“随机”数字。设置种子时(每次),每次都会执行相同的操作,并为您提供相同的编号。

如果您希望看似随机数,请不要设置种子。但是,如果您使用的代码使用要调试的随机数,则在每次运行之前设置种子可能非常有帮助,这样每次运行代码时,它们都会执行相同的操作。

要获得每次运行的最大随机数,请调用numpy.random.seed()将导致numpy将种子设置为从/dev/urandom Windows或其Windows模拟或者,如果两者均不可用,它将使用时钟。

有关使用种子生成伪随机数的更多信息,请参见Wikipedia

np.random.seed(0) makes the random numbers predictable

>>> numpy.random.seed(0) ; numpy.random.rand(4)
array([ 0.55,  0.72,  0.6 ,  0.54])
>>> numpy.random.seed(0) ; numpy.random.rand(4)
array([ 0.55,  0.72,  0.6 ,  0.54])

With the seed reset (every time), the same set of numbers will appear every time.

If the random seed is not reset, different numbers appear with every invocation:

>>> numpy.random.rand(4)
array([ 0.42,  0.65,  0.44,  0.89])
>>> numpy.random.rand(4)
array([ 0.96,  0.38,  0.79,  0.53])

(pseudo-)random numbers work by starting with a number (the seed), multiplying it by a large number, adding an offset, then taking modulo of that sum. The resulting number is then used as the seed to generate the next “random” number. When you set the seed (every time), it does the same thing every time, giving you the same numbers.

If you want seemingly random numbers, do not set the seed. If you have code that uses random numbers that you want to debug, however, it can be very helpful to set the seed before each run so that the code does the same thing every time you run it.

To get the most random numbers for each run, call numpy.random.seed(). This will cause numpy to set the seed to a random number obtained from /dev/urandom or its Windows analog or, if neither of those is available, it will use the clock.

For more information on using seeds to generate pseudo-random numbers, see wikipedia.


回答 1

如果您设置np.random.seed(a_fixed_number)每次调用numpy的其他随机函数,则结果将相同:

>>> import numpy as np
>>> np.random.seed(0) 
>>> perm = np.random.permutation(10) 
>>> print perm 
[2 8 4 9 1 6 7 3 0 5]
>>> np.random.seed(0) 
>>> print np.random.permutation(10) 
[2 8 4 9 1 6 7 3 0 5]
>>> np.random.seed(0) 
>>> print np.random.permutation(10) 
[2 8 4 9 1 6 7 3 0 5]
>>> np.random.seed(0) 
>>> print np.random.permutation(10) 
[2 8 4 9 1 6 7 3 0 5]
>>> np.random.seed(0) 
>>> print np.random.rand(4) 
[0.5488135  0.71518937 0.60276338 0.54488318]
>>> np.random.seed(0) 
>>> print np.random.rand(4) 
[0.5488135  0.71518937 0.60276338 0.54488318]

但是,如果只调用一次并使用各种随机函数,结果将仍然不同:

>>> import numpy as np
>>> np.random.seed(0) 
>>> perm = np.random.permutation(10)
>>> print perm 
[2 8 4 9 1 6 7 3 0 5]
>>> np.random.seed(0) 
>>> print np.random.permutation(10)
[2 8 4 9 1 6 7 3 0 5]
>>> print np.random.permutation(10) 
[3 5 1 2 9 8 0 6 7 4]
>>> print np.random.permutation(10) 
[2 3 8 4 5 1 0 6 9 7]
>>> print np.random.rand(4) 
[0.64817187 0.36824154 0.95715516 0.14035078]
>>> print np.random.rand(4) 
[0.87008726 0.47360805 0.80091075 0.52047748]

If you set the np.random.seed(a_fixed_number) every time you call the numpy’s other random function, the result will be the same:

>>> import numpy as np
>>> np.random.seed(0) 
>>> perm = np.random.permutation(10) 
>>> print perm 
[2 8 4 9 1 6 7 3 0 5]
>>> np.random.seed(0) 
>>> print np.random.permutation(10) 
[2 8 4 9 1 6 7 3 0 5]
>>> np.random.seed(0) 
>>> print np.random.permutation(10) 
[2 8 4 9 1 6 7 3 0 5]
>>> np.random.seed(0) 
>>> print np.random.permutation(10) 
[2 8 4 9 1 6 7 3 0 5]
>>> np.random.seed(0) 
>>> print np.random.rand(4) 
[0.5488135  0.71518937 0.60276338 0.54488318]
>>> np.random.seed(0) 
>>> print np.random.rand(4) 
[0.5488135  0.71518937 0.60276338 0.54488318]

However, if you just call it once and use various random functions, the results will still be different:

>>> import numpy as np
>>> np.random.seed(0) 
>>> perm = np.random.permutation(10)
>>> print perm 
[2 8 4 9 1 6 7 3 0 5]
>>> np.random.seed(0) 
>>> print np.random.permutation(10)
[2 8 4 9 1 6 7 3 0 5]
>>> print np.random.permutation(10) 
[3 5 1 2 9 8 0 6 7 4]
>>> print np.random.permutation(10) 
[2 3 8 4 5 1 0 6 9 7]
>>> print np.random.rand(4) 
[0.64817187 0.36824154 0.95715516 0.14035078]
>>> print np.random.rand(4) 
[0.87008726 0.47360805 0.80091075 0.52047748]

回答 2

如前所述,numpy.random.seed(0)将随机种子设置为0,因此从random获得的伪随机数将从同一点开始。在某些情况下,这对于调试非常有用。但是,经过一番阅读后,如果您有线程,这似乎是错误的处理方法,因为它不是线程安全的。

来自python中的numpy随机和随机随机数之间的差异

对于numpy.random.seed(),主要的困难在于它不是线程安全的-也就是说,如果您有许多不同的执行线程,则使用它是不安全的,因为如果两个不同的线程正在执行,则不能保证它可以正常工作。同时功能。如果您不使用线程,并且可以合理地期望将来不需要以这种方式重写程序,那么numpy.random.seed()应该可以用于测试。如果有任何理由怀疑您将来可能需要线程,那么从长远来看,按照建议进行操作并创建numpy.random.Random类的本地实例要安全得多。据我所知,random.random.seed()是线程安全的(或者至少我没有发现任何相反的证据)。

如何执行此操作的示例:

from numpy.random import RandomState
prng = RandomState()
print prng.permutation(10)
prng = RandomState()
print prng.permutation(10)
prng = RandomState(42)
print prng.permutation(10)
prng = RandomState(42)
print prng.permutation(10)

可以给:

[3 0 4 6 8 2 1 9 7 5]

[1 6 9 0 2 7 8 3 5 4]

[8 1 5 0 7 2 9 4 3 6]

[8 1 5 0 7 2 9 4 3 6]

最后,请注意,由于xor的工作方式,可能在某些情况下初始化为0(与并非所有位均为0的种子相反)可能会导致一些首次迭代的分布不均匀,但这取决于算法,这超出了我目前的担忧和这个问题的范围。

As noted, numpy.random.seed(0) sets the random seed to 0, so the pseudo random numbers you get from random will start from the same point. This can be good for debuging in some cases. HOWEVER, after some reading, this seems to be the wrong way to go at it, if you have threads because it is not thread safe.

from differences-between-numpy-random-and-random-random-in-python:

For numpy.random.seed(), the main difficulty is that it is not thread-safe – that is, it’s not safe to use if you have many different threads of execution, because it’s not guaranteed to work if two different threads are executing the function at the same time. If you’re not using threads, and if you can reasonably expect that you won’t need to rewrite your program this way in the future, numpy.random.seed() should be fine for testing purposes. If there’s any reason to suspect that you may need threads in the future, it’s much safer in the long run to do as suggested, and to make a local instance of the numpy.random.Random class. As far as I can tell, random.random.seed() is thread-safe (or at least, I haven’t found any evidence to the contrary).

example of how to go about this:

from numpy.random import RandomState
prng = RandomState()
print prng.permutation(10)
prng = RandomState()
print prng.permutation(10)
prng = RandomState(42)
print prng.permutation(10)
prng = RandomState(42)
print prng.permutation(10)

may give:

[3 0 4 6 8 2 1 9 7 5]

[1 6 9 0 2 7 8 3 5 4]

[8 1 5 0 7 2 9 4 3 6]

[8 1 5 0 7 2 9 4 3 6]

Lastly, note that there might be cases where initializing to 0 (as opposed to a seed that has not all bits 0) may result to non-uniform distributions for some few first iterations because of the way xor works, but this depends on the algorithm, and is beyond my current worries and the scope of this question.


回答 3

我在神经网络中经常使用它。众所周知,当我们开始训练神经网络时,我们会随机初始化权重。在特定数据集上对这些权重训练模型。经过数个时期后,您将获得一组训练有素的权重。

现在,假设您要从头开始再次训练,或者要将模型传递给其他人来重现您的结果,权重将再次初始化为一个随机数,该数字与以前的数字大不相同。在与之前相同的时期(保持相同的数据和其他参数)之后,获得的训练权重将有所不同。问题在于您的模型不再具有可复制性,因为每次您从头训练模型时,模型都会提供不同的权重集。这是因为每次都会用不同的随机数初始化模型。

如果每次您从头开始训练时都将模型初始化为同一组随机初始化权重,该怎么办?在这种情况下,您的模型可以重现。这是通过numpy.random.seed(0)实现的。通过将seed()提到一个特定的数字,您将始终挂在同一组随机数上。

I have used this very often in neural networks. It is well known that when we start training a neural network we randomly initialise the weights. The model is trained on these weights on a particular dataset. After number of epochs you get trained set of weights.

Now suppose you want to again train from scratch or you want to pass the model to others to reproduce your results, the weights will be again initialised to a random numbers which mostly will be different from earlier ones. The obtained trained weights after same number of epochs ( keeping same data and other parameters ) as earlier one will differ. The problem is your model is no more reproducible that is every time you train your model from scratch it provides you different sets of weights. This is because the model is being initialized by different random numbers every time.

What if every time you start training from scratch the model is initialised to the same set of random initialise weights? In this case your model could become reproducible. This is achieved by numpy.random.seed(0). By mentioning seed() to a particular number, you are hanging on to same set of random numbers always.


回答 4

想象一下,您正在向某人展示如何使用一堆“随机”数字进行编码。通过使用numpy种子,他们可以使用相同的种子编号并获得相同的“随机”编号集。

因此,它不是完全随机的,因为算法会散出数字,但看起来像是随机生成的一堆。

Imagine you are showing someone how to code something with a bunch of “random” numbers. By using numpy seed they can use the same seed number and get the same set of “random” numbers.

So it’s not exactly random because an algorithm spits out the numbers but it looks like a randomly generated bunch.


回答 5

随机种子指定计算机生成随机数序列时的起点。

例如,假设您要在Excel中生成一个随机数(注意:Excel为种子设置的限制为9999)。如果您在此过程中向“随机种子”框中输入数字,则可以再次使用同一组随机数。如果在框中键入“ 77”,并在下次运行随机数生成器时键入“ 77”,则Excel将显示同一组随机数。如果输入“ 99”,则会得到一组完全不同的数字。但是,如果您恢复为77的种子,那么您将获得与开始时相同的一组随机数。

例如,“取一个数字x,加900 + x,然后减去52。” 为了启动该过程,您必须指定一个起始编号x(种子)。让我们以77开始:

加900 + 77 = 977减52 = 925按照相同的算法,第二个“随机”数将是:

900 + 925 = 1825减52 = 1773这个简单的例子遵循一个模式,但是计算机数字生成背后的算法要复杂得多。

A random seed specifies the start point when a computer generates a random number sequence.

For example, let’s say you wanted to generate a random number in Excel (Note: Excel sets a limit of 9999 for the seed). If you enter a number into the Random Seed box during the process, you’ll be able to use the same set of random numbers again. If you typed “77” into the box, and typed “77” the next time you run the random number generator, Excel will display that same set of random numbers. If you type “99”, you’ll get an entirely different set of numbers. But if you revert back to a seed of 77, then you’ll get the same set of random numbers you started with.

For example, “take a number x, add 900 +x, then subtract 52.” In order for the process to start, you have to specify a starting number, x (the seed). Let’s take the starting number 77:

Add 900 + 77 = 977 Subtract 52 = 925 Following the same algorithm, the second “random” number would be:

900 + 925 = 1825 Subtract 52 = 1773 This simple example follows a pattern, but the algorithms behind computer number generation are much more complicated


回答 6

在所有平台/系统上,设置特定种子值后生成的所有随机数均相同。

All the random numbers generated after setting particular seed value are same across all the platforms/systems.


回答 7

在Numpy文档中有一个很好的解释:https ://docs.scipy.org/doc/numpy-1.15.1/reference/generation/numpy.random.RandomState.html 指的是Mersenne Twister伪随机数生成器。有关该算法的更多详细信息,请参见:https : //en.wikipedia.org/wiki/Mersenne_Twister

There is a nice explanation in Numpy docs: https://docs.scipy.org/doc/numpy-1.15.1/reference/generated/numpy.random.RandomState.html it refers to Mersenne Twister pseudo-random number generator. More details on the algorithm here: https://en.wikipedia.org/wiki/Mersenne_Twister


回答 8

numpy.random.seed(0)
numpy.random.randint(10, size=5)

这将产生以下输出: array([5, 0, 3, 3, 7]) 同样,如果我们运行相同的代码,我们将得到相同的结果。

现在,如果我们将种子值0更改为1或其他值:

numpy.random.seed(1)
numpy.random.randint(10, size=5)

这将产生以下输出:array([5 8 9 5 0])但是现在输出与上面的不一样。

numpy.random.seed(0)
numpy.random.randint(10, size=5)

This produces the following output: array([5, 0, 3, 3, 7]) Again,if we run the same code we will get the same result.

Now if we change the seed value 0 to 1 or others:

numpy.random.seed(1)
numpy.random.randint(10, size=5)

This produces the following output: array([5 8 9 5 0]) but now the output not the same like above.


回答 9

以上所有答案均显示了 np.random.seed() in代码。我将尽力简要地解释为什么它真正发生。计算机是基于预定义算法设计的计算机。计算机的任何输出都是在输入上实现的算法的结果。因此,当我们要求计算机生成随机数时,请确保它们是随机的,但计算机并不仅仅是随机地提供它们!

因此,当我们编写np.random.seed(any_number_here)该算法时,将输出一组特定于参数的数字any_number_here。如果我们传递正确的参数,几乎就像可以获得一组特定的随机数。但是,这将要求我们了解算法的工作方式,这非常繁琐。

因此,例如,如果我写np.random.seed(10)了一组获得的特定数字,即使我在10年后执行同一行,也将保持不变,除非算法发生变化。

All the answers above show the implementation of np.random.seed() in code. I’ll try my best to explain briefly why it actually happens. Computers are machines that are designed based on predefined algorithms. Any output from a computer is the result of the algorithm implemented on the input. So when we request a computer to generate random numbers, sure they are random but the computer did not just come up with them randomly!

So when we write np.random.seed(any_number_here) the algorithm will output a particular set of numbers that is unique to the argument any_number_here. It’s almost like a particular set of random numbers can be obtained if we pass the correct argument. But this will require us to know about how the algorithm works which is quite tedious.

So, for example if I write np.random.seed(10) the particular set of numbers that I obtain will remain the same even if I execute the same line after 10 years unless the algorithm changes.


从Numpy数组创建Pandas DataFrame:如何指定索引列和列标题?

问题:从Numpy数组创建Pandas DataFrame:如何指定索引列和列标题?

我有一个由列表列表组成的Numpy数组,代表带有行标签和列名的二维数组,如下所示:

data = array([['','Col1','Col2'],['Row1',1,2],['Row2',3,4]])

我希望所得的DataFrame将Row1和Row2作为索引值,并将Col1,Col2作为标头值

我可以指定索引如下:

df = pd.DataFrame(data,index=data[:,0]),

但是我不确定如何最好地分配列标题。

I have a Numpy array consisting of a list of lists, representing a two-dimensional array with row labels and column names as shown below:

data = array([['','Col1','Col2'],['Row1',1,2],['Row2',3,4]])

I’d like the resulting DataFrame to have Row1 and Row2 as index values, and Col1, Col2 as header values

I can specify the index as follows:

df = pd.DataFrame(data,index=data[:,0]),

however I am unsure how to best assign column headers.


回答 0

您需要指定dataindexcolumnsDataFrame构造函数,如:

>>> pd.DataFrame(data=data[1:,1:],    # values
...              index=data[1:,0],    # 1st column as index
...              columns=data[0,1:])  # 1st row as the column names

编辑:如@joris注释中一样,您可能需要更改上述内容np.int_(data[1:,1:])才能具有正确的数据类型。

You need to specify data, index and columns to DataFrame constructor, as in:

>>> pd.DataFrame(data=data[1:,1:],    # values
...              index=data[1:,0],    # 1st column as index
...              columns=data[0,1:])  # 1st row as the column names

edit: as in the @joris comment, you may need to change above to np.int_(data[1:,1:]) to have correct data type.


回答 1

这是一个易于理解的解决方案

import numpy as np
import pandas as pd

# Creating a 2 dimensional numpy array
>>> data = np.array([[5.8, 2.8], [6.0, 2.2]])
>>> print(data)
>>> data
array([[5.8, 2.8],
       [6. , 2.2]])

# Creating pandas dataframe from numpy array
>>> dataset = pd.DataFrame({'Column1': data[:, 0], 'Column2': data[:, 1]})
>>> print(dataset)
   Column1  Column2
0      5.8      2.8
1      6.0      2.2

Here is an easy to understand solution

import numpy as np
import pandas as pd

# Creating a 2 dimensional numpy array
>>> data = np.array([[5.8, 2.8], [6.0, 2.2]])
>>> print(data)
>>> data
array([[5.8, 2.8],
       [6. , 2.2]])

# Creating pandas dataframe from numpy array
>>> dataset = pd.DataFrame({'Column1': data[:, 0], 'Column2': data[:, 1]})
>>> print(dataset)
   Column1  Column2
0      5.8      2.8
1      6.0      2.2

回答 2

我同意Joris;似乎您应该以不同的方式执行此操作,例如使用numpy record arrays。从这个好答案中修改“选项2” ,您可以像这样进行操作:

import pandas
import numpy

dtype = [('Col1','int32'), ('Col2','float32'), ('Col3','float32')]
values = numpy.zeros(20, dtype=dtype)
index = ['Row'+str(i) for i in range(1, len(values)+1)]

df = pandas.DataFrame(values, index=index)

I agree with Joris; it seems like you should be doing this differently, like with numpy record arrays. Modifying “option 2” from this great answer, you could do it like this:

import pandas
import numpy

dtype = [('Col1','int32'), ('Col2','float32'), ('Col3','float32')]
values = numpy.zeros(20, dtype=dtype)
index = ['Row'+str(i) for i in range(1, len(values)+1)]

df = pandas.DataFrame(values, index=index)

回答 3

只需使用pandas DataFrame的from_records即可完成此操作

import numpy as np
import pandas as pd
# Creating a numpy array
x = np.arange(1,10,1).reshape(-1,1)
dataframe = pd.DataFrame.from_records(x)

This can be done simply by using from_records of pandas DataFrame

import numpy as np
import pandas as pd
# Creating a numpy array
x = np.arange(1,10,1).reshape(-1,1)
dataframe = pd.DataFrame.from_records(x)

回答 4

    >>import pandas as pd
    >>import numpy as np
    >>data.shape
    (480,193)
    >>type(data)
    numpy.ndarray
    >>df=pd.DataFrame(data=data[0:,0:],
    ...        index=[i for i in range(data.shape[0])],
    ...        columns=['f'+str(i) for i in range(data.shape[1])])
    >>df.head()
    [![array to dataframe][1]][1]

    >>import pandas as pd
    >>import numpy as np
    >>data.shape
    (480,193)
    >>type(data)
    numpy.ndarray
    >>df=pd.DataFrame(data=data[0:,0:],
    ...        index=[i for i in range(data.shape[0])],
    ...        columns=['f'+str(i) for i in range(data.shape[1])])
    >>df.head()
    [![array to dataframe][1]][1]


回答 5

添加到@ behzad.nouri的答案-我们可以创建一个帮助程序来处理这种常见情况:

def csvDf(dat,**kwargs): 
  from numpy import array
  data = array(dat)
  if data is None or len(data)==0 or len(data[0])==0:
    return None
  else:
    return pd.DataFrame(data[1:,1:],index=data[1:,0],columns=data[0,1:],**kwargs)

让我们尝试一下:

data = [['','a','b','c'],['row1','row1cola','row1colb','row1colc'],
     ['row2','row2cola','row2colb','row2colc'],['row3','row3cola','row3colb','row3colc']]
csvDf(data)

In [61]: csvDf(data)
Out[61]:
             a         b         c
row1  row1cola  row1colb  row1colc
row2  row2cola  row2colb  row2colc
row3  row3cola  row3colb  row3colc

Adding to @behzad.nouri ‘s answer – we can create a helper routine to handle this common scenario:

def csvDf(dat,**kwargs): 
  from numpy import array
  data = array(dat)
  if data is None or len(data)==0 or len(data[0])==0:
    return None
  else:
    return pd.DataFrame(data[1:,1:],index=data[1:,0],columns=data[0,1:],**kwargs)

Let’s try it out:

data = [['','a','b','c'],['row1','row1cola','row1colb','row1colc'],
     ['row2','row2cola','row2colb','row2colc'],['row3','row3cola','row3colb','row3colc']]
csvDf(data)

In [61]: csvDf(data)
Out[61]:
             a         b         c
row1  row1cola  row1colb  row1colc
row2  row2cola  row2colb  row2colc
row3  row3cola  row3colb  row3colc

反转numpy数组的最有效方法

问题:反转numpy数组的最有效方法

信不信由你,在分析当前代码后,执行numpy数组还原的重复操作将占用大量运行时间。我现在拥有的是基于视图的常见方法:

reversed_arr = arr[::-1]

还有其他方法可以更有效地执行此操作,还是我对不切实际的numpy性能的痴迷所致的幻觉?

Believe it or not, after profiling my current code, the repetitive operation of numpy array reversion ate a giant chunk of the running time. What I have right now is the common view-based method:

reversed_arr = arr[::-1]

Is there any other way to do it more efficiently, or is it just an illusion from my obsession with unrealistic numpy performance?


回答 0

创建时,reversed_arr您正在创建原始数组的视图。然后,您可以更改原始数组,并且视图将更新以反映所做的更改。

您是否经常需要重新创建视图?您应该能够执行以下操作:

arr = np.array(some_sequence)
reversed_arr = arr[::-1]

do_something(arr)
look_at(reversed_arr)
do_something_else(arr)
look_at(reversed_arr)

我不是numpy专家,但这似乎是用numpy做事情的最快方法。如果这是您已经在做的,我认为您无法对此进行改进。

PS这里很好的numpy视图讨论:

查看到一个numpy的数组?

When you create reversed_arr you are creating a view into the original array. You can then change the original array, and the view will update to reflect the changes.

Are you re-creating the view more often than you need to? You should be able to do something like this:

arr = np.array(some_sequence)
reversed_arr = arr[::-1]

do_something(arr)
look_at(reversed_arr)
do_something_else(arr)
look_at(reversed_arr)

I’m not a numpy expert, but this seems like it would be the fastest way to do things in numpy. If this is what you are already doing, I don’t think you can improve on it.

P.S. Great discussion of numpy views here:

View onto a numpy array?


回答 1

如上所述,a[::-1]实际上仅创建一个视图,因此它是一个恒定时间的操作(因此,随着数组的增长,它不需要花费更长的时间)。如果您需要数组是连续的(例如,因为您要对其执行许多矢量运算),ascontiguousarray则其速度大约与flipup/ 一样快fliplr


生成绘图的代码:

import numpy
import perfplot


perfplot.show(
    setup=lambda n: numpy.random.randint(0, 1000, n),
    kernels=[
        lambda a: a[::-1],
        lambda a: numpy.ascontiguousarray(a[::-1]),
        lambda a: numpy.fliplr([a])[0],
    ],
    labels=["a[::-1]", "ascontiguousarray(a[::-1])", "fliplr"],
    n_range=[2 ** k for k in range(25)],
    xlabel="len(a)",
    logx=True,
    logy=True,
)

As mentioned above, a[::-1] really only creates a view, so it’s a constant-time operation (and as such doesn’t take longer as the array grows). If you need the array to be contiguous (for example because you’re performing many vector operations with it), ascontiguousarray is about as fast as flipud/fliplr:


Code to generate the plot:

import numpy
import perfplot


perfplot.show(
    setup=lambda n: numpy.random.randint(0, 1000, n),
    kernels=[
        lambda a: a[::-1],
        lambda a: numpy.ascontiguousarray(a[::-1]),
        lambda a: numpy.fliplr([a])[0],
    ],
    labels=["a[::-1]", "ascontiguousarray(a[::-1])", "fliplr"],
    n_range=[2 ** k for k in range(25)],
    xlabel="len(a)",
    logx=True,
    logy=True,
)

回答 2

因为这似乎还没有被标记为答案……托马斯·阿里德森的答案应该是正确的:只使用

np.flipud(your_array) 

如果是一维数组(列数组)。

用matrizs做

fliplr(matrix)

如果您想反转行和flipud(matrix)想要反转列。无需将一维列数组设置为二维行数组(具有一个“无”层的矩阵),然后对其进行翻转。

Because this seems to not be marked as answered yet… The Answer of Thomas Arildsen should be the proper one: just use

np.flipud(your_array) 

if it is a 1d array (column array).

With matrizes do

fliplr(matrix)

if you want to reverse rows and flipud(matrix) if you want to flip columns. No need for making your 1d column array a 2dimensional row array (matrix with one None layer) and then flipping it.


回答 3

np.fliplr() 左右翻转数组。

请注意,对于一维数组,您需要进行一些技巧:

arr1d = np.array(some_sequence)
reversed_arr = np.fliplr([arr1d])[0]

np.fliplr() flips the array left to right.

Note that for 1d arrays, you need to trick it a bit:

arr1d = np.array(some_sequence)
reversed_arr = np.fliplr([arr1d])[0]

回答 4

我将在前面有关的答案上进行扩展np.fliplr()。下面的代码演示了如何构建一个1d数组,将其转换为2d数组,翻转它,然后再转换回1d数组。time.clock()将用于保留时间,以秒为单位。

import time
import numpy as np

start = time.clock()
x = np.array(range(3))
#transform to 2d
x = np.atleast_2d(x)
#flip array
x = np.fliplr(x)
#take first (and only) element
x = x[0]
#print x
end = time.clock()
print end-start

带有未打印注释的注释:

[2 1 0]
0.00203907123594

随着打印语句注释掉:

5.59799927506e-05

因此,就效率而言,我认为这很不错。对于那些热爱一线做的人,这里就是这种形式。

np.fliplr(np.atleast_2d(np.array(range(3))))[0]

I will expand on the earlier answer about np.fliplr(). Here is some code that demonstrates constructing a 1d array, transforming it into a 2d array, flipping it, then converting back into a 1d array. time.clock() will be used to keep time, which is presented in terms of seconds.

import time
import numpy as np

start = time.clock()
x = np.array(range(3))
#transform to 2d
x = np.atleast_2d(x)
#flip array
x = np.fliplr(x)
#take first (and only) element
x = x[0]
#print x
end = time.clock()
print end-start

With print statement uncommented:

[2 1 0]
0.00203907123594

With print statement commented out:

5.59799927506e-05

So, in terms of efficiency, I think that’s decent. For those of you that love to do it in one line, here is that form.

np.fliplr(np.atleast_2d(np.array(range(3))))[0]

回答 5

扩展别人的说法,我将举一个简短的例子。

如果您有一维数组…

>>> import numpy as np
>>> x = np.arange(4) # array([0, 1, 2, 3])
>>> x[::-1] # returns a view
Out[1]: 
array([3, 2, 1, 0])

但是如果您正在使用2D阵列…

>>> x = np.arange(10).reshape(2, 5)
>>> x
Out[2]:
array([[0, 1, 2, 3, 4],
       [5, 6, 7, 8, 9]])

>>> x[::-1] # returns a view:
Out[3]: array([[5, 6, 7, 8, 9],
               [0, 1, 2, 3, 4]])

这实际上并不会反转矩阵。

应该使用np.flip实际反转元素

>>> np.flip(x)
Out[4]: array([[9, 8, 7, 6, 5],
               [4, 3, 2, 1, 0]])

如果要一张一张地打印矩阵的元素,请同时使用平面和翻转

>>> for el in np.flip(x).flat:
>>>     print(el, end = ' ')
9 8 7 6 5 4 3 2 1 0

Expanding on what others have said I will give a short example.

If you have a 1D array …

>>> import numpy as np
>>> x = np.arange(4) # array([0, 1, 2, 3])
>>> x[::-1] # returns a view
Out[1]: 
array([3, 2, 1, 0])

But if you are working with a 2D array …

>>> x = np.arange(10).reshape(2, 5)
>>> x
Out[2]:
array([[0, 1, 2, 3, 4],
       [5, 6, 7, 8, 9]])

>>> x[::-1] # returns a view:
Out[3]: array([[5, 6, 7, 8, 9],
               [0, 1, 2, 3, 4]])

This does not actually reverse the Matrix.

Should use np.flip to actually reverse the elements

>>> np.flip(x)
Out[4]: array([[9, 8, 7, 6, 5],
               [4, 3, 2, 1, 0]])

If you want to print the elements of a matrix one-by-one use flat along with flip

>>> for el in np.flip(x).flat:
>>>     print(el, end = ' ')
9 8 7 6 5 4 3 2 1 0

回答 6

为了使它可以使用负数和长列表,您可以执行以下操作:

b = numpy.flipud(numpy.array(a.split(),float))

凡翻页是一维arra

In order to have it working with negative numbers and a long list you can do the following:

b = numpy.flipud(numpy.array(a.split(),float))

Where flipud is for 1d arra


熊猫轴是什么意思?

问题:熊猫轴是什么意思?

这是我的生成数据框的代码:

import pandas as pd
import numpy as np

dff = pd.DataFrame(np.random.randn(1,2),columns=list('AB'))

然后我得到了数据框:

+------------+---------+--------+
|            |  A      |  B     |
+------------+---------+---------
|      0     | 0.626386| 1.52325|
+------------+---------+--------+

当我输入命令时:

dff.mean(axis=1)

我有 :

0    1.074821
dtype: float64

根据熊猫的参考,axis = 1代表列,我希望命令的结果是

A    0.626386
B    1.523255
dtype: float64

所以这是我的问题:大熊猫轴是什么意思?

Here is my code to generate a dataframe:

import pandas as pd
import numpy as np

dff = pd.DataFrame(np.random.randn(1,2),columns=list('AB'))

then I got the dataframe:

+------------+---------+--------+
|            |  A      |  B     |
+------------+---------+---------
|      0     | 0.626386| 1.52325|
+------------+---------+--------+

When I type the commmand :

dff.mean(axis=1)

I got :

0    1.074821
dtype: float64

According to the reference of pandas, axis=1 stands for columns and I expect the result of the command to be

A    0.626386
B    1.523255
dtype: float64

So here is my question: what does axis in pandas mean?


回答 0

它指定轴沿其的装置被计算的。默认情况下axis=0。这与显式指定numpy.mean时的用法一致(默认情况下为,轴== None,该值将计算扁平化数组的平均值),沿(即以熊猫为索引)和沿。为了更加清楚起见,可以选择指定(代替)或(代替)。axisnumpy.meanaxis=0axis=1axis='index'axis=0axis='columns'axis=1

+------------+---------+--------+
|            |  A      |  B     |
+------------+---------+---------
|      0     | 0.626386| 1.52325|----axis=1----->
+------------+---------+--------+
             |         |
             | axis=0  |
                      

It specifies the axis along which the means are computed. By default axis=0. This is consistent with the numpy.mean usage when axis is specified explicitly (in numpy.mean, axis==None by default, which computes the mean value over the flattened array) , in which axis=0 along the rows (namely, index in pandas), and axis=1 along the columns. For added clarity, one may choose to specify axis='index' (instead of axis=0) or axis='columns' (instead of axis=1).

+------------+---------+--------+
|            |  A      |  B     |
+------------+---------+---------
|      0     | 0.626386| 1.52325|----axis=1----->
+------------+---------+--------+
             |         |
             | axis=0  |
             ↓         ↓

回答 1

这些答案确实有助于解释这一点,但是对于非程序员(例如像我这样在数据科学类中首次学习Python的人)来说,它仍然不是很直观。我仍然发现在行和列中使用术语“沿”或“对于每个”令人困惑。

对我来说更有意义的是这样说:

  • 轴0将作用于每个COLUMN中的所有ROWS
  • 轴1将作用于每个行中的所有列

因此,轴0的均值将是每一列中所有行的均值,轴1的均值将是每一行中所有列的均值。

最终,这是与@zhangxaochen和@Michael所说的相同的事情,但是以一种更易于我内部化的方式。

These answers do help explain this, but it still isn’t perfectly intuitive for a non-programmer (i.e. someone like me who is learning Python for the first time in context of data science coursework). I still find using the terms “along” or “for each” wrt to rows and columns to be confusing.

What makes more sense to me is to say it this way:

  • Axis 0 will act on all the ROWS in each COLUMN
  • Axis 1 will act on all the COLUMNS in each ROW

So a mean on axis 0 will be the mean of all the rows in each column, and a mean on axis 1 will be a mean of all the columns in each row.

Ultimately this is saying the same thing as @zhangxaochen and @Michael, but in a way that is easier for me to internalize.


回答 2

让我们想象一下(您会永远记住),

在熊猫:

  1. axis = 0表示沿“索引”。这是逐行操作

假设要对dataframe1和dataframe2执行concat()操作,我们将dataframe1并从dataframe1中取出第一行并放入新的DF,然后从dataframe1中取出另一行并放入新的DF中,重复此过程直到我们到达dataframe1的底部。然后,我们对dataframe2执行相同的过程。

基本上,将dataframe2堆叠在dataframe1之上,反之亦然。

例如在桌子或地板上堆书

  1. axis = 1表示沿“列”。这是列操作。

假设要对dataframe1和dataframe2执行concat()操作,我们将取出dataframe1的第一个完整列(又名1st系列)并放入新的DF中,然后取出dataframe1的第二列并与之相邻(横向) ),我们必须重复此操作,直到完成所有列。然后,我们在dataframe2上重复相同的过程。基本上, 横向堆叠dataframe2。

例如在书架上整理书籍。

更重要的是,与矩阵相比,数组是更好的表示嵌套n维结构的表示形式!因此,下面的内容可以帮助您更加直观地了解将轴推广到多个维度时轴如何发挥重要作用。另外,您实际上可以打印/写入/绘制/可视化任何n维数组,但是在3维以上的纸张上以矩阵表示形式(3维)进行写入或可视化是不可能的。

Let’s visualize (you gonna remember always),

In Pandas:

  1. axis=0 means along “indexes”. It’s a row-wise operation.

Suppose, to perform concat() operation on dataframe1 & dataframe2, we will take dataframe1 & take out 1st row from dataframe1 and place into the new DF, then we take out another row from dataframe1 and put into new DF, we repeat this process until we reach to the bottom of dataframe1. Then, we do the same process for dataframe2.

Basically, stacking dataframe2 on top of dataframe1 or vice a versa.

E.g making a pile of books on a table or floor

  1. axis=1 means along “columns”. It’s a column-wise operation.

Suppose, to perform concat() operation on dataframe1 & dataframe2, we will take out the 1st complete column(a.k.a 1st series) of dataframe1 and place into new DF, then we take out the second column of dataframe1 and keep adjacent to it (sideways), we have to repeat this operation until all columns are finished. Then, we repeat the same process on dataframe2. Basically, stacking dataframe2 sideways.

E.g arranging books on a bookshelf.

More to it, since arrays are better representations to represent a nested n-dimensional structure compared to matrices! so below can help you more to visualize how axis plays an important role when you generalize to more than one dimension. Also, you can actually print/write/draw/visualize any n-dim array but, writing or visualizing the same in a matrix representation(3-dim) is impossible on a paper more than 3-dimensions.


回答 3

axis指向数组的维,在pd.DataFrames 的情况下axis=0是向下的维,axis=1而向右的维。

示例:考虑一个ndarraywith shape (3,5,7)

a = np.ones((3,5,7))

a是3维的ndarray,即具有3个轴(“轴”是“轴”的复数)。的配置a看起来像3片面包,每片面包的尺寸为5 x 7。a[0,:,:]将引用第0个切片,a[1,:,:]将引用第1 个切片,依此类推。

a.sum(axis=0)sum()沿的第0轴应用a。您将添加所有切片,最后得到一个形状的切片(5,7)

a.sum(axis=0) 相当于

b = np.zeros((5,7))
for i in range(5):
    for j in range(7):
        b[i,j] += a[:,i,j].sum()

b并且a.sum(axis=0)将两者看起来像这样

array([[ 3.,  3.,  3.,  3.,  3.,  3.,  3.],
       [ 3.,  3.,  3.,  3.,  3.,  3.,  3.],
       [ 3.,  3.,  3.,  3.,  3.,  3.,  3.],
       [ 3.,  3.,  3.,  3.,  3.,  3.,  3.],
       [ 3.,  3.,  3.,  3.,  3.,  3.,  3.]])

在中pd.DataFrame,轴的工作方式与在numpy.arrays中相同:axis=0将对sum()每列应用或任何其他归约函数。

注意:在@zhangxaochen的答案中,我发现“沿行”和“沿列”这两个短语有些混乱。axis=0应该指“每列”和axis=1“每行”。

axis refers to the dimension of the array, in the case of pd.DataFrames axis=0 is the dimension that points downwards and axis=1 the one that points to the right.

Example: Think of an ndarray with shape (3,5,7).

a = np.ones((3,5,7))

a is a 3 dimensional ndarray, i.e. it has 3 axes (“axes” is plural of “axis”). The configuration of a will look like 3 slices of bread where each slice is of dimension 5-by-7. a[0,:,:] will refer to the 0-th slice, a[1,:,:] will refer to the 1-st slice etc.

a.sum(axis=0) will apply sum() along the 0-th axis of a. You will add all the slices and end up with one slice of shape (5,7).

a.sum(axis=0) is equivalent to

b = np.zeros((5,7))
for i in range(5):
    for j in range(7):
        b[i,j] += a[:,i,j].sum()

b and a.sum(axis=0) will both look like this

array([[ 3.,  3.,  3.,  3.,  3.,  3.,  3.],
       [ 3.,  3.,  3.,  3.,  3.,  3.,  3.],
       [ 3.,  3.,  3.,  3.,  3.,  3.,  3.],
       [ 3.,  3.,  3.,  3.,  3.,  3.,  3.],
       [ 3.,  3.,  3.,  3.,  3.,  3.,  3.]])

In a pd.DataFrame, axes work the same way as in numpy.arrays: axis=0 will apply sum() or any other reduction function for each column.

N.B. In @zhangxaochen’s answer, I find the phrases “along the rows” and “along the columns” slightly confusing. axis=0 should refer to “along each column”, and axis=1 “along each row”.


回答 4

对我而言,最容易理解的方法是谈论您是针对每一列(axis = 0)还是每一行(axis = 1)计算统计信息。如果您计算统计量,请说一个平均值,axis = 0您将获得每一列的统计量。因此,如果每个观察值都是一行,并且每个变量都在列中,则将获得每个变量的均值。如果设置,axis = 1则将为每一行计算统计信息。在我们的示例中,您将获得所有变量中每个观察值的平均值(也许您需要相关度量的平均值)。

axis = 0:按列=按列=沿行

axis = 1:按行=按行=沿列

The easiest way for me to understand is to talk about whether you are calculating a statistic for each column (axis = 0) or each row (axis = 1). If you calculate a statistic, say a mean, with axis = 0 you will get that statistic for each column. So if each observation is a row and each variable is in a column, you would get the mean of each variable. If you set axis = 1 then you will calculate your statistic for each row. In our example, you would get the mean for each observation across all of your variables (perhaps you want the average of related measures).

axis = 0: by column = column-wise = along the rows

axis = 1: by row = row-wise = along the columns


回答 5

让我们看一下Wiki中的表格。这是国际货币基金组织对2010年至2019年前十个国家的GDP估算。

1.第1轴将对所有列的每一行起作用
如果您要计算十年(2010-2019年)中每个国家的平均(平均)GDP,则需要做df.mean(axis=1)。例如,如果您要计算2010年至2019年美国的平均GDP,df.loc['United States','2010':'2019'].mean(axis=1)

2.轴0将对所有行的每一列起作用
如果我想计算所有国家每个年份的平均(平均)GDP,则需要做df.mean(axis=0)。例如,如果您要计算美国,中国,日本,德国和印度的2015年平均GDP,df.loc['United States':'India','2015'].mean(axis=0)

请注意:上面的代码仅在将“国家(或从属地区)”列设置为“索引”后才能使用set_index方法。

Let’s look at the table from Wiki. This is an IMF estimate of GDP from 2010 to 2019 for top ten countries.

1. Axis 1 will act for each row on all the columns
If you want to calculate the average (mean) GDP for EACH countries over the decade (2010-2019), you need to do, df.mean(axis=1). For example, if you want to calculate mean GDP of United States from 2010 to 2019, df.loc['United States','2010':'2019'].mean(axis=1)

2. Axis 0 will act for each column on all the rows
If I want to calculate the average (mean) GDP for EACH year for all countries, you need to do, df.mean(axis=0). For example, if you want to calculate mean GDP of the year 2015 for United States, China, Japan, Germany and India, df.loc['United States':'India','2015'].mean(axis=0)

Note: The above code will work only after setting “Country(or dependent territory)” column as the Index, using set_index method.


回答 6

从编程角度来看,轴是形状元组中的位置。这是一个例子:

import numpy as np

a=np.arange(120).reshape(2,3,4,5)

a.shape
Out[3]: (2, 3, 4, 5)

np.sum(a,axis=0).shape
Out[4]: (3, 4, 5)

np.sum(a,axis=1).shape
Out[5]: (2, 4, 5)

np.sum(a,axis=2).shape
Out[6]: (2, 3, 5)

np.sum(a,axis=3).shape
Out[7]: (2, 3, 4)

轴上的均值将导致该尺寸被删除。

参考原始问题,dff形状为(1,2)。使用axis = 1会将形状更改为(1,)。

Axis in view of programming is the position in the shape tuple. Here is an example:

import numpy as np

a=np.arange(120).reshape(2,3,4,5)

a.shape
Out[3]: (2, 3, 4, 5)

np.sum(a,axis=0).shape
Out[4]: (3, 4, 5)

np.sum(a,axis=1).shape
Out[5]: (2, 4, 5)

np.sum(a,axis=2).shape
Out[6]: (2, 3, 5)

np.sum(a,axis=3).shape
Out[7]: (2, 3, 4)

Mean on the axis will cause that dimension to be removed.

Referring to the original question, the dff shape is (1,2). Using axis=1 will change the shape to (1,).


回答 7

熊猫的设计者韦斯·麦金尼(Wes McKinney)过去经常从事金融数据工作。将列视为股票名称,将索引视为每日价格。然后,您可以猜测axis=0此财务数据的默认行为(即)。axis=1可以简单地认为是“另一个方向”。

例如,统计功能,如mean()sum()describe()count()都默认为列明智的,因为它更有意义,做他们每个股票。sort_index(by=)也默认为列。fillna(method='ffill')将沿着列填充,因为它是相同的库存。dropna()默认为行,因为您可能只想放弃当天的价格,而不是丢弃该股票的所有价格。

类似地,方括号索引是指各列,因为选择股票而不是选择一天更为普遍。

The designer of pandas, Wes McKinney, used to work intensively on finance data. Think of columns as stock names and index as daily prices. You can then guess what the default behavior is (i.e., axis=0) with respect to this finance data. axis=1 can be simply thought as ‘the other direction’.

For example, the statistics functions, such as mean(), sum(), describe(), count() all default to column-wise because it makes more sense to do them for each stock. sort_index(by=) also defaults to column. fillna(method='ffill') will fill along column because it is the same stock. dropna() defaults to row because you probably just want to discard the price on that day instead of throw away all prices of that stock.

Similarly, the square brackets indexing refers to the columns since it’s more common to pick a stock instead of picking a day.


回答 8

记住轴1(列)和轴0(行)的简单方法之一就是您期望的输出。

  • 如果您期望每行的输出都使用axis =’columns’,
  • 另一方面,如果要为每列输出,请使用axis =’rows’。

one of easy ways to remember axis 1 (columns), vs axis 0 (rows) is the output you expect.

  • if you expect an output for each row you use axis=’columns’,
  • on the other hand if you want an output for each column you use axis=’rows’.

回答 9

axis=正确使用的问题是在两种主要情况下的使用:

  1. 用于计算累积值重新排列(例如排序)数据。
  2. 用于操纵(“播放”)实体(例如dataframe)。

该答案背后的主要思想是,为了避免混淆,我们选择数字名称来指定特定的轴,以更清晰,直观和描述性的方式为准。

熊猫基于NumPy,后者基于数学,尤其是基于n维矩阵。这是3维空间中数学中轴名称的常用图像:

此图片仅用于存储轴的序号

  • 0 对于x轴,
  • 1 y轴
  • 2 对于z轴。

z轴是只对面板 ; 对于数据帧,我们将把兴趣限制在带有x轴(,垂直)y轴(,水平)的绿色二维基本平面01

所有这些都是数字作为axis=参数的潜在值。

轴的名称'index'(您可以使用别名'rows')和'columns',对于此说明,这些名称与(轴的)序数之间的关系并不重要,因为每个人都知道“行”“列”是什么意思(我想,这里的每个人都知道熊猫中“索引”一词的含义)。

现在,我的建议:

  1. 如果要计算累加值,则可以从沿轴0(或沿轴1)定位的值(使用axis=0axis=1)计算得出。

    同样,如果要重新排列值,请使用轴的轴编号沿着该轴的编号将放置数据以进行重新排列(例如,用于排序)。

  2. 如果要操作(例如连接实体(例如,数据框),请使用axis='index'(同义词:)axis='rows'axis='columns'指定结果更改 – 分别为索引)或
    (对于串联,您将分别获得更长的索引(=更多的行)更多的列。)

The problem with using axis= properly is for its use for 2 main different cases:

  1. For computing an accumulated value, or rearranging (e. g. sorting) data.
  2. For manipulating (“playing” with) entities (e. g. dataframes).

The main idea behind this answer is that for avoiding the confusion, we select either a number, or a name for specifying the particular axis, whichever is more clear, intuitive, and descriptive.

Pandas is based on NumPy, which is based on mathematics, particularly on n-dimensional matrices. Here is an image for common use of axes’ names in math in the 3-dimensional space:

This picture is for memorizing the axes’ ordinal numbers only:

  • 0 for x-axis,
  • 1 for y-axis, and
  • 2 for z-axis.

The z-axis is only for panels; for dataframes we will restrict our interest to the green-colored, 2-dimensional basic plane with x-axis (0, vertical), and y-axis (1, horizontal).

It’s all for numbers as potential values of axis= parameter.

The names of axes are 'index' (you may use the alias 'rows') and 'columns', and for this explanation it is NOT important the relation between these names and ordinal numbers (of axes), as everybody knows what the words “rows” and “columns” mean (and everybody here — I suppose — knows what the word “index” in pandas means).

And now, my recommendation:

  1. If you want to compute an accumulated value, you may compute it from values located along axis 0 (or along axis 1) — use axis=0 (or axis=1).

    Similarly, if you want to rearrange values, use the axis number of the axis, along which are located data for rearranging (e.g. for sorting).

  2. If you want to manipulate (e.g. concatenate) entities (e.g. dataframes) — use axis='index' (synonym: axis='rows') or axis='columns' to specify the resulting changeindex (rows) or columns, respectively.
    (For concatenating, you will obtain either a longer index (= more rows), or more columns, respectively.)


回答 10

这是基于@Safak的答案。理解pandas / numpy中轴的最好方法是创建3d数组,并检查3个不同轴上sum函数的结果。

 a = np.ones((3,5,7))

一个将是:

    array([[[1., 1., 1., 1., 1., 1., 1.],
    [1., 1., 1., 1., 1., 1., 1.],
    [1., 1., 1., 1., 1., 1., 1.],
    [1., 1., 1., 1., 1., 1., 1.],
    [1., 1., 1., 1., 1., 1., 1.]],

   [[1., 1., 1., 1., 1., 1., 1.],
    [1., 1., 1., 1., 1., 1., 1.],
    [1., 1., 1., 1., 1., 1., 1.],
    [1., 1., 1., 1., 1., 1., 1.],
    [1., 1., 1., 1., 1., 1., 1.]],

   [[1., 1., 1., 1., 1., 1., 1.],
    [1., 1., 1., 1., 1., 1., 1.],
    [1., 1., 1., 1., 1., 1., 1.],
    [1., 1., 1., 1., 1., 1., 1.],
    [1., 1., 1., 1., 1., 1., 1.]]])

现在检查沿每个轴的数组元素的总和:

 x0 = np.sum(a,axis=0)
 x1 = np.sum(a,axis=1)
 x2 = np.sum(a,axis=2)

将为您提供以下结果:

   x0 :
   array([[3., 3., 3., 3., 3., 3., 3.],
        [3., 3., 3., 3., 3., 3., 3.],
        [3., 3., 3., 3., 3., 3., 3.],
        [3., 3., 3., 3., 3., 3., 3.],
        [3., 3., 3., 3., 3., 3., 3.]])

   x1 : 
   array([[5., 5., 5., 5., 5., 5., 5.],
   [5., 5., 5., 5., 5., 5., 5.],
   [5., 5., 5., 5., 5., 5., 5.]])

  x2 :
   array([[7., 7., 7., 7., 7.],
        [7., 7., 7., 7., 7.],
        [7., 7., 7., 7., 7.]])

This is based on @Safak’s answer. The best way to understand the axes in pandas/numpy is to create a 3d array and check the result of the sum function along the 3 different axes.

 a = np.ones((3,5,7))

a will be:

    array([[[1., 1., 1., 1., 1., 1., 1.],
    [1., 1., 1., 1., 1., 1., 1.],
    [1., 1., 1., 1., 1., 1., 1.],
    [1., 1., 1., 1., 1., 1., 1.],
    [1., 1., 1., 1., 1., 1., 1.]],

   [[1., 1., 1., 1., 1., 1., 1.],
    [1., 1., 1., 1., 1., 1., 1.],
    [1., 1., 1., 1., 1., 1., 1.],
    [1., 1., 1., 1., 1., 1., 1.],
    [1., 1., 1., 1., 1., 1., 1.]],

   [[1., 1., 1., 1., 1., 1., 1.],
    [1., 1., 1., 1., 1., 1., 1.],
    [1., 1., 1., 1., 1., 1., 1.],
    [1., 1., 1., 1., 1., 1., 1.],
    [1., 1., 1., 1., 1., 1., 1.]]])

Now check out the sum of elements of the array along each of the axes:

 x0 = np.sum(a,axis=0)
 x1 = np.sum(a,axis=1)
 x2 = np.sum(a,axis=2)

will give you the following results:

   x0 :
   array([[3., 3., 3., 3., 3., 3., 3.],
        [3., 3., 3., 3., 3., 3., 3.],
        [3., 3., 3., 3., 3., 3., 3.],
        [3., 3., 3., 3., 3., 3., 3.],
        [3., 3., 3., 3., 3., 3., 3.]])

   x1 : 
   array([[5., 5., 5., 5., 5., 5., 5.],
   [5., 5., 5., 5., 5., 5., 5.],
   [5., 5., 5., 5., 5., 5., 5.]])

  x2 :
   array([[7., 7., 7., 7., 7.],
        [7., 7., 7., 7., 7.],
        [7., 7., 7., 7., 7.]])

回答 11

我这样理解:

假设您的操作需要在数据框中从左向右/从右向左遍历,则显然是在合并列,即。您正在各种列上进行操作。这是轴= 1

df = pd.DataFrame(np.arange(12).reshape(3,4),columns=['A', 'B', 'C', 'D'])
print(df)
   A  B   C   D
0  0  1   2   3
1  4  5   6   7
2  8  9  10  11 

df.mean(axis=1)

0    1.5
1    5.5
2    9.5
dtype: float64

df.drop(['A','B'],axis=1,inplace=True)

    C   D
0   2   3
1   6   7
2  10  11

需要注意的是,我们正在对列进行操作

同样,如果您的操作需要在数据框中从上到下/下到上遍历,则您正在合并行。这是axis = 0

I understand this way :

Say if your operation requires traversing from left to right/right to left in a dataframe, you are apparently merging columns ie. you are operating on various columns. This is axis =1

Example

df = pd.DataFrame(np.arange(12).reshape(3,4),columns=['A', 'B', 'C', 'D'])
print(df)
   A  B   C   D
0  0  1   2   3
1  4  5   6   7
2  8  9  10  11 

df.mean(axis=1)

0    1.5
1    5.5
2    9.5
dtype: float64

df.drop(['A','B'],axis=1,inplace=True)

    C   D
0   2   3
1   6   7
2  10  11

Point to note here is we are operating on columns

Similarly, if your operation requires traversing from top to bottom/bottom to top in a dataframe, you are merging rows. This is axis=0.


回答 12

轴= 0表示上向下轴= 1表示左到右

sums[key] = lang_sets[key].iloc[:,1:].sum(axis=0)

给定的示例是对==键中的所有数据求和。

axis = 0 means up to down axis = 1 means left to right

sums[key] = lang_sets[key].iloc[:,1:].sum(axis=0)

Given example is taking sum of all the data in column == key.


回答 13

我的想法是:Axis = n,其中n = 0、1等,意味着矩阵沿该轴折叠(折叠)。因此,在2D矩阵中,当您沿0(行)折叠时,您实际上一次只对一列进行操作。对于高阶矩阵也是如此。

这与对矩阵中维的常规引用不同,其中0->行和1->列。对于N维数组中的其他维类似。

My thinking : Axis = n, where n = 0, 1, etc. means that the matrix is collapsed (folded) along that axis. So in a 2D matrix, when you collapse along 0 (rows), you are really operating on one column at a time. Similarly for higher order matrices.

This is not the same as the normal reference to a dimension in a matrix, where 0 -> row and 1 -> column. Similarly for other dimensions in an N dimension array.


回答 14

我是熊猫的新手。但这是我理解熊猫轴的方式:


恒定 变化 方向


0列向下|


1行列向右->


因此,要计算列的均值,该特定列应为常数,但其下的行可以更改(变化),因此轴= 0。

类似地,要计算一行的平均值,该特定行是恒定的,但它可以遍历不同的列(变化),轴= 1。

I’m a newbie to pandas. But this is how I understand axis in pandas:


Axis Constant Varying Direction


0 Column Row Downwards |


1 Row Column Towards Right –>


So to compute mean of a column, that particular column should be constant but the rows under that can change (varying) so it is axis=0.

Similarly, to compute mean of a row, that particular row is constant but it can traverse through different columns (varying), axis=1.


回答 15

我认为还有另一种理解方式。

对于np.array,如果要消除列,则使用axis = 1; 如果要消除行,则使用axis = 0。

np.mean(np.array(np.ones(shape=(3,5,10))),axis = 0).shape # (5,10)
np.mean(np.array(np.ones(shape=(3,5,10))),axis = 1).shape # (3,10)
np.mean(np.array(np.ones(shape=(3,5,10))),axis = (0,1)).shape # (10,)

对于熊猫对象,axis = 0代表按行操作,axis = 1代表按列操作。这与numpy定义不同,我们可以检查numpy.docpandas.doc中的定义

I think there is an another way to understand it.

For a np.array,if we want eliminate columns we use axis = 1; if we want eliminate rows, we use axis = 0.

np.mean(np.array(np.ones(shape=(3,5,10))),axis = 0).shape # (5,10)
np.mean(np.array(np.ones(shape=(3,5,10))),axis = 1).shape # (3,10)
np.mean(np.array(np.ones(shape=(3,5,10))),axis = (0,1)).shape # (10,)

For pandas object, axis = 0 stands for row-wise operation and axis = 1 stands for column-wise operation. This is different from numpy by definition, we can check definitions from numpy.doc and pandas.doc


回答 16

我将明确避免使用“按行排列”或“沿列排列”,因为人们可能以完全错误的方式解释它们。

打个比方。直观地,您希望pandas.DataFrame.drop(axis='column')从N列中删除一列,并为您提供(N-1)列。因此,您暂时无需关注行(并从英语词典中删除单词“ row”。)反之亦然,它drop(axis='row')适用于行。

同样,sum(axis='column')可以处理多列,并为您提供1列。同样,sum(axis='row')结果为1行。这与其最简单的定义形式一致,即将数字列表简化为单个数字。

通常,使用axis=column,您可以看到列,在列上工作并获取列。忘记行。

使用axis=row,更改视角并处理行。

0和1只是’row’和’column’的别名。这是矩阵索引的惯例。

I will explicitly avoid using ‘row-wise’ or ‘along the columns’, since people may interpret them in exactly the wrong way.

Analogy first. Intuitively, you would expect that pandas.DataFrame.drop(axis='column') drops a column from N columns and gives you (N – 1) columns. So you can pay NO attention to rows for now (and remove word ‘row’ from your English dictionary.) Vice versa, drop(axis='row') works on rows.

In the same way, sum(axis='column') works on multiple columns and gives you 1 column. Similarly, sum(axis='row') results in 1 row. This is consistent with its simplest form of definition, reducing a list of numbers to a single number.

In general, with axis=column, you see columns, work on columns, and get columns. Forget rows.

With axis=row, change perspective and work on rows.

0 and 1 are just aliases for ‘row’ and ‘column’. It’s the convention of matrix indexing.


回答 17

我也一直在试图找出最后一个小时的轴。以上所有答案中的语言以及文档均无济于事。

要回答我现在所理解的问题,在Pandas中,axis = 1或0表示在应用功能时要保持哪些轴头恒定。

注意:当我说标题时,我指的是索引名称

扩展您的示例:

+------------+---------+--------+
|            |  A      |  B     |
+------------+---------+---------
|      X     | 0.626386| 1.52325|
+------------+---------+--------+
|      Y     | 0.626386| 1.52325|
+------------+---------+--------+

对于axis = 1 = columns:我们保持列标题不变,并通过更改数据应用均值函数。为了演示,我们将列标题保持不变:

+------------+---------+--------+
|            |  A      |  B     |

现在我们填充一组A和B值,然后求平均值

|            | 0.626386| 1.52325|  

然后我们填充下一组A和B值并找到平均值

|            | 0.626386| 1.52325|

类似地,对于axis = rows,我们保持行标题不变,并不断更改数据:为了演示,首先修复行标题:

+------------+
|      X     |
+------------+
|      Y     |
+------------+

现在填充第一组X和Y值,然后求平均值

+------------+---------+
|      X     | 0.626386
+------------+---------+
|      Y     | 0.626386
+------------+---------+

然后填充下一组X和Y值,然后求平均值:

+------------+---------+
|      X     | 1.52325 |
+------------+---------+
|      Y     | 1.52325 |
+------------+---------+

综上所述,

当axis = columns时,可以固定列标题并更改数据,这些数据将来自不同的行。

当axis = rows时,将修复行标题并更改数据,这些数据将来自不同的列。

I have been trying to figure out the axis for the last hour as well. The language in all the above answers, and also the documentation is not at all helpful.

To answer the question as I understand it now, in Pandas, axis = 1 or 0 means which axis headers do you want to keep constant when applying the function.

Note: When I say headers, I mean index names

Expanding your example:

+------------+---------+--------+
|            |  A      |  B     |
+------------+---------+---------
|      X     | 0.626386| 1.52325|
+------------+---------+--------+
|      Y     | 0.626386| 1.52325|
+------------+---------+--------+

For axis=1=columns : We keep columns headers constant and apply the mean function by changing data. To demonstrate, we keep the columns headers constant as:

+------------+---------+--------+
|            |  A      |  B     |

Now we populate one set of A and B values and then find the mean

|            | 0.626386| 1.52325|  

Then we populate next set of A and B values and find the mean

|            | 0.626386| 1.52325|

Similarly, for axis=rows, we keep row headers constant, and keep changing the data: To demonstrate, first fix the row headers:

+------------+
|      X     |
+------------+
|      Y     |
+------------+

Now populate first set of X and Y values and then find the mean

+------------+---------+
|      X     | 0.626386
+------------+---------+
|      Y     | 0.626386
+------------+---------+

Then populate the next set of X and Y values and then find the mean:

+------------+---------+
|      X     | 1.52325 |
+------------+---------+
|      Y     | 1.52325 |
+------------+---------+

In summary,

When axis=columns, you fix the column headers and change data, which will come from the different rows.

When axis=rows, you fix the row headers and change data, which will come from the different columns.


回答 18

axis = 1,它将明智地求和行,keepdims = True将保持二维。希望对您有帮助。

axis=1 ,It will give the sum row wise,keepdims=True will maintain the 2D dimension. Hope it helps you.


回答 19

这里的许多答案对我有很大帮助!

如果您对axisPython和MARGIN R中(例如在apply函数中),则可能会发现我写过一篇有趣的博客文章:https : //accio.github.io/programming/2020/05/ 19 / numpy-pandas-axis.html

在本质上:

  • 有趣的是,与二维数组相比,使用三维数组更容易理解它们的行为。
  • 在Python包中 numpypandas,sum的axis参数实际上指定numpy,以计算所有可以以array [0,0,…,i,…,0]形式获取的值的平均值所有可能的值。在i的位置固定的情况下重复此过程,其他维度的索引则一个接一个地变化(从最右边的元素开始)。结果是一个n-1维数组。
  • 在R中,MARGINS参数使apply函数计算可以以array [,…,i,…,]的形式获取的所有值的平均值,其中i遍历所有可能的值。迭代完所有i值后,不再重复该过程。因此,结果是一个简单的向量。

Many answers here helped me a lot!

In case you get confused by the different behaviours of axis in Python and MARGIN in R (like in the apply function), you may find a blog post that I wrote of interest: https://accio.github.io/programming/2020/05/19/numpy-pandas-axis.html.

In essence:

  • Their behaviours are, intriguingly, easier to understand with three-dimensional array than with two-dimensional arrays.
  • In Python packages numpy and pandas, the axis parameter in sum actually specifies numpy to calculate the mean of all values that can be fetched in the form of array[0, 0, …, i, …, 0] where i iterates through all possible values. The process is repeated with the position of i fixed and the indices of other dimensions vary one after the other (from the most far-right element). The result is a n-1-dimensional array.
  • In R, the MARGINS parameter let the apply function calculate the mean of all values that can be fetched in the form of array[, … , i, … ,] where i iterates through all possible values. The process is not repeated when all i values have been iterated. Therefore, the result is a simple vector.

回答 20

数组设计为具有所谓的axis = 0,垂直排列的行相对于axis = 1,水平排列的列。轴是指数组的尺寸。

Arrays are designed with so-called axis=0 and rows positioned vertically versus axis=1 and columns positioned horizontally. Axis refers to the dimension of the array.


连接两个一维NumPy数组

问题:连接两个一维NumPy数组

我在NumPy中有两个简单的一维数组。我应该能够使用numpy.concatenate将它们连接起来。但是我收到以下代码的错误:

TypeError:只有length-1数组可以转换为Python标量

import numpy
a = numpy.array([1, 2, 3])
b = numpy.array([5, 6])
numpy.concatenate(a, b)

为什么?

I have two simple one-dimensional arrays in NumPy. I should be able to concatenate them using numpy.concatenate. But I get this error for the code below:

TypeError: only length-1 arrays can be converted to Python scalars

Code

import numpy
a = numpy.array([1, 2, 3])
b = numpy.array([5, 6])
numpy.concatenate(a, b)

Why?


回答 0

该行应为:

numpy.concatenate([a,b])

要连接的数组需要作为一个序列而不是作为单独的参数传递。

NumPy文档中

numpy.concatenate((a1, a2, ...), axis=0)

将一系列数组连接在一起。

它试图将您解释b为axis参数,这就是为什么它抱怨无法将其转换为标量。

The line should be:

numpy.concatenate([a,b])

The arrays you want to concatenate need to be passed in as a sequence, not as separate arguments.

From the NumPy documentation:

numpy.concatenate((a1, a2, ...), axis=0)

Join a sequence of arrays together.

It was trying to interpret your b as the axis parameter, which is why it complained it couldn’t convert it into a scalar.


回答 1

连接一维数组有多种可能性,例如,

numpy.r_[a, a],
numpy.stack([a, a]).reshape(-1),
numpy.hstack([a, a]),
numpy.concatenate([a, a])

对于大型阵列,所有这些选项都同样快。对于小型的,concatenate有一点优势:

该图是使用perfplot创建的:

import numpy
import perfplot

perfplot.show(
    setup=lambda n: numpy.random.rand(n),
    kernels=[
        lambda a: numpy.r_[a, a],
        lambda a: numpy.stack([a, a]).reshape(-1),
        lambda a: numpy.hstack([a, a]),
        lambda a: numpy.concatenate([a, a]),
    ],
    labels=["r_", "stack+reshape", "hstack", "concatenate"],
    n_range=[2 ** k for k in range(19)],
    xlabel="len(a)",
)

There are several possibilities for concatenating 1D arrays, e.g.,

numpy.r_[a, a],
numpy.stack([a, a]).reshape(-1),
numpy.hstack([a, a]),
numpy.concatenate([a, a])

All those options are equally fast for large arrays; for small ones, concatenate has a slight edge:

The plot was created with perfplot:

import numpy
import perfplot

perfplot.show(
    setup=lambda n: numpy.random.rand(n),
    kernels=[
        lambda a: numpy.r_[a, a],
        lambda a: numpy.stack([a, a]).reshape(-1),
        lambda a: numpy.hstack([a, a]),
        lambda a: numpy.concatenate([a, a]),
    ],
    labels=["r_", "stack+reshape", "hstack", "concatenate"],
    n_range=[2 ** k for k in range(19)],
    xlabel="len(a)",
)

回答 2

的第一个参数concatenate本身应该是要串联的数组序列

numpy.concatenate((a,b)) # Note the extra parentheses.

The first parameter to concatenate should itself be a sequence of arrays to concatenate:

numpy.concatenate((a,b)) # Note the extra parentheses.

回答 3

另一种方法是使用“ concatenate”的缩写形式,即“ r _ […]”或“ c _ […]”,如下面的示例代码所示(请参见http://wiki.scipy.org / NumPy_for_Matlab_Users以获取更多信息):

%pylab
vector_a = r_[0.:10.] #short form of "arange"
vector_b = array([1,1,1,1])
vector_c = r_[vector_a,vector_b]
print vector_a
print vector_b
print vector_c, '\n\n'

a = ones((3,4))*4
print a, '\n'
c = array([1,1,1])
b = c_[a,c]
print b, '\n\n'

a = ones((4,3))*4
print a, '\n'
c = array([[1,1,1]])
b = r_[a,c]
print b

print type(vector_b)

结果是:

[ 0.  1.  2.  3.  4.  5.  6.  7.  8.  9.]
[1 1 1 1]
[ 0.  1.  2.  3.  4.  5.  6.  7.  8.  9.  1.  1.  1.  1.] 


[[ 4.  4.  4.  4.]
 [ 4.  4.  4.  4.]
 [ 4.  4.  4.  4.]] 

[[ 4.  4.  4.  4.  1.]
 [ 4.  4.  4.  4.  1.]
 [ 4.  4.  4.  4.  1.]] 


[[ 4.  4.  4.]
 [ 4.  4.  4.]
 [ 4.  4.  4.]
 [ 4.  4.  4.]] 

[[ 4.  4.  4.]
 [ 4.  4.  4.]
 [ 4.  4.  4.]
 [ 4.  4.  4.]
 [ 1.  1.  1.]]

An alternative ist to use the short form of “concatenate” which is either “r_[…]” or “c_[…]” as shown in the example code beneath (see http://wiki.scipy.org/NumPy_for_Matlab_Users for additional information):

%pylab
vector_a = r_[0.:10.] #short form of "arange"
vector_b = array([1,1,1,1])
vector_c = r_[vector_a,vector_b]
print vector_a
print vector_b
print vector_c, '\n\n'

a = ones((3,4))*4
print a, '\n'
c = array([1,1,1])
b = c_[a,c]
print b, '\n\n'

a = ones((4,3))*4
print a, '\n'
c = array([[1,1,1]])
b = r_[a,c]
print b

print type(vector_b)

Which results in:

[ 0.  1.  2.  3.  4.  5.  6.  7.  8.  9.]
[1 1 1 1]
[ 0.  1.  2.  3.  4.  5.  6.  7.  8.  9.  1.  1.  1.  1.] 


[[ 4.  4.  4.  4.]
 [ 4.  4.  4.  4.]
 [ 4.  4.  4.  4.]] 

[[ 4.  4.  4.  4.  1.]
 [ 4.  4.  4.  4.  1.]
 [ 4.  4.  4.  4.  1.]] 


[[ 4.  4.  4.]
 [ 4.  4.  4.]
 [ 4.  4.  4.]
 [ 4.  4.  4.]] 

[[ 4.  4.  4.]
 [ 4.  4.  4.]
 [ 4.  4.  4.]
 [ 4.  4.  4.]
 [ 1.  1.  1.]]

回答 4

以下是使用numpy.ravel(),的更多方法:numpy.array()利用一维数组可以解包为普通元素的事实:

# we'll utilize the concept of unpacking
In [15]: (*a, *b)
Out[15]: (1, 2, 3, 5, 6)

# using `numpy.ravel()`
In [14]: np.ravel((*a, *b))
Out[14]: array([1, 2, 3, 5, 6])

# wrap the unpacked elements in `numpy.array()`
In [16]: np.array((*a, *b))
Out[16]: array([1, 2, 3, 5, 6])

Here are more approaches for doing this by using numpy.ravel(), numpy.array(), utilizing the fact that 1D arrays can be unpacked into plain elements:

# we'll utilize the concept of unpacking
In [15]: (*a, *b)
Out[15]: (1, 2, 3, 5, 6)

# using `numpy.ravel()`
In [14]: np.ravel((*a, *b))
Out[14]: array([1, 2, 3, 5, 6])

# wrap the unpacked elements in `numpy.array()`
In [16]: np.array((*a, *b))
Out[16]: array([1, 2, 3, 5, 6])

回答 5

来自numpy 文档的更多事实:

语法为 numpy.concatenate((a1, a2, ...), axis=0, out=None)

轴= 0用于行连接轴= 1用于列连接

>>> a = np.array([[1, 2], [3, 4]])
>>> b = np.array([[5, 6]])

# Appending below last row
>>> np.concatenate((a, b), axis=0)
array([[1, 2],
       [3, 4],
       [5, 6]])

# Appending after last column
>>> np.concatenate((a, b.T), axis=1)    # Notice the transpose
array([[1, 2, 5],
       [3, 4, 6]])

# Flattening the final array
>>> np.concatenate((a, b), axis=None)
array([1, 2, 3, 4, 5, 6])

希望对您有所帮助!

Some more facts from the numpy docs :

With syntax as numpy.concatenate((a1, a2, ...), axis=0, out=None)

axis = 0 for row-wise concatenation axis = 1 for column-wise concatenation

>>> a = np.array([[1, 2], [3, 4]])
>>> b = np.array([[5, 6]])

# Appending below last row
>>> np.concatenate((a, b), axis=0)
array([[1, 2],
       [3, 4],
       [5, 6]])

# Appending after last column
>>> np.concatenate((a, b.T), axis=1)    # Notice the transpose
array([[1, 2, 5],
       [3, 4, 6]])

# Flattening the final array
>>> np.concatenate((a, b), axis=None)
array([1, 2, 3, 4, 5, 6])

I hope it helps !