I could try to come up with a mapping of all of these cases, but does numpy provide some automatic way of converting its dtypes into the closest possible native python types? This mapping need not be exhaustive, but it should convert the common dtypes that have a close python analog. I think this already happens somewhere in numpy.
(Another method is np.asscalar(val), however it is deprecated since NumPy 1.16).
For the curious, to build a table of conversions of NumPy array scalars for your system:
for name in dir(np):
obj = getattr(np, name)
if hasattr(obj, 'dtype'):
try:
if 'time' in name:
npn = obj(0, 'D')
else:
npn = obj(0)
nat = npn.item()
print('{0} ({1!r}) -> {2}'.format(name, npn.dtype.char, type(nat)))
except:
pass
There are a few NumPy types that have no native Python equivalent on some systems, including: clongdouble, clongfloat, complex192, complex256, float128, longcomplex, longdouble and longfloat. These need to be converted to their nearest NumPy equivalent before using .item().
found myself having mixed set of numpy types and standard python. as all numpy types derive from numpy.generic, here’s how you can convert everything to python standard types:
if isinstance(obj, numpy.generic):
return numpy.asscalar(obj)
This means there is no fixed lists and your code will scale with more types.
回答 7
numpy将信息保留在公开的映射中,typeDict因此您可以执行以下操作:
>>>import __builtin__
>>>import numpy as np
>>>{v: k for k, v in np.typeDict.items()if k in dir(__builtin__)}{numpy.object_:'object',
numpy.bool_:'bool',
numpy.string_:'str',
numpy.unicode_:'unicode',
numpy.int64:'int',
numpy.float64:'float',
numpy.complex128:'complex'}
如果您想要实际的python类型而不是它们的名称,可以执行::
>>>{v: getattr(__builtin__, k)for k, v in np.typeDict.items()if k in vars(__builtin__)}{numpy.object_: object,
numpy.bool_: bool,
numpy.string_: str,
numpy.unicode_: unicode,
numpy.int64: int,
numpy.float64: float,
numpy.complex128: complex}
numpy holds that information in a mapping exposed as typeDict so you could do something like the below::
>>> import __builtin__
>>> import numpy as np
>>> {v: k for k, v in np.typeDict.items() if k in dir(__builtin__)}
{numpy.object_: 'object',
numpy.bool_: 'bool',
numpy.string_: 'str',
numpy.unicode_: 'unicode',
numpy.int64: 'int',
numpy.float64: 'float',
numpy.complex128: 'complex'}
If you want the actual python types rather than their names, you can do ::
>>> {v: getattr(__builtin__, k) for k, v in np.typeDict.items() if k in vars(__builtin__)}
{numpy.object_: object,
numpy.bool_: bool,
numpy.string_: str,
numpy.unicode_: unicode,
numpy.int64: int,
numpy.float64: float,
numpy.complex128: complex}
In[1]:import numpy as np
In[2]: aa = np.random.uniform(0,1,1000000)In[3]:%timeit map(float, aa)10 loops, best of 3:117 ms per loop
In[4]:%timeit map(lambda x: x.astype(float), aa)1 loop, best of 3:780 ms per loop
In[5]:%timeit map(lambda x: x.item(), aa)1 loop, best of 3:475 ms per loop
Sorry to come late to the partly, but I was looking at a problem of converting numpy.float64 to regular Python float only. I saw 3 ways of doing that:
npValue.item()
npValue.astype(float)
float(npValue)
Here are the relevant timings from IPython:
In [1]: import numpy as np
In [2]: aa = np.random.uniform(0, 1, 1000000)
In [3]: %timeit map(float, aa)
10 loops, best of 3: 117 ms per loop
In [4]: %timeit map(lambda x: x.astype(float), aa)
1 loop, best of 3: 780 ms per loop
In [5]: %timeit map(lambda x: x.item(), aa)
1 loop, best of 3: 475 ms per loop
It sounds like float(npValue) seems much faster.
回答 9
我的方法有点用力,但似乎在所有情况下都很好:
def type_np2py(dtype=None, arr=None):'''Return the closest python type for a given numpy dtype'''if((dtype isNoneand arr isNone)or(dtype isnotNoneand arr isnotNone)):raiseValueError("Provide either keyword argument `dtype` or `arr`: a numpy dtype or a numpy array.")if dtype isNone:
dtype = arr.dtype
#1) Make a single-entry numpy array of the same dtype#2) force the array into a python 'object' dtype#3) the array entry should now be the closest python type
single_entry = np.empty([1], dtype=dtype).astype(object)return type(single_entry[0])
My approach is a bit forceful, but seems to play nice for all cases:
def type_np2py(dtype=None, arr=None):
'''Return the closest python type for a given numpy dtype'''
if ((dtype is None and arr is None) or
(dtype is not None and arr is not None)):
raise ValueError(
"Provide either keyword argument `dtype` or `arr`: a numpy dtype or a numpy array.")
if dtype is None:
dtype = arr.dtype
#1) Make a single-entry numpy array of the same dtype
#2) force the array into a python 'object' dtype
#3) the array entry should now be the closest python type
single_entry = np.empty([1], dtype=dtype).astype(object)
return type(single_entry[0])
A side note about array scalars for those who don’t need automatic conversion and know the numpy dtype of the value:
Array scalars differ from Python scalars, but for the most part they can be used interchangeably (the primary exception is for versions of Python older than v2.x, where integer array scalars cannot act as indices for lists and tuples). There are some exceptions, such as when code requires very specific attributes of a scalar or when it checks specifically whether a value is a Python scalar. Generally, problems are easily fixed by explicitly converting array scalars to Python scalars, using the corresponding Python type function (e.g., int, float, complex, str, unicode).
Thus, for most cases conversion might not be needed at all, and the array scalar could be used directly. The effect should be identical to using Python scalar:
But if, for some reason, the explicit conversion is needed, using the corresponding Python built-in function is the way to go. As shown in the other answer it’s also faster than array scalar item() method.
回答 11
翻译整个ndarray而不是一个单位数据对象:
def trans(data):"""
translate numpy.int/float into python native data type
"""
result =[]for i in data.index:# i = data.index[0]
d0 = data.iloc[i].values
d =[]for j in d0:if'int'in str(type(j)):
res = j.item()if'item'in dir(j)else j
elif'float'in str(type(j)):
res = j.item()if'item'in dir(j)else j
else:
res = j
d.append(res)
d = tuple(d)
result.append(d)
result = tuple(result)return result
Translate the whole ndarray instead one unit data object:
def trans(data):
"""
translate numpy.int/float into python native data type
"""
result = []
for i in data.index:
# i = data.index[0]
d0 = data.iloc[i].values
d = []
for j in d0:
if 'int' in str(type(j)):
res = j.item() if 'item' in dir(j) else j
elif 'float' in str(type(j)):
res = j.item() if 'item' in dir(j) else j
else:
res = j
d.append(res)
d = tuple(d)
result.append(d)
result = tuple(result)
return result
However, it takes some minutes when handling large dataframes. I am also looking for a more efficient solution.
Hope a better answer.
This is arguably the way of creating an array filled with certain values, because it explicitly describes what is being achieved (and it can in principle be very efficient since it performs a very specific task).
回答 1
已为Numpy 1.7.0更新:(@ Rolf Bartstra的提示)。
a=np.empty(n); a.fill(5) 最快。
以降序排列:
%timeit a=np.empty(1e4); a.fill(5)100000 loops, best of 3:5.85 us per loop
%timeit a=np.empty(1e4); a[:]=5100000 loops, best of 3:7.15 us per loop
%timeit a=np.ones(1e4)*510000 loops, best of 3:22.9 us per loop
%timeit a=np.repeat(5,(1e4))10000 loops, best of 3:81.7 us per loop
%timeit a=np.tile(5,[1e4])10000 loops, best of 3:82.9 us per loop
Updated for Numpy 1.7.0:(Hat-tip to @Rolf Bartstra.)
a=np.empty(n); a.fill(5) is fastest.
In descending speed order:
%timeit a=np.empty(1e4); a.fill(5)
100000 loops, best of 3: 5.85 us per loop
%timeit a=np.empty(1e4); a[:]=5
100000 loops, best of 3: 7.15 us per loop
%timeit a=np.ones(1e4)*5
10000 loops, best of 3: 22.9 us per loop
%timeit a=np.repeat(5,(1e4))
10000 loops, best of 3: 81.7 us per loop
%timeit a=np.tile(5,[1e4])
10000 loops, best of 3: 82.9 us per loop
You should also always avoid iterating like you are doing in your example. A simple a[:] = v will accomplish what your iteration does using numpy broadcasting.
回答 3
显然,不仅绝对速度而且速度顺序(如user1579844所报告)均取决于机器。这是我发现的:
a=np.empty(1e4); a.fill(5) 最快
以降序排列:
timeit a=np.empty(1e4); a.fill(5)# 100000 loops, best of 3: 10.2 us per loop
timeit a=np.empty(1e4); a[:]=5# 100000 loops, best of 3: 16.9 us per loop
timeit a=np.ones(1e4)*5# 100000 loops, best of 3: 32.2 us per loop
timeit a=np.tile(5,[1e4])# 10000 loops, best of 3: 90.9 us per loop
timeit a=np.repeat(5,(1e4))# 10000 loops, best of 3: 98.3 us per loop
timeit a=np.array([5]*int(1e4))# 1000 loops, best of 3: 1.69 ms per loop (slowest BY FAR!)
Apparently, not only the absolute speeds but also the speed order (as reported by user1579844) are machine dependent; here’s what I found:
a=np.empty(1e4); a.fill(5) is fastest;
In descending speed order:
timeit a=np.empty(1e4); a.fill(5)
# 100000 loops, best of 3: 10.2 us per loop
timeit a=np.empty(1e4); a[:]=5
# 100000 loops, best of 3: 16.9 us per loop
timeit a=np.ones(1e4)*5
# 100000 loops, best of 3: 32.2 us per loop
timeit a=np.tile(5,[1e4])
# 10000 loops, best of 3: 90.9 us per loop
timeit a=np.repeat(5,(1e4))
# 10000 loops, best of 3: 98.3 us per loop
timeit a=np.array([5]*int(1e4))
# 1000 loops, best of 3: 1.69 ms per loop (slowest BY FAR!)
So, try and find out, and use what’s fastest on your platform.
v = 7
rows = 3
cols = 5
a = numpy.tile(v, (rows,cols))
a
Out[1]:
array([[7, 7, 7, 7, 7],
[7, 7, 7, 7, 7],
[7, 7, 7, 7, 7]])
Although tile is meant to ’tile’ an array (instead of a scalar, as in this case), it will do the job, creating pre-filled arrays of any size and dimension.
import numpy as np
data = np.zeros( (512,512,3), dtype=np.uint8)
data[256,256] = [255,0,0]
What I want this to do is display a single red dot in the center of a 512×512 image. (At least to begin with… I think I can figure out the rest from there)
回答 0
您可以使用PIL创建(并显示)图像:
from PIL importImageimport numpy as np
w, h =512,512
data = np.zeros((h, w,3), dtype=np.uint8)
data[0:256,0:256]=[255,0,0]# red patch in upper left
img =Image.fromarray(data,'RGB')
img.save('my.png')
img.show()
Using pygame, you can open a window, get the surface as an array of pixels, and manipulate as you want from there. You’ll need to copy your numpy array into the surface array, however, which will be much slower than doing actual graphics operations on the pygame surfaces themselves.
回答 4
如何使用示例显示存储在numpy数组中的图像(在Jupyter笔记本中有效)
我知道有更简单的答案,但是这一答案将使您了解如何从numpy数组中淹没图像。
加载示例
from sklearn.datasets import load_digits
digits = load_digits()
digits.images.shape #this will give you (1797, 8, 8). 1797 images, each 8 x 8 in size
from matplotlib import pyplot as plot
import numpy as np
fig = plot.figure()
ax = fig.add_subplot(1,1,1)# make sure your data is in H W C, otherwise you can change it by# data = data.transpose((_, _, _))
data = np.zeros((512,512,3), dtype=np.int32)
data[256,256]=[255,0,0]
ax.imshow(data.astype(np.uint8))
Supplement for doing so with matplotlib. I found it handy doing computer vision tasks. Let’s say you got data with dtype = int32
from matplotlib import pyplot as plot
import numpy as np
fig = plot.figure()
ax = fig.add_subplot(1, 1, 1)
# make sure your data is in H W C, otherwise you can change it by
# data = data.transpose((_, _, _))
data = np.zeros((512,512,3), dtype=np.int32)
data[256,256] = [255,0,0]
ax.imshow(data.astype(np.uint8))
Here num_classes stands for number of classes you have. So if you have a vector with shape of (10000,) this function transforms it to (10000,C). Note that a is zero-indexed, i.e. one_hot(np.array([0, 1]), 2) will give [[1, 0], [0, 1]].
numpy.eye(number of classes)[vector containing the labels]
回答 6
这是将一维矢量转换为一维二维热阵列的函数。
#!/usr/bin/env pythonimport numpy as np
def convertToOneHot(vector, num_classes=None):"""
Converts an input 1-D vector of integers into an output
2-D array of one-hot vectors, where an i'th input value
of j will set a '1' in the i'th row, j'th column of the
output array.
Example:
v = np.array((1, 0, 4))
one_hot_v = convertToOneHot(v)
print one_hot_v
[[0 1 0 0 0]
[1 0 0 0 0]
[0 0 0 0 1]]
"""assert isinstance(vector, np.ndarray)assert len(vector)>0if num_classes isNone:
num_classes = np.max(vector)+1else:assert num_classes >0assert num_classes >= np.max(vector)
result = np.zeros(shape=(len(vector), num_classes))
result[np.arange(len(vector)), vector]=1return result.astype(int)
以下是一些用法示例:
>>> a = np.array([1,0,3])>>> convertToOneHot(a)
array([[0,1,0,0],[1,0,0,0],[0,0,0,1]])>>> convertToOneHot(a, num_classes=10)
array([[0,1,0,0,0,0,0,0,0,0],[1,0,0,0,0,0,0,0,0,0],[0,0,0,1,0,0,0,0,0,0]])
I think the short answer is no. For a more generic case in n dimensions, I came up with this:
# For 2-dimensional data, 4 values
a = np.array([[0, 1, 2], [3, 2, 1]])
z = np.zeros(list(a.shape) + [4])
z[list(np.indices(z.shape[:-1])) + [a]] = 1
I am wondering if there is a better solution — I don’t like that I have to create those lists in the last two lines. Anyway, I did some measurements with timeit and it seems that the numpy-based (indices/arange) and the iterative versions perform about the same.
def onehottify(x, n=None, dtype=float):"""1-hot encode x with the max value n (computed from data if n is None)."""
x = np.asarray(x)
n = np.max(x)+1if n isNoneelse n
return np.eye(n, dtype=dtype)[x]
def onehottify_only_1d(x, n=None, dtype=float):
x = np.asarray(x)
n = np.max(x)+1if n isNoneelse n
b = np.zeros((len(x), n), dtype=dtype)
b[np.arange(len(x)), x]=1return b
后一种方法的速度提高了约35%(MacBook Pro 13 2015),但前一种方法更通用:
>>>import numpy as np
>>> np.random.seed(42)>>> a = np.random.randint(0,9, size=(10_000,))>>> a
array([6,3,7,...,5,8,6])>>>%timeit onehottify(a,10)188µs ±5.03µs per loop (mean ± std. dev. of 7 runs,10000 loops each)>>>%timeit onehottify_only_1d(a,10)139µs ±2.78µs per loop (mean ± std. dev. of 7 runs,10000 loops each)
def onehottify(x, n=None, dtype=float):
"""1-hot encode x with the max value n (computed from data if n is None)."""
x = np.asarray(x)
n = np.max(x) + 1 if n is None else n
return np.eye(n, dtype=dtype)[x]
Also, here is a quick-and-dirty benchmark of this method and a method from the currently accepted answer by YXD (slightly changed, so that they offer the same API except that the latter works only with 1D ndarrays):
def onehottify_only_1d(x, n=None, dtype=float):
x = np.asarray(x)
n = np.max(x) + 1 if n is None else n
b = np.zeros((len(x), n), dtype=dtype)
b[np.arange(len(x)), x] = 1
return b
The latter method is ~35% faster (MacBook Pro 13 2015), but the former is more general:
>>> import numpy as np
>>> np.random.seed(42)
>>> a = np.random.randint(0, 9, size=(10_000,))
>>> a
array([6, 3, 7, ..., 5, 8, 6])
>>> %timeit onehottify(a, 10)
188 µs ± 5.03 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
>>> %timeit onehottify_only_1d(a, 10)
139 µs ± 2.78 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
import numpy as np
import sklearn
sklb = sklearn.preprocessing.LabelBinarizer()
a = np.asarray([1,2,44,3,2])
n = np.unique(a)
sklb.fit(n)
b = sklb.transform(a)
I recently ran into a problem of same kind and found said solution which turned out to be only satisfying if you have numbers that go within a certain formation. For example if you want to one-hot encode following list:
all_good_list = [0,1,2,3,4]
go ahead, the posted solutions are already mentioned above. But what if considering this data:
problematic_list = [0,23,12,89,10]
If you do it with methods mentioned above, you will likely end up with 90 one-hot columns. This is because all answers include something like n = np.max(a)+1. I found a more generic solution that worked out for me and wanted to share with you:
import numpy as np
import sklearn
sklb = sklearn.preprocessing.LabelBinarizer()
a = np.asarray([1,2,44,3,2])
n = np.unique(a)
sklb.fit(n)
b = sklb.transform(a)
I hope someone encountered same restrictions on above solutions and this might come in handy
回答 12
这种编码类型通常是numpy数组的一部分。如果您使用这样的numpy数组:
a = np.array([1,0,3])
那么有一种非常简单的方法可以将其转换为1-hot编码
out =(np.arange(4)== a[:,None]).astype(np.float32)
def expand_integer_grid(arr, n_classes):"""
:param arr: N dim array of size i_1, ..., i_N
:param n_classes: C
:returns: one-hot N+1 dim array of size i_1, ..., i_N, C
:rtype: ndarray
"""
one_hot = np.zeros(arr.shape +(n_classes,))
axes_ranges =[range(arr.shape[i])for i in range(arr.ndim)]
flat_grids =[_.ravel()for _ in np.meshgrid(*axes_ranges, indexing='ij')]
one_hot[flat_grids +[arr.ravel()]]=1assert((one_hot.sum(-1)==1).all())assert(np.allclose(np.argmax(one_hot,-1), arr))return one_hot
Here’s a dimensionality-independent standalone solution.
This will convert any N-dimensional array arr of nonnegative integers to a one-hot N+1-dimensional array one_hot, where one_hot[i_1,...,i_N,c] = 1 means arr[i_1,...,i_N] = c. You can recover the input via np.argmax(one_hot, -1)
def expand_integer_grid(arr, n_classes):
"""
:param arr: N dim array of size i_1, ..., i_N
:param n_classes: C
:returns: one-hot N+1 dim array of size i_1, ..., i_N, C
:rtype: ndarray
"""
one_hot = np.zeros(arr.shape + (n_classes,))
axes_ranges = [range(arr.shape[i]) for i in range(arr.ndim)]
flat_grids = [_.ravel() for _ in np.meshgrid(*axes_ranges, indexing='ij')]
one_hot[flat_grids + [arr.ravel()]] = 1
assert((one_hot.sum(-1) == 1).all())
assert(np.allclose(np.argmax(one_hot, -1), arr))
return one_hot
回答 18
使用以下代码。效果最好。
def one_hot_encode(x):"""
argument
- x: a list of labels
return
- one hot encoding matrix (number of labels, number of class)
"""
encoded = np.zeros((len(x),10))for idx, val in enumerate(x):
encoded[idx][val]=1return encoded
def one_hot_encode(x):
"""
argument
- x: a list of labels
return
- one hot encoding matrix (number of labels, number of class)
"""
encoded = np.zeros((len(x), 10))
for idx, val in enumerate(x):
encoded[idx][val] = 1
return encoded
Found it here P.S You don’t need to go into the link.
It is good to know the version of numpy you run, but strictly speaking if you just need to have specific version on your system you can write like this:
pip install numpy==1.14.3 and this will install the version you need and uninstall other versions of numpy.
What does np.random.seed do in the below code from a Scikit-Learn tutorial? I’m not very familiar with NumPy’s random state generator stuff, so I’d really appreciate a layman’s terms explanation of this.
np.random.seed(0)
indices = np.random.permutation(len(iris_X))
(pseudo-)random numbers work by starting with a number (the seed), multiplying it by a large number, adding an offset, then taking modulo of that sum. The resulting number is then used as the seed to generate the next “random” number. When you set the seed (every time), it does the same thing every time, giving you the same numbers.
If you want seemingly random numbers, do not set the seed. If you have code that uses random numbers that you want to debug, however, it can be very helpful to set the seed before each run so that the code does the same thing every time you run it.
To get the most random numbers for each run, call numpy.random.seed(). This will cause numpy to set the seed to a random number obtained from /dev/urandom or its Windows analog or, if neither of those is available, it will use the clock.
For more information on using seeds to generate pseudo-random numbers, see wikipedia.
As noted, numpy.random.seed(0) sets the random seed to 0, so the pseudo random numbers you get from random will start from the same point. This can be good for debuging in some cases. HOWEVER, after some reading, this seems to be the wrong way to go at it, if you have threads because it is not thread safe.
For numpy.random.seed(), the main difficulty is that it is not
thread-safe – that is, it’s not safe to use if you have many different
threads of execution, because it’s not guaranteed to work if two
different threads are executing the function at the same time. If
you’re not using threads, and if you can reasonably expect that you
won’t need to rewrite your program this way in the future,
numpy.random.seed() should be fine for testing purposes. If there’s
any reason to suspect that you may need threads in the future, it’s
much safer in the long run to do as suggested, and to make a local
instance of the numpy.random.Random class. As far as I can tell,
random.random.seed() is thread-safe (or at least, I haven’t found any
evidence to the contrary).
Lastly, note that there might be cases where initializing to 0 (as opposed to a seed that has not all bits 0) may result to non-uniform distributions for some few first iterations because of the way xor works, but this depends on the algorithm, and is beyond my current worries and the scope of this question.
I have used this very often in neural networks. It is well known that when we start training a neural network we randomly initialise the weights. The model is trained on these weights on a particular dataset. After number of epochs you get trained set of weights.
Now suppose you want to again train from scratch or you want to pass the model to others to reproduce your results, the weights will be again initialised to a random numbers which mostly will be different from earlier ones. The obtained trained weights after same number of epochs ( keeping same data and other parameters ) as earlier one will differ. The problem is your model is no more reproducible that is every time you train your model from scratch it provides you different sets of weights. This is because the model is being initialized by different random numbers every time.
What if every time you start training from scratch the model is initialised to the same set of random initialise weights? In this case your model could become reproducible. This is achieved by numpy.random.seed(0). By mentioning seed() to a particular number, you are hanging on to same set of random numbers always.
Imagine you are showing someone how to code something with a bunch of “random” numbers. By using numpy seed they can use the same seed number and get the same set of “random” numbers.
So it’s not exactly random because an algorithm spits out the numbers but it looks like a randomly generated bunch.
A random seed specifies the start point when a computer generates a random number sequence.
For example, let’s say you wanted to generate a random number in Excel (Note: Excel sets a limit of 9999 for the seed). If you enter a number into the Random Seed box during the process, you’ll be able to use the same set of random numbers again. If you typed “77” into the box, and typed “77” the next time you run the random number generator, Excel will display that same set of random numbers. If you type “99”, you’ll get an entirely different set of numbers. But if you revert back to a seed of 77, then you’ll get the same set of random numbers you started with.
For example, “take a number x, add 900 +x, then subtract 52.” In order for the process to start, you have to specify a starting number, x (the seed). Let’s take the starting number 77:
Add 900 + 77 = 977
Subtract 52 = 925
Following the same algorithm, the second “random” number would be:
900 + 925 = 1825
Subtract 52 = 1773
This simple example follows a pattern, but the algorithms behind computer number generation are much more complicated
All the answers above show the implementation of np.random.seed() in code. I’ll try my best to explain briefly why it actually happens. Computers are machines that are designed based on predefined algorithms. Any output from a computer is the result of the algorithm implemented on the input. So when we request a computer to generate random numbers, sure they are random but the computer did not just come up with them randomly!
So when we write np.random.seed(any_number_here) the algorithm will output a particular set of numbers that is unique to the argument any_number_here. It’s almost like a particular set of random numbers can be obtained if we pass the correct argument. But this will require us to know about how the algorithm works which is quite tedious.
So, for example if I write np.random.seed(10) the particular set of numbers that I obtain will remain the same even if I execute the same line after 10 years unless the algorithm changes.
I agree with Joris; it seems like you should be doing this differently, like with numpy record arrays. Modifying “option 2” from this great answer, you could do it like this:
import pandas
import numpy
dtype = [('Col1','int32'), ('Col2','float32'), ('Col3','float32')]
values = numpy.zeros(20, dtype=dtype)
index = ['Row'+str(i) for i in range(1, len(values)+1)]
df = pandas.DataFrame(values, index=index)
回答 3
只需使用pandas DataFrame的from_records即可完成此操作
import numpy as np
import pandas as pd
# Creating a numpy array
x = np.arange(1,10,1).reshape(-1,1)
dataframe = pd.DataFrame.from_records(x)
This can be done simply by using from_records of pandas DataFrame
import numpy as np
import pandas as pd
# Creating a numpy array
x = np.arange(1,10,1).reshape(-1,1)
dataframe = pd.DataFrame.from_records(x)
回答 4
>>import pandas as pd
>>import numpy as np
>>data.shape
(480,193)>>type(data)
numpy.ndarray
>>df=pd.DataFrame(data=data[0:,0:],... index=[i for i in range(data.shape[0])],... columns=['f'+str(i)for i in range(data.shape[1])])>>df.head()[![array to dataframe][1]][1]
>>import pandas as pd
>>import numpy as np
>>data.shape
(480,193)
>>type(data)
numpy.ndarray
>>df=pd.DataFrame(data=data[0:,0:],
... index=[i for i in range(data.shape[0])],
... columns=['f'+str(i) for i in range(data.shape[1])])
>>df.head()
[![array to dataframe][1]][1]
回答 5
添加到@ behzad.nouri的答案-我们可以创建一个帮助程序来处理这种常见情况:
def csvDf(dat,**kwargs):from numpy import array
data = array(dat)if data isNoneor len(data)==0or len(data[0])==0:returnNoneelse:return pd.DataFrame(data[1:,1:],index=data[1:,0],columns=data[0,1:],**kwargs)
让我们尝试一下:
data =[['','a','b','c'],['row1','row1cola','row1colb','row1colc'],['row2','row2cola','row2colb','row2colc'],['row3','row3cola','row3colb','row3colc']]
csvDf(data)In[61]: csvDf(data)Out[61]:
a b c
row1 row1cola row1colb row1colc
row2 row2cola row2colb row2colc
row3 row3cola row3colb row3colc
Adding to @behzad.nouri ‘s answer – we can create a helper routine to handle this common scenario:
def csvDf(dat,**kwargs):
from numpy import array
data = array(dat)
if data is None or len(data)==0 or len(data[0])==0:
return None
else:
return pd.DataFrame(data[1:,1:],index=data[1:,0],columns=data[0,1:],**kwargs)
Let’s try it out:
data = [['','a','b','c'],['row1','row1cola','row1colb','row1colc'],
['row2','row2cola','row2colb','row2colc'],['row3','row3cola','row3colb','row3colc']]
csvDf(data)
In [61]: csvDf(data)
Out[61]:
a b c
row1 row1cola row1colb row1colc
row2 row2cola row2colb row2colc
row3 row3cola row3colb row3colc
Believe it or not, after profiling my current code, the repetitive operation of numpy array reversion ate a giant chunk of the running time. What I have right now is the common view-based method:
reversed_arr = arr[::-1]
Is there any other way to do it more efficiently, or is it just an illusion from my obsession with unrealistic numpy performance?
When you create reversed_arr you are creating a view into the original array. You can then change the original array, and the view will update to reflect the changes.
Are you re-creating the view more often than you need to? You should be able to do something like this:
I’m not a numpy expert, but this seems like it would be the fastest way to do things in numpy. If this is what you are already doing, I don’t think you can improve on it.
As mentioned above, a[::-1] really only creates a view, so it’s a constant-time operation (and as such doesn’t take longer as the array grows). If you need the array to be contiguous (for example because you’re performing many vector operations with it), ascontiguousarray is about as fast as flipud/fliplr:
Code to generate the plot:
import numpy
import perfplot
perfplot.show(
setup=lambda n: numpy.random.randint(0, 1000, n),
kernels=[
lambda a: a[::-1],
lambda a: numpy.ascontiguousarray(a[::-1]),
lambda a: numpy.fliplr([a])[0],
],
labels=["a[::-1]", "ascontiguousarray(a[::-1])", "fliplr"],
n_range=[2 ** k for k in range(25)],
xlabel="len(a)",
logx=True,
logy=True,
)
Because this seems to not be marked as answered yet… The Answer of Thomas Arildsen should be the proper one: just use
np.flipud(your_array)
if it is a 1d array (column array).
With matrizes do
fliplr(matrix)
if you want to reverse rows and flipud(matrix) if you want to flip columns. No need for making your 1d column array a 2dimensional row array (matrix with one None layer) and then flipping it.
import time
import numpy as np
start = time.clock()
x = np.array(range(3))#transform to 2d
x = np.atleast_2d(x)#flip array
x = np.fliplr(x)#take first (and only) element
x = x[0]#print x
end = time.clock()print end-start
I will expand on the earlier answer about np.fliplr(). Here is some code that demonstrates constructing a 1d array, transforming it into a 2d array, flipping it, then converting back into a 1d array. time.clock() will be used to keep time, which is presented in terms of seconds.
import time
import numpy as np
start = time.clock()
x = np.array(range(3))
#transform to 2d
x = np.atleast_2d(x)
#flip array
x = np.fliplr(x)
#take first (and only) element
x = x[0]
#print x
end = time.clock()
print end-start
With print statement uncommented:
[2 1 0]
0.00203907123594
With print statement commented out:
5.59799927506e-05
So, in terms of efficiency, I think that’s decent. For those of you that love to do it in one line, here is that form.
np.fliplr(np.atleast_2d(np.array(range(3))))[0]
回答 5
扩展别人的说法,我将举一个简短的例子。
如果您有一维数组…
>>>import numpy as np
>>> x = np.arange(4)# array([0, 1, 2, 3])>>> x[::-1]# returns a viewOut[1]:
array([3,2,1,0])
但是如果您正在使用2D阵列…
>>> x = np.arange(10).reshape(2,5)>>> x
Out[2]:
array([[0,1,2,3,4],[5,6,7,8,9]])>>> x[::-1]# returns a view:Out[3]: array([[5,6,7,8,9],[0,1,2,3,4]])
+------------+---------+--------+|| A | B |+------------+---------+---------|0|0.626386|1.52325|----axis=1----->+------------+---------+--------+||| axis=0|↓↓
It specifies the axis along which the means are computed. By default axis=0. This is consistent with the numpy.mean usage when axis is specified explicitly (in numpy.mean, axis==None by default, which computes the mean value over the flattened array) , in which axis=0 along the rows (namely, index in pandas), and axis=1 along the columns. For added clarity, one may choose to specify axis='index' (instead of axis=0) or axis='columns' (instead of axis=1).
These answers do help explain this, but it still isn’t perfectly intuitive for a non-programmer (i.e. someone like me who is learning Python for the first time in context of data science coursework). I still find using the terms “along” or “for each” wrt to rows and columns to be confusing.
What makes more sense to me is to say it this way:
Axis 0 will act on all the ROWS in each COLUMN
Axis 1 will act on all the COLUMNS in each ROW
So a mean on axis 0 will be the mean of all the rows in each column, and a mean on axis 1 will be a mean of all the columns in each row.
Ultimately this is saying the same thing as @zhangxaochen and @Michael, but in a way that is easier for me to internalize.
axis=0 means along “indexes”. It’s a row-wise operation.
Suppose, to perform concat() operation on dataframe1 & dataframe2,
we will take dataframe1 & take out 1st row from dataframe1 and place into the new DF, then we take out another row from dataframe1 and put into new DF, we repeat this process until we reach to the bottom of dataframe1. Then, we do the same process for dataframe2.
Basically, stacking dataframe2 on top of dataframe1 or vice a versa.
E.g making a pile of books on a table or floor
axis=1 means along “columns”. It’s a column-wise operation.
Suppose, to perform concat() operation on dataframe1 & dataframe2,
we will take out the 1st complete column(a.k.a 1st series) of dataframe1 and place into new DF, then we take out the second column of dataframe1 and keep adjacent to it (sideways), we have to repeat this operation until all columns are finished. Then, we repeat the same process on dataframe2.
Basically,
stacking dataframe2 sideways.
E.g arranging books on a bookshelf.
More to it, since arrays are better representations to represent a nested n-dimensional structure compared to matrices! so below can help you more to visualize how axis plays an important role when you generalize to more than one dimension. Also, you can actually print/write/draw/visualize any n-dim array but, writing or visualizing the same in a matrix representation(3-dim) is impossible on a paper more than 3-dimensions.
axis refers to the dimension of the array, in the case of pd.DataFrames axis=0 is the dimension that points downwards and axis=1 the one that points to the right.
Example: Think of an ndarray with shape (3,5,7).
a = np.ones((3,5,7))
a is a 3 dimensional ndarray, i.e. it has 3 axes (“axes” is plural of “axis”). The configuration of a will look like 3 slices of bread where each slice is of dimension 5-by-7. a[0,:,:] will refer to the 0-th slice, a[1,:,:] will refer to the 1-st slice etc.
a.sum(axis=0) will apply sum() along the 0-th axis of a. You will add all the slices and end up with one slice of shape (5,7).
a.sum(axis=0) is equivalent to
b = np.zeros((5,7))
for i in range(5):
for j in range(7):
b[i,j] += a[:,i,j].sum()
In a pd.DataFrame, axes work the same way as in numpy.arrays: axis=0 will apply sum() or any other reduction function for each column.
N.B. In @zhangxaochen’s answer, I find the phrases “along the rows” and “along the columns” slightly confusing. axis=0 should refer to “along each column”, and axis=1 “along each row”.
The easiest way for me to understand is to talk about whether you are calculating a statistic for each column (axis = 0) or each row (axis = 1). If you calculate a statistic, say a mean, with axis = 0 you will get that statistic for each column. So if each observation is a row and each variable is in a column, you would get the mean of each variable. If you set axis = 1 then you will calculate your statistic for each row. In our example, you would get the mean for each observation across all of your variables (perhaps you want the average of related measures).
axis = 0: by column = column-wise = along the rows
Let’s look at the table from Wiki. This is an IMF estimate of GDP from 2010 to 2019 for top ten countries.
1. Axis 1 will act for each row on all the columns If you want to calculate the average (mean) GDP for EACH countries over the decade (2010-2019), you need to do, df.mean(axis=1). For example, if you want to calculate mean GDP of United States from 2010 to 2019, df.loc['United States','2010':'2019'].mean(axis=1)
2. Axis 0 will act for each column on all the rows If I want to calculate the average (mean) GDP for EACH year for all countries, you need to do, df.mean(axis=0). For example, if you want to calculate mean GDP of the year 2015 for United States, China, Japan, Germany and India, df.loc['United States':'India','2015'].mean(axis=0)
Note: The above code will work only after setting “Country(or dependent territory)” column as the Index, using set_index method.
The designer of pandas, Wes McKinney, used to work intensively on finance data. Think of columns as stock names and index as daily prices. You can then guess what the default behavior is (i.e., axis=0) with respect to this finance data. axis=1 can be simply thought as ‘the other direction’.
For example, the statistics functions, such as mean(), sum(), describe(), count() all default to column-wise because it makes more sense to do them for each stock. sort_index(by=) also defaults to column. fillna(method='ffill') will fill along column because it is the same stock. dropna() defaults to row because you probably just want to discard the price on that day instead of throw away all prices of that stock.
Similarly, the square brackets indexing refers to the columns since it’s more common to pick a stock instead of picking a day.
The problem with using axis= properly is for its use for 2 main different cases:
For computing an accumulated value, or rearranging (e. g. sorting) data.
For manipulating (“playing” with) entities (e. g. dataframes).
The main idea behind this answer is that for avoiding the confusion, we select either a number, or a name for specifying the particular axis, whichever is more clear, intuitive, and descriptive.
Pandas is based on NumPy, which is based on mathematics, particularly on n-dimensional matrices. Here is an image for common use of axes’ names in math in the 3-dimensional space:
This picture is for memorizing the axes’ ordinal numbers only:
0 for x-axis,
1 for y-axis, and
2 for z-axis.
The z-axis is only for panels; for dataframes we will restrict our interest to the green-colored, 2-dimensional basic plane with x-axis (0, vertical), and y-axis (1, horizontal).
It’s all for numbers as potential values of axis= parameter.
The names of axes are 'index' (you may use the alias 'rows') and 'columns', and for this explanation it is NOT important the relation between these names and ordinal numbers (of axes), as everybody knows what the words “rows” and “columns” mean (and everybody here — I suppose — knows what the word “index” in pandas means).
And now, my recommendation:
If you want to compute an accumulated value, you may compute it from values located along axis 0 (or along axis 1) — use axis=0 (or axis=1).
Similarly, if you want to rearrange values, use the axis number of the axis, along which are located data for rearranging (e.g. for sorting).
If you want to manipulate (e.g. concatenate) entities (e.g. dataframes) — use axis='index' (synonym: axis='rows') or axis='columns' to specify the resulting change — index (rows) or columns, respectively.
(For concatenating, you will obtain either a longer index (= more rows), or more columns, respectively.)
This is based on @Safak’s answer.
The best way to understand the axes in pandas/numpy is to create a 3d array and check the result of the sum function along the 3 different axes.
df = pd.DataFrame(np.arange(12).reshape(3,4),columns=['A','B','C','D'])print(df)
A B C D
00123145672891011
df.mean(axis=1)01.515.529.5
dtype: float64
df.drop(['A','B'],axis=1,inplace=True)
C D
02316721011
Say if your operation requires traversing from left to right/right to left in a dataframe, you are apparently merging columns ie. you are operating on various columns.
This is axis =1
Example
df = pd.DataFrame(np.arange(12).reshape(3,4),columns=['A', 'B', 'C', 'D'])
print(df)
A B C D
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
df.mean(axis=1)
0 1.5
1 5.5
2 9.5
dtype: float64
df.drop(['A','B'],axis=1,inplace=True)
C D
0 2 3
1 6 7
2 10 11
Point to note here is we are operating on columns
Similarly, if your operation requires traversing from top to bottom/bottom to top in a dataframe, you are merging rows. This is axis=0.
My thinking : Axis = n, where n = 0, 1, etc. means that the matrix is collapsed (folded) along that axis. So in a 2D matrix, when you collapse along 0 (rows), you are really operating on one column at a time. Similarly for higher order matrices.
This is not the same as the normal reference to a dimension in a matrix, where 0 -> row and 1 -> column. Similarly for other dimensions in an N dimension array.
For pandas object, axis = 0 stands for row-wise operation and axis = 1 stands for column-wise operation. This is different from numpy by definition, we can check definitions from numpy.doc and pandas.doc
I will explicitly avoid using ‘row-wise’ or ‘along the columns’, since people may interpret them in exactly the wrong way.
Analogy first. Intuitively, you would expect that pandas.DataFrame.drop(axis='column') drops a column from N columns and gives you (N – 1) columns. So you can pay NO attention to rows for now (and remove word ‘row’ from your English dictionary.) Vice versa, drop(axis='row') works on rows.
In the same way, sum(axis='column') works on multiple columns and gives you 1 column. Similarly, sum(axis='row') results in 1 row. This is consistent with its simplest form of definition, reducing a list of numbers to a single number.
In general, with axis=column, you see columns, work on columns, and get columns. Forget rows.
With axis=row, change perspective and work on rows.
0 and 1 are just aliases for ‘row’ and ‘column’. It’s the convention of matrix indexing.
+------------+---------+--------+|| A | B |+------------+---------+---------| X |0.626386|1.52325|+------------+---------+--------+| Y |0.626386|1.52325|+------------+---------+--------+
I have been trying to figure out the axis for the last hour as well. The language in all the above answers, and also the documentation is not at all helpful.
To answer the question as I understand it now, in Pandas, axis = 1 or 0 means which axis headers do you want to keep constant when applying the function.
Note: When I say headers, I mean index names
Expanding your example:
+------------+---------+--------+
| | A | B |
+------------+---------+---------
| X | 0.626386| 1.52325|
+------------+---------+--------+
| Y | 0.626386| 1.52325|
+------------+---------+--------+
For axis=1=columns : We keep columns headers constant and apply the mean function by changing data.
To demonstrate, we keep the columns headers constant as:
+------------+---------+--------+
| | A | B |
Now we populate one set of A and B values and then find the mean
| | 0.626386| 1.52325|
Then we populate next set of A and B values and find the mean
| | 0.626386| 1.52325|
Similarly, for axis=rows, we keep row headers constant, and keep changing the data:
To demonstrate, first fix the row headers:
+------------+
| X |
+------------+
| Y |
+------------+
Now populate first set of X and Y values and then find the mean
+------------+---------+
| X | 0.626386
+------------+---------+
| Y | 0.626386
+------------+---------+
Then populate the next set of X and Y values and then find the mean:
+------------+---------+
| X | 1.52325 |
+------------+---------+
| Y | 1.52325 |
+------------+---------+
In summary,
When axis=columns, you fix the column headers and change data, which will come from the different rows.
When axis=rows, you fix the row headers and change data, which will come from the different columns.
Their behaviours are, intriguingly, easier to understand with three-dimensional array than with two-dimensional arrays.
In Python packages numpy and pandas, the axis parameter in sum actually specifies numpy to calculate the mean of all values that can be fetched in the form of array[0, 0, …, i, …, 0] where i iterates through all possible values. The process is repeated with the position of i fixed and the indices of other dimensions vary one after the other (from the most far-right element). The result is a n-1-dimensional array.
In R, the MARGINS parameter let the apply function calculate the mean of all values that can be fetched in the form of array[, … , i, … ,] where i iterates through all possible values. The process is not repeated when all i values have been iterated. Therefore, the result is a simple vector.
Arrays are designed with so-called axis=0 and rows positioned vertically versus axis=1 and columns positioned horizontally. Axis refers to the dimension of the array.
I have two simple one-dimensional arrays in NumPy. I should be able to concatenate them using numpy.concatenate. But I get this error for the code below:
TypeError: only length-1 arrays can be converted to Python scalars
Code
import numpy
a = numpy.array([1, 2, 3])
b = numpy.array([5, 6])
numpy.concatenate(a, b)
%pylab
vector_a = r_[0.:10.]#short form of "arange"
vector_b = array([1,1,1,1])
vector_c = r_[vector_a,vector_b]print vector_a
print vector_b
print vector_c,'\n\n'
a = ones((3,4))*4print a,'\n'
c = array([1,1,1])
b = c_[a,c]print b,'\n\n'
a = ones((4,3))*4print a,'\n'
c = array([[1,1,1]])
b = r_[a,c]print b
print type(vector_b)
An alternative ist to use the short form of “concatenate” which is either “r_[…]” or “c_[…]” as shown in the example code beneath (see http://wiki.scipy.org/NumPy_for_Matlab_Users for additional information):
%pylab
vector_a = r_[0.:10.] #short form of "arange"
vector_b = array([1,1,1,1])
vector_c = r_[vector_a,vector_b]
print vector_a
print vector_b
print vector_c, '\n\n'
a = ones((3,4))*4
print a, '\n'
c = array([1,1,1])
b = c_[a,c]
print b, '\n\n'
a = ones((4,3))*4
print a, '\n'
c = array([[1,1,1]])
b = r_[a,c]
print b
print type(vector_b)
# we'll utilize the concept of unpackingIn[15]:(*a,*b)Out[15]:(1,2,3,5,6)# using `numpy.ravel()`In[14]: np.ravel((*a,*b))Out[14]: array([1,2,3,5,6])# wrap the unpacked elements in `numpy.array()`In[16]: np.array((*a,*b))Out[16]: array([1,2,3,5,6])
>>> a = np.array([[1,2],[3,4]])>>> b = np.array([[5,6]])# Appending below last row>>> np.concatenate((a, b), axis=0)
array([[1,2],[3,4],[5,6]])# Appending after last column>>> np.concatenate((a, b.T), axis=1)# Notice the transpose
array([[1,2,5],[3,4,6]])# Flattening the final array>>> np.concatenate((a, b), axis=None)
array([1,2,3,4,5,6])