问题:如何正确保存和加载numpy.array()数据?

我想知道如何numpy.array正确保存和加载数据。目前,我正在使用该numpy.savetxt()方法。例如,如果我有一个array markers,它看起来像这样:

在此处输入图片说明

我尝试通过使用以下方式保存它:

numpy.savetxt('markers.txt', markers)

在其他脚本中,我尝试打开以前保存的文件:

markers = np.fromfile("markers.txt")

这就是我得到的…

在此处输入图片说明

首先保存的数据如下所示:

0.000000000000000000e+00
0.000000000000000000e+00
0.000000000000000000e+00
0.000000000000000000e+00
0.000000000000000000e+00
0.000000000000000000e+00
0.000000000000000000e+00
0.000000000000000000e+00
0.000000000000000000e+00
0.000000000000000000e+00

但是,当我使用相同的方法保存刚刚加载的数据时,即 numpy.savetxt()它看起来像这样:

1.398043286095131769e-76
1.398043286095288860e-76
1.396426376485745879e-76
1.398043286055061908e-76
1.398043286095288860e-76
1.182950697433698368e-76
1.398043275797188953e-76
1.398043286095288860e-76
1.210894289234927752e-99
1.398040649781712473e-76

我究竟做错了什么?PS没有执行其他“后台”操作。只需保存和加载,这就是我得到的。先感谢您。

I wonder, how to save and load numpy.array data properly. Currently I’m using the numpy.savetxt() method. For example, if I got an array markers, which looks like this:

enter image description here

I try to save it by the use of:

numpy.savetxt('markers.txt', markers)

In other script I try to open previously saved file:

markers = np.fromfile("markers.txt")

And that’s what I get…

enter image description here

Saved data first looks like this:

0.000000000000000000e+00
0.000000000000000000e+00
0.000000000000000000e+00
0.000000000000000000e+00
0.000000000000000000e+00
0.000000000000000000e+00
0.000000000000000000e+00
0.000000000000000000e+00
0.000000000000000000e+00
0.000000000000000000e+00

But when I save just loaded data by the use of the same method, ie. numpy.savetxt() it looks like this:

1.398043286095131769e-76
1.398043286095288860e-76
1.396426376485745879e-76
1.398043286055061908e-76
1.398043286095288860e-76
1.182950697433698368e-76
1.398043275797188953e-76
1.398043286095288860e-76
1.210894289234927752e-99
1.398040649781712473e-76

What am I doing wrong? PS there are no other “backstage” operation which I perform. Just saving and loading, and that’s what I get. Thank you in advance.


回答 0

我发现执行此操作的最可靠方法是与一起使用np.savetxtnp.loadtxt而不是np.fromfile更适合用编写的二进制文件tofile。该np.fromfilenp.tofile方法写入和读取二进制文件,而np.savetxt写入一个文本文件。因此,例如:

In [1]: a = np.array([1, 2, 3, 4])
In [2]: np.savetxt('test1.txt', a, fmt='%d')
In [3]: b = np.loadtxt('test1.txt', dtype=int)
In [4]: a == b
Out[4]: array([ True,  True,  True,  True], dtype=bool)

要么:

In [5]: a.tofile('test2.dat')
In [6]: c = np.fromfile('test2.dat', dtype=int)
In [7]: c == a
Out[7]: array([ True,  True,  True,  True], dtype=bool)

我使用前一种方法,即使它速度较慢并且有时会创建更大的文件:二进制格式也可能取决于平台(例如,文件格式取决于系统的字节序)。

NumPy数组有与平台无关的格式,可以使用np.save和保存和读取np.load

In  [8]: np.save('test3.npy', a)    # .npy extension is added if not given
In  [9]: d = np.load('test3.npy')
In [10]: a == d
Out[10]: array([ True,  True,  True,  True], dtype=bool)

The most reliable way I have found to do this is to use np.savetxt with np.loadtxt and not np.fromfile which is better suited to binary files written with tofile. The np.fromfile and np.tofile methods write and read binary files whereas np.savetxt writes a text file. So, for example:

In [1]: a = np.array([1, 2, 3, 4])
In [2]: np.savetxt('test1.txt', a, fmt='%d')
In [3]: b = np.loadtxt('test1.txt', dtype=int)
In [4]: a == b
Out[4]: array([ True,  True,  True,  True], dtype=bool)

Or:

In [5]: a.tofile('test2.dat')
In [6]: c = np.fromfile('test2.dat', dtype=int)
In [7]: c == a
Out[7]: array([ True,  True,  True,  True], dtype=bool)

I use the former method even if it is slower and creates bigger files (sometimes): the binary format can be platform dependent (for example, the file format depends on the endianness of your system).

There is a platform independent format for NumPy arrays, which can be saved and read with np.save and np.load:

In  [8]: np.save('test3.npy', a)    # .npy extension is added if not given
In  [9]: d = np.load('test3.npy')
In [10]: a == d
Out[10]: array([ True,  True,  True,  True], dtype=bool)

回答 1

np.save('data.npy', num_arr) # save
new_num_arr = np.load('data.npy') # load
np.save('data.npy', num_arr) # save
new_num_arr = np.load('data.npy') # load

回答 2

有一个sep=关键字参数:

如果文件是文本文件,则项目之间的分隔符。空(“”)分隔符表示文件应被视为二进制文件。分隔符中的空格(“”)匹配零个或多个空格字符。仅由空格组成的分隔符必须至少匹配一个空格。

默认值sep=""意味着np.fromfile()试图将其读取为二进制文件而不是以空格分隔的文本文件,因此您会得到无意义的值。如果使用np.fromfile('markers.txt', sep=" "),将得到您想要的结果。

但是,正如其他人指出的那样,这是将文本文件转换为numpy数组的首选方法,除非该文件需要人类可读,否则通常最好使用二进制格式(例如np.load()/ np.save())。

has a sep= keyword argument:

Separator between items if file is a text file. Empty (“”) separator means the file should be treated as binary. Spaces (” ”) in the separator match zero or more whitespace characters. A separator consisting only of spaces must match at least one whitespace.

The default value of sep="" means that np.fromfile() tries to read it as a binary file rather than a space-separated text file, so you get nonsense values back. If you use np.fromfile('markers.txt', sep=" ") you will get the result you are looking for.

However, as others have pointed out, is the preferred way to convert text files to numpy arrays, and unless the file needs to be human-readable it is usually better to use binary formats instead (e.g. np.load()/np.save()).


回答 3

对于简短的答案,您应该使用np.savenp.load。这些方法的优点是它们是由numpy库的开发人员制作的,并且已经可以工作(加上可能已经很好地进行了优化),例如

import numpy as np
from pathlib import Path

path = Path('~/data/tmp/').expanduser()
path.mkdir(parents=True, exist_ok=True)

lb,ub = -1,1
num_samples = 5
x = np.random.uniform(low=lb,high=ub,size=(1,num_samples))
y = x**2 + x + 2

np.save(path/'x', x)
np.save(path/'y', y)

x_loaded = np.load(path/'x.npy')
y_load = np.load(path/'y.npy')

print(x is x_loaded) # False
print(x == x_loaded) # [[ True  True  True  True  True]]

扩展答案:

最后,它确实取决于您的需求,因为您还可以将其保存为人类可读的格式(请参见将此NumPy数组转储到csv文件中),或者如果文件非常大,甚至可以与其他库一起使用(请参见保存numpy数组的最佳方法)在磁盘上进行扩展讨论)。

但是,(由于您在问题中使用“正确”一词,因此进行了扩展)我仍然认为开箱即用(和大多数代码!)使用numpy函数最有可能满足大多数用户需求。最重要的原因是它已经起作用。出于其他原因尝试使用其他东西可能会使您出乎意料的长兔子洞,弄清楚为什么它不起作用并迫使它起作用。

以尝试用泡菜保存为例。我只是为了好玩而尝试,花了至少30分钟的时间才意识到,除非我用字节模式打开并读取文件,否则泡菜不会保存我的东西wb。花时间去Google,试一试,理解错误消息等。小细节,但事实是它已经需要我打开文件,从而以意想不到的方式使事情变得复杂。补充一点,它要求我重新阅读此内容(哪个btw有点令人困惑)内置开放功能中的模式a,a +,w,w +和r +之间的区别?

所以,如果有符合您需要使用它,除非你有一个(的接口非常)充分的理由(如与MATLAB或由于某种原因,你真的要读取的文件和打印Python真的不能满足您的需求,它的兼容性可能有问题)。此外,最有可能的是,如果您需要对其进行优化,则可以在以后找到答案(而不是花很多时间调试无用的东西,例如打开一个简单的numpy文件)。

因此,请使用interface / numpy提供。它可能并不完美,这很可能很好,尤其是对于已经存在numpy的库而言。

我已经花了很多时间用numpy来保存和加载数据,所以请乐在其中,希望对您有所帮助!

import numpy as np
import pickle
from pathlib import Path

path = Path('~/data/tmp/').expanduser()
path.mkdir(parents=True, exist_ok=True)

lb,ub = -1,1
num_samples = 5
x = np.random.uniform(low=lb,high=ub,size=(1,num_samples))
y = x**2 + x + 2

# using save (to npy), savez (to npz)
np.save(path/'x', x)
np.save(path/'y', y)
np.savez(path/'db', x=x, y=y)
with open(path/'db.pkl', 'wb') as db_file:
    pickle.dump(obj={'x':x, 'y':y}, file=db_file)

## using loading npy, npz files
x_loaded = np.load(path/'x.npy')
y_load = np.load(path/'y.npy')
db = np.load(path/'db.npz')
with open(path/'db.pkl', 'rb') as db_file:
    db_pkl = pickle.load(db_file)

print(x is x_loaded)
print(x == x_loaded)
print(x == db['x'])
print(x == db_pkl['x'])
print('done')

关于我学到的一些评论:

  • np.save如预期的那样,它已经很好地进行了压缩(请参阅https://stackoverflow.com/a/55750128/1601580),开箱即用,无需打开任何文件。清洁。简单。高效。用它。
  • np.savez使用未压缩的格式(请参阅docsSave several arrays into a single file in uncompressed 。npz format.如果决定使用此格式(警告您不要使用标准解决方案,因此请注意错误!),您可能会发现您需要使用参数名称来保存它,除非您想要使用默认名称。因此,如果第一个已经使用(或任何作品都使用该功能!),请勿使用此功能。
  • Pickle还允许执行任意代码。出于安全原因,某些人可能不想使用此功能。
  • 可读文件的制作成本很高,可能不值得。
  • 有一些所谓hdf5的大文件。凉!https://stackoverflow.com/a/9619713/1601580

请注意,这不是详尽的答案。但是对于其他资源,请检查以下内容:

For a short answer you should use np.save and np.load. The advantages of these is that they are made by developers of the numpy library and they already work (plus are likely already optimized nicely) e.g.

import numpy as np
from pathlib import Path

path = Path('~/data/tmp/').expanduser()
path.mkdir(parents=True, exist_ok=True)

lb,ub = -1,1
num_samples = 5
x = np.random.uniform(low=lb,high=ub,size=(1,num_samples))
y = x**2 + x + 2

np.save(path/'x', x)
np.save(path/'y', y)

x_loaded = np.load(path/'x.npy')
y_load = np.load(path/'y.npy')

print(x is x_loaded) # False
print(x == x_loaded) # [[ True  True  True  True  True]]

Expanded answer:

In the end it really depends in your needs because you can also save it human readable format (see this Dump a NumPy array into a csv file) or even with other libraries if your files are extremely large (see this best way to preserve numpy arrays on disk for an expanded discussion).

However, (making an expansion since you use the word “properly” in your question) I still think using the numpy function out of the box (and most code!) most likely satisfy most user needs. The most important reason is that it already works. Trying to use something else for any other reason might take you on an unexpectedly LONG rabbit hole to figure out why it doesn’t work and force it work.

Take for example trying to save it with pickle. I tried that just for fun and it took me at least 30 minutes to realize that pickle wouldn’t save my stuff unless I opened & read the file in bytes mode with wb. Took time to google, try thing, understand the error message etc… Small detail but the fact that it already required me to open a file complicated things in unexpected ways. To add that it required me to re-read this (which btw is sort of confusing) Difference between modes a, a+, w, w+, and r+ in built-in open function?.

So if there is an interface that meets your needs use it unless you have a (very) good reason (e.g. compatibility with matlab or for some reason your really want to read the file and printing in python really doesn’t meet your needs, which might be questionable). Furthermore, most likely if you need to optimize it you’ll find out later down the line (rather than spend ages debugging useless stuff like opening a simple numpy file).

So use the interface/numpy provide. It might not be perfect it’s most likely fine, especially for a library that’s been around as long as numpy.

I already spent the saving and loading data with numpy in a bunch of way so have fun with it, hope it helps!

import numpy as np
import pickle
from pathlib import Path

path = Path('~/data/tmp/').expanduser()
path.mkdir(parents=True, exist_ok=True)

lb,ub = -1,1
num_samples = 5
x = np.random.uniform(low=lb,high=ub,size=(1,num_samples))
y = x**2 + x + 2

# using save (to npy), savez (to npz)
np.save(path/'x', x)
np.save(path/'y', y)
np.savez(path/'db', x=x, y=y)
with open(path/'db.pkl', 'wb') as db_file:
    pickle.dump(obj={'x':x, 'y':y}, file=db_file)

## using loading npy, npz files
x_loaded = np.load(path/'x.npy')
y_load = np.load(path/'y.npy')
db = np.load(path/'db.npz')
with open(path/'db.pkl', 'rb') as db_file:
    db_pkl = pickle.load(db_file)

print(x is x_loaded)
print(x == x_loaded)
print(x == db['x'])
print(x == db_pkl['x'])
print('done')

Some comments on what I learned:

  • np.save as expected, this already compresses it well (see https://stackoverflow.com/a/55750128/1601580), works out of the box without any file opening. Clean. Easy. Efficient. Use it.
  • np.savez uses a uncompressed format (see docs) Save several arrays into a single file in uncompressed .npz format. If you decide to use this (you were warned to go away from the standard solution so expect bugs!) you might discover that you need to use argument names to save it, unless you want to use the default names. So don’t use this if the first already works (or any works use that!)
  • Pickle also allows for arbitrary code execution. Some people might not want to use this for security reasons.
  • human readable files are expensive to make etc. Probably not worth it.
  • there is something called hdf5 for large files. Cool! https://stackoverflow.com/a/9619713/1601580

Note this is not an exhaustive answer. But for other resources check this:


声明:本站所有文章,如无特殊说明或标注,均为本站原创发布。任何个人或组织,在未征得本站同意时,禁止复制、盗用、采集、发布本站内容到任何网站、书籍等各类媒体平台。如若本站内容侵犯了原著者的合法权益,可联系我们进行处理。