问题:将HDF5用于大型阵列存储(而不是平面二进制文件)是否具有分析速度或内存使用优势?

我正在处理大型3D阵列,通常需要以各种方式对其进行切片以进行各种数据分析。一个典型的“立方体”可以达到〜100GB(将来可能会更大)

似乎对于python中的大型数据集,通常推荐的文件格式是使用HDF5(h5py或pytables)。我的问题是:使用HDF5来存储和分析这些多维数据集,而不是将它们存储在简单的平面二进制文件中,对速度或内存使用有好处吗?HDF5是否更适合表格数据,而不是像我正在使用的大型数组?我看到HDF5可以提供很好的压缩,但是我对处理速度和处理内存溢出更感兴趣。

我经常只想分析多维数据集的一个大子集。pytables和h5py的一个缺点似乎是,当我对数组进行切片时,我总是得到一个numpy数组,占用了内存。但是,如果我对平面二进制文件的numpy内存映射进行切片,则可以获得一个视图,该视图将数据保留在磁盘上。因此,看来我可以更轻松地分析数据的特定扇区,而不会耗尽内存。

我已经浏览了pytables和h5py,到目前为止,还没有看到两者对我的好处。

I am processing large 3D arrays, which I often need to slice in various ways to do a variety of data analysis. A typical “cube” can be ~100GB (and will likely get larger in the future)

It seems that the typical recommended file format for large datasets in python is to use HDF5 (either h5py or pytables). My question is: is there any speed or memory usage benefit to using HDF5 to store and analyze these cubes over storing them in simple flat binary files? Is HDF5 more appropriate for tabular data, as opposed to large arrays like what I am working with? I see that HDF5 can provide nice compression, but I am more interested in processing speed and dealing with memory overflow.

I frequently want to analyze only one large subset of the cube. One drawback of both pytables and h5py is it seems is that when I take a slice of the array, I always get a numpy array back, using up memory. However, if I slice a numpy memmap of a flat binary file, I can get a view, which keeps the data on disk. So, it seems that I can more easily analyze specific sectors of my data without overrunning my memory.

I have explored both pytables and h5py, and haven’t seen the benefit of either so far for my purpose.


回答 0

HDF5的优势:组织,灵活性,互操作性

HDF5的一些主要优点是其层次结构(类似于文件夹/文件),与每个项目一起存储的可选任意元数据以及其灵活性(例如压缩)。这种组织结构和元数据存储听起来很琐碎,但在实践中非常有用。

HDF的另一个优点是,数据集可以是固定大小的灵活的尺寸。因此,将数据附加到大型数据集很容易,而无需创建整个新副本。

此外,HDF5是一种标准格式,几乎所有语言都可以使用库,因此使用HDF可以很容易地在Matlab,Fortran,R,C和Python之间共享磁盘数据。(公平地说,只要知道C与F的顺序并知道存储数组的形状,dtype等,使用大型二进制数组也不太困难。)

HDF对于大型阵列的优势:更快的任意切片I / O

就像TL / DR一样:对于约8GB的3D阵列,使用分块的HDF5数据集沿任何轴读取“完整”切片大约需要20秒,而对于3个小时(最坏情况),则需要0.3秒(最佳情况)相同数据的映射数组。

除了上面列出的内容外,“块状” *磁盘数据格式(例如HDF5)还有另一个大优势:读取任意片(强调任意)通常会更快,因为磁盘上的数据更连续。平均。

*(HDF5不必是分块的数据格式。它支持分块,但不是必需的。实际上h5py,如果我没记错的话,在其中创建数据集的默认设置是不分块。)

基本上,对于一个给定的数据集片,最佳情况下的磁盘读取速度和最坏情况下的磁盘读取速度将与分块的HDF数据集相当接近(假设您选择了一个合理的块大小,或者让一个库为您选择了一个)。用一个简单的二进制数组,最好的情况也比较快,但最坏的情况是糟糕。

请注意,如果您有SSD,则可能不会注意到读/写速度的巨大差异。但是,使用常规硬盘驱动器,顺序读取要比随机读取快得多。(即,普通硬盘驱动器需要很长的seek时间。)HDF在SSD上仍然具有优势,但更多的原因是其其他功能(例如,元数据,组织等)而不是原始速度。


首先,为了消除混乱,访问h5py数据集将返回一个对象,该对象的行为与numpy数组非常相似,但是直到将数据切片后才将其加载到内存中。(类似于memmap,但不完全相同。)有关更多信息,请参见h5py介绍

对数据集进行切片会将数据的一个子集加载到内存中,但是大概您想对它做点什么,这时无论如何您都将需要它在内存中。

如果您确实想进行核外计算,则可以使用pandas或轻松地获得表格数据pytables。可能h5py(大型ND阵列更合适),但是您需要降低到较低的水平并自己处理迭代。

但是,类似numpy的内核外计算的未来是Blaze。如果您真的想走那条路,请看一下


“无条件”案例

首先,考虑一个写入磁盘的3D C顺序数组(我将通过调用arr.ravel()和打印结果来模拟它,以使内容更加可见):

In [1]: import numpy as np

In [2]: arr = np.arange(4*6*6).reshape(4,6,6)

In [3]: arr
Out[3]:
array([[[  0,   1,   2,   3,   4,   5],
        [  6,   7,   8,   9,  10,  11],
        [ 12,  13,  14,  15,  16,  17],
        [ 18,  19,  20,  21,  22,  23],
        [ 24,  25,  26,  27,  28,  29],
        [ 30,  31,  32,  33,  34,  35]],

       [[ 36,  37,  38,  39,  40,  41],
        [ 42,  43,  44,  45,  46,  47],
        [ 48,  49,  50,  51,  52,  53],
        [ 54,  55,  56,  57,  58,  59],
        [ 60,  61,  62,  63,  64,  65],
        [ 66,  67,  68,  69,  70,  71]],

       [[ 72,  73,  74,  75,  76,  77],
        [ 78,  79,  80,  81,  82,  83],
        [ 84,  85,  86,  87,  88,  89],
        [ 90,  91,  92,  93,  94,  95],
        [ 96,  97,  98,  99, 100, 101],
        [102, 103, 104, 105, 106, 107]],

       [[108, 109, 110, 111, 112, 113],
        [114, 115, 116, 117, 118, 119],
        [120, 121, 122, 123, 124, 125],
        [126, 127, 128, 129, 130, 131],
        [132, 133, 134, 135, 136, 137],
        [138, 139, 140, 141, 142, 143]]])

这些值将按顺序存储在磁盘上,如下面的第4行所示。(暂时忽略文件系统详细信息和碎片。)

In [4]: arr.ravel(order='C')
Out[4]:
array([  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12,
        13,  14,  15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,
        26,  27,  28,  29,  30,  31,  32,  33,  34,  35,  36,  37,  38,
        39,  40,  41,  42,  43,  44,  45,  46,  47,  48,  49,  50,  51,
        52,  53,  54,  55,  56,  57,  58,  59,  60,  61,  62,  63,  64,
        65,  66,  67,  68,  69,  70,  71,  72,  73,  74,  75,  76,  77,
        78,  79,  80,  81,  82,  83,  84,  85,  86,  87,  88,  89,  90,
        91,  92,  93,  94,  95,  96,  97,  98,  99, 100, 101, 102, 103,
       104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116,
       117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129,
       130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143])

在最佳情况下,让我们沿第一个轴剖一下。请注意,这些只是数组的前36个值。这将是一个非常快的阅读!(一寻一读)

In [5]: arr[0,:,:]
Out[5]:
array([[ 0,  1,  2,  3,  4,  5],
       [ 6,  7,  8,  9, 10, 11],
       [12, 13, 14, 15, 16, 17],
       [18, 19, 20, 21, 22, 23],
       [24, 25, 26, 27, 28, 29],
       [30, 31, 32, 33, 34, 35]])

同样,沿第一个轴的下一个切片将仅是下一个36个值。要沿该轴读取完整的切片,我们只需执行一项seek操作。如果我们要读取的只是沿着该轴的各个切片,那么这就是理想的文件结构。

但是,让我们考虑最坏的情况:沿着最后一个轴的切片。

In [6]: arr[:,:,0]
Out[6]:
array([[  0,   6,  12,  18,  24,  30],
       [ 36,  42,  48,  54,  60,  66],
       [ 72,  78,  84,  90,  96, 102],
       [108, 114, 120, 126, 132, 138]])

要读入此片,我们需要36次搜索和36次读取,因为所有值都在磁盘上分开。他们都不是相邻的!

这看起来似乎很小,但是随着我们使用越来越大的数组,操作的数量和大小seek迅速增长。对于以这种方式存储并通过via读入的大型(〜10Gb)3D阵列memmap,即使使用现代硬件,沿“最差”轴读取整个切片也很容易花费数十分钟。同时,沿最佳轴的切片可能会花费不到一秒钟的时间。为了简单起见,我只显示单个轴上的“完整”切片,但是对于数据的任何子集的任意切片,都会发生完全相同的事情。

顺便说一句,有几种文件格式可以利用此功能,并且基本上在磁盘上存储三份巨大的 3D阵列副本:一份以C顺序,一份以F顺序,一份在两者之间的中间位置。(一个示例是Geoprobe的D3D格式,尽管我不确定它是否在任何地方都有记录。)谁在乎最终文件大小是否为4TB,但存储空间却很便宜!疯狂的是,由于主要用例是在每个方向上提取单个子切片,因此您要进行的读取非常非常快。效果很好!


简单的“块状”案例

假设我们将3D数组的2x2x2“块”存储为磁盘上的连续块。换句话说,类似:

nx, ny, nz = arr.shape
slices = []
for i in range(0, nx, 2):
    for j in range(0, ny, 2):
        for k in range(0, nz, 2):
            slices.append((slice(i, i+2), slice(j, j+2), slice(k, k+2)))

chunked = np.hstack([arr[chunk].ravel() for chunk in slices])

因此磁盘上的数据如下所示chunked

array([  0,   1,   6,   7,  36,  37,  42,  43,   2,   3,   8,   9,  38,
        39,  44,  45,   4,   5,  10,  11,  40,  41,  46,  47,  12,  13,
        18,  19,  48,  49,  54,  55,  14,  15,  20,  21,  50,  51,  56,
        57,  16,  17,  22,  23,  52,  53,  58,  59,  24,  25,  30,  31,
        60,  61,  66,  67,  26,  27,  32,  33,  62,  63,  68,  69,  28,
        29,  34,  35,  64,  65,  70,  71,  72,  73,  78,  79, 108, 109,
       114, 115,  74,  75,  80,  81, 110, 111, 116, 117,  76,  77,  82,
        83, 112, 113, 118, 119,  84,  85,  90,  91, 120, 121, 126, 127,
        86,  87,  92,  93, 122, 123, 128, 129,  88,  89,  94,  95, 124,
       125, 130, 131,  96,  97, 102, 103, 132, 133, 138, 139,  98,  99,
       104, 105, 134, 135, 140, 141, 100, 101, 106, 107, 136, 137, 142, 143])

只是为了表明它们是的2x2x2块arr,请注意这些是的前8个值chunked

In [9]: arr[:2, :2, :2]
Out[9]:
array([[[ 0,  1],
        [ 6,  7]],

       [[36, 37],
        [42, 43]]])

要读取沿轴的任何切片,我们将读取6或9个连续的块(所需数据的两倍),然后仅保留所需的部分。在最坏的情况下,最大搜索数为9,而非分块版本的搜索数最大为36。(但是最好的情况仍然是6个搜索,而对于内存阵列则为1个。)由于顺序读取与搜索相比非常快,因此大大减少了将任意子集读入内存所需的时间。再次,这种效果随着更大的阵列而变得更大。

HDF5更进一步。块不必连续存储,它们由B树索引。此外,它们不必在磁盘上具有相同的大小,因此可以将压缩应用于每个块。


分块数组 h5py

默认情况下,h5py不会在磁盘上创建分块的HDF文件(pytables相比之下,我认为是)。chunks=True但是,如果在创建数据集时指定,则会在磁盘上获得分块数组。

作为一个简短的示例:

import numpy as np
import h5py

data = np.random.random((100, 100, 100))

with h5py.File('test.hdf', 'w') as outfile:
    dset = outfile.create_dataset('a_descriptive_name', data=data, chunks=True)
    dset.attrs['some key'] = 'Did you want some metadata?'

请注意,这会chunks=True告诉h5py我们自动选择块大小。如果您最了解最常用的用例,则可以通过指定形状元组来优化块的大小/形状(例如(2,2,2),在上面的简单示例中)。这使您可以更有效地沿特定轴进行读取,或针对特定大小的读取/写入进行优化。


I / O性能比较

为了强调这一点,让我们比较从分块的HDF5数据集和包含相同确切数据的大型(〜8GB)Fortran排序3D数组中读取的片段。

我已经清除了每次运行之间的所有操作系统缓存,因此我们看到的是“冷”性能。

对于每种文件类型,我们将测试沿第一个轴的“完整” x切片和沿最后一个轴的“完整” z切片读取。对于按Fortran排序的映射数组,“ x”切片是最坏的情况,而“ z”切片是最好的情况。

所使用的代码基于要点(包括创建hdf文件)。我无法轻松共享此处使用的数据,但是您可以通过形状相同的零数组(621, 4991, 2600)和type)来模拟它np.uint8

chunked_hdf.py如下所示:

import sys
import h5py

def main():
    data = read()

    if sys.argv[1] == 'x':
        x_slice(data)
    elif sys.argv[1] == 'z':
        z_slice(data)

def read():
    f = h5py.File('/tmp/test.hdf5', 'r')
    return f['seismic_volume']

def z_slice(data):
    return data[:,:,0]

def x_slice(data):
    return data[0,:,:]

main()

memmapped_array.py相似,但要确保将切片实际加载到内存中要复杂得多(默认情况下,memmapped将返回另一个数组,这不是苹果与苹果的比较)。

import numpy as np
import sys

def main():
    data = read()

    if sys.argv[1] == 'x':
        x_slice(data)
    elif sys.argv[1] == 'z':
        z_slice(data)

def read():
    big_binary_filename = '/data/nankai/data/Volumes/kumdep01_flipY.3dv.vol'
    shape = 621, 4991, 2600
    header_len = 3072

    data = np.memmap(filename=big_binary_filename, mode='r', offset=header_len,
                     order='F', shape=shape, dtype=np.uint8)
    return data

def z_slice(data):
    dat = np.empty(data.shape[:2], dtype=data.dtype)
    dat[:] = data[:,:,0]
    return dat

def x_slice(data):
    dat = np.empty(data.shape[1:], dtype=data.dtype)
    dat[:] = data[0,:,:]
    return dat

main()

首先让我们看一下HDF的性能:

jofer at cornbread in ~ 
$ sudo ./clear_cache.sh

jofer at cornbread in ~ 
$ time python chunked_hdf.py z
python chunked_hdf.py z  0.64s user 0.28s system 3% cpu 23.800 total

jofer at cornbread in ~ 
$ sudo ./clear_cache.sh

jofer at cornbread in ~ 
$ time python chunked_hdf.py x
python chunked_hdf.py x  0.12s user 0.30s system 1% cpu 21.856 total

“完整的” x切片和“完整的” z切片大约需要相同的时间(约20秒)。考虑到这是一个8GB的阵列,还算不错。大多数时候

并且,如果将其与映射的数组时间进行比较(按Fortran顺序排序:最佳情况是“ z切片”,最坏情况是“ x切片”):

jofer at cornbread in ~ 
$ sudo ./clear_cache.sh

jofer at cornbread in ~ 
$ time python memmapped_array.py z
python memmapped_array.py z  0.07s user 0.04s system 28% cpu 0.385 total

jofer at cornbread in ~ 
$ sudo ./clear_cache.sh

jofer at cornbread in ~ 
$ time python memmapped_array.py x
python memmapped_array.py x  2.46s user 37.24s system 0% cpu 3:35:26.85 total

是的,你看的没错。一个切片方向为0.3秒,另一个切片方向为〜3.5 小时

在“ x”方向上进行切片的时间远远大于将整个8GB阵列加载到内存中并选择所需切片所需的时间!(同样,这是一个按Fortran顺序排列的数组。对于C顺序的数组来说,相反的x / z slice时序将是这种情况。)

但是,如果我们一直想沿最佳情况方向进行切片,那么磁盘上的大二进制数组就非常好。(〜0.3秒!)

对于映射阵列,您将陷入这种I / O差异(或者各向异性是一个更好的说法)。但是,对于分块的HDF数据集,您可以选择分块大小,以使访问权限相等或针对特定用例进行了优化。它为您提供了更多的灵活性。

综上所述

希望这可以帮助您至少部分解决问题。HDF5与“原始”内存映射相比还具有许多其他优点,但是我在这里没有足够的空间来扩展它们。压缩可以加快某些速度(我处理的数据不能从压缩中获得很多好处,因此我很少使用它),并且与“原始”内存映射相比,HDF5文件对OS级缓存的播放效果更好。除此之外,HDF5是一种非常出色的容器格式。它为您提供了很大的灵活性来管理数据,并且可以或多或少地使用任何编程语言来使用它。

总体而言,尝试一下,看看它是否适合您的用例。我想您可能会感到惊讶。

HDF5 Advantages: Organization, flexibility, interoperability

Some of the main advantages of HDF5 are its hierarchical structure (similar to folders/files), optional arbitrary metadata stored with each item, and its flexibility (e.g. compression). This organizational structure and metadata storage may sound trivial, but it’s very useful in practice.

Another advantage of HDF is that the datasets can be either fixed-size or flexibly sized. Therefore, it’s easy to append data to a large dataset without having to create an entire new copy.

Additionally, HDF5 is a standardized format with libraries available for almost any language, so sharing your on-disk data between, say Matlab, Fortran, R, C, and Python is very easy with HDF. (To be fair, it’s not too hard with a big binary array, too, as long as you’re aware of the C vs. F ordering and know the shape, dtype, etc of the stored array.)

HDF advantages for a large array: Faster I/O of an arbitrary slice

Just as the TL/DR: For an ~8GB 3D array, reading a “full” slice along any axis took ~20 seconds with a chunked HDF5 dataset, and 0.3 seconds (best-case) to over three hours (worst case) for a memmapped array of the same data.

Beyond the things listed above, there’s another big advantage to a “chunked”* on-disk data format such as HDF5: Reading an arbitrary slice (emphasis on arbitrary) will typically be much faster, as the on-disk data is more contiguous on average.

*(HDF5 doesn’t have to be a chunked data format. It supports chunking, but doesn’t require it. In fact, the default for creating a dataset in h5py is not to chunk, if I recall correctly.)

Basically, your best case disk-read speed and your worst case disk read speed for a given slice of your dataset will be fairly close with a chunked HDF dataset (assuming you chose a reasonable chunk size or let a library choose one for you). With a simple binary array, the best-case is faster, but the worst-case is much worse.

One caveat, if you have an SSD, you likely won’t notice a huge difference in read/write speed. With a regular hard drive, though, sequential reads are much, much faster than random reads. (i.e. A regular hard drive has long seek time.) HDF still has an advantage on an SSD, but it’s more due its other features (e.g. metadata, organization, etc) than due to raw speed.


First off, to clear up confusion, accessing an h5py dataset returns an object that behaves fairly similarly to a numpy array, but does not load the data into memory until it’s sliced. (Similar to memmap, but not identical.) Have a look at the h5py introduction for more information.

Slicing the dataset will load a subset of the data into memory, but presumably you want to do something with it, at which point you’ll need it in memory anyway.

If you do want to do out-of-core computations, you can fairly easily for tabular data with pandas or pytables. It is possible with h5py (nicer for big N-D arrays), but you need to drop down to a touch lower level and handle the iteration yourself.

However, the future of numpy-like out-of-core computations is Blaze. Have a look at it if you really want to take that route.


The “unchunked” case

First off, consider a 3D C-ordered array written to disk (I’ll simulate it by calling arr.ravel() and printing the result, to make things more visible):

In [1]: import numpy as np

In [2]: arr = np.arange(4*6*6).reshape(4,6,6)

In [3]: arr
Out[3]:
array([[[  0,   1,   2,   3,   4,   5],
        [  6,   7,   8,   9,  10,  11],
        [ 12,  13,  14,  15,  16,  17],
        [ 18,  19,  20,  21,  22,  23],
        [ 24,  25,  26,  27,  28,  29],
        [ 30,  31,  32,  33,  34,  35]],

       [[ 36,  37,  38,  39,  40,  41],
        [ 42,  43,  44,  45,  46,  47],
        [ 48,  49,  50,  51,  52,  53],
        [ 54,  55,  56,  57,  58,  59],
        [ 60,  61,  62,  63,  64,  65],
        [ 66,  67,  68,  69,  70,  71]],

       [[ 72,  73,  74,  75,  76,  77],
        [ 78,  79,  80,  81,  82,  83],
        [ 84,  85,  86,  87,  88,  89],
        [ 90,  91,  92,  93,  94,  95],
        [ 96,  97,  98,  99, 100, 101],
        [102, 103, 104, 105, 106, 107]],

       [[108, 109, 110, 111, 112, 113],
        [114, 115, 116, 117, 118, 119],
        [120, 121, 122, 123, 124, 125],
        [126, 127, 128, 129, 130, 131],
        [132, 133, 134, 135, 136, 137],
        [138, 139, 140, 141, 142, 143]]])

The values would be stored on-disk sequentially as shown on line 4 below. (Let’s ignore filesystem details and fragmentation for the moment.)

In [4]: arr.ravel(order='C')
Out[4]:
array([  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12,
        13,  14,  15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,
        26,  27,  28,  29,  30,  31,  32,  33,  34,  35,  36,  37,  38,
        39,  40,  41,  42,  43,  44,  45,  46,  47,  48,  49,  50,  51,
        52,  53,  54,  55,  56,  57,  58,  59,  60,  61,  62,  63,  64,
        65,  66,  67,  68,  69,  70,  71,  72,  73,  74,  75,  76,  77,
        78,  79,  80,  81,  82,  83,  84,  85,  86,  87,  88,  89,  90,
        91,  92,  93,  94,  95,  96,  97,  98,  99, 100, 101, 102, 103,
       104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116,
       117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129,
       130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143])

In the best case scenario, let’s take a slice along the first axis. Notice that these are just the first 36 values of the array. This will be a very fast read! (one seek, one read)

In [5]: arr[0,:,:]
Out[5]:
array([[ 0,  1,  2,  3,  4,  5],
       [ 6,  7,  8,  9, 10, 11],
       [12, 13, 14, 15, 16, 17],
       [18, 19, 20, 21, 22, 23],
       [24, 25, 26, 27, 28, 29],
       [30, 31, 32, 33, 34, 35]])

Similarly, the next slice along the first axis will just be the next 36 values. To read a complete slice along this axis, we only need one seek operation. If all we’re going to be reading is various slices along this axis, then this is the perfect file structure.

However, let’s consider the worst-case scenario: A slice along the last axis.

In [6]: arr[:,:,0]
Out[6]:
array([[  0,   6,  12,  18,  24,  30],
       [ 36,  42,  48,  54,  60,  66],
       [ 72,  78,  84,  90,  96, 102],
       [108, 114, 120, 126, 132, 138]])

To read this slice in, we need 36 seeks and 36 reads, as all of the values are separated on disk. None of them are adjacent!

This may seem pretty minor, but as we get to larger and larger arrays, the number and size of the seek operations grows rapidly. For a large-ish (~10Gb) 3D array stored in this way and read in via memmap, reading a full slice along the “worst” axis can easily take tens of minutes, even with modern hardware. At the same time, a slice along the best axis can take less than a second. For simplicity, I’m only showing “full” slices along a single axis, but the exact same thing happens with arbitrary slices of any subset of the data.

Incidentally there are several file formats that take advantage of this and basically store three copies of huge 3D arrays on disk: one in C-order, one in F-order, and one in the intermediate between the two. (An example of this is Geoprobe’s D3D format, though I’m not sure it’s documented anywhere.) Who cares if the final file size is 4TB, storage is cheap! The crazy thing about that is that because the main use case is extracting a single sub-slice in each direction, the reads you want to make are very, very fast. It works very well!


The simple “chunked” case

Let’s say we store 2x2x2 “chunks” of the 3D array as contiguous blocks on disk. In other words, something like:

nx, ny, nz = arr.shape
slices = []
for i in range(0, nx, 2):
    for j in range(0, ny, 2):
        for k in range(0, nz, 2):
            slices.append((slice(i, i+2), slice(j, j+2), slice(k, k+2)))

chunked = np.hstack([arr[chunk].ravel() for chunk in slices])

So the data on disk would look like chunked:

array([  0,   1,   6,   7,  36,  37,  42,  43,   2,   3,   8,   9,  38,
        39,  44,  45,   4,   5,  10,  11,  40,  41,  46,  47,  12,  13,
        18,  19,  48,  49,  54,  55,  14,  15,  20,  21,  50,  51,  56,
        57,  16,  17,  22,  23,  52,  53,  58,  59,  24,  25,  30,  31,
        60,  61,  66,  67,  26,  27,  32,  33,  62,  63,  68,  69,  28,
        29,  34,  35,  64,  65,  70,  71,  72,  73,  78,  79, 108, 109,
       114, 115,  74,  75,  80,  81, 110, 111, 116, 117,  76,  77,  82,
        83, 112, 113, 118, 119,  84,  85,  90,  91, 120, 121, 126, 127,
        86,  87,  92,  93, 122, 123, 128, 129,  88,  89,  94,  95, 124,
       125, 130, 131,  96,  97, 102, 103, 132, 133, 138, 139,  98,  99,
       104, 105, 134, 135, 140, 141, 100, 101, 106, 107, 136, 137, 142, 143])

And just to show that they’re 2x2x2 blocks of arr, notice that these are the first 8 values of chunked:

In [9]: arr[:2, :2, :2]
Out[9]:
array([[[ 0,  1],
        [ 6,  7]],

       [[36, 37],
        [42, 43]]])

To read in any slice along an axis, we’d read in either 6 or 9 contiguous chunks (twice as much data as we need) and then only keep the portion we wanted. That’s a worst-case maximum of 9 seeks vs a maximum of 36 seeks for the non-chunked version. (But the best case is still 6 seeks vs 1 for the memmapped array.) Because sequential reads are very fast compared to seeks, this significantly reduces the amount of time it takes to read an arbitrary subset into memory. Once again, this effect becomes larger with larger arrays.

HDF5 takes this a few steps farther. The chunks don’t have to be stored contiguously, and they’re indexed by a B-Tree. Furthermore, they don’t have to be the same size on disk, so compression can be applied to each chunk.


Chunked arrays with h5py

By default, h5py doesn’t created chunked HDF files on disk (I think pytables does, by contrast). If you specify chunks=True when creating the dataset, however, you’ll get a chunked array on disk.

As a quick, minimal example:

import numpy as np
import h5py

data = np.random.random((100, 100, 100))

with h5py.File('test.hdf', 'w') as outfile:
    dset = outfile.create_dataset('a_descriptive_name', data=data, chunks=True)
    dset.attrs['some key'] = 'Did you want some metadata?'

Note that chunks=True tells h5py to automatically pick a chunk size for us. If you know more about your most common use-case, you can optimize the chunk size/shape by specifying a shape tuple (e.g. (2,2,2) in the simple example above). This allows you to make reads along a particular axis more efficient or optimize for reads/writes of a certain size.


I/O Performance comparison

Just to emphasize the point, let’s compare reading in slices from a chunked HDF5 dataset and a large (~8GB), Fortran-ordered 3D array containing the same exact data.

I’ve cleared all OS caches between each run, so we’re seeing the “cold” performance.

For each file type, we’ll test reading in a “full” x-slice along the first axis and a “full” z-slize along the last axis. For the Fortran-ordered memmapped array, the “x” slice is the worst case, and the “z” slice is the best case.

The code used is in a gist (including creating the hdf file). I can’t easily share the data used here, but you could simulate it by an array of zeros of the same shape (621, 4991, 2600) and type np.uint8.

The chunked_hdf.py looks like this:

import sys
import h5py

def main():
    data = read()

    if sys.argv[1] == 'x':
        x_slice(data)
    elif sys.argv[1] == 'z':
        z_slice(data)

def read():
    f = h5py.File('/tmp/test.hdf5', 'r')
    return f['seismic_volume']

def z_slice(data):
    return data[:,:,0]

def x_slice(data):
    return data[0,:,:]

main()

memmapped_array.py is similar, but has a touch more complexity to ensure the slices are actually loaded into memory (by default, another memmapped array would be returned, which wouldn’t be an apples-to-apples comparison).

import numpy as np
import sys

def main():
    data = read()

    if sys.argv[1] == 'x':
        x_slice(data)
    elif sys.argv[1] == 'z':
        z_slice(data)

def read():
    big_binary_filename = '/data/nankai/data/Volumes/kumdep01_flipY.3dv.vol'
    shape = 621, 4991, 2600
    header_len = 3072

    data = np.memmap(filename=big_binary_filename, mode='r', offset=header_len,
                     order='F', shape=shape, dtype=np.uint8)
    return data

def z_slice(data):
    dat = np.empty(data.shape[:2], dtype=data.dtype)
    dat[:] = data[:,:,0]
    return dat

def x_slice(data):
    dat = np.empty(data.shape[1:], dtype=data.dtype)
    dat[:] = data[0,:,:]
    return dat

main()

Let’s have a look at the HDF performance first:

jofer at cornbread in ~ 
$ sudo ./clear_cache.sh

jofer at cornbread in ~ 
$ time python chunked_hdf.py z
python chunked_hdf.py z  0.64s user 0.28s system 3% cpu 23.800 total

jofer at cornbread in ~ 
$ sudo ./clear_cache.sh

jofer at cornbread in ~ 
$ time python chunked_hdf.py x
python chunked_hdf.py x  0.12s user 0.30s system 1% cpu 21.856 total

A “full” x-slice and a “full” z-slice take about the same amount of time (~20sec). Considering this is an 8GB array, that’s not too bad. Most of the time

And if we compare this to the memmapped array times (it’s Fortran-ordered: A “z-slice” is the best case and an “x-slice” is the worst case.):

jofer at cornbread in ~ 
$ sudo ./clear_cache.sh

jofer at cornbread in ~ 
$ time python memmapped_array.py z
python memmapped_array.py z  0.07s user 0.04s system 28% cpu 0.385 total

jofer at cornbread in ~ 
$ sudo ./clear_cache.sh

jofer at cornbread in ~ 
$ time python memmapped_array.py x
python memmapped_array.py x  2.46s user 37.24s system 0% cpu 3:35:26.85 total

Yes, you read that right. 0.3 seconds for one slice direction and ~3.5 hours for the other.

The time to slice in the “x” direction is far longer than the amount of time it would take to load the entire 8GB array into memory and select the slice we wanted! (Again, this is a Fortran-ordered array. The opposite x/z slice timing would be the case for a C-ordered array.)

However, if we’re always wanting to take a slice along the best-case direction, the big binary array on disk is very good. (~0.3 sec!)

With a memmapped array, you’re stuck with this I/O discrepancy (or perhaps anisotropy is a better term). However, with a chunked HDF dataset, you can choose the chunksize such that access is either equal or is optimized for a particular use-case. It gives you a lot more flexibility.

In summary

Hopefully that helps clear up one part of your question, at any rate. HDF5 has many other advantages over “raw” memmaps, but I don’t have room to expand on all of them here. Compression can speed some things up (the data I work with doesn’t benefit much from compression, so I rarely use it), and OS-level caching often plays more nicely with HDF5 files than with “raw” memmaps. Beyond that, HDF5 is a really fantastic container format. It gives you a lot of flexibility in managing your data, and can be used from more or less any programming language.

Overall, try it and see if it works well for your use case. I think you might be surprised.


声明:本站所有文章,如无特殊说明或标注,均为本站原创发布。任何个人或组织,在未征得本站同意时,禁止复制、盗用、采集、发布本站内容到任何网站、书籍等各类媒体平台。如若本站内容侵犯了原著者的合法权益,可联系我们进行处理。