""" Define a simple format for saving numpy arrays to disk with the full information about them. The ``.npy`` format is the standard binary file format in NumPy for persisting a *single* arbitrary NumPy array on disk. The format stores all of the shape and dtype information necessary to reconstruct the array correctly even on another machine with a different architecture. The format is designed to be as simple as possible while achieving its limited goals. The ``.npz`` format is the standard format for persisting *multiple* NumPy arrays on disk. A ``.npz`` file is a zip file containing multiple ``.npy`` files, one for each array. Capabilities ------------ - Can represent all NumPy arrays including nested record arrays and object arrays. - Represents the data in its native binary form. - Supports Fortran-contiguous arrays directly. - Stores all of the necessary information to reconstruct the array including shape and dtype on a machine of a different architecture. Both little-endian and big-endian arrays are supported, and a file with little-endian numbers will yield a little-endian array on any machine reading the file. The types are described in terms of their actual sizes. For example, if a machine with a 64-bit C "long int" writes out an array with "long ints", a reading machine with 32-bit C "long ints" will yield an array with 64-bit integers. - Is straightforward to reverse engineer. Datasets often live longer than the programs that created them. A competent developer should be able create a solution in his preferred programming language to read most ``.npy`` files that he has been given without much documentation. - Allows memory-mapping of the data. See `open_memmep`. - Can be read from a filelike stream object instead of an actual file. - Stores object arrays, i.e. arrays containing elements that are arbitrary Python objects. Files with object arrays are not to be mmapable, but can be read and written to disk. Limitations ----------- - Arbitrary subclasses of numpy.ndarray are not completely preserved. Subclasses will be accepted for writing, but only the array data will be written out. A regular numpy.ndarray object will be created upon reading the file. .. warning:: Due to limitations in the interpretation of structured dtypes, dtypes with fields with empty names will have the names replaced by 'f0', 'f1', etc. Such arrays will not round-trip through the format entirely accurately. The data is intact; only the field names will differ. We are working on a fix for this. This fix will not require a change in the file format. The arrays with such structures can still be saved and restored, and the correct dtype may be restored by using the ``loadedarray.view(correct_dtype)`` method. File extensions --------------- We recommend using the ``.npy`` and ``.npz`` extensions for files saved in this format. This is by no means a requirement; applications may wish to use these file formats but use an extension specific to the application. In the absence of an obvious alternative, however, we suggest using ``.npy`` and ``.npz``. Version numbering ----------------- The version numbering of these formats is independent of NumPy version numbering. If the format is upgraded, the code in `numpy.io` will still be able to read and write Version 1.0 files. Format Version 1.0 ------------------ The first 6 bytes are a magic string: exactly ``\\x93NUMPY``. The next 1 byte is an unsigned byte: the major version number of the file format, e.g. ``\\x01``. The next 1 byte is an unsigned byte: the minor version number of the file format, e.g. ``\\x00``. Note: the version of the file format is not tied to the version of the numpy package. The next 2 bytes form a little-endian unsigned short int: the length of the header data HEADER_LEN. The next HEADER_LEN bytes form the header data describing the array's format. It is an ASCII string which contains a Python literal expression of a dictionary. It is terminated by a newline (``\\n``) and padded with spaces (``\\x20``) to make the total length of ``magic string + 4 + HEADER_LEN`` be evenly divisible by 16 for alignment purposes. The dictionary contains three keys: "descr" : dtype.descr An object that can be passed as an argument to the `numpy.dtype` constructor to create the array's dtype. "fortran_order" : bool Whether the array data is Fortran-contiguous or not. Since Fortran-contiguous arrays are a common form of non-C-contiguity, we allow them to be written directly to disk for efficiency. "shape" : tuple of int The shape of the array. For repeatability and readability, the dictionary keys are sorted in alphabetic order. This is for convenience only. A writer SHOULD implement this if possible. A reader MUST NOT depend on this. Following the header comes the array data. If the dtype contains Python objects (i.e. ``dtype.hasobject is True``), then the data is a Python pickle of the array. Otherwise the data is the contiguous (either C- or Fortran-, depending on ``fortran_order``) bytes of the array. Consumers can figure out the number of bytes by multiplying the number of elements given by the shape (noting that ``shape=()`` means there is 1 element) by ``dtype.itemsize``. Notes ----- The ``.npy`` format, including reasons for creating it and a comparison of alternatives, is described fully in the "npy-format" NEP. """ import cPickle import numpy from numpy.lib.utils import safe_eval MAGIC_PREFIX = '\x93NUMPY' MAGIC_LEN = len(MAGIC_PREFIX) + 2 def magic(major, minor): """ Return the magic string for the given file format version. Parameters ---------- major : int in [0, 255] minor : int in [0, 255] Returns ------- magic : str Raises ------ ValueError if the version cannot be formatted. """ if major < 0 or major > 255: raise ValueError("major version must be 0 <= major < 256") if minor < 0 or minor > 255: raise ValueError("minor version must be 0 <= minor < 256") return '%s%s%s' % (MAGIC_PREFIX, chr(major), chr(minor)) def read_magic(fp): """ Read the magic string to get the version of the file format. Parameters ---------- fp : filelike object Returns ------- major : int minor : int """ magic_str = fp.read(MAGIC_LEN) if len(magic_str) != MAGIC_LEN: msg = "could not read %d characters for the magic string; got %r" raise ValueError(msg % (MAGIC_LEN, magic_str)) if magic_str[:-2] != MAGIC_PREFIX: msg = "the magic string is not correct; expected %r, got %r" raise ValueError(msg % (MAGIC_PREFIX, magic_str[:-2])) major, minor = map(ord, magic_str[-2:]) return major, minor def dtype_to_descr(dtype): """ Get a serializable descriptor from the dtype. The .descr attribute of a dtype object cannot be round-tripped through the dtype() constructor. Simple types, like dtype('float32'), have a descr which looks like a record array with one field with '' as a name. The dtype() constructor interprets this as a request to give a default name. Instead, we construct descriptor that can be passed to dtype(). Parameters ---------- dtype : dtype The dtype of the array that will be written to disk. Returns ------- descr : object An object that can be passed to `numpy.dtype()` in order to replicate the input dtype. """ if dtype.names is not None: # This is a record array. The .descr is fine. # XXX: parts of the record array with an empty name, like padding bytes, # still get fiddled with. This needs to be fixed in the C implementation # of dtype(). return dtype.descr else: return dtype.str def header_data_from_array_1_0(array): """ Get the dictionary of header metadata from a numpy.ndarray. Parameters ---------- array : numpy.ndarray Returns ------- d : dict This has the appropriate entries for writing its string representation to the header of the file. """ d = {} d['shape'] = array.shape if array.flags.c_contiguous: d['fortran_order'] = False elif array.flags.f_contiguous: d['fortran_order'] = True else: # Totally non-contiguous data. We will have to make it C-contiguous # before writing. Note that we need to test for C_CONTIGUOUS first # because a 1-D array is both C_CONTIGUOUS and F_CONTIGUOUS. d['fortran_order'] = False d['descr'] = dtype_to_descr(array.dtype) return d def write_array_header_1_0(fp, d): """ Write the header for an array using the 1.0 format. Parameters ---------- fp : filelike object d : dict This has the appropriate entries for writing its string representation to the header of the file. """ import struct header = ["{"] for key, value in sorted(d.items()): # Need to use repr here, since we eval these when reading header.append("'%s': %s, " % (key, repr(value))) header.append("}") header = "".join(header) # Pad the header with spaces and a final newline such that the magic # string, the header-length short and the header are aligned on a 16-byte # boundary. Hopefully, some system, possibly memory-mapping, can take # advantage of our premature optimization. current_header_len = MAGIC_LEN + 2 + len(header) + 1 # 1 for the newline topad = 16 - (current_header_len % 16) header = '%s%s\n' % (header, ' '*topad) if len(header) >= (256*256): raise ValueError("header does not fit inside %s bytes" % (256*256)) header_len_str = struct.pack('