polychrom.hdf5_format module

New-style HDF5 trajectories

The purpose of the HDF5 reporter

There are several reasons for migrating to the new HDF5 storage format:

  • Saving each conformation as individual file is producing too many files

  • Using pickle-based approaches (joblib) makes format python-specific and not backwards compatible; text is clumsy

  • Would be nice to save metadata, such as starting conformation, forces, or initial parameters.

  • Compression can be benefitial for rounded conformations: can reduce file size by up to 40%

one file vs many files vs several files

Saving each conformation as an individual file is undesirable because it will produce too many files: filesystem check or backup on 30,000,000 files takes hours/days.

Saving all trajectory as a single files is undesirable because 1. backup software will back up a new copy of the file every day as it grows; and 2. if the last write fails, the file will end up in the corrupted state and would need to be recovered.

Solution is: save groups of conformations as individual files. E.g. save conformations 1-50 as one file, conformations 51-100 as a second file etc.

This way, we are not risking to lose anything if the power goes out at the end. This way, we are not screwing with backup solutions. This way, we have partial trajectories that can be analyzed. Although partial trajectories are not realtime, @golobor was proposing a solution to it for debug/development.

Polychrom storage format

We chose the HDF5-based storage that roughly mimics the MDTraj HDF5 format. It does not have MDTraj topology because it seemed a little too complicated. However, full MDTraj compatibility may be added in the future

Separation of simulation and repoter

Polychrom separates two entities: a simulation object and a reporter. When a simulation object is initialized, a reporter (actually, a list of reporters in case you want to use several) is passed to the simulation object. Simulation object would attempt to save several things: __init__ arguments, starting conformation, energy minimization results, serialized forces, and blocks of conformations together with time, Ek, Ep.

Each time a simulation object wants to save something, it calls reporter.report(…) for each of the reporters. It passes a string indicating what is being reported, and a dictionary to save. Reporter will have to interpret this and save the data. Reporter is also keeping appropriate counts. Users can pass a dict with extra variables to polychrom.simulation.Simulation.do_block() as save_extras paramater. This dict will be saved by the reporter.

Note

Generic Python objects are not supported by HDF5 reporter. Data has to be HDF5-compatible, meaning an array of numbers/strings, or a number/string.

The HDF5 reporter used here saves everything into an HDF5 file. For anything except the conformations, it would immmediately save the data into a single HDF5 file: numpy array compatible structures would be saved as datasets, and regular types (strings, numbers) would be saved as attributes. For conformations, it would wait until a certain number of conformations is received. It will then save them all at once into an HDF5 file under groups /1, /2, /3… /50 for blocks 1,2,3…50 respectively, and save them to blocks_1-50.h5 file

Multi-stage simulations or loop extrusion

We frequently have simulations in which a simulation object changes. One example would be changing forces or parameters throughout the simulation. Another example would be loop extrusion simulations.

In this design, a reporter object can be reused and passed to a new simulation. This would keep counter of conformations, and also save applied forces etc. again. The reporter would create a file “applied_forces_0.h5” the first time it receives forces, and “applied_forces_1.h5” the second time it receives forces from a simulation. Setting reporter.blocks_only=True would stop the reporter from saving anything but blocks, which may be helpful for making loop extrusion conformations. This is currently implemented in the examples

URIs to identify individual conformations

Because we’re saving several conformations into one file, we designed an URI format to quickly fetch a conformation by a unique identifyer.

URIs are like that: /path/to/the/trajectory/blocks_1-50.h5::42

This URI will fetch block #42 from a file blocks_1-50.h5, which contains blocks 1 through 50 including 1 and 50 polychrom.polymerutils.load() function is compatible with URIs Also, to make it easy to load both old-style filenames and new-style URIs, there is a function polychrom.polymerutils.fetch_block(). fetch_block will autodetermine the type of a trajectory folder. So it will fetch both /path/to/the/trajectory/block42.dat and /path/to/the/trajectory/blocks_x-y.h5::42 automatically

class polychrom.hdf5_format.HDF5Reporter(folder, max_data_length=50, h5py_dset_opts=None, overwrite=False, blocks_only=False, check_exists=True)[source]

Bases: object

__init__(folder, max_data_length=50, h5py_dset_opts=None, overwrite=False, blocks_only=False, check_exists=True)[source]

Creates a reporter object that saves a trajectory to a folder

Parameters
  • folder (str) – Folder to save data to.

  • max_data_length (int, optional (default=50)) – Will save data in groups of max_data_length blocks

  • overwrite (bool, optional (default=False)) – Overwrite an existing trajectory in a folder.

  • check_exists (bool (optional, default=True)) – Raise an error if previous trajectory exists in the folder

  • blocks_only (bool, optional (default=False)) – Only save blocks, do not save any other information

continue_trajectory(continue_from=None, continue_max_delete=5)[source]

Continues a simulation in a current folder (i.e. continues from the last block, or the block you specify). By default, takes the last block. Otherwise, takes the continue_from block

You should initialize the class with “check_exists=False” to continue a simulation

NOTE: This funciton does not continue the simulation itself (parameters, bonds, etc.) - it only manages counting the blocks and the saved files.

Returns (block_number, data_dict) - you should start a new simulation with data_dict[“pos”]

Parameters
  • continue_from (int or None, optional (default=None)) – Block number to continue a simulation from. Default: last block found

  • continue_max_delete (int (default = 5)) – Maximum number of blocks to delete if continuing a simulation. It is here to avoid accidentally deleting a lot of blocks.

Returns

  • (block_number, data_dict)

  • block_number is a number of a current block

  • data_dict is what load_URI would return on the last block of a trajectory.

dump_data()[source]
report(name, values)[source]

Semi-internal method to be called when you need to report something

Parameters
  • name (str) – Name of what is being reported (“data”, “init_args”, anything else)

  • values (dict) – Dict of what to report. Accepted types are np-array-compatible, numbers, strings. No dicts, objects, or what cannot be converted to a np-array of numbers or strings/bytes.

polychrom.hdf5_format.list_URIs(folder, empty_error=True, read_error=True, return_dict=False)[source]

Makes a list of URIs (path-like records for each block). for a trajectory folder Now we store multiple blocks per file, and URI is a Universal Resource Identifier for a block.

It is be compatible with polymerutils.load, and with contactmap finders, and is generally treated like a filename.

This function checks that the HDF5 file is openable (if read_error==True), but does not check if individual datasets (blocks) exist in a file. If read_error==False, a non-openable file is fully ignored. NOTE: This covers the most typical case of corruption due to a terminated write, because an HDF5 file becomes invalid in that case.

It does not check continuity of blocks (blocks_1-10.h5; blocks_20-30.h5 is valid) But it does error if one block is listed twice (e.g. blocks_1-10.h5; blocks_5-15.h5 is invalid)

TODO: think about the above checks, and check for readable datasets as well

Parameters
  • folder (str) – folder to find conformations in

  • empty_error (bool, optional) – Raise error if the folder does not exist or has no files, default True

  • read_error (bool, optional) – Raise error if one of the HDF5 files cannot be read, default True

  • return_dict (bool, optional) – True: return a dict of {block_number, URI}. False: return a list of URIs. This is a default.

polychrom.hdf5_format.load_URI(dset_path)[source]

Loads a single block of the simulation using address provided by list_filenames dset_path should be

/path/to/trajectory/folder/blocks_X-Y.h5::Z

where Z is the block number

polychrom.hdf5_format.load_hdf5_file(fname)[source]

Loads a saved HDF5 files, reading all datasets and attributes. We save arrays as datasets, and regular types as attributes in HDF5

polychrom.hdf5_format.save_hdf5_file(filename, data_dict, dset_opts=None, mode='w')[source]

Saves data_dict to filename