polychrom.hdf5_format module¶
New-style HDF5 trajectories¶
The purpose of the HDF5 reporter¶
There are several reasons for migrating to the new HDF5 storage format:
Saving each conformation as individual file is producing too many files
Using pickle-based approaches (joblib) makes format python-specific and not backwards compatible; text is clumsy
Would be nice to save metadata, such as starting conformation, forces, or initial parameters.
Compression can be benefitial for rounded conformations: can reduce file size by up to 40%
one file vs many files vs several files¶
Saving each conformation as an individual file is undesirable because it will produce too many files: filesystem check or backup on 30,000,000 files takes hours/days.
Saving all trajectory as a single files is undesirable because 1. backup software will back up a new copy of the file every day as it grows; and 2. if the last write fails, the file will end up in the corrupted state and would need to be recovered.
Solution is: save groups of conformations as individual files. E.g. save conformations 1-50 as one file, conformations 51-100 as a second file etc.
This way, we are not risking to lose anything if the power goes out at the end. This way, we are not screwing with backup solutions. This way, we have partial trajectories that can be analyzed. Although partial trajectories are not realtime, @golobor was proposing a solution to it for debug/development.
Polychrom storage format¶
We chose the HDF5-based storage that roughly mimics the MDTraj HDF5 format. It does not have MDTraj topology because it seemed a little too complicated. However, full MDTraj compatibility may be added in the future
Separation of simulation and repoter¶
Polychrom separates two entities: a simulation object and a reporter. When a simulation object is initialized, a reporter (actually, a list of reporters in case you want to use several) is passed to the simulation object. Simulation object would attempt to save several things: __init__ arguments, starting conformation, energy minimization results, serialized forces, and blocks of conformations together with time, Ek, Ep.
Each time a simulation object wants to save something, it calls reporter.report(…) for each of the reporters. It
passes a string indicating what is being reported, and a dictionary to save. Reporter will have to interpret this and
save the data. Reporter is also keeping appropriate counts. Users can pass a dict with extra variables to
polychrom.simulation.Simulation.do_block()
as save_extras
paramater. This dict will be saved by the
reporter.
Note
Generic Python objects are not supported by HDF5 reporter. Data has to be HDF5-compatible, meaning an array of numbers/strings, or a number/string.
The HDF5 reporter used here saves everything into an HDF5 file. For anything except the conformations, it would immmediately save the data into a single HDF5 file: numpy array compatible structures would be saved as datasets, and regular types (strings, numbers) would be saved as attributes. For conformations, it would wait until a certain number of conformations is received. It will then save them all at once into an HDF5 file under groups /1, /2, /3… /50 for blocks 1,2,3…50 respectively, and save them to blocks_1-50.h5 file
Multi-stage simulations or loop extrusion¶
We frequently have simulations in which a simulation object changes. One example would be changing forces or parameters throughout the simulation. Another example would be loop extrusion simulations.
In this design, a reporter object can be reused and passed to a new simulation. This would keep counter of conformations, and also save applied forces etc. again. The reporter would create a file “applied_forces_0.h5” the first time it receives forces, and “applied_forces_1.h5” the second time it receives forces from a simulation. Setting reporter.blocks_only=True would stop the reporter from saving anything but blocks, which may be helpful for making loop extrusion conformations. This is currently implemented in the examples
URIs to identify individual conformations¶
Because we’re saving several conformations into one file, we designed an URI format to quickly fetch a conformation by a unique identifyer.
URIs are like that: /path/to/the/trajectory/blocks_1-50.h5::42
This URI will fetch block #42 from a file blocks_1-50.h5, which contains blocks 1 through 50 including 1 and 50
polychrom.polymerutils.load()
function is compatible with URIs Also, to make it easy to load both old-style
filenames and new-style URIs, there is a function polychrom.polymerutils.fetch_block()
. fetch_block will
autodetermine the type of a trajectory folder. So it will fetch both /path/to/the/trajectory/block42.dat and
/path/to/the/trajectory/blocks_x-y.h5::42 automatically
- class polychrom.hdf5_format.HDF5Reporter(folder, max_data_length=50, h5py_dset_opts=None, overwrite=False, blocks_only=False, check_exists=True)[source]¶
Bases:
object
- __init__(folder, max_data_length=50, h5py_dset_opts=None, overwrite=False, blocks_only=False, check_exists=True)[source]¶
Creates a reporter object that saves a trajectory to a folder
- Parameters
folder (str) – Folder to save data to.
max_data_length (int, optional (default=50)) – Will save data in groups of max_data_length blocks
overwrite (bool, optional (default=False)) – Overwrite an existing trajectory in a folder.
check_exists (bool (optional, default=True)) – Raise an error if previous trajectory exists in the folder
blocks_only (bool, optional (default=False)) – Only save blocks, do not save any other information
- continue_trajectory(continue_from=None, continue_max_delete=5)[source]¶
Continues a simulation in a current folder (i.e. continues from the last block, or the block you specify). By default, takes the last block. Otherwise, takes the continue_from block
You should initialize the class with “check_exists=False” to continue a simulation
NOTE: This funciton does not continue the simulation itself (parameters, bonds, etc.) - it only manages counting the blocks and the saved files.
Returns (block_number, data_dict) - you should start a new simulation with data_dict[“pos”]
- Parameters
continue_from (int or None, optional (default=None)) – Block number to continue a simulation from. Default: last block found
continue_max_delete (int (default = 5)) – Maximum number of blocks to delete if continuing a simulation. It is here to avoid accidentally deleting a lot of blocks.
- Returns
(block_number, data_dict)
block_number is a number of a current block
data_dict is what load_URI would return on the last block of a trajectory.
- report(name, values)[source]¶
Semi-internal method to be called when you need to report something
- Parameters
name (str) – Name of what is being reported (“data”, “init_args”, anything else)
values (dict) – Dict of what to report. Accepted types are np-array-compatible, numbers, strings. No dicts, objects, or what cannot be converted to a np-array of numbers or strings/bytes.
- polychrom.hdf5_format.list_URIs(folder, empty_error=True, read_error=True, return_dict=False)[source]¶
Makes a list of URIs (path-like records for each block). for a trajectory folder Now we store multiple blocks per file, and URI is a Universal Resource Identifier for a block.
It is be compatible with polymerutils.load, and with contactmap finders, and is generally treated like a filename.
This function checks that the HDF5 file is openable (if read_error==True), but does not check if individual datasets (blocks) exist in a file. If read_error==False, a non-openable file is fully ignored. NOTE: This covers the most typical case of corruption due to a terminated write, because an HDF5 file becomes invalid in that case.
It does not check continuity of blocks (blocks_1-10.h5; blocks_20-30.h5 is valid) But it does error if one block is listed twice (e.g. blocks_1-10.h5; blocks_5-15.h5 is invalid)
TODO: think about the above checks, and check for readable datasets as well
- Parameters
folder (str) – folder to find conformations in
empty_error (bool, optional) – Raise error if the folder does not exist or has no files, default True
read_error (bool, optional) – Raise error if one of the HDF5 files cannot be read, default True
return_dict (bool, optional) – True: return a dict of {block_number, URI}. False: return a list of URIs. This is a default.
- polychrom.hdf5_format.load_URI(dset_path)[source]¶
Loads a single block of the simulation using address provided by list_filenames dset_path should be
/path/to/trajectory/folder/blocks_X-Y.h5::Z
where Z is the block number