icat.dumpfile — Backend for icatdump and icatingest

This module provides the base classes icat.dumpfile.DumpFileReader and icat.dumpfile.DumpFileWriter that define the API and the logic for reading and writing ICAT data files. The actual work is done in file format specific modules that should provide subclasses that must implement the abstract methods.

class icat.dumpfile.DumpFileReader(client, infile)

Bases: object

Base class for backends that read a data file.

mode = 'r'

File mode suitable for the backend.

Subclasses should override this with either “rt” or “rb”, according to the mode required for the backend.

getdata()

Iterate over the chunks in the data file.

Yield some data object in each iteration. This data object is specific to the implementing backend and should be passed as the data argument to getobjs_from_data().

getobjs_from_data(data, objindex)

Iterate over the objects in a data chunk.

Yield a new entity object in each iteration. The object is initialized from the data, but not yet created at the client.

getobjs(objindex=None)

Iterate over the objects in the data file.

Yield a new entity object in each iteration. The object is initialized from the data, but not yet created at the client.

Parameters:objindex (dict) – a mapping from keys to entity objects, see icat.client.Client.searchUniqueKey() for details. This serves as a cache of previously retrieved objects, used to resolve object relations. If this is None, an internal cache will be used that is purged at the start of every new data chunk.
class icat.dumpfile.DumpFileWriter(client, outfile)

Bases: object

Base class for backends that write a data file.

mode = 'w'

File mode suitable for the backend.

Subclasses should override this with either “wt” or “wb”, according to the mode required for the backend.

head()

Write a header with some meta information to the data file.

startdata()

Start a new data chunk.

If the current chunk contains any data, write it to the data file.

writeobj(key, obj, keyindex)

Add an entity object to the current data chunk.

finalize()

Finalize the data file.

writeobjs(objs, keyindex, chunksize=100)

Write some entity objects to the current data chunk.

The objects are searched from the ICAT server. The key index is used to serialize object relations in the data file. For object types that do not have an appropriate uniqueness constraint in the ICAT schema, a generic key is generated. These objects may only be referenced from the same chunk in the data file.

Parameters:
  • objs (icat.query.Query or str or list) –

    query to search the objects, either a Query object or a string. It must contain an appropriate include clause to include all related objects from many-to-one relations. These related objects must also include all informations needed to generate their unique key, unless they are registered in the key index already.

    Furthermore, related objects from one-to-many relations may be included. These objects will then be embedded with the relating object in the data file. The same requirements for including their respective related objects apply.

    As an alternative to a query, objs may also be a list of entity objects. The same conditions on the inclusion of related objects apply.

  • keyindex (dict) – cache of generated keys. It maps object ids to unique keys. See the icat.entity.Entity.getUniqueKey() for details.
  • chunksize (int) – tuning parameter, see icat.client.Client.searchChunked() for details.
writedata(objs, keyindex=None, chunksize=100)

Write a data chunk.

Parameters:
icat.dumpfile.Backends = {}

A register of all known backends.

icat.dumpfile.register_backend(formatname, reader, writer)

Register a backend.

This function should be called by file format specific backends at initialization.

Parameters:
icat.dumpfile.open_dumpfile(client, f, formatname, mode)

Open a data file, either for reading or for writing.

Note that depending on the backend, the file must either be opened in binary or in text mode. If f is a file object, it must have been opened in the appropriate mode according to the backend selected by formatname. The backend classes define a corresponding class attribute mode. If f is a file name, the file will be opened in the appropriate mode.

The subclasses of icat.dumpfile.DumpFileReader and icat.dumpfile.DumpFileWriter may be used as context managers. This function is suitable to be used in the with statement.

>>> with open_dumpfile(client, f, "XML", 'r') as dumpfile:
...     for obj in dumpfile.getobjs():
...         obj.create()
Parameters:
  • client (icat.client.Client) – the ICAT client.
  • f – the object to read the data from or write the data to, according to mode. What object types are supported depends on the backend. All backends support at least a file object or the name of file. The special value of “-” may be used as an alias for sys.stdin or sys.stdout.
  • formatname (str) – name of the file format that has been registered by the backend.
  • mode (str) – either “r” or “w” to indicate that the file should be opened for reading or writing respectively.
Returns:

an instance of the appropriate class. This is either the reader or the writer class, according to the mode, that has been registered by the backend.

Raises:

ValueError – if the format is not known or if the mode is not “r” or “w”.

ICAT data files

Data files are partitioned in chunks. This is done to avoid having the whole file, e.g. the complete inventory of the ICAT, at once in memory. The problem is that objects contain references to other objects (e.g. Datafiles refer to Datasets, the latter refer to Investigations, and so forth). We keep an index of the objects in order to resolve these references. But there is a memory versus time tradeoff: we cannot keep all the objects in the index, that would again mean the complete inventory of the ICAT. And we can’t know beforehand which object is going to be referenced later on, so we don’t know which one to keep and which one to discard from the index. Fortunately we can query objects we discarded once back from the ICAT server with icat.client.Client.searchUniqueKey(). But this is expensive. So the strategy is as follows: keep all objects from the current chunk in the index and discard the complete index each time a chunk has been processed. This will work fine if objects are mostly referencing other objects from the same chunk and only a few references go across chunk boundaries.

Therefore, we want these chunks to be small enough to fit into memory, but at the same time large enough to keep as many relations between objects as possible local in a chunk. It is in the responsibility of the writer of the data file to create the chunks in this manner.

The objects that get written to the data file and how this file is organized is controlled by lists of ICAT search expressions, see icat.dumpfile.DumpFileWriter.writeobjs(). There is some degree of flexibility: an object may include related objects in an one-to-many relation, just by including them in the search expression. In this case, these related objects should not have a search expression on their own again. For instance, the search expression for Grouping may include UserGroup. The UserGroups will then be embedded in their respective grouping in the data file. There should not be a search expression for UserGroup then.

Objects related in a many-to-one relation must always be included in the search expression. This is also true if the object is indirectly related to one of the included objects. In this case, only a reference to the related object will be included in the data file. The related object must have its own list entry.