icat.dumpfile
— Backend for icatdump and icatingest¶
This module provides the base classes
icat.dumpfile.DumpFileReader
and
icat.dumpfile.DumpFileWriter
that define the API and the
logic for reading and writing ICAT data files. The actual work is
done in file format specific modules that should provide subclasses
that must implement the abstract methods.
-
class
icat.dumpfile.
DumpFileReader
(client, infile)¶ Bases:
object
Base class for backends that read a data file.
-
mode
= 'r'¶ File mode suitable for the backend.
Subclasses should override this with either “rt” or “rb”, according to the mode required for the backend.
-
getdata
()¶ Iterate over the chunks in the data file.
Yield some data object in each iteration. This data object is specific to the implementing backend and should be passed as the data argument to
getobjs_from_data()
.
-
getobjs_from_data
(data, objindex)¶ Iterate over the objects in a data chunk.
Yield a new entity object in each iteration. The object is initialized from the data, but not yet created at the client.
-
getobjs
(objindex=None)¶ Iterate over the objects in the data file.
Yield a new entity object in each iteration. The object is initialized from the data, but not yet created at the client.
Parameters: objindex ( dict
) – a mapping from keys to entity objects, seeicat.client.Client.searchUniqueKey()
for details. This serves as a cache of previously retrieved objects, used to resolve object relations. If this isNone
, an internal cache will be used that is purged at the start of every new data chunk.
-
-
class
icat.dumpfile.
DumpFileWriter
(client, outfile)¶ Bases:
object
Base class for backends that write a data file.
-
mode
= 'w'¶ File mode suitable for the backend.
Subclasses should override this with either “wt” or “wb”, according to the mode required for the backend.
-
head
()¶ Write a header with some meta information to the data file.
-
startdata
()¶ Start a new data chunk.
If the current chunk contains any data, write it to the data file.
-
writeobj
(key, obj, keyindex)¶ Add an entity object to the current data chunk.
-
finalize
()¶ Finalize the data file.
-
writeobjs
(objs, keyindex, chunksize=100)¶ Write some entity objects to the current data chunk.
The objects are searched from the ICAT server. The key index is used to serialize object relations in the data file. For object types that do not have an appropriate uniqueness constraint in the ICAT schema, a generic key is generated. These objects may only be referenced from the same chunk in the data file.
Parameters: - objs (
icat.query.Query
orstr
orlist
) –query to search the objects, either a Query object or a string. It must contain an appropriate include clause to include all related objects from many-to-one relations. These related objects must also include all informations needed to generate their unique key, unless they are registered in the key index already.
Furthermore, related objects from one-to-many relations may be included. These objects will then be embedded with the relating object in the data file. The same requirements for including their respective related objects apply.
As an alternative to a query, objs may also be a list of entity objects. The same conditions on the inclusion of related objects apply.
- keyindex (
dict
) – cache of generated keys. It maps object ids to unique keys. See theicat.entity.Entity.getUniqueKey()
for details. - chunksize (
int
) – tuning parameter, seeicat.client.Client.searchChunked()
for details.
- objs (
-
writedata
(objs, keyindex=None, chunksize=100)¶ Write a data chunk.
Parameters: - objs – an iterable that yields either queries to search
for the objects or object lists. See
icat.dumpfile.DumpFileWriter.writeobjs()
for details. - keyindex (
dict
) – cache of generated keys, seeicat.dumpfile.DumpFileWriter.writeobjs()
for details. If this isNone
, an internal index will be used. - chunksize (
int
) – tuning parameter, seeicat.client.Client.searchChunked()
for details.
- objs – an iterable that yields either queries to search
for the objects or object lists. See
-
-
icat.dumpfile.
Backends
= {}¶ A register of all known backends.
-
icat.dumpfile.
register_backend
(formatname, reader, writer)¶ Register a backend.
This function should be called by file format specific backends at initialization.
Parameters: - formatname (
str
) – name of the file format that the backend implements. - reader – class for reading data files. Should be a subclass
of
icat.dumpfile.DumpFileReader
. - writer – class for writing data files. Should be a subclass
of
icat.dumpfile.DumpFileWriter
.
- formatname (
-
icat.dumpfile.
open_dumpfile
(client, f, formatname, mode)¶ Open a data file, either for reading or for writing.
Note that depending on the backend, the file must either be opened in binary or in text mode. If f is a file object, it must have been opened in the appropriate mode according to the backend selected by formatname. The backend classes define a corresponding class attribute mode. If f is a file name, the file will be opened in the appropriate mode.
The subclasses of
icat.dumpfile.DumpFileReader
andicat.dumpfile.DumpFileWriter
may be used as context managers. This function is suitable to be used in thewith
statement.>>> with open_dumpfile(client, f, "XML", 'r') as dumpfile: ... for obj in dumpfile.getobjs(): ... obj.create()
Parameters: - client (
icat.client.Client
) – the ICAT client. - f – the object to read the data from or write the data to,
according to mode. What object types are supported depends on
the backend. All backends support at least a file object or
the name of file. The special value of “-” may be used as an
alias for
sys.stdin
orsys.stdout
. - formatname (
str
) – name of the file format that has been registered by the backend. - mode (
str
) – either “r” or “w” to indicate that the file should be opened for reading or writing respectively.
Returns: an instance of the appropriate class. This is either the reader or the writer class, according to the mode, that has been registered by the backend.
Raises: ValueError – if the format is not known or if the mode is not “r” or “w”.
- client (
ICAT data files¶
Data files are partitioned in chunks. This is done to avoid having
the whole file, e.g. the complete inventory of the ICAT, at once in
memory. The problem is that objects contain references to other
objects (e.g. Datafiles refer to Datasets, the latter refer to
Investigations, and so forth). We keep an index of the objects in
order to resolve these references. But there is a memory versus time
tradeoff: we cannot keep all the objects in the index, that would
again mean the complete inventory of the ICAT. And we can’t know
beforehand which object is going to be referenced later on, so we
don’t know which one to keep and which one to discard from the index.
Fortunately we can query objects we discarded once back from the ICAT
server with icat.client.Client.searchUniqueKey()
. But this is
expensive. So the strategy is as follows: keep all objects from the
current chunk in the index and discard the complete index each time a
chunk has been processed. This will work fine if objects are mostly
referencing other objects from the same chunk and only a few
references go across chunk boundaries.
Therefore, we want these chunks to be small enough to fit into memory, but at the same time large enough to keep as many relations between objects as possible local in a chunk. It is in the responsibility of the writer of the data file to create the chunks in this manner.
The objects that get written to the data file and how this file is
organized is controlled by lists of ICAT search expressions, see
icat.dumpfile.DumpFileWriter.writeobjs()
. There is some degree
of flexibility: an object may include related objects in an
one-to-many relation, just by including them in the search expression.
In this case, these related objects should not have a search
expression on their own again. For instance, the search expression
for Grouping may include UserGroup. The UserGroups will then be
embedded in their respective grouping in the data file. There should
not be a search expression for UserGroup then.
Objects related in a many-to-one relation must always be included in the search expression. This is also true if the object is indirectly related to one of the included objects. In this case, only a reference to the related object will be included in the data file. The related object must have its own list entry.