ICAT data files
ICAT data files provide a way to serialize ICAT content to a flat
file. These files are read by the icatingest and written by
the icatdump command line scripts respectively. The program
logic for reading and writing the files is provided in the
icat.dumpfile
module.
The actual file format depends on the version of the ICAT schema and on the backend: python-icat provides backends using XML and YAML.
Logical structure of ICAT data files
Data files are partitioned in chunks. This is done to avoid having the whole file, e.g. the complete inventory of the ICAT, at once in memory. The problem is that objects contain references to other objects, e.g. Datafiles refer to Datasets, the latter refer to Investigations, and so forth. We keep an index of the objects as a cache in order to resolve these references. But there is a memory versus time tradeoff: in order to avoid the index to grow beyond bounds, objects need to be discarded from the index from time to time. References to objects that can not be resolved from the index need to be searched from the ICAT server, which is of course expensive. So the strategy is as follows: keep all objects from the current chunk in the index and discard the complete index each time a chunk has been processed. [1] This will work fine if objects are mostly referencing other objects from the same chunk and only a few references go across chunk boundaries.
Therefore, we want these chunks to be small enough to fit into memory, but at the same time large enough to keep as many relations between objects as possible local in a chunk. It is in the responsibility of the writer of the data file to create the chunks in this manner.
The data chunks contain ICAT object definitions, e.g. serializations of individual ICAT objects, including all attribute values and many-to-one relations. The many-to-one relations are provided as references to other objects that must exist in the ICAT server at the moment that this object definition is read.
There is some degree of flexibility with respect to related objects in one-to-many relations: object definitions for these related objects may be included in the object definitions of the parent object. When the parent is read, these related objects will be created along with the parent in one single cascading call. Thus, the related objects must not be included again as a separate object in the ICAT data file. For instance, an ICAT data file may include User, Grouping, and UserGroup as separate objects. In this case, the UserGroup entries must properly reference User and Grouping as their related objects. Alternatively the file may only contain User and Grouping objects, with the UserGroups being included into the object definition of the corresponding Grouping objects.
References to ICAT objects and unique keys
References to ICAT objects may be encoded using reference keys. There are two kinds of those keys, local keys and unique keys:
When an ICAT object is defined in the file, it generally defines a local key at the same time. Local keys are stored in the object index and may be used to reference this object from other objects in the same data chunk.
Unique keys can be obtained from an object by calling
icat.entity.Entity.getUniqueKey()
. An object can be searched by
its unique key from the ICAT server by calling
icat.client.Client.searchUniqueKey()
. As a result, it is
possible to reference an object by its unique key even if the
reference is not in the object index. All references that go across
chunk boundaries must use unique keys. [1]
Reference keys should be considered as opaque ids.
ICAT data XML files
The root element of ICAT data XML files is icatdata
. It may
optionally have one head
subelement and one or more data
subelements.
The head
element will be ignored by icatingest. It serves
to provide some information on the context of the creation of the data
file, which may be useful for debugging in case of issues.
The actual payload of an ICAT data XML file is in the data
elements. There can be any number of them and each is one chunk
according to the logical structure explained above. The subelements
of data
may either be ICAT object references or ICAT object
definitions, both explained in detail below. Either of them may have
an id
attribute that defines a local key that allows to reference
the corresponding object later on.
Snippet 1 shows a simple example for an ICAT
data XML file having one single data
element that defines four
Datasets.
<?xml version="1.0" encoding="utf-8"?>
<icatdata>
<head>
<date>2023-10-17T07:33:36Z</date>
<generator>manual edit</generator>
</head>
<data>
<investigationRef id="inv_1" name="10100601-ST" visitId="1.1-N"/>
<dataset id="dataset_1">
<complete>false</complete>
<endDate>2012-07-30T01:10:08+00:00</endDate>
<name>e209001</name>
<startDate>2012-07-26T15:44:24+00:00</startDate>
<investigation ref="inv_1"/>
<sample name="ab3465" investigation.ref="inv_1"/>
<type name="raw"/>
</dataset>
<dataset id="dataset_2">
<complete>false</complete>
<endDate>2012-08-06T01:10:08+00:00</endDate>
<name>e209002</name>
<startDate>2012-08-02T05:30:00+00:00</startDate>
<investigation ref="inv_1"/>
<sample name="ab3465" investigation.ref="inv_1"/>
<type name="raw"/>
</dataset>
<dataset id="dataset_3">
<complete>false</complete>
<endDate>2012-07-16T14:30:17+00:00</endDate>
<name>e209003</name>
<startDate>2012-07-16T11:42:05+00:00</startDate>
<investigation ref="inv_1"/>
<sample name="ab3466" investigation.ref="inv_1"/>
<type name="raw"/>
</dataset>
<dataset id="dataset_4">
<complete>false</complete>
<endDate>2012-07-31T22:52:23+00:00</endDate>
<name>e209004</name>
<startDate>2012-07-31T20:20:37+00:00</startDate>
<investigation ref="inv_1"/>
<type name="raw"/>
</dataset>
</data>
</icatdata>
ICAT object references
ICAT object references do not define an ICAT object to be created when reading the ICAT data file but reference an already existing one. It is either assumed to exist in ICAT before ingesting the file or it must appear earlier in the ICAT data file, so that it will be created before the referencing object is read.
ICAT objects may either be referenced by reference key or by
attributes. A reference key should be included as a ref
attribute.
When referencing the object by attributes, these attributes should be
included using the same name in the reference element. This may also
include attributes of related objects using the same dot notation as
for ICAT JPQL search expressions. Referencing by attributes may be
combined with referencing related objects by reference key, using
ref
in place of the related object’s attribute names. In any
case, referenced objects must be uniquely defined by the attribute
values.
ICAT object references may be used in two locations in ICAT data XML
files: as direct subelements of data
or to reference related
objects in many-to-one relations in ICAT object definitions, see
below. In the former case, the name of the object reference element
is the name of the corresponding ICAT entity type (the first letter in
lowercase) with a Ref
suffix appended. In that case, the element
should have an id
attribute that will define a local key that can
be used to reference that object in subsequent object references.
This is convenient to define a shortcut when the same object needs to
be referenced often, to avoid having to repeat the same set of
attributes each time.
In any case, object reference elements only have attributes, but no content or subelements.
See Snippet 1 for a few examples: the first
subelement of the data
element in this case is
investigationRef
. It references a (supposed to be existing)
Investigation by its attributes name
and visitId
. It defines
a local key for that Investigation object in the id
attribute.
The Dataset object definitions in that example each use that local key
to set their relation with the Investigation respectively. The
Dataset object definitions each also include a relation with their
type
, referencing the related DatasetType by the name
attribute. Some of the Dataset object definitions also include a
relation with a Sample. The respective Sample object is referenced by
name
and the related Investigation. The latter is referenced by
the local key defined earlier in the investigation.ref
attribute.
ICAT object definitions
ICAT object definitions define objects that will be created in ICAT
when ingesting the ICAT data file. As direct subelements of data
,
the name of the element must be the name of the corresponding entity
type in the ICAT schema (the first letter in lowercase).
The subelements of ICAT object definitions are the attributes and object relations as defined in the ICAT schema using the same names. Attributes must include the corresponding value as text content of the element. All many-to-one relations must be provided as ICAT object references, see above.
The ICAT object definitions may include one-to-many relations as subelements. In this case, these subelements must in turn be ICAT object definitions for the related objects. These related objects will be created along with the parent in one single cascading call. The object definition for the related object must not include its relation with the parent object as this is already implied by the parent and child relationship.
When appearing as direct subelements of data
, ICAT object
definitions may have an id
attribute that will define a local key
that can be used to reference the defined object later on.
<?xml version="1.0" encoding="utf-8"?>
<icatdata>
<head>
<date>2024-01-03T13:21:15+00:00</date>
<service>https://icat.example.com:8181/ICATService/ICAT?wsdl</service>
<apiversion>6.0.0</apiversion>
<generator>icatdump (python-icat 1.2.0)</generator>
</head>
<data>
<user id="User_name-db=2Fahau">
<affiliation>Goethe University Frankfurt, Faculty of Philosophy and History</affiliation>
<email>ahau@example.org</email>
<familyName>Hau</familyName>
<fullName>Arnold Hau</fullName>
<givenName>Arnold</givenName>
<name>db/ahau</name>
<orcidId>0000-0002-3263</orcidId>
</user>
<user id="User_name-db=2Fjbotu">
<affiliation>Université Paul-Valéry Montpellier 3</affiliation>
<email>jbotu@example.org</email>
<familyName>Botul</familyName>
<fullName>Jean-Baptiste Botul</fullName>
<givenName>Jean-Baptiste</givenName>
<name>db/jbotu</name>
<orcidId>0000-0002-3264</orcidId>
</user>
<user id="User_name-db=2Fjdoe">
<email>jdoe@example.org</email>
<familyName>Doe</familyName>
<fullName>John Doe</fullName>
<givenName>John</givenName>
<name>db/jdoe</name>
</user>
<user id="User_name-db=2Fnbour">
<affiliation>University of Nancago</affiliation>
<email>nbour@example.org</email>
<familyName>Bourbaki</familyName>
<fullName>Nicolas Bourbaki</fullName>
<givenName>Nicolas</givenName>
<name>db/nbour</name>
<orcidId>0000-0002-3266</orcidId>
</user>
<grouping id="Grouping_name-investigation=5F10100601=2DST=5Fowner">
<name>investigation_10100601-ST_owner</name>
<userGroups>
<user ref="User_name-db=2Fahau"/>
</userGroups>
</grouping>
<grouping id="Grouping_name-investigation=5F10100601=2DST=5Freader">
<name>investigation_10100601-ST_reader</name>
<userGroups>
<user ref="User_name-db=2Fjbotu"/>
</userGroups>
<userGroups>
<user ref="User_name-db=2Fjdoe"/>
</userGroups>
<userGroups>
<user ref="User_name-db=2Fnbour"/>
</userGroups>
</grouping>
<grouping id="Grouping_name-investigation=5F10100601=2DST=5Fwriter">
<name>investigation_10100601-ST_writer</name>
<userGroups>
<user ref="User_name-db=2Fahau"/>
</userGroups>
</grouping>
</data>
<data>
<investigation id="Investigation_facility-(name-ESNF)_name-10100601=2DST_visitId-1=2E1=2DN">
<doi>DOI:00.0815/inv-00601</doi>
<endDate>2010-10-12T15:00:00+00:00</endDate>
<fileCount>4</fileCount>
<fileSize>127125</fileSize>
<name>10100601-ST</name>
<startDate>2010-09-30T10:27:24+00:00</startDate>
<title>Ni-Mn-Ga flat cone</title>
<visitId>1.1-N</visitId>
<facility ref="Facility_name-ESNF"/>
<investigationGroups>
<role>owner</role>
<grouping ref="Grouping_name-investigation=5F10100601=2DST=5Fowner"/>
</investigationGroups>
<investigationGroups>
<role>reader</role>
<grouping ref="Grouping_name-investigation=5F10100601=2DST=5Freader"/>
</investigationGroups>
<investigationGroups>
<role>writer</role>
<grouping ref="Grouping_name-investigation=5F10100601=2DST=5Fwriter"/>
</investigationGroups>
</investigation>
</data>
</icatdata>
Consider the example in Snippet 2. It contains two chunks: the first chunk contains four User objects and three Grouping objects. The Groupings include related UserGroups. Note that these UserGroups include their relation to the User, but not their relation with Grouping. The latter is implied by the parent relation of the object in the file. The second chunk only contains one Investigation, including related InvestigationGroups.
Finally note that the file format also depends on the ICAT schema version: the present example can only be ingested into ICAT server 5.0 or newer, because the attributes fileCount and fileSize have been added to Investigation in this version. With older ICAT versions, it will fail because these attributes are not defined.
You will find more extensive examples in the source distribution of python-icat. The distribution also provides XML Schema Definition files for the ICAT data XML file format corresponding to various ICAT schema versions. Note the these XML Schema Definition files are provided for reference only. The icatingest script does not validate its input.
ICAT data YAML files
In this section we describe the ICAT data file format using the YAML backend. Consider the example in Snippet 3, it corresponds to the same ICAT content as the XML in Snippet 2:
%YAML 1.1
# Date: Wed, 03 Jan 2024 13:24:51 +0000
# Service: https://icat.example.com:8181/ICATService/ICAT?wsdl
# ICAT-API: 6.0.0
# Generator: icatdump (python-icat 1.2.0)
---
grouping:
Grouping_name-investigation=5F10100601=2DST=5Fowner:
name: investigation_10100601-ST_owner
userGroups:
- user: User_name-db=2Fahau
Grouping_name-investigation=5F10100601=2DST=5Freader:
name: investigation_10100601-ST_reader
userGroups:
- user: User_name-db=2Fjbotu
- user: User_name-db=2Fjdoe
- user: User_name-db=2Fnbour
Grouping_name-investigation=5F10100601=2DST=5Fwriter:
name: investigation_10100601-ST_writer
userGroups:
- user: User_name-db=2Fahau
user:
User_name-db=2Fahau:
affiliation: Goethe University Frankfurt, Faculty of Philosophy and History
email: ahau@example.org
familyName: Hau
fullName: Arnold Hau
givenName: Arnold
name: db/ahau
orcidId: 0000-0002-3263
User_name-db=2Fjbotu:
affiliation: "Universit\xE9 Paul-Val\xE9ry Montpellier 3"
email: jbotu@example.org
familyName: Botul
fullName: Jean-Baptiste Botul
givenName: Jean-Baptiste
name: db/jbotu
orcidId: 0000-0002-3264
User_name-db=2Fjdoe:
email: jdoe@example.org
familyName: Doe
fullName: John Doe
givenName: John
name: db/jdoe
User_name-db=2Fnbour:
affiliation: University of Nancago
email: nbour@example.org
familyName: Bourbaki
fullName: Nicolas Bourbaki
givenName: Nicolas
name: db/nbour
orcidId: 0000-0002-3266
---
investigation:
Investigation_facility-(name-ESNF)_name-10100601=2DST_visitId-1=2E1=2DN:
doi: DOI:00.0815/inv-00601
endDate: '2010-10-12T15:00:00+00:00'
facility: Facility_name-ESNF
fileCount: 4
fileSize: 127125
investigationGroups:
- grouping: Grouping_name-investigation=5F10100601=2DST=5Fowner
role: owner
- grouping: Grouping_name-investigation=5F10100601=2DST=5Freader
role: reader
- grouping: Grouping_name-investigation=5F10100601=2DST=5Fwriter
role: writer
name: 10100601-ST
startDate: '2010-09-30T10:27:24+00:00'
title: Ni-Mn-Ga flat cone
visitId: 1.1-N
ICAT data YAML files start with a head consisting of a few comment
lines, followed by one or more YAML documents. YAML documents are
separated by a line containing only ---
. The comments in the head
provide some information on the context of the creation of the data
file, which may be useful for debugging in case of issues.
Each YAML document defines one chunk of data according to the logical structure explained above. It consists of a mapping having the name of entity types in the ICAT schema (the first letter in lowercase) as keys. The values are in turn mappings that map object ids as key to ICAT object definitions as value. These object ids define local keys that may be used to reference the respective object later on. In the present example, the first chunk contains four User objects and three Grouping objects. The Groupings include related UserGroups. The second chunk only contains one Investigation, including related investigationGroups.
Each of the ICAT object definitions corresponds to an object in the ICAT schema. It is again a mapping with the object’s attribute and relation names as keys and corresponding values. All many-to-one relations must be provided and reference existing objects, e.g. they must either already have existed before starting the ingestion or appear in the same or an earlier YAML document in the ICAT data file. The values of many-to-one relations are reference keys, either local keys defined in the same YAML document or unique keys. Unlike the XML backend, the YAML backend does not support referencing objects by attributes.
The object definitions may include one-to-many relations. In this case, the value for the relation name is a list of object definitions for the related objects. These related objects will be created along with the parent in one single cascading call. In the present example, the Grouping objects include their related UserGroup objects. Note that these UserGroups include their relation to the User, but not with Grouping. The latter relationship is implied by the parent relation of the object in the file.
Note that the entries in the mappings in YAML have no inherent order. The icatingest script uses a predefined order to read the ICAT entity types in order to make sure that referenced objects are created before any object that may reference them.