icat.ingest
— Ingest metadata into ICAT
New in version 1.1.0.
Note
The status of this module in the current version is still experimental. There may be incompatible changes in the future even in minor releases of python-icat.
This module provides class icat.ingest.IngestReader
that
reads Metadata ingest files to add them to ICAT. It is designed
for the use case of ingesting metadata for datasets created during
experiments.
The IngestReader
is based on the general purpose
class XMLDumpFileReader
. It differs from
that base class in restricting the vocabular of the input file: only
objects that need to be created during ingestion from the experiment
may appear in the input. This restriction is enforced by first
validating the input against an XML Schema Definition (XSD). In a
second step, the input is transformed into generic ICAT data XML
file format using an XSL Transformation (XSLT)
and then fed into XMLDumpFileReader
. The
format of the input files may be customized to some extent by
providing custom versions of XSD and XSLT files, see
Customizing the input format below.
The Dataset
objects in the input will not be created by
IngestReader
, because it is assumed that a
separate workflow in the caller will copy the content of datafiles to
the storage managed by IDS and create the corresponding Dataset
and Datafile
objects in ICAT at the same time. But the attributes
of the datasets will be read from the input file and set in the
Dataset
objects by IngestReader
.
IngestReader
will also create the related
DatasetTechnique
, DatasetInstrument
and DatasetParameter
objects read from the input file in ICAT.
- class icat.ingest.IngestReader(client, metadata, investigation)
Bases:
XMLDumpFileReader
Read metadata from XML ingest files into ICAT.
The input file may contain one or more datasets and related objects that must all belong to a single investigation. The file is first validated against an XML Schema Definition (XSD) and then transformed on-the-fly into generic ICAT data file format using an XSL Transformation (XSLT). The result of that transformation is fed into the parent class
XMLDumpFileReader
.- Parameters:
client (
icat.client.Client
) – a client object configured to connect to the ICAT server that the objects should be created in.metadata (
Path
or file object) – the input file. Either the path to the file or a file object opened for reading binary data.investigation (
icat.entity.Entity
) – the investigation object that the input data should belong to.
- Raises:
icat.exception.InvalidIngestFileError – if the input in metadata is not valid.
Changed in version 1.3.0: drop class attribute
XSLT_name
in favour ofXSLT_Map
.Changed in version 1.3.0: inject an element
_environment
as first child of the root element into the input data.- SchemaDir = PosixPath('/usr/share/icat')
Path to a directory to read XSD and XSLT files from.
- XSD_Map = {('icatingest', '1.0'): 'ingest-10.xsd', ('icatingest', '1.1'): 'ingest-11.xsd'}
A mapping to select the XSD file to use. Keys are pairs of root element name and version attribute, the values are the corresponding name of the XSD file.
- XSLT_Map = {'icatingest': 'ingest.xslt'}
A mapping to select the XSLT file to use. Keys are the root element name, the values are the corresponding name of the XSLT file.
New in version 1.3.0.
- get_xsd(ingest_data)
Get the XSD file.
Inspect the root element in the input data and lookup the tuple of element name and version attribute in
XSD_Map
. The value is taken as a file name relative toSchemaDir
and this path is returned.Subclasses may override this method to customize the XSD file to use. These derived versions may inspect the input data to select the appropriate file. Derived versions should raise
InvalidIngestFileError
if they decide to reject the input data.- Parameters:
ingest_data (
lxml.etree._ElementTree
) – input data- Returns:
path to the XSD file.
- Return type:
- Raises:
icat.exception.InvalidIngestFileError – if the pair of root element name and version attribute could not be found in
XSD_Map
.
- get_xslt(ingest_data)
Get the XSLT file.
Inspect the root element in the input data and lookup the element name in
XSLT_Map
. The value is taken as a file name relative toSchemaDir
and this path is returned.Subclasses may override this method to customize the XSLT file to use. These derived versions may inspect the input data to select the appropriate file. Derived versions should raise
InvalidIngestFileError
if they decide to reject the input data.- Parameters:
ingest_data (
lxml.etree._ElementTree
) – input data- Returns:
path to the XSLT file.
- Return type:
- Raises:
icat.exception.InvalidIngestFileError – if the root element name could not be found in
XSLT_Map
.
Changed in version 1.3.0: lookup the root element name in
XSLT_Map
rather than using a static file name.
- get_environment(client)
Get the environment to be injected as an element into the input.
Subclasses may override this method to control the attributes set in the environment.
- Parameters:
client (
icat.client.Client
) – the client object being used by this IngestReader.- Returns:
the environment.
- Return type:
New in version 1.3.0.
- add_environment(client, ingest_data)
Inject environment information into input data.
The attributes set in the environment are determined by calling
get_environment()
. Subclasses may override this method to fully control the process of adding the environment element.- Parameters:
client (
icat.client.Client
) – the client object being used by this IngestReader.ingest_data (
lxml.etree._ElementTree
) – input data
New in version 1.3.0.
- getobjs_from_data(data, objindex)
Iterate over the objects in a data chunk.
Yield a new entity object in each iteration. The object is initialized from the data, but not yet created at the client.
- getobjs()
Iterate over the objects in the ingest file.
- ingest(datasets, dry_run=False, update_ds=False)
Ingest metadata from an ingest file.
Read the metadata provided as argument to the constructor. The acceptable set of objects in the input is restricted: only
Dataset
and relatedDatasetInstrument
,DatasetTechnique
, andDatasetParameter
objects are allowed. TheDataset
objects must be in the list provided as argument.If dry_run is
False
, the related objects will be created in ICAT. In this case, the datasets in the argument must already have been created in ICAT beforehand (e.g. the id attribute must be set). If dry_run isTrue
, the objects in the metadata will be checked for conformance, but nothing will be committed to ICAT. In this case, the datasets don’t need to be created beforehand.if update_ds is
True
, the objects in the datasets argument will be updated: the attributes and the relations to other objects will be set to the values read from the input. This is particularly useful in conjunction with dry_run in order to update the datasets from the metadata prior to creating them in ICAT.- Parameters:
datasets (iterable of
icat.entity.Entity
) – list of allowed datasets in the input.dry_run (
bool
) – flag whether not to create related objects.update_ds (
bool
) – flag whether to update the datasets in the argument.
- Raises:
icat.exception.InvalidIngestFileError – if the input is not valid, for instance if there is any unallowed object or duplicate objects.
icat.exception.SearchResultError – if any object references in the input could not be resolved.
Ingest process
The processing of the metadata during the instantiation of an
IngestReader
object may be summarized by the
following steps:
Read the metadata and parse the
lxml.etree._ElementTree
.Call
get_xsd()
to get the appropriate XSD file and validate the metadata against that schema.Inject an
_environment
element as first child of the root element, see below.Call
get_xslt()
to get the appropriate XSLT file and transform the metadata into generic ICAT data XML file format.Feed the result of the transformation into the parent class
XMLDumpFileReader
.
Once this initialization is done,
ingest()
may be called to read the
individual objects defined in the metadata.
The environment element
During the processing of the metadata, an _environment
element
will be injected as the first child of the root element. In the
current version of python-icat, this _environment
element has the
following attributes:
- icat_version
Version of the ICAT server this client connects to, e.g. the
icat.client.Client.apiversion
attribute of the client object being used by thisIngestReader
.
More attributes may be added in future versions. This
_environment
element may be used by the XSLT in order to adapt the
result of the transformation to the environment, in particular to
adapt the output to the ICAT schema version it is supposed to conform
to.
Ingest example
It is assumed that the XSD and XSLT files (ingest-*.xsd,
ingest.xslt) provided with the python-icat source distribution are
installed in the directory pointed to by the class attribute
SchemaDir
of
IngestReader
. The core of an ingest script
might then look like:
from pathlib import Path
include icat
from icat.ingest include IngestReader
# prerequisite: search the investigation object to ingest into from
# ICAT and collect a list of dataset objects that should be ingested
# from the data collected at the experiment. The datasets should be
# instantiated (client.new('Dataset')) and include their respective
# datafiles, but not yet created at this point:
# investigation = client.assertedSearch(...)[0]
# datasets = [...]
# metadata = Path(...path to ingest file...)
# Make a dry run to check for errors and fail early, before having
# committed anything to ICAT yet. As a side effect, this will
# update the datasets, setting the attribute values that are read
# from the input file:
try:
reader = IngestReader(client, metadata, investigation)
reader.ingest(datasets, dry_run=True, update_ds=True)
except (icat.InvalidIngestFileError, icat.SearchResultError) as e:
raise RuntimeError("invalid ingest file") from e
# Create the datasets. In a real production script, you'd copy the
# content of the datafiles to IDS storage at the same time:
for ds in datasets:
ds.create()
# Now read the metadata into ICAT for real:
reader.ingest(datasets)
There is a somewhat more complete script in the example directory of the python-icat source distribution.
Customizing the input format
The ingest input file format may be customized by providing custom XSD
and XSLT files. The easiest way to do that is to subclass
IngestReader
. In most cases, you’d only need to
override some class attributes as follows:
from pathlib import Path
import icat.ingest
class MyFacilityIngestReader(icat.ingest.IngestReader):
# Override the directory to search for XSD and XSLT files:
SchemaDir = Path("/usr/share/icat/my-facility")
# Override the XSD files to use:
XSD_Map = {
('legacyingest', '0.5'): "legacy-ingest-05.xsd",
('myingest', '4.3'): "my-ingest-40.xsd",
}
# Override the XSLT file to use:
XSLT_Map = {
'legacyingest': "legacy-ingest.xslt",
'myingest': "my-ingest.xslt",
}
XSD_Map
and
XSLT_Map
are mappings with
properties of the root element of the input data as keys and file
names as values. The methods
get_xsd()
and
get_xslt()
respectively inspect the
input file and use these mappings to select the XSD and XSLT file
accordingly. Note that XSD_Map
takes tuples of root element name and version attribute as keys, while
XSLT_Map
uses the name of the root
element name alone. It is is assumed that it is fairly easy to
formulate adaptations to the input version directly in XSLT, so one
single XSLT file would be sufficient to cover all versions.
In the above example, MyFacilityIngestReader would recognize input files like
<?xml version='1.0' encoding='UTF-8'?>
<legacyingest version="0.5">
<!-- ... -->
</legacyingest>
and
<?xml version='1.0' encoding='UTF-8'?>
<myingest version="4.3">
<!-- ... -->
</myingest>
Input files having any other combination of root element name and version number would be rejected.
In more involved scenarios of selecting the XSD or XSLT files based on
the input, one may also override the
get_xsd()
and
get_xslt()
methods.