icat.ingest — Ingest metadata into ICAT

New in version 1.1.0.

Note

The status of this module in the current version is still experimental. There may be incompatible changes in the future even in minor releases of python-icat.

This module provides class icat.ingest.IngestReader that reads Metadata ingest files to add them to ICAT. It is designed for the use case of ingesting metadata for datasets created during experiments.

The IngestReader is based on the general purpose class XMLDumpFileReader. It differs from that base class in restricting the vocabular of the input file: only objects that need to be created during ingestion from the experiment may appear in the input. This restriction is enforced by first validating the input against an XML Schema Definition (XSD). In a second step, the input is transformed into generic ICAT data XML file format using an XSL Transformation (XSLT) and then fed into XMLDumpFileReader. The format of the input files may be customized to some extent by providing custom versions of XSD and XSLT files, see Customizing the input format below.

The Dataset objects in the input will not be created by IngestReader, because it is assumed that a separate workflow in the caller will copy the content of datafiles to the storage managed by IDS and create the corresponding Dataset and Datafile objects in ICAT at the same time. But the attributes of the datasets will be read from the input file and set in the Dataset objects by IngestReader. IngestReader will also create the related DatasetTechnique, DatasetInstrument and DatasetParameter objects read from the input file in ICAT.

class icat.ingest.IngestReader(client, metadata, investigation)

Bases: XMLDumpFileReader

Read metadata from XML ingest files into ICAT.

The input file may contain one or more datasets and related objects that must all belong to a single investigation. The file is first validated against an XML Schema Definition (XSD) and then transformed on-the-fly into generic ICAT data file format using an XSL Transformation (XSLT). The result of that transformation is fed into the parent class XMLDumpFileReader.

Parameters:
  • client (icat.client.Client) – a client object configured to connect to the ICAT server that the objects should be created in.

  • metadata (Path or file object) – the input file. Either the path to the file or a file object opened for reading binary data.

  • investigation (icat.entity.Entity) – the investigation object that the input data should belong to.

Raises:

icat.exception.InvalidIngestFileError – if the input in metadata is not valid.

Changed in version 1.3.0: drop class attribute XSLT_name in favour of XSLT_Map.

Changed in version 1.3.0: inject an element _environment as first child of the root element into the input data.

SchemaDir = PosixPath('/usr/share/icat')

Path to a directory to read XSD and XSLT files from.

XSD_Map = {('icatingest', '1.0'): 'ingest-10.xsd', ('icatingest', '1.1'): 'ingest-11.xsd'}

A mapping to select the XSD file to use. Keys are pairs of root element name and version attribute, the values are the corresponding name of the XSD file.

XSLT_Map = {'icatingest': 'ingest.xslt'}

A mapping to select the XSLT file to use. Keys are the root element name, the values are the corresponding name of the XSLT file.

New in version 1.3.0.

get_xsd(ingest_data)

Get the XSD file.

Inspect the root element in the input data and lookup the tuple of element name and version attribute in XSD_Map. The value is taken as a file name relative to SchemaDir and this path is returned.

Subclasses may override this method to customize the XSD file to use. These derived versions may inspect the input data to select the appropriate file. Derived versions should raise InvalidIngestFileError if they decide to reject the input data.

Parameters:

ingest_data (lxml.etree._ElementTree) – input data

Returns:

path to the XSD file.

Return type:

Path

Raises:

icat.exception.InvalidIngestFileError – if the pair of root element name and version attribute could not be found in XSD_Map.

get_xslt(ingest_data)

Get the XSLT file.

Inspect the root element in the input data and lookup the element name in XSLT_Map. The value is taken as a file name relative to SchemaDir and this path is returned.

Subclasses may override this method to customize the XSLT file to use. These derived versions may inspect the input data to select the appropriate file. Derived versions should raise InvalidIngestFileError if they decide to reject the input data.

Parameters:

ingest_data (lxml.etree._ElementTree) – input data

Returns:

path to the XSLT file.

Return type:

Path

Raises:

icat.exception.InvalidIngestFileError – if the root element name could not be found in XSLT_Map.

Changed in version 1.3.0: lookup the root element name in XSLT_Map rather than using a static file name.

get_environment(client)

Get the environment to be injected as an element into the input.

Subclasses may override this method to control the attributes set in the environment.

Parameters:

client (icat.client.Client) – the client object being used by this IngestReader.

Returns:

the environment.

Return type:

dict

New in version 1.3.0.

add_environment(client, ingest_data)

Inject environment information into input data.

The attributes set in the environment are determined by calling get_environment(). Subclasses may override this method to fully control the process of adding the environment element.

Parameters:

New in version 1.3.0.

getobjs_from_data(data, objindex)

Iterate over the objects in a data chunk.

Yield a new entity object in each iteration. The object is initialized from the data, but not yet created at the client.

getobjs()

Iterate over the objects in the ingest file.

ingest(datasets, dry_run=False, update_ds=False)

Ingest metadata from an ingest file.

Read the metadata provided as argument to the constructor. The acceptable set of objects in the input is restricted: only Dataset and related DatasetInstrument, DatasetTechnique, and DatasetParameter objects are allowed. The Dataset objects must be in the list provided as argument.

If dry_run is False, the related objects will be created in ICAT. In this case, the datasets in the argument must already have been created in ICAT beforehand (e.g. the id attribute must be set). If dry_run is True, the objects in the metadata will be checked for conformance, but nothing will be committed to ICAT. In this case, the datasets don’t need to be created beforehand.

if update_ds is True, the objects in the datasets argument will be updated: the attributes and the relations to other objects will be set to the values read from the input. This is particularly useful in conjunction with dry_run in order to update the datasets from the metadata prior to creating them in ICAT.

Parameters:
  • datasets (iterable of icat.entity.Entity) – list of allowed datasets in the input.

  • dry_run (bool) – flag whether not to create related objects.

  • update_ds (bool) – flag whether to update the datasets in the argument.

Raises:

Ingest process

The processing of the metadata during the instantiation of an IngestReader object may be summarized by the following steps:

  1. Read the metadata and parse the lxml.etree._ElementTree.

  2. Call get_xsd() to get the appropriate XSD file and validate the metadata against that schema.

  3. Inject an _environment element as first child of the root element, see below.

  4. Call get_xslt() to get the appropriate XSLT file and transform the metadata into generic ICAT data XML file format.

  5. Feed the result of the transformation into the parent class XMLDumpFileReader.

Once this initialization is done, ingest() may be called to read the individual objects defined in the metadata.

The environment element

During the processing of the metadata, an _environment element will be injected as the first child of the root element. In the current version of python-icat, this _environment element has the following attributes:

icat_version

Version of the ICAT server this client connects to, e.g. the icat.client.Client.apiversion attribute of the client object being used by this IngestReader.

More attributes may be added in future versions. This _environment element may be used by the XSLT in order to adapt the result of the transformation to the environment, in particular to adapt the output to the ICAT schema version it is supposed to conform to.

Ingest example

It is assumed that the XSD and XSLT files (ingest-*.xsd, ingest.xslt) provided with the python-icat source distribution are installed in the directory pointed to by the class attribute SchemaDir of IngestReader. The core of an ingest script might then look like:

from pathlib import Path
include icat
from icat.ingest include IngestReader

# prerequisite: search the investigation object to ingest into from
# ICAT and collect a list of dataset objects that should be ingested
# from the data collected at the experiment.  The datasets should be
# instantiated (client.new('Dataset')) and include their respective
# datafiles, but not yet created at this point:
# investigation = client.assertedSearch(...)[0]
# datasets = [...]
# metadata = Path(...path to ingest file...)

# Make a dry run to check for errors and fail early, before having
# committed anything to ICAT yet.  As a side effect, this will
# update the datasets, setting the attribute values that are read
# from the input file:
try:
    reader = IngestReader(client, metadata, investigation)
    reader.ingest(datasets, dry_run=True, update_ds=True)
except (icat.InvalidIngestFileError, icat.SearchResultError) as e:
    raise RuntimeError("invalid ingest file") from e

# Create the datasets.  In a real production script, you'd copy the
# content of the datafiles to IDS storage at the same time:
for ds in datasets:
    ds.create()

# Now read the metadata into ICAT for real:
reader.ingest(datasets)

There is a somewhat more complete script in the example directory of the python-icat source distribution.

Customizing the input format

The ingest input file format may be customized by providing custom XSD and XSLT files. The easiest way to do that is to subclass IngestReader. In most cases, you’d only need to override some class attributes as follows:

from pathlib import Path
import icat.ingest

class MyFacilityIngestReader(icat.ingest.IngestReader):

    # Override the directory to search for XSD and XSLT files:
    SchemaDir = Path("/usr/share/icat/my-facility")

    # Override the XSD files to use:
    XSD_Map = {
        ('legacyingest', '0.5'): "legacy-ingest-05.xsd",
        ('myingest', '4.3'): "my-ingest-40.xsd",
    }

    # Override the XSLT file to use:
    XSLT_Map = {
        'legacyingest': "legacy-ingest.xslt",
        'myingest': "my-ingest.xslt",
    }

XSD_Map and XSLT_Map are mappings with properties of the root element of the input data as keys and file names as values. The methods get_xsd() and get_xslt() respectively inspect the input file and use these mappings to select the XSD and XSLT file accordingly. Note that XSD_Map takes tuples of root element name and version attribute as keys, while XSLT_Map uses the name of the root element name alone. It is is assumed that it is fairly easy to formulate adaptations to the input version directly in XSLT, so one single XSLT file would be sufficient to cover all versions.

In the above example, MyFacilityIngestReader would recognize input files like

<?xml version='1.0' encoding='UTF-8'?>
<legacyingest version="0.5">
    <!-- ... -->
</legacyingest>

and

<?xml version='1.0' encoding='UTF-8'?>
<myingest version="4.3">
    <!-- ... -->
</myingest>

Input files having any other combination of root element name and version number would be rejected.

In more involved scenarios of selecting the XSD or XSLT files based on the input, one may also override the get_xsd() and get_xslt() methods.