Overview of the Data Ingest to Dissemination Workflow

Introduction

The Geoscience Data Exchange (GDEX) maintains an established data ingest to dissemination workflow, which is employed by all GDEX dataset specialists to bring new dataset collections into the archive. The existing workflow was first introduced in 2008. Few changes to the overall logic have occurred since its introduction, outside of additions and updates to tools used to support the workflow. To support change management, staff meetings are held on a weekly basis that allow Dataset Specialists to review and agree upon proposed workflow changes. Changes typically involve introduction of new software components that are thoroughly vetted and tested on prototype datasets prior to being introduced to the production workflow. Additionally, all software workflow components are maintained in the Git version control system. Finally, since no sensitive data is housed in the GDEX, security concerns are not an issue, and are not addressed in this document. A complete overview of the GDEX data ingestion to dissemination workflow is provided in Figures 1, 2, and 3.

Figure 1. GDEX Dataset Selection and Dataset Metadata Creation

1.1. In this case, a PI (data submitter) requests to archive data in the GDEX through the GDEX Dataset Submission Portal according to the GDEX Terms and Conditions. If the request to archive is approved by the Data Engineering & Curation Section (DECS) manager, as described in the decision workflow, a dataset specialist is assigned to work with the submitter and collect the dataset metadata, and provide details on how the data will be handled. If the request to archive is rejected by the DECS manager, the submitter is notified why their request was rejected, and provided with suggestions for alternative repositories where applicable.

1.2. GDEX staff and NCAR scientists confer and work together to determine which community produced data assets are available, need to be part of the GDEX’s holdings to be accessible to CISL computing systems, and are practical to acquire, such as Copernicus climate reanalysis data products. Once it is agreed upon that the data asset should be included in the GDEX by the DECS manager, a dataset specialist is assigned to iterate with the dataset producer and review institutional documentation to draft the dataset metadata.

1.3. After the initial set of dataset metadata has been drafted, the next task of the dataset specialist is to formally create the new dataset. This process involves selecting an internal GDEX dataset ID, defining the dataset storage location on GDEX Dataset Collections Disk, and populating the dataset metadata with required fields. This information is entered through the GDEX Metadata Manager tool . Once populated, metadata are written to XML files on the GDEX Web Server Disk, and to the relevant GDEX Metadata Database tables.

Figure 2. GDEX Dataset Ingest and DOI creation

2.1. Dataset specialist triggers a process to transfer data from a remote location to local scratch disk. The GDEX supported “dsupdt” tool provides a configurable option to transfer data from a remote location to local scratch disk.

2.2. Dataset Specialist performs value added data preparation steps if applicable. Typically, these data preparation operations are chosen to be performed on datasets where native grids come in non-standard projections, and they need to be transformed into a simpler projection to better serve the user community. Additionally, data can be reorganized and/or converted from its native format into a different format, to support better compression and/or broader community usage. If a data preparation operation has been performed, native data are retained and archived with the GDEX to support reproducibility, unless the native data provider cannot act as a reliable backup mechanism.

2.3. Data file archive steps are triggered programmatically by dsupdt process, or by dataset specialist manual action. The tool, dsarch, is used to perform the data archiving steps of:

Compute MD5 checksum.
Create one dataset archive file copy on GDEX Dataset Collection Disk
Create one dataset archive backup file copy on the Quasar tape system.
If needed, create one dataset archive file disaster recovery copy on the Quasar tape system which is moved to a fireproof safe. See GDEX Data Security for more information.
Update GDEX Metadata Database with locations of all files
Trigger tool, gatherxml, to:
- Scan archive data file.
- Verify validity of file contents.
- Extract file content metadata.
- Update file content metadata and dataset summary metadata in GDEX Metadata Database, and update dataset summary metadata on GDEX Web Server Disk.
  - The GDEX Metadata Database and GDEX Web Server disk are backed up by an enterprise tool on a daily basis to ensure redundancy and long term preservation of dataset metadata assets. See GDEX Data Security for more information.

2.4. After all metadata and data have been successfully ingested into the dataset archive, the Dataset Specialist validates that dataset metadata are correct, and mints a digital object identifier (DOI) for the dataset using the GDEX Metadata Manager tool, which acts through the DataCite API interface to perform this action. Additional background on this process can be found here: https://gdex.ucar.edu/resources/citations/.

Figure 3. GDEX Dataset Publication and Dissemination

3.1. The Geoscience Data Exchange (GDEX) Web Server (https://gdex.ucar.edu/) provides interfaces for users to search, discover, and access the archived data and Metadata through a variety of avenues.

- Summary of metadata access avenues:
  - Metadata are queried from the GDEX Metadata Database and GDEX Web Server Disk to support user interaction with search tools, dataset filelists, and data request interfaces.
  - 3.2. Users can access standards structured metadata through web service endpoints:
    - Open Archives for Metadata Harvesting (OAI-PMH) (https://gdex.ucar.edu/oai/?verb=Identify).
    - Unidata’s Thematic Real-time Environmental Distributed Data Services (THREDDS) Data Server (https://tds.gdex.ucar.edu/thredds/catalog/catalog.html).
    - Catalogue Service for the Web (CSW) (https://gdex.ucar.edu/csw/?request=GetCapabilities&service=CSW)
- Summary of data access avenues:
  - Users can download archive files directly or from data request outputs using:
    - 3.3. Traditional HTTP methods
    - 3.7. Globus GridFTP (https://www.globus.org/)
      - 3.7. Data Mover servers, running the Globus Connect Server software stack (https://www.globus.org/globus-connect-server), are used to support the Globus data transfer option
  - Users can request that data be prepared for them for download through:
    - 3.5. Data subset and format conversion requests
  - 3.4. Through interoperable tools or scripts, users can programmatically request subsets of data to be transferred through the Open Source Project for a Network Data Access Protocol (OPeNDAP) (https://www.opendap.org/) provided by the THREDDS Data Server.
  - 3.8. CISL High-performance computing (HPC) users can read data archive files directly from GDEX Dataset Collection Disk (https://www2.cisl.ucar.edu/data-portals/research-data-archive).
  - Links to all data access avenues can be found under the “Data Access” tab of a dataset homepage found on the GDEX Web Server. Users need to be authenticated with their GDEX user profile to access these links. An example of a “Data Access” page can be viewed here.

Inventory and Description of GDEX Software

The following software components are used to support the tools in Figure 1 of the workflow (i.e. “GDEX Dataset Selection and Dataset Metadata Creation”):

Researchers submit datasets through the DATAHELP desk, a Jira Data Center tool, which they can access on the GDEX Submit Data webpage.
The Metadata Manager metadata data entry and validation tool is an in-house developed Python-based (https://www.python.org/) web application. The Metadata Manager maps metadata into a GDEX native schema based on International Organization for Standardization (ISO) representations (e.g. ISO 8601), and requires use of Global Change Master Directory (GCMD) controlled vocabulary keywords (https://www.earthdata.nasa.gov/data/tools/gcmd-keyword-viewer ) to describe dataset collection parameters.

The following software components are used to support the tools in Figure 2 of the workflow (i.e. “GDEX Dataset Ingest and DOI creation”):

The in-house developed dsupdt tool (2.1), used to support automated data ingest for dynamic datasets, is written in Python. dsupdt interfaces with the GDEX Metadata Database to save configuration preferences using the Python supported database modules (e.g. DBI, https://dev.mysql.com/doc/connector-python/en/). dsupdt uses community supported software, including wget (https://www.gnu.org/software/wget/) and ncftp (https://www.ncftp.com/) to transfer data from remote servers to local GDEX/Computational and Information Systems Lab (CISL) servers.
Data preparation steps (2.2) are typically performed using community supported data manipulation tools. Selected examples of data preparation tools used by the DECS include:
1. wgrib2 (http://www.cpc.ncep.noaa.gov/products/wesley/wgrib2/)
2. NetCDF operators (http://nco.sourceforge.net/)
3. Climate Data Operators (https://code.mpimet.mpg.de/projects/cdo/)
4. NCAR Command Language (https://www.ncl.ucar.edu/)
5. NCAR GeoCAT (https://geocat.ucar.edu/)
6. ECMWF ECCODES (https://confluence.ecmwf.int/display/ECC/ecCodes+Home)
7. Open Source Packages supported/shared through the Pangeo community project (https://pangeo.io/packages.html)
The in-house developed dsarch tool, used to archive data to GDEX dataset collection disk and the Quasar Tape system (2.3), is written in Python. It uses Python supported database modules to record file location and description information in the GDEX Metadata Database.
The in-house developed gatherxml tool, used to extract format specific file level metadata and write that metadata into the GDEX Metadata Database and to GDEX Web Server disk (2.3), is written in C++ and used the community-supported PostgreSQL C API (https://www.postgresql.org/docs/current/libpq.html) to interface with the GDEX Metadata Database.
The Metadata Manager metadata data entry and validation tool (2.4) is an in-house developed Python-based web application. The Metadata Manager includes a Python module that maps native GDEX metadata into required DataCite (https://datacite.org/) metadata elements and calls the DataCite API (https://support.datacite.org/docs/mds-api-guide) to mint GDEX dataset DOIs.

The following software components are used to support the tools in Figure 3 of the workflow (i.e. “GDEX Dataset Publication and Dissemination”):

The in-house developed scm tool (3.1), used to generate content metadata summaries for inclusion in dataset collection level metadata on GDEX Web Server Disk, is written in C++ and uses the community-supported PostgreSQL C API to interface with the GDEX Metadata Database.
Several web applications provide interfaces for users to search, discover, and access archived data and metadata (3.2) including the following:
1. Globus.org developed faceted and free text search applications that are written in Python. These applications use the Globus.org Search API to interface with the GDEX Metadata search index.
2. In-house developed subset request applications that are written in Python, C++, PHP, and Javascript. These use the community supported PostgreSQL C API and PHP PDO driver (http://php.net/manual/en/ref.pdo-mysql.php) to interface with the GDEX Metadata Database.
3. In-house developed OAI-PMH and CSW servers, used to distribute standard structured dataset metadata (3.3), are written in Python and use the community supported PostgreSQL C API to interface with the GDEX Metadata Database. Both servers use community metadata specifications to map GDEX native metadata into multiple schemas (see R14 for additional details on provided metadata schemas).
4. The community supported Unidata Thematic Real-time Environmental Distributed Data Services (THREDDS - https://www.unidata.ucar.edu/software/thredds/current/tds/) is used to support Open-source Project for a Network Data Access Protocol (OPeNDAP) data access (3.5).
5. The externally supported Globus Connect Server (https://www.globus.org/globus-connect-server) (3.7) is used to support Globus maintained GridFTP data transfers (https://www.globus.org/#transfer) (3.8).
6. The in-house developed dsrqst tool (https://rda.ucar.edu/rdadocs/dsrqst/), which automatically manages user data request processing, is written in Python, and coordinates user request processing workflows that run on CISL High Performance Computing (HPC) systems (https://arc.ucar.edu/resources) (3.6). dsrqst uses Python supported database modules to interface with the GDEX Metadata Database.

General software/server/infrastructure components used across all components of the GDEX dataset ingest to dissemination workflow include the following:

The GDEX Web and Metadata Databases run in Docker containers on the Kubernetes-based Cloud Infrastructure for Remote Research, Universities, and Scientists (CIRRUS) platform and use an open-source PostgreSQL 17.4 database server.
The Python/Django and C++/cgi-bin web applications that currently run on Apache 2.4.x HTTP server. (https://httpd.apache.org/).
The GDEX Dataset Collection Disk that runs on a IBM Spectrum Scale General Parallel File System (https://www.ibm.com/docs/en/gpfs).
The CISL Quasar Tape system, which hosts backups of GDEX data, is an IBM TS4500 robotic library with 2,198 slots and dual accessors. Full specification for Quasar can be found at: https://arc.ucar.edu/knowledge_base/70549580
UCAR’s Network Engineering and Telecommunications Section (NETS, (http://nets.ucar.edu/nets/intro/introduction.shtml) maintains high volume and high availability network connectivity to support programmatic/automated GDEX data ingest workflows effectively. Additionally, auto retry capability is integrated into the GDEX dsupdt tool (https://rda.ucar.edu/rdadocs/dsupdt/) to support data ingest recovery as needed after system/network outages. The GDEX does not maintain “real-time” datasets, where immediate access is essential to support user needs. All GDEX assets are considered to be for research use only, so although NETS typically provisions around-the-clock connectivity to public and private networks at a bandwidth that is sufficient to meet the global and/or regional responsibilities, 24x7 connectivity is not essential to support the use case needs of the GDEX user community.

Additional details on Metadata, Software, and Infrastructure not captured above:

Metadata:

As highlighted above, dataset collection level metadata is maintained in a native GDEX schema based on ISO representations (e.g. ISO 8601) and leverages Global Change Master Directory (GCMD) controlled vocabulary keywords. Tools are provided to map the native GDEX metadata into community standards based schemas according to the relevant standard specifications, including DataCite, GCMD Directory Interchange Format (DIF), Dublin Core, Federal Geographic Data Committee (FGDC), International Organization for Standardization (ISO) 19139, ISO 19115-3, and JSON-LD Structured Data. Please find an example of the available standard metadata schemas provided by the GDEX by reviewing the “Metadata Record” menu found at the bottom of an example dataset homepage: https://gdex.ucar.edu/datasets/d083002/

Additionally, all of the listed metadata schemas, plus the THREDDS schema, can be accessed through the GDEX Open Archive Initiatives Protocol for Metadata Harvesting (OAI-PMH) web service: https://gdex.ucar.edu/oai/?verb=ListMetadataFormats

A Catalog Service for the Web (CSW) server can also be used to access GDEX metadata at: https://gdex.ucar.edu/csw/?request=GetCapabilities&service=CSW

Software:

As detailed above, the GDEX employs a combination of in-house developed software and community supported software components to support data curation, data discovery, and data access workflows. An inventory of in-house developed software components is maintained in the 42 repositories organized under the NCAR institutional GitHub space. Due to security concerns, there is currently a mix of publicly available and restricted repositories maintained in the GDEX team space, so all repositories are not visible to external parties. Documentation is included as READMEs in each GDEX team repository.

Infrastructure:

Quarterly meetings are held between relevant DECS staff to develop estimates of future storage requirements based on regular automated ingest stream volumes, and on estimated future product volumes that will be coming into the archive. Based on this information and on a yearly basis, CISL allocates Quasar tape and GDEX dataset disk resources as needed to support future GDEX growth.

Load usage and performance is actively monitored on all GDEX supported servers and services to ensure performance continues to meet expectations. New servers are procured based on forecast usage metrics and recorded load usage on a 4-5 yearly basis.

Information Science and and Services Division (ISS) and DECS management participates in CISL strategic planning exercises on a bi-yearly basis, to ensure that GDEX service offerings evolve to meet current and future user expectations. Additionally, DECS management meets with CISL management on a monthly basis to review current services offerings, and determine whether or not these need to be adjusted to meet existing user expectations.

Disaster and Business Continuity

The GDEX’s Information Technology (IT) infrastructure, supported by the National Center for Atmospheric Research (NCAR) Computational and Information Systems Lab (CISL), provides highly available storage, containerized web and database services, and backup and disaster recovery for archive data as described in https://gdex.ucar.edu/documents/7/RDA_data_security.pdf.

The GDEX Web and Metadata Databases operate in Docker containers on the CIRRUS platform (Cloud Infrastructure for Remote Research, Universities, and Scientists), a Kubernetes-based cloud service at the NCAR Wyoming Supercomputing Center (NWSC) in Cheyenne, WY. In the event of hardware and software failures, the Kubernetes infrastructure and containerized services will be restored by the CISL High Performance Computing Systems Group (HSG) staff. The HSG service level agreement (SLA) supports maintenance on GDEX containerized services 7AM and 7PM Monday - Friday, excluding holidays.

GDEX web and metadata databases, dataset Collection Disk, and Quasar Tape system are maintained on the NWSC uninterruptible power supply (UPS) backup. In the event of a utility power issue, all GDEX infrastructure, including containerized services, remains available on the UPS system. CISL HSG is responsible for bringing Dataset Collection Disk, the Quasar Tape system, and CISL HPC resources back online in the event of an unplanned outage. This infrastructure is monitored by the CISL Cheyenne Facilities Operations Section (COS) on a 24x7 basis (https://arc.ucar.edu/system_status), in coordination with HSG staff. Designated DECS staff are contacted by COS and or HSG staff in the event of an unplanned outage, and coordinate with those entities to bring services back online according to agreed upon SLAs which may be vendor dependent.

NCAR/UCAR Risk analysis planning:

Publicly available information can be found in the Risk Management and Resiliency section in: https://gdex.ucar.edu/resources/docs/rda-data-securityresilience-overview/. An overview of the UCAR Enterprise Risk Management office can be found at: https://rda.ucar.edu/rdadocs/UCAR-Enterprise-Risk-Management.pdf

NCAR/UCAR Disaster and business continuity plan:

UCAR's business continuity plan is based on the following standards

U.S. Department of Homeland Security, Federal Emergency Management Agency (FEMA)
NFPA 1600:2007 Standard on Disaster/Emergency Management and Business Continuity Programs
ISO 22301 • NIST SP 800-34 Contingency Planning Guide for Information Technology Systems
DRII/DRJ GAP Generally Accepted Practices for Business Continuity Practitioners

Information on disaster and related business continuity planning can be found on the corporate intranet site which is not publicly available. A copy of the information provided on the internal website has been made publicly available at: https://gdex.ucar.edu/documents/11/UCAR-Business-Continuity.pdf