README FOR: Investigation of metadata standard use by geoscience data repositories, Version 2 Principal Investigator: Yauheniya Liapich Mordrige College of Education University of Denver 1999 E Evans Ave, Denver, CO 80210 euginia.liapich@du.edu Co-Investigator: Matthew S. Mayernik National Center for Atmospheric Research University Corporation for Atmospheric Research 1850 Table Mesa Dr, Boulder, CO 80305 mayernik@ucar.edu Date of Data Collection Range 2021-03-19 - 2021-11-18 The data was collected remotely by examining the websites of the organization and noting the patterns in the metadata regarding the datasets. One sheet in the data collection spreadsheet includes only the CDF and the ESIP data centers. The next sheet in the data collection spreadsheet concerns the USGS data facilities. We have separated the CDF and ESIP file sheets and the USGS sheets to demonstrate broad community trends versus variation and consistency within a single organization (USGs). ## Version 2 Note This dataset has been updated in response to peer reviewers of the associated paper. Changed for Version 2 are as follows: 1) a small number of corrections were made to incorrect values in the original data, 2) the data in columns H-L have been added. Note, These new rows were only added to the "ESIP_CDF" file. These new columns are not included in the USGS file, as they are not relevant to that file. No changes have been made to the USGS file for this Version 2. ## Column Description Column A - Data Center: In this column, we note the individual data center from which we collected our data. We investigated these data centers to examine the manner in which the metadata for the organizational datasets followed FAIR Principles. Column B - Data Center URL: In this column, we noted the URL of the particular data center that we visited. The URL was important to our research so that were able to note if the web sites of the data centers we visited were congruent with the names of the data centers and if were researching the appropriate web sites for each center. Column C - Example Record Examined: In this column we noted the url for the specific dataset from the website that we examined. This provides traceability to the exact web pages we examined, and also indicates whether URL in column C included at least a section of the URL found in column B, to ensure that the dataset that we were using was representative of the website. Column D: Metadata schema(s) used: In this column, we note the metadata schema(s) that were found for each dataset examined. If we could not determine which metadata schemas the dataset used (if any), we coded it as "Unknown". Column E: Keyword or Subject vocabs: In this column we noted which keywords or controlled vocabularies the datasets used. The usage of such keywords was important to our research because we desired to note how the FAIR principle of Interoperability was reflected in the individual datasets. In this context, the principle of interoperability was salient because it reflected increased compatibility of the datasets among institutions with various specialties. If there were no particular controlled vocabularies for the dataset, we designated such vocabularies as "Custom". When we did not know which controlled vocabulary the dataset used, we listed the subset of keywords as "Unknown". Column F: Present DOIs for data? (Y/N). In this column we noted if the DOIs were present for the dataset. It was important for our research to note DOIs because the we aimed to find if the datasets that we found were compatible with the FAIR principle of findable and with DOIs being present to facilitate the findability of the dataset. Column G: Data Facility Type: This column explicates which type of data facility we examined for the research. We examined data facilities that were members of two different groups the Council of Data Facilities (CDF) and the Earth Science Information Partners (ESIP). The type of data facility in the dataset is important to our research because the these two organizations encompass the majority of the Earth system science data facilities within the US, and thus encompass a large proportion of the high quality and large quantity data and metadata collections. These types of data facility are open and transparent in allowing access of their datasets, and have a high level of accessibility in the metadata and commonly use DOIs for datasets. Column H - DataCite subjects. This column notes if the number of subjects present in the DataCite for a DOI given in the data facility is the same, larger or smaller than the number of subjects present on the original landing page for the dataset.The comparison of the quantity of subject present both in DataCite and in the original page was important because the consistency and presence of subjects reflected findability, accessibility, and reusability of the dataset for the individual facility. Column I - Subject Schema. This column notes if it was possible to deduce the presence of the subject schema from the DataCite XML. The presence of subject schema for the controlled vocabulary allowed our research to note if the dataset enables interoperability with records from other data facilities, to increase the accessibility of the individual records that data facility provides in the context of DOI that was provided in the DataCite. Column J - DataCite URL. This column demonstrates the DataCite URL for the associated DataCite metadata record. The URL is significant in the context of the data because it allows us and others to conduct and validate points of reference and research. Column K - Number of datasets. This column notes the number of datasets found on the data facility website. The number was either noted on the data facility center website or calculated using the resources provided on the individual data facility web pages. If it was not possible to determine a number, we coded this column as “NP”. Column L - Notes on the number. This column notes how the number of the datasets was found or calculated. The findings demonstrate importance because iIf the number of datasets was calculated, it was based on only reflects the number of datasets that were accessible to the individuals who visiting the public web siteare not members of the organization. The notation provided here about the number ofof the accessible datasets is intended to ensure reflects on the transparency and accuracy of our research, because it permits the knowledge of the scope for additional analysis. ### Council of Data Facilities Members We initially selected the repositories by examining the members of Council of Data Facilities (CDF). As we selected the data facilities for analysis, we followed the CDF's membership types. Category A consists of NSF-funded not-for-profit or academic data facilities. The Council of Data Facilities that are part of category B are Federally Funded Research and Development Centers (FFRDCs) and other federal, state, and local data facilities. The data facilities that form category C are International, private, and other not-for-profit or academic data facilities. We excluded category D data facilities because these are not data facilities that conduct data collection or discovery and are therefore unlikely to have datasets let alone metadata. Rather, these facilities are supportive of the facilities that do preserve data. ### Earth Science Information Partners (ESIP) Members In order to increase diversity in our metadata investigation, we also selected ESIP members. Similar to the CDF, ESIP has multiple categories of members. Type 1 category of such members features Distributors of satellite and ground-based data sets, as well as standardized products derived from those data. Type II includes providers of data and information products, technologies or services aimed primarily at the Earth science and research communities. We excluded the type III category because it contains commercial and non-commercial organizations engaged in developing tools for Earth science. Because these organizations develop tools, they contain no datasets, and thus metadata is absent. We also excluded Type V organizations who are non-voting financial supporters of ESIP because these organizations are unlikely to have data that interests us as researchers of metadata schemas. ### Data Collection Process For full details of the data collection methodology, please see the published paper. Here we present a few high-level details. Our data collection process involved intially noting an individual CDF or ESIP Member. The website of the each organization was checked for the presence of datasets. Once a representative dataset was found, the link for the dataset was noted in the data collection worksheet and checked for additional links and files and especially for the presence of metadata. The available metadata was then checked for the aspects of ISO 19115 metadata, other metadata standards, presence of keywords and controlled vocabularies, and of DOIs. We listed repositories in column B as "Skip" for a number of reasons: 1) we were unable to find a working website for the facility, 2) the facilities did not host data or metadata on their website or did not provide a way to discover their data, and 3) it was unclear how the facilities hosted data as many of the facilities provided links to external repositories or did not enable searching on their website directly. ### Glossary of Acronyms ANZRC- Australia/New Zealand Reference Center ANZRC FOR- Australia/New Zealand Reference Center Fields of Research AU/NZS = Australian/New Zezland standards ARDF- Alaska Resource Data File BCO-DMO -Biological and Chemical Oceanography Data Management Office CDF- Council of Data Facilities Center CIESIN- Center for International Earth Science Information Network CMECS - Coastal and Marine Ecological Classification Standard CSA Biocomplexity Thesaurus - Cambridge Scientific Abstract Biocomplexity Thesaurus ECHO10 - NASA's Earth Observing System (EOS) Clearinghouse (ECHO) metadata model FGDC - Federal Geographic Data Committee ESIP - Earth Science Information Partners GCMD-Global Change Master directory GEO -Oregon Geospatial Enterprise Office GIS- Geographic information system GMD - General Material Designation Table GNIS- Geographic names information system GNS- GeoNames net server IEDA- Interdisciplinary Earth Sicence Data Analysis ISO- International standard of Organization IGS- Indiana Geological Survey LTER - Long term ecological research MGDL- Marine Geodigital Library MODS - Metadata Object Description Schema MRIB- Marine Realms information Bank Native CMR = NASA Common Metadata Repository NSIDC DAAC- National Snow and Ice Data Center Distributed Active Archive Center SEDAC- Socioeconomic Data and Applications Center