Data & Metadata

What is metadata?

Documentation on your data. Good data documentation includes detailed information on:

What: title, name

Data type description

Sensor – instrument, model

Source – platform, computer

Level e.g. http://www.godae.org/Data-definition.html

How: method, QA/QC and lineage: pre- and post processing, provenance

When

Where

Who

Why

Why metadata?

Make data clear to understand and easy to use for:

Browse, Search and Retrieval

Ingest, Quality Assurance, Reprocessing

Application to Application Transfer

Storage and Archive

Re-use

Metadata standards

Download examples:

ISO 19115 http://c3grid1.dkrz.de:8080/xsql/cera_map_iso.xsql?id=2234809

DIF http://c3grid1.dkrz.de:8080/oai/provider?verb=GetRecord&metadataPrefix=dif&identifier=dif2000013

Granularity of Data Sets

Aggregated metadata is required by a number of services to the community as well as services to provide links to cited literature. These services, in turn, allow other initiatives (e.g. GEO, Future Earth, and similar) to draw on qualified metadata collections for their own applications and services.

One of the significant practical problems associated with meta-data aggregations relates to granularity of the data being described. The size and nature of data sets described in a meta-data record varies greatly between institutions, scale of observation, data families, and scientific disciplines. Furthermore, some data sets (for example sensor data) are updated continuously, and as such do not have a ‘beginning’ and an ‘end’.

Greatly differing granularity is not ideal in at least three cases: (1) search results derived from an aggregated meta-data collection are difficult to interpret, (2) automated processing of data sets is more difficult, and (3) not all data sets are easily citable.

In the following some askpects for choosing the appropriate granularity:

Technical

Unique Instrument, Survey, Experiment, or Platform

A unique instrument/platform/project combination

Unique Parameter

A unique parameter, parameter combination, or set of independent variables contained in the data set

Distinct Coverage

Distinct temporal, spatial, or topic coverages

Versions/ Processing Level

Data sets at varying stages of processing – specifically applicable to sensor data, but in a general sense distinguishing data at stages of raw observation, quality assured, and processed/ transformed data. Hence we are amending this to reflect the life cycle of the data set, and each significant life cycle stage or version should be uniquely citable.

Practical/ Functional

Size

The sheer size of some data sets makes access to them impractical for most users. Hence one has to consider the case where a single data set is referenced by multiple meta-data records, ideally incorporating some for of sub-set resolving query to the data. The sub-setting criteria are similar to the ones discussed here.

Business Value

Assigning unique identifiers to ever smaller data sub-sets loses business value at some point, hence there is a trade-off between the other considerations listed here and the cost of assigning a meta-data record and unique identifier.

Citability and Authorship

Body of Work

The data supporting a logical body of work (journal article, research project) should have a separate meta-data record.

Unique Authorship

A meta-data record should be split if the data being described has more than one unique set of authors.

Reproducibility

This rule should allow scientists that seek to reproduce results to uniquely and easily find the data set(s) used for the original research.

full article: ICSU WDS (2014) Guidelines in Respect of Metadata Granularity, http://goo.gl/nhCDvK

Metadata converting

For dissemination and cataloging.

Metadata interfaces

Typically a data centre’s data deposit form or metdata editor are completed by the data creator

Ingest examples:

WDCC/CERA http://cera-www.dkrz.de/pls/apex/f?p=110:10:2317779178940753

CMIP5 Questionnaire http://q.cmip5.ceda.ac.uk/

DataCite API https://mds.datacite.org/static/apidoc

To view the list of KomFor data portals & service providers follow this link.

Vocabularies & Ontologies

CMIP5 Data Reference Syntax (DRS) and Controlled Vocabularies

Entry name/title of data are specified according to the Data Reference Syntax (http://cmip-pcmdi.llnl.gov/cmip5/docs/cmip5_data_reference_syntax.pdf) as activity / product / institute / model / experiment / frequency / modeling realm / MIP table / ensemble member / version number / variable name / CMOR filename.nc.

What are data?

Data are digital information e.g. numbers, words, images, etc.

Why use standard formats?

Standard formats ensures interchangeability and long-term usability.

Data standards

Examples of data standards:

cf-netcdf http://cf-pcmdi.llnl.gov/

WMO GRIB http://www.wmo.int/pages/prog/www/WMOCodes/Guides/GRIB/Introduction_GRIB1-GRIB2.pdf

Quality Control and Quality Assurance

Quality control and assurance takes place at various stages of research: data production / processing, ingest and long-term archiving, data publishing and dissemination

What are dynamic datasets?

Dynamic datasets are transient. The rules of the change depends on different use cases.

Data Streams: data from monitoring services with continuous output

For datasets that are continuously and rapidly updated, there are special challenges both in citation and preservation. For citation, three approaches are possible:

Cite a time slice (the set of updates to the dataset made during a particular period of time);

Cite a snap shot (a copy of the entire dataset made at a specific time);

Cite the continuously updated dataset, but add an Access Date and Time to the citation.

Examples: DOI assignments on datasets: doi:10.1594/WDCC/MODIS__Arctic__MPF , doi:10.5676/DWD_GPCC/CLIM_M_V2011_100

Data ingest and download examples

Interfaces to catalogs and query languages

Data Ingest examples

Data download examples

CERA WWW-Gateway - Jblob Version 2 http://cera-www.dkrz.de/CERA/jblob/

CERA GUI http://cera-www.dkrz.de/WDCC/ui/

ESGF

http://bmbf-ipcc-ar5.dkrz.de/thredds/esgcet/1/cmip5.output1.MPI-M.MPI-ESM-LR.rcp26.day.atmos.day.r1i1p1.v20111014.html?dataset=cmip5.output1.MPI-M.MPI-ESM-LR.rcp26.day.atmos.day.r1i1p1.v20111014.vas_day_MPI-ESM-LR_rcp26_r1i1p1_20060101-20091231.nc

Access: http server, GridFTP and OPENDAP

How to do:

Interface more resources, submissions systems

Data and Metadata

Metadata

Documentation on your data. Good data documentation includes detailed information on:

Data

Data Ingest examples

Data download examples