header-image-start-2

Best Practices - Quality Assurance

Quality Assurance (QA)

The aim of Quality Assurance is to document and comment on evaluation methods together with suspicious results. The QA does not perform any corrections.

Quality Assurance of data plays an important role in the data publication process as well as for data re-use. However, quality control procedures and quality documentation vary greatly among scientific data.  In general every project defines their quality procedures.

In order to make the different Quality Assessments of the projects comparable, a generic Quality Assessment System is under development. Based on the self-assessment approach of a maturity matrix, an objective and uniform quality level system for data is derived. It consists of 5 maturity quality levels, starting with the initial level=1.

What is the goal of the Quality Assessment System ?

  •     Encourage data creators to improve their quality assessment procedures to reach the next quality level.
  •     Enable data consumers to decide, whether a dataset has a quality that is sufficient for usage in the aimed application.

Quality Maturity Matrix

 

Different criteria are defined, which are subdivided into aspects. For every aspect the 5 maturity quality levels are defined.

The Criteria are (Version 06.11.2014):

Consistency

•    Data Organisation and Data Object
•    Versioning and Controlled Vocabularies (CVs)
•    Data-Metadata Consistency

Completeness

•    Existence of Data
•    Existence of Core Metadata and Provenance

Accessibility

•    Technical Data Access by Identifier/Lineage
•    Core Metadata and Provenance Access by Identifier

Accuracy

•    Plausibility
•    Statistical Anomalies

 

State of the Quality Assessment System

The Quality Assessment System is currently developed. The information on this page will be updated as progress is made. Comments and suggestions are welcome.

Maturity Level Description (Version 06.11.2014):

Level 1 Conceptual Grade

Characteristics: under conceptual development or not evaluated

Score 1 is always reached

Level 2 Research Grade

Characteristics: aspects characterized for data production and processing with individual checks

Score 2 rating scales correspond to the following aspects:

Consistency:

Data Organisation and Data Object:

  • informal data organization is consistent
  • file names are consistent to internal rules
  • file extensions are consistent

Versioning and Controlled Vocabularies (CVs):

  • informal versioning is consistent
  • CVs are consistent

DataMetadata Consistency: creators are correct

Completeness:

Existence of Data: data is in production and may be deleted or overwritten
Existence of Core Metadata and Provenance:

  • creators exist
  • data provenance is unsystematically documented

Accessibility:

Technical Data Access by Identifier/Lineage: data is accessible by file names

Core Metadata and Provenance Access by Identifier:

  • creators are accessible
  • data provenance unsystematically documented is accessible

Accuracy:

Plausibility: documented procedure about technical sources of errors and deviation/inaccuracy exists

Statistical Anomalies: missing values are indicated e.g. with fill values

Level 3 Exchange Grade

Characteristics: aspects characterized for project collaboration with systematic main data components checks conform to project requirements

Score 3 rating scales correspond to the following aspects:

Consistency:

Data Organisation and Data Object:

  • data organisation is documented
  • internal identifiers (with mapping to data objects) e.g. file names and formats correspond to project requirements
  • file extensions, size and checksum of main components are consistent

Versioning and Controlled Vocabularies (CVs):

  • systematic versioning correspond to project requirements
  • formal CVs of main components are consistent

DataMetadata Consistency: creators/contact are correct

Completeness:

Existence of Data: datasets exist, not complete and may be deleted but not overwritten unless explicitly specified


Existence of Core Metadata and Provenance:

  • creators/contact exist
  • naming conventions for discovery exist
  • datasets provenance is basically documented e.g. in data header

Accessibility:

Technical Data Access by Identifier/Lineage:

  • datasets are accessible by internal identifier and mapping (bijectiv) to objects are documented e.g. in data header
  • checksums are accessible

Core Metadata and Provenance Access by Identifier:

  • creators/contact with naming conventions are accessible
  • dataset provenance is accessible

Accuracy:

Plausibility:

  • documented procedure about technical sources of errors and deviation/inaccuracy exists
  • documented procedure about methodological sources of errors and deviation/inaccuracy exists

Statistical Anomalies:

  • missing values are indicated e.g. with fill values
  • documented procedure about rough anomalies are available e.g. outliers concerning limits
Level 4 Re-use Grade

Characteristics: aspects characterized for data and metadata preservation with systematic checks conform to community standards and long-term archive requirements

Score 4 rating scales correspond to the following aspects:

Consistency:

Data Organisation and Data Object:

  • data organization is structured/conform according to welldefined rules
  • entry names and data formats are conform to community standards
  • datasets are reusable with selfdescribing data objects which meet the community standards
  • file extension, size and checksum are consistent

Versioning and Controlled Vocabularies (CVs):

  • systematic versioning collection including documentation of enhancement is conform to community standards
  • old versions stored if feasible
  • formal CVs of data are conform to community standards

DataMetadata Consistency:

  • data source e.g. sensor are consistent
  • creators/contact/publisher are consistent
  • metadata for search and discovery e.g. keywords  are consistent
  • quality assurance procedure for data and metadata consistency (approval and review) is documented
  • data citation metadata to a documented procedure are consistent

Completeness:

Existence of Data:

  • data entities (conform to community standards) are complete (dynamic datasets - data stream are not affected)
  • number of data sets (aggregation) is consistent
  • data are persistent, as long as expiration date requires


Existence of Core Metadata (main components) and Provenance:

  • data source e.g. sensor
  • creators/contact/publisher
  • metadata for search and discovery e.g. keywords
  • quality assurance procedure (approval and review)
  • data citation
  • detailed description of data production steps and method
  • data expiration date
  • access constraint

Accessibility:

Technical Data Access by Identifier/Lineage:

  • complete datasets (conform to community standards) are accessible by permanent (minimum 10 years see rules of good scientific practice) identifier with resolving to data access as long as expiration date requires
  • checksums are accessible

Core Metadata and Provenance Access by Identifier:

  • main metadata components:
    -data source e.g. sensor
    -creators/contact/publisher
    -metadata for search and discovery e.g. keywords
    -quality assurance procedure for data and metadata consistency (approval and review) is documented
    -data citation metadata to a documented procedure are consistent
    with data expiration date are accessible by identifier
  • detailed description of data production steps and methods are accessible by identifier

Accuracy:

Plausibility:

  • documented procedure about technical sources of errors and deviation/inaccuracy exists
  • documented procedure about methodological sources of errors and deviation/inaccuracy exists

 

Statistical Anomalies:

  • missing values are indicated e.g. with fill values
  • documented procedure about rough anomalies are available e.g. outliers concerning limits.
  • documented procedure about systematic deviations in time and space (e.g. changes in mean, variance and trends) and random errors exist
  • scientific consistency among multiple data sets and their relationships is documented if feasible

 

Level 5 Impact Grade

Characteristics: aspects characterized for a distributed virtual research environment with checks conform to general/international standards

Score 5 rating scales correspond to the following aspects:

Consistency:

Data Organisation and Data Object:

  • data organization is structured/conform according to standardized rules
  • data formats are conform to general/international standards
  • data objects are consistent to external scientific objects and up-to-date
  • file extension, size and checksum are consistent
  • data objects with general/international standards are self-describing
  • data objects are fully machine-readable with references to sources

Versioning and Controlled Vocabularies (CVs):

  • systematic versioning collection including documentation of enhancement is conform to community standards
  • old versions stored if feasible
  • documentation of not included newer versions is consistent
  • CVs are general/international standardized

DataMetadata Consistency:

  • data source e.g. sensor
  • creators/contact/publisher
  • metadata for search and discovery e.g. keywords
  • quality assurance procedure for data and metadata consistency (approval and review) is documented
  • data citation metadata to a documented procedure are consistent
  • external metadata and data are consistent

Completeness:

Existence of Data:

  • data entities (conform to general/international standards) are complete (dynamic datasets - data stream are not affected)
  • number of data sets (aggregation) is consistent
  • data are persistent, as long as expiration date requires

Existence of Core Metadata and Provenance:

  • metadata is conform to general/international standards
  • data provenance chain exists including internal and external objects e.g. software, articles, method and workflow description

Accessibility:

Technical Data Access by Identifier/Lineage:

  • complete data (conform to general/international standards) is accessible by global resolvable identifier (PID) registered with resolving to data access including backup as long as expiration date requires
  • data is accessible within other data infrastructures including cross references
  • external PID references supported
  • provenance chain is accessible

Core Metadata and Provenance Access by Identifier:

  • metadata with data expiration date including backup general/international standardizedare accessible by global resolvable identifier
  • data provenance chain including internal and external objects e.g. software, articles, methods and workflow description are accessible by global resolvable identifier

Accuracy:

Plausibility:

  • documented procedure about technical sources of errors and deviation/inaccuracy exists
  • documented procedure about methodological sources of errors and deviation/inaccuracy exists
  • documented procedure with validation against independent data exists
  • references to evaluation results (data) and methods exist

Statistical Anomalies:

  • missing values are indicated e.g. with fill values
  • documented procedure about rough anomalies are available e.g. outliers concerning limits
  • documented procedure about systematic deviations in time and space (e.g. changes in mean, variance and trends) and random errors exist
  • scientific consistency among multiple data sets and their relationships is documented if feasible

Quality Maturity Matrix as excel pdf (Version 06.11.2014)

Maturity_WDCC_v6nov14c.pdf (69.8 KiB)

Checklists (in work):

Examples

Level 1 Checklist

Characteristics: under conceptual development or not evaluated

Score 1 is always reached.

To achieve a specific desired score level it would be helpful to have a data management plan. This should include the conceptual development.

Level 2 Checklist

Characteristics: aspects characterized for data production and processing with individual checks

 

Consistency:

Data Organization and Data Object
Checks: informal in accordance to Data Management Plan
  • Is the directory structure correct?
  • Are the file sizes feasible for access and transmission?
  • Are the file names correct in accordance to CVs if applicable?
  • Are the file extensions correct in accordance to format if feasible?
Versioning and Controlled Vocabularies (CV)
Checks: informal in accordance to Data Management Plan
  • Is a versioning available if feasible?
  • How is the versioning implemented if feasible?
  • Are CVs applied if feasible?

Data-Metadata Consistency
Checks: informal
  • Are creators correct?

 

Completeness:

Existence of Core Metadata and Provenance
Checks: informal
  • Are creators documented?
  • Is data provenance unsystematically documented?

 

Accessibility:

Technical Data Access by Identifier and Lineage
Checks: informal
  • Is data accessible by file names?

Core Metadata and Provenance Access by Identifier:
Checks: informal
  • How are creators and data provenance documentation available/accessible by identifier?


Accuracy:

Plausibility
Checks: informal
  • Is an estimation of technical errors available?
    • Detections of anomalies in model datasets with sources of errors and deviations e.g. technical (Appendix A 1.1)?
    • Detections of anomalies in observational datasets with sources of errors and deviations:
      • Is a classification into a level system available e.g. http://www.godae.org/Data-definition.html ?
Statistical Anomalies
Checks: spot checks required
  • Are missing values indicated e.g. with fill values or in metadata?

 

The checks are documented in project and institute reports (paper, internet and articles). Checks are done and documented under control of project data management.

Level 3 Checklist

Characteristics: aspects characterized for project collaboration with systematic main data components checks conform to project requirements

 

Consistency:

Data Organization and Data Object
Checks: required for main data components
  • Is a documentation with minimal specification of data organization available?
  • Is the directory structure correct?
  • Are the file sizes feasible for access and transmission?
  • Are the file names or internal identifiers with mapping to objects correct in accordance to CVs?
  • Are the formats and file extensions correct (checked with format checker)?
  • Are the data sizes checked and correct, the size of data set is not equal 0?
  • Are checksums correct?
Versioning and Controlled Vocabularies (CV)
Checks: required for main data components
  • Is a versioning available if feasible?
  • How is the versioning implemented if feasible? If no explicit versioning exists an implicit versioning could be established. But a good way of versioning implementation is the identification with the date of storage. As soon as more than one version exists versioning should be implemented.
  • Are CVs applied?
  • Are formal CVs of standard components correct?
    • Naming conventions for discovery e.g. CVs
Data-Metadata Consistency
Checks: required for main data components
  • Are creators/contact e.g. in data headers correct?

 

Completeness:

Existence of Data
Checks: required for main data components

  • How is data overwriting prevented?  Removing is allowed and overwriting is allowed if explicitly specified
Existence of Core Metadata and Provenance
Checks: required for main data components
  • Do metadata exist:
    -creators/contact
    -Keywords for discovery e.g. CVs
  • Is dataset provenance basically documented?

 

Accessibility:

Technical Data Access by Identifier and Lineage
Checks: required for main data components
  • Are data accessible by internal identifier?
  • Are internal identifiers with mapping to objects available?
  • Are checksums accessible?
Core Metadata and Provenance Access by Identifier:
Checks: required for main data components
  • Are creators/contact accessible by internal identifiers?
  • Is dataset provenance accessible?
  • How is the technical access implemented?


Accuracy:

Plausibility
Checks: required for main data components
Does an estimation of errors exist?
  • Detections of anomalies in model dataset with sources of errors and deviations e.g. technical (Appendix A 1.1)
  • Detections of anomalies in model dataset with methodological sources of errors and deviations.
  • Detections of anomalies in observational dataset with sources of errors and deviations:Does a classification into a level system exist e.g. http://www.godae.org/Data-definition.html ?
  • Detections of anomalies in observational dataset with methodological sources of errors and deviations.
Statistical Anomalies
Checks: required for main data components
  • Are missing values indicated e.g. with fill values or in metadata?
  • Are rough errors indicated (example tests see Appendix A 1.2)

 

The check procedure and check status are available at the repository. Checks are done and documented under control of repository.

Level 4 Checklist

Characteristics: aspects characterized for data and metadata preservation with systematic checks conform to community standards and long-term archive requirements

Consistency:

Data Organization and Data Object
Checks: required, systematic and conform to community standards and long-term archive requirements
  • Is a documentation with specification of well-defined rules of data organization available?
  • Is the storage structure correct e.g. directory, access database tables?
  • Are the file sizes feasible for access and transmission?
  • Are the entry names correct in accordance to community standards?
  • Are the data formats self-describing, conform to community standards and file extensions correct?
  • Are the data size checked and correct, the size of data set is not equal 0?
  • Are checksums correct?
Versioning and Controlled Vocabularies (CV)
Checks: required, systematic and conform to community standards and long-term archive requirements
  • How is the systematic versioning on files or data collection implemented and applied?
    If no explicit versioning exists an implicit versioning could be established. But a good way of versioning implementation is the identification with the date of storage. As soon as more than one version exists versioning should be implemented.
    Are old data versions stored if feasible?
  • Are old data versions stored if feasible?
  • Are formal CVs conform to community standards?
Data-Metadata Consistency
Checks: required, systematic and conform to community standards and long-term archive requirements
  • Are metadata components to a documented procedure correct (components see Appendix A1.3)?

 

Completeness:

Existence of Data
Checks: required, systematic and conform to community standards and long-term archive requirements

  • Are data entities complete (dynamic datasets – data stream are not effected)?
  • Is the number of data sets correct in accordance to data management plan?
  • Is data overwriting and removing of old versions prevented in accordance to expiration date?
Existence of Core Metadata and Provenance
Checks: required, systematic and conform to community standards and long-term archive requirements
  • Are core metadata components (see Appendix A 1.3) to a documented procedure complete?

 

Accessibility:

Technical Data Access by Identifier and Lineage
Checks: required, systematic and conform to community standards and long-term archive requirements
  • Is an expiration date of data available?
  • Are permanent (as long as expiration date requires) identifier to data assigned?
  • Are checksums accessible?
Core Metadata and Provenance Access by Identifier:
Checks: required, systematic and conform to community standards and long-term archive requirements
  • Are core metadata components (see Appendix A 1.3) accessible by identifier?
  • Are detailed description of data production steps and methods accessible by identifier?


Accuracy:

Plausibility
Checks: required, systematic and conform to community standards and long-term archive requirements
  • Does an estimation of errors exist?
    • Detections of anomalies in model dataset with sources of errors and deviations e.g. technical (Appendix A 1.1)
    • Detections of anomalies in model dataset with methodological sources of errors and deviations.
    • Detections of anomalies in observational dataset with sources of errors and deviations:
    • Does a classification into a level system exist e.g. http://www.godae.org/Data-definition.html ?
    • Detections of anomalies in observational dataset with methodological sources of errors and deviations.
Statistical Anomalies
Checks: required, systematic and conform to community standards and long-term archive requirements
  • Are missing values indicated e.g. with fill values or in metadata?
  • Are rough errors indicated (example tests see Appendix A 1.2)
  • Is a documented procedure about shift in mean, variance and trends available?
  • Is a documented procedure about random errors (inherent uncertainties of measurements see Appendix A 1.4) available?

 

Check procedure and check status are available and meet documentation standards at long-term archive with status of QA criteria. Checks are documented under control of long-term-archive.

Level 5 Checklist

Characteristics: aspects characterized for a distributed virtual research environment with checks conform to general/international standards

 

Consistency:

Data Organization and Data Object
Checks: required, systematic and conform to community standards and long-term archive requirements
  • Is a documentation with specification of standardized rules of data organization available?
  • Is the storage structure correct e.g. directory, access database tables?
  • Are the file sizes feasible for access and transmission?
  • Are the entry names correct in accordance to standardization?
  • Are the data formats self-describing, conform to general standards and file extensions correct?
  • Are the data objects fully machine-readable with reference to sources?
  • Are the data size checked and correct, the size of data set is not equal 0?
  • Are checksums correct?
  • Which external objects exist and are they consistent to data object?
Versioning and Controlled Vocabularies (CV)
Checks: required, systematic and conform to community standards and long-term archive requirements
  • How is the systematic versioning on files or data collection implemented and applied?
    If no explicit versioning exists an implicit versioning could be established. But a good way of versioning implementation is the identification with the date of storage. As soon as more than one version exists versioning should be implemented.
  • Are old data versions stored if feasible?
  • Are formal CVs conform to international standards?
  • Is the documentation of not included newer versions correct?
Data-Metadata Consistency
Checks: required, systematic and conform to community standards and long-term archive requirements
  • Are metadata components to a documented procedure correct (components see Appendix A1.3)?
  • Are external metadata and data consistent?

 

Completeness:

Existence of Data
Checks: required, systematic and conform to community standards and long-term archive requirements

  • Are data entities complete (dynamic datasets – data stream are not effected)?
  • Is the number of data sets correct in accordance to data management plan?
  • Is data overwriting and removing of old versions prevented in accordance to expiration date?
Existence of Core Metadata and Provenance
Checks: required, systematic and conform to community standards and long-term archive requirements
  • Are core metadata components (Appendix A 1.3) are complete and available with international standards like ISO 19135?
  • Does a documentation of the data provenance chain exist?

 

Accessibility:

Technical Data Access by Identifier and Lineage
Checks: required, systematic and conform to community standards and long-term archive requirements
  • Is an expiration date of data available?
  • Are global resolvable, permanent (as long as expiration date requires) identifier (PIDs) to data assigned?
  • Are checksums accessible?
  • Is data accessible within other data infrastructures?
  • Are cross-references  supported?
  • Is a data backup available?
  • Is the documentation of the provenance chain accessible by identifier?
Core Metadata and Provenance Access by Identifier:
Checks: required, systematic and conform to community standards and long-term archive requirements
  • Are general/standardized metadata accessible by global resolvable identifier?
  • Are detailed description of data production steps and methods accessible by identifier?
  • Is a backup of metadata accessible by global resolvable identifier or does a recovery system exist?
  • Is a documentation of provenance chain including internal and external objects accessible by global resolvable identifier?


Accuracy:

Plausibility
Checks: required, systematic and conform to community standards and long-term archive requirements
  • Does an estimation of errors exist?
    -Detections of anomalies in model dataset with sources of errors and deviations e.g. technical (Appendix A 1.1)
    -Detections of anomalies in model dataset with methodological sources of errors and deviations.
    -Detections of anomalies in observational dataset with sources of errors and deviations:
    -Does a classification into a level system exist e.g. http://www.godae.org/Data-definition.html ?
    -Detections of anomalies in observational dataset with methodological sources of errors and deviations.
  • Is a documented procedure with validation against independent data available? E.g. Model Data in connection with model development see Appendix A 1.5
  • Are references to the evaluation results with methods available?

Statistical Anomalies
Checks: required, systematic and conform to community standards and long-term archive requirements
  • Are missing values indicated e.g. with fill values or in metadata?
  • Are rough errors indicated (example tests see Appendix A 1.2)
  • Is a documented procedure about shift in mean, variance and trends available?
  • Is a documented procedure about random errors (inherent uncertainties of measurements see Appendix A 1.4) available?

 

Check procedure and check status are available and meet documentation standards at long-term archive with status of QA criteria. Checks are documented under control of long-term-archive.

Appendix A

1.1 Technical Sources of Errors and Deviations (not complete)

  • source code

-Robustness in extrem values
-Exception for overflow and zero
-Used units e.g. mm/d or m/sec
-Errors like latitude, longitude, rotated pools grid conversion confusion
-Statistic problems – samples over time versus instantaneous values
-Units conversion and confusion

  • Compiler
  • Software library
  • Number of processors
  • Computer accuracy – numerical stability for very large and small values

1.2 Rough Errors Tests (Examples)

  • LIM-test by Meek and Hatfield [94] are applied. (The test checks every data point on whether it exceeds a predefined range of values.)
  • NOC-test by Meek and Hatfield [94] are applied. (The test checks on whether data does not change for more than a predefined number of values. It can be used to detect errors of instrument.)
  • ROC-test by Meek and Hatfield [94] are applied. (The test checks the rate of change. The difference between two consecutive elements is checked concerning limits.)


1.3 Core Metadata Components

  • source e.g. sensor (e.g. models: CLM 2.4.11 on NEC-SX6(hurrikan) or observational data: In Situ Land-based Platforms, GROUND STATIONS with In Situ/Laboratory Instruments)
  • creators/contact/publisher
  • Metadata for search and discovery if feasible:

-The time description (metadata) and data (in data header and file names) are consistent  - calendric aspects, gaps, doubles, step widths
-The spatial description (metadata) and data (in data header and file names) are consistent (latitude, longitude, grid)
-The variable description with units (metadata) and data (in data header and file names) are consistent
-The description how the data have been determined or derived by the specified method e.g. statistics applying to data representation: samples over time - temporal means (e.g. monthly means), instantaneous – point is correct
-keywords

  • Quality assurance procedure of approval and review
  • Citation metadata (creators , publication date , title , publisher and persistent identifier)
  • Description of data production steps and method description (project description and summary)
  • Expiration date of data
  • Access constraint
  • Contributor(s) if feasible


1.4 Random Errors Observational Data

  • Calibration of instruments
  • Precise time and location description
  • Situation in the field of measurement
  • Physical description of the measured data

These descriptions are very detailed and should be published in a journal.


1.5 Model Data in connection with model development

  • For the model data, these include the description of the models, components and their equations. This is also very detailed and should be published in a journal. E.g. http://www.geoscientific-model-development.net/  project or institute reports.
  • Documentation with checklists of prognostic and diagnostic variables
  • Description of simulation with:

-Parameterisation
-Resolution in time and space, dependencies of time and space resolutions (what happens if wind speed is stronger than the grid cell represents). Are there descriptions in table form?
-Boundary conditions
-Input data
-Constants – for initialization and run e.g. orography, solar constant, drag coefficient, area leaf index


References:
Meek, D. Hatfield, J. (1994) Data quality checking for single station meteorological databases.Agricultural and Forest Meteorology - AGR FOREST METEOROL , vol. 69, no. 1-2, pp. 85-109, DOI: 10.1016/0168-1923(94)90083-3

QA Documentation

ISO 19157 (in work)

DataCite (in work)

Quality Maturity Level (Version 06.11.2014)

Maturity_Levels_v6nov14.pdf (136.9 KiB)

Quality Maturity Level (Version 03.04.2014)

MaturityMatrix(version3apr14).pdf (111.2 KiB)