Guidance for Persisting Soil Sample Observation Data from Spectrometry and Wet Chemistry

1. Purpose

This document outlines best practices for storing, documenting, and referencing soil sample observation data acquired via spectrometry techniques. It addresses two common workflows:

Calibration campaigns, where a subset of samples is analyzed using both wet chemistry and spectrometry to build or expand a spectral library.
Prediction campaigns, where only spectrometry is performed and an existing spectral library is used for inference.

In both cases, spectral libraries must be traceable, documented, and referenced correctly by each spectral observation.

2. Core Data Entities

2.1 Soil Sample

Each physical soil sample should be represented as a unique entity with:

Sample ID (persistent, unique identifier)
Collection metadata:
- Geographic location (coordinates, site name)
- Depth, horizon, or core information
- Date and method of collection
- Collector or organization
Storage and handling information (if relevant)

2.2 Spectral Observation

Each spectrometry measurement should be stored as its own record:

Link to Sample ID
Spectrometer/device ID and model
Wavelength range and resolution
Acquisition settings (e.g. gain, integration time, replicates)
Date of measurement and operator
Preprocessing steps (e.g. smoothing, normalization)
Reference to the spectral library used for calibration or prediction

2.3 Wet Chemistry Observation

When performed:

Link to Sample ID
Analytes measured (e.g. SOC, pH, nutrients, texture)
Laboratory method and protocols (e.g. ISO, USDA, local standards)
Laboratory or institution
Units and uncertainty
Date of analysis, analyst or laboratory code

3. Spectral Libraries

3.1 Definition

A spectral library is a curated dataset that includes:

Spectral observations linked to samples with known reference values (typically from wet chemistry)
Calibration models derived from those paired data
Potential metadata on environmental or soil type ranges

3.2 Library Metadata Requirements

Each spectral library should have its own identifier and descriptive record with:

Library ID (unique identifier)
Purpose (e.g. SOC prediction, multi-property calibration)
Scope (geographical region, soil types, date range)
Device(s) used
Preprocessing and modeling approach (e.g. PLSR, machine learning)
Versioning details and modification history
Data quality criteria and validation metrics
Contributors/maintainers and contact details
Link to the paired wet chemistry dataset used to build or update it

4. Referencing Spectral Libraries

4.1 For Calibration Samples

When a sample subset undergoes both wet chemistry and spectrometry:

Each spectral observation must be tagged with:
- The Library ID it contributes to or helps calibrate
- Whether it is part of calibration, validation, or test sets
Wet chemistry records should be clearly tied to corresponding spectral records, using the Sample ID.

4.2 For Prediction Samples

When only spectrometry is performed:

Each spectral sample must reference the Library ID (and version) used to generate predictions.
Store predicted values separately but linked to:
- The Sample ID
- The spectral observation record
- The specific model version within the library

5. Data Persistence and Storage Structure

5.1 Database or File-Based Approach

Use a structured and queryable system (e.g. relational database, standardized file formats with metadata). At minimum, maintain:

Samples Table Sample ID, collection metadata.
Spectral Observations Table Spectral file reference, Sample ID, device metadata, spectral library reference.
Wet Chemistry Table Sample ID, analytes, lab methods, values, units.
Spectral Library Table Library ID, metadata, versioning, reference to calibration data.
Model or Prediction Table (if applicable) Prediction target, model version, linked spectral observation.

5.2 File Format Recommendations

Spectral data: e.g. CSV, JSON, ENVI files, or binary vendor formats with metadata sidecar files.
Metadata: embed or link in machine-readable form (e.g. YAML, JSON).
Spectral library bundles: zip/TAR structures including documentation and metadata file.

Ensure every file contains or links to identifiers for:

Sample ID
Spectral observation ID
Library ID

6. Versioning and Traceability

Assign version numbers to spectral libraries and calibration models.
Never overwrite older versions—archive instead.
Record provenance:
- Who created or updated a library
- Date and reason for changes
- Data or models added, removed, or recalibrated

7. Quality Control and Validation

7.1 During Calibration

Document selection criteria for calibration and validation samples.
Store cross-validation metrics (e.g. RMSE, R², residuals).
Flag outliers or questionable measurements.

7.2 During Prediction

Clearly separate predicted values from measured ones.
Track confidence intervals or uncertainty estimates.
Ensure the model’s applicability domain is documented.

8. Documentation and Accessibility

Create or store documentation for:

Data schemas
Metadata standards used (e.g. INSPIRE, ISO 19115, OGC)
Naming conventions and identifiers
Data access protocols (e.g. APIs, shared drives, repositories)

Where appropriate:

Use DOIs or persistent links for spectral libraries.
Provide clear citation formats.

9. Linking Campaigns Through Libraries

Since new campaigns may rely entirely on existing libraries:

Require each new spectral observation to:
- Reference the library used for prediction
- Include version information
- Record whether any local calibration updates were applied
If new wet chemistry is added later, update the library accordingly with a new version and document the change.

10. Summary of Key Requirements

Element	Must Include	Purpose
Sample Record	Sample ID, collection data	Traceability
Spectral Observation	Sample ID, device metadata, spectral file, Library ID	Reusability & reference
Wet Chemistry Record	Sample ID, lab methods, analyte values	Calibration & validation
Spectral Library	Library ID, metadata, version, scope, provenance	Prediction & documentation
Model/Prediction Link	Observation ID, Library ID, model version, predicted values	Transparency

11. Compliance and Future-Proofing

Align with FAIR principles (Findable, Accessible, Interoperable, Reusable).
Use persistent identifiers and open metadata standards.
Plan for interoperability across institutions, countries, and legacy datasets.