Table of Contents
Welcome to the SoilWise Technical Documentation!
SoilWise Technical Documentation currently consists of the following sections:
- Technical Components
- APIs
- Infrastructure
- Glossary
- Printable version - where you find all sections composed in one page, that can be easily printed using Web Browser options
Essential Terminology
A full list of terms used within this Technical Documentation can be found in the Glossary. The most essential ones are defined as follows:
- (Descriptive) metadata: Summary information describing digital objects such as datasets and knowledge resources.
-
Metadata record: An entry in e.g. a catalogue or abstracting and indexing service with summary information about a digital object.
-
Data: A collection of discrete or continuous values that convey information, describing the quantity, quality, fact, statistics, other basic units of meaning, or simply sequences of symbols that may be further interpreted formally (Wikipedia).
-
Dataset: (Also: Data set) A collection of data (Wikipedia).
-
Knowledge: Facts, information, and skills acquired through experience or education; the theoretical or practical understanding of a subject. SoilWise mainly considers explicit knowledge -- Information that is easily articulated, codified, stored, and accessed. E.g. via books, web sites, or databases. It does not include implicit knowledge (information transferable via skills) nor tacit knowledge (gained via personal experiences and individual contexts). Explicit knowledge can be further divided into semantic and structural knowledge:
- Semantic knowledge: Also known as declarative knowledge, refers to knowledge about facts, meanings, concepts, and relationships. It is the understanding of the world around us, conveyed through language. Semantic knowledge answers the "What?" question about facts and concepts.
- Structural knowledge: Knowledge about the organisation and interrelationships among pieces of information. It is about understanding how different pieces of information are interconnected. Structural knowledge explains the "How?" and "Why?" regarding the organisation and relationships among facts and concepts.
- Knowledge resource: A digital object, such as a document, a web page, or a database, that holds relevant explicit knowledge.
Release notes
Date | Action |
---|---|
30. 9. 2024 | v2.0 Released: For D2.1 Developed & Integrated DM components, v1 D3.1 Developed & Integrated KM components, v1 and D4.1 Repository infrastructure, components and APIs, v1 purposes |
30. 9. 2024 | Technical Components functionality updated according to first SoilWise repository prototype |
27. 8. 2024 | APIs section restructured |
20. 8. 2024 | Knowledge Graph component added |
13. 8. 2024 | Metadata Authoring component added |
1. 7. 2024 | Metadata Augmentation component added |
30. 4. 2024 | v1.0 Released: For D1.3 Architecture Repository v1 purposes |
27. 3. 2024 | Technical Components restructured according to the architecture from Brugges Technical Meeting |
27. 3. 2024 | v0.1 Released: Technical documentation based on the Consolidated architecture |
10. 2. 2024 | Technical Documentation was initialized |
Technical Components ↵
Introduction
The SoilWise Repository (SWR) architecture aims towards efficient facilitation of soil data management. It seamlessly gathers, processes, and disseminates data from diverse sources. The system prioritizes high-quality data dissemination, knowledge extraction and interoperability while user management and monitoring tools ensure secure access and system health. Note that, SWR primarily serves to power Decision Support Systems (DSS) rather than being a DSS itself.
The presented architecture represents an outlook and a framework for ongoing SoilWise development. As such, the implementation has been following intrinsic (within the SoilWise project) and extrinsic (e.g. EUSO development Mission Soil Projects) opportunities and limitations. The presented architecture is the first release out of two planned. Modifications during the implementation will be incorporated into the final version of the SoilWise architecture due M42.
This section lists technical components for building the SoilWise Repository as forseen in the architecture design. As for now, the following components are foreseen:
- Harvester
- Repository Storage
- Catalogue
- Metadata Validation
- Metadata Authoring
- Transformation and Harmonistation
- Metadata Augmentation
- Knowledge Graph
- Natural Language Querying
- User Management and Access Control
A full version of architecture diagram is available at: https://soilwise-he.github.io/soilwise-architecture/.
Harvester
The Harvester component is dedicated to automatically harvest sources to populate SWR with metadata on datasets and knowledge sources.
Metadata harvesting concept
Metadata harvesting is the process of ingesting metadata, i.e. evidence on data and knowledge, from remote sources and storing it locally in the catalogue for fast searching. It is a scheduled process, so local copy and remote metadata are kept aligned. Various components exist which are able to harvest metadata from various (standardised) API's. SoilWise aims to use existing components where available.
The harvesting mechanism relies on the concept of a universally unique identifier (UUID) or unique resource identifier (URI) that is being assigned commonly by metadata creator or publisher. Another important concept behind the harvesting is the last change date. Every time a metadata record is changed, the last change date is updated. Just storing this parameter and comparing it with a new one allows any system to find out if the metadata record has been modified since last update. An exception is if metadata is removed remotely. SoilWise Repository can only derive that fact by harvesting the full remote content. Discussion is needed to understand if SWR should keep a copy of the remote source anyway, for archiving purposes. All metadata with an update date newer then last-identified successfull harvester run are extracted from remote location.
A harvesting task typically extracts records with update-date later then the last-identified successfull harvester run.
Harvested content is (by default) not editable for the following reasons:
- The harvesting is periodic so any local change to harvested metadata will be lost during the next run.
- The change date may be used to keep track of changes so if the metadata gets changed, the harvesting mechanism may be compromised.
If inconsistencies with imported metadata are identified, we can add a statement to the graph of such inconsistencies. We can also notify the author of the inconsistency so they can fix the inconsistency on their side.
A governance aspect still under discussion is if harvested content is removed as soon as a harvester configuration is removed, or when records are removed from the remote endpoint. The risk of removing content is that relations within the graph are breached. An alternative is to indicate the record has been archived by the provider.
Typical tasks of a harvester:
- Define a harvester job
- Schedule (on request, weekly, daily, hourly)
- Endpoint / Endpoint type (example.com/csw -> OGC:CSW)
- Apply a filter (only records with keyword='soil-mission')
- Understand success of a harvest job
- overview of harvested content (120 records)
- which runs failed, why? (today failed -> log, yesterday successfull -> log)
- Monitor running harvestors (20% done -> cancel)
- Define behaviours on harvested content
- skip records with low quality (if test xxx fails)
- mint identifier if missing ( https://example.com/data/{uuid} )
- a model transformation before ingestion ( example-transform.xsl / do-something.py )
Resource Types
Metadata for following resource types are foreseen to be harvested:
- Data & Knowledge Resources
- Organisations, Projects, LTE, Living labs initiatives
- Repositories/Catalogues
These entities relate to each other as:
flowchart LR
people -->|memberOf| o[organisations]
o -->|partnerIn| p[projects]
p -->|produce| d[data & knowledge resources]
o -->|publish| d
d -->|describedIn| c[catalogues]
p -->|part-of| fs[Fundingscheme]
Datasets
Metadata records of datasets are, for the first iteration, primarily imported from the ESDAC, INSPIRE GeoPortal, BonaRes and Cordis/OpenAire. In later iterations SoilWise aims to include other projects and portals, such as national or thematic portals. These repositories contain large number of datasets. Selection of key datasets concerning the SoilWise scope is a subject of know-how to be developed within SoilWise.
Knowledge sources
With respect to harvesting, it is important to note that knowledge assets are heterogeneous, and that (compared to data), metadata standards and particularly access / harvesting protocols are not generally adopted. Available metadata might be implemented using a proprietary schema, and basic assumptions for harvesting, e.g. providing a "date of last change" might not be offered. This will, in some cases, make it necessary to develop customized harvesting and metadata extraction processes. It also means that informed decisions need to be made on which resources to include, based on priority, required efforts and available capacity.
The SoilWise project team is still exploring which knowledge resources to include. As an example, an important cluster of knowledge sources may be seen academic articles and report deliverables from Mission Soil Horizon Europe projects. These resources are accessible from ESDAC, Cordis and OpenAire. Extracting content from Cordis, OpenAire can be achieved using a harvesting task (using the Cordis schema, extended with post processing). For the first iteration, SoilWise aims to achieve this goal. In future iterations new knowledge sources may become relevant, we will investigate at that moment what is the best approach to harvest them.
Functionality
The Harvester component currently comprises of the following functions:
- Harvest records from metadata and knowledge resources
- Metadata harmonization
- Metadata RDF turtle serialization
- RDF to Triple Store
- Duplication identification
Harvest records from metadata and knowledge resources
Note, the first SoilWise Repository development iteration resulted in 9,0444 harvested metadata records (to date 12.09.20241).
CORDIS
European Research projects typically advertise their research outputs via Cordis. This makes Cordis a likely candidate to discover research outputs, such as reports, articles and datasets. Cordis does not capture many metadata properties. In those cases where a resource is identified by a DOI, additional metadata can be found in OpenAire via the DOI. The scope of projects, from which to include project deliverables is still under discussion.
Which projects to include is derived from 2 sources:
- ESDAC maintains a list of historic EU funded research projects
- Mission soil platform maintains a list of current Mission soil projects
A script fetches the content from these 2 sources and prepares for the CORDIS and OpenAire harvested to understand which content is relevant. The content in these pages is unstructured html. The content is scraped using a python library. This is not optimal, because the scraper expects a dedicated html structure, which is fragile.
Results of the scrape activity are stored in table harvest.projects
. For each project a Record control number(RCN) is retrieved from the Cordis knowledge graph. This RCN could be used to filter OpenAire, however OpenAire can also be filtered using project grant number. At this moment in time the Cordis Knowledge graph does not contain the Mission Soil projects yet.
At this moment in time we do not harvest resources from Cordis which do not have a DOI. This includes mainly progress reports of the projects.
OpenAire
For those resources, discovered via Cordis/ESDAC, and identified by a DOI, a harvester fetches additional metadata from OpenAire. OpenAire is a catalogue initiative which harvests metadata from popular scientific repositories, such as Zenodo, Dataverse, etc.
Not all DOI's registered in Cordis are available in OpenAire. OpenAire only lists resources with an open access license. Other DOI's can be fetched from the DOI registry directly or via Crossref.org. This work is still in preparation.
Records in OpenAire are stored in the Open Aire Research Graph (OAF) format, which is transformed to a metadata set based on Dublin Core.
OGC-CSW
Many (spatial) catalogues advertise their metadata via the catalogue Service for the Web standard, such as INSPIRE GeoPortal, Bonares, ISRIC. The OWSLib library is used to query records from CSW endpoints. A filter can be configured to retrieve subsets of the catalogue.
Incidentally, records advertised as CSW also include a DOI reference (Bonares/ISRIC). Additional metadata for these DOI's is extracted from OpenAire/Crossref.
INSPIRE
Although INSPIRE Geoportal does offer a CSW endpoint, due to a technical reasons, we have not been able to harvest from it. Instead we have developed a dedicated harvester via the Elastic Search API endpoint of the Geoportal. If at some point the technical issue has been resolved, use of the CSW harvest endpoint is favourable.
ESDAC
The ESDAC catalogue is an instance of Drupal CMS. We have developed a dedicated harvester to scrape html elements to extract Dublin Core metadata from ESDAC html elements. Metadata is extracted for datasets, maps (EUDASM) and documents. Incidentally a DOI is mentioned as part of the HTML, this DOI is then used as identifier for the resource, else the resource url is used as identifier. If the DOI is not known to the system yet, OpenAire will be queried to capture additional metadata on the resource.
Impact4Soil
Impact4soil is build on a Strapi.io headless CMS. The CMS provides an API to retrieve datasets and scientific articles. The API provides minimal metadata, but fortunately in most cases a DOI is included. DOI is used to capture additional metadata from OpenAire.
Prepsoil portal
Prep4soil is build on a headless CMS. The CMS at times provides an API to retrieve datasets, knowledge items, living labs, lighthouses and communities of practice. The API provides minimal metadata, incidentally a DOI is included. DOI is used to capture additional metadata from OpenAire.
Metadata Harmonization
Once stored in the harvest sources database, a second process is triggered which harmonizes the sources to the desired metadata profile. These processes are split by design, to prevent that any failure in metadata processing would require to fetch remote content again.
Table below indicates the various source models supported
source | platform |
---|---|
Dublin Core | Cordis |
Extended Dublin core | ESDAC |
Datacite | OpenAire, Zenodo, DOI |
ISO19115:2005 | Bonares, INSPIRE |
Metadata is harmonised to a DCAT RDF representation.
For metadata harmonization some supporting modules are used, owslib is a module to parse various source metadata models, including iso19139:2007. A transformation script from (semic-eu/iso19139-to-dcat-ap.xslt)[https://github.com/semic-eu/iso19139-to-dcat-ap/] in combination with lxml and rdflib is used to convert iso19139:2007 metadata to RDF, serialised as turtle.
Harmonised metadata is either transformed to iso19139:2007 or Dublin Core and then ingested by the pycsw software, used to power the SoilWise Catalogue, using an automated process running at intervals. At this moment the pycsw catalogue software requires a dedicated database structure. This step converts the harmonised metadata database to that model. In next iterations we aim to remove this step and enable the catalogue to query the harmnised model directly.
Metadata Augmentation
The metadata augmentation processes are described elsewhere, what is relevant here is that the output of these processes is integrated in the harmonised metadata database.
Metadata RDF turtle serialization
The harmonised metadata model is based on the DCAT ontology. In this step the content of the database is written to RDF.
Harmonized metadata is transformed to RDF in preparation of being loaded into the triple store (see also Knowledge Graph).
RDF to Triple store
This is a component which on request can dump the content of the harmonised database as an RDF quad store. This service is requested at intervals by the triple store component. In a next iteration we aim to push the content to the triple store at intervals.
Duplication indentification
A resource can be described in multiple Catalogues, identified by a common identifier. Each of the harvested instances may contain duplicate, alternative or conflicting statements about the resource. SoilWise Repository aims to persist a copy of the harvested content (also to identify if the remote source has changed). For this iteration we store the first copy, and capture on what other platforms the record has been discovered. OpenAire already has a mechanism to indicate in which platforms a record has been discovered, this information is ingested as part of the harvest. An aim of this exercise is also to understand in which repositories a certain resource is advertised.
Visualization of source repositories is in the first development iteration available as a dedicated section in the SoilWise Catalogue.
Technology
Git actions/pipelines to run harvest tasks
Git actions (github) or pipelines (gitlab) are automated processes which run at intervals or events. Git platforms typically offer this functionality including extended logging, queueing, and manual job monitoring and interaction (start/stop).
Each harvester runs in a dedicated container. The result of the harvester is ingested into a (temporary) storage. Follow up processes (harmonization, augmentation, validation) pick up the results from the temporary storage.
flowchart LR
c[CI-CD] -->|task| q[/Queue\]
r[Runner] --> q
r -->|deploys| hc[Harvest container]
hc -->|harvests| db[(temporary storage)]
hc -->|data cleaning| db[(temporary storage)]
Harvester tasks are triggered from Git CI-CD, Git provides options to cancel and trigger tasks and review CI-CD logs to check errors
OGC-CSW
Many (spatial) catalogues advertise their metadata via the catalogue Service for the Web standard, such as INSPIRE GeoPortal, Bonares, ISRIC.
CORDIS - OpenAire
Cordis does not capture many metadata properties. We harvest the title of a project publication and, if available, the DOI. In those cases where a resource is identified by a DOI, additional metadata can be found in OpenAire via the DOI. For those resources a harvester fetches additional metadata from OpenAire.
A second mechanism is available to link from Cordis to OpenAire, the RCN number. The OpenAire catalogue can be queried using an RCN filter to retrieve only resources relevant to a project. This work is still in preparation.
Not all DOI's registered in Cordis are available in OpenAire. OpenAire only lists resources with an open access license. Other DOI's can be fetched from the DOI registry directly or via Crossref.org. This work is still in preparation. Detailed technical information can be found in the technical description.
OpenAire and other sources
The software used to query OpenAire by DOI or by RCN is not limited to be used by DOIs or RCNs that come from Cordis. Any list of DOIs or list of RCNs can be handled by the software.
Integration opportunities
The Automatic metadata harvesting component will show its full potential when being in the SWR tightly connected to (1) SWR Catalogue, (2) Metadata authoring and (3) ETS/ATS, i.e. test suites.
Repository Storage
Info
Current version: Postgres release 12.2; Virtuoso release 07.20.3239
Access point: Triple Store (SWR SPARQL endpoint) https://sparql.soilwise-he.containers.wur.nl/sparql
The SoilWise repository aims at merging and seamlessly providing different types of content. To host this content and to be able to efficiently drive internal processes and to offer performant end user functionality, different storage options are implemented.
- A relational database management system for the storage of the core metadata of both data and knowledge assets.
- A Triple Store to store the metadata of data and knowledge assets as a graph, linked to soil health and related knowledge as a linked data graph.
- Git for storage of user-enhanced metadata.
Functionality
Postgress RDBMS: storage of raw and augmented metadata
A "conventional" RDBMS is used to store the (augmented) metadata of data and knowledge assets. The harvester process uses it to store the raw results of the metadata harvesting of the different resources that are currently connected. Various metadata augmentation jobs use it as input and write their input to this data store. The catalogue also queries the Postgress database.
There are several reasons for choosing an RDBMS as the main source for metadata storage and metadata querying
- An RDBMS provides good options to efficiently structure and index its contents, thus allowing performant access for both internal processes and end user interface querying.
- An RDBMS easily allows implementing constraints and checks to keep data and relations consistent and valid.
- Various extensions, e.g. search engines, are available to make querying, aggregations even more performant and fitted for end users.
Virtuoso Triple Store: storage of SWR knowledge graph
A Triple Store is implemented as part of the SWR infrastructure to allow a more flexible linkage between the knowledge captured as metadata and various sources of internal and external knowledge sources, particularly taxonomies, vocabularies and ontologies that are implemented as RDF graphs. Results of the harvesting and metadata augmentation that are stored in the RDBMS are converted to RDF and stored in the Triple Store.
A Triple Store is selected as a parallel storage because it offers several capabilites
- It allows the linking of different knowledge models, e.g. to connect the SWR metadata model with existing and new knowledge structures on soil health and related domains.
- It allows reasoning over the relations in the stored graph, and thus allows connecting and smartly combining knowledge from those domains.
- Through the SPARQL interface, it allows users and processes to use such reasoning and exploit previously unconnected sets of knowledge.
Git: User enhanced metadata
The current setup of SWR, using the pycsw infrastructure, allows users to propose metadata enhancements. Such enhancements are managed in Git at: https://github.com/soilwise-he/soilinfohub/discussions.
Ongoing Developments
In the next iteration of the SWR development, the currently deployed storage options will be extended to support new features and functions. Such extensions can improve performance and usability. Moreover, we expect that the integration of AI/ML based functions will require additional types of storage and better a integration to exploit their combined power. Exploratory work that was performed, but is not yet integrated into the deployment of iteration 1 include:
Establishing a vector database
A vector database is foreseen as a foundation to use Large Language Models (LLM) and implement Natural Language Querying (NLQ), e.g. to allow chatbot functionality for end users. A vector DB allows storage of text embeddings that are a the basis for such NLQ functions.
Selecting a search engine
A search engine, deployed on top of the current RDBMS, will increase the perfomance of end user queries. It can also offer better usability, e.g. by offering aggregation functions for faceted search and ranking of search results. Additionally, search engines are also implementing the indexation of unstructured content and are moving to supporting text embeddings. Thus, they might be a starting point (or alternative?) to offer smart searches on unstructured text, using more conventional and broadly adopted software and offering easier migration pathways towards NLQ-like functions.
Technology & Integration
Components used:
- Virtuoso (version 07.20.3239)
- PostgreSQL (release 12.13)
Catalogue
The metadata catalogue is a central piece of the architecture, giving access to individual metadata records. In the catalogue domain, various effective metadata catalogues are developed around the standards issued by the OGC, the Catalogue Service for the Web (CSW) and the OGC API Records, Open Archives Initiative (OAI-PMH), W3C (DCAT), FAIR science (Datacite) and Search Engine community (schema.org). For our first iteration we've selected the pycsw software, which supports most of these standards.
Functionality
The SoilWise prototype adopts a frontend, focusing on:
- minimalistic User Interface, to prevent a technical feel,
- paginated search results, sorted alphabetically, by date, see more information in Chapter Query Catalogue,
- option to filter by facets, see more information in Chapter Query Catalogue,
- preview of the dataset (if a thumbnail or OGC:Service is available), else display of its spatial extent, see more information in Chapter Display record's detail,
- option to provide feedback to publisher/author, see more information in Chapter User Engagement,
- readable link in the browser bar, to facilitate link sharing.
Query Catalogue
The SoilWise Catalogue currently enables the following search options:
50 results are displayed per page in alphabetical order, in the form of overview table comprising preview of title, abstract, contributor, type and date. Search items set through user interface is also reflected in the URL to facilitate sharing.
Fulltext search
Fulltext search is currently enabled through the q= attribute. Other queryable parameters are title, keywords, abstract, contributor. Full list of queryables can be found at: https://soilwise-he.containers.wur.nl/cat/collections/metadata:main/queryables.
Fulltext search currently supports only nesting words with AND operator.
Faceted search
- filter by physical soil parameters (soil texture, WRB, soil structure, bulk density, porosity, water holding capacity, soil moisture),
- filter by chemical soil parameters (ph, organic matter, cation exchange capacity, electrical conductivity, nutrient content, soil carbon, soil nitrogen, soil phosporus, heavy metals concentration),
- filter by biological soil parameters (microbial biomass, soil enzyme activities, soil fauna, soil respiration),
- filter by soil functions (soil fertility, water regulation, soil erosion control, carbon sequestration, soil health, supporting plant growth, contaminant filtration),
- filter by soil degradation indicators (soil erosion, soil compaction, soil salinization, soil acidification, soil contamination),
- filter by environmental soil functions (habitat for organisms, climate regulation, water filtration),
- fitler by long-term field experiments (experimental treatments, temporal data, environmental covariates, soil productivity, soil management),
- filter by record's type (dataset, document, publication, software, services, series).
Future work
- extend fulltext search; allow complex queries using exact match, OR,...
- use Full Text Search ranking to sort by relevance.
- filter by source repository.
Display record's detail
After clicking result's table item, a record's detail is displayed at unique URL address to facilitate sharing. Record's detail currently comprises:
- record's type tag,
- full title,
- full abstract,
- keywords' tags,
- preview of record's geographical extent, see Map preview,
- record's preview image, if available,
- all other record's items,
- section enabling User Engagement,
- last update date.
Future work
- links section with links to original repository, TBD...,
- indication of metadata augmentation, such as link liveliness assessment,
- display metadata augmentation results,
- display metadata validation results,
- show relations to other records,
- better distinguish link types; service/api, download, records, documentation, etc.
Resource preview
SoilWise Catalogue currently supports 3 types of preview:
- Display resource geographical extent, which is available in the record's detail, as well in the search results list.
- Display of a graphic preview (thumbnail) in case it is advertised in metadata.
- Map preview of OGC:WMS services advertised in metadata enables standard simple user interaction (zoom, changing layers).
Data download (AS IS)
Download of data "as is" is currently supported through the links section from the harvested repository. Note, "interoperable data download" has been only a proof-of-concept in the first iteration phase, i.e. is not integrated into the SoilWise Catalogue.
Display link to knowledge
Download of knowledge source "as is" is currently supported through the links section from the harvested repository.
Support catalogue API's of various communities
In order to interact with the many relevant data communities, Soilwise aims to support a range of catalogue standards.
Catalogue Service for the Web
Catalogue service for the web (CSW) is a standardised pattern to interact with (spatial) catalogues, maintained by OGC.
OGC API - Records
OGC is currently in the process of adopting a revised edition of its catalogue standards. The new standard is called OGC API - Records. OGC API - Records is closely related to Spatio Temporal Asset Catalogue (STAC), a community standard in the Earth Observation community.
Protocol for metadata harvesting
The open archives initiative has defined a common protocol for metadata harvesting (oai-pmh), which is adopted by many catalogue solutions, such as Zenodo, OpenAire, CKAN. The oai-pmh endpoint of Soilwise can be harvested by these repositories.
Schema.org annotiations
Annotiations using schema.org/Dataset ontology enable search engines to harvest metadata in a structured way.
User Engagement
Collecting users feedback provides an important channel on the usability of described resources. Users can even support each other by sharing the feedback as 'questions and answers'. For this purpose every display of a record is concluded with a feedback section where users can interact about the resource. Users need to authenticate to provide feedback.
Future work
Notify the resource owners of incoming feedback, so they can answer any questions or even improve their resource.
Technology
pycsw is a catalogue component offering an HTML frontend and query interface using various standardised catalogue APIs to serve multiple communities. Pycsw, written in python, allows for the publishing and discovery of geospatial metadata via numerous APIs (CSW 2/CSW 3, OpenSearch, OAI-PMH, SRU), providing a standards-based metadata and catalogue component of spatial data infrastructures. pycsw is Open Source, released under an MIT license, and runs on all major platforms (Windows, Linux, Mac OS X).
pycsw is deployed as a docker container from the official docker hub repository. Its configuration is updated at deployment. Some layout templates are overwritten at deployment to facilitate a tailored HTML view.
Integration
The SWR catalogue component will show its full potential when integrated to (1) Harvester, (2) Storage of metadata, (3) Metadata Augmentation and Metadata Validation.
Metadata Validation
Metadata should help users assess the usability of a data set for their own purposes and help users to understand their quality.
In terms of metadata, SoilWise Repository aims for the approach to harvest and register as much as possible (see more information in the Harvester Component). Catalogues which capture metadata authored by data custodians typically have a wide range of metadata completion and accuracy. Therefore, the SoilWise Repository employs metadata validation mechanisms to provide additional information about metadata completeness, conformance and integrity. Information resulting from the validation process are stored together with each metadata record in a relation database and updated after registering a new metadata version. Within the first iteration, they are not displayed in the SoilWise Catalogue, except of the results of the Link liveliness assessment component. For the following iterations, we forsee the validation results to be available only to data / knowledge owners / managers and the SWR admins, as SoilWise is not in an arbiter's role.
After Metadata augmentation, the whole validation process can be repeated to understand the variability of metadata and value which has been added by SWR.
Validations:
Metadata profiles
Metadata profiles specify the required metadata elements that must be included to describe resources, ensuring they are discoverable, accessible, and usable. Metadata validation is inherently linked to the specific metadata profile it is intended to follow. This linkage ensures that metadata records are consistent, meet the necessary standards, and are fit for their intended purpose, thereby supporting effective data management, discovery, and use. In the soil domain, several metadata profiles are commonly used to ensure the effective documentation, discovery, and utilization of soil data, for example Datacite, GBIF-EML, Geo-DCAT-AP, INSPIRE Metadata Profile, Dublin Core, ANZLIC Metadata Profile, FAO Global Soil Partnership Metadata Profile, EJP/EUSO Metadata Profile. SoilWise Repository is currently able to perform validations according to the following metadata profiles:
EUSO Metadata profile
This metadata profile was developed through EJP Soil project efforts and modified and approved by the EUSO Working Group.
This metadata profile has been used within the first development iteration phase. Its further modification are under discussions among all the stakeholders.
Label | Cardinality | Codelist | Description |
---|---|---|---|
Identification | 1-n | Unique identification of the dataset (A UUID, URN, or URI, such as DOI) | |
Title | 1-1 | Short meaningful title | |
Abstract | 1-1 | Short description or abstract (1/2 page), can include (multiple) scientific/technical references | |
Extent (geographic) | 0-1 | BBOX or Geonames | Geographical coverage (e.g. EU, EU & Balkan, France, Wallonia, Berlin) |
Reference period - Start | 0-1 | Reference period for the data - Start | |
Reference period - End | 0-1 | Reference period - End; empty if ongoing | |
Access constraints | 1-1 | INSPIRE | Indicates if the data is publicly accessible or the reason to apply access constaints |
Usage constraints | 1-1 | INSPIRE | Indicates if there are legal usage constraints (license) |
Keywords | 0-n | Keywords | |
Contact | 1-n | name; organisation; email; role, where role is one of distributor, owner, pointOfContact, processor, publisher, metadata-contact | |
Source | 0-n | Source is a reference to another dataset which is used as a source for this dataset. Reference a single dataset per line; Title; Date; or provide a DOI; | |
isSourceOf | 0-n | Other datasets that the current dataset is used as input source | |
Lineage | 1-1 | Statement on the origin and processing of the data | |
Processing steps | 0-n | Methods applied in data acquisition and processing: preferably reference a method from a standard (national, LUCAS, FAO, etc.). One processing step per line; Method; Date; Processor; Method reference; Comment | |
Language | 1-n | ISO | Language, of the data and metadata, if metadata is multilingual multiple languages can be provided |
Reference system | 0-1 | CRS | Spatial Projection: drop down list of options, including ‘unknown’ (you can also leave out the field if it is unknown) |
Citation | 0-n | Citations are references to articles which reference this dataset; one citation on each line; Title; Authors; Date; or provide a DOI | |
Spatial resolution | 0-n | Resolution (grid) or scale (vector) | |
Data type | 0-1 | table, vector, grid | The type of data |
Geometry type | 0-1 | point, line, polygon, ... | Geometry type for vector data |
File / service Location | 0-n | Url or path to the data file or service | |
Format | 0-n | IANA | File Format in which the data is maintained or published |
Delivery | 0-n | The way the dataset is available (ie digital: download, viewer OR physical way: Shipping or in situ access ) | |
Maintenenance frequency | 0-1 | ISO | Indication of the frequency of data updates |
Modification date | 0-1 | Date of last modification | |
Status | 0-1 | ISO | Status of the dataset |
Subject - Spatial scope | 0-n | INSPIRE | The scope of the dataset, e.g. regional, national, continental |
Subject - Soil properties | 0-n | INSPIRE | Soil properties described in this dataset |
Subject - Soil function | 0-n | INSPIRE | Soil funtions described in this dataset |
Subject - Soil threats | 0-n | INSPIRE | Soil threats described in this dataset |
Subject - Soil Indicators | 0-n | INSPIRE | Soil indicators described in this dataset |
Subject - EUSO Data WG subgroup | 0-n | EUSO | The EUSO subgroups which contributed to this record |
Subject - Context | 0-n | EUSO | Context: (e.g. EU-Project SOILCARE, EJP-Soil, Literature, ESDAC, etc.) |
Subject - Possible End-users | 0-n | EUSO | Possible end-users: citizens, scientific community, private sector, EU, member states, academia |
Subject - Category | 0-n | EUSO | One or more thematic categories of the dataset |
Quality statement | 0-1 | A statement of quality or any other supplemental information | |
Datamodel/dimensions | 0-1 | The datamodel (table) or dimensions (grid) of the dataset | |
Units of measure | 0-n | ISU | List of UoM from International System of Units, at attribute/dimension level |
Attribute type | 0-n | string, number, date | The type of attribute |
Categorical Data | 0-n | Lookup tables for categorical data, at attribute/dimension level | |
Uncertainty | 0-n | Method used to assess uncertainty and its result. For example: One or more measurements to describe the error and uncertainties in the dataset | |
Completeness | 0-1 | The % of completeness |
INSPIRE metadata profile
The validation against the INSPIRE metadata profile checks whether the metadata records are in accordance with the technical requirements of INSPIRE, specifically according to the INSPIRE data specification on Soil – Technical Guidelines version 3.0. The Soil-specific metadata elements are:
Type | Package Stereotypes |
---|---|
DerivedProfilePresenceInSoilBody | «associationType» |
DerivedSoilProfile | «featureType» |
FAOHorizonMasterValue | «codelist» |
FAOHorizonNotationType | «dataType» |
FAOHorizonSubordinateValue | «codelist» |
FAOPrimeValue | «codelist» |
LayerGenesisProcessStateValue | «codelist» |
LayerTypeValue | «codelist» |
ObservedSoilProfile | «featureType» |
OtherHorizonNotationType | «dataType» |
OtherHorizonNotationTypeValue | «codelist» |
OtherSoilNameType | «dataType» |
OtherSoilNameTypeValue | «codelist» |
ParticleSizeFractionType | «dataType» |
ProfileElement | «featureType» |
ProfileElementParameterNameValue | «codelist» |
RangeType | «dataType» |
SoilBody | «featureType» |
SoilDerivedObject | «featureType» |
SoilDerivedObjectParameterNameValue | «codelist» |
SoilHorizon | «featureType» |
SoilInvestigationPurposeValue | «codelist» |
SoilLayer | «featureType» |
SoilPlot | «featureType» |
SoilPlotTypeValue | «codelist» |
SoilProfile | «featureType» |
SoilProfileParameterNameValue | «codelist» |
SoilSite | «featureType» |
SoilSiteParameterNameValue | «codelist» |
SoilThemeCoverage | «featureType» |
SoilThemeDescriptiveCoverage | «featureType» |
SoilThemeDescriptiveParameterType | «dataType» |
SoilThemeParameterType | «dataType» |
WRBQualifierGroupType | «dataType» |
WRBQualifierPlaceValue | «codelist» |
WRBQualifierValue | «codelist» |
WRBReferenceSoilGroupValue | «codelist» |
WRBSoilNameType | «dataType» |
WRBSpecifierValue | «codelist» |
Functionality
Metadata profile validation
Info
Current version: 0.1.0
Project: Metadata validator
Access point: https://data.soilwise.wetransform.eu/#/home (authorization needed)
Metadata structure validation
The initial steps of metadata validation comprise:
- Syntax Check: Verifying that the metadata adheres to the specified syntax rules. This includes checking for allowed tags, correct data types, character encoding, and adherence to naming conventions.
- Schema (DTD/xsd/shacl/json-schema) Validation: Ensuring that the metadata conforms to the defined schema or metadata model. This involves verifying that all required elements are present, and relationships between different metadata components are correctly established.
Metadata completeness indication
The indication calculates a level of completeness of a record, indicated in % of 100 for endorsed properties of the EUSO soil profile, considering that some properties are conditional based on selected values in other properties.
Metadata ETS/ATS checking
The methodology of ETS/ATS has been suggested to develop validation tests.
Abstract Executable Test Suites (ATS) define a set of abstract test cases or scenarios that describe the expected behaviour of metadata without specifying the implementation details. These test suites focus on the logical aspects of metadata validation and provide a high-level view of metadata validation requirements, enabling stakeholders to understand validation objectives and constraints without getting bogged down in technical details. They serve as a valuable communication and documentation tool, facilitating collaboration between metadata producers, consumers, and validators. ATS are often documented using natural language descriptions, diagrams, or formal specifications. They outline the expected inputs, outputs, and behaviours of the metadata under various conditions.
Executable Test Suites (ETS) are sets of tests designed according to ATS to perform the metadata validation. These tests are typically automated and can be run repeatedly to ensure consistent validation results. Executable test suites consist of scripts, programs, or software tools that perform various validation checks on metadata. These checks can include:
- Data Integrity: Checking for inconsistencies or errors within the metadata. This includes identifying missing values, conflicting information, or data that does not align with predefined constraints.
- Standard Compliance: Assessing whether the metadata complies with relevant industry standards, such as Dublin Core, MARC, or specific domain standards like those for scientific data or library cataloguing.
- Interoperability: Evaluating the metadata's ability to interoperate with other systems or datasets. This involves ensuring that metadata elements are mapped correctly to facilitate data exchange and integration across different platforms.
- Versioning and Evolution: Considering the evolution of metadata over time and ensuring that the validation process accommodates versioning requirements. This may involve tracking changes, backward compatibility, and migration strategies.
- Quality Assurance: Assessing the overall quality of the metadata, including its accuracy, consistency, completeness, and relevance to the underlying data or information resources.
- Documentation: Documenting the validation process itself, including any errors encountered, corrective actions taken, and recommendations for improving metadata quality in the future.
Technology & Integration
hale»connect has been deployed. This platform includes the European Testing Framework ETF and can execute Metadata and Data validation usign the ETS approach outlined above. The User Guide is available here. The administration console of the platform can be accessed upon login at: https://data.soilwise.wetransform.eu/#/home.
The metadata validation component will show its full potential when integrated to (1) SWR Catalogue, (2) Storage of metadata, and (3) Requires authentication and authorisation.
User Guide
When using the ‘Metadata only’ workflow, the metadata profile can be validated with hale»connect. To do this, after logging into hale»connect, go directly to the setup of a new Theme (transformation project and Schema are therefore not required) and activate ‘Publish metadata only’ and specify where the metadata should come from. To validate the metadata file, upload the metadata and select ‘Metadata only’. Once validation is complete, a report can be called up.
A comprehensive tutorial video on setting up and executing transformation workflows can be found here.
Future work
- full development of the ETS, using populated codelists,
- display validation results in the SoilWise Catalogue,
- on-demand metadata validation, which would generate reports for user-uploaded metadata,
- applicability of ISO19157 Geographic Information – Data quality (i.e. the standard intended for data validations) for metadata-based validation reports,
- Shacl is is in general intended for semantic web related validations; however, it's exact scope will be determined during the upcoming SoilWise developments.
Link liveliness assessment
Metadata (and data and knowledge sources) tend to contain links to other resources. Not all of these URIs are persistent, so over time they can degrade. In practice, many non-persistent knowledge sources and assets exist that could be relevant for SWR, e.g. on project websites, in online databases, on the computers of researchers, etc. Links pointing to such assets might however be part of harvested metadata records or data and content that is stored in the SWR.
The link liveliness assessment subcomponent runs over the available links stored with the SWR assets and checks their status. The function is foreseen to run frequently over the URIs in the SWR repository, assessing and storing the status of the link. The link liveliness privides the following functions:
- OGC API Catalogue Integration
- Designed to work specifically with OGC API - Records System
- Extracts and evaluates URLs from catalogue items
- Link Validation
- Evaluates the validity of links to external sources and within the repository
- Checks if metadata accurately represents the source
- Support for OGC service links
- Health Status Tracking
- Provides up-to-date status history for every assessed link
- Maintains a history of link health over time
- Flexible Evaluation
- Supports single resource evaluation on demand
- Performs periodic tests to provide availability history
- Broken link management
- Identifies and categorizes broken links based on their status code (
401 Unauthorized
,404 Not Found
,500 Server Error
) - Flags deprecated links after consecutive failed tests and excludes them from future check
- Identifies and categorizes broken links based on their status code (
- Timeout management
- Identifies resources exceeding specified timeout thresholds
A javascript widget is further used to display the link status directly in the SWR Catalogue record.
Technology
- Python Used for the linkchecker integration, API development, and database interactions
- PostgreSQL Primary database for storing and managing link information
- FastAPI Employed to create and expose REST API endpoints. Utilizes FastAPI's efficiency and auto-generated Swagger documentation
- Docker Used for containerizing the application, ensuring consistent deployment across environments
- CI/CD Automated pipeline for continuous integration and deployment, with scheduled weekly runs for link liveliness assessment
Metadata Authoring
Functionality
No implementations are yet an integrated part of the SWR delivery, as they were intentionally out of the first development itertation. Metadata authoring and generation is, however, possible using the hale»connect workflows.
Foreseen functionality
Users are enabled to create and maintain metadata records within the SWR, in case these records can not be imported from a remote source. Note that importing records from remote is the preferred approach from the SWR point of view because the ownership and persistence of the record is facilitated by the remote platform.
- Users login to the system and are enabled to upload a metadata record.
- A form is available for users to create or manage an existing record. The form has select options for those fields which are linked to a codelist.
- Users can also upload a spreadsheet of records which are converted to the MCF format.
- Users will see metadata validation results.
Technology
The authoring workflow uses a GIT backend, additions to the catalogue are entered by members of the GIT repository directly or via pull request (review). Records are stored in iso19139:2007 XML or MCF. MCF is a subset of iso19139:2007 in a YAML encoding, defined by the pygeometa community. The pygeometa library is used to convert the MCF to any requested metadata format.
The pygeometa community provides a webbased form for users uncomfortable with editing an MCF file directly. The tool can be hosted within SWR, to faciliate a dedicated color scheme. The form is auto generated from mcf json schema, the schema can be annotated to provide a dedicated EUSO user experience (for example preselect relevant codelists).
Users can also submit metadata using a CSV (excel) format, which is converted to MCF in a CI-CD workflow
At intervals the SWR ingests metadata which has been uploaded via the authoring workflow.
Transformation and Harmonisation
These components make sure that data is interoperable, i.e. provided to agreed-upon formats, structures and semantics. They are used to ingest data and transform it into common standard data, e.g. in the central SWR format for soil.
The specific requirements these components have to fulfil are:
- The services shall be able to work with data that is described explicitly or implicitly with a schema. The services shall be able to load schemas expressed as XML Schemas, GML Application Schemas, RDF-S and JSON Schema.
- The services shall support GML, GeoPackage, GeoJSON, CSV, RDF and XSL formats for data sources.
- The services shall be able to connect with external download services such as WFS or OGC API, Features.
- The services shall be able to write out data in GML, GeoPackage, GeoJSON, CSV, RDF and XSL formats.
- There shall be an option to read and write data from relational databases.
- The services should be exposed as OGC API Processes
- Transformation processes shall include the following capabilities:
- Rename types & attributes.
- Convert between units of measurement.
- Restructure data, e.g. through, joining, merging, splitting.
- Map codelists and other coded values.
- Harmonise observations as if they were measured using a common procedure using Pedotransfer Functions.
- Reproject data.
- Change data from one format to another.
- There should be an interactive editor to create the specific transformation processes required for the SWR.
- It should be possible to share transformation processes.
- Transformation processes should be fully documented or self-documented.
Technology & Integration
We have deployed the following components to the SWR infrastructure:
- hale studio, a proven ETL tool optimised for working with complex structured data, such as XML, relational databases, or a wide range of tabular formats. It supports all required procedures for semantic and structural transformation. It can also handle reprojection. While Hale Studio exists as a multi-platform interactive application, its capabilities can be provided through a web service with an OpenAPI.
- A comprehensive tutorial video on soil data harmonisation with hale studio can be found here
Another part of the deployed system, GDAL, a very robust conversion library used in most FOSS and commercial GIS software, can be used for a wealth of format conversions and can handle reprojection. In cases where no structural or semantic transformation is needed, a GDAL-based conversion service would make sense.
Setting up a transformation process in hale»connect
Complete the following steps to set up soil data transformation, validation and publication processes:
- Log into hale»connect.
- Create a new transformation project (or upload it).
- Specify source and target schemas.
- Create a theme (this is a process that describes what should happen with the data).
- Add a new transformation configuration. Note: Metadata generation can be configured in this step.
- A validation process can be set up to check against conformance classes.
Executing a transformation process
- Create a new dataset and select the theme of the current source data, and provide the source data file.
- Execute the transformation process. ETF validation processes are also performed. If successful, a target dataset and the validation reports will be created.
- View and download services will be created if required.
To create metadata (data set and service metadata), activate the corresponding button(s) when setting up the theme for the transformation process.
Metadata Augmentation
Functionality
In this component scripting / NLP / LLM are used on a metadata record to augment metadata statements about the resource. Augmentations are stored on a dedicated augmentation table, indicating the process which produced it.
metadata-uri | metadata-element | source | value | proces | date |
---|---|---|---|---|---|
https://geo.fi/data/ee44-aa22-33 | spatial-scope | 16.7,62.2,18,81.5 | https://inspire.ec.europa.eu/metadata-codelist/SpatialScope/national | spatial-scope-analyser | 2024-07-04 |
https://geo.fi/data/abc1-ba27-67 | soil-thread | This dataset is used to evaluate Soil Compaction in Nuohous Sundström | http://aims.fao.org/aos/agrovoc/c_7163 | keyword-analyser | 2024-06-28 |
For the first SoilWise prototype, the functionality of the Metadata Augmentation component comprises:
Automatic metadata generation
To generate metadata (data set and service metadata), activate the corresponding button(s) when setting up the theme for the transformation process. The steps are described here
Translation module
Many records arrive in a local language, SWR translates the main properties for the record: title and abstract into English, to offer a single language user experience. The translations are used in filtering and display of records.
The translation module builds on the EU translation service (API documentation at https://language-tools.ec.europa.eu/). Translations are stored in a database for reuse by the SWR. The EU translation returns asynchronous responses to translation requests, this means that translations may not yet be available after initial load of new data. A callback operation populates the database, from that moment a translation is available to SWR. The translation service uses 2-letter language codes, it means a translation from a 3-letter iso code (as used in for example iso19139:2007) to 2-letter code is required. The EU translation service has a limited set of translations from a certain to alternative language available, else returns an error.
Initial translation is triggered by a running harvester. The translations will then be available once the record is ingested to the triplestore and catalogue database in a followup step of the harvester.
Foreseen functionality
In the next iterations, Metadata augmentation component is foreseen to include the following additional functions:
Keyword matcher
Keywords are an important mechanism to filter and cluster records. But similar keywords need to be equal to be able to match them. This module evaluates keywords of existing records to make them equal in case of high similarity.
Analyses existing keywords on a metadata record. Two cases can be identified:
- If a keyword, having a skos identifier, has a closeMatch or sameAs relation to a prefered keyword, the prefered keyword is used.
- If an existing keyword, without skos identifier, matches a prefered keyword by (translated) string or synonym, then append the matched keyword (including skos identifier). Consider the risk of false positives.
To facilitate this use case the SWR contains a knowledge graph of prefered keywords in the soil domain with relations to alternative keywords, such as agrovoc, gemet, dpedia, iso. This knowledge graph is maintained at https://github.com/soilwise-he/soil-health-knowledge-graph. Agrovoc is multilingual, facilitating the translation case.
For metadata records which have not been analysed yet (in that iteration), the module extracts the records, for each keyword an analyses is made if it maches any of the prefered keywords, if so, the prefered keyword is added to the record.
Spatial Locator
Analyses existing keywords to find a relevant geography for the record, it then uses the GeoNames API to find spatial coordinates for the geography, which are inserted into the metadata record.
Spatial scope analyser
A script that analyses the spatial scope of a resource
The bounding box is matched to country bounding boxes
To understand if the dataset has a global, continental, national or regional scope
- Retrieves all datasets (as iso19139 xml) from database (records table joined with augmentations) which:
- have a bounding box
- no spatial scope
- in iso19139 format
- For each record it compares the boundingbox to country bounding boxes:
- if bigger then continents > global
- If matches a continent > continental
- if matches a country > national
- if smaller > regional
- result is written to as an augmentation in a dedicated table
EUSO-high-value dataset tagging
The EUSO high-value datasets are those with substantial potential to assess soil health status, as detailed on the EUSO dashboard. This framework includes the concept of soil degradation indicator metadata-based identification and tagging. Each dataset (possibly only those with the supra-national spatial scope - under discussion) will be annotated with a potential soil degradation indicator for which it might be utilised. Users can then filter these datasets according to their specific needs.
The EUSO soil degradation indicators employ specific methodologies and thresholds to determine soil health status, see also the Table below. These methodologies will also be considered, as they may have an impact on the defined thresholds. This issue will be examined in greater detail in the future.
Soil Degradation | Soil Indicator | Type of methodic for threshold |
---|---|---|
Soil erosion | Water erosion | RUSLE2015 |
Wind erosion | GIS-RWEQ | |
Tillage erosion | SEDEM | |
Harvest erosion | Textural index | |
Post-fire recovery | USLE (Type of RUSLE) | |
Soil pollution | Arsenic excess | GAMLSS-RF |
Copper excess | GLM and GPR | |
Mercury excess | LUCAS topsoil database | |
Zinc Excess | LUCAS topsoil database | |
Cadmium Excess | GEMAS | |
Soil nutrients | Nitrogen surplus | NNB |
Phosphorus deficiency | LUCAS topsoil database | |
Phosphorus excess | LUCAS topsoil database | |
Loss of soil organic carbon | Distance to maximum SOC level | qGAM |
Loss of soil biodiversity | Potential threat to biological functions | Expert Polling, Questionnaire, Data Collection, Normalization and Analysis |
Soil compaction | Packing density | Calculation of Packing Density (PD) |
Salinization | Secondary salinization | - |
Loss of organic soils | Peatland degradation | - |
Soil consumption | Soil sealing | Raster remote sense data |
Technically, we forsee the metadata tagging process as illustrated below. At first, metadata record's title, abstract and keywords will be checked for the occurence of specific values from the Soil Indicator and Soil Degradation Codelists, such as Water erosion
or Soil erosion
(see the Table above). If found, the Soil Degradation Indicator Tag
(corresponding value from the Soil Degradation Codelist) will be displayed to indicate suitability of given dataset for soil indicator related analyses. Additionally, a search for corresponding methodology will be conducted to see if the dataset is compliant with the EUSO Soil Health indicators presented in the EUSO Dashboard. If found, the tag EUSO High-value dataset
will be added. In later phase we assume search for references to Scientific Methodology papers in metadata record's links. Next, the possibility of involving a more complex search using soil thesauri will also be explored.
flowchart TD
subgraph ic[Indicators Search]
ti([Title Check]) ~~~ ai([Abstract Check])
ai ~~~ ki([Keywords Check])
end
subgraph Codelists
sd ~~~ si
end
subgraph M[Methodologies Search]
tiM([Title Check]) ~~~ aiM([Abstract Check])
kl([Links check]) ~~~ kM([Keywords Check])
end
m[(Metadata Record)] --> ic
m --> M
ic-- + ---M
sd[(Soil Degradation Codelist)] --> ic
si[(Soil Indicator Codelist)] --> ic
em[(EUSO Soil Methodologies list)] --> M
M --> et{{EUSO High-Value Dataset Tag}}
et --> m
ic --> es{{Soil Degradation Indicator Tag}}
es --> m
th[(Thesauri)]-- synonyms ---Codelists
Knowledge Graph
Info
Current version: 0.1.0
Project: Soil Health Knowledge graph
Access point: SWR SPARQL endpoint: https://sparql.soilwise-he.containers.wur.nl/sparql
SoilWise develops and implements a Knowledge Graph linking the knowledge captured in harvested and augmented metadata with various sources of internal and external knowledge sources, particularly taxonomies, vocabularies and ontologies that are also implemented as RDF graphs. Linking such graphs into a harmonized SWR Knowledge Graph allows reasoning over the relations in the stored graph, and thus allows connecting and smartly combining knowledge from those domains.
The first iteration of the SWR Knowledge Graph is a graph representation of the (harmonized) metadata that is currently harvested, validated and augmented as part of the SWR catalogue database. It's RDF representation, stored in a triple store, and the SPARQL endpoint deployed on top of the triple store, allow users alternate access to the metadata, exploiting semantics and relations between different assets.
At the same time, experiments have been performed to prepare for the linkage of this RDF metadata graph and existing and AI/ML generated graphs. In future iterations, the metadata graph will be linked/merged with a dedicated soil health knowledge graph also linking to external resources, establishing a broader interconnected soil health knowledge graph. Consequently, it will evolve into a knowledge network that allows much more powerful and impactful queries and reasoning, e.g. supporting decision support and natural language quering.
Functionality
Knowledge Graph querying (SPARQL endpoint)
The SPARQL endpoint, deployed on top of the SWR triple store, allows end users to query the SWR knowledge graph using the SPARQL query language. It is the primary access point to the knowledge graph, both for humans, as well as for machines. Many applications and end users will instead interact with specialised assets that use the SPARQL end-point, such as the Chatbot or the API. However, the SPARQL end-point is the main source for the development of further knowledge applications and provides bespoke search to humans.
Since we're importing resources from various data and knowledge repositories, we expect many duplicities, blank nodes and conflicting statements. Implementation of rules should be permissive, not preventing inclusion, only flag potential inconsistencies.
Ongoing Developments
Knowledge Graph enrichment and linking
Info
Access point: https://voc.soilwise-he.containers.wur.nl/concept/
As a preparation to extend the currently deployed metadata knowledge graph (KG) with broader domain knowledge, experimental work has been performed to enrich the KG to link it with other knowledge graphs.
The following aspects have been worked on and will be furhter developed and integrated into future iterations of the SoilWise KG:
- Applying various methods using AI/ML to derive a (soil health) knowledge graph from unstructured content. This is piloted by using (parts of) the EEA report "Soil monitoring in Europe - Indicators and thresholds for soil quality assessments". It tests the effectiveness of various methods to generate knowledge in the form of KGs from documents, which could also benefit other AI/ML functions foreseen.
- Establishing links between the SoilWise KG and external taxonomies and ontologies (linked data). Concepts in the SoilWise KG that (closely) match with concepts in the AGROVOC thesaurus are linked. The implemented method is exemplary for the foreseen wider linking required to establish a soil health KG.
- Testing AI/ML based methods to derive additional knowledge (e.g. keywords, geography) for data and knowledge assets. Such methods could for instance be used to further augment metadata or fill exisiting metadata gaps. Besides testing such methods, this includes establishing a model that allows to distinguish between genuine and generated metadata.
Technology & Integration
Components used:
- Virtuoso (version 07.20.3239)
- Python notebooks
Ontologies/Vocabularies/Schemas:
Natural Language Querying
Making open knowledge findable and accessible for SoilWise users
Functionality
The aplication of Natural Language Querying (NLQ) for SoilWise and the integration into the SoilWise repository is currently still in the research phase. No implementations are yet an integrated part of the SWR delivery, in line with the plan for the first development iteration.
Ongoing Developments
A strategy for development and implementation of NLQ to support SoilWise users is currently being developed. It considers various ways to make knowledge available through NLQ, possibly including options to migrate to different "levels" of complexity and innovation.
Such a "leveled approach" could start from leveraging existing/proven search technology (e.g. the Apache Solr open source search engine), and gradually combining this with new developments in NLP (such as transformer based language models) to make harvested knowledge metadata and harmonized knowledge graphs accessible to SoilWise users.
Typical general steps towards an AI-powered self-learning search system, are listed below from less to more complex. Note that to fully benefit from later steps it will be necessary to process knowledge (documents) themselves ("look inside the documents") instead of only working with the metadata about them.
- basic keyword based search (tf-idf4, bm255)
- use of taxonomies and entity extraction
- understanding query intent (semantic query parsing, semantic knowledge graphs, virtual assistants)
- automated relevance tuning (signals boosting, collaborative filtering, learning to rank)
- Self-learning search system (full feedback loop using all user and content data)
Core topics are:
- LLM1 based (semantic) KG generation from unstructured content (leveraging existing search technology)
- chatbot - Natural Language Interface (using advanced NLP2 methodologies, such as LLMs)
- LLM operationalisation (RAG3 ingestion pipeline(s), generation pipeline, embedding store, models)
The final aim is towards extractive question answering (extract answers from sources in real-time), result summarization (summarize search results for easy review), and abstractive question answering (generate answers to questions from search results). Not all these aims might be achievable within the project though. Later steps (marked in yellow in the following image) depend more on the use of complex language models.
One step towards personalisation could be the use of (user) signals boosting and collaborative filtering. But this would require tracking and logging (user) actions.
A seperate development could be a chatbot based on selected key soil knowledge documents ingested into a vector database (as a fixed resource), or even a fine-tuned LLM that is more soil science specific than a plain foundation LLM.
Optionally the functionality can be extended from text processing to also include multi-modal data such as photos (e.g. of soil profiles). Effort needed for this has to be carefully considered.
Along the way natural language processing (NLP) methods and approaches can (and are) also be applied for various metadata handling and augmentation.
Foreseen technology
- (Semantic) search engine, e.g. Apache Solr or Elasticsearch
- Graph database (if needed)
- (Scalable) vector database (if needed)
- Java and/or Python based NLP libraries, e.g. OpenNLP, spaCy
- Small to large foundation LLMs
- LLM development framework (such as langChain or LlamaIndex)
- Frontend toolkit
- LLM deployment and/or hosted API access
- Authentication and authorisation layer
- Computation and storage infrastructure
- Hardware acceleration, e.g. GPU (if needed)
-
Large Language Model. Typically a deep learning model based on the transformer architecture that has been trained on vast amounts of text data, usually from known collections scraped from the Internet. ↩
-
Natural Language Processing. An interdisciplinary subfield of computer science and artificial intelligence, primarily concerned with providing computers with the ability to process data encoded in natural language. It is closely related to information retrieval, knowledge representation and computational linguistics. ↩
-
Retrieval Augmented Generation. A framework for retrieving facts from an external knowledge base to ground large language models on the most accurate, up-to-date information and enhancing the (pre)trained parameteric (semantic) knowledge with non-parameteric knowledge to avoid hallucinations and get better responses. ↩
-
tf-idf. Term Frequency - Inverse Document Frequency, a statistical method in NLP and information retrieval that measures how important a term is within a document relative to a collection of documents (called a corpus). ↩
-
bm25. Okapi Best Match 25, a well-known ranking function used by search engines to estimate the relevance of documents to a given search query. It is based on tf-idf, but considered an improvement and adding some tunable parameters. ↩
User Management and Access Control
User and organisation management, authorisation and authentication are complex, cross-cutting aspects of a system such as the SoilWise repository. Back-end and front-end components need to perform access control for authenticated users. Many organisations already have infrastructures in place, such as an Active Directory or a Single Sign On based on OAuth.
No implementations are yet an integrated part of the SWR delivery, in line with the plan for the first development iteration.
The general model we apply is that:
- a user shall be a member of at least one organisation.
- a user may have at least one role in every organisation that they are a member of.
- a user always acts in the context of one of their roles in one organisation (similar to Github contexts).
- organisations can be hierarchical, and user roles may be inherited from an organisation that is higher up in the hierarchy.
The basic requirements for the SWR authentication mechanisms are:
- User authentication, and thus, provision of authentication tokens, shall be distributed ("Identity Brokering") and may happen through existing services. Authentication mechanisms that are to be supported include OAuth, SAML 2.0 and Active Directory.
- An authoritative Identity Provider, such as an eIDAS-based one, should be integrated in a later iteration as well.
- There shall be a central service that performs role and organisation mapping for authenticated users. This service also provides the ability to configure roles and set up organisations and users. This central service can also provide simple, direct user authentication (username/password-based) for those users who do not have their own authentication infrastructure.
- There may be different levels of trust establishment based on the specific authentication service used. Higher levels of trust may be required to access critical data or infrastructure.
- SWR services shall use Keycloak or JSON Web Tokens for authorization.
- To access SWR APIs, the same rules apply as to access the SWR through the UI.
In later iterations, the authentication and authorisation mechanisms should also be used to facilitate connector-based access to data space resources.
Sign-up
For every registered user of SWR components, an account is needed. This account can be created in one of three ways:
- Automatically, by providing an authentication token that was created by a trusted authentication service and that contains the necessary information on the organisation of the user and the intended role (this can e.g. be implemented through using a DAPS)
- Manually, through self-registration (may only be available for users from certain domains and/or for certain roles)
- Through superuser registration; in this case the user gets issued an activation link and has to set the password to complete registration
Authentication
Certain functionalities of the SWR will be available to anonymous users, but functions that edit any of the state of the system (data, configuration, metadata) require an authenticated user. The easiest form of authentication is to use the login provided by the SWR itself. This log-in is username-password based. A second factor, e.g. through an authenticator app, may be added in the upcoming iteration.
Other forms of authentication include using an existing token.
Authorisation
Every component has to check whether an authenticated user may invoke a desired action based on that user's roles in their organisations. To ensure that the User Interface does not offer actions that a given user may not invoke, the user interface shall also perform authorisation.
Roles are generally defined using Privileges: A certain role may, for example, read
certain resources, they may edit
or even delete
them. Here is an example of such a definition:
A standard user
may only read
and edit
their own User
profile, and read the information from their organisation. Once a user has been given the role dataManager
, they may perform any CRUD operation on any Data
that is in the scope of their organisation
. They are also granted read
access to publication Theme
configurations on their own and in any parent organisations.
Further implementation hints and Technologies
The public cloud hale connect user service can be used for central user management.
Completed work - Iteration 1
- User/Role and Organisation management has been deployed and configured as part of weTransform's hale connect installation.
- As of now, there are three Identity providers deployed as part of that infrastructure:
- The integrated user service in hale connect,
- a Keycloak/OpenID-connect based one using GoPass via Github
- a Data Spaces connector.
Planned work - Iteration 2
- Integrate eIDAS or a different autheoritative Identity Provider
- Update other components to accept the tokens generated by this infrastructure
Ended: Technical Components
APIs ↵
Introduction
Within the first development iteration, the following APIs are employed in the SoilWise repository:
- Discovery APIs
- SPARQL: https://sparql.soilwise-he.containers.wur.nl/sparql/
- OGC API- Records: https://soilwise-he.containers.wur.nl/cat/openapi
- Spatio Temporal Asset Catalog (STAC): https://soilwise-he.containers.wur.nl/cat/stac/openapi
- Catalog service for the Web (CSW): https://soilwise-he.containers.wur.nl/cat/openapi
- Protocol for Metadata Harvesting (OAI-PMH): https://soilwise-he.containers.wur.nl/cat/oaipmh
- OpenSearch: https://soilwise-he.containers.wur.nl/cat/opensearch
- Processing API's
- Translate API: https://api.soilwise-he.containers.wur.nl/tolk/docs
- Link Liveness Assessment API: https://api.soilwise-he.containers.wur.nl/linky/docs
- RDF to triplestore API: https://repo.soilwise-he.containers.wur.nl/swagger-ui/index.html
Future work
SoilWise will in the future use more APIs to interact between components as well as enable remote users to interact with SoilWise components. Standardised APIs will be used if possible, such as:
- Open API
- GraphQL
- OGC webservices (preferably OGC API generation based on Open API)
- SPARQL for potential future knowledge graphs
Ended: APIs
Infrastructure ↵
Introduction
This section describes the general hardware infrastructure and deployment pipelines used for the SWR. As of the delivery of this initial version of the technical documentation, a prototype pipeline and hardware environment shall continuously be improved as required to fit the needs of the project.
For the development of First project iteration cycle, we defined the following criteria:
- There is no production environment.
- There is a distributed staging environment, with each partner deploying their solutions to their specific hardware.
- All of the hardware nodes used in the staging environment include an offsite backup capacity, such as a storage box, that is operated in a different physical location.
- There is no central dev/test environment. Each organisation is responsible for its own dev/test environments.
- The deployment and orchestration configuration for this iteration should be stored as YAML in a GitHub repository.
- Deployments to the distributed staging environment are done preferably through GitHub Actions or through alternative pipelines, such as a Jenkins or GitLab instance provided by weTransform or other partners.
- For each component, there shall be separate build processes managed by the responsible partners that result in the built images being made accessible through a hub (e.g. dockerhub)
Work completed - Iteration 1
The Soilwise infrastructure uses components provided by Github. Github components are used to:
- Administer and assign to roles the different Soilwise users.
- Register, prioritise and assign tasks.
- Store source code of software artifacts.
- Author documentation.
- Run CI/CD pipelines.
- Collect user feedback.
During the iteration, the following components have been deployed:
on infrastructure provided by Wageningen University:
- A PostGreSQL database on the PostGreSQL cluster.
- A number of repositories at the university Gitlab instance, including CI/CD pipelines to run metadata harvesters.
- A range of services deployed on the univerity k8s cluster, with their configuration stored on Gitlab. Container images are stored on the university Harbor repository.
- Usage logs monitored through the university instance of Splunk.
- Availability monitoring provided by Uptimerobot.com.
on WeTransform cloud infrastructure:
- a k8s deployment of the hale connect stack as been installed and configured. This instance can provide user management and has been integrated with the GitHub repository https://github.com/soilwise-he/Soilwise-credentials. The stack provides Transformation, Metadata Generation and Validation capabilities.
Future work - Iteration 2
The main objective of iteration 2 is to reorganise the orchestration of the different components, so all components can be centrally accessed and monitored.
The integrations will, whereever feasible, build on API's which are standardised by W3C, OGC or de facto standards, such as Open API or GraphQL.
The intent of the consortium is to set up a distributed architecture, with the staging and production environment in an overall kubernetes-based orchestration mode if it is deemed necessary and advantageous at that point in time.
Containerization
The SWR is being developed in a containerised docker environment. This means that each software component, whether it's a database, storage system, or some kind of service, is compiled into a container image. These images are made available in a hub or repository, so that they can be deployed automatically whenever needed, including to fresh hardware.
GIT versioning system
All aspects of the SoilWise repository can be managed through the SoilWise GitHub repository. This allows all members of the Mission Soil and EUSO community to provide feedback or contribute to any of the aspects.
Documentation
Documentation is maintained in the markdown format using McDocs and deployed as html or pdf using GitHub Pages.
An interactive preview of architecture diagrams is also maintained and published using GitHub Pages.
Source code
Software libraries tailored or developed in the scope of SoilWise are maintained through the GitHub repository.
Container build scripts/deployments
SoilWise is based on an orchestrated set of container deployments. Both the definitions of the containers as well as the orchestration of those containers are maintained through Git.
Harvester definitions
The configuration of the endpoint to be harvested, filters to apply and the interval is stored in a GitHub repository. If the process runs as a CI-CD pipeline, then the logs of each run are also available in Git.
Authored and harvested metadata
Metadata created in SWR, as well as metadata imported from external sources, are stored in GitHub, so a full history of each record is available, and users can suggest changes to existing metadata.
Validation rules
Rules (ATS/ETS) applied to metadata (and data) validation are stored in a git repository.
ETL configuration
Alignments to be applied to the source to be standardised and/or harmonised are stored on a git repository, so users can try the alignment locally or contribute to its development.
Backlog / discussions
Roadmap discussion, backlog and issue management are part of the GitHub repository. Users can flag issues on existing components, documentation or data, which can then be followed up by the participants.
Release management
Releases of the components and infrastructure are managed from a GitHub repository, so users understand the status of a version and can download the packages. The release process is managed in an automated way through CI-CD pipelines.
Ended: Infrastructure
Glossary
- Abstracting and indexing service
- Abstracting and indexing service is a service, e.g. a search engine, that abstracts and indexes digital objects or metadata records, and provides matching and ranking functionality in support of information retrieval.
- Acceptance Criteria
- Acceptance Criteria can be used to judge if the resulting software satisfies the user's needs. A single user story/requirement can have multiple acceptance criteria.
- API
- Application programming interface (API) is a way for two or more computer programs to communicate with each other (source wikipedia)
- Application profile
- Application profile is a specification for data exchange for applications that fulfil a certain use case. In addition to shared semantics, it also allows for the imposition of additional restrictions, such as the definition of cardinalities or the use of certain code lists (source: purl.eu).
- Artificial Intelligence
- Artificial Intelligence (AI) is a field of study that develops and studies intelligent machines. It includes the fields rule based reasoning, machine learning and natural language processing (NLP). (source: wikipedia)
- Assimilation
- Assimilation is a term indicating the processes involved to combine multiple datasets with different origin into a common dataset, the term is somewhat similarly used in psychology as
incorporation of new concepts into existing schemes
(source: wikipedia). But is not well aligned with its usage in the data science community:updating a numerical model with observed data
(source: wikipedia) - ATOM
- ATOM is a standardised interface to exchange news feeds over the internet. It has been adopted by INSPIRE as a basic alternative to download services via WFS or WCS.
- Catalogue
- Catalogue or metadata registry is a central location in an organization where metadata definitions are stored and maintained (source: wikipedia)
- Code list
- Code list an enumeration of terms in order to constrain input and avoid errors (source: UN).
- Conceptual model
- Conceptual model or domain model represents concepts (entities) and relationships between them (source: wikipedia)
- Content negotiation
- Content negotiation refers to mechanisms that make it possible to serve different representations of a resource at the same URI (source: wikipedia)
- Controlled vocabulary
- Controlled vocabulary provides a way to organize knowledge for subsequent retrieval. A carefully selected list of words and phrases, which are used to tag units of information so that they are more easily retrieved by a search (source: Semwebtech). Vocabulary, unlike the dictionary and thesaurus, offers an in-depth analysis of a word and its usage in different contexts (source: learn grammar)
- Cordis
- Cordis is the primary source of results from EU-funded projects since 1990
- Corpus
- Corpus (plural: Corpora) is a repository of text documents (knowledge resources); a body of works. Typically the input for information retrieval.
- CSW
- CSW Catalogue Service for the Web
- Data
- Data is a collection of discrete or continuous values that convey information, describing the quantity, quality, fact, statistics, other basic units of meaning, or simply sequences of symbols that may be further interpreted formally (Wikipedia).
- Data source
- Data source/provider is a provider of data resources.
- Data management
- Data management is the practice of collecting, organising, managing, and accessing data (for some purpose, such as decision-making).
- Dataset
- Dataset (Also: Data set) A collection of data (Wikipedia).
- Dataverse
- Dataverse is open source research data repository software
- Datacite
- Datacite is a non-profit organisation that provides persistent identifiers (DOIs) for research data.
- Datacite metadata scheme
- Datacite metadata schema a datamodel for metadata for scientific resources
- Digital exchange of soil-related data
- Digital exchange of soil-related data (ISO 28258:2013) presents a conceptual model of a common understanding of what soil profile data are
- Digital soil mapping
- Digital soil mapping is the creation and the population of a geographically referenced soil databases generated at a given resolution by using field and laboratory observation methods coupled with environmental data through quantitative relationships (source: wikipedia)
- Discovery service
- Discovery service is a concept from INSPIRE indicating a service type which enables discovery of resources (search and find). Typically implemented as CSW.
- Download service
- Download service is a concept from INSPIRE indicating a service type which enables download of a (subset of a) dataset. Typically implemented as WFS, WCS, SOS or Atom.
- DOI
- DOI a digital identifier of an object, any object — physical, digital, or abstract
- Encoding
- Encoding is the format used to serialise a resource to a file, common encodings are xml, json, turtle
- ESDAC
- ESDAC thematic centre for soil related data in Europe
- EUSO
- EUSO European Soil Observatory
- GDAL OGR
- GDAL and OGR are software packages widely used to interact with a variety of spatial data formats
- GML
- Geography Markup Language (GML) is an xml based standardised encoding for spatial data.
- GeoPackage
- GeoPackage a set of conventions for storing spatial data a SQLite database
- Geoserver
- Geoserver java based software package providing access to remote data through OGC services
- Global Soil Information System
- Global Soil Information System (GLOSIS) is an activity of FAO Global Soil Partnership enabling a federation of soil information systems and interoperable data sets
- GLOSIS domain model
- GLOSIS domain model is an abstract, architectural component that defines how data are organised; it embodies a common understanding of what soil profile data are.
- GLOSIS Web Ontology
- GLOSIS Web Ontology is an implementation of the GLOSIS domain model using semantic technology
- GLOSIS Codelists
- GLOSIS Codelists is a series of codelists supporting the GLOSIS web ontology. Including the codelists as published in the FAO Guidelines for Soil Description (v2007), soil properties as collected by FAO GfSD and procedures as initally collected by Johan Leenaars.
- Glosolan
- Glosolan network to strengthen the capacity of laboratories in soil analysis and to respond to the need for harmonizing soil analytical data
- HALE
- Humboldt Alignment Editor (HALE) java based desktop software to compose and apply a data transformation to data
- Harmonization
- Harmonization is the process of transforming two datasets to a common model, a common projection, usage of common domain values and align their geometries
- Information retreival
- Information retreival (IR) is the task of identifying and retrieving information system resources (e.g. digital objects or metadata records) that are relevant to a search query. It includes searching for the information in a document, searching for the documents themselves, as well as searching for metadata describing the documents.
- Iteration
- An iteration is each development cycle (three foreseen within the SoilWise project) in the project. Each iteration can have phases. There are four phases per iteration focussing on co-design, development, integration and validation, demonstration.
- JRC
- JRC Joint Research Centre of the European Commission, its Directorate General. The JRC provides independent, evidence-based science and knowledge, supporting EU policies to positively impact society. Relevant policy areas within JRC are JRC Soil and JRC INSPIRE
- Knowledge
Knowledge is facts, information, and skills acquired through experience or education; the theoretical or practical understanding of a subject. SoilWise mainly considers explicit knowledge -- Information that is easily articulated, codified, stored, and accessed. E.g. via books, web sites, or databases. It does not include implicit knowledge (information transferable via skills) nor tacit knowledge (gained via personal experiences and individual contexts). Explicit knowledge can be further divided into semantic and structural knowledge.
- Semantic knowledge: Also known as declarative knowledge, refers to knowledge about facts, meanings, concepts, and relationships. It is the understanding of the world around us, conveyed through language. Semantic knowledge answers the "What?" question about facts and concepts.
- Structural knowledge: Knowledge about the organisation and interrelationships among pieces of information. It is about understanding how different pieces of information are interconnected. Structural knowledge explains the "How?" and "Why?" regarding the organisation and relationships among facts and concepts.
- Knowledge graph
- Knowledge graph is a representation of a network of real-world entities -- such as objects, events, situations or concepts -- and the relationships between them. Typically the network is made up of nodes, edges, and labels. Both semantic and structural knowledge can be expressed, stored, searched, visualised, and explored as knowledge graphs.
- Knowledge resource
- Knowledge resource is a digital object, such as a document, a web page, or a database, that holds relevant explicit knowledge.
- Knowledge source
- Knowledge source/provider is a provider of knowledge resources.
- Knowledge management
- Knowledge managmenet is the practice of collecting, organising, managing, and accessing knowledge (for some purpose, such as as decision-making).
- LLM
- Large Language Model is typically a deep learning model based on the transformer architecture that has been trained on vast amounts of text data, usually from know collections scraped from the Internet.
- Mapserver
- Mapserver C based software package providing access to remote data through OGC services
- Metadata
- (Descriptive) metadata is a summary information describing digital objects such as datasets and knowledge resources.
- Metadata record
- Metadata record is an entry in e.g. a catalogue or abstracting and indexing service with summary information about a digital object.
- Metadata source
- Metadata source/provider is a provider of metadata.
- NLP
- Natural Language Processing is an interdisciplinary subfield of computer science and artificial intelligence, primarily concerned with providing computers with the ability to process data encoded in natural language. It is closely related to information retrieval, knowledge representation and computational linguistics.
- Observations and Measurements
- A conceptual model for Observations and Measurements (O&M), also known as ISO19156
- OGC API
- OGC API building blocks that can be used to assemble novel APIs for web access to geospatial content
- Ontology
- Ontology is a formal representation of the entities in a knowledge graph. Ontologies and knowledge graphs can be expressed in a similar manner and they are closely related. Ontologies can be seen as the (semantic) data model defining classes, relationships and attributes, while knowledge graphs contain the real data according to the (semantic) data model.
- Persistent identifier
- Persistent identifier is a long-lasting reference to a digital object.
- Product backlog
- Product backlog is the document where user stories/requirements are gathered with their acceptance criteria, and prioritized.
- QGIS
- QGIS desktop software package to create spatial vizualisations of various types of data
- RAG
- Retrieval Augmented Generation is a framework for retrieving facts from an external knowledge base to ground large language models on the most accurate, up-to-date information and enhancing the (pre)trained parameteric (semantic) knowledge with non-parameteric knowledge to avoid hallucinations and get better responses.
- REA
- REA is the European Research Executive Agency, it's mandate is to manage several EU programmes and support services.
- Relational model
- Relational model an approach to managing data using a structure and language consistent with first-order predicate logic (source: wikipedia)
- RDF
- Resource Description Framework (RDF) a standard model for data interchange on the Web
- Representational state transfer
- Representational state transfer (REST) a set of guidelines for creating stateless, reliable web APIs (source: wikipedia)
- Requirements
- Requirements are the capabilities of an envisioned component of the repository which are classified as ‘must have’, or ‘nice to have’.
- Rolling plan
- Rolling plan is a methodology for considering the internal and external developments that may generate changes to the SoilWise Repository design and development. It keeps track of any developments and changes on a technical, stakeholder group level or at EUSO/JRC.
- SensorThings API
- SensorThingsAPI (STA) is a formalised protocol to exchange sensor data and tasks between IoT devices, maintained at Open Geospatial Consortium.
- Sensor Observation Service
- Sensor Observation Service (SOS) is a formalised protocol to exchange sensor data between entities, maintained at Open Geospatial Consortium.
- Sprint
- Sprint is a small timeframe during which tasks have been defined.
- Sprint backlog
- Sprint backlog is composed of the set of product backlog elements chosen for the sprint, and an action plan for achieving them.
- Soil classification
- Soil classification deals with the systematic categorization of soils based on distinguishing characteristics as well as criteria that dictate choices in use (source: wikipedia)
- Soilgrids
- Soilgrids a system for global digital soil mapping that uses many profile data and machine learning methods to predict the spatial distribution of soil properties across the globe
- SoilWise Use cases
- The SoilWise use cases are described in the Grant Agreement to understand the needs from the stakeholder groups (users). Each use case provides user stories epics.
- Task
- Task is the smallest segment of work that must be done to complete a user story/requirement.
- UML
- Unified Modeling Language (UML) a general-purpose modeling language that is intended to provide a standard way to visualize the design of a system (source: wikipedia)
- Usage scenarios
- Usage scenarios describe how (groups of) users might use the software product. These usage scenarios can originate or be updated from the SoilWise use cases, user story epic or user stories/requirements.
- User story
- A User story is a statement, written from the point of view of the user, that describes the functionality needed by the user from the SoilWise Repository.
- User story epic
- A User story epic is a narrative of stakeholders needs that can be narrowed down into smaller specific needs (user stories/requirements).
- Validation framework
- Validation framework is a framework allowing good communication between users and developers, validation of developed products by users, and flexibility on the developer’s side to take change requests into account as soon as possible. The validation framework needs a description of the functionalities to be developed (user stories/requirements), the criteria that enable to verify that the developed component corresponds to the user needs (acceptance criteria), the definition of tasks for the developers (backlog) and the workflow.
- View service
- View service is a concept from INSPIRE indicating a service type which presents a (pre)view of a dataset. Typically implemented as WMS or WMTS.
- Web service
- Web service a service offered by a device to another device, communicating with each other via the Internet (source: wikipedia)
- WOSIS
- WOSIS is a global dataset, maintained at ISRIC, aiming to serve the user with a selection of standardised and ultimately harmonised soil profile data
- WMS
- Web Map service (WMS) is a formalised protocol to exchange geospatial data represented as images
- WFS
- Web Feature Service (WFS) is a formalised protocol to exchange geospatial vector data
- WCS
- Web Coverage Service (WCS) is a formalised protocol to exchange geospatial grid data
- XSD
- XML Schema Definition (XSD) recommendation how to formally describe the elements in an Extensible Markup Language (XML) document (source: wikipedia)