3 The Catalogue

Figure 3.1: A Catalogue Example

3.1 The codebook

Our codebook

Our data observatories harmonize our dataset structures in a way that they can be synchronized or integrated with SDMX datasets provided by reliable data agencies such as Eurostat, the World Bank, OECD, etc. Furthermore, with the use of the more general RDF model we can incorporate data from non-statistical sources, or use various big data sources and turn them into statistical products.

We use the SMDX standard’s organizational principles in creating datasets. SDMX is a statistical standard that defines the way statistical data agencies (like the Eurostat, the World Bank, UN organizations, national statistical offices) collect, process, and distribute data.

We use the W3C dataset standard based on RDF. RDF started as a metadata standard to facilitate machine-to-machine synchronization via the internet. Using W3C RDF standards allows our data observatory to communicate with various commercial application made by Google, Microsoft, or non-commercial applications like Wikidata.

SDMX and RDF are both ISO standards, and they are slowly converging. SDMX predates the RDF and huge – it standardizes all aspects of international statistical cooperation. We only use very small parts of SDMX to synchronize our data observatories with reliable data sources of Eurostat, OECD, and to make our data products compatible with theirs.

3.1.1 RDF principles

Our datasets are special, strict forms of the RDF Dataset definition, and allow the attributes of the dataset to be connected to other sources (such as library services with bibliographical references, authoritative identification services.)

3.1.2 SDMX organizational principles

The variables of a dataset are organized as dimensions, measurements and attributes, and each follow standardized coding.

Measurements are the actual variables. We use two dimensions: a geographical concept and a time concept for each observation, for example, an observation relates to the country of ‘Germany’ and the time period of the entire 2021. year. The geographical concepts is labelled with SDMX standard codes that can be machine labelled correctly into English, French, German or the users choice of labelling. The time concept uses ISO time coding information for correct reading and translation into the user’s software. In the future, we may add further dimensions to our datasets when necessary. Observational attributes: observational attributes describe the measurement. We use the observation status standard attribute, which clearly states if a measurement is actual or estimated, or if it is missing from observation, missing from processing, etc. In statistical application it is very important to know if an observation is logically impossible (no Czechoslovak GDP for 1993, because Czechoslovakia was dissolved into Czechia and Slovakia) and should be omitted from the European average GDP or it is not present in the dataset (the Czechoslovak GDP should be present in 1992, as there is no Slovak and Czech GDP yet for that year.) A potential imputed value can contain an estimated Slovak and Czech part of the Czechoslovak GDP in 1992 for backward compatibility of the dataset. We use further attributes to indicate the statistical imputation or forecasting used and the seasonal adjustment made in time series. Dimensional attributes: The geo code of “DE” refers to Germany or Allemagne. Because we use only standard dimensional attributes, we placed all of them in the codebooks. They can be easily added to our datasets for better human readability or easier visualization (in charts, use “Germany” instead of DE). Dataset attributes: All our datasets use a single unit of measure, they have one original data source, they have various authors and contributors. We place this information in our Catalogue in our relational database and API, and they are connected to the dataset on the Zenodo API. For example, if you a dataset is accessed in the JSON-LD form, the dataset attributes are automatically imported with the data itself. This allows the incorporation of our interoperable datasets in semantic web or SPARQL application.