Educational Notebook

Citizen Science and STEM Education with R: Forecasting and Reproducible Learning from Open Urban Air Quality Data

Authors
Affiliation

Jesús Cáceres Tello

Universidad Complutense de Madrid

José Javier Galán Hernández

Universidad Complutense de Madrid

Published

January 1, 2025

License CC BY 4.0 Made with R Rendered with Quarto DOI 10.3390/app152212183 Zotero Applied Sciences 2025

1 🌍 Introduction

This notebook complements the article Citizen Science and STEM Education with R: AI–IoT Forecasting and Reproducible Learning from Open Urban Air Quality Data
(Applied Sciences, 2025) and includes reproducible code examples, data harmonisation workflows, and visualisations.

It reproduces the main analytical workflow implemented in the study and illustrates how R and Quarto can be integrated into STEM education to foster data literacy, environmental awareness, and methodological transparency.


2 🔁 Reproducible Data Workflow

The complete workflow integrates open data, computational reproducibility, and STEM learning.
It can be applied to other urban contexts or courses focusing on environmental informatics, statistical modelling, or sustainability transitions.

Fig. 1. Open Data and Methodological Pipeline.

3 🗂️️ Data Sources and Structure

3.1 Air Quality Data

Air quality datasets were retrieved from the Madrid Open Data Portal (Portal de Datos Abiertos del Ayuntamiento de Madrid).
Measurements include nitrogen dioxide (NO₂), ozone (O₃), particulate matter (PM₁₀, PM₂.₅), sulphur dioxide (SO₂), and carbon monoxide (CO) recorded hourly across 24 urban stations (2020–2024).

Fig. 2a. Air quality monitoring stations across the Madrid urban area (2020–2024).

3.2 Pollutant Coverage

Each station has different pollutant coverage and measurement frequency, which provides an excellent example for students to explore data completeness and measurement uncertainty in open environmental datasets.

Fig. 2b. Pollutant coverage and measurement frequency across monitoring stations.

3.3 Data Processing Workflow

Data from both sources were processed in R through three main stages:

  1. Reading and cleaning monthly CSVs (removing redundant columns and correcting data types).
  2. Validating records with confirmed measurements (VAL flag).
  3. Pivoting and compressing results into parquet format for efficiency and consistency.

Fig. 3. Data acquisition, validation, and harmonisation workflow implemented in R.

4 📊 Exploratory Analysis

This section presents an exploratory analysis of key urban air pollutants using open environmental data. The aim is to characterise the temporal variability and distributional properties of NO₂ and O₃ as a basis for subsequent modelling and correlation analyses. Exploratory statistics are used to identify long-term trends, seasonal patterns, and interannual variability relevant for urban air quality dynamics.

4.1 Annual and Seasonal Variability

Figure 4 shows the annual variability of daily mean NO₂ and O₃ concentrations over the 2020–2024 period. NO₂ concentrations display relatively stable distributions with a slight reduction in median values in the most recent years. In contrast, O₃ exhibits higher dispersion and a progressive increase in median and upper-range concentrations, reflecting its pronounced seasonal behaviour.

Overall, the boxplots highlight distinct distributional patterns between the primary pollutant (NO₂) and the secondary pollutant (O₃), providing a descriptive baseline for the subsequent correlation and modelling analyses.

Fig. 4. Annual variability of NO₂ and O₃ concentrations (2020–2024).

4.2 Distribution Analysis

Boxplots provide a powerful visual tool to discuss dispersion, central tendency, and outliers across pollutants.
In this context, students learn how descriptive statistics translate into environmental interpretation, reinforcing quantitative reasoning with real data.

Fig. 8. Boxplots of NO₂ and O₃ concentrations across 2020–2024.

5 🔮 Forecasting with Prophet

Time-series forecasting introduces students to predictive modelling using open environmental data.
The Prophet model (Taylor & Letham, 2018) was selected for its interpretability, decomposition structure, and robustness to missing values — key features for teaching reproducible forecasting in R.

5.1 Model for NO₂

Students can visualise how additive components — trend, seasonality, and residuals — reveal the influence of human activity and meteorological cycles on pollutant evolution. This exercise supports reproducible experimentation with forecasting horizons, cross-validation, and performance metrics such as RMSE or MAE.

Fig. 5a. Prophet forecast for NO₂ concentrations (2020–2024).

Fig. 5b. Prophet forecast for O₃ concentrations (2020–2024).

6 🌦 Meteorological Integration

Meteorological factors shape pollutant behaviour and are fundamental in understanding atmospheric processes.
By integrating temperature, solar radiation, and wind speed data from AEMET, learners can explore multivariate relationships within an urban ecosystem.

6.1 Integration Workflow

Fig. 7. Integration workflow of meteorological and air quality data (2020–2024).

6.2 Correlation Analysis between Pollutants and Meteorological Variables

To complement the integration workflow, this section computes the Spearman correlations between daily concentrations of NO₂ and O₃ and six meteorological parameters (temperature, relative humidity, wind speed, solar radiation, atmospheric pressure, and precipitation) using the validated datasets (2020–2024).

Code
library(arrow)
library(dplyr)
library(tidyr)
library(corrr)
library(knitr)
library(purrr)
library(lubridate)

# --- 1. Cargar y combinar archivos anuales ---
air_files <- list.files(
  "data/Calidad del Aire_Parquet",
  pattern = "aire_validados_.*\\.parquet$",
  full.names = TRUE
)

met_files <- list.files(
  "data/Meteorologia_Parquet",
  pattern = "meteo_validados_.*\\.parquet$",
  full.names = TRUE
)

air <- map_dfr(air_files, read_parquet)
met <- map_dfr(met_files, read_parquet)

# --- 2. Crear variable diaria ---
air <- air %>%
  mutate(date = as.Date(FECHA_HORA))

met <- met %>%
  mutate(date = as.Date(FECHA_HORA))

# --- 3. Agregar diariamente y pivotar a formato ancho ---
air_wide <- air %>%
  mutate(MAGNITUD = as.character(MAGNITUD)) %>%
  group_by(date, MAGNITUD) %>%
  summarise(valor = mean(VALOR, na.rm = TRUE), .groups = "drop") %>%
  pivot_wider(names_from = MAGNITUD, values_from = valor)

met_wide <- met %>%
  mutate(MAGNITUD = as.character(MAGNITUD)) %>%
  group_by(date, MAGNITUD) %>%
  summarise(valor = mean(VALOR, na.rm = TRUE), .groups = "drop") %>%
  pivot_wider(names_from = MAGNITUD, values_from = valor)

# --- 4. Unir por fecha ---
data_joined <- inner_join(air_wide, met_wide, by = "date")

# --- 5. Renombrar magnitudes (solo las que existen) ---
data_joined <- data_joined %>%
  rename_with(
    ~ recode(
      .x,
      `8`  = "NO2",
      `14` = "O3",
      `81` = "WS",
      `83` = "T",
      `85` = "PRES",
      `86` = "RH",
      `87` = "SR",
      `89` = "PREC"
    ),
    .cols = intersect(
      names(data_joined),
      c("8", "14", "81", "83", "85", "86", "87", "89")
    )
  )

# --- 6. Calcular correlaciones de Spearman (robusto) ---
vars <- c("NO2", "O3", "T", "RH", "WS", "SR", "PRES", "PREC")

cor_matrix <- data_joined %>%
  select(any_of(vars)) %>%
  correlate(method = "spearman") %>%
  focus(any_of(c("NO2", "O3"))) %>%
  arrange(term)

# --- 7. Mostrar tabla formateada ---
kable(
  cor_matrix %>%
    mutate(across(where(is.numeric), ~ round(.x, 2))),
  caption = "Table 4. Spearman correlation coefficients (ρ) between daily pollutant concentrations and meteorological variables (2020–2024).",
  col.names = c("Variable", "NO₂", "O₃")
)
Table 4. Spearman correlation coefficients (ρ) between daily pollutant concentrations and meteorological variables (2020–2024).
Variable NO₂ O₃
PREC -0.18 -0.11
RH 0.24 -0.70
SR 0.55 -0.46
T -0.35 0.68
WS -0.75 0.57

7 🎓 Learning and Reproducibility Framework

Reproducibility is both a scientific and pedagogical value.
This framework unifies open data, transparent computation, and educational innovation, reinforcing the culture of open science.

Fig. 6. Reproducible learning and open-science framework for STEM education.

8 🧑‍🏫 Educational Applications

This Notebook can be directly incorporated into undergraduate or postgraduate STEM courses focused on data analysis, environmental informatics, or sustainability.

Suggested learning activities: 1. Reproduce pollutant forecasts with modified training periods.
2. Explore correlations between additional meteorological variables.
3. Design inquiry-based projects connecting data to local environmental policies.
4. Document and publish reproducible reports using Quarto and GitHub.

Through these exercises, students not only practise coding but also embrace scientific integrity and civic engagement through data.


9 🌐 Repository and Citation

All code, figures, and harmonised datasets are openly available at:
https://github.com/jcaceres-academic/OpenUrbanAirandMeteorological

When citing this educational resource, please use:

Cáceres-Tello, J., Galán-Hernández, J. J., Morales Cevallo, M. B., & López-Meneses, E. (2025). Citizen Science and STEM Education with R: AI–IoT Forecasting and Reproducible Learning from Open Urban Air Quality >Data. Applied Sciences, 15(22), 12183. https://doi.org/10.3390/app152212183

This ensures traceability and recognition for open-source academic contributions.


10 📚 References

All cited works are managed through the shared bibliographic file applsci-3979500.bib, which includes all references used in the manuscript and notebook.
A public mirror of this bibliography is archived in the author’s Zotero collection:
➡️ https://www.zotero.org/jcaceres_academic/collections/X6RW9UGU