---
title: "Handling Semantic Ambiguity with prelabelled Vectors"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Handling Semantic Ambiguity with prelabelled Vectors}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

## Introduction

You are developing a family of R packages that extend tidy data workflows with richer semantic and provenance-aware capabilities. The work began from practical experience building tidyverse-based data pipelines and repeatedly encountering the same limitation: while tidy datasets are highly efficient and semantically clear within a given workflow, much of their meaning remains implicit and dependent on the contextual knowledge of their creator. Once exported, serialized, or transferred across environments, this contextual information is often lost. :contentReference[oaicite:0]{index="0"}

```{r setup}
library(dataset)
```

The `dataset` package introduces semantically enriched vectors and data frames that preserve explicit metadata throughout the workflow lifecycle. However, fully formal semantic annotation is verbose and cognitively demanding. Constructing semantically complete RDF-compatible objects is appropriate only for mature stages of a workflow.

In practice, semantic stabilization is usually incremental. Observational data often arrive with partially inconsistent, incomplete, or ambiguous labels. Before a variable can mature into a formally defined vector created with `labelled::labelled()` or `dataset::defined()`, analysts typically perform several rounds of semantic harmonisation.

The `prelabelled` class supports this intermediate stage.

Unlike formally defined semantic vectors, `prelabelled` vectors tolerate:

- incomplete semantic mappings;
- unresolved observational values;
- mixed coding conventions;
- gradual semantic stabilization.

This vignette demonstrates how provisional semantic assertions can be incrementally stabilised while preserving the original observational evidence.

## A small ambiguous dataset

We begin with a small dataset containing country observations. The dataset is intentionally inconsistent: some observations use full country names, while others already use ISO 3166 alpha-2 country codes.

Such ambiguity is extremely common in operational analytical workflows, particularly when datasets are merged from multiple sources or manually curated over time.

```{r countrydata1}
country_data_1 <- data.frame(
  country = c("Andorra", "LI", "San Marino", "AD", "Liechtenstein"),
  time = c(2020, 2020, 2020, 2021, 2021),
  value = c(1.2, 2.4, 3.1, 1.3, 2.5)
  )
```

### Creating provisional semantic assertions

We now create a lightweight semantic mapping.

The goal is not yet to create a formally closed semantic vocabulary. Instead, we begin stabilising the semantics incrementally by mapping some observational values to candidate semantic assertions.

Values that are not explicitly mapped remain self-describing.

```{r countrymap1}
country_map <- c(
  "Andorra" = "AD",
  "Liechtenstein" = "LI",
  "San Marino" = "SM"
)

country_data_1$country <-
  prelabel(
    country_data_1$country,
    labels = country_map
  )
```

### Inspecting the prelabelled vector

The resulting vector preserves the original observational values while attaching a provisional semantic vocabulary in the `"prelabel"` attribute.

```{r printcountrydata1}
print(country_data_1$country)
```

This separation between:

- observational evidence;
- semantic interpretation;

is a central design principle of the `prelabelled` class.

The observational values remain unchanged, while semantic operationalisation may evolve iteratively over time.

### Semantic operationalisation

Using `as.character()` operationalises the semantic assertions into a semantically stabilised character vector.

```{r countrydata2}
country_data_2 <- data.frame(
  country = as.character(country_data_1$country),
  time = country_data_1$time,
  value = country_data_1$value
)

country_data_2
```

Mapped observations are converted into their candidate semantic assertions, while unmatched values remain self-describing.

This allows analysts to gradually reduce semantic ambiguity without destroying the original observational evidence.

## A more ambiguous dataset

The next dataset contains a more difficult form of semantic ambiguity.

Some observations use ISO 3166 alpha-2 country codes, while others use ISO 3166 alpha-3 codes or full country names. Although the observations are semantically related, they do not yet form a stable closed vocabulary.

```{r countrydata3}
country_data_3 <- data.frame(
  country = c(
    "AD", "AND", "LI", "LIE", "SMR", "San Marino"
  ),
  time = c(2020, 2020, 2020, 2021, 2021, 2021),
  value = c(1, 2, 3, 4, 5, 6)
)
```

## Incremental semantic stabilization

The `prelabelled` workflow does not require complete semantic resolution from the outset.

Instead, semantic stabilization can proceed incrementally:

- observational ambiguities become explicit;
- partial semantic mappings accumulate gradually;
- unresolved values remain operationally usable;
- semantic assertions become progressively more stable.

```{r countrymap3}
country_map_3 <- c(
  "Andorra" = "AD",
  "Andorra" = "AND",
  "Liechtenstein" = "LI",
  "San Marino" = "SM",
  "San Marino" = "SMR"
)

prelabelled_country <- prelabel(
  country_data_3$country,
  labels = country_map_3
)
```

This approach is particularly useful in exploratory analytical workflows, archival reconstruction, metadata harmonisation, and cross-dataset integration tasks.

```{r}
prelabelled_country
```

### Semantic workspaces

While `as.character()` provides lightweight semantic coercion, which may be more useful after semantic stabilisation.

```{r}
as.character(prelabelled_country)
```

The `as_character()` method creates a provenance-preserving semantic workspace.

```{r}
as_character(prelabelled_country)
```

The resulting vector retains:

- the original observational values;
- the provisional semantic vocabulary;
- additional semantic attributes.

This allows analysts to continue semantic refinement workflows while preserving reversibility and provenance awareness.

### From provisional semantics to formally defined semantics

The goal of `prelabelled` vectors is not to replace formally defined semantic vectors.

Instead, they provide a lightweight preparatory stage for incremental semantic stabilization.

Once semantic ambiguity has been sufficiently reduced, `prelabelled` vectors may mature into formally defined semantic vectors created with `labelled::labelled()` or `dataset::defined()`. For further information, see `vignette("defined", package = "dataset")`- Working with semantic vectors: Semantic vectors with `defined()`.

In this sense, semantic enrichment becomes an iterative analytical workflow rather than a single terminal annotation step.
