Metadata Standards for Archaeological Data

Establishing rigorous metadata standards is the foundational compliance layer for any spatial heritage repository. Within the broader Heritage GIS Architecture & Fundamentals framework, archaeological datasets require explicit provenance tracking, temporal resolution, and spatial accuracy declarations to meet both academic peer-review standards and statutory heritage management requirements. This guide details a reproducible, automated validation pipeline that enforces metadata compliance, verifies coordinate reference system integrity, and prepares datasets for field deployment or archival submission.

Core Validation Schema & Compliance Rules

Archaeological metadata must bridge descriptive cataloging standards with domain-specific requirements. Compliance frameworks typically align with ISO 19115-1:2014 Geographic Information — Metadata and the Dublin Core Metadata Initiative while extending into CIDOC CRM mappings for stratigraphic and excavation phase tracking. The following validation rules form the baseline for automated compliance checking:

Mandatory Identifiers: dataset_uuid (RFC 4122 v4), site_code (minimum 3 alphanumeric characters), project_id, and record_date (strict ISO 8601 YYYY-MM-DD).
Spatial Metadata: bounding_box (WGS84 min/max), crs_authority (explicit EPSG code), horizontal_accuracy_m (numeric, ≥0), collection_method (enumerated: RTK-GNSS, TotalStation, UAV_Photogrammetry, Terrestrial_Laser_Scan).
Temporal Metadata: chronological_period (controlled vocabulary), dating_method (e.g., C14, Stratigraphic, Typological), temporal_resolution (Absolute/Relative).
Provenance & Custody: collector, institution, license_type (e.g., CC-BY-4.0, CC0-1.0, Restricted), access_restrictions.
Data Quality Flags: completeness_score (0.0–1.0), spatial_precision_class (Class I–IV), validation_status (Pending/Passed/Failed).

Datasets failing any mandatory field check or CRS validation must be quarantined immediately to prevent spatial misalignment in downstream analysis or statutory reporting.

Automated Validation Pipeline

The following production-ready workflow uses version-pinned dependencies to guarantee deterministic behavior across research environments. It parses spatial datasets, validates embedded metadata against a strict JSON schema, verifies CRS authority, and routes outputs based on validation state.

Dependencies (requirements.txt)

geopandas==1.0.1
pyproj==3.6.1
jsonschema==4.23.0

Validation & Routing Script

import geopandas as gpd
import json
import logging
import shutil
from pathlib import Path
from typing import Dict, Any
from jsonschema import validate, ValidationError
from pyproj import CRS
from datetime import datetime

# Audit-ready logging configuration
logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s | %(levelname)-8s | %(message)s",
    handlers=[
        logging.FileHandler("metadata_validation.log", mode="a"),
        logging.StreamHandler()
    ]
)
logger = logging.getLogger("archaeo_metadata_validator")

METADATA_SCHEMA: Dict[str, Any] = {
    "type": "object",
    "required": ["dataset_uuid", "site_code", "crs_authority", "record_date", "collector"],
    "properties": {
        "dataset_uuid": {"type": "string", "pattern": "^[0-9a-f]{8}-[0-9a-f]{4}-4[0-9a-f]{3}-[89ab][0-9a-f]{3}-[0-9a-f]{12}$"},
        "site_code": {"type": "string", "minLength": 3},
        "crs_authority": {"type": "string", "pattern": "^EPSG:\\d{4,}$"},
        "record_date": {"type": "string", "pattern": "^\\d{4}-\\d{2}-\\d{2}$"},
        "collector": {"type": "string", "minLength": 2},
        "horizontal_accuracy_m": {"type": "number", "minimum": 0},
        "collection_method": {"type": "string", "enum": ["RTK-GNSS", "TotalStation", "UAV_Photogrammetry", "Terrestrial_Laser_Scan"]},
        "validation_status": {"type": "string", "enum": ["Pending", "Passed", "Failed"]}
    },
    "additionalProperties": False
}

def validate_crs(crs_auth: str, gdf: gpd.GeoDataFrame) -> bool:
    """Verify CRS matches declared EPSG authority and is valid."""
    try:
        declared = CRS.from_user_input(crs_auth)
        actual = gdf.crs
        if actual is None:
            logger.warning("Dataset lacks embedded CRS definition.")
            return False
        return declared.equals(actual) and declared.is_valid
    except Exception as e:
        logger.error(f"CRS validation failed: {e}")
        return False

def validate_and_route(
    input_path: Path,
    metadata: Dict[str, Any],
    quarantine_dir: Path,
    archive_dir: Path
) -> None:
    """Execute schema validation, CRS check, and route dataset accordingly.

    Metadata is passed explicitly as a dict since GeoPackage files do not
    carry arbitrary key-value metadata readable via geopandas.
    """
    if not input_path.exists():
        logger.error(f"Input path not found: {input_path}")
        return

    try:
        gdf = gpd.read_file(input_path)

        # 1. JSON Schema Validation
        validate(instance=metadata, schema=METADATA_SCHEMA)
        logger.info(f"Schema validation passed for {input_path.name}")

        # 2. CRS Authority Verification
        crs_auth = metadata["crs_authority"]
        if not validate_crs(crs_auth, gdf):
            raise ValueError(f"Declared {crs_auth} does not match dataset CRS or is invalid.")
        logger.info(f"CRS integrity confirmed: {crs_auth}")

        # 3. Route to Archive
        metadata["validation_status"] = "Passed"
        archive_dir.mkdir(parents=True, exist_ok=True)
        output_path = archive_dir / f"{input_path.stem}_validated.gpkg"
        gdf.to_file(output_path, driver="GPKG")

        # Write sidecar metadata JSON alongside the GeoPackage
        meta_path = archive_dir / f"{input_path.stem}_validated_metadata.json"
        meta_path.write_text(json.dumps(metadata, indent=2))
        logger.info(f"Dataset routed to archive: {output_path}")

    except ValidationError as ve:
        logger.error(f"Schema validation failed: {ve.message}")
        quarantine_dir.mkdir(parents=True, exist_ok=True)
        quarantine_path = quarantine_dir / f"{input_path.stem}_quarantined.gpkg"
        # gdf may not be defined if read_file failed; guard accordingly
        try:
            gdf.to_file(quarantine_path, driver="GPKG")
        except Exception:
            shutil.copy2(input_path, quarantine_path)
        logger.info(f"Dataset quarantined for remediation: {quarantine_path}")
    except Exception as e:
        logger.critical(f"Pipeline execution halted: {e}")
        raise

if __name__ == "__main__":
    INPUT_DIR = Path("data/raw")
    QUARANTINE = Path("data/quarantine")
    ARCHIVE = Path("data/archive")

    # Metadata is loaded from companion JSON sidecars alongside each GPKG
    for fp in INPUT_DIR.glob("*.gpkg"):
        meta_sidecar = fp.with_suffix(".json")
        if not meta_sidecar.exists():
            logger.warning(f"No metadata sidecar for {fp.name}. Skipping.")
            continue
        meta = json.loads(meta_sidecar.read_text())
        validate_and_route(fp, meta, QUARANTINE, ARCHIVE)

Pipeline Routing & Data Lifecycle

The validation engine operates as a deterministic state machine. Data enters the raw ingestion buffer, where the pipeline executes three sequential gates:

Schema Gate: Enforces mandatory fields, UUID formatting, and ISO 8601 date compliance.
Spatial Gate: Cross-references the declared crs_authority against the embedded GeoPackage projection. Mismatches trigger immediate quarantine to prevent coordinate drift in multi-site analyses.
Quality Gate: Assigns spatial_precision_class based on horizontal_accuracy_m thresholds (e.g., ≤0.02m = Class I, ≤0.5m = Class II).

Validated datasets are written to the archive directory alongside immutable metadata JSON sidecars. Quarantined files are flagged for manual review, typically requiring field team verification or coordinate transformation. For teams integrating this pipeline with desktop GIS environments, proper Setting Up QGIS for Archaeological Surveys ensures that project templates inherit the same CRS and metadata constraints before data collection begins.

When deploying across regional heritage networks, strict adherence to CRS Selection for Heritage Sites prevents projection-induced area distortion in volumetric calculations and spatial statistics. The pipeline explicitly rejects ambiguous or deprecated projections, enforcing modern EPSG registries (e.g., EPSG:27700 for UK Ordnance Survey, EPSG:326xx for UTM zones, EPSG:4326 for global exchange).

Legacy Data Integration & Remediation

Historical survey archives often lack machine-readable metadata or use legacy coordinate systems (e.g., OSGB36, NAD27). Automated remediation requires parsing legacy headers, extracting embedded survey parameters, and mapping them to the current schema. The companion workflow for Automating metadata extraction from legacy survey files demonstrates regex-based header parsing and pyproj transformation chains to backfill missing crs_authority and horizontal_accuracy_m fields.

All transformations must be logged to maintain chain-of-custody compliance. The Python logging module used in the pipeline above provides structured audit trails compatible with institutional data repositories and FAIR data principles. By enforcing schema validation at ingestion, heritage managers ensure that spatial datasets remain interoperable, legally defensible, and ready for publication or statutory submission.

Metadata Standards for Archaeological Data #

Core Validation Schema & Compliance Rules #

Automated Validation Pipeline #

Pipeline Routing & Data Lifecycle #

Legacy Data Integration & Remediation #

Metadata Standards for Archaeological Data

Core Validation Schema & Compliance Rules

Automated Validation Pipeline

Pipeline Routing & Data Lifecycle

Legacy Data Integration & Remediation