Setting Up Version Control for GIS Heritage Projects: A Definitive Pipeline for Spatial Data Integrity
Heritage GIS projects routinely fail at the version control layer when binary spatial formats collide with standard Git workflows. The most persistent automation bottleneck occurs during branch merges of archaeological survey layers, where the .shp/.shx/.dbf triad desynchronizes and embedded GeoTIFF CRS definitions silently drift. This degradation breaks spatial joins, invalidates topology checks, and corrupts longitudinal site records. The following reference provides exact configurations, diagnostic routines, and spatial validation hooks to stabilize version control for multi-researcher archaeological teams, heritage managers, and Python GIS developers.
1. Repository Initialization & Binary Asset Routing
Standard Git cannot diff or merge binary geospatial files. You must enforce strict pointer tracking to prevent monolithic blob corruption and repository bloat. Create a .gitattributes file at the repository root with exact routing rules:
*.shp filter=lfs diff=lfs merge=lfs -text
*.shx filter=lfs diff=lfs merge=lfs -text
*.dbf filter=lfs diff=lfs merge=lfs -text
*.prj filter=lfs diff=lfs merge=lfs -text
*.tif filter=lfs diff=lfs merge=lfs -text
*.tiff filter=lfs diff=lfs merge=lfs -text
*.gpkg filter=lfs diff=lfs merge=lfs -text
*.qgz filter=lfs diff=lfs merge=lfs -text
*.xml filter=lfs diff=lfs merge=lfs -text
Initialize Git LFS and Data Version Control (DVC) to isolate heavy raster archives and vector datasets from commit history while preserving immutable lineage:
git lfs install
dvc init
dvc add data/survey_rasters/*.tif
dvc add data/vector_layers/site_boundaries.gpkg
dvc push
Structure your directory tree to align with established spatial data governance models. Aligning repository branches with the Heritage GIS Architecture & Fundamentals framework ensures spatial data, processing scripts, and documentation remain decoupled but fully traceable across excavation phases, post-excavation analysis, and publication.
2. Automated Pre-Commit Spatial Validation
Implement a Python-based pre-commit hook that executes automated spatial validation before any commit reaches the remote. This prevents silent corruption from propagating through collaborative branches.
Create ./hooks/spatial_validator.py with the following exact logic:
#!/usr/bin/env python3
import subprocess
import sys
from pathlib import Path
from osgeo import ogr, gdal
import pyproj
def validate_shapefile_triad(shp_path: Path) -> bool:
"""Verify .shp, .shx, and .dbf exist and are readable."""
base = shp_path.with_suffix('')
required = [f"{base}.shp", f"{base}.shx", f"{base}.dbf"]
if not all(Path(f).exists() for f in required):
print(f"ERROR: Shapefile triad desync at {base}")
return False
ds = ogr.Open(str(shp_path), update=False)
if ds is None:
print(f"ERROR: OGR unable to open {shp_path}")
return False
ds = None
return True
def validate_crs_drift(raster_path: Path, expected_epsg: int, tolerance_meters: float = 0.001) -> bool:
"""Check GeoTIFF CRS against baseline and verify coordinate precision."""
ds = gdal.Open(str(raster_path))
if ds is None:
print(f"ERROR: GDAL unable to open {raster_path}")
return False
srs = ds.GetSpatialRef()
if srs is None:
print(f"ERROR: Missing CRS definition in {raster_path}")
ds = None
return False
actual_epsg = srs.GetAuthorityCode("EPSG")
if actual_epsg != str(expected_epsg):
print(f"CRITICAL: CRS drift detected. Expected EPSG:{expected_epsg}, found EPSG:{actual_epsg}")
ds = None
return False
# Verify geotransform precision tolerance
gt = ds.GetGeoTransform()
if abs(gt[0] % 1.0) < tolerance_meters and abs(gt[3] % 1.0) < tolerance_meters:
print(f"WARNING: Coordinate origin precision exceeds {tolerance_meters}m tolerance in {raster_path}")
ds = None
return True
if __name__ == "__main__":
# Run against staged files (passed via pre-commit framework)
staged_files = sys.argv[1:]
baseline_epsg = 27700 # OSGB36 / British National Grid (adjust per project)
passed = True
for f in staged_files:
path = Path(f)
if path.suffix == ".shp":
passed &= validate_shapefile_triad(path)
elif path.suffix in (".tif", ".tiff"):
passed &= validate_crs_drift(path, baseline_epsg)
sys.exit(0 if passed else 1)
Register this hook in .pre-commit-config.yaml:
repos:
- repo: local
hooks:
- id: spatial-validator
name: Validate Spatial Integrity
entry: python ./hooks/spatial_validator.py
language: python
types: [file]
pass_filenames: true
3. CRS Enforcement & Metadata Lineage
Coordinate Reference System drift is the primary cause of failed spatial joins in heritage datasets. When integrating legacy survey data with modern drone-derived orthomosaics, enforce strict EPSG alignment and projection transformation rules. Align your baseline with established [CRS Selection for Heritage Sites] protocols to avoid on-the-fly reprojection during analysis.
For metadata preservation, embed ISO 19115-compliant XML sidecars alongside every spatial asset. Use gdal_edit.py to hardcode CRS and metadata tags directly into GeoTIFF headers:
gdal_edit.py -a_srs EPSG:27700 \
-mo "TIFFTAG_IMAGEDESCRIPTION=Archaeological survey orthomosaic, 2024" \
-mo "TIFFTAG_SOFTWARE=Agisoft Metashape 1.8" \
data/survey_rasters/site_orthophoto.tif
Validate metadata completeness using the Metadata Standards for Archaeological Data checklist before merging feature branches. Missing creation_date, survey_method, or datum_transformation fields should trigger automated CI/CD failures.
4. Merge Conflict Resolution & Transactional Workflows
Shapefiles are inherently fragile under concurrent editing. When multiple researchers modify vector layers simultaneously, the .shx spatial index frequently becomes orphaned. Transition primary working layers to GeoPackage (.gpkg), which supports SQLite transactional locking and spatial indexing.
After any merge or pull request, verify spatial integrity using exact diagnostic commands:
# 1. Check LFS pointer consistency
git lfs ls-files | grep -v "pointer"
# 2. Validate GeoPackage spatial index integrity
ogrinfo data/vector_layers/site_features.gpkg -sql "SELECT CheckSpatialIndex('features')"
# 3. Extract and compare CRS definitions
gdalsrsinfo -o epsg data/survey_rasters/dem_2024.tif
If ogrinfo returns ERROR 1: Unable to open .shx file or CRS outputs mismatch your baseline, the merge corrupted the spatial index. This commonly occurs when desktop GIS clients auto-save project states during concurrent edits or when Python geoprocessing scripts write to the same .gpkg without explicit transaction locks. Reference your Project Scoping & Data Governance protocols to enforce pre-merge lock files, branch protection rules, and mandatory code reviews for spatial data mutations.
For Python-based geoprocessing, enforce explicit transaction boundaries:
from osgeo import ogr
ds = ogr.Open("data/vector_layers/site_features.gpkg", update=True)
layer = ds.GetLayer("features")
ds.StartTransaction()
try:
# Execute feature insertions/updates here
layer.CreateFeature(feature)
ds.CommitTransaction()
except Exception as e:
ds.RollbackTransaction()
raise e
finally:
ds = None
5. Archival Integrity & Cross-Platform Verification
Longitudinal heritage datasets require deterministic archival workflows. Before finalizing a release branch, execute cross-platform interoperability checks to ensure datasets render identically across QGIS, ArcGIS Pro, and web-based mapping stacks. Run the following validation sequence:
# Verify geometry validity (tolerance: 0.0001 map units)
ogr2ogr -dialect sqlite -sql "SELECT * FROM features WHERE NOT ST_IsValid(geometry)" \
/vsistdout/ data/vector_layers/site_features.gpkg
# Generate SHA-256 checksums for immutable archival
find data/ -type f \( -name "*.shp" -o -name "*.gpkg" -o -name "*.tif" \) \
-exec sha256sum {} + > data/checksums.sha256
When preparing datasets for institutional repositories, align your packaging strategy with [Long-Term Digital Preservation for Heritage GIS] best practices. Strip proprietary .qgz project files from the final release archive, replace them with standardized layer_styles.qml exports, and verify that all relative paths resolve correctly. Implement [Cross-Platform GIS Interoperability Testing] by running automated pytest suites that load the archived .gpkg and .tif assets against a headless GDAL environment, confirming coordinate bounds, attribute schemas, and raster band counts match the baseline manifest.
By enforcing strict pointer tracking, automated spatial validation, transactional locking, and deterministic archival checksums, heritage GIS teams can eliminate version control degradation and maintain spatially accurate, research-grade datasets across multi-year excavation and analysis lifecycles.