GDLM — Governed Data Lineage Model

📋 The Problem

Syntropa ingests farm data from multiple sources (WhatsApp voice logs, photos, manual data entry, financial records) across 27+ farms with varying quality. Without lineage tracking:

→ We don't know who recorded a data point, when, or on which device
→ Whisper transcription errors propagate silently — no way to flag or correct them
→ Data entry quality varies by person (some workers are meticulous, others inconsistent)
→ Financial figures get entered without audit trail — corrections overwrite originals
→ No way to answer: "Where did this number come from?"

🧭 Core Principles

🔗

Every Record Has Lineage

Every data point traces back to its source: a voice message, a photo, a manual entry, or a computed derivation. Nothing exists without provenance.

📝

Corrections, Not Overwrites

Data is never overwritten. Corrections create new versions with links to the original. The full history is always preserved.

⚖️

Trust Is Weighted

Not all sources are equal. A farm manager's voice log carries more weight than an auto-transcription. Trust scores propagate through derived data.

🗣️

Voice-First Entry

Farm workers record in Tamil via WhatsApp voice messages — the primary data source. The system must respect this oral tradition and make it auditable.

🔄 Data Flow Pipeline

📱 WhatsApp
Voice / Photo

→

📦 Export ZIP
GDrive Ingest

→

🧠 Whisper GPU
Transcription

→

🤖 LLM Cleanup
Ollama/GPT

→

💾 SurrealDB
Versioned Store

→

🌐 Syntropa
Web Frontend

📍 Lineage Attached At Every Stage

Each stage stamps metadata onto the record — source file, processing timestamps, model version, confidence score, and who/what performed the operation. If a transcription is corrected by a human, the correction links back to the original Whisper output.

📐 GDLM Record Schema

// Every record in Syntropa follows this envelope
{
  "record_id": "REC-2025-OOM-00347",
  "farm": "oomathurai",
  "record_type": "voice_log",

  // Source lineage
  "source": {
    "type": "whatsapp_voice",
    "file": "AUDIO-2025-08-13-00-23-08.opus",
    "sender": "Velmurugan",
    "sender_role": "farm_manager",
    "recorded_at": "2025-08-13T00:23:08+05:30",
    "duration_seconds": 14.9
  },

  // Processing lineage
  "processing": [
    {
      "step": "whisper_transcription",
      "model": "whisper-large-v3",
      "gpu": "RTX-3050-6GB",
      "host": "boston.thirucloud.com",
      "processed_at": "2025-09-15T14:22:00Z",
      "confidence": 0.72,
      "output": "நேத்து மழை பெய்ததால காலை..."
    },
    {
      "step": "llm_cleanup",
      "model": "ollama/llama3-70b",
      "processed_at": "2025-09-15T14:22:05Z",
      "output": "நேற்று மழை பெய்ததால் காலை..."
    }
  ],

  // Version history (corrections never overwrite)
  "versions": [
    { "v": 1, "by": "whisper", "at": "2025-09-15", "trust": 0.72 },
    { "v": 2, "by": "ollama", "at": "2025-09-15", "trust": 0.85 },
    { "v": 3, "by": "Renuika", "at": "2025-09-16", "trust": 0.95 }
  ],
  "current_version": 3,
  "trust_score": 0.95
}

⚖️ Trust Scoring Model

Every data source and processor gets a trust weight. These propagate: a record corrected by a high-trust human gets a higher score than raw machine output.

0.0–0.3

🔴 Unreliable

0.3–0.5

🟠 Raw/Unverified

0.5–0.7

🟡 Machine-Processed

0.7–0.9

🟢 Reviewed

0.9–1.0

🔵 Verified/Golden

Source Trust Weights

Source Type	Base Trust	Notes
Farm Manager voice log	0.80	Primary authority (Velmurugan, Prabu, Sivakumar)
Data Entry — Renuika	0.90	Highest accuracy data entry (RRC Pannai)
Data Entry — Other	0.65	Varies by individual performance
Whisper raw transcription	0.55	Tamil accuracy ~70% with large-v3 on RTX 3050
LLM cleanup (Ollama)	0.75	Grammar/spelling fixes, structured extraction
LLM cleanup (GPT-4)	0.85	Higher Tamil accuracy than local models
Owner verification (Thiru)	0.98	Final authority — golden record
WhatsApp photo EXIF	0.95	Machine-stamped timestamp, GPS
Financial receipt scan	0.90	Original document — OCR may reduce to 0.70

📦 Record Types in Syntropa

Type	Source	Example	Volume
voice_log	WhatsApp audio → Whisper	Daily farm activity narration	3,337+ (Oomathurai)
photo	WhatsApp images	Crop photos, cattle, fields	3,952 across 12 farms
expense	Voice log / manual	₹43,701 ploughing — Sevakkon	100s per farm/year
livestock	Manual / voice log	Cow "Megalai" moved to Vellunachiyaar	70+ cattle, 120+ goats
event	Voice log extraction	Wild pig damage, harvest, planting	Daily across farms
crop_cycle	Derived from events	Groundnut: planted Aug → harvested Nov	Per field per season

📝 Version Control Rules

🔒 Immutable Append-Only Log

GDLM uses an append-only version chain. Every record version is immutable once written. Corrections create a new version entry with:

Field	Description
`v`	Incrementing version number
`by`	Who/what made this version (person or system ID)
`at`	ISO timestamp of correction
`trust`	Trust score of this version
`delta`	What changed (field-level diff)
`reason`	Optional: why the correction was made

The current_version pointer always references the highest-trust version. The UI shows the current version but allows drilling into history.

🌾 Farm-Specific Implementation

🏹 Oomathurai Pannai — Pilot Farm

The first GDLM-tracked farm. 3,337 voice logs from Velmurugan (manager) and Devi, transcribed via Whisper on Boston GPU (RTX 3050). LLM cleanup pipeline in progress. Voice log player at voice-log-player.html allows playback alongside transcription for manual verification.

Current state: v1 (Whisper raw) for all records. v2 (LLM cleanup) partially complete. v3 (human review) pending.

⚔️ மருது சகோதரர்கள் — Active Transcription

4,285 audio files being transcribed on Boston GPU. 1,001 photos already in gallery. GDLM tracking begins at ingest — each transcription batch logged with model version, GPU, and processing timestamp.

👑 RRC பண்ணை — Best Data Entry

Renuika (trust: 0.90) handles data entry for Raja Raja Cholan Pannai. Her entries serve as the benchmark for data quality scoring. GDLM will use her verified records to train quality classifiers for other farms.

🛠️ Technology Stack

Component	Technology	Role in GDLM
Transcription	Whisper large-v3 (RTX 3050)	Source records with confidence scores
LLM Cleanup	Ollama (local) / GPT-4 (API)	Version 2 processing with trust upgrade
Storage	SurrealDB / JSON flat files	Versioned record store with lineage
Media	Boston NFS (/mnt/workspace)	Audio/photo with EXIF preservation
Frontend	Syntropa (static HTML/JS)	Version viewer, trust display, audit trail
Infrastructure	ThiruCloud Boston server	Manassas VA, managed by Matt Patton
Backup	Wasabi S3	Offsite backup of all GDLM records

🗺️ Implementation Roadmap

Phase	Scope	Status
Phase 0	Whisper transcription with file-level lineage (filename, timestamp, farm)	Done
Phase 1	LLM cleanup pipeline — v2 records with trust upgrade	In Progress
Phase 2	GDLM JSON envelope on all new records. Version history UI in Syntropa	Planned
Phase 3	Human review workflow — voice playback + correction interface	Planned
Phase 4	Trust scoring engine — auto-score based on source, corrections, age	Planned
Phase 5	Cross-farm GDLM — unified lineage across all 27 farms	Future
Phase 6	SurrealDB migration — full graph-based lineage with query support	Future

🌐 Scope: All ThiruCloud Sites

GDLM is designed to extend beyond Syntropa to all ThiruCloud-hosted sites where data lineage matters:

Site	GDLM Use Case
syntropa.com	Farm data lineage — voice logs, photos, finances, livestock
antikva.in	Product catalog lineage — farm-to-table traceability
createvaluetech.com	Advisory content lineage — source attribution
anjaraipetti.xyz	Tamil cultural content — source and translation tracking