Architecture Specification

GDLM — Governed Data Lineage Model

A specification for tracking the origin, corrections, trust, and lineage of every piece of farm data across the Syntropa ecosystem — from WhatsApp voice log to structured record.
v0.1.0 · Syntropa · ThiruCloud · Eyarkai Puratchi Niruvanam

📋 The Problem

Syntropa ingests farm data from multiple sources (WhatsApp voice logs, photos, manual data entry, financial records) across 27+ farms with varying quality. Without lineage tracking:

→ We don't know who recorded a data point, when, or on which device
→ Whisper transcription errors propagate silently — no way to flag or correct them
→ Data entry quality varies by person (some workers are meticulous, others inconsistent)
→ Financial figures get entered without audit trail — corrections overwrite originals
→ No way to answer: "Where did this number come from?"

🧭 Core Principles

🔗
Every Record Has Lineage
Every data point traces back to its source: a voice message, a photo, a manual entry, or a computed derivation. Nothing exists without provenance.
📝
Corrections, Not Overwrites
Data is never overwritten. Corrections create new versions with links to the original. The full history is always preserved.
⚖️
Trust Is Weighted
Not all sources are equal. A farm manager's voice log carries more weight than an auto-transcription. Trust scores propagate through derived data.
🗣️
Voice-First Entry
Farm workers record in Tamil via WhatsApp voice messages — the primary data source. The system must respect this oral tradition and make it auditable.

🔄 Data Flow Pipeline

📱 WhatsApp
Voice / Photo
📦 Export ZIP
GDrive Ingest
🧠 Whisper GPU
Transcription
🤖 LLM Cleanup
Ollama/GPT
💾 SurrealDB
Versioned Store
🌐 Syntropa
Web Frontend
📍 Lineage Attached At Every Stage

Each stage stamps metadata onto the record — source file, processing timestamps, model version, confidence score, and who/what performed the operation. If a transcription is corrected by a human, the correction links back to the original Whisper output.

📐 GDLM Record Schema

// Every record in Syntropa follows this envelope { "record_id": "REC-2025-OOM-00347", "farm": "oomathurai", "record_type": "voice_log", // Source lineage "source": { "type": "whatsapp_voice", "file": "AUDIO-2025-08-13-00-23-08.opus", "sender": "Velmurugan", "sender_role": "farm_manager", "recorded_at": "2025-08-13T00:23:08+05:30", "duration_seconds": 14.9 }, // Processing lineage "processing": [ { "step": "whisper_transcription", "model": "whisper-large-v3", "gpu": "RTX-3050-6GB", "host": "boston.thirucloud.com", "processed_at": "2025-09-15T14:22:00Z", "confidence": 0.72, "output": "நேத்து மழை பெய்ததால காலை..." }, { "step": "llm_cleanup", "model": "ollama/llama3-70b", "processed_at": "2025-09-15T14:22:05Z", "output": "நேற்று மழை பெய்ததால் காலை..." } ], // Version history (corrections never overwrite) "versions": [ { "v": 1, "by": "whisper", "at": "2025-09-15", "trust": 0.72 }, { "v": 2, "by": "ollama", "at": "2025-09-15", "trust": 0.85 }, { "v": 3, "by": "Renuika", "at": "2025-09-16", "trust": 0.95 } ], "current_version": 3, "trust_score": 0.95 }

⚖️ Trust Scoring Model

Every data source and processor gets a trust weight. These propagate: a record corrected by a high-trust human gets a higher score than raw machine output.

0.0–0.3
🔴 Unreliable
0.3–0.5
🟠 Raw/Unverified
0.5–0.7
🟡 Machine-Processed
0.7–0.9
🟢 Reviewed
0.9–1.0
🔵 Verified/Golden

Source Trust Weights

Source TypeBase TrustNotes
Farm Manager voice log0.80Primary authority (Velmurugan, Prabu, Sivakumar)
Data Entry — Renuika0.90Highest accuracy data entry (RRC Pannai)
Data Entry — Other0.65Varies by individual performance
Whisper raw transcription0.55Tamil accuracy ~70% with large-v3 on RTX 3050
LLM cleanup (Ollama)0.75Grammar/spelling fixes, structured extraction
LLM cleanup (GPT-4)0.85Higher Tamil accuracy than local models
Owner verification (Thiru)0.98Final authority — golden record
WhatsApp photo EXIF0.95Machine-stamped timestamp, GPS
Financial receipt scan0.90Original document — OCR may reduce to 0.70

📦 Record Types in Syntropa

TypeSourceExampleVolume
voice_logWhatsApp audio → WhisperDaily farm activity narration3,337+ (Oomathurai)
photoWhatsApp imagesCrop photos, cattle, fields3,952 across 12 farms
expenseVoice log / manual₹43,701 ploughing — Sevakkon100s per farm/year
livestockManual / voice logCow "Megalai" moved to Vellunachiyaar70+ cattle, 120+ goats
eventVoice log extractionWild pig damage, harvest, plantingDaily across farms
crop_cycleDerived from eventsGroundnut: planted Aug → harvested NovPer field per season

📝 Version Control Rules

🔒 Immutable Append-Only Log

GDLM uses an append-only version chain. Every record version is immutable once written. Corrections create a new version entry with:

FieldDescription
vIncrementing version number
byWho/what made this version (person or system ID)
atISO timestamp of correction
trustTrust score of this version
deltaWhat changed (field-level diff)
reasonOptional: why the correction was made

The current_version pointer always references the highest-trust version. The UI shows the current version but allows drilling into history.

🌾 Farm-Specific Implementation

🏹 Oomathurai Pannai — Pilot Farm

The first GDLM-tracked farm. 3,337 voice logs from Velmurugan (manager) and Devi, transcribed via Whisper on Boston GPU (RTX 3050). LLM cleanup pipeline in progress. Voice log player at voice-log-player.html allows playback alongside transcription for manual verification.

Current state: v1 (Whisper raw) for all records. v2 (LLM cleanup) partially complete. v3 (human review) pending.

⚔️ மருது சகோதரர்கள் — Active Transcription

4,285 audio files being transcribed on Boston GPU. 1,001 photos already in gallery. GDLM tracking begins at ingest — each transcription batch logged with model version, GPU, and processing timestamp.

👑 RRC பண்ணை — Best Data Entry

Renuika (trust: 0.90) handles data entry for Raja Raja Cholan Pannai. Her entries serve as the benchmark for data quality scoring. GDLM will use her verified records to train quality classifiers for other farms.

🛠️ Technology Stack

ComponentTechnologyRole in GDLM
TranscriptionWhisper large-v3 (RTX 3050)Source records with confidence scores
LLM CleanupOllama (local) / GPT-4 (API)Version 2 processing with trust upgrade
StorageSurrealDB / JSON flat filesVersioned record store with lineage
MediaBoston NFS (/mnt/workspace)Audio/photo with EXIF preservation
FrontendSyntropa (static HTML/JS)Version viewer, trust display, audit trail
InfrastructureThiruCloud Boston serverManassas VA, managed by Matt Patton
BackupWasabi S3Offsite backup of all GDLM records

🗺️ Implementation Roadmap

PhaseScopeStatus
Phase 0Whisper transcription with file-level lineage (filename, timestamp, farm)Done
Phase 1LLM cleanup pipeline — v2 records with trust upgradeIn Progress
Phase 2GDLM JSON envelope on all new records. Version history UI in SyntropaPlanned
Phase 3Human review workflow — voice playback + correction interfacePlanned
Phase 4Trust scoring engine — auto-score based on source, corrections, agePlanned
Phase 5Cross-farm GDLM — unified lineage across all 27 farmsFuture
Phase 6SurrealDB migration — full graph-based lineage with query supportFuture

🌐 Scope: All ThiruCloud Sites

GDLM is designed to extend beyond Syntropa to all ThiruCloud-hosted sites where data lineage matters:

SiteGDLM Use Case
syntropa.comFarm data lineage — voice logs, photos, finances, livestock
antikva.inProduct catalog lineage — farm-to-table traceability
createvaluetech.comAdvisory content lineage — source attribution
anjaraipetti.xyzTamil cultural content — source and translation tracking