Syntropa ingests farm data from multiple sources (WhatsApp voice logs, photos, manual data entry, financial records) across 27+ farms with varying quality. Without lineage tracking:
→ We don't know who recorded a data point, when, or on which device
→ Whisper transcription errors propagate silently — no way to flag or correct them
→ Data entry quality varies by person (some workers are meticulous, others inconsistent)
→ Financial figures get entered without audit trail — corrections overwrite originals
→ No way to answer: "Where did this number come from?"
Each stage stamps metadata onto the record — source file, processing timestamps, model version, confidence score, and who/what performed the operation. If a transcription is corrected by a human, the correction links back to the original Whisper output.
Every data source and processor gets a trust weight. These propagate: a record corrected by a high-trust human gets a higher score than raw machine output.
| Source Type | Base Trust | Notes |
|---|---|---|
| Farm Manager voice log | 0.80 | Primary authority (Velmurugan, Prabu, Sivakumar) |
| Data Entry — Renuika | 0.90 | Highest accuracy data entry (RRC Pannai) |
| Data Entry — Other | 0.65 | Varies by individual performance |
| Whisper raw transcription | 0.55 | Tamil accuracy ~70% with large-v3 on RTX 3050 |
| LLM cleanup (Ollama) | 0.75 | Grammar/spelling fixes, structured extraction |
| LLM cleanup (GPT-4) | 0.85 | Higher Tamil accuracy than local models |
| Owner verification (Thiru) | 0.98 | Final authority — golden record |
| WhatsApp photo EXIF | 0.95 | Machine-stamped timestamp, GPS |
| Financial receipt scan | 0.90 | Original document — OCR may reduce to 0.70 |
| Type | Source | Example | Volume |
|---|---|---|---|
| voice_log | WhatsApp audio → Whisper | Daily farm activity narration | 3,337+ (Oomathurai) |
| photo | WhatsApp images | Crop photos, cattle, fields | 3,952 across 12 farms |
| expense | Voice log / manual | ₹43,701 ploughing — Sevakkon | 100s per farm/year |
| livestock | Manual / voice log | Cow "Megalai" moved to Vellunachiyaar | 70+ cattle, 120+ goats |
| event | Voice log extraction | Wild pig damage, harvest, planting | Daily across farms |
| crop_cycle | Derived from events | Groundnut: planted Aug → harvested Nov | Per field per season |
GDLM uses an append-only version chain. Every record version is immutable once written. Corrections create a new version entry with:
| Field | Description |
|---|---|
v | Incrementing version number |
by | Who/what made this version (person or system ID) |
at | ISO timestamp of correction |
trust | Trust score of this version |
delta | What changed (field-level diff) |
reason | Optional: why the correction was made |
The current_version pointer always references the highest-trust version. The UI shows the current version but allows drilling into history.
The first GDLM-tracked farm. 3,337 voice logs from Velmurugan (manager) and Devi, transcribed via Whisper on Boston GPU (RTX 3050). LLM cleanup pipeline in progress. Voice log player at voice-log-player.html allows playback alongside transcription for manual verification.
Current state: v1 (Whisper raw) for all records. v2 (LLM cleanup) partially complete. v3 (human review) pending.
4,285 audio files being transcribed on Boston GPU. 1,001 photos already in gallery. GDLM tracking begins at ingest — each transcription batch logged with model version, GPU, and processing timestamp.
Renuika (trust: 0.90) handles data entry for Raja Raja Cholan Pannai. Her entries serve as the benchmark for data quality scoring. GDLM will use her verified records to train quality classifiers for other farms.
| Component | Technology | Role in GDLM |
|---|---|---|
| Transcription | Whisper large-v3 (RTX 3050) | Source records with confidence scores |
| LLM Cleanup | Ollama (local) / GPT-4 (API) | Version 2 processing with trust upgrade |
| Storage | SurrealDB / JSON flat files | Versioned record store with lineage |
| Media | Boston NFS (/mnt/workspace) | Audio/photo with EXIF preservation |
| Frontend | Syntropa (static HTML/JS) | Version viewer, trust display, audit trail |
| Infrastructure | ThiruCloud Boston server | Manassas VA, managed by Matt Patton |
| Backup | Wasabi S3 | Offsite backup of all GDLM records |
| Phase | Scope | Status |
|---|---|---|
| Phase 0 | Whisper transcription with file-level lineage (filename, timestamp, farm) | Done |
| Phase 1 | LLM cleanup pipeline — v2 records with trust upgrade | In Progress |
| Phase 2 | GDLM JSON envelope on all new records. Version history UI in Syntropa | Planned |
| Phase 3 | Human review workflow — voice playback + correction interface | Planned |
| Phase 4 | Trust scoring engine — auto-score based on source, corrections, age | Planned |
| Phase 5 | Cross-farm GDLM — unified lineage across all 27 farms | Future |
| Phase 6 | SurrealDB migration — full graph-based lineage with query support | Future |
GDLM is designed to extend beyond Syntropa to all ThiruCloud-hosted sites where data lineage matters:
| Site | GDLM Use Case |
|---|---|
| syntropa.com | Farm data lineage — voice logs, photos, finances, livestock |
| antikva.in | Product catalog lineage — farm-to-table traceability |
| createvaluetech.com | Advisory content lineage — source attribution |
| anjaraipetti.xyz | Tamil cultural content — source and translation tracking |