VaaniNetra

System Architecture

VaaniNetra (वाणीनेत्र) — "The Eye That Reads Financial Language"

Judge Snapshot

What makes this architecture practical for regulatory workflows

Coverage + Explainability

132 YAML rules across 17 regulatory frameworks with evidence spans, rule citations, precedents, and remediation in each report.

Entity Trust Layer

Extracted entities are cross-validated via Neo4j relationships and Outris ROC/MCA services before final report generation.

Human-in-the-Loop

Low-confidence entity/section/compliance outputs create annotation tasks with bulk review actions and export-ready correction data.

Operational Visibility

Upload and validation flows expose stage-level progress, retries, and live activity logs for long-running annual report processing.

End-to-End Pipeline

Document intake → compliance report in ~60 seconds

Upload

Drag & drop PDF/XBRL document

< 1s

Extract

OCR + entity recognition + section classification

~2s/page

Validate

132 rules across 17 frameworks + KG entity validation + ROC/MCA cross-check

~45s total

Forensic

9 anomaly checks (Benford, Beneish, Altman, etc.)

~5s

Report

Findings with evidence chains + precedent citations

~3s

Chat

Ask NFRA Bot questions about the document

real-time

High-Level Architecture

5-layer system: Document Intelligence → Compliance Engine → Knowledge Graph → External APIs → Storage

PRESENTATION LAYER

Next.js 16 Dashboard

11 pages, shadcn/ui, Recharts, Forensics

FastAPI REST API

documents, compliance, forensic, entity APIs

NFRA Insight Chatbot

RAG + Knowledge Graph

LAYER 4: AGENTIC COMPLIANCE ENGINE (LangGraph 7-Node DAG)

parse_doc + map_rules

132 YAML rules, 17 standards

validate_ind_as

13 Ind AS, RAG+Neo4j, 3x vote

validate_sebi

SEBI LODR, MCA Sch III+BRSR

detect_forensic

Benford, Beneish, Altman, 9 checks

generate_report

Findings, Evidence, Precedent

Heuristic (<50ms) → LLM+RAG (3x vote) → Human-in-Loop | Rule citation + evidence + precedents

LAYER 3: REGULATORY KNOWLEDGE GRAPH

Neo4j Aura

312 nodes, 7 types, 19 Cypher

Qdrant Cloud

408 vectors, 2 collections, 3072-dim

NFRA Precedents

58 orders, penalty amounts

YAML Rule Engine

132 rules, 18 files

LAYER 2: DOCUMENT INTELLIGENCE ENGINE

Format Detector

PDF, XBRL, iXBRL, Excel

OCR Engine

pdfplumber + Gemini Vision

Entity Extraction

21 regex + 22 LLM types

XBRL Extractor

BSE/NSE/MCA, Ind AS taxonomy

Section Classifier (13 types) + Table Detector + Scope Detection

LAYER 1: DATA INGESTION & STORAGE

Outris ROC DB

2.45M companies

MCA API

CIN/DIN lookup

Gemini 2.0 Flash

LiteLLM router

OSINT Module

SEBI/NFRA scrapers

File Upload

PDF, XBRL, XLS

PostgreSQL (metadata + findings) • S3-Compatible Storage (docs)

System Architecture Diagram

End-to-end data flow: Document Intelligence → LangGraph Compliance Engine → Knowledge Layer → External APIs → Compliance Report

Rendering architecture diagram…

Document Intelligence

Compliance Engine

Knowledge + Forensics

External APIs / LLM

LangGraph Compliance Pipeline — 7-Node Parallel DAG

parse_document → map_rules → [validate_ind_as ‖ validate_sebi ‖ detect_forensic ‖ validate_entities] → generate_report

Rendering architecture diagram…

Orchestration nodes

Ind AS / SEBI validation

Forensic detection (9 checks)

Report generation

3-Layer Classification Architecture

Heuristic Engine → LLM Verification (RAG + self-consistency) → Human-in-the-Loop — every finding goes through all three layers

Rendering architecture diagram…

Layer 1 — Heuristic (<50ms)

21 regex patterns + 132 YAML keyword rules. Fast first-pass, filters not_applicable cases.

Layer 2 — LLM + RAG

Gemini 2.0 Flash with Qdrant RAG context + Neo4j precedents. Self-consistency voting ×3.

Layer 3 — Human Review

Low-confidence items auto-queued for annotation. Corrections exported as training data (JSONL/CoNLL/HF).

Presentation Layer

Next.js 16 frontend with 11 sidebar pages + report detail route, shadcn/ui components, Recharts visualizations

Dashboard

Stats, severity charts, compliance distribution, recent docs

Upload & Extract

Drag-and-drop PDF upload → OCR → entity extraction → section classification

Compliance Reports

Reports list + report detail route with evidence spans, precedent citations, RAG sources

Forensic Analysis

Benford charting + anomaly list (including Beneish/Altman/ratio-based backend checks)

OSINT Intelligence

SEBI/NFRA enforcement actions, news signals, entity screening

NFRA Chatbot

RAG-powered conversational assistant with document Q&A and KG queries

Regulations

Browse 132 rules across regulatory frameworks with searchable rule cards

Annotations

Human-in-the-loop queue with document filtering and bulk accept/reject actions

Benchmarks

11/11 metrics passing, latency breakdown, explainability chain, confusion matrix

Architecture

This page — system architecture overview and component documentation

Settings

Runtime toggles for MCA and ROC entity validation integrations

Agentic Compliance Engine (Layer 4)

LangGraph 7-node parallel DAG with multi-agent validation, RAG context, and optional self-consistency voting

Orchestrator Agent

LangGraph state machine — routes documents through parallel validation nodes

Ind AS Validator

13 standards (Ind AS 1, 7, 8, 12, 16/38, 24, 33, 36, 37, 109, 115, 116)

SEBI/MCA Validators

SEBI LODR + MCA Schedule III checks via YAML rule packs (RBI rules present in rule definitions)

Forensic Anomaly Detector

9 checks — Benford, Beneish, Altman, ratio variance, Q4 spike, audit mismatch

Entity Validator

KG + ROC + MCA-backed CIN/DIN/FRN/auditor validation

Self-Consistency Voting

Configurable multi-pass voting for selected checks (enabled via runtime flags)

Report Generator

Findings aggregation, severity classification, executive summary, auto annotation task generation

Regulatory Knowledge Graph (Layer 3)

Neo4j (312 nodes) + Qdrant (408 vectors) + YAML rules (132 across 18 files)

Neo4j Graph

312 nodes: 12 Companies, 4 Regulations, 16 Standards, 103 Disclosures, 67 Provisions, 58 Precedents, 52 Auditors

Qdrant Vector Store

408 vectors in 2 collections (regulations, precedents), gemini-embedding-001, 3072-dim

Rule Engine

132 YAML rules across 18 files — keyword matching + LLM refinement per rule

Precedent Database

36 NFRA enforcement orders (27 text + 9 vision extracted), penalty amounts, entity links

Query Engine

Cypher queries for entity validation, shared auditor detection, violation history

MCA API Validator

Live CIN/DIN verification via Ministry of Corporate Affairs database (fetch-company, fetch-director, fetch-by-name)

Document Intelligence Engine (Layer 2)

PDF extraction pipeline — OCR, entity recognition, section classification, table detection

Format Detector

Auto-detect PDF, XBRL, iXBRL, Excel — routes to appropriate parser

OCR Engine

pdfplumber (digital) + Gemini 2.0 Flash Vision (scanned) — dual-path extraction

Entity Extractor

21 regex patterns (CIN, DIN, PAN, amounts, Ind AS refs) + 22 LLM entity types

Section Classifier

13 regulatory section types — keyword + LLM hybrid with structural rules

Scope Detector

Standalone vs Consolidated detection using 4+4 marker patterns

Table Detector

pdfplumber table extraction with financial number matching

Data Ingestion & Storage (Layer 1)

PostgreSQL + S3-compatible object storage + document lifecycle management

File Upload

PDF, XBRL, Excel upload with size validation and asynchronous processing

PostgreSQL

Document metadata, extraction results, compliance findings, audit trail

S3-Compatible Storage

Object storage for uploaded documents and extracted artifacts (AWS S3 / Cloudflare R2 style endpoints)

Alembic Migrations

Schema versioning — initial_tables migration for all models

Explainability — 7-Component Reasoning Chain

100% of non-compliant findings include a full reasoning chain from rule → evidence → explanation → remediation

Rule Citation

Every finding cites the specific Ind AS / SEBI LODR / Companies Act rule (e.g., IND_AS_24_001)

Evidence Spans

Text snippets from the document that triggered the finding, with page numbers

Missing Elements

List of specific disclosures required but not found (e.g., 'transaction amounts')

LLM Explanation

Natural language explanation of why the flag was raised

Remediation

Actionable guidance on how to fix the non-compliance

Precedent Citations

NFRA/SEBI enforcement order citations with penalty amounts

RAG Sources

Vector-retrieved regulatory passages from Qdrant supporting the decision

Benchmark Performance (11/11 Targets Met)

Annexure III Part A — Technical Performance Metrics

Metric	Achieved	Target	Status
CER (Digital)	0.00%	≤ 5%	PASS
Entity F1	0.9622	≥ 0.85	PASS
Extraction Accuracy	0.9303	≥ 0.85	PASS
Segmentation mIoU	0.9322	≥ 0.85	PASS
Section F1	0.8162	≥ 0.75	PASS
ROUGE-1	0.4042	≥ 0.35	PASS
ROUGE-2	0.2060	≥ 0.15	PASS
ROUGE-L	0.3095	≥ 0.30	PASS
BERTScore	0.8889	≥ 0.85	PASS
Compliance Macro-F1	0.8575	≥ 0.80	PASS
Compliance MCC	0.808	≥ 0.60	PASS

Technology Stack

Category	Technologies
Frontend	Next.js 16, React 19, TypeScript, Tailwind CSS, shadcn/ui, Recharts, Lucide
Backend	FastAPI, Python 3.11+, Pydantic v2, SQLAlchemy 2.0, Alembic
LLM	Gemini 2.0 Flash via LiteLLM, gemini-embedding-001 (3072-dim)
Orchestration	LangGraph — 7-node parallel DAG with state machine
Graph DB	Neo4j Aura — 312 nodes, 7 node types, Cypher queries
Vector DB	Qdrant Cloud — 408 vectors, 2 collections (regulations, precedents)
External APIs	Outris MCA API + ROC database integration for CIN/DIN/name verification
Storage	PostgreSQL (metadata), S3-compatible object storage (documents)
Deployment	Railway (backend + frontend), Railway Postgres, Neo4j Aura, Qdrant Cloud