VaaniNetra

System Architecture

VaaniNetra (वाणीनेत्र) — "The Eye That Reads Financial Language"

Judge Snapshot
What makes this architecture practical for regulatory workflows

Coverage + Explainability

132 YAML rules across 17 regulatory frameworks with evidence spans, rule citations, precedents, and remediation in each report.

Entity Trust Layer

Extracted entities are cross-validated via Neo4j relationships and Outris ROC/MCA services before final report generation.

Human-in-the-Loop

Low-confidence entity/section/compliance outputs create annotation tasks with bulk review actions and export-ready correction data.

Operational Visibility

Upload and validation flows expose stage-level progress, retries, and live activity logs for long-running annual report processing.

End-to-End Pipeline
Document intake → compliance report in ~60 seconds
1

Upload

Drag & drop PDF/XBRL document

< 1s
2

Extract

OCR + entity recognition + section classification

~2s/page
3

Validate

132 rules across 17 frameworks + KG entity validation + ROC/MCA cross-check

~45s total
4

Forensic

9 anomaly checks (Benford, Beneish, Altman, etc.)

~5s
5

Report

Findings with evidence chains + precedent citations

~3s
6

Chat

Ask NFRA Bot questions about the document

real-time
High-Level Architecture
5-layer system: Document Intelligence → Compliance Engine → Knowledge Graph → External APIs → Storage
PRESENTATION LAYER
Next.js 16 Dashboard
11 pages, shadcn/ui, Recharts, Forensics
FastAPI REST API
documents, compliance, forensic, entity APIs
NFRA Insight Chatbot
RAG + Knowledge Graph
LAYER 4: AGENTIC COMPLIANCE ENGINE (LangGraph 7-Node DAG)
parse_doc + map_rules
132 YAML rules, 17 standards
validate_ind_as
13 Ind AS, RAG+Neo4j, 3x vote
validate_sebi
SEBI LODR, MCA Sch III+BRSR
detect_forensic
Benford, Beneish, Altman, 9 checks
generate_report
Findings, Evidence, Precedent
Heuristic (<50ms) → LLM+RAG (3x vote) → Human-in-Loop  |  Rule citation + evidence + precedents
LAYER 3: REGULATORY KNOWLEDGE GRAPH
Neo4j Aura
312 nodes, 7 types, 19 Cypher
Qdrant Cloud
408 vectors, 2 collections, 3072-dim
NFRA Precedents
58 orders, penalty amounts
YAML Rule Engine
132 rules, 18 files
LAYER 2: DOCUMENT INTELLIGENCE ENGINE
Format Detector
PDF, XBRL, iXBRL, Excel
OCR Engine
pdfplumber + Gemini Vision
Entity Extraction
21 regex + 22 LLM types
XBRL Extractor
BSE/NSE/MCA, Ind AS taxonomy
Section Classifier (13 types) + Table Detector + Scope Detection
LAYER 1: DATA INGESTION & STORAGE
Outris ROC DB
2.45M companies
MCA API
CIN/DIN lookup
Gemini 2.0 Flash
LiteLLM router
OSINT Module
SEBI/NFRA scrapers
File Upload
PDF, XBRL, XLS
PostgreSQL (metadata + findings) • S3-Compatible Storage (docs)
System Architecture Diagram
End-to-end data flow: Document Intelligence → LangGraph Compliance Engine → Knowledge Layer → External APIs → Compliance Report
Rendering architecture diagram…
Document Intelligence
Compliance Engine
Knowledge + Forensics
External APIs / LLM
LangGraph Compliance Pipeline — 7-Node Parallel DAG
parse_document → map_rules → [validate_ind_as ‖ validate_sebi ‖ detect_forensic ‖ validate_entities] → generate_report
Rendering architecture diagram…
Orchestration nodes
Ind AS / SEBI validation
Forensic detection (9 checks)
Report generation
3-Layer Classification Architecture
Heuristic Engine → LLM Verification (RAG + self-consistency) → Human-in-the-Loop — every finding goes through all three layers
Rendering architecture diagram…

Layer 1 — Heuristic (<50ms)

21 regex patterns + 132 YAML keyword rules. Fast first-pass, filters not_applicable cases.

Layer 2 — LLM + RAG

Gemini 2.0 Flash with Qdrant RAG context + Neo4j precedents. Self-consistency voting ×3.

Layer 3 — Human Review

Low-confidence items auto-queued for annotation. Corrections exported as training data (JSONL/CoNLL/HF).

Presentation Layer
Next.js 16 frontend with 11 sidebar pages + report detail route, shadcn/ui components, Recharts visualizations

Dashboard

Stats, severity charts, compliance distribution, recent docs

Upload & Extract

Drag-and-drop PDF upload → OCR → entity extraction → section classification

Compliance Reports

Reports list + report detail route with evidence spans, precedent citations, RAG sources

Forensic Analysis

Benford charting + anomaly list (including Beneish/Altman/ratio-based backend checks)

OSINT Intelligence

SEBI/NFRA enforcement actions, news signals, entity screening

NFRA Chatbot

RAG-powered conversational assistant with document Q&A and KG queries

Regulations

Browse 132 rules across regulatory frameworks with searchable rule cards

Annotations

Human-in-the-loop queue with document filtering and bulk accept/reject actions

Benchmarks

11/11 metrics passing, latency breakdown, explainability chain, confusion matrix

Architecture

This page — system architecture overview and component documentation

Settings

Runtime toggles for MCA and ROC entity validation integrations

Agentic Compliance Engine (Layer 4)
LangGraph 7-node parallel DAG with multi-agent validation, RAG context, and optional self-consistency voting

Orchestrator Agent

LangGraph state machine — routes documents through parallel validation nodes

Ind AS Validator

13 standards (Ind AS 1, 7, 8, 12, 16/38, 24, 33, 36, 37, 109, 115, 116)

SEBI/MCA Validators

SEBI LODR + MCA Schedule III checks via YAML rule packs (RBI rules present in rule definitions)

Forensic Anomaly Detector

9 checks — Benford, Beneish, Altman, ratio variance, Q4 spike, audit mismatch

Entity Validator

KG + ROC + MCA-backed CIN/DIN/FRN/auditor validation

Self-Consistency Voting

Configurable multi-pass voting for selected checks (enabled via runtime flags)

Report Generator

Findings aggregation, severity classification, executive summary, auto annotation task generation

Regulatory Knowledge Graph (Layer 3)
Neo4j (312 nodes) + Qdrant (408 vectors) + YAML rules (132 across 18 files)

Neo4j Graph

312 nodes: 12 Companies, 4 Regulations, 16 Standards, 103 Disclosures, 67 Provisions, 58 Precedents, 52 Auditors

Qdrant Vector Store

408 vectors in 2 collections (regulations, precedents), gemini-embedding-001, 3072-dim

Rule Engine

132 YAML rules across 18 files — keyword matching + LLM refinement per rule

Precedent Database

36 NFRA enforcement orders (27 text + 9 vision extracted), penalty amounts, entity links

Query Engine

Cypher queries for entity validation, shared auditor detection, violation history

MCA API Validator

Live CIN/DIN verification via Ministry of Corporate Affairs database (fetch-company, fetch-director, fetch-by-name)

Document Intelligence Engine (Layer 2)
PDF extraction pipeline — OCR, entity recognition, section classification, table detection

Format Detector

Auto-detect PDF, XBRL, iXBRL, Excel — routes to appropriate parser

OCR Engine

pdfplumber (digital) + Gemini 2.0 Flash Vision (scanned) — dual-path extraction

Entity Extractor

21 regex patterns (CIN, DIN, PAN, amounts, Ind AS refs) + 22 LLM entity types

Section Classifier

13 regulatory section types — keyword + LLM hybrid with structural rules

Scope Detector

Standalone vs Consolidated detection using 4+4 marker patterns

Table Detector

pdfplumber table extraction with financial number matching

Data Ingestion & Storage (Layer 1)
PostgreSQL + S3-compatible object storage + document lifecycle management

File Upload

PDF, XBRL, Excel upload with size validation and asynchronous processing

PostgreSQL

Document metadata, extraction results, compliance findings, audit trail

S3-Compatible Storage

Object storage for uploaded documents and extracted artifacts (AWS S3 / Cloudflare R2 style endpoints)

Alembic Migrations

Schema versioning — initial_tables migration for all models

Explainability — 7-Component Reasoning Chain
100% of non-compliant findings include a full reasoning chain from rule → evidence → explanation → remediation
1

Rule Citation

Every finding cites the specific Ind AS / SEBI LODR / Companies Act rule (e.g., IND_AS_24_001)

2

Evidence Spans

Text snippets from the document that triggered the finding, with page numbers

3

Missing Elements

List of specific disclosures required but not found (e.g., 'transaction amounts')

4

LLM Explanation

Natural language explanation of why the flag was raised

5

Remediation

Actionable guidance on how to fix the non-compliance

6

Precedent Citations

NFRA/SEBI enforcement order citations with penalty amounts

7

RAG Sources

Vector-retrieved regulatory passages from Qdrant supporting the decision

Benchmark Performance (11/11 Targets Met)
Annexure III Part A — Technical Performance Metrics
MetricAchievedTargetStatus
CER (Digital)0.00%≤ 5%PASS
Entity F10.9622≥ 0.85PASS
Extraction Accuracy0.9303≥ 0.85PASS
Segmentation mIoU0.9322≥ 0.85PASS
Section F10.8162≥ 0.75PASS
ROUGE-10.4042≥ 0.35PASS
ROUGE-20.2060≥ 0.15PASS
ROUGE-L0.3095≥ 0.30PASS
BERTScore0.8889≥ 0.85PASS
Compliance Macro-F10.8575≥ 0.80PASS
Compliance MCC0.808≥ 0.60PASS
Technology Stack
CategoryTechnologies
FrontendNext.js 16, React 19, TypeScript, Tailwind CSS, shadcn/ui, Recharts, Lucide
BackendFastAPI, Python 3.11+, Pydantic v2, SQLAlchemy 2.0, Alembic
LLMGemini 2.0 Flash via LiteLLM, gemini-embedding-001 (3072-dim)
OrchestrationLangGraph — 7-node parallel DAG with state machine
Graph DBNeo4j Aura — 312 nodes, 7 node types, Cypher queries
Vector DBQdrant Cloud — 408 vectors, 2 collections (regulations, precedents)
External APIsOutris MCA API + ROC database integration for CIN/DIN/name verification
StoragePostgreSQL (metadata), S3-compatible object storage (documents)
DeploymentRailway (backend + frontend), Railway Postgres, Neo4j Aura, Qdrant Cloud