RAG vs NLP vs Raw LLM Calls: Choosing the Right AI Architecture

Most “AI-powered” products start the same way: send text to an LLM, parse the response, ship it. No custom models. No fine-tuning. No MLOps infrastructure.

And that’s usually the right call—at least initially.

The question isn’t whether to use AI. It’s which kind of AI architecture fits your current stage: raw LLM calls for speed, RAG for document Q&A, or classical NLP for cost efficiency at scale. Each has different tradeoffs, and most teams will use all three eventually.

This post breaks down when each approach makes sense, how to blend them together, and why “start simple, optimize later” beats “build it right the first time” for most startups.

The Three Approaches

Let’s define terms:

Raw LLM Calls: Send your data + a prompt to a frontier model, get structured output back.

response = openai.chat.completions.create(
    model="gpt-4",
    messages=[{"role": "user", "content": f"Extract metrics from: {document_text}"}]
)

RAG (Retrieval-Augmented Generation): Store your data as embeddings, retrieve relevant chunks, send those chunks + user query to an LLM.

relevant_docs = vector_db.similarity_search(query, k=5)
context = "\n".join([doc.content for doc in relevant_docs])
response = llm.generate(f"Based on this context:\n{context}\n\nAnswer: {query}")

Classical NLP: Use pre-trained models (BERT, FinBERT, spaCy) or rule-based systems to extract entities, classify text, or parse structure without calling external APIs.

doc = nlp(text)
entities = [(ent.text, ent.label_) for ent in doc.ents]

Each has different cost curves, accuracy profiles, and operational complexity.

The Honest Comparison

FactorRaw LLMRAGClassical NLP
Time to MVPDays1-2 weeks4-8 weeks
Per-request cost$0.01-0.10$0.005-0.05~$0.001
Accuracy (out of box)70-95%*85-95%70-85%
Accuracy (with tuning)85-98%90-98%90-95%
Latency1-10s0.5-5s10-100ms
Improves over time?NoSomewhatYes (with feedback)
Requires ML expertise?NoSomeYes
Vendor lock-inHighMediumNone

*LLM accuracy varies significantly by task complexity. Recent benchmarks show 60-80% on contract extraction (Vellum), 96%+ on medical data extraction (Nature Digital Medicine), and 82% on finance tasks (AIMutiple). GPT-5.2 achieves 98% accuracy on 256k context retrieval tasks.

The pattern is clear: LLMs trade money and vendor dependency for speed and simplicity. NLP trades development time for control and cost efficiency.

Quick Decision Guide

START HERE: What are you building?

├─► "Users ask questions about documents"
│   └─► RAG (with LLM fallback for complex queries)

├─► "Extract specific fields from documents"
│   │
│   ├─► Less than 10k docs/month?
│   │   └─► Raw LLM calls (optimize later)
│   │
│   ├─► More than 10k docs/month?
│   │   │
│   │   ├─► Structured format (tables, forms)?
│   │   │   └─► Hybrid: NLP for tables + LLM for edge cases
│   │   │
│   │   └─► Unstructured format (free text, mixed layouts)?
│   │       └─► LLM with confidence scoring + human review
│   │
│   └─► Need real-time (<500ms)?
│       └─► NLP only (no API calls in hot path)

└─► "Classify or categorize documents"

    ├─► Few categories, lots of examples?
    │   └─► Fine-tuned classifier (BERT, DistilBERT)

    └─► Many categories, few examples?
        └─► LLM with few-shot prompting

The 30-second version: Start with LLM calls. Add RAG when users want to search. Add NLP when costs hurt.

What This Looks Like in Practice

These numbers aren’t theoretical. I’ve seen this pattern repeatedly with document extraction systems—sustainability reports, financial filings, legal contracts. The architecture that gets to production fastest is almost always prompt-based.

A typical ESG (Environmental, Social, and Governance) extraction pipeline might look like:

  • Hundreds of indicators extracted per document (emissions, water usage, board composition, etc.)
  • 80-90% accuracy on first-pass extraction
  • $0.02-0.05 per document depending on model and document length
  • 30-60 seconds latency per document

At 1,000 documents/month, that’s $30-50. At 100,000 documents/month, it’s $3,000-5,000—and that’s when the math changes.

A hybrid system that routes content by type can look like:

  • Tables (often 70-80% of structured data): NLP extraction at ~$0.001/doc
  • Free-form text (15-20%): NER models at ~$0.002/doc
  • Charts and visuals (5-10%): LLM extraction at ~$0.03/doc
  • Low-confidence results: LLM verification as a fallback

The potential savings at scale are significant—often 60-75% cost reduction with 3-4x faster processing. And unlike pure LLM approaches, accuracy can improve over time as corrections feed back into the NLP models.

The lesson isn’t “LLMs are bad.” The lesson is that prompt-based extraction is often the right starting point, and there’s a clear path to optimization when scale demands it.

When to Use Raw LLM Calls

Use LLMs when:

  1. You’re validating the product, not the technology. If you don’t know whether users want extracted financial metrics, don’t spend 8 weeks building an NLP pipeline. Spend 3 days building an LLM wrapper and find out.

  2. The task is genuinely hard for rule-based systems. Free-form document understanding, nuanced classification, anything requiring “reasoning”—LLMs excel here.

  3. Volume is low. At 1,000 documents/month and $0.03/doc, you’re paying $30/month. That’s nothing. The engineering time to build an NLP system costs more.

  4. Accuracy matters more than cost. For high-stakes extractions (financial filings, legal documents), LLM accuracy often beats custom NLP without extensive training data.

The typical LLM extraction pattern:

EXTRACTION_PROMPT = """
You are an expert data extractor. Extract the following from the document:
- Total Revenue (in millions USD)
- Net Income (in millions USD)
- Total Assets (in millions USD)

Return as JSON: {"revenue": ..., "net_income": ..., "total_assets": ...}
If a value is not found, use null.

Document:
{document_text}
"""

def extract_financials(document_text: str) -> dict:
    response = gemini.generate_content(
        EXTRACTION_PROMPT.format(document_text=document_text)
    )
    return json.loads(response.text)

That’s it. That’s the “AI-powered extraction engine.” And it works surprisingly well.

The Hidden Costs of LLM-Only

What the cost table doesn’t show:

  1. Non-determinism. Run the same prompt on the same document twice, get different results. LLMs are probabilistic systems—temperature settings help, but you can’t guarantee identical outputs. For data pipelines that need reproducibility, this is a fundamental problem. You can’t write unit tests that assert exact outputs. You can’t diff yesterday’s extraction against today’s to see what changed.

  2. No learning loop. Every extraction is independent. If you correct an error, the model doesn’t learn from it. You’re paying for the same mistakes forever. NLP systems can ingest corrections as training data; LLMs just forget.

  3. Black box debugging. When extraction fails, you can’t easily debug why. Was it the prompt? The document format? Token limits? A model update that changed behavior? You’re guessing. I’ve spent hours debugging extractions only to discover the model was updated and now interprets a phrase differently.

  4. Latency floor. API calls have inherent latency. Even fast models (Gemini Flash, GPT-4-turbo) take 1-5 seconds. For real-time applications, this is brutal. For batch processing, it means 30-60 seconds per document instead of the 100ms you’d get from local NLP.

  5. Rate limits and outages. Your “AI feature” is actually “OpenAI’s uptime.” When they have issues, you have issues. When they deprecate a model, you scramble. When they change pricing, your unit economics shift overnight.

  6. Silent regressions. Model updates happen without warning. An extraction that worked perfectly for six months suddenly starts returning different field names or misinterpreting units. Without comprehensive logging and monitoring, you won’t notice until a customer complains.

These aren’t reasons to avoid LLMs—they’re reasons to treat LLM-based systems as prototypes that may need hardening later.

When to Use RAG

RAG is the middle ground: smarter than raw prompts, simpler than full NLP pipelines.

Use RAG when:

  1. Users ask questions over a corpus. Chatbots, search, “ask your documents”—RAG is purpose-built for this.

  2. Context matters more than extraction. RAG excels at “find relevant information and synthesize” vs. “extract specific fields.”

  3. You have a growing knowledge base. New documents get embedded and become searchable immediately. No retraining.

  4. Accuracy needs to exceed raw LLM. By grounding responses in retrieved context, RAG reduces hallucination significantly.

The typical RAG pattern:

# Indexing (once per document)
def index_document(doc_id: str, text: str):
    chunks = split_into_chunks(text, chunk_size=500)
    for i, chunk in enumerate(chunks):
        embedding = embed_model.encode(chunk)
        vector_db.upsert(
            id=f"{doc_id}_{i}",
            vector=embedding,
            metadata={"doc_id": doc_id, "text": chunk}
        )

# Querying
def answer_question(query: str, user_id: str = None) -> str:
    query_embedding = embed_model.encode(query)

    # Retrieve relevant chunks
    results = vector_db.search(
        query_embedding,
        top_k=5,
        filter={"user_id": user_id} if user_id else None
    )

    context = "\n---\n".join([r.metadata["text"] for r in results])

    # Generate answer with context
    response = llm.generate(f"""
    Based on the following documents:
    {context}

    Answer this question: {query}

    If the answer isn't in the documents, say so.
    """)

    return response

RAG Architecture Decisions

Embedding model matters. OpenAI’s text-embedding-3-small is cheap and good. For domain-specific content (finance, legal, medical), consider fine-tuned embeddings or models like FinBERT.

Chunk size is a tradeoff. Smaller chunks (200-500 tokens) = more precise retrieval, less context per chunk. Larger chunks (1000-2000 tokens) = more context, potentially irrelevant content retrieved.

Hybrid search often wins. Combine vector similarity with keyword search (BM25). Some queries are better served by exact matches.

# Hybrid search with reciprocal rank fusion
def hybrid_search(query: str, top_k: int = 5):
    # Vector search
    vector_results = vector_db.search(embed(query), top_k=20)

    # Keyword search
    keyword_results = bm25_index.search(query, top_k=20)

    # Combine with RRF
    scores = {}
    for rank, result in enumerate(vector_results):
        scores[result.id] = scores.get(result.id, 0) + 1 / (60 + rank)
    for rank, result in enumerate(keyword_results):
        scores[result.id] = scores.get(result.id, 0) + 1 / (60 + rank)

    return sorted(scores.items(), key=lambda x: -x[1])[:top_k]

The Hidden Costs of RAG

  1. Embedding drift. When you change embedding models, you need to re-embed everything. This is expensive at scale.

  2. Chunk boundary problems. Important information split across chunks may never be retrieved together.

  3. Still paying for LLM calls. RAG reduces hallucination but doesn’t eliminate LLM costs. You’re paying for embeddings AND generation.

  4. Index maintenance. Documents change. Keeping embeddings in sync with source documents requires infrastructure.

  5. Users don’t always know what to ask. RAG assumes users can formulate good queries. In practice, many users don’t know the right terminology or what’s even in the documents. They ask vague questions like “what are the environmental metrics?” when they need specific Scope 1 emissions data. Without structured extraction running alongside RAG, you’re dependent on user query quality—and that’s often the weakest link.

This is why hybrid approaches work well: RAG for conversational access, structured extraction for ensuring you capture everything regardless of what users ask.

When to Use Classical NLP

Use NLP when:

  1. You’ve validated the product and need to scale. The LLM prototype works, users love it, now you’re processing 100k documents/month and costs are unsustainable.

  2. The extraction is structured and repetitive. Tables. Forms. SEC filings. Anything with predictable format is NLP territory.

  3. You need sub-second latency. NLP models run locally in milliseconds. No API round-trip.

  4. You want to improve over time. With labeled data from corrections, you can fine-tune NLP models. Accuracy compounds.

  5. You need explainability. NLP extractions can show exactly what was matched and why. LLMs are black boxes.

When NOT to use NLP:

  1. You’re still figuring out what to extract. NLP requires knowing your schema upfront. If users keep asking for new fields, you’ll be constantly updating extraction rules. LLMs handle ad-hoc requests without code changes.

  2. Documents are highly variable. NLP excels at structured, predictable formats. Handwritten notes, informal reports, or documents with wildly inconsistent layouts will frustrate rule-based systems. LLMs handle chaos better.

  3. You lack training data. Fine-tuned NER models need labeled examples. If you don’t have hundreds of annotated documents, your NLP accuracy will lag behind a well-prompted LLM.

  4. The ROI doesn’t justify the engineering. Building NLP pipelines takes weeks. If you’re processing 5,000 documents/month and LLM costs are $150, the $100/month savings doesn’t justify a 4-week engineering investment. Do the math first.

  5. You need to handle edge cases gracefully. NLP fails hard on unexpected inputs—it returns nothing or garbage. LLMs fail soft—they might hallucinate, but they usually return something plausible. For user-facing applications where “no result” is worse than “approximate result,” LLMs are more forgiving.

The sweet spot for NLP is high-volume, structured, repetitive extraction where you’ve already validated the product. Everything else should probably stay on LLMs until scale forces the conversation.

The hybrid pattern I recommend:

Document → Layout Analysis → Content Routing

              ┌────────────────┼────────────────┐
              ▼                ▼                ▼
         TABLES            TEXT             CHARTS
     (NLP: Camelot)    (NLP: NER)      (LLM: Vision)
       85-95% acc       80-90% acc        75-95% acc
        $0.001          $0.002           $0.030
              │                │                │
              └────────────────┼────────────────┘

                      Confidence Scoring

              ┌────────────────┼────────────────┐
              ▼                ▼                ▼
          High (>0.9)    Medium (0.7-0.9)   Low (<0.7)
           Accept         LLM Verify        Human Review

Table extraction accuracy depends on document quality and tool choice. Benchmarks show Camelot at 72-73% on complex layouts, while deep learning approaches (Table-Transformer, TableFormer) achieve 93%+ (arXiv PDF Parsing Study). For clean, well-structured tables, rule-based tools perform excellently.

Most financial/ESG data lives in tables—and tables are where NLP shines:

class TableExtractor:
    def __init__(self):
        self.detector = TableTransformer()  # Microsoft's model
        self.parser = CamelotParser()
        self.matcher = IndicatorMatcher(load_indicators())

    def extract(self, pdf_path: str) -> List[Extraction]:
        # 1. Detect tables
        tables = self.detector.detect(pdf_path)

        results = []
        for table in tables:
            # 2. Parse table structure
            df = self.parser.parse(table)

            # 3. Classify table type
            stmt_type = self.classify_statement(df)

            # 4. Extract and match metrics
            for row in df.iterrows():
                metric = self.extract_metric(row, stmt_type)
                if metric:
                    matched = self.matcher.match(metric)
                    matched.confidence = self.calculate_confidence(matched)
                    results.append(matched)

        return results

    def classify_statement(self, df) -> str:
        """Detect: Balance Sheet, Income Statement, Cash Flow"""
        headers = ' '.join(df.columns).lower()
        if 'assets' in headers or 'liabilities' in headers:
            return 'balance_sheet'
        elif 'revenue' in headers or 'income' in headers:
            return 'income_statement'
        elif 'cash' in headers and 'operating' in headers:
            return 'cash_flow'
        return 'unknown'

The Cost Math

Let’s compare at different scales:

1,000 documents/month (Seed stage)

ApproachCostNotes
LLM only$30Just pay it
RAG$20Marginal savings, more complexity
Hybrid NLP$150Engineering cost exceeds savings

10,000 documents/month (Series A)

ApproachCostNotes
LLM only$300Starting to matter
RAG$150Worth it for chat features
Hybrid NLP$80 + engineeringBreak-even in ~6 months

100,000 documents/month (Series B+)

ApproachCostNotes
LLM only$3,000Painful
RAG$1,000Better, but still high
Hybrid NLP$300Clear winner at scale

The crossover point is typically 10-50k documents/month. Below that, LLMs win on simplicity. Above that, NLP wins on cost.

The Evolution Path

Here’s how I recommend startups evolve their “AI” stack:

Stage 1: Raw LLM (Day 1 - Product-Market Fit)

# This is your entire AI pipeline
def extract(document: str) -> dict:
    return json.loads(gpt4(PROMPT + document))

Ship it. Learn what users actually need. Don’t optimize.

Stage 2: RAG (Month 3-6 - Chat/Search Features)

When users ask “can I search my documents?” or “can I ask questions?”, add RAG:

# Add pgvector to your existing Postgres
ALTER TABLE documents ADD COLUMN embedding vector(1536);

# Index on insert
def save_document(doc):
    doc.embedding = embed(doc.content)
    db.save(doc)

# Search
def search(query):
    return db.query("""
        SELECT * FROM documents
        ORDER BY embedding <-> %s
        LIMIT 5
    """, embed(query))

Use your existing database. Don’t add Pinecone yet.

Stage 3: Hybrid NLP (Month 12+ - Scale)

When LLM costs exceed engineering costs, invest in NLP. The key insight: you’re not replacing LLMs, you’re reducing how often you call them.

The 6-stage hybrid pipeline:

1. Document Classification
   └─ What type of document is this? (report, filing, form, etc.)
   └─ Tool: Simple classifier or rule-based detection

2. Layout Analysis
   └─ Where are the tables, text blocks, charts, headers?
   └─ Tool: LayoutLMv3, pdfplumber, or PyMuPDF

3. Content Routing
   └─ Route each content block to the right extractor
   └─ Tables → NLP, Text → NER, Charts → LLM

4. Parallel Extraction
   └─ Each extractor runs independently
   └─ NLP extractions return confidence scores

5. Confidence-Based Verification
   └─ High confidence (>0.9): Accept
   └─ Medium (0.7-0.9): LLM verification
   └─ Low (<0.7): Human review queue

6. Feedback Loop
   └─ Log every extraction with source location
   └─ Corrections become training data for NLP models

Start with tables. In most structured documents (financial reports, sustainability disclosures, regulatory filings), 70-80% of the data you want lives in tables. Tables are deterministic—same input, same output. NLP table extraction is a solved problem.

class HybridExtractor:
    def extract(self, document) -> List[Extraction]:
        results = []

        # NLP for tables (cheap, fast, deterministic)
        for table in detect_tables(document):
            extractions = self.table_extractor.extract(table)
            for e in extractions:
                e.source = "nlp_table"
                e.confidence = self.calculate_confidence(e)
            results.extend(extractions)

        # NLP for text entities (cheap, fast, okay accuracy)
        for text_block in detect_text(document):
            extractions = self.ner_model.extract(text_block)
            for e in extractions:
                e.source = "nlp_ner"
            results.extend(extractions)

        # LLM for everything else (expensive, slow, good accuracy)
        for chart in detect_charts(document):
            extractions = self.llm_extractor.extract(chart)
            for e in extractions:
                e.source = "llm"
            results.extend(extractions)

        # LLM verification for medium-confidence NLP results
        for result in results:
            if result.source.startswith("nlp") and 0.7 < result.confidence < 0.9:
                result = self.llm_verify(result)
                result.source = f"{result.source}_verified"

        return results

The key metrics to track:

  • Extraction accuracy by source (NLP vs LLM)
  • Confidence score calibration (are your 0.9 predictions actually 90% correct?)
  • LLM fallback rate (what percentage still needs LLM?)
  • Time-to-extraction by document type

Build the feedback loop early. Every extraction should be logged with:

  • Source document location (page, bounding box)
  • Extraction method used
  • Confidence score
  • User corrections (if any)

Corrections are training data. The more users correct, the better your NLP models get. This is the learning loop that LLM-only systems lack.

The Technology Stack

What I’d use today for each approach:

Raw LLM

  • Model: Gemini 2.5 Flash (fast, cheap) or Claude Sonnet 4.5 (best reasoning). For complex extraction, GPT-5 or Claude Opus 4.5.
  • Structured output: Instructor library for Pydantic validation
  • Cost tracking: LangSmith or custom logging

RAG

  • Vector DB: pgvector (if you have Postgres) or Qdrant (if you don’t)
  • Embeddings: OpenAI text-embedding-3-small or Cohere embed-v3
  • Chunking: LangChain’s RecursiveCharacterTextSplitter
  • Orchestration: LlamaIndex or raw code (frameworks add complexity)

Classical NLP

  • Table extraction: Camelot, pdfplumber, or Table-Transformer
  • Layout analysis: LayoutLMv3 (Microsoft, open-source)
  • NER: spaCy with domain-specific fine-tuning
  • Financial NER: FinBERT
  • OCR (if needed): DocTR (Apache 2.0)

What About Fine-Tuning and Open-Source LLMs?

Two questions I get asked frequently:

“Should we fine-tune a model?”

Usually, no. Fine-tuning makes sense when:

  • You have thousands of labeled examples
  • Your task is narrow and well-defined
  • Prompt engineering has hit a ceiling
  • You need to reduce per-request costs at very high volume

For most startups, prompt engineering gets you 80-90% of the way there. Fine-tuning is a month of work to squeeze out the last 5-10%—and that month is better spent shipping features.

When fine-tuning does make sense: Classification tasks with clear categories, domain-specific entity extraction where generic models struggle (e.g., legal clauses, medical terms), or cost optimization at 1M+ requests/month where even small per-request savings compound.

”What about open-source models (Llama, Mistral, Qwen)?”

Open-source models are viable for:

  • Cost reduction at scale. Self-hosted Llama 3.1 70B can match GPT-4 quality for many tasks at a fraction of the cost—if you have the infrastructure.
  • Data privacy requirements. Some enterprises can’t send data to external APIs. Self-hosted models solve this.
  • Latency-sensitive applications. Local inference can be faster than API round-trips.

But they come with trade-offs:

  • Infrastructure complexity. You need GPU servers, model serving (vLLM, TGI), monitoring, scaling. This is a full-time job.
  • Quality gap on complex tasks. For nuanced reasoning, multi-step extraction, or handling edge cases, frontier models still lead.
  • Rapid obsolescence. The model you self-host today will be outperformed by a new release in 3-6 months. API providers handle upgrades for you.

My recommendation: Start with APIs (OpenAI, Anthropic, Google). When you’re spending $5,000+/month on LLM calls and have engineering capacity for infrastructure, evaluate self-hosting. For most startups, that day is further away than you think.

”What about vision models for charts and OCR?”

Multimodal models (GPT-4o, Gemini 2.0 Flash, Claude’s vision capabilities) have made OCR and chart extraction surprisingly viable without traditional pipelines.

Where vision models excel:

  • Charts and graphs. Extracting data points from bar charts, line graphs, pie charts—tasks that would require complex computer vision pipelines are now a single API call.
  • Scanned documents. Handwritten notes, faxes, low-quality scans where traditional OCR struggles.
  • Mixed layouts. Documents with embedded images, tables, and text interspersed.

Current benchmarks (late 2025):

  • GPT-4o and Qwen2.5-VL achieve ~75% accuracy on complex OCR benchmarks (OmniAI Benchmark)
  • Qwen2.5-VL hits 96.4% on DocVQA, near human-level at 98.1% (OCRBench v2)
  • For simple, clean documents, multimodal LLMs now match traditional OCR
  • For complex layouts (rotated text, overlapping elements), they still struggle

The practical tradeoff: Vision models are slower and more expensive than text-only calls. A chart extraction that costs $0.01 with vision might cost $0.001 with traditional table parsing. But if you’re processing charts (which NLP can’t handle), vision models are often your only option short of building custom computer vision.

My approach: Use vision models for content that NLP genuinely can’t handle—charts, diagrams, handwritten notes. For clean, machine-generated PDFs with selectable text, traditional text extraction + NLP is still faster and cheaper.

The Meta-Lesson: Start Dumb

The “AI-powered” product that ships beats the ML pipeline that’s always two weeks away.

Every successful AI product I’ve seen followed the same pattern:

  1. Ship an LLM wrapper
  2. Learn what users actually need
  3. Optimize the parts that matter

The founders who fail are the ones building custom transformers before they have users. The ones who succeed are the ones who “fake” AI with prompts until the unit economics force them to get smart.

Your LLM wrapper isn’t technical debt. It’s a working prototype that generates revenue while you figure out what to build next.

Start dumb. Get smart later. Ship first.


Resources

LLM APIs & Structured Output:

Vector Databases & RAG:

  • pgvector - Vector similarity for Postgres (start here)
  • Qdrant - Purpose-built vector DB if you need more
  • LlamaIndex - RAG framework (useful, but adds complexity)

NLP & Document Processing:

Benchmarks & Research:

Further Reading:


For API architecture that scales, see building scalable APIs with Go.