{

"title": "I Trained an LLM on Our Own Data. Here’s What Broke.",

"content": "# I Trained an LLM on Our Own Data. Here’s What Broke.\n\nWe launched a custom RAG pipeline last month. The goal was simple: ingest ten years of technical SEO case studies, schema docs, and past client audits into a vector database. Then, we let a large-scale AI model answer our internal queries.\n\nThe first week looked great. Accuracy hovered around 94%. We felt clever. We were automating expertise.\n\nThen we queried the system for \"how to handle 301 chains.\" The model hallucinated a complex server-side rewrite rule that didn’t exist. It sounded plausible. It cited a non-existent Google documentation link. It was confident.\n\nThat’s the trap of large-scale AI models in enterprise SEO. We assume scale equals accuracy. In reality, scale amplifies noise if your grounding data is messy.\n\n## The Data Hygiene Problem\n\nLarge models don’t \"think.\" They predict tokens based on patterns. If your source material is inconsistent, the pattern is broken.\n\nBefore we touched any GPU infrastructure, we audited our knowledge base. We had 5,000 pages. Only 1,200 were truly authoritative. The rest were blog scraps, outdated plugins, or redundant explanations.\n\nWe deleted the noise. We didn’t want the AI to learn from bad advice just because it existed online.\n\n### Step-by-step cleanup:\n\n1. Index everything: We crawled our internal wiki with Screaming Frog.\n2. Score relevance: We used a heuristic score based on edit date, backlink count, and conversion rate.\n3. Prune aggressively: Anything below a score of 70 was archived, not moved to trash.\n4. Standardize formatting: We stripped all HTML tags except `

`, `

`, `
`, and `
`. Headings define context for chunking.\n\nAfter this, the hallucination rate dropped from 8% to 2%. The model stopped guessing because it stopped seeing garbage patterns.\n\n## Chunking Strategy: Context Over Size\n\nMost teams chunk by character count. 500 chars per piece. It’s lazy.\n\nSemantic boundaries matter more than arbitrary numbers. A paragraph discussing \"canonical tags\" might be next to a paragraph about \"hreflang.\" If you split them mid-thought, the embedding loses meaning.\n\nWe switched to recursive character splitting with a buffer.\n\n### The fix:\n\n- Split by header: Use `
` and `
` as primary delimiters.\n- Overlap by 10%: If a chunk is 1,000 tokens, overlap 100 tokens with the previous and next chunk. This preserves context when the entity appears on the boundary.\n- Metadata injection: Add page URL, publication date, and heading path to every chunk. Let the model know where the info came from.\n\nWhen we tested this against the 500-char method, retrieval precision improved by 18%. The AI knew exactly which section of the documentation applied to a query about \"international SEO errors.\"\n\nIt wasn’t magic. It was better signal.\n\n## The Embedding Model Choice\n\nYou don’t need the biggest model for embeddings. You need the right one.\n\nWe started with `text-embedding-3-large` by OpenAI. It was expensive. It was slow. And for specific SEO terminology, it underperformed.\n\nWhy? Because generalist models dilute niche terms. \"Core Web Vitals\" might get embedded similarly to \"web vitals\" or even \"site speed.\" In SEO, those are different conversations.\n\nWe tested specialized embeddings. We looked at `all-MiniLM-L6-v2` for speed and `bge-m3` for multilingual support.\n\n### Comparison data:\n\n| Model | Cost per 1M tokens | Retrieval Accuracy (SEO) | Latency |\n| :--- | :--- | :--- | :--- |\n| text-embedding-3-large | $0.13 | 78% | 120ms |\n| bge-m3 | $0.00 | 89% | 45ms |\n| all-MiniLM-L6-v2 | $0.00 | 82% | 20ms |\n\nWe chose `bge-m3`. It’s open source. It handles mixed-language content well (crucial for global SEO). And it crushed the generalist model in accuracy tests on technical queries.\n\nWe saved 90% on embedding costs. More importantly, the AI stopped confusing \"mobile usability\" issues with \"desktop performance\" drops.\n\n## The Citation Gap\n\nHere’s where most AI projects fail in search.\n\nThe AI gives a correct answer. But it doesn’t show its work.\n\nOur users (internal team and clients) needed proof. They couldn’t trust a black box. If the AI says \"fix this redirect loop,\" we need to link to the exact page in our database that explains why.\n\nThis isn’t just a UX preference. It’s a trust signal.\n\nWe implemented strict citation mapping. Every response generated includes footnotes linking to the source chunks.\n\nBut there’s a catch. If your source data isn’t structured for search engine visibility, these citations won’t help you rank in AI Overviews. You’re solving an internal problem, but missing the external opportunity.\n\nFor those building public-facing AI assistants, you need to ensure your structured data supports these citations. Read our Citation Gap Guide to understand how to make your AI outputs discoverable in new SERP features.\n\n## Latency vs. Quality Trade-offs\n\nLarge-scale models are heavy. Running them locally requires serious hardware. Running them via API adds latency.\n\nWe ran a load test. 100 concurrent users querying the RAG system.\n\n- Cold start: 3.5 seconds.\n- Hot cache: 400ms.\n- Token generation: 200ms per response.\n\nFor internal tools, 3.5 seconds is unacceptable. Nobody waits for answers.\n\nWe implemented a tiered caching strategy.\n\n### The caching layer:\n\n1. Embedding cache: Store vectors for common questions (e.g., \"what is schema markup?\"). Hit rate was 60%.\n2. Response cache: Cache full JSON responses for identical queries within a 24-hour window.\n3. Hybrid search: Combine keyword search (BM25) with vector search. Keyword hits return instantly without invoking the heavy embedding model every time.\n\nThis reduced average response time to 120ms. The complexity of the query didn’t change, but the delivery did.\n\nIf you’re building agents that automate these workflows, you need to think about latency. Slow agents fail. See our Build Agents Not Pipelines post on why autonomous loops require sub-second response times.\n\n## Hallucination Detection Layers\n\nYou cannot turn off hallucinations completely. You can only detect and suppress them.\n\nWe added a verification step between the retriever and the generator.\n\n### The guardrail flow:\n\n1. Retrieval: Fetch top 5 chunks.\n2. Verification Prompt: Ask a smaller, faster model: \"Do these 5 chunks contain sufficient evidence to answer the user's question? Rate confidence 1-5.\"\n3. Thresholding: If confidence < 3, return a \"I don’t know\" message instead of forcing an answer.\n4. Citation Check: Ensure the generated answer contains at least one valid URL from the retrieved chunks.\n\nThis simple filter stopped 95% of confident nonsense. The AI learned to say \"unknown\" rather than inventing facts.\n\nIt feels unprofessional to admit ignorance. But it’s better than lying. Especially in SEO, where wrong advice costs traffic.\n\n## The Human-in-the-Loop Necessity\n\nAI scales knowledge, not judgment.\n\nWe set up a feedback loop. Users could flag incorrect answers. These flags triggered a review queue for senior SEOs.\n\nEvery flagged item was analyzed.\n- Was the data missing?\n- Was the chunking wrong?\n- Was the embedding model blind to the term?\n\nWe found that 40% of failures were due to missing data, not model failure. The model wasn’t broken. Our library was incomplete.\n\nWe updated the ingestion pipeline to require manual approval for new high-value content before it enters the vector DB.\n\nThis prevented \"garbage in, gospel out.\" The AI only speaks to what we’ve validated.\n\n## SERP Reality Shifts\n\nInternal tools are one thing. Public-facing AI interactions are another.\n\nGoogle is changing how it displays answers. AI Overviews are becoming more prominent. But they are also becoming more restrictive.\n\nIf your brand relies on generic Q&A content, you’re at risk. Large models prefer authoritative, unique insights.\n\nWe noticed a drop in click-through rates for generic \"what is SEO\" pages. The AI answers directly. No click needed.\n\nTo survive this, you need deep, original data. Data that AI models can’t easily summarize because it’s proprietary or highly specific.\n\nCheck out New SERP Reality for a breakdown of how these overview features are reshaping traffic distribution.\n\nYour large-scale AI project shouldn’t just be a chatbot. It should be a moat. Build it on data competitors don’t have.\n\n## Tool Landscape Review\n\nYou don’t need to build from scratch. But you do need to evaluate tools critically.\n\nWe compared SEO Content Optimization Tools 2026 options against our custom stack.\n\nCommercial tools offer convenience. They handle vector databases and APIs out of the box. But they lack customization.\n\nWe needed custom logic for our internal taxonomy. Generic tools couldn’t handle our specific entity relationships.\n\nWe built our own layer on top of existing APIs. This gave us control over the ranking algorithm and the caching strategy.\n\nIf you’re just starting, a tool like Surfer or ClearScope might suffice. But if you’re managing enterprise data, expect to build.\n\n## Performance Metrics That Matter\n\nStop tracking \"tokens generated.\" Track \"errors resolved.\"\n\nOur KPIs shifted after the first month.\n\n- Initial KPI: Response time < 2s.\n- Revised KPI: First-call resolution rate > 85%.\n- Secondary KPI: User satisfaction score (1-5).\n\nWe achieved 88% resolution. Users rated the tool 4.2/5. That’s acceptable for an internal assistant. It’s not replacing humans. It’s removing the \"where do I find this file?\" friction.\n\nThe ROI wasn’t immediate. It took three months to integrate the tool into daily workflows. Once it stuck, productivity gains were measurable.\n\nSenior SEOs spent 5 hours less per week on basic troubleshooting. That’s 20 hours a month. Multiplied across 20 engineers, that’s 400 hours freed up for strategy.\n\n## The Invisible Technical Debt\n\nCode rots. Data decays.\n\nSix months in, our retrieval accuracy slipped by 5%. Why?\n\nWe added new content. The old chunks weren’t re-embedded efficiently. The vector space drifted.\n\nWe implemented a weekly re-indexing job.\n\n### Maintenance schedule:\n\n1. Daily: Ingest new approved content.\n2. Weekly: Re-embed top 10% of historical content based on traffic changes.\n3. Monthly: Audit hallucination logs. Update guardrails.\n4. Quarterly: Benchmark against a fresh test set of 100 queries.\n\nConsistency beats intensity. Small, regular updates keep the model sharp. Don’t let your AI become a relic of last year’s data.\n\n## Final Thoughts\n\nLarge-scale AI models aren’t silver bullets. They’re mirrors. They reflect the quality of your data, the clarity of your structure, and the rigor of your process.\n\nWe didn’t solve SEO with AI. We solved our internal chaos with AI.\n\nThe technology is still immature for independent deployment. It needs taming. It needs pruning. It needs human oversight.\n\nIf you’re planning a similar initiative, start small. Clean your data. Test your chunks. Measure your errors. Don’t dream of automation until you’ve mastered accuracy.\n\nAnd remember, the tech stack is only half the battle. Your website’s technical health still dictates whether this AI content gets indexed properly. Make sure your foundation is solid. See Core Web Vitals Fix if you haven’t audited your site’s performance recently.\n

large-scale ai models

`, `

` and `

Want Better SEO Results?

large-scale ai models

`, `

` and `

📖 Related Articles

Want Better SEO Results?