It was 2:14 PM on a Tuesday. I had just deployed my latest automation script to handle client reporting. Within four minutes, the server CPU spiked to 98%. The logs showed nothing but `TimeoutError` and `JSONDecodeError`.
The script wasn't broken. It was *dumb*.
It was a linear chain of functions. Step A fetched data from API X. Step B cleaned it. Step C pushed to Google Sheets. If API X slowed down by 200ms, Step B waited. Step C failed. The whole thing collapsed because there was no logic to handle failure. Just rigid execution.
I spent three days refactoring. I stopped writing scripts. I started building an AI agent framework in Python.
This isn't a tutorial on how to import `langchain` and pray. This is what happened when I actually tried to build autonomous agents that could think, pause, and recover without human intervention.
The Problem with "Chains"
Most tutorials tell you to build "chains." You connect LLM calls together. Input goes in. Output comes out.
That works for simple Q&A. It fails for complex workflows.
In my case, the workflow was:
1. Search for trending keywords.
2. Analyze SERP features.
3. Draft meta tags.
4. Push to CMS.
If step 1 returns empty results, the chain crashes. Or worse, it hallucinates data to keep That is dangerous for SEO work. You cannot automate a process that doesn't know when to stop.
The solution is not better error handling. It is architectural change. You need an agent loop. Not a straight line. A cycle.
Designing the Memory Layer
An agent needs context. Without it, every tool call is blind. I built a memory layer using `sqlite3` for persistence and `redis` for short-term session state.
Why SQLite? Because for most SEO automation tasks, the volume isn't high enough to justify a full database cluster. But Redis is essential for keeping the "thought process" alive between function calls.
Here is the structure I used:
class AgentMemory:
def __init__(self):
self.short_term = {}
self.long_term = []
def add_thought(self, thought, result):
self.short_term[thought] = result
# Only push to long term if confidence score > 0.8
if len(result) > 500:
self.long_term.append(result)
This prevents token bloat. Your LLM context window is expensive. You don't need every raw HTML dump stored in memory. You need the *insight*. I implemented a summarizer step that condenses raw data before storing it. This cut my context usage by 60%.
Tool Definition: The Real Bottleneck
You can have the smartest agent. If your tools are poorly defined。 it will fail.
I tried giving my agent a generic "Search Web" tool. It would search。 return 1000 URLs, and freeze.
I switched to specific, typed tools.
1. `get_serp_data(query, region)`
2. `fetch_meta_tags(url)`
3. `compare_rankings(current, previous)`
Each tool had strict input/output schemas. I used Pydantic models to enforce this. This reduced parsing errors by 90%. The agent doesn't guess the format. It knows exactly what data type to expect.
When defining these tools, remember that the LLM is not a coder. It is a pattern matcher. Give it clear examples of success and failure. Document the edge cases. If `get_serp_data` times out, the tool should return a specific error code, not crash. This allows the agent to trigger a retry logic instead of abandoning the task.
See my breakdown on Build Agents Not Pipelines for more on why modular tools beat monolithic scripts.
Implementing the Thought-Action-Observation Loop
The core of any AI agent framework is the loop. It’s not `if-else`. It’s a cycle.
1. Thought: What does the agent need to do next?
2. Action: Call a tool.
3. Observation: Read the output.
4. Repeat until goal is met.
I implemented this using a simple `while` loop with a maximum iteration count. Safety first. You don't want an agent stuck in an infinite loop trying to fix a minor typo in a meta description.
while not goal_achieved and iterations < MAX_ITERATIONS:
prompt = construct_prompt(memory, goal, available_tools)
response = llm.generate(prompt)
action = parse_action(response)
observation = execute_tool(action)
memory.add_observation(observation)
iterations += 1
The `parse_action` step is critical. LLMs are bad at returning clean JSON. I added a validation layer that retries the generation if the JSON schema doesn't match. This added latency but saved hours of debugging.
Also, pay attention to how you structure the prompt for the "Thought" phase. Don't just ask "What should I do?" Ask "Given the observation X。 and the goal Y, which tool best advances the objective?" This forces the agent to justify its choice. It reduces random tool calls.
Handling Hallucinations in Data Extraction
LLMs lie. Especially when extracting data from messy HTML.
In my initial tests, the agent extracted price data incorrectly because it guessed based on surrounding text rather than parsing the DOM. I stopped using the LLM for raw extraction.
Instead, I used a hybrid approach.
1. Use BeautifulSoup to clean the HTML structure.
2. Pass the cleaned text to the LLM.
3. Ask the LLM to extract specific fields using the cleaned text.
This reduced extraction errors significantly. The LLM became a reasoning engine。 not a scraper.
For SEO content analysis, this distinction is vital. You aren't asking the AI to read a webpage. You are asking it to analyze the *intent* and *structure* of the content you feed it. Feeding it raw HTML is noise. Feeding it structured headings and paragraph summaries is signal.
Check out AI Agent Reality Check to understand why grounding your agents in verified data sources prevents costly mistakes.
Performance and Cost Management
An AI agent framework in Python can get expensive fast. Each loop iteration costs tokens. Each tool call costs time.
I optimized this by implementing "caching" at the agent level. If the agent tries to run `get_serp_data` for the same query twice within an hour, it skips the tool call and pulls from cache.
I also introduced a "confidence threshold" for actions. Before executing a destructive action (like deleting a draft post or updating a live page)。 the agent must assign a confidence score. If the score is below 0.9。 it halts and asks for human confirmation.
This human-in-the-loop step is non-negotiable for production environments. You can trust an AI to organize data. You cannot trust it to make final editorial decisions without verification.
Testing the Framework
Unit testing an LLM is hard. You can't assert exact outputs. But you can assert behavior.
I wrote tests for the loop itself.
I used a mock LLM that returned predefined responses. This allowed me to simulate failures. What happens if the API returns 500? Does the agent retry? Does it log the error?
These tests caught a bug where the agent would enter an infinite loop if two tools depended on each other's output format. The fix was adding a dependency checker before the loop started.
When It Actually Works
This framework isn't for everything. Don't build an agent for simple SEO audits. Just run a script.
It shines when the path is unclear.
Example: "Find competitors ranking for 'best vegan protein' who have weak backlink profiles, then draft a comparison page outline for them."
The agent has to:
1. Search.
2. Analyze backlinks.
3. Filter.
4. Synthesize.
5. Draft.
No linear script can handle the variability of step 3. An agent can.
But be careful. If you rely entirely on this for traffic, you risk falling into the zero-click trap.
Read Zero-Click Survival Guide to ensure your automated content actually drives clicks and isn't just fueling AI models.
Final Thoughts on Maintenance
Your agent will break. The APIs you depend on will change. The LLM provider will update their pricing.
Keep your code modular. Isolate the tool definitions. Isolate the memory layer. Isolate the loop.
If Google changes its SERP layout, you only need to update the `fetch_serp_data` tool. You don't need to rewrite the entire agent.
This is the advantage of a framework over a script. Scripts are brittle. Frameworks are adaptable.
I’m still refining mine. The current bottleneck is speed. LLM inference takes time. I'm looking into smaller, fine-tuned models for the "Thought" phase to reduce latency. But for now。 the reliability gains are worth the wait.
Build the tools right. Define the memory clearly. And always, always put a stop button on the loop.