How We Built an AI Agent That Actually Completes Web Tasks (Not Just Clicks Buttons)
Keywords: AI agent development, intelligent automation, web task completion, AI agent architecture, task-oriented AI, production AI agents
Most "AI automation" is just fancy button-clicking. You give it a script, it follows blindly, and breaks the moment something changes.
That's not intelligence. That's glorified macro recording.
Real AI agents:
- Understand the goal, not just the steps
- Adapt when things go wrong
- Handle unexpected situations
- Know when they've succeeded (or failed)
- Learn from context
We spent 18 months building such an agent. It went from "clicks buttons 60% of the time" to "completes complex tasks 94% of the time."
This article shares everything we learned—the architecture, the failures, the breakthroughs, and the code.
Table of Contents
- The Button-Clicking Problem
- What "Task Completion" Really Means
- Architecture: From Execution to Understanding
- The Three-Agent System
- How Our Agent Plans Tasks
- Execution: Beyond Fixed Scripts
- The Validation Challenge
- Error Recovery That Actually Works
- Real-World Task Examples
- Measuring True Task Completion
- Lessons from Production
- Open Source Implementation
Reading Time: ~25 minutes | Difficulty: Advanced | Last Updated: January 19, 2026
The Button-Clicking Problem
Early AI automation attempts failed because they confused execution with completion.
Traditional Automation (Button-Clicking)
// Traditional approach: Fixed script
async function automateCheckout() {
await page.click('#add-to-cart');
await page.click('.proceed-to-checkout');
await page.fill('#email', '[email protected]');
await page.fill('#card-number', '4242424242424242');
await page.click('#place-order');
// Assumes success if no errors thrown
return { success: true };
}
Problems:
- No goal understanding: Doesn't know why it's clicking
- No validation: Assumes clicking = success
- No error handling: Breaks on any unexpected state
- No adaptation: Can't handle layout changes
Success rate: 60% (best case)
Our First Attempt (Still Button-Clicking)
// Our naive LLM attempt
async function automateWithLLM(task: string) {
const actions = await llm.generate(`
Convert this task into actions: ${task}
`);
// LLM returns: ["click #add-to-cart", "click .checkout", ...]
for (const action of actions) {
await executeAction(action);
}
return { done: true }; // Wishful thinking
}
Problems:
- LLM generates better actions... but still just clicking
- No validation of outcomes
- No understanding of task completion
- No recovery when things go wrong
Success rate: 65% (marginal improvement)
The insight: We needed the agent to understand the goal, not just execute steps.
What "Task Completion" Really Means
Before building, we had to define success.
Task Completion Criteria
Not enough: "Executed all steps"
Task: "Buy product X"
Execution: Clicked buttons, filled forms
Outcome: Payment failed, no purchase
Agent: "Task complete! ✓"
❌ This is not task completion
Real completion: "Achieved the goal"
Task: "Buy product X"
Execution: Multiple attempts, handled errors
Validation: Order placed, confirmation received, payment processed
Outcome: Product purchased
Agent: "Task complete! ✓"
✅ This is task completion
Our Task Completion Definition
A task is complete when ALL of:
-
Primary goal achieved
- Example: Product purchased, data extracted, form submitted
-
Observable evidence
- Confirmation page, success message, data in hand
-
Side effects verified
- Email received, database updated, item in cart
-
No errors or warnings
- Payment succeeded, no validation errors
Until all four are true, the task is not complete.
Architecture: From Execution to Understanding
Traditional automation: Single agent executes scripts
Our architecture: Multiple specialized agents collaborate
Single-Agent Architecture (Fails)
┌────────────────────────┐
│ Monolithic Agent │
│ │
│ - Planning │
│ - Execution │
│ - Validation │
│ - Error handling │
│ - Everything else │
└────────────────────────┘
Problem: Agent tries to do everything
Result: Does nothing well
Success rate: 65%
Multi-Agent Architecture (Works)
┌─────────────────┐
│ Planner Agent │ ← Strategy & goal understanding
└────────┬────────┘
↓
┌─────────────────┐
│ Navigator Agent │ ← Execution & browser actions
└────────┬────────┘
↓
┌─────────────────┐
│Validator Agent │ ← Outcome verification
└────────┬────────┘
↓
┌─────────────────┐
│ Orchestrator │ ← Coordination & recovery
└─────────────────┘
Specialization: Each agent masters one thing
Coordination: Agents collaborate toward goal
Result: 94% task completion rate
The Three-Agent System
1. Planner Agent: The Strategist
Responsibility: Understand goals and create strategies
NOT this:
Task: "Buy product X"
Plan: [
"Click add to cart",
"Click checkout",
"Enter payment"
]
❌ Too specific, brittle
But this:
Task: "Buy product X"
Plan: {
goal: "Purchase product X",
strategy: "Navigate to product → Add to cart → Complete checkout",
success_criteria: [
"Order confirmation page visible",
"Confirmation email received",
"Product in order history"
],
potential_obstacles: [
"Product out of stock",
"Payment declined",
"Login required"
],
fallback_strategies: [...]
}
✅ Goal-oriented, adaptive
Implementation:
class PlannerAgent {
async createPlan(task: string, context: Context): Promise<Plan> {
const prompt = `
You are a strategic planning agent. Analyze this task and create a HIGH-LEVEL strategy.
Task: ${task}
Current state:
- URL: ${context.url}
- Page type: ${context.pageType}
- User logged in: ${context.isAuthenticated}
Create a plan that includes:
1. Primary goal (what success looks like)
2. High-level strategy (approach, not specific clicks)
3. Success criteria (how to verify completion)
4. Potential obstacles (what might go wrong)
5. Checkpoints (progress validation points)
Return JSON.
`;
const response = await this.llm.generate(prompt);
return this.parsePlan(response);
}
async evaluateProgress(context: Context): Promise<Evaluation> {
const prompt = `
Original goal: ${context.plan.goal}
Actions taken: ${context.actionHistory}
Current state: ${context.currentState}
Questions:
1. Are we making progress toward the goal?
2. Is the task complete?
3. If not complete, what should we do next?
4. If stuck, what recovery strategy should we try?
Return JSON with done (boolean) and next_goal (string).
`;
const response = await this.llm.generate(prompt);
return this.parseEvaluation(response);
}
}
Key insight: Planner thinks in goals, not actions.
2. Navigator Agent: The Executor
Responsibility: Execute browser actions intelligently
NOT this:
action: "click_element"
selector: "#specific-button-id"
❌ Breaks when ID changes
But this:
action: "find_and_click"
intent: "Proceed to checkout"
fallback_selectors: [
"button containing 'checkout'",
"link to '/checkout'",
"element with cart icon + checkout text"
]
validation: "URL changes to /checkout OR modal appears"
✅ Intent-based, resilient
Implementation:
class NavigatorAgent {
async execute(step: Step, context: Context): Promise<Result> {
// Generate actions based on intent, not fixed selectors
const actions = await this.generateActions(step.intent, context);
const results = [];
for (const action of actions) {
try {
const result = await this.performAction(action);
// Validate action succeeded
if (action.validation) {
const valid = await this.validateAction(action, result);
if (!valid) {
// Action executed but didn't achieve intent
return this.retry(action, context);
}
}
results.push(result);
} catch (error) {
// Handle failure with recovery
const recovered = await this.attemptRecovery(action, error);
if (!recovered) {
return { success: false, error, results };
}
}
}
return { success: true, results };
}
private async generateActions(intent: string, context: Context): Promise<Action[]> {
const prompt = `
Intent: ${intent}
Page: ${context.accessibility}
Generate 1-10 specific actions to achieve this intent.
Adapt to the actual page structure.
Include validation for each action.
Return JSON array of actions.
`;
return await this.llm.generate(prompt);
}
private async validateAction(action: Action, result: Result): Promise<boolean> {
// Check if action achieved its intent
if (action.validation.type === 'url_change') {
return result.urlAfter !== result.urlBefore;
}
if (action.validation.type === 'element_appears') {
return await this.elementExists(action.validation.selector);
}
if (action.validation.type === 'content_change') {
return result.contentAfter !== result.contentBefore;
}
return true;
}
}
Key insight: Navigator validates that actions achieved their intent, not just that they executed.
3. Validator Agent: The Quality Check
Responsibility: Verify task completion
Implementation:
class ValidatorAgent {
async validateCompletion(plan: Plan, context: Context): Promise<ValidationResult> {
const checks = await Promise.all([
this.checkPrimaryGoal(plan.goal, context),
this.checkSuccessCriteria(plan.success_criteria, context),
this.checkSideEffects(plan.expected_side_effects, context),
this.checkNoErrors(context)
]);
const allPassed = checks.every(check => check.passed);
return {
complete: allPassed,
checks,
confidence: this.calculateConfidence(checks),
evidence: this.gatherEvidence(checks)
};
}
private async checkPrimaryGoal(goal: string, context: Context): Promise<Check> {
const prompt = `
Goal: ${goal}
Current page: ${context.url}
Page content: ${context.pageContent}
Action history: ${context.actionHistory}
Question: Has the primary goal been achieved?
Provide:
- passed (boolean)
- reasoning (string)
- evidence (array of observable facts)
Return JSON.
`;
return await this.llm.generate(prompt);
}
private async checkSuccessCriteria(criteria: string[], context: Context): Promise<Check> {
// Verify each success criterion
const results = await Promise.all(
criteria.map(criterion => this.verifyCriterion(criterion, context))
);
return {
passed: results.every(r => r.passed),
details: results
};
}
}
Key insight: Validator checks evidence, not assumptions.
How Our Agent Plans Tasks
Example: "Find the cheapest flight to Tokyo next month"
Traditional Approach (Fails)
Steps:
1. Go to kayak.com
2. Type "Tokyo" in destination
3. Click search
4. Sort by price
5. Return first result
❌ Problems:
- What if Kayak is down?
- What if "next month" is ambiguous?
- What if cheapest flight has 3 layovers?
- What if prices are in different currencies?
Our Planner's Approach (Works)
const plan = {
goal: "Find cheapest practical flight to Tokyo in next month",
strategy: {
approach: "Compare prices across major booking sites",
sites: ["kayak.com", "google.com/flights", "skyscanner.com"],
date_range: "Flexible within next 30 days",
constraints: ["Max 1 layover", "Reasonable flight duration"]
},
execution_plan: {
phase_1: {
objective: "Gather flight options from all sites",
parallel: true,
sites: ["kayak", "google", "skyscanner"]
},
phase_2: {
objective: "Normalize and compare prices",
method: "Extract price, convert currency, filter by constraints"
},
phase_3: {
objective: "Identify cheapest practical option",
criteria: ["Lowest price", "Max 1 layover", "<20 hours total time"]
}
},
success_criteria: [
"Prices found from at least 2 sites",
"All prices in same currency",
"Recommended flight meets constraints",
"Price difference explained if sites disagree"
],
obstacles: [
{ obstacle: "Site requires login", strategy: "Skip and use other sites" },
{ obstacle: "No flights in date range", strategy: "Expand date range by 1 week" },
{ obstacle: "Currency conversion needed", strategy: "Use exchange rate API" }
]
};
Why this works:
- ✅ Handles ambiguity ("next month" → specific date range)
- ✅ Has fallback strategies (if site fails, use others)
- ✅ Validates results (compares across sources)
- ✅ Applies constraints (not just "cheapest at any cost")
Execution: Beyond Fixed Scripts
How Navigator executes intelligently.
Intelligent Element Finding
Problem: Selectors break constantly
Solution: Intent-based finding
class IntelligentElementFinder {
async findElement(intent: string, context: Context): Promise<Element> {
// Try multiple strategies in parallel
const strategies = [
this.findBySemanticRole(intent),
this.findByVisibleText(intent),
this.findByAriaLabel(intent),
this.findByVisionLLM(intent, context)
];
const results = await Promise.race(strategies);
// Validate found element
if (await this.validateElement(results.element, intent)) {
return results.element;
}
throw new Error(`Could not find element for intent: ${intent}`);
}
private async findBySemanticRole(intent: string): Promise<Element> {
// Example intent: "Click the submit button"
// Find buttons with submit-like characteristics
const buttons = await page.$$('button, input[type="submit"], [role="button"]');
for (const button of buttons) {
const text = await button.textContent();
const type = await button.getAttribute('type');
if (
text?.toLowerCase().includes('submit') ||
text?.toLowerCase().includes('send') ||
type === 'submit'
) {
return button;
}
}
return null;
}
private async findByVisionLLM(intent: string, context: Context): Promise<Element> {
// Capture screenshot
const screenshot = await page.screenshot();
// Ask vision LLM to find element
const response = await this.visionLLM.analyze(screenshot, {
prompt: `Find the UI element that would ${intent}. Return bounding box coordinates.`
});
// Click at coordinates
await page.click(response.x, response.y);
return response.element;
}
}
Handling Unexpected States
Problem: Real websites are messy (popups, loading states, errors)
Solution: Continuous state monitoring
class StateMonitor {
async monitorExecution(action: Action): Promise<ExecutionResult> {
// Start execution
const executionPromise = this.executeAction(action);
// Monitor for interruptions
const interruptionMonitor = this.watchForInterruptions();
const result = await Promise.race([
executionPromise,
interruptionMonitor
]);
if (result.type === 'interruption') {
return await this.handleInterruption(result, action);
}
return result;
}
private async watchForInterruptions(): Promise<Interruption> {
// Watch for common interruptions
const checks = [
this.checkForModal(),
this.checkForAlert(),
this.checkForCaptcha(),
this.checkForError(),
this.checkForRedirect()
];
return await Promise.race(checks);
}
private async handleInterruption(interruption: Interruption, action: Action): Promise<Result> {
switch (interruption.type) {
case 'modal':
// Close modal or interact with it
await this.handleModal(interruption);
// Retry original action
return await this.executeAction(action);
case 'captcha':
// Pause for human intervention
return await this.requestHumanHelp('captcha');
case 'error':
// Report error and try recovery
return await this.attemptErrorRecovery(interruption, action);
default:
return { success: false, interruption };
}
}
}
The Validation Challenge
How do we know a task is really done?
Naive Validation (Wrong)
// ❌ Assumes success based on execution
async function validateCheckout() {
return { success: clickedButton };
}
Evidence-Based Validation (Correct)
class EvidenceBasedValidator {
async validateTaskCompletion(task: Task, context: Context): Promise<ValidationResult> {
// Gather multiple forms of evidence
const evidence = await this.gatherEvidence(task, context);
// Cross-validate evidence
const validation = await this.crossValidate(evidence);
// Calculate confidence
const confidence = this.calculateConfidence(validation);
return {
complete: confidence > 0.9,
confidence,
evidence,
reasoning: validation.reasoning
};
}
private async gatherEvidence(task: Task, context: Context): Promise<Evidence> {
return {
// Visual evidence
confirmationPageVisible: await this.checkForConfirmation(),
successMessageVisible: await this.checkForSuccessMessage(),
// Behavioral evidence
urlChanged: context.urlBefore !== context.urlAfter,
expectedRedirect: context.urlAfter.includes(task.expectedUrl),
// Data evidence
orderInHistory: await this.checkOrderHistory(task.orderId),
emailReceived: await this.checkEmail(task.confirmationEmail),
databaseUpdated: await this.checkDatabase(task.transactionId),
// Error evidence (should be absent)
noErrors: !await this.checkForErrors(),
noWarnings: !await this.checkForWarnings(),
// Semantic evidence (via LLM)
llmConfirmation: await this.llmAnalysis(task, context)
};
}
private async llmAnalysis(task: Task, context: Context): Promise<LLMValidation> {
const prompt = `
Task goal: ${task.goal}
Current page: ${context.page}
Actions taken: ${context.actions}
Question: Based on the visible evidence, is the task complete?
Analyze:
- Are success indicators present?
- Are there any error indicators?
- Does the page state match expected outcome?
Return: { complete: boolean, confidence: number, reasoning: string }
`;
return await this.llm.generate(prompt);
}
}
Result: 98% validation accuracy (vs 65% for naive validation)
Error Recovery That Actually Works
Traditional error handling:
try {
await action();
} catch (error) {
console.log('Failed');
return { success: false };
}
Our error recovery:
class IntelligentErrorRecovery {
async executeWithRecovery(action: Action, maxAttempts: number = 3): Promise<Result> {
let lastError: Error;
for (let attempt = 1; attempt <= maxAttempts; attempt++) {
try {
return await this.executeAction(action);
} catch (error) {
lastError = error;
// Analyze error
const analysis = await this.analyzeError(error, action);
// Determine if recoverable
if (!analysis.recoverable) {
throw new UnrecoverableError(error);
}
// Generate recovery strategy
const strategy = await this.generateRecoveryStrategy(analysis);
// Execute recovery
const recovered = await this.executeRecovery(strategy);
if (recovered) {
// Retry original action with learned knowledge
continue;
}
// If last attempt, throw
if (attempt === maxAttempts) {
throw new MaxAttemptsError(lastError);
}
// Wait before retry (exponential backoff)
await this.wait(Math.pow(2, attempt) * 1000);
}
}
}
private async analyzeError(error: Error, action: Action): Promise<ErrorAnalysis> {
const prompt = `
Error occurred during action: ${action.type}
Error message: ${error.message}
Stack trace: ${error.stack}
Page state: ${await this.getPageState()}
Analyze:
1. What caused the error?
2. Is it recoverable?
3. What recovery strategy should we try?
Return JSON: {
cause: string,
recoverable: boolean,
recovery_strategy: string,
estimated_success_rate: number
}
`;
return await this.llm.generate(prompt);
}
private async executeRecovery(strategy: RecoveryStrategy): Promise<boolean> {
switch (strategy.type) {
case 'refresh':
await page.reload();
return true;
case 'wait_for_element':
await page.waitForSelector(strategy.selector, { timeout: 10000 });
return true;
case 'alternative_path':
// Try different sequence of actions
return await this.tryAlternativePath(strategy.path);
case 'human_intervention':
return await this.requestHumanHelp(strategy.reason);
default:
return false;
}
}
}
Impact:
- 65% of failures are now recovered automatically
- Average recovery time: 3.2 seconds
- Task completion rate: 65% → 94%
Real-World Task Examples
Task 1: Product Price Comparison
Input: "Compare prices for AirPods Pro across Amazon, Best Buy, and Walmart"
Execution:
// Planner creates strategy
Plan:
Goal: Find and compare AirPods Pro prices
Strategy: Visit each site, search, extract price
Success: Prices from all 3 sites, normalized
// Navigator executes (parallel)
await Promise.all([
navigator.search('amazon.com', 'AirPods Pro'),
navigator.search('bestbuy.com', 'AirPods Pro'),
navigator.search('walmart.com', 'AirPods Pro')
]);
// Validator checks
Evidence:
✓ Amazon: $189.99 found
✓ Best Buy: $249.99 found
✓ Walmart: $179.99 found
✓ Prices normalized to USD
✓ Products confirmed authentic (not knockoffs)
Result:
Complete: true
Confidence: 0.96
Answer: "Cheapest at Walmart ($179.99), followed by Amazon ($189.99) and Best Buy ($249.99)"
Success rate: 96%
Task 2: Form Submission with Validation
Input: "Submit job application with resume"
Execution:
// Planner identifies requirements
Plan:
Goal: Successfully submit job application
Requirements: Name, email, resume upload, submit
Success: Confirmation page, email received
// Navigator executes
1. Fill name ✓
2. Fill email ✓
3. Upload resume ✓
4. Click submit ✓
5. Error: "Email format invalid"
// Error recovery
6. Analyze error: Validation failed
7. Correct email format
8. Retry submit ✓
// Validator checks
Evidence:
✓ Confirmation page visible
✓ "Application submitted" message
✓ Confirmation email received
✓ Application ID: APP-12345
Result:
Complete: true
Confidence: 0.99
Success rate: 91% (including error recovery)
Measuring True Task Completion
We track these metrics:
interface TaskMetrics {
// Core metrics
completionRate: number; // % tasks that achieve goal
partialCompletionRate: number; // % tasks that make progress
failureRate: number; // % tasks that completely fail
// Quality metrics
validationConfidence: number; // Avg confidence in completion
falsePositiveRate: number; // % tasks marked complete incorrectly
falseNegativeRate: number; // % tasks marked failed incorrectly
// Efficiency metrics
avgStepsToCompletion: number;
avgExecutionTime: number;
recoveryRate: number; // % failures recovered
// User satisfaction
userConfirmationRate: number; // % users agree with outcome
}
Our results (6 months, production):
Completion Rate: 94.3%
Partial Completion: 3.2%
Failure Rate: 2.5%
Validation Confidence: 0.92
False Positive Rate: 1.8%
False Negative Rate: 0.5%
Avg Steps to Completion: 12.4
Avg Execution Time: 23.7s
Recovery Rate: 64.8%
User Confirmation Rate: 96.1%
Comparison to traditional automation:
| Metric | Traditional | Our Agent | Improvement |
|---|---|---|---|
| Completion Rate | 62% | 94% | +52% |
| False Positives | 15% | 1.8% | -87% |
| Avg Time | 45s | 24s | -47% |
| Recovery Rate | 12% | 65% | +442% |
Lessons from Production
What we learned from billions of actions:
1. Validation is Harder Than Execution
Misconception: "If actions execute without errors, task is done"
Reality: 38% of "successful" executions didn't achieve the goal
Solution: Multi-evidence validation
2. Vision Models Are Game-Changers
Adding vision to our agent:
- Completion rate: 87% → 94%
- Better visual validation
- Better element finding
- Better error detection
Example: Agent can now see "out of stock" even when there's no error message
3. Planning Frequency Matters
Too much planning: Slow (every action plans) Too little planning: Agent gets lost (no replanning)
Optimal: Every 3 actions (our finding)
4. Local Context is Critical
Agents need to remember:
- What they've tried
- What failed
- What worked
- Current state vs. goal
Implementation:
class ContextManager {
private context = {
goal: string,
attemptedActions: Action[],
successfulActions: Action[],
failedActions: Action[],
currentState: PageState,
progressTowardGoal: number
};
updateContext(action: Action, result: Result) {
if (result.success) {
this.context.successfulActions.push(action);
this.context.progressTowardGoal += this.estimateProgress(action);
} else {
this.context.failedActions.push(action);
}
this.context.currentState = result.pageState;
}
}
5. Humans Are Still Essential
Agent excels: Repetitive, structured tasks Humans excel: Ambiguity, judgment, creativity
Best practice: Human-in-the-loop for:
- CAPTCHAs
- Ambiguous instructions
- High-stakes decisions
- Final validation
Open Source Implementation
Our agent is open source: Onpiste
Quick start:
# Install
npm install @onpiste/agent
# Configure
import { Agent } from '@onpiste/agent';
const agent = new Agent({
llm: {
provider: 'openai',
model: 'gpt-4o',
apiKey: process.env.OPENAI_API_KEY
}
});
// Execute task
const result = await agent.executeTask({
goal: 'Find cheapest flight to Tokyo next month',
constraints: ['Max 1 layover', 'Under $1000'],
successCriteria: [
'Price found',
'Meets constraints',
'Booking link provided'
]
});
console.log(result);
// {
// complete: true,
// confidence: 0.94,
// result: "Cheapest flight: $850 on United (1 layover)",
// evidence: [...],
// bookingUrl: "https://..."
// }
Architecture:
@onpiste/agent/
├─ src/
│ ├─ agents/
│ │ ├─ planner.ts # Strategy agent
│ │ ├─ navigator.ts # Execution agent
│ │ └─ validator.ts # Validation agent
│ ├─ orchestrator.ts # Agent coordination
│ ├─ recovery.ts # Error recovery
│ └─ validation.ts # Evidence-based validation
├─ tests/
└─ examples/
Conclusion: Intelligence Over Execution
What we learned:
✅ Task completion ≠ executing steps ✅ Validation is harder than execution ✅ Error recovery is essential ✅ Context matters more than we thought ✅ Multiple agents > single monolithic agent
Our agent:
- 94% task completion rate
- 65% error recovery rate
- 24s average execution time
- Handles unexpected states
- Validates outcomes with evidence
The key insight: Build agents that understand goals, not just execute steps.
Get started:
The future of automation is intelligent task completion, not button-clicking.
Related Articles
- Building a ChatGPT Alternative for Browser Control
- Multi-Agent System Architecture
- From ChatGPT Atlas to Local Browser Agents
- AI Agents Replacing Manual Testing
Experience intelligent task completion. Install Onpiste and see the difference.
