Back to blog

Smart Column Optimization: Transform Messy Scraped Data into Clean, Usable Datasets

Keywords: data cleaning, column optimization, duplicate detection, semantic headers, AI data preparation

You scraped the data. Now you're staring at 47 columns, half of which are duplicates or empty.

Column names like: product_title_1, product_title_2, [object Object], undefined

Your options:

  1. Manually review every column (30 minutes)
  2. Write Python scripts to clean data (1 hour)
  3. Import into Excel and manually delete columns (20 minutes)

Or use AI-powered column optimization that does it automatically in 5 seconds.

Table of Contents

The Data Quality Problem

Common Issues in Raw Scraped Data

Problem 1: Duplicate Columns

| Product Name | Product Title | item_name | name |
|--------------|---------------|-----------|------|
| Keyboard     | Keyboard      | Keyboard  | ...  |

All four columns contain the same data.

Problem 2: Cryptic Headers

| col_1 | div.product-card > span.title | [data-field="price"] |
|-------|-------------------------------|----------------------|
| Mouse | $29.99                        | In Stock             |

Headers are CSS selectors, not human-readable.

Problem 3: Empty or Useless Columns

| Product | Price | placeholder_column | tracking_id | internal_guid |
|---------|-------|--------------------|-------------|---------------|
| Mouse   | $29   |                    | 12847       | a8f3-92d1... |

Some columns are always empty or contain technical IDs users don't need.

Problem 4: Poor Column Order

| SKU  | Internal ID | Category | Subcategory | Tags | Product Name | Price |
|------|-------------|----------|-------------|------|--------------|-------|
| 1029 | prod_88291  | Tech     | Peripherals | USB  | Keyboard     | $59   |

Most important columns (Name, Price) are buried at the end.

The Manual Cleaning Burden

Traditional workflow:

  1. Export to CSV
  2. Open in Excel
  3. Manually review each column
  4. Delete duplicates
  5. Rename cryptic headers
  6. Reorder columns
  7. Delete empty columns
  8. Save cleaned version

Time: 20-30 minutes per dataset Error-prone: Easy to delete wrong columns Tedious: Repeated for every scrape

Automatic Duplicate Column Detection

Similarity Algorithm

Approach: Compare every column pair, calculate similarity

function calculateSimilarity(col1: string[], col2: string[]) {
  let matches = 0;
  const totalRows = col1.length;

  for (let i = 0; i < totalRows; i++) {
    const val1 = String(col1[i] || '').trim().toLowerCase();
    const val2 = String(col2[i] || '').trim().toLowerCase();
    if (val1 === val2) {
      matches++;
    }
  }

  return matches / totalRows;
}

Threshold: 85% similarity = duplicate

Example:

Column A: ["Apple", "Banana", "Cherry", "Date", "Elderberry"]
Column B: ["Apple", "Banana", "Cherry", "Date", "Fig"]

Similarity: 4/5 = 80% (keep both)

Column C: ["Apple", "Banana", "Cherry", "Date", "Elderberry"]
Column D: ["Apple", "Banana", "Cherry", "Date", "Elderberry"]

Similarity: 5/5 = 100% (remove D)

Empty Column Detection

Remove if:

  • All cells are empty strings
  • All cells are null/undefined
  • All cells contain only whitespace
function isColumnEmpty(column: string[]) {
  return column.every(cell => {
    const str = String(cell || '').trim();
    return str.length === 0;
  });
}

Deduplication Benefits

Before:

47 columns, 500 KB file size

After:

23 columns (-51%), 280 KB file size (-44%)

Results:

  • Smaller files
  • Faster loading
  • Easier analysis
  • Less visual clutter

AI Semantic Header Generation

The Problem with Auto-Generated Headers

Scrapers often generate:

  • CSS selectors: div.product > span.name
  • XPath expressions: /html/body/div[2]/table/tr/td[1]
  • Generic names: column_1, column_2, column_3
  • Technical IDs: data-field-983

Users need:

  • Human-readable names: Product Name, Price, Rating

LLM-Powered Header Generation

Process:

async function generateSemanticHeaders(
  fields: string[],
  sampleData: string[][],
  userIntent?: string
) {
  const prompt = `
    Original headers: ${JSON.stringify(fields)}
    Sample data (first 5 rows): ${JSON.stringify(sampleData)}
    ${userIntent ? `User intent: ${userIntent}` : ''}

    Generate clear, semantic column headers.
    Rules:
    - Use Title Case
    - Be concise but descriptive
    - Remove technical prefixes
    - Make headers business-friendly
  `;

  const response = await llm.chat.completions.create({
    model: 'gpt-4o-mini',
    messages: [
      { role: 'system', content: 'You are a data analyst expert.' },
      { role: 'user', content: prompt }
    ],
    temperature: 0.1
  });

  return JSON.parse(response.choices[0].message.content);
}

Example transformation:

Before:

["div.title", "span.price", "div[data-rating]", "a.link"]

AI sees sample data:

[
  ["Wireless Mouse", "$29.99", "4.5", "https://..."],
  ["Keyboard", "$59.99", "4.7", "https://..."]
]

After:

["Product Name", "Price", "Customer Rating", "Product URL"]

Context-Aware Naming

User intent: "I want to analyze product pricing"

AI prioritizes:

  • "Price" instead of "Cost" or "MSRP"
  • "Product Name" instead of "Title" or "Item"
  • "Discount" if applicable

User intent: "I need customer contact information"

AI prioritizes:

  • "Email Address" instead of "Contact"
  • "Phone Number" instead of "Tel"
  • "Full Name" instead of "Name"

Intelligent Column Reordering

Importance-Based Ordering

AI analyzes:

  1. User intent: "I want product pricing data"
  2. Column names: Which are most relevant to intent?
  3. Data types: Unique identifiers vs descriptive fields
  4. Common patterns: What users typically care about first

Reordering logic:

async function reorderColumns(
  fields: string[],
  data: string[][],
  intent?: string
) {
  const prompt = `
    Fields: ${JSON.stringify(fields)}
    Sample data: ${JSON.stringify(data.slice(0, 5))}
    User intent: ${intent || 'general data extraction'}

    Reorder columns by importance.
    Put most valuable columns first, technical/metadata columns last.

    Return column order as array of indices: [2, 0, 5, 1, 3, 4]
  `;

  const response = await llm.chat.completions.create({...});
  return JSON.parse(response.choices[0].message.content);
}

Example:

Original order:

[SKU, Internal_ID, Vendor_Code, Product_Name, Price, Stock_Status, Created_At]

AI reordered (intent: "product analysis"):

[Product_Name, Price, Stock_Status, SKU, Vendor_Code, Internal_ID, Created_At]

Benefits:

  • Most important data visible first
  • Reduces horizontal scrolling
  • Faster data comprehension
  • Better for reports/dashboards

Identifying and Removing Useless Columns

AI-Powered Useless Column Detection

Criteria for "useless":

  1. Mostly empty (>90% empty cells)
| Product | Price | Promo_Code | Discount |
|---------|-------|------------|----------|
| Mouse   | $29   |            |          |
| Keyboard| $59   |            |          |

Promo_Code and Discount are useless

  1. Constant values (no variation)
| Product | Category | Status  | Active |
|---------|----------|---------|--------|
| Mouse   | Tech     | Public  | true   |
| Keyboard| Tech     | Public  | true   |

Status and Active have no variation

  1. Technical metadata users don't need
| Product | created_timestamp | last_modified | db_row_id |
|---------|-------------------|---------------|-----------|
| Mouse   | 1634567890        | 1634567899    | 10283     |

Timestamps and IDs rarely needed for analysis

  1. Duplicates (covered earlier)

Confidence Scoring

AI provides:

{
  "Promo_Code": {
    "reason": "99% of cells are empty",
    "confidence": 0.95
  },
  "db_row_id": {
    "reason": "Internal database ID not useful for analysis",
    "confidence": 0.85
  },
  "last_modified": {
    "reason": "Technical timestamp, not requested by user",
    "confidence": 0.78
  }
}

User control:

  • Auto-remove: Confidence > 0.85
  • Suggest removal: Confidence 0.70-0.85
  • Keep: Confidence < 0.70

User Confirmation

Best UX:

AI identified 5 potentially useless columns:
☑ Promo_Code (99% empty)
☑ db_row_id (internal ID)
☐ last_modified (timestamp)  ← User unchecks
☑ tracking_pixel (always empty)
☑ placeholder_column (technical field)

[Remove Selected] [Keep All]

User reviews suggestions before deletion.

Real-World Optimization Examples

Example 1: E-commerce Product Scrape

Raw scraped data (12 columns):

col_1, div.product-title, span.price, div.price, [data-rating], stars,
product_url, link_href, image_src, img_url, stock_status, availability

After AI optimization (6 columns):

Product Name, Price, Customer Rating, Product URL, Image, Availability

Changes:

  • ✅ Removed duplicates: col_1 ≈ div.product-title, span.price ≈ div.price, stars ≈ [data-rating], image_src ≈ img_url, link_href ≈ product_url
  • ✅ Renamed to semantic headers
  • ✅ Reordered by importance
  • ✅ Result: 50% fewer columns, 100% more usable

Example 2: Research Paper Data Table

Raw scraped data (18 columns):

table_row_1, table_cell_1, table_cell_2, ..., table_cell_15,
footnote_ref, source_link, page_number

After AI optimization (8 columns):

Year, Country, Population (millions), GDP (billions), Growth Rate (%),
Unemployment (%), Inflation (%), Source

Changes:

  • ✅ AI inferred column meanings from data patterns
  • ✅ Generated descriptive headers with units
  • ✅ Removed technical table structure columns
  • ✅ Kept source attribution

Example 3: Job Listings Scrape

Raw scraped data (23 columns):

job_title, company, location, job_title_2, company_name, office_location,
salary, compensation, pay_range, posted_date, post_date, date_published,
description, job_desc, full_description, apply_url, application_link,
logo_url, company_logo, job_type, employment_type, tracking_id, internal_id

After AI optimization (9 columns):

Job Title, Company, Location, Salary Range, Posted Date, Job Type,
Description, Apply URL, Company Logo

Changes:

  • ✅ Collapsed duplicates: job_title/job_title_2, salary variants, date variants
  • ✅ Removed tracking IDs
  • ✅ Kept one description field (full_description)
  • ✅ Clear, business-friendly headers

Best Practices

1. Review AI Suggestions Before Applying

Always preview:

  • Columns to be removed
  • New header names
  • Reordered column positions

Why: AI is 95% accurate, not 100%. You might need specific fields the AI thinks are useless.

2. Provide Intent When Possible

Generic scrape:

AI optimization with no intent provided → generic cleanup

With intent:

Intent: "I need pricing data for competitor analysis"
→ AI prioritizes pricing, discounts, product names
→ Removes technical IDs, timestamps
→ Better results

3. Keep One Sample of Original Data

Workflow:

  1. Extract raw data
  2. Export raw CSV (backup)
  3. Apply AI optimization
  4. Export optimized CSV (working version)

Safety net: If optimization removes something you need, refer to raw backup.

4. Use Progressive Optimization

Step-by-step approach:

1. Remove obvious duplicates (automatic)
2. Review preview
3. Generate semantic headers (automatic)
4. Review headers, edit if needed
5. Reorder columns (automatic)
6. Review order, adjust manually if desired
7. Final export

Benefits:

  • Maintain control at each step
  • Catch issues early
  • Learn what AI does well/poorly for your use case

5. Save Optimization Settings

For repeated scrapes:

1. Scrape site A with AI optimization
2. Save settings: {
    removeEmptyColumns: true,
    similarityThreshold: 0.85,
    autoReorder: true,
    intent: "product pricing analysis"
   }
3. Next time you scrape site A, reuse settings
4. Consistent, predictable results

Frequently Asked Questions

Can I undo AI optimizations?

Best practice: Keep raw data backup before optimization.

In-app: Some tools offer "Revert to original" before final export.

Recommendation: Export both raw and optimized versions.

How long does AI optimization take?

Typical timing:

  • Duplicate detection: < 1 second (algorithmic)
  • Semantic header generation: 2-3 seconds (LLM call)
  • Column reordering: 2-3 seconds (LLM call)
  • Total: ~5-7 seconds per dataset

Scales with:

  • Number of columns (more columns = slightly longer)
  • Sample data size (uses first 10 rows only)
  • LLM provider speed

What if AI removes a column I need?

Prevention:

  1. Review suggestions before applying
  2. Uncheck columns you want to keep

After removal:

  • Re-scrape with AI disabled
  • Or load raw backup data

Future feature: "Undo" button to restore removed columns.

Does it work offline?

Current: Requires LLM API for semantic headers and intelligence

Fallback: If LLM unavailable:

  • Duplicate detection still works (algorithmic)
  • Empty column removal still works
  • Semantic headers fallback to cleaned-up original names

Can I customize the similarity threshold?

Default: 85% similarity = duplicate

Customizable in settings:

  • 90% = stricter (fewer removals)
  • 80% = more aggressive (more removals)
  • 100% = only exact duplicates

Recommendation: Start with 85%, adjust if too aggressive/conservative.

Conclusion

AI-powered column optimization transforms messy scraped data into clean, analysis-ready datasets in seconds. By automatically detecting duplicates, generating semantic headers, reordering columns by importance, and identifying useless fields, it eliminates 20-30 minutes of manual data cleaning per scrape.

Key benefits:

  • ✅ Automatic duplicate detection (85% similarity threshold)
  • ✅ AI-generated semantic headers (LLM-powered)
  • ✅ Intelligent column reordering (importance-based)
  • ✅ Useless column identification (confidence scoring)
  • ✅ 50%+ reduction in column count (typical)
  • ✅ Human review and override (maintain control)

Best for:

  • Large scraped datasets (20+ columns)
  • Repeated scraping workflows
  • Business intelligence / analysis
  • Non-technical users

When to skip:

  • Small datasets (<10 columns)
  • You need all raw fields
  • Time-critical extraction (save 5 seconds)

Ready for cleaner data? Install the OnPiste Chrome extension and let AI optimize your scraped datasets automatically.


Share this article