Smart Column Optimization: Transform Messy Scraped Data into Clean, Usable Datasets
Keywords: data cleaning, column optimization, duplicate detection, semantic headers, AI data preparation
You scraped the data. Now you're staring at 47 columns, half of which are duplicates or empty.
Column names like: product_title_1, product_title_2, [object Object], undefined
Your options:
- Manually review every column (30 minutes)
- Write Python scripts to clean data (1 hour)
- Import into Excel and manually delete columns (20 minutes)
Or use AI-powered column optimization that does it automatically in 5 seconds.
Table of Contents
- The Data Quality Problem
- Automatic Duplicate Column Detection
- AI Semantic Header Generation
- Intelligent Column Reordering
- Identifying and Removing Useless Columns
- Real-World Optimization Examples
- Best Practices
The Data Quality Problem
Common Issues in Raw Scraped Data
Problem 1: Duplicate Columns
| Product Name | Product Title | item_name | name |
|--------------|---------------|-----------|------|
| Keyboard | Keyboard | Keyboard | ... |
All four columns contain the same data.
Problem 2: Cryptic Headers
| col_1 | div.product-card > span.title | [data-field="price"] |
|-------|-------------------------------|----------------------|
| Mouse | $29.99 | In Stock |
Headers are CSS selectors, not human-readable.
Problem 3: Empty or Useless Columns
| Product | Price | placeholder_column | tracking_id | internal_guid |
|---------|-------|--------------------|-------------|---------------|
| Mouse | $29 | | 12847 | a8f3-92d1... |
Some columns are always empty or contain technical IDs users don't need.
Problem 4: Poor Column Order
| SKU | Internal ID | Category | Subcategory | Tags | Product Name | Price |
|------|-------------|----------|-------------|------|--------------|-------|
| 1029 | prod_88291 | Tech | Peripherals | USB | Keyboard | $59 |
Most important columns (Name, Price) are buried at the end.
The Manual Cleaning Burden
Traditional workflow:
- Export to CSV
- Open in Excel
- Manually review each column
- Delete duplicates
- Rename cryptic headers
- Reorder columns
- Delete empty columns
- Save cleaned version
Time: 20-30 minutes per dataset Error-prone: Easy to delete wrong columns Tedious: Repeated for every scrape
Automatic Duplicate Column Detection
Similarity Algorithm
Approach: Compare every column pair, calculate similarity
function calculateSimilarity(col1: string[], col2: string[]) {
let matches = 0;
const totalRows = col1.length;
for (let i = 0; i < totalRows; i++) {
const val1 = String(col1[i] || '').trim().toLowerCase();
const val2 = String(col2[i] || '').trim().toLowerCase();
if (val1 === val2) {
matches++;
}
}
return matches / totalRows;
}
Threshold: 85% similarity = duplicate
Example:
Column A: ["Apple", "Banana", "Cherry", "Date", "Elderberry"]
Column B: ["Apple", "Banana", "Cherry", "Date", "Fig"]
Similarity: 4/5 = 80% (keep both)
Column C: ["Apple", "Banana", "Cherry", "Date", "Elderberry"]
Column D: ["Apple", "Banana", "Cherry", "Date", "Elderberry"]
Similarity: 5/5 = 100% (remove D)
Empty Column Detection
Remove if:
- All cells are empty strings
- All cells are null/undefined
- All cells contain only whitespace
function isColumnEmpty(column: string[]) {
return column.every(cell => {
const str = String(cell || '').trim();
return str.length === 0;
});
}
Deduplication Benefits
Before:
47 columns, 500 KB file size
After:
23 columns (-51%), 280 KB file size (-44%)
Results:
- Smaller files
- Faster loading
- Easier analysis
- Less visual clutter
AI Semantic Header Generation
The Problem with Auto-Generated Headers
Scrapers often generate:
- CSS selectors:
div.product > span.name - XPath expressions:
/html/body/div[2]/table/tr/td[1] - Generic names:
column_1,column_2,column_3 - Technical IDs:
data-field-983
Users need:
- Human-readable names:
Product Name,Price,Rating
LLM-Powered Header Generation
Process:
async function generateSemanticHeaders(
fields: string[],
sampleData: string[][],
userIntent?: string
) {
const prompt = `
Original headers: ${JSON.stringify(fields)}
Sample data (first 5 rows): ${JSON.stringify(sampleData)}
${userIntent ? `User intent: ${userIntent}` : ''}
Generate clear, semantic column headers.
Rules:
- Use Title Case
- Be concise but descriptive
- Remove technical prefixes
- Make headers business-friendly
`;
const response = await llm.chat.completions.create({
model: 'gpt-4o-mini',
messages: [
{ role: 'system', content: 'You are a data analyst expert.' },
{ role: 'user', content: prompt }
],
temperature: 0.1
});
return JSON.parse(response.choices[0].message.content);
}
Example transformation:
Before:
["div.title", "span.price", "div[data-rating]", "a.link"]
AI sees sample data:
[
["Wireless Mouse", "$29.99", "4.5", "https://..."],
["Keyboard", "$59.99", "4.7", "https://..."]
]
After:
["Product Name", "Price", "Customer Rating", "Product URL"]
Context-Aware Naming
User intent: "I want to analyze product pricing"
AI prioritizes:
- "Price" instead of "Cost" or "MSRP"
- "Product Name" instead of "Title" or "Item"
- "Discount" if applicable
User intent: "I need customer contact information"
AI prioritizes:
- "Email Address" instead of "Contact"
- "Phone Number" instead of "Tel"
- "Full Name" instead of "Name"
Intelligent Column Reordering
Importance-Based Ordering
AI analyzes:
- User intent: "I want product pricing data"
- Column names: Which are most relevant to intent?
- Data types: Unique identifiers vs descriptive fields
- Common patterns: What users typically care about first
Reordering logic:
async function reorderColumns(
fields: string[],
data: string[][],
intent?: string
) {
const prompt = `
Fields: ${JSON.stringify(fields)}
Sample data: ${JSON.stringify(data.slice(0, 5))}
User intent: ${intent || 'general data extraction'}
Reorder columns by importance.
Put most valuable columns first, technical/metadata columns last.
Return column order as array of indices: [2, 0, 5, 1, 3, 4]
`;
const response = await llm.chat.completions.create({...});
return JSON.parse(response.choices[0].message.content);
}
Example:
Original order:
[SKU, Internal_ID, Vendor_Code, Product_Name, Price, Stock_Status, Created_At]
AI reordered (intent: "product analysis"):
[Product_Name, Price, Stock_Status, SKU, Vendor_Code, Internal_ID, Created_At]
Benefits:
- Most important data visible first
- Reduces horizontal scrolling
- Faster data comprehension
- Better for reports/dashboards
Identifying and Removing Useless Columns
AI-Powered Useless Column Detection
Criteria for "useless":
- Mostly empty (>90% empty cells)
| Product | Price | Promo_Code | Discount |
|---------|-------|------------|----------|
| Mouse | $29 | | |
| Keyboard| $59 | | |
Promo_Code and Discount are useless
- Constant values (no variation)
| Product | Category | Status | Active |
|---------|----------|---------|--------|
| Mouse | Tech | Public | true |
| Keyboard| Tech | Public | true |
Status and Active have no variation
- Technical metadata users don't need
| Product | created_timestamp | last_modified | db_row_id |
|---------|-------------------|---------------|-----------|
| Mouse | 1634567890 | 1634567899 | 10283 |
Timestamps and IDs rarely needed for analysis
- Duplicates (covered earlier)
Confidence Scoring
AI provides:
{
"Promo_Code": {
"reason": "99% of cells are empty",
"confidence": 0.95
},
"db_row_id": {
"reason": "Internal database ID not useful for analysis",
"confidence": 0.85
},
"last_modified": {
"reason": "Technical timestamp, not requested by user",
"confidence": 0.78
}
}
User control:
- Auto-remove: Confidence > 0.85
- Suggest removal: Confidence 0.70-0.85
- Keep: Confidence < 0.70
User Confirmation
Best UX:
AI identified 5 potentially useless columns:
☑ Promo_Code (99% empty)
☑ db_row_id (internal ID)
☐ last_modified (timestamp) ← User unchecks
☑ tracking_pixel (always empty)
☑ placeholder_column (technical field)
[Remove Selected] [Keep All]
User reviews suggestions before deletion.
Real-World Optimization Examples
Example 1: E-commerce Product Scrape
Raw scraped data (12 columns):
col_1, div.product-title, span.price, div.price, [data-rating], stars,
product_url, link_href, image_src, img_url, stock_status, availability
After AI optimization (6 columns):
Product Name, Price, Customer Rating, Product URL, Image, Availability
Changes:
- ✅ Removed duplicates:
col_1 ≈ div.product-title,span.price ≈ div.price,stars ≈ [data-rating],image_src ≈ img_url,link_href ≈ product_url - ✅ Renamed to semantic headers
- ✅ Reordered by importance
- ✅ Result: 50% fewer columns, 100% more usable
Example 2: Research Paper Data Table
Raw scraped data (18 columns):
table_row_1, table_cell_1, table_cell_2, ..., table_cell_15,
footnote_ref, source_link, page_number
After AI optimization (8 columns):
Year, Country, Population (millions), GDP (billions), Growth Rate (%),
Unemployment (%), Inflation (%), Source
Changes:
- ✅ AI inferred column meanings from data patterns
- ✅ Generated descriptive headers with units
- ✅ Removed technical table structure columns
- ✅ Kept source attribution
Example 3: Job Listings Scrape
Raw scraped data (23 columns):
job_title, company, location, job_title_2, company_name, office_location,
salary, compensation, pay_range, posted_date, post_date, date_published,
description, job_desc, full_description, apply_url, application_link,
logo_url, company_logo, job_type, employment_type, tracking_id, internal_id
After AI optimization (9 columns):
Job Title, Company, Location, Salary Range, Posted Date, Job Type,
Description, Apply URL, Company Logo
Changes:
- ✅ Collapsed duplicates: job_title/job_title_2, salary variants, date variants
- ✅ Removed tracking IDs
- ✅ Kept one description field (full_description)
- ✅ Clear, business-friendly headers
Best Practices
1. Review AI Suggestions Before Applying
Always preview:
- Columns to be removed
- New header names
- Reordered column positions
Why: AI is 95% accurate, not 100%. You might need specific fields the AI thinks are useless.
2. Provide Intent When Possible
Generic scrape:
AI optimization with no intent provided → generic cleanup
With intent:
Intent: "I need pricing data for competitor analysis"
→ AI prioritizes pricing, discounts, product names
→ Removes technical IDs, timestamps
→ Better results
3. Keep One Sample of Original Data
Workflow:
- Extract raw data
- Export raw CSV (backup)
- Apply AI optimization
- Export optimized CSV (working version)
Safety net: If optimization removes something you need, refer to raw backup.
4. Use Progressive Optimization
Step-by-step approach:
1. Remove obvious duplicates (automatic)
2. Review preview
3. Generate semantic headers (automatic)
4. Review headers, edit if needed
5. Reorder columns (automatic)
6. Review order, adjust manually if desired
7. Final export
Benefits:
- Maintain control at each step
- Catch issues early
- Learn what AI does well/poorly for your use case
5. Save Optimization Settings
For repeated scrapes:
1. Scrape site A with AI optimization
2. Save settings: {
removeEmptyColumns: true,
similarityThreshold: 0.85,
autoReorder: true,
intent: "product pricing analysis"
}
3. Next time you scrape site A, reuse settings
4. Consistent, predictable results
Frequently Asked Questions
Can I undo AI optimizations?
Best practice: Keep raw data backup before optimization.
In-app: Some tools offer "Revert to original" before final export.
Recommendation: Export both raw and optimized versions.
How long does AI optimization take?
Typical timing:
- Duplicate detection: < 1 second (algorithmic)
- Semantic header generation: 2-3 seconds (LLM call)
- Column reordering: 2-3 seconds (LLM call)
- Total: ~5-7 seconds per dataset
Scales with:
- Number of columns (more columns = slightly longer)
- Sample data size (uses first 10 rows only)
- LLM provider speed
What if AI removes a column I need?
Prevention:
- Review suggestions before applying
- Uncheck columns you want to keep
After removal:
- Re-scrape with AI disabled
- Or load raw backup data
Future feature: "Undo" button to restore removed columns.
Does it work offline?
Current: Requires LLM API for semantic headers and intelligence
Fallback: If LLM unavailable:
- Duplicate detection still works (algorithmic)
- Empty column removal still works
- Semantic headers fallback to cleaned-up original names
Can I customize the similarity threshold?
Default: 85% similarity = duplicate
Customizable in settings:
- 90% = stricter (fewer removals)
- 80% = more aggressive (more removals)
- 100% = only exact duplicates
Recommendation: Start with 85%, adjust if too aggressive/conservative.
Conclusion
AI-powered column optimization transforms messy scraped data into clean, analysis-ready datasets in seconds. By automatically detecting duplicates, generating semantic headers, reordering columns by importance, and identifying useless fields, it eliminates 20-30 minutes of manual data cleaning per scrape.
Key benefits:
- ✅ Automatic duplicate detection (85% similarity threshold)
- ✅ AI-generated semantic headers (LLM-powered)
- ✅ Intelligent column reordering (importance-based)
- ✅ Useless column identification (confidence scoring)
- ✅ 50%+ reduction in column count (typical)
- ✅ Human review and override (maintain control)
Best for:
- Large scraped datasets (20+ columns)
- Repeated scraping workflows
- Business intelligence / analysis
- Non-technical users
When to skip:
- Small datasets (<10 columns)
- You need all raw fields
- Time-critical extraction (save 5 seconds)
Ready for cleaner data? Install the OnPiste Chrome extension and let AI optimize your scraped datasets automatically.
