Use AI to clean up messy CSV / Excel data

Anyone who's worked with real-world data has stared at a spreadsheet thinking "this would take me four hours to clean up." Inconsistent capitalization. Mixed date formats. Phone numbers with random punctuation. Free-text categories where 'United States,' 'USA,' 'US,' and 'America' all mean the same thing. AI is genuinely transformational here — when used carefully.

What AI cleans up well

Standardizing categorical values (country names, status codes, product categories)
Reformatting dates and phone numbers to consistent formats
Splitting full names into first/last
Combining or splitting columns based on rules
Detecting obvious typos in names and emails
Inferring missing data from context ("this row has city Tokyo, country is probably Japan")
Generating regex you would have spent 30 minutes on
Writing the Pandas / SQL code to do the cleanup at scale

What AI does badly

Anything requiring domain knowledge it doesn't have (your specific industry's product taxonomy)
Detecting subtle data corruption (legitimate-looking but wrong)
Cleanup that requires consulting external sources for ground truth
High-precision deduplication ("is John Smith at Apple Inc the same person as J. Smith at Apple?")
Anything where being wrong is unacceptable

Two workflows

For one-off cleanup of < 1000 rows: paste the data directly into Claude or GPT. Describe what you want. Get cleaned data back. Verify by spot-checking.

For repeated cleanup or > 1000 rows: ask AI to write the code (Pandas, dbt, SQL, Excel formula) to do the cleanup. Run the code yourself. AI doesn't reliably handle thousands of rows in chat — context gets truncated, results get inconsistent.

A practical example

You have a sheet of contacts with messy company names. "google" "Google Inc" "google.com" "GOOGLE" all need to become "Google."

For 200 rows: paste, ask. "Standardize the company column. Use canonical company names. Match common aliases (e.g., 'google.com' → 'Google'). Output as CSV."

For 200,000 rows: ask AI to generate the cleanup code. "Write Python that takes a Pandas DataFrame with a 'company' column and standardizes values using a mapping I'll provide. The mapping is: [...]. For values not in the mapping, use string normalization rules: lowercase, strip punctuation, strip 'inc/ltd/llc' suffixes, then title-case."

The code approach is auditable, fast, and handles scale.

Useful prompt patterns

This CSV has [problem]. Specific rules:
- [Rule 1]
- [Rule 2]
Do not: [common errors to avoid]
Output: cleaned CSV with the same columns and row count.
For any row you're unsure about, leave the value blank and add a 'review_flag=true' column.

The review_flag column is critical. Forces AI to admit uncertainty rather than confidently invent.

Verification habits

Always verify before trusting cleaned data:

Row count matches input (didn't accidentally drop rows)
Spot-check 10 random rows for correctness
Check totals/sums for monetary columns (cleanup shouldn't change them)
Search for sentinel values ("unknown," "n/a," "") — did they get cleaned correctly?
Diff against original; do the changes look right?

For important data, run AI cleanup, then ask AI to verify its own work: "Compare this output to the input. List any rows where the cleanup changed the meaning rather than just the format." The same model often catches its own mistakes.

Privacy considerations

Don't paste sensitive data into ChatGPT or any consumer-tier AI:

Customer PII (names, emails, addresses)
Health records
Financial transaction data
Anything regulated

Use: enterprise tiers with data privacy guarantees, or self-hosted models (Llama, Qwen) for any data that shouldn't leave your environment. For very sensitive data, the cleanup workflow shifts to AI generates code, you run it locally — no data ever goes to the AI.

When NOT to use AI for cleanup

When the data is already structured and you have a clear mapping, just write the code. AI overhead isn't worth it for a 5-line Pandas operation.

When you don't know what "clean" means yet. AI will dutifully apply whatever rules you give. If you haven't decided the rules, AI cleanup encodes wrong decisions.

For data where ground truth requires external lookup. "Is this an active business?" "Is this address correct?" — AI guesses; you need an API.

For production data pipelines. AI helps draft the cleanup logic; production should run deterministic code, not AI calls per row.

Tools that integrate AI for data

Claude with files — paste CSV, get cleaned output
OpenAI Code Interpreter — runs Python on your uploaded data
Numerous, Rows, Coda — spreadsheets with built-in AI cleanup formulas
Hex / Mode — analytics platforms with AI-assisted notebooks
dbt + AI — for production transformations, AI helps you write dbt models

The pre/post-AI productivity multiplier

For data cleanup specifically, AI is one of the highest-multiplier productivity tools in 2026:

Tasks that took 4 hours of manual work: 20 minutes
Regex you'd avoid because you didn't want to learn the syntax: now feasible
Cleanup tasks you'd otherwise skip: now worth doing

The trap: confidence in AI output despite lack of verification. Bad cleaned data is worse than messy data because nobody questions it later. Build verification into your workflow.

Decision tree

One-off cleanup, < 1000 rows: paste into Claude/GPT, verify
Repeating workflow or > 1000 rows: AI generates code, you run it
Sensitive data: self-hosted model or AI-writes-code-you-run-locally
Production pipeline: AI for development, deterministic code in production
High-stakes data (finance, regulatory): AI as draft, human review mandatory

Next steps

Build a cleanup rules document for your common data shapes
Try AI for one cleanup task you've been avoiding; measure time saved
Read about Pandas / SQL cleanup patterns for the long-term skill
For sensitive data, look into privacy-preserving setups before paste-into-ChatGPT becomes habit