Anyone who's worked with real-world data has stared at a spreadsheet thinking "this would take me four hours to clean up." Inconsistent capitalization. Mixed date formats. Phone numbers with random punctuation. Free-text categories where 'United States,' 'USA,' 'US,' and 'America' all mean the same thing. AI is genuinely transformational here — when used carefully.
What AI cleans up well
- Standardizing categorical values (country names, status codes, product categories)
- Reformatting dates and phone numbers to consistent formats
- Splitting full names into first/last
- Combining or splitting columns based on rules
- Detecting obvious typos in names and emails
- Inferring missing data from context ("this row has city Tokyo, country is probably Japan")
- Generating regex you would have spent 30 minutes on
- Writing the Pandas / SQL code to do the cleanup at scale
What AI does badly
- Anything requiring domain knowledge it doesn't have (your specific industry's product taxonomy)
- Detecting subtle data corruption (legitimate-looking but wrong)
- Cleanup that requires consulting external sources for ground truth
- High-precision deduplication ("is John Smith at Apple Inc the same person as J. Smith at Apple?")
- Anything where being wrong is unacceptable
Two workflows
For one-off cleanup of < 1000 rows: paste the data directly into Claude or GPT. Describe what you want. Get cleaned data back. Verify by spot-checking.
For repeated cleanup or > 1000 rows: ask AI to write the code (Pandas, dbt, SQL, Excel formula) to do the cleanup. Run the code yourself. AI doesn't reliably handle thousands of rows in chat — context gets truncated, results get inconsistent.
A practical example
You have a sheet of contacts with messy company names. "google" "Google Inc" "google.com" "GOOGLE" all need to become "Google."
For 200 rows: paste, ask. "Standardize the company column. Use canonical company names. Match common aliases (e.g., 'google.com' → 'Google'). Output as CSV."
For 200,000 rows: ask AI to generate the cleanup code. "Write Python that takes a Pandas DataFrame with a 'company' column and standardizes values using a mapping I'll provide. The mapping is: [...]. For values not in the mapping, use string normalization rules: lowercase, strip punctuation, strip 'inc/ltd/llc' suffixes, then title-case."
The code approach is auditable, fast, and handles scale.
Useful prompt patterns
This CSV has [problem]. Specific rules:
- [Rule 1]
- [Rule 2]
Do not: [common errors to avoid]
Output: cleaned CSV with the same columns and row count.
For any row you're unsure about, leave the value blank and add a 'review_flag=true' column.
The review_flag column is critical. Forces AI to admit uncertainty rather than confidently invent.
Verification habits
Always verify before trusting cleaned data:
- Row count matches input (didn't accidentally drop rows)
- Spot-check 10 random rows for correctness
- Check totals/sums for monetary columns (cleanup shouldn't change them)
- Search for sentinel values ("unknown," "n/a," "") — did they get cleaned correctly?
- Diff against original; do the changes look right?
For important data, run AI cleanup, then ask AI to verify its own work: "Compare this output to the input. List any rows where the cleanup changed the meaning rather than just the format." The same model often catches its own mistakes.
Privacy considerations
Don't paste sensitive data into ChatGPT or any consumer-tier AI:
- Customer PII (names, emails, addresses)
- Health records
- Financial transaction data
- Anything regulated
Use: enterprise tiers with data privacy guarantees, or self-hosted models (Llama, Qwen) for any data that shouldn't leave your environment. For very sensitive data, the cleanup workflow shifts to AI generates code, you run it locally — no data ever goes to the AI.
When NOT to use AI for cleanup
When the data is already structured and you have a clear mapping, just write the code. AI overhead isn't worth it for a 5-line Pandas operation.
When you don't know what "clean" means yet. AI will dutifully apply whatever rules you give. If you haven't decided the rules, AI cleanup encodes wrong decisions.
For data where ground truth requires external lookup. "Is this an active business?" "Is this address correct?" — AI guesses; you need an API.
For production data pipelines. AI helps draft the cleanup logic; production should run deterministic code, not AI calls per row.
Tools that integrate AI for data
- Claude with files — paste CSV, get cleaned output
- OpenAI Code Interpreter — runs Python on your uploaded data
- Numerous, Rows, Coda — spreadsheets with built-in AI cleanup formulas
- Hex / Mode — analytics platforms with AI-assisted notebooks
- dbt + AI — for production transformations, AI helps you write dbt models
The pre/post-AI productivity multiplier
For data cleanup specifically, AI is one of the highest-multiplier productivity tools in 2026:
- Tasks that took 4 hours of manual work: 20 minutes
- Regex you'd avoid because you didn't want to learn the syntax: now feasible
- Cleanup tasks you'd otherwise skip: now worth doing
The trap: confidence in AI output despite lack of verification. Bad cleaned data is worse than messy data because nobody questions it later. Build verification into your workflow.
Decision tree
- One-off cleanup, < 1000 rows: paste into Claude/GPT, verify
- Repeating workflow or > 1000 rows: AI generates code, you run it
- Sensitive data: self-hosted model or AI-writes-code-you-run-locally
- Production pipeline: AI for development, deterministic code in production
- High-stakes data (finance, regulatory): AI as draft, human review mandatory
Next steps
- Build a cleanup rules document for your common data shapes
- Try AI for one cleanup task you've been avoiding; measure time saved
- Read about Pandas / SQL cleanup patterns for the long-term skill
- For sensitive data, look into privacy-preserving setups before paste-into-ChatGPT becomes habit