A practical, step-by-step guide to cleaning survey data: removing speeders, straightliners, duplicates, and bad responses before you analyze.
Every survey dataset arrives a little dirty. Some respondents rush through without reading, some answer the same option down every row, some are duplicates, and some give logically impossible combinations. If you analyze that raw data, you risk drawing confident conclusions from garbage. Data cleaning is the unglamorous but essential step between collection and analysis. This guide walks through a practical cleaning workflow you can apply to almost any survey.
Why cleaning matters
The cost of bad data is invisible until it bites. A handful of careless or fraudulent responses can shift a mean, flip a close comparison, or invent a trend that does not exist. Because survey insights often feed real decisions about product, marketing, or strategy, the integrity of the underlying responses matters as much as the sophistication of the analysis. Cleaning is risk management: it protects you from acting on noise.
The goal of cleaning is not to delete responses you dislike. It is to remove responses that fail objective quality criteria you set in advance. Defining those criteria before you look at the results keeps you honest and prevents the temptation to massage data toward a preferred conclusion.
Removing speeders
Speeders are respondents who complete the survey far faster than is humanly possible if they actually read the questions. The standard approach is to measure completion time and flag responses below a sensible threshold. A common rule of thumb is to estimate the median completion time, then treat responses completed in less than roughly a third to a half of that median as suspect. Someone who finishes a ten-minute survey in ninety seconds almost certainly clicked without reading.
Capture timing data automatically at the platform level rather than trying to reconstruct it later. Be careful not to over-trim: a genuinely fast but attentive respondent exists too, so combine the speeding flag with other quality signals before removing anyone. Use speeding as one vote in a multi-criteria decision, not a single guillotine.
Catching straightliners
Straightlining is when a respondent selects the same answer for every item in a grid or matrix, for example choosing "strongly agree" all the way down a long battery of statements. It is a telltale sign of disengagement. To detect it, look for zero or near-zero variance across a set of items that should naturally produce some variation. If a respondent gave an identical answer to twenty statements, including reverse-worded ones, they almost certainly were not reading.
Reverse-worded items are a useful design trick here. If you include a statement phrased in the opposite direction and a respondent agrees with both a positive and its negation, that contradiction exposes inattentive answering. Building a few such items into your matrix questions makes straightliners far easier to catch.
Attention checks and trap questions
Attention checks are questions inserted specifically to verify that respondents are reading. The classic form is an instructed-response item such as "To show you are paying attention, please select 'Somewhat disagree' for this question." Respondents who answer anything else have failed the check. Use these sparingly, because too many can annoy honest participants and even introduce their own bias, but one or two in a long survey is a reasonable safeguard.
Pair attention checks with logical consistency checks. If someone says they have never used your product and later rates its newest feature, those answers conflict and the response deserves scrutiny. Designing these checks is easier when you start from a tested instrument; our market research survey template gives you a clean structure to add quality controls to.
Duplicates and bots
Duplicate responses arise when the same person submits more than once, whether by accident, by refreshing, or to game an incentive. Detect them using identifiers you can ethically collect, such as a respondent ID, an email when appropriate, or platform-level deduplication. Be cautious with technical signals like IP addresses, since shared networks can produce false positives, but a cluster of identical responses from one source warrants a closer look.
Automated bot submissions are a growing concern for open or incentivized surveys. Open-ended text is often the best bot detector: nonsensical, copy-pasted, or off-topic free-text answers reveal non-human or fraudulent responses that closed questions hide. Reading a sample of verbatims is a quick, high-value cleaning step.
Handling missing and inconsistent data
Not every imperfect response should be deleted. Some respondents simply skip optional questions, leaving gaps you must decide how to treat. The simplest approach is to exclude incomplete responses from analyses that need those specific fields while keeping them for analyses that do not, which preserves as much usable data as possible. More advanced approaches impute missing values, but imputation introduces assumptions and should be used cautiously and transparently.
Inconsistent or out-of-range values, like an age of 200 or a date in the future, should be corrected where the intended value is obvious and flagged or removed where it is not. Standardize formats too, so that "USA," "U.S.," and "United States" are treated as the same category before you tabulate. This kind of normalization prevents a single real group from being split across several spelling variants.
Documenting your decisions
Cleaning involves judgment, and judgment must be auditable. Keep a record of every rule you applied, how many responses each rule removed, and how many remained. This cleaning log lets others reproduce your dataset, defends your analysis when someone questions a result, and helps you refine your criteria for future studies. Report your final usable sample size alongside the original collected count so readers understand the basis of your numbers. Teams that run frequent studies can codify these rules once and reuse them across projects using templates for research teams, and pair them with a standard market research survey so cleaning is consistent every wave.
The most defensible approach is to decide your cleaning rules and thresholds before the data arrives, then apply them mechanically. Setting criteria in advance removes the temptation to keep responses that support your hypothesis and drop ones that do not, which is a subtle but real source of bias. Where possible, prefer flagging over deleting: add a quality column that marks each response as clean or suspect, so you can run your analysis with and without the flagged cases and see whether your conclusions hold either way. If the headline finding survives both versions, you can report it with confidence; if it depends entirely on questionable responses, that is critical to know before you present it. Treat cleaning as an ongoing capability rather than a one-time chore. After each study, review which rules caught the most problems and whether any honest responses were wrongly removed, then tune your thresholds for next time. A team that invests in a documented, repeatable cleaning process spends less effort per study and produces results that withstand scrutiny, which ultimately is what lets stakeholders trust the data enough to act on it.
Frequently Asked Questions
How much data is it normal to remove during cleaning? It varies widely by source and survey length. Panel and incentivized samples often need more cleaning than engaged customer lists. There is no fixed percentage; what matters is applying consistent, pre-defined rules and documenting the result.
Should I clean data before or after analysis? Before. Cleaning is a pre-analysis step. Analyzing first and removing responses afterward invites bias, because you may be tempted to drop responses that contradict the result you want.
What is the difference between a speeder and a straightliner? A speeder completes the survey suspiciously fast, flagged by completion time. A straightliner selects the same answer repeatedly regardless of content, flagged by lack of variance. A response can be both, and each is detected differently.
Are attention checks always necessary? Not always. For short surveys to highly engaged audiences they may be overkill. For long surveys or paid panels, one or two attention checks meaningfully improve data quality without overburdening respondents.
Collect cleaner data from the start. Build surveys with built-in quality controls. Create your free account or browse our templates to begin.