As organisations strive to become more data-informed, bad data quality that leads to terrible decisions can cost them a lot. To avoid putting trust in dubious data, double-check its quality with a quick audit.
Hi, I'm Mauricio. Connect with me on LinkedIn to share ideas and keep the conversation going!
Auditing data quality can prevent many issues down the line.
So… we have decisions to make. We know how important it is to be data-driven, so we get some data to support these decisions, by extracting insights or making predictions. Then something goes wrong: a forecast gets completely under-predicted, calculations don’t add up, or data contradicts common sense.
Now, some of these can naturally occur, and it is just what it is (e.g. great analysis with good data proves our common sense wrong), but many times those are symptoms of an underlying problem: not enough thought or attention went into checking data quality.
Culture, knowledge, process and skill are all aspects of why this happens. I’ll focus on the last two. Let’s start with understanding the dimensions of data quality, and jump right into a handy checklist for your quick audits.
As with many things ‘data’, there is no universally accepted definition for ‘data quality’. I’ve seen it explained through five to eight dimensions. I prefer keeping things simple, so let's stick with five dimensions using a practical example: imagine you are travelling abroad with your partner. You don’t want to forget anything important, so you prepare a travel checklist.
Accuracy: how closely does it represent the real world?
Does your data accurately reflect reality? Issues might appear when translating visa requirements to your language or looking at unofficial sources (blogs instead of the government website). 😵💫
Completeness: any information missing?
Is your list complete, given its scope? Perhaps your partner asked for something and you forgot to add it, or you failed to spot that there was a second page in the mandatory vaccines list. 😶
Consistency: are the data sources consistent across time?
This is about data not contradicting itself. Can you imagine if you didn’t notice that the date format of the airline website was MM-DD-YYYY instead of your commonly used DD-MM-YYYY, so you wrote down (and perhaps bought the ticket for) the wrong date? 😵
Relevance: is it important for what we are trying to solve?
Is every item (row) and attribute (column) relevant to your checklist? You don’t want to pack a bunch of items that are easy to acquire at the destination, or things for the countryside when you are heading to the big capital city. 🙃
Timeliness: is it up-to-date?
The data should the current and available when you need it. Outdated flight information, or failing to spot a notification on a flight disruption, can create many headaches. 😔
Data is prone to human and technical errors, in every step of its life cycle. Don’t just trust it because it is written down! Have some healthy scepticism. Use this checklist to help you out!
Here is my own ‘Data Quality Quick Audits’ step-by-step checklist. It is not a data governance guide… and I won’t go into automating any parts of this process (and its many benefits and challenges). It is about you being able to do spot checks to get a sense of the quality of your data and document these to support your project. 🕵️
- Scope and lineage
Define what datasets will be audited. Note down specifics: what file (e.g. .xlsx or .txt) or table from your database. Think about lineage: Where does it come from (system or people) and has it been altered?
These questions will help you think about how much you trust and understand it, implicitly.
If the dataset is too large, it’s perfectly valid to sample it. Just document it. ✍️
- Data format
Check the format of the dataset, and if it's suitable for analysis: Excel, CSV, JSON, plain text…
Most of the time, you want structured data (tabular) that is ’tidy’ because analytical tools can ingest and process it easily.
For unstructured data like free text, you’ll need other approaches: more specialised tools or even doing the analysis manually. 😵💫
- Data types and validity
You should confirm that each variable is assigned as the correct data type (e.g. numbers should not be encoded as text) to enable proper calculations and visualisations. This kind of error happened way more often to me than I am proud to admit. 😅
Some software does this automatically to a good standard, but be attentive: check each variable definition, the data it contains, and the assigned data type. Fix as needed.
After you get that done, check for validity. It’s what that “data validation” feature is all about.
Besides data type, reflect on what values are within logical ranges, and investigate (and fix) those that are not. Examples:
Pro tip: check the range of numerical values (“What is the minimum and maximum age in this dataset?”) and distinct values for categories (“We should only have seven distinct days of the week: Sunday, Monday, etc.”).
- Missing data
Missing data can be related to the “scope” I mentioned earlier: ensuring you are getting all the data that you thought you had. During the Covid pandemic, a technical problem due to digital file conversion (CSV to XLS… yes, the old XLS, not the newer XLSX 🤦) led to Public Health England losing about 16000 positive test results, making those people and their close contacts vulnerable.
Missing data can occur for a multitude of reasons. Besides missing whole observations (rows) or variables (columns), it can also be missing values. As a quick audit, you should spot these and see if you can investigate and fix their root cause, and find what is supposed to be there.
Examples:
My reaction when a lot of data is missing.
Sometimes the data might be missing for a good reason and that can also be okay. One example is a person unwilling to disclaim their gender on that optional field, or due to other privacy and safety reasons (more on that later).
Speaking of which, never forget, missing (aka ‘null’ values) are not zero! Zero is a known quantity! If you see Price = 0, that means the item is free. If you see Price = NULL, it means you don’t know how much it costs. 💡
- Duplicates
To think about duplicates, first, understand your dataset’s level of granularity. If it’s the ‘main user’ table in the database, each entry should be a unique user. However, if it’s a ‘transaction table’, then a user could show up multiple times, as they keep making purchases. Identify and remove duplicates so they don't confuse and break your analysis.
The good news: many tools have features designed specifically for this! 😃
Using Excel’s Conditional Formatting -> Highlighting Cell Rules to highlight duplicates
If you want to approach it more manually, it’s similar to the data validation process: look for distinct values for the whole row or the required unique IDs.
- Consistency
Sometimes, the problem is that somewhere along the way, definitions or ways of collecting data change (which is fine 👍), and nothing is done to address or document this (which is totally NOT fine 👎).
In the previous example, say at some point, you start inputting the ‘Duration’ as hours instead: 0.5, 0.75, 1.5, etc. And now it’s all part of the same column: some values in minutes, others in hours. Easy enough to see the problems in analysis this would cause, especially with big and longevous datasets where these issues are harder to spot. 🔎
Inputting duration in hours, instead of minutes, from 03/05/2024. 😵
An even more common issue is inconsistency across datasets. I had this happen with a feedback analysis here at Growth Tribe. In some tables, satisfaction scores ranged from 1 to 5, and in others, they ranged from 1 to 10. Easy enough to solve (just scale the values to make them consistent) but I would have missed this if I hadn’t checked the ranges in each table separately first. Use formulas with data validation (e.g. ‘total cost per unity’ in one table should be exact to ‘unity cost’ of one table multiplied by ‘quantity’ of another) and look at ranges and distributions across tables and time to reveal them.
- Timeliness and relevance
For timeliness, the most important question here is: Is the data sufficiently up-to-date?’. This requires understanding your systems (for internal data) and keeping an eye on official sources (for external data). Also relevant to check if the data is still relevant to your business needs… this is related to KPIs. Update your dataset as needed. 🔄
The other question is: ‘Is it arriving when I need it?’. Checking (or receiving) a change in your flight schedule after it has departed doesn’t help. This ‘refresh rate’ is usually defined in the system or business process itself (and related to KPIs), but worth keeping an eye out for this so that all goes as planned.
As for relevance, you have to sanity check and be very critical about what data matters for your problem. Fight against the instinct of “I want all the data with all the variables”... as it will only make your analysis more difficult to do, and can even confuse you. Raise the bar of what should be included.
- Accessibility and security
Final point as a checklist: before shouting out about the data and analysis, clarify in your audit who has access to what and what is the security protocol. This decision might not be yours, but, in any case, that will save you many headaches. You don’t want sensitive information to be flowing around the organisation without oversight, or used for unintended purposes. Proper data access control and security systems keep your organisation compliant with data regulations and prevent big data breaches that cost a lot (in fines and reputation). 👩⚖️
The full checklist, with key questions and an editable canvas.
I prepared a game so you can go through all the steps shown before. Imagine that the context is that “travel planning checklist” for an upcoming international trip with your partner. You will audit the list you put together.
Make a copy of the Data Quality Quick Audit checklist.
Make a copy of the Data Quality Quick Audit - Exercise.
Add the issues to the box that makes sense to you
Compare your answers with mine, by deleting the text box in the last slide. See if you (or perhaps, I) missed something.
"The Practitioner's Guide to Data Quality Improvement" by David Loshin. This is about running programs and processes to improve data quality in organisations. The practitioner, here, is from the business side, not developers or data engineers (those should look elsewhere).
"9 Data Quality Checks to Solve (Almost) All Your Issues" by Keboola. An overview guide created by Keboola, with explanations of the problems, how they happen and what to do.
"Tidy Data" by Hadley Wickham: Tidy data is the most suitable format for analysis, and this paper explains exactly what it is, and how you can tidy your data if it's messy.
"Data Cleaning Plan" by Crystal Lewis. For those who want to get their hands dirty cleaning the data, this is a very detailed approach. This is not an “overview” guide like the previous piece: this is for those who will perform the cleaning.
Here are the key points we covered:
If the data is bad, everything done with it suffers: The subsequent analysis and whatever insights, models or decisions that come from it can't overcome bad data.
The five dimensions of data quality: Accuracy, Completeness, Reliability, Relevance and Timeliness.
A quick audit checklist: You now can use the step-by-step checklist to do data quality quick audits!
I know this will help you a lot before you start an analysis, or really, any project that relies heavily on data.
See you in the next one!