data science

Data quality 101: Best practices for maintaining accurate data

Data quality is often put on the back burner when a company is trying to get a data stack off the ground. Rather than focusing on the nuances of their data, they choose to emphasize just having data to work with. But what good is data that is wrong? Or incomplete? Or stale?

The truth is it isn’t. The quality of your data trickles into every business decision made, determining whether the decisions you are making are good for the company or misplaced. Data quality needs to be emphasized from the very beginning, no matter how much work needs to be done. 

What is data quality? 

Data quality refers to how accurate your data is and whether it's easy to use. If your data is high quality, it should be ready to be transformed into data models and used in reports. Ask yourself: is the analytics engineer able to model this? Can the data analyst trust this data without doing extensive validation?

When I think of data quality, I like to think about it in terms of the following characteristics:

  • Accuracy

  • Freshness

  • Completeness

  • Consistency 

  • Uniqueness 

Accuracy is how close the data is to the truth. In my opinion, this is the most important. However, it is the one that everyone tends to think about when talking about data quality. Because of this, it’s probably more important to emphasize the data quality characteristics that aren’t so obvious. 

Freshness is how often up-to-date data is available for use. There is a different freshness standard for all companies, depending on how they use their data. Before cloud computing, data that was a week old was considered fresh. As technology has advanced, more and more data teams are changing their definition of fresh, looking for updated data every hour. 

Completeness looks at a holistic view of your data. Do you have enough data to draw a clear picture of the business? Or are there gaps in the data preventing you from becoming a data driven business? Completeness really comes down to how data is collected and if the right data is being collected. 

Consistency refers to what your data looks like over time. Is the data being ingested from a source in the same format month after month? Or are characteristics like datatype changing from sync to sync? With high-quality data, you should be getting exactly what you expect every time. 

Lastly, uniqueness looks at duplicates. Do your data sources have true primary keys? Is every row being ingested unique? Or are there multiple of the same rows being ingested, sometimes with the same primary key, sometimes with varying ones? You need to ensure you have truly unique rows, each with an identifying key so that you can properly join tables downstream. 

Why data quality matters

Data quality is important in both decision-making and how the business operates. If the data being used is wrong, or even off by a few key data points, this can lead to poor decisions by business teams. They may think they need to head in one direction because of what the data is telling them, but really they need to be heading in an entirely different direction. High-quality data will allow business teams to drive efforts towards the right places, having maximum impact on key metrics like revue.

Poor decisions due to data quality may also lead stakeholders and analyst to lose trust in the data team. When trust is lost, the company lacks a truth source of truth, meaning everyone will be pulling different numbers from different places, leading to a lack in consistency. Because of this, business decisions will be made inaccurately across different teams, leading to insatiable benchmarks in performance. 

Bot traffic is a common data quality issue ecommerce businesses run into. Let’s say a company launches a huge influencer campaign that they spent hundreds of thousands of dollars on. They are seeing an insane amount of traffic to their site. Using marketing analytics, the team can then conclude that the campaign was successful because of all this new traffic.

However, what if 60% of this traffic was from bots? If you remove the bot traffic from the data, the campaign was actually a huge flop compared to other marketing initiatives with the same amount of spend. If the data team isn’t focusing on data quality and identifying these bots, the marketing team would spend even more money on an effort that wasn’t actually working. However, if the data team identifies this fast, they can accurately measure the results of the campaign. 

While this is just one example, this can happen across any business function. Finance teams could be recording the wrong revenue numbers if duplicate purchase orders are included in the data without anyone realizing it. Growth teams could be looking at data that is a month old, basing customer acquisition strategy for today on trends that happened weeks ago. Data models used by all business teams could be continually breaking because of the lack of standardization across the data sources it uses. 

How to measure data quality

Data quality can be measured with various metrics, each depending on what specific aspect of data quality you plan on focusing on. The first metric to start with is typically freshness, or how up-to-date your data is. This is good to apply to your data sources to ensure data is being ingested on the proper frequency. While frequency measure how often your data is updated (hourly, daily, or monthly), latency measures how long it takes to move your data.  If your data frequency is hourly but latency is 4 hours, you’ll still have a data freshness issue.

Data quality can also be measured with other tests that you can apply directly to a dataset or a column within a dataset. The basic ones include uniqueness and not null tests, which should universally be applied across all sources and models. More complex tests include datatype tests and anomaly testing (looking at mean, median, z-score, etc.). 

Lastly, you can choose to measure data quality by looking at the characteristics of your data pipeline/orchestration solution. With this, you can measure the average run-time of your models as well as the number of failures, errors, or failed tests a month. It’s often a good practice to track these measure over time, asking yourself if failures are increasing over time or if these are one-time events. These metrics focus less on the data itself and more so how the data team is delivering the data to be used by the business. 

Common data quality issues and how to fix them

The most common data quality issues come back to the five characteristics of data quality that I discussed earlier –  accuracy, completeness, consistency, uniqueness, and freshness.

Inaccurate or incomplete data

It is common to find data that is inaccurate or incomplete due to the way the data is being collected. This typically comes down to fixing an issue with how data is being populated into your database by your application or an external tool. 

For example, if you are using a web event tracking tool, your engineering team is most likely the one implementing that code within the application to track the data you need. However, they might not understand everything that needs to be tracked from an analytics perspective. This requires working closely with them to ensure the needed fields are being tracked and then validating that data. 

In order to avoid inaccurate or incomplete data, you also want to test as close to the source as possible. This way, you are able to catch problems as they are ingested into your data warehouse rather than not discovering them after they are being used in a downstream data model. 

Data duplication and inconsistency

Duplication and inconsistency is a common data quality issue that occurs when using external data sources like Hubspot and Google Sheets. This is because you are most likely ingesting data from an API endpoint. These APIs tend to change all of the time, messing up connectors and accidentally causing duplicate records or changes in metadata. Unfortunately, there is not much you can choose to do but standardize this raw data as it comes in.

For example, let’s say there is a created_at field that you need to ingest from Hubspot, but the datatype keeps changing at the API level from date to timestamp to timestampntz. Instead of adjusting your model’s code each time, standardize this column when you ingest it into your data warehouse and then reference that downstream rather than the raw data. 

For example, within dbt, I do this at the staging layer for every data source:


Id AS hubspot_company_id,

Created_at::timestamp AS company_created_at, 

account_id::INT AS account_id

FROM {{ source(‘airbyte’, ‘hubspot_companies’) }} 

Now, no matter the datatype of the raw data column, it will always be cast to the datatype that is needed in the downstream models. 

Data downtime

The data team exists to serve business users with the metrics that they need to answer tough business questions. When a dataset or dashboard is broken it can feel like the end of the world. The business wants answers to their questions and the data team is scrambling to find a solution and make sure that solution is actually one that produces the right data. 

While I wish there was an easy answer to this one, it’s complicated. The best thing I think anyone can do here is communicate and give it time. Sometimes problems just can’t be rushed. It is better for the data team to get to the root cause and fix it the right way than come up with a temporary solution that could end up being entirely wrong. 

It is testing and documentation that will help avoid these situations as much as possible. However, stuff breaks! That’s just part of the job. Make sure you set up a system for communicating data problems to the business in a way that they can understand. This system should also include ways to keep them updated on the efforts that are going to fixing that issue. Communication is key here! 

Data staleness 

Ensuring data freshness is the most basic data quality check I recommend everyone to have. It lets you know if your data is actually being ingested as you expect it to be. It’s a double check on your data connectors and data pipeline to make sure they are working properly. 

Luckily, if you’re a dbt user, it is super easy to implement freshness checks on your data sources. In the YAML file that you define a source, you simply add a freshness block to the table name, define the timestamp column you wish to check, and assign a time interval for how fresh the data must be. 

code block showing an example of a freshness block used for a Yaml file in order to ensure data quality

The loaded_at_field should be a timestamp column within the dataset you are testing that represents when a row in that table was generated. This can be when a record was loaded or created, or even something that represents the true creation time, like order_placed_at for an orders table. 

Here, you can also specify different time intervals for a warning versus an error. If it’s just a warning, then all of your downstream data models will continue to run. If the test errors out, then none of the downstream data models will run.

I always recommend setting the warning time to be how often the connector typically refreshes. If something does go wrong, you don’t want it to fail right away. Give yourself a buffer time to fix the problem! However, if the problem isn’t fixed after a certain amount of time, use the error period as a kick in the pants to actually fix the problem. 

Best practices for improving data quality 

Luckily, there are simple practices that you can integrate into your data culture to ensure you are doing everything possible to avoid data quality issues.

Establishing data governance policies

Data governance policies are helpful for controlling who has access to your data and the kinds of data they can see. I always recommend implementing some sort of access policy as soon as you create a data warehouse. This way, you can limit who can access both development and production data from the very start. 

By controlling who has access to your data, and what they can do with it, you are preventing confidential customer information from being leaked as well as preventing any accidental data deletions. Make sure you are granular with the read and write access you give to various users and tools. Not everything needs the ability to write data! In fact, most people and tools only need to read it, especially within a cloud data warehouse. Always start by granting the least amount of permissions needed. If a user finds they need more, you can always grant that further down the line. 

Document your access policies somewhere where the entire business can see them. This way everyone can hold one another accountable for following the standards set in place. And, no matter how urgent a request is, make sure you thoroughly think through every use case before straying from these policies. 

Implementing data validation and verification procedures

A common reason data quality issues slip into production is due to a lack of data validation. It can be easy to assume that every sync, piece of code, and tool does exactly what it says it will. However, you always need to validate the data being produced. 

Every time I write a new data model, I run a few validation queries to ensure the code behaves as I expect. Here are some things I like to check for:

  • Duplicates

  • Nulls 

  • Empty values

  • Datatypes

  • Freshness

Sound familiar? These are some of the most common data quality characteristics that we discussed earlier! These types of things should always be checked before pushing a new code change. This way, you can track down any issues that manifest downstream knowing your model was high-quality when you first wrote it and pushed it to production. 

In addition to validating my data models, I always validate any new data sources that are put in place. It’s common for the data ingested by tools to be written in a way different from your assumptions. Make sure you check for the same data quality characteristics at the source as you do for any data models.

I also recommend running some test queries on other parallel data sources that could be affected by any new data collection methods. For example, if you implement code in your application that tracks a new web event, make sure you check the tracking of other web events that have already existed on your website. New data collection methods have the potential to impact data sources that have already been validated, making additional validation necessary. 

Using data quality management tools

To track data quality you can integrate a data quality management tool into your data pipeline. Data quality management tools help you with testing both before and after pushing code to production, including monitoring and alerting. Datafold is a great example of a tool that you can integrate with GitHub and dbt that allows you to check for downstream data impacts before merging your code to production. This way you are being proactive rather than reactive. 

Most other tools fall under the category of data observability, like Monte Carlo and Sifflet, but can also exist as data catalog tools and open-source dbt packages. If looking for a data catalog tool, I highly recommend Atlan for a well-rounded product that is useful for both business and technical teams. As for open-source dbt packages, dbt-expectations and re_data are great for testing. 

Insights to action starts with high quality data

Data quality best practices start from the moment you begin collecting any data. While it is often thought about as an afterthought, once companies start running into problems, it really should be considered in every data decision from the very beginning. By waiting to prioritize data quality, you create more tech debt for yourself to clean up at some point. 

Read the 7 steps to data cleaning

By focusing on accuracy, freshness, completeness, consistency, and uniqueness when testing and validating your models, you are already taking a huge step in the right direction. Once you start adding tests in place you can draw attention to the quality issues that exist in your stack. Then, you can take the time to address them at the source so they don’t happen in the first place. 

High-quality data is something that affects your entire business, so share your fantastic data quality metrics with your stakeholders and watch the trust grow.  By following these best practices, you are guaranteed to change the way data is being valued and used across your company. 

See how ThoughtSpot can help you manage data quality, so you get the most value out of your company’s data. Start a free trial today.