When I first started my role as an analytics engineer, I was tasked with rewriting a bunch of data models that were written in the past by contractors. These models were taking over 24 hours to run and often failed to run at all. They were poorly thought out and contained a bunch of “quick fix” code rather than being designed with the entire flow of the model in mind. Because of the random queries strung together in an order that made no sense, they were nearly impossible to understand or debug. Looking at these models showed me exactly what not to do when writing my own data models.
It’s important to think through your dbt data models from start to finish before even writing a line of code. When we try to string things together, hoping for a quick fix, we are only setting ourselves up for failure. Data modeling is truly an art! And all great art requires some planning for the struggles that lie ahead. When you write your dbt data models, you want to write them with three things in mind: readability, modularity, and performance.
First, when you write code, you can’t just write it for yourself. You need to be writing it with others in mind. With any successful team, you will have other people reviewing your code and trying to understand it. If you think it's hard to read, it probably is. This is why it's important to simplify the code that you are writing. I see a lot of people using subqueries and complex joins that are almost impossible to understand if you didn’t write the code yourself. If you’re writing code like this, try to figure out a way to simplify it or break it down. Choose to use a CTE over a subquery. If possible, when your queries are hard to read, break down one complex join into a few simpler ones.
You also always want to make sure you are commenting on your code. Comments help the person reading your code understand why you used a certain function or performed a particular join. They make the code clear to the reader without them needing to guess the purpose behind a piece of code. If it isn’t intuitive, always comment the why behind what you are doing.
Lastly, it is a best practice to always know what the data model is doing, just by reading its name. This means the name of the data model should be descriptive of its role in your analytics workflow. For example, if you are joining two datasets in a data model, you can name it <table_name>_joined_<table_name>. If you are grouping users by the country that they live in, you would name your model users_grouped_by_country. Again, this leaves no guessing for whoever is using or reviewing your data models.
Another important thing to keep in mind when writing your data models is modularity. You want to always be thinking about ways you can reuse code you’ve already written, or write new code in a way that can be used by other data models. This idea follows the standard practice of DRY- “Don’t Repeat Yourself”. By planning ahead, you will avoid rewriting the same code multiple times in different models. Repeating code only slows down your data models, makes it harder to understand, and wastes your time.
Think about the logic that you use in multiple data models. It could be some type of mapping, a metric calculation, or a simple join. Whenever you have code that is used in more than one data model, it should become its own model. This way, you can simply reference that existing logic in the new model you are building. The model only runs once, yet is used multiple times.
dbt is a data transformation tool that does a great job at solving this problem. It makes writing modular data models easy by allowing you to reference other models directly within your SQL files. In dbt’s documentation, these types of models are referred to as “intermediate models,” or models bridging gaps between source data and your final data models.
Lastly, you need to keep performance in mind when writing your data models. You don’t want to use functions or logic that is going to slow down the execution time. The models that I had to rewrite, which were taking over a day to run, often used SQL functions like DISTINCT, MAX, and MIN within a grouping.
For example, if a company is trying to look for the most recent user for a certain store location and customer type, they could write something like this. However, this is often unnecessary and wastes a lot of computing power.
SELECT MAX(user_id) FROM new_users WHERE user_type_id = 1 GROUP BY store_location
This could easily be replaced by a simple window function that is easier to read and speeds up run time. If I were to rewrite this query, I would rewrite it using the FIRST_VALUE() function like follows:
SELECT FIRST_VALUE(user_id) OVER (PARTITION BY store_location ORDER BY date_acquired DESC) FROM new_users
In addition to functions like this, using the incorrect join can also negatively affect a data model’s run time. If you use a full outer join instead of a left join, you are retrieving way more data than is actually needed. You also need to make sure you are joining on the appropriate columns. Not selecting enough columns to satisfy a join condition or joining on columns with incompatible data types can cause a longer run time. This is why it’s important to test your code as you go, validating you are getting the results you expected and that your code is as fast as you hope for it to be.
A lot of times we build data models for what we think the business needs, not what it actually needs. I’m guilty of this myself. Every analytics engineer probably is. When I first rebuilt our old data models with readability, modularity, and performance in mind, I didn’t even consider that they may not be useful for the business. I went through the entire rebuilding process only to discover that these models weren’t used very often. The initial columns included in them were never thought out for how the business runs. This means our data analyst was having to join different tables and calculate metrics herself, directly within the business intelligence tool, defeating the purpose of having a data model
It’s important to observe what dashboards and reports are most useful to the business. Which ones do they check every day? What kinds of data pulls are they asking for from the data analyst? If you have a data analyst that you work closely with, it’s usually best to talk with them about what the business needs. While analytics engineers have close communication with business teams, analysts have it even closer. They are usually the ones directly serving marketing, growth, and accounting teams. Having them show you what they do every day, will help you get a good idea of the data that would actually be used on a daily basis. Instead of jumping right into building your data model, talk to them about the columns they need.
When I rewrote our original data models, it was mainly because it was impossible to decouple all of the code. It was impossible to make sense of and understand the individual pieces that made up the entire data model. Things will always go wrong. You can’t build your models to be bullet-proof.. So, when something does go wrong, you want to make sure you wrote your code can be easily debugged to get to the root of the problem. Modular data models are handy because they are already broken down into smaller data models. Using a tool like dbt, you can run the smaller pieces without running the entire model. It’s much less intimidating to debug the smaller modular pieces of code rather than one huge model. dbt also supports testing at the source as well as directly within your models, improving the debugging process and making the quality of your data better.
Writing your code in smaller pieces, with debugging in mind, will prevent you from having to entirely rewrite the model in the future. You can easily change small parts of code in order to modify a larger data model. Bits and pieces can be broken down, but one large inefficient data model can’t be. This is another place where commented code comes in handy. You can better dissect the purpose of the original code and change it to fit your current needs.
Another mistake I see with data modeling is too restrictive of models, meaning they are aggregated in a way where they can only be used for a specific business case. The purpose of creating a data model in the first place is to produce a dataset that can be regularly reused by a data analyst. Data models produce data sets that can be referenced in multiple reports and dashboards. They essentially save the analyst time from repetitively performing the same joins and complex functions.
When you create data models that are too detailed, they can’t be reused and referenced in multiple reports and dashboards. This usually entails aggregating your models or filtering them for a specific use case. Instead of doing this within a data model, an analyst can do this for their specific use case.
For example, let’s say you are writing a data model to consolidate all marketing costs into one dataset. The first time you wrote it, you aggregated costs on a daily basis. This prevented the marketing team from seeing how much they were spending at certain times of the day. Now, you realize that you don’t want to aggregate costs at all. You want to leave each row of data how it is found in the data source. This way, if you wanted to break down Facebook spend by morning, evening, and night, you would be able to. Again, data models are meant to be as broad as possible while still focusing on a specific use case. In this case, gathering different spending amounts from different marketing platforms and joining them all into one dataset is more than enough for your model.
Functions to avoid in dbt data models:
Aggregates such as SUM, COUNT, AVERAGE, MIN and MAX
GROUP BY functions
WHERE filters at the end of your queries that make your models too granular
Remember, data models are meant to serve a large core of your business, not a very specific use-case. dbt calls your main data models core data models for a reason! If you can’t use the model for multiple use cases, it’s probably too granular.
Putting your new learnings to use
Writing your data models right the first time will save you many future headaches. Thoughtfully planning them out, optimizing the code, and changing anything that doesn’t work will save you tech debt down the line. When you write data models right from the start, the code should be able to grow and scale with you and the business. It shouldn’t have to be completely scrapped to be rebuilt when done right.
While the upfront process of writing modular, readable, and fast code can be slow, it is well worth all of the planning and thoughtful consideration that goes into it, especially when you see how fast your models will run. If you’re reading this and recognize some of these faults in your own data models, I highly recommend making the time now to salvage them. See where you can apply these principles in order to improve your models so that they can scale with you. It’s always best to review and reiterate while the models are newly built as compared to waiting.
Remember, observe how the business and data analyst currently uses your data in order to understand their true needs in a data model. While building them out, make sure you are making them readable and modular, writing them with debugging in mind. Lastly, they should be broad enough to be used in multiple different use cases and serve many unique purposes.