Analytics engineer is the latest role that combines the technical skills of a data engineer with the business knowledge of a data analyst. They are typically coding in SQL, building dbt data models, and automating data pipelines. You could say they own the steps between data ingestion and orchestration.
Whether you are a seasoned analytics engineer or new to the field, it’s important to continually learn new things and improve the work you’ve already done. There is always more advanced SQL aanalytics functions to learn, a new tool to add to your stack, or experience to be gained by improving on your best practices.
Here are the 5 skills analytics engineers need to know to take their career to the next level:
Basic Dimensional Modeling Techniques
Analytics engineers are organizing the data warehouse, deciding on an architecture that will store data models. Being the person that works closest with the warehouse and builds the data models, they need to have a good understanding of dimensional modeling techniques. While I don’t necessarily believe in strictly following every aspect of these theories, I do think it’s important to be aware of them and understand what they are. Then, you can decide what is best for your individual business and what makes sense for your data models. It’s never going to be a one-size fits all kind of thing!
We may also find that analytics engineers with an advanced level of dimensional modeling skills will demand higher salaries in 2023. Again, look at your needs as a company and figure out what is best for you in terms of balancing salary and skills.
Dimensional modeling involves the use of fact and dim tables to store historical data for your business. You may also hear this referred to as a “star schema”. This is because one fact table is often supported by multiple dim tables. Fact tables store facts about a certain business process. For example, you may have all of the transactions on your website located in a fact table. You could even have a fact table of all the products that go in and out of your warehouse.
Dim (or dimensional) tables store attributes that describe the data that’s stored in a fact table. So, if we think about our orders fact table, a dim table may include data denoting order types, customer profiles, or details about payment methods. I like to think of the mapping tables that may exist in a warehouse. These are the tables that map ids used in other tables to their actual meaning. Dim tables are often these mapping tables joined together to give a more holistic view of a fact.
In contrast with dimensional modeling and the star schema, you may often hear the term OBT, or one big table. This is another way you can store data within your warehouse. It is exactly how it sounds! You would store all of your data in one table, rather than separating it into multiple tables. No joins are ever required. While this is quite simple and may be good for smaller teams, I have never found this to be helpful. It is hard to locate the data columns that you need and there are often duplicate records depending on the data that this table stores. And, quite frankly, you wouldn’t need data models or analytics engineers at all if this was an effective way of doing things.
SQL Window Functions
While you can definitely get by building SQL data models without using any window functions, knowing window functions will make your work a lot easier. There’s no need to fear them as they are functions that simply operate over a window, or group of rows, but allows you to include non-aggregated columns in the resulting query. They will save you computing resources and time spent running your models, and optimizing your code. I’d argue that they are single-handedly the most powerful functions within SQL.
Window functions work by grouping and ordering your data using two commands- PARTITION BY and ORDER BY. PARTITION BY specifies how you want to group your values while ORDER BY specifies the column and direction you wish to order the values. These are then used to perform a calculation over a number of related rows.
Some of the most popular window functions include LEAD() and LAG(), ROW_NUMBER(), and FIRST_VALUE(). Each of these works similarly, using PARTITION BY and ORDER BY commands, yet calculates something different.
LEAD() and LAG()
These functions work in the same way but LAG looks at rows before the current value and LEAD looks at rows after the current value. LEAD() and LAG() are great to use for finding sequential values in a dataset. For example, if you have a list of campaigns that went live on Facebook, you could find the campaign that went live before each campaign and the one that went live after each campaign, using these functions.
We could take this a step further by partitioning by social_platform in order to find the previous campaign for each campaign launched on the same social platform. We would use the LAG() function, in this case, to find the campaign before the campaign in the row being analyzed. Here we are telling the query engine to create a group of rows for each social media platform then within each group order the rows by created_date. Lastly, we are selecting the current campaign_name, social_platform, and the previous campaign_name.
SELECT Campaign_name, Social_platform, LAG(campaign_name) OVER(PARTITION BY social_platform ORDER BY created_date ASC) AS previous_campaign FROM campaigns
Here we are also ordering by a timestamp column to ensure we find the campaigns in the right order. Adding ASC ensures the campaigns are ordered starting from when they were first created, ending with the most recent one. If you want them to be ordered the opposite way you would have to specify DESC.
ROW_NUMBER() is a great function to use with any type of ordering problem. It can be used to order values in a group depending on the columns you specify in the PARTITION BY and ORDER BY clauses. Let’s say you are looking for the top three performing campaigns for each social platform from the last year. ROW_NUMBER() could help you find this writing clean and readable code.
SELECT Camapign_name, Social_platform, Campaign_revenue, ROW_NUMBER() OVER(PARTITION BY social_platform ORDER BY campaign_revenue DESC) AS performance_order FROM campaigns
Here, we do not specify a column name in the ROW_NUMBER() part of the function. Because the function outputs a number for each row in the group, it’s not necessary. What really matters are the columns specified in the PARTITION BY and ORDER BY clauses.
Here, we are restarting/continuing the count each time the social_platform changes. If you were to think of three different groups — Facebook, Instagram, and Youtube — you are finding the order of performance for each of these. Except, in this case, the data for each is all within one dataset.
After writing the query above, you can write another query and filter by performance_order in order to only find the top three performers for each social platform. Your query might look like this:
SELECT Campaign_name, Social_platform, Campaign_revenue FROM camapign_performance_orders WHERE performance_order <= 3
Lastly, FIRST_VALUE() is a great window function to use in place of ROW_NUMBER() if you only want to find the first value in each group. This function can save you the extra step of filtering for where row_number = 1. Just make sure you are ordering your values in the right direction, or you could get the exact opposite of what you want.
Instead of returning a number, the FIRST_VALUE() function returns the first value of the specified column based on the ORDER BY clause for each group in the window.
SELECT FIRST_VALUE(campaign_name) OVER(PARTITION BY social_platform ORDER BY campaign_revenue DESC) AS top_campaign_name, Social_platform, Campain_revenue FROM campaigns
Here, only the campaign_name that was a top performer in terms of revenue for its corresponding social platform will be returned by the window function.
How to write a Macro in dbt
dbt is a tool any analytics engineer should know, so knowing how to write macros will only deepen your dbt knowledge. If you aren’t familiar with dbt, it is a data transformation tool that uses SQL and jinja to help you write modular data models that can be reused as opposed to legacy transformation tools that create multiple versions of the truth. dbt has a lot of advanced features, like macros, that allow you to customize your code and make it reusable throughout your entire project.
Macros are essentially “functions” within dbt that are written using Jinja syntax. You can write a macro and then reference it throughout your data models, passing a column in your SQL code into the function. I’ve written macros to generate custom schema names for my models, clean data columns, and convert currency values. If you have a unique use case, chances are it can be solved using a macro!
How to Explain Technical Concepts in a Simple Way
Analytics engineers interact with business teams and their stakeholders every day. Clear communication is imperative to get things done. This often means being able to explain highly technical concepts in a way those with no technical experience can understand. This will help complete projects faster and more effectively, ensuring things are completed the right way the first time.
In order to practice this, I recommend explaining technical terms to your kids or non-technical friends. You might have to bribe them a little bit to get them to listen, but see if they can then explain what you taught them back to you. This will demonstrate if they truly understood the concept you taught them. It’s also helpful to hear how they describe something in their own words.
If looking for more resources on this topic, we highly recommend reading Storytelling with Data by Brent Van Dyke. This book gives practical tips on communicating results through effective data visualizations. You can check out the synopsis here as well as pick up some other top reads for data and analytics pros.
How to thoroughly document data sources and models
Documenting data sources and models in a way that others can understand is one of the most underrated skills of an analytics engineer. Most know they should be documenting, but a lot of them don’t take the time to make sure it’s done at all, let alone the right way. And, I get it. It takes time. It’s easy to want to rush into the final product and get a model to production.
But learning how to properly document is an invaluable skill because of how much value it provides the business. Documentation ensures legacy knowledge lives on, even when the people who built the data architecture move on. It also minimizes the chances of things going wrong. Business users have one, concrete definition of a piece of data rather than the different ideas that live in their heads.
Documenting your data sources and models as you work is the key to doing it well. When you do it as you work, you gain a better understanding of the nuances of the data. If you wait until after the fact, you will probably forget the little details in the data you picked up along the way. Along with documenting data definitions, you should be commenting throughout your code as well so anyone can understand why you wrote what you wrote. This will allow for an easier code review process and a better future understanding of the business logic.
When documenting data sources, you want to keep the following in mind:
Where is this data from?
How often does this table refresh?
What are the original data types of the columns?
What time zones are the dates in?
When documenting data models, you need the same information, but you also need to document additional logic and changes to the data.
You should think about:
What are the upstream dependencies of this model?
What tables were joined in order to produce this?
Were the columns’ data types changed?
Are the date columns in a different time zone than the source?
Is the data aggregated in any way?
What table does this ID column map back to?
When documenting your data models, think about the potential questions that your teammates or even stakeholders could ask you. Anticipate those and answer them directly within your documentation.
Improving your analytics engineer skills going forward
To be a good analytics engineer you need to know SQL like the back of your hand, have knowledge of dbt, be able to understand business terminology and create a data pipeline. However, if you want to improve and grow as an analytics engineer, you need to broaden your skills beyond these.
You need to uplevel both your technical and soft skills, improving in not only what you’re building but how you are communicating. It is the little things that can make the biggest difference in your career, like knowing how to build a custom dbt macro or using a window function that lessens the runtime of your model.
With the right know-how and the right tools, there is always room to grow and become a better analytics engineer. Start a ThoughtSpot free trial to see how our tools can help take your career to the next level.
Madison Schott is an analytics engineer for Winc, a wine subscription company that makes all its own wines, where she rebuilt its entire modern data stack. She blogs about analytics engineering and data best practices on Medium. She also has her own weekly newsletter on Substack.