When collecting data to use in your business intelligence, it is important to consider all of the different steps and data team members involved. This process starts at the data engineering level, continues to the analytics engineer, and ends with the data analyst. All three roles offer significant contributions to the collection and preparation of data—data that will be used to generate Liveboards, dashboards or reports that answer key questions for your business.
Collaborate with one another in order to follow proper best practices—you can’t do it without one another! In this article, we will discuss the seven best practices of data collection for BI and how each point relates to the responsibilities of the members of your data team.
Before collecting any data, it is imperative that you set specific goals for what the data will be used for. What business question are you trying to answer? This question should involve all roles in your data team, as it will drive the entire data collection process. At the end of the day, you are collecting data to serve a specific purpose within the business by providing business users with a clear answer.
By defining the data that you need to answer this question before you try to collect it, you only spend time gathering what is relevant. It’s easy to get off track, collecting various data sources—many of which you don’t need. Beware of this pitfall, because it may delay the BI delivery. By setting clear guidelines, you will help the entire team prioritize what needs to be worked on in order to produce the BI product.
In addition to setting specific goals, you need to identify the key performance indicators (KPIs) that you will be calculating in the dashboard or report that you are building. Thinking in terms of KPIs will help you define the answer to the identified business question. You want to understand the KPI you will be building out with the data you are collecting before you collect the data. Make sure you also think in terms of the dimensions needed for this KPI (month, year, quarter, product type, customer acquisition date, etc.), as these factors will largely drive what data you end up collecting.
Next, you want to think about the types of data collection methods you will be using. You’ll want to loop in the data engineer at the beginning of the process. They will determine how they can access the data and what tools, if any, they need to use. The data engineer can answer questions like, “is the data currently available or do I need to talk with the engineering team to collect this data?” If data isn’t already available, this can delay the delivery of the BI product.
Lastly, think about the data quality metrics behind what you are building. What aspects of data quality are most important? What are the most important tests to ensure that the data pipeline, from ingestion to report, runs as expected and doesn’t break? Setting these expectations upfront with data engineers and analytics engineers will allow you to produce the highest quality data from the beginning.
Now that you’ve defined your goals, you need to identify and capture the data that you need in order to meet your goals. This step should be a collaborative effort between data engineers, analytics engineers, and data analysts.
First, you want to look at what data sources you need. Are you dealing with marketing data? Do you need Facebook, LinkedIn, and Youtube data? Is this already available, or does the data engineer need to collect this? Does the analytics engineer still need to model it? You should also think about how often the data needs to be available, as this can help determine the method in which you ingest data (SFTP, API, ODBC, etc.).
After you determine the data sources, you can look at the available tables within those sources. You want to ensure these tables have the data points you need. If not, you need to talk with a data engineer about how to collect the additional data points. You should also assess how these tables relate to one another and how you will model them together.
It’s important to read through data documentation, such as table and column definitions written out in dbt, to determine which tables are fresh and most accurate. I’ve had numerous instances where I thought I was using the most relevant table only to find out that it had been replaced with another table. If you don’t have clear documentation, communication with the entire data and engineering team is your sole key to success. In this case, the analytics engineer should begin documenting these tables as they investigate.
This brings me to my third point—data validation. The analytics engineer and data engineer should work together to validate that the data being used is collected at the source as expected. Does the volume look right? How fresh is the data? Does this match the data across other tables? These are all aspects to inspect in order to maintain high data quality.
Data accuracy is an important aspect of data quality—this ensures the data correctly reflects what is happening in the business. Data accuracy also refers to the consistency in metrics across different data sources of truth.
For example, if one dashboard says sales for the month are $12,000 and another dashboard says sales for the month are $15,000, you do not have accurate data. Even though one of these numbers may be correct, you have no idea which one. This reflects greater issues in the accuracy of your data across all sources because of the inability to pinpoint where the issue is occurring. Luckily, there are a few things you can do to ensure accuracy, most of which are the analytics engineer’s job.
The first is data cleaning. This should be done in the form of staging models if using dbt. These staging models include basic standardization like datatype casting and column renaming that will ensure your data is ready to be used downstream in BI. If more data cleaning needs to be done, analytics engineers can further clean by building dbt data models that create the columns data analysts need, transforming the data into a more usable state for the objectives listed above.
Analytics engineers also need to ensure data quality is high even after the initial validation stage. It is possible for things to go wrong later on, not just when the data source is initially set up. Data teams can conduct regular data audits which allow you to stay on top of anything unexpected in your data. This includes building thorough tests for your models and then investigating them when they are triggered. Prioritize adding tests that check for the data quality characteristics you defined in Stage 1.
Are any tests continually failing? Are there lags in the freshness of the data? Instead of ignoring these problems, you should be proactive about solving them so they don’t break anything within your BI tool downstream. I recommend sending alerts to a shared data team Slack channel so you always know when something is wrong. This is much easier than manually checking a UI and ensures all data team members are aware of the issues occurring.
Creating proper data governance policies for access control of your data is imperative throughout the data collection process. Whenever you ingest a new data source, it is important that you consider who has access to it and the confidential information it could contain.
If your data source has Personally Identifiable Information (PII) , you need to ensure you are masking the data within your cloud data warehouse and, of course, within your BI solution. Ideally, this should be masked from the data analyst at the source layer. To make this process seamless, the data engineer or analytics engineer can apply a dynamic data masking policy within your dbt staging models.
In addition to hiding confidential data, it is important to set up access control from the very beginning of a data source’s creation. It is the job of the data or analytics engineer to create different roles within your data warehouse that have different levels of access to raw data, your development environment, and your production environment.
Raw data should always be inaccessible to the data analyst, preventing confusion about what source to select from. They should also only be given read access in production unless they are heavily involved in the dbt data modeling. Data and analytics engineers can have read-and-write access to development and production environments since they are typically the main people transforming the data. However, even they should only have read access to the raw data to prevent accidentally overwriting it with transformations.
When working with BI, it is important that the entire data team has read and edit access to the dashboards and reports. This way, if the analyst is out, engineers can debug anything that breaks on the BI level. However, business users should only be given read access to these reports so they don’t accidentally break or change anything.
Once you’ve done this process a few times, you can use what you’ve learned from past implementations to optimize the data collection process.
You can start with automating certain processes, like data ingestion, so you don’t need to continually run data syncs and data models. When your data is automatically ingested into your data warehouse, you can set your Liveboards and dashboards to automatically refresh.
I recommend scheduling data syncs on a specific cadence or a certain time of day, depending on how often the business needs fresh data in the BI tool. Various data collection tools like Airbyte, Fivetran, and Segment can make the process of collecting and automating data syncs a lot easier. This is a great area to work with the data engineer on.
With more complex reporting, it’s important for the analytics engineer to be involved in modeling out datasets for data analysts to then reference in the BI layer. In order to create seamless transformations you can use a tool like dbt and follow best practices like organizing your model into staging, intermediate, and core directories. Dbt also offers a tool called dbt cloud which will refresh these models for you in the form of jobs, on whatever cadence or time of day that you specify. This will ensure even the more complex transformation logic is always producing fresh data.
Maintaining consistency across multiple data sources is extremely important when ingesting data into any BI tool. When data is ingested here, it should already be cleaned by the analytics engineer.
As mentioned earlier with data accuracy, creating staging models with dbt for each of your data sources acts as the standardization layer. This is where you standardize the timestamp type you will be using, integer datatype, currencies, and naming conventions.
I highly recommend creating a style guide that outlines all of these standards before applying them to your models. This way, you have a guide you can reference when building each model, no matter the source. You can also share this guide with others to make sure they are following the same conventions in the modeling and BI layers.
When everyone follows the same “style” of coding, documentation, and design, it increases data quality and makes sources and models easier to use throughout the entire business. You can trust that your data fits certain standards when using it in KPI or metric calculations.
Data collection is an ongoing process within any organization, and it involves all members of your team. In the beginning, you are doing whatever you can to produce what the business needs as fast as you can. Over time, you realize what is and isn’t working. You start to see the bottlenecks where change would really benefit the entire data collection process.
In order to continually improve the process, you need to look back and analyze performance. You should be evaluating the number of failures in your data pipeline to see where you could potentially improve. Are there any failures that don’t have tests in place where you can add one to be proactive? Tests are a great way to monitor progress because you can track specific numbers behind them.
When dealing heavily with business users, like in the case of BI, it’s important to also seek qualitative feedback from users. Ask business users if they’ve run into any troubles with dashboards or reports or have seen anything weird. Also, talk to them about how you could improve their interactions with what you’ve built. Is there something that isn’t working? What could you do to make it better? At the end of the day, we are building reports for the business—we want to make sure they are getting the most out of them!
Lastly, keep on making iterative improvements. You should never stop testing or seeking feedback. Data collection is an ongoing process. What worked one quarter may not be working the next, making it imperative that you are staying on top of this process of improvement.
Data collection for BI is a collaborative process between data analysts, data engineers, and analytics engineers. The entire team needs to work together to ensure these best practices are being followed.
Every step of the process, from data ingestion to data modeling, and finally building the BI interface, every step is vitally important. That’s why it is so important that teams keep these best practices in mind throughout the entire process—ensuring your data team is producing the highest quality BI products for the business.
Madison Schott is a modern data stack engineer for ConvertKit, an email marketing platform. She blogs about analytics engineering, data modeling, and data best practices on Medium. She also has her own weekly newsletter on Substack.