This article will take you through the fundamentals of Google BigQuery. I’ll also share optimization opportunities to improve performance when you’re using both BigQuery and ThoughtSpot.
BigQuery maximizes flexibility by separating the compute engine that analyzes your data from your storage. The data that you can store in BigQuery includes tables, table snapshots, and materialized views.
BigQuery stores table data in columnar format, meaning it stores each column separately. Column-oriented databases are particularly efficient at scanning individual columns over an entire dataset and are better for OLAP (Online Analytical Processing).
BigQuery is particularly effective when a user only needs to query certain columns from a table. Because the query doesn’t need to scan unnecessary columns, it is more performant.
BigQuery runs in multi-tenancy with shared resources, allocated as “slots.” These represent virtual CPUs that execute SQL. BigQuery determines how many slots a query requires. Because the query execution is dynamic, the query plan can be modified while a query is in flight.
There are several factors that will impact Google BigQuery performance:
How many bytes does your query need?
How many bytes does your query pass to each slot?
How much CPU work does your query require?
How many bytes does your query write?
Are your queries following SQL best practices?
Now that we’ve covered some of the fundamentals, let’s explore some best practices when using google BigQuery.
In addition to the performance factors, there are a number of additional optimizations to consider in order to improve the overall performance of BigQuery. But first, I want to start off with a disclaimer:
This is a general guide of considerations to optimally deploy and govern BigQuery for use with ThoughtSpot—particularly geared for a large user installed base. In order to get the most out of BigQuery, it is very important to factor in specific use-cases, environments, and data requirements.
Partitioning: Partitioning tables based on date or another logical grouping can help reduce the amount of data scanned during queries.
Clustering: Clustering tables based on frequently accessed columns can help reduce the amount of data scanned during queries.
Data pruning: If your queries only require a subset of the data in a table, you can use data pruning techniques such as WHERE clauses to limit the amount of data scanned.
Data compression: Enabling data compression as needed but NOT always.
ThoughtSpot queries should return in search time—this means seconds. Depending upon the complexity of the calculation and volume of data, this timeframe may vary from ~3 secs to ~30 secs on average. Any queries that execute for longer than one minute will timeout.
If you’re experiencing longer search times, here are a few suggestions to improve performance:
Keep the search query as clean and as simple as needed. For example, remove search tokens that aren’t needed.
Limit the number of data points in a search or answer. For example, filter your search results for the last year if you don’t need to see year-over-year data.
Understand the Query Visualizer, and identify joins that could cause performance issues. For example, you may want to denormalize tables when necessary to reduce JOIN operations. You could also consider using nested and repeated fields to reduce the number of tables and JOINs required. Another example is the use of poor performing joins, a full outer join will be particularly bad for performance as it combines the rows from two tables, including the matching and non-matching rows, based on a specified join condition. It returns a result set that includes all rows from both tables, regardless of whether there is a match between the tables.
Push as many formulas back to BigQuery as possible. In general, most formulas could be in BigQuery with the exception of a few, such as ratios/percentages, that need to be calculated post aggregation and/or filtering.
Review the use of BigQuery runtime views—tables or materialized views are preferred
Review runtime calculations
Review transaction level of detail
Review data model structure and correctness of joins
Review separation of workloads
Last but not least, ensure you’ve reviewed the strengths and weaknesses of BigQuery. Certain platforms are much better with denormalized models, for example, than others. It’s important to select the right cloud data platform for your business.
Once you’ve reviewed the basic optimizations included above, there are some further suggestions that specifically help improve the BigQuery performance. They include:
Optimizing your data modeling:
Aggregations - Move aggregations up to the top avoiding “group by” where you can
Where clause - the first part of the SQL should contain heaviest data filter functions
Order By - Use this function with LIMIT to optimize the SQL statement
Joins - Place the largest table first in query and then place the remaining tables by decreasing size
Necessary columns only - No Select * (A Select * statement is used to retrieve all columns from a table in a database. The asterisk (*) is a wildcard character that represents all columns. When you execute a SELECT * query, the database engine will return all columns of the specified table in the result set.
Carefully consider denormalizing data into one big table (OBT) or consider materializing large result sets.
BigQuery also has a number of aggregate functions built into the platform. Consider using them to ensure the most efficient way of computing the data is used.
Ensure you’re caching queries in the future, when supported on the roadmap.
You should also consider the org structure, defining policies, procedures and controls, and level of automation. Carefully monitor and evaluate these areas to identify opportunities for improvement.
To conclude, there are a number of ways in which you can improve the Google BigQuery performance.
First, analyze and triage the ThoughtSpot content. This will help you identify the performance bottlenecks. Once this is complete, there are a number of considerations on both the BigQuery side, such as partitioning, and also on the ThoughtSpot side—success will likely be a combination of both.
When making changes, think about the use cases you are trying to solve. Don’t just plan for now—consider how you want your BigQuery x ThoughtSpot stack to work in 6 months, 1 year, and 2 years time. This upfront planning will help you make the most of your modern data stack investment—ultimately keeping end users happy and having a high performance analytics stack.
If you would like additional support, please reach out to your ThoughtSpot contact. ThoughtSpot Professional Services can help to evaluate your specific requirements, plan for an optimal deployment, and ensure a successful implementation of ThoughtSpot with Google BigQuery—or whatever cloud data platform you choose.