Tips and Tutorials

What the Bleep is a Chasm Trap?

Chasm trap. Sounds like one of those metal claws hunters put in the jungle to capture leopards and panthers - not something you want to fall into around the office. Luckily for those of us who deal with data on a daily basis, these traps are a lot less deadly - but not by much. So what is a chasm trap really? And why does it strike fear in the hearts of data analysts everywhere?

Going Back to the Beginning: A Simple Star Schema

A chasm trap is simply one specific way that tables in a database schema can be related to one another. To take a step back, in a denormalized data model, you have fact and dimension tables. Fact tables contain measurable business facts (like sales or order records) and references to dimension tables, which contain details about the entities involved (like customers or products).

The most simple database structure is a star schema. In a star schema, there is only one fact table and several dimension tables. Each dimension table has a primary key, which uniquely identifies each row. The fact table joins to the dimension tables using a foreign key relationship to the primary key in each dimension, and the resulting layout of these relationships resembles a star - hence the name, star schema.

 

 

Enter Reality: The Dreaded Chasm Trap Schema

If you’re asking, “What kind of fantasy land are you living in?”, you’ve probably had enough experience in the real world to know that a simple star schema exists mostly in textbooks. The most common chasm trap scenario occurs when joining fact tables from two or more source systems (like Salesforce and Marketo), which may have no relationship to one another except through a shared dimension (like Contacts). This is why it's such a common pain point for reporting and analytics tools and, ironically, part of the most compelling and valuable use cases. Here is what it looks like:

 

 

Unfortunately, some of the most common use cases for analysis involve chasm trap scenarios, including:

  • Campaign Performance: Attribution analysis comparing marketing campaigns with sales data, where both fact tables contain a customer identifier column with a foreign key to a customer dimension table.
  • Product Defects: Joining sensor data with data about customer service calls to explore whether components from certain manufacturers during specific time periods generate more support issues.
  • Loyalty Effectiveness: Comparing website traffic with in-store sales, where you have a loyalty program which allows you to join on the customer loyalty ID number in the loyalty program dimension table.

What’s so scary about that?

Well, nothing. Until the CEO calls your boss to ask why his report shows 20,000 marketing campaigns this quarter when he knows for a fact that there were fewer than 500. That's because most BI software isn't able to detect and account for chasm trap. When computing counts and aggregates (like total sales or transaction volume), the two fact tables of interest are aggregated and joined in a single query. This results in a Cartesian product that incorrectly pairs every row of one fact table with every row of the other and shows inflated results. So in other words, chasm trap creates an overcounting problem, making it almost impossible to accurately analyze data in this format.

Admitting you have a Chasm Trap Problem

First, look at your schema diagram to identify if you have fact tables that you want to join through a shared dimension. Then, run a report that uses columns from both fact tables. Do totals and counts on the results. If the result of the aggregate is too high, congratulations! You have the classic chasm trap overcounting problem.

Crossing the Chasm (Trap)

Most companies handle this problem through complex ETL processes that pre-join and pre-aggregate tables. If you have a data modeling expert, you may even be able to create some complex views that pre-aggregate the data so that the answer returned is correct. That assumes you know in advance what kinds of questions will be asked.

Or you could look for a business intelligence solution that automatically handles chasm traps so that the person asking the question doesn’t have to understand any of this stuff, but still gets the right answer every time. Just sayin’.

×