We're the top company on Modern Data Stack! Give us an upvote here.
bubbles svg

Cloud Data Viz and Analytics Health Check

Uncover the fitness of your Cloud Data Viz & Analytics

Get my free score

How Astrato prevents users from SQL traps

When querying database tables with the structured query language (SQL), users may be facing problems when tables are joined. Tables usually have different granularities (or grains), which leads to the risk of incorrect results.

In Astrato, we followed the paradigm that the arrows of a many-to-one relationship, e.g. fact→dimension, are displayed from the left to right, so the data model forms a horizontal tree where the root, the most granular table, is on the most left side. Astrato also supports the scenario of multiple fact tables: this is modeled as a Directed Acyclic Graph (DAG) forming a Polytree.

See references:

https://en.wikipedia.org/wiki/Directed_acyclic_graph

​​https://en.wikipedia.org/wiki/Polytree

The following data set contains 4 fact tables (Sales, Shipments, Returns, and Purchases) and 3 dimension tables (Clients, Products, and Suppliers). In terms of Graph Theory, this is a DAG Polytree. Please notice that the tree has 3 roots: Shipments, Returns, and Purchases.

When a user drags and drops some fields, a query is created. Astrato’s Data Analytics Engine (DA Engine) will analyze the graph and the properties of the query, such as dimensions, measures and filters. On this occasion, the DA Engine will detect the presence of SQL traps.

A SQL fan trap happens when:

  • two tables that are in a many-to-one relationship are joined and
  • the query contains a numeric column that comes from the “one” side of the relationship, and is aggregated with SUM or AVG.

For completeness, the scope is a bit wider. In general, the definition is valid with all the NRTD (Non Resilient To Duplication) aggregations: SUM, COUNT non-distinct, AVG, MEDIAN, MODE, LISTAGG, STDDEV, VARIANCE

Let’s illustrate a fan trap example using two tables from the data view above, where the query contains two measures and one dimension:

SALES (one):


The total amount of SOLD products is 10.

SHIPMENTS (many):


The total amount of SHIPPED products is 6.

Please note that the column SOLD comes from Sales, which is from the “one” side of the relationship, and it is aggregated by SUM. So, this is a fan trap.

Also, please note that the 4 items of SA04 are not shipped yet.

Now the user wants to know “how much was ordered and shipped, by product”. This is a very common business question.

Let’s do the query with a SELECT statement and join the two tables to sum up the both quantities by product:

The query result shows that 5 items of PR01 were sold, and only 1 was shipped. This is true. It also shows that the products PR02 and PR03 were partially shipped. But this is NOT TRUE!

We saw earlier that the total of SOLD is 10. But now it appears to be 15. How is it possible?

I think you guessed it: the fan trap caught us!

The reason is very easy to understand by watching the image below: the sales SA02 and SA03 are duplicated.


This result is clearly incorrect.

Now, let’s ask Astrato. This is the result:


The result in Astrato is correct. How is this possible? 

Because Astrato followed and implemented the principles of the Unified Star Schema (USS) invented by Francesco Puppini. (https://technicspub.com/uss/)

The USS methodology requires modeling and loading a table called the “Puppini Bridge”. Astrato went a step further and does this automatically for you, out of the box! So there is no need for any ETL in your data source!

Astrato separates the different grains into sub-queries, which then are combined by the set-operation UNION ALL before the aggregation:

Astrato also adds comments automatically to the SQL statement nicely (here in gray color) so it makes it fully transparent and readable to the user. Plus, the benefit is that the user can copy the SQL statement and can run it for validation directly in the database or in any other analytical tool. This is not possible in most of the BI tools available on the market today, because they perform part of the calculation in their internal in-memory engine. Astrato, instead, is 100% SQL!

If a query contains three or more tables joined together, this may lead to a chasm trap. Chasm traps are solved exactly the same way in Astrato: by separating the grains into sub-queries.