Scalability

Why You Need Just In Time (JIT) Code Compilation

Did you know that only 63% of people who climb Mount Everest actually succeed?  In fact, over 6% of the people who attempt the climb end up dying on their voyage.  And that percentage is increasing as the accessibility and popularity of the world’s tallest mountain grows.

Creating a fast, scalable database that is general and efficient is a lot like climbing a mountain, although with a much lower mortality rate. Like climbing a mountain, it requires preparation, training, and a deep understanding of what works.  

This post won’t give you everything you need to summit the mountain of scalable databases, but by the end you’ll have a better understanding of one of the core tools—Just-In-Time compilation (JIT).

High performance databases employ a slew of techniques such as massive parallelism, vectorization, and caching to achieve very low latencies. JIT is a core piece of accomplishing this.

You can view the motivation of JIT as follows: imagine a skilled software engineer sits between a database user and a database. Every time the user issues a query to the database, the engineer writes a hand optimized program (in raw C++, Java, etc.) to answer that query. 

You can imagine that this program would be very specialized for the underlying schema, data-types and specific structure of the query. JIT allows us to do just that - minus the cost of a full time engineer! 

Without JIT, achieving this is just not feasible because it would require supporting an infinite combinations of schemas and data-types.

To see why JIT is so critical, let’s look at a simple group-by query. Typically, databases use hash-grouping to execute such queries, where grouping columns are placed in a tuple and used as keys in a hash map. 

So what data structure should we use to store these tuples? A simple array won’t work because the grouping columns can have different data-types. Even templatization doesn’t work because all template instantiations need to be known at compile time, and there are infinitely many of these if we plan to support arbitrary group-by queries. 

Using a wrapper-union struct or polymorphism works, but each has significant, negative performance implications. Using JIT, we can dynamically generate code to support a specific grouping column combination. In fact, this scenario is not specific to grouping queries and arises frequently in query execution.

In a database, JIT works as follows: given an input query, a plan is first generated for answering that query. Then, given that plan, the program generates source code, composing various modules together to come up with a end-to-end program that can answer the input query. 

At this time the generated source code is just text—no different than what a human engineer would write, except it was automatically generated and customized for the input query! Next, this code is compiled using a compiler-backend library like LLVM. Finally, this compiled code can be called into and voila! An optimized program for answering the input query.

Of course, using JIT has its own challenges. For example, code generation and compilation take up precious time during query execution. However, we can use caching to avoid generating and compiling code for queries we’ve already seen.

Debugging JIT programs is also notoriously hard, because the source code is ephemeral (it disappears after it has been compiled). To work around this we can have a debug mode for code generation and compilation, where we write JIT code to a file and use an offline compiler along with dlopen(). We can then debug the JIT code as normal.

Using JIT for is critical for building high-performance query execution engines. By constructing specialized code for each input query, JIT allows us to leverage several optimizations such as loop-unrolling and data-type specific optimizations. 

Here’s an example of how we build the code.

——————————————————
movie_table (STRING movie_name, INT year, STRING genre, DOUBLE revenue)
SELECT SUM(revenue) GROUP BY year, genre;
——————————————————  

struct Value {

 enum Type { STRING, INT, DOUBLE };

 virtual long Hash() const = 0;

 virtual string GetString() const { return “invalidString”; }

 virtual int GetInt() const { return -1; }

 virtual double GetDouble() const { return -1; }

 virtual Type type() = 0;

 virtual void Add(Value* v) = 0;

 Value* Make(Type type) { … }

 template<class T> Value* Make(T v) { … }

};

struct IntValue : public Value {

 IntValue(int v_in) : v(v_in) {}

 virtual long Hash() const { return (long)v; }

 virtual int GetInt() const { return v; }

 virtual Type type() { return INT; }

 virtual Add(Value* v_in) {

   assert(v_in->type() == INT);

   v += v->GetInt();

 }

 int v;

};

struct DoubleValue : public Value { … }

struct StringValue : public Value { … }

long Hash(vector<Value*> row) {

 for (auto& v : row) {

   long h = v->Hash();

   ...

 }

}

————————

vector<vector<Value*> > HashGrouping(vector<int> grouping_indices, int aggregate_index) {

 map<vector<Value*>, Value*, Hash> group_map;

 for (auto& row : movie_table.rows()) {  // row is vector<Value*>

   vector<Value*> k;

   for (int index : grouping_indices) {

     k.push_back(row[index]);

   }

   if (group_map.find(k) == nullptr) {

     group_map[k] = Value::Make(row[aggregate_index]->Type());

   }

   group_map[k]->Add(row[aggregate_index]);

 }

 vector<vector<Value*>> results;

 for (auto& k : group_map.keys()) {

   vector<Value*> r = k;

   r.push_back(group_map[k]);

   results.push_back(r);

 }

}

——————————————————

struct row_type { string movie_name; int year; string genre; double revenue; };

struct key_type { int year; string genre; }

struct result_type { int year; string genre; double revenue; }

using value_type = double;

map<key_type, value_type> group_map;
for (auto& row : movie_table.rows()) {

 key_type k { row.year, row.genre };

 group_map[k] += row.revenue;

}
vector<result_type> results;
for (auto k& : group_map.keys()) {

 results.push_back({k.year, k.genre, group_map[k]});

}

return result;

——————————————————

Hopefully this post has given you a better understanding of one of the tools used to build scalable, highly-performant databases. Just like a pair of great hiking boots, this is just one tool used to achieve the goal—but it’s essential to success.

Happy climbing!

×