Software developer blog

SQL on steroids

Few weeks ago we held an internal Coderetreat at Emarsys for all programmers within the company. When the business intelligence team heard about it they started to joke around, that they should drop by and implement Conway's Game of Life in SQL. I'm not much of an SQL wizard, but I rose to the challenge. First I tried it in MySQL, since that's the dialect I have the most experience with:

CREATE TABLE generations (`id` INT UNSIGNED, `x` BIGINT, `y` BIGINT);
INSERT INTO generations VALUES (0, 0, -1), (0, 1, 0), (0, -1, 1), (0, 0, 1), (0, 1, 1);
 
CREATE TABLE offsets (`x` BIGINT, `y` BIGINT);
INSERT INTO offsets VALUES (-1,-1),(-1, 0),(-1,1),(0,-1),(0,0),(0,1),(1,-1),(1,0),(1,1);
 
INSERT INTO generations 
  SELECT g.id+1 id, g.x+o.x x, g.y+o.y y
    FROM generations g 
      INNER JOIN offsets o 
    WHERE g.id=(SELECT MAX(id) FROM generations)
    GROUP BY g.x+o.x, g.y+o.y
      HAVING (MAX(o.x=0 AND o.y=0) = 1 AND COUNT(1) - 1 BETWEEN 2 AND 3)
        OR (MAX(o.x=0 AND o.y=0) = 0 AND COUNT(1) = 3);

It works, and it's not even that complicated. First we join the last population with an offset table to generate the set of living cells and their neighbors. Grouping by the cell coordinates plus the offsets we can get all necessary information. The expression max(o.x=0 and o.y=0) is one exactly if the cell was alive in the previous generation, otherwise it's zero. Using that we can count the number of neighbors too: count(1)-max(o.x=0 and o.y=0). The having clause utilizes that knowledge to filter the grouped rows to contain only the cells in the next generation.


How about the second generation?

The only problem is that one has to copy the last query n times to get the nth generation. Although normally I would be perfectly happy to stop here at a Coderetreat - after all it's just an extra while loop - but not with SQL. In SQL we do not have a for loop unless we are willing to use some scripting extension like PL/SQL.

In MySQL I was pretty stuck since it has no recursive queries or hierarchic queries. On the other hand one of the SQL wizards at Emarsys - Levente Otti - was kind enough to help me out with solving the problem in Oracle SQL. A big-big THANK YOU to him! Our first attempt was to put the solution above within a recursive query, but it didn't work out that well, since the dialect does not support the use of having clauses within the recursive part of the query.

It turns out that the problem of generating any number of generations with a single query is perfect if you wish to nerd snipe a data analyst. After checking in with him a number of times, Levente got back to me almost a day later with the solution bellow. It takes a similar approach. An important difference is that it's completely on the fly, no actual persistence is involved, even the first generation is part of the query.

-- the initial population of cells
WITH population_zero (x,y) AS 
   (SELECT  0,-1 FROM dual UNION ALL 
    SELECT  1, 0 FROM dual UNION ALL 
    SELECT -1, 1 FROM dual UNION ALL 
    SELECT  0, 1 FROM dual UNION ALL 
    SELECT  1, 1 FROM dual )
--sequence [-1:1] - used in offset
,gen (id) AS (SELECT level - 2 x FROM dual CONNECT BY level < 4)
--3x3 offset
,offset (ox, oy) AS ( SELECT x.id x, y.id y FROM gen x, gen y )
--recursive generator function
,cell_stats(g, x, y, STATUS) AS (
      SELECT 0 g, x, y, 'alive' STATUS FROM population_zero  
    UNION ALL 
      SELECT 
        pz.g + 1 g, pz.x+o.ox nx, pz.y+o.oy ny, 
        CASE 
          WHEN 
          (-- if the cell was alive 
              CASE WHEN o.ox=0 AND o.oy=0 THEN 1 ELSE 0 END = 1
            AND
            -- the neighbour count has to be 2 or 3 for it to stay alive
              COUNT (DISTINCT pz.x || pz.y) OVER (partition BY pz.x+o.ox, pz.y+o.oy) 
              - 1 -- exclude the cell itself from the neighbour count
              BETWEEN 2 AND 3
          ) 
          OR 
          (-- if the cell was dead
              CASE WHEN o.ox=0 AND o.oy=0 THEN 1 ELSE 0 END = 0
            AND 
            -- the neighbour count has to be 3 for it to resurrect
              COUNT (DISTINCT pz.x || pz.y) OVER (partition BY pz.x+o.ox, pz.y+o.oy) 
              = 3
          )
          THEN 'alive' ELSE 'dead' END STATUS  
       FROM cell_stats pz, offset o 
      WHERE 
        (pz.STATUS='alive') 
        --terminate condition:
        AND pz.g < 6
)
SELECT 
DISTINCT g, x, y 
FROM cell_stats 
WHERE STATUS = 'alive' ORDER BY g,x,y;

Easy-peasy, right? I bet you got it immediately! No, you didn't... Okay, let's see if I can explain it.

First we have the input data in population_zero. That's pretty clear, although the Oracle way of saying "Hey, I need these records in that table" is pretty crafty. Then there is this magic going on to generate the offsets with a hierarchical query.

The real magic though happens as we generate the cell_stats, and also this is where the most significant difference comes in. Unlike in the MySQL solution here the dead neighbors are left within the table, but are marked 'dead'. This way the calculation is moved from the having clause into table column. The significant drawback here is that there is an exponential growth in the number of rows in cell_stats as the number of generations increases. But hey.. who in their right minds would do something like this in SQL for a production system, right? RIGHT?!

So what is the lesson here?

It turns out that Conway's Game of Life is Turing Complete. That means that we can calculate anything using SQL that can be calculated by a computer. At first you may think that this is great, since it shows how powerful SQL is, but it turns out that being powerful is not always a desired quality.

In 1974 Donald D. Chamberlin wrote a paper titled "SEQUEL: A structured English query language". In this paper he derives a new language that is to become SQL from another declarative query language called SQUARE. He makes his motivation pretty clear in the paper:

However, there is also a large class of users who, while they are not computer specialists, would be willing to learn to interact with a computer in a reasonably high-level, non-procedural query language.

It turns out that SQL is not an API or a protocol. It is a domain specific language (DSL) for analysis of tabular data that has been abused and tormented by the software community to a point where it lost its identity.

Having a DSL that is Turing Complete is usually not a good sign: it can imply that it is not geared toward being easy to use by domain specialists anymore. Queries like the one above are hard to comprehend even for programmers and it's definitely not something a non-technical person would be able to interact with.

The hallmark of a good DSL is that it's extended very cautiously, and only with features that help non-techies to have a better user experience. The problem with SQL is that we seem to have forgotten its origins and today most people think about it mostly as the defacto standard API for relational databases. This also implied that the extensions it received were more and more focused on pushing more performance out of the relational database behind it, while still trying to maintain its declarative style. The result is a language that even developers need time to learn and understand.

On the other hand SQL is a horrible API for any database. It's fine if you have a low number of simple queries, but as the application grows and the number of requests per second gets out of hand, optimizing the database performance turns into a separate profession on its own. One has to understand the query plan, then find out what is the wasteful part in it, find a better way to collect the data, and finally reverse engineer the solution into an SQL query that the query optimizer would hopefully translate to the desired algorithm.

What can we do about this?

Unfortunately I can't give you a better alternative for a relational database API, not because I can't imagine a better one, but because I don't know about one that exists nor do I have the time to create one. Some of the NoSQL databases - especially aggregate databases like MongoDB - have better APIs. When it comes to my pet projects I usually end up using MongoDB, and I plan to give Neo4j a try soon, however we will still need to struggle with SQL for many years to come in our production systems.

I don't wish to give the impression that I see SQL as something that's purely evil, although I'm pretty sure that database vendors providing expensive DBA courses for their products are really happy with the status quo. SQL is perfect for what it has been designed for. BI solutions will probably have relational databases behind them for decades to come, especially ones that provide an SQL terminal as part of their user interface.

However I do say that SQL is not the only option when it comes to persistence. Other solutions are more optimal for most purposes that we use SQL for today. Using SQL as the Swiss army knife of persistence has been the normal mode of operation for decades, but now that NoSQL is around, and it's gaining popularity, the least we should do is learn, and understand them, so that we can start using them in production as soon as possible.

@ //