Trees in SQL with Prime Numbers

Suppose we want to implement a commenting system. Each comment can have one reply (or more), which is also a comment. An reply being a comment, it can have replies as well.
In other words: we have a tree of comments.

A possible approach is to keep a reference of the direct parent. This approach is straightforward. However, traversals are problematic:

  • Retrieving all the parents requires as many self-JOINs as there are parents.
  • We cannot COUNT() the parents of a comment in a comfortable manner.

This article presents an arithmetical approach that is reference-free, thus making traversal operations much smoother.

Here are sample data for our nested comments:

id | parents  |                text
---+----------+----------------------------------------
2  |        1 | 
3  |        2 | First comment
5  |        2 | Second comment
7  |        2 | Third comment
11 |       10 | First comment on the second comment
13 |        6 | First comment on the first comment
17 |        6 | Second comment on the first comment
19 |       14 | Comment on third comment
23 |      266 | Comment on comment on third comment
29 |       10 | Second comment on the second comment
31 |      290 | Comment on the second comment on the second comment
37 |     8990 | Comment on comment on the second comment on the second comment
41 |   332630 | Second comment on comment on comment on the second comment on the second comment
43 | 13637830 | Comment on second comment on comment on comment on the second comment on the second comment
47 |   332630 | First comment on comment on comment on the second comment on the second comment

Each node has a pair of two integers: id and parents. The root has the pair (2, 1) and is empty. Any child with a pair (id, parents) must follow two rules:

  • id is a prime number, and it is unique.
  • parents is the product id × parents of its direct parent.

The critical thing to note is that each parents value is a product of a prime number and previous parents value. The previous parents value is also a product of a prime number and a previous parents value until the root (2, 1) is reached.
It is essential because it means that any parents has a prime decomposition by which we can retrieve all the parents' ids.

Here is the representation of the tree we have modelized in our comment table:

graph

Retrieving the parents of a node

Let's suppose we want comment #47 and its parents.

Comment #47 has a parents = 332630.
332630 = 2 × 5 × 29 × 31 × 37. That is, its parents are the nodes with respectively IDs 37, 31, 29, 5 and 2. We can now retrieve the parents with a simple SQL query:

SELECT * FROM comment WHERE id IN (2, 5, 29, 31, 37);

id | parents |                              text                              
---+---------+----------------------------------------------------------------
2  |       1 | 
5  |       2 | Second comment
29 |      10 | Second comment on the second comment
31 |     290 | Comment on the second comment on the second comment
37 |    8990 | Comment on comment on the second comment on the second comment

(Typically, the factorization method is implemented in the language from which we call the database. But, of course, we could have implemented a factorization routine right into the database and done something like this: SELECT * FROM comment WHERE id IN PrimeFactors(332630);.)

Counting parents is pretty easy too, and we don't even need to ask the database. We simply count how many prime factors we have. Here, we have 4 factors: 5, 29, 31 and 37 (we don't count 2 because it is the root).

graph

Retrieving the children of a node

Children of a given node with the pair (id, parents) are all nodes in the form (_,id × parents).

Again, the SQL query to retrieve them is pretty simple:

SELECT * from comment WHERE parents = 5 * 2;

id | parents |                 text                 
---+---------+--------------------------------------
11 |      10 | First comment on the second comment
29 |      10 | Second comment on the second comment

Counting:

SELECT count(*) FROM comment WHERE parents = 5 * 2;
count 
-------
2

Ensuring Consistency

To insert a new node, we need the parent's ID and its parents value. We multiply them so we can set our :parents.
We also need a new unique and prime ID. Here is a safe manner to insert a new node: We have a table containing as many prime numbers as needed (here, ~10k). Then, we can insert a new node in a single transaction like this:

WITH newid AS (
    SELECT n
    FROM primes 
    WHERE n > (SELECT max(id) FROM comment)
    LIMIT 1;
)
INSERT INTO comment
    (id, parents, text) VALUES 
    ((SELECT n FROM newid), :parents, :text);

Assuming we have a primes table looking like this:

SELECT * FROM primes;
n  
----
2
3
5
7
11
13
17
...
99929
99961
99971
99989
99991
(9592 rows)

Last words

One of the greatest strength of an RDBMS is its consistency model. By implementing the method presented in this article, we lost the ensured consistency of our "references" because we can put any number in the row parents and the DB won't check if it corresponds to something. That check would be granted with a reference-based approach.

It is possible to implement triggers on INSERTs, DELETEs, and UPDATEs to overcome this problem, by programmatically checking consistency properties. However, this would bring a pretty significant amount of complexity.

If you want to comment on this article, please feel free to send me your thoughts by email:
my first name @ this domain name.
You can also reach me at @CalamiaEnzo on Twitter.