Trees in SQL with Prime Numbers

By Enzo Calamia

This article presents an original way to implement trees in SQL. The technique works well, but I would not advise anyone to use it in production. It is intended as a fun way to do things!

Suppose we want to implement a commenting system. Each comment can have one reply (or more), which is also a comment. An reply being a comment, it can have replies as well.
In other words: we have a tree of comments.

A possible approach is to keep a reference of the direct parent. This approach is straightforward. However, traversals are problematic (at least, without recursive CTEs):

  • Retrieving all the parents requires as many self-JOINs as there are parents.
  • We cannot COUNT() the parents of a comment in a comfortable manner.

This article presents an arithmetical approach that is reference-free, thus making traversal operations much smoother.

Here are sample data for our nested comments:

id | parents  |                text
---+----------+----------------------------------------
2  |        1 | 
3  |        2 | First comment
5  |        2 | Second comment
7  |        2 | Third comment
11 |       10 | First comment on the second comment
13 |        6 | First comment on the first comment
17 |        6 | Second comment on the first comment
19 |       14 | Comment on third comment
23 |      266 | Comment on comment on third comment
29 |       10 | Second comment on the second comment
31 |      290 | Comment on the second comment on the second comment
37 |     8990 | Comment on comment on the second comment on the second comment
41 |   332630 | Second comment on comment on comment on the second comment on the second comment
43 | 13637830 | Comment on second comment on comment on comment on the second comment on the second comment
47 |   332630 | First comment on comment on comment on the second comment on the second comment

Each node has a pair of two integers: id and parents. The root has the pair (2, 1) and is empty. Any child with a pair (id, parents) must follow two rules:

  • id is a prime number, and it is unique.
  • parents is the product id × parents of its direct parent.

The critical thing to note is that each parents value is a product of a prime number and previous parents value. The previous parents value is also a product of a prime number and a previous parents value until the root (2, 1) is reached.
It is essential because it means that any parents has a prime decomposition by which we can retrieve all the parents' ids.

Here is the representation of the tree we have modelized in our comment table:

graph

Retrieving the parents of a node

Let's suppose we want comment #47 and its parents.

Comment #47 has a parents = 332630.
332630 = 2 × 5 × 29 × 31 × 37. That is, its parents are the nodes with respectively IDs 37, 31, 29, 5 and 2. We can now retrieve the parents with a simple SQL query:

SELECT * FROM comment WHERE id IN (2, 5, 29, 31, 37);

id | parents |                              text                              
---+---------+----------------------------------------------------------------
2  |       1 | 
5  |       2 | Second comment
29 |      10 | Second comment on the second comment
31 |     290 | Comment on the second comment on the second comment
37 |    8990 | Comment on comment on the second comment on the second comment

(Typically, the factorization method is implemented in the language from which we call the database. But, of course, we could have implemented a factorization routine right into the database and done something like this: SELECT * FROM comment WHERE id IN PrimeFactors(332630);.)

Counting parents is pretty easy too, and we don't even need to ask the database. We simply count how many prime factors we have. Here, we have 4 factors: 5, 29, 31 and 37 (we don't count 2 because it is the root).

graph

Retrieving the children of a node

Children of a given node with the pair (id, parents) are all nodes in the form (_,id × parents).

Again, the SQL query to retrieve them is pretty simple:

SELECT * from comment WHERE parents = 5 * 2;

id | parents |                 text                 
---+---------+--------------------------------------
11 |      10 | First comment on the second comment
29 |      10 | Second comment on the second comment

Counting:

SELECT count(*) FROM comment WHERE parents = 5 * 2;
count 
-------
2

Ensuring Consistency

To insert a new node, we need the parent's ID and its parents value. We multiply them so we can set our :parents.
We also need a new unique and prime ID. Here is a safe manner to insert a new node: We have a table containing as many prime numbers as needed (here, ~10k). Then, we can insert a new node in a single transaction like this:

WITH newid AS (
    SELECT n
    FROM primes 
    WHERE n > (SELECT max(id) FROM comment)
    LIMIT 1;
)
INSERT INTO comment
    (id, parents, text) VALUES 
    ((SELECT n FROM newid), :parents, :text);

Assuming we have a primes table looking like this:

SELECT * FROM primes;
n  
----
2
3
5
7
11
13
17
...
99929
99961
99971
99989
99991
(9592 rows)

Last words

One of the greatest strength of an RDBMS is its consistency model. By implementing the method presented in this article, we lost the ensured consistency of our "references" because we can put any number in the row parents and the DB won't check if it corresponds to something. That check would be granted with a reference-based approach.

It is possible to implement triggers on INSERTs, DELETEs, and UPDATEs to overcome this problem, by programmatically checking consistency properties. However, this would bring a pretty significant amount of complexity.

References

Serhiy Morozov, Hossein Saiedian and Hanzhang Wang (January 2014). Reusable Prime Number Labeling Scheme for Hierarchical Data Representation in Relational Database. Journal of Computing and Information Technology. p.3, sect. 2. (DOI: 10.2498/cit.1002310)