About tree traversal options in Django, and how I evaluate third-party libraries
Background
I'm working on a side project for my wife, and the data structure for this project involves models with a recursive foreign key relationship: I have a Concept
model that can have subconcepts. Currently, the business requirements only need a single level of depth, but I want the models to support arbitrary levels of recursion for two reasons: 1) because it seems like one of those things that would be tedious to change after it's been implemented, and 2) because this side project is mostly an opportunity for me to experiment with different technologies and design patterns. It's a good learning opportunity, even if it's a bit overkill.
Tree traversal
So the big question with this kind of implementation, of course, is how to efficiently make queries. At its core, it's a tree traversal problem. Now, I don't have a CS background, so I have only a passing familiarity with different tree traversal algorithms. I know the concepts of breadth-first search and depth-first search, but I'd have to do some googling in order to implement an algorithm for it.
So I went about investigating options for managing recursive relationships in Django. The first promising lead I found was this Stackabuse article written by Adam McQuistan on modified preorder tree traversal, comparing it to an adjacent list implementation. At first, I was quite intrigued by the implementation. It offered very efficient SELECT
queries on the underlying data, it was easy to comprehend, and it had a few libraries offering an implementation.
But the more I considered it, the less enthused I became. The MPTT implementation offers great SELECT
performance at the expense of bad create/update performance. Updating entries in an MPTT implementation requires updating (potentially) every entry in the table. Now, on its own, this was not an insurmountable problem for my use case. The business requirements on the app I'm building do not involve frequent updates to this table.
But what soured me on the implementation is the possibility of data corruption. Any time an update to a table requires updating most/all of the other rows of that table, the possibility of data seems very high. Any amount of concurrent writes makes me nervous with such an implementation, and this is obviously not a problem that can be solved with a simple select_for_update
call in Django.
Diving more into these possibilities led me to django-tree-queries, as well as this blog post by the library's author. This blog post addresses the concerns I had with an MPTT implementation, but it also calls out a number of things that I value when choosing third-party libraries -- and here's where this post is going to switch gears.
What I look for in a library
The aforementioned blog post on django-tree-queries had an interesting section: "The constraints of django-tree-queries." Not features. Constraints. Some of those constraints include "as little customization as possible" and "depth-first search only." In general, the library has a small API surface, and the author deliberately justifies that choice. Not including the test suite, the library has 280 SLOC. It supports Django 1.8 and Python 2.7 -- and this is a great sign, in my opinion. Not because I'm using those outdated versions, but because it suggests that the library is simple, robust, and unlikely to introduce breaking API changes.
In my experience, this is the kind of library that I can drop into a project in order to do one thing that I need it to do, do it well, and require the least amount of effort from me -- both in the present, implementing the library, and in the future, when I need to maintain the project. So a big thank you to Matthias Kestenholz. Maybe read his blog or buy him a coffee.