Data-Oriented Programming

My recent reading interests

In 2020 and 2021 before becoming a dad, with forced reduced social interaction due to COVID, I spent a large amount of time reading technical books and working on my craft. I have learned about Domain Driven Design, Hexagonal Architecture and Data-Oriented Programming. I used the first two in my work on Krispr with Daniel Low at Airbnb. It made it easier to expand the project, rework its dependencies, and delivery mechanism as our requirements changed. However, today, I am writing about my latest interest: Data-Oriented Programming.

Data-Oriented Programming

Yehonathan Sharvit popularized the term Data-Oriented Programming to encompass a set of principles that make it easier to grow a codebase. While none of those principles are novel, they are not mainstream in the industry, and they indeed make it far easier to work with code.

After reading Yehonathan's blog posts over the years, I bought his book and I loved it! It was a breath of fresh air when it comes to technical writing style. The book tells the story of two developers that learn about the data oriented programming principles together. It is using a conversational style that is engaging and makes you want to learn more about their stories and about the next problem that they are solving.

In this article I am going to explain 3 keys ideas I got out of the book:

OOP forces you to make too many decisions at once that constrain the design and leads to rewrite
Represent all the data using generic and open data structure, it makes it easier to explore the state of a system and expand it
Immutable data structures are a must and make it easier to reason about a program

The book talks about so much more than that and I encourage you to grab a copy at Manning to see for yourself, try the sample before to get a sense of the conversational style I mentioned earlier.

1. OOP forces you to make too many decisions at once that constrain the design and leads to rewrite

When working on a side project, until recently, my default choice was to use OOP with Python. I have noticed that as my projects grow, I tend to rewrite a large portion of the code over time to express my ideas better, pick better name, better class hierarchies and represent the data or side effect differently. OOP appeals to my perfectionist tendency and can at time make want to rewrite part of the system that are not necessary to rewrite. It stimulates me intellectually to see how I can come up with the best abstractions over the data, the easiest patterns to use.

After reading Data-Oriented Programming's chapter on OOP, I changed my mind. I felt compelled to try building all my side projects in Clojure and practice separation of code and data from the beginning. That's the style I adopted in my latest project: a tool to track my budget and automatically categorize incoming transactions (like Mint, but working better at categorizing my transactions). I wrote the project only once and haven't felt the urge to rework core part of it. Recently, I also started refactoring the code to extract all the data out of Clojure files and into a config file, it was easy and took 10 minutes.

Sneak peek (as it isn't yet open-sourced) of how to configure it:

{
 :start "01/01/2021"
 :end "12/31/2021"
 :sources [{:type "folder-recursive" :value "/Users/laurent/Downloads"}]
 :input-folder "/Users/laurent/Downloads"
 :report-name "2021"
 :output-folder "/Users/laurent/Documents/budget"
 :exclusion-rules '[("description%" #{"AUTOPAY PAYMENT"})]
 :categories-rules
 '[
   ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
   ;; High priority rules                             ;;
   ;; These entries categorize specific transactions  ;;
   ;; with a specific date, they take precedence over ;;
   ;; the other rules                                 ;;
   ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;


   ("venmo"                   "laurent/haircut" {:date "03/25/2021" :amount 40.0 :note "Tried a new hair style, it turned out great, go back there" :rating "A"})
   ...
   ]'
 }

In short, I am going to continue separating data and code and adopt a functional style from the get go for my side projects. It simply made me more productive and removed my propensity to refactor to better abstraction.

2. Immutable data structures are a must and make it easier to reason about a program

When reading code, you have to keep in mind a lot of context and abstractions in your head. I find in my work with others that unlike reading prose, reading a complex piece of code can be challenging for even the most seasoned programmer. One particular consideration that we keep in mind when working with languages that don't have immutable data structure (most of the languages) is that operations can happen in place or return a new copy.

For example in Python, to sort a list:

list.sort() sorts the list and replaces the original list, whereas sorted(list) returns a sorted copy of the list, without changing the original list.

Some languages and community have found ways to distinguish the two, for example the Ruby community has standardized on the ! character to indicate operations that mutate in place. Yet, it is up to the individual programmers to remember the convention and apply it. Which is why, even when I read ruby code, I sometimes have to check which style of operations I am using.

In Data-Oriented Programming, Yehonathan advocates for immutable, and persistent data structures. First some definitions:

Immutable Data Structure: Data Structure that cannot change once they have a value. It is what is also known as "frozen" data structure. It is an abstract concept.
Persistent Data Structure: A way of implementing efficient Immutable Data Structure using Structural Sharing
Structural Sharing between two data structures: When data structures refer to a same piece of memory even though they are different from one another. An example clarifies it: image the list ["a", "b"] and the list ["a", "b", "c"]. The latter can build on top of the former, reuse the memory layout and add "c" at the end of it.

Immutable Data Structure are mainstream and I have seen them used in the industry, especially in Java with Guava, Persistent Data Structure are not, and it is a shame, as they are an efficient way to represent Immutable Data Structure! Clojure is the only language I know of where all the data structure in the language are Persistent by default.

Pro-tip: If you ever mention Persistent Data Structure in an interview as a way to solve a problem, make sure you know how to implement them (read some papers, for example this one).

When you use Immutable Data Structure (persistent or not), a whole class of bug can go away, you do not have to keep thinking about whether the operations mutate the underlying data, they simply do not do it.

I liked how Yehonathan explains all these concepts in Data-Oriented Programming and links to libraries in common languages for using such data structure. It clarified the terms for me and made me even more convinced that Immutable data structure is the way to go almost always. The only exception is highly performance sensitive code, but I write little of that in my job.

3. Represent all the data using generic and open data structure, it makes it easier to explore the state of a system and expand it

Earlier, I mentioned how I configure my budget analyzer. In the configuration, you may have notice this line:

("venmo" "laurent/haircut" {:date "03/25/2021"
                            :amount 40.0
                            :note "Tried a new hair style, it turned out great, go back there"
                            :rating "A"})

Nothing in the code of the project knows about :note and :rating, I just added them as I wrote the rule to categorize the transaction to keep a record of the fact that I liked that haircut. That is possible only because I represent all my data using generic and open data structure. In Clojure, I use lists, vectors, maps and primitive types to represent all the data in my program. That may sound scary if you are used to OOP and once again, Yehonathan does a great job in his book at debunking why it isn't so scary at all!

Thanks to this little change in approach, I am able to use rich generic operations on the data (as provided for example by the Clojure Standard Library), I can also serialize and deserialize easily, inspect the running state of the system and provide time-traveling mechanism for the data.

Conclusion

If you liked some of the ideas I presented above, I encourage you to read Data-Oriented Programming and see for yourself how you can approach programming tasks with a different angle than what you are used to. It has become my new way of working on side projects and when I am not constrained to an OOP codebase.