Static-Time Relational Query Plan Generator

19 May 2011

In PLDI'11 there is a paper "Data Representation Synthesis", by Peter Hawkins, Alex Aiken, Kathleen Fisher, Martin Rinard, and Mooly Sagi, PLDI'11, June 4-8, 2011, San Jose, California, USA (I discovered this on Lambda the Ultimate). In sum, the idea is due to me and was communicated directly to one of the authors years earlier, however the authors do not acknowledge that fact.

The core idea is a "Static-Time Relational Query Plan Generator": one represents data in the relational model and then runs a query plan generator at *static* time to generate functions that query and update the model. There is more than one way to generate the data structures and as well as more than one way to implement a given query or update; therefore allowing for human guidance of this static-time query plan generator is part of the problem.

Again, the idea is not to use this as part of a database, but to render out the data structures and functions in some other language, such as C or C++, and then allow linking against them with other code. Alternatively, one could simply integrate this functionality into a new programming language. Either way, the overhead of a runtime SQL engine is obviated. More importantly, the ability to design and maintain data structures in a declarative rather than imperative language and generate code guaranteed to be correct is a huge boon to software engineers.

I thought up this idea while working on the rather horrible internals of Scott McPeak's Elsa C/C++ front-end: I noticed that most of what I was doing amounted to implementing the queries of a relational database by hand, which was very tedious, verbose, and error-prone (this is the point where a software engineer looks for a way to automate an activity). I worked on this idea and its ramifications for quite a while and presented it to several people in 2005, at the time calling it "Orth" (I have since picked a new name for a more expanded version of the idea). One of them was Alex Aiken, also an author on the above paper. No acknowledgment of this fact is given in the paper.

I also told Simon Goldsmith who was a student of Alex Aiken. I was on staff at Berkeley at the time working for Alex and Simon and I collaborated on projects. Simon summarized "Orth" to Alex at the time; on 6/10/05 Simon forwarded me a copy of that email which I found by a recent search of my email history. I have quoted it below with Simon's permission. Simon and I even discussed his working on automating the choice between different query plans by profiling and discussed this research direction with Alex.

I notified Alex of this situation on 10 May 2011. Alex responded: "I certainly didn't remember any such conversation before I received your email (and even now, I don't recall anything specific) and the authors definitely came up with the idea independently. I do have a pretty terrible memory. I will will be happy to add an ack to future papers." Alex is busy guy doing many things, and perhaps his conscious forgot, though I find that incredible, however I do not believe his subconscious forgot. You might get to claim you invented an idea independently if you never heard of it; however do not get to claim you invented an idea independently after someone gave you the exact same idea several years before.

The authors of the above paper cite several "similar" ideas from the literature, as if a static-time query plan generator is really just an incremental improvement -- this sort of writing is really just academic politics. The first time I told this idea to Susan Graham, a very senior researcher in the field, she said "you can't generate the functions if you don't know the queries." I said "you can if you know the queries at compile time." She stopped breathing and her face changed color. That's not the reaction someone has to an incremental improvement. (Fortunately she resumed breathing shortly thereafter.)

I have therefore decided to acknowledge myself here in the absence of any other acknowledgment on the record of this fact.

-- Daniel S. Wilkerson

P.S. I continue to work on my version of this idea, which is much more comprehensive; however this work is delayed in my queue behind another multi-year project of even more import.

dsw: Here is an email sent to me by Simon on 6/10/05 9:52 PM, evidently quoting to me an email that he had already sent to Alex describing Orth.

Below is a draft of my eventual remarks to Alex about Orth. -Simon

Dan and I have spent some time talking about his idea for Orth and he asked me to say a few words about it to you. First of all, I think it's a really neat idea. I enjoy talking about it with Dan and would happily collaborate on research related to it. Orth could really raise the level of abstraction at which we express our programs without sacrificing performance (indeed, perhaps enhancing it -- sort of like what functional programming was supposed to do). Second, I find his writeup to be rather obtuse in some parts and (mirroring his heroic, but ultimately futile rewrite of an early draft of the Partiqle paper) I'd like to offer my understanding of his Orth proposal.

Orth provides a novel view of how data structures are to be specified and implemented. Orth separates the relationships among entities in the program from the implementation of these relationships (as pointers, structs, hash tables, lists, etc). The programmer specifies the entities (e.g., customers, parts, orders, warehouses, inventories), what relationships exist among the entities (a warehouse has a location and also an inventory, an inventory is a multiset of parts, an order involves a customer, shipping address, and a multiset of parts), and the operations on the data. The difference between the orth approach and the traditional approach is that the operations are specified in terms of entities and relations, not pointers and hash tables. Orth's compiler is free to pick whatever representation it feels like. Massive refactorings of the data structures are possible without proportionally massive changes to the source.

The Orth language has three pieces to it: (1) The data definition / type sublanguage is for defining the data that your program's data structures store and the relationships among them. It is roughly analogous to SQL's "create table", a class declaration in C++, or a type definition in ML. Dan proposes a syntax that looks roughly like C++ class definitions. My picture is something like an entity-relationship diagram

(2) The data structure implementation sublanguage is for implementing the operations on the data structures. This is the sublanguage in which you'd implement operations like "create a new order (with the following customer, items, and so on) and add it to the list of pending orders". These operations are analogous to the SQL queries in a DB app or the implementations of methods in a class. Dan envisions a bunch of SQL queries as method bodies. I'm not convinced that SQL is quite right

(3) The program part of the program (as opposed to the data structures part) is just regular C++ (Or Java, or ML, or whatever). The prototypes for the operations defined with (2) look like regular C++ method prototypes.

The Orth Compiler will have two major pieces: (a) Choosing data structures for the stuff specified in (1) such that all the operations in (2) are supported. It's easy to be correct, but hard to get good performance. You might even need feedback-directed optimization or some dynamic optimization to do really well.

(b) Compiling the operations specified in (2) into regular C++ code that adheres to the interface. Really, once you've decided on the data structures, this step is totally routine. The hardest part may be making sure data structure invariants are maintained (e.g. parent pointers are always updated).

Dan's talked about how other stuff will be really easy in Orth, but I don't have any concrete picture of how he plans to make it so. One extension that does make sense is a sort of advice language to tell the compiler how to represent your data or what operations it's ok to make faster or slower.