The goal of this package is to present an alternative representation of Python that is more amenable to analysis and optimization. The eventual goal is to lift Python’s bytecode into an SSA form and then transforming that SSA CFG into a dataflow graph, and this package is the first step along that path.

It mirrors in some way the work in the C / LLVM space to automatically parallize and optimize that code.

One way we can understand this problem is visually, as the number of differnt forms Python takes:

  1. In the CPython implementation, Python goes from:

    1. Source code

    2. to an AST

    3. to bytecode

  2. We then want to take that lowest level, bytecode, and build back up to a higher level representation that is more amenable to optimization and analysis, describing the semantics of Python:

    1. This package implements a minimal data semantics of the bytecode level, that is more tightly specified than how the bytecode is represented in memory. This gives us a higher level representation

    2. On top of that, we can transform the stack based representation into SSA CFG created from the bytecode. This effectively erases the stack details, leaving you with SSA. This translation is meant to be an abstract analog of CPython’s bytecode interpreter. So that instead of interpreting the bytecode eagerly, it builds up a CFG first that has the same semantics as interpreting it.

    3. Finally, we can lift that A dataflow graph, showing data dependencies, from this SSA. At this level, program optimization becomes much simpler and straightforward.

We test out abstraction level 1-3 by also building ways to go backwards, and taking Python bytecode as an input and making sure all the transformations are isomorphic. This type of testing can help us be confident that the different representations are semantically faithful to the original bytecode source, and so the original Python behavior.

To use this library to somehow optimize or analyze Python code then, you would walk down the first three Python levels, then walk down our three levels we provide here, perform any optimizations or translations, and walk back up however many levels you like to get the form you are looking for.

The underlying theory behind this library is that at the dataflow graph level, we can more closely represent the programs denotional semantics, aka the underlying conceptual meaning of the program. This is the level closer to our own human understanding of what the program “does”. So if we can move the program faithfully up to this level, then we can reason about it in terms which are closer to how we think.

For example, as a human we can reason about adding taking the sum of two arrays, and say that “the result at index i is the addition of the first array at index i at the second array at index i. So if we add two arrays and then immediately index, we know that we don’t have to recompute the full array, but simply the sum of the two corresponding values.” Encoding this sort of logic at the level of Python bytecode simply does not make sense. The bytecode has no idea about some abstract value of an array, or more fundamentally even of a value that is defined mathematically like this. However, at the dataflow level, it becomes feasible to reason about values in this matheamtical sense, instead of in a memory/imperative sense. So at least we have some hope of being express this sort of meaning, and have the computer use it to optimize our program.

After doing this optimization, we can move it back down to whatever level we like to actually execute. We might chose to go back to Python source, or we might instead chose to compile to some alternative implementaiton before executing, such as high level language like SQL or a lower level langauge like LLVM.