The PaSh Compiler

Quick Jump: intro | overview | details | earlier versions

Introduction

PaSh has recently shifted away from ahead-of-time compilation and towards just-in-time compilation intermixed with the execution of a script. This shift brings many benefits, allowing PaSh to correctly handle expansion and other important details – but complicates the clear exposition of the two phases. A high-level diagram of PaSh’s end-to-end operation is shown below:

PaSh pre-processes a sequential script to insert calls to the pash_runtime.py. It then invokes the script, switching between evaluation, execution, and parallelization at runtime: (i) it first parses the script, creating an abstact syntax tree (AST); (ii) it then expands the nodes of the AST, often calling the shell which performs that expansion; (iii) it compiles dataflow regions, parts of the AST that are potentially parallelizable, through an iterative optimization proceedure applied over a dataflow graph (DFG); and (iv) finally emits the parallel script by translating the DFG to AST and unparsing the AST back to a shell script. The compilation takes into account information about individual commands through annotations, and the emitted parallel script uses additional constructs provided by PaSh’s runtime library.

A correspondence between blocks in the diagram and Python modules is shown below:

Compiler Overview

Now let’s get to the compiler. It’s entry point is pash.py that parses a script and replaces potentially parallelizable regions with calls to pash_runtime.sh. It then executes the script. This allows invoking the compiler during the runtime to have information about the values of environment variables.

The pash_runtime.sh script simply invokes the pash.py compiler: if it succeeds it executes the optimized script, otherwise it executes the original script.

The compiler has several stages:

  1. It expands words in the AST and then it turns it into our dataflow model (guided by annotations)
  2. It performs transformations on the dataflow graph to expose parallelism (guided by annotations)
  3. It then translates the dataflow graph back to a shell script to execute it with bash

Zooming into Fragments

A few interesting fragments are outlined below.

The ast_to_ir.py file contains a case statement that essentially pattern-matches on constructs of the shells script AST and then compiles them accordingly.

The following function from ir.py is responsible for parallelizing a single node (i.e., a command) in the dataflow graph. Look at the schematic in the comments starting on line 663 that gives the high-level overview of what this function does (not shown below).

Another interesting fragment is in ir_to_ast.py, which translates the parallel dataflow graph back to an AST.

This AST is then unparsed back into a (parallel) shell script.

Earlier Versions

The compiler is outlined in the EuroSys paper, but has evolved considerably since then:

  • PaSh originally did not have a preprocessing component, and didn’t handle variable expansion. It now does both, significantly improving its practical applicability since it can be used on scripts where the environment variables are modified throughout the script.

  • PaSh originally was using code in parser – a port of libdash, the dash parser extended with OCaml bindings – and specifically the ocaml2json and json2ocaml binaries to interface with PaSh. PaSh now uses a custom parser written in Python, avoiding any dependency to OCaml and simplifying dependency management.