Quick Jump: intro | overview | details | earlier versions
PaSh has recently shifted away from ahead-of-time compilation and towards just-in-time compilation intermixed with the execution of a script. This shift brings many benefits, allowing PaSh to correctly handle expansion and other important details – but complicates the clear exposition of the two phases. A high-level diagram of PaSh’s end-to-end operation is shown below:
PaSh pre-processes a sequential script to
insert calls to the
pash_runtime.py. It then invokes
the script, switching between evaluation,
execution, and parallelization at runtime: (i)
it first parses the script, creating an abstact
syntax tree (AST); (ii) it then expands the
nodes of the AST, often calling the shell which
performs that expansion; (iii) it compiles
dataflow regions, parts of the AST that are
potentially parallelizable, through an iterative
optimization proceedure applied over a dataflow
graph (DFG); and (iv) finally emits the parallel
script by translating the DFG to AST and
unparsing the AST back to a shell script. The
compilation takes into account information about
individual commands through annotations, and the
emitted parallel script uses additional
constructs provided by PaSh’s runtime library.
A correspondence between blocks in the diagram and Python modules is shown below:
Now let’s get to the compiler. It’s entry point is pash.py that parses a script and replaces potentially parallelizable regions with calls to pash_runtime.sh. It then executes the script. This allows invoking the compiler during the runtime to have information about the values of environment variables.
The pash_runtime.sh script simply invokes the pash.py compiler: if it succeeds it executes the optimized script, otherwise it executes the original script.
The compiler has several stages:
dfg2shell translation
happens in ir_to_ast.pyA few interesting fragments are outlined below.
The ast_to_ir.py file contains a case statement that essentially pattern-matches on constructs of the shells script AST and then compiles them accordingly.
compile_cases = {
"Pipe": (lambda fileIdGen, config:
lambda ast_node: compile_node_pipe(ast_node, fileIdGen, config)),
"Command": (lambda fileIdGen, config:
lambda ast_node: compile_node_command(ast_node, fileIdGen, config)),
"And": (lambda fileIdGen, config:
lambda ast_node: compile_node_and_or_semi(ast_node, fileIdGen, config)),
"Or": (lambda fileIdGen, config:
lambda ast_node: compile_node_and_or_semi(ast_node, fileIdGen, config))
# ... more code ...The following function from ir.py is responsible for parallelizing a single node (i.e., a command) in the dataflow graph. Look at the schematic in the comments starting on line 663 that gives the high-level overview of what this function does (not shown below).
# See comment on line 637
def parallelize_node(self, node_id, fileIdGen):
node = self.get_node(node_id)
assert(node.is_parallelizable())
## Initialize the new_node list
new_nodes = []
# ... more code ...Another interesting fragment is in ir_to_ast.py, which translates the parallel dataflow graph back to an AST.
def ir2ast(ir, args):
# ... more code ...
## Make the main body
body = ir.to_ast(drain_streams)
# ... more code ...
## Call the prologue that creates fifos for all ephemeral fids
prologue = make_ir_prologue(ephemeral_fids)
## Call the epilogue that removes all ephemeral fids
epilogue = make_ir_epilogue(ephemeral_fids, clean_up_graph, args.log_file)
final_asts = prologue + body + epilogueThis AST is then unparsed back into a (parallel) shell script.
The compiler is outlined in the EuroSys paper, but has evolved considerably since then:
PaSh originally did not have a preprocessing component, and didn’t handle variable expansion. It now does both, significantly improving its practical applicability since it can be used on scripts where the environment variables are modified throughout the script.
PaSh originally was using code in parser – a port of libdash,
the dash parser extended with OCaml
bindings – and specifically the
ocaml2json and
json2ocaml binaries to interface
with PaSh. PaSh now uses a custom parser written
in Python, avoiding any dependency to OCaml and
simplifying dependency management.