Quick jump: Introduction | Running Scripts | What Next?
This short tutorial covers the pash
’s main functionality. Before proceeding, make sure you have installed PaSh
PaSh is a system for parallelizing POSIX shell scripts. It has been shown to achieve order-of-magnitude performance improvements on shell scripts.
N.b.: PaSh is still under heavy development.
Consider the following spell-checking script, applied to two large markdown files f1.md
and f2.md
(line 1):
# spell-checking.sh
cat f1.md f2.md |
tr A-Z a-z |
tr -cs A-Za-z '\n' |
sort |
uniq |
comm -13 dict.txt - > out
cat out | wc -l | sed 's/$/ mispelled words!/'
The first cat
streams two markdown files into a pipeline that converts characters in the stream into lower case, removes punctuation, sorts the stream in alphabetical order, removes duplicate words, and filters out words from a dictionary file (lines 1–7). A second pipeline (line 7) counts the resulting lines to report the number of misspelled words to the user.
If you’re new to shell scripting, try to run each part of the pipeline separately and observe the output. For example, run
cat f1.md f2.md | tr A-Z a-z
in your terminal to witness the lower-case conversion.
Visually, the script can be thought as executing sequentially as follows:
The first pipeline (left; parts omitted) processes f1.md
and f2.md
sequentially through all pipeline stages and writes to out
. After it executes to completion, the second pipeline starts its sequential execution.
PaSh transforms and executes each pipeline in a data-parallel fashion. Visually, the parallel script would look like this for 2x-parallelism (i.e., assuming that the computer on which we execute the script has at least two CPUs and that PaSh is invoked with -w
value of 2
).
Given a script, PaSh converts it to a dataflow graph, performs a series of semantics-preserving program transformations that expose parallelism, and then converts the dataflow graph back into a POSIX script. The new parallel script has POSIX constructs added to explicitly guide parallelism, coupled with PaSh-provided Unix-aware runtime primitives for addressing performance- and correctness-related issues.
All scripts in this guide assume that $PASH_TOP
is set to the top directory of the PaSh codebase (e.g., /opt/pash
on docker)
To run scripts in this section of the tutorial, make sure you are in the intro
directory of the evaluation
:
In the following examples, you can avoid including
$PASH_TOP
beforepa.sh
by addingPASH_TOP
in yourPATH
, which amounts to adding anexport PATH=$PATH:$PASH_TOP
in your shell configuration file.
The simplest script to try out pash
is hello-world.sh
, which applies an expensive regular expression over the system’s dictionary file.
To run hello-world.sh
sequentially, you would call it using bash
:
To run it in parallel with PaSh:
At this point, you might be interested in running pa.sh --help
to get a first sense of the available options. Of particular interest is --with
or -w
, which specifies the degree of parallelism sought by PaSh (e.g., -w 2
).
We will use demo-spell.sh
– a pipeline based on the original Unix spell program by Johnson – to confirm that the infrastructure works as expected. We need to setup the appropriate input files for this script to execute:
After inputs are configured, let’s take a quick look at demo-spell.sh
:
The script streams the input file into a pipeline that converts characters to lower case, removes punctuation, sorts in alphabetical order, removes duplicate words, and filters out words from a dictionary file.
Next, let’s run it on sequential inputs:
We prefix the script with the time
command, which should also output how long it took for the script to execute. On our evaluation infrastructure, the script takes about 41s.
To execute it using pash
with 2x-parallelism:
On our evaluation infrastructure, the 2x-parallel script takes about 28s.
You can check that the results are correct by:
Assuming you have more than 8 CPUs, you could also execute it with 8x-parallelism using:
On our evaluation infrastructure, the 8x-parallel script takes about 14s.
To view the parallel code emitted by the compiler, you can inspect the log:
The contents of the parallel script are shown after the line (4) Executing script in ...
and for 2x parallelism (--width 2
) they should look like this:
rm -f "#file2"
...
mkfifo "#file2"
...
{ cat scripts/input/100M.txt >"#file2" & }
{ tr -cs A-Za-z "\\n" <"#file4" >"#file6" & }
{ /home/eurosys21/pash/runtime/auto-split.sh "#file2" "#file14" "#file15" & }
{ tr A-Z a-z <"#file32" >"#file17" & }
{ tr A-Z a-z <"#file15" >"#file18" & }
{ cat "#file33" "#file34" >"#file4" & }
{ /home/eurosys21/pash/runtime/auto-split.sh "#file6" "#file19" "#file20" & }
{ sort <"#file35" >"#file22" & }
{ sort <"#file20" >"#file23" & }
{ sort -m "#file36" "#file37" >"#file8" & }
{ /home/eurosys21/pash/runtime/auto-split.sh "#file8" "#file25" "#file26" & }
{ uniq <"#file38" >"#file28" & }
{ uniq <"#file26" >"#file29" & }
{ cat "#file39" "#file40" >"#file30" & }
{ uniq <"#file30" >"#file10" & }
{ /home/eurosys21/pash/runtime/eager.sh "#file14" "#file32" "/tmp/pash_eager_intermediate_#file1" & }
{ /home/eurosys21/pash/runtime/eager.sh "#file17" "#file33" "/tmp/pash_eager_intermediate_#file2" & }
{ /home/eurosys21/pash/runtime/eager.sh "#file18" "#file34" "/tmp/pash_eager_intermediate_#file3" & }
{ /home/eurosys21/pash/runtime/eager.sh "#file19" "#file35" "/tmp/pash_eager_intermediate_#file4" & }
{ /home/eurosys21/pash/runtime/eager.sh "#file22" "#file36" "/tmp/pash_eager_intermediate_#file5" & }
{ /home/eurosys21/pash/runtime/eager.sh "#file23" "#file37" "/tmp/pash_eager_intermediate_#file6" & }
{ /home/eurosys21/pash/runtime/eager.sh "#file25" "#file38" "/tmp/pash_eager_intermediate_#file7" & }
{ /home/eurosys21/pash/runtime/eager.sh "#file28" "#file39" "/tmp/pash_eager_intermediate_#file8" & }
{ /home/eurosys21/pash/runtime/eager.sh "#file29" "#file40" "/tmp/pash_eager_intermediate_#file9" & }
{ /home/eurosys21/pash/runtime/eager.sh "#file10" "#file41" "/tmp/pash_eager_intermediate_#file10" & }
{ comm -13 scripts/input/dict.txt "#file41" & }
source /home/eurosys21/pash/runtime/wait_for_output_and_sigpipe_rest.sh ${!}
rm -f "#file2"
...
Note that most stages in the pipeline are repeated twice and proceed in parallel (i.e., using &
). This completes the “quick-check”.
This concludes the first PaSh tutorial. This section includes pointers for further exploration, depending on your needs.
PaSh consist of three main components and a few additional “auxiliary” files and directories. The three main components are:
annotations: DSL characterizing commands, parallelizability study, and associated annotations. More specifically, (i) a lightweight annotation language allows command developers to express key parallelizability properties about their commands; (ii) an accompanying parallelizability study of POSIX and GNU commands. guides the annotation language and optimized aggregator library
compiler: Shell-dataflow translations and associated parallelization transformations. Given a script, the PaSh compiler converts it to a dataflow graph, performs a series of semantics-preserving program transformations that expose parallelism, and then converts the dataflow graph back into a POSIX script.
runtime: Runtime components such as eager
, split
, and associated combiners. Apart from POSIX constructs added to guide parallelism explicitly, PaSh provides Unix-aware runtime primitives for addressing performance- and correctness-related issues.
These three components implement the contributions presented in the EuroSys paper. They are expected to be usable with minimal effort, through a few different installation means presented below.
The auxiliary directories are:
pash
.Academic papers associated with PaSh offer substantially deeper overviews of the concepts underpinning several PaSh components.
Chat:
Mailing Lists:
pash
Development/contributions: