Quick jump: Introduction | Running Scripts | What Next?
This short tutorial covers the
pash’s main functionality. Before
proceeding, make sure you
have installed PaSh
PaSh is a system for parallelizing POSIX shell scripts. It has been shown to achieve order-of-magnitude performance improvements on shell scripts.
N.b.: PaSh is still under heavy development.
Consider the following spell-checking script,
applied to two large markdown files
f1.md and f2.md (line
1):
# spell-checking.sh
cat f1.md f2.md |
tr A-Z a-z |
tr -cs A-Za-z '\n' |
sort |
uniq |
comm -13 dict.txt - > out
cat out | wc -l | sed 's/$/ mispelled words!/'The first cat streams two
markdown files into a pipeline that converts
characters in the stream into lower case,
removes punctuation, sorts the stream in
alphabetical order, removes duplicate words, and
filters out words from a dictionary file (lines
1–7). A second pipeline (line 7) counts the
resulting lines to report the number of
misspelled words to the user.
If you’re new to shell scripting, try to run each part of the pipeline separately and observe the output. For example, run
cat f1.md f2.md | tr A-Z a-zin your terminal to witness the lower-case conversion.
Visually, the script can be thought as executing sequentially as follows:
The first pipeline (left; parts omitted)
processes f1.md and
f2.md sequentially through
all pipeline stages and writes to
out. After it executes to
completion, the second pipeline starts its
sequential execution.
PaSh transforms and executes each pipeline in
a data-parallel fashion. Visually, the parallel
script would look like this for 2x-parallelism
(i.e., assuming that the computer on which we
execute the script has at least two CPUs and
that PaSh is invoked with -w value
of 2).
Given a script, PaSh converts it to a dataflow graph, performs a series of semantics-preserving program transformations that expose parallelism, and then converts the dataflow graph back into a POSIX script. The new parallel script has POSIX constructs added to explicitly guide parallelism, coupled with PaSh-provided Unix-aware runtime primitives for addressing performance- and correctness-related issues.
All scripts in this guide assume that
$PASH_TOP is set to the top
directory of the PaSh codebase (e.g.,
/opt/pash on docker)
To run scripts in this section of the
tutorial, make sure you are in the
intro directory of the
evaluation:
cd $PASH_TOP/evaluation/introIn the following examples, you can avoid including
$PASH_TOPbeforepa.shby addingPASH_TOPin yourPATH, which amounts to adding anexport PATH=$PATH:$PASH_TOPin your shell configuration file.
The simplest script to try out
pash is
hello-world.sh, which applies an
expensive regular expression over the system’s
dictionary file.
To run hello-world.sh
sequentially, you would call it using
bash:
time bash ./hello-world.shTo run it in parallel with PaSh:
time $PASH_TOP/pa.sh $PASH_TOP/evaluation/intro/hello-world.shAt this point, you might be interested in
running pa.sh --help to get a first
sense of the available options. Of particular
interest is --with or
-w, which specifies the degree of
parallelism sought by PaSh (e.g.,
-w 2).
We will use demo-spell.sh – a
pipeline based on
the original Unix spell program by Johnson –
to confirm that the infrastructure works as
expected. We need to setup the appropriate input
files for this script to execute:
./input/setup.shAfter inputs are configured, let’s take a
quick look at demo-spell.sh:
cat demo-spell.shThe script streams the input file into a pipeline that converts characters to lower case, removes punctuation, sorts in alphabetical order, removes duplicate words, and filters out words from a dictionary file.
Next, let’s run it on sequential inputs:
time ./demo-spell.sh > spell.outWe prefix the script with the
time command, which should also
output how long it took for the script to
execute. On our evaluation infrastructure, the
script takes about 41s.
To execute it using pash with
2x-parallelism:
time $PASH_TOP/pa.sh -w 2 -d 1 --log_file pash.log demo-spell.sh > pash-spell.outOn our evaluation infrastructure, the 2x-parallel script takes about 28s.
You can check that the results are correct by:
diff spell.out pash-spell.outAssuming you have more than 8 CPUs, you could also execute it with 8x-parallelism using:
time $PASH_TOP/pa.sh -w 8 -d 1 --log_file pash.log demo-spell.sh > pash-spell.outOn our evaluation infrastructure, the 8x-parallel script takes about 14s.
To view the parallel code emitted by the compiler, you can inspect the log:
cat pash.logThe contents of the parallel script are shown
after the line
(4) Executing script in ... and for
2x parallelism (--width 2) they
should look like this:
rm -f "#file2"
...
mkfifo "#file2"
...
{ cat scripts/input/100M.txt >"#file2" & }
{ tr -cs A-Za-z "\\n" <"#file4" >"#file6" & }
{ /home/eurosys21/pash/runtime/auto-split.sh "#file2" "#file14" "#file15" & }
{ tr A-Z a-z <"#file32" >"#file17" & }
{ tr A-Z a-z <"#file15" >"#file18" & }
{ cat "#file33" "#file34" >"#file4" & }
{ /home/eurosys21/pash/runtime/auto-split.sh "#file6" "#file19" "#file20" & }
{ sort <"#file35" >"#file22" & }
{ sort <"#file20" >"#file23" & }
{ sort -m "#file36" "#file37" >"#file8" & }
{ /home/eurosys21/pash/runtime/auto-split.sh "#file8" "#file25" "#file26" & }
{ uniq <"#file38" >"#file28" & }
{ uniq <"#file26" >"#file29" & }
{ cat "#file39" "#file40" >"#file30" & }
{ uniq <"#file30" >"#file10" & }
{ /home/eurosys21/pash/runtime/eager.sh "#file14" "#file32" "/tmp/pash_eager_intermediate_#file1" & }
{ /home/eurosys21/pash/runtime/eager.sh "#file17" "#file33" "/tmp/pash_eager_intermediate_#file2" & }
{ /home/eurosys21/pash/runtime/eager.sh "#file18" "#file34" "/tmp/pash_eager_intermediate_#file3" & }
{ /home/eurosys21/pash/runtime/eager.sh "#file19" "#file35" "/tmp/pash_eager_intermediate_#file4" & }
{ /home/eurosys21/pash/runtime/eager.sh "#file22" "#file36" "/tmp/pash_eager_intermediate_#file5" & }
{ /home/eurosys21/pash/runtime/eager.sh "#file23" "#file37" "/tmp/pash_eager_intermediate_#file6" & }
{ /home/eurosys21/pash/runtime/eager.sh "#file25" "#file38" "/tmp/pash_eager_intermediate_#file7" & }
{ /home/eurosys21/pash/runtime/eager.sh "#file28" "#file39" "/tmp/pash_eager_intermediate_#file8" & }
{ /home/eurosys21/pash/runtime/eager.sh "#file29" "#file40" "/tmp/pash_eager_intermediate_#file9" & }
{ /home/eurosys21/pash/runtime/eager.sh "#file10" "#file41" "/tmp/pash_eager_intermediate_#file10" & }
{ comm -13 scripts/input/dict.txt "#file41" & }
source /home/eurosys21/pash/runtime/wait_for_output_and_sigpipe_rest.sh ${!}
rm -f "#file2"
...Note that most stages in the pipeline are
repeated twice and proceed in parallel (i.e.,
using &). This completes the
“quick-check”.
This concludes the first PaSh tutorial. This section includes pointers for further exploration, depending on your needs.
PaSh consist of three main components and a few additional “auxiliary” files and directories. The three main components are:
annotations: DSL characterizing commands, parallelizability study, and associated annotations. More specifically, (i) a lightweight annotation language allows command developers to express key parallelizability properties about their commands; (ii) an accompanying parallelizability study of POSIX and GNU commands. guides the annotation language and optimized aggregator library
compiler: Shell-dataflow translations and associated parallelization transformations. Given a script, the PaSh compiler converts it to a dataflow graph, performs a series of semantics-preserving program transformations that expose parallelism, and then converts the dataflow graph back into a POSIX script.
runtime:
Runtime components such as eager,
split, and associated combiners.
Apart from POSIX constructs added to guide
parallelism explicitly, PaSh provides Unix-aware
runtime primitives for addressing performance-
and correctness-related issues.
These three components implement the contributions presented in the EuroSys paper. They are expected to be usable with minimal effort, through a few different installation means presented below.
The auxiliary directories are:
pash.Academic papers associated with PaSh offer substantially deeper overviews of the concepts underpinning several PaSh components.
Chat:
Mailing Lists:
pashDevelopment/contributions: