Parallelizability Classes Study & Annotation
Language
Quick Jump: Parallelizability Classes
| study
| example
1 | example
2 | howto | issues
PaSh includes (i) a parallelizability study
of commands in POSIX and GNU Coreutils, and (ii)
an annotation language for describing the
parallelizability properties of individual
commands. The parallelizability study informed
the design of the annotation language, which was
in turn used to capture the key
parallelizability characteristics in many of
these commands.
N.b.: We welcome contributions to the
study and annotatations for common
commands.
Main Parallelizability Classes
PaSh introduces four major parallelizability
classes:
Stateless Commands: The first
class, stateless, contains commands
that operate on individual line elements of
their input, without maintaining state across
invocations. These are commands that can be
expressed as a purely functional
map or filter –
e.g., grep filters out
individual lines and basename
removes a path prefix from a string. They may
produce multiple elements – e.g.,
tr may insert NL
tokens – but always return empty output for
empty input. Workloads that use only stateless
commands are trivial to parallelize: they do not
require any synchronization to maintain
correctness, nor caution about where to split
inputs.
Parallelizable Pure Commands:
The second class,
parallelizable_pure, contains
commands that respect functional purity –
i.e., same outputs for same inputs –
but maintain internal state across their entire
pass. The details of this state and its
propagation during element processing affect
their parallelizability characteristics. Some
commands are easy to parallelize, because they
maintain trivial state and are commutative –
e.g., wc simply maintains
a counter. Other commands, such as
sort, maintain more complex
invariants that have to be taken into account
when merging partial results.
Non-parallelizable Pure
Commands: The third class,
pure, contains commands that, while
purely functional, cannot be parallelized within
a single data stream. This is because their
internal state depends on prior state in
non-trivial ways over the same pass. % should we
say something about state machines? For example,
hashing commands such as sha1sum
maintain complex state that has to be updated
sequentially. If parallelized on a single input,
each stage would need to wait on the results of
all previous stages, foregoing any parallelism
benefits.
Side-effectful Commands: The
last class, side-effectful,
contains commands that have side-effects across
the system – for example, updating environment
variables, interacting with the filesystem, and
accessing the network. Such commands are not
parallelizable without finer-grained concurrency
control mechanisms that can detect side-effects
across the system.
Parallelizability Classes Study of Commands in GNU
& POSIX
The parallelizability study of commands in
GNU and POSIX is comprised of two parts: a
coarse-grained parallelizability study, and a
set of annotations for commands.
The main results of the parallelizability
study are summarized in the PaSh
EuroSys’21 paper (Sec. 3.1 and Tab. 1). To
see the results of the parallelizability study,
run ./p_stats.
Annotations for about 60 popular commands are
stored in this directory encoded as JSON files
(about 14 lines per annotation on average, for a
total of 846 lines of annotations). Annotations
can be thought of as defining a bidirectional
correspondence between a command and a node in
the dataflow graph—the abstraction used by the
PaSh compiler. Since command behaviors (and
correspondence) can change based on their
arguments, annotations contain a sequence of
predicates. Each predicate is accompanied by
information that instantiates the correspondence
between a command and a dataflow node.
A Simple Example: chmod
As a first example, below we present the
annotations for chmod.
{
"command": "chmod",
"cases": [
{
"predicate": "default",
"class": "side-effectful"
}
]
}
The annotation for chmod is very
simple, since it only needs to establish that
chmod is side-effectful and
therefore cannot be translated to a dataflow
node.
Another Example: cut
As another example, below we present the
annotations for cut.
{
"command": "cut",
"cases": [
{
"predicate": {
"operator": "or",
"operands": [
{
"operator": "val_opt_eq",
"operands": [
"-d",
"\n"
]
},
{
"operator": "exists",
"operands": [
"-z"
]
}
]
},
"class": "pure",
"inputs": [
"args[:]"
],
"outputs": [
"stdout"
]
},
{
"predicate": "default",
"class": "stateless",
"inputs": [
"args[:]"
],
"outputs": [
"stdout"
]
}
],
"options": [
"stdin-hyphen",
"empty-args-stdin"
],
"short-long": [
{
"short": "-d",
"long": "--delimiter"
},
{
"short": "-z",
"long": "--zero-terminated"
}
]
}
The annotation for cut has two
cases, each of which consists of a predicate on
its arguments, and then an assignment of its
parallelizability class, inputs, and outputs.
The first predicate indicates that
cut is “pure” – i.e., not
parallelizable but representable as a dataflow
node – if the value accompanying the
-d option is \n or if
it was used with the -z flag. In
both of these cases, newlines do not represent
data item boundaries, but are rather used
internally by the command, making it unsafe to
parallelize by splitting on line boundaries. In
all other cases (see the “default” case) the
command is stateless. Inputs are always assigned
to the non-option arguments and the output is
always stdout. The option “stdin-hyphen”
indicates that a non-option argument that is
just a dash - represents the stdin,
and the option “empty-args-stdin” indicates that
if non-option arguments are empty, then the
command reads from its stdin. The list
identified by “short-long” contains a
correspondence of short and long argument names
for this command.
How to Annotate a Command
The first step to annotating a command is to
identify its default class:
stateless,
parallelizable_pure,
pure, and
side-effectful. How does the
command behave without any inputs? The next step
is to identify the set of inputs and their
order.
This process then has to be repeated for
every set of arguments, which have to be
expressed as first-order-logic predicates (see
examples above). This can be (and is currently)
achieved in an incremental fashion: a few flags
at a time.
For more details, here is an early version of
the annotation language:
<option> ::= `-' <string>
<category> ::= `stateless' | `pure' | ...
<maybe-int> ::= ε | <int>
<arg> ::= `args[' <int> `]'
<args> ::= <arg>
| `args[' <maybe-int> `:' <maybe-int> `]'
<input> ::= `stdin'
| <args>
<inputs> ::= <input>
| <input> `,' <inputs>
<output> ::= `stdout'
| <arg>
<outputs> ::= <output>
| <output> `,' <outputs>
<option-pred> ::= <option>
| `value' <option> = <string>
| `not' <option-pred>
| <option-pred> `or' <option-pred>
| <option-pred> `and' <option-pred>
<assignment> ::= `(' <category>, `[' <inputs> `]' `,' `[' <output> `]' `)'
<predicate> ::= <option-pred> `=>' <assignment>
<pred-list> ::= `|' <predicate> <pred-list>
| `|' `otherwise' `=>' <assignment>
<command> ::= <name> `\{' <pred-list> `\}'
<command-list> ::= <command>
| <command> <command-list>
Mini-tutorial: Adding Custom
Aggregators
For this tutorial, let’s assume you want to
parallelize a
simple ann-agg.sh script.
Let’s also assume there are no annotations or
aggregators for the commands
test_one and test_two.
Note that normally these two commands would be
annotated as stateless, as their
aggregator is simply the
concatenation function; however, we
will now annotate them as
parallelizable_pure and provide
“custom” aggregation commands that simply
concatenate their input streams.
Step 1: Implement aggregators and their
annotations:
An aggregator is usually either binary or
n-ary: it takes as input two or
n file names (or paths) and outputs
results to the standard out. An aggregator may
also take additional flags—for example, flags
that configure its operation or flags that were
provided to the original command.
We will implement test_one’s
aggregator as a
shell script that internally uses the Unix
cat command to concatenate any
number of input streams.
We will implement test_two’s
aggregator as a
Python script that concatenates any number
of inputs streams.
For PaSh to be able to hook these aggregators
correctly, i.e., so that it can
instantiate them as command invocations, we also
need to add their annotations in annotations/custom_aggregators.
Below are the two annotation files named annotations/custom_aggregators/cat.py.json
and annotations/custom_aggregators/concat.json.
(FIXME: relative path? Until this is
fixed, prefix aggregator names with
pagg- to avoid name
clashes!) The most important
information in these files is (i) the
aggregation command’s name, and
(ii) its treatment of inputs (both taking
["args[:]"]), and outputs (both
outputing to ["stdout"]).
Step 2: Point commands to their custom
aggregators: Add two new annotation files
in $PASH_TOP/annotations with names
test_one.json and
test_two.json, so that they point
to the right aggregator commands. Apart from
providing the correct command name,
the two key properties are the
class (which should be
parallelizable_pure) and the
rel_path (which should point to the
aggregator programs we just implemented—ideally,
relative to $PASH_TOP).
Here is the annotation for test_one.json,
where the aggregator points to
runtime/agg/opt/concat.sh. Note
that path is relative with respect to
$PASH_TOP and therefore refers to
$PASH_TOP/runtime/agg/opt/concat.sh:
Here is the annotation for test_two.json,
pointing to runtime/agg/py/cat.py
(i.e., implying
$PASH_TOP/runtime/agg/py/cat.py).
The annotations also specifies that the
aggregator should be called with the
-a flag, in addition to any other
flags provided to the original command.
More complex aggregators:
Suppose we want to parallelize a new script
called ann-agg-2.sh.
This script contains two new commands
test_uniq_1 and
test_uniq_2. Their annotations are
in files annotations/test_uniq_1
and annotations/test_uniq_2.json.
Issues
- [pr.json line 18]
(https://github.com/binpash/pash/tree/main/annotations/pr.json)
- [cat.json line 3]
(https://github.com/binpash/pash/tree/main/annotations/cat.json)