PaSh: Light-touch Data-Parallel Shell Scripting

The study has evolved over time

Originally, there are five main groups:

stateless: these are the simplest to parallelize. They fall into four classes depending on the input chunk—some are parallelizable at the level of individual characters (e.g., tr), some at the level of lines (e.g., grep), some at the level of paragraphs (e.g., fmt), and some at the level of files
pure: these are somewhat more difficult to parallelize, as they might require to see the end of the input (or include some line metadata); they are still pure (a la reducers).
DFS: these interact with the (distributed version of the) file-system. As the file-system is a central part of the Unix design and philosophy, many of these commands have Unix-specific semantics that are not helpful in pipelines or distributed workloads—e.g., operations related to file ownership. However, they are not difficult to emulate atop a conventional distributed storage.
EVs: these commands affect environment variables; interestingly, the vast majority of these commands only read environment variables—the common case in scripts. So even if we had something like distributed transactions, we would still be able to get away with mostly read-accesses that do not need to block (note that there are multiple transactional protocols, because these are reversible).
Side-effectful: These have non-reversible side-effects that affect the outside system. They are non-distributable, and would most certainly require a pesimistic transaction management control.

There are still a few commands whose semantics are not entirely understood, marked as ? (but for which pash assumes the most conservative class possible).

Currently, the four major classes:

Stateless Commands: This class contains commands that operate on individual line elements of their input, without maintaining state across invocations. These are commands that can be expressed as a purely functional map or filter—e.g., grep filters out individual lines and basename removes a path prefix from a string. They may produce multiple elements—e.g., tr may insert NL tokens—but always return empty output for empty input. Workloads that use only stateless commands are trivial to parallelize: they do not require any synchronization to maintain correctness, nor caution about where to split inputs.
Parallelizable Pure Commands: This class contains commands that respect functional purity—i.e., same outputs for same inputs—but maintain internal state across their entire pass. The details of this state and its propagation during element processing affect their parallelizability characteristics. Some commands are easy to parallelize, because they maintain trivial state and are commutative—e.g., wc simply maintains a counter. Other commands, such as sort, maintain more complex invariants that have to be taken into account when merging partial results.
Non-parallelizable Pure Commands: This class contains commands that, while purely functional, cannot be parallelized. This is because their internal state depends on prior state in the same pass in non-trivial ways. For example, hashing commands such as sha1sum maintain complex state that has to be updated sequentially. If parallelized on a single input, each stage would need to wait on the results of all previous stages, foregoing any parallelism benefits.
Side-effectful Commands: This class contains commands that have side-effects across the system—for example, updating environment variables, interacting with the filesystem, and accessing the network. Such commands are not parallelizable without finer-grained concurrency control mechanisms that can detect side-effects across the system. This is the largest class.

version 0.9 \| revision ea962cc \| Merge pull request #611 from binpash/future \| Updated: 00:36 07/23/2022
Copyright PaSh Project a Series of LF Projects, LLC. For website terms of use, privacy policy, trademark guidelines and other policies please see www.lfprojects.org/policies/.