PaSh: Light-touch Data-Parallel Shell Scripting

Quick Jump: latest snapshot | performance | correctness

Among other metrics, the PaSh continuous integration infrastructure tracks performance and correctness metrics on the main branch over time. Some of these results—for the most recent commits—are shown below; longer commit history, other branches, and more configurations are part of the control subsystem.

One can also view these results in textual form. For example, to get the latest 5 builds run curl -s ci.binpa.sh | head. To get details about a specific commit, run curl -s ci.binpa.sh/f821b4b.

Latest Snapshot

The next few plots show results only for the latest commit on main branch.

Classic Unix One-liners

This benchmark set contains 9 pipelines written by Unix experts: a few pipelines are from Unix legends (e.g., Top-N, Spell), one from a book on Unix scripting, and a few are from top Stackoverflow answers. Pipelines contain 2-7 stages (avg.: 5.2), ranging from scalable CPU-intensive (e.g., grep stage in Nfa-regex) to non-parallelizable stages (e.g., diff stage in Diff). Inputs are script-specific and average 10GB per benchmark.

Unix50 from Bell Labs

This benchmark includes 36 pipelines solving the Unix 50 game. The pipelines were designed to highlight Unix's modular philosophy, make extensive use of standard commands under a variety of flags, and appear to be written by non-experts. PaSh executes each pipeline as-is, without any modification.

Mass-transit analytics during COVID-19

This benchmark set contains 4 pipelines that were used to analyze real telemetry data from bus schedules during the COVID-19 response in a large European city. The pipelines compute several average statistics on the transit system per day—such as daily serving hours and daily number of vehicles. Pipelines range between 9 and 10 stages (avg.: 9.2), use typical Unix staples sed, sort, and uniq, and operate on a fixed 34GB dataset that contains mass-transport data collected over a single year.

Performance

The next few plots show the performance of various benchmarks over time, as a timeseries (with the time captured by the commit ID) on main branch.

Classic Unix One-liners

Unix50 from Bell Labs

Mass-transit analytics during COVID-19

Dependency Untangling

While the JIT engine operates as if invoked on every region, Pash is engineered to spawn a long-running stateful compilation server just once, feeding it compilation requests until the execution of the script completes. This design has two benefits: (1) it reduces run-time overhead by avoiding reinitializing the compiler for each compilation request; and (2) it allows maintaining and querying past-compilation results when compiling a new fragment. The latter allows Pash to untangle dependencies across regions, finding and exploiting opportunities for cross-region parallel execution. This set contains several benchmarks including log processing/parsing, media conversion, genome computation and compression apps.

Average Temperature

Contains a large script downloading and processing multi-year temperature data across the US.

NLP

Contains several scripts from Unix for Poets, a tutorial for developing programs for natural-language processing out of Unix and Linux utilities.

WebIndex

Large multi-stage script for web crawling and indexing, using a variety of third-party and built-in utilities.

Correctness

This section tracks various statistics across tests checking the correctness of various PaSh components.

Benchmark	Passed	Failed	Untested	Unresolved	Unsupported	Other
Compiler Tests	54/54	0/54	0/54	0/54	0/54	0/54
Intro Tests	2/2	0/2	0/2	0/2	0/2	0/2
Interface Tests	39/39	0/39	0/39	0/39	0/39	0/39
Annotation Tests	1/1	1/1	1/1	1/1	1/1	1/1
Aggregator Tests	108/109	1/109	0/109	0/109	0/109	0/109
Smoosh Tests	10/10	0/10	0/10	0/10	0/10	0/10
Posix Tests	375/494	41/494	31/494	6/494	40/494	1/494

version 0.9 \| revision ea962cc \| Merge pull request #611 from binpash/future \| Updated: 00:36 07/23/2022
Copyright PaSh Project a Series of LF Projects, LLC. For website terms of use, privacy policy, trademark guidelines and other policies please see www.lfprojects.org/policies/.

Benchmarks

Latest Snapshot

Classic Unix One-liners

Unix50 from Bell Labs

Mass-transit analytics during COVID-19

Performance

Classic Unix One-liners

Unix50 from Bell Labs

Mass-transit analytics during COVID-19

Dependency Untangling

Average Temperature

NLP

WebIndex

Correctness