distributed-fork alternatives and similar packages
Based on the "distributed" category.
Alternatively, view distributed-fork alternatives based on common mentions on social networks and blogs.
-
distributed-process-platform
DEPRECATED (Cloud Haskell Platform) in favor of distributed-process-extras, distributed-process-async, distributed-process-client-server, distributed-process-registry, distributed-process-supervisor, distributed-process-task and distributed-process-execution
SaaSHub - Software Alternatives and Reviews
Do you think we are missing an alternative of distributed-fork or a related project?
README
distributed-dataset
A distributed data processing framework in pure Haskell. Inspired by Apache Spark.
- An example: /examples/gh/Main.hs
- API documentation: https://utdemir.github.io/distributed-dataset/
- Introduction blogpost: https://utdemir.com/posts/ann-distributed-dataset.html
Packages
distributed-dataset
This package provides a Dataset
type which lets you express and execute
transformations on a distributed multiset. Its API is highly inspired
by Apache Spark.
It uses pluggable Backend
s for spawning executors and ShuffleStore
s
for exchanging information. See 'distributed-dataset-aws' for an
implementation using AWS Lambda and S3.
It also exposes a more primitive Control.Distributed.Fork
module which lets you run IO
actions remotely. It
is especially useful when your task is embarrassingly
parallel.
distributed-dataset-aws
This package provides a backend for 'distributed-dataset' using AWS services. Currently it supports running functions on AWS Lambda and using an S3 bucket as a shuffle store.
distributed-dataset-opendatasets
Provides Dataset
's reading from public open datasets. Currently it can fetch GitHub event data from GH Archive.
Running the example
- Clone the repository.
$ git clone https://github.com/utdemir/distributed-dataset
$ cd distributed-dataset
- Make sure that you have AWS credentials set up. The easiest way is to install AWS command line interface and to run:
$ aws configure
- Create an S3 bucket to put the deployment artifact in. You can use the console or the CLI:
$ aws s3api create-bucket --bucket my-s3-bucket
Build an run the example:
- If you use Nix on Linux:
- (Recommended) Use my binary cache on Cachix to reduce compilation times:
nix-env -i cachix # or your preferred installation method cachix use utdemir
- Then:
$ nix run -f ./default.nix example-gh -c example-gh my-s3-bucket
- If you use stack (requires Docker, works on Linux and MacOS):
$ stack run --docker-mount $HOME/.aws/ --docker-env HOME=$HOME example-gh my-s3-bucket
Stability
Experimental. Expect lots of missing features, bugs, instability and API changes. You will probably need to modify the source if you want to do anything serious. See issues.
Contributing
I am open to contributions; any issue, PR or opinion is more than welcome.
- In order to develop
distributed-dataset
, you can use;- On Linux:
Nix
,cabal-install
orstack
. - On MacOS:
stack
withdocker
.
- On Linux:
- Use ormolu to format source code.
Nix
- You can use my binary cache on cachix so that you don't recompile half of the Hackage.
nix-shell
will drop you into a shell withormolu
,cabal-install
andsteeloverseer
alongside with all required haskell and system dependencies. You can usecabal new-*
commands there.- Easiest way to get a development environment would be to run
sos
at the top level directory inside of a nix-shell.
Stack
- Make sure that you have
Docker
installed. - Use
stack
as usual, it will automatically use a Docker image - Run
./make.sh stack-build
before you send a PR to test different resolvers.
Related Work
Papers
- Towards Haskell in Cloud by Jeff Epstein, Andrew P. Black, Simon L. Peyton Jones
- Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing by Matei Zaharia, et al.
Projects
- Apache Spark.
- Sparkle: Run Haskell on top of Apache Spark.
- HSpark: Another attempt at porting Apache Spark to Haskell.