hadoop-streaming alternatives and similar packages
Based on the "Cloud" category.
Alternatively, view hadoop-streaming alternatives based on common mentions on social networks and blogs.
-
distributed-process-tests
Cloud Haskell core library -
api-tools
A Haskell embedded DSL for generating an API's JSON wrappers and documentation. -
push-notify
A server-side library in Haskell for sending push notifications to devices running different OS. -
push-notify-ccs
A server-side library in Haskell for sending push notifications to devices running different OS. -
courier
A message-passing library, intended for simplifying network applications -
push-notify-apn
Send Push Notifications from haskell using the new HTTP2 API -
distributed-process-task
Cloud Haskell Task Execution Framework -
distributed-process-systest
Testing Tools and Capabilities for Cloud Haskell -
distributed-process-zookeeper
A Zookeeper backend for Cloud Haskell. -
cloud-seeder
a Haskell library for interacting with CloudFormation stacks -
distributed-process-lifted
A generalization of distributed-process functions to a MonadProcess typeclass and standard transformer instances using monad-control and similar technique. -
task-distribution
A framework for distributing Haskell tasks running on HDFS data using Cloud Haskell. The goal is speedup through distribution on clusters using regular hardware. This framework provides different, simple workarounds to transport new code to other cluster nodes. -
grpc-etcd-client
Haskell etcd client using the gRPC binding -
push-notify-general
A general library for sending/receiving push notif. through dif. services.
Static code analysis for 29 languages.
Do you think we are missing an alternative of hadoop-streaming or a related project?
README
A simple Hadoop streaming library based on conduit, useful for writing mapper and reducer logic in Haskell and running it on AWS Elastic MapReduce, Azure HDInsight, GCP Dataproc, and so forth.
Hackage: https://hackage.haskell.org/package/hadoop-streaming
Word Count Example
See the Haddock in HadoopStreaming.Text
for a simple word-count example.
A Few Things to Note
ByteString vs Text
The HadoopStreaming
module provides the general Mapper
and Reducer
data types, whose input and output
types are abstract. They are usually instantiated with either ByteString
or Text
.
ByteString
is more suitable if the input/output needs to be decoded/encoded, for instance using the
base64-bytestring
library. On the other hand, Text
could make more sense if decoding/encoding is not needed,
or if the data is not UTF-8 encoded (see below regarding encodings). In general I'd imagine ByteString
being
used much more often than Text
.
The HadoopStreaming.ByteString
and HadoopStreaming.Text
modules provide some utilities for working with
ByteString
and Text
, respectively.
Encoding
It is highly recommended that your input data be UTF-8 encoded, as this is the default encoding Hadoop uses. If you must use other encodings such as UTF-16, keep in mind the following gotchas:
It is not enough that your code can work with the encoding you choose to use:
- By default, if any of your input files does not end with a UTF-8 representation of newline, i.e., a
0x0A
byte, Hadoop streaming will add a0x0A
byte. - Likewise, if any line in your mapper output does not contain a UTF-8 representation of tab (
0x09
), Hadoop streaming will add it at the end of the line.
- By default, if any of your input files does not end with a UTF-8 representation of newline, i.e., a
This will almost certainly break your job. It may be possible to configure Hadoop streaming and tell it to use other encodings,
so that the above behavior is consistent with the encoding you choose to use, but I don't know whether that is the case. I tried
-D mapreduce.map.java.opts="-Dfile.encoding=UTF-16BE"
but that doesn't seem to work.
- If you use
ByteString
as the input type and useData.ByteString.hGetLine
to read lines from the input, be aware thatData.ByteString.hGetLine
uses0x0A
bytes as line breaks, so it doesn't work properly for non-UTF-8 encoded input. For example, in UTF-16BE and UTF-16LE, the newline character is encoded as0x00 0x0A
and0x0A 0x00
, respectively.