Cloud Comptuing · Data Structures and Algorithms · Distributed Algorithms and Communication Protocols

Apache Beam+Apache Flink/Spark for Batch&Stream Processing

When it comes to stream processing, the Open Source community provides an entire ecosystem to tackle a set of generic problems. Among the emergent Apache projects, Beam is providing a clean programming model intended to be run on top of a runtime like Flink, Spark, Google Cloud DataFlow, etc.


Apache Beam


A really convenient declarative framework which allows to specify complex processing pipeline in very few lines of code; the typical and inflated word count example looks like:


public static void main(String[] args) {
 WordCountOptions options = PipelineOptionsFactory.fromArgs(args).withValidation()
 Pipeline p = Pipeline.create(options);

 .apply(new CountWords())
 .apply(MapElements.via(new FormatAsTextFn()))
 .apply("WriteCounts", TextIO.write().to(options.getOutput()));;


For a detailed explanation, there’s the quickstart page.


A solid alternative to create dataflow-oriented processing pipelines which can be executed on distributed schedulers/runtimes.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s