Cloud Comptuing · Data Structures and Algorithms · Distributed Algorithms and Communication Protocols

Apache Beam+Apache Flink/Spark for Batch&Stream Processing

When it comes to stream processing, the Open Source community provides an entire ecosystem to tackle a set of generic problems. Among the emergent Apache projects, Beam is providing a clean programming model intended to be run on top of a runtime like Flink, Spark, Google Cloud DataFlow, etc.

 

Apache Beam

 

A really convenient declarative framework which allows to specify complex processing pipeline in very few lines of code; the typical and inflated word count example looks like:

 

public static void main(String[] args) {
 WordCountOptions options = PipelineOptionsFactory.fromArgs(args).withValidation()
 .as(WordCountOptions.class);
 Pipeline p = Pipeline.create(options);

 p.apply("ReadLines", TextIO.read().from(options.getInputFile()))
 .apply(new CountWords())
 .apply(MapElements.via(new FormatAsTextFn()))
 .apply("WriteCounts", TextIO.write().to(options.getOutput()));

p.run().waitUntilFinish();
}

 

For a detailed explanation, there’s the quickstart page.

 

A solid alternative to create dataflow-oriented processing pipelines which can be executed on distributed schedulers/runtimes.

Leave a comment