Overview
In the study note series, this post covers Google Dataflow. Dataflow (DF) is Googles serverless, managed streaming service with low latency and processing time.
- DataFlow is used to deduplicate and Window data streams
- Trigger Events are when the watermark passes the end of the window
- Trigger Events include when late data arrives
- Watermark is the age of the oldest unprocessed record
- Two concepts need to be understood
- Event Time (when the thing happened)
- Processing Time (when we are processing the aggregate)
- If the watermark is daily the results may be too slow, it is possible to provide speculative results.
- Provides the following window options
- Fixed Time Windows
- Sliding Time Windows
- Per-Session Windows
- Single Global Window
- Data Flow processes real time and batch
- DF allows for aggregates and window based analytics.
- It ingest streaming data and augments it with static data
- DF is programable via an API
- DF is based on Apache Beam
- DF can be executed on Flink or Spark, default is DF Cloud
- There is no Cloud Dataflow-specific cross pipeline communication mechanism for sharing data or processing context between pipelines. You can use durable storage like Cloud Storage or an in memory cache like App Engine to share data between pipeline instances.
- A transform represents a processing operation that transforms data. A transform takes one or more
- PCollections as input, performs an operation that you specify on each element in that collection, and produces one or more PCollections as output. A transform can perform nearly any kind of processing operation, including performing mathematical computations on data, converting data from one format to another, grouping data together, reading and writing data, filtering data to output only the elements you want, or combining data elements into single values.
- Dataflow cannot ingest event streams. It needs Pub/Sub service to do so.
- Using the Drain option to stop your job tells the Cloud Dataflow service to finish your job in its current state. Your job will stop ingesting new data from input sources soon after receiving the drain request (typically within a few minutes). However, the Cloud Dataflow service will preserve any existing resources, such as worker instances, to finish processing and writing any buffered data in your pipeline. When all pending processing and write operations are complete, the Cloud Dataflow service will clean up the GCP resources associated with your job.
From https://cloud.google.com/dataflow/docs/guides/stopping-a-pipeline
- ParDo is the core parallel processing operation in the Apache Beam SDKs, invoking a user-specified function on each of the elements of the input PCollection. ParDo collects the zero or more output elements into an outputPCollection. The ParDo transform processes elements independently and possibly in parallel.
From https://cloud.google.com/dataflow/docs/concepts/beam-programming-model
Updating Pipelines
Batch
- Updating batch pipelines is not supported.
Streaming
- Pass the
--update
option.
- Set the
--jobName
option in PipelineOptions to the same name as the job you want to update.
- If any transform names in your pipeline have changed, you must supply a transform mapping and pass it using the
--transformNameMapping
option. –transformNameMapping= {“oldTransform1″:”newTransform1″,”oldTransform2″:”newTransform2”,…}
Changing windowing
You can change windowing and trigger strategies for the PCollections in your replacement pipeline, but use caution. Changing the windowing or trigger strategies will not affect data that is already buffered or otherwise in-flight.
We recommend that you attempt only smaller changes to your pipeline’s windowing, such as changing the duration of fixed- or sliding-time windows. Making major changes to windowing or triggers, like changing the windowing algorithm, might have unpredictable results on your pipeline output.
All details are accurate at the time of writing, please refer to Google for current details