If your workflows are held together by a handful of cron jobs or ad hoc scripts with loose dependencies that only a handful of people know how to manage then you’ll appreciate the nature of DAGs. Your workflow can be simplified, centralized and collaborative by having your entire task execution in a central tool that everyone can contribute to, and track in the Airflow UI.
Why use Composer for orchestrating my data workflows?
Composer can support a whole range of different use cases but the majority of users are data engineers who are building workflows to orchestrate their data pipelines. They’ve chosen Composer because it helps overcome some of the challenges commonly faced when managing data workflows.
Being able to coordinate and interface with multiple services is paramount to any ETL or ELT workflow and engineers need to be able to do this reliably and securely. Thankfully, there are hundreds of operators and sensors that allow you to communicate with services across multiple cloud environments without having to write and maintain lots of code to call APIs directly.
As workflows scale in complexity, having sophisticated task management becomes more important. You can parallelize and branch tasks based on the state of previous tasks. It has built-in scheduling and features to handle unexpected behavior – for example, sending an email or notification on failure.
Composer takes the Airflow experience for engineers up a level by creating and maintaining the Airflow environment for you, taking care of the infrastructure needed to get your DAGs up and running. Not having to focus on infrastructure management in addition to being able to automate and delegate the repetitive tasks to Composer, means data engineers get time back to focus on actually building data pipelines and workflows.
Can I use Composer to process and transform my data?
Composer is not a data processing tool and shouldn’t be used to directly transform and process big data. It is not designed to pass large amounts of data from one task to the next and doesn’t have any sophisticated data processing parallelism or aggregation that is fundamental to handling big data. Composer is also better suited to orchestrating batch workloads over those that require super low latency as it can take a few seconds to start one task once another has finished.
But what about those data transfer operators?
It’s worth pointing out there are some transfer operators, like the Google Cloud Storage to BigQuery operator, that transfers data from one source to another.
So isn’t Composer being used here to transfer data? Not quite – under the hood Composer is just making a call to the BigQuery API to transfer the data, no data is downloaded, or transferred between the Composer workers – this is all delegated to BigQuery resources.
So how does Composer compare to Dataflow or DataFusion?
Services like Data Fusion, Dataflow and Dataproc are great for ingesting, processing and transforming your data. These services are designed to operate directly on big data and can build both batch and real time pipelines that support the performant aggregation (shuffling, grouping) and scaling of data. This is where you should build your data pipelines and you can use Composer to manage the execution of these services as part of a wider workflow.
Let’s revisit our example with Composer
You would first create DAG with tasks for each stage of your workflow. We’ll assume our data pipeline is in Dataflow: