Workflow automation and orchestration at CASD

Datascience team

Goals

  • What is workflow automation and orchestration framework?
  • What are the popular workflow automation tools?
  • How to do metadata management at CASD?

What is workflow automation and orchestration framework?

A workflow automation and orchestration framework coordinates multiple tasks(often interdependent) so that complex processes run reliably, observably, and efficiently without human intervention.

  • Automation: Executes pre-defined tasks automatically (Execution.)
  • Orchestration: Ensures multiple automated components run in the right order, with state tracking, error handling, and data passing between them. (Coordination)

Key concepts in workflow automation and orchestration framework:

  • Task : The atomic unit of computation (e.g. function, script, or command).
  • Flow / DAG : A sequence of tasks, rules, and dependencies designed to achieve a specific outcome. It can be represented as a directed acyclic graph.
  • Task Dependency : Orders(e.g. upstream/downstream) and trigger conditions between tasks.
  • Scheduling and Triggers : Decides when to start flows (e.g. time, event, manual).
  • Executor / Worker : Executes tasks on designated infrastructure(e.g. local, Docker, cluster).
  • Task State Management : Tracks states of each task(e.g. running, success, retry, fail, cancel).
  • Logger : Records logs, metrics, and events for task monitoring and debugging.
  • Error Handling and retry : Handle failed tasks(e.g. retry, ignore, stop, etc.).

Automation VS Orchestration

Automation:

  • Scope : Focus on a single task
  • Goal : To avoid manual execution
  • Example : User needs to run a spark job daily to analyze logs

Orchestration:

  • Scope : Focus on multiple tasks and their orders, dependencies
  • Goal : Ensure end-to-end workflow consistency
  • Example : Run extract → transform → validate → load tasks, in sequence

Advantages of workflow automation

  • Reliability : Automated workflow execution, recovery and retries reduce manual work and human error
  • Reproducibility : A complete workflow contains all details(e.g. scripts, input data, parameters, etc.) for a rerun.
  • Auditability : All historical output of workflow runs are logged which allow users to audit the workflow.
  • Maintainability : Modular tasks make unit testing easier, reusable tasks avoid code duplication.
  • Scalability : Multiple workers allow parallel task execution.

Disadvantages of workflow automation:

  • Extra operational complexity : Need to deploy and maintain new tools for automation and orchestration
  • Extra learning time : Users need to learn how to develop and run workflow in specific tools (e.g. Prefect, airflow, etc.)
  • Extra infrastructure overhead : Orchestration tools require many components (e.g. api server, worker, scheduler, etc) to run constantly.

Existing workflow automation and orchestration tools(1)

  • Airflow : Works on all types of infra (e.g. bare-metal, VMs and containers), very mature solution, large user base, provide many operators(e.g. bash, python, sql, etc.). Only works on linux OS.
  • Prefect : Python centric, modern UI, cross-platform
  • Dagster : Data centric orchestrator, provide best support for unit testing.

Existing workflow automation and orchestration tools(2)

  • Luigi : Easy to install as python a package, Lightweight scheduling, suitable for small and median workflow.
  • Argo Workflow : Use yaml for workflow definition, Kubernetes-native(and only works on k8s)
  • Mage : Very powerful tool with AI assistance, but requires License

Workflow automation and orchestration in CASD

CASD proposes two workflow automation tools: Prefect and Airflow

  • Airflow : For linux servers, and complex workflow which requires operator other than python (e.g. sql, bash, etc.).
  • Prefect : For Windows servers, python centric workflow

Workflow automation and orchestration in practice

  • Setup prefect working environment : Configure a virtual environment, install prefect client, connect to a server, specify your first workflow
  • Work pool and worker management : Learn how to create and configure work-pool and worker
  • Workflow management : Learn how to define tasks and workflow
  • Deployment management : Learn how to define a deployment in prefect
  • Run spark in Prefect : Learn how to run spark in Prefect and best practices
  • Compare with airflow : Compare what we have learned from prefect with airflow