Skip to main content

ELT: Extract, Transform, Load

ETL is a three-phase process where data is extracted, transformed (cleaned, sanitized, scrubbed) and loaded into an output data container.

Tools

  • Airbyte - "Airbyte is an open-source EL(T) platform that helps you replicate your data in your warehouses, lakes and databases."
  • Airflow - "A platform to programmatically author, schedule, and monitor workflows."
  • Azkaban - "A batch workflow job scheduler created at LinkedIn to run Hadoop jobs. Azkaban resolves the ordering through job dependencies and provides an easy to use web user interface to maintain and track your workflows."
  • Dray.it - "Docker workflow engine. Allows users to separate a workflow into discrete steps each to be handled by a single container."
  • Luigi - "a Python module that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization etc. It also comes with Hadoop support built in."
  • Mara Pipelines - "A lightweight opinionated ETL framework, halfway between plain scripts and Apache Airflow"
  • Pinball - "a scalable workflow management platform developed at Pinterest. It is built based on layered approach."
  • prefect - "a new workflow management system, designed for modern infrastructure and powered by the open-source Prefect Core workflow engine. Users organize Tasks into Flows, and Prefect takes care of the rest."
  • TaskFlow - "allows the creation of lightweight task objects and/or functions that are combined together into flows (aka: workflows) in a declarative manner. It includes engines for running these flows in a manner that can be stopped, resumed, and safely reverted."
  • Toil - Similar to Luigi, jobs are classes with a run method. Supports executing jobs on other machines (workers) which can include AWS spot instances.
  • Argo - Container based workflow management system for Kubernetes. Workflows are specified as a directed acyclic graph (DAG), and each step is executed on a container, and the latter is run on a Kubernetes Pod. There is also support for Airflow DAGs.
  • Dagster - "Dagster is a data orchestrator for machine learning, analytics, and ETL. It lets you define pipelines in terms of the data flow between reusable, logical components, then test locally and run anywhere. With a unified view of pipelines and the assets they produce, Dagster can schedule and orchestrate Pandas, Spark, SQL, or anything else that Python can invoke."