LitMy.ru - литература в один клик

Data Pipelines with Apache Airflow, Second Edition (MEAP v9)

  • Добавил: literator
  • Дата: Вчера, 19:02
  • Комментариев: 0
Название: Data Pipelines with Apache Airflow, Second Edition (MEAP v9)
Автор: Julian de Ruiter, Ismael Cabral, Kris Geusebroek, Daniel van der Ende, Bas Harenslak
Издательство: Manning Publications
Год: 2025
Страниц: 401
Язык: английский
Формат: pdf (true), epub
Размер: 56.2 MB

Simplify, streamline, and scale your data operations with data pipelines built on Apache Airflow.

Apache Airflow provides a batteries-included platform for designing, implementing, and monitoring data pipelines. Building pipelines on Airflow eliminates the need for patchwork stacks and homegrown processes, adding security and consistency to the process. Now in its second edition, Data Pipelines with Apache Airflow teaches you to harness this powerful platform to simplify and automate your data pipelines, reduce operational overhead, and seamlessly integrate all the technologies in your stack.

This book focuses on Apache Airflow, a batch-oriented framework for building data pipelines. Airflow’s key feature is that it enables you to easily build scheduled data pipelines using Python, while also providing many building blocks that allow you to stitch together the many different technologies encountered in modern technological landscapes.

In Airflow, you define your DAGs using Python code in DAG files, which are essentially Python scripts that describe the structure of the corresponding DAG. As such, each DAG file typically describes the set of tasks for a given DAG and the dependencies between the tasks, which are then parsed by Airflow to identify the DAG structure. One advantage of defining Airflow DAGs in Python code is that this programmatic approach provides you with a lot of flexibility for building DAGs. For example, as we will see later in this book, you can use Python code to dynamically generate optional tasks depending on certain conditions or even generate entire DAGs based on external metadata or configuration files.

These first few chapters should be broadly applicable and appeal to a wide audience. For these chapters, we expect you to have intermediate experience with programming in Python (~one year of experience), meaning that you should be familiar with basic concepts such as string formatting, comprehensions, args/kwargs, and so on. You should also be familiar with the basics of the Linux terminal and have a basic working knowledge of databases (including SQL) and different data formats.

After this introduction, we’ll dive deeper into more advanced features of Airflow such as generating dynamic DAGs, implementing your own operators, running containerized tasks, and so on. These chapters will require some more understanding of the involved technologies, including writing your own Python classes, basic Docker concepts, file formats, and data partitioning. We expect this second part to be of special interest to the data engineers in the audience.

In Data Pipelines with Apache Airflow, Second Edition you'll learn how to:

Master the core concepts of Airflow architecture and workflow design
Schedule data pipelines using the Dataset API and time tables, including complex irregular schedules
Develop custom Airflow components for your specific needs
Implement comprehensive testing strategies for your pipelines
Apply industry best practices for building and maintaining Airflow workflows
Deploy and operate Airflow in production environments
Orchestrate workflows in container-native environments
Build and deploy Machine Learning and Generative AI models using Airflow

Data Pipelines with Apache Airflow has empowered thousands of data engineers to build more successful data platforms. This new second edition has been fully revised to cover the latest features of Apache Airflow, including the Taskflow API, deferrable operators, and Large Language Model integration. Filled with real-world scenarios and examples, you'll be carefully guided from Airflow novice to expert.

about the book
Data Pipelines with Apache Airflow, Second Edition teaches you how to build and maintain effective data pipelines. You'll master every aspect of directed acyclic graphs (DAGs)—the power behind Airflow—and learn to customize them for your pipeline's specific needs. Part reference and part tutorial, each technique is illustrated with engaging hands-on examples, from training Machine Learning models for Generative AI to optimizing delivery routes. You'll explore common Airflow usage patterns, including aggregating multiple data sources and connecting to data lakes, while discovering exciting new features such as dynamic scheduling, the Taskflow API, and Kubernetes deployments.

about the reader
For DevOps, data engineers, Machine Learning engineers, and sysadmins with intermediate Python skills.

Скачать Data Pipelines with Apache Airflow, Second Edition (MEAP V09)












[related-news] [/related-news]
Внимание
Уважаемый посетитель, Вы зашли на сайт как незарегистрированный пользователь.
Мы рекомендуем Вам зарегистрироваться либо войти на сайт под своим именем.