Название: Automating Data Quality Monitoring at Scale: Scaling Beyond Rules with Machine Learning (Final) Автор: Jeremy Stanley, Paige Schwartz Издательство: O’Reilly Media, Inc. Год: 2024 Страниц: 220 Язык: английский Формат: True PDF, True/Retail EPUB Размер: 21.4 MB, 10.1 MB
The world's businesses ingest a combined 2.5 quintillion bytes of data every day. But how much of this vast amount of data—used to build products, power AI systems, and drive business decisions—is poor quality or just plain bad? This practical book shows you how to ensure that the data your organization relies on contains only high-quality records.
Most data engineers, data analysts, and data scientists genuinely care about data quality, but they often don't have the time, resources, or understanding to create a data quality monitoring solution that succeeds at scale. In this book, Jeremy Stanley and Paige Schwartz from Anomalo explain how you can use automated data quality monitoring to cover all your tables efficiently, proactively alert on every category of issue, and resolve problems immediately.
Machine Learning is a statistical approach that, compared to rule-based testing and metrics monitoring, has many advantages: it’s scalable, can detect unknown-unknown changes, and, at the risk of anthropomorphizing, it’s smart. It can learn from prior inputs, use contextual information to minimize false positives, and actually understand your data better and better over time. In the previous chapters, we’ve explored when and how automation with ML makes sense for your data quality monitoring strategy. Now it’s time to explore the core mechanism: how you can train, develop, and use a model to detect data quality issues—and even explain aspects like their severity and where they occur in your data. In this chapter, we’ll explain which Machine Learning approach works best for data quality monitoring and show you the algorithm (series of steps) you can follow to implement this approach. We’ll answer questions like how much data you should sample, and how to make the model’s outputs explainable. It’s important to caveat that following the steps here won’t result in a model that’s ready to monitor real-world data.
This book will help you: • Learn why data quality is a business imperative • Understand and assess unsupervised learning models for detecting data issues • Implement notifications that reduce alert fatigue and let you triage and resolve issues quickly • Integrate automated data quality monitoring with data catalogs, orchestration layers, and BI and ML systems • Understand the limits of automated data quality monitoring and how to overcome them • Learn how to deploy and manage your monitoring solution at scale • Maintain automated data quality monitoring for the long term
Who Should Use This Book: We’ve written this book with three main audiences in mind. The first is the chief data and analytics officer (CDAO) or VP of data. As someone responsible for your organization’s data at the highest level, this entire book is for you—but you may be most interested in Chapters 1, 2, and 3, where we clearly explain why you should care about automating data quality monitoring at your organization and walk through how to assess the ROI of an automated data quality monitoring platform. Chapter 8 is also especially relevant, as it discusses how to track and improve data quality over time.
The second audience for this book is the head of data governance. In this or similar roles, you’re likely the person most directly accountable for managing data quality at your organization. While the entire book should be of great value to you, we believe that the chapters on automation, Chapters 1, 2, and 3, as well as Chapters 7 and 8 on integrations and operations, will be especially useful.
Our third audience is the data practitioner. Whether you’re a data scientist, analyst, or data engineer, your job depends on data quality, and the monitoring tools you use will have a significant impact on your day-to-day. Those building or operating a data quality monitoring platform should focus especially on Chapters 4 through 7, where we cover how to develop a model, design notifications, and integrate the platform with your data ecosystem.
Preface 1. The Data Quality Imperative 2. Data Quality Monitoring Strategies and the Role of Automation 3. Assessing the Business Impact of Automated Data Quality Monitoring 4. Automating Data Quality Monitoring with Machine Learning 5. Building a Model That Works on Real-World Data 6. Implementing Notifications While Avoiding Alert Fatigue 7. Integrating Monitoring with Data Tools and Systems 8. Operating Your Solution at Scale A. Types of Data Quality Issues Index
Скачать Automating Data Quality Monitoring at Scale: Scaling Beyond Rules with Machine Learning (Final)
Внимание
Уважаемый посетитель, Вы зашли на сайт как незарегистрированный пользователь.
Мы рекомендуем Вам зарегистрироваться либо войти на сайт под своим именем.