Название: Data Virtualization in the Cloud Era: Data Lakes and Data Federation At Scale Автор: Daniel Abadi, Andrew Mott Издательство: O’Reilly Media, Inc. Год: 2024-07-03 Страниц: 184 Язык: английский Формат: pdf, epub, mobi Размер: 10.1 MB
For decades data virtualization has been little more than a dream. How nice it would be if we could ignore all the details regarding where data is located and how it is stored, and simply access all data within an organization from a single unified interface! Unfortunately, this dream was held back by fundamental limitations of hardware and complexity of the necessary software, so data virtualization remained a niche technology. However, in the last decade, advances in networking hardware and machine learning technology has started to transform data virtualization from dream to reality.
Language is a barrier beyond the fact that one dataset may be in English, another in Chinese, and another in Greek. Even if they are all in English, the computing system that stores the data may require questions to be posed in different languages in order to extract or answer questions about these datasets. One system may have an SQL interface, another GraphQL, and a third system may support only text search. The client who wishes to pose a question to these differing systems needs to learn the language that the system supports as its interface.
The goal of data virtualization (DV) is to eliminate or alleviate these other barriers. A DV System creates a central interface in which data can be accessed no matter where it is located, no matter how it is stored, and no matter how it is organized. The system does not physically move the data to a central location. Rather, the data exists there virtually. A user of the system is given the impression that all data is in one place, even though in reality it may be spread across the world. Furthermore, the user is presented with information about what datasets exist, how they are organized, and enough of the semantic details of each dataset to be able to formulate queries over them. The user can then issue commands that access any dataset virtualized by the system without needing to know any of the physical details regarding where data is located, which systems are being used to store it, and how the data is compressed or organized in storage.
The most complex part of a DV System is the data virtualization engine (DV Engine), which receives requests from clients (generated using the client interface) and performs whatever processing is required for these requests. This typically involves communication with the specific underlying data sources that contain data relevant to those requests. The DV Engine thus needs to know how to communicate with a variety of different types of systems that may store data that is being virtualized by the system. Furthermore, it may need to forward parts of client requests to these underlying source systems. Therefore, the engine needs to know how to properly express these requests such that the underlying data source system can perform these requests in a high-performing way and return the results in a manner that is consumable in a scalable fashion by the DV System. The DV Engine may also need to combine results received from multiple underlying data source systems involved in a client request.
In general, the goal of data virtualization is to allow clients to express requests over datasets without having to worry about the details of how the underlying data source systems store the source data. Yet most underlying data sources have unique interfaces that require expertise in that particular system before data can be accessed. Therefore, the DV Engine typically requires some translation on the fly from a global interface that is used by the client to access any underlying system into the particular interface used by specific underlying data sources.
In this book, we discuss:
What is data virtualization and why is it useful? What are the technical underpinnings that make virtualization more practical today than it ever has been in the past? Where does data virtualization fit into the modern data mesh and data fabric paradigms?
1. Introduction to Data Virtualization and Data Lakes 2. Recent Technology Developments Driving the Rebirth of Data Virtualization 3. How Data Virtualization Systems Work 4. Advanced Architectural Components 5. Data Virtualization Systems in Practice 6. Case Studies 7. Data Architectures Supported by Data Virtualization Systems 8. The Future of Data Virtualization
Внимание
Уважаемый посетитель, Вы зашли на сайт как незарегистрированный пользователь.
Мы рекомендуем Вам зарегистрироваться либо войти на сайт под своим именем.