Piotr Ziuziański | Business Intelligence | 28 April 2021 1
The search for an effective way to store and effectively process data is a sign of the times. Therefore, the subject of this article is the relational storage of big, structured data sets from various sources, called data warehouses. The article should be of interest to all those who are wondering what a data warehouse and a Data Vault are and what has changed in the perception of the “traditional” multidimensional data warehouse in recent years.
Almost every large enterprise faces the problem of data redundancy. Data is “produced” by almost every electronic device and various types of systems supporting the operational activity of the organization.
These include systems for:
These systems mostly store data in databases in the third normal form. What does it mean? In databases, you can have 3 types of relationships:
1. First normal form = 1NF – each attribute in the table has an elementary value (no repeating groups of information), and the table has a key;
2. Second normal form = 2NF – the table should store data only for a specific class of objects;
3. Third normal form = 3NF – each non-key attribute is directly dependent only on the primary key.
At some point, every enterprise faces a problem that relates not to the lack of access to data or the scarcity thereof, but to data overload, difficult processing, and the inability to turn it into knowledge and efficiently apply it in the decision-making process. The solution to this problem is a data warehouse, which is designed to integrate heterogeneous (from many sources) company data. The term “warehouse” clearly indicates a huge volume of data that can be stored in such a structure.
A data warehouse is a relational database that stores integrated data from various sources, including the company’s transactional systems. Most often, a data warehouse is dedicated to a company-specific business process or area. The purpose of the warehouse is to provide knowledge for decision-making.
It is difficult to indicate a specific moment in history that marks the beginning of the data warehouse concept. The theory pertaining to this concept dates back as far as the 1970s. However, the first commercial Business Intelligence system was created in 1985 for Procter & Gamble. In 1988, in the article “Architecture for Business and Information Systems”, published in the IBM Systems Journal, Barry Devlin and Paul Murphy defined the term “business data warehouse”.
Data warehouses are inextricably associated with the American computer scientist Bill Inmon, born in 1945, who is widely considered the father of the data warehouse. In 2007, Bill Inmon was named by Computerworld as one of the ten people who have had the most significant impact on IT development in the past 40 years. In 1992, Inmon defined the data warehouse as follows:
“A data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management’s decision-making process”
Each of the words used in the definition precisely defines the attributes of a data warehouse:
Fig. 1. Data warehouse attributes as defined by Inmon
Next to Inmon, Ralph Kimball, born in 1944, is another key figure in the field of data warehousing. Unlike Inmon’s definition of a data warehouse, where the emphasis is on the characteristics of the warehouse, Kimball focuses on its purpose: “a copy of transaction data specifically structured for query and analysis.”
Fig. 2. The purpose of the data warehouse as defined by Kimball
The author of the third approach to the subject of the data warehouse, known as the Data Vault, is Dan Linstedt. The Data Vault is the result of 10 years of his research efforts to ensure the consistency, flexibility and scalability of the warehouse. The first results of his research in this field are five articles on this subject, which were published in 2000.
Bill Inmon, Ralph Kimball and Dan Linstedt are collectively the authors of hundreds of publications presenting their approach to data modeling. The complete reflection of their points of view on data warehouses can be found in their books. However, I will try to briefly compare the approaches of the three visionaries in the next paragraphs.
Fig. 3. The first editions of books on designing data warehouses in accordance with various approaches
According to Inmon, a data warehouse in a physical, implementation approach is a central, relational database in the third normal form, on the basis of which data marts are built for individual units (departments, divisions) of an organization. Inmon’s approach is based directly on source data, so you can start building a warehouse a priori without specifying the user requirements. Inmon’s architecture assumes the use of all operating system databases, so it excludes the option of acquiring data selectively. Therefore, the data warehouse stores atomic (elementary) data from source systems’ databases. This approach guarantees flexibility and scalability, which is extremely important in the case of rapidly changing database structures of source systems. Additionally, all data marts are always based on the same data warehouse, which implies consistency. Inmon is thereby a follower of the “single version of the truth” theory (SVOT). It is worth emphasizing that the data warehouse and data marts are physically separated, which can be observed in the picture below.
Fig. 4. Scheme of data warehouse architecture according to Inmon
Data warehouses are supplied using the so-called ETL (Extract, Transform, Load). The aims of these processes are as follows:
The diagram shows data sources that may be databases of transaction systems or flat files from these systems. Of course, other forms of data source are acceptable, albeit these two are the most commonly used. The second (optional) layer is the staging area where the data goes directly from the source systems, usually without any transformations.
As mentioned before, a data warehouse as understood by Inmon represents atomic data from source systems in third normal form. Based on one central data warehouse, sometimes referred to as the Corporate Information Factory (CIF), the data marts are built. Data marts or OLAP cubes built on the basis of marts are sources for various reporting applications with functionalities such as a pivot table, data visualization in the form of charts, the presentation of key KPIs (Key Performance Indicators), etc.
Kimball is often associated with the logical schema of the data warehouse, known as the star schema. This association is accurate as Kimball developed the concept of a star schema, i.e. a relational database structure, the central part of which is the so-called fact table surrounded by dimension tables. “The table of facts contains measures and so called foreign keys of dimensions which link to dimensions.” The measure characterizes the magnitude of the given phenomenon (e.g. the sales value in the event of the sale of the product).
Dimensions describe the actual state of play, providing additional information (e.g. about the product sold and the buyer). Each warehouse must include a time dimension that allows for an explicit determination of the date and time of a given event.
Fig. 5. A star schema worked out by Ralph Kimball
Apart from the star schema, there are also variations of schemas known as snowflake and star constellations.
Kimball assumes that a data warehouse is actually a collection of consistent data marts based on shared dimensions. Reports are created directly on the basis of data marts or through an additional layer – OLAP cubes.
Fig. 6. Schema of data warehouse architecture according to Kimball
According to Kimball, dimensions should be settled, that is, they should have the same meaning in relation to many facts. The design stage of this approach involves the development of a bus matrix. Thanks to this approach, specific fact tables are plugged into a “bus” representing the available dimensions of the organization in the data warehouse.
Fig. 7. Bus matrix of business processes and dimensions
Moreover, Kimball’s approach, unlike Inmon’s, assumes a strong commitment of end users from the very beginning of the warehouse creation process.
In contrary to Inmon’s view, Linstedt assumes that all available data from the entire time period should be loaded into the warehouse. This is known as the “single version of the facts” approach. As with Kimball’s star schema, with the Data Vault Linstedt introduces some additional objects to organize the data warehouse structure. These objects are referred to as the hub, satellite and link.
Fig. 8. Data Vault architecture according to Linstedt
Hubs are objects containing a unique list of business keys (from source systems). In addition, the hub stores metadata on the date and time of the given key’s first occurrence and the source of its origin. The hub does not contain descriptive data or facts. Links, in turn, allow for the definition of relationships between hubs. They are similar to fact tables in multidimensional modeling. The satellites contain descriptive data, thus resembling the dimensions known from multidimensional modeling, and they can only connect to hubs or links. An example of a hub can therefore be a unique customer identifier in the sales system; the link will be a single sales line, and the satellite will be customer data for shipping.
Fig. 9. Schema of data warehouse architecture according to Linstedt
For optimization purposes, in Data Vault 2.0, instead of primary keys coming directly from source systems (usually integers), they are transformed with the use of the so-called hash function, e.g. MD5 or the more secure SHA-2. Additionally, thanks to this approach, it is possible to implement Data Vault on Hadoop. Another Data Vault 2.0 innovation is the use of the so-called Hash Diff for the efficient comparison of data already loaded with data waiting to be loaded in the next load. The Hash Diff is determined on a basis of all descriptive columns in the satellite (not metadata) using the hash function. In the case of differences between the hashes, a new record is saved (similarly to SCD type 2), and no change means no action is taken.
According to Linstedt, we can sometimes list the so-called Raw Vault and Business Vault in data warehouses. The Raw Vault is the heart of the Data Vault, organized into satellites, hubs, and links to track historical changes in integrated data from multiple sources. Based on this layer, the so-called Business Vault is a layer that stores data which is important from a business perspective. The data in this layer has some basic transformations required by the business. The Business Vault is data preparation for data marts. It is sometimes referred to as staging out, as opposed to staging in, which is a layer that stores data collected directly from sources without any transformations (a transition area). Since Data Vault 2.0, the term “information mart” is used instead of “data mart” to emphasize its role in providing useful information to decision-makers.
When referring to the Data Vault, it is worth mentioning the names of those who actively contribute to and promote the concept of the Data Vault around the world: Hans Hultgren, Michael Olschimke, and Roelant Vos.
There are many aspects that differ or link the approaches presented in this article. A table presenting various criteria of comparative analysis, available in a PDF file, can be a handy tool.
Taking into consideration the well-established position of the data warehouse according to Kimball’s approach, the following question may arise: is multidimensional modeling then an out-of-date technique for data warehouse modeling? The answer is simple: it is customer need that determines which modeling technique would be suitable. For some, the Data Vault solution is not suitable; for others, using Inmon’s approach is pointless. Often you can find hybrid solutions combining various approaches. Instead of making a dogmatic choice between academic approaches, you should focus on pragmatism and flexibility. Each case is individual and only an accurate analysis based on experience allows one to make the right choice and adapt the solution accordingly to the client’s needs.
In general, if there are no specific analysis requirements or if marts are to provide information to several BI systems, it is worth applying Inmon’s approach, especially if the database structures of the source systems are stable.
The Kimball approach is recommended when the requirements are well-known and defined. Multidimensional models are recommended as a mart structure due to their advantages, including high performance and readability for end users. On the other hand, some tools work well at data marts with a flat structure, which is also perfect for Data Science analyses.
Lindstedt’s Data Vault approach is an effective solution for a data warehouse based on multiple data sources, the structure of which changes frequently. Data Vault is typically a good choice for large to very large agile projects where the flexibility, productivity and scalability of the warehouse are key.
When discussing the methods of storing big data sets, we are obliged to mention the solutions known as Big Data, which are becoming increasingly popular. Due to the enormous amount of data available in enterprises, the concept of the so-called Data Lakes, which aim to store and analyze Big Data, is developing. The emergence of new solutions of this type in warehousing does not replace the existing ones, but merely fills the gap. The coexistence and integration of data storage systems (both structured and unstructured) in the organization allows you to control the chaos and obtain the necessary business knowledge.