The search for an effective way to store and effectively process data is a sign of the times. Therefore, the subject of this article is the relational storage of big, structured data sets from various sources, called data warehouses. The article should be of interest to all those who are wondering what a data warehouse and a Data Vault are and what has changed in the perception of the “traditional” multidimensional data warehouse in recent years.

The curse of the data explosion

Almost every large enterprise faces the problem of data redundancy. Data is “produced” by almost every electronic device and various types of systems supporting the operational activity of the organization.

These include systems for:

  • Enterprise Resource Planning,
  • Customer Relationship Management,
  • Warehouse management,
  • Financial and accounting systems,
  • Manufacturing systems,
  • and many others, often specific to the given area of company activity.

These systems mostly store data in databases in the third normal form. What does it mean? In databases, you can have 3 types of relationships:

1. First normal form = 1NF – each attribute in the table has an elementary value (no repeating groups of information), and the table has a key;

2. Second normal form = 2NF – the table should store data only for a specific class of objects;

3. Third normal form = 3NF – each non-key attribute is directly dependent only on the primary key.

At some point, every enterprise faces a problem that relates not to the lack of access to data or the scarcity thereof, but to data overload, difficult processing, and the inability to turn it into knowledge and efficiently apply it in the decision-making process. The solution to this problem is a data warehouse, which is designed to integrate heterogeneous (from many sources) company data. The term “warehouse” clearly indicates a huge volume of data that can be stored in such a structure.
A data warehouse is a relational database that stores integrated data from various sources, including the company’s transactional systems. Most often, a data warehouse is dedicated to a company-specific business process or area. The purpose of the warehouse is to provide knowledge for decision-making.

A brief history of the data warehouse

It is difficult to indicate a specific moment in history that marks the beginning of the data warehouse concept. The theory pertaining to this concept dates back as far as the 1970s. However, the first commercial Business Intelligence system was created in 1985 for Procter & Gamble. In 1988, in the article “Architecture for Business and Information Systems”, published in the IBM Systems Journal, Barry Devlin and Paul Murphy defined the term “business data warehouse”.

Data warehouses are inextricably associated with the American computer scientist Bill Inmon, born in 1945, who is widely considered the father of the data warehouse. In 2007, Bill Inmon was named by Computerworld as one of the ten people who have had the most significant impact on IT development in the past 40 years. In 1992, Inmon defined the data warehouse as follows:

“A data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management’s decision-making process”

Each of the words used in the definition precisely defines the attributes of a data warehouse:

Fig. 1. Data warehouse attributes as defined by Inmon

Next to Inmon, Ralph Kimball, born in 1944, is another key figure in the field of data warehousing. Unlike Inmon’s definition of a data warehouse, where the emphasis is on the characteristics of the warehouse, Kimball focuses on its purpose: “a copy of transaction data specifically structured for query and analysis.”

Fig. 2. The purpose of the data warehouse as defined by Kimball

The author of the third approach to the subject of the data warehouse, known as the Data Vault, is Dan Linstedt. The Data Vault is the result of 10 years of his research efforts to ensure the consistency, flexibility and scalability of the warehouse. The first results of his research in this field are five articles on this subject, which were published in 2000.

Bill Inmon, Ralph Kimball and Dan Linstedt are collectively the authors of hundreds of publications presenting their approach to data modeling. The complete reflection of their points of view on data warehouses can be found in their books. However, I will try to briefly compare the approaches of the three visionaries in the next paragraphs.

Fig. 3. The first editions of books on designing data warehouses in accordance with various approaches

Single version of the truth by Bill Inmon

According to Inmon, a data warehouse in a physical, implementation approach is a central, relational database in the third normal form, on the basis of which data marts are built for individual units (departments, divisions) of an organization. Inmon’s approach is based directly on source data, so you can start building a warehouse a priori without specifying the user requirements. Inmon’s architecture assumes the use of all operating system databases, so it excludes the option of acquiring data selectively. Therefore, the data warehouse stores atomic (elementary) data from source systems’ databases. This approach guarantees flexibility and scalability, which is extremely important in the case of rapidly changing database structures of source systems. Additionally, all data marts are always based on the same data warehouse, which implies consistency. Inmon is thereby a follower of the “single version of the truth” theory (SVOT). It is worth emphasizing that the data warehouse and data marts are physically separated, which can be observed in the picture below.

Fig. 4. Scheme of data warehouse architecture according to Inmon

Data warehouses are supplied using the so-called ETL (Extract, Transform, Load). The aims of these processes are as follows:

  • extraction, collecting data from source systems;
  • transformation and standardization of data;
  • loading data into the warehouse.

The diagram shows data sources that may be databases of transaction systems or flat files from these systems. Of course, other forms of data source are acceptable, albeit these two are the most commonly used. The second (optional) layer is the staging area where the data goes directly from the source systems, usually without any transformations.
As mentioned before, a data warehouse as understood by Inmon represents atomic data from source systems in third normal form. Based on one central data warehouse, sometimes referred to as the Corporate Information Factory (CIF), the data marts are built. Data marts or OLAP cubes built on the basis of marts are sources for various reporting applications with functionalities such as a pivot table, data visualization in the form of charts, the presentation of key KPIs (Key Performance Indicators), etc.

Kimball’s multidimensional world

Kimball is often associated with the logical schema of the data warehouse, known as the star schema. This association is accurate as Kimball developed the concept of a star schema, i.e. a relational database structure, the central part of which is the so-called fact table surrounded by dimension tables. “The table of facts contains measures and so called foreign keys of dimensions which link to dimensions.” The measure characterizes the magnitude of the given phenomenon (e.g. the sales value in the event of the sale of the product).

Dimensions describe the actual state of play, providing additional information (e.g. about the product sold and the buyer). Each warehouse must include a time dimension that allows for an explicit determination of the date and time of a given event.

Fig. 5. A star schema worked out by Ralph Kimball

Apart from the star schema, there are also variations of schemas known as snowflake and star constellations.

Kimball assumes that a data warehouse is actually a collection of consistent data marts based on shared dimensions. Reports are created directly on the basis of data marts or through an additional layer – OLAP cubes.

 

Fig. 6. Schema of data warehouse architecture according to Kimball

According to Kimball, dimensions should be settled, that is, they should have the same meaning in relation to many facts. The design stage of this approach involves the development of a bus matrix. Thanks to this approach, specific fact tables are plugged into a “bus” representing the available dimensions of the organization in the data warehouse.

Fig. 7. Bus matrix of business processes and dimensions

Moreover, Kimball’s approach, unlike Inmon’s, assumes a strong commitment of end users from the very beginning of the warehouse creation process.

Single version of the facts – Data Vault by Lindstedt

In contrary to Inmon’s view, Linstedt assumes that all available data from the entire time period should be loaded into the warehouse. This is known as the “single version of the facts” approach. As with Kimball’s star schema, with the Data Vault Linstedt introduces some additional objects to organize the data warehouse structure. These objects are referred to as the hub, satellite and link.

Fig. 8. Data Vault architecture according to Linstedt

Hubs are objects containing a unique list of business keys (from source systems). In addition, the hub stores metadata on the date and time of the given key’s first occurrence and the source of its origin. The hub does not contain descriptive data or facts. Links, in turn, allow for the definition of relationships between hubs. They are similar to fact tables in multidimensional modeling. The satellites contain descriptive data, thus resembling the dimensions known from multidimensional modeling, and they can only connect to hubs or links. An example of a hub can therefore be a unique customer identifier in the sales system; the link will be a single sales line, and the satellite will be customer data for shipping.

Fig. 9. Schema of data warehouse architecture according to Linstedt

For optimization purposes, in Data Vault 2.0, instead of primary keys coming directly from source systems (usually integers), they are transformed with the use of the so-called hash function, e.g. MD5 or the more secure SHA-2. Additionally, thanks to this approach, it is possible to implement Data Vault on Hadoop. Another Data Vault 2.0 innovation is the use of the so-called Hash Diff for the efficient comparison of data already loaded with data waiting to be loaded in the next load. The Hash Diff is determined on a basis of all descriptive columns in the satellite (not metadata) using the hash function. In the case of differences between the hashes, a new record is saved (similarly to SCD type 2), and no change means no action is taken.

According to Linstedt, we can sometimes list the so-called Raw Vault and Business Vault in data warehouses. The Raw Vault is the heart of the Data Vault, organized into satellites, hubs, and links to track historical changes in integrated data from multiple sources. Based on this layer, the so-called Business Vault is a layer that stores data which is important from a business perspective. The data in this layer has some basic transformations required by the business. The Business Vault is data preparation for data marts. It is sometimes referred to as staging out, as opposed to staging in, which is a layer that stores data collected directly from sources without any transformations (a transition area). Since Data Vault 2.0, the term “information mart” is used instead of “data mart” to emphasize its role in providing useful information to decision-makers.

When referring to the Data Vault, it is worth mentioning the names of those who actively contribute to and promote the concept of the Data Vault around the world: Hans Hultgren, Michael Olschimke, and Roelant Vos.

Comparison of modeling techniques

There are many aspects that differ or link the approaches presented in this article. A table presenting various criteria of comparative analysis, available in a PDF file, can be a handy tool.

A choice?

Taking into consideration the well-established position of the data warehouse according to Kimball’s approach, the following question may arise: is multidimensional modeling then an out-of-date technique for data warehouse modeling? The answer is simple: it is customer need that determines which modeling technique would be suitable. For some, the Data Vault solution is not suitable; for others, using Inmon’s approach is pointless. Often you can find hybrid solutions combining various approaches. Instead of making a dogmatic choice between academic approaches, you should focus on pragmatism and flexibility. Each case is individual and only an accurate analysis based on experience allows one to make the right choice and adapt the solution accordingly to the client’s needs.

In general, if there are no specific analysis requirements or if marts are to provide information to several BI systems, it is worth applying Inmon’s approach, especially if the database structures of the source systems are stable.

The Kimball approach is recommended when the requirements are well-known and defined. Multidimensional models are recommended as a mart structure due to their advantages, including high performance and readability for end users. On the other hand, some tools work well at data marts with a flat structure, which is also perfect for Data Science analyses.

Lindstedt’s Data Vault approach is an effective solution for a data warehouse based on multiple data sources, the structure of which changes frequently. Data Vault is typically a good choice for large to very large agile projects where the flexibility, productivity and scalability of the warehouse are key.

Big Data

When discussing the methods of storing big data sets, we are obliged to mention the solutions known as Big Data, which are becoming increasingly popular. Due to the enormous amount of data available in enterprises, the concept of the so-called Data Lakes, which aim to store and analyze Big Data, is developing. The emergence of new solutions of this type in warehousing does not replace the existing ones, but merely fills the gap. The coexistence and integration of data storage systems (both structured and unstructured) in the organization allows you to control the chaos and obtain the necessary business knowledge.

Sources:

  1. Abramson I.: Data Warehouse: The Choice of Inmon versus Kimball
  2. Adamson C.: Three Data Warehouse Architectures that Use Star Schema, opublikowano: 26 marca 2007.
  3. Anderson D.: What is “The Data Vault” and why do we need it?, opublikowano 27 marca 2015.
  4. Czarko-Wasiutycz R.: Miejsce architektury Data Vault 2.0 w hurtowniach danych, prezentacja na konferencji SQLDay, Wrocław 2018.
  5. Dalby J.: Dimensional modeling – architecture and terminology.
  6. Drozda P.: Bazy danych. Wprowadzenie.
  7. Comparison Between Inmon and Kimball Methodology for the Purpose of Designing, Constructing and Testing of a Commercial BIDW Project, International Journal of Computer Graphics Vol. 8, No.1 (2017).
  8. George S.: Inmon or Kimball: Which approach is suitable for your data warehouse?, opublikowano 14 kwietnia 2012.
  9. Graczyk B.: Ciebie też to czeka…ewolucja trwa…, opublikowano 14 marca 2017.
  10. Graczyk B.: Największy mit świata BI – poznaj odpowiedź…, opublikowano 23 sierpnia 2017.
  11. Graziano K.: Data Vault 2.0 Modeling Basics, opublikowano 20 października 2015.
  12. Graziano K.: Building an Information Mart With Your Data Vault, opublikowano 17 listopada 2015.
  13. Hall M.: The 10 IT People Who Mattered in the Past 40 Years (but You May Not Know Why), opublikowano 9 lipca 2007.
  14. Hultgren P.: Data Vault layers & the Raw Vault, opublikowano 20 lutego 2011
  15. Inmon W.H.: Corporate Information Factory, John Wiley & Sons, Indianapolis 2000.
  16. Kempe S.: A Short History of Data Warehousing, opublikowano 23 sierpnia 2012.
  17. Kimball R., Ross M.: The Data Warehouse ToolKit, Third Edition: The Definitive Guide to Dimensional Modeling, John Wiley & Sons, Indianapolis 2013.
  18. Libera T., Ziuziański P.: Charakterystyka budowy hurtowni danych i możliwości implementacji wymiarów różnego typu, Zeszyty Naukowe WSZiB, nr 43, Kraków 2017.
  19. Linstedt D., Olschimke M.: Building a Scalable Data Warehouse with Data Vault 2.0, Morgan Kaufmann, Cambridge, MA, USA, 2015.
  20. Mor Y.: Inmon vs. Kimball – The Big Data Warehouse Duel, opublikowano 10 kwietnia 2014.
  21. Naamane Z., Jovanovic V.: Effectiveness of Data Vault compared to Dimensional Data Marts on Overall Performance of a Data Warehouse System, IJCSI International Journal of Computer Science Issues, Volume 13, Issue 4, July 2016.
  22. Orlov V.: Data Warehouse Architecture: Inmon CIF, Kimball Dimensional or Linstedt Data Vault?, opublikowano 9 kwietnia 2014.
  23. Ross M.: Differences of Opinion, opublikowano 3 marca 2004.
  24. Sen A., Sinha A.: A Comparison of Data Warehousing Methodologies, Communications of the ACM, March 2005/Vol. 48, No. 3.
  25. The Data Vault Architecture, opublikowano 14 grudnia 2012.
  26. Vos R.: Comparisons between Data Warehouse modelling techniques, opublikowano 12 lutego 2013.
  27. Vos R.: Data Vault comparisons, opublikowano 16 maja 2012.
  28. Yessad L., Labiod A.: Comparative study of data warehouses modeling approaches: Inmon, Kimball and Data Vault, International Conference on System Reliability and Science (ICSRS), Paris 2016.

 

 

 

 

 

 

 

Author:
Senior Business Intelligence Specialist

Certyfikowany specjalista technologii Business Intelligence firmy Microsoft. Tworzy rozwiązania dla klientów, począwszy od etapu modelowania hurtowni danych, projektowania i wdrażania procesów ETL, po implementację modelu danych i wdrożenie wizualnej warstwy raportowej. Autor bloga poświęconego tematyce rozwiązań klasy Business Intelligence.

Comment

CONTACT US!

Would you like to learn more about the possibilities of cooperation? Do you have a question? Write to us!

    I hereby agree that JCommerce Sp. z o.o. shall process my personal data (hereinafter ‘personal data’), such as my name, surname, e-mail address, telephone number and company name, for commercial purposes.
    I hereby agree that JCommerce Sp. z o.o. shall process my personal data (hereinafter ‘personal data’), such as my name, surname, e-mail address, telephone number and company name, for marketing purposes.
    I hereby agree that JCommerce Sp. z o.o. shall process my personal data (hereinafter ‘personal data’), such as my name, surname, e-mail address, telephone number and company name, for recruitment purposes.
    I hereby agree that JCommerce Sp. z o.o. shall process my personal data (hereinafter ‘personal data’), such as my name, surname, e-mail address, telephone number and company name, for future recruitment purposes.
    I have been informed by JCommerce Sp. z o.o., 3 Ks. Piotra Sciegiennego St. 40-114 Katowice – the personal data controller – that: - The provision of the aforementioned personal data is voluntary but essential for commercial purposes if I have chosen a request for proposal, or recruitment purposes, if I have chosen the remaining options;
    - I have the right to access the content of my data, including to receive copies of it and correct it, delete it and limit the processing of it, as well as the right to transfer it, the right to object to the processing of it, and the right to withdraw my consent at any time. However, the withdrawal of my consent shall not affect the lawfulness of processing carried out on the basis of the consent in question prior to its withdrawal;
    - A statement of withdrawal of my consent to the processing of personal data should be submitted to the headquarters of JCommerce Sp. z o.o. or sent to the following e-mail address: zgody@jcommerce.pl. The withdrawal of consent to the processing of personal data shall result in the inability to fulfil the aforementioned processing purposes;
    - The personal data provided shall be shared by JCommerce Sp. z o.o. with the company’s authorised employees and individuals collaborating with JC under civil-law contracts, who are involved in the implementation of the purpose of the processing;
    - The data provided shall be processed on the basis of the relevant provisions of Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing Directive 95/46/EC (General Data Protection Regulation), ‘GDPR’;
    - Should you have any questions regarding the protection of your personal data, please contact us by e-mail: odo@jcommerce.pl;
    - The personal data provided shall be processed for the purpose for which it was supplied, or until I express my objection in this regard. In the event of filing an objection, JCommerce Sp. z o.o. shall no longer process my personal data for the aforementioned purposes, unless it can demonstrate that there are valid and legally justified grounds overriding my interests, rights and freedoms or my data is necessary to establish, pursue or defend a claim, if any;
    - I have the right to file a complaint to the supervisory authority if I consider that the processing of the aforementioned personal data violates the provisions of the General Data Protection Regulation of 27 April 2016.
    I hereby agree that JCommerce Sp. z o.o. shall process my personal data (hereinafter ‘personal data’), such as my name, surname, e-mail address, telephone number and company name, for commercial purposes.
    I hereby agree that JCommerce Sp. z o.o. shall process my personal data (hereinafter ‘personal data’), such as my name, surname, e-mail address, telephone number and company name, for marketing purposes.
    I hereby agree that JCommerce Sp. z o.o. shall process my personal data (hereinafter ‘personal data’), such as my name, surname, e-mail address, telephone number and company name, for recruitment purposes.
    I hereby agree that JCommerce Sp. z o.o. shall process my personal data (hereinafter ‘personal data’), such as my name, surname, e-mail address, telephone number and company name, for future recruitment purposes.
    I have been informed by JCommerce Sp. z o.o., 3 Ks. Piotra Sciegiennego St. 40-114 Katowice – the personal data controller – that: - The provision of the aforementioned personal data is voluntary but essential for commercial purposes if I have chosen a request for proposal, or recruitment purposes, if I have chosen the remaining options;
    - I have the right to access the content of my data, including to receive copies of it and correct it, delete it and limit the processing of it, as well as the right to transfer it, the right to object to the processing of it, and the right to withdraw my consent at any time. However, the withdrawal of my consent shall not affect the lawfulness of processing carried out on the basis of the consent in question prior to its withdrawal;
    - A statement of withdrawal of my consent to the processing of personal data should be submitted to the headquarters of JCommerce Sp. z o.o. or sent to the following e-mail address: zgody@jcommerce.pl. The withdrawal of consent to the processing of personal data shall result in the inability to fulfil the aforementioned processing purposes;
    - The personal data provided shall be shared by JCommerce Sp. z o.o. with the company’s authorised employees and individuals collaborating with JC under civil-law contracts, who are involved in the implementation of the purpose of the processing;
    - The data provided shall be processed on the basis of the relevant provisions of Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing Directive 95/46/EC (General Data Protection Regulation), ‘GDPR’;
    - Should you have any questions regarding the protection of your personal data, please contact us by e-mail: odo@jcommerce.pl;
    - The personal data provided shall be processed for the purpose for which it was supplied, or until I express my objection in this regard. In the event of filing an objection, JCommerce Sp. z o.o. shall no longer process my personal data for the aforementioned purposes, unless it can demonstrate that there are valid and legally justified grounds overriding my interests, rights and freedoms or my data is necessary to establish, pursue or defend a claim, if any;
    - I have the right to file a complaint to the supervisory authority if I consider that the processing of the aforementioned personal data violates the provisions of the General Data Protection Regulation of 27 April 2016.