#datawarehouse

In order to make the right decision when choosing a data organization system, it is advisable to conduct a comparative analysis.

Key differences between databases and data warehouse:

Data Warehouse

stores summary data;
used for data analysis;
storage of historical and current data;
information from various sources providing;
providing of information on general business operations;

Database

uses detailed data;
fixation transactions;
storage of current data;
collection of data from one source;
fixation the main day-to-day operations;

Key differences between data mart and data warehouse

Data Mart

providing of a thematic data subset that was retrieved from the data warehouse (usually less than 100 GB in size);
is a repository of valuable data for a specific subgroup;
fast data analysis;
getting data from the data warehouse;

Data Warehouse

significantly larger (terabyte or more);
contains all cleaned data for business units;
getting data from databases;

Key differences between data lake and data mart

Data Lake

contains all raw and unfiltered organization data;
expedient to use for wider and deeper analysis of raw data;
a complete solution that acts as a data warehouse, database and data mart;
availability of a central archive where data marts can be stored in different user areas;

Data Mart

contains filtered and structured data for a specific department;
allows to quickly and efficiently analyze relevant information;
is a one-time solution without ETL process;

Key differences between data lake and data warehouse

Data Warehouse

storage of cleaned data to create structured data models and reports;
use of ODS from transactional systems;
intended for users who need to create reports for analytics;

Data Lake

storage of all data for the organization;
use of hardware that makes it possible to economically store large amounts of data (terabytes, petabytes);
extracting data from all data types, including non-traditional data types (web service logs, social media activity, sensor data, etc.);
designed for deep analysis that goes beyond data scope that is stored in the repository;

Key differences between databases and data mart

Database

is a transactional data repository (OLTP);
fixation of all aspects and activities of one subject in particular;
contains raw data;
users do not interact with data in databases;
is the first step in the ETL process;

Data Mart

is a warehouse of analytical data (OLAP);
contains data from several subjects;
contains processed and verified data that simplifies the process of creating reports;
direct user interaction with data from data marts;
is the last step in the ETL process

Key differences between databases and data lake

Database

fixation of transactional data that are related to one topic;
stores traditional data (text, numbers);
does not perform data cleaning, stores raw data;
exports its data to another process (operational data storage);
is the first step in the ETL process;

Data Lake

recording the activity of many databases and other disparate data sources;
it is possible to store data of any type (pdf-files, images, sound files, etc.);
stores raw data, however, a data cleansing procedure is implemented;
performs all data processing (cleansing and aggregation);
handles all aspects of the ETL process.

Modern organizations process data daily. However, the data may differ in type, scope and manner of use. This must be considered when choosing the best data solution. Achieving results depends, among other things, on the selected enterprise data management system, that must fully meet business needs. It can be data mart, data warehouse, database or data lake.

Database

A database is a place of related data storage that is used to capture a particular situation. For example, a point-of-sale (POS) database. In this case, the database collects, and stores data related to retail store transactions. Data entering the database is processed, systematized, managed, updated, and then stored in tables. The database is the target storage for raw transactional data and performs online transaction processing (OLTP).

The main database characteristics:

structuring according to particular company operations and applications;
availability of strict rules for data storage and organization (RDBMS);
flexible data storage (NoSQL);
single-purpose in its nature;
use for online transaction processing (OLTP);
recording data, capturing transactions as they occur, and placing them.

Data warehouse

The data warehouse is the main analytical system of the company. It often works in conjunction with an operational data warehouse (ODS) to store data that has been retrieved from various company databases. For example, a company has databases supporting points of sale, online activity, customer and employee information. The data warehouse will take the data from these sources and make it available in one place. The method of extracting data from the database, converting it to ODS, and loading it into the data warehouse is an example of ETL and ELT processes.

The data warehouse is an excellent tool for data analysis due to the capture of transformed historical data. Business departments are involved in organizing data, using it for reporting and data analysis. The data warehouse uses SQL to query data, and use tables, indexes, keys, views, and data types to organize and ensure data integrity.

The main data warehouse features:

large amount of historical data storage, old data is not deleted when new ones are updated;
capturing data from several disparate databases;
works with ODS to store organize and clean data;
OLAP-application (analytical online processing);
main data source for data analysis;
reports and dashboards use data from data warehouse.

Data mart

Like a data warehouse, a data mart maintains and stores processed and ready for analysis data. However, a visibility scope is limited. The data mart provides the subject data that is required to support each business unit. For example, a data mart to support reports and analysis of a marketing department. By defining data boundaries within a particular department, only relevant data is available.

Using a data mart increases security level. Visibility restriction prevents irresponsible use of data that is not relevant to a particular department. It should also be noted that less data in the data mart increases the speed of processing, and therefore increases the speed of query execution. Data is aggregated and prepared for a specific department, minimizing data misuse and the possibility of conflicting reports.

Key data mart features:

focus on one subject or business unit;
mini-repository of aggregated data;
the amount of data is limited;
reports and dashboards use data from the data mart.

Data lake

A data lake is designed to store structured and unstructured company data. It collects all the valuable data for later use: images, pdfs, videos, etc. Just like a data warehouse, a lake extracts data from several disparate sources and processes it. It can also be used for data analysis and reporting purposes. For processing and analysis, different applications and technologies (for example, Java) are used. Data lakes are often used in conjunction with machine learning. Machine learning test results are also stored in data lake. The level of usage complexity requires serious skills from users, as well as experience with programming languages and data processing methods. Data cleaning occurs without ODS usage.

Key data lake features:

collection of all data from many disparate data sources over a long period of time;
meeting the needs of different users within the company;
data processing and cleaning, saving them to the data lake.

Data analytics is an indispensable modern business tool. Data understanding and its analytics provide comprehensive answers to the business on how to set up processes for maximum benefit, who is the main business customer and what he needs, what «gaps» exist in activities. In addition, companies use artificial intelligence to offer products and services to the «right» people. The ability to increase business processes efficiency is provided by robots and automation.

All these developments are based on an ever-increasing stream of data that is collected, stored and analyzed. Using data, some companies have revolutionized new services to improve and simplify human life: search engines, communications, e-commerce, booking systems and more.

However, most of the companies did not achieve such success. The reason is that companies struggle to manage their data and most of it is not used. Accordingly, monetization does not occur.

The first block in data management is storage. An incorrect strategy to overcome this task, or its complete absence, may lead to other problems in the future. Data volume is constantly growing, so companies need to clearly understand what data is important.

At the moment, cloud services offer to store almost unlimited information amounts. However, there are certain difficulties here. For example, data with a high level of confidentiality or regulatory burden cannot be hosted outside the enterprise; some data requires instant access from anywhere in the world; some data requires routine archiving. Also, there is a need for data auditing to determine data relevance and compliance with international regulations. To perform these functions smoothly, it’s necessary to know where the data is located, how many copies exist, and how to access it.

The process of obtaining reliable data can be achieved with fast and highly available storage systems. Modern business analytics involves moving and sorting large data amounts to provide business users and customers with flexible functionality. Including the system must be supported by encryption and security.

Intelligent data warehouse

To achieve maximum speed, stability and security, modern storage systems, including the IBM FlashSystem, use solid-state non-volatile media. Artificial intelligence technologies usage enables intelligent management of data storage and access, which can increase speed and minimize the likelihood of errors and data loss. So, the data with the most frequent predictable access will be ready for work and will be queued.

One of the key requirements for a storage system is resilience. The goal of many businesses today is to build analytics-driven internal processes and customer relationships. In such model, it is impossible to allow these processes to stop as a result of problems with the data flow or infrastructure. Modern storage systems offer the ability to quickly copy and replicate corporate data. Ensuring data integrity is often critical. In this case, 2 or more identical copies of data can be synchronized in different places, and in a situation of an unforeseen failure, it can be restored with practically no data loss.

Rapidly changing data management

The UK Met Office has tackled the challenge of implementing the infrastructure needed to handle rapidly changing data. The information is used to determine weather changes, climate research, and seasonal trends. To do this, 300 million weather-related data points are collected, analyzed daily, and made available to customers. This happens twice to eliminate the risk of interrupting the data flow.

To support this process, a hybrid cloud strategy was developed based on the IBM FlashSystem. The storage provides a high level of compression, which is a cost-effective solution. It also contributes to the creation of a high-performance data infrastructure, which is necessary to transfer information from internal servers to the public cloud and clients.

Another example is the Archdiocese of Salzburg, which needed a solution to provide services more efficiently (support, outreach to the community and parishioners, access to many historical documents and literature). The Archdiocese was able to increase response times 10 to 20 times by moving away from mechanical disk storage in favor of solid-state, non-volatile systems.

Data is an important part of business assets. All data decisions must be smart and effective. The storage process should be considered as a key element of the data management strategy, along with the collection and analytics processes.

Another one trend is data lake and data warehouse combine that promotes data stack simplification. Until recent times data lake and data warehouse subsist separately. Both objects are intended to data holding. But they are not synonymous and there is a principal difference between them.

The first object is a repository for a big volume of raw data in its original form from different sources. Data can be of different types: structured, semi-structured and unstructured. Data lake is characterized by high data flexibility and availability and a big choice oh machine learning usage.

The second object is also a repository for a big volume of data. But in this case data runs processing and gets into the storage already structured strictly regulated ways. Data warehouse is characterized by less flexibility, fixed configuration and transactional analytics and BI support.

Wishing to get the best of both sides, organizations try to combine 2 variants. As a result, they have both data lake and data warehouse (sometimes several with many parallel pipelines). Today’s data storage solution providers offer more such possibilities. For example, Snowflake – its platform allows to connect data warehouse and data lake; Microsoft Synapse – its cloud warehouse has integrated capabilities of data lake.

Key differences between database, data warehouse, data mart and data lake

Database, Data Warehouse, Data Mart, Data Lake: main characteristics and differences

Storage is the key element of data-driven process

Data Lake and Data Warehouse merging

Previous post #maindatainsfrastucturetrends