#maindatainfrastructuretrends

Actual Data Infrastructure Tasks

New solutions and applications on the one hand provide data stack accessibility and simplicity on at enterprises, on the other hand promote appearing of bigger amount difficulties. Current situation looks like this: data amount that pass through the organization is growing rapidly. Also, a number of their sources is becoming more also that is connected with the appearing of SaaS tools numerously.

The modern data stack is oriented on the field of transactional data and analytics. But enterprises don’t manage just pipeline and have several of them that are working synchronously. Additionally, enterprises need streaming technologies that now are in the early stage of development.

As a result, such tools like Spark, Kafka, Pulsar will be relevant any further. Consequently, the requirement of data processing engineers that can use these technologies will also grow.

Orchestration systems have a dynamic development. It is proved by the appearing of such frameworks like Airflow, Luigi, Perfect, Dagster etc. These tools have the form of the libraries set with open source code. They are destined for work process developing, planning and monitoring. The tool is writing in the Phyton programming language and it is the differentiating feature. Such singularity gives a possibility to create and write task chains in visual mood and write Phyton code. DAG (Directed Acyclic Graph) is used for data visualization.

It follows that data management continuous to be the main requirement in a business environment (through the modern data stack or machine learning pipelines).

Previous post #maindatainfrastructuretrends

Natural Language Processing achievements

Over the last years natural language processing (further NLP) field was growing rapidly. In 2020 NLP world market was valued at $13,6 billion. According to forecasts this market segment will have a growth in future and by 2026 its value will achieve $42,04 billion, herewith 85% of enterprise activity will be made without human involvement.

Besides powerful NLP market players like Google, Microsoft, Amazon, IBM, Apple that continue to develop and improve upon their products, there are a lot of new startup companies (42Chat, Canary Speech, Gamalon, Green Key Technologies).

Development of greatest relevance at latest:

BERT (Bidirectional Encoder Representations from Transformers) is a mainstay of Google. Due to ability to analyze a request as a whole sentence (with preposition and conjunction) and focus on a context it provides selection of relevant search results. BERT works in 72 languages.
ELECTRA (Efficiently Learning an Encoder that Classifies Token Replacements Accurately) – the new Google development. They safe all BERTs’ benefits and made ELECTRA an encoder that learns effectively and classifies accurately replaced tokens. Consequently, it has an edge over previous development without increase of computer power costs.
GPT-3 (Generative Pre-trained Transformer) – Open-AI 3^rd version of NLP algorithm that based on Transformer architecture. Currently the new version is the biggest language model (175B characteristics).

Previous post #maindatainfrastructuretrends

Boom time for DSML platforms

The cornerstone of ML and AI integration into organizations is DSML-platforms. The main task of companies that develop such platforms was to expand their product by offering as many different variants for using platforms in business as possible.

Over the last years it was possible to oversee a growth of this market share. It led to significant scalability of platforms development companies.

For example:

Dataiku – DSS software product was created for data collecting, processing and analysis. The system allows to create business forecasts by transforming basic data on the basis of which it’s possible to make effective decisions. Target users of this product are data analysts, business analysts and data engineers who develop applications for a company. In 2020 Dataiku engaged €85 million. As a result, company has grown by €212 million since 2012.
Databricks – this company was founded in 2013 by Apache Spark founder. A developed product (data analytics platform) is optimized for cloud platform Microsoft Azure, oriented toward analysts to process SQL-requests to data lake. Also, it designed to create and share a dashboard. In 2013 the company engaged $13 million, in 2019 this amount has grown by $400 million.

Previous post #maindatainfrastructuretrends

The role of data analysts is growing

Data analysts play more relevance role in data management. Usually, analysts represent either separate specialist’s team or individual specialists in organization’s department. They know SQL that is used for data management from a storage, also they can know Phyton. But analysts are not engineers, they are responsible for a processing the last section of the data pipeline.

Now with the help of modern tools analysts have a possibility to go further into the engineers’ territory. For example: to process transformations using own SQL knowledge.

This option gave an opportunity to exhale a little bit. Data processing engineers are rarely met and consequently their price is too high. But the analysts’ market is many times bigger, moreover it’s easier to teach them and their price tag is much lower.

Additionally, new start-ups orient towards analysts specifically. They create modern tools that help to extract and analyze information.

Start-up companies like Susu, Outlier, Anodot create KPI tools that dealing with data warehouse analyzing and extracting of specific information about some rates and errors discovery.

Also, there appear tools that allow to integrate data and analytics directly to the application. The remarkable example is Census that creates track from the data warehouse to the application.

All of this promote wider integration of Business Intelligence in enterprises. But currently this tool is still unutilized by enterprises depriving analysts of more opportunities.

Previous post #maindatainfrastructuretrends

Data Lake and Data Warehouse merging

Another one trend is data lake and data warehouse combine that promotes data stack simplification. Until recent times data lake and data warehouse subsist separately. Both objects are intended to data holding. But they are not synonymous and there is a principal difference between them.

The first object is a repository for a big volume of raw data in its original form from different sources. Data can be of different types: structured, semi-structured and unstructured. Data lake is characterized by high data flexibility and availability and a big choice oh machine learning usage.

The second object is also a repository for a big volume of data. But in this case data runs processing and gets into the storage already structured strictly regulated ways. Data warehouse is characterized by less flexibility, fixed configuration and transactional analytics and BI support.

Wishing to get the best of both sides, organizations try to combine 2 variants. As a result, they have both data lake and data warehouse (sometimes several with many parallel pipelines). Today’s data storage solution providers offer more such possibilities. For example, Snowflake – its platform allows to connect data warehouse and data lake; Microsoft Synapse – its cloud warehouse has integrated capabilities of data lake.

Previous post #maindatainsfrastucturetrends

ETL & ELT

ETL – is a process of data extraction, transforming and loading. It means data is extracted from a source in the first instance, sent to the «intermediate zone» for transformation and after it is loaded into a destination area.

New tools generation gave a possibility to proceed from ETL to ELT. Key difference between them is a work concept. As opposed to ETL, ELT is a process of data extracting from different sources, loading directly into a destination area and then its transformation. ELT usage is the main benefit for work with large quantities of data.

Nevertheless, ELT field is at the stage of its infancy and rapid development. Questions of confidential information processing (PHI, PII) are still opened. That’s why discussion about little data processing necessity is actual and drives to a hybrid version (ETLT) appearing.

While companies like Snowflake, Bigquary, Redshift have changed data location mode, management and access, data integration industry has been developing also. There are a lot of prospects to automate many engineering tasks in the cloud storage system, where the main goal is data extraction and download (without its transformation). Such prospects had impact on the growth of such companies as Segment, Stitch, Fivetran etc.

Let’s explore Fivertran as example that has a form of automated ETL platform. It allows to collect and analyze data by connecting data bases to the central repository. Fivertran offers wide connectors variety whereby data is extracted from different sources and loaded into a storage. This process occurs automatically, it is completely managed process that doesn’t demand any support. This enabled different non-engineering teams to configure connectors for data integration and management.

Currently such tools have a wide usage. And the proof of that is company’s performance: over the past year the value of Series C securities was $1,2 billion.

Previous post #maindatainfrastrictiretrends

Cloud data warehous evolution

Way back in the 70s computer scientist Josepf Carl Robnett Licklider was the first who start speaking about cloud services conception. Even at that time developers offered to place and process information on remote servers. But this idea had to be postponed since Internet was at its beginning stage at that time.

In 2012 RedShift Amazon was appearing that has a form of full-guided cloud storage and allows to do data analysis by using SQL. Shortly other IT companies (Google, Microsoft etc.) began to implement this technology into life. It caused cloud storage onrush.

The modern form of data stack has the same idea as its predecessors: this is data pipeline creating with help from:

data extraction from different sources
saving in one whole warehouse
analyzing and visualization

The cloud storage global usage has grown significantly in recent years and has become a real mainstream. It could be explained by economy of the acquisition costs and own IT-infrastructure support, high level of information safety-critical level, scalability and accessibility. According to the forecast cloud warehouse popularity will continue to grow and by 2025 this market segment will grow up to $137B.

Previous post #maindatainfrastructuretrends

Data infrastructure stability and development

2020 was unforgettable and unstable year. However, digital ecosystem has demonstrated great stability and growth having made significant transformation for a few months.

Data technologies (artificial intelligence, machine learning, data infrastructure) and cloud technologies are in the center of the digital transformation. That’s why companies from the digital ecosystem could survive and prosper in such difficult period.

Snowflake (data warehouse provider) became the most striking example of this. In September 2020 it was transformed to a public company with $69 billion market cap (by that time it was the biggest software IPO).

Palantir became the second instance. This is US based company that develops data analytics software for organizations. The company became public via direct listing and achieved $22 billion market cap.

Without doubt a lot of economic factors like consumer trust, inflation, developing of economy etc. have influence on business success. But financial market dictates its own rules according to the new reality. Each company that strives to be successful, has to be a data-oriented company.

It’s worth to remark that data technologies have own requirements, but some of them assume completely another concept and reasoning. Such artificial intelligence method as machine learning is too strong technical segment. A project success amount to 90-95%. Such rates effect more on artificial intelligence products evolution.

In the last few years companies began to demand more from digital decision providers. For example, they want to process bigger data value faster and chipper or use machine learning models wider. There is a logical connection since companies’ management began to understand benefits and get profit from such decisions.