Data warehousing is definitely not a new concept. But why are we here again, still talking about the same? Well, the answer lies in the question itself. It brings business value. And therefore, businesses are not afraid to invest in modernizing their data warehouse. And why modernizing? Firstly, let’s define Data warehousing. Even before that let’s see how most of us have heard about data warehouses. Very commonly, and how “sales” sell data warehousing projects are by giving alternatives: Would you like to go with Azure DW or AWS Redshift? Or the most popular Snowflake? Well, these are just the underlying technology platforms supporting data warehousing. To put it in short, data warehouse is NOT a product.
Data warehouse is also NOT just a database
Yes, you read it correctly!
Then What is a data warehouse?
Data warehouse is a central repository of data ingested from heterogeneous data sources. It collects and stores current and historic state of data coming from different businesses supporting applications across organizations. Is this all? This is just a part of what data warehousing is. Data warehousing in a nutshell is the process of collecting data from different sources, ingesting, modeling data in a data warehouse environment. On top of that, making data available for business for their analytics and advanced analytics needs in a secure and organized manner. This might sound overwhelming now but hold on! We will cater every step individually and revisit this definition.
And then what is Enterprise Data Warehouse?
There can be many data warehouses built in different business units of an organization. A common company wide data warehouse is often referred to as EDW. Companies can plan and build from scratch an EDW or can scale an existing DW to become an EDW by broadening the scope of source systems.
Do we need a Data Lake if we are already building Data Warehouse?
In order to understand this, let us first investigate the fundamentals. Data warehousing is supported by three main underlying solutions: data storage, data modeling and data processing. Due to the nature of technical components needed (if we exclude the logical modeling) data storage is inexpensive compared to the cost of data processing units (including in-memory storage). This is the biggest reason why data warehouses have been evolving with the intention to optimize data processing costs and speed. Therefore, smart companies like Snowflake have separated their data storage and processing layer. Microsoft Azure Synapse is also moving in the same direction and that might be the standard soon.
Do we process all Data all the time?
No. Large volume of data for an organization stays unaltered during the storage lifecycle. Storage lifecycle starts when data is generated in the source application (either filled manually or collected from devices in different forms), stored, processed, and then finally purged. In most typical reporting use cases, only recent snapshot of data is relevant. For predictive analytics use cases historical data is needed to understand the last few years of trend. Historical medical data, however, can be needed for research purposes for a longer term. Therefore, it makes sense to bring only recent snapshots to in-memory storage and keep the rest of the data in cold storage to optimize cost.
This concept also gave rise to the Data Lake concept when storing and processing data in a large sized Hadoop cluster started to become cash eating mammoths. Frequently and in-frequently accessed data had to be separated and stored for future use. File systems were then designed to logically separate raw, curated and analytics / consumer zones. Similar concepts then got carried over to major cloud providers using S3 buckets / storage accounts. So, the answer to the question of whether we need Data Lake if we already have Data Warehouse: In an ideal situation, infrequently accessed, large historical data can be stored into Data Lake for advanced analytics and archiving purposes. Latest snapshots of data that need processing can then be moved to Data Warehouse. This would optimize cost, improve reporting performance and your enterprise data will be available on cloud for any future needs.
Moving to cloud
Moving to cloud where pay as you go model was the selling point became less challenging for architects to predict storage and processing power. Flexibility to scale out with processing units no longer became a hurdle. But old school data modeling techniques still were pulling back in the full-scale utilization of cloud. Right on time kicked in Data Vault 2 modeling that supported all insert, parallel loading. But did it solve all the problems? Well, for data warehousing yes, but still these tools were not Data Science friendly. It only supported SQL and limited power of a full-blown programming language. That is when Databricks kicked in and revolutionized the whole data lake and data warehousing space. Suddenly, developers got the power of pyspark and scala in their hands to support all sorts of data science functions that are connected to cloud. Snowflake is moving in the same direction and recently has announced support for Python and some other popular programming languages.
Does moving to cloud solve all the problems?
Speaking in laywoman/man terms, main objectives of data warehousing include storing data which is easily searchable, available when needed and complex operations can be performed on data. Data modeling solves the problem of making data searchable. Cloud supports availability and scalability allows complex operations quicker. However, business does not necessarily relate to complex data models. Therefore, business modeling / business glossary becomes essential for business to be able to talk with IT. Modern data catalog tools such as Collibra, Alation and the more recent Azure Purview not only helps you to build business glossaries but also provide added features like data lineage, data quality and AI capability to categorize your data.
Future of Data Warehousing
Many companies have now started to invest in a common data model where they align their enterprise data to some existing industry specific data models. This way they can align with common standards across borders and utilize prebaked algorithms for advanced analytics. Another concept that is trending right now is Data mesh. The idea behind Data mesh fundamentally is to create domain-oriented data platform which is referred to as data product. You can read about how to implement Data Mesh in our upcoming blog (in Finnish). These data products can then be shared across business units or with external partners. Data Warehousing solutions are also moving towards making data sharing more convenient that will reduce replicating of similar data on multiple platforms hence supporting the paradigm of data product.