What is Snowflake and how does it differ from other cloud data warehouses:
Shared-disk architecture – where there is one disk/storage and many compute resources are attached to it.
This architecture has many drawback one of them being the central disk which becomes a bottle-neck whenever the disk goes down the whole system faults.
Shared nothing architecture – where each storage has its own compute resource attached to it.
This architecture solves the bottleneck problem in shared-disk architecture but it has it’s own drawbacks one of them is – as each disk is attached to it’s own compute resource so data stored on some other disk will not be available for computing on every compute resource/warehouse which a user uses to read/write data.
Hadoop is built upon shared nothing architecture, in this there are nodes which has it’s own storage disk and compute resource both and such nodes are grouped together to form a Hadoop cluster.
In Hadoop the files are stored in a distributed manner across the nodes. These nodes are not connected to each other and so data required by other nodes are requested by the master node to provide.
This system has both the compute and storage tightly coupled and to scale this architecture it is difficult.
It’s purely a compute engine, MPP(massive parallel programming) engine but it is in-memory.
Spark also have nodes as in Hadoop but the nodes have cache in place of storage disk, which makes spark much faster in terms of data processing.
Spark is fast but data remains in cache and as soon as the node is down the data is lost. Also for using spark an external file system such as HDFS or other cloud storage will be required which eventually spikes the cost.
It is a multi-clustered shared data architecture, it is a hybrid of shared-disk and shared-nothing architecture.
It has storage and compute resource de-coupled, which allows to scale each one separately.
To understand this architecture we can use the analogy of a large restaurant where the are separate kitchens(which can be understood as data storage), separate tables which can be grouped/clustered (can be understood as compute resource ) and waiters(can be understood as cloud services layer).
As compute is decoupled from storage data required by each compute resource is fetched by the cloud-services layer(waiters in this case) and makes available for the compute resource to process.
As the compute layer and storage layer is loosely coupled, each can be scaled when needed which makes it cost effective the previous architectures.
Snowflake is a modern data cloud – built on cloud itself
- It supports all kind of data originating from any source – so it solves structures & un-structured data issue
- It is optimized for cloud – the storage and compute resource is separated, so we are able to scale storage and compute separately
- It helps in data security as data can never leave your premise – it is available in all 3 major clouds
Snowflake has 3 layered unique architecture-
Data storage layer –
- This layer is built upon the underlying cloud storage (S3/azure blob/GCP bucket)
- It can be scaled infinitely as it is built on cloud storages
- The data is compressed and stored in columnar format for less space.
- The data encrypted and is not visible even to the cloud service provider.
- Pay only for shared data, as compute is separated and is billed separately.
Compute & processing layer –
- It is also called the query engine or warehouse
- Underlying architecture for this layer is virtual machines(EC2/azure VM’s/GCP VM)
- Can be scaled up or down very easily to match work-loads
Cloud service layer –
- Authentication and authorisation management
- User and session management
- Query compilation, optimization and data caching
- Warehouse management