1. Explain Snowflake architecture
Snowflake is built on an AWS cloud data warehouse and is truly a Saas offering. There is no software, hardware, ongoing maintenance, tuning, etc. needed to work with Snowflake.
Three main layers make the Snowflake architecture – database storage, query processing, and cloud services.
- Data storage – In Snowflake, the stored data is reorganized into its internal optimized, columnar, and optimized format.
- Query processing – Virtual warehouses process the queries in Snowflake.
- Cloud services – This layer coordinates and handles all activities across the Snowflake. It provides the best results for Authentication, Metadata management, Infrastructure management, Access control, and Query parsing.
2. Describe Snowflake computing.
Snowflake cloud data warehouse platform provides instant, secure, and governed access to the entire data network and a core architecture to enable various types of data workloads, including a single platform for developing modern data applications. Snowflake brings together the power of data warehouses, the scalability of big data platforms, the elasticity of the cloud, and real-time data sharing at a fraction of the cost of traditional solutions.
3. What is the use of the Cloud Services layer in Snowflake?
The services layer acts as the brain of the Snowflake. In Snowflake, the Services layer authenticates user sessions, applies security functions, offers management, performs optimization, and organizes all the transactions.
4. What type of database is Snowflake?
Snowflake is built entirely on a SQL database. It’s a columnar-stored relational database that works well with Excel, Tableau, and many other tools. Snowflake contains its query tool, supports multi-statement transactions, role-based security, etc., which are expected in a SQL database.
5. How is data stored in Snowflake?
Snowflake stores the data in multiple micro partitions which are internally optimized and compressed. The data is stored in a columnar format in the cloud storage of Snowflake. The data objects stored by Snowflake cannot be accessed or visible to the users. By running SQL query operations on Snowflake, you can access them.
6. How many editions of Snowflake are available?
Snowflake offers four editions depending on your usage requirements.
- Standard edition – Its introductory level offering provides unlimited access to Snowflake’s standard features.
- Enterprise edition – Along with Standard edition features and services, offers additional features required for large-scale enterprises.
- Business-critical edition – Also, called Enterprise for Sensitive Data (ESD). It offers high-level data protection for sensitive data to organization needs.
- Virtual Private Snowflake (VPS) – Provides high-level security for organizations dealing with financial activities.
7. Explain Virtual warehouse
In Snowflake, a virtual warehouse, often known as a “warehouse,” is a collection of computational resources. A virtual warehouse provides the resources required for the users like CPU, memory, and temporary storage to perform multiple Snowflake operations:
- Execute the SQL SELECT statements that necessitate the use of computing resources (e.g. retrieving rows from tables and views).
- DML operations include:
- Updating table rows (DELETE , INSERT , UPDATE).
- Data Loading into tables (COPY INTO <table>).
- Data unloading from tables (COPY INTO <location>).
8. Is Snowflake OLTP or OLAP?
An OLTP (Online Transactional Processing) database contains detailed and up-to-date data, as well as a large volume of typically small data transactions. In turn, online analytical processing (OLAP) often necessitates complex and aggregated queries with a small number of transactions. Snowflake’s database schema is built around online analytical processing.
9. Explain Columnar database
The columnar database is opposite to the conventional databases. It saves the data in columns in place of rows, eases the method for analytical query processing and offers more incredible performance for databases. Columnar database eases analytics processes, and it is the future of business intelligence.
10. What is the use of a database storage layer?
Whenever we load the data into the Snowflake, it organizes the data into the compressed, columnar, and optimized format. Snowflake deals with storing the data that comprises data compression, organization, statistics, file size, and other properties associated with the data storage. All the data objects we store in the Snowflake are inaccessible and invisible. We can access the data objects by executing the SQL query operation through Snowflake.
11. What is the use of the Compute layer in Snowflake?
In Snowflake, Virtual warehouses perform all the data handling tasks. Which are multiple clusters of the compute resources. While performing a query, virtual warehouses extract the least data needed from the storage layer to satisfy the query requests.
12. What are the different ways to access the Snowflake Cloud data warehouse?
We can access the Snowflake data warehouse through:
- A web-based user interface from which all aspects of Snowflake management and usage can be accessed.
- Command-line clients (such as SnowSQL) that can access all parts of Snowflake management and use.
- Snowflake has ODBC and JDBC drivers, which allow other applications (like Tableau) to connect to it.
- Native connectors (e.g., Python, Spark) for developing programmes that connect to Snowflake.
- Third-party connectors can be used to link applications such as ETL tools (e.g., Informatica) and BI tools (e.g., ThoughtSpot) to Snowflake.
13. Why is Snowflake highly successful?
Snowflake is highly successful because of the following reasons:
- It assists a wide variety of technology areas like data integration, business intelligence, advanced analytics, security, and governance.
- It offers cloud infrastructure and supports advanced design architectures ideal for dynamic and quick usage developments.
- Snowflake supports predetermined features like data cloning, data sharing, division of computing and storage, and directly scalable computing.
- Snowflake eases data processing.
- Snowflake provides extendable computing power.
- Snowflake suits various applications like ODS with the staged data, data lakes with data warehouse, raw marts, and data marts with acceptable and modelled data.
14. How do we secure the data in the Snowflake?
Data security plays a prominent role in all enterprises. Snowflake adapts the best-in-class security standards for encrypting and securing the customer accounts and data that we store in the Snowflake. It provides the industry-leading key management features at no extra cost.
15. What are Micro Partitions?
Snowflake comes along with a robust and unique kind of data partitioning known as micro partitioning. Data that exists in the Snowflake tables are systematically converted into micro partitions. Generally, we perform Micro partitioning on the Snowflake tables.
16. How is Snowflake different from Redshift?
Both Redshift and Snowflake provide on-demand pricing but vary in package features. Snowflake splits compute storage from usage in its pricing pattern, whereas Redshift integrates both.
|Snowflake is a comprehensive SaaS solution that requires no maintenance.||AWS Redshift clusters necessitate some manual maintenance.|
|Snowflake separates computing and storage, allowing for customizable pricing and setup.||Reserved/Spot instance price in Redshift provides for cost optimization.|
|Snowflake uses real-time auto-scaling.||Redshift, on the other hand, involves the addition and removal of nodes in order to scale.|
|Snowflake provides less data customisation options.||Where Redshift facilitates data flexibility with features such as partitioning and distribution.|
|Snowflake provides always-on encryption with strict security checks.||While Redshift offers a flexible, customised security strategy.|
17. Explain Snowpipe in Snowflake
Snowpipe is Snowflake’s continuous data ingestion service. Snowpipe loads data in minutes once files are uploaded to a stage and submitted for ingestion. Snowflake maintains load capacity with Snowpipe’s serverless compute approach, assuring appropriate compute resources to meet demand. In a nutshell, Snowpipe provides a “pipeline” for loading new data in micro-batches as soon as it becomes available.
The data is loaded using the COPY command defined in a connected pipe. Snowpipe can use a pipe, which is a named, first-class Snowflake object containing a COPY statement. The COPY statement specifies the location of the data files (i.e., a stage) as well as the target table. All data types, including semi-structured data types like JSON and Avro, are supported.
18. There are several ways for detecting staged files:
- Using cloud messaging to automate Snowpipe
- REST endpoints in Snowpipe
19. The Snowpipe benefits are as follows:
- Real-time insights
20. Explain Snowflake Time Travel
Snowflake Time Travel tool allows us to access the past data at any moment in the specified period. Through this, we can see the data that we can change or delete. Through this tool, we can carry out the following tasks:
- Restore the data-associated objects that may have lost unintentionally.
- For examining the data utilization and changes done to the data in a specific time period.
- Duplicating and backing up the data from the essential points in history.
21. Differentiate Fail-Safe and Time-Travel in Snowflake
|According to the Snowflake edition, account or object particular time travel setup, users can retrieve and set the data reverting to the history.||Fail-Safe, the User does not have control over the recovery of data valuable merely after completing the period. In this context, only Snowflake assistance can help for 7 days. Therefore if you set time travel as six days, we retrieve the database objects after executing the transaction + 6 days duration.|
22. What is zero-copy Cloning in Snowflake?
Zero copy cloning is a snowflake implementation in which a simple keyword CLONE allows you to generate a clone of your tables, schemas, and databases without replicating the actual data. As a result, you can have practically real-time data from production cloned into your dev and stage environments to conduct various activities.
- There are no additional storage costs associated with data replication.
- There is no waiting time for copying data from production to non-production contexts.
- There is no need for administrative efforts since cloning is as simple as a click of a button.
- No copy, only clone: Data exists only in one place.
- Promote corrected/fixed data to production instantly.
23. What is Data Retention Period in Snowflake?
The data retention period is an important aspect of Snowflake Time Travel.
When data in a table is modified, such as deletion or discarding an object holding data, Snowflake saves the data’s previous state. The data retention period determines the number of days that this historical data is kept and, as a result, Time Travel operations (SELECT, CREATE… CLONE, UNDROP) can be performed on it.
The standard retention period is one day (24 hours) and is enabled by default for all Snowflake accounts.
24. What is the use of Snowflake Connectors?
The Snowflake connector is a piece of software that allows us to connect to the Snowflake data warehouse platform and conduct activities such as Read/Write, Metadata import, and Bulk data loading.
25. The Snowflake connector can be used to execute the following tasks:
- Read data from or publish data to tables in the Snowflake data warehouse.
- Load data in bulk into a Snowflake data warehouse table.
- You can insert or bulk load data into numerous tables at the same time by using the Numerous input connections functionality.
- To lookup records from a table in the Snowflake data warehouse.
26. Following are the types of Snowflake Connectors:
- Snowflake Connector for Kafka
- Snowflake Connector for Spark
- Snowflake Connector for Python
27. Describe Snowflake Clustering
In Snowflake, data partitioning is called clustering, which specifies cluster keys on the table. The method by which you manage clustered data in a table is called re-clustering.
A clustering key is a subset of columns in a table (or expressions on a database) that are deliberately intended to co-locate the table’s data in the same micro-partitions. This is beneficial for very large tables where the ordering was not perfect (at the time the data was inserted/loaded) or if extensive DML has weakened the table’s natural clustering.
Some general indicators that can help determine whether a clustering key should be defined for a table are as follows:
- Table queries are running slower than expected or have degraded noticeably over time.
- The table’s clustering depth is large.
28. Explain Data Shares
Snowflake Data sharing allows organizations to securely and immediately share their data. Secure data sharing enables sharing of the data between the accounts through Snowflake secure views, database tables.
29. What is “Stage” in the Snowflake?
In Snowflake, stages are data storage locations. If the data to be imported into Snowflake is stored in another cloud area, such as AWS S3, Azure, or GCP, these are referred to as External stages; if the data is stored within Snowflake, they are referred to as Internal stages.
Internal Stages are further divided as below
- Table Stage
- User Stage
- Internal Named Stage
30. Does Snowflake maintain stored procedures?
31. How do we execute the Snowflake procedure?
Stored procedures allow us to create modular code comprising complicated business logic by adding various SQL statements with procedural logic. For executing the Snowflake procedure, carry out the below steps:
- Run a SQL statement
- Extract the query results
- Extract the result set metadata
32. Explain Snowflake Compression
All the data we enter into the Snowflake gets compacted systematically. Snowflake utilizes modern data compression algorithms for compressing and storing the data. Customers have to pay for the packed data, not the exact data.
Following are the advantages of the Snowflake Compression:
- Storage expenses are lesser than original cloud storage because of compression.
- No storage expenditure for on-disk caches.
- Approximately zero storage expenses for data sharing or data cloning.
33. How to create a Snowflake task?
To create a Snowflake task, we have to use the “CREATE TASK” command. Procedure to create a snowflake task:
- CREATE TASK in the schema.
- USAGE in the warehouse on task definition.
- Run SQL statement or stored procedure in the task definition.
34. How do we create temporary tables?
To create temporary tables, we have to use the following syntax:
Create temporary table mytable (id number, creation_date date);
35. Can you tell me how to access the Snowflake Cloud data warehouse?
Snowflake’s data warehouse can be accessed using the following ways:
- ODBC Drivers (a driver for connecting to Snowflake).
- JDBC Drivers (a driver enabling a Java application to interact with a database).
- Python Libraries (for creating Python applications that connect to Snowflake and perform standard operations).
- Web User Interface (can be used for almost any task you can accomplish with SQL and the command line, such as: Creating and managing users and other account-level objects).
- SnowSQL Command-line Client (Python-based command-line interface to connect to Snowflake from Windows, Linux, and MacOS).
36. Name the types of catches in Snowflake?
- Query Results Caching
- Metadata Cache
- Virtual Warehouse Local Disk Caching
37. What is Materialized view in Snowflake?
A materialized view in Snowflake is a pre-computed data set derived from a query specification. As the data is pre-computed, it becomes far easier to query materialized view than a non-materialized view from the view’s base table.
In simple words, materialized views are designed to enhance the query performance for common and repetitive query patterns. Materialized Views are primary database objects and speedup projection, expensive aggregation, and selection operations for queries that run on larger data sets.
38. Name a few Snowflake database objects that can be shared using Secure data sharing?
- Secure views
- External tables
- Secure UDFs
- Secure materialized views
39. What are the internal and external stages in Snowflake?
Following are the two stages supported by Snowflake to store data files and to load/unload data:
Internal Stage: Here the files are stored within the Snowflake account
External Stage: Here the files are stored in an external location. For instance AWS S3.
40. What is “Continuous Data Protection” (CDP) in Snowflake?
Continuous Data Protection (CDP) is an essential feature offered by Snowflake to protect data stored in snowflake from events like malicious attacks, human error, and software or hardware failovers. This CDP feature makes your data accessible and recoverable at all the stages of the data life cycle even if you lost it accidentally.
41. What is Snowflake Caching, and why should you care?
Ans: When you submit a new query, Snowflake checks for previously executed queries, and if one exists and the results are still cached, it uses the cached result set instead of running the query. The results of the Snowflake Cache are global and can be used by anyone.
42. What are the advantages of Snowflake Query Caching?
- The default setting for the Results Cache is Automatic. You do not need to do anything special to use this feature.
- For 24 hours, all results are cached.
- There are no restrictions on space. Snowflake Cache (AWS/GCP/Azure) has infinite storage.
- The cache is global and accessible to all WH and users.
- As a result of caching, your BI dashboards will load faster.
- As a result of caching, the compute cost is reduced.
43. When the underlying data changes, what happens to Cache results?
Ans: When the data in the underlying micro-partition changes, Snowflake Cache results are invalidated. Although the Snowflake Documentation contains more details, tests have shown that the result cache is reused unless the underlying data (or SQL query) has changed. Inserts, updates, and deletes that do not affect the underlying data are ignored. The result cache is used if the data in the micro-partitions remains unchanged, as demonstrated by a series of additional tests.
Finally, results are typically kept for 24 hours, though the clock is reset every time the query is re-executed, up to a maximum of 30 days, after which they query the remote disk.
44. Can you explain the difference between Snowflake and AWS (Amazon Web Service)?
Ans: Snowflake and Amazon Redshift, for example, are cloud-based data warehouse platforms that offer excellent performance, scalability, and business intelligence tools. Both platforms offer similar core functionality, such as relational management, security, scalability, and cost-efficiency. Pricing, user experience, and deployment options are among the differences.
- Snowflake is a complete SaaS (Software as a Service) offering, so no maintenance is required. AWS Redshift clusters, on the other hand, necessitate manual care.
- Snowflake’s security model employs always-on encryption to enforce stringent security checks, whereas Redshift uses a more flexible, customizable approach.
- Snowflake’s storage and computation are entirely independent, which means that storage costs are comparable to S3. On the other hand, AWS uses a Red Shift spectrum to get around this problem and allows you to query data that is directly available in S3. Despite this, it isn’t as perfect as Snowflake.
Snowflake is one of the few enterprise-ready cloud data warehouses that brings simplicity without sacrificing features. It automatically scales, both up and down, to get the right balance of performance vs. cost. Snowflake’s claim to fame is that it separates compute from storage. This is significant because almost every other database, Redshift included, combines the two, meaning you must size for your largest workload and incur the cost that comes with it.
In this document, we will focus on some of the best practices based on Snowflake’s recommendation and previous experience.
2- Best Practice — Efficient Table Design
i. Choosing the appropriate data types helps in improving query performance.
- Date and TimeStamp are stored more efficiently than VARCHAR on Snowflake.
- Snowflake recommends choosing a date or timestamp data type for storing date and timestamp fields instead of a character data type.
ii. It is recommended to use Referential Integrity Constraints
- They provide valuable metadata that users can use to understand the schema and the relationships defined between tables.
- In Snowflake, referral integrity constraints are not enforced by default. When created, they are disabled. NOT NULL is an exception and is enforced by default.
iii. For big data sets, Clustering is a good practice and helps improve query performance.
- One will have to manually run a DML statement to re-cluster a table.
- The existing data will be re-written in a different order upon re-clustering. Snowflake saves the previous order for 7 days to provide Fail-safe protection, one of Snowflake’s cloud data warehouse’s most lauded features.
- Re-clustering a table on Snowflake costs additional dollars. This is directly proportional to the size of the data set that is re-ordered.
3- Best Practice — Data Storage
Snowflake provides an array of features for data that is stored. Continuous Data Protection (CDP) which includes Fail-Safe and Time Travel is given to all Snowflake accounts for no additional cost. This does not mean that CDP will not have an impact on your storage costs. On the contrary,
- Your account will be charged for all the data stored in schemas, tables, and databases created in your Snowflake architecture. Based on the data stored and the duration for which it is stored, CDP has an impact on the storage costs.
- Until the data leaves the Fail-safe state, storage costs will be incurred. This means that you pay for the data storage irrespective of whether it is in an Active, Time-travel, or Fail-safe State. Hence, it is important to make the right storage considerations.
4- Best Practice — Data Staging
To assist in loading bulk data into tables, Snowflake has a feature called stages where files that have the data to be loaded are staged.
- Snowflake allows both internal (within Snowflake) and external (S3, Azure) stages.
- No additional cost is charged for Time Travel and Fail-safe features for data stored in internal stages within Snowflake. However, standard data storage costs apply.
- While performing Snowflake ETL, it is a good practice to stage regular data sets by partitioning them into logical paths. This could include details such as source identifiers or geographical location, etc., along with the date when the data was written.
- Staging will allow you to take advantage of Snowflake’s parallel operations by letting you execute concurrent COPY statements that match a subset of files.
5- Best Practice — Data Cloning
Snowflake has a zero-copy cloning feature that gives an easy way to take a “snapshot” of any schema, table, or database.
· Cloning creates a derived copy of that object which initially shares the underlying storage. This can come in handy when creating instant backups.
· Cloning does not incur any additional costs if you do not need to make any changes to the cloned object.
6- Best Practice — Data Loading Considerations
i. General File Sizing Recommendations
- While performing Snowflake ETL, it is important to optimize the number of parallel loads into Snowflake. It is recommended to create compressed data files that are roughly 10 MB to 100 MB in size.
- Aggregate the smaller files to reduce processing overhead. Split the large files into several smaller files for faster load. This allows you to distribute the load between servers in the active Snowflake warehouse.
ii. Data Size Limitations for Semi-Structured Data
- The VARIANT data type has a 16 MB (compressed) size limit on the individual rows.
- For efficiency enhancement, while executing the COPY INTO <table> command it is recommended to enable the STRIP_OUTER_ARRAY file format option. This will load the records into separate table rows by removing the outer array structure.
iii. Data Size Limitations of Parquet files
It is recommended to split parquet files that are greater than 3GB in size into smaller files of 1GB or lesser for smooth loading. This will ensure that the loading does not timeout.
iv. Preparing Delimited Text Files
The following points must be considered while preparing CSV/Delimited text files for loading:
- Files must have data in ASCII format only. The default character set is UTF-8. However, additional encodings can be mentioned using the ENCODING file format option.
- Within the files, records and fields should be delimited by different characters. Note, that both should be a single (necessarily not the same) character. Pipe (|), caret (^), comma (,), and tilde (~) are common field delimiters. Often the line feed (n) is used as a row delimiter.
- Fields that have delimiter should be enclosed in single or double-quotes. If the data being loaded contains quotes, then those must be escaped.
- Fields that have carriage returns (r n) should be enclosed in single or double quotes too. In the windows system, carriage returns are commonly introduced along with a line feed character to mark the end of a line.
- Each row should have the same number of columns.
7- Best Practice — Improving Load Performance
- Use bulk loading to get the data into tables in Snowflake.
- Consider splitting large data files so the load can be efficiently distributed across servers in a cluster.
- Delete from internal stages files that are no longer needed. You may notice improved performance in addition to saving on costs.
- Isolate load and transform jobs from queries to prevent resource contention. Dedicate separate warehouse for loading and querying operations to optimize performance for each.
- Leverage the scalable compute layer to do the bulk of the data processing.
- Consider using SnowPipe in micro-batching scenarios.
8- Best Practice — Data Caching
While Caching is an automatic behaviour, there are two best practices you can implement to maximize cache usage and speed query performance.
- Firstly, when segmenting query workload, you should place users querying the same data on the same virtual warehouse. This maximizes the chances that data retrieved to the cache by one user will also be used by others.
- Finally, be aware that the result cache is completely independent of the virtual warehouse, and any query executed by any user on the account will be served from the result cache, provided the SQL text is the same.
9- Best Practice — Data Querying
i. Dedicated warehouse for Querying:
- Snowflake automatically caches data in the Virtual Warehouse (local disk cache), so place users querying the same data on the same virtual warehouse. This maximizes the chances that data retrieved to the cache by one user will also be used by others. The suspending warehouse will erase this cache.
- Result Cache is maintained by the Global Services layer, any query executed by any user on the account will be served from the result cache, provided the SQL text is the same. Results are retained for 24 hours.
- Snowflake Query Profile feature helps us to analyze queries being run from BI tools as well. In case, you have multiple BI tools and common users, having a dedicated warehouse for each BI tool will help to identify queries generated from the BI tool.
10- Best Practice — Warehouse Size
Snowflake offers pay-per-second billing. They allow the different sizes of the warehouse (Large, X-Large, 2X-Large, etc.). Create separate warehouses for different environments such as development, testing, and production.
- For queries in development or testing environments, smaller warehouse sizes (such as X-Small, Small, Medium) may be enough.
- For queries in production environments, larger warehouse sizes (such as Large, X-Large, 2X-Large, etc.) may be enough and cost-effective.
- From a performance point of view, it is always better to create a separate warehouse for your data loading and query execution. Start with a smaller size and based on the performance, you can manually resize the warehouse.
- Increasing the size of a warehouse does not always improve data loading performance. Data loading performance is influenced more by the number of files being loaded (and the size of each file) than the size of the warehouse.
- Credit charges are calculated based on:
- The number of Clusters (if using a multi-cluster warehouse)
- The number of servers per cluster (a derivative of warehouse-size)
- The time duration that each cluster in each server runs for.
11- Best Practice — Warehouse Optimizations
i. Enable Auto-Suspend
Make sure all virtual warehouses are set to auto-suspend. This way, when they are done processing queries, auto-suspend will turn off your virtual warehouses, and thus stop credit consumption.
ii. Enable Auto-Resume
Make sure all virtual warehouses are set to auto-resume. If you are going to implement auto-suspend and set appropriate timeout limits, enabling auto-resume is a must; otherwise, users will not be able to query the system.
iii. Set Timeouts Appropriately for Workloads
All virtual warehouses should have an appropriate timeout for their particular workload:
- For the task, data loading, and ETL/ELT warehouses set the timeout for suspension immediately upon completion.
- For BI and SELECT query warehouses, set the suspension timeout to 10 minutes in most situations to keep data caches warm for frequent access by end-users.
- For DevOps, DataOps, and data science warehouses, set the suspension timeout to 5 minutes because having a warm cache is not as important for ad hoc and highly unique queries.
iv. Set Account Statement Timeouts
Use the STATEMENT_QUEUED_TIMEOUT_IN_SECONDS and STATEMENT_TIMEOUT_IN_SECONDS parameters to automatically stop queries that are taking too long to execute, either due to a user error or a frozen cluster. Customize warehouse, account, session, and user timeout-level statements according to your data strategy for long-running queries.
Here’s an example:
ALTER WAREHOUSE LOAD_WH SET STATEMENT_TIMEOUT_IN_SECONDS= 3600; SHOW PARAMETERS IN WAREHOUSE LOAD_WH
v. Drop Unused Tables
You might have unused tables that are candidates to be dropped. Just make sure no one is querying these tables. In fact, you might want to make it mandatory to check all tables before deletion. (If you have Time Travel set up, you can un-drop a table if you make an error.) This is specific to the database context, so be sure to look at tables across your databases. Also, be mindful of tables used only because of DDLs.
vi. Find Warehouses That Don’t Have Resource Monitors
Resource monitors are a great way to proactively control workload budgets and prevent unexpected resource spikes. Resource monitors can help monitor both user usage and service account usage in Snowflake. First, you should have dedicated virtual warehouses for each of your loading, ELT, BI, reporting, and data science workloads as well as for other workloads. Accounts and warehouses can have total, yearly, monthly, weekly, and daily credit quotas.
vii. Apply Resource Monitors
You can use the UI or SQL to apply your resource monitor policy. Based on account preference settings, resource monitors can notify you when consumption reaches a lower threshold, and then suspend the warehouse or account at a higher threshold.
Considerations for Resource Monitoring
- We recommend setting monitors to notify you when a certain threshold of consumption is reached.
- When consumption approaches the maximum budgeted level, set the resource monitor to auto-suspend the warehouse or the entire account, allowing queries to complete but preventing future requests.
- Resource monitors can also be used to terminate all currently running queries and immediately suspend the resource or account. This setting is usually reserved for situations where a hard quota is exceeded.
- For customers that do not want to set hard limits, it’s still always a good idea to have notification monitors set on all warehouses in case of usage unexpectedly spikes. That way, all admins within the account will get an email or on-screen notification when thresholds are reached.
The following figure shows the resource monitor configuration screen:
12- Best Practice — Scaling Data Warehouse
i. Scale- Up
Snowflake allows for a scale-up in the virtual data warehouse to better handle large workloads. When using scale-up to improve performance, make note of the following:
- Snowflake supports fast and easy adjustments to the warehouse-size to handle the workload.
- It can also automatically suspend or resume the scale-up, with complete transparency for the user.
Snowflake’s scale-up functionality supports the continually changing requirements for processing.
Snowflake supports the deployment of same-size clusters to support concurrency. Keep these points in mind for how scale-out can help performance optimization:
- As users execute queries, the virtual data warehouse automatically adds clusters up to a fixed limit.
- It can scale-up in a more controlled way instead of deploying one or more clusters of larger machines like legacy data platforms.
Snowflake automatically adjusts based on user queries, with automatic clustering during peak and off-hours as needed.
iii. Enable Auto-Scaling
If you are using an enterprise edition of Snowflake, multi-cluster warehouses should be configured to run in an Auto-scale mode, which enables Snowflake to automatically start and stop clusters as needed.
13- Best Practice — Leverage Materialized Views
Materialized views can improve the performance of queries that repeatedly use the same subquery results. Use materialized views when
- The query contains a small number of rows and/or columns relative to the base table
- The query contains results that require significant processing
- The query is against file-based data
- Results don’t change often
- Results are used often
- The query consumes a lot of resources
You can create a materialized view like this:
CREATE MATERIALIZED VIEW MV1 AS SELECT * FROM TAB1 WHERE COLUMN_1 = x OR COLUMN_1 = y;
And use it like this:
SELECT * FROM MV1 WHERE COLUMN_1 = y;
But be careful! There are some gotchas:
- A background service automatically updates the materialized view after changes are made to the base table
- The maintenance of materialized views consumes credits
- Compare materialized views vs. scaling up compute to solve performance issues.
14- Best Practice — Segment Data
Snowflake caches data in the virtual data warehouse, but it’s still essential to segment data. Consider these best practices for data query performance:
- Group users with common queries in the same virtual data warehouse to optimize data retrieval and use.
- The Snowflake Query Profile supports query analysis to help identify and address performance concerns.
Snowflake draws from the same virtual data warehouse to support complex data science operations, business intelligence queries, and ELT data integration.
15- Best Practice — Using Snowflake Query Profile
The first step to Snowflake optimization is to discover the source of performance bottlenecks. Use Snowflake’s Query Profile tool to diagnose issues in slow-running queries. Like similar database management tools, Query Profile lets you explore how Snowflake executed your query and what steps in the process are causing slowdowns.
Once you have identified the nature of the problem, use these seven best practices to help optimize your query speed and overall Snowflake performance.
16- Benefit from a range of Integrations and Partnerships
Snowflake’s ecosystem is designed for flexibility, openness, and extensibility without exposure to security threats. That means you benefit from partnerships and seamless integrations with vendors of your choice, including vendors for BI tools, data science technologies, marketing and sales automation software, CRM solutions, and machine learning.
Another area you might use vendors for is security analytics. Snowflake enables you to connect to existing SIEM software, fraud products, and threat intelligence feeds. Also, Snowflake provides built-in security options from leading BI partners, including Tableau, Looker, Sigma Computing, and Periscope Data, so you can create a wide range of user interfaces, visualizations, and reports that align with your needs, processes, and workflows.
17- Additional Features
i. Secure Data Sharing
Let’s put this to the test. In your organization, if you need to pass data to a team in another company in a secure way, how would you do it? Secure FTP? Ugh! An expensive proprietary tool? Blegh! Or — worst of all — an email (hopefully you would at least password-protect it)? Don’t you dare! If any of these are the case, you really need to consider Snowflake Secure Data Sharing.
Snowflake not only supports view-based data sharing but has enhanced this by revolutionizing how organizations distribute and consume shared data. Unlike FTP and email, Snowflake Data Sharing is far easier to use as it provides instant access to live data and eliminates data copying. The Snowflake Built for the Cloud Architecture enables Data Sharing without any complicated ETL and setup and more importantly allows you to control security in one single place, at the data level.
With Secure Data Sharing, no actual data is copied or transferred between accounts. All sharing is accomplished through Snowflake’s unique services layer and metadata store. This is an important concept because it means that shared data does not take up any storage in a consumer account and, therefore, does not contribute to the consumer’s monthly data storage charges. The only charges to consumers are for the compute resources (i.e. virtual warehouses) used to query the shared data.
Real-time secure sharing of data is ready as soon as a data provider adds to or updates data for the end consumers. There are out of the box advantages to this approach such as:
- Immediate access: no transformation, data movement, loading, or reconstruction is required, and data is available for immediate use.
- Live data: changes made in real-time by a data provider are immediately available to data consumers without effort. Thus, data is always current.
- Secure Managed access: a data provider can share all their data to N number of data consumers in a secure, governed manner with minimal effort. The flexibility of controlling all aspects of managed access increases the ease of use.
- No ETL management: users do not need to manage any ETL for secure data sharing capabilities and operations.
- Access Logs: Data providers can track and monitor access to Data Shares and quickly respond to the user’s actions.
ii. Data Marketplace
A data marketplace is an online transactional location or store that facilitates the buying and selling of data. As many businesses seek to augment or enrich internal data sets with external data, cloud-based data marketplaces are appearing at a growing rate to match data consumers with the right data sellers.
Snowflake Data Marketplace gives data scientists, business intelligence and analytics professionals, and everyone who desires data-driven decision-making, access to more than 375 live and ready-to-query data sets from more than 125 third-party data providers and data service providers (as of January 29, 2021).
- Data Augmentation: Source quality, third-party data to augment your own data sets and make better business decisions.
- Data Analysis: Get personalized data feeds from vendors you already work with or request customized data sets based on your business requirements from the many providers in the Data Marketplace.
- Data Science: Discover new data sets your team can use to train AI/ML models and to improve data science applications.
- Data Enrichment: Take advantage of bidirectional data exchange and enrich your own data by leveraging data enrichment services from different vendors.
Best Practices for Data Engineering on Snowflake
The most important lessons I have learned working with Snowflake customers include the following:
- Use a standard ingestion pattern: This involves the multi-stage process of landing the data files in cloud storage and then loading them to a (transient) landing table before transforming the data. Breaking the overall process into predefined steps makes it easier to orchestrate and test, and ensures separation of duties.
- Retain history of raw data: Unless your data is sourced from a raw data lake, it makes sense to keep the raw data history which should ideally be stored using the VARIANT data type to benefit from automatic schema evolution. This means you have the option of truncating and re-processing data if bugs are found in the transformation pipeline and it provides a raw data source for Data Scientists. While you may not yet have any machine learning requirements, it’s almost certain you will, if not now, then in the coming years. Remember that Snowflake data storage is remarkably cheap, unlike on-premises solutions.
- Use multiple data models: On-premises data storage was so expensive it was not feasible to store multiple copies of data, with each using a different data model as required. However, using Snowflake, it makes sense to store raw data history in either structured or variant format, cleaned and conformed data in 3rd Normal Form or using a Data Vault model and finally, data ready for consumption in a Kimball Dimensional Data model. Each data model has unique benefits, and storing the results of intermediate steps has huge architectural benefits, not least the ability to reload and reprocess the data in case of mistakes.
- Use the right tool: As the quote above implies, if you only know one tool really well, you’ll use it inappropriately. The decision should be based upon a range of factors, including the existing skill set in the team, whether you need rapid near real-time delivery, and whether you’re doing a once-off data load or a regular repeating process. Be aware, Snowflake can natively handle a range of file formats, including Avro, Parquet, ORC, JSON and CSV. Finally, there is extensive guidance on loading data into Snowflake on the online documentation.
- Use COPY or SNOWPIPE to load data: Around 80% of data loaded into a data warehouse is either ingested using a regular batch process or, increasingly, immediately in near real-time after the data files arrive. By far, the fastest, most cost-efficient way to load data is using COPY and SNOWPIPE, so avoid the temptation to use other methods (for example, queries against external tables) for regular data loads. Effectively, this is another example of using the right tool.
- Avoid JDBC or ODBC for regular large data loads: Another right tool recommendation. While a JDBC or ODBC interface may be acceptable to load a few megabytes of data, these interfaces will not scale to the massive throughput of COPY and SNOWPIPE. Use them by all means, but not for large regular data loads.
- Avoid Scanning Files: When using the COPY command to ingest data, use partitioned staged data files described as step 1 in the Top 3 Snowflake Performance Tuning Tactics. This reduces the effort of scanning large numbers of data files in cloud storage.
- Choose a sensible Virtual Warehouse size: Another tip from the article Top Snowflake Query Optimization tactics, is don’t assume an X6-LARGE virtual warehouse will load massive data files any faster than an X-SMALL. Each physical file is loaded sequentially, and it pays to follow the Snowflake File Sizing Recommendations and either split multi-gigabyte files into chunks of 100–250Mb or load multiple concurrent data files in parallel.
- Ensure 3rd party tools push down: ETL tools like Ab Initio, Talend and Informatica were originally designed to extract data from source systems into an ETL server, transform the data and write them to the warehouse. As Snowflake can draw upon massive on-demand compute resources and automatically scale out, it makes no sense to use and have data copied to an external server. Instead, use the ELT (Extract, Load and Transform) method, and ensure the tools generate and execute SQL statements on Snowflake to maximise throughput and reduce costs. Two excellent examples of ELT tools include DBT and Matillion.
- Transform data in Steps: A common mistake by inexperienced developers is to write huge SQL statements that join, summarize and process lots of tables in the mistaken belief this is an efficient way of working. In reality, the code becomes over-complex and difficult to maintain and, worst still, performs poorly. Instead, break the transformation pipeline into multiple steps and write results to intermediate tables. This makes it easier to test intermediate results, simplifies the code and often produces simple SQL code that runs faster.
- Use Transient Tables for intermediate results: During a complex ELT pipeline, write intermediate results to a transient table that may be truncated before the next load. This reduces the time-travel storage to just one day and avoids an additional 7 days of fail-safe storage. By all means, use temporary tables if sensible, but it’s often helpful to check the results of intermediate steps in a complex ELT pipeline.
- Avoid row-by-row processing: Modern analytics platforms like Snowflake are designed to ingest, process and analyse billions of rows at amazing speed using simple SQL statements which act upon the data set at a time. However, people tend to think in terms of row-by-row processing, and this sometimes leads to programming loops which fetch and update rows one at a time. Be aware that row-by-row processing is by far the single biggest way of killing query performance. Use SQL statements to process all table entries simultaneously and avoid row-by-row processing at all costs.
- Use Query Tags: When you start any multi-step transformation task, set the session query tag using: ALTER SESSION SET QUERY_TAG = ‘XXXXXX’ and ALTER SESSION UNSET QUERY_TAG. This stamps every SQL statement until reset with an identifier and is invaluable to System Administrators. As every SQL statement (and QUERY_TAG) is recorded in the QUERY_HISTORY view, you can then track the job performance over time. This can quickly identify when a task change has resulted in poor performance, identify inefficient transformation jobs or indicate when a job would be better executed on a larger or smaller warehouse.
- Keep it Simple: The best indicator of an experienced developer is the value they place upon simplicity. You can always make a job 10% faster, generic, or more elegant, and it may be beneficial, but it’s always helpful to simplify a solution. Simple solutions are easier to understand, easier to diagnose problems and are therefore easier to maintain. Around 50% of the Snowflake challenges I face are difficult to resolve because the solution is a single, monolithic complex block of code. The first thing I do is break down the solution into steps and only then identify the root cause.
What are Snowflake Transformation and ETL?
ETL or ELT (Extract Transform and Load) are often used interchangeably as a short code for data engineering. For the purposes of this article, Data Engineering is the process of transforming raw data into useful information to facilitate data-driven business decisions. The main steps include data loading, which involves ingesting raw data followed by cleaning, restructuring, enriching the data by combining additional attributes, and finally, preparing it for consumption by end users.
ETL and Transformation tend to be used interchangeably, although the transformation task is a subset of the overall ETL pipeline.
Snowflake ETL in Practice
The diagram below illustrates the Snowflake ETL data flow to build complex data engineering pipelines. There are several components, and you may not use all of them on your project, but they are based on my experience with Snowflake customers over the past five years.
The diagram above shows the data sources, which may include:
- Data Lakes: Some Snowflake customers already have an existing cloud-based Data Lake, which acts as an enterprise-wide store of raw historical data that feeds both the data warehouse and machine learning initiatives. Typically, data is stored in S3, Azure or GCP cloud storage in CSV, JSON or Parquet format.
- On-Premises Databases: These include both operational databases, which generate data, and existing on-premise data warehouses, which are in the process of being migrated to Snowflake. These can (for example) include billing systems and ERP systems used to manage business operations.
- Streaming Sources: Unlike on-premises databases, where the data is relatively static, streaming data sources constantly feed in new data. This can include data from Internet of Things (IoT) devices, Web Logs, and Social Media sources.
- SaaS and Data Applications: This includes existing Software as a Service (SaaS) systems, for example, ServiceNow and Salesforce, which have Snowflake connectors and other cloud-based applications.
- Data Files: Include data provided from either cloud or on-premises systems in various file formats, including CSV, JSON, Parquet, Avro and ORC, which Snowflake can store and query natively.
- Data Sharing: Refers to the ability for Snowflake to expose read-only access to data on other Snowflake accounts seamlessly. The Snowflake Data Exchange or Marketplace provides instant access to data across all major cloud platforms (Google, AWS or Microsoft) and global regions. This can enrich existing information with additional externally sourced data without physically copying the data.
In common with all analytics platforms, the data engineering phases include:
- Data Staging: This involves capturing and storing raw data files on cloud storage, including Amazon S3, Azure Blob or GCP storage.
- Data Loading: This involves loading the data into a Snowflake table, which can be cleaned and transformed. It’s good practice to initially load data to a transient table, balancing the need for speed, resilience, simplicity, and reduced storage cost.
- Raw History: Unless the data is sourced from a raw data lake, retaining the raw data history is good practice to support machine learning and data re-processing as needed.
- Data Integration: This is cleaning and enriching data with additional attributes and restructuring and integrating the data. It’s usually good practice to use temporary or transient tables to store intermediate results during the transformation process, with the final results stored in Snowflake permanent tables.
- Data Presentation and Consumption: Whereas the Data Integration area may hold data in the 3rd Normal Form or Data Vault, storing data ready for consumption in a Kimball Dimensional Design or denormalised tables as needed usually is good practice. This area can also include a layer of views acting as a semantic layer to insulate users from the underlying table design.
- Data Governance, Security and Monitoring: Refers to managing access to the data, including Role Based Access Control and handling sensitive data using Dynamic Data Masking and Row Level Security. This also supports monitoring Snowflake usage and cost to ensure the platform operates efficiently.
Finally, the data consumers can include dashboards and ad-hoc analysis, real-time processing and Machine Learning, business intelligence or data sharing.
Snowflake Data Loading Options
The diagram below illustrates the options available to stage and load data into a landing table – the first step in the Snowflake data engineering pipeline.
As the diagram above shows, Snowflake supports a wide range of use cases, including:
- Data File Loading: This is the most common and highly efficient data loading method in Snowflake. This involves using SnowSQL to rapidly execute SQL commands to load data into a landing table. Using this technique, it’s possible to load terabytes of data quickly, which can be performed on a batch or micro-batch basis. Once the data files are held in a cloud stage (EG. S3 buckets), the COPY command can load the data into Snowflake. This is the most common method for most large-volume batch data loading tasks, and it’s usually good practice to size data files at around 100-250 megabytes of compressed data, optionally breaking up huge data files appropriately.
- Replication from Premises Databases: Snowflake supports a range of data replication and ETL tools, including HVR, Stitch, Fivetran and Qlik Replicate, which seamlessly replicate changes from operational or legacy warehouse systems with zero impact upon the source system. Equally, a vast range of data integration tools include Snowflake support and direct replication from other database platforms, which can be used to extract and load data. Equally, some customers write their data extraction routines and use the Data File Loading and COPY technique described above.
- Data Streaming: Options to stream data into Snowflake include using the Snowflake Kafka Connector to automatically ingest data directly from a Kafka topic, as demonstrated by this remarkable video demonstration. Unlike the COPY command, which needs a virtual warehouse, Snowpipe is an entirely serverless process, and Snowflake manages the operation completely, scaling out the compute resources as needed. Equally, the option exists to trigger Snowpipe to load data files automatically when they arrive on cloud storage.
- Inserts using JDBC and ODBC: Although not the most efficient way to bulk load data into Snowflake (using COPY or Snowpipe is always faster and more efficient), the Snowflake JDBC and ODBC connectors are available in addition to a range of Connectors and Drivers, including Python, Node.js and Go.
- Ingestion from a Data Lake: While Snowflake can host a Data Lake, customers with an existing investment in a cloud data lake can use Snowflake External Tables to provide a transparent interface to data in the lake. From a Snowflake perspective, the data appears to be held in a read-only table, but the data is transparently read from the underlying files on cloud storage.
- Data Sharing: The Data Exchange provides a seamless way to share data globally for customers with multiple Snowflake deployments. Using the underlying Snowflake Data Sharing technology, customers can query and join data in real time from various sources without copying. Existing in-house data can also be enriched with additional attributes from externally sourced data using the Snowflake Data Marketplace.
You may notice a consistent design pattern in the above scenarios as follows:
1. Stage the data: Loading data files to a Snowflake file stage.
2. Load the data: Into a Snowflake table.
Once the data has landed in Snowflake, it can be transformed using the techniques described below. In some cases, Snowflake has made the entire process appear seamless, for example, using the Kafka Connector or Snowpipe, but the underlying design pattern is the same.
Snowflake Transformation Options
Having ingested the data into a Snowflake landing table, several tools are available to clean, enrich and transform the data, illustrated in the diagram below.
The options available to transform data include the following:
- Using ETL Tools: This often has the advantage of leveraging the existing skill set within the data engineering team, and Snowflake supports a wide range of data integration tools. It is best practice when using these to ensure they follow the pushdown principle, in which the ETL tool executes SQL, which is pushed down to Snowflake to gain maximum benefit from the scale-up architecture.
- Incremental Views: A pattern commonly found on systems migrated from Teradata and uses a series of views built upon views to build a real-time transformation pipeline. It is good practice to break complex views into smaller steps and write intermediate results to transient tables, as this makes it easier to test and debug and, in many cases, can lead to significant performance improvements.
- Streams & Tasks: Snowflake Streams provide a compelling but straightforward way of implementing simple change data capture (CDC) within Snowflake. It is good practice to combine Streams and Snowflake Tasks on data acquired for near real-time processing. Effectively, the Stream keeps a pointer in the data to record the already processed results, and the Task provides scheduling to periodically transform the newly arrived data. Previously, it was necessary to allocate a suitably sized virtual warehouse to execute the task. Still, the recent release of the Serverless Compute option further simplifies the task and means Snowflake automatically manages the compute resources, scaling up or out as needed.
- Spark and Java on Snowflake: Using the recently released Snowpark API, Data Engineers and Data Scientists who would previously load data into a Databricks Cluster to execute SparkSQL jobs can now develop using Visual Studio, IntelliJ, SBT, Scala and Jupyter notebooks with Spark DataFrames automatically translated and directed as Snowflake SQL. Combined with the ability to execute Java UDFs, this provides powerful options to transform data in Snowflake using your preferred development environment without the additional cost and complexity of supporting external clusters.
The data transformation patterns described above show the most common methods; however, each Snowflake component can be combined seamlessly. For example, although Streams and Tasks are commonly used together, they can each be used independently to build a bespoke transformation pipeline and combined with materialised views to deliver highly scalable and performant solutions.
“If the only tool you have is a hammer, you tend to see every problem as a nail”. Abraham Maslow.
Conclusion: Snowflake ETL and Data Engineering Best Practices
The article above summarises several options and highlights some of the best data engineering practices on Snowflake. The critical lessons learned from working with Snowflake customers include the following:
- Follow the standard ingestion pattern: This involves the multi-stage process of staging the data files in cloud storage and loading files to a Snowflake transient table before transforming the data and storing it in a permanent table. Breaking the overall strategy into independent stages makes it easier to orchestrate and test.
- Retain history of raw data: Unless your data is sourced from a raw data lake, it makes sense to keep the raw data history which should ideally be stored using the VARIANT data type to benefit from automatic schema evolution. This means you can truncate and re-process data if bugs are found in the transformation pipeline and provide an excellent raw data source for Data Scientists. While you may not yet have any machine learning requirements, it’s almost sure you will, if not now, then in the coming years. Remember that Snowflake data storage is remarkably cheap, unlike on-premises solutions.
- Use multiple data models: On-premises data storage was so expensive it was not feasible to store numerous copies of data, each using a different data model to match the need. However, using Snowflake, storing raw data history in either structured or variant format, cleaning and conformed data in the 3rd Normal Form or using a Data Vault model makes sense. Finally, data is ready for consumption in a Kimball Dimensional Data model. Each data model has unique benefits, and storing the results of intermediate steps has huge architectural benefits, not least the ability to reload and reprocess the data in case of mistakes.
- Use the right tool: As the quote above implies, if you are only aware of one ELT tool or method, you’ll almost certainly misuse it at some point. The decision about the ELT tool should be based upon a range of factors, including the existing skill set in the team, whether you need rapid near real-time delivery, and whether you’re doing a once-off data load or regular repeating loads. Be aware that Snowflake can natively handle a range of file formats, including Avro, Parquet, ORC, JSON and CSV. There is extensive guidance on loading data into Snowflake on the online documentation.
- Use COPY or SNOWPIPE to load data: Around 80% of data loaded into a data warehouse is either ingested using a regular batch process or, increasingly, immediately in near real-time after the data files arrive. By far, the fastest, most cost-efficient way to load data is using COPY and SNOWPIPE, so avoid the temptation to use other methods (for example, queries against external tables) for regular data loads. Effectively, this is another example of using the right tool.
- Avoid JDBC or ODBC for regular large data loads: Another right tool recommendation. While a JDBC or ODBC interface may be acceptable to load a few megabytes of data, these interfaces will not scale to the massive throughput of COPY and SNOWPIPE. Use them by all means, but not for sizeable regular data loads.
- Avoid Scanning Files: Using the COPY command to ingest data, use partitioned staged data files. This reduces the effort of scanning large numbers of data files in cloud storage.
- Choose a suitable Virtual Warehouse size: Don’t assume an X6-LARGE warehouse will load huge data files faster than an X-SMALL. Each physical file is loaded sequentially on a single CPU, and it is more sensible to load most loads on an X-SMALL warehouse. Consider splitting massive data files into 100-250MB chunks and loading them on a larger (perhaps MEDIUM size) warehouse.
- Ensure 3rd party tools push down: ETL tools like Ab Initio, Talend and Informatica were initially designed to extract data from source systems into an ETL server, transform the data and write them to the warehouse. As Snowflake can draw upon massive on-demand compute resources and automatically scale out, it makes no sense to use and have data copied to an external server. Instead, use the ELT (Extract, Load and Transform) method, and ensure the tools generate and execute SQL statements on Snowflake to maximise throughput and reduce costs. Excellent examples include DBT and Matillion.
- Transform data in Steps: A common mistake by inexperienced data engineers is to write massive SQL statements that join, summarise and process lots of tables in the mistaken belief this is an efficient way of working. In reality, the code becomes over-complex, challenging to maintain, and, worst still, often performs poorly. Instead, break the transformation ETL pipeline into multiple steps and write results to intermediate transient tables. This makes it easier to test intermediate results, simplifies the code and often produces simple SQL code that will almost certainly run faster.
- Use Transient tables for intermediate results: During a complex ETL pipeline, write intermediate results to a transient table which may be truncated before the next load. This reduces the time-travel storage to just one day and avoids an additional seven days of fail-safe storage. By all means, use temporary tables if sensible, but the option to check the results of intermediate steps in a complex ELT pipeline is often helpful.
- Avoid row-by-row processing: As described in the article on Snowflake Query Tuning, Snowflake is designed to ingest, process and analyse billions of rows at fantastic speed. This is often referred to as set-at-a-time processing. However, people tend to think about row-by-row processing, which sometimes leads to programming loops that fetch and update rows one at a time. Remember that row-by-row processing is the most significant way to kill query performance. Use SQL statements to process all table entries simultaneously and avoid row-by-row processing at all costs.
- Use Query Tags: When you start any multi-step transformation task, set the session query tag using: ALTER SESSION SET QUERY_TAG = ‘XXXXXX’ and at the end, ALTER SESSION UNSET QUERY_TAG. This stamps every SQL statement until reset with an identifier and is invaluable to System Administrators. As every SQL statement (and QUERY_TAG) is recorded in the QUERY_HISTORY view, you can track the job performance over time. This can quickly identify when a task change has resulted in poor performance, identify inefficient transformation jobs or indicate when a job would be better executed on a larger or smaller warehouse.
- Keep it Simple: Probably the best indicator of an experienced data engineer is their value on simplicity. You can always make a job 10% faster, generic, or more elegant, and it may be beneficial, but it always helps to simplify a solution. Simple solutions are easier to understand, easier to diagnose problems and are therefore easier to maintain. Around 50% of the performance challenges I face are difficult to resolve because the solution is a single, monolithic complex block of code. To resolve this, I first break down the key components into smaller steps and only then identify the root cause.
1. Transform your data incrementally:
A common mistake novice data engineers make is writing huge SQL statements that join, aggregate, and process many tables, misunderstanding that this is an efficient way to work. That’s it. In practice, the code becomes overly complex, difficult to maintain, and worse, often problematic. Instead, split the transformation pipeline into multiple steps and write the results to an intermediate table. This makes it easier to test intermediate results, simplifies code, and often produces simpler SQL code that runs faster.
2. Load data using COPY or SNOWPIPE:
Approximately 80% of data loaded into a data warehouse is ingested via regular batch processes, or increasingly as soon as the data files arrive. I’m here. Using COPY and SNOWPIPE is the fastest and cheapest way to load data. So, resist the temptation to periodically load data using other methods (such as querying external tables). In fact, this is another example of using the right tools.
3. Use multiple data models:
Local data storage is so costly that it was not possible to store multiple copies of data, each using a different data model to meet your requirements. However, when using Snowflake, store the raw data history in a structured or variant format, clean and fit the data using the third normal form or the Data Vault model, and It makes sense to store the final consumable data in the Kimball dimensional data model. Each data model has its own advantages and storing intermediate step results has significant architectural advantages. Especially important is the ability to reload and reprocess the data in the event of an error.
4. Choose a required Virtual Warehouse size:
Another tip from the Top 3 Snowflake Performance Tuning Tactics, don’t assume an X6-LARGE virtual warehouse will load massive data files any faster than an X-SMALL. Each physical file is loaded sequentially, and it therefore pays to follow the Snowflake File Sizing Recommendations and either split multi-gigabyte files into chunks of 100–250Mb or load multiple concurrent data files in parallel.
5. Keep raw data history:
Unless the data comes from a raw data lake, it makes sense to keep raw data history. This should ideally be stored in a VARIANT data type to benefit from automatic schema evolution. This means that data can be truncated and reprocessed if errors are found in the transformation pipeline, providing data scientists with a great source of raw data. If you don’t have a need for machine learning yet, you’ll almost certainly need it in the next few years, if not now.
6. Do not use JDBC or ODBC for normal large data loads:
Another recommendation for suitable tools. JDBC or ODBC interfaces may be suitable for loading several megabytes of data, but these interfaces cannot handle the massive throughput of COPY and SNOWPIPE. Use it, but don’t use it for normal large data loads.
7. Avoid scanning files:
When collecting data using the COPY command, use partitioned staging data files as described in Step 1 of Snowflake Top 3 Performance Optimization Tactics. This reduces the effort of scanning large numbers of data files in cloud storage.
8. Use the Tool according to requirement:
As the quote above suggests, if you only know one tool, it’s being used improperly. The decision is based on several factors, such as the skills available on your team, whether you need fast, near-real-time delivery, whether you are performing a one-time data load, or a process that repeats on a regular basis. must be Note that Snowflake can natively handle a variety of file formats including Avro, Parquet, ORC, JSON, and CSV. Please see online documentation for detailed instructions loading data into Snowflake.
9. Ensure 3rd party tools push down:
ETL tools like Ab Initio, Talend and Informatica were originally designed to extract data from source systems into an ETL server, transform the data and write them to the warehouse. As Snowflake can draw upon massive on-demand compute resources and automatically scale out, it makes no sense to have data copied to an external server. Instead, use the ELT (Extract, Load and Transform) method, and ensure the tools generate and execute SQL statements on Snowflake to maximize throughput and reduce costs.
10. Using Query Tag:
When starting a multi-step conversion task, set the session query tag using ALTER SESSION SET QUERY_TAG = “XXXXXX” and ALTER SESSION UNSET QUERY_TAG. This stamps an identifier on each SQL statement until it is rolled back, which is very important for system administrators. Each SQL statement (and QUERY_TAG) is logged in the QUERY_HISTORY view, allowing you to track job performance over time. This allows you to quickly see when a task change has degraded performance, identified inefficient conversion jobs, or indicated when jobs need to run better in large or small warehouses.
11. Use Transient Tables for Intermediate Results:
During complex ELT pipelines, write intermediate results to transient tables that may be truncated before the next load. This cuts the time travel storage down to just one day and avoids an additional 7 days of failsafe storage. Always use temporary tables where it makes sense. However, it is often useful to validate the results of intermediate steps in complex ELT pipelines.
12. Avoid row-by-row processing:
Modern analytics platforms such as Snowflake are designed to ingest, process, and analyze billions of rows at incredible speed using simple SQL statements that react to each data set It has been. However, people tend to think in terms of row-by-row processing, and this can lead to programming loops where he fetches and updates one row at a time. Note that row-by-row processing is the biggest way to slow query performance. Use SQL statements to process all table entries at once and avoid row-by-row processing at all costs.
13. Follow standard ingestion patterns:
This involves a multi-step process of storing data files in cloud storage and loading them into storage tables before transforming the data. Breaking down the entire process into defined steps makes it easier to orchestrate and test.
In this blog series, we will explore the ingestion options and the best practices of each. Let’s start with the simplest method of data ingestion, the INSERT command. This is a suitable approach to bringing a small amount of data, it has some limitations for large data sets exceeding the single digit MB range, particularly around ease of use, scalability, and error handling. For larger data sets, data engineers typically have the option to use a variety of ETL/ELT tools to ingest data, or preferably use object storage as an intermediate step alongside COPY INTO or Snowpipe. Customers who need to stream data and use popular Kafka-based applications can use the Snowflake Kafka Connector to ingest Kafka topics into Snowflake tables via a managed Snowpipe. Snowflake also allows data engineers to query data stored in external stages using the External Table options. An in-depth description of all the available ingestion options, fitting a variety of use cases, is a topic for follow-up blog posts.
This blog post will cover the two most widely used and recommended file based data ingestion approaches: COPY INTO and Snowpipe. We will outline the similarities and differences between both and recommend best practices informed by the experience of over 5,000 customers loading data to the Snowflake Data Cloud.
COPY INTO vs. Snowpipe
The COPY command enables loading batches of data available in external cloud storage or an internal stage within Snowflake. This command uses a predefined, customer-managed virtual warehouse to read the data from the remote storage, optionally transform its structure, and write it to native Snowflake tables.
These on-the-fly transformations may include:
- Column reordering
- Column omission
- Text truncation
COPY fits nicely in an existing infrastructure where one or more warehouses are managed for size and suspension/resumption to achieve peak price to performance of various workloads, such as SELECT queries or data transformations. In the absence of such an existing infrastructure managing warehouses, Snowpipe should be considered for simplicity and convenience.
COPY provides file-level transaction granularity as partial data from a file will not be loaded by default ON_ERROR semantics. Snowpipe does not give such an assurance as Snowpipe may commit a file in micro-batch chunks for improved latency and availability of data. When you are loading data continuously, a file is just a chunking factor and is not seen as a transaction boundary determinant.
Snowpipe is designed for continuous ingestion and is built on COPY, though there are some differences in detailed semantics listed in the documentation. In distinction to COPY, Snowpipe runs a serverless service, meaning that there are no virtual warehouses to manage, with Snowflake-managed resources instead that automatically scale to changing workloads. This frees you from the burden of managing your warehouse, scaling it for a variable load, and optimizing it for the best cost-performance balance. It also shifts the burden of monitoring file loading from the customer to Snowflake.
Considerations for COPY
The COPY command relies on a customer-managed warehouse, so there are some considerations to consider when choosing the appropriate warehouse size. The most critical aspect is the degree of parallelism as each thread can ingest a single file at a time. The XS Warehouse provides eight threads, and each increment of warehouse-size doubles the amount of available threads. The simplified conclusion is that for a significantly large number of files, you would expect optimal parallelism for each given warehouse size—meaning halving the time to ingest the large batch of files for every upsize step. However, this speedup can be limited by factors such as networking or I/O delays in real-life scenarios. These factors should be considered for larger ingestion jobs and might require individual benchmarking during the planning phase.
The unique capabilities of Snowflake isolating warehouses, their individual sizing, and the per-second billing make it fast, efficient, and fault-tolerant to ingest separate tables in parallel. The number of files, the size, or the format can vary between the table sources when loading multiple tables. In this case, we suggest choosing the appropriate warehouse size according to the considerations for each data characteristic.
Snowpipe API vs. auto-ingest Snowpipe
The Snowflake-provided serverless resources can operate for all operation types: batched, micro-batch, and continuous.
For most use cases, especially for incremental updating of data in Snowflake, auto-ingesting Snowpipe is the preferred approach. This approach continuously loads new data to the target table by reacting to newly created files in the source bucket.
Snowpipe relies on the cloud vendor-specific system for event distribution, such as AWS SQS or SNS, Azure Event Grid, or GCP Pub/Sub. This setup requires corresponding privileges to the cloud account to deliver event notifications from the source bucket to Snowpipe.
Whenever an event service can not be set up, or an existing data pipeline infrastructure is in place, a REST API-triggered Snowpipe is a suitable alternative. It is also currently the only option if an internal stage is used for storing the raw files. Most commonly, the REST API approach is used by ETL/ELT tools that don’t want to put the burden of creating object storage on the end user and instead use a Snowflake-managed Internal Stage.
Here is an example of how to trigger a Snowpipe API for ingestion: https://medium.com/snowflake/invoking-the-snowpipe-rest-api-from-postman-141070a55337
Checking status of ingestion
COPY INTO is a synchronous process which returns the load status as output. Whereas for Snowpipe, execution of the file ingestion is done asynchronously, so processing status needs to be observed explicitly.
For all Snowpipes as well as COPY INTO, the COPY_HISTORY view or the lower latency COPY_HISTORY function is available. Generally, files that fail to load need to be looked at by someone, so it is often good to check COPY_HISTORY less frequently than the typical file arrival rate. For all Snowpipes, error notifications are also available to publish events to your event handler of choice (AWS SNS, Azure Event Grid, or GCP Pub/Sub).
Additionally, for the API-triggered Snowpipes, you can use the insertReport and loadHistoryScan API endpoints to track the ingestion. While the insertReport endpoint returns events for the last 10 minutes only, this constraint can be configured using the loadHistoryScan endpoint. In any case, insertReport should be favored over the loadHistory as excessive usage of the latter tends to lead to API throttling.
Initial data loading
The three options introduced in this series can also be used for initial data loading. Like other use cases, each option has its advantages and drawbacks, but in general we suggest starting with the evaluation of auto-ingest first, mainly because of its simplicity and scalability.
Snowpipe auto-ingest relies on new file notifications. A common situation in which customers find themself is that the required notifications are not available. This is the case for all files already present in the bucket before the notification channel has been configured. There are several solution approaches to this. For example, we recently helped a large customer load hundreds of TB of data for an initial load where it made more sense to produce fake notifications to the notification channel pointing to the available files.
Where this approach is unfeasible, or especially for smaller buckets or where bucket listing is reasonable, you can trigger ingestion using the REFRESH functionality:
alter pipe <pipe_name> refresh PREFIX = ‘<path>’ MODIFIED_AFTER = <start_time>
This will list the bucket and begin ingesting files that were recently modified up to the last 7 days.
However, it is also possible to run COPY jobs and monitor them manually as the initial loading is typically a time-bound activity. The decision depends on the control vs. automation tradeoff and may be influenced by the cost considerations in the next section. Otherwise, COPY provides greater control with the attendant responsibility to manage the warehouse and the job duration. Here are 2 important considerations while using COPY for initial data loading:
- In the case of data skew, COPY jobs that don’t have enough files will not utilize the warehouse efficiently. For example, if you are only loading less than 8 files, an XS warehouse will be just as fast as a 2XL. Therefore it is important to consider the COPY degree of parallelism mentioned earlier.
- A single COPY job loading millions of files can potentially run for a long time, and the default job timeout is 24 hours. To avoid hitting this limit, you would need to split the overall job into smaller COPY jobs. You can leverage your data path partitioning whenever possible, as it is always more efficient for COPY to list and load data from the explicit path where your data exists rather than traverse the entire bucket.
Recommended file size for Snowpipe and cost considerations
There is a fixed, per-file overhead charge for Snowpipe in addition to the compute processing costs. We recommend files at least above 10 MB on average, with files in the 100 to 250 MB range offering the best cost-to-performance ratio.
At the single digit megabyte or lower range for average file size, Snowpipe is typically not the most cost-effective (in credits/TB) option. COPY may provide a better cost performance depending on the file arrival rate, size of warehouse used, and non-COPY use of the Cloud Services Layer. Ergo, there is no single correct answer below 10 MB, and further analysis is warranted. In general, larger file sizes of at least 100 MB are noticeably more efficient, such that an increase in file size does not change the credits/TB much. However, we also recommend not exceeding 5 GB in file size, to take advantage of parallelization and error handling capabilities. With a larger file, there is a higher likelihood of an erroneous record being found at the end of a large file that may cause an ingestion job to fail, which will later need to be restarted depending on the selected ON_ERROR option. Snowflake can and does handle much larger files, and customers have successfully loaded files larger in the TB range.
With these file size recommendations in mind, it should be noted that the per-file charge tends to be a small fraction of the overall cost.
All ingestion methods support the most common file formats out of the box:
- XML (currently in public preview)
Additionally, these files can be provided compressed, and Snowflake will decompress them during the ingestion process. Supported compression formats like GZIP, BZ2, BROTLI, ZSTD, SNAPPY, DEFLATE, or RAW_DEFLATE can be configured explicitly or detected automatically by Snowflake. In general, the ingestion of compressed files should always be preferred over uncompressed files, as the movement of external assets of the network also tends to become the limiting factor compared to local decompression.
Due to the large set of combinations, it’s not possible to predict the performance of the ingestion of a specific file format configuration exactly even given a file size. The most significant observed impact on data loading performance is the structure of the file (the number of columns or nested attributes in a single record), not the total file size.
With all that said, our measurement indicated that the ingestion of gzip’ed CSV files (.csv.gz) is not only the most widely used but usually the most performant configuration for ingestion. Further, the observations indicate the performance according to the file format listed above, starting with the fastest.
Most of the time the available file format and size is predefined by the source system or already available in object storage data. In this case, it’s not beneficial to split, merge, or re-encode existing files to a different format or size as the gain presumably will not pay back the efforts. The only notable exception for this is the loading of data across cloud regions or cloud vendors, where the transparent compression performed by Snowflake can make a massive difference in egress spending.
How long does it take to ingest and how much does it cost?
Both ingestion time and cost depend on various factors, including:
- Size of the file: the core ingestion time is relative to the content, so the costs tend to be proportional to number of records and file size but not an exact correlation.
- Amount of processing required by your COPY statement: some ingestion jobs invoke complex UDFs that take significant time per row and occasionally can even run out of memory if the data size is not correctly anticipated.
- File format, compression, nested structures, etc all play an impact on how efficiently we can decompress and load the data. An uncompressed file with a large number of columns may take the same amount of time as a compressed file with a small number of columns but has highly nested data structures.
Therefore, it is impossible to answer the above question correctly without measuring it for your specific case. Please note that we currently do not have an ingest-service-specific SLA. However, we have seen that on average P95 latency across deployments, customers, and a range of values for the above factors has been about one minute. This is, of course, a statistical measure, and specific performance for a given customer in a given unit of time may vary considerably.
10 best practices
- Consider auto-ingest Snowpipe for continuous loading. See above for cases where it may be better to use COPY or the REST API.
- Consider auto-ingest Snowpipe for initial loading as well. It may be best to use a combination of both COPY and Snowpipe to get your initial data in.
- Use file sizes above 10 MB and preferably in the range of 100 MB to 250 MB; however, Snowflake can support any size file. Keeping files below a few GB is better to simplify error handling and avoid wasted work. This is not a hard limit and you can always use our error handling capabilities such as ON_ERROR = CONTINUE.
- Keep max field size capped at 16 MB. Ingestion is bound by a Snowflake-wide field size limit of 16 MB.
- Keep your data ingestion process simple by utilizing our native features to ingest your data as is, without splitting, merging, or converting files. Snowflake supports ingesting many different data formats and compression methods at any file volume. Features such as schema detection and schema evolution (currently in private preview) can help simplify data loading directly into structured tables.
- Average measurements across customers are not likely to predict latency and cost. Measuring for a sample of your data is a much more reliable approach beyond the indicative numbers.
- Do check file loading success/failure for Snowpipe using COPY_HISTORY or other options, such as subscribing to Snowpipe error notifications. Also occasionally check SYSTEM$PIPE_STATUS for the health of the Snowpipe.
- Do not expect in-order loading with Snowpipe. Files are loaded concurrently and asynchronously in multiple chunks, so reordering is possible. If an order is essential, use event timestamp in data if possible, or use COPY and load sequentially.
- Leverage your existing object path partitioning for COPY when possible. Using the most explicit path allows COPY to efficiently list and load your data as quickly as possible. Even though Snowflake can scalably list and load large volumes of data; you can avoid wasted compute and API calls by using path partitioning, especially when you already loaded the previous days, months, and years data which will just be ignored for deduplication.
- Use cloud provider event filtering to reduce the amount of notification noise, ingestion latency, and costs from unwanted notifications with filtering on prefix or suffix events before it is sent to Snowpipe. Leverage the native cloud event filtering before using Snowpipe’s more powerful regex pattern filtering.
Thousands of customers and developers are using these best practices to bring in massive amounts of data onto the Snowflake Data Cloud to derive insights and value from that data. No matter which ingestion option you prefer, Snowflake will always be continuously improving its performance and capabilities to support your business requirements for data pipelines. While we focused mainly on file based data ingestion with COPY and Snowpipe here, part 2 of our blog post will go over streaming data ingestion.