Skip to main content

Infrastructure

Snowflake Data Warehouse

Reference for Snowflake: separated compute and storage, virtual warehouses, micro-partitions, Snowpark, comparison to BigQuery and Databricks, and cost surprises.

CoreTier 3Current~12 min
0

What It Is

Snowflake is a cloud data warehouse, founded in 2012 by Benoit Dageville, Thierry Cruanes, and Marcin Zukowski; public on NYSE since 2020. It runs on AWS, Azure, and GCP as a managed service. The defining architectural choice is the separation of compute from storage: data lives in a columnar, immutable format on cloud object storage; query execution runs on independently sized compute clusters called virtual warehouses.

A virtual warehouse is a sized cluster of cloud VMs (XS, S, M, L, XL, ... up to 6XL). Resizing is online; warehouses can auto-suspend after an idle period and auto-resume on the next query. Multiple warehouses can read the same underlying tables concurrently without contention, which is the standard pattern for separating BI workloads from ETL workloads.

Storage is organized into micro-partitions: contiguous units of ~50-500 MB of compressed columnar data, immutable once written. Each micro-partition stores per-column min/max metadata used for pruning: a query with WHERE date >= '2025-01-01' skips partitions whose date range falls outside the predicate, which is the dominant performance lever. Clustering keys influence which rows end up in which micro-partitions; the wrong clustering key on a high-cardinality column wastes the optimization.

Snowflake's SQL dialect is broadly ANSI-compliant with extensions for semi-structured data (VARIANT columns store JSON / Avro / Parquet, queried with path syntax col:field.subfield::INT). Snowpark is the Python / Java / Scala client library that lets users write DataFrame-style code or register Python UDFs and stored procedures that execute inside the warehouse, avoiding data egress.

When You'd Use It

Snowflake fits when an organization needs a centralized analytical store, has multiple consuming teams, and wants strict separation between workloads. The compute-storage split makes it easy to give the data-science team a large warehouse for one-hour scoring jobs while BI dashboards run on a small one.

Direct competitors:

  • BigQuery (Google): also storage-compute separated, slot-based pricing instead of warehouse-time, deeply integrated with the Google ecosystem. Cheaper for very bursty workloads, less predictable for steady ones.
  • Databricks SQL / Lakehouse: Spark-based, stores data in Delta Lake (Parquet plus a transaction log), better for ML and notebook-driven analysis, weaker for high-concurrency BI.
  • Amazon Redshift: AWS-native, recently added RA3 nodes for compute-storage separation, generally cheaper inside AWS but has a more dated query optimizer.

Anti-patterns: do not use Snowflake as an OLTP database (no row-level locking, every update rewrites micro-partitions). Do not use it as a feature store for low-latency model serving (query latency is seconds, not milliseconds). Do not move data into Snowflake just to run Python on it; if the workload is ML training, keep data on object storage and use a compute platform.

In 2024 Snowflake released Arctic, an open-weight 480 B-parameter MoE LLM, and Cortex, a managed inference layer for in-warehouse LLM calls. Cortex is convenient when text data already lives in Snowflake and the alternative would be exporting it; it is not competitive on cost or latency with dedicated inference platforms for greenfield workloads.

Notable Gotchas

Watch Out

Auto-suspend defaults are too generous

A warehouse defaults to a 10-minute auto-suspend timeout. A handful of analysts running ad-hoc queries every 5 minutes on a Large warehouse will keep it running 24/7 at full cost. Set auto-suspend to 60 seconds for interactive warehouses; the resume cost is small (a few seconds, no data reload because the result cache and metadata cache survive).

Watch Out

Query result cache is invalidated by harmless changes

Snowflake caches query results for 24 hours, keyed by exact query text plus underlying-table hash. Adding a comment, changing whitespace, or any DDL on the underlying table invalidates the cache. Dashboards that look "free" can become expensive after a schema change. Use RESULT_SCAN(LAST_QUERY_ID()) and warehouse caches deliberately rather than relying on the result cache.

References

Related Topics

Last reviewed: April 18, 2026