Snowflake Data Warehouse

Sneiderman, Robby

Infrastructure

Snowflake Data Warehouse

Reference for Snowflake: separated compute and storage, virtual warehouses, micro-partitions, Snowpark, comparison to BigQuery and Databricks, and cost surprises.

CoreTier 3CurrentReference~12 min

Prereq Map

Learning position

Read this page in the graph.

infrastructure | layer 4 | tier 3. This page has 0 direct prerequisites and 0 published dependents.

Open Atlas Prerequisites Leads to

What next

Take the diagnostic

No published continuation is declared yet, so the diagnostic is the clean next route.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

What It Is

Snowflake is a cloud data warehouse, founded in 2012 by Benoit Dageville, Thierry Cruanes, and Marcin Zukowski; public on NYSE since 2020. It runs on AWS, Azure, and GCP as a managed service. The defining architectural choice is the separation of compute from storage: data lives in a columnar, immutable format on cloud object storage; query execution runs on independently sized compute clusters called virtual warehouses.

A virtual warehouse is a sized cluster of cloud VMs (XS, S, M, L, XL, ... up to 6XL). Resizing is online; warehouses can auto-suspend after an idle period and auto-resume on the next query. Multiple warehouses can read the same underlying tables concurrently without contention, which is the standard pattern for separating BI workloads from ETL workloads.

Storage is organized into micro-partitions: contiguous units of ~50-500 MB of compressed columnar data, immutable once written. Each micro-partition stores per-column min/max metadata used for pruning: a query with WHERE date >= '2025-01-01' skips partitions whose date range falls outside the predicate, which is the dominant performance lever. Clustering keys influence which rows end up in which micro-partitions; the wrong clustering key on a high-cardinality column wastes the optimization.

Snowflake's SQL dialect is broadly ANSI-compliant with extensions for semi-structured data (VARIANT columns store JSON / Avro / Parquet, queried with path syntax col:field.subfield::INT). Snowpark is the Python / Java / Scala client library that lets users write DataFrame-style code or register Python UDFs and stored procedures that execute inside the warehouse, avoiding data egress.

When You'd Use It

Snowflake fits when an organization needs a centralized analytical store, has multiple consuming teams, and wants strict separation between workloads. The compute-storage split makes it easy to give the data-science team a large warehouse for one-hour scoring jobs while BI dashboards run on a small one.

Direct competitors:

BigQuery (Google): also storage-compute separated, slot-based pricing instead of warehouse-time, deeply integrated with the Google ecosystem. Cheaper for very bursty workloads, less predictable for steady ones.
Databricks SQL / Lakehouse: Spark-based, stores data in Delta Lake (Parquet plus a transaction log), better for ML and notebook-driven analysis, weaker for high-concurrency BI.
Amazon Redshift: AWS-native, recently added RA3 nodes for compute-storage separation, generally cheaper inside AWS but has a more dated query optimizer.

Anti-patterns: do not use Snowflake as an OLTP database (no row-level locking, every update rewrites micro-partitions). Do not use it as a feature store for low-latency model serving (query latency is seconds, not milliseconds). Do not move data into Snowflake just to run Python on it; if the workload is ML training, keep data on object storage and use a compute platform.

In 2024 Snowflake released Arctic, an open-weight 480 B-parameter MoE LLM, and Cortex, a managed inference layer for in-warehouse LLM calls. Cortex is convenient when text data already lives in Snowflake and the alternative would be exporting it; it is not competitive on cost or latency with dedicated inference platforms for greenfield workloads.

Definition

Virtual Warehouse

A virtual warehouse is Snowflake's compute unit: an independently sized cluster that executes queries against shared cloud storage. The table data are not owned by the warehouse, so separate warehouses can serve BI, ETL, and data-science workloads without copying the database.

Proposition

Storage-Compute Separation Principle

Statement

Separating storage from compute lets analytical systems scale query capacity for one workload without physically duplicating the underlying table data.

Intuition

The BI team, ETL job, and data-science backfill can read the same tables while paying for different compute shapes. The storage layer is the common ledger.

Failure Mode

The separation does not make bad SQL cheap. Poor clustering, excessive warehouse size, and warehouses left running still produce avoidable cost.

report a correction →

ExerciseCore

Problem

A dashboard and a nightly scoring job query the same Snowflake tables. Why might two warehouses be better than one larger shared warehouse?

Notable Gotchas

Watch Out

Auto-suspend defaults are too generous

A warehouse defaults to a 10-minute auto-suspend timeout. A handful of analysts running ad-hoc queries every 5 minutes on a Large warehouse will keep it running 24/7 at full cost. Set auto-suspend to 60 seconds for interactive warehouses; the resume cost is small (a few seconds, no data reload because the result cache and metadata cache survive).

Watch Out

Query result cache is invalidated by harmless changes

Snowflake caches query results for 24 hours, keyed by exact query text plus underlying-table hash. Adding a comment, changing whitespace, or any DDL on the underlying table invalidates the cache. Dashboards that look "free" can become expensive after a schema change. Use RESULT_SCAN(LAST_QUERY_ID()) and warehouse caches deliberately rather than relying on the result cache.

References

Snowflake Documentation (https://docs.snowflake.com/).
Dageville, B. et al. "The Snowflake Elastic Data Warehouse," SIGMOD 2016 (https://event.cwi.nl/lsde/papers/p215-dageville-snowflake.pdf).
Snowpark for Python (https://docs.snowflake.com/en/developer-guide/snowpark/python/index).
Snowflake Arctic announcement (https://www.snowflake.com/blog/arctic-open-efficient-foundation-language-models-snowflake/), April 2024.
Kleppmann, M. "Designing Data-Intensive Applications," Chapter 3 (Storage and Retrieval) for column-store fundamentals.

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

0

No direct prerequisites are declared; this is treated as an entry point.

Derived topics

0

No published topic currently declares this as a prerequisite.