Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Introduction

Welcome to the Dimensional Modelling Docs — a shared, practical source of truth for building dimensional (Kimball-style) data warehouses and marts. It collects the concepts, definitions, and reference models we rely on when designing facts and dimensions, written to be useful both to the analytics engineers on the team and to the LLMs that assist them.

Why this guide

There are many excellent Kimball books and tutorials, but we kept hitting the same gaps. They were:

  • Not written for LLMs — hard to feed to an assistant as ground truth.
  • Light on concrete examples — strong on theory, thin on the exact SQL and edge cases you actually meet in production.

So this guide is deliberately pragmatic: short explanations backed by real examples we have encountered while developing facts and dimensions in production.

Concepts

The Concepts section covers the foundational building blocks of dimensional modelling — the ideas you compose into a model. Start with the Dimension, see how dimensions and facts fit together in a Star Schema, then learn how rows are identified and linked in Kimball Keys Definitions.

Definitions

The Kimball Group website already offers excellent, concise definitions. Here the focus is narrower: collect the terms that come up most often during development, each as a short, self-contained definition with an example where it helps. See Degenerate Dimension or Slowly Changing Dimension for the shape these take.

Reference

The Reference section is a growing catalogue of the facts and dimensions you meet in real-life scenarios. Each page documents one entity — its grain, schema, worked SQL, and common pitfalls — so you have a standard to design and build from (and one an LLM can quote). Standard Cost is the first worked example.

Dimension

Perspective handles these steps automatically. We document them here in detail so that analysts can understand the process and work in sync with Perspective’s lineage generation.

Transformation Diagram

flowchart LR

%% Sources
A[Kimball Standard Library] --> B[Specification]
B --> X[(Warehouse)]
X --> C[Grounding]


%% Subgraph 1 (Source A path)
subgraph G1[Source A Processing]
    E1[Table A - Source A]
    E2[Table B]
    E3[Table C]
    F0["Group <br />(reduce granularity)"]
    F1[Select Columns]
    J1[Join]
    P1[Cast / Prepare]
    R1[Filter]

    E1 --> F0
    E2 --> F0
    E3 --> F0
    F0 --> F1
    F1 --> J1
    J1 --> P1
    P1 --> R1
end

%% Subgraph 2 (Source B path)
subgraph G2[Source B Processing]
    E4[Table D - Source B]
    F01["Group <br />(reduce granularity)"]
    F2[Select Columns]
    J2[Join]
    P2[Cast / Prepare]
    R2[Filter]

    E4 --> F01
    F01 --> F2
    F2 --> J2
    J2 --> P2
    P2 --> R2
end

%% Connect grounding to both paths
C --> E1
C --> E2
C --> E3
C --> E4

%% Merge paths
R1 --> U["Union<br /> (active only when building<br />from multiple sources)"]
R2 --> U

%% Final steps
U --> K[Natural Key Selection]
K --> L[Generate Surrogate Key]
L --> M[Final SQL]

Transformation Steps

Building a dimension typically follows three stages: create source-specific dimensions, unify them into a single consolidated dimension, and add surrogate keys.

1. Create Source Dimensions

When consolidating a dimension from multiple sources, start by creating one dimension table per source. Each source dimension should contain only the attributes coming from that system, cleaned and renamed to use business-friendly column names.

For example, if you are consolidating the sales_territory dimension from SAP and Infor, you should create:

  • sap_sales_territory — contains territory attributes as they come from SAP
  • infor_sales_territory — contains territory attributes as they come from Infor

This keeps each source isolated, making it easier to debug data quality issues back to their origin. It also means changes in one source system don’t ripple into the transformation logic of another.

Each source dimension is built through a series of sub-steps. Think of these as individual CTEs or queries that you chain together — each one does exactly one thing.

1.1 Select

Start by selecting only the columns you need from the raw or staging tables. This is the first filter — strip away everything that doesn’t belong in the dimension.

  • pick the columns that will become attributes in the final dimension
  • rename columns to use business-friendly names at this stage (e.g. VKORGsales_org)
  • drop any operational or system columns that have no analytical value (e.g. _etl_loaded_at, last_modified_by)

Keep it minimal. If you’re unsure whether a column belongs, leave it out — you can always add it later.

1.2 Group

Aggregate rows when the raw data is at a finer grain than the dimension needs. This typically happens when the source table has one row per transaction or event, but the dimension should have one row per entity.

  • use group by on the natural key columns of the dimension
  • apply aggregation functions (max, min, count, sum) only where they make analytical sense
  • be explicit about which value you’re keeping when multiple rows collapse into one — don’t rely on implicit behaviour

Not every dimension needs a grouping step. If the source is already at the right grain (one row per entity), skip this step entirely.

1.3 Prepare

Transform raw values into clean, consistent formats. This step is about data quality, not business logic.

Typical operations:

  • cast — enforce correct data types (e.g. string dates to DATE, numeric codes to INTEGER)
  • trim / upper / lower — normalize text values for consistency
  • coalesce — fill in defaults for nullable columns where a default makes sense
  • case when — map coded values to human-readable labels (e.g. 'A''Active', 'I''Inactive')

Avoid doing too much here. If a transformation requires joining to another table or involves complex business rules, it belongs in a later step.

1.4 Join

Enrich the dimension by joining attributes from related tables. This is where you bring together data that lives across multiple tables within the same source system.

  • use left join by default — you want to keep all dimension records even if the related table has gaps

1.5 Filter

Apply any filters that determine which records should be included in the dimension. This step is intentionally last so that you’re filtering on clean, transformed data.

  • exclude soft-deleted records, test data, or inactive entities that shouldn’t appear in reporting
  • document the filter criteria clearly — future analysts need to understand why certain records are excluded
  • if a filter is business-critical (e.g. “only include customers with at least one order”), call it out with a comment explaining the business rule

The filtering rules should be defined by the user in the business requirements, not inferred by LLM.

2. Unify Source Dimensions

Once the source dimensions are in place, combine them into a single unified dimension. The unified dimension should:

  • union or merge records from all source dimensions into a common schema
  • resolve column naming differences — pick one business name and stick with it
  • handle duplicates across sources by defining a clear deduplication strategy (e.g. prefer SAP over Infor when records overlap)

Following our example, this step produces:

  • unified_sales_territory

Keep the unification logic straightforward. If the mapping between sources is complex, consider adding an intermediate staging step rather than cramming everything into one query.

2.1 Union

Stack all source dimensions into a single consolidated table using UNION ALL. At this stage, the unified dimension does not yet have natural or surrogate keys — it is simply all source records brought together under a common schema.

  • every source dimension must have the same column names and data types before unioning — this is why the renaming and casting in step 1 matters
  • use UNION ALL, not UNION — you want to preserve all rows, including apparent duplicates. Deduplication is a separate, explicit step
  • add a source_system column (e.g. 'SAP', 'Infor') so you can always trace a record back to its origin

Note: if the dimension is built from a single source (no UNION required), do not add the source_system column to the final dimension. It carries no analytical value when there is only one source and would just be noise. The source_system column is only meaningful when multiple sources are unified and traceability between them matters.

2.2 Deduplicate (optional)

If the same entity exists in multiple source systems, you need a strategy to resolve duplicates. This step only applies when sources overlap — if each source contributes entirely distinct records, skip it.

  • define a priority order across sources (e.g. SAP is the master for territory data, Infor fills gaps)
  • use row_number() partitioned by the business key and ordered by source priority, then keep only the first row
  • document the deduplication logic clearly — this is one of the most common sources of confusion in unified dimensions

3. Add Keys to the Unified Dimension

After unification, identify the natural key columns and generate a surrogate key. The final dimension table follows the dim_ naming convention. For our example:

  • dim_sales_territory

3.1 Identify Natural Keys

Determine which columns uniquely identify a record in the unified dimension. Natural keys are the business identifiers that come from the source systems — they are what the business already uses to refer to an entity.

  • a natural key can be a single column (e.g. territory_code) or a composite of multiple columns (e.g. territory_code + source_system)
  • always append _nk to natural key columns (e.g. territory_code_nk, source_system_nk) so they are immediately recognizable in the schema — this mirrors the _sk convention used for surrogate keys
  • (optional) after identifying the natural keys, verify uniqueness: run a group by on the nk_ columns and confirm every group has exactly one row. If duplicates remain, revisit your deduplication logic in step 2.2

Note: the rules for selecting the columns for SCD Type 2 dimensions have yet to be decided.

3.2 Generate Surrogate Key

The surrogate key is always an MD5 hash of the combined natural key — that is, a hash of all natural key columns concatenated together. This makes the key deterministic and reproducible across pipeline runs, which is essential for idempotent loads.

  • concatenate all natural key columns into a single string, using the | separator to avoid collisions between values like ('AB', 'C') and ('A', 'BC')
  • apply the md5 hash function to produce the surrogate key
  • name the surrogate key column <dimension_name>_sk (e.g. sales_territory_sk) so it’s immediately recognizable as a surrogate key in fact table joins

Note on SCD Type 2 dimensions: for slowly changing dimensions of Type 2, hashing only the natural keys will produce duplicate surrogate keys across historical versions of the same entity (e.g. the same territory_code appearing with different territory_name values over time). This is acceptable in Kimball-style modelling — from an analytics perspective, the natural key is all a data analyst needs to reason about the dimension, and duplicate surrogate keys are fine as long as only one row is flagged as active at any point in time. Joins from fact tables should filter on WHERE is_active = TRUE (or equivalent) to pick the current version. There is no need to think in terms of database primary keys here; the analytical model does not require surrogate key uniqueness.

3.3 Select Final Columns

Select the final set of columns for the dimension table. This is where you decide what the consumers of the dimension will see.

  • place the surrogate key as the first column
  • include the natural key columns for traceability back to the source systems
  • include the source_system column from step 2.1
  • include all descriptive attributes from the unified dimension
  • order columns logically — keys first, then identifiers, then descriptive attributes. A well-ordered dimension is easier to scan and query

Star Schema

A star schema is the foundational pattern in dimensional modelling. It organizes data into a central fact table surrounded by dimension tables, forming a star-like shape when visualized. The fact table holds measurable events (e.g. orders, shipments, invoices), while the dimensions provide the context around those events (e.g. customer, product, territory). See Star Schema (Kimball Group) for the original definition.

This structure is optimized for analytical queries. Because every dimension joins directly to the fact table, queries are predictable — analysts always know where to look for metrics (facts) and where to look for filters and groupings (dimensions).

Example Star Schema Diagram

erDiagram
    dim_sales_territory ||--o{ fact_invoice_lines : "sales_territory_sk"
    dim_product_hierarchy ||--o{ fact_invoice_lines : "product_hierarchy_sk"
    dim_industry ||--o{ fact_invoice_lines : "industry_sk"

    fact_invoice_lines {
        string sales_territory_sk FK
        string product_hierarchy_sk FK
        string industry_sk FK
        decimal quantity
        decimal amount
    }
    dim_sales_territory {
        string sales_territory_sk PK
        string territory_code_nk
        string territory_name
        string region
    }
    dim_product_hierarchy {
        string product_hierarchy_sk PK
        string product_code_nk
        string product_name
        string category
    }
    dim_industry {
        string industry_sk PK
        string industry_code_nk
        string industry_name
        string sector
    }

Rules

One fact table per star schema

Each star schema should contain exactly one fact table at one grain. The grain is the business definition of the measurement event that creates a fact record — it should always start at the lowest, most atomic level. If you need to combine metrics from different business processes (e.g. sales and inventory), build separate star schemas rather than merging everything into a single fact table. This keeps each schema focused and avoids grain conflicts. See Four-Step Dimensional Design Process and Keep to the Grain for more on defining grain.

Do not join dimensions to other dimensions

In dimensional modelling, joining a dimension to another dimension is known as an outrigger dimension. While the Kimball methodology technically allows outriggers, they are rarely necessary and they complicate the SQL logic — queries become harder to read, maintain, and optimize. As Kimball notes, outriggers should be used sparingly, and in most cases correlations between dimensions should be demoted to a fact table where both dimensions are represented as separate foreign keys. See also Design Tip #105: Snowflakes, Outriggers, and Bridges.

If you find yourself wanting to join two dimensions together, use a factless fact table instead. A factless fact table captures the relationship between dimensions as a fact at its own grain — it has no numeric measures, only foreign keys to the dimensions involved. This keeps the star schema clean and the joins predictable. See Design Tip #133: Factless Fact Tables for Simplification for practical examples.

Always use left joins, fact on the left

When joining dimensions to the fact table, always use left join with the fact table on the left side. This ensures that every fact record is preserved in the result, even if a matching dimension record is missing.

  • a missing dimension match usually indicates a data quality issue — the left join makes these gaps visible rather than silently dropping rows
  • if you use inner join instead, you risk losing fact records and underreporting metrics without realizing it

Keep aggregate calculations in the reporting layer

The star schema should store atomic, grain-level data. Derived calculations like percentages, ratios, running totals, and year-over-year comparisons belong in the reporting or semantic layer — not in the fact table itself.

  • storing pre-aggregated values in facts makes them inflexible — they can’t be re-sliced by dimensions they weren’t originally grouped by
  • let the reporting tool handle aggregation so that analysts can drill down to the detail when needed
  • the one exception is additive measures (e.g. quantity, amount) — these belong in the fact table because they can be meaningfully summed across any dimension. Semi-additive measures (e.g. balances) can be summed across some dimensions but not all, and non-additive measures (e.g. unit prices, ratios) should never be summed directly

For more on aggregate tables as a performance optimization, see Aggregate Fact Tables (Kimball Group).

References

Kimball Keys Definitions

In the Kimball dimensional modelling approach, keys are the backbone that connect fact tables to their dimensions and that let us track history correctly. This page defines the key types you will encounter and the conventions we use for each.

Quick reference

KeyLives inStable?Meaningful?Purpose
Natural keySource systemYes (in source)YesIdentifies a business entity in the source
Durable / supernatural keyDimensionYes (forever)NoIdentifies an entity across all source changes
Surrogate keyDimensionPer row versionNoPrimary key of a dimension row
Foreign keyFactNoPoints a fact row at a dimension row
Degenerate dimensionFactYesYesOperational identifier with no dimension table

Natural key

The identifier an entity carries in the source system — for example a customer_id from the CRM, an SKU from the product catalogue, or an order number from the OLTP database.

  • Carries business meaning and may be reused or recycled by the source.
  • Can be composite (several columns) and can change format over time.
  • Not used as the primary key of a dimension, because a single natural key can map to many historical versions of a row (see surrogate keys).
customer_id = "CUST-00417"

Durable (supernatural) key

A warehouse-assigned, never-changing identifier for a business entity. While surrogate keys change with every new version of a row, the durable key stays constant for the lifetime of the entity.

  • Use it to group all historical versions of the same entity.
  • Survives source-system migrations and natural-key reformatting.
  • Sometimes called a persistent or supernatural key.
customer_durable_key = 90231   -- one value for CUST-00417 across all versions

Surrogate key

A meaningless, warehouse-generated integer that serves as the primary key of a dimension table. Every row in the dimension — including every historical version of an entity — gets its own surrogate key.

  • Typically a monotonically increasing integer (BIGINT) or an identity column.
  • Carries no business meaning; never expose it to end users as a “real” id.
  • Decouples the warehouse from source-key changes and enables Type 2 history.

Convention: name surrogate keys <entity>_sk, e.g. customer_sk, product_sk, date_sk.

customer_sk = 1048576   -- primary key of one specific version of a customer row

Why a surrogate key instead of the natural key?

  1. Slowly changing dimensions (SCD Type 2). When an attribute changes we add a new row with a new surrogate key, preserving the old version.
  2. Performance. Single-column integer joins are faster and smaller than wide or composite natural keys.
  3. Insulation. Source-system key changes don’t ripple into facts.
  4. Late-arriving / unknown members. Reserved surrogate values can represent “Unknown” or “Not applicable” rows.

Foreign key

The column in a fact table that stores a dimension’s surrogate key, forming the join between the fact and that dimension.

  • One foreign key per dimension the fact relates to.
  • Always points at a surrogate key, never at a natural key.
  • Should be enforced (logically, at minimum) so every fact row resolves to a valid dimension row — including the special “Unknown” member.
fact_sales.customer_sk  -->  dim_customer.customer_sk

Degenerate dimension

A dimension key that lives in the fact table itself because it has no interesting attributes of its own and therefore no separate dimension table. Classic examples: invoice number, order number, transaction id.

  • Useful for grouping the line items of a single operational document.
  • Stored as a column on the fact, often suffixed _id or _number.
fact_invoice_lines.invoice_number = "INV-2026-008812"

Special dimension members

Reserve a handful of surrogate key values for rows that don’t map to real source records, so that fact foreign keys never have to be NULL:

Surrogate keyMember meaning
-1Unknown
-2Not applicable
-3Missing / not yet arrived

Putting it together

erDiagram
    dim_customer ||--o{ fact_sales : "customer_sk"
    dim_customer {
        bigint customer_sk PK "surrogate key"
        bigint customer_durable_key "durable key"
        string customer_id "natural key"
        string customer_name "attribute"
        date valid_from_to "SCD2 validity + is_current"
    }
    fact_sales {
        bigint customer_sk FK "to dim_customer"
        bigint product_sk FK "to dim_product"
        bigint date_sk FK "to dim_date"
        string order_number "degenerate dimension"
        decimal quantity_amount "additive facts"
    }

A fact row joins to a specific version of a customer via customer_sk. To analyse an entity across all its versions, group on customer_durable_key. To trace a record back to the source, use the customer_id natural key.

Standard Cost

Type: Dimension (per-unit cost rate) · Primary home: dim_standard_cost (SCD Type 2) · Also surfaced on: dim_product (current value only, Type 1)

Summary

The standard cost is the predetermined, planned unit cost of a product, it typically includes these costs:

  • materials
  • labour
  • allocated overhead

Standard costs are calculated per material, typically during a periodic cost roll (often annually or quarterly). It is used for inventory valuation, margin reporting, and variance analysis against actual cost.

Standard Cost is not the price the customer pays and not the actual cost incurred.

Natural Key

Standard Cost is per unit of product, the natural key tends to be:

  • Plant ID
  • Material Number
  • Effective Date (period in which the cost is valid)

In the warehouse this natural key maps to a cost_durable_key (stable per plant + material across every cost version) and a per-version standard_cost_sk surrogate key.

Schema

One row per: plant_id + material_number + cost version (effective period).

ColumnTypeRoleNotes
standard_cost_skBIGINTsurrogate PKone row per plant + material + cost version
plant_idVARCHARnatural keyERP plant / costing location
material_numberVARCHARnatural keyERP material (≈ product)
cost_durable_keyBIGINTdurable keystable per (plant, material) across all versions
cost_effective_fromDATEnatural key · validityinclusive start of this cost version
cost_effective_toDATEvalidityexclusive end; 9999-12-31 while current
is_currentBOOLEANvalidityflag for the active version
material_costDECIMAL(18,4)attribute (rate)per-unit component; non-additive
labour_costDECIMAL(18,4)attribute (rate)per-unit component; non-additive
overhead_costDECIMAL(18,4)attribute (rate)per-unit component; non-additive
standard_unit_costDECIMAL(18,4)attribute (rate)= material + labour + overhead; non-additive
currency_codeCHAR(3)attributeISO 4217; cost is per this currency
uom_codeVARCHARattributeunit of measure the cost is expressed in

Source & lineage

ERP.COST_MASTER  ──┐
ERP.BOM_ROLLUP   ──┼──> stg_standard_cost ──> dim_standard_cost
ERP.COST_PERIODS ──┘                              │
                                                  └──> dim_product.standard_cost (current only, Type 1)

The cost roll job lands a new effective period; the staging model derives standard_unit_cost, closes the prior period’s cost_effective_to, and assigns the new standard_cost_sk.

How to use it

Current standard cost of a product

SELECT plant_id, material_number, standard_unit_cost, currency_code
FROM   dim_standard_cost
WHERE  is_current;

Margin — the dimensional way (fact carries the cost surrogate key)

The ETL stamps each fact row with the standard_cost_sk for the version in effect on the transaction date, so this is a plain equi-join that is already point-in-time correct — no date logic needed at query time.

SELECT  s.order_number,
        s.sale_date,
        s.extended_revenue,
        s.quantity * sc.standard_unit_cost            AS standard_cost_of_sale,
        s.extended_revenue
          - s.quantity * sc.standard_unit_cost        AS standard_margin
FROM    fact_sales         s
JOIN    dim_standard_cost  sc
  ON    sc.standard_cost_sk = s.standard_cost_sk;     -- point-in-time resolved at load

If the fact has no cost SK — point-in-time range join

When a fact only carries the natural key, match each row to the cost version that was active when it happened — never to the current cost:

SELECT  s.order_number,
        s.quantity * sc.standard_unit_cost AS standard_cost_of_sale
FROM    fact_sales        s
JOIN    dim_standard_cost sc
  ON    sc.plant_id        = s.plant_id
 AND    sc.material_number = s.material_number
 AND    s.sale_date >= sc.cost_effective_from
 AND    s.sale_date <  sc.cost_effective_to;          -- half-open interval

Purchase price / cost variance (standard vs actual)

SELECT  p.plant_id,
        p.material_number,
        SUM(p.actual_unit_cost   * p.quantity)        AS actual_cost,
        SUM(sc.standard_unit_cost * p.quantity)       AS standard_cost,
        SUM((p.actual_unit_cost - sc.standard_unit_cost) * p.quantity) AS variance
FROM    fact_purchase_receipts p
JOIN    dim_standard_cost      sc
  ON    sc.standard_cost_sk = p.standard_cost_sk
GROUP BY p.plant_id, p.material_number;

Common Pitfalls

  • Standard cost is a dimension, not a fact — because its amounts are non-additive. standard_unit_cost is a per-unit rate: summing it across products, plants, or periods is meaningless (SUM(standard_unit_cost) answers no real question). You look it up and multiply by a fact quantity (quantity × standard_unit_cost) to get an additive measure — the cost of sale — which belongs in the fact/query, not here. Modelling these rates as a fact table is what tempts that erroneous SUM.
  • Always join point-in-time, never to the current cost. Resolve the version at ETL into standard_cost_sk, or range-join on the natural key + date (above). Using dim_product.standard_cost (the current value) to value historical sales silently restates past margins every time a cost roll runs.
  • Use a half-open interval [from, to) (>= from AND < to). Closed intervals (BETWEEN) double-count on the boundary day when one version ends and the next begins.
  • standard_cost on dim_product is Type 1 (overwrite). It exists only for convenience / current-state lookups. It carries no history — don’t report trends from it.
  • Currency and UoM are part of the cost. Don’t sum or compare standard_unit_cost across rows with different currency_code or uom_code. Convert first.
  • Standard ≠ actual ≠ average ≠ list price. Keep cost types in separate, clearly named attributes. Mixing them is the single most common reporting error.
  • Cost-roll timing. A roll dated the 1st but loaded on the 5th leaves a 4-day gap if effective dating isn’t backfilled. Validate that max(cost_effective_to) for the prior version meets min(cost_effective_from) of the next with no gap or overlap.
  • Missing cost for a product. New materials may sell before a standard cost is rolled. Provide a -1 “Unknown cost” member (see Kimball Keys Definitions) and a data-quality check, rather than producing NULL margins.
  • Kimball Keys Definitions — natural, durable, and surrogate keys used above.
  • fact_sales, fact_purchase_receipts — consumers; carry standard_cost_sk.
  • dim_product — carries the current-value convenience copy (Type 1).

Change history / SCD

dim_standard_cost is an SCD Type 2 dimension with effective dating. Each cost roll closes the current row (cost_effective_to, is_current = false) and inserts a new current row with a fresh standard_cost_sk. Facts reference the version in effect at their transaction date via that standard_cost_sk. The dim_product.standard_cost convenience copy is Type 1 (overwrite, no history).

Sales Territory

Type: Dimension · Primary home: dim_sales_territory (SCD Type 2) · Also surfaced on: fact_invoice_lines (FK sales_territory_sk)

Summary

dim_sales_territory maps each salesperson (also called a sales rep or sales district) to the nested chain of sales territories they roll up through, flattened into one row per salesperson. It lets you aggregate sales, quota, and commission at any level of the territory tree — from a single rep up to global sales — with a plain GROUP BY.

It is not a geography dimension: the levels are an internal sales hierarchy, not postal/administrative geography (a rep’s territory need not match where customers live). The salesperson is the leaf of this dimension, not a full employee/HR dimension.

Natural Key

One row represents one salesperson for an assignment period (SCD Type 2); there is exactly one current row per salesperson. The natural key is the source salesperson identifier, made unique per version by the effective date:

Natural Key Fields:

  • Sales Person Id

The territory tree is balanced and fixed at 6 levels, flattened onto each salesperson row — the standard Kimball treatment for a fixed-depth hierarchy (no bridge table needed). Level 1 is the root (Global Sales); level 5 is the most granular territory; the salesperson is the leaf below level 5:

L1  Global Sales
└── L2  EMEA
    └── L3  Italy
        └── L4  Northern Italy
            └── L5  District 512 – Milan
                └── Salesperson  Maria Rossi (REP-00417)

Schema

ColumnTypeRoleNotes
sales_territory_skBIGINTsurrogate PKone row per salesperson version
salesperson_idVARCHARnatural keysource sales rep / district code
salesperson_nameVARCHARattributeleaf of the hierarchy
territory_l5_codeVARCHARattributelevel 5 — most granular territory (e.g. district)
territory_l5_nameVARCHARattribute
territory_l4_codeVARCHARattributelevel 4
territory_l4_nameVARCHARattribute
territory_l3_codeVARCHARattributelevel 3
territory_l3_nameVARCHARattribute
territory_l2_codeVARCHARattributelevel 2
territory_l2_nameVARCHARattribute
territory_l1_codeVARCHARattributelevel 1 — root; constant
territory_l1_nameVARCHARattributealways Global Sales
valid_fromDATEvalidityinclusive start of this assignment
valid_toDATEvalidityexclusive end; 9999-12-31 while current
is_currentBOOLEANvalidityactive-version flag

A worked row for the example above:

ColumnValue
salesperson_nameMaria Rossi
territory_l5_nameDistrict 512 – Milan
territory_l4_nameNorthern Italy
territory_l3_nameItaly
territory_l2_nameEMEA
territory_l1_nameGlobal Sales

Source & lineage

CRM.SALES_REP        ──┐
CRM.TERRITORY_TREE   ──┼──> stg_sales_territory ──> dim_sales_territory
HR.REP_ASSIGNMENTS   ──┘                                │
                                                        └──> fact_invoice_lines.sales_territory_sk (FK, resolved at load)

TERRITORY_TREE is a parent→child recursive table. The staging model walks it, flattens the five territory levels onto each salesperson, and — on a reassignment (reorg, rep moves district) — closes the prior row’s valid_to and inserts a new current row with a fresh sales_territory_sk.

How to use it

Sales rolled up to any territory level

Because the hierarchy is flattened, grouping at any level is a plain GROUP BY — no bridge, no recursion:

SELECT  st.territory_l2_name        AS area,
        SUM(s.net_amount)           AS net_sales
FROM    fact_invoice_lines            s
JOIN    dim_sales_territory   st ON st.sales_territory_sk = s.sales_territory_sk
GROUP BY st.territory_l2_name
ORDER BY net_sales DESC;

The full territory chain for one salesperson

SELECT  salesperson_name,
        territory_l5_name, territory_l4_name, territory_l3_name,
        territory_l2_name, territory_l1_name
FROM    dim_sales_territory
WHERE   is_current
  AND   salesperson_id = 'REP-00417';

Point-in-time: attribute each sale to the territory in effect then

The ETL stamps each fact_invoice_lines row with the sales_territory_sk that was active on the sale date, so this equi-join is automatically point-in-time correct — a sale stays with the rep’s territory at the time, even after a later reorg:

SELECT  st.territory_l3_name        AS country,
        DATE_TRUNC('quarter', s.sale_date) AS qtr,
        SUM(s.net_amount)           AS net_sales
FROM    fact_invoice_lines            s
JOIN    dim_sales_territory   st ON st.sales_territory_sk = s.sales_territory_sk
GROUP BY st.territory_l3_name, DATE_TRUNC('quarter', s.sale_date);

Common Pitfalls

  • Mind the level direction. Level 1 = Global (root), level 5 = most granular.
  • Salesperson ≠ person. The leaf is a sales role/district. One human may cover several districts, and a district may pass between people over time. Keep this dimension at the territory-assignment grain; model the individual separately if HR attributes are needed.
  • Unassigned reps. New reps may book sales before territory setup. Point the fact FK at a -1 “Unknown territory” member (see Kimball Keys Definitions) rather than leaving the FK NULL.
  • Kimball Keys Definitions — surrogate, durable, and natural keys plus the special “Unknown” member used above.
  • Slowly Changing Dimension — the SCD Type 2 pattern this dimension uses for reassignments.
  • fact_invoice_lines — primary consumer; carries sales_territory_sk.

Change history / SCD

dim_sales_territory is an SCD Type 2 dimension. Each territory reassignment closes the current row (valid_to, is_current = false) and inserts a new current row with a fresh sales_territory_sk; the salesperson_id natural key stays constant so all of a rep’s history can be grouped together. Facts reference the version in effect at their transaction date via sales_territory_sk.

Invoice Lines

Type: Fact — transaction · Grain: one row per invoice line · Primary home: fact_invoice_lines

Summary

fact_invoice_lines records the billed detail of customer invoices — one row for each product line on each invoice. It is the backbone for revenue, discount, tax, and margin reporting, sliced by customer, product, date, and sales territory.

It is not an invoice-header fact: whole-invoice charges (freight, invoice-level discounts) are not repeated on every line — keep those in a separate header fact or allocate them down to the line, or you will double-count.

Grain

One row per invoice line — a single product line item on a single invoice.

grain = (invoice_number, invoice_line_number)

invoice_number and invoice_line_number are degenerate dimensions (identifiers with no dimension table of their own); together they uniquely identify a row.

Schema

ColumnTypeRoleNotes
invoice_numberVARCHARdegenerate dimensionthe operational invoice id
invoice_line_numberINTdegenerate dimensionline position within the invoice
invoice_date_skBIGINTFK → dim_datedate the invoice was issued
customer_skBIGINTFK → dim_customerbill-to customer (version at invoice date)
product_skBIGINTFK → dim_productproduct sold
sales_territory_skBIGINTFK → dim_sales_territoryrep / territory in effect at invoice date
standard_cost_skBIGINTFK → dim_standard_coststandard-cost version in effect at invoice date
currency_codeCHAR(3)attributedocument currency (ISO 4217)
quantityDECIMAL(18,4)measureunits invoiced; additive
gross_amountDECIMAL(18,4)measurelist value before discount; additive
discount_amountDECIMAL(18,4)measureadditive
net_amountDECIMAL(18,4)measure= gross − discount; additive
tax_amountDECIMAL(18,4)measureadditive
standard_cost_of_saleDECIMAL(18,4)measure= quantity × standard_unit_cost; additive
unit_priceDECIMAL(18,4)measurenet per unit; non-additive (a rate)
load_tsTIMESTAMPauditwarehouse load timestamp

Measures & additivity

MeasureAdditivityNotes
quantityadditivesums across all dimensions
gross_amount, discount_amount, net_amount, tax_amountadditivethe money measures; sum freely
standard_cost_of_saleadditivepairs with net_amount to give margin
unit_pricenon-additivea per-unit rate — never SUM; for an average use SUM(net_amount) / SUM(quantity)

Margin is derived, not stored: net_amount − standard_cost_of_sale. Because both inputs are additive, margin can be summed at any level.

Source & lineage

ERP.INVOICE_HEADER ──┐
ERP.INVOICE_LINE   ──┼──> stg_invoice_lines ──> fact_invoice_lines
ERP.FX_RATES       ──┘

The staging model joins header to line, then resolves each foreign key. The SCD Type 2 keys (customer_sk, sales_territory_sk, standard_cost_sk) are looked up as of the invoice date, so every line carries the dimension version that was in effect when it was billed — point-in-time correct (see the pitfalls).

How to use it

Net sales by month and sales area

SELECT  d.year_month,
        st.territory_l2_name          AS area,
        SUM(f.net_amount)             AS net_sales
FROM    fact_invoice_lines   f
JOIN    dim_date             d  ON d.date_sk            = f.invoice_date_sk
JOIN    dim_sales_territory  st ON st.sales_territory_sk = f.sales_territory_sk
GROUP BY d.year_month, st.territory_l2_name;

Standard margin by product category

SELECT  p.product_category,
        SUM(f.net_amount)                            AS net_sales,
        SUM(f.standard_cost_of_sale)                 AS standard_cost,
        SUM(f.net_amount - f.standard_cost_of_sale)  AS standard_margin
FROM    fact_invoice_lines f
JOIN    dim_product        p ON p.product_sk = f.product_sk
GROUP BY p.product_category;

Average selling price (the non-additive measure, done right)

-- weighted average, NOT AVG(unit_price)
SELECT  product_sk,
        SUM(net_amount) / NULLIF(SUM(quantity), 0) AS avg_unit_price
FROM    fact_invoice_lines
GROUP BY product_sk;

Common Pitfalls

  • Never SUM(unit_price) (or AVG it). It is a per-unit rate — non-additive. Compute an average as SUM(net_amount) / SUM(quantity).
  • Header vs line grain. Whole-invoice charges (freight, invoice-level discounts) belong to the header, not each line. Repeating them per line double-counts — allocate them to lines or keep a separate header fact.
  • Dimension fan-out. Joining to a dimension at a coarser grain or through a multi-valued bridge multiplies rows and inflates the measures. Join on the line’s own surrogate keys, and pre-aggregate the fact before any 1-to-many join.
  • Point-in-time keys. customer_sk, sales_territory_sk, and standard_cost_sk are the versions in effect at the invoice date. Don’t re-derive them from the current dimension row, or history is restated on every reorg / cost roll.
  • Returns & credit notes. Credits arrive as negative quantity / amounts. Keep the sign convention consistent so SUM nets correctly; don’t silently filter them out of margin.
  • Currency. Amounts are in currency_code (document currency). Convert to a single reporting currency before summing across currencies.
  • Unknown members. A line that can’t resolve a dimension points at the -1 “Unknown” member (see Kimball Keys Definitions), never a NULL FK.
  • Standard Cost — supplies standard_cost_sk / standard_unit_cost behind standard_cost_of_sale.
  • Sales Territory — supplies sales_territory_sk for territory rollups.
  • Grain and Kimball Keys Definitions — the concepts this fact builds on.
  • dim_date, dim_customer, dim_product — the remaining conformed dimensions.

Change history / load pattern

fact_invoice_lines is a transaction-grain fact, loaded insert-only: each invoice line is written once and never updated. Corrections and returns flow in as new (often negative) lines rather than edits, preserving an auditable history. Late-arriving invoices are appended with their historical invoice_date_sk, and their SCD Type 2 keys resolve to the version that was current then (or the -1 Unknown member until the dimension member appears).

Degenerate Dimension

Also known as: DD

A business identifier stored directly on a fact table that has no attributes of its own, and therefore no separate dimension table.

In practice: It usually identifies the operational transaction or document a fact row came from — an invoice number, order number, or ticket id. Keeping it on the fact lets you group the line items of a single document without a join.

Examples:

  • fact_invoice_lines.invoice_number = "INV-2026-008812" — one invoice spans many line-item fact rows but needs no dim_invoice.
  • standard_cost_amount = 3,4 - standard cost can be used as a degenerate dimension.

Common pitfalls:

  • Don’t build a dimension table for it just to “be consistent” — with no descriptive attributes, it stays on the fact.
  • If you later discover real attributes (status, channel), promote it to a proper dimension and replace the degenerate key with a surrogate-key foreign key.

See also: Kimball Keys Definitions

Grain

The grain of a fact table is the precise meaning of a single row — the level of detail it records.

In practice: Declare the grain in business terms before choosing dimensions or facts (“one row per order line”, “one row per daily account balance”). Every dimension and measure on the table must be true at that grain; mixing grains in a single table is the most common dimensional modelling mistake.

Example: fact_sales at the grain one row per product per order line — so quantity and extended_amount are recorded per line, never per whole order.

See also: Standard Cost (declares its grain as an explicit key tuple)

Slowly Changing Dimension

Also known as: SCD

A dimension whose attribute values change occasionally over time, together with the technique chosen for whether to keep or overwrite the prior values.

In practice: Pick a strategy per attribute, not per table:

  • Type 1 — overwrite the value; no history kept.
  • Type 2 — add a new row with a new surrogate key and an effective-date range (valid_from / valid_to / is_current); full history preserved.
  • Type 3 — keep a “previous value” column alongside the current one; limited history.

Example: A customer moves city. Type 1 overwrites the city; Type 2 closes the old row (valid_to) and inserts a new current row, so historical facts still join to the address that was true at the time.

See also: Kimball Keys Definitions · Standard Cost