Introduction
Welcome to the Dimensional Modelling Docs — a shared, practical source of truth for building dimensional (Kimball-style) data warehouses and marts. It collects the concepts, definitions, and reference models we rely on when designing facts and dimensions, written to be useful both to the analytics engineers on the team and to the LLMs that assist them.
Why this guide
There are many excellent Kimball books and tutorials, but we kept hitting the same gaps. They were:
- Not written for LLMs — hard to feed to an assistant as ground truth.
- Light on concrete examples — strong on theory, thin on the exact SQL and edge cases you actually meet in production.
So this guide is deliberately pragmatic: short explanations backed by real examples we have encountered while developing facts and dimensions in production.
Concepts
The Concepts section covers the foundational building blocks of dimensional modelling — the ideas you compose into a model. Start with the Dimension, see how dimensions and facts fit together in a Star Schema, then learn how rows are identified and linked in Kimball Keys Definitions.
Definitions
The Kimball Group website already offers excellent, concise definitions. Here the focus is narrower: collect the terms that come up most often during development, each as a short, self-contained definition with an example where it helps. See Degenerate Dimension or Slowly Changing Dimension for the shape these take.
Reference
The Reference section is a growing catalogue of the facts and dimensions you meet in real-life scenarios. Each page documents one entity — its grain, schema, worked SQL, and common pitfalls — so you have a standard to design and build from (and one an LLM can quote). Standard Cost is the first worked example.
Dimension
Perspective handles these steps automatically. We document them here in detail so that analysts can understand the process and work in sync with Perspective’s lineage generation.
Transformation Diagram
flowchart LR
%% Sources
A[Kimball Standard Library] --> B[Specification]
B --> X[(Warehouse)]
X --> C[Grounding]
%% Subgraph 1 (Source A path)
subgraph G1[Source A Processing]
E1[Table A - Source A]
E2[Table B]
E3[Table C]
F0["Group <br />(reduce granularity)"]
F1[Select Columns]
J1[Join]
P1[Cast / Prepare]
R1[Filter]
E1 --> F0
E2 --> F0
E3 --> F0
F0 --> F1
F1 --> J1
J1 --> P1
P1 --> R1
end
%% Subgraph 2 (Source B path)
subgraph G2[Source B Processing]
E4[Table D - Source B]
F01["Group <br />(reduce granularity)"]
F2[Select Columns]
J2[Join]
P2[Cast / Prepare]
R2[Filter]
E4 --> F01
F01 --> F2
F2 --> J2
J2 --> P2
P2 --> R2
end
%% Connect grounding to both paths
C --> E1
C --> E2
C --> E3
C --> E4
%% Merge paths
R1 --> U["Union<br /> (active only when building<br />from multiple sources)"]
R2 --> U
%% Final steps
U --> K[Natural Key Selection]
K --> L[Generate Surrogate Key]
L --> M[Final SQL]
Transformation Steps
Building a dimension typically follows three stages: create source-specific dimensions, unify them into a single consolidated dimension, and add surrogate keys.
1. Create Source Dimensions
When consolidating a dimension from multiple sources, start by creating one dimension table per source. Each source dimension should contain only the attributes coming from that system, cleaned and renamed to use business-friendly column names.
For example, if you are consolidating the sales_territory dimension from SAP and Infor, you should create:
sap_sales_territory— contains territory attributes as they come from SAPinfor_sales_territory— contains territory attributes as they come from Infor
This keeps each source isolated, making it easier to debug data quality issues back to their origin. It also means changes in one source system don’t ripple into the transformation logic of another.
Each source dimension is built through a series of sub-steps. Think of these as individual CTEs or queries that you chain together — each one does exactly one thing.
1.1 Select
Start by selecting only the columns you need from the raw or staging tables. This is the first filter — strip away everything that doesn’t belong in the dimension.
- pick the columns that will become attributes in the final dimension
- rename columns to use business-friendly names at this stage (e.g.
VKORG→sales_org) - drop any operational or system columns that have no analytical value (e.g.
_etl_loaded_at,last_modified_by)
Keep it minimal. If you’re unsure whether a column belongs, leave it out — you can always add it later.
1.2 Group
Aggregate rows when the raw data is at a finer grain than the dimension needs. This typically happens when the source table has one row per transaction or event, but the dimension should have one row per entity.
- use
group byon the natural key columns of the dimension - apply aggregation functions (
max,min,count,sum) only where they make analytical sense - be explicit about which value you’re keeping when multiple rows collapse into one — don’t rely on implicit behaviour
Not every dimension needs a grouping step. If the source is already at the right grain (one row per entity), skip this step entirely.
1.3 Prepare
Transform raw values into clean, consistent formats. This step is about data quality, not business logic.
Typical operations:
cast— enforce correct data types (e.g. string dates toDATE, numeric codes toINTEGER)trim/upper/lower— normalize text values for consistencycoalesce— fill in defaults for nullable columns where a default makes sensecase when— map coded values to human-readable labels (e.g.'A'→'Active','I'→'Inactive')
Avoid doing too much here. If a transformation requires joining to another table or involves complex business rules, it belongs in a later step.
1.4 Join
Enrich the dimension by joining attributes from related tables. This is where you bring together data that lives across multiple tables within the same source system.
- use
left joinby default — you want to keep all dimension records even if the related table has gaps
1.5 Filter
Apply any filters that determine which records should be included in the dimension. This step is intentionally last so that you’re filtering on clean, transformed data.
- exclude soft-deleted records, test data, or inactive entities that shouldn’t appear in reporting
- document the filter criteria clearly — future analysts need to understand why certain records are excluded
- if a filter is business-critical (e.g. “only include customers with at least one order”), call it out with a comment explaining the business rule
The filtering rules should be defined by the user in the business requirements, not inferred by LLM.
2. Unify Source Dimensions
Once the source dimensions are in place, combine them into a single unified dimension. The unified dimension should:
- union or merge records from all source dimensions into a common schema
- resolve column naming differences — pick one business name and stick with it
- handle duplicates across sources by defining a clear deduplication strategy (e.g. prefer SAP over Infor when records overlap)
Following our example, this step produces:
unified_sales_territory
Keep the unification logic straightforward. If the mapping between sources is complex, consider adding an intermediate staging step rather than cramming everything into one query.
2.1 Union
Stack all source dimensions into a single consolidated table using UNION ALL. At this stage, the unified dimension does not yet have natural or surrogate keys — it is simply all source records brought together under a common schema.
- every source dimension must have the same column names and data types before unioning — this is why the renaming and casting in step 1 matters
- use
UNION ALL, notUNION— you want to preserve all rows, including apparent duplicates. Deduplication is a separate, explicit step - add a
source_systemcolumn (e.g.'SAP','Infor') so you can always trace a record back to its origin
Note: if the dimension is built from a single source (no
UNIONrequired), do not add thesource_systemcolumn to the final dimension. It carries no analytical value when there is only one source and would just be noise. Thesource_systemcolumn is only meaningful when multiple sources are unified and traceability between them matters.
2.2 Deduplicate (optional)
If the same entity exists in multiple source systems, you need a strategy to resolve duplicates. This step only applies when sources overlap — if each source contributes entirely distinct records, skip it.
- define a priority order across sources (e.g. SAP is the master for territory data, Infor fills gaps)
- use
row_number()partitioned by the business key and ordered by source priority, then keep only the first row - document the deduplication logic clearly — this is one of the most common sources of confusion in unified dimensions
3. Add Keys to the Unified Dimension
After unification, identify the natural key columns and generate a surrogate key. The final dimension table follows the dim_ naming convention. For our example:
dim_sales_territory
3.1 Identify Natural Keys
Determine which columns uniquely identify a record in the unified dimension. Natural keys are the business identifiers that come from the source systems — they are what the business already uses to refer to an entity.
- a natural key can be a single column (e.g.
territory_code) or a composite of multiple columns (e.g.territory_code+source_system) - always append
_nkto natural key columns (e.g.territory_code_nk,source_system_nk) so they are immediately recognizable in the schema — this mirrors the_skconvention used for surrogate keys - (optional) after identifying the natural keys, verify uniqueness: run a
group byon thenk_columns and confirm every group has exactly one row. If duplicates remain, revisit your deduplication logic in step 2.2
Note: the rules for selecting the columns for SCD Type 2 dimensions have yet to be decided.
3.2 Generate Surrogate Key
The surrogate key is always an MD5 hash of the combined natural key — that is, a hash of all natural key columns concatenated together. This makes the key deterministic and reproducible across pipeline runs, which is essential for idempotent loads.
- concatenate all natural key columns into a single string, using the
|separator to avoid collisions between values like('AB', 'C')and('A', 'BC') - apply the
md5hash function to produce the surrogate key - name the surrogate key column
<dimension_name>_sk(e.g.sales_territory_sk) so it’s immediately recognizable as a surrogate key in fact table joins
Note on SCD Type 2 dimensions: for slowly changing dimensions of Type 2, hashing only the natural keys will produce duplicate surrogate keys across historical versions of the same entity (e.g. the same
territory_codeappearing with differentterritory_namevalues over time). This is acceptable in Kimball-style modelling — from an analytics perspective, the natural key is all a data analyst needs to reason about the dimension, and duplicate surrogate keys are fine as long as only one row is flagged as active at any point in time. Joins from fact tables should filter onWHERE is_active = TRUE(or equivalent) to pick the current version. There is no need to think in terms of database primary keys here; the analytical model does not require surrogate key uniqueness.
3.3 Select Final Columns
Select the final set of columns for the dimension table. This is where you decide what the consumers of the dimension will see.
- place the surrogate key as the first column
- include the natural key columns for traceability back to the source systems
- include the
source_systemcolumn from step 2.1 - include all descriptive attributes from the unified dimension
- order columns logically — keys first, then identifiers, then descriptive attributes. A well-ordered dimension is easier to scan and query
Star Schema
A star schema is the foundational pattern in dimensional modelling. It organizes data into a central fact table surrounded by dimension tables, forming a star-like shape when visualized. The fact table holds measurable events (e.g. orders, shipments, invoices), while the dimensions provide the context around those events (e.g. customer, product, territory). See Star Schema (Kimball Group) for the original definition.
This structure is optimized for analytical queries. Because every dimension joins directly to the fact table, queries are predictable — analysts always know where to look for metrics (facts) and where to look for filters and groupings (dimensions).
Example Star Schema Diagram
erDiagram
dim_sales_territory ||--o{ fact_invoice_lines : "sales_territory_sk"
dim_product_hierarchy ||--o{ fact_invoice_lines : "product_hierarchy_sk"
dim_industry ||--o{ fact_invoice_lines : "industry_sk"
fact_invoice_lines {
string sales_territory_sk FK
string product_hierarchy_sk FK
string industry_sk FK
decimal quantity
decimal amount
}
dim_sales_territory {
string sales_territory_sk PK
string territory_code_nk
string territory_name
string region
}
dim_product_hierarchy {
string product_hierarchy_sk PK
string product_code_nk
string product_name
string category
}
dim_industry {
string industry_sk PK
string industry_code_nk
string industry_name
string sector
}
Rules
One fact table per star schema
Each star schema should contain exactly one fact table at one grain. The grain is the business definition of the measurement event that creates a fact record — it should always start at the lowest, most atomic level. If you need to combine metrics from different business processes (e.g. sales and inventory), build separate star schemas rather than merging everything into a single fact table. This keeps each schema focused and avoids grain conflicts. See Four-Step Dimensional Design Process and Keep to the Grain for more on defining grain.
Do not join dimensions to other dimensions
In dimensional modelling, joining a dimension to another dimension is known as an outrigger dimension. While the Kimball methodology technically allows outriggers, they are rarely necessary and they complicate the SQL logic — queries become harder to read, maintain, and optimize. As Kimball notes, outriggers should be used sparingly, and in most cases correlations between dimensions should be demoted to a fact table where both dimensions are represented as separate foreign keys. See also Design Tip #105: Snowflakes, Outriggers, and Bridges.
If you find yourself wanting to join two dimensions together, use a factless fact table instead. A factless fact table captures the relationship between dimensions as a fact at its own grain — it has no numeric measures, only foreign keys to the dimensions involved. This keeps the star schema clean and the joins predictable. See Design Tip #133: Factless Fact Tables for Simplification for practical examples.
Always use left joins, fact on the left
When joining dimensions to the fact table, always use left join with the fact table on the left side. This ensures that every fact record is preserved in the result, even if a matching dimension record is missing.
- a missing dimension match usually indicates a data quality issue — the
left joinmakes these gaps visible rather than silently dropping rows - if you use
inner joininstead, you risk losing fact records and underreporting metrics without realizing it
Keep aggregate calculations in the reporting layer
The star schema should store atomic, grain-level data. Derived calculations like percentages, ratios, running totals, and year-over-year comparisons belong in the reporting or semantic layer — not in the fact table itself.
- storing pre-aggregated values in facts makes them inflexible — they can’t be re-sliced by dimensions they weren’t originally grouped by
- let the reporting tool handle aggregation so that analysts can drill down to the detail when needed
- the one exception is additive measures (e.g.
quantity,amount) — these belong in the fact table because they can be meaningfully summed across any dimension. Semi-additive measures (e.g. balances) can be summed across some dimensions but not all, and non-additive measures (e.g. unit prices, ratios) should never be summed directly
For more on aggregate tables as a performance optimization, see Aggregate Fact Tables (Kimball Group).
References
- Dimensional Modeling Techniques — Kimball Group — complete list of techniques
- A Dimensional Modeling Manifesto — the original case for dimensional modelling
- Fact Tables and Dimension Tables — foundational definitions
- Kimball Dimensional Modeling Techniques (PDF) — comprehensive reference document
Kimball Keys Definitions
In the Kimball dimensional modelling approach, keys are the backbone that connect fact tables to their dimensions and that let us track history correctly. This page defines the key types you will encounter and the conventions we use for each.
Quick reference
| Key | Lives in | Stable? | Meaningful? | Purpose |
|---|---|---|---|---|
| Natural key | Source system | Yes (in source) | Yes | Identifies a business entity in the source |
| Durable / supernatural key | Dimension | Yes (forever) | No | Identifies an entity across all source changes |
| Surrogate key | Dimension | Per row version | No | Primary key of a dimension row |
| Foreign key | Fact | — | No | Points a fact row at a dimension row |
| Degenerate dimension | Fact | Yes | Yes | Operational identifier with no dimension table |
Natural key
The identifier an entity carries in the source system — for example a
customer_id from the CRM, an SKU from the product catalogue, or an order number
from the OLTP database.
- Carries business meaning and may be reused or recycled by the source.
- Can be composite (several columns) and can change format over time.
- Not used as the primary key of a dimension, because a single natural key can map to many historical versions of a row (see surrogate keys).
customer_id = "CUST-00417"
Durable (supernatural) key
A warehouse-assigned, never-changing identifier for a business entity. While surrogate keys change with every new version of a row, the durable key stays constant for the lifetime of the entity.
- Use it to group all historical versions of the same entity.
- Survives source-system migrations and natural-key reformatting.
- Sometimes called a persistent or supernatural key.
customer_durable_key = 90231 -- one value for CUST-00417 across all versions
Surrogate key
A meaningless, warehouse-generated integer that serves as the primary key of a dimension table. Every row in the dimension — including every historical version of an entity — gets its own surrogate key.
- Typically a monotonically increasing integer (
BIGINT) or an identity column. - Carries no business meaning; never expose it to end users as a “real” id.
- Decouples the warehouse from source-key changes and enables Type 2 history.
Convention: name surrogate keys
<entity>_sk, e.g.customer_sk,product_sk,date_sk.
customer_sk = 1048576 -- primary key of one specific version of a customer row
Why a surrogate key instead of the natural key?
- Slowly changing dimensions (SCD Type 2). When an attribute changes we add a new row with a new surrogate key, preserving the old version.
- Performance. Single-column integer joins are faster and smaller than wide or composite natural keys.
- Insulation. Source-system key changes don’t ripple into facts.
- Late-arriving / unknown members. Reserved surrogate values can represent “Unknown” or “Not applicable” rows.
Foreign key
The column in a fact table that stores a dimension’s surrogate key, forming the join between the fact and that dimension.
- One foreign key per dimension the fact relates to.
- Always points at a surrogate key, never at a natural key.
- Should be enforced (logically, at minimum) so every fact row resolves to a valid dimension row — including the special “Unknown” member.
fact_sales.customer_sk --> dim_customer.customer_sk
Degenerate dimension
A dimension key that lives in the fact table itself because it has no interesting attributes of its own and therefore no separate dimension table. Classic examples: invoice number, order number, transaction id.
- Useful for grouping the line items of a single operational document.
- Stored as a column on the fact, often suffixed
_idor_number.
fact_invoice_lines.invoice_number = "INV-2026-008812"
Special dimension members
Reserve a handful of surrogate key values for rows that don’t map to real source
records, so that fact foreign keys never have to be NULL:
| Surrogate key | Member meaning |
|---|---|
-1 | Unknown |
-2 | Not applicable |
-3 | Missing / not yet arrived |
Putting it together
erDiagram
dim_customer ||--o{ fact_sales : "customer_sk"
dim_customer {
bigint customer_sk PK "surrogate key"
bigint customer_durable_key "durable key"
string customer_id "natural key"
string customer_name "attribute"
date valid_from_to "SCD2 validity + is_current"
}
fact_sales {
bigint customer_sk FK "to dim_customer"
bigint product_sk FK "to dim_product"
bigint date_sk FK "to dim_date"
string order_number "degenerate dimension"
decimal quantity_amount "additive facts"
}
A fact row joins to a specific version of a customer via customer_sk. To
analyse an entity across all its versions, group on customer_durable_key. To
trace a record back to the source, use the customer_id natural key.
Standard Cost
Type: Dimension (per-unit cost rate) · Primary home:
dim_standard_cost(SCD Type 2) · Also surfaced on:dim_product(current value only, Type 1)
Summary
The standard cost is the predetermined, planned unit cost of a product, it typically includes these costs:
- materials
- labour
- allocated overhead
Standard costs are calculated per material, typically during a periodic cost roll (often annually or quarterly). It is used for inventory valuation, margin reporting, and variance analysis against actual cost.
Standard Cost is not the price the customer pays and not the actual cost incurred.
Natural Key
Standard Cost is per unit of product, the natural key tends to be:
- Plant ID
- Material Number
- Effective Date (period in which the cost is valid)
In the warehouse this natural key maps to a cost_durable_key (stable per
plant + material across every cost version) and a per-version standard_cost_sk
surrogate key.
Schema
One row per: plant_id + material_number + cost version (effective period).
| Column | Type | Role | Notes |
|---|---|---|---|
standard_cost_sk | BIGINT | surrogate PK | one row per plant + material + cost version |
plant_id | VARCHAR | natural key | ERP plant / costing location |
material_number | VARCHAR | natural key | ERP material (≈ product) |
cost_durable_key | BIGINT | durable key | stable per (plant, material) across all versions |
cost_effective_from | DATE | natural key · validity | inclusive start of this cost version |
cost_effective_to | DATE | validity | exclusive end; 9999-12-31 while current |
is_current | BOOLEAN | validity | flag for the active version |
material_cost | DECIMAL(18,4) | attribute (rate) | per-unit component; non-additive |
labour_cost | DECIMAL(18,4) | attribute (rate) | per-unit component; non-additive |
overhead_cost | DECIMAL(18,4) | attribute (rate) | per-unit component; non-additive |
standard_unit_cost | DECIMAL(18,4) | attribute (rate) | = material + labour + overhead; non-additive |
currency_code | CHAR(3) | attribute | ISO 4217; cost is per this currency |
uom_code | VARCHAR | attribute | unit of measure the cost is expressed in |
Source & lineage
ERP.COST_MASTER ──┐
ERP.BOM_ROLLUP ──┼──> stg_standard_cost ──> dim_standard_cost
ERP.COST_PERIODS ──┘ │
└──> dim_product.standard_cost (current only, Type 1)
The cost roll job lands a new effective period; the staging model derives
standard_unit_cost, closes the prior period’s cost_effective_to, and assigns
the new standard_cost_sk.
How to use it
Current standard cost of a product
SELECT plant_id, material_number, standard_unit_cost, currency_code
FROM dim_standard_cost
WHERE is_current;
Margin — the dimensional way (fact carries the cost surrogate key)
The ETL stamps each fact row with the standard_cost_sk for the version in
effect on the transaction date, so this is a plain equi-join that is already
point-in-time correct — no date logic needed at query time.
SELECT s.order_number,
s.sale_date,
s.extended_revenue,
s.quantity * sc.standard_unit_cost AS standard_cost_of_sale,
s.extended_revenue
- s.quantity * sc.standard_unit_cost AS standard_margin
FROM fact_sales s
JOIN dim_standard_cost sc
ON sc.standard_cost_sk = s.standard_cost_sk; -- point-in-time resolved at load
If the fact has no cost SK — point-in-time range join
When a fact only carries the natural key, match each row to the cost version that was active when it happened — never to the current cost:
SELECT s.order_number,
s.quantity * sc.standard_unit_cost AS standard_cost_of_sale
FROM fact_sales s
JOIN dim_standard_cost sc
ON sc.plant_id = s.plant_id
AND sc.material_number = s.material_number
AND s.sale_date >= sc.cost_effective_from
AND s.sale_date < sc.cost_effective_to; -- half-open interval
Purchase price / cost variance (standard vs actual)
SELECT p.plant_id,
p.material_number,
SUM(p.actual_unit_cost * p.quantity) AS actual_cost,
SUM(sc.standard_unit_cost * p.quantity) AS standard_cost,
SUM((p.actual_unit_cost - sc.standard_unit_cost) * p.quantity) AS variance
FROM fact_purchase_receipts p
JOIN dim_standard_cost sc
ON sc.standard_cost_sk = p.standard_cost_sk
GROUP BY p.plant_id, p.material_number;
Common Pitfalls
- Standard cost is a dimension, not a fact — because its amounts are
non-additive.
standard_unit_costis a per-unit rate: summing it across products, plants, or periods is meaningless (SUM(standard_unit_cost)answers no real question). You look it up and multiply by a fact quantity (quantity × standard_unit_cost) to get an additive measure — the cost of sale — which belongs in the fact/query, not here. Modelling these rates as a fact table is what tempts that erroneousSUM. - Always join point-in-time, never to the current cost. Resolve the version
at ETL into
standard_cost_sk, or range-join on the natural key + date (above). Usingdim_product.standard_cost(the current value) to value historical sales silently restates past margins every time a cost roll runs. - Use a half-open interval
[from, to)(>= from AND < to). Closed intervals (BETWEEN) double-count on the boundary day when one version ends and the next begins. standard_costondim_productis Type 1 (overwrite). It exists only for convenience / current-state lookups. It carries no history — don’t report trends from it.- Currency and UoM are part of the cost. Don’t sum or compare
standard_unit_costacross rows with differentcurrency_codeoruom_code. Convert first. - Standard ≠ actual ≠ average ≠ list price. Keep cost types in separate, clearly named attributes. Mixing them is the single most common reporting error.
- Cost-roll timing. A roll dated the 1st but loaded on the 5th leaves a 4-day
gap if effective dating isn’t backfilled. Validate that
max(cost_effective_to)for the prior version meetsmin(cost_effective_from)of the next with no gap or overlap. - Missing cost for a product. New materials may sell before a standard cost is
rolled. Provide a
-1“Unknown cost” member (see Kimball Keys Definitions) and a data-quality check, rather than producingNULLmargins.
Related
- Kimball Keys Definitions — natural, durable, and surrogate keys used above.
fact_sales,fact_purchase_receipts— consumers; carrystandard_cost_sk.dim_product— carries the current-value convenience copy (Type 1).
Change history / SCD
dim_standard_cost is an SCD Type 2 dimension with effective dating. Each
cost roll closes the current row (cost_effective_to, is_current = false) and
inserts a new current row with a fresh standard_cost_sk. Facts reference the
version in effect at their transaction date via that standard_cost_sk. The
dim_product.standard_cost convenience copy is Type 1 (overwrite, no history).
Sales Territory
Type: Dimension · Primary home:
dim_sales_territory(SCD Type 2) · Also surfaced on:fact_invoice_lines(FKsales_territory_sk)
Summary
dim_sales_territory maps each salesperson (also called a sales rep or
sales district) to the nested chain of sales territories they roll up
through, flattened into one row per salesperson. It lets you aggregate sales,
quota, and commission at any level of the territory tree — from a single rep up
to global sales — with a plain GROUP BY.
It is not a geography dimension: the levels are an internal sales hierarchy, not postal/administrative geography (a rep’s territory need not match where customers live). The salesperson is the leaf of this dimension, not a full employee/HR dimension.
Natural Key
One row represents one salesperson for an assignment period (SCD Type 2); there is exactly one current row per salesperson. The natural key is the source salesperson identifier, made unique per version by the effective date:
Natural Key Fields:
- Sales Person Id
The territory tree is balanced and fixed at 6 levels, flattened onto each salesperson row — the standard Kimball treatment for a fixed-depth hierarchy (no bridge table needed). Level 1 is the root (Global Sales); level 5 is the most granular territory; the salesperson is the leaf below level 5:
L1 Global Sales
└── L2 EMEA
└── L3 Italy
└── L4 Northern Italy
└── L5 District 512 – Milan
└── Salesperson Maria Rossi (REP-00417)
Schema
| Column | Type | Role | Notes |
|---|---|---|---|
sales_territory_sk | BIGINT | surrogate PK | one row per salesperson version |
salesperson_id | VARCHAR | natural key | source sales rep / district code |
salesperson_name | VARCHAR | attribute | leaf of the hierarchy |
territory_l5_code | VARCHAR | attribute | level 5 — most granular territory (e.g. district) |
territory_l5_name | VARCHAR | attribute | |
territory_l4_code | VARCHAR | attribute | level 4 |
territory_l4_name | VARCHAR | attribute | |
territory_l3_code | VARCHAR | attribute | level 3 |
territory_l3_name | VARCHAR | attribute | |
territory_l2_code | VARCHAR | attribute | level 2 |
territory_l2_name | VARCHAR | attribute | |
territory_l1_code | VARCHAR | attribute | level 1 — root; constant |
territory_l1_name | VARCHAR | attribute | always Global Sales |
valid_from | DATE | validity | inclusive start of this assignment |
valid_to | DATE | validity | exclusive end; 9999-12-31 while current |
is_current | BOOLEAN | validity | active-version flag |
A worked row for the example above:
| Column | Value |
|---|---|
salesperson_name | Maria Rossi |
territory_l5_name | District 512 – Milan |
territory_l4_name | Northern Italy |
territory_l3_name | Italy |
territory_l2_name | EMEA |
territory_l1_name | Global Sales |
Source & lineage
CRM.SALES_REP ──┐
CRM.TERRITORY_TREE ──┼──> stg_sales_territory ──> dim_sales_territory
HR.REP_ASSIGNMENTS ──┘ │
└──> fact_invoice_lines.sales_territory_sk (FK, resolved at load)
TERRITORY_TREE is a parent→child recursive table. The staging model walks it,
flattens the five territory levels onto each salesperson, and — on a reassignment
(reorg, rep moves district) — closes the prior row’s valid_to and inserts a new
current row with a fresh sales_territory_sk.
How to use it
Sales rolled up to any territory level
Because the hierarchy is flattened, grouping at any level is a plain GROUP BY —
no bridge, no recursion:
SELECT st.territory_l2_name AS area,
SUM(s.net_amount) AS net_sales
FROM fact_invoice_lines s
JOIN dim_sales_territory st ON st.sales_territory_sk = s.sales_territory_sk
GROUP BY st.territory_l2_name
ORDER BY net_sales DESC;
The full territory chain for one salesperson
SELECT salesperson_name,
territory_l5_name, territory_l4_name, territory_l3_name,
territory_l2_name, territory_l1_name
FROM dim_sales_territory
WHERE is_current
AND salesperson_id = 'REP-00417';
Point-in-time: attribute each sale to the territory in effect then
The ETL stamps each fact_invoice_lines row with the sales_territory_sk that was active
on the sale date, so this equi-join is automatically point-in-time correct — a
sale stays with the rep’s territory at the time, even after a later reorg:
SELECT st.territory_l3_name AS country,
DATE_TRUNC('quarter', s.sale_date) AS qtr,
SUM(s.net_amount) AS net_sales
FROM fact_invoice_lines s
JOIN dim_sales_territory st ON st.sales_territory_sk = s.sales_territory_sk
GROUP BY st.territory_l3_name, DATE_TRUNC('quarter', s.sale_date);
Common Pitfalls
- Mind the level direction. Level 1 = Global (root), level 5 = most granular.
- Salesperson ≠ person. The leaf is a sales role/district. One human may cover several districts, and a district may pass between people over time. Keep this dimension at the territory-assignment grain; model the individual separately if HR attributes are needed.
- Unassigned reps. New reps may book sales before territory setup. Point the
fact FK at a
-1“Unknown territory” member (see Kimball Keys Definitions) rather than leaving the FKNULL.
Related
- Kimball Keys Definitions — surrogate, durable, and natural keys plus the special “Unknown” member used above.
- Slowly Changing Dimension — the SCD Type 2 pattern this dimension uses for reassignments.
fact_invoice_lines— primary consumer; carriessales_territory_sk.
Change history / SCD
dim_sales_territory is an SCD Type 2 dimension. Each territory reassignment
closes the current row (valid_to, is_current = false) and inserts a new
current row with a fresh sales_territory_sk; the salesperson_id natural key
stays constant so all of a rep’s history can be grouped together. Facts reference the
version in effect at their transaction date via sales_territory_sk.
Invoice Lines
Type: Fact — transaction · Grain: one row per invoice line · Primary home:
fact_invoice_lines
Summary
fact_invoice_lines records the billed detail of customer invoices — one row for
each product line on each invoice. It is the backbone for revenue, discount, tax,
and margin reporting, sliced by customer, product, date, and sales territory.
It is not an invoice-header fact: whole-invoice charges (freight, invoice-level discounts) are not repeated on every line — keep those in a separate header fact or allocate them down to the line, or you will double-count.
Grain
One row per invoice line — a single product line item on a single invoice.
grain = (invoice_number, invoice_line_number)
invoice_number and invoice_line_number are degenerate dimensions (identifiers
with no dimension table of their own); together they uniquely identify a row.
Schema
| Column | Type | Role | Notes |
|---|---|---|---|
invoice_number | VARCHAR | degenerate dimension | the operational invoice id |
invoice_line_number | INT | degenerate dimension | line position within the invoice |
invoice_date_sk | BIGINT | FK → dim_date | date the invoice was issued |
customer_sk | BIGINT | FK → dim_customer | bill-to customer (version at invoice date) |
product_sk | BIGINT | FK → dim_product | product sold |
sales_territory_sk | BIGINT | FK → dim_sales_territory | rep / territory in effect at invoice date |
standard_cost_sk | BIGINT | FK → dim_standard_cost | standard-cost version in effect at invoice date |
currency_code | CHAR(3) | attribute | document currency (ISO 4217) |
quantity | DECIMAL(18,4) | measure | units invoiced; additive |
gross_amount | DECIMAL(18,4) | measure | list value before discount; additive |
discount_amount | DECIMAL(18,4) | measure | additive |
net_amount | DECIMAL(18,4) | measure | = gross − discount; additive |
tax_amount | DECIMAL(18,4) | measure | additive |
standard_cost_of_sale | DECIMAL(18,4) | measure | = quantity × standard_unit_cost; additive |
unit_price | DECIMAL(18,4) | measure | net per unit; non-additive (a rate) |
load_ts | TIMESTAMP | audit | warehouse load timestamp |
Measures & additivity
| Measure | Additivity | Notes |
|---|---|---|
quantity | additive | sums across all dimensions |
gross_amount, discount_amount, net_amount, tax_amount | additive | the money measures; sum freely |
standard_cost_of_sale | additive | pairs with net_amount to give margin |
unit_price | non-additive | a per-unit rate — never SUM; for an average use SUM(net_amount) / SUM(quantity) |
Margin is derived, not stored: net_amount − standard_cost_of_sale. Because
both inputs are additive, margin can be summed at any level.
Source & lineage
ERP.INVOICE_HEADER ──┐
ERP.INVOICE_LINE ──┼──> stg_invoice_lines ──> fact_invoice_lines
ERP.FX_RATES ──┘
The staging model joins header to line, then resolves each foreign key. The SCD
Type 2 keys (customer_sk, sales_territory_sk, standard_cost_sk) are looked
up as of the invoice date, so every line carries the dimension version that
was in effect when it was billed — point-in-time correct (see the pitfalls).
How to use it
Net sales by month and sales area
SELECT d.year_month,
st.territory_l2_name AS area,
SUM(f.net_amount) AS net_sales
FROM fact_invoice_lines f
JOIN dim_date d ON d.date_sk = f.invoice_date_sk
JOIN dim_sales_territory st ON st.sales_territory_sk = f.sales_territory_sk
GROUP BY d.year_month, st.territory_l2_name;
Standard margin by product category
SELECT p.product_category,
SUM(f.net_amount) AS net_sales,
SUM(f.standard_cost_of_sale) AS standard_cost,
SUM(f.net_amount - f.standard_cost_of_sale) AS standard_margin
FROM fact_invoice_lines f
JOIN dim_product p ON p.product_sk = f.product_sk
GROUP BY p.product_category;
Average selling price (the non-additive measure, done right)
-- weighted average, NOT AVG(unit_price)
SELECT product_sk,
SUM(net_amount) / NULLIF(SUM(quantity), 0) AS avg_unit_price
FROM fact_invoice_lines
GROUP BY product_sk;
Common Pitfalls
- Never
SUM(unit_price)(orAVGit). It is a per-unit rate — non-additive. Compute an average asSUM(net_amount) / SUM(quantity). - Header vs line grain. Whole-invoice charges (freight, invoice-level discounts) belong to the header, not each line. Repeating them per line double-counts — allocate them to lines or keep a separate header fact.
- Dimension fan-out. Joining to a dimension at a coarser grain or through a multi-valued bridge multiplies rows and inflates the measures. Join on the line’s own surrogate keys, and pre-aggregate the fact before any 1-to-many join.
- Point-in-time keys.
customer_sk,sales_territory_sk, andstandard_cost_skare the versions in effect at the invoice date. Don’t re-derive them from the current dimension row, or history is restated on every reorg / cost roll. - Returns & credit notes. Credits arrive as negative
quantity/ amounts. Keep the sign convention consistent soSUMnets correctly; don’t silently filter them out of margin. - Currency. Amounts are in
currency_code(document currency). Convert to a single reporting currency before summing across currencies. - Unknown members. A line that can’t resolve a dimension points at the
-1“Unknown” member (see Kimball Keys Definitions), never aNULLFK.
Related
- Standard Cost — supplies
standard_cost_sk/standard_unit_costbehindstandard_cost_of_sale. - Sales Territory — supplies
sales_territory_skfor territory rollups. - Grain and Kimball Keys Definitions — the concepts this fact builds on.
dim_date,dim_customer,dim_product— the remaining conformed dimensions.
Change history / load pattern
fact_invoice_lines is a transaction-grain fact, loaded insert-only: each
invoice line is written once and never updated. Corrections and returns flow in as
new (often negative) lines rather than edits, preserving an auditable history.
Late-arriving invoices are appended with their historical invoice_date_sk, and
their SCD Type 2 keys resolve to the version that was current then (or the -1
Unknown member until the dimension member appears).
Degenerate Dimension
Also known as: DD
A business identifier stored directly on a fact table that has no attributes of its own, and therefore no separate dimension table.
In practice: It usually identifies the operational transaction or document a fact row came from — an invoice number, order number, or ticket id. Keeping it on the fact lets you group the line items of a single document without a join.
Examples:
fact_invoice_lines.invoice_number = "INV-2026-008812"— one invoice spans many line-item fact rows but needs nodim_invoice.standard_cost_amount = 3,4- standard cost can be used as a degenerate dimension.
Common pitfalls:
- Don’t build a dimension table for it just to “be consistent” — with no descriptive attributes, it stays on the fact.
- If you later discover real attributes (status, channel), promote it to a proper dimension and replace the degenerate key with a surrogate-key foreign key.
See also: Kimball Keys Definitions
Grain
The grain of a fact table is the precise meaning of a single row — the level of detail it records.
In practice: Declare the grain in business terms before choosing dimensions or facts (“one row per order line”, “one row per daily account balance”). Every dimension and measure on the table must be true at that grain; mixing grains in a single table is the most common dimensional modelling mistake.
Example: fact_sales at the grain one row per product per order line — so
quantity and extended_amount are recorded per line, never per whole order.
See also: Standard Cost (declares its grain as an explicit key tuple)
Slowly Changing Dimension
Also known as: SCD
A dimension whose attribute values change occasionally over time, together with the technique chosen for whether to keep or overwrite the prior values.
In practice: Pick a strategy per attribute, not per table:
- Type 1 — overwrite the value; no history kept.
- Type 2 — add a new row with a new surrogate key and an effective-date range
(
valid_from/valid_to/is_current); full history preserved. - Type 3 — keep a “previous value” column alongside the current one; limited history.
Example: A customer moves city. Type 1 overwrites the city; Type 2 closes the
old row (valid_to) and inserts a new current row, so historical facts still join
to the address that was true at the time.
See also: Kimball Keys Definitions · Standard Cost