Over
the past year I have given many examples of fact tables in dimensional data
warehouses. You should recall that fact tables are the large tables "in
the middle" of a dimensional schema. Fact tables always have a multipart
key, in which each component of the key joins to a single dimension table. Fact
tables contain the numeric, additive fields that are best thought of as the
measurements of the business, measured at the intersection of all of the
dimension values.
There has been so much talk about numeric additive values in fact tables that it may come as a surprise that two kinds of very useful fact tables don't have any facts at all! They may consist of nothing but keys. These are called factless fact tables. The first type of factless fact table is a table that records an event. Many event-tracking tables in dimensional data warehouses turn out to be factless. One good example is shown in Figure 1. Here you will track student attendance at a college. Imagine that you have a modern student tracking system that detects each student attendance event each day. With the heightened powers of dimensional thinking that you have developed over the past few months, you can easily list the dimensions surrounding the student attendance event. These dimensions include:
Date: one record in this dimension for each day on the calendar Student: one record in this dimension for each student Course: one record in this dimension for each course taught each semester Teacher: one record in this dimension for each teacher Facility: one record in this dimension for each room, laboratory, or athletic field
The grain of the fact table in Figure 1 is the individual student attendance event. When the student walks through the door into the lecture, a record is generated. It is clear that these dimensions are all well-defined and that the fact table record, consisting of just the five keys, is a good representation of the student attendance event. Each of the dimension tables is deep and rich, with many useful textual attributes on which you can constrain and from which you can form row headers in reports.
The only problem is that there is no obvious fact to record each time a student attends a lecture or suits up for physical education. Tangible facts such as the grade for the course don't belong in this fact table. This fact table represents the student attendance process, not the semester grading process or even the midterm exam process. You are left with the odd feeling that something is missing.
Actually, this fact table consisting only of keys is a perfectly good fact table and probably ought to be left as is. A lot of interesting questions can be asked of this dimensional schema, including:
Which classes were the most heavily attended? Which classes were the most consistently attended? Which teachers taught the most students? Which teachers taught classes in facilities belonging to other departments? Which facilities were the most lightly used? What was the average total walking distance of a student in a given day?
My only real criticism of this schema is the unreadability of the SQL. Most of the above queries end up as counts. For example, the first question starts out as:
SELECT COURSE, COUNT(COURSE_KEY) FROM FACT_TABLE COURSE_DIMENSION, ETC. WHERE ... GROUP BY COURSE
In this case you are counting the course_keys non-distinctly. It is an oddity of SQL that you can count any of the keys and still get the same correct answer. For example:
SELECT COURSE, COUNT(TEACHER_KEY) FROM FACT_TABLE COURSE_DIMENSION, ETC. WHERE ... GROUP BY COURSE
would give the same answer because you are counting the number of keys that fly by the query, not their distinct values. Although this doesn't faze a SQL expert (such as my fellow columnist Joe Celko), it does make the SQL look odd. For this reason, data designers will often add a dummy "attendance" field at the end of the fact table in Figure 1. The attendance field always contains the value 1. This doesn't add any information to the database, but it makes the SQL much more readable. Of course, select count (*) also works, but most query tools don't automatically produce the select count (*) alternative. The attendance field gives users a convenient and understandable place to make the query.
Now your first question reads:
SELECT COURSE, SUM(ATTENDANCE) FROM FACT_TABLE COURSE_DIMENSION, ETC. WHERE ... GROUP BY COURSE
You can think of these kinds of event tables as recording the collision of keys at a point in space and time. Your table simply records the collisions that occur. (Automobile insurance companies often literally record collisions this way.) In this case, the dimensions of the factless fact table could be:
Date of Collision Insured Party Insured Auto Claimant Claimant Auto Bystander Witness Claim Type
Like the college course attendance example, this collision database could answer many interesting questions. The author has designed a number of collision databases, including those for both automobiles and boats. In the case of boats, a variant of the collision database required a "dock" dimension as well as a boat dimension.
A second kind of factless fact table is called a coverage table. A typical coverage table is shown in Figure 2. Coverage tables are frequently needed when a primary fact table in a dimensional data warehouse is sparse. Figure 2 also shows a simple sales fact table that records the sales of products in stores on particular days under each promotion condition. The sales fact table does answer many interesting questions but cannot answer questions about things that didn't happen. For instance, it cannot answer the question, "Which products were on promotion that didn't sell?" because it contains only the records of products that did sell. The coverage table comes to the rescue. A record is placed in the coverage table for each product in each store that is on promotion in each time period. Notice that you need the full generality of a fact table to record which products are on promotion. In general, which products are on promotion varies by all of the dimensions of product, store, promotion, and time. This complex many-to-many relationship must be expressed as a fact table. This is one of Kimball's Laws: Every many-to-many relationship is a fact table, by definition.
Perhaps some of you would suggest just filling out the original fact table with records representing zero sales for all possible products. This is logically valid, but it would expand the fact table enormously. In a typical grocery store, only about 10 percent of the products sell on any given day. Including all of the zero sales could increase the size of the database by a factor of ten. Remember, too, that you would have to carry all of the additive facts as zeros. Because many big grocery store sales fact tables approach a billion records, this would be a killer. Besides, there is something obscene about spending large amounts of money on disk drives to store zeros.
The coverage factless fact table can be made much smaller than the equivalent set of zeros described in the previous paragraph. The coverage table must only contain the items on promotion; the items not on promotion that also did not sell can be left out. Also, it is likely for administrative reasons that the assignment of products to promotions takes place periodically, rather than every day. Often a store manager will set up promotions in a store once each week. Thus we don't need a record for every product every day. One record per product per promotion per store each week will do. Finally, the factless format keeps us from storing explicit zeros for the facts as well.
Answering the question, "Which products were on promotion that did not sell?" requires a two-step application. First, consult the coverage table for the list of products on promotion on that day in that store. Second, consult the sales table for the list of products that did sell. The desired answer is the set difference between these two lists of products.
Coverage tables are also useful for recording the assignment of sales teams to customers in businesses in which the sales teams make occasional very large sales. In such a business, the sales fact table is too sparse to provide a good place to record which sales teams were associated with which customers. The sales team coverage table provides a complete map of the assignment of sales teams to customers, even if some of the combinations never result in a sale.
There has been so much talk about numeric additive values in fact tables that it may come as a surprise that two kinds of very useful fact tables don't have any facts at all! They may consist of nothing but keys. These are called factless fact tables. The first type of factless fact table is a table that records an event. Many event-tracking tables in dimensional data warehouses turn out to be factless. One good example is shown in Figure 1. Here you will track student attendance at a college. Imagine that you have a modern student tracking system that detects each student attendance event each day. With the heightened powers of dimensional thinking that you have developed over the past few months, you can easily list the dimensions surrounding the student attendance event. These dimensions include:
Date: one record in this dimension for each day on the calendar Student: one record in this dimension for each student Course: one record in this dimension for each course taught each semester Teacher: one record in this dimension for each teacher Facility: one record in this dimension for each room, laboratory, or athletic field
The grain of the fact table in Figure 1 is the individual student attendance event. When the student walks through the door into the lecture, a record is generated. It is clear that these dimensions are all well-defined and that the fact table record, consisting of just the five keys, is a good representation of the student attendance event. Each of the dimension tables is deep and rich, with many useful textual attributes on which you can constrain and from which you can form row headers in reports.
The only problem is that there is no obvious fact to record each time a student attends a lecture or suits up for physical education. Tangible facts such as the grade for the course don't belong in this fact table. This fact table represents the student attendance process, not the semester grading process or even the midterm exam process. You are left with the odd feeling that something is missing.
Actually, this fact table consisting only of keys is a perfectly good fact table and probably ought to be left as is. A lot of interesting questions can be asked of this dimensional schema, including:
Which classes were the most heavily attended? Which classes were the most consistently attended? Which teachers taught the most students? Which teachers taught classes in facilities belonging to other departments? Which facilities were the most lightly used? What was the average total walking distance of a student in a given day?
My only real criticism of this schema is the unreadability of the SQL. Most of the above queries end up as counts. For example, the first question starts out as:
SELECT COURSE, COUNT(COURSE_KEY) FROM FACT_TABLE COURSE_DIMENSION, ETC. WHERE ... GROUP BY COURSE
In this case you are counting the course_keys non-distinctly. It is an oddity of SQL that you can count any of the keys and still get the same correct answer. For example:
SELECT COURSE, COUNT(TEACHER_KEY) FROM FACT_TABLE COURSE_DIMENSION, ETC. WHERE ... GROUP BY COURSE
would give the same answer because you are counting the number of keys that fly by the query, not their distinct values. Although this doesn't faze a SQL expert (such as my fellow columnist Joe Celko), it does make the SQL look odd. For this reason, data designers will often add a dummy "attendance" field at the end of the fact table in Figure 1. The attendance field always contains the value 1. This doesn't add any information to the database, but it makes the SQL much more readable. Of course, select count (*) also works, but most query tools don't automatically produce the select count (*) alternative. The attendance field gives users a convenient and understandable place to make the query.
Now your first question reads:
SELECT COURSE, SUM(ATTENDANCE) FROM FACT_TABLE COURSE_DIMENSION, ETC. WHERE ... GROUP BY COURSE
You can think of these kinds of event tables as recording the collision of keys at a point in space and time. Your table simply records the collisions that occur. (Automobile insurance companies often literally record collisions this way.) In this case, the dimensions of the factless fact table could be:
Date of Collision Insured Party Insured Auto Claimant Claimant Auto Bystander Witness Claim Type
Like the college course attendance example, this collision database could answer many interesting questions. The author has designed a number of collision databases, including those for both automobiles and boats. In the case of boats, a variant of the collision database required a "dock" dimension as well as a boat dimension.
A second kind of factless fact table is called a coverage table. A typical coverage table is shown in Figure 2. Coverage tables are frequently needed when a primary fact table in a dimensional data warehouse is sparse. Figure 2 also shows a simple sales fact table that records the sales of products in stores on particular days under each promotion condition. The sales fact table does answer many interesting questions but cannot answer questions about things that didn't happen. For instance, it cannot answer the question, "Which products were on promotion that didn't sell?" because it contains only the records of products that did sell. The coverage table comes to the rescue. A record is placed in the coverage table for each product in each store that is on promotion in each time period. Notice that you need the full generality of a fact table to record which products are on promotion. In general, which products are on promotion varies by all of the dimensions of product, store, promotion, and time. This complex many-to-many relationship must be expressed as a fact table. This is one of Kimball's Laws: Every many-to-many relationship is a fact table, by definition.
Perhaps some of you would suggest just filling out the original fact table with records representing zero sales for all possible products. This is logically valid, but it would expand the fact table enormously. In a typical grocery store, only about 10 percent of the products sell on any given day. Including all of the zero sales could increase the size of the database by a factor of ten. Remember, too, that you would have to carry all of the additive facts as zeros. Because many big grocery store sales fact tables approach a billion records, this would be a killer. Besides, there is something obscene about spending large amounts of money on disk drives to store zeros.
The coverage factless fact table can be made much smaller than the equivalent set of zeros described in the previous paragraph. The coverage table must only contain the items on promotion; the items not on promotion that also did not sell can be left out. Also, it is likely for administrative reasons that the assignment of products to promotions takes place periodically, rather than every day. Often a store manager will set up promotions in a store once each week. Thus we don't need a record for every product every day. One record per product per promotion per store each week will do. Finally, the factless format keeps us from storing explicit zeros for the facts as well.
Answering the question, "Which products were on promotion that did not sell?" requires a two-step application. First, consult the coverage table for the list of products on promotion on that day in that store. Second, consult the sales table for the list of products that did sell. The desired answer is the set difference between these two lists of products.
Coverage tables are also useful for recording the assignment of sales teams to customers in businesses in which the sales teams make occasional very large sales. In such a business, the sales fact table is too sparse to provide a good place to record which sales teams were associated with which customers. The sales team coverage table provides a complete map of the assignment of sales teams to customers, even if some of the combinations never result in a sale.
No comments:
Post a Comment