The
operational data store (ODS) is a part of the data warehouse environment about
which many managers have confused feelings. I am often asked, "Should I
build an ODS?" I have decided that the underlying question is, "What
is an ODS, anyway?"According to Bill Inmon and Claudia Imhoff in their
book Building the Operational Data Store (John Wiley & Sons, 1996),
an ODS is a "subject-oriented, integrated, volatile, current valued data
store, containing only corporate detailed data." This definition for an
ODS reflects a real market need for current, operational data.
If anything, the need for the ODS function has grown in recent years and months. At the same time as our data warehouse systems have gotten bigger, the need to analyze ever more detailed customer behavior and ever more specific operational texture has grown. In most cases the analysis must be done on the most granular and detailed data that we can possibly source. The emergence of data mining has also demanded that we crawl though reams of the lowest-level data, looking for correlations and patterns.
Until now, the ODS was considered a different system from the main data warehouse because the ODS was based on "operational" data. The downstream data warehouse was almost always summarized. Because the warehouse was a complete historical record, we usually didn't dare store this operational (transactional) data as a complete history.
However, the hardware and software technology supporting data warehousing has kept rolling forward, able to store more data and able to process larger answer sets. We also have discovered how to extract and clean data rapidly, and we have figured out how to model it for user understandability and extreme query performance. It is fashionable these days to talk about multiterabyte data warehouses, and consultants braver than I talk about petabyte (1,000 terabyte) data warehouses being just around the corner.
Now I am getting suspicious of the ODS assumption we have been making that you cannot store the individual transactions of a big business in a historical time series. Let us stop for a moment and estimate the number of low-level sales transactions in a year for a large retailer. This is surprisingly simple. I use the following technique to triangulate the overall size of a data warehouse before I ever interview the end users.
Imagine that our large retailer has six billion dollars in retail sales per year. The only other fact we need is the average size of a line item on a typical sales transaction. Suppose that our retailer is a drug store and that the average dollar value of a line item is two dollars. We can immediately estimate the number of transaction line items per year as six billion dollars divided by two dollars, or three billion. This is a large number, but it is well within the range of many current data warehouses. Even a three-year history would "only" generate nine billion records. If we did a tight dimensional design with four 4-byte dimension keys, and four 4-byte facts, then the raw-fact table data size per year would be nine billion times 32 bytes, or 288GB. Three years of raw data would be 864GB. I know of more than a dozen data warehouses bigger than this today.
Our regular data warehouses can now embrace the lowest-level transaction data as a multiyear historical time series, and we are using high-performance data extracting and cleaning tools to pull this data out of the legacy systems at almost any desired frequency each day. So why is my ODS a separate system? Why not just make the ODS the leading, breaking wave of the data warehouse itself?
With the growing interest in data mining fine-grained customer behavior in the form of individual customer transactions, we increasingly need detailed transaction-time histories available for analysis. The effort expended to make a lightweight, throwaway, traditional ODS data source (for example, a volatile, current, valued, data source restricted to current data) is becoming a dead end and a distraction.
Let us take this opportunity to tighten and restrict the definition of the ODS. We will view the ODS simply as the "front edge" of the existing data warehouse. By bringing the ODS into the data warehouse environment, we make it more useful to clerks, executives, and analysts, and we need only to build a single extract system. This new, simplified view of the ODS is shown in Figure 1 and Figure 2.
Let us redefine the ODS as follows. The ODS is a subject-oriented, integrated, frequently augmented store of detailed data in the enterprise data warehouse.
The ODS is subject-oriented. That is, the ODS, like the rest of the data warehouse, is organized around specific business domains such as Customer, Product, Activity, Policy, Claim, or Shipment.
The ODS is integrated. The ODS gracefully bridges between subjects and presents an overarching view of the business rather than an incompatible stovepipe view of the business.
The ODS is frequently augmented. This requirement is a significant departure from the original ODS statement that said the ODS was volatile; for example, the ODS was constantly being overwritten and its data structures were constantly changing. This new requirement of frequently augmenting the data also invalidates Inmon and Imhoff's statement that the ODS contains only current, valued data. We aren't afraid to store the transaction history. In fact, that has now become our mission.
The ODS sits within the full data warehouse framework of historical data and summarized data. In a data warehouse containing a monthly summarized view of data in addition to the transaction detail, the input flow to the ODS also contributes to a special "current rolling month." In many cases, when the last day of the month is reached, the current rolling month becomes the most recent member of the standard months in the time series and a new current rolling month is created.
The ODS naturally supports a collective view of data. We now see how the ODS presents a collective view to the executive who must be able to see a customer's overall account balance. The executive can immediately and gracefully link to last month's collective view of the customer (via the time series) and to the surrounding class of customers (via the data warehouse aggregations).
The ODS is organized for rapid updating directly from the legacy system. The data extraction and cleaning industry has come a long way in the last few years. We can pipeline data from the legacy systems through data cleaning and data integrating steps and drop it into the ODS portion of the data warehouse. Inmon and Imhoff's original distinctions of Class I (near realtime upload), Class II (upload every few hours), and Class III (upload perhaps once per day) are still valid, but the architectural differences in the extract pipeline are far less interesting than they used to be. The ability to upload data very frequently will probably be based more on waiting for remote operational systems to deliver necessary data than computing or bandwidth restrictions in the data pipeline.
The ODS should be organized around a star join schema design. Inmon and Imhoff recommend the star join data model as "the most fundamental description of the design of the data found in the operational data store." The star join, or dimensional model, is the preferred data model for achieving user understandability and predictable high performance. (For further information on this subject, please see my article, "A Dimensional Modeling Manifesto," in the August 1997 issue of DBMS.)
The ODS contains all of the text and numbers required to describe low-level transactions, but may additionally contain back references to the legacy system that would allow realtime links to be opened to the legacy systems through terminal- or transaction-based interfaces. This is an interesting aspect of the original definition of the ODS, and is somewhat straightforward if the low-level transactions are streamed out of the legacy system and into the ODS portion of the data warehouse. What this means is that operational keys like the invoice number and line number are kept in the data flow all the way into the ODS, so that an application can pick up these keys and link successfully back to the legacy system interface.
The ODS is supported by extensive metadata needed to explain and present the meaning of the data to end users through query and reporting tools, as well as metadata describing an extract "audit" of the data warehouse contents.
Bringing the ODS into the existing data warehouse framework solves a number of problems. We can now focus on building a single data extract pipeline. We don't need to have a split personality where we are willing to have a volatile, changing data structure with no history and no support for performance-enhancing aggregations. Our techniques have improved in the last few years. We understand how to take a flow of atomic-level transactions, put them into a dimensional framework, and simultaneously build a detailed transaction history with no compromising of detail, and at the same time build a regular series of periodic snapshots that lets us rapidly track a complex enterprise over time. As I just mentioned, a special snapshot in this time series is the current rolling snapshot at the very front of the time series. This is the echo of the former separate ODS. In next month's column, I will describe the dual personality of transaction and snapshot schemas that is at the heart of "operational data warehouse."
Finally, if you have been reading this with a skeptical perspective, and you have been saying to yourself, "storing all that transaction detail just isn't needed in my organization: all my management needs are high-level summaries," then broaden your perspective and listen to what is going on in the marketing world. I believe that we are in the midst of a major move to one-on-one marketing in which large organizations are seeking to understand and respond to detailed and individual customer behavior. Banks need to know exactly who is at that ATM between 5:00 p.m. and 6:00 p.m., what transactions are they performing, and how that pattern has evolved this year in response to various bank incentive programs. Catalina Marketing is ready to print coupons at your grocery store register that respond to what you have in your shopping basket and what you have been buying in recent trips to the store. To do this, these organizations need all the gory transaction details, both current and historical.
Our data warehouse hardware and software are ready for this revolution. Our data warehouse design techniques are ready for this revolution. Are you ready? Bring your ODS in out of the rain and into the warehouse.
Figure
1.
The original ODS architecture necessitated
two pathways and two systems because the main data warehouse wasn't prepared to
store low-level transactions.
Figure
2.
If anything, the need for the ODS function has grown in recent years and months. At the same time as our data warehouse systems have gotten bigger, the need to analyze ever more detailed customer behavior and ever more specific operational texture has grown. In most cases the analysis must be done on the most granular and detailed data that we can possibly source. The emergence of data mining has also demanded that we crawl though reams of the lowest-level data, looking for correlations and patterns.
Until now, the ODS was considered a different system from the main data warehouse because the ODS was based on "operational" data. The downstream data warehouse was almost always summarized. Because the warehouse was a complete historical record, we usually didn't dare store this operational (transactional) data as a complete history.
However, the hardware and software technology supporting data warehousing has kept rolling forward, able to store more data and able to process larger answer sets. We also have discovered how to extract and clean data rapidly, and we have figured out how to model it for user understandability and extreme query performance. It is fashionable these days to talk about multiterabyte data warehouses, and consultants braver than I talk about petabyte (1,000 terabyte) data warehouses being just around the corner.
Now I am getting suspicious of the ODS assumption we have been making that you cannot store the individual transactions of a big business in a historical time series. Let us stop for a moment and estimate the number of low-level sales transactions in a year for a large retailer. This is surprisingly simple. I use the following technique to triangulate the overall size of a data warehouse before I ever interview the end users.
Imagine that our large retailer has six billion dollars in retail sales per year. The only other fact we need is the average size of a line item on a typical sales transaction. Suppose that our retailer is a drug store and that the average dollar value of a line item is two dollars. We can immediately estimate the number of transaction line items per year as six billion dollars divided by two dollars, or three billion. This is a large number, but it is well within the range of many current data warehouses. Even a three-year history would "only" generate nine billion records. If we did a tight dimensional design with four 4-byte dimension keys, and four 4-byte facts, then the raw-fact table data size per year would be nine billion times 32 bytes, or 288GB. Three years of raw data would be 864GB. I know of more than a dozen data warehouses bigger than this today.
Our regular data warehouses can now embrace the lowest-level transaction data as a multiyear historical time series, and we are using high-performance data extracting and cleaning tools to pull this data out of the legacy systems at almost any desired frequency each day. So why is my ODS a separate system? Why not just make the ODS the leading, breaking wave of the data warehouse itself?
With the growing interest in data mining fine-grained customer behavior in the form of individual customer transactions, we increasingly need detailed transaction-time histories available for analysis. The effort expended to make a lightweight, throwaway, traditional ODS data source (for example, a volatile, current, valued, data source restricted to current data) is becoming a dead end and a distraction.
Let us take this opportunity to tighten and restrict the definition of the ODS. We will view the ODS simply as the "front edge" of the existing data warehouse. By bringing the ODS into the data warehouse environment, we make it more useful to clerks, executives, and analysts, and we need only to build a single extract system. This new, simplified view of the ODS is shown in Figure 1 and Figure 2.
Let us redefine the ODS as follows. The ODS is a subject-oriented, integrated, frequently augmented store of detailed data in the enterprise data warehouse.
The ODS is subject-oriented. That is, the ODS, like the rest of the data warehouse, is organized around specific business domains such as Customer, Product, Activity, Policy, Claim, or Shipment.
The ODS is integrated. The ODS gracefully bridges between subjects and presents an overarching view of the business rather than an incompatible stovepipe view of the business.
The ODS is frequently augmented. This requirement is a significant departure from the original ODS statement that said the ODS was volatile; for example, the ODS was constantly being overwritten and its data structures were constantly changing. This new requirement of frequently augmenting the data also invalidates Inmon and Imhoff's statement that the ODS contains only current, valued data. We aren't afraid to store the transaction history. In fact, that has now become our mission.
The ODS sits within the full data warehouse framework of historical data and summarized data. In a data warehouse containing a monthly summarized view of data in addition to the transaction detail, the input flow to the ODS also contributes to a special "current rolling month." In many cases, when the last day of the month is reached, the current rolling month becomes the most recent member of the standard months in the time series and a new current rolling month is created.
The ODS naturally supports a collective view of data. We now see how the ODS presents a collective view to the executive who must be able to see a customer's overall account balance. The executive can immediately and gracefully link to last month's collective view of the customer (via the time series) and to the surrounding class of customers (via the data warehouse aggregations).
The ODS is organized for rapid updating directly from the legacy system. The data extraction and cleaning industry has come a long way in the last few years. We can pipeline data from the legacy systems through data cleaning and data integrating steps and drop it into the ODS portion of the data warehouse. Inmon and Imhoff's original distinctions of Class I (near realtime upload), Class II (upload every few hours), and Class III (upload perhaps once per day) are still valid, but the architectural differences in the extract pipeline are far less interesting than they used to be. The ability to upload data very frequently will probably be based more on waiting for remote operational systems to deliver necessary data than computing or bandwidth restrictions in the data pipeline.
The ODS should be organized around a star join schema design. Inmon and Imhoff recommend the star join data model as "the most fundamental description of the design of the data found in the operational data store." The star join, or dimensional model, is the preferred data model for achieving user understandability and predictable high performance. (For further information on this subject, please see my article, "A Dimensional Modeling Manifesto," in the August 1997 issue of DBMS.)
The ODS contains all of the text and numbers required to describe low-level transactions, but may additionally contain back references to the legacy system that would allow realtime links to be opened to the legacy systems through terminal- or transaction-based interfaces. This is an interesting aspect of the original definition of the ODS, and is somewhat straightforward if the low-level transactions are streamed out of the legacy system and into the ODS portion of the data warehouse. What this means is that operational keys like the invoice number and line number are kept in the data flow all the way into the ODS, so that an application can pick up these keys and link successfully back to the legacy system interface.
The ODS is supported by extensive metadata needed to explain and present the meaning of the data to end users through query and reporting tools, as well as metadata describing an extract "audit" of the data warehouse contents.
Bringing the ODS into the existing data warehouse framework solves a number of problems. We can now focus on building a single data extract pipeline. We don't need to have a split personality where we are willing to have a volatile, changing data structure with no history and no support for performance-enhancing aggregations. Our techniques have improved in the last few years. We understand how to take a flow of atomic-level transactions, put them into a dimensional framework, and simultaneously build a detailed transaction history with no compromising of detail, and at the same time build a regular series of periodic snapshots that lets us rapidly track a complex enterprise over time. As I just mentioned, a special snapshot in this time series is the current rolling snapshot at the very front of the time series. This is the echo of the former separate ODS. In next month's column, I will describe the dual personality of transaction and snapshot schemas that is at the heart of "operational data warehouse."
Finally, if you have been reading this with a skeptical perspective, and you have been saying to yourself, "storing all that transaction detail just isn't needed in my organization: all my management needs are high-level summaries," then broaden your perspective and listen to what is going on in the marketing world. I believe that we are in the midst of a major move to one-on-one marketing in which large organizations are seeking to understand and respond to detailed and individual customer behavior. Banks need to know exactly who is at that ATM between 5:00 p.m. and 6:00 p.m., what transactions are they performing, and how that pattern has evolved this year in response to various bank incentive programs. Catalina Marketing is ready to print coupons at your grocery store register that respond to what you have in your shopping basket and what you have been buying in recent trips to the store. To do this, these organizations need all the gory transaction details, both current and historical.
Our data warehouse hardware and software are ready for this revolution. Our data warehouse design techniques are ready for this revolution. Are you ready? Bring your ODS in out of the rain and into the warehouse.
Figure
1.
Figure
2.
The new ODS reality. The cleaning and
loading pathway needs only to be a single system because we are now prepared to
build our data warehouse on the foundation of individual transactions.
No comments:
Post a Comment