Showing posts with label DWH Fundamentals. Show all posts

Friday, 18 October 2013

Most ETL systems follow first with dimension load then fact load. In case if fact arrives first then how you load?

We have to just set the early arriving fact as “unknown” and set it to a unknown dimension key (usually –1) in the fact table.

Another option we can use is the idea of an inferred dimension. As you load your fact table data you do a lookup to your dimension.
If you have a match, cool, take that key and move on. If you don’t have a match, instead of setting the key to –1 (unknown), do this:

1) Insert a new dimension record with your business key from your fact table
2) Grab the newly created dimension key from the record you just inserted

3) Merge the key back into your fact data pipeline.

Thursday, 12 September 2013

Metadata

Introduction

In this electronic age where digital information is being created at a fantastic rate, tools are necessary to locate, access, manage, and understand it all—and that's where metadata comes in. A common definition of the term is "data about data." Metadata can serve many functions in data administration, including detailing the data's history, conditions for use, format, and management requirements. The Minnesota State Archives' interest in metadata stems from its mandate to identify, collect, preserve, and make available the historically valuable records of government regardless of format.

Summary

Data about data. Information (e.g., creator name, creation date) that is used to facilitate intellectual control of, and structured access to, other information. Metadata is usually defined as "data about data." Metadata allows users to locate and evaluate data without each person having to discover it anew with every use. Its basic elements are a structured format and a controlled vocabulary, which together allow for a precise and comprehensible description of content, location, and value.

While the term itself might sound new and trendy, the concept it describes is not. In some fashion, metadata has always been with us, apparent in everything from program listings in TV Guide to the nutritional information on the back of a cereal box. For government According to the State of Minnesota, an item that documents an official government transaction or action.

"All cards, correspondence, disks, maps, memoranda, microfilm, papers, photographs, recordings, reports, tapes, writings and other data, information or documentary material, regardless of physical form or characteristics, storage media or condition of use, made or received by an officer or agency of the state and an officer or agency of a county, city, town, school, district, municipal, subdivision or corporation or other public authority or political entity within the state pursuant to state law or in connection with the translation of public business by an officer or agency…. The term 'records' excludes data and information that does not become part of an official translation, library and museum material made or acquired and kept solely for reference or exhibit purpose, extra copies of documents kept only for convenience of reference and stock of publications and process documents, and bond, coupons, or other obligations or evidence of indebtedness, the destruction or other disposition of which is governed by other laws" (Minnesota Statutes, section 138.17, subd.1).

"Information that is inscribed on a tangible medium or that is stored in an electronic or other medium and is retrievable in perceivable form" (Minnesota Statutes, section 325L.02). records, the familiar forms of metadata are the recordkeeping metadata standard and the records retention schedule.

Anyone who has suffered the exercise in irrelevance offered by an The vast network of computer systems that enables worldwide connectivity among users and computers. Internet search engine will appreciate the value of precise metadata. Because "Data, text, images, sounds, codes, computer programs, software, databases, or the like" (Minnesota Statutes, section 325L.02). information in a digital format is only legible through the use of intermediary hardware and software, the role of metadata in information technology is fundamentally important. In any system, given the volume of information it contains, the uses to which it can be put, and the costs involved, metadata is the basic tool for efficiency and effectiveness.

Whatever you want to do with the information (e.g., protect its confidentiality, present it as evidence, provide citizens access to it, broadcast it, share it, preserve it, destroy it) will be feasible only if you and your partners can understand and rely upon the metadata describing it. Using metadata effectively means understanding and applying the standards appropriate to your needs.

Metadata Functions
Government agencies routinely use metadata to fulfill a variety of functions, but the primary uses are for:

Legal and statutory reasons (e.g., to satisfy records management laws and the rules of evidence)
Technological reasons (e.g., to design and document systems)
Operational or administrative reasons (e.g., to document decisions and establish accountability)
Service to citizens, agency staff, and others (e.g., to locate and share information)

In all of these cases, metadata standards will be effective only if they rely on a structured format and controlled vocabulary. "Structured format" means the metadata is defined in terms of specific, standardized elements or fields. For example, a library catalog entry for a book will identify its author, title, subject(s), and location, among other things. Unless all the elements are there, users will not be able to evaluate the metadata; they won't be able to answer the question "Is this the book I want?"
"Controlled vocabulary" means that there is a standard as well for the content of the elements. For example, the nutritional information on the back of a box of cereal is often defined in terms of weight per serving. We know what “sugar: three grams” means. It refers to a standard unit of measurement that allows us to compare the sugar content of one cereal to that of another. But if the box read "just the way you like it" or "pretty sweet," that would mean different things to different people. We couldn't compare a subjective review like that to what's on the back of another box of cereal.
To work effectively, the elements and components of metadata should have an accepted, precise meaning that reflects a common understanding among its creators and its users. That allows for evaluation and comparison, for selecting the information you want from all the information available.
Metadata and Information Technology
Metadata is useful for the management of information in any storage format, paper or digital. But it is critically important for information in a digital format because that is only legible through the use of intermediary hardware and software. We can open up a book or even hold microfilm up to a light to determine what it says. But we can't just look at a CD and say what's on it. We cannot possibly hope to locate, evaluate, or use all the files on a single PC, let alone the Internet, without metadata.
If information technology makes metadata necessary, it's information technology that makes metadata useful. Special software applications, such as TagGen, make the creation of standardized metadata simpler. Databases store and provide access to metadata. Most software applications automatically create metadata and associate it with files. One example is the header and routing information that accompany an e-mail message. Another is the set of properties created with every Microsoft Word document; certain elements such as the title, author, file size, etc., are automatically created, but other elements can be customized and created manually. Normally, some combination of automatically and manually created information is best for precise and practical metadata.
Most important, metadata can inform business rules and software code that transforms it into "executable knowledge." For example, metadata can be used for batch processing of files. A date element is critical to records management, as most record retention schedules are keyed to a record's date of creation. Metadata in more sophisticated data formats, such as eXtensible Markup Language (XML), allow for extraction, use, and calculation based on specific components of a metadata record.

By Ralph Kimball

Metadata is an amazing topic in the data warehouse world. Considering that we don’t know exactly what it is, or where it is, we spend more time talking about it, worrying about it, and feeling guilty we aren’t doing anything about it than any other topic. Several years ago we decided that metadata is any data about data. This wasn’t very helpful because it didn’t paint a clear picture in our minds. This fuzzy view gradually cleared up, and recently we have been talking more confidently about the "back-room metadata" that guides the extraction, cleaning, and loading processes, as well as the "front-room metadata" that makes our query tools and report writers function smoothly.
The back-room metadata presumably helps the DBA bring the data into the warehouse and is probably also of interest to business users when they ask from where the data came. The front-room metadata is mostly for the benefit of the end user, and its definition has been expanded not only to include the oil that makes our tools function smoothly, but also a kind of dictionary of business content represented by all the data elements.
Even these definitions, as helpful as they are, fail to give the data warehouse manager much of a feeling for what it is he or she is supposed to do. It sounds like whatever this metadata stuff is, it’s important, and we better:

Make a nice annotated list of all of it.
Decide just how important each part is.
Take responsibility for it.
Decide what constitutes a consistent and working set of it.
Decide whether to make it or buy it.
Store it somewhere for backup and recovery.
Make it available to the people who need it.
Assure its quality and make it complete and up to date.
Control it from one place.
Document all of these responsibilities well enough to hand this job off (soon).

Now there is a good, solid IT set of responsibilities. So far, so good. The only trouble is, we haven’t really said what it is yet. We do notice that the last item in the above list really isn’t metadata, but rather, data about metadata. With a sinking feeling, we realize we probably need meta meta data data.
To get this under control, let’s try to make a complete list of all possible types of metadata. We surely won’t succeed in this first try, but we will learn a lot. First, let’s go to the source systems, which could be mainframes, separate nonmainframe servers, users’ desktops, third-party data providers, or even online sources. We will assume that all we do here is read the source data and extract it to a data staging area that could be on the mainframe or could be on a downstream machine. Taking a big swig of coffee, we start the list:

Repository specifications
Source schemas
Copy-book specifications
Proprietary or third-party source specifications
Print spool file source specifications
Old format specifications for archived mainframe data
Relational, spreadsheet, and Lotus Notes source specifications
Presentation graphics source specifications (for example, Powerpoint)
URL source specifications
Ownership descriptions of each source
Business descriptions of each source
Update frequencies of original sources
Legal limitations on the use of each source
Mainframe or source system job schedules
Access methods, access rights, privileges, and passwords for source access
The Cobol/JCL, C, or Basic to implement extraction
The automated extract tool settings, if we use such a tool
Results of specific extract jobs including exact times, content, and completeness.

Now let’s list all the metadata needed to get the data into a data staging area and prepare it for loading into one or more data marts. We may do this on the mainframe with hand-coded Cobol, or by using an automated extract tool. Or we may bring the flat file extracts more or less untouched into a separate data staging area on a different machine. In any case, we have to be concerned about metadata describing:

Data transmission scheduling and results of specific transmissions
File usage in the data staging area including duration, volatility, and ownership
Definitions of conformed dimensions and conformed facts
Job specifications for joining sources, stripping out fields, and looking up attributes
Slowly changing dimension policies for each incoming descriptive attribute (for example, overwrite, create new record, or create new field)
Current surrogate key assignments for each production key, including a fast lookup table to perform this mapping in memory
Yesterday’s copy of a production dimension to use as the basis for Diff Compare
Data cleaning specifications
Data enhancement and mapping transformations (for example, expanding abbreviations and providing more detail)
Transformations required for data mining (for example, interpreting nulls and scaling numerics)
Target schema designs, source to target data flows, target data ownership, and DBMS load scripts
Aggregate definitions
Aggregate usage statistics, base table usage statistics, potential aggregates
Aggregate modification logs
Data lineage and audit records (where exactly did this record come from and when)
Data transform run-time logs, success summaries, and time stamps
Data transform software version numbers
Business descriptions of extract processing
Security settings for extract files, software, and metadata
Security settings for data transmission (that is, passwords, certificates, and so on)
Data staging area archive logs and recovery procedures
Data staging-archive security settings.

Once we have finally transferred the data to the data mart DBMS, then we must have metadata, including:

DBMS system tables
Partition settings
Indexes
Disk striping specifications
Processing hints
DBMS-level security privileges and grants
View definitions
Stored procedures and SQL administrative scripts

 DBMS backup status, procedures, and security. In the front room, we have metadata extending to the horizon, including:

Precanned query and report definitions
Join specification tool settings
Pretty print tool specifications (for relabeling fields in readable ways)
End-user documentation and training aids, both vendor supplied and IT supplied
Network security user privilege profiles, authentication certificates, and usage statistics, including logon attempts, access attempts, and user ID by location reports
Individual user profiles, with links to human resources to track promotions, transfers, and resignations that affect access rights
Links to contractor and partner tracking where access rights are affected
Usage and access maps for data elements, tables, views, and reports
Resource charge back statistics
Favorite Web sites (as a paradigm for all data warehouse access).

Now we can see why we didn’t know what this metadata was all about. It is everything! Except for the data itself. Suddenly, the data seems like the simplest part.
With this perspective, do we really need to keep track of all this? We do, in my opinion. This list of metadata is the essential framework of your data warehouse. Just listing it as we have done seems quite helpful. It’s a long list, but we can go down through it, find each kind of metadata, and identify what it is used for and where it is stored.
There are some sobering realizations, however. Much of this metadata needs to reside on the machines close to where the work occurs. Programs, settings, and specifications that drive processes have to be in certain destination locations and in very specific formats. That isn’t likely to change soon.

Once we have taken the first step of getting our metadata corralled and under control, can we hope for tools that will pull all the metadata together in one place and be able to read and write it as well? With such a tool, not only would we have a uniform user interface for all this disparate metadata, but on a consistent basis we would be able to snapshot all the metadata at once, back it up, secure it, and restore it if we ever lost it.

Don’t hold your breath. As you can appreciate, this is a very hard problem, and encompassing all forms of metadata will require a kind of systems integration that we don’t have today. I believe the Metadata Coalition (a group of vendors trying seriously to solve the metadata problem) will make some reasonable progress in defining common syntax and semantics for metadata, but it has been two years and counting since they started this effort. Unfortunately, Oracle, the biggest DBMS player, has chosen to sit out this effort and has promised to release its own proprietary metadata standard. Other vendors are making serious efforts to extend their product suites to encompass many of the activities listed in this article and simultaneously to publish their own framework for metadata. These vendors include Microsoft, who’s working with the Metadata Coalition to extend the Microsoft Repository, as well as a pack of aggressive, smaller players proposing comprehensive metadata frameworks, including Sagent, Informatica, VMark, and D2K. In any case, these vendors will have to offer significant business advantages in order to compel other vendors to write to their specifications.

Data Warehousing Project - Data Modeling

As it is difficult to talk about data modeling without going into some technical terms, I will first present several terms that are used commonly in the data modeling field:
Dimension: A category of information. For example, the time dimension.
Attribute: A unique level within a dimension. For example, Month is an attribute in the Time Dimension.
Hierarchy: The specification of levels that represents relationship between different attributes within a hierarchy. For example, one possible hierarchy in the Time dimension is Year --> Quarter --> Month --> Day.
Fact Table: A fact table is a table that contains the measures of interest. For example, sales amount would be such a measure. This measure is stored in the fact table with the appropriate granularity. For example, it can be sales amount by store by day. In this case, the fact table would contain three columns: A date column, a store column, and a sales amount column.
Lookup Table: The lookup table provides the detailed information about the attributes. For example, the lookup table for the Quarter attribute would include a list of all of the quarters available in the data warehouse. Each row (each quarter) may have several fields, one for the unique ID that identifies the quarter, and one or more additional fields that specifies how that particular quarter is represented on a report (for example, first quarter of 2001 may be represented as "Q1 2001" or "2001 Q1").
The first step in data modeling is to illustrate the relationships between the entities for the enterprise. The manifestation of this illustration is called the "Entity-Relationship (ER) Diagram". From the ER diagram we can then design logical, and subsequently physical, data models.
In designing data models for data warehouses / data marts, the most commonly used schema types are Star Schema, Snowflake Schema, and Federated Star Schema.
Star Schema: In the star schema design, a single object (the fact table) sits in the middle and is radially connected to other surrounding objects (dimension lookup tables) like a star. A star schema can be simple or complex. A simple star consists of one fact table; a complex star can have more than one fact table.
Snowflake Schema: The snowflake schema is an extension of the star schema where each point of the star explodes into more points. The main advantage of the snowflake schema is the improvement in query performance due to minimized disk storage requirements and joining smaller lookup tables. The main disadvantage of the snowflake schema is the additional maintenance efforts needed due to the increase number of lookup tables.
Federated Star Schema: In federated star schema, instead of having the fact table in the middle, a chosen dimension sits in the middle. Then all the fact tables related to this particular dimension radiate from it. Finally, all the other dimensions that are related to each of the fact tables complete the loop. This type of schema is best used when one wants to focus the analysis on that one particular schema. Because all the fact tables are connected to one central dimension, this is an excellent way of performing cross-fact analysis. The construct also allows much better segmentation and profiling of the one dimension of interest.
Data Modeling
Most people involved in application development follow some kind of methodology. A methodology is a prescribed set of processes through which the developer analyzes the client's requirements and develops an application. Major database vendors and computer gurus all practice and promote their own methodology. Some database vendors even make their analysis, design, and development tools conform to a particular methodology. If you are using the tools of a particular vendor, it may be easier to follow their methodology as well. For example, when CNS develops and supports Oracle database applications it uses the Oracle toolset. Accordingly, CNS follows Oracle's CASE*Method application development methodology (or a reasonable facsimile thereof).
One technique commonly used in analyzing the client's requirements is data modeling. The purpose of data modeling is to develop an accurate model, or graphical representation, of the client's information needs and business processes. The data model acts as a framework for the development of the new or enhanced application. There are almost as many methods of data modeling as there are application development methodologies. CNS uses the Oracle CASE*Method for its data modeling.
As time goes by, applications tend to accrue new layers, just like an onion. We develop more paper pushing and report printing, adding new layers of functions and features. Soon it gets to the point where we can only see with difficulty the core of the application where its essence lies. Around the core of the application we see layer upon layer, protecting, nurturing, but ultimately obscuring the core. Our systems and applications often fall victim to these protective or hiding processes. The essence of an application is lost in the shuffle of paper and the accretion of day-to-day changes. Data modeling encourages both the developer and the client to tear off these excess layers, to explore and revisit the essence or purpose of the application once more. The new analysis determines what needs to feed into and what feeds from the core purpose.
Application Audience and Services
After participants at CNS-sponsored application analysis meetings agree on a scope and objectives statement, we find it helpful to identify the audience of the application. To whom do you offer the services we are modeling? Who is affected by the application? Answers to these and similar questions help the participants stay in focus with the desired application results.
After assembling an audience list, we then develop a list of services provided by the application. This list includes the services of the existing application and any desired future services in the new application. From this list, we model the information requirements of each service. To do this, it is useful to first identify the three most important services of the application, and then of those three, the single most important service. Eventually all of the services will be modeled. Focusing our data modeling on one service just gives us a starting point.
Entities
The next step in modeling a service or process, is to identify the entities involved in that process. An entity is a thing or object of significance to the business, whether real or imagined, about which the business must collect and maintain data, or about which information needs to be known or held. An entity may be a tangible or real object like a person or a building; it may be an activity like an appointment or an operation; it may be conceptual as in a cost center or an organizational unit.
Whatever is chosen as an entity must be described in real terms. It must be uniquely identifiable. That is, each instance or occurrence of an entity must be separate and distinctly identifiable from all other instances of that type of entity.
For example, if we were designing a computerized application for the care of plants in a greenhouse, one of its processes might be tracking plant waterings. Within that process, there are two entities: the Plant entity and the Watering entity. A Plant has significance as a living flora of beauty. Each Plant is uniquely identified by its biological name, or some other unique reference to it. A Watering has significance as an application of water to a plant. Each Watering is uniquely identified by the date and time of its application.
Attributes
After you identify an entity, then you describe it in real terms, or through its attributes. An attribute is any detail that serves to identify, qualify, classify, quantify, or otherwise express the state of an entity occurrence or a relationship. Attributes are specific pieces of information which need to be known or held.
An attribute is either required or optional. When it's required, we must have a value for it, a value must be known for each entity occurrence. When it's optional, we could have a value for it, a value may be known for each entity occurrence. For example, some attributes for Plant are: description, date of acquisition, flowering or non-flowering, and pot size. The description is required for every Plant. The pot size is optional since some plants do not come in pots. Again, some of Watering's attributes are: date and time of application, amount of water, and water temperature. The date and time are required for every Watering. The water temperature is optional since we do not always check it before watering some plants.
The attributes reflect the need for the information they provide. In the analysis meeting, the participants should list as many attributes as possible. Later they can weed out those that are not applicable to the application, or those the client is not prepared to spend the resources on to collect and maintain. The participants come to an agreement on which attributes belong with an entity, as well as which attributes are required or optional.
The attributes which uniquely define an occurrence of an entity are called primary keys. If such an attribute doesn't exist naturally, a new attribute is defined for that purpose, for example an ID number or code.
Relationships
After two or more entities are identified and defined with attributes, the participants determine if a relationship exists between the entities. A relationship is any association, linkage, or connection between the entities of interest to the business; it is a two-directional, significant association between two entities, or between an entity and itself. Each relationship has a name, an optionality (optional or mandatory), and a degree (how many). A relationship is described in real terms.
Rarely will there be a relationship between every entity and every other entity in an application. If there are only two or three entities, then perhaps there will be relationships between them all. In a larger application, there are not always relationships between one entity and all of the others.
Assigning a name, an optional and a degree to a relationship helps confirm the validity of that relationship. If you cannot give a relationship all these things, then perhaps there really is no relationship at all. For example, there is a relationship between Plant and Watering. Each Plant may be given one or more Waterings. Each Watering must be for one and only one specific Plant.
Entity Relationship Diagrams
To visually record the entities and the relationships between them, an entity relationship diagram, or ERD, is drawn. An ERD is a pictorial representation of the entities and the relationships between them. It allows the participants in the meeting to easily see the information structure of the application. Later, the project team uses the ERD to design the database and tables. Knowing how to read an ERD is very important. If there are any mistakes or relationships missing, the application will fail in that respect. Although somewhat cryptic, learning to read an ERD comes quickly.
Each entity is drawn in a box. Each relationship is drawn as a line between entities. The relationship between Plant and Watering is drawn on the ERD as follows:

Since a relationship is between two entities, an ERD shows how one entity relates to the other, and vice versa. Reading an ERD relationship means you have to read it from one entity to the other, and then from the other to the first. Each style and mark on the relationship line has some significance to the relationship and its reading. Half the relationship line belongs to the entity on that side of the line. The other half belongs to the other entity on the other side of the line.
When you read a relationship, start with one entity and note the line style starting at that entity. Ignore the latter half of the line's style, since it's there for you to come back the other way. A solid line at an entity represents a mandatory relationship. In the example above, each Watering must be for one and only one Plant. A dotted line at an entity represents an optional relationship. Each Plant may be given one or more Watering.
The way in which the relationship line connects to an entity is significant. If it connects with a single line, it represents one and only one occurrence of that entity. In the example, each Watering must be for one and only one Plant. If the relationship line connects with three prongs, i.e., a crowsfoot, it represents one or more of the entities. Each Plant may be given one or more Waterings. As long as both statements are true, then you know you have modeled the relationship properly.
In the relationship between Plant and Watering, there are two relationship statements. One is: each Watering must be for one and only one Plant. These are the parts of the ERD which that statement uses:

The second statement is: each Plant may be given one or more Waterings. The parts of the ERD which that statement uses are:

After some experience, you learn to ask the appropriate questions to determine if two entities are related to each other, and the degree of that relationship. After agreeing on the entities and their relationships, the process of identifying more entities, describing them, and determining their relationships continues until all of the services of the application have been examined. The data model remains software and hardware independent.
Many-to-Many Relationships
There are different types of relationships. The greenhouse plant application example showed a one-to-many and a many-to-one relationship, both between Plant and Watering. Two other relationships commonly found in data models are one-to-one and many-to-many. One-to-one relationships are between two entities where both are related to each other, once and only once for each instance of either. In a many-to-many relationship, multiple occurrences of one entity are related to one occurrence of another, and vice versa.
An example of a many-to-many relationship in the greenhouse plant application is between the Plant and Additive entities. Each plant may be treated with one or more Additives. Each Additive may be given to one or more Plants.

Many-to-many relationships cannot be directly converted into database tables and relationships. This is a restriction of the database systems, not of the application. The development team has to resolve the many-to-many relationship before it can continue with the database development. If you identify a many-to-many relationship in your analysis meeting, you should try to resolve it in the meeting. The participants can usually find a fitting entity to provide the resolution.
To resolve a many-to-many relationship means to convert it into two one-to-many, many-to-one relationships. A new entity comes between the two original entities, and this new entity is referred to as an intersection entity. It allows for every possible matched occurrence of the two entities. Sometimes the intersection entity represents a point or passage in time.

With these new relationships, Plant is now related to Treatment. Each Plant may be given one or more Treatments. Each Treatment must be given to one and only one Plant. Additive is also related to Treatment. Each Additive may be used in one or more Treatments. Each Treatment must be comprised of one and only one Additive. With these two new relationships, Treatment cannot exist without Plant and Additive. Treatment can occur multiple times, once for each treatment of a plant additive. To keep each Treatment unique, a new attribute is defined. Treatment now has application date and time attributes. They are the unique identifiers or the primary key of Treatment. Other attributes of Treatment are quantity and potency of the additive.
Will Data Modeling Look Good on You?
There are other processes and marks to enhance a data model besides the ones shown in this article. Many of them are used in the actual development of the database tables. The techniques shown here only provide a basic foundation for undertaking your own data modeling analysis.

Data modeling gives you the opportunity to shed the layers of processes covering up the fundamental essence of your business. Remember to leave your baggage at the door of a data modeling session. Come to the meeting with enthusiasm and a positive outlook for a new and improved application.

The operational data store (ODS)

The operational data store (ODS) is a part of the data warehouse environment about which many managers have confused feelings. I am often asked, "Should I build an ODS?" I have decided that the underlying question is, "What is an ODS, anyway?"According to Bill Inmon and Claudia Imhoff in their book Building the Operational Data Store (John Wiley & Sons, 1996), an ODS is a "subject-oriented, integrated, volatile, current valued data store, containing only corporate detailed data." This definition for an ODS reflects a real market need for current, operational data.
If anything, the need for the ODS function has grown in recent years and months. At the same time as our data warehouse systems have gotten bigger, the need to analyze ever more detailed customer behavior and ever more specific operational texture has grown. In most cases the analysis must be done on the most granular and detailed data that we can possibly source. The emergence of data mining has also demanded that we crawl though reams of the lowest-level data, looking for correlations and patterns.
Until now, the ODS was considered a different system from the main data warehouse because the ODS was based on "operational" data. The downstream data warehouse was almost always summarized. Because the warehouse was a complete historical record, we usually didn't dare store this operational (transactional) data as a complete history.
However, the hardware and software technology supporting data warehousing has kept rolling forward, able to store more data and able to process larger answer sets. We also have discovered how to extract and clean data rapidly, and we have figured out how to model it for user understandability and extreme query performance. It is fashionable these days to talk about multiterabyte data warehouses, and consultants braver than I talk about petabyte (1,000 terabyte) data warehouses being just around the corner.
Now I am getting suspicious of the ODS assumption we have been making that you cannot store the individual transactions of a big business in a historical time series. Let us stop for a moment and estimate the number of low-level sales transactions in a year for a large retailer. This is surprisingly simple. I use the following technique to triangulate the overall size of a data warehouse before I ever interview the end users.
Imagine that our large retailer has six billion dollars in retail sales per year. The only other fact we need is the average size of a line item on a typical sales transaction. Suppose that our retailer is a drug store and that the average dollar value of a line item is two dollars. We can immediately estimate the number of transaction line items per year as six billion dollars divided by two dollars, or three billion. This is a large number, but it is well within the range of many current data warehouses. Even a three-year history would "only" generate nine billion records. If we did a tight dimensional design with four 4-byte dimension keys, and four 4-byte facts, then the raw-fact table data size per year would be nine billion times 32 bytes, or 288GB. Three years of raw data would be 864GB. I know of more than a dozen data warehouses bigger than this today.
Our regular data warehouses can now embrace the lowest-level transaction data as a multiyear historical time series, and we are using high-performance data extracting and cleaning tools to pull this data out of the legacy systems at almost any desired frequency each day. So why is my ODS a separate system? Why not just make the ODS the leading, breaking wave of the data warehouse itself?
With the growing interest in data mining fine-grained customer behavior in the form of individual customer transactions, we increasingly need detailed transaction-time histories available for analysis. The effort expended to make a lightweight, throwaway, traditional ODS data source (for example, a volatile, current, valued, data source restricted to current data) is becoming a dead end and a distraction.
Let us take this opportunity to tighten and restrict the definition of the ODS. We will view the ODS simply as the "front edge" of the existing data warehouse. By bringing the ODS into the data warehouse environment, we make it more useful to clerks, executives, and analysts, and we need only to build a single extract system. This new, simplified view of the ODS is shown in Figure 1 and Figure 2.
Let us redefine the ODS as follows. The ODS is a subject-oriented, integrated, frequently augmented store of detailed data in the enterprise data warehouse.
The ODS is subject-oriented. That is, the ODS, like the rest of the data warehouse, is organized around specific business domains such as Customer, Product, Activity, Policy, Claim, or Shipment.
The ODS is integrated. The ODS gracefully bridges between subjects and presents an overarching view of the business rather than an incompatible stovepipe view of the business.
The ODS is frequently augmented. This requirement is a significant departure from the original ODS statement that said the ODS was volatile; for example, the ODS was constantly being overwritten and its data structures were constantly changing. This new requirement of frequently augmenting the data also invalidates Inmon and Imhoff's statement that the ODS contains only current, valued data. We aren't afraid to store the transaction history. In fact, that has now become our mission.
The ODS sits within the full data warehouse framework of historical data and summarized data. In a data warehouse containing a monthly summarized view of data in addition to the transaction detail, the input flow to the ODS also contributes to a special "current rolling month." In many cases, when the last day of the month is reached, the current rolling month becomes the most recent member of the standard months in the time series and a new current rolling month is created.
The ODS naturally supports a collective view of data. We now see how the ODS presents a collective view to the executive who must be able to see a customer's overall account balance. The executive can immediately and gracefully link to last month's collective view of the customer (via the time series) and to the surrounding class of customers (via the data warehouse aggregations).
The ODS is organized for rapid updating directly from the legacy system. The data extraction and cleaning industry has come a long way in the last few years. We can pipeline data from the legacy systems through data cleaning and data integrating steps and drop it into the ODS portion of the data warehouse. Inmon and Imhoff's original distinctions of Class I (near realtime upload), Class II (upload every few hours), and Class III (upload perhaps once per day) are still valid, but the architectural differences in the extract pipeline are far less interesting than they used to be. The ability to upload data very frequently will probably be based more on waiting for remote operational systems to deliver necessary data than computing or bandwidth restrictions in the data pipeline.
The ODS should be organized around a star join schema design. Inmon and Imhoff recommend the star join data model as "the most fundamental description of the design of the data found in the operational data store." The star join, or dimensional model, is the preferred data model for achieving user understandability and predictable high performance. (For further information on this subject, please see my article, "A Dimensional Modeling Manifesto," in the August 1997 issue of DBMS.)
The ODS contains all of the text and numbers required to describe low-level transactions, but may additionally contain back references to the legacy system that would allow realtime links to be opened to the legacy systems through terminal- or transaction-based interfaces. This is an interesting aspect of the original definition of the ODS, and is somewhat straightforward if the low-level transactions are streamed out of the legacy system and into the ODS portion of the data warehouse. What this means is that operational keys like the invoice number and line number are kept in the data flow all the way into the ODS, so that an application can pick up these keys and link successfully back to the legacy system interface.
The ODS is supported by extensive metadata needed to explain and present the meaning of the data to end users through query and reporting tools, as well as metadata describing an extract "audit" of the data warehouse contents.
Bringing the ODS into the existing data warehouse framework solves a number of problems. We can now focus on building a single data extract pipeline. We don't need to have a split personality where we are willing to have a volatile, changing data structure with no history and no support for performance-enhancing aggregations. Our techniques have improved in the last few years. We understand how to take a flow of atomic-level transactions, put them into a dimensional framework, and simultaneously build a detailed transaction history with no compromising of detail, and at the same time build a regular series of periodic snapshots that lets us rapidly track a complex enterprise over time. As I just mentioned, a special snapshot in this time series is the current rolling snapshot at the very front of the time series. This is the echo of the former separate ODS. In next month's column, I will describe the dual personality of transaction and snapshot schemas that is at the heart of "operational data warehouse."
Finally, if you have been reading this with a skeptical perspective, and you have been saying to yourself, "storing all that transaction detail just isn't needed in my organization: all my management needs are high-level summaries," then broaden your perspective and listen to what is going on in the marketing world. I believe that we are in the midst of a major move to one-on-one marketing in which large organizations are seeking to understand and respond to detailed and individual customer behavior. Banks need to know exactly who is at that ATM between 5:00 p.m. and 6:00 p.m., what transactions are they performing, and how that pattern has evolved this year in response to various bank incentive programs. Catalina Marketing is ready to print coupons at your grocery store register that respond to what you have in your shopping basket and what you have been buying in recent trips to the store. To do this, these organizations need all the gory transaction details, both current and historical.

Our data warehouse hardware and software are ready for this revolution. Our data warehouse design techniques are ready for this revolution. Are you ready? Bring your ODS in out of the rain and into the warehouse.

Figure 1.

The original ODS architecture necessitated two pathways and two systems because the main data warehouse wasn't prepared to store low-level transactions.

Figure 2.

The new ODS reality. The cleaning and loading pathway needs only to be a single system because we are now prepared to build our data warehouse on the foundation of individual transactions.

Comparison of OLAP and OLTP

OLAP applications are quite different from On-line Transaction Processing (OLTP) applications which consist of a large number of relatively simple transactions. The transactions usually retrieve and update a small number of records that are contained in several distinct tables. The relationships between the tables are generally simple.

A typical customer order entry OLTP transaction might retrieve all of the data relating to a specific customer and then insert a new order for the customer. Information is selected from the customer, customer order, and detail line tables. Each row in each table contains a customer identification number which is used to relate the rows from the different tables. The relationships between the records are simple and only a few records are actually retrieved or updated by a single transaction.
The difference between OLAP and OLTP has been summarised as, OLTP servers handle mission-critical production data accessed through simple queries; while OLAP servers handle management-critical data accessed through an iterative analytical investigation. Both OLAP and OLTP, have specialized requirements and therefore require special optimized servers for the two types of processing.
OLAP database servers use multidimensional structures to store data and relationships between data. Multidimensional structures can be best visualized as cubes of data, and cubes within cubes of data. Each side of the cube is considered a dimension.
Each dimension represents a different category such as product type, region, sales channel, and time. Each cell within the multidimensional structure contains aggregated data relating elements along each of the dimensions. For example, a single cell may contain the total sales for a given product in a region for a specific sales channel in a single month. Multidimensional databases are a compact and easy to understand vehicle for visualizing and manipulating data elements that have many inter relationships.
OLAP database servers support common analytical operations including: consolidation, drill-down, and "slicing and dicing".

Consolidation - involves the aggregation of data such as simple roll-ups or complex expressions involving inter-related data. For example, sales offices can be rolled-up to districts and districts rolled-up to regions.
Drill-Down - OLAP data servers can also go in the reverse direction and automatically display detail data which comprises consolidated data. This is called drill-downs. Consolidation and drill-down are an inherent property of OLAP servers.
"Slicing and Dicing" - Slicing and dicing refers to the ability to look at the database from different viewpoints. One slice of the sales database might show all sales of product type within regions. Another slice might show all sales by sales channel within each product type. Slicing and dicing is often performed along a time axis in order to analyse trends and find patterns.

OLAP servers have the means for storing multidimensional data in a compressed form. This is accomplished by dynamically selecting physical storage arrangements and compression techniques that maximize space utilization. Dense data (i.e., data exists for a high percentage of dimension cells) are stored separately from sparse data (i.e., a significant percentage of cells are empty). For example, a given sales channel may only sell a few products, so the cells that relate sales channels to products will be mostly empty and therefore sparse. By optimizing space utilization, OLAP servers can minimize physical storage requirements, thus making it possible to analyse exceptionally large amounts of data. It also makes it possible to load more data into computer memory which helps to significantly improve performance by minimizing physical disk I/O.

In conclusion OLAP servers logically organize data in multiple dimensions which allows users to quickly and easily analyse complex data relationships. The database itself is physically organized in such a way that related data can be rapidly retrieved across multiple dimensions. OLAP servers are very efficient when storing and processing multidimensional data. RDBMSs have been developed and optimized to handle OLTP applications. Relational database designs concentrate on reliability and transaction processing speed, instead of decision support need. The different types of server can therefore benefit a broad range of data management applications.

Disadvantages of data warehousing

Complexity and anticipation in development. The disadvantage mentioned most frequently (48; 11 percent) is complexity in development. IS cannot just buy a data warehouse; IS has to build one because each warehouse has a unique architecture and a set of requirements that spring from the individual needs of the organization [Ladaga, 1995; Myers, 1995b]. IS needs to ask a wide range of questions in building it [Redding, 1995; Goldberg, 1995b]. Builders need to pay as much attention to the structure, definitions, and flow of data as they do to choosing hardware and software [Hildebrand, 1995; Adhikari, 1996; Edwards, 1995; Wallace, 1994a]. Data warehouse construction requires a sense of anticipation about future ways to use the collected records [Goldberg, 1995b]. Developers need to be aware of the constantly changing needs of their company's business and the capabilities of the available and emerging hardware and software [Lardear, 1995a]. How to scale the warehouse to meet increasing user demand for both volume and complexity [Lardear, 1995a] makes its development more complex. Also, there are difficulties in choosing the right products [Harding, 1994; Cafasso, 1995b]. In summary, developing such a large database requires an expert [Harding, 1994].
2. Takes time to build. Second, 32 (7 percent) articles point out that to build a data warehouse takes time (2 to 3 years) [Goldberg, 1995b; Hildebrand, 1995; Ladaga, 1995; Redding, 1995]. In a situation where there is not strong executive sponsorship, IS directors or others wishing to develop a warehouse may spend an inordinate amount of time justifying the need.
3. Expensive to build. Similarly, 17 (4 percent) mentioned that a data warehouse is also expensive to build ($2 to 3 million) [Harding, 1994; Hildebrand, 1995; Ladaga, 1995; Redding, 1995]. One reason data warehouses are so expensive is that data must be moved or copied from existing databases, sometimes manually, and data needs to be translated into a common format [Cole, 1995h].
4. Lack of API. Ten (2 percent) articles suggest that data warehousing software still lack a set of application programming interfaces (API) or other standards that shuttle data smoothly through the entire warehouse process, such as Open Database Connectivity (ODBC) interface (Microsoft Corp.). However, ODBC API that lets PCs access data from many different databases, is not everywhere [Nash, 1995b].

5. End-user training. Seven (2 percent) articles suggest it is necessary to create a new "mind-set" with all employees who must be prepared to capitalize upon the innovative data analysis provided by data warehouses; those end users require extensive training. A communication plan is essential to educate all constituents [Goldberg, 1995b; Ladaga, 1995].
6. Complexity involved in SMP and MPP. Six (1 percent) of the articles point out the complexity of data warehousing, which will be increased if the warehouses involve symmetrical multiprocessing (SMP) and massively parallel processing (MPP). Synchronization and shared access are difficult [Goldberg, 1995b; Burleson, 1995].
7. Difficulty in distributed database environment. Because the data warehouse is a method of bringing disparate data together, it is centralized by its very nature [Wallace, 1994b] and this is mentioned in 5 or 1 percent of the articles. While many companies are still in the preliminary stages of putting their data warehouses together, this centralization means only workers located at the same site as the warehouse have access to the data [Wallace, 1994b].
8. Time-lag between data warehouses and operation. Lastly, in 3 (1 percent) of the articles, it is said that the data in data warehouses is extracted from operational databases that are continuously changing. A real-time data warehouse is an oxymoron because it is impossible to have real-time replication while maintaining a full-scale data warehouse [Burleson, 1995]. Data warehouses store only a time slice of corporate data that is steadily drifting backward out of relevance until the warehouses are replenished.

Conclusion
This study makes two contributions: one for practitioners who are planning to implement data warehouses, and one for researchers who are studying the phenomenon. For the practitioners, the lists of advantages and disadvantages show what to expect when implementing a data warehouse and what kinds of problems they may face. Simplicity of the data structure, ease of use, fast access, and better quality data which leads to improved productivity and decision making are the most mentioned advantages. On the other hand, complexity of the development, time, and cost to build data warehouses are the most recognized disadvantages. For researchers, the findings provide a wide range of issues to be researched.
p> < � > 0 � @e� parallel processing. Eleven (2 percent) of these authors indicate that parallel processing helps users perform database tasks more quickly [Brown, 1995; Bull, 1995b; Stedman, 1995a]. Users can ask questions that were too process-intensive to answer before and data warehouse can handle more customers, users, transactions, queries, and messages. It supports the higher performance demands in client/server environment, provides unlimited scaleability, and thus, better price/performance [Capacity Management Review, 1995].

11. Robust processing engines. Ten (2 percent) of the articles mention that data warehouses allow users to directly obtain and refine data from different software applications without affecting the operational databases, and to integrate different business tasks into a single, streamlined process supported by real-time information. This provides users with robust processing engines [Goldberg, 1995b; Seybold, 1995].
12. Platform independent. Seven (2 percent) of the articles point out that data warehouses can be built on everything from a high-end PC to a mainframe, although many are choosing Unix servers and running their warehouses in a client/server environment. IBM and other five data warehouse software venders formed alliances to clear the cross-platform hurdles inherent in data warehouse implementation. Similar partnerships have been formed by other vendors. It is crucial to have such independence which was not easy in the legacy system [Systems Management 3X 400, 1995; Wallace, 1994a].
13. Computing infrastructure. Seven (2 percent) of the articles mention data warehousing helps the organization create a computing infrastructure that can support changes in computer systems and business structures [Wallace, 1994b].
14. Downsizing facilitation. Six (1 percent) articles suggest that data warehouses empower employees to make decentralized decisions since they put information closer to users. They are designed to give end users faster access to the information that is already there without impacting other systems or resources. Therefore, users do not need to ask IS to get needed data and IS managers can concentrate on other tasks. This potentially cuts the information middle-man who passes information from one place to another and suggests downsizing [Bull, 1995b; Seybold, 1995a].
15. Quantitative value. Another advantage, mentioned in six articles (1 percent), is realistic benchmarking. Data warehouses provide the quantitative metrics necessary to establish business process baselines that are derived from historical data and allow business managers to measure progress [Jain, 1995; Modisette, 1996].

16. Security. Three (1 percent) articles talk about the fact that clients of the data warehouses cannot directly query the production databases, thus improving security of the production databases as well as their productivity [Ricciuti, 1994a]. Some warehouses also provide management services for handling security [Smith, 1996].

Advantages of data warehousing

1. Simplicity. The most frequently (59, 13 percent of 456 articles) mentioned advantage of data warehousing is summarized as "simplicity." Data warehousing makes business simple because a data warehouse provides a single image of business reality by integrating various data. Data warehouses allow existing legacy systems to continue in operation, consolidate inconsistent data from various legacy systems into one coherent set, and reap benefits from vital information about current operations [Hackathorn, 1995; Wallace, 1994a]. Current operations can be monitored and compared with past operations, predictions of future operations can be rationally made, new business processes can be devised, and new operational systems quickly spawn to support those processes [Fairhead, 1995; Hackathorn, 1995; Ricciuti, 1994a; Smith, 1995d; Wallace, 1994a; Weinberg, 1995a]. Data warehouses can also store large amounts of historical data and corporatewide data that companies need to turn into vital business information [Bull, 1995b; Brown, 1995; Cafasso, 1994d; Eckerson, 1993b; Hackathorn, 1995; Lisker, 1994; Nash, 1995c; Smith, 1995f; Wallace, 1994a]. Data warehouses offer the benefit of a single, centralized data location while maintaining local client/server distribution [Ricciuti, 1994a]. Furthermore, data warehouses are companywide systems [Hoffman and Nash, 1995]; therefore, they improve corporatewide communication [Seybold, 1995].
2. Better quality data; improved productivity. The second most frequently (53; 12 percent) mentioned advantage is better quality data [Wallace, 1994b]. Other data quality issues include consistency, accuracy, and documentation [Ladaga, 1995; Ricciuti, 1994a; Wallace, 1994b]. Improved decision making through OLAP and data mining analysis were mentioned as improvements in productivity [Barry, 1995; Barquin, 1995; Broda, 1995; Henderson, 1996; Mason, 1995].
3. Fast access. The next most frequently mentioned (48; 11 percent) advantage is "fast access." Since data warehouses allow users to retrieve necessary data by themselves, the work log of IS can be cut. The necessary data is in one place, so systems response time should be reduced [Bull, 1995b; Fairhead, 1995; Goldberg, 1995b; Lisker, 1994; Parsons, 1995; Reardon, 1995].
4. Easy to use. Forty seven or 10 percent of the articles mentioned "easy to use." Queries from users do not interfere with normal operations, because a data warehouse enables easy access to business data without slowing down the operational database by taking some of the operational data and putting it in a separate database [Bull, 1995b; Burleson, 1995; Fairhead, 1995; Lisker, 1994; Ricciuti, 1994a; Smith, 1995d; Smith, 1995f; Wallace, 1994a; Wallace, 1994b]. Data warehouses focus on subjects [Barquin, 1995; Broda, 1995], support on-time, ad-hoc queries for fast decision-making as well as the regular reporting [Broda, 1995; Myers, 1995a]; and they are targeted at end users [Adhikari, 1996; Burleson, 1995; Smith, 1995d; Wallace, 1994a; Wallace, 1994b].
5. Separate decision-support operation from production operation. Another advantage mentioned in 32 articles (7 percent) is that data warehouses are built in order to separate operational, continually updated transaction data from historical, more static data required for business analysis. By doing so, managers and analysts can use historical data for their decision-making activities without slowing down the production operation [Francett, 1995b; Taft, 1995; Wallace, 1994a].
6. Gives competitive advantage. Twenty six articles or 6 percent of them mention that data warehouses better manage and utilize corporate knowledge, which in turn helps a business become more competitive, better understand customers, and more rapidly meet market demands [Wallace, 1994a; Wallace, 1994b]. Therefore, this benefit can justify the large expense [Barquin, 1995].
7. Ultimate distributed database. Fifteen (3 percent) of the articles discuss data warehouses pulling together information from disparate and potentially incompatible locations throughout the organization and putting it to good use. Middleware, data transfer software and other client/server tools are used to link those disparate data sources. A data warehouse is an ultimate distributed database [Burleson, 1995; Reardon, 1995; Wallace, 1994a].
8. Operation cost. In fourteen (3 percent) articles, it is said that data warehouses provide fertile ground to architect new operational systems [Hackathorn, 1995]. It eliminates paper based files [Cafasso, 1994d; Hackathorn, 1995; Ladaga, 1995; Parsons, 1995; Santosus, 1995; Wallace, 1994b] and once the initial investment is covered, the organization's information-technology group generally requires fewer resources [Barquin, 1995].
9. Information flow management. The next highly mentioned topic (13; 3 percent) is that data warehouses handle a large amount of data from various operational data sources, and data warehouses manages the flow of information rather than just collecting data. To respond to changing business needs, production systems are constantly changing along with their data encoding and structures. Data warehouses, especially the meta data, help continuous incremental refinement that must track both production systems and the changing business environment [Barquin, 1995; Hackathorn, 1995].
10. Enables parallel processing. Eleven (2 percent) of these authors indicate that parallel processing helps users perform database tasks more quickly [Brown, 1995; Bull, 1995b; Stedman, 1995a]. Users can ask questions that were too process-intensive to answer before and data warehouse can handle more customers, users, transactions, queries, and messages. It supports the higher performance demands in client/server environment, provides unlimited scaleability, and thus, better price/performance [Capacity Management Review, 1995].
11. Robust processing engines. Ten (2 percent) of the articles mention that data warehouses allow users to directly obtain and refine data from different software applications without affecting the operational databases, and to integrate different business tasks into a single, streamlined process supported by real-time information. This provides users with robust processing engines [Goldberg, 1995b; Seybold, 1995].
12. Platform independent. Seven (2 percent) of the articles point out that data warehouses can be built on everything from a high-end PC to a mainframe, although many are choosing Unix servers and running their warehouses in a client/server environment. IBM and other five data warehouse software venders formed alliances to clear the cross-platform hurdles inherent in data warehouse implementation. Similar partnerships have been formed by other vendors. It is crucial to have such independence which was not easy in the legacy system [Systems Management 3X 400, 1995; Wallace, 1994a].
13. Computing infrastructure. Seven (2 percent) of the articles mention data warehousing helps the organization create a computing infrastructure that can support changes in computer systems and business structures [Wallace, 1994b].
14. Downsizing facilitation. Six (1 percent) articles suggest that data warehouses empower employees to make decentralized decisions since they put information closer to users. They are designed to give end users faster access to the information that is already there without impacting other systems or resources. Therefore, users do not need to ask IS to get needed data and IS managers can concentrate on other tasks. This potentially cuts the information middle-man who passes information from one place to another and suggests downsizing [Bull, 1995b; Seybold, 1995a].
15. Quantitative value. Another advantage, mentioned in six articles (1 percent), is realistic benchmarking. Data warehouses provide the quantitative metrics necessary to establish business process baselines that are derived from historical data and allow business managers to measure progress [Jain, 1995; Modisette, 1996].

16. Security. Three (1 percent) articles talk about the fact that clients of the data warehouses cannot directly query the production databases, thus improving security of the production databases as well as their productivity [Ricciuti, 1994a]. Some warehouses also provide management services for handling security [Smith, 1996].

Data Warehouse & Business Intelligence

Lables