Tuesday, 17 September 2013

POSITION and INDEX functions

The POSITION function Returns the actual position of the character which occurs first. POSITION function is ANSI standard.

Teradata has an equivalent function called INDEX.

Both the POSITION and INDEX functions returns position of character's first occurrence in a string.

Examples for the POSITION function

SELECT POSITION( 'u' IN 'formula'); Displays Result as '5'

Examples for the INDEX function

SELECT INDEX('formula', 'u'); Displays Result as '5'

Monday, 16 September 2013

What is residual condition in Teradata ?

Parse trees are always presented upside down, so query
execution begins with the lower cluster of operations and terminates with the upper cluster. In an EXPLAIN of query, the expressions in the upper cluster would be referred to as residual conditions.

How do we handle if DML changing dynamically

There are many ways to handle the DMLs which changes dynamically with in a single file. Some of the suitable methods are to use a conditional DML or to call the vector functionality while calling the DMLs.

What is AB_LOCAL expression where do you use it in ab-initio?

ablocal_expr is a parameter of table component of Ab Initio.ABLOCAL() is replaced by the contents of ablocal_expr.Which we can make use in parallel unloads.There are two forms of AB_LOCAL() construct, one with no arguments and one with single argument as a table name(driving table).

The use of AB_LOCAL() construct is in Some complex SQL statements contain grammar that is not recognized by the Ab Initio parser when unloading in parallel. You can use the ABLOCAL() construct in this case to prevent the Input Table component from parsing the SQL (it will get passed through to the database). It also specifies which table to use for the parallel clause.

How do you improve the performance of a graph?

There are many ways the performance of the graph can be improved.

1) Use a limited number of components in a particular phase
2) Use optimum value of max core values for sort and join components
3) Minimise the number of sort components
4) Minimise sorted join component and if possible replace them by in-memory join/hash join
5) Use only required fields in the sort, reformat, join components
6) Use phasing/flow buffers in case of merge, sorted joins
7) If the two inputs are huge then use sorted join, otherwise use hash join with proper driving port
8) For large dataset don't use broadcast as partitioner
9) Minimise the use of regular expression functions like re_index in the trasfer functions
10) Avoid repartitioning of data unnecessarily

What is the difference between a DB config and a CFG file?

.dbc file has the information required for Ab Initio to connect to the database to extract or load tables or views. While .CFG file is the table configuration file created by db_config while using components like Load DB Table.

What is the function you would use to transfer a string into a decimal?

In this case no specific function is required if the size of the string and decimal is same. Just use decimal cast with the size in the transform function and will suffice. For example, if the source field is defined as string(8) and the destination as decimal(8) then (say the field name is field1).

out.field :: (decimal(8)) in.field

If the destination field size is lesser than the input then use of string_substring function can be used likie the following.
say destination field is decimal(5).

out.field :: (decimal(5))string_lrtrim(string_substring(in.field,1,5)) /* string_lrtrim used to trim leading and trailing spaces */

What would be the time out value for the AbInitio process?

You can increase time-out values with the AbInitio environment variables AB_STARTUP_TIMEOUT and AB_RTEL_TIMEOUT_SECONDS.

There are two Ab Initio environment variables that control time-out values. Increasing these values may help if a job fails with a connection time-out message, although there may some problem causing the time-out, such as a machine down or an incorrect name used, and increasing the time-out interval will not help.

AB_STARTUP_TIMEOUT specifies the number of seconds before timing out when starting processes on a processing node.

To start a job on a processing node, the Co>Operating System uses rsh, rexec, rlogin, telnet, ssh, or dcom. In most cases, these succeed or fail within a few seconds. However, if the processing node is heavily loaded, startup may take significantly longer. Increasing the time-out value gives the processing node more time to respond.

AB_RTEL_TIMEOUT_SECONDS controls how long to wait for the remote rlogin or telnet server to respond.

What are the project parameters?

Project parameters specify various aspects of project behavior. Each parameter has a name and a string value. When a project is created, it comes with a set of default project parameters that set up a correspondence between file system URLs and locations within a datatore project.

Default Project Parameters

The default project parameters are the set of parameters that come with a About Projects.

The default project location parameter is PROJECT_DIR, which represents the location of a project's top-level directory in a datastore, for example, /Projects/warehouse. You specify the name of the location parameter when you create a project. PROJECT_DIR cannot be edited in the parameters editors. You can edit the parameter name using the air project modify command (see the Guide to Managing Technical Metadata for details).

The other default parameters (see the table below) refer to PROJECT_DIR in their values through $ substitution. For example, DML has a default value of $PROJECT_DIR/dml. These default parameters represent the locations of various directories in the project, and you can edit them in the Parameters Editors.

Table 1
Parameter name	Represents the location of the Directory that:
DML	Stores record format files
XFR	Stores transform files
PWD	The internal system uses to translate relative paths to absolute paths
RUN	The graphs of the project execute in
DB	Stores database interface files

You can reference project parameters from the graphs in the project, and the components in them, using $ substitution.

You can view and edit project parameters or add new parameters directly by using the Project Parameters Editor, or indirectly by using the Sandbox Parameters Editor and then checking the project in to the datastore. The latter method is strongly recommended.

What are the graph parameters?

Graph parameters are associated with individual graphs and are private to them. They affect the execution only of the graph for which they are defined. All the specifiable values of a graph, including each component's parameters (as well as other values such as URLs, file protections, and record formats) comprise that graph's parameters.

For example, the Layout parameter of the A-Transactions input file component in the tutorial Join Customers graph (see Lesson 3: Modifying graphs and seeing changes in the browser) is a graph parameter. Its value is the URL of the A-Transactions input file. Graph parameters are part of the graph they are associated with and are checked in or out along with the graph.

Saturday, 14 September 2013

I am loading 1 lakh records into one table1 by using MLOAD and from the same table I am deleting 1 lakh records by using MLOAD. Then which script is faster? Either MLOAD import or MLOAD delete?

Definitely MLOAD delete only. Because it uses only 4 phases(Acquisition phase is not used) and will not use any transient journal.

What is deadlock ?

A deadlock occurs when transaction 1 places a lock on resource A, and then needs to lock resource B. But resource B has already been locked by transaction 2, which in turn needs to place a lock on resource A. This state of affairs is called a deadlock or a deadly embrace. To resolve a deadlock, Teradata Database aborts one of the transactions and performs a rollback.

Note:-For example, a statement in BTEQ ends with a semicolon (;) as the last non-blank character in
the line.
Thus, BTEQ sees the following example as two requests:
sel * from table1;
sel * from table2;
However, if you write these same statements in the following way, BTEQ sees them as only one request:
sel * from table1
; sel * from table2;

What are different types of Locks available in Teradata ?

There are 4 types of Locks in Teradata Database which are mentioned below

Access Lock:-

The use of an access lock allows for reading data while modifications are in process. Access locks are designed for decision support on tables that are updated only by small, single-row changes. Access locks is not concerned about data consistency. Access locks prevent other users from obtaining the Exclusive locks on the locked data.

Read Lock:-

Read locks are used to ensure consistency during read operations. Several users may hold concurrent read locks on the same data, during this time no data modification is permitted. Read locks prevent other users from obtaining the Exclusive locks andWrite locks on the locked data.

Write Lock:-

Write locks enable users to modify data while maintaining data consistency. While the data has a write lock on it, other users can only obtain an access lock. During this time, all other locks are held in a queue until the write lock is released.

Exclusive Lock:-

Exclusive locks are applied to databases or tables and not to rows. When an exclusive lock is applied, no other user can access the database or table. Exclusive locks are used when a DDL command is executed .An exclusive lock on a database or table prevents other users from obtaining any lock on the locked object.

What is "checksum" in Teradata?

It basically specifies the (percentage) amount of data that should be used to compute the CHECKSUM for the table.
It's used for detecting data corruption (bad blocks ?)
More % sample = More reliability in detecting errors, but to compute CHECKSUM over more data, you consume more CPU cycles and takes more time ...

The problem in the diskdrive and disk array...can corrupt the data....
these type of corrupted data cant be found easily..but queries against these corrupted data will get u wrong answers..we can find the corruption by means of scandisk and checktable.....
These errors will reduce the availability of the DWH.......This Kinda Errors is called DIsk I/o Errors In order to avoid this in TD we have the DIsk I/o Integrity Check....
by means of checksum for table level......this is a kinda protection technique by which we can select the various levels of corruption checking ..........
These checks are done by some integrity methods.....
This feature detects and logs the disk i/o errors

TD give predefined data integrity levels check.....
default,low,end,medium,high....etc...

This checksum can be enabled.....using create table for table level.. DDL.
for system level use DBScontrol utilty to set the parameter

If u wanna more hands on then u ve to use the scandisk and checktbl utility....
u have to run the checktbl utility in level 3 so that it will diagnose the entire rows,byte by byte..

Thursday, 12 September 2013

Metadata

Introduction

In this electronic age where digital information is being created at a fantastic rate, tools are necessary to locate, access, manage, and understand it all—and that's where metadata comes in. A common definition of the term is "data about data." Metadata can serve many functions in data administration, including detailing the data's history, conditions for use, format, and management requirements. The Minnesota State Archives' interest in metadata stems from its mandate to identify, collect, preserve, and make available the historically valuable records of government regardless of format.

Summary

Data about data. Information (e.g., creator name, creation date) that is used to facilitate intellectual control of, and structured access to, other information. Metadata is usually defined as "data about data." Metadata allows users to locate and evaluate data without each person having to discover it anew with every use. Its basic elements are a structured format and a controlled vocabulary, which together allow for a precise and comprehensible description of content, location, and value.

While the term itself might sound new and trendy, the concept it describes is not. In some fashion, metadata has always been with us, apparent in everything from program listings in TV Guide to the nutritional information on the back of a cereal box. For government According to the State of Minnesota, an item that documents an official government transaction or action.

"All cards, correspondence, disks, maps, memoranda, microfilm, papers, photographs, recordings, reports, tapes, writings and other data, information or documentary material, regardless of physical form or characteristics, storage media or condition of use, made or received by an officer or agency of the state and an officer or agency of a county, city, town, school, district, municipal, subdivision or corporation or other public authority or political entity within the state pursuant to state law or in connection with the translation of public business by an officer or agency…. The term 'records' excludes data and information that does not become part of an official translation, library and museum material made or acquired and kept solely for reference or exhibit purpose, extra copies of documents kept only for convenience of reference and stock of publications and process documents, and bond, coupons, or other obligations or evidence of indebtedness, the destruction or other disposition of which is governed by other laws" (Minnesota Statutes, section 138.17, subd.1).

"Information that is inscribed on a tangible medium or that is stored in an electronic or other medium and is retrievable in perceivable form" (Minnesota Statutes, section 325L.02). records, the familiar forms of metadata are the recordkeeping metadata standard and the records retention schedule.

Anyone who has suffered the exercise in irrelevance offered by an The vast network of computer systems that enables worldwide connectivity among users and computers. Internet search engine will appreciate the value of precise metadata. Because "Data, text, images, sounds, codes, computer programs, software, databases, or the like" (Minnesota Statutes, section 325L.02). information in a digital format is only legible through the use of intermediary hardware and software, the role of metadata in information technology is fundamentally important. In any system, given the volume of information it contains, the uses to which it can be put, and the costs involved, metadata is the basic tool for efficiency and effectiveness.

Whatever you want to do with the information (e.g., protect its confidentiality, present it as evidence, provide citizens access to it, broadcast it, share it, preserve it, destroy it) will be feasible only if you and your partners can understand and rely upon the metadata describing it. Using metadata effectively means understanding and applying the standards appropriate to your needs.

Metadata Functions
Government agencies routinely use metadata to fulfill a variety of functions, but the primary uses are for:

Legal and statutory reasons (e.g., to satisfy records management laws and the rules of evidence)
Technological reasons (e.g., to design and document systems)
Operational or administrative reasons (e.g., to document decisions and establish accountability)
Service to citizens, agency staff, and others (e.g., to locate and share information)

In all of these cases, metadata standards will be effective only if they rely on a structured format and controlled vocabulary. "Structured format" means the metadata is defined in terms of specific, standardized elements or fields. For example, a library catalog entry for a book will identify its author, title, subject(s), and location, among other things. Unless all the elements are there, users will not be able to evaluate the metadata; they won't be able to answer the question "Is this the book I want?"
"Controlled vocabulary" means that there is a standard as well for the content of the elements. For example, the nutritional information on the back of a box of cereal is often defined in terms of weight per serving. We know what “sugar: three grams” means. It refers to a standard unit of measurement that allows us to compare the sugar content of one cereal to that of another. But if the box read "just the way you like it" or "pretty sweet," that would mean different things to different people. We couldn't compare a subjective review like that to what's on the back of another box of cereal.
To work effectively, the elements and components of metadata should have an accepted, precise meaning that reflects a common understanding among its creators and its users. That allows for evaluation and comparison, for selecting the information you want from all the information available.
Metadata and Information Technology
Metadata is useful for the management of information in any storage format, paper or digital. But it is critically important for information in a digital format because that is only legible through the use of intermediary hardware and software. We can open up a book or even hold microfilm up to a light to determine what it says. But we can't just look at a CD and say what's on it. We cannot possibly hope to locate, evaluate, or use all the files on a single PC, let alone the Internet, without metadata.
If information technology makes metadata necessary, it's information technology that makes metadata useful. Special software applications, such as TagGen, make the creation of standardized metadata simpler. Databases store and provide access to metadata. Most software applications automatically create metadata and associate it with files. One example is the header and routing information that accompany an e-mail message. Another is the set of properties created with every Microsoft Word document; certain elements such as the title, author, file size, etc., are automatically created, but other elements can be customized and created manually. Normally, some combination of automatically and manually created information is best for precise and practical metadata.
Most important, metadata can inform business rules and software code that transforms it into "executable knowledge." For example, metadata can be used for batch processing of files. A date element is critical to records management, as most record retention schedules are keyed to a record's date of creation. Metadata in more sophisticated data formats, such as eXtensible Markup Language (XML), allow for extraction, use, and calculation based on specific components of a metadata record.

By Ralph Kimball

Metadata is an amazing topic in the data warehouse world. Considering that we don’t know exactly what it is, or where it is, we spend more time talking about it, worrying about it, and feeling guilty we aren’t doing anything about it than any other topic. Several years ago we decided that metadata is any data about data. This wasn’t very helpful because it didn’t paint a clear picture in our minds. This fuzzy view gradually cleared up, and recently we have been talking more confidently about the "back-room metadata" that guides the extraction, cleaning, and loading processes, as well as the "front-room metadata" that makes our query tools and report writers function smoothly.
The back-room metadata presumably helps the DBA bring the data into the warehouse and is probably also of interest to business users when they ask from where the data came. The front-room metadata is mostly for the benefit of the end user, and its definition has been expanded not only to include the oil that makes our tools function smoothly, but also a kind of dictionary of business content represented by all the data elements.
Even these definitions, as helpful as they are, fail to give the data warehouse manager much of a feeling for what it is he or she is supposed to do. It sounds like whatever this metadata stuff is, it’s important, and we better:

Make a nice annotated list of all of it.
Decide just how important each part is.
Take responsibility for it.
Decide what constitutes a consistent and working set of it.
Decide whether to make it or buy it.
Store it somewhere for backup and recovery.
Make it available to the people who need it.
Assure its quality and make it complete and up to date.
Control it from one place.
Document all of these responsibilities well enough to hand this job off (soon).

Now there is a good, solid IT set of responsibilities. So far, so good. The only trouble is, we haven’t really said what it is yet. We do notice that the last item in the above list really isn’t metadata, but rather, data about metadata. With a sinking feeling, we realize we probably need meta meta data data.
To get this under control, let’s try to make a complete list of all possible types of metadata. We surely won’t succeed in this first try, but we will learn a lot. First, let’s go to the source systems, which could be mainframes, separate nonmainframe servers, users’ desktops, third-party data providers, or even online sources. We will assume that all we do here is read the source data and extract it to a data staging area that could be on the mainframe or could be on a downstream machine. Taking a big swig of coffee, we start the list:

Repository specifications
Source schemas
Copy-book specifications
Proprietary or third-party source specifications
Print spool file source specifications
Old format specifications for archived mainframe data
Relational, spreadsheet, and Lotus Notes source specifications
Presentation graphics source specifications (for example, Powerpoint)
URL source specifications
Ownership descriptions of each source
Business descriptions of each source
Update frequencies of original sources
Legal limitations on the use of each source
Mainframe or source system job schedules
Access methods, access rights, privileges, and passwords for source access
The Cobol/JCL, C, or Basic to implement extraction
The automated extract tool settings, if we use such a tool
Results of specific extract jobs including exact times, content, and completeness.

Now let’s list all the metadata needed to get the data into a data staging area and prepare it for loading into one or more data marts. We may do this on the mainframe with hand-coded Cobol, or by using an automated extract tool. Or we may bring the flat file extracts more or less untouched into a separate data staging area on a different machine. In any case, we have to be concerned about metadata describing:

Data transmission scheduling and results of specific transmissions
File usage in the data staging area including duration, volatility, and ownership
Definitions of conformed dimensions and conformed facts
Job specifications for joining sources, stripping out fields, and looking up attributes
Slowly changing dimension policies for each incoming descriptive attribute (for example, overwrite, create new record, or create new field)
Current surrogate key assignments for each production key, including a fast lookup table to perform this mapping in memory
Yesterday’s copy of a production dimension to use as the basis for Diff Compare
Data cleaning specifications
Data enhancement and mapping transformations (for example, expanding abbreviations and providing more detail)
Transformations required for data mining (for example, interpreting nulls and scaling numerics)
Target schema designs, source to target data flows, target data ownership, and DBMS load scripts
Aggregate definitions
Aggregate usage statistics, base table usage statistics, potential aggregates
Aggregate modification logs
Data lineage and audit records (where exactly did this record come from and when)
Data transform run-time logs, success summaries, and time stamps
Data transform software version numbers
Business descriptions of extract processing
Security settings for extract files, software, and metadata
Security settings for data transmission (that is, passwords, certificates, and so on)
Data staging area archive logs and recovery procedures
Data staging-archive security settings.

Once we have finally transferred the data to the data mart DBMS, then we must have metadata, including:

DBMS system tables
Partition settings
Indexes
Disk striping specifications
Processing hints
DBMS-level security privileges and grants
View definitions
Stored procedures and SQL administrative scripts

 DBMS backup status, procedures, and security. In the front room, we have metadata extending to the horizon, including:

Precanned query and report definitions
Join specification tool settings
Pretty print tool specifications (for relabeling fields in readable ways)
End-user documentation and training aids, both vendor supplied and IT supplied
Network security user privilege profiles, authentication certificates, and usage statistics, including logon attempts, access attempts, and user ID by location reports
Individual user profiles, with links to human resources to track promotions, transfers, and resignations that affect access rights
Links to contractor and partner tracking where access rights are affected
Usage and access maps for data elements, tables, views, and reports
Resource charge back statistics
Favorite Web sites (as a paradigm for all data warehouse access).

Now we can see why we didn’t know what this metadata was all about. It is everything! Except for the data itself. Suddenly, the data seems like the simplest part.
With this perspective, do we really need to keep track of all this? We do, in my opinion. This list of metadata is the essential framework of your data warehouse. Just listing it as we have done seems quite helpful. It’s a long list, but we can go down through it, find each kind of metadata, and identify what it is used for and where it is stored.
There are some sobering realizations, however. Much of this metadata needs to reside on the machines close to where the work occurs. Programs, settings, and specifications that drive processes have to be in certain destination locations and in very specific formats. That isn’t likely to change soon.

Once we have taken the first step of getting our metadata corralled and under control, can we hope for tools that will pull all the metadata together in one place and be able to read and write it as well? With such a tool, not only would we have a uniform user interface for all this disparate metadata, but on a consistent basis we would be able to snapshot all the metadata at once, back it up, secure it, and restore it if we ever lost it.

Don’t hold your breath. As you can appreciate, this is a very hard problem, and encompassing all forms of metadata will require a kind of systems integration that we don’t have today. I believe the Metadata Coalition (a group of vendors trying seriously to solve the metadata problem) will make some reasonable progress in defining common syntax and semantics for metadata, but it has been two years and counting since they started this effort. Unfortunately, Oracle, the biggest DBMS player, has chosen to sit out this effort and has promised to release its own proprietary metadata standard. Other vendors are making serious efforts to extend their product suites to encompass many of the activities listed in this article and simultaneously to publish their own framework for metadata. These vendors include Microsoft, who’s working with the Metadata Coalition to extend the Microsoft Repository, as well as a pack of aggressive, smaller players proposing comprehensive metadata frameworks, including Sagent, Informatica, VMark, and D2K. In any case, these vendors will have to offer significant business advantages in order to compel other vendors to write to their specifications.

Data Warehousing Project - Data Modeling

As it is difficult to talk about data modeling without going into some technical terms, I will first present several terms that are used commonly in the data modeling field:
Dimension: A category of information. For example, the time dimension.
Attribute: A unique level within a dimension. For example, Month is an attribute in the Time Dimension.
Hierarchy: The specification of levels that represents relationship between different attributes within a hierarchy. For example, one possible hierarchy in the Time dimension is Year --> Quarter --> Month --> Day.
Fact Table: A fact table is a table that contains the measures of interest. For example, sales amount would be such a measure. This measure is stored in the fact table with the appropriate granularity. For example, it can be sales amount by store by day. In this case, the fact table would contain three columns: A date column, a store column, and a sales amount column.
Lookup Table: The lookup table provides the detailed information about the attributes. For example, the lookup table for the Quarter attribute would include a list of all of the quarters available in the data warehouse. Each row (each quarter) may have several fields, one for the unique ID that identifies the quarter, and one or more additional fields that specifies how that particular quarter is represented on a report (for example, first quarter of 2001 may be represented as "Q1 2001" or "2001 Q1").
The first step in data modeling is to illustrate the relationships between the entities for the enterprise. The manifestation of this illustration is called the "Entity-Relationship (ER) Diagram". From the ER diagram we can then design logical, and subsequently physical, data models.
In designing data models for data warehouses / data marts, the most commonly used schema types are Star Schema, Snowflake Schema, and Federated Star Schema.
Star Schema: In the star schema design, a single object (the fact table) sits in the middle and is radially connected to other surrounding objects (dimension lookup tables) like a star. A star schema can be simple or complex. A simple star consists of one fact table; a complex star can have more than one fact table.
Snowflake Schema: The snowflake schema is an extension of the star schema where each point of the star explodes into more points. The main advantage of the snowflake schema is the improvement in query performance due to minimized disk storage requirements and joining smaller lookup tables. The main disadvantage of the snowflake schema is the additional maintenance efforts needed due to the increase number of lookup tables.
Federated Star Schema: In federated star schema, instead of having the fact table in the middle, a chosen dimension sits in the middle. Then all the fact tables related to this particular dimension radiate from it. Finally, all the other dimensions that are related to each of the fact tables complete the loop. This type of schema is best used when one wants to focus the analysis on that one particular schema. Because all the fact tables are connected to one central dimension, this is an excellent way of performing cross-fact analysis. The construct also allows much better segmentation and profiling of the one dimension of interest.
Data Modeling
Most people involved in application development follow some kind of methodology. A methodology is a prescribed set of processes through which the developer analyzes the client's requirements and develops an application. Major database vendors and computer gurus all practice and promote their own methodology. Some database vendors even make their analysis, design, and development tools conform to a particular methodology. If you are using the tools of a particular vendor, it may be easier to follow their methodology as well. For example, when CNS develops and supports Oracle database applications it uses the Oracle toolset. Accordingly, CNS follows Oracle's CASE*Method application development methodology (or a reasonable facsimile thereof).
One technique commonly used in analyzing the client's requirements is data modeling. The purpose of data modeling is to develop an accurate model, or graphical representation, of the client's information needs and business processes. The data model acts as a framework for the development of the new or enhanced application. There are almost as many methods of data modeling as there are application development methodologies. CNS uses the Oracle CASE*Method for its data modeling.
As time goes by, applications tend to accrue new layers, just like an onion. We develop more paper pushing and report printing, adding new layers of functions and features. Soon it gets to the point where we can only see with difficulty the core of the application where its essence lies. Around the core of the application we see layer upon layer, protecting, nurturing, but ultimately obscuring the core. Our systems and applications often fall victim to these protective or hiding processes. The essence of an application is lost in the shuffle of paper and the accretion of day-to-day changes. Data modeling encourages both the developer and the client to tear off these excess layers, to explore and revisit the essence or purpose of the application once more. The new analysis determines what needs to feed into and what feeds from the core purpose.
Application Audience and Services
After participants at CNS-sponsored application analysis meetings agree on a scope and objectives statement, we find it helpful to identify the audience of the application. To whom do you offer the services we are modeling? Who is affected by the application? Answers to these and similar questions help the participants stay in focus with the desired application results.
After assembling an audience list, we then develop a list of services provided by the application. This list includes the services of the existing application and any desired future services in the new application. From this list, we model the information requirements of each service. To do this, it is useful to first identify the three most important services of the application, and then of those three, the single most important service. Eventually all of the services will be modeled. Focusing our data modeling on one service just gives us a starting point.
Entities
The next step in modeling a service or process, is to identify the entities involved in that process. An entity is a thing or object of significance to the business, whether real or imagined, about which the business must collect and maintain data, or about which information needs to be known or held. An entity may be a tangible or real object like a person or a building; it may be an activity like an appointment or an operation; it may be conceptual as in a cost center or an organizational unit.
Whatever is chosen as an entity must be described in real terms. It must be uniquely identifiable. That is, each instance or occurrence of an entity must be separate and distinctly identifiable from all other instances of that type of entity.
For example, if we were designing a computerized application for the care of plants in a greenhouse, one of its processes might be tracking plant waterings. Within that process, there are two entities: the Plant entity and the Watering entity. A Plant has significance as a living flora of beauty. Each Plant is uniquely identified by its biological name, or some other unique reference to it. A Watering has significance as an application of water to a plant. Each Watering is uniquely identified by the date and time of its application.
Attributes
After you identify an entity, then you describe it in real terms, or through its attributes. An attribute is any detail that serves to identify, qualify, classify, quantify, or otherwise express the state of an entity occurrence or a relationship. Attributes are specific pieces of information which need to be known or held.
An attribute is either required or optional. When it's required, we must have a value for it, a value must be known for each entity occurrence. When it's optional, we could have a value for it, a value may be known for each entity occurrence. For example, some attributes for Plant are: description, date of acquisition, flowering or non-flowering, and pot size. The description is required for every Plant. The pot size is optional since some plants do not come in pots. Again, some of Watering's attributes are: date and time of application, amount of water, and water temperature. The date and time are required for every Watering. The water temperature is optional since we do not always check it before watering some plants.
The attributes reflect the need for the information they provide. In the analysis meeting, the participants should list as many attributes as possible. Later they can weed out those that are not applicable to the application, or those the client is not prepared to spend the resources on to collect and maintain. The participants come to an agreement on which attributes belong with an entity, as well as which attributes are required or optional.
The attributes which uniquely define an occurrence of an entity are called primary keys. If such an attribute doesn't exist naturally, a new attribute is defined for that purpose, for example an ID number or code.
Relationships
After two or more entities are identified and defined with attributes, the participants determine if a relationship exists between the entities. A relationship is any association, linkage, or connection between the entities of interest to the business; it is a two-directional, significant association between two entities, or between an entity and itself. Each relationship has a name, an optionality (optional or mandatory), and a degree (how many). A relationship is described in real terms.
Rarely will there be a relationship between every entity and every other entity in an application. If there are only two or three entities, then perhaps there will be relationships between them all. In a larger application, there are not always relationships between one entity and all of the others.
Assigning a name, an optional and a degree to a relationship helps confirm the validity of that relationship. If you cannot give a relationship all these things, then perhaps there really is no relationship at all. For example, there is a relationship between Plant and Watering. Each Plant may be given one or more Waterings. Each Watering must be for one and only one specific Plant.
Entity Relationship Diagrams
To visually record the entities and the relationships between them, an entity relationship diagram, or ERD, is drawn. An ERD is a pictorial representation of the entities and the relationships between them. It allows the participants in the meeting to easily see the information structure of the application. Later, the project team uses the ERD to design the database and tables. Knowing how to read an ERD is very important. If there are any mistakes or relationships missing, the application will fail in that respect. Although somewhat cryptic, learning to read an ERD comes quickly.
Each entity is drawn in a box. Each relationship is drawn as a line between entities. The relationship between Plant and Watering is drawn on the ERD as follows:

Since a relationship is between two entities, an ERD shows how one entity relates to the other, and vice versa. Reading an ERD relationship means you have to read it from one entity to the other, and then from the other to the first. Each style and mark on the relationship line has some significance to the relationship and its reading. Half the relationship line belongs to the entity on that side of the line. The other half belongs to the other entity on the other side of the line.
When you read a relationship, start with one entity and note the line style starting at that entity. Ignore the latter half of the line's style, since it's there for you to come back the other way. A solid line at an entity represents a mandatory relationship. In the example above, each Watering must be for one and only one Plant. A dotted line at an entity represents an optional relationship. Each Plant may be given one or more Watering.
The way in which the relationship line connects to an entity is significant. If it connects with a single line, it represents one and only one occurrence of that entity. In the example, each Watering must be for one and only one Plant. If the relationship line connects with three prongs, i.e., a crowsfoot, it represents one or more of the entities. Each Plant may be given one or more Waterings. As long as both statements are true, then you know you have modeled the relationship properly.
In the relationship between Plant and Watering, there are two relationship statements. One is: each Watering must be for one and only one Plant. These are the parts of the ERD which that statement uses:

The second statement is: each Plant may be given one or more Waterings. The parts of the ERD which that statement uses are:

After some experience, you learn to ask the appropriate questions to determine if two entities are related to each other, and the degree of that relationship. After agreeing on the entities and their relationships, the process of identifying more entities, describing them, and determining their relationships continues until all of the services of the application have been examined. The data model remains software and hardware independent.
Many-to-Many Relationships
There are different types of relationships. The greenhouse plant application example showed a one-to-many and a many-to-one relationship, both between Plant and Watering. Two other relationships commonly found in data models are one-to-one and many-to-many. One-to-one relationships are between two entities where both are related to each other, once and only once for each instance of either. In a many-to-many relationship, multiple occurrences of one entity are related to one occurrence of another, and vice versa.
An example of a many-to-many relationship in the greenhouse plant application is between the Plant and Additive entities. Each plant may be treated with one or more Additives. Each Additive may be given to one or more Plants.

Many-to-many relationships cannot be directly converted into database tables and relationships. This is a restriction of the database systems, not of the application. The development team has to resolve the many-to-many relationship before it can continue with the database development. If you identify a many-to-many relationship in your analysis meeting, you should try to resolve it in the meeting. The participants can usually find a fitting entity to provide the resolution.
To resolve a many-to-many relationship means to convert it into two one-to-many, many-to-one relationships. A new entity comes between the two original entities, and this new entity is referred to as an intersection entity. It allows for every possible matched occurrence of the two entities. Sometimes the intersection entity represents a point or passage in time.

With these new relationships, Plant is now related to Treatment. Each Plant may be given one or more Treatments. Each Treatment must be given to one and only one Plant. Additive is also related to Treatment. Each Additive may be used in one or more Treatments. Each Treatment must be comprised of one and only one Additive. With these two new relationships, Treatment cannot exist without Plant and Additive. Treatment can occur multiple times, once for each treatment of a plant additive. To keep each Treatment unique, a new attribute is defined. Treatment now has application date and time attributes. They are the unique identifiers or the primary key of Treatment. Other attributes of Treatment are quantity and potency of the additive.
Will Data Modeling Look Good on You?
There are other processes and marks to enhance a data model besides the ones shown in this article. Many of them are used in the actual development of the database tables. The techniques shown here only provide a basic foundation for undertaking your own data modeling analysis.

Data modeling gives you the opportunity to shed the layers of processes covering up the fundamental essence of your business. Remember to leave your baggage at the door of a data modeling session. Come to the meeting with enthusiasm and a positive outlook for a new and improved application.

The operational data store (ODS)

The operational data store (ODS) is a part of the data warehouse environment about which many managers have confused feelings. I am often asked, "Should I build an ODS?" I have decided that the underlying question is, "What is an ODS, anyway?"According to Bill Inmon and Claudia Imhoff in their book Building the Operational Data Store (John Wiley & Sons, 1996), an ODS is a "subject-oriented, integrated, volatile, current valued data store, containing only corporate detailed data." This definition for an ODS reflects a real market need for current, operational data.
If anything, the need for the ODS function has grown in recent years and months. At the same time as our data warehouse systems have gotten bigger, the need to analyze ever more detailed customer behavior and ever more specific operational texture has grown. In most cases the analysis must be done on the most granular and detailed data that we can possibly source. The emergence of data mining has also demanded that we crawl though reams of the lowest-level data, looking for correlations and patterns.
Until now, the ODS was considered a different system from the main data warehouse because the ODS was based on "operational" data. The downstream data warehouse was almost always summarized. Because the warehouse was a complete historical record, we usually didn't dare store this operational (transactional) data as a complete history.
However, the hardware and software technology supporting data warehousing has kept rolling forward, able to store more data and able to process larger answer sets. We also have discovered how to extract and clean data rapidly, and we have figured out how to model it for user understandability and extreme query performance. It is fashionable these days to talk about multiterabyte data warehouses, and consultants braver than I talk about petabyte (1,000 terabyte) data warehouses being just around the corner.
Now I am getting suspicious of the ODS assumption we have been making that you cannot store the individual transactions of a big business in a historical time series. Let us stop for a moment and estimate the number of low-level sales transactions in a year for a large retailer. This is surprisingly simple. I use the following technique to triangulate the overall size of a data warehouse before I ever interview the end users.
Imagine that our large retailer has six billion dollars in retail sales per year. The only other fact we need is the average size of a line item on a typical sales transaction. Suppose that our retailer is a drug store and that the average dollar value of a line item is two dollars. We can immediately estimate the number of transaction line items per year as six billion dollars divided by two dollars, or three billion. This is a large number, but it is well within the range of many current data warehouses. Even a three-year history would "only" generate nine billion records. If we did a tight dimensional design with four 4-byte dimension keys, and four 4-byte facts, then the raw-fact table data size per year would be nine billion times 32 bytes, or 288GB. Three years of raw data would be 864GB. I know of more than a dozen data warehouses bigger than this today.
Our regular data warehouses can now embrace the lowest-level transaction data as a multiyear historical time series, and we are using high-performance data extracting and cleaning tools to pull this data out of the legacy systems at almost any desired frequency each day. So why is my ODS a separate system? Why not just make the ODS the leading, breaking wave of the data warehouse itself?
With the growing interest in data mining fine-grained customer behavior in the form of individual customer transactions, we increasingly need detailed transaction-time histories available for analysis. The effort expended to make a lightweight, throwaway, traditional ODS data source (for example, a volatile, current, valued, data source restricted to current data) is becoming a dead end and a distraction.
Let us take this opportunity to tighten and restrict the definition of the ODS. We will view the ODS simply as the "front edge" of the existing data warehouse. By bringing the ODS into the data warehouse environment, we make it more useful to clerks, executives, and analysts, and we need only to build a single extract system. This new, simplified view of the ODS is shown in Figure 1 and Figure 2.
Let us redefine the ODS as follows. The ODS is a subject-oriented, integrated, frequently augmented store of detailed data in the enterprise data warehouse.
The ODS is subject-oriented. That is, the ODS, like the rest of the data warehouse, is organized around specific business domains such as Customer, Product, Activity, Policy, Claim, or Shipment.
The ODS is integrated. The ODS gracefully bridges between subjects and presents an overarching view of the business rather than an incompatible stovepipe view of the business.
The ODS is frequently augmented. This requirement is a significant departure from the original ODS statement that said the ODS was volatile; for example, the ODS was constantly being overwritten and its data structures were constantly changing. This new requirement of frequently augmenting the data also invalidates Inmon and Imhoff's statement that the ODS contains only current, valued data. We aren't afraid to store the transaction history. In fact, that has now become our mission.
The ODS sits within the full data warehouse framework of historical data and summarized data. In a data warehouse containing a monthly summarized view of data in addition to the transaction detail, the input flow to the ODS also contributes to a special "current rolling month." In many cases, when the last day of the month is reached, the current rolling month becomes the most recent member of the standard months in the time series and a new current rolling month is created.
The ODS naturally supports a collective view of data. We now see how the ODS presents a collective view to the executive who must be able to see a customer's overall account balance. The executive can immediately and gracefully link to last month's collective view of the customer (via the time series) and to the surrounding class of customers (via the data warehouse aggregations).
The ODS is organized for rapid updating directly from the legacy system. The data extraction and cleaning industry has come a long way in the last few years. We can pipeline data from the legacy systems through data cleaning and data integrating steps and drop it into the ODS portion of the data warehouse. Inmon and Imhoff's original distinctions of Class I (near realtime upload), Class II (upload every few hours), and Class III (upload perhaps once per day) are still valid, but the architectural differences in the extract pipeline are far less interesting than they used to be. The ability to upload data very frequently will probably be based more on waiting for remote operational systems to deliver necessary data than computing or bandwidth restrictions in the data pipeline.
The ODS should be organized around a star join schema design. Inmon and Imhoff recommend the star join data model as "the most fundamental description of the design of the data found in the operational data store." The star join, or dimensional model, is the preferred data model for achieving user understandability and predictable high performance. (For further information on this subject, please see my article, "A Dimensional Modeling Manifesto," in the August 1997 issue of DBMS.)
The ODS contains all of the text and numbers required to describe low-level transactions, but may additionally contain back references to the legacy system that would allow realtime links to be opened to the legacy systems through terminal- or transaction-based interfaces. This is an interesting aspect of the original definition of the ODS, and is somewhat straightforward if the low-level transactions are streamed out of the legacy system and into the ODS portion of the data warehouse. What this means is that operational keys like the invoice number and line number are kept in the data flow all the way into the ODS, so that an application can pick up these keys and link successfully back to the legacy system interface.
The ODS is supported by extensive metadata needed to explain and present the meaning of the data to end users through query and reporting tools, as well as metadata describing an extract "audit" of the data warehouse contents.
Bringing the ODS into the existing data warehouse framework solves a number of problems. We can now focus on building a single data extract pipeline. We don't need to have a split personality where we are willing to have a volatile, changing data structure with no history and no support for performance-enhancing aggregations. Our techniques have improved in the last few years. We understand how to take a flow of atomic-level transactions, put them into a dimensional framework, and simultaneously build a detailed transaction history with no compromising of detail, and at the same time build a regular series of periodic snapshots that lets us rapidly track a complex enterprise over time. As I just mentioned, a special snapshot in this time series is the current rolling snapshot at the very front of the time series. This is the echo of the former separate ODS. In next month's column, I will describe the dual personality of transaction and snapshot schemas that is at the heart of "operational data warehouse."
Finally, if you have been reading this with a skeptical perspective, and you have been saying to yourself, "storing all that transaction detail just isn't needed in my organization: all my management needs are high-level summaries," then broaden your perspective and listen to what is going on in the marketing world. I believe that we are in the midst of a major move to one-on-one marketing in which large organizations are seeking to understand and respond to detailed and individual customer behavior. Banks need to know exactly who is at that ATM between 5:00 p.m. and 6:00 p.m., what transactions are they performing, and how that pattern has evolved this year in response to various bank incentive programs. Catalina Marketing is ready to print coupons at your grocery store register that respond to what you have in your shopping basket and what you have been buying in recent trips to the store. To do this, these organizations need all the gory transaction details, both current and historical.

Our data warehouse hardware and software are ready for this revolution. Our data warehouse design techniques are ready for this revolution. Are you ready? Bring your ODS in out of the rain and into the warehouse.

Figure 1.

The original ODS architecture necessitated two pathways and two systems because the main data warehouse wasn't prepared to store low-level transactions.

Figure 2.

The new ODS reality. The cleaning and loading pathway needs only to be a single system because we are now prepared to build our data warehouse on the foundation of individual transactions.

Data Warehouse & Business Intelligence

Lables