Data Warehouse & Business Intelligence: Data Warehousing Fundamentals

1)What is the difference between star schema and snow flake schema ?and when we use those schema's?

A)Star Schema : Star Schema is a relational database schema for representing multidimensional data. It is the simplest form of data warehouse schema that contains one or more dimensions and fact tables. It is called a star schema because the entity-relationship diagram between dimensions and fact tables resembles a star where one fact table is connected to multiple dimensions. The center of the star schema consists of a large fact table and it points towards the dimension tables. The advantage of star schema are slicing down, performance increase and easy understanding of data.

Snowflake Schema : A snowflake schema is a term that describes a star schema structure normalized through the use of outrigger tables. i.e dimension table hierarchies are broken into simpler tables.

In a star schema every dimension will have a primary key.

 In a star schema, a dimension table will not have any parent table.

 Whereas in a snow flake schema, a dimension table will have one or more parent tables.

 Hierarchies for the dimensions are stored in the dimensional table itself in star schema.

 Whereas hierarchies are broken into separate tables in snow flake schema. These hierarchies helps to drill down the data from topmost hierarchies to the lowermost hierarchies.

Star schema : In this star schema fact table in normalized format and dimension table is in de-normalized format. It also known as basic star schema.

Snow flake schema: In this both dimension and fact table is in normalized format only. It is also known as Extended star schema. If u r taking the snow flake it requires more dimensions, more foreign keys, and it will reduce the query performance, but it normalizes the records.,depends on the requirement we can choose the schema

2)what is cube and why we are crating a cube what is diff between etl and olap cubes.

A)Any schema or Table or Report which gives you meaningful information of one attribute with respective to more than one attribute is called a cube.

For Ex: In a product table with Product ID and Sales columns , we can analyze Sales with respective to Product Name , but if you analyze Sales with respective to Product as well as Region( region being attribute in Location Table) the report or Resultant table or schema would be Cube.

ETL Cubes : Built in the staging area to load frequently accessed reports to the target.

Reporting Cubes: Built after the actual load of all the tables to the target depending on the customer requirement for his business analysis.

3)Summarize the difference between OLTP,ODS AND DATA WAREHOUSE ?

OLTP - means online transaction processing ,it is nothing but a database ,we are calling oracle,sql server,db2 are olap tools.

OLTP databases, as the name implies, handle real time transactions which inherently have some special requirements.

ODS- stands for Operational Data Store. Its a final integration point ETL process we load the data in ODS before you load the values in target..

Data Warehouse- Data warehouse is collection of integrated,time variant,non volatile and time variant collection of data which is used to take management decisions

ODS: this is operational data stores, which means the real time transactional databases. In data warehouse, we extract the data from ODS, transform in the staging area and load into the target data warehouse.

4)Do u need separate space for Data warehouse & Data mart?

A)In the data warehouse all the information of the enterprise is there but the data mart is specific for the particular analysis like sales,production ....,,, so data mart is subject oriented and warehouse is nothing but collection of data marts so we assume it also subject oriented because it's collection of data marts ... so for individual analysis we need data marts

5)how data in Data warehouse stored after data has been extracted and transformed from heterogeneous sources and where does the data go from Data warehouse?

A)Data in Data warehouse (Bill Inmon's approach) is stored in Relational form. Data in Data warehouse stored in the form of relational tables, most of the data warehouses approach is snowflake schema

6)what is data cleaning? how is it done?

A)Data cleansing is nothing but standardizing and reformatting ( encoding,decoding,data type conversion) the data before we store the data in the warehouse.

Data Cleansing: the act of detecting and removing and/or correcting a database’s dirty data (i.e., data that is incorrect, out-of-date, redundant, incomplete, or formatted incorrectly)

7)what is the difference between Data warehouse and Data warehousing?

A)Data warehousing encompasses a complete architecture and process, whereas Data warehouse is Data stored in Database in the form of Dimension, Fact Tables, Lookup Tables, Aggregated Fact tables.

Datawarehousing is not just having a single data warehouse . Data warehousing is the transformation of data to information, thereby enabling the business to examine its operations and performance. This task is accomplished by the staging and transformation of data from data sources, enabling the business to access and analyze information. The data stores may be persistent (stored on disk) or transient (using disk or memory). In addition, the workflow usually involves multiple data stores to support the staging and transformation of data into information such as operational data stores, data warehouses, data marts, online analytical processing cubes, files such as a flat file (comma-separated values extract, for example), XML data and even spreadsheets

8)what is the need of surrogate key;why primary key not used as surrogate key?

A)Surrogate Key is an artificial identifier for an entity. In surrogate key values are generated by the system sequentially(Like Identity property in SQL Server and Sequence in Oracle). They do not describe anything.

Primary Key is a natural identifier for an entity. In Primary keys all the values are entered manually by the user which are uniquely identified. There will be no repetition of data.

Need for surrogate key not Primary Key

If a column is made a primary key and later there needs a change in the data type or the length for that column then all the foreign keys that are dependent on that primary key should be changed making the database Unstable.

Surrogate Keys make the database more stable because it insulates the Primary and foreign key relationships from changes in the data types and length. Surrogate key is system generated unique number. Identity in SQL, Sequence in Oracle, Sequence Generator in Informatica.

For Example : You are extracting Customer Information from OLTP Source and after ETL process, loading customer information in a dimension table (DW). If you take SCD Type 1, Yes you can use Primary Key of Source Customer ID as Primary Key in Dimension Table. But if you would like to preserve history of customer in Dimension table i.e. Type 2. Then you need another unique no apart from Customer ID. There you have to use Surrogate Key.

Example Surrogate key:

Generates serial number for key values

1. For implementing SCD type-2. that is when we need to store previous data with current data.....the key already exist in table violates primary key rule.

In that case we create one extra field and we make it as key.

suppose

ename phoneno

xyz 9888333

xyz 9888222

--------suppose in this table ename is primary key. If we need to store 2 records we need to create extra field (surrogate key)and have to make as it as primary key.

2. For faster query---performance purpose

select * from emp where empno='adsa1233'

select * from emp where empid=123

in above 2 queries, first query takes more time. because it has to change character type to ascii format and it should compare.

So we create extra id(surrogate key) and generate numbers. While loading into fact it is very helpful.

Another reason : If you have Alpha Numeric as a Customer ID. Then you have to use surrogate key in Dimension Table. It is advisable to have system generated small integer number as a surrogate key in the dimension table. so that indexing and retrieval is much faster.

One more reason for using surrogate keys in a Data warehouse.

1. As data is extracted from disparate sources , where in each source might have primary keys with data types or formats inherent to the underlying database, if the same primary keys are are used in the DW , there would be inconsistencies in representation of data which would make querying of the database a difficult job , so the surrogate keys are implemented to circumvent these kind of situations .

9)Explain degenerated dimension in detail?

A)Degenerated dimension is a dimension which is not having any source in oltp. It is generated at the time of transaction. Like invoice no this is generated when the invoice is raised. It is not used in linking and it is also not a fkey. but we can refer these degenerated dimensions as a primary key of the fact table

A Degenerate dimension is a Dimension which has only a single attribute. This dimension is typically represented as a single field in a fact table. The data items that are not facts and data items that do not fit into the existing dimensions are termed as Degenerate Dimensions.

Degenerate Dimensions are the fastest way to group similar transactions. Degenerate Dimensions are used when fact tables represent transactional data. They can be used as primary key for the fact table but they cannot act as foreign keys

When the cardinality of column value is high, instead of maintaining a separate dimension and having the overhead of making a join with fact table, degenerated dimensions can be build.

For example, In sales fact table, Invoice number is a degenerated dimension. Since Invoice Number is not tied up to an order header table, hence there is no need for invoice number to join a dimensional table; hence it is referred as degenerate dimension. The degenerate Dimension are keys of a fact table which do not the corresponding dimension table

9)what is the differences between the static and dynamic caches?

A)static cache stores overloaded values in the memory and it wont change through out the running of the session where as dynamic cache stores the values in the memory and changes dynamically during the running of the session used in scd types -- where target table changes and is cache is dynamically changes.

10)Can a dimension table contains numeric values?

A)we can have numeric values in dimensional table but these are not frequently updated as dimension table contains constant data but only on some occasions it can change It may contain the numeric values but not summarized information's.

11)What is hybrid slowly changing dimension?

A)what ever changes done in source for each and every record there is a new entry in target side, whether it may be UPDATE or INSERT and in target maintaining the history. Hybrid SCD’s are combination of both SCD 1 and SCD 2.It may happen that in a table, some columns are important and we need to track changes for them i.e capture the historical data for them whereas in some columns even if the data changes, we don't care. For such tables we implement Hybrid SCDs, where in some columns are Type 1 and some are Type 2
Example I am giving:
Like account information is usually maintained in two categories:
Current Account and other is Time of Event Account i.e We have two set of tables eg CUR_ACCT this is fast moving dimension containing information like Balance etc , while the other is TOE_ACCT table this contains information like Contact Details, Phone No where history is not only important but considered to be changing slowly. With this respect TOE_ACCT table qualifies as slowly changing dimension

12)what is rapidly changing dimension?

A)There is no Dimension called Rapidly changing dimension in DWH. But if you consider overall ODS tables; a rapidly changing dimension is one that holds the transactional data rather than staging data. There is a Dimension called Monster Dimension. I hope if you can treat this as a rapidly changing dimension.

13)what is the data type of the surrogate key?

A)Data type of the surrogate key is either integer or numeric or number

Normally Surrogate keys are sequencers which keep on increasing with new records being injected into the table. The standard data type is integer Since Surrogate Keys are generated sequentially by the system the data type should be of numeric type.

data type is: INT4 LENGTH: 10 DECIMAL PLACE: 0 (zero)

14)Give examples of degenerated dimensions?

A)Degenerated Dimension is a dimension key without corresponding dimension. Example:
In the Point Of Sale Transaction Fact table, we have:
Date Key (FK), Product Key (FK), Store Key (FK), Promotion Key (FP), and POS Transaction Number
Date Dimension corresponds to Date Key, Production Dimension corresponds to Production Key. In a traditional parent-child database, POS Transactional Number would be the key to the transaction header record that contains all the info valid for the transaction as a whole, such as the transaction date and store identifier. But in this dimensional model, we have already extracted this info into other dimension. Therefore, POS Transaction Number looks like a dimension key in the fact table but does not have the corresponding dimension table.Therefore, POS Transaction Number is a degenerated dimension

15)what are the steps to build the Data warehouse?

Gathering business requirements
Identifying Sources
Identifying Facts
Defining Dimensions
Define Attributes
Redefine Dimensions & Attributes
Organize Attribute Hierarchy & Define Relationship
Assign Unique Identifiers
Additional conventions:Cardinality/Adding ratios
1.Understand the bossiness requirements.
2.Once the business requirements are clear then Identify the Grains(Levels).
3.Grains are defined ,design the Dimensional tables with the Lower level Grains.
4.Once the Dimensions are designed,design the Fact table With the Key Performance Indicators(Facts).
5.Once the dimensions and Fact tables are designed define the relation ship between the tables by using primary key and Foreign Key. In logical phase data base design looks like Star Schema design so it is named as Star Schema Design.

16)what are the different architecture of data warehouse?

A)There are three types of architectures.
Date warehouse Basic Architecture :
In this architecture end users access data that is derived from several sources through the data warehouse.
architecture: Source --> Warehouse --> End Users
Data warehouse with staging area Architecture:
Whenever the data that is derived from sources need to be cleaned and processed before putting it into warehouse then staging area is used.
architecture: Source --> Staging Area -->Warehouse --> End Users
Data warehouse with staging area and data marts Architecture:
Customization of warehouse architecture for different groups in the organization then data marts are added and used.
architecture: Source --> Staging Area --> Warehouse --> Data Marts --> End Users

17)what are the advantages data mining over tradition Approach..

A)Data Mining is used for the estimation of future. For example, if we take a company/business organization, by using the concept of Data Mining, we can predict the future of business in terms of Revenue (or) Employees (or) Customers (or) Orders etc.raditional approaches use simple algorithms for estimating the future. But, it does not give accurate results when compared to Data Mining.

18) What is the difference between view and materialized Views?.

view - store the SQL statement in the database and let you use it as a table. Everytime you access the view, the SQL statement executes.
materialized view - stores the results of the SQL in table form in the database. SQL statement only executes once and after that everytime you run the query, the stored result set is used. Pros include quick query results.
View is a logical reference to a database table. But Meterial View is actual table and we can refresh data in time intervels. If you made any change in database table that change will effect into view but not meterialize view..
View is a PSEUDO table that is not stored in the database and it is just a query
MATERIALIZED VIEWS These are similar to a view but these are permantely stored in the database and often refreshed. This is used in optimization for the faster data retrieval and is useful in aggregation and summarization of data.
Normal Views are for Access purpose ie One need to give access the end user to specific data.
19)What is the main difference between Inmon and Kimbal approach ?
A)RalfKimball: he follows bottom-up approach i.e., first create individual Data Marts from the existing sources and then create Data Warehouse.
BillImmon: he follows top-down approach i.e., first create Data Warehouse from the existing sources and then create individual Data Marts.
20)why fact table is in normal form?
A)To eliminate the redundancy
21)What is Difference between E-R Modeling and Dimensional Modeling?
A1) E-R modeling revolves around the Entities and their relationships to capture the overall process of the system. Dimensional model/Muti-Dimensional Modeling revolves around Dimensions(point of analysis) for decision making and not to capture the process.
A2)In ER modeling the data is in normalized form. So more number of Joins, which may adversely affect the system performance. Whereas in Dimensional Modeling the data is de-normalized, so less number of joins, by which system performance will improve.

22)What is conformed fact?
A)A fact,which can be used across multiple data marts is called as conformed fact.
23)What are the methodologies of Data Warehousing?
A)In Data warehousing contains the Two Methods
1>> Top Down Method
2>>Bottom up method
In Top Down method First load the Data marts and then load the data warehouse.
In Bottom Up method first load the Data warehouse and then load the Data marts.
24)What is Dimensional Modeling?

A)Systematic arrangement of Fact/Dimension tables are called as Schema, the designing a schema in data warehouse / data mart is known as Dimension modeling

25)What is degenerate dimension table?
the values of dimension which is stored in fact table is called degenerate dimensions. these dimensions doesn't have its own dimensions
26)What is a CUBE in data warehousing concept?
Cubes are logical representation of multidimensional data. The edge of the cube contains dimension members and the body of the cube contains data values
Cube is a logical schema which contains facts and dimensions
yes,cube is a logical representation of data.12 steps are required for the creation of cubes. Its mainly used for analysis purpose. we can analyze which product can be sold more in which country all these we available in the cube by performing operations like roll-up,drill-down,slice and dice
27)What is ODS?
A)ODS means Operational Data Store. It is used to store current data through transactional web applications,sap,MQ series. Current data means particular data from one date into one date ods contains 30-90 data
28) What are conformed dimensions ?
A)Conformed dimentions are dimensions which are common to the cubes.(cubes are the schemas contains facts and dimension tables)
Consider Cube-1 contains F1,D1,D2,D3 and Cube-2 contains F2,D1,D2,D4 are the Facts and Dimensions
here D1,D2 are the Conformed Dimensions
29) What is SCD ? types of SCD?
A)SCD:-------- The value of dimensions is used change very rarely, That is called Slowly Changing dimensions
Here mainly 3
1)SCD1:Replace the old values overwrite by new values
2)SCD2:Just Creating Additional records
3)SCD3:It's maintain just previous and recent
In the SCD2 again 3
1)Versioning
2)Flagvalue
3)Effective Date range
Versioning:Here the updated dimensions inserted in to the target along with version number
The new dimensions will be inserted into the target along with Primary key
Flag value:The updated dimensions insert into the target along with 0
and new dimensions inset into the target along with 1
30) What is Normalization, First Normal Form, Second Normal Form , Third Normal Form?

A) Normalization can be defined as segregating of table into two different tables, so as to avoid duplication of values. The normalization is a step by step process of removing redundancies and dependencies of attributes in data structure

The condition of data at completion of each step is described as a “normal form”. Needs for normalization : improves data base design.
Ensures minimum redundancy of data. Reduces need to reorganize data when design is modified or enhanced. Removes anomalies for database activities.

First normal form :
· A table is in first normal form when it contains no repeating groups.
· The repeating column or fields in an un normalized table are removed from the table and put in to tables of their own.
· Such a table becomes dependent on the parent table from which it is derived.
· The key to this table is called concatenated key, with the key of the parent table forming a part it.

Second normal form:
· A table is in second normal form if all its non_key fields fully dependent on the whole key.
· This means that each field in a table ,must depend on the entire key.
· Those that do not depend upon the combination key, are moved to another table on whose key they depend on.
· Structures which do not contain combination keys are automatically in second normal form.
Third normal form:
· A table is said to be in third normal form , if all the non key fields of the table are independent of all other non key fields of the same table.
31) What are non-additive facts?
A) Non-additive facts are facts that cannot be summed up for any of
the dimensions present in the fact table. Example: temperature,bill number...etc
fact table typically has two types of columns: those that contain numeric facts (often called measurements), and those that are foreign keys to dimension tables.

A fact table contains either detail-level facts or facts that have been aggregated. Fact tables that contain aggregated facts are often called summary tables. A fact table usually contains facts with the same level of aggregation.

Though most facts are additive, they can also be semi-additive or non-additive. Additive facts can be aggregated by simple arithmetical addition. A common example of this is sales. Non-additive facts cannot be added at all.

An example of this is averages. Semi-additive facts can be aggregated along some of the dimensions and not along others. An example of this is inventory levels, where you cannot tell what a level means simply by looking at it.
32) What are Semi-additive and fact-less facts and in which scenario will you use such kinds of fact tables?
A) Semi-Additive: Semi-additive facts are facts that can be summed up for some of the dimensions in the fact table, but not the others. For example:
Current_Balance and Profit_Margin are the facts. Current_Balance is a semi-additive fact, as it makes sense to add them up for all accounts (what's the total current balance for all accounts

in the bank?), but it does not make sense to add them up through time (adding up all current balances for a given account for each day of the month does not give us any useful information
Fact-less Fact table
A fact-less fact table captures the many-to-many relationships between
dimensions, but contains no numeric or textual facts. They are often used to record events or
coverage information. Common examples of fact-less fact tables include:
- Identifying product promotion events (to determine promoted products that didn’t sell)
- Tracking student attendance or registration events
- Tracking insurance-related accident events
- Identifying building, facility, and equipment schedules for a hospital or university
33) What is a level of Granularity of a fact table?
A) Level of granularity means level of detail that you put into the fact table in a data warehouse. For example: Based on design you can decide to put the sales data in each transaction. Now, level of granularity would mean what detail are you willing to put for each transactional fact. Product sales with respect to each minute or you want to aggregate it up to minute and put that data.

34) What is the difference between OLAP, ROLAP, MOLAP and HOLAP?

A) ROLAP, MOLAP and HOLAP are specialized OLAP (Online Analytical Analysis) applications.

ROLAP stands for Relational OLAP. Users see their data organized in cubes with dimensions, but the data is really stored in a Relational Database (RDBMS) like Oracle. The RDBMS will store data at a fine grain level, response times are usually slow.

MOLAP stands for Multidimensional OLAP. Users see their data organized in cubes with dimensions, but the data is store in a Multi-dimensional database (MDBMS) like Oracle Express Server. In a MOLAP system lot of queries have a finite answer and performance is usually critical and fast.

HOLAP stands for Hybrid OLAP, it is a combination of both worlds. Seagate Software's Holos is an example HOLAP environment. In a HOLAP system one will find queries on aggregated data as well as on detailed data.

35) What is the difference between an ODS and a W/H?

A) An ODS (Operational Data Store) is an integrated database of operational data. Its sources include legacy systems and it contains current or near term data. An ODS may contain 30 to 90 days of information.

A warehouse typically contains years of data (Time Referenced). Data warehouses group data by subject rather than by activity (subject-oriented). Other properties are: Non-volatile (read only) and Integrated.

36) When should one use an MD-database (multi-dimensional database) and not a relational one?

A) Data in a multi-dimensional database is stored as business people views it, allowing them to slice and dice the data to answer business questions. When designed correctly, an OLAP database will provide must faster response times for analytical queries.

Normal relational databases store data in two-dimensional tables and analytical queries against them are normally very slow.

37) How can Oracle Materialized Views be used to speed up data warehouse queries?

A) With "Query Rewrite" (QUERY_REWRITE_ENABLED=TRUE in INIT.ORA) Oracle can direct queries to use pre-aggregated tables instead of scanning large tables to answer complex queries.

Materialized views in a W/H environments is typically referred to as summaries, because they store summarized data.

Data Warehouse & Business Intelligence

Lables

Thursday, 12 September 2013

Data Warehousing Fundamentals

No comments:

Post a Comment