Extract Transform & Load: October 2012

10/23/12

Difference Between Normal Lookup and Sparse Lookup

Normal Lookup data needs to be in memory
Normal might provide poor performance if the reference data is huge as it has to put all the data in memory.
Normal Lookup can have more than one reference link.
Normal lookup can be used with any database.

Sparse Lookup directly hits the database.
If the input stream data is less and reference data is more like 1:100 or more in such cases sparse lookup is better.
Sparse Lookup,we can only have one reference link.
Sparse lookup,we can only use for Oracle and DB2.
Sparse lookup sends individual sql statements for every incoming row.(Imagine if the reference data is huge).

This Lookup type option can be found in Oracle or DB2 stages.Default is Normal.

When do you use Snowflake Schema Implementation?

Ralph Kimball, the data warehousing guru, proposes three cases where snowflake implementation is not only acceptable but is also the key to a successful design:

Large customer dimensions where, for example, 80 percent of the fact table measurements involve anonymous visitors about whom you collect little detail, and 20 percent involve reliably registered customers about whom you collect much detailed data by tracking many dimensions.

Financial product dimensions for banks, brokerage houses, and insurance companies, because each of the individual products has a host of special attributes not shared by other products.

Multi enterprise calendar dimensions because each organization has idiosyncratic fiscal periods,seasons, and holidays.

Ralph Kimball recommends that in most of the other cases, star schemas are a better solution. Although redundancy is reduced in a normalized snowflake, more joins are required. Kimball usually advises that it is not a good idea to expose end users to a physical snowflake design, because it almost always compromises understandability and performance.

Difference Between Master Data Management(MDM) and Data warehouse

Master Data Management and Data Warehousing have a lot in common. For example, the effort of data transformation and cleansing is very similar to an ETL process in data warehousing, and in fact they can use the same ETL tools. In the real world, it is not uncommon to see MDM and data warehousing fall into the same project. On the other hand, it is important to call out the main differences between the two:

1) Different Goals

The main purpose of a data warehouse is to analyze data in a multidimensional fashion, while the main purpose of MDM is to create and maintain a single source of truth for a particular dimension within the organization. In addition, MDM requires solving the root cause of the inconsistent metadata, because master data needs to be propagated back to the source system in some way. In data warehousing, solving the root cause is not always needed, as it may be enough just to have a consistent view at the data warehousing level rather than having to ensure consistency at the data source level.

2) Different Types of Data

Master Data Management is only applied to entities and not transactional data, while a data warehouse includes data that are both transactional and non-transactional in nature. The easiest way to think about this is that MDM only affects data that exists in dimensional tables and not in fact tables, while in a data warehousing environment includes both dimensional tables and fact tables.

3) Different Reporting Needs

In data warehousing, it is important to deliver to end users the proper types of reports using the proper type of reporting tool to facilitate analysis. In MDM, the reporting needs are very different -- it is far more important to be able to provide reports on data governance, data quality, and compliance, rather than reports based on analytical needs.

4) Where Data Is Used

In a data warehouse, usually the only usage of this "single source of truth" is for applications that access the data warehouse directly, or applications that access systems that source their data straight from the data warehouse. Most of the time, the original data sources are not affected. In master data management, on the other hand, we often need to have a strategy to get a copy of the master data back to the source system. This poses challenges that do not exist in a data warehousing environment. For example, how do we sync the data back with the original source? Once a day? Once an hour? How do we handle cases where the data was modified as it went through the cleansing process? And how much modification do we need make do to the source system so it can use the master data? These questions represent some of the challenges MDM faces. Unfortunately, there is no easy answer to those questions, as the solution depends on a variety of factors specific to the organization, such as how many source systems there are, how easy / costly it is to modify the source system, and even how internal politics play out.

10/19/12

Oracle SQL Hints

/*+ hint */
/*+ hint(argument) */
/*+ hint(argument-1 argument-2) */

10/16/12

Difference Between UNION and UNION ALL

The Main Difference is

UNION returns the data without duplicates
UNION ALL returns the data with duplicates if there any.

UNION is costly as it has to check for redundancy
UNION ALL is faster.

Difference Between Delete,Truncate and Drop

Delete :

Delete is a DML command.
Delete command uses undo table space for roll back.
deletes can be undo by using Rollback.
Delete command can be used to delete few rows from table or all rows from table.
After the Delete command need to use the Commit or Rollback command to permanent the transaction.

Truncate ;

Truncate is a DDL command.
Truncate removes all rows from a table.
truncate wont use any table space and triggers will be filred.
It cannot be Rolled Back.
It's faster than Delete.

Drop :

Drop removes the data and the table structure from the Database.
Operation cannot be rolled back.
No DML trigger will be fired.

How to run the UNIX Processes in background

using "&" symbol (ampersand) at the end of the command we can run the processes in the background.

Ex : sleep 100 &

it will give the PID and leave the terminal for other work.we can see the background process by using the "ps" command.

Nav

10/23/12

10/19/12

10/16/12