1/4/11

Datastage FAQ's

Difference Between Server Jobs and Parallel Jobs:


  • Datastage parallel jobs can run in parallel on multiple nodes. Server jobs do not run on multiple node.
  • Parallel jobs support partition parallelism(Round robin,Hash,modulus etc.), server jobs don't support this.
  • The transformer in Parallel jobs compiles in C++. In server jobs, the transformer is compiled in Basic language.

--------------------------------------
Basic difference is server job runs on windows platform usually and parallel job runs on unix platform.

  • server job runs on on node whereas parallel job runs on more than one node.
  • Even the server jobs run on UNIX most of the major installation are on UNIX platfoam and comming to the differences. There is a major difference in job architecture.
  • Server jobs process in sequence one stage after other.
  • While Parallel job process in parallel. It uses the configuration file to know the number of CPU's difined to process parallely.
  • Datastage parallel jobs can run in parallel on multiple nodes. Server jobs do not run on multiple node.
  • Parallel jobs support partition parallelism(Round robin Hash modulus etc.) server jobs don't support this.
  • The transformer in Parallel jobs compiles in C++. In server jobs the transformer is compiled in Basic language.

------------------------------------------------
Compare and Contrast ODBC and Plug-In stages?

ODBC: a) Poor Performance.
b) Can be used for Variety of Databases.
c) Can handle Stored Procedures.

Plug-In: a) Good Performance.
b) Database specific. (Only one database)
c) Cannot handle Stored Procedures.
------------------------------------------
What is Dimensional Modeling?

  • Dimensional Modeling is a logical design technique that seeks to present the data in a standard framework that is, intuitive and allows for high performance access.

----------------------------
What is Star Schema?

  • Star Schema is a de-normalized multi-dimensional model. It contains centralized fact tables surrounded by dimensions table.
  • Dimension Table: It contains a primary key and description about the fact table.
  • Fact Table: It contains foreign keys to the dimension tables, measures and aggregates.

-------------------------
what is snowflake schema?
snowflake schema is represented by centralized fact tables which are connected to multiple dimension tables.in this dimensions are normalized into multiple related tables.when the dimensions of a snowflake schema are elaborated ,having multiple levels of relation ships,and where child tables have multiple parent tables.”snowflakin” only affects the dimension table not the fact table.
----------------------------------------------
what is normalization?

ans1: is process of efficiently organizing data in database.there are two goals of the normalization process: eliminating redundant data and ensuring data dependencies make sense (only storing related data in a table).
ans2: is the process of removing repeated information.it first removes repeated records in a table.


what is de-normalization?


the process of attempting to optimize the performance of a database by adding redundant data or by grouping data .
is usually done to decrease the time req. to execute the complex queries .in a normalized database joins are required to gather all the information from multiple entities ,as data is divided and stored in multiple entities rather than in one large table.


What is ODS?

  • Operational Data Store is a staging area where data can be rolled back.


Importance of Surrogate Key in Data warehousing?

  • surrogate key is system generated artifical primary key
  • Surrogate Key is a Primary Key for a Dimension table. Most importance of using it is, it is independent of underlying database, i.e. Surrogate Key is not affected by the changes going on with a database.
  • It is useful because the natural primary key (i.e. Customer Number in Customer table) can change and this makes updates more difficult.
  • Some tables have columns such as AIRPORT_NAME or CITY_NAME which are stated as the primary keys (according to the business users) but ,not only can these change, indexing on a numerical value is probably better and you could consider creating a surrogate key called, say, AIRPORT_ID. This would be internal to the system and as far as the client is concerned you may display only the AIRPORT_NAME.

----------------------------------------
What are other Performance tunings you have done in your last project to increase the performance of slowly running jobs?


  • Staged the data coming from ODBC/OCI/DB2UDB stages or any database on the server using Hash/Sequential files for optimum performance also for data recovery in case job aborts.
  • Tuned the OCI stage for 'Array Size' and 'Rows per Transaction' numerical values for faster inserts, updates and selects.
  • Tuned the 'Project Tunables' in Administrator for better performance.
  • Used sorted data for Aggregator.
  • Sorted the data as much as possible in Database and reduced the use of DS-Sort for better performance of jobs.
  • Removed the data not used from the source as early as possible in the job.
  • Worked with DB-admin to create appropriate Indexes on tables for better performance of DS queries.
  • Converted some of the complex joins/business in DS to Stored Procedures on DS for faster execution of the jobs.
  • If an input file has an excessive number of rows and can be split-up then use standard logic to run jobs in parallel.
  • Before writing a routine or a transform, make sure that there is not the functionality required in one of the standard routines supplied in the sdk or ds utilities categories.
  • Constraints are generally CPU intensive and take a significant amount of time to process. This may be the case if the constraint calls routines or external macros but if it is inline code then the overhead will be minimal.
  • Try to have the constraints in the 'Selection' criteria of the jobs itself. This will eliminate the unnecessary records even getting in before joins are made.
  • Tuning should occur on a job-by-job basis.
  • Use the power of DBMS.
  • Try not to use a sort stage when you can use an ORDER BY clause in the database.
  • Using a constraint to filter a record set is much slower than performing a SELECT … WHERE…. 
  • Make every attempt to use the bulk loader for your particular database. Bulk loaders are generally faster than using ODBC or OLE.