10/23/12

Difference Between Normal Lookup and Sparse Lookup


  • Normal Lookup data needs to be in memory
  • Normal might provide poor performance if the reference data is huge as it has to put all the data in memory.
  • Normal Lookup can have more than one reference link.
  • Normal lookup can be used with any database.
  • Sparse Lookup directly hits the database.
  • If the input stream data is less and reference data is more like 1:100 or more in such cases sparse lookup is better.
  • Sparse Lookup,we can only have one reference link.
  • Sparse lookup,we can only use for Oracle and DB2.
  • Sparse lookup sends individual sql statements for every incoming row.(Imagine if the reference data is  huge).

This Lookup type option can be found in Oracle or DB2 stages.Default is Normal.

When do you use Snowflake Schema Implementation?

Ralph Kimball, the data warehousing guru, proposes three cases where snowflake implementation is not only acceptable but is also the key to a successful design:

  • Large customer dimensions where, for example, 80 percent of the fact table measurements involve anonymous visitors about whom you collect little detail, and 20 percent involve reliably registered customers about whom you collect much detailed data by tracking many dimensions.
  • Financial product dimensions for banks, brokerage houses, and insurance companies, because each of the individual products has a host of special attributes not shared by other products.
  • Multi enterprise calendar dimensions because each organization has idiosyncratic fiscal periods,seasons, and holidays.

Ralph Kimball recommends that in most of the other cases, star schemas are a better solution. Although redundancy is reduced in a normalized snowflake, more joins are required. Kimball usually advises that it is not a good idea to expose end users to a physical snowflake design, because it almost always compromises understandability and performance.

Difference Between Master Data Management(MDM) and Data warehouse

Master Data Management and Data Warehousing have a lot in common. For example, the effort of data transformation and cleansing is very similar to an ETL process in data warehousing, and in fact they can use the same ETL tools. In the real world, it is not uncommon to see MDM and data warehousing fall into the same project. On the other hand, it is important to call out the main differences between the two:


1) Different Goals

The main purpose of a data warehouse is to analyze data in a multidimensional fashion, while the main purpose of MDM is to create and maintain a single source of truth for a particular dimension within the organization. In addition, MDM requires solving the root cause of the inconsistent metadata, because master data needs to be propagated back to the source system in some way. In data warehousing, solving the root cause is not always needed, as it may be enough just to have a consistent view at the data warehousing level rather than having to ensure consistency at the data source level.


2) Different Types of Data

Master Data Management is only applied to entities and not transactional data, while a data warehouse includes data that are both transactional and non-transactional in nature. The easiest way to think about this is that MDM only affects data that exists in dimensional tables and not in fact tables, while in a data warehousing environment includes both dimensional tables and fact tables.


3) Different Reporting Needs

In data warehousing, it is important to deliver to end users the proper types of reports using the proper type of reporting tool to facilitate analysis. In MDM, the reporting needs are very different -- it is far more important to be able to provide reports on data governance, data quality, and compliance, rather than reports based on analytical needs.


4) Where Data Is Used

In a data warehouse, usually the only usage of this "single source of truth" is for applications that access the data warehouse directly, or applications that access systems that source their data straight from the data warehouse. Most of the time, the original data sources are not affected. In master data management, on the other hand, we often need to have a strategy to get a copy of the master data back to the source system. This poses challenges that do not exist in a data warehousing environment. For example, how do we sync the data back with the original source? Once a day? Once an hour? How do we handle cases where the data was modified as it went through the cleansing process? And how much modification do we need make do to the source system so it can use the master data? These questions represent some of the challenges MDM faces. Unfortunately, there is no easy answer to those questions, as the solution depends on a variety of factors specific to the organization, such as how many source systems there are, how easy / costly it is to modify the source system, and even how internal politics play out.



10/19/12

Oracle SQL Hints


/*+ hint */
/*+ hint(argument) */
/*+ hint(argument-1 argument-2) */

10/16/12

Difference Between UNION and UNION ALL

The Main Difference is

UNION returns the data without duplicates
UNION ALL returns the data with duplicates if there any.

UNION is costly as it has to check for redundancy
UNION ALL is faster.

Difference Between Delete,Truncate and Drop



Delete :
  • Delete is a DML command.
  • Delete command uses undo table space for roll back.
  • deletes can be undo by using Rollback.
  • Delete command can be used to delete few rows from table or all rows from table.
  • After the Delete command need to use the  Commit or Rollback command to permanent the transaction.
Truncate ;
  • Truncate is a DDL command.
  • Truncate removes all rows from a table.
  • truncate wont use any table space and triggers will be filred.
  • It cannot be Rolled Back.
  • It's faster than Delete.
Drop :
  • Drop removes the data and the table structure from the Database.
  • Operation cannot be rolled back.
  • No DML trigger will be fired.



How to run the UNIX Processes in background

using "&" symbol (ampersand) at the end of the command we can run the processes in the background.

Ex :  sleep 100 &

it will give the PID and leave the terminal for other work.we can see the background process by using the "ps" command.

9/28/12

Copy Stage - Force Option

Copy Stage :  Stage->Properties have option called Force

Force : true or False.

True to specify that DataStage should not try to optimize the job by removing the Copy operation.

False to specify that Datastage should try to optimize the job (it might remove the copy operator or might not).

9/14/12

Conductor Node in Datastage



Below is the sample APT CONFIG FILE ,see in bold to mention conductor node.


{
node "node0"
{
fastname "server1"
pools "conductor"
resource disk "/datastage/Ascential/DataStage/Datasets/node0" {pools "conductor"}
resource scratchdisk "/datastage/Ascential/DataStage/Scratch/node0" {pools ""}
}
node "node1"
{
fastname "server2"
pools ""
resource disk "/datastage/Ascential/DataStage/Datasets/node1" {pools ""}
resource scratchdisk "/datastage/Ascential/DataStage/Scratch/node1" {pools ""}
}
node "node2"
{
fastname "server2"
pools ""
resource disk "/datastage/Ascential/DataStage/Datasets/node2" {pools ""}
resource scratchdisk "/datastage/Ascential/DataStage/Scratch/node2" {pools ""}
}
}

Please find the below different answers :
------
For every job that starts there will be one (1) conductor process (started on the conductor node), there will be one (1) section leader for each node in the configuration file and there will be one (1) player process (may or may not be true) for each stage in your job for each node. So if you have a job that uses a two (2) node configuration file and has 3 stages then your job will have

1 conductor
2 section leaders (2 nodes * 1 section leader per node)
6 player processes (3 stages * 2 nodes)

Your dump score may show that your job will run 9 processes on 2 nodes.

This kind of information is very helpful when determining the impact that a particular job or process will have on the underlying operating system and system resources.

-----
Conductor Node :
It is a main process to

  1.  Start up jobs
  2.  Resource assignments
  3.  Responsible to create Section leader (used to create & manage player player process which perform actual job execution).
  4.  Single coordinator for status and error messages.
  5.  manages orderly shutdown when processing completes in the event of fatal error.


-----
Jobs developed with DataStage EE and QualityStage are independent of the actual hardware and degree of parallelism used to run the job. The parallel Configuration File provides a mapping at runtime between the job and the actual runtime infrastructure and resources by defining logical processing nodes.


To facilitate scalability across the boundaries of a single server, and to maintain platform independence, the parallel framework uses a multi-process architecture.

The runtime architecture of the parallel framework uses a process-based architecture that enables scalability beyond server boundaries while avoiding platform-dependent threading calls. The actual runtime deployment for a given job design is composed of a hierarchical relationship of operating system processes, running on one or more physical servers


  • Conductor Node (one per job): the main process used to startup jobs, determine resource assignments, and create Section Leader processes on one or more processing nodes. Acts as a single coordinator for status and error messages, manages orderly shutdown when processing completes or in the event of a fatal error. The conductor node is run from the primary server
  • Section Leaders (one per logical processing node): used to create and manage player processes which perform the actual job execution. The Section Leaders also manage communication between the individual player processes and the master Conductor Node.
  • Players: one or more logical groups of processes used to execute the data flow logic. All players are created as groups on the same server as their managing Section Leader process.


-----

When the job is initiated the primary process (called the “conductor”) reads the job design, which is a generated Orchestrate shell (osh) script. The conductor also reads the parallel execution configuration file specified by the current setting of the APT_CONFIG_FILE environment variable.

Once the execution nodes are known (from the configuration file) the conductor causes a coordinating process called a “section leader” to be
started on each; by forking a child process if the node is on the same machine as the conductor or by remote shell execution if the node is on a
different machine from the conductor (things are a little more dynamic in a grid configuration, but essentially this is what happens). Each section
leader process is passed the score and executes it on its own node, and is visible as a process running osh. Section leaders’ stdout and stderr are
redirected to the conductor, which is solely responsible for logging entries from the job.


The score contains a number of Orchestrate operators. Each of these runs in a separate process, called a “player” (the metaphor clearly is one of an
orchestra). Player processes’ stdout and stderr are redirected to their parent section leader. Player processes also run the osh executable.

Communication between the conductor, section leaders and player processes in a parallel job is effected via TCP.

Difference between scratch disk and resource scratch disk


The Only difference is :
  • Scratch Disk is for Temporary storage (Like RAM in our PC)
Ex : Files created during the process between the source and targtet such as Sort,Remove duplicate,Aggregator etc..
  • Resource Scratch Disk is for Permanent storage (like a Hard Drice in our PC)
Ex : Data sets,files sets, Loookup file sets etc..

9/13/12

Common Errors,warnings in Datastage

  • Warning ; A sequential operator cannot preserve the partitioning of input data set on input port 0
          Sol : Clear the partitoning
  • Warning : Agg_stg: When checking operator: When binding input interface field “column1” to field “ column1 ”: Implicit conversion from source type “string[5]” to result type “dfloat”: Converting string to number.
          Sol: use data type conversion
  • Warning ; oci_oracle_source: When checking operator: When binding output interface field “column1” to field “column1”: Converting a nullable source to a non-nullable result;
         Sol : Use Null functions

9/6/12

Initial Load and Delta Load


Difference Between Initial Load and Delta Load :

Initial Load :

Ø       Occurs Once
Ø       Large amount of Data

Delta Load :

If the data service has the capability to return the data modified only after a specified date and time, the ETL process will load only the data modified after the last successful load. This is called delta load

Ø       Occurs regularly
Ø       Adjustments to Initial load
Ø       Small amount of data

8/10/12

sed unix


awk unix


NR = Number or Records read so far
NF = Number of Fields in current record
FS = the Field Separator
RS = the Record Separator
BEGIN = a pattern that's only true before processing any input
END = a pattern that's only true after processing all input.

8/9/12

UNIX Commands


COPY THE FIRST 10 LINES FROM ONE FILE TO ANOTHER FILE :
sed -e "10q" GDK2120_D1.TXT > GDK2120_D2.TXT

-----------------------------------
COUNT THE NUMBER OF FILES IN THE FOLDER :

 ls -1|wc -l    [-ONE & -L]
-----------------------------------
CHECKING DISK SPACE :
df -h
or
df -k
or
du -sh

-----------------------------------
MIGRATE FOLDERS FROM ONE ENVIRONMENT TO ANOTHER ENVIRONMENT(EX: DEV TO QA)
scp -rp /home/data user@dev.com:/home/data

-----------------------------------
CHANGE THE OWNER OF THE FOLDER :
sudo chown -R user /home/data
-----------------------------------
TO CHECK FOR THE FILE(GREP)
ls /home/data | grep ^CED_DELTA.*TXT$
ls /home/data | grep CED_DELTA.*TXT$
-----------------------------------
to check the process of my own
ps -f -l -uusername
-------------------------------------
to convert multiple lines into one single line
awk '{printf("%s",$0)}' HPS_D1120807_142128.TXT > HPS_D1120807_142128_1.TXT
--------------------------------------------------------------------------------
to convert one single row with delimeters into multiple rows (~ is  a delimeter here)
awk -F"~" '{for(i=1 ;i <= NF;i++ ) {print $i;}}' HPS_D1120807_142128_1.TXT >HPS_D1120807_142128_2.TXT
--------------------------------------------------------------------------------
to remove the last line from the file
sed '$d' EDISAMPLE_3.TXT > EDISAMPLE_4.TXT

-------------------------------------------------------------------------
rename file with one pattern to another pattern
rename  ATTXJZZ  ATTXJGD *.ksh

-------------------------------------------------------------------
touch to change the timestamp of the file
touch -a -m -t 201301181205.09 filename.txt

8/7/12

Change Data Capture Stage - Properties&Behaviour

Change mode determines how keys and values are specified :
  • Explicit Keys & Values    : means that key & value columns must be explicitly defined. 
  • All Keys, Explicit Values : means that value columns must be defined, but all other columns are key columns unless they are excluded. 
  • Explicit Keys, All Values : means that key columns must be defined, but all other columns are value columns unless they are excluded.

8/6/12

Join stage input/output link Partitioning types


Lookup stage input/output link Partitioning types


how many input and output links can Merge Staga have


Database Join or Datastage Join,which is better


Find and replace in UNIX


how many input and output links can Join Stage have

Beacause people asking this question,i really tried connecting around 70 input links to Join stage and it took.

So the conclusion is, it can have any number of input links as per the ibm developer guide.

Any Number of Input Links
One Output Link
Zero Reject Links.

Difference Between Primary key and Unique key


Unique Key :
Unique key is a single column or set of columns that can uniquely identify a row in table.
It Doesn't allow duplicates
It Allows Null values

Primary Key :
Primary key doesn't allow Null values
It doesnt allow duplicates

Finding duplicates SQL


SELECT item_id,
 COUNT(item_id) AS NumOccurrences
FROM item
GROUP BY item_id
HAVING ( COUNT(item_id) > 1 )

8/3/12

Different padding functions in Datastage




Padding with spaces or zeroes:


  • Str(" ", 50 - Len(inputcolumn)) : inputcolumn



in above example,50 is the total lenghth of the Target filed
if the input value have 10 charaters so the above function will replace the remaining 40 with spaces
Output Looks like : "xxxxxxxxxx                                             "


  • Str("0", 50 - Len(inputcolumn)) : inputcolumn


in above example,50 is the total lenghth of the Target filed
if the input value have 10 charaters so the above function will replace the remaining 40 with zeroes

Output Looks like : "xxxxxxxxxx000000000000000000000000000"


  • Right(STR('0',18):inputcolumn,18)
  • Right('0000':inputcolumn,4)



7/27/12

how to see the processes you own in AIX (UNIX)


To Display All the processes :
>> ps -e -f

It will display all the process information.

To display the the processes you own :

>>ps -f -l -uname

Output :
----------------------------------------------------------------------------------------------------------
 F  S      UID     PID    PPID   C PRI NI ADDR    SZ    WCHAN    STIME    TTY  TIME CMD
------------------------------------------------------------------------------------------------------------
240001 A   name  376928  504028   0  60 20 40d05400  1308 f10001003177efb8 16:31:09  pts/0  0:00 cmdlog -ksh
   40001 A   name  504028  434330   0  60 20 1f0a5e400   988          16:31:09      -  0:00 sshd: n74950@pts/0
  240001 A   name  536690  376928   0  60 20 c0d0d400   768          16:31:09  pts/1  0:00 -ksh

6/6/12

Hide Password Entry in UNIX script:



We can use 'stty' to hide the password entry in UNIX command prompt, which is like entering password when you login to the UNIX.
stty:
'stty' sets options for your terminal
Ex :
  • stty erase \^h --- set the backspace as erase key
  • stty sane
  • stty
Example Script:
#!/bin/ksh
echo "Please enter the pwd:"
stty -echo
read pwd
stty echo
if [ "$pwd" == "secret" ]; then
echo "u have access"
else
  echo "access denied"
fi

6/4/12

Sort Utility in Datastage

The type of sort the stage will carryout:
  • Datastage: This used built-in DataStage sorter.do not require any additional software.
  • UNIX : unix sort command is used to perform sort.

Stable Sort :

  • Applies to sort utility type of Datastage,the default is true.It is set to True to guarantee that this sort operation will not rearrange records that are already in sorted dataset.
  • If it is set to False no prior ordering of records is guaranteed.

5/18/12

How do we know whether we running Datastage jobs on SMP or MPP System:




Look in the Configuration file if the "fastname" has the same value for all given nodes then you are running on SMP.
If there are different "fastname's" then its MPP.
For Server jobs it won't matter if it is SMP or MPP because Server jobs execute only on the machine where Datastage installed.
See below in Bold.
Ex :
main_program: APT configuration file: /ibmiscode/Configurations/ETL/apt_config_max

  node "DEV1" { 
    fastname "DEV1" 
    pools "" 
    resource disk "/is_resource1/isent" {pools ""} 
    resource scratchdisk "/scratch1/sent" {pools ""} 
  } 


  node "local2" { 
    fastname "DEV1" 
    pools "" 
    resource disk "/is_resource2/isent" {pools ""} 
    resource scratchdisk "/scratch2/sent" {pools ""} 
  } 


  node "local3" { 
    fastname "DEV1" 
    pools "" 
    resource disk "/is_resource3/isent" {pools ""} 
    resource scratchdisk "/scratch3/sent" {pools ""} 
  } 


  node "local4" { 
    fastname "DEV1" 
    pools "" 
    resource disk "/is_resource4/isent" {pools ""} 
    resource scratchdisk "/scratch4/sent" {pools ""} 
  } 
}

5/9/12

CUT command unix/linux


CUT command :
i'm going to take below sample file for the examples shown.
-> cat test.txt
201200001|201200001|LIFE|A
201200002|201200002|STD|B
201200003|201200003|LTD|C
201200004|201200004|LTD|X
201200005|201200005|LTD|E

cut command will be used with the following options
  • cut -c
  • cut -d with -f
1) cut -c :

-> cut -c 11 test.txt
 the above command will give the 11th character from each line of the file
2
2
2
2
2

-> cut -c 11,12 test.txt
the above command will give the 11th and 12th characters from each line
20
20
20
20
20

-> cut -c 1-10 test.txt
the above command will give characters from 1 to 10 from each line
201200001|
201200002|
201200003|
201200004|
201200005|

-> cut -c -8 test.txt
the above command will return charaters from 1 to 8 from each like similary like above result
20120000
20120000
20120000
20120000
20120000

-> cut -c 10- test.txt

the above command will return charaters from 10th to end of the line from each line.
|201200001|LIFE|A
|201200002|STD|B
|201200003|LTD|C
|201200004|LTD|X
|201200005|LTD|E

2) cut -d with -f:
character followed by -d in the unix command is a delimeter either we can use in the quotes     (-d "|") or we can use with back slash after the -d ( -d\).

-f specifies the filed number to return

-> cut -d "|" -f3 test.txt
or we can use
-> cut -d\| -f3 test.txt
the above command returns 3rd filed in each line which have the pipe(|) as the delimeter
LIFE
STD
LTD
LTD
LTD
-> cut -d\| -f3,4 test.txt
the above command will return the 3rd and 4th filed from each line

LIFE|A
STD|B
LTD|C
LTD|X
LTD|E
-> cut -d\| -f3,5 test.txt
the above command only gives 3rd filed bcz there is no 5th filed
LIFE
STD
LTD
LTD
LTD

-> cut -d\| -f1- test.txt
the above command will return data from filed1 to the end of each line.

201200001|201200001|LIFE|A
201200002|201200002|STD|B
201200003|201200003|LTD|C
201200004|201200004|LTD|X
201200005|201200005|LTD|E
-> cut -d\| -f-3 test.txt
the above command will return data from field1 to field3 from each line.

201200001|201200001|LIFE
201200002|201200002|STD
201200003|201200003|LTD
201200004|201200004|LTD
201200005|201200005|LTD










5/3/12

Find If the file is in Unix format or DOS format :


We can find by using the 'od' command and 'grep' for carriage returns('\r') that usually exist in DOS format files.
Example : od -c filename.txt| grep'\\r'
if it won't show up anything then it is in Unix format.
Or we can use the following method also :
Example : grep -c '^M$' filename 
Note that ^M has to be typed with control-v and enter(Not by typing shift 6 and M)

5/2/12

How to Find the changes made by particular USER in DataStage :


How to Find the changes made by particular USER in Datastage :

Here is how -

  •  Go to Data stage Designer
  • from the Menu bar  Tools --> Advanced Find


Then Select filelds based upon your choice:


  • Name To find : * or part of the job name
  • Folder to search : select the particular folder you are looking for
  • Modified by User: Enter the user name of the datastage user.
  • Date  : select the date type


Then click Find.

It will give you the list of jobs modified by that particular USER.


5/1/12

How to Check the Disk Usage in Unix/Linux


Both Linux and Unix offers two commands for checking out free disk space

  • df command : Report file system disk space usage
  • du command : Estimate file space usage

$ df -h  /home/test

OR


$ df -k /home/test


Output :
Filesystem             Size   Used  Avail Use% Mounted on
/dev/sdb1               20G   9.2G   9.6G  49% /
varrun                 393M   144k   393M   1% /var/run
varlock                393M      0   393M   0% /var/lock
procbususb             393M   123k   393M   1% /proc/bus/usb
udev                   393M   123k   393M   1% /dev
devshm                 393M      0   393M   0% /dev/shm
lrm                    393M    35M   359M   9% /lib/modules/2.6.20-15-generic/volatile
/dev/sdb5               29G   5.4G    22G  20% /media/docs
/dev/sdb3               30G   5.9G    23G  21% /media/isomp3s
/dev/sda1              8.5G   4.3G   4.3G  51% /media/xp1
/dev/sda2               12G   6.5G   5.2G  56% /media/xp2
/dev/sdc1               40G   3.1G    35G   9% /media/backup

du Command :

du shows how much space one ore more files or directories is using.

  $ du -s /home/test
Output : 75224 /home/test





4/30/12

How do I change my Oracle password using SQL*Plus?


Using SQL*Plus in Oracle 11g

  1. From the Start menu, select All Programs, and then select your Oracle Home group (e.g., Oracle - OraClient11g_home1).
  2. Click Application Development, and then click SQL Plus.When you launch SQL Plus in the Oracle 11g client, you will get the command shell version of SQL*Plus.
  3. When prompted with Enter user-name: , enter your Oracle username, then  @ , and then the net service name of the Oracle database; for example: Example : username@dev.company.comNote: In the STCs, the full Oracle net service name is oed1prd.world . Oracle net service names configured on your computer may differ from the STC names, since these can be arbitrarily assigned.
  4. Click Enter. When prompted with Enter password: , enter your Oracle password, and then click Enter.
  5. At the prompt, to change your password, use the following command: password
  6. When prompted, enter your old password, then your new password, and then retype your new password. SQL*Plus will indicate a successful password change.

Schema File in Datastage


Schema files and partial Schemas:

You can also specify the meta data for a stage in a plain text file known as a schema file. This is not stored in the Repository but you could, for example, keep it in a document management or source code control system, or publish it on an intranet site.

Note: If you are using a schema file on an NLS system, the schema file needs to be in UTF-8 format. It is, however, easy to convert text files between two different maps with a WebSphere DataStage job. Such a job would read data from a text file using a Sequential File stage and specifying the appropriate character set on the NLS Map page. It would write the data to another file using a Sequential File stage, specifying the UTF-8 map on the NLS Map page.

Some parallel job stages allow you to use a partial schema. This means that you only need define column definitions for those columns that you are actually going to operate on.

Remember that you should turn runtime column propagation on if you intend to use schema files to define column meta data.

Complex data types

Parallel jobs support three complex data types: v Subrecords v Tagged subrecords v Vectors When referring to complex data in WebSphere DataStage column definitions, you can specify fully qualified column names, for example:
Parent.Child5.Grandchild2.

Subrecords :

A subrecord is a nested data structure. The column with type subrecord does not itself define any storage, but the columns it contains do. These columns can have any data type, and you can nest subrecords one within another. The LEVEL property is used to specify the structure of subrecords. The following diagram gives an example of a subrecord structure.
Parent (subrecord)
Child1 (string)
Child2 (string)      LEVEL01
Child3 (string)
Child4 (string)
Child5(subrecord)
        Grandchild1 (string)
        Grandchild2 (time)
        Grandchild3 (sfloat)  LEVEL02

Tagged subrecord

This is a special type of subrecord structure, it comprises a number of columns of different types and the actual column is ONE of these, as indicated by the value of a tag at run time. The columns can be of any type except subrecord or tagged. The following diagram illustrates a tagged subrecord.

Parent (tagged)
       Child1 (string)
       Child2 (int8)
       Child3 (raw)
       Tag = Child1, so column has data type of string

Vector

A vector is a one dimensional array of any type except tagged. All the elements of a vector are of the same type, and are numbered from 0. The vector can be of fixed or variable length. For fixed length vectors the length is explicitly stated, for variable length ones a property defines a link field which gives the length at run time. The following diagram illustrates a vector of fixed length and one of variable length.



Netezza Performance Server :


Distribution Specification:
Each table in a Netezza RDBMS database has only one distribution key, which consists of
one to four columns. You can use the following SQL syntax to create distribution keys.

>To create an explicit distribution key, the Netezza SQL syntax is:
usage: create table [ ( [, … ] ) ] as
[ distribute on [hash] ( [ ,… ] ) ]
The phrase distribute on specifies the distribution key, the word hash is optional.

> To create a round-robin distribution key, the Netezza SQL syntax is:
usage: create table (col1 int, col2 int, col3 int);
distribute on random;
The phrase distribute on random specifies round-robin distribution.

> To create a table without specifying a distribution key, the Netezza SQL syntax is:
usage: create table (col1 int, col2 int, col3 int);
The NPS chooses a distribution key. There is no guarantee what that key is and it can
vary depending on the NPS software release.



Choosing the distribution key of table in Netezza

In Netezza the each table data is distributed across may servers(SPU's). This distribution of data results in parallel processing. So, we should be careful in choosing the right distribution key while creating a table.

There are two types of distribution methods:

  1. Hash algorithm
  2. Random algorithm
Random algorithm applies when you specify random() in the 'distribute on' clause.
Hash algorithm applies when you specify column names in the distribute on clause. When you don't specify distribute on clause while creating a table, by default, NPS distributes the data on first column in the create table statement.


Tips to choose a good distribution method:


  1. Choose columns that distributes rows evenly across all the SPU's.
  2. Select the columns withe unique values and high cardinality.
  3. Do not use random distribution even it distributes data equally across all the SPU's.
  4. Always specify the distribution clause even if you are distributing on first column in the table.
  5. Use same distribution key and datatypes for commonly joined tables.
  6. Do not distribute on floating point and boolean datatypes.

Netezza Date Conversions :


SELECT TO_TIMESTAMP(CURRENT_TIMESTAMP,'YYYY-MM-DD HH24:MI:SS')

TO_TIMESTAMP
-------------
10/24/2011 4:42:35.000156

SELECT TO_DATE(CURRENT_TIMESTAMP,'YYYY-MM-DD HH24:MI:SS')

TO_DATE
----------
10/24/2011


DATE_PART :

EX ;
SELECT DATE_PART('MONTH',CURRENT_TIMESTAMP)

DATE_PART
---------------
         10

DATE_TRUNC : 

BEFORE TRUNC ;
SELECT CURRENT_TIMESTAMP
----------------------------------
10/24/2011 4:54:11.001598


AFTER TRUNC HOUR :

SELECT DATE_TRUNC('HOUR',CURRENT_TIMESTAMP)
-------------------------------------

10/24/2011 4:00:00.001598



Netezza SQL Examples


1) To Limit the Data in netezza
Ex :     select * from EMP Limit 1000

Difference Between Datastage 7.x and 8.x


The Main Differences are :

1) No Manager Client :
  we don't have client manager tool in version 8,its incorporated with Designer itself..

2) Parameter Set :
 There is no need of hardcoding the parameters for every job and we have a option called Parameter set.if we create the parameter set,we can call the parameter set for whole project or job,sequence..

 3) Lookup :
     Range LookUp introduced in Lookup stage and Lookup has it's own UI.


4) Database Repository  :
      > In Datastage 7.x repository used to be a file system but in 8.x we can use the Database as a    Repository.

Note the database compatibility for the Metadata Server repository is the latest versions of the three DBMS engines. DB2 is an optional extra in the bundle if you don't want to use an existing database.
  • IBM UDB DB2 ESE 9
    -IBM Information Server does not support the Database Partitioning Feature (DPF) for use in the repository layer
    -DB2 Restricted Enterprise Edition 9 is included with IBM Information Server and is an optional part of the installation however its use is restricted to hosting the IBM Information Server repository layer and cannot be used for other applications
  • Oracle 10g
  • SQL Server 2005
5) Languages
     Foreign language support for the graphical tools and product messages:  


Chinese (Simplified and Traditional), Czech, Danish, Finnish, French, German, Italian, Japanese, Korean, Norwegian, Polish, Portuguese, Russian, Spanish and Swedish.


6)  Removed Functions:
   These are the functions that are not in DataStage 8, please imaging the last waltz playing in your head as   you  peruse this list:
  • dssearch command line function
  • dsjob "-import"
  • Version Control tool
  • Released jobs
  • Oracle 8i native database stages
  • ClickPack
The loss of the Version Control tool is not a big deal as the import/export functions have been improved. Building a release file as an export in version 8 is easier than building it in the Version Control tool in version 7.

7) New Functions Existing Stages
  • Complex Flat File Stage: Multi Format File (MFF) in addition to existing cobol file support.
  • Surrogate Key Generator: now maintains the key source via integrated state file or DBMS sequence.
  • Lookup Stage: range lookups by defining checking high and low range fields on the input or reference data table. Updatable in memory lookups.
  • Transformer Stage: new surrogate key functions Initialize() and GetNextKey().
  • Enterprise FTP Stage: now choose between ftp and sftp transfer.


8) New Stages
    A new stage from the IBM software family, new stages from new partners and the convergence of              Quality Stage functions into Datastage. Apart from the SCD stage these all come at an additional
cost.
  • WebSphere Federation and Classic Federation
  • Netezza Enterprise Stage
  • SFTP Enterprise Stage
  • iWay Enterprise Stage
  • Slowly Changing Dimension: for type 1 and type 2 SCDs.
  • Six QualityStage stages.

Difference Between Datastage Server Jobs and Parallel Jobs


Datastage parallel jobs can run in parallel on multiple nodes. Server jobs do not run on multiple node.
Parallel jobs support partition parallelism(Round robin,Hash,modulus etc.), server jobs don't support this.
The transformer in Parallel jobs compiles in C++. In server jobs, the transformer is compiled in Basic language.
--------------------------------------
Basic difference is server job runs on windows platform usually and parallel job runs on unix platform.
server job runs on on node whereas parallel job runs on more than one node.
Even the server jobs run on UNIX most of the major installation are on UNIX platfoam and comming to the differences. There is a major difference in job architecture.
Server jobs process in sequence one stage after other.
While Parallel job process in parallel. It uses the configuration file to know the number of CPU's difined to process parallely.
Datastage parallel jobs can run in parallel on multiple nodes. Server jobs do not run on multiple node.
Parallel jobs support partition parallelism(Round robin Hash modulus etc.) server jobs don't support this.
The transformer in Parallel jobs compiles in C++. In server jobs the transformer is compiled in Basic language.

------------------------------------------

  1. The major difference between Infosphere Datastage Enterprise and Server edition is that Enterprise Edition (EE) introduces Parallel jobs. Parallel jobs support a completely new set of stages, which implement the scalable and parallel data processing mechanisms. In most cases parallel jobs and stages look similiar to the Datastage Server objects, however their capababilities are way different.
    In rough outline:
    • Parallel jobs are executable datastage programs, managed and controlled by Datastage Server runtime environment
    • Parallel jobs have a built-in mechanism for Pipelining, Partitioning and Parallelism. In most cases no manual intervention is needed to implement optimally those techniques.
    • Parallel jobs are a lot faster in such ETL tasks like sorting, filtering, aggregating
  2. Datastage EE jobs are compiled into OSH (Orchestrate Shell script language).
    OSH executes operators - instances of executable C++ classes, pre-built components representing stages used in Datastage jobs.
    Server Jobs are compiled into Basic which is an interpreted pseudo-code. This is why parallel jobs run faster, even if processed on one CPU.
  3. Datastage Enterprise Edition adds functionality to the traditional server stages, for instance record and column level format properties.
  4. Datastage EE brings also completely new stages implementing the parallel concept, for example:
    • Enterprise Database Connectors for Oracle, Teradata & DB2
    • Development and Debug stages - Peek, Column Generator, Row Generator, Head, Tail, Sample ...
    • Data set, File set, Complex flat file, Lookup File Set ...
    • Join, Merge, Funnel, Copy, Modify, Remove Duplicates ...
  5. When processing large data volumes Datastage EE jobs would be the right choice, however when dealing with smaller data environment, using Server jobs might be just easier to develop, understand and manage.
    When a company has both Server and Enterprise licenses, both types of jobs can be used.
  6. Sequence jobs are the same in Datastage EE and Server editions.


Parallel Job - create a unique counter within a parallel transformer


Introduction

This entry describes various ways of creating a unique counter in DataStage jobs.
A parallel job has a surrogate key stage that creates unique IDs, however it is limited in that it does not support conditional code and it may be more efficient to add a counter to an existing transformer rather than add a new stage.
In a server job there are a set of key increment routines installed in the routine SDK samples that offer a more complex counter that remembers values between job executions.
The following section outlines a transformer only technique.

Steps

In a DataStage job the easiest way to create a counter is within the Transformer stage with a Stage Variable.
svMyCounter = svMyCounter + 1
This simple counter adds 1 each time a row is processed.
The counter can be given a seed value by passing a value in as a job parameter and setting the initial value of svMyCounter to that job parameter.
In a parallel job this simple counter will create duplicate values on each node as the transformer is split into parallel instances. It can be turned into a unique counter by using special parallel macros.
  1. Create a stage variable for the counter, eg. SVCounter.
  2. At the Stage Properties form set the Initial Value of the Stage Variable to            "@PARTITIONNUM - @NUMPARTITIONS + 1".
  3. Set the derivation of the stage variable to "svCounter + @NUMPARTITIONS". You can embed this in an IF statement if it is a conditional counter.
Each instance will start at a different number, eg. -1, -2, -3, -4. When the counter is incremented each instance is increment by the number of partitions, eg. 4. This gives us a sequence in instance 1 of 1, 5, 9, 13... Instance 2 is 2, 6, 10, 14... etc.
Remember this method only works if your data is evenly balanced i.e. equal number of rows going through each partition. Alternative syntax is:
@INROWNUM * @NUMPARTITIONS + @PARTITIONNUM 

Source

Convert Function Datastage


A="NOW IS THE TIME"

CONVERT('TI','XY',A)

OUTPUT : NOW YS XHE XYME

4/24/12

How to Capture/Handle Rejects in Join Stage

First of all there is no such a thing as rejets from join,there is only result of the join. In Join Stage Use the Outer join --- Left/Right/Outer depending on which side you are interested in finding these 'rejects' then in subsequent stages capture those where left/right data is missing (or having nulls)