Extract Transform & Load: 2011

9/9/11

Datastage Transformations

Server Job Transformations:
1)To remove the zeroes from left
function : Trim(columnName,'0','L')
2) To remove the zeroes from left
function : Trim(columnName,'0','R')
3)Create spaces or create zeroes(padding) before or after the Input value
Fmt(KeyMgtGetNextValue (DSJobStartTime),"8'0'R")
Example :
If Input : "1234"
desired output : "00001234"
If i/o : "123"
o/p will be:"00000123"

Parallel Job Transformations :
To Handle the Spaces or Nulls etc:
UpCase ( Trim ( Convert ( Char (0) : Char(10) : Char(13) : Char(124), " ", DSLink1.InColumn) ) )

Sequence Job Transformations:
Command Stage :
List out the files with comma(,)
ls -m #$FilePath##FolderPath# |tr '\n' ','| sed '$s/,$//'

Use Sql in Command Stage :
Ex :
command:
nzsql (Netezza)

Parameter:

-host #$jpNZ_SERVER# -db #$jpNZ_DB_DB# -u #$jpNZ_USERID_DB# -pw #$jpNZ_PWD_DB# -t -c "SELECT STRT_TMSTMP FROM #$jpNZ_DB_DB#.#$jpNZ_

TABLE_OWNER#.EXTRACT_DRIVER WHERE PROC_NM='XYZ' "

8/31/11

Car Loan Tips

Run Your Credit Report

>Before embarking in your car buying journey, request your credit report from the three credit bureaus. You can request your credit report for free once a year by visiting annualcreditreport.com or by calling 1-877-322-8228. Your credit report will give you a glimpse of your creditworthiness and inform you of any possible shortcomings. Knowing of all this before stepping into a dealership will guard you from the most aggressive selling tactics and help you walk away when the financing offered is not in your best interest.
Car Loan Warning

!>Be careful to avoid paid credit reporting services. Only annualcreditreport.com is authorized to request a free credit report for you under the law. Paid credit reporting services often carry hidden fees and undisclosed costs.
----
Visit Your Nearest Bank or Credit Union To Get A Quote

>Once you have your credit report handy and a have a good idea of what type of car and price range you are looking for, head over to your nearest bank or credit union to see what kind of interest rates they are offering on their car loans. In some cases, particularly if you already know exactly what vehicle you want to purchase, the bank or credit union may pre-approve you, thus letting you know exactly what interest rate and monthly payments you should expect in your car loan.
Car Loan Warning

Be sure to shop around and to compare rates. Visit more than one financial institution to get a quote and to find out what interest rates they are offering on their loans. This will give you a better idea if you are getting a good deal or not.

-----
Negotiate for a Better Rate

>Despite the loans offered directly by banks and credit unions, eight out of every 10 car buyers finance their vehicle through a car dealer. Whether it is the convenience offered or simply the marketing tactics deployed, if you find yourself behind closed-doors in the finance and insurance department of a car dealer be ready to negotiate for the lowest interest rate possible without feeling intimidated. Knowing your credit history and the loan rates offered directly from banks and credit unions in your area will definitely give you the upper-hand in getting the best car loan possible, but remain weary of any interest rate markups added on by the dealer. While a car dealer may initially originate your loan, it often attempts to sell the loan to a third-party lender for a profit. This profit is made by arbitrarily raising the interest rate of your car loan. If the interest rate offered by the dealer is higher than what you anticipated, just ask for the desired interest rate and renegotiate.
Car Loan Warning

>Try to avoid any add-on products offered by the dealer. Products such as vehicle service contracts, guaranteed auto protection insurance, credit life and disability insurance, and many others are often overpriced and unnecessary. Car dealers often sell these products to raise the cost of their loans and increase their profit margins. If you really need any of those add-on products, try to purchase them outside the dealership for much cheaper.
---
Other Things to Consider:

Comparison Shop Online: The internet has made it a lot easier for consumers to compare car prices and loan rates online. Start your research there before you head out to the dealership.

“Yo-Yo” scams: “Yo-yo” scams or “spot deliveries” occur when a car buyer drives away with the vehicle without finalizing sale. Once home, the dealer will call back the buyer claiming that it was unable to fund the loan at the agreed-upon terms. The buyer must then return the car to the dealer and often renegotiate the loan at a higher interest rate than one agreed-upon before.

“Buy Here and Pay Here” Dealers: “Buy Here Pay Here” dealerships typically finance used auto loans in-house to borrowers with no or poor credit. The average APR is usually much higher than a bank or credit union loan. The car loans made by these dealers are often unsustainable and lead to a high rate of repossessions.

Take Your Time: The average consumer spends 45 minutes with the finance and insurance department at the dealer (only 27 minutes if they take a test drive), so take your time to consider your lending options and don’t feel pressured to sign the dotted line. You have the right to take the entire paperwork home before agreeing to the loan.

Don’t Get Caught In The Monthly Payment Trap: Dealers will often attempt to mask the true cost of their loans by focusing on the monthly payments. Be sure to compare the total cost of all the loans offered and to choose the one that is less costly to you in the long run.

8/16/11

Generating output files for multiple columns

When two or more output columns have XPath expressions, XML Output generates a file for each column. You must add a column index flag to the root filename to prevent overwriting. This creates a naming pattern.

Valid flags include:
%% Column position, starting with zero (0)
%@ Column names
You can add these flags before, within, or after the root filename.
Examples The first output column is CUSTOMERS. The second output column is DIVISIONS.
1. The naming pattern is acme%%.xml. XML Output generates two files, called acme0.xml and acme1.xml.
2. The naming pattern is acme%@.xml. XML Output generates two files, called acmeCUSTOMERS.xml and acmeDIVISIONS.xml.

8/8/11

APT_CONFIG_FILE : Configuration File

1)APT_CONFIG_FILE is the file using which DataStage determines the configuration file (one can have many configuration files for a project) to be used. In fact, this is what is generally used in production. However, if this environment variable is not defined then how DataStage determines which file to use?

1)If the APT_CONFIG_FILE environment variable is not defined then DataStage look for default configuration file (config.apt) in following path:
1)Current working directory.
2)INSTALL_DIR/etc, where INSTALL_DIR ($APT_ORCHHOME) is the top level directory of DataStage installation.

2)What are the different options a logical node can have in the configuration file?

1.fastname – The fastname is the physical node name that stages use to open connections for high volume data transfers. The attribute of this option is often the network name. Typically, you can get this name by using Unix command ‘uname -n’.
2.pools – Name of the pools to which the node is assigned to. Based on the characteristics of the processing nodes you can group nodes into set of pools.
1.A pool can be associated with many nodes and a node can be part of many pools.
2.A node belongs to the default pool unless you explicitly specify apools list for it, and omit the default pool name (“”) from the list.
3.A parallel job or specific stage in the parallel job can be constrained to run on a pool (set of processing nodes).
1.In case job as well as stage within the job are constrained to run on specific processing nodes then stage will run on the node which is common to stage as well as job.
3.resource – resource resource_type “location” [{pools “disk_pool_name”}] | resource resource_type “value” . resource_type can be canonicalhostname (Which takes quoted ethernet name of a node in cluster that is unconnected to Conductor node by the hight speed network.) or disk (To read/write persistent data to this directory.) or scratchdisk (Quoted absolute path name of a directory on a file system where intermediate data will be temporarily stored. It is local to the processing node.) or RDBMS Specific resourses (e.g. DB2, INFORMIX, ORACLE, etc.)

3)How datastage decides on which processing node a stage should be run?

1.If a job or stage is not constrained to run on specific nodes then parallel engine executes a parallel stage on all nodes defined in the default node pool. (Default Behavior)
2.If the node is constrained then the constrained processing nodes are choosen while executing the parallel stage. (Refer to 2.2.3 for more detail).

4)When configuring an MPP, you specify the physical nodes in your system on which the parallel engine will run your parallel jobs. This is called Conductor Node. For other nodes, you do not need to specify the physical node. Also, You need to copy the (.apt) configuration file only to the nodes from which you start parallel engine applications. It is possible that conductor node is not connected with the high-speed network switches. However, the other nodes are connected to each other using a very high-speed network switches. How do you configure your system so that you will be able to achieve optimized parallelism?

1.Make sure that none of the stages are specified to be run on the conductor node.
2.Use conductor node just to start the execution of parallel job.
3.Make sure that conductor node is not the part of the default pool.

5)Although, parallelization increases the throughput and speed of the process, why maximum parallelization is not necessarily the optimal parallelization?

Datastage creates one process for every stage for each processing node. Hence, if the hardware resource is not available to support the maximum parallelization, the performance of overall system goes down. For example, suppose we have a SMP system with three CPU and a Parallel job with 4 stage. We have 3 logical node (one corresponding to each physical node (say CPU)). Now DataStage will start 3*4 = 12 processes, which has to be managed by a single operating system. Significant time will be spent in switching context and scheduling the process.

6)Since we can have different logical processing nodes, it is possible that some node will be more suitable for some stage while other nodes will be more suitable for other stages. So, when to decide which node will be suitable for which stage?

1.If a stage is performing a memory intensive task then it should be run on a node which has more disk space available for it. E.g. sorting a data is memory intensive task and it should be run on such nodes.
2.If some stage depends on licensed version of software (e.g. SAS Stage, RDBMS related stages, etc.) then you need to associate those stages with the processing node, which is physically mapped to the machine on which the licensed software is installed. (Assumption: The machine on which licensed software is installed is connected through other machines using high speed network.)
3.If a job contains stages, which exchange large amounts of data then they should be assigned to nodes where stages communicate by either shared memory (SMP) or high-speed link (MPP) in most optimized manner.

7)Basically nodes are nothing but set of machines (specially in MPP systems). You start the execution of parallel jobs from the conductor node. Conductor nodes creates a shell of remote machines (depending on the processing nodes) and copies the same environment on them. However, it is possible to create a startup script which will selectively change the environment on a specific node. This script has a default name of startup.apt. However, like main configuration file, we can also have many startup configuration files. The appropriate configuration file can be picked up using the environment variable APT_STARTUP_SCRIPT. What is use of APT_NO_STARTUP_SCRIPT environment variable?

1.Using APT_NO_STARTUP_SCRIPT environment variable, you can instruct Parallel engine not to run the startup script on the remote shell.

8)What are the generic things one must follow while creating a configuration file so that optimal parallelization can be achieved?

1.Consider avoiding the disk/disks that your input files reside on.
2.Ensure that the different file systems mentioned as the disk and scratchdisk resources hit disjoint sets of spindles even if they’re located on a RAID (Redundant Array of Inexpensive Disks) system.
3.Know what is real and what is NFS:
1.Real disks are directly attached, or are reachable over a SAN (storage-area network -dedicated, just for storage, low-level protocols).
2.Never use NFS file systems for scratchdisk resources, remember scratchdisk are also used for temporary storage of file/data during processing.
3.If you use NFS file system space for disk resources, then you need to know what you are doing. For example, your final result files may need to be written out onto the NFS disk area, but that doesn’t mean the intermediate data sets created and used temporarily in a multi-job sequence should use this NFS disk area. Better to setup a “final” disk pool, and constrain the result sequential file or data set to reside there, but let intermediate storage go to local or SAN resources, not NFS.
4.Know what data points are striped (RAID) and which are not. Where possible, avoid striping across data points that are already striped at the spindle level.

6/24/11

The PATH Environment Variable in UNIX

The PATH environment variable has a special format. Let's see what it looks like:

dbettis@rhino[~]$ echo $PATH
/usr/local/bin:/bin:/usr/bin:/sbin:/usr/sbin:.

It's essentially a :-separated list of directories. When you execute a command, the shell searches through each of these directories, one by one, until it finds a directory where the executable exists. Remember that we found ls in /bin, right? /bin is the second item in the PATH variable. So let's remove /bin from PATH. We can do this by using the export command:

dbettis@rhino[~]$ export PATH=/usr/local/bin:/usr/bin:/sbin:/usr/sbin:.

Make sure that the variable is set correctly:

dbettis@rhino[~]$ echo $PATH
/usr/local/bin:/usr/bin:/sbin:/usr/sbin:.

Now, if we try to run ls, the shell no longer knows to look in /bin!

dbettis@rhino[~]$ ls
-bash: ls: command not found

As expected, ls can no longer be found. Let's add /bin back to PATH, as ls is a very useful thing to have.

dbettis@rhino[~]$ export PATH=/usr/local/bin:/bin:/usr/bin:/sbin:/usr/sbin:.

Adding to PATH
There are many times where you'll want to append an item to PATH. First, let's see what the current PATH is:

dbettis@rhino[~]$ echo $PATH
/usr/local/bin:/bin:/usr/bin:/sbin:/usr/sbin:.

The way to add a directory is as follows:

dbettis@rhino[~]$ export PATH=$PATH:/new/path

This command adds /new/path to PATH. Let's see if it got updated:

dbettis@rhino[~]$ echo $PATH
/usr/local/bin:/bin:/usr/bin:/sbin:/usr/sbin:.:/new/path

Making this happen every time you login

There's a special file in your home directory called .bashrc In UNIX, a convention is that files beginning with . are configuration files, and thus should be hidden from view. ls will only list files beginning with a . if passed the -a flag. e.g.

dbettis@rhino[~]$ ls -a

At any rate, this file (.bashrc), simply contains a list of commands. Each one of these commands gets executed every time you create a new shell.

dbettis@rhino[~]$ cat .bashrc
export PATH="$PATH:/p/firefox/bin"
..

Every time a shell is started, /p/firefox/bin is added to PATH. If you wish to have certain directories automatically added to PATH, simply place those commands at the end of this file. Log out and log back in to view the changes. Alternatively, you can load the contents of that file in the current session:

dbettis@rhino[~]$ . .bashrc

6/23/11

Understanding UNIX permissions and chmod

Introduction
This is a topic that has been beaten to death both in books and on-line. For some reason, it seems that it is one of the most common misunderstandings that people have to face when learning how to write and/or configure their first cgi programs. This tutorial aims to clarify the concepts involved. Note that we will be referring to UNIX in a generic sense in this article. Most of what we are going to discuss here applies to all UNIX flavours. (such as Linux, SVR4, BSD etc.) It is also a good idea to type man chmod to check for the specific details on your system, too.
Users
A UNIX system serves many users. Users are an abstraction that denotes a logical entity for assignment of ownership and operation privileges over the system. A user may correspond to a real-world person, but also a type of system operation. So, in my system, I have user 'nick' that corresponds to me, but I also have user 'www' which corresponds to the privileges necessary to operate the local webserver. UNIX doesn't care about what the user means for me. It just knows what belongs to any given user and what each user is allowed to do with any given thing (file, program, device, etc) on the system. UNIX identifies each user by a User ID (UID) and the username (or login) such as 'nick' and 'www' is just an alias to the UID that makes humans more comfortable.

Groups

Users can be organized in groups. A user may belong to one or more groups of users. The concept of groups serves the purpose of assigning sets of privileges for a given resource and sharing them among many users that need to have them. (perhaps because they are all members of a project working team and they all need access to some common project files) So, on my system user 'nick' and user 'www' both belong to the group 'perlfect'. This way, they can have some shared privileges over the files for this site. User 'nick' needs them to edit the site, and user 'www' needs them to manage the webserver that will be publishing the site.

Ownership

Every file in UNIX has an owner user and an owner group. So, for any file in the system, user 'nick' may have one of the following ownership relations:

nick owns the file, i.e. the file's owner is 'nick'.
nick is a member of the group that owns the file, i.e. the file's owner group is 'perlfect'.
nick is neither the owner, nor belonging to the group that owns the file

Permissions

Every file on the system has associated with it a set of permissions. Permissions tell UNIX what can be done with that file and by whom. There are three things you can (or can't) do with a given file:

read it,
write (modify) it and
execute it.

Unix permissions specify which of the above operations can be performed for any ownership relation with respect to the file. In simpler terms, what can the owner do, what can the owner group do, and what can everybody else do with the file. For any given ownership relation, we need three bits to specify access permissions: the first to denote read (r) access, the second to denote (w) access and the third to denote execute (x) access. We have three ownership relations: 'owner', 'group' and 'all' so we need a triplet for each, resulting in nine bits. Each bit can be set or clear. (not set) We mark a set bit by it's corresponding operation letter (r, w or x) and a clear bit by a dash (-) and put them all on a row. An example might be rwxr-xr-x.What this means is that the owner can do anything with the file, but group owners and the rest of the world can only read or execute it. Usually in UNIX there is also another bit that precedes this 9-bit pattern. You do not need to know about it, at least for the time being.

So if you try ls -l on the command prompt you will get something like the following: [nick@thekla src]$ ls -l -rwxr-xr-x 1 nick users 382 Jan 19 11:49 bscoped.pl drwxr-xr-x 3 nick users 1024 Jan 19 11:19 lib/ -rwxr-xr-x 1 nick users 1874 Jan 19 10:23 socktest.pl

The first column here shows the permission bit pattern for each file. The third column shows the owner, and the fourth column shows the owner group. By the time, the information provided by ls -l should be enough for you to figure out what each user of the system can do with any of the files in the directory.

Directories

Another interesting thing to note is that lib/ which is a directory has permissions, too. Permissions take a different meaning for directories. Here's what they mean:

1) read determines if a user can view the directory's contents, i.e. do ls in it.
2)write determines if a user can create new files or delete file in the directory. (Note here that this essentially means that a user with write access toa directory can delete files in the directory even if he/she doesn't have write permissions for the file! So be careful with this.)
3)execute determines if the user can cd into the directory.

chmod

To set/modify a file's permissions you need to use the chmod program. Of course, only the owner of a file may use chmod to alter a file's permissions. chmod has the following syntax: chmod [options] mode file(s)

The 'mode' part specifies the new permissions for the file(s) that follow as arguments. A mode specifies which user's permissions should be changed, and afterwards which access types should be changed. Let's say for example: chmod a-x socktest.pl This means that the execute bit should be cleared (-) for all users. (owner, group and the rest of the world) The permissions start with a letter specifying what users should be affected by the change, this might be any of the following:

u the owner user
g the owner group
o others (neither u, nor g)
a all users

This is followed by a change instruction which consists of a +(set bit) or -(clear bit) and the letter corresponding to the bit that should be changed.

Let's see some examples:
$ ls -l socktest.pl
-rwxr-xr-x 1 nick users 1874 Jan 19 10:23 socktest.pl* $ chmod a-x socktest.pl
$ ls -l socktest.pl -rw-r--r-- 1 nick users 1874 Jan 19 10:23 socktest.pl $ chmod g+w socktest.pl $ ls -l socktest.pl -rw-rw-r-- 1 nick users 1874 Jan 19 10:23 socktest.pl
$ chmod ug+x socktest.pl
$ ls -l socktest.pl -rwxrwxr-- 1 nick users 1874 Jan 19 10:23 socktest.pl*
$ chmod ug-wx socktest.pl $ ls -l socktest.pl -r--r--r-- 1 nick users 1874 Jan 19 10:23 socktest.pl

Strange numbers...

You might have encountered things like chmod 755 somefile and of course you will be wondering what this is. The thing is, that you can change the entire permission pattern of a file in one go using one number like the one in this example. Every mode has a corresponding code number, and as we shall see there is a very simple way to figure out what number corresponds to any mode.

Every one of the three digits on the mode number corresponds to one of the three permission triplets. (u, g and o) Every permission bit in a triplet corresponds to a value: 4 for r, 2 for w, 1 for x. If the permission bit you add this value to the number of the permission triplet. If it is cleared, then you add nothing. (Some of you might notice that in fact, the number for a triplet is the octal value corresponding to the three-bit pattern - if you don't know what an octal value is, it doesn't really matter, just follow the intstructions) So if a file has rwxr-xr-x permissions we do the following calculation:

Triplet for u: rwx => 4 + 2 + 1 = 7
Triplet for g: r-x => 4 + 0 + 1 = 5
Tripler for o: r-x => 4 + 0 + 1 = 5
Which makes : 755

So, 755 is a terse way to say 'I don't mind if other people read or run this file, but only I should be able to modify it' and 777 means 'everyone has full access to this file'

Databases and partitioning types

ORACLE:

Basics of Partitioning
Partitioning allows a table, index or index-organized table to be subdivided into
smaller pieces. Each piece of the database object is called a partition. Each
partition has its own name, and may optionally have its own storage characteristics. From the perspective of a database administrator, a partitioned
object has multiple pieces that can be managed either collectively or individually.
This gives the administrator considerable flexibility in managing partitioned

object. However, from the perspective of the application, a partitioned table is
identical to a non-partitioned table; no modifications are necessary when
accessing a partitioned table using SQL DML commands.
Figure 1: Application and DBA perspective of a partitioned table
Database objects - tables, indexes, and index-organized tables - are partitioned
using a 'partitioning key', a set of columns which determine in which partition a
given row will reside. For example the sales table shown in figure 1 is rangepartitioned
on sales date, using a monthly partitioning strategy; the table appears
to any application as a single, 'normal' table. However, the DBA can manage and
store each monthly partition individually, potentially using different storage tiers,
applying table compression to the older data, or store complete ranges of older
data in read only tablespaces.

Basic Partitioning Strategies
Oracle Partitioning offers three fundamental data distribution methods that control
how the data is actually going to placed into the various individual partitions,
namely:
Range: The data is distributed based on a range of values of the
partitioning key (for a date column as the partitioning key, the 'January-
2007' partition contains rows with the partitioning-key values between
'01-JAN-2007' and '31-JAN-2007'). The data distribution is a continuum
without any holes and the lower boundary of a range is automatically
defined by the upper boundary of the preceding range.
List: The data distribution is defined by a list of values of the partitioning
key (for a region column as the partitioning key, the 'North America'
partition may contain values 'Canada', 'USA', and 'Mexico'). A special
'DEFAULT' partition can be defined to catch all values for a partition key
that are not explicitly defined by any of the lists.
Hash: A hash algorithm is applied to the partitioning key to determine
the partition for a given row. Unlike the other two data distribution
methods, hash does not provide any logical mapping between the data
and any partition.
Using the above-mentioned data distribution methods, a table can be partitioned
either as single or composite partitioned table:
Single (one-level) Partitioning: A table is defined by specifying one of
the data distribution methodologies, using one or more columns as the
partitioning key. For example consider a table with a number column as
the partitioning key and two partitions 'less_than_five_hundred' and
'less_than_thousand', the 'less_than_thousand' partition contains rows
where the following condition is true: 500 <= Partitioning key <1000 You can specify Range, List, and Hash partitioned tables.
Composite Partitioning: A combination of two data distribution
methods are used to define a composite partitioned table. First, the table
is partitioned by data distribution method one and then each partition is
further subdivided into subpartitions using a second data distribution
method. All sub-partitions for a given partition together represent a
logical subset of the data. For example, a range-hash composite
partitioned table is first range-partitioned, and then each individual rangepartition
is further sub-partitioned using the hash partitioning technique.
Available composite partitioning techniques are range-hash, range-list,
range-range, list-range, list-list, and list-hash.
Index-organized tables (IOTs) can be partitioned using range, hash, and
list partitioning. Composite partitioning is not supported for IOTs.

6/22/11

Difference between SMP and MPP

SMP (symmetric multiprocessing), in which some hardware resources might be shared among processors. The processors communicate via shared memory and have a single operating system.
Or
Symmetric Multi-Processing. In a symmetrical multi-processing environment, the CPU's share the same memory,and as a result code running in one CPU can affect the
memory used by another.

Cluster or MPP (massively parallel processing), also known as shared-nothing, in which each processor has exclusive access to hardware resources. MPP systems are physically housed in the same box, whereas cluster systems can be physically dispersed. The processors each have their own operating system, and communicate via a high-speed network.
Or
MPP - Massively Parallel Processing. computer system with many independent arithmetic units or entiremicroprocessors, that run in parallel.

-----------------------------------------------Or
Smp supports limited parellelism i.e 64 processors where as
MPP can support N number of nodes or processors [high
performance]
Smp processing is SEQUENTIAL where As
MPP Processing can be PARALLEL

6/19/11

Difference between Dataset ,Fileset and sequential file

Dataset:
0). Data set is the internally data format behind Orchestrate framework, so any other data being processed as source in parallel job would be converted into data set format first(it is handled by the operator "import") and also being processed as target would be converted from data set format last(it is handled by the operator "export"). Hence, data set usually could bring highest performance.
1) It stores data in binary in the internal format of DataStage so, it takes less time to read/write from dataset than any other source/target.
2)It preserves the partioning schemes so that you don't have to partition it again.
3)You cannot view data without datastage

Fileset:
0) Both .ds file and .fs file are the descriptor file of data set and file set respectively, whereas .fs file is stored as ASCII format, so you could directly open it to see the path of data file and its schema. However, .ds file cannot be open directly, and you could follow alternative way to achieve that, Data Set Management, the utility in client tool(such as Designer and Manager), and command line ORCHADMIN.
1)It stores data in the format similar to a sequential file.
2)Only advantage of using fileset over a sequential file is "it preserves partioning scheme"
3)You can view the data but in the order defined in partitioning scheme

parallel lookup types

Parallel DataStage jobs can have many sources of reference data for lookups including database tables, sequential files or native datasets. Which is the most efficient?

This question has popped up several times over on the DSExchange. In DataStage server jobs the answer is quite simple, local hash files are the fastest method of a key based lookup, as long as the time taken to build the hash file does not wipe out your benefits from using it.
In a parallel job there are a very large number of stages that can be used as a lookup, a much wider variety than server jobs, this includes most data sources and the parallel staging formats of datasets and lookup filesets. I have discounted database lookups as the overhead of the database connectivity and any network passage makes them slower than most local storage.

someone did a test comparing datasets to sequential files to lookup filesets and increased row volumes to see how they responded. The test had three jobs, each with a sequential file input stage and a reference stage writing to a copy stage.

Small lookups
I set the input and lookup volumes to 1000 rows. All three jobs processed in 17 or 18 seconds. No lookuptables were created apart from the existing lookup fileset one. This indicates the lookup data fit into memory and did not overflow to a resource file.

1 Million Row Test
The lookup dataset took 35 seconds, the lookup fileset took 18 seconds and the lookup sequential file took 35 seconds even though it had to partition the data. I assume this is because the input also had to be partitioned and this was the bottleneck in the job.

2 million rows
Starting to see some big differences now. Lookup fileset down at 45 seconds is only three times the length of the 1000 row test. Dataset is up to 1:17 and sequential file up to 1:32. The cost of partitioning the lookup data is really showing now.

3 million rows
The filset still at 45 seconds, swallowed up the extra 1 million rows with ease. Dataset up to 2:06 and the sequential file up to 2:20.

As a final test I replaced the lookup stage with a join stage and tested the dataset and sequential file reference links. The dataset join finished in 1:02 and the sequential file join finished in 1:15. A large join proved faster then a large lookup but not as fast as a lookup file.

Conclusion
If your lookup size is low enough to fit into memory then the source is irrelevent, they all load up very quickly, even database lookups are fast. If you have very large lookup files spilling into lookup table resources then the lookup fileset outstrips the other options. A join becomes a viable option. They are a bit harder to design as you can only join one source at a time whereas a lookup can join multiple sources.

I usually go with lookups for code to description or code to key type lookups regardless of the size, I reserve the joins for references that bring back lots of columns. I will certainly be making more use of the lookup fileset to get more performance from jobs.

Optimzie your DataStage Job Performance with relevant Environment Variables

DataStage has many parameters which can be tweaked and used to optimize the performance of various DataStage Jobs. Even many available to collect more information during the event of crash to get more traces.
For any DataStage Job if you run into problem or want to get more details need to check following variables.

$APT_CONFIG_FILE: This allows you to define Configuration file based on your requirement. You can keep many configuration files with n-node combination and assign it dynamically for Job based in criteria or time.

$APT_SCORE_DUMP: It creates a job run report that shows the partitioning used, degree of parallelism, data buffering and inserted operators. It is Useful for finding out what your high volume job is doing.

$APT_PM_PLAYER_TIMING: This option lets you see what each operator in a job is doing, especially how much data they are handling and how much CPU they are consuming. It helps in identifying various bottlenecks.

One way to speed up very high volume jobs is to pre-sort the data and make sure it is not resorted in the DataStage job.Â This is done by turning off auto sorting in high volume jobs:
APT_NO_SORT_INSERTION: stops the job from automatically adding a sort command to the start of a job that has stages that need sorted data such as Remove Duplicates.You can also add a sort stage to the job and set it to a value of "Previously Sorted" to avoid this is a specific job path.

Please Refer to the following link for more details:
Optimzie your DataStage Job Performance with relevant Environment Variables

Error Handling in Datastage Job Design

There are multiple ways to handle Errors in Data or in DataStage Jobs.

We can use the Reject link Option via Transformer Stage and also Reject Link Option from within Connector Stages. If we face issue in Job Sequences, We can use the "Exception Handler" activity. Here is how we can call use this activity from with in DataStage Job Sequence.

You can check the Check box named "Automatically handle activities that fail" at properties of master sequence. As you might want to have a Check Point, check "Restart job from the failure point".

In DataStage Job sequence use a exception handler activity. Post exception handler activity can include a email notification activity (same for SMS). On Job Failure the handle will go to the exception handler activity and an email/SMS willl be sent notifying the user that a sequence has failed. It also provides information on failure code as we select part of Job Design.
More on "How to use Notification Activity" in next one.

Disclaimer: "The postings on this site are my own and don't necessarily represent IBM's positions, strategies or opinions

Sequence Job Errors:
1) Controller problem: Loop start > loop end - cannot execute it
Sol : Because there is no value being passed from the routine activity? If the re-run of routine activity every time you restart your sequence will not affect anything then check the box "Do not check point" so that the routine activity will fire everytime giving your start loop a value to start with. If thats not what you want you will have to re-think your design.

2) Change_Capture_cc: When checking operator: User inserted sort " Change_Capture_cc .CVG_CLMRTG_INDEX_fnl_out_Sort" does not fulfill the sort requirements of the downstream operator " Change_Capture_cc "

Sol : "Need to hash partition the data using the sort keys as hash keys in order to guarantee that duplicate rows are in the same partition"

Server Job Errors :

Parallel Job Errors :

6/17/11

Sort stage to remove duplicate

1)what is the advantage of using sort stage over remove duplicate stage in removing duplicates.
2) Is there any way in which we can specify which record to retain(like retaining the last record or retaining the first) when we remove duplicate using transform stage, similarly in sort stage also.

Ans :
1)The advantage of using sort stage over remove duplicate stage is that sort stage allows us to capture the duplicate records whereas remove duplicate stage does not.
2) Using a sort stage we can only retain the first record.
Normally we go for retaining last when we sort a particular field in ascending order and try to get the last rec. The same can be done using sort stage by sorting in descending order to retain the first record.
********
Yes indeed...
Use a sort stage with the option "Create Key Change Column" set to True.
Thil will create a column Keychange at the output of the sort as below.

Input
-----
col_a
------
1
1
1
2
2
3
4
5
6
6

output
-------
col_a Keychange
------------------
1 1
1 0
1 0
2 1
2 0
3 1
4 1
5 1
6 1
6 0

For the first occurance of a particular value, it assigns 1 and for subsequent occurances of the same value it assigns 0.
Now you can use a transformer with constraint Keychange=0 to capture duplicate records.

*********
Can u tell me what is create cluster key change value in sort stage ... if my requirement is that i have to remove duplicate, i dont want to capture the duplicate records. Is that fine to remove duplicate using input tab sort ,unique option of any stage .
******

the options spcd above for retaining duplicate values using sort stage will obviously work........but as far as I remember there is an option in the properties of the sort stage namely ::ALLOW DUPLICATES.If this property is set to true I think it will satisfy the requirement(provided that u r using datastage 8.x.

If yhe property "cluster key change column" is set to true it will eventually create a cluster key change column in the output.concentrate on the following example::

i/p_col cluster _key _change_col
1 1
1 0
1 0
2 1
2 0
3 1
1 1
2 1
6 1
6 0

the logic is that every i/p value will check the value immediete above of it.If it finds a match the o/p is 0 else 1.doesn't matter if it is appeared previously or not.

6/15/11

Teradata Database Hardware and Software Architecture

Teradata Permanent and Temporary Tables

To manipulate tabular data, you must submit a query in a language that the database
understands. In the case of the Teradata Database, the language is SQL. You can store the results of multiple SQL queries in tables. Permanent storage of tables is necessary when different sessions and users must share table contents.

When tables are required for only a single session, you can request that the system creates temporary tables. Using this type of table, you can save query results for use in subsequent queries within the same session. Also, you can break down complex queries into smaller queries by storing results in a temporary table for use during the same session. When the session ends, the system automatically drops the temporary table.

Global Temporary Tables:
Global temporary tables are tables that exist only for the duration of the SQL session in which they are used. The contents of these tables are private to the session, and the system automatically drops the table at the end of that session. However, the system saves the global temporary table definition permanently in the Data Dictionary. The saved definition may be shared by multiple users and sessions with each session getting its own instance of the table.

Volatile Tables:
If you need a temporary table for a single use only, you can define a volatile table. The definition of a volatile table resides in memory but does not survive across a system restart. Using volatile tables improves performance even more than using global temporary tables because the system does not store the definitions of volatile tables in the Data Dictionary.Access-rights checking is not necessary because only the creator can access the volatile table.

Derived Tables:
A special type of temporary table is the derived table. You can specify a derived table in an SQL SELECT statement. A derived table is obtained from one or more other tables as the result of a subquery. The scope of a derived table is only visible to the level of the SELECT statement calling the subquery.
Using derived tables avoids having to use the CREATE and DROP TABLE statements for
storing retrieved information and assists in coding more sophisticated, complex queries.

Database Normalization and Referential Integrity

Normalization is the process of reducing a complex data structure into a simple, stable one.
Generally this process involves removing redundant attributes, keys, and relationships from
the conceptual data model.
First Normal Form
First normal form (1NF) is definitive of a relational database. If we are to consider a database
relational, then all relations in the database must be in 1NF.
We say a relation is in 1NF if all fields within that relation are atomic. We sometimes refer to
this concept as the elimination of repeating groups from a relation. Furthermore, first normal
form allows no hierarchies of data values.

Second Normal Form
Second normal form (2NF) deals with the elimination of circular dependencies from a
relation. We say a relation is in 2NF if it is in 1NF and if every non-key attribute is fully
dependent on the entire Primary Key.
A non-key attribute is any attribute that is not part of the Primary Key for the relation.

Third Normal Form:
Third normal form (3NF) deals with the elimination of non-key attributes that do not
describe the Primary Key.
For a relation to be in 3NF, the relationship between any two non-Primary Key columns, or
groups of columns, in a relation must not be one-to-one in either direction.
We say attributes are mutually independent if none of them is functionally dependent on any
combination of the others. This mutual independence ensures that we can update individual
attributes without any danger of affecting any other attribute in a row.
The following list of benefits summarizes the advantages of implementing a normalized
logical model in 3NF.
• Greater number of relations
• More PI choices
• Optimal distribution of data
• Fewer full table scans
• More joins possible

Referential Integrity:
Traditional referential integrity is the concept of relationships between tables, based on the
definition of a primary key and a foreign key. The concept states that a row cannot exist in a
table with a non-null value for a referencing column if an equal value does not exist in a
referenced column.
Using referential integrity, you can specify columns within a referencing table that are foreign
keys for columns in some other referenced table. You must define referenced columns as either
primary key columns or unique columns.
Referential integrity is a reliable mechanism that prevents accidental database inconsistencies
when you perform INSERTS, UPDATES, and DELETES.

Teradata Data Distribution and Data Access Methods

An index is a physical mechanism used to store and access the rows of a table. Indexes on
tables in a relational database function much like indexes in books— they speed up
Information retrieval.
In general, the Teradata Database uses indexes to:
• Distribute data rows.
• Locate data rows.
• Improve performance.
Indexed access is usually more efficient than searching all rows of a table.
• Ensure uniqueness of the index values.
Only one row of a table can have a particular value in the column or columns defined as a
Unique Index.
The Teradata Database supports the following types of indexes:
• Primary
• Secondary
• Join
• Hash
• Special indexes for referential integrity
These indexes are discussed in the following sections.

Primary Index : The Teradata Database requires one Primary Index (PI) for each table in the database, except for some data dictionary tables and global temporary tables.

Distributes rows.
Defines most common access path.
May be unique or non-unique.
May be null.
Physical access mechanism.
Required by Teradata Database for most tables.
64-column limit.
Values can be changed.

Primary Indexes and Data Distribution:
Unique Primary Indexes (UPIs) guarantee uniform distribution of table rows.
Non-Unique Primary Indexes (NUPIs) can cause skewed data. While not a guarantor of
uniform row distribution, the degree of uniqueness of the index will determine the degree of
uniformity of the distribution. Because all rows with the same PI value end up on the same
AMP, columns with a small number of distinct values that are repeated frequently do not
make good PI candidates.
The most efficient access method to data in a table is through the PI. For this reason, choosing
a PI should take the following design goal into consideration: choosing a PI that gives good
distribution of data across the AMPs must be balanced against choosing a PI that reflects the
most common usage pattern of the table.

5/18/11

Search Engine Optimization

Search Engine Optimization is – a method or process of improving and optimizing the ranking of a website in search engine results.

After having published a Blog or website, the first thing to do is to make sure that the information relating to the site is available on the internet. Web Directories, Blog Directories and Search Engines are the internet Yellow Pages. Although the major Search Engines will crawl your website even without any action on your part, it will be much faster if you could proactively provide them your site information. We had written articles on how and where to submit the blog information so as to have your site fully indexed and appear in many places on the internet.

Submit Blogger Sitemap to Yahoo:
Publicize your site by having your Blogger.com or Blogspot.com blog site included in Yahoo! Directory and Yahoo! Search. I went to the official Yahoo! submission of URL page and clicked the link “Submit Your Site for Free”. You will be prompted to log in using your Yahoo! account. For those of you who do not have a Yahoo! account, simply sign up for one. It is Free.

Select “Submit Site Feed” and enter the URL of your Blog's site feed. You can use either:-

http://YOURBLOGNAME.blogspot.com/rss.xml

or

http://YOURBLOGNAME.blogspot.com/atom.xml

Remember to replace YOURBLOGNAME with that of your Blog's. For Blogger blogs, do not add "www." to YOURBLOGNAME. Once the Sitemaps are submitted, click the “Authentication” link shown at the top right hand corner of the page.

2/3/11

2G Scam Explanation clearly..................

1/4/11

Datastage FAQ's

Difference Between Server Jobs and Parallel Jobs:

Datastage parallel jobs can run in parallel on multiple nodes. Server jobs do not run on multiple node.
Parallel jobs support partition parallelism(Round robin,Hash,modulus etc.), server jobs don't support this.
The transformer in Parallel jobs compiles in C++. In server jobs, the transformer is compiled in Basic language.

--------------------------------------
Basic difference is server job runs on windows platform usually and parallel job runs on unix platform.

server job runs on on node whereas parallel job runs on more than one node.
Even the server jobs run on UNIX most of the major installation are on UNIX platfoam and comming to the differences. There is a major difference in job architecture.
Server jobs process in sequence one stage after other.
While Parallel job process in parallel. It uses the configuration file to know the number of CPU's difined to process parallely.
Datastage parallel jobs can run in parallel on multiple nodes. Server jobs do not run on multiple node.
Parallel jobs support partition parallelism(Round robin Hash modulus etc.) server jobs don't support this.
The transformer in Parallel jobs compiles in C++. In server jobs the transformer is compiled in Basic language.

------------------------------------------------
Compare and Contrast ODBC and Plug-In stages?

ODBC: a) Poor Performance.
b) Can be used for Variety of Databases.
c) Can handle Stored Procedures.

Plug-In: a) Good Performance.
b) Database specific. (Only one database)
c) Cannot handle Stored Procedures.
------------------------------------------
What is Dimensional Modeling?

Dimensional Modeling is a logical design technique that seeks to present the data in a standard framework that is, intuitive and allows for high performance access.

----------------------------
What is Star Schema?

Star Schema is a de-normalized multi-dimensional model. It contains centralized fact tables surrounded by dimensions table.
Dimension Table: It contains a primary key and description about the fact table.
Fact Table: It contains foreign keys to the dimension tables, measures and aggregates.

-------------------------
what is snowflake schema?
snowflake schema is represented by centralized fact tables which are connected to multiple dimension tables.in this dimensions are normalized into multiple related tables.when the dimensions of a snowflake schema are elaborated ,having multiple levels of relation ships,and where child tables have multiple parent tables.”snowflakin” only affects the dimension table not the fact table.
----------------------------------------------
what is normalization?

ans1: is process of efficiently organizing data in database.there are two goals of the normalization process: eliminating redundant data and ensuring data dependencies make sense (only storing related data in a table).
ans2: is the process of removing repeated information.it first removes repeated records in a table.

what is de-normalization?

the process of attempting to optimize the performance of a database by adding redundant data or by grouping data .
is usually done to decrease the time req. to execute the complex queries .in a normalized database joins are required to gather all the information from multiple entities ,as data is divided and stored in multiple entities rather than in one large table.

What is ODS?

Operational Data Store is a staging area where data can be rolled back.

Importance of Surrogate Key in Data warehousing?

surrogate key is system generated artifical primary key
Surrogate Key is a Primary Key for a Dimension table. Most importance of using it is, it is independent of underlying database, i.e. Surrogate Key is not affected by the changes going on with a database.
It is useful because the natural primary key (i.e. Customer Number in Customer table) can change and this makes updates more difficult.
Some tables have columns such as AIRPORT_NAME or CITY_NAME which are stated as the primary keys (according to the business users) but ,not only can these change, indexing on a numerical value is probably better and you could consider creating a surrogate key called, say, AIRPORT_ID. This would be internal to the system and as far as the client is concerned you may display only the AIRPORT_NAME.

----------------------------------------
What are other Performance tunings you have done in your last project to increase the performance of slowly running jobs?

Staged the data coming from ODBC/OCI/DB2UDB stages or any database on the server using Hash/Sequential files for optimum performance also for data recovery in case job aborts.
Tuned the OCI stage for 'Array Size' and 'Rows per Transaction' numerical values for faster inserts, updates and selects.
Tuned the 'Project Tunables' in Administrator for better performance.
Used sorted data for Aggregator.
Sorted the data as much as possible in Database and reduced the use of DS-Sort for better performance of jobs.
Removed the data not used from the source as early as possible in the job.
Worked with DB-admin to create appropriate Indexes on tables for better performance of DS queries.
Converted some of the complex joins/business in DS to Stored Procedures on DS for faster execution of the jobs.
If an input file has an excessive number of rows and can be split-up then use standard logic to run jobs in parallel.
Before writing a routine or a transform, make sure that there is not the functionality required in one of the standard routines supplied in the sdk or ds utilities categories.
Constraints are generally CPU intensive and take a significant amount of time to process. This may be the case if the constraint calls routines or external macros but if it is inline code then the overhead will be minimal.
Try to have the constraints in the 'Selection' criteria of the jobs itself. This will eliminate the unnecessary records even getting in before joins are made.
Tuning should occur on a job-by-job basis.
Use the power of DBMS.
Try not to use a sort stage when you can use an ORDER BY clause in the database.
Using a constraint to filter a record set is much slower than performing a SELECT … WHERE….
Make every attempt to use the bulk loader for your particular database. Bulk loaders are generally faster than using ODBC or OLE.

Nav