Extract Transform & Load: April 2012

4/30/12

How do I change my Oracle password using SQL*Plus?

Using SQL*Plus in Oracle 11g

From the Start menu, select All Programs, and then select your Oracle Home group (e.g., Oracle - OraClient11g_home1).
Click Application Development, and then click SQL Plus.When you launch SQL Plus in the Oracle 11g client, you will get the command shell version of SQL*Plus.
When prompted with Enter user-name: , enter your Oracle username, then @ , and then the net service name of the Oracle database; for example: Example : username@dev.company.comNote: In the STCs, the full Oracle net service name is oed1prd.world . Oracle net service names configured on your computer may differ from the STC names, since these can be arbitrarily assigned.
Click Enter. When prompted with Enter password: , enter your Oracle password, and then click Enter.
At the prompt, to change your password, use the following command: password
When prompted, enter your old password, then your new password, and then retype your new password. SQL*Plus will indicate a successful password change.

Schema File in Datastage

Schema files and partial Schemas:

You can also specify the meta data for a stage in a plain text file known as a schema file. This is not stored in the Repository but you could, for example, keep it in a document management or source code control system, or publish it on an intranet site.

Note: If you are using a schema file on an NLS system, the schema file needs to be in UTF-8 format. It is, however, easy to convert text files between two different maps with a WebSphere DataStage job. Such a job would read data from a text file using a Sequential File stage and specifying the appropriate character set on the NLS Map page. It would write the data to another file using a Sequential File stage, specifying the UTF-8 map on the NLS Map page.

Some parallel job stages allow you to use a partial schema. This means that you only need define column definitions for those columns that you are actually going to operate on.

Remember that you should turn runtime column propagation on if you intend to use schema files to define column meta data.

Complex data types

Parallel jobs support three complex data types: v Subrecords v Tagged subrecords v Vectors When referring to complex data in WebSphere DataStage column definitions, you can specify fully qualified column names, for example:
Parent.Child5.Grandchild2.

Subrecords :

A subrecord is a nested data structure. The column with type subrecord does not itself define any storage, but the columns it contains do. These columns can have any data type, and you can nest subrecords one within another. The LEVEL property is used to specify the structure of subrecords. The following diagram gives an example of a subrecord structure.
Parent (subrecord)
Child1 (string)
Child2 (string)      LEVEL01
Child3 (string)
Child4 (string)
Child5(subrecord)
        Grandchild1 (string)
        Grandchild2 (time)
        Grandchild3 (sfloat) LEVEL02

Tagged subrecord

This is a special type of subrecord structure, it comprises a number of columns of different types and the actual column is ONE of these, as indicated by the value of a tag at run time. The columns can be of any type except subrecord or tagged. The following diagram illustrates a tagged subrecord.

Parent (tagged)
       Child1 (string)
       Child2 (int8)
       Child3 (raw)
       Tag = Child1, so column has data type of string

Vector

A vector is a one dimensional array of any type except tagged. All the elements of a vector are of the same type, and are numbered from 0. The vector can be of fixed or variable length. For fixed length vectors the length is explicitly stated, for variable length ones a property defines a link field which gives the length at run time. The following diagram illustrates a vector of fixed length and one of variable length.

Netezza Performance Server :

Distribution Specification:
Each table in a Netezza RDBMS database has only one distribution key, which consists of
one to four columns. You can use the following SQL syntax to create distribution keys.
>To create an explicit distribution key, the Netezza SQL syntax is:
usage: create table [ ( [, … ] ) ] as
[ distribute on [hash] ( [ ,… ] ) ]
The phrase distribute on specifies the distribution key, the word hash is optional.
> To create a round-robin distribution key, the Netezza SQL syntax is:
usage: create table (col1 int, col2 int, col3 int);
distribute on random;
The phrase distribute on random specifies round-robin distribution.
> To create a table without specifying a distribution key, the Netezza SQL syntax is:
usage: create table (col1 int, col2 int, col3 int);
The NPS chooses a distribution key. There is no guarantee what that key is and it can
vary depending on the NPS software release.

Choosing the distribution key of table in Netezza

In Netezza the each table data is distributed across may servers(SPU's). This distribution of data results in parallel processing. So, we should be careful in choosing the right distribution key while creating a table.

There are two types of distribution methods:

Hash algorithm
Random algorithm

Random algorithm applies when you specify random() in the 'distribute on' clause.
Hash algorithm applies when you specify column names in the distribute on clause. When you don't specify distribute on clause while creating a table, by default, NPS distributes the data on first column in the create table statement.

Tips to choose a good distribution method:

Choose columns that distributes rows evenly across all the SPU's.
Select the columns withe unique values and high cardinality.
Do not use random distribution even it distributes data equally across all the SPU's.
Always specify the distribution clause even if you are distributing on first column in the table.
Use same distribution key and datatypes for commonly joined tables.
Do not distribute on floating point and boolean datatypes.

Netezza Date Conversions :

SELECT TO_TIMESTAMP(CURRENT_TIMESTAMP,'YYYY-MM-DD HH24:MI:SS')

TO_TIMESTAMP
-------------
10/24/2011 4:42:35.000156

SELECT TO_DATE(CURRENT_TIMESTAMP,'YYYY-MM-DD HH24:MI:SS')

TO_DATE
----------
10/24/2011

DATE_PART :

EX ;
SELECT DATE_PART('MONTH',CURRENT_TIMESTAMP)

DATE_PART
---------------
10

DATE_TRUNC :

BEFORE TRUNC ;
SELECT CURRENT_TIMESTAMP
----------------------------------
10/24/2011 4:54:11.001598

AFTER TRUNC HOUR :

SELECT DATE_TRUNC('HOUR',CURRENT_TIMESTAMP)
-------------------------------------

10/24/2011 4:00:00.001598

Netezza SQL Examples

1) To Limit the Data in netezza
Ex : select * from EMP Limit 1000

Difference Between Datastage 7.x and 8.x

The Main Differences are :

1) No Manager Client :

we don't have client manager tool in version 8,its incorporated with Designer itself..

2) Parameter Set :

There is no need of hardcoding the parameters for every job and we have a option called Parameter set.if we create the parameter set,we can call the parameter set for whole project or job,sequence..

3) Lookup :

Range LookUp introduced in Lookup stage and Lookup has it's own UI.

4) Database Repository :

> In Datastage 7.x repository used to be a file system but in 8.x we can use the Database as a Repository.

Note the database compatibility for the Metadata Server repository is the latest versions of the three DBMS engines. DB2 is an optional extra in the bundle if you don't want to use an existing database.

IBM UDB DB2 ESE 9
-IBM Information Server does not support the Database Partitioning Feature (DPF) for use in the repository layer
-DB2 Restricted Enterprise Edition 9 is included with IBM Information Server and is an optional part of the installation however its use is restricted to hosting the IBM Information Server repository layer and cannot be used for other applications
Oracle 10g
SQL Server 2005

5) Languages
Foreign language support for the graphical tools and product messages:

Chinese (Simplified and Traditional), Czech, Danish, Finnish, French, German, Italian, Japanese, Korean, Norwegian, Polish, Portuguese, Russian, Spanish and Swedish.

6) Removed Functions:

These are the functions that are not in DataStage 8, please imaging the last waltz playing in your head as you peruse this list:

dssearch command line function
dsjob "-import"
Version Control tool
Released jobs
Oracle 8i native database stages
ClickPack

The loss of the Version Control tool is not a big deal as the import/export functions have been improved. Building a release file as an export in version 8 is easier than building it in the Version Control tool in version 7.

7) New Functions Existing Stages

Complex Flat File Stage: Multi Format File (MFF) in addition to existing cobol file support.
Surrogate Key Generator: now maintains the key source via integrated state file or DBMS sequence.
Lookup Stage: range lookups by defining checking high and low range fields on the input or reference data table. Updatable in memory lookups.
Transformer Stage: new surrogate key functions Initialize() and GetNextKey().
Enterprise FTP Stage: now choose between ftp and sftp transfer.

8) New Stages

A new stage from the IBM software family, new stages from new partners and the convergence of Quality Stage functions into Datastage. Apart from the SCD stage these all come at an additional

cost.

WebSphere Federation and Classic Federation
Netezza Enterprise Stage
SFTP Enterprise Stage
iWay Enterprise Stage
Slowly Changing Dimension: for type 1 and type 2 SCDs.

Six QualityStage stages.

Difference Between Datastage Server Jobs and Parallel Jobs

Datastage parallel jobs can run in parallel on multiple nodes. Server jobs do not run on multiple node.
Parallel jobs support partition parallelism(Round robin,Hash,modulus etc.), server jobs don't support this.
The transformer in Parallel jobs compiles in C++. In server jobs, the transformer is compiled in Basic language.
--------------------------------------
Basic difference is server job runs on windows platform usually and parallel job runs on unix platform.
server job runs on on node whereas parallel job runs on more than one node.
Even the server jobs run on UNIX most of the major installation are on UNIX platfoam and comming to the differences. There is a major difference in job architecture.
Server jobs process in sequence one stage after other.
While Parallel job process in parallel. It uses the configuration file to know the number of CPU's difined to process parallely.
Datastage parallel jobs can run in parallel on multiple nodes. Server jobs do not run on multiple node.
Parallel jobs support partition parallelism(Round robin Hash modulus etc.) server jobs don't support this.
The transformer in Parallel jobs compiles in C++. In server jobs the transformer is compiled in Basic language.

------------------------------------------

The major difference between Infosphere Datastage Enterprise and Server edition is that Enterprise Edition (EE) introduces Parallel jobs. Parallel jobs support a completely new set of stages, which implement the scalable and parallel data processing mechanisms. In most cases parallel jobs and stages look similiar to the Datastage Server objects, however their capababilities are way different.
In rough outline:
- Parallel jobs are executable datastage programs, managed and controlled by Datastage Server runtime environment
- Parallel jobs have a built-in mechanism for Pipelining, Partitioning and Parallelism. In most cases no manual intervention is needed to implement optimally those techniques.
- Parallel jobs are a lot faster in such ETL tasks like sorting, filtering, aggregating
Datastage EE jobs are compiled into OSH (Orchestrate Shell script language).
OSH executes operators - instances of executable C++ classes, pre-built components representing stages used in Datastage jobs.
Server Jobs are compiled into Basic which is an interpreted pseudo-code. This is why parallel jobs run faster, even if processed on one CPU.
Datastage Enterprise Edition adds functionality to the traditional server stages, for instance record and column level format properties.
Datastage EE brings also completely new stages implementing the parallel concept, for example:
- Enterprise Database Connectors for Oracle, Teradata & DB2
- Development and Debug stages - Peek, Column Generator, Row Generator, Head, Tail, Sample ...
- Data set, File set, Complex flat file, Lookup File Set ...
- Join, Merge, Funnel, Copy, Modify, Remove Duplicates ...
When processing large data volumes Datastage EE jobs would be the right choice, however when dealing with smaller data environment, using Server jobs might be just easier to develop, understand and manage.
When a company has both Server and Enterprise licenses, both types of jobs can be used.
Sequence jobs are the same in Datastage EE and Server editions.

Parallel Job - create a unique counter within a parallel transformer

Introduction

This entry describes various ways of creating a unique counter in DataStage jobs.
A parallel job has a surrogate key stage that creates unique IDs, however it is limited in that it does not support conditional code and it may be more efficient to add a counter to an existing transformer rather than add a new stage.
In a server job there are a set of key increment routines installed in the routine SDK samples that offer a more complex counter that remembers values between job executions.
The following section outlines a transformer only technique.

Steps

In a DataStage job the easiest way to create a counter is within the Transformer stage with a Stage Variable.
svMyCounter = svMyCounter + 1
This simple counter adds 1 each time a row is processed.
The counter can be given a seed value by passing a value in as a job parameter and setting the initial value of svMyCounter to that job parameter.
In a parallel job this simple counter will create duplicate values on each node as the transformer is split into parallel instances. It can be turned into a unique counter by using special parallel macros.

Create a stage variable for the counter, eg. SVCounter.
At the Stage Properties form set the Initial Value of the Stage Variable to "@PARTITIONNUM - @NUMPARTITIONS + 1".
Set the derivation of the stage variable to "svCounter + @NUMPARTITIONS". You can embed this in an IF statement if it is a conditional counter.

Each instance will start at a different number, eg. -1, -2, -3, -4. When the counter is incremented each instance is increment by the number of partitions, eg. 4. This gives us a sequence in instance 1 of 1, 5, 9, 13... Instance 2 is 2, 6, 10, 14... etc.
Remember this method only works if your data is evenly balanced i.e. equal number of rows going through each partition. Alternative syntax is:
@INROWNUM * @NUMPARTITIONS + @PARTITIONNUM

Source

Convert Function Datastage

A="NOW IS THE TIME"

CONVERT('TI','XY',A)

OUTPUT : NOW YS XHE XYME

4/24/12

How to Capture/Handle Rejects in Join Stage

First of all there is no such a thing as rejets from join,there is only result of the join. In Join Stage Use the Outer join --- Left/Right/Outer depending on which side you are interested in finding these 'rejects' then in subsequent stages capture those where left/right data is missing (or having nulls)

Nav