9/28/12

Copy Stage - Force Option

Copy Stage :  Stage->Properties have option called Force

Force : true or False.

True to specify that DataStage should not try to optimize the job by removing the Copy operation.

False to specify that Datastage should try to optimize the job (it might remove the copy operator or might not).

9/14/12

Conductor Node in Datastage



Below is the sample APT CONFIG FILE ,see in bold to mention conductor node.


{
node "node0"
{
fastname "server1"
pools "conductor"
resource disk "/datastage/Ascential/DataStage/Datasets/node0" {pools "conductor"}
resource scratchdisk "/datastage/Ascential/DataStage/Scratch/node0" {pools ""}
}
node "node1"
{
fastname "server2"
pools ""
resource disk "/datastage/Ascential/DataStage/Datasets/node1" {pools ""}
resource scratchdisk "/datastage/Ascential/DataStage/Scratch/node1" {pools ""}
}
node "node2"
{
fastname "server2"
pools ""
resource disk "/datastage/Ascential/DataStage/Datasets/node2" {pools ""}
resource scratchdisk "/datastage/Ascential/DataStage/Scratch/node2" {pools ""}
}
}

Please find the below different answers :
------
For every job that starts there will be one (1) conductor process (started on the conductor node), there will be one (1) section leader for each node in the configuration file and there will be one (1) player process (may or may not be true) for each stage in your job for each node. So if you have a job that uses a two (2) node configuration file and has 3 stages then your job will have

1 conductor
2 section leaders (2 nodes * 1 section leader per node)
6 player processes (3 stages * 2 nodes)

Your dump score may show that your job will run 9 processes on 2 nodes.

This kind of information is very helpful when determining the impact that a particular job or process will have on the underlying operating system and system resources.

-----
Conductor Node :
It is a main process to

  1.  Start up jobs
  2.  Resource assignments
  3.  Responsible to create Section leader (used to create & manage player player process which perform actual job execution).
  4.  Single coordinator for status and error messages.
  5.  manages orderly shutdown when processing completes in the event of fatal error.


-----
Jobs developed with DataStage EE and QualityStage are independent of the actual hardware and degree of parallelism used to run the job. The parallel Configuration File provides a mapping at runtime between the job and the actual runtime infrastructure and resources by defining logical processing nodes.


To facilitate scalability across the boundaries of a single server, and to maintain platform independence, the parallel framework uses a multi-process architecture.

The runtime architecture of the parallel framework uses a process-based architecture that enables scalability beyond server boundaries while avoiding platform-dependent threading calls. The actual runtime deployment for a given job design is composed of a hierarchical relationship of operating system processes, running on one or more physical servers


  • Conductor Node (one per job): the main process used to startup jobs, determine resource assignments, and create Section Leader processes on one or more processing nodes. Acts as a single coordinator for status and error messages, manages orderly shutdown when processing completes or in the event of a fatal error. The conductor node is run from the primary server
  • Section Leaders (one per logical processing node): used to create and manage player processes which perform the actual job execution. The Section Leaders also manage communication between the individual player processes and the master Conductor Node.
  • Players: one or more logical groups of processes used to execute the data flow logic. All players are created as groups on the same server as their managing Section Leader process.


-----

When the job is initiated the primary process (called the “conductor”) reads the job design, which is a generated Orchestrate shell (osh) script. The conductor also reads the parallel execution configuration file specified by the current setting of the APT_CONFIG_FILE environment variable.

Once the execution nodes are known (from the configuration file) the conductor causes a coordinating process called a “section leader” to be
started on each; by forking a child process if the node is on the same machine as the conductor or by remote shell execution if the node is on a
different machine from the conductor (things are a little more dynamic in a grid configuration, but essentially this is what happens). Each section
leader process is passed the score and executes it on its own node, and is visible as a process running osh. Section leaders’ stdout and stderr are
redirected to the conductor, which is solely responsible for logging entries from the job.


The score contains a number of Orchestrate operators. Each of these runs in a separate process, called a “player” (the metaphor clearly is one of an
orchestra). Player processes’ stdout and stderr are redirected to their parent section leader. Player processes also run the osh executable.

Communication between the conductor, section leaders and player processes in a parallel job is effected via TCP.

Difference between scratch disk and resource scratch disk


The Only difference is :
  • Scratch Disk is for Temporary storage (Like RAM in our PC)
Ex : Files created during the process between the source and targtet such as Sort,Remove duplicate,Aggregator etc..
  • Resource Scratch Disk is for Permanent storage (like a Hard Drice in our PC)
Ex : Data sets,files sets, Loookup file sets etc..

9/13/12

Common Errors,warnings in Datastage

  • Warning ; A sequential operator cannot preserve the partitioning of input data set on input port 0
          Sol : Clear the partitoning
  • Warning : Agg_stg: When checking operator: When binding input interface field “column1” to field “ column1 ”: Implicit conversion from source type “string[5]” to result type “dfloat”: Converting string to number.
          Sol: use data type conversion
  • Warning ; oci_oracle_source: When checking operator: When binding output interface field “column1” to field “column1”: Converting a nullable source to a non-nullable result;
         Sol : Use Null functions

9/6/12

Initial Load and Delta Load


Difference Between Initial Load and Delta Load :

Initial Load :

Ø       Occurs Once
Ø       Large amount of Data

Delta Load :

If the data service has the capability to return the data modified only after a specified date and time, the ETL process will load only the data modified after the last successful load. This is called delta load

Ø       Occurs regularly
Ø       Adjustments to Initial load
Ø       Small amount of data