Extract Transform & Load: August 2011

8/31/11

Car Loan Tips

Run Your Credit Report

>Before embarking in your car buying journey, request your credit report from the three credit bureaus. You can request your credit report for free once a year by visiting annualcreditreport.com or by calling 1-877-322-8228. Your credit report will give you a glimpse of your creditworthiness and inform you of any possible shortcomings. Knowing of all this before stepping into a dealership will guard you from the most aggressive selling tactics and help you walk away when the financing offered is not in your best interest.
Car Loan Warning

!>Be careful to avoid paid credit reporting services. Only annualcreditreport.com is authorized to request a free credit report for you under the law. Paid credit reporting services often carry hidden fees and undisclosed costs.
----
Visit Your Nearest Bank or Credit Union To Get A Quote

>Once you have your credit report handy and a have a good idea of what type of car and price range you are looking for, head over to your nearest bank or credit union to see what kind of interest rates they are offering on their car loans. In some cases, particularly if you already know exactly what vehicle you want to purchase, the bank or credit union may pre-approve you, thus letting you know exactly what interest rate and monthly payments you should expect in your car loan.
Car Loan Warning

Be sure to shop around and to compare rates. Visit more than one financial institution to get a quote and to find out what interest rates they are offering on their loans. This will give you a better idea if you are getting a good deal or not.

-----
Negotiate for a Better Rate

>Despite the loans offered directly by banks and credit unions, eight out of every 10 car buyers finance their vehicle through a car dealer. Whether it is the convenience offered or simply the marketing tactics deployed, if you find yourself behind closed-doors in the finance and insurance department of a car dealer be ready to negotiate for the lowest interest rate possible without feeling intimidated. Knowing your credit history and the loan rates offered directly from banks and credit unions in your area will definitely give you the upper-hand in getting the best car loan possible, but remain weary of any interest rate markups added on by the dealer. While a car dealer may initially originate your loan, it often attempts to sell the loan to a third-party lender for a profit. This profit is made by arbitrarily raising the interest rate of your car loan. If the interest rate offered by the dealer is higher than what you anticipated, just ask for the desired interest rate and renegotiate.
Car Loan Warning

>Try to avoid any add-on products offered by the dealer. Products such as vehicle service contracts, guaranteed auto protection insurance, credit life and disability insurance, and many others are often overpriced and unnecessary. Car dealers often sell these products to raise the cost of their loans and increase their profit margins. If you really need any of those add-on products, try to purchase them outside the dealership for much cheaper.
---
Other Things to Consider:

Comparison Shop Online: The internet has made it a lot easier for consumers to compare car prices and loan rates online. Start your research there before you head out to the dealership.

“Yo-Yo” scams: “Yo-yo” scams or “spot deliveries” occur when a car buyer drives away with the vehicle without finalizing sale. Once home, the dealer will call back the buyer claiming that it was unable to fund the loan at the agreed-upon terms. The buyer must then return the car to the dealer and often renegotiate the loan at a higher interest rate than one agreed-upon before.

“Buy Here and Pay Here” Dealers: “Buy Here Pay Here” dealerships typically finance used auto loans in-house to borrowers with no or poor credit. The average APR is usually much higher than a bank or credit union loan. The car loans made by these dealers are often unsustainable and lead to a high rate of repossessions.

Take Your Time: The average consumer spends 45 minutes with the finance and insurance department at the dealer (only 27 minutes if they take a test drive), so take your time to consider your lending options and don’t feel pressured to sign the dotted line. You have the right to take the entire paperwork home before agreeing to the loan.

Don’t Get Caught In The Monthly Payment Trap: Dealers will often attempt to mask the true cost of their loans by focusing on the monthly payments. Be sure to compare the total cost of all the loans offered and to choose the one that is less costly to you in the long run.

8/16/11

Generating output files for multiple columns

When two or more output columns have XPath expressions, XML Output generates a file for each column. You must add a column index flag to the root filename to prevent overwriting. This creates a naming pattern.

Valid flags include:
%% Column position, starting with zero (0)
%@ Column names
You can add these flags before, within, or after the root filename.
Examples The first output column is CUSTOMERS. The second output column is DIVISIONS.
1. The naming pattern is acme%%.xml. XML Output generates two files, called acme0.xml and acme1.xml.
2. The naming pattern is acme%@.xml. XML Output generates two files, called acmeCUSTOMERS.xml and acmeDIVISIONS.xml.

8/8/11

APT_CONFIG_FILE : Configuration File

1)APT_CONFIG_FILE is the file using which DataStage determines the configuration file (one can have many configuration files for a project) to be used. In fact, this is what is generally used in production. However, if this environment variable is not defined then how DataStage determines which file to use?

1)If the APT_CONFIG_FILE environment variable is not defined then DataStage look for default configuration file (config.apt) in following path:
1)Current working directory.
2)INSTALL_DIR/etc, where INSTALL_DIR ($APT_ORCHHOME) is the top level directory of DataStage installation.

2)What are the different options a logical node can have in the configuration file?

1.fastname – The fastname is the physical node name that stages use to open connections for high volume data transfers. The attribute of this option is often the network name. Typically, you can get this name by using Unix command ‘uname -n’.
2.pools – Name of the pools to which the node is assigned to. Based on the characteristics of the processing nodes you can group nodes into set of pools.
1.A pool can be associated with many nodes and a node can be part of many pools.
2.A node belongs to the default pool unless you explicitly specify apools list for it, and omit the default pool name (“”) from the list.
3.A parallel job or specific stage in the parallel job can be constrained to run on a pool (set of processing nodes).
1.In case job as well as stage within the job are constrained to run on specific processing nodes then stage will run on the node which is common to stage as well as job.
3.resource – resource resource_type “location” [{pools “disk_pool_name”}] | resource resource_type “value” . resource_type can be canonicalhostname (Which takes quoted ethernet name of a node in cluster that is unconnected to Conductor node by the hight speed network.) or disk (To read/write persistent data to this directory.) or scratchdisk (Quoted absolute path name of a directory on a file system where intermediate data will be temporarily stored. It is local to the processing node.) or RDBMS Specific resourses (e.g. DB2, INFORMIX, ORACLE, etc.)

3)How datastage decides on which processing node a stage should be run?

1.If a job or stage is not constrained to run on specific nodes then parallel engine executes a parallel stage on all nodes defined in the default node pool. (Default Behavior)
2.If the node is constrained then the constrained processing nodes are choosen while executing the parallel stage. (Refer to 2.2.3 for more detail).

4)When configuring an MPP, you specify the physical nodes in your system on which the parallel engine will run your parallel jobs. This is called Conductor Node. For other nodes, you do not need to specify the physical node. Also, You need to copy the (.apt) configuration file only to the nodes from which you start parallel engine applications. It is possible that conductor node is not connected with the high-speed network switches. However, the other nodes are connected to each other using a very high-speed network switches. How do you configure your system so that you will be able to achieve optimized parallelism?

1.Make sure that none of the stages are specified to be run on the conductor node.
2.Use conductor node just to start the execution of parallel job.
3.Make sure that conductor node is not the part of the default pool.

5)Although, parallelization increases the throughput and speed of the process, why maximum parallelization is not necessarily the optimal parallelization?

Datastage creates one process for every stage for each processing node. Hence, if the hardware resource is not available to support the maximum parallelization, the performance of overall system goes down. For example, suppose we have a SMP system with three CPU and a Parallel job with 4 stage. We have 3 logical node (one corresponding to each physical node (say CPU)). Now DataStage will start 3*4 = 12 processes, which has to be managed by a single operating system. Significant time will be spent in switching context and scheduling the process.

6)Since we can have different logical processing nodes, it is possible that some node will be more suitable for some stage while other nodes will be more suitable for other stages. So, when to decide which node will be suitable for which stage?

1.If a stage is performing a memory intensive task then it should be run on a node which has more disk space available for it. E.g. sorting a data is memory intensive task and it should be run on such nodes.
2.If some stage depends on licensed version of software (e.g. SAS Stage, RDBMS related stages, etc.) then you need to associate those stages with the processing node, which is physically mapped to the machine on which the licensed software is installed. (Assumption: The machine on which licensed software is installed is connected through other machines using high speed network.)
3.If a job contains stages, which exchange large amounts of data then they should be assigned to nodes where stages communicate by either shared memory (SMP) or high-speed link (MPP) in most optimized manner.

7)Basically nodes are nothing but set of machines (specially in MPP systems). You start the execution of parallel jobs from the conductor node. Conductor nodes creates a shell of remote machines (depending on the processing nodes) and copies the same environment on them. However, it is possible to create a startup script which will selectively change the environment on a specific node. This script has a default name of startup.apt. However, like main configuration file, we can also have many startup configuration files. The appropriate configuration file can be picked up using the environment variable APT_STARTUP_SCRIPT. What is use of APT_NO_STARTUP_SCRIPT environment variable?

1.Using APT_NO_STARTUP_SCRIPT environment variable, you can instruct Parallel engine not to run the startup script on the remote shell.

8)What are the generic things one must follow while creating a configuration file so that optimal parallelization can be achieved?

1.Consider avoiding the disk/disks that your input files reside on.
2.Ensure that the different file systems mentioned as the disk and scratchdisk resources hit disjoint sets of spindles even if they’re located on a RAID (Redundant Array of Inexpensive Disks) system.
3.Know what is real and what is NFS:
1.Real disks are directly attached, or are reachable over a SAN (storage-area network -dedicated, just for storage, low-level protocols).
2.Never use NFS file systems for scratchdisk resources, remember scratchdisk are also used for temporary storage of file/data during processing.
3.If you use NFS file system space for disk resources, then you need to know what you are doing. For example, your final result files may need to be written out onto the NFS disk area, but that doesn’t mean the intermediate data sets created and used temporarily in a multi-job sequence should use this NFS disk area. Better to setup a “final” disk pool, and constrain the result sequential file or data set to reside there, but let intermediate storage go to local or SAN resources, not NFS.
4.Know what data points are striped (RAID) and which are not. Where possible, avoid striping across data points that are already striped at the spindle level.

Nav