Extract Transform & Load: June 2011

6/24/11

The PATH Environment Variable in UNIX

The PATH environment variable has a special format. Let's see what it looks like:

dbettis@rhino[~]$ echo $PATH
/usr/local/bin:/bin:/usr/bin:/sbin:/usr/sbin:.

It's essentially a :-separated list of directories. When you execute a command, the shell searches through each of these directories, one by one, until it finds a directory where the executable exists. Remember that we found ls in /bin, right? /bin is the second item in the PATH variable. So let's remove /bin from PATH. We can do this by using the export command:

dbettis@rhino[~]$ export PATH=/usr/local/bin:/usr/bin:/sbin:/usr/sbin:.

Make sure that the variable is set correctly:

dbettis@rhino[~]$ echo $PATH
/usr/local/bin:/usr/bin:/sbin:/usr/sbin:.

Now, if we try to run ls, the shell no longer knows to look in /bin!

dbettis@rhino[~]$ ls
-bash: ls: command not found

As expected, ls can no longer be found. Let's add /bin back to PATH, as ls is a very useful thing to have.

dbettis@rhino[~]$ export PATH=/usr/local/bin:/bin:/usr/bin:/sbin:/usr/sbin:.

Adding to PATH
There are many times where you'll want to append an item to PATH. First, let's see what the current PATH is:

dbettis@rhino[~]$ echo $PATH
/usr/local/bin:/bin:/usr/bin:/sbin:/usr/sbin:.

The way to add a directory is as follows:

dbettis@rhino[~]$ export PATH=$PATH:/new/path

This command adds /new/path to PATH. Let's see if it got updated:

dbettis@rhino[~]$ echo $PATH
/usr/local/bin:/bin:/usr/bin:/sbin:/usr/sbin:.:/new/path

Making this happen every time you login

There's a special file in your home directory called .bashrc In UNIX, a convention is that files beginning with . are configuration files, and thus should be hidden from view. ls will only list files beginning with a . if passed the -a flag. e.g.

dbettis@rhino[~]$ ls -a

At any rate, this file (.bashrc), simply contains a list of commands. Each one of these commands gets executed every time you create a new shell.

dbettis@rhino[~]$ cat .bashrc
export PATH="$PATH:/p/firefox/bin"
..

Every time a shell is started, /p/firefox/bin is added to PATH. If you wish to have certain directories automatically added to PATH, simply place those commands at the end of this file. Log out and log back in to view the changes. Alternatively, you can load the contents of that file in the current session:

dbettis@rhino[~]$ . .bashrc

6/23/11

Understanding UNIX permissions and chmod

Introduction
This is a topic that has been beaten to death both in books and on-line. For some reason, it seems that it is one of the most common misunderstandings that people have to face when learning how to write and/or configure their first cgi programs. This tutorial aims to clarify the concepts involved. Note that we will be referring to UNIX in a generic sense in this article. Most of what we are going to discuss here applies to all UNIX flavours. (such as Linux, SVR4, BSD etc.) It is also a good idea to type man chmod to check for the specific details on your system, too.
Users
A UNIX system serves many users. Users are an abstraction that denotes a logical entity for assignment of ownership and operation privileges over the system. A user may correspond to a real-world person, but also a type of system operation. So, in my system, I have user 'nick' that corresponds to me, but I also have user 'www' which corresponds to the privileges necessary to operate the local webserver. UNIX doesn't care about what the user means for me. It just knows what belongs to any given user and what each user is allowed to do with any given thing (file, program, device, etc) on the system. UNIX identifies each user by a User ID (UID) and the username (or login) such as 'nick' and 'www' is just an alias to the UID that makes humans more comfortable.

Groups

Users can be organized in groups. A user may belong to one or more groups of users. The concept of groups serves the purpose of assigning sets of privileges for a given resource and sharing them among many users that need to have them. (perhaps because they are all members of a project working team and they all need access to some common project files) So, on my system user 'nick' and user 'www' both belong to the group 'perlfect'. This way, they can have some shared privileges over the files for this site. User 'nick' needs them to edit the site, and user 'www' needs them to manage the webserver that will be publishing the site.

Ownership

Every file in UNIX has an owner user and an owner group. So, for any file in the system, user 'nick' may have one of the following ownership relations:

nick owns the file, i.e. the file's owner is 'nick'.
nick is a member of the group that owns the file, i.e. the file's owner group is 'perlfect'.
nick is neither the owner, nor belonging to the group that owns the file

Permissions

Every file on the system has associated with it a set of permissions. Permissions tell UNIX what can be done with that file and by whom. There are three things you can (or can't) do with a given file:

read it,
write (modify) it and
execute it.

Unix permissions specify which of the above operations can be performed for any ownership relation with respect to the file. In simpler terms, what can the owner do, what can the owner group do, and what can everybody else do with the file. For any given ownership relation, we need three bits to specify access permissions: the first to denote read (r) access, the second to denote (w) access and the third to denote execute (x) access. We have three ownership relations: 'owner', 'group' and 'all' so we need a triplet for each, resulting in nine bits. Each bit can be set or clear. (not set) We mark a set bit by it's corresponding operation letter (r, w or x) and a clear bit by a dash (-) and put them all on a row. An example might be rwxr-xr-x.What this means is that the owner can do anything with the file, but group owners and the rest of the world can only read or execute it. Usually in UNIX there is also another bit that precedes this 9-bit pattern. You do not need to know about it, at least for the time being.

So if you try ls -l on the command prompt you will get something like the following: [nick@thekla src]$ ls -l -rwxr-xr-x 1 nick users 382 Jan 19 11:49 bscoped.pl drwxr-xr-x 3 nick users 1024 Jan 19 11:19 lib/ -rwxr-xr-x 1 nick users 1874 Jan 19 10:23 socktest.pl

The first column here shows the permission bit pattern for each file. The third column shows the owner, and the fourth column shows the owner group. By the time, the information provided by ls -l should be enough for you to figure out what each user of the system can do with any of the files in the directory.

Directories

Another interesting thing to note is that lib/ which is a directory has permissions, too. Permissions take a different meaning for directories. Here's what they mean:

1) read determines if a user can view the directory's contents, i.e. do ls in it.
2)write determines if a user can create new files or delete file in the directory. (Note here that this essentially means that a user with write access toa directory can delete files in the directory even if he/she doesn't have write permissions for the file! So be careful with this.)
3)execute determines if the user can cd into the directory.

chmod

To set/modify a file's permissions you need to use the chmod program. Of course, only the owner of a file may use chmod to alter a file's permissions. chmod has the following syntax: chmod [options] mode file(s)

The 'mode' part specifies the new permissions for the file(s) that follow as arguments. A mode specifies which user's permissions should be changed, and afterwards which access types should be changed. Let's say for example: chmod a-x socktest.pl This means that the execute bit should be cleared (-) for all users. (owner, group and the rest of the world) The permissions start with a letter specifying what users should be affected by the change, this might be any of the following:

u the owner user
g the owner group
o others (neither u, nor g)
a all users

This is followed by a change instruction which consists of a +(set bit) or -(clear bit) and the letter corresponding to the bit that should be changed.

Let's see some examples:
$ ls -l socktest.pl
-rwxr-xr-x 1 nick users 1874 Jan 19 10:23 socktest.pl* $ chmod a-x socktest.pl
$ ls -l socktest.pl -rw-r--r-- 1 nick users 1874 Jan 19 10:23 socktest.pl $ chmod g+w socktest.pl $ ls -l socktest.pl -rw-rw-r-- 1 nick users 1874 Jan 19 10:23 socktest.pl
$ chmod ug+x socktest.pl
$ ls -l socktest.pl -rwxrwxr-- 1 nick users 1874 Jan 19 10:23 socktest.pl*
$ chmod ug-wx socktest.pl $ ls -l socktest.pl -r--r--r-- 1 nick users 1874 Jan 19 10:23 socktest.pl

Strange numbers...

You might have encountered things like chmod 755 somefile and of course you will be wondering what this is. The thing is, that you can change the entire permission pattern of a file in one go using one number like the one in this example. Every mode has a corresponding code number, and as we shall see there is a very simple way to figure out what number corresponds to any mode.

Every one of the three digits on the mode number corresponds to one of the three permission triplets. (u, g and o) Every permission bit in a triplet corresponds to a value: 4 for r, 2 for w, 1 for x. If the permission bit you add this value to the number of the permission triplet. If it is cleared, then you add nothing. (Some of you might notice that in fact, the number for a triplet is the octal value corresponding to the three-bit pattern - if you don't know what an octal value is, it doesn't really matter, just follow the intstructions) So if a file has rwxr-xr-x permissions we do the following calculation:

Triplet for u: rwx => 4 + 2 + 1 = 7
Triplet for g: r-x => 4 + 0 + 1 = 5
Tripler for o: r-x => 4 + 0 + 1 = 5
Which makes : 755

So, 755 is a terse way to say 'I don't mind if other people read or run this file, but only I should be able to modify it' and 777 means 'everyone has full access to this file'

Databases and partitioning types

ORACLE:

Basics of Partitioning
Partitioning allows a table, index or index-organized table to be subdivided into
smaller pieces. Each piece of the database object is called a partition. Each
partition has its own name, and may optionally have its own storage characteristics. From the perspective of a database administrator, a partitioned
object has multiple pieces that can be managed either collectively or individually.
This gives the administrator considerable flexibility in managing partitioned

object. However, from the perspective of the application, a partitioned table is
identical to a non-partitioned table; no modifications are necessary when
accessing a partitioned table using SQL DML commands.
Figure 1: Application and DBA perspective of a partitioned table
Database objects - tables, indexes, and index-organized tables - are partitioned
using a 'partitioning key', a set of columns which determine in which partition a
given row will reside. For example the sales table shown in figure 1 is rangepartitioned
on sales date, using a monthly partitioning strategy; the table appears
to any application as a single, 'normal' table. However, the DBA can manage and
store each monthly partition individually, potentially using different storage tiers,
applying table compression to the older data, or store complete ranges of older
data in read only tablespaces.

Basic Partitioning Strategies
Oracle Partitioning offers three fundamental data distribution methods that control
how the data is actually going to placed into the various individual partitions,
namely:
Range: The data is distributed based on a range of values of the
partitioning key (for a date column as the partitioning key, the 'January-
2007' partition contains rows with the partitioning-key values between
'01-JAN-2007' and '31-JAN-2007'). The data distribution is a continuum
without any holes and the lower boundary of a range is automatically
defined by the upper boundary of the preceding range.
List: The data distribution is defined by a list of values of the partitioning
key (for a region column as the partitioning key, the 'North America'
partition may contain values 'Canada', 'USA', and 'Mexico'). A special
'DEFAULT' partition can be defined to catch all values for a partition key
that are not explicitly defined by any of the lists.
Hash: A hash algorithm is applied to the partitioning key to determine
the partition for a given row. Unlike the other two data distribution
methods, hash does not provide any logical mapping between the data
and any partition.
Using the above-mentioned data distribution methods, a table can be partitioned
either as single or composite partitioned table:
Single (one-level) Partitioning: A table is defined by specifying one of
the data distribution methodologies, using one or more columns as the
partitioning key. For example consider a table with a number column as
the partitioning key and two partitions 'less_than_five_hundred' and
'less_than_thousand', the 'less_than_thousand' partition contains rows
where the following condition is true: 500 <= Partitioning key <1000 You can specify Range, List, and Hash partitioned tables.
Composite Partitioning: A combination of two data distribution
methods are used to define a composite partitioned table. First, the table
is partitioned by data distribution method one and then each partition is
further subdivided into subpartitions using a second data distribution
method. All sub-partitions for a given partition together represent a
logical subset of the data. For example, a range-hash composite
partitioned table is first range-partitioned, and then each individual rangepartition
is further sub-partitioned using the hash partitioning technique.
Available composite partitioning techniques are range-hash, range-list,
range-range, list-range, list-list, and list-hash.
Index-organized tables (IOTs) can be partitioned using range, hash, and
list partitioning. Composite partitioning is not supported for IOTs.

6/22/11

Difference between SMP and MPP

SMP (symmetric multiprocessing), in which some hardware resources might be shared among processors. The processors communicate via shared memory and have a single operating system.
Or
Symmetric Multi-Processing. In a symmetrical multi-processing environment, the CPU's share the same memory,and as a result code running in one CPU can affect the
memory used by another.

Cluster or MPP (massively parallel processing), also known as shared-nothing, in which each processor has exclusive access to hardware resources. MPP systems are physically housed in the same box, whereas cluster systems can be physically dispersed. The processors each have their own operating system, and communicate via a high-speed network.
Or
MPP - Massively Parallel Processing. computer system with many independent arithmetic units or entiremicroprocessors, that run in parallel.

-----------------------------------------------Or
Smp supports limited parellelism i.e 64 processors where as
MPP can support N number of nodes or processors [high
performance]
Smp processing is SEQUENTIAL where As
MPP Processing can be PARALLEL

6/19/11

Difference between Dataset ,Fileset and sequential file

Dataset:
0). Data set is the internally data format behind Orchestrate framework, so any other data being processed as source in parallel job would be converted into data set format first(it is handled by the operator "import") and also being processed as target would be converted from data set format last(it is handled by the operator "export"). Hence, data set usually could bring highest performance.
1) It stores data in binary in the internal format of DataStage so, it takes less time to read/write from dataset than any other source/target.
2)It preserves the partioning schemes so that you don't have to partition it again.
3)You cannot view data without datastage

Fileset:
0) Both .ds file and .fs file are the descriptor file of data set and file set respectively, whereas .fs file is stored as ASCII format, so you could directly open it to see the path of data file and its schema. However, .ds file cannot be open directly, and you could follow alternative way to achieve that, Data Set Management, the utility in client tool(such as Designer and Manager), and command line ORCHADMIN.
1)It stores data in the format similar to a sequential file.
2)Only advantage of using fileset over a sequential file is "it preserves partioning scheme"
3)You can view the data but in the order defined in partitioning scheme

parallel lookup types

Parallel DataStage jobs can have many sources of reference data for lookups including database tables, sequential files or native datasets. Which is the most efficient?

This question has popped up several times over on the DSExchange. In DataStage server jobs the answer is quite simple, local hash files are the fastest method of a key based lookup, as long as the time taken to build the hash file does not wipe out your benefits from using it.
In a parallel job there are a very large number of stages that can be used as a lookup, a much wider variety than server jobs, this includes most data sources and the parallel staging formats of datasets and lookup filesets. I have discounted database lookups as the overhead of the database connectivity and any network passage makes them slower than most local storage.

someone did a test comparing datasets to sequential files to lookup filesets and increased row volumes to see how they responded. The test had three jobs, each with a sequential file input stage and a reference stage writing to a copy stage.

Small lookups
I set the input and lookup volumes to 1000 rows. All three jobs processed in 17 or 18 seconds. No lookuptables were created apart from the existing lookup fileset one. This indicates the lookup data fit into memory and did not overflow to a resource file.

1 Million Row Test
The lookup dataset took 35 seconds, the lookup fileset took 18 seconds and the lookup sequential file took 35 seconds even though it had to partition the data. I assume this is because the input also had to be partitioned and this was the bottleneck in the job.

2 million rows
Starting to see some big differences now. Lookup fileset down at 45 seconds is only three times the length of the 1000 row test. Dataset is up to 1:17 and sequential file up to 1:32. The cost of partitioning the lookup data is really showing now.

3 million rows
The filset still at 45 seconds, swallowed up the extra 1 million rows with ease. Dataset up to 2:06 and the sequential file up to 2:20.

As a final test I replaced the lookup stage with a join stage and tested the dataset and sequential file reference links. The dataset join finished in 1:02 and the sequential file join finished in 1:15. A large join proved faster then a large lookup but not as fast as a lookup file.

Conclusion
If your lookup size is low enough to fit into memory then the source is irrelevent, they all load up very quickly, even database lookups are fast. If you have very large lookup files spilling into lookup table resources then the lookup fileset outstrips the other options. A join becomes a viable option. They are a bit harder to design as you can only join one source at a time whereas a lookup can join multiple sources.

I usually go with lookups for code to description or code to key type lookups regardless of the size, I reserve the joins for references that bring back lots of columns. I will certainly be making more use of the lookup fileset to get more performance from jobs.

Optimzie your DataStage Job Performance with relevant Environment Variables

DataStage has many parameters which can be tweaked and used to optimize the performance of various DataStage Jobs. Even many available to collect more information during the event of crash to get more traces.
For any DataStage Job if you run into problem or want to get more details need to check following variables.

$APT_CONFIG_FILE: This allows you to define Configuration file based on your requirement. You can keep many configuration files with n-node combination and assign it dynamically for Job based in criteria or time.

$APT_SCORE_DUMP: It creates a job run report that shows the partitioning used, degree of parallelism, data buffering and inserted operators. It is Useful for finding out what your high volume job is doing.

$APT_PM_PLAYER_TIMING: This option lets you see what each operator in a job is doing, especially how much data they are handling and how much CPU they are consuming. It helps in identifying various bottlenecks.

One way to speed up very high volume jobs is to pre-sort the data and make sure it is not resorted in the DataStage job.Â This is done by turning off auto sorting in high volume jobs:
APT_NO_SORT_INSERTION: stops the job from automatically adding a sort command to the start of a job that has stages that need sorted data such as Remove Duplicates.You can also add a sort stage to the job and set it to a value of "Previously Sorted" to avoid this is a specific job path.

Please Refer to the following link for more details:
Optimzie your DataStage Job Performance with relevant Environment Variables

Error Handling in Datastage Job Design

There are multiple ways to handle Errors in Data or in DataStage Jobs.

We can use the Reject link Option via Transformer Stage and also Reject Link Option from within Connector Stages. If we face issue in Job Sequences, We can use the "Exception Handler" activity. Here is how we can call use this activity from with in DataStage Job Sequence.

You can check the Check box named "Automatically handle activities that fail" at properties of master sequence. As you might want to have a Check Point, check "Restart job from the failure point".

In DataStage Job sequence use a exception handler activity. Post exception handler activity can include a email notification activity (same for SMS). On Job Failure the handle will go to the exception handler activity and an email/SMS willl be sent notifying the user that a sequence has failed. It also provides information on failure code as we select part of Job Design.
More on "How to use Notification Activity" in next one.

Disclaimer: "The postings on this site are my own and don't necessarily represent IBM's positions, strategies or opinions

Sequence Job Errors:
1) Controller problem: Loop start > loop end - cannot execute it
Sol : Because there is no value being passed from the routine activity? If the re-run of routine activity every time you restart your sequence will not affect anything then check the box "Do not check point" so that the routine activity will fire everytime giving your start loop a value to start with. If thats not what you want you will have to re-think your design.

2) Change_Capture_cc: When checking operator: User inserted sort " Change_Capture_cc .CVG_CLMRTG_INDEX_fnl_out_Sort" does not fulfill the sort requirements of the downstream operator " Change_Capture_cc "

Sol : "Need to hash partition the data using the sort keys as hash keys in order to guarantee that duplicate rows are in the same partition"

Server Job Errors :

Parallel Job Errors :

6/17/11

Sort stage to remove duplicate

1)what is the advantage of using sort stage over remove duplicate stage in removing duplicates.
2) Is there any way in which we can specify which record to retain(like retaining the last record or retaining the first) when we remove duplicate using transform stage, similarly in sort stage also.

Ans :
1)The advantage of using sort stage over remove duplicate stage is that sort stage allows us to capture the duplicate records whereas remove duplicate stage does not.
2) Using a sort stage we can only retain the first record.
Normally we go for retaining last when we sort a particular field in ascending order and try to get the last rec. The same can be done using sort stage by sorting in descending order to retain the first record.
********
Yes indeed...
Use a sort stage with the option "Create Key Change Column" set to True.
Thil will create a column Keychange at the output of the sort as below.

Input
-----
col_a
------
1
1
1
2
2
3
4
5
6
6

output
-------
col_a Keychange
------------------
1 1
1 0
1 0
2 1
2 0
3 1
4 1
5 1
6 1
6 0

For the first occurance of a particular value, it assigns 1 and for subsequent occurances of the same value it assigns 0.
Now you can use a transformer with constraint Keychange=0 to capture duplicate records.

*********
Can u tell me what is create cluster key change value in sort stage ... if my requirement is that i have to remove duplicate, i dont want to capture the duplicate records. Is that fine to remove duplicate using input tab sort ,unique option of any stage .
******

the options spcd above for retaining duplicate values using sort stage will obviously work........but as far as I remember there is an option in the properties of the sort stage namely ::ALLOW DUPLICATES.If this property is set to true I think it will satisfy the requirement(provided that u r using datastage 8.x.

If yhe property "cluster key change column" is set to true it will eventually create a cluster key change column in the output.concentrate on the following example::

i/p_col cluster _key _change_col
1 1
1 0
1 0
2 1
2 0
3 1
1 1
2 1
6 1
6 0

the logic is that every i/p value will check the value immediete above of it.If it finds a match the o/p is 0 else 1.doesn't matter if it is appeared previously or not.

6/15/11

Teradata Database Hardware and Software Architecture

Teradata Permanent and Temporary Tables

To manipulate tabular data, you must submit a query in a language that the database
understands. In the case of the Teradata Database, the language is SQL. You can store the results of multiple SQL queries in tables. Permanent storage of tables is necessary when different sessions and users must share table contents.

When tables are required for only a single session, you can request that the system creates temporary tables. Using this type of table, you can save query results for use in subsequent queries within the same session. Also, you can break down complex queries into smaller queries by storing results in a temporary table for use during the same session. When the session ends, the system automatically drops the temporary table.

Global Temporary Tables:
Global temporary tables are tables that exist only for the duration of the SQL session in which they are used. The contents of these tables are private to the session, and the system automatically drops the table at the end of that session. However, the system saves the global temporary table definition permanently in the Data Dictionary. The saved definition may be shared by multiple users and sessions with each session getting its own instance of the table.

Volatile Tables:
If you need a temporary table for a single use only, you can define a volatile table. The definition of a volatile table resides in memory but does not survive across a system restart. Using volatile tables improves performance even more than using global temporary tables because the system does not store the definitions of volatile tables in the Data Dictionary.Access-rights checking is not necessary because only the creator can access the volatile table.

Derived Tables:
A special type of temporary table is the derived table. You can specify a derived table in an SQL SELECT statement. A derived table is obtained from one or more other tables as the result of a subquery. The scope of a derived table is only visible to the level of the SELECT statement calling the subquery.
Using derived tables avoids having to use the CREATE and DROP TABLE statements for
storing retrieved information and assists in coding more sophisticated, complex queries.

Database Normalization and Referential Integrity

Normalization is the process of reducing a complex data structure into a simple, stable one.
Generally this process involves removing redundant attributes, keys, and relationships from
the conceptual data model.
First Normal Form
First normal form (1NF) is definitive of a relational database. If we are to consider a database
relational, then all relations in the database must be in 1NF.
We say a relation is in 1NF if all fields within that relation are atomic. We sometimes refer to
this concept as the elimination of repeating groups from a relation. Furthermore, first normal
form allows no hierarchies of data values.

Second Normal Form
Second normal form (2NF) deals with the elimination of circular dependencies from a
relation. We say a relation is in 2NF if it is in 1NF and if every non-key attribute is fully
dependent on the entire Primary Key.
A non-key attribute is any attribute that is not part of the Primary Key for the relation.

Third Normal Form:
Third normal form (3NF) deals with the elimination of non-key attributes that do not
describe the Primary Key.
For a relation to be in 3NF, the relationship between any two non-Primary Key columns, or
groups of columns, in a relation must not be one-to-one in either direction.
We say attributes are mutually independent if none of them is functionally dependent on any
combination of the others. This mutual independence ensures that we can update individual
attributes without any danger of affecting any other attribute in a row.
The following list of benefits summarizes the advantages of implementing a normalized
logical model in 3NF.
• Greater number of relations
• More PI choices
• Optimal distribution of data
• Fewer full table scans
• More joins possible

Referential Integrity:
Traditional referential integrity is the concept of relationships between tables, based on the
definition of a primary key and a foreign key. The concept states that a row cannot exist in a
table with a non-null value for a referencing column if an equal value does not exist in a
referenced column.
Using referential integrity, you can specify columns within a referencing table that are foreign
keys for columns in some other referenced table. You must define referenced columns as either
primary key columns or unique columns.
Referential integrity is a reliable mechanism that prevents accidental database inconsistencies
when you perform INSERTS, UPDATES, and DELETES.

Teradata Data Distribution and Data Access Methods

An index is a physical mechanism used to store and access the rows of a table. Indexes on
tables in a relational database function much like indexes in books— they speed up
Information retrieval.
In general, the Teradata Database uses indexes to:
• Distribute data rows.
• Locate data rows.
• Improve performance.
Indexed access is usually more efficient than searching all rows of a table.
• Ensure uniqueness of the index values.
Only one row of a table can have a particular value in the column or columns defined as a
Unique Index.
The Teradata Database supports the following types of indexes:
• Primary
• Secondary
• Join
• Hash
• Special indexes for referential integrity
These indexes are discussed in the following sections.

Primary Index : The Teradata Database requires one Primary Index (PI) for each table in the database, except for some data dictionary tables and global temporary tables.

Distributes rows.
Defines most common access path.
May be unique or non-unique.
May be null.
Physical access mechanism.
Required by Teradata Database for most tables.
64-column limit.
Values can be changed.

Primary Indexes and Data Distribution:
Unique Primary Indexes (UPIs) guarantee uniform distribution of table rows.
Non-Unique Primary Indexes (NUPIs) can cause skewed data. While not a guarantor of
uniform row distribution, the degree of uniqueness of the index will determine the degree of
uniformity of the distribution. Because all rows with the same PI value end up on the same
AMP, columns with a small number of distinct values that are repeated frequently do not
make good PI candidates.
The most efficient access method to data in a table is through the PI. For this reason, choosing
a PI should take the following design goal into consideration: choosing a PI that gives good
distribution of data across the AMPs must be balanced against choosing a PI that reflects the
most common usage pattern of the table.

Nav