Parallel DataStage jobs can have many sources of reference data for lookups including database tables, sequential files or native datasets. Which is the most efficient?
This question has popped up several times over on the DSExchange. In DataStage server jobs the answer is quite simple, local hash files are the fastest method of a key based lookup, as long as the time taken to build the hash file does not wipe out your benefits from using it.
In a parallel job there are a very large number of stages that can be used as a lookup, a much wider variety than server jobs, this includes most data sources and the parallel staging formats of datasets and lookup filesets. I have discounted database lookups as the overhead of the database connectivity and any network passage makes them slower than most local storage.
someone did a test comparing datasets to sequential files to lookup filesets and increased row volumes to see how they responded. The test had three jobs, each with a sequential file input stage and a reference stage writing to a copy stage.
Small lookups
I set the input and lookup volumes to 1000 rows. All three jobs processed in 17 or 18 seconds. No lookuptables were created apart from the existing lookup fileset one. This indicates the lookup data fit into memory and did not overflow to a resource file.
1 Million Row Test
The lookup dataset took 35 seconds, the lookup fileset took 18 seconds and the lookup sequential file took 35 seconds even though it had to partition the data. I assume this is because the input also had to be partitioned and this was the bottleneck in the job.
2 million rows
Starting to see some big differences now. Lookup fileset down at 45 seconds is only three times the length of the 1000 row test. Dataset is up to 1:17 and sequential file up to 1:32. The cost of partitioning the lookup data is really showing now.
3 million rows
The filset still at 45 seconds, swallowed up the extra 1 million rows with ease. Dataset up to 2:06 and the sequential file up to 2:20.
As a final test I replaced the lookup stage with a join stage and tested the dataset and sequential file reference links. The dataset join finished in 1:02 and the sequential file join finished in 1:15. A large join proved faster then a large lookup but not as fast as a lookup file.
Conclusion
If your lookup size is low enough to fit into memory then the source is irrelevent, they all load up very quickly, even database lookups are fast. If you have very large lookup files spilling into lookup table resources then the lookup fileset outstrips the other options. A join becomes a viable option. They are a bit harder to design as you can only join one source at a time whereas a lookup can join multiple sources.
I usually go with lookups for code to description or code to key type lookups regardless of the size, I reserve the joins for references that bring back lots of columns. I will certainly be making more use of the lookup fileset to get more performance from jobs.
0 comments:
Post a Comment