Posts

Es werden Posts vom Dezember, 2018 angezeigt.

Parallelisation

The parallelisation options can mess up a job when used different parallelisation mechanisms for sources that are used in a join or lookup. E. g. if using round robin in one source stream and hashing in another stream over the lookup or join criterion resp. or even partitioning not over the join or lookup criterion respectively, chances are high that that one criterion value gets into different parallel chunks.

Copy stage

The copy stage is virtually a virtual stage (could not pass the pun 😉), i. e. at compile time it gets eliminated and its functionality integrated into the preceding stage element. There are actually two purposes to the copy stage: Forking the data stream into several streams. This could be achieved by using multiple output links in other stages. However, e. g. for the transformer one possibly would need to multiply code to get the effect in several outputs. Change the character of jobs optically.

Parameter

Parameters get defined in the properties of job or container. You can define the types parameter set and other primitive data type like string. You also can overwrite environment variables. I am not sure if prepending $ in front of the parameter name make it an environment variable replacement or if also specific type are needed. Hopefully I can update this post clarifying the matter. Parameter sets can get their values from a parameter file. It is probably not possible to cascade parameter files but it is possible to define more than one parameter set thus making it possible to define sets of different scope.

Ghost inputs/outputs, generic implementation

Bild
DataStage has an interesting feature called "runtime column propagation". This means that a DS stage (like transfomer) takes whatever is thrown at it, say the result of an SQL query, reads its metadata and propagate it through the stage element. It is possible to use the actual data in the stage like in a transfomer. You configure it be ticking a checkbox in the output tab. It might be a bit hidden though. It is, however a double-edged sword. The up is that one can easily implement generic etl jobs. The down is that especially the undiscerning user can get puzzled by ghostly inputs/outputs. Connected to that you need to have memorised the structure getting propagated.