Skip to end of metadata
Go to start of metadata
Summary

In the row-by-row mode, the Table Comparison Transform executes a select statement for every single input row to lookup the value in the compare table.

Throughput

database performance (3'600 rows/sec with DOP=1)

CPU

none

Memory

none

Disk

none

DoP

Supported (actually advised!)

Transform Usage


Transform Settings


The advantage is, not the entire comparison table has to be read, only the rows actually required. On the other hand, for every single row we have to execute the SQL wait for the database to perform the query and fetch the found data.

The SQL generated for the row-by-row mode is like

select columns_required.... from comparison_table where key = :parameter order by generated_column desc

So the SQL statement is preparsed and the where clause contains all the columns listed in the "Input Primary Key Columns". The additional order by is required in case the comparison table is a slow changing dimension table with one logical key having multiple versions.

The performance of this transform can never be in the same league as an engine-only transform and also depends on the database performance. Btw, it might be a good idea to check if the comparison table does have an index on all columns of the "Input Primary Key" list, otherwise the database will have to perform a full table scan for every single input row.

Nevertheless, this transform was highly tuned, therefore the performance is not that bad with still a throughput of 3600 rows per second. Also, setting a DegreeOfParallelization on the dataflow might be a good test on multi-CPU servers as the wait for the lookup can be reduced, especially in case the jobserver and the database are not on the same server - the network latency has to be added.

Also, it might be a very good idea to use DegreeOfParallelism setting in this mode as there will be a lot of waiting for Data Integrator and the database.

 

One interesting remaining question is, what is the difference if a column is omitted in the input schema vs. not listed in the compare column list?

For new rows, this is simple, the value will be null. For update rows, the Table Comparison will copy all the found column values of the comparison table into the before image buffer and the after image buffer of the row. After that, the input schema values will be copied into the after image buffer. If the input does not contain a certain column, it obviously is not compared - there is no value to compare with. But if any column listed in the compare column has changed, the transform outputs the update row including the before and after image values. So if the input column is omitted, the update will have the old values in the column. If the input schema did contain the column, it will be updated to the new value. In the example, the CURRENT_IND remains what it was for the update row, the insert row did not get any value, so the CURRENT_IND is null.


Attachment
Job_TableComparison_rowbyrow.zip (4.4 KB)



  • No labels

2 Comments

  1.  

    Dear Author,

     

    The article on TC (all input methods) is very useful and informative. It has definitely helped me in understanding the performance issues.

    Couple of things which I could not understand are order by criteria and sorting order.

    My queries are as below:

    1. Why the sorted key will be ordered in descending? Is there any specific reason?
    2. If DB sorts in lexicological mode then what is the sorting method for DI? and why it cannot be changed to in DI?

     

    Thanks,

    Akhilesh

     

     

  2. Former Member

    Hi Akhilesh,

    To answer your question to an extent.

     

    1. Sorted input option is best when the input data is pre sorted based on the primary key columns. DS reads the comparison table in the order of the primary key columns using sequential read only once. NOTE: The order of the input data set must exactly match the order of all primary key columns in the Table_Comparison transform.

    Regards,

    Nitish Chaudhary