Skip to end of metadata
Go to start of metadata
Summary

In sorted input mode, we guarantee that the data is coming sorted by the columns listed as primary key columns ascending. Then DI will execute one select statement for the entire comparison table with an order by on the columns of the input primary key list ascending plus the generated key column descending if specified. The advantage is, just one SQL statement executed and no memory required in the engine, the sorting takes its toll, though.

Throughput

database performance 6'400 rows/sec (DI 12.1: 20'000 rows/sec)

CPU

none

Memory

none

Disk

none

DoP

supported

Transform Usage


Transform Settings


In this mode no single selects have to be issued, instead just one SQL reading the entire table with all the advantages like read ahead. And we do not require any memory, we just fetch one row after the other. Obviously, there is an overhead for the sort of the comparison table the database has to perform but if there is an index or even a primary key, the database will just read the index elements one after the other to get the data sorted.

The performance is obviously better than the row-by-row (144 secs) in this case, as the entire table has to be read anyway: 81 secs which is a throughput of 6'400 rows per sec.


Although this method works well in most cases, there are a couple of issues found occasionally. First problem is, the sort order of the input and the sort order of the database has to match the DI engine sort order, it has to be binary. For integer or all kind of number datatype columns in fact, that is not a problem. But if the primary key is a varchar column, databases tend to sort in lexicographical order, so the order would be A, Ä (German umlaut, read it as AE char combination), B. But for binary sorting, the German umlaut is some 8bit character with a char code > 127, whereas all the regular characters are ASCII with values between 0 and 127. And don't ask what would happen with Asian characters being sorted. But as sorted depends on the character code, the codepages of the databases and the engine play a role as well. Overall, that could get really complex.

In DI 12.1 this problem is addressed by two independent features. The engine codepage is as much aligned with the involved database codepages as possible and chosen fully automatically as long as in DSConfig the engine codepage is "<default>". Given standard scenarios where databases and the engine run in the same codepage, this problem is gone. But even if the databases are using the ISO Latin-1 codepage and the engine is in UTF-16 mode for whatever reasons, both codepages are still compatible when it comes to ordered data. So lot of thought has been put into the engine behavior.

The other feature built in was to ask the database to sort the data binary. Before, DI simply added the order by clause, now it is adding the order by with the additional option to sort the data binary. But this sorting is done using the database codepage, not the session codepage hence an additional property in the datastore is required - the database codepage.
For all the remaining cases where either the database does not understand a syntax for binary sorts or the codepages are not compatible and the input-primary-key is a character(!) datatype, then we do no longer push down the sort but perform it inside the engine.

Another problem is performance as such. The sort does not use array fetches, cannot be parallelized and more. All those limitations are gone with DI 12.1. The transform does no longer do the reads all by itself; it is using a standard DI reader object for the comparison table. So you get full performance including all the fancy features implemented to make reading as quick as possible. As a result, even on the single CPU laptop used here, you see significant performance gains. But more important, the transform is no longer a singularization point. Both inputs, the input data stream and the reader for the comparison table will now partition the data via a hash function into buckets that can be processed independently. So now this feature will work very well with DOP.

And we have added another goody. Table Comparison transform will often be used for delta loads. So out of your billion row table you read e.g. all the million order line items that are new or changed. What would be their primary key? Very likely the smallest order number found is not order=1 but a very high order number, the recent ones. So instead of comparing the input dataset with the entire comparison table, we wait for the first orderitem-record to be sent into the TC transform and then add that as a where condition when reading the comparison table. So you will see a statement like "select * from comparison table where ORDER >= 42525246 order by ORDER binary". In worst case the where clause has no effect in the database as it has to read all anyway, in best case the number is high and the column is indexed hence just a few rows have to be actually read and sorted.

Attachment
Job_TableComparison_sorted.zip (4.38 KB)