Skip to end of metadata
Go to start of metadata

You have a table that contains multiple time stamped records for a given primary key: 

KeyAttTimestamp
037472012.11.11 04:17:30
01ABC2014.09.30 17:45:54
02UVW2014.04.16 17:45:23
01DEF

2014.08.17 16:16:27

02XYZ2014.08.25 18:15:45
01JKL2012.04.30 04:00:00
03777

2014.07.15 12:45:12

01GHI2013.06.08 23:11:26
037372010.12.06 06:43:52

 

Output required is the most recent record for every key value: 

KeyAttTimestamp
01ABC2014.09.30 17:45:54
02XYZ2014.08.25 18:15:45
03777

2014.07.15 12:45:12

 

Solution #1: Use the gen_row_num_by_group function

Build a dataflow as such:

In the first query transform, sort the input stream according to Key and Timestamp desc(ending). The sort will be pushed to the underlying database, which is often good for performance.

KeyAttTimestamp
01ABC2014.09.30 17:45:54
01DEF2014.08.17 16:16:27
01GHI2013.06.08 23:11:26
01JKL2012.04.30 04:00:00
02XYZ2014.08.25 18:15:45
02UVW2014.04.16 17:45:23
037772014.07.15 12:45:12
037472012.11.11 04:17:30
037372010.12.06 06:43:52

 

In the second query transform, add a column Seqno and map it to gen_row_num_by_group(Key).

KeyAttTimestampSeqno
01ABC2014.09.30 17:45:541
01DEF2014.08.17 16:16:272
01GHI2013.06.08 23:11:263
01JKL2012.04.30 04:00:004
02XYZ2014.08.25 18:15:451
02UVW2014.04.16 17:45:232
037772014.07.15 12:45:121
037472012.11.11 04:17:302
037372010.12.06 06:43:523

 

In the third query transform, add a where-clause Seqno = 1 (and don’t map the Seqno column).

KeyAttTimestamp
01ABC2014.09.30 17:45:54
02XYZ2014.08.25 18:15:45
037772014.07.15 12:45:12

 

Solution #2: use a join

Suppose we’re talking Big Data here, there are millions of records in the source table. On HANA. Obviously. Although the sort is pushed down to the database, the built-in function is not. Therefore every single record has to be pulled into DS memory; and then eventually written back to the database.

Now consider this approach:

 

The first query transform selects two columns from the source table only: Key and Timestamp. Define a group by on Key and set the mapping for Timestamp to max(Timestamp).

KeyTimestamp
012014.09.30 17:45:54
022014.08.25 18:15:45
032014.07.15 12:45:12

 

In the second transform, (inner) join on Key and Timestamp and map all columns from the source table to the output.

KeyAttTimestamp
01ABC2014.09.30 17:45:54
02XYZ2014.08.25 18:15:45
037772014.07.15 12:45:12

 

If you uncheck bulk loading of the target table, you’ll notice that the full sql (read and write) will be pushed to the underlying database. And your job will run so much faster!

 

Note: This second approach produces correct results only if there are no duplicate most recent timestamps within a given primary key.

 

  • No labels

1 Comment

  1. When not on HANA, add a Data_Transfer transform to achieve full sql-pushdown: