Skip to end of metadata
Go to start of metadata

With lookup_ext() you can lookup a row in a table or in a flat file. But sometimes this is not sufficient, e.g. you want to lookup the value of an aggregate function "select order_number, sum(revenue) from order_items group by order_number" or lookup in the "select * from order join order_items on (...)". For the latter you might be able to use two lookups, first lookup in the one table and return the join keys required for the second. But you can easily come up with a scenario this wouldn't help at all.

Join rather lookup

Yes, we could use a query to build the dataset and join it rather than using a lookup but as said before, lookup has advantages over the join.

Create a physical lookup table (stage table) or a virtual (database view)

Another option would be to create a stage table that has the dataset stored. Now it is a regular table we can lookup in. Another table just to pull the data together so we can lookup in? That does not sound right. A similar idea could be to create a database view rather than a table we can lookup in. And actually, that might not be such a bad idea at all. It requires that you have the permission to create a view though and if data has to be joined from different sources you are screwed, too.

Writing manual SQL code with all its downsides

The most direct method, you could call it the equivalent of the SQL Transform or a sql() function, would be the "custom sql" button at the bottom. Its use case is less to allow completely wild SQL statements but to override the DI generated SQL. You still have to pick a lookup table, even if it is not used at all. The only reason this is required is so the engine knows the datastore to use and the GUI can populate the drop down boxes with a list of available columns. The SQL you specify in the Custom SQL dialog only has to contain the columns mentioned in the condition, output and order by areas. Those columns do not even need to come from that table. As example, you could chose target_ds.TARGET.CUSTOMER_MASTER as lookup table, use the column CUSTOMER_ID as condition and CUSTOMER_NAME_1 as output column, but the custom SQL says "select order_number as CUSTOMER_ID, revenue as CUSTOMER_NAME_1 from ORDER". The datatypes should match or at least be compatible and it certainly is not a good technique either. You have to understand the idea of that feature was to modify the select statement DI would normally create slightly, e.g. yes, I want to lookup in the customer_master table but in case the customer has no order yet, treat it as non-existent. If you do not use the function wizard to create the lookup_ext() function you can actually specify any table of the datastore you want to select from and use completely different columns for the actual select, conditions, return columns etc. The engine can deal with that, the GUI is limited. But that also means no New-Function-Call but you have to use the lookup_ext in a mapping. Either way, a database view is a much better approach.


Persistent Cache table

Another solution is a variation of the stage table. But do not build the stage table in the database, use a persistent cache datastore. You simply create a new datastore of type "persistent cache" and build a dataflow doing whatever you want to finally load the data into a table of this datastore. Now it is a regular table like any other and you can use it as lookup table. The major advantage is, such a persistent cache table does not need to be read and cached, it is the cache already. All the data is binary stored in the pageable cache format, we just need to activate the pageable-cache database using that file and the cache exists. At the beginning the cache miss ratio will be high as all data is on disk and none in memory but that does change quickly. Performance is really not bad. And this approach has all the advantages of the above mentioned methods. It gives you full flexibility, all transforms, all functions, everything can be used to build the persistent cache table. The lookup is simple. No need to write any SQL at all. And you have full impact/lineage information from source to target.



1 Comment

  1. Former Member

    One important aspect of the Custom SQL in a lookup that this article does not cover is that it can be dynamic at time of execution, unlike a persistent cache table. So if you have a dataflow that is being used multiple times and uses variables to control actions, one of those actions can be the complete SQL of the lookup. This gives a flexibility that I have not seen any other way to maintain in data services. Say your custom SQL is a very simple select normally, but for your weekly run it needs to encompass 4 other tables. And then for your monthly run it access a history table that uses a completely different join criteria and 3 different tables. With a Custom SQL lookup, that is a cakewalk and no additional cache is used during the week when you have a small data set.

    I agree that there is always more than one way to accomplish a goal, but don't discount something simply because there is another method. They all have their own unique advantages and disadvantages.