Skip to end of metadata
Go to start of metadata

Another approach for joining tables is to use the DI function lookup_ext(). There are a couple of differences to a join. First, a lookup_ext() is a function call of the engine and therefor never ever pushed down together with the reader, even if it would be possible in theory. Second, a lookup always returns exactly one row. The default values if no row can be found, and the first row only, if multiples are found. And third, you have much tighter control over the cache method - you specify it. With a join the options to influence that are less specific.

The preferred method of using the lookup_ext should really be the new-function-call way. You rightclick the output pane of the query and select "New function call" from the popup menu. With this you can get into the function wizard and later you can edit it via the wizard, you can return multiple columns, it is visually nice,.... it is the way to go.

The execution time for this is 46seconds, although exactly (almost) the same is done in the join case with caching turned on. We just pay the price of the greater flexibility of lookup_ext function without actually using it.

If we change the caching option of lookup_ext to NO_CACHE, we again do the same as with the join in the engine, the version with the nested loop: 2'06seconds.

From the SQL generated, the join and the lookups are identical, it is just the greater flexibility of the function that consumes more CPU. And since the CPU is saturated on my computer anyway, the lookup takes significantly longer.

Demand_Cache is the last option we have. With this, the cache is not prefilled like with PRE_LOAD_CACHE, instead the row is looked up like with NO_CACHE and then kept in memory. So for this example, where we are going to read the entire lookup table anyway, it does not make sense.

What is interesting though is that it takes 2'17seconds, so slightly longer than the NO_CACHE case, but the process size did increase to up to 195MB, whereas in the join case it was 80MB only! The same can be found for the PRE_LOAD_CACHE case, more memory than in the join is required, but not that much more, just 120MB.

The DEMAND_CACHE does make sense where the lookup table is large, we will lookup many times as the source table has many rows as well, but always the same ones. Like "I have 1 million order lines per day, but for five out of my 500'000 customers only.". Loading half a million rows into cache although 5 rows are required only, that does not make sense, and executing one million times a lookup that selects the same row over and over from the database does not either. So in case the join condition is not customer_id = customer_id, but instead, we change it to something where demand_cache makes sense, it would be customer_id = industry_id (12 distinct values only), the performance is: 1'15 secs.

  • No labels