People get confused with the different terminologies, Batch, RealTime, near RealTime, running a batch job more often, CDC readers,....
When we talk about RealTime, some message with content has to trigger the processing. So SAP Data Services acts as a server process. In batch on the other hand, the job is just executed and that is it. The batch job queries the source.
To make the difference more obvious, let us assume there is a database with order information. In case we want to get the data, we can start a batch job and ask the database "What are the changed records?". We might get many, a few or none. Okay, we can call the job very often, thus increasing the load on both ends, DS and the source, and often with the result: no changes found. This is a batch job started once a while, like in the standard Data Warehouse environment, once a night.
For this use case it would actually be better if the database could inform us regarding a change, like a message saying "Order 1234 was created". But for this, that database - or source system in general - has to know a way to send messages to the outside and in a way DS does understand. e.g. for each new record the database calls a webservice - provided by Data Services - and passes the order number as parameter.
Is this Realtime or near Realtime? This depends on the point of view. Since we get a message right away when the change does happen we can call it Realtime. On the other hand, it is asychronous, so data might be commited at one system but not on the other system. The gap is in the normal case just a few milliseconds, but it is not guaranteed and when errors happen can take quite a while. btw, all EAI tools work asynchronous.
Then there are other options, mixtures of the two. The source database might send a message, but not that specific. Like "A new Order was created or changed and I am not telling which". With that we have both options. We can start a batch job to find all changed records or we use this a input message of a realtime dataflow. In both cases, the message itself contains no information, its existance tells us to start looking for changes. In case the message is more like this case where we at least know it is the orders table, a RealTime Dataflow would be the way to go. If the message would be even more common, like "load the entire Data Warehouse NOW!", a batch job would be preferred.
CDC Readers is something completely different. No matter how the job got started, how do you find the changes in the source? Maybe there is a timestamp you can trust, maybe a table where the application logs all modifications. Maybe there is nothing. In this last case you can at least read the database logs (or similar) and scan for changes in this table - if the database allows that (Oracle, DB2, SQL Server) and the ETL tool has a reader like DI does.
1 Comment
Unknown User (sgsjnm2)
Hi Robbie
Can we use a BW extractor to be the source for real time extraction which is setup as direct delta.
Poonam