Skip to end of metadata
Go to start of metadata

Why are we not executing a batchjob every second, but instead have build a RealTime Job? Because starting a batchjob has a lot of overhead: Image activation, connect to the repo, read the repo, parse it, optimize it, start the dataflow, connect to source and target, disconnect, shutdown all processes, etc.

So overall, for a simple flow the entire preperation work can take 20, 30 seconds. A RealTime Job is started once and keeps running, thus the activation is done only once. For the processing point of view, it basically looks like a DataFlow that get started and is now waiting for the first row to be found in the source database. After a while, a first row (message) is received and processed, and then it takes another hour until the next row is send.

In addition, while batch jobs do not care much about guaranteed delivery, in RealTime mode we do. Not to the extent of Two Phase Commits and distributed Transactions but the most comomn cases.

Performance wise, Batch and RealTime Flows are working totally different. A BatchJob is a parallel pipelined engine where each transformation step processes the rows as fast as it can until its output buffer is full because the downstream object (table loader?) is slower. So at any given time many 10'000 rows are processed in the engine. Because of the guaranteed delivery approach, in a Realtime Flow only one input message is processed at any time. This message might contain sub schemas with many rows or trigger the processing of many rows, but if we use the analogy of one message being one source table row, one source row is processed only. For this reason you will find 10'000 rows per second to be standard in batch jobs, 100 messages per second in a RealTime Flow. And that on the other hand is why you can easily create multiple instances of one RealTime Flow.

  • No labels