Skip to end of metadata
Go to start of metadata

We looked at the Validation Transform Performance already, but now we should have a closer look into it from a data quality perspective. A dataset is read in, transformed and ready to be loaded into the target. But before we should try to cleanse the data as much as possible. For example there might be a gender flag in the source with the values f for female, m for male and x, ?, - u,.... for unknown. Obviously the best thing would be to get all the data corrected in the source so there are no unknown values anymore but you would need to have some statistics. If 99% of the cases are of type f or m you will act differently as if just 10% of the data is correct. And in the meantime we could define the rule that the gender flag should contain m,f and ? only. All of that can be done with the Validation Transform.

In the transform you can specify a rule for each column you like that defines when a value is assumed to be correct.

In this example, the rule for the column QUANTITY is the value has to be bigger than zero. If a row passes the rules for all columns, it is sent to the PASS output. If one value violates the rule, you can specify what to do.

One option is to say that this is a fundamental problem then and the entire row should be sent to the FAIL output. Or you treat the problem is minor and still let the row pass then, probably resetting the value to some default. Or you do both, load it by sending to the PASS route and still forward it to FAIL as well.

So this way you can build dataflows dealing with erroneous data. Correct wrong values if possible and protocol rows fundamentally wrong in another table or file.

But there is more into it. Collecting or correcting data is one thing, the other is to know about the data quality. So we built a way to look at the statistics of the validation transform as well. As this could be quite a volume, you have to enable it per Validation transform and on job level.

For the transform, there are the Validation Transform Options in a second tab.

Here you can tell to collect the statistics and probably even collect sample data in addition.

When executing there is a flag to disable the collection for a run, by default it is enabled.

Once the job ran, you can view the statistics in the Data Validation Dashboards of WebAdmin.

But before we can actually see something, we need to tell what Jobs we want to see the statistics for. By clicking on the Settings link on the top right menu of the web page, you specify a functional area and what jobs belong to that. In case no jobs are listed, then there are none with validation transforms having the "collect Data Validation Statistics" option enabled.

Now we know two rows did violate the rule for the PRICE > 10 and the value was 0.1 and 10 in those rows.

The report will show you the PASS and FAIL statistics, a low priority is a failed row that was sent to pass, medium is the send to both action and high priority failed rows are those sent to the FAIL pass alone.

Be aware, you can actually drill in those reports and gather more and more details. In the next level you get the answer why the rows is treated as failed. As there are multiple rules typically defined it is not obvious if it failed because of the PRICE or the QUANTITY rule in our example.

Now, as we know all failed rows had a problem with the price, we can drill deeper to see the data samples collected during the run.

So these are the two rows who did violate the rule for the PRICE > 10 and the value was 0.1 and 10.



  • No labels