Skip to end of metadata
Go to start of metadata
Summary

The Data Cleanse transform lets you parse text fields to figure out information like name, email, dates etc. Using the directory data the transform can distinct between first and last name, company vs. natural names, guess the gender and more. But the Cleansing Package has to be installed and loaded into an own repo for that to work.

Before the Data Cleanse Transform for text parsing can be used, the Cleansing Package has to be downloaded from the Service Marketplace and installed on any Designer computer. This will add files in the Data Services subdirectory \DataQuality\datacleanse and, using the repository manager, the data can be uploaded into new repository tables.


 

As this reference repository is likely to be shared with others, there is a separate connection dialog to be filled out pointing to this repository in each DI local repository. So although it could be the same schema as the DI repository, usually it is not.


 

With that we can now take one of the country specific Data Cleanse Transforms and feed it with data to get the different fields decoded, validated, cleansed and additional data being derived.


 

In this example all the transform got as input is the NAME1 column which contains either a company name or a person and was able to depict what the first name is, the last name, the middle name, if it is a person or a company, derive the gender information.

As said, this not only works with names like in the example, it is very useful for other tasks as well.

  • somewhere a date in different formats is stored - identify it and return it in a common format
  • is there an email address inside the text
  • phone numbers in various formats - bring them into a standard format
  • world wide phone numbers - decode them e.g. 011-49 is the same as 01149, 0049, +49, ... it's all Germany
  • Social Security Numbers - identify and format them
  • is an address inside the text - extract the address portion
  • dual names like "Trish and Mike Miller"
  • Synonyms for a name (MATCH_STD) - like Liz, Elisabeth, Beth could mean the same
  • with names we are not done either, title, prename, maturity...
  • ...
  • ...

Performance: see Data Cleanse - Name Parsing



  • No labels