Page tree
Skip to end of metadata
Go to start of metadata

SAP Data Services Text Data Processing enables you to perform natural language processing and extraction processing on unstructured text. This capability was introduced in Data Services 4.0 and has been enhanced further in subsequent releases. Text Data Processing now supports extracting information from binary documents, such as Word and PDF, richer entity extraction in 31 different languages, and can be pushed down to execute directly in Hadoop. In the latest Data Services 4.2 release, the Entity Extraction transform has added language identification, pre-defined entity type support for Dutch and Portuguese, and sentiment analysis extraction in Simplified Chinese.

Entity Extraction Transform

The Entity Extraction transform found under the Text Data Processing feature on Data Services Designer performs linguistic processing on content by using semantic and syntactic knowledge of words. You can configure the transform to identify paragraphs, sentences, and clauses and it can extract entities and facts from text. Typically, you use the Entity Extraction transform when you have text with specific information you want to extract and then use in downstream analytics and applications.

Language Modules

Each language module (formally known as Lingware) consists of a set of files that include system dictionaries containing words to support the language processing operations for the given natural language. It is the language modules that enable linguistic analysis and extraction of unstructured text in a given language.

  • Language Modules Reference
    • Out-of-the-box entities, relations, and events pre-configured to extract key information – the who, what, where, when, and how in text
    • Available in EnglishFrenchGermanSimplified ChineseSpanishJapaneseItalianRussianKoreanArabicFarsiPortugueseDutch, Bokmal, Catalan, Croatian, Czech, Danish, Greek, Hebrew, Hungarian, Nynorsk, Polish, Romanian, Serbian, Slovak, Slovenian, Swedish, Thai, Traditional Chinese, Turkish
    • Note: The languages in bold support pre-defined entity types out-of-the-box while the remainder may only support the NOUN_GROUP entity type. All of the languages support dictionary- and rule-based extraction.

Certain language modules include specialized content that provides entity types and sets of rules that address specific needs:

  • Voice of the Customer
    • Out-of-the-box entities, relations, and requests pre-configured to extract key information for sentiments: strong positive, weak positive, neutral, weak negative, strong negative, problems
    • Available in English, French, German, Spanish, and Simplified Chinese
  • Public Sector
    • Out-of-the-box entities and relations pre-configured to extract key information such as person-organization, person-alias, travel events and security (weapons, vehicles, facilities, etc.)
    • Available in English
    • Available in Arabic and Simplified Chinese for security only
  • Enterprise
    • Out-of-the-box entities, relations, and events pre-configured to extract key information for mergers and acquisitions, as well as executive job changes
    • Available in English
Relevant Data Services Documentation
  • Designer Guide -- contains an overview of Text Data Processing and its Entity Extraction transform
  • Reference Guide -- provides details on the options and input/output fields supported by the Entity Extraction transform
  • Text Data Processing Extraction Customization Guide – contains information about building dictionaries and extraction rules to create your own extraction patterns to use with the Entity Extraction transform
  • Text Data Processing Language Reference Guide – contains information about the linguistic analysis and extraction features of the Entity Extraction transform provides, as well as a reference section for each language module
  • No labels