Registration

Dear SAP Community Member,
In order to fully benefit from what the SAP Community has to offer, please register at:
http://scn.sap.com
Thank you,
The SAP Community team.
Skip to end of metadata
Go to start of metadata

Purpose

Custom Grouper User Language (CGUL) is a sentence-based language that enables you to perform pattern matching using character or token-based regular expressions combined with linguistic attributes to define custom entity types. Working with CGUL can be very challenging. The goal of this page is to simplify this process.

Overview

This page will assist users by providing hints regarding common topics related to CGUL as well as bestow some examples of custom rule creation in order to accomplish desired entity output.

Before You Start

Modifying OOB (Out of Box) rules is NOT recommended. Doing so, you risk breaking existing functionality and make supporting changes more difficult.  SAP consultants and other groups create new rule(s) which cover missing functionality. New .rul source files can use the same category names as an OOB rule. Duplicate categories are okay as long as they are not defined in a single .rul file. This approach makes testing much easier. 

If you absolutely need to modify existing OOB rules, here are the steps you should take to make sure you do not change existing functionality while adding new patterns:

  1. BACK THEM UP!!
  2. Create #subgroups for each variation you want to capture.
  3. At the #group level, only use variables which are defined by either a #subgroup(s) or defines.
  4. If its language, create a #defines.

Identifying rule changes between TA and TDP.

To determine changes review the following files:

For English:

     english-tf.config (Category names may change and some types may no longer be included.)

     english-tf-cg.config (This file shows which CUSTOM rules are loaded. Use this to compare against the same file from previous versions.)

The above files are consistent for every supported language. So if you're working with German, each file above would begin with “german” instead of “english”.

Rule (.rul) sources can be recompiled under new and older versions.

You can take the .rul files from VOC and compile them using the cgc or tf-cgc file to run under older or newer versions. Issues could arise related to category name changes or as mentioned previous removal of categories, but in most cases they should compile without error.

The lingware under each new version of Text Analysis is constantly changing to improve accuracy and coverage. For VOC all changes/differences in results would be caused by changes in the rule source files which are included in the VOC package. Comparing these files across Text Analysis version will identify areas changed and define why you’re seeing differences.

Tagging Seems to Differ from Text Analysis (TA) to Text Data Processing (TDP).

Example:

Using VOC (Voice of the Customer) rules, loved and loves in TDP are not tagged as sentiment like they are in TA XI 3.x


Recommendation:

For the above example one way you could use CGUL to identify love is as follows:

CGUL example
<STEM: \p{ci} (lov(e|ed|es|ing))>

This will match regardless of case [love, loves, loved, loving].


Using the suggested sample the results in TDP are: 

Entity Mention Text:      "love"

     Label Path(s):        StrongPositiveSentiment

     Source:               ExtractionRule

     Global Offset+Length: 2 +4

     Global Byte Offset+Length: 2 +4

Entity Mention Text:      "loved"

     Label Path(s):        StrongPositiveSentiment

     Source:               ExtractionRule

     Global Offset+Length: 15 +5

     Global Byte Offset+Length: 15 +5

Entity Mention Text:      "loves"

     Label Path(s):        StrongPositiveSentiment

     Source:               ExtractionRule

     Global Offset+Length: 29 +5

     Global Byte Offset+Length: 29 +5

 

Text source used to produce above results:

I love xyz.

I loved xyz.

I loves xyz.

A category out scopes an entity, but the other entity is required.

Example:

Using VOC rules, we tried adding certain compound words like wanna to be scored as "Request"; however it is being classified as PERSON_FAM.

 

Recommendation:

In cases where a category out scopes another entity and the other entity is required you need to subtype the larger category. In the example of PERSON_FAM you could do this:

CGUL Example
#group Request: [TE PERSON_FAM] <>? <\p{ci}(wanna)> [/TE]

 

Results for above rule:

ENTITY NAME: "I wanna"
ENTITY CATEGORY: PERSON
CONFIDENCE: 10
RELEVANCE: 100
METHOD: Unique
NAME CATALOG RECORDS:
OFFSET: 0
LENGTH: 7

SUBENTITY NAME: "I wanna"
ENTITY CATEGORY: PERSON_FAM
CONFIDENCE: 0
RELEVANCE: -1
METHOD: Unique
NAME CATALOG RECORDS:
OFFSET: 0
LENGTH: 7

ENTITY NAME: "I wanna"
ENTITY CATEGORY: Request
CONFIDENCE: 30
RELEVANCE: 44
METHOD: Custom Grouper Entity
NAME CATALOG RECORDS:
OFFSET: 0
LENGTH: 7

 

If further reduction needs to be done to just tag wanna you could do the following:

CGUL Example
#group FAM_BREAK: [TE PERSON_FAM] <>? [OD Request] <\p{ci}(wanna)>[/OD] [/TE]

 

Results for above rule:

ENTITY NAME: "I wanna"
ENTITY CATEGORY: FAM_BREAK
CONFIDENCE: 30
RELEVANCE: 44
METHOD: Custom Grouper Entity
NAME CATALOG RECORDS:
OFFSET: 0
LENGTH: 7

SUBENTITY NAME: "wanna"
ENTITY CATEGORY: Request
CONFIDENCE: 0
RELEVANCE: -1
METHOD: Custom Grouper Entity
NAME CATALOG RECORDS:
OFFSET: 2
LENGTH: 5

Related Content

Documentation

Related Notes

  • No labels