Skip to end of metadata
Go to start of metadata
Summary

The Entity Extraction transform includes options that control which language, dictionaries, and rules to use for extraction. The Processing Options group includes specific configuration parameters for processing.

Common

The Common option group includes a setting to run the transform as a separate process.

Option

Description

Run as Separate Process

Yes: Splits the transform into a separate process.
Note:
The Entity Extraction transform is always run as a separate process. You cannot change the value of this option.

Languages

The Languages option group includes settings to process content in different languages, such as English, German, and French. If the input content is in a language other than the specified languages, you might see unexpected results.

Note:

Predefined entities are entities associated with different languages and are part of the language modules. These entities are extracted by default.

Option

Description

Language

Specifies the language for processing your content. You may select another language from the list of available languages displayed alphabetically in the drop-down menu.
The default language setting is 'Auto'. You may select another language from the list, to override the automatic identification. If the transform cannot identify the language, it uses the setting of the Default Language option.
Note: If content arrives in XML, and is organized in sections, the transform identifies the language of each section.
If you select a language other than 'Auto', only entities defined for that language will be extracted for the entire document.
When the language is set to 'Auto', any specified dictionaries or rule files the name of which do not identify a language will always be applied. However, dictionaries and rule files that do identify a language (e.g. English) in the file name will be applied only to input identified in that language.
Note:
You will not be able to run extraction unless you have a language directory that contains the files for at least one language. By default, the language directory is installed on a client as well as a server during installation.
The default location for the language directory is: <LINK_DIR>/TextAnalysis/languages.

Default Language

Specifies the default language that the transform should assume if the Language option was Auto and the transform could not identify a language.
If the value 'None' is chosen as the value for Default Language, and the language cannot be identified, a non-fatal error occurs and processing continues.If the name of some other available language is chosen as the default language, and the transform fails to identify a language on the first attempt, only entities defined for the chosen language will be extracted.

Filter By Entity Types

Specifies a list of entity types (supported by the selected language) to use for filtering the extraction output.
Note:
When the Language = 'Auto', this list is the union of all of the available entity types from the installed languages.By default, a drop-down menu showing '...' displays. Clicking this launches the "Ordered Options Window - [BOBJ:Filter By Entity Types - Option]" dialog.
Select or remove one or more entity types from the list of available entity types for that language.
Note:
Entity type support varies among languages.


Processing Options

The Processing Options group includes configuration settings for the transform. They affect how the transform will process the content before generating the extraction output.

The Dictionary Only option is most useful when you want to extract entities based solely on entities defined in a dictionary. For example, you want to match exactly the product and customer names from your custom dictionary and you are not interested in any other extraction output. In such a case, getting predefined entities from the extraction process will not be of interest.

The Processing Timeout option is most useful when you want to limit the amount of time spent on processing large content or content that take a very long time to process. 

Option

Description

Dictionary Only

Use this option to limit the extraction process to use entities defined only in the specified dictionaries. You must specify a dictionary file to use this option.
Note:
If you select this option, the extraction output will not include any predefined entities. Along with this option, if you also select the Rule option, the extraction output will include entities and facts defined in the rules along with entities from the specified dictionaries.

Advanced Parsing

Specifies whether advanced parsing information should be produced during extraction. Advanced parsing enriches linguistic processing including richer noun phrase structure, noun phrase coordination, and syntactic function attributes that can be leveraged in custom rules.
This option is available only for the English language. By default, YES and NO display. If you select the YES option for non-English languages, an error message displays.

Processing Timeout

Use this option to stop processing the content after a set amount of time. By default, the Processing Timeout option is set to 60 seconds. The Processing Timeout value can be one of the following:

  • -1 indicates no timeout should be enforced.
  • >=1 indicates the amount of time (in seconds) after which processing should abort.

Document Properties

Specifies whether document properties of a binary document should be extracted, if they are present in the document. A value of YES causes the extraction, and a value of NO (the default) causes no extraction.

Document properties are name-value pairs. The Entity Extraction transform extracts only the following document properties for binary documents:

  • APP_NAME: The name of the software that was used to create the document
  • APP_VERSION: The version of the software that was used to create the document
  • AUTHOR: The name of the person who created the document
  • COMPANY: The name of the company that owns the document
  • TITLE: The title of the document
  • DATE_CREATED: The date on which the document was created

    Document properties, if available, are extracted as entities. The SOURCE for the properties is called DOC_PROPERTY and only the following fields are defined for DOC_PROPERTY entities:
  • ID: The entity ID of the document property
  • SOURCE: DOC_PROPERTY
  • TYPE: The name of the document property
  • STANDARD_FORM or SOURCE_FORM: The value of the document property
  • CONVERTED_TEXT: The textual content of the binary document
    Any other output columns are not applicable to DOC_PROPERTY extraction rows, and have their value set to -1.


Dictionaries

The Dictionaries option group includes settings to process content by specifying one or more dictionaries that should be used when performing extraction. It also enables filtering by entity types defined in each dictionary.

The Dictionaries option group is comprised of individual dictionaries. You can configure the transform to use multiple dictionaries. These options are found under Dictionaries > Dictionary > Dictionary File.

Option

Description

Dictionary

Use this option to add dictionaries that should be used during extraction or delete an existing dictionary. Right-click this option and select the option to duplicate an entry or to delete an entry.
Once the entry is duplicated, change the duplicate dictionary file by selecting the dictionary to use from the directory structure.

Dictionary file

Use the Browse option under the drop-down menu to select a valid, compiled dictionary file to use for extraction.
Note:
To include the dictionaries during extraction, they need to be accessible to the job server. If the dictionary files are located on a remote computer, include the path to
those files (that can be resolved by the job server).

Filter By Entity Types

Specifies a list of entity types (defined in the selected dictionary) to use for filtering the extraction output.
By default, a drop-down menu showing '...' displays. Clicking this launches the "Ordered Options Window - [BOBJ:Filter By Entity Types - Option]" dialog.
Select or remove one or more entity types from the available entity list.


Rules

The Rules option group includes settings to process content by specifying one or more extraction rules to use when performing extraction. It also enables filtering by rule names defined in each rule file.

The Rules option group includes individual rules. You can configure the transform to use multiple rules. These options are found under Rules > Rule > Rule File.

Option

Description

Rule

Use this option if you want to add rules that should be used during extraction or to delete an existing rule. Right-click on this option and select the option to duplicate an
entry or to delete an entry.
Once the entry is duplicated, change the duplicate rule file by selecting the rule you want to use from the directory structure.

Rule File

Use the Browse option under the drop-down menu to select a valid, compiled rule file to use for extraction.
To include the rules during extraction, they need to be accessible to the job server. If the rule files are located on a remote computer, include the path to those files (that can be resolved by the job server).
Note:
A rule file typically contains multiple rules. You can use the rule filtering option to select a specific rule in a rule file.

Filter By Rule Names

Specifies a list of rule names (defined in the selected rule file) to use for filtering the extraction output.
By default, a drop-down menu showing '...' displays. Clicking this launches the "Ordered Options Window - [BOBJ:Filter By Rule Names - Option]" dialog.
Select or remove one or more rules from the filtering list.

  • No labels