Skip to end of metadata
Go to start of metadata
Summary

The Entity Extraction transform requires that you map fields on input and output. These mappings tell the transform how to process your data.

Input Fields

The following are Data Services recognized input fields that you can use in the input mapping for the Entity Extraction transform.

Name

Data Type

Description

TEXT

Long, blob, or varchar

Content to be processed by the transform to extract entities and/or facts. The content must be in a text format such as a text, HTML, XML, or certain binary formats (such as PDF and Word). Refer to the Reference Guide for a list of supported binary formats.

TEXT_ID

Long, int, or varchar

Optional field. Unique identifier to be used for tracing the content in case of an error.
Note: An unsupported data type is ignored during runtime and instead either the file name (if read from an unstructured text file format) or the string TEXT input field is be used as the content identifier.

  • When a varchar or long column is mapped to the TEXT_ID input field and a value used to construct an error message contains more than 1K bytes, the value will be truncated to 1K.

Output Fields

The following are Data Services recognized output fields that you can use in the output mapping for the Entity Extraction transform. The fields are listed in the order they appear on the Output tab. 

Generated Field Name

Data Type

Description

ID

int

Represents a parent-child relationship between entities and/or facts. This value is unique within the scope of the processed input text.
Note:
If you process two different input documents using the same data flow and store the output to a database, you should not use this field as a primary key.

PARENT_ID

int

Represents a parent-child relationship between entities and/or facts. If present, it provides a link to a parent ID value. If notpresent, this value is set to -1 to indicate there is no relationship.

STANDARD_FORM

varchar (2000)

The standard form of an entity, subentity, fact, or subfact. Generally it is the longest, most precise or official name associated with the value of the corresponding TYPE column. It may be normalized, such as converting a date to its ISO normalized form.
Note:
The standard form and the source form for an entity are often the same.

TYPE

varchar (255)

The type of an entity or fact. It may also represent subentity types, subtypes, subfact types if applicable. For example, Mr. Jones will be identified as a PERSON entity and "Mr." as a PERSON_PRE subentity and "Jones" as a PERSON_FAM subentity.
Note:
"/" is used as a separator to identify subtypes.

SOURCE_FORM

varchar (2000)

The name of an entity, subentity, fact, or subfact as mentioned in the input text.

SOURCE

varchar (10)

The origin of an entity or fact. Meaning, how the match was determined based on one of the following:

  • SYSTEM - indicating that the entity was matched using the system files.
  • DICTIONARY - indicating that the entity was matched using a dictionary.
  • RULE - indicating that the entity or fact was matched using an extraction rule file.

OFFSET

int

The character offset of an entity or a fact in the CONVERTED_TEXT field.

LENGTH

int

The character length of an entity or a fact in the CONVERTED_TEXT field.

PARAGRAPH_ID

int

A unique identifier of the paragraph in the CONVERTED_TEXT field containing the entity or fact.

SENTENCE_ID

int

A unique identifier of the sentence in the CONVERTED_TEXT field containing the entity or fact.

CONVERTED_TEXT

long

The content text representation in UTF-16 encoding of the input text. Only output for the first entity row of any input row.

LANGUAGE

varchar (20)

The language of the extraction.

  • No labels