Skip to end of metadata
Go to start of metadata
Summary

This section describes various recommendations on best practices when using the Entity Extraction transform for text data processing in Data Services 4.1 SP1.

Improving transform performance

You can assess the Entity Extraction transform performance using your own data by applying the following formula:

            Throughput (MB/hr) = ∑ (Total Text Size in Megabytes) / Total Processing Time in Hours

If you find that your total processing time is too long, your throughput is not high enough, or the CPU(s) on your installation machine is not being completely utilized, you will see a significant performance improvement by increasing the Degree of Parallelism (DOP) property value for any data flow containing this transform. You can base the value of DOP on the number of CPUs your Job Server machine has or this number plus one.

This is also critical if your transform configuration relies on one or more custom extraction rules.

Improving TDP Hadoop pushdown performance

The mapred.tasktracker.map.tasks.maximum property in the configuration file (located at $HADOOP_HOME/conf/core-site.xml or an alternate configuration location, depending on the flavor of Hadoop you are using) on each machine in the Hadoop cluster should be set to somewhere in the range of 1x to 2x the number of cores on each machine.

When using text data processing in the Hadoop framework, the amount of data a mapper can handle and consequently the number of mappers a job uses, is controlled by the Hadoop configuration setting, mapred.max.split.size.

You can set the value for mapred.max.split.size in the Hadoop configuration file (located at $HADOOP_HOME/conf/core-site.xml or an alternate configuration location, depending on the flavor of Hadoop you are using).

By default, the value for mapred.max.split.size is 0, which means that there is no limit and text data processing would run with only one mapper. You should change this configuration value to the amount of data a mapper can handle.

For example, you might have a Hadoop cluster that contains twenty machines and each machine is set up to run a maximum of ten mappers (20 x 10 = 200 mappers available in the cluster). The input data averages 200 GB. If you want the text data processing job to consume 100 percent of the available mappers (200 GB / 200 mappers = 1 GB per mapper), you would set mapred.max.split.size to 1073741824 (1 GB).

<property>

<name>mapred.max.split.size</name>

<value>1073741824</value>

</property>

If you want the text data processing job to consume 50 percent of the available mappers (200 GB / 100 mappers = 2 GB per mapper), you would set mapred.max.split.size to 2147483648 (2 GB).

Debugging transform output

After executing a job containing an Entity Extraction transform, you may review some of the results and find either: 1) missing results or 2) incorrectly typed results. You can usually resolve these situations by changing your transform configuration.

  • Missing results -- occur if entity type or rule name filters have been applied, a result is mistyped, or an expected result is not extracted. First, check to ensure you have not misapplied a filter. Next, if you believe the result may have been mistyped, see below. If you didn’t receive an extraction result, for example because a person’s name is not capitalized, you can override the default extraction behavior by specifying it in a dictionary.
  • Incorrectly typed results -- occur if it is either typed as something you would not expect, such as Mercedes being identified as the first name of a PERSON instead of a type of VEHICLE, or if it gets assigned into the PROP_MISC (proper noun miscellaneous) type. The PROP_MISC entity type is often a bucket that indicates a result is important but its type could not be determined. In both of these cases, you can override the default extraction behavior by specifying the text to extract in a dictionary.

Accessing social media content

Data Services does not provide a specific reader to access Web content found and social media content. However, if you can get the data into a database or onto a file system, there are provided readers that can access it.

There are a couple of methods you can use to get the data:

  1. Write a custom application using the APIs provided by some services, such as Twitter or Amazon, to directly retrieve the content you want to process and store this to a database or to disk. See the English Twitter blueprint job for an example of this method.
  2. Use tools built around some of these services to search for content and manually save the results as Web pages on disk.
  3. Use a web crawler service, such as Kapow Software or Fetch Technologies, to identify and store specific content from Web pages to a database.

Using transform output

The Entity Extraction transform results are always output as a flat-structure. This structure uses ID and PARENT_ID column values to retain hierarchical information that can be used to convert the output for easier storage and consumption when performing querying and analysis.

You can associate any output rows together if one refers to the ID value of another as its PARENT_ID. However, this can only be done for entities and facts that are extracted from the same piece of text. Otherwise, ID and PARENT_ID values will not be unique and have the potential for collisions.

A value of -1 for a PARENT_ID indicates there is no relationship to other output rows.

Voice of the Customer output

Some sentences processed during sentiment analysis can contain multiple sentiments. When this occurs, you will receive multiple sentiment and potentially multiple topic sub facts associated with the same fact.

For example, the sentence “I hate Xxxx because they have very bad customer service and their products are very expensive.” would result in the following output if the ID, PARENT_ID, STANDARD_FORM, and TYPE output fields were selected: 

ID

PARENT_ID

STANDARD_FORM

TYPE

1

-1

I hate Xxxx because they have very bad customer service and their products are very expensive.

Sentiment

2

1

hate

StrongNegativeSentiment

3

1

Xxxx

Topic

4

1

very bad

StrongNegativeSentiment

5

1

customer service

Topic

Using the ID and PARENT_ID columns in this case may not be enough to associate the correct sentiment with the correct topic(s). What you can do in this case to work around this limitation is to use the OFFSET and LENGTH values for each sub fact and relate them together using proximity within the fact STANDARD_FORM.

File Formats

  • Fixed-Width -- file format should always be configured with Delimiters > Row = "none" unless you want to process the file as separate rows, which won’t work well for a text, HTML, or XML file being represented as a single column.
  • Delimited -- file format should only be used when processing a column’s value as input where each row represents a different document. If a delimited file is used to store extraction output, use a character, such as quote (‘) or double-quote (“), to surround any text. This is done to insulate the rest of the output, as entity names will often contain new line characters which, when written to text, breaks the row parsing since the break is after an incorrect number of columns. 

Processing input data from multiple columns

You can combine data from multiple columns into a single column to process the contents as a single piece of text. There are two methods to do so: 1) pre-processing and 2) during job execution.

The first method is to pre-process your data to concatenate the contents of each column into a single column. For example, in a database table you may create another column to hold the contents of the concatenated columns.

Alternatively, you can use the “||” operator within a Query transform to concatenate multiple columns together into a new output field which would represent a single piece of text to process.

Once the columns have been concatenated together, map the new column to the TEXT input field exposed by the Entity Extraction transform.

Processing input data from multiple rows

Some data you need to process as a single column may be split across multiple rows. You can use a Reverse_Pivot transform to convert the rows to columns which can then be combined into a single column using a Query transform that can be processed as a single piece of text.

Pre-processing input data

There may be cases where you need to manipulate the contents of input data before passing it to the Entity Extraction transform for text data processing. For example, if each of your documents to process contains a footer mentioning the same company name, there will be many extractions output to identify this company. However, this may be considered noise that should be reduced.

Additionally, there may be punctuation in your input data that would lower the quality of the extraction results without being cleaned up. For instance, when you have something like ” /leaking-coil” using Data Services to clean this before it is being processed, so it would look like “leaking coil”.

You can manipulate the input data by using a Query transform and invoking one of the many functions provided by Data Services. If you use the new unstructured text file format, you will have to apply the long_to_varchar function to the Data field before manipulating the data using any other functions.

  • No labels