Entity extraction with Text Analysis XI 3.x can be a resource consuming process. The goal of this page is to explain some of the gotchas and how to avoid them.
This page will help users by providing a list of the most common things which can affect performance during the entity extraction process. Some of the best ways to deal with a few of these issues will be explained.
1. Watch your file size. It is recommended (per SAP ENG, not in TA documentation) you make sure files you crawl are no larger than 5MB. Just break larger files into smaller chunks.
2. Entity density (the number of entities in a given source) directly affects performance. To the best of your ability, know your data, and disable entities you don't need. For instance, when using Processing Manager to create a crawl, only enable the entity checkboxes for the stuff you're really interested in. If you aren't using Processing Manager, this can be done on the back end by editing the tf.config file for the language you are crawling. The file should be located:
C:\Program Files\Business Objects\BusinessObjects Text Analysis\tf_cruncher\etc\lang
In the example below I have commented out the PERSON entities in the english-tf.config. You will notice PERSON uses an item grouper ID called "pp". So in within english-tf.config, every instance of "pp" must be commented out:
3. Understand crawling for numeric entities (DATE for example) can be a real bear, especially if the content you are extracting from is filled with numbers. The crawl will look at EVERY possible number combination to determine the numeric entity. This process can take a REALLY long time.
4. Consider setting up multiple crawls when something is taking a long time. Have one crawl search for numeric entities in a file and run it at a time of day it won't matter so much if it takes a while. Then have a second crawl maybe search for PERSON or ORGANIZATION. Combine the results of both crawls after the fact.
Voice of the Customer - A Special Case
When using VOC rules, either with the ThingFinder SDK or via Processing Manager in Text Analysis, you MUST use a load balancer. The reason is with VOC, or any custom CGUL rule, there is an inherited limitation in the ThingFinder 4.3 software in which entity extraction becomes mostly single threaded even though the server is designed to be multi-threaded. The results are poor performance, processing that seems never-ending, and likely timeouts.
The workaround for this is to run multiple ThingFinder extraction services on the same machine. To do this you need to implement a software or hardware load balancer.
So, how do you create multiple ThingFinders? Here are the steps:
1. Shut down TSP platform services.
2. Navigate to the ThingFinder_1.config.properties file located:
C:\Program Files\BusinessObjects\BusinessObjects Text Analysis\CruncherAgent\etc
3. Create a copy of the ThingFinder_1.config.properties for each cruncher you intend to run. Make sure to rename each one something unique. For example:
4. Using a file editor like Notepad ++, locate "arg.port=7300" in each new config.properties file and give each one a unique port to use. For instance:
ThingFinder_1.config.properties - 7300 (default)
ThingFinder_2.config.properties - 7301
ThingFinder_3.config.properties - 7302
5. Save the changes to your files.
6. Restart TSP platform services.
7. Navigate to *ttp://localhost:9880/SDXAdmin and click on "Text Services" in the left-hand pane.
8. All of the new crunchers should be displayed, running on the same system, registered under a single Admin, and watched by a single agent:
9. The only thing left is to install a load balancer of your choice and configure it to balance across these instances.
IMPORTANT: SAP cannot recommend which load balancer to use. If you have an issue with your load balancer, you must contact that vendor for support.