You experience below issues as a result of TREXIndexServer-crashes:
- Query returns with communication error.
- Indexing failures.
- Queries become unresponsive.
Content of SAP note 1872501
If high memory in the BWA is the reason for the TREXIndexServer-crash, the query that caused the high memory needs to be optimized. The following points can be checked to find out the reason for the crashing of the TREXIndexServer:
- Verify whether the TREXIndexServer process has crashed and was restarted
It is possible that the end user is not aware that the TREXIndexServer has crashed. Before starting to analyze the BWA problem, it should be verified that the crash has indeed occurred. The TREXDaemon process makes sure that all BWA processes are always restarted after they have crashed or were manually killed. There are a few ways to verify that a BWA service has stopped and was restarted.
Note: A BWA service is a Linux OS process. The words “service” and “process” are used interchangeably in this document.
a. Check the BWA alerts in transaction RSDDBIAMON2
Check the alerts reported by the alert server in transaction RSDDBIAMON2. If there was a crash and the core file creation is activated on the system (see SAP Note 1553477), the alert server will issue a warning that core files were generated. Please note that the missing of core files does not mean that the TREXIndexServer did not crash, however if there are core files generated, this is a clear indicator that a crash has occurred. Also note that core files will be generated for any process that crashes, thus it is necessary to double-check that the core files generated are for the TREXIndexServer process.
Note: In Linux, the generation of core files can be enabled or disabled (with the Linux command culimit). Core files are generated when a process stops and core files are enabled. If you choose to kill a process with a core file, it might take up to an hour, which means downtime for the system. If the service restarts by itself and core files are activated, the automatic restart will also take time, thus some administrators choose to disable core files generation.
The core files are located on the BWA blades, in folder /usr/sap/<SID>/TRX<id>/<blade_name>, where <blade_name> is the name of the blade. This is where the crash occurred.
b. Check runtime of BWA processes
The start time and the up time of each BWA process can be seen in the Python TREX Admin Tool, tab Services -> MC (management console). If the process has been recently restarted, this will be visible from the start time column. See the screenshot below.
A BWA process can either restart automatically or be restarted manually. Administrators might restart the TREXIndexServer manually if they experience high memory consumption.
Note: A BWA process can be restarted manually from the Python TREX Admin Tool, tab Services -> Services: select the corresponding host and service that should be restarted and select restart from the context menu. Use the screenshot below as a reference.
c. Check the trace files
If a service is down, the TREXDaemon will try to restart it. Evidence of the restart can be found in the TREXDaemon trace file. Please note that if there is a TREXIndexServer crash, then no trace file entries will be written to the TREXIndexServer trace file itself as the process crashed and could not write in its trace file. In this case SAP would need to analyze the core files (if the creation of those is switched on).
You can use the Trace Analyzer tool (Analyze button on the Trace tab) to sort through the traces and perform detailed trace analysis. For a quick search of the TREXIndexServer process start, open the trace file with a double-click:
Perform Root cause analysis of the crash
Once it has been verified that the TREXIndexServer process has crashed, the cause for this crash needs to be identified. There are a few things that can be checked in order to find the cause for the crash
a. Analyze the BWA system load
The BWA system load can be monitored from the Python TREX Admin tool, tab Services -> Load. The screenshot below is a sample screenshot of the Load tab. The green lines denote the memory consumption and the red lines denote the CPU usage. Each line corresponds to one BWA blade. By selecting the different blades, high memory and CPU consumption can be narrowed down to a specific blade.
Possible solution: If core files were generated on a specific BWA blade and high memory consumption can be observed for that blade, the reason for the crash can be identified as high memory usage. In this case, further investigation is required to find out what exactly was executed on the system at the moment of the crash that lead to high memory consumption. If the issue is reproducible, it can be helpful to reproduce it and observe/record the system’s behavior.
b. Analyze existing core files
Analyzing the core files can help narrow down the crash to a specific InfoCube or piece of development code. The core files can be found in the /usr/sap/<SID>/TRX<id>/<blade_name> folder, where <blade_name> is the name of the blade on which the crash occurred.
If there are more than one core files generated on the system, then the analysis can start from the newest one. To display all files sorted by date, use the following Linux command: >ls -lrt <directory_name>
The core files are named core.<pid> or <pid>.core, where <pid> is the process ID for the process that has crashed. To make sure you are working with the core file of the TREXIndexServer process, execute the following command: >file <corefile_name>
For reading the core files, use the gdb debugger. Executing the following Linux command launches the gdb debugger:
>gdb usr/sap/<SID>/TRX<id>/exe/TREXIndexServer.x <corefile_name>
Note: Please note that the gdb version might be outdated, which will cause the command to abort. Use the >gdb version command to see the current gdb version. If the gdb version is old, the hardware partner needs to upgrade it (it is enough to upgrade it on one blade). If the issue cannot wait, the core files can be sent to SAP for analysis. If the core files are too large, they can be split and compressed with >split and >gzip and uploaded to sapmats.
For gdb to display all relevant information, it is important to have the right debug symbols copied on the file system. Note that gdb can also be used without the debug symbol, but then not all relevant information might be visible. For each BWA revision, there are different debug symbols. If the needed debug symbols are not available on the system, SAP can provide the right ones. The debug symbols need to be copied into the /exe/ directory on the BWA blade.
To retrieve the backtrace, please enter the command bt into gdb and, once the core file is loaded, press <enter>. Another gdb command is bf full, which displays the full information about the process at the time of the crash. For an extended list of gdb commands, please refer to SAP Note 1316629 - BWA 7.00: Analysis of corefiles.
Note: The core file analysis can also be done on a backup blade in order to not hinder the BWA performance.
Note: The core file analysis can also be done on a test system. Please note that when analyzing core files in a different environment, the SID of the systems might be different. In this case the SID of the test system needs to be updated before the core files can be analyzed on this system. For further details, refer to SAP Note 1316629 - BWA 7.00: Analysis of corefiles.
Possible solution: Analyzing a core file’s call stack can point to the location in the development code, which has led to the crash. This information can be submitted to SAP on incident component BC-TRX-BIA for further investigation and code corrections. In order for the code corrections to be verified, it is necessary to provide a reproducible user case that was causing the crash.
- Crash Analysis using Python trace
a. Execute a Python trace
Another way to approach the crash analysis is through a Python trace. If the crash is reproducible and a Python trace is recorded, the trace can point to the root of the problem. Such possible root causes for a query that crashes the TREXIndexServer are, for example, problems with specific InfoProviders or a fems field.
Recording a Python trace:
Please refer to the following screenshot of transaction RSRT for an explanation on how to record a Python trace for a specific query execution.
If the specific query that is causing the TREXIndexServer to crash is not known, it is possible to switch on the Python trace for a specific period of time (e.g. a day) and see if a crash occurs. The Python trace can be switched on/off from the Python TREX Admin Tool -> Trace tab. The trace is stored in a trace file on the file system. The Python trace is revision independent and can be copied to another system and replayed there for analysis.
Analyzing a Python trace: The recorded Python trace can be run with the following Linux command: >python <tracefile_name.py>
Note that if the Python trace is to be run on a test system, the so called communication method needs to be changed first. To change the communication method, change the setCommunicationMethod parameter in the .py file from zero to one. Please note that due to the different BWA data on the test system, the problem might not be reproducible there, even after running the Python trace.
To write the Python trace execution results to the command-line, add the following piece of code to the end of the trace: so.olapSearch() -> fuzzypy.writeResult(so.olapSearch())
Possible solution: Analyzing the Python trace and commenting out parts of the Python trace code can lead to identifying a specific part of a query that has led to the TREXIndexServer crash (e.g. fems). Once the reason for the crash has been identified, the query can be adapted according to the findings.
b. Check /var/log/messages
It is possible that the Linux OS itself has killed the TREXIndexServer process, as the process has exceeded some predefined system resource consumption parameter thresholds. If this is the case, this can be verified from /var/log/messages. If the TREXIndexServer needs to use more OS resources, the parameters can be adjusted with the >ulimit command-line command.
Possible solution: It can be verified from var/log/messages whether some OS parameters settings have led to the TREXIndexServer crash. Once those settings are adjusted, e.g. thresholds have been increased, the TREXIndexServer crash should not occur again.
c. Check the statistics index
If no Python trace can be recorded, detailed information about the occurred events can also be found in the statistics index. The statistics index records all calls and events that have been executed and can be used to narrow down the problem to those events that happened right before the TREXIndexServer crashed. The statistics index information can be found in the Python TREX Admin Tool, tab Mining.
For a detailed description on how to work with the statistics index, please refer to SAP Note 1756353 - How to use BWA statistics to find a BW query name.
Possible solution: The detailed information from the statistics index can lead to the specific query call that caused the TREXIndexServer crash.