Registration

Dear SAP Community Member,
In order to fully benefit from what the SAP Community has to offer, please register at:
http://scn.sap.com
Thank you,
The SAP Community team.
Skip to end of metadata
Go to start of metadata

 

 

Purpose 

This wiki aims to help the analysis of semaphore lock or waits situations due to deadlocks or even slow operations inside of operations protected by semaphores. 

It cannot be and does not aim to be a complete documentation describing all possible semaphores issue scenarios.The main idea is to show you the traditional troubleshooting process in a semaphore lock and wait situation. 

Overview 

The most frequent problems involving SAP semaphores could be classified as: 

1) semaphore deadlock or stuck lock situation;
2) slow operations blocked by semaphore; 

The first situation will stuck the system and users cannot even login to the system.  Usually, such situation may affect a single Application Server but it is also possible that the entire system is affected.
The second issue is a similar situation, however, the system is not really stuck. The operations can take long time, but the work process status change time to time and the semaphore locker change as well. 

The content here is separated in the following sections: 

  1. Identifying the semaphore locker
  2. Solving a semaphore lock and wait situation
  3. Root cause analysis
  4. How to analyse the c-stack
  5. How to prevent the situation
  

Identifying the semaphore locker 

The work process locker (semaphore locker) is the responsible to stuck the system, it may lock the semaphore forever or during a specific time span. 

If the locker does not release the semaphore the system stuck and we have the first situation (usually caused by a deadlock).
However, if the locker hold the semaphore and release after a while, changing the work process locker, we have the second situation. 

The first step is to identify the locker. There are few ways to identify the semaphore locker:

 
  •  SM50 transaction (not always possible to use during the standstill situation);

 

 
  

 
 
  • Using the SAPMMC tool:

 

 

 

  • Using the SAPCONTROL client connecting to sapstartsrv service:

 

sapcontrol -nr <NR> -function ABAPGetWPTable 

 

 
  • Using the snapshot feature created by dispatcher (see further):

 


 

If the column time that belongs to the semaphore locker keep increasing or already running reached a long time, it has a big possibility to be a semaphore
deadlock situation. However, if the column time is close to zero or just few seconds, may be just a slow situation.

 

There are situation where the semaphore locker cannot be identified, for these cases you can use the semd tool as explained in the note 2027885.

2027885 - External semaphore monitoring

 

Solving a semaphore lock and wait situation

The first thing to do is define if it is situation is the case 1 or case 2 based in the behavior's analysis..

In case of a semaphore deadlock or stuck lock, it is possible to restart the work process locker and, that may solve the issue. If the issue is a stuck lock situation,
the WP restart may solved the hanging situation. It is possible that a new WP stuck the same semaphore again, therefore, just the restart will solve.
However, if for some reason, the semaphore cannot be properly released, the only solution is the restart of applicator server.

In case of the second situation, it is need to identify the reason that the work process are holding for too long the semaphore. This issue may belongs to distinct situation
like problems in the file system, a long runner operation from OS level, shortage of resources at OS level, etc.

Root cause analysis

There are multiple reasons that may lead to a semaphore deadlock, stuck semaphore or even slow down lock situation.

 

The deadlock or stuck semaphore usually is related to an error, basically, if a WP is accidentally interrupt and it is at a point when the process hold a semaphore, the release may not occurs as expected:

  1. In certain error situations, some resources are not always released as expected and an unlock is missing for a semaphore. For example, a crash of WP trigger a clean up process where all allocated resources should be released, however, if the clean up also fails, some resource may never be released and stuck the system.
  2. Crashes during the critical section, where the processing is running over a semaphore protection, are also the most critical situations if the clean up does not occurs as expected.
  3. A classical deadlock situation, when the system is performing some operation protected by a semaphore and, at the exact time, the operation is interrupted and the same operation is trigger again requesting the same lock (before it be released).
  4. The clean up of resources occurs during the start up of the process, if this action also crash for whatever reason, some resources will not be identified in the next clean up process and keep blocking the system.

In the other side, situation of slow operation protected by a semaphore, the error does not exactly occurs with the semaphore manipulation. But the operations between the lock and unlock the semaphore are taking a long time.

 

In both cases, the analysis should be done collecting the C-stack of the work process during the occurrence of the problem.

To do that SAP deliver a tool called sapstack explained in the SAP note 1964673
1964673 - C-Call stack analysis 

Calling "sapstack <PID>" during the occurrence of the issue will print the c-stack of the process. Another possible way to do that is using a signal USR2 to the process ID: "kill -USR2 <PID>" will print in the work process trace (dev_w*) the c-stack. There is no "kill" command on windows, therefore, it is possible to use "sapntkill -USR2 <PID>". After send the signal USR2, it is required to send a signal USR1: "kill -USR1 <PID>".

 

 

How to analyse the c-stack

 

The call stack should be analysed from bottom to top. To find out the root cause the most relevant calls are the last ones before the signal. See the following example:


  1.  This stack we should search by errors in calls like BackupFile -> IndDeleteOldest calls.

    M ------------------ C-STACK ----------------------
    [0] SunDoStack2, at 0x127ec69
    [1] CTrcStack2, at 0x127e8b3
    [2] CTrcStack, at 0x127e85c
    [3] __1cOThStackHandler6F_v_, at 0x10a504d
    [4] __1cKDpTrcOnOff6Fi_v_, at 0xfbd04f
    [5] __sighndlr, at 0xfffffd7ff6d775b6
    [6] call_user_handler, at 0xfffffd7ff6d6be52
    [7] sigacthandler, at 0xfffffd7ff6d6c07e
    [8] ????????, at 0xffffffffffffffff
    [9] __1cSPfHIndDeleteOldest6FpnJPF_HYPIND_CpH_i_, at 0x33f9624
    [10] __1cMPfBackupFile6F_v_, at 0x33eeb70
    [11] PfStatWrite, at 0x33cbd49
    [12] __1cLPfWriteStat6F_i_, at 0x33eddeb
    [13] __1cLThCallHooks6FnOSOS_HOOK_EVENT_pnTSOS_HOOK_EVENT_INFO_Cpi_i_, at 0x1077304
    [14] __1cTThPerformTaskSwitch6FnKDP_WP_STAT_pnLTH_STRATEGY_CC_i_, at 0x106608a
    [15] __1cUThITriggerTaskSwitch6Fpv_i_, at 0x1063588
    [16] __1cITskhLoop6F_v_, at 0x10a8590
    [17] __1cHThStart6F_v_, at 0x10a58df
    [18] DpMain, at 0xf6d7b1
    M -------------------------------------------------

     

     

  2. This second example, the is about RqQAddRequest -> std::deque -> std::_Deque_base

    ------------------ C-STACK ----------------------
    dw.sapPNR_D00[S](LinStackBacktrace+0x8c)[0x662179]
    dw.sapPNR_D00[S](LinStack+0x35)[0x6661c0]
    dw.sapPNR_D00[S](CTrcStack2+0x4e)[0x661f2e]
    dw.sapPNR_D00[S](SigIGenAction+0x288)[0x2300e28]
    <signal handler called> [0x7ffff198f850]
    libc.so.6[S](__GI_raise+0x35)[0x7ffff1636875]
    libc.so.6[S](__GI_abort+0x181)[0x7ffff1637e51]
    libc.so.6[S](__libc_message+0x38f)[0x7ffff16778bf]
    libc.so.6[S](malloc_printerr+0xb8)[0x7ffff167d0c8]
    libc.so.6[S](_int_malloc+0x7af)[0x7ffff168007f]
    libc.so.6[S](__GI___libc_malloc+0x77)[0x7ffff16821f7]
    libstdc++.so.6[S](operator new(unsigned long)+0x1d)[0x7ffff209008d]
    dw.sapPNR_D00[S](std::_Deque_base<RQ_Q_NOTIFY_ELEM, std::allocator<RQ_Q_NOTIFY_ELEM> >::_M_initialize_map(unsigned long)+0x7c)[0x1de879c]
    dw.sapPNR_D00[S](std::deque<RQ_Q_NOTIFY_ELEM, std::allocator<RQ_Q_NOTIFY_ELEM> >::deque()+0x5a)[0x1de88ea]
    dw.sapPNR_D00[S](RqQAddRequest(unsigned int, REQUEST_BUF*, unsigned char (*)(REQUEST_BUF*, void*), void*, unsigned char)+0x359)[0x1de3cf9]
    dw.sapPNR_D00[S](DpRqPutIntoQueue(REQUEST_BUF*, unsigned char (*)(REQUEST_BUF*, void*), void*, unsigned char, unsigned char)+0x16e)[0x1df93de]
    dw.sapPNR_D00[S](DpRqSendRequest(REQUEST_BUF*, unsigned char, unsigned char)+0x3d)[0x1df9c4d]
    dw.sapPNR_D00[S](DpWpHandShake(DP_SESSION_INFO, unsigned int)+0x9f)[0x48b827]
    dw.sapPNR_D00[S](DpTmSend(int, int, int, unsigned char, unsigned int)+0x3c3)[0x44f8d5]
    dw.sapPNR_D00[S](DpHandleGuiResponse(REQUEST_BUF*)+0x45)[0x468641]
    dw.sapPNR_D00[S](DpRqServiceQueue()+0x1450)[0x1df53f0]
    dw.sapPNR_D00[S](DpLoopExec()+0x195)[0x1df5655]
    dw.sapPNR_D00[S](DpMain+0x3bd)[0x1ddc5cd]
    libc.so.6[S](__libc_start_main+0xe6)[0x7ffff1622c36]
    -------------------------------------------------

     

     

  3. This example, the current call is SemTimedOp. However, this functions means that the respective work process is doing a semaphore operation, see details about this OS call here. This means that probably this is not the work process locker, but probably a work process waiting by a semaphore.

    M ------------------ C-STACK ----------------------
    dw.sapPE3_D03[S](LinStackBacktrace+0x8c)[0x65a0f9]
    dw.sapPE3_D03[S](LinStack+0x35)[0x65e140]
    dw.sapPE3_D03[S](CTrcStack2+0x4e)[0x659eb1]
    dw.sapPE3_D03[S](ThStackHandler()+0x102)[0x4e9a63]
    dw.sapPE3_D03[S](DpTrcOnOff(int)+0x212)[0x45f7a9]
    <signal handler called> [0x7f6a1a3327e0]
    libc.so.6[S](__GI_semtimedop+0xa)[0x7f6a1a07a16a]
    dw.sapPE3_D03[S](SemTimedOp+0xc1)[0x1b58831]
    dw.sapPE3_D03[S](EvtWtRst+0x72)[0x1d4c882]
    dw.sapPE3_D03[S](RqQWorkerWaitForRequests(unsigned int, unsigned char, RQ_Q_PRIO, unsigned int*, int)+0x2fb)[0x1b4b0cb]
    dw.sapPE3_D03[S](ThRqCheckQueues(int, REQUEST_BUF**)+0x795)[0x1b5dcb5]
    dw.sapPE3_D03[S](ThRqGetNextRequest(int, REQUEST_BUF**)+0x37)[0x1b5fd37]
    dw.sapPE3_D03[S](ThRqWaitFor(int, REQUEST_BUF**)+0x42)[0x1b5fee2]
    dw.sapPE3_D03[S](ThRqAcceptImpl(unsigned char, int, REQUEST_BUF**)+0xe4)[0x1b5e8e4]
    dw.sapPE3_D03[S](ThRqAcceptInlineReply(int, REQUEST_BUF**)+0x32)[0x1d5af52]
    dw.sapPE3_D03[S](ThCPIC(REQUEST_BUF*, unsigned char, unsigned char*, TH_CPIC_EXEC_TYPE, unsigned char, char16_t*, char16_t*, int, unsigned char, int, TH_SECURITY_INFO*, char16_t*, int*, REQUEST_BUF**)+0x11a3)[0x1d68a03]
    dw.sapPE3_D03[S](ThSAPICMRCV(unsigned char*, int, int, TH_CPIC_EXEC_TYPE, unsigned char, unsigned char, unsigned char, unsigned char, unsigned char, unsigned char, int*, REQUEST_BUF**)+0x4ad)[0x1d6d87d]
    dw.sapPE3_D03[S](ThSAPCMRCV+0x88)[0x1d6ddd8]
    dw.sapPE3_D03[S](comread(void*, unsigned char*, unsigned int, unsigned int*, int)+0xfe)[0x203365e]
    dw.sapPE3_D03[S](ab_rfcread+0x2fc)[0x1c48cec]
    dw.sapPE3_D03[S](rfcget_gethead+0x3b)[0x1c4532b]
    dw.sapPE3_D03[S](ab_rfcget_error_exception+0x121)[0x2029c41]
    dw.sapPE3_D03[S](ab_rfcimport+0x34c)[0x202629c]
    dw.sapPE3_D03[S](ab_jcaly()+0x119)[0x1be5f79]
    dw.sapPE3_D03[S](ab_extri()+0x197)[0x1b78427]
    dw.sapPE3_D03[S](ab_xevent(char16_t const*)+0x32)[0x1bb87e2]
    dw.sapPE3_D03[S](ab_dstep+0x36)[0x1b75936]
    dw.sapPE3_D03[S](dynpmcal(DINFDUMY*, STPDUMMY*)+0x2d1)[0x1de0831]
    dw.sapPE3_D03[S](dynppbo0(DINFDUMY*)+0xb6)[0x1de26c6]
    dw.sapPE3_D03[S](dynprctl(DINFDUMY*)+0x189)[0x1de2dc9]
    dw.sapPE3_D03[S](dynpen00+0x407)[0x1dd3d77]
    dw.sapPE3_D03[S](ThrtCallAbapVm+0xc0)[0x1daf1d0]
    dw.sapPE3_D03[S](RfcHandler::handleRequest(REQUEST_BUF*, bool)+0x173)[0x1dbce13]
    dw.sapPE3_D03[S](ThHandleRequest(REQUEST_BUF*, unsigned char, unsigned char)+0x1a3)[0x1d72b23]
    dw.sapPE3_D03[S](TskhLoop()+0xa2)[0x1b62072]
    dw.sapPE3_D03[S](ThStart()+0x26e)[0x4e95c0]
    dw.sapPE3_D03[S](DpMain+0x36c)[0x1d09d2c]
    libc.so.6[S](__libc_start_main+0xfd)[0x7f6a19fadd1d]

     

     

  4. Here we have another example where the stack goes to OS calls like name resolution and wins resolution calls: gethostbyname2 -> _nss_wins_gethostbyname2_r.
    Note that in this example the issue probably exists in the OS side and not really in the SAP layer as the wins resolutions is a OS responsibility.

__lll_lock_wait () from /lib64/libpthread.so.0
pthread_mutex_lock () from /lib64/libpthread.so.0
_nss_wins_gethostbyname_r () from /lib64/libnss_wins.so.2
_nss_wins_gethostbyname2_r () from /lib64/libnss_wins.so.2
gaih_inet () from /lib64/libc.so.6
getaddrinfo () from /lib64/libc.so.6
interpret_string_addr_internal () from /lib64/libnss_wins.so.2
interpret_addr () from /lib64/libnss_wins.so.2
interpret_addr2 () from /lib64/libnss_wins.so.2
?? () from /lib64/libnss_wins.so.2
wins_srv_tags () from /lib64/libnss_wins.so.2
resolve_wins () from /lib64/libnss_wins.so.2
_nss_wins_gethostbyname_r () from /lib64/libnss_wins.so.2
_nss_wins_gethostbyname2_r () from /lib64/libnss_wins.so.2
gethostbyname2_r@@GLIBC_2.2 () from /lib64/libc.so.6
gaih_inet () from /lib64/libc.so.6
getaddrinfo () from /lib64/libc.so.6
NiPGetHostByName(char16_t const*, unsigned char, unsigned char, NI_NODEADDR*, unsigned int*, char16_t*, unsigned int, _IO_FILE**) ()
NIHIMPL_LINEAR::getNodeAddr(char16_t const*, NI_NODEADDR*, unsigned int, int, _IO_FILE**) ()
NiIGetNodeAddr(char16_t const*, int, NI_NODEADDR*, unsigned int, _IO_FILE**) ()
NiHostToAddr ()
DpNetCheck ()
DpSapEnvInit ()
DpMain ()
__libc_start_main () from /lib64/libc.so.6
_start ()

 

 

You probably are wondering what to do with these call stacks. The most important thing is to use these stacks to find out in the SAP knowledge repository about a known issue involved the same or most similar call stack. You can search the most relevant calls inside of SAP knowledge repository by some known issue involved the respective calls in order to check if your system is affected by the issue or not.

INCLUDE WHERE TO SEARCH

It not possible to provide you with an specific solutions about problems regarding semaphores because the root cause may have distinct reasons, therefore the analysis process is the most relevant item in such kind of cases.

 

How to prevent the situation 

There are some specific situations, with kernel 72X, where the semaphore can be automatically recovered with the feature of note 1890657.

1890657 - Semaphore recovery
 


 


.... UNDER CONSTRUCTION...

 


 


 

Related Content

 

Related Documents

 

Related SAP Notes/KBAs

2007484 - Semaphore 42
1754001 - DP: Work processes cause block on semaphore 5
1795711 - DP: Work processes block on semaphore 42
981088 - System crash due to semaphore 5 deadlock
1875389 - DeaGo to linkdlock on semaphore 5
1704898 - Instance hangs on Semaphore 7
2101988 - MM:AIX detect loop in ES layer
1548895 - Work processes cause block with semaphore 7
2171151 - ST: All work processes wait for semaphore 42

 

 

 

 

  • No labels