Technical Information: Package CRM_ES_EXTR_MGR
Encapsulates the extraction of a Business Object Template.
Encapsulates the extraction of a Technical Object Template.
Manages the extraction process.
CL_CRM_ES_SO_ROOT, CL_CRM_ES_EXTRACT_SO, CL_CRM_ES_EXTRACT_SO_BROKER
Database Tables / Views
CRMC_EXTMGR_PAR / CRMV_EXTMGR_PAR
Holds the system information on which server group with how many tasks the parallel jobs are executed. Mandatory customizing! The table is unfortunately defined as temporary table, so a client copy will not copy the content. This leads to problems on customer sites, since the extraction is not working immediately after a client copy. A different solution should be evaluated. Customizing is done in reference IMG->CRM->UI Framework->Enterprise Search Integration->Define Settings for Parallel Extraction. More information see there. Important are the settings for server group, maximum tasks and package size (~100 is recommended as package size).
Temporary table which holds the keys after they are retrieved either by query or by method. In case of method integration only an amount of about 10.000 (package size) is stored within the table temporarily. In case of query usage the whole key set is stored in the table.
CRMD_ES_KEYSTORE is a temporary table.
CRMD_ES_LOADMODE is a customizing table (type 'E') holding the information which templates have to be indexed with a different setting rather then the standard one (typically 'H'). This has to be used because one of the templates, CRM_SURVEY, does not support parallelization and has to be indexed sequentially. Consequently we deliver one entry in this table holding the above mentioned information.
CRMD_ES_MAPSTR is a customizing table (type 'E') which allows to specify key_structure_name or attr_struct_name information in case the GenIL information is wrong or not available. By standard it does not hold information delivered from SAP.
Structures and Table Types
CRMS_ES_ADDITIONAL_FIELDS / CRMT_ES_ADDITIONAL_FIELDS_TAB
This is an important structure within the parallel execution of the indexing. The parallel framework is not able to transfer artificial tables (tables which are not from DDIC) to the parallel process, this means we must work with the original tables. These tables are not able to store additional fields as ES_OWN_KEY, ES_FATHER_KEY, BOL_ROOT_KEY_ENTITY. So the process is that we store these fields within a CRMT_ES_ADDITIONAL_FIELDS_TAB and pump them back into the structures which are returned to Enterprise Search. The fields are identified by node name, field name, and index within the table as returned by the respective process.
CRMS_ES_EXTRACT_NODE / CRMT_ES_EXTRACT_NODES
The EXRACT_NODE structure / table type is actually a down-stripped ESH_S_IF_EXTRACT_NODE / ESH_T_IF_EXTRACT_NODES as provided by the Enterprise Search. This is also only used parellization as some of the parts of the Enterprise Search version can not be transferred to the parallel processes. The re-packing is done in (UN)PACK_DATA_RFC method in the Extract_Manager
CRMS_ES_EXTRACT_SEL_OPT / CRMT_ES_EXTRACT_SEL_OPT
Similar to above for structure ESH_S_IF_EXTRACT_SEL_OPT.
CRMS_ES_PARALLEL_EXTRACT_NODE / CRMT_ES_PARALLEL_EXTRACT_NODE
Temporarilly used strcture for node information.
CRMS_ES_PARALLEL_EXTRACT_REC / CRMT_ES_PARALLEL_EXTRACT_REC
Main data structure holding all information required for a parallel process.
CRM(S/T)_ES_RELATION_DATA_TAB/ CRM(S/T)_ES_SO_EXTRACT / CRM_ES_PARALLEL_EXTRACT_RECORD
CRM_ES_STRUCT_DEF / CRMT_ES_STRUCT_DEF
By writing this documentation I realized that this structure has become also obsolete even as it is still appearing at quite some places in the code.
Function Group CRM_ES_EXTRACTION_MANAGER
Function Group dealing with the triggering and fetching the parallel processes. This is a standard way to execute parallelism in ABAP with the three subroutines BEFORE_RFC_CALLBACK_FORM, IN_RFC_CALLBACK_FORM and AFTER_RFC_CALLBACK_FORM. This is the only reason for a function group in an otherwise object orientated landscape. Please note that the subroutines communicate with certain limitation. "BEFORE_RFC" sends the data using SPTA_INDX_PACKAGE_ENCODE directly to "IN_RFC", which uses PACK_DATA_RFC (a static method of the extract manager) to capsulate the data sent to "AFTER_RFC". IN_RFC_CALLBACK_FORM basically calls CL_CRM_ES_BO_EXTR->Extract_Data which is the same method as within sequential indexing. Nevertheless there are differences between both executions as extract_data has to deal differently with both execution flow (depending on iv_parallel_switch).
Generated Function Group CRM_ES_EXTMGRPAR
automatically generated, see view CRMV_EXTMGR_PAR
The parameter is used to switch between sequential mode ('S'), parallel mode ('P') or hybrid mode ('H'). Hybrid mode, which is the standard, means that Initial Load is executed parallely, delta load is executed sequentially. Full parallel mode is not tested (might cause problems for deletions which do not appear in Initial Load). Sequential Mode can be helpful for debugging purposes, but should not be used productively. Setting can be overwritten by an entry in Loadmode table (see above). This is always done for CRM_SURVEY template.
Message Class CRM_ES_EXTR_MGR
Message class for information messages sent to the log file. Messages usually contain one or more placeholders like &1, &2, &3 ... which are replaced in method CL_CRM_ES_EXTRACT_MANAGER->Construct_Log_Message by the respective values (see chapter log file assembly).
As the whole functionality is pretty much integrated into an embedding process, there is no working test class available. Apparently there is a outdated test class called ABAP_UNIT_TESTCLASS (returns errors).
Logical View on the Processes
Parallel customizing can be done in SPRO as follows:
This specifies the number of tasks which were used on which server group with which package size.
Please note, that this corresponds to the CRM extraction process only. The Enterprise Search requests for Initial Load from NetWeaver are also parallized. The maximum number of tasks used is consequently the cross product of our number of tasks with the corresponding customizing in NetWeaver.´The NetWeaver customizing is unfortunately only available from a specific release onwards.
the log files
Starting point is usually not the Admin Cockpit, but the report ESH_IX_CRT_INDEX_OBJECT_TYPE directly. The initial 7.0 version was simpler and looks as follows:
Since there is quite some information required, it is recommended to work with variants. But be aware that the connection guid might change!
Search Connector ID, Object Type and Logical system should be obtained from the Log file (SLG1) or from the dump directly (ST22). Connection GUID might be fetched from table ESH_ADM_SC_MAIN using ST16. Connection ID is typically not template specific.
Nowadays functionality was a bit enhanced and looks as in the following picture:
Template Type should be set to 'COMRUNTIME'. Indexing mode must be set, if Initial Load should be triggered.
To reproduce a delta load it was required to start the report and change the correspondig attribute during processing (and vice versa as the detection runs automatically).
Please think before whether you want to use the sequential mode or the parallel mode to debug. Of course debugging is easier in sequential mode, but it is not guaranteed that the problem can be reproduced. In case you want to use the parallel mode you might reduce the package size to a value which is lower than the one customized for the parallel processes, so only one process is started making debugging life easier.
I usually do not trust breakpoints which are not securely hit, so you might start the process in debug mode /h. The following functions are hit:
short report initiatin the process, lr_load_admin_tool->index_object_data will call:
rather big method (about 1000 locs). Pretty much in the middle there is a call gr_load_controller->index_object_data with a lot of parameteres (so it is easy to find). Processing to this point might take some time, since the authorization processing is done in advance.
In case you have to switch from or to delta mode, this is the place where you have to interact, before processing continues to...
There are two to three ways out of this function, you have to look out for call of the iterator->next() method
This already belongs to WEBCUIF.
reads the template data (either from model or from NW) and retrieves the keys which are written to our internal table. the main extraction starts with the call to extract_data:
CL_CRM_ES_EXTRACT_MANAGER->extract_data CL_CRM_ES_EXTRACT_MANAGER->extract_bo CL_CRM_ES_EXTRACT_MANAGER->extract_data_parallel or extract_data_sequential
Now we are at the heart of dest... aehm extraction. Continue as problem class indicates.
deletions in delta load
Deletions can only occur in delta load and they are always dealing with root objects (as all extractions deal with the complete object, deletions on sub items are in fact modifications to the root item). They are handled in the PROCESS_DELETIONS method. The delta load process itself provides the data which elements are deleted, unfortunately this information is not reliable and sufficient. In general we treat all entities which should be there according to the read request but could not be read as deleted. This also makes sense for type transformations.
The general process is easy, we compare the request with the result, and if the result contains less elements then requested a deletion must be treated. So we scan both tables for the deleted ones and write the to the DELETE_TABLE_REF. Well unfortunately there is a small problem, which is causing some difficulties. The deleted objects are not providing the actual attribute_struct (which is anyway deleted), only the keystructure information is available. The NW on the other hand can only execute the deletion in case their keys are provided fully. This means, the keys on root level must be part of the key structure (plus eventually BOL_ROOT_ENTITY_KEY, which is calculated out of the key structure). Secondly we have to have the information which field from the key structure matches with which field of the attribute structure. the method ADD_ENTITY_2_DEL_TABLE does some heuristics, which seem to be sufficient for our cases.
The Extraction knows two different ways to retrieve the keys which are afterwards used for the read request. The way with the query needs not much work from the application side except the specification of a fitting query. But this process has to write all returned keys to an internal table. This process would fail in case certain memory levels are reached (about 800.000 entries in our tests). So for templates where application can not exclude bigger sizes, the memory approach has to be chosen. Technically we do first an attempt to retrieve the data by method. If this returns an error, keys are retrieved using the customized query.
This process calls gr_generic_il->get_object_keys which hands it further to the application component, in case the method is implemented. Get_object_keys uses the following parameters:
- IV_QUERY_NAME name of the query
- IS_QUERY_PARAMETERS Query parameters, MAX_HITS is set to package size (10000), SELECTION_HINTS contains the BOR key.
- IT_SELECTION_PARAMETERS contains parameters as customized in ES Workbench.
- IV_LAST_OBJECT_KEY_PROCESSED contains information on the last processed key (BOL OBJECT_ID) within the preceeding package.
- ET_OBJECT_KEYS Keys
- ET_SELECTION_PARAMETERS_DROPED if this is set to a value bigger then or equal to 1 we treat the return not as final and later one we refire the query with the keys from the method to further restrict the return data.
- EV_LAST_OBJECT_KEY_PROCESSED Last key processed to be used as starting point for next package.
- EV_DONE is ignored (unfortunately).
The method may return more keys than requested. This is used for contacts implementation.
The method returns the keys as BOL Object ids, and are converted to correctly typed keys afterwards. The last key is processed separately. In the early day there was a bug, that the check against lines>0 was missing and so the processed dumped if no data was supplied and hence no last key available from which we get the data, see note 1477143.
If the method returns less then package size entries, the indexing will set be halted after processing the current package.
If the method is not implemented, CX_CRM_GENIL_NOT_IMPLEMENTED is raised and caught be the indexing. And we proceed with..
we execute execute_filters without handing over keys. Problem on the execution is that the query returns data, but it is not clear whether it contains the root key and if yes which is the root key.
So we try two approaches.
- Find a relation to the root object and fetch the query from there
- If above is not possible, we assume that the query result object key is the same (type and content) as the root object key. If this assumption is wrong (and it is actually pretty often wrong), the process dumps.
We had some problems on this process even at a later stage. It turned out that existing relations were removed during the implementation process simply because a very different implementation team implemented a relation with the same name causing the original relation to disappear as BOL allows only a single relation with a specific name in its universe.
refiring of the query to check correctness of keys
If the data was fetched by a method and it sets its flag in a way as described above a rechecking of the already available keys is triggered. The same process is also triggered during delta load. At first the query is asked, whether it supports the hint 'BOL_ROOT_ENTITY_KEY'. If the query does not supported the query is not refired and the complete refiring is suppressed (so there is no possibility to filter entries out).
If the query supports the hint, the query retrieves as list of a keys as selection hints as well as the customized parameters. The process then uses the result of the query as base for the extraction.
WARNING Unfortunately it occurs often that the query states that it supports the BOL_ROOT_ENTITY_KEY, whereas it does not support in reality. This leads to the effect that all entries are returned. Leading to horrible performance and/or multiplied data in TREX. Often this may not be detected during test phases since the test data sizes are so small that the problem does not appear with the Initial Load. There is also a difference between the parallel execution (typically using an internal package size below 100) and sequential execution (using 10.000 normally). Delta Load is then running into trouble in all cases.
In case of retrieval by query all keys are stored in table CRMD_ES_KEYSTORE (see above). In case of retrieval by query the current key set (typically 10.000) is stored in the table.
At first all entries belonging to the current template type are deleted. Then the keys are added. Afterwards the keys are fetched not using a cursor but with the following SQL query:
With EhP2 the Enterprise Search also supports the Mass Access (MA) API provided by BOL. To do so two things has to be changed:
- the call of the read
- the examination of the data
The mass access only works if all required objects are supporting the mass access. This means, if we enable an object for mass access and the customer copies this templates and adds a dependent object which is not masss enabled, the complete process must be done via the standard ways. The read always attempts to read the data using mass access. If the objects do not support this, an exception is raised and we switch to standard mode. The examination of the data is a completely different topic as the data is not returned as a tree of objects but table based. The coding is stored in the recursive function "CONVERT_MA_DATA".
differences between sequential and parallel indexing
Beyond the obvious differences there are also some technical differences between the two approaches:
- parallel execution works with a package size of about 100 (as customized in CRMC_EXTMGR_PAR) compared to the 10.000 as in the sequential mode.
- there is a mechanism in the parallel execution which is able to identify corrupt entities and to read around them. So a single corrupt object is not destroying the upload of maybe millions of objects. This is possible since the parallel execution works in own threads, so a dump does not harm the main object and we may react on the falsely working process. This is all not possible in the sequential process where the process then simply halts.
- it is not possible to use artificial data structures for the parallel execution as they can not be handed to the parallel processes. Instead we have to use the original datastructures and to copy them to the datastructures as provided by the NW Enterprise Search. This data structure may also contain ES_FATHER_KEY, ES_OWN_KEY or BOL_ROOT_ENTITY_KEY fields which are not part of the original structure. This information is stored in a crmt_es_additional_fields_tab table, together with the respective row number.
- some of the templates may be excluded from the parallel extraction (as CRM_SURVEY)
- of course parallel execution is difficult to debug. Sometimes it might be convenient to reduce the package size in program ESH_IX_CRT_INDEX_OBJECT_TYPE to a number lower or equal to the one customized for the parallel process, so only one task is started und you do not get confused with several tasks.
template header information retrieving
Before EhP2 all required template header was retrieved from the model manager. This is done by a call go_model_manager->get_model_info. This is still available in EhP2 within the method READ_TEMPL_DATA_FROM_MODEL_MGR as for old templates this method still has to be used. For templates following the new approach the required information comes directly from the ES, see method READ_TEMPL_DATA_FROM_ES.
BOL read query assembly
Data is read by a single read to the BOL which consist of package size number of entities for all levels. Before the read can be executed, the relation table which is used for the read has to be assembled. This is done in the three methods SPLIT_RELTAB, ADD_KEYINFO_TO_RELTAB and ADD_RELATIONINFO_TO_RELTAB. The first method also constructs a model tree which is used later at the examination of the data. ADD_KEY_INFO_TO_RELTAB adds those rows to the relation table where the key information is required and hence it has to be specified for each and every entry. All other relations (below the highest level) are put only once in the table.
query result examination
This is provided by the recursive method CONVERT_DATA which is named on the highest level CONVERT_DATA_ROOT. There it is basically the same coding but against slightly different data types.
mass access result examination
You may debug the process in A2F using the template FLIGHTS which is "mass-enabled". Well, it's more or less a hack since the enablement is a) only functional (the component calls the normal read and converts the result to the required structures) and b) limited to the usage as we have for the FLIGHT template. The method CONVERT_MA_DATA is executing the examination of the mass access result. Mass data returns no object hierarchy but a rather complicated table structure. The main result object contains 3 tables:
- IF_GENIL_MA_RESPONSE~REQUEST_DESCR Describes the data which is returned
- TYPE_STORAGE the data
- PART_COUNT_TAB not needed
Our flight example request asks for FLIGHTS, Books, Tickets and Notes. Since Notes is 1:1 related to Flights they share the same row in the type storage, but they are included as structure inclusion.
The examination process follows the same recursive approach as the normal examination, because for the father key assembly we need the key information from the father and so we should examine this table first. The process follows this abstract schedule:
- identify entry in Enterprise Search table
- identify entry in Mass Access result table
- set field symbols accordingly and check whether any other tables uses this table as parent, so we have to collect the keys for further ES_FATHER_KEY usage
- move data from result table to ES table
- add BOL_ROOT_ENTITY_KEY an ES_OWN_KEY if required (that's easy)
- if es_father_key is required: Examine father table to find corresponding key and fill it (second option available in coding should no more occur)
- store key information in table which is handed lower
- recursive call to all dependent entities.
when a field with the name es_own_key is part of the requested structure, the extract will fill it with a seld-created GUID. Type (binary or text) is determined from target field type, see ADD_>TO_TARGET_FIELDSTRUC.
when a field with the name es_father_key is part of the requested structure, it will be filled with the key of the father. This handling has some limitations:
- the key of the father must be a single field
- with EhP2
This feature is required on root level as the top most entry must have this specific entry for UI purposes. Technically the field is filled with the BOL Object ID which is passed through the recursive process to the ADD_TO_TARGET_FIELDSTRUC method.
log file assembly
The log file communication is done in CL_CRM_ES_EXTRACT_MANAGER; in CL_CRM_ES_BO_EXTR no messages are put into the corresponding structure. You may further restrict the log list by setting sub object either to 'CRM_ES_DELTA_EXTR' for delta loads or to 'CRM_ES_DATA_EXTR' for Initial Loads. Additionally structure transfers (CRM_ES_STRUC_TRANS) and customizing transfers (CRM_ES_CUST_TRANS) are filed under object CRM_ES. Within the success case, the following information is logged:
- delta or initial extraction
- template to be extracted
- mass access yes / no
- execution in parallel or sequentially, if parallel the number of tasks is reported also
- extraction date and time
- requested and returned entities. The log file only states the success of the CRM part of the indexing. If the Enterprise Search (i.e. the upload to the TREX) parts are failing afterwards, this will be not reported within the log here, please take a look in the log within the Admin Cockpit. The log file also reports some of the possible problems. In case there is something wrong with the CRM indexing typically a provider error is thrown.
Logs can be viewed with transaction slg1. Use object "CRM_ES" for filtering. Data retrieval might be very slow on first usage.
BT_ORGSET special case
As for the Orgset, no data is also data and an empty data set has to be uploaded. This is special, hard-coded stuff and hence, pretty ugly. In CONVERT_DATA it is checked whether we have an relation between BTAdminH and BTOrgSet, if yes, LV_IS_SPECIAL_CASE is set. If this flag is set and no data is available, ADD_TO_TARGET_FIELDSTRUC is exeucted with the mode "IV_FIRE_BLIND", adding an empty row to the dataset.
is developed by the respective application team directly.
parallel? Or sequential?
the decision is derived in CL_CRM_ES_EXTRACT_MANAGER which first uses GET_EXTRACTION_MODE to fetch the mode either from CRMD_ES_LOADMODE or from the user parameter CRM_ES_PLL. If none of the above applies, extraction mode will be 'H'. The decision itself is derived with the logical statement: is_parallel = ( lv_pll = 'P' or ( lv_pll = 'H' and IV_IS_DELTALOAD = ABAP_FALSE ) ).
Initialization of the parallel framework
the initialization can fail, either because the customizing is not present (then EXTRACT_BO raises i010) or due to system issues (then EXTRACT_DATA_PARALLEL raises i018). The second dump is only available from EhP1 onwards, in CRM 7.0 plain, dump may be cryptic.
system settings, sizing
Frankly speaking, our knowledge on the correct settings is limited, it is probably more a task for the application / consulting teams. What we can say is, that we made good experiences with a package size below 100. The number of tasks is highly machine dependent, the other parameters are of minor importancy.
NetWeaver Enterprise Search parallelization
The Initial Load (and only the Initial Load) as triggered from the Enterprise Search is also parallelized, this means we have kind of double parallelization, we work with dialog jobs whereas the NetWeaver collea
Note 1422819, CRM 7.0 upto SP05, Initial Load Memory Problems
package size was set incorrectly in query approach, so all entities were processed instead of package size. Note must be implemented
Note 1477143, CRM 7.0 upto SP07, Extract dumps when no data is supplied.
attempt to calculate LAST_PROCESSED_KEY dumps if no data was provided by method. In productive systems this should not occur, nevertheless it is suggested to apply the note.
Note 1530416, CRM 7.0 upto SP08, Adaption to Hub scenario
If used in hub scenario, this note must be applied.