Skip to end of metadata
Go to start of metadata

                                             

Anupam Ghosh

  

Effect of garbage collection on throughput of PI server

Garbage as the name suggest is a collection of  useless objects which were once useful to the program. System does memory recycling by releasing the heap space occupied by objects no longer being referenced by the program. The free space may be utilized by new objects. Garbage collector must run any finalizers of the object being freed.     

SAP NetWeaver application server uses Virtual Machine Container Technology , the aim is to provide isolation between various users as securely as possible to avoid any unwanted restrictions. Shared memory is exclusively used in the VM container environment in order to convert the SAP roll-in and roll-out concept. Each work process can map areas of this shared resource to its address space (stable address). Only in exceptional cases would local process memory be allocated for a short time. While a request is being processed a Java VM works exclusively for one user. A pool of VMs is dynamically assigned to the work processes. These VMs can access shared data in the shared memory.

 There are the following memory types for the VM container:

●      Java heap: Java objects belonging to a VM are stored on the Java heap. Each VM has its own Java heap, and other VMs cannot access them.

●      VM heap: The VM heap is used to store local objects such as VM Java and service stacks. It can be used only by one VM.

●      Shared pool: This can be used by all VMs and contains the shared Java objects (for example, shared closures, shared classes, shared code cache for compiled code).

 A garbage collection is run periodically on the local VM Java heap and in the shared pool. As data content of heaps increases the garbage collection process takes more time to free the memory thus system throughput decreases.

Unix shell scripts to spilt large files before PI starts processing it

            Unix shell scripts may be written to split the files before PI picks it up for processing. In order to work with Unix shell scripts we need to know basics of Unix operating system and some of its commands. Detailed information on Unix commands may be obtained from the links given below

The following books I referred, to learn about UNIX operating system

  • Your UNIX: The Ultimate Guide  by  Sumitabha Das
  • Unix Shell programming by Yashavant Kanetkar

While trying to split files before PI server picks up the file for processing there are certain factors to be considered as shown below

  1. The original large sized file cannot be directly split it into smaller parts on basis of its size. The reason being that each smaller part of the file has to contain certain number of complete lines. If division is made only on basis of bytes it contains then smaller files might have incomplete lines in it. Each line of a text file might contain multiple data separated by comma or other characters. The number of data items and their format must remain intact in the smaller files obtained after splitting the large file.     
  2. The lines of the file to be split may not be related to each other. Then the procedure of splitting becomes simple. Thus any number of lines is fine in one file. But what do I mean by relation within lines of a file? Sometimes two consecutive lines within files may be dependent on each other. That is either both should be present in one file. We have to take care that these lines do not get separated after splitting. The case study I have discussed later deals with such scenario.
  3. The case study describes how to split text files but it cannot split files of other categories such as PDF etc.
  4. After designing a shell script the SAP BASIS team can put the script in proper directory of PI server, since PI consultants may not have access to operating system.
  5.  The file will be split into equal number of lines as far as possible. It is possible that the last of the newly obtained smaller files obtained after splitting a large file may have very few lines. Roughly you can expect nearly  17000 lines in a file of size near 5MB.
  6. In real business scenarios the files might contain a header and trailer information. These information are important since they serve as security parameters. Trailer for example might contain number of lines in the original file, so if by mistake new lines have been introduced or lines have been deleted the trailer information might not match with actual number of lines in the file. This mismatch is reported by validation check within PI server. If a file has got header and trailer information then each of the spitted files should also contain the same format. The case study included below shows the process of splitting a file which has header information in it.
  7. There are two ways using which the shell script might be called to split a large file. Firstly the script might run in an infinite loop within server. In this case we need to ensure that the directory in which we are putting the file is not the directory where PI will pick it up for processing else PI server might pick up the large file for processing. The script will first check the file before processing whether the file is eligible for splitting or not. If the file size is more than 5MB then the script will produce smaller files and put those smaller files in the directory from which PI server picks those up for processing. In second method we might call the script as an operating system command (within command line) within communication channel parameters as shown below. In this case the file may be directly put in the folder where PI server is going to pick up the file. In the case study I have followed the first method.
  8. PI server will  process each line of the file one after another sequentially. When PI server picks up the smaller files after splitting it might not follow any sequence. The server may process any of the newly spitted files in any order, this is because all the files are available almost at the same time. Generally in real business scenarios the order of processing of each line within a file does not affect the results. In case the order of processing is going to affect your results, then you need to pause the script for some time after it produces each file. There is Unix command “sleep” for your help. I have not used “sleep” command in case study since it increases the time of processing of the script.    
  9. Shell scripts have many advantages over conventional programming such as easy program or file selection, quick start, and interactive debugging. A shell script can be used to provide a sequencing and decision-making linkage around existing programs, and for moderately-sized scripts the absence of a compilation step is an advantage. Interpretive running makes it easy to write debugging code into a script and re-run it to detect and fix bugs. [stage:3]
  10. The filenames of the newly obtained smaller files after splitting the large file should indicate the line numbers contained within it.

Case Study

Problem: Splitting a text file with file extension “txt”. The file may have any number of lines in it. We have to split it into multiple smaller files. Each line of the file ends with a word “mile” or “km”. Splitting of files at any point within the file is possible, except that we cannot split at a point where there are consecutive lines with last value “mile” followed by line which ends with “km” or vice-versa. So always the script has to check last values of consecutive lines, if both of last values are same then only splitting is possible between those two lines else the scripts searches for correct place to split the file. ( This program should also work fine with excel files and .csv files, but I have not tried it myself.)

       Each step of the script is explained below. Explanation statements or comments are preceded ‘#’ symbol.

 find_split_point()
{
# This is a function to search the line number from where
# the file will be split. This function receives three input
#$1 is the line number from where the searching starts
#$2 is the file name which is being split
#$3 is the total number of lines in the file
 
#converting $1 to numeric form otherwise it will be in
#textual form and saving in variable line
 
lines=`expr $1 + 0`
 
#converting $3 to numeric form otherwise it will be in
#textual form and saving in variable Total_lines_in_file
 
Total_lines_in_file=`expr 0 + $3`
 
#While loop to continue searching till end of file
 
while [ $lines -lt $Total_lines_in_file ]
do
                       
      #First line is line1. Testing is done between line2 and line1. Whether
      # the file can be split from this point or not. Converting from text to
      #numeric form.
 
                       
      line1=`expr $lines + 0`
 
      #Second line is line2. Testing is done between line2 and line1. #Whether the file can be split from this point or not. Converting
      #from text to numeric form.
                   
      line2=`expr $lines + 1`
                       
      #Extract line1 from the file and put it in file f1
 
      head -$line1 $2 | tail -1 > f1
 
      #Find if the word “mile” is present in line1 or not
 
      c1=`grep -i "\"mile\"" f1 | wc -l`
                       
      #converting the result to numeric form
      
      c1=`expr $c1 + 0`
 
      #Put line2 in another file
 
      head -$line2 $2 | tail -1 > f1
 
      #Check if the word “km” is present in it or not
 
      c2=`grep -i "\"km\"" f1 | wc -l`
 
      #Convert the count into numeric from text
 
      c2=`expr $c2 + 0`
      # if consecutive lines are having words “mile” and “km” then stsrt
      # searching from next line by incrementing value of lines by 1 and
      # continue the loop from beginning
 
      if [ $c1 -eq 1 -a $c2 -eq 1 ]; then
             lines=`expr $lines + 1`
             continue
      fi
 
      # if consecutive lines are having words “km” and “mile” then stsrt
      # searching from next line by incrementing value of lines by 1 and
      # continue the loop from beginning
 
      if [ $c1 -eq 0 -a $c2 -eq 0 ]; then        
          head -$line1 $2 | tail -1 > f1
          c1=`grep -i "\"km\"" f1 | wc -l`
          c1=`expr $c1 + 0`
          head -$line2 $2 | tail -1 > f1
          c2=`grep -i "\"mile\"" f1 | wc -l`
          c2=`expr $c2 + 0`
          if [ $c1 -eq 1 -a $c2 -eq 1 ]; then
                     lines=`expr $lines + 1`
                     continue
          fi
      else
          break;
      fi
 
#End the loop
 
done
 
#Return the line number from where the file may be split
 
echo $lines
}
 
 
 
#Assign value numeric 1 to variable g
 
g=`expr 1 + 0`
 
#Write 3 lines in each file. Stored in variable maxline_count
 
maxline_count=`expr 3 + 0`
 
#Start infinite loop
 
while [ $g -gt 0 ]
do
 
            # find how many files with extension txt are there and suppress error
            # messages if no files are there
 
            ww=`ls -l  *.[Tt][Xx][Tt] 2>/dev/null | wc -l`
 
            #convert ww to numeric form and assign it to ww1
 
            ww1=`expr  $ww + 0`
 
            # continue looping if there are no files with extension “txt”
 
            if [ $ww1 -gt 0  ]; then
                        cat /dev/null
            else
                        continue
            fi         
 
            # count the number of files with extension “txt” save it in variable count
 
            count=`ls -1 *.[Tt][Xx][Tt] | wc -l`
 
            # generate log record. This step may be omitted.
 
            echo "Number of [Tt][Xx][Tt] files=" $count > logrecord
 
            # convert count value from text to numeric
 
            count=`expr $count + 0`
 
            # initialise num variable to numeric zero
 
            num=`expr 1 - 1`
 
            #  loop until count and num variable values are equal
 
            while [ $count != $num ]
            do
                        # increment value of num  by 1
 
                        num=`expr $num + 1`            
 
                        # the name of the first of the “txt” files is stored in variable filename
 
                        filename=`ls -1 *.[Tt][Xx][Tt] | head -$num |tail -1`
 
                        # write the filename to logrecord , temporary file. This step may be
                        #omitted.
 
                        echo "file="$filename >>logrecord
 
                        #find number of lines in the file and save it in linecount variable
 
                        linecount=`wc -l $filename |cut -d ' ' -f 1`
                       
                        # writing line count to temporary logrecord file. This step may be
                        #omitted
 
                        echo  $linecount >> logrecord
 
                        # convert linecount from text to number
 
                        linecount=`expr $linecount + 0`
 
                        # extract the header information. If there is no header omit this line
 
                        header=`head -1 $filename`
 
                        #check if the number of lines in the file are greater than maxline_count
 
                        if [ $linecount -gt $maxline_count ]; then
 
                                    # find maximum number of files to be formed after splitting
                                    #and save it in no_of_parts
 
                                    no_of_parts=`expr $linecount / $maxline_count`
 
                                    # Temporary Log record. This step may be omitted
 
                                    echo "Trying to split this  file into " $no_of_parts "parts" >> logrecord
 
                                    # Initializing variable k with numeric value of maxline_count
 
                                    k=`expr 0 + $maxline_count`
 
                                    # Initialize startcount variable to 1
 
                                    startcount=`expr 0 + 1`
 
                                    # loop until startcount becomes equal to linecount
 
			         while [ $startcount -le $linecount ]
                                    do
                                                # call function find_split_point with 3 parameters and
                                                # store return value in  variable lines
 
                                                lines=$(find_split_point `expr $k + 0` $filename $linecount)
                                               
                                                # The return value is written to temporary log record file.
                                                # This line may be omitted
 
                                                echo " lines=" $lines >> logrecord
 
                                                #store the filename in variable newfilename
 
					  newfilename=`echo $filename | cut -d '.' -f 1`
 
                                                # The filename is written to temporary log record file.
                                                # This line may be omitted
 
                                                echo "file1=" $newfilename >> logrecord
 
                                                # creating the new file name adding the line numbers to
                                                #  the file name and adding extension            “txt”
 
                                                newfilename=`echo $newfilename"_line_numbers_"$startcount"_"$lines".txt"`
 
                                                # New File name written to temporary log file.
                                                # This line may be omitted.
 
                                                echo $newfilename >> logrecord
 
                                                # total no. of lines written to target is saved within
                                                #variable no_of_lines_to_target.
 
                                                no_of_lines_to_target=`expr $lines - $startcount + 1`
 
                                                # In log record file we keep no. of lines written to
                                                # smaller file. This line may be omitted.
 
                                                echo "no of lines to file=" $no_of_lines_to_target >> logrecord
 
                                                #copy selected number of lines from old file to new file
                                                # insert the header information for all files except the
                                                #first one, since header is present in first line of original
                                                #file. Temp and temp1 are temporary files created in
                                                #this process.
 
                                                tail -n +$startcount $filename > temp
                                                head -$no_of_lines_to_target temp > temp1
                                                if [ $startcount != 1 ]; then
                                                            echo $header > temp
                                                            cat temp temp1 >$newfilename
                                                else
                                                            cat temp1 >$newfilename
                                                fi         
 
                                                # for log record, line can be omitted
                                                echo " new file written " $newfilename >> logrecord
 
                                                # Delete temporary files
 
                                                rm -f temp temp1
 
                                                #move newly created file to met subdirectory
                                   
                                                mv -f $newfilename ./met
                                               
                                                # increment startcount to point to next line
 
                                                startcount=`expr $lines + 1`
 
                                                # decrement no_of_parts variable
 
                                                no_of_parts=`expr $no_of_parts - 1`
 
                                                # save the value of variable lines plus
                                                # maxline_count into k
 
                                                k=`expr $lines + $maxline_count`
 
                                                # test whether the number of lines in file greater
                                                #than maxline_count. If no then further splitting is not
                                                # required
 
                                                p=`expr $linecount - $startcount`
                                                if [ $p -lt $maxline_count ]; then
                                                            k=`expr $linecount + 0`
                                                fi
 
                                    # loop ends
                       
                                    done
                        else
 
                                    # for log record. Line may be omitted
 
                                    echo " no spiltting required for file $filename" >> logrecord
                        fi
                       
                       
                        #delete original file after splitting
 
                        rm -f $filename
 
                        # end inner loop
           
            done   
 
#end outer loop
           
done

Here is the original file to be split “sdn.txt

Header information

1,"mile"

2,"km"

3,"mile"

4,"km"

5,"km"

6,"km"

7,"km"

8,"km"

9,"km"

10,"km"

11,"km"

12,"km"

13,"km"

This file sdn.txt has 13 lines in it excluding header. I have put an criteria in the script shown above that the file be split after every 3 lines. We have to keep this file sdn.txt in the same folder as that of the script. The scrip runs in infinite loop so whenever you put any “txt” file in the folder it will be split automatically. The new files if any will be put under “met” directory under current folder. [stage:In case you want to call the script through command line feature in communication channel settings then you cannot have a script running in infinite loop. So please make necessary changes in code.] The receiver communication channel must be configured to pick up files from "met" directory.

      The new files formed after split is shown below

sdn1_line_numbers_1_5.txt

_____________________________

Header information

1,"mile"

2,"km"

3,"mile"

4,"km"

  

sdn1_line_numbers_6_8.txt

____________________________

Header information

5,"km"

6,"km"

7,"km"

sdn1_line_numbers_9_11.txt

________________________________

Header information

8,"km"

9,"km"

10,"km"

sdn1_line_numbers_12_14.txt

____________________________________

Header information

11,"km"

12,"km"

13,"km"

                  

If you notice carefully you will see that sdn1_line_numbers_1_5.txt has 4 lines instead of 3 (excluding header).  This is because between 3rd and 4th line of the file we cannot split as per our initial condition that if consecutive lines are “km” followed by “mile” or vice versa, that cannot be a split point. Thus the script has searched further and found suitable split point next line. If the lines are independent of each other you can reduce this checking. The original script file (without comments) and sample "txt" files are provided in links below. 

                                         

References

  1. http://www.artima.com/insidejvm/ed2/gcP.html
  2. http://help.sap.com/saphelp_nw70/helpdata/en/a9/26ae3c95164695accbf2483a14281e/frameset.htm
  3. http://en.wikipedia.org/wiki/Shell_script#Advantages_and_disadvantages
  4. http://wiki.sdn.sap.com/wiki/display/XI/Optimum+File+Size+for+various+file+scenarios+in+PI+7.0