Section 19.13. Profiling

19.13. Profiling

Don't optimize applicationsprofile them.

In the previous guideline, the benchmarked comparison between repeatedly computing sqrt $pixel_value and repeatedly looking up $sqrt_of[$pixel_value] indicated that caching provided a 9% improvement:


                     Rate         recompute   look_up_array
    recompute       3951/s            --             -8%
    look_up_array   4291/s            9%             --

That sounds impressive, but it's important to keep those numbers in perspective. Each iteration of the test did 256 square root retrievals. So, overall, the test was achieving 1,011,456 (i.e., 3951 x 256) sqrt calls per second, compared to 1,098,496 @sqrt_of look-ups per second.

Suppose you were processing the 786,432 pixels of a typical 1024 x 768 image. Using the example performance figures, the repeated sqrt calls would require around 0.78 seconds to process that many pixels, whereas the look-up table would take only about 0.72 seconds. Adding a cache to this section of your code would save you a grand total of 0.06 seconds per image.

That's an all-too-common outcome when code is optimized: developers focus their efforts on those components that are easy to optimize, rather than on those components in which improvements will produce the greatest benefit.

How do you find those places where optimization will do the most good? By understanding where your application spends most of its time. And the easiest way to do that is to profile your program using the standard Devel::DProf module, which can determine how long your application spends within each subroutine in your source code. That is, instead of running your program in the usual way:

    
    > perl application.pl datafile

run it under the auspices of the profiler module:

    
    > perl -d:DProf application.pl datafile

The -d: debugging flag is a shorthand for -MDevel::, so specifying -d:DProf is the same as specifying -MDevel::DProf. That tells perl to include that profiling module before the start of the application.pl source.

The module itself simply watches every subroutine call within your code, noting how much time elapses between the invocation and return, and adding that duration to a record of the total time spent in each subroutine. At the end of the program, the module creates a file called tmon.out in the current directory.

It's possible to directly interpret the raw data in the file (see the module docs for details), but much easier to understand it by passing it through the standard dprofpp application.

For example, you could use Devel::DProf to profile a call to your new autoformat application, and then summarize the results with dproffpp like so:

    
    > perl -d:DProf ~/bin/autoformat  < Ch_19.txt  > Ch_19_formatted.txt


    > dprofpp tmon.out

    Total Elapsed Time = 3.212516 Seconds
      User+System Time = 0.722516 Seconds
    Exclusive Times
    %Time ExclSec CumulS #Calls sec/call Csec/c  Name
     16.6   0.120  0.416      3   0.0399 0.1387  main::BEGIN
     11.0   0.080  0.138      9   0.0089 0.0153  Text::Autoformat::BEGIN
     8.30   0.060  0.066      1   0.0600 0.0665  Getopt::Declare::parse
     7.89   0.057  0.057    221   0.0003 0.0003  Text::Balanced::_failmsg
     6.09   0.044  0.075      1   0.0437 0.0749  Text::Autoformat::autoformat
     5.54   0.040  0.089      3   0.0133 0.0298  Getopt::Declare::Arg::BEGIN
     5.26   0.038  0.252      1   0.0381 0.2516  Getopt::Declare::new
     4.01   0.029  0.059     26   0.0011 0.0023  Getopt::Declare::BEGIN
     3.88   0.028  0.111     70   0.0004 0.0016  Text::Balanced::extract_codeblock
     3.60   0.026  0.074    133   0.0002 0.0006  Text::Balanced::_match_codeblock
     2.77   0.020  0.020      1   0.0200 0.0199  Text::Balanced::ErrorMsg::BEGIN
     2.77   0.020  0.030      3   0.0066 0.0099  vars::BEGIN
     2.77   0.020  0.020      5   0.0040 0.0040  Exporter::as_heavy
     2.77   0.020  0.020     17   0.0012 0.0012  Exporter::import
     1.38   0.010  0.010      1   0.0100 0.0100  AutoLoader::AUTOLOAD

For each subroutine, the table produced by dprofpp shows (amongst other details):

The total time spent in the subroutine (%Time), as a percentage of the total time spent running the application
The actual amount of time spent within the subroutine (ExclSec)
The actual amount of time spent within the subroutine and any subroutines it called (CumulS)
The number of calls made to the subroutine (#Calls)

Looking at the output of the previous example, you can see that the program spends about 8% of its time in Text::Balanced::_failmsg( ), making a total of 221 calls to that subroutine. In contrast, it spends only about half that amount of time in 70 calls to Text::Balanced::extract_codeblock( ). So it probably makes sense to focus any optimization efforts on _failmsg( ) first.

If you need finer-grained profiling, the Devel::SmallProf CPAN module allows you to count how many times each line of your program is executed, which makes it easier to determine precisely what statement is causing a particular subroutine to be expensive:

    
    > perl -d:SmallProf application.pl datafile

The result is a file named smallprof.out, which is actually a copy of your source code with each line prefixed by the number of times it was executed, the total amount of time spent executing the line, and the line number. Although the module lacks a utility like dprofpp to summarize its output, it is an excellent investigative tool once you have identified the main suspects using Devel::DProf.