Allocation performance take 2

Last week, Mike Croucher posted a very interesting article on the fact that cumprod can be used to generate a vector of powers much more quickly than the built-in .^ operator. Trying to improve on Mike’s results, I used my finding that zeros(n,m)+scalar is often faster than ones(n,m)*scalar (see my article on pre-allocation performance). Applying this to Mike’s powers-vector example, zeros(n,m)+scalar only gave me a 25% performance boost (i.e., 1-1/1.25 or 20% faster), rather than the x5 speedup that I received in my original article.

Naturally, the difference could be due to different conditions: a different running platform, OS, Matlab release, and allocation size. But the difference felt intriguing enough to warrant a small investigation. I came up with some interesting new findings, that I cannot fully explain:

The performance of allocating zeros, ones (x100)

function t=perfTest
    % Run tests multiple time, for multiple allocation sizes
    n=100; tidx = 1; iters = 100;
    while n < 1e8
        t(tidx,1) = n;
        clear y; tic, for idx=1:iters, clear y; y=ones(n,1);  end, t(tidx,2)=toc;  % clear; ones()
        clear y; tic, for idx=1:iters, clear y; y=zeros(n,1); end, t(tidx,3)=toc;  % clear; zeros()
        clear y; tic, for idx=1:iters,          y=ones(n,1);  end, t(tidx,4)=toc;  % only ones()
        clear y; tic, for idx=1:iters,          y=zeros(n,1); end, t(tidx,5)=toc;  % only zeros()
        n = n * 2;
        tidx = tidx + 1;
    end
 
    % Normalize result on a per-element basis
    t2 = bsxfun(@rdivide, t(:,2:end), t(:,1));
 
    % Display the results
    h  = loglog(t(:,1), t(:,2:end));  % overall durations
    %h = loglog(t(:,1), t2);  % normalized durations
    set(h, 'LineSmoothing','on');  % see http://undocumentedmatlab.com/blog/plot-linesmoothing-property/
    set(h(2), 'LineStyle','--', 'Marker','+', 'MarkerSize',5, 'Color',[0,.5,0]);
    set(h(3), 'LineStyle',':',  'Marker','o', 'MarkerSize',5);
    set(h(4), 'LineStyle','-.', 'Marker','*', 'MarkerSize',5);
    legend(h, 'clear; ones', 'clear; zeros', 'ones', 'zeros', 'Location','NorthWest');
    xlabel('# allocated elements');
    ylabel('duration [secs]');
    box off
end

The full results were (R2013a, Win 7 64b, 8MB):

The same data normalized per-element (x100)

       n  clear,ones  clear,zeros    only ones   only zeros
========  ==========  ===========    =========   ==========
     100    0.000442     0.000384     0.000129     0.000124
     200    0.000390     0.000378     0.000150     0.000121
     400    0.000404     0.000424     0.000161     0.000151
     800    0.000422     0.000438     0.000165     0.000176
    1600    0.000583     0.000516     0.000211     0.000206
    3200    0.000656     0.000606     0.000325     0.000296
    6400    0.000863     0.000724     0.000587     0.000396
   12800    0.001289     0.000976     0.000975     0.000659
   25600    0.002184     0.001574     0.001874     0.001360
   51200    0.004189     0.002776     0.003649     0.002320
  102400    0.010900     0.005870     0.010778     0.005487
  204800    0.051658     0.000966     0.049570     0.000466
  409600    0.095736     0.000901     0.095183     0.000463
  819200    0.213949     0.000984     0.219887     0.000817
 1638400    0.421103     0.001023     0.429692     0.000610
 3276800    0.886328     0.000936     0.877006     0.000609
 6553600    1.749774     0.000972     1.740359     0.000526
13107200    3.499982     0.001108     3.550072     0.000649
26214400    7.094449     0.001144     7.006229     0.000712
52428800   14.039551     0.001853    14.396687     0.000822

(Note: all numbers should be divided by the number of loop iterations iters=100)

As can be seen from the data and resulting plots (log-log scale), the more elements we allocate, the longer this takes. It is not surprising that in all cases the allocation duration is roughly linear, since when twice as many elements need to be allocated, this roughly takes twice as long. It is also not surprising to see that the allocation has some small overhead, which is apparent when allocating a small number of elements.

A potentially surprising result, namely that allocating 200-400 elements is in some cases a bit faster than allocating only 100 elements, can actually be attributed to measurement inaccuracies and JIT warm-up time.

Another potentially surprising result, that zeros is consistently faster than ones can perhaps be explained by zeros being able to use more efficient low-level functions (bzero) for clearing memory, than ones which needs to memset a value.

A somewhat more surprising effect is that of the clear command: As can be seen in the code, calling clear within the timed loops has no functional use, because in all cases the variable y is being overridden with new values. However, we clearly see that the overhead of calling clear is an extra 3µS or so per call. Calling clear is important in cases where we deal with very large memory constructs: clearing them from memory enables additional memory allocations (of the same or different variables) without requiring virtual memory paging, which would be disastrous for performance. But if we have a very large loop which calls clear numerous times and does not serve such a purpose, then it is better to remove this call: although the overhead is small, it accumulates and might be an important factor in very large loops.

Another aspect that is surprising is the fact that zeros (with or without clear) is much faster when allocating 200K+ elements, compared to 100K elements. This is indicative of an internal switch to a more optimized allocation algorithm, which apparently has constant speed rather than linear with allocation size. At the very same point, there is a corresponding performance degradation in the allocation of ones. I suspect that 100K is the point at which Matlab's internal parallelization (multi-threading) kicks in. This occurs at varying points for different functions, but it is normally some multiple of 20K elements (20K, 40K, 100K or 200K - a detailed list was posted by Mike Croucher again). Apparently, it kicks-in at 100K for zeros, but for some reason not for ones.

The performance degradation at 100K elements has been around in Matlab for ages - I see it as far back as R12 (Matlab 6.0), for both zeros and ones. The reason for it is unknown to me, if anyone could illuminate me, I'd be happy to hear. The new thing is the implementation of a faster internal mechanism (presumably multi-threading) in R2008b (Matlab 7.7) for zeros, at the very same point (100K elements), although for some unknown reason this was not done for ones as well.

Another aspect that is strange here is that the speedup for zeros at 200K elements is ~12 - much higher than the expected optimal speedup of 4 on my quad-core system. The higher speedup may perhaps be explained by hyper-threading or SIMD at the CPU level.

In any case, going back to the original reason I started this investigation, the reason for getting such wide disparity in speedups between using zeros and ones for 10K elements (as in Mike Croucher's post), and for 3M elements (as in my pre-allocation performance article) now becomes clear: In the case of 10K elements, multi-threading is not active, and zeros is indeed only 20-30% faster than ones; In the case of 3M elements, the superior multi-threading of zeros over ones enables much larger speedups, increasing with allocated size.

Some take-away lessons:

Using zeros is always preferable to ones, especially for more than 100K elements on Matlab 7.7 (R2008b) and newer.
zeros(n,m)+scalar is consistently faster than ones(n,m)*scalar, especially for more than 100K elements, and for the same reason.
In some cases, it may be worth to use a built-in function with more elements than actually necessary, just to benefit from its internal multi-threading.
Never take performance assumptions for granted. Always test on your specific system using a representative data-set.

p.s. - readers who are interested in the historical evolution of the zeros function are referred to Loren Shure's latest post, only a few days ago (a fortunate coincidence indeed). Unfortunately, Loren's post does not illuminate the mysteries above.

Preallocation performance Preallocation is a standard Matlab speedup technique. Still, it has several undocumented aspects. ...
Array resizing performance Several alternatives are explored for dynamic array growth performance in Matlab loops. ...
Matrix processing performance Matrix operations performance is affected by internal subscriptions in a counter-intuitive way....
Plot performance Undocumented inner plot mechanisms can be used to significantly improved plotting performance...

Allocation performance take 2

Trending Articles

Bath man appears in court charged with attempted murder of a man...

MACLEAN, Allan

Black Angus Grilled Artichokes

Practice Sheet of Right form of verbs for HSC Students

Police blotter for Jan. 12

99 God Status for Whatsapp, Facebook

Rajasthan Board 12th Science Result 2018 name wise- RBSE 12th commerce result...

Notorious Naushad of Ippa gang nabbed

Child Kidnapping: Amy McNeil was kidnapped on her way to school by 5 adults;...

Sonible Smartlimit v1.1.5-R2R

NCERT Solutions for Class 9th Sanskrit Chapter 3 पाथेयम्

मतलबी दोस्त स्टेट्स | Matlabi Dost Status in Hindi – Selfish Friends Status

Arrow Flash 2 – Sinhala Dubbed – Episode 23 – 20th March 2016

[GET] AI Traffic Goldmine

[E² Plugin] HDF-Radio

Universal Multi-Patch v1.3 By RADIXX11

IWAN – Thanks and Praise ( Throw Back Thursday )

RONALD P SONDERGAARD Arrested by Miami-Dade County Corrections on Mar 03, 2017

मुख मैथुन से उठाएं सेक्स का भरपूर मज़ा, जानें क्या है इसका सही तरीकामुख मैथुन...

HSSC Excise & Taxation Inspector Result 2017 Scorecard/ Category Wise Merit List