COMPGROUPS.NET | Search | Post Question | Groups | Stream | About | Register

### memcpy from within matlab

• Follow

```bottleneck statement :

idx1=...
idx2=....
A(idx1)=B(idx2);

where A and B are predefined 3D arrays, idx1 and idx2 are vectors of ind32 indeces of length 10^4 to 10^5.
This statement is my bottle-neck given that I use it very often in a loop, is there a more efficient method to copy B(idx2) into  A(idx1) ? may be using lower level instructions like memcpy or the alike directly from inside Matlab code?
thank you for the attention
gianni
```
 0

```"Gianni Schena" <schena@univ.trieste.it> wrote in message <i6kn36\$f9a\$1@fred.mathworks.com>...
> bottleneck statement :
>
> idx1=...
> idx2=....
> A(idx1)=B(idx2);
>
> where A and B are predefined 3D arrays, idx1 and idx2 are vectors of ind32 indeces of length 10^4 to 10^5.
> This statement is my bottle-neck given that I use it very often in a loop, is there a more efficient method to copy B(idx2) into  A(idx1) ? may be using lower level instructions like memcpy or the alike directly from inside Matlab code?

memcpy() copies contiguous block, how it can applied in your case? Are idx1 and idx2  consecutive integer arrays?

If it's the case, you can code a MEX file to do that.

Bruno
```
 0

```"Bruno Luong" <b.luong@fogale.findmycountry> wrote in message <i6l490\$ovn\$1@fred.mathworks.com>...
> "Gianni Schena" <schena@univ.trieste.it> wrote in message <i6kn36\$f9a\$1@fred.mathworks.com>...
> > bottleneck statement :
> >
> > idx1=...
> > idx2=....
> > A(idx1)=B(idx2);
> >
> > where A and B are predefined 3D arrays, idx1 and idx2 are vectors of ind32 indeces of length 10^4 to 10^5.
> > This statement is my bottle-neck given that I use it very often in a loop, is there a more efficient method to copy B(idx2) into  A(idx1) ? may be using lower level instructions like memcpy or the alike directly from inside Matlab code?
>
> memcpy() copies contiguous block, how it can applied in your case? Are idx1 and idx2  consecutive integer arrays?
>
> If it's the case, you can code a MEX file to do that.
>

Bruno, Thank you for the advise, No , they are NOT consecutive arrays of integers.
Any alternative suggestion ? regards, gianni

> Bruno
```
 0

```"Bruno Luong" <b.luong@fogale.findmycountry> wrote in message <i6l490\$ovn\$1@fred.mathworks.com>...
> "Gianni Schena" <schena@univ.trieste.it> wrote in message <i6kn36\$f9a\$1@fred.mathworks.com>...
> > bottleneck statement :
> >
> > idx1=...
> > idx2=....
> > A(idx1)=B(idx2);
> >
> > where A and B are predefined 3D arrays, idx1 and idx2 are vectors of ind32 indeces of length 10^4 to 10^5.
> > This statement is my bottle-neck given that I use it very often in a loop, is there a more efficient method to copy B(idx2) into  A(idx1) ? may be using lower level instructions like memcpy or the alike directly from inside Matlab code?
>
> memcpy() copies contiguous block, how it can applied in your case? Are idx1 and idx2  consecutive integer arrays?
>
> If it's the case, you can code a MEX file to do that.
>

Bruno, Thank you for the advise, No , they are NOT consecutive arrays of integers.
Any alternative suggestion ? regards, gianni

> Bruno
```
 0

```"Gianni Schena" <schena@univ.trieste.it> wrote in message <i6l5q7\$762\$1@fred.mathworks.com>...
> "Bruno Luong" <b.luong@fogale.findmycountry> wrote in message <i6l490\$ovn\$1@fred.mathworks.com>...
> > "Gianni Schena" <schena@univ.trieste.it> wrote in message <i6kn36\$f9a\$1@fred.mathworks.com>...
> > > bottleneck statement :
> > >
> > > idx1=...
> > > idx2=....
> > > A(idx1)=B(idx2);
> > >
> > > where A and B are predefined 3D arrays, idx1 and idx2 are vectors of ind32 indeces of length 10^4 to 10^5.
> > > This statement is my bottle-neck given that I use it very often in a loop, is there a more efficient method to copy B(idx2) into  A(idx1) ? may be using lower level instructions like memcpy or the alike directly from inside Matlab code?
> >
> > memcpy() copies contiguous block, how it can applied in your case? Are idx1 and idx2  consecutive integer arrays?
> >
> > If it's the case, you can code a MEX file to do that.
> >
>
> Bruno, Thank you for the advise, No , they are NOT consecutive arrays of integers.
> Any alternative suggestion ? regards, gianni

The only alternative is to do the downstream calculations in a mex routine and then you could avoid the memory copy at the MATLAB level. What do you do with A downstream?

James Tursa
```
 0

```"Gianni Schena" <schena@univ.trieste.it> wrote in message <i6l5k7\$o4e\$1@fred.mathworks.com>...

> >
>
> Bruno, Thank you for the advise, No , they are NOT consecutive arrays of integers.
> Any alternative suggestion ? regards, gianni

You might try to make an in-place change of A using MEX. That could save some run time but violates Matlab memory sharing memory requirement. If you know what you are doing there is no danger.

Bruno
```
 0

```On 13/09/10 3:24 AM, Gianni Schena wrote:
> bottleneck statement :
>
> idx1=...
> idx2=....
> A(idx1)=B(idx2);
> where A and B are predefined 3D arrays, idx1 and idx2 are vectors of
> ind32 indeces of length 10^4 to 10^5.
> This statement is my bottle-neck given that I use it very often in a
> loop, is there a more efficient method to copy B(idx2) into A(idx1) ?
> may be using lower level instructions like memcpy or the alike directly
> from inside Matlab code? thank you for the attention

We do not know the Matlab implementation of this kind of assignment, so
it is difficult to say whether there is potential to do better.

There are some important effects that you want to avoid in copying
memory around:

1) Cache-line thrashing.

This occurs when primary cache is limited {as it always is!} (e.g., 32
Kb per core for Intel i5), and two locations needed for the same
operation have the same low-order bits in their address. For example, if
both arrays occupied 64 megabytes and were adjacent in memory, then
indexes 1 and 4097 and 8193 (and so on) would be at the same relative
offsets in the 32 Kb cache, so assigning B(8193) to A(1) would require
moving data in and out of primary cache for a single operation

Mathworks has never defined the memory alignment of arrays (not even
"large" arrays for some definition of "large"), so it should be assumed
that any array might have any arbitrary 8-byte alignment. The addresses
of those data blocks are not exposed at the Matlab level (other than
using "format debug") and processor details about cache sizes and cache
strategies are not exposed, so at the Matlab level it is difficult or
impossible to avoid cache line thrashing. If, however, one drops into C
or C++ then it is not uncommon for there to be _some_ way to get at the
processor details, and the memory addresses of the data blocks become
exposed, which would at least allow one to evaluate how much thrashing
would take place.

Sometimes it just isn't worth the effort to avoid cache line thrashing,
and sometimes it is crucial for performance. Avoiding it can be done by
using an intermediate buffer deliberately chosen to be not aligned with
source or destination; although data may be copied twice, doing so can
be much much faster than consistent cache line thrashing.

The intermediate buffer strategy is easiest to implement if the
destination offsets (idx1 in your example) are sorted.

[sidx1, sidx1idx] = sort(idx1);
A(sidx1) = B(idx2(sidx1idx));

The above code is logically semantically equivalent to the
straight-forward assignment, but a MEX'd version of this strategy has
more potential to optimize cache access.

2) Cache exhaustion.

Your arrays are big enough that each of them will more than fill typical
primary and secondary and tertiary cache on desktop processors.
Traditionally it is said that each cache level is about a factor of 10
slower than the higher level. There are thus performance gains to be
made if, instead of evaluating the assignment as a "stream" of indices
to be done sequentially, that you instead "chunk" the assignments, doing
as many in the same cache block as you can before moving on to the next
cache block. You can achieve this in either reading the source data or
in writing the destination data, by using an index sorting technique
such as I describe above. If you are fortunate you might be able to do
it for source and destination simultaneously, if the source and
destination indexes are "near" each other.

3) Filling write lines.

For efficiencies, modern processors usually assume that if you ask to
access a particular chunk of data, that you will probably want to access
immediately adjacent data. They thus do not usually _only_ read in 8
bytes (one double) at a time: they usually read in a small number of
such blocks at a time. Correspondingly, when memory is flushed out of
primary cache after being written to, it is not just the 8 bytes that
gets written back: instead the entire small block gets written back.

inefficient to read and write the same small block multiple times with
cache flushing going on in-between. Writing to a block of data is
usually more expensive on a shared memory system (such as exists in
multicore or multi-cpu systems), as the memory address must be "locked"
against simultaneous update, and the new value must be propagated
through to every core or CPU or cache that happens to be accessing the
same block; where-as when a core or CPU is finished with a block it has
merely read, all it has to do is remove the internal notification that
it has an interest in the block, without there needing to be any cache
or CPU or core updating. It is thus more efficient to concentrate on
reducing the writes to a single small block of memory. This is assisted
by techniques such as sorting the destination indices and proceeding in
sorted order.

If the destination indices happened to be out of order but the source
ones were in order (or nearly so) then sorting by destination index can
have the result of scattering the data reads all over the place where
they were pretty much sequential before. One thus needs to know, for
efficiency, whether a scattered read is more efficient than a scattered
write or vice versus. That might depend upon the processor, but
*typically* it is the write that matters more. On the other hand, if
this is an issue, then the technique of using an intermediate buffer
comes in handy again: read the source in to an intermediate buffer in
source-efficient order, and access that buffer in destination order for
the writes. To the extent that the source order was relatively sparse,
using an intermediate buffer will improve the "locality of reference".

By now it has probably become clear that there is no one technique for
optimally doing A(idx1) = B(idx2) as the overhead work required to make
the poorer cases more efficient is wasted on better-behaved cases. Any
knowledge of the characteristics of idx1 and idx2 relative to each
other. There *is* the possibility to do better than Matlab itself... but
only in some cases.
```
 0

```"Bruno Luong" <b.luong@fogale.findmycountry> wrote in message <i6lajg\$qd9\$1@fred.mathworks.com>...
> "Gianni Schena" <schena@univ.trieste.it> wrote in message
>
> You might try to make an in-place change of A using MEX. That could save some run time but violates Matlab memory sharing memory requirement. If you know what you are doing there is no danger.
>

Inplace MAX could save about 30% in time.

m=1e6;
n=3e5;
A=rand(1,m);
B=rand(1,m);
iA=ceil(m*rand(1,n));
iB=ceil(m*rand(1,n));

tic
A(iA)=B(iB);
toc % 0.018547 seconds.

iA=uint32(iA);
iB=uint32(iB);
tic
A(iA)=B(iB);
toc % 0.015522 seconds.

%% inplace MEX
tic
copyidx(A,iA,B,iB);
toc % 0.010667 seconds.

#include "mex.h"
#include "matrix.h"

#define A prhs[0]
#define iA prhs[1]
#define B prhs[2]
#define iB prhs[3]

/* Mex file copyidx.c ATTENTION NO PARAMETER CHECK */
void mexFunction(int nlhs, mxArray *plhs[],
int nrhs, const mxArray *prhs[]) {

size_t m, i;
double *ptrA, *ptrB;
int *ptriA, *ptriB; /* 32 bit integers */

m = (int)(mxGetN(iA));
ptrA = mxGetPr(A);
ptrB = mxGetPr(B);
ptriA = (int*)mxGetData(iA);
ptriB = (int*)mxGetData(iB);
ptrA--;
ptrB--;
for (i=0; i<m; i++)
ptrA[ptriA[i]] = ptrB[ptriB[i]];

return;
}

Bruno
```
 0

```"Bruno Luong" <b.luong@fogale.findmycountry> wrote in message <i6ld0c\$9hs\$1@fred.mathworks.com>...
>
>     ptrA--;
>     ptrB--;

FYI, although this usually works as a method to compensate for the MATLAB 1-based indexing and C 0-based indexing, it is technically non-conforming C code. You are not guaranteed that the address one less than the block starting address is a valid address to use in any expression, even if you are not dereferencing it (as you are not). You are only guaranteed that the address one more than the block end is valid to use in expressions, and then only if you don't dereference it.

James Tursa
```
 0

8 Replies
233 Views

Similiar Articles:

7/28/2012 5:54:22 AM