Hi all, I recently started trying to write MEX code for my bottlenecks in my FEM program. For linear algebra operations like kronecker products, I saw good results. However, when I tried to convert my assembly procedure into MEX code -- http://www.mathworks.com/matlabcentral/newsreader/view_thread/290656 I found it fared worse. I gave that up and after some research online, I found that using the sparse matrix properties to perform the assembly process is much faster. However, when I tried to write MEX code for this, again it proved way slower ( mex code attached ).-- http://blogs.mathworks.com/loren/2007/03/01/creating-sparse-finite-element-matrices-in-matlab/ I venture to guess- 1. If I have 19 inputs and 16 outputs as I do here, and they are large arrays, does MEX code become pointless ? 2. Is it that the interfacing itself has high overhead for such cases ? If so, I would like to know so that I dont waste time on MEX henceforth. %======================================================= #include "mex.h" #include "math.h" void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[]) { double *kmat, *r_vec, *c_vec ; double *I_ff, *I_fp, *I_pf, *I_pp, *J_ff, *J_fp, *J_pf, *J_pp; double *X_ff, *X_fp, *X_pf, *X_pp; int n_ff,n_fp,n_pf,n_pp ; int size_r_vec, size_c_vec ; /* create a pointer to the input vector r */ kmat = mxGetPr(prhs[0]); r_vec = mxGetPr(prhs[1]); c_vec = mxGetPr(prhs[2]); I_ff = mxGetPr(prhs[3]); I_fp = mxGetPr(prhs[4]); I_pf = mxGetPr(prhs[5]); I_pp = mxGetPr(prhs[6]); J_ff = mxGetPr(prhs[7]); J_fp = mxGetPr(prhs[8]); J_pf = mxGetPr(prhs[9]); J_pp = mxGetPr(prhs[10]); X_ff = mxGetPr(prhs[11]); X_fp = mxGetPr(prhs[12]); X_pf = mxGetPr(prhs[13]); X_pp = mxGetPr(prhs[14]); /* Extract n_ff, n_fp etc from input */ n_ff = (int)mxGetScalar(prhs[15]); n_fp = (int)mxGetScalar(prhs[16]); n_pf = (int)mxGetScalar(prhs[17]); n_pp = (int)mxGetScalar(prhs[18]); /* get the dimensions of the matrix input y */ size_r_vec = mxGetM(prhs[1]); size_c_vec = mxGetM(prhs[2]); int i, j ; int r_value, c_value ; for(i = 0; i 0) { n_ff = n_ff + 1; I_ff[n_ff -1] = r_value; J_ff[n_ff -1] = c_value; X_ff[n_ff -1] = kmat[i + size_r_vec*j]; } else if(r_value> 0 && c_value < 0) { n_fp = n_fp + 1; I_fp[n_fp -1] = r_value; J_fp[n_fp -1] = abs(c_value); X_fp[n_fp -1] = kmat[i + size_r_vec*j]; } else if(r_value 0) { n_pf = n_pf + 1; I_pf[n_pf -1] = abs(r_value); J_pf[n_pf -1] = (c_value); X_pf[n_pf -1] = kmat[i + size_r_vec*j]; } else if(r_value< 0 && c_value < 0) {n_pp = n_pp + 1; I_pp[n_pp-1]= abs(r_value); J_pp[n_pp -1] = abs(c_value); X_pp[n_pp -1] = kmat[i + size_r_vec*j]; } } } /* Assign to output */ plhs[0] = mxDuplicateArray(prhs[3]); plhs[1] = mxDuplicateArray(prhs[4]); plhs[2] = mxDuplicateArray(prhs[5]); plhs[3] = mxDuplicateArray(prhs[6]); plhs[4] = mxDuplicateArray(prhs[7]); plhs[5] = mxDuplicateArray(prhs[8]); plhs[6] = mxDuplicateArray(prhs[9]); plhs[7] = mxDuplicateArray(prhs[10]); plhs[8] = mxDuplicateArray(prhs[11]); plhs[9] = mxDuplicateArray(prhs[12]); plhs[10] = mxDuplicateArray(prhs[13]); plhs[11] = mxDuplicateArray(prhs[14]); plhs[12] = mxCreateDoubleScalar(n_ff); plhs[13] = mxCreateDoubleScalar(n_fp); plhs[14] = mxCreateDoubleScalar(n_pf); plhs[15] = mxCreateDoubleScalar(n_pp); return; } %=============================================================

0 |

9/7/2010 3:17:05 AM

On 7 Sep, 05:17, "shriram srinivasan" <shrir...@tamu.edu> wrote: > Hi all, > I recently started trying to write MEX code for my bottlenecks in my FEM = program. > For linear algebra operations like kronecker products, I saw good results= .. > However, when I tried to =A0convert my assembly procedure into MEX code -= -http://www.mathworks.com/matlabcentral/newsreader/view_thread/290656 > > I found it fared worse. I gave that up and after some research online, I = found that using the sparse matrix properties to perform the assembly proce= ss is much faster. > However, when I tried to =A0write MEX =A0code for this, again it proved w= ay slower ( mex code attached ).--http://blogs.mathworks.com/loren/2007/03/= 01/creating-sparse-finite-el... > > I venture to guess- > 1. If I have 19 inputs and 16 outputs as I do here, and they are large ar= rays, does MEX code become pointless ? No. But you need to know exactly what you are doing to squeeze any hint of performance out of the C compiler. > 2. Is it that =A0the interfacing itself =A0has high overhead for such cas= es ? No. But you need to know exactly what you are doing to squeeze any hint of performance out of the C compiler. > If so, I would like to know so that I dont waste time on MEX henceforth. Instead of attempting stuff you don't master, spend some time learning C or C++. The code you posted might be sped up a factor 2, maybe 3 or more, by some simple syntax changes. You would also need to know how to configure the compiler to produce very fast executables. Rune

0 |

9/7/2010 4:23:51 AM

"Cris Luengo" <cris.luengo@google.for.my.name.to.contact.me> wrote in message <i64slj$h08$1@fred.mathworks.com>... > Rune Allnor <allnor@tele.ntnu.no> wrote in message <c3d47692-2d6d-4ad1-89a9-e9b833034924@g17g2000yqe.googlegroups.com>... > > On 7 Sep, 05:17, "shriram srinivasan" <shrir...@tamu.edu> wrote: > > > I venture to guess- > > > 1. If I have 19 inputs and 16 outputs as I do here, and they are large arrays, does MEX code become pointless ? > > > > No. But you need to know exactly what you are doing to squeeze > > any hint of performance out of the C compiler. > > Also, you are writing in the input arrays -- which is a big no-no in MATLAB -- and then make a copy. Try making a copy first, and then get the pointers to the output arrays and write in them. > > Cris. A "big non-no"? If the OP is going to bend the rules and modify their inputs, the subsequent copies are a (literal) waste of time.

0 |

9/7/2010 8:34:05 AM

"shriram srinivasan" <shrirams@tamu.edu> wrote in message <i64arh$c6q$1@fred.mathworks.com>... > > plhs[0] = mxDuplicateArray(prhs[3]); > plhs[1] = mxDuplicateArray(prhs[4]); > plhs[2] = mxDuplicateArray(prhs[5]); > plhs[3] = mxDuplicateArray(prhs[6]); > plhs[4] = mxDuplicateArray(prhs[7]); > plhs[5] = mxDuplicateArray(prhs[8]); > plhs[6] = mxDuplicateArray(prhs[9]); > plhs[7] = mxDuplicateArray(prhs[10]); > plhs[8] = mxDuplicateArray(prhs[11]); > plhs[9] = mxDuplicateArray(prhs[12]); > plhs[10] = mxDuplicateArray(prhs[13]); > plhs[11] = mxDuplicateArray(prhs[14]); This could be a tremendous waste of time if prhs[3] - prhs[14] are large arrays. If you are just going to return them back to MATLAB with no changes you should be using the undocumented mxCreateSharedDataCopy instead, which will avoid the data copies that you are making above. Just add this prototype at the top of your code: mxArray *mxCreateSharedDataCopy(const mxArray *pr); James Tursa

0 |

9/7/2010 1:53:07 PM

"shriram srinivasan" <shrirams@tamu.edu> wrote in message <i64arh$c6q$1@fred.mathworks.com>... > Hi all, > I recently started trying to write MEX code for my bottlenecks in my FEM program. > For linear algebra operations like kronecker products, I saw good results. > However, when I tried to convert my assembly procedure into MEX code -- http://www.mathworks.com/matlabcentral/newsreader/view_thread/290656 > > I found it fared worse. I gave that up and after some research online, I found that using the sparse matrix properties to perform the assembly process is much faster. > However, when I tried to write MEX code for this, again it proved way slower ( mex code attached ).-- > http://blogs.mathworks.com/loren/2007/03/01/creating-sparse-finite-element-matrices-in-matlab/ > > I venture to guess- > 1. If I have 19 inputs and 16 outputs as I do here, and they are large arrays, does MEX code become pointless ? > 2. Is it that the interfacing itself has high overhead for such cases ? > If so, I would like to know so that I dont waste time on MEX henceforth. > %======================================================= > #include "mex.h" > #include "math.h" > > void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[]) > { > double *kmat, *r_vec, *c_vec ; > double *I_ff, *I_fp, *I_pf, *I_pp, *J_ff, *J_fp, *J_pf, *J_pp; > double *X_ff, *X_fp, *X_pf, *X_pp; > int n_ff,n_fp,n_pf,n_pp ; > > int size_r_vec, size_c_vec ; > > /* create a pointer to the input vector r */ > kmat = mxGetPr(prhs[0]); > r_vec = mxGetPr(prhs[1]); > c_vec = mxGetPr(prhs[2]); > I_ff = mxGetPr(prhs[3]); > I_fp = mxGetPr(prhs[4]); > I_pf = mxGetPr(prhs[5]); > I_pp = mxGetPr(prhs[6]); > J_ff = mxGetPr(prhs[7]); > J_fp = mxGetPr(prhs[8]); > J_pf = mxGetPr(prhs[9]); > J_pp = mxGetPr(prhs[10]); > X_ff = mxGetPr(prhs[11]); > X_fp = mxGetPr(prhs[12]); > X_pf = mxGetPr(prhs[13]); > X_pp = mxGetPr(prhs[14]); > > /* Extract n_ff, n_fp etc from input */ > n_ff = (int)mxGetScalar(prhs[15]); > n_fp = (int)mxGetScalar(prhs[16]); > n_pf = (int)mxGetScalar(prhs[17]); > n_pp = (int)mxGetScalar(prhs[18]); > > /* get the dimensions of the matrix input y */ > > size_r_vec = mxGetM(prhs[1]); > size_c_vec = mxGetM(prhs[2]); > > int i, j ; > int r_value, c_value ; > > for(i = 0; i 0) > { n_ff = n_ff + 1; > I_ff[n_ff -1] = r_value; > J_ff[n_ff -1] = c_value; > X_ff[n_ff -1] = kmat[i + size_r_vec*j]; > } > else if(r_value> 0 && c_value < 0) > { n_fp = n_fp + 1; > I_fp[n_fp -1] = r_value; > J_fp[n_fp -1] = abs(c_value); > X_fp[n_fp -1] = kmat[i + size_r_vec*j]; > } > else if(r_value 0) > { n_pf = n_pf + 1; > I_pf[n_pf -1] = abs(r_value); > J_pf[n_pf -1] = (c_value); > X_pf[n_pf -1] = kmat[i + size_r_vec*j]; > } > else if(r_value< 0 && c_value < 0) > {n_pp = n_pp + 1; > I_pp[n_pp-1]= abs(r_value); > J_pp[n_pp -1] = abs(c_value); > X_pp[n_pp -1] = kmat[i + size_r_vec*j]; > } > } > } > > /* Assign to output */ > > plhs[0] = mxDuplicateArray(prhs[3]); > plhs[1] = mxDuplicateArray(prhs[4]); > plhs[2] = mxDuplicateArray(prhs[5]); > plhs[3] = mxDuplicateArray(prhs[6]); > plhs[4] = mxDuplicateArray(prhs[7]); > plhs[5] = mxDuplicateArray(prhs[8]); > plhs[6] = mxDuplicateArray(prhs[9]); > plhs[7] = mxDuplicateArray(prhs[10]); > plhs[8] = mxDuplicateArray(prhs[11]); > plhs[9] = mxDuplicateArray(prhs[12]); > plhs[10] = mxDuplicateArray(prhs[13]); > plhs[11] = mxDuplicateArray(prhs[14]); > > plhs[12] = mxCreateDoubleScalar(n_ff); > plhs[13] = mxCreateDoubleScalar(n_fp); > plhs[14] = mxCreateDoubleScalar(n_pf); > plhs[15] = mxCreateDoubleScalar(n_pp); > > return; > } > %============================================================= Now I see what you're doing... you are making the inner part of some loop into a MEX file. Input arguments 4 through 15 are being updated, as are the scalar integer inputs 16 through 19. The actual operation you do on these arrays is very straight-forward and rather simple. In comparison, the work you do in copying the input to the output is HUGE! No wonder the end result is slower... :) You should consider making the whole loop into a MEX-file. Try to have some data as input, create empty arrays for the output, and fill them with the results of your calculations. Then you will gain speed. Cris.

0 |

9/7/2010 2:21:35 PM

Many thanks to Rune, Cris, Steve and James for your observations. @James: If I use mxCreateSharedDataCopy, does that mean that in the code I have, everything remains the same except that I dont need the mxDuplicateArray at the end ? Secondly, how do I avoid modifying the inputs as Steve mentioned . @Cris: Your understanding is quite correct. There is a loop over all the elements of my FEM mesh, and the contribution of each element is assembled using this function. I cannot extend this to cover the outer loop over the elements because that would involve essentially rewriting each function I use in my FEM program. So I am only looking at ways to avoid copying the input to the output in this code. Any suggestions on that front would be most helpful

0 |

9/7/2010 3:32:07 PM

"shriram srinivasan" <shrirams@tamu.edu> wrote in message <i65ltn$huu$1@fred.mathworks.com>... > > If I use mxCreateSharedDataCopy, does that mean that in the code I have, everything remains the same except that I dont need the mxDuplicateArray at the end ? Just replace all the calls that you showed of mxDuplicateArray with mxCreateSharedDataCopy. (This begs the question as to why you do this in the first place, however. Why not do this at the m-code level?) > Secondly, how do I avoid modifying the inputs as Steve mentioned . I would suggest posting the m-code that you are trying to speed up. That way we can make a more accurate assessment of how to proceed, and what the best use of a mex file vs m-code would be. Generally, MATLAB is not very good at array slicing inside a loop since that incurs a data copy each time. This can often be avoided in a mex routine. James Tursa

0 |

9/7/2010 5:32:10 PM

Forgot to include the mcode for the function: %============================================= [I_ff,I_fp,I_pf,I_pp,J_ff,J_fp,J_pf,J_pp,X_ff,X_fp,X_pf,X_pp,... n_ff,n_fp,n_pf,n_pp] = sparse_data_local_matrix(kmat,r_vec,c_vec,... I_ff,I_fp,I_pf,I_pp,J_ff,J_fp,J_pf,J_pp,... X_ff,X_fp,X_pf,X_pp,n_ff,n_fp,n_pf,n_pp) for i = 1: length(r_vec) for j = 1: length(c_vec) if (r_vec(i) > 0 && c_vec(j) > 0 ) n_ff = n_ff + 1; I_ff(n_ff) = r_vec(i); J_ff(n_ff) = c_vec(j); X_ff(n_ff) = kmat(i,j); elseif (r_vec(i) > 0 && c_vec(j) < 0 ) n_fp = n_fp + 1; I_fp(n_fp) = r_vec(i); J_fp(n_fp) = abs(c_vec(j)); X_fp(n_fp) = kmat(i,j); elseif (r_vec(i) < 0 && c_vec(j) > 0 ) n_pf = n_pf + 1; I_pf(n_pf) = abs(r_vec(i)); J_pf(n_pf) = c_vec(j); X_pf(n_pf) = kmat(i,j); else n_pp = n_pp + 1; I_pp(n_pp) = abs(r_vec(i)); J_pp(n_pp) = abs(c_vec(j)); X_pp(n_pp) = kmat(i,j); end end end %===============================================================

0 |

9/7/2010 6:27:19 PM

*bounce*

0 |

9/8/2010 5:41:05 AM

"shriram srinivasan" <shrirams@tamu.edu> wrote in message <i65ltn$huu$1@fred.mathworks.com>... > Many thanks to Rune, Cris, Steve and James for your observations. > > @James: > If I use mxCreateSharedDataCopy, does that mean that in the code I have, everything remains the same except that I dont need the mxDuplicateArray at the end ? > Secondly, how do I avoid modifying the inputs as Steve mentioned . > > @Cris: > Your understanding is quite correct. There is a loop over all the elements of my FEM mesh, and the contribution of each element is assembled using this function. > I cannot extend this to cover the outer loop over the elements because that would involve essentially rewriting each function I use in my FEM program. > So I am only looking at ways to avoid copying the input to the output in this code. > Any suggestions on that front would be most helpful In a MEX-file, you are not supposed to modify the input arguments. In your case, this means making a copy of the input, then modifying the copies, and pass these copies as output arguments. But this means that your MEX-file might be slower than an M-file that doesn't need to do the copying (MATLAB nowadays has some inplace operations, very nice!). If you want to avoid the copying of matrices in your MEX-file, you could continue modifying the input, as you are doing. Simply remove the copying of the matrices at the end of your MEX-file. You'd call it with [counter1,counter2,...] = mexfunction(array1,array2,...,coutner1,counter2,...); Note that you'd be breaking MATLAB "rules". But it should work as long as you don't keep additional copies of your arrays. For example, if you do b = a; mexfunction(a); % this one modifies its input then "b" will also be modified. "b" contains only a pointer to the data of "a". When you use MATLAB commands to modify "b", a full copy is made of the data. But your MEX-file is not aware of how many variables use the data in the input array. This is why you're not supposed to modify it. But in your case you have everything under control. Just make sure that all the input arrays are independent, not copies of each other! ...and don't tell anybody at The MathWorks what you're doing! :) Another option would be to create the output arrays, and collect them all in a cell array: [res1{ii},res2{ii},...] = mexfunction(data) At the end of the loop you can concatenate the results into one big array: res1 = [res1{:}]; res2 = [res2{:}]; ... This option is a little bit more involved, but a lot more correct. :) Cheers, Cris.

0 |

9/8/2010 2:09:04 PM

"Cris Luengo" <cris.luengo@google.for.my.name.to.contact.me> wrote in message <i685e0$41l$1@fred.mathworks.com>... > "shriram srinivasan" <shrirams@tamu.edu> wrote in message <i65ltn$huu$1@fred.mathworks.com>... > > Many thanks to Rune, Cris, Steve and James for your observations. > > > > @James: > > If I use mxCreateSharedDataCopy, does that mean that in the code I have, everything remains the same except that I dont need the mxDuplicateArray at the end ? > > Secondly, how do I avoid modifying the inputs as Steve mentioned . > > > > @Cris: > > Your understanding is quite correct. There is a loop over all the elements of my FEM mesh, and the contribution of each element is assembled using this function. > > I cannot extend this to cover the outer loop over the elements because that would involve essentially rewriting each function I use in my FEM program. > > So I am only looking at ways to avoid copying the input to the output in this code. > > Any suggestions on that front would be most helpful > In a MEX-file, you are not supposed to modify the input arguments. Hi, This is absolutely NOT TRUE. It is the programmers choice to do it or not. I do it, for example, when I need to multiply a large array by a scalar and don't need the pre-multiplication version of the variable. Following that "advice" one needs at certain time to have TWO copies of the array with all that implies to memory usage, whilst if we do inplace modifications there is no extra memory need. > ... (MATLAB nowadays has some inplace operations, very nice!). Hmm, but TMW can!

0 |

9/8/2010 2:32:13 PM

This MEX coding seems to be a waste of time. A quick look of the matlab file seems that OP should try to first vectorize the double-for loops (which already faster than the MEX), which could improve substantially the speed. Bruno

0 |

9/8/2010 4:44:06 PM

"Cris Luengo" <cris.luengo@google.for.my.name.to.contact.me> wrote in message <i68blc$6cn$1@fred.mathworks.com>... > > Yes, they can, because they can check to see how many references there are to the data, and make the deep copy if necessary. FYI, you can also make this check in a mex file if desired and have the same behavior as MATLAB for shared variables. I think the undocumented function is mxUnshareArray. > MATLAB now does some things in-place, you might see that > data = data*3; > is actually faster than your MEX-file. MATLAB parallelizes this sort of operations Based on my experiences with this, it seems to be highly dependent on MATLAB version and whether or not the underlying BLAS routines used (if any) are multi-threaded. e.g., for (real scalar)*(real array) MATLAB is tough to beat, but for (complex scalar)*(complex array) you can sometimes beat MATLAB with simple loops in a mex routine because you can optimize memory access better than MATLAB does. James Tursa

0 |

9/8/2010 5:57:20 PM

Thanks to everyone for providing your perspective. I have a long way to go before understanding the details of all the technical stuff, but I have given up this for now. In this particular instance, it hasn't been worth the effort to write MEX code. If someone does come up with a suggestion that can be faster than my current mcode, I shall then look into it.

0 |

9/9/2010 5:28:03 AM

"shriram srinivasan" <shrirams@tamu.edu> wrote in message <i69r93$553$1@fred.mathworks.com>... > If someone does come up with a suggestion that can be faster than my current mcode, I shall then look into it. I wrote earlier that you might try to vectorize - oops sorry call it rather "array-program" convert [sic] - your code instead of running with double for-loops. Use techniques such as logical indexing, or find() function or accumarray() or whatever you know to carry out the calculation. Bruno

0 |

9/9/2010 6:53:05 AM