OpenMP "not working" on gfortran

  • Follow


Hello everyone.

I have run into a strange issue. My OpenMP parallelized program runs on 
several threads (numbers varying from 1-8 as given by OMP_NUM_THREADS), 
as reported by omp_get_num_threads(). However, it appears that all 
threads are running on a single CPU core, meaning that instead of a 
near-perfect speedup I get some slowdown when increasing the number of 
threads.

What is even stranger is that my code used to run nicely on multiple 
cores, and indeed does so on intel fortran on our cluster. Has anyone 
experienced similar issues with OpenMP, or can you suggest somewhere to 
start looking for the trouble?

(I have found that most of the program runtime is spent in BLAS 
routines, so the performance difference between compilers is small. I 
have been using Goto BLAS for these tests, and the Goto BLAS routines 
are indeed running nicely in multicore mode.)

Cheers,
Paul.
0
Reply Paul 2/11/2011 9:33:29 AM

Paul,

What operating system and gfortran version are you using?  I've seen
no such problems with gfortran 4.5.2 on Max OSX 10.6.6 (Snow Leopard).

Al Greynolds
www.ruda-cardinal.com


0
Reply Al 2/11/2011 11:58:30 AM


Paul,

What operating system and gfortran version are you using?  I've seen
no such OpenMP problems with gfortran 4.5.2 on Mac OSX 10.6.6 (Snow
Leopard).

Al Greynolds
www.ruda-cardinal.com
0
Reply Al 2/11/2011 12:05:46 PM

Den 11.02.11 12.58, skrev Al Greynolds:
> Paul,
>
> What operating system and gfortran version are you using?  I've seen
> no such problems with gfortran 4.5.2 on Max OSX 10.6.6 (Snow Leopard).
>
> Al Greynolds
> www.ruda-cardinal.com
>
>

Sorry, I forgot to mention that. Goto BLAS does not employ several cores 
on my mac os x 10.6.6 (snow leopard), but my own openmp-ified code does 
parallelize nicely. I have also gfortran 4.5.2. I have to agree with 
your conclusions there.

The question was asked with respect to gfortran 4.5.1 on Fedora Linux 
release 14. The processor is an Intel Core i7 2.80 GHz (we have several 
similar machines with varying GHz ratings), 4 cores with hyperthreading. 
Here, Goto BLAS does employ several cores, but my own openmp-ified code 
does not parallelize. There is in fact a small performance penalty to 
increasing the number of OpenMP threads, and top reports CPU usage at 
100% (i.e. one full core) so I believe that all threads are running on 
one core.

Paul.

0
Reply Paul 2/11/2011 12:50:02 PM

On Feb 11, 3:33=A0am, Paul Anton Letnes <paul.anton.let...@gmail.com>
wrote:
> However, it appears that all threads are running on a single CPU core
Threads are managed by the OS. Besides how did you check it?
0
Reply rusi_pathan 2/11/2011 1:11:50 PM

Den 11.02.11 14.11, skrev rusi_pathan:
> On Feb 11, 3:33 am, Paul Anton Letnes<paul.anton.let...@gmail.com>
> wrote:
>> However, it appears that all threads are running on a single CPU core
> Threads are managed by the OS. Besides how did you check it?

I checked it by noting that
1) 'top' in the linux terminal reports 100% CPU usage (it reports up to 
800% for other programs)
2) Program execution does not speed up on the linux server when 
increasing the OMP_NUM_THREADS environment variable, while it does speed 
up on my mac laptop
3) omp_get_num_threads() reports 1,2,4,8 threads (as set by 
OMP_NUM_THREADS env. var.) correctly, but there is no speedup

Perhaps omp_get_num_threads is reporting something else than the number 
of actual threads? I do not know. I am posting here because I have no 
idea where to look for fixes to this problem.

Cheers,
Paul
0
Reply Paul 2/11/2011 3:27:56 PM

> Has anyone experienced similar issues with OpenMP, or can you suggest
> somewhere to start looking for the trouble?

Try a simple OpenMP program (like a single for loop), see if you can
reproduce the issue with that, then post here your compiler version
(output of "gfortran -v") and the exact code and command line you are
using.

-- 
FX
0
Reply FX 2/11/2011 8:29:45 PM

Den 11.02.11 21.29, skrev FX:
>> Has anyone experienced similar issues with OpenMP, or can you suggest
>> somewhere to start looking for the trouble?
>
> Try a simple OpenMP program (like a single for loop), see if you can
> reproduce the issue with that, then post here your compiler version
> (output of "gfortran -v") and the exact code and command line you are
> using.

This is part of the problem: the simple OpenMP loop runs on all cores. 
Also, the OpenMP code I wrote works on Mac OS X gfortran 4.5.2. The 
OpenMP code also works flawlessly on Rocks cluster linux with intel 
fortran v. 11.

I have now tested on gfortran 4.4.5 and 4.5.1 on Fedora Linux release 
14, and both of these compilers result in no speedup. I know I can't 
expect anyone here to find my problem (as I have no idea where to look 
myself and a simple program doesn't reproduce the error), but it would 
be interesting to see if someone here has had the same experience.

Paul

0
Reply Paul 2/12/2011 9:00:40 AM

On 2/12/2011 2:00 AM, Paul Anton Letnes wrote:
> Den 11.02.11 21.29, skrev FX:
>>> Has anyone experienced similar issues with OpenMP, or can you suggest
>>> somewhere to start looking for the trouble?
>>
>> Try a simple OpenMP program (like a single for loop), see if you can
>> reproduce the issue with that, then post here your compiler version
>> (output of "gfortran -v") and the exact code and command line you are
>> using.
>
> This is part of the problem: the simple OpenMP loop runs on all cores.
> Also, the OpenMP code I wrote works on Mac OS X gfortran 4.5.2. The
> OpenMP code also works flawlessly on Rocks cluster linux with intel
> fortran v. 11.
>
> I have now tested on gfortran 4.4.5 and 4.5.1 on Fedora Linux release
> 14, and both of these compilers result in no speedup. I know I can't
> expect anyone here to find my problem (as I have no idea where to look
> myself and a simple program doesn't reproduce the error), but it would
> be interesting to see if someone here has had the same experience.

You mentioned that your code "used to run nicely on multiple cores."  Is 
it possible that something in the code or in your environment has 
changed recently?

You might think about methodically removing pieces of your program until 
it either works properly or is equivalent (for some definition thereof) 
to a simple program that does work.  This is guaranteed to be tedious 
and time-consuming, but it has a chance of helping you isolate the problem.

Louis


0
Reply Louis 2/12/2011 12:20:57 PM

On 11 Feb, 10:33, Paul Anton Letnes <paul.anton.let...@gmail.com>
wrote:

> I have run into a strange issue. My OpenMP parallelized program runs on
> several threads (numbers varying from 1-8 as given by OMP_NUM_THREADS),
> as reported by omp_get_num_threads(). However, it appears that all
> threads are running on a single CPU core, meaning that instead of a
> near-perfect speedup I get some slowdown when increasing the number of
> threads.

OpenMP works fine on GCC, including gfortran.

Threads are managed by the operating system. If you have 8 logical
CPUs (e.g. Intel i7) and 8 threads, all CPUs could theoretically be
saturated. Thus, either the process is started affinity to one CPU
(this could e.g. be set by the shell), or there is something in your
program that forces sequential execution.

The latter could be an OpenMP pragma that specify that ony one thread
may execute a major portion of the code:

   !$OMP SINGLE
   !$OMP MASTER

or contention for a global mutex, e.g. an unnamed critical section:

   !$OMP CRITICAL

There could also be programming mistakes such as an !$OMP PARALLEL
block without an !$OMP DO or multiple !$OMP SECTION inside. There
could also be a typo like !$OMP PARALLEL instead of !$OMP PARALLEL DO,
which will have the effect you reported.

If the program will parallelize nicely on other systems, the problem
must be related to restricted CPU affinity, which is an OS or command
shell issue. If it does not, the problem is likely related to the
OpenMP code.

If nothing else helps, I'd suggest you try to rebuild GCC and
recompile your program.


Sturla





0
Reply sturlamolden 2/12/2011 2:34:33 PM

Den 12.02.11 15.34, skrev sturlamolden:
> On 11 Feb, 10:33, Paul Anton Letnes<paul.anton.let...@gmail.com>
> wrote:
>
>> I have run into a strange issue. My OpenMP parallelized program runs on
>> several threads (numbers varying from 1-8 as given by OMP_NUM_THREADS),
>> as reported by omp_get_num_threads(). However, it appears that all
>> threads are running on a single CPU core, meaning that instead of a
>> near-perfect speedup I get some slowdown when increasing the number of
>> threads.
>
> OpenMP works fine on GCC, including gfortran.
>
> Threads are managed by the operating system. If you have 8 logical
> CPUs (e.g. Intel i7) and 8 threads, all CPUs could theoretically be
> saturated. Thus, either the process is started affinity to one CPU
> (this could e.g. be set by the shell), or there is something in your
> program that forces sequential execution.
>
> The latter could be an OpenMP pragma that specify that ony one thread
> may execute a major portion of the code:
>
>     !$OMP SINGLE
>     !$OMP MASTER
>
> or contention for a global mutex, e.g. an unnamed critical section:
>
>     !$OMP CRITICAL
>
> There could also be programming mistakes such as an !$OMP PARALLEL
> block without an !$OMP DO or multiple !$OMP SECTION inside. There
> could also be a typo like !$OMP PARALLEL instead of !$OMP PARALLEL DO,
> which will have the effect you reported.
>
> If the program will parallelize nicely on other systems, the problem
> must be related to restricted CPU affinity, which is an OS or command
> shell issue. If it does not, the problem is likely related to the
> OpenMP code.
>
> If nothing else helps, I'd suggest you try to rebuild GCC and
> recompile your program.
>
>
> Sturla

This is interesting reading. I am posting my one single OpenMP loop 
below, so you can have a look. I do believe it is a very simple !$OMP 
PARALLEL DO and as mentioned, it usually (not always, oddly enough)
works nicely on intel fortran, as mentioned before.

It would appear that this might be an OS issue, then?

Paul

--------------------

!$omp parallel do &
!$omp private(icol, irow, til, ip1, ip2, fra, iq1, iq2, alpha_0) &
!$omp firstprivate(imin, imax) &
!$omp shared(LHS)
do icol = imin, imax
     if (icol == 1) then
         print *, 'omp num threads:', omp_get_num_threads() ! TODO Remove
     end if
     fra = reverse_lookup(icol, 1)
     iq1 = reverse_lookup(icol, 2)
     iq2 = reverse_lookup(icol, 3)
     alpha_0 = alpha_0s(iq1, iq2)
     do irow = 1, size(LHS, 1)
         til = reverse_lookup(irow, 1)
         ip1 = reverse_lookup(irow, 2)
         ip2 = reverse_lookup(irow, 3)
         LHS(irow, icol) = generate_element(pm_lhs, fra, til, iq1, &
             iq2, ip1, ip2, numerics, alphas(ip1, ip2), &
             alpha_0, fft_of_zetas, qs)
     end do
end do
!$omp end parallel do



0
Reply Paul 2/12/2011 3:06:28 PM

On 12 Feb, 16:06, Paul Anton Letnes <paul.anton.let...@gmail.com>
wrote:

> It would appear that this might be an OS issue, then?

That is hard to tell. The function "reverse_lookup" or
"generate_element" could still be synchronized with a mutex. That it
sometimes fail with Intel's compiler suggest a programming issue.


Sturla
0
Reply sturlamolden 2/12/2011 7:09:08 PM

Den 12.02.11 20.09, skrev sturlamolden:
> On 12 Feb, 16:06, Paul Anton Letnes<paul.anton.let...@gmail.com>
> wrote:
>
>> It would appear that this might be an OS issue, then?
>
> That is hard to tell. The function "reverse_lookup" or
> "generate_element" could still be synchronized with a mutex. That it
> sometimes fail with Intel's compiler suggest a programming issue.
>
>
> Sturla

Just to clarify, reverse_lookup is in fact an array which is built 
before entering the loop. Perhaps there is some sort of programming 
issue with the generate_element() function, I will look into that 
possibility. Thanks a lot for your suggestions!

Speaking of which, I guess reverse_lookup should be declared 
firstprivate or shared? I didn't even think of that before.

Paul.

0
Reply Paul 2/12/2011 9:13:58 PM

On Feb 12, 6:00=A0am, Paul Anton Letnes <paul.anton.let...@gmail.com>
wrote:
> Den 11.02.11 21.29, skrev FX:
>
> >> Has anyone experienced similar issues with OpenMP, or can you suggest
> >> somewhere to start looking for the trouble?
>
> > Try a simple OpenMP program (like a single for loop), see if you can
> > reproduce the issue with that, then post here your compiler version
> > (output of "gfortran -v") and the exact code and command line you are
> > using.
>
> This is part of the problem: the simple OpenMP loop runs on all cores.
> Also, the OpenMP code I wrote works on Mac OS X gfortran 4.5.2. The
> OpenMP code also works flawlessly on Rocks cluster linux with intel
> fortran v. 11.
>
> I have now tested on gfortran 4.4.5 and 4.5.1 on Fedora Linux release
> 14, and both of these compilers result in no speedup. I know I can't
> expect anyone here to find my problem (as I have no idea where to look
> myself and a simple program doesn't reproduce the error), but it would
> be interesting to see if someone here has had the same experience.
>
> Paul

Hi,

Can you post the single loop so I can compile-link-run?

Fernando.
0
Reply ftinetti (148) 2/14/2011 11:10:37 AM

On Feb 12, 11:34=A0am, sturlamolden <sturlamol...@yahoo.no> wrote:
> On 11 Feb, 10:33, Paul Anton Letnes <paul.anton.let...@gmail.com>
> wrote:
>
> > I have run into a strange issue. My OpenMP parallelized program runs on
> > several threads (numbers varying from 1-8 as given by OMP_NUM_THREADS),
> > as reported by omp_get_num_threads(). However, it appears that all
> > threads are running on a single CPU core, meaning that instead of a
> > near-perfect speedup I get some slowdown when increasing the number of
> > threads.
>
> OpenMP works fine on GCC, including gfortran.
>
> Threads are managed by the operating system. If you have 8 logical
> CPUs (e.g. Intel i7) and 8 threads, all CPUs could theoretically be
> saturated. Thus, either the process is started affinity to one CPU
> (this could e.g. be set by the shell), or there is something in your
> program that forces sequential execution.
>
> The latter could be an OpenMP pragma that specify that ony one thread
> may execute a major portion of the code:
>
> =A0 =A0!$OMP SINGLE
> =A0 =A0!$OMP MASTER
>
> or contention for a global mutex, e.g. an unnamed critical section:
>
> =A0 =A0!$OMP CRITICAL
>
> There could also be programming mistakes such as an !$OMP PARALLEL
> block without an !$OMP DO or multiple !$OMP SECTION inside. There
> could also be a typo like !$OMP PARALLEL instead of !$OMP PARALLEL DO,
> which will have the effect you reported.

No, in fact the effect should be something like the opposite: every
thread will do all the job (race conditions aside).

0
Reply ftinetti (148) 2/14/2011 11:13:29 AM

On 14 Feb, 12:13, gmail-unlp <ftine...@gmail.com> wrote:

> No, in fact the effect should be something like the opposite: every
> thread will do all the job (race conditions aside).

Yes. You are right. My pun, sorry.

Sturla

0
Reply sturlamolden (723) 2/14/2011 11:51:09 AM

Den 14.02.11 12.10, skrev gmail-unlp:
> On Feb 12, 6:00 am, Paul Anton Letnes<paul.anton.let...@gmail.com>
> wrote:
>> Den 11.02.11 21.29, skrev FX:
>>
>>>> Has anyone experienced similar issues with OpenMP, or can you suggest
>>>> somewhere to start looking for the trouble?
>>
>>> Try a simple OpenMP program (like a single for loop), see if you can
>>> reproduce the issue with that, then post here your compiler version
>>> (output of "gfortran -v") and the exact code and command line you are
>>> using.
>>
>> This is part of the problem: the simple OpenMP loop runs on all cores.
>> Also, the OpenMP code I wrote works on Mac OS X gfortran 4.5.2. The
>> OpenMP code also works flawlessly on Rocks cluster linux with intel
>> fortran v. 11.
>>
>> I have now tested on gfortran 4.4.5 and 4.5.1 on Fedora Linux release
>> 14, and both of these compilers result in no speedup. I know I can't
>> expect anyone here to find my problem (as I have no idea where to look
>> myself and a simple program doesn't reproduce the error), but it would
>> be interesting to see if someone here has had the same experience.
>>
>> Paul
>
> Hi,
>
> Can you post the single loop so I can compile-link-run?
>
> Fernando.

Sure! Just see below. The outer loop in a double loop is 
OpenMP-parallelized. The function generate_element contains a function 
generate_j which also contains a loop, so significant work is done even 
in the innermost loop. Unfortunately, this loop is way too short (only 
10 iterations) to usefully parallelize. I guess it would be easier to 
debug that way, though.

Cheers,
Paul.

----------------------
subroutine setup_LHS(...)
! LHS is the left hand side of an equation system

(snip)

!$omp parallel do &
!$omp private(icol, irow, til, ip1, ip2, fra, iq1, iq2, alpha_0) &
!$omp firstprivate(qs, numerics, imin, imax, reverse_lookup, alpha_0s, 
alphas) &
!$omp shared(LHS, fft_of_zetas)
do icol = imin, imax
     if (icol == 1) then
         print *, 'omp num threads:', omp_get_num_threads() ! TODO Remove
     end if
     fra = reverse_lookup(icol, 1)
     iq1 = reverse_lookup(icol, 2)
     iq2 = reverse_lookup(icol, 3)
     alpha_0 = alpha_0s(iq1, iq2)
     do irow = 1, size(LHS, 1)
         til = reverse_lookup(irow, 1)
         ip1 = reverse_lookup(irow, 2)
         ip2 = reverse_lookup(irow, 3)
         LHS(irow, icol) = generate_element(pm_lhs, fra, til, iq1, &
             iq2, ip1, ip2, numerics, alphas(ip1, ip2), &
             alpha_0, fft_of_zetas, qs)
     end do
end do
!$omp end parallel do

0
Reply paul.anton.letnes (55) 2/14/2011 12:18:02 PM

On Feb 14, 9:18=A0am, Paul Anton Letnes <paul.anton.let...@gmail.com>
wrote:
> Den 14.02.11 12.10, skrev gmail-unlp:
>
>
>
>
>
>
>
>
>
> > On Feb 12, 6:00 am, Paul Anton Letnes<paul.anton.let...@gmail.com>
> > wrote:
> >> Den 11.02.11 21.29, skrev FX:
>
> >>>> Has anyone experienced similar issues with OpenMP, or can you sugges=
t
> >>>> somewhere to start looking for the trouble?
>
> >>> Try a simple OpenMP program (like a single for loop), see if you can
> >>> reproduce the issue with that, then post here your compiler version
> >>> (output of "gfortran -v") and the exact code and command line you are
> >>> using.
>
> >> This is part of the problem: the simple OpenMP loop runs on all cores.
> >> Also, the OpenMP code I wrote works on Mac OS X gfortran 4.5.2. The
> >> OpenMP code also works flawlessly on Rocks cluster linux with intel
> >> fortran v. 11.
>
> >> I have now tested on gfortran 4.4.5 and 4.5.1 on Fedora Linux release
> >> 14, and both of these compilers result in no speedup. I know I can't
> >> expect anyone here to find my problem (as I have no idea where to look
> >> myself and a simple program doesn't reproduce the error), but it would
> >> be interesting to see if someone here has had the same experience.
>
> >> Paul
>
> > Hi,
>
> > Can you post the single loop so I can compile-link-run?
>
> > Fernando.
>
> Sure! Just see below. The outer loop in a double loop is
> OpenMP-parallelized. The function generate_element contains a function
> generate_j which also contains a loop, so significant work is done even
> in the innermost loop. Unfortunately, this loop is way too short (only
> 10 iterations) to usefully parallelize. I guess it would be easier to
> debug that way, though.
>
> Cheers,
> Paul.
>
> ----------------------
> subroutine setup_LHS(...)
> ! LHS is the left hand side of an equation system
>
> (snip)
>
> !$omp parallel do &
> !$omp private(icol, irow, til, ip1, ip2, fra, iq1, iq2, alpha_0) &
> !$omp firstprivate(qs, numerics, imin, imax, reverse_lookup, alpha_0s,
> alphas) &
> !$omp shared(LHS, fft_of_zetas)
> do icol =3D imin, imax
> =A0 =A0 =A0if (icol =3D=3D 1) then
> =A0 =A0 =A0 =A0 =A0print *, 'omp num threads:', omp_get_num_threads() ! T=
ODO Remove
> =A0 =A0 =A0end if
> =A0 =A0 =A0fra =3D reverse_lookup(icol, 1)
> =A0 =A0 =A0iq1 =3D reverse_lookup(icol, 2)
> =A0 =A0 =A0iq2 =3D reverse_lookup(icol, 3)
> =A0 =A0 =A0alpha_0 =3D alpha_0s(iq1, iq2)
> =A0 =A0 =A0do irow =3D 1, size(LHS, 1)
> =A0 =A0 =A0 =A0 =A0til =3D reverse_lookup(irow, 1)
> =A0 =A0 =A0 =A0 =A0ip1 =3D reverse_lookup(irow, 2)
> =A0 =A0 =A0 =A0 =A0ip2 =3D reverse_lookup(irow, 3)
> =A0 =A0 =A0 =A0 =A0LHS(irow, icol) =3D generate_element(pm_lhs, fra, til,=
 iq1, &
> =A0 =A0 =A0 =A0 =A0 =A0 =A0iq2, ip1, ip2, numerics, alphas(ip1, ip2), &
> =A0 =A0 =A0 =A0 =A0 =A0 =A0alpha_0, fft_of_zetas, qs)
> =A0 =A0 =A0end do
> end do
> !$omp end parallel do

Sorry, I don't have enough time to "complete" the code with
declarations/initializations/etc., that's why I need (and asked for)
some code I can directly compile-link-run-play almost directly from
the command line. Maybe you can just assign a constant to LHS just to
check for parallel behavior...

Looking the previous posts I was thinking in the line of OpenMP
implementation/OS problems, did you see something related to taskset?

Fernando.
0
Reply ftinetti (148) 2/14/2011 12:58:31 PM

Den 14.02.11 13.58, skrev gmail-unlp:
> On Feb 14, 9:18 am, Paul Anton Letnes<paul.anton.let...@gmail.com>
> wrote:
>> Den 14.02.11 12.10, skrev gmail-unlp:
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>> On Feb 12, 6:00 am, Paul Anton Letnes<paul.anton.let...@gmail.com>
>>> wrote:
>>>> Den 11.02.11 21.29, skrev FX:
>>
>>>>>> Has anyone experienced similar issues with OpenMP, or can you suggest
>>>>>> somewhere to start looking for the trouble?
>>
>>>>> Try a simple OpenMP program (like a single for loop), see if you can
>>>>> reproduce the issue with that, then post here your compiler version
>>>>> (output of "gfortran -v") and the exact code and command line you are
>>>>> using.
>>
>>>> This is part of the problem: the simple OpenMP loop runs on all cores.
>>>> Also, the OpenMP code I wrote works on Mac OS X gfortran 4.5.2. The
>>>> OpenMP code also works flawlessly on Rocks cluster linux with intel
>>>> fortran v. 11.
>>
>>>> I have now tested on gfortran 4.4.5 and 4.5.1 on Fedora Linux release
>>>> 14, and both of these compilers result in no speedup. I know I can't
>>>> expect anyone here to find my problem (as I have no idea where to look
>>>> myself and a simple program doesn't reproduce the error), but it would
>>>> be interesting to see if someone here has had the same experience.
>>
>>>> Paul
>>
>>> Hi,
>>
>>> Can you post the single loop so I can compile-link-run?
>>
>>> Fernando.
>>
>> Sure! Just see below. The outer loop in a double loop is
>> OpenMP-parallelized. The function generate_element contains a function
>> generate_j which also contains a loop, so significant work is done even
>> in the innermost loop. Unfortunately, this loop is way too short (only
>> 10 iterations) to usefully parallelize. I guess it would be easier to
>> debug that way, though.
>>
>> Cheers,
>> Paul.
>>
>> ----------------------
>> subroutine setup_LHS(...)
>> ! LHS is the left hand side of an equation system
>>
>> (snip)
>>
>> !$omp parallel do&
>> !$omp private(icol, irow, til, ip1, ip2, fra, iq1, iq2, alpha_0)&
>> !$omp firstprivate(qs, numerics, imin, imax, reverse_lookup, alpha_0s,
>> alphas)&
>> !$omp shared(LHS, fft_of_zetas)
>> do icol = imin, imax
>>       if (icol == 1) then
>>           print *, 'omp num threads:', omp_get_num_threads() ! TODO Remove
>>       end if
>>       fra = reverse_lookup(icol, 1)
>>       iq1 = reverse_lookup(icol, 2)
>>       iq2 = reverse_lookup(icol, 3)
>>       alpha_0 = alpha_0s(iq1, iq2)
>>       do irow = 1, size(LHS, 1)
>>           til = reverse_lookup(irow, 1)
>>           ip1 = reverse_lookup(irow, 2)
>>           ip2 = reverse_lookup(irow, 3)
>>           LHS(irow, icol) = generate_element(pm_lhs, fra, til, iq1,&
>>               iq2, ip1, ip2, numerics, alphas(ip1, ip2),&
>>               alpha_0, fft_of_zetas, qs)
>>       end do
>> end do
>> !$omp end parallel do
>
> Sorry, I don't have enough time to "complete" the code with
> declarations/initializations/etc., that's why I need (and asked for)
> some code I can directly compile-link-run-play almost directly from
> the command line. Maybe you can just assign a constant to LHS just to
> check for parallel behavior...
>
> Looking the previous posts I was thinking in the line of OpenMP
> implementation/OS problems, did you see something related to taskset?
>
> Fernando.

I see, and I understand. However, the code is a bit complex. This is 
probably part of the reason for the strange behavior! As mentioned 
earlier, a simple do loop example compiles, runs and speeds up as expected.

I will consider your advice with respect to just assigning a constant to 
LHS and see if that works out. I think it could be helpful!

I have never used taskset, so I have no idea how I would go about using 
it in this context.

Paul.
0
Reply paul.anton.letnes (55) 2/14/2011 1:07:15 PM

On Feb 14, 10:07=A0am, Paul Anton Letnes <paul.anton.let...@gmail.com>
wrote:
> Den 14.02.11 13.58, skrev gmail-unlp:
>
>
>
>
>
> > On Feb 14, 9:18 am, Paul Anton Letnes<paul.anton.let...@gmail.com>
> > wrote:
> >> Den 14.02.11 12.10, skrev gmail-unlp:
>
> >>> On Feb 12, 6:00 am, Paul Anton Letnes<paul.anton.let...@gmail.com>
> >>> wrote:
> >>>> Den 11.02.11 21.29, skrev FX:
>
> >>>>>> Has anyone experienced similar issues with OpenMP, or can you sugg=
est
> >>>>>> somewhere to start looking for the trouble?
>
> >>>>> Try a simple OpenMP program (like a single for loop), see if you ca=
n
> >>>>> reproduce the issue with that, then post here your compiler version
> >>>>> (output of "gfortran -v") and the exact code and command line you a=
re
> >>>>> using.
>
> >>>> This is part of the problem: the simple OpenMP loop runs on all core=
s.
> >>>> Also, the OpenMP code I wrote works on Mac OS X gfortran 4.5.2. The
> >>>> OpenMP code also works flawlessly on Rocks cluster linux with intel
> >>>> fortran v. 11.
>
> >>>> I have now tested on gfortran 4.4.5 and 4.5.1 on Fedora Linux releas=
e
> >>>> 14, and both of these compilers result in no speedup. I know I can't
> >>>> expect anyone here to find my problem (as I have no idea where to lo=
ok
> >>>> myself and a simple program doesn't reproduce the error), but it wou=
ld
> >>>> be interesting to see if someone here has had the same experience.
>
> >>>> Paul
>
> >>> Hi,
>
> >>> Can you post the single loop so I can compile-link-run?
>
> >>> Fernando.
>
> >> Sure! Just see below. The outer loop in a double loop is
> >> OpenMP-parallelized. The function generate_element contains a function
> >> generate_j which also contains a loop, so significant work is done eve=
n
> >> in the innermost loop. Unfortunately, this loop is way too short (only
> >> 10 iterations) to usefully parallelize. I guess it would be easier to
> >> debug that way, though.
>
> >> Cheers,
> >> Paul.
>
> >> ----------------------
> >> subroutine setup_LHS(...)
> >> ! LHS is the left hand side of an equation system
>
> >> (snip)
>
> >> !$omp parallel do&
> >> !$omp private(icol, irow, til, ip1, ip2, fra, iq1, iq2, alpha_0)&
> >> !$omp firstprivate(qs, numerics, imin, imax, reverse_lookup, alpha_0s,
> >> alphas)&
> >> !$omp shared(LHS, fft_of_zetas)
> >> do icol =3D imin, imax
> >> =A0 =A0 =A0 if (icol =3D=3D 1) then
> >> =A0 =A0 =A0 =A0 =A0 print *, 'omp num threads:', omp_get_num_threads()=
 ! TODO Remove
> >> =A0 =A0 =A0 end if
> >> =A0 =A0 =A0 fra =3D reverse_lookup(icol, 1)
> >> =A0 =A0 =A0 iq1 =3D reverse_lookup(icol, 2)
> >> =A0 =A0 =A0 iq2 =3D reverse_lookup(icol, 3)
> >> =A0 =A0 =A0 alpha_0 =3D alpha_0s(iq1, iq2)
> >> =A0 =A0 =A0 do irow =3D 1, size(LHS, 1)
> >> =A0 =A0 =A0 =A0 =A0 til =3D reverse_lookup(irow, 1)
> >> =A0 =A0 =A0 =A0 =A0 ip1 =3D reverse_lookup(irow, 2)
> >> =A0 =A0 =A0 =A0 =A0 ip2 =3D reverse_lookup(irow, 3)
> >> =A0 =A0 =A0 =A0 =A0 LHS(irow, icol) =3D generate_element(pm_lhs, fra, =
til, iq1,&
> >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 iq2, ip1, ip2, numerics, alphas(ip1, ip2),=
&
> >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 alpha_0, fft_of_zetas, qs)
> >> =A0 =A0 =A0 end do
> >> end do
> >> !$omp end parallel do
>
> > Sorry, I don't have enough time to "complete" the code with
> > declarations/initializations/etc., that's why I need (and asked for)
> > some code I can directly compile-link-run-play almost directly from
> > the command line. Maybe you can just assign a constant to LHS just to
> > check for parallel behavior...
>
> > Looking the previous posts I was thinking in the line of OpenMP
> > implementation/OS problems, did you see something related to taskset?
>
> > Fernando.
>
> I see, and I understand. However, the code is a bit complex. This is
> probably part of the reason for the strange behavior! As mentioned
> earlier, a simple do loop example compiles, runs and speeds up as expecte=
d.
>
> I will consider your advice with respect to just assigning a constant to
> LHS and see if that works out. I think it could be helpful!
>
> I have never used taskset, so I have no idea how I would go about using
> it in this context.
>
> Paul.

I see, no problem, I understood from a previous post (I copy the text
here):
> This is part of the problem: the simple OpenMP loop runs on all cores.
> Also, the OpenMP code I wrote works on Mac OS X gfortran 4.5.2. The
> OpenMP code also works flawlessly on Rocks cluster linux with intel
> fortran v. 11.
> I have now tested on gfortran 4.4.5 and 4.5.1 on Fedora Linux release
> 14, and both of these compilers result in no speedup.
that the simple loop did not work on Fedora, my mistake, sorry.

About code complexity and parallel performance, I think code
complexity only "affects" correctness, not parallel usage to only one
core, but obviously I don't know a looooot of details/facts.

About taskset: since I was thinking in OpenMP implementation/OS
problems, I suggested taking a look at taskset, which is used to
"retrieve or set a process's CPU affinity" -copied from
http://www.unix.com/man-page/Linux/1/taskset/
Since threads (I think always) inherit scheduling properties from the
process in which they are created, maybe you can verify/play with the
process scheduling properties. But it's just a guess...

Fernando.

0
Reply ftinetti (148) 2/14/2011 1:42:27 PM

In article <2c0b330a-1cfb-42d1-b597-8aacac4702a0@f36g2000pri.googlegroups.com>,
gmail-unlp  <ftinetti@gmail.com> wrote:
>
>About code complexity and parallel performance, I think code
>complexity only "affects" correctness, not parallel usage to only one
>core, but obviously I don't know a looooot of details/facts.

That's more-or-less correct, in theory.  Whether it is the case
in practice is less clear.

>About taskset: since I was thinking in OpenMP implementation/OS
>problems, I suggested taking a look at taskset, which is used to
>"retrieve or set a process's CPU affinity" -copied from
>http://www.unix.com/man-page/Linux/1/taskset/
>Since threads (I think always) inherit scheduling properties from the
>process in which they are created, maybe you can verify/play with the
>process scheduling properties. But it's just a guess...

This will sound a bit patronising, but you are completely lost.
Nobody except an expert (in using operating systems, primarily)
should think of using such facilities, as they don't work the way
that they appear to, and never have (on ANY operating system!)
And most experts know better than to use such facilities except
in extremis, for that reason and many others.

The most that it is worth doing is calling a little C function
that returns the TID (thread identifier), which will check that
different parts of the code are using different threads.  But
DO be warned that the mapping between OpenMP threads and system
threads is not simple, not at all.

If they all give the same TID, then your problem is that it is not
starting enough threads; if they give different ones, you have a
scheduler issue, and you should stop looking at the OpenMP code
and look elsewhere.  And the latter is HARD.


Regards,
Nick Maclaren.
0
Reply nmm12 (898) 2/14/2011 1:43:57 PM

In article <9577de97-eabd-45d6-9af2-04a8a112e0a8@l12g2000pra.googlegroups.com>,
gmail-unlp  <ftinetti@gmail.com> wrote:
>
>Yes, I think so. Btw, do you have some suggestion to the OP not so
>HARD or for experts only or for those who know better? I really don't
>have any idea, so if there is not such a suggestion I would know the
>answer is in the pile-of-things-i-don't-know. I'll keep learning,
>however, just trying to reach those places and understand a little bit
>more.

The best that I can offer is to Email the slides on a course that
I have part-written, largely on how to avoid getting into trouble
with OpenMP.  But it doesn't contain ANYTHING about the actual
configuration and system usage - except general guidelines.  Please
say if you would like to see it.


Regards,
Nick Maclaren.
0
Reply nmm12 (898) 2/14/2011 2:20:59 PM

On Feb 14, 10:43=A0am, n...@cam.ac.uk wrote:
> In article <2c0b330a-1cfb-42d1-b597-8aacac470...@f36g2000pri.googlegroups=
..com>,
>
> gmail-unlp =A0<ftine...@gmail.com> wrote:
>
> >About code complexity and parallel performance, I think code
> >complexity only "affects" correctness, not parallel usage to only one
> >core, but obviously I don't know a looooot of details/facts.
>
> That's more-or-less correct, in theory. =A0Whether it is the case
> in practice is less clear.
>
> >About taskset: since I was thinking in OpenMP implementation/OS
> >problems, I suggested taking a look at taskset, which is used to
> >"retrieve or set a process's CPU affinity" -copied from
> >http://www.unix.com/man-page/Linux/1/taskset/
> >Since threads (I think always) inherit scheduling properties from the
> >process in which they are created, maybe you can verify/play with the
> >process scheduling properties. But it's just a guess...
>
> This will sound a bit patronising, but you are completely lost.

What do you mean with "a bit patronising"? (I understand the part you
say I'm completely lost... ;P ...).

> Nobody except an expert (in using operating systems, primarily)
> should think of using such facilities, as they don't work the way
> that they appear to, and never have (on ANY operating system!)

I'm not exactly an expert (and I'm not able to define what an expert
is), and I've used such facilities, and they worked the way I
expected... This remains me that someone told to maintain some
software with the current bugs... Maybe I'm completely lost and I'm
completely lucky! :))

> And most experts know better than to use such facilities except
> in extremis, for that reason and many others.

Again, a realm I don't know. I just suggested a very simple task: to
play a little with an OS command just guessing something, in case of
nothing else appears.

> The most that it is worth doing is calling a little C function
> that returns the TID (thread identifier), which will check that
> different parts of the code are using different threads. =A0But
> DO be warned that the mapping between OpenMP threads and system
> threads is not simple, not at all.

Yes, that's why I didn't suggest that, besides I don't know calling
little C functions from Fortran (yes, I'm far far away from being an
expert...). About OpenMP implementation: yes, that's a place I don't
want to be, I usually try to figure out only if the problem is just
mine or there is some bug in the OpenMP implementation. The answer is
usually known before I start looking for, but I like to find out the
reason...

> If they all give the same TID, then your problem is that it is not
> starting enough threads; if they give different ones, you have a
> scheduler issue, and you should stop looking at the OpenMP code
> and look elsewhere. =A0And the latter is HARD.

Yes, I think so. Btw, do you have some suggestion to the OP not so
HARD or for experts only or for those who know better? I really don't
have any idea, so if there is not such a suggestion I would know the
answer is in the pile-of-things-i-don't-know. I'll keep learning,
however, just trying to reach those places and understand a little bit
more.

Thank you very much,

Fernando.
0
Reply ftinetti (148) 2/14/2011 2:56:02 PM

Den 14.02.11 15.20, skrev nmm1@cam.ac.uk:
> In article<9577de97-eabd-45d6-9af2-04a8a112e0a8@l12g2000pra.googlegroups.com>,
> gmail-unlp<ftinetti@gmail.com>  wrote:
>>
>> Yes, I think so. Btw, do you have some suggestion to the OP not so
>> HARD or for experts only or for those who know better? I really don't
>> have any idea, so if there is not such a suggestion I would know the
>> answer is in the pile-of-things-i-don't-know. I'll keep learning,
>> however, just trying to reach those places and understand a little bit
>> more.
>
> The best that I can offer is to Email the slides on a course that
> I have part-written, largely on how to avoid getting into trouble
> with OpenMP.  But it doesn't contain ANYTHING about the actual
> configuration and system usage - except general guidelines.  Please
> say if you would like to see it.
>
>
> Regards,
> Nick Maclaren.

If you have a course over how not to get into trouble with OpenMP, I for 
one am interested.

Paul.

0
Reply paul.anton.letnes (55) 2/14/2011 3:14:58 PM

On Feb 14, 12:14=A0pm, Paul Anton Letnes <paul.anton.let...@gmail.com>
wrote:
> Den 14.02.11 15.20, skrev n...@cam.ac.uk:
>
>
>
>
>
> > In article<9577de97-eabd-45d6-9af2-04a8a112e...@l12g2000pra.googlegroup=
s.com>,
> > gmail-unlp<ftine...@gmail.com> =A0wrote:
>
> >> Yes, I think so. Btw, do you have some suggestion to the OP not so
> >> HARD or for experts only or for those who know better? I really don't
> >> have any idea, so if there is not such a suggestion I would know the
> >> answer is in the pile-of-things-i-don't-know. I'll keep learning,
> >> however, just trying to reach those places and understand a little bit
> >> more.
>
> > The best that I can offer is to Email the slides on a course that
> > I have part-written, largely on how to avoid getting into trouble
> > with OpenMP. =A0But it doesn't contain ANYTHING about the actual
> > configuration and system usage - except general guidelines. =A0Please
> > say if you would like to see it.
>
> > Regards,
> > Nick Maclaren.
>
> If you have a course over how not to get into trouble with OpenMP, I for
> one am interested.
>
> Paul.


Me too, and thank you,

Fernando.
0
Reply ftinetti (148) 2/14/2011 3:18:09 PM

24 Replies
805 Views

(page loaded in 0.225 seconds)

Similiar Articles:













7/25/2012 10:41:53 PM


Reply: