|
|
Re: MP Synchronization ISSUES #3
Hoff, to answer your questions ...
But first, we did a cluster shutdown and startup - this seems to have resolved
the MP Synch problem (for now)... the cluster had been online 120+ days. This
leads me in the direction of application and memory management.
> Clustered?
The servers are clustered using 100 Mbit (private VLAN) and Gigabit (fiber).
The Gigabit provides access to our users.
> What sort of I/O device(s)?
DS20 = 1x FWD 2x SAN 1x 100 1x Gigabit
DS20 = 1x FWD 2x SAN 1x 100 1x Gigabit
ES40 = 1x FWD 2x SAN 1x 100 1x Gigabit
> What happens when you temporarily drop from 4 CPUs down to 1 or 2?
> (STOP/CPU
> your way down, for purposes of testing.) Do you see a big drop-off in MPSYNC?
Have not done this - I suspect this will terminate jobs currently executing,
correct? If not, then I will try this.
The 2 CPU servers (DS20's) do not have the MP synch issue.
> Are there differences in settings for the working set, for instance,
> or disk
> contention on the system, or such, between/among these systems?
I need to double check on this - however I have attempted to keep the values
the same across the board.
> Can you see what are the jobs completing for? Memory? Disk? I/O?
> Network?
It seems that CPU is the most used resource.
Disk... we have 4x HSG80 pairs (RAID 0+1) that have disks striped with 3 mirror
sets (2 disk mirror set) and then partitioned into (upto) 8 volumes. Attempts
were made to spread the heavy I/O across controllers and partition sets filling
in the gap with moderate to low I/O.
[....partition...]
[.....stripe.....]
[mirr][mirr][mirr]
[d][d][d][d][d][d]
We are looking into disk I/O contention and we are finding some areas that need
to be addressed.
Before the cluster shutdown we added additional disk on a new HSG80 controller
set to help isolate the problems. We found no noticible impact from the
application layer (FOCUS) however using the BACKUP utility we did.
This comes back to my path of application issue or memory issue - leans me
towards the application. This could also be the account that the appliction is
running on as well.
> Do you have a disk around (and licenses) where you can try OpenVMS V7.3-2
> with the configuration? V7.3 saw changes to off-load I/O Lock 8 activity --
> this was one of the major locks on earlier releases. A number of folks have
> found that getting to V7.3-1 or V7.3-2 -- or more current -- really helps
> system
> and application performance. V7.2-1H1 added spinlock tracing, so you're
> unfortunately below that release, too. V7.3 adds SDA LCK/RLOCK tracing, too.
Yes, I have the necessary disk and licenses (damn the torpedoes though) to do
this - so one day soon we can try this.
Yeah, I tried the SPL and no avail - not for V7.2-1 - figures!
> AMDS can sometimes help spot culprits, too.
Will work on that aspect as well.
> What I generally end up doing here is looking at what the particular
> applications are doing. In detail.
I have noted that when the system is running FOCUS (and only FOCUS multiple
batch jobs usually <10) the MP synch shoots up. As soon as the applications
finish the MP synch goes away. Hmmm... FOCUS?
John "REBOOT" Brandon
VMS Systems Administrator
firstname.lastname.spam.me.not@dalsemi.com
|
|
0
|
|
|
|
Reply
|
brandon18 (250)
|
9/13/2006 3:28:35 PM |
|
BRANDON, JOHN M wrote:
> Hoff, to answer your questions ...
>
> But first, we did a cluster shutdown and startup - this seems to have resolved
> the MP Synch problem (for now)... the cluster had been online 120+ days. This
> leads me in the direction of application and memory management.
Could be. Some sort of resource starvation or resource leak?
>> Clustered?
>
> The servers are clustered using 100 Mbit (private VLAN) and Gigabit (fiber).
> The Gigabit provides access to our users.
Gigabit is also where the locking is. You might try looking at where the
locks are mastered, and at what nodes are participating -- these are the lock
and dlock activities, at a start.
>> What happens when you temporarily drop from 4 CPUs down to 1 or 2?
>> (STOP/CPU your way down, for purposes of testing.) Do you see a big
>> drop-off in MPSYNC?
>
> Have not done this - I suspect this will terminate jobs currently executing,
> correct? If not, then I will try this.
You should be able to remove the processors "hot"; we toss CPUs around within
the Galaxy all the time, for instance. If this is a production server and
you're as paranoid as I can get in such an environment and you can reboot, you
can also configure the CPUs out at boot.
> It seems that CPU is the most used resource.
You'll unfortunately have to look behind that, at what the processes are
doing, and what code-paths are active.
....
>> What I generally end up doing here is looking at what the particular
>> applications are doing. In detail.
>
> I have noted that when the system is running FOCUS (and only FOCUS multiple
> batch jobs usually <10) the MP synch shoots up. As soon as the applications
> finish the MP synch goes away. Hmmm... FOCUS?
I'm not familiar with what FOCUS is doing internally -- some of the tools
that have been discussed in this thread were implemented to allow us to peer
"inside" an application, and see what it's up to. System service logging, the
spinlock tracing, etc. But your OpenVMS version is unfortunately too far back...
|
|
0
|
|
|
|
Reply
|
hoff-remove-this (566)
|
9/13/2006 6:56:28 PM
|
|
|
1 Replies
29 Views
(page loaded in 0.101 seconds)
|
|
|
|
|
|
|
|
|