MMX FPU Mode and Task Switching

  • Follow


As I understand it, on the first execution of an MMX instruction
the FPU mode switches to MMX mode. In order to switch back to
normal floating-point mode, you must execute a EMMS instruction.
It is emphasized that this shouldn't be done often since the
EMMS instruction is expensive. I'm getting this information from

  http://webster.cs.ucr.edu/AoA/Windows/HTML/TheMMXInstructionSet.html

How does a task maintain the FPU in MMX mode when, at any moment, the
OS can preempt and switch to another task that may require normal
FPU mode?
-- 
%  Randy Yates                  % "Watching all the days go by...    
%% Fuquay-Varina, NC            %  Who are you and who am I?"
%%% 919-577-9882                % 'Mission (A World Record)', 
%%%% <yates@ieee.org>           % *A New World Record*, ELO
http://home.earthlink.net/~yatescr

0
Reply Randy 2/20/2007 5:09:18 PM

Randy Yates <spamtrap@crayne.org> wrote:
> As I understand it, on the first execution of an MMX instruction
> the FPU mode switches to MMX mode. In order to switch back to
> normal floating-point mode, you must execute a EMMS instruction.
> It is emphasized that this shouldn't be done often since the
> EMMS instruction is expensive. I'm getting this information from
> 
>  http://webster.cs.ucr.edu/AoA/Windows/HTML/TheMMXInstructionSet.html
> 
> How does a task maintain the FPU in MMX mode when, at any moment, the
> OS can preempt and switch to another task that may require normal
> FPU mode?

The MMX mode is part of the 8087 task state and, as such, it is
preserved along with the register contents at every task switch.

If you have also XMM registers, you might think about using them unless
you need the 8087 80-bit long double format.

0
Reply dave 2/20/2007 6:30:10 PM


dave <spamtrap@crayne.org> wrote in part:
> The MMX mode is part of the 8087 task state and, as such, it is
> preserved along with the register contents at every task switch.

Not always.  Because very few processes use the FPU, Intel included
facilities for a "lazy FPU switch" controlled by TS, bit3 of CR0 .

Essentially the x87 state is only saved-restored when needed by
a foreign-task FP/MMX instruction.  Linux does this and it saves
work on task-switch.

-- Robert

0
Reply Robert 2/20/2007 7:53:17 PM

Robert Redelmeier <redelm@ev1.net.invalid> writes:

> dave <spamtrap@crayne.org> wrote in part:
>> The MMX mode is part of the 8087 task state and, as such, it is
>> preserved along with the register contents at every task switch.
>
> Not always.  Because very few processes use the FPU, Intel included
> facilities for a "lazy FPU switch" controlled by TS, bit3 of CR0 .
>
> Essentially the x87 state is only saved-restored when needed by
> a foreign-task FP/MMX instruction.  Linux does this and it saves
> work on task-switch.

I'm confused - how does linux "know" that the x87 state needs to
be saved? 

By the way, this will be executing under FC4.
-- 
%  Randy Yates                  % "...the answer lies within your soul
%% Fuquay-Varina, NC            %       'cause no one knows which side
%%% 919-577-9882                %                   the coin will fall."
%%%% <yates@ieee.org>           %  'Big Wheels', *Out of the Blue*, ELO
http://home.earthlink.net/~yatescr

0
Reply Randy 2/20/2007 8:47:41 PM

On Feb 20, 2:47 pm, Randy Yates  <spamt...@crayne.org> wrote:
> Robert Redelmeier <red...@ev1.net.invalid> writes:
> > dave <spamt...@crayne.org> wrote in part:
> >> The MMX mode is part of the 8087 task state and, as such, it is
> >> preserved along with the register contents at every task switch.
>
> > Not always.  Because very few processes use the FPU, Intel included
> > facilities for a "lazy FPU switch" controlled by TS, bit3 of CR0 .
>
> > Essentially the x87 state is only saved-restored when needed by
> > a foreign-task FP/MMX instruction.  Linux does this and it saves
> > work on task-switch.
>
> I'm confused - how does linux "know" that the x87 state needs to
> be saved?


There are two ways to implement lazy x87 context switches: you need to
set either the EM or TS bits in CR0 whenever you switch tasks.  The TS
bit has the advantage of being set automatically if you use the x86
task-management stuff.  Then when (and if) that task issues a x87
instruction, the OS will get an Int 7, at which point it can save the
x87 state for the last task using it (known to the OS, of course), and
then the OS loads the x87 state for the current task, make note of the
fact that the FPU is not assigned to this task, and resumes.

0
Reply robertwessel2 2/21/2007 2:18:59 AM

Randy Yates <spamtrap@crayne.org> wrote in part:
> I'm confused - how does linux "know" that the x87 state
> needs to be saved?

>From the Intel IA32 System's Programming Manual:

         TS Task Switched (bit 3 of CR0). 

Allows the saving of the x87 FPU, MMX, SSE, and SSE2 context on
a task switch to be delayed until an x87 FPU, MMX, SSE, or SSE2
instruction is actually executed by the new task. The processor
sets this flag on every task switch and tests it when executing
x87 FPU, MMX, SSE, and SSE2 instructions.

.. If the TS flag is set and the EM flag (bit 2 of CR0) is clear,
a device-not-available exception (#NM) is raised prior to the
execution of any x87 FPU, MMX, SSE, and SSE2 instruction, with
the exception of the PAUSE, PREFETCHh, SFENCE, LFENCE, MFENCE,
MOVNTI, and CLFLUSH instructions. (See the paragraph below for
the special case of the WAIT/FWAIT instructions.)

.. If the TS flag is set and the MP flag (bit 1 of CR0) and EM flag
are clear, an #NM exception is not raised prior to the execution
of an x87 FPU WAIT/FWAIT instruction.

.. If the EM flag is set, the setting of the TS flag has no affect
on the execution of the x87 FPU, MMX, SSE, and SSE2 instructions.

Table 2-1 shows the actions taken when the processor encounters
an x87 FPU instruction based on the settings of the TS, EM, and
MP flags. Tables 10-1 and 11-1 show the actions taken when the
processor encounters an MMX and or an SSE or SSE2 instruction,
respectively.

The processor does not automatically save the context of the x87
FPU, XMM, and MXCSR registers on a task switch. Instead it sets
the TS flag, which causes the processor to raise an #NM exception
whenever it encounters an x87 FPU, MMX, SSE, or SSE2 instruction in
the instruction stream for the new task (with the exception of the
instructions listed above). The fault handler for the #NM exception
can then be used to clear the TS flag (with the CLTS instruction)
and save the context of the x87 FPU, 2-15 SYSTEM ARCHITECTURE
OVERVIEW XMM, and MXCSR registers. If the task never encounters
an x87 FPU, MMX, SSE, or SSE2 instruction, the x87 FPU, MMX, SSE,
and SSE2 context is never saved.

> By the way, this will be executing under FC4.

What is that?

-- Robert

0
Reply Robert 2/21/2007 1:56:28 PM

Robert Redelmeier wrote:

> Randy Yates wrote:
> 
>> By the way, this will be executing under FC4.
> 
> What is that?

http://en.wikipedia.org/wiki/Fedora_Core

0
Reply Spoon 2/21/2007 3:07:42 PM

Spoon <spamtrap@crayne.org> wrote in part:
> Robert Redelmeier wrote:
>> Randy Yates wrote:
>>> By the way, this will be executing under FC4.
>> What is that?
> 
> http://en.wikipedia.org/wiki/Fedora_Core

Very nice, but no more explanatory than if I said SW9
(Slackware 9.0).  There is such a thing as over-abbreviating.

The libs don't matter at all.  That it is Linux and which
kernel (2.4 or 2.6) does.

-- Robert

> 

0
Reply Robert 2/21/2007 8:16:12 PM

Robert Redelmeier <redelm@ev1.net.invalid> writes:

> Spoon <spamtrap@crayne.org> wrote in part:
>> Robert Redelmeier wrote:
>>> Randy Yates wrote:
>>>> By the way, this will be executing under FC4.
>>> What is that?
>> 
>> http://en.wikipedia.org/wiki/Fedora_Core
>
> Very nice, but no more explanatory than if I said SW9
> (Slackware 9.0).  There is such a thing as over-abbreviating.

Robert, 

First of all, let me say thank you for responding and for the
good information you provided. 

I assumed that using a common acronym for a common operating system on
a usenet group that specializes in programming x86 systems is
reasonable. I would have responded sooner with a clarification if I
hadn't been knee-deep lately.

Yes, I meant "Fedora Core 4" - essentially Red Hat's "free" linux OS. 

> The libs don't matter at all.  That it is Linux and which
> kernel (2.4 or 2.6) does.

[yates@localhost hw3]$ uname -a
Linux localhost 2.6.11-1.1369_FC4 #1 Thu Jun 2 22:55:56 EDT 2005 i686 athlon i386 GNU/Linux
-- 
%  Randy Yates                  % "My Shangri-la has gone away, fading like 
%% Fuquay-Varina, NC            %  the Beatles on 'Hey Jude'" 
%%% 919-577-9882                %  
%%%% <yates@ieee.org>           % 'Shangri-La', *A New World Record*, ELO
http://home.earthlink.net/~yatescr

0
Reply Randy 2/21/2007 10:21:15 PM

Randy Yates <spamtrap@crayne.org> wrote in part:
> I assumed that using a common acronym for a common operating

Not a problem.  "Common" is a relative term.  An I apologize
if my tone appeared harsh.

Getting back to topic, I think a bit of historical perspective
helps.  Intel added TS to CR0 back in 386 days.  Before x87 was
even common, and when CPUs had no cache.  Writing ~96 bytes of
x87 state out to non-write-behind memory was felt to take too
much time on every task switch or interrupt.  Hence the idea of
delaying it until needed.

The consequence is that it is hard (not impossible) to do
FP/MMX in the kernel.  On modern caching CPUs, the relative
cost of saving state has dropped considerably while the cost
of exception processing has stayed the same or grown.  Now,
a system with multiple tasks using MMX/FPU might be [slightly]
better off always saving state.

-- Robert

0
Reply Robert 2/22/2007 2:22:07 PM

On Feb 22, 8:22 am, Robert Redelmeier <red...@ev1.net.invalid> wrote:
> Getting back to topic, I think a bit of historical perspective
> helps.  Intel added TS to CR0 back in 386 days.  Before x87 was
> even common, and when CPUs had no cache.  Writing ~96 bytes of
> x87 state out to non-write-behind memory was felt to take too
> much time on every task switch or interrupt.  Hence the idea of
> delaying it until needed.
>
> The consequence is that it is hard (not impossible) to do
> FP/MMX in the kernel.  On modern caching CPUs, the relative
> cost of saving state has dropped considerably while the cost
> of exception processing has stayed the same or grown.  Now,
> a system with multiple tasks using MMX/FPU might be [slightly]
> better off always saving state.


Actually TS was added with protected mode in the 286.

In addition to the direct state saving overhead, avoiding state saves
potentially allowed parallel execution on the FPU to continue during
interrupts and context switches.  Again, that's far less of an issue
today, but people forget how slow some numeric operations were in that
era.

It also reduced average interrupt latency for both reasons (no state
save, and often being able to avoid waiting for the long numeric
operation to complete first).

Unfortunately the wait for a prior FPU operation was not interruptible
(although starting with the 486 some of the longer operations were
abortable and restartable), so the maximum latency was not reduced
since if the interrupt request happened while you were waiting for
that 1100 cycle FYL2X, you just got to wait (but if the interrupt
happened after the FYL2X was issued and before you started to wait for
the result, it could proceed immediately).

0
Reply robertwessel2 2/23/2007 12:59:23 AM

Robert Redelmeier <redelm@ev1.net.invalid> writes:

> Randy Yates <spamtrap@crayne.org> wrote in part:
>> I assumed that using a common acronym for a common operating
>
> Not a problem.  "Common" is a relative term.  An I apologize
> if my tone appeared harsh.
>
> Getting back to topic, I think a bit of historical perspective
> helps.  Intel added TS to CR0 back in 386 days.  Before x87 was
> even common, and when CPUs had no cache.  Writing ~96 bytes of
> x87 state out to non-write-behind memory was felt to take too
> much time on every task switch or interrupt.  Hence the idea of
> delaying it until needed.
>
> The consequence is that it is hard (not impossible) to do
> FP/MMX in the kernel.  On modern caching CPUs, the relative
> cost of saving state has dropped considerably while the cost
> of exception processing has stayed the same or grown.  Now,
> a system with multiple tasks using MMX/FPU might be [slightly]
> better off always saving state.

Thanks Robert. I believe I follow you. I remember the good ol' days
when writing 96 bytes was not a small chore (I did a crapload of 8085
assembly programming back in the 80's). 

Two more questions on the topic of MMX programming:

  1. I haven't even thought critically about caching, but, umm, what
  about it? How do you do it? How do you guarantee (or is it
  impossible?)  that your cached data stays in cache through task
  switches/interrupts? I'm going to be banging blocks of 4K words of
  samples at a time through a 2261-tap polyphase FIR filter at an
  input rate of 448000 Hz. If you work out the details, I'll need
  about 8K of data (4K current samples, 2260 old samples, and 2261
  coefficients) cached in memory for about 0.33 milliseconds every
  9.14 milliseconds. Is this doable or am I dreaming?

  It would be nice to keep the 2261 words of coefficients in cache
  permanently (I think).

  2. I got the idea reading the Intel documentation that if you have
  an SSE-capable machine with the extended 128-bit registers, the
  PMADDWD will/can do a vector of 8 multiplies at a time rather than
  4. Is this true?

--Randy


-- 
%  Randy Yates                  % "With time with what you've learned, 
%% Fuquay-Varina, NC            %  they'll kiss the ground you walk 
%%% 919-577-9882                %  upon."
%%%% <yates@ieee.org>           % '21st Century Man', *Time*, ELO
http://home.earthlink.net/~yatescr

0
Reply Randy 2/24/2007 4:27:20 AM

On 20 helmi, 19:09, Randy Yates  <spamt...@crayne.org> wrote:
> As I understand it, on the first execution of an MMX instruction
> the FPU mode switches to MMX mode. In order to switch back to
> normal floating-point mode, you must execute a EMMS instruction.
> It is emphasized that this shouldn't be done often since the
> EMMS instruction is expensive. I'm getting this information from
>
>  http://webster.cs.ucr.edu/AoA/Windows/HTML/TheMMXInstructionSet.html
>
> How does a task maintain the FPU in MMX mode when, at any moment, the
> OS can preempt and switch to another task that may require normal
> FPU mode?
> --
> %  Randy Yates                  % "Watching all the days go by...
> %% Fuquay-Varina, NC            %  Who are you and who am I?"
> %%% 919-577-9882                % 'Mission (A World Record)',
> %%%% <y...@ieee.org>           % *A New World Record*, ELOhttp://home.earthlink.net/~yatescr

M32 has fpu support and M64 mmx/sse/sse2 support.
http://www.menuetos.net

0
Reply spamtrap 2/24/2007 11:10:29 AM

Robert Redelmeier <redelm@ev1.net.invalid> writes:

> Randy Yates <spamtrap@crayne.org> wrote in part:
>> I assumed that using a common acronym for a common operating
>
> Not a problem.  "Common" is a relative term.  An I apologize
> if my tone appeared harsh.
>
> Getting back to topic, I think a bit of historical perspective
> helps.  Intel added TS to CR0 back in 386 days.  Before x87 was
> even common, and when CPUs had no cache.  Writing ~96 bytes of
> x87 state out to non-write-behind memory was felt to take too
> much time on every task switch or interrupt.  Hence the idea of
> delaying it until needed.
>
> The consequence is that it is hard (not impossible) to do
> FP/MMX in the kernel.  On modern caching CPUs, the relative
> cost of saving state has dropped considerably while the cost
> of exception processing has stayed the same or grown.  Now,
> a system with multiple tasks using MMX/FPU might be [slightly]
> better off always saving state.

Thanks Robert. I believe I follow you. I remember the good ol' days
when writing 96 bytes was not a small chore (I did a crapload of 8085
assembly programming back in the 80's). 

Two more questions on the topic of MMX programming:

  1. I haven't even thought critically about caching, but, umm, what
  about it? How do you do it? How do you guarantee (or is it
  impossible?)  that your cached data stays in cache through task
  switches/interrupts? I'm going to be banging blocks of 4K words of
  samples at a time through a 2261-tap polyphase FIR filter at an
  input rate of 448000 Hz. If you work out the details, I'll need
  about 8K of data (4K current samples, 2260 old samples, and 2261
  coefficients) cached in memory for about 0.33 milliseconds every
  9.14 milliseconds. Is this doable or am I dreaming?

  It would be nice to keep the 2261 words of coefficients in cache
  permanently (I think).

  2. I got the idea reading the Intel documentation that if you have
  an SSE-capable machine with the extended 128-bit registers, the
  PMADDWD will/can do a vector of 8 multiplies at a time rather than
  4. Is this true?

--Randy

-- 
%  Randy Yates                  % "With time with what you've learned, 
%% Fuquay-Varina, NC            %  they'll kiss the ground you walk 
%%% 919-577-9882                %  upon."
%%%% <yates@ieee.org>           % '21st Century Man', *Time*, ELO
http://home.earthlink.net/~yatescr

0
Reply Randy 2/24/2007 6:12:19 PM

Randy Yates wrote:

> I got the idea reading the Intel documentation that if you have an
> SSE-capable machine with the extended 128-bit registers, the PMADDWD
> will/can do a vector of 8 multiplies at a time rather than 4.

According to AMD's manual:

The PMADDWD instruction is an SSE2 instruction.

Multiplies each packed 16-bit signed value in the first source operand 
by the corresponding packed 16-bit signed value in the second source 
operand, adds the adjacent intermediate 32-bit results of each 
multiplication (for example, the multiplication results for the adjacent 
bit fields 63�48 and 47�32, and 31�16 and 15�0), and writes the 32-bit 
result of each addition in the corresponding doubleword of the 
destination (first source). The first source/destination operand is an 
XMM register and the second source operand is another XMM register or 
128-bit memory location.

i.e. 8 16x16->32 multiplications, followed by 4 32-bit additions.

0
Reply Spoon 2/26/2007 10:45:34 AM

14 Replies
125 Views

(page loaded in 0.108 seconds)

Similiar Articles:







7/14/2012 10:00:23 PM


Reply: