[xpost comp.lang.c, comp.unix.programmer, comp.os.ms-
windows.programmer.win32]
Hello NG,
I have a very specific question, but first of all I describe you what
I'm facing as I'm open to alternative general solutions..
I'm refactoring some old C code that deals with very basilar data
structures (like lists, priority queues, hashtables and so on) that
pretends to be very portable. Regarding the memory management, I have
an internal wrapper that, on *nix and windows systems, maps the malloc/
calloc/realloc/free functions one-to-one to the default ones, but it's
possibile to switch to another custom allocator that works pretty much
as a standard allocator (a big chunk of memory splitted/joined, tought
for embedded systems).
I would like to implement my custom allocator for the *nix and windows
systems too (of course loosing some portability for performance gain),
and maybe implement something like the C++ allocators to improve the
memory management of the different algorithms (like rare large
allocations on vectors / frequent small allocations on lists), but
I've some questions:
- In these systems, to have decent performances, I was thinking to
allocate page-sized (and page-aligned) chunks to have small
allocations in the same page, but how is it possible to request some
memory with this characteristics? Will it help the OS memory manager?
- Does all of this make any sense? :-)
Thanks for any suggestion!
Ciao!
P.S. I don't really have the necessity of a "performance boost", but
I'm dealing with a general, low-level and already used library, and as
long as I'm refactoring the code I'd like to do it the better possible
way ;-)
|
|
0
|
|
|
|
Reply
|
fmassei (45)
|
12/31/2011 2:56:53 PM |
|
On 12/31/2011 9:56 AM, FtM wrote:
> [xpost comp.lang.c, comp.unix.programmer, comp.os.ms-
> windows.programmer.win32]
> Hello NG,
> I have a very specific question, but first of all I describe you what
> I'm facing as I'm open to alternative general solutions..
> I'm refactoring some old C code that deals with very basilar data
> structures (like lists, priority queues, hashtables and so on) that
> pretends to be very portable. Regarding the memory management, I have
> an internal wrapper that, on *nix and windows systems, maps the malloc/
> calloc/realloc/free functions one-to-one to the default ones, but it's
> possibile to switch to another custom allocator that works pretty much
> as a standard allocator (a big chunk of memory splitted/joined, tought
> for embedded systems).
This is frequently done, usually either to use an alternate memory
manager with extra debugging features or to substitute a manager whose
performance characteristics better match the needs of the program at
hand. However, the C language does not guarantee that it's possible to
do such substitution; "the library" is regarded as monolithic, and not
as something that can be replaced piecemeal. This allows the library
implementor to use internal and unadvertised features, "back doors" if
you like. (For example, exit() needs a "back door" to find and close
all the still-open FILE* streams.) It is at least possible that some
implementation's internals involve "back doors" to the memory manager,
so if you substituted a memory manager that implemented only the public
operations you'd break the library in strange ways.
That's mostly a "theoretical" consideration, though, perhaps since
replacing the memory manager is such an incredibly useful thing to want
to do. Still, keep in mind that it's not guaranteed to work.
> I would like to implement my custom allocator for the *nix and windows
> systems too (of course loosing some portability for performance gain),
> and maybe implement something like the C++ allocators to improve the
> memory management of the different algorithms (like rare large
> allocations on vectors / frequent small allocations on lists), but
> I've some questions:
> - In these systems, to have decent performances, I was thinking to
> allocate page-sized (and page-aligned) chunks to have small
> allocations in the same page, but how is it possible to request some
> memory with this characteristics? Will it help the OS memory manager?
> - Does all of this make any sense? :-)
Not much, unless you are fond of re-inventing well-rounded wheels
for your own amusement. The things you mention are already common
practice in many Unixoid allocators; I have no specific knowledge of
Windows' allocators but there's no reason to think Redmond is unaware
of the state(s) of the art(s).
> P.S. I don't really have the necessity of a "performance boost", but
> I'm dealing with a general, low-level and already used library, and as
> long as I'm refactoring the code I'd like to do it the better possible
> way ;-)
For a "general low-level" library I think you'll be hard-pressed
just to match the performance of the O/S' own allocators, much less
improve on them. Things would be different for a library tuned to
the specific needs of a particular program and making use of program-
specific knowledge: Which allocations will be short- or long-lived,
which will be used together and would benefit from sharing a single
page, which data items are allocated and freed so frequently that they
merit their own cache, ... Large gains can (sometimes) be had by
exploiting knowledge of this kind, but a "general low-level" library
really can't have that kind of knowledge.
If you stay on the "general low-level" plane, you're just matching
wits with people who are paid to devote a lot of thought to aspects of
this particular problem -- which is why I say you're unlikely even to
do as well as they've already done. More engineer-decades have gone
into the existing allocators than you're likely to be able to devote
to replacing them. There are probably better targets for your effort.
--
Eric Sosman
esosman@ieee-dot-org.invalid
|
|
0
|
|
|
|
Reply
|
esosman2 (2945)
|
12/31/2011 4:08:45 PM
|
|
On Dec 31, 5:08=A0pm, Eric Sosman <esos...@ieee-dot-org.invalid> wrote:
> On 12/31/2011 9:56 AM, FtM wrote:
>
> > [xpost comp.lang.c, comp.unix.programmer, comp.os.ms-
> > windows.programmer.win32]
> > Hello NG,
> > I have a very specific question, but first of all I describe you what
> > I'm facing as I'm open to alternative general solutions..
> > I'm refactoring some old C code that deals with very basilar data
> > structures (like lists, priority queues, hashtables and so on) that
> > pretends to be very portable. Regarding the memory management, I have
> > an internal wrapper that, on *nix and windows systems, maps the malloc/
> > calloc/realloc/free functions one-to-one to the default ones, but it's
> > possibile to switch to another custom allocator that works pretty much
> > as a standard allocator (a big chunk of memory splitted/joined, tought
> > for embedded systems).
>
> =A0 =A0 =A0This is frequently done, usually either to use an alternate me=
mory
> manager with extra debugging features or to substitute a manager whose
> performance characteristics better match the needs of the program at
> hand. =A0However, the C language does not guarantee that it's possible to
> do such substitution; "the library" is regarded as monolithic, and not
> as something that can be replaced piecemeal. =A0This allows the library
> implementor to use internal and unadvertised features, "back doors" if
> you like. =A0(For example, exit() needs a "back door" to find and close
> all the still-open FILE* streams.) =A0It is at least possible that some
> implementation's internals involve "back doors" to the memory manager,
> so if you substituted a memory manager that implemented only the public
> operations you'd break the library in strange ways.
>
> =A0 =A0 =A0That's mostly a "theoretical" consideration, though, perhaps s=
ince
> replacing the memory manager is such an incredibly useful thing to want
> to do. =A0Still, keep in mind that it's not guaranteed to work.
>
You are perfectly right, I optimistically hope that if something like
the memory allocation functions are broken I'll notice it in a short
time and fix it (replacing my allocator with the wrapper) :-) Anyway
the "custom" allocator is written for those simple systems that don't
have a dynamic memory allocator at all (and they are probably
disappearing from the market).
> > I would like to implement my custom allocator for the *nix and windows
> > systems too (of course loosing some portability for performance gain),
> > and maybe implement something like the C++ allocators to improve the
> > memory management of the different algorithms (like rare large
> > allocations on vectors / frequent small allocations on lists), but
> > I've some questions:
> > - In these systems, to have decent performances, I was thinking to
> > allocate page-sized (and page-aligned) chunks to have small
> > allocations in the same page, but how is it possible to request some
> > memory with this characteristics? Will it help the OS memory manager?
> > - Does all of this make any sense? :-)
>
> =A0 =A0 =A0Not much, unless you are fond of re-inventing well-rounded whe=
els
> for your own amusement. =A0The things you mention are already common
> practice in many Unixoid allocators; I have no specific knowledge of
> Windows' allocators but there's no reason to think Redmond is unaware
> of the state(s) of the art(s).
>
> > P.S. I don't really have the necessity of a "performance boost", but
> > I'm dealing with a general, low-level and already used library, and as
> > long as I'm refactoring the code I'd like to do it the better possible
> > way ;-)
>
> =A0 =A0 =A0For a "general low-level" library I think you'll be hard-press=
ed
> just to match the performance of the O/S' own allocators, much less
> improve on them. =A0Things would be different for a library tuned to
> the specific needs of a particular program and making use of program-
> specific knowledge: Which allocations will be short- or long-lived,
> which will be used together and would benefit from sharing a single
> page, which data items are allocated and freed so frequently that they
> merit their own cache, ... =A0Large gains can (sometimes) be had by
> exploiting knowledge of this kind, but a "general low-level" library
> really can't have that kind of knowledge.
>
> =A0 =A0 =A0If you stay on the "general low-level" plane, you're just matc=
hing
> wits with people who are paid to devote a lot of thought to aspects of
> this particular problem -- which is why I say you're unlikely even to
> do as well as they've already done. =A0More engineer-decades have gone
> into the existing allocators than you're likely to be able to devote
> to replacing them. =A0There are probably better targets for your effort.
>
You exactly identified the origin of my doubts: if I go the way I
described, I'd like to give the library user the possibility to switch/
reuse more than one allocator just like the way you can switch
allocators in C++. I know that I can't beat the general standard one,
but using an optimized (maybe third-party?) one only in some occasions
and the system's one otherwise seems reasonable to me.. isn't it?
I'm asking here because I know that it's a common problem that many of
you have faced and solved ;-)
Ciao!
|
|
0
|
|
|
|
Reply
|
fmassei (45)
|
12/31/2011 4:28:37 PM
|
|
FtM wrote:
> [xpost comp.lang.c, comp.unix.programmer, comp.os.ms-
> windows.programmer.win32]
> Hello NG,
> I have a very specific question, but first of all I describe you what
> I'm facing as I'm open to alternative general solutions..
> I'm refactoring some old C code that deals with very basilar data
> structures (like lists, priority queues, hashtables and so on) that
> pretends to be very portable. Regarding the memory management, I have
> an internal wrapper that, on *nix and windows systems, maps the malloc/
> calloc/realloc/free functions one-to-one to the default ones, but it's
> possibile to switch to another custom allocator that works pretty much
> as a standard allocator (a big chunk of memory splitted/joined, tought
> for embedded systems).
> I would like to implement my custom allocator for the *nix and windows
> systems too (of course loosing some portability for performance gain),
> and maybe implement something like the C++ allocators to improve the
> memory management of the different algorithms (like rare large
> allocations on vectors / frequent small allocations on lists), but
> I've some questions:
AFAIK the allocator of GNU/Linux libc already does this.
man mallopt will give you some information about it's behavior.
Or read the source code.
No need to reinvent the wheel.
> - In these systems, to have decent performances, I was thinking to
> allocate page-sized (and page-aligned) chunks to have small
> allocations in the same page, but how is it possible to request some
> memory with this characteristics? Will it help the OS memory manager?
linux libc uses sbrk and mmap... and possibly other syscalls
regards,
JK
|
|
0
|
|
|
|
Reply
|
klammerr (8)
|
12/31/2011 4:39:10 PM
|
|
On Dec 31, 5:39=A0pm, Johann Klammer <klamm...@NOSPAM.aon.at> wrote:
> FtM wrote:
> > I would like to implement my custom allocator for the *nix and windows
> > systems too (of course loosing some portability for performance gain),
> > and maybe implement something like the C++ allocators to improve the
> > memory management of the different algorithms (like rare large
> > allocations on vectors / frequent small allocations on lists), but
> > I've some questions:
>
> AFAIK the allocator of GNU/Linux libc already does this.
> man mallopt will give you some information about it's behavior.
I've already looked into mallopt but it doesn't seem enough fine-
grained for my needs... Do you know some code or a project that uses
it so I can take a look?
> Or read the source code.
> No need to reinvent the wheel.
>
That's the reason of asking before coding ;)
> > - In these systems, to have decent performances, I was thinking to
> > allocate page-sized (and page-aligned) chunks to have small
> > allocations in the same page, but how is it possible to request some
> > memory with this characteristics? Will it help the OS memory manager?
>
> linux libc uses sbrk and mmap... and possibly other syscalls
>
sure, sbrk() is the way to go for a low-level allocator, but the
question of how I can take a page-aligned segment remains: you can't
be sure of the OS's VA masking mechanism (or maybe yes?). Anyway, I
could proceed by reading the code of a specific kernel, but it would
be totally useless if there isn't a general enough-shared "philosophy"
underneath (read as: "a specific syscall").
I already use mmap() for persistent b-trees, but does it make sense to
use it without an underlying physical file?
Ciao!
|
|
0
|
|
|
|
Reply
|
fmassei (45)
|
12/31/2011 4:55:37 PM
|
|
On 2011-12-31, FtM <fmassei@gmail.com> wrote:
> I would like to implement my custom allocator for the *nix and windows
> systems too (of course loosing some portability for performance gain),
> and maybe implement something like the C++ allocators to improve the
> memory management of the different algorithms (like rare large
> allocations on vectors / frequent small allocations on lists), but
What makes you think that your malloc doesn't already handle
such cases?
Say, have you profiled the code to see what fraction of the time is actually
spent executing the allocator code?
Can you answer the question: how much faster will the program be if
a magic allocator is substituted which requires zero cycles to execute?
> I've some questions:
> - In these systems, to have decent performances, I was thinking to
> allocate page-sized (and page-aligned) chunks to have small
> allocations in the same page, but how is it possible to request some
> memory with this characteristics? Will it help the OS memory manager?
Not knowing these basics is an indicator that you need to study more if you
want to write memory allocators that beat malloc.
|
|
0
|
|
|
|
Reply
|
kaz15 (1129)
|
12/31/2011 5:24:32 PM
|
|
On Dec 31, 6:24=A0pm, Kaz Kylheku <k...@kylheku.com> wrote:
> On 2011-12-31, FtM <fmas...@gmail.com> wrote:
>
> > I would like to implement my custom allocator for the *nix and windows
> > systems too (of course loosing some portability for performance gain),
> > and maybe implement something like the C++ allocators to improve the
> > memory management of the different algorithms (like rare large
> > allocations on vectors / frequent small allocations on lists), but
>
> What makes you think that your malloc doesn't already handle
> such cases?
>
Simply because it can't know if I'm going to do 200 little allocations
or 1 big one. Even if it has the most neat heuristic ever, the libc
(or the kernel) exposing a simple specific function/syscall would do
its job way better. Part of the question was about this hypothetical
function.
> Say, have you profiled the code to see what fraction of the time is actua=
lly
> spent executing the allocator code?
>
> Can you answer the question: how much faster will the program be if
> a magic allocator is substituted which requires zero cycles to execute?
>
Let's see... So why do the C++ STLs (or boost, for another example) do
exactly what I'm asking here? :-)
> > I've some questions:
> > - In these systems, to have decent performances, I was thinking to
> > allocate page-sized (and page-aligned) chunks to have small
> > allocations in the same page, but how is it possible to request some
> > memory with this characteristics? Will it help the OS memory manager?
>
> Not knowing these basics is an indicator that you need to study more if y=
ou
> want to write memory allocators that beat malloc.
I already said that I don't want to "beat malloc", just to find a
(probably platform-specific) way to facilitate its job.
Ciao!
P.S. I'm sorry if I'm looking so.. childish(?).. to you but, believe
me, I've written some physical/virtual memory allocators for many
simpler system, I know how systems work behind the scenes, and I
really don't want to start writing something like that: I'm just
asking how to deal with the two specific environments above while
masking the differences between the two and maintaining a standard
interface. I know that it's hard and that someone already did it: I'm
asking *how* he did that (and maybe where!) :-)
|
|
0
|
|
|
|
Reply
|
fmassei (45)
|
12/31/2011 5:52:36 PM
|
|
FtM wrote:
....
> I already use mmap() for persistent b-trees, but does it make sense to
> use it without an underlying physical file?
....
RTFM: MAP_ANONYMOUS
However, the number of mmaps per process is limited on linux.
JK
|
|
0
|
|
|
|
Reply
|
klammerr (8)
|
12/31/2011 6:01:03 PM
|
|
On 2011-12-31, Johann Klammer <klammerr@NOSPAM.aon.at> wrote:
> FtM wrote:
> ...
>> I already use mmap() for persistent b-trees, but does it make sense to
>> use it without an underlying physical file?
> ...
> RTFM: MAP_ANONYMOUS
>
> However, the number of mmaps per process is limited on linux.
That is true, but now for some qualifying cluesticks:
- malloc uses mmap when sbrk runs into a collision, and also for
thread arenas and large allocations. So you don't avoid mmap by using
malloc (though you avoid making lots of little mmap requests).
- brk/sbrk are based on mmmap! so there is no escaping from mmap that way.
So if you make lots of little page-sized brk increments, can you hit
the limit? It would seems to, but ...
- ... luckily, mmap will coalesce adjacent anonymous mappings which have
compatible flags so you will only run into the limit if you make a large
number of maps which do not touch or that do touch but have incompatible
properties.
- The default limit is only 65536 but in the light of the above, it's
maybe not so terrible.
- If you still have a problem, you can easily tweak the maximum count. become
root and then:
sysctl -w vm.max_map_count=<your-desired-number>
better yet, record it in /etc/sysctl.conf and run sysctl -p.
|
|
0
|
|
|
|
Reply
|
kaz15 (1129)
|
12/31/2011 6:31:44 PM
|
|
On 12/31/2011 12:52 PM, FtM wrote:
> On Dec 31, 6:24 pm, Kaz Kylheku<k...@kylheku.com> wrote:
>> On 2011-12-31, FtM<fmas...@gmail.com> wrote:
>>
>>> I would like to implement my custom allocator for the *nix and windows
>>> systems too (of course loosing some portability for performance gain),
>>> and maybe implement something like the C++ allocators to improve the
>>> memory management of the different algorithms (like rare large
>>> allocations on vectors / frequent small allocations on lists), but
>>
>> What makes you think that your malloc doesn't already handle
>> such cases?
>>
>
> Simply because it can't know if I'm going to do 200 little allocations
> or 1 big one. Even if it has the most neat heuristic ever, the libc
> (or the kernel) exposing a simple specific function/syscall would do
> its job way better. Part of the question was about this hypothetical
> function.
Why do you think the kernel predicts your program's future behavior
better than malloc() does?
> I already said that I don't want to "beat malloc", just to find a
> (probably platform-specific) way to facilitate its job.
Both the kernel and malloc() have a hard time predicting your
program's behavior, but I'll risk a prediction for the New Year:
You will not invent a "general low-level" replacement that
outperforms malloc() et al.
> [...] I'm just
> asking how to deal with the two specific environments above while
> masking the differences between the two and maintaining a standard
> interface.
Easy: Use malloc() et al., right out of the box. Platform
specifics are already masked, system-specific dodges and tricks are
already employed, code is already tested and debugged to a degree
that you cannot possibly replicate.
There! Wasn't that easy?
--
Eric Sosman
esosman@ieee-dot-org.invalid
|
|
0
|
|
|
|
Reply
|
esosman2 (2945)
|
12/31/2011 7:30:19 PM
|
|
On 31.12.2011 18:52, FtM wrote:
> On Dec 31, 6:24 pm, Kaz Kylheku <k...@kylheku.com> wrote:
>> On 2011-12-31, FtM <fmas...@gmail.com> wrote:
>>
>>> I would like to implement my custom allocator for the *nix and windows
>>> systems too (of course loosing some portability for performance gain),
>>> and maybe implement something like the C++ allocators to improve the
>>> memory management of the different algorithms (like rare large
>>> allocations on vectors / frequent small allocations on lists), but
>>
>> What makes you think that your malloc doesn't already handle
>> such cases?
>>
>
> Simply because it can't know if I'm going to do 200 little allocations
> or 1 big one. Even if it has the most neat heuristic ever, the libc
> (or the kernel) exposing a simple specific function/syscall would do
> its job way better. Part of the question was about this hypothetical
> function.
>
malloc() is already optimized to exactly those cases, as these are the
by far most common ones: Either you want a storage area, the size of
which is determined at run time, then you will probably have one big
allocation, or you are using fancy data structures other than dynamic
arrays (that includes, but is not limited to, linked lists, trees and
hashes), then chances are you are allocating each node separately so you
get your n+1 little allocations.
Having seen the glibc, eglibc, dietlibc and uclibc allocators, I already
saw that those are the preferred modes of operation. The dietlibc
allocator is especially easy to describe:
For one, it _never_ uses brk(). That syscall is just plain old and does
nothing mmap() couldn't do better. Then it has linked lists prepared for
objects of sizes of powers of two, starting at 16 and stopping at 2048.
It rounds the sum of the object header size and the object size to the
next power of two (using some nifty eye-watering bit-mangling) and puts
it in such a linked list. If the list is too small, it is expanded using
mmap() or mremap().
If the object is bigger than 2048 bytes it gets it's own page(s)!
>> Say, have you profiled the code to see what fraction of the time is actually
>> spent executing the allocator code?
>>
>> Can you answer the question: how much faster will the program be if
>> a magic allocator is substituted which requires zero cycles to execute?
>>
>
> Let's see... So why do the C++ STLs (or boost, for another example) do
> exactly what I'm asking here? :-)
>
Hmmm... I just had a look at the libstdc++ deployed with g++ 4.5.0 and I
saw that at least there, operator new calls malloc(). I also looked at
STLport and saw the same.
Also, you didn't answer the questions. Those are actually pretty good ones!
>>> I've some questions:
>>> - In these systems, to have decent performances, I was thinking to
>>> allocate page-sized (and page-aligned) chunks to have small
>>> allocations in the same page, but how is it possible to request some
>>> memory with this characteristics? Will it help the OS memory manager?
>>
>> Not knowing these basics is an indicator that you need to study more if you
>> want to write memory allocators that beat malloc.
>
> I already said that I don't want to "beat malloc", just to find a
> (probably platform-specific) way to facilitate its job.
> Ciao!
>
You are not making sense. You don't want to beat malloc but you still
want to write a malloc replacement? You'd willingly choose the inferior
implementation?
> P.S. I'm sorry if I'm looking so.. childish(?).. to you but, believe
> me, I've written some physical/virtual memory allocators for many
> simpler system,
That's an entirely different can of worms. Believe me, the memory
management that just uses an OS is far easier than the one the OS has to
maintain.
> I know how systems work behind the scenes, and I
> really don't want to start writing something like that: I'm just
> asking how to deal with the two specific environments above while
> masking the differences between the two and maintaining a standard
> interface. I know that it's hard and that someone already did it: I'm
> asking *how* he did that (and maybe where!) :-)
You could look at the libc source codes. glibc, eglibc, dietlibc and
uclibc are a good place to start for Linux. For Windows I'd recommend
cygwin.
If you are interested in the syscalls that allocate memory: For *nix
that's mmap() with MAP_ANONYMOUS or, failing that, /dev/zero. Failing
_that_, you could try to map a newly created and truncated file. Try to
stay away from brk(), it's marked obsolete.
For Windows, have a look at HeapAlloc(), VirtualAlloc() and
LocalAlloc(). BTW: The MSVC implementation somehow manages to make
access to _one_ byte after the allocated buffer fail with a crash. Good
job, as it makes finding off-by-one errors easier.
But I still don't see your point. malloc() is meant to be portable! The
only thing you might want to tweak against is malloc(0), as that's
unspecified. Several allocators return NULL on that, but several others
do not. They just present you an object you can't even access the first
byte of.
HTH,
Markus
|
|
0
|
|
|
|
Reply
|
nullplan (53)
|
1/1/2012 11:21:51 AM
|
|
"Markus Wichmann" <nullplan@gmx.net> wrote in message
news:f7s5t8-jo5.ln1@voyager.wichi.de.vu...
> On 31.12.2011 18:52, FtM wrote:
>> Simply because it can't know if I'm going to do 200 little allocations
>> or 1 big one.
> For one, it _never_ uses brk(). That syscall is just plain old and does
> nothing mmap() couldn't do better. Then it has linked lists prepared for
> objects of sizes of powers of two, starting at 16 and stopping at 2048.
That exactly what I do in my own allocator! Except I stop at 1024. Although
I probably manage them very differently, for example once a small block size
is allocated, say 64 bytes, it always stays that size.
And one thing about malloc() I don't like, are the overheads in having to
remember the block sizes; if I allocate 16 bytes, malloc() seems to reserve
either 20 or, more often, 24 bytes on the few compilers I've tried. That's a
50% overhead!
Also if you're being sensible and trying to keep to power-of-two
allocations, this 4- or 8-byte overhead is going to screw that up.
My allocators have never stored a block size, and it has never really been a
problem. Either you will know the size (a struct for example), or you need
to keep track of it yourself, in which case it is very handy when you have
an allocation that can grow in never having to keep running back to
realloc().
As an illustration, take this simple example of a string growing from one to
ten million characters:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
int main(void)
{
char *p;
int i,len;
p=malloc(1);
*p=0;
len=0;
for (i=1; i<=10000000; ++i) {
++len;
p=realloc(p,len+1);
p[len-1]='*';
p[len]=0;
// puts(p);
}
printf("Length= %d\n",strlen(p));
}
On three C compilers, this took about 12 seconds (on DMC, any length of
string resulted in a runtime of 0.4 seconds; a bit odd).
Using my strategy, of avoiding any calls to malloc or realloc unless
absolutely necessary, and using that inside an interpreter, this bit of
script took about 0.1 seconds:
string a
a:=""
to 10 million do
a+:="*"
od
println "Length=",a.len
Of course, real C code wouldn't use such a naive strategy; it would also
seek to minimise malloc()/realloc() calls, but that's also an admission that
these calls have some overheads which it would be desirable to avoid. Hence
I can understand the OP's attempt to experiment with his own versions.
--
Bartc
|
|
0
|
|
|
|
Reply
|
bc (2211)
|
1/1/2012 12:54:55 PM
|
|
On 1/1/2012 7:54 AM, BartC wrote:
> "Markus Wichmann" <nullplan@gmx.net> wrote in message
> news:f7s5t8-jo5.ln1@voyager.wichi.de.vu...
>> On 31.12.2011 18:52, FtM wrote:
>
>>> Simply because it can't know if I'm going to do 200 little allocations
>>> or 1 big one.
>
>> For one, it _never_ uses brk(). That syscall is just plain old and does
>> nothing mmap() couldn't do better. Then it has linked lists prepared for
>> objects of sizes of powers of two, starting at 16 and stopping at 2048.
>
> That exactly what I do in my own allocator! Except I stop at 1024.
> Although I probably manage them very differently, for example once a
> small block size is allocated, say 64 bytes, it always stays that size.
Sounds like a recipe for internal fragmentation, but YMMV.
> And one thing about malloc() I don't like, are the overheads in having
> to remember the block sizes; if I allocate 16 bytes, malloc() seems to
> reserve either 20 or, more often, 24 bytes on the few compilers I've
> tried. That's a 50% overhead!
It's only significant if you have many such small allocations.
Even if so, a malloc() implementation can find other ways to remember
the size than by storing it explicitly. For example, malloc() might
set aside regions of memory dedicated to "small" allocations: If an
pointer aims at an address anyhere in [A1,A1+N1), [A2,A2+N2), ... it
could simply be "known" to point at a 32-byte block. (The test can't
be done efficiently in portable C, but a malloc() implementation need
not be portable, nor written in C.)
> Also if you're being sensible and trying to keep to power-of-two
> allocations, this 4- or 8-byte overhead is going to screw that up.
What's "sensible" about allocating in powers of two? Why should
you allocate 128 bytes to hold a line of input, rather than 100 or
80 or 250? Programmers are numerologically superstitious.
> My allocators have never stored a block size, and it has never really
> been a problem. Either you will know the size (a struct for example), or
> you need to keep track of it yourself, in which case it is very handy
> when you have an allocation that can grow in never having to keep
> running back to realloc().
If your allocators never store block sizes, it follows that free()
cannot recycle a released block for re-use by a subsequent malloc(),
because it does not know whether the freed block is big enough to
satisfy the malloc() request. (Re-computing the size by examining the
addresses of nearby busy and free blocks counts as "storing the size"
in my book, just encoding it deviously.)
An allocator with a different API -- one whose free()-substitute
takes the block size as a parameter -- could recycle blocks. But now
you're no longer talking about the O.P.'s drop-in replacement for the
API as it stands.
> As an illustration, take this simple example of a string growing from
> one to ten million characters:
>
> #include <stdio.h>
> #include <stdlib.h>
> #include <string.h>
>
> int main(void)
> {
> char *p;
> int i,len;
>
> p=malloc(1);
> *p=0;
> len=0;
>
> for (i=1; i<=10000000; ++i) {
> ++len;
> p=realloc(p,len+1);
Side note: Never call realloc() this way in real code, because if
it returns NULL you'll have lost your only pointer to the original
block; you won't even be able to free() it any more.
> p[len-1]='*';
> p[len]=0;
> // puts(p);
> }
With an allocator that does not store the block size, this will
necessarily take a long time, for several reasons:
- Since realloc() does not know how long the block at `p' is, it
cannot tell whether there's any slop at the end that the block
could grow into without moving. Therefore, every realloc() call
must allocate a new block and copy the old contents into it.
- Since realloc() does not know how long the old block is, it
must copy `len+1' bytes from the old block to the new, even
though that's more than the old block holds. (We'll assume a
non-portable implementation can get away with copying the non-
existent bytes.) Total amount copied: 2+...+10000001 bytes;
that's fifty terabytes.
- Since realloc() does not know how long the old block was, it
cannot re-use that memory to satisfy subsequent requests. Thus,
every call allocates a brand-new block and makes the old block
unavailable. Total memory allocated: 1+2+...+10000001 bytes.
50TB to store a 10MB string is an overhead of fifty million
percent -- and you groused about a mere fifty???!
--
Eric Sosman
esosman@ieee-dot-org.invalid
|
|
0
|
|
|
|
Reply
|
esosman2 (2945)
|
1/1/2012 2:03:21 PM
|
|
On 1/1/2012 7:54 AM, BartC wrote:
> "Markus Wichmann" <nullplan@gmx.net> wrote in message
> news:f7s5t8-jo5.ln1@voyager.wichi.de.vu...
>> On 31.12.2011 18:52, FtM wrote:
>
>>> Simply because it can't know if I'm going to do 200 little allocations
>>> or 1 big one.
>
>> For one, it _never_ uses brk(). That syscall is just plain old and does
>> nothing mmap() couldn't do better. Then it has linked lists prepared for
>> objects of sizes of powers of two, starting at 16 and stopping at 2048.
>
> That exactly what I do in my own allocator! Except I stop at 1024.
> Although I probably manage them very differently, for example once a
> small block size is allocated, say 64 bytes, it always stays that size.
Sounds like a recipe for internal fragmentation, but YMMV.
> And one thing about malloc() I don't like, are the overheads in having
> to remember the block sizes; if I allocate 16 bytes, malloc() seems to
> reserve either 20 or, more often, 24 bytes on the few compilers I've
> tried. That's a 50% overhead!
It's only significant if you have many such small allocations.
Even if so, a malloc() implementation can find other ways to remember
the size than by storing it explicitly. For example, malloc() might
set aside regions of memory dedicated to "small" allocations: If an
pointer aims at an address anyhere in [A1,A1+N1), [A2,A2+N2), ... it
could simply be "known" to point at a 32-byte block. (The test can't
be done efficiently in portable C, but a malloc() implementation need
not be portable, nor written in C.)
> Also if you're being sensible and trying to keep to power-of-two
> allocations, this 4- or 8-byte overhead is going to screw that up.
What's "sensible" about allocating in powers of two? Why should
you allocate 128 bytes to hold a line of input, rather than 100 or
80 or 250? Programmers are numerologically superstitious.
> My allocators have never stored a block size, and it has never really
> been a problem. Either you will know the size (a struct for example), or
> you need to keep track of it yourself, in which case it is very handy
> when you have an allocation that can grow in never having to keep
> running back to realloc().
If your allocators never store block sizes, it follows that free()
cannot recycle a released block for re-use by a subsequent malloc(),
because it does not know whether the freed block is big enough to
satisfy the malloc() request. (Re-computing the size by examining the
addresses of nearby busy and free blocks counts as "storing the size"
in my book, just encoding it deviously.)
An allocator with a different API -- one whose free()-substitute
takes the block size as a parameter -- could recycle blocks. But now
you're no longer talking about the O.P.'s drop-in replacement for the
API as it stands.
> As an illustration, take this simple example of a string growing from
> one to ten million characters:
>
> #include <stdio.h>
> #include <stdlib.h>
> #include <string.h>
>
> int main(void)
> {
> char *p;
> int i,len;
>
> p=malloc(1);
> *p=0;
> len=0;
>
> for (i=1; i<=10000000; ++i) {
> ++len;
> p=realloc(p,len+1);
Side note: Never call realloc() this way in real code, because if
it returns NULL you'll have lost your only pointer to the original
block; you won't even be able to free() it any more.
> p[len-1]='*';
> p[len]=0;
> // puts(p);
> }
With an allocator that does not store the block size, this will
necessarily take a long time, for several reasons:
- Since realloc() does not know how long the block at `p' is, it
cannot tell whether there's any slop at the end that the block
could grow into without moving. Therefore, every realloc() call
must allocate a new block and copy the old contents into it.
- Since realloc() does not know how long the old block is, it
must copy `len+1' bytes from the old block to the new, even
though that's more than the old block holds. (We'll assume a
non-portable implementation can get away with copying the non-
existent bytes.) Total amount copied: 2+...+10000001 bytes;
that's fifty terabytes.
- Since realloc() does not know how long the old block was, it
cannot re-use that memory to satisfy subsequent requests. Thus,
every call allocates a brand-new block and makes the old block
unavailable. Total memory allocated: 1+2+...+10000001 bytes.
50TB to store a 10MB string is an overhead of fifty million
percent -- and you groused about a mere fifty???!
--
Eric Sosman
esosman@ieee-dot-org.invalid
|
|
0
|
|
|
|
Reply
|
esosman2 (2945)
|
1/1/2012 2:08:29 PM
|
|
"FtM" <fmassei@gmail.com> ha scritto nel messaggio
news:911c685c-3bf9-4079-bf61-66342d8a8c2c@u6g2000vbc.googlegroups.com...
> [xpost comp.lang.c, comp.unix.programmer, comp.os.ms-
> windows.programmer.win32]
> Hello NG,
> I have a very specific question, but first of all I describe you what
> I'm facing as I'm open to alternative general solutions..
> I'm refactoring some old C code that deals with very basilar data
> structures (like lists, priority queues, hashtables and so on) that
> pretends to be very portable. Regarding the memory management, I have
> an internal wrapper that, on *nix and windows systems, maps the malloc/
> calloc/realloc/free functions one-to-one to the default ones, but it's
> possibile to switch to another custom allocator that works pretty much
> as a standard allocator (a big chunk of memory splitted/joined, tought
> for embedded systems).
> I would like to implement my custom allocator for the *nix and windows
> systems too (of course loosing some portability for performance gain),
> and maybe implement something like the C++ allocators to improve the
> memory management of the different algorithms (like rare large
> allocations on vectors / frequent small allocations on lists), but
> I've some questions:
i'm just an hobby programmer... my vew is
the point of start is the the malloc-free functions in
the K&R2 book; and think about them
|
|
0
|
|
|
|
Reply
|
io_x
|
1/1/2012 4:02:53 PM
|
|
"Eric Sosman" <esosman@ieee-dot-org.invalid> wrote in message
news:jdpp3r$8h3$1@dont-email.me...
> On 1/1/2012 7:54 AM, BartC wrote:
>> That exactly what I do in my own allocator! Except I stop at 1024.
>> Although I probably manage them very differently, for example once a
>> small block size is allocated, say 64 bytes, it always stays that size.
>
> Sounds like a recipe for internal fragmentation, but YMMV.
That's what I once thought. I've used a similar scheme for years (blocks
could get smaller but not bigger), and originally prepared a defragmentation
algorithm, but I never needed it. (This was for managing many small
allocations in a script language.) It was self-sustaining.
>> Also if you're being sensible and trying to keep to power-of-two
>> allocations, this 4- or 8-byte overhead is going to screw that up.
>
> What's "sensible" about allocating in powers of two? Why should
> you allocate 128 bytes to hold a line of input, rather than 100 or
> 80 or 250? Programmers are numerologically superstitious.
More like habit. But, if you allocate 4096 bytes thinking it would fit
neatly into one page of virtual memory, you will really need 4100 or 4104;
not quite as tidy. (This is an issue with my allocator, which does request
power-of-two sizes, that is implemented on top of malloc()).
>> My allocators have never stored a block size, and it has never really
>> been a problem.
> An allocator with a different API -- one whose free()-substitute
> takes the block size as a parameter -- could recycle blocks. But now
> you're no longer talking about the O.P.'s drop-in replacement for the
> API as it stands.
Yes, that's what I used; free() takes an extra parameter.
>> p=realloc(p,len+1);
> With an allocator that does not store the block size, this will
> necessarily take a long time, for several reasons:
> - Since realloc() does not know how long the block at `p' is, it
<snip>
You're now talking about a hypothetical realloc() that really has no idea
what the original block size was?
That wouldn't work anyway, as it would have no idea whether the block is
expanding or contracting and there would be other issues such as those you
mention.
Such a function *has* to be able to determine the original size, but storing
it as malloc/realloc presumably does, deducing it, or simply being told.
(Actually my scheme doesn't even have a realloc()-type function; but with
block-sizes known, it would be trivial to write)
--
Bartc
|
|
0
|
|
|
|
Reply
|
bc (2211)
|
1/1/2012 4:23:02 PM
|
|
On Jan 2, 1:54 pm, "BartC" <b...@freeuk.com> wrote:
> "Markus Wichmann" <nullp...@gmx.net> wrote in message
>
> news:f7s5t8-jo5.ln1@voyager.wichi.de.vu...
>
> > On 31.12.2011 18:52, FtM wrote:
> >> Simply because it can't know if I'm going to do 200 little allocations
> >> or 1 big one.
> > For one, it _never_ uses brk(). That syscall is just plain old and does
> > nothing mmap() couldn't do better. Then it has linked lists prepared for
> > objects of sizes of powers of two, starting at 16 and stopping at 2048.
>
> That exactly what I do in my own allocator! Except I stop at 1024. Although
> I probably manage them very differently, for example once a small block size
> is allocated, say 64 bytes, it always stays that size.
>
Ok, thank you all for the suggestions (that basically are "use malloc
and don't bother"). I know that the allocation functions are great as
they are: I'm using them right now and I have no problems at all! I
was only trying to expand the code from the embedded part that
contains some code that I already have, that's it. More on, I already
have some handlers that help me debugging the algorithms..
Anyway, regarding the allocator descriptions that Markus and BartC
gave, that's what I was thinking and partially already doing, let's
see if this makes more sense to you:
- first of all, I'm dealing with the memory allocation only *inside*
my library, so I don't need to build a general purpose allocator.
- inside, I'll have three different "objects": the first one is for
little allocations, the second one for the others, the third one for
the medium/big allocations that possibly will need to realloc(). The
third allocator follows the "normal" strategy (and could simply use
the system calls for that matter), while the first and the second one
could simply have some preallocated blocks and a vector to mark if
they are used or not, skipping the natural overhead needed by the
general purpose allocator.
What do you think about this approach (despite the fact that in the
end it probably gives no noticeable advantages)?
Ciao!
|
|
0
|
|
|
|
Reply
|
fmassei (45)
|
1/1/2012 6:08:56 PM
|
|
In article <911c685c-3bf9-4079-bf61-66342d8a8c2c@u6g2000vbc.googlegroups.com>,
FtM <fmassei@gmail.com> wrote:
>- Does all of this make any sense? :-)
IMHE this come up when
a) you are not pleased with the default functions' time or space
overhead
b) you like to control things
c) you want the 'memory' in memory-mapped files
d) you want to use shared memory
|
|
0
|
|
|
|
Reply
|
jgk1 (176)
|
1/1/2012 9:21:03 PM
|
|
In article <jdpp3r$8h3$1@dont-email.me>,
Eric Sosman <esosman@ieee-dot-org.invalid> wrote:
>What's "sensible" about allocating in powers of two?
A) VM pages are powers of two, so if you have non-two objects you waste
space or take more VM pages for each object than you would like
B) cache lines are powers of two, so if you have non-two objects you
waste space or take more cache lines for each object than you would like
>Programmers are numerologically superstitious.
i am numerologically superstitious, but you can also file this into the
'maybe someone has figured out something that we don't understand'
category
|
|
0
|
|
|
|
Reply
|
jgk1 (176)
|
1/1/2012 9:41:03 PM
|
|
"BartC" <bc@freeuk.com> wrote in message
news:LPYLq.355$EO5.173@newsfe26.ams2...
[growing a string from empty, to 10 million chars]
> char *p;
> int i,len;
>
> p=malloc(1);
> *p=0;
> len=0;
>
> for (i=1; i<=10000000; ++i) {
> ++len;
> p=realloc(p,len+1);
> p[len-1]='*';
> p[len]=0;
> }
> On three C compilers, this took about 12 seconds (on DMC, any length of
> string resulted in a runtime of 0.4 seconds; a bit odd).
Measuring the number of times the value of p changes, with the slow
compilers, this was respectively 2335, 2336, and 2337. With DMC, it was just
46.
Clearly DMC overallocates more than the others (growing allocations by
sqrt(2) apparently).
(My own code, which just uses doubling, will call a memory allocator only
about 20 times, so doesn't have the overheads of calling a function like
realloc() ten million times, hence it was quite quick even though it was
interpreted (but heavily optimised too..))
So with an application with certain memory usage patterns, one
implementation can dramatically outperform another. Another reason, even if
not writing your own allocators, at least to not just use what's provided
without question.
--
Bartc
|
|
0
|
|
|
|
Reply
|
bc (2211)
|
1/1/2012 11:26:35 PM
|
|
"FtM" <fmassei@gmail.com> wrote in message
news:433dd1ba-34f5-47d5-a7db-d4dbcda01939@cs7g2000vbb.googlegroups.com...
> - inside, I'll have three different "objects": the first one is for
> little allocations, the second one for the others, the third one for
> the medium/big allocations that possibly will need to realloc(). The
> third allocator follows the "normal" strategy (and could simply use
> the system calls for that matter), while the first and the second one
> could simply have some preallocated blocks and a vector to mark if
> they are used or not, skipping the natural overhead needed by the
> general purpose allocator.
>
> What do you think about this approach (despite the fact that in the
> end it probably gives no noticeable advantages)?
I don't think there's enough information here; what sort of sizes are the
three classes of allocator handling?
Will any of them always be a fixed size? Will they need deallocating? (As
sometimes this isn't necessary, or the pool of memory from which they're
drawn will be freed as one block.) What strategy will be used to grow the
heap from which the the blocks will come; in fact does this heap need to
grow? Where does the memory come from in the first place? Etc..
--
Bartc
|
|
0
|
|
|
|
Reply
|
bc (2211)
|
1/1/2012 11:34:43 PM
|
|
"BartC" <bc@freeuk.com> writes:
> "BartC" <bc@freeuk.com> wrote in message
> news:LPYLq.355$EO5.173@newsfe26.ams2...
>
> [growing a string from empty, to 10 million chars]
>> char *p;
>> int i,len;
>>
>> p=malloc(1);
>> *p=0;
>> len=0;
>>
>> for (i=1; i<=10000000; ++i) {
>> ++len;
>> p=realloc(p,len+1);
>> p[len-1]='*';
>> p[len]=0;
>> }
>
>> On three C compilers, this took about 12 seconds (on DMC, any length
>> of string resulted in a runtime of 0.4 seconds; a bit odd).
On my system is takes, .15 seconds
> Measuring the number of times the value of p changes, with the slow
> compilers, this was respectively 2335, 2336, and 2337. With DMC, it
> was just 46.
and changes p 156 times. (Off-topic extra: a C++ program using s += '*'
on a std::string takes 0.07s on the same machine.)
> Clearly DMC overallocates more than the others (growing allocations by
> sqrt(2) apparently).
I don't see how you can conclude that. The allocated storage might be
able to grown (i.e. it is not initially over-allocated) without changing
the address.
<snip>
--
Ben.
|
|
0
|
|
|
|
Reply
|
ben.usenet (6515)
|
1/1/2012 11:54:42 PM
|
|
"Ben Bacarisse" <ben.usenet@bsb.me.uk> wrote in message
news:0.0749b13cccb174981f23.20120101235442GMT.87ipkv7zct.fsf@bsb.me.uk...
> "BartC" <bc@freeuk.com> writes:
>>> for (i=1; i<=10000000; ++i) {
>>> p=realloc(p,len+1);
>>> On three C compilers, this took about 12 seconds (on DMC, any length
>>> of string resulted in a runtime of 0.4 seconds; a bit odd).
>
> On my system is takes, .15 seconds
>> Measuring the number of times the value of p changes, with the slow
>> compilers, this was respectively 2335, 2336, and 2337. With DMC, it
>> was just 46.
>
> and changes p 156 times.
Your implementation must be pretty good then, or maybe it's optimised a few
realloc() calls out.
Running the test by calling realloc() with the same argument (so no new
allocations are needed) still takes about 1 second on those three slow
compilers (gcc, lccwin-32 and Pelles C). Seems like two of them have ripped
off the other's inefficient realloc() implementation!
> (Off-topic extra: a C++ program using s += '*'
> on a std::string takes 0.07s on the same machine.)
Sounds about right when calls to realloc() are minimised. The question I
suppose is how much buffering/allocation logic should be done outside of
malloc()/realloc(), or do you just let those functions do most of the work;
you can afford to do that with fast implementations like yours.
>> Clearly DMC overallocates more than the others (growing allocations by
>> sqrt(2) apparently).
>
> I don't see how you can conclude that.
Just a guess. The 46th root of ten million is around 1.42.
> The allocated storage might be
> able to grown (i.e. it is not initially over-allocated) without changing
> the address.
That's true; with nothing else going on memory-wise, even an over-allocated
block could be further expanded at the same location, depending on how
allocator works.
--
Bartc
|
|
0
|
|
|
|
Reply
|
bc (2211)
|
1/2/2012 12:32:57 AM
|
|
On 2012-01-01, BartC <bc@freeuk.com> wrote:
> for (i=1; i<=10000000; ++i) {
> ++len;
> p=realloc(p,len+1);
Growing linearly, and, by one character, is silly.
You have O(N*N) behavior here.
If you grow exponentially (e.g. doubling the buffer), you get amortized O(N)
behavior.
I.e. this is a poor algorithm. It's the O(N*N) time that's killing this,
not malloc.
|
|
0
|
|
|
|
Reply
|
kaz15 (1129)
|
1/2/2012 2:46:58 AM
|
|
FtM=E6=96=BC 2011=E5=B9=B412=E6=9C=8831=E6=97=A5=E6=98=9F=E6=9C=9F=E5=85=AD=
UTC+8=E4=B8=8B=E5=8D=8810=E6=99=8256=E5=88=8653=E7=A7=92=E5=AF=AB=E9=81=93=
=EF=BC=9A
> [xpost comp.lang.c, comp.unix.programmer, comp.os.ms-
> windows.programmer.win32]
> Hello NG,
> I have a very specific question, but first of all I describe you what
> I'm facing as I'm open to alternative general solutions..
> I'm refactoring some old C code that deals with very basilar data
> structures (like lists, priority queues, hashtables and so on) that
> pretends to be very portable. Regarding the memory management, I have
> an internal wrapper that, on *nix and windows systems, maps the malloc/
> calloc/realloc/free functions one-to-one to the default ones, but it's
> possibile to switch to another custom allocator that works pretty much
> as a standard allocator (a big chunk of memory splitted/joined, tought
> for embedded systems).
> I would like to implement my custom allocator for the *nix and windows
> systems too (of course loosing some portability for performance gain),
> and maybe implement something like the C++ allocators to improve the
> memory management of the different algorithms (like rare large
> allocations on vectors / frequent small allocations on lists), but
> I've some questions:
> - In these systems, to have decent performances, I was thinking to
> allocate page-sized (and page-aligned) chunks to have small
> allocations in the same page, but how is it possible to request some
> memory with this characteristics? Will it help the OS memory manager?
> - Does all of this make any sense? :-)
> Thanks for any suggestion!
> Ciao!
>=20
> P.S. I don't really have the necessity of a "performance boost", but
> I'm dealing with a general, low-level and already used library, and as
> long as I'm refactoring the code I'd like to do it the better possible
> way ;-)
Kind of interesting here. I thought you were describing a customized=20
heap manager that can support page swapping to some file in the hard disk.=
=20
A graph of nodes of buffers is required.=20
If you don't have to page out to the secondary storage system, then=20
the malloc, memcpy, memset and free in the GCC source code are available. =
=20
|
|
0
|
|
|
|
Reply
|
dihedral88888 (786)
|
1/2/2012 5:01:10 AM
|
|
On Jan 2, 12:34=A0am, "BartC" <b...@freeuk.com> wrote:
> "FtM" <fmas...@gmail.com> wrote in message
>
> news:433dd1ba-34f5-47d5-a7db-d4dbcda01939@cs7g2000vbb.googlegroups.com...
>
> > - inside, I'll have three different "objects": the first one is for
> > little allocations, the second one for the others, the third one for
> > the medium/big allocations that possibly will need to realloc(). The
> > third allocator follows the "normal" strategy (and could simply use
> > the system calls for that matter), while the first and the second one
> > could simply have some preallocated blocks and a vector to mark if
> > they are used or not, skipping the natural overhead needed by the
> > general purpose allocator.
>
> > What do you think about this approach (despite the fact that in the
> > end it probably gives no noticeable advantages)?
>
> I don't think there's enough information here; what sort of sizes are the
> three classes of allocator handling?
>
16<=3Dx<=3D128 (little structures, one-shot char buffers, etc.)
128<x<=3D2048 (array of pointers, bigger structures, etc.)
x>2048 (realloc()atable memory)
> Will any of them always be a fixed size?
Yes, the first two types. (Of course, if a realloc() is requested for
a first or second type memory it should be switched to the third one,
but this is not likely to happen as the user (me) is aware of that
behaviour!). I didn't write it, but obviously I will have the
possibility to call the objects directly, choosing the preferred
allocation strategy.
> Will they need deallocating?
Sure they will. The first two by marking on their own map-vector, the
third by the usual way of joining adiacent free blocks.
> What strategy will be used to grow the
> heap from which the the blocks will come; in fact does this heap need to
> grow?
Each "object" will have some pages to start with; when one of them
fills its pages completely will ask the system for some more
(increasing its allocation vector on the fly too). I don't know the
"some" quantity, but I think no one do before runtime.. Making it a
compile-(or maybe run-)time parameter will help monitoring and testing
the overall performances, but this is just an idea.
Naturally, speaking of the first two "objects", the space of the
allocation vector should be taken into account, but as it is a bit-
masked vector it will occupy a very little space.
> Where does the memory come from in the first place?
mmap (thus even leaving the possibility to swap to persistent storage
or share the memory between processes).
Ciao!
|
|
0
|
|
|
|
Reply
|
fmassei (45)
|
1/2/2012 10:18:45 AM
|
|
FtM=E6=96=BC 2012=E5=B9=B41=E6=9C=882=E6=97=A5=E6=98=9F=E6=9C=9F=E4=B8=80UT=
C+8=E4=B8=8B=E5=8D=886=E6=99=8218=E5=88=8645=E7=A7=92=E5=AF=AB=E9=81=93=EF=
=BC=9A
> On Jan 2, 12:34=C2=A0am, "BartC" <b....@freeuk.com> wrote:
> > "FtM" <fma...@gmail.com> wrote in message
> >
> > news:433dd1ba-34f5-47d5-a7db-d4dbcda01939@cs7g2000vbb.googlegroups.com.=
...
> >
> > > - inside, I'll have three different "objects": the first one is for
> > > little allocations, the second one for the others, the third one for
> > > the medium/big allocations that possibly will need to realloc(). The
> > > third allocator follows the "normal" strategy (and could simply use
> > > the system calls for that matter), while the first and the second one
> > > could simply have some preallocated blocks and a vector to mark if
> > > they are used or not, skipping the natural overhead needed by the
> > > general purpose allocator.
> >
> > > What do you think about this approach (despite the fact that in the
> > > end it probably gives no noticeable advantages)?
> >
> > I don't think there's enough information here; what sort of sizes are t=
he
> > three classes of allocator handling?
> >
>=20
> 16<=3Dx<=3D128 (little structures, one-shot char buffers, etc.)
> 128<x<=3D2048 (array of pointers, bigger structures, etc.)
> x>2048 (realloc()atable memory)
>=20
>=20
> > Will any of them always be a fixed size?
>=20
> Yes, the first two types. (Of course, if a realloc() is requested for
> a first or second type memory it should be switched to the third one,
> but this is not likely to happen as the user (me) is aware of that
> behaviour!). I didn't write it, but obviously I will have the
> possibility to call the objects directly, choosing the preferred
> allocation strategy.
>=20
> > Will they need deallocating?
>=20
> Sure they will. The first two by marking on their own map-vector, the
> third by the usual way of joining adiacent free blocks.
>=20
> > What strategy will be used to grow the
> > heap from which the the blocks will come; in fact does this heap need t=
o
> > grow?
>=20
> Each "object" will have some pages to start with; when one of them
> fills its pages completely will ask the system for some more
> (increasing its allocation vector on the fly too). I don't know the
> "some" quantity, but I think no one do before runtime.. Making it a
> compile-(or maybe run-)time parameter will help monitoring and testing
> the overall performances, but this is just an idea.
> Naturally, speaking of the first two "objects", the space of the
> allocation vector should be taken into account, but as it is a bit-
> masked vector it will occupy a very little space.
>=20
> > Where does the memory come from in the first place?
>=20
> mmap (thus even leaving the possibility to swap to persistent storage
> or share the memory between processes).
This requires customized data compression modules not just the heap manager=
..=20
If a program only uses 20 to 30 percent of the total heap space in a typica=
l=20
platform, then a lot people won't bother to write a new heap manager at all=
.. =20
>=20
> Ciao!
|
|
0
|
|
|
|
Reply
|
dihedral88888 (786)
|
1/2/2012 10:34:09 AM
|
|
On Jan 2, 11:34=C2=A0am, 88888 Dihedral <dihedral88...@googlemail.com>
wrote:
> FtM=E6=96=BC 2012=E5=B9=B41=E6=9C=882=E6=97=A5=E6=98=9F=E6=9C=9F=E4=B8=80=
UTC+8=E4=B8=8B=E5=8D=886=E6=99=8218=E5=88=8645=E7=A7=92=E5=AF=AB=E9=81=93=
=EF=BC=9A
>
> > > Where does the memory come from in the first place?
>
> > mmap (thus even leaving the possibility to swap to persistent storage
> > or share the memory between processes).
>
> This requires customized data compression modules not just the heap manag=
er.
>
Maybe. Or maybe more disk space? :-) I don't think it's relevant
here..
> If a program only uses 20 to 30 percent of the total heap space in a typi=
cal
> platform, then a lot people won't bother to write a new heap manager at a=
ll.
>
I don't see your point.. In the OSes we were speaking about there's no
fixed "heap-space", the physical memory is "recycled" based on the
current needs.. Or you were thinking something different?
Ciao!
|
|
0
|
|
|
|
Reply
|
fmassei (45)
|
1/2/2012 10:43:33 AM
|
|
FtM=E6=96=BC 2012=E5=B9=B41=E6=9C=882=E6=97=A5=E6=98=9F=E6=9C=9F=E4=B8=80UT=
C+8=E4=B8=8B=E5=8D=886=E6=99=8243=E5=88=8633=E7=A7=92=E5=AF=AB=E9=81=93=EF=
=BC=9A
> On Jan 2, 11:34=C2=A0am, 88888 Dihedral <dihedr...@googlemail.com>
> wrote:
> > FtM=E6=96=BC 2012=E5=B9=B41=E6=9C=882=E6=97=A5=E6=98=9F=E6=9C=9F=E4=B8=
=80UTC+8=E4=B8=8B=E5=8D=886=E6=99=8218=E5=88=8645=E7=A7=92=E5=AF=AB=E9=81=
=93=EF=BC=9A
> >
> > > > Where does the memory come from in the first place?
> >
> > > mmap (thus even leaving the possibility to swap to persistent storage
> > > or share the memory between processes).
> >
> > This requires customized data compression modules not just the heap man=
ager.
> >
>=20
> Maybe. Or maybe more disk space? :-) I don't think it's relevant
> here..
>=20
> > If a program only uses 20 to 30 percent of the total heap space in a ty=
pical
> > platform, then a lot people won't bother to write a new heap manager at=
all.
> >
>=20
> I don't see your point.. In the OSes we were speaking about there's no
> fixed "heap-space", the physical memory is "recycled" based on the
> current needs.. Or you were thinking something different?
>=20
> Ciao!
I think the heap space in a 32 bit platform is somewhat limited.=20
Of course in the 64 bit OS a lot things will be different.=20
But as long as the hard disk and the dram are both of some finite sizes,=
=20
I just model the heap space to be of a finite size in my programs.=20
I am not interested in trivial programs.=20
|
|
0
|
|
|
|
Reply
|
dihedral88888 (786)
|
1/2/2012 11:11:27 AM
|
|
"FtM" <fmassei@gmail.com> wrote in message
news:8c4b9ed3-8f9f-47ae-a70f-cc803aa6cc79@d9g2000yqg.googlegroups.com...
> On Jan 2, 12:34 am, "BartC" <b...@freeuk.com> wrote:
>> What strategy will be used to grow the
>> heap from which the the blocks will come; in fact does this heap need to
>> grow?
>
> Each "object" will have some pages to start with; when one of them
> fills its pages completely will ask the system for some more
> (increasing its allocation vector on the fly too). I don't know the
> "some" quantity, but I think no one do before runtime.. Making it a
> compile-(or maybe run-)time parameter will help monitoring and testing
> the overall performances, but this is just an idea.
> Naturally, speaking of the first two "objects", the space of the
> allocation vector should be taken into account, but as it is a bit-
> masked vector it will occupy a very little space.
>
>> Where does the memory come from in the first place?
>
> mmap (thus even leaving the possibility to swap to persistent storage
> or share the memory between processes).
It sounds like you are more interesting in playing with the virtual memory
side of things and doing your own paging.
Be aware that the OS tends to be take care of this, and does it rather well
in conjunction with the hardware! (If you look in particular at the memory
management routines on the msdn website, it can do it rather comprehensively
too. I usually take one look then go back to the friendly little malloc,
realloc and free functions of C.)
Also, if you just want to experiment, then possibly malloc() can provide the
initial memory within which you can do your own allocations. Just ask for
the largest block of memory you want to work with, and look for the first
address within it where the bottom 12 bits are zero; that will likely be on
a page boundary. (In a real program, malloc()ed memory may not grow in the
way you might like, so your mmap() method might be needed.)
--
Bartc
|
|
0
|
|
|
|
Reply
|
bc (2211)
|
1/2/2012 11:40:22 AM
|
|
"Kaz Kylheku" <kaz@kylheku.com> wrote in message
news:20120101184335.99@kylheku.com...
> On 2012-01-01, BartC <bc@freeuk.com> wrote:
>> for (i=1; i<=10000000; ++i) {
>> ++len;
>> p=realloc(p,len+1);
>
> Growing linearly, and, by one character, is silly.
Nevertheless, that's what the program wants to do. (As it happens I quite
often have to build strings this way, and without knowing the final size.)
> You have O(N*N) behavior here.
>
> If you grow exponentially (e.g. doubling the buffer), you get amortized
> O(N)
> behavior.
But do you worry about this stuff directly in your code, or let the
language's runtime take care of it, or have some intermediate functions deal
with these details?
(And in the case of a simple C string, there is nowhere to put any extra
information about what the current allocated size is; it doesn't even know
it's length!)
> I.e. this is a poor algorithm. It's the O(N*N) time that's killing this,
> not malloc.
Yet some C implementations coped reasonably enough:
"Ben Bacarisse" <ben.usenet@bsb.me.uk> wrote in message
> On my system is takes, .15 seconds
--
Bartc
|
|
0
|
|
|
|
Reply
|
bc (2211)
|
1/2/2012 11:53:50 AM
|
|
On 1/1/2012 4:41 PM, Joe keane wrote:
> In article<jdpp3r$8h3$1@dont-email.me>,
> Eric Sosman<esosman@ieee-dot-org.invalid> wrote:
>> What's "sensible" about allocating in powers of two?
>
> A) VM pages are powers of two, so if you have non-two objects you waste
> space or take more VM pages for each object than you would like
>
> B) cache lines are powers of two, so if you have non-two objects you
> waste space or take more cache lines for each object than you would like
If you have a forty-four-byte struct and allocate sixty-four
bytes to hold it, how have you avoided *either* sort of "waste?"
If you have a thousand such structs, what "waste" do you avoid
by spending 64K instead of 44K to hold them? That is, what gain
repays your expenditure of an extra 20K?
>> Programmers are numerologically superstitious.
>
> i am numerologically superstitious, but you can also file this into the
> 'maybe someone has figured out something that we don't understand'
> category
I think the "someone" was Agent Mulder.
--
Eric Sosman
esosman@ieee-dot-org.invalid
|
|
0
|
|
|
|
Reply
|
esosman2 (2945)
|
1/2/2012 1:10:52 PM
|
|
"Eric Sosman" <esosman@ieee-dot-org.invalid> wrote in message
news:jdsact$nr6$1@dont-email.me...
> On 1/1/2012 4:41 PM, Joe keane wrote:
>> In article<jdpp3r$8h3$1@dont-email.me>,
>> Eric Sosman<esosman@ieee-dot-org.invalid> wrote:
>>> What's "sensible" about allocating in powers of two?
>>
>> A) VM pages are powers of two, so if you have non-two objects you waste
>> space or take more VM pages for each object than you would like
>>
>> B) cache lines are powers of two, so if you have non-two objects you
>> waste space or take more cache lines for each object than you would like
>
> If you have a forty-four-byte struct and allocate sixty-four
> bytes to hold it, how have you avoided *either* sort of "waste?"
>
> If you have a thousand such structs, what "waste" do you avoid
> by spending 64K instead of 44K to hold them? That is, what gain
> repays your expenditure of an extra 20K?
This is where you redesign your struct to use only 32 bytes, or find a good
use for another 20 bytes to make 64.
Such structs are easier to index in arrays. And fit nicely into those 32- or
64-byte blocks that memory seems to be structured as now, if also aligned
properly, and less likely to straddle two VM pages (actually I don't know if
that's a big deal now; it certainly has been).
So powers-of-two are hardware-friendly. But now stick on 8- or 12-bytes of
overhead, and it's a mess. You can't design a struct to be 8- or 12-bytes
smaller than a power-of-two; every system will be different.
--
Bartc
|
|
0
|
|
|
|
Reply
|
bc (2211)
|
1/2/2012 2:07:58 PM
|
|
On 1/2/2012 9:07 AM, BartC wrote:
>
>
> "Eric Sosman" <esosman@ieee-dot-org.invalid> wrote in message
> news:jdsact$nr6$1@dont-email.me...
>> On 1/1/2012 4:41 PM, Joe keane wrote:
>>> In article<jdpp3r$8h3$1@dont-email.me>,
>>> Eric Sosman<esosman@ieee-dot-org.invalid> wrote:
>>>> What's "sensible" about allocating in powers of two?
>>>
>>> A) VM pages are powers of two, so if you have non-two objects you waste
>>> space or take more VM pages for each object than you would like
>>>
>>> B) cache lines are powers of two, so if you have non-two objects you
>>> waste space or take more cache lines for each object than you would like
>>
>> If you have a forty-four-byte struct and allocate sixty-four
>> bytes to hold it, how have you avoided *either* sort of "waste?"
>>
>> If you have a thousand such structs, what "waste" do you avoid
>> by spending 64K instead of 44K to hold them? That is, what gain
>> repays your expenditure of an extra 20K?
>
> This is where you redesign your struct to use only 32 bytes, or find a
> good use for another 20 bytes to make 64.
Ri-i-i-ght. Pretending the extra twenty bytes are in a good cause
doesn't make them so.
> Such structs are easier to index in arrays.
Nonsense. The `[]' operator works with equal ease on arrays
of any element size at all.
> And fit nicely into those
> 32- or 64-byte blocks that memory seems to be structured as now, if also
> aligned properly, and less likely to straddle two VM pages (actually I
> don't know if that's a big deal now; it certainly has been).
The fit is not "nice" if it's attained by wasting space. (Weren't
you the guy who recently complained about malloc() adding overhead to
each block?)
As for straddling VM pages, denser packing wins over bloated "nicer"
packing any day. Consider again the example of a thousand structs: I
want to use 44KB, you're suggesting 64KB would be "nicer." Suppose an
8KB page: You want to use eight pages, I want to use six. Suppose a
64B cache line: You want to use a thousand cache lines, I want to use
six hundred eighty-eight.
> So powers-of-two are hardware-friendly. But now stick on 8- or 12-bytes
> of overhead, and it's a mess.
Wait, wait, wait: You started with 44 bytes, strained your brain
to imagine a "good use" for another twenty because you're superstitious
about powers of two), and then complained when malloc() tacked on eight
bytes of its own? Why didn't you just stick with the original 44?
> You can't design a struct to be 8- or
> 12-bytes smaller than a power-of-two; every system will be different.
You can't design it to be exactly 64 bytes, either. Quick: How
many bytes in a `long double'? What's the alignment requirement of
a function pointer? What is the "unit" size of a bit-field?
--
Eric Sosman
esosman@ieee-dot-org.invalid
|
|
0
|
|
|
|
Reply
|
esosman2 (2945)
|
1/2/2012 2:47:45 PM
|
|
"BartC" <bc@freeuk.com> writes:
> "Kaz Kylheku" <kaz@kylheku.com> wrote in message
<snip>
>> You have O(N*N) behavior here.
>>
>> If you grow exponentially (e.g. doubling the buffer), you get amortized
>> O(N)
>> behavior.
<snip>
>> I.e. this is a poor algorithm. It's the O(N*N) time that's killing this,
>> not malloc.
>
> Yet some C implementations coped reasonably enough:
>
> "Ben Bacarisse" <ben.usenet@bsb.me.uk> wrote in message
>
>> On my system is takes, .15 seconds
Though I made no attempt to determine the growth rate of the time. It
may still have been quadratic, but fast quadratic.
However, it is in fact linear. Judging by the number of times the
pointer is changed by realloc it is doing some kind of internal doubling
of the requested space.
The trouble is that you can't rely on this, so to get amortised linear
performance in portable C you have to do the exponential growth
yourself.
--
Ben.
|
|
0
|
|
|
|
Reply
|
ben.usenet (6515)
|
1/2/2012 3:18:51 PM
|
|
On Jan 2, 12:11=C2=A0pm, 88888 Dihedral <dihedral88...@googlemail.com>
wrote:
> FtM=E6=96=BC 2012=E5=B9=B41=E6=9C=882=E6=97=A5=E6=98=9F=E6=9C=9F=E4=B8=80=
UTC+8=E4=B8=8B=E5=8D=886=E6=99=8243=E5=88=8633=E7=A7=92=E5=AF=AB=E9=81=93=
=EF=BC=9A
>
> > On Jan 2, 11:34=C2=A0am, 88888 Dihedral <dihedr...@googlemail.com>
> > wrote:
> > > FtM=E6=96=BC 2012=E5=B9=B41=E6=9C=882=E6=97=A5=E6=98=9F=E6=9C=9F=E4=
=B8=80UTC+8=E4=B8=8B=E5=8D=886=E6=99=8218=E5=88=8645=E7=A7=92=E5=AF=AB=E9=
=81=93=EF=BC=9A
>
> > > > > Where does the memory come from in the first place?
>
> > > > mmap (thus even leaving the possibility to swap to persistent stora=
ge
> > > > or share the memory between processes).
>
> > > This requires customized data compression modules not just the heap m=
anager.
>
> > Maybe. Or maybe more disk space? :-) I don't think it's relevant
> > here..
>
> > > If a program only uses 20 to 30 percent of the total heap space in a =
typical
> > > platform, then a lot people won't bother to write a new heap manager =
at all.
>
> > I don't see your point.. In the OSes we were speaking about there's no
> > fixed "heap-space", the physical memory is "recycled" based on the
> > current needs.. Or you were thinking something different?
>
> > Ciao!
>
> I think the heap space in a 32 bit platform is somewhat limited.
> Of course in the 64 bit OS a lot things will be different.
>
More bits =3D=3D More space. So?
> But as long as the hard disk =C2=A0and the dram =C2=A0are both =C2=A0of s=
ome finite sizes,
> I just model the heap space to be of a finite size =C2=A0in my programs.
>
I don't think anyone deals with infinite spaces.
Ciao!
|
|
0
|
|
|
|
Reply
|
fmassei (45)
|
1/2/2012 4:17:19 PM
|
|
On Jan 2, 12:40=A0pm, "BartC" <b...@freeuk.com> wrote:
> "FtM" <fmas...@gmail.com> wrote in message
>
> news:8c4b9ed3-8f9f-47ae-a70f-cc803aa6cc79@d9g2000yqg.googlegroups.com...
>
>
> > On Jan 2, 12:34 am, "BartC" <b...@freeuk.com> wrote:
> >> What strategy will be used to grow the
> >> heap from which the the blocks will come; in fact does this heap need =
to
> >> grow?
>
> > Each "object" will have some pages to start with; when one of them
> > fills its pages completely will ask the system for some more
> > (increasing its allocation vector on the fly too). I don't know the
> > "some" quantity, but I think no one do before runtime.. Making it a
> > compile-(or maybe run-)time parameter will help monitoring and testing
> > the overall performances, but this is just an idea.
> > Naturally, speaking of the first two "objects", the space of the
> > allocation vector should be taken into account, but as it is a bit-
> > masked vector it will occupy a very little space.
>
> >> Where does the memory come from in the first place?
>
> > mmap (thus even leaving the possibility to swap to persistent storage
> > or share the memory between processes).
>
> It sounds like you are more interesting in playing with the virtual memor=
y
> side of things and doing your own paging.
>
> Be aware that the OS tends to be take care of this, and does it rather we=
ll
> in conjunction with the hardware! (If you look in particular at the memor=
y
> management routines on the msdn website, it can do it rather comprehensiv=
ely
> too. I usually take one look then go back to the friendly little malloc,
> realloc and free functions of C.)
>
Yeah, I know.. I have a strange intrinsic vision of the memory
management that is more low-level than the c-level, probably due to my
previous works.. I don't know if it's a "fortune" or not, but that's
why I start by directly asking for low-level facilities instead of
abstracting the real problems. Anyway, I think that porting some of
this logic one level up could help somehow..
> Also, if you just want to experiment, then possibly malloc() can provide =
the
> initial memory within which you can do your own allocations. Just ask for
> the largest block of memory you want to work with, and look for the first
> address within it where the bottom 12 bits are zero; that will likely be =
on
> a page boundary. (In a real program, malloc()ed memory may not grow in th=
e
> way you might like, so your mmap() method might be needed.)
>
You made the right example: I didn't know if that was true or not!
Technically, a virtual page base address doesn't have to match the
physical page boundary..
Ciao!
|
|
0
|
|
|
|
Reply
|
fmassei (45)
|
1/2/2012 4:27:01 PM
|
|
On Jan 2, 3:47=A0pm, Eric Sosman <esos...@ieee-dot-org.invalid> wrote:
> On 1/2/2012 9:07 AM, BartC wrote:
>
>
>
>
>
>
>
>
>
>
>
> > "Eric Sosman" <esos...@ieee-dot-org.invalid> wrote in message
> >news:jdsact$nr6$1@dont-email.me...
> >> On 1/1/2012 4:41 PM, Joe keane wrote:
> >>> In article<jdpp3r$8h...@dont-email.me>,
> >>> Eric Sosman<esos...@ieee-dot-org.invalid> wrote:
> >>>> What's "sensible" about allocating in powers of two?
>
> >>> A) VM pages are powers of two, so if you have non-two objects you was=
te
> >>> space or take more VM pages for each object than you would like
>
> >>> B) cache lines are powers of two, so if you have non-two objects you
> >>> waste space or take more cache lines for each object than you would l=
ike
>
> >> If you have a forty-four-byte struct and allocate sixty-four
> >> bytes to hold it, how have you avoided *either* sort of "waste?"
>
> >> If you have a thousand such structs, what "waste" do you avoid
> >> by spending 64K instead of 44K to hold them? That is, what gain
> >> repays your expenditure of an extra 20K?
>
> > This is where you redesign your struct to use only 32 bytes, or find a
> > good use for another 20 bytes to make 64.
>
> =A0 =A0 =A0Ri-i-i-ght. =A0Pretending the extra twenty bytes are in a good=
cause
> doesn't make them so.
>
> > Such structs are easier to index in arrays.
>
> =A0 =A0 =A0Nonsense. =A0The `[]' operator works with equal ease on arrays
> of any element size at all.
>
> > And fit nicely into those
> > 32- or 64-byte blocks that memory seems to be structured as now, if als=
o
> > aligned properly, and less likely to straddle two VM pages (actually I
> > don't know if that's a big deal now; it certainly has been).
>
> =A0 =A0 =A0The fit is not "nice" if it's attained by wasting space. =A0(W=
eren't
> you the guy who recently complained about malloc() adding overhead to
> each block?)
>
> =A0 =A0 =A0As for straddling VM pages, denser packing wins over bloated "=
nicer"
> packing any day. =A0Consider again the example of a thousand structs: I
> want to use 44KB, you're suggesting 64KB would be "nicer." =A0Suppose an
> 8KB page: You want to use eight pages, I want to use six. =A0Suppose a
> 64B cache line: You want to use a thousand cache lines, I want to use
> six hundred eighty-eight.
>
Someone previously asked me what if the allocation took no cycles at
all. That question is obviously plain wrong, because I *do* like the
allocator taking cycles if this means that, when the memory is low, my
complex data structure resides in one single page and the system has
not to hysterically swap from disk to deal with a pathological
fragmentation.
With your comment anyway you are agreeing with me: what if you could
choose? :-)
Ciao!
|
|
0
|
|
|
|
Reply
|
fmassei (45)
|
1/2/2012 4:36:22 PM
|
|
On 1/2/2012 11:36 AM, FtM wrote:
> [...]
> Someone previously asked me what if the allocation took no cycles at
> all. That question is obviously plain wrong, [...]
Part of the problem is that you have not made your objectives
clear. Kaz apparently believed you were seeking an allocator that
would consume less CPU to perform its job -- indeed, that's exactly
how I understand some of your remarks. In that context his question
is not "plain wrong" at all, but entirely to the point: He's asking
about the upper bound on potential improvement. If you could reduce
the allocator's time to zero -- if you could achieve an infinite
speedup -- how much overall improvement would you attain? If you
cannot answer that question, at least with a well-grounded estimate,
you have *no* business trying to speed up the allocator, nor the sort,
nor the parser, nor the hash table, nor the frammis rebiculator.
If you have not measured, you do not know whether you are in
trouble. If you achieve an X% improvement, you *still* won't know
whether you are out of trouble or still in it.
> [...] I *do* like the
> allocator taking cycles if this means that, when the memory is low, my
> complex data structure resides in one single page and the system has
> not to hysterically swap from disk to deal with a pathological
> fragmentation.
A memory manager with C's semantics, that is, one that cannot
relocate an already-allocated block, is *always* vulnerable to
fragmentation (J.M. Robson, JACM 18, 416-423). The practical issue
is to find strategies that "usually" don't suffer from "severe"
fragmentation in "typical" cases.
> With your comment anyway you are agreeing with me: what if you could
> choose? :-)
Sorry; I don't understand this remark.
--
Eric Sosman
esosman@ieee-dot-org.invalid
|
|
0
|
|
|
|
Reply
|
esosman2 (2945)
|
1/2/2012 6:43:20 PM
|
|
On Jan 2, 7:43=A0pm, Eric Sosman <esos...@ieee-dot-org.invalid> wrote:
> On 1/2/2012 11:36 AM, FtM wrote:
>
> > [...]
> > Someone previously asked me what if the allocation took no cycles at
> > all. That question is obviously plain wrong, [...]
>
> =A0 =A0 =A0Part of the problem is that you have not made your objectives
> clear. =A0Kaz apparently believed you were seeking an allocator that
> would consume less CPU to perform its job -- indeed, that's exactly
> how I understand some of your remarks. =A0In that context his question
> is not "plain wrong" at all, but entirely to the point: He's asking
> about the upper bound on potential improvement. =A0If you could reduce
> the allocator's time to zero -- if you could achieve an infinite
> speedup -- how much overall improvement would you attain? =A0If you
> cannot answer that question, at least with a well-grounded estimate,
> you have *no* business trying to speed up the allocator, nor the sort,
> nor the parser, nor the hash table, nor the frammis rebiculator.
>
> =A0 =A0 =A0If you have not measured, you do not know whether you are in
> trouble. =A0If you achieve an X% improvement, you *still* won't know
> whether you are out of trouble or still in it.
>
Yup, reading again what I wrote from this point of view I understand
his question. I was a little bit harsh because on the one hand in my
opinion the "speed" was only a part of the problem, and on the other
hand I know that I can't measure anything before implementing
something that I think its a good idea..
> > [...] I *do* like the
> > allocator taking cycles if this means that, when the memory is low, my
> > complex data structure resides in one single page and the system has
> > not to hysterically swap from disk to deal with a pathological
> > fragmentation.
>
> =A0 =A0 =A0A memory manager with C's semantics, that is, one that cannot
> relocate an already-allocated block, is *always* vulnerable to
> fragmentation (J.M. Robson, JACM 18, 416-423). =A0The practical issue
> is to find strategies that "usually" don't suffer from "severe"
> fragmentation in "typical" cases.
>
That's very interesting, I didn't know that it was mathematically
proved! Anyway.. [read below]
> > With your comment anyway you are agreeing with me: what if you could
> > choose? :-)
>
> =A0 =A0 =A0Sorry; I don't understand this remark.
>
.... if for your own use you give up the standard interface and
manually choose which allocation strategy best fit your needs, you
could have an advantage.
A very practical example. Someone use my library just for the linked-
lists, implemented with this hypothetical allocator. His code uses the
allocator of its choice, probably the system's one, and of course he
will have one or more linked-lists accessed from time to time; these
lists will hardly suffer severe fragmentation (if any), no matter how
or when he inserts or deletes their elements, simply because the
memory was already allocated from the start.
If all of this can be useful or not depends on the specific library
user's needs, but personally I think that giving this possibility
would be a nice-to-have feature.
Ciao!
|
|
0
|
|
|
|
Reply
|
fmassei (45)
|
1/2/2012 7:38:06 PM
|
|
"Eric Sosman" <esosman@ieee-dot-org.invalid> wrote in message
news:jdsg2h$mkb$1@dont-email.me...
> On 1/2/2012 9:07 AM, BartC wrote:
>> Such structs are easier to index in arrays.
>
> Nonsense. The `[]' operator works with equal ease on arrays
> of any element size at all.
Where an address calculation is needed, then it's usually multiply versus a
shift. Although turning 44 bytes/element into 64 just for that reason would
not be worth it. There should be other benefits to using 64.
> As for straddling VM pages, denser packing wins over bloated "nicer"
> packing any day. Consider again the example of a thousand structs: I
> want to use 44KB, you're suggesting 64KB would be "nicer." Suppose an
> 8KB page: You want to use eight pages, I want to use six. Suppose a
> 64B cache line: You want to use a thousand cache lines, I want to use
> six hundred eighty-eight.
You forgot the bit where I suggested trying to get it down to 32 bytes. So
some data structures could get smaller.
You would only expand one up to the next power-of-two if there was a net
benefit. Keeping 20 bytes unused out of 64 wouldn't be a benefit. Storing
something in there that would have needed it's own space anyway might well
be.
(While most structs you'd leave alone because there aren't going to be
enough of them, or accessed often enough, to worry about.)
>> You can't design a struct to be 8- or
>> 12-bytes smaller than a power-of-two; every system will be different.
>
> You can't design it to be exactly 64 bytes, either. Quick: How
> many bytes in a `long double'? What's the alignment requirement of
> a function pointer? What is the "unit" size of a bit-field?
I'm pretty good at that actually..
(I've just been looking at an old application of mine. There is one struct
that has an 8-byte padding field added at the end to bring the size up to
exactly 128 bytes. Would that really have made a difference or not? Let's
see!
Running the application with a struct size of 128 bytes, the runtime was 48
seconds (this is for a 3D hidden line display). I took out the padding
field, and the runtime wentup to 59 seconds, 23% slower. Now that could be
due to a million reasons, but all the same, I think I'll leave that padding
in, thanks!)
--
Bartc
|
|
0
|
|
|
|
Reply
|
bc (2211)
|
1/2/2012 9:36:45 PM
|
|
Eric Sosman <esosman@ieee-dot-org.invalid> writes:
[...]
>> This is where you redesign your struct to use only 32 bytes, or find a
>> good use for another 20 bytes to make 64.
[...]
>> Such structs are easier to index in arrays.
>
> Nonsense. The `[]' operator works with equal ease on arrays
> of any element size at all.
[...]
It's certainly possible that `[]' on an array of 64-bytes elements
will be faster than `[]' on an array with 44-byte elements.
(It's also possible that it could have the same or even slower
performance.)
But padding a structure to make it 64 bytes for the purpose of making
array indexing faster is a classic case of micro-optimization, to
be attempted only if measurements show that it makes a significant
difference.
--
Keith Thompson (The_Other_Keith) kst-u@mib.org <http://www.ghoti.net/~kst>
Will write code for food.
"We must do something. This is something. Therefore, we must do this."
-- Antony Jay and Jonathan Lynn, "Yes Minister"
|
|
0
|
|
|
|
Reply
|
kst-u (21484)
|
1/2/2012 9:42:26 PM
|
|
On 2012-01-02, BartC <bc@freeuk.com> wrote:
> "Eric Sosman" <esosman@ieee-dot-org.invalid> wrote in message
> news:jdsg2h$mkb$1@dont-email.me...
>> On 1/2/2012 9:07 AM, BartC wrote:
>
>>> Such structs are easier to index in arrays.
>>
>> Nonsense. The `[]' operator works with equal ease on arrays
>> of any element size at all.
>
> Where an address calculation is needed, then it's usually multiply versus a
> shift. Although turning 44 bytes/element into 64 just for that reason would
> not be worth it. There should be other benefits to using 64.
Note that 44 is 32 + 8 + 4, and so 44x is (32x + 8x + 4x). Your CPU can
multiply by 44 just by doing three shifts and two additions, which can likely
be parallelized.
|
|
0
|
|
|
|
Reply
|
kaz15 (1129)
|
1/2/2012 9:44:54 PM
|
|
In article <jdsact$nr6$1@dont-email.me>,
Eric Sosman <esosman@ieee-dot-org.invalid> wrote:
> If you have a forty-four-byte struct and allocate sixty-four
>bytes to hold it, how have you avoided *either* sort of "waste?"
I think the context is where you're writing some dynamic array
[e.g. string or vector] and you have some control over the sequence of
capacities [at least that's what i thought]. Of course, you do not want
to call realloc on every operation; you want the progression to be
exponential. In such a case, powers of two are very logical.
|
|
0
|
|
|
|
Reply
|
jgk1 (176)
|
1/2/2012 9:45:15 PM
|
|
On 1/2/2012 4:42 PM, Keith Thompson wrote:
> Eric Sosman<esosman@ieee-dot-org.invalid> writes:
> [...]
>>> This is where you redesign your struct to use only 32 bytes, or find a
>>> good use for another 20 bytes to make 64.
> [...]
>>> Such structs are easier to index in arrays.
>>
>> Nonsense. The `[]' operator works with equal ease on arrays
>> of any element size at all.
> [...]
>
> It's certainly possible that `[]' on an array of 64-bytes elements
> will be faster than `[]' on an array with 44-byte elements.
> (It's also possible that it could have the same or even slower
> performance.)
Note that he didn't claim "faster" or "slower," but "easier."
The claim is, I repeat, nonsense.
> But padding a structure to make it 64 bytes for the purpose of making
> array indexing faster is a classic case of micro-optimization, to
> be attempted only if measurements show that it makes a significant
> difference.
It's probably not even micro-optimization; I'll stick with my
hypothesis that it's numerological superstition.
The original remark was
[...] if you're being sensible and trying to keep to
power-of-two allocations [...]
.... and I'm objecting to the implication that power-of-two sizes are
in any way "sensible" for their own sake. That's just bogus.
--
Eric Sosman
esosman@ieee-dot-org.invalid
|
|
0
|
|
|
|
Reply
|
esosman2 (2945)
|
1/2/2012 10:38:50 PM
|
|
On 1/2/2012 4:45 PM, Joe keane wrote:
> In article<jdsact$nr6$1@dont-email.me>,
> Eric Sosman<esosman@ieee-dot-org.invalid> wrote:
>> If you have a forty-four-byte struct and allocate sixty-four
>> bytes to hold it, how have you avoided *either* sort of "waste?"
>
> I think the context is where you're writing some dynamic array
> [e.g. string or vector] and you have some control over the sequence of
> capacities [at least that's what i thought]. Of course, you do not want
> to call realloc on every operation; you want the progression to be
> exponential. In such a case, powers of two are very logical.
(Shrug.) No more logical than powers of three, or of three halves
(`size += size/2').
I don't know when the number Two starts to exercise its mystical
fascination for programmers, but the fascination certainly exists: You
find people reaching for powers of two "just because," with no actual
reason for the choice. Maybe it's when the neophyte first learns that
he can sometimes shift instead of multiplying and dividing, or that he
can use AND to calculate some remainders. This may strike the young
and credulous as evidence that Two is magical, and just as some folks
never get over the Tooth Fairy, others never get over the Two'th Fairy.
Yes, folks, yes, there certainly *are* circumstances where power-
of-two sizes are Just What The Doctor Ordered. But there's no good
reason to make everything in sight a power of two! If you allocate
128 array elements to represent the 88 keys of a piano, or sixteen
for the Twelve Days of Christmas, or thirty-two for your fingers and
toes, you're just being silly. Or superstitious.
--
Eric Sosman
esosman@ieee-dot-org.invalid
|
|
0
|
|
|
|
Reply
|
esosman2 (2945)
|
1/2/2012 11:19:29 PM
|
|
Eric Sosman <esosman@ieee-dot-org.invalid> writes:
[...]
> Yes, folks, yes, there certainly *are* circumstances where power-
> of-two sizes are Just What The Doctor Ordered. But there's no good
> reason to make everything in sight a power of two! If you allocate
> 128 array elements to represent the 88 keys of a piano, or sixteen
> for the Twelve Days of Christmas, or thirty-two for your fingers and
> toes, you're just being silly. Or superstitious.
Or polydactyl.
--
Keith Thompson (The_Other_Keith) kst-u@mib.org <http://www.ghoti.net/~kst>
Will write code for food.
"We must do something. This is something. Therefore, we must do this."
-- Antony Jay and Jonathan Lynn, "Yes Minister"
|
|
0
|
|
|
|
Reply
|
kst-u (21484)
|
1/2/2012 11:42:49 PM
|
|
"Eric Sosman" <esosman@ieee-dot-org.invalid> wrote in message
news:jdte21$im7$1@dont-email.me...
> Yes, folks, yes, there certainly *are* circumstances where power-
> of-two sizes are Just What The Doctor Ordered. But there's no good
> reason to make everything in sight a power of two! If you allocate
> 128 array elements to represent the 88 keys of a piano, or sixteen
> for the Twelve Days of Christmas, or thirty-two for your fingers and
> toes, you're just being silly. Or superstitious.
Odd how the C standard seems preoccupied with such numbers then (or with
powers-of-two-minus-one, the same thing):
(from 5.2.4.1:)
"- 127 nesting levels of blocks
- 63 nesting levels of conditional inclusion
- 12 pointer, array, and function declarators (in any combinations)
- 63 nesting levels of parenthesized declarators within a full declarator
- 63 nesting levels of parenthesized expressions within a full expression
- 63 significant initial characters in an internal identifier or a macro
- 31 significant initial characters in an external identifier
- 4095 external identifiers in one translation unit
- 511 identifiers with block scope declared in one block
- 4095 macro identifiers simultaneously defined in one
- 127 parameters in one function definition
- 127 arguments in one function call
- 127 parameters in one macro definition
- 127 arguments in one macro invocation
- 4095 characters in a logical source line
- 4095 characters in a character string literal or wide string literal
- 65535 bytes in an object (in a hosted environment only)
- 15 nesting levels for #included files
- 1023 case labels for a switch statement
- 1023 members in a single structure or union
- 1023 enumeration constants in a single enumeration
- 63 levels of nested structure"
(But I'm not sure how that 12 got in there, on the third line; a typo?)
--
Bartc
|
|
0
|
|
|
|
Reply
|
bc (2211)
|
1/2/2012 11:59:13 PM
|
|
Bart=E6=96=BC 2012=E5=B9=B41=E6=9C=883=E6=97=A5=E6=98=9F=E6=9C=9F=E4=BA=8CU=
TC+8=E4=B8=8A=E5=8D=885=E6=99=8236=E5=88=8645=E7=A7=92=E5=AF=AB=E9=81=93=EF=
=BC=9A
> "Eric Sosman" <esosman@ieee-dot-org.invalid> wrote in message
> news:jdsg2h$mkb$1@dont-email.me...
> > On 1/2/2012 9:07 AM, BartC wrote:
>=20
> >> Such structs are easier to index in arrays.
> >
> > Nonsense. The `[]' operator works with equal ease on arrays
> > of any element size at all.
>=20
> Where an address calculation is needed, then it's usually multiply versus=
a
> shift. Although turning 44 bytes/element into 64 just for that reason wou=
ld
> not be worth it. There should be other benefits to using 64.
>=20
> > As for straddling VM pages, denser packing wins over bloated "nicer"
> > packing any day. Consider again the example of a thousand structs: I
> > want to use 44KB, you're suggesting 64KB would be "nicer." Suppose an
> > 8KB page: You want to use eight pages, I want to use six. Suppose a
> > 64B cache line: You want to use a thousand cache lines, I want to use
> > six hundred eighty-eight.
>=20
> You forgot the bit where I suggested trying to get it down to 32 bytes. S=
o
> some data structures could get smaller.
>=20
> You would only expand one up to the next power-of-two if there was a net
> benefit. Keeping 20 bytes unused out of 64 wouldn't be a benefit. Storing
> something in there that would have needed it's own space anyway might wel=
l
> be.
>=20
This is OK. I remembered that I did my own heap manager for digital
videos or images long time ago. Every instance is of millions of true=20
color pixels and can be magnified to 16 times in one dimension to be viewed=
..=20
It is not difficult at all but just a little bit tedious.=20
=20
=20
> (While most structs you'd leave alone because there aren't going to be
> enough of them, or accessed often enough, to worry about.)
>=20
> >> You can't design a struct to be 8- or
> >> 12-bytes smaller than a power-of-two; every system will be different.
> >
> > You can't design it to be exactly 64 bytes, either. Quick: How
> > many bytes in a `long double'? What's the alignment requirement of
> > a function pointer? What is the "unit" size of a bit-field?
>=20
> I'm pretty good at that actually..
>=20
> (I've just been looking at an old application of mine. There is one struc=
t
> that has an 8-byte padding field added at the end to bring the size up to
> exactly 128 bytes. Would that really have made a difference or not? Let's
> see!
>=20
> Running the application with a struct size of 128 bytes, the runtime was =
48
> seconds (this is for a 3D hidden line display). I took out the padding
> field, and the runtime wentup to 59 seconds, 23% slower. Now that could b=
e=20
> due to a million reasons, but all the same, I think I'll leave that paddi=
ng=20
> in, thanks!)
>=20
> --=20
> Bartc
|
|
0
|
|
|
|
Reply
|
dihedral88888 (786)
|
1/3/2012 7:10:40 AM
|
|
On Mon, 02 Jan 2012 18:19:29 -0500, Eric Sosman saw fit to publish the
following:
<snip>
> Yes, folks, yes, there certainly *are* circumstances where power-
> of-two sizes are Just What The Doctor Ordered. But there's no good
> reason to make everything in sight a power of two! If you allocate 128
> array elements to represent the 88 keys of a piano, or sixteen for the
> Twelve Days of Christmas, or thirty-two for your fingers and toes,
> you're just being silly. Or superstitious.
Bravo! I couldn't have put it better than that.
--
Spock: The odds of surviving another attack are 13562190123 to 1, Captain.
|
|
0
|
|
|
|
Reply
|
kleuske1588 (43)
|
1/3/2012 11:38:19 AM
|
|
In article <LPYLq.355$EO5.173@newsfe26.ams2>, BartC <bc@freeuk.com> wrote:
>And one thing about malloc() I don't like, are the overheads in having to
>remember the block sizes; if I allocate 16 bytes, malloc() seems to reserve
>either 20 or, more often, 24 bytes on the few compilers I've tried. That's a
>50% overhead!
Agreed. When i was doing memory allocation, if you ask for 16 bytes it
really was 16 bytes, plus some per-page control structures stage left.
Trick is all the objects on the same 'page' (not the same as a VM page,
you can make it what size seems to work best) are the same size! So you
just need the page number, look it up in a hash table (don't worry, this
is very fast).
|
|
0
|
|
|
|
Reply
|
jgk1 (176)
|
1/3/2012 12:21:44 PM
|
|
"Kleuske" <kleuske@nowhere.net> wrote in message
news:jdupbb$om0$3@dont-email.me...
> On Mon, 02 Jan 2012 18:19:29 -0500, Eric Sosman saw fit to publish the
> following:
> <snip>
>
>> Yes, folks, yes, there certainly *are* circumstances where power-
>> of-two sizes are Just What The Doctor Ordered. But there's no good
>> reason to make everything in sight a power of two! If you allocate 128
>> array elements to represent the 88 keys of a piano, or sixteen for the
>> Twelve Days of Christmas, or thirty-two for your fingers and toes,
>> you're just being silly. Or superstitious.
>
> Bravo! I couldn't have put it better than that.
For user applications, I also agree with all this; there shouldn't be all
these mysterious powers-of-two floating around for no good reason. (And if
you've ever been involved in writing user manuals, the less to have to
explain, the better.)
For higher level, rapid development languages, they are also largely
irrelevant.
But C is a somewhat lower level language, and these powers of two do get
everywhere. You can completely ignore them if you wish, but if often pays to
be aware of them.
Anything that can only be described, or indexed, in N bits, will have 2^N
possibilities; you can't away from that. (And hardware seems to prefer N
itself as a power-of-two; so 8 bits in a byte is more common than 5, 7 or
11.)
And part of the context of this thread is memory allocation strategies,
where there might well be merit in a scheme were allocated blocks can only
change in size by a power of two (and which, if implemented too thinly on
top of malloc(), is affected by malloc()'s own overheads).
I don't think that is being obsessed with it.
--
Bartc
|
|
0
|
|
|
|
Reply
|
bc (2211)
|
1/3/2012 12:49:34 PM
|
|
In article <f500e5bf-edc1-43f7-9fbe-cf7bac512531@t16g2000vba.googlegroups.com>,
FtM <fmassei@gmail.com> wrote:
>That question is obviously plain wrong, because I *do* like the
>allocator taking cycles if this means that, when the memory is low, my
>complex data structure resides in one single page and the system has
>not to hysterically swap from disk to deal with a pathological
>fragmentation.
IMHE, people who try to make the allocation 'as fast as possible' are
just chasing a squirrel up the wrong tree.
No sir, you want to make your clients' code as fast as possible.
Sometimes people recommend a class-based linked list ('look! it's two
instructions!'). In my onion, this is disasterous. Works great for toy
programs, but when you have lots-of-data and long-running combined, your
memory turns into an omelet.
You may as well send invitations to Mr. Cache Miss and Ms. Tlb Fail...
If you have a 'fast' allocator that doesn't have the slighest idea of
locality, and a 'slow' allocator that takes some care of it, the
'slower' code will kick the 'faster' code arse every time.
|
|
0
|
|
|
|
Reply
|
jgk1 (176)
|
1/3/2012 1:14:46 PM
|
|
Joe keane=E6=96=BC 2012=E5=B9=B41=E6=9C=883=E6=97=A5=E6=98=9F=E6=9C=9F=E4=
=BA=8CUTC+8=E4=B8=8B=E5=8D=889=E6=99=8214=E5=88=8646=E7=A7=92=E5=AF=AB=E9=
=81=93=EF=BC=9A
> In article <f500e5bf-edc1-43f7...@t16g2000vba.googlegroups.com>,
> FtM <fma...@gmail.com> wrote:
> >That question is obviously plain wrong, because I *do* like the
> >allocator taking cycles if this means that, when the memory is low, my
> >complex data structure resides in one single page and the system has
> >not to hysterically swap from disk to deal with a pathological
> >fragmentation.
>=20
> IMHE, people who try to make the allocation 'as fast as possible' are
> just chasing a squirrel up the wrong tree.
>=20
> No sir, you want to make your clients' code as fast as possible.
>=20
> Sometimes people recommend a class-based linked list ('look! it's two
> instructions!'). In my onion, this is disasterous. Works great for toy
> programs, but when you have lots-of-data and long-running combined, your
> memory turns into an omelet.
>=20
> You may as well send invitations to Mr. Cache Miss and Ms. Tlb Fail...
>=20
> If you have a 'fast' allocator that doesn't have the slighest idea of
> locality, and a 'slow' allocator that takes some care of it, the
> 'slower' code will kick the 'faster' code arse every time.
This is a system level problem. The allocator under a huge dram space with=
=20
a lot free space remained should be designed toward fast easily.=20
But if the dram is near fully allocated then a good heap walker that can=20
swap pages to the hard disk is important toward reliability not the speed=
=20
part in the allocator to degrade the system speed.=20
=20
=20
|
|
0
|
|
|
|
Reply
|
dihedral88888 (786)
|
1/3/2012 2:07:20 PM
|
|
In article <jdsact$nr6$1@dont-email.me>,
Eric Sosman <esosman@ieee-dot-org.invalid> wrote:
> If you have a forty-four-byte struct and allocate sixty-four
>bytes to hold it, how have you avoided *either* sort of "waste?"
If the cache line is 16 bytes, we know that each object will take only
three lines; the fourth is never used [it may be worse with the VM].
For this case 48 would work just as well (the compiler will do that for
you if it thinks 8-byte alignment is a good idea). But then if the
cache line is 32 bytes, same argument applies.
|
|
0
|
|
|
|
Reply
|
jgk1 (176)
|
1/4/2012 9:52:56 PM
|
|
In article <jdte21$im7$1@dont-email.me>,
Eric Sosman <esosman@ieee-dot-org.invalid> wrote:
> I don't know when the number Two starts to exercise its mystical
>fascination for programmers, but the fascination certainly exists: You
>find people reaching for powers of two "just because," with no actual
>reason for the choice.
You have 'struct x', you add padding, test it, it is faster. You have
'struct y', you add padding, test it, it is faster. You have 'struct
z', you add padding, test it, it is faster. You have 'struct w', you
add padding, test it, it is slower [oops doesn't always work].
You have 'struct a', and your boss says 'can't test! just guess!'.
What would you do?
|
|
0
|
|
|
|
Reply
|
jgk1 (176)
|
1/4/2012 11:42:15 PM
|
|
On Jan 1, 4:54=A0am, "BartC" <b...@freeuk.com> wrote:
> "Markus Wichmann" <nullp...@gmx.net> wrote in message
>
> news:f7s5t8-jo5.ln1@voyager.wichi.de.vu...
>
> > On 31.12.2011 18:52, FtM wrote:
> >> Simply because it can't know if I'm going to do 200 little allocations
> >> or 1 big one.
> > For one, it _never_ uses brk(). That syscall is just plain old and does
> > nothing mmap() couldn't do better. Then it has linked lists prepared fo=
r
> > objects of sizes of powers of two, starting at 16 and stopping at 2048.
>
> That exactly what I do in my own allocator! Except I stop at 1024. Althou=
gh
> I probably manage them very differently, for example once a small block s=
ize
> is allocated, say 64 bytes, it always stays that size.
>
> And one thing about malloc() I don't like, are the overheads in having to
> remember the block sizes; if I allocate 16 bytes, malloc() seems to reser=
ve
> either 20 or, more often, 24 bytes on the few compilers I've tried. That'=
s a
> 50% overhead!
>
> Also if you're being sensible and trying to keep to power-of-two
> allocations, this 4- or 8-byte overhead is going to screw that up.
>
> My allocators have never stored a block size, and it has never really bee=
n a
> problem. Either you will know the size (a struct for example), or you nee=
d
> to keep track of it yourself, in which case it is very handy when you hav=
e
> an allocation that can grow in never having to keep running back to
> realloc().
>
> As an illustration, take this simple example of a string growing from one=
to
> ten million characters:
>
> #include <stdio.h>
> #include <stdlib.h>
> #include <string.h>
>
> int main(void)
> {
> char *p;
> int i,len;
>
> p=3Dmalloc(1);
> *p=3D0;
> len=3D0;
>
> for (i=3D1; i<=3D10000000; ++i) {
> =A0++len;
> =A0p=3Drealloc(p,len+1);
> =A0p[len-1]=3D'*';
> =A0p[len]=3D0;
> // puts(p);}
>
> printf("Length=3D %d\n",strlen(p));
>
> }
>
> On three C compilers, this took about 12 seconds (on DMC, any length of
> string resulted in a runtime of 0.4 seconds; a bit odd).
>
> Using my strategy, of avoiding any calls to malloc or realloc unless
> absolutely necessary, and using that inside an interpreter, this bit of
> script took about 0.1 seconds:
>
> string a
>
> a:=3D""
> to 10 million do
> =A0a+:=3D"*"
> od
> println "Length=3D",a.len
>
> Of course, real C code wouldn't use such a naive strategy; it would also
> seek to minimise malloc()/realloc() calls, but that's also an admission t=
hat
> these calls have some overheads which it would be desirable to avoid. Hen=
ce
> I can understand the OP's attempt to experiment with his own versions.
>
> --
> Bartc
You can take a look at my bit-mapped memory allocator: code.google.com/
p/arena-memory-allocation
for an example of out-of-band storage of the block sizes.
Karl m
|
|
0
|
|
|
|
Reply
|
malbrain (59)
|
1/5/2012 12:48:23 AM
|
|
On 1/4/2012 6:42 PM, Joe keane wrote:
> In article<jdte21$im7$1@dont-email.me>,
> Eric Sosman<esosman@ieee-dot-org.invalid> wrote:
>> I don't know when the number Two starts to exercise its mystical
>> fascination for programmers, but the fascination certainly exists: You
>> find people reaching for powers of two "just because," with no actual
>> reason for the choice.
>
> You have 'struct x', you add padding, test it, it is faster. You have
> 'struct y', you add padding, test it, it is faster. You have 'struct
> z', you add padding, test it, it is faster. You have 'struct w', you
> add padding, test it, it is slower [oops doesn't always work].
>
> You have 'struct a', and your boss says 'can't test! just guess!'.
> What would you do?
I'd leave it alone, because there are more important things
to attend to.
--
Eric Sosman
esosman@ieee-dot-org.invalid
|
|
0
|
|
|
|
Reply
|
esosman2 (2945)
|
1/5/2012 1:23:47 AM
|
|
On 1/4/2012 4:52 PM, Joe keane wrote:
> In article<jdsact$nr6$1@dont-email.me>,
> Eric Sosman<esosman@ieee-dot-org.invalid> wrote:
>> If you have a forty-four-byte struct and allocate sixty-four
>> bytes to hold it, how have you avoided *either* sort of "waste?"
>
> If the cache line is 16 bytes, we know that each object will take only
> three lines; the fourth is never used [it may be worse with the VM].
> For this case 48 would work just as well (the compiler will do that for
> you if it thinks 8-byte alignment is a good idea). But then if the
> cache line is 32 bytes, same argument applies.
All I can say is that you and I have divergent notions of
efficiency. I'm thinking "I can run over the entire array with
just N*44/C cache misses; Joe would prefer N*64/C." That is, I
think you'll incur 46% more cache misses than I will. Tell me
again, please, what I have wasted by using 31% fewer cache line
loads than you would?
I'm also thinking "I need N*44/P pages for my array; Joe would
rather use N*64/P." That is, I think your working set size will
be 46% larger than mine, your likelihood of taking a page fault
will be 46% greater than mine. Tell me again, please, what have
I wasted by using 31% fewer memory pages?
It's an odd sort of "waste" that spends fewer resources.
--
Eric Sosman
esosman@ieee-dot-org.invalid
|
|
0
|
|
|
|
Reply
|
esosman2 (2945)
|
1/5/2012 1:31:38 AM
|
|
"Karl Malbrain" <malbrain@yahoo.com> wrote in message
news:0f8561ea-8775-4527-bb2d-da5fb5e0c5da@a17g2000yqj.googlegroups.com...
> You can take a look at my bit-mapped memory allocator: code.google.com/
> p/arena-memory-allocation
> for an example of out-of-band storage of the block sizes.
Thanks, I'll take a closer look later.
But a few things struck me immediately:
(1) It had a comment block at the beginning that actually said what the code
was and gave a brief description, instead of the usual GPL waffle.
(2) It's vfree() function requires the size to be supplied, just like mine
does! So I'm not alone in thinking the way C does it is not the only way.
(3) It has plenty of powers-of-two sprinkled about. With the comments in
this thread I was starting to feel like some sort of dinosaur.
--
Bartc
|
|
0
|
|
|
|
Reply
|
bc (2211)
|
1/5/2012 1:36:42 AM
|
|
"Eric Sosman" <esosman@ieee-dot-org.invalid> wrote in message
news:je2uhu$6s5$1@dont-email.me...
> On 1/4/2012 4:52 PM, Joe keane wrote:
>> In article<jdsact$nr6$1@dont-email.me>,
>> Eric Sosman<esosman@ieee-dot-org.invalid> wrote:
>>> If you have a forty-four-byte struct and allocate sixty-four
>>> bytes to hold it, how have you avoided *either* sort of "waste?"
>>
>> If the cache line is 16 bytes, we know that each object will take only
>> three lines; the fourth is never used [it may be worse with the VM].
>> For this case 48 would work just as well (the compiler will do that for
>> you if it thinks 8-byte alignment is a good idea). But then if the
>> cache line is 32 bytes, same argument applies.
>
> All I can say is that you and I have divergent notions of
> efficiency. I'm thinking "I can run over the entire array with
> just N*44/C cache misses; Joe would prefer N*64/C." That is, I
> think you'll incur 46% more cache misses than I will. Tell me
> again, please, what I have wasted by using 31% fewer cache line
> loads than you would?
I think he means that the last 16-bytes of the 64 might never be accessed,
so will never be fetched from memory in the first place; there is no cache
miss.
(That does depend on how the 64-bytes are used; if the whole struct is
copied, the compiler might access all 64-bytes anyway.)
And if the cache blocks are 32-bytes, then it doesn't matter.
> I'm also thinking "I need N*44/P pages for my array; Joe would
> rather use N*64/P." That is, I think your working set size will
> be 46% larger than mine, your likelihood of taking a page fault
> will be 46% greater than mine. Tell me again, please, what have
> I wasted by using 31% fewer memory pages?
No one was advocating wasting 20 bytes, but finding a use for them, perhaps
moving data there that would have been stored elsewhere.
> It's an odd sort of "waste" that spends fewer resources.
If you had a heavily used struct, and it turned out it had 65 bytes, that
wouldn't bother you any? You wouldn't try and get it down to 64?
--
Bartc
|
|
0
|
|
|
|
Reply
|
bc (2211)
|
1/5/2012 1:45:38 AM
|
|
jgk@panix.com (Joe keane) writes:
> In article <jdte21$im7$1@dont-email.me>,
> Eric Sosman <esosman@ieee-dot-org.invalid> wrote:
> > I don't know when the number Two starts to exercise its mystical
> >fascination for programmers, but the fascination certainly exists: You
> >find people reaching for powers of two "just because," with no actual
> >reason for the choice.
>
> You have 'struct x', you add padding, test it, it is faster. You have
> 'struct y', you add padding, test it, it is faster. You have 'struct
> z', you add padding, test it, it is faster. You have 'struct w', you
> add padding, test it, it is slower [oops doesn't always work].
>
> You have 'struct a', and your boss says 'can't test! just guess!'.
> What would you do?
I'd ask my boss if he's heard of Knuth.
However, it's hypothetical, I'd probably not be working at such a
company in the first place.
Phil
--
Unix is simple. It just takes a genius to understand its simplicity
-- Dennis Ritchie (1941-2011), Unix Co-Creator
|
|
0
|
|
|
|
Reply
|
thefatphil_demunged (1558)
|
1/5/2012 9:31:00 PM
|
|
88888 Dihedral=E6=96=BC 2012=E5=B9=B41=E6=9C=883=E6=97=A5=E6=98=9F=E6=9C=9F=
=E4=BA=8CUTC+8=E4=B8=8B=E5=8D=8810=E6=99=8207=E5=88=8620=E7=A7=92=E5=AF=AB=
=E9=81=93=EF=BC=9A
> Joe keane=E6=96=BC 2012=E5=B9=B41=E6=9C=883=E6=97=A5=E6=98=9F=E6=9C=9F=E4=
=BA=8CUTC+8=E4=B8=8B=E5=8D=889=E6=99=8214=E5=88=8646=E7=A7=92=E5=AF=AB=E9=
=81=93=EF=BC=9A
> > In article <f500e5bf-e...@t16g2000vba.googlegroups.com>,
> > FtM <fma...@gmail.com> wrote:
> > >That question is obviously plain wrong, because I *do* like the
> > >allocator taking cycles if this means that, when the memory is low, my
> > >complex data structure resides in one single page and the system has
> > >not to hysterically swap from disk to deal with a pathological
> > >fragmentation.
> >=20
> > IMHE, people who try to make the allocation 'as fast as possible' are
> > just chasing a squirrel up the wrong tree.
> >=20
> > No sir, you want to make your clients' code as fast as possible.
> >=20
> > Sometimes people recommend a class-based linked list ('look! it's two
> > instructions!'). In my onion, this is disasterous. Works great for to=
y
> > programs, but when you have lots-of-data and long-running combined, you=
r
> > memory turns into an omelet.
> >=20
> > You may as well send invitations to Mr. Cache Miss and Ms. Tlb Fail...
> >=20
> > If you have a 'fast' allocator that doesn't have the slighest idea of
> > locality, and a 'slow' allocator that takes some care of it, the
> > 'slower' code will kick the 'faster' code arse every time.
>=20
> This is a system level problem. The allocator under a huge dram space wi=
th=20
> a lot free space remained should be designed toward fast easily.=20
>=20
> But if the dram is near fully allocated then a good heap walker that can=
=20
> swap pages to the hard disk is important toward reliability not the speed=
=20
> part in the allocator to degrade the system speed.
The heap managing module is very important in the OS.=20
Normally all allocations from malloc are allocated in sizes of 2's powers
that are 2 to S powers for S=3D4, 5, 6, ..12 to 16.=20
A page might be 2K bytes or 64K or even 4M(20 bits) bytes in various system=
s.
There are trivial details.=20
The heap walker part for each node in the heap space is more important.=20
After a heap walk, the buddy algorithm can be applied.=20
Do we need to discuss the details here?
=20
=20
|
|
0
|
|
|
|
Reply
|
dihedral88888 (786)
|
1/6/2012 1:06:10 AM
|
|
"BartC" <bc@freeuk.com> writes:
> "Eric Sosman" <esosman@ieee-dot-org.invalid> wrote in message
> news:je2uhu$6s5$1@dont-email.me...
> > On 1/4/2012 4:52 PM, Joe keane wrote:
> >> In article<jdsact$nr6$1@dont-email.me>,
> >> Eric Sosman<esosman@ieee-dot-org.invalid> wrote:
> >>> If you have a forty-four-byte struct and allocate sixty-four
> >>> bytes to hold it, how have you avoided *either* sort of "waste?"
> >>
> >> If the cache line is 16 bytes, we know that each object will take only
> >> three lines; the fourth is never used [it may be worse with the VM].
> >> For this case 48 would work just as well (the compiler will do that for
> >> you if it thinks 8-byte alignment is a good idea). But then if the
> >> cache line is 32 bytes, same argument applies.
> >
> > All I can say is that you and I have divergent notions of
> > efficiency. I'm thinking "I can run over the entire array with
> > just N*44/C cache misses; Joe would prefer N*64/C." That is, I
> > think you'll incur 46% more cache misses than I will. Tell me
> > again, please, what I have wasted by using 31% fewer cache line
> > loads than you would?
>
> I think he means that the last 16-bytes of the 64 might never be accessed,
> so will never be fetched from memory in the first place; there is no cache
> miss.
Were there to be no hits on google for "64 bytes (one cache line)", then
you might have a point. The fact that most of such hits refer to the single
most popular general purpose computer architecture in the world diminishes
your point significantly. Those bytes are being pulled into the cache whether
you access them or not.
> (That does depend on how the 64-bytes are used; if the whole struct is
> copied, the compiler might access all 64-bytes anyway.)
>
> And if the cache blocks are 32-bytes, then it doesn't matter.
Nonsense. An adjacent bunch of Eric's structures being pulled into the
working set will displace fewer cache lines than the same number of your
structs. It's only if the cache lines are 16 bytes (i.e. not any of the
half dozen architectures I've worked with in recent years) that both
will cause 48 bytes to be churned.
> > I'm also thinking "I need N*44/P pages for my array; Joe would
> > rather use N*64/P." That is, I think your working set size will
> > be 46% larger than mine, your likelihood of taking a page fault
> > will be 46% greater than mine. Tell me again, please, what have
> > I wasted by using 31% fewer memory pages?
>
> No one was advocating wasting 20 bytes, but finding a use for them, perhaps
> moving data there that would have been stored elsewhere.
How about the "use" of "still being available for subsequent allocations"?
> > It's an odd sort of "waste" that spends fewer resources.
>
> If you had a heavily used struct, and it turned out it had 65 bytes, that
> wouldn't bother you any? You wouldn't try and get it down to 64?
Straw man.
Phil
--
Unix is simple. It just takes a genius to understand its simplicity
-- Dennis Ritchie (1941-2011), Unix Co-Creator
|
|
0
|
|
|
|
Reply
|
thefatphil_demunged (1558)
|
1/7/2012 3:55:28 AM
|
|
"Phil Carmody" <thefatphil_demunged@yahoo.co.uk> wrote in message
news:87obugw4i7.fsf@bazspaz.fatphil.org...
> "BartC" <bc@freeuk.com> writes:
>> I think he means that the last 16-bytes of the 64 might never be
>> accessed,
>> so will never be fetched from memory in the first place; there is no
>> cache
>> miss.
>
> Were there to be no hits on google for "64 bytes (one cache line)", then
> you might have a point. The fact that most of such hits refer to the
> single
> most popular general purpose computer architecture in the world diminishes
> your point significantly. Those bytes are being pulled into the cache
> whether
> you access them or not.
I've no idea what cache-line sizes are being used these days. If it's 64
instead of 16 then perhaps we'd be talking about a 176-byte versus a
256-byte struct instead.
>> And if the cache blocks are 32-bytes, then it doesn't matter.
>
> Nonsense. An adjacent bunch of Eric's structures being pulled into the
> working set will displace fewer cache lines than the same number of your
> structs.
That's true, if you touch any element of a 64-byte struct, and the cache
lines are 64-bytes, and the struct is aligned on 64-bytes, then a set of,
say, 1000 serial accesses will require 1000 memory fetches, instead of 688
if they were only 44 bytes. That's why I advocated not wasting that extra 20
bytes per struct. (In fact I suggested making do with 32 in the first
place.)
The problem with 44-bytes in this case, is that if you're touching two parts
of the struct at a time, and you're doing random access (which is what
arrays are for), you'll frequently be fetching two cache lines instead of
one, as the structs will straddle a 64-byte boundary.
And if you are only touching one member of the struct at a time, and are
doing serial accesses, then perhaps you'd be better off using a set of
parallel arrays, for at least some of the struct members. If a field is
4-bytes wide, then that would require only about 1/11 of the memory fetches
as the 44-byte struct.
You're quite likely to find these narrower array elements happen to have
power-of-two widths..
Anyway, look at this struct (compiled with default padding):
struct record{
char c1;
double d1;
char c2;
double d2;
char c3;
double d3;
char c4;
} x;
On my machine, these 28 bytes end up occupying 56, exactly double!
Rearrange these fields so all the small ones are at the end, and still you
get 32, not 28!
Wasting bytes is obviously OK when the compiler does it...
(BTW I put that struct into another language, and it consistently gave a
size of 28 bytes no matter how the fields are ordered.)
--
Bartc
|
|
0
|
|
|
|
Reply
|
bc (2211)
|
1/7/2012 12:02:36 PM
|
|
FtM=E6=96=BC 2011=E5=B9=B412=E6=9C=8831=E6=97=A5=E6=98=9F=E6=9C=9F=E5=85=AD=
UTC+8=E4=B8=8B=E5=8D=8810=E6=99=8256=E5=88=8653=E7=A7=92=E5=AF=AB=E9=81=93=
=EF=BC=9A
> [xpost comp.lang.c, comp.unix.programmer, comp.os.ms-
> windows.programmer.win32]
> Hello NG,
> I have a very specific question, but first of all I describe you what
> I'm facing as I'm open to alternative general solutions..
> I'm refactoring some old C code that deals with very basilar data
> structures (like lists, priority queues, hashtables and so on) that
> pretends to be very portable. Regarding the memory management, I have
> an internal wrapper that, on *nix and windows systems, maps the malloc/
> calloc/realloc/free functions one-to-one to the default ones, but it's
> possibile to switch to another custom allocator that works pretty much
> as a standard allocator (a big chunk of memory splitted/joined, tought
> for embedded systems).
> I would like to implement my custom allocator for the *nix and windows
> systems too (of course loosing some portability for performance gain),
> and maybe implement something like the C++ allocators to improve the
> memory management of the different algorithms (like rare large
> allocations on vectors / frequent small allocations on lists), but
> I've some questions:
> - In these systems, to have decent performances, I was thinking to
> allocate page-sized (and page-aligned) chunks to have small
> allocations in the same page, but how is it possible to request some
> memory with this characteristics? Will it help the OS memory manager?
> - Does all of this make any sense? :-)
> Thanks for any suggestion!
> Ciao!
>=20
> P.S. I don't really have the necessity of a "performance boost", but
> I'm dealing with a general, low-level and already used library, and as
> long as I'm refactoring the code I'd like to do it the better possible
> way ;-)
I mean the allocation requests for the heap space can be modeled=20
as a discrete random process X[n] for n=3D0,1,2,3... in an application=20
and so is the free process Y[m] for m=3D0, 1,2,3 ....=20
Therefore the amount of memory used can be counted as=20
for(i=3D0, s1=3D0;i<n;i++) s1+=3DX[i];
for(i=3D0, s2=3D0;i<m;i++) s2+=3DY[i];=20
The number s1-s2 is the amount of space used is another random process, too=
..=20
The OS part does not know the two random processes better than the programm=
er=20
who is the author of the application. Thus a customized heap manager=20
can beat the OS as proved in the above.=20
=20
|
|
0
|
|
|
|
Reply
|
dihedral88888 (786)
|
1/7/2012 1:20:25 PM
|
|
|
65 Replies
37 Views
(page loaded in 1.233 seconds)
|