I find that I could in theory get a performance boost either by using a
RAID5 via mdadm or by striping via LVM. Let's assume redundancy is not a
concern merely performance boosting.
What's the difference in these two approaches and is one better than the
other?
--
Rahul
|
|
0
|
|
|
|
Reply
|
Rahul
|
1/17/2010 8:46:17 PM |
|
On Sunday 17 January 2010 21:46 in comp.os.linux.misc, somebody
identifying as Rahul wrote...
> I find that I could in theory get a performance boost either by using
> a RAID5 via mdadm or by striping via LVM. Let's assume redundancy is
> not a concern merely performance boosting.
>
> What's the difference in these two approaches and is one better than
> the other?
A true RAID 5 means that you need at least three disks, in which case
the data will, per data segment, be striped over two disks, and the
third disk will hold a parity block. Distribution of the parity blocks
is staircased, meaning that the parity block will be put on a different
disk in the array per data segment, like so...
Data segment Disk 1 Disk 2 Disk 3
A A-1 A-2 A-parity
B B-1 B-parity B-2
C C-parity C-1 C-2
D D-1 D-2 D-parity
E E-1 E-parity E-2
F F-parity F-1 F-2
... ... ... ...
Writing to a RAID 5 is slower than writing to a single disk because with
each write, the parity block must be updated, which means calculation
of the parity data and writing that parity data to the pertaining disk.
Reading from a (non-degraded) RAID 5 however is fast and comparable to
RAID 0, also known as "striping", because the parity block need not be
read, unless the array is running in degraded mode, i.e. with one of
the disks failing and the missing data is recalculated using the parity
block.
A plain stripeset on the other hand only requires two disks, and simply
does what the above does, but without parity blocks. So you'd have a
set up like this...
Data segment Disk 1 Disk 2
A A-1 A-2
B B-1 B-2
C C-1 C-2
... ... ...
In this case, you don't have any redundancy. Writing to the stripeset
is faster than writing to a single disk, and the same applies for
reading. It's not a 2:1 performance boost due to the overhead for
splitting the data for writes and re-assembling it upon reads, but
there is a significant performance improvement, and especially so if
you use more than two disks.
Now, you can use virtually any kind of software RAID set-up
with /mdadm/ - including RAID 0 - and things like LVM can offer you a
similar set-up - you don't even need either of them if you want to
apply striping to the swap partition because this can be achieved by
simply giving two swap partitions on separate disks an equal priority
in "/etc/fstab".
If striping without redundancy is what you want, then you can go either
way, i.e. RAID 0 via /mdadm/ or via the older /dmraid/ or striping
implemented at the logical volume management level. The only
difference is in the layer of the kernel in which this will be handled,
so whether you set it up via /mdadm/ - or even via /dmraid/ - versus
setting it up via the logical volume manager, it is still software
RAID, and I don't think there would be any significant - i.e. humanly
noticeable - difference in performance.
There are however a few considerations you should take into account with
both of these approaches, i.e. that you should not put the filesystem
which holds the kernels and /initrd/ - and preferably not the root
filesystem either[1] - on a stripe, because the bootloader recognizes
neither software RAID nor logical volume management. It's a
chicken-and-egg thing, i.e. the drivers for LVM and software RAID are
in the Linux kernel, so you have to be able load the Linux kernel first
before you can make use of those drivers. GRUB does not have any
drivers for that, and the way LILO works it would also not be able to
load a kernel off of a striped filesystem.
[1] Having the root filesystem on a software RAID stripeset will work
only if you have an initrd which contains *all* the required driver
modules, since there is no control over the order of the automatic
module loading by the kernel itself. It loads the modules according
to the hardware it finds, and if it needs a module off of the root
filesystem before the RAID or LVM modules have been loaded, then
you're foobarred.
--
*Aragorn*
(registered GNU/Linux user #223157)
|
|
0
|
|
|
|
Reply
|
Aragorn
|
1/18/2010 11:32:58 AM
|
|
Rahul wrote:
> I find that I could in theory get a performance boost either by using a
> RAID5 via mdadm or by striping via LVM. Let's assume redundancy is not a
> concern merely performance boosting.
>
> What's the difference in these two approaches and is one better than the
> other?
>
LVM is for logical volume management, mdadm is for administering
multiple disk setups (i.e., software raid). LVM /can/ do basic
striping, in that if you have two physical volumes allocated to the same
volume group, then a logical volume can be striped across the two
physical volumes. As another poster has said, you won't notice a
performance difference between striping via LVM or mdadm. But you
/will/ notice a difference in the administration and commands used - it
is more convenient to use mdadm for raid than LVM.
My recommendation is that you use mdadm to create a raid from the raw
drives or partitions on the drives, and if you want the volume
management features of LVM (I find it very useful), put LVM on top of
mdadm raid.
As for the type of raid to use, that depends on the number of disks you
have and the redundancy you want. raid5 is well-known to be slower for
writing, especially for smaller writes, and it can be risky for large
disks in critical applications (since rebuilding takes so long, and
wears the other disks). Mirroring is safer, and mdadmin can happily do
a raid10 (roughly a stripe of mirrors) on any number of disks for high
speed and mirrored redundancy.
Booting from raids is complicated, but not as difficult as suggested by
another poster. Modern grub can handle a /boot partition on a raid1 or
raid0 mdadmin setup, although it's a little inconvenient to install -
you typically have to manually run grub to install the first stage
bootloader on each disk's boot sector individually.
The last server I configured had three disks. I partitioned each into a
small partion (1G) and a big partition (the rest of the disks). The
small partitions I joined in an mdadm raid1 (mirror), and use for /boot.
The big partitions are in an raid10 mdadm block, used as an LVM
physical drive, with logical partitions for various parts of the system
and virtual machines. It will happily run and boot with any one of the
drives removed.
|
|
0
|
|
|
|
Reply
|
David
|
1/18/2010 9:13:30 PM
|
|
Aragorn <aragorn@chatfactory.invalid> wrote in news:hj1gta$2hp$5
@news.eternal-september.org:
Thanks for the great explaination!
> Writing to a RAID 5 is slower than writing to a single disk because with
> each write, the parity block must be updated, which means calculation
> of the parity data and writing that parity data to the pertaining disk.
This is where I get confused. Is writing to a RAID5 slower than a single
disk irrespective of how many disks I throw at the RAID5? I currently have
a 7-disk RAID5. Will writing to this be slower than a single disk? Isn't
the parity calculation a fairly fast process especially if one has a
hardware based card? And then if the write gets split into 6 parts shouldnt
that speed up the process since each disk is writing only 1/6th of the
chunk?
>
> In this case, you don't have any redundancy. Writing to the stripeset
> is faster than writing to a single disk, and the same applies for
> reading. It's not a 2:1 performance boost due to the overhead for
> splitting the data for writes and re-assembling it upon reads, but
> there is a significant performance improvement, and especially so if
> you use more than two disks.
Why doesn;t a similar boost come out of a RAID5 with a large number of
disks? Merely because of the parity calculation overhead?
>
> There are however a few considerations you should take into account with
> both of these approaches, i.e. that you should not put the filesystem
> which holds the kernels and /initrd/ - and preferably not the root
> filesystem either[1] - on a stripe, because the bootloader recognizes
Luckily that is not needed. I have a seperate drive to boot from. The RAID
is intended only for user /home dirs.
--
Rahul
|
|
0
|
|
|
|
Reply
|
Rahul
|
1/19/2010 7:37:31 AM
|
|
David Brown <david.brown@hesbynett.removethisbit.no> wrote in
news:BtOdnakm8taiUsnWnZ2dnUVZ8qOdnZ2d@lyse.net:
Thanks David!
> Rahul wrote:
>
> LVM is for logical volume management, mdadm is for administering
> multiple disk setups (i.e., software raid). LVM /can/ do basic
> striping, in that if you have two physical volumes allocated to the
> same volume group, then a logical volume can be striped across the two
> physical volumes. As another poster has said, you won't notice a
> performance difference between striping via LVM or mdadm. But you
Will putting LVM on top of mdadm slow things down? Or does LVM not have a
significant performance penalty?
>
> My recommendation is that you use mdadm to create a raid from the raw
> drives or partitions on the drives, and if you want the volume
> management features of LVM (I find it very useful), put LVM on top of
> mdadm raid.
This is exactly what I was trying to do. BUt LVM asks "stripe" or :no
stripe". THat I wasn;t sure about.
> As for the type of raid to use, that depends on the number of disks
> you have and the redundancy you want. raid5 is well-known to be
> slower for writing, especially for smaller writes, and it can be risky
> for large disks in critical applications
Maybe if I explain my situation you can have some more comments.
I have 3 physical "storage boxes" (MD-1000's from Dell). Each takes 15
SAS 15k drives of 300 GB each. i.e. I have a total of 45 drives of 300 GB
each. Redundancy is important but not critical. Performance was more
imporntant.
My original plan was to split each box into two RAID5 arrays of 7 disks
each and leave 1 as a hot spare. Thus I get 6 RAID5 arrays in all. They
are visible as /dev/sdb /dev/sdc etc. but I want to mount a single /home
on it. That's where I introduced LVM. But then LVM again introduces a
striping option. Should I be striping or not?
That's where I am confuesd about what my best option is. It's hard to
balance redundancy, performance and disk capacity.
Any other creative options that come to mind?
>(since rebuilding takes so
> long, and wears the other disks). Mirroring is safer, and mdadmin can
> happily do a raid10 (roughly a stripe of mirrors) on any number of
> disks for high speed and mirrored redundancy.
>
> Booting from raids is complicated, but not as difficult as suggested
Luckily I don't have to go down that path; I have a seperate drive to
boot from.
--
Rahul
|
|
0
|
|
|
|
Reply
|
Rahul
|
1/19/2010 7:44:53 AM
|
|
On Tuesday 19 January 2010 08:37 in comp.os.linux.misc, somebody
identifying as Rahul wrote...
> Aragorn <aragorn@chatfactory.invalid> wrote in news:hj1gta$2hp$5
> @news.eternal-september.org:
>
> Thanks for the great explaination!
Glad you appreciated it. ;-)
>> Writing to a RAID 5 is slower than writing to a single disk because
>> with each write, the parity block must be updated, which means
>> calculation of the parity data and writing that parity data to the
>> pertaining disk.
>
> This is where I get confused. Is writing to a RAID5 slower than a
> single disk irrespective of how many disks I throw at the RAID5?
Normally, yes, although it won't be *much* slower. But there is some
overhead in the calculation of the parity, yes. This is why RAID 6 is
even slower during writes: it stores *two* parity blocks per data
segment (and as such, it requires a minimum of 4 disks).
> I currently have a 7-disk RAID5. Will writing to this be slower than a
> single disk?
A little, yes. But reading from it will be significantly faster.
> Isn't the parity calculation a fairly fast process especially if one
> has a hardware based card?
Ah, but with a hardware-based RAID things are different. The actual
writing process will still be somewhat slower than writing to a single
disk, but considering that everything is taken care of by the hardware
and that such adapters have a very large cache - often backed by a
battery - this will not really have a noticeable performance impact.
With hardware RAID, the kernel treats the entire array as a single disk
and will simply write to the array. As far as the operating system is
concerned, that's where it ends, and the array takes care of everything
else from there, in a delayed fashion, but this is not something you
notice as your actual CPU(s) are freed up again as soon as the data is
transfered to the memory of the RAID adapter.
It is however advised if you have a hardware RAID adapter to disable the
write barriers. Write barriers are where the kernel forces the disks
drives to flush their caches. Since a hardware RAID adapter must be in
total control of the disk drives and has cache memory of its own, the
operating system should never force the disk drives to flush their
cache.
> And then if the write gets split into 6 parts shouldnt that speed up
> the process since each disk is writing only 1/6th of the chunk?
Yes, but the data has to be split up first - which is of course a lot
faster on hardware RAID since it is done by a dedicated processor on
the adapter itself then - and the parity has to be calculated. This is
overhead which you do not have with a single disk.
>> In this case, you don't have any redundancy. Writing to the
>> stripeset is faster than writing to a single disk, and the same
>> applies for reading. It's not a 2:1 performance boost due to the
>> overhead for splitting the data for writes and re-assembling it upon
>> reads, but there is a significant performance improvement, and
>> especially so if you use more than two disks.
>
> Why doesn;t a similar boost come out of a RAID5 with a large number of
> disks? Merely because of the parity calculation overhead?
Yes, that is the main difference. Like I said, RAID 6 is even slower
during writes (and has equal performance during reads).
>> There are however a few considerations you should take into account
>> with both of these approaches, i.e. that you should not put the
>> filesystem which holds the kernels and /initrd/ - and preferably not
>> the root filesystem either[1] - on a stripe, because the bootloader
>> recognizes
>
> Luckily that is not needed. I have a seperate drive to boot from. The
> RAID is intended only for user /home dirs.
Ah but wait a minute. As I understand it, you have a hardware RAID
adapter card. In that case - assuming that it is a real hardware RAID
adapter and not one of those on-board fake-RAID things - it doesn't
matter, because to the operating system (and even to the BIOS), the
entire array will be seen as a single disk. So then it is perfectly
possible to have your bootloader, your "/boot" and your "/" living on
the RAID array. (I am doing that myself on one of my machines, which
has two RAID 5 arrays of four disks each.)
And in this case - i.e. if you have a hardware RAID array - then your
original question regarding software RAID 0 versus striping via LVM is
also answered, because hardware RAID will always be a bit faster than
software RAID or striped LVM. Additionally, since you mention seven
disks, you could even opt for RAID 10 or 51 and even have a "hot spare"
or "standby spare". (Or you could use the extra disk as an individual,
standalone disk.)
RAID 10 is where you have a mirror (i.e. RAID 1) which is striped to
another mirror - you could instead also use RAID 01, which is a stripe
which is mirrored on another stripe. RAID 10 is better than RAID 01
though - there's a good article on Wikipedia about it. RAID 10 or 01
require four disks in total. Performance is very good for both reading
and writing *and* you have redundancy.
Similarly, RAID 51 is where you have a RAID 5 which is mirrored onto
another RAID 5. Or you could use RAID 15, which is a RAID 5 comprised
of mirrors. RAID 51 and 15 require a minimum of six disks.
(Similarly, there is RAID 61 and 16, which require a minimum of eight
disks.)
There is of course a trade-off. Except for RAID 0, which isn't really
RAID because it has no redundancy, all RAID solutions are expensive in
diskspace, and how expensive exactly depends on the chosen RAID type.
In RAID 1, RAID 10 or RAID 01 set-up, you lose 50% of your storage
capacity.
With RAID 5, your storage capacity is reduced by the capacity of one
disk in the array, and with RAID 6 by the capacity of two disks in the
array. So, with a single RAID 5 array comprised of seven disks without
a standby or hot spare, your total storage capacity is that of six
disks.
And then there's the lost capacity of the hot spare or standby spare - a
hot spare is spinning but otherwise unused until one of the other disks
starts to fail, while a standby spare is spun down until one of the
other disks fails. Upon such failure, the array will be automatically
rebuilt using the parity blocks to write the missing data to the spare
disk.
The bottom line...: A seven-disk RAID 0 would be faster than a RAID 5
during writes, but not really significantly faster during reads, and
you would have the full storage capacity of all disks in the array, but
there would be no redundancy at all. So, considering that you have
seven disks, I think you really should consider building in redundancy.
After all, with RAID 0, if a single disk in the array fails, then
you'll have lost all of your data. A RAID 5 would upon failure of a
single disk run slower, but at least you'd still have access to your
data.
--
*Aragorn*
(registered GNU/Linux user #223157)
|
|
0
|
|
|
|
Reply
|
Aragorn
|
1/19/2010 9:28:39 AM
|
|
Aragorn <aragorn@chatfactory.invalid> wrote in news:hj3u07$etq$4
@news.eternal-september.org:
>> Thanks for the great explaination!
>
> Glad you appreciated it. ;-)
Of course I did! Your comments have been infinitely more helpful than the
hours I spent with my vendors stupid helpdesk! :)
>
> Ah but wait a minute. As I understand it, you have a hardware RAID
> adapter card. In that case - assuming that it is a real hardware RAID
> adapter and not one of those on-board fake-RAID things - it doesn't
Ah yes. Fake-RAID. I have been trying to figure out if mine is real or fake.
I have a Dell PERC-e card and I hope it is "real", Is there a way to tell
"fake RAID" apart?
>
> And in this case - i.e. if you have a hardware RAID array - then your
> original question regarding software RAID 0 versus striping via LVM is
> also answered, because hardware RAID will always be a bit faster than
> software RAID or striped LVM.
Ok that's good to know. The reason I ask is I wasn't sure if in-spite of
having a hardware card I ought to export individual drives and then use mdadm
to manage.
With my 45 drives total there are a lot of options and its hard to calculate
which is the best one....
--
Rahul
|
|
0
|
|
|
|
Reply
|
Rahul
|
1/19/2010 6:43:10 PM
|
|
On Tuesday 19 January 2010 19:43 in comp.os.linux.misc, somebody
identifying as Rahul wrote...
> Aragorn <aragorn@chatfactory.invalid> wrote in news:hj3u07$etq$4
> @news.eternal-september.org:
>
>>> Thanks for the great explaination!
>>
>> Glad you appreciated it. ;-)
>
> Of course I did! Your comments have been infinitely more helpful than
> the hours I spent with my vendors stupid helpdesk! :)
Oh, I have come to experience still quite recently that the people who
populate a helpdesk are nothing but clerks with a FAQ in front of their
nose. Ask them anything that's not listed in that FAQ and they're
clueless. ;-)
>> Ah but wait a minute. As I understand it, you have a hardware RAID
>> adapter card. In that case - assuming that it is a real hardware
>> RAID adapter and not one of those on-board fake-RAID things - it
>> doesn't
>
> Ah yes. Fake-RAID. I have been trying to figure out if mine is real or
> fake. I have a Dell PERC-e card and I hope it is "real", Is there a
> way to tell "fake RAID" apart?
Well, since you are speaking of a separate plug-in card and since you've
mentioned elsewhere that you're using SAS drives, I would be inclined
to think that it is a genuine hardware RAID adapter. I do seem to
remember that the DEll PERC cards are based upon an LSI, Qlogic or
Adaptec adapter.
If each of the arrays of disks is seen by the kernel as a single disk,
then it's hardware RAID. Fake-RAID is generally reserved for IDE and
SATA disks, but I haven't encountered it for SAS or SCSI yet.
>> And in this case - i.e. if you have a hardware RAID array - then your
>> original question regarding software RAID 0 versus striping via LVM
>> is also answered, because hardware RAID will always be a bit faster
>> than software RAID or striped LVM.
>
> Ok that's good to know. The reason I ask is I wasn't sure if in-spite
> of having a hardware card I ought to export individual drives and then
> use mdadm to manage.
No no, that's totally unnecessary. With true hardware RAID, the
operating system will see an entire RAID array as being a single disk
and will treat it accordingly. Everything else is done by the RAID
adapter itself. Just make sure - as I made reference to earlier - that
you disable write barriers - this is done at mount time, via a mount
option.
> With my 45 drives total there are a lot of options and its hard to
> calculate which is the best one....
I would personally not use all of them for "/home". You mention three
arrays, so I would suggest the following...:
° First array:
- /boot
- /
- /usr
- /usr/local
- /opt
- an optional rescue/emergency root filesystem
° Second array:
- /var
- /tmp (Note: you can also make this a /tmpfs/ instead.)
- /srv (Note: use at your own discretion.)
° Third array:
- /home
--
*Aragorn*
(registered GNU/Linux user #223157)
|
|
0
|
|
|
|
Reply
|
Aragorn
|
1/19/2010 7:52:06 PM
|
|
Aragorn <aragorn@chatfactory.invalid> wrote in
news:hj52h6$lr7$2@news.eternal-september.org:
>
> I would personally not use all of them for "/home". You mention three
> arrays, so I would suggest the following...:
>
> ° First array:
> - /boot
> - /
> - /usr
> - /usr/local
> - /opt
> - an optional rescue/emergency root filesystem
>
> ° Second array:
> - /var
> - /tmp (Note: you can also make this a /tmpfs/
> instead.) - /srv (Note: use at your own discretion.)
>
> ° Third array:
> - /home
>
Sorry, I should have clarified. For /boot /usr etc. all I have a
seperate mirrored SAS drive. So those are taken care of. Besides 15x
300GB would be too much storage for any of those trees.
I have all 45 drives bought just to provide a high performance /home.
The question is how best to configure them:
1. What RAID pattern?
2. Do I add LVM on top? THis is cleaner than arbitrarily mounting /home1
/home2 etc. But the overhead of LVM worries me 3. Do I use LVM striping
or not? etc.
--
Rahul
|
|
0
|
|
|
|
Reply
|
Rahul
|
1/19/2010 7:57:29 PM
|
|
On Tuesday 19 January 2010 20:57 in comp.os.linux.misc, somebody
identifying as Rahul wrote...
> Aragorn <aragorn@chatfactory.invalid> wrote in
> news:hj52h6$lr7$2@news.eternal-september.org:
>
>> I would personally not use all of them for "/home". You mention
>> three arrays, so I would suggest the following...:
>>
>> ° First array:
>> - /boot
>> - /
>> - /usr
>> - /usr/local
>> - /opt
>> - an optional rescue/emergency root filesystem
>>
>> ° Second array:
>> - /var
>> - /tmp (Note: you can also make this a /tmpfs/
>> instead.) - /srv (Note: use at your own discretion.)
>>
>> ° Third array:
>> - /home
>>
>
> Sorry, I should have clarified. For /boot /usr etc. all I have a
> seperate mirrored SAS drive. So those are taken care of. Besides 15x
> 300GB would be too much storage for any of those trees.
Oh, okay then.
> I have all 45 drives bought just to provide a high performance /home.
> The question is how best to configure them:
>
> 1. What RAID pattern?
Since it is hardware RAID and you have a whole farm of disks, I'd set
them up as RAID 10 or 01 - i.e. a striped mirror or a mirrored stripe -
with a couple of standby spares. Saves up on some electricity too. ;-)
> 2. Do I add LVM on top? THis is cleaner than arbitrarily mounting
> /home1 /home2 etc. But the overhead of LVM worries me 3. Do I use LVM
> striping or not? etc.
Well, you can go two ways. Once everything is set up hardware-wise, you
can do either of the following:
° Use LVM with striping across the three arrays.
° Use LVM and combine the three partitions - one on each array -
into a single linear volume. The volume will then fill up
one array at the time.
° Use /mdadm/ and create a stripe with that, similar to LVM
striping but using regular partitions. You won't need LVM
anymore then.
° Use /mdadm/ and create a JBOD (alias "linear array"). This
is similar to the approach where you fill up one array at the
time, but now you're not using LVM.
The above range of suggestions is only for usability, mind you. Since
it is a hardware RAID, you will already be maximizing on performance
without any striping implemented via /mdadm/ or LVM. ;-)
--
*Aragorn*
(registered GNU/Linux user #223157)
|
|
0
|
|
|
|
Reply
|
Aragorn
|
1/19/2010 8:09:42 PM
|
|
Rahul wrote:
> David Brown <david.brown@hesbynett.removethisbit.no> wrote in
> news:BtOdnakm8taiUsnWnZ2dnUVZ8qOdnZ2d@lyse.net:
>
> Thanks David!
>
>> Rahul wrote:
>>
>> LVM is for logical volume management, mdadm is for administering
>> multiple disk setups (i.e., software raid). LVM /can/ do basic
>
>> striping, in that if you have two physical volumes allocated to the
>> same volume group, then a logical volume can be striped across the two
>> physical volumes. As another poster has said, you won't notice a
>> performance difference between striping via LVM or mdadm. But you
>
> Will putting LVM on top of mdadm slow things down? Or does LVM not have a
> significant performance penalty?
LVM does have a performance penalty, but it is not normally significant.
If you have a number of logical partitions which you then grow a
number of times, you end up with the actual physical blocks of the
partitions rather scattered across the disk(s), which may impact
performance for streaming or large files. The flexibility you get is
normally worth the slight cost (IMHO).
>> My recommendation is that you use mdadm to create a raid from the raw
>> drives or partitions on the drives, and if you want the volume
>> management features of LVM (I find it very useful), put LVM on top of
>> mdadm raid.
>
> This is exactly what I was trying to do. BUt LVM asks "stripe" or :no
> stripe". THat I wasn;t sure about.
>
>
>> As for the type of raid to use, that depends on the number of disks
>> you have and the redundancy you want. raid5 is well-known to be
>> slower for writing, especially for smaller writes, and it can be risky
>> for large disks in critical applications
>
> Maybe if I explain my situation you can have some more comments.
>
> I have 3 physical "storage boxes" (MD-1000's from Dell). Each takes 15
> SAS 15k drives of 300 GB each. i.e. I have a total of 45 drives of 300 GB
> each. Redundancy is important but not critical. Performance was more
> imporntant.
>
> My original plan was to split each box into two RAID5 arrays of 7 disks
> each and leave 1 as a hot spare. Thus I get 6 RAID5 arrays in all. They
> are visible as /dev/sdb /dev/sdc etc. but I want to mount a single /home
> on it. That's where I introduced LVM. But then LVM again introduces a
> striping option. Should I be striping or not?
>
Don't do any striping with LVM - set up your raid arrays (with hardware
raid and/or mdadm) until you have a single "disk", and put LVM on that.
> That's where I am confuesd about what my best option is. It's hard to
> balance redundancy, performance and disk capacity.
>
>
> Any other creative options that come to mind?
>
>
>
>> (since rebuilding takes so
>> long, and wears the other disks). Mirroring is safer, and mdadmin can
>> happily do a raid10 (roughly a stripe of mirrors) on any number of
>> disks for high speed and mirrored redundancy.
>>
>> Booting from raids is complicated, but not as difficult as suggested
>
> Luckily I don't have to go down that path; I have a seperate drive to
> boot from.
>
|
|
0
|
|
|
|
Reply
|
David
|
1/19/2010 8:37:43 PM
|
|
Aragorn wrote:
> On Tuesday 19 January 2010 08:37 in comp.os.linux.misc, somebody
> identifying as Rahul wrote...
>
>> Aragorn <aragorn@chatfactory.invalid> wrote in news:hj1gta$2hp$5
>> @news.eternal-september.org:
>>
>> Thanks for the great explaination!
>
> Glad you appreciated it. ;-)
>
Unfortunately, there seems to me to be a number of misconceptions in
this post. I freely admit to having more theoretical knowledge from
trawling the net, reading mdadm documentation, etc., than personal
practical experience - so anyone reading this will have to judge for
themselves whether they think I am right, or Aragorn is right. Either
way, I hope to give you some things to think about.
>>> Writing to a RAID 5 is slower than writing to a single disk because
>>> with each write, the parity block must be updated, which means
>>> calculation of the parity data and writing that parity data to the
>>> pertaining disk.
>> This is where I get confused. Is writing to a RAID5 slower than a
>> single disk irrespective of how many disks I throw at the RAID5?
>
> Normally, yes, although it won't be *much* slower. But there is some
> overhead in the calculation of the parity, yes. This is why RAID 6 is
> even slower during writes: it stores *two* parity blocks per data
> segment (and as such, it requires a minimum of 4 disks).
>
Writing to RAID5 (or RAID6) /may/ be slower than writing to a single
disk - or it may be much faster (closer to RAID0 speeds). The actual
parity calculations are negligible with modern hardware, whether it be
the host CPU or a hardware raid card. What takes time is if existing
data has to be read in from the disks in order to calculate the parity -
this causes a definite delay. If you are writing a whole stripe, the
parity can be calculated directly and the write goes at N-1 speed as
each block in the stripe can be written in parallel. This is also the
case if the parts of the block are already in the cache from before.
Thus random writes are slow on RAID5 (and RAID6), but larger block
writes are full speed.
There can also be significant differences between the speed of mdadm
software RAID5, and hardware RAID5. With hardware raid, the card can
report a small write as "finished" before it has read in the block and
written out the data and new parity. This is safe for good hardware
with battery backup of its buffers, and gives fast writes (as far as the
host is concerned) even for small writes. Software raid5 cannot do
this. But on the other hand, software raid5 can take advantage of large
system memories for cache, and is thus far more likely to have the
required stripe data already in its cache (especially for metadata and
directory areas of the file system, which are commonly accessed but have
small writes).
This is perhaps also a good time to mention one of the risks of raid5
(and raid6) - the RAID5 Write Hole. When you are writing a stripe to
the disk, the system must write at least two blocks - data and the
updated parity block. These two writes cannot be done atomically - if
you get a system failure at this point, the blocks may be inconsistent
and the whole stripe is inconsistent and effectively becomes silent garbage.
>> I currently have a 7-disk RAID5. Will writing to this be slower than a
>> single disk?
>
> A little, yes. But reading from it will be significantly faster.
>
Not necessarily - writing will be slower if you do lots of small random
writes, but much faster if you write large blocks.
Also remember that with a 7 disk array under heavy use, you /will/ see a
disk failure at some point. Degraded performance of raid 5 is very
poor, and rebuilds are slow. Some people believe that the chance of a
second disk failure occurring during a rebuild is so large (rebuilds are
particular intensive for the other disks) that raid 5 should be
considered unsafe for large arrays. Raid 6 is better since it can
survive a second failure, but mirrored raids are safer still.
>> Isn't the parity calculation a fairly fast process especially if one
>> has a hardware based card?
>
A decent host processor will do the parity calculations /much/ faster
than the raid processor on most hardware cards. But the calculations
themselves are not the cause of the latency, it's the extra reads that
take time.
> Ah, but with a hardware-based RAID things are different. The actual
> writing process will still be somewhat slower than writing to a single
> disk, but considering that everything is taken care of by the hardware
> and that such adapters have a very large cache - often backed by a
> battery - this will not really have a noticeable performance impact.
>
> With hardware RAID, the kernel treats the entire array as a single disk
> and will simply write to the array. As far as the operating system is
> concerned, that's where it ends, and the array takes care of everything
> else from there, in a delayed fashion, but this is not something you
> notice as your actual CPU(s) are freed up again as soon as the data is
> transfered to the memory of the RAID adapter.
>
True, but see above for more information.
> It is however advised if you have a hardware RAID adapter to disable the
> write barriers. Write barriers are where the kernel forces the disks
> drives to flush their caches. Since a hardware RAID adapter must be in
> total control of the disk drives and has cache memory of its own, the
> operating system should never force the disk drives to flush their
> cache.
>
Make sure your raid controller has batteries, and that the whole system
is on an UPS!
>> And then if the write gets split into 6 parts shouldnt that speed up
>> the process since each disk is writing only 1/6th of the chunk?
>
> Yes, but the data has to be split up first - which is of course a lot
> faster on hardware RAID since it is done by a dedicated processor on
> the adapter itself then - and the parity has to be calculated. This is
> overhead which you do not have with a single disk.
>
Nonsense - a host CPU is perfectly capable of splitting a stripe into
its blocks in a fraction of a microsecond. It is also much faster at
doing the parity calculations - the host CPU typically runs at least ten
times as fast as the CPU or ASIC on the raid card. And again, the
splitting and parity calculations are not the bottleneck, it's the
latency of the reads needed to calculate the new parity that takes time.
Where a hardware raid card will win is if your IO is a bottleneck, which
can be the case for large fast arrays. In particular, if you have a
mirror raid with software raid, then the host CPU has to write out all
the data twice - with hardware raid, it's the raid card that doubles up
the data.
There are times when top-range hardware raid cards will beat software
raid on speed, but not often - especially with a fast multi-core modern
host cpu. It does, however, depend highly on your raid setup and the
type of load you have - there are no set answers here.
Software raid does of course have a reliability weak point - if your OS
crashes in the middle of a write, you have a bigger chance of hitting
the raid 5 write hole than you would with a hardware raid card with a
battery.
>>> In this case, you don't have any redundancy. Writing to the
>>> stripeset is faster than writing to a single disk, and the same
>>> applies for reading. It's not a 2:1 performance boost due to the
>>> overhead for splitting the data for writes and re-assembling it upon
>>> reads, but there is a significant performance improvement, and
>>> especially so if you use more than two disks.
>> Why doesn;t a similar boost come out of a RAID5 with a large number of
>> disks? Merely because of the parity calculation overhead?
>
> Yes, that is the main difference. Like I said, RAID 6 is even slower
> during writes (and has equal performance during reads).
>
Assuming (again!) that you are doing a small write and the old data and
parity blocks are not in the cache, then you have the latency of the
reads (two reads for a single block write on raid 5, and three reads for
raid 6).
For reading, especially for large reads, raid 5 is approximately like
N-1 raid 0 drives, while raid 6 is like N-2 raid 0.
>>> There are however a few considerations you should take into account
>>> with both of these approaches, i.e. that you should not put the
>>> filesystem which holds the kernels and /initrd/ - and preferably not
>>> the root filesystem either[1] - on a stripe, because the bootloader
>>> recognizes
>> Luckily that is not needed. I have a seperate drive to boot from. The
>> RAID is intended only for user /home dirs.
>
> Ah but wait a minute. As I understand it, you have a hardware RAID
> adapter card. In that case - assuming that it is a real hardware RAID
> adapter and not one of those on-board fake-RAID things - it doesn't
> matter, because to the operating system (and even to the BIOS), the
> entire array will be seen as a single disk. So then it is perfectly
> possible to have your bootloader, your "/boot" and your "/" living on
> the RAID array. (I am doing that myself on one of my machines, which
> has two RAID 5 arrays of four disks each.)
>
> And in this case - i.e. if you have a hardware RAID array - then your
> original question regarding software RAID 0 versus striping via LVM is
> also answered, because hardware RAID will always be a bit faster than
> software RAID or striped LVM. Additionally, since you mention seven
> disks, you could even opt for RAID 10 or 51 and even have a "hot spare"
> or "standby spare". (Or you could use the extra disk as an individual,
> standalone disk.)
>
> RAID 10 is where you have a mirror (i.e. RAID 1) which is striped to
> another mirror - you could instead also use RAID 01, which is a stripe
> which is mirrored on another stripe. RAID 10 is better than RAID 01
> though - there's a good article on Wikipedia about it. RAID 10 or 01
> require four disks in total. Performance is very good for both reading
> and writing *and* you have redundancy.
>
Yes, wikipedia /does/ have some useful information about raid - it's
worth reading.
One thing you are missing here is that Linux mdadm raid 10 is very much
more flexible than just a "stripe of mirrors", which is the standard
raid 10. In particular, you can use any number of disks (from 2
upwards), you can have more than 2 copies of each block (at the cost of
disk space, obviously) for greater redundancy, and you can have a layout
that optimises the throughput for different loads.
For example, a "f2" md raid 10 layout gives you full raid 0 performance
for large reads while being at least as fast as other raids for writing
and random reads (and much faster than raid 5 for small random writes).
It is normally the fastest raid layout with redundancy - though at a 50%
cost in disk space.
<http://en.wikipedia.org/wiki/Non-standard_RAID_levels#Linux_MD_RAID_10>
Raid10 performance is also much less affected by a disk failure, and
rebuilds are faster and less stressful on the system. And a single hot
spare will cover all the disks - you don't need a spare per
However, while a "f2" md raid 10 is probably the fastest setup for
directly connected drives, this is not what you have. You will also
suffer from bandwidth issues if you try to do all the mirroring of all
45 drives in software. In your case, I would recommend raid 10 on each
box - 7 raid1 pairs striped together with a hot spare (assuming the
hardware supports a common hot spare). Your host then sees these three
disks, which you should stripe together with mdadm raid0 - there is no
need for redundancy here, as that is handled at a lower level. Put your
LVM physical volume on top of this if you want the flexibility of LVM -
if you don't need it, don't bother.
> Similarly, RAID 51 is where you have a RAID 5 which is mirrored onto
> another RAID 5. Or you could use RAID 15, which is a RAID 5 comprised
> of mirrors. RAID 51 and 15 require a minimum of six disks.
> (Similarly, there is RAID 61 and 16, which require a minimum of eight
> disks.)
>
As a minor point, mdadm raid 5 can work on 2 disks (and raid 6 on three
disks). Such a 2-disk raid 5 is not much use in a working system, but
can be convenient when setting things up or upgrading drives, as you can
add more drives to the mdadm raid 5 later on. It's just an example of
how much more flexible mdadm is than hardware raid solutions.
> There is of course a trade-off. Except for RAID 0, which isn't really
> RAID because it has no redundancy, all RAID solutions are expensive in
> diskspace, and how expensive exactly depends on the chosen RAID type.
> In RAID 1, RAID 10 or RAID 01 set-up, you lose 50% of your storage
> capacity.
>
> With RAID 5, your storage capacity is reduced by the capacity of one
> disk in the array, and with RAID 6 by the capacity of two disks in the
> array. So, with a single RAID 5 array comprised of seven disks without
> a standby or hot spare, your total storage capacity is that of six
> disks.
>
> And then there's the lost capacity of the hot spare or standby spare - a
> hot spare is spinning but otherwise unused until one of the other disks
> starts to fail, while a standby spare is spun down until one of the
> other disks fails. Upon such failure, the array will be automatically
> rebuilt using the parity blocks to write the missing data to the spare
> disk.
>
I have never heard of a distinction between a "hot spare" that is
spinning, and a "standby spare" that is not spinning. Given that spinup
takes a few seconds, and a rebuild often takes many hours, I can't see
you have much to gain by keeping a spare drive spinning. To my mind, a
"hot spare" is a drive that will be used automatically to replace a dead
drive.
An "offline spare" is an extra drive that is physically attached, but
not in use automatically - in the event of a failure, it can be manually
assigned to a raid set. This makes sense if you have several hardware
raid sets defined and want to share a single spare, if the hardware raid
cannot support this (mdadm, of course, supports such a setup with a
shared hot spare).
> The bottom line...: A seven-disk RAID 0 would be faster than a RAID 5
> during writes, but not really significantly faster during reads, and
> you would have the full storage capacity of all disks in the array, but
> there would be no redundancy at all. So, considering that you have
> seven disks, I think you really should consider building in redundancy.
> After all, with RAID 0, if a single disk in the array fails, then
> you'll have lost all of your data. A RAID 5 would upon failure of a
> single disk run slower, but at least you'd still have access to your
> data.
>
|
|
0
|
|
|
|
Reply
|
David
|
1/19/2010 9:57:01 PM
|
|
David Brown <david.brown@hesbynett.removethisbit.no> wrote in
news:NKOdnXtWJIFpt8vWnZ2dnUVZ7radnZ2d@lyse.net:
>
> themselves whether they think I am right, or Aragorn is right. Either
> way, I hope to give you some things to think about.
An alternative viewpoint is always good!
> Thus random writes are slow on RAID5 (and RAID6), but larger block
> writes are full speed.
And if I did a RAID10 at hardware level (as you later suggest) I'd get
the speedup on random writes as well? (which are otherwise slow on a
RAID5?) What other way do I have to speed up random writes?
> There can also be significant differences between the speed of mdadm
> software RAID5, and hardware RAID5. With hardware raid, the card can
> report a small write as "finished" before it has read in the block and
> written out the data and new parity. This is safe for good hardware
> with battery backup of its buffers, and gives fast writes (as far as
> the host is concerned) even for small writes. Software raid5 cannot
> do this. But on the other hand, software raid5 can take advantage of
> large system memories for cache, and is thus far more likely to have
> the required stripe data already in its cache (especially for metadata
> and directory areas of the file system, which are commonly accessed
> but have small writes).
Yes, I do have a battery backed up cache on my Hardware card. But from
the point you make above there's something to be said about a software
(mdadm or LVM) on top of hardware approach? This way I get the best of
both worlds? LVM / mdadm will serve out from RAM (I've 48 Gigs of it)
and speed up reads. Writes will be speeded up due to the caches of the
Hardware card. Does this make sense?
> This is perhaps also a good time to mention one of the risks of raid5
> (and raid6) - the RAID5 Write Hole.
This risk is reduced by a battery backed-up cache, correct?
>
>>> I currently have a 7-disk RAID5. Will writing to this be slower than
>>> a single disk?
>>
>> A little, yes. But reading from it will be significantly faster.
>
> Not necessarily - writing will be slower if you do lots of small
> random writes, but much faster if you write large blocks.
And will the reads and large-sequential-writes be even faster if I did a
14 disk RAID5 instead of a 7-disk RAID5?
>
> Make sure your raid controller has batteries, and that the whole
> system is on an UPS!
Yes! Both.
>
> For reading, especially for large reads, raid 5 is approximately like
> N-1 raid 0 drives, while raid 6 is like N-2 raid 0.
Problem is I haven't seen a similar formula mentioned for writes. Neither
large nor small writes. What's a approximate design equation to use to
rate options?
>
> However, while a "f2" md raid 10 is probably the fastest setup for
> directly connected drives, this is not what you have. You will also
> suffer from bandwidth issues
Which bandwidth are we talking about? THe CPU-to-controller?
>if you try to do all the mirroring of all
> 45 drives in software. In your case, I would recommend raid 10 on
> each box - 7 raid1 pairs striped together with a hot spare (assuming
> the hardware supports a common hot spare). Your host then sees these
> three disks, which you should stripe together with mdadm raid0 - there
> is no need for redundancy here, as that is handled at a lower level.
> Put your LVM physical volume on top of this if you want the
> flexibility of LVM - if you don't need it, don't bother.
Ah! Thanks! That;s a creative solution I hadn't thought about.
>
> I have never heard of a distinction between a "hot spare" that is
> spinning, and a "standby spare" that is not spinning.
Me neither.
>
>> The bottom line...: A seven-disk RAID 0 would be faster than a RAID 5
>> during writes, but not really significantly faster during reads, and
>> you would have the full storage capacity of all disks in the array,
>> but there would be no redundancy at all. So, considering that you
>> have seven disks, I think you really should consider building in
>> redundancy. After all, with RAID 0, if a single disk in the array
>> fails, then you'll have lost all of your data. A RAID 5 would upon
>> failure of a single disk run slower, but at least you'd still have
>> access to your data.
>>
Or I could do the RAID10 that you suggest and stripe on top of three such
arrays using mdadm. I'm thinking about this very interesting option.
Thanks!
--
Rahul
|
|
0
|
|
|
|
Reply
|
Rahul
|
1/19/2010 11:32:39 PM
|
|
On 2010-01-19, Rahul <nospam@nospam.invalid> wrote:
> Aragorn <aragorn@chatfactory.invalid> wrote in
> news:hj52h6$lr7$2@news.eternal-september.org:
>
>>
>> I would personally not use all of them for "/home". You mention three
>> arrays, so I would suggest the following...:
>>
>> ?? First array:
>> - /boot
>> - /
>> - /usr
>> - /usr/local
>> - /opt
>> - an optional rescue/emergency root filesystem
>>
>> ?? Second array:
>> - /var
>> - /tmp (Note: you can also make this a /tmpfs/
>> instead.) - /srv (Note: use at your own discretion.)
>>
>> ?? Third array:
>> - /home
>>
>
> Sorry, I should have clarified. For /boot /usr etc. all I have a
> seperate mirrored SAS drive. So those are taken care of. Besides 15x
> 300GB would be too much storage for any of those trees.
>
> I have all 45 drives bought just to provide a high performance /home.
> The question is how best to configure them:
>
> 1. What RAID pattern?
Do you want speed or do you want size or do you want redundancy?
I have just instituted raid0 ( striped) across two partitions on two
disks ( the disks are identical, and the partitioning of them is
identical). There are 500GB WD disks 7200 SATA. hdparm -t gives about
82MB/s
I used mdadm to set up a raid0 ( first bringing in the raid0 module)
on two 450GB patitions, one on each of the drives and mounted the
resultant /dev/md0 after formatting as ext3 onto /local
I then did
cat /dev/null>/local/a
for 12 sec, and a was then a 2GB file, so writing to that disk (assuming
writing all 0 from cat does not produce some sort of sparse file) went
at about 160MB/s, ie twice as fast as reading from a single disk.
> 2. Do I add LVM on top? THis is cleaner than arbitrarily mounting /home1
> /home2 etc. But the overhead of LVM worries me 3. Do I use LVM striping
> or not? etc.
You want to use lvm why?
>
>
|
|
0
|
|
|
|
Reply
|
unruh
|
1/19/2010 11:50:51 PM
|
|
unruh <unruh@wormhole.physics.ubc.ca> wrote in
news:slrnhlchar.4le.unruh@wormhole.physics.ubc.ca:
Thanks unruh!
> Do you want speed or do you want size or do you want redundancy?
Mainly speed. Size and Redundancy are good but lesser goals. I guess its
always a tradeoff between all 3.
>
>> 2. Do I add LVM on top? THis is cleaner than arbitrarily mounting /home1
>> /home2 etc. But the overhead of LVM worries me 3. Do I use LVM striping
>> or not? etc.
>
> You want to use lvm why?
>
Because I have 3 different "storage boxes" with 15 drives each. At best I
see three devices /dev/sda /dev/sdb /dev/sdc after I use the hardware RIAD
controllers. Logically I just want to mount /home on them.
At worst (If I do 7 disk RAID5's) I might see 6 physical drives. Then again
LVM would aggregate them and I could mount /home.
I am open to other sugesstions.
--
Rahul
|
|
0
|
|
|
|
Reply
|
Rahul
|
1/20/2010 12:23:12 AM
|
|
On Tuesday 19 January 2010 22:57 in comp.os.linux.misc, somebody
identifying as David Brown wrote...
> Aragorn wrote:
>
>> On Tuesday 19 January 2010 08:37 in comp.os.linux.misc, somebody
>> identifying as Rahul wrote...
>>
>>> Aragorn <aragorn@chatfactory.invalid> wrote in news:hj1gta$2hp$5
>>> @news.eternal-september.org:
>>>
>>> Thanks for the great explaination!
>>
>> Glad you appreciated it. ;-)
>
> Unfortunately, there seems to me to be a number of misconceptions in
> this post. I freely admit to having more theoretical knowledge from
> trawling the net, reading mdadm documentation, etc., than personal
> practical experience - so anyone reading this will have to judge for
> themselves whether they think I am right, or Aragorn is right. Either
> way, I hope to give you some things to think about.
Having read your reply, I agree with most of it. I obviously made a
thinko in assuming that it was the parity calculation that slowed
things down, but I was writing my reply in a rather abstracted set of
mind.
All things considered, you and I are both right. The difference in our
view is that you further dissected the slowdown to the reads needed in
order to calculate the parity, while in my abstraction, I did not get
into this any further. ;-)
>>>> Writing to a RAID 5 is slower than writing to a single disk because
>>>> with each write, the parity block must be updated, which means
>>>> calculation of the parity data and writing that parity data to the
>>>> pertaining disk.
>>>
>>> This is where I get confused. Is writing to a RAID5 slower than a
>>> single disk irrespective of how many disks I throw at the RAID5?
>>
>> Normally, yes, although it won't be *much* slower. But there is some
>> overhead in the calculation of the parity, yes. This is why RAID 6
>> is even slower during writes: it stores *two* parity blocks per data
>> segment (and as such, it requires a minimum of 4 disks).
>
> Writing to RAID5 (or RAID6) /may/ be slower than writing to a single
> disk - or it may be much faster (closer to RAID0 speeds). The actual
> parity calculations are negligible with modern hardware, whether it be
> the host CPU or a hardware raid card. What takes time is if existing
> data has to be read in from the disks in order to calculate the parity
> - this causes a definite delay. If you are writing a whole stripe,
> the parity can be calculated directly and the write goes at N-1 speed
> as each block in the stripe can be written in parallel. This is also
> the case if the parts of the block are already in the cache from
> before.
>
> Thus random writes are slow on RAID5 (and RAID6), but larger block
> writes are full speed.
I agree with this.
> There can also be significant differences between the speed of mdadm
> software RAID5, and hardware RAID5. With hardware raid, the card can
> report a small write as "finished" before it has read in the block and
> written out the data and new parity. This is safe for good hardware
> with battery backup of its buffers, and gives fast writes (as far as
> the host is concerned) even for small writes. Software raid5 cannot
> do this.
Correct.
> But on the other hand, software raid5 can take advantage of large
> system memories for cache, and is thus far more likely to have the
> required stripe data already in its cache (especially for metadata
> and directory areas of the file system, which are commonly accessed
> but have small writes).
Also correct, but dependent on available system RAM, of course. I'm not
sure how much RAM the RAID controllers in my two servers have - I think
one has 256 MB and the other one 128 MB - but on a system with, say, 4
GB of system RAM, caching capacity is of course much higher.
> This is perhaps also a good time to mention one of the risks of raid5
> (and raid6) - the RAID5 Write Hole. When you are writing a stripe to
> the disk, the system must write at least two blocks - data and the
> updated parity block. These two writes cannot be done atomically - if
> you get a system failure at this point, the blocks may be inconsistent
> and the whole stripe is inconsistent and effectively becomes silent
> garbage.
Indeed, the infamous RAID 5/6 write hole. I left that bit of
information out as my advice to the OP was not to use RAID 5 or RAID 6
but to use RAID 10 instead, considering the gigantic amount of drives
he has at his disposal. ;-)
On the other hand - and purely in theory, as RAID 5/6 is not the right
solution for the OP - a battery-backed hardware RAID controller on a
machine hooked up to a UPS should be able to avoid the RAID 5/6 writing
hole, since the controller itself has its own processor. I even
believe that my Adaptec SAS RAID controller runs a Linux kernel of its
own.
>>> I currently have a 7-disk RAID5. Will writing to this be slower than
>>> a single disk?
>>
>> A little, yes. But reading from it will be significantly faster.
>
> Not necessarily - writing will be slower if you do lots of small
> random writes, but much faster if you write large blocks.
Yes, of course, it all depends on the usage. That's why different types
of servers use different types of RAID configurations, and the same can
be said about the choice of filesystem types. There is no "one size
fits all". ;-)
> Also remember that with a 7 disk array under heavy use, you /will/ see
> a disk failure at some point.
Also correct. The risk of disk failure will increase with the amount of
disks involved.
> Degraded performance of raid 5 is very poor, and rebuilds are slow.
Correct. RAID 5 and RAID 6 are trade-offs between redundancy, diskspace
consumption and performance. RAID 10 offers the best performance and
redundancy, but is the most costly in terms of wasted diskspace.
> Some people believe that the chance of a
> second disk failure occurring during a rebuild is so large (rebuilds
> are particular intensive for the other disks) that raid 5 should be
> considered unsafe for large arrays. Raid 6 is better since it can
> survive a second failure, but mirrored raids are safer still.
>
>>> Isn't the parity calculation a fairly fast process especially if one
>>> has a hardware based card?
>
> A decent host processor will do the parity calculations /much/ faster
> than the raid processor on most hardware cards. But the calculations
> themselves are not the cause of the latency, it's the extra reads that
> take time.
Correct.
>> It is however advised if you have a hardware RAID adapter to disable
>> the write barriers. Write barriers are where the kernel forces the
>> disks drives to flush their caches. Since a hardware RAID adapter
>> must be in total control of the disk drives and has cache memory of
>> its own, the operating system should never force the disk drives to
>> flush their cache.
>>
>
> Make sure your raid controller has batteries, and that the whole
> system is on an UPS!
If data integrity is important, then I consider a UPS a necessity, even
without RAID. ;-)
>>> And then if the write gets split into 6 parts shouldnt that speed up
>>> the process since each disk is writing only 1/6th of the chunk?
>>
>> Yes, but the data has to be split up first - which is of course a lot
>> faster on hardware RAID since it is done by a dedicated processor on
>> the adapter itself then - and the parity has to be calculated. This
>> is overhead which you do not have with a single disk.
>
> Nonsense - a host CPU is perfectly capable of splitting a stripe into
> its blocks in a fraction of a microsecond. It is also much faster at
> doing the parity calculations - the host CPU typically runs at least
> ten times as fast as the CPU or ASIC on the raid card.
Yes, but the host CPU also has other things to take care off, while the
CPU or ASIC on a RAID controller is dedicated to just that one task.
> And again, the splitting and parity calculations are not the
> bottleneck, it's the latency of the reads needed to calculate the new
> parity that takes time.
True, but as I stated higher up, I considered the reads to be part of
the parity calculation process. You need those reads in order to
calculate the parity, so it's a matter of semantics. ;-)
> There are times when top-range hardware raid cards will beat software
> raid on speed, but not often - especially with a fast multi-core
> modern host cpu. It does, however, depend highly on your raid setup
> and the type of load you have - there are no set answers here.
Considering that many hardware RAID adapters have a battery-backed
cache, I'd say that's another argument in favor of true hardware RAID.
> Software raid does of course have a reliability weak point - if your
> OS crashes in the middle of a write, you have a bigger chance of
> hitting the raid 5 write hole than you would with a hardware raid card
> with a battery.
I think this is an important thing to consider for anyone looking into a
RAID 5 solution.
>>>> There are however a few considerations you should take into account
>>>> with both of these approaches, i.e. that you should not put the
>>>> filesystem which holds the kernels and /initrd/ - and preferably
>>>> not the root filesystem either[1] - on a stripe, because the
>>>> bootloader recognizes [...]
>>>
>>> Luckily that is not needed. I have a seperate drive to boot from.
>>> The RAID is intended only for user /home dirs.
>>
>> Ah but wait a minute. As I understand it, you have a hardware RAID
>> adapter card. In that case - assuming that it is a real hardware
>> RAID adapter and not one of those on-board fake-RAID things - it
>> doesn't matter, because to the operating system (and even to the
>> BIOS), the entire array will be seen as a single disk. So then it is
>> perfectly possible to have your bootloader, your "/boot" and your "/"
>> living on the RAID array. (I am doing that myself on one of my
>> machines, which has two RAID 5 arrays of four disks each.)
>>
>> And in this case - i.e. if you have a hardware RAID array - then your
>> original question regarding software RAID 0 versus striping via LVM
>> is also answered, because hardware RAID will always be a bit faster
>> than software RAID or striped LVM. Additionally, since you mention
>> seven disks, you could even opt for RAID 10 or 51 and even have
>> a "hot spare" or "standby spare". (Or you could use the extra disk
>> as an individual, standalone disk.)
>>
>> RAID 10 is where you have a mirror (i.e. RAID 1) which is striped to
>> another mirror - you could instead also use RAID 01, which is a
>> stripe which is mirrored on another stripe. RAID 10 is better than
>> RAID 01 though - there's a good article on Wikipedia about it. RAID
>> 10 or 01 require four disks in total. Performance is very good for
>> both reading and writing *and* you have redundancy.
>
> Yes, wikipedia /does/ have some useful information about raid - it's
> worth reading.
You are preaching to the choir. ;-)
> One thing you are missing here is that Linux mdadm raid 10 is very
> much more flexible than just a "stripe of mirrors", which is the
> standard raid 10. In particular, you can use any number of disks
> (from 2 upwards), you can have more than 2 copies of each block (at
> the cost of disk space, obviously) for greater redundancy, and you can
> have a layout that optimises the throughput for different loads.
I'm not missing that, but since the OP had confirmed that he has a
hardware RAID set-up, I was addressing that aspect only. Linux
software RAID is applied on a partition basis and is thus indeed more
flexible than hardware RAID, which is applied on an entire disk basis.
> Raid10 performance is also much less affected by a disk failure, and
> rebuilds are faster and less stressful on the system. And a single
> hot spare will cover all the disks - you don't need a spare per
I think RAID 10 (at the hardware level) would be ideal for the OP.
> [...] Put your LVM physical volume on top of this if you want the
> flexibility of LVM - if you don't need it, don't bother.
He might want to use LVM in order to combine what the operating sees as
being three independent drives - and which are in reality three RAID
arrays - into a single "/home" volume, but then again, he could instead
also create that "/home" as an /mdadm/ stripe.
Either way, he needs to choose between /mdadm/ and LVM for how he wants
to set up the "/home" volume from three separate arrays, be it striped
or linear - alias the JBOD approach - but not /mdadm/ and LVM both.
That would be unnecessary overhead.
>> Similarly, RAID 51 is where you have a RAID 5 which is mirrored onto
>> another RAID 5. Or you could use RAID 15, which is a RAID 5
>> comprised of mirrors. RAID 51 and 15 require a minimum of six disks.
>> (Similarly, there is RAID 61 and 16, which require a minimum of eight
>> disks.)
>
> As a minor point, mdadm raid 5 can work on 2 disks (and raid 6 on
> three disks). Such a 2-disk raid 5 is not much use in a working
> system, but can be convenient when setting things up or upgrading
> drives, as you can add more drives to the mdadm raid 5 later on. It's
> just an example of how much more flexible mdadm is than hardware raid
> solutions.
True, but considering the enormous amount of drives involved and
hardware RAID already being present, I think the wisest approach would
be to use hardware RAID at the lower levels and only use /mdadm/ at the
final level.
>> [...]
>> With RAID 5, your storage capacity is reduced by the capacity of one
>> disk in the array, and with RAID 6 by the capacity of two disks in
>> the array. So, with a single RAID 5 array comprised of seven disks
>> without a standby or hot spare, your total storage capacity is that
>> of six disks.
>>
>> And then there's the lost capacity of the hot spare or standby spare
>> - a hot spare is spinning but otherwise unused until one of the other
>> disks starts to fail, while a standby spare is spun down until one of
>> the other disks fails. Upon such failure, the array will be
>> automatically rebuilt using the parity blocks to write the missing
>> data to the spare disk.
>
> I have never heard of a distinction between a "hot spare" that is
> spinning, and a "standby spare" that is not spinning.
This is quite a common distinction, mind you. There is even a "live
spare" solution, but to my knowledge this is specific to Adaptec - they
call it RAID 5E.
In a "live spare" scenario, the spare disk is not used as such but is
part of the live array, and both data and parity blocks are being
written to it, but with the distinction that each disk in the array
will also have empty blocks for the total capacity of a standard spare
disk. These empty blocks are thus distributed across all disks in the
array and are used for array reconstruction in the event of a disk
failure.
> Given that spinup takes a few seconds, and a rebuild often takes many
> hours, I can't see you have much to gain by keeping a spare drive
> spinning.
It might be required for some software RAID solutions where the spare
disk cannot be spun down via software. For instance in the event of
parallel SCSI drives in a software RAID array.
> To my mind, a "hot spare" is a drive that will be used automatically
> to replace a dead drive.
Semantics. ;-)
> An "offline spare" is an extra drive that is physically attached, but
> not in use automatically - in the event of a failure, it can be
> manually assigned to a raid set. This makes sense if you have several
> hardware raid sets defined and want to share a single spare, if the
> hardware raid cannot support this (mdadm, of course, supports such a
> setup with a shared hot spare).
Most modern hardware RAID controllers support this.
--
*Aragorn*
(registered GNU/Linux user #223157)
|
|
0
|
|
|
|
Reply
|
Aragorn
|
1/20/2010 11:58:49 AM
|
|
Rahul wrote:
> David Brown <david.brown@hesbynett.removethisbit.no> wrote in
> news:NKOdnXtWJIFpt8vWnZ2dnUVZ7radnZ2d@lyse.net:
>
>> themselves whether they think I am right, or Aragorn is right. Either
>> way, I hope to give you some things to think about.
>
> An alternative viewpoint is always good!
>
>> Thus random writes are slow on RAID5 (and RAID6), but larger block
>> writes are full speed.
>
> And if I did a RAID10 at hardware level (as you later suggest) I'd get
> the speedup on random writes as well? (which are otherwise slow on a
> RAID5?) What other way do I have to speed up random writes?
>
Random writes will always be fairly fast with raid10, whether software
or hardware - there are no blocks that have to be read. With mirroring
(raid1 or raid10), you have to do twice as many writes as with raid0,
but they are to different drives and can thus be written in parallel.
>> There can also be significant differences between the speed of mdadm
>> software RAID5, and hardware RAID5. With hardware raid, the card can
>> report a small write as "finished" before it has read in the block and
>> written out the data and new parity. This is safe for good hardware
>> with battery backup of its buffers, and gives fast writes (as far as
>> the host is concerned) even for small writes. Software raid5 cannot
>> do this. But on the other hand, software raid5 can take advantage of
>> large system memories for cache, and is thus far more likely to have
>> the required stripe data already in its cache (especially for metadata
>> and directory areas of the file system, which are commonly accessed
>> but have small writes).
>
> Yes, I do have a battery backed up cache on my Hardware card. But from
> the point you make above there's something to be said about a software
> (mdadm or LVM) on top of hardware approach? This way I get the best of
> both worlds? LVM / mdadm will serve out from RAM (I've 48 Gigs of it)
> and speed up reads. Writes will be speeded up due to the caches of the
> Hardware card. Does this make sense?
>
The speedup you get with mdadm having large caches is only relevant for
raid5 (or raid6). The trouble with small raid5 writes is that you need
to read in the old data and parity block before you can write them anew,
and here a large cache increases the chance of having these blocks in
the cache. Once you are beyond the raid5 level (for example, if you
have the raid5 in hardware), caches on the host will not help.
If you have your three boxes set up with raid5 in hardware, then you
should stripe them (raid0) to form your final "disk". It is unlikely to
make any noticeable performance difference doing this in hardware or
mdadm, but mdadm is probably more flexible (though there is not much you
can do with raid0).
The worst choice you could make is to have raid5 on the host (software
or hardware). The issue here is that each stripe is going to be very
large, since it will cover all 45 disks. Since raid5 writes are slow
unless they cover an entire stripe, even fairly large writes are going
to be parts of a stripe and therefore slow.
>
>> This is perhaps also a good time to mention one of the risks of raid5
>> (and raid6) - the RAID5 Write Hole.
>
> This risk is reduced by a battery backed-up cache, correct?
>
Yes, if the disks themselves also have battery backup (i.e., an UPS).
If the controller is able to complete writes safely even in the event of
a power cut or a host OS crash, then the raid5 write hole should not be
an issue.
>>>> I currently have a 7-disk RAID5. Will writing to this be slower than
>>>> a single disk?
>>> A little, yes. But reading from it will be significantly faster.
>> Not necessarily - writing will be slower if you do lots of small
>> random writes, but much faster if you write large blocks.
>
> And will the reads and large-sequential-writes be even faster if I did a
> 14 disk RAID5 instead of a 7-disk RAID5?
>
You have to try to think what is happening when you are doing different
types of access. Lets go through this for some different setups,
considering small and large reads and writes with N drives and raid0,
raid1, raid10, mdadm "far" raid10, and raid5.
For raid0, you have a layout like this:
1 2 3 4
5 6 7 8
A small read will require a single seek on one of the disks, followed by
a read - it will give the same performance as a single drive (though you
will get better average throughput if you have lots of unrelated reads
in parallel). Similarly with a small write. For large reads and
writes, you can read or write to all the disks in parallel -
theoretically you get N times the throughput.
For raid1, you have this:
1a 1b
2a 2b
3a 3b
4a 4b
(Numbers are the data blocks, letters "a" and "b" indicate the
duplications).
With more disks but keeping two copies, this is effectively standard
raid10 (and raid01 is identical to raid10 performance-wise) :
1a 1b 2a 2b
3a 3b 4a 4b
5a 5b 6a 6b
7a 7b 8a 8b
Small reads are, as usual, a seek followed by a read. Seeks may be a
little faster than for 1 disk, since either half of the mirror can be
used - the one with the closest head can be picked. Small writes are
similar, though the same data must be written twice. However, since the
two copies are on different disks, these are done in parallel. For
large reads, you basically have half the disks running sequentially for
(N/2) speed - parallel reads from the second copy are only really useful
for reads of up to three stripes in size. Bulk writing is at up to N/2
speed.
mdadm "far" raid10 is a little different:
1a 2a 3a 4a
5a 6a 7a 8a
....
2b 1b 4b 3b
6b 5b 8b 7b
As with raid1, small reads are a seek followed by a read. Seeks may be
a little faster than for 1 disk, since either half of the mirror can be
used - the one with the closest head can be picked. Small writes are
similar, though the same data must be written twice in parallel. For
large reads - and this is a key difference from standard raid10 - the
layout looks like raid0, and runs at full N speed. Bulk writing is
again at up to N/2 speed.
raid5 looks like this:
1 2 3 p123
4 5 p456 6
Small reads are the same as for a single disk. Large reads are similar
to (N-1), but don't quite make it - the parity blocks disrupt the flow
of the sequential reads. Large writes - full stripes - are close to
(N-1) since the parity can be calculated on the host or controller and
written out directly. The killer is for small writes - imagine trying
to write to block 3 here. The host or controller must also calculate
the new p123, either by reading in block 1 (for 3-disk raid5) or by
reading the old block 3 and p123 (for more than 3 disks). Then it
calculates the parity (a quick task), and writes out blocks 3 and p123.
Waiting for these reads is what stalls the write process and gives
long latency on random writes.
So in answer to your question (which you should be able to see yourself
now), large reads and writes scale with the number of disks for raid5.
But the cutoff point for a write to be "large" or "small", i.e., the
stripe size, is larger when you have more disks.
>> Make sure your raid controller has batteries, and that the whole
>> system is on an UPS!
>
> Yes! Both.
>> For reading, especially for large reads, raid 5 is approximately like
>> N-1 raid 0 drives, while raid 6 is like N-2 raid 0.
>
> Problem is I haven't seen a similar formula mentioned for writes. Neither
> large nor small writes. What's a approximate design equation to use to
> rate options?
Large writes are approximately N-1. For small writes, you have longer
latency than for a simple single disk.
>> However, while a "f2" md raid 10 is probably the fastest setup for
>> directly connected drives, this is not what you have. You will also
>> suffer from bandwidth issues
>
> Which bandwidth are we talking about? THe CPU-to-controller?
The bandwidth between the host memory, through the DMA controller (and
possibly the cpu), to the SAS controller. Simply put, with software
mirroring the host has to write the same data twice, using twice the
bandwidth.
>
>> if you try to do all the mirroring of all
>> 45 drives in software. In your case, I would recommend raid 10 on
>> each box - 7 raid1 pairs striped together with a hot spare (assuming
>> the hardware supports a common hot spare). Your host then sees these
>> three disks, which you should stripe together with mdadm raid0 - there
>> is no need for redundancy here, as that is handled at a lower level.
>> Put your LVM physical volume on top of this if you want the
>> flexibility of LVM - if you don't need it, don't bother.
>
> Ah! Thanks! That;s a creative solution I hadn't thought about.
>> I have never heard of a distinction between a "hot spare" that is
>> spinning, and a "standby spare" that is not spinning.
>
> Me neither.
>
>>> The bottom line...: A seven-disk RAID 0 would be faster than a RAID 5
>>> during writes, but not really significantly faster during reads, and
>>> you would have the full storage capacity of all disks in the array,
>>> but there would be no redundancy at all. So, considering that you
>>> have seven disks, I think you really should consider building in
>>> redundancy. After all, with RAID 0, if a single disk in the array
>>> fails, then you'll have lost all of your data. A RAID 5 would upon
>>> failure of a single disk run slower, but at least you'd still have
>>> access to your data.
>>>
>
> Or I could do the RAID10 that you suggest and stripe on top of three such
> arrays using mdadm. I'm thinking about this very interesting option.
> Thanks!
>
>
|
|
0
|
|
|
|
Reply
|
David
|
1/20/2010 12:17:55 PM
|
|
Aragorn wrote:
> On Tuesday 19 January 2010 22:57 in comp.os.linux.misc, somebody
> identifying as David Brown wrote...
>
>> Aragorn wrote:
>>
>>> On Tuesday 19 January 2010 08:37 in comp.os.linux.misc, somebody
>>> identifying as Rahul wrote...
>>>
>>>> Aragorn <aragorn@chatfactory.invalid> wrote in news:hj1gta$2hp$5
>>>> @news.eternal-september.org:
>>>>
>>>> Thanks for the great explaination!
>>> Glad you appreciated it. ;-)
>> Unfortunately, there seems to me to be a number of misconceptions in
>> this post. I freely admit to having more theoretical knowledge from
>> trawling the net, reading mdadm documentation, etc., than personal
>> practical experience - so anyone reading this will have to judge for
>> themselves whether they think I am right, or Aragorn is right. Either
>> way, I hope to give you some things to think about.
>
> Having read your reply, I agree with most of it. I obviously made a
> thinko in assuming that it was the parity calculation that slowed
> things down, but I was writing my reply in a rather abstracted set of
> mind.
>
> All things considered, you and I are both right. The difference in our
> view is that you further dissected the slowdown to the reads needed in
> order to calculate the parity, while in my abstraction, I did not get
> into this any further. ;-)
>
It is also the case that calculating parity used to be significant in
the timing. When host cpus ran at a few hundred MHz, doing parity
calculations in software was slow, and took all of the host cpu
capacity. But since then, the host cpus are orders of magnitude faster
(and are even better suited to the sorts of streaming calculations
needed), while hard disk speeds have only increased a few times. And
with multiple cores on the host as standard, most setups can handle the
load without noticing.
Another point here is that you can improve some things on the host side
by using a faster processor and more ram (you can never have too much
ram in a file server, assuming you are not using windows). It is often
cheaper and easier to boost the host in this way to improve your
software raid than it is to change the hardware raid setup.
>>>>> Writing to a RAID 5 is slower than writing to a single disk because
>>>>> with each write, the parity block must be updated, which means
>>>>> calculation of the parity data and writing that parity data to the
>>>>> pertaining disk.
>>>> This is where I get confused. Is writing to a RAID5 slower than a
>>>> single disk irrespective of how many disks I throw at the RAID5?
>>> Normally, yes, although it won't be *much* slower. But there is some
>>> overhead in the calculation of the parity, yes. This is why RAID 6
>>> is even slower during writes: it stores *two* parity blocks per data
>>> segment (and as such, it requires a minimum of 4 disks).
>> Writing to RAID5 (or RAID6) /may/ be slower than writing to a single
>> disk - or it may be much faster (closer to RAID0 speeds). The actual
>> parity calculations are negligible with modern hardware, whether it be
>> the host CPU or a hardware raid card. What takes time is if existing
>> data has to be read in from the disks in order to calculate the parity
>> - this causes a definite delay. If you are writing a whole stripe,
>> the parity can be calculated directly and the write goes at N-1 speed
>> as each block in the stripe can be written in parallel. This is also
>> the case if the parts of the block are already in the cache from
>> before.
>>
>> Thus random writes are slow on RAID5 (and RAID6), but larger block
>> writes are full speed.
>
> I agree with this.
>
>> There can also be significant differences between the speed of mdadm
>> software RAID5, and hardware RAID5. With hardware raid, the card can
>> report a small write as "finished" before it has read in the block and
>> written out the data and new parity. This is safe for good hardware
>> with battery backup of its buffers, and gives fast writes (as far as
>> the host is concerned) even for small writes. Software raid5 cannot
>> do this.
>
> Correct.
>
>> But on the other hand, software raid5 can take advantage of large
>> system memories for cache, and is thus far more likely to have the
>> required stripe data already in its cache (especially for metadata
>> and directory areas of the file system, which are commonly accessed
>> but have small writes).
>
> Also correct, but dependent on available system RAM, of course. I'm not
> sure how much RAM the RAID controllers in my two servers have - I think
> one has 256 MB and the other one 128 MB - but on a system with, say, 4
> GB of system RAM, caching capacity is of course much higher.
>
You will probably find it is cheaper to add another 4 GB system ram to
the host than another 128 MB to the hardware raid controller.
And of course more system ram means more data held in the file-cache (as
well as the low-level block caches useful for raid5), reducing the reads
from the disk.
>> This is perhaps also a good time to mention one of the risks of raid5
>> (and raid6) - the RAID5 Write Hole. When you are writing a stripe to
>> the disk, the system must write at least two blocks - data and the
>> updated parity block. These two writes cannot be done atomically - if
>> you get a system failure at this point, the blocks may be inconsistent
>> and the whole stripe is inconsistent and effectively becomes silent
>> garbage.
>
> Indeed, the infamous RAID 5/6 write hole. I left that bit of
> information out as my advice to the OP was not to use RAID 5 or RAID 6
> but to use RAID 10 instead, considering the gigantic amount of drives
> he has at his disposal. ;-)
>
Agreed.
> On the other hand - and purely in theory, as RAID 5/6 is not the right
> solution for the OP - a battery-backed hardware RAID controller on a
> machine hooked up to a UPS should be able to avoid the RAID 5/6 writing
> hole, since the controller itself has its own processor. I even
> believe that my Adaptec SAS RAID controller runs a Linux kernel of its
> own.
>
That's often the case these days - "hardware" raid controllers are
frequently host processors running Linux (or sometimes other systems)
and software raid. This is especially true of SAN boxes.
>>>> I currently have a 7-disk RAID5. Will writing to this be slower than
>>>> a single disk?
>>> A little, yes. But reading from it will be significantly faster.
>> Not necessarily - writing will be slower if you do lots of small
>> random writes, but much faster if you write large blocks.
>
> Yes, of course, it all depends on the usage. That's why different types
> of servers use different types of RAID configurations, and the same can
> be said about the choice of filesystem types. There is no "one size
> fits all". ;-)
>
Yes - perhaps the OP will give more details on his expected usage
patterns. There are many other factors we haven't discussed that can
affect the "best" setup, such as what requirements he has for future
expansion.
>> Also remember that with a 7 disk array under heavy use, you /will/ see
>> a disk failure at some point.
>
> Also correct. The risk of disk failure will increase with the amount of
> disks involved.
>
>> Degraded performance of raid 5 is very poor, and rebuilds are slow.
>
> Correct. RAID 5 and RAID 6 are trade-offs between redundancy, diskspace
> consumption and performance. RAID 10 offers the best performance and
> redundancy, but is the most costly in terms of wasted diskspace.
>
>> Some people believe that the chance of a
>> second disk failure occurring during a rebuild is so large (rebuilds
>> are particular intensive for the other disks) that raid 5 should be
>> considered unsafe for large arrays. Raid 6 is better since it can
>> survive a second failure, but mirrored raids are safer still.
>>
>>>> Isn't the parity calculation a fairly fast process especially if one
>>>> has a hardware based card?
>> A decent host processor will do the parity calculations /much/ faster
>> than the raid processor on most hardware cards. But the calculations
>> themselves are not the cause of the latency, it's the extra reads that
>> take time.
>
> Correct.
>
>>> It is however advised if you have a hardware RAID adapter to disable
>>> the write barriers. Write barriers are where the kernel forces the
>>> disks drives to flush their caches. Since a hardware RAID adapter
>>> must be in total control of the disk drives and has cache memory of
>>> its own, the operating system should never force the disk drives to
>>> flush their cache.
>>>
>> Make sure your raid controller has batteries, and that the whole
>> system is on an UPS!
>
> If data integrity is important, then I consider a UPS a necessity, even
> without RAID. ;-)
>
And assuming the data is important, the OP must also think about backup
solutions. But that's worth its own thread.
>>>> And then if the write gets split into 6 parts shouldnt that speed up
>>>> the process since each disk is writing only 1/6th of the chunk?
>>> Yes, but the data has to be split up first - which is of course a lot
>>> faster on hardware RAID since it is done by a dedicated processor on
>>> the adapter itself then - and the parity has to be calculated. This
>>> is overhead which you do not have with a single disk.
>> Nonsense - a host CPU is perfectly capable of splitting a stripe into
>> its blocks in a fraction of a microsecond. It is also much faster at
>> doing the parity calculations - the host CPU typically runs at least
>> ten times as fast as the CPU or ASIC on the raid card.
>
> Yes, but the host CPU also has other things to take care off, while the
> CPU or ASIC on a RAID controller is dedicated to just that one task.
>
That's true, but not really relevant for modern CPUs - when you've got 4
or 8 cores running at ten times the speed of the raid controller's chip,
you are not talking about a significant load.
>> And again, the splitting and parity calculations are not the
>> bottleneck, it's the latency of the reads needed to calculate the new
>> parity that takes time.
>
> True, but as I stated higher up, I considered the reads to be part of
> the parity calculation process. You need those reads in order to
> calculate the parity, so it's a matter of semantics. ;-)
>
With those definitions, then I agree, of course.
>> There are times when top-range hardware raid cards will beat software
>> raid on speed, but not often - especially with a fast multi-core
>> modern host cpu. It does, however, depend highly on your raid setup
>> and the type of load you have - there are no set answers here.
>
> Considering that many hardware RAID adapters have a battery-backed
> cache, I'd say that's another argument in favor of true hardware RAID.
>
>> Software raid does of course have a reliability weak point - if your
>> OS crashes in the middle of a write, you have a bigger chance of
>> hitting the raid 5 write hole than you would with a hardware raid card
>> with a battery.
>
> I think this is an important thing to consider for anyone looking into a
> RAID 5 solution.
>
>>>>> There are however a few considerations you should take into account
>>>>> with both of these approaches, i.e. that you should not put the
>>>>> filesystem which holds the kernels and /initrd/ - and preferably
>>>>> not the root filesystem either[1] - on a stripe, because the
>>>>> bootloader recognizes [...]
>>>> Luckily that is not needed. I have a seperate drive to boot from.
>>>> The RAID is intended only for user /home dirs.
>>> Ah but wait a minute. As I understand it, you have a hardware RAID
>>> adapter card. In that case - assuming that it is a real hardware
>>> RAID adapter and not one of those on-board fake-RAID things - it
>>> doesn't matter, because to the operating system (and even to the
>>> BIOS), the entire array will be seen as a single disk. So then it is
>>> perfectly possible to have your bootloader, your "/boot" and your "/"
>>> living on the RAID array. (I am doing that myself on one of my
>>> machines, which has two RAID 5 arrays of four disks each.)
>>>
>>> And in this case - i.e. if you have a hardware RAID array - then your
>>> original question regarding software RAID 0 versus striping via LVM
>>> is also answered, because hardware RAID will always be a bit faster
>>> than software RAID or striped LVM. Additionally, since you mention
>>> seven disks, you could even opt for RAID 10 or 51 and even have
>>> a "hot spare" or "standby spare". (Or you could use the extra disk
>>> as an individual, standalone disk.)
>>>
>>> RAID 10 is where you have a mirror (i.e. RAID 1) which is striped to
>>> another mirror - you could instead also use RAID 01, which is a
>>> stripe which is mirrored on another stripe. RAID 10 is better than
>>> RAID 01 though - there's a good article on Wikipedia about it. RAID
>>> 10 or 01 require four disks in total. Performance is very good for
>>> both reading and writing *and* you have redundancy.
>> Yes, wikipedia /does/ have some useful information about raid - it's
>> worth reading.
>
> You are preaching to the choir. ;-)
>
>> One thing you are missing here is that Linux mdadm raid 10 is very
>> much more flexible than just a "stripe of mirrors", which is the
>> standard raid 10. In particular, you can use any number of disks
>> (from 2 upwards), you can have more than 2 copies of each block (at
>> the cost of disk space, obviously) for greater redundancy, and you can
>> have a layout that optimises the throughput for different loads.
>
> I'm not missing that, but since the OP had confirmed that he has a
> hardware RAID set-up, I was addressing that aspect only. Linux
> software RAID is applied on a partition basis and is thus indeed more
> flexible than hardware RAID, which is applied on an entire disk basis.
>
I was discussing raid a little more generally, since the OP was asking
about mdadm and LVM, while I think you were talking more about hardware
raid since he has hardware raid devices already (mdadm raid might be
better value for money than hardware raid for most setups - but not if
you already have the hardware!). Just a difference of emphasis, really.
>> Raid10 performance is also much less affected by a disk failure, and
>> rebuilds are faster and less stressful on the system. And a single
>> hot spare will cover all the disks - you don't need a spare per
>
> I think RAID 10 (at the hardware level) would be ideal for the OP.
>
>> [...] Put your LVM physical volume on top of this if you want the
>> flexibility of LVM - if you don't need it, don't bother.
>
> He might want to use LVM in order to combine what the operating sees as
> being three independent drives - and which are in reality three RAID
> arrays - into a single "/home" volume, but then again, he could instead
> also create that "/home" as an /mdadm/ stripe.
>
> Either way, he needs to choose between /mdadm/ and LVM for how he wants
> to set up the "/home" volume from three separate arrays, be it striped
> or linear - alias the JBOD approach - but not /mdadm/ and LVM both.
> That would be unnecessary overhead.
>
I'd recommend combining the three "drives" using mdadm raid0 rather than
LVM striping - it's a cleaner solution, and easier to get right (with
LVM striping it's all too easy to make a logical partition that is not
striped, since the striping must be stated explicitly in the lvcreate
command).
The point of LVM is for features such as resizing partitions, snapshots,
migration of partitions, etc. If you don't need these, don't use LVM.
LVM does not have much overhead, but it still leads to some slowdown.
LVM is fine for JBOD or linear setups - adding a new drive to your
volume group for more disk space. But I believe mdadm does a better job
- it is more dedicated to the task. I don't have any numbers, but I
have found LVM striping to be slower than expected.
>>> Similarly, RAID 51 is where you have a RAID 5 which is mirrored onto
>>> another RAID 5. Or you could use RAID 15, which is a RAID 5
>>> comprised of mirrors. RAID 51 and 15 require a minimum of six disks.
>>> (Similarly, there is RAID 61 and 16, which require a minimum of eight
>>> disks.)
>> As a minor point, mdadm raid 5 can work on 2 disks (and raid 6 on
>> three disks). Such a 2-disk raid 5 is not much use in a working
>> system, but can be convenient when setting things up or upgrading
>> drives, as you can add more drives to the mdadm raid 5 later on. It's
>> just an example of how much more flexible mdadm is than hardware raid
>> solutions.
>
> True, but considering the enormous amount of drives involved and
> hardware RAID already being present, I think the wisest approach would
> be to use hardware RAID at the lower levels and only use /mdadm/ at the
> final level.
>
Indeed - that was another one of my general points, rather than specific
advice to the OP. I have perhaps mixed these up a bit in my posts.
>>> [...]
>>> With RAID 5, your storage capacity is reduced by the capacity of one
>>> disk in the array, and with RAID 6 by the capacity of two disks in
>>> the array. So, with a single RAID 5 array comprised of seven disks
>>> without a standby or hot spare, your total storage capacity is that
>>> of six disks.
>>>
>>> And then there's the lost capacity of the hot spare or standby spare
>>> - a hot spare is spinning but otherwise unused until one of the other
>>> disks starts to fail, while a standby spare is spun down until one of
>>> the other disks fails. Upon such failure, the array will be
>>> automatically rebuilt using the parity blocks to write the missing
>>> data to the spare disk.
>> I have never heard of a distinction between a "hot spare" that is
>> spinning, and a "standby spare" that is not spinning.
>
> This is quite a common distinction, mind you. There is even a "live
> spare" solution, but to my knowledge this is specific to Adaptec - they
> call it RAID 5E.
>
> In a "live spare" scenario, the spare disk is not used as such but is
> part of the live array, and both data and parity blocks are being
> written to it, but with the distinction that each disk in the array
> will also have empty blocks for the total capacity of a standard spare
> disk. These empty blocks are thus distributed across all disks in the
> array and are used for array reconstruction in the event of a disk
> failure.
>
Is there any real advantage of such a setup compared to using raid 6 (in
which case, the "empty" blocks are second parity blocks)? There would
be a slightly greater write overhead (especially for small writes), but
that would not be seen by the host if there is enough cache on the
controller.
>> Given that spinup takes a few seconds, and a rebuild often takes many
>> hours, I can't see you have much to gain by keeping a spare drive
>> spinning.
>
> It might be required for some software RAID solutions where the spare
> disk cannot be spun down via software. For instance in the event of
> parallel SCSI drives in a software RAID array.
>
Not all drive systems (controller and/or drives) support spin down or
idle drives.
>> To my mind, a "hot spare" is a drive that will be used automatically
>> to replace a dead drive.
>
> Semantics. ;-)
>
Yes.
>> An "offline spare" is an extra drive that is physically attached, but
>> not in use automatically - in the event of a failure, it can be
>> manually assigned to a raid set. This makes sense if you have several
>> hardware raid sets defined and want to share a single spare, if the
>> hardware raid cannot support this (mdadm, of course, supports such a
>> setup with a shared hot spare).
>
> Most modern hardware RAID controllers support this.
>
OK.
It looks like we agree on most things here - we just had a little
difference on the areas we wrote about (specific information for the OP,
or more general RAID discussions), and a few small differences in
terminology.
mvh.,
David
|
|
0
|
|
|
|
Reply
|
David
|
1/20/2010 1:07:03 PM
|
|
On Wednesday 20 January 2010 14:07 in comp.os.linux.misc, somebody
identifying as David Brown wrote...
> Aragorn wrote:
>
>> [David Brown wrote:]
>>
>>> But on the other hand, software raid5 can take advantage of large
>>> system memories for cache, and is thus far more likely to have the
>>> required stripe data already in its cache (especially for metadata
>>> and directory areas of the file system, which are commonly accessed
>>> but have small writes).
>>
>> Also correct, but dependent on available system RAM, of course. I'm
>> not sure how much RAM the RAID controllers in my two servers have - I
>> think one has 256 MB and the other one 128 MB - but on a system with,
>> say, 4 GB of system RAM, caching capacity is of course much higher.
>
> You will probably find it is cheaper to add another 4 GB system ram to
> the host than another 128 MB to the hardware raid controller.
Well, that all depends... On a system that uses ECC registered RAM -
such as a genuine server - the cost of adding more RAM may be quite
daunting.
On the other hand, I'm not so sure whether a hardware RAID adapter can
be retrofitted with more memory than it already has out of the box.
>> On the other hand - and purely in theory, as RAID 5/6 is not the
>> right solution for the OP - a battery-backed hardware RAID controller
>> on a machine hooked up to a UPS should be able to avoid the RAID 5/6
>> writing hole, since the controller itself has its own processor. I
>> even believe that my Adaptec SAS RAID controller runs a Linux kernel
>> of its own.
>
> That's often the case these days - "hardware" raid controllers are
> frequently host processors running Linux (or sometimes other systems)
> and software raid. This is especially true of SAN boxes.
Well, the line between hardware RAID and software RAID is rather blurry
in the event of a modern hardware RAID controller. Sure, it's all
firmware, but there is a software component involved as well,
presumably because of certain efficiencies in scheduling with set-ups
employing multiple disks, as with the nested RAID solutions.
>>> Make sure your raid controller has batteries, and that the whole
>>> system is on an UPS!
>>
>> If data integrity is important, then I consider a UPS a necessity,
>> even without RAID. ;-)
>
> And assuming the data is important, the OP must also think about
> backup solutions. But that's worth its own thread.
Ahh, but that is the First Rule in the Bible of any sysadmin: "Thou
shalt make backups, and lots of them too!" :p
>>> [...] a host CPU is perfectly capable of splitting a stripe
>>> into its blocks in a fraction of a microsecond. It is also much
>>> faster at doing the parity calculations - the host CPU typically
>>> runs at least ten times as fast as the CPU or ASIC on the raid card.
>>
>> Yes, but the host CPU also has other things to take care off, while
>> the CPU or ASIC on a RAID controller is dedicated to just that one
>> task.
>
> That's true, but not really relevant for modern CPUs - when you've got
> 4 or 8 cores running at ten times the speed of the raid controller's
> chip, you are not talking about a significant load.
Well, 8 cores might be a bit of a stretch, and not everyone has quadcore
CPUs yet, either. My Big Machine for instance has two dualcore
Opterons in it, so that makes for four cores in total. (The machine
also has a SAS/SATA RAID controller, so the RAID discussion is moot
here, but I'm just mentioning it.)
Another thing which must not be overlooked is that the CPU or ASIC on a
hardware RAID controller is typically a RISC chip, and so comparing
clock speeds would not really give an accurate impression of its
performance versus a mainboard processor chip. For instance, a MIPS or
Alpha processor running at 800 MHz still outperforms most (single core)
2+ GHz processors.
>>> I have never heard of a distinction between a "hot spare" that is
>>> spinning, and a "standby spare" that is not spinning.
>>
>> This is quite a common distinction, mind you. There is even a "live
>> spare" solution, but to my knowledge this is specific to Adaptec -
>> they call it RAID 5E.
>>
>> In a "live spare" scenario, the spare disk is not used as such but is
>> part of the live array, and both data and parity blocks are being
>> written to it, but with the distinction that each disk in the array
>> will also have empty blocks for the total capacity of a standard
>> spare disk. These empty blocks are thus distributed across all disks
>> in the array and are used for array reconstruction in the event of a
>> disk failure.
>
> Is there any real advantage of such a setup compared to using raid 6
> (in which case, the "empty" blocks are second parity blocks)? There
> would be a slightly greater write overhead (especially for small
> writes), but that would not be seen by the host if there is enough
> cache on the controller.
Well, the advantage of this set-up is that you don't need to replace a
failing disk, since there is already sufficient diskspace left blank on
all disks in the array, and so the array can recreate itself using that
extra blank diskspace. This is of course all nice in theory, but in
practice one would eventually replace the disk anyway.
In terms of performance, it would be similar to RAID 6 for reads -
because the empty blocks have to be skipped in sequential reads - but
for writing it would be slightly better than RAID 6 since only one set
of parity data per stripe needs to be (re)calculated and (re)written.
It does of course remain a single-point-of-failure set-up, whereas RAID
6 offers a two-points-of-failure set-up.
>>> An "offline spare" is an extra drive that is physically attached,
>>> but not in use automatically - in the event of a failure, it can be
>>> manually assigned to a raid set. This makes sense if you have
>>> several hardware raid sets defined and want to share a single spare,
>>> if the hardware raid cannot support this (mdadm, of course, supports
>>> such a setup with a shared hot spare).
>>
>> Most modern hardware RAID controllers support this.
>
> OK.
>
> It looks like we agree on most things here - we just had a little
> difference on the areas we wrote about (specific information for the
> OP, or more general RAID discussions), and a few small differences in
> terminology.
Well, you've made me reconsider my usage of RAID 5, though. I am not
contemplating on using two RAID 10 arrays instead of two RAID 5 arrays,
since each of the arrays has four disks. They are both different
arrays, though. They're connected to the same RAID controller but the
first array is comprised of 147 GB 15k Hitachi SAS disks and the second
array is comprised of 1 TB 7.2k Western Digital RAID Edition SATA-2
disks on a hotswap backplane.
I had always considered RAID 5 to be the best trade-off, considering the
loss of diskspace involved versus the retail price of the hard disks -
especially the SAS disks - but considering that the SAS array will be
used to house the main systems in a virtualized set-up (on Xen) and
will probably endure the most small and random writes, RAID 10 might
actually be a better solution. The cost of the lost diskspace on the
SATA-2 disks is smaller since this type of disks is far less expensive
than SAS.
See, this is one of the advantages of Usenet. People get to share not
only knowledge but also differing views and strategies, and in the end,
everyone will have gleaned something useful. ;-)
--
*Aragorn*
(registered GNU/Linux user #223157)
|
|
0
|
|
|
|
Reply
|
Aragorn
|
1/20/2010 1:44:36 PM
|
|
On Wednesday 20 January 2010 14:44 in comp.os.linux.misc, somebody
identifying as Aragorn wrote...
> On Wednesday 20 January 2010 14:07 in comp.os.linux.misc, somebody
> identifying as David Brown wrote...
>
> Well, you've made me reconsider my usage of RAID 5, though. I am not
^^^
> contemplating on using two RAID 10 arrays instead of two RAID 5
> arrays, [...]
That should read "now" instead of "not". Typo again. ;-)
--
*Aragorn*
(registered GNU/Linux user #223157)
|
|
0
|
|
|
|
Reply
|
Aragorn
|
1/20/2010 1:49:59 PM
|
|
Aragorn wrote:
> On Wednesday 20 January 2010 14:07 in comp.os.linux.misc, somebody
> identifying as David Brown wrote...
>
>> Aragorn wrote:
>>
>>> [David Brown wrote:]
>>>
>>>> But on the other hand, software raid5 can take advantage of large
>>>> system memories for cache, and is thus far more likely to have the
>>>> required stripe data already in its cache (especially for metadata
>>>> and directory areas of the file system, which are commonly accessed
>>>> but have small writes).
>>> Also correct, but dependent on available system RAM, of course. I'm
>>> not sure how much RAM the RAID controllers in my two servers have - I
>>> think one has 256 MB and the other one 128 MB - but on a system with,
>>> say, 4 GB of system RAM, caching capacity is of course much higher.
>> You will probably find it is cheaper to add another 4 GB system ram to
>> the host than another 128 MB to the hardware raid controller.
>
> Well, that all depends... On a system that uses ECC registered RAM -
> such as a genuine server - the cost of adding more RAM may be quite
> daunting.
>
> On the other hand, I'm not so sure whether a hardware RAID adapter can
> be retrofitted with more memory than it already has out of the box.
>
"Daunting" is less than "impossible" :-) Of course, it depends on your
setup and your needs - there are no fixed answers (I think that's been
mentioned before...)
>>> On the other hand - and purely in theory, as RAID 5/6 is not the
>>> right solution for the OP - a battery-backed hardware RAID controller
>>> on a machine hooked up to a UPS should be able to avoid the RAID 5/6
>>> writing hole, since the controller itself has its own processor. I
>>> even believe that my Adaptec SAS RAID controller runs a Linux kernel
>>> of its own.
>> That's often the case these days - "hardware" raid controllers are
>> frequently host processors running Linux (or sometimes other systems)
>> and software raid. This is especially true of SAN boxes.
>
> Well, the line between hardware RAID and software RAID is rather blurry
> in the event of a modern hardware RAID controller. Sure, it's all
> firmware, but there is a software component involved as well,
> presumably because of certain efficiencies in scheduling with set-ups
> employing multiple disks, as with the nested RAID solutions.
>
>>>> Make sure your raid controller has batteries, and that the whole
>>>> system is on an UPS!
>>> If data integrity is important, then I consider a UPS a necessity,
>>> even without RAID. ;-)
>> And assuming the data is important, the OP must also think about
>> backup solutions. But that's worth its own thread.
>
> Ahh, but that is the First Rule in the Bible of any sysadmin: "Thou
> shalt make backups, and lots of them too!" :p
>
The zeroth rule, which is often forgotten (until you learn the hard
way!), is "thou shalt make a plan for restoring from backups, test that
plan, document that plan, and find a way to ensure that all backups are
tested and restoreable in this way". /Then/ you can start making your
actual backups!
And the second rule is "thou shalt make backups of your backups",
followed by "thou shalt have backups of critical hardware". (That's
another bonus of software raid - if your hardware raid card dies, you
may have to replace it with exactly the same type of card to get your
raid working again - with mdadm raid, you can use any PC.)
>>>> [...] a host CPU is perfectly capable of splitting a stripe
>>>> into its blocks in a fraction of a microsecond. It is also much
>>>> faster at doing the parity calculations - the host CPU typically
>>>> runs at least ten times as fast as the CPU or ASIC on the raid card.
>>> Yes, but the host CPU also has other things to take care off, while
>>> the CPU or ASIC on a RAID controller is dedicated to just that one
>>> task.
>> That's true, but not really relevant for modern CPUs - when you've got
>> 4 or 8 cores running at ten times the speed of the raid controller's
>> chip, you are not talking about a significant load.
>
> Well, 8 cores might be a bit of a stretch, and not everyone has quadcore
> CPUs yet, either. My Big Machine for instance has two dualcore
> Opterons in it, so that makes for four cores in total. (The machine
> also has a SAS/SATA RAID controller, so the RAID discussion is moot
> here, but I'm just mentioning it.)
>
> Another thing which must not be overlooked is that the CPU or ASIC on a
> hardware RAID controller is typically a RISC chip, and so comparing
> clock speeds would not really give an accurate impression of its
> performance versus a mainboard processor chip. For instance, a MIPS or
> Alpha processor running at 800 MHz still outperforms most (single core)
> 2+ GHz processors.
>
As already mentioned, "hardware" raid is often done now with a general
purpose processor rather than an ASIC - and MIPS is a particularly
popular core for the job. But while you get a lot more work out of an
800 MHz for a given price, size or power than you do with an x86, you
don't get more for a given clock rate. Parity calculations are really
just a big stream of "xor"'s, and a modern x86 will chew through these
as fast as memory bandwidth allows. Internally, x86 assembly is mostly
converted to wide-word RISC-style instructions, so a decently written
parity function will be as efficient per clock on an x86 as it is on MIPS.
There are plenty of situations where a slower clock but cleaner
architecture gives more true speed, especially if latency is more
important than throughput, but this isn't one of them.
>>>> I have never heard of a distinction between a "hot spare" that is
>>>> spinning, and a "standby spare" that is not spinning.
>>> This is quite a common distinction, mind you. There is even a "live
>>> spare" solution, but to my knowledge this is specific to Adaptec -
>>> they call it RAID 5E.
>>>
>>> In a "live spare" scenario, the spare disk is not used as such but is
>>> part of the live array, and both data and parity blocks are being
>>> written to it, but with the distinction that each disk in the array
>>> will also have empty blocks for the total capacity of a standard
>>> spare disk. These empty blocks are thus distributed across all disks
>>> in the array and are used for array reconstruction in the event of a
>>> disk failure.
>> Is there any real advantage of such a setup compared to using raid 6
>> (in which case, the "empty" blocks are second parity blocks)? There
>> would be a slightly greater write overhead (especially for small
>> writes), but that would not be seen by the host if there is enough
>> cache on the controller.
>
> Well, the advantage of this set-up is that you don't need to replace a
> failing disk, since there is already sufficient diskspace left blank on
> all disks in the array, and so the array can recreate itself using that
> extra blank diskspace. This is of course all nice in theory, but in
> practice one would eventually replace the disk anyway.
>
The same is true of raid6 - if one disk dies, the degraded raid6 is very
similar to raid5 until you replace the disk.
And I still don't see any significant advantage of spreading the wholes
around the drives rather than having them all on the one drive (i.e., a
normal hot spare). The rebuild still has to do as many reads and
writes, and takes as long. The rebuild writes will be spread over all
the disks rather than just on the one disk, but I can't see any
advantage in that.
I suppose read performance, especially for many parallel small reads,
will be slightly higher than for a normal hot spare, since you have more
disks with active data and therefore higher chances of parallelising
these accesses. But you get the same advantage with raid6.
> In terms of performance, it would be similar to RAID 6 for reads -
> because the empty blocks have to be skipped in sequential reads - but
> for writing it would be slightly better than RAID 6 since only one set
> of parity data per stripe needs to be (re)calculated and (re)written.
>
> It does of course remain a single-point-of-failure set-up, whereas RAID
> 6 offers a two-points-of-failure set-up.
>
>>>> An "offline spare" is an extra drive that is physically attached,
>>>> but not in use automatically - in the event of a failure, it can be
>>>> manually assigned to a raid set. This makes sense if you have
>>>> several hardware raid sets defined and want to share a single spare,
>>>> if the hardware raid cannot support this (mdadm, of course, supports
>>>> such a setup with a shared hot spare).
>>> Most modern hardware RAID controllers support this.
>> OK.
>>
>> It looks like we agree on most things here - we just had a little
>> difference on the areas we wrote about (specific information for the
>> OP, or more general RAID discussions), and a few small differences in
>> terminology.
>
> Well, you've made me reconsider my usage of RAID 5, though. I am now
> contemplating on using two RAID 10 arrays instead of two RAID 5 arrays,
> since each of the arrays has four disks. They are both different
> arrays, though. They're connected to the same RAID controller but the
> first array is comprised of 147 GB 15k Hitachi SAS disks and the second
> array is comprised of 1 TB 7.2k Western Digital RAID Edition SATA-2
> disks on a hotswap backplane.
>
> I had always considered RAID 5 to be the best trade-off, considering the
> loss of diskspace involved versus the retail price of the hard disks -
> especially the SAS disks - but considering that the SAS array will be
> used to house the main systems in a virtualized set-up (on Xen) and
> will probably endure the most small and random writes, RAID 10 might
> actually be a better solution. The cost of the lost diskspace on the
> SATA-2 disks is smaller since this type of disks is far less expensive
> than SAS.
>
I gather that raid 10 (hardware or software) is now often considered a
better choice - raid 5 is often viewed as unreliable due to the risks of
a second failure during rebuilds, which are increasingly time-consuming
with larger disks. Where practical, I think mdadm "far" raid 10 is the
optimal if you are happy with losing 50% of your disk space - it is
faster than other redundant setups in many situations, and has a great
deal of flexibility. If you want more redundancy, you can use double
mirrors for 33% disk space and still have full speed. If you have the
chance, it would be very nice to try out some different arrangements and
see which is fastest in reality, not just in theory!
The other option is to go for a file system that handles multiple disks
and redundancy directly - ZFS is the best known, with btrfs the
experimental choice on Linux.
> See, this is one of the advantages of Usenet. People get to share not
> only knowledge but also differing views and strategies, and in the end,
> everyone will have gleaned something useful. ;-)
>
Absolutely - that's also why it's good to have a general discussion
every now and again, rather than just answering a poster's questions.
Good questions (such as in this thread) inspire an exchange of
information for many people's benefits (I've learned things here too).
|
|
0
|
|
|
|
Reply
|
David
|
1/20/2010 2:48:03 PM
|
|
David Brown <david@westcontrol.removethisbit.com> wrote in
news:4b570006$0$6251$8404b019@news.wineasy.se:
Thanks both Aragorn and David! This is one of the most comprehensive
advice about RAID issues that I ever got. If you were ever in Madison,
WI I owe you guys a beer! :)
> Aragorn wrote:
>>
>
> Yes - perhaps the OP will give more details on his expected usage
> patterns. There are many other factors we haven't discussed that can
> affect the "best" setup, such as what requirements he has for future
> expansion.
Sure. More details: Its a mixed bad of I/O actually. This is a part of a
High Performance Compute Cluster. So a wide variety of codes are in use.
We have tracked I/O nature. Some of them have large sequential writes.
Others are dominated by random seeks. Which is why I am not really fine
tuning my setup for a particular access pattern but going for the best
overall performance. RAID5 and RAID10 fit the bill it seems. RAID10 even
more so. I do have the luxary of excess storage right now so I am
convinced I ought to do a RAID10 like you guys suggested (at the HW
level).
For combining the 3 RAID10's I am still split between LVM and mdadm. The
performance advantages convince me towards mdadm. But the ease of
partition resizing etc. make LVM attractive.
>
> And assuming the data is important, the OP must also think about
> backup solutions. But that's worth its own thread.
Actually I am lucky. The data will *not* be backed up on tape. You might
think this strange but this is meant to be a store for jobs that are
staging or running. So people are expected to remove data to more secure
storage in ~10 days. Worst case we take a 10 day hit which is Ok for our
scientific computing needs.
>
> I was discussing raid a little more generally, since the OP was asking
> about mdadm and LVM, while I think you were talking more about
> hardware raid since he has hardware raid devices already (mdadm raid
> might be better value for money than hardware raid for most setups -
> but not if you already have the hardware!). Just a difference of
> emphasis, really.
Absolutely. I appreciate the improvement in my overall RAID
understanding.
--
Rahul
|
|
0
|
|
|
|
Reply
|
Rahul
|
1/20/2010 3:25:46 PM
|
|
On Wednesday 20 January 2010 15:48 in comp.os.linux.misc, somebody
identifying as David Brown wrote...
> Aragorn wrote:
>
>> On Wednesday 20 January 2010 14:07 in comp.os.linux.misc, somebody
>> identifying as David Brown wrote...
>>
>>> You will probably find it is cheaper to add another 4 GB system ram
>>> to the host than another 128 MB to the hardware raid controller.
>>
>> Well, that all depends... On a system that uses ECC registered RAM -
>> such as a genuine server - the cost of adding more RAM may be quite
>> daunting.
>>
>> On the other hand, I'm not so sure whether a hardware RAID adapter
>> can be retrofitted with more memory than it already has out of the
>> box.
>
> "Daunting" is less than "impossible" :-) Of course, it depends on
> your setup and your needs - there are no fixed answers (I think that's
> been mentioned before...)
Yeah... Most adapters I know of come with either 128 MB or 256 MB. I'd
have to check the specs for my Adaptec SAS RAID adapter again, but my
U320 RAID adapter - also from Adaptec - has only 128 MB.
The sad news is that the battery packs are often optional, so you need
to pay attention when ordering or buying such an adapter card.
>>> And assuming the data is important, the OP must also think about
>>> backup solutions. But that's worth its own thread.
>>
>> Ahh, but that is the First Rule in the Bible of any sysadmin: "Thou
>> shalt make backups, and lots of them too!" :p
>
> The zeroth rule, which is often forgotten (until you learn the hard
> way!), is "thou shalt make a plan for restoring from backups, test
> that plan, document that plan, and find a way to ensure that all
> backups are tested and restoreable in this way". /Then/ you can start
> making your actual backups!
Well, so far I've always used the tested and tried approach of tar'ing
in conjunction with bzip2. Can't get any cleaner than that. ;-)
> And the second rule is "thou shalt make backups of your backups",
> followed by "thou shalt have backups of critical hardware". (That's
> another bonus of software raid - if your hardware raid card dies, you
> may have to replace it with exactly the same type of card to get your
> raid working again - with mdadm raid, you can use any PC.)
Well, considering that my Big Machine has drained my piggy bank for
about 17'000 Euros worth of hardware, having a duplicate machine is not
really an option. The piggy bank's on a diet now. :-)
I do on the other hand still have a slightly older dual Xeon machine
with 4 GB of RAM and an U320 SCSI RAID 1 (with two 73 GB disks), which
I will be setting up as an emergency replacement server, and to store
additional backups on - I store my other backups on Iomega REV disks.
>> Another thing which must not be overlooked is that the CPU or ASIC on
>> a hardware RAID controller is typically a RISC chip, and so comparing
>> clock speeds would not really give an accurate impression of its
>> performance versus a mainboard processor chip. For instance, a MIPS
>> or Alpha processor running at 800 MHz still outperforms most (single
>> core) 2+ GHz processors.
>
> As already mentioned, "hardware" raid is often done now with a general
> purpose processor rather than an ASIC - and MIPS is a particularly
> popular core for the job.
I'm not sure on the one on my SAS RAID adapter, but I think it's an
Intel RISC processor. It's not a MIPS or an Alpha, that much I am
certain of.
> But while you get a lot more work out of an 800 MHz for a given price,
> size or power than you do with an x86, you don't get more for a given
> clock rate. Parity calculations are really just a big stream
> of "xor"'s, and a modern x86 will chew through these as fast as memory
> bandwidth allows. Internally, x86 assembly is mostly converted to
> wide-word RISC-style instructions, so a decently written parity
> function will be as efficient per clock on an x86 as it is on MIPS.
True.
>>>>> I have never heard of a distinction between a "hot spare" that is
>>>>> spinning, and a "standby spare" that is not spinning.
>>>>
>>>> This is quite a common distinction, mind you. There is even a
>>>> "live spare" solution, but to my knowledge this is specific to
>>>> Adaptec - they call it RAID 5E.
>>>>
>>>> In a "live spare" scenario, the spare disk is not used as such but
>>>> is part of the live array, and both data and parity blocks are
>>>> being written to it, but with the distinction that each disk in the
>>>> array will also have empty blocks for the total capacity of a
>>>> standard spare disk. These empty blocks are thus distributed
>>>> across all disks in the array and are used for array reconstruction
>>>> in the event of a disk failure.
>>>
>>> Is there any real advantage of such a setup compared to using raid 6
>>> (in which case, the "empty" blocks are second parity blocks)? There
>>> would be a slightly greater write overhead (especially for small
>>> writes), but that would not be seen by the host if there is enough
>>> cache on the controller.
>>
>> Well, the advantage of this set-up is that you don't need to replace
>> a failing disk, since there is already sufficient diskspace left
>> blank on all disks in the array, and so the array can recreate itself
>> using that extra blank diskspace. This is of course all nice in
>> theory, but in practice one would eventually replace the disk anyway.
>
> The same is true of raid6 - if one disk dies, the degraded raid6 is
> very similar to raid5 until you replace the disk.
>
> And I still don't see any significant advantage of spreading the
> wholes around the drives rather than having them all on the one drive
> (i.e., a normal hot spare). The rebuild still has to do as many reads
> and writes, and takes as long. The rebuild writes will be spread over
> all the disks rather than just on the one disk, but I can't see any
> advantage in that.
Well, the idea is simply to give the spare disk some exercise, i.e. to
use it as part of the live array while still offering the extra
redundancy of a spare. So in the event of a failure, the array can be
fully rebuilt without the need to replace the broken drive, as opposed
to that the array would stay in degraded mode until the broken drive is
replaced.
> I suppose read performance, especially for many parallel small reads,
> will be slightly higher than for a normal hot spare, since you have
> more disks with active data and therefore higher chances of
> parallelising these accesses. But you get the same advantage with
> raid6.
Yes, but RAID 6 would be slower for small writes, and if one of the
drives fails, the array stays in degraded mode (since it considers
itself to be a RAID 6, not a RAID 5E).
>>> It looks like we agree on most things here - we just had a little
>>> difference on the areas we wrote about (specific information for the
>>> OP, or more general RAID discussions), and a few small differences
>>> in terminology.
>>
>> Well, you've made me reconsider my usage of RAID 5, though. I am now
>> contemplating on using two RAID 10 arrays instead of two RAID 5
>> arrays, since each of the arrays has four disks. They are both
>> different arrays, though. They're connected to the same RAID
>> controller but the first array is comprised of 147 GB 15k Hitachi SAS
>> disks and the second array is comprised of 1 TB 7.2k Western Digital
>> RAID Edition SATA-2 disks on a hotswap backplane.
>>
>> I had always considered RAID 5 to be the best trade-off, considering
>> the loss of diskspace involved versus the retail price of the hard
>> disks - especially the SAS disks - but considering that the SAS array
>> will be used to house the main systems in a virtualized set-up (on
>> Xen) and will probably endure the most small and random writes, RAID
>> 10 might actually be a better solution. The cost of the lost
>> diskspace on the SATA-2 disks is smaller since this type of disks is
>> far less expensive than SAS.
>
> I gather that raid 10 (hardware or software) is now often considered a
> better choice - raid 5 is often viewed as unreliable due to the risks
> of a second failure during rebuilds, which are increasingly
> time-consuming with larger disks. Where practical, I think
> mdadm "far" raid 10 is the optimal if you are happy with losing 50% of
> your disk space - it is faster than other redundant setups in many
> situations, and has a great deal of flexibility.
Well, 50% is the minimum storage capacity one loses when using any kind
of mirroring, be it RAID 1, RAID 10, RAID 0+1, RAID 50 or whatever.
> If you want more redundancy, you can use double mirrors for 33% disk
> space and still have full speed.
Yes, but that's a set-up which, due to understandable financial
considerations, would be reserved only for the corporate world. Many
people already consider me certifiably insane for having spent that
much money - 17'000 Euro, as I wrote higher up - on a privately owned
computer system. But then again, for the intended purposes, I need
fast and reliable hardware and a lot of horsepower. :-)
In the event of the OP on the other hand, 45 SAS disks of 300 GB each
and three SAS RAID storage enclosures also doesn't seem like quite an
affordable buy, so I take it he intends to use it for a business.
That, or he's a maniac like me. :p
> If you have the chance, it would be very nice to try out some
> different arrangements and see which is fastest in reality, not just
> in theory!
Ahh, but whole books have been written about such tests, and it still
always boils down to "What are you planning to do with it?" For
instance, a database server has different needs from a mailserver, and
this has different needs from a fileserver or workstation, etc. ;-)
> The other option is to go for a file system that handles multiple
> disks and redundancy directly - ZFS is the best known, with btrfs the
> experimental choice on Linux.
I don't think Btrfs is already considered stable enough. ZFS is of
course a great choice, but the GPL forbids linking ZFS into the Linux
kernel. If there is a "filesystem in userspace" implementation of it,
then it would of course be possible to legally use ZFS on a GNU/Linux
system.
I have been looking into NexentaOS (i.e. GNU/kOpenSolaris) for a while,
which uses ZFS, albeit that ZFS was not my reason for being interested
in the project. I was more interested in the fact that it supports
both Solaris Zones - of which the Linux equivalents are OpenVZ and
VServer - and running paravirtualized on top of Xen.
Doing that with OpenVZ requires the use of a 2.6.27 kernel which is
still considered unstable by the OpenVZ developers, and doing that with
Vserver is as good as impossible, since they're still using a 2.6.16
kernel, and you can't apply the (now obsolete) Xen patches to that
because those are for 2.6.18. And thus, running VServer in a Xen
virtual machine would require that you run it via hardware
virtualization rather than paravirtualized.
The big problem with NexentaOS however is that it's based on Ubuntu and
that it uses binary .deb packages, whereas I would rather have a Gentoo
approach, where you can build the whole thing from sources without
having to go "the LFS way".
Oh well, I've relayed the whole thing for the weekend, so I still have
plenty of time to think things over. ;-)
>> See, this is one of the advantages of Usenet. People get to share
>> not only knowledge but also differing views and strategies, and in
>> the end, everyone will have gleaned something useful. ;-)
>
> Absolutely - that's also why it's good to have a general discussion
> every now and again, rather than just answering a poster's questions.
> Good questions (such as in this thread) inspire an exchange of
> information for many people's benefits (I've learned things here too).
Maybe we should invite some politicians over to Usenet. Then *they*
might possibly learn something about the real world as well. :p
--
*Aragorn*
(registered GNU/Linux user #223157)
|
|
0
|
|
|
|
Reply
|
Aragorn
|
1/20/2010 4:23:45 PM
|
|
On Wednesday 20 January 2010 16:25 in comp.os.linux.misc, somebody
identifying as Rahul wrote...
> David Brown <david@westcontrol.removethisbit.com> wrote in
> news:4b570006$0$6251$8404b019@news.wineasy.se:
>
> Thanks both Aragorn and David! This is one of the most comprehensive
> advice about RAID issues that I ever got. If you were ever in Madison,
> WI I owe you guys a beer! :)
Unfortunately, that offer will have to remain academic but it is
nevertheless appreciated. ;-)
>> Yes - perhaps the OP will give more details on his expected usage
>> patterns. There are many other factors we haven't discussed that can
>> affect the "best" setup, such as what requirements he has for future
>> expansion.
>
> Sure. More details: Its a mixed bad of I/O actually. This is a part of
> a High Performance Compute Cluster. So a wide variety of codes are in
> use. We have tracked I/O nature. Some of them have large sequential
> writes. Others are dominated by random seeks. Which is why I am not
> really fine tuning my setup for a particular access pattern but going
> for the best overall performance. RAID5 and RAID10 fit the bill it
> seems. RAID10 even more so. I do have the luxary of excess storage
> right now so I am convinced I ought to do a RAID10 like you guys
> suggested (at the HW level).
>
> For combining the 3 RAID10's I am still split between LVM and mdadm.
> The performance advantages convince me towards mdadm. But the ease of
> partition resizing etc. make LVM attractive.
Well, if you're only going to be putting "/home" on the array, then LVM
is a moot point. Just set each array up as a RAID 10, possibly with a
spare on each array and format each array with a single partition, and
then you can use /mdadm/ to combine them into a stripeset. ;-)
--
*Aragorn*
(registered GNU/Linux user #223157)
|
|
0
|
|
|
|
Reply
|
Aragorn
|
1/20/2010 6:23:08 PM
|
|
Aragorn wrote:
> On Wednesday 20 January 2010 15:48 in comp.os.linux.misc, somebody
> identifying as David Brown wrote...
>> Aragorn wrote:
>>> On Wednesday 20 January 2010 14:07 in comp.os.linux.misc, somebody
>>> identifying as David Brown wrote...
<snip to save a little space>
>>>> And assuming the data is important, the OP must also think about
>>>> backup solutions. But that's worth its own thread.
>>> Ahh, but that is the First Rule in the Bible of any sysadmin: "Thou
>>> shalt make backups, and lots of them too!" :p
>> The zeroth rule, which is often forgotten (until you learn the hard
>> way!), is "thou shalt make a plan for restoring from backups, test
>> that plan, document that plan, and find a way to ensure that all
>> backups are tested and restoreable in this way". /Then/ you can start
>> making your actual backups!
>
> Well, so far I've always used the tested and tried approach of tar'ing
> in conjunction with bzip2. Can't get any cleaner than that. ;-)
>
rsync copying is even cleaner - the backup copy is directly accessible.
And when combined with hard link copies in some way (such as
rsnapshot) you can get snapshots.
Of course, .tar.bz2 is good too - /if/ you have it automated so that it
is actually done (or you are one of these rare people that can regularly
follow a manual procedure). It also needs to be saved in a safe and
reliable place - many people have had regular backups saved to tape only
to find later that the tapes were unreadable. And of course it needs to
be saved again, in a different place and stored at a different site.
I know I'm preaching to the choir here, as you said before - but there
may be others in the congregation.
>> And the second rule is "thou shalt make backups of your backups",
>> followed by "thou shalt have backups of critical hardware". (That's
>> another bonus of software raid - if your hardware raid card dies, you
>> may have to replace it with exactly the same type of card to get your
>> raid working again - with mdadm raid, you can use any PC.)
>
> Well, considering that my Big Machine has drained my piggy bank for
> about 17'000 Euros worth of hardware, having a duplicate machine is not
> really an option. The piggy bank's on a diet now. :-)
>
You don't need a duplicate machine - you just need duplicates of any
parts that are important, specific, and may not always been easily
available. There is no need to buy a new machine, but as soon as your
particular choice of hardware raid cards start going out of fashion, buy
a spare. Better still, buy a spare /now/ before the manufacturer
decides to update the firmware in new versions of the card and they
become incompatible with your raid drives. Of course, you can always
restore from backup in an emergency if the worst happens.
> I do on the other hand still have a slightly older dual Xeon machine
> with 4 GB of RAM and an U320 SCSI RAID 1 (with two 73 GB disks), which
> I will be setting up as an emergency replacement server, and to store
> additional backups on - I store my other backups on Iomega REV disks.
>
>>> Another thing which must not be overlooked is that the CPU or ASIC on
>>> a hardware RAID controller is typically a RISC chip, and so comparing
>>> clock speeds would not really give an accurate impression of its
>>> performance versus a mainboard processor chip. For instance, a MIPS
>>> or Alpha processor running at 800 MHz still outperforms most (single
>>> core) 2+ GHz processors.
>> As already mentioned, "hardware" raid is often done now with a general
>> purpose processor rather than an ASIC - and MIPS is a particularly
>> popular core for the job.
>
> I'm not sure on the one on my SAS RAID adapter, but I think it's an
> Intel RISC processor. It's not a MIPS or an Alpha, that much I am
> certain of.
>
Intel haven't made RISC processors for many years (discounting the
Itanium, which is an unlikely choice for a raid processor). They used
to have StrongArms, and long, long ago they had a few other designs, but
I'm pretty certain you don't have an Intel RISC processor on the card.
It also will not be an Alpha - they have not been made for years either
(they were very nice chips until DEC, then HP+Compaq totally screwed
them up, with plenty of encouragement from Intel). Realistic cores
include MIPS in many flavours, PPC, and for more recent designs, perhaps
an ARM of some kind. If the heavy lifting is being done by ASIC logic
rather than the processor core, there is a wider choice of possible cores.
>> But while you get a lot more work out of an 800 MHz for a given price,
>> size or power than you do with an x86, you don't get more for a given
>> clock rate. Parity calculations are really just a big stream
>> of "xor"'s, and a modern x86 will chew through these as fast as memory
>> bandwidth allows. Internally, x86 assembly is mostly converted to
>> wide-word RISC-style instructions, so a decently written parity
>> function will be as efficient per clock on an x86 as it is on MIPS.
>
> True.
>
>>>>>> I have never heard of a distinction between a "hot spare" that is
>>>>>> spinning, and a "standby spare" that is not spinning.
>>>>> This is quite a common distinction, mind you. There is even a
>>>>> "live spare" solution, but to my knowledge this is specific to
>>>>> Adaptec - they call it RAID 5E.
>>>>>
>>>>> In a "live spare" scenario, the spare disk is not used as such but
>>>>> is part of the live array, and both data and parity blocks are
>>>>> being written to it, but with the distinction that each disk in the
>>>>> array will also have empty blocks for the total capacity of a
>>>>> standard spare disk. These empty blocks are thus distributed
>>>>> across all disks in the array and are used for array reconstruction
>>>>> in the event of a disk failure.
>>>> Is there any real advantage of such a setup compared to using raid 6
>>>> (in which case, the "empty" blocks are second parity blocks)? There
>>>> would be a slightly greater write overhead (especially for small
>>>> writes), but that would not be seen by the host if there is enough
>>>> cache on the controller.
>>> Well, the advantage of this set-up is that you don't need to replace
>>> a failing disk, since there is already sufficient diskspace left
>>> blank on all disks in the array, and so the array can recreate itself
>>> using that extra blank diskspace. This is of course all nice in
>>> theory, but in practice one would eventually replace the disk anyway.
>> The same is true of raid6 - if one disk dies, the degraded raid6 is
>> very similar to raid5 until you replace the disk.
>>
>> And I still don't see any significant advantage of spreading the
>> wholes around the drives rather than having them all on the one drive
>> (i.e., a normal hot spare). The rebuild still has to do as many reads
>> and writes, and takes as long. The rebuild writes will be spread over
>> all the disks rather than just on the one disk, but I can't see any
>> advantage in that.
>
> Well, the idea is simply to give the spare disk some exercise, i.e. to
> use it as part of the live array while still offering the extra
> redundancy of a spare. So in the event of a failure, the array can be
> fully rebuilt without the need to replace the broken drive, as opposed
> to that the array would stay in degraded mode until the broken drive is
> replaced.
>
The array will be in degraded mode while the rebuild is being done, just
like if it were raid5 with a hot spare - and it will be equally slow
during the rebuild. So no points there.
In fact, according to wikipedia, the controller will "compact" the
degraded raid set into a normal raid5, and when you replace the broken
drive it will "uncompact" it into the raid 5E arrangement again. The
"compact" and "uncompact" operations take much longer than a standard
raid5 rebuild.
So all you get here is a marginal increase in the parallelisation of
multiple simultaneous small reads, which you could get anyway with raid6
rather than raid5 with a spare.
>> I suppose read performance, especially for many parallel small reads,
>> will be slightly higher than for a normal hot spare, since you have
>> more disks with active data and therefore higher chances of
>> parallelising these accesses. But you get the same advantage with
>> raid6.
>
> Yes, but RAID 6 would be slower for small writes, and if one of the
> drives fails, the array stays in degraded mode (since it considers
> itself to be a RAID 6, not a RAID 5E).
>
Degraded raid5 and raid6 have varying speeds, depending on whether the
data you access is available directly or must be calculated from the
rest of the stripe and the parity. The same applies to a degraded raid
5E with a broken drive.
You are right that small writes to raid 6 would be slower than to a raid 5E.
>>>> It looks like we agree on most things here - we just had a little
>>>> difference on the areas we wrote about (specific information for the
>>>> OP, or more general RAID discussions), and a few small differences
>>>> in terminology.
>>> Well, you've made me reconsider my usage of RAID 5, though. I am now
>>> contemplating on using two RAID 10 arrays instead of two RAID 5
>>> arrays, since each of the arrays has four disks. They are both
>>> different arrays, though. They're connected to the same RAID
>>> controller but the first array is comprised of 147 GB 15k Hitachi SAS
>>> disks and the second array is comprised of 1 TB 7.2k Western Digital
>>> RAID Edition SATA-2 disks on a hotswap backplane.
>>>
>>> I had always considered RAID 5 to be the best trade-off, considering
>>> the loss of diskspace involved versus the retail price of the hard
>>> disks - especially the SAS disks - but considering that the SAS array
>>> will be used to house the main systems in a virtualized set-up (on
>>> Xen) and will probably endure the most small and random writes, RAID
>>> 10 might actually be a better solution. The cost of the lost
>>> diskspace on the SATA-2 disks is smaller since this type of disks is
>>> far less expensive than SAS.
>> I gather that raid 10 (hardware or software) is now often considered a
>> better choice - raid 5 is often viewed as unreliable due to the risks
>> of a second failure during rebuilds, which are increasingly
>> time-consuming with larger disks. Where practical, I think
>> mdadm "far" raid 10 is the optimal if you are happy with losing 50% of
>> your disk space - it is faster than other redundant setups in many
>> situations, and has a great deal of flexibility.
>
> Well, 50% is the minimum storage capacity one loses when using any kind
> of mirroring, be it RAID 1, RAID 10, RAID 0+1, RAID 50 or whatever.
>
>> If you want more redundancy, you can use double mirrors for 33% disk
>> space and still have full speed.
>
> Yes, but that's a set-up which, due to understandable financial
> considerations, would be reserved only for the corporate world. Many
> people already consider me certifiably insane for having spent that
> much money - 17'000 Euro, as I wrote higher up - on a privately owned
> computer system. But then again, for the intended purposes, I need
> fast and reliable hardware and a lot of horsepower. :-)
>
I'm curious - what is the intended purpose? I think I would have a hard
job spending more than about three or four thousand Euros on a single
system.
> In the event of the OP on the other hand, 45 SAS disks of 300 GB each
> and three SAS RAID storage enclosures also doesn't seem like quite an
> affordable buy, so I take it he intends to use it for a business.
>
It also does not strike me as a high value-for-money system - I can't
help feeling that this is way more bandwidth than you could actually
make use of in the rest of the system, so it would be better to have
fewer larger drives and less layers to reduce the latencies. Spent the
cash saved on even more ram :-)
45 disks at a throughput of say 75 MBps each gives about 3.3 GBps - say
3 GBps since some are hot spares. Ultimately, being a server, this is
going to be pumped out on Ethernet links. That's a lot of bandwidth -
it would effectively saturate four 10 Gbit links.
I have absolutely no real-world experience with these sorts of systems,
and could therefore be totally wrong, but my gut feeling is that the
theoretical numbers will not scale with so many drives - something like
15 1 TB SATA drives would be similar in speed in practice.
> That, or he's a maniac like me. :p
>
>> If you have the chance, it would be very nice to try out some
>> different arrangements and see which is fastest in reality, not just
>> in theory!
>
> Ahh, but whole books have been written about such tests, and it still
> always boils down to "What are you planning to do with it?" For
> instance, a database server has different needs from a mailserver, and
> this has different needs from a fileserver or workstation, etc. ;-)
>
It would still be fun!
>> The other option is to go for a file system that handles multiple
>> disks and redundancy directly - ZFS is the best known, with btrfs the
>> experimental choice on Linux.
>
> I don't think Btrfs is already considered stable enough. ZFS is of
> course a great choice, but the GPL forbids linking ZFS into the Linux
> kernel. If there is a "filesystem in userspace" implementation of it,
> then it would of course be possible to legally use ZFS on a GNU/Linux
> system.
>
There /is/ a "filesystem in userspace" implementation of ZFS (using
fuse). But it is not feature complete, and not particularly fast.
btrfs is still a risk, and is still missing some features (such as
elegant handling of low free space...), but the potential is there.
> I have been looking into NexentaOS (i.e. GNU/kOpenSolaris) for a while,
> which uses ZFS, albeit that ZFS was not my reason for being interested
> in the project. I was more interested in the fact that it supports
> both Solaris Zones - of which the Linux equivalents are OpenVZ and
> VServer - and running paravirtualized on top of Xen.
>
> Doing that with OpenVZ requires the use of a 2.6.27 kernel which is
> still considered unstable by the OpenVZ developers, and doing that with
> Vserver is as good as impossible, since they're still using a 2.6.16
> kernel, and you can't apply the (now obsolete) Xen patches to that
> because those are for 2.6.18. And thus, running VServer in a Xen
> virtual machine would require that you run it via hardware
> virtualization rather than paravirtualized.
>
> The big problem with NexentaOS however is that it's based on Ubuntu and
> that it uses binary .deb packages, whereas I would rather have a Gentoo
> approach, where you can build the whole thing from sources without
> having to go "the LFS way".
>
Why is it always so hard to get /everything/ you want when building a
system :-(
> Oh well, I've relayed the whole thing for the weekend, so I still have
> plenty of time to think things over. ;-)
>
>>> See, this is one of the advantages of Usenet. People get to share
>>> not only knowledge but also differing views and strategies, and in
>>> the end, everyone will have gleaned something useful. ;-)
>> Absolutely - that's also why it's good to have a general discussion
>> every now and again, rather than just answering a poster's questions.
>> Good questions (such as in this thread) inspire an exchange of
>> information for many people's benefits (I've learned things here too).
>
> Maybe we should invite some politicians over to Usenet. Then *they*
> might possibly learn something about the real world as well. :p
>
|
|
0
|
|
|
|
Reply
|
David
|
1/20/2010 10:59:44 PM
|
|
On Wednesday 20 January 2010 23:59 in comp.os.linux.misc, somebody
identifying as David Brown wrote...
> Aragorn wrote:
>
>> On Wednesday 20 January 2010 15:48 in comp.os.linux.misc, somebody
>> identifying as David Brown wrote...
>
> <snip to save a little space>
Yeah, these posts themselves are getting quite long, but at least, it's
one of those rare threads in which the conversation continues
on-topic. :-)
Quite honestly, I'm enjoying this thread, because I get to hear
interesting feedback - and I think you do to, from your point of view -
and I have a feeling that Rahul, the OP, is sitting there enjoying
himself over all the valid arguments being discussed here in the debate
over various RAID types. ;-)
This is a good thread, and I recommend that any lurking newbies would
save the posts for later reference in the event that they are faced
with the decision on whether and how to implement RAID on one of their
machines. Newbies, heads up! :p
>>> The zeroth rule, which is often forgotten (until you learn the hard
>>> way!), is "thou shalt make a plan for restoring from backups, test
>>> that plan, document that plan, and find a way to ensure that all
>>> backups are tested and restoreable in this way". /Then/ you can
>>> start making your actual backups!
>>
>> Well, so far I've always used the tested and tried approach of
>> tar'ing in conjunction with bzip2. Can't get any cleaner than
>> that. ;-)
>
> rsync copying is even cleaner - the backup copy is directly
> accessible. And when combined with hard link copies in some way (such
> as rsnapshot) you can get snapshots.
I have seen this method being discussed before, but to be honest I've
never even looked into "rsnapshot". I do intend to explore it for the
future, since the ability to make incremental backups seems very
interesting.
So far I have always made either data backups only - and on occasion,
backups of important directories such as "/etc" - or complete
filesystem backups, but never incremental backups. For IRC logs - I
run an IRC server (which is currently inactive - see farther down) and
I log the channels I'm in - I normally use "zip" every month, and then
erase the logs themselves. This is not an incremental approach, of
course.
My reason for using "zip" rather than "tar" for IRC logs is that my
colleagues run Windoze and so their options are limited. ;-)
> Of course, .tar.bz2 is good too - /if/ you have it automated so that
> it is actually done (or you are one of these rare people that can
> regularly follow a manual procedure).
To be honest, so far I've been doing that manually, but like I said, my
approach is rather amateuristic, in the sense that it's not a
systematic approach. But then again, so far the risk was rather
limited because I only needed to save my own files.
On the hosting server we used - which is now no longer operational as
such - the hosting software itself made regular backups of the domains,
but using the ".tar.bz2" approach. I'm not sure whether there was
anything incremental about the backups as it was my colleague who
occupied himself with the management of that machine - it was located
at his home.
> It also needs to be saved in a safe and reliable place - many people
> have had regular backups saved to tape only to find later that the
> tapes were unreadable.
That is always a risk, just as it was with the old music cassette tapes.
Magnetic storage is actually not advised for backups.
> And of course it needs to be saved again, in a different place and
> stored at a different site.
That would indeed be the best approach. Like I said in my previous
post, I use Iomega REV disks for backups to which I want to have
immediate access, but I also forgot to mention that I back up stuff to
DVDs, and I use DVD+RW media for that, since they tend to be of higher
quality than DVD-/+R - likewise I prefer CD-RW over CD-R - and the
advantage of optical storage is that it is the better choice in the
event of magnetic corruption, which you *can* and eventually *do* get
on tape drives.
Hard disks are relatively cheap these days - at least, if we're talking
about consumergrade SATA disks - and they are magnetically better than
tapes, in the sense that the magnetic coating on the platters is more
time-resilient than with tape drives. On the other hand, hard disks
contain lots of moving components and if a hard disk fails - and here
we go again - you lose all your data, unless you have a RAID set-up.
So one can use hard disks for backups - it's fast, reasonably affordable
and reasonably reliable, but it's not the final solution. If one
stores one's backups on hard disks, then one needs to make backups of
those backups on another kind of media.
My advice would therefore be to make redundant backups on different
types of media. Optical media are ideal in terms of the fact that they
are not susceptible to electromagnetic interference, but they might in
turn have other issues - especially older CDs and DVDs - since storage
there is in fact mechanical, i.e. the data is stored via physical
indentations in a kind of resin, made by a fairly high-powered laser.
And some readers will not accept media that were burned using other
CD/DVD writers. This is becoming more rare these days, but the problem
still exists.
> I know I'm preaching to the choir here, as you said before - but there
> may be others in the congregation.
Indeed, and people tend "not to care" until they burn their fingers. So
we can't stress this enough.
>>> And the second rule is "thou shalt make backups of your backups",
>>> followed by "thou shalt have backups of critical hardware". (That's
>>> another bonus of software raid - if your hardware raid card dies,
>>> you may have to replace it with exactly the same type of card to get
>>> your raid working again - with mdadm raid, you can use any PC.)
>>
>> Well, considering that my Big Machine has drained my piggy bank for
>> about 17'000 Euros worth of hardware, having a duplicate machine is
>> not really an option. The piggy bank's on a diet now. :-)
>>
>
> You don't need a duplicate machine - you just need duplicates of any
> parts that are important, specific, and may not always been easily
> available.
Well, just about everything in that machine is very expensive. And on
the other hand, I did have another server here - which was
malfunctioning but which has been repaired now - so I might as well put
that one to use as a back-up machine in the event that my main machine
would fail somehow - something which I am not looking forward to, of
course! ;-)
I also can't use the Xen live migration approach, because I intend to
set up my main machine with 64-bit software, while the other server is
a strictly 32-bit machine. But redundancy - i.e. a duplicate set-up of
the main servers - should be adequate enough for my purposes.
The other machine uses Ultra 320 SCSI drives, and I have a small stack
of those lying around, as well as a couple of Ultra 160s, which can
also be hooked up to the same RAID card.
> There is no need to buy a new machine, but as soon as your particular
> choice of hardware raid cards start going out of fashion, buy
> a spare. Better still, buy a spare /now/ before the manufacturer
> decides to update the firmware in new versions of the card and they
> become incompatible with your raid drives. Of course, you can always
> restore from backup in an emergency if the worst happens.
Well, considering that this is an entirely private project and that
there is no real risk involved in downtime - not that I don't care
about downtime - I think I've got it all sufficiently covered.
>> I'm not sure on the one on my SAS RAID adapter, but I think it's an
>> Intel RISC processor. It's not a MIPS or an Alpha, that much I am
>> certain of.
>
> Intel haven't made RISC processors for many years (discounting the
> Itanium, which is an unlikely choice for a raid processor).
The Itanium is not a RISC processor, it's a CISC. It's just not an
x86. ;-)
> They used to have StrongArms, and long, long ago they had a few other
> designs, but I'm pretty certain you don't have an Intel RISC processor
> on the card. It also will not be an Alpha - they have not been made
> for years either (they were very nice chips until DEC, then HP+Compaq
> totally screwed them up, with plenty of encouragement from Intel).
> Realistic cores include MIPS in many flavours, PPC, and for more
> recent designs, perhaps an ARM of some kind. If the heavy lifting is
> being done by ASIC logic rather than the processor core, there is a
> wider choice of possible cores.
Apparently it's an Intel 80333 processor, clocked at 800 MHz. Hmm, I
don't know whether that's a RISC processor; I've never heard of it
before, actually.
This is my RAID adapter card...
http://www.adaptec.com/en-US/products/Controllers/Hardware/sas/value/SAS-31205/
>>>>>> This is quite a common distinction, mind you. There is even a
>>>>>> "live spare" solution, but to my knowledge this is specific to
>>>>>> Adaptec - they call it RAID 5E.
>>>>>>
>>>>>> In a "live spare" scenario, the spare disk is not used as such
>>>>>> but is part of the live array, and both data and parity blocks
>>>>>> are being written to it, but with the distinction that each disk
>>>>>> in the array will also have empty blocks for the total capacity
>>>>>> of a standard spare disk. These empty blocks are thus
>>>>>> distributed across all disks in the array and are used for array
>>>>>> reconstruction in the event of a disk failure.
>>>>>
>>>>> Is there any real advantage of such a setup compared to using raid
>>>>> 6 (in which case, the "empty" blocks are second parity blocks)?
>>>>> There would be a slightly greater write overhead (especially for
>>>>> small writes), but that would not be seen by the host if there is
>>>>> enough cache on the controller.
>>>>
>>>> Well, the advantage of this set-up is that you don't need to
>>>> replace a failing disk, since there is already sufficient diskspace
>>>> left blank on all disks in the array, and so the array can recreate
>>>> itself using that extra blank diskspace. This is of course all
>>>> nice in theory, but in practice one would eventually replace the
>>>> disk anyway.
>>>
>>> The same is true of raid6 - if one disk dies, the degraded raid6 is
>>> very similar to raid5 until you replace the disk.
>>>
>>> And I still don't see any significant advantage of spreading the
>>> wholes around the drives rather than having them all on the one
>>> drive (i.e., a normal hot spare). The rebuild still has to do as
>>> many reads and writes, and takes as long. The rebuild writes will
>>> be spread over all the disks rather than just on the one disk, but I
>>> can't see any advantage in that.
>>
>> Well, the idea is simply to give the spare disk some exercise, i.e.
>> to use it as part of the live array while still offering the extra
>> redundancy of a spare. So in the event of a failure, the array can
>> be fully rebuilt without the need to replace the broken drive, as
>> opposed to that the array would stay in degraded mode until the
>> broken drive is replaced.
>
> The array will be in degraded mode while the rebuild is being done,
> just like if it were raid5 with a hot spare - and it will be equally
> slow during the rebuild. So no points there.
Well, it's not really something that - at least, in my impression - is
advised as "a particular RAID solution", but rather as "a nice
extension to RAID 5".
> In fact, according to wikipedia, the controller will "compact" the
> degraded raid set into a normal raid5, and when you replace the broken
> drive it will "uncompact" it into the raid 5E arrangement again. The
> "compact" and "uncompact" operations take much longer than a standard
> raid5 rebuild.
>
> So all you get here is a marginal increase in the parallelisation of
> multiple simultaneous small reads, which you could get anyway with
> raid6 rather than raid5 with a spare.
Well, yes, but the idea of RAID 5E is merely that you can have a RAID 5
with the extra disk being part of the array so as to spread the wear.
I know it's not of much use, but we began speaking of this with regards
to the terms "standby spare", "hot spare" and "live spare". ;-)
>>> If you want more redundancy, you can use double mirrors for 33% disk
>>> space and still have full speed.
>>
>> Yes, but that's a set-up which, due to understandable financial
>> considerations, would be reserved only for the corporate world. Many
>> people already consider me certifiably insane for having spent that
>> much money - 17'000 Euro, as I wrote higher up - on a privately owned
>> computer system. But then again, for the intended purposes, I need
>> fast and reliable hardware and a lot of horsepower. :-)
>
> I'm curious - what is the intended purpose? I think I would have a
> hard job spending more than about three or four thousand Euros on a
> single system.
Well, okay, here goes... It's intended to be a kind of "mainframe" -
which is what I call it on occasion when referring to that machine
among the other machines I own.
I have had this machine over at my place for two years already, but I
still needed a few extra hardware components - I want things pristine
before I begin my set-up so as to exclude nasty surprises with changes
to the hardware afterwards - and the person who was supposed to deliver
this hardware to me pulled a no-show on me. At first he kept on
stonewalling me - and, oh irony, I've been there before with another
hardware vendor - and eventually he wouldn't even return my phone calls
(to his voicemail) or my e-mails.
So eventually I directly contacted the people who had actually built the
machine, and for whom the other person was the mediator. These people
also needed a lot of time to get all the extra components, but
eventually they did, and the machine was delivered at my home again two
days ago now, so I can begin the installation over the weekend.
As for the hardware, it's a Tyan Thunder n6650W (S2915) motherboard -
the original one, not the revised one - which is a twin-socket ccNUMA
board for AMD Opterons. There are two 2218HE Opterons installed -
dualcore, 68 Watt, 2.6 GHz. The motherboard has eight DIMM sockets (as
two nodes of four DIMM sockets each), all of which are populated with
ATP 4 GB ECC registered DDR-2 pc5300 modules, making for a total of 32
GB of RAM, or if you will, two 16 GB ccNUMA nodes.
I've already shown you what RAID adapter card is installed, and this
adapter connects to eight hard disks, four of which are 147 GB 15k
Hitachi disks mounted in a "hidden" drive cage and to be used for the
main system, and the four others being 1 TB 7.2k Western Digital RAID
Edition SATA-2 disks, mounted in an IcyDock hotswap backplane drive
cage. There is a Plextor PX810-SA SATA double layer DVD writer and no
floppy drive. The motherboad also has a non-RAID on-board SAS
controller (which I've disabled in the BIOS) and a Firewire controller.
The original PSU was a CoolerMaster EPS12V 800 Watt, but considering the
extra drives and certain negative reviews of that CoolerMaster PSU
under heavy load, I have had it replaced now with a Zippy 1200 Watt
EPS12V PSU. The chassis is a CoolerMaster CM832 Stacker, which is not
the more commonly known Stacker but a model that now still only exists
as the black-and-green "nVidia Edition" model. Mine is completely
black, however.
There are two videocards installed. One is an older GeCube PCI Radeon
9250 card (with 256 MB), connected to the second channel on one of my
two SGI 21" CRT monitors. The other one is an Asus PCIe GeForce 8800
GTS (with 640 MB), connected to the first channel on both SGI monitors.
There are also two keyboards and one mouse. One keyboard is connected
via PS/2, the other one (and the mouse) via USB. So far the
hardware. ;-)
Now, as for my intended purposes, I am going to set up this machine with
Xen, as I have mentioned earlier. There will be three primary XenLinux
virtual machines running on this system, all of which will be Gentoo
installations.
The three main virtual machines will be set up as follows:
(1) The Xen dom0 virtual machine. For those not familiar with Xen, it
is a hypervisor that itself normally runs on the bare metal
(although it can be nested if the hardware has virtualization
extensions) but unlike the more familiar virtual machine monitors
like VMWare Workstation/Player or VirtualBox which are commonly used
on desktops and laptops, Xen does not have a "host" system. Instead
Xen has a "privileged guest", and this is called "dom0", or "domain
0". This virtual machine is privileged because it is from there
that one starts and stops the other Xen guests. It is also the
system that has direct access to the hardware - i.e. "the driver
domain".
On my machine, this is the virtual machine that will be using the
PCI Radeon card for video output and the PS/2 keyboard for input.
It will however not have full access to all the hardware, because
- and Xen allows this - the PCIe GeForce card, the soundchip on the
motherboard and all USB hubs will be hidden from Xen and from dom0.
(2) A workstation virtual machine. This is an unprivileged guest -
which in a Xen context is called "domU" - but it will also be a
driver domain, i.e. it will have direct access to the GeForce, the
soundchip and the USB hubs. It'll boot up to runlevel 3, but it'll
have KDE 4.x installed, along with loads of applications. As it has
direct access to the USB hubs, it'll also be running a CUPS server
for my USB-connected multifunctional device, a Brother MFC-9880.
It'll also be running an NFS server for multimedia files.
(3) A server virtual machine which I intend to set up - if possible -
with an OpenVZ kernel. Again for those who are not familiar with
it, OpenVZ is a modified Linux kernel which offers operating system
level virtualization. This means that you have one common kernel
running multiple, virtualized userspaces, each with their own
filesystems and user accounts, and their own "init" set up. I am
not sure yet whether I will be hiding the second Gbit Ethernet
adapter on the motherboard from dom0 and have this server domU
access it directly, or whether I will simply have this domU connect
to the dom0's Ethernet bridge.
The OpenVZ system will be running several isolated userspaces -
which are called "zones", just as in (Open)Solaris - one of which
I intend to set up as the sole system from which /ssh/ login from
the internet is allowed, and doing nothing else. The idea is that
access to any other machine in the network - physical or virtual -
must pass through this one virtual machine, making it harder for
an eventual black hat to do any damage. Then, there will also be
a generic "zone" for a DNS server and one, possibly two websites,
and one, possibly two mailservers. Lastly, another "zone" will be
running an IRC server and an IRC services package, possibly also
with a few eggdrops.
Systems (1) and (2) will be installed on the SAS disks, which are
currently set up as a RAID 5, but which I am now going to set up as a
RAID 10. System (3) itself will be installed on the same array as well
whereas the privileged userspace and the "ssh honeypot" are concerned.
The other "zones" will be installed on the SATA-2 array - currently
also set up as RAID 5 but also to be converted to RAID 10 - together
with the NFS share exported by system (2) and an additional volume for
backups. These backups will then be backed up themselves to the other
physical server - i.e. the 32-bit dual Xeon machine - as well as to
DVDs and REV disks.
As for the IRC part, I'll try to cut a very, very long story short... A
number of years ago - in July 2002, to be precise - I was part of a
group of people who started a new and small IRC network. Actually, it
all started when we decided to take over an existing but dying IRC
network in order to save it, but that's a whole other story.
Over the years, people came and went in our team - and as our team was
quite large, there were also a number of intrigues and hidden agendas
going on, resulting in some people getting fired on the spot - and we
also experienced a number of difficulties with hosting solutions -
primarily, having to pay too much money for too poor a service - and so
a little over three years ago, the remaining team members decided
jointly that it would be more cost-effective if we started self-hosting
our domain. We obtained a few second-hand servers and regular
consumergrade PCs via eBay and some SCSI disks, and we set the whole
thing up on an entry-level professional ADSL connection, all housed at
the home of one guy of our team, who was and still is living at his
parents' house. We also made up a contract that each of us would pay a
monthly contribution for the rent of the ADSL connection and
electricity, with a small margin for unexpected expenses.
So far so good, but already right from the beginning, one of us squirmed
his way out of having to pay monthly contributions, and then some ego
clashes occurred within the team - both the guy at whose home the
servers are set-up and another team member who was his best buddy are
what you could consider "socially dysfunctional" - resulting in the
loss of virtually all our users. To cut /that/ story short as well,
the guy who was running the servers at his parents' home set up a
shadow network behind my back (and on a machine of his own to which I
had no /ssh/ access) and moved over all our users to that other domain.
I only found out about it because one of our users was confused over
the two different domains and came to ask me why we had two IRC
networks which were not linked to one another.
The guy who set up that shadow network did however stay true to the
contract and kept the servers up and running, contributed financially
to the costs for the domain, and even still offered some technical
support for when things went bad - it's old hardware, and every once in
a while something breaks down and needs to be replaced. He also
meticulously kept the accounting up to date in terms of contributions
and expenses.
Then, as our contract was drawn up for an effective term of three
years - since that was the minimum rental term for the "businessgrade"
ADSL connection - and as this contract was about to end (on November
1st 2009), the guy sent an e-mail to our mailing list - sufficiently in
advance - that he had decided to step out of the IRC team at the expiry
date of the contract, but that he would help those who were still
interested in moving the domain over, and that he would still keep the
servers running until that day. So far he's still keeping the IRC
server up until I've set everything up myself, but the mail- and
webservers are down.
So at present, the IRC network that we had jointly started in 2002 is
now in suspended animation, with only one or two users (apart from the
other guy and myself) still regularly connecting, and a bunch of people
who seek to leech MP3s and pr0n - both of which are not to be found on
our network because for legal reasons we have decided to ban public
filesharing. The fines for copyright infringement or illegal "warez"
distribution over here are quite high, and I'm not prepared to go to
jail over something that stupid.
I'm not sure how I am going to revive the IRC network again - and it
will be a network again (as opposed to a single server) because one of
our old users and a girl who was on my team have both offered to set up
a server and link it to my new server - but I feel that it would be a
shame to give up on something that I have co-founded now eight years
ago and of which I have all that time been the chairman. (I was
elected chairman from the start and when someone challenged my position
and demanded re-elections one year later - as he wanted a shot at the
position - I was, with the exception of by that one person, unanimously
re-elected as chairman.)
So there will eventually be three servers on the new network (plus the
IRC services, which are considered a separate server by the IRCd
software). My now ex-colleage at whose place the main server is at
present still running did however overdo it a bit in terms of the
required set-up, hardwarewise. As I wrote higher up, it was an
entry-level businessgrade ADSL connection with eight public IP
addresses. Way too much, but the guy's an IT maniac and even more so
than I am. He's also a lot younger and still lacks some wisdom in
terms of spending.
So I am simply going to convert my residential cable internet connection
to what they call an "Office Line" over here, i.e. a single static IP
address via cable, requiring no extra hardware (as the cable modem can
handle the higher speeds) and a larger threshold for the traffic
volume, with (non-guaranteed) down/up speeds of 20 Mb/sec and 10 Mb/sec
respectively. I have a simple Linksys WRT45GL router now with the
standard firmware - which is Linux, by the way ;-) - and it'll do well
enough to do port forwarding to the respective virtual machines.
Additional firewalling can be done via /iptables/ on the respective
virtual machines.
So there you have it. Not quite as short a description as I had
announced higher up, but then again, you wanted to know. :-)
>> In the event of the OP on the other hand, 45 SAS disks of 300 GB each
>> and three SAS RAID storage enclosures also doesn't seem like quite an
>> affordable buy, so I take it he intends to use it for a business.
>
> It also does not strike me as a high value-for-money system - I can't
> help feeling that this is way more bandwidth than you could actually
> make use of in the rest of the system, so it would be better to have
> fewer larger drives and less layers to reduce the latencies. Spent
> the cash saved on even more ram :-)
Well, what I personally find overkill in this is that he intends to use
the entire array only for the "/home" filesystem. That seems like an
awful waste of some great resources that I personally would put to use
more efficiently - e.g. you could have the entire "/var" tree on it,
and an additional "/srv" tree.
Of course, a lot depends on the software. As I have come to experience
myself, lots of hosting software parks all the domains under "/home"
instead of under "/var" or "/srv". In fact, one could say that on a
general note, the implementation of "/srv" in just about every
GNU/Linux distribution is abominable. Some distros create a "/srv" dir
at install time but that's about as far as it goes. All the packages
are still configured to use "/var" for websites and FTP repositories -
which I suppose you could circumvent through symlinks - but like I
said, most hosting software typically parks everything under "/home".
> 45 disks at a throughput of say 75 MBps each gives about 3.3 GBps -
> say 3 GBps since some are hot spares. Ultimately, being a server,
> this is going to be pumped out on Ethernet links. That's a lot of
> bandwidth - it would effectively saturate four 10 Gbit links.
Well, since he talks of a high performance computing set-up, I would
imagine that he has plenty of 10 Gbit links at his disposal, or
possibly something a lot faster still. ;-)
> I have absolutely no real-world experience with these sorts of
> systems, and could therefore be totally wrong, but my gut feeling is
> that the theoretical numbers will not scale with so many drives -
> something like 15 1 TB SATA drives would be similar in speed in
> practice.
No real world experience with that sort of thing here either, but like I
said, in my opinion using 45 disks - or perhaps 42 if he keeps three
hot spares - for a single "/home" filesystem does seem like overkill to
me, and yes, there is the bandwidth issue too.
>> I have been looking into NexentaOS (i.e. GNU/kOpenSolaris) for a
>> while, which uses ZFS, albeit that ZFS was not my reason for being
>> interested in the project. I was more interested in the fact that it
>> supports both Solaris Zones - of which the Linux equivalents are
>> OpenVZ and VServer - and running paravirtualized on top of Xen.
>>
>> [...]
>> The big problem with NexentaOS however is that it's based on Ubuntu
>> and that it uses binary .deb packages, whereas I would rather have a
>> Gentoo approach, where you can build the whole thing from sources
>> without having to go "the LFS way".
>
> Why is it always so hard to get /everything/ you want when building a
> system :-(
True... Putting a binary "one size fits all"-optimized distribution on
an unimportant PC or laptop is okay by me, but for a system so
specialized and geared for performance as the one I have, I want
everything to be optimized for the underlying hardware, and I also
don't need or want all those typical "Windoze-style desktop
optimizations" most distribution vendors now build into their systems.
Gentoo is far from ideal - given some issues over at the Gentoo
Foundation itself and the fact that the developers seem mostly occupied
with discussing how cool they think they are, rather than to actually
do something sensible, and they've also started to implement a few
defaults of which they themselves say that these are not the best
choices but that they are the choices of which they think most users
will opt for them - but at least the basic premise is still there, i.e.
you do build it from sources, and as such you have more control over
how the resulting system will be set-up, both in terms of hardware
optimizations and in terms of software interoperability.
--
*Aragorn*
(registered GNU/Linux user #223157)
|
|
0
|
|
|
|
Reply
|
Aragorn
|
1/21/2010 12:05:13 PM
|
|
On 2010-01-21, Aragorn <aragorn@chatfactory.invalid> wrote:
> On Wednesday 20 January 2010 23:59 in comp.os.linux.misc, somebody
> identifying as David Brown wrote...
>
>>
>> rsync copying is even cleaner - the backup copy is directly
>> accessible. And when combined with hard link copies in some way (such
>> as rsnapshot) you can get snapshots.
>
> I have seen this method being discussed before, but to be honest I've
> never even looked into "rsnapshot". I do intend to explore it for the
> future, since the ability to make incremental backups seems very
> interesting.
It is actually far better than that. EAch of the backups is a complete
backup. Ie, you do not have to restore sequentially (full backup and
then each of the incrementals to get you back). What rsnapshot ( rsync)
does is to use hard links to store stuff that does not change, so there
is only one copy of those files, and then puts in new copies of stuff
which has changed. It that has the advantage of incremental, but also
the advantage of a full backup in that each backup really is a full
backup. It of course has the disadvantage of incremental that there is
only one copy of files which have not changed and if something alters
that copy all copies are altered.
>
> So far I have always made either data backups only - and on occasion,
> backups of important directories such as "/etc" - or complete
> filesystem backups, but never incremental backups. For IRC logs - I
> run an IRC server (which is currently inactive - see farther down) and
> I log the channels I'm in - I normally use "zip" every month, and then
> erase the logs themselves. This is not an incremental approach, of
> course.
>
> My reason for using "zip" rather than "tar" for IRC logs is that my
> colleagues run Windoze and so their options are limited. ;-)
>
>> Of course, .tar.bz2 is good too - /if/ you have it automated so that
>> it is actually done (or you are one of these rare people that can
>> regularly follow a manual procedure).
>
> To be honest, so far I've been doing that manually, but like I said, my
> approach is rather amateuristic, in the sense that it's not a
> systematic approach. But then again, so far the risk was rather
> limited because I only needed to save my own files.
rsnapshot is easily automated. You ALWAYS find that if something goes
wrong, it does so when you forgot to make backups for 3 months.
|
|
0
|
|
|
|
Reply
|
unruh
|
1/21/2010 5:32:30 PM
|
|
On Thursday 21 January 2010 18:32 in comp.os.linux.misc, somebody
identifying as unruh wrote...
> On 2010-01-21, Aragorn <aragorn@chatfactory.invalid> wrote:
>
>> On Wednesday 20 January 2010 23:59 in comp.os.linux.misc, somebody
>> identifying as David Brown wrote...
>>
>>> rsync copying is even cleaner - the backup copy is directly
>>> accessible. And when combined with hard link copies in some way
>>> (such as rsnapshot) you can get snapshots.
>>
>> I have seen this method being discussed before, but to be honest I've
>> never even looked into "rsnapshot". I do intend to explore it for
>> the future, since the ability to make incremental backups seems very
>> interesting.
>
> It is actually far better than that. EAch of the backups is a complete
> backup. Ie, you do not have to restore sequentially (full backup and
> then each of the incrementals to get you back). What rsnapshot (
> rsync) does is to use hard links to store stuff that does not change,
> so there is only one copy of those files, and then puts in new copies
> of stuff which has changed.
But I am curious... What exactly happens at the filesystem level if a
hard link is used for any given file when the filesystem holding the
backups is a different filesystem than the source? Wouldn't it make a
copy then anyway?
> It that has the advantage of incremental, but also the advantage of a
> full backup in that each backup really is a full backup. It of course
> has the disadvantage of incremental that there is only one copy of
> files which have not changed and if something alters that copy all
> copies are altered.
Ah yes, I can see how that would of course not be desirable. Imagine
you terribly screw up a file and then its copies in the backups will be
screwed up as well. Hmm...
>>> Of course, .tar.bz2 is good too - /if/ you have it automated so that
>>> it is actually done (or you are one of these rare people that can
>>> regularly follow a manual procedure).
>>
>> To be honest, so far I've been doing that manually, but like I said,
>> my approach is rather amateuristic, in the sense that it's not a
>> systematic approach. But then again, so far the risk was rather
>> limited because I only needed to save my own files.
>
> rsnapshot is easily automated. You ALWAYS find that if something goes
> wrong, it does so when you forgot to make backups for 3 months.
Well, I'll be looking into it more closely once I'll have my new machine
set up... which is a *lot* of work... 8-)
--
*Aragorn*
(registered GNU/Linux user #223157)
|
|
0
|
|
|
|
Reply
|
Aragorn
|
1/21/2010 5:47:55 PM
|
|
On 2010-01-21, Aragorn <aragorn@chatfactory.invalid> wrote:
> On Thursday 21 January 2010 18:32 in comp.os.linux.misc, somebody
> identifying as unruh wrote...
>
>> On 2010-01-21, Aragorn <aragorn@chatfactory.invalid> wrote:
>>
>>> On Wednesday 20 January 2010 23:59 in comp.os.linux.misc, somebody
>>> identifying as David Brown wrote...
>>>
>>>> rsync copying is even cleaner - the backup copy is directly
>>>> accessible. And when combined with hard link copies in some way
>>>> (such as rsnapshot) you can get snapshots.
>>>
>>> I have seen this method being discussed before, but to be honest I've
>>> never even looked into "rsnapshot". I do intend to explore it for
>>> the future, since the ability to make incremental backups seems very
>>> interesting.
>>
>> It is actually far better than that. EAch of the backups is a complete
>> backup. Ie, you do not have to restore sequentially (full backup and
>> then each of the incrementals to get you back). What rsnapshot (
>> rsync) does is to use hard links to store stuff that does not change,
>> so there is only one copy of those files, and then puts in new copies
>> of stuff which has changed.
>
> But I am curious... What exactly happens at the filesystem level if a
> hard link is used for any given file when the filesystem holding the
> backups is a different filesystem than the source? Wouldn't it make a
> copy then anyway?
The hard links are on the backups, not between the backups and the
source, -- that is not a backup at all. Ie, if you have 3 backups with
files f1, f2 and f3. file f2 changed between backup 1 and 2 and then
stayed the same, f3 was added after backup 2.
B1 will have f1 and f2. B2 will have f1 hard liked with the B1 version
of f1, and will have the different ( new) f2'. B3 willhave F1 and F2
hard liked to the B2 versions of f1 and f2, and will have the additional
file f3.
Ie, each backup has the full complement of files for its backup, and yo
ucan restore then completely just from those backups.
rsync -av B2/ /home/tipple/
will restore all of the files to tipple that were there whan backu B2
was made, and no reference to B1 or B3 need be made. In fact if you do
rm -r B1, this will make no difference to B2.
However the storage space requirement will be that for only one copy of
f1, two different copies of f2 ( the old B1 and the new B2 version) and
one copy of f3, instead of 3 copies of f1, three of f2 and one of f3.
You of course will at all times have a separate copy of the files in the
orginal.
>
>> It that has the advantage of incremental, but also the advantage of a
>> full backup in that each backup really is a full backup. It of course
>> has the disadvantage of incremental that there is only one copy of
>> files which have not changed and if something alters that copy all
>> copies are altered.
>
> Ah yes, I can see how that would of course not be desirable. Imagine
> you terribly screw up a file and then its copies in the backups will be
> screwed up as well. Hmm...
NONONONONO. There is no link from the files to the backups, only amongst
the backups.
If you screw up the original, you just replace it.
The problem comes if you happenen to for some reason go into say B2 and
edit the file f1 in B2 ( what the hell your are doing editing a backup
file I do not know-- maybe you are a prime minister trying to alter the
emails you sent your mistress), then all three backups B1/f1, B2/f1 and
B3/f1 will be altered. However if you erase say B1/f1 then you will
still have the complete B2/f1 and B3/f1 (that is how hard links work.
>
>>>> Of course, .tar.bz2 is good too - /if/ you have it automated so that
>>>> it is actually done (or you are one of these rare people that can
>>>> regularly follow a manual procedure).
>>>
>>> To be honest, so far I've been doing that manually, but like I said,
>>> my approach is rather amateuristic, in the sense that it's not a
>>> systematic approach. But then again, so far the risk was rather
>>> limited because I only needed to save my own files.
>>
>> rsnapshot is easily automated. You ALWAYS find that if something goes
>> wrong, it does so when you forgot to make backups for 3 months.
>
> Well, I'll be looking into it more closely once I'll have my new machine
> set up... which is a *lot* of work... 8-)
So reduce the work of making backups.
|
|
0
|
|
|
|
Reply
|
unruh
|
1/21/2010 6:36:31 PM
|
|
On Thursday 21 January 2010 19:36 in comp.os.linux.misc, somebody
identifying as unruh wrote...
> On 2010-01-21, Aragorn <aragorn@chatfactory.invalid> wrote:
>> On Thursday 21 January 2010 18:32 in comp.os.linux.misc, somebody
>> identifying as unruh wrote...
>>
>>> On 2010-01-21, Aragorn <aragorn@chatfactory.invalid> wrote:
>>>
>>>> On Wednesday 20 January 2010 23:59 in comp.os.linux.misc, somebody
>>>> identifying as David Brown wrote...
>>>>
>>>>> rsync copying is even cleaner - the backup copy is directly
>>>>> accessible. And when combined with hard link copies in some way
>>>>> (such as rsnapshot) you can get snapshots.
>>>>
>>>> I have seen this method being discussed before, but to be honest
>>>> I've never even looked into "rsnapshot". I do intend to explore it
>>>> for the future, since the ability to make incremental backups seems
>>>> very interesting.
>>>
>>> It is actually far better than that. EAch of the backups is a
>>> complete backup. Ie, you do not have to restore sequentially (full
>>> backup and then each of the incrementals to get you back). What
>>> rsnapshot ( rsync) does is to use hard links to store stuff that
>>> does not change, so there is only one copy of those files, and then
>>> puts in new copies of stuff which has changed.
>>
>> But I am curious... What exactly happens at the filesystem level if
>> a hard link is used for any given file when the filesystem holding
>> the backups is a different filesystem than the source? Wouldn't it
>> make a copy then anyway?
>
> The hard links are on the backups, not between the backups and the
> source, -- that is not a backup at all. Ie, if you have 3 backups with
> files f1, f2 and f3. file f2 changed between backup 1 and 2 and then
> stayed the same, f3 was added after backup 2.
>
> B1 will have f1 and f2. B2 will have f1 hard liked with the B1 version
> of f1, and will have the different ( new) f2'. B3 willhave F1 and F2
> hard liked to the B2 versions of f1 and f2, and will have the
> additional file f3.
Ahh, okay, I get it now. ;-)
>>> It that has the advantage of incremental, but also the advantage of
>>> a full backup in that each backup really is a full backup. It of
>>> course has the disadvantage of incremental that there is only one
>>> copy of files which have not changed and if something alters that
>>> copy all copies are altered.
>>
>> Ah yes, I can see how that would of course not be desirable. Imagine
>> you terribly screw up a file and then its copies in the backups will
>> be screwed up as well. Hmm...
>
> NONONONONO. There is no link from the files to the backups, only
> amongst the backups.
> If you screw up the original, you just replace it.
Okay, I see what you mean. ;-)
> The problem comes if you happenen to for some reason go into say B2
> and edit the file f1 in B2 ( what the hell your are doing editing a
> backup file I do not know-- maybe you are a prime minister trying to
> alter the emails you sent your mistress), [...
Well, that seems like a fashionable thing to do these days, but then
again, I've never been that fashionable. :p
> ...] then all three backups B1/f1, B2/f1 and B3/f1 will be altered.
> However if you erase say B1/f1 then you will still have the complete
> B2/f1 and B3/f1 (that is how hard links work.
Yes, I know what a hard link is. It just seemed weird to me - in my
orignal misunderstanding of your explanation - that one could create a
hard link across filesystem boundaries. ;-)
>>> rsnapshot is easily automated. You ALWAYS find that if something
>>> goes wrong, it does so when you forgot to make backups for 3 months.
>>
>> Well, I'll be looking into it more closely once I'll have my new
>> machine set up... which is a *lot* of work... 8-)
>
> So reduce the work of making backups.
Well, you've convinced me in favor of /rsnapshot/ so I will give that a
closer look. ;-)
--
*Aragorn*
(registered GNU/Linux user #223157)
|
|
0
|
|
|
|
Reply
|
Aragorn
|
1/21/2010 7:08:59 PM
|
|
Aragorn wrote:
> On Wednesday 20 January 2010 23:59 in comp.os.linux.misc, somebody
> identifying as David Brown wrote...
>
>> Aragorn wrote:
>>
>>> On Wednesday 20 January 2010 15:48 in comp.os.linux.misc, somebody
>>> identifying as David Brown wrote...
>> <snip to save a little space>
>
> Yeah, these posts themselves are getting quite long, but at least, it's
> one of those rare threads in which the conversation continues
> on-topic. :-)
>
> Quite honestly, I'm enjoying this thread, because I get to hear
> interesting feedback - and I think you do to, from your point of view -
> and I have a feeling that Rahul, the OP, is sitting there enjoying
> himself over all the valid arguments being discussed here in the debate
> over various RAID types. ;-)
>
> This is a good thread, and I recommend that any lurking newbies would
> save the posts for later reference in the event that they are faced
> with the decision on whether and how to implement RAID on one of their
> machines. Newbies, heads up! :p
>
Yes, lots of interesting things are turning up here.
>>>> The zeroth rule, which is often forgotten (until you learn the hard
>>>> way!), is "thou shalt make a plan for restoring from backups, test
>>>> that plan, document that plan, and find a way to ensure that all
>>>> backups are tested and restoreable in this way". /Then/ you can
>>>> start making your actual backups!
>>> Well, so far I've always used the tested and tried approach of
>>> tar'ing in conjunction with bzip2. Can't get any cleaner than
>>> that. ;-)
>> rsync copying is even cleaner - the backup copy is directly
>> accessible. And when combined with hard link copies in some way (such
>> as rsnapshot) you can get snapshots.
>
> I have seen this method being discussed before, but to be honest I've
> never even looked into "rsnapshot". I do intend to explore it for the
> future, since the ability to make incremental backups seems very
> interesting.
>
Another poster has given you a pretty good explanation of how rsync
snapshot backups work. I'll just give a few more points here.
rsync is designed to make a copy of a directory as efficiently as
possible. It will only copy over files that have changed or been added,
and even for changed files it can often copy over just the changes
rather than the whole file. And if you are doing the rsync over a slow
network, you can compress the transfers. There are additional flags to
delete files in the destination that are no longer present in the
source, and to omit certain files from the copy (amongst many other flags).
For snapshots, this is combined with the "cp -al" command that copies a
tree but hard-links files rather than copying them. So you do something
like rsync copy the source tree to a "current copy", then "cp -al" the
"current copy" to a dated backup snapshot directory. The next day, you
repeat the process - only changes from the source to the "current copy"
are transferred, and any files left untouched will be hardlinked each
time - you only ever have one real copy of each file, with hardlinks in
each snapshot. It's not perfect - for example, a file rename will cause
a new transfer, for example (the "--fuzzy" flag can be used to avoid the
transfer, but not the file duplication). And if you have partial
transfers you can end up breaking the hard-link chaining and end up with
extra copies of the files (and thus extra disk space).
rsnapshot and dirvish are two higher level backup systems build on this
technique.
Another option, which is a bit more efficient for the transfer and can
help avoid duplicates if you have occasional hiccups, is to use the
"--link-dest" option to provide the source of your links. This avoids
the extra "cp -al" step for greater efficiency, and also lets you
specify a number of old snapshots - helpful if some of these were
incomplete.
Remember also that rsync is aimed at network transfers - you want to
keep your backups on a different machine (although it's nice with local
snapshots of /etc). At the very least, keep them on a different file
system partition than the original - that way you have protection
against file system disasters. Obviously you want to avoid making any
changes to the files in the snapshots, though deleting them works
perfectly (files disappear only once all the hard links are gone). It
is also a good idea to hide the backup tree from locate/updatedb, if you
have it - your 100 daily snapshots may not take much more disk space
than a single copy, but it does take 100 times as many files and
directories.
> So far I have always made either data backups only - and on occasion,
> backups of important directories such as "/etc" - or complete
> filesystem backups, but never incremental backups. For IRC logs - I
> run an IRC server (which is currently inactive - see farther down) and
> I log the channels I'm in - I normally use "zip" every month, and then
> erase the logs themselves. This is not an incremental approach, of
> course.
>
> My reason for using "zip" rather than "tar" for IRC logs is that my
> colleagues run Windoze and so their options are limited. ;-)
>
Tell your Windoze colleagues to get a proper zipper program, instead of
relying on windows own half-hearted "zip folders" or illegal
unregistered copies of WinZip. 7zip is totally free, and vastly better
- amongst other features, it supports .tar.bz2 without problems.
>> Of course, .tar.bz2 is good too - /if/ you have it automated so that
>> it is actually done (or you are one of these rare people that can
>> regularly follow a manual procedure).
>
> To be honest, so far I've been doing that manually, but like I said, my
> approach is rather amateuristic, in the sense that it's not a
> systematic approach. But then again, so far the risk was rather
> limited because I only needed to save my own files.
>
> On the hosting server we used - which is now no longer operational as
> such - the hosting software itself made regular backups of the domains,
> but using the ".tar.bz2" approach. I'm not sure whether there was
> anything incremental about the backups as it was my colleague who
> occupied himself with the management of that machine - it was located
> at his home.
>
>> It also needs to be saved in a safe and reliable place - many people
>> have had regular backups saved to tape only to find later that the
>> tapes were unreadable.
>
> That is always a risk, just as it was with the old music cassette tapes.
> Magnetic storage is actually not advised for backups.
>
I agree - I dislike tapes for backup systems. But I also dislike
optical storage - disks are typically not big enough for complete
backups, so you have to have a really messy system with multiple disks
for a backup set, or even worse, incremental backup patch sets. Even if
you can fit everything on a single disk, it requires manual intervention
to handle the disks and store them safely off-site, and you need to test
them regularly (that's important!).
Hard disk space is cheap, especially if you are not bothered about
performance. It is, IMHO, the best backup medium these days. Make sure
you have two independent copies in case of disk crashes, of course. At
my office I have an onsite backup server (it's ideal for when someone
tells me they deleted an important folder a few weeks ago - I can browse
to the right dated backup snapshot, and copy out the data directly), and
I have an offsite backup with copies over the Internet at night.
>> And of course it needs to be saved again, in a different place and
>> stored at a different site.
>
> That would indeed be the best approach. Like I said in my previous
> post, I use Iomega REV disks for backups to which I want to have
> immediate access, but I also forgot to mention that I back up stuff to
> DVDs, and I use DVD+RW media for that, since they tend to be of higher
> quality than DVD-/+R - likewise I prefer CD-RW over CD-R - and the
> advantage of optical storage is that it is the better choice in the
> event of magnetic corruption, which you *can* and eventually *do* get
> on tape drives.
>
> Hard disks are relatively cheap these days - at least, if we're talking
> about consumergrade SATA disks - and they are magnetically better than
> tapes, in the sense that the magnetic coating on the platters is more
> time-resilient than with tape drives. On the other hand, hard disks
> contain lots of moving components and if a hard disk fails - and here
> we go again - you lose all your data, unless you have a RAID set-up.
>
> So one can use hard disks for backups - it's fast, reasonably affordable
> and reasonably reliable, but it's not the final solution. If one
> stores one's backups on hard disks, then one needs to make backups of
> those backups on another kind of media.
>
> My advice would therefore be to make redundant backups on different
> types of media. Optical media are ideal in terms of the fact that they
> are not susceptible to electromagnetic interference, but they might in
> turn have other issues - especially older CDs and DVDs - since storage
> there is in fact mechanical, i.e. the data is stored via physical
> indentations in a kind of resin, made by a fairly high-powered laser.
> And some readers will not accept media that were burned using other
> CD/DVD writers. This is becoming more rare these days, but the problem
> still exists.
>
>> I know I'm preaching to the choir here, as you said before - but there
>> may be others in the congregation.
>
> Indeed, and people tend "not to care" until they burn their fingers. So
> we can't stress this enough.
>
>>>> And the second rule is "thou shalt make backups of your backups",
>>>> followed by "thou shalt have backups of critical hardware". (That's
>>>> another bonus of software raid - if your hardware raid card dies,
>>>> you may have to replace it with exactly the same type of card to get
>>>> your raid working again - with mdadm raid, you can use any PC.)
>>> Well, considering that my Big Machine has drained my piggy bank for
>>> about 17'000 Euros worth of hardware, having a duplicate machine is
>>> not really an option. The piggy bank's on a diet now. :-)
>>>
>> You don't need a duplicate machine - you just need duplicates of any
>> parts that are important, specific, and may not always been easily
>> available.
>
> Well, just about everything in that machine is very expensive. And on
> the other hand, I did have another server here - which was
> malfunctioning but which has been repaired now - so I might as well put
> that one to use as a back-up machine in the event that my main machine
> would fail somehow - something which I am not looking forward to, of
> course! ;-)
>
> I also can't use the Xen live migration approach, because I intend to
> set up my main machine with 64-bit software, while the other server is
> a strictly 32-bit machine. But redundancy - i.e. a duplicate set-up of
> the main servers - should be adequate enough for my purposes.
>
Can't Xen cope with mixes of 64-bit and 32-bit machines? I've never
used it - my servers use OpenVZ (no problem with mixes of 32-bit and
64-bit virtual machines with a 64-bit host), and on desktops I use
Virtual Box (you can mix 32-bit and 64-bit hosts and guests in any
combination).
> The other machine uses Ultra 320 SCSI drives, and I have a small stack
> of those lying around, as well as a couple of Ultra 160s, which can
> also be hooked up to the same RAID card.
>
>> There is no need to buy a new machine, but as soon as your particular
>> choice of hardware raid cards start going out of fashion, buy
>> a spare. Better still, buy a spare /now/ before the manufacturer
>> decides to update the firmware in new versions of the card and they
>> become incompatible with your raid drives. Of course, you can always
>> restore from backup in an emergency if the worst happens.
>
> Well, considering that this is an entirely private project and that
> there is no real risk involved in downtime - not that I don't care
> about downtime - I think I've got it all sufficiently covered.
>
>>> I'm not sure on the one on my SAS RAID adapter, but I think it's an
>>> Intel RISC processor. It's not a MIPS or an Alpha, that much I am
>>> certain of.
>> Intel haven't made RISC processors for many years (discounting the
>> Itanium, which is an unlikely choice for a raid processor).
>
> The Itanium is not a RISC processor, it's a CISC. It's just not an
> x86. ;-)
>
The Itanium is not CISC (though it certainly is complex!) - it is
technically a VLIW processor (very long instruction word). But VLIW is
a subtype of RISC - one in which more than one RISC instruction is given
in each instruction group for explicit parallelisation. It's an
interesting theory, relies on non-existent super-compilers,
inefficiently implemented and hopeless in practice for most types of
software.
>> They used to have StrongArms, and long, long ago they had a few other
>> designs, but I'm pretty certain you don't have an Intel RISC processor
>> on the card. It also will not be an Alpha - they have not been made
>> for years either (they were very nice chips until DEC, then HP+Compaq
>> totally screwed them up, with plenty of encouragement from Intel).
>> Realistic cores include MIPS in many flavours, PPC, and for more
>> recent designs, perhaps an ARM of some kind. If the heavy lifting is
>> being done by ASIC logic rather than the processor core, there is a
>> wider choice of possible cores.
>
> Apparently it's an Intel 80333 processor, clocked at 800 MHz. Hmm, I
> don't know whether that's a RISC processor; I've never heard of it
> before, actually.
>
After a bit of web searching, I see that the 80333 is a dedicated RAID
system-on-a-chip, not a general purpose processor. The core is indeed
RISC - it is an XScale processor, which is approximately the same thing
as the StrongARM and is basically a modification of an older ARM core.
> This is my RAID adapter card...
>
> http://www.adaptec.com/en-US/products/Controllers/Hardware/sas/value/SAS-31205/
>
>>>>>>> This is quite a common distinction, mind you. There is even a
>>>>>>> "live spare" solution, but to my knowledge this is specific to
>>>>>>> Adaptec - they call it RAID 5E.
>>>>>>>
>>>>>>> In a "live spare" scenario, the spare disk is not used as such
>>>>>>> but is part of the live array, and both data and parity blocks
>>>>>>> are being written to it, but with the distinction that each disk
>>>>>>> in the array will also have empty blocks for the total capacity
>>>>>>> of a standard spare disk. These empty blocks are thus
>>>>>>> distributed across all disks in the array and are used for array
>>>>>>> reconstruction in the event of a disk failure.
>>>>>> Is there any real advantage of such a setup compared to using raid
>>>>>> 6 (in which case, the "empty" blocks are second parity blocks)?
>>>>>> There would be a slightly greater write overhead (especially for
>>>>>> small writes), but that would not be seen by the host if there is
>>>>>> enough cache on the controller.
>>>>> Well, the advantage of this set-up is that you don't need to
>>>>> replace a failing disk, since there is already sufficient diskspace
>>>>> left blank on all disks in the array, and so the array can recreate
>>>>> itself using that extra blank diskspace. This is of course all
>>>>> nice in theory, but in practice one would eventually replace the
>>>>> disk anyway.
>>>> The same is true of raid6 - if one disk dies, the degraded raid6 is
>>>> very similar to raid5 until you replace the disk.
>>>>
>>>> And I still don't see any significant advantage of spreading the
>>>> wholes around the drives rather than having them all on the one
>>>> drive (i.e., a normal hot spare). The rebuild still has to do as
>>>> many reads and writes, and takes as long. The rebuild writes will
>>>> be spread over all the disks rather than just on the one disk, but I
>>>> can't see any advantage in that.
>>> Well, the idea is simply to give the spare disk some exercise, i.e.
>>> to use it as part of the live array while still offering the extra
>>> redundancy of a spare. So in the event of a failure, the array can
>>> be fully rebuilt without the need to replace the broken drive, as
>>> opposed to that the array would stay in degraded mode until the
>>> broken drive is replaced.
>> The array will be in degraded mode while the rebuild is being done,
>> just like if it were raid5 with a hot spare - and it will be equally
>> slow during the rebuild. So no points there.
>
> Well, it's not really something that - at least, in my impression - is
> advised as "a particular RAID solution", but rather as "a nice
> extension to RAID 5".
>
I have to conclude it is more like "an inefficient and proprietary
extension to raid 5 that looked good on the marketing brochure - who
cares about reality?" :-)
>> In fact, according to wikipedia, the controller will "compact" the
>> degraded raid set into a normal raid5, and when you replace the broken
>> drive it will "uncompact" it into the raid 5E arrangement again. The
>> "compact" and "uncompact" operations take much longer than a standard
>> raid5 rebuild.
>>
>> So all you get here is a marginal increase in the parallelisation of
>> multiple simultaneous small reads, which you could get anyway with
>> raid6 rather than raid5 with a spare.
>
> Well, yes, but the idea of RAID 5E is merely that you can have a RAID 5
> with the extra disk being part of the array so as to spread the wear.
> I know it's not of much use, but we began speaking of this with regards
> to the terms "standby spare", "hot spare" and "live spare". ;-)
>
>>>> If you want more redundancy, you can use double mirrors for 33% disk
>>>> space and still have full speed.
>>> Yes, but that's a set-up which, due to understandable financial
>>> considerations, would be reserved only for the corporate world. Many
>>> people already consider me certifiably insane for having spent that
>>> much money - 17'000 Euro, as I wrote higher up - on a privately owned
>>> computer system. But then again, for the intended purposes, I need
>>> fast and reliable hardware and a lot of horsepower. :-)
>> I'm curious - what is the intended purpose? I think I would have a
>> hard job spending more than about three or four thousand Euros on a
>> single system.
>
> Well, okay, here goes... It's intended to be a kind of "mainframe" -
> which is what I call it on occasion when referring to that machine
> among the other machines I own.
>
> I have had this machine over at my place for two years already, but I
> still needed a few extra hardware components - I want things pristine
> before I begin my set-up so as to exclude nasty surprises with changes
> to the hardware afterwards - and the person who was supposed to deliver
> this hardware to me pulled a no-show on me. At first he kept on
> stonewalling me - and, oh irony, I've been there before with another
> hardware vendor - and eventually he wouldn't even return my phone calls
> (to his voicemail) or my e-mails.
>
> So eventually I directly contacted the people who had actually built the
> machine, and for whom the other person was the mediator. These people
> also needed a lot of time to get all the extra components, but
> eventually they did, and the machine was delivered at my home again two
> days ago now, so I can begin the installation over the weekend.
>
> As for the hardware, it's a Tyan Thunder n6650W (S2915) motherboard -
> the original one, not the revised one - which is a twin-socket ccNUMA
> board for AMD Opterons. There are two 2218HE Opterons installed -
> dualcore, 68 Watt, 2.6 GHz. The motherboard has eight DIMM sockets (as
> two nodes of four DIMM sockets each), all of which are populated with
> ATP 4 GB ECC registered DDR-2 pc5300 modules, making for a total of 32
> GB of RAM, or if you will, two 16 GB ccNUMA nodes.
>
> I've already shown you what RAID adapter card is installed, and this
> adapter connects to eight hard disks, four of which are 147 GB 15k
> Hitachi disks mounted in a "hidden" drive cage and to be used for the
> main system, and the four others being 1 TB 7.2k Western Digital RAID
> Edition SATA-2 disks, mounted in an IcyDock hotswap backplane drive
> cage. There is a Plextor PX810-SA SATA double layer DVD writer and no
> floppy drive. The motherboad also has a non-RAID on-board SAS
> controller (which I've disabled in the BIOS) and a Firewire controller.
>
> The original PSU was a CoolerMaster EPS12V 800 Watt, but considering the
> extra drives and certain negative reviews of that CoolerMaster PSU
> under heavy load, I have had it replaced now with a Zippy 1200 Watt
> EPS12V PSU. The chassis is a CoolerMaster CM832 Stacker, which is not
> the more commonly known Stacker but a model that now still only exists
> as the black-and-green "nVidia Edition" model. Mine is completely
> black, however.
>
> There are two videocards installed. One is an older GeCube PCI Radeon
> 9250 card (with 256 MB), connected to the second channel on one of my
> two SGI 21" CRT monitors. The other one is an Asus PCIe GeForce 8800
> GTS (with 640 MB), connected to the first channel on both SGI monitors.
>
> There are also two keyboards and one mouse. One keyboard is connected
> via PS/2, the other one (and the mouse) via USB. So far the
> hardware. ;-)
>
> Now, as for my intended purposes, I am going to set up this machine with
> Xen, as I have mentioned earlier. There will be three primary XenLinux
> virtual machines running on this system, all of which will be Gentoo
> installations.
>
> The three main virtual machines will be set up as follows:
>
> (1) The Xen dom0 virtual machine. For those not familiar with Xen, it
> is a hypervisor that itself normally runs on the bare metal
> (although it can be nested if the hardware has virtualization
> extensions) but unlike the more familiar virtual machine monitors
> like VMWare Workstation/Player or VirtualBox which are commonly used
> on desktops and laptops, Xen does not have a "host" system. Instead
> Xen has a "privileged guest", and this is called "dom0", or "domain
> 0". This virtual machine is privileged because it is from there
> that one starts and stops the other Xen guests. It is also the
> system that has direct access to the hardware - i.e. "the driver
> domain".
>
> On my machine, this is the virtual machine that will be using the
> PCI Radeon card for video output and the PS/2 keyboard for input.
> It will however not have full access to all the hardware, because
> - and Xen allows this - the PCIe GeForce card, the soundchip on the
> motherboard and all USB hubs will be hidden from Xen and from dom0.
>
> (2) A workstation virtual machine. This is an unprivileged guest -
> which in a Xen context is called "domU" - but it will also be a
> driver domain, i.e. it will have direct access to the GeForce, the
> soundchip and the USB hubs. It'll boot up to runlevel 3, but it'll
> have KDE 4.x installed, along with loads of applications. As it has
> direct access to the USB hubs, it'll also be running a CUPS server
> for my USB-connected multifunctional device, a Brother MFC-9880.
> It'll also be running an NFS server for multimedia files.
>
> (3) A server virtual machine which I intend to set up - if possible -
> with an OpenVZ kernel. Again for those who are not familiar with
> it, OpenVZ is a modified Linux kernel which offers operating system
> level virtualization. This means that you have one common kernel
> running multiple, virtualized userspaces, each with their own
> filesystems and user accounts, and their own "init" set up. I am
> not sure yet whether I will be hiding the second Gbit Ethernet
> adapter on the motherboard from dom0 and have this server domU
> access it directly, or whether I will simply have this domU connect
> to the dom0's Ethernet bridge.
>
> The OpenVZ system will be running several isolated userspaces -
> which are called "zones", just as in (Open)Solaris - one of which
> I intend to set up as the sole system from which /ssh/ login from
> the internet is allowed, and doing nothing else. The idea is that
> access to any other machine in the network - physical or virtual -
> must pass through this one virtual machine, making it harder for
> an eventual black hat to do any damage. Then, there will also be
> a generic "zone" for a DNS server and one, possibly two websites,
> and one, possibly two mailservers. Lastly, another "zone" will be
> running an IRC server and an IRC services package, possibly also
> with a few eggdrops.
>
> Systems (1) and (2) will be installed on the SAS disks, which are
> currently set up as a RAID 5, but which I am now going to set up as a
> RAID 10. System (3) itself will be installed on the same array as well
> whereas the privileged userspace and the "ssh honeypot" are concerned.
> The other "zones" will be installed on the SATA-2 array - currently
> also set up as RAID 5 but also to be converted to RAID 10 - together
> with the NFS share exported by system (2) and an additional volume for
> backups. These backups will then be backed up themselves to the other
> physical server - i.e. the 32-bit dual Xeon machine - as well as to
> DVDs and REV disks.
>
This sounds like a fun system! However, I would have split this into
two distinct machines - a server and a workstation. You are mixing two
very different types of use on a single machine, giving you something
that is bound to be a compromise (or a lot more expensive than necessary).
A good workstation will have a processor aimed at high peak performance
for single threads, with reasonable performance for up to 4 threads
(more if you do a lot of compiling or other highly parallel tasks).
Memory should be optimised for latency (this is more important than fast
bandwidth), as should the disks (main disk(s) should be SSD, possibly
with harddisks for bulk storage). You want good graphics and sound.
For software, you want your host OS to be the main working OS - put
guest OS's under Virtual Box if you want.
For a server, your processor is aimed at high throughput on multiple
threads, and memory should be large - even if that means slow. Disks
should be large and reliable (raid). Graphics can be integrated on the
motherboard - you need a console keyboard and screen for the initial
installation, and they are disconnected afterwards (except possibly for
disaster recovery). These days you want your host OS to be minimal, and
have the real work done in virtual machines. Go for OpenVZ as much as
possible - OpenVZ machines are very light, and very fast to set up (on
the server at work, I can set up a new OpenVZ virtual machine in a
couple of minutes). Use Xen or KVM if you need more complete
virtualisation.
Of course, when you already have the hardware, you use what you have.
> As for the IRC part, I'll try to cut a very, very long story short... A
> number of years ago - in July 2002, to be precise - I was part of a
> group of people who started a new and small IRC network. Actually, it
> all started when we decided to take over an existing but dying IRC
> network in order to save it, but that's a whole other story.
>
> Over the years, people came and went in our team - and as our team was
> quite large, there were also a number of intrigues and hidden agendas
> going on, resulting in some people getting fired on the spot - and we
> also experienced a number of difficulties with hosting solutions -
> primarily, having to pay too much money for too poor a service - and so
> a little over three years ago, the remaining team members decided
> jointly that it would be more cost-effective if we started self-hosting
> our domain. We obtained a few second-hand servers and regular
> consumergrade PCs via eBay and some SCSI disks, and we set the whole
> thing up on an entry-level professional ADSL connection, all housed at
> the home of one guy of our team, who was and still is living at his
> parents' house. We also made up a contract that each of us would pay a
> monthly contribution for the rent of the ADSL connection and
> electricity, with a small margin for unexpected expenses.
>
> So far so good, but already right from the beginning, one of us squirmed
> his way out of having to pay monthly contributions, and then some ego
> clashes occurred within the team - both the guy at whose home the
> servers are set-up and another team member who was his best buddy are
> what you could consider "socially dysfunctional" - resulting in the
> loss of virtually all our users. To cut /that/ story short as well,
> the guy who was running the servers at his parents' home set up a
> shadow network behind my back (and on a machine of his own to which I
> had no /ssh/ access) and moved over all our users to that other domain.
> I only found out about it because one of our users was confused over
> the two different domains and came to ask me why we had two IRC
> networks which were not linked to one another.
>
> The guy who set up that shadow network did however stay true to the
> contract and kept the servers up and running, contributed financially
> to the costs for the domain, and even still offered some technical
> support for when things went bad - it's old hardware, and every once in
> a while something breaks down and needs to be replaced. He also
> meticulously kept the accounting up to date in terms of contributions
> and expenses.
>
> Then, as our contract was drawn up for an effective term of three
> years - since that was the minimum rental term for the "businessgrade"
> ADSL connection - and as this contract was about to end (on November
> 1st 2009), the guy sent an e-mail to our mailing list - sufficiently in
> advance - that he had decided to step out of the IRC team at the expiry
> date of the contract, but that he would help those who were still
> interested in moving the domain over, and that he would still keep the
> servers running until that day. So far he's still keeping the IRC
> server up until I've set everything up myself, but the mail- and
> webservers are down.
>
> So at present, the IRC network that we had jointly started in 2002 is
> now in suspended animation, with only one or two users (apart from the
> other guy and myself) still regularly connecting, and a bunch of people
> who seek to leech MP3s and pr0n - both of which are not to be found on
> our network because for legal reasons we have decided to ban public
> filesharing. The fines for copyright infringement or illegal "warez"
> distribution over here are quite high, and I'm not prepared to go to
> jail over something that stupid.
>
> I'm not sure how I am going to revive the IRC network again - and it
> will be a network again (as opposed to a single server) because one of
> our old users and a girl who was on my team have both offered to set up
> a server and link it to my new server - but I feel that it would be a
> shame to give up on something that I have co-founded now eight years
> ago and of which I have all that time been the chairman. (I was
> elected chairman from the start and when someone challenged my position
> and demanded re-elections one year later - as he wanted a shot at the
> position - I was, with the exception of by that one person, unanimously
> re-elected as chairman.)
>
> So there will eventually be three servers on the new network (plus the
> IRC services, which are considered a separate server by the IRCd
> software). My now ex-colleage at whose place the main server is at
> present still running did however overdo it a bit in terms of the
> required set-up, hardwarewise. As I wrote higher up, it was an
> entry-level businessgrade ADSL connection with eight public IP
> addresses. Way too much, but the guy's an IT maniac and even more so
> than I am. He's also a lot younger and still lacks some wisdom in
> terms of spending.
>
> So I am simply going to convert my residential cable internet connection
> to what they call an "Office Line" over here, i.e. a single static IP
> address via cable, requiring no extra hardware (as the cable modem can
> handle the higher speeds) and a larger threshold for the traffic
> volume, with (non-guaranteed) down/up speeds of 20 Mb/sec and 10 Mb/sec
> respectively. I have a simple Linksys WRT45GL router now with the
> standard firmware - which is Linux, by the way ;-) - and it'll do well
> enough to do port forwarding to the respective virtual machines.
> Additional firewalling can be done via /iptables/ on the respective
> virtual machines.
>
I like to install OpenWRT on WRT54GL devices - it makes them far more
flexible than the original firmware. Of course, if the original
firmware does all you need, then that's fine.
> So there you have it. Not quite as short a description as I had
> announced higher up, but then again, you wanted to know. :-)
>
Not /quite/ as short as it sounded at the start... but yes, an
interesting history. And it explains where you got this collection of
hardware.
>>> In the event of the OP on the other hand, 45 SAS disks of 300 GB each
>>> and three SAS RAID storage enclosures also doesn't seem like quite an
>>> affordable buy, so I take it he intends to use it for a business.
>> It also does not strike me as a high value-for-money system - I can't
>> help feeling that this is way more bandwidth than you could actually
>> make use of in the rest of the system, so it would be better to have
>> fewer larger drives and less layers to reduce the latencies. Spent
>> the cash saved on even more ram :-)
>
> Well, what I personally find overkill in this is that he intends to use
> the entire array only for the "/home" filesystem. That seems like an
> awful waste of some great resources that I personally would put to use
> more efficiently - e.g. you could have the entire "/var" tree on it,
> and an additional "/srv" tree.
>
> Of course, a lot depends on the software. As I have come to experience
> myself, lots of hosting software parks all the domains under "/home"
> instead of under "/var" or "/srv". In fact, one could say that on a
> general note, the implementation of "/srv" in just about every
> GNU/Linux distribution is abominable. Some distros create a "/srv" dir
> at install time but that's about as far as it goes. All the packages
> are still configured to use "/var" for websites and FTP repositories -
> which I suppose you could circumvent through symlinks - but like I
> said, most hosting software typically parks everything under "/home".
>
The /srv directory has no specified standard function, so it is
perfectly reasonable for it to be empty. As for what goes in /var, and
where you put it, that also varies a lot. Mail server data, databases,
and web server data are often there - but not necessarily. Logs are
almost invariably under /var. But if the OP's server is mainly a file
server, it is not unreasonable for all the relevant files to be in
/home. He could even mount /var/log on a tmpfs filesystem for speed
(obviously it will be lost on reboot).
Personally, on file servers I generally have a /data directory for
shared data, and I have my OpenVZ directories under /vz.
>> 45 disks at a throughput of say 75 MBps each gives about 3.3 GBps -
>> say 3 GBps since some are hot spares. Ultimately, being a server,
>> this is going to be pumped out on Ethernet links. That's a lot of
>> bandwidth - it would effectively saturate four 10 Gbit links.
>
> Well, since he talks of a high performance computing set-up, I would
> imagine that he has plenty of 10 Gbit links at his disposal, or
> possibly something a lot faster still. ;-)
>
>> I have absolutely no real-world experience with these sorts of
>> systems, and could therefore be totally wrong, but my gut feeling is
>> that the theoretical numbers will not scale with so many drives -
>> something like 15 1 TB SATA drives would be similar in speed in
>> practice.
>
> No real world experience with that sort of thing here either, but like I
> said, in my opinion using 45 disks - or perhaps 42 if he keeps three
> hot spares - for a single "/home" filesystem does seem like overkill to
> me, and yes, there is the bandwidth issue too.
>
>>> I have been looking into NexentaOS (i.e. GNU/kOpenSolaris) for a
>>> while, which uses ZFS, albeit that ZFS was not my reason for being
>>> interested in the project. I was more interested in the fact that it
>>> supports both Solaris Zones - of which the Linux equivalents are
>>> OpenVZ and VServer - and running paravirtualized on top of Xen.
>>>
>>> [...]
>>> The big problem with NexentaOS however is that it's based on Ubuntu
>>> and that it uses binary .deb packages, whereas I would rather have a
>>> Gentoo approach, where you can build the whole thing from sources
>>> without having to go "the LFS way".
>> Why is it always so hard to get /everything/ you want when building a
>> system :-(
>
> True... Putting a binary "one size fits all"-optimized distribution on
> an unimportant PC or laptop is okay by me, but for a system so
> specialized and geared for performance as the one I have, I want
> everything to be optimized for the underlying hardware, and I also
> don't need or want all those typical "Windoze-style desktop
> optimizations" most distribution vendors now build into their systems.
>
> Gentoo is far from ideal - given some issues over at the Gentoo
> Foundation itself and the fact that the developers seem mostly occupied
> with discussing how cool they think they are, rather than to actually
> do something sensible, and they've also started to implement a few
> defaults of which they themselves say that these are not the best
> choices but that they are the choices of which they think most users
> will opt for them - but at least the basic premise is still there, i.e.
> you do build it from sources, and as such you have more control over
> how the resulting system will be set-up, both in terms of hardware
> optimizations and in terms of software interoperability.
>
Have you looked at Sabayon Linux? It's originally based on Gentoo, but
you might find the developer community more to your liking.
|
|
0
|
|
|
|
Reply
|
David
|
1/21/2010 11:21:46 PM
|
|
Aragorn <aragorn@chatfactory.invalid> wrote in news:hj7amh$i8q$1
@news.eternal-september.org:
>
> In the event of the OP on the other hand, 45 SAS disks of 300 GB each
> and three SAS RAID storage enclosures also doesn't seem like quite an
> affordable buy, so I take it he intends to use it for a business.
>
> That, or he's a maniac like me. :p
>
Probably the later! ;) This is a academic setting with a scientific
computing cluster.
--
Rahul
|
|
0
|
|
|
|
Reply
|
Rahul
|
1/22/2010 3:55:00 AM
|
|
David Brown <david.brown@hesbynett.removethisbit.no> wrote in
news:8KednbJOTsi3FsrWnZ2dnUVZ8i2dnZ2d@lyse.net:
> It also does not strike me as a high value-for-money system - I can't
> help feeling that this is way more bandwidth than you could actually
> make use of in the rest of the system, so it would be better to have
> fewer larger drives and less layers to reduce the latencies.
Interesting! I actually spent money getting more but smaller drives. It
would have been easier getting larger drives but I thought more spindles
the better especially for random I/O.
Which latency could have been reduced had I used larger drives? There's
hardly any layers I can see that are extranous. I have a storage box,
RAID controller and then LVM / mdadm on top.
So long as I used more than one box having the LVM / mdadm layer seemed
pretty necessary anyways. And just 15 drives seemed too few for IOPS.
Besides the larger drives have bad seek times.
>Spent the
> cash saved on even more ram :-)
I already thought I maxed out on my RAM. I've 48 Gigs. Do you think
that's enough? I guess I can always add more later if necessary.
> 45 disks at a throughput of say 75 MBps each gives about 3.3 GBps - say
> 3 GBps since some are hot spares. Ultimately, being a server, this is
> going to be pumped out on Ethernet links. That's a lot of bandwidth -
> it would effectively saturate four 10 Gbit links.
Well, I only have two 10 Gbit links. But my calculations had shown that
I'd be maxing out the two RAID cards I have before that happens. But I
could be wrong.
On the other hand bandwidth was just half the story as I saw it. I did
have a fair share of apps doing random I/O and seeks. Here I wanted to
maximise my IOPS. Splitting over more independant spindles should
hopefully boost my performance in that respect.
>
> I have absolutely no real-world experience with these sorts of systems,
> and could therefore be totally wrong, but my gut feeling is that the
> theoretical numbers will not scale with so many drives - something like
> 15 1 TB SATA drives would be similar in speed in practice.
I almost did that option. Point was that I was scared with the IOPS
expectations and there was no real way to test on full load. SO I speced
it out genorously.
By way of application: This storage is supposed to be the NFS server that
will serve out NFS mounts to ~275 servers each 8 core. Being a HPC
environment there's pretty much full load 24x7.
--
Rahul
|
|
0
|
|
|
|
Reply
|
Rahul
|
1/22/2010 4:19:38 AM
|
|
Aragorn <aragorn@chatfactory.invalid> wrote in
news:hj9ftp$7p4$1@news.eternal-september.org:
> Quite honestly, I'm enjoying this thread, because I get to hear
> interesting feedback - and I think you do to, from your point of view
> - and I have a feeling that Rahul, the OP, is sitting there enjoying
> himself over all the valid arguments being discussed here in the
> debate over various RAID types. ;-)
Absolutely. I am still lurking around. After all I have the blame as
being the OP who started us off on this interesting RAID discussion.
> I have seen this method being discussed before, but to be honest I've
> never even looked into "rsnapshot". I do intend to explore it for the
> future, since the ability to make incremental backups seems very
> interesting.
Jumping on the rsync topic: I've been using rsync+rsnapshot backups for
the last 6 months to keep around 1 Terabyte of user data safe. It works
like a charm. Originally there was some convincing required since the
bosses thought "it wasn't backup unless it was tape"
But we have a compromise now in that we do a full tape backup about once
every 6 months. But incremental daily, weekly and monthly backpus are all
disk-to-disk.
> So far I have always made either data backups only - and on occasion,
> backups of important directories such as "/etc" -
I'd also add something like bazaar or mercurial to the backup equation.
I've found it invaluable to version my config files and protect myself
against sys-admin blunders. Besides it gives me a new-found confidence
making system changes in the security of doing an instant rollback when
there is trouble to pretty much any point in time.
>> It also needs to be saved in a safe and reliable place - many people
>> have had regular backups saved to tape only to find later that the
>> tapes were unreadable.
I'm moving away from tapes to disk-to-disk backups. In these days of
cheap disk it's starting to make so much more sense.
>
> Well, what I personally find overkill in this is that he intends to
> use the entire array only for the "/home" filesystem. That seems like
> an awful waste of some great resources that I personally would put to
> use more efficiently - e.g. you could have the entire "/var" tree on
> it, and an additional "/srv" tree.
Does it seem a waste when you think that the /home is being served out
via NFS to ~275 servers? Those are the "compute nodes" doing most of the
I/O so that's where I need high-performance storage.
Well, I already have a fast disk for /var etc. But those are merely local
to the server connected to the storage. This server is supposed to do
just one thing and do it well: Serve out the central storage via NFS.
> Well, since he talks of a high performance computing set-up, I would
> imagine that he has plenty of 10 Gbit links at his disposal, or
> possibly something a lot faster still. ;-)
I have twin 10Gig Links right now with options for two more at a later
date.
--
Rahul
|
|
0
|
|
|
|
Reply
|
Rahul
|
1/22/2010 4:37:07 AM
|
|
Rahul wrote:
> David Brown <david.brown@hesbynett.removethisbit.no> wrote in
> news:8KednbJOTsi3FsrWnZ2dnUVZ8i2dnZ2d@lyse.net:
>
>> It also does not strike me as a high value-for-money system - I can't
>> help feeling that this is way more bandwidth than you could actually
>> make use of in the rest of the system, so it would be better to have
>> fewer larger drives and less layers to reduce the latencies.
>
> Interesting! I actually spent money getting more but smaller drives. It
> would have been easier getting larger drives but I thought more spindles
> the better especially for random I/O.
>
It is very difficult to judge these things - it is so dependent on the
load. Don't rate my gut feeling above /your/ gut feeling! The trouble
is, the only way to be sure is to try out both combinations and see,
which is a little impractical. If you are able, you could do testing
with only have the disks attached to see if it makes a measurable
difference.
More spindles /may/ reduce the seek time for random IO - it will be
especially effective if there is a lot of random reads in parallel.
> Which latency could have been reduced had I used larger drives? There's
> hardly any layers I can see that are extranous. I have a storage box,
> RAID controller and then LVM / mdadm on top.
>
I have been imagining two layers of controllers here - your disks are
connected to one controller on a storage box, and that controller is
then connected to the controller in the host. With fewer disks, you can
cut out the middle man and connect the disks directly to raid
controllers on the host. Theoretically, that will reduce latency -
though I don't know if there will be a difference in real life.
> So long as I used more than one box having the LVM / mdadm layer seemed
> pretty necessary anyways. And just 15 drives seemed too few for IOPS.
> Besides the larger drives have bad seek times.
>
Have you considered using a clustered file system such as Lustre or GFS?
You then have a central server for metadata, which is easy to get fast
(everything will be in the server's ram cache), and the actual data is
spread around and duplicated on different servers.
>> Spent the
>> cash saved on even more ram :-)
>
> I already thought I maxed out on my RAM. I've 48 Gigs. Do you think
> that's enough? I guess I can always add more later if necessary.
>
640 KB ram is enough for anybody :-)
48 GB is actually quite a lot. Whether it is enough or not is hard to
say. Run the system you have got - if people complain that it is slow,
do some monitoring to find the bottlenecks. If they don't complain,
then 48 GB is enough!
>> 45 disks at a throughput of say 75 MBps each gives about 3.3 GBps - say
>> 3 GBps since some are hot spares. Ultimately, being a server, this is
>> going to be pumped out on Ethernet links. That's a lot of bandwidth -
>> it would effectively saturate four 10 Gbit links.
>
> Well, I only have two 10 Gbit links. But my calculations had shown that
> I'd be maxing out the two RAID cards I have before that happens. But I
> could be wrong.
>
You know the details of the system, and you've probably done the
calculations on paper rather than just in your head, so your numbers are
a better guess. How does the bandwidth of the raid cards compare to the
theoretical bandwidth of the disks?
> On the other hand bandwidth was just half the story as I saw it. I did
> have a fair share of apps doing random I/O and seeks. Here I wanted to
> maximise my IOPS. Splitting over more independant spindles should
> hopefully boost my performance in that respect.
>
>> I have absolutely no real-world experience with these sorts of systems,
>> and could therefore be totally wrong, but my gut feeling is that the
>> theoretical numbers will not scale with so many drives - something like
>> 15 1 TB SATA drives would be similar in speed in practice.
>
> I almost did that option. Point was that I was scared with the IOPS
> expectations and there was no real way to test on full load. SO I speced
> it out genorously.
>
> By way of application: This storage is supposed to be the NFS server that
> will serve out NFS mounts to ~275 servers each 8 core. Being a HPC
> environment there's pretty much full load 24x7.
>
>
>
|
|
0
|
|
|
|
Reply
|
David
|
1/22/2010 9:06:43 AM
|
|
On Friday 22 January 2010 00:21 in comp.os.linux.misc, somebody
identifying as David Brown wrote...
> Aragorn wrote:
>
>> On Wednesday 20 January 2010 23:59 in comp.os.linux.misc, somebody
>> identifying as David Brown wrote...
>>
>>> rsync copying is even cleaner - the backup copy is directly
>>> accessible. And when combined with hard link copies in some way
>>> (such as rsnapshot) you can get snapshots.
>>
>> I have seen this method being discussed before, but to be honest I've
>> never even looked into "rsnapshot". I do intend to explore it for
>> the future, since the ability to make incremental backups seems very
>> interesting.
>
> Another poster has given you a pretty good explanation of how rsync
> snapshot backups work.
Yep, Bill is no stranger to me - at least, Usenet-wise; I've never met
him in person. ;-) He's a longstanding regular of both
alt.os.linux.mandr* groups and hangs out - Usenet-wise - in just about
the same groups as I do. ;-)
> I'll just give a few more points here.
>
> rsync is designed to make a copy of a directory as efficiently as
> possible. It will only copy over files that have changed or been
> added, and even for changed files it can often copy over just the
> changes rather than the whole file. And if you are doing the rsync
> over a slow network, you can compress the transfers. There are
> additional flags to delete files in the destination that are no longer
> present in the source, and to omit certain files from the copy
> (amongst many other flags).
Okay, sounds interesting. And yes, part of the backing up will be over
the network to a separate machine.
> For snapshots, this is combined with the "cp -al" command that copies
> a tree but hard-links files rather than copying them. So you do
> something like rsync copy the source tree to a "current copy", then
> "cp -al" the "current copy" to a dated backup snapshot directory. The
> next day, you repeat the process - only changes from the source to
> the "current copy" are transferred, and any files left untouched will
> be hardlinked each time - you only ever have one real copy of each
> file, with hardlinks in each snapshot. It's not perfect - for
> example, a file rename will cause a new transfer, for example
> (the "--fuzzy" flag can be used to avoid the transfer, but not the
> file duplication). And if you have partial transfers you can end up
> breaking the hard-link chaining and end up with extra copies of the
> files (and thus extra disk space).
I'd rather have a duplicate file with differing names than having to
find out that I've lost the only copy. ;-)
> rsnapshot and dirvish are two higher level backup systems build on
> this technique.
I have heard "dirvish" being mentioned, but I had no idea what it was.
I guess I'll have to look into it, but so far I'm already impressed
by "rsnapshot" by simply reading what it can do and how it
operates. ;-)
> Another option, which is a bit more efficient for the transfer and can
> help avoid duplicates if you have occasional hiccups, is to use the
> "--link-dest" option to provide the source of your links. This avoids
> the extra "cp -al" step for greater efficiency, and also lets you
> specify a number of old snapshots - helpful if some of these were
> incomplete.
>
> Remember also that rsync is aimed at network transfers - you want to
> keep your backups on a different machine (although it's nice with
> local snapshots of /etc).
Yep, as I said, part of the backups will be stored on a separate
machine, and the rest locally - but on a different RAID array - on the
same machine, and on removable media. The REV drive I have can handle
70 GB of data - according to Iomega, it can handle 140 GB with
compression, but I suppose that they're talking of using their own
compression algorithms and their Designed For Windows Only proprietary
software.
In my experience, it is impossible to make such claims as that
compression would double the storage capacity, since a number of
document formats are already compressed to begin with and you cannot
compress them any further.
> At the very least, keep them on a different file system partition than
> the original - that way you have protection against file system
> disasters.
Yep. Like I said, the locally stored backups will be backups of the
other RAID array, stored on the second array.
> Obviously you want to avoid making any changes to the files in the
> snapshots, though deleting them works perfectly (files disappear only
> once all the hard links are gone).
Yes, I'm aware of that.
> It is also a good idea to hide the backup tree from locate/updatedb,
> if you have it - your 100 daily snapshots may not take much more disk
> space than a single copy, but it does take 100 times as many files and
> directories.
I am thinking about simply not having the backup volume mounted during
normal operation and only have it mounted just prior to when the backup
scripts start running.
On the other hand, I could indeed also simply make sure that "updatedb"
doesn't "see" the backup partition.
>> My reason for using "zip" rather than "tar" for IRC logs is that my
>> colleagues run Windoze and so their options are limited. ;-)
>
> Tell your Windoze colleagues to get a proper zipper program, instead
> of relying on windows own half-hearted "zip folders" or illegal
> unregistered copies of WinZip. 7zip is totally free, and vastly
> better - amongst other features, it supports .tar.bz2 without
> problems.
I could also tell them to use GNU/Linux instead of Windoze - and it
would make a great difference in the amount of spam I get from
their /botnetted/ computers, but to them GNU/Linux is for servers and
PCs or laptops must run Windoze. Or OS-X - two of them have a Mac as
well.
>>> It also needs to be saved in a safe and reliable place - many people
>>> have had regular backups saved to tape only to find later that the
>>> tapes were unreadable.
>>
>> That is always a risk, just as it was with the old music cassette
>> tapes. Magnetic storage is actually not advised for backups.
>
> I agree - I dislike tapes for backup systems. But I also dislike
> optical storage - disks are typically not big enough for complete
> backups, so you have to have a really messy system with multiple disks
> for a backup set, or even worse, incremental backup patch sets. Even
> if you can fit everything on a single disk, it requires manual
> intervention to handle the disks and store them safely off-site, and
> you need to test them regularly (that's important!).
The limitation in volume is one of the reasons why optical storage is
not a good idea for backups, as it implies - as you said - human
intervention to switch the media around.
>> Well, just about everything in that machine is very expensive. And
>> on the other hand, I did have another server here - which was
>> malfunctioning but which has been repaired now - so I might as well
>> put that one to use as a back-up machine in the event that my main
>> machine would fail somehow - something which I am not looking forward
>> to, of course! ;-)
>>
>> I also can't use the Xen live migration approach, because I intend to
>> set up my main machine with 64-bit software, while the other server
>> is a strictly 32-bit machine. But redundancy - i.e. a duplicate
>> set-up of the main servers - should be adequate enough for my
>> purposes.
>
> Can't Xen cope with mixes of 64-bit and 32-bit machines? I've never
> used it - my servers use OpenVZ (no problem with mixes of 32-bit and
> 64-bit virtual machines with a 64-bit host), and on desktops I use
> Virtual Box (you can mix 32-bit and 64-bit hosts and guests in any
> combination).
Xen can deal with that, sure, but the problem is that the other physical
machine only has 32-bit processors, while I intend to use 64-bit
software for the virtual machines on my main computer. You can't
migrate a running 64-bit virtual machine to a physical machine which
only supports 32-bit. ;-)
>>>>>>>> There is even a "live spare" solution, but to my knowledge this
>>>>>>>> is specific to Adaptec - they call it RAID 5E.
>>>
>>> The array will be in degraded mode while the rebuild is being done,
>>> just like if it were raid5 with a hot spare - and it will be equally
>>> slow during the rebuild. So no points there.
>>
>> Well, it's not really something that - at least, in my impression -
>> is advised as "a particular RAID solution", but rather as "a nice
>> extension to RAID 5".
>
> I have to conclude it is more like "an inefficient and proprietary
> extension to raid 5 that looked good on the marketing brochure - who
> cares about reality?" :-)
Nobody who has anything to sell. ;-)
>>>> Many people already consider me certifiably insane for having spent
>>>> that much money - 17'000 Euro, as I wrote higher up - on a
>>>> privately owned computer system. But then again, for the intended
>>>> purposes, I need fast and reliable hardware and a lot of
>>>> horsepower. :-)
>>>
>>> I'm curious - what is the intended purpose? I think I would have a
>>> hard job spending more than about three or four thousand Euros on a
>>> single system.
>>
>> Well, okay, here goes... It's intended to be a kind of "mainframe" -
>> which is what I call it on occasion when referring to that machine
>> among the other machines I own.
>>
>> [...]
>> Now, as for my intended purposes, I am going to set up this machine
>> with Xen, as I have mentioned earlier. There will be three primary
>> XenLinux virtual machines running on this system, all of which will
>> be Gentoo installations.
>>
>> The three main virtual machines will be set up as follows:
>>
>> (1) The Xen dom0 virtual machine. [...]
>>
>> (2) A workstation virtual machine.
>> [...] but it will also be a driver domain, i.e. it will have
>> direct access to the GeForce, the soundchip and the USB hubs.
>> It'll boot up to runlevel 3, but it'll have KDE 4.x installed,
>> along with loads of applications. As it has direct access to
>> the USB hubs, it'll also be running a CUPS server for my
>> USB-connected multifunctional device, a Brother MFC-9880. It'll
>> also be running an NFS server for multimedia files.
>>
>> (3) A server virtual machine which I intend to set up - if possible -
>> with an OpenVZ kernel. [...]
>> The OpenVZ system will be running several isolated userspaces -
>> which are called "zones", just as in (Open)Solaris - one of which
>> I intend to set up as the sole system from which /ssh/ login from
>> the internet is allowed, and doing nothing else. The idea is
>> that access to any other machine in the network - physical or
>> virtual - must pass through this one virtual machine, making it
>> harder for an eventual black hat to do any damage. Then, there
>> will also be a generic "zone" for a DNS server and one, possibly
>> two websites, and one, possibly two mailservers. Lastly,
>> another "zone" will be running an IRC server and an IRC services
>> package, possibly also with a few eggdrops.
>
> This sounds like a fun system! However, I would have split this into
> two distinct machines - a server and a workstation. You are mixing
> two very different types of use on a single machine, giving you
> something that is bound to be a compromise (or a lot more expensive
> than necessary).
Not really a compromise, considering the horsepower of the machine. It
does however start to sound a lot like a mainframe, and part of my
reasons for wanting to do this is to create an exceptional set-up which
has not been done before in this form[1] and which should be able to
knock my Windoze-loving friends out of their socks.[2] ;-)
[1] OpenVZ on Xen has already been done before, but not with Gentoo, to
my knowledge. Using two keyboards and two videocards for separate
virtual machines via Xen has also already been done, but never with
Gentoo, and not in conjunction with an OpenVZ server running in one
of the domUs.
[2] "Can you do this with your Windows XP/Vista/7 too?" :pp
> A good workstation will have a processor aimed at high peak
> performance for single threads, with reasonable performance for up to
> 4 threads (more if you do a lot of compiling or other highly parallel
> tasks).
Since Gentoo is sources-based, compiling is part of the picture. ;-)
(I seem to remember that compiling/building a vanilla kernel on that
machine (with GNU/Linux running on the bare metal, not in a virtual
machine context) took about 50 seconds.)
> Memory should be optimised for latency (this is more important than
> fast bandwidth), [...]
I think pc5300 works very well with these Opterons.
> as should the disks (main disk(s) should be SSD, [...
Ah, but with this I do not agree. SSDs, even the enterprise-grade
ones - which are still far more expensive than a SAS disk of equal or
even greater capacity - have a limited lifespan if they are being
written to a lot, so anyone using SSDs should consider putting only
read-only filesystems on them.
> possibly with harddisks for bulk storage). You want good graphics and
> sound. For software, you want your host OS to be the main working OS -
> put guest OS's under Virtual Box if you want.
Hmm... No, I again disagree. First of all, if the host operating
system (or privileged guest in a Xen-context) is being used for daily
work, then it becomes a liability to the guest. You want the host (or
privileged guest) as static and lean as possible. It's the Achilles
heel of your entire system.
Secondly, VirtualBox might be very popular with a number of desktop
users - and particularly so they could run Windows inside of it - but I
do not consider that a proper virtualization context. VirtualBox runs
on the host operating system as a process. It's not as efficient as
Xen. And not quite as cool either. :p
> For a server, your processor is aimed at high throughput on multiple
> threads, and memory should be large - even if that means slow. Disks
> should be large and reliable (raid). Graphics can be integrated on
> the motherboard - you need a console keyboard and screen for the
> initial installation, and they are disconnected afterwards (except
> possibly for disaster recovery).
In my case, as explained in my earlier post, the XenLinux privileged
guest will have its own videocard and keyboard - there is no on-board
video on this motherboard anyway - but the output of that videocard is
connected to the second channel of one of the two monitors, and the
other videocard - driven by the first domU - is connected to the first
channel on both monitors. I can switch video output to the screen with
the flick of a (mechanical) switch on the monitor.
> These days you want your host OS to be minimal, and have the real work
> done in virtual machines. Go for OpenVZ as much as possible - OpenVZ
> machines are very light, and very fast to set up (on the server at
> work, I can set up a new OpenVZ virtual machine in a couple of
> minutes). Use Xen or KVM if you need more complete virtualisation.
Well, it's going to be OpenVZ on top of Xen - if I can get my hands on
the sources for the OpenVZ 2.6.27 kernel, because they only seem to
supply .rpm packages and Gentoo's package manager can't handle those -
for the server domU, and the rest will probably be running the most
recent paravirtualized vanilla kernels.
> Of course, when you already have the hardware, you use what you have.
And it'll be a cool enough set-up the way I have it all planned. :-)
>> [...] I have a simple Linksys WRT45GL router now with the standard
>> firmware - which is Linux, by the way ;-) - and it'll do well enough
>> to do port forwarding to the respective virtual machines. Additional
>> firewalling can be done via /iptables/ on the respective virtual
>> machines.
>
> I like to install OpenWRT on WRT54GL devices - it makes them far more
> flexible than the original firmware. Of course, if the original
> firmware does all you need, then that's fine.
Yeah, I know quite a few people who ue OpenWRT, but then again, at
present I see no reason why I would be putting myself through the
trouble of flashing the firmware on that thing. ;-) The standard
firmware is already quite good, mind you. ;-)
>> Gentoo is far from ideal - given some issues over at the Gentoo
>> Foundation itself and the fact that the developers seem mostly
>> occupied with discussing how cool they think they are, rather than to
>> actually do something sensible, and they've also started to implement
>> a few defaults of which they themselves say that these are not the
>> best choices but that they are the choices of which they think most
>> users will opt for them - but at least the basic premise is still
>> there, i.e. you do build it from sources, and as such you have more
>> control over how the resulting system will be set-up, both in terms
>> of hardware optimizations and in terms of software interoperability.
>
> Have you looked at Sabayon Linux? It's originally based on Gentoo,
> but you might find the developer community more to your liking.
I have only briefly looked into it. It appears to come as a Live DVD
only, and it appears to favor Gnome, so applying all the USE flags to
get it to support Qt3, Qt4 and KDE 4 might be a lot of work. I'll give
it a closer look, though. ;-)
--
*Aragorn*
(registered GNU/Linux user #223157)
|
|
0
|
|
|
|
Reply
|
Aragorn
|
1/22/2010 1:42:38 PM
|
|
On Friday 22 January 2010 05:37 in comp.os.linux.misc, somebody
identifying as Rahul wrote...
> Aragorn <aragorn@chatfactory.invalid> wrote in
> news:hj9ftp$7p4$1@news.eternal-september.org:
>
>> Quite honestly, I'm enjoying this thread, because I get to hear
>> interesting feedback - and I think you do to, from your point of view
>> - and I have a feeling that Rahul, the OP, is sitting there enjoying
>> himself over all the valid arguments being discussed here in the
>> debate over various RAID types. ;-)
>
> Absolutely. I am still lurking around. After all I have the blame as
> being the OP who started us off on this interesting RAID discussion.
It *is* interesting - I guess we've all learned something new here - and
it is quite rare as well that a thread this deep hasn't strayed
off-topic yet. ;-)
> I'm moving away from tapes to disk-to-disk backups. In these days of
> cheap disk it's starting to make so much more sense.
On that I would agree. Tape cartridges are expensive, not as reliable
for long-term storage, and they are vulnerable. Hard disks are a much
more convenient choice in this day and age, and especially so the ones
that can be hot-swapped or unplugged from the system - e.g. USB or
eSATA.
>> Well, what I personally find overkill in this is that he intends to
>> use the entire array only for the "/home" filesystem. That seems
>> like an awful waste of some great resources that I personally would
>> put to use more efficiently - e.g. you could have the entire "/var"
>> tree on it, and an additional "/srv" tree.
>
> Does it seem a waste when you think that the /home is being served out
> via NFS to ~275 servers?
Ahh, no, but you didn't tell us that, Grasshopper! :p
> Those are the "compute nodes" doing most of the I/O so that's where I
> need high-performance storage.
Well, what I understood from your earlier mention was that it was
intended to *be* a high-performance computing node.
> Well, I already have a fast disk for /var etc. But those are merely
> local to the server connected to the storage. This server is supposed
> to do just one thing and do it well: Serve out the central storage via
> NFS.
Then it makes sense, indeed, and then the set-up that David and I both
agreed on would indeed be the best for you, i.e. (to summarize) six
disks per array set up as RAID 10 with the seventh disk being a spare,
and /mdadm/ on top of the three arrays to combine them into a
striped "/home".
Should be quite a "bit blaster". :-)
--
*Aragorn*
(registered GNU/Linux user #223157)
|
|
0
|
|
|
|
Reply
|
Aragorn
|
1/22/2010 2:01:15 PM
|
|
On Friday 22 January 2010 15:01 in comp.os.linux.misc, somebody
identifying as Aragorn wrote...
> [...] i.e. (to summarize) six disks per array set up as RAID 10 with
> the seventh disk being a spare, and /mdadm/ on top of the three arrays
> to combine them into a striped "/home".
Hmm... Seems like I've butterfingered again. That or I just had a brain
fart. :p
You've got 45 disks, distributed over three enclosures. That makes 15
disks per enclosure, so you could make that into a 14-disk RAID 10
array per enclosure with the 15th disk as a spare. Or a 12-disk RAID
10 and the three remaining disks as spares. ;-)
--
*Aragorn*
(registered GNU/Linux user #223157)
|
|
0
|
|
|
|
Reply
|
Aragorn
|
1/22/2010 2:06:41 PM
|
|
On 22/01/2010 14:42, Aragorn wrote:
> On Friday 22 January 2010 00:21 in comp.os.linux.misc, somebody
> identifying as David Brown wrote...
>
>> Aragorn wrote:
>>
<snipping this bit to save space - I think we've covered the important
stuff about rsync, unless anyone else has questions or comments>
>
>>> My reason for using "zip" rather than "tar" for IRC logs is that my
>>> colleagues run Windoze and so their options are limited. ;-)
>>
>> Tell your Windoze colleagues to get a proper zipper program, instead
>> of relying on windows own half-hearted "zip folders" or illegal
>> unregistered copies of WinZip. 7zip is totally free, and vastly
>> better - amongst other features, it supports .tar.bz2 without
>> problems.
>
> I could also tell them to use GNU/Linux instead of Windoze - and it
> would make a great difference in the amount of spam I get from
> their /botnetted/ computers, but to them GNU/Linux is for servers and
> PCs or laptops must run Windoze. Or OS-X - two of them have a Mac as
> well.
>
I can understand it being difficult to persuade people to move from
Windows, but it should be a lot easier to persuade typical windows users
to install 7zip. Send them a file called "NakedPictures.7z" and a link
to the 7zip download page - that should be sufficient argument.
>>>> It also needs to be saved in a safe and reliable place - many people
>>>> have had regular backups saved to tape only to find later that the
>>>> tapes were unreadable.
>>>
>>> That is always a risk, just as it was with the old music cassette
>>> tapes. Magnetic storage is actually not advised for backups.
>>
>> I agree - I dislike tapes for backup systems. But I also dislike
>> optical storage - disks are typically not big enough for complete
>> backups, so you have to have a really messy system with multiple disks
>> for a backup set, or even worse, incremental backup patch sets. Even
>> if you can fit everything on a single disk, it requires manual
>> intervention to handle the disks and store them safely off-site, and
>> you need to test them regularly (that's important!).
>
> The limitation in volume is one of the reasons why optical storage is
> not a good idea for backups, as it implies - as you said - human
> intervention to switch the media around.
>
>>> Well, just about everything in that machine is very expensive. And
>>> on the other hand, I did have another server here - which was
>>> malfunctioning but which has been repaired now - so I might as well
>>> put that one to use as a back-up machine in the event that my main
>>> machine would fail somehow - something which I am not looking forward
>>> to, of course! ;-)
>>>
>>> I also can't use the Xen live migration approach, because I intend to
>>> set up my main machine with 64-bit software, while the other server
>>> is a strictly 32-bit machine. But redundancy - i.e. a duplicate
>>> set-up of the main servers - should be adequate enough for my
>>> purposes.
>>
>> Can't Xen cope with mixes of 64-bit and 32-bit machines? I've never
>> used it - my servers use OpenVZ (no problem with mixes of 32-bit and
>> 64-bit virtual machines with a 64-bit host), and on desktops I use
>> Virtual Box (you can mix 32-bit and 64-bit hosts and guests in any
>> combination).
>
> Xen can deal with that, sure, but the problem is that the other physical
> machine only has 32-bit processors, while I intend to use 64-bit
> software for the virtual machines on my main computer. You can't
> migrate a running 64-bit virtual machine to a physical machine which
> only supports 32-bit. ;-)
>
OK, running 64-bit virtual machines on a 32-bit processor is tough (QEMU
can do it, if you are happy to wait long enough!).
>>>>>>>>> There is even a "live spare" solution, but to my knowledge this
>>>>>>>>> is specific to Adaptec - they call it RAID 5E.
>>>>
>>>> The array will be in degraded mode while the rebuild is being done,
>>>> just like if it were raid5 with a hot spare - and it will be equally
>>>> slow during the rebuild. So no points there.
>>>
>>> Well, it's not really something that - at least, in my impression -
>>> is advised as "a particular RAID solution", but rather as "a nice
>>> extension to RAID 5".
>>
>> I have to conclude it is more like "an inefficient and proprietary
>> extension to raid 5 that looked good on the marketing brochure - who
>> cares about reality?" :-)
>
> Nobody who has anything to sell. ;-)
>
>>>>> Many people already consider me certifiably insane for having spent
>>>>> that much money - 17'000 Euro, as I wrote higher up - on a
>>>>> privately owned computer system. But then again, for the intended
>>>>> purposes, I need fast and reliable hardware and a lot of
>>>>> horsepower. :-)
>>>>
>>>> I'm curious - what is the intended purpose? I think I would have a
>>>> hard job spending more than about three or four thousand Euros on a
>>>> single system.
>>>
>>> Well, okay, here goes... It's intended to be a kind of "mainframe" -
>>> which is what I call it on occasion when referring to that machine
>>> among the other machines I own.
>>>
>>> [...]
>>> Now, as for my intended purposes, I am going to set up this machine
>>> with Xen, as I have mentioned earlier. There will be three primary
>>> XenLinux virtual machines running on this system, all of which will
>>> be Gentoo installations.
>>>
>>> The three main virtual machines will be set up as follows:
>>>
>>> (1) The Xen dom0 virtual machine. [...]
>>>
>>> (2) A workstation virtual machine.
>>> [...] but it will also be a driver domain, i.e. it will have
>>> direct access to the GeForce, the soundchip and the USB hubs.
>>> It'll boot up to runlevel 3, but it'll have KDE 4.x installed,
>>> along with loads of applications. As it has direct access to
>>> the USB hubs, it'll also be running a CUPS server for my
>>> USB-connected multifunctional device, a Brother MFC-9880. It'll
>>> also be running an NFS server for multimedia files.
>>>
>>> (3) A server virtual machine which I intend to set up - if possible -
>>> with an OpenVZ kernel. [...]
>>> The OpenVZ system will be running several isolated userspaces -
>>> which are called "zones", just as in (Open)Solaris - one of which
>>> I intend to set up as the sole system from which /ssh/ login from
>>> the internet is allowed, and doing nothing else. The idea is
>>> that access to any other machine in the network - physical or
>>> virtual - must pass through this one virtual machine, making it
>>> harder for an eventual black hat to do any damage. Then, there
>>> will also be a generic "zone" for a DNS server and one, possibly
>>> two websites, and one, possibly two mailservers. Lastly,
>>> another "zone" will be running an IRC server and an IRC services
>>> package, possibly also with a few eggdrops.
>>
>> This sounds like a fun system! However, I would have split this into
>> two distinct machines - a server and a workstation. You are mixing
>> two very different types of use on a single machine, giving you
>> something that is bound to be a compromise (or a lot more expensive
>> than necessary).
>
> Not really a compromise, considering the horsepower of the machine. It
> does however start to sound a lot like a mainframe, and part of my
> reasons for wanting to do this is to create an exceptional set-up which
> has not been done before in this form[1] and which should be able to
> knock my Windoze-loving friends out of their socks.[2] ;-)
>
> [1] OpenVZ on Xen has already been done before, but not with Gentoo, to
> my knowledge. Using two keyboards and two videocards for separate
> virtual machines via Xen has also already been done, but never with
> Gentoo, and not in conjunction with an OpenVZ server running in one
> of the domUs.
>
> [2] "Can you do this with your Windows XP/Vista/7 too?" :pp
>
>> A good workstation will have a processor aimed at high peak
>> performance for single threads, with reasonable performance for up to
>> 4 threads (more if you do a lot of compiling or other highly parallel
>> tasks).
>
> Since Gentoo is sources-based, compiling is part of the picture. ;-)
>
> (I seem to remember that compiling/building a vanilla kernel on that
> machine (with GNU/Linux running on the bare metal, not in a virtual
> machine context) took about 50 seconds.)
>
That's a fast kernel compile! Are you cheating and using ccache?
>> Memory should be optimised for latency (this is more important than
>> fast bandwidth), [...]
>
> I think pc5300 works very well with these Opterons.
>
>> as should the disks (main disk(s) should be SSD, [...
>
> Ah, but with this I do not agree. SSDs, even the enterprise-grade
> ones - which are still far more expensive than a SAS disk of equal or
> even greater capacity - have a limited lifespan if they are being
> written to a lot, so anyone using SSDs should consider putting only
> read-only filesystems on them.
>
SSDs are certainly more expensive for their size (by a factor of about
30 or so, I think). But you can get a 64 GB SSD for a very reasonable
price, and it is more than sufficient for your OS, applications (except
perhaps large games), and main data files. Bulk data, such as video
files, will need to go on a hard disk. But everything else can
typically go on the SSD. If you are only using 30 GB for your system,
who cares if you have 30 GB free or 970 GB free? An SSD is not
sufficient on its own for a typical workstation, but it can form an
excellent base.
The "limited lifespan" of SSDs is mostly a myth, and partly outdated
information. With a modern SLC drive, you can write to them continually
at full speed for /decades/ before you would expect to suffer from wear.
Reliability is orders of magnitude better than for hard disks.
>> possibly with harddisks for bulk storage). You want good graphics and
>> sound. For software, you want your host OS to be the main working OS -
>> put guest OS's under Virtual Box if you want.
>
> Hmm... No, I again disagree. First of all, if the host operating
> system (or privileged guest in a Xen-context) is being used for daily
> work, then it becomes a liability to the guest. You want the host (or
> privileged guest) as static and lean as possible. It's the Achilles
> heel of your entire system.
>
That's true for a server - not for a workstation. You are correct that
the guest is no safer (or more reliable) than the host, but for a
workstation, that's fine. You use the host for your main work, as
efficiently as possible (i.e., direct access to hardware, etc.). If you
have something risky (testing out software, running windows), you do it
in a guest without risking the host. But if you have something that has
to be safer and more reliable than the system you are using as your main
workhorse every day, it should be on a different physical system - the
server.
> Secondly, VirtualBox might be very popular with a number of desktop
> users - and particularly so they could run Windows inside of it - but I
> do not consider that a proper virtualization context. VirtualBox runs
> on the host operating system as a process. It's not as efficient as
> Xen. And not quite as cool either. :p
>
It is a somewhat different concept from Xen - you are correct that it
runs as a process on the host. But you are incorrect to write that off
as a disadvantage or "not proper virtualization". It is a different
type of virtualisation, with advantages as well as disadvantages. It is
easier in use, and easier to integrate (sharing files and clipboards,
moving smoothly between guest and host, mixing host and guest windows,
etc.). But it doesn't allow the guest the same level of controlled
hardware access that a hypervisor solution like Xen gives you. That's
why I recommend VBox for workstations, but it is probably not the best
choice for a server. Different tools for different jobs.
>> For a server, your processor is aimed at high throughput on multiple
>> threads, and memory should be large - even if that means slow. Disks
>> should be large and reliable (raid). Graphics can be integrated on
>> the motherboard - you need a console keyboard and screen for the
>> initial installation, and they are disconnected afterwards (except
>> possibly for disaster recovery).
>
> In my case, as explained in my earlier post, the XenLinux privileged
> guest will have its own videocard and keyboard - there is no on-board
> video on this motherboard anyway - but the output of that videocard is
> connected to the second channel of one of the two monitors, and the
> other videocard - driven by the first domU - is connected to the first
> channel on both monitors. I can switch video output to the screen with
> the flick of a (mechanical) switch on the monitor.
>
>> These days you want your host OS to be minimal, and have the real work
>> done in virtual machines. Go for OpenVZ as much as possible - OpenVZ
>> machines are very light, and very fast to set up (on the server at
>> work, I can set up a new OpenVZ virtual machine in a couple of
>> minutes). Use Xen or KVM if you need more complete virtualisation.
>
> Well, it's going to be OpenVZ on top of Xen - if I can get my hands on
> the sources for the OpenVZ 2.6.27 kernel, because they only seem to
> supply .rpm packages and Gentoo's package manager can't handle those -
> for the server domU, and the rest will probably be running the most
> recent paravirtualized vanilla kernels.
>
>> Of course, when you already have the hardware, you use what you have.
>
> And it'll be a cool enough set-up the way I have it all planned. :-)
>
>>> [...] I have a simple Linksys WRT45GL router now with the standard
>>> firmware - which is Linux, by the way ;-) - and it'll do well enough
>>> to do port forwarding to the respective virtual machines. Additional
>>> firewalling can be done via /iptables/ on the respective virtual
>>> machines.
>>
>> I like to install OpenWRT on WRT54GL devices - it makes them far more
>> flexible than the original firmware. Of course, if the original
>> firmware does all you need, then that's fine.
>
> Yeah, I know quite a few people who ue OpenWRT, but then again, at
> present I see no reason why I would be putting myself through the
> trouble of flashing the firmware on that thing. ;-) The standard
> firmware is already quite good, mind you. ;-)
>
If it ain't broke, don't fix it? But it's more fun to break it, then
see if you can fix it again afterwards...
>>> Gentoo is far from ideal - given some issues over at the Gentoo
>>> Foundation itself and the fact that the developers seem mostly
>>> occupied with discussing how cool they think they are, rather than to
>>> actually do something sensible, and they've also started to implement
>>> a few defaults of which they themselves say that these are not the
>>> best choices but that they are the choices of which they think most
>>> users will opt for them - but at least the basic premise is still
>>> there, i.e. you do build it from sources, and as such you have more
>>> control over how the resulting system will be set-up, both in terms
>>> of hardware optimizations and in terms of software interoperability.
>>
>> Have you looked at Sabayon Linux? It's originally based on Gentoo,
>> but you might find the developer community more to your liking.
>
> I have only briefly looked into it. It appears to come as a Live DVD
> only, and it appears to favor Gnome, so applying all the USE flags to
> get it to support Qt3, Qt4 and KDE 4 might be a lot of work. I'll give
> it a closer look, though. ;-)
>
|
|
0
|
|
|
|
Reply
|
David
|
1/22/2010 2:56:34 PM
|
|
On Friday 22 January 2010 15:56 in comp.os.linux.misc, somebody
identifying as David Brown wrote...
> On 22/01/2010 14:42, Aragorn wrote:
>
>> On Friday 22 January 2010 00:21 in comp.os.linux.misc, somebody
>> identifying as David Brown wrote...
>>
>>> Can't Xen cope with mixes of 64-bit and 32-bit machines? I've never
>>> used it - my servers use OpenVZ (no problem with mixes of 32-bit and
>>> 64-bit virtual machines with a 64-bit host), and on desktops I use
>>> Virtual Box (you can mix 32-bit and 64-bit hosts and guests in any
>>> combination).
>>
>> Xen can deal with that, sure, but the problem is that the other
>> physical machine only has 32-bit processors, while I intend to use
>> 64-bit software for the virtual machines on my main computer. You
>> can't migrate a running 64-bit virtual machine to a physical machine
>> which only supports 32-bit. ;-)
>
> OK, running 64-bit virtual machines on a 32-bit processor is tough
> (QEMU can do it, if you are happy to wait long enough!).
The last time I was over at the home of my now ex-colleague, he had a
64-bit OpenSuSE running on a 32-bit host operating system - which I
believe to have been Windows XP - using VMWare. There was an
incredible discrepancy in the time as shown on the clock in each
system's panel.
I don't remember whether the skewed clock was running fast or slow, but
it differed by about half an hour every five minutes. It was insane.
He had to install an ntpd in the OpenSuSE guest, because it refused to
connect to the ntpd running on the host - probably a firewalling issue,
although I seem to remember that he left that port open.
Of course, that's VMWare. And Windows. So it's not anything I care
enough about to troubleshoot it. :p
>>> A good workstation will have a processor aimed at high peak
>>> performance for single threads, with reasonable performance for up
>>> to 4 threads (more if you do a lot of compiling or other highly
>>> parallel tasks).
>>
>> Since Gentoo is sources-based, compiling is part of the picture. ;-)
>>
>> (I seem to remember that compiling/building a vanilla kernel on that
>> machine (with GNU/Linux running on the bare metal, not in a virtual
>> machine context) took about 50 seconds.)
>
> That's a fast kernel compile! Are you cheating and using ccache?
Nope. Only did a "make -j5". It's a ccNUMA system, with two dualcore
Opterons, running at 2.6 GHz. It *is* fast indeed!
I remember that using the same make options, my dual Xeon machine -
32-bit, 2.2 GHz, hyperthreading enabled, 400 MHz FSB, single memory
controller on the motherboard - needed about 5 minutes. With the
default "-j2" it took about 40 to 45 minutes, depending on the kernel
generation.
The first time I ever compiled a kernel from sources was the vanilla
2.6.5 kernel, and on that dual Xeon machine. At that time, I still
hadn't figured out that you could do parallel compiles using "-j" and a
number. The last kernel I've compiled on it was 2.6.17.something, with
the "-j5", and that was the one that took (little over) 5 minutes to
build.
The one I built on the new machine in 50 seconds was the 2.6.22 kernel
with Xen patches for dom0. Although this was in Gentoo, the 2.6.22
XenLinux kernels came from Ubuntu (but supplied by Gentoo). There were
also 2.6.20 XenLinux-patched kernels from RedHat, and the
Xen.org-supplied 2.6.18 kernels.
These days, those are all obsolete, of course, since as of 2.6.30,
vanilla Linux has all the code for running as dom0, domU or on the bare
metal, and you can have all three usages inside a single binary kernel.
The kernel will decide whether it is running on the bare metal or
paravirtualized on Xen (or KVM), and whether it's a privileged or
unprivileged Xen guest.
Of course, for a custom system like mine, building specialized kernels
for dom0 and domU is more preferable than a one-size-fits-all kernel
image.
>>> [...] possibly with harddisks for bulk storage). You want good
>>> graphics and sound. For software, you want your host OS to be the
>>> main working OS - put guest OS's under Virtual Box if you want.
>>
>> Hmm... No, I again disagree. First of all, if the host operating
>> system (or privileged guest in a Xen-context) is being used for daily
>> work, then it becomes a liability to the guest. You want the host
>> (or privileged guest) as static and lean as possible. It's the
>> Achilles heel of your entire system.
>
> That's true for a server - not for a workstation. You are correct
> that the guest is no safer (or more reliable) than the host, but for a
> workstation, that's fine.
Well, only if you consider the guest to be something that needs to be
sandboxed. And of course, Windows *should* be. And sealed in behind
solid concrete walls by people wearing Hazmat suits. :p
> You use the host for your main work, as efficiently as possible (i.e.,
> direct access to hardware, etc.). If you have something risky
> (testing out software, running windows), you do it in a guest without
> risking the host.
Yes, the sandbox scenario. For that kind of usage, host-based virtual
machine monitors are a good choice, of course.
> But if you have something that has to be safer and more reliable than
> the system you are using as your main workhorse every day, it should
> be on a different physical system - the server.
Well, with hardware like mine and with something as powerful as Xen, I
consider my usage of that system quite efficient and more interesting.
You also have to keep in mind here that I view a GNU/Linux (or any
UNIX) workstation as something other than "a PC". To me, any UNIX
system is a client/server architecture and I treat and view it as such.
It's a different philosophy from the single-user thinking that is
typical for Windows users. And thus, once again, it's all in the eye
of the beholder. ;-)
>> Secondly, VirtualBox might be very popular with a number of desktop
>> users - and particularly so they could run Windows inside of it - but
>> I do not consider that a proper virtualization context. VirtualBox
>> runs on the host operating system as a process. It's not as
>> efficient as Xen. And not quite as cool either. :p
>
> It is a somewhat different concept from Xen - you are correct that it
> runs as a process on the host. But you are incorrect to write that
> off as a disadvantage or "not proper virtualization".
Well, it obviously *is* virtualization, but as I wrote higher up, I
consider that good for sandboxing, but not for my purposes. I like the
mainframe-style approach of Xen, where everything is a virtual machine,
including the management system.
It's a matter of taste, really, so when I said "I don't consider that"
and so on, I was really speaking of me, myself and I. ;-)
> It is a different type of virtualisation, with advantages as well as
> disadvantages. It is easier in use, and easier to integrate (sharing
> files and clipboards, moving smoothly between guest and host, mixing
> host and guest windows, etc.).
I have seen screenshots of a virtualization solution - don't ask me
which one because I don't remember - where GNU/Linux and Windows were
actually *sharing* the desktop. Neither was running in any windowed
context and one didn't have to switch full screen mode between the host
(GNU/Linux) and the guest (Windows). It was all seamless, with X
desktop's taskbar at the top and the Windows taskbar at the bottom of
the screen. It was weird. 8-)
> But it doesn't allow the guest the same level of controlled
> hardware access that a hypervisor solution like Xen gives you. That's
> why I recommend VBox for workstations, but it is probably not the best
> choice for a server. Different tools for different jobs.
And differing opinions on what constitutes "a workstation". :-)
>>> I like to install OpenWRT on WRT54GL devices - it makes them far
>>> more
>>> flexible than the original firmware. Of course, if the original
>>> firmware does all you need, then that's fine.
>>
>> Yeah, I know quite a few people who ue OpenWRT, but then again, at
>> present I see no reason why I would be putting myself through the
>> trouble of flashing the firmware on that thing. ;-) The standard
>> firmware is already quite good, mind you. ;-)
>
> If it ain't broke, don't fix it?
Exactly! :-)
> But it's more fun to break it, then see if you can fix it again
> afterwards...
And if you can't, then you've got yourself an expensive paperweight. :p
--
*Aragorn*
(registered GNU/Linux user #223157)
|
|
0
|
|
|
|
Reply
|
Aragorn
|
1/22/2010 3:41:56 PM
|
|
David Brown <david@westcontrol.removethisbit.com> wrote in
news:4b596aa4$0$6281$8404b019@news.wineasy.se:
> Rahul wrote:
>> David Brown <david.brown@hesbynett.removethisbit.no> wrote in
>> news:8KednbJOTsi3FsrWnZ2dnUVZ8i2dnZ2d@lyse.net:
>>
>
> It is very difficult to judge these things - it is so dependent on the
> load. Don't rate my gut feeling above /your/ gut feeling! The
> trouble is, the only way to be sure is to try out both combinations
> and see, which is a little impractical.
Exactly. Lies, damned lies and benchmarks. I spent a month on getting my
vendor to run benchmarks on the intended I/O boxes but it is very
difficult to estimate. In the end I decided to take a more robust
decision and thus bought this config. With 3 indipendent storage boxes,
45 SAS 15 drives, and a beefy server with 48 Gigs RAM at least I have the
independance thant if one layout does not give the performance I want I
can move things around and try a different strategy. Worst case even put
in 3 servers with one box attached to each and then spread out I/O via
Lustre etc. Or even have the hack of a /home1 /home2 and a /home3 etc.
> If you are able, you could do
> testing with only have the disks attached to see if it makes a
> measurable difference.
I will. What's your tool of choice? bonnie++, iozone etc.? Or just a dd
with changing parameters.
> I have been imagining two layers of controllers here - your disks are
> connected to one controller on a storage box,
I have a MD-1000 from Dell. I don't think it has any controller on board
to speak of. It's a dumb box. The PERC-6e card in the host server is the
only controller that I know of.
>>
>
> Have you considered using a clustered file system such as Lustre or
> GFS?
> You then have a central server for metadata, which is easy to get
> fast
> (everything will be in the server's ram cache), and the actual data is
> spread around and duplicated on different servers.
I did consider Luster, gluster hadoopFS and gfs. Problem is that they
seem very tricky to set up correctly and I was scared away. Have you
actually tried any of those? Any anecdotal comments? Have you used any of
these distributed filesystems?
> 48 GB is actually quite a lot. Whether it is enough or not is hard to
> say. Run the system you have got - if people complain that it is
> slow, do some monitoring to find the bottlenecks. If they don't
> complain, then 48 GB is enough!
I'm not even sure what's the absolute max RAM that my server will even
take.
--
Rahul
|
|
0
|
|
|
|
Reply
|
Rahul
|
1/22/2010 6:24:11 PM
|
|
Aragorn wrote:
> On Friday 22 January 2010 15:56 in comp.os.linux.misc, somebody
> identifying as David Brown wrote...
>
>> On 22/01/2010 14:42, Aragorn wrote:
>>
>>> On Friday 22 January 2010 00:21 in comp.os.linux.misc, somebody
>>> identifying as David Brown wrote...
>>>
<snip to save space>
>>>> [...] possibly with harddisks for bulk storage). You want good
>>>> graphics and sound. For software, you want your host OS to be the
>>>> main working OS - put guest OS's under Virtual Box if you want.
>>> Hmm... No, I again disagree. First of all, if the host operating
>>> system (or privileged guest in a Xen-context) is being used for daily
>>> work, then it becomes a liability to the guest. You want the host
>>> (or privileged guest) as static and lean as possible. It's the
>>> Achilles heel of your entire system.
>> That's true for a server - not for a workstation. You are correct
>> that the guest is no safer (or more reliable) than the host, but for a
>> workstation, that's fine.
>
> Well, only if you consider the guest to be something that needs to be
> sandboxed. And of course, Windows *should* be. And sealed in behind
> solid concrete walls by people wearing Hazmat suits. :p
>
>> You use the host for your main work, as efficiently as possible (i.e.,
>> direct access to hardware, etc.). If you have something risky
>> (testing out software, running windows), you do it in a guest without
>> risking the host.
>
> Yes, the sandbox scenario. For that kind of usage, host-based virtual
> machine monitors are a good choice, of course.
>
>> But if you have something that has to be safer and more reliable than
>> the system you are using as your main workhorse every day, it should
>> be on a different physical system - the server.
>
> Well, with hardware like mine and with something as powerful as Xen, I
> consider my usage of that system quite efficient and more interesting.
> You also have to keep in mind here that I view a GNU/Linux (or any
> UNIX) workstation as something other than "a PC". To me, any UNIX
> system is a client/server architecture and I treat and view it as such.
>
You are in a little unusual situation, having such a powerful single
machine, so it makes sense for you to want to do everything on the same
physical machine.
I view a server as a multi-user system, but a "workstation" is just a
powerful PC. And PC means /personal/ computer - a PC should, I think,
be mainly for a single person. Different people want different things
from a PC, and they are often in conflict - one person wants fast and
powerful, another wants small and quiet. A server has to have an
emphasis on reliability and security, a workstation on ease of use and
speed of interactive tasks.
> It's a different philosophy from the single-user thinking that is
> typical for Windows users. And thus, once again, it's all in the eye
> of the beholder. ;-)
>
>>> Secondly, VirtualBox might be very popular with a number of desktop
>>> users - and particularly so they could run Windows inside of it - but
>>> I do not consider that a proper virtualization context. VirtualBox
>>> runs on the host operating system as a process. It's not as
>>> efficient as Xen. And not quite as cool either. :p
>> It is a somewhat different concept from Xen - you are correct that it
>> runs as a process on the host. But you are incorrect to write that
>> off as a disadvantage or "not proper virtualization".
>
> Well, it obviously *is* virtualization, but as I wrote higher up, I
> consider that good for sandboxing, but not for my purposes. I like the
> mainframe-style approach of Xen, where everything is a virtual machine,
> including the management system.
>
> It's a matter of taste, really, so when I said "I don't consider that"
> and so on, I was really speaking of me, myself and I. ;-)
>
>> It is a different type of virtualisation, with advantages as well as
>> disadvantages. It is easier in use, and easier to integrate (sharing
>> files and clipboards, moving smoothly between guest and host, mixing
>> host and guest windows, etc.).
>
> I have seen screenshots of a virtualization solution - don't ask me
> which one because I don't remember - where GNU/Linux and Windows were
> actually *sharing* the desktop. Neither was running in any windowed
> context and one didn't have to switch full screen mode between the host
> (GNU/Linux) and the guest (Windows). It was all seamless, with X
> desktop's taskbar at the top and the Windows taskbar at the bottom of
> the screen. It was weird. 8-)
>
Weird, but useful!
>> But it doesn't allow the guest the same level of controlled
>> hardware access that a hypervisor solution like Xen gives you. That's
>> why I recommend VBox for workstations, but it is probably not the best
>> choice for a server. Different tools for different jobs.
>
> And differing opinions on what constitutes "a workstation". :-)
>
>>>> I like to install OpenWRT on WRT54GL devices - it makes them far
>>>> more
>>>> flexible than the original firmware. Of course, if the original
>>>> firmware does all you need, then that's fine.
>>> Yeah, I know quite a few people who ue OpenWRT, but then again, at
>>> present I see no reason why I would be putting myself through the
>>> trouble of flashing the firmware on that thing. ;-) The standard
>>> firmware is already quite good, mind you. ;-)
>> If it ain't broke, don't fix it?
>
> Exactly! :-)
>
>> But it's more fun to break it, then see if you can fix it again
>> afterwards...
>
> And if you can't, then you've got yourself an expensive paperweight. :p
>
|
|
0
|
|
|
|
Reply
|
David
|
1/24/2010 10:21:34 PM
|
|
Rahul wrote:
> David Brown <david@westcontrol.removethisbit.com> wrote in
> news:4b596aa4$0$6281$8404b019@news.wineasy.se:
>
>> Rahul wrote:
>>> David Brown <david.brown@hesbynett.removethisbit.no> wrote in
>>> news:8KednbJOTsi3FsrWnZ2dnUVZ8i2dnZ2d@lyse.net:
>>>
>> It is very difficult to judge these things - it is so dependent on the
>> load. Don't rate my gut feeling above /your/ gut feeling! The
>> trouble is, the only way to be sure is to try out both combinations
>> and see, which is a little impractical.
>
> Exactly. Lies, damned lies and benchmarks. I spent a month on getting my
> vendor to run benchmarks on the intended I/O boxes but it is very
> difficult to estimate. In the end I decided to take a more robust
> decision and thus bought this config. With 3 indipendent storage boxes,
> 45 SAS 15 drives, and a beefy server with 48 Gigs RAM at least I have the
> independance thant if one layout does not give the performance I want I
> can move things around and try a different strategy. Worst case even put
> in 3 servers with one box attached to each and then spread out I/O via
> Lustre etc. Or even have the hack of a /home1 /home2 and a /home3 etc.
>
>> If you are able, you could do
>> testing with only have the disks attached to see if it makes a
>> measurable difference.
>
> I will. What's your tool of choice? bonnie++, iozone etc.? Or just a dd
> with changing parameters.
>
I've used bonnie++, but I have never done any specific task-related
testing. I haven't had to put together systems with performance
requirements - mostly I've picked parts based on solid value for money.
I think it's fun doing testing, but that's more along the lines of
"that's cool - this system does X more bogomips than the old one!".
For your tests, you'll want to spend a bit more time on them, and try to
get something that matches your real load. In particular, you'll want
something that can simulate a large number of parallel accesses, ideally
using a second machine to access the server over NFS.
>> I have been imagining two layers of controllers here - your disks are
>> connected to one controller on a storage box,
>
> I have a MD-1000 from Dell. I don't think it has any controller on board
> to speak of. It's a dumb box. The PERC-6e card in the host server is the
> only controller that I know of.
>
>> Have you considered using a clustered file system such as Lustre or
>> GFS?
>> You then have a central server for metadata, which is easy to get
>> fast
>> (everything will be in the server's ram cache), and the actual data is
>> spread around and duplicated on different servers.
>
> I did consider Luster, gluster hadoopFS and gfs. Problem is that they
> seem very tricky to set up correctly and I was scared away. Have you
> actually tried any of those? Any anecdotal comments? Have you used any of
> these distributed filesystems?
>
No idea - sorry. I've read the wikipedia pages, which makes me the
local expert, but I have never tried anything like this. My company
would have to increase its IT budget by an order of magnitude or two
before I could get the chance!
>> 48 GB is actually quite a lot. Whether it is enough or not is hard to
>> say. Run the system you have got - if people complain that it is
>> slow, do some monitoring to find the bottlenecks. If they don't
>> complain, then 48 GB is enough!
>
> I'm not even sure what's the absolute max RAM that my server will even
> take.
>
>
|
|
0
|
|
|
|
Reply
|
David
|
1/24/2010 10:29:16 PM
|
|
|
42 Replies
1162 Views
(page loaded in 0.438 seconds)
Similiar Articles: Work at Home - Earn $9,000 Weekly With Affiliate Job - comp.text ...Work at Home - Earn $9,000 Weekly With Affiliate Job - comp.lang ... difference between striping using mdadm and LVM - comp.os.linux ... The RAID is intended only for user ... Work at Home - Earn $9,000 Weekly With Affiliate Job - comp.lang ...Work at Home - Earn $9,000 Weekly With Affiliate Job - comp.lang ... difference between striping using mdadm and LVM - comp.os.linux ... The RAID is intended only for user ... What's the difference between cputime and tic, toc? - comp.soft ...(I used TIC TOC to get the time elapsed). I am not using matlab in network, for ... difference between striping using mdadm and LVM - comp.os.linux ... >> But while you ... F.A. LINKSYS 5 PORT NETWORK HUB NH1005 10/100 LIKE NEW - comp.os ...difference between striping using mdadm and LVM - comp.os.linux ... For raid0, you have a layout like this: 1 2 3 4 5 6 7 8 A ... will eventually be three servers on the ... Use cronbach alpha to determine whether a factor is reliable ...difference between striping using mdadm and LVM - comp.os.linux ... For instance, a MIPS or Alpha processor running at ... It also needs to be saved in a safe and reliable ... How do you combine pictures into 1 for ebay auction - comp ...difference between striping using mdadm and LVM - comp.os.linux ... ° Use LVM and combine the three partitions - one on each array - into a single ... configure them ... Date Due Calculation that Skips Weekends. - comp.databases ...difference between striping using mdadm and LVM - comp.os.linux ... It's not a 2:1 performance boost due to the overhead ... Oh well, I've relayed the whole thing for the ... 500gb USB drive --rmformat - comp.unix.solarisdifference between striping using mdadm and LVM - comp.os.linux ... There are 500GB WD disks 7200 SATA ... patitions, one on each of the drives and mounted the resultant ... Converting LVM disks to vxfs - comp.sys.hp.hpuxI have some disks which I want to bring to a veritas data group. These disks were originally configured with HP-UX LVM. When I tried to run vxdiskse... show compile options for rpm - comp.os.linux.miscdifference between striping using mdadm and LVM - comp.os.linux ... > > > Any other creative options that come to mind? > > > >> (since rebuilding takes ... and the person ... Optimizing ntpd memory usage? - comp.protocols.time.ntpdifference between striping using mdadm and LVM - comp.os.linux ..... again as soon as the data is transfered to the memory ... Yes, of course, it all depends on the ... increase or decrase brightness of an image - comp.soft-sys.matlab ...difference between striping using mdadm and LVM - comp.os.linux ... The risk of disk failure will increase with the amount of disks involved. ... so it would be better to ... Calculating the volume of a cavity inside an assembly - comp.cad ...difference between striping using mdadm and LVM - comp.os.linux ... > LVM is for logical volume management, mdadm is for ... total there are a lot of options and its hard ... Linux driver (or help) needed for Atheros AR8151 Ethernet device ...I have an HP Pavilion dm4 laptop that I am dual-booting ... difference between striping using mdadm and LVM 42 421 Rahul Restoring SCSI to optimal sync transfer rate - comp.unix.solaris ...difference between striping using mdadm and LVM - comp.os.linux ... What's a approximate design equation to use to rate ... the hard way!), is "thou shalt make a plan for ... Answer : difference between striping using mdadm and LVMdifference between striping using mdadm and LVM - answer - I find that I could in theory get a performance boost either by using a RAID5 via mdadm or by striping via LVM. Re: [Beowulf] hardware RAID versus mdadm versus LVM-stripingRe: [Beowulf] hardware RAID versus mdadm versus LVM-striping. Geoff Jacobs Fri, 12 Mar 2010 21:50:54 -0800 7/19/2012 3:59:12 PM
|