Hi
I'm looking for some input on file designs for scientific data - if
you're not interested in this sort of thing you should ignore this
posting completely. If you are, please read on:
We're soon to embark on the redesign of one of our systems. One of
the main areas of functionality of the system is data management. The
basic data is geophysical signal records from our instruments. The
records are processed, transformed, manipulated etc, and a variety of
modelling and inversion operations carried out. The data management
issues will be familiar: issues of auditability, traceability,
provenance etc: what operations were carried out to transform signal A
into signal A', who carried them out, what versions of what programs
were used, etc.
For each project we carry out we'll typically collect data from a
handful of lines of instrumentation (survey data), each of which
comprises 10s (that's 10-off, not 10 seconds) -- 100s of sensors, each
of which records 10s -- 100s of signals. Back home the data is
processed, so each input signal might give rise to multiple output
signals (different filters, different downsampling, etc), and so on.
In short, the data fits quite naturally into a hierarchical structure.
Right now we manage our data in what I would describe as a 'file
forest' -- each signal is in a file, which is in a directory, which is
in a directory, and so on. In a typical project the directory
structure might be 8 layers deep, with an average branching factor of
3 - 4 -- seriously some of our projects ultimately comprise 20,000
files or more. Each of the files contains a bunch of (text) header
data, maybe 80 lines of up to 80 characters, and signal data which is
of the order of 4 columns of up to about 1000 rows of floating-point
numbers. There's a lot of redundant duplication of data -- when a
signal is transformed the output file might contain the same header
info as the input file, plus a few more lines describing the
transformation, and the same data, but with a new column of figures
added (the transform of one of the other columns of data). And
there's an awful lot of files. Mostly the directories and files are
created and managed by the existing system, but users can open up a
file and tweak it if they wish to.
So, wearing my software engineering hat, I'm thinking -- this is a
mess, the new system should rationalise this data management and
introduce some structured file formats. I'm thinking XML, HDF5, maybe
Infosets. Now, there's a wealth of information on XML, and pretty
good documentation on the others, most of it directed to telling you
how to use the formats and their associated tools. But everything I
find leaves me with 3 questions, and this is what I'm hoping some of
you might be able to give me pointers on:
1. What are the relative merits of XML, HDF5 and Infosets ? I think
I've figured out that plain XML is limited in file size to about the
size of my computer's RAM since parsing an XML file requires it to be
read in one go (but that standards like Xlink (?) might help us round
this). And yes, XML is kind of bloated for scientific data sets, but
we're only talking GBs of data, not TBs, and there are good reasons
for having readable data sets -- much easier for geophysicists to
prototype bits and pieces of code for one. But is there any hard info
out there on why to choose one rather than the others ? Or why to
avoid one or all like the plague ?
2. OK, having chosen my favourite approach to file design, how do I
design a file ? If we choose to go XML, how do I decide what become
attributes, what are tags, and what are just text ? The published
info I've found is quite good at telling you how to implement a
design, but less good at how to derive a design in the first place.
3. How do I persuade my boss that the effort associated with
modifying our working practices will be repaid ? As a software
engineer the sheer inelegance of our current approach persuades me
that it is bad -- but I'm not sure that's strong enough. Another
argument is that our current approach duplicates a lot of data, but we
haven't yet had a horrendous incident arising from this (ie non-
agreement of what should be identical data) which would make a
compelling case for change. And the argument about wasting disk space
is not an argument now that disks are so cheap.
OK, I realise I'm asking for a lot, and I don't really expect a load
of free consultancy, but if anyone cares to send me some links to
useful sites, or published papers, I'd really appreciate it.
And, in case you were wondering why this message is being posted to
comp.lang.fortran, it's because our language of choice is Fortran
(though we do dabble in C, C++, Python and Java too).
Thanks
Mark Westwood
|
|
0
|
|
|
|
Reply
|
markc.westwood (96)
|
2/13/2008 1:16:06 PM |
|
On 2008-02-13 09:16:06 -0400, Mark Westwood <markc.westwood@gmail.com> said:
> Hi
>
> I'm looking for some input on file designs for scientific data - if
> you're not interested in this sort of thing you should ignore this
> posting completely. If you are, please read on:
>
> We're soon to embark on the redesign of one of our systems. One of
> the main areas of functionality of the system is data management. The
> basic data is geophysical signal records from our instruments. The
> records are processed, transformed, manipulated etc, and a variety of
> modelling and inversion operations carried out. The data management
> issues will be familiar: issues of auditability, traceability,
> provenance etc: what operations were carried out to transform signal A
> into signal A', who carried them out, what versions of what programs
> were used, etc.
>
> For each project we carry out we'll typically collect data from a
> handful of lines of instrumentation (survey data), each of which
> comprises 10s (that's 10-off, not 10 seconds) -- 100s of sensors, each
> of which records 10s -- 100s of signals. Back home the data is
> processed, so each input signal might give rise to multiple output
> signals (different filters, different downsampling, etc), and so on.
> In short, the data fits quite naturally into a hierarchical structure.
>
> Right now we manage our data in what I would describe as a 'file
> forest' -- each signal is in a file, which is in a directory, which is
> in a directory, and so on. In a typical project the directory
> structure might be 8 layers deep, with an average branching factor of
> 3 - 4 -- seriously some of our projects ultimately comprise 20,000
> files or more. Each of the files contains a bunch of (text) header
> data, maybe 80 lines of up to 80 characters, and signal data which is
> of the order of 4 columns of up to about 1000 rows of floating-point
> numbers. There's a lot of redundant duplication of data -- when a
> signal is transformed the output file might contain the same header
> info as the input file, plus a few more lines describing the
> transformation, and the same data, but with a new column of figures
> added (the transform of one of the other columns of data). And
> there's an awful lot of files. Mostly the directories and files are
> created and managed by the existing system, but users can open up a
> file and tweak it if they wish to.
>
> So, wearing my software engineering hat, I'm thinking -- this is a
> mess, the new system should rationalise this data management and
> introduce some structured file formats. I'm thinking XML, HDF5, maybe
> Infosets. Now, there's a wealth of information on XML, and pretty
> good documentation on the others, most of it directed to telling you
> how to use the formats and their associated tools. But everything I
> find leaves me with 3 questions, and this is what I'm hoping some of
> you might be able to give me pointers on:
>
> 1. What are the relative merits of XML, HDF5 and Infosets ? I think
> I've figured out that plain XML is limited in file size to about the
> size of my computer's RAM since parsing an XML file requires it to be
> read in one go (but that standards like Xlink (?) might help us round
> this). And yes, XML is kind of bloated for scientific data sets, but
> we're only talking GBs of data, not TBs, and there are good reasons
> for having readable data sets -- much easier for geophysicists to
> prototype bits and pieces of code for one. But is there any hard info
> out there on why to choose one rather than the others ? Or why to
> avoid one or all like the plague ?
>
> 2. OK, having chosen my favourite approach to file design, how do I
> design a file ? If we choose to go XML, how do I decide what become
> attributes, what are tags, and what are just text ? The published
> info I've found is quite good at telling you how to implement a
> design, but less good at how to derive a design in the first place.
>
> 3. How do I persuade my boss that the effort associated with
> modifying our working practices will be repaid ? As a software
> engineer the sheer inelegance of our current approach persuades me
> that it is bad -- but I'm not sure that's strong enough. Another
> argument is that our current approach duplicates a lot of data, but we
> haven't yet had a horrendous incident arising from this (ie non-
> agreement of what should be identical data) which would make a
> compelling case for change. And the argument about wasting disk space
> is not an argument now that disks are so cheap.
>
> OK, I realise I'm asking for a lot, and I don't really expect a load
> of free consultancy, but if anyone cares to send me some links to
> useful sites, or published papers, I'd really appreciate it.
>
> And, in case you were wondering why this message is being posted to
> comp.lang.fortran, it's because our language of choice is Fortran
> (though we do dabble in C, C++, Python and Java too).
>
> Thanks
>
> Mark Westwood
The star gazers have an exchange format called FITS (Flexible Image
Transport System) that is the first hit on Google for "FITS data".
They must have some tales to tell of lessons, both good and bad, that
they have learned as well as other sources.
There are even a few FITS users who post here so you should bet
more definitive information than just my pointer.
|
|
0
|
|
|
|
Reply
|
g.sande (1183)
|
2/13/2008 2:15:23 PM
|
|
Mark Westwood wrote:
> I'm looking for some input on file designs for scientific data - if
> you're not interested in this sort of thing you should ignore this
> posting completely. If you are, please read on:
>
> We're soon to embark on the redesign of one of our systems. One of
> the main areas of functionality of the system is data management. The
> basic data is geophysical signal records from our instruments. The
> records are processed, transformed, manipulated etc, and a variety of
> modelling and inversion operations carried out. The data management
> issues will be familiar: issues of auditability, traceability,
> provenance etc: what operations were carried out to transform signal A
> into signal A', who carried them out, what versions of what programs
> were used, etc.
[snip]
Some of my colleagues use the CDF format to collect and process their
fluid dynamics data. Originally this format was used in completely
different fields but it is designed to be useful as a general purpose
data container. It might be useful. Have a look here:
http://cdf.gsfc.nasa.gov/
|
|
0
|
|
|
|
Reply
|
dsnouck (6)
|
2/13/2008 2:45:43 PM
|
|
"Mark Westwood" <markc.westwood@gmail.com> wrote in message
news:6f1128ee-12ca-4544-a84e-8fda9a55dd7c@d70g2000hsb.googlegroups.com...
> Hi
>
> I'm looking for some input on file designs for scientific data - if
> you're not interested in this sort of thing you should ignore this
> posting completely. If you are, please read on:
>
> We're soon to embark on the redesign of one of our systems. One of
> the main areas of functionality of the system is data management. The
> basic data is geophysical signal records from our instruments. The
> records are processed, transformed, manipulated etc, and a variety of
> modelling and inversion operations carried out. The data management
> issues will be familiar: issues of auditability, traceability,
> provenance etc: what operations were carried out to transform signal A
> into signal A', who carried them out, what versions of what programs
> were used, etc.
>
> For each project we carry out we'll typically collect data from a
> handful of lines of instrumentation (survey data), each of which
> comprises 10s (that's 10-off, not 10 seconds) -- 100s of sensors, each
> of which records 10s -- 100s of signals. Back home the data is
> processed, so each input signal might give rise to multiple output
> signals (different filters, different downsampling, etc), and so on.
> In short, the data fits quite naturally into a hierarchical structure.
>
> Right now we manage our data in what I would describe as a 'file
> forest'
<snip description of file organization>
> Mostly the directories and files are
> created and managed by the existing system, but users can open up a
> file and tweak it if they wish to.
>
> So, wearing my software engineering hat, I'm thinking -- this is a
> mess, the new system should rationalise this data management and
> introduce some structured file formats. I'm thinking XML, HDF5, maybe
> Infosets. Now, there's a wealth of information on XML, and pretty
> good documentation on the others, most of it directed to telling you
> how to use the formats and their associated tools. But everything I
> find leaves me with 3 questions, and this is what I'm hoping some of
> you might be able to give me pointers on:
>
> 1. What are the relative merits of XML, HDF5 and Infosets ?
<snip>
You should seriously consider a fourth approach: A Relational DataBase
Management System (RDBMS). From the extensive description of your data
organization, it appears that this is the kind of problem that is well-suited to
an RDBMS.
There are several very good RDBMS packages out there. Some are supported on
a variety of operating systems while others are specific to a particular
operating system. E.g., Microsoft Access and Corel Paradox work only under
Windows.
You don't mention the operating system(s) that you are using. This is
important because the OS determines what products are available. And, if you
are using one or more of the various Unixes, you need to specify exactly which
Unix(es) you are using.
Different RDBMSes have different trade-offs with respect to cost, data types
supported, execution efficiency, scalability, programming languages supported,
and other considerations. You will need to evaluate each product with respect
to your needs.
All modern RDBMS packages use Structured Query Language (SQL) as an
interface language to the RDBMS. You can write SQL statements in a variety of
modes: interactive (from the keyboard), embedded (in program source code),
dynamic (programs construct SQL statements at run time), and SQL modules (work
like subroutines).
Some major RDBMS products on mid-range and mainframe systems include:
Oracle, Sybase, and Informix. On IBM mainframes, there is DB2. On OpenVMS,
there is Oracle/Rdb, which is the current version of the original Rdb. Oracle
also offers its regular product on OpenVMS, but the two products are quitre
different. Oracle/Rdb is very well tailored to the OpenVMS operating system
and, IMHO, easier to use than the regular Oracle product.
There is at least one RDBMS that is free: MySQL. However, I don't know how
well suited it is to your needs.
> 2. OK, having chosen my favourite approach to file design, how do I
> design a file ?
<snip>
This is an issue that is the subject of an entire branch of software
engineering. It is called DataBase Analysis. There are highly-trained
professionals called DataBase Analysts who evaluate such needs and develop
solutions on a full-time basis. You really need to retain a DataBase Analyst as
a consultant or else hire one to do this job. From the description of the
problem you provided us, this looks like a job for a fully-trained professional.
> 3. How do I persuade my boss that the effort associated with
> modifying our working practices will be repaid ? As a software
> engineer the sheer inelegance of our current approach persuades me
> that it is bad -- but I'm not sure that's strong enough. Another
> argument is that our current approach duplicates a lot of data, but we
> haven't yet had a horrendous incident arising from this (ie non-
> agreement of what should be identical data) which would make a
> compelling case for change. And the argument about wasting disk space
> is not an argument now that disks are so cheap.
This is another issue that is a separate discipline. This is really a
Project Management or Business Management issue. You will need to quantify the
expected benefits and costs of the current approach and the alternative future
approaches. You will also need to do a risk analysis of the current versus
future approaches. And, you need to include qualitative costs and benefits as
well and those that can be quantified.
Be sure to pro-actively consider the option of doing nothing, i.e., continue
using the current system. Be sure to assess the risks and costs of doing
nothing. A frequent mistake in such analyses is to overlook the costs and risks
of doing nothing.
Perhaps a Project Management consultant would be useful. Also, if you do go
ahead with this data conversion, it is likely to be a major project so a
professional project manager would be useful.
> OK, I realise I'm asking for a lot, and I don't really expect a load
> of free consultancy, but if anyone cares to send me some links to
> useful sites, or published papers, I'd really appreciate it.
>
> And, in case you were wondering why this message is being posted to
> comp.lang.fortran, it's because our language of choice is Fortran
> (though we do dabble in C, C++, Python and Java too).
>
> Thanks
>
> Mark Westwood
Well, Fortran is my first choice for developing software, regardless of the
applicaiton area. I have done a lot of commercial application development in
Fortran, C, and C++ and I find Fortran to be far more productive and far less
error-prone than the other two.
FWIW, here are my educational credentials.
Bachelor of Science, Applied Mathematics, Engineering, and Physics
Master of Business Administration
Master of Project Management
Good luck with your data conversion.
--
Craig Dedo
17130 W. Burleigh Place
P. O. Box 423
Brookfield, WI 53008-0423
Voice: (262) 783-5869
Fax: (262) 783-5928
Mobile: (414) 412-5869
E-mail: <cdedo@wi.rr.com> or <craig@ctdedo.com>
|
|
0
|
|
|
|
Reply
|
cdedo (127)
|
2/13/2008 2:55:46 PM
|
|
In message <2008021310152316807-gsande@worldnetattnet>, Gordon Sande
<g.sande@worldnet.att.net> writes
>The star gazers have an exchange format called FITS (Flexible Image
>Transport System) that is the first hit on Google for "FITS data".
>
>They must have some tales to tell of lessons, both good and bad, that
>they have learned as well as other sources.
>
>There are even a few FITS users who post here so you should bet
>more definitive information than just my pointer.
Indeed. I'm a member of a FITS committee and have been on a working
party recently revising the standard. FITS is indeed widely used in
astronomy, but it is good for two main reasons:
(a) already widely used in astronomy
(b) described to the bit level in the primary literature, so likely to
be a readable format in centuries to come.
But it was designed in the early 80s and has many awkward features and
restrictions which are only perpetuated for compatibility, so I simply
cannot recommend it for use ab initio in another field.
The HDF5 format is a lot more flexible and powerful, and on my limited
experience would suggest you look at it seriously. The main drawback,
potentially, is that it is not defined as a file format but as a
software library. You have to assume that the C code on which it is
based will continue to be usable as compilers and platforms evolve, and
though the library is freely usable, I'm not sure that the code is open
source. Using HDF5 is probably a safe bet, but FITS has chosen the more
cautious approach of defining the file, not the software.
XML is a lot more buzzword-compliant, and will probably get you brownie
points, but it is basically a text format, so for scientific data there
are large overheads in processing time and data file size. Although
there are ways of encoding binary data in XML files, they are hard to
use, and most XML utilities won't handle them at all. Clearly some
people are using XML files with Fortran, as there was a course on it
recently at the e-science institute in Edinburgh, see
http://www.nesc.ac.uk/action/esi/contribution.cfm?Title=841
--
Clive Page
|
|
0
|
|
|
|
Reply
|
junk2470 (76)
|
2/13/2008 4:52:39 PM
|
|
Mark Westwood wrote:
<snip>
So, wearing my software engineering hat, I'm thinking -- this is a
> mess, the new system should rationalise this data management and
> introduce some structured file formats. I'm thinking XML, HDF5, maybe
> Infosets.
Personally, I don't find the "file forest" idea to be inelegant -
especially if, as you say, "the data fits quite naturally into a
hierarchical structure". It's simple, but if it fits the requirement,
I'd say that made it elegant. I'd need a good reason before replacing
one hierarchical file system with another that was less transparent.
When looking at alternatives, don't forget that the file forest scheme
has some distinct advantages - e.g.
* anyone can understand that sort of structure in minutes,
* the scope of any data corruption is probably clear immediately
* your system is not complicated by reliance on 3rd party software. For
example if you change to a different compiler, you don't have to worry
about whether the third party software is compatible.
* the fact that "users can open up a file and tweak it if they wish to"
may be an advantage, but could also be a problem (though of course you
can set file permissions).
I'm reminded of the mess that is the Windows registry. That was some
software engineer's idea to clear up the perceived "mess" of INI files.
The result is an ever growing pit, that progressively reduces the
speed of your machine, and which, if corrupted, can wipe out your entire
system. Fortunately Linux has not yet taken that route.
--
John Appleyard - (send email to john!news@.. rather than spamtrap@..)
Polyhedron Software
Programs for Programmers - QA, Compilers, Graphics, Consultancy
********* Visit our Web site on http://www.polyhedron.co.uk/ *********
|
|
0
|
|
|
|
Reply
|
spamtrap7925 (139)
|
2/13/2008 5:05:49 PM
|
|
In message <47b304f3$0$16656$4c368faf@roadrunner.com>, Craig Dedo
<cdedo@wi.rr.com> writes
> You should seriously consider a fourth approach: A Relational
>DataBase Management System (RDBMS). From the extensive description of
>your data organization, it appears that this is the kind of problem
>that is well-suited to an RDBMS.
Yes it might well be. I'd recommend using Postgres rather than MySQL.
Postgres is also open source, but a rather more solid product and much
more standards-compliant. Both can be used on all modern platforms
including Windows and Linux. See this page
http://troels.arvin.dk/db/rdbms/
for a comparison of features.
All the same, getting scientific data into and out of an RDBMS isn't
easy - typically you have to convert your data into text (CSV) to load
it, and have to generate SQL statements to extract it. This can all be
done in Fortran (and I've done it on data held originally in FITS files)
but you may wonder after a bit whether you are not exchanging one can of
worms for another. I've also used Fortran2003 Stream I/O (implemented
by g95 and gfortran) to read and write Postgres binary format, which
avoids the handling CSV files, but that doesn't avoid the need to be
proficient in SQL.
>> 2. OK, having chosen my favourite approach to file design, how do I
>> design a file ?
> <snip>
>
> This is an issue that is the subject of an entire branch of software
>engineering. It is called DataBase Analysis. There are highly-trained
>professionals called DataBase Analysts who evaluate such needs and
>develop solutions on a full-time basis. You really need to retain a
>DataBase Analyst as a consultant or else hire one to do this job. From
>the description of the problem you provided us, this looks like a job
>for a fully-trained professional.
I rather think that a DataBase Analyst will tell you that a DBMS is the
solution, whatever the problem is (if all you have is a hammer, etc.).
Yes once you have decided to use an RDBMS then there are
well-established methods, entity relationship modelling and so forth,
but I've found these methods of limited use in designing scientific data
formats. Maybe there really are professionals out there with the right
skills to design scientific data files, but I have not been fortunate
enough to meet them. Some pointers would be helpful...
> Be sure to pro-actively consider the option of doing nothing,
Always a good idea :-)
--
Clive Page
|
|
0
|
|
|
|
Reply
|
junk2470 (76)
|
2/13/2008 5:10:21 PM
|
|
Craig
Thanks for your extended attention.
Funnily enough I spent about 10 years persuading companies of the
benefits of implementing RDBMSs back when they were the next best
thing in data management and I agree with most of what you've
written. But I've excluded them from consideration because:
-- I don't think that there's a snowball's chance in hell of
extracting that much money from the paymasters;
-- and if I did, then I'd need a DBMS expert to look after it all for
me.
Perhaps, one day, if we implement a successful data management scheme,
we'll be able to take the next step and implement a proper DBMS-based
solution.
As to hiring a professional in this area, well I have no objection in
principal, but I'm not aware of anyone who might justifiably be called
an expert. I know what I don't want to hire -- an (R)DBMS expert from
the commercial world learning about scientific data while I'm paying
the bills.
As I said, thanks for the input,
Mark
--
On 13 Feb, 14:55, "Craig Dedo" <cd...@wi.rr.com> wrote:
> "Mark Westwood" <markc.westw...@gmail.com> wrote in message
>
> news:6f1128ee-12ca-4544-a84e-8fda9a55dd7c@d70g2000hsb.googlegroups.com...
>
> > Hi
>
> > I'm looking for some input on file designs for scientific data - if
> > you're not interested in this sort of thing you should ignore this
> > posting completely. If you are, please read on:
>
> > We're soon to embark on the redesign of one of our systems. One of
> > the main areas of functionality of the system is data management. The
> > basic data is geophysical signal records from our instruments. The
> > records are processed, transformed, manipulated etc, and a variety of
> > modelling and inversion operations carried out. The data management
> > issues will be familiar: issues of auditability, traceability,
> > provenance etc: what operations were carried out to transform signal A
> > into signal A', who carried them out, what versions of what programs
> > were used, etc.
>
> > For each project we carry out we'll typically collect data from a
> > handful of lines of instrumentation (survey data), each of which
> > comprises 10s (that's 10-off, not 10 seconds) -- 100s of sensors, each
> > of which records 10s -- 100s of signals. Back home the data is
> > processed, so each input signal might give rise to multiple output
> > signals (different filters, different downsampling, etc), and so on.
> > In short, the data fits quite naturally into a hierarchical structure.
>
> > Right now we manage our data in what I would describe as a 'file
> > forest'
>
> <snip description of file organization>> Mostly the directories and files are
> > created and managed by the existing system, but users can open up a
> > file and tweak it if they wish to.
>
> > So, wearing my software engineering hat, I'm thinking -- this is a
> > mess, the new system should rationalise this data management and
> > introduce some structured file formats. I'm thinking XML, HDF5, maybe
> > Infosets. Now, there's a wealth of information on XML, and pretty
> > good documentation on the others, most of it directed to telling you
> > how to use the formats and their associated tools. But everything I
> > find leaves me with 3 questions, and this is what I'm hoping some of
> > you might be able to give me pointers on:
>
> > 1. What are the relative merits of XML, HDF5 and Infosets ?
>
> <snip>
>
> You should seriously consider a fourth approach: A Relational DataBase
> Management System (RDBMS). From the extensive description of your data
> organization, it appears that this is the kind of problem that is well-suited to
> an RDBMS.
>
> There are several very good RDBMS packages out there. Some are supported on
> a variety of operating systems while others are specific to a particular
> operating system. E.g., Microsoft Access and Corel Paradox work only under
> Windows.
>
> You don't mention the operating system(s) that you are using. This is
> important because the OS determines what products are available. And, if you
> are using one or more of the various Unixes, you need to specify exactly which
> Unix(es) you are using.
>
> Different RDBMSes have different trade-offs with respect to cost, data types
> supported, execution efficiency, scalability, programming languages supported,
> and other considerations. You will need to evaluate each product with respect
> to your needs.
>
> All modern RDBMS packages use Structured Query Language (SQL) as an
> interface language to the RDBMS. You can write SQL statements in a variety of
> modes: interactive (from the keyboard), embedded (in program source code),
> dynamic (programs construct SQL statements at run time), and SQL modules (work
> like subroutines).
>
> Some major RDBMS products on mid-range and mainframe systems include:
> Oracle, Sybase, and Informix. On IBM mainframes, there is DB2. On OpenVMS,
> there is Oracle/Rdb, which is the current version of the original Rdb. Oracle
> also offers its regular product on OpenVMS, but the two products are quitre
> different. Oracle/Rdb is very well tailored to the OpenVMS operating system
> and, IMHO, easier to use than the regular Oracle product.
>
> There is at least one RDBMS that is free: MySQL. However, I don't know how
> well suited it is to your needs.
>
> > 2. OK, having chosen my favourite approach to file design, how do I
> > design a file ?
>
> <snip>
>
> This is an issue that is the subject of an entire branch of software
> engineering. It is called DataBase Analysis. There are highly-trained
> professionals called DataBase Analysts who evaluate such needs and develop
> solutions on a full-time basis. You really need to retain a DataBase Analyst as
> a consultant or else hire one to do this job. From the description of the
> problem you provided us, this looks like a job for a fully-trained professional.
>
> > 3. How do I persuade my boss that the effort associated with
> > modifying our working practices will be repaid ? As a software
> > engineer the sheer inelegance of our current approach persuades me
> > that it is bad -- but I'm not sure that's strong enough. Another
> > argument is that our current approach duplicates a lot of data, but we
> > haven't yet had a horrendous incident arising from this (ie non-
> > agreement of what should be identical data) which would make a
> > compelling case for change. And the argument about wasting disk space
> > is not an argument now that disks are so cheap.
>
> This is another issue that is a separate discipline. This is really a
> Project Management or Business Management issue. You will need to quantify the
> expected benefits and costs of the current approach and the alternative future
> approaches. You will also need to do a risk analysis of the current versus
> future approaches. And, you need to include qualitative costs and benefits as
> well and those that can be quantified.
>
> Be sure to pro-actively consider the option of doing nothing, i.e., continue
> using the current system. Be sure to assess the risks and costs of doing
> nothing. A frequent mistake in such analyses is to overlook the costs and risks
> of doing nothing.
>
> Perhaps a Project Management consultant would be useful. Also, if you do go
> ahead with this data conversion, it is likely to be a major project so a
> professional project manager would be useful.
>
> > OK, I realise I'm asking for a lot, and I don't really expect a load
> > of free consultancy, but if anyone cares to send me some links to
> > useful sites, or published papers, I'd really appreciate it.
>
> > And, in case you were wondering why this message is being posted to
> > comp.lang.fortran, it's because our language of choice is Fortran
> > (though we do dabble in C, C++, Python and Java too).
>
> > Thanks
>
> > Mark Westwood
>
> Well, Fortran is my first choice for developing software, regardless of the
> applicaiton area. I have done a lot of commercial application development in
> Fortran, C, and C++ and I find Fortran to be far more productive and far less
> error-prone than the other two.
>
> FWIW, here are my educational credentials.
> Bachelor of Science, Applied Mathematics, Engineering, and Physics
> Master of Business Administration
> Master of Project Management
>
> Good luck with your data conversion.
>
> --
> Craig Dedo
> 17130 W. Burleigh Place
> P. O. Box 423
> Brookfield, WI 53008-0423
> Voice: (262) 783-5869
> Fax: (262) 783-5928
> Mobile: (414) 412-5869
> E-mail: <cd...@wi.rr.com> or <cr...@ctdedo.com>
|
|
0
|
|
|
|
Reply
|
markc.westwood (96)
|
2/13/2008 5:17:39 PM
|
|
On 2008-02-13, Clive Page <junk@main.machine> wrote:
> In message <2008021310152316807-gsande@worldnetattnet>, Gordon Sande
><g.sande@worldnet.att.net> writes
>>The star gazers have an exchange format called FITS (Flexible Image
>>Transport System) that is the first hit on Google for "FITS data".
>>
>>They must have some tales to tell of lessons, both good and bad, that
>>they have learned as well as other sources.
>>
>>There are even a few FITS users who post here so you should bet
>>more definitive information than just my pointer.
>
> Indeed. I'm a member of a FITS committee and have been on a working
> party recently revising the standard. FITS is indeed widely used in
> astronomy, but it is good for two main reasons:
> (a) already widely used in astronomy
> (b) described to the bit level in the primary literature, so likely to
> be a readable format in centuries to come.
>
> But it was designed in the early 80s and has many awkward features and
> restrictions which are only perpetuated for compatibility, so I simply
> cannot recommend it for use ab initio in another field.
>
In the particle physics community there's an open-source project called,
unfortunately, ROOT. Its home is at http://root.cern.ch . It's an
object-oriented framework for processing scientific data. One of its
features is a file format which handles its data objects. Take a look
at http://root.cern.ch/root/InputOutput.html .
I'm not a big-time ROOT user, but I work with people who are. As a
stand-alone data cruncher it's rather annoying -- the scripting langauge
is C++ ! The API allows you to pull out the data and file management
functions, though.
|
|
0
|
|
|
|
Reply
|
pln1 (10)
|
2/13/2008 6:26:58 PM
|
|
On Feb 13, 11:05=A0am, John Appleyard <spamt...@polyhedron.com> wrote:
> Mark Westwood wrote:
>
<snip>
> I'm reminded of the mess that is the Windows registry. =A0That was some
> software engineer's idea to clear up the perceived "mess" of INI files.
> =A0 =A0 The result is an ever growing pit, that progressively reduces the
> speed of your machine, and which, if corrupted, can wipe out your entire
> system. =A0Fortunately Linux has not yet taken that route.
>
I was mildly against GINO adding all those registry manipulation
routines. It adds little truly needed functionality and serves to
increase the non-portability of the Windows API versus the Linux/Unix/
OPENVMS API.
> --
> John Appleyard =A0- (send email to john!news@.. rather than spamtrap@..)
> Polyhedron Software
> Programs for Programmers - QA, Compilers, Graphics, Consultancy
> ********* Visit our Web site onhttp://www.polyhedron.co.uk/*********
|
|
0
|
|
|
|
Reply
|
garylscott (1357)
|
2/13/2008 7:41:22 PM
|
|
Daniel Snouck wrote:
> Mark Westwood wrote:
>
>> I'm looking for some input on file designs for scientific data - if
>> you're not interested in this sort of thing you should ignore this
>> posting completely. If you are, please read on:
>>
>> We're soon to embark on the redesign of one of our systems. One of
>> the main areas of functionality of the system is data management. The
>> basic data is geophysical signal records from our instruments. The
>> records are processed, transformed, manipulated etc, and a variety of
>> modelling and inversion operations carried out. The data management
>> issues will be familiar: issues of auditability, traceability,
>> provenance etc: what operations were carried out to transform signal A
>> into signal A', who carried them out, what versions of what programs
>> were used, etc.
> [snip]
>
> Some of my colleagues use the CDF format to collect and process their
> fluid dynamics data. Originally this format was used in completely
> different fields but it is designed to be useful as a general purpose
> data container. It might be useful. Have a look here:
> http://cdf.gsfc.nasa.gov/
A related design is NetCDF, which has a set of F90 bindings.
http://www.unidata.ucar.edu/software/netcdf/
|
|
0
|
|
|
|
Reply
|
enigma (394)
|
2/13/2008 8:23:53 PM
|
|
"Clive Page" <junk@main.machine> wrote in message
news:vMkfInE9RysHJwaY@page.demo.co.uk...
> In message <47b304f3$0$16656$4c368faf@roadrunner.com>, Craig Dedo
> <cdedo@wi.rr.com> writes
>> You should seriously consider a fourth approach: A Relational DataBase
>> Management System (RDBMS). From the extensive description of your data
>> organization, it appears that this is the kind of problem that is well-suited
>> to an RDBMS.
>
> Yes it might well be. I'd recommend using Postgres rather than MySQL.
> Postgres is also open source, but a rather more solid product and much more
> standards-compliant. Both can be used on all modern platforms including
> Windows and Linux. See this page
> http://troels.arvin.dk/db/rdbms/
> for a comparison of features.
Thank you very much for the recommendation. Before you mentioned it, I had
never heard of it.
> All the same, getting scientific data into and out of an RDBMS isn't easy -
> typically you have to convert your data into text (CSV) to load it, and have
> to generate SQL statements to extract it. This can all be done in Fortran
> (and I've done it on data held originally in FITS files) but you may wonder
> after a bit whether you are not exchanging one can of worms for another. I've
> also used Fortran2003 Stream I/O (implemented by g95 and gfortran) to read and
> write Postgres binary format, which avoids the handling CSV files, but that
> doesn't avoid the need to be proficient in SQL.
I disagree. Different RDBMSes have different methods of direct data entry.
Obviously, just about every RDBMS can accept CSV files, but that is not the only
way to do a mass load or mass update. If the RDBMS supports Fortran with
embedded SQL, then you could write a Fortran program that reads from unformatted
input files and writes to database tables that have binary representation, e.g.,
double precision floating point.
Also, the ease or difficulty of data load operations depends somewhat on how
the original data is formatted. How is it derived, e.g., from instruments
writing directly to a storage medium? If from instruments, is it binary or
human-readable text? Either way, how is it formatted? E.g., if binary, is it
single precision, double precision, or some other precision? Is it big endian
or little endian?
Also, SQL has a fairly easy learning curve. I learned it quickly on a job
11 years ago. There are quite a lot of self-teaching books available. I don't
think that the need to know SQL is that big of a difficulty.
>>> 2. OK, having chosen my favourite approach to file design, how do I
>>> design a file ?
>> <snip>
>>
>> This is an issue that is the subject of an entire branch of software
>> engineering. It is called DataBase Analysis. There are highly-trained
>> professionals called DataBase Analysts who evaluate such needs and develop
>> solutions on a full-time basis. You really need to retain a DataBase Analyst
>> as a consultant or else hire one to do this job. From the description of the
>> problem you provided us, this looks like a job for a fully-trained
>> professional.
>
> I rather think that a DataBase Analyst will tell you that a DBMS is the
> solution, whatever the problem is (if all you have is a hammer, etc.). Yes
> once you have decided to use an RDBMS then there are well-established methods,
> entity relationship modelling and so forth, but I've found these methods of
> limited use in designing scientific data formats. Maybe there really are
> professionals out there with the right skills to design scientific data files,
> but I have not been fortunate enough to meet them. Some pointers would be
> helpful...
I disagree that a DataBase Analyst will always recommend an RDBMS, if such a
person is a competent professional. Such people are expected to know the
trade-offs involved in various file and database structures. If the
architecture is really critical, then it would be wise to run some experiments
using fake data and numbers of records that simulate actual working conditions.
>> Be sure to pro-actively consider the option of doing nothing,
>
> Always a good idea :-)
> --
> Clive Page
--
Craig Dedo
17130 W. Burleigh Place
P. O. Box 423
Brookfield, WI 53008-0423
Voice: (262) 783-5869
Fax: (262) 783-5928
Mobile: (414) 412-5869
E-mail: <cdedo@wi.rr.com> or <craig@ctdedo.com>
|
|
0
|
|
|
|
Reply
|
cdedo (127)
|
2/13/2008 10:00:13 PM
|
|
On Wed, 13 Feb 2008 18:26:58 +0000 (UTC), Patrick Nolan
<pln@glast2.Stanford.EDU>
wrote in <slrnfr6dji.63r.pln@glast2.Stanford.EDU>:
> On 2008-02-13, Clive Page <junk@main.machine> wrote:
> In the particle physics community there's an open-source project called,
> unfortunately, ROOT. Its home is at http://root.cern.ch . It's an
> object-oriented framework for processing scientific data. One of its
> features is a file format which handles its data objects. Take a look
> at http://root.cern.ch/root/InputOutput.html .
I was thinking of ROOT too, more specifically the framework "we"
have set up to handle CMS data files when LHC starts operation (and before,
with simulated data...) called CMSSW. It basically takes data through
several steps of analysis, adding persistent data to the data object as
more analyses are performed, with pointers to provenance, etc. The basic
file holding the data object is a ROOT file. Unfortunately, CMSSW is
written in C++. ...and it's *huge* with all the different analyses and
sub-detectors that need to be covered.
> I'm not a big-time ROOT user, but I work with people who are. As a
> stand-alone data cruncher it's rather annoying -- the scripting langauge
> is C++ ! The API allows you to pull out the data and file management
> functions, though.
I've been doing a lot of ROOT work lately[1], much of it in python.
There's a python interface (and Ruby too, IIRC) so you don't _have_ to
use C++. You can also, with a bit of care, use C++ macros as compiled
code with g++ and link to other g++ driver code, but in fact that's only a
few percent faster than loading the macro into the ROOT C++ interpreter
as a compiled shared library.
[1] Developing and testing a new method of comparing two-dimensional
histograms to decide if they come from the same distribution or not.
--
Ivan Reid, School of Engineering & Design, _____________ CMS Collaboration,
Brunel University. Ivan.Reid@[brunel.ac.uk|cern.ch] Room 40-1-B12, CERN
KotPT -- "for stupidity above and beyond the call of duty".
|
|
0
|
|
|
|
Reply
|
Ivan.Reid (496)
|
2/13/2008 10:32:18 PM
|
|
Mark Westwood wrote:
> Craig
>
> Thanks for your extended attention.
>
> Funnily enough I spent about 10 years persuading companies of the
> benefits of implementing RDBMSs back when they were the next best
> thing in data management and I agree with most of what you've
> written. But I've excluded them from consideration because:
>
> -- I don't think that there's a snowball's chance in hell of
> extracting that much money from the paymasters;
> -- and if I did, then I'd need a DBMS expert to look after it all for
> me.
>
> Perhaps, one day, if we implement a successful data management scheme,
> we'll be able to take the next step and implement a proper DBMS-based
> solution.
>
> As to hiring a professional in this area, well I have no objection in
> principal, but I'm not aware of anyone who might justifiably be called
> an expert. I know what I don't want to hire -- an (R)DBMS expert from
> the commercial world learning about scientific data while I'm paying
> the bills.
>
> As I said, thanks for the input,
>
> Mark
In other posts people suggested Postgres and MySQL...
You might want to check into SQLite ( http://sqlite.org ).
SQLite is public domain and is "a software library that implements a
self-contained, serverless, zero-configuration, transactional SQL
database engine".
I haven't tried either of the following, but apparently interfacing
the database with Fortran is possible:
- An example that shows how to call SQLite functions from Fortran.
http://danial.org/sqlite/fortran/
- A fairly complete set of subroutines and functions to use SQLite
from a Fortran 90/95 program can be found on
http://flibs.sourceforge.net. A small part of the library has
been written in C, mostly to make the interaction with the SQLite
library itself easier.
--Mark
|
|
0
|
|
|
|
Reply
|
mstucky5 (52)
|
2/13/2008 10:39:56 PM
|
|
Craig Dedo wrote:
> "Mark Westwood" <markc.westwood@gmail.com> wrote in message
> news:6f1128ee-12ca-4544-a84e-8fda9a55dd7c@d70g2000hsb.googlegroups.com...
>
>> Hi
>>
>> I'm looking for some input on file designs for scientific data - if
>> you're not interested in this sort of thing you should ignore this
>> posting completely. If you are, please read on:
>>
>> We're soon to embark on the redesign of one of our systems. One of
>> the main areas of functionality of the system is data management. The
>> basic data is geophysical signal records from our instruments. The
>> records are processed, transformed, manipulated etc, and a variety of
>> modelling and inversion operations carried out. The data management
>> issues will be familiar: issues of auditability, traceability,
>> provenance etc: what operations were carried out to transform signal A
>> into signal A', who carried them out, what versions of what programs
>> were used, etc.
>>
>> For each project we carry out we'll typically collect data from a
>> handful of lines of instrumentation (survey data), each of which
>> comprises 10s (that's 10-off, not 10 seconds) -- 100s of sensors, each
>> of which records 10s -- 100s of signals. Back home the data is
>> processed, so each input signal might give rise to multiple output
>> signals (different filters, different downsampling, etc), and so on.
>> In short, the data fits quite naturally into a hierarchical structure.
>>
>> Right now we manage our data in what I would describe as a 'file
>> forest'
>
> <snip description of file organization>
>
>> Mostly the directories and files are
>> created and managed by the existing system, but users can open up a
>> file and tweak it if they wish to.
>>
>> So, wearing my software engineering hat, I'm thinking -- this is a
>> mess, the new system should rationalise this data management and
>> introduce some structured file formats. I'm thinking XML, HDF5, maybe
>> Infosets. Now, there's a wealth of information on XML, and pretty
>> good documentation on the others, most of it directed to telling you
>> how to use the formats and their associated tools. But everything I
>> find leaves me with 3 questions, and this is what I'm hoping some of
>> you might be able to give me pointers on:
>>
>> 1. What are the relative merits of XML, HDF5 and Infosets ?
>
> <snip>
>
> You should seriously consider a fourth approach: A Relational
> DataBase Management System (RDBMS). From the extensive description of
> your data organization, it appears that this is the kind of problem that
> is well-suited to an RDBMS.
>
> There are several very good RDBMS packages out there. Some are
> supported on a variety of operating systems while others are specific to
> a particular operating system. E.g., Microsoft Access and Corel Paradox
> work only under Windows.
>
> You don't mention the operating system(s) that you are using. This
> is important because the OS determines what products are available.
> And, if you are using one or more of the various Unixes, you need to
> specify exactly which Unix(es) you are using.
>
> Different RDBMSes have different trade-offs with respect to cost,
> data types supported, execution efficiency, scalability, programming
> languages supported, and other considerations. You will need to
> evaluate each product with respect to your needs.
>
> All modern RDBMS packages use Structured Query Language (SQL) as an
> interface language to the RDBMS. You can write SQL statements in a
> variety of modes: interactive (from the keyboard), embedded (in program
> source code), dynamic (programs construct SQL statements at run time),
> and SQL modules (work like subroutines).
>
> Some major RDBMS products on mid-range and mainframe systems include:
> Oracle, Sybase, and Informix. On IBM mainframes, there is DB2. On
> OpenVMS, there is Oracle/Rdb, which is the current version of the
> original Rdb. Oracle also offers its regular product on OpenVMS, but
> the two products are quitre different. Oracle/Rdb is very well tailored
> to the OpenVMS operating system and, IMHO, easier to use than the
> regular Oracle product.
>
> There is at least one RDBMS that is free: MySQL. However, I don't
> know how well suited it is to your needs.
(snip)
There is another industrial-grade RDBMS that is free: Ingres (it is
even in the open-source world the last couple of years). It can run on
Windows, Linux, Unix platforms (possibly others, I haven't looked).
Scaleable, handles embedded SQL just fine.
There is one drawback in regards to use with Fortran, and that is the
preprocessor for parsing out embedded SQL stuff. The last time I
checked (admittedly this was a couple of years ago), the preprocessor
only supported fixed-format 72-column code. That may have been updated,
I haven't been able to work on getting our old fixed-format legacy code
updated...
http://www.ingres.com
We're using Ingres to run our water database on a bunch of Sun servers
under Solaris. Mixture of Fortran "77 with extensions", C, and
auxiliary stuff like perl, tkg2, shell scripts, and so on. Tons of
embedded SQL in both C & Fortran.
Jim
|
|
0
|
|
|
|
Reply
|
JCornwall (184)
|
2/14/2008 3:09:31 AM
|
|
J. F. Cornwall wrote:
> Craig Dedo wrote:
>
>> "Mark Westwood" <markc.westwood@gmail.com> wrote in message
>> news:6f1128ee-12ca-4544-a84e-8fda9a55dd7c@d70g2000hsb.googlegroups.com...
>>
>>> Hi
>>>
>>> I'm looking for some input on file designs for scientific data - if
>>> you're not interested in this sort of thing you should ignore this
>>> posting completely. If you are, please read on:
>>>
>>> We're soon to embark on the redesign of one of our systems. One of
>>> the main areas of functionality of the system is data management. The
>>> basic data is geophysical signal records from our instruments. The
>>> records are processed, transformed, manipulated etc, and a variety of
>>> modelling and inversion operations carried out. The data management
>>> issues will be familiar: issues of auditability, traceability,
>>> provenance etc: what operations were carried out to transform signal A
>>> into signal A', who carried them out, what versions of what programs
>>> were used, etc.
>>>
>>> For each project we carry out we'll typically collect data from a
>>> handful of lines of instrumentation (survey data), each of which
>>> comprises 10s (that's 10-off, not 10 seconds) -- 100s of sensors, each
>>> of which records 10s -- 100s of signals. Back home the data is
>>> processed, so each input signal might give rise to multiple output
>>> signals (different filters, different downsampling, etc), and so on.
>>> In short, the data fits quite naturally into a hierarchical structure.
>>>
>>> Right now we manage our data in what I would describe as a 'file
>>> forest'
>>
>>
>> <snip description of file organization>
>>
>>> Mostly the directories and files are
>>> created and managed by the existing system, but users can open up a
>>> file and tweak it if they wish to.
>>>
>>> So, wearing my software engineering hat, I'm thinking -- this is a
>>> mess, the new system should rationalise this data management and
>>> introduce some structured file formats. I'm thinking XML, HDF5, maybe
>>> Infosets. Now, there's a wealth of information on XML, and pretty
>>> good documentation on the others, most of it directed to telling you
>>> how to use the formats and their associated tools. But everything I
>>> find leaves me with 3 questions, and this is what I'm hoping some of
>>> you might be able to give me pointers on:
>>>
>>> 1. What are the relative merits of XML, HDF5 and Infosets ?
>>
>>
>> <snip>
>>
>> You should seriously consider a fourth approach: A Relational
>> DataBase Management System (RDBMS). From the extensive description of
>> your data organization, it appears that this is the kind of problem
>> that is well-suited to an RDBMS.
>>
>> There are several very good RDBMS packages out there. Some are
>> supported on a variety of operating systems while others are specific
>> to a particular operating system. E.g., Microsoft Access and Corel
>> Paradox work only under Windows.
>>
>> You don't mention the operating system(s) that you are using. This
>> is important because the OS determines what products are available.
>> And, if you are using one or more of the various Unixes, you need to
>> specify exactly which Unix(es) you are using.
>>
>> Different RDBMSes have different trade-offs with respect to cost,
>> data types supported, execution efficiency, scalability, programming
>> languages supported, and other considerations. You will need to
>> evaluate each product with respect to your needs.
>>
>> All modern RDBMS packages use Structured Query Language (SQL) as an
>> interface language to the RDBMS. You can write SQL statements in a
>> variety of modes: interactive (from the keyboard), embedded (in
>> program source code), dynamic (programs construct SQL statements at
>> run time), and SQL modules (work like subroutines).
>>
>> Some major RDBMS products on mid-range and mainframe systems
>> include: Oracle, Sybase, and Informix. On IBM mainframes, there is
>> DB2. On OpenVMS, there is Oracle/Rdb, which is the current version of
>> the original Rdb. Oracle also offers its regular product on OpenVMS,
>> but the two products are quitre different. Oracle/Rdb is very well
>> tailored to the OpenVMS operating system and, IMHO, easier to use than
>> the regular Oracle product.
>>
>> There is at least one RDBMS that is free: MySQL. However, I don't
>> know how well suited it is to your needs.
>
>
> (snip)
>
> There is another industrial-grade RDBMS that is free: Ingres (it is
> even in the open-source world the last couple of years). It can run on
> Windows, Linux, Unix platforms (possibly others, I haven't looked).
> Scaleable, handles embedded SQL just fine.
>
> There is one drawback in regards to use with Fortran, and that is the
> preprocessor for parsing out embedded SQL stuff. The last time I
> checked (admittedly this was a couple of years ago), the preprocessor
> only supported fixed-format 72-column code. That may have been updated,
> I haven't been able to work on getting our old fixed-format legacy code
> updated...
>
supporting only 72-column fixed format code would have been quite short
sighted in the first place. Many compilers supported free form source
as early as 1980.
> http://www.ingres.com
>
> We're using Ingres to run our water database on a bunch of Sun servers
> under Solaris. Mixture of Fortran "77 with extensions", C, and
> auxiliary stuff like perl, tkg2, shell scripts, and so on. Tons of
> embedded SQL in both C & Fortran.
>
> Jim
--
Gary Scott
mailto:garylscott@sbcglobal dot net
Fortran Library: http://www.fortranlib.com
Support the Original G95 Project: http://www.g95.org
-OR-
Support the GNU GFortran Project: http://gcc.gnu.org/fortran/index.html
If you want to do the impossible, don't hire an expert because he knows
it can't be done.
-- Henry Ford
|
|
0
|
|
|
|
Reply
|
garylscott (1357)
|
2/14/2008 3:11:36 AM
|
|
"Gary Scott" <garylscott@sbcglobal.net> wrote in message
news:IjOsj.10173$Ch6.5547@newssvr11.news.prodigy.net...
> J. F. Cornwall wrote:
>>
>> There is one drawback in regards to use with Fortran, and that is the
>> preprocessor for parsing out embedded SQL stuff. The last time I checked
>> (admittedly this was a couple of years ago), the preprocessor only supported
>> fixed-format 72-column code. That may have been updated, I haven't been able
>> to work on getting our old fixed-format legacy code updated...
>>
> supporting only 72-column fixed format code would have been quite short
> sighted in the first place. Many compilers supported free form source as
> early as 1980.
> --
> Gary Scott
> mailto:garylscott@sbcglobal dot net
Unfortunately, almost all of the Fortran embedded SQL bindings support only
the old fixed source form. This is because one part of the SQL standard defines
the Fortran binding and it was written to be consistent with the FORTRAN 77
standard; it was developed long before Fortran 90 became the Fortran standard.
Since the original Fortran binding was adopted, there never has been much
interest in updating it to Fortran 95 and Fortran 2003, most likely due to lack
of someone to do the work. I would like to do the update of the Fortran
binding, but I have some other slightly more important responsibilities, such as
earning a living. I am not yet independently wealthy and I don't have an
academic research position, so I do not have the necessary leisure time right
now for such a project.
--
Craig Dedo
17130 W. Burleigh Place
P. O. Box 423
Brookfield, WI 53008-0423
Voice: (262) 783-5869
Fax: (262) 783-5928
Mobile: (414) 412-5869
E-mail: <cdedo@wi.rr.com> or <craig@ctdedo.com>
|
|
0
|
|
|
|
Reply
|
cdedo (127)
|
2/14/2008 4:47:42 AM
|
|
> All the same, getting scientific data into and out of an RDBMS isn't
> easy - typically you have to convert your data into text (CSV) to load
> it, and have to generate SQL statements to extract it.
BLOBs (binary large objects) might be a better fit in this case - unless
you really want to model the relationships of the data down to the
sensor or group of sensor level. But if you are using BLOBs, then the
question is whether you really need a _relational_ database to manage
their relationships - because managing the _data_ at that level is
trivial and your current solution as good as any.
Jan
|
|
0
|
|
|
|
Reply
|
Jan.Vorbrueggen (131)
|
2/14/2008 9:32:36 AM
|
|
> The HDF5 format is a lot more flexible and powerful, and on my limited
> experience would suggest you look at it seriously.
Isn't that what STScI - the guys managing the Hubble space telescope -
are using? I do know that they have at least similar requirements, i.e.,
keeping track of a pipeline of data processing such that at any time in
the data's lifetime you know how and with which set of parameters it has
been processed.
> The main drawback,
> potentially, is that it is not defined as a file format but as a
> software library.
However, given the source code it should be easy to spec the file
format, and I think it's likely there is such a spec available.
Jan
|
|
0
|
|
|
|
Reply
|
Jan.Vorbrueggen (131)
|
2/14/2008 9:35:44 AM
|
|
In message <47b32356$0$510$bed64819@news.gradwell.net>, John Appleyard
<spamtrap@polyhedron.com> writes
>Personally, I don't find the "file forest" idea to be inelegant -
>especially if, as you say, "the data fits quite naturally into a
>hierarchical structure". It's simple, but if it fits the requirement,
>I'd say that made it elegant. I'd need a good reason before replacing
>one hierarchical file system with another that was less transparent.
I agree with John - for many purposes a file forest has a lot to be said
for it on grounds of simplicity, ease of use by lots of processes in
parallel, and not many single points of failure. In my professional
work I've been using file forests quite a lot for terabytes of data.
One slight warning on most Unix-like systems: try not to have more than
a few thousand files in each directory, or the directory searching gets
slow, and even simple things like wildcard expansion in say "ls *" don't
work. Maybe the same is true for Windows, I've never tried on data
collections of any size.
But file forests can tend towards anarchy. One option is to use a file
forest but with an RDBMS in adition, but used just to store every
filename and perhaps some limited metadata about each file. That may
make searching for a given file by reference to its contents, e.g.
date/time of data acquisition, much easier.
--
Clive Page
|
|
0
|
|
|
|
Reply
|
junk2470 (76)
|
2/14/2008 2:52:08 PM
|
|
Craig Dedo wrote:
> "Gary Scott" <garylscott@sbcglobal.net> wrote in message
> news:IjOsj.10173$Ch6.5547@newssvr11.news.prodigy.net...
>
>> J. F. Cornwall wrote:
>>
>>>
>>> There is one drawback in regards to use with Fortran, and that is the
>>> preprocessor for parsing out embedded SQL stuff. The last time I
>>> checked (admittedly this was a couple of years ago), the preprocessor
>>> only supported fixed-format 72-column code. That may have been
>>> updated, I haven't been able to work on getting our old fixed-format
>>> legacy code updated...
>>>
>> supporting only 72-column fixed format code would have been quite
>> short sighted in the first place. Many compilers supported free form
>> source as early as 1980.
>> --
True, but this is a preprocessor written by the Ingres folks, not a
compiler. They obviously did not write it to support those extensions.
>> Gary Scott
>> mailto:garylscott@sbcglobal dot net
>
>
> Unfortunately, almost all of the Fortran embedded SQL bindings
> support only the old fixed source form. This is because one part of the
> SQL standard defines the Fortran binding and it was written to be
> consistent with the FORTRAN 77 standard; it was developed long before
> Fortran 90 became the Fortran standard.
>
> Since the original Fortran binding was adopted, there never has been
> much interest in updating it to Fortran 95 and Fortran 2003, most likely
> due to lack of someone to do the work. I would like to do the update of
> the Fortran binding, but I have some other slightly more important
> responsibilities, such as earning a living. I am not yet independently
> wealthy and I don't have an academic research position, so I do not have
> the necessary leisure time right now for such a project.
>
I'd like to do an update of the Ingres code myself, but I have way too
much in the way of actual job taskings to even think about doing it.
And I see enough computer code during the day that I am not going to
tackle it in my own time...
Jim
|
|
0
|
|
|
|
Reply
|
JCornwall (184)
|
2/14/2008 11:23:40 PM
|
|
J. F. Cornwall <JCornwall@cox.net> wrote:
> Craig Dedo wrote:
> >> Many compilers supported free form
> >> source as early as 1980.
> >> --
>
> True, but this is a preprocessor written by the Ingres folks, not a
> compiler. They obviously did not write it to support those extensions.
Yes, and to me that is symptomatic of one of the inherent shortcommings
of preprocessor schemes like embedded SQL. You end up having to write
code that both the pre-processor and the compiler accept. It has been a
long time since I fiddled with embedded SQL (or maybe it was one of the
other embedded database languages - I forget), but I recall that as a
problem.
Interfaces using library routines might not be as handy in some ways,
but they minimize the interactions between the compiler and the database
interface. I tend to regard minimization of interactions like that as a
good thing - a lot like minimizing interactions between different
program units.
--
Richard Maine | Good judgement comes from experience;
email: last name at domain . net | experience comes from bad judgement.
domain: summertriangle | -- Mark Twain
|
|
0
|
|
|
|
Reply
|
nospam47 (9742)
|
2/15/2008 12:34:04 AM
|
|
Mark Westwood wrote:
>
> As to hiring a professional in this area, well I have no objection in
> principal, but I'm not aware of anyone who might justifiably be called
> an expert. I know what I don't want to hire -- an (R)DBMS expert from
> the commercial world learning about scientific data while I'm paying
> the bills.
>
You might be surprised...
My OH (an ex particle physicist now working with medical databases) has
commented more than once that there are a remarkable number of people
with physics doctorates and beyond working in the world of commercial
databases these days. Especially particle physicists.
--
Catherine Rees Lay
Polyhedron Software Ltd. Registered Office: Linden House,
93 High St, Standlake, Witney, OX29 7RH, United Kingdom.
Registered in England No.2541693. Vat Reg No. GB 537 3214 57
|
|
0
|
|
|
|
Reply
|
catherine.news (58)
|
2/15/2008 10:03:34 AM
|
|
"Catherine Rees Lay" <catherine.news@polyhedron.com> wrote in message
news:66mdneq0Wqrr_ijanZ2dnUVZ8qydnZ2d@eclipse.net.uk...
> Mark Westwood wrote:
>>
>> As to hiring a professional in this area, well I have no objection in
>> principal, but I'm not aware of anyone who might justifiably be called
>> an expert. I know what I don't want to hire -- an (R)DBMS expert from
>> the commercial world learning about scientific data while I'm paying
>> the bills.
>>
> You might be surprised...
>
> My OH (an ex particle physicist now working with medical databases) has
> commented more than once that there are a remarkable number of people with
> physics doctorates and beyond working in the world of commercial databases
> these days. Especially particle physicists.
That's because our brains are wired (or wierd?) differently, and because of
Quantum, and the search for the Heisenbug :-)
(but I don't have a doctorate, just a mere graduate)
Les
>
> --
> Catherine Rees Lay
>
> Polyhedron Software Ltd. Registered Office: Linden House,
> 93 High St, Standlake, Witney, OX29 7RH, United Kingdom.
> Registered in England No.2541693. Vat Reg No. GB 537 3214 57
|
|
0
|
|
|
|
Reply
|
l.neilson825 (106)
|
2/15/2008 11:00:58 AM
|
|
|
23 Replies
37 Views
(page loaded in 0.627 seconds)
Similiar Articles: Ill-behaved Cambria font - comp.fontsSo now the design is approved and I'm committed, but I'm ... it. > > Erwin Denissen Although they do contain OT ... know, but in this case it saves you about 1 MB of file ... comp.cad.synthesisI have the SPEF file, .mod file, the gate-level verilog ... The Huffington Post said in the article, Science Goes ... 7/13/2009 2:58:18 AM) Dear Computer Design Researcher ... Converting goasm code to masm - comp.lang.asm.x86DB "Glad this was by design - bye, bye .." DB ... Meter FREE Signal Generator Science ... 3D 1 ; [This instruction is going ot be ... file opening slows down - puzzling - comp.lang.xharbour(I launched some file transfer from/to clients to/from ... Exited with a M.Sc. in Computing Science (thesis was ... [ot] Hey EMG, I know for sure you are really very ... advice on what programs to install on a new HP 50g - comp.sys.hp48 ...... programs should i install? whats the best file manager ... See http://staff.science.uva.nl/~dominik/hpcalc/emacs ... have really posed, and for not minding these dopey "OT ... Dos and don'ts in C++ unit testing? - comp.lang.c++.moderated ...... Numerica Corporation dg@steffen a@t numerica d@ot ... "Design the file access class" is the obvious answer, but I ... are impossible to test are generally "computer science". [comp.publish.cdrom] CD-Recordable FAQ, Part 1/4 - comp.publish ...Archive-name: cdrom/cd-recordable/part1 Posting-Frequency: monthly Last-modified: 2008/10/09 Version: 2.71 Send corrections and updates to And... GUI for Fortran programs - comp.lang.fortran... advise to not do this if at all possible -- file I/O is ... Although maybe slightly OT, here's a working demo. ... Else, one needs to design a way to pass information back ... How to check whether malloc has allocated memory properly in case ...If possible I would design the program so that `size' is ... of 9: "Unix has its weak points, but its file ... becoming popular, PCs were becoming useful for scientific ... top 10 uses for random data compression?? anyone? - comp ...Then, effectively you could compress any file to a one ... then we before engage Brion and Clifford's level science. ... systematic, whereas ahead of you it's basing moderate. Questionnaire Design and Surveys Sampling... techniques and computer codes for questionnaire design ... For moderately asymmetrical distributions the mode, median ... TITLE Bernoulli sampling; PI=0.25, N=32 GET FILE ... Logical vs. Physical Design: Do You Know the Difference?Logical design has been an inherent part of the "PRIDE" Methodologies for IRM for a ... Moving IRM from an Art to a Science 7/15/2012 11:51:15 AM
|