Michael Black wrote:-
> Based on context from the original poster's previous posts, I think he
> wants to selectively download pages, or parts of pages. So it would seem
> he wants a "smart" browser, that will look at the context of the page and
> retrieve only what he wants. Sed won't work, since it requires the file
> to be transferred to his computer before he can start filtering it, and it
> would seem (based on previous context) that he is trying to avoid
> downloading the bloat in the first place.
OP often wonders why he can't write 'to be understood'.
But if one person CAN understsnd it's OK.
Why should "to avoid
downloading the bloat in the first place." be based on previous context,
rather than the optimum bandwidth saver, independant of context?
> I think the poster is heading down the wrong path, complicating things for
> the wrong reasons. I find that if I keep frequent pages like google and
> wikipedia local to my computer, it avoids some of the most common wasted
> transfers, the pages held locally act just like they were remote in terms
> of the browser, so it bumps out an intermediate step for common searches.
Sure, you can always argue "your time & convenience is more valuable
than physical resources" which leads to the N.American V8 mentality;
which may be right -- for some.
Sylvain Robitaille wrote:-
> Examine the manual pages for wget and curl, with special attention to
> the "continue fetch" options to which you're referring. I think you'll
> find in both cases that all it does is skip forward some number of
> bytes (which with curl you can specify as an argument to that option),
> before starting to save the new download. They don't have any form
> of intelligence to help them determine whether any of the *content*
> already exists in the file being appended to. They're simply skipping
> forward some number of bytes before appending to the target file.
> If you can figure out a way to calculate how many bytes to skip from
> the start, (and can live with any duplicated material at the end of each
> download), I suspect that "curl" will be the better choice.
So finally the 4th respondent knows what I'm talking about.
Since I OP-ed I've discovered that wget is just a background browser.
There are 2 classes of resources to be saved:
1. physical: bandwidth, file-space;
2. human effort from clutter/noise.
Often you want to get a series of articles, which have the same
60% contents header, with a 10% <article contents> and the rest
a mostly redundant 30% trailer.
Currently, I fetch and append all articles to a file [the book]
via `lynx -dump URL`, with the URL & a separator line for each
part/web-page of the book.
Then I delete the repeated parts while reading it.
Sure `curl` can <fetch the rest of the URL from byte N>
provided the server has the facility.
But I'm just checking to see if others have optimising methods which
I don't yet use.
I've just realised that it won't work.
Oh, I'm wrong again, it can work:
when I look at the lynx-rendered text in my editor, I can see the
<byte count>. But that's not the byte-count of the <source-cut>.
OTOH the mapping from a convenient 'transision' in the lynx-rendered
text to the byte-count of the corresponding source[html] should be
easy to find automatically? You appreciate the problem of validly
concatenating 2 html-sources?
BTW these optimisations may start as novel engineering quirks, but
you can't tolerate using the 'consumer' methods once you get used
Here's a typical on-line-auto-shopping-list-script:----
g1277 Dec14L Dec14D
cd /mnt/p11/Econ/Krugman g1277 Tmplt Dec14
cd /mnt/p11/Inet/USEnet nhd12 GrpLst GrpHdrsLst
-----------> which 3 fetches mean:-
1.use lynx set to line-len < 77, with 2 args:
File of URLs,
File to append fetches to, with URL header & separator-line.
2. goto the appropriate dir and get the latest blog & save it as Dec14
This is an example of where the header is always repeated, but
not so annoying, because you you don't save multiple 'pages'
in a 'book' with the same garbage.
3. goto the appropriate dir and get the list of Newgroups from
GrpLst to use lynx to use google to get the list of headers
[for each group] appended to file: GrpHdrsLst, with suitable
URL-header & separator-line for each fetch.
What I really need is to be able to get gmail without a graphical
browser. Can anyone help.
== Chris Glur.