Hello,
I have a problem that I do not know how to approach. Hopefully someone out
there will be able to help me out.
First off, I am not a computing wizz, merely a mathematician with a few
years' programming under his belt. My main language is fortran 77 but I
have also handled c/c++. As for system or network programming : not my area
at all nor do I have time to go into that.
Ok, I wish to know, without GUI (under linux), how to query a given search
engine to know the position of a web site.
Example :
query google for www.nasa.org -> reply 4
so google has www.nasa.org as its 4th entry
Please, if you could keep it simple, no gibberishy stuff about calling
socket xyz with SSL under API sublayer for MFC.
Also, no point sending pieces of code since these are only useful in a very
specific environment (yours).
Now, this is pretty simple stuff (for computer oriented people), it must be,
and so should the answer. What I would expect would be along the lines of :
run such command : command www.google.com www.nasa.org > output.txt
then search thought output.txt for www.nasa.org, the line it on is its
position.
I hope this is clear enough : I am NOT after techy info, merely a quick and
dirty method to do a very simple thing. I do not wish to re-invent the
wheel.
I very much thank anyone for any help whatsoever.
G.
|
|
0
|
|
|
|
Reply
|
igthibau (46)
|
7/23/2004 12:47:36 PM |
|
igthibau wrote:
> I very much thank anyone for any help whatsoever.
Problem is following: Google expects a http form request and
delivers you a HTML document to be parsed.
Submitting the query and retrieving the document is child's play
using a HTTP library as curl.
Hard thing is to parse the HTML document deliverd from the search
engine. Actually you'd need a specialized backend for every
engine, since they all do it with another semantic.
Wolfgang
|
|
0
|
|
|
|
Reply
|
wdraxinger (404)
|
7/23/2004 11:58:13 AM
|
|
igthibau <igthibau@wanadoo.fr> wrote:
> I have a problem that I do not know how to approach. Hopefully someone out
> there will be able to help me out.
> First off, I am not a computing wizz, merely a mathematician with a few
> years' programming under his belt. My main language is fortran 77 but I
> have also handled c/c++. As for system or network programming : not my area
> at all nor do I have time to go into that.
> Ok, I wish to know, without GUI (under linux), how to query a given search
> engine to know the position of a web site.
> Example :
> query google for www.nasa.org -> reply 4
> so google has www.nasa.org as its 4th entry
Sorry, but that question does not make sense to me. What exactly is
"the position of a web site"? Do you mean that you want to find out
for a certain query (e.g. searching for the word "saturn") what's
the number of the first hit where e.g. www.nasa.org is linked?
If it's that then the first thing you have to find out is how to
create an URL for that query, e.g. for google you could use
http://www.google.com/search?q=saturn
But for other search engines it might be different. And there are
lots of further qualifiers you may wish to append, e.g. if you
want more than the default 20 results but 100 you would have to
change that to
http://www.google.com/search?q=saturn&num=100
But what you then will get (and it's not that difficult to get the
results into a file, but google seems to try to keep you from doing
this with a standard tool like wget) is a lot of HTTP code that you
still have to parse (line numbers make no sense at all in this
context) to extract the bit of information you're interested in and
that's going to be the hardest part of it...
Regards, Jens
--
\ Jens Thoms Toerring ___ Jens.Toerring@physik.fu-berlin.de
\__________________________ http://www.toerring.de
|
|
0
|
|
|
|
Reply
|
Jens.Toerring (807)
|
7/23/2004 12:14:47 PM
|
|
On Fri, 23 Jul 2004, igthibau wrote:
>
> Ok, I wish to know, without GUI (under linux), how to query a given search
> engine to know the position of a web site.
> Example :
> query google for www.nasa.org -> reply 4
> so google has www.nasa.org as its 4th entry
You realize, of course, that Google doesn't "rank" sites this way.
You have to actually enter a search term or phrase in the box, and then
Google will give you a list of sites containing that term. You can't
"query Google for www.nasa.org" (unless I've been missing something!).
But I'll assume Jens is right and you meant to query for, let's say,
"saturn," and see where NASA came up in the results.
> Also, no point sending pieces of code since these are only useful in
> a very specific environment (yours).
Well, if you don't want code, and you don't want to learn network
programming yourself, we're kind of in a bind, aren't we? :) But
luckily we /know/ what platform you use: Linux. So you can use any
source code that's guaranteed to work on Linux. For example, this
ksh one-liner:
% lynx -source 'http://www.google.com/search?q=saturn&num=100' | grep
'^<br><font ' | grep -n 'nasa' | pr -W 72 -Tt
5:<br><font color=#008000>www.<b>saturn</b>.de/ - 3k - </font><nobr> <
6:<br><font color=#008000><b>saturn</b>.jpl.nasa.gov/home/index.cfm - 4
7:<br><font color=#008000><b>saturn</b>.jpl.nasa.gov/index.cfm - 43k -
16:<br><font color=#008000>www.dpo.uab.edu/~moudry/ - 8k - </font><nobr
17:<br><font color=#008000>pds.jpl.nasa.gov/planets/welcome/mars.htm -
18:<br><font color=#008000>pds.jpl.nasa.gov/planets/choices/<b>saturn</b
45:<br><font color=#008000>www.esa.int/SPECIALS/Cassini-Huygens/ - 42k
46:<br><font color=#008000>www.nasa.gov/mission_pages/cassini/main/ - 4
47:<br><font color=#008000>nssdc.gsfc.nasa.gov/photo_gallery/ photogalle
48:<br><font color=#008000>nssdc.gsfc.nasa.gov/planetary/planets/<b>satu
51:<br><font color=#008000>www.kuro5hin.org/story/2004/7/1/93459/66714 -
52:<br><font color=#008000>ringmaster.arc.nasa.gov/<b>saturn</b>/<b>satu
59:<br><font color=#008000>www.mindspring.com/~dhanon/home.htm - 3k - <
60:<br><font color=#008000>science.nasa.gov/headlines/y2004/09jul_hailst
61:<br><font color=#008000>science.nasa.gov/headlines/y2002/13dec_<b>sat
62:<br><font color=#008000>starchild.gsfc.nasa.gov/docs/ StarChild/solar
71:<br><font color=#008000>www.cosmicelk.co.uk/Saturn.htm - 1k - Jul 2
%
I'll break it down for you.
lynx -source 'http://www.google.com/search?q=saturn&num=100'
This fetches the Google results page for a query on "saturn". Replace
"saturn" with your own search phrase, using periods or plusses to
separate words in quotes or out of quotes, e.g. searching on "foo"
would be "q=foo", searching on "foo bar" would be "q=foo+bar", and
searching on ""foo bar"" would be "q=foo.bar". We return the top
100 results.
| grep '^<br><font '
This looks for a special combination of tags that we're going to
assume Google prepends to all search results. This is ad-hoc and
not guaranteed to work (it's certainly not documented anywhere!),
but it works this week. ;) Might require some hacking in a month
or two, but oh well. If you want robustness, you're just going to
have to use an official Google API, or learn Perl.
| grep -n 'nasa'
This takes all those search result lines and finds the ones containing
the word 'nasa' (in your case, perhaps 'www.nasa.org' would do better;
this will also return results from e.g. JPL). It returns those lines,
and also prepends their line numbers (1 for the first line of the
Google results, 2 for the second, and so on). NB: If a part of the
'nasa' string is the same as or similar to a part of the 'saturn'
string, beware! You'll need to account for the fact that Google
marks up parts of its output like this: 'www.<b>saturn</b>.de'. I'll
leave that to the regex gurus.
| pr -W 72 -Tt
This takes all /those/ lines and truncates them to 72 characters,
making the output easier to read on a terminal display. Change the
'72' to something bigger if you like.
Now, you'll notice that this still returns a bunch of false positives,
but (1) using 'www.nasa.org' instead of just 'nasa' will partially fix
that, and (2) you'll probably be having a human look at this anyway,
won't you? And (3) learn Perl.
> Now, this is pretty simple stuff (for computer oriented people), it must be,
> and so should the answer. What I would expect would be along the lines of :
> run such command : command www.google.com www.nasa.org > output.txt
> then search thought output.txt for www.nasa.org, the line it on is its
> position.
Two words: Shell script. I'm no Linux expert; I'll leave that to
someone else too.
HTH,
-Arthur
|
|
0
|
|
|
|
Reply
|
ajo (1601)
|
7/23/2004 2:15:57 PM
|
|
Well, thank you, that's the most comprehensive reply so far (and close to
100% understandable too)
>
> You realize, of course, that Google doesn't "rank" sites this way.
> You have to actually enter a search term or phrase in the box, and then
> Google will give you a list of sites containing that term. You can't
> "query Google for www.nasa.org" (unless I've been missing something!).
yep. sorry about that, seemed so obvious I did not put this quite right : of
course one types in a keyword and the search engine returns a list of sites
containing the keyword.
And I am interested in finding out where a *specific* site is positioned in
that list
>
> Well, if you don't want code, and you don't want to learn network
> programming yourself, we're kind of in a bind, aren't we? :)
well, I get the feeling I'd have to learn quite a bit to do what represents
a very small part of a much larger project. this impression I get dates
from when I got into MFC : months of learning and 200k of code to write
before getting the first "hello world" on the screen ! discouraged me
completely.
as for getting code, I don't feel competent enough in that area to integrate
*rapidly* and *efficiently* someone else's code whose logic and programming
I would not understand.
But
> luckily we /know/ what platform you use: Linux. So you can use any
> source code that's guaranteed to work on Linux. For example, this
> ksh one-liner:
>
> % lynx -source 'http://www.google.com/search?q=saturn&num=100' | grep
> '^<br><font ' | grep -n 'nasa' | pr -W 72 -Tt
now THAT'S what I'm talking about. I'll have to look in depth at this, but
it look's somewhat exactly what I'm after.
>
> Now, you'll notice that this still returns a bunch of false positives,
> but (1) using 'www.nasa.org' instead of just 'nasa' will partially fix
> that, and (2) you'll probably be having a human look at this anyway,
> won't you? And (3) learn Perl.
'false positives'?
you mean the search is too broad?
why learn Perl ?
I don't really understand the point in learning these languages (python,
perl), what are they for ? This is a genuine question, I am not downgrading
them.
>
>> Now, this is pretty simple stuff (for computer oriented people), it must
>> be, and so should the answer. What I would expect would be along the
>> lines of : run such command : command www.google.com www.nasa.org >
>> output.txt then search thought output.txt for www.nasa.org, the line it
>> on is its position.
>
> Two words: Shell script. I'm no Linux expert; I'll leave that to
> someone else too.
>
I thought the above did the job? Now you got me confused.
Nevertheless all this is precious help and I'll see what I can do with it.
Thank you very much.
G
|
|
0
|
|
|
|
Reply
|
igthibau (46)
|
7/23/2004 8:53:28 PM
|
|
On Fri, 23 Jul 2004, igthibau wrote:
>
>> Well, if you don't want code, and you don't want to learn network
>> programming yourself, we're kind of in a bind, aren't we? :)
>
> well, I get the feeling I'd have to learn quite a bit to do what represents
> a very small part of a much larger project. this impression I get dates
> from when I got into MFC : months of learning and 200k of code to write
> before getting the first "hello world" on the screen ! discouraged me
> completely.
(This is just a quick response; more when I get the time, perhaps.)
MFC is not all there is to programming. :) There are a lot of libraries
available for Linux that only require you to know how to run 'tar zxvf'
and 'make install', and then have very nice interfaces. I can't think
of any right now, but... ;)
>> % lynx -source 'http://www.google.com/search?q=saturn&num=100' | grep
>> '^<br><font ' | grep -n 'nasa' | pr -W 72 -Tt
>
> now THAT'S what I'm talking about. I'll have to look in depth at this, but
> it look's somewhat exactly what I'm after.
As I said, it has its problems, but it solves the main problem (how
to get Google to give you something useful) and makes a
first-approximation stab at the secondary problem (how to parse the
results to find what you're interested in).
>> Now, you'll notice that this still returns a bunch of false positives,
>> but (1) using 'www.nasa.org' instead of just 'nasa' will partially fix
>> that, and (2) you'll probably be having a human look at this anyway,
>> won't you? And (3) learn Perl.
>
> 'false positives'?
> you mean the search is too broad?
Yes. Note the results you get; the first one IIRC was "www.saturn.de".
That's not NASA. Thus our filter was too broad.
> why learn Perl ?
Perl has built-in and very magical support for /regular expressions/,
which are a way to quickly describe "patterns" in text. Google it. :)
> I don't really understand the point in learning these languages (python,
> perl), what are they for ? This is a genuine question, I am not downgrading
> them.
Another advantage of Perl, for you, is that it's interpreted rather
than compiled, so there's no compiler to cajole. (Well, /maybe/ that's
an advantage...)
>> Two words: Shell script. I'm no Linux expert; I'll leave that to
>> someone else too.
>
> I thought the above did the job? Now you got me confused.
Well, you want to be able to substitute /anything/ for "saturn"
and "nasa," right? That is, you want to have
./myprogram saturn www.nasa.gov
whereas with my suggestion, you're typing
>> lynx -source 'http://www.google.com/search?q=saturn&num=100' | grep
>> '^<br><font ' | grep -n 'nasa' | pr -W 72 -Tt
every time. Shell scripts let you encapsulate all that complexity
into a neat little package, with command-line arguments and stuff.
You're welcome,
-Arthur
|
|
0
|
|
|
|
Reply
|
ajo (1601)
|
7/23/2004 9:07:55 PM
|
|
In article <cdrmtq$qfb$1@news-reader4.wanadoo.fr>, igthibau@wanadoo.fr
says...
> well, I get the feeling I'd have to learn quite a bit to do what represents
> a very small part of a much larger project. this impression I get dates
> from when I got into MFC : months of learning and 200k of code to write
> before getting the first "hello world" on the screen ! discouraged me
> completely.
Seems to me that if you start a Dialog project and select 'Hello World'
as the title, you will get it on the screen without any coding at all!
If that seems like cheating, adding the line "SetWindowText( "Hello
World" );" to OnInitDialog() should do the trick.
- Gerry Quinn
|
|
0
|
|
|
|
Reply
|
gerryq (1321)
|
7/24/2004 9:27:43 AM
|
|
that's if you ask the machine to create the project skeleton for you.
when you look into the code , somewhere in the middle you see "add your code
here" or something along those lines.
the skeleton takes care of movement, menus, windows, clicking, etc ...
Gerry Quinn wrote:
> In article <cdrmtq$qfb$1@news-reader4.wanadoo.fr>, igthibau@wanadoo.fr
> says...
>> well, I get the feeling I'd have to learn quite a bit to do what
>> represents a very small part of a much larger project. this impression I
>> get dates
>> from when I got into MFC : months of learning and 200k of code to write
>> before getting the first "hello world" on the screen ! discouraged me
>> completely.
>
> Seems to me that if you start a Dialog project and select 'Hello World'
> as the title, you will get it on the screen without any coding at all!
>
> If that seems like cheating, adding the line "SetWindowText( "Hello
> World" );" to OnInitDialog() should do the trick.
>
> - Gerry Quinn
|
|
0
|
|
|
|
Reply
|
igthibau (46)
|
7/25/2004 2:32:11 PM
|
|
>>>>> "igthibau" == igthibau <igthibau@wanadoo.fr> writes:
igthibau> Hello, I have a problem that I do not know how to
igthibau> approach. Hopefully someone out there will be able to help me out.
igthibau> First off, I am not a computing wizz, merely a mathematician with
igthibau> a few years' programming under his belt. My main language is
igthibau> fortran 77 but I have also handled c/c++. As for system or network
igthibau> programming : not my area at all nor do I have time to go into
igthibau> that. Ok, I wish to know, without GUI (under linux), how to query
igthibau> a given search engine to know the position of a web site. Example
igthibau> : query google for www.nasa.org -> reply 4 so google has
igthibau> www.nasa.org as its 4th entry
Google has an API for the search engine. It is a SOAP service with a
published WSDL. There are two catches: you need to obtain a key to use the
API (this is free, but you have to register) and you can make only a limited
number of queries per day (about 1000 queries, if I remember it right). The
API comes with example programs with C# and Java, but since it is SOAP based,
you can use others.
But to your question: the query result contains a list of
<resultelement>s. Since it is a list, I guess 4th element on the list is what
you are after.
For more information, check api.google.com
--
Arto V. Viitanen av@cs.uta.fi
University of Tampere, Department of Computer Sciences
Tampere, Finland http://www.cs.uta.fi/~av/
|
|
0
|
|
|
|
Reply
|
av111 (38)
|
7/26/2004 6:37:57 AM
|
|
In article <ce09dt$clf$1@news-reader5.wanadoo.fr>, igthibau@wanadoo.fr
says...
> that's if you ask the machine to create the project skeleton for you.
> when you look into the code , somewhere in the middle you see "add your code
> here" or something along those lines.
>
> the skeleton takes care of movement, menus, windows, clicking, etc ...
Well, it depends what you ask it to take care of. If you create a
Dialog App, you can easily dump the dialog it gives you and insert
something else.
Have you some objection to things being easy?
- Gerry Quinn
> Gerry Quinn wrote:
>
> > In article <cdrmtq$qfb$1@news-reader4.wanadoo.fr>, igthibau@wanadoo.fr
> > says...
> >> well, I get the feeling I'd have to learn quite a bit to do what
> >> represents a very small part of a much larger project. this impression I
> >> get dates
> >> from when I got into MFC : months of learning and 200k of code to write
> >> before getting the first "hello world" on the screen ! discouraged me
> >> completely.
> >
> > Seems to me that if you start a Dialog project and select 'Hello World'
> > as the title, you will get it on the screen without any coding at all!
> >
> > If that seems like cheating, adding the line "SetWindowText( "Hello
> > World" );" to OnInitDialog() should do the trick.
> >
> > - Gerry Quinn
>
>
|
|
0
|
|
|
|
Reply
|
gerryq (1321)
|
7/26/2004 10:00:27 AM
|
|
|
9 Replies
29 Views
(page loaded in 0.166 seconds)
|