f



why search engines don't or can't index all parts of a file when it's of large size?

why search engines don't or can't index all parts of a file when it's
of large size?

0
entrepreneur
6/16/2006 4:34:24 AM
comp.ai.nat-lang 1763 articles. 0 followers. arin280 (56) is leader. Post Follow

5 Replies
1961 Views

Similar Articles

[PageSpeed] 35

entrepreneur schrieb:
> why search engines don't or can't index all parts of a file when it's
> of large size?

A search engine cannot distinguish between files that are stored
physically on a webserver and answers that are generated on the
fly, like by scripts.

It's a simple exercise to generate a cgi script
that generates a list of some million random words,
even non-stop-words.

So the programmers of a search engine must apply
different strategies to handle fake documents.
Such strategies might be
- index only the first n megabytes of a document
- index only documents with a ranking higher
   then m.
- index only documents from websites that
   seem to honest.

Of course, the search engine engineers won't hand
out their junk fighting strategies, these are
business secrets.

Regards,
Joachim





0
Joachim
6/16/2006 9:27:18 AM

> A search engine cannot distinguish between files that are stored
> physically on a webserver and answers that are generated on the
> fly, like by scripts.
>
> It's a simple exercise to generate a cgi script
> that generates a list of some million random words,
> even non-stop-words.
>
> So the programmers of a search engine must apply
> different strategies to handle fake documents.
> Such strategies might be
> - index only the first n megabytes of a document
> - index only documents with a ranking higher
>    then m.
> - index only documents from websites that
>    seem to honest.
>
> Of course, the search engine engineers won't hand
> out their junk fighting strategies, these are
> business secrets.
>
> Regards,
> Joachim
i didn't mean the dynamic web pages, i just discuss the static web
pages, or other files, like pdf.  when a static page or file is of
large size, maybe a search engine just the first n megabytes. but i
still think that the search engine can identify it's static not
dynamic. so why a search engine should do but doesn't do it, to index
the document completely.

0
entrepreneur
6/16/2006 12:00:55 PM


> A search engine cannot distinguish between files that are stored
> physically on a webserver and answers that are generated on the
> fly, like by scripts.
>
> It's a simple exercise to generate a cgi script
> that generates a list of some million random words,
> even non-stop-words.
>
> So the programmers of a search engine must apply
> different strategies to handle fake documents.
> Such strategies might be
> - index only the first n megabytes of a document
> - index only documents with a ranking higher
>    then m.
> - index only documents from websites that
>    seem to honest.
>
> Of course, the search engine engineers won't hand
> out their junk fighting strategies, these are
> business secrets.
>
> Regards,
> Joachim
i didn't mean the dynamic web pages, i just discuss the static web
pages, or other files, like pdf.  when a static page or file is of
large size, maybe a search engine just index the first n megabytes. but
i
still think that the search engine can identify it's static not
dynamic. so why a search engine should do but doesn't do it, to index
the document completely.

0
entrepreneur
6/16/2006 12:03:56 PM
entrepreneur schrieb:
 > but i still think that the search engine can identify it's static not
 > dynamic. so why a search engine should do but doesn't do it, to index
 > the document completely.

A simple heuristic is: if an URL contains a "?", then most
probably the webserver response comes from a script.
But there's a simple trick to foil this. Web server extensions
like mod_rewrite cloak URLs to appear without substrings like
"?", ".php", ".asp" and so on. Even spammers know that, so the
search engine programmers must provide countermeasures anyway.

BTW, some say Google keeps the entire web in RAM,
so it should be clear that there must be a limit on
how much from one website can be indexed.
http://blog.topix.net/archives/000011.html

Regards,
Joachim
0
Joachim
6/16/2006 1:35:50 PM
That still isn't an answer to the original question.

That question was "why do search engines generally not index all of
long documents".

The real answer is that indexing all of the document gives poor
retrieval results.  The reason for that is that long documents
generally contain enough sub-topics that it is difficult to build a
simple index based scoring system that accurately characterizes how the
document uses different words.  With simple vector space models,
estimates of relevance of a long document to a single word tends to go
down even though most of the other words in the document are closely
related to it (this is due to the commonly used cosine normalization).
Slightly better results can be had with different weighting schemes
such as the Okapi formulae or the normalization used by Lucene, but
these have problems as well.

Two solutions are commonly tried.  The first is to simply break the
long document down into smaller sub-documents.  The second is to use
some alternative system that recognized semantic proximity of terms so
that when you have a one word query, it can still get a good score even
though there are lots of distinct words in the document.  Examples
include latent semantic indexing (LSI) which is getting pretty old in
the tooth or latent Dirichlet assignment (LDA) which is on considerably
better theoretical ground or even discrete component analysis (DCA)
which is an umbrella framework that describes  most other similar
approaches.



Joachim Pimiskern wrote:
> entrepreneur schrieb:
>  > but i still think that the search engine can identify it's static not
>  > dynamic. so why a search engine should do but doesn't do it, to index
>  > the document completely.
>
> A simple heuristic is: if an URL contains a "?", then most
> probably the webserver response comes from a script.
> But there's a simple trick to foil this. Web server extensions
> like mod_rewrite cloak URLs to appear without substrings like
> "?", ".php", ".asp" and so on. Even spammers know that, so the
> search engine programmers must provide countermeasures anyway.
>
> BTW, some say Google keeps the entire web in RAM,
> so it should be clear that there must be a limit on
> how much from one website can be indexed.
> http://blog.topix.net/archives/000011.html
> 
> Regards,
> Joachim

0
Ted
6/17/2006 12:09:37 AM
Reply:

Similar Artilces:

I don't work for IBM and I don't make promises I can't deliver on
I wish I could afford an advertising campaign to compete with what they have on the Internet now. I promise to go totally ballistic at the next LLLNL contract. Robert. On 8/12/2011 11:24 PM, Robert Myers wrote: > I wish I could afford an advertising campaign to compete with what they > have on the Internet now. > > I promise to go totally ballistic at the next LLLNL contract. > > Robert. I don't work for IBM (anymore) either. If you are talking about the death of Blue Waters, I don't believe they said they couldn't deliver. They said they chose not to beca...

I don't mind Linux. It's the fan club I can't stand.
Linux: A sheer waste of processing time. On Thu, 09 Feb 2006 23:42:08 -0500, Ana Thema wrote: > Linux: A sheer waste of processing time. Ana Thema ... a witch in search of batteries... -- Rick <http://ricks-place.tripod.com/sound/2cents.wav> In comp.os.linux.advocacy, Ana Thema <anathema@gmail.net> wrote on Thu, 09 Feb 2006 23:42:08 -0500 <o96ou15hrlhepdpdr7pkbefq3ep4e31ve1@ax4.com>: > Linux: A sheer waste of processing time. > Well of course it is; everyone should just use Windows instead. The occasional hangs, glitches, and crashes are more than compensa...

OS X
after a bit of reading, it looks like I did a veeeeery shtupid thing... (I'm copying my post to this newsgroup, as the alt.sys.mac.newuser-help group didn't look quite as highly-trafficked...) anyways, here's what I did... ----------------------------------- in an effort to help the owner of a nice shiny new eMac 'clean up' their personal folder, I moved a few items... including the Library folder! it appears.... correct me if I'm wrong... that Mac OS X then created a NEW library... so we restarted it, and all the Quicken data, dialup info, (etc.) was gone... so we ...

Don't keep the .O's but Don't make everytime
I wish to make a executable from the prog1.c or the s.prog1.c files. And don't create the prog1.o files and certainly don't save them. Also, when I remake the module it would ONLY be remade if the s.prog1.c or prog1.c have been modified. I'm having little luck with this. Can anyone put some light on this issue? ...

i can't insert 'a''a' in a field with size 3
hi, i run INSERT INTO test (FIELD1) VALUES ('a''a') but DBMS oracle returns this error: ORA-01401: inserted value too large for column (FIELD1 size is 3!) what's the error? Fici wrote: > hi, > i run > INSERT INTO test (FIELD1) VALUES ('a''a') > but DBMS oracle returns this error: > ORA-01401: inserted value too large for column > (FIELD1 size is 3!) > what's the error? SQL> create table test ( 2 field1 VARCHAR2(3)); Table created. SQL> INSERT INTO test 2 (field1) 3 VALUES 4 ('a...

What is a difference in 'T t = u' and in 'T t( u )'
Hi, Does anyone can explain me why declarations marked as 1 compile whereas declarations marked as 2 do not compile? The most obvious example is the third one with shared_ptr. It is because of explicite constructor. Am I right? But what about first and second examples? Please notice also that examples second and third are completely different. Constructor calling syntax is right in third but not in second. However assign syntax is right in second example but not in third. #include <boost/assign/list_of.hpp> #include <boost/shared_ptr.hpp> #include <iostream> #include <...

Displaying Table Field in a report's title-bar ??? don't say it can't |3 done.
I have a Report and am trying to display some Fields from the table that the Report is producing in the title-bar of the window. I try things like: Me.Caption = [TABLE_NAME].[FIELD1] + " " [TABLE_NAME].[FIELD2] I want it to display the first and last name of the person the report is about in the title-bar -- because the data is used for a servay. The name can't be printed--but I'd like it to display in the title bar so that the end user can match the names together. How could I code this in Access? I get alot of wierd errors trying 2 do this? Thanx. P.S. Is there a w...

AOL 9.0 don't do it just don't . Don't even think of it.
I had 8.0, and it ran fine. I downloaded 9.0 and that was the end of my computer never having a problem. 3 years and no problems. 1 week with AOL 9.0 and I've had 27 error messages , sign offs and other problems. AOL was no help. They can't even admit it has a problem. So I had to REFORMAT MY HARD DRIVE AND INSTALL ALL MY PROGRAMS AND ALL MY SAVED FILES. Needless to say I no longer have AOL at all. Reformat and get rid of it. ...

Lotus Notes: Don't Pass, Don't Pay! That's our Certification Guarantee.
Study with CertFX, the leader in Lotus, WebSphere and WebSphere Portal Certification practice tests. With CertFX practice tests, all purchases are available for immediate download. You also receive free support, lifetime maintenance and real-time updates with your purchase. CertFX has all Domino 8, WebSphere and WebSphere Portal practice tests available. We are fanatical about updating our content to ensure we provide you with the best, most accurate study materials on the market. Get started on your recertification today by visiting: http://www.dominonews.com/homepage.nsf...

Lotus Notes: Don't Pass, Don't Pay! That's our Certification Guarantee.
Study with CertFX, the leader in Lotus, WebSphere and WebSphere Portal Certification practice tests. With CertFX practice tests, all purchases are available for immediate download. You also receive free support, lifetime maintenance and real-time updates with your purchase. CertFX has all Domino 8, WebSphere and WebSphere Portal practice tests available. We are fanatical about updating our content to ensure we provide you with the best, most accurate study materials on the market. Get started on your recertification today by visiting: http://www.dominonews.com/homepage.nsf...

Can't clean. Says it's printing. It isn't.
I'm trying to clean cartridge heads in an HP660C printer that has been unused for some time. I printed one test page which showed streaks. No other jobs had been set up for printing nor were being printed. When I went to the services page and tried to clean the heads it kept coming up with a message that I can't because the printer is printing. It wasn't. I tried "purge print documents", and turning the printer off and on again. Still wouldn't clean. Thanks for any suggestions. I really need to get some letters printed soon. "Martin&...

Steve Jobs
<quote> I want to go back to that other question first and say one more thing, he says. This don't be evil mantra: "It's bullshit." Audience roars. </quote> http://www.wired.com/epicenter/2010/01/googles-dont-be-evil-mantra-is-bullshit-adobe-is-lazy-apples-steve-jobs/comment-page-2/ On Sun, 31 Jan 2010 20:27:54 -0500, Ezekiel wrote: > <quote> > I want to go back to that other question first and say one more thing, > he says. This don't be evil mantra: "It's bullshit." Audience roars. > </quote> > &...

What's This 1024 Boundry & I Don't Care If it Won't Boot
I use the D: drive for backup. It's a 160gig drive and for some reason it got partitioned to about 130gig. I'm running out of space so I tried to increase it to the maximum (155??) using Partition Magic v8. PM took the instruction and seemed to be proceeding fine. It had to reboot so I left the machine to do chores. When I came back PM had thrown up some kind of error message I didn't understand and when I pressed return to complete the reboot, I saw it had not increased the size. So I went back into PM. Now when I try to resize the partition I get this screen http:...

Why can't we 'tformfwd' a cp2tform(..., 'polynomial')
Hi, Just as an example: This is allowed xybase = reshape(randn(12,1),6,2); t_affine = cp2tform(xybase,xybase,'affine'); [x,y] = tformfwd(t_affine,xybase,xybase); but this t_poly = cp2tform(xybase,xybase,'polynomial',2); [x,y] = tformfwd(t_poly,xybase,xybase); gives an error "Function TFORMFWD expected T's forward_fcn field to be non- empty." I checked the code a bit and indeed, with the 'polynomial' case, the findPolynomialTransform() function has at its end trans = maketform('custom',2,2,[],@inv_polynomial,[A B]); wher...

Web resources about - why search engines don't or can't index all parts of a file when it's of large size? - comp.ai.nat-lang

Resources last updated: 2/8/2016 12:38:40 PM