f



Finding files with similar names

	I am looking for some Linux utility (shell or whatever) to find 
files (recursively in a directory) with similar names. A Google search 
reveals a number of tools, but they all seem to be rather limited, in 
that they rely on basic features like extensions and case. I am looking 
for something more clever than that - e.g. a tool that would match 
"nameoftherose.pdf" and "The_Name_of_Ros (1985)". This is just an 
example; it is easy to come up with many other variations.

	Anbyody in this forum know of a tool along such lines?

	My first thought was that it would be possible to come up with a 
list of heuristics applicable to individual cases (like the one above) 
that could be augmented, based on experience. I guess this might work (I 
don't know how performant it would be though) but, if that is what I 
thought about immediately, I am sure that somebody else must have already 
thought about it, and implemented something based on it: I do not want to 
reinvent the wheel.

	Any pointers would be much appreciated.
0
Harold
12/16/2016 6:06:17 PM
comp.os.linux.misc 33599 articles. 1 followers. amosa69 (78) is leader. Post Follow

7 Replies
640 Views

Similar Articles

[PageSpeed] 24

On 2016-12-16, Harold Johanssen <noemail@please.net> wrote:
> 	I am looking for some Linux utility (shell or whatever) to find 
> files (recursively in a directory) with similar names. A Google search 
> reveals a number of tools, but they all seem to be rather limited, in 
> that they rely on basic features like extensions and case. I am looking 
> for something more clever than that - e.g. a tool that would match 
> "nameoftherose.pdf" and "The_Name_of_Ros (1985)". This is just an 
> example; it is easy to come up with many other variations.
>
> 	Anbyody in this forum know of a tool along such lines?
>
> 	My first thought was that it would be possible to come up with a 
> list of heuristics applicable to individual cases (like the one above) 
> that could be augmented, based on experience. I guess this might work (I 
> don't know how performant it would be though) but, if that is what I 
> thought about immediately, I am sure that somebody else must have already 
> thought about it, and implemented something based on it: I do not want to 
> reinvent the wheel.

I think you have no real idea of "similar" is. Is a file named car
similar to one named boat? (They both carry people-- does that make them
similar?) So, If I list all files for you, is that what you want (I am
sure I can find some similarity between any two file names.)


>
> 	Any pointers would be much appreciated.

find . -name ...
with some globbing

0
William
12/16/2016 6:46:56 PM
On 16/12/16 20:46, William Unruh wrote:
> On 2016-12-16, Harold Johanssen <noemail@please.net> wrote:
>> 	I am looking for some Linux utility (shell or whatever) to find
>> files (recursively in a directory) with similar names.
>
> find . -name ...
> with some globbing
>
+1

0
The
12/16/2016 6:56:40 PM
On 2016-12-16 19:06, Harold Johanssen wrote:
> 	I am looking for some Linux utility (shell or whatever) to find 
> files (recursively in a directory) with similar names. A Google search 
> reveals a number of tools, but they all seem to be rather limited, in 
> that they rely on basic features like extensions and case. I am looking 
> for something more clever than that - e.g. a tool that would match 
> "nameoftherose.pdf" and "The_Name_of_Ros (1985)". This is just an 
> example; it is easy to come up with many other variations.
> 
> 	Anbyody in this forum know of a tool along such lines?

No... I know of a grep variant (cgrep) that matches on similar names,
but you have to give the name.

-- 
Cheers,
       Carlos E.R.
0
Carlos
12/16/2016 9:03:55 PM
Carlos E. R. <robin_listas@invalid.es> wrote:
> On 2016-12-16 19:06, Harold Johanssen wrote:
>>       I am looking for some Linux utility (shell or whatever) to find 
>> files (recursively in a directory) with similar names. A Google search 
>> reveals a number of tools, but they all seem to be rather limited, in 
>> that they rely on basic features like extensions and case. I am looking 
>> for something more clever than that - e.g. a tool that would match 
>> "nameoftherose.pdf" and "The_Name_of_Ros (1985)". This is just an 
>> example; it is easy to come up with many other variations.
>> 
>>       Anbyody in this forum know of a tool along such lines?
> 
> No... I know of a grep variant (cgrep) that matches on similar names,
> but you have to give the name.

He could perform an O(N^2) naive search looking for similar names using
cgrep:

Note - this below assumes no ASCII newlines in filenames

  TMP=$(mktemp)
  find /my/wonderful/dir -type f > $TMP # so we only run 'find' once
                                        # and to have a file for cgrep
                                        # to search within
  cat $TMP | while read name ; do
    cgrep "$name" $TMP # pick your own cgrep output options
  done
  rm $TMP

This will always match itself, filtering out "self matches", if
desired, is left as an exercise.
0
Rich
12/16/2016 11:45:37 PM
On 2016-12-16, Harold Johanssen <noemail@please.net> wrote:
> 	I am looking for some Linux utility (shell or whatever) to find 
> files (recursively in a directory) with similar names. A Google search 
> reveals a number of tools, but they all seem to be rather limited, in 
> that they rely on basic features like extensions and case. I am looking 
> for something more clever than that - e.g. a tool that would match 
> "nameoftherose.pdf" and "The_Name_of_Ros (1985)". This is just an 
> example; it is easy to come up with many other variations.
>
> 	Anbyody in this forum know of a tool along such lines?

IBM's Watson.
-- 
                                 Chick Tower

For e-mail:  colm DOT sent DOT towerboy AT xoxy DOT net
0
Chick
12/17/2016 4:10:06 AM
In article <o31aep$4qi$1@news.albasani.net>,
Harold Johanssen  <noemail@please.net> wrote:
>	I am looking for some Linux utility (shell or whatever) to find 
>files (recursively in a directory) with similar names. A Google search 
>reveals a number of tools, but they all seem to be rather limited, in 
>that they rely on basic features like extensions and case. I am looking 
>for something more clever than that - e.g. a tool that would match 
>"nameoftherose.pdf" and "The_Name_of_Ros (1985)". This is just an 
>example; it is easy to come up with many other variations.

As the good members of this newsgroup have made clear, the problem as
stated is not well-defined.  So, they are being their usual snarky selves
about it.  Although, in fact, the 'cgrep' (*) idea posted does seem
interesting...

(*) Which, interestingly enough, I'd never heard of before...

But looking at the problem a little deeper - reading between the lines -
the problem seems to be one that we've all probably faced at some point.
Namely, that of weeding out duplicate files.  So, it seems to me that a
better approach would be to try to find files with similar content, rather
than similar names.  Nitpickers note: Yes, this also is not completely
well-defined, but I think it is likely to work better than the idea of
looking for similar filenames.

Things you can try:
    1) Files with similar (or identical) sizes.
    2) Files with similar (or identical) dates.
    3) Content analysis.  This is the trickiest, but can be done.  You gave
	an example involving PDF files.  Now, PDF is a binary (mostly) file
	format, but there are bits of ASCII in there.  It seems to me that
	comparing the pattern of the ASCII strings (as revealed by the
	"strings" command) ought to give one a pretty good idea if two
	files are basically the same.

Anyway, that's how I'd go about this task...

-- 
The randomly chosen signature file that would have appeared here is more than 4
lines long.  As such, it violates one or more Usenet RFCs.  In order to remain
in compliance with said RFCs, the actual sig can be found at the following URL:
	http://user.xmission.com/~gazelle/Sigs/Seneca
0
gazelle
12/17/2016 10:00:21 AM
On 2016-12-17 11:00, Kenny McCormack wrote:

> As the good members of this newsgroup have made clear, the problem as
> stated is not well-defined.  So, they are being their usual snarky selves
> about it.  Although, in fact, the 'cgrep' (*) idea posted does seem
> interesting...
> 
> (*) Which, interestingly enough, I'd never heard of before...

Because cgrep has been for many years an internal only program, only
available inside Lucent (or AT&T before the disassembly). About the time
when Lucent disappeared and was eaten, they made those apps public
source (but perhaps not the exact same version).

The tool could search the text of the event log printer of a phone
exchange for a word, then extract the whole event context below and
upwards the particular line where the word appeared, for instance. It
was of particular interest for 5EEE technicians (options -R and -T).

From its man page:

       cgrep - shows context of matching patterns found in files

       -A nerrors
             Allow up to nerrors mismatches in matching patterns or
delimiters.  (Comparisons with trail_patterns are unaffected.)  nerrors
must be a positive integer or zero.  By default, a single mismatch is
either the insertion of an extra character in the text, a substitution
of a character  in  the text, or a deletion of a character in the text.
However, portions of a fixed string or regular expression that are
enclosed in angle brackets (< and >) must be matched exactly.  NOTE:
Angle brackets become meta-characters in all regular expressions or
fixed strings  for  which  approximate matching  is used.  For
performance reasons, it is a good idea to use angle brackets liberally
when using approximate matching.  Like other meta-characters, an angle
bracket may be escaped in a regular expression by preceding it with a
backslash, and an angle bracket  retains  its  literal meaning  inside
of  regular  expression character classes (square brackets).  In a fixed
string, an angle bracket may be escaped by preceding it with a backslash
and using the -a option.

             The cost of deletions, insertions, or substitutions may be
set independently with the -C, +C, or -+C options  respectively.   These
 options  are explained below.  The section, APPROXIMATE MATCHING
EXAMPLES, gives examples of the use of approximate matching.



There are tools that search the filesystem to locate occurrences of the
same file, by name. But I do not know of one that hits on approximations
of the name.

I started to write a tool that would locate the occurrences of the same
file, based on content. But it had to be the exact same content. My idea
was to search the entire given tree to produce a list of files, then
calculate the checksum for each file, then compare the list of checksums
with itself. But I never finished it.


-- 
Cheers,
       Carlos E.R.
0
Carlos
12/17/2016 2:21:23 PM
Reply: