For some people out there this will be a "simple" task, but I have been
pounding my head for ages on this with no luck, so I hope someone might
help me out?
I need to "convert" (encrypt, hash, mash, whatever you want to call
this) a string into a 4-byte integer (values from -2147483648 to
2147483647) but also be able to "recover" the input string e.g.
foo = mash("This is the input string to be mashed")
foo = [-2147483648 to 2147483647]
bar = unmash(foo)
bar = "This is the input string to be mashed"
I think what is making this more complicated than it needs to be is that
there are more than 2^256 possible string input combinations so there is
going to be a real risk of "collisions". This is for a simple home
project, so if anyone has any ideas, such as limiting the input string
length, dictionary or some other pre-conditioning, I'd love to hear
them. Thanks in advance.
|
|
0
|
|
|
|
Reply
|
noemailplease
|
4/4/2010 1:50:36 PM |
|
noemailplease wrote:
> For some people out there this will be a "simple" task, but I have been
> pounding my head for ages on this with no luck, so I hope someone might
> help me out?
>
> I need to "convert" (encrypt, hash, mash, whatever you want to call
> this) a string into a 4-byte integer (values from -2147483648 to
> 2147483647) but also be able to "recover" the input string e.g.
>
> foo = mash("This is the input string to be mashed")
> foo = [-2147483648 to 2147483647]
> bar = unmash(foo)
> bar = "This is the input string to be mashed"
>
> I think what is making this more complicated than it needs to be is that
> there are more than 2^256 possible string input combinations so there is
> going to be a real risk of "collisions". This is for a simple home
> project, so if anyone has any ideas, such as limiting the input string
> length, dictionary or some other pre-conditioning, I'd love to hear
> them. Thanks in advance.
I should have added why I need this: I have a simple data array of
integers and I want to store the text with its corresponding integer
values. (The data is used in another external application too, so
unfortunately I cannot change that) I can limit the input text within
some limits but if I assume the text will contain all "normal" english
keyboard characters in some order, I can limit the length if need be.
I think the only other option I have is to use two arrays in the hope
they remain synchronised? i.e.
label[0] <--> data[0][0]...data[0][n]
label[1] <--> data[1][0]...data[1][n]
but I think that is very dangerous for the data.
|
|
0
|
|
|
|
Reply
|
noemailplease
|
4/4/2010 2:11:48 PM
|
|
On 10-04-04 8:50 AM, noemailplease wrote:
> I need to "convert" (encrypt, hash, mash, whatever you want to call
> this) a string into a 4-byte integer (values from -2147483648 to
> 2147483647) but also be able to "recover" the input string e.g.
> I think what is making this more complicated than it needs to be is that
> there are more than 2^256 possible string input combinations so there is
> going to be a real risk of "collisions".
What you ask is impossible in the general case. If there are N items
that have to be stored in M slots, and M < N, then by the "Pigeon Hole
Principle", at least two items are going to end up in the same slot.
Notice that this argument depends only on the number of slots, and not
only how fancifully or briefly the slots are named.
However, if your programs never in fact use more than 2^32 different
input values, and if you allow string encoding data to be stored and
retrieved by all the appropriate programs, then it is entirely possible
that the encoding table, appropriately done, could require less space
than the original strings.
Do I understand correctly that the match between particular 4-byte
integers and corresponding strings will be defined by some process that
is out of the control of the compression algorithm? If so it should be
possible to work that in, but it will not be as efficient as otherwise.
A key technical question for your purposes is the need for random access
to the strings, and the time constraints. For example if you only need
to output the corresponding strings once during a "report phase", then
decoding time of the string table or a need to decode sequentially would
not be important, whereas for more interactive programs that are not
permitted to decode the entire string array in memory, random access to
the encoded material might be key. That could reduce the encoding
efficiency.
|
|
0
|
|
|
|
Reply
|
Walter
|
4/5/2010 3:27:47 AM
|
|
Walter Roberson wrote:
> On 10-04-04 8:50 AM, noemailplease wrote:
>
>> I need to "convert" (encrypt, hash, mash, whatever you want to call
>> this) a string into a 4-byte integer (values from -2147483648 to
>> 2147483647) but also be able to "recover" the input string e.g.
>
>> I think what is making this more complicated than it needs to be is that
>> there are more than 2^256 possible string input combinations so there is
>> going to be a real risk of "collisions".
>
> What you ask is impossible in the general case. If there are N items
> that have to be stored in M slots, and M < N, then by the "Pigeon Hole
> Principle", at least two items are going to end up in the same slot.
> Notice that this argument depends only on the number of slots, and not
> only how fancifully or briefly the slots are named.
>
> However, if your programs never in fact use more than 2^32 different
> input values, and if you allow string encoding data to be stored and
> retrieved by all the appropriate programs, then it is entirely possible
> that the encoding table, appropriately done, could require less space
> than the original strings.
>
> Do I understand correctly that the match between particular 4-byte
> integers and corresponding strings will be defined by some process that
> is out of the control of the compression algorithm? If so it should be
> possible to work that in, but it will not be as efficient as otherwise.
>
> A key technical question for your purposes is the need for random access
> to the strings, and the time constraints. For example if you only need
> to output the corresponding strings once during a "report phase", then
> decoding time of the string table or a need to decode sequentially would
> not be important, whereas for more interactive programs that are not
> permitted to decode the entire string array in memory, random access to
> the encoded material might be key. That could reduce the encoding
> efficiency.
Thanks for the reply Walter.
To demonstrate more what I am hoping to achieve: The data I am storing
is the results of a series of processes which has a "name" as given to
it by the designer, a string; I have very little control over the actual
name used but I can impose some limits (such as length, character set etc).
Once the results array has been analysed, I want to see which process
generated that set of results. e.g. if the "best" set of data is on row
'x' of my array, then I want to read the value in the first column of
that row to identify the associated process and return this as plain text.
e.g.
Process Test1 Test2 ... TestN
1234567 100 94 100
3456789 -23 -2 22
....
9999999 100 1 50
In my example, let's say my analysis says the second row is the "best"
result, then I want to be able to recover "3456789" as the string say,
"Jones.Alutian.Sunrise.v27.1" (or whatever the designer called their
process).
Efficiency is not really an issue, as the string will only be "mashed"
and "unmashed" once per session.
I know I could use some sort of lookup table or other referential
system, but I would really like to keep all the data together in the one
place, if possible, and due to external programming constraints the data
must be a 4-byte integer.
Does this help? I hope so. Thanks for your time so far.
|
|
0
|
|
|
|
Reply
|
noemailplease
|
4/5/2010 7:11:52 AM
|
|
|
3 Replies
350 Views
(page loaded in 0.12 seconds)
|