Noob : Basic reversible string to integer "encryption"

  • Follow


For some people out there this will be a "simple" task, but I have been 
pounding my head for ages on this with no luck, so I hope someone might 
help me out?

I need to "convert" (encrypt, hash, mash, whatever you want to call 
this) a string into a 4-byte integer (values from -2147483648 to 
2147483647) but also be able to "recover" the input string e.g.

foo = mash("This is the input string to be mashed")
       foo = [-2147483648 to 2147483647]
bar = unmash(foo)
       bar = "This is the input string to be mashed"

I think what is making this more complicated than it needs to be is that 
there are more than 2^256 possible string input combinations so there is 
going to be a real risk of "collisions".  This is for a simple home 
project, so if anyone has any ideas, such as limiting the input string 
length, dictionary or some other pre-conditioning, I'd love to hear 
them.  Thanks in advance.
0
Reply noemailplease 4/4/2010 1:50:36 PM

noemailplease wrote:
> For some people out there this will be a "simple" task, but I have been 
> pounding my head for ages on this with no luck, so I hope someone might 
> help me out?
> 
> I need to "convert" (encrypt, hash, mash, whatever you want to call 
> this) a string into a 4-byte integer (values from -2147483648 to 
> 2147483647) but also be able to "recover" the input string e.g.
> 
> foo = mash("This is the input string to be mashed")
>       foo = [-2147483648 to 2147483647]
> bar = unmash(foo)
>       bar = "This is the input string to be mashed"
> 
> I think what is making this more complicated than it needs to be is that 
> there are more than 2^256 possible string input combinations so there is 
> going to be a real risk of "collisions".  This is for a simple home 
> project, so if anyone has any ideas, such as limiting the input string 
> length, dictionary or some other pre-conditioning, I'd love to hear 
> them.  Thanks in advance.

I should have added why I need this: I have a simple data array of 
integers and I want to store the text with its corresponding integer 
values.  (The data is used in another external application too, so 
unfortunately I cannot change that) I can limit the input text within 
some limits but if I assume the text will contain all "normal" english 
keyboard characters in some order, I can limit the length if need be.

I think the only other option I have is to use two arrays in the hope 
they remain synchronised? i.e.
label[0] <--> data[0][0]...data[0][n]
label[1] <--> data[1][0]...data[1][n]

but I think that is very dangerous for the data.
0
Reply noemailplease 4/4/2010 2:11:48 PM


On 10-04-04 8:50 AM, noemailplease wrote:

> I need to "convert" (encrypt, hash, mash, whatever you want to call
> this) a string into a 4-byte integer (values from -2147483648 to
> 2147483647) but also be able to "recover" the input string e.g.

> I think what is making this more complicated than it needs to be is that
> there are more than 2^256 possible string input combinations so there is
> going to be a real risk of "collisions".

What you ask is impossible in the general case. If there are N items 
that have to be stored in M slots, and M < N, then by the "Pigeon Hole 
Principle", at least two items are going to end up in the same slot. 
Notice that this argument depends only on the number of slots, and not 
only how fancifully or briefly the slots are named.

However, if your programs never in fact use more than 2^32 different 
input values, and if you allow string encoding data to be stored and 
retrieved by all the appropriate programs, then it is entirely possible 
that the encoding table, appropriately done, could require less space 
than the original strings.

Do I understand correctly that the match between particular 4-byte 
integers and corresponding strings will be defined by some process that 
is out of the control of the compression algorithm? If so it should be 
possible to work that in, but it will not be as efficient as otherwise.

A key technical question for your purposes is the need for random access 
to the strings, and the time constraints. For example if you only need 
to output the corresponding strings once during a "report phase", then 
decoding time of the string table or a need to decode sequentially would 
not be important, whereas for more interactive programs that are not 
permitted to decode the entire string array in memory, random access to 
the encoded material might be key. That could reduce the encoding 
efficiency.
0
Reply Walter 4/5/2010 3:27:47 AM

Walter Roberson wrote:
> On 10-04-04 8:50 AM, noemailplease wrote:
> 
>> I need to "convert" (encrypt, hash, mash, whatever you want to call
>> this) a string into a 4-byte integer (values from -2147483648 to
>> 2147483647) but also be able to "recover" the input string e.g.
> 
>> I think what is making this more complicated than it needs to be is that
>> there are more than 2^256 possible string input combinations so there is
>> going to be a real risk of "collisions".
> 
> What you ask is impossible in the general case. If there are N items 
> that have to be stored in M slots, and M < N, then by the "Pigeon Hole 
> Principle", at least two items are going to end up in the same slot. 
> Notice that this argument depends only on the number of slots, and not 
> only how fancifully or briefly the slots are named.
> 
> However, if your programs never in fact use more than 2^32 different 
> input values, and if you allow string encoding data to be stored and 
> retrieved by all the appropriate programs, then it is entirely possible 
> that the encoding table, appropriately done, could require less space 
> than the original strings.
> 
> Do I understand correctly that the match between particular 4-byte 
> integers and corresponding strings will be defined by some process that 
> is out of the control of the compression algorithm? If so it should be 
> possible to work that in, but it will not be as efficient as otherwise.
> 
> A key technical question for your purposes is the need for random access 
> to the strings, and the time constraints. For example if you only need 
> to output the corresponding strings once during a "report phase", then 
> decoding time of the string table or a need to decode sequentially would 
> not be important, whereas for more interactive programs that are not 
> permitted to decode the entire string array in memory, random access to 
> the encoded material might be key. That could reduce the encoding 
> efficiency.


Thanks for the reply Walter.

To demonstrate more what I am hoping to achieve: The data I am storing 
is the results of a series of processes which has a "name" as given to 
it by the designer, a string; I have very little control over the actual 
name used but I can impose some limits (such as length, character set etc).

Once the results array has been analysed, I want to see which process 
generated that set of results. e.g. if the "best" set of data is on row 
'x' of my array, then I want to read the value in the first column of 
that row to identify the associated process and return this as plain text.

e.g.
Process  Test1  Test2  ... TestN
1234567  100    94         100
3456789  -23    -2         22
....
9999999  100    1          50

In my example, let's say my analysis says the second row is the "best" 
result, then I want to be able to recover "3456789" as the string say, 
"Jones.Alutian.Sunrise.v27.1" (or whatever the designer called their 
process).

Efficiency is not really an issue, as the string will only be "mashed" 
and "unmashed" once per session.

I know I could use some sort of lookup table or other referential 
system, but I would really like to keep all the data together in the one 
place, if possible, and due to external programming constraints the data 
must be a 4-byte integer.

Does this help?  I hope so.  Thanks for your time so far.
0
Reply noemailplease 4/5/2010 7:11:52 AM

3 Replies
350 Views

(page loaded in 0.12 seconds)

Similiar Articles:





7/25/2012 6:34:42 PM


Reply: