f



A routine to calc Shannon's info entropy. I'm open to improvements/suggestions.

Code is based on the formula and results here:
http://www.shannonentropy.netmark.pl


entropy.py
======================================================================
import sys,math,string

msg=sys.argv[1]

#use only letters and numbers
if not msg.isalnum():
	print("Error: use letters and numbers only");
	exit(0)

msglen=len(msg)
print("\n%s characters in '%s'\n" % (msglen,msg))

i=1; Hrow=0.0; Htotal=0.0

#count and print details of each character in message
print(" #   Char  Freq   Dist    D*log2(D)  H(X) sum")
print("---  ----  ----  ------  ----------  ----------")
def printHrows(submsg):
	global i,Hrow,Htotal
	for t in sorted(set([(x,submsg.count(x)) for x in submsg])):
		char = t[0];	
		freq = t[1];
		dist = freq/float(msglen)
		Hrow = dist * (math.log(dist)/math.log(2))
		Htotal += Hrow;
		print(" %s    %s     %s    %.3f    %.3f     %.5f" % 
(i,char,freq,round(dist,3),Hrow,-1*Htotal))
		i+=1

#for sorting - process upper,lower,numbers separately
printHrows(filter(str.isupper, msg))
printHrows(filter(str.islower, msg))
printHrows(filter(str.isdigit, msg))
		
print("\nThe Shannon entropy of your message is %.5f" % (-1*Htotal))
print("The metric entropy of your message is  %.5f" % 
(float(-1*Htotal)/msglen))
======================================================================


$python entropy.py Abbcccdddd1223334444

20 characters in 'Abbcccdddd1223334444'

  #   Char  Freq   Dist    D*log2(D)  H(X) sum
---  ----  ----  ------  ----------  ----------
  1    A     1    0.050    -0.216     0.21610
  2    b     2    0.100    -0.332     0.54829
  3    c     3    0.150    -0.411     0.95883
  4    d     4    0.200    -0.464     1.42322
  5    1     1    0.050    -0.216     1.63932
  6    2     2    0.100    -0.332     1.97151
  7    3     3    0.150    -0.411     2.38205
  8    4     4    0.200    -0.464     2.84644

The Shannon entropy of your message is 2.84644
The metric entropy of your message is  0.14232
======================================================================

Seems to work fine.

Any suggestions to improve the code?

0
DFS
12/15/2016 10:21:58 PM
comp.lang.python 77058 articles. 6 followers. Post Follow

3 Replies
852 Views

Similar Articles

[PageSpeed] 3

On Fri, 16 Dec 2016 09:21 am, DFS wrote:

> Code is based on the formula and results here:
> http://www.shannonentropy.netmark.pl
[...]
> Seems to work fine.
> 
> Any suggestions to improve the code?


- Separate the calculation logic from the display logic as much
  as practical.

- Modular programming: use functions to make it easier to test and easier
  to modify the program in the future, e.g. to add command line options.

- Use standard tools where possible.

- Better, more descriptive names.

- Avoid global variables unless really needed.

- More error checking.

- Errors should exit with a non-zero return code, and should print to
  stderr rather than stdout.

You end up with about twice as much code, but hopefully it is easier to
understand, and it should certainly be easier to debug and maintain if you
decide to change it.



# --- cut ---

# only needed in Python 2
from __future__ import division

import sys
import math
import string

from collections import Counter


def fatal_error(errmsg):
    # Python 3 syntax
    print(errmsg, file=sys.stderr)
    ## Python 2 syntax
    ## print >>sys.stderr, errmsg
    sys.exit(1)


def display_header(msg):
    print()
    print("%s characters in '%s'" % (len(msg), msg))
    print()
    print(" #   Char  Freq   Dist    D*log2(D)  H(X) sum")
    print("---  ----  ----  ------  ----------  ----------")


def display_row(row_number, c, freq, dist, entropy, running_total):
    args = (row_number, c, freq, dist, entropy, running_total)
    template = "%3d  %-4c  %4d  %6.3f  %10.3f  %10.5f"
    print(template % args)


def entropy(c, freqs, num_symbols):
    """Return the entropy of character c from frequency table freqs."""
    f = freqs[c]/num_symbols
    return -f*math.log(f, 2)


def display_results(freqs, Hs, num_symbols):
    """Display results including entropy of each symbol.

    Returns the total entropy of the message.
    """
    # Display rows with uppercase first, then lower, then digits.
    upper = sorted(filter(str.isupper, freqs))
    lower = sorted(filter(str.islower, freqs))
    digits = sorted(filter(str.isdigit, freqs))
    assert set(upper + lower + digits) == set(freqs)
    count = 1
    running_total = 0.0
    for chars in (upper, lower, digits):
        for c in chars:
            f = freqs[c]
            H = Hs[c]
            running_total += H
            display_row(count, c, f, f/num_symbols, H, running_total)
            count += 1
    total = running_total
    print()
    print("The Shannon entropy of your message is %.5f" % total)
    print("The metric entropy of your message is %.5f"% (total/num_symbols))
    return total


def main(args=None):
    if args is None:
        args = sys.argv[1:]
    if len(args) != 1:
        fatal_error("too many or too few arguments")
    msg = args[0]
    if not msg.isalnum():
        fatal_error("only alphanumeric symbols supported")
    display_header(msg)
    frequencies = Counter(msg)
    num_symbols = len(msg)
    # Calculate the entropy of each symbol and the total entropy.
    entropies = {}
    for c in frequencies:
        H = entropy(c, frequencies, num_symbols)
        entropies[c] = H
    total = display_results(frequencies, entropies, num_symbols)



if __name__ == "__main__":
    # Only run when module is being used as a script.
    main()


# --- cut ---





-- 
Steve
“Cheer up,” they said, “things could be worse.” So I cheered up, and sure
enough, things got worse.

0
Steve
12/16/2016 5:38:20 AM
Thanks for this.  I'll reply with more tomorrow (Sat)



On 12/16/2016 12:38 AM, Steve D'Aprano wrote:
> On Fri, 16 Dec 2016 09:21 am, DFS wrote:
>
>> Code is based on the formula and results here:
>> http://www.shannonentropy.netmark.pl
> [...]
>> Seems to work fine.
>>
>> Any suggestions to improve the code?
>
>
> - Separate the calculation logic from the display logic as much
>   as practical.
>
> - Modular programming: use functions to make it easier to test and easier
>   to modify the program in the future, e.g. to add command line options.
>
> - Use standard tools where possible.
>
> - Better, more descriptive names.
>
> - Avoid global variables unless really needed.
>
> - More error checking.
>
> - Errors should exit with a non-zero return code, and should print to
>   stderr rather than stdout.
>
> You end up with about twice as much code, but hopefully it is easier to
> understand, and it should certainly be easier to debug and maintain if you
> decide to change it.
>
>
>
> # --- cut ---
>
> # only needed in Python 2
> from __future__ import division
>
> import sys
> import math
> import string
>
> from collections import Counter
>
>
> def fatal_error(errmsg):
>     # Python 3 syntax
>     print(errmsg, file=sys.stderr)
>     ## Python 2 syntax
>     ## print >>sys.stderr, errmsg
>     sys.exit(1)
>
>
> def display_header(msg):
>     print()
>     print("%s characters in '%s'" % (len(msg), msg))
>     print()
>     print(" #   Char  Freq   Dist    D*log2(D)  H(X) sum")
>     print("---  ----  ----  ------  ----------  ----------")
>
>
> def display_row(row_number, c, freq, dist, entropy, running_total):
>     args = (row_number, c, freq, dist, entropy, running_total)
>     template = "%3d  %-4c  %4d  %6.3f  %10.3f  %10.5f"
>     print(template % args)
>
>
> def entropy(c, freqs, num_symbols):
>     """Return the entropy of character c from frequency table freqs."""
>     f = freqs[c]/num_symbols
>     return -f*math.log(f, 2)
>
>
> def display_results(freqs, Hs, num_symbols):
>     """Display results including entropy of each symbol.
>
>     Returns the total entropy of the message.
>     """
>     # Display rows with uppercase first, then lower, then digits.
>     upper = sorted(filter(str.isupper, freqs))
>     lower = sorted(filter(str.islower, freqs))
>     digits = sorted(filter(str.isdigit, freqs))
>     assert set(upper + lower + digits) == set(freqs)
>     count = 1
>     running_total = 0.0
>     for chars in (upper, lower, digits):
>         for c in chars:
>             f = freqs[c]
>             H = Hs[c]
>             running_total += H
>             display_row(count, c, f, f/num_symbols, H, running_total)
>             count += 1
>     total = running_total
>     print()
>     print("The Shannon entropy of your message is %.5f" % total)
>     print("The metric entropy of your message is %.5f"% (total/num_symbols))
>     return total
>
>
> def main(args=None):
>     if args is None:
>         args = sys.argv[1:]
>     if len(args) != 1:
>         fatal_error("too many or too few arguments")
>     msg = args[0]
>     if not msg.isalnum():
>         fatal_error("only alphanumeric symbols supported")
>     display_header(msg)
>     frequencies = Counter(msg)
>     num_symbols = len(msg)
>     # Calculate the entropy of each symbol and the total entropy.
>     entropies = {}
>     for c in frequencies:
>         H = entropy(c, frequencies, num_symbols)
>         entropies[c] = H
>     total = display_results(frequencies, entropies, num_symbols)
>
>
>
> if __name__ == "__main__":
>     # Only run when module is being used as a script.
>     main()
>
>
> # --- cut ---
>
>
>
>
>

0
DFS
12/17/2016 4:53:13 AM
On 12/16/2016 12:38 AM, Steve D'Aprano wrote:
> On Fri, 16 Dec 2016 09:21 am, DFS wrote:
>
>> Code is based on the formula and results here:
>> http://www.shannonentropy.netmark.pl
> [...]
>> Seems to work fine.
>>
>> Any suggestions to improve the code?
>
>
> - Separate the calculation logic from the display logic as much
>   as practical.
>
> - Modular programming: use functions to make it easier to test and easier
>   to modify the program in the future, e.g. to add command line options.
>
> - Use standard tools where possible.
>
> - Better, more descriptive names.
>
> - Avoid global variables unless really needed.
>
> - More error checking.
>
> - Errors should exit with a non-zero return code, and should print to
>   stderr rather than stdout.
>
> You end up with about twice as much code, but hopefully it is easier to
> understand, and it should certainly be easier to debug and maintain if you
> decide to change it.
>
>
>
> # --- cut ---
>
> # only needed in Python 2
> from __future__ import division
>
> import sys
> import math
> import string
>
> from collections import Counter
>
>
> def fatal_error(errmsg):
>     # Python 3 syntax
>     print(errmsg, file=sys.stderr)
>     ## Python 2 syntax
>     ## print >>sys.stderr, errmsg
>     sys.exit(1)
>
>
> def display_header(msg):
>     print()
>     print("%s characters in '%s'" % (len(msg), msg))
>     print()
>     print(" #   Char  Freq   Dist    D*log2(D)  H(X) sum")
>     print("---  ----  ----  ------  ----------  ----------")
>
>
> def display_row(row_number, c, freq, dist, entropy, running_total):
>     args = (row_number, c, freq, dist, entropy, running_total)
>     template = "%3d  %-4c  %4d  %6.3f  %10.3f  %10.5f"
>     print(template % args)
>
>
> def entropy(c, freqs, num_symbols):
>     """Return the entropy of character c from frequency table freqs."""
>     f = freqs[c]/num_symbols
>     return -f*math.log(f, 2)
>
>
> def display_results(freqs, Hs, num_symbols):
>     """Display results including entropy of each symbol.
>
>     Returns the total entropy of the message.
>     """
>     # Display rows with uppercase first, then lower, then digits.
>     upper = sorted(filter(str.isupper, freqs))
>     lower = sorted(filter(str.islower, freqs))
>     digits = sorted(filter(str.isdigit, freqs))
>     assert set(upper + lower + digits) == set(freqs)
>     count = 1
>     running_total = 0.0
>     for chars in (upper, lower, digits):
>         for c in chars:
>             f = freqs[c]
>             H = Hs[c]
>             running_total += H
>             display_row(count, c, f, f/num_symbols, H, running_total)
>             count += 1
>     total = running_total
>     print()
>     print("The Shannon entropy of your message is %.5f" % total)
>     print("The metric entropy of your message is %.5f"% (total/num_symbols))
>     return total
>
>
> def main(args=None):
>     if args is None:
>         args = sys.argv[1:]
>     if len(args) != 1:
>         fatal_error("too many or too few arguments")
>     msg = args[0]
>     if not msg.isalnum():
>         fatal_error("only alphanumeric symbols supported")
>     display_header(msg)
>     frequencies = Counter(msg)
>     num_symbols = len(msg)
>     # Calculate the entropy of each symbol and the total entropy.
>     entropies = {}
>     for c in frequencies:
>         H = entropy(c, frequencies, num_symbols)
>         entropies[c] = H
>     total = display_results(frequencies, entropies, num_symbols)
>
>
>
> if __name__ == "__main__":
>     # Only run when module is being used as a script.
>     main()
>
>
> # --- cut ---


Wow.  That's a whole different mindset, where 24 lines of code becomes a 
main plus 5 functions.

I get your reasoning, but in my eyes you made something simple very 
complicated.  For instance, you end up repeating
"count, c, f, f/num_symbols, H, running_total" three times.

And the speed is half as fast as the original py27 code.

The args and template separation is nice.  The vertical layout makes it 
easier to follow when you have a lot of vars to print.  I'll use that 
method for sure.

Thanks for the analysis and code reconstitution!  I updated it using 
some of your ideas; I'll post it in a couple days.

0
DFS
12/18/2016 3:36:12 AM
Reply: