How to convert bytearray into integer?

  • Follow


Hi there,

Recently I'm facing a problem to convert 4 bytes on an bytearray into
an 32-bit integer.  So far as I can see, there're 3 ways: a) using
struct module, b) using ctypes module, and c) manually manipulation.

Are there any other ways?

My sample is as following:

-----
import struct
import ctypes

def test_struct(buf, offset):
  return struct.unpack_from("I", buf, offset)[0]

def test_ctypes(buf, offset):
  return ctypes.c_uint32.from_buffer(buf, offset).value

def test_multi(buf, offset):
  return buf[offset] + (buf[offset+1] << 8) + (buf[offset+2] << 16) +
(buf[offset+3] << 24)

buf_w = bytearray(5)
buf_w[1] = 1
buf_r = buffer(buf_w)

if __name__ == '__main__':
  import timeit

  t1 = timeit.Timer("test_struct(buf_r, 1)",
                    "from __main__ import test_struct, buf_r")
  t2 = timeit.Timer("test_ctypes(buf_w, 1)",
                    "from __main__ import test_ctypes, buf_w")
  t3 = timeit.Timer("test_multi(buf_w, 1)",
                    "from __main__ import test_multi, buf_w")
  print t1.timeit(number=1000)
  print t2.timeit(number=1000)
  print t3.timeit(number=1000)
-----

Yet the results are bit confusing:

-----
number = 10000
0.0081958770752
0.012549161911
0.0112121105194

number = 1000
0.00087308883667
0.00125789642334
0.00110197067261

number = 100
0.0000917911529541     # 9.17911529541e-05
0.000133991241455
0.00011420249939

number = 10
1.69277191162e-05
2.19345092773e-05
1.69277191162e-05

number = 1
1.00135803223e-05
1.00135803223e-05
5.96046447754e-06
-----

As the number of benchmarking loops decreasing, method c which is
manually manipulating overwhelms the former 2 methods.  However, if
number == 10K, the struct method wins.

Why does it happen?

Thanks,
Jacky (jacky.chao.wang#gmail.com)
0
Reply Jacky 8/16/2010 5:06:57 PM

On Monday 16 August 2010, it occurred to Jacky to exclaim:
> Hi there,
> 
> Recently I'm facing a problem to convert 4 bytes on an bytearray into
> an 32-bit integer.  So far as I can see, there're 3 ways:

> a) using struct module,

Yes, that's what it's for, and that's what you should be using.

> b) using ctypes module, and

Yeeaah, that would work, but that's really not what it's for. from_buffer 
wants a writable buffer interface, which is unlikely to be what you want.

> c) manually manipulation.

Well, yes, you can do that, but it gets messy when you're working with more 
complex data structures, or you have to consider byte order.

> Are there any other ways?

You could write a C extension module tailored to your specific purpose ;-) 

> number = 1
> 1.00135803223e-05
> 1.00135803223e-05
> 5.96046447754e-06
> -----
> 
> As the number of benchmarking loops decreasing, method c which is
> manually manipulating overwhelms the former 2 methods.  However, if
> number == 10K, the struct method wins.
> 
> Why does it happen?

struct wins because it's built for the job.

As for the small numbers: don't take these numbers seriously. Just don't. This 
may be caused by the way your OS's scheduler handles things for all I know. If 
there is an explanation for this unscientific observation, I have two guesses 
what it might be:
 * struct and ctypes still need to do some setup work, or something
 * somebody is optimising something, but doesn't know what they should be
   optimising in the first place after only a few iterations.

0
Reply Thomas 8/16/2010 5:50:14 PM


Hi Thomas,

Thanks for your comments!  Please check mine inline.

On Aug 17, 1:50=A0am, Thomas Jollans <tho...@jollybox.de> wrote:
> On Monday 16 August 2010, it occurred to Jacky to exclaim:
>
> > Hi there,
>
> > Recently I'm facing a problem to convert 4 bytes on an bytearray into
> > an 32-bit integer. =A0So far as I can see, there're 3 ways:
> > a) using struct module,
>
> Yes, that's what it's for, and that's what you should be using.

My concern is that struct may need to parse the format string,
construct the list, and de-reference index=3D0 for this generated list
to get the int out.

There should be some way more efficient?

>
> > b) using ctypes module, and
>
> Yeeaah, that would work, but that's really not what it's for. from_buffer
> wants a writable buffer interface, which is unlikely to be what you want.

Actually my buffer is writable --- it's an bytearray.  Turning it into
a R/O one make me to do extra effort: wrapping the bytearray into
buffer().

My question is, this operation seems like to be much simpler than the
former one, and it's very straightforward as well.  Why is it slow?

>
> > c) manually manipulation.
>
> Well, yes, you can do that, but it gets messy when you're working with mo=
re
> complex data structures, or you have to consider byte order.

agree. :)

>
> > Are there any other ways?
>
> You could write a C extension module tailored to your specific purpose ;-=
)

Ha, yes.  Actually I've already modified socketmodule.c myself ---
it's hard to image why socket object provides the interface:
socket.recv_from(buf[, num_bytes[, flags]]) but forget the more
generic one: socket.recv_from(buf[, offset[, num_bytes[, flags]]])

So do socket.send(...).

>
> > number =3D 1
> > 1.00135803223e-05
> > 1.00135803223e-05
> > 5.96046447754e-06
> > -----
>
> > As the number of benchmarking loops decreasing, method c which is
> > manually manipulating overwhelms the former 2 methods. =A0However, if
> > number =3D=3D 10K, the struct method wins.
>
> > Why does it happen?
>
> struct wins because it's built for the job.
>
> As for the small numbers: don't take these numbers seriously. Just don't.=
 This
> may be caused by the way your OS's scheduler handles things for all I kno=
w. If
> there is an explanation for this unscientific observation, I have two gue=
sses
> what it might be:
> =A0* struct and ctypes still need to do some setup work, or something
> =A0* somebody is optimising something, but doesn't know what they should =
be
> =A0 =A0optimising in the first place after only a few iterations.

Agree.  Thanks.

- Jacky
0
Reply jacky.chao.wang (3) 8/16/2010 7:08:34 PM

On Aug 16, 8:08=A0pm, Jacky <jacky.chao.w...@gmail.com> wrote:
> Hi Thomas,
>
> Thanks for your comments! =A0Please check mine inline.
>
> On Aug 17, 1:50=A0am, Thomas Jollans <tho...@jollybox.de> wrote:
>
> > On Monday 16 August 2010, it occurred to Jacky to exclaim:
>
> > > Hi there,
>
> > > Recently I'm facing a problem to convert 4 bytes on an bytearray into
> > > an 32-bit integer. =A0So far as I can see, there're 3 ways:
> > > a) using struct module,
>
> > Yes, that's what it's for, and that's what you should be using.
>
> My concern is that struct may need to parse the format string,
> construct the list, and de-reference index=3D0 for this generated list
> to get the int out.
>
> There should be some way more efficient?

Well, you can improve on the struct solution by using the
struct.Struct class to avoid parsing the format string repeatedly:

>>> import struct
>>> S =3D struct.Struct('<I')
>>> S.unpack_from(buffer(bytearray([1,2,3,4,5])))
(67305985,)

This doesn't make a huge difference on my machine (OS X 10.6.4, 64-bit
build of Python 2.6) though;  it's probably more effective for long
format strings. Adding:

def test_struct2(buf, offset, S=3Dstruct.Struct('<I')):
    return S.unpack_from(buf, offset)[0]

to your test code, I see a speedup of around 8% over your test_struct.

By the way, you may want to consider using an explicit byte-order/size
marker in your format string;  i.e., use '<I' instead of 'I'.  This
forces a 4-byte little-endian interpretation, regardless of the
platform you're running Python on.

--
Mark
0
Reply dickinsm (350) 8/16/2010 7:36:26 PM

On Monday 16 August 2010, it occurred to Jacky to exclaim:
> Hi Thomas,
> 
> Thanks for your comments!  Please check mine inline.
> 
> On Aug 17, 1:50 am, Thomas Jollans <tho...@jollybox.de> wrote:
> > On Monday 16 August 2010, it occurred to Jacky to exclaim:
> > > Hi there,
> > > 
> > > Recently I'm facing a problem to convert 4 bytes on an bytearray into
> > > an 32-bit integer.  So far as I can see, there're 3 ways:
> > > a) using struct module,
> > 
> > Yes, that's what it's for, and that's what you should be using.
> 
> My concern is that struct may need to parse the format string,
> construct the list, and de-reference index=0 for this generated list
> to get the int out.
> 
> There should be some way more efficient?

The struct module is written in C, not in Python. It does have to parse a 
string, yes, so, if you wrote your own, limited, C function to do the job, it 
might be marginally faster.

> 
> > > b) using ctypes module, and
> > 
> > Yeeaah, that would work, but that's really not what it's for. from_buffer
> > wants a writable buffer interface, which is unlikely to be what you want.
> 
> Actually my buffer is writable --- it's an bytearray.  Turning it into
> a R/O one make me to do extra effort: wrapping the bytearray into
> buffer().
> 
> My question is, this operation seems like to be much simpler than the
> former one, and it's very straightforward as well.  Why is it slow?

Unlike struct, it constructs an object you're not actually interested in 
around your int.

> it's hard to image why socket object provides the interface:
> socket.recv_from(buf[, num_bytes[, flags]]) but forget the more
> generic one: socket.recv_from(buf[, offset[, num_bytes[, flags]]])

Well, that's what pointer arithmetic (in C) or slices (in Python) are for! 
There's an argument to be made for sticking close to the traditional 
(originally C) interface here - it's familiar.


 - Thomas
0
Reply Thomas 8/16/2010 7:38:00 PM

On Aug 16, 8:36=A0pm, Mark Dickinson <dicki...@gmail.com> wrote:
> On Aug 16, 8:08=A0pm, Jacky <jacky.chao.w...@gmail.com> wrote:
> > My concern is that struct may need to parse the format string,
> > construct the list, and de-reference index=3D0 for this generated list
> > to get the int out.
>
> > There should be some way more efficient?
>
> Well, you can improve on the struct solution by using the
> struct.Struct class to avoid parsing the format string repeatedly:
>
> >>> import struct
> >>> S =3D struct.Struct('<I')
> >>> S.unpack_from(buffer(bytearray([1,2,3,4,5])))
>
> (67305985,)
>
> This doesn't make a huge difference on my machine (OS X 10.6.4, 64-bit
> build of Python 2.6) though; =A0it's probably more effective for long
> format strings.

Sorry, this was inaccurate:  this makes almost *no* significant
difference on my machine for large test runs (10000 and up).  For
small ones, though, it's faster.  The reason is that the struct module
caches (up to 100, in the current implementation) previously used
format strings, so with your tests you're only ever parsing the format
string once anyway.  Internally, the struct module converts that
format string to a Struct object, and squirrels that Struct object
away into its cache, which is implemented as a dict from format
strings to Struct objects.  So the next time that the format string is
used it's simply looked up in the cache, and the Struct object
retrieved.

By the way, in Python 3.2 there's yet another fun way to do this,
using int.from_bytes.

>>> int.from_bytes(bytearray([1,2,3,4]), 'little')
67305985

--
Mark
0
Reply dickinsm (350) 8/16/2010 7:53:21 PM

Hi Mark,

Thanks for your reply.  Agree and I'll use your suggestions.  Thanks!

-Jacky

On Aug 17, 3:36=A0am, Mark Dickinson <dicki...@gmail.com> wrote:
> On Aug 16, 8:08=A0pm, Jacky <jacky.chao.w...@gmail.com> wrote:
>
> > Hi Thomas,
>
> > Thanks for your comments! =A0Please check mine inline.
>
> > On Aug 17, 1:50=A0am, Thomas Jollans <tho...@jollybox.de> wrote:
>
> > > On Monday 16 August 2010, it occurred to Jacky to exclaim:
>
> > > > Hi there,
>
> > > > Recently I'm facing a problem to convert 4 bytes on an bytearray in=
to
> > > > an 32-bit integer. =A0So far as I can see, there're 3 ways:
> > > > a) using struct module,
>
> > > Yes, that's what it's for, and that's what you should be using.
>
> > My concern is that struct may need to parse the format string,
> > construct the list, and de-reference index=3D0 for this generated list
> > to get the int out.
>
> > There should be some way more efficient?
>
> Well, you can improve on the struct solution by using the
> struct.Struct class to avoid parsing the format string repeatedly:
>
> >>> import struct
> >>> S =3D struct.Struct('<I')
> >>> S.unpack_from(buffer(bytearray([1,2,3,4,5])))
>
> (67305985,)
>
> This doesn't make a huge difference on my machine (OS X 10.6.4, 64-bit
> build of Python 2.6) though; =A0it's probably more effective for long
> format strings. Adding:
>
> def test_struct2(buf, offset, S=3Dstruct.Struct('<I')):
> =A0 =A0 return S.unpack_from(buf, offset)[0]
>
> to your test code, I see a speedup of around 8% over your test_struct.
>
> By the way, you may want to consider using an explicit byte-order/size
> marker in your format string; =A0i.e., use '<I' instead of 'I'. =A0This
> forces a 4-byte little-endian interpretation, regardless of the
> platform you're running Python on.
>
> --
> Mark- Hide quoted text -
>
> - Show quoted text -

0
Reply jacky.chao.wang (3) 8/17/2010 12:57:45 AM

On Aug 17, 3:38=A0am, Thomas Jollans <tho...@jollybox.de> wrote:
> On Monday 16 August 2010, it occurred to Jacky to exclaim:
>
>
>
>
>
> > Hi Thomas,
>
> > Thanks for your comments! =A0Please check mine inline.
>
> > On Aug 17, 1:50 am, Thomas Jollans <tho...@jollybox.de> wrote:
> > > On Monday 16 August 2010, it occurred to Jacky to exclaim:
> > > > Hi there,
>
> > > > Recently I'm facing a problem to convert 4 bytes on an bytearray in=
to
> > > > an 32-bit integer. =A0So far as I can see, there're 3 ways:
> > > > a) using struct module,
>
> > > Yes, that's what it's for, and that's what you should be using.
>
> > My concern is that struct may need to parse the format string,
> > construct the list, and de-reference index=3D0 for this generated list
> > to get the int out.
>
> > There should be some way more efficient?
>
> The struct module is written in C, not in Python. It does have to parse a
> string, yes, so, if you wrote your own, limited, C function to do the job=
, it
> might be marginally faster.
>
>
>
> > > > b) using ctypes module, and
>
> > > Yeeaah, that would work, but that's really not what it's for. from_bu=
ffer
> > > wants a writable buffer interface, which is unlikely to be what you w=
ant.
>
> > Actually my buffer is writable --- it's an bytearray. =A0Turning it int=
o
> > a R/O one make me to do extra effort: wrapping the bytearray into
> > buffer().
>
> > My question is, this operation seems like to be much simpler than the
> > former one, and it's very straightforward as well. =A0Why is it slow?
>
> Unlike struct, it constructs an object you're not actually interested in
> around your int.
>
> > it's hard to image why socket object provides the interface:
> > socket.recv_from(buf[, num_bytes[, flags]]) but forget the more
> > generic one: socket.recv_from(buf[, offset[, num_bytes[, flags]]])
>
> Well, that's what pointer arithmetic (in C) or slices (in Python) are for=
!
> There's an argument to be made for sticking close to the traditional
> (originally C) interface here - it's familiar.

Hi Thomas, - I'm not quite follow you.  It will be great if you could
show me some code no this part...

>
> =A0- Thomas- Hide quoted text -
>
> - Show quoted text -

0
Reply Jacky 8/17/2010 1:00:09 AM

On Aug 17, 3:53=A0am, Mark Dickinson <dicki...@gmail.com> wrote:
> On Aug 16, 8:36=A0pm, Mark Dickinson <dicki...@gmail.com> wrote:
>
>
>
>
>
> > On Aug 16, 8:08=A0pm, Jacky <jacky.chao.w...@gmail.com> wrote:
> > > My concern is that struct may need to parse the format string,
> > > construct the list, and de-reference index=3D0 for this generated lis=
t
> > > to get the int out.
>
> > > There should be some way more efficient?
>
> > Well, you can improve on the struct solution by using the
> > struct.Struct class to avoid parsing the format string repeatedly:
>
> > >>> import struct
> > >>> S =3D struct.Struct('<I')
> > >>> S.unpack_from(buffer(bytearray([1,2,3,4,5])))
>
> > (67305985,)
>
> > This doesn't make a huge difference on my machine (OS X 10.6.4, 64-bit
> > build of Python 2.6) though; =A0it's probably more effective for long
> > format strings.
>
> Sorry, this was inaccurate: =A0this makes almost *no* significant
> difference on my machine for large test runs (10000 and up). =A0For
> small ones, though, it's faster. =A0The reason is that the struct module
> caches (up to 100, in the current implementation) previously used
> format strings, so with your tests you're only ever parsing the format
> string once anyway. =A0Internally, the struct module converts that
> format string to a Struct object, and squirrels that Struct object
> away into its cache, which is implemented as a dict from format
> strings to Struct objects. =A0So the next time that the format string is
> used it's simply looked up in the cache, and the Struct object
> retrieved.
>
> By the way, in Python 3.2 there's yet another fun way to do this,
> using int.from_bytes.
>
> >>> int.from_bytes(bytearray([1,2,3,4]), 'little')

Thanks!  It looks pretty like the ctypes way. ;)

>
> 67305985
>
> --
> Mark- Hide quoted text -
>
> - Show quoted text -

0
Reply jacky.chao.wang (3) 8/17/2010 1:05:26 AM

On Tuesday 17 August 2010, it occurred to Jacky to exclaim:
> On Aug 17, 3:38 am, Thomas Jollans <tho...@jollybox.de> wrote:
> > On Monday 16 August 2010, it occurred to Jacky to exclaim:
> > > it's hard to image why socket object provides the interface:
> > > socket.recv_from(buf[, num_bytes[, flags]]) but forget the more
> > > generic one: socket.recv_from(buf[, offset[, num_bytes[, flags]]])
> > 
> > Well, that's what pointer arithmetic (in C) or slices (in Python) are
> > for! There's an argument to be made for sticking close to the
> > traditional (originally C) interface here - it's familiar.
> 
> Hi Thomas, - I'm not quite follow you.  It will be great if you could
> show me some code no this part...

When I originally wrote that, I didn't check the Python docs, I just had a 
quick look at the manual page.

This is the signature of the BSD-socket recv function: (recv(2))

       ssize_t recv(int sockfd, void *buf, size_t len, int flags);

so, to receive data into a buffer, you pass it the buffer pointer.

	len = recv(sock, buf, full_len, 0);

To receive more data into another buffer, you pass it a pointer further on:

	len = recv(sock, buf+len, full_len-len, 0);
	/* or, this might be clearer, but it's 100% the same: */
	len = recv(sock, & buf[len], full_len-len, 0);

Now, in Python. I assume you were referring to socket.recv_into:

		socket.recv_into(buffer[, nbytes[, flags]])

It's hard to imagine why this method exists at all. I think the recv method is 
perfectly adequate:

	buf = bytearray()
	buf[:] = sock.recv(full_len)
	# then:
	lngth = len(buf)
	buf[lngth:] = sock.recv(full_len - lngth)

But still, nothing's stopping us from using recv_into:

	# create a buffer large enough. Oh this is so C...
	buf = bytearray([0]) * full_len
	lngth = sock.recv_into(buf, length_of_first_bit)
     # okay, now let's fill the rest !
	sock.recv_into(memoryview(buf)[lngth:])

In C, you can point your pointers where ever you want. In Python, you can 
point your memoryview at buffers in any way you like, but there tend to be 
better ways of doing things.

Cheers,

	Thomas
0
Reply Thomas 8/20/2010 7:36:31 PM

9 Replies
481 Views

(page loaded in 0.247 seconds)

Similiar Articles:













7/23/2012 4:01:58 PM


Reply: