socket flushing/buffering problem, app hangs on close

  • Follow


I've written a TCL app that receives data from a single TCP source and
distributes this data to multiple TCP receivers using a very simple
ASCII protocol. The server is non-blocking using TCL's event loop. Most
of the receivers are not under my control and sometimes behave poorly.
This means I don't have access to code/application and in some cases the
owner of those applications.

Here is my problem.

TCL has called my writable handler indicating that a channel is ready
for data. I write data to the channel but the client stops reading data
at some point, but does not close the connection. TCP's flow control
kicks in and data ends up being buffered in the receivers TCP input
buffer, my hosts TCP output buffer and finally my application's TCL
channel output buffer.

If at this point I connect to another port and issue a command for my
application to shutdown it hangs. I forced a core dump and noticed that
it's hanging in send(). The man page for TCL's close indicates that TCL
will put the channel into blocking mode and attempt to flush the channel
of any remaining data, the interpreter does this for each open channel
when exit is called. However if the TCP stack is not accepting data the
application will never be able to exit or close channels without exiting
for that matter. This appears to be a pretty serious bug. I need to
'kill -9' in order to force an exit... very ugly. Seems like what is
needed is an option to the close command to discard any data buffered in
the TCL channel's output buffer and close the channel.

I coded a small extension in C that closes the OS specific handle for
the channel and the unregisters the channel from the interpreter. This
causes send() to return -1 but the interpreter doesn't care at that
point and shutdown continues successfully.

Anyone else run into this? I'm I totally missing something here?

BTW I'm using TCL 8.4 on Linux and HP-UX but a review of the current 8.5
API it seems like this deadlock could still exist.

Any input/ideas are greatly appreciated,
Wayne
0
Reply WC 2/2/2010 1:29:23 AM

On Feb 1, 5:29=A0pm, WC <wcu...@cox.net> wrote:
> I've written a TCL app that receives data from a single TCP source and
> distributes this data to multiple TCP receivers using a very simple
> ASCII protocol. The server is non-blocking using TCL's event loop. Most
> of the receivers are not under my control and sometimes behave poorly.
> This means I don't have access to code/application and in some cases the
> owner of those applications.
>
> Here is my problem.
>
> TCL has called my writable handler indicating that a channel is ready
> for data. I write data to the channel but the client stops reading data
> at some point, but does not close the connection. TCP's flow control
> kicks in and data ends up being buffered in the receivers TCP input
> buffer, my hosts TCP output buffer and finally my application's TCL
> channel output buffer.
>
> If at this point I connect to another port and issue a command for my
> application to shutdown it hangs. I forced a core dump and noticed that
> it's hanging in send(). The man page for TCL's close indicates that TCL
> will put the channel into blocking mode and attempt to flush the channel
> of any remaining data, the interpreter does this for each open channel
> when exit is called. However if the TCP stack is not accepting data the
> application will never be able to exit or close channels without exiting
> for that matter. This appears to be a pretty serious bug. I need to
> 'kill -9' in order to force an exit... very ugly. Seems like what is
> needed is an option to the close command to discard any data buffered in
> the TCL channel's output buffer and close the channel.
>
> I coded a small extension in C that closes the OS specific handle for
> the channel and the unregisters the channel from the interpreter. This
> causes send() to return -1 but the interpreter doesn't care at that
> point and shutdown continues successfully.
>
> Anyone else run into this? I'm I totally missing something here?
>
> BTW I'm using TCL 8.4 on Linux and HP-UX but a review of the current 8.5
> API it seems like this deadlock could still exist.
>
> Any input/ideas are greatly appreciated,
> Wayne

Right, so it sounds like your wrote an application which gets
stuck...probably due to poor coding. It also sounds like you ran it in
background so you couldn't control it except via signals. The TCP
connection should still time out if you let it sit long enough.

BTW, a channel becomes readable/writable if an error occurs, it is
something of a blunt indicator. In this case is sounds like the
application is simply waiting around to send or receive data. I'm not
sure how this adds up to a bug.

0
Reply tom 2/2/2010 2:51:00 AM


tom.rmadilo wrote:
> On Feb 1, 5:29 pm, WC <wcu...@cox.net> wrote:
> 
> Right, so it sounds like your wrote an application which gets
> stuck...probably due to poor coding. It also sounds like you ran it in
> background so you couldn't control it except via signals. The TCP
> connection should still time out if you let it sit long enough.
> 
> BTW, a channel becomes readable/writable if an error occurs, it is
> something of a blunt indicator. In this case is sounds like the
> application is simply waiting around to send or receive data. I'm not
> sure how this adds up to a bug.
> 

Did you even read my post or were you just looking for someone to criticize?

1) Backgrounding does not imply that an application can only be 
controlled via signals. In fact I'm using a control socket on another 
port, as stated in my message, to send the app a stop message. But this 
is beside the point I'm not sure why you brought it up?

2) You need to go back and study blocking sockets, if the remote end 
stops reading data but the IP buffers on both ends are full and you 
attempt to write more data, the write end will block until the remote 
end begins to read data thus clearing IP buffers or it simply closes the 
connection. Neither of which are happening. There is no timeout to wait 
for, TCP is operating as designed in this case.

3) I know about read/write handlers, I have both installed on these
channels. The write handler is not getting called because the remote end 
is not reading and the read handler is not getting called becuase the 
remote end is not closing the socket nor sending my application any 
data. I know this because I see that on my host system netstat shows 
around 40K in the TCP write Q and the connection is in the ESTABLISHED 
state.

Perhaps "bug" is a strong word, it appears that TCL is operating as 
designed but there should be a way to close an output channel and 
instruct TCL to just discard any data that it has left and not attempt 
to send it for the exact reason cited above. It does not sound like a 
good design if a remote machine can cause my application to hang while 
attempting to close a channel or exit the application simply becuase the 
interpreter mandates that it must flush all data from it's queues.

0
Reply WC 2/2/2010 6:56:35 AM

This is an OpenPGP/MIME signed message (RFC 2440 and 3156)
--------------enigC37CD0A3B677262463C818DD
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Can't you just close them manually?  Off hand:

foreach sock [chan names sock*] {
   # enables dump on close
   fconfigure $sock -blocking no
   close $sock
}


--=20



--------------enigC37CD0A3B677262463C818DD
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: OpenPGP digital signature
Content-Disposition: attachment; filename="signature.asc"

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)

iEYEARECAAYFAktn1XIACgkQlZadkQh/RmExHwCg5rmD2o8hyqkSgIreUo11wFWA
yW4An2HOonwXVAnjrbp/blPg4a67ZnQa
=yXxo
-----END PGP SIGNATURE-----

--------------enigC37CD0A3B677262463C818DD--
0
Reply David 2/2/2010 7:34:09 AM

David Gravereaux wrote:
> Can't you just close them manually?  Off hand:
> 
> foreach sock [chan names sock*] {
>    # enables dump on close
>    fconfigure $sock -blocking no
>    close $sock
> }
> 
Unfortunately not, if I do this while the application is running and I 
leave the socket non-blocking. TCL will return from close immediately 
and try to flush the data in the background. So the script layer 
"thinks" it's closed but a file descriptor is forever allocated to the 
interpreter. Many opens and closes with the bad server eventually causes 
file descriptor starvation in the process.

When the application finally attempts to exit it hangs since it is the 
interpreter's policy to flush and close all open channels before it 
exists. So all those background tasks prevent it from exiting.

If I put the channel in blocking mode as you suggest above I don't even 
get the benefit of the interp attempting to close the channel in the 
background. It hangs on the close until the other side reads the data or 
  terminates the connection. Which means that none of the my other 
socket handlers are being serviced as they are in the non-blocking 
scenario. Essentially the application gives the impression that it is 
locked at this point.

I appreciate the suggestion though!
Thanks.





0
Reply WC 2/2/2010 7:52:10 AM

WC wrote:
> Unfortunately not, if I do this while the application is running and I 
> leave the socket non-blocking. TCL will return from close immediately 
> and try to flush the data in the background. So the script layer 
> "thinks" it's closed but a file descriptor is forever allocated to the 
> interpreter. Many opens and closes with the bad server eventually causes 
> file descriptor starvation in the process.

n.B.
I once wrote a "self scriptable" ( not tcl ;-) multiplexer in
C for distributing messages ( duplicating, logging)  between different
processes. ( Most processes where run of the mill tty/cmdline oriented programms )
Outgoing problems were handled via sigpipe. ( would not have  caught an
unresponsive client either )

Limit the problem. don't try to reconnect? A dead client is dead, dead, dead
Would that work for you?
Limit buffering space?


uwe
0
Reply Uwe 2/2/2010 8:49:48 AM

On Feb 1, 7:29=A0pm, WC <wcu...@cox.net> wrote:
> I've written a TCL app that receives data from a single TCP source and
> distributes this data to multiple TCP receivers using a very simple
> ASCII protocol. The server is non-blocking using TCL's event loop. Most
> of the receivers are not under my control and sometimes behave poorly.
> This means I don't have access to code/application and in some cases the
> owner of those applications.
>
> Here is my problem.
>
> TCL has called my writable handler indicating that a channel is ready
> for data. I write data to the channel but the client stops reading data
> at some point, but does not close the connection. TCP's flow control
> kicks in and data ends up being buffered in the receivers TCP input
> buffer, my hosts TCP output buffer and finally my application's TCL
> channel output buffer.
>
> If at this point I connect to another port and issue a command for my
> application to shutdown it hangs. I forced a core dump and noticed that
> it's hanging in send(). The man page for TCL's close indicates that TCL
> will put the channel into blocking mode and attempt to flush the channel
> of any remaining data, the interpreter does this for each open channel
> when exit is called. However if the TCP stack is not accepting data the
> application will never be able to exit or close channels without exiting
> for that matter. This appears to be a pretty serious bug. I need to
> 'kill -9' in order to force an exit... very ugly. Seems like what is
> needed is an option to the close command to discard any data buffered in
> the TCL channel's output buffer and close the channel.
>
> I coded a small extension in C that closes the OS specific handle for
> the channel and the unregisters the channel from the interpreter. This
> causes send() to return -1 but the interpreter doesn't care at that
> point and shutdown continues successfully.
>
> Anyone else run into this? I'm I totally missing something here?
>
> BTW I'm using TCL 8.4 on Linux and HP-UX but a review of the current 8.5
> API it seems like this deadlock could still exist.
>
> Any input/ideas are greatly appreciated,
> Wayne

Why does this work?

Interp 1:
% socket -server accept 1515
sock5
% proc accept {socket clientAddr clientPort} {
        puts "Accepted $socket."
        puts $socket "hello"
        return
}
% after 60000 exit
after#0
% vwait forever
Accepted sock7.
Accepted sock8.
MacBookPro:~ paul$

Interp 2:
% socket localhost 1515
sock5
% close sock5
% socket localhost 1515
sock5
% close sock5
% exit


I ran 'exit' in Interp 2 before the 'after' was triggered in Interp 1.
As you can see tclsh exits fine for me. Or is there a flaw in this
test? I'm on Mac OS 10.4.
0
Reply PaulWalton 2/2/2010 9:19:16 AM

PaulWalton wrote:

> 
> Why does this work?
> 
> Interp 1:
> % socket -server accept 1515
> sock5
> % proc accept {socket clientAddr clientPort} {
> ...
 > ...
 > ...
> 
> 
> I ran 'exit' in Interp 2 before the 'after' was triggered in Interp 1.
> As you can see tclsh exits fine for me. Or is there a flaw in this
> test? I'm on Mac OS 10.4.

Hi Paul,

Well you're close but, it is not a valid test. The TCL IO system's 
buffers were able to flush before it exited. You sent a very small 
amount of data. Though you didn't do a read in your client app TCL was 
able to clear it's IO buffers down the TCP stack, in which the OS close 
succeeds.

My application is streaming data to a number of clients and it is not 
unusual for it to build up a half meg of data rather quickly. For the 
test to be valid the receivers TCP input queue needs to be full as well 
as the senders TCP output queue. With moderm TCP stacks this can be 
several hundred K of data. Only then will TCL begin to buffer data in 
it's interp's IO buffers. That will definitely cause TCL to block when 
attempting to clear those buffers.

Thanks,
Wayne
0
Reply WC 2/2/2010 2:56:36 PM

Uwe Klein wrote:
>
> n.B.
> I once wrote a "self scriptable" ( not tcl ;-) multiplexer in
> C for distributing messages ( duplicating, logging)  between different
> processes. ( Most processes where run of the mill tty/cmdline oriented 
> programms )
> Outgoing problems were handled via sigpipe. ( would not have  caught an
> unresponsive client either )
> 
> Limit the problem. don't try to reconnect? A dead client is dead, dead, 
> dead
> Would that work for you?
> Limit buffering space?
> 
> 
> uwe

LOL, that would work for me... But not for my boss:( We get paid for the 
data we send them. Yes this is an annoying scenario since the problem is 
the customers application. But it is what it is.

I'm attempting to replace a version of this same application that I 
wrote in C a few years ago. But it doesn't make a very strong case if I 
need to include a C extension with the script in order to terminate 
badly behaving clients:( The C application happily frees it's back 
queue, sets the TCP linger timer to 0 and closes the socket. It then 
reconnects and sends the client data... until the client app stops 
responding again.
0
Reply WC 2/2/2010 3:50:31 PM

On Feb 2, 2:29=A0am, WC <wcu...@cox.net> wrote:
>
> [...]
> will put the channel into blocking mode and attempt to flush the channel
> of any remaining data, the interpreter does this for each open channel
> when exit is called. However if the TCP stack is not accepting data the
> application will never be able to exit or close channels without exiting
> for that matter. This appears to be a pretty serious bug. I need to
> 'kill -9' in order to force an exit... very ugly. Seems like what is
> needed is an option to the close command to discard any data buffered in
> the TCL channel's output buffer and close the channel.
>
> I coded a small extension in C that closes the OS specific handle for
> the channel and the unregisters the channel from the interpreter. This
> causes send() to return -1 but the interpreter doesn't care at that
> point and shutdown continues successfully.
>
> Anyone else run into this? I'm I totally missing something here?
>
> BTW I'm using TCL 8.4 on Linux and HP-UX but a review of the current 8.5
> API it seems like this deadlock could still exist.
>
> Any input/ideas are greatly appreciated,
> Wayne

You are absolutely right: that's a design flaw. If our hands were
free, we'd fix that instantly. The problem is the existing base of Tcl
apps... So we can only extend, not reform. Something like [chan
unflush], or [chan discard].

You can file a TIP for that; however,  in the meantime, you can use
the following workaround:

 set ff [open "|cat >@ $sok" w]
 # do writes on $ff, reads on $sok
 # you can still fconfigure $ff -blocking 0

 # now assume it's time to close
 exec kill -INT [pid $ff]
 catch {close $ff}

Ugly, eh ? Yup. Just one percent simpler than an extension ;-)

-Alex


0
Reply Alexandre 2/2/2010 4:23:12 PM

On Feb 2, 8:23=A0am, Alexandre Ferrieux <alexandre.ferri...@gmail.com>
wrote:
> On Feb 2, 2:29=A0am, WC <wcu...@cox.net> wrote:
>
>
>
>
>
> > [...]
> > will put the channel into blocking mode and attempt to flush the channe=
l
> > of any remaining data, the interpreter does this for each open channel
> > when exit is called. However if the TCP stack is not accepting data the
> > application will never be able to exit or close channels without exitin=
g
> > for that matter. This appears to be a pretty serious bug. I need to
> > 'kill -9' in order to force an exit... very ugly. Seems like what is
> > needed is an option to the close command to discard any data buffered i=
n
> > the TCL channel's output buffer and close the channel.
>
> > I coded a small extension in C that closes the OS specific handle for
> > the channel and the unregisters the channel from the interpreter. This
> > causes send() to return -1 but the interpreter doesn't care at that
> > point and shutdown continues successfully.
>
> > Anyone else run into this? I'm I totally missing something here?
>
> > BTW I'm using TCL 8.4 on Linux and HP-UX but a review of the current 8.=
5
> > API it seems like this deadlock could still exist.
>
> > Any input/ideas are greatly appreciated,
> > Wayne
>
> You are absolutely right: that's a design flaw. If our hands were
> free, we'd fix that instantly. The problem is the existing base of Tcl
> apps... So we can only extend, not reform. Something like [chan
> unflush], or [chan discard].
>
> You can file a TIP for that; however, =A0in the meantime, you can use
> the following workaround:
>
> =A0set ff [open "|cat >@ $sok" w]
> =A0# do writes on $ff, reads on $sok
> =A0# you can still fconfigure $ff -blocking 0
>
> =A0# now assume it's time to close
> =A0exec kill -INT [pid $ff]
> =A0catch {close $ff}
>
> Ugly, eh ? Yup. Just one percent simpler than an extension ;-)


Why not try using [chan pending ]?

In my recent experiment with htclient, I found that the only way to
avoid failure on read (potential DOS attack) was to read only bytes
available in the tcl buffer.

The biggest deficit in the Tcl channel code is the lack of timeouts,
but the manpage for [chan puts] indicates that applications should
take care to not push too much data into the output channel with each
writable event.

Until the actual code is posted, hard to say this is a tcl failing, or
exactly what the failure is.
0
Reply tom 2/2/2010 5:33:35 PM

On Feb 2, 8:56=A0am, WC <wcu...@cox.net> wrote:
> PaulWalton wrote:
>
> > Why does this work?
>
> > Interp 1:
> > % socket -server accept 1515
> > sock5
> > % proc accept {socket clientAddr clientPort} {
> > ...
>
> =A0> ...
> =A0> ...
>
>
>
> > I ran 'exit' in Interp 2 before the 'after' was triggered in Interp 1.
> > As you can see tclsh exits fine for me. Or is there a flaw in this
> > test? I'm on Mac OS 10.4.
>
> Hi Paul,
>
> Well you're close but, it is not a valid test. The TCL IO system's
> buffers were able to flush before it exited. You sent a very small
> amount of data. Though you didn't do a read in your client app TCL was
> able to clear it's IO buffers down the TCP stack, in which the OS close
> succeeds.
>
> My application is streaming data to a number of clients and it is not
> unusual for it to build up a half meg of data rather quickly. For the
> test to be valid the receivers TCP input queue needs to be full as well
> as the senders TCP output queue. With moderm TCP stacks this can be
> several hundred K of data. Only then will TCL begin to buffer data in
> it's interp's IO buffers. That will definitely cause TCL to block when
> attempting to clear those buffers.
>
> Thanks,
> Wayne

Thank you for the explanation.
0
Reply PaulWalton 2/2/2010 6:46:09 PM

On Feb 2, 11:23=A0am, Alexandre Ferrieux <alexandre.ferri...@gmail.com>
wrote:
>
> You are absolutely right: that's a design flaw. If our hands were
> free, we'd fix that instantly. The problem is the existing base of Tcl
> apps... So we can only extend, not reform. Something like [chan
> unflush], or [chan discard].
>
> You can file a TIP for that; however, =A0in the meantime, you can use
> the following workaround:
>
> =A0set ff [open "|cat >@ $sok" w]
> =A0# do writes on $ff, reads on $sok
> =A0# you can still fconfigure $ff -blocking 0
>
> =A0# now assume it's time to close
> =A0exec kill -INT [pid $ff]
> =A0catch {close $ff}
>
> Ugly, eh ? Yup. Just one percent simpler than an extension ;-)
>
> -Alex

Alex,

Ugly... but clever my man!! Very Nice:) Do you think piping to
something like netcat would work bi-directionally so I can stick with
a single channel?

Ok, I will file a TIP for that as I feel it is very important to have
some mechanism in place for this condition. Either a separate function
as you mentioned or a flag to close. This would allow backward
compatibility.

close -noflush $chan

Thanks Alex,

Wayne
0
Reply Wayne 2/2/2010 6:49:05 PM

On Feb 2, 12:33=A0pm, "tom.rmadilo" <tom.rmad...@gmail.com> wrote:
>
> Why not try using [chan pending ]?
>
> In my recent experiment with htclient, I found that the only way to
> avoid failure on read (potential DOS attack) was to read only bytes
> available in the tcl buffer.
>
> The biggest deficit in the Tcl channel code is the lack of timeouts,
> but the manpage for [chan puts] indicates that applications should
> take care to not push too much data into the output channel with each
> writable event.
>
> Until the actual code is posted, hard to say this is a tcl failing, or
> exactly what the failure is.

Currently I'm only concerned with the [puts] case as I'm not reading
from clients, unless they close the connection.

Yeah, I saw that in the manpage, the problem with verbiage like that
is how much is "too much"? In fact just 1 byte in the interp's IO
buffers will cause the interp to block.

[chan pending] won't help because by the time there is data in the
interp's buffer it's to late:(

I'm stuck with 8.4 on my production system right now:(... so no [chan]
function for me.

It would be nice if one could turn off TCL buffering completely, I
have [fconfigure $chan -buffering none] but in non-blocking mode the
interp still accepts data via [puts] and this is stated so in the
[puts] manpage.

I understand they can't change the semantics since this API has
existed for many releases but I would have expected that if you
disable buffering and attempt to write to a non-blocking channel which
cannot accept data that [puts] would return just the number of bytes
it was able to write to the channel, in which case the application
could handle the buffering instead of the interp. Which is exactly
what my C implementation is doing.
0
Reply Wayne 2/2/2010 7:10:50 PM

On Feb 2, 7:49=A0pm, Wayne <wcu...@gmail.com> wrote:
> On Feb 2, 11:23=A0am, Alexandre Ferrieux <alexandre.ferri...@gmail.com>
> wrote:
>
>
>
>
>
>
>
> > You are absolutely right: that's a design flaw. If our hands were
> > free, we'd fix that instantly. The problem is the existing base of Tcl
> > apps... So we can only extend, not reform. Something like [chan
> > unflush], or [chan discard].
>
> > You can file a TIP for that; however, =A0in the meantime, you can use
> > the following workaround:
>
> > =A0set ff [open "|cat >@ $sok" w]
> > =A0# do writes on $ff, reads on $sok
> > =A0# you can still fconfigure $ff -blocking 0
>
> > =A0# now assume it's time to close
> > =A0exec kill -INT [pid $ff]
> > =A0catch {close $ff}
>
> > Ugly, eh ? Yup. Just one percent simpler than an extension ;-)
>
> > -Alex
>
> Alex,
>
> Ugly... but clever my man!! Very Nice:) Do you think piping to
> something like netcat would work bi-directionally so I can stick with
> a single channel?

Oh yes of course, if you don't absolutely want Tcl sockets or close
monitoring of the connection attempt, [open "|nc ..." r+] will work
similarly.

> Ok, I will file a TIP for that as I feel it is very important to have
> some mechanism in place for this condition. Either a separate function
> as you mentioned or a flag to close. This would allow backward
> compatibility.
>
> close -noflush $chan

Not sure what you mean by backward compat here, since the current
[close] syntax doesn't have prefix options, but yes, I like this
syntax too :)

-Alex
0
Reply Alexandre 2/2/2010 7:27:26 PM

14 Replies
1420 Views

(page loaded in 0.75 seconds)

Similiar Articles:


















7/21/2012 8:57:01 AM


Reply: