Thread: [Overlayweaver-discuss] Overlay weaver seems to leave open sockets
Brought to you by:
shudo
From: Susanna M. <sus...@is...> - 2010-08-03 13:54:05
|
Hi All. We created an application build on top of OW 0.9.10 that every 30 seconds does: - value=get(key); - doingChanges(value); - put(key,value); We started the application+(OW) on 10 machines. When I just started machines all seems to work fine, however after few days ow gives errors in the logs: ow.routing.RoutingException at ow.dht.impl.BasicDHTImpl.get(BasicDHTImpl.java:249) at eu.xtreemos.ads.dht.ow.OWDHT.get(OWDHT.java:173) that corresponds to the failure of the first step (the get) When I check the list of opened sockets by the java process I see a lot of (about 800!!) sockets in CLOSE_WAIT status, here a sample of the output: java 13392 root 136u IPv6 30495239 0t0 TCP xtreemos7.xxx.xxx:3997->xtreemos3.xxx.xxx:45977 (CLOSE_WAIT) java 13392 root 137u IPv6 30478306 0t0 TCP xtreemos7.xxx.xxx:3997->xtreemos.xxx.xxx:44935 (CLOSE_WAIT) java 13392 root 138u IPv6 30478319 0t0 TCP xtreemos7.xxx.xxx:3997->xtreemos.xxx.xxx:44941 (CLOSE_WAIT) java 13392 root 139u IPv6 30478334 0t0 TCP xtreemos7.xxx.xxx:3997->xtreemos.xxx.xxx:44944 (CLOSE_WAIT) java 13392 root 140u IPv6 30478395 0t0 TCP xtreemos7.xxx.xxx:3997->xtreemos.xxx.xxx:44947 (CLOSE_WAIT) java 13392 root 141u IPv6 30478437 0t0 TCP xtreemos7.xxx.xxx:3997->xtreemos.xxx.xxx:44953 (CLOSE_WAIT) java 13392 root 142u IPv6 30478444 0t0 TCP xtreemos7.xxx.xxx:3997->xtreemos.xxx.xxx:44956 (CLOSE_WAIT) java 13392 root 143u IPv6 30478451 0t0 TCP xtreemos7.xxx.xxx:3997->xtreemos.xxx.xxx:44959 (CLOSE_WAIT) java 13392 root 144u IPv6 30479112 0t0 TCP xtreemos7.xxx.xxx:3997->xtreemos.xxx.xxx:44962 (CLOSE_WAIT) java 13392 root 145u IPv6 30479135 0t0 TCP xtreemos7.xxx.xxx:3997->xtreemos.xxx.xxx:44968 (CLOSE_WAIT) java 13392 root 146u IPv6 30479143 0t0 TCP xtreemos7.xxx.xxx:3997->xtreemos.xxx.xxx:44971 (CLOSE_WAIT) java 13392 root 147u IPv6 30479200 0t0 TCP xtreemos7.xxx.xxx:3997->xtreemos.xxx.xxx:44974 (CLOSE_WAIT) java 13392 root 148u IPv6 30479209 0t0 TCP xtreemos7.xxx.xxx:3997->xtreemos.xxx.xxx:44977 (CLOSE_WAIT) java 13392 root 149u IPv6 30479218 0t0 TCP xtreemos7.xxx.xxx:3997->xtreemos.xxx.xxx:44980 (CLOSE_WAIT) java 13392 root 150u IPv6 30479255 0t0 TCP xtreemos7.xxx.xxx:3997->xtreemos.xxx.xxx:44983 (CLOSE_WAIT) java 13392 root 151u IPv6 30479269 0t0 TCP xtreemos7.xxx.xxx:3997->xtreemos.xxx.xxx:44986 (CLOSE_WAIT) java 13392 root 152u IPv6 30479119 0t0 TCP xtreemos7.xxx.xxx:3997->xtreemos.xxx.xxx:44965 (CLOSE_WAIT) java 13392 root 153u IPv6 30479307 0t0 TCP xtreemos7.xxx.xxx:3997->xtreemos.xxx.xxx:44989 (CLOSE_WAIT) java 13392 root 154u IPv6 30479321 0t0 TCP xtreemos7.xxx.xxx:3997->xtreemos.xxx.xxx:44993 (CLOSE_WAIT) java 13392 root 155u IPv6 30479337 0t0 TCP xtreemos7.xxx.xxx:3997->xtreemos1.xxx.xxx:47574 (CLOSE_WAIT) java 13392 root 156u IPv6 30481247 0t0 TCP xtreemos7.xxx.xxx:3997->xtreemos1.xxx.xxx:47830 (CLOSE_WAIT) java 13392 root 157u IPv6 30484056 0t0 TCP xtreemos7.xxx.xxx:3997->xtreemos1.xxx.xxx:47920 (CLOSE_WAIT) java 13392 root 158u sock 0,4 0t0 30483999 can't identify protocol java 13392 root 159u IPv6 30485036 0t0 TCP xtreemos7.xxx.xxx:3997->xtreemos3.yyy.yyy:53764 (CLOSE_WAIT) java 13392 root 160u IPv6 30484994 0t0 TCP xtreemos7.xxx.xxx:3997->xtreemos3.xxx.xxx:37886 (CLOSE_WAIT) java 13392 root 161u IPv6 30490877 0t0 TCP xtreemos7.xxx.xxx:3997->xtreemos.xxx.xxx:56459 (CLOSE_WAIT) java 13392 root 162u IPv6 30486179 0t0 TCP xtreemos7.xxx.xxx:3997->xtreemos.xxx.xxx:56988 (CLOSE_WAIT) java 13392 root 163u IPv6 30485086 0t0 TCP xtreemos7.xxx.xxx:3997->xtreemos8.xxx.xxx:51897 (CLOSE_WAIT) java 13392 root 164u IPv6 30485109 0t0 TCP xtreemos7.xxx.xxx:3997->xtreemos3.xxx.xxx:37921 (CLOSE_WAIT) java 13392 root 165u IPv6 30485117 0t0 TCP xtreemos7.xxx.xxx:3997->xtreemos4.xxx.xxx:34401 (CLOSE_WAIT) java 13392 root 166u IPv6 30498463 0t0 TCP xtreemos7.xxx.xxx:3997->xtreemos.xxx.xxx:56470 (CLOSE_WAIT) java 13392 root 167u IPv6 30490880 0t0 TCP xtreemos7.xxx.xxx:3997->xtreemos1.xxx.xxx:33411 (CLOSE_WAIT) java 13392 root 168u IPv6 30491014 0t0 TCP xtreemos7.xxx.xxx:3997->xtreemos1.xxx.xxx:33493 (CLOSE_WAIT) java 13392 root 169u IPv6 30491994 0t0 TCP xtreemos7.xxx.xxx:3997->xtreemos.xxx.xxx:56575 (CLOSE_WAIT) java 13392 root 170u IPv6 30497339 0t0 TCP xtreemos7.xxx.xxx:3997->xtreemos.xxx.xxx:56415 (CLOSE_WAIT) java 13392 root 171u IPv6 30494095 0t0 TCP xtreemos7.xxx.xxx:3997->xtreemos.xxx.xxx:56597 (CLOSE_WAIT) java 13392 root 172u IPv6 30494272 0t0 TCP xtreemos7.xxx.xxx:3997->xtreemos.xxx.xxx:56333 (CLOSE_WAIT) java 13392 root 173u IPv6 30496210 0t0 TCP xtreemos7.xxx.xxx:3997->xtreemos3.xxx.xxx:45993 (CLOSE_WAIT) java 13392 root 174u IPv6 30496286 0t0 TCP xtreemos7.xxx.xxx:3997->xtreemos.xxx.xxx:56410 (CLOSE_WAIT) java 13392 root 175u IPv6 30496257 0t0 TCP xtreemos7.xxx.xxx:3997->xtreemos3.yyy.yyy:56624 (CLOSE_WAIT) java 13392 root 176u IPv6 30497349 0t0 TCP xtreemos7.xxx.xxx:3997->xtreemos.xxx.xxx:56419 (CLOSE_WAIT) java 13392 root 177u IPv6 30500632 0t0 TCP xtreemos7.xxx.xxx:3997->xtreemos3.yyy.yyy:42447 (CLOSE_WAIT) java 13392 root 178u IPv6 30498487 0t0 TCP xtreemos7.xxx.xxx:3997->xtreemos.xxx.xxx:56474 (CLOSE_WAIT) java 13392 root 179u IPv6 30497380 0t0 TCP xtreemos7.xxx.xxx:3997->xtreemos.xxx.xxx:56422 (CLOSE_WAIT) java 13392 root 180u IPv6 30497390 0t0 TCP xtreemos7.xxx.xxx:3997->xtreemos.xxx.xxx:56424 (CLOSE_WAIT) java 13392 root 181u IPv6 30497423 0t0 TCP xtreemos7.xxx.xxx:3997->xtreemos.xxx.xxx:56429 (CLOSE_WAIT) java 13392 root 182u IPv6 30497433 0t0 TCP xtreemos7.xxx.xxx:3997->xtreemos.xxx.xxx:56432 (CLOSE_WAIT) java 13392 root 183u IPv6 30498205 0t0 TCP xtreemos7.xxx.xxx:3997->xtreemos.xxx.xxx:56437 (CLOSE_WAIT) java 13392 root 184u IPv6 30498243 0t0 TCP xtreemos7.xxx.xxx:3997->xtreemos.xxx.xxx:56441 (CLOSE_WAIT) java 13392 root 185u IPv6 30498257 0t0 TCP xtreemos7.xxx.xxx:3997->xtreemos.xxx.xxx:56444 (CLOSE_WAIT) java 13392 root 186u IPv6 30498268 0t0 TCP xtreemos7.xxx.xxx:3997->xtreemos.xxx.xxx:56447 (CLOSE_WAIT) java 13392 root 187u IPv6 30498280 0t0 TCP xtreemos7.xxx.xxx:3997->xtreemos.xxx.xxx:*56450* (CLOSE_WAIT) java 13392 root 188u IPv6 30498284 0t0 TCP xtreemos7.xxx.xxx:3997->xtreemos.xxx.xxx:*56452* (CLOSE_WAIT) java 13392 root 189u IPv6 30498324 0t0 TCP xtreemos7.xxx.xxx:3997->xtreemos.xxx.xxx:56455 (CLOSE_WAIT) java 13392 root 190u IPv6 30498329 0t0 TCP xtreemos7.xxx.xxx:3997->xtreemos.xxx.xxx:56458 (CLOSE_WAIT) java 13392 root 191u IPv6 30498346 0t0 TCP xtreemos7.xxx.xxx:3997->xtreemos.xxx.xxx:56461 (CLOSE_WAIT) java 13392 root 192u IPv6 30498368 0t0 TCP xtreemos7.xxx.xxx:3997->xtreemos.xxx.xxx:56464 (CLOSE_WAIT) java 13392 root 193u IPv6 30498006 0t0 TCP xtreemos7.xxx.xxx:3997->xtreemos4.xxx.xxx:37515 (CLOSE_WAIT) java 13392 root 194u IPv6 30498380 0t0 TCP xtreemos7.xxx.xxx:3997->xtreemos.xxx.xxx:56466 (CLOSE_WAIT) Many client connections are opened on port 3997 (machine xtreemos7), some of them from the same machine but different outgoing ports. The client side seems to close the connection, instead the server side doesn't close the connection. I checked into the source code of OW and the class that manages client socket connections is ow.messaging.tcp.ConnectionPool. This class associates to each destination a client socket. The client Socket should be reused each time a message is sent to an existing destination. I have two questions: 1) Why xtreemos.xxx.xxx has two or more different connections toward xtreemos7.xxx.xxx:3997 on different outgoing ports? (e.g 56450, 56452) 2) Why the socket on the server side remains opened (actually in CLOSE_WAIT status)? I expect that when the client uses for the first time connection on port *56450 *toward *xtreemos7:3997 *then it closes the socket. This problem is very subtle since it prevents to open any file after a certain amount of time due to the limit of open files in an unix system. Thanks Susanna -- "Reality is that which, when you stop believing in it, doesn't go away" -- Philip K. Dick |
From: Kazuyuki S. <20...@sh...> - 2010-08-08 20:11:23
|
Hi Susanna, > Message-ID: <AANLkTi=Fzf...@ma...> > From: Susanna Martinelli <sus...@is...> > Date: Tue, 3 Aug 2010 15:53:35 +0200 > We created an application build on top of OW 0.9.10 that every 30 seconds > does: > > - value=get(key); > - doingChanges(value); > - put(key,value); > > We started the application+(OW) on 10 machines. > When I just started machines all seems to work fine, however after few days > ow gives errors in the logs: > > ow.routing.RoutingException > at ow.dht.impl.BasicDHTImpl.get(BasicDHTImpl.java:249) > at eu.xtreemos.ads.dht.ow.OWDHT.get(OWDHT.java:173) > > that corresponds to the failure of the first step (the get) > > When I check the list of opened sockets by the java process I see a lot of > (about 800!!) sockets in CLOSE_WAIT status, here a sample of the output: > > java 13392 root 136u IPv6 30495239 0t0 TCP > xtreemos7.xxx.xxx:3997->xtreemos3.xxx.xxx:45977 (CLOSE_WAIT) > java 13392 root 137u IPv6 30478306 0t0 TCP > xtreemos7.xxx.xxx:3997->xtreemos.xxx.xxx:44935 (CLOSE_WAIT) > java 13392 root 138u IPv6 30478319 0t0 TCP > xtreemos7.xxx.xxx:3997->xtreemos.xxx.xxx:44941 (CLOSE_WAIT) Would you try Overlay Weaver 0.9.11? It has just been released and includes a fix for the problem. > The client side seems to close the connection, instead the server side > doesn't close the connection. I could not reproduce it by a short experiment, but it is not surprising to see such remaining connections because receiver-side nodes did not close a connection as you pointed out. > I have two questions: > > 1) Why xtreemos.xxx.xxx has two or more different connections toward > xtreemos7.xxx.xxx:3997 on different outgoing ports? > (e.g 56450, 56452) The default size of a connection pool is very small (*). Once it overflows, it throw connections out and they cannot be reused. (*) In src/ow/messaging/tcp/TCPMessageConfiguration.java: ... public final static int DEFAULT_CONNECTION_POOL_SIZE = 3; // Connection pool is disabled if 0 or a negative value is specified. > 2) Why the socket on the server side remains opened (actually in CLOSE_WAIT > status)? I expect that when the client uses for the first time connection on > port *56450 *toward *xtreemos7:3997 *then it closes > the socket. The reason has not been clear. I could not reproduce it. Anyway, Overlay Weaver 0.9.11 closes a connection on both side, a receiver and a sender. I believe the change dissolves the problem. Kazuyuki Shudo 20...@sh... http://www.shudo.net/ |