Thursday, October 26, 2006

Analysis of The WRT54G Failure

[Updated Nov 26, 2006]
This post has had over 50 hits from Google on the keywords "wrt54g hung torrent" in less than a month (More info on wrt54g issue, click here). If your wrt54g has the same problem, join me and complain to Linksys. I am boycotting LinkSys product for the mean while until this problem is resolved.
We will have better products in the long run if we exercise our right as consumers. Thank you.

[Updated Nov 11, 2006]
Two bad news: First this router seem to have some interoperateability issue with Linux, and second, LinkSys tech support is proven useless.

Under Linux, I managed to get connected to this puppy. In fact I am typing up this blog on my Slackware 11 machine, however the catch is the connection down time is randomly distributed, from 10 minutes to a few hours.

Second thing, as mentioned below. This flimsy small router can't take the heat of full-throttle bit-torrent connections, therefore I tried my luck with Linksys support staff. But they are far from helpful, instead of looking at my problem, now they blame their router's gradual-death-after-reset was due to the speed of my network, duh!

Dear Free-Chai,

You have mentioned that you have a fiber optic connection; our router is designed for Cable and DSL connection. The reason why you are encountering the problem might be that the router can’t sustain the speed provided by your fiber optic connection. I suggest please lower down the MTU to 1200. If this will not solve the issue, you need to put a hub between your connections (internet connection-hub-router).


Sincerely,


Cheed (Badge ID 19574)
Linksys – A Division of Cisco Systems, Inc.

P/S: I changed the MTU to 1200 as suggested, still going nowhere.

Regarding to this support staff's theory on fibre optic speed, I totally disagree. The optical fibre is pulled to a main switch for my 25-storey apartment building and then distributed to individuals using standard Ethernet cat cable. This layer 1 thingie _shouldn't_ affect the functionality of the router which operates on mainly layer 3 and above. Moreover, how do you explain the gradual death of the router after a couple of hours?

Too bad I can't return this piece of equipment (there is literally no law in Malaysia for return/refund of faulty merchandise, well, at least the shops will ignore and laugh at you. Technically you can go to the small-claim court and fight, but nah, not worth it)

Should you ever consider to buy WRT54G, you may want to reconsider your choices.

=====================================
I kind of confirm the reason of my router lock-up is due to TCP connections are generated faster than they expire, given the puny memory my router has (8MB), the router crashes after a few hours as number of connection states rockets.

The reason is simple: TCP is a reliable stream delivery service which preserves the orders of the TCP frames, hence once a TCP virtual circuit is in place, it will prepare itself for service and starts a count-down timer at the same time. When a timer expires, the computer will declare the said connection is dead and returns appropriate status code to the relevant application. Hence this count-down timer is a well-thought-of design toaddress the unreliable nature of underlying networks (99.99% IP networks, even on ultra reliable ATM networks people want to run IPoA, duh). The price for this bit of reliability is each TCP-enabled device will have to keep a state table on all TCP connections, scrubbing it regularly to weed out those connections with expired timers.

Advent of P2P applications change the whole landscape. The characteristics of P2P networks are extreme fluidity, and high unreliability. The unreliability part alone TCP can handle reasonably well, but the fluidity kills it. Fluidity refers to the situation there are many peers leaving and joining a node simultaneously as each of them contribute a small portion of a certain download/upload. This seemingly harmless operation poses a serious concern for the device in terms of updating the TCP state table, especially to those devices with flimsy memory footprint like my WRT54G. Simple queueing theory is applicable here: if input rate (connection creation rate) is higher than output rate (connection time-out rate), the queue length must be infinitely long. This is exactly what our problem is: our queue length, aka the memory size, is very modest, therefore in a run of several hours, the router stores so much data it can't even have enough data for swapping, so finally it says good bye in a Windows way, without the blue screens.

2 comments:

The Soothsayer said...

I think it's because you set the device to be a NAT device. If you configure it as a normal router, then it wouldn't have to handle TCP states. Maybe you can set it as a router and then use one of your PCs as the NAT device.

Cuppa Chai said...

I totally agree with you, and the whole reason why the router wants to keep the state table is because of the NAT. However if I were to put one of my PC as NAT device, it is more straightforward to dump this crappy router and use the PC as router as well, so this goes back to square one: why did I buy this router in the first place? Duh...