Modify

Opened 6 years ago

Closed 2 years ago

Last modified 7 months ago

#7552 closed defect (fixed)

bcm47xx/b43 wireless connection suddenly stops when router system load increases

Reported by: amain@… Owned by: hauke
Priority: response-needed Milestone:
Component: base system Version: Backfire 10.03
Keywords: b43 hostapd link stability system load Cc:

Description

This or related problems have been with the b43 driver in bcm47xx devices for over a couple of years now. Things have improved, but still b43 is not stable and problems I saw 1 year ago still persist. But I never could find a reproduction path, until know:

  1. Compile and install stock backfire 10.03 on the router
  2. Install usb2, usb-storage, ext3 using opkg
  3. mount an external USB disk partition on /mnt
  4. Setup a simple wireless connection with WPA2 and a passphrase
  5. Use 2 computers: LP1 and LP2.
  6. LP1 connects over wireless to the router
  7. LP2 connects over ethernet to a LAN port of the switch of the router
  8. Start a smb file transfer : LP1 receiving files from LP2
  9. Start a sftp file transfer: LP1 receiving files from LP2
  10. Start a ssh connection from LP1 to the router and start top
  11. Start ssh connection from LP2 to router and type: find /mnt

--> boom, wireless connection within seconds stops functioning. However using tcpdump one can see ARP packets arrive on wlan0 on the router, but the router never transmits any IP packet anymore. The wireless link level remains in tact so it seems. Even arp packets can be received over the wireless connection, but no other packets are send. De AP remains visible for other wireless devices, but these cannot connect to it.

This scenario was tested with all variations of: backfire, trunk, Asus WL-500GD, Asus WL-500G Premium v1. All resulting in connection loss / sudden stop.

Another interesting observation is on all above mentioned combinations I saw duplicate packets be-ing seen by wlan0 on the router, but were not transmitted by the client. So it seems that at some level packets are duplicated in certain cases.

Hopefully with this report it will be possible to fix the problems with the b43 driver. Let me know if I can help.

Attachments (10)

wireless (352 bytes) - added by amain@… 6 years ago.
/etc/config/wireless
arp_still_received (316 bytes) - added by amain@… 6 years ago.
ARP packets still received, but no response
network_restart (281 bytes) - added by amain@… 6 years ago.
/etc/init.d/network.sh restart output: shows No such device (-19)
lp1_supplicant.dump (4.3 KB) - added by amain@… 6 years ago.
tcpdump on LP1 showing the initial packets after successfull WPA connection setup.
router_wlan0.dump (3.2 KB) - added by amain@… 6 years ago.
tcpdump on wlan0 on router of inital packet after WPA connection setup. See the duplicate packets received, which were not transmitter
kernel_debug_log.txt (5.8 KB) - added by Andreas Bräu <ab@…> 4 years ago.
840-b43-workaround-rx-fifo-overflow.patch (1.5 KB) - added by Dmitriy Taychenachev <dimichxp@…> 4 years ago.
for mac80211 openwrt package
978-b43_dmarx_adddisc.patch (4.3 KB) - added by thommyj@… 3 years ago.
patch for adding rx desc underrun
979-b43_addsysfs.patch (4.1 KB) - added by thommyj@… 3 years ago.
hack for displaying debug info in sysfs for above patch
980-b43_debugwifidown.patch (2.5 KB) - added by thommyj@… 3 years ago.
showing number of interrupts for b43 in sysfs

Download all attachments as: .zip

Change History (153)

Changed 6 years ago by amain@…

/etc/config/wireless

Changed 6 years ago by amain@…

ARP packets still received, but no response

Changed 6 years ago by amain@…

/etc/init.d/network.sh restart output: shows No such device (-19)

Changed 6 years ago by amain@…

tcpdump on LP1 showing the initial packets after successfull WPA connection setup.

Changed 6 years ago by amain@…

tcpdump on wlan0 on router of inital packet after WPA connection setup. See the duplicate packets received, which were not transmitter

comment:1 Changed 6 years ago by anonymous

You do not even need such a complex setup. Just do some CPU-intensive transfer like SCP (I even managed it with plain FTP transfer) and your driver we lock immediately. It looks like that the drivers state machine remains in an invalid state caused by missed interrupts/packets due to high CPU load. This is definitly a driver problem and a serious showstopper for all the b43 folks. OpenWRT cannot be considered 'Working' on these devices.

comment:2 Changed 6 years ago by anonymous

I am seeing the exact same problem with similar setup on trunk of 10.03 on my wrt54GL 1.1.

comment:3 Changed 6 years ago by anonymous

Problem persists in 10.03.1 rc1

comment:4 Changed 6 years ago by th0ma7@…

Just switch from kernel 2.4 to 2.6 and having the exact same problem with WRT54GL version 1.0 using 10.03.1-rc1 kernel 2.6 (brcm47xx).

Used to be really really stable with kernel 2.4 branch.

comment:5 follow-ups: Changed 6 years ago by Gina Häußge <gina@…>

Same issue here with 10.03 with a WL500g Deluxe. Easily reproducible by starting an SCP from wlan to lan (burnt bridge setup), connection on wlan machine drops promptly and is only regained after a reboot of the router.

comment:6 in reply to: ↑ 5 Changed 6 years ago by microcris <microcris@…>

Replying to Gina Häußge <gina@…>:

Same issue here with 10.03 with a WL500g Deluxe. Easily reproducible by starting an SCP from wlan to lan (burnt bridge setup), connection on wlan machine drops promptly and is only regained after a reboot of the router.

Same here :(
Asus WL500GPV2 Backfire (r23709) Kernel 2.6

comment:7 Changed 6 years ago by paul.r.ml@…

Confirmed on ASUS WL500 G, is there a fallback or a workaround to this issue please ?

comment:8 Changed 6 years ago by noway <popov@…>

confirmed on dir 320

comment:9 Changed 6 years ago by stabarinde@…

Confirmed on (XWRT) Kamikaze 8.09 on WRT54GL v1.1 . Hopefully I can get WDS working on 2.4 kernel instead...

comment:10 follow-up: Changed 6 years ago by Felix <felix@…>

If this has been confirmed as a driver issue, has it been reported upstream?

comment:11 Changed 5 years ago by Zajec

Please, provide dmesg | grep b43 so we can at least know what hardware is affected. Unfortunately bcm47xx does not tell us too much. In dmesg should be also firmware version visible.

comment:12 in reply to: ↑ 5 Changed 5 years ago by Gina Häußge <gina@…>

Quoting myself here...

Replying to Gina Häußge <gina@…>:

Same issue here with 10.03 with a WL500g Deluxe. Easily reproducible by starting an SCP from wlan to lan (burnt bridge setup), connection on wlan machine drops promptly and is only regained after a reboot of the router.

...but this time with the requested info:

# dmesg | grep b43
b43-phy0: Broadcom 4306 WLAN found (core revision 5)
Registered led device: b43-phy0::tx
Registered led device: b43-phy0::rx
Registered led device: b43-phy0::radio
b43 ssb1:0: firmware: requesting b43/ucode5.fw
b43 ssb1:0: firmware: requesting b43/pcm5.fw
b43 ssb1:0: firmware: requesting b43/b0g0initvals5.fw
b43 ssb1:0: firmware: requesting b43/b0g0bsinitvals5.fw
b43-phy0: Loading firmware version 478.104 (2008-07-01 00:50:23)

comment:13 follow-up: Changed 5 years ago by piotr@…

I have the same problem with my Asus WL-500G premium (v1). Tested with trunk, rev26720. Restarting wifi on AP helps only for a few seconds (system is under heavy load). There is no information about problem neither in syslog nor in dmesg.

When this problem occurs, "wifi down" command returns "No such device (-19)" error. I've straced "wifi down" command and it seems that this error is (probably) generated by "brctl delif" command located in unbridge() function in /lib/network/config.sh.

# dmesg|grep b43
b43-phy0: Broadcom 4318 WLAN found (core revision 9)
Registered led device: b43-phy0::tx
Registered led device: b43-phy0::rx
Registered led device: b43-phy0::assoc
Registered led device: b43-phy0::radio
b43-phy0: Loading firmware version 410.2160 (2007-05-26 15:32:10)

comment:14 in reply to: ↑ 13 Changed 5 years ago by anonymous

Replying to piotr@…:

oops. formatting correction :)

b43-phy0: Broadcom 4318 WLAN found (core revision 9)
Registered led device: b43-phy0::tx
Registered led device: b43-phy0::rx
Registered led device: b43-phy0::assoc
Registered led device: b43-phy0::radio
b43-phy0: Loading firmware version 410.2160 (2007-05-26 15:32:10)

comment:15 Changed 5 years ago by liberty@…

Same issue here! Wrt54gl 2.6.32.27.

comment:16 in reply to: ↑ 10 Changed 5 years ago by veox

Replying to Felix <felix@…>:

If this has been confirmed as a driver issue, has it been reported upstream?

I guess upstream knows about this. From a link in ticket #7366 (possibly related) and reading b43-dev mailing lists I reckon AP/Master code is not complete.

There are reverse-engineered specs at http://bcm-v4.sipsolutions.net/

I'm currently trying a workaround, as in http://permalink.gmane.org/gmane.linux.drivers.bcm54xx.devel/10680

comment:17 Changed 5 years ago by buckh@…

(sorry for the interruption. commenting in hopes i get added to cc list)

comment:18 Changed 5 years ago by hauke

  • Owner changed from developers to hauke
  • Status changed from new to accepted

comment:19 Changed 5 years ago by nbd

Please try http://nbd.name/860-b43_restart_config.patch on latest trunk (copy it to package/mac80211/patches)

comment:20 Changed 5 years ago by nbd

  • Priority changed from high to response-needed

comment:21 Changed 5 years ago by th0ma7@…

Is it possible to get a working build for WRT54g v1.0 device so the patch can be tested? Otherwhise I won't be able to test the patch without much efforts and associated risks to brick my only router (although might become a good reason to change it).

comment:22 Changed 5 years ago by lars.schotte@…

same issue here on Linux 3.0.3 #1 Wed Oct 26 13:19:24 CEST 2011 mips GNU/Linux
and WRT54GL v1.1

wlan stops working, it is visible by other devices but can not connect to it.

comment:23 follow-up: Changed 5 years ago by coderjoe@…

Same issue as all above using OpenWRT Backfire 10.03.1-RC6, r28680:

root@forest:~# dmesg | grep b43
b43-phy0: Broadcom 5352 WLAN found (core revision 9)
Registered led device: b43-phy0::tx
Registered led device: b43-phy0::rx
Registered led device: b43-phy0::radio
b43-phy0: Loading firmware version 508.1084 (2009-01-14 01:32:01)
b43-phy0: Loading firmware version 508.1084 (2009-01-14 01:32:01)

root@forest:~# uname -a
Linux forest 2.6.32.27 #11 Sun Oct 30 19:48:44 CET 2011 mips GNU/Linux

For me the high CPU state was achieved through the use of QoS during high load network conditions.

comment:24 in reply to: ↑ 23 Changed 5 years ago by dir1212

Same issue with release 10.03.1 on my WRT54GL v1.1 (bcm47xx).

comment:25 Changed 5 years ago by mpbryan@…

Same issue on Asus WL-500gD (bcm47xx) with 10.03.1.

comment:26 Changed 5 years ago by Michal Pomorski <misieck@…>

I am confirming this on Gargoyle 1.5.3 revision 63071b4, with OpenWrt backfire revision r29961. Running on Wrt54gl v.1.1 (Broadcom 5352 (core revision 9)).

Linux Gargoyle 2.6.32.27 #1 Wed Feb 1 15:59:46 EST 2012 mips GNU/Linux
b43-phy0: Loading firmware version 410.2160 (2007-05-26 15:32:10)

I must say this greatly disappoints me, as I expected this now-legendary router to cooperate painlessly with any attempt to hack it. As it stands now OpenWRT does not fully support the wrt54gl and many others with broadcom radios.
Any workarounds?

comment:27 Changed 5 years ago by przemator@…

Hello, I have an Asus WL-500G. I have the same issue. I have been installing latest 47xx firmware since a long time, and this bug is present and getting worse. With 10.03.1, when Wi-Fi is on, my router may freeze and reboot itself even. The worst case was when the router would reboot for 10 minutes without effect, I thought it was "bricked" already, but after a few more power-reboots, it started.

On Kamikaze the wireless would work with no encryption (WPA2 would shutdown wifi after a couple of days).

Now I tried all kinds of encryption and always my laptop loses connection after a few days, then the network is STILL visible, but laptop cannot establish connection. I should mention I'm using the default configuration, only enable wifi via LUCi.

comment:28 Changed 5 years ago by anonymous

I enabled verbose output (verbose=3) on the b43 module to try to help track it down. Also, another thread (need to find again) suggests disabling QoS (qos=0). So far, so good. I'm going to put some load on it tonight. If it lasts more than two hours on my setup, it's an improvement. I'll post back shortly.

comment:29 Changed 5 years ago by adrien@…

Hi,

I'm confirming this bug on WRT54GL with Backfire 10.3.1 with 2.6.32.27 linux and Broadcom BCM3302 V0.8 cpu model (system type bcm47xx).

To have this bug, I simply navigate to http://enfluxlibre.tuxfamily.org/ : on this page, my web browser want to download big ogg files and broke so the wifi. Another way is to simply dowload Debian CD with "axel" download manager (configured with 10 connections) and wait 2 minutes before wifi is broken.

As workaround I found that with the "axel" download manager I can limit download throughput : if I try to download Debian CD limited to 300 bytes per second, the download works without broking wifi.

So I wanted to use QoS with qos, wshaper or dsl-qos-queue to limit my download throughput, but I haven't have any success to run these programs (I'm a beginner to openwrt and I haven't any experience with networking administration).

Do you think that using QoS is a possible workaround ? Or does it use too to many CPU to do it ?

comment:30 Changed 5 years ago by offlinehacker

Hello,

Nbd, patch you provided does not work on trunk. Can you please

Applying ./patches/860-b43_restart_config.patch using plaintext:
patching file drivers/net/wireless/b43/main.c
Hunk #1 FAILED at 326.
Hunk #2 FAILED at 3762.
Hunk #3 succeeded at 4072 with fuzz 1 (offset 242 lines).
Hunk #4 FAILED at 3921.
Hunk #5 succeeded at 4813 with fuzz 2 (offset 111 lines).
Hunk #6 succeeded at 5557 with fuzz 2 (offset 782 lines).
Hunk #7 FAILED at 4934.
4 out of 7 hunks FAILED -- saving rejects to file drivers/net/wireless/b43/main.c.rej
Patch failed! Please fix ./patches/860-b43_restart_config.patch!

comment:31 Changed 5 years ago by offlinehacker

Sorry for ugly formatting. Nbd, patch you provided does not work on trunk.

Applying ./patches/860-b43_restart_config.patch using plaintext: 
patching file drivers/net/wireless/b43/main.c
Hunk #1 FAILED at 326.
Hunk #2 FAILED at 3762.
Hunk #3 succeeded at 4072 with fuzz 1 (offset 242 lines).
Hunk #4 FAILED at 3921.
Hunk #5 succeeded at 4813 with fuzz 2 (offset 111 lines).
Hunk #6 succeeded at 5557 with fuzz 2 (offset 782 lines).
Hunk #7 FAILED at 4934.
4 out of 7 hunks FAILED -- saving rejects to file drivers/net/wireless/b43/main.c.rej
Patch failed!  Please fix ./patches/860-b43_restart_config.patch!

comment:32 Changed 5 years ago by nbd

the patch is no longer necessary, it has already been applied.

comment:33 Changed 4 years ago by adrien@…

Hello!

I've noticed that your patch seems working with wired connection. BTW, if your patch reload the rooter on overload of CPU it works, but only for wired connection.

I've same issue with wireless connection : if I make a too loud download by wireless connection (WPA2), CPU overload and is blocking. So is it possible to do same patch/workaround for the wireless connection, please ?

Thanks !

comment:34 Changed 4 years ago by anonymous

i am experiencing the exact same problem (under heavy download in managed mode wifi disconnects and has to be restarted for it to work) on a minipci atheros card ar5212 with the ath5k driver on both backfire svn and trunk

is minipci card not well supported in openwrt
here is the ticket https://dev.openwrt.org/ticket/10758

Changed 4 years ago by Andreas Bräu <ab@…>

comment:35 Changed 4 years ago by Andreas Bräu <ab@…>

we still have these problems with the current trunk, but we managed to get some information from debug at the time wifi dies until it restarts (see attachment kernel_debug_log.txt)

comment:36 follow-up: Changed 4 years ago by Bastian Bittorf <bittorf@…>

comment:37 in reply to: ↑ 36 ; follow-up: Changed 4 years ago by openwrt@…

Replying to Bastian Bittorf <bittorf@…>:

workaround found:

https://lists.openwrt.org/pipermail/openwrt-devel/2012-July/016031.html

I cannot confirm the effectiveness of this suggested workaround running latest OpenWRT (10.03.1) on two Linksys WRT54GS routers separated by a distance of about 20 metres. I used the suggested commands on both units and then attempted to rsync a 150Mbyte file over the link. The router on the sending end of the link died within a few seconds as usual. When I brought that router back up (with the rsync command still running) the router on the receiving end promptly died. I had to stop the rsync to get the link to stay up. Normally I use rsync bwlimit=500 to limit the transfer rate to 500kBytes/sec and that seems not to crash the (radio device driver?) very often.

comment:38 in reply to: ↑ 37 ; follow-ups: Changed 4 years ago by Bastian Bittorf <bittorf@…>

I cannot confirm the effectiveness of this suggested workaround running
latest OpenWRT (10.03.1) on two Linksys WRT54GS routers separated by a

please use trunk. also show us which command exactly you used. paste your commandline history. i tested your setup and it works without problems.

comment:39 in reply to: ↑ 38 Changed 4 years ago by Bastian Bittorf <bittorf@…>

please use trunk. also show us which command exactly you used.
paste your commandline history. i tested your setup and it
works without problems.

also post your /etc/config/wireless of both routers.
mine is this:

config wifi-device 'radio0'
	option type 'mac80211'
	option country 'US'
	option channel '1'
	option macaddr 'xx:xx:xx:xx:xx:xx'
	option hwmode '11g'

config wifi-iface
	option device 'radio0'
	option network 'wlan'
	option mode 'adhoc'
	option bssid '02:ca:ff:ee:ba:be'
	option ssid 'bb'

comment:40 follow-up: Changed 4 years ago by openwrt@…

Unfortunately I cannot now post the configurations of all the routers since I have reflashed them all with Tomato, which gives no problems. Here is the configuration of one of the access points, the client on this link would obviously have been set to 'station' mode. I tried all possible encryption methods (including no encryption) and I tried 11b instead of 11g, with the same results. Invariably one or other of the wireless devices would stop responding after a few tens of seconds under high load.

config wifi-device  radio0
        option type     mac80211
        option country  uk
        option maxassoc 1
        option channel  6
        option macaddr  xx:xx:xx:xx:xx:xx
        option hwmode   11g

config wifi-iface
        option device   radio0
        option network  wifi
        option mode     ap
        option ssid     xxxxxx
        option encryption WEP
        option key      xxxxxxxxxxxxxxxxxxxxxxxxxxx

I am willing to re-test OpenWRT when it is possible for me to do so but since these wireless links carry nightly backup traffic it is not acceptable for them to be unreliable. I have other routers which could be made available for testing but I do not have time to work on that until the backup systems are once again functioning reliably. It's close, but no cigar, at the moment.

comment:41 in reply to: ↑ 40 ; follow-up: Changed 4 years ago by Bastian Bittorf <bittorf@…>

and I tried 11b instead of 11g, with the same results.

it doesn't matter which "hwmode" you have in the config:
directly after the interface comes up, you have to set the
txrates on all nodes with:

iw dev $WIFIDEV set bitrates legacy-2.4 6 9 12 18 24 36 48 54

all other things are useless. read the stuff better...

comment:42 Changed 4 years ago by anonymous

I can confirm that running

iw dev wlan0 set bitrates legacy-2.4 6 9 12 18 24 36 48 54

on a WRT54G running as an access point makes the wireless connection stable.

Setting hwmode 11g is not enough, since 802.11g also includes the lower bitrates of 1, 2, 5 and 11 Mbps. Since these lower bitrates use a different modulation, the wifi drops are probably related to a change in the modulation?

comment:43 follow-up: Changed 4 years ago by florian

Can you guys bring this to the Linux wireless development mailing-list so you get a chance to get a fix from the b43 developpers?

comment:44 Changed 4 years ago by hauke

This was already posted on the wireless-testing mailing list, but I haven't seen anybody investigating into this.

http://www.spinics.net/lists/linux-wireless/msg94438.html

comment:45 in reply to: ↑ 41 Changed 4 years ago by anonymous

Replying to Bastian Bittorf <bittorf@…>:

and I tried 11b instead of 11g, with the same results.

it doesn't matter which "hwmode" you have in the config:
directly after the interface comes up, you have to set the
txrates on all nodes with:

iw dev $WIFIDEV set bitrates legacy-2.4 6 9 12 18 24 36 48 54

Which is exactly what I did.

... read the stuff better...

Write the stuff better.

comment:46 in reply to: ↑ 43 Changed 4 years ago by anonymous

Replying to florian:

Can you guys bring this to the Linux wireless development mailing-list so you get a chance to get a fix from the b43 developpers?

As apparently nobody else wants to that that on, it will be my pleasure.

Unfortunately you will have to be patient with me, since today is the first time that I have built the OpenWRT firmware and I really don't know what I'm doing.

comment:47 in reply to: ↑ 38 ; follow-up: Changed 4 years ago by anonymous

Replying to Bastian Bittorf <bittorf@…>:

I cannot confirm the effectiveness of this suggested workaround running
latest OpenWRT (10.03.1) on two Linksys WRT54GS routers separated by a

please use trunk. also show us which command exactly you used. paste your commandline history. i tested your setup and it works without problems.

Yesterday I built OpenWRT from trunk and installed it on two WRT54GS routers. Incidentally I noticed that when set up with one router as an access point and the other as a client, the client would associate for a maximum of about 20 milliseconds and then disassociate. I see that this might be a known bug in the kernel drivers and I guess that this is the reason that you used adhoc mode in your claimed working configuration.

[ 8946.780000] wlan0: RX AssocResp from xx:xx:xx:xx:xx:xx (capab=0x411 status=0 aid=1)
[ 8946.796000] wlan0: associated
[ 8946.812000] wlan0: disassociating from xx:xx:xx:xx:xx:xx by local choice (reason=3)

When I set up both routers in adhoc mode I was able to establish a link.

I then gave the following command to both routers (pasted from the terminal as requested)

root@OpenWrt:~# iw dev wlan0 set bitrates legacy-2.4 6 9 12 18 24 36 48 54
root@OpenWrt:~#

and started a large file transfer by rsync without a rate limit.

About ten seconds later, after tranfser of about 2% of the file, the wireless link was lost until the routers' wireless devices were reconfigured. I repeated the procedure to confirm it and the fault is perfectly repeatable.

I shall now take this problem to the Linux Wireless list.

comment:48 in reply to: ↑ 47 Changed 4 years ago by Bastian Bittorf <bittorf@…>

When I set up both routers in adhoc mode I was able to establish a link.

can you please provide your exact /etc/config/wireless and
a commandline history for replaying your setup?

bye, bastian

comment:49 Changed 4 years ago by Bastian Bittorf <bittorf@…>

whats also part of the problem, is maybe that all MANAGEMENT FRAMES are further
send with 1mbit (Beacons/Probe Req/Resp/Association/Authentication). So there is a change between OFDM and non-OFDM and IMHO this is the root-cause of the problem.

with the propriarity driver is was possible to force to 11g / b / mixed. does somebody
know how to force 11g-only?

bye, bastian

comment:50 Changed 4 years ago by anonymous

Using latest Attitude Adjustment (12.09-beta), same problem.

Router Model: Linksys WRT54G/GS/GL
Firmware Version: OpenWrt Attitude Adjustment 12.09-beta / LuCI Trunk (trunk+svn9220)
Kernel Version: 3.3.8

comment:51 Changed 4 years ago by Wipster

Confirmed with latest AA in svn.
Can we expose hostapds supported_rates option through uci?
If we set up basic_rates and supported_rates to avoid the OFDM/CCK crossover bug, would that be an acceptable workaround until the issue is solved upstream?

comment:52 Changed 4 years ago by Branko Majic <branko@…>

What should the supported_rates and basic_rates be set to? Did you manage to work around the bug by using those settings?

comment:53 Changed 4 years ago by Wipster

Unfortunately those settings did not help, and I'm not sure the rate setting with iw helped for me either.

comment:54 Changed 4 years ago by richlv@…

i might be suffering from the same issue for years (since pre-kamikaze/wr days...)

my current setup is :
asus wl500gp v1
r34676 of svn://svn.openwrt.org/openwrt/branches/attitude_adjustment

wireless stops working every now and then. note - restarting hostapd makes it working for me again (i'd even be ready to have a script that does this when connection drops, i just don't know how to detect it on the router itself...)

# dmesg | grep b43
b43-pci-bridge 0000:00:02.0: setting latency timer to 64
b43-phy0: Broadcom 4318 WLAN found (core revision 9)
b43-phy0: Found PHY: Analog 3, Type 2 (G), Revision 7
Registered led device: b43-phy0::tx
Registered led device: b43-phy0::rx
Registered led device: b43-phy0::radio
b43-phy0: Loading firmware version 666.2 (2011-02-23 01:15:07)
# cat /etc/config/wireless 
config wifi-device  radio0
        option type     mac80211
        option channel  11
        option macaddr  00:22:15:27:3a:74
        option hwmode   11g

        # REMOVE THIS LINE TO ENABLE WIFI:
#       option disabled 1

config wifi-iface
        option device   radio0
        option network  lan
        option mode     ap
        option ssid     OpenWrt
        option encryption psk2
        option key      thepresharedkey

any additional information that would help ?

Last edited 4 years ago by hauke (previous) (diff)

comment:55 Changed 4 years ago by richlv@…

ok, i screwed up the formatting. its still mostly readable, so i won't spam with repeated content :)

comment:56 Changed 4 years ago by richlv@…

just happened again. nothing interesting in dmesg or logread. a couple of minutes after wifi stops working there are "deauthenticated due to local deauth request" entries for all wifi clients, but that seems to be an effect, not the cause.

another issue suggested looking at cat /sys/kernel/debug/ieee80211/phy0 for the specific device, but i do not have /sys/kernel/debug

comment:57 Changed 4 years ago by richlv@…

thanks to hauke for fixing formatting. anybody with debugging ideas ? i might have a few days available for testing this year :)

comment:58 Changed 4 years ago by AndreasKloeckner <inform@…>

(sorry for the interruption. commenting in hopes i get added to cc list)

comment:59 Changed 4 years ago by lucky0106@…

I can confirm that this issue is persistent on my WRT54Gv4 as well. This is sort of a biggie, hope it can be fixed soon :( Is the web ui crapping its self when attempting to make wireless config changes related?

comment:60 Changed 4 years ago by AndreasKloeckner <inform@…>

Both

  • rmmod b43; rmmod b43-legacy; insmod b43 qos=0
  • iw dev ... set bitrates

seem ineffective on a WRTSL54GS with "ATTITUDE ADJUSTMENT (12.09-beta, r33312)".

comment:61 Changed 4 years ago by bittorf@…

iw dev ... set bitrates

Must be called after wifi up (otherwise it gets overwritten).

comment:62 Changed 4 years ago by richlv@…

is there a reliable way to detect when it breaks ? i currently have a script check logread output every minute and restart hostapd if last entry is about "disassociate" or so, but it feels very hackish and wrong :)

comment:63 Changed 4 years ago by Lieta

I get Unhandled kernel unaligned access on WL700gE in b43 when I start torrent download on PC connected to LAN (not WiFi) port of router.

[ 1064.912000] [sched_delayed] sched: RT throttling activated
[ 1123.212000] Unhandled kernel unaligned access#1:
[ 1123.212000] Cpu 0
[ 1123.212000] $ 0 : 00000000 1000dc00 b54e6225 b54e6235
[ 1123.212000] $ 4 : 00010000 00010000 8108220c 00000024
[ 1123.212000] $ 8 : 00000020 81080000 00000004 00000001
[ 1123.212000] $12 : 82c11f90 00000000 00000000 0000003c
[ 1123.212000] $16 : 82d81bc0 00000001 802acc10 802e0000
[ 1123.212000] $20 : 00000006 00000008 808e0000 808e0000
[ 1123.212000] $24 : 00000000 801f5e6c
[ 1123.212000] $28 : 82fa4000 82fa5ca0 00208140 801baa1c
[ 1123.212000] Hi : 00000000
[ 1123.212000] Lo : 00000000
[ 1123.212000] epc : 8006c140 put_page+0x34/0xbc
[ 1123.212000] Tainted: G O
[ 1123.212000] ra : 801baa1c skb_release_data+0xf8/0x178
[ 1123.212000] Status: 1000dc03 KERNEL EXL IE
[ 1123.212000] Cause : 00800010
[ 1123.212000] BadVA : b54e6235
[ 1123.212000] PrId : 00029006 (Broadcom BMIPS3300)
[ 1123.212000] Modules linked in: ide_gd_mod aec62xx ide_core usb_storage ohci_hcd nf_nat_irc nf_nat_ftp nf_conntrack_irc nf_conntrack_ftp ipt_MASQUERADE iptable_nat nf_nat pppoe xt_conntrack xt_CT xt_NOTRACK iptable_raw xt_state nf_conntrack_ipv4 nf_defrag_ipv4 nf_conntrack ehci_hcd sd_mod pppox ipt_REJECT xt_TCPMSS xt_LOG xt_comment xt_multiport xt_mac xt_limit iptable_mangle iptable_filter ip_tables xt_tcpudp x_tables tun ppp_async ppp_generic slhc ext4 jbd2 mbcache b43legacy(O) b43(O) mac80211(O) usbcore usb_common scsi_mod nls_base crc16 crc_ccitt cfg80211(O) compat(O) ssb_hcd bcma_hcd arc4 aes_generic crypto_blkcipher cryptomgr aead crypto_hash crypto_algapi switch_robo(O) switch_core(O) b44 diag(O)
[ 1123.212000] Process irq/6-b43 (pid: 941, threadinfo=82fa4000, task=82c031b8, tls=00000000)
[ 1123.212000] Stack : 82d10bc8 00000000 00000040 838c8560 82d81bc0 82d81bc0 82c29460 801baab8

    802d0000 801c487c 838c8560 82d81bc0 82d81bc0 801c3a60 82fa5cd8 82fa5cd8
    808e1204 00000007 00000001 808e1208 00000003 00000002 00000100 800204d0
    00ff0000 8005826c 802d0000 00010000 fffffffe 802d0000 1000dc00 82d74000
    00ff0000 8005826c 802d0000 00010000 fffffffe 802d0000 80060000 80020760
    ...

[ 1123.212000] Call Trace:
[ 1123.212000] [<8006c140>] put_page+0x34/0xbc
[ 1123.212000] [<801baa1c>] skb_release_data+0xf8/0x178
[ 1123.212000] [<801baab8>] kfree_skb+0x1c/0x1b8
[ 1123.212000] [<801c3a60>] net_tx_action+0xa0/0x1e0
[ 1123.212000] [<800204d0>] do_softirq+0xc8/0x1b4
[ 1123.212000] [<80020760>] do_softirq+0x5c/0x94
[ 1123.212000] [<80020994>] irq_exit+0x4c/0x7c
[ 1123.212000] [<80005744>] ret_from_irq+0x0/0x4
[ 1123.212000] [<8025992c>] mutex_unlock+0x18/0xa0
[ 1123.212000] [<82f07a24>] b43_controller_restart+0x860/0xa3c [b43]
[ 1123.212000] [<801a3f24>] ssb_pci_write32+0x0/0x94
[ 1123.212000]
[ 1123.212000]
Code: 00a42024 10800008 00000000 <c0640000> 2485ffff e0650000 10a0fffc 00000000 0801b06c
[ 1123.472000] ---[ end trace 562f8b910bfe7f02 ]---
[ 1123.484000] Kernel panic - not syncing: Fatal exception in interrupt
[ 1123.484000] Rebooting in 3 seconds..

dmesg | grep b43

[    0.704000] b43-pci-bridge 0000:00:01.0: setting latency timer to 64
[   16.648000] b43-phy0: Broadcom 4318 WLAN found (core revision 9)
[   16.700000] b43-phy0: Found PHY: Analog 3, Type 2 (G), Revision 7
[   17.144000] Registered led device: b43-phy0::tx
[   17.144000] Registered led device: b43-phy0::rx
[   17.144000] Registered led device: b43-phy0::radio
[   47.632000] b43-phy0: Loading firmware version 666.2 (2011-02-23 01:15:07)
[   48.008000] b43-phy0: Loading firmware version 666.2 (2011-02-23 01:15:07)

Wireless and LAN are connected in a bridge:
brctl show

bridge name     bridge id               STP enabled     interfaces
br-lan          8000.001bfc295a54       no              eth0.0
                                                        wlan0

OpenWrt BARRIER BREAKER (Bleeding Edge, r35205)
Linux kernel 3.6.11
I created a ticket #12861
Haven't tried

iw dev wlan0 set bitrates legacy-2.4 6 9 12 18 24 36 48 54

yet.

comment:64 follow-ups: Changed 4 years ago by richlv@…

kernel errors do seem as a different issue, as at least in my case i do not see anything like that.

so in the case when it just stops working, is there any guess as to which component fails ? maybe it should be reported upstream to kernel, hostapd or whatever ?

comment:65 in reply to: ↑ 64 Changed 4 years ago by Lieta

Replying to richlv@…:

kernel errors do seem as a different issue, as at least in my case i do not see anything like that.

so in the case when it just stops working, is there any guess as to which component fails ? maybe it should be reported upstream to kernel, hostapd or whatever ?

According to this:
http://wireless.kernel.org/en/users/Drivers/b43/
Known issues:
BCM4318 chipset: AP mode does not work because of packet loss in high transmission rates. Hard to debug & fix.

comment:66 Changed 4 years ago by bittorf@…

I alread reported this issue upstream:
http://marc.info/?l=linux-wireless&m=134261326621995&w=2

IMHO this is exactly the issue and the underlying reason.
It has nothing to do with high transmission rates but
with changing modulation during switching between e.g. 11 and 12mbit

for us/our community the problem is fixed with using the
"iw dev ..." command, but there is a fix needed within openwrt:

1)
allow enforcing a fixed rateset via uci
otherwise the "iw dev ..." command is always needed
after a "wifi up"

2)
make sure the beaconing is propagation our new
"i can only speak these rates" thing

3)
let hostapd also do this

for "1" i can send a patch, but 2 and 3 are more important

comment:67 in reply to: ↑ 64 Changed 4 years ago by bittorf@…

Replying to richlv@…:

kernel errors do seem as a different issue, as at least in my case i do not see anything like that.

so in the case when it just stops working, is there any guess as to which component fails ? maybe it should be reported upstream to kernel, hostapd or whatever ?

there is no easy "sign" of a stopped wifi:

we log the incoming framecounter in adhoc-mode at least 1 sourrounding adhoc-node with OLSR to detect: no more incoming wifi frames -> 'wifi up' and AP-mode there is really no sign of a stopped wifi

comment:68 Changed 4 years ago by nbd

I'm still hoping that somebody will spend the time to go from saying 'it has to do with changing modulation' to actually figuring out what's going on.
I'm occasionally looking into the code myself as well, whenever I get some ideas on what could be the cause.
Workarounds may be OK as a stopgap measure, but I don't want them to become elaborate and permanent, that would get people to stop caring.

comment:69 Changed 4 years ago by Dmitriy Taychenachev <dimichxp@…>

I have tried to investigate the stability issues on WL-500GP under high load and have found that probably some of the silent freezes are due to overflow of the RX DMA buffer, seems like b43 does not handlre such a situation at all. I've not a lot of experience with such low level stuff though, so please correct me if I'm wrong.

The attached patch is very dirty workaround. I'm using it on my router for a few weeks, sometimes freezes occure but at least I can run BitTorrent client without doubts (it was guaranteed to crash the wifi before thanks to a lot of simultaneous connections). Of course it's not the solution, but maybe it'll inspire someone to a further progress.

Changed 4 years ago by Dmitriy Taychenachev <dimichxp@…>

for mac80211 openwrt package

comment:70 Changed 4 years ago by nbd

Thanks for looking into this! Please also start a thread about this on the linux-wireless list and describe what you've figured out so far. Maybe somebody there will help come up with a proper fix.

comment:71 follow-up: Changed 4 years ago by AndreasKloeckner <inform@…>

comment:72 in reply to: ↑ 71 Changed 4 years ago by anonymous

Replying to AndreasKloeckner <inform@…>:

Possibly related commit:
https://dev.openwrt.org/changeset/35671/

yes, already applied. b43 should be stable now, but needs some more
memory. please build and do heavy testing.

comment:73 follow-up: Changed 4 years ago by lucky0106@…

Sorry if this is off topic, but how do we follow a commit and know when it's implemented in a major release? I've been looking forward to getting back on OpenWRT, but I want to make sure that I do so once the fix has been included.

comment:74 in reply to: ↑ 73 Changed 4 years ago by anonymous

Replying to lucky0106@…:

Sorry if this is off topic, but how do we follow a commit and know when it's implemented in a major release? I've been looking forward to getting back on OpenWRT, but I want to make sure that I do so once the fix has been included.

at the moment this is "only" included in trunk r35671 so you have to wait for e.g. backfire rc5 which will be released soon (~may 2013). rc4 was snapshotted at r24045.

comment:75 Changed 4 years ago by hauke

  • Resolution set to fixed
  • Status changed from accepted to closed

This problem should mostly be fix in trunk r35671 and attitude adjustment branches r35947.

There is an additional check which detects an overflow in the dma code needed, which will hopefully be added later.

The fix is included in the current trunk snapshots and will probably be included in the final attitude adjustment release.

comment:76 follow-up: Changed 4 years ago by AndreasKloeckner <inform@…>

FWIW, the fix from r35671 significantly reduces the rate of incidence of the issue and makes the wifi much more stable, but it does *not* fix it. I personally didn't find that terribly surprising--all the patch does is quadruple the size of some DMA ring. That makes it less likely that the CPU will not have a chance to service the IRQ before an overrun occurs.

comment:77 in reply to: ↑ 76 Changed 4 years ago by bittorf@…

Replying to AndreasKloeckner <inform@…>:

FWIW, the fix from r35671 significantly reduces the rate of incidence of the issue and makes the wifi much more stable, but it does *not* fix it. I personally didn't find that terribly surprising--all the patch does is quadruple the size of some DMA ring. That makes it less likely that the CPU will not have a chance to service the IRQ before an overrun occurs.

but to say it again clear: the patch "840-b43-workaround-rx-fifo-overflow.patch" _must_ still be included till all the DMA-code gets a rewrite. hauke, please add to default patch-list.

comment:78 Changed 4 years ago by anonymous

unfortunately, situation doesn't seem to be any better with asus wl-500gp v1 and trunk r36083.
i can still kill wifi easily by opening slashdot from two devices at the same time or opening a single webpage with 106 elements (as reported by opera).

again, hostapd restart solves it till the next time i actually try to use wifi

comment:79 follow-ups: Changed 4 years ago by AndreasKloeckner <inform@…>

FWIW, Dmitriy's patch certainly helps, but it does cause router freezes for me, too. (on a WRTSL54GS) As such, even the combination of larger RX fifo and the overrun fix isn't much better than the larger FIFO. I'll try without the overrun patch and with an obscenely large RX FIFO. I'll report back once I've done that.

comment:80 in reply to: ↑ 79 Changed 4 years ago by bittorf@…

  • Resolution fixed deleted
  • Status changed from closed to reopened

Replying to AndreasKloeckner <inform@…>:

FWIW, Dmitriy's patch certainly helps, but it does cause router freezes for me, too. (on a WRTSL54GS) As such, even the combination of larger RX fifo and the overrun fix isn't much better than the larger FIFO. I'll try without the overrun patch and with an obscenely large RX FIFO. I'll report back once I've done that.

before you do this: please post dmesg-output after you did an 'wifi up' when the box was frozen. so we can see the statistics of b43 (it must be compile with b43-debugging on)

comment:81 in reply to: ↑ 79 Changed 4 years ago by anonymous

Replying to AndreasKloeckner <inform@…>:

FWIW, Dmitriy's patch certainly helps, but it does cause router freezes for me, too. (on a WRTSL54GS) As such, even the combination of larger RX fifo and the overrun fix isn't much better than the larger FIFO. I'll try without the overrun patch and with an obscenely large RX FIFO. I'll report back once I've done that.

How often do you experience freezes with the patch applied? I would really like to have a look at the issue, but it does not occur whenever I try to reproduce it during debugging session. Moreover, although I've seen such freezes before, there were no problems with wifi on my router for more than a month, so it's not possible for me to debug it even during everyday usage. Do you have any ideas on how to reproduce the bug? It would be really nice if you'd provide some debugging information; I'm afraid the default output of b43 (even compiled with debug flag) as suggested by Bastian would not be very useful (although indeed better than nothing), but if you have a time we could patch the driver with some more interesting printk's.

comment:82 Changed 4 years ago by richlv

if anybody can come up with step-by-step debugging instructions for dummies, i could try them on asus wl500gp :)
(i'm seeing wifi-going dead with vanilla trunk; restarting hostapd resolves the problem)

comment:83 Changed 4 years ago by AndreasKloeckner <inform@…>

How often do you experience freezes with the patch applied?

Happened once so far, after about 3 days of having the patched image running. Note that the patch caused a different "freeze"--the entire machine becomes unresponsive, including via ethernet.

I'd be happy to help you try and debug this. Just tell me what to do and what patches to apply.

comment:84 Changed 4 years ago by bittorf@…

For debugging / enforcing traffic and load i do this:

2 x Strong Router: ar71xx / 400 MHz or more
1 x brcm47xx / b43:

[A]StrongRouter -- ethernet -- [B]b43-Router/Adhoc -- wifi -- [C]StrongRouter/Adhoc

They all speak OLSR, but you can simple set routes, so A can reach C over B.
Then make an endless download from /dev/null to /dev/zero via wget from C to A
and from A to C. a usefull cgi-script for uhttpd looks like this:

file: /www/cgi-bin-download.sh

#!/bin/sh
cat <<EOF
Content-type: application/octet-stream
Content-Disposition: attachment; filename="testdownload.bin"

EOF
dd if=/dev/zero bs=128k count=9999 2>&-

then let the traffic flow:

from A to C:

root@BoxC:~ wget -O /dev/null http://$IP_ROUTER_A/cgi-bin-download.sh

from C to A:

root@BoxA:~ wget -O /dev/null http://$IP_ROUTER_C/cgi-bin-download.sh

if you are satisfied with the speed or the box freezes, unload to wifi
driver on the b43-box via: and post your b43-statistics:

root@b43box:~ uci set wireless.radio0.disabled=1
root@b43box:~ wifi
root@b43box:~ dmesg

Let the games begin...

comment:85 follow-up: Changed 4 years ago by bittorf@…

one of my tests:
http://intercity-vpn.de/files/openwrt/b43test2.dmesg.txt

here you can see this line:

b43-phy0 debug: DMA-32 tx_ring_AC_BE: Used slots 204/256, Failed frames 301/82665 = 0.3%, Average tries 1.15

so, we have used 209 out of max 256 slots. try to beat it 8-)

comment:86 in reply to: ↑ 85 Changed 4 years ago by hauke

Replying to bittorf@…:

one of my tests:
http://intercity-vpn.de/files/openwrt/b43test2.dmesg.txt

here you can see this line:

b43-phy0 debug: DMA-32 tx_ring_AC_BE: Used slots 204/256, Failed frames 301/82665 = 0.3%, Average tries 1.15

so, we have used 209 out of max 256 slots. try to beat it 8-)

This is one of the TX queues and *not* the RX queue, which seams to be the problem here. From your log 64 slots should be enough for the RX ring:

[27546.364000] b43-phy0 debug: DMA-32 rx_ring: Used slots 54/256, Failed frames 0/0 = 0.0%, Average tries 0.00

Could you please move this discussion to the b43 mailing list at b43-dev [at] lists.infradead.org , because no b43 developer is reading this ticket, I am not a b43 developer.

Last edited 4 years ago by hauke (previous) (diff)

comment:87 follow-up: Changed 4 years ago by AndreasKloeckner <inform@…>

As an update, I haven't yet had a chance to move away from the image with the larger RX ring and Dmitriy's patch. I haven't experienced any more crashes since, so I'll give it some more time to observe. I unfortunately can't replicate the strong router/weak router setup--I've only got this one router... :)

comment:88 in reply to: ↑ 87 Changed 4 years ago by bittorf@…

Replying to AndreasKloeckner <inform@…>:

As an update, I haven't yet had a chance to move away from the image with the larger RX ring and Dmitriy's patch. I haven't experienced any more crashes since, so I'll give it some more time to observe. I unfortunately can't replicate the strong router/weak router setup--I've only got this one router... :)

you can also use a simpler setup:

LaptopA ---ethernet--- RouterB (b43,AP-Mode) ~ ~ ~ wifi ~ ~ ~ LaptopC (client-mode)

than copy via scp a large file from A to C and (at the same time) from C to A.

comment:89 Changed 4 years ago by buckh@…

i have an ASUS WL-520gU that i gave up on b/c of dma buffer issues that i'd be willing to mail to anybody who's commented on this ticket thus far who's in the U.S. i don't have the original image, media, instructions or anything

comment:90 follow-up: Changed 3 years ago by thommyj@…

Confirmed on Asus wl500gd (see ticket 13076 for logs). Increasing the buffer size seems to help, but sometimes it still crashes. Setting allowed speeds didn't made any difference.

I think the correct way of handling it to throw away the packets in the ring buffer when it's full (letting TCP for clients decreasing the speed and not buffering old data).

RX desc underflow seems not to be critical (according to http://bcm-v4.sipsolutions.net/802.11/Registers#DINT and documents for other open broadcom products). So I made a patch that enables that interrupt and just tells the device to continue reception (effectively overwriting already received packets).

It has been running a few days with high load now without any problems (except some expected drops of course). During the tests I used 16 slots in the ringbuffer to provoke it further. CPU Load for the interrupt thread was constantly above 50%, most of the time around 90%. Simple ping traffic run as background experienced ~1-2% packet drops.

The patch can of course be combined with an increase in slots to 256, but 64 actually works good enough at least on my system. A couple of dropped packets doesn't really matter, specially on wifi. Keeping it at 64 will make sure not to keep old packets (w high latency)in pipe. Peers will also throttle down traffic quicker if packets are dropped early.

Changed 3 years ago by thommyj@…

patch for adding rx desc underrun

comment:91 in reply to: ↑ 90 Changed 3 years ago by thommyj@…

..forgot patchname is 978-b43_dmarx_adddisc.patch​
Made a sysfs hack just to see drops and some stats, see 979-b43_addsysfs.patch

Changed 3 years ago by thommyj@…

hack for displaying debug info in sysfs for above patch

comment:92 Changed 3 years ago by piotras@…

I'd like to report successful results from limited testing of 978-b43_dmarx_adddisc.patch + 979-b43_addsysfs.patch on top of backfire branch (r33081) using Linksys WRT54GL. This is with kernel 2.6.32.27.

Without the patch I was affected by the original issue. For testing I performed uploads to WRT54GL configured as access point. With wireless connection saturated (about 2MBytes/s) the issue was usually triggered within first 100MBytes of each upload.

After applying the patches I repeated the same tests and didn't hit the issue with over 16Gbytes of uploaded data. Downloads are also working correctly. I'll continue testing the code for next few days, but using more usual (lighter) traffic.

Let me know if any specific tests with WRT54GL would be of help. E.g. if we need reports with more recent version of compat-wireless before patches can be merged.

comment:93 Changed 3 years ago by thommyj@…

Great work testing piotras.

FYI I made a post about the problem at b43 mailinglist
http://lists.infradead.org/pipermail/b43-dev/2013-April/003047.html

comment:94 Changed 3 years ago by richlv@…

it looks like the patches did not help for my case - but maybe i did something wrong.

  • trunk r36346
  • asus wl500p
  • to apply the patches, i compiled the image first, then applied them into :

build_dir/target-mipsel_uClibc-0.9.33.2/linux-brcm47xx/compat-wireless-2013-02-22 and rebuilt the image
they seem to be applied as i did get slot_usage & discards files in /sys/

  • opening my web browser & email client at the same time seemed to be enough to kill wifi. right after that happened, connecting via a cable revealed :
    • slot_usage
      max: 30
      last: 1
      nullruns: 67
      
    • discards
      0
      
  • a bit later logread revealed (which might simply be killing off the now inactive wifi session) :
    Apr 17 17:49:48 OpenWrt daemon.info hostapd: wlan0: STA 14:36:05:e9:1a:c5 IEEE 802.11: deauthenticated due to local deauth request
    
  • at this point, "discards" and "slot_usage" files disappeared (i did not check whether any upper level directories disappeared, too)
  • restarting hostapd immediately solved the problem

i would be really glad to try out any other debugging ideas anybody might have...

comment:95 Changed 3 years ago by thommyj@…

That's to bad. If you got the slot_usage and discards it seems like you managed to apply at least the sysfs patch correctly. Just looking at your printout it seems like the code that I added for handling rx descriptor underruns didn't execute (discards=0). Can think of a couple of different reasons:

  • somehow 978-b43_dmarx_adddisc.patch​ didn't take. Seems not probably since 979 apparently was added. But to be sure, did you see any problems when building? Did you see the status line saying that the build systems used the patch?
  • I think wl500premium has a different network chipset then the wl500deluxe (which I have). I have only tested this on BCM4306/3. What does dmesg say that you have? Dont know what the WRT54GL have as piotras has tested on. Maybe your card should be handled differently.
  • Differences between openwrt revisions? I have only tested on r34185
  • You see a different error. When I was testing I started by maxing out the link speed with data that the router was supposed to bridge over to the wireless. When sending that amount of data I got another error (but this gave me a kernel dump). So there is certainly more areas that could improve. Have you tested to increase the rx slots? Did it feel more stable? If so it seems like the same (or a similar error)
  • I did something wrong in my patch (which of course is highly unlikely =) )

It really strange that the sys-files disappeared when you restarted hostapd. How did you restart it? Do you use some script that unloads b43 and loads it again (which then could be the old one)?

comment:96 Changed 3 years ago by Richlv <richlv@…>

  • i just applied the patches manually... was i supposed to place them somewhere they would be picked up by the build process ?
  • ok, so the official name is wl500gp. seemingly relevant dmesg lines :
    [   15.372000] b43-phy0: Broadcom 4318 WLAN found (core revision 9)
    [   15.424000] b43-phy0: Found PHY: Analog 3, Type 2 (G), Revision 7
    [   15.452000] Broadcom 43xx driver loaded [ Features: PNL ]
    [   15.844000] Broadcom 43xx-legacy driver loaded [ Features: PLID ]
    [   36.876000] b43-phy0: Loading firmware version 666.2 (2011-02-23 01:15:07)
    
  • as for files disappearing, they disappeared on their own when wifi dropped, and reappeared (on their own again) short time later (which did not restore wifi) - hostapd was not involved there. that was one reproduced lockup only, though, and did not seem to affect overall wifi availability

comment:97 Changed 3 years ago by thommyj@…

I'm no openwrt expert but I made the patches to be placed under package/mac80211/patches. If I place them there they seem to be applied automagically. But it doesn't matter, adding them manually is ok if you know what you do. Since you got the sys-files you must have succeeded in the patching.

Ok weird with the files. So trying to understand if it is the same problem you see:

  • Have you tried increasing the number of rx slots?
  • When you getting the error, is the cpu load high? Specifically the kernel thread irq/X-b43?

comment:98 Changed 3 years ago by Richlv <richlv@…>

  • as a non-coder, i must admit that i have no idea how to increase rx slots ;)
  • with current trunk, i can kill wifi just by any network traffic above loading a couple simple pages. when it dies, cpu load does not go above 0.46 (when measured every 2 seconds)

comment:99 Changed 3 years ago by thommyj@…

Sorry =) There were a patch previously in this thread which increased the number of rx slots, with that I almost never saw the error. See https://dev.openwrt.org/browser/trunk/package/mac80211/patches/840-43-increase_number_of_rx_dma_slots.patch?rev=35671

Your problem seems to be a bit different, I needed to load down the router quite hard. When you say 0.46, is that before or after the Wifi goes down? My problem was that the load was very high (when Wifi still was up), so the CPU didn't have time to handle all the data coming in from the network.

Have you tried locking down the speed as suggested earlier in the thread?
iw dev wlan0 set bitrates legacy-2.4 6 9 12 18 24 36 48 54
That didn't work at all for me, but seems to have solved it for some people.

comment:100 follow-up: Changed 3 years ago by Richlv <richlv@…>

ah, thanks, will try rx change tomorrow.
it's a bit hard to say when exactly load is higher, as i use the same wifi connection to break it :)

running
iw dev wlan0 set bitrates legacy-2.4 6 9 12 18 24 36 48 54
does not seem to help, i can still trigger the lockup rather easily. this time, shortly after it locked up, the router rebooted, too... couldn't find any evidence as for why it did that.

and huge thanks for looking into this, right now it's almost unusable as a wireless router - things seem to be worse with current trunk, compared to the situation a year or so ago...

comment:101 in reply to: ↑ 100 Changed 3 years ago by thommyj@…

You could also try to add the patch below to see if you still get interrupts after the wifi goes down. I didn't test it myself, just built it, but it should add a file under sysfs called interrupts. In that you can see number of interrupts for the driver, number of interrupts for DMA channel 0 (which is used for rx) and which types of interrupts that have been catched since startup.

What I'm trying to see if this is a new problem or the same as I saw but with some sort of twist. So please read out the value just after you have loaded the b43 module, and a couple of times after the crash. Just to see if the counters still are moving

Changed 3 years ago by thommyj@…

showing number of interrupts for b43 in sysfs

comment:102 follow-up: Changed 3 years ago by Richlv <richlv@…>

ok, new results...

  • looks like rx slot increase is already applied to trunk, thus extra patch failed (and the increase does not help);
  • applied the interrupt patch. when the wifi goes down, total still increases; if i attempt to connect to wifi, ch0total increases (otherwise it does not). wifi connections fail and ask for the password again (using psk2).

so it might not be b43/kernel issues, but purely a hostapd issue instead ?

comment:103 in reply to: ↑ 102 Changed 3 years ago by thommyj@…

Ah ok, well if the slot increase is not making things better and ch0total increases (even though its only when you try to reconnect), it doesn't sound like the same problem. The problem that my patch was trying to fix, killed the traffic totally from the network device.

Maybe you can start hostapd with -dd for debug ouput? To avoid messing this thread down anymore, you should probably also try to write a new ticket

comment:104 follow-up: Changed 3 years ago by piotras@…

Continued testing of 978-b43_dmarx_adddisc.patch + 979-b43_addsysfs.patch on top of backfire branch (r33081) using Linksys WRT54GL. Works with no issues under different load and with different types of traffic since I started 6 days ago.

Tested router uses Broadcom 5352 SoC with integrated WLAN, based on dmesg:

b43-phy0: Broadcom 5352 WLAN found (core revision 9)

comment:105 Changed 3 years ago by anonymous

hauke have you tested the latest proposed solution at all?

comment:106 Changed 3 years ago by Richlv <richlv@…>

thanks thommyj, split out the hostapd related problem as ticket #13403
if anybody has debugging hints, that would be greatly appreciated

comment:107 Changed 3 years ago by Richlv <richlv@…>

damn. nginx/trac errors made me create a dupe. the first one about hostapd related problem was/is ticket #13402

comment:108 Changed 3 years ago by bittorf@…

after a lot of testing, if can confirm that an r36467 image with

978-b43_dmarx_adddisc.patch
979-b43_addsysfs.patch

is working well. i even added a patch to reduce the rx_dma_slots from 256 to 16.
i can see some of

b43-phy0 warning: Rx descriptor underrun (high cpu load?), throwing packets

so the code is working and there was no interuption in upload/download tcp-stream.

b43-debug shows after driver unloading:

[ 3968.044000] b43-phy0 debug: Wireless interface stopped
[ 3968.044000] b43-phy0 debug: DMA-32 rx_ring: Used slots 0/16, Failed frames 0/0 = 0.0%, Average tries 0.00
[ 3968.044000] b43-phy0 debug: DMA-32 tx_ring_AC_BK: Used slots 0/256, Failed frames 0/0 = 0.0%, Average tries 0.00
[ 3968.052000] b43-phy0 debug: DMA-32 tx_ring_AC_BE: Used slots 256/256, Failed frames 1565/412013 = 0.3%, Average tries 1.20
[ 3968.060000] b43-phy0 debug: DMA-32 tx_ring_AC_VI: Used slots 0/256, Failed frames 0/0 = 0.0%, Average tries 0.00
[ 3968.068000] b43-phy0 debug: DMA-32 tx_ring_AC_VO: Used slots 12/256, Failed frames 7/574 = 1.2%, Average tries 1.10
[ 3968.320000] b43-phy0 debug: DMA-32 tx_ring_mcast: Used slots 8/256, Failed frames 0/1245 = 0.0%, Average tries 1.00

here are the full logs:
http://intercity-vpn.de/files/openwrt/dmesg_b43test_P1=978-b43_dmarx_adddisc_P2=979-b43_addsysfs.patch_P3=decrease_number_of_rx_dma_slots_to_16.txt

# fastest download/upload-rate was ~2.35MB/sec = 19mbit/s goodput @ 54mbit
# before driver-restart ('wifi up') this was done:

root@OpenWrt:/# cat /sys/devices/ssb0:3/slot_usage
max: 15
last: 1
nullruns: 7778

root@OpenWrt:/# cat /sys/kernel/debug/ieee80211/phy0/netdev:wlan0/stations/00:21:6a:32:7c:1c/rc_stats
rate      throughput  ewma prob  this prob  this succ/attempt   success    attempts
       1         1.0       98.0      100.0             0(  0)        67          77
       2         1.9       95.5      100.0             0(  0)        18          20
       5.5       5.1       95.7      100.0             0(  0)        11          11
    P 11         9.7       96.1      100.0             0(  0)       416         507
       6         5.8       97.0      100.0             0(  0)        17          18
       9         8.6       95.7      100.0             0(  0)        11          11
 B    12        10.6       90.2      100.0             0(  0)      1649        2189
      18         9.0       52.1       50.0             0(  0)      1237        1825
      24         9.0       39.6       25.0             0(  0)       359         966
   D  36         9.9       30.7       25.0             0(  0)       730        1202
  C   48        10.4       25.0       25.0             0(  0)     13074       18516
A     54        11.5       24.9       25.0             4( 16)    199052      234512

Total packet count::    ideal 7029      lookaround 371

comment:109 Changed 3 years ago by hauke

  • Resolution set to fixed
  • Status changed from reopened to closed

The version of the 978-b43_dmarx_adddisc.patch pacth send for mainline inclusion is was applied in r36474. Now only 32 RX slots are allocated.

comment:110 Changed 3 years ago by amain <amain@…>

Thank you all. This bug has bugging me for years. I'm glad a solution has been found. I think many people will have a more stable router now.

comment:111 follow-up: Changed 3 years ago by Przemator

Hi all,

I have installed this:

http://downloads.openwrt.org/attitude_adjustment/12.09/brcm47xx/generic/openwrt-brcm47xx-squashfs.trx

on my Asus WL-500G Premium and set the WiFi security to WPA2 PSK. After 2 weeks the WiFi died, just like it did for the last 5 years on the 2.6 branch. I had to manually disable and enable it again.

I believe the issue has not been resolved. Also, I wonder how can I stop receiving messages from this site...

comment:112 in reply to: ↑ 111 Changed 3 years ago by thommyj@…

Replying to Przemator:

  • What does the log say?
  • Does it happen under specific conditions? High load?
  • What wifi-hw is in the Premium?
  • You sad previously that the device crashed and restarted, do you have a copy of the dump?
  • Have you tried the patches above that show the number of interrupts to see if they are increased when the wifi dies?

The reason why I ask is because it seems like there is multiple errors in this thread. for example richiv's issue above seems to be another than what my patches fixes (see #13402 were it seems like a better power adaptor fixes the problem).

comment:113 Changed 3 years ago by Przemator

  1. Log:
    [  361.208000] b43-phy0: Loading firmware version 666.2 (2011-02-23 01:15:07)
    [  361.344000] device wlan0 entered promiscuous mode
    [  361.512000] b43-phy0: Loading firmware version 666.2 (2011-02-23 01:15:07)
    [  362.224000] br-lan: port 2(wlan0) entered forwarding state
    [  362.228000] br-lan: port 2(wlan0) entered forwarding state
    [  364.232000] br-lan: port 2(wlan0) entered forwarding state
    [23789.948000] sched: RT throttling activated
    [1105912.400000] device wlan0 left promiscuous mode
    [1105912.408000] br-lan: port 2(wlan0) entered disabled state
    [1105920.964000] b43-phy0: Loading firmware version 666.2 (2011-02-23 01:15:07)
    [1105921.104000] device wlan0 entered promiscuous mode
    [1105921.268000] b43-phy0: Loading firmware version 666.2 (2011-02-23 01:15:07)
    [1105921.976000] br-lan: port 2(wlan0) entered forwarding state
    [1105921.980000] br-lan: port 2(wlan0) entered forwarding state
    [1105923.984000] br-lan: port 2(wlan0) entered forwarding state
    
  1. High load might increase the risk, but I'm not sure if it's the reason. I mean, it is capable of working for 1-2 weeks without fail.
  1. If by HW you mean hardware, then it's Broadcom BCM4704@264Mhz.
  1. I flashed the device recently with Attitude Adjustment and it has not crashed since. In LUCI I see the Wifi at 0% power, so i press disable/enable button to restart only the Wifi.
  1. Unfortunately I have no idea how to apply these patches and with this amount of effort I might as well go back to kernel 2.4 or buy a new router :P.

comment:114 Changed 3 years ago by nbd

Przemator, the 12.09 release does not have the fix yet. You either have to use snapshots (from trunk), wait for a 12.09.1 release, or build the 12.09 branch yourself.

comment:115 Changed 3 years ago by Przemator

Well, in this case I will wait for the 12.09.1, which is probably not until the next year.

comment:116 Changed 3 years ago by Przemator

Oh great, I installed the snapshot build and my router stopped working. It is booting normally but luci redirects to a non existing page and ash and telnet are not responding. Did I just kill my router?

comment:117 Changed 3 years ago by anonymous

Is there any reason why the 12.09 release can't include this fix?

comment:118 Changed 3 years ago by Przemator

I suppose it is because 12.09 has already been released. I managed to reinstall 10.03 with kernel 2.4 using TFTP, but I basically wasted a few hours after I made the mistake of installing the snapshot. Don't install snapshots unless you want to brick your router!

comment:119 Changed 3 years ago by anonymous

Oh, that's right, trac's display of up and coming releases is pretty confusing :P

You say you're using the 2.4 kernel, does that mean users who need to stay on that kernel (e.g WRT54G) will eventually see this patch in a release of backfire?

comment:120 follow-up: Changed 3 years ago by cablop@…

How can we get the updated driver or package? I'm using Backfire 10.03.1 openwrt-brcm47xx-squashfs on a WRT54GSv2. But packages on the website are from before this fix! Some of us just need the wifi working. To flash the whole thing in such old router is too risky.

comment:121 Changed 3 years ago by cablop@…

  • Resolution fixed deleted
  • Status changed from closed to reopened

How could it be fixed if the solution is not available to users?

comment:122 in reply to: ↑ 120 Changed 3 years ago by lucky0106@…

tgp1994

comment:123 Changed 3 years ago by lucky0106@…

Please excuse the autofill above.

According to nbd,

You either have to use snapshots (from trunk), wait for a 12.09.1 release, or build the 12.09 branch yourself.

Although I would like to be putting this on my WRT54G myself. DD-WRT's instructions are just a convoluted mess right now and there hasn't been even a beta release in years (or they forgot to update one of their many stickies)...

comment:124 Changed 3 years ago by cablop@…

It is a complete mess, for sure. Where can i get those snapshots (and for Backfire, not AA)? I cannot use 12.09, cause the WRT54GS does not support it (it can be installed but there's not enough memory to run it). So i want to be able to use this solution for backfire. afaik few messages ago they said they tested the fix on top of Backfire...

comment:125 in reply to: ↑ 104 Changed 3 years ago by cablop@…

Replying to piotras@…:

Continued testing of 978-b43_dmarx_adddisc.patch + 979-b43_addsysfs.patch on top of backfire branch (r33081) using Linksys WRT54GL. Works with no issues under different load and with different types of traffic since I started 6 days ago.

Tested router uses Broadcom 5352 SoC with integrated WLAN, based on dmesg:

b43-phy0: Broadcom 5352 WLAN found (core revision 9)

Can you help me or give me instructions on how to make a b43 (or the required) package for Backfire 10.03.1? Thanks in advance.

comment:126 follow-up: Changed 3 years ago by hauke

Backfire is not supported any more. This is a fix for b43, so it will not affect any broadcom-wl user or kernel 2.4 users.

The code is in trunk and in the 12.09 branch so this ticket should be marked closed. The status is used to track if a developer has to do something, not if a new release should be build with this problem fixed.

comment:127 Changed 3 years ago by hauke

  • Resolution set to fixed
  • Status changed from reopened to closed

comment:128 Changed 3 years ago by anonymous

What are you people talking about?? There are daily, DAILY, updated builds here: http://downloads.openwrt.org/snapshots/trunk/ o_0

comment:129 in reply to: ↑ 126 Changed 3 years ago by cablop@…

Replying to hauke:

Backfire is not supported any more. This is a fix for b43, so it will not affect any broadcom-wl user or kernel 2.4 users.

The code is in trunk and in the 12.09 branch so this ticket should be marked closed. The status is used to track if a developer has to do something, not if a new release should be build with this problem fixed.

well... openwrt-brcm47xx-squashfs does not have the broadcom-wl package... :/

comment:130 follow-up: Changed 2 years ago by pfalcon

Just to clarify - I don't see the final fix (r36474) to be merged to branches/attitude_adjustment . More specifically, I see that "841-b43-reduce-number-of-RX-slots.patch" patch was applied in r37266, but not the effective part of patchset, "840-b43-Handle-DMA-RX-descriptor-underrun.patch".

Can OpenWRT developers please clarify the situation and perhaps merge the fix to attitude_adjustment branch for people to try it without risking instabilities of trunk? Thanks.

comment:131 Changed 2 years ago by anonymous

I would like to start using OpenWRT on my router as well, this bug had literally killed it for me in the past.

comment:132 in reply to: ↑ 130 Changed 2 years ago by anonymous

Replying to pfalcon:

Just to clarify - I don't see the final fix (r36474) to be merged to branches/attitude_adjustment . More specifically, I see that "841-b43-reduce-number-of-RX-slots.patch" patch was applied in r37266, but not the effective part of patchset, "840-b43-Handle-DMA-RX-descriptor-underrun.patch".

I think that patch was removed because it was accepted upstream. So the fix should be incorporated into both trunk and 12.09. http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/drivers/net/wireless/b43?id=73b82bf0bfbf58e6ff328d3726934370585f6e78

comment:133 follow-up: Changed 2 years ago by pfalcon

Thanks for clarification regarding patch being merged to upstream, that's good news.

And I'd like to share my experiences with this issue. First of all, for me, issue is very easy reproducible in following way: I have NAS connected to router (WL-500gD) via wired Ethernet, and whenever I try to download something sizable on NAS (which thus goes to router via Ethernet, which then does NAT on it, and routes to ISP connection, via Ethernet too), I very soon experience WiFi connection drop. To clarify again: yes, overloading Ethernet connection causes WiFi going down. Experimenting, this happens when download speed is higher than ~1.3Mbyte/s. CPU load and loadavg is high when this happens, so I'm sure this is the same issue as described in this ticket - fast coming ethernet packets take quite an effort from CPU to do masquerading, which causes b43 interrupts to be delayed/missed, which causes WiFi going down.

When I decided to investigated this, I had 12.09 installed, I downgraded to 10.03.1 (2.6 kernel), just to find out that situation is worse, as wifi connection doesn't recover automatically (well, in full correspondence with this ticket's timeline). After I found this ticket, I upgraded to trunk daily build: "BARRIER BREAKER (Bleeding Edge, r41302)", which, based on the discussion above, should have all the fixes.

Unfortunately, I cannot say that issue, as formulated above, resolved. Maybe, maybe it takes a bit longer from download start until the wifi connection drops, but that's the only effect I can see, which is hard to quantify and may be a placebo effect. Trying to download anything CD-size (like a movie) causes wifi connection drop not later than in a minute. And immediately after starting ethernet download, wifi connection is effectively blocked - no web page can be opened for example (though pings leisurely get thru). It all ends in wifi disconnect soon, and then connection is not re-established until download on NAS finishes.

Precompiled builds from http://downloads.openwrt.org/snapshots/trunk/ don't have any debug files mentioned above, so I cannot provide more specific info regarding what happens when connection drops. I hoped to be able to build my own image, but couldn't get to it 3 weeks now, so decided to just write down the above, in case someone may be interested.

comment:134 Changed 2 years ago by anonymous

  • Resolution fixed deleted
  • Status changed from closed to reopened

comment:135 Changed 2 years ago by chris@…

I have just loaded 14.07 rc2 on a wrt54gs and I am experiencing the same issue.
I will build from trunk tomorrow to see what I get, but it looks like this bug is definitely still with us.
I can trigger it just downloading the Ubuntu updates from a LAN connected server

comment:136 Changed 2 years ago by thommyj@…

I've been running my original patch (so not the updated kernel) ever since I first posted it in the thread. So far I have not experienced the same error. I have however seen another Wifi error, with similar outcome. I didn't analyze the error further than checking my added counters to see that it wasn't the same error. I've also seen an Ethernet driver crash under high load. Both happens very seldom for me (maybe 3 times last year or so). I'm running both an Asus wl500gd and a Linksys WRT54G (I think ver 2.X). Those were also the ones I tested my original patch on. So I cant speak for other Broadcom chips, although if I remember correctly someone on the wireless mailing list tested the kernel patch on other hws.

An easy check is to increase the number of available slots to see if it happens more seldom. Patch in this thread could be used (although it probably wont apply cleanly anymore).

comment:137 Changed 2 years ago by PET

Any update on this guys? Using Asus WL-500g Premium and experiencing the same issue as mentioned above. Still on Backfire but I hope it can be still solved soon!

comment:138 in reply to: ↑ 133 Changed 2 years ago by jogo

  • Resolution set to fixed
  • Status changed from reopened to closed

Replying to pfalcon:

Unfortunately, I cannot say that issue, as formulated above, resolved. Maybe, maybe it takes a bit longer from download start until the wifi connection drops, but that's the only effect I can see, which is hard to quantify and may be a placebo effect. Trying to download anything CD-size (like a movie) causes wifi connection drop not later than in a minute. And immediately after starting ethernet download, wifi connection is effectively blocked - no web page can be opened for example (though pings leisurely get thru). It all ends in wifi disconnect soon, and then connection is not re-established until download on NAS finishes.

If the connection recovers after the download finished, then this is a different issue. This ticket is about the connection going down and not being able to recover until a reboot/wifi driver reload.

Please open a separate ticket for that, as the issue might be somewhere else, like e.g. the ethernet driver eating all cpu leaving none for the wifi driver etc.

comment:139 Changed 2 years ago by pfalcon

No, this is the same issue, namely "bcm47xx/b43 wireless connection suddenly stops when router system load increases", and yes, streaming a download via Ethernet (and via NAT) is easy way to increase system load.

The connection recovers because, well, this was the *workaround* applied to make this issue manageable. I'm not going to reopen this ticket or open new though - there're limits to what Open Source community can do to mend vendors' wicked ways. Sometimes it's just easier to say "$%*& off, Broadcom, I'll try hard to stay away from your products from now on", and buy human-friendly hardware (I'm astonished how much better Atheros works for example).

Thanks all who brought up b43 driver though - it's great work, another example that the community is smarter than some vendors.

comment:140 follow-up: Changed 10 months ago by mmar@…

Hi folks,
I test latest day WL500gp v1 with CHAOS CALMER (15.05)
b43 I use as client , and when I enable any encryption , this issue occurs.
Now b43 have watchdog , then router is reboot every 30 sec. (b43 irq top process)

With no encryption b43 work ok, but I test it only some hours now.

Any news about?

comment:141 in reply to: ↑ 140 Changed 10 months ago by anonymous

Replying to mmar@…:

Hi folks,
I test latest day WL500gp v1 with CHAOS CALMER (15.05)
b43 I use as client , and when I enable any encryption , this issue occurs.
Now b43 have watchdog , then router is reboot every 30 sec. (b43 irq top process)

With no encryption b43 work ok, but I test it only some hours now.

Any news about?

I test more and WPA2 AES seems too stable, may use hw encrypt.

comment:142 Changed 10 months ago by anonymous

pretty sure I've been experiencing this issue for the last couple of years (last time this morning), through various versions of Gargoyle firmware. Device WL500 G Premium v2. Just came across this ticket, finally fedup and searching if anybody is going through the same or this is a hardware problem. The symptoms totally match those of mine: typically under heavy load, like streaming hd video, especially in case of scrolling it fw/bw. Often WiFi AP dissapears, but wired connection keeps functioning. Sometimes WiFi recovers after couple of minutes, but often the router looses WAN IP (dhcp) right after this incident, and never acquires it again untill power-cycled.

comment:143 Changed 7 months ago by anonymous

Any progress?

Add Comment

Modify Ticket

Action
as closed .
The resolution will be deleted. Next status will be 'reopened'.
Author


E-mail address and user name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.