Modify

Opened 4 years ago

Closed 4 years ago

Last modified 2 years ago

#11894 closed defect (fixed)

Hostapd stop working - EAPOL-Key timeout

Reported by: raptor2101@… Owned by: nbd
Priority: normal Milestone: Barrier Breaker 14.07
Component: packages Version: Trunk
Keywords: hostapd Cc:

Description

I'm running a trunk version of OpenWRT. Since two months I'm suffering Wifi-Problems

OpenWRT Version: r32739 (installed 17.07.2012)
Router: Buffalo WZR-HP-AG300H
WLAN: 2.4 GHz WPA2 EAP-TLS via Freeradius running on Ubuntu 12.04

Problem: after starting hostapd all runs fine. After a time (24h till 3 days) suddenly clients aren't able to connect. Connected Clients still can use use the wifi-connect and stay connected till next rekeying. Rekeying also fails.

The problem can be solved by restarting hostapd (wifi down && wifi up). Clients can reconnect immediately.

Note: cause i write the logs into mysql-db the logfile is a csv-file.

/etc/config/wireless

config wifi-device  radio0
        option log_level 0
        option type     mac80211
        option channel  10
        option macaddr  4c:e6:76:c2:5a:e8
        option hwmode   11ng
        option htmode   HT20
        option beacon_int 150
        list ht_capab   SHORT-GI-40
        list ht_capab   TX-STBC
        list ht_capab   RX-STBC1
        list ht_capab   DSSS_CCK-40

config wifi-iface
        option device           radio0
        option network          trustedlan
        option mode             ap
        option ssid             atlantic.aqua
        option encryption       wpa2
        option auth_server      192.168.99.2
        option auth_secret      #############
        option nasid            access-point
#        option iapp_interface   routerlan

if a client tries to connect wpa_supplicant logs

wlan0: direct probe to 4c:e6:76:c2:5a:e8 (try 1/3)
wlan0: direct probe to 4c:e6:76:c2:5a:e8 (try 2/3)
wlan0: direct probe to 4c:e6:76:c2:5a:e8 (try 3/3)
wlan0: direct probe to 4c:e6:76:c2:5a:e8 timed out

last log message from hostapd

wlan0: STA 18:87:96:18:c8:4e WPA: sending 1/2 msg of Group Key Handshake
wlan0: STA 18:87:96:18:c8:4e WPA: EAPOL-Key timeout
wlan0: STA 1c:b0:94:1c:78:a5 WPA: sending 1/2 msg of Group Key Handshake
wlan0: STA 1c:b0:94:1c:78:a5 WPA: EAPOL-Key timeout

Attachments (6)

ExportMessages.csv (7.7 KB) - added by raptor2101@… 4 years ago.
CSV LogMessages from hostapd
captured.pcap (67.2 KB) - added by raptor2101@… 4 years ago.
Captured MonitorDevice while hostapd stop working
ExportMessages.2.csv (72.9 KB) - added by raptor2101@… 4 years ago.
Logfile generated while hostapd runs with -dd
HostAPd-Strace.csv (63.3 KB) - added by anonymous 4 years ago.
500 rows strace and debug log
last_hostapd_msgs.txt (3.8 KB) - added by syzop 4 years ago.
last hostapd messages before hang
syslog.txt (545.1 KB) - added by syzop@… 4 years ago.
syslog of ap. wlan0* stopped on 10:11:06

Download all attachments as: .zip

Change History (78)

Changed 4 years ago by raptor2101@…

CSV LogMessages from hostapd

comment:1 Changed 4 years ago by anonymous

Problem still exist in r32878. Same log messages. THis could correlate with another error. If i try to setup a Kerberos-db on the router, the initilialisation stucks because /dev/random don't have enough entropy

comment:2 Changed 4 years ago by raptor2101@…

Ok done some further debugging. I'am running the hostapd with -dd flags and capturing managment-traffic with tcpdump.

Here are my resultes: It takes currently less then 24 hours till hostapd stop working. Logmessages don't show anything even with extended debugging. THis waths drives me crazy. The monitor-device captures incoming Probe-Requestes, Authentication-Requestes and (intern?) Acknownledge-Packages but nothing of this appears in the tracefile.

i attached some more files

Changed 4 years ago by raptor2101@…

Captured MonitorDevice while hostapd stop working

Changed 4 years ago by raptor2101@…

Logfile generated while hostapd runs with -dd

comment:3 Changed 4 years ago by Del <delonly@…>

It seems we have the same problem with the same router. I will be on travel for a week, but should be able to contribute after that. Thanks for your efforts!

comment:4 Changed 4 years ago by Del <delonly@…>

This issue affects both WPA and WPA2 encryption. It only affects the 2.4GHz radio, the 5GHz radio works fine.

Seems like it is a bug related to the Atheros AR9223. The same chip is found in the Netgear routers WNDR3700 and WNDR3800 both supported by OpenWrt. This really puzzles me, why don't we see more reports?

Any hints on what to look for or test is appreciated. The fact that the problem occurs after a couple of days makes trial and error a time consuming task.

comment:5 Changed 4 years ago by raptor2101@…

You can shorten the time between two failures by increasing the "reconnect" counts and/or using Radius.

If i have "many" reconnects on my network, it need less than 24h to get a "stalled" hostapd.

Could that caused by the "broken" /dev/random? On the WZR-HP-AG300H there are no entropy supplied by this device ... (f.e. Kerberos isn't able to create a new database)

comment:6 Changed 4 years ago by jow

Could you try to hook onto hostapd with strace -p when it hangs?
Alternatively could you compile hostapd with debugging enabled, then run it in foreground in e.g. a screen session until it stalls?

comment:7 Changed 4 years ago by anonymous

at the next stall you will get a strace log...

comment:8 Changed 4 years ago by raptor2101@…

Regarding to the strace output, hostapd hangs on a "sending"-request. In my case this took over 8 hours without returning... is this kernel/driver oder hostapd related?

sendto(9, "\2\3\0\177\2\23\202\0\20\0\0\0\0\0\0\0O#\362R\206\344\301\217\242\26 \211L\27f\32"..., 131, 0, {sa_family=AF_PACKET, proto=0x888e, if9, pkttype=PACKET_HOST, addr(6)={0, 00042028ef0a}, 20

with "kill -INT" leads to a crash of hostapd, all other signals are ignored..

comment:9 Changed 4 years ago by nbd

please try the latest version of openwrt to see if it happens there as well.

comment:10 Changed 4 years ago by Del

Latest trunk is on. I opted for keeping settings when flashing, please inform me is I need to wipe the settings to get the updated hostapd config.

comment:11 Changed 4 years ago by nbd

You don't need to wipe your settings

comment:12 Changed 4 years ago by anonymous

Updated to latest version last night, waiting for next hang/crash...

comment:13 Changed 4 years ago by Del <delonly@…>

Running r33482 now since Sunday, today the wireless refuses connections on the 2.4GHz radio. The 5GHz radio still works, but that may be related to the fact that I only have one device connecting there, while a number of devices connect to the 2.4GHz radio.

comment:14 Changed 4 years ago by raptor2101@…

running r33482 it takes 24 till freeze

hostapd hangs on the following:

sendto(9, "\2\3\0\177\2\23\202\0\20\0\0\0\0\0\0\0\6\274\245\267e\305\247\356\365\362+\367\262dA+"..., 131, 0, {sa_family=AF_PACKET, proto=0x888e, if9, pkttype=PACKET_HOST, addr(6)={0, 1cb0941c78a5}, 20

comment:15 Changed 4 years ago by jow

raptor2101, can you paste a log with more context? (e.g. the last 100-200 lines leading to this specific hang). In particular it would be interesting to see what fd 9 refers to.

comment:16 Changed 4 years ago by nbd

next time it gets into that state, please paste the output of

cat /sys/kernel/debug/ieee80211/phy0/queues

comment:17 Changed 4 years ago by raptor2101@…

strace runs in background... next stale you will get both more detailed logs and the cat of queues...

comment:18 Changed 4 years ago by raptor2101@…

Ok hostapd hangs again - here is the requested output

cat /sys/kernel/debug/ieee80211/phy0/queues
00: 0x00000000/0
01: 0x00000000/0
02: 0x00000000/0
03: 0x00000000/0

cat /sys/kernel/debug/ieee80211/phy0/ath9k/regdump 
Segmentation fault

the newly attached file contains 500 rows of hostapd straces and logmessages, if you need more, I currently storring ~ 1GB of hostapd logtraces ;)

Changed 4 years ago by anonymous

500 rows strace and debug log

comment:19 Changed 4 years ago by nbd

please try the latest version, i fixed a bug that might be related to this issue.

comment:20 Changed 4 years ago by raptor2101@…

can you give me a source
http://downloads.openwrt.org/snapshots/ is empty

comment:21 Changed 4 years ago by nigelr

I'm still having this problem with r33624 on a TP-LINK TL-WDR4300, and it's only on the 2.4GHz interface.

comment:22 Changed 4 years ago by nbd

found more issues that might have been causing this issue, please try r33635 or later

comment:23 Changed 4 years ago by Del <delonly@…>

Thanks, compiling r33666 as we speak. Raptor, missing images is a good opportunity to test the excellent openwrt build system ;-) I always enjoy the make menuconfig step.

comment:24 Changed 4 years ago by Del <delonly@…>

Sorry, the problem is still there. Unable to connect this evening after a couple of days. Thanks for the efforts, just keep them coming :-) Let me know if there is anything else I can do.

comment:25 Changed 4 years ago by nbd

followup fix committed in r33695, should work now.

comment:26 Changed 4 years ago by Del

r33725 is installed, will report back in couple of days. Thanks!

comment:27 Changed 4 years ago by Del <delonly@…>

Sorry, the connection was not able to connect a couple of minutes ago. This time I actually needed to reboot the router to get everything up again, /etc/init.d/network restart did not work. This has not happened before.

comment:28 Changed 4 years ago by nigelr

Yep, I'm still having the same problem after a couple of days on r33780.

comment:29 Changed 4 years ago by Del <delonly@…>

I have no idea why my router needed a reboot a week ago. Right now I get this:

root@OpenWrt:~# uptime
15:04:32 up 7 days, 23:59, load average: 0.00, 0.01, 0.04

and wireless is working nicely without my intervention. Fantastic, it seems like this issue is finally resolved. You did it again nbd, huge thanks! Please close the report.

comment:30 Changed 4 years ago by nbd

  • Resolution set to fixed
  • Status changed from new to closed

thanks for testing.

comment:31 Changed 4 years ago by Del <delonly@…>

  • Resolution fixed deleted
  • Status changed from closed to reopened

Whatever you did (I haven't looked at the commits), it did have an effect. However, after ten days of flawless wireless operation the problem re-surfaced. I am unable to connect this evening. Immediately after running /etc/init.d/network restart every client can connect again. Still it is the 2.4GHz radio giving problems, while the 5GHz (which is far less used) still accepts connections.

comment:32 Changed 4 years ago by Del <delonly@…>

Had a look at the patch, it seems you changed dev_kfree_skb_irq(skb); in favour of ieee80211_free_txskb(&local->hw, skb);

Not sure what effect it has, but there is still two places left where dev_kfree_skb_irq(skb); is used. One in iface.c and one in status.c

Does it make sense to change those too?

comment:33 Changed 4 years ago by nbd

No, those should not be changed.

comment:34 Changed 4 years ago by matt@…

I'm still seeing this issue in the latest Attitude Adjustment. After about 24 hours wireless clients can no longer connect to the 2.4ghz network. As reported by others, the 5ghz network is fine.

root@brooklyn:~# cat /sys/kernel/debug/ieee80211/phy0/queues
00: 0x00000000/0
01: 0x00000000/0
02: 0x00000000/0
03: 0x00000000/0

comment:35 Changed 4 years ago by nbd

I committed a fix today, try updating to latest attitude_adjustment svn.

comment:36 Changed 4 years ago by matt@…

Compiling and installing now. I'll check back in a couple of days.

Thanks!

comment:37 Changed 4 years ago by nigelr

I'm still having the problem with r33939 on AA branch.

comment:38 Changed 4 years ago by nbd

please try latest trunk or AA, the fix in r33991/r33992 might be relevant

comment:39 Changed 4 years ago by nbd

  • Owner changed from developers to nbd
  • Status changed from reopened to accepted

comment:40 Changed 4 years ago by Del <delonly@…>

Just installed a freshly baked r33991 (seems that was the latest in trunk). Thanks!

comment:41 Changed 4 years ago by nigelr

I've been running AA r34018 for 3 days with no problems so far. Fingers crossed it stays that way.

comment:42 Changed 4 years ago by syzop

I was about to post a new bug report, but then noticed this one.
I've 15 WNDR3700 access points deployed at a high school and we're experiencing the same issue: clients can no longer connect to the access point after a while. Even with hourly wifi down/up cronjobs it's not working well enough, apparently sometimes they even go unreachable within the hour.
I should point out that, since it's a highschool, we have a lot of associate/disassociate events when students migrate from class to class. And even when they are all IN class we still have 1 login every few seconds (globally).
We're running r33288 at the moment.
I'll see if I can upgrade the access points next week(s) and let you know if latest trunk fixes things for us. If so, that would be great!

comment:43 Changed 4 years ago by Del <delonly@…>

Five days of stable wireless and counting.

comment:44 Changed 4 years ago by Del <delonly@…>

Sorry, had to restart wireless this evening. The bug is still present as of r33991.

comment:45 Changed 4 years ago by syzop

Was hoping to help out with results, I would already be happy with the improvement Del described, unfortunately when I upgraded to r34064 the 5ghz radio doesn't seem to be transmitting anymore (laptop and another AP don't see the network). I created a new issue #12428 for this.

comment:46 Changed 4 years ago by matt@…

Just had to restart after 5 days uptime running r34056.

comment:47 Changed 4 years ago by nbd

found more issues of the same type. please try trunk or AA r34134, it should hopefully finally fix this.

comment:48 Changed 4 years ago by matt@…

Things seem to have gotten a bit worse for some 2.4ghz devices. Others seem to be working fine. I have an Android phone that can no longer ping the gateway after about 15 minutes of inactivity. I'm seeing a bunch of deauthenticated due to inactivity (timer DEAUTH/REMOVE) messages in the logs, which is something I haven't seen before.

comment:49 Changed 4 years ago by syzop

matt, did it get worse with the new r34134? or still testing with r34056?

I'm going to mass-upgrade a number of AP's next few days. Only sounds logical to use latest r34134, unless you are telling me it's worse than r34056 :)

comment:50 Changed 4 years ago by matt@…

It got worse with r34134. Pretty sure nothing changed on the Android device, and I've tried rebooting it with no change. One weird thing I noticed is that another desktop was unable to detect the 2.4ghz network while the Android device was unable to connect. Restarting wireless on the Android device restored connectivity, but it went away a few minutes later. I've switched my laptop over to the 2.4ghz network to try and replicate, but my guess is the issue is related to idle clients, so I'll need to let my laptop sit for a bit.

comment:51 Changed 4 years ago by syzop

Thanks for the feedback, matt.
I've upgraded 6 access points to r34122. Previously, with r33288 wireless of the AP's would quickly screw up, so I'm eager to see what happens in the next couple of days with this version...

comment:52 Changed 4 years ago by syzop

No improvement here with r34122.
I removed my hourly wifi restart cron job, so I could see how long it would survive without restarting wireless.
Access point worked fine, but within minutes when the 800+ students arrived you couldn't connect anymore (still a 5/5 signal).
Instead of restarting wifi to 'fix' things, I tried to debug a little.
This is what I found:

  • The access point still broadcasts it's SSID (both on 2.4 and 5GHz)
  • Connecting with one laptop works
  • Connecting with another laptop didn't work
  • Then I noticed that the laptop where things did work was only associated on the 5GHz side. So I did a little test: when I disabled the 2.4GHz radio on the laptop, it would still connect. But when I enable the 2.4 and disable the 5GHz it would no longer connect. So clearly a 2.4GHz issue.
  • I did not see any hostapd messages in syslog when I was unable to connect, nor any other syslog message.

So, and I think this is just confirming what everyone else already said, it seems hostapd on the 2.4GHz radio stops accepting new clients, it doesn't log anything.

Then when I killed (-15) hostapd of the 2.4GHz interface (wlan0) and started it in debug mode on the console, things worked again.

What can I do to help?
I wanted to strace hostapd but strace isn't installed, and there's no package by that name.

Off-topic: nothing new, but this dev site is often unreachable, which is quite annoying.

Changed 4 years ago by syzop

last hostapd messages before hang

comment:53 Changed 4 years ago by syzop@…

Don't think it helps, but, file attached with the last hostapd messages on this particular access point.
After this nothing is logged anymore with regards to the wlan0* interface.

Changed 4 years ago by syzop@…

syslog of ap. wlan0* stopped on 10:11:06

comment:54 Changed 4 years ago by syzop@…

On second thought, I've attached the full syslog of the access point since last night.
As said, the last message about wlan0 was on 10:11:06, then nothing anymore until I restart hostapd on 2.4GHz (wlan0) on 14:14:XX.
There are a number of ath: phy0: keycache entry 228 out of range messages. I don't know if they are related to this problem.

If you need anything else, let me know!

comment:55 Changed 4 years ago by Kufat

I've also noticed that this problem occurs on 2.4 but not 5. Of course, this might be an issue with some slightly misbehaving 2.4GHz client(s) common to both of our networks causing the state rather than being an issue with the 2.4 driver/firmware itself, but it could be worth investigating further.

If anybody needs any additional information gathered to help resolve this ticket, I'm almost always in #openwrt on freenode.

comment:56 Changed 4 years ago by anonymous

Clarification:
It is a bug in ath9k, but my point was that it might not be a bug specific to 2.4GHz code in ath9k. It could be an issue in code that's common to 2.4 and 5 that's only triggered in 2.4GHz due to client behavior.

comment:57 Changed 4 years ago by syzop@…

Could be that some specific 2.4GHz-client(s) trigger this.
Another possibility is that, because there are more clients using 2.4GHz (which don't have 5GHz capability), this larger number clients, and thus higher amount of associate/disassociate events, might cause the problem.

Just a few theories. Just to say that I concur that it doesn't have to be band related at all (code-wise / driver-wise), it just gets triggered on the 2.4GHz band all the time.

I'd also be happy to try any suggestions by a dev or anyone else. I'm really committed to solving this issue. Just tell me what you need / want me to do.

comment:58 Changed 4 years ago by Kufat

syzop, I should've mentioned that I usually only have one client on my 2.4GHz network (HTC Droid Incredible, an Android phone.)

comment:59 Changed 4 years ago by syzop@…

Oh really, that's interesting... maybe it isn't related at all to number of clients then. Or there are multiple issues.

I added my restart wifi cron job again (every 30 minutes) but now I got kernel dumps and wireless going down sometimes, I've opened a new bug #12480 for this since I don't think it's the same issue: it happens within seconds after wifi is restart.

comment:60 Changed 4 years ago by Aaaron16

I'm also experiencing this issue using ath9k_htc devices. In my application I have vehicle access points where most of the time no one is attempting to connect. Then after several days when someone tries to attach, it only works by restarting hostapd.

comment:61 Changed 4 years ago by syzop@…

Just adding some references to previous bugs which *seem* related:
#7363, #12144, #12159,
https://forum.openwrt.org/viewtopic.php?id=24350&p=3
And these may be unrelated, but are current ath9k issues on WNDR3700:
#11862 (Failed to stop TX DMA..) and #12480 (WARNING..at ath9k_beacon_tasklet)

Is it already clear if this is a driver issue or a hostapd issue?
To my knowledge it's still broadcasting it's SSID when this happens, does this mean anything? (eg: does this rule out a driver issue? or not?)

comment:62 Changed 4 years ago by Del <delonly@…>

I believe it is difficult to make categorical statements as to where the bug is until it is resolved. Again I have now had five days of stable wireless (and counting), this time with r34165. I do not believe I saw more the three days until nbd started working on it, so my best guess is to look at his patches to see where the problem is at. I just pointed another user to the relevant documentation here: https://dev.openwrt.org/ticket/12349#comment:8 Seems you are in a better position than the rest of us with respect to testing patches (trial and error is rather cumbersome with five days to wait), so feel free to dig in the code and test stuff out :)

comment:63 Changed 4 years ago by syzop@…

I can test latest trunk next week. I've succesfully generated 'my own image' using the openwrt build system earlier today.
My plan is to run hostapd in debugging mode and (s)trace it as was done earlier.
Not sure what else I can do at this point...

And yes, you're right, I can reproduce this issue within hours, or at least every working day for sure.

I didn't notice any improvement, though, between versions, unlike you (?).

comment:64 Changed 4 years ago by Del <delonly@…>

I am carefully optimistic, wireless still stable and:

root@OpenWrt:~# uptime
18:05:40 up 8 days, 19:31, load average: 0.00, 0.01, 0.04

Normal usage, five clients connecting multiple times each day on the 2.4GHz radio. Still running r34165.

comment:65 Changed 4 years ago by Del <delonly@…>

Fourteen days of stable wireless today. Hence I belive this bug is fixed, and the report can be closed. Syzop, it will be interesting to hear if this also solves your problems.

comment:66 Changed 4 years ago by syzop@…

I've upgraded two access points to r34200, they remained OK for 2 working days (now it's weekend).
I'm going to upgrade at least 6 more today. If by the end of the week they are still good, then I'm confident that this issue has been nailed (and then I'd be a very happy man ;p).
I'm just being cautiously optimistic, given that previously it also seemed resolved a number of times and then turned out not to be. This time it *looks* different, but 2 days is quite short to be sure, even for me ;)

I also wonder about Kufat and matt.

comment:67 Changed 4 years ago by Kufat

I've upgraded to r34336 and will let everyone know how it goes.

comment:68 Changed 4 years ago by matt@…

Things have stabilized here. I'm not sure what to say about the Android device as I've not upgraded since installing r34134.

comment:69 Changed 4 years ago by syzop@…

I have two users who can no longer connect after the firmware upgrade, but I'm not convinced yet if it's really related. Could be another problem, like maybe I made a config mistake when I did the upgrades.

Anyway: when I check out the logs of hostapd and look at the associate/disassociate events I can see the radio of all AP's are up throughout the day (both 2.4 and 5.X GHz). Previously hostapd would just stop logging any activity on some interfaces (usually in the morning already).

So, I'm pretty confident that the bug has been resolved. Thanks everyone! :)

comment:70 Changed 4 years ago by nbd

  • Resolution set to fixed
  • Status changed from accepted to closed

comment:71 Changed 4 years ago by Kufat

Well, it's post-resolution, but I wanted to say that the problem also seems to be resolved over here. Thanks for all the hard work, nbd (and all others who've helped!)

comment:72 Changed 2 years ago by jow

  • Milestone changed from Attitude Adjustment 12.09 to Barrier Breaker 14.07

Milestone Attitude Adjustment 12.09 deleted

Add Comment

Modify Ticket

Action
as closed .
The resolution will be deleted. Next status will be 'reopened'.
Author


E-mail address and user name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.