Modify

Opened 9 years ago

Closed 8 years ago

Last modified 2 years ago

#2035 closed defect (fixed)

asus wl500gp - random bus errors / segmentation faults

Reported by: hause@… Owned by: developers
Priority: high Milestone: Barrier Breaker 14.07
Component: base system Version:
Keywords: asus wl 500 g p random bus error segmentation fault Cc:

Description

hello,

after flashing the asus with openwrt-brcm47xx-2.6-jffs2-64k.trx via atftp / restore button i get
random bus errors / segmentation faults.

i tried this with two asus wl500gp. result is the same on both devices.

first time i recognized this with 7785.
the last kamikaze - that i still have here - that works without problems is 7346.

minicom.cap from the flashing-procedure is attached.

greets,

kh

Attachments (6)

minicom.cap (33.5 KB) - added by hause@… 9 years ago.
minicom logfile
minicom.2.cap (18.9 KB) - added by hause@… 9 years ago.
7971 nvram erase and testing
900-no-highpage.patch (2.1 KB) - added by jhansen@… 9 years ago.
Remove copy_user_highpage function
bcm47xx-prom.patch (596 bytes) - added by jhansen@… 9 years ago.
Get wl500gp mostly booting in 2.6.24-rc4
lzma_loader_cfe.patch (1.7 KB) - added by jhansen@… 8 years ago.
Don't nuke fw_argX from CFE
lzma_loader_cfe.2.patch (1.5 KB) - added by jhansen@… 8 years ago.
Don't nuke fw_argX from CFE (better)

Download all attachments as: .zip

Change History (147)

Changed 9 years ago by hause@…

minicom logfile

comment:1 follow-up: Changed 9 years ago by jhansen@…

Have you tried it since r7865? There were some problems with the cpu cache patches that should have been fixed.

comment:2 in reply to: ↑ 1 Changed 9 years ago by anonymous

Replying to jhansen@cardaccess-inc.com:

Have you tried it since r7865? There were some problems with the cpu cache patches that should have been fixed.

yes. last i tried was 7874 with the same result.

currently i go backwards till i catch the last one that works.
i also try the 7880 now.

comment:3 follow-up: Changed 9 years ago by nbd

any news?

comment:4 in reply to: ↑ 3 Changed 9 years ago by hause@…

Replying to nbd:

any news?

with 7963 the errors didn't occur as much as before but its still unstable.
maybe a command works for 5 to 10 times, maybe not. its like a lottery.

now i am trying 7971.

meanwhile i figured out that someone played around with the nvram-vars on the devices.
for example: the default ip in cfe was not set to 192.168.1.1

comment:5 follow-up: Changed 9 years ago by nbd

can you try erasing your nvram?

comment:6 in reply to: ↑ 5 Changed 9 years ago by hause@…

Replying to nbd:

can you try erasing your nvram?

sure. but i don't know which nvram-data is essential for cfe to boot or to detect the hardware.

comment:7 follow-up: Changed 9 years ago by nbd

after erasing the nvram partition using mtd, the boot loader will restore working defaults by itself after the reboot.

Changed 9 years ago by hause@…

7971 nvram erase and testing

comment:8 in reply to: ↑ 7 ; follow-up: Changed 9 years ago by hause@…

Replying to nbd:

after erasing the nvram partition using mtd, the boot loader will restore working defaults by itself after the reboot.

7971 nvram erased. feels better but still errors. by the way. network configuration seems to be broken. see minicom.cap

comment:9 in reply to: ↑ 8 Changed 9 years ago by hause@…

Replying to hause@gmx.de:

Replying to nbd:

after erasing the nvram partition using mtd, the boot loader will restore working defaults by itself after the reboot.

7971 nvram erased. feels better but still errors. by the way. network configuration seems to be broken. see minicom.cap

8019... still the same

comment:10 follow-up: Changed 9 years ago by nbd

  • Resolution set to fixed
  • Status changed from new to closed

should be much more stable as of r8165

comment:11 in reply to: ↑ 10 Changed 9 years ago by anonymous

  • Resolution fixed deleted
  • Status changed from closed to reopened

Replying to nbd:

should be much more stable as of r8165

I'm using r8211 and still having random seg faults and bus errors quite often. I've attached a serial port capture of the boot process showing the errors.

Device eth0:  hwaddr 00-1B-FC-57-BA-B4, ipaddr 192.168.1.1, mask 255.255.255.0
        gateway not set, nameserver not set
Rescue Flag disable.
Loader:raw Filesys:raw Dev:flash0.os File: Options:(null)
Loading: .. 3740 bytes read
Entry at 0x80001000
Closing network.
Starting program at 0x80001000
Linux version 2.6.22.1 (keith@keith-desktop) (gcc version 4.1.2) #14 Sun Jul 29
00:51:19 EDT 2007
CPU revision is: 00029006
ssb: Core 0 found: ChipCommon (cc 0x800, rev 0x03, vendor 0x4243)
ssb: Core 1 found: Fast Ethernet (cc 0x806, rev 0x06, vendor 0x4243)
ssb: Core 2 found: Fast Ethernet (cc 0x806, rev 0x06, vendor 0x4243)
ssb: Core 3 found: USB 1.1 Hostdev (cc 0x808, rev 0x03, vendor 0x4243)
ssb: Core 4 found: PCI (cc 0x804, rev 0x08, vendor 0x4243)
ssb: Core 5 found: MIPS 3302 (cc 0x816, rev 0x03, vendor 0x4243)
ssb: Core 6 found: V90 (cc 0x807, rev 0x02, vendor 0x4243)
ssb: Core 7 found: IPSEC (cc 0x80B, rev 0x00, vendor 0x4243)
ssb: Core 8 found: MEMC SDRAM (cc 0x80F, rev 0x02, vendor 0x4243)
ssb: Initializing MIPS core...
ssb: set_irq: core 0x0806, irq 2 => 2
ssb: set_irq: core 0x0806, irq 3 => 3
ssb: set_irq: core 0x0804, irq 0 => 4
ssb: Sonics Silicon Backplane found at address 0x18000000
Determined physical RAM map:
 memory: 02000000 @ 00000000 (usable)
Initrd not found or empty - disabling initrd
Built 1 zonelists.  Total pages: 8128
Kernel command line: root=/dev/sda1 rootdelay=10 console=ttyS0,115200
Primary instruction cache 16kB, physically tagged, 2-way, linesize 16 bytes.
Primary data cache 16kB, 2-way, linesize 16 bytes.
Synthesized TLB refill handler (20 instructions).
Synthesized TLB load handler fastpath (32 instructions).
Synthesized TLB store handler fastpath (31 instructions).
Synthesized TLB modify handler fastpath (30 instructions).
PID hash table entries: 128 (order: 7, 512 bytes)
Using 132.000 MHz high precision timer.
Dentry cache hash table entries: 4096 (order: 2, 16384 bytes)
Inode-cache hash table entries: 2048 (order: 1, 8192 bytes)
Memory: 29440k/32768k available (2369k kernel code, 3328k reserved, 382k data, 1
24k init, 0k highmem)
Mount-cache hash table entries: 512
NET: Registered protocol family 16
SCSI subsystem initialized
usbcore: registered new interface driver usbfs
usbcore: registered new interface driver hub
usbcore: registered new device driver usb
ssb: PCIcore in host mode found
registering PCI controller with io_map_base unset
PCI: fixing up bridge
PCI: Fixing up device 0000:00:00.0
Time: MIPS clocksource has been installed.
NET: Registered protocol family 2
IP route cache hash table entries: 1024 (order: 0, 4096 bytes)
TCP established hash table entries: 1024 (order: 1, 8192 bytes)
TCP bind hash table entries: 1024 (order: 0, 4096 bytes)
TCP: Hash tables configured (established 1024 bind 1024)
TCP reno registered
squashfs: version 3.0 (2006/03/15) Phillip Lougher
Registering mini_fo version $Id$
JFFS2 version 2.2. (NAND) © 2001-2006 Red Hat, Inc.
io scheduler noop registered
io scheduler deadline registered (default)
Serial: 8250/16550 driver $Revision: 1.90 $ 2 ports, IRQ sharing enabled
serial8250: ttyS0 at MMIO 0x0 (irq = 3) is a 16550A
serial8250: ttyS1 at MMIO 0x0 (irq = 3) is a 16550A
b44.c:v1.01 (Jun 16, 2006)
eth0: Broadcom 10/100BaseT Ethernet 00:1b:fc:57:ba:b4
eth1: Broadcom 10/100BaseT Ethernet 40:10:18:00:00:2d
flash init: 0x1c000000 0x02000000
Physically mapped flash: Found 1 x16 devices at 0x0 in 16-bit bank
 Amd/Fujitsu Extended Query Table at 0x0040
Physically mapped flash: CFI does not contain boot bank location. Assuming top.
number of CFI chips: 1
cfi_cmdset_0002: Disabling erase-suspend-program due to code brokenness.
Flash device: 0x800000 at 0x1fc00000
bootloader size: 262144
Creating 4 MTD partitions on "Physically mapped flash":
0x00000000-0x00040000 : "cfe"
0x00040000-0x007f0000 : "linux"
0x00120000-0x007f0000 : "rootfs"
mtd: partition "rootfs" set to be root filesystem
split_squasfs: no squashfs found in "Physically mapped flash"
0x007f0000-0x00800000 : "nvram"
PCI: Enabling device 0000:00:03.2 (0000 -> 0002)
PCI: Fixing up device 0000:00:03.2
ehci_hcd 0000:00:03.2: EHCI Host Controller
ehci_hcd 0000:00:03.2: new USB bus registered, assigned bus number 1
ehci_hcd 0000:00:03.2: irq 6, io mem 0x40000000
ehci_hcd 0000:00:03.2: USB 2.0 started, EHCI 1.00, driver 10 Dec 2004
usb usb1: configuration #1 chosen from 1 choice
hub 1-0:1.0: USB hub found
hub 1-0:1.0: 4 ports detected
USB Universal Host Controller Interface driver v3.0
PCI: Enabling device 0000:00:03.0 (0000 -> 0001)
PCI: Fixing up device 0000:00:03.0
uhci_hcd 0000:00:03.0: UHCI Host Controller
uhci_hcd 0000:00:03.0: new USB bus registered, assigned bus number 2
uhci_hcd 0000:00:03.0: irq 6, io base 0x00000100
usb usb2: configuration #1 chosen from 1 choice
hub 2-0:1.0: USB hub found
hub 2-0:1.0: 2 ports detected
PCI: Enabling device 0000:00:03.1 (0000 -> 0001)
PCI: Fixing up device 0000:00:03.1
uhci_hcd 0000:00:03.1: UHCI Host Controller
uhci_hcd 0000:00:03.1: new USB bus registered, assigned bus number 3
uhci_hcd 0000:00:03.1: irq 6, io base 0x00000120
usb usb3: configuration #1 chosen from 1 choice
hub 3-0:1.0: USB hub found
hub 3-0:1.0: 2 ports detected
usb 1-1: new high speed USB device using ehci_hcd and address 2
Initializing USB Mass Storage driver...
usb 1-1: configuration #1 chosen from 1 choice
scsi0 : SCSI emulation for USB Mass Storage devices
usbcore: registered new interface driver usb-storage
USB Mass Storage support registered.
usbcore: registered new interface driver libusual
nf_conntrack version 0.5.0 (256 buckets, 2048 max)
ip_tables: (C) 2000-2006 Netfilter Core Team
TCP vegas registered
NET: Registered protocol family 1
NET: Registered protocol family 17
802.1Q VLAN Support v1.8 Ben Greear <greearb@candelatech.com>
All bugs added by David S. Miller <davem@redhat.com>
Waiting 10sec before mounting root device...
scsi 0:0:0:0: Direct-Access     Generic  USB Flash Disk   1.00 PQ: 0 ANSI: 2
sd 0:0:0:0: [sda] 4078080 512-byte hardware sectors (2088 MB)
sd 0:0:0:0: [sda] Write Protect is off
sd 0:0:0:0: [sda] Assuming drive cache: write through
sd 0:0:0:0: [sda] 4078080 512-byte hardware sectors (2088 MB)
sd 0:0:0:0: [sda] Write Protect is off
sd 0:0:0:0: [sda] Assuming drive cache: write through
 sda:<7>usb-storage: queuecommand called
 sda1 sda2
sd 0:0:0:0: [sda] Attached SCSI removable disk
EXT3-fs: INFO: recovery required on readonly filesystem.
EXT3-fs: write access will be enabled during recovery.
kjournald starting.  Commit interval 5 seconds
EXT3-fs: recovery complete.
EXT3-fs: mounted filesystem with ordered data mode.
VFS: Mounted root (ext3 filesystem) readonly.
Freeing unused kernel memory: 124k freed
Algorithmics/MIPS FPU Emulator v1.5
INIT: version 2.86 booting
Starting the hotplug events dispatcher: udevdudevd[192]: main: the kernel does n
ot support inotify, udevd can't monitor configuration file changes
.
Synthesizing the initial hotplug events...done.
Waiting for /dev to be fully populated...udevd-event[293]: run_program: '/lib/ud
ev/net.agent' abnormal exit
done.
Activating swap:/etc/rcS.d/S10checkroot.sh: line 24:   337 Segmentation fault
   swapon -a -v
 failed!
Will now check root file system:fsck 1.40-WIP (14-Nov-2006)
[/sbin/fsck.ext3 (1) -- /] fsck.ext3 -a -C0 /dev/sda1
/dev/sda1: clean, 16951/222208 files, 174117/444016 blocks
.
EXT3 FS on sda1, internal journal
/etc/rcS.d/S10checkroot.sh: line 24:   336 Segmentation fault      swapon -a -v
 failed!
Will now check root file system:fsck 1.40-WIP (14-Nov-2006)
[/sbin/fsck.ext3 (1) -- /] fsck.ext3 -a -C0 /dev/sda1
/dev/sda1: clean, 16951/222208 files, 174117/444016 blocks
.
EXT3 FS on sda1, internal journal
Setting the system clock..
Cannot access the Hardware Clock via any known method.
Use the --debug option to see the details of our search for an access method.
System Clock set. Local time: Sat Jan  1 00:00:31 UTC 2000.
Cleaning up ifupdown....
Loading device-mapper support.
Will now check all file systems.
fsck 1.40-WIP (14-Nov-2006)
Checking all file systems.
Done checking file systems.
A log is being saved in /var/log/fsck/checkfs if that location is writable.
Setting kernel variables...done.
Will now mount local filesystems:.
Will now activate swapfile swap:swapon on /dev/sda2
Adding 262072k swap on /dev/sda2.  Priority:-1 extents:1 across:262072k
done.
Cleaning /tmp...done.
Cleaning /var/run...done.
/etc/init.d/bootclean: line 17:   567 Bus error               rm -f "$1"
* bootclean: Failure deleting '/var/run/.clean'.
Cleaning /var/lock...done.
Setting up networking....
Configuring network interfaces...

comment:12 in reply to: ↑ description Changed 9 years ago by anonymous

I see this issue as well. I'm running Kamikaze 7.07 openwrt-brcm47xx-2.6-squashfs.trx with no wireless (really barebones, even disabled httpd and crond). I get random "Segmentation Fault" errors. Occasionally, I get random output from the shell in the form of:

root@router:/proc# ls
-ash: : not found

"ls" will work fine right before or after the above. There is nothing special about /proc (same thing happens randomly elsewhere, e.g. /www).

I seem to be unable to "cat /proc/kmsg". I could do it a short while ago and it had a bunch of OOPS's in it.

On a positive note, all this weirdness does not seem to affect the actually routing. I've been torture testing the routing with no issues.

Thanks
Moh

comment:13 Changed 9 years ago by michu-at-neophob-com

same issue's here:

root@OpenWrt:/# reboot
/bin/ash: : Permission denied
root@OpenWrt:/# reboot
The system is going down NOW !!
Sending SIGTERM to all processes.

perhaps related to ticket nr 2183?

regards
michu

comment:14 Changed 9 years ago by mountaindude

Yup, I see the same issues with Kamikaze 7.07 running on 2.6.22

comment:15 Changed 9 years ago by anonymous

Is anyone trying to fix this bug or has the 500gp lost support?

comment:16 Changed 9 years ago by jhansen@…

There has always been support for the 500gp with a 2.4 kernel. Unfortunately the 2.6 kernel is not stable on it yet.

comment:17 Changed 9 years ago by mangoo

There was a series of patches posted to linux-mips mailing list, which add BCM947xx support:

http://www.linux-mips.org/archives/linux-mips/2007-08/msg00103.html

comment:18 Changed 9 years ago by anonymous

Hmm, it mostly comes from OpenWRT though :)

comment:19 Changed 9 years ago by hause@…

hi,

after waiting some time i tried 8417.
it seems to be stable again.
actually i use 8461 and i don't have any error.

greets,

kh

comment:20 Changed 9 years ago by anonymous

Really ?

One week ago I tried a revision posterior to r8417 and it was very unstable on a WL-700gE ( not the same CPU though ). I retried r8479 yesterday and it doesn't even boot :o

comment:21 Changed 9 years ago by anonymous

s/posterior/later than
Sorry for my poor english :)

comment:22 Changed 9 years ago by mangoo

I just tried r8494 from today, and the bug is still there.

comment:23 Changed 9 years ago by jhansen@…

I'd work on this bug more but I just don't have time as of late. The WL-500g Premium actually worked perfectly in Kamikaze 7.06 with a 2.6.19.2 kernel. Something happened in 2.6.20 or 2.6.21 that causes the router to crash now.

So the approach I'd take is to use the old arch code from 2.6.19.2 (just the mainstream code in arch/mips and include/asm-mips) with the new BCM947XX BSP code in a 2.6.22 kernel, and see if it still boots/runs. If so, then the mips maintainers added a feature, etc. that doesn't work on this router. Just add each of the changes they made one by one and see which one breaks it, then add a flag so that that feature doesn't get used on this processor.

If using the old arch code with the new BSP still doesn't work, try using the new arch code with the old BSP. Perhaps something changed in the BCM947xx BSP that broke support.

This is how I fixed the kmap_coherent problems we were seeing on both the WGT634U and WL-500g premium, and I'm sure a similar approach should put this issue to rest. The mips maintainers added a feature that they assumed worked on all MIPS32's, but unfortunately these Broadcom CPUs couldn't handle it.

comment:24 Changed 9 years ago by mangoo

You mean, to checkout the current svn revision, and try to apply these from revision 6850 (it's a 2.6.19.2 kernel that works fine):

$ ls -l 6580/trunk/target/linux/brcm47xx-2.6/patches
total 72
-rw-rw-r-- 1 build build 6083 2007-09-02 17:57 100-board_support.patch
-rw-rw-r-- 1 build build 1154 2007-09-02 17:57 110-flash_map.patch
-rw-rw-r-- 1 build build 45287 2007-09-02 17:57 120-b44_ssb_support.patch
-rw-rw-r-- 1 build build 3005 2007-09-02 17:57 130-remove_scache.patch
-rw-rw-r-- 1 build build 473 2007-09-02 17:57 140-export_uevent_handler.patch
-rw-rw-r-- 1 build build 842 2007-09-02 17:57 150-cpu_fixes.patch

instead of these:

$ ls -l 8577/target/linux/brcm47xx-2.6/patches-2.6.22
total 132
-rw-rw-r-- 1 build build 6668 2007-09-02 23:12 100-board_support.patch
-rw-rw-r-- 1 build build 1270 2007-09-02 23:12 110-flash_map.patch
-rw-rw-r-- 1 build build 45637 2007-09-02 23:12 120-b44_ssb_support.patch
-rw-rw-r-- 1 build build 3124 2007-09-02 23:12 130-remove_scache.patch
-rw-rw-r-- 1 build build 18418 2007-09-02 23:12 150-cpu_fixes.patch
-rw-rw-r-- 1 build build 2503 2007-09-02 23:12 160-kmap_coherent.patch
-rw-rw-r-- 1 build build 427 2007-09-02 23:12 170-cpu_wait.patch
-rw-rw-r-- 1 build build 8539 2007-09-02 23:12 200-b44_ssb_fixup.patch
-rw-rw-r-- 1 build build 12608 2007-09-02 23:12 210-ssb_fixes.patch
-rw-rw-r-- 1 build build 1149 2007-09-02 23:12 230-ssb_arch_setup.patch
-rw-rw-r-- 1 build build 5074 2007-09-02 23:12 240-extif_fixes.patch

?

comment:25 Changed 9 years ago by jhansen@…

Well, you can approach it either way; either try the 2.6.19.2 kernel with the r8577 patches, or the 2.6.21+ kernel with the r6580 patches (I'm assuming those are for pre-2.6.21). I personally would try using the 2.6.19.2 kernel + the new patches first, but we really don't know where the problem is, so either way is a good start. If 2.6.19.2 + the new patches works, then you'd apply each git changeset from 2.6.19 to 2.6.21 in arch/mips, and see which one breaks the platform.

comment:26 Changed 9 years ago by mangoo

Hmm, it's not trivial to apply 2.6.22.4 brcm47xx-2.6 patches to 2.6.19.2 - 2.6.19.2 lacks ssb, and it complicates the things a lot.

When we apply (after some minor modifications) 2.6.19.2 brcm47xx-2.6 patches to a 2.6.22.4 kernel, it doesn't compile... It's certainly more complicated.

comment:27 Changed 9 years ago by alex@…

Hi

Just to add to the information

I have been seeing the same random crashes, i have had 7.07 (r8735) and 7.09 (8779). Similar sort of problems comments not working and then working

comment:28 Changed 9 years ago by h.ruehl@…

Just to add some report. Same error with my wl500gp. Random segfaults and -ash: : not found. At the moment my router is not stable and it connects only one of three times to the Internet.

Hope the developers can increase priority for this bug. I believe the Asus Wl500gp is a very popular router and these Errors are annoying.

If you need some special Report. Tell me!

Regards Heinz

comment:29 Changed 9 years ago by anonymous

I reduced the number of SIGSEGVs by updating to squashfs 3.2 and configuring the lzma decompressing code to use 32bit integers (define makro _LZMA_PROB32). But I still get bus errors.

comment:30 Changed 9 years ago by mangoo

I'm getting errors even when I don't use squashfs, but jffs2 only.

comment:31 Changed 9 years ago by alex@…

I have tried 7.06 7.07 7.09. I have always gotten random segfaults (not sure what to look for in bus problems) in 7.07 7.09. I think as stated above there are some patches which seem to have destabilised the kernel for this environment. I haven't had any problems with 7.06 and I am trying to back port some of the packages

comment:32 follow-up: Changed 9 years ago by Wolfram Joost

I just saw two starts with 7.07 without any crashes. Before starting, I removed all lan cables. Can someone confirm this behaviour?

comment:33 in reply to: ↑ 32 Changed 9 years ago by anonymous

I'm only using the serial port and still getting lots of errors.

Replying to Wolfram Joost:

I just saw two starts with 7.07 without any crashes. Before starting, I removed all lan cables. Can someone confirm this behaviour?

comment:34 follow-up: Changed 9 years ago by Wolfram Joost

Currently, it look likes to me that the (v)fork syscall causes the crashes.

comment:35 in reply to: ↑ 34 ; follow-up: Changed 9 years ago by mangoo

Replying to Wolfram Joost:

Currently, it look likes to me that the (v)fork syscall causes the crashes.

Why do you think (v)fork syscall might be the cause?

comment:36 in reply to: ↑ 35 Changed 9 years ago by Wolfram Joost

Replying to mangoo:

Why do you think (v)fork syscall might be the cause?

Well, I saw problems only with the shell, either executing a script or in interactive mode. Sometimes, the parent process crashes but most of the time the child dies.
A program like dropbear runs without any problems if it survived the starting process.

I did some traces (printks in fs/exec.c and kernel/fork.c) and the crashes i saw where after the fork but before the execve syscall.

Currently I'm testing a kernel with the the following change in include/asm-mips/cacheflush.h:

diff -Naurp linux-2.6.22.8.orig/include/asm-mips/cacheflush.h linux-2.6.22.8.patched/include/asm-mips/cacheflush.h
--- linux-2.6.22.8.orig/include/asm-mips/cacheflush.h   2007-09-25 08:05:13.000000000 +0200
+++ linux-2.6.22.8.patched/include/asm-mips/cacheflush.h        2007-09-27 10:54:40.158345323 +0200
@@ -32,7 +32,8 @@
 extern void (*flush_cache_all)(void);
 extern void (*__flush_cache_all)(void);
 extern void (*flush_cache_mm)(struct mm_struct *mm);
-#define flush_cache_dup_mm(mm) do { (void) (mm); } while (0)
+/* #define flush_cache_dup_mm(mm)      do { (void) (mm); } while (0) */
+#define flush_cache_dup_mm(mm) flush_cache_mm(mm)
 extern void (*flush_cache_range)(struct vm_area_struct *vma,
        unsigned long start, unsigned long end);
 extern void (*flush_cache_page)(struct vm_area_struct *vma, unsigned long page, unsigned long pfn);

Try and error...

comment:37 Changed 9 years ago by mangoo

I wonder if this will help:

http://marc.info/?l=linux-kernel&m=119132339221219

What's in linux-mips.git for 2.6.24?

(...) Various cleanups, including some moving of code and
support for 32-bit Broadcom BCM47XX processors (...)

comment:38 Changed 9 years ago by nbd

it won't, since this code was merged from our codebase ;)

comment:39 Changed 9 years ago by mangoo

So, is it not the borked CPU support, but rather board support is fishy?

comment:40 Changed 9 years ago by nbd

unknown. looks like cpu issues

comment:41 Changed 9 years ago by jhansen@…

I've seen the problem since brcm47xx used 2.6.21.5, and the BSP was nearly identical to the BSP that operates perfectly in 2.6.19.2. So I'm guessing that MIPS upstream enabled a feature or added some sequence of operations that this CPU doesn't like, somewhere between 2.6.19 and 2.6.21.

I'm trying to see if it works in 2.6.20 or not.

comment:42 Changed 9 years ago by andrew@…

I've compiled r9118, and it's looks ok!
The another router have r8770 on it.
I run this simple script for a long time:
while true; do ps > /dev/null; ls -l > /dev/null; done;

r8770 makes segfaults, but r9118 none! It's solved?

comment:43 follow-up: Changed 9 years ago by GreenLiteHosting

I've compiled r9130 (2.6.22.4) and I am observing the usual seg. faults and bus errors.

I executed the script suggested by Andrew and immediately received bus errors. The status of this bug remains unchanged.

comment:44 in reply to: ↑ 43 ; follow-up: Changed 9 years ago by anonymous

Replying to GreenLiteHosting:

I've compiled r9130 (2.6.22.4) and I am observing the usual seg. faults and bus errors.

I executed the script suggested by Andrew and immediately received bus errors. The status of this bug remains unchanged.

Same here ( r9128)

comment:45 in reply to: ↑ 44 Changed 9 years ago by anonymous

Replying to anonymous:

Replying to GreenLiteHosting:

I've compiled r9130 (2.6.22.4) and I am observing the usual seg. faults and bus errors.

I executed the script suggested by Andrew and immediately received bus errors. The status of this bug remains unchanged.

Same here ( r9128)

Let me add a bit more, I ran the loop through a couple of times, the first 2-3 times I received a few segfaults straight off. The last couple of times I have run it, it has run with out any faults. The last time through did 8K loops

a purely gut feel rating is that it is a bit better.

comment:46 follow-up: Changed 9 years ago by jhansen@…

I've tried 2.6.20 with 7.09 and see the same problem, so the problem was introduced between 2.6.19 and 2.6.20.

I have had one small breakthrough, though. If I revert the patch that introduced the copy_user_highpage function (git commit bcd022801ee514e28c32837f0b3ce18c775f1a7b), then the system is much, much more stable, albeit not perfectly stable. I will still get a seg fault in user-land (never in kernel-space at least, which is good) only if I try very hard to overload the system.

comment:47 in reply to: ↑ 46 Changed 9 years ago by alex@…

Replying to jhansen@cardaccess-inc.com:

I've tried 2.6.20 with 7.09 and see the same problem, so the problem was introduced between 2.6.19 and 2.6.20.

I have had one small breakthrough, though. If I revert the patch that introduced the copy_user_highpage function (git commit bcd022801ee514e28c32837f0b3ce18c775f1a7b), then the system is much, much more stable, albeit not perfectly stable. I will still get a seg fault in user-land (never in kernel-space at least, which is good) only if I try very hard to overload the system.

Will this make into trunk, or could you please explain how to undo the above mentioned patch ?

comment:48 follow-up: Changed 9 years ago by jhansen@…

I've posted the patch that removes copy_user_highpage. I don't see this specific patch making it into trunk any time soon since it certainly wouldn't be accepted into the mainline kernel. I'd like to clean it up as well as fix the remaining segfaults before anything gets committed.

I've also noticed that the segfaults are more frequent after the unit has been off for a while. Once it heats up, the problems go away.

Changed 9 years ago by jhansen@…

Remove copy_user_highpage function

comment:49 follow-up: Changed 9 years ago by jhansen@…

The trac patch viewer is messed up with this patch, so make sure that you download the patch and save it as a file; don't just go off of the trac patch viewer or your kernel won't link.

comment:50 in reply to: ↑ 48 Changed 9 years ago by andrew@…

Replying to jhansen@cardaccess-inc.com:

I've posted the patch that removes copy_user_highpage. I don't see this specific patch making it into trunk any time soon since it certainly wouldn't be accepted into the mainline kernel. I'd like to clean it up as well as fix the remaining segfaults before anything gets committed.

I've also noticed that the segfaults are more frequent after the unit has been off for a while. Once it heats up, the problems go away.

You're right, segfaults only occurs on the first "heatup", after a while it's looks dissapearing for a long time.
Someone could post a better test script? (which makes segfaults after the heatup?)

comment:51 Changed 9 years ago by mangoo

Are these patches in svn yet (the one that removes copy_user_highpage etc.)?

comment:52 in reply to: ↑ 49 Changed 9 years ago by alex@…

Replying to jhansen@cardaccess-inc.com:

The trac patch viewer is messed up with this patch, so make sure that you download the patch and save it as a file; don't just go off of the trac patch viewer or your kernel won't link.

Can you tell me where to place the patch in trunk, I am presuming here
target/linux/brcm47xx/patches-2.6.22

comment:53 Changed 9 years ago by jhansen@…

Yes, that's where it would go, but I don't know that I'd commit it until other people are commenting that they see an improvement in stability.

To try out the patch, just place it in the target/linux/brcm47xx/patches-2.6.22 directory, rm -rf build_mipsel/linux-2.6-brcm47xx/linux-2.6.22, and make.

I'm still working on eeking out more stability because my system still crashes sometimes with this patch.

comment:54 Changed 9 years ago by nbd

jhansen: I've tried your patch, but unfortunately it does not improve stability at all on my test system. The test case that I use is:
while true; do /etc/init.d/firewall restart; done

comment:55 Changed 9 years ago by nbd

  • Resolution set to fixed
  • Status changed from reopened to closed

Wolfram Joost, jhansen: Seems like a combination of your fixes is what makes it really work :)
Applied in [9285] along with a few cleanups of my own patches.

comment:56 Changed 9 years ago by anonymous

Or, an upgrade to 2.6.23 did it?

comment:57 Changed 9 years ago by nbd

Nope - I tested it. It was the fixes I mentioned above.

comment:58 Changed 9 years ago by alex@…

Guys thanks for the effort, about to compile and test

comment:59 Changed 9 years ago by mangoo

Works fine here, too - thanks.

Now, I can only complain about Broadcom wireless ;)

comment:60 Changed 9 years ago by nbd

I'm working on that right now

comment:61 Changed 9 years ago by anonymous

Ouch ouch - still not so good:

Kernel bug detected#1:
Cpu 0
$ 0 : 00000000 1000d800 00000001 00001fff
$ 4 : 810085a0 7fd2bbb8 7fd2bbb8 00100177
$ 8 : 0042d613 810103e0 00000000 00000000
$12 : 000000b7 0000003d 0000007a 0000003d
$16 : 810085a0 0000000e 804f5b54 7fd2bbb8
$20 : 00000000 804bfe30 804bfe34 00000001
$24 : 00000003 2ac0a118
$28 : 804be000 804bfdb0 81998148 8000d4b4
Hi : 00000000
Lo : 00000000
epc : 8000e78c Not tainted
ra : 8000d4b4 Status: 1000d803 KERNEL EXL IE
Cause : 00000034
PrId : 00029006
Modules linked in: sch_sfq cls_fw sch_htb pppoe pppox xt_mark ipt_tos xt_length xt_MARK ipt_ULOG ipt_recent xt_state ipt_REJECT iptable_filter xt_TCPMSS xc
Process ps (pid: 1635, threadinfo=804be000, task=81998578)
Stack : 81ec3540 0000000e 804f5b54 7fd2bbb8 80062c28 80062a94 804bfde0 81998148

804bfea0 ffffff9c 00000011 00000030 00000006 0000000e 81ec3540 7fd2bbb8
8081f000 0000002a 81ec3540 81ec3574 00000000 8081f000 802c0000 80062d58
8027d6c0 81ec3540 00000001 80055328 00000000 00000001 804bfe34 804bfe30
00000000 810085a0 81ec3540 00000000 0000002a 000007ff 00001000 8081f000
...

Call Trace:[<80062c28>][<80062a94>][<80062d58>][<80055328>][<800ba090>][<800bc654>][<800790a0>][<80079078>][<800795bc>][<800768f8>][<80077d4c>][<80001d48>]

Code: 8c820000 00021242 30420001 <00028036> 8f820014 00052b02 24420001 af820014 3c02802c

0:14 /bin/bash ick.sh

1635 ttyS0 R+ 0:00 ps x

Segmentation fault

Or:

mangoo:~# ps x

PID TTY STAT TIME COMMAND

1 ? Ss 0:00 init [2]
2 ? S< 0:00 [kthreadd]
3 ? S< 0:00 [ksoftirqd/0]
4 ? S< 0:00 [events/0]
5 ? S< 0:00 [khelper]

23 ? S< 0:00 [kblockd/0]
33 ? S< 0:00 [khubd]
65 ? D 0:00 [pdflush]
66 ? S 0:00 [pdflush]
67 ? D< 0:00 [kswapd0]
68 ? S< 0:00 [aio/0]
82 Kernel bug detected#1:

Cpu 0
$ 0 : 00000000 1000d800 00000001 00001fff
$ 4 : 810143a0 7ff7cf29 7ff7cf29 00100177
$ 8 : 00a1d603 8101a280 00000000 00000000
$12 : 000000b7 0000003d 0000007a 0000003d
$16 : 810143a0 0000000e 81f06c50 7ff7cf29
$20 : 00000000 8162de30 8162de34 00000001
$24 : 2aade8fe 2ac0a118
$28 : 8162c000 8162ddb0 802f8108 8000d4b4
Hi : 00000000
Lo : 00000000
epc : 8000e78c Not tainted
ra : 8000d4b4 Status: 1000d803 KERNEL EXL IE
Cause : 00000034
PrId : 00029006
Modules linked in: sch_sfq cls_fw sch_htb pppoe pppox xt_mark ipt_tos xt_length xt_MARK ipt_ULOG ipt_recent xt_state ipt_REJECT iptable_filter xt_TCPMSS xc
Process ps (pid: 1216, threadinfo=8162c000, task=81f55518)
Stack : 810746c0 0000000e 81f06c50 7ff7cf29 80062c28 80062a94 8162dde0 802f8108

8162dea0 ffffff9c 00000011 00000030 00000006 0000000e 810746c0 7ff7cf29
80d14000 0000000f 810746c0 810746f4 00000000 80d14000 802c0000 80062d58
8027d6c0 810746c0 00000001 800553a8 00000000 00000001 8162de34 8162de30
00000000 810143a0 810746c0 00000000 0000000f 000007ff 00001000 80d14000
...

Call Trace:[<80062c28>][<80062a94>][<80062d58>][<800553a8>][<800ba090>][<800bc654>][<800790a0>][<80079078>][<800795bc>][<800768f8>][<80077d4c>][<80001d48>]

Code: 8c820000 00021242 30420001 <00028036> 8f820014 00052b02 24420001 af820014 3c02802c
? S< 0:00 [mtdblockd]

137 ? S< 0:00 [scsi_eh_0]
138 ? S< 0:00 [usb-storage]
141 ? S< 0:00 [scsi_eh_1]
142 ? D< 0:00 [usb-storage]
165 ? S< 0:00 [kjournald]

Segmentation fault

That's all with revision 9287.

comment:62 Changed 9 years ago by anonymous

# strace -e open ps aux
(...)
root 165 0.0 0.0 0 0 ? S< 00:32 0:00 [kjournald]
open("/proc/165/status", O_RDONLY) = 6
open("/proc/246/stat", O_RDONLY) = 6
open("/proc/166/cmdline", O_RDONLY) = 6
+++ killed by SIGSEGV +++
Process 1320 detached

Everything OK:

# cd /proc
# find -name stat | xargs md5sum
# find -name status | xargs md5sum

Trying to open cmdline fails:

# find -name cmdline | xargs md5sum
(...)
d41d8cd98f00b204e9800998ecf8427e ./165/task/165/cmdline
d41d8cd98f00b204e9800998ecf8427e ./165/cmdline
xargs: md5sum: terminated by signal 11

comment:63 Changed 9 years ago by mangoo

mangoo:/proc/1058# while true ;

do
cat cmdline
done

/usr/sbin/dhcpd-qSegmentation fault
Segmentation fault
Segmentation fault
Segmentation fault
Segmentation fault

The kernel apparently doesn't like when we want to read cmdline - I'll check if it affect other files, too.

comment:64 Changed 9 years ago by nbd

have you tried make target/linux-clean world?

comment:65 Changed 9 years ago by mangoo

The kernel doesn't like when we touch "cmdline" and "environ" in /proc/<pid>/

comment:66 Changed 9 years ago by mangoo

have you tried make target/linux-clean world?

Hmm, no, why?

I downloaded the latest version:

svn co https://svn.openwrt.org/openwrt/trunk/

And built it.

comment:67 Changed 9 years ago by mangoo

Ooh, I used gcc 4.2.x. I wonder if this is related.

I'll build revision 9295 with the default settings (gcc 4.1.x).

comment:68 Changed 9 years ago by mangoo

Trying to open "environ" or "cmdline" doesn't fail always (mostly, it fails though):

find /proc -name environ -exec cat {} \;

For some pids, it doesn't fail at all; for other pids, it fails always.

Still, I have to build it with gcc 4.1.x.

comment:69 Changed 9 years ago by anonymous

Perhaps it doesn't have to do with the fact that I used gcc 4.2.x (but still, I'll build it to check).

"find /proc -name environ -exec cat {} \;" works fine just after reboot.

I use swap on my router. Once swap is used, trying to open "environ" or "cmdline" begins to fail.

comment:70 Changed 9 years ago by mangoo

This is how I can reproduce it reliably:

1) mount a swap partition
2) make sure some swap is used (my /dev/mtd1 is 8.1 MB):

# VAR=$(cat /dev/mtd1)

3) try to open "environ" or "cmdline":

# find /proc -name environ -exec cat {} \;
(...)
Kernel bug detected#1:
(...)

comment:71 Changed 9 years ago by nbd

i think the cache architecture of bcm4704 is a bit different and breaks a few assumptions of the mips cache code. i have to talk to ralf baechle about this, he probably knows more.
until that is fixed, i'd suggest disabling swap

comment:72 Changed 9 years ago by anonymous

I've got 9299 with 2.6.23 and 512mb usb swap on a Wl500Gp. Working absolutely great - no segfaults at all, whereas I had them constantly before. Actually seems a bit faster - maybe the new scheduler.

comment:73 Changed 9 years ago by jhansen@…

Don't know that it helps, but I've seen the cmdline problem in the past even as early as 2.6.19, with no swap. Luckily it doesn't affect operation of the rest of the system, unless you use ps for something. I haven't seen it for a long time.

comment:74 Changed 9 years ago by mangoo

Luckily it doesn't affect operation of the rest of the system, unless you use ps for something. I haven't seen it for a long time.

It does affect operation of the system a bit - after a couple of such "Kernel bug detected", you can't shut the system down properly, some random commands will fail.

And it's not only ps which looks into various /proc entries.

comment:75 Changed 9 years ago by mangoo

I've got 9299 with 2.6.23 and 512mb usb swap on a Wl500Gp. Working absolutely great

Did you make sure swap is actually used by processes (if swap is used by tmpfs, the bugs don't happen):

VAR=$(cat /dev/mtd1)

find /proc -name environ -exec cat {} \;

?

comment:76 Changed 9 years ago by mangoo

I've got 9299 with 2.6.23 and 512mb usb swap on a Wl500Gp. Working absolutely great

Compiled with gcc 4.1.2, it fails when swap is used.

comment:77 Changed 9 years ago by anonymous

Or, an upgrade to 2.6.23 did it?

Nope - I tested it. It was the fixes I mentioned above.

I just tried it without the 310-no_highpage.patch, and I don't get any kernel errors.

That is, until processes begin to touch swap.

comment:78 Changed 9 years ago by jhansen@…

Do you see the same problems when using swap in Kamikaze 7.06?

comment:79 Changed 9 years ago by mangoo

I never used it.

What revision is Kamikaze 7.06?

And what kernel does it have for these Broadcom boards?

comment:80 Changed 9 years ago by mangoo

According to http://downloads.openwrt.org/kamikaze/7.06/brcm47xx-2.6/packages/ it is 2.6.19.2.

No, this kernel is perfectly stable for me (taking wireless aside).

comment:81 Changed 9 years ago by anonymous

I'm seeing some segfaults now with swap - they aren't random like before; whenever I do a top or ps. Apps don't crash out like before, though - even the ones using the swap. It's annoying, but not nothing like before.

comment:82 Changed 9 years ago by alex@…

  • Resolution fixed deleted
  • Status changed from closed to reopened

So to summerise

trunk 9287 is okay, if you don't touch swap
but nbd is going to talk to ralf baechle

So should the ticked be reopened ?

comment:83 Changed 9 years ago by jhansen@…

It turns out that this bug has tipped Ralf Baechle off to a bug in the copy_user_highpage implementation and he is actively fixing the problem (a commit that should fix the problem made it into mainline yesterday). Perhaps swap will work OK in 2.6.24. Or if you can't wait 3 months, you can just apply commits b868868ae0f7272228c95cc760338ffe35bb739d and 985c30ef4d7c2a4f0e979a507a7e2f7f09b096c3 to 2.6.23.

comment:84 Changed 9 years ago by anonymous

Of course we can't wait ;)

comment:85 Changed 9 years ago by anonymous

how does one apply the commits mentioned above?

comment:86 Changed 9 years ago by michu

google is your friend.. enter commit id and
-> http://grmso.net:8090/commit/b868868ae0f7272228c95cc760338ffe35bb739d/

cheers michu

comment:87 Changed 9 years ago by jhansen@…

It looks like those two commits against 2.6.22 do not fix the problem; it is just as bad as before. I'll have to try the tip of the mainline git tree...

comment:88 Changed 9 years ago by anonymous

According to the patch description, 985c30ef4d7c2a4f0e979a507a7e2f7f09b096c3 should be applied.

Anyway, this commit collides with 160-kmap_coherent.patch.

comment:89 Changed 9 years ago by jhansen@…

Yes, they do collide; we were completely avoiding kmap_coherent, and Ralf was trying to avoid it only when necessary. You'd have to revert the kmap_coherent patch first.

For the record, applying 300-fork_cacheflush.patch and 310-no-highpage.patch to Kamikaze 7.09 works great (of course I don't use swapping).

comment:90 Changed 9 years ago by mangoo

Not sure about 300-fork_cacheflush.patch - but the device seems to work stable for me, whether 310-no-highpage.patch is applied or not (with swap disabled).

comment:91 Changed 9 years ago by michu-at-neophob-com

I can confirm that with the current svn build 9441 my asus wl500-g premium runns stable (without swap)

comment:92 Changed 9 years ago by andrew@…

r9442 running since 3 days, and so far seems stable. (without swap)

comment:93 Changed 9 years ago by mangoo

r9573 still not usable with swap. Which perhaps doesn't wonder, as there were no related changes ;)

comment:94 Changed 9 years ago by jhansen@…

Not sure where to put this, but I found that you can at least boot the WL-500gP (and most likely WGT634u, WRT54g, etc.) into 2.6.24-rc4 with the following patch. It doesn't boot all the way into userland yet, but hopefully this will save someone else some time bringing in the rest of the patches.

It looks like the nvram code from earlier OpenWRT patches will be necessary (since the CFE-trampoline on this platform won't return correct nvram values itself), as well as (obviously) SquashFS patches, and mtd partition patches.

Changed 9 years ago by jhansen@…

Get wl500gp mostly booting in 2.6.24-rc4

comment:95 Changed 9 years ago by anonymous

So what is the status of HEAD (9707 as I write this) with the Asus 500gP? Is it working without any patches or do I still need to do some patching to get any stability?

comment:96 Changed 9 years ago by anonymous

don't use swap and you should be fine

comment:97 Changed 9 years ago by anonymous

Any ideas what's wrong with the swap?

comment:98 Changed 8 years ago by anonymous

2.6.24 reportedly now has the driver that supports the Broadcom 4318 wireless chip (in G and B mode) used on the wl500gp (mini-pci card in 1.x and surface mount in 2.x).

http://forums.gentoo.org/viewtopic-t-547687-postdays-0-postorder-asc-start-0.html

comment:99 Changed 8 years ago by anonymous

I am experiencing same problem using latest trunk (r10204)- random segmentation faults on high cpu load, tested with swap enabled and disabled. Processes (like ctorrent) with high load are killed and if I do a top:
Kernel bug detected#1:
Cpu 0
$ 0 : 00000000 1000d800 00000001 00001fff
$ 4 : 81011ec0 7fc77f72 7fc77f72 000008f6
$ 8 : 00100177 008f679f 00080000 00020000
$12 : 80264d80 00000001 80cdde40 00100100
$16 : 81011ec0 80264d80 81c53aac 7fc77f72
$20 : 0000000e 00000000 80c5be60 80c5be64
$24 : 00000003 800b6318
$28 : 80c5a000 80c5bde8 00000001 8000d3f0
Hi : 00000000
Lo : 00000000
epc : 8000e640 Tainted: P
ra : 8000d3f0 Status: 1000d803 KERNEL EXL IE
Cause : 00000034
PrId : 00029006
Modules linked in: usb_storage ehci_hcd uhci_hcd ath_pci wlan_xauth wlan_wep wlan_tkip wlan_ccmp wlan_acl ath_rate_minstrel ath_hal(P) wlan_scan_sta wlan_scan_ap wlan sd_mod nf_nat_tftp nf_conntrack_tftp nf_nat_irc nf_conntrack_irc nf_nat_ftp nf_conntrack_ftp ppp_async ppp_generic slhc crc_ccitt ext3 jbd usbcore scsi_mod switch_robo switch_core diag
Process top (pid: 916, threadinfo=80c5a000, task=81dcc1a8)
Stack : 81d2e3c0 7fc77f72 81c53aac 7fc77f72 80061588 800613e4 00000051 0000000e

004e56bc 00000004 00000011 00000030 81d2e3c0 7fc77f72 80fc1000 00000004
81d2e3c0 81d2e3f4 00000000 80fc1000 81dcc1a8 800616b4 81dcc1a8 81d2e3c0
81dcc1a8 000000d0 00000000 00000001 80c5be64 80c5be60 80231540 81011ec0
81d2e3c0 00000000 00000004 00000021 80fc1000 00001000 81dcc1a8 00000004
...

Call Trace:[<80061588>][<800613e4>][<800616b4>][<800b3dac>][<800b6398>][<8007762c>][<80077b34>][<80077b34>][<80074fd8>][<8002ade8>][<8000c12c>]

Code: 8c820000 00021242 30420001 <00028036> 8f820014 00052b02 24420001 af820014 3c028027

I can provide more info / do more tests if needed.

comment:100 follow-up: Changed 8 years ago by anonymous

Do you have these segfaults without atheros modules loaded?

comment:101 Changed 8 years ago by anonymous

Any news about 2.6.24? Perhaps with that kernel this problem will go away?

comment:102 in reply to: ↑ 100 Changed 8 years ago by anonymous

Replying to anonymous:

Do you have these segfaults without atheros modules loaded?

No Change if atheros modules loaded or not. (But the case reported by me was with atheros modules loaded)

comment:103 Changed 8 years ago by Federico

Still not working with latest trunk r10459.

I don't have still any atheros card, so wifi is disable.

  • with swap off, process like amule need few hours before die and need just a manual restart to work again.
  • with swap on, after half day (with amule on, so with high cpu load) most of process are dying, even /sbin/halt don't give any result.

# lsmod
Module Size Used by Not tainted
usb_storage 28032 1
usblp 9344 0
ehci_hcd 28880 0
uhci_hcd 19920 0
sd_mod 18512 2
nf_nat_tftp 480 0
nf_conntrack_tftp 2480 1 nf_nat_tftp
nf_nat_irc 960 0
nf_conntrack_irc 2832 1 nf_nat_irc
nf_nat_ftp 1472 0
nf_conntrack_ftp 5152 1 nf_nat_ftp
ppp_async 9856 0
ppp_generic 20192 1 ppp_async
slhc 5376 1 ppp_generic
crc_ccitt 1024 1 ppp_async
ext3 98064 2
jbd 55408 1 ext3
nls_iso8859_15 3392 0
usbcore 106576 5 usb_storage,usblp,ehci_hcd,uhci_hcd
scsi_mod 72256 2 usb_storage,sd_mod
nls_base 4448 1 nls_iso8859_15
switch_robo 4048 0
switch_core 5056 1 switch_robo
diag 8400 0

comment:104 Changed 8 years ago by anonymous

Also, 2.6.24 has support for BCM47XX boards.

Is it what's already in OpenWrt, or something completely different?

comment:105 Changed 8 years ago by jhansen@…

It's based on the code that was in OpenWRT, but doesn't include several things that will still need to be applied as patches (CFE nvram reading, etc.). Much has been fixed/changed in 2.6.24 for the mips arch, so it will be interesting to see how well things work.

comment:106 Changed 8 years ago by alex@…

Hi

Just to add some more info the mix, I have a hand build 7.09 (r9455 => 2.623.1), with 900-no-highpage.patch applied. I do not use swap, I have atheros card and it works fine, since applying the patch I haven't had any problems with the system. This would be over 2 months now.

I am guessing that we can get a stable build that the problem is known and that it would be possible to fix it in 24. I am waiting for a stable 24 so I can install on my new 500gW (yes I know N doesn't work). and I want to use the broadcom wifi.

I am willing to do some testing of image / patches if needed

comment:107 Changed 8 years ago by mangoo

Looks like this patch is needed - without it, we won't see PCI devices with 2.6.24 kernels:

http://git.aurel32.net/?p=bcm47xx.git;a=commit;h=2c4af8d64dd69e2f074690a036cd83e0a8f6946f

comment:108 Changed 8 years ago by mangoo

jhansen: how did you get 2.6.24 "mostly booting"?

Is it enough to:

  • edit trunk/target/linux/brcm47xx/Makefile, and change LINUX_VERSION:=2.6.24.2
  • apply the bcm47xx-prom.patch

Will it build an image which boots?

comment:109 Changed 8 years ago by jhansen@…

No, it's not quite that easy, since you need to delete some ssb and other kernel files in openwrt as well. I'm working on getting a patch out, but it looks like this processor still needs the highpage and flush_cache_dup_mm patches.

comment:110 Changed 8 years ago by anonymous

still segfault on hig load cpu!!
could someone plz update comment next to Broadcom BCM947xx/953xx [2.6] on http://wiki.openwrt.org/OpenWrtDocs/Hardware/Asus/WL500GP?
is still wrote there that segfault is fixed, maybe someone can decide to don't buy it because of this!

comment:111 Changed 8 years ago by anonymous

the only things i can do (after a segfault) is a ping to the router, because dropbear and p910nd are died. so it mean that kernel is still alive and that problem came by application to memory.

comment:112 Changed 8 years ago by anonymous

My seg faults disappeared when I've returned to default 16MB of memory.

comment:113 Changed 8 years ago by anonymous

So, there are no segfaults with 16 MB of memory, even if you use swap?

comment:114 Changed 8 years ago by jhansen@…

So far 2.6.24 needs the same highpage and flush_mm patches that 2.6.23 needed.

I just noticed that the comment for tiny_shmem (which is enabled by default with CONFIG_EMBEDDED) states that tiny_shmem "may be appropriate on small systems without swap." Try enabling full shmem in your kernel config and see if your router still crashes with swap enabled.

comment:115 Changed 8 years ago by anonymous

Are you talking about CONFIG_SHMEM? I had it already enabled.

But with swap, it crashes also when CONFIG_SHMEM is disabled.

comment:116 Changed 8 years ago by anonymous

Maybe 2.6.25 will be better?

comment:117 Changed 8 years ago by nthgr

Could somebody summarize this problem is solved or not and if yes how, please?

I have same problem! I tried with 3 releases (.trx flashing with Asus Restoration Tool):
I have ASUS WL-500G Premium.
I tried with the followings:

  1. http://downloads.openwrt.org/kamikaze/7.09/brcm47xx-2.6/openwrt-brcm47xx-2.6-squashfs.trx
  1. http://downloads.x-wrt.org/xwrt/kamikaze/7.07/brcm47xx-2.6/openwrt-brcm47xx-2.6-squashfs.trx
  1. http://downloads.x-wrt.org/xwrt/kamikaze/7.09/brcm47xx-2.6/openwrt-brcm47xx-2.6-squashfs.trx

I can see something here: http://wiki.openwrt.org/OpenWrtDocs/Hardware/Asus/WL500GP then here: https://dev.openwrt.org/changeset/9285, but I don't know how can I install that patches?? And I don't know is this the perfect solution or not?

Thx

comment:118 Changed 8 years ago by nthgr

Or could somebody rebuild this
(http://downloads.openwrt.org/kamikaze/7.09/brcm47xx-2.6/openwrt-brcm47xx-2.6-squashfs.trx) package with changeset 9285 and releases it again for public??
Thx

comment:119 follow-up: Changed 8 years ago by anonymous

There seems to be 2.6.25 support in the latest revisions:

svn co https://svn.openwrt.org/openwrt/trunk/

See patches in trunk/target/linux/brcm47xx/patches-2.6.25.

But it's not enabled by default - trunk/target/linux/brcm47xx/Makefile still contains a 2.6.23.x kernel (LINUX_VERSION).

Anyone tried to change the kernel to 2.6.25.4 in this Makefile and build it yet? Does it work?

comment:120 in reply to: ↑ 119 Changed 8 years ago by noz

Replying to anonymous:

There seems to be 2.6.25 support in the latest revisions:
See patches in trunk/target/linux/brcm47xx/patches-2.6.25.
But it's not enabled by default - trunk/target/linux/brcm47xx/Makefile still contains a 2.6.23.x kernel (LINUX_VERSION).
Anyone tried to change the kernel to 2.6.25.4 in this Makefile and build it yet? Does it work?

There's a very good reason that it's not enabled! It Oopses solid as soon as it gets to userspace. I'm working on it, and BTW, it's patched against 2.6.25.1, so that's what I'd try against if you really want to try.

comment:121 Changed 8 years ago by anonymous

I just compared 2.6.25.1 patches you're talking about and 2.6.23.x which are enabled by default, and they are identical (md5sum), so it looks like it needs some more work?

comment:122 follow-up: Changed 8 years ago by jhansen@…

I am able to boot and run OpenWRT on the wgt634u with 2.6.25 after applying the following patch to the lzma loader. This passes the fw_argX arguments from CFE to linux, as they were intended to be passed on. The current lzma-loader just nukes them all, and doesn't pass on anything, so there's no way for Linux to know the CFE hook/handles.

Changed 8 years ago by jhansen@…

Don't nuke fw_argX from CFE

Changed 8 years ago by jhansen@…

Don't nuke fw_argX from CFE (better)

comment:123 Changed 8 years ago by jhansen@…

I've confirmed that this also works on a WL-500g Premium. OpenWRT works with no crashes with svn trunk (+2.6.25.4) and this patch. Haven't tried swap, but I don't plan on it, either. So someone else can test that out if they want.

comment:124 Changed 8 years ago by alex@…

thanks for that jhansen, haven't checked this thread in a while, but good timing to check now.

comment:125 Changed 8 years ago by anonymous

Have you tried wireless?

comment:126 follow-up: Changed 8 years ago by anonymous

Revision 11269 doesn't compile for me:

(...)

CC drivers/ssb/driver_pcicore.o

drivers/ssb/driver_pcicore.c: In function 'ssb_pcicore_fixup_pcibridge':
drivers/ssb/driver_pcicore.c:314: error: implicit declaration of function 'pcibios_enable_device'
make[7]: * [drivers/ssb/driver_pcicore.o] Error 1
make[6]:
* [drivers/ssb] Error 2
make[5]: * [drivers] Error 2
make[5]: Leaving directory `/home/tch-data/openwrt/11269/build_dir/linux-brcm47xx/linux-2.6.25.4'
make[4]:
* home/tch-data/openwrt/11269/build_dir/linux-brcm47xx/linux-2.6.25.4/.image Error 2
make[4]: Leaving directory `/home/tch-data/openwrt/11269/target/linux/brcm47xx'
make[3]: * [install] Error 2
make[3]: Leaving directory `/home/tch-data/openwrt/11269/target/linux'
make[2]:
* [target/linux/install] Error 2
make[2]: Leaving directory `/home/tch-data/openwrt/11269'
make[1]: * home/tch-data/openwrt/11269/staging_dir/mipsel/stamp/.target_install Error 2
make[1]: Leaving directory `/home/tch-data/openwrt/11269'
make:
* [world] Error 2

comment:127 Changed 8 years ago by jhansen@…

Wireless on Atheros NICs in both platforms (WL-500gP and WGT634u) works great! Haven't (and probably won't) try Broadcom NIC.

For the compile problem, you need to enable PCI with 'make kernel_menuconfig', then you should be fine. For some reason it's not configured in by default.

comment:128 Changed 8 years ago by mangoo

It compiled and it even runs, but I get an oops upon bootup:

Waiting for /dev to be fully populated...b43-phy0: Broadcom 4318 WLAN found

Data bus error, epc == 8013406c, ra == 801b28b4
Oops#1:
Cpu 0
$ 0 : 00000000 1000c801 81c85454 801b26a4
$ 4 : c00003f8 81c85454 00000001 802e9a20
$ 8 : 81d95fe0 0000c800 00000000 81ea2000
$12 : 312d0000 00000001 802ef588 81c5b13c
$16 : 81c85400 81c85454 000003f8 00000007
$20 : c00b4000 00000001 81daf448 81d69d40
$24 : 00008000 801b260c
$28 : 81d94000 81d95cd0 81d69d40 801b28b4
Hi : 00000000
Lo : 00000000
epc : 8013406c Not tainted
ra : 801b28b4 Status: 1000c803 KERNEL EXL IE
Cause : 0080001c
PrId : 00029006 (Broadcom BCM3302)
Modules linked in: b43(+) mac80211 cfg80211
Process modprobe (pid: 264, threadinfo=81d94000, task=81c2f068)
Stack : 00004318 00000005 0000017f 81daf000 0000017f 81daf000 00000002 c00681c4

00000001 00000001 00004318 00000005 81d95d28 00000000 81e679d8 81c8bb64
00000003 81c85400 81c59400 81d69ddc 00000001 81d95e00 c00842f4 81c85454
c0084310 81c8948c c00b4000 0000001b c00d8d98 c00d88b8 c00a709c 801b0a1c
81c85454 800c9e8c 81c7f288 8012265c 81c894a8 81c89400 80155530 801554d4
...

Call Trace:[<c00681c4>][<c00a709c>][<801b0a1c>][<800c9e8c>][<8012265c>][<80155530>][<801554d4>][<8025157c>][<80002c64>][<80002c64>][<801559c8>][<801540f8>]
Code: 90820000 03e00008 304200ff <94820000> 03e00008 3042ffff 8c820000 03e00008 00000000

comment:129 Changed 8 years ago by mangoo

And some good news: it doesn't crash when swap is used!

comment:130 Changed 8 years ago by mangoo

It oopses on b43 module insertion.

comment:131 in reply to: ↑ 122 Changed 8 years ago by noz

Replying to jhansen@cardaccess-inc.com:

I am able to boot and run OpenWRT on the wgt634u with 2.6.25 after applying the following patch to the lzma loader. This passes the fw_argX arguments from CFE to linux, as they were intended to be passed on. The current lzma-loader just nukes them all, and doesn't pass on anything, so there's no way for Linux to know the CFE hook/handles.

Applied in r11275, thanks. Allows WRTSL54GS to boot also.

comment:132 follow-up: Changed 8 years ago by jhansen@…

noz, the assembly language part of the patch was missing in the commit. Can you bring that in, too, since just the decompress.c patch won't change anything?

comment:133 in reply to: ↑ 126 Changed 8 years ago by KanjiMonster

Replying to anonymous:

Revision 11269 doesn't compile for me:
drivers/ssb/driver_pcicore.c: In function 'ssb_pcicore_fixup_pcibridge':
drivers/ssb/driver_pcicore.c:314: error: implicit declaration of function 'pcibios_enable_device'

You have set CONFIG_PCI=y in your kernel-config, see this post from openwrt-devel.

comment:134 Changed 8 years ago by KanjiMonster

I just installed an image with 2.6.25.4 on my Asus 500gP (v1), and I can confirm it running (with an atheros card) and being rock stable (I had some random crashes with .23.16).

comment:135 Changed 8 years ago by anonymous

Does the wireless work? There are contradicting reports here:

http://forum.openwrt.org/viewtopic.php?id=15443

comment:136 in reply to: ↑ 132 Changed 8 years ago by noz

Replying to jhansen@cardaccess-inc.com:

noz, the assembly language part of the patch was missing in the commit. Can you bring that in, too, since just the decompress.c patch won't change anything?

Sorry. Now applied in r11340. This time I've checked and the correct CFE seal is being passed in as fw_arg3.

comment:137 Changed 8 years ago by jhansen@…

Can this bug be closed? Any objections? Any b43-related crashes should be posted on a new bug.

comment:138 Changed 8 years ago by KanjiMonster

+1 for closing (and moving b43-related things to separate bug report).
Router is stable, transfering data from and to it.
Still running on first startup after flashing, now at 11 days uptime; running samba, postgresql, subversion, cups.

(using 2.6.25.4, .6 came after that, but I don't expect any difference in stability).

comment:139 Changed 8 years ago by othiman

+1 for closing. Running with atheros wifi card and intensive load by downloading some torrents. With old kernel it crashed after a couple of minutes, now running 2 days without any problem. Kernel version 2.6.25.6 with swap turned on.

comment:140 Changed 8 years ago by nbd

  • Resolution set to fixed
  • Status changed from reopened to closed

comment:141 Changed 2 years ago by jow

  • Milestone changed from Attitude Adjustment 12.09 to Barrier Breaker 14.07

Milestone Attitude Adjustment 12.09 deleted

Add Comment

Modify Ticket

Action
as closed .
The resolution will be deleted. Next status will be 'reopened'.
Author


E-mail address and user name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.