Modify

Opened 2 years ago

Last modified 23 months ago

#14839 new defect

Kernel crashes on write operations on lantiq

Reported by: me@… Owned by: developers
Priority: normal Milestone: Chaos Calmer 15.05
Component: kernel Version: Trunk
Keywords: Cc:

Description

I have an ARV752DPW (Easybox 802).
Trunk r39332 crashes on a write operation somewhere in jffs2 compression code:
root@OpenWrt:/etc# opkg install luci
Installing luci (svn-r9951-1) to root...
Downloading http://downloads.openwrt.org/snapshots/trunk/lantiq/packages/luci_svn-r9951-1_lantiq.ipk.
Installing uhttpd (2013-11-21-cd66639800ee2882a0867ec54868502eb9b893d8) to root...
Downloading http://downloads.openwrt.org/snapshots/trunk/lantiq/packages/uhttpd_2013-11-21-cd66639800ee2882a0867ec54868502eb9b893d8_lantiq.ipk.
Installing uhttpd-mod-ubus (2013-11-21-cd66639800ee2882a0867ec54868502eb9b893d8) to root...
Downloading http://downloads.openwrt.org/snapshots/trunk/lantiq/packages/uhttpd-mod-ubus_2013-11-21-cd66639800ee2882a0867ec54868502eb9b893d8_lantiq.ipk.
Installing luci-mod-admin-full (svn-r9951-1) to root...
Downloading http://downloads.openwrt.org/snapshots/trunk/lantiq/packages/luci-mod-admin-full_svn-r9951-1_lantiq.ipk.
[ 183.532000] CPU 0 Unable to handle kernel paging request at virtual address 5bc3452e, epc == 8014795c, ra == 80149c0c
[ 183.540000] Oops#1:
[ 183.540000] CPU: 0 PID: 1135 Comm: opkg Not tainted 3.10.26 #1
[ 183.540000] task: 82927648 ti: 82866000 task.ti: 82866000
[ 183.540000] $ 0 : 00000000 00a76f58 000192a4 00000010
[ 183.540000] $ 4 : c0002000 00000000 c003454a 5bc3452e
[ 183.540000] $ 8 : 8264d683 8264d683 00000111 00000000
[ 183.540000] $12 : 8264d66d 00000014 00000000 00000014
[ 183.540000] $16 : 00000014 00000026 c0002000 00000012
[ 183.540000] $20 : 00000026 8264d684 00000014 8264d684
[ 183.540000] $24 : 00000683 00002001
[ 183.540000] $28 : 82866000 82867ae0 00000000 80149c0c
[ 183.540000] Hi : 00010b46
[ 183.540000] Lo : 66675044
[ 183.540000] epc : 8014795c GetPureRepPrice+0x34/0x14c
[ 183.540000] Not tainted
[ 183.540000] ra : 80149c0c LzmaEnc_MemEncode+0x1ce4/0x2e54
[ 183.540000] Status: 1100fc03 KERNEL EXL IE
[ 183.540000] Cause : 00800008
[ 183.540000] BadVA : 5bc3452e
[ 183.540000] PrId : 00019641 (MIPS 24KEc)
[ 183.540000] Modules linked in: rt2800pci rt2800mmio rt2800lib iptable_nat rt2x00pci rt2x00mmio rt2x00lib pppoe nf_nat_ipv4 nf_conntrack_ipv4 mac80211 ipt_MASQUERADE cfg80211 xt_time xt_tcpudp xt_state xt_nat xt_multiport xt_mark xt_mac xt_limit xt_conntrack xt_comment xt_TCPMSS xt_REDIRECT xt_LOG xt_CT pppox ppp_async nf_nat_irc nf_nat_ftp nf_nat nf_defrag_ipv4 nf_conntrack_irc nf_conntrack_ftp ltq_atm_danube iptable_raw iptable_mangle iptable_filter ipt_REJECT ip_tables drv_vmmc crc_itu_t crc_ccitt compat drv_dsl_cpe_api ltq_mei_danube ledtrig_usbdev ip6t_REJECT ip6t_rt ip6t_hbh ip6t_mh ip6t_ipv6header ip6t_frag ip6t_eui64 ip6t_ah ip6table_raw ip6table_mangle ip6table_filter ip6_tables x_tables nf_conntrack_ipv6 nf_conntrack nf_defrag_ipv6 pppoatm ppp_generic slhc tun br2684 atm drv_tapi ipv6 eeprom_93cx6 drv_ifxos arc4 crypto_blkcipher ltq_hcd_danube gpio_button_hotplug
[ 183.540000] Process opkg (pid: 1135, threadinfo=82866000, task=82927648, tls=7760c440)
[ 183.540000] Stack : 803b3c98 00000100 00000018 c0039660 00000001 c0032ca0 00000002 00000006

00000000 00000000 00000000 00000000 00000015 00000031 000003ba 000000db
801472f4 82a10000 00001000 00000000 00000000 00000014 8264d66d 00000111
00000011 00000002 00000012 000002f5 00000308 00000012 0000097d 0000000a
0000006d 8264d683 00000000 c0032ca0 00000682 00000001 00019298 00000671
...

[ 183.540000] Call Trace:
[ 183.540000] [<8014795c>] GetPureRepPrice+0x34/0x14c
[ 183.540000] [<80149c0c>] LzmaEnc_MemEncode+0x1ce4/0x2e54
[ 183.540000] [<80109e00>] jffs2_lzma_compress+0x78/0xb8
[ 183.540000] [<800f656c>] jffs2_selected_compress+0xd8/0x17c
[ 183.540000] [<800f6828>] jffs2_compress+0x218/0x2a4
[ 183.540000] [<800fd920>] jffs2_write_inode_range+0xd8/0x31c
[ 183.540000] [<800f822c>] jffs2_write_end+0x160/0x310
[ 183.540000] [<80065090>] generic_file_buffered_write+0x1cc/0x310
[ 183.540000] [<800668c4>] generic_file_aio_write+0x410/0x488
[ 183.540000] [<800669bc>] generic_file_aio_write+0x80/0xf4
[ 183.540000] [<8009ae0c>] do_sync_write+0x88/0xc0
[ 183.540000] [<8009ba0c>] vfs_write+0xd8/0x1a0
[ 183.540000] [<8009be34>] SyS_write+0x60/0xa4
[ 183.540000] [<80008144>] stack_done+0x20/0x40
[ 183.540000]
[ 183.540000]
Code: 00873821 00063040 00863021 <94e30000> 94c50000 3402c328 386307ff 00031902 00052902
[ 183.828000] ---[ end trace 9dc7ee51fb36f295 ]---
Segmentation fault

Afterwards, no writes are possible until reboot.

Attachments (0)

Change History (12)

comment:1 follow-up: Changed 2 years ago by anonymous

Have you tried re-flashing, so that the JFFS2 partition gets renewed?

comment:2 Changed 2 years ago by anonymous

I have experienced the same problem on the BT Home Hub 2B.
If I recompiling with lzo compression, or no compression, I no longer get the crash, but just get a jffs2 warning:

jffs2_sum_write_data: Not enough space for summary, padsize = -400

The warning can be eliminated by doubling the value of MAX_SUMMARY_SIZE in summary.h .
My guess is that the jffs2 lzma code wrong assumes that a summary is always present, whereas jffs2 sometimes does not create one if it would be too big.

comment:3 Changed 2 years ago by anonymous

Just in case I wasn't clear, increasing MAX_SUMMARY_SIZE makes the Ooops go away too, if you keep lzma, but I think something still needs to be fixed in the lzma code.

comment:4 Changed 2 years ago by benm1

Whether or not the bug occurs seems to depend on the actual data being compressed. One way I can reproduce it fairly reliably is to execute opkg install libopenssl_xxxxxxxxxxx.ipk .

Could there be some problem of alignment or a type mismatch when compiling lzma on mips?
Just on the offchance I tried the following patch, and after applying it I can no longer reproduce the bug. But without understanding exactly what is going on I can hardly call this a proper fix.

--- a/include/linux/lzma/LzmaDec.h 2014-02-26 21:18:31.000000000 +0100
+++ b/include/linux/lzma/LzmaDec.h 2014-02-26 19:56:52.000000000 +0100
@@ -10,7 +10,8 @@

extern "C" {
#endif

-/* #define _LZMA_PROB32 */
+
+#define _LZMA_PROB32

/* _LZMA_PROB32 can increase the speed on some CPUs,

but memory usage for CLzmaDec::probs will be doubled in that case */

comment:5 Changed 2 years ago by anonymous

This bug seems to have something to do with memory use as well; using this patch https://lists.openwrt.org/pipermail/openwrt-devel/2014-February/023751.html reduces the chance to encounter it. Disabling dsl_control from starting seems to reduce its frequency even further.

comment:6 Changed 2 years ago by malaakso@…

The _LZMA_PROB32-patch above increases the frequency of these problems in my case. I suspect it is due to the increased memory use.

comment:7 in reply to: ↑ 1 Changed 2 years ago by Kevin

Replying to anonymous:

Have you tried re-flashing, so that the JFFS2 partition gets renewed?

Sorry to have been out of the loop here. Yes, I've tried that, it didn't change anything. Haven't tried the other patches proposed here.

comment:8 Changed 23 months ago by benm1

My suggestions in comments 2, 3, and 4 were probably red herrings I'm afraid. One way you can work round the problem is not to use lzma compression at all for jffs2. Select something else in make kernel_menuconfig.

That said, I am now using recent trunk builds on the BT Home Hub 2B and am no longer able to reproduce the bug even if I leave the default lzma compression on jffs2. What version are you using Kevin?

comment:9 Changed 23 months ago by malaakso

That may be because the patch I sent (see comment:5) has been committed in r40325. I still see it in a board with 32 MiB RAM, but rarely.

comment:10 Changed 23 months ago by benm1

I wondered if that might have made a difference. But if the bug still occurs occasionally, do you think it is worth switching to zlib or lzo compression by default for jffs2 on nand (would require a separate xway-nand subtarget I suppose), or alternatively ditching jffs2 completely and using a squashfs/ubifs overlay instead?

comment:11 Changed 23 months ago by malaakso

I always switch my builds to LZO, but haven't given too much thought on what would be good default. I suppose UBIFS for NAND would be the best option, but NOR targets should be fixed as well...

comment:12 Changed 23 months ago by benm1

Silly me. I got the idea from somewhere (no idea where!) that this only affected NAND targets. If it affects NOR and NAND the simplest solution is just to specify lzo or zlib compression for jffs2 in target/linux/lantiq/config-default.

Add Comment

Modify Ticket

Action
as new .
Author


E-mail address and user name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.