phantomjs会在Centos 7 3.10.0-327内核会出现Crash自动重启和死机情况

环境

系统 Centos 7
硬件 阿里云 16核64G
业务 tomcat 7,phantomjs 2.11

现象

tomcat集群若干台,随机会出现某台ECS直接死机/偶尔自动重启状况,开启dump分析后如下。

分析

crash /usr/lib/debug/usr/lib/modules/3.10.0-327.36.3.el7.x86_64/vmlinux vmcore
crash 7.1.5-2.el7
Copyright (C) 2002-2016  Red Hat, Inc.
Copyright (C) 2004, 2005, 2006, 2010  IBM Corporation
Copyright (C) 1999-2006  Hewlett-Packard Co
Copyright (C) 2005, 2006, 2011, 2012  Fujitsu Limited
Copyright (C) 2006, 2007  VA Linux Systems Japan K.K.
Copyright (C) 2005, 2011  NEC Corporation
Copyright (C) 1999, 2002, 2007  Silicon Graphics, Inc.
Copyright (C) 1999, 2000, 2001, 2002  Mission Critical Linux, Inc.
This program is free software, covered by the GNU General Public License,
and you are welcome to change it and/or distribute copies of it under
certain conditions.  Enter "help copying" to see the conditions.
This program has absolutely no warranty.  Enter "help warranty" for details.
 
GNU gdb (GDB) 7.6
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-unknown-linux-gnu"...
 
      KERNEL: /usr/lib/debug/usr/lib/modules/3.10.0-327.36.3.el7.x86_64/vmlinux
    DUMPFILE: vmcore  [PARTIAL DUMP]
        CPUS: 16
        DATE: Sun Dec  4 09:41:05 2016
      UPTIME: 22 days, 18:52:24
LOAD AVERAGE: 0.42, 0.48, 0.44
       TASKS: 2910
    NODENAME: XXXXXXX
     RELEASE: 3.10.0-327.36.3.el7.x86_64
     VERSION: #1 SMP Mon Oct 24 16:09:20 UTC 2016
     MACHINE: x86_64  (2494 Mhz)
      MEMORY: 64 GB
       PANIC: "general protection fault: 0000 [#1] SMP "
         PID: 9359
     COMMAND: "Qt bearer threa"
        TASK: ffff880bd1f7b980  [THREAD_INFO: ffff880c09978000]
         CPU: 2
       STATE: TASK_RUNNING (PANIC)

KERNEL: 系统崩溃时运行的 kernel 文件
DUMPFILE: 内核转储文件
CPUS: 所在机器的 CPU 数量
DATE: 系统崩溃的时间
TASKS: 系统崩溃时内存中的任务数
NODENAME: 崩溃的系统主机名
RELEASE: 和 VERSION: 内核版本号
MACHINE: CPU 架构
MEMORY: 崩溃主机的物理内存
PANIC: 崩溃类型,常见的崩溃类型包括:
SysRq (System Request):通过魔法组合键导致的系统崩溃,通常是测试使用。通过 echo c > /proc/sysrq-trigger,就可以触发系统崩溃。
oops:可以看成是内核级的 Segmentation Fault。应用程序如果进行了非法内存访问或执行了非法指令,会得到 Segfault 信号,一般行为是 coredump,应用程序也可以自己截获 Segfault 信号,自行处理。如果内核自己犯了这样的错误,则会弹出 oops 信息。

得到导致crash的进程PID和执行的COMMAND。

crash> bt
PID: 9359   TASK: ffff880bd1f7b980  CPU: 2   COMMAND: "Qt bearer threa"
 #0 [ffff880c0997b7d8] machine_kexec at ffffffff81051e9b
 #1 [ffff880c0997b838] crash_kexec at ffffffff810f27e2
 #2 [ffff880c0997b908] oops_end at ffffffff8163f448
 #3 [ffff880c0997b930] die at ffffffff8101859b
 #4 [ffff880c0997b960] do_general_protection at ffffffff8163ed3e
 #5 [ffff880c0997b990] general_protection at ffffffff8163e5e8
    [exception RIP: netlink_compare+11]
    RIP: ffffffff815560bb  RSP: ffff880c0997ba40  RFLAGS: 00010246
    RAX: 0000000000000000  RBX: 0002000108000002  RCX: 00000000a55c3ca0
    RDX: 0000000000002484  RSI: ffff880c0997ba90  RDI: 0002000107fffb7a
    RBP: ffff880c0997ba78   R8: ffff880c0997ba8c   R9: ffff880c0997ba08
    R10: ffff880ffec03600  R11: 0000000000000293  R12: ffff880fe890a000
    R13: ffff880c0997ba90  R14: ffffffff815560b0  R15: ffff880fe6bd3bc0
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #6 [ffff880c0997ba48] rhashtable_lookup_compare at ffffffff813080d0
 #7 [ffff880c0997ba80] netlink_lookup at ffffffff815569ee
 #8 [ffff880c0997bab0] netlink_getsockbyportid at ffffffff81557d8f
 #9 [ffff880c0997bac8] netlink_alloc_skb at ffffffff81557dff
#10 [ffff880c0997bb00] netlink_dump at ffffffff81558037
#11 [ffff880c0997bb30] __netlink_dump_start at ffffffff81558a6b
#12 [ffff880c0997bb68] rtnetlink_rcv_msg at ffffffff8153a4a0
#13 [ffff880c0997bbd8] netlink_rcv_skb at ffffffff8155aa19
#14 [ffff880c0997bc00] rtnetlink_rcv at ffffffff8153a338
#15 [ffff880c0997bc18] netlink_unicast at ffffffff8155a02d
#16 [ffff880c0997bc60] netlink_sendmsg at ffffffff8155a420
#17 [ffff880c0997bcf8] sock_sendmsg at ffffffff815112d0
#18 [ffff880c0997be58] SYSC_sendto at ffffffff81511841
#19 [ffff880c0997bf70] sys_sendto at ffffffff815122ce
#20 [ffff880c0997bf80] system_call_fastpath at ffffffff81646b49
    RIP: 00007f0117be0dd3  RSP: 00007f00d0b92bb0  RFLAGS: 00000293
    RAX: 000000000000002c  RBX: ffffffff81646b49  RCX: ffffffffffffffff
    RDX: 0000000000000014  RSI: 00007f00d0b927c0  RDI: 000000000000000a
    RBP: 00007f00d0b92830   R8: 00007f00d0b927a0   R9: 000000000000000c
    R10: 0000000000000000  R11: 0000000000000293  R12: ffffffff815122ce
    R13: ffff880c0997bf78  R14: 0000000000000001  R15: 00007f00d0b91770
    ORIG_RAX: 000000000000002c  CS: 0033  SS: 002b

查看崩溃的堆栈信息:bt
如上输出中,以“# 数字”开头的行为调用堆栈,即系统崩溃前内核依次调用的一系列函数,通过这个可以迅速推断内核在何处崩溃。

   PID    PPID  CPU       TASK        ST  %MEM     VSZ    RSS  COMM
>     0      0   0  ffffffff81951440  RU   0.0       0      0  [swapper/0]
>     0      0   1  ffff880fe8a66780  RU   0.0       0      0  [swapper/1]
      0      0   2  ffff880fe8a67300  RU   0.0       0      0  [swapper/2]
      0      0   3  ffff880fe8ac0000  RU   0.0       0      0  [swapper/3]
>     0      0   4  ffff880fe8ac0b80  RU   0.0       0      0  [swapper/4]
      0      0   5  ffff880fe8ac1700  RU   0.0       0      0  [swapper/5]
>     0      0   6  ffff880fe8ac2280  RU   0.0       0      0  [swapper/6]
>     0      0   7  ffff880fe8ac2e00  RU   0.0       0      0  [swapper/7]
>     0      0   8  ffff880fe8ac3980  RU   0.0       0      0  [swapper/8]
>     0      0   9  ffff880fe8ac4500  RU   0.0       0      0  [swapper/9]
>     0      0  10  ffff880fe8ac5080  RU   0.0       0      0  [swapper/10]
      0      0  11  ffff880fe8ac5c00  RU   0.0       0      0  [swapper/11]
>     0      0  12  ffff880fe8ac6780  RU   0.0       0      0  [swapper/12]
>     0      0  13  ffff880fe8ac7300  RU   0.0       0      0  [swapper/13]
>     0      0  14  ffff880fe8af8000  RU   0.0       0      0  [swapper/14]
>     0      0  15  ffff880fe8af8b80  RU   0.0       0      0  [swapper/15]
      1      0   7  ffff880fe8918000  IN   0.0  190464   5500  systemd
。。。。。。。。。
>  9359      1   2  ffff880bd1f7b980  RU   0.1 2773168  52496  Qt bearer threa
   9360      1   6  ffff880bd1f7a280  IN   0.1 2773168  52496  phantomjs
   9361      1   2  ffff880bd1f7f300  IN   0.1 2773168  52496  phantomjs
   9362      1   2  ffff8800a51f5080  IN   0.1 2773168  52496  phantomjs
   9363      1   1  ffff8800a51f5c00  IN   0.1 2773168  52496  phantomjs
   9364      1   9  ffff8800a51f2e00  IN   0.1 2773168  52496  phantomjs
   9365      1  10  ffff880ed4cb8000  IN   0.1 2773168  52496  phantomjs
   9366      1   6  ffff880ed4cbae00  IN   0.1 2773168  52496  phantomjs
   9367      1   5  ffff880ed4cb9700  IN   0.1 2773168  52496  phantomjs
   9368      1   7  ffff880ed4cbc500  IN   0.1 2773168  52496  phantomjs
   9369      1   7  ffff880ed4cbd080  IN   0.1 2773168  52496  phantomjs
   9370      1  12  ffff880ed4cbb980  IN   0.1 2773168  52496  phantomjs
   9371      1   6  ffff880ed4cbdc00  IN   0.1 2773168  52496  phantomjs
   9372      1   7  ffff880ed4cbf300  IN   0.1 2773168  52496  phantomjs
。。。。。

看到PID 9359的进程是Qt bearer threa(其实应该是Qt bearer thread)

crash> set 9359
    PID: 9359
COMMAND: "Qt bearer threa"
   TASK: ffff880bd1f7b980  [THREAD_INFO: ffff880c09978000]
    CPU: 2
  STATE: TASK_RUNNING (PANIC)
crash> files
PID: 9359   TASK: ffff880bd1f7b980  CPU: 2   COMMAND: "Qt bearer threa"
ROOT: /    CWD: /server/
 FD       FILE            DENTRY           INODE       TYPE PATH
  0 ffff880fe0990c00 ffff88009de98c00 ffff880a8c6d1030 FIFO 
  1 ffff880fe0990700 ffff88009de98000 ffff880a8c6d1bc0 FIFO 
  2 ffff880fe0990900 ffff88009de98480 ffff880a8c6d0250 FIFO 
  3 ffff880f801af000 ffff880a1334f200 ffff880ffebc3090 UNKN [eventfd]
  4 ffff880ef3e8cc00 ffff880a23a41800 ffff880ffebc3090 UNKN [eventfd]
  5 ffff880fe221a200 ffff880fbb539440 ffff880fbb543a00 REG  /usr/share/mime/mime.cache
  6 ffff880fe221b200 ffff880fc9ec0480 ffff880fc9ebf350 REG  /server/phantomjsdriver.log
  7 ffff880fe221b400 ffff880a1334efc0 ffff880fe6e942b0 SOCK TCP
  8 ffff880ee8db0f00 ffff88009dcc1380 ffff880ffebc3090 UNKN [eventfd]
  9 ffff880fe0991400 ffff88009de98a80 ffff880a13353730 SOCK TCP
 10 ffff880fe560da00 ffff880906509e00 ffff88090659c2b0 SOCK NETLINK
 21 ffff880fe439db00 ffff880ffe80a600 ffff880fe8639598 CHR  /dev/urandom

解决办法,降级内核到稳定版,或升级到最新进行测试

参考

https://www.ibm.com/developerworks/cn/linux/l-cn-kdump4/

还没有评论,快来抢沙发!

发表评论

  • 😉
  • 😐
  • 😡
  • 😈
  • 🙂
  • 😯
  • 🙁
  • 🙄
  • 😛
  • 😳
  • 😮
  • emoji-mrgree
  • 😆
  • 💡
  • 😀
  • 👿
  • 😥
  • 😎
  • 😕
  • 63 queries in 0.201 seconds