20190128分配内存失败

现象:

1. 过一段时间,erlang shell就会因内存不足而崩溃
2. eheap_alloc: Cannot allocate 6801972448 bytes of memory (of type "heap").

线上问题时排查步骤:

1. lager Crash日志里面内容查看
    1.1 找到ranch_tcp的崩溃日志,找到请求的path:/v2.0/xxx/xxx/xxx
    1.2 根据『Stacktrace』找到崩溃的点:
      xxx_servant,get_xxx_list,2
      xxxx_servant,'-get_xxxx_list/1-fun-0-
    ** Cowboy handler xxx_handler terminating in get/2
    for the reason error:{case_clause,{ok,[{... ...
    ** Handler state was {ctx,{dict,9,16,16,8,80,48,{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},{{[[<<"host">>|<<"10.128.133.167:8988">>]],[],[[ver|<<"v2.0">>]],[],[[user_id|<<"54173435024eeb336ff956bf31bc02a6">>]],[],[],[],[[<<"nonce">>|<<"329059209610">>]],[[<<"user-agent">>|<<"Apache-HttpClient/4.5.5 (Java/1.8.0_162)">>]],[[<<"token">>|<<"873a12b8c7fb4b71703870ba783a8a62">>]],[[<<"timestamp">>|<<"1548621043379">>]],[[<<"connection">>|<<"Keep-Alive">>]],[],[],[[<<"accept-encoding">>|<<"gzip,deflate">>]]}}},{dict,0,16,16,8,80,48,{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},{{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]}}}}
    ** Request was [{socket,#Port<0.20540204>},{transport,ranch_tcp},{connection,keepalive},{pid,<0.13593.6298>},{method,<<"GET">>},{version,'HTTP/1.1'},{peer,{{10,128,134,47},57912}},{host,<<"10.128.133.167">>},{host_info,undefined},{port,8988},{path,<<"/v2.0/xxx/xxx/xxx">>},{path_info,undefined},{qs,<<>>},{qs_vals,undefined},{bindings,[{user_id,<<"xxxxx">>},{ver,<<"v2.0">>}]},{headers,[{<<"nonce">>,<<"xxxx">>},{<<"token">>,<<"xxxx">>},{<<"timestamp">>,<<"1548621043379">>},{<<"host">>,<<"xxxx:8988">>},{<<"connection">>,<<"Keep-Alive">>},{<<"user-agent">>,<<"Apache-HttpClient/4.5.5 (Java/1.8.0_162)">>},{<<"accept-encoding">>,<<"gzip,deflate">>}]},{p_headers,[{<<"if-modified-since">>,undefined},{<<"if-none-match">>,undefined},{<<"if-unmodified-since">>,undefined},{<<"if-match">>,undefined},{<<"accept">>,undefined},{<<"connection">>,[<<"keep-alive">>]}]},{cookies,undefined},{meta,[{charset,undefined},{media_type,{<<"*">>,<<"*">>,[]}}]},{body_state,waiting},{multipart,undefined},{buffer,<<>>},{resp_compress,false},{resp_state,waiting},{resp_headers,[{<<"content-type">>,[<<"*">>,<<"/">>,<<"*">>,<<>>]},{<<"X-Frame-Options">>,<<"SAMEORIGIN">>}]},{resp_body,<<>>},{onresponse,undefined}]
    ** Stacktrace: [{xxxx_servant,'-get_xxxx_list/1-fun-0-',2,[{file,[47,114,111...]},{line,82}]},{lists,foldl,3,[{file,"lists.erl"},{line,1263}]},{xxx_servant,get_xxx_list,2,[{file,[47,114,111... ]},{line,128}]},{cowboy_rest,call,3,[{file,"/root/.jenkins/workspace/octopus/_build/default/lib/cowboy/src/cowboy_rest.erl"},{line,1093}]},{cowboy_rest,set_resp_body,2,[{file,"/root/.jenkins/workspace/octopus/_build/default/lib/cowboy/src/cowboy_rest.erl"},{line,974}]},{cowboy_rest,upgrade,4,[{file,"/root/.jenkins/workspace/octopus/_build/default/lib/cowboy/src/cowboy_rest.erl"},{line,84}]},{cowboy_protocol,execute,4,[{file,"/root/.jenkins/workspace/octopus/_build/default/lib/cowboy/src/cowboy_protocol.erl"},{line,566}]}]
上面操作基本已经能判断到问题来源,更进一步可以分析erl_crash.dump文件
2. 分析erl_crash.dump文件
  2.1 awk -v threshold=<queue size> -f queue_fun.awk erl_crash.dump
    ======================================
    1: ranch_conns_sup:start_protocol/2
    1: prim_inet:accept0/2
    1: ranch_conns_sup:start_protocol/2
    1: ranch_conns_sup:start_protocol/2
    1: ranch_conns_sup:start_protocol/2
    1: ranch_conns_sup:start_protocol/2
  2.2 ./erl_crashdump_analyzer.sh erl_crash.dump
    Processes Heap+Stack memory sizes (words) used in the VM (5 largest different):
    ===
      1 850246556
      1 410034027
      1 55185655
      1 18481566
      1 999631

    Processes OldHeap memory sizes (words) used in the VM (5 largest different):
    ===
      1 492040832
      1 410034027
      1 137319567
      1 7427328
      1 1199557
  2.3 打开erl_crash.dump文件,搜索850246556:

    Internal State: ACT_PRIO_NORMAL | USR_PRIO_NORMAL | PRQ_PRIO_NORMAL | TRAP_EXIT | ON_HEAP_MSGQ
    =proc:<0.30.0>
    State: Waiting
    Name: error_logger
    Spawned as: proc_lib:init_p/5
    Spawned by: <0.3.0>
    Started: Mon Jan 28 03:25:13 2019
    Message queue length: 0
    Number of heap fragments: 0
    Heap fragment data: 0
    Link list: [<0.0.0>, <0.165.0>, <0.56.0>]
    Reductions: 229086
    Stack+heap: 850246556
    OldHeap: 0
    Heap unused: 492178286
    OldHeap unused: 0
    Memory: 6801973496
    Program counter: 0x00007feb0c8ae170 (gen_event:fetch_msg/5 + 48)
    CP: 0x0000000000000000 (invalid)
    arity = 0

说明:

1. 此问题直接通过lager crash文件找到原因,不需要erl_crash.dump文件
2. 原因是roomID为undefined时,会请求没有查询条件mongo请求,里面数据有近300万,内存吃光