Process Signal in Ruby

2020-08-31

Ruby
Signal

引子

虽然 GIL 限制了 Ruby 多线程利用多核的能力, 但还可以通过多进程方式来使用多核资源.

master – worker 是多进程常用的组织方式, 一个 master 进程负责管理调度, 多个 worker 进程处理具体的业务逻辑.

那么, master 和 worker 之间如何通信呢? 除了利用文件,TCP,UNIX Socket这样的共享资源, 其实原始也高效的方式就是利用信号.

本篇主题是 master 进程如何正确处理 worker 的信号, 在介绍主题前, 先回顾一下相关基础.

fork

fork 派生新的进程, 子进程继承了父进程内存中的所有内容，包括已打开的文件描述符.

父进程 fork 返回子进程的 pid;
子进程 fork 返回 nil;

wait 和 wait 家族

father_pid = Process.pid

child_pid = fork do
  puts "child>>"
  sleep 2
end

p Process.wait(child_pid)
p "father: #{father_pid}, child: #{child_pid}"
=begin
child>>
6549
"father: 6547, child: 6549"
=end

Process.wait(pid) 默认是阻塞的, 等 pid 对应的子进程退出才返回, 返回 pid;

father_pid = Process.pid

child_pid = fork do
  puts "child>>"
  sleep 2
end

p Process.wait(child_pid, Process::WNOHANG)
p "father: #{father_pid}, child: #{child_pid}"
=begin
nil
"father: 6558, child: 6560"
child>>
=end

Process.wait(pid, Process::WNOHANG) 是非阻塞的:

如果 pid 对应的子进程已经退出了, 立即返回 pid;
如果 pid 对应的子进程还没有退出, 立即返回 nil.

wait 的 pid 可以是 -1 或留空, 都表示等待任意子进程.

wait2 跟 wait 的区别在返回值: pid, status = Process.wait2(pid) .

Signal 与僵尸进程

Signal.list 可以查看所有信号, 更详细的信息可以查看 man signal 相关的章节.

Signal      Standard   Action   Comment
────────────────────────────────────────────────────────────────────────
SIGABRT      P1990      Core    Abort signal from abort(3)
SIGALRM      P1990      Term    Timer signal from alarm(2)
SIGBUS       P2001      Core    Bus error (bad memory access)
SIGCHLD      P1990      Ign     Child stopped or terminated
SIGCLD         -        Ign     A synonym for SIGCHLD
SIGCONT      P1990      Cont    Continue if stopped
SIGEMT         -        Term    Emulator trap
SIGFPE       P1990      Core    Floating-point exception
SIGHUP       P1990      Term    Hangup detected on controlling terminal
                                or death of controlling process
SIGILL       P1990      Core    Illegal Instruction
SIGINFO        -                A synonym for SIGPWR
SIGINT       P1990      Term    Interrupt from keyboard
SIGIO          -        Term    I/O now possible (4.2BSD)
SIGIOT         -        Core    IOT trap. A synonym for SIGABRT
SIGKILL      P1990      Term    Kill signal
SIGLOST        -        Term    File lock lost (unused)
SIGPIPE      P1990      Term    Broken pipe: write to pipe with no
                                readers; see pipe(7)
SIGPOLL      P2001      Term    Pollable event (Sys V);
                                synonym for SIGIO
SIGPROF      P2001      Term    Profiling timer expired
SIGPWR         -        Term    Power failure (System V)
SIGQUIT      P1990      Core    Quit from keyboard
SIGSEGV      P1990      Core    Invalid memory reference
SIGSTKFLT      -        Term    Stack fault on coprocessor (unused)
SIGSTOP      P1990      Stop    Stop process
SIGTSTP      P1990      Stop    Stop typed at terminal
SIGSYS       P2001      Core    Bad system call (SVr4);
                                see also seccomp(2)
SIGTERM      P1990      Term    Termination signal
SIGTRAP      P2001      Core    Trace/breakpoint trap
SIGTTIN      P1990      Stop    Terminal input for background process
SIGTTOU      P1990      Stop    Terminal output for background process
SIGUNUSED      -        Core    Synonymous with SIGSYS
SIGURG       P2001      Ign     Urgent condition on socket (4.2BSD)
SIGUSR1      P1990      Term    User-defined signal 1
SIGUSR2      P1990      Term    User-defined signal 2
SIGVTALRM    P2001      Term    Virtual alarm clock (4.2BSD)
SIGXCPU      P2001      Core    CPU time limit exceeded (4.2BSD);
                                see setrlimit(2)
SIGXFSZ      P2001      Core    File size limit exceeded (4.2BSD);
                                see setrlimit(2)

一个进程退出了, 不论是正常退出还是意外退出, 它的退出信息都会被内核收集, 加入到一个队列中交由它的父进程处理.

父进程处理的方式有两种:

wait, 主动收集子进程退出信息;
detach, 开一线程再来 wait. Rubinius Process#detach

僵尸进程是很形象的描述, 说的是一个进程死了, 但尸体没有消失. 也就是说, 进程死掉后它的父进程一直没有为他收尸, 导致它的退出信息一直残留在内核中, 造成内核资源的浪费.

Demo Code

子进程退出会给父进程发送 CHLD 信号, 但是这个信号是不可靠投递. 如果处理信号的过程中又接收到了另一个信号, 则可能会造成信号丢失, 进而导致 “收尸” 处理的遗漏.

child_count = 3
dead_count = 0

child_count.times do
  fork do
    sleep 1
    puts "Child #{Process.pid} done."
  end
end

trap(:CHLD) do
  child_pid = Process.wait
  puts "process child pid: #{child_pid}"
  dead_count += 1
  exit if dead_count == child_count
end

sleep 10

output

Child 5711 done.
Child 5712 done.
Child 5713 done.
process child pid: 5713
process child pid: 5712

这是一个典型输出, 三个子进程结束, 父进程只处理了其中两个, 导致漏掉的那个子进程变成了僵尸进程.

改进:

child_count = 3
dead_count = 0

child_count.times do
  fork do
    sleep 1
    puts "Child #{Process.pid} done."
  end
end

trap(:CHLD) do
  begin
    while child_pid = Process.wait(-1, Process::WNOHANG)
      puts "process child pid: #{child_pid}"
      dead_count += 1
      exit if dead_count == child_count
    end
  rescue Errno::ECHILD
  end
end

sleep 10

output:

Child 5916 done.
Child 5915 done.
Child 5914 done.
process child pid: 5916
process child pid: 5915
process child pid: 5914

如此修改, 父进程就不会因为信号的不可靠投递而遗漏处理.

`trap` More

trap 平时很少使用, 几乎只用于常驻后台的程序.

trap(:INT) 实际上是重新定义 :INT 信号的响应方式, 效果是全局的.

如果已经定义了 :INT 的处理方式, 想要在此基础上添加逻辑:

puts "pid: #{Process.pid}"

trap(:INT) do
  puts "processing>>>"
  exit
end

old_sigal = trap(:INT) do
  puts "new processing..."

  old_sigal.call if old_sigal.respond_to?(:call)
end

at_exit do
  puts "Bye~"
end

sleep 100

由于 trap 的全局性, 并不推荐对同一个信号多次定义, 如果只想在退出前做点什么, 用 at_exit 就足够了.

另外, 不能用这种方式调用信号的默认处理方式.

信号经常跟进程的退出过程相关.

Ruby 执行完一段脚本后正常退出, 退出码为 0, 除此之外还有几种显示的退出方式:

方法	退出码	是否调用`at_exit`	可选
exit	0	√	指定退出码 `exit(code)`
exit!	1	x	指定退出码 `exit!(code)`
abort	1	√	打印错误消息 `abort(msg)`
raise(error)	1	√	指定异常 `raise(StandardError.new)`