Process Signal in Ruby

2020-08-31

引子

虽然 GIL 限制了 Ruby 多线程利用多核的能力, 但还可以通过多进程方式来使用多核资源.

master – worker 是多进程常用的组织方式, 一个 master 进程负责管理调度, 多个 worker 进程处理具体的业务逻辑.

那么, master 和 worker 之间如何通信呢? 除了利用文件,TCP,UNIX Socket这样的共享资源, 其实原始也高效的方式就是利用信号.

本篇主题是 master 进程如何正确处理 worker 的信号, 在介绍主题前, 先回顾一下相关基础.

fork

fork 派生新的进程, 子进程继承了父进程内存中的所有内容,包括已打开的文件描述符.

  • 父进程 fork 返回 子进程的 pid;
  • 子进程 fork 返回 nil;

wait 和 wait 家族

father_pid = Process.pid

child_pid = fork do
  puts "child>>"
  sleep 2
end

p Process.wait(child_pid)
p "father: #{father_pid}, child: #{child_pid}"
=begin
child>>
6549
"father: 6547, child: 6549"
=end

Process.wait(pid) 默认是阻塞的, 等 pid 对应的子进程退出才返回, 返回 pid;

father_pid = Process.pid

child_pid = fork do
  puts "child>>"
  sleep 2
end

p Process.wait(child_pid, Process::WNOHANG)
p "father: #{father_pid}, child: #{child_pid}"
=begin
nil
"father: 6558, child: 6560"
child>>
=end

Process.wait(pid, Process::WNOHANG) 是非阻塞的:

  • 如果 pid 对应的子进程已经退出了, 立即返回 pid;
  • 如果 pid 对应的子进程还没有退出, 立即返回 nil.

wait 的 pid 可以是 -1 或留空, 都表示等待任意子进程.

wait2wait 的区别在返回值: pid, status = Process.wait2(pid) .

Signal 与 僵尸进程

Signal.list 可以查看所有信号, 更详细的信息可以查看 man signal 相关的章节.

Signal      Standard   Action   Comment
────────────────────────────────────────────────────────────────────────
SIGABRT      P1990      Core    Abort signal from abort(3)
SIGALRM      P1990      Term    Timer signal from alarm(2)
SIGBUS       P2001      Core    Bus error (bad memory access)
SIGCHLD      P1990      Ign     Child stopped or terminated
SIGCLD         -        Ign     A synonym for SIGCHLD
SIGCONT      P1990      Cont    Continue if stopped
SIGEMT         -        Term    Emulator trap
SIGFPE       P1990      Core    Floating-point exception
SIGHUP       P1990      Term    Hangup detected on controlling terminal
                                or death of controlling process
SIGILL       P1990      Core    Illegal Instruction
SIGINFO        -                A synonym for SIGPWR
SIGINT       P1990      Term    Interrupt from keyboard
SIGIO          -        Term    I/O now possible (4.2BSD)
SIGIOT         -        Core    IOT trap. A synonym for SIGABRT
SIGKILL      P1990      Term    Kill signal
SIGLOST        -        Term    File lock lost (unused)
SIGPIPE      P1990      Term    Broken pipe: write to pipe with no
                                readers; see pipe(7)
SIGPOLL      P2001      Term    Pollable event (Sys V);
                                synonym for SIGIO
SIGPROF      P2001      Term    Profiling timer expired
SIGPWR         -        Term    Power failure (System V)
SIGQUIT      P1990      Core    Quit from keyboard
SIGSEGV      P1990      Core    Invalid memory reference
SIGSTKFLT      -        Term    Stack fault on coprocessor (unused)
SIGSTOP      P1990      Stop    Stop process
SIGTSTP      P1990      Stop    Stop typed at terminal
SIGSYS       P2001      Core    Bad system call (SVr4);
                                see also seccomp(2)
SIGTERM      P1990      Term    Termination signal
SIGTRAP      P2001      Core    Trace/breakpoint trap
SIGTTIN      P1990      Stop    Terminal input for background process
SIGTTOU      P1990      Stop    Terminal output for background process
SIGUNUSED      -        Core    Synonymous with SIGSYS
SIGURG       P2001      Ign     Urgent condition on socket (4.2BSD)
SIGUSR1      P1990      Term    User-defined signal 1
SIGUSR2      P1990      Term    User-defined signal 2
SIGVTALRM    P2001      Term    Virtual alarm clock (4.2BSD)
SIGXCPU      P2001      Core    CPU time limit exceeded (4.2BSD);
                                see setrlimit(2)
SIGXFSZ      P2001      Core    File size limit exceeded (4.2BSD);
                                see setrlimit(2)

一个进程退出了, 不论是正常退出还是意外退出, 它的退出信息都会被内核收集, 加入到一个队列中交由它的父进程处理.

父进程处理的方式有两种:

僵尸进程是很形象的描述, 说的是一个进程死了, 但尸体没有消失. 也就是说, 进程死掉后它的父进程一直没有为他收尸, 导致它的退出信息一直残留在内核中, 造成内核资源的浪费.

Demo Code

子进程退出会给父进程发送 CHLD 信号, 但是这个信号是不可靠投递. 如果处理信号的过程中又接收到了另一个信号, 则可能会造成信号丢失, 进而导致 “收尸” 处理的遗漏.

child_count = 3
dead_count = 0

child_count.times do
  fork do
    sleep 1
    puts "Child #{Process.pid} done."
  end
end

trap(:CHLD) do
  child_pid = Process.wait
  puts "process child pid: #{child_pid}"
  dead_count += 1
  exit if dead_count == child_count
end

sleep 10

output

Child 5711 done.
Child 5712 done.
Child 5713 done.
process child pid: 5713
process child pid: 5712

这是一个典型输出, 三个子进程结束, 父进程只处理了其中两个, 导致漏掉的那个子进程变成了僵尸进程.


改进:

child_count = 3
dead_count = 0

child_count.times do
  fork do
    sleep 1
    puts "Child #{Process.pid} done."
  end
end

trap(:CHLD) do
  begin
    while child_pid = Process.wait(-1, Process::WNOHANG)
      puts "process child pid: #{child_pid}"
      dead_count += 1
      exit if dead_count == child_count
    end
  rescue Errno::ECHILD
  end
end

sleep 10

output:

Child 5916 done.
Child 5915 done.
Child 5914 done.
process child pid: 5916
process child pid: 5915
process child pid: 5914

如此修改, 父进程就不会因为信号的不可靠投递而遗漏处理.

trap More

trap 平时很少使用, 几乎只用于常驻后台的程序.

trap(:INT) 实际上是重新定义 :INT 信号的响应方式, 效果是全局的.

如果已经定义了 :INT 的处理方式, 想要在此基础上添加逻辑:

puts "pid: #{Process.pid}"

trap(:INT) do
  puts "processing>>>"
  exit
end

old_sigal = trap(:INT) do
  puts "new processing..."

  old_sigal.call if old_sigal.respond_to?(:call)
end

at_exit do
  puts "Bye~"
end

sleep 100

由于 trap 的全局性, 并不推荐对同一个信号多次定义, 如果只想在退出前做点什么, 用 at_exit 就足够了.

另外, 不能用这种方式调用信号的默认处理方式.


信号经常跟进程的退出过程相关.

Ruby 执行完一段脚本后正常退出, 退出码为 0, 除此之外还有几种显示的退出方式:

方法 退出码 是否调用at_exit 可选
exit 0 指定退出码 exit(code)
exit! 1 x 指定退出码 exit!(code)
abort 1 打印错误消息 abort(msg)
raise(error) 1 指定异常 raise(StandardError.new)

参考

https://man7.org/linux/man-pages/man2/signal.2.html

https://man7.org/linux/man-pages/man7/signal.7.html

https://github.com/rubinius/rubinius/blob/master/core/process.rb