ablog

不器用で落着きのない技術者のメモ

NFSでI/Oシステムコール発行後に応答がない場合、プロセスを kill できるか

NFSのマウントオプションで soft と hard がある。プロセスがI/Oシステムコールを発行してユーザーモードからカーネルモードにコンテキストスイッチした後、応答がないと、soft の場合はリトライを繰返した後にI/Oエラーになるが、hard の場合は応答があるまで待ち続ける。

  • hard + intr: kill できる*1。おそらく TASK_INTERRUPTIBLE でスリープするため。
  • hard + nointr: kill できない。おそらく TASK_UNINTERRUPTIBLE でスリープするため。

Kernel 2.6.25 以降、TASK_KILLABLE が導入され、NFS Client のコードでI/Oシステムコール発行後、TASK_KILLABLE でスリープするよう変更が入り、マウントオプションに hard を指定しても kill できるようになっている。

intr / nointr This option is provided for backward compatibility.It is ignored after kernel 2.6.25.

nfs(5) - Linux manual page

2.6.25 以降 intr / noinrt オプションが無視されるのはこの変更のためと思われる。RHEL5(2.6.18)はこの変更が入っていないが、6(2.6.32)以降はこの変更が入っていると思われる*2

参考

The Linux Programming Interface: A Linux and UNIX System Programming Handbook (English Edition)

The Linux Programming Interface: A Linux and UNIX System Programming Handbook (English Edition)

  • 22.3 Interruptible and Uninterruptible Process Sleep States

We need to add a proviso to our earlier statement that SIGKILL and SIGSTOP always act immediately on a process. At various times, the kernel may put a process to sleep, and two sleep states are distinguished:

  • TASK_INTERRUPTIBLE: The process is waiting for some event. For example, it is waiting for terminal input, for data to be written to a currently empty pipe, or for the value of a System V semaphore to be increased. A process may spend an arbitrary length of time in this state. If a signal is generated for a process in this state, then the operation is interrupted and the process is woken up by the delivery of a signal. When listed by ps(1), processes in the TASK_INTERRUPTIBLE state are marked by the letter S in the STAT (process state) field.
  • TASK_UNINTERRUPTIBLE: The process is waiting on certain special classes of event, such as the completion of a disk I/O. If a signal is generated for a process in this state, then the signal is not delivered until the process emerges from this state. Processes in the TASK_UNINTERRUPTIBLE state are listed by ps(1) with a D in the STAT field.

Because a process normally spends only very brief periods in the TASK_UNINTERRUPTIBLE state, the fact that a signal is delivered only when the process leaves this state is invisible. However, in rare circumstances, a process may remain hung in this state, perhaps as the result of a hardware failure, an NFS problem, or a kernel bug. In such cases, SIGKILL won’t terminate the hung process. If the underlying problem can’t otherwise be resolved, then we must restart the system in order to eliminate the process.
The TASK_INTERRUPTIBLE and TASK_UNINTERRUPTIBLE states are present on most UNIX implementations. Starting with kernel 2.6.25, Linux adds a third state to address the hanging process problem just described:

  • TASK_KILLABLE: This state is like TASK_UNINTERRUPTIBLE, but wakes the process if a fatal signal (i.e., one that would kill the process) is received. By converting relevant parts of the kernel code to use this state, various scenarios where a hung process requires a system restart can be avoided. Instead, the process can be killed by sending it a fatal signal. The first piece of kernel code to be converted to use TASK_KILLABLE was NFS.

Linux カーネルのバージョン 2.6.25 では、プロセスをスリープさせるための新しい状態である TASK_KILLABLE が導入されています。kill 可能という、この新しい状態でプロセスがスリープしている場合、そのプロセスは TASK_UNINTERRUPTIBLE の場合と同じように動作し、しかも重要なシグナルに応答することができます。

NFS クライアント・コードが何カ所か変更され、この新しいプロセスの状態が使われています。リスト 3 は Linux カーネル 2.6.18 と 2.6.26 の間での nfs_wait_event マクロの違いを示しています。

  • リスト 3. TASK_KILLABLE の導入による nfs_wait_event の変更
Linux Kernel 2.6.18                          Linux Kernel 2.6.26
==========================================   =============================================
#define nfs_wait_event(clnt, wq, condition)  #define nfs_wait_event(clnt, wq, condition)
 ({                                           ({
  int __retval = 0;                            int __retval = 
                                                   wait_event_killable(wq, condition);
    if (clnt->cl_intr) {                        __retval;
     sigset_t oldmask;                        })
     rpc_clnt_sigmask(clnt, &oldmask);
     __retval = 
     wait_event_interruptible(wq, condition);
       rpc_clnt_sigunmask(clnt, &oldmask);
    } else
        wait_event(wq, condition);
        __retval;
 })

リスト 4 は Linux カーネル 2.6.18 と 2.6.26 との間での nfs_direct_wait() 関数の定義の違いを示しています。

  • リスト 4. TASK_KILLABLE の導入による nfs_direct_wait() の変更
Linux Kernel 2.6.18                                   
=================================           
static ssize_t nfs_direct_wait(struct nfs_direct_req *dreq) 
{                                                           
  ssize_t result = -EIOCBQUEUED;                              

  /* Async requests don't wait here */                         
 if (dreq->iocb)                                              
      goto out;                                                    

 result = wait_for_completion_interruptible(&dreq->completion);

 if (!result)                                                 
   result = dreq->error;                                        
 if (!result)                                                 
   result = dreq->count;                                        

out:                                                            
  kref_put(&dreq->kref, nfs_direct_req_release);
  return (ssize_t) result;
}                                                               



Linux Kernel 2.6.26
=====================
static ssize_t nfs_direct_wait(struct nfs_direct_req *dreq)
{
  ssize_t result = -EIOCBQUEUED;
  /* Async requests don't wait here */
  if (dreq->iocb)
    goto out;

  result = wait_for_completion_killable(&dreq->completion);
  if (!result)
    result = dreq->error;
  if (!result)
    result = dreq->count;
out:
   return (ssize_t) result;
 }

この新機能を利用するための NFS クライアントの変更の詳細を知るためには、「参考文献」に挙げた Linux Kernel Mailing List のエントリーを見てください。

これまでは NFS マウント・オプション intr を指定することで、何らかのイベントを待っている NFS クライアント・プロセスに割り込みをかけられましたが、その場合、(TASK_KILLABLE のように) kill を目的とする 1 つのシグナルのみではなく、すべての割り込みが許可されてしまいました。

https://www.ibm.com/developerworks/jp/linux/library/l-task-killable/

Or maybe not. A while back, Matthew Wilcox realized that many of these concerns about application bugs do not really apply if the application is about to be killed anyway. It does not matter if the developer thought about the possibility of an interrupted system call if said system call is doomed to never return to user space. So Matthew created a new sleeping state, called TASK_KILLABLE; it behaves like TASK_UNINTERRUPTIBLE with the exception that fatal signals will interrupt the sleep.

...

The TASK_KILLABLE patch was merged for the 2.6.25 kernel, but that does not mean that the unkillable process problem has gone away. The number of places in the kernel (as of 2.6.26-rc8) which are actually using this new state is quite small - as in, one need not worry about running out of fingers while counting them. The NFS client code has been converted, which can only be a welcome development. But there are very few other uses of TASK_KILLABLE, and none at all in device drivers, which is often where processes get wedged.

https://lwn.net/Articles/288056/

NFS: Switch from intr mount option to TASK_KILLABLE

By using the TASK_KILLABLE infrastructure, we can get rid of the 'intr' mount option. We have to use _killable everywhere instead of _interruptible as we get rid of rpc_clnt_sigmask/sigunmask.

https://lkml.org/lkml/2007/12/6/329?cm_mc_uid=48289949268313906794256&cm_mc_sid_50200000=1446010393

*1:シグナルを送ってプロセスを停止できる

*2:リリースとカーネルのバージョンの対応はhttps://access.redhat.com/ja/node/16476参照