HDFS の下の OS レイヤーを覗いてみる

Big Data Forensics: Learning Hadoop Investigations

Big Data Forensics: Learning Hadoop Investigations

  • HDFS collections through the host operating system

Targeted collection from a Hadoop client
The third method for collecting HDFS data from the host operating system is a targeted collection. The HDFS data is stored in defined locations within the host operating system. This data can be collected on a per-node basis through logical file copies. Every node needs to be collected to ensure the HDFS files can be reconstructed in the analysis phase.

The same process is conducted for both targeted collections and imaging collections, except for a couple of differences. With imaging collections, entire disk volumes are collected and hashed. Targeted collections involve the copying of individual files and directories. In both methods, the investigator collects the data, documents the process, and computes MD5/SHA-1 hash values. However, there are differences. In targeted collections, MD5/SHA-1 is computed on the files but not the volumes, the collection process requires multiple copies rather than a single image file, and certain metadata is not preserved. Also, investigators typically perform the targeted collection using scripts rather than manually typing the commands at runtime.

The first step for performing the targeted collection is to identify the location where the host operating system stores the HDFS files. For Linux, Unix, OS X, and other Unix variants, this can be found in the hdfs-site.xml file. While typically stored in the /etc/hadoop directory, it can be stored in other locations, so the investigator first needs to find this location before beginning. In Windows, this information is typically located in the Windows Hadoop installation directory c:\hadoop. To find the directory location from the command line, run the following command:


The investigator should collect the entire DataNode tree structure. The structure is comprised of the following directories and files:

  • BP---: This directory is the block pool that collects the blocks of data belonging to that DataNode.
  • finalized/rbw: The actual data blocks are stored in these directories. The finalized directory stores the blocks that have been completely written to disk. The rbw directory stands for replica being written and stores the blocks that are currently being written to HDFS.
  • VERSION: This text file stores property information. Each DataNode has a DataNode-wide VERSION file and also VERSION files for each block pool.
  • blk_: The binary data blocks content files.
  • blk_.meta: The binary data blocks metadata files.
  • dncp_block_verification: This file tracks the times in which the block was last verified via checksum.
  • in_use.lock: This is a lock file used by the DataNode process to prevent multiple DataNode processes from modifying the directory.





[root@ip-***-**-*-133 hdfs]# tree -d /mnt/hdfs
└── current
    └── BP-747367826-
        ├── current
        │&#160;&#160; ├── finalized
        │&#160;&#160; │&#160;&#160; └── subdir0
        │&#160;&#160; │&#160;&#160;     ├── subdir0
        │&#160;&#160; │&#160;&#160;     ├── subdir1
        │&#160;&#160; │&#160;&#160;     ├── subdir3
        │&#160;&#160; │&#160;&#160;     ├── subdir4
        │&#160;&#160; │&#160;&#160;     ├── subdir5
        │&#160;&#160; │&#160;&#160;     ├── subdir6
        │&#160;&#160; │&#160;&#160;     ├── subdir7
        │&#160;&#160; │&#160;&#160;     └── subdir8
        │&#160;&#160; └── rbw
        └── tmp

15 directories
  • ファイルを確認する
[root@ip-***-**-*-133 subdir7]# pwd
[root@ip-***-**-*-133 subdir7]# ls -lh|head
total 15G
-rw-r--r-- 1 hdfs hdfs  128M Sep 23 16:23 blk_1073743618
-rw-r--r-- 1 hdfs hdfs  1.1M Sep 23 16:23 blk_1073743618_2794.meta
-rw-r--r-- 1 hdfs hdfs  128M Sep 23 16:23 blk_1073743619
-rw-r--r-- 1 hdfs hdfs  1.1M Sep 23 16:23 blk_1073743619_2795.meta
-rw-r--r-- 1 hdfs hdfs  128M Sep 23 16:23 blk_1073743620
-rw-r--r-- 1 hdfs hdfs  1.1M Sep 23 16:23 blk_1073743620_2796.meta
-rw-r--r-- 1 hdfs hdfs  128M Sep 23 16:23 blk_1073743622
-rw-r--r-- 1 hdfs hdfs  1.1M Sep 23 16:23 blk_1073743622_2798.meta
-rw-r--r-- 1 hdfs hdfs  128M Sep 23 16:24 blk_1073743624
  • /mnt は HDFS のデータが保存されているのでサイズが大きい。
[root@ip-***-**-*-133 hdfs]# df
Filesystem      Size  Used Avail Use% Mounted on
devtmpfs         16G   76K   16G   1% /dev
tmpfs            16G     0   16G   0% /dev/shm
/dev/xvda1       99G  3.7G   95G   4% /
/dev/xvdb1      5.0G   37M  5.0G   1% /emr
/dev/xvdb2      495G   43G  452G   9% /mnt ★
[root@ip-***-**-*-133 hdfs]# mount
proc on /proc type proc (rw,relatime)
sysfs on /sys type sysfs (rw,relatime)
devtmpfs on /dev type devtmpfs (rw,relatime,size=16460148k,nr_inodes=4115037,mode=755)
devpts on /dev/pts type devpts (rw,relatime,gid=5,mode=620,ptmxmode=000)
tmpfs on /dev/shm type tmpfs (rw,relatime)
/dev/xvda1 on / type ext4 (rw,noatime,data=ordered)
devpts on /dev/pts type devpts (rw,relatime,gid=5,mode=620,ptmxmode=000)
none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw,relatime)
/dev/xvdb1 on /emr type xfs (rw,relatime,attr2,inode64,noquota)
/dev/xvdb2 on /mnt type xfs★ (rw,relatime,attr2,inode64,noquota)
cgroup on /cgroup/blkio type cgroup (rw,relatime,blkio)
cgroup on /cgroup/cpu type cgroup (rw,relatime,cpu)
cgroup on /cgroup/cpuacct type cgroup (rw,relatime,cpuacct)
cgroup on /cgroup/cpuset type cgroup (rw,relatime,cpuset)
cgroup on /cgroup/devices type cgroup (rw,relatime,devices)
cgroup on /cgroup/freezer type cgroup (rw,relatime,freezer)
cgroup on /cgroup/hugetlb type cgroup (rw,relatime,hugetlb)
cgroup on /cgroup/memory type cgroup (rw,relatime,memory)
cgroup on /cgroup/perf_event type cgroup (rw,relatime,perf_event)