Disk health evaluation using IOSTAT

Published on: July 23, 2014 by George K.

Disk health evaluation using IOSTAT

Scenario:

iostat (input/output statistics) a utility that reports Central Processing Unit (CPU) statistics and input/output statistics for devices and partitions.It can be used for the linux disk health check and it is quite handy to identify which partition is being heavily used and if any HW issues exist. The details regarding the various parameters will make this article quite confusing. So I am focusing on the values which need to be monitored during a suspected server issue.

A sample IOSTAT output will look like this

[~]# iostat -x

Linux 2.6.18-408.el5.lve0.8.58 server1.ssages.com Monday 04 June 2012

avg-cpu: %user %nice %system %iowait %steal %idle

28.01 0.33 4.18 7.94 0.00 59.54

Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util

sda 234.36 463.39 420.55 217.85 8883.26 5451.91 22.45 1.70 2.66 1.12 71.63

sda1 0.02 0.00 0.00 0.00 0.03 0.00 41.66 0.00 6.57 5.06 0.00

sda2 19.75 65.49 223.08 79.80 3372.78 1163.38 14.98 1.82 6.02 1.70 51.63

sda3 6.77 18.37 5.94 13.44 251.64 254.52 26.11 0.58 29.76 2.30 4.45

sda8 206.75 170.56 190.00 78.20 5232.34 1990.57 26.93 1.98 7.39 2.12 56.99

An explanation for all parameters here will lead to ambiguity. Let us focus on the core values

CPU Statistics

As you can see from the result the avg-cpu part explains the IO activities for the CPU. You can get the CPU performance alone using the following command iostat -c .

So we are focusing on the most crucial one is %iowait and %idle in avg-cpu performance output .

A high iowait on this section indicates that CPU is incapable of handling all incoming requests and hence the requests are held in queue indicating a degraded performance.

The %idle parameter indicates the CPU idle time. A high value here indicates that CPU is not busy. On the above example, we can assume that 7.94 % of total IO requests are in the queue while 59.54% of CPU clocks are idle.

Disk statistics

The second portion of the output explains the IO activities for various disks attached to the server.This can be used in Linux disk health check. The disk details alone can be obtained by using the switch “-d “

#iostat -xd

Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util

sda    22.72 138.23 24.66 12.88 2227.01 1208.87 91.53 0.19 5.03   3.08   11.55

sda1    0.01    0.00    0.00 0.00 0.01 0.00    10.58 0.00 7.10    6.49    0.00

sda2    0.00 0.00    0.00 0.00     0.01 0.00     47.20 0.00     3.98    3.68 0.00

sda3    0.01    0.13 0.00 0.17    0.06     2.36 14.17 15.02 0.77 5857.96 99.98

sda4    0.00    0.00 0.00 0.00    0.00     0.00 2.00    1.00   0.00 89353908.00 99.99

sda5   22.71 138.10   24.66 12.71 2226.93 1206.51 91.88 0.19 5.05   3.09 11.55

The most crucial parameters on this result are svctm %util . Let us see what they mean and its importance.

svctm

The number of milliseconds spent servicing requests, from beginning to end, including queue time and the time the device actually takes to fulfil the request.

%util

This really shows the device utilization, as the name implies, because when the value approaches 100%, the device is saturated.

As you can see from the above example, for sda3 and sda4 svctm and util stays at that high range of level indicating high activities on the mentioned partitions. Now let us see which partition is mounted on these particular HDD.

# df -h

Filesystem Size Used Avail Use% Mounted on

/dev/sda5 3.9G 1008M 2.7G 27% /

/dev/sda8 948G 576G 323G 65% /home

/dev/sda6 2.0G 546M 1.4G 30% /tmp

/dev/sda3 30G 18G 10G 64% /usr

/dev/sda4 97G 57G 36G 62% /var

/dev/sda1 198M 25M 164M 13% /boot

tmpfs 12G 12K 12G 1% /dev/shm

Here the active partitions are /var and /usr. Now check the processes which actively uses, these partitions. In this case, it was mysql abuse and the data directory was configured as /var. Stopping the attack restored normalcy for the IO activities.

[~]# iostat -xm 2

avg-cpu: %user %nice %system %iowait %steal %idle

48.56 0.28 6.06 1.20 0.00 43.91

Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util

sda 57.46 281.08 61.40 173.13 1898.67 3689.77 23.83 0.36 4.28 0.88 20.74

sda1 0.01 0.00 0.00 0.00 0.03 0.00 24.49 0.00 3.98 3.89 0.00

sda2 6.33 62.28 28.13 80.88 425.01 1145.95 14.41 0.19 7.64 1.06 11.59

sda3 3.79 10.71 2.54 9.77 122.00 163.85 23.21 0.13 10.45 0.87 1.07

sda4 0.00 0.00 0.00 0.00 0.00 0.00 2.00 0.00 9.19 9.19 0.00

sda5 0.32 1.29 0.11 0.58 4.12 15.01 27.51 0.00 4.65 2.41 0.17

sda6 0.36 94.54 0.16 30.63 4.92 1001.49 32.68 0.37 11.93 0.36 1.12

sda7 0.00 0.00 0.00 0.00 0.02 0.02 48.97 0.00 4.94 4.28 0.00

sda8 46.66 112.26 30.44 51.27 1342.56 1363.44 33.12 0.31 3.81 1.52 12.41

For continues Linux disk health check, specify the time in second as an argument. The following example give you an output at 2 seconds interval

[otw_shortcode_info_box border_type=”bordered” border_color_class=”otw-aqua-border” border_style=”bordered”]iostat -xdm 2[/otw_shortcode_info_box]

Category : Linux, Troubleshooting

George K.

George started his career in web hosting and Linux technical support in the year 2004 and is with SupportSages since 2009. He has keen interest in server optimizations, custom security solutions, hacked server recovery, cyber forensic and high availability fail over system design and implementation. George loves long drives and is passionate about art and literature.

You may also read:

Comments

Add new commentSIGN IN

How to setup a WordPress website on a freshly provisioned VPS with ISPmanager control panel

Azure VmScaleset Alert Configuration

EC2 Status check and restart using SSM runbook

Blog