README.md 6.86 KB
Newer Older
Matteo Melli's avatar
Matteo Melli committed
1 2
# pgio

Matteo Melli's avatar
Matteo Melli committed
3
A Java CLI turned in executable using GraalVM that is able to capture disk IO 
Matteo Melli's avatar
Matteo Melli committed
4 5 6 7
usage stats per process (that can be grouped by process type) and total of the system.
The tool produce a stream of data that can be 
stored and interpreted (using CSV by default or exporting to Prometheus) to 
analyze and find out which part of PostgreSQL is using the disk most during a 
Matteo Melli's avatar
Matteo Melli committed
8
period of time.
Matteo Melli's avatar
Matteo Melli committed
9

Matteo Melli's avatar
Matteo Melli committed
10
## Stats collected
Matteo Melli's avatar
Matteo Melli committed
11

Matteo Melli's avatar
Matteo Melli committed
12 13 14
pgio runs only on Linux modern kernels (>= 2.6.x) reading `/proc/vmstat` and 
`/proc/<pid>/(cmdline|stat|io)` 
(http://man7.org/linux/man-pages/man5/proc.5.html)
Matteo Melli's avatar
Matteo Melli committed
15

Matteo Melli's avatar
Matteo Melli committed
16
Taking iotop as a reference (see https://unix.stackexchange.com/questions/248197/iotop-showing-1-5-mb-s-of-disk-write-but-all-programs-have-0-00-b-s/248218#248218):
Matteo Melli's avatar
Matteo Melli committed
17

Matteo Melli's avatar
Matteo Melli committed
18
iotop read information per process from `/proc/<pid>/io`, in particular:
Matteo Melli's avatar
Matteo Melli committed
19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42

```
rchar: characters read
     The number of bytes which this task has caused to be
     read from storage.  This is simply the sum of bytes
     which this process passed to read(2) and similar system
     calls.  It includes things such as terminal I/O and is
     unaffected by whether or not actual physical disk I/O
     was required (the read might have been satisfied from
     pagecache).

wchar: characters written
     The number of bytes which this task has caused, or
     shall cause to be written to disk.  Similar caveats
     apply here as with rchar.
```

iotop read information globally from /proc/vmstat, in particular:

```
pgpgin – Number of kilobytes the system has paged in from disk per second.
pgpgout – Number of kilobytes the system has paged out to disk per second.
```

Matteo Melli's avatar
Matteo Melli committed
43 44
Combining per process info with global info iotop calculate the % of write/read 
throughput each process is consuming.
Matteo Melli's avatar
Matteo Melli committed
45

Matteo Melli's avatar
Matteo Melli committed
46 47
Referenced iotop is a python rewrite of original iotop 
(https://github.com/analogue/iotop).
Matteo Melli's avatar
Matteo Melli committed
48

Matteo Melli's avatar
Matteo Melli committed
49 50 51 52
Original iotop code (https://github.com/Tomas-M/iotop) uses Taskstats 
(https://www.kernel.org/doc/Documentation/accounting/taskstats.txt).
Seems that stats used from taskstats calls includes following stats from 
`/proc/<pid>/io`:
Matteo Melli's avatar
Matteo Melli committed
53 54 55


```
Matteo Melli's avatar
Matteo Melli committed
56 57 58 59
read_bytes: bytes read
     Attempt to count the number of bytes which this process
     really did cause to be fetched from the storage layer.
     This is accurate for block-backed filesystems.
Matteo Melli's avatar
Matteo Melli committed
60

Matteo Melli's avatar
Matteo Melli committed
61 62 63
write_bytes: bytes written
     Attempt to count the number of bytes which this process
     caused to be sent to the storage layer.
Matteo Melli's avatar
Matteo Melli committed
64

Matteo Melli's avatar
Matteo Melli committed
65 66 67 68 69 70 71 72 73 74 75
cancelled_write_bytes:
     The big inaccuracy here is truncate.  If a process
     writes 1MB to a file and then deletes the file, it will
     in fact perform no writeout.  But it will have been
     accounted as having caused 1MB of write.  In other
     words: this field represents the number of bytes which
     this process caused to not happen, by truncating page‐
     cache.  A task can cause "negative" I/O too.  If this
     task truncates some dirty pagecache, some I/O which
     another task has been accounted for (in its
     write_bytes) will not be happening.
Matteo Melli's avatar
Matteo Melli committed
76 77
```

Matteo Melli's avatar
Matteo Melli committed
78 79 80 81 82 83
Since relation between global and per process info can be out of sync we include
`rchar` and `wchar` (more correlated to pgpgio and pgpgout but that could 
include io stats not related to disk) stats together with `read_bytes`, 
`write_bytes` and `cancelled_write_bytes` (directly related to disk io but less 
precise) stats to allow a better analisys of the stats collected.

Matteo Melli's avatar
Matteo Melli committed
84 85 86 87 88 89 90 91
## Build

To build the project as a tar.gz with all Java dependencies:

```
mvn clean package -P assembler
```

Matteo Melli's avatar
Matteo Melli committed
92 93
To build the project as a tar.gz with an executable built using GraalVM 
`native-image` tool:
Matteo Melli's avatar
Matteo Melli committed
94 95 96 97 98 99 100

```
mvn clean package -P executable
```

## Run

Matteo Melli's avatar
Matteo Melli committed
101 102
This will generate output with collected stats of all processes (should be run 
as root) every 3 seconds:
Matteo Melli's avatar
Matteo Melli committed
103 104

```
105
bin/pgio -D <postgresql data dir>
Matteo Melli's avatar
Matteo Melli committed
106 107 108 109 110
```

## Options

```
Matteo Melli's avatar
Matteo Melli committed
111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132
Option                      Description                                        
------                      -----------                                        
-D <String>                 Specifies the file system location of the database 
                              configuration files. If this is omitted, the     
                              environment variable PGDATA is used.             
-a, --advanced              Enable advanced options                            
-d, --debug                 Show debug messages                                
-g, --group <String>        Group results using specified group configuration  
                              file (advanced option)                           
-h, --help                  Displays this help message and quit                
-i, --interval <String>     Interval in milliseconds to gather stats (default: 
                              3000)                                            
--no-print-header           Suppress print of CSV header                       
-o, --show-other            Print read/write data not accounted by any listed  
                              process                                          
--ppid <String>             Parent pid of the process to scan (advanced option)
--prometheus-bind <String>  The bind address of prometheus service (advanced   
                              option)                                          
--prometheus-port <String>  The port of prometheus service (advanced option)   
--prometheus-service        Run as a prometheus service (advanced option)      
-s, --show-system           Print read/write data for the whole system         
-v, --version               Show version and quit        
Matteo Melli's avatar
Matteo Melli committed
133 134 135 136
```

### Group configuration file

Matteo Melli's avatar
Matteo Melli committed
137 138
Specifing this file will enable a special mode where instead of single processes
groups with aggregated data will be printed.
Matteo Melli's avatar
Matteo Melli committed
139

Matteo Melli's avatar
Matteo Melli committed
140 141
A JSON indicating groups has the following syntax (regular expression will use 
pattern from [java.util.regex.Pattern](https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html):
Matteo Melli's avatar
Matteo Melli committed
142 143 144 145 146 147 148 149

```
{
 "<group name>" [ <list of regexp> ],
 ...
}
```

Matteo Melli's avatar
Matteo Melli committed
150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184
For example to use with PostgreSQL we create a template using following `postgresql.json`:

```
{
  "archiver": [ ".*archiver.*" ],
  "wal sender": [ ".*wal sender.*" ],
  "bgwriter": [ ".*writer process.*" ],
  "autovacuum": [ ".*autovacuum worker process.*" ],
  "stats": [ ".*stats collector process.*" ],
  "wal writer": [ ".*wal writer process.*" ],
  "checkpoint": [ ".*checkpointer process.*" ],
  "query": [ "postgres: " ],
}
```

```
bin/pgio -D <postgresql data dir> --advanced --group postgresql.json
```

The expected output will include following groups:

* archiver
* wal sender
* bgwriter
* autovacuum
* stats
* wal writer
* checkpoint
* query

The idea is that you can extend those groups to include other relevant 
tools like wal-e, pgbouncer, particual user's queries.
Grouping stats by process type reduce the number of stats collected 
and provide more understandable metrics. 

Matteo Melli's avatar
Matteo Melli committed
185
## Prometheus service
Matteo Melli's avatar
Matteo Melli committed
186

Matteo Melli's avatar
Matteo Melli committed
187 188 189 190
To start pgio as a prometheus service:

```
bin/pgio -D <postgresql data dir> --advanced --prometheus-service --prometheus-bind 0.0.0.0 --prometheus-port 9090
Matteo Melli's avatar
Matteo Melli committed
191
```