Operate a cluster

Rebuild a Volume: Rawx

Preparation
Set the incident date
Launch rebuilding
Clear the incident date
Distribute rebuilding
Options

Preparation

Find information about the service you want to rebuild. By running openio cluster list rawx you will get a list of all rawx service IDs accompanied by their volume paths.

Verify that the service was automatically scored to zero by running openio cluster list rawx.

If not, lock the score of the targeted rawx service to zero by running openio cluster lock rawx <RAWX_ID>, where RAWX_ID is the network address of the service (ip:port). This will prevent the service from receiving upload requests, and will reduce the number of download requests.

Set the incident date

Set an incident on the target rawx service by running the openio volume admin command:

# openio volume admin incident 127.0.0.1:6023
+----------------+------------+
| Volume         |       Date |
+----------------+------------+
| 127.0.0.1:6023 | 1558448872 |
+----------------+------------+

By default, the incident date is the current timestamp. You can change this incident date by using the parameter --date <TIMESTAMP>.

Check that the incident date is correctly set:

# openio volume admin show 127.0.0.1:6023
+---------------+----------------+
| Field         | Value          |
+---------------+----------------+
| volume        | 127.0.0.1:6023 |
| incident_date | 1558448872     |
+---------------+----------------+

Launch rebuilding

You can now launch the rebuild by using the openio-admin rawx rebuild command:

# openio-admin rawx rebuild 127.0.0.1:6023
OPENIO|013FFAEF2FB62E06F4BC47404F5E990EE1B6BF4A1634922F0A928E929CF1F20B|DF88359A66890500A0B98110945C1DFF|72113572B7B7ACE0F9EEE6746C8A862F2C9E065ED88E02EEEBF7457E73E40FA6 OK None
OPENIO|069448EFB3312C88AC3C234914588B6CF4229F1A9F21F1AB48940524717037B8|B141D49A6689050021097E1F955C8E28|14F637345F98558E7A01DA4250198797A57CA5CE2E2819105C6F49AB3C8FF11A OK None
[...]
OPENIO|FF8D6B041D8BC55CD615EDCFC5706D1EC74CC171DBD7477A00E8B86D4A4181A2|038E2E9B668905007C453D78945CB5B9|D84DB096697171066FE74700423B41C7205E3972B0EBCCF653B70B0AD1B70538 OK None
OPENIO|FF8D6B041D8BC55CD615EDCFC5706D1EC74CC171DBD7477A00E8B86D4A4181A2|038E2E9B668905007C453D78945CB5B9|DD77FAC886C0861D637800C65079E6DDAFB6826BD96BC33797695F76F7BEC205 OK None

Clear the incident date

After the rebuilding and if there were no errors, you can clear the incident.

# openio volume admin clear --before-incident 127.0.0.1:6023
+----------------+---------+--------------------------------------------+
| Volume         | Success | Message                                    |
+----------------+---------+--------------------------------------------+
| 127.0.0.1:6023 | True    | {'removed': 0, 'repaired': 0, 'errors': 0} |
+----------------+---------+--------------------------------------------+
# openio volume admin show 127.0.0.1:6023
+--------+----------------+
| Field  | Value          |
+--------+----------------+
| volume | 127.0.0.1:6023 |
+--------+----------------+

Distribute rebuilding

To distribute, we use the Master/Slave model. The broken chunks are sent to beanstalkd tubes and the slave rebuilders listen to these beanstalkd tubes.

You just have to start the master.

# openio-admin rawx distributed-rebuild 127.0.0.1:6027
OPENIO|01B5B9031119EBCDEEBC5D343174582E58F21FF83DA896D5916C94725BA89165|5C5D7DDF6689050003D9F2534571E1D7|3CCDCA0484F2E019B100311925D5BBB0B5EF9CE2D50F18E20FEFC2681A316EDE OK None
OPENIO|03CA2501B1BEA19AF7C858CD27F57DEFC7EB923C02929CA844C3936E024A5972|9C5AB8DE66890500FC92DA0A4571113B|07B04C2C254761EA95812D6ED2B54D9127BDF7EBB2692ED7FEAED52E4CFB6198 OK None
[...]
OPENIO|FD2B770544D314C0D67A95E77A295E6EC1B60802ECB7D6952A6F7B6E7ED39F89|71D9DADF66890500DDC105344271E3B3|2A616075632F2486A1F84F8C7196FDCD77198F170DBCD3F1D0C54742229F27B9 OK None
OPENIO|FD2B770544D314C0D67A95E77A295E6EC1B60802ECB7D6952A6F7B6E7ED39F89|71D9DADF66890500DDC105344271E3B3|175447C2BDB654AA94681FB9C480760B98D0DC179E4A3430C027415E42465366 OK None

Options

If you want more information about current rebuilding, you can change the report interval using the --report-interval option. The default value is set to 3600 seconds, but if you want a report every minute, you can launch rebuilding using openio-admin rawx rebuild --report-interval 60.

By default, rebuilding uses only one worker; you can set a number of workers using the --workers option. For example, openio-admin rawx rebuild --workers 42 launches rebuilding using 42 workers.

Workers have a limited number of chunks to rebuild per seconds, 30 by default: the goal is to maintain the cluster performance during the rebuilding. You can change this value using the --items-per-second option. If you want to unlimit the number of chunks to rebuild per second, you can use openio-admin rawx rebuild --items-per-second 0.