Quantcast
Channel: StackExchange Replication Questions
Viewing all articles
Browse latest Browse all 17268

Restoring MariaDB Galera cluster fails (IST/SST)

$
0
0

A few days ago we experienced a crash with one of our MariaDB Galera nodes. I'm having serious problems with syncing the offline node with it's master again. For now, the 2 nodes are running independently with a copy of the same database (doesn't have a lot of changes).

I have two nodes (db4) and (dbp5) where db4 is the primary one in the cluster. I've disabled firewall on both machines to make sure that rsync (4444) and ports 4567/45678 are not being blocked over the network.

First, some configs.

/etc/my.cnf.d/server.cnf on db4:

[galera]
wsrep_on=ON
wsrep_provider=/usr/lib64/galera/libgalera_smm.so
wsrep_provider_options='gcache.size=1G'
wsrep_cluster_address='gcomm://192.168.87.11,192.168.89.1,192.168.87.77,192.168.89.12'
wsrep_cluster_name='testcluster'
wsrep_node_address='192.168.89.1'
wsrep_node_name='db4'
wsrep_sst_method=rsync
wsrep_sst_auth=root:xxxxxxxxxxxx
bind-address=0.0.0.0
binlog_format=row
default_storage_engine=InnoDB
innodb_autoinc_lock_mode=2
innodb_doublewrite=1
query_cache_size=0
slave_run_triggers_for_rbr=1

/etc/my.cnf.d/server.cnf on db5:

[galera]
wsrep_on=ON
wsrep_provider=/usr/lib64/galera/libgalera_smm.so
wsrep_provider_options='gcache.size=1G'
wsrep_cluster_address='gcomm://192.168.87.11,192.168.89.1,192.168.87.77,192.168.89.12'
wsrep_cluster_name='testcluster'
wsrep_node_address='192.168.87.77'
wsrep_node_name='db5'
wsrep_sst_method=rsync
wsrep_sst_auth=root:xxxxxxxxxxxx
bind-address=0.0.0.0
binlog_format=row
default_storage_engine=InnoDB
innodb_autoinc_lock_mode=2
innodb_doublewrite=1
query_cache_size=0
slave_run_triggers_for_rbr=1

mysql_errors.log on joiner (db5):

2016-07-26 15:06:53 140100902053632 [Note] WSREP: Flow-control interval: [23, 23]
2016-07-26 15:06:53 140100902053632 [Note] WSREP: Shifting OPEN -> PRIMARY (TO: 12)
2016-07-26 15:06:53 140102141889280 [Note] WSREP: State transfer required:
    Group state: 9087af40-530c-11e6-9cd8-4aa7f3c5b5f8:12
    Local state: 00000000-0000-0000-0000-000000000000:-1
2016-07-26 15:06:53 140102141889280 [Note] WSREP: New cluster view: global state: 9087af40-530c-11e6-9cd8-4aa7f3c5b5f8:12, view# 49: Primary, number of nodes: 2, my index: 0, protocol version 3
2016-07-26 15:06:53 140102141889280 [Warning] WSREP: Gap in state sequence. Need state transfer.
2016-07-26 15:06:53 140100872697600 [Note] WSREP: Running: 'wsrep_sst_rsync --role 'joiner' --address '192.168.87.77' --datadir '/var/lib/mysql/'   --parent '38891'  '' '
2016-07-26 15:06:53 140102141889280 [Note] WSREP: Prepared SST request: rsync|192.168.87.77:4444/rsync_sst
2016-07-26 15:06:53 140102141889280 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2016-07-26 15:06:53 140102141889280 [Note] WSREP: REPL Protocols: 7 (3, 2)
2016-07-26 15:06:53 140100964579072 [Note] WSREP: Service thread queue flushed.
2016-07-26 15:06:53 140102141889280 [Note] WSREP: Assign initial position for certification: 12, protocol version: 3
2016-07-26 15:06:53 140100964579072 [Note] WSREP: Service thread queue flushed.
2016-07-26 15:06:53 140102141889280 [Warning] WSREP: Failed to prepare for incremental state transfer: Local state UUID (00000000-0000-0000-0000-000000000000) does not match group state UUID (9087af40-530c-11e6-9cd8-4aa7f3c5b5f8): 1 (Operation not permitted)
     at galera/src/replicator_str.cpp:prepare_for_IST():482. IST will be unavailable.
2016-07-26 15:06:53 140100902053632 [Note] WSREP: Member 0.0 (db5) requested state transfer from '*any*'. Selected 1.0 (db4)(SYNCED) as donor.
2016-07-26 15:06:53 140100902053632 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 12)
2016-07-26 15:06:53 140102141889280 [Note] WSREP: Requesting state transfer: success, donor: 1
2016-07-26 15:06:55 140100910446336 [Note] WSREP: (d1f58cf2, 'tcp://0.0.0.0:4567') turning message relay requesting off

The errors above show me that IST is not possibnle due to a gap in state sequence. However, I can't seem to get past the Local state UUID (Operation not permitted error)?

mysql_errors.log on donor (db4):

2016-07-26 15:00:07 140089736820480 [Note] WSREP:  cleaning up 908a89a8 (tcp://192.168.83.133:4567)
2016-07-26 15:00:07 140089736820480 [Note] WSREP: (f056ffe0, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers:
2016-07-26 15:00:08 140089736820480 [Note] WSREP: declaring e0dbb547 at tcp://192.168.83.133:4567 stable
2016-07-26 15:00:08 140089736820480 [Note] WSREP: Node f056ffe0 state prim
2016-07-26 15:00:08 140089736820480 [Note] WSREP: view(view_id(PRIM,e0dbb547,44) memb {
    e0dbb547,0
    f056ffe0,0
} joined {
} left {
} partitioned {
})
2016-07-26 15:00:08 140089736820480 [Note] WSREP: save pc into disk
2016-07-26 15:00:08 140089728427776 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 1, memb_num = 2
2016-07-26 15:00:08 140089728427776 [Note] WSREP: STATE EXCHANGE: Waiting for state UUID.
2016-07-26 15:00:08 140089728427776 [Note] WSREP: STATE EXCHANGE: sent state msg: e1749860-5330-11e6-a814-ca2523c3fc4c
2016-07-26 15:00:08 140089728427776 [Note] WSREP: STATE EXCHANGE: got state msg: e1749860-5330-11e6-a814-ca2523c3fc4c from 0 (db5)
2016-07-26 15:00:08 140089728427776 [Note] WSREP: STATE EXCHANGE: got state msg: e1749860-5330-11e6-a814-ca2523c3fc4c from 1 (db4)
2016-07-26 15:00:08 140089728427776 [Note] WSREP: Quorum results:
    version    = 3,
    component  = PRIMARY,
    conf_id    = 42,
    members    = 1/2 (joined/total),
    act_id     = 10,
    last_appl. = 0,
    protocols  = 0/7/3 (gcs/repl/appl),
    group UUID = 9087af40-530c-11e6-9cd8-4aa7f3c5b5f8
2016-07-26 15:00:08 140089728427776 [Note] WSREP: Flow-control interval: [23, 23]
2016-07-26 15:00:08 140090968677120 [Note] WSREP: New cluster view: global state: 9087af40-530c-11e6-9cd8-4aa7f3c5b5f8:10, view# 43: Primary, number of nodes: 2, my index: 1, protocol version 3
2016-07-26 15:00:08 140090968677120 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2016-07-26 15:00:08 140090968677120 [Note] WSREP: REPL Protocols: 7 (3, 2)
2016-07-26 15:00:08 140089790953216 [Note] WSREP: Service thread queue flushed.
2016-07-26 15:00:08 140090968677120 [Note] WSREP: Assign initial position for certification: 10, protocol version: 3
2016-07-26 15:00:08 140089790953216 [Note] WSREP: Service thread queue flushed.
2016-07-26 15:00:08 140089728427776 [Note] WSREP: Member 0.0 (db5) requested state transfer from '*any*'. Selected 1.0 (db4)(SYNCED) as donor.
2016-07-26 15:00:08 140089728427776 [Note] WSREP: Shifting SYNCED -> DONOR/DESYNCED (TO: 10)
2016-07-26 15:00:08 140090968677120 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2016-07-26 15:00:08 140088805684992 [Note] WSREP: Running: 'wsrep_sst_rsync --role 'donor' --address '192.168.87.77:4444/rsync_sst' --socket '/var/lib/mysql/mysql.sock' --datadir '/var/lib/mysql/'     '' --gtid '9087af40-530c-11e6-9cd8-4aa7f3c5b5f8:10' --gtid-domain-id '0''
2016-07-26 15:00:08 140090968677120 [Note] WSREP: sst_donor_thread signaled with 0
2016-07-26 15:00:08 140088805684992 [Note] WSREP: Flushing tables for SST...
2016-07-26 15:00:08 140088805684992 [Note] WSREP: Provider paused at 9087af40-530c-11e6-9cd8-4aa7f3c5b5f8:10 (134)
2016-07-26 15:00:08 140088805684992 [Note] WSREP: Tables flushed.
2016-07-26 15:00:11 140089736820480 [Note] WSREP: (f056ffe0, 'tcp://0.0.0.0:4567') turning message relay requesting off

However, I've made sure that SST would have the right permissions on db4/db5?

mysql -u root -pxxxxxxxxxxxx -e "GRANT ALL PRIVILEGES ON *.* TO 'root'@'%' IDENTIFIED BY 'xxxxxxxxxxxx' WITH GRANT OPTION;"

Also, I can see the rsync processes starting on db4:

mysql     98081  59227  0 15:04 ?        00:00:00 /bin/bash -ue /usr//bin/wsrep_sst_rsync --role donor --address 192.168.87.77:4444/rsync_sst --socket /var/lib/mysql/mysql.sock --datadir /var/lib/mysql/  --gtid 9087af40-530c-11e6-9cd8-4aa7f3c5b5f8:12 --gtid-domain-id 0
mysql     98110  98081  0 15:04 ?        00:00:00 rsync --owner --group --perms --links --specials --ignore-times --inplace --dirs --delete --quiet --whole-file -f - /lost+found -f - /.fseventsd -f - /.Trashes -f + /wsrep_sst_binlog.tar -f + /ib_lru_dump -f + /ibdata* -f + /*/ -f - /* /var/lib/mysql// rsync://192.168.87.77:4444/rsync_sst

I've even tried to completely remove /var/lib/mysql on db5, import the databasedump and tried resyncing it:

systemctl set-environment MYSQLD_OPTS="--wsrep_cluster_address=gcomm://192.168.87.11,192.168.89.1,192.168.87.77,192.168.89.12"
systemctl start mariadb.service
systemctl unset-environment MYSQLD_OPTS

Viewing all articles
Browse latest Browse all 17268

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>