repmgr安装测试自动故障切换-postgresql12

时间：2021-08-21 20:02:49 收藏：0 阅读：23

安装

环境准备

环境介绍

两台PG12 （host1 host2）host1为原主库，host2为原备库。主从异步流复制。

下载最新repmgr包：repmgr-5.2.1.tar.gz （https://repmgr.org/ ）

./configure
sudo make && sudo make install

配置

主库配置

listen_addresses = ‘*‘
shared_preload_libraries = ‘repmgr‘
wal_log_hints = ‘on‘
wal_level = replica
hot_standby = on

local repmgr repmgr trust
host repmgr repmgr 127.0.0.1/24 trust
host repmgr repmgr 192.168.73.140/24 trust
host repmgr repmgr 192.168.73.141/24 trust
host postgres repmgr 192.168.73.140/24 trust

local replication repmgr trust
host replication repmgr 127.0.0.1/24 trust
host replication repmgr 192.168.73.140/24 trust
host replication repmgr 192.168.73.141/24 trust

[postgres@pg1 opt]$ pg_config --sysconfdir
/opt/pgsql/etc

sudo mkdir -pv /opt/pgsql/etc/

touch   /opt/pgsql/etc/repmgr.conf
cat >> /opt/pgsql/etc/repmgr.conf << EOF
node_id=2
node_name=‘pg1‘
conninfo=‘host=192.168.73.140 port=5432 dbname=repmgr user=repmgr connect_timeout=2‘
data_directory =‘/opt/pgsql/data‘
pg_bindir=‘/opt/pgsql/bin‘
config_directory=‘/opt/pgsql/data‘
log_level=INFO
log_facility=STDERR
log_file=‘/opt/pgsql/etc/repmgr.log‘
EOF

注册主节点

repmgr primary register
repmgr service status
psql -h 192.168.73.140 -U repmgr -d repmgr -c ‘SELECT * FROM repmgr.nodes;‘

开启主库守护进程repmgrd
```
repmgrd -d
```
从库配置

sudo mkdir -pv /opt/pgsql/etc/
touch   /opt/pgsql/etc/repmgr.conf
cat >> /opt/pgsql/etc/repmgr.conf << EOF
node_id=2
node_name=‘pg2‘
conninfo=‘host=192.168.73.141 port=5432 dbname=repmgr user=repmgr connect_timeout=2‘
data_directory =‘/opt/pgsql/data‘
pg_bindir=‘/opt/pgsql/bin‘
config_directory=‘/opt/pgsql/data‘
log_level=INFO
log_facility=STDERR
log_file=‘/opt/pgsql/etc/repmgr.log‘
EOF

克隆主库

[postgres@pg2 etc]$ repmgr -h 192.168.73.140 -U repmgr -d repmgr -f /opt/pgsql/etc/repmgr.conf standby clone -F
WARNING: following problems with command line parameters detected:
  "config_directory" set in repmgr.conf, but --copy-external-config-files not provided
NOTICE: destination directory "/opt/pgsql/data" provided
INFO: connecting to source node
DETAIL: connection string is: host=192.168.73.140 user=repmgr dbname=repmgr
DETAIL: current installation size is 884 MB
INFO: replication slot usage not requested;  no replication slot will be set up for this standby
NOTICE: checking for available walsenders on the source node (2 required)
NOTICE: checking replication connections can be made to the source server (2 required)
WARNING: directory "/opt/pgsql/data" exists but is not empty
NOTICE: -F/--force provided - deleting existing data directory "/opt/pgsql/data"
NOTICE: starting backup (using pg_basebackup)...
HINT: this may take some time; consider using the -c/--fast-checkpoint option
INFO: executing:
  /opt/pgsql/bin/pg_basebackup -l "repmgr base backup"  -D /opt/pgsql/data -h 192.168.73.140 -p 5432 -U repmgr -X stream 
NOTICE: standby clone (using pg_basebackup) complete
NOTICE: you can now start your PostgreSQL server
HINT: for example: pg_ctl -D /opt/pgsql/data start
HINT: after starting the server, you need to register this standby with "repmgr standby register"

6.启动备库并注册

pg_ctl -D /opt/pgsql/data start

repmgr standby register

7.启动repmgr守护进程

repmgrd -d

8.查看从库repmgr状态

repmgr service status

[postgres@pg2 opt]$  repmgr service status
ID | Name | Role    | Status    | Upstream | repmgrd | PID  | Paused? | Upstream last seen
----+------+---------+-----------+----------+---------+------+---------+--------------------
1  | pg1  | primary | * running |          | running | 9603 | no      | n/a                
2  | pg2  | standby |   running | pg1      | running | 2551 | no      | 0 second(s) ago 

[postgres@pg2 opt]$ repmgr cluster show
ID | Name | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string                                                                        
----+------+---------+-----------+----------+----------+----------+----------+-------------------------------------------------------------------------------------------
1  | pg1  | primary | * running |          | default  | 100      | 7        | host=192.168.73.140 port=5432 dbname=repmgr user=repmgr password=repmgr connect_timeout=2
2  | pg2  | standby |   running | pg1      | default  | 100      | 7        | host=192.168.73.141 port=5432 dbname=repmgr user=repmgr password=repmgr connect_timeout=2

使用repmgr执行switch over切换

测试切换repmgr  standby switchover --siblings-follow --dry-run
正式进行主备切换 repmgr -f repmgr.conf standby switchover
主备切换后主库需要重新注册 repmgr -f repmgr.conf standby register --force
重新注册后必须重启备库

使用repmgr执行fail over切换

1.手动关闭主库模拟数据库异常: pg_ctl -m fast stop

2.备库强制提升为主库：
postgres@HWFBSB6:~> repmgr -f repmgr.conf standby promote

4.启动原主库：
postgres@HWFBSB5:~> pg_ctl -D /pgdata/pg113 -l /pgdata/pg113/server.log start
waiting for server to start… done
server started

6.可以看到从主备库查看的状态都带有警告，这时如果要把剔除掉的原主库作为备库状态重新加入，需要repmgr node rejoin操作

? 首先关闭原主库,使用repmgr node rejoin重新添加原主库:

? 预命令

repmgr -f repmgr.conf node rejoin -d ‘host=197.0.1.105 user=repmgr dbname=repmgr’ --force-rewind --dry-run --verbose

正式执行repmgr node rejoin重新添加原主库命令：
repmgr -f repmgr.conf node rejoin -d ‘host=197.0.1.105 user=repmgr dbname=repmgr’ --force-rewind --verbose

一主一备利用repmgrd实现自动fail over

cat >> repmgr.conf << EOF
monitoring_history=true （启用监控参数）
monitor_interval_secs=5（定义监视数据间隔写入时间参数）
reconnect_attempts=10（故障转移之前，尝试重新连接主库次数（默认为6）参数）
reconnect_interval=5（每间隔5s尝试重新连接一次参数）
EOF

如下参数为自动故障转移所需配置的重要参数：

failover=automatic
promote_command=’/pgsql/pg113/bin/repmgr standby promote -f /home/postgres/repmgr.conf --log-to-file’
follow_command=’/pgsql/pg113/bin/repmgr standby follow -f /home/postgres/repmgr.conf --log-to-file --upstream-node-id=%n’

*附加参数：
replication_lag_critical（如果备用数据库的复制延迟（已秒为单位）超过此值，则将终止切换）

备：参数配置完成后需要重启主备库使其生效:

repmgr node service --action=restart

启动主备库的repmgrd：
repmgrd --pid-file /tmp/repmgrd.pid

练习环境自测

关闭主库

 repmgr cluster show
 ID | Name | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string                                                                        
----+------+---------+-----------+----------+----------+----------+----------+-------------------------------------------------------------------------------------------
 1  | pg1  | standby |   running | pg2      | default  | 100      | 9        | host=192.168.73.140 port=5432 dbname=repmgr user=repmgr password=repmgr connect_timeout=2
 2  | pg2  | primary | * running |          | default  | 100      | 9        | host=192.168.73.141 port=5432 dbname=repmgr user=repmgr password=repmgr connect_timeout=2
[postgres@pg2 data]$ 
[postgres@pg2 data]$ pg_ctl stop

备库查看状态

[postgres@pg1 data]$ repmgr cluster show
 ID | Name | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string                                                                        
----+------+---------+-----------+----------+----------+----------+----------+-------------------------------------------------------------------------------------------
 1  | pg1  | primary | * running |          | default  | 100      | 10       | host=192.168.73.140 port=5432 dbname=repmgr user=repmgr password=repmgr connect_timeout=2
 2  | pg2  | primary | - failed  | ?        | default  | 100      |          | host=192.168.73.141 port=5432 dbname=repmgr user=repmgr password=repmgr connect_timeout=2

WARNING: following issues were detected
  - unable to connect to node "pg2" (ID: 2)

HINT: execute with --verbose option to see connection error messages

原主库修改timeline

[postgres@pg2 data]$ cat postgresql.auto.conf 
# Do not edit this file manually!
# It will be overwritten by the ALTER SYSTEM command.
archive_mode = ‘on‘
archive_command = ‘cp -i %p /data/archive_wals/%f‘
wal_keep_segments = ‘512‘
primary_conninfo = ‘host=192.168.73.140 user=repmgr application_name=pg2 password=repmgr connect_timeout=2 port=5432‘
recovery_target_timeline = ‘10‘
wal_retrieve_retry_interval = ‘5000‘

拉起原主库，查看状态，切换成功

[postgres@pg1 data]$ repmgr cluster show
 ID | Name | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string                                                                        
----+------+---------+-----------+----------+----------+----------+----------+-------------------------------------------------------------------------------------------
 1  | pg1  | primary | * running |          | default  | 100      | 10       | host=192.168.73.140 port=5432 dbname=repmgr user=repmgr password=repmgr connect_timeout=2
 2  | pg2  | primary | ! running |          | default  | 100      | 9        | host=192.168.73.141 port=5432 dbname=repmgr user=repmgr password=repmgr connect_timeout=2

WARNING: following issues were detected
  - node "pg2" (ID: 2) is running but the repmgr node record is inactive



[postgres@pg2 data]$ repmgr cluster show
 ID | Name | Role    | Status               | Upstream | Location | Priority | Timeline | Connection string                                                                        
----+------+---------+----------------------+----------+----------+----------+----------+-------------------------------------------------------------------------------------------
 1  | pg1  | standby | ! running as primary |          | default  | 100      | 10       | host=192.168.73.140 port=5432 dbname=repmgr user=repmgr password=repmgr connect_timeout=2
 2  | pg2  | primary | * running            |          | default  | 100      | 9        | host=192.168.73.141 port=5432 dbname=repmgr user=repmgr password=repmgr connect_timeout=2

WARNING: following issues were detected
  - node "pg1" (ID: 1) is registered as standby but running as primary

[postgres@pg2 data]$ pg_ctl start
server started

[postgres@pg2 data]$ repmgr cluster show
 ID | Name | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string                                                                        
----+------+---------+-----------+----------+----------+----------+----------+-------------------------------------------------------------------------------------------
 1  | pg1  | primary | * running |          | default  | 100      | 10       | host=192.168.73.140 port=5432 dbname=repmgr user=repmgr password=repmgr connect_timeout=2
 2  | pg2  | standby |   running | pg1      | default  | 100      | 9        | host=192.168.73.141 port=5432 dbname=repmgr user=repmgr password=repmgr connect_timeout=2
[postgres@pg2 data]$ repmgr standby follow
NOTICE: attempting to find and follow current primary
INFO: timelines are same, this server is not ahead
DETAIL: local node lsn is 0/C9024B00, follow target lsn is 0/C9024B00
NOTICE: setting node 2‘s upstream to node 1
NOTICE: stopping server using "/opt/pgsql/bin/pg_ctl  -D ‘/opt/pgsql/data‘ -w -m fast stop"
2021-08-21 02:28:29.339 EDT [23663] LOG:  received fast shutdown request
2021-08-21 02:28:29.340 EDT [23663] LOG:  aborting any active transactions
2021-08-21 02:28:29.340 EDT [23672] FATAL:  terminating walreceiver process due to administrator command
2021-08-21 02:28:29.340 EDT [23676] FATAL:  terminating connection due to administrator command
2021-08-21 02:28:29.340 EDT [23674] FATAL:  terminating connection due to administrator command
2021-08-21 02:28:29.341 EDT [23669] LOG:  shutting down
2021-08-21 02:28:29.353 EDT [23663] LOG:  database system is shut down
NOTICE: starting server using "/opt/pgsql/bin/pg_ctl  -w -D ‘/opt/pgsql/data‘ start"
NOTICE: STANDBY FOLLOW successful
DETAIL: standby attached to upstream node "pg1" (ID: 1)

发现原主库时间线不对：

[postgres@pg2 data]$ repmgr standby follow
NOTICE: attempting to find and follow current primary
INFO: timelines are same, this server is not ahead
DETAIL: local node lsn is 0/C9024B00, follow target lsn is 0/C9024B00
NOTICE: setting node 2‘s upstream to node 1
NOTICE: stopping server using "/opt/pgsql/bin/pg_ctl  -D ‘/opt/pgsql/data‘ -w -m fast stop"
2021-08-21 02:28:29.339 EDT [23663] LOG:  received fast shutdown request
2021-08-21 02:28:29.340 EDT [23663] LOG:  aborting any active transactions
2021-08-21 02:28:29.340 EDT [23672] FATAL:  terminating walreceiver process due to administrator command
2021-08-21 02:28:29.340 EDT [23676] FATAL:  terminating connection due to administrator command
2021-08-21 02:28:29.340 EDT [23674] FATAL:  terminating connection due to administrator command
2021-08-21 02:28:29.341 EDT [23669] LOG:  shutting down
2021-08-21 02:28:29.353 EDT [23663] LOG:  database system is shut down
NOTICE: starting server using "/opt/pgsql/bin/pg_ctl  -w -D ‘/opt/pgsql/data‘ start"
NOTICE: STANDBY FOLLOW successful
DETAIL: standby attached to upstream node "pg1" (ID: 1)


[postgres@pg2 data]$ repmgr cluster show
 ID | Name | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string                                                                        
----+------+---------+-----------+----------+----------+----------+----------+-------------------------------------------------------------------------------------------
 1  | pg1  | primary | * running |          | default  | 100      | 10       | host=192.168.73.140 port=5432 dbname=repmgr user=repmgr password=repmgr connect_timeout=2
 2  | pg2  | standby |   running | pg1      | default  | 100      | 10       | host=192.168.73.141 port=5432 dbname=repmgr user=repmgr password=repmgr connect_timeout=2

发现正常了~~

repmgr相关命令

repmgr cluster show
repmgr node status
repmgr  cluster event
repmgr cluster crosscheck

repmgr  node service --list-actions
测试切换repmgr  standby switchover --siblings-follow --dry-run
正式进行主备切换 repmgr -f repmgr.conf standby switchover
主备切换后主库需要重新注册 repmgr -f repmgr.conf standby register --force

利用repmgr switch over数据库：
repmgr -f /home/postgres/repmgr.conf standby switchover --siblings-follow --dry-run
repmgr -f/home/postgres/repmgr.conf standby switchover

利用repmgr promote备库：
repmgr -f /home/postgres/repmgr.conf standby promote

利用rpmgr rejoin备用数据库：（inactive时）
repmgr  node rejoin -d ‘host=192.168.73.141 user=repmgr dbname=repmgr‘ --force-rewind --dry-run –verbose
repmgr  node rejoin -d ‘host=192.168.73.141 user=repmgr dbname=repmgr‘ --force-rewind –verbose


repmgr primary register  安装pg的repmgr扩展并注册为主节点
repmgr primary unregister 注销不活动的主节点
repmgr standby clone 从其他节点复制数据到从节点
repmgr standby register 注册从节点（添加从的信息到repmgr元数据）
repmgr standby unregister repmgr  元数据中移除从的信息
repmgr standby promote 将从提升为主
repmgr standby follow  将从跟随新主
repmgr standby switchover 将从提升为主并将主降级为从
repmgr witness register  注册一个观察节点
repmgr witness unregister  移除一个观察节点
repmgr node status  显示节点的基本信息和复制状态
repmgr node check  从复制的角度对节点进行健康监测
repmgr node rejoin   重新加入一个失效节点到集群
repmgr cluster show  显示所有集群中注册的节点信息
repmgr cluster matrix 在所有节点运行show并汇总
repmgr cluster crosscheck  在节点间两两交叉监测连接
repmgr cluster event 输出时间记录
repmgr cluster cleanup   清理监控历史
repmgr service status  节点状态

留个坑，还差witness没有尝试。

原文：https://www.cnblogs.com/ddlearning/p/15169618.html