repmgr安装测试自动故障切换-postgresql12
安装
环境准备
- 环境介绍
两台PG12 (host1 host2)host1为原主库,host2为原备库。主从异步流复制。
下载最新repmgr包:repmgr-5.2.1.tar.gz (https://repmgr.org/ )
./configure
sudo make && sudo make install
配置
-
主库配置
listen_addresses = ‘*‘ shared_preload_libraries = ‘repmgr‘ wal_log_hints = ‘on‘ wal_level = replica hot_standby = on
local repmgr repmgr trust host repmgr repmgr 127.0.0.1/24 trust host repmgr repmgr 192.168.73.140/24 trust host repmgr repmgr 192.168.73.141/24 trust host postgres repmgr 192.168.73.140/24 trust local replication repmgr trust host replication repmgr 127.0.0.1/24 trust host replication repmgr 192.168.73.140/24 trust host replication repmgr 192.168.73.141/24 trust
[postgres@pg1 opt]$ pg_config --sysconfdir
/opt/pgsql/etcsudo mkdir -pv /opt/pgsql/etc/ touch /opt/pgsql/etc/repmgr.conf cat >> /opt/pgsql/etc/repmgr.conf << EOF node_id=2 node_name=‘pg1‘ conninfo=‘host=192.168.73.140 port=5432 dbname=repmgr user=repmgr connect_timeout=2‘ data_directory =‘/opt/pgsql/data‘ pg_bindir=‘/opt/pgsql/bin‘ config_directory=‘/opt/pgsql/data‘ log_level=INFO log_facility=STDERR log_file=‘/opt/pgsql/etc/repmgr.log‘ EOF
-
注册主节点
repmgr primary register repmgr service status psql -h 192.168.73.140 -U repmgr -d repmgr -c ‘SELECT * FROM repmgr.nodes;‘
-
开启主库守护进程repmgrd
repmgrd -d
-
从库配置
sudo mkdir -pv /opt/pgsql/etc/
touch /opt/pgsql/etc/repmgr.conf
cat >> /opt/pgsql/etc/repmgr.conf << EOF
node_id=2
node_name=‘pg2‘
conninfo=‘host=192.168.73.141 port=5432 dbname=repmgr user=repmgr connect_timeout=2‘
data_directory =‘/opt/pgsql/data‘
pg_bindir=‘/opt/pgsql/bin‘
config_directory=‘/opt/pgsql/data‘
log_level=INFO
log_facility=STDERR
log_file=‘/opt/pgsql/etc/repmgr.log‘
EOF
-
克隆主库
[postgres@pg2 etc]$ repmgr -h 192.168.73.140 -U repmgr -d repmgr -f /opt/pgsql/etc/repmgr.conf standby clone -F WARNING: following problems with command line parameters detected: "config_directory" set in repmgr.conf, but --copy-external-config-files not provided NOTICE: destination directory "/opt/pgsql/data" provided INFO: connecting to source node DETAIL: connection string is: host=192.168.73.140 user=repmgr dbname=repmgr DETAIL: current installation size is 884 MB INFO: replication slot usage not requested; no replication slot will be set up for this standby NOTICE: checking for available walsenders on the source node (2 required) NOTICE: checking replication connections can be made to the source server (2 required) WARNING: directory "/opt/pgsql/data" exists but is not empty NOTICE: -F/--force provided - deleting existing data directory "/opt/pgsql/data" NOTICE: starting backup (using pg_basebackup)... HINT: this may take some time; consider using the -c/--fast-checkpoint option INFO: executing: /opt/pgsql/bin/pg_basebackup -l "repmgr base backup" -D /opt/pgsql/data -h 192.168.73.140 -p 5432 -U repmgr -X stream NOTICE: standby clone (using pg_basebackup) complete NOTICE: you can now start your PostgreSQL server HINT: for example: pg_ctl -D /opt/pgsql/data start HINT: after starting the server, you need to register this standby with "repmgr standby register"
6.启动备库并注册
pg_ctl -D /opt/pgsql/data start repmgr standby register
7.启动repmgr守护进程
repmgrd -d
8.查看从库repmgr状态
repmgr service status [postgres@pg2 opt]$ repmgr service status ID | Name | Role | Status | Upstream | repmgrd | PID | Paused? | Upstream last seen ----+------+---------+-----------+----------+---------+------+---------+-------------------- 1 | pg1 | primary | * running | | running | 9603 | no | n/a 2 | pg2 | standby | running | pg1 | running | 2551 | no | 0 second(s) ago [postgres@pg2 opt]$ repmgr cluster show ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string ----+------+---------+-----------+----------+----------+----------+----------+------------------------------------------------------------------------------------------- 1 | pg1 | primary | * running | | default | 100 | 7 | host=192.168.73.140 port=5432 dbname=repmgr user=repmgr password=repmgr connect_timeout=2 2 | pg2 | standby | running | pg1 | default | 100 | 7 | host=192.168.73.141 port=5432 dbname=repmgr user=repmgr password=repmgr connect_timeout=2
使用repmgr执行switch over切换
测试切换repmgr standby switchover --siblings-follow --dry-run
正式进行主备切换 repmgr -f repmgr.conf standby switchover
主备切换后主库需要重新注册 repmgr -f repmgr.conf standby register --force
重新注册后必须重启备库
使用repmgr执行fail over切换
1.手动关闭主库模拟数据库异常: pg_ctl -m fast stop
2.备库强制提升为主库:
postgres@HWFBSB6:~> repmgr -f repmgr.conf standby promote
3.新主库查看主备库状态:repmgr -f repmgr.conf cluster show
WARNING: the following problems were found in the configuration file:
parameter “cluster” is deprecated and will be ignored
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
----±--------±--------±----------±---------±---------±---------±---------±---------------------------------------
1 | HWFBSB5 | primary | - failed | | default | 100 | ? | host=HWFBSB5 user=repmgr dbname=repmgr
2 | HWFBSB6 | primary | * running | | default | 100 | 2 | host=HWFBSB6 user=repmgr dbname=repmgr
4.启动原主库:
postgres@HWFBSB5:~> pg_ctl -D /pgdata/pg113 -l /pgdata/pg113/server.log start
waiting for server to start… done
server started
5.原主库查看主备库状态:
postgres@HWFBSB5:~> repmgr -f repmgr.conf cluster show
WARNING: the following problems were found in the configuration file:
parameter “cluster” is deprecated and will be ignored
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
----±--------±--------±---------------------±---------±---------±---------±---------±---------------------------------------
1 | HWFBSB5 | primary | * running | | default | 100 | 1 | host=HWFBSB5 user=repmgr dbname=repmgr
2 | HWFBSB6 | standby | ! running as primary | | default | 100 | 2 | host=HWFBSB6 user=repmgr dbname=repmgr
6.可以看到从主备库查看的状态都带有警告,这时如果要把剔除掉的原主库作为备库状态重新加入,需要repmgr node rejoin操作
? 首先关闭原主库,使用repmgr node rejoin重新添加原主库:
? 预命令
repmgr -f repmgr.conf node rejoin -d ‘host=197.0.1.105 user=repmgr dbname=repmgr’ --force-rewind --dry-run --verbose
正式执行repmgr node rejoin重新添加原主库命令:
repmgr -f repmgr.conf node rejoin -d ‘host=197.0.1.105 user=repmgr dbname=repmgr’ --force-rewind --verbose
一主一备利用repmgrd实现自动fail over
cat >> repmgr.conf << EOF
monitoring_history=true (启用监控参数)
monitor_interval_secs=5(定义监视数据间隔写入时间参数)
reconnect_attempts=10(故障转移之前,尝试重新连接主库次数(默认为6)参数)
reconnect_interval=5(每间隔5s尝试重新连接一次参数)
EOF
如下参数为自动故障转移所需配置的重要参数:
failover=automatic
promote_command=’/pgsql/pg113/bin/repmgr standby promote -f /home/postgres/repmgr.conf --log-to-file’
follow_command=’/pgsql/pg113/bin/repmgr standby follow -f /home/postgres/repmgr.conf --log-to-file --upstream-node-id=%n’
*附加参数:
replication_lag_critical(如果备用数据库的复制延迟(已秒为单位)超过此值,则将终止切换)
备:参数配置完成后需要重启主备库使其生效:
repmgr node service --action=restart
启动主备库的repmgrd:
repmgrd --pid-file /tmp/repmgrd.pid
练习环境自测
-
关闭主库
repmgr cluster show ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string ----+------+---------+-----------+----------+----------+----------+----------+------------------------------------------------------------------------------------------- 1 | pg1 | standby | running | pg2 | default | 100 | 9 | host=192.168.73.140 port=5432 dbname=repmgr user=repmgr password=repmgr connect_timeout=2 2 | pg2 | primary | * running | | default | 100 | 9 | host=192.168.73.141 port=5432 dbname=repmgr user=repmgr password=repmgr connect_timeout=2 [postgres@pg2 data]$ [postgres@pg2 data]$ pg_ctl stop
-
备库查看状态
[postgres@pg1 data]$ repmgr cluster show ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string ----+------+---------+-----------+----------+----------+----------+----------+------------------------------------------------------------------------------------------- 1 | pg1 | primary | * running | | default | 100 | 10 | host=192.168.73.140 port=5432 dbname=repmgr user=repmgr password=repmgr connect_timeout=2 2 | pg2 | primary | - failed | ? | default | 100 | | host=192.168.73.141 port=5432 dbname=repmgr user=repmgr password=repmgr connect_timeout=2 WARNING: following issues were detected - unable to connect to node "pg2" (ID: 2) HINT: execute with --verbose option to see connection error messages
-
原主库修改timeline
[postgres@pg2 data]$ cat postgresql.auto.conf # Do not edit this file manually! # It will be overwritten by the ALTER SYSTEM command. archive_mode = ‘on‘ archive_command = ‘cp -i %p /data/archive_wals/%f‘ wal_keep_segments = ‘512‘ primary_conninfo = ‘host=192.168.73.140 user=repmgr application_name=pg2 password=repmgr connect_timeout=2 port=5432‘ recovery_target_timeline = ‘10‘ wal_retrieve_retry_interval = ‘5000‘
-
拉起原主库,查看状态,切换成功
[postgres@pg1 data]$ repmgr cluster show ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string ----+------+---------+-----------+----------+----------+----------+----------+------------------------------------------------------------------------------------------- 1 | pg1 | primary | * running | | default | 100 | 10 | host=192.168.73.140 port=5432 dbname=repmgr user=repmgr password=repmgr connect_timeout=2 2 | pg2 | primary | ! running | | default | 100 | 9 | host=192.168.73.141 port=5432 dbname=repmgr user=repmgr password=repmgr connect_timeout=2 WARNING: following issues were detected - node "pg2" (ID: 2) is running but the repmgr node record is inactive [postgres@pg2 data]$ repmgr cluster show ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string ----+------+---------+----------------------+----------+----------+----------+----------+------------------------------------------------------------------------------------------- 1 | pg1 | standby | ! running as primary | | default | 100 | 10 | host=192.168.73.140 port=5432 dbname=repmgr user=repmgr password=repmgr connect_timeout=2 2 | pg2 | primary | * running | | default | 100 | 9 | host=192.168.73.141 port=5432 dbname=repmgr user=repmgr password=repmgr connect_timeout=2 WARNING: following issues were detected - node "pg1" (ID: 1) is registered as standby but running as primary
[postgres@pg2 data]$ pg_ctl start server started [postgres@pg2 data]$ repmgr cluster show ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string ----+------+---------+-----------+----------+----------+----------+----------+------------------------------------------------------------------------------------------- 1 | pg1 | primary | * running | | default | 100 | 10 | host=192.168.73.140 port=5432 dbname=repmgr user=repmgr password=repmgr connect_timeout=2 2 | pg2 | standby | running | pg1 | default | 100 | 9 | host=192.168.73.141 port=5432 dbname=repmgr user=repmgr password=repmgr connect_timeout=2 [postgres@pg2 data]$ repmgr standby follow NOTICE: attempting to find and follow current primary INFO: timelines are same, this server is not ahead DETAIL: local node lsn is 0/C9024B00, follow target lsn is 0/C9024B00 NOTICE: setting node 2‘s upstream to node 1 NOTICE: stopping server using "/opt/pgsql/bin/pg_ctl -D ‘/opt/pgsql/data‘ -w -m fast stop" 2021-08-21 02:28:29.339 EDT [23663] LOG: received fast shutdown request 2021-08-21 02:28:29.340 EDT [23663] LOG: aborting any active transactions 2021-08-21 02:28:29.340 EDT [23672] FATAL: terminating walreceiver process due to administrator command 2021-08-21 02:28:29.340 EDT [23676] FATAL: terminating connection due to administrator command 2021-08-21 02:28:29.340 EDT [23674] FATAL: terminating connection due to administrator command 2021-08-21 02:28:29.341 EDT [23669] LOG: shutting down 2021-08-21 02:28:29.353 EDT [23663] LOG: database system is shut down NOTICE: starting server using "/opt/pgsql/bin/pg_ctl -w -D ‘/opt/pgsql/data‘ start" NOTICE: STANDBY FOLLOW successful DETAIL: standby attached to upstream node "pg1" (ID: 1)
?
发现原主库时间线不对:
?
[postgres@pg2 data]$ repmgr standby follow
NOTICE: attempting to find and follow current primary
INFO: timelines are same, this server is not ahead
DETAIL: local node lsn is 0/C9024B00, follow target lsn is 0/C9024B00
NOTICE: setting node 2‘s upstream to node 1
NOTICE: stopping server using "/opt/pgsql/bin/pg_ctl -D ‘/opt/pgsql/data‘ -w -m fast stop"
2021-08-21 02:28:29.339 EDT [23663] LOG: received fast shutdown request
2021-08-21 02:28:29.340 EDT [23663] LOG: aborting any active transactions
2021-08-21 02:28:29.340 EDT [23672] FATAL: terminating walreceiver process due to administrator command
2021-08-21 02:28:29.340 EDT [23676] FATAL: terminating connection due to administrator command
2021-08-21 02:28:29.340 EDT [23674] FATAL: terminating connection due to administrator command
2021-08-21 02:28:29.341 EDT [23669] LOG: shutting down
2021-08-21 02:28:29.353 EDT [23663] LOG: database system is shut down
NOTICE: starting server using "/opt/pgsql/bin/pg_ctl -w -D ‘/opt/pgsql/data‘ start"
NOTICE: STANDBY FOLLOW successful
DETAIL: standby attached to upstream node "pg1" (ID: 1)
[postgres@pg2 data]$ repmgr cluster show
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
----+------+---------+-----------+----------+----------+----------+----------+-------------------------------------------------------------------------------------------
1 | pg1 | primary | * running | | default | 100 | 10 | host=192.168.73.140 port=5432 dbname=repmgr user=repmgr password=repmgr connect_timeout=2
2 | pg2 | standby | running | pg1 | default | 100 | 10 | host=192.168.73.141 port=5432 dbname=repmgr user=repmgr password=repmgr connect_timeout=2
发现正常了~~
repmgr相关命令
repmgr cluster show
repmgr node status
repmgr cluster event
repmgr cluster crosscheck
repmgr node service --list-actions
测试切换repmgr standby switchover --siblings-follow --dry-run
正式进行主备切换 repmgr -f repmgr.conf standby switchover
主备切换后主库需要重新注册 repmgr -f repmgr.conf standby register --force
利用repmgr switch over数据库:
repmgr -f /home/postgres/repmgr.conf standby switchover --siblings-follow --dry-run
repmgr -f/home/postgres/repmgr.conf standby switchover
利用repmgr promote备库:
repmgr -f /home/postgres/repmgr.conf standby promote
利用rpmgr rejoin备用数据库:(inactive时)
repmgr node rejoin -d ‘host=192.168.73.141 user=repmgr dbname=repmgr‘ --force-rewind --dry-run –verbose
repmgr node rejoin -d ‘host=192.168.73.141 user=repmgr dbname=repmgr‘ --force-rewind –verbose
repmgr primary register 安装pg的repmgr扩展并注册为主节点
repmgr primary unregister 注销不活动的主节点
repmgr standby clone 从其他节点复制数据到从节点
repmgr standby register 注册从节点(添加从的信息到repmgr元数据)
repmgr standby unregister repmgr 元数据中移除从的信息
repmgr standby promote 将从提升为主
repmgr standby follow 将从跟随新主
repmgr standby switchover 将从提升为主并将主降级为从
repmgr witness register 注册一个观察节点
repmgr witness unregister 移除一个观察节点
repmgr node status 显示节点的基本信息和复制状态
repmgr node check 从复制的角度对节点进行健康监测
repmgr node rejoin 重新加入一个失效节点到集群
repmgr cluster show 显示所有集群中注册的节点信息
repmgr cluster matrix 在所有节点运行show并汇总
repmgr cluster crosscheck 在节点间两两交叉监测连接
repmgr cluster event 输出时间记录
repmgr cluster cleanup 清理监控历史
repmgr service status 节点状态
留个坑,还差witness没有尝试。
原文:https://www.cnblogs.com/ddlearning/p/15169618.html