
pgBackRest preview: verify recovery target timeline on restore
Have you ever faced timeline issues during PostgreSQL recovery? It can be frustrating to complete a restore only to discover during recovery that the timeline is invalid, which often results in an error message like this:
FATAL: requested timeline 2 is not a child of this server's history DETAIL: Latest checkpoint is at 0/7000028 on timeline 1, but in the history of the requested timeline, the server forked off from that timeline at 0/600AA20.
While understanding the importance of PostgreSQL timelines is crucial, this topic can be confusing for those unfamiliar with it.
In this post, we’ll explore why this happens, what you can do about it, and how to detect it earlier.
Let’s start with a simple scenario: a full backup followed by some generated WAL archives.
$ pgbackrest info stanza: ro9pg status: ok cipher: aes-256-cbc db (current) wal archive min/max (17): 00000001000000000000000E/000000010000000000000031 full backup: 20250206-102246F timestamp start/stop: 2025-02-06 10:22:46+00 / 2025-02-06 10:22:49+00 wal start/stop: 00000001000000000000000E / 00000001000000000000000E database size: 22.2MB, database backup size: 22.2MB repo1: backup size: 2.7MB
Now, someone unexpectedly promotes a standby or performs a test restore to validate the backup but forgets to disable WAL archiving on the temporary cluster.
$ pgbackrest info ... wal archive min/max (17): 000000010000000000000003/00000002000000000000003A
$ pgbackrest repo-ls archive/ro9pg/17-1 0000000100000000 00000002.history 0000000200000000 $ pgbackrest repo-ls archive/ro9pg/17-1/0000000100000000 ... 000000010000000000000031-f35618b3a1176b39059b5aac8fd17f8ecd63d3be.zst 000000010000000000000032.partial-826c8a36ae1a5d4a624720382ece08f2b9f4bf29.zst $ pgbackrest repo-ls archive/ro9pg/17-1/0000000200000000 000000020000000000000032-e22f694a0bd949cda0be235f09b5620351eb3f4c.zst 000000020000000000000033-81d92d49082a857880affca6a084f0e6d9d3bd26.zst ... $ pgbackrest repo-get archive/ro9pg/17-1/00000002.history 1 0/32899B98 no recovery target specified
A new timeline has been generated, and WAL archives created on this timeline will have 00000002
in their filename. From the 00000002.history
file, we can see that the new timeline forked from the first one at 0/32899B98
.
Since this timeline was created accidentally, no one notices it. Meanwhile, the daily backup cron runs as usual, generating a new full backup.
$ pgbackrest info stanza: ro9pg status: ok cipher: aes-256-cbc db (current) wal archive min/max (17): 00000001000000000000000E/00000002000000000000003B full backup: 20250206-102246F timestamp start/stop: 2025-02-06 10:22:46+00 / 2025-02-06 10:22:49+00 wal start/stop: 00000001000000000000000E / 00000001000000000000000E database size: 22.2MB, database backup size: 22.2MB repo1: backup size: 2.7MB full backup: 20250206-104949F timestamp start/stop: 2025-02-06 10:49:49+00 / 2025-02-06 10:49:52+00 wal start/stop: 000000010000000000000033 / 000000010000000000000033 database size: 777.3MB, database backup size: 777.3MB repo1: backup size: 36.2MB
We can see that the new full backup (wal start/stop: 000000010000000000000033 / 000000010000000000000033
) is on an earlier timeline compared to what’s in the repository (wal archive min/max (17): 00000001000000000000000E / 00000002000000000000003B
). Now, let’s try restoring this new full backup and see what happens.
LOG: starting backup recovery with redo LSN 0/36000028, checkpoint LSN 0/36000080, on timeline ID 1 LOG: restored log file "00000002.history" from archive LOG: restored log file "000000020000000000000036" from archive LOG: invalid resource manager ID 32 at 0/36000080 LOG: invalid checkpoint record FATAL: could not locate required checkpoint record at 0/36000080 HINT: If you are restoring from a backup, touch "recovery.signal" or "standby.signal" and add required recovery options. If you are not restoring from a backup, try removing the file "backup_label". Be careful: removing "backup_label" will result in a corrupt cluster if restoring from a backup.
This error message sounds even more alarming than we might have expected. Since PostgreSQL 12, following the latest timeline has been the default behavior. This explains why PostgreSQL cannot find the checkpoint in the WAL – it is searching for it in the second timeline instead of the first one. To resolve this, we can either clean up everything related to the second timeline in the repository or set recovery_target_timeline
to current
, using the pgBackRest --target-timeline=current
restore option.
LOG: starting backup recovery with redo LSN 0/36000028, checkpoint LSN 0/36000080, on timeline ID 1 LOG: restored log file "000000010000000000000036" from archive LOG: starting archive recovery LOG: redo starts at 0/36000028 LOG: completed backup recovery with redo LSN 0/36000028 and end LSN 0/36000158 LOG: consistent recovery state reached at 0/36000158 ... LOG: selected new timeline ID: 3 LOG: archive recovery complete LOG: database system is ready to accept connections
The next pgBackRest release will include a new feature related to this: verify recovery target timeline, which checks the recovery target timeline during the restore process. This saves time by detecting issues before the restore starts and providing more information about potential problems. Let’s see how this feature would have handled our earlier example.
Going back to the beginning: a full backup, followed by some generated WAL archives.
$ pgbackrest info stanza: ro9pg status: ok cipher: aes-256-cbc db (current) wal archive min/max (17): 000000010000000000000074/000000010000000000000098 full backup: 20250206-120502F timestamp start/stop: 2025-02-06 12:05:02+00 / 2025-02-06 12:05:06+00 wal start/stop: 000000010000000000000074 / 000000010000000000000074 database size: 777.3MB, database backup size: 777.3MB repo1: backup size: 36.2MB
Now, someone unexpectedly promotes a standby or performs a test restore to validate the backup but forgets to disable WAL archiving on the temporary cluster, and even takes a differential backup afterward.
$ pgbackrest info stanza: ro9pg status: ok cipher: aes-256-cbc db (current) wal archive min/max (17): 000000010000000000000074/0000000200000000000000A5 full backup: 20250206-120502F timestamp start/stop: 2025-02-06 12:05:02+00 / 2025-02-06 12:05:06+00 wal start/stop: 000000010000000000000074 / 000000010000000000000074 database size: 777.3MB, database backup size: 777.3MB repo1: backup size: 36.2MB diff backup: 20250206-120502F_20250206-120917D timestamp start/stop: 2025-02-06 12:09:17+00 / 2025-02-06 12:09:20+00 wal start/stop: 0000000200000000000000A4 / 0000000200000000000000A4 database size: 179.1MB, database backup size: 152.6MB repo1: backup size: 6.7MB backup reference total: 1 full
$ pgbackrest repo-ls archive/ro9pg/17-1 0000000100000000 00000002.history 0000000200000000 $ pgbackrest repo-ls archive/ro9pg/17-1/0000000100000000 ... 000000010000000000000099-03270b2754c5e72013af497aa8d10b3d467abea2.zst 000000010000000000000099.partial-587f49dfa38308f08bfd9e37d7e2ddf8c09744e4.zst 00000001000000000000009A-6ec2014948eebb8de2ddc1451da839771897c8f0.zst $ pgbackrest repo-ls archive/ro9pg/17-1/0000000200000000 000000020000000000000099-4e03a26ae3e6b002f88f4f117d5e8134c112de79.zst 00000002000000000000009A-ecdf1ad475558d5a2b01257beae8423262091884.zst 00000002000000000000009B-da16dc2716dbb264b21d36ae2e69255a948075ae.zst ... $ pgbackrest repo-get archive/ro9pg/17-1/00000002.history 1 0/99000060 no recovery target specified
A new timeline has been generated and WAL archives generated on that timeline will have 00000002
in their filename. From the 00000002.history
file content, we can tell that the new timeline forked from the first one at 0/99000060
.
Since this timeline was created accidentally, no one will notice it, and the daily backup cron will run as usual, generating a new full backup.
$ pgbackrest info ... full backup: 20250206-121012F timestamp start/stop: 2025-02-06 12:10:12+00 / 2025-02-06 12:10:15+00 wal start/stop: 00000001000000000000009A / 00000001000000000000009A database size: 777.3MB, database backup size: 777.3MB repo1: backup size: 36.2MB
We can see that the new full backup (wal start/stop: 00000001000000000000009A / 00000001000000000000009A
) is on an earlier timeline compared to what’s in the repository (wal archive min/max (17): 000000010000000000000074/0000000200000000000000A5
). Now, let’s try restoring this new full backup and see what happens.
$ pgbackrest --stanza=ro9pg restore --set=20250206-121012F ERROR: [058]: target timeline 2 forked from backup timeline 1 at 0/99000060 which is before backup lsn of 0/9a000028 HINT: was the target timeline created by accidentally promoting a standby? HINT: was the target timeline created by testing a restore without --archive-mode=off? HINT: was the backup made after the target timeline was created?
This time, we know there is a potential timeline issue before starting the recovery or even retrieving files from the remote repository. Once again, using --target-timeline=current
would resolve the issue.
Once started, the new temporary PostgreSQL cluster will create a new third(!) timeline.
$ pgbackrest repo-ls archive/ro9pg/17-1 0000000100000000 00000002.history 0000000200000000 00000003.history 0000000300000000 $ pgbackrest repo-get archive/ro9pg/17-1/00000003.history 1 0/9B000000 no recovery target specified
Currently, our latest timeline is the third one. However, this third timeline is unaware of the second one since it was created during a different restore test. So, what happens if we restore the differential backup taken on the second timeline? PostgreSQL would likely fail to recover because it cannot locate the switch point from timeline 2 to timeline 3. Once again, setting recovery_target_timeline = 'current'
would resolve the issue. Thanks to the new pgBackRest feature mentioned above, we no longer have to wait until the restore process is complete to discover the problem.
$ pgbackrest --stanza=ro9pg restore --set=20250206-120502F_20250206-120917D ERROR: [058]: backup timeline 2, lsn 0/a4000028 is not in the history of target timeline 3 HINT: was the target timeline created by promoting from a timeline < latest?
Final Thoughts
Since the default value of recovery_target_timeline
changed in PostgreSQL 12, I’ve noticed that when answering your pgBackRest questions about recovery issues, we increasingly have to recommend the --target-timeline=current
option. Hopefully, detecting potential timeline issues early – without requiring you to check PostgreSQL logs, will be highly beneficial for many of you. 😊