News & Blog back

Subscribe

pgBackRest preview: verify recovery target timeline on restore

Have you ever faced timeline issues during PostgreSQL recovery? It can be frustrating to complete a restore only to discover during recovery that the timeline is invalid, which often results in an error message like this:

FATAL: requested timeline 2 is not a child of this server's history
DETAIL: Latest checkpoint is at 0/7000028 on timeline 1, but in the history of the requested timeline, the server forked off from that timeline at 0/600AA20.

While understanding the importance of PostgreSQL timelines is crucial, this topic can be confusing for those unfamiliar with it.

In this post, we’ll explore why this happens, what you can do about it, and how to detect it earlier.


Let’s start with a simple scenario: a full backup followed by some generated WAL archives.

$ pgbackrest info
stanza: ro9pg
status: ok
cipher: aes-256-cbc

db (current)
wal archive min/max (17): 00000001000000000000000E/000000010000000000000031

full backup: 20250206-102246F
timestamp start/stop: 2025-02-06 10:22:46+00 / 2025-02-06 10:22:49+00
wal start/stop: 00000001000000000000000E / 00000001000000000000000E
database size: 22.2MB, database backup size: 22.2MB
repo1: backup size: 2.7MB

Now, someone unexpectedly promotes a standby or performs a test restore to validate the backup but forgets to disable WAL archiving on the temporary cluster.

$ pgbackrest info
...
wal archive min/max (17): 000000010000000000000003/00000002000000000000003A
$ pgbackrest repo-ls archive/ro9pg/17-1
0000000100000000
00000002.history
0000000200000000

$ pgbackrest repo-ls archive/ro9pg/17-1/0000000100000000
...
000000010000000000000031-f35618b3a1176b39059b5aac8fd17f8ecd63d3be.zst
000000010000000000000032.partial-826c8a36ae1a5d4a624720382ece08f2b9f4bf29.zst

$ pgbackrest repo-ls archive/ro9pg/17-1/0000000200000000
000000020000000000000032-e22f694a0bd949cda0be235f09b5620351eb3f4c.zst
000000020000000000000033-81d92d49082a857880affca6a084f0e6d9d3bd26.zst
...

$ pgbackrest repo-get archive/ro9pg/17-1/00000002.history
1 0/32899B98 no recovery target specified

A new timeline has been generated, and WAL archives created on this timeline will have 00000002 in their filename. From the 00000002.history file, we can see that the new timeline forked from the first one at 0/32899B98.

Since this timeline was created accidentally, no one notices it. Meanwhile, the daily backup cron runs as usual, generating a new full backup.

$ pgbackrest info
stanza: ro9pg
status: ok
cipher: aes-256-cbc

db (current)
wal archive min/max (17): 00000001000000000000000E/00000002000000000000003B

full backup: 20250206-102246F
timestamp start/stop: 2025-02-06 10:22:46+00 / 2025-02-06 10:22:49+00
wal start/stop: 00000001000000000000000E / 00000001000000000000000E
database size: 22.2MB, database backup size: 22.2MB
repo1: backup size: 2.7MB

full backup: 20250206-104949F
timestamp start/stop: 2025-02-06 10:49:49+00 / 2025-02-06 10:49:52+00
wal start/stop: 000000010000000000000033 / 000000010000000000000033
database size: 777.3MB, database backup size: 777.3MB
repo1: backup size: 36.2MB

We can see that the new full backup (wal start/stop: 000000010000000000000033 / 000000010000000000000033) is on an earlier timeline compared to what’s in the repository (wal archive min/max (17): 00000001000000000000000E / 00000002000000000000003B). Now, let’s try restoring this new full backup and see what happens.

LOG: starting backup recovery with redo LSN 0/36000028, checkpoint LSN 0/36000080, on timeline ID 1
LOG: restored log file "00000002.history" from archive
LOG: restored log file "000000020000000000000036" from archive
LOG: invalid resource manager ID 32 at 0/36000080
LOG: invalid checkpoint record
FATAL: could not locate required checkpoint record at 0/36000080
HINT: If you are restoring from a backup, touch "recovery.signal" or "standby.signal" and add required recovery options.
If you are not restoring from a backup, try removing the file "backup_label".
Be careful: removing "backup_label" will result in a corrupt cluster if restoring from a backup.

This error message sounds even more alarming than we might have expected. Since PostgreSQL 12, following the latest timeline has been the default behavior. This explains why PostgreSQL cannot find the checkpoint in the WAL – it is searching for it in the second timeline instead of the first one. To resolve this, we can either clean up everything related to the second timeline in the repository or set recovery_target_timeline to current, using the pgBackRest --target-timeline=current restore option.

LOG: starting backup recovery with redo LSN 0/36000028, checkpoint LSN 0/36000080, on timeline ID 1
LOG: restored log file "000000010000000000000036" from archive
LOG: starting archive recovery
LOG: redo starts at 0/36000028
LOG: completed backup recovery with redo LSN 0/36000028 and end LSN 0/36000158
LOG: consistent recovery state reached at 0/36000158
...
LOG: selected new timeline ID: 3
LOG: archive recovery complete
LOG: database system is ready to accept connections

The next pgBackRest release will include a new feature related to this: verify recovery target timeline, which checks the recovery target timeline during the restore process. This saves time by detecting issues before the restore starts and providing more information about potential problems. Let’s see how this feature would have handled our earlier example.

Going back to the beginning: a full backup, followed by some generated WAL archives.

$ pgbackrest info
stanza: ro9pg
status: ok
cipher: aes-256-cbc

db (current)
wal archive min/max (17): 000000010000000000000074/000000010000000000000098

full backup: 20250206-120502F
timestamp start/stop: 2025-02-06 12:05:02+00 / 2025-02-06 12:05:06+00
wal start/stop: 000000010000000000000074 / 000000010000000000000074
database size: 777.3MB, database backup size: 777.3MB
repo1: backup size: 36.2MB

Now, someone unexpectedly promotes a standby or performs a test restore to validate the backup but forgets to disable WAL archiving on the temporary cluster, and even takes a differential backup afterward.

$ pgbackrest info
stanza: ro9pg
status: ok
cipher: aes-256-cbc

db (current)
wal archive min/max (17): 000000010000000000000074/0000000200000000000000A5

full backup: 20250206-120502F
timestamp start/stop: 2025-02-06 12:05:02+00 / 2025-02-06 12:05:06+00
wal start/stop: 000000010000000000000074 / 000000010000000000000074
database size: 777.3MB, database backup size: 777.3MB
repo1: backup size: 36.2MB

diff backup: 20250206-120502F_20250206-120917D
timestamp start/stop: 2025-02-06 12:09:17+00 / 2025-02-06 12:09:20+00
wal start/stop: 0000000200000000000000A4 / 0000000200000000000000A4
database size: 179.1MB, database backup size: 152.6MB
repo1: backup size: 6.7MB
backup reference total: 1 full
$ pgbackrest repo-ls archive/ro9pg/17-1
0000000100000000
00000002.history
0000000200000000

$ pgbackrest repo-ls archive/ro9pg/17-1/0000000100000000
...
000000010000000000000099-03270b2754c5e72013af497aa8d10b3d467abea2.zst
000000010000000000000099.partial-587f49dfa38308f08bfd9e37d7e2ddf8c09744e4.zst
00000001000000000000009A-6ec2014948eebb8de2ddc1451da839771897c8f0.zst

$ pgbackrest repo-ls archive/ro9pg/17-1/0000000200000000
000000020000000000000099-4e03a26ae3e6b002f88f4f117d5e8134c112de79.zst
00000002000000000000009A-ecdf1ad475558d5a2b01257beae8423262091884.zst
00000002000000000000009B-da16dc2716dbb264b21d36ae2e69255a948075ae.zst
...

$ pgbackrest repo-get archive/ro9pg/17-1/00000002.history
1 0/99000060 no recovery target specified

A new timeline has been generated and WAL archives generated on that timeline will have 00000002 in their filename. From the 00000002.history file content, we can tell that the new timeline forked from the first one at 0/99000060.

Since this timeline was created accidentally, no one will notice it, and the daily backup cron will run as usual, generating a new full backup.

$ pgbackrest info
...
full backup: 20250206-121012F
timestamp start/stop: 2025-02-06 12:10:12+00 / 2025-02-06 12:10:15+00
wal start/stop: 00000001000000000000009A / 00000001000000000000009A
database size: 777.3MB, database backup size: 777.3MB
repo1: backup size: 36.2MB

We can see that the new full backup (wal start/stop: 00000001000000000000009A / 00000001000000000000009A) is on an earlier timeline compared to what’s in the repository (wal archive min/max (17): 000000010000000000000074/0000000200000000000000A5). Now, let’s try restoring this new full backup and see what happens.

$ pgbackrest --stanza=ro9pg restore --set=20250206-121012F
ERROR: [058]: target timeline 2 forked from backup timeline 1 at 0/99000060 which is before backup lsn of 0/9a000028
HINT: was the target timeline created by accidentally promoting a standby?
HINT: was the target timeline created by testing a restore without --archive-mode=off?
HINT: was the backup made after the target timeline was created?

This time, we know there is a potential timeline issue before starting the recovery or even retrieving files from the remote repository. Once again, using --target-timeline=current would resolve the issue.

Once started, the new temporary PostgreSQL cluster will create a new third(!) timeline.

$ pgbackrest repo-ls archive/ro9pg/17-1
0000000100000000
00000002.history
0000000200000000
00000003.history
0000000300000000

$ pgbackrest repo-get archive/ro9pg/17-1/00000003.history
1 0/9B000000 no recovery target specified

Currently, our latest timeline is the third one. However, this third timeline is unaware of the second one since it was created during a different restore test. So, what happens if we restore the differential backup taken on the second timeline? PostgreSQL would likely fail to recover because it cannot locate the switch point from timeline 2 to timeline 3. Once again, setting recovery_target_timeline = 'current' would resolve the issue. Thanks to the new pgBackRest feature mentioned above, we no longer have to wait until the restore process is complete to discover the problem.

$ pgbackrest --stanza=ro9pg restore --set=20250206-120502F_20250206-120917D
ERROR: [058]: backup timeline 2, lsn 0/a4000028 is not in the history of target timeline 3
HINT: was the target timeline created by promoting from a timeline < latest?

Final Thoughts

Since the default value of recovery_target_timeline changed in PostgreSQL 12, I’ve noticed that when answering your pgBackRest questions about recovery issues, we increasingly have to recommend the --target-timeline=current option. Hopefully, detecting potential timeline issues early – without requiring you to check PostgreSQL logs, will be highly beneficial for many of you. 😊

You may also like: