No doubt, pg_dump/pg_restore is an amazing tool to perform logical backup and restorations in PostgreSQL. We have already explained in detail about pg_dump/pg_restore here.
The tool pgcopydb is an Open Source tool that automates executing pg_dump and pg_restore between two PostgreSQL servers. It is a wrapper over pg_dump and pg_restore to simplify the process with more functionalities. In this blog post, we will look at pgcopydb in detail.
Why do we need to consider pgcopydb?
pgcopydb helps to achieve two important things that are not possible to achieve directly with pg_dump/pg_restore.
1. No intermediate files
To achieve faster dump and restore using pg_dump/pg_restore, we need to use parallelism. To achieve that, the –jobs parameter is mandatory on both. Whenever we are using the –jobs parameter, to support parallelism, it needs to write to an intermediate file.
For example, let’s plan to copy 1 TB of data from source to destination, we need to store the dump file first in local. It means we require additional disk space on the server just to place the dump. Disk space required depends on the size of the data.
By using pgcopydb, we do not require any additional disk space to place the dump because this tool will completely bypass the intermediate files for the table data and it supports parallelism.
This is achieved by connecting to the source databases during the entire operation, whereas for pg_restore, it is not required to connect to the source database always because it restores data from on disk intermediate files.
2. Create all indexes concurrently
pg_dump/pg_restore uses ALTER TABLE command directly to build both indexes and constraints. But the ALTER TABLE .. ADD CONSTRAINT command requires ACCESS EXCLUSIVE lock that prevents any concurrency.
In pgcopydb, it first executes CREATE UNIQUE INDEX statement. Once it is completed, it just adds the constraint using ALTER TABLE .. ADD CONSTRAINT .. PRIMARY KEY USING INDEX.. This is the way it helps to build all indexes in the table concurrently.
Installation
It is very straightforward to install in Ubuntu. The following are the steps
Source and Destination should be accessible from the host where pgcopydb is running. Also, Both source and destination should be available throughout the entire process for the successful migration.
Even though pgcopydb does not require any storage to store data files locally, it requires a very small amount of storage on the server where pgcopydb is running. This is required to keep the schema backup and a few temporary files to keep progress. This would be very minimal in size as it contains only the structure of the schema objects. By default, it will use the path /tmp/pgcopydb. If required to use some other path, we can use the –dir option.
pgcopydb depends on the pg_dump and pg_restore tools in the server where pgcopydb is running. The tool version of pg_dump and pg_restore should match the PostgreSQL version of the target database.
Workflow
The following workflow explains how a database copy happens using pgcopydb between two different PostgreSQL servers.
pgcopydb first produces schema only backup using pg_dump in custom format(Not in plain text) from the source server. It will be in two files a) Pre-data section b) Post-data section
pre-data section file will be restored in the destination server using the pg_restore command. It creates all the PostgreSQL objects in the destination server.
pgcopydb gets the list of all tables from the source server and for each of them executes the copy process from the source to the target in a dedicated sub-processes. It will be performed for all tables until all the data has been copied over.
An auxiliary process is started concurrently with the main copy process mentioned in point 3. This process is responsible for copying all the Large Objects from the source to the destination.
In each copy table sub-process mentioned in point 3, as soon as the table data is copied, then the pgcopydb gets the list of indexes and creates them in parallel in the destination server.
Index creations will be performed in parallel as we discussed in the previous section.
As soon as the data and indexes are created, Vacuum analyze will be executed on each target table.
Now, pgcopydb gets the list of sequences and updates the values accordingly in the destination server.
Finally, the post-data section which contains foreign key constraints will be created in the target database.
Above all successful completions makes the restoration of the database from the source to the destination successful.
Demo
In this demo, let’s see how to perform a complete database copy from the source to the destination and its comparison with pg_dump/pg_restore.
It is not mandatory to run pgcopydb in a separate server. It can be installed in either source or destination server as well.
Now copying complete database demo from source to destination.
pgcopydb copy-db --table-jobs 16 --index-jobs 16
12:17:10 3971226 INFO Running pgcopydb version 0.6-1.pgdg20.04+1 from "/usr/bin/pgcopydb"
12:17:10 3971226 INFO [SOURCE] Copying database from "postgres://backup@192.168.33.11:5432/demo?password=****"
12:17:10 3971226 INFO [TARGET] Copying database into "postgres://backup@192.168.33.12:5432/demo?password=****"
12:17:10 3971226 INFO Using work dir "/tmp/pgcopydb"
12:17:10 3971226 INFO STEP 1: dump the source database schema (pre/post data)
12:17:10 3971226 INFO Exported snapshot "00000004-00000F28-1" from the source database
12:17:10 3971226 INFO /usr/bin/pg_dump -Fc --snapshot 00000004-00000F28-1 --section pre-data --file /tmp/pgcopydb/schema/pre.dump 'postgres://backup@192.168.33.11:5432/demo?'
12:17:10 3971226 INFO /usr/bin/pg_dump -Fc --snapshot 00000004-00000F28-1 --section post-data --file /tmp/pgcopydb/schema/post.dump 'postgres://backup@192.168.33.11:5432/demo?'
12:17:11 3971226 INFO STEP 2: restore the pre-data section to the target database
12:17:11 3971226 INFO /usr/bin/pg_restore --dbname 'postgres://backup@192.168.33.12:5432/demo?' /tmp/pgcopydb/schema/pre.dump
12:17:11 3971226 INFO STEP 3: copy data from source to target in sub-processes
12:17:11 3971226 INFO STEP 4: create indexes and constraints in parallel
12:17:11 3971226 INFO STEP 5: vacuum analyze each table
12:17:11 3971226 INFO Listing ordinary tables in source database
12:17:11 3971226 INFO Fetched information for 10 tables, with an estimated total of 100000108 tuples and 20 GB
12:17:11 3971226 INFO Now starting 10 processes
12:17:11 3971226 INFO Reset sequences values on the target database
12:17:11 3971226 INFO Listing sequences in source database
12:17:11 3971226 INFO Fetched information for 10 sequences
12:17:11 3971243 INFO COPY "public"."sbtest1";
12:17:11 3971244 INFO COPY "public"."sbtest10";
12:17:11 3971247 INFO COPY "public"."sbtest2";
12:17:11 3971245 INFO COPY "public"."sbtest3";
12:17:11 3971249 INFO COPY "public"."sbtest4";
12:17:11 3971251 INFO COPY "public"."sbtest5";
12:17:11 3971253 INFO COPY "public"."sbtest6";
12:17:11 3971256 INFO COPY "public"."sbtest7";
12:17:11 3971255 INFO COPY "public"."sbtest8";
12:17:11 3971257 INFO COPY "public"."sbtest9";
12:17:11 3971242 INFO Copying large objects
12:20:23 3971244 INFO Creating 2 indexes for table "public"."sbtest10"
12:20:23 3972757 INFO VACUUM ANALYZE "public"."sbtest10";
12:20:23 3972758 INFO CREATE UNIQUE INDEX sbtest10_pkey ON public.sbtest10 USING btree (id);
12:20:23 3972759 INFO CREATE INDEX k_10 ON public.sbtest10 USING btree (k);
12:20:23 3971243 INFO Creating 2 indexes for table "public"."sbtest1"
12:20:23 3972764 INFO VACUUM ANALYZE "public"."sbtest1";
12:20:23 3972765 INFO CREATE UNIQUE INDEX sbtest1_pkey ON public.sbtest1 USING btree (id);
12:20:23 3972766 INFO CREATE INDEX k_1 ON public.sbtest1 USING btree (k);
12:20:24 3971249 INFO Creating 2 indexes for table "public"."sbtest4"
12:20:24 3972789 INFO VACUUM ANALYZE "public"."sbtest4";
12:20:24 3972790 INFO CREATE UNIQUE INDEX sbtest4_pkey ON public.sbtest4 USING btree (id);
12:20:24 3972791 INFO CREATE INDEX k_4 ON public.sbtest4 USING btree (k);
12:20:24 3971251 INFO Creating 2 indexes for table "public"."sbtest5"
12:20:24 3972808 INFO VACUUM ANALYZE "public"."sbtest5";
12:20:24 3972809 INFO CREATE UNIQUE INDEX sbtest5_pkey ON public.sbtest5 USING btree (id);
12:20:24 3972810 INFO CREATE INDEX k_5 ON public.sbtest5 USING btree (k);
12:20:25 3971245 INFO Creating 2 indexes for table "public"."sbtest3"
12:20:25 3972815 INFO VACUUM ANALYZE "public"."sbtest3";
12:20:25 3972816 INFO CREATE UNIQUE INDEX sbtest3_pkey ON public.sbtest3 USING btree (id);
12:20:25 3972817 INFO CREATE INDEX k_3 ON public.sbtest3 USING btree (k);
12:20:25 3971253 INFO Creating 2 indexes for table "public"."sbtest6"
12:20:25 3972822 INFO VACUUM ANALYZE "public"."sbtest6";
12:20:25 3972823 INFO CREATE UNIQUE INDEX sbtest6_pkey ON public.sbtest6 USING btree (id);
12:20:25 3972824 INFO CREATE INDEX k_6 ON public.sbtest6 USING btree (k);
12:20:25 3971257 INFO Creating 2 indexes for table "public"."sbtest9"
It took a total of 6 minutes and 17 seconds to copy the 25GB database from one server to another. Let’s try the same data copy with pg_dump/pg_restore directly.
time pg_dump -h192.168.33.11 -Ubackup --port=5432 -Fd demo -j 16 -f backup
real 2m34.974s
user 24m38.031s
sys 0m24.372s
time pg_restore -h192.168.33.12 -Ubackup --port=5432 -d demo -j 16 backup/
real 6m5.533s
user 1m58.733s
sys 0m8.671s
It took a total of 2 minutes and 34 seconds for backup, 6 minutes and 5 seconds for restore, and 22 seconds for vacuum. ie) 9 minutes. Used 16 parallel threads for both the tests.
As per the stats, we could see how efficient pgcopydb is to perform logical backup and restore in PostgreSQL. So far, we have discussed only how to perform a complete database copy from source to destination. pgcopydb also supports filtering(copy specific objects) from its latest release. In our next blog post, we will look at how to use filtering in pgcopydb. This feature will be really helpful for daily database operations as well.