Dumpling – Logical backup tool for TiDB

Mydbops
Oct 1, 2024
3
Mins to Read
All

TiDB, developed by PingCAP, stands out as a distributed SQL database known for its exceptional scalability and strong consistency. To harness its full potential, effective data export tools are essential. Enter Dumpling—a powerful, open-source solution designed specifically for TiDB. In this blog, we'll delve into Dumpling's capabilities, showcasing how it can streamline your data management processes and optimize your export strategies.

Dumpling

  • Dumpling Overview: Dumpling is an open-source data export tool crafted by PingCAP specifically for the TiDB ecosystem. It provides a reliable and efficient solution for managing data exports seamlessly.
  • Data Format Support: Dumpling supports exporting data in multiple formats, including SQL and CSV files, giving users the flexibility to choose the format that best suits their data management needs.
  • Flexible Data Dumping: Users can perform either full table dumps or selectively export partial datasets by utilizing the –where condition. This targeted approach allows for more efficient data management and minimizes resource usage.
  • Direct Cold Storage Export: One of Dumpling's standout features is its ability to directly export data into cold storage solutions, such as Google Cloud Storage (GCS) and Amazon Web Services (AWS), by simply specifying the appropriate URLs. This capability streamlines the backup process and eliminates the need for intermediary storage solutions.
  • Performance Optimization: Dumpling is engineered for high-speed data exports, capable of efficiently handling large datasets. Its performance is further enhanced by features like parallel processing, which significantly accelerates the export process and reduces downtime.

Key Features of Dumpling

  • File Chunking: Dumpling allows you to set a maximum file size with the -F option, automatically chunking large datasets into manageable files. This enhances performance and simplifies file handling.
  • Data Compression: Use the --compress <format> option to compress CSV and SQL data files, as well as table structures. Supported algorithms include gzip, snappy, and zstd, saving storage space and improving transfer speeds.
  • Accelerated Export:The -t option lets you specify the number of threads for exporting data, boosting concurrency and significantly speeding up the process for large datasets.
  • Reduced Memory Overhead:With the -r option, Dumpling minimizes memory usage during data scans in TiDB, allowing for concurrent data dumps and enhancing export efficiency without overloading memory resources.
  • Control memory usage: --tidb-mem-quota-query controls the memory usage of a single query statement in TiDB.

Installation and testing

To get started with Dumpling, you can easily install it as part of the TiDB Toolkit, available for download on the official PingCAP website.

Let’s conduct a test by taking a data dump of an employee table where the employee's age is greater than 30.

mysql> select * from mydbops.employee;
+------+---------+------+
| id   | name    | age  |
+------+---------+------+
|    1 | Tom     |   20 |
|    2 | Raj     |   40 |
|    3 | Cathy   |   38 |
|    4 | jeya    |   22 |
|    5 | sekar   |   31 |
+------+---------+------+

Sample command

root@mydbops:# ./dumpling -u test -p ***** -P 4000 -h 127.0.0.1 -o /var/tmp/mysql/ 
--filetype csv --database mydbops --tables-list mydbops.employee --where "age > 30" -F 300MiB 
--output-filename-template 'mydbops.employee.{{.Index}}’

Where:

  • -u <username>: Your TiDB username.
  • -p <password>: Your TiDB password.
  • -h <host>: The hostname of your TiDB server.
  • -P <port>: The port number of your TiDB server.
  • -o <output_directory>: The directory where the exported files will be saved.
  • -F <max_file_size>: The maximum size of a single output file.

After executing the command, you will find the generated dump files in the /var/tmp/mysql directory. Each file will contain the following data:

root@mydbops#:/var/tmp/mysql# ls -lrth
total 16K
-rw-r--r-- 1 root root   75 Aug 30 17:23 mydbops.employee.000000000.csv
-rw-r--r-- 1 root root  188 Aug 30 17:23 mydbops.employee-schema.sql
-rw-r--r-- 1 root root   96 Aug 30 17:23 mydbops-schema-create.sql
-rw-r--r-- 1 root root 1.2K Aug 30 17:23 metadata

You can use the same command structure to perform dumps from MySQL servers; simply adjust the username, port, host, and password details accordingly.

Pushing Data to Cold Storage

Dumpling offers a powerful feature that allows you to directly dump data into cold storage by specifying a Google Cloud Storage (GCS) URI. This functionality is compatible with both AWS and GCP cloud platforms, making it convenient to manage your data in cloud environments.

We can seamlessly push data to cold storage without the need for any intermediate storage medium, streamlining your data management workflow.

Requirements for Pushing Data to GCP

To successfully push data to GCP, Dumpling requires the following:

  1. GCP Bucket Details: You need to specify the target GCP bucket where the data will be stored.
  2. Credentials File: A credentials.json file is necessary, containing vital information about your GCP account, including:
    • Private Key
    • Client Email
    • Client ID
    • Other relevant details for authentication

To push data from a TiDB server to Google Cloud Platform (GCP) using Dumpling, you can use the following command:

./dumpling -u test -p ***** -P 4000 -h 127.0.0.1 -o 
"gs://test-archival/test/employee?credentials-file=/home/credentials.json" 
--filetype csv --database mydbops --tables-list mydbops.employee --where "age > 30" 
-F 250MiB --compress gzip --output-filename-template 'employee_02_2022.{{.Index}}'

where 

-o "gs://test-archival/test/employee?credentials-file=/home/credentials.json" holds the details of GCP bucket info.

After executing the command, I verified the GCS bucket and confirmed that the backup files were successfully uploaded, ensuring the integrity and accessibility of the exported data.

TiDB - Dumpling
TiDB - Dumpling

Comparing the --where and --sql Options in Dumpling

When performing a backup with Dumpling, we observed a significant difference in performance when using the --where option compared to the --sql option for filtering records based on a timestamp column. Below is an analysis of our findings.

Using the --where Option

./dumpling -u user_ro -p -P 4000 -h 10.24.40.67 -o 
"gs://archival/reserve/2022?credentials-file=/home/credentials.json" 
--filetype csv --database order  --tables-list order.items --where 
"updated_at between '2022-01-01 00:00:00' AND '2022-02-01 00:00:00'" 
-F 250MiB --compress gzip --output-filename-template 'items_01_2022.{{.Index}}'
  • When utilizing the --where clause to export filtered data, Dumpling initiated a full table scan.
  • Secondary indexes on the updated_at date column were not utilized during this process, leading to inefficiencies.
  • The result was a prolonged export time of 50 minutes for 130,000 records.

Using the --sql Option

./dumpling -u user_ro -p -P 4000 -h 10.24.40.67 -o 
"gs://archival/reserve/2022?credentials-file=/home/credentials.json" 
--filetype csv --database order  --tables-list order.items --sql 
"select * from order.items use index(updated_at) where updated_at 
between '2022-01-01 00:00:00' AND '2022-02-01 00:00:00'" -F 250MiB 
--compress gzip --output-filename-template 'items_02_2022.{{.Index}}'
  • By using the --sql option, Dumpling allows you to pass an actual SQL query with filter conditions, enabling the use of secondary indexes. In this case, Dumpling used the index on the updated_at column, significantly improving the performance compared to the --where option.
Options Index usage Export file size / Count of records Export time
--where Does not use secondary indexes (full table scan) 83MB / 130000 50 min
--sql Secondary index on updated_at is used 83MB / 130000 5 seconds

Real-World Use Cases

  • Backup Strategies: Dumpling is a valuable tool for creating regular backups of your TiDB databases. You can schedule regular exports to ensure your backups are always up-to-date.
  • Migration Scenarios: If you’re migrating data between TiDB clusters or from other databases, Dumpling facilitates a smooth transition by exporting data in a compatible format.

Dumpling is a versatile and efficient tool for TiDB data exports, ideal for backups and migrations. Its ability to handle large datasets and optimize export speeds, especially when using the --sql option, makes Dumpling an essential component for streamlining your TiDB data management strategies.

Looking to enhance your TiDB experience? Mydbops offers expert TiDB consulting services tailored to your needs. Our team can help you implement Dumpling for efficient data management, optimize your database performance, and ensure secure backup strategies. Get in touch with us today!

No items found.

About the Author

Mydbops

Subscribe Now!

Subscribe here to get exclusive updates on upcoming webinars, meetups, and to receive instant updates on new database technologies.

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.