Efficient storage management is vital in TiDB and other distributed systems to maintain performance and optimize resource utilization. A key mechanism to achieve this is Garbage Collection (GC). TiDB's GC process plays a crucial role in cleaning up expired data, obsolete versions, and tombstones, ensuring optimal disk space usage, improved query performance, and database integrity.
In this blog post, we’ll explore TiDB’s GC lifecycle in depth—how it works, the available configuration options, and best practices to fine-tune it for a smooth and efficient TiDB instance.
What is Garbage Collection in TiDB?
Garbage Collection (GC) in TiDB is the process of removing outdated, obsolete, or expired data from TiKV, the underlying distributed key-value store powering TiDB. Built on top of TiKV, TiDB uses multi-version concurrency control (MVCC) to handle concurrent transactions. This approach, while effective, can result in older data versions persisting in the system even when they are no longer required for queries or transactional consistency.
The primary goal of GC is to clean up these outdated data versions to:
- Free up storage space
- Prevent excessive disk usage
- Retain only relevant and live data
- Optimize query performance by removing stale data
At the core of TiDB’s GC process is a GC safe point, a timestamp that marks the boundary for data eligible for cleanup. This ensures that only data no longer needed for queries or transactional integrity is removed, maintaining the system's efficiency and stability.
How Does GC Work in TiDB?
The GC process in TiDB is closely tied to TiKV’s MVCC mechanism. When a transaction is executed in TiDB, multiple versions of the data may exist over time. TiDB uses timestamps to track the versions, and each version of a key has an associated commit timestamp. The garbage collection mechanism relies on the following steps:
Step 1: Timestamp and GC Safe Point
Every time a transaction is committed in TiDB, a commit timestamp is assigned to it. These timestamps are then used for two purposes:
Query Visibility: The commit timestamp determines whether a version of data is visible to a particular transaction or query.
GC Safe Point: The GC safe point is a timestamp that determines the earliest version of data that can be safely removed. This is essentially the cut-off point for GC. Any data version with a commit timestamp earlier than the GC safe point can be safely deleted.
The GC process periodically runs in the background to remove the data that is older than the current GC safe point. The safe point is automatically updated by TiDB at regular intervals. This timestamp-driven approach allows TiDB to maintain optimal storage efficiency while ensuring high performance and consistency for queries.
Step 2: GC Cycle
The GC (Garbage Collection) cycle in TiDB is a systematic process designed to identify and remove outdated data while ensuring the integrity of live data. Here’s how it works:
1. Marking Old Data
The GC process begins by scanning for data versions with commit timestamps older than the current GC safe point. These outdated versions are no longer needed for transactional consistency or queries.
2. Deleting Obsolete Versions
Once identified, the obsolete data versions are deleted from TiKV storage, TiDB's distributed key-value store. This step is crucial for freeing up storage space and maintaining system efficiency.
3. Triggering a New Safe Point
After completing the cleanup, TiDB recalculates and updates the GC safe point. This updated safe point ensures that subsequent GC cycles can safely remove older versions while preserving relevant and live data for ongoing transactions.
Step 3: Tombstones
In TiDB, when data is deleted or updated, the previous version is not immediately removed. Instead, it is marked with a tombstone. This mechanism serves as a placeholder, signaling that the data is no longer valid.
Why Use Tombstones?
- Consistency and Durability: Tombstones allow TiDB to maintain transactional consistency by ensuring that a transaction can still access the original version of the data until the GC cycle completes.
- Deferred Deletion: The actual removal of data marked as a tombstone occurs during the Garbage Collection (GC) cycle, providing a controlled cleanup process.
Step 4: GC Retention Time
The GC retention time in TiDB specifies the minimum duration for which data must be retained before it becomes eligible for deletion. This configuration ensures that data versions are kept long enough to satisfy the requirements for transactional consistency and durability.
Key Features of GC Retention Time
Configurable Period:
- The retention time can be adjusted to fit the needs of the system.
- For example, the default value might be
3600s
(1 hour), meaning data older than this is eligible for garbage collection.
Impact on Transactions:
- Retention time ensures that older data is preserved for a sufficient period to handle long-running transactions or queries.
- This minimizes the risk of inconsistencies when accessing historical data.
Balance Between Performance and Durability:
- A shorter retention period reduces storage usage but may risk affecting transaction visibility.
- A longer retention period ensures data availability but can lead to higher disk usage.
Configuring and Tuning GC in TiDB
Key Parameters
TiDB allows you to configure the GC behavior to suit your specific needs. Below are the key parameters for tuning GC:
tikv_gc_life_time
This parameter controls the duration TiDB keeps historical data versions. By default, it's set to 10m
(10 minutes), but it can be adjusted based on your workload. A longer GC lifetime may be necessary for systems with complex transaction patterns, while a shorter GC lifetime can help free up disk space more aggressively.
gc_mode
TiDB offers two GC modes:
- Background: The GC process runs automatically in the background and is managed by TiDB.
- Manual: The GC process can be triggered manually, offering users more control over when garbage collection occurs.'
tikv_gc_run_interval
This parameter dictates how often TiDB runs the GC process, specifying the interval between consecutive GC cycles. The default value is 10s
, but you can adjust it based on your workload and available system resources.
tikv_gc_worker_count
This parameter determines the number of workers involved in the GC process. Increasing the number of workers can speed up GC, but it should be balanced with system resource usage to avoid potential performance issues.
Example GC configuration commands
Configuring GC parameters in TiDB allows you to optimize the garbage collection process for your specific workload and performance needs. Below is an example of how you can adjust key GC settings:
Commands to Configure GC Parameters
Set GC Retention Time to 30 Minutes
The retention time determines how long data versions are retained before becoming eligible for garbage collection.
SET GLOBAL tikv_gc_life_time = '30m';
Adjust the GC Interval to 15 Seconds
The GC interval defines how often the garbage collection process runs in TiDB.
SET GLOBAL tikv_gc_run_interval = '15s';
Set the Number of GC Workers to 4
The number of GC workers determines how many threads are allocated to the garbage collection process.
SET GLOBAL tikv_gc_worker_count = 4;
Best Practices for GC in TiDB
SHOW GLOBAL STATUS LIKE 'tikv_gc%
'
command to monitor GC progress and identify any issues, such as delayed GC or excessive disk usage. The GC lifecycle in TiDB is vital for maintaining efficient storage management and optimal system performance. Understanding its workings, key components, and effective configuration can help you prevent disk usage issues and ensure consistent query performance. Regular monitoring and fine-tuning are essential to optimize GC for your workload, ensuring your TiDB instance runs efficiently.
For expert assistance in managing and optimizing your TiDB environment, explore Mydbops TiDB Consulting Services and leverage our expertise to achieve peak performance.