Loading any large CSV file into MySQL server using the LOAD DATA INFILE is a time-consuming process , because it is single threaded and it is a single transaction too. But with modern hardware system resource is not a bottleneck. At Mydbops we focus on improving the efficiency of process as we value performance more. MySQL introduced the parallel load data operations in its latest minor release MySQL 8.0.17 . I had the curiosity to test this feature and wanted to know, how it can improve the data loading (CSV) compared to the existing single-threaded method . Through this blog I am going to compare the both methods .
Remember you can use the parallel data loading utility only via MySQL Shell .
Internal Work Flow :
This section describes the working flow of the parallel data loading in MySQL 8.0.17 .
- The importTable utility will analyse the input data file
- The importTable utility will divide the large file into chunks
- The importTable utility will update the chunks to destination MySQL server with parallel threads
Lab Environment :
Below is the server configuration, which I was used to test the data loading operation .
- 4 core CPU
- 8 GB RAM
- SSD Disk
- Centos 7
I have a CSV file to be loaded, the size of the file is around 8 GB .
Let’s start the test with the existing method ( single threaded LOAD DATA INFILE ) .
Loading via load data infile:
Well it took 39 minutes and 36 seconds . No, I am going to load same data on a different table , with new Muti threaded in MySQL shell utility .
With multi threaded in MySQL Shell.
Yes, It took only 6 minutes and 30 seconds. This is 6x faster than the single-threaded method and improved the speed tremendously.
Options needs to be taken care :
Below are the important options which involved in the performance of effective data loading. Every thresholds should be provided with the optimal values based on the available system resources ( CPU / RAM / Disk IOPS ) else it can degrade the performance.
- Threads
- BytesPerChunk
- MaxRate
Threads :
You can define the number of parallel threads to process the data from the input file to the target server. The default value is 8 threads
BytesPerChunk :
This defines the size of the chunk for each LOAD DATA call . All the threads will process the separate chunk during the operation. We can define the threshold based on the available core and file size .
MaxRate:
The maximum limit on data throughput in bytes per second per thread. Use this option if you need to avoid saturating the network or the I/O or CPU for the client host or target server.
Hope this blog helps to identify the difference between the existing load data operation and latest MySQL Shell utility parallel data loading operation . At Mydbops, We are keep testing the new things on MySQL and related tools, will be back soon with an exciting blog soon.
Featured image by Joao Marcelo Marques on Unsplash