MySQL load data infile (CSV) made faster

Mydbops
Oct 17, 2019
8
Mins to Read
All

Loading any large CSV file into MySQL server using the LOAD DATA INFILE is a time-consuming process , because it is single threaded and it is a single transaction too. But with modern hardware system resource is not a bottleneck. At Mydbops we focus on improving the efficiency of process as we value performance more. MySQL introduced the parallel load data operations in its latest minor release MySQL 8.0.17 . I had the curiosity to test this feature and wanted to know, how it can improve the data loading (CSV) compared to the existing single-threaded method  . Through this blog I am going to  compare the both methods .

Remember you can use the parallel data loading utility only via MySQL Shell .

Internal Work Flow :

This section describes the working flow of the parallel data loading in MySQL 8.0.17 .

  • The importTable utility will analyse the input data file
  • The importTable utility will divide the large file into chunks
  • The importTable utility will update the chunks to destination MySQL server with parallel threads

Lab Environment :

Below is the server configuration, which I was used to test the data loading operation .

  • 4 core CPU
  • 8 GB RAM
  • SSD Disk
  • Centos 7

I have a CSV file to be loaded, the size of the file is around 8 GB .

 
mysql-files $ ls -lrth
-rw-r--r-- 1 root _mysql 8.06G Sep 12 18:16 load_data.csv
	

Let’s start the test with the existing method ( single threaded LOAD DATA INFILE ) .

Loading via load data infile:

 
MySQL localhost:33060+ ssl osm SQL > load data infile '/usr/local/mysql/mysql-files/load_data.csv' into table single.single_load fields terminated by ',' optionally enclosed by '"' lines terminated by '\n' (user_id,visible,name,size,latitude,longitude,timestamp,public,description,inserted);

Query OK, 14244516 rows affected, 0 warnings (39 min 35.5036 sec)
Records: 14244516 Deleted: 0 Skipped: 0 Warnings: 0
	

Well it took 39 minutes and 36 seconds .  No, I am going to load same data on a different table , with new Muti threaded in MySQL shell utility .

With multi threaded in MySQL Shell.

 
MySQL localhost osm JS > util.importTable("/usr/local/mysql/mysql-files/load_data.csv", {schema: "parallel", table: "parallel_load", columns: ["user_id","visible","name","size","latitude","longitude","timestamp","public","description","inserted"], dialect: "csv-unix", skipRows: 0, showProgress: true, fieldsOptionallyEnclosed: true, fieldsTerminatedBy: ",", linesTerminatedBy: "\n",fieldsEnclosedBy: '"',threads: 8, bytesPerChunk: "1G", maxRate: "2G"})

Importing from file '/usr/local/mysql/mysql-files/load_data.csv' to table `parallel`.`parallel_load` in MySQL Server at /tmp%2Fmysql.sock using 8 threads
[Worker01] parallel.parallel_load: Records: 1780564 Deleted: 0 Skipped: 0 Warnings: 0
 [Worker02] parallel.parallel_load: Records: 1780564 Deleted: 0 Skipped: 0 Warnings: 0
 [Worker03] parallel.parallel_load: Records: 1780564 Deleted: 0 Skipped: 0 Warnings: 0
 [Worker04] parallel.parallel_load: Records: 1780564 Deleted: 0 Skipped: 0 Warnings: 0
 [Worker05] parallel.parallel_load: Records: 1780564 Deleted: 0 Skipped: 0 Warnings: 0
 [Worker06] parallel.parallel_load: Records: 1780564 Deleted: 0 Skipped: 0 Warnings: 0
 [Worker07] parallel.parallel_load: Records: 1780564 Deleted: 0 Skipped: 0 Warnings: 0
 [Worker08] parallel.parallel_load: Records: 1780564 Deleted: 0 Skipped: 0 Warnings: 0

100% (8.06 GB / 8.06 GB), 535.17 KB/s
File '/usr/local/mysql/mysql-files/load_data.csv' (8.06 GB) was imported in 6 min 30.0411 sec at 18.81 MB/s
Total rows affected in parallel.parallel_load: Records: 14244516 Deleted: 0 Skipped: 0 Warnings: 0
	

Yes, It took only 6 minutes and 30 seconds. This is 6x faster than the single-threaded method and improved the speed tremendously.

Options needs to be taken care :

Below are the important options which involved in the performance of effective data loading. Every thresholds should be provided with the optimal values based on the available system resources ( CPU / RAM / Disk IOPS ) else it can degrade the performance.

  • Threads
  • BytesPerChunk
  • MaxRate

Threads :

You can define the number of parallel threads to process the data from the input file to the target server. The default value is 8 threads

BytesPerChunk :

This defines the size of the chunk for each LOAD DATA call . All the threads will process the separate chunk during the operation. We can define the threshold based on the available core and file size .

MaxRate:

The maximum limit on data throughput in bytes per second per thread. Use this option if you need to avoid saturating the network or the I/O or CPU for the client host or target server.

Hope this blog helps to identify the difference between the existing load data operation and latest MySQL Shell utility parallel data loading operation . At Mydbops, We are keep testing the new things on MySQL and related tools, will be back soon with an exciting blog soon.

Featured image by Joao Marcelo Marques on Unsplash

No items found.

About the Author

Mydbops

Subscribe Now!

Subscribe here to get exclusive updates on upcoming webinars, meetups, and to receive instant updates on new database technologies.

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.