MySQL to Amazon Redshift Replication.

Jul 31, 2018

Mins to Read

All

In our work, We used to get a lot of requirements for replicating data from one data source to another. Our team provided solutions to replicate data from MySQL to Vertica, Amazon Redshift, Hadoop. Out of which Amazon Redshift replication is a bit complicated as Amazon Redshift is a Database as a service (DBaaS) and the process is not straightforward.

So, I take this opportunity to guide on how to replicate the specific set of tables from MySQL to AWS Redshift using Tungsten replicator.

1.0. Tungsten Replicator:

Tungsten Replicator is an open source replication engine supports data extract from MySQL, MySQL Variants such as RDS, Percona Server, MariaDB and Oracle and allows the data extracted to be applied on other data sources such as Vertica, Cassandra, Redshift etc.

Tungsten Replicator includes support for parallel replication, and advanced topologies such as fan-in and multi-master, and can be used efficiently in cross-site deployments.

1.1.0. General Architecture:

There are three major components in tungsten replicator
1. Extractor / Master Service
2. Transaction History Log (THL)
3. Applier / Slave Service

1.1.1. Extractor / Master Service:

The extractor component reads data from MySQL’s binary log and writes that information into the Transaction History Log (THL).

1.1.2. Transaction History Log (THL):

The Transaction History Log (THL) acts as a translator between two different data sources. It stores transactional data from different data servers in a universal format using the replicator service acting as a master, That could then be processed Applier / Slave service.

1.1.3. Applier / Slave Service:

All the raw row-data recorded on the THL logs is re-assembled or constructed into another format such as JSON or BSON, or external CSV formats that enable the data to be loaded in bulk batches into a variety of different targets.

Therefore Statement information is not supported for heterogeneous deployments. So It’s mandatory that Binary log format on MySQL is ROW.

1.2.0. Pre Requisites:

1.2.1. Server Packages:

JDK 7 or higher
Ant 1.8 or higher
Ruby
Net-SSH

1.2.2. MySQL:

All the tables to be replicated must have a primary key.
Following MySQL configuration should be enabled on MySQL
binlog-format = row
binlog-row-image = full
collation-server = utf8_general_ci
character-set-server = utf8

1.2.3. Redshift:

Database name, Schema_name should be same as MySQL Database name of the tables to be replicated.

1.2.4. S3 Bucket:

Read & write access to an AWS S3 Bucket. (Access key, Secret key is required)

2.0. Requirement:

Consider the servers with below details are used for Demo.

AWS EC2 MySQL Server – 172.19.12.234
AWS Redshift – 172.19.12.116 (Database as a Service)
AWS S3 bucket – s3://mydbops-migration

As Redshift is a database as a service, We just have an endpoint to connect. Therefore We will be installing both the tungsten Master / Slave service on the MySQL server itself.

We would need to replicate the tables emp, emp_records from new_year database on the MySQL server to Redshift. Structures of the table are given below.

 
CREATE TABLE `emp` (
`no` int(11) NOT NULL,
`city` varchar(50) DEFAULT NULL,
`state` varchar(50) DEFAULT NULL,
PRIMARY KEY (`no`)
) ENGINE=InnoDB;

CREATE TABLE `emp_records` (
`no` int(11) NOT NULL,
`name` varchar(50) DEFAULT NULL,
`address` varchar(50) DEFAULT NULL,
PRIMARY KEY (`no`)
) ENGINE=InnoDB;

3.0. Implementation:

The implementation consists of following steps.

Installation / Building tungsten from source
Preparing equivalent schema for Redshift
Configuring Master service
Configuring Slave service
Generating worker tables (temp tables used by tungsten) for replication to be created on redshift
Start the replication

3.1. Installation / Building From Source:

Download the source package from the GIT.

 
#git clone https://github.com/continuent/tungsten-replicator.git

Compile this package it will generate the tungsten-replicator.tar file.

 
#sh tungsten-replicator/builder/build.sh

#mkdir -p tungsten

Once the tar file is generated extract the file to the folder created named tungsten and remove the old tungsten replicator package.

 
#tar --strip-components 1 -zxvf tungsten-replicator/builder/build/tungsten-replicator-5.2.1.tar.gz -C tungsten/

Now we have got tungsten binaries, Clean up source packages unless required.

 
#rm -rf tungsten-replicator

3.2. Preparing equivalent schema for Redshift:

Create database new_year on Redshift.

 
dev=# create database new_year;
CREATE DATABASE

The new database was created. Now I am going to create a new schema.
Before creating schema first you have to switch to new_year database.

 
dev=# \c new_year
psql (9.2.24, server 8.0.2)

Then create tables in new_year schema.

 
new_year=# create table new_year.emp(no int primary key, city varchar(50),state varchar(50));
CREATE TABLE

new_year=# create table new_year.emp_records(no int primary key, name varchar(50),address varchar(50) );
CREATE TABLE

Note:

If you do not mention schema name while creating the table, the table will create inside the public schema.
To check tables are created inside the correct new_year schema.

 
new_year=# \dt new_year.*;
List of relationsList of relations
schema    | name         | type  |   owner
----------+--------------+-------+-----------
new_year  | emp          | table | redshift-usr
new_year  | emp_records  | table | redshift-usr
(2 rows)

3.3. Configuring Master Service:

Create a replication user on MySQL with Replication Slave privilege to stream binlog from MySQL to Tungsten Master service.

 
mysql> grant replication slave on *.* to 'tungsten'@'localhost' identified by 'secret';

Switch to tungsten directory and Reset the defaults configuration file.

 
#cd ~/tungsten
#./tools/tpm configure defaults --reset

Configure the Master service on the directory of your choice, We have used /opt/master
Following commands will prepare the configuration file for Master service.

 
#./tools/tpm configure master \
--install-directory=/opt/master \
--enable-heterogeneous-service=true \
--members=mysql-db-master \
--master=mysql-db-master

 
#./tools/tpm configure master --hosts=mysql-db-master \
--replication-user=tungsten \
--replication-password=tungsten \
--skip-validation-check=MySQLUnsupportedDataTypesCheck \
--property=replicator.filter.pkey.addColumnsToDeletes=true \
--property=replicator.filter.pkey.addPkeyToInserts=true

Once the configuration is prepared, Then we can install it using tpm.

 
#./tools/tpm install

Configuration is now complete.  For further information, please consult
Tungsten documentation, which is available at docs.continuent.com.
NOTE  >> Command successfully completed

Now Master service will be configured under /opt/master/
Start the tungsten Master service.

 
#/opt/master/tungsten/cluster-home/bin/startall

Starting Tungsten Replicator Service...
Waiting for Tungsten Replicator Service.
running: PID:22291

Verify it’s working by checking the master status.

 
#/opt/master/tungsten/tungsten-replicator/bin/trepctl services

Processing services command...
NAME              VALUE
----              -----
appliedLastSeqno: 0
appliedLatency  : 1.667
role            : master
serviceName     : master
serviceType     : local
started         : true
state           : ONLINE
Finished services command...

 
#/opt/master/tungsten/tungsten-replicator/bin/trepctl status

Processing status command...
NAME                     VALUE
----                     -----
appliedLastEventId     : mysql-bin.000011:0000000000000510;-1
appliedLastSeqno       : 0
appliedLatency         : 1.667
autoRecoveryEnabled    : false
autoRecoveryTotal      : 0
channels               : 1
clusterName            : master
currentEventId         : mysql-bin.000011:0000000000000510
currentTimeMillis      : 1525355498784
dataServerHost         : mysql-db-master
extensions             : 
host                   : mysql-db-master
latestEpochNumber      : 0
masterConnectUri       : thl://localhost:/
masterListenUri        : thl://mysql-db-master:2112/
maximumStoredSeqNo     : 0
minimumStoredSeqNo     : 0
offlineRequests        : NONE
pendingError           : NONE
pendingErrorCode       : NONE
pendingErrorEventId    : NONE
pendingErrorSeqno      : -1
pendingExceptionMessage: NONE
pipelineSource         : jdbc:mysql:thin://mysql-db-master:3306/tungsten_master?noPrepStmtCache=true
relativeLatency        : 21.784
resourcePrecedence     : 99
rmiPort                : 10000
role                   : master
seqnoType              : java.lang.Long
serviceName            : master
serviceType            : local
simpleServiceName      : master
siteName               : default
sourceId               : mysql-db-master
state                  : ONLINE
timeInStateSeconds     : 21.219
timezone               : GMT
transitioningTo        : 
uptimeSeconds          : 21.741
useSSLConnection       : false
version                : Tungsten Replicator 5.2.1
Finished status command...

If the master did not start properly refer to this (/opt/master/service_logs/trepsvc.log) error log.

3.4. Configuring Slave service:

Switch to tungsten directory and Reset the defaults configuration file.

 
#cd ~/tungsten
#./tools/tpm configure defaults --reset

Create JSON file with name s3-config-slave.json in the format below, Fill in your AWS S3 Bucket details like Access key, Secrect key, S3 bucket path.

 
{
"awsS3Path" : "s3://mydbops-migration",
"awsAccessKey" : "XXXXXX",
"awsSecretKey" : "YYYYYYY",
"gzipS3Files" : "false",
"cleanUpS3Files" : "true"
}

Configure the Slave service on the directory of your choice, We have used /opt/slave
Following commands will prepare the configuration file for Slave service.

 
#./tools/tpm configure slave \
--install-directory=/opt/slave \
--enable-heterogeneous-service=true \
--members=mysql-db-master

Add the replication filter to only replicate those two tables. Use Redshift host, user, password to configure the slave service.

 
#./tools/tpm configure slave --hosts=mysql-db-master \
--replication-host=172.19.12.116 \
--replication-user=redshift-usr \
--replication-password='redshift-pass' --datasource-type=redshift \
--batch-enabled=true \
--batch-load-template=redshift \
--redshift-dbname=new_year \
--svc-applier-filters=dropstatementdata,replicate \
--property=replicator.filter.replicate.do=new_year.emp,new_year.emp_records \
--svc-applier-block-commit-interval=10s \
--svc-applier-block-commit-size=5 \
--rmi-port=10002 \
--thl-port=2113 \
--master-thl-port=2112 \
--master-thl-host=mysql-db-master

Once the configuration is prepared, Then we can install it using tpm.

 
#./tools/tpm install
Configuration is now complete.  For further information, please consult
Tungsten documentation, which is available at docs.continuent.com.
NOTE  >> Command successfully completed
Once it complete copy the s3-config-slave.json file to slave (share) directory.

 
#cp s3-config-slave.json /opt/slave/share/

Now the slave is configured, Before starting we need to create worker/stage table used by tungsten to replicate data on Redshift.

3.5. Generating Worker / Stage tables To Be Created On Redshift:

Tungsten provides a utility named ddlscan to generate the Worker / Stage tables required for the replication functionality to work.

 
#/opt/slave/tungsten/tungsten-replicator/bin/ddlscan -db new_year -template ddl-mysql-redshift-staging.vm > staging_ddl

Apply the schema generated from the above operation on Redshift.
Now we have Worker / Stage tables created on redshift.

 
new_year=# \dt new_year.*;

List of relations
  schema  |       name            | type  |   owner   
----------+-----------------------+-------+-----------
 new_year | emp                   | table | redshift-usr
 new_year | emp_records           | table | redshift-usr
 new_year | stage_xxx_emp         | table | redshift-usr
 new_year | stage_xxx_emp_pkey    | table | redshift-usr
 new_year | stage_xxx_emp_records | table | redshift-usr
 new_year | stage_xxx_emp_pkey    | table | redshift-usr

(6 rows)

3.6. Starting Replication:

Once the slave is configured and the stage tables are created in Redshift, then start the slave

 
#/opt/slave/tungsten/cluster-home/bin/startall

Starting Tungsten Replicator Service...
Waiting for Tungsten Replicator Service.
running: PID:23968

Verify it’s working by checking the slave status.

 
#/opt/slave/tungsten/tungsten-replicator/bin/trepctl services

NAME              VALUE
----              -----
appliedLastSeqno: -1
appliedLatency  : -1.0
role            : slave
serviceName     : slave
serviceType     : local
started         : true
state           : ONLINE
Finished services command...

 
# /opt/slave/tungsten/tungsten-replicator/bin/trepctl status

Processing status command...
NAME                     VALUE
----                     -----
appliedLastEventId     : mysql-bin.000011:0000000000000510;-1
appliedLastSeqno       : 0
appliedLatency         : 251.018
autoRecoveryEnabled    : false
autoRecoveryTotal      : 0
channels               : 1
clusterName            : slave
currentEventId         : NONE
currentTimeMillis      : 1525355728202
dataServerHost         : 172.19.12.116
extensions             : 
host                   : 172.19.12.116
latestEpochNumber      : 0
masterConnectUri       : thl://mysql-db-master:2112/
masterListenUri        : null
maximumStoredSeqNo     : 0
minimumStoredSeqNo     : 0
offlineRequests        : NONE
pendingError           : NONE
pendingErrorCode       : NONE
pendingErrorEventId    : NONE
pendingErrorSeqno      : -1
pendingExceptionMessage: NONE
pipelineSource         : thl://mysql-db-master:2112/
relativeLatency        : 251.202
resourcePrecedence     : 99
rmiPort                : 10002
role                   : slave
seqnoType              : java.lang.Long
serviceName            : slave
serviceType            : local
simpleServiceName      : slave
siteName               : default
sourceId               : 172.19.12.116
state                  : ONLINE
timeInStateSeconds     : 12.558
timezone               : GMT
transitioningTo        : 
uptimeSeconds          : 24.407
useSSLConnection       : false
version                : Tungsten Replicator 5.2.1
Finished status command...

If the slave did not start properly refer to this (/opt/slave/service_logs/trepsvc.log) error log.

4.0. Testing:

Now both master & slave are in sync. Now I am going to insert a few record on MySQL server in emp and emp_records table.

 
insert into emp values(1,'chennai','tamilnadu');
insert into emp values (2,'Banglore','Karnataka');
insert into emp_records values(1,'suresh','Noth car street');
insert into emp_records values(2,'John','South car street');

Above these records are inserted in the master server. At the same I have checked redshift these records are replicated or not.

 
new_year=# select * from new_year.emp;

no  | city      | state
----+-----------+----------
1   | chennai   | tamilnadu
2   | Banglore  | Karnataka

(2 rows)

 
new_year=# select * from new_year.emp_records;

no | name   | address
----+----------+---------
1 | suresh   | Noth car street
2 | John     | South car street

(2 rows)

5.0. Troubleshooting:

Replication can be broken due to incorrect data types. During such scenarios, We should analyze the issue and fix the datatype and resume replication.

Sample Error :

 
# /opt/slave/tungsten/tungsten-replicator/bin/trepctl status
Processing status command...

NAME                     VALUE
----                     -----
appliedLastEventId     : NONE
appliedLastSeqno       : -1
appliedLatency         : -1.0
autoRecoveryEnabled    : false
autoRecoveryTotal      : 0
channels               : -1
clusterName            : slave
currentEventId         : NONE
currentTimeMillis      : 1526577299571
dataServerHost         : 172.25.12.119
extensions             : 
host                   : 172.25.12.119
latestEpochNumber      : -1
masterConnectUri       : thl://mysql-db-master:2112/
masterListenUri        : null
maximumStoredSeqNo     : -1
minimumStoredSeqNo     : -1
offlineRequests        : NONE
pendingError           : Stage task failed: stage=q-to-dbms seqno=75 fragno=0
pendingErrorCode       : NONE
pendingErrorEventId    : mysql-bin.000027:0000000000072461;-1
pendingErrorSeqno      : 75
pendingExceptionMessage: CSV loading failed: schema=new table=doc_content CSV file=/opt/slave/tmp/staging/slave/staging0/yp-yp_user_doc_content-69.csv message=Wrapped org.postgresql.util.PSQLException: ERROR: Value too long for character type

                           Detail: 

                           -----------------------------------------------
                           error:  Value too long for character type
                           code:      8001
                           context:   Value too long for type character varying(256)
                           query:     1566568
                           location:  funcs_string.hpp:395
                           process:   query0_75_1566568 [pid=10475]
                           -----------------------------------------------
                          (/opt/slave/tungsten/tungsten-replicator/appliers/batch/redshift.js#256)
pipelineSource         : UNKNOWN
relativeLatency        : -1.0

This error info explains that value is too long for character data type for table doc_content on new database in Redshift.

In MySQL, the table doc_content consists of column “context” with TEXT data type.
Even in Redshift, context is a TEXT column.
Here the catch, In Redshift, the text datatype is equivalent to varchar(256).
So writing anything above 256 on MySQL will break replication.

So the solution is to increase the varchar length from 255 to varchar of 1000. In Redshift changing the datatype will not work.

 
yp=# alter table new.doc_content ALTER COLUMN content TYPE varchar(2000);
ERROR:  ALTER COLUMN TYPE is not supported

We can’t increase the column size in Redshift without recreating the table.
The alternate solution is to add a new column with the required changes and move the data and then the old column can be dropped.

 
ALTER TABLE yp.yp_user_doc_content ADD COLUMN content_new VARCHAR(2000);
UPDATE yp.yp_user_doc_content SET content_new = content;
ALTER TABLE yp.yp_user_doc_content DROP COLUMN content;
ALTER TABLE yp.yp_user_doc_content RENAME COLUMN content_new TO content;

Now we’re good to restart the replication again.

6.0. Conclusion:Tungsten replicator is a great tool when it comes to replication of data with heterogeneous data sources. If we understand it’s working, It’s easy to configure and operate.

No items found.