In our work, We used to get a lot of requirements for replicating data from one data source to another. Our team provided solutions to replicate data from MySQL to Vertica, Amazon Redshift, Hadoop. Out of which Amazon Redshift replication is a bit complicated as Amazon Redshift is a Database as a service (DBaaS) and the process is not straightforward.
So, I take this opportunity to guide on how to replicate the specific set of tables from MySQL to AWS Redshift using Tungsten replicator.
1.0. Tungsten Replicator:
Tungsten Replicator is an open source replication engine supports data extract from MySQL, MySQL Variants such as RDS, Percona Server, MariaDB and Oracle and allows the data extracted to be applied on other data sources such as Vertica, Cassandra, Redshift etc.
Tungsten Replicator includes support for parallel replication, and advanced topologies such as fan-in and multi-master, and can be used efficiently in cross-site deployments.
1.1.0. General Architecture:
There are three major components in tungsten replicator 1. Extractor / Master Service 2. Transaction History Log (THL) 3. Applier / Slave Service
1.1.1. Extractor / Master Service:
The extractor component reads data from MySQL’s binary log and writes that information into the Transaction History Log (THL).
1.1.2. Transaction History Log (THL):
The Transaction History Log (THL) acts as a translator between two different data sources. It stores transactional data from different data servers in a universal format using the replicator service acting as a master, That could then be processed Applier / Slave service.
1.1.3. Applier / Slave Service:
All the raw row-data recorded on the THL logs is re-assembled or constructed into another format such as JSON or BSON, or external CSV formats that enable the data to be loaded in bulk batches into a variety of different targets.
Therefore Statement information is not supported for heterogeneous deployments. So It’s mandatory that Binary log format on MySQL is ROW.
1.2.0. Pre Requisites:
1.2.1. Server Packages:
JDK 7 or higher
Ant 1.8 or higher
Ruby
Net-SSH
1.2.2. MySQL:
All the tables to be replicated must have a primary key.
Following MySQL configuration should be enabled on MySQL binlog-format = row binlog-row-image = full collation-server = utf8_general_ci character-set-server = utf8
1.2.3. Redshift:
Database name, Schema_name should be same as MySQL Database name of the tables to be replicated.
1.2.4. S3 Bucket:
Read & write access to an AWS S3 Bucket. (Access key, Secret key is required)
2.0. Requirement:
Consider the servers with below details are used for Demo.
AWS EC2 MySQL Server – 172.19.12.234 AWS Redshift – 172.19.12.116 (Database as a Service) AWS S3 bucket – s3://mydbops-migration
As Redshift is a database as a service, We just have an endpoint to connect. Therefore We will be installing both the tungsten Master / Slave service on the MySQL server itself.
We would need to replicate the tables emp, emp_records from new_year database on the MySQL server to Redshift. Structures of the table are given below.
Once the configuration is prepared, Then we can install it using tpm.
#./tools/tpm install
Configuration is now complete. For further information, please consult
Tungsten documentation, which is available at docs.continuent.com.
NOTE >> Command successfully completed
Now Master service will be configured under /opt/master/
Once the configuration is prepared, Then we can install it using tpm.
#./tools/tpm install
Configuration is now complete. For further information, please consult
Tungsten documentation, which is available at docs.continuent.com.
NOTE >> Command successfully completed
Once it complete copy the s3-config-slave.json file to slave (share) directory.
#cp s3-config-slave.json /opt/slave/share/
Now the slave is configured, Before starting we need to create worker/stage table used by tungsten to replicate data on Redshift.
3.5. Generating Worker / Stage tables To Be Created On Redshift:
Tungsten provides a utility named ddlscan to generate the Worker / Stage tables required for the replication functionality to work.
#/opt/slave/tungsten/tungsten-replicator/bin/trepctl services
NAME VALUE
---- -----
appliedLastSeqno: -1
appliedLatency : -1.0
role : slave
serviceName : slave
serviceType : local
started : true
state : ONLINE
Finished services command...
If the slave did not start properly refer to this (/opt/slave/service_logs/trepsvc.log) error log.
4.0. Testing:
Now both master & slave are in sync. Now I am going to insert a few record on MySQL server in emp and emp_records table.
insert into emp values(1,'chennai','tamilnadu');
insert into emp values (2,'Banglore','Karnataka');
insert into emp_records values(1,'suresh','Noth car street');
insert into emp_records values(2,'John','South car street');
Above these records are inserted in the master server. At the same I have checked redshift these records are replicated or not.
new_year=# select * from new_year.emp;
no | city | state
----+-----------+----------
1 | chennai | tamilnadu
2 | Banglore | Karnataka
(2 rows)
new_year=# select * from new_year.emp_records;
no | name | address
----+----------+---------
1 | suresh | Noth car street
2 | John | South car street
(2 rows)
5.0. Troubleshooting:
Replication can be broken due to incorrect data types. During such scenarios, We should analyze the issue and fix the datatype and resume replication.
Sample Error :
# /opt/slave/tungsten/tungsten-replicator/bin/trepctl status
Processing status command...
NAME VALUE
---- -----
appliedLastEventId : NONE
appliedLastSeqno : -1
appliedLatency : -1.0
autoRecoveryEnabled : false
autoRecoveryTotal : 0
channels : -1
clusterName : slave
currentEventId : NONE
currentTimeMillis : 1526577299571
dataServerHost : 172.25.12.119
extensions :
host : 172.25.12.119
latestEpochNumber : -1
masterConnectUri : thl://mysql-db-master:2112/
masterListenUri : null
maximumStoredSeqNo : -1
minimumStoredSeqNo : -1
offlineRequests : NONE
pendingError : Stage task failed: stage=q-to-dbms seqno=75 fragno=0
pendingErrorCode : NONE
pendingErrorEventId : mysql-bin.000027:0000000000072461;-1
pendingErrorSeqno : 75
pendingExceptionMessage: CSV loading failed: schema=new table=doc_content CSV file=/opt/slave/tmp/staging/slave/staging0/yp-yp_user_doc_content-69.csv message=Wrapped org.postgresql.util.PSQLException: ERROR: Value too long for character type
Detail:
-----------------------------------------------
error: Value too long for character type
code: 8001
context: Value too long for type character varying(256)
query: 1566568
location: funcs_string.hpp:395
process: query0_75_1566568 [pid=10475]
-----------------------------------------------
(/opt/slave/tungsten/tungsten-replicator/appliers/batch/redshift.js#256)
pipelineSource : UNKNOWN
relativeLatency : -1.0
This error info explains that value is too long for character data type for table doc_content on new database in Redshift.
In MySQL, the table doc_content consists of column “context” with TEXT data type.
Even in Redshift, context is a TEXT column.
Here the catch, In Redshift, the text datatype is equivalent to varchar(256).
So writing anything above 256 on MySQL will break replication.
So the solution is to increase the varchar length from 255 to varchar of 1000. In Redshift changing the datatype will not work.
yp=# alter table new.doc_content ALTER COLUMN content TYPE varchar(2000);
ERROR: ALTER COLUMN TYPE is not supported
We can’t increase the column size in Redshift without recreating the table.
The alternate solution is to add a new column with the required changes and move the data and then the old column can be dropped.
ALTER TABLE yp.yp_user_doc_content ADD COLUMN content_new VARCHAR(2000);
UPDATE yp.yp_user_doc_content SET content_new = content;
ALTER TABLE yp.yp_user_doc_content DROP COLUMN content;
ALTER TABLE yp.yp_user_doc_content RENAME COLUMN content_new TO content;
Now we’re good to restart the replication again.
6.0. Conclusion:Tungsten replicator is a great tool when it comes to replication of data with heterogeneous data sources. If we understand it’s working, It’s easy to configure and operate.