We have been working with PMM for quite a long time, we do most of the performance analysis with PMM for most of our clients. It also provides the flexibility that we have built our own custom dashboard. PMM has many advantages
- Easy to deploy (docker based)
- Flexible
- Customizable
- Query Analytics
- One-stop solution for MySQL,Mongo,ProxySQL & PostgresSQL
- Orchestrator
- Rich and Deep metrics stats
Highly recommended for any production deployments its equivalent to Enterprise-grade monitoring and graphing tool.
Recently we have been working for our client on MySQL Consulting to scale peak sale of the year. Wherein we have deployed PMM to view the performance insights of DB.
We started on-boarding a considerable number of servers under PMM, Everyone liked the way PMM was able to project the performance insights with a beautiful dashboard over a period of time.
Need for HA and Backup of PMM server
When PMM started covering border group of server, we were raised on the question of HA for PMM server, of course it is good to have a High Availability for all possible solutions.
What if the server crashes due to Hardware failures?
- Does PMM support redundancy?
- Is PMM server a single point of failure (SPOF) ?
- What would be the backup strategy?
We had search for many options to achieve HA, with containerized (docker) PMM server. We explored
- lsyncd – Rsync based daemon to copy incremental changes at the block level, failed when we tried to sync the entire /var/lib/docker and start back docker services
- DRBD – will work, but it has complexity with the setup and maintenance
The Accidental HA solution:
As the name suggests, “it was Accidental,”. lets see how it’s done.We have below servers, those were running on debian9(Stretch), with the same version of docker.
The requirement here is to sync the metrics data between the source(live.mydbops.com) and destination (livebackup.mydbops.com), On the source server we had Prometheus data set size around 178G, so the initial sync took some time.
Stop PMM-server on source
Copy the existing data within the docker volume.
Once the initial copy is done, start back the source PMM server
Next, to transfer the data copy to the destination server, I have used SCP here.
Now on the destination server, make sure to have the same version of the PMM server as the source, I have here used version “1.17-1” Since the monitoring server used for testing does not have internet access enabled I had to do an offline load of PMM image as below
Creating docker volume as same as the source server
Once the data volume is created now proceed to copy back the data into docker data volume as below
Once the data-copy is done, PMM data volume is ready to be used after the change of ownership as below.
Now let’s proceed with the final step to run the pmm-server
Note this command should be exactly the same command executed on the master.After a few moment the PMM service was started in the destination box, we could see the live metrics data polling in (Graph below).
We wanted to check the data sync between the source and destination, so we decided to make it run for a couple of more days
After this test run period, We have verified the graphs between source and destination, which seems exactly the same, I am just sharing a sample graph for a DB server(172.31.11.11) between source and destination.
Graph from Destination
Graph from source :
How does it work?
In the first instance when we saw the data is being polled in the backup we were astonished !! and more curious to know how it’s actually working ?. We had a long discussion with our expert team on this
This is just a conceptual understanding to explain this scenario.Let’s go back to the official architecture of the PMM server by Percona, As below.
As the above architecture depicts PMM has two parts one is the QAN(Query Analytics) and (MM) Metrics Monitor, where we have achieved the redundancy.
Metrics monitor works by a “PULL” mechanism, ie., the Prometheus collects the metrics data from registered exporters. Since we have duplicated and created the destination server with the same registered exporter from servers. Hence destination, it makes one more pull from DB servers and stores data on to its own data-volume. What we have achieved is in the below architecture
To analyze this we decided to capture packets on the DB machine for those servers and we could
TCP packet analysis via TCP Dump:
Key Takeaways:
- This works well for replicating exactly the same set of servers which is already registered with the source. If a new server is added, I believe the samessteps have to be followed back.
- This is not officially supported, though it works and full-fills the requirement.
- QAN HA is not done here, but it can be done since the PMM server uses MySQL internally to store the QAN data, hence it can be replicated by exposing the MySQL port.
- There can be minor differences or deviation in the metrics data since the time at which it polls the data might be different between the servers.
As said earlier it is good to have a High Availability for PMM as it plays a critical role.