Pegasaas API Upgrade — How (and why) we went from a single-server to a multi-server network.

Pegasaas API Upgrade — How (and why) we went from a single-server to a multi-server network.

Since we switched from the in-plugin method of processing HTML optimizations, to the more powerful and dynamic method of optimizing via a central API in February 2019, the number of monthly page optimizations performed by the Pegasaas API has grown dramatically from about 1500 for the last two weeks of February, to as high as a projected 60,000 per month this month.   Generally, most sites do not re-optimize their entire set of pages monthly, although there are those that rebuild bi-weekly, or even weekly.

Planning for Growth: We’re not in Kansas Anymore, Toto.

It became obvious after April that if growth continued as it had for the previous months, that we were going to need more processing power.  It also became clear that we were seeing failed connections from websites hosted on slow servers that were geographically distant from our API server in Austin, TX, USA — if a website is hosted in the middle of Australia, it will take a little longer for the connection to be made to the API server in the USA.  And because we didn’t want to leave connections open indefinitely, and because we also plan further upgrading our service offerings in the next year, we realized that we needed to develop a system were we had “nodes” deployed across the globe.

It was about that time that we decided to begin planning for the development of a distributed network of API nodes that could accept and process requests.

Building A Network: But wait!  Are those flying monkeys?!

Work began at the beginning of July, and progressed quickly.  Most of the heavy lifting of developing the systems to “spin up” a “node” in any one of 16 AWS data centres across the globe was completed when we received an unexpected communication from our service provider (Hostway) that runs the data centre that our Pegasaas API server exists in.  In this email, Hostway informed us that they were retiring the data centre and would be moving all of our assets to a new  “state-of-the-art” facility on one of two dates.  Sounded pretty good to me until I read how long they planned to have the server offline.

How long?  Six hours.

Yeah, the implications of being offline for more than 10 minutes caused my blood pressure to rise pretty sharp.  Thinking back on it, I realize that to expect no downtime for this type of scenario was impossible, but still… what were we going to do?

Well, the answer came to me shortly thereafter as I realized we would need to employ, in less than a month, our distributed node network — we had planned for a August/September launch anyway, but there was no hard deadline.  Now, there was: August 22nd at 9pm PDT.

Infrastructure Hardening: If I only had a brain, maybe I’d have thought of this scenario.

Please understand, we have nightly backups of our servers (in the event that we had a catastrophic hardware failure), but I had never considered a situation where the server would actually be offline for any extended period of time.  That’s my bad.  I hadn’t been daydreaming about worst case scenarios.  But nevertheless, our infrastructure was not as robust as it probably should have, and could have been.

As an aside, we had also recently learned of some client websites that were running WordPress websites on multiple AWS EC2 servers at once — this actually was a situation that we saw with what we would have considered Enterprise level installations that required redundant servers in the event that one or more servers experienced a failure.  We learned that Pegasaas Accelerator WP v2.x, and the corresponding API, is not capable of deploying optimized assets to multiple servers.  We’ve formulated a plan for this, and it will be a part of the 3rd generation product and API coming later this year.

Having been exposed to the reality that some websites were using multiple redundant servers that could handle a disruption such as a server going offline, we knew that we could handle our own scheduled downtime of our primary server.

I’m happy to say, that August 22nd (yesterday, at the time of this post) has come and gone, and our new multi-server network has performed exceptionally well. In fact, there was no disruption in service at all for the Pegasaas API while our primary server (which actually hosts a number of other websites that we run) was offline.

What We Did: Servers and Databases and DNS, oh my!

What we ended up doing was deploying two identical t2.medium AWS EC2 servers that accessed an Amazon RDS (database).  In fact, we also connected our primary API server to the same database.  While there is a little latency for the primary server to process large amounts of data when accessing the Amazon RDS server, the two Pegasaas API “nodes” had a quick connection as the database was hosted in the same US-West-1 Data Centre.

We used the ns1.com DNS provider that allows for geo-targeted DNS data — while we’re not using this feature at this moment, we will use it when we spin up “nodes” in different geographic regions to cut down on the connection timeouts between the Plugin installed on servers geographically distant from our primary network.  NS1.com allows for multiple IP addresses for a DNS zone record — something we didn’t have with our previous plain vanilla DNS provider.  This means you can have multiple IP addresses listed so that when a connection is requested to “api.pegasaas.io”, that if any of the connections fail, the requesting client (web browser, or plugin installed to a website) will automatically try to connect with the second… or third, IP address.  Automatically.

How It Was Done: Just Follow The Yellow Brick Road, One Step at A Time.

In the days leading up to August 22nd, we tested the system rigorously, first by connecting the prime server to the Amazon RDS, and then adding the first node into the network.  We then briefly shut down the primary server (by way of a reboot) to see if the DNS failover system would work.  It did.

After the first node was working without any issues, we incorporated a second node.  We didn’t know how much traffic a t2.medium node could handle, and we wanted to ensure that there was double redundancy.

And In The End: Behind the Curtain

It was a major lesson in service hardening that I had not, in 17+ years of being responsible for websites and web severs, had to deal with.  I’m happy to say, we have a suuuuuuuper solid API now.

And, with all of these lessons, and the exposure to Amazon Web Services, we’re going to be further developing our data management of optimized resources by employing the use of Amazon S3 Buckets… but that’s a topic for a different day and a different blog post.

Comments are closed.