It's been an interesting week, along with the usual work we've been fixing a webapp that had a major problem.
The problem:
Clicking any link would halt the app for up to 40 seconds, before either showing the page or giving a server error (503, see HTTP cats for error codes). A refresh after a 503 usually shows the page instantly.
Along with the main issue of the site being unresponsive, we started seeing Google Search Console errors, so the SEO was having a major downturn due to the same problem. Only 2 weeks before, this the webapp had a Google Lighthouse score in the high 90's and now we couldn't even get Lighthouse to give a score.
We needed to debug and find out what was goiong on.
Checking
The first check we made was for hacking or hijacking of the site. After usage checks, logs and performance tests there was no evidence of hacking or the site being used by a 3rd party for unauthorised code.
Next, we looked at the main codebase. Had any code changes by developers happened in the last week or so? We checked GitHub and there where no changes to the codebase in the last 4 weeks, so no, a code change didn't seem to be the problem.
So, we turned to code injection or a 'bad actor' changing of the code and rebuilding. To combat this we destroyed the file system completely and rebuilt everything from scratch. Any erroneous files/scripts would be purged and only the 'good' code would be left. If this had 'fixed' the issue we would need to check all security and find out how this was done. After the rebuild, we tested and found the same issues, so with the good code we had the same problem. This was not the cause either.
We could, with almost certanty, say the issue was not a hack or a change in code. Did we have something out of date that was conflicting with the server? Some module that was deprecated and a server update had caused issue?
The next check was to get our hands dirty in the code.
Coding updates
The first check on the code was to make sure this was a production or live problem only, so we ran the code locally. We had no problems with either a development build or a production build. The code seemed good locally.
We had to make sure the code was completely up to date to rule out deprecated code or module issues, so we updated all the 3rd party modules and anything that might cause problems.
All was still good locally and after uploading and building on the hosting, we had the same problem.
The code was not the issue.
Knowing the code was fine and up to date it was time to move on from the code and look at the hosting.
Hosting
The Hosting we use for this webapp is Hostinger (PHP) and we use Node Version Manager (NVM) to run Node.js. We then use Apache to 'proxy' any traffic to the app. This was not the ideal solution, but it suited for this webapp at the time we created it because there are other PHP apps on this hosting. This is a great way to run a Node.js app on these types of server. This had worked with multiple webapps we have created, however, this was the last of this type on this hosting, so we couldn't compare other apps without creating something new and setting up the full devOps workflow. We decided not to create this just yet, but would keep it in mind if we needed more checks.
Our hosting company had recently added a new Node.js hosting package to their portfolio, so we decided to kill 2 birds with one stone and ask "Does this app work with the new type of hosting"? We created the new hosing with just a few clicks, gave it access to only the webapp repository on GitHub and only as read-only then we used the subdomain app.*. After waiting about 20 seconds, the new webapp was ready to use. Very simple.
We tested and the new subdomain was working and showing the app perfectly and the webapp was back to its usual, performant self. This was pretty much the evidence we needed. The NVM webapp, with an Apache proxy was the issue and the changes the hosting company had made had caused it.
We made a few more checks and found the underlying server type, the proxying and a bunch of other issues had been conflicting LiteSpeed/Nginx edge tries to apply "Optimised Managed Routing," fails to find a valid socket for the root, and throws a 503 before Apache sees the request. The exact cause is conjecture as there are now real logs so we had to use traces to even put this together. As it stands:
- Cause: Hosting changes
- Fix: Change to the new hosting Node.js package
Time to set everything back to a working version, but as usual, there is one more thing...
Finalising
We had one more small issue. The hosting was on the main domain and we had multiple databases, emails and subdomains on the main domanin. The way the hosting company works made it impossible to simply change from PHP to Node.js hosting without loosing all this.
We know proxying from Apache was not going to work, but we needed any traffic hitting the main domain to see the webapp on the new subdomain, not the old hosting. We also know DNS was too broad a change and could damage other apps, so we didn't want to change any DNS settings at the moment.
We had one more 'layer' to use.
We made sure all traffic hitting the main domain went to a single PHP file using Apache .htaccess like this:
RewriteEngine On
RewriteCond %{HTTP_HOST} ^huytonweb\.com$ [NC]
RewriteRule ^(.*)$ https://www.huytonweb.com/$1 [R=301,L]
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule ^(.*)$ default.php [L]
Now all traffic hitting the main domain is going to the PHP file default.php. We could then use that PHP script to curl out to the new app subdomain like:
<?php
ob_start();
$target_url = 'https://www.app.huytonweb.com' . $_SERVER['REQUEST_URI'];
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $target_url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, true);
if ($_SERVER['REQUEST_METHOD'] === 'POST') {
curl_setopt($ch, CURLOPT_POST, true);
$post_data = file_get_contents('php://input');
curl_setopt($ch, CURLOPT_POSTFIELDS, $post_data);
}
$headers = [
'Accept: ' . ($_SERVER['HTTP_ACCEPT'] ?? '*/*'),
'User-Agent: ' . ($_SERVER['HTTP_USER_AGENT'] ?? 'PHP-Bridge'),
'X-Internal-Request: huyton-secret-44Xcw8TF0GHSO1M'
];
if (isset($_SERVER['CONTENT_TYPE'])) {
$headers[] = 'Content-Type: ' . $_SERVER['CONTENT_TYPE'];
}
curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);
$output = curl_exec($ch);
$info = curl_getinfo($ch);
curl_close($ch);
ob_end_clean();
if ($output === false) {
http_response_code(500);
exit;
}
http_response_code($info['http_code']);
if (!empty($info['content_type'])) {
header("Content-Type: " . $info['content_type']);
}
echo $output;
exit;It was another 'hop' we didn't need, and ugly code, but, having done this before and knowing it passes other blockers, we decided this was the 'best' way to get the system up and running while we look at other possibilities.
That was it, the site was back up and back in the high 90s lighthouse performance and SEO is now back and running.
Who done it?
- Who did it? The hosting company.
- How? They added a new type of hosting.
- Why? To make it easier for Node.js users the change broke the NVM apps.
Simple as it seems, this caused a week of outage, 3 days of fixes, updates, changes and head scratching, but what did we learn from it?
I think we completed the correct checks in the correct order, but the failure was the timing.
We didn't see the issue for 3-5 days after it started. The site looked fine on the home page, but clicking any link would have shown the issue. Our takeaway is we need to check any apps daily either automated or manually and not become complacent when an app has been working for a long time.
To do this we'll look at adding checks into our cache buster that will let us know if it finds issues.
If you have a similar story or would like use to look at your website or app or if you have an idea or new business, why not contact us or leave a comment.
