Dear All,

Thanks to Erik for this post-mortem, which will also serve as some historical documentation on the system setup.  I should note that Mark E was abroad and also out of pocket (at an extended wedding celebration) during this incident, not able to devote uninterrupted attention to troubleshooting.  So it was a bit of a "perfect storm" situation, what with not just one, but both our volunteer administrators coincidentally absorbed elsewhere, concurrent with a relatively unusual back-compatibility-breaking deprecation in gitlab.  

With that said,  I'll speak for common-lisp.net's sponsoring organization, the CLF (as a Board member), and say that we consider any amount of downtime to be too much. After all, our policy for the past several years has been to invite and encourage CL-based projects to come to common-lisp.net for their main repository hosting, as an alternative to huge impersonal repository hosts à la github (which arguably has its own whole set of issues and risks).  We wish to continue encouraging in this way in good faith.  To that end, at our next monthly teleconference, the CLF will discuss actions we can take to reduce the likelihood of an outage like this recurring.  If anyone on this list has ideas they would like to contribute to the discussion, please feel free to describe them on this list. 

You may also consider joining the upcoming CLF meeting on October 4 (on Google Hangout). Please write to me directly if you would like to receive the meeting link/invitation. 


Best Regards,

 Dave Cooper



On Fri, Sep 29, 2017 at 6:03 PM, Erik Huelsmann <ehuels@gmail.com> wrote:
Hi,

Last weekend and up to Wednesday, gitlab.common-lisp.net had issues, returning 500 Internal Server Errors while cloning or pulling; additionally the gitlab subdomain was down completely on Sunday.
This mail provides an analysis of what happened.

There is some context to all this to be started with: common-lisp.net uses the so-called "omnibus" package to run its GitLab install; it's a batteries-included package provided by GitLab, meaning that everything down to OpenSSL, nginx and Ruby are included in the package and installed in a separate - not interfering with the system - location. This omnibus package also comes with its own configuration (script) in the form of a Chef recipe.

While the package provides a default configuration which uses an Nginx reverse proxy and default ports for daemons to be accessed over TCP sockets, this default configuration doesn't quite wok on common-lisp.net due to the fact that we use Apache 2.4 as our web-visible reverse proxy. Apache 2.4 also serves a truckload of other services, such as lisppaste, trac, abcl.org, cliki.net, darcsweb.cgi, etc.
Due to this entanglement of Apache, we can't just replace it with nginx. Also, due to the large number of reverse-proxied services, not all standard ports for GitLab's configuration are open.
This isn't a problem, because GitLab offers the ability to configure site-local deviations from the defaults configuration as input for the Chef recipe.

We have succesfully been running with a configuration like this since GitLab 7.(something). The current GitLab version is 10.0.

In its evolution from version 7 to version 10, gitlab started out with "Unicorn" based rails workers (a standard Rails setup). As demand grew, a custom webserver was developed (gitlab-git-http-server) which addressed Unicorn time-outs with long running "git" processes (clones).
In order to support the "simple" setup with just Unicorn, the unicorn and gitlab-git-http-servers were configured to run each on their own port.
Around GitLab 8.2, gitlab-git-http-server was renamed to gitlab-workhorse and the configuration keys were renamed with it, although the old config keys were still respected. Our local override contained these gitlab-git-http-server config keys last Sunday due to ports already being taken by other services.

As of version 10, the 'gitlab-git-http-server' configuration keys are no longer supported: the configuration *must* now be specified in terms of 'gitlab-workhorse' keys. Last Sunday, when I upgraded the system to the current version (10.0.2) in the morning, I missed this fact, which caused the system to remain unconfigured (and thus unavailable) until I received notification on #common-lisp.net of problems.
The cause at that time was quickly determined and the 'gitlab-git-http-server' configuration keys were quickly removed and the system was redeployed and all seemed to work again, after changing the reverse proxy rules to point to gitlab's remaining open ports.

On Monday I received more signals of problems; being on a conference with little to no Net access, Mark pitched in, but was unable to determine the cause. When I *did* have access, everything looked fine, so I didn't check any further.
Then on Tuesday, I received more signals of problems, but being on the same conference without Net access, still, I wasn't able to do much.
On Wednesday morning, with yet more reports of problems, it became apparent that I was checking the web frontend for availability, but that the people reporting issues were actually experiencing problems with clones/pulls/etc. So, the git-over-http component wasn't working.

With the actual problem identified and reproduced, it was quickly apparent that due to the removal of the gitlab-git-http-server config keys, 'gitlab-workhorse' was no longer being configured and started. With a bit of trial-and-error, it also turned out that gitlab-workhorse has a default configuration to run over Unix domain sockets; a configuration supported by Nginx, but not by Apache. With the configuration corrected and the system reconfigured, problems were solved by Wednesday noon.

In retrospect, the removal of these config keys was in the release notes, so I could have known. It was 22 <PgDown> clicks down, by which time I wasn't alert enough to realise the importance of the deprecation announcement.


Regards,

--
Bye,

Erik.

http://efficito.com -- Hosted accounting and ERP.
Robust and Flexible. No vendor lock-in.



--
My Best,

Dave Cooper, david.cooper@gen.works
genworks.com, gendl.org
+1 248-330-2979