Problems After CA 1.88-1
This document should automatically be expired as soon as we move to CA 1.89.1, since the bugs brought out here will surely have been fixed by then, I think. If we are past the point, don't read any more...
Anyway, getting down to business, following the release of lcg-CA (etc.) 1.88-1, various problems concerning authentication cropped up at several sites. Some of the types of problem are mentioned in Appendix 1, as well as a workaround to keep a site going in the short term by rolling back the CA certificates. But the deadline for updating to 1.88-1 is 2017.12.04, i.e. already passed, so I'm listing here the basic cause of the problem, determined by Robert Frank, and some other measures sites can take to maintain operations.
Simply stated, the factors necessary for this problem to happen are as follows (an unexpurgated, i.e. accurate, version of this explanation is given in Appendix 3).
- It happens where one system sends certificates to another to be authenticated by (old versions of) bouncycastle (i.e. Java applications.)
- The systems must have different versions of the CA certificates; one must have 1.87-1 and the other must have 1.88-1; it doesn't matter which way round, they just must be different.
- The system that is authenticating incoming certificates (e.g. CREAM or ARGUS, or whatever) must be on bouncycastle-1.46-1 (unpatched version - this version is very old, from 2013, and the bug was fixed from 1.48 onwards according to NIKHEF).
- The system that is authenticating incoming certificates must have a CRL (UKeScienceCA-2B or root) containing no extensions (note that both sides are authenticating, usually, in the grid scenarios.)
If any of those conditions are missing, then the error doesn't happen. So fixes are:
- Jens is sending out a CRL with a "non-critical extension". A site could wait until the CRL propagates around the sites. Initial tests of this idea showed that the CA root certificate also needed changes, which has been done and now the process is said to work (preliminarily tested by Robert Frank.) Note: If this works, it could be a good choice for sites, since they'd just need to (a) wait for the CRLs to update automatically via fetch-crl, then (b) upgrade CAs to 1.88-1.
- Use a better version of bouncycastle. One way on ARGUS to get that it is install it with Centos7, and perhaps UMD3 or 4. I haven't tested this, but Chris Brew assured us that the problem doesn't come out on Centos7 (Chris, please confirm version of BC on your ARGUS server).
- If you use ARGUS or CREAM, you could update bouncycastle to the "Robert Frank" patched version and then just update everything to 1.88-1. This has been tested and seems/is safe. The name of the rpm is bouncycastle-1.46-1.el6.1.noarch, and Robert provides details in Appendix 2.
- Any other options ...
Appendix 1 - How to Roll Back
Important note: rolling back to 1.87-1 got our site to run jobs again, but it was only a temporary workaround. And for some sites it may not work even as a temporary workaround in some circumstances. As an example, consider a site with a CREAM CE that has been rolled back to 1.87-1, and that is used by a submission client on 1.88-1. Those conditions might trigger the error. But since I, sj, don't run CREAM, I can't test that. We use ARC which has no link to bouncycastle since it's written in C and Perl etc. UIs could also be affected. On a related note, Andrew Lahiff suggests to use arcproxy (written in, I think, C) instead of voms-proxy-init (nowadays written in Java and hence susceptible to the issue).
So, let's look at the problem. In the last days of November 2017, a new set of root certificates were released, version 1.88-1. At Liverpool, the rpms on our central ARGUS server were updated automatically in the evening of 27th Nov. Once the already queued jobs had started, the jobs began to dwindle and I noticed it the next morning. On the ARGUS server, in the /var/log/argus/pepd/process.log file, were lots of errors like this.
2017-11-27 18:00:03.102Z - ERROR [TrustStoreValidationErrorLogger] - Validation error: error at position 0 in chain, problematic certificate subject: CN=hepgrid11.ph.liv.ac.uk,L=CSD,OU=Liverpool,O=eScience,C=UK (category: CRL): Can not verify the CRL as its issuer's public key is unknown or can not be validated Cause: Certification path could not be validated. Cause: NullPointerException
It was affecting our CEs and our DPM headnode (hepgrid11). I got the site back up temporarily by following these steps.
I rolled back to 1.87-1 on our CEs, SE and our ARGUS server. To rollback, I first removed the existing references to the current repo, then put in repo listed below. That points to the old 1.87-1 versions (the baseurl is different from the standard place).
# pwd /etc/yum.repos.d # cat EGI-trustanchors.repo [EGI-trustanchors] name=EGI-trustanchors baseurl=https://egi-igtf.ndpf.info/distribution/egi-1.87-1/ca-policy-egi-core-1.87-1/ enabled=1 gpgcheck=0 priority=3
It may be possible to use “yum history” for this, but I used these commands to remove the newly installed 1.88-1 CAs.
# for p in `rpm -qa | grep 1.88-1 | grep ca_`; do yum -y remove $p; done
Then check for other packages of version 1.88-1 and, if any, remove those too, by hand.
# rpm -qa | grep 1.88-1
Then yum install (or update) lcg-CA (or ca-policy-egi-core, or whatever it is you use).
This is OK at Liverpool for the time being, but we’ll have to go to a new version of the CAs sometime soon.
Appendix 2 - Robert Frank's Bug Fix
Robert Frank's notes on his update to bouncycastle.
I've built an SL6 bouncycastle rpm which uses the fixed implementation of that function:
rpm: http://mirror.tier2.hep.manchester.ac.uk/Repositories/local/6/x86_64/bouncycastle-1.46-1.el6.1.noarch.rpm src: http://mirror.tier2.hep.manchester.ac.uk/Repositories/local/6/sources/bouncycastle-1.46-1.el6.1.src.rpm patch: http://mirror.tier2.hep.manchester.ac.uk/tier2/deltacrl.patch
After installing the rpm, my problems with the java voms clients disappeared. More testing is needed though.
Appendix 3 - Robert's Full Explanation
The summary above is an easy-to-understand but slightly simplified explanation. It's probably satisfactory for most sites. But for those who prefer the truth, the whole truth and nothing but the truth, here are full, unexpurgated explanations by Robert Frank that use more accurate prose. His email also contains some info on
- his progress testing Jens' effort to solve the problem using non-critical extensions (now known to work well, I believe),
- possible repositories that contain ARGUS 1.7 (with a patched copy of BC) for various UMD distributions, and
- related problems with the CANL libs, and how to work around them.
I reproduce verbatim the important sections below.
On 05/12/17 14:54, Stephen Jones wrote: > The deadline for updating to 1.88-1 is 2017.12.04, i.e. already passed. David > has asked me to document what to do about it so sites can update. So, please confirm that the factors > necessary for this error to happen are: > > a) It happens where one system sends proxys to another to be authenticated by bouncycastle. Not quite. It happens when a certificate chain containing the 2B CA certificate is sent across (either from server to client for server certificate validation, or from client to server for client certificate validation) and is validated with bouncycastle against the locally installed trust anchors. > b) The systems must have different versions of the CA certificates; one must > have 1.87-1 and the other must have 1.88-1; it doesn't matter which > way round, they just must be different. Correct. > c) The system that is authenticating incoming proxies (e.g. CREAM or ARGUS, or > whatever) must be on bouncycastle-1.46-1 (unpatched version). Again, not quite. This has nothing to do with proxies. It's any system that uses an unpatched bouncycastle to validate a certificate chain it receives from the remote side against it's locally installed set of trust anchors. > d) The system that is authenticating incoming proxies must have a CRL for > the UK CA cert (UKeScienceCA-2B) that contains no "non-critical extensions". It must have a CRL for any UK eScience CA (root or 2B) that contains no extensions at all. To summarise, all of the following has to apply to trigger the problem: * it effects the different versions of the UK eScience 2B CA as installed by the trust anchor releases 1.88 and 1.87 (or earlier) * both sides use different versions of the 2B CA (installed by a trust anchor release, installed in the browser, etc) * a certificate chain containing the 2B CA has to be transferred from one side to the other * the side receiving the chain uses an unpatched version of bouncycastle to validate the received chain against a local installation of the trust anchors * a CRL without any extensions issued by any of the CAs in the chain is present for the local trust anchors The above can apply to a server, a client, or both. > If any of those things is missing, then the error doesn't happen. So fixes are: > > 1) Jens is sending a CRL with a "non-critical extension". A site could > wait until the CRL propagates around the sites. (Jens, pls confirm when you "know" > this actually works!) This might be needed for all CAs in the chain. I'll test it again once Jens issued an updated CRL for the Root CA. It's possible that having it in the CRL of the root CA is enough, but I won't know for sure until I've tested it. > 2) Use a better version of bouncycastle. One way on ARGUS to get that it > is install it with Centos7, and perhaps UMD3 or 4. I haven't tested this, > but Chris Brew assured us that the problem doesn't come out on Centos7 > (Chris, please confirm version of BC on your ARGUS server). You can get Argus 1.7 for SL6x from UMD 4, but not from UMD 3. Argus 1.7 ships with a newer version of BC which doesn't have the problem. > 3) If you use ARC (sj: typo, I meant ARGUS) or CREAM, you could update bouncycastle to the "Robert Frank" > patched version (details TBD) and then just update everything > to 1.88-1. This has been tested and seems/is safe. Correct. If you use CREAM you can do the same, you just have to install the patched version on the CE as well (has been tested in Manchester). Also, all services that use the CANL library to reload CA certificates and CRLs automatically need to be restarted after the update to 1.88. Cheers, Robert