Difference between revisions of "RAL Tier1 weekly operations castor 06/04/2018"

From GridPP Wiki
Jump to: navigation, search
(Created page with "== Draft agenda == 1. Problems encountered this week 2. Upgrades/improvements made this week 3. What are we planning to do next week? 4. Long-term project updates (if not ...")
 
 
Line 31: Line 31:
 
== Operation problems ==
 
== Operation problems ==
  
   * Serious firewall misconfiguration resulted in the breaking of external xrootd access to RAL
+
   * CMS unroutable file issue which was dealt with
   * Chris reports that a functional test that used to run on lcgcadm05 doesn't work anymore (probably because the machine has been turned off)
+
   * ChrisP reports that a functional test that used to run on lcgcadm05 doesn't work anymore (probably because the machine has been turned off)
 
       * Should have been ported to castor-functional-test1 - maybe just needs repointing?
 
       * Should have been ported to castor-functional-test1 - maybe just needs repointing?
  
 
== Operation news ==
 
== Operation news ==
    
+
 
   * Aquilon SRM profiles fixed up and deployed to prod
+
   * Phani confirmed that a DB flip-over to R26 is possible
   * lcgcts12 (preprod) upgrade to SL7 properly attached to the preprod instance
+
   * Oracle patching on R26 and R89 DBs was successfully completed. Had a follow-up meeting to review...
 +
   * Four 2013 generation disk servers deployed into genTape
  
 
== Plans for next few weeks ==
 
== Plans for next few weeks ==
  
Patching of the Neptune and Pluto DB and testing of switching over to R26 postponed until 27th March.
+
  * Attend GridPP meeting
 
+
  * Review and submit the CC for macroheadnodes
GP: Continue work on 'macroheadnodes' and write the change control.
+
Deployment of new genTape disk servers.
+
  
 
== Long-term projects ==
 
== Long-term projects ==
Line 53: Line 52:
 
utility and SRM features in one node which passes functional tests
 
utility and SRM features in one node which passes functional tests
  
'Macroheadnode' configuration testable :D
+
No point to move Atlas and CMS to the macroheadnode setup since their d1t0 data are moving to Echo over the next two months
  
HA-proxyfication of the CASTOR SRMs: HA proxy is back and can be tested on preprod
+
HA-proxyfication of the CASTOR SRMs: HA proxy set up has been tested on the SRMs
  
 
Target: Combined headnodes running on SL7/Aquilon - implement CERN-style 'Macro' headnodes.
 
Target: Combined headnodes running on SL7/Aquilon - implement CERN-style 'Macro' headnodes.
  
Draining of 4 x 13 generation disk servers from Atlas that will be deployed on genTape - draining complete, waiting for Fabric.
+
== Actions ==
  
Draining of 10% of the 14 generation disk servers
+
Tell whether a Facilities functional test which was runningon on lcgcadm05 is running on castor-functional-test1
 
+
== Actions ==
+
  
 
RA/BD:  Run GFAL unit tests against CASTOR. Get them here: https://gitlab.cern.ch/dmc/gfal2/tree/develop/test/
 
RA/BD:  Run GFAL unit tests against CASTOR. Get them here: https://gitlab.cern.ch/dmc/gfal2/tree/develop/test/
  
RA to organise a meeting with the Fabric team to discuss outstanding issues with Data Services hardware
+
RA to check if there is any outstanding action for Fabric
  
 
GP/RA to write a Nagios test to check for large number of requests that remain for a long time
 
GP/RA to write a Nagios test to check for large number of requests that remain for a long time
Line 73: Line 70:
 
== Staffing ==
 
== Staffing ==
  
RA on call
+
GP on call, RA on 15/4

Latest revision as of 10:10, 6 April 2018

Draft agenda

1. Problems encountered this week

2. Upgrades/improvements made this week

3. What are we planning to do next week?

4. Long-term project updates (if not already covered)

  1. SL5 elimination from CIP and tape verification server
  2. CASTOR stress test improvement
  3. Generic CASTOR headnode setup
  4. Aquilonised headnodes

5. Special topics

6. Actions

7. Review Fabric tasks

  1.   Link

8. AoTechnicalB

9. Availability for next week

10. On-Call

11. AoOtherB

Operation problems

  * CMS unroutable file issue which was dealt with
  * ChrisP reports that a functional test that used to run on lcgcadm05 doesn't work anymore (probably because the machine has been turned off)
     * Should have been ported to castor-functional-test1 - maybe just needs repointing?

Operation news

  * Phani confirmed that a DB flip-over to R26 is possible
  * Oracle patching on R26 and R89 DBs was successfully completed. Had a follow-up meeting to review...
  * Four 2013 generation disk servers deployed into genTape

Plans for next few weeks

  * Attend GridPP meeting
  * Review and submit the CC for macroheadnodes

Long-term projects

Headnode migration to Aquilon - Stager, scheduler, utility and nameserver configuration mainly complete. Stager, scheduler, utility tested seperately and all together on preprod. Combined Stager, scheduler, utility and SRM features in one node which passes functional tests

No point to move Atlas and CMS to the macroheadnode setup since their d1t0 data are moving to Echo over the next two months

HA-proxyfication of the CASTOR SRMs: HA proxy set up has been tested on the SRMs

Target: Combined headnodes running on SL7/Aquilon - implement CERN-style 'Macro' headnodes.

Actions

Tell whether a Facilities functional test which was runningon on lcgcadm05 is running on castor-functional-test1

RA/BD: Run GFAL unit tests against CASTOR. Get them here: https://gitlab.cern.ch/dmc/gfal2/tree/develop/test/

RA to check if there is any outstanding action for Fabric

GP/RA to write a Nagios test to check for large number of requests that remain for a long time

Staffing

GP on call, RA on 15/4