Difference between revisions of "RAL Tier1 weekly operations castor 21/07/2014"
From GridPP Wiki
(Created page with "== Operations News == * 2.1.14-13 upgrades now complete with the exception of switching off the compatibility mode. * GEN SRM issues (not dteam) have been solved - bug in srmb...") |
|||
Line 2: | Line 2: | ||
* 2.1.14-13 upgrades now complete with the exception of switching off the compatibility mode. | * 2.1.14-13 upgrades now complete with the exception of switching off the compatibility mode. | ||
* GEN SRM issues (not dteam) have been solved - bug in srmbed being exposed by customer config. | * GEN SRM issues (not dteam) have been solved - bug in srmbed being exposed by customer config. | ||
− | * Elastic Search has been fixed by James | + | * Elastic Search infrastructure has been fixed by James – we need to put tools on the admin node. |
* Plan to ensure PreProd represents production in terms of hardware generation are underway. A student will be starting soon with the task of investigating visualisation and querying solutions for CASTOR use. | * Plan to ensure PreProd represents production in terms of hardware generation are underway. A student will be starting soon with the task of investigating visualisation and querying solutions for CASTOR use. | ||
* Deployment of disk servers is due to restart next week. | * Deployment of disk servers is due to restart next week. |
Revision as of 09:05, 21 July 2014
Contents
Operations News
- 2.1.14-13 upgrades now complete with the exception of switching off the compatibility mode.
- GEN SRM issues (not dteam) have been solved - bug in srmbed being exposed by customer config.
- Elastic Search infrastructure has been fixed by James – we need to put tools on the admin node.
- Plan to ensure PreProd represents production in terms of hardware generation are underway. A student will be starting soon with the task of investigating visualisation and querying solutions for CASTOR use.
- Deployment of disk servers is due to restart next week.
Operations Problems
- Incorrect service classes in castor.conf on disk servers, Atlas issues resolved by Rob. Other non production issues identified by Bruno - Fix planned.
- low level db locking issues continue (various VOs) - has been reported to the developers at CERN.
- A potential race condition which could result in data loss has been seen on CMS (2.1.14-13) while investigating a file that would not migrate to tape. CERN have been notified.
- Atlas Xrootd proxy issues - investigation starting.
- CMS xroot issues
- Facilities castor error
- LHCb possible IO problems - ticket raised and investigation started.
- Need to update dteam and ATLAS VOs' voms-servers
Blocking Issues
Planned, Scheduled and Cancelled Interventions
- Switch-off of compatibility mode for Tier 1 Name Server
- Upgrade of Facilities CASTOR from 2.1.14-11 to 2.1.14-13.
Advanced Planning
Tasks
- Put V13 servers in NonProd into production (once name server compatibility mode change complete)
- Resume draining on the ATLAS instance (again, once name server compatibility mode change complete)
- Switch from admin machines: lcgccvm02 to lcgcadm05
- Replace DLF with Elastic Search
- Correct partitioning alignment issue (3rd CASTOR partition) on new castor disk servers
Interventions
Staffing
- Castor on Call person
- Chirs Monday - Friday
- Somebody - Weekend
- Staff absence/out of the office:
- Matt out Monday
- Brian Monday/Tuesday