RAL Tier1 weekly operations Grid 20100503
From GridPP Wiki
Revision as of 12:23, 30 April 2010 by Matt hodges (Talk | contribs)
Contents
Operational Issues
Description | Start | End | Affected VO(s) | Severity | Status |
---|---|---|---|---|---|
Job status monitoring from CREAMCE | 2-Feb-2010 | CMS | medium | [10-Feb-2010] WMS patch available soon; CREAMCE new version available soon [07-Apr-2010] CMS tests have shown that WMS patches resolve the problem; still waiting for patch to be installed on the production WMSs in Italy |
Downtimes
Description | Hosts | Type | Start | End | Affected VO(s) |
---|---|---|---|---|---|
Blocking Issues
Description | Requested Date | Required By Date | Priority | Status |
---|---|---|---|---|
Hardware for Testbed | Medium | Required for change validation, load testing, etc. Also for phased rollout (which replaces PPS).
Have initial hardware. [2010-02-22] More hardware expected by end of March. |
Developments/Plans
Highlights for Tier-1 Ops Meeting
Highlights for Tier-1 VO Liaison Meeting
Detailed Individual Reports
Alastair
- Working on ATLAS software server upgrade (testing with Jonathan starting tomorrow)
- Working on setting up and testing ATLASGROUP disk at RAL.
- Working with B-Physics Group on group analysis requirements (TAG based analysis).
- Looking into ATLAS PFC (Pool File Catalogue) problems.
Andrew
- APR
- Updated PBS gmetric scripts to include pilot accounts [Done]
- Added DNs back into FTS monitor, but they are now anonymized [Done]
- CMS data ops
- More MC reprocessing at FNAL; pile-up reprocessing at CNAF
- Installing & setting up PhEDEx on new VOBOX
- Script to check checksums in CASTOR of random files from specific datasets [Done]
Catalin
- ALICE VOBOXes gLite updates [done]
- various OS updates [done]
- Self Service Tools training [done]
- APR [ongoing]
- ATLAS, Alice phone calls
- install and configure squid on LHCb VOBOX [ongoing]
Derek
- Investigating scheduler avoidance of new WNs [Ongoing]
- Evaluating cloud technology for Grid Services testbed use [Ongoing]
- APR [Ongoing]
- SSC Training [Done]
- Requested renewal of 3 CE certificates
Matt
- APRs
- Distribute notes for deployment of 09 disk capacity meeting [Done]
- Catch up on change controls (batch related) [Done]
- Add Maui reservations to test CASTOR 32-bit libraries [Done]
- Fix CPU capacity plots (Viglen09) [Done]
- Look at draft User Board allocations [Done]
- Update short-range CPU/disk capacity profiles [Done]
Richard
- 1/2 days Oracle/SSC Training (Thu)
- Drafted a Change Control request to move some of the BDII servers to the Atlas building for greater resilience
- Working on proposal on intra/inter -team communication to meet an action from the team awayday
- Reviewing G/S process documentation
- Further Nagios items from the to-do list (https://wiki.e-science.cclrc.ac.uk/web1/bin/view/EScienceInternal/NagiosTasksToDo)
- CASTOR items:
- Continuing p/p stress testing
Mayo
- Implement feedback into TSBN web interface
- Set up scripts that update TSBN interface to run as scheduled jobs on a windows machine
- Writing and configuring Nagios nrpe plugins [Done]
- Certificate viewer for NGS cert wizard
- Write PDU power controller query script [Done]
- Write a script to turn PDU ports off
VO Reports
ALICE
Would like CREAM-CE v1.6 to be installed asap
ATLAS
- Not heard anything about LHC technical stop. (Maybe I missed something/it will be announced tomorrow)
- LHC continuing to islowly increase luminosity.
- ATLAS Software and computing week. (More chaotic than usual due to Volcano)
- Fast 'fast re-processing' could start this week.
- ATLAS announced that it would like all disk space in production by June 1st.
CMS
- PhEDEx on SL4 is no longer supported. 3.3.1 has just been released for SL5 only.
LHCb
OnCall/AoD Cover
- Primary OnCall: Catalin
- Grid OnCall:
- AoD: