Difference between revisions of "Operations Bulletin Latest"

From GridPP Wiki
Jump to: navigation, search
()
()
Line 359: Line 359:
 
===== =====
 
===== =====
 
<!-- ******************Edit start********************* ----->
 
<!-- ******************Edit start********************* ----->
 +
'''Monday 4th August 2014, 14.30 BST'''<br />
 +
20 Open UK tickets this week.
  
 +
'''NGI/Other'''<br />
 +
[https://ggus.eu/index.php?mode=ticket_info&ticket_id=107369 107369](30/7)<br />
 +
NGIs are being asked to ask Cloud sites to fill in a questionnaire about grid deployed cloud stuff security. This ticket was meant for 100IT - although shouldn't there be one for UKI-GridPP-Cloud-IC too? (I couldn't see such a ticket in the solved pile). I've assigned to uk-ngi ops and notified the site of the ticket. Assigned (4/8)
 +
 +
[https://ggus.eu/index.php?mode=ticket_info&ticket_id=106615 106615](2/7)
 +
Decommissioning ticket for the FTS2 service at RAL on the 2/9/14. Nothing else to do really, on hold to closer to the time. On hold (14/7)
 +
 +
'''BRISTOL'''<br />
 +
[https://ggus.eu/index.php?mode=ticket_info&ticket_id=106325 106325](18/6)<br />
 +
CMS pilots losing their network connections. The Bristol admins are waiting to see how things pan out for a similar RAL ticket (106324) but I'm not sure if waiting for this is the right thing to do - it could be that the RAL problems are very RAL specific. On hold (14/7)
 +
 +
[https://ggus.eu/index.php?mode=ticket_info&ticket_id=106554 106554](29/6)<br />
 +
Another CMS ticket, about FTS backlogs between FNAL and Bristol. Although the original transfer has finished a connectivity problem still seems to persist and Lukasz has offered some suggestions and asked cms how they'd like to proceed. Waiting for reply (29/7)
 +
 +
Could these two issues be somehow related?
 +
 +
'''GLASGOW'''<br />
 +
[https://ggus.eu/index.php?mode=ticket_info&ticket_id=107435 107435](1/8)<br />
 +
CMS glideins were getting held up at Glasgow. Dave and the gang tracked down missing /cms/Role=pilot explicit mappings in their argus, and have added them in. Things are looking better, with the ticket now in the customary "How's it look on your end?" state. Waiting for reply (4/8)
 +
 +
'''ECDF'''<br />
 +
[https://ggus.eu/index.php?mode=ticket_info&ticket_id=95303 95303](1/7/13)<br />
 +
glexec deployment ticket. There's been some movement on the glexec tarball development front at last. Jeremy tweaked the reminder date and assigned person. On hold (29/7)
 +
 +
'''SHEFFIELD'''<br />
 +
[https://ggus.eu/index.php?mode=ticket_info&ticket_id=107217 107217](24/7)<br />
 +
Sheffield failing site-bdii checks due to the old "all the 4s" published waiting jobs problem (usually due to broken dynamic publishing). Ticket's acknowledged, and the expiry has been extended, but could do with a proper update soon. In progress (1/8)
 +
 +
'''LANCASTER'''<br />
 +
[https://ggus.eu/index.php?mode=ticket_info&ticket_id=100566 100566](27/1)<br />
 +
Poor perfsonar performance. After a reinstall of the node and establishing that there is no hard bottleneck we're stuck. Currently waiting on some network engineer time at Lancaster, whilst scratching our heads over why the perfsonar isn't working right for us. On hold (4/8)
 +
 +
[https://ggus.eu/index.php?mode=ticket_info&ticket_id=95299 95299](1/7/13)<br />
 +
Tarball glexec ticket. I've opened up a line with the glexec devs, who have been very helpful. They've given me some build tips, but then I went on holiday before I could use their advice. On hold (29/7)
 +
 +
'''UCL'''<br />
 +
[https://ggus.eu/index.php?mode=ticket_info&ticket_id=101285 101285](16/2)<br />
 +
Ben got the UCL perfsonar box back on its NICs, and is just waiting on getting it back into the WLCG mesh. Jeremy is on the case. Waiting for reply (29/7)
 +
 +
[https://ggus.eu/index.php?mode=ticket_info&ticket_id=95298 95298](1/7/13)<br />
 +
UCL's glexec ticket. Ben has stated this will have to wait until he is back from leave at the end of August. On hold (29/7)
 +
 +
'''RHUL'''<br />
 +
[https://ggus.eu/index.php?mode=ticket_info&ticket_id=107436 107436](1/8)<br />
 +
Atlas having transfer problems to RHUL. Govind has tracked down a grdftp mapping problem that solves some of the errors, the rest seem to be due to his new pool nodes misbehaving. Perhaps a problem with the configuration of the latest version of DPM? In progress (3/8)
 +
 +
'''QMUL'''<br />
 +
[https://ggus.eu/index.php?mode=ticket_info&ticket_id=107440 107440](2/8)<br />
 +
LHCB seem to be having problems getting files from the input sandbox on what appears to be all QM cream CEs. Chris is on the case. In progress (4/8)
 +
 +
[https://ggus.eu/index.php?mode=ticket_info&ticket_id=107402 107402](31/7)<br />
 +
Site BDII test failures. It looks like this problem has evapourated, Gareth has suggest that somebody at QM close the ticket if they're happy. Assigned (can be closed) (1/8)
 +
 +
'''Cloud-IC'''<br />
 +
[https://ggus.eu/index.php?mode=ticket_info&ticket_id=106347 106347](19/6)<br />
 +
The new cloud site was noted as hogging 12% of the cern statum one cvmfs connections. There was some discussion about this, and the Shoal installation at Oxford might well prevent this from happening again, but the site was down for maintenance so no confirmation could be made. When is the cloud site likely to be back in action? On Hold (14/7)
 +
 +
'''EFDA-JET'''<br />
 +
[https://ggus.eu/index.php?mode=ticket_info&ticket_id=97485 97485](21/9/13)<br />
 +
LHCB authentication errors at jet, which survived OS and EMI upgrades. I've been all talk and no trousers about getting round to helping jet out, has anyone seen anything like this? Or have any ideas? On Hold (29/7)
 +
 +
'''TIER 1'''<br />
 +
[https://ggus.eu/index.php?mode=ticket_info&ticket_id=107416 107416] (31/7)<br />
 +
The RAL FTS has been accused of hammering the US MWT2 srm. Andrew has suggested a course of action that might soothe things. Waiting for Reply (4/8)
 +
 +
[https://ggus.eu/index.php?mode=ticket_info&ticket_id=106655 106655] (4/7)<br />
 +
Castor failing ops tests. This was due to reasons, that are understood by clever people. A fix was delayed, but hopefully should roll out this week. In progress (31/7)
 +
 +
[https://ggus.eu/index.php?mode=ticket_info&ticket_id=105405 105405] (14/5)<br />
 +
Vidyp router firewall checking ticket. This ticket has been left fallow for a while, with some offline discussion between the Vidyo devs and RAL networking. Any news? On hold (1/7)
 +
 +
[https://ggus.eu/index.php?mode=ticket_info&ticket_id=106324 106324](18/6)<br />
 +
CMS pilots losing connection to their submission hosts. Firewall tweaks haven't fixed the problem, but a suggestion of of changing the pilot "keepalive" parameter was put forward. There seems to be some confusion on the cms end about the current state of this issue, but last word says it persists. In progress (30/7)
  
  

Revision as of 15:57, 4 August 2014

Bulletin archive


Week commencing 4th August 2014
Task Areas
General updates

Tuesday 29th July


Tuesday 22nd July

  • The RAL FTS2 is due to be switched off on 2nd September.
  • HyperK enablement request still stands.
  • The WLCG biweekly WLCG status from yesterday's ops meeting is available here.
  • There was a general reminder last week about putting too much detail into messages that go to our public email lists. Please remember the Traffic Light Protocol!
  • There will be an IPv6 session at GridPP33. JANET will participate. Pete C and G are taking ideas for talks.
  • There was a GridPP technical meeting on Friday.
  • The final WLCG T2 availability and reliability figures for June 2014 are now available.
  • There was a first meeting last Wednesday of the HEP Software Foundation. The first step is to prepare a “call for volunteers" who can devote the time in the coming months to lead the work that has to be done.


WLCG Operations Coordination - Agendas

Tuesday 29th July

  • There was a WLCG coordination meeting last Thursday. Minutes are available.
  • News: MB agreed gstat no longer supported. Notes from WLCG workshop available.
  • Baseline: Changes FTS3 v3.2.26; DPM 1.8.8; perfSONAR 3.3.2-17; Storm 1.11.4.
  • Baseline: Issues: “Missing Key Usage Extension in delegated proxy” affecting CREAM; dCache upgrade; EOS (Brazilian proxies). Tickets for CVMFS upgrade to 2.1.19.
  • Tier-2 – feedback from GridPP given. CVMFS monitoring, gstat and ARGUS-SE status.
  • ALICE: Very high to low throughput due to input tasks. ALICE site admins to note VOMS changes
  • ATLAS: issues related to the new ATLAS frameworks commissioning/integration. Using MCORE and want more. under discussion within ATLAS new way of measure the availability of the sites, through their availability for analysis.
  • CMS: Busy. CAS2014 soon target 40k concurrent jobs. Want to make xrootd-fallback test critical. Removal individual release tags from CEs. 5 site reminders.
  • LHCb: Access to lcg-voms2 and voms2 from outside CERN are timing out. re-curring problem after upgrade of dCache sites. CVMFS clients not all upgraded. Question - What happens to VM if it fails and needs reinstantiation. Lose the cores? No.
  • Tracking tools: Recommendation to close TF.
  • FTS3: Only a few CMS sites running over FTS2. Can now monitor FTS3 transfers against user DN (needed by CRAB3)
  • Glexec: no change. Some running unreleased rpms due to ARGUS status. Someone needs to pledge effort to support ARGUS. SWITCH unable pledge effort.
  • MW readiness: DPM is the pilot WLCG MW package used for the Readiness Verification effort. Progress is traced in a dedicated JIRA tracker. Next meeting 1st October.
  • Multicore: ATLAS steady flow. With CMS concurrent at 5 sites. How to track.
  • SHA-2: Intro new VOMS servers. Testing infrastructure with preprod SAM. Some good; many failures due to simple configuration issues of CREAM and/or Argus (Action to check). Aim for 15th September. Will open tickets. RFC no progress.
  • WMS decommissioning: Work ongoing to improve SAM Condor probe jobs lifecycle. End of September or October.
  • IPv6: All but ALICE now testing (migrating AliEN to Perl 5.18).
  • HTTP proxy dis: NTR
  • Network and transfer metrics WG: proposed dates for kickoff are in the last week of August and second week in September.



Tuesday 21st July

  • The next coordination meeting takes place this Thursday at 14:30 UK time. There is now a standing item for sites to raise issues of concern. Is there anything we would like to mention this week? we are invited to update the twiki up until 1 hour before the meeting.


Tier-1 - Status Page

Tuesday 29th July

  • All Castor instances have been upgraded to version 2.1.14. The upgrade is complete including turning off compatibility mode on the namserver component which was done at he end of last week.
  • We have announced that we will shutdown both the FTS2 service and the software server used by the small VOs on the 2nd September.
  • There was a site outage owing to a network problem for around 45minutes last Tuesday at the time of this DTEAM/OPS meeting.
Storage & Data Management - Agendas/Minutes

Wednesday 30 July 2014

  • Very optimistically hoping to get some Real Work™ done during supposedly quiet time optimistically expected in August.
  • Suggestion to ask experiments to present their view of storage and data management at GridPP33: after all, it's their opinion which is most important.
  • Collating feedback on the storage meetings.

Wednesday 23 July 2014

  • We really should try to document our VO policy: the stuff in people's heads, experiences with "small" VOs (that tend to grow bigger), best practices. Also, much of the wiki needs reviewing. "Boring" old documentation - hey ho!

Tuesday 22nd July

  • Alert today advising to update DPM, mainly to get the new (bug fixed)

Wednesday 2 July

  • Guidance and policies for "small" VOs: how to get them started with stuff, without preventing them later growing bigger.

Tuesday 1st July


Accounting - UK Grid Metrics HEPSPEC06 Atlas Dashboard HS06

Tuesday 29th July

  • Accounting gap in last week for UCL and Durham.

Tuesday 22nd July

  • EGI accounting portal does not show any significant outages in publishing in recent days.


Documentation - KeyDocs

See the worst KeyDocs list for documents needing review now and the names of the responsible people.

Tuesday 22nd July

  • Starting on revisions this week.
  • Is the alert system now working?

Monday 16th June

  • A review is starting of old and obsolete pages within the GridPP website - there are many! Please review sections that you have created and update them if necessary.

Tuesday 6th April

  • KeyDocs are going to be reviewed (in next 4 weeks) as the system is not working (or not adding anything) in some areas.


Interoperation - EGI ops agendas

Tuesday 14th July

  • Last meeting yesterday.
  • URT: see agenda for details
  • SR: In verification: gfal2 v. 2.5.5; active: globus-info-provider-service v. 0.2.1 cream v. 1.16.3; ready to be released: storm v. 1.11.4 lb v. 11.1 wms v. 3.6.5 dcache v. 2.6.28
  • DMSU report: CREAM CLI/GridSite SegFaults at Long-Lived Proxies solved
  • Migration of Central SAM services: Note to make sure that if being reinstalled that patches are applied
  • EMI-2/APEL-2 - Looks like UCL is still publishing with APEL-2 publisher
  • Hoped that gr.net issues resolved on Monday. Summary of discussion to be in minutes.
  • Next meeting placeholder 28th July, but may not happen (OMD depending)
  • Please fill out this UMD customer satisfaction survey in the next couple of weeks if you had a moment: https://www.surveymonkey.com/s/MQ6G8BZ

Tuesday 1st July

  • Today's ops meeting cancelled - partly due to forthcoming 4th EGI annual review.
  • EMI-2 decommissioning: The situation is followed by COD (GGUS 106354). "Please remember that we passed the decommissioning deadline and after today - Sites still deploying unsupported service end-points risk suspension, unless documented technical reasons prevent a Site Admin from updating these end-points (source PROC16).
  • There is STILL use of UMD2/EMI2 APEL clients to send accounting data. As of today there are 20 sites (see latest list) still using UMD2/EMI2 APEL clients



Monitoring - Links MyWLCG

Tuesday 22nd July

  • Chris noted yesterday that gstat reports most sites as critical. SB thought the underlying problem is that a value is supposed to be "Production" in GLUE 1 and "production" in GLUE 2; at some point they changed both to lower case.

Tuesday 15th July

On-duty - Dashboard ROD rota

Tuesday 29th July

  • Dealing with low availability alarms in dashboard - notepad message from ROD?

Tuesday 22nd July

  • Another quiet week. Bham availability alarm ticket created.
Rollout Status WLCG Baseline

Monday 28th July

Tuesday 18th March

Tuesday 11th February

  • 31st May has been set as the deadline for EMI-2 decommissioning. There may be an issue for dCache (related to 3rd party/enstore component).

References


Security - Incident Procedure Policies Rota

Tuesday 29th July

Tuesday 22nd July

Monday 14th July

  • EGI CSIRT ADVISORY [EGI-ADV-20140625]



Services - PerfSonar dashboard | GridPP VOMS

- This includes notifying of (inter)national services that will have an outage in the coming weeks or will be impacted by work elsewhere. (Cross-check the Tier-1 update).

Tuesday 22nd July

  • There was a problem with VOMS admin that was noticed last Thursday: VO page requestes resulted in the WEB-INF directory of the jetty app being displayed. A server restart fixed the problem.

Tuesday 17th June

  • The GridPP VOMS server was updated on 11/06/2014 - no issues reported.


Tickets

Monday 4th August 2014, 14.30 BST
20 Open UK tickets this week.

NGI/Other
107369(30/7)
NGIs are being asked to ask Cloud sites to fill in a questionnaire about grid deployed cloud stuff security. This ticket was meant for 100IT - although shouldn't there be one for UKI-GridPP-Cloud-IC too? (I couldn't see such a ticket in the solved pile). I've assigned to uk-ngi ops and notified the site of the ticket. Assigned (4/8)

106615(2/7) Decommissioning ticket for the FTS2 service at RAL on the 2/9/14. Nothing else to do really, on hold to closer to the time. On hold (14/7)

BRISTOL
106325(18/6)
CMS pilots losing their network connections. The Bristol admins are waiting to see how things pan out for a similar RAL ticket (106324) but I'm not sure if waiting for this is the right thing to do - it could be that the RAL problems are very RAL specific. On hold (14/7)

106554(29/6)
Another CMS ticket, about FTS backlogs between FNAL and Bristol. Although the original transfer has finished a connectivity problem still seems to persist and Lukasz has offered some suggestions and asked cms how they'd like to proceed. Waiting for reply (29/7)

Could these two issues be somehow related?

GLASGOW
107435(1/8)
CMS glideins were getting held up at Glasgow. Dave and the gang tracked down missing /cms/Role=pilot explicit mappings in their argus, and have added them in. Things are looking better, with the ticket now in the customary "How's it look on your end?" state. Waiting for reply (4/8)

ECDF
95303(1/7/13)
glexec deployment ticket. There's been some movement on the glexec tarball development front at last. Jeremy tweaked the reminder date and assigned person. On hold (29/7)

SHEFFIELD
107217(24/7)
Sheffield failing site-bdii checks due to the old "all the 4s" published waiting jobs problem (usually due to broken dynamic publishing). Ticket's acknowledged, and the expiry has been extended, but could do with a proper update soon. In progress (1/8)

LANCASTER
100566(27/1)
Poor perfsonar performance. After a reinstall of the node and establishing that there is no hard bottleneck we're stuck. Currently waiting on some network engineer time at Lancaster, whilst scratching our heads over why the perfsonar isn't working right for us. On hold (4/8)

95299(1/7/13)
Tarball glexec ticket. I've opened up a line with the glexec devs, who have been very helpful. They've given me some build tips, but then I went on holiday before I could use their advice. On hold (29/7)

UCL
101285(16/2)
Ben got the UCL perfsonar box back on its NICs, and is just waiting on getting it back into the WLCG mesh. Jeremy is on the case. Waiting for reply (29/7)

95298(1/7/13)
UCL's glexec ticket. Ben has stated this will have to wait until he is back from leave at the end of August. On hold (29/7)

RHUL
107436(1/8)
Atlas having transfer problems to RHUL. Govind has tracked down a grdftp mapping problem that solves some of the errors, the rest seem to be due to his new pool nodes misbehaving. Perhaps a problem with the configuration of the latest version of DPM? In progress (3/8)

QMUL
107440(2/8)
LHCB seem to be having problems getting files from the input sandbox on what appears to be all QM cream CEs. Chris is on the case. In progress (4/8)

107402(31/7)
Site BDII test failures. It looks like this problem has evapourated, Gareth has suggest that somebody at QM close the ticket if they're happy. Assigned (can be closed) (1/8)

Cloud-IC
106347(19/6)
The new cloud site was noted as hogging 12% of the cern statum one cvmfs connections. There was some discussion about this, and the Shoal installation at Oxford might well prevent this from happening again, but the site was down for maintenance so no confirmation could be made. When is the cloud site likely to be back in action? On Hold (14/7)

EFDA-JET
97485(21/9/13)
LHCB authentication errors at jet, which survived OS and EMI upgrades. I've been all talk and no trousers about getting round to helping jet out, has anyone seen anything like this? Or have any ideas? On Hold (29/7)

TIER 1
107416 (31/7)
The RAL FTS has been accused of hammering the US MWT2 srm. Andrew has suggested a course of action that might soothe things. Waiting for Reply (4/8)

106655 (4/7)
Castor failing ops tests. This was due to reasons, that are understood by clever people. A fix was delayed, but hopefully should roll out this week. In progress (31/7)

105405 (14/5)
Vidyp router firewall checking ticket. This ticket has been left fallow for a while, with some offline discussion between the Vidyo devs and RAL networking. Any news? On hold (1/7)

106324(18/6)
CMS pilots losing connection to their submission hosts. Firewall tweaks haven't fixed the problem, but a suggestion of of changing the pilot "keepalive" parameter was put forward. There seems to be some confusion on the cms end about the current state of this issue, but last word says it persists. In progress (30/7)



Tools - MyEGI Nagios

Monday 14th July

Winnie reported on Saturday 12th July that most of the UK sites are failing nagios test. Problem started with unscheduled power cut at a Greek site hosting EGI Message broker (mq.afroditi.hellasgrid.gr) around 2PM on 11th July. Message broker was put in downtime but topbdii's continued to publish it for quite long time. Stephen Burke mentioned in TB support thread that now default caching time is 4 days. When I checked on Monday morning only Manchester was still publishing mq.afroditi and it went away after Alessandra manually restarted top bdii. It seams that Imperial is configured with much shorter cache time. Only Oxford and Imperial was almost not affected and the reason may be that Oxford WN's have Imperial top bdii as first option in BDII_LIST. Other NGI's have reported same problem and this outage is likely to be considered when calculating availability/reliability. All Nagios tests came back to normal now.

Emir reported this on tools-admin mailing list "We were planning to raise this issue at the next Operations meeting. In these extreme cases 24h cache rule in Top BDII has to be somehow circumvented."

Tuesday 1st July

  • There was a monitoring problem on 26th June. All ARC CE's were using storage-monit.phyics.ox.ac.uk for replicating files as part of the nagios testing. storage-monit was updated but not re-yaimed until later. Storage-monit was broken for the morning leading to all ARC SRM tests failing.

Tuesday 24th June

  • An update from Janusz on DIRAC:
  • We had a stupid bug in Dirac which affected the gridpp VO and storage. Now it is fixed and I was able to successfully upload a test file to Liverpool and register the file with the DFC
  • The async FTS is still under study, there some issues with this.
  • I have a link to software to sync user database from a VOMS server, haven’t looked into this in detail yet.


VOs - GridPP VOMS VO IDs Approved VO table

"Monday 14th July 2014"

  • HyperK.org will initially use remote storage (irods at QMUL) - so CPU resources would be appreciated.

"Monday 30 June 2104"

  • HyperK.org request for support from other sites
    • 2TB storage requested.
    • CVMFS required
  • Cernatschool.org
    • WebDAV access to storage -world read works at QMUL.
    • ideally will configure federated access with DFC as LFC allows.


Monday 16 June 2014

  • CVMFS
    • Snoplus almost ready to move to CVMFS - waiting on two sites. Will use symlinks in existing software
  • VOMS server: Snoplus has problems with some of the VOMS servers - see ggus 106243 - may be related to update.


Tuesday 15th April

  • Is there interest in an FTS3 web front end? (more details)


Site Updates

Tuesday 20th May

  • Various sites but notably Oxford have ARGUS problems. 100s of requests seen per minute. Performance issues have been noted after initial installation at RAL, QMUL and others.


Meeting Summaries
Project Management Board - MembersMinutes Quarterly Reports

Empty

GridPP ops meeting - Agendas Actions Core Tasks

Empty


RAL Tier-1 Experiment Liaison Meeting (Wednesday 13:30) Agenda Meeting takes place on Vidyo.

Wednesday 30th July 2014

  • Operations report
  • Business as usual....
  • The termination of the FTS2 service has been announced for the 2nd September.
  • The software server used by the smaller VOs will be turned off - also on 2nd September.
  • We are planning to turn off access to the cream CEs (possibly keeping them open for Alice) - although no date yet decided for this.
WLCG Grid Deployment Board - Agendas MB agendas

Empty



NGI UK - Homepage CA

Empty

Events
UK ATLAS - Shifter view News & Links

Empty

UK CMS

Empty

UK LHCb

Empty

UK OTHER
  • N/A
To note

  • N/A