GDB 9th February 2011

From GridPP Wiki
Jump to: navigation, search

GDB Feb 9th 2011-02-09

Notes by Pete Gronbech

JG Intro, Availability calculations to include CREAM by end of March APEL should have closed RGMA at the end of the year. That has been delayed, the final deadline is the end of February, sites that have not upgraded will not be listed in the WLCG reports.


ARGUS: Christopher Witzig The following institues are working on ARGUS: CNAF,HIP,NIKHEF, SWITCH

In EGEE 3 release 1.0, 1.1 were released for service and pilot operation. Now in EMI release 1.2 in glite and 1.3 will be in EMI 1 there will be further development into other services.

Authorization of pilot jobs on WN, Global banning Integration into ce, to provide authorization of jobs at a site

4 areas of the service, Initial rules, banning/unbanning, pilot job Status of Release: Blacklisting , DN,FQAN, VO , CA OSCT/WLCG/EGI CSIRT operates a global banning service Can use it or not, it is disabled by default. National or regional service could be added.


1.3 due in April 2011, server part very little differences from 1.2 (aug2010), client side will have more changes. C-client API will be modified to make it thread safe for ARC compatibility.

(C interface and J interface, Java one is used by dcache and cream.)

The current 1.2 release will work with glite 3.2 and EMI so no need to wait for the new release, can use 1.2 and then you can upgrade at your leisure after April. Over the next year the developments will be to integrate with CREAM, so consistent authZ between ce and WN. Development is finished, release CREAM 1.7 due in March, v1.64 is released today. Integration of ARGUS and ARC is in prototype.

Also being integrated in to UINCORE.

Main benefit is consistent authz between jobs and data management, at request level not file level. Targeted for DPM, dcache, LFC and STORM. New dcache interface gplazma2 can use it. Storm planned for summer 2011 Execution Environment Service plugins being developed by Nikhef. Security review by PSNC, Poznan and comments integrated into 1.2, EMI review under way.

Marcus Shultz: comment that EMI will not be an upgrade it will almost certainly be a reinstall as the dependencies will be in epel. Michel Jouvin: Support for ARGUS is now available as a quattor template.

Romain: There is a ARGUS server at CERN where DN’s will be inserted when there is an incident, and removed afterwards. The format is in a special XML called XACML.

MUPJ : Maarten Litmaath

/ops/Role=pilot job + glexec test Feb 6th Generally we see a mixture of ce, the ones that support glexec should publish the fact. T1s, that fail will get a ggus ticket from ML. This is a requirement for WLCG, but not EGI, so they cannot require that ops test can do this on all the grid. CMS are using glexec at OSG sites where glexec is installed in suid mode. Some sites need to install it in a different location (eg IN2P3) so the variable $GLEXEC_LOCATION can be used, but it is not currently defined on WN, bug opened. OSG has another variable $OSG_GLEXEC_LOCATION it would be nice to unify.

Deadlines, Intend to get T1s all done by March (ML says they could get fixed this week!) T2s by June?

Monitoring URLs should be advertised better, so they don’t have to paste the URL from this talk. (https://samnag023.cern.ch/nagios/cgi-bin/status.cgi? servicegroup=SERVICE_CE&style=detail and https://samnag023.cern.ch/nagios/cgi-bin/status.cgi?servicegroup=SERVICE_CREAM-CE&style=detail)

Probes are available in the Regional Nagios, but not configured. Have to be careful about making it critical as not all sites need to support this.

EGI will support making this available…

The main expt VOs have there own testing pages. (Links in the talk) Sites need to publish so they can be tested, eg Triumf have it but do not publish this at the moment.

Could it be in the GOCDB?

GS: Complained that we have been talking about this for 2 years+, but is it really worth it. Could do more on logging side. Remains to be convinced that glexec will give extra security. Ian Bird, Should not stop, agree we should re look, if the frame works can provide traceability then ok. Romain: no, if there is different pilot jobs you do not have traceability.

Jeff Templon: Don’t forget if no id switching then whole VO would be blocked.

GS: Atlas can switch off users, Debate….. ML: accepts this is not an perfect solution but better than anything else, hope Virtual machines

MJ: Problem was not glexec but SCAS or ARGUS which is required to make use of it. Need to make sites confident that there is a configuration that works easily.

Information System status report : Flavia Donno Clear it is being used in a different way than it was conceived. Proposed to deploy a set of well managed (and specced) top level BDIIs. Parallel deployment of static/semi-dynamic top level BDIIs.

PIC asked why it had to by 99% reliable. It’s because currently many services do not support failover, so point at single instance. France use a DNS alias to failover to second system. Suggestion to have 3 machines, but some object.

Cleaning up /consolidating IS used attributes doc in prep. Deployment of semi static , code available in 1 week, VOs acceptance tests possible in 2 weeks if plan accepted. Deployment of well managed top level BDIIs, had a good response from T1s.

Some discussion about the deployment procedure and also how users jobs will know which BDII to query. (PG: I suspect that we either have to have one type or the other, ie just change the existing ones to have a longer time out so they become semi static)

WLCG Infor system Data Quality Meter, an attempt to monitor how many sites are publishing correctly.

Installed Capacity : John Gordon

T1 all publishing and exceeding WLCG Mou’s. That’s good, but cannot test if the values are correct.

Need for this info is to present to funding agencies, Could upload the info by hand. It’s a way of telling us what fraction of the capacity is intended for a VO. Sites need to publish the shares (ie what percentage of the cluster is available to a particular LHC VO)

In RRB have only ever reported on installed capacity at T1s. T2s is difficult. The usual debate about how to tell if the information being published is correct. The sites need to check it to ensure it represents reality.

EMI Major release once per year, Right place to introduce new features and backwards-incompatibly changes, Minimize no of changes in a major release.

Clear separation between maintaining prod and introducing new feature. Life time of a release is 2 years, full support for 18 months, security for 6 months more

EMI 1 SL5 64bit + epel SL6 and Debian available in EMI 2

Will install in /usr not /opt/glite. (tar ball WN?)

Meta packages just for installation. Cannot be used to version the installation of a certain product In general no upgrade path between major releases.

Morris Reidel: Developments:

Change of Globus from VDT to epel. VOMS using standard SSL (instead of GSI) Started migration towards GLUE2 Improved info validation checking Improvements for Data components NFS4.1, Webdav, better perrf

Lots of effort going into epel compliance. This should help sys admins after the first installation. VOMS Migrated from VOMSRS to VOMS-Admin CREAM ce (CREAM, BLAH,CEMON) Glue 1.3 still supported, but adding 2.0 support ARGUS integration

WMS Work in progress to support GLUE2 Manage GLUE1.3 and 2 in JDL ARC Dcache (glue2, nfs4.1 (sl6),

FTS can support resumption of gridftp DPM Improved dpm-drain

PG Asked if there will be a WN tarball Discussion about putting things in epel, should be tested by EGI, problem between EMI/EGI worries that sites will not be able to use the epel repo due to worries that new versions will appear there and be loaded by accident.


EGI: Tiziana Ferrari

UMD Unified Middleware Distribution Staged rollout etc EGI will include any component requested by the users from EMI or other places (eg WLCG).

WLCG is a candidate Virtual Research Community

Looking for more Early Adopters.

Short of EAs for FTS, VOMS, Oracle. EAs apply for components they are interested in. Some components are easier than others. For cream he has 9EAs and in 2 days had 4 try it so that’s about the right number.

Long discussion about the release process, worries that it may take longer, and be less reactive to expt needs. EMI says not. EGI says not. Others not convinced.

Staged rollout should be to production systems not test systems.

TF We have a problem with the number of partners participating.


Middleware Support Markus Schulz

UI/WN and VO Box not in EMI at the moment Very sceptical that sites will upgrade to EMI 1 as its on SL5, so no real benefit for the effort. More likely if doing it at the same time as SL6, as that has other benefits such as better IO in the kernel.

Need to reinstall the node, as removing rpms would leave the node in a messy state, possibly causing trouble. Reinstalling WN is not a problem but other services would be harder.

ML: Only SE’s are difficult, all others can be wiped out and replaced with a little down time.

Middleware update Glite 3.1 retirement New glite web pages active http://glite.cern.ch/ SA1 coordination meeting. No more lcg-CA releases!! This is now done by EGI Installation and configuration guide will be updated. https://twiki.cern.ch/twiki/bin/view/LCG/GenericInstallGuide320

CREAM1.6.4 is released today 1.7 due march same as emi release

CREAM CE’s should support all VOs.


UI/WN/VOBOX were ready but held back to allow inclusion of cream client tools.

Torque_utils, client and server 2.3.13 to be integrated in glite. Best effort in Nikhef.

Condor best effort by PIC

Send an email to PIC, to say CONDOR support required.

Effort in EMI is available for this, not clear who is responsible.

No… Nikhef EMI effort is for security not batch systems.

IB: Is there something holding the torque support for SL5 up? Configuration changes for YAIM etc.

People funded by EMI are funded but they are not funded for batch system support.



SLC5.6 upgrade issue Helge Meinhard

LHCb and LCD use tcmalloc.so is glibc dependant , may require recompilation when glibc changes. Code run on RHEL/SLC 5.6 without recompiling tcmalloc.so broke. Emergency work-around needed checking

RH 5.6 13 jan Cern 20jan SL 5.6 last week. (but not on web?) Rollaout at cern on 28 jan.

New version of glibc and subversion client (big change)

Put 10% of lxplus on to it. Did something go wrong with the certification procedure.

Minor releases (security, bug fixes, etc) usually not necessary.