GDB 12th February 2014

From GridPP Wiki
Revision as of 12:02, 18 February 2014 by Alessandra forti (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

HEP SW collaboration [1]

hardware architecture is changing. Whole SW has been built following a serial paradigma. Need to change, community doesn't have the expertise, we need to build it. To coordinate this with HEP community and external organisations and organise activities a formal collaboration is to be setup. Geant and ROOT the obvious candidates to be rewritten as they are at the core of experiments software and are also used by external partners. Potential for funding (only for CERN?). There will be a 1&1/2 day initial meeting in April. Invitation has been sent to major HEP labs and organisations such as GridPP (have we received it? Anyone going? We should.).

IPV6 update [2]

There was aF2F meeting in January 23/24. There were reports from CERN cloud and openstack and the use of IPV6 and other reports from sites. Talk also on xrootd V4 IPV6 compliancy. Mesh of data transfers with *FTP over IPV6 in preparation for CHEP but then all sites situation started to degrade and stopped working. This was due to people not taking care of the tests instances not a problem with IPV6 in itself. Atlas also involved now introduced IPV6 flag in AGIS plan to modify application and run small data transfers and HC tests. Need to define use cases outside CERN and continue testing transfer protocols of course need volunteer sites. Important to know which services outside CERN need to be dual-stack. Participating sites need to be compensated/forgiven if their availbility might suffer. WLCG MB will discuss this. Software and tools IPV6 compliance is tracked in a database. Dcache has is not compliant yet v2.8.0 is being tested at NDGF problems are being discussed with the dcache team. Not trivial. Survey on WLCG sites status. Would like to have a 1 day workshop possibly a pre-GDB aimed primarily at T1 sites.

Sites are strongly encouraged to participate and participating sites will be presented in a positive light to the funding agencies. It is better to disrupt the service before the end of LS1.

Future of Scientific Linux [3]

centOS and RHEL are merging in a new opensource project. SL team at Fermilab has proposed 3 ways to integrate with this project with more emphasis on becoming a CentOS Special Interest Group. SLC team would rather not put that effort and just adopt CentOS and setup add-on repositories.

Discussion on the model

Nikhef High Memory Observation [4]

Which High memory problem we have. Limit at Nikhef is 4GB pvmem. Some jobs have out of control memory usage. Jobs don't ask for memory at all. Nikhef sets pvmem limit at queue level, it translates to a ulimit on process. What if users need more? It is possible link in the talk about the investigation done at Nikhef on how to do it but there are some inconsistencies for example torque allows to set two limits one on job mem and one on pvmem and the one on pvmem is ignored. Link between memory and core maui can over allocate memory for jobs if they are a minority. Maui had problems with 8 cores jobs it allocated 32GB of memory per core rather than 32GB to the whole job, needed to correct manually. Considering current Maui status it might not be worth go and dig in the code.

Discussion about maui status is postponed to the pre-GSB in March.

WLCG Ops Coord F2F report [5]

glexec deployment

About to become critical tests but it is not yet. Some of the tests are not yet in optimal state. Still 20 sites with a GGUS ticket more that haven't deployed, they are not CMS sites so they not have a problem with the tests. Later presentation on SAM tests.

PerfSONAR

20 sites W/o perfsonar and 20 with old releases. This will be brought to the attention of the experiments.

Tracking Tools Evolution

functionalities for savannah-to-jira migration in GGUS is still not there. JIRA still miss some functionalities, full migration is still a long way.

SHA-2 migration

Experiments need to adapt to the VOMS-admin interface. The is progress, campaign to distribute

Machine/Job feature

Progress on the bidirectional communication between job and batch system

Middleware readiness

Proposed model will rely on experiments testing frameworks for example SAM/HC. Monitoring needs to be adapted. Sites willing to deploy testing instances of some services. ATLAS has already a full chain prototype. We have to find a mechanism to reward sites that volunteer.

Enforcement of baseline versions

Need to monitor the versions at sites use the information to run ad-hoc upgrade campaigns. Eventually automate. Possible usage of pakiti on all WN and services at sites.

WMS decomissioning

Shared and CMS WMS clusters will be shut down in April. There will no be WMSs anymore at CERN nor support for them.

Multicore deployment

ATLAS and CMS opposite deployment models. We do not discuss if one model is better than the other, we will instead test a mixture of the two models to find the best compromise.

Maui lack of support is a concern also for multicore.

FTS3 deployment

consensus on FTS3 deployment model. Limited number of FTS3 servers but having multiple servers is useful. Aim at deploying 3 or 4 servers in production with a shread configuration. Then the 3 LHC Vos will distribute the load among the instances according to policies and needs.

Experiment computing commissioning

An a have heavy commissioning exercise like STEP09 wasn't considered necessary as the infrastructure is already there and working. Commissioning activities this time are experiment specific and things like multicore are already followed.

SAM tests scheduling [6]

Progress report on the transition of SAM tests from WMS framework to a condor one. Highlighted the problem of SAM tests without lcgadmin role. Tests that are submitted with a production or a pilot role they might get stuck in the queue due to fairshares and time out. This concerns only so called WN tests such as glexec. Atlas and CMS have two different opinions onn how to solve this in the short term. The long term is still under discussion. There was a long discussion about a WMS time out which was a bit of a waste of time considering the WMS is going to be phased out in April.

WLCG dashboard [7]

Progress report on the WLCG dashboard and how things have been simplified underneath by removing the extra database interection and only using each underlying dashboard API. So for example while before FTS, FAX, AAA, etc results were stored in a WLCG dashboard DB now the WLCG dashboard uses the results from every single dashboard via the API.

Data preservation [8]

Sorry missed it

WLCG Monitoring Consolidation [9]

Status report. Description of what the consolidation group is trying to simplify: from services maintainance to the code behind the dashboards (WLCG transfers dashboard cited as an example). In particular importance for sites is the reduction of number of dashboards – most of what exist will be moved to an SSB like solution (for example SAM tests and REBUS). Sites contribution was highlighted with special thanks to David. Most of the next steps will concentrate on the new SAM framework.

LHCOPN/ONE Evolution Workshop [10]

Experiments mostly happy with the situation, in general it was recognised that T1 and T2 are becoming more similar and some T2 might need more bandwidth while some other are already fine. There was a clear desire to support LHCOPN upgrades for T1s and some desire to engage with T2 that want to move but not much manpower was committed to this.

There was a comment on how the network people and WLCG are still distant and things are communicated with a certain delay. Things should improve in the future.

Perfsonar update [11]

progress report on the past few months. Encouraging sites to move to 3.3.2 and enable mesh tests. Several sites are still broken. Several slides dedicated to the new dashboard. Also mentioned Ipv6 and work done by Duncan to integrate it in perfsonar.