GDB 12th December 2012

From GridPP Wiki
Jump to: navigation, search

Vidyo was playing up so these notes aren't great I'm afraid.</br> -Matt

Agenda: http://indico.cern.ch/conferenceDisplay.py?confId=155075

Introduction

http://indico.cern.ch/getFile.py/access?contribId=3&sessionId=0&resId=0&materialId=slides&confId=155075

Missed the first talk, sorry about that.

CVMFS Task Force

http://indico.cern.ch/getFile.py/access?contribId=4&sessionId=0&resId=1&materialId=slides&confId=155075

-All major VOs require CVMFS</br> -Large number of non-cvmfs sites haven't responded to requests for information about their CVMFS deployment, but I don't think that</br> -ATLAS & LHCB all grid site deployment is 30 April 2013.</br> -Tickets will be sent out in Feb.</br> -all sl6 binaries will be in cvmfs

-CMFS scaling has been tested up to 2000 jobslots.

PerfSonar

http://indico.cern.ch/getFile.py/access?contribId=5&sessionId=0&resId=1&materialId=slides&confId=155075

-As we know, expected at all WLCG sites</br> -Two boxes is the recommended number</br> -"First wave" of sites needing perfsonar was chosen by experiments.

-PerfSonar mesh tested in US & Italy</br> -Currently installed via bash script</br> -Mesh configuration will be in the Q1 2013 release of perfsonar</br> -Don't wait for new release.

-One mesh per region. Intra-region testing first</br> -Second test will be T2s to a foreign T1</br> -Third tier tests are T2 vs T2</br> -Standards for the tests will be laid out.

-UK honourably 0mentioned as "very involved"</br> -UK not meshed yet though

-1 "name" per region/cloud asked for - who's ours? Is it Duncan?

-Dashboard being put under BSD licence and going into github. Lots of work going into this and a comprehensive User API.

Unsupported Middleware Migration Update

http://indico.cern.ch/getFile.py/access?contribId=6&sessionId=1&resId=1&materialId=slides&confId=155075

-No sites suspended</br> -New deadline on tickets from October is Dec 17th. All sites with affected services must put them in downtime (done for the UK IIRC). "At risk" downtimes seem to be sufficient.

-Deadline still 31st Jan -ARC probes needed

-EMI1 probles will start warnings from January and alarms from March.

Glue 2 requirements

http://indico.cern.ch/getFile.py/access?contribId=13&sessionId=1&resId=1&materialId=slides&confId=155075

-glue-validator tool being put into SAM to monitor glue 2.0 compliance.</br> -Jan release?

-eventually will integrate into resource BDII - if it doesn;t comply, it doesn't get published.</br> -This might take a while to get out.

-ginfo tool mentioned.</br> -feedback requested, especially for which GLUE 2.0 attributes are most important for the experiments.</br> -EGI success with Glue 2.0 needed to convince OSG to get involved.

Oracle at Tier1s: ATLAS and CMS database talks

Atlas

http://indico.cern.ch/getFile.py/access?contribId=9&sessionId=2&resId=0&materialId=slides&confId=155075

-Happy with running Oracle + Frontier for access to Geometry, rigger and Condition DBs</br> -Only KIT of the Tier1s surveyed (RAL included) had plans to phase out Oracle</br> -This will be sufficient.</br> -Plan to work out how to split the site that rely on KIT during Summer 2013

-DB releases:</br> -Remove HOTDISK from CVMFS sites</br> -Move from Conditions DB releases to conditions files for reprocessing</br> -Then move towards Frontier</br> -DB release tech to be used for stable, large scale endevours.</br> -Convince detector guys to move to COOL instead of condition data files

-Tag Evolution</br> -Looking at using Hadoop/Hbase instead of Oracle for EventIndex

-Muon Calibration</br> -not benefiting from CERN Oracle external license</br> -looking at cheaper alternatives

CMS

http://indico.cern.ch/getFile.py/access?contribId=10&sessionId=2&resId=0&materialId=slides&confId=155075

-Use Frontier for conditions, stable infrastructure.</br> -would like a backed up Frontier Launchpad and Oracle replicated instance at a Tier 1, primarily as insurance.

-Only need Oracle for FTS at Tier-1s</br> -When Oracle-less FTS is released happy for Tier-1s to drop Oracle

-Uses a number of Oracle instances at CERN (Phedex, T0 Tracking, Organised Processing, Conditions).</br> -Also use CouchDb & MongoDB</br> -If there are good savings to be had reducing Oracle usage could migrate more to such things.


DPM Workshop

http://indico.cern.ch/getFile.py/access?contribId=7&resId=0&materialId=slides&confId=155075

-A succesful workshop.</br> -Plenty of "Bread and Butter" feedback</br> -complaints draining mentioned a few times

-Current GridFTP doesn't redirect</br> -hope to work with globus on this</br> (then things went silent for a bit-damn you Vidyo!)</br> -positive view on DPM from experiments</br> -DMLite being used in Rucio.</br> -LFC being left behind.</br> -Live demo did well- even used a Raspberry PI as one of the servers!</br> -Next workshop Mon 18th March in Taipai, during ISGC</br> -Creation of a DPM Collaboration confirmed.

Virtualised WNs

http://indico.cern.ch/getFile.py/access?contribId=8&resId=1&materialId=slides&confId=155075

-Tony points out that traceability is still an issue</br> -Working group came up with Trusted images, trusted by "image endorsors".</br> -Policy has been agreed upon, technical points have been defined.

-Model shown that uses a shared image repo.</br> -Working group has dealt with many issues</br> -cvmfs deals with some of the needs</br> -took too long testing static images in CernVM</br> -move to work on this now</br> -issues moving credential into VM image</br> -how to integrate pilot factories?</br> -avoidance of queuing virtual instantiation requests (which pilot factories could easily end up creating).</br> -communication of resource availability to experiments.

-Discussion of VMs</br> -the "end of the batch system"? cloud != batch</br> -traceability concerns for high-volumes of jobs in a single VM</br> -glexec would work as it would in a physical node.

-VM per job mentioned. Overhead of this considered by many too wasteful.</br> -Thanks to cvmfs you don't nessicerily need a VM to run in this fashion...</br> -Security and other updates in VMs would occur naturally in the VM "churn".</br> -I/O inefficiency also mentioned, but as with many such factors the benefits outweigh the costs.

And then Vidyo started to suck for a bit :-(


Software Defined Networks - talk from ESnet

http://indico.cern.ch/getFile.py/access?contribId=1&resId=1&materialId=slides&confId=155075

100Gb infrastructure across US with interlinks elsewhere. Access to 44 100Gb channels.</br> Variable network conditions.</br> TCP not suited to big data transfer, especially on "long" links.</br> -either replace/upgrade tcp or create a loss free environment for tcp to work in.</br> -openflow started as a network security tool</br> -uses flow tables to decide what to do with packets</br> The talk got quite technical from here, my notes couldn't do it justice so I advise you read the slides if interested.


Post EMI Lifecycle Management

http://indico.cern.ch/getFile.py/access?contribId=14&resId=0&materialId=slides&confId=155075

"No funding" paradigm coming up</br> SLAs will be based on good will.</br> Control (of a project/product/developement) will move with funding.</br>

(Some) Overlap between EGI activities and what is needed after EMI.</br>

Science-soft and "EMI-consortium" rolls unclear</br>

Different stacks</br> -OSG tool-chain moving closer to us</br> -NDGF/ARC not much coupling is needed/possible

Product teams must decide on non-LHC VO requirements</br> PTs are asked to remote into preGDBs, which will be the forum for discussion concerning LHC VO tehcnical requirements and other issues.</br> Strong move to only two repos (EPEL & OS)

Strong rejection of Rollback.</br> -No centrally proposed mandatory rollback will be foreseen.

Then we had some more Vidyo suckage.

-rejection of versioned meta-packages</br> -alternatives are difficult to manage</br> -metapackages will change very slowly

-version discovery will be done via query/glue

OS platforms</br> -glite-decendent PTs can use any OS as long as one of SL/SLC/CentOS/RHEL covered.</br> -coordination (between PTs) only needed for major version moves

yaim core ends soon, not much love.</br> -puppet support considered (like DPM)</br> -yaim support must continue for at least a year to allow transition

Pilot services for some PTs (not required for all). Bit like early adopters.</br> Not known what to do for the CEs :-S --- batch system integration makes things complicated.

Not all PTs committed to using GGUS after emi (19 yes, 14 maybe, 1 no)</br> Level 1 & 2 support remain unchanged, level 3 support will alter dependant on PT.

"No better tool then GGUS to track change requests."</br> -no real alternative -1 PT rejected GGUS, several more were luke-warm about it.

EPEL & FEDORA- "Strong on intergration, weak on testing"</br> -bitten by globus before.

-regular, frequent automated tests</br> -unclear who will do this</br> -verification under production conditions</br> -coordinated by WLCG Ops team.

NDGF/NeIC status update

http://indico.cern.ch/getFile.py/access?contribId=15&resId=1&materialId=slides&confId=155075

DataGrid->Nordugrid->NDGF->NeIC</br> NeIC came into official being on the 1st January 2012</br> Still Hosts Nordic Tier-1</br> -still have the distributed dcache