GDB January 2009

From GridPP Wiki
Jump to: navigation, search


GDB Wednesday 14th January 2009

Agenda Notes: PDG

JG talked about possible future agenda items: Have had talks about nagios monitoring SAM. Not had many talks about how to use nagios to monitor fabric.

WLCG Workshop is on 21-22 March in Prague prior to CHEP. Will focus on 2009 Operations. Service operations Operations procedures and tools Downtime handling Support issues Monitoring VO views

Wondering whether to have another in the summer or not.

==MB meeting yesterday== (Ian Bird) Clarification of running schedule, probably no real physics collisions before September. After restart in June, seems achievable if nothing else goes wrong. Before Christmas understood accelerator run to finish in November, but this will be discussed at the Chamonix workshop first week of Feb. We should relax the requirement to have all the installed capacity by April, can delay a few months, to August. Have to present the plan to the LHCC. JG What resources are needed in 2009?

Benchmarking G.Merino

“GDB benchmarking WG”

Asked how to migrate to the new unit. Currently use SPECint2000 which is no longer supported. Recommended to move to SPECall_cpp2006. Group needs to publish the recipe for running the benchmark Agree on conversion factors to convert equipment requirements and site pledges tables from si2k units to the new units.

Ratio of new to old is roughly 4 Focusing the results on the newest Intel Quad cores as these are the most common at the moment. At CERN two different systems gave ratios of 3.84 and 4.05, the whole cluster at GridKa came out as 4.19.

Propose to keep it simple so use 4. Sites should purchase SPEC cpu2006 and report their current power in the new units.

Accounting needs to be thought about as currently the specnit2k figure is published. There needs to be a way to know which figures are being published by a site.

Working Group on WN’s.

Main Goal which should solve many problems, such as using memory efficiently. We are trying to catch up with OSG. Yaim forces us to publish one sub cluster per ce. Plan is have a new node type The Cluster Publisher. One per site. A very trivial service publishes static ldif. Five steps that can be taken now, and then the 6th step is for YAIM support for ‘freestyle’ Clusters and Sub Clusters. Many of the steps are to RTE tags publishing. New YAIM will require wider testing, PPS and early production sites.

If you do turn the new functionality on then you have to have one queue per glue sub cluster. So you must know what you are doing. It should be set to off by default.

Grid Configuration Data

Or “What should be on the grid”

How to reliably find a list of sites > service > VO mappings in the EGEE production infrastructure.

BDII publish live data, cannot tell if it has just dropped out or been decommissioned.

Two options either put more info in the GOCDB or put it into the site-BDII.


WLCG GLUE 2.0

Working with the OGF was a good experience and did not add very much overhead to the process. April 2009 start working on Glue 2.1


Christmas Experiences

ALICE Overloading of WMS caused trouble. Is there a max number of jobs setting? Huge backlog on WMS204 once it had 40000 queued jobs. Had to baby sit the WMS the whole time and switch to another one as soon as the problem occurs.

Myproxy server issue, new h/w will help. ALICE will change method of submission to make a proxy delegation only once per hour. ALICE problems around Christmas day onwards.


ATLAS CMS LHCb Simulation jobs, some reduction around Christmas day for 4 days. Dropped from around 400 jobs to 1000 jobs. Possible network issue at Cern being investigated.


ATLAS Ticketing in OSG

Basically have converted to direct routing of tickets avoiding the TPMs and OSG GOC.

Markus Schultz

Various feedback, but does not always offer solutions. CMS quote gave explicit examples and suggested that there were too many bug fixes per release.

They have approx 50 bugs /month so there are obviously many patches. Have to understand the difference between perceived impact and factual impact.

==Post Mortem report== by Andreas Unterkircher http://indico.cern.ch/getFile.py/access?sessionId=5&resId=1&materialId=1&confId=45461 Talked about a roll back scheme which involves having pointers to different repositories. I did not like that at all, why can’t they use higher rpm version numbers to force sites that have already updated to get back to the working older content. My suggestion was not favoured.

==VOMSRS== status report…..


Middleware

CREAM CE There is a version released specifically for ICE-WMS testing in PPS. Not yet certified. There are three versions around! Still does not sound like its ready yet.

Need many more instances to really shake down the service, want to encourage ROC managers to install one cream ce per region in the next few weeks.

Requirements for BLAH

SL5 m/w status

ATLAS OK except for SELinux issue. CERN uses the default setup which is to have se linux on. Other sites may not do this.

Part of ROOT is the problem? People see the benefit of having selinux switched on so wish to solve the problem, rather than switch off se linux.

SCAS/LCAS/LCMAPS/GLEXEC


ALICE is very keen to migrate to SL5, CMS seem to be keen, other experiments not so sure. WN software is available in PPS and is scheduled for release at the beginning of February.