GDB-July 2008

From GridPP Wiki
Revision as of 12:44, 18 July 2008 by Peter gronbech (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

== 09 July 2008 == Agenda

UK input/issues

- Concern about several recent releases not working and sites having to downgrade services following updates (LFC, VOMS). - Bugs reaching production sites (e.g. latest DPM release)

Meeting summary/report

Grid Deployment Board 9th July 2008 (Notes by Pete Gronbech)

Introduction John Gordon


OPN Meeting summary (David)

1 Monitoring infrastructure, to be put in place, will contact t1s to sign an MoU. Non European cases are happy that there is an agreement with CERN Network that will be monitored is the t0-t1 infrastructure. Link up/down, and packet loss will be monitored (Dante aim to get signed with T1 s) 2 Operational infrastructure. Network is built in a federated way. Relies on the expertise of the people at the end points. Aim not to create any new types of committees, existing setup is OK. Suggestion to setup a completely new setup was rejected. IP infrastructure setup by end sites, works well. A centralized network subject to EU, would mean giving up a considerable amount of sovereignty.

LHC OPN is a bounded network of a fixed size.

The August GDB meeting will be canceled.

September GDB will be more Tier2 specific.


EGI-DS

Not sure the blue print is public yet. A bit narrower than hoped, there was little engagement with new communities. Operational model is based around LHC, and will probably deliver for LHC, but because of that other NGI’s may not sign up to it, or the EC may not fund it.

In some countries the ROCs were not engaged with the NGIs. Will be a half day workshop at the EGEE conference.


IHEPCCC/HEPiX Benchmarking WG - Helge Meinhard.

Experiment result scales (95% level) with the spec 2006 suite. Random number generators on different architectures vary.

1 can wait for a good understanding of the Random numbers…no result before autumn.

2 Trust the reasonable correlation shown at hepix and choose one of the benchmarks. Could go ahead now. Specint2006, specfp2006, SPECall_cpp2006, OS, compiler and compilation options. Recommendation is not to use spec rate.

SPECall_cpp2006 is made up of 3 apps from specint and 4 from specfp, (7 apps), can run it in 6h, but no published values. Have shown the FP component of these tests are very similar to our applications. Sizable memory footprint is also similar to our use. All numbers are measured following the hepix rules. These numbers tend to come out at 50-60% of the published values. Ratio gives no surprises. SL4 x86_64, gcc 3.4.6, Flags as defined by LCG-SPI, -O2 –fPIC –pthread –m32, multiple independent parallel runs.

Difference between spec rate and multiple spec base. Multiple spec base is more realistic to our use.

Proposal, is to use the cpp benchmark, a script will be made available. All conditions will be defined carefully and a name for it chosen.

Any volunteers for a small working group to manage the transition of use. Make a proposal to 2 questions. How to translate the expt requirements in to the new numbers, How to calculate the existing resources value.

Unitarity of the financial envelope is mandatory.


Monitoring Working Group - James Casey

New document produced EGEE III Operations Automation Strategy

How to reduce the number of people required for operation. Also NGI , want less local tools.

How to move SAM from central to ROC? Document is about 50 pages, COD dashboard, new setup, alarms goes into regional dashboard, who has 24 hrs to fix before it goes to the COD, attempts to reduce the work load on the COD.

Have to know who is the authoritative author of the information. GOCDB?, CIC DB, BDII , VO info providers. In talks with RAL APEL people to migrate from RGMA to Active MQ.

Question of trusting the ROCs, if ENOC results conflicted with ROC then could audit later.

By EGEE 08 yaim will configure the nagios setup.

Short talks at EGEE conf, repository of tools on HEPIX/WLCG management WG.


Is the LHC Grid Ready for Data: John Gordon T1 procurements, should have met the MoU by April, but most had met some of it by the time of CCRC08.

Look at the procurement process, eu laws change over the years. You may go over a threshold which means different rules apply. Suppliers could fold. Legal issues. Later you buy, you may get better value, but the earlier you buy the more likely you are going to be ready on time.

Experience has shown that it takes longer than minutes to bring storage on line. Are all the storage tokens defined at sites. ATLAS have just made some changes.

CMS ? one extra space token that we didn’t have at the t2s, ( a cache for bring data to t1). Need to define ACLs.

LHCb have defined space tokens a long time ago. The ones required for CCRC are in place, but not all required tokens are in place.

Evolve the computing model depending on beam conditions. Storage: issues with dcache and castor.

Job Mix: Issues with user jobs at t1s. High i/o vs High cpu jobs, can they be scheduled on multicore systems, not yet.

Concern about new releases bringing new problems. Once the data arrives should only be doing patching?

T2s not stressed yet.

Review of critical services. LFC being down caused big trouble.

Tier 2 talk Vetterli

Compared pledge with used. Most sites less used than pledge, but not clear whether this is due to lack of resources or lack of use. In US, CPU installed was more than that pledged. Not the same story for disk.

the chaotic use of the Grid by a large

     number of inexperienced people is still to be tested

Over 100 links tested by CMS in CCRC08. Two stages of analysis, controlled and chaotic

Number of users per site is likely to be > 50 users, but tests were mostly for sites with <10. Was some (340 users) running jobs during the period.

Need better communication between tier 2s and t1s (esp not in same country). Need to integrate the t2s with the GDB better, and at overview board.

Interoperability of EGEE, OSG, NDGF. More capacity due to larger data sets? Installation s/w does not always work.

50% of MC is done at T2s.

Jamie Shiers Agreed to run the challenge against agreed metrics. Against those we were successful, but may be the bar was not high enough.

LHC machine. July 20th machine should be cold 2 sectors below 2K August 9th Tunnel should be cleared.

Single beam data taking from the day the cavern is closed. Close for lunch at 12:57 cern time.

Marcus Schultz

Average 30-40 patches

During CCRC 18 patches to 3.2 32bit, (normal rate) 2 updates to glite 3.1 64 bit 1 glite 3.0

No scalability problems during ccrc08 WMS/LB for glite3.2 /sl4 was released. Implementation of Job Priorities was release but not picked up by sites.


Factor of 5 drop in load on the ce with newer LCG-ce.

Did not specify the version of all packages, this was a mistake and in future will list every thing. Had just assumed the latest version.

LCG-CE ok for next year, have to push hard to get CREAM ce in to production. This requires new clients and WMS. Started work on support for SL5 and Debian 4, some work on SUSE9.

Glue2 will allow major steps forward. SRM2.2 was used

Started testing CREAM CE some problems with lack of test nodes. Glexec. Requires SCAS (to replace LCAS).

Requires extensive developer testing. Any ideas on dates? Few weeks month?? It is a critical new service and requires deep certification.

 FTS  New version in certification • SL4 support • Handling of space tokens • Error classification  Within 6 months • Split of SRM negotiations and gridFTP  Improved throughput • Improved logging  Syslog target  Streamlined format for correlation with other logs • Full VOMS support • Short term changes for “Addendum to the SRM v2.2 WLCG Usage Agreement” • Python client libraries

Move to SL5 and VDT1.10 WN should be ready by the end of September.

Glue2 will not be backwards compatible but worth it.

OSG

Nordugrid / NDGF explaination in terms of interoperability. Next release of ARC will use BDII instead of Globus MDS. (opposite to OSG)

All data goes through the ce, which could be a scalability problem. If have more than 5k cores.

(Cern have 16000 cores, 20 ce’w but only one site bdii, one batch system.)

SRM report:


Dcache improvement in the bulk remove operation 1.8.0 Patch release 8 for client fixes lhc problem reported in lyon. Working to ensure the prestaging works properly Really improve ls for pin operation

No news on storm and DPM.

No news on ‘Multiple ACL on spacetoken in DPM ‘ JC raised concern that we had to down grade recently on two services. VOMS and LFC.

T1 who don’t attend the daily ops meeting should follow the minutes. What happened to the glite web page when the VOMS issue was released.

Rpms were not removed from the repository. There is now a comment on the page. EMT can pull them back.


CREAM CE status report. 11WNs 110 virtual cpus, UI, one ce for stress testing, one for installation and config test.

Tests performed, 9800 jobs per day with 49 users.


Some bugs, is CAs get updated have to restart tomcat. Need to clean up files I pool accounts Authentication fails with new type of VO attribute in VOMS proxy (fixed)

System load is usually less that 1 (but up to 9 with high load), memory is <2GB Disk usage can be high, if log level is high. 97% successful submission, 99.5% successful job run out of those 97%.

Will be tested against all batch systems eventually.

Mega table

Need better reporting, as resources pledged vs delivered not matching.

Lots of talk about the glue schema, with discussion from Stephen Burke and Jeff Templon.

OSG Security setup.

Should also be interested in VDT security.

Should also link with EGEE and NDGF


Follow up actions