GDB 10th November 2010

From GridPP Wiki
Revision as of 10:36, 16 November 2010 by Peter gronbech (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Gdb Meeting at CERN 10.11.10

Notes by Pete Gronbech Agenda

John Gordon gave the Introduction Meetings since last time:

  • Quattor Workshop at RAL,
  • CHEP 10 in Taipei
  • OGF30, Brussels
  • HEPIX Cornell

Interesting talks on ARC, Progress made on GLUE2 EMI people gave talk on accounting records.

Future GDBs December meeting in the ALICE building 160? Jan 12, Feb 9th, March 9(Lyon) April 6th?, May 11, June 8 Etc

Upcoming Meetings: EMI All Hands meeting, Prague 22-24 Nov VOMS and expts, CERN ISGC Taipei 21 March and OGF EGI User forum Vilnius 11 April

Last time we talked about a security incident where sites had to upgrade or be banned, sites upgraded quickly, recent less severe problem, the sites did not upgrade so quickly.

Quarterly operations experience will be reported in Dec.


Site Security Challenge

Graeme Stewart , Sven Gabriel

What’s new in SSC4? An overview of the latest Site Security Challenge recently submitted to the T1’s was described, generally sites improved, a clear scoring system is in place that should provide the sites with guidance. Follow the procedures!

Below are lots of notes taken during the talks:

Thanks to Graeme installing a parallel job submission system. Alarm activities related to DN, network traffic between ip1 and ip2. Wn and external host. Involved components myproxy-VOMS, WMS, lcg-CE WN Atlas job submission Malicious binary changed Communications what to/who expected time Containment kill jobs, user/cert Management save malicious code. Forensics Almost all sites improved Communications response times better Find/kill

UI found by all sites, network analysis on by some sites, analysis of binary done by all sites.

Tier 1s

SSC4 role play Star academy Atlas at cern Some external host Grid site

Atlas, job monitor, job repo, pilot factory.

Target time is 4 hours. Ce at site, wn pilot job submitter DN of Graeme, Jean luc picard sends job…

Small script runs looking at the environment. Pwned IRC client Talks to panda port.

User runs the IRC client and runs the commands on the WN

Site gets an alarm mail. They should investigate. Gets Input sandbox Real DN etc From panda

Heads up to EGI csirt abuse@cern.ch Communications with VO manager. Heads up to CA. Traffic light scoring system.

PJS (pilot job submitter), DN under which the pilot run at the site. GS ssc4, Only one mail sent DN found Log file dumps provided

CERN Heads up to EGI-CSIRT in 30min with DN Heads up to VO manager and atlas CSIRT Heads up to UK ca Or via VO manager

Job stopped in 1h PJU (Pilot Job User) banned after 30 min cream CE missed

All tasks done within 4 h, only team that spotted the PJU banning monitor.

One site spotted the prep job running using no CPU

Another mentioned PJS and PJU mentioned. Panda id sent to EGI-CSIRT

Nikhef: Forensics all hosts include my laptop found! IRC bot maintained an open TCP connection port 25443 Globus tool kit 4

Cron, at, daemonizing not mentioned.

Info in log not always used Provided script to check wunderbar for active clients

Several follow ups but no final report. Binary gridssc.sh grdissc.sh lutra

Heads up to UK CA PJS(Pilot Job Submitter) activity described

Framework will be provided to the NGI’s to test all sites. Most sites did not send a dedicated close out report. Some sites have user management problems. Second DN not found by all sites Panda ID URL, not always used. User interface not find by all sites. Forensic Network and malicious binary to be improved. Tech skills do vary a lot. NGI project wide run to be automated …. Set up. Want to include some storage operations. Some sites found it hard to ban the user properly on all ce’s etc…

Graeme Stewart Atlas side of things.

Run the ssc4 through panda Chance to review atlas security practice and get on the job training. Some docs improved.

Tried to ensure it did not interfere with production.

Grep for credentials using ssc in DN? Rogue payload injected into PanDA with special job tag. Atlas-adc-csirt@cern.ch Better instructions for banning a user. And cancels running jobs.

Triage by ADC manager.

Immediately responded to the site. Asked for more info. Very good comms, could get payload code from Atlas.

Useful although time consuming. Provided real training for AMODs and other experts. Can ban in panda, Can throw out of Atlas but 96 hours VOMS cert may be in use. Can revoke cert. (this is part of the clean up not an emergency procedure as its hard to prove to the CA that the cert has really been compromised).

Follow the procedure. It says contact egi csirt, and vo manager…..

==WLCG Information Officer== Flavia Donno

Static and dynamic info providers. 1200 resources, 374 sites etc.. Site level bdii collects info and propagates to top level bdii. Many services consume the info.

Single point of contact to try to ensure coherency and accuracy and to listen to customers.

  • ALICE: WLCG IS not used, interested in stable CREAM CE
  • ATLAS: WLCG IS is a low priority, they consider the fact that the bdii has to continually publish to be a serious design flaw and so it’s why ATLAS does not rely on it.

So either have to statically configure or strongly cache the info.

Requirements for info system has changed as it was designed for WMS.

  • CMS no statement
  • LHCb seem interested in info consolidation.

Consider splitting of dynamic and static info a priority.

Conclusions: Semi-static info mostly needed. Support fail-over and caching in services ala FTS & WMS No glue 2.0 in short term.

Will launch a query on lhcb-rollout to get sites feedback.


Developers feedback Need to define what is really needed.

Other consumers Monitoring, gstat, accounting and management. Accuracy of published info important

Plan for compilation of WLCG profile by end of Jan 2011

Deployment strategy for Top Level BDII. Need to have multiple per continent Need failover strategy

Ticket 63478

Installed capacity Share not always published. Cpu scaling factors

Flavia.donno@cern.ch

Jeff Templon Third class of info, administrative, No impact on operations etc, In order to avoid multiple counting Multiple ce’s GOCDB also critical for listing nodes and downtimes. Marcus: Move away from GOC was due to info being unreliable. So wanted to generate it automatically. It was designed as a feed for WMS but now used for monitoring we should step back and take a fresh look at it.

Network Incident Handling

Jamie Shiers A procedure that is simple, has a well defined set of actors, and is applicable to both lhcopn and gpn.

  • What is the problem?
    • 1 clean cut link between site a and b

Well understood and fail over happens quickly either auto or manual.

    • 2 degradation , lower than expected rates or high failures.

Take longer to diagnose

New procedure applies to both.

LHCOPN Want good throughput and low failure rates. Basic diagnostics by source and destination then declare it a network issue. Both sites ticketed, and they own the ticket.

Network guys Responsible for ensuring link between sites becomes clean Includes interaction with relevant parties in middle Tickets are not regularly updated Need change of state info, ie its improved, why? Regular ticket updates are also an important part of the model. Escalation to WLCG project Leader (Ian Bird) but who in Networking? Some time limits to be defined for escalation

Lunch

CREAM

JG: Were some issues, been testing new things this year. 159 cream ce’s know, 127 working fine, 366 lcg-ce ok out of 403.

Latest release and deployment status Issues with availability calculation

Gridview, Availability calculations, should OR the ce’s not AND them.

GS: Atlas and cream Previous bugs, Atlas use cream via condor g, used to die after 24 hrs, gridftp overload factory. Early summer, new release fixed these two bugs. Sept Atlas asked sites to install cream in parallel for testing. UK and DE clouds now running 20% via cream. Eg Glasgow and Oxford 90% via cream.

Uk total is about 10k jobs /day Not completely trouble free, memory blow up in cream_gahp on some hosts. Some patched binaries from condor team. Nordu grid need 7.5.3

State drift between condor and cream. Seems that keeps trying to get OSB(Output Sanbox) which has disappeared from ce.

Somehow condor thinks it has not got OSB so keeps trying. ML: If condor team are letting these issues slip let Management board know.

LHCb Direct cream submission. WMS only used for pilot deployment mechanism.

Pilot director for cream submission, Dirac WMS system A dedicated site director (on a VO box) running for each CREAM CE. Has to be a VO box with UI and is part of Dirac frame work. Next week real DIRAC release for cream CE direct submission. Legacy LCG-CE ‘s with WMS submission.

ALICE Are using cream at all sites except CERN, where some lcg-ce’s used. No problems except at CERN due to size of batch system and LSF. Use resource BDII to decide if the CE should receive more jobs.

Helgi M: No of problems with cream CE partly to do with LSF batch system. Rather painful and man power intensive to keep service up. Developers have been good at providing fixes, but do not feel it is production ready. Worrying as LCG-CE are running on SL4 and support finishes for that soon.

CMS: Glide in possible problems Configuration of auth layer on EU sites still needs solving.

ALICE: Exciting time for physics for Alice due to heavy ion data at mo. Direct job submission to cream CEs. Cream is fast and reliable. Only need 2 CEs even for big site. (You have two for redundancy)

Cream 1.6.1 Glite 3.1 1.6.2 in 3.2

1.6.3 released today for 3.2 and also for 3.1? Not all cream ces are in bdii Cream 1.7 will only be on glite 3.2 1.7 due for end of year, integration with Argus. Sys admin will need to select argus or old model. Cream-support@lists.infn.it

Jeff Templon asked why rpm version numbers etc different to cream ce numbers. Release notes have mapping.

EMI? Is it committed to support this throughout the life of the project, not going to move on to the next greatest thing??

ML: Cream is the next greatest thing. Is that EMI’s view?


==Installed Capacity== John Gordon

Talk previously shown to MB. Compares installed capacity with pledges. All in the installed capacity document. Some sites not publishing, some not meeting CPU pledge some disk. Only ScotGrid for disk listed.

Use fair share to apportion installed cpu capacity to vo’s. Gstat did not recognise non integer fair shares.

(like old reliability report that did not support no integer HS06 values until July 2010.)


MU Pilot Jobs

Maarten Litmath

Glexec update.

Tests by ML to T1s. RAL ok,

(Ed’s Note: Why does t2ce04,5,6 fail at Oxford) Alice integrating glexec into AliEn

User proxy handling being worked on.


Middleware

Glite 3.1 retirement calendar approved

3.2 new releases this week, cream 1.6.3, argus1.2 and LB 2.1.16 3.1 new LB and WMS Vomscerts 6.2

Upcoming glite3.2 releases GFAL and lcg_util for UI/WN/VO BOX was rejected this week.

DPM/LFC 1.8 and WMS & UI in sl5 PX in SL5 waiting for VDT to provide myproxy in SL5 64bit. New release of torque

CONDOR_utils very slow progress patch request since December 2009.

DPM/LFC 1.8 should be released in 2 weeks.

Remove glueLocation object for WLC VOs

May be a compiler flag that could be used to allow SL libs to work on SUSE. ML: says it is not a simple thing and would require a lot of work. MS: also agrees that no effort will be put in then EMI release will be a src release so can get fixed then. Far from trivial to get it working on SUSE.

T1 Services BOF Maria Girone

Reported on a recent meeting for WLCG coordination


HEPIX Fall Meeting 2010

Michel Jouvin

Dense agenda Private cloud testbeds at several sites, IPV6 new topic

IPV6 in background for a few meetings, US federal CIO asked for concrete plans to move to IPV6 by end of year to US sites. Shortage of IPV4 addresses next fall, will affect Asia more. A WG will be setup to identify operational issues.

File system and storage Testbed at KIT to compare solutions. New apps provided by expts following Amsterdam jamboree Assessment of Root

NFS 4.1 promising results based on dcache. DPM also working on it. CERN presentation on EOS

Virtualization and clouds. Report from group Cern fnal, lbnl Cern only one connected to grid.

Stratuslab project released its first toolkit. Help run private cloud WNs Michel Jouvin…

CERN and FNAL will work together for better SLC and SL integration.

Worries about Oracle only supporting s/w on their h/w.

Eg lustre , consortiums created to further development of lustre. Expensive to join us one.

Grid engine source forge project. Oracle not happy with forks

Hepix sites encouraged to participate

Spring May 2-6 GSI

Fall 2011 Triumph 20th anniversary event. SL6 end of winter??? About 4 months of work after RH releases 6. Red Hat have said RHEL6 will be released when its ready.