UKI-SCOTGRID-GLASGOW

From GridPP Wiki
Revision as of 14:24, 25 July 2012 by Stephen jones (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

UKI-SCOTGRID-GLASGOW

Topic: HEPSPEC06


UKI-SCOTGRID-GLASGOW
OS+cores,Bits OS Kernel 32/64 mem gcc Benchmark Total Per Core
Dual CPU, Dual Core, AMD Opteron 2.4GHz SL4.6 2.6.9-78.0.17.ELsmp 64 8GB 3.4.6 S2k6 all_cpp 32bit 30.73 7.68
Dual CPU, Quad Core, Intel Xeon E5420 2.5GHz SL4.6 2.6.9-78.0.17.ELsmp 64 16GB 3.4.6 S2k6 all_cpp 32bit 65.24 8.15



Topic: Middleware_transition


  • DPM MySQL head node. Upgrade planned.
  • Some DPM pool nodes - rolling update to SL5 (around half of the oldest disk servers to be decommissioned after next procurement).
  • WMS & L&B servers. Will remain until there is a supported combination of L&B and WMS that will run on the same host.
  • VOMS server (already under threat of upgrade due to incident) - upgrade planned.
  • SL4 VOBOX for PANDA. In discussion with the VO on that one.


No problems anticipated in update to SL5, once software requirements are handled

gLite3.2/EMI


ARGUS: Will deploy when CREAM/DPM have Argus support (currently use SCAS)

BDII_site: glite 3.2: glite-BDII-3.2.9-0

CE (CREAM/LCG): glite 3.2: glite-CREAM-3.2.8-2

glexec: glite-3.2 glite-security-glexec-0.7.0-2

SE: glite-3.2 DPM 1.8.0-1

UI: glite3.2: glite-UI-version-3.2.8-0

WMS: glite 3.1: glite-WMS-3.1.31-0

WN: glite 3.2: glite-WN-3.2.9 through 11 (rolling upgrade)

LB: glite 3.1: glite-LB-3.1.20-2

VOMS: glite 3.1: Voms-Admin version 2.0.15

Comments

WMS and LB will be updated once stable release version for SL5 (EMI)

We have an EMI CREAM instance and ARC CEs under testing

We have a test version of DPM running on svr025

Topic: Protected_Site_networking


  • Upgraded to 4 X 48x10Gb/s core switches + 16 x 40Gb/s interfaces: Device Extreme Network X670
  • Upgraded to 12 X 48x1Gb/s core switches + 16 x 10Gb/s interfaces+ 24 X 40Gb/s interfaces: Device Extreme Network X460
  • Upgraded internal backbone now capable of 320 Gbps.
  • Cluster network now passed through main physics 2 core switches directly to ClydeNET - no interaction with University firewalls
  • Primary WAN link 10 Gb/s; effective upper limit at 8-9 Gb/s 130.209.239.0/25 range
  • Secondary Wan link 10 Gb/s; effective upper limit at 8-9 Gb/s 130.209.239.0/25 range. To be installed during summer of 2012.
  • Monitoring: Nagios/Cacti/Ganglia/Ridgeline
  • In process of installing NagVis


File:Glasgow-network-new.png


Topic: Resiliency_and_Disaster_Planning


Backup Strategy

  • Conducted Review of backup strategy. All new machines now included in backups.
  • Dirvish used for backups [10 days of daily backups, 3 months of weekly, 1 year of monthly].
  • Daily off-site backup of cluster administration server [svr031] allowing full tier2 rebuild if necessary.


Tools

  • OSSEC installed on all machines at ScotGrid. Web interface, generation of alerts, rules engine, rootkit checker and scriptable actions. Glasgow installation very noisy at first. Therefore, time required to tailor for site.
  • Splunk installed on all machines at ScotGrid. Log aggregator and indexer with web interface for searching. 500mb a day limit for free version. Glasgow use 100mb a day. Very expensive for full license. Use cases - searching for suspicious IP, hardware faults
  • OSSEC has splunk integration and work nicely together.


Local Procedures

  • Cold start procedures updated after power outages. This helped to highlight missing steps.
  • Appropriate machine room signage created after issues identifying server rooms, circuit breakers, switches etc.
  • Emergency contacts list created. Phone numbers distributed amongst team.


Topic: SL4_Survey_August_2011


  • DPM MySQL head node. Upgrade planned.
  • Some DPM pool nodes - rolling update to SL5.
  • WMS & L&B servers. Will remain until there is a supported combination of L&B and WMS that will run on the same host.
  • VOMS server (already under threat of upgrade due to incident) - upgrade planned.
  • SL4 VOBOX for PANDA. In discussion with the VO on that one.


No problems anticipated in update to SL5, once software requirements are handled.

Topic: Site_information


Memory

1. Real physical memory per job slot: 2GB

2. Real memory limit beyond which a job is killed: None

3. Virtual memory limit beyond which a job is killed: None

4. Number of cores per WN: 4 (~50% will have 8 cores from Nov 2008)

Comments:

Network

1. WAN problems experienced in the last year: None

2. Problems/issues seen with site networking: None foreseen

3. Forward look: Networking arrangements seem adequate for 2009 at least.

Comments:


Topic: Site_status_and_plans



SL5 WNs

Current status (date): Initial Migration Complete. 1912 cores total, 1848 SL5 on WN3.2.4-0, 48 SL4 on WN3.1.40-0

Planned upgrade: December move of remaining 48 SL4 cores to SL5.

Comments: Migration complete. Some SL4 capacity kept for local ATLAS users to run non ported versions of Athena.

SRM

Current status (date): 2 DPMS migrated to SL5 DPM3.2.1-0

Planned upgrade: Possible upgrade from DPM-srm-server-mysql.x86_64 1.7.2-5 when available

Comments:

SCAS/glexec/ARGUS

Current status: 28/04/2011 ARGUS installation planned for May.

Current status (date): 10/11/2009 SCAS & GLEXEC with CREAM and GLEXEC on WN deployed in UAT .

Planned deployment: SCAS, GLEXEC with CREAM, GLEXEC with WN in Production on request.

Comments: Documenting install and info on wiki.

CREAM CE

Current status (date): 10/11/2009 Deployed in Production currently running 3.1.22

Planned deployment: Completed. Migrated to svr014.gla.scotgrid.ac.uk, svr008.gla.scotgrid.ac.uk and svr026.gla.scotgrid.ac.uk

Comments: In Production and open to all VO's