Site information
Contents
Page for gathering site feedback and information
There are two areas to be completed in October 2008. The first is to provide information about batch system memory limits. The second is to give an update on networking issues that can feed in to a GridPP networking review.
Background to batch system memory request for details:
This concerns ATLAS in that jobs have been killed off by batch systems even though they only have very short spike requirements for memory greater than initially requested memory per job slot. And Graeme said ... "The point is that the final memory is used only for a short time so even on our 4 core nodes we don't see swapping or other problems. Second point is that nothing in the information system lets you know what the site policy is re. killing jobs off - and this is extremely important in knowing if a job can be safely run on a particular site. What I would suggest is really wanted is that sites provide information on
1. Real physical memory per job slot.
2. Real memory limit beyond which a job is killed.
3. Virtual memory limit beyond which a job is killed.
4. Number of cores per WN. If you have several configurations please list each one.
This can be per-queue if you have different limits (e.g., RAL have 2G and 3G queues)".
So, in what follows for each site there are three areas to complete with this information along with a comment prompt should you feel the need. We will then work this information into a table which you can edit if the situation at your site changes.
Pete Clarke wishes to develop an overview of current network problems and issues together with a forward look from sites. Jeremy's summary of the driving discussion: "We agreed the best approach is to gather input from the individual Tier-2 sites and cross-check their status against the statements made in the readiness review (especially about problems to be overcome). Following this the feedback will be shared with the UK experiment representatives to gain their input (have they seen any problems?). Once all views have been gathered a network status and forward look document will be compiled with an initial draft being ready for early December". To gather your site input the following sections are available to complete:
1. WAN problems experienced in the last year. Please include here any connectivty problems, bandwidth capping issues or other factors that led (or may have led) to the WAN becoming a bottleneck for data transfers to/from your site.
2. Problems/issues seen with site networking. If there are any internal networking issues with which you are dealing that are likely to impact user analysis or prevent the full capacity of your CPU/storage resources from being realised, please mention them here.
3. Forward look. Is your site planning any changes to its LAN or WAN connectivity over the coming 12 months? If so please give details.
London Tier2
UKI-LT2-BRUNEL
1. Real physical memory per job slot:
dgc-grid-35: 1G dgc-grid-40: 2G dgc-grid-44: 1G
2. Real memory limit beyond which a job is killed:
n/a
3. Virtual memory limit beyond which a job is killed:
n/a
4. Number of cores per WN:
dgc-grid-35: 2 dgc-grid-40: 8 dgc-grid-44: 4
Comments:
1. WAN problems experienced in the last year:
Main connection capped at 400Mb/s
2. Problems/issues seen with site networking:
Recently, problems with site DNS servers. These have now been addressed.
3. Forward look:
Comments:
Network link expected to increase to 1GB in the next quarter
Move to a new data centre in April
UKI-LT2-IC-HEP
1. Real physical memory per job slot:
ce00: older nodes 1G, newer nodes 2G
2. Real memory limit beyond which a job is killed:
none
3. Virtual memory limit beyond which a job is killed:
none
4. Number of cores per WN:
older nodes 4, newer nodes 8
Comments:
1. WAN problems experienced in the last year:
none
2. Problems/issues seen with site networking:
none
3. Forward look:
Comments:
Generally good network.
UKI-LT2-QMUL
1. Real physical memory per job slot:
1G
2. Real memory limit beyond which a job is killed:
1G (rss) in lcg_long_x86 and 2G (rss) in lcg_long2_x86 queue
3. Virtual memory limit beyond which a job is killed:
INFINITY
4. Number of cores per WN:
2-8
Comments:
1. WAN problems experienced in the last year:
2. Problems/issues seen with site networking:
3. Forward look:
Comments:
Generally good networking.
UKI-LT2-RHUL
1. Real physical memory per job slot:
ce1: 0.5 GB
ce2: 2 GB
2. Real memory limit beyond which a job is killed:
ce1: none
ce2: none
3. Virtual memory limit beyond which a job is killed:
ce1: none
ce2: none
4. Number of cores per WN:
ce1: 2
ce2: 8
Comments:
ce1 o/s is RHEL3 so most VOs can't use it.
ce2 o/s is SL4.
1. WAN problems experienced in the last year:
Firewall problems related to ce2/se2, now resolved.
2. Problems/issues seen with site networking:
LAN switch stack came up wrongly configured after a scheduled power cut at end of Nov. It was very difficult to debug remotely and was not resolved until early January.
3. Forward look:
See below.
Comments:
The ce2/se2 cluster is due to be relocated from IC to new RHUL machine room around mid-2009. Power and cooling should be good. More work is required to minimise the impact on the network performance.
Currently at IC the cluster has a short, uncontended 1Gb/s connection direct to a PoP on LMN which has given very reliable performance for data transfers. Current network situation at RHUL is a 1Gb/s link shared with all other campus internet traffic. The point has been made that this move must not be a backward step in terms of networking, so RHUL CC are investigating options to address this.
UKI-LT2-UCL-CENTRAL
1. Real physical memory per job slot: 4G or amount requested 2. Real memory limit beyond which a job is killed: Still working on enforcement at the moment. 3. Virtual memory limit beyond which a job is killed: 12G or 3* amount requested 4. Number of cores per WN: 4 Comments: We've not had a working production cluster for a while as we threw our old cluster out to make room for the new one. Prior to that the old one was receiving minimal maintenance as we were anticipating the new cluster's delivery being imminent. Network
1. WAN problems experienced in the last year:
2. Problems/issues seen with site networking:
3. Forward look: We now have people in networks specifically tasked with looking after research computing and Grid facilities. Comments:
UKI-LT2-UCL-HEP
1. Real physical memory per job slot: 1GB or 512MB
2. Real memory limit beyond which a job is killed: none at present
3. Virtual memory limit beyond which a job is killed: none at present
4. Number of cores per WN: 2
Comments: Will shortly replace all WNs. Will post up to date details when new WNs online
1. WAN problems experienced in the last year: none
2. Problems/issues seen with site networking: none
3. Forward look: no specific problems foreseen
Comments:
NorthGrid
UKI-NORTHGRID-LANCS-HEP
1. Real physical memory per job slot:
2. Real memory limit beyond which a job is killed:
3. Virtual memory limit beyond which a job is killed:
4. Number of cores per WN:
Comments:
1. WAN problems experienced in the last year:
2. Problems/issues seen with site networking:
3. Forward look:
Comments:
UKI-NORTHGRID-LIV-HEP
1. Real physical memory per job slot: 1GB
2. Real memory limit beyond which a job is killed: None
3. Virtual memory limit beyond which a job is killed: None
4. Number of cores per WN: 1
Comments: A small number of nodes are 1.5GB per slot and will increase as older machines are retired. Possible link with central cluster would provide some 8core nodes with 2GB per slot.
1. WAN problems experienced in the last year: University firewall 1hr timeouts causing lcg-cp transfers to fail to exit if they take more than 1hr. Shared university 1G link limiting transfers and ability to merge local and central clusters.
2. Problems/issues seen with site networking: 1G networking (at 20node:1G ratio in racks) becoming a bottleneck, particular for user analysis and storage.
3. Forward look: 10G links with central computer services. Investigate dedicated 1G WAN link.
Comments:
UKI-NORTHGRID-MAN-HEP
1. Real physical memory per job slot: 2 GB
2. Real memory limit beyond which a job is killed: None
3. Virtual memory limit beyond which a job is killed: None
4. Number of cores per WN: 2
Comments:
1. WAN problems experienced in the last year: None
2. Problems/issues seen with site networking: None
3. Forward look:
Comments:
UKI-NORTHGRID-SHEF-HEP
1. Real physical memory per job slot: 1.975 GB
2. Real memory limit beyond which a job is killed: 2.1725 GB for over 10 minutes
3. Virtual memory limit beyond which a job is killed: No
4. Number of cores per WN: 2
Comments:
1. WAN problems experienced in the last year:
2. Problems/issues seen with site networking: Sheffield University DNS server is less stable than we want it to be. We are using temporary substitution of DNS server
3. Forward look:
Comments:
ScotGrid
UKI-SCOTGRID-DURHAM
1. Real physical memory per job slot: 2Gb.
2. Real memory limit beyond which a job is killed: 2Gb
3. Virtual memory limit beyond which a job is killed: No Limit
4. Number of cores per WN: 8
Figures apply from Jan 09 when new cluster was installed
1. WAN problems experienced in the last year: 8 breaks over the last 12 months according to JISC Monitoring Unit giving total outage time of 507 (presumably minutes).
2. Problems/issues seen with site networking: Old cluster was on 100Mbps switch, new cluster is Gigabit networking. Bonding hasn't given us the performance increase hoped for so 1Gbps for 8 cores currently, but investigating how to bring that to 2Gbps.
3. Forward look: WAN is looking to remain at 1Gps from our campus to JANET shared with all other university users.
Comments:
UKI-SCOTGRID-ECDF
1. Real physical memory per job slot:
2 Gb
2. Real memory limit beyond which a job is killed:
None.
3. Virtual memory limit beyond which a job is killed:
Depends on the VO - 6Gb for ATLAS Production jobs, 3Gb for everyone else (as they've not had a problem yet).
4. Number of cores per WN: 4 or 8 (roughly the same number nodes with each, as dual-processor (either dual or quad core)).
Comments:
1. WAN problems experienced in the last year:
2. Problems/issues seen with site networking:
3. Forward look:
Comments:
UKI-SCOTGRID-GLASGOW
1. Real physical memory per job slot: 2GB
2. Real memory limit beyond which a job is killed: None
3. Virtual memory limit beyond which a job is killed: None
4. Number of cores per WN: 4 (~50% will have 8 cores from Nov 2008)
Comments:
1. WAN problems experienced in the last year: None
2. Problems/issues seen with site networking: None foreseen
3. Forward look: Networking arrangements seem adequate for 2009 at least.
Comments:
SouthGrid
UKI-SOUTHGRID-BHAM-HEP
1. Real physical memory per job slot:
- PP Grid cluster: 2048MB/core
- eScience cluster: 1024MB/core
- Atlas cluster: 512MB/core
2. Real memory limit beyond which a job is killed: None
3. Virtual memory limit beyond which a job is killed: None
4. Number of cores per WN:
- PP Grid cluster: 8
- Mesc cluster: 2
- Atlas cluster: 2
Comments:
1. WAN problems experienced in the last year: None
2. Problems/issues seen with site networking:
- DNS problems, faulty GBIC, several reboot of core switches in summer 08
- Broken switch connecting Mesc workers on 26/12/08 (second hand replacement 100MB/s switch installed on 12/01/09)
- Networking between SE and WN is poor according to Steve's networking tests - ongoing investigation
3. Forward look:
Replace 100MB/s switches by gigabit switches for workers
Comments:
UKI-SOUTHGRID-BRIS-HEP
1. Real physical memory per job slot:
- PP-managed cluster: Old WN (being phased out ASAP): 512MB/core; New WN: 2GB/core
- HPC-managed cluster: 2GB/core
2. Real memory limit beyond which a job is killed:
- None known (Unix default = unlimited) (both clusters)
3. Virtual memory limit beyond which a job is killed:
- None known (Unix default = unlimited) (both clusters)
4. Number of cores per WN:
- PP-managed cluster: Old WN (being phased out ASAP) 2 cores; New WN: 8 cores
- HPC-managed cluster: 4 cores
Comments:
1. WAN problems experienced in the last year:
- None
2. Problems/issues seen with site networking:
- None
3. Forward look:
- Uni link to SWERN either will be or already is upgraded to 2.5Gbps AFAIK
Comments:
UKI-SOUTHGRID-CAM-HEP
1. Real physical memory per job slot:
- 2GB - 120 job slots
- >2GB - 168 job slots
2. Real memory limit beyond which a job is killed:
- None
3. Virtual memory limit beyond which a job is killed:
- None
4. Number of cores per WN:
- 4 cores - 2 WNs
- 8 cores - 16 WNs
- 12 cores - 2 WNs
- 16 cores - 8 WNs
Comments:
1. WAN problems experienced in the last year:
- None
2. Problems/issues seen with site networking:
3. Forward look:
Comments:
EFDA-JET
1. Real physical memory per job slot: 2GB
2. Real memory limit beyond which a job is killed: Not currently implemented
3. Virtual memory limit beyond which a job is killed: Not currently implemented
4. Number of cores per WN: 2 on some, 4 on others
Comments:
Our Worker Nodes have either 2 or 4 cores. There is 2GB RAM for each core, i.e the total RAM per node is either 4 or 8GB. Each node has 3GB swap
1. WAN problems experienced in the last year: None
2. Problems/issues seen with site networking: None
3. Forward look:
Will move services to nodes with faster network interfaces.
Comments:
UKI-SOUTHGRID-OX-HEP
1. Real physical memory per job slot: Old nodes due to be decommissioned in Nov 08 1GB/core Newer nodes 2GB/core
2. Real memory limit beyond which a job is killed: None specifically imposed.
3. Virtual memory limit beyond which a job is killed: None specifically imposed.
4. Number of cores per WN: Old : 2 New: 8
Comments: Our machines are run in 32 bit mode with the ordinary (as opposed to HUGEMEM) SL kernel, so a single process can only address a maximum of 3Gb. The worker nodes are run with very little swap space, so if all the real memory in a machine is used it should bring the OOM killer into play, rather than just bogging down in swap. In practice this doesn't seem to happen; the eight-core WNs usually have enough free real memory to accommodate the larger jobs.
1. WAN problems experienced in the last year:
2. Problems/issues seen with site networking: Site is connected to Janet at 2Gb/s Cluster shares a 1Gb/s link which could be upgraded as needed.
3. Forward look:
Comments:
UKI-SOUTHGRID-RALPP
1. Real physical memory per job slot: 1 or 2 GB/Core depending on node type, have VO queues that publish 985 MB/Core and SubCluster Queues that publish 500, 1000 and 2000 MB/Core.
2. Real memory limit beyond which a job is killed: Not currently implemented although if a node starts to run out of swap and we notice in time we may manually kill jobs.
3. Virtual memory limit beyond which a job is killed: See above
4. Number of cores per WN:
Comments:
we don't currently kill jobs for over memory use and just try to use it to give more info to the batch system, however, due to problem jobs killing Worker Nodes recently we may implement some killing policy, probably at 125% or 150% of the published queue limit (may be lower for higher memory queues)
1. WAN problems experienced in the last year:
None.
2. Problems/issues seen with site networking:
Slight issue with link between the different sections of our cluster which was only 1GB, now increased to 2GB and things seem better.
3. Forward look:
In the future we will be separating the farm further with the Disk and CPU resources being hosted in different Machine Rooms and the link between these will become critical. We are looking at ways to upgrade this to 10Gb/s.
Comments:
Tier1
RAL-LCG2-Tier-1
1. Real physical memory per job slot:
All WNs have 2GB/core (1 job slot per core).
2. Real memory limit beyond which a job is killed:
Dependent on queue : 500M,700M,1000M,2000M,3000M
3. Virtual memory limit beyond which a job is killed:
no limit
4. Number of cores per WN:
4 or 8 depending on hardware.
Comments:
1. WAN problems experienced in the last year:
2. Problems/issues seen with site networking:
3. Forward look:
Plan for doubled 10GbE (20Gb/s) for internal links and doubling of existing links as needed.
Comments: