RAL Tier1 DCache SRM

From GridPP Wiki
Jump to: navigation, search
File:RAL-Production-DCache-Arch.png
RAL DCache Architecture

The RAL Tier1 runs a large DCache facility.

Each node in the diagram corresponds to a physical box. A number of protocols are generalised to be called dcache, these are dcache internal protocols.

Service Endpoints

https://dcache.gridpp.rl.ac.uk:8443/srm/managerv1.wsdl?/pnfs/gridpp.rl.ac.uk/data/<vo>

where vo is one of atlas,cms,dteam,lhcb,pheno,biomed,hone,zeus,ilc,esr,magic or t2k

https://dcache-tape.gridpp.rl.ac.uk:8443/srm/managerv1.wsdl?/pnfs/gridpp.rl.ac.uk/tape/<vo>

where vo is one of atlas,cms,dteam,lhcb,minos

Both endpoints are connected to the same DCache instance, however dcache-tape is used for access to the RAL Atlas Data Store, the Mass Storage System at RAL Tier1 and as such is configured to have a longer lifetime for SRM get requests.

A file with path /pnfs/gridpp.rl.ac.uk/data/<vo>/ is entirely different from a file with path /pnfs/gridpp.rl.ac.uk/tape/<vo>/

Files with a path of /pnfs/gridpp.rl.ac.uk/data/ are stored permanently on disk

Files with a path of /pnfs/gridpp.rl.ac.uk/tape/ are initally written to disk and are then stored to tape, eventually the file will be removed from disk and will be restored from tape if required.

RAL DCache Pools

Each disk server within dCache at the tier1 has at least one disk pool on it. Up to now these have corresponded to physical individual disk partitions. There are two types of disk pool currently deployed.

  • Shared pools that will offload and load files to tape. These are the ones mapped to the file space /pnfs/gridpp.rl.ac.uk/tape
  • VO specific pools that do not interact with tapes with their files remaining on disk permanently. These are mapped to VO specific paths such as /pnfs/gridpp.rl.ac.uk/data/lhcb.

The diagram below shows how these disk pools are arranged within the RAL Tier1. This is a representation showing possibilities rather than reality. For instance the t2k VO only has one disk pool and in some places CMS has more than one disk pool on a disk server.

Basic Usage

The dcache file system, i.e. everything under /pnfs/gridpp.rl.ac.uk/ is mounted on all of the RAL batch workers and can be accessed in a POSIX like way. To do this the following must be set by the end user.

 bash$ export LD_PRELOAD=libpdcap.so

This will allow you to view files in dCache, for example by doing:

 bash$ cat /pnfs/gridpp.rl.ac.uk/data/dteam/myfile.txt


In principle we could set this for end users by default but pre-loading libraries on people is generally not what people expect.

Service Monitoring

Available Space

To find out the disk space free and used for a vo use this ldap query:

 $ ldapsearch -x -H ldap://site-bdii.gridpp.rl.ac.uk:2170 \
     -b 'Mds-Vo-name=RAL-LCG2,o=Grid' '(GlueSALocalID=<VO>)' \
     GlueSARoot GlueSAStateAvailableSpace GlueSAStateUsedSpace 

replacing <VO> with the vo name.

For example

 $ ldapsearch -x -H ldap://site-bdii.gridpp.rl.ac.uk:2170 \
     -b 'Mds-vo-name=RAL-LCG2,o=Grid' '(GlueSALocalID=cms)' \
     GlueSARoot GlueSAStateAvailableSpace GlueSAStateUsedSpace

This shows the free space and used space respectively, the units are kilobytes.

   # cms, dcache.gridpp.rl.ac.uk, RAL-LCG2, grid
   dn: GlueSALocalID=cms,GlueSEUniqueID=dcache.gridpp.rl.ac.uk,mds-vo-name=RAL-LCG2,o=grid
   GlueSARoot: cms:/pnfs/gridpp.rl.ac.uk/data/cms
   GlueSAStateAvailableSpace: 382434948
   GlueSAStateUsedSpace: 25049346428
   # cms, dcache-tape.gridpp.rl.ac.uk, RAL-LCG2, grid
   dn: GlueSALocalID=cms,GlueSEUniqueID=dcache-tape.gridpp.rl.ac.uk,mds-vo-name=RAL-LCG2,o=grid
   GlueSARoot: cms:/pnfs/gridpp.rl.ac.uk/tape/cms
   GlueSAStateAvailableSpace: -9473007048
   GlueSAStateUsedSpace: 13767974344

In this case the negative value for available space shows that CMS is over quota.

Local Deployment Information

  • 8 systems (gftp0440,gftp0444-gftp0447,gftp0450-gftp0452) are deployed as gridftp & gsidcap doors
  • 1 system (pg350) is deployed as a PostGres database node for SRM request persistency and central dCache service data storage
  • 1 system (lcg0438) is deployed as a PostGres database node for SRM request persistency
  • 1 system (dcache-tape) is deployed as an SRM door intended for access to the RAL Atlas Data Store
  • 1 system (dcache) is deployed as as an SRM door for access to the RAL Tier1 Disk Servers
  • 1 system (pnfs) is deployed as pNFS server
  • 1 system (dcache-head) is deployed as a central dCache service node
  • 1 system (csfnfs58) runs pools for smaller vos -
    • 1 PhenoGrid pool
    • 1 BioMed pool
    • 1 H1 pools
    • 1 ZEUS pool
    • 1 ILC pool
    • 1 ESR pool
    • 1 Magic pool
    • 1 T2K pool
    • 1 Babar pool
    • 1 Minos pool
    • 1 Cedar pool
    • 1 SNO pool
    • 1 Fusion pool
    • 1 Geant 4 pool
  • 21 systems are deployed as dedicated dCache pool nodes
    • csfnfs39 - 2 LHCB pools
    • csfnfs42 - 2 ATLAS pools
    • csfnfs50 - 2 LHCB pools, 2 ATLAS pools, 2 MINOS pools
    • csfnfs54 - 4 ATLAS pools
    • csfnfs56 - 4 ATLAS pools
    • csfnfs57 - 4 LHCB pools
    • csfnfs60 - 3 ATLAS pools, 2 DTeam pools, 1 shared pool
    • csfnfs61 - 3 LHCb pools, 2 DTeam pools, 1 shared pool
    • csfnfs62 - 3 CMS pools, 2 DTeam pools, 1 shared pool
    • csfnfs63 - 3 CMS pools, 2 DTeam pools, 1 shared pool
    • csfnfs64 - 4 LHCB pools
    • gdss66 - 4 ATLAS pools
    • gdss67 - 4 ZEUS pools
    • gdss68 - 5 shared pools
    • gdss88 - 3 LHCB pools
    • gdss89 - 3 LHCB pools
    • gdss91 - 3 LHCB pools
    • gdss92 - 3 LHCB pools
    • gdss99 - 3 LHCB pools
    • gdss100 - 3 LHCB pools
    • gdss101 - 3 LHCB pools

System tuning

Disk Servers

See RAL Tier1 Disk Servers for general disk server tuning. We had to raise the number of open file descriptors that the dcache pool service could use.

Transfer Systems

The gftp door systems have been upgraded to java 1.5. The gftp systems have several lines added to their /etc/sysctl.conf files

 
   #/afs/cern.ch/project/openlab/install/service_challenge/tmp/sysctl/sysctl_8MBbuf_4Mwin_n3

   ### IPV4 specific settings
   net.ipv4.tcp_timestamps = 0 # turns TCP timestamp support off, default 1, reduces CPU use
   net.ipv4.tcp_sack = 1 # turn SACK support off, default on

   # on systems with a VERY fast bus -> memory interface this is the big gainer
   net.ipv4.tcp_rmem = 262144 4194304 8388608 # sets min/default/max TCP read buffer, default 4096 87380 174760
   net.ipv4.tcp_wmem = 262144 4194304 8388608 # sets min/pressure/max TCP write buffer, default 4096 16384 131072
   #net.ipv4.tcp_mem  = 262144 4194304 8388608 # sets min/pressure/max TCP buffer space, default 31744 32256 32768
   net.ipv4.tcp_mem  = 32768 65536 131072  # sets min/pressure/max TCP buffer space, default 31744 32256 32768

   ### CORE settings (mostly for socket and UDP effect)
   net.core.rmem_max = 4194303 # maximum receive socket buffer size, default 131071
   net.core.wmem_max = 4194303 # maximum send socket buffer size, default 131071
   net.core.rmem_default = 1048575 # default receive socket buffer size, default 65535
   net.core.wmem_default = 1048575 # default send socket buffer size, default 65535
   net.core.optmem_max = 1048575 # maximum amount of option memory buffers, default 10240
   net.core.netdev_max_backlog = 100000 # number of unprocessed input packets before kernel starts dropping them, default 300
   

Operational Procedures

See RAL Tier1 DCache Operational Procedures

See also

Other resources