RALPP Logbook 200607

From GridPP Wiki
Jump to: navigation, search

04/07/2006

0930 Added CPU and Walltime Scaling - Chris Brew

It was very simple really.

First I edited /opt/lcg/var/gip/ldif/static-file-Cluster.ldif to set:

GlueHostBenchmarkSI00: 1000

I.e our reference system is a 1.000 kSI2k CPU.

Then I added:

$cputmult 1.068
$wallmult 1.068

to /var/spool/pbs/mom_priv/config on all the worker nodes, and ran

killall -HUP pbs_mon

on each node to cause pbs_mom to reread its config file.

cpumult and wallmult scale the actual CPU and wall times respectively by the value given.

Since the whole farm is currently 2.8 GHz PIVs only one value is needed (taken from the T1).

I've also changed the glite-wn-setup_pp module and the template site-info.def file to do add these values at install time.

Eventually these numbers should be taken from the database.

19/07/2006

1130 Investigation of OPS SFT RM failures on heplnx204/165

The SFT-RM test was failing for the OPS VO with the error:

the server sent an error response: 530 530 User Name for GSI Identity/C=CH/O=CERN/OU=GRID/CN=Piotr Nyczyk 6217 not found.

After discussion during the TB-Support Meeting I discovered that the kpwd file (what dache uses for the Certificate to User ID mapping) wasn't being updated and so didn't have the entries for the OPS users.

I've run the generating script by hand, if we start passing now I'll add an entry into cron to run in automatically every hour.

Chris brew 11:37, 19 Jul 2006 (BST)

Looks like it failed again. This time I found the following error in the logs:

Can't determine storageInfo : CacheException(rc=35;msg=OSM info not found in /pnfs/fs/.(access)(001500000000000000001638)(type=--I--d-----))

Poking around I found that the ops VO directory hadn't been tagged:

[root@heplnx204 ops]# cat '.(tags)()'

Returned nothing.

whereas in the dteam directory:

[root@heplnx204 dteam]# cat '.(tags)()'
.(tag)(OSMTemplate)
.(tag)(sGroup)

SO I added the tags:

[root@heplnx204 ops]# echo 'StoreName    ops'>'.(tag)(OSMTemplate)'
[root@heplnx204 ops]# echo ops >'.(tag)(sGroup)'
[root@heplnx204 ops]# cat '.(tag)(OSMTemplate)'
StoreName    ops
[root@heplnx204 ops]# cat '.(tag)(sGroup)'
ops

I'll probably have to do this for all the VOs I added in the last round!

Chris brew 12:06, 19 Jul 2006 (BST)

24/07/2006

0930 DMA Errors spotted in central syslogger LogWatch

Tracked down to node heplnx26 reporting dma errors to syslog. Marked it as offline in pbs and will wait for the current jobs to finish.

27/07/2006

Jobs finished (no more problems seem).
Booted to Maxtor disk tester, disk passed quick test, Also passed full test but reported format problems.

Initiated full low level format and reinstalled the node.

Chris brew 12:15, 27 Jul 2006 (BST)