RAL Tier1 Fabric Team Closed Items 2008

From GridPP Wiki
Jump to: navigation, search

2008 closed items from RAL Tier1 Fabric Team.

Action ID RT Number Priority Owner Action Title Target date Status Date closed Notes
T-64 19384 High JIT,JRHA Benchmarking of loan Viglen equipment and comparison with production servers 06/07/07 Closed 2008-01-02 Additional system drives delivered. Writing plan, complete by Fri 15/06/07.

20070613: James diverted to Castor Systems work.

20070620: Networking now available, JIT aims to start testing by close of Friday.

20070626: SL4.4 installation using driver from 3ware website and manually updated initrd.

20070627: Testing in progress. Need a standard viglen server for comparison.

20070704: Tests for RAID6/i386/ext3 done. RAID5 in progress. Got a 2006 server (gdss172) for comparison.

20070710: Tests for RAID5/i386/ext3 done. x86_64 install to be started as well as tests on current production kit (gdss170).

20070718: Some tests complete on gdss170/172. NFS test still to be done. James T to show James A how to run the tests. Graphs for 512MB and 16GB files plotted.

20070802: Local tests completed on all machines.

20070926: Report writing in progress.

20071010: Graphs mostly created.

20071024: No progress last week as A.O.D. and broken linux machine. Using JRHA's machine now.

2007-11-14: Andrew has suggested some more tests and would like the graphs plotted differently.

2007-11-23: Kit packed up for return to Viglen.

2007-11-29: Returned to Viglen.

T-95 Medium JIT,JRHA Test DDN kit Closed 2008-01-02 Disk/load testing.

20070808: Worked with Mark Adams from DDN to run some preliminary benchmarks after he spent monday configuring the systems for our needs. First indications are promising, we will start to run more in-depth benchmarks when JIT returns from leave.

20070813: Remote access server installed, awaiting external connection.

20070824: Run some iozone tests (results in /root/DDNtest2007/results/ on the machines). Some of the test machines dropped out before the end of the tests but that might just have been my SSH connection.

20070919: Need to move external access to standard network (MJB).

20070926: Some results sent to DDN. Starting NFS tests.

20071010: Collaborating with DDN to run tests, they have now run a few themselves as well.

20071024: No progress last week due to James T. being A.O.D. gdss170 (the NFS client) is broken at the moment.

2007-11-14: Test servers and access node blanked and powered off today. Report on what has been done required.

2007-11-23: Kit dismantled and ready for collection. Only Dell access node remains to be packed.

T-117 High JIT SL4 workers missing some packages 2007-10-03 Closed 2008-01-02 List of missing packages generated.

2007-10-10: Need to install CASTOR client and CERN libs on SL4 workers.

T-123 Low JIT Script to replace IP addresses with names in text files. 2007-12-31 Closed 2008-01-02 Script works but needs tidying up. Script is adequate.
F-13 Medium MJB Compusys 2004 maintenance Closed 20080102 Investigate cost of extending the maintenance on the Compusys 2004 systems and arrays.

20070620: No progress.

20070627: No progress. JA, MJB to meet to discuss.

20070704: Meeting scheduled for tomorrow.

20071010: Arrange another meeting to discuss.

20080102: Closed - no point in getting paid maintenance. MJB and JRHA will discuss further spares acquisitions - c.f. F-15.

F-29 Medium NGHW Upgrade OS Closed Upgrade OS on 2003 and 2004 disk hardware. We already have one system of each upgraded. We have encountered problems with cache-write back on the 2004 hardware, which is resolved by changing the cache to write-through but at a performance hit.

04/07: New kernel in progress with updated aic drivers from Adaptec's web site. This will be our first kernel with drivers from Adaptec. Previous drivers have been from Justin Gibbs the original maintainer. however, this should afford us a forward maintenance route.

10/07: Latest Areca driver built into the kernel. Unfortunately, the instructions on incorporation leave a little to be desired. They allow the driver to be built but the help information is missing from the kernel configuration file.

20070723: Investigating whether we should use the Red Hat or Broadcom version of the tg3 driver. Reputedly, the Broadcom version is more performant.

20070925: New kernel base released. New base needs to be 55.0.6

20071010: New base of 55.0.9 installed. Kernel compiled after much trauma due to a compilation issue, resolved by installing and compiling 55.0.2. There is a future kernel in the pipeline due to a security alert in mountd. This will have to be our new base when it is released. The .config file for the new base has been rationalized with the deployed .config file line by line. There are a a few more issues to be resolved though.

20071024: Moving ahead well. Very few lines in the config file are now unexplained.

20071114. New kernel built and tested on csfnfs34, 65 and gdss86. No problems found. However, the kernel base has moved forward so need to remake kernel with latest release and smoke test.

20071218: New kernel smoke tested fine. Implemented on test servers. Deployed on 3rd Dec to csfnfs33 through csfnfs42,gdss43, csfnfs46 and csfnfs47 as well as gdss51. All 21 of the Areca based servers(gdss66 through and including gdss86). csfnfs48 and onwards need to be upgraded once we have a sufficient spares to ensure hardware recovery.

20071218: New kernel released on 3rd December. Need new port.

20080107: Closed.

F-31 Low NGHW UKQCD transition Closed UKQCD currently run on their own grid. Ultimately, they will need transitioning onto the main LCG farm.

27/06: UKQCD not planning to move till early 2008.

20080107: Transferred this issue to Grid Services team

F-42 Low NGHW IFT Meeting Closed Arrange IFT meeting re discussion of their technologies.

20080107: Closed

T-119 21246 Medium JRHA New Castor servers requiring network cables 2007-10-12 Closed 2007-12-10

4 tape servers in rack C

Ctsc16.ads.rl.ac.uk - 5510b Port 32
Ctsc17.ads.rl.ac.uk - 5510b Port 33
Ctsc18.ads.rl.ac.uk - 5510b Port 34
Ctsc19.ads.rl.ac.uk - 5510b Port 35

3 services machines in rack B

Ccsb11.ads.rl.ac.uk - 5510b Port 36
Ccsb12.ads.rl.ac.uk - 5510b Port 37
Cdb13.ads.rl.ac.uk  - 5510b Port 38

2007-10-16: Cabled machines with reclaimed blue cables, awaiting cable numbers for labelling from networking.

2007-10-24: Still awaiting cable numbers.

2007-12-10: Numbered cables.

F-35 Low JRHA APC move 2007-12-03 Closed 2007-12-03 Currently, 3 APCs for the 2003 hardware are powering the 2002 hardware. These ideally should be swapped round at an appropriate time and their configuration changed (power failure or scheduled shutdown).

2007-11-14: Added to list of things to do when we 'stop' for the update to csfnfs02 RAL Tier1 Farm Shutdown.

2007-12-03: Move completed during shutdown, one pdu died but has been left dead as only affects eight of the oldest batchworkers.

F-79 High MJB VO software server for CMS 2007-29-06 Closed

2007-06-20: lcg0614 selected and disks replaced with 500GB Hitachi units for SW raid1.

2007-07-04: no progress

2007-10-03: Waiting for CMS to install their software stack from scratch.

2007-10-24: System available as server, progress now pending CMS testing.

20080102: System in production last autumn.

F-30 Medium NGHW csfnfs58 documentation Closed csfnfs58 holds many small partitions. The array partitioning, fdisk, filesystem parameters, directory mount points and permissions needs documenting and scripting so that rebuilding the server or upgrading the OS is facilitated. Backups on data partitions are in place for those that need it.

20070925: CMS agitating to use new disk and server via VO machine(lcg0614). This will free up space on cfsnfs58.

20071218: Since the shutdown, all space on csfnfs58 is available.

20080107: Need to allocate 100GB for MICE

20080114: 100GB allocated to MICE - /exportstage/datafs-sdb3

F-51 Low NGHW RAIDWATCH on Comp 2004 Closed Implement RAIDWATCH on IFT arrays. RAIDWATCH only scales to 7 machines. HPC have experience of installing it. It would be best to have a RAIDWATCH server per rack.

27/06: Medium->Low

20080114: Closing issue as RAIDWATCH is not a priority.

F-52 Low NGHW Resolver explanation for Castor Closed Explain process that the resolver uses to find a host. Order of preference depending on /etc/hosts, /etc/nsswitch.conf, /etc/host.conf etc...

20071218: This is intertwined with nsswitch.conf, whose man page, RedHat tips etc all disagree.

20080114: Closing issue.

T-82 Medium NGHW Need to update csfnfs58 functionality and VO space for MICE Closed csfnfs58 gives small VOs disk space, provides legacy repositories as well as providing a repository for LCG software. The mtab.tab has 77 entries and when mounting a new partition the mount can hang with the result that the system needs to be power cycled for the new mount to take effect but with all filesystems needing to be fsck'd. MICE need some space and we no longer are able to give space on the fly to small VOs without being at risk. csfnfs58 has 77 entries in /etc/mtab. We need to migrate csfnfs58 functionality towards newer hardware and OS so that new VOs can be added without being at risk.

2007-07-04: Still outstanding, raised at Monday meeting.

2007-09-25: No further progress

2007-10-10: Waiting go for farm stop from CMS (for nfs02) to include extra partitions and mount lcg software area with no_root_squash for DR.

2007-11-14: Waiting on whether a kernel update for CVE-2007-5191 will be released in a timely manner so we can incorporate the update with the farm shutdown. We will need to go ahead with kernel updates anyone. Added NGHW to the Red Hat bug list tracking system so should receive notification of fix.

2007-12-18: During the shutdown made all partitions available for use.

20080107: Need to allocate 100GB to MICE

20080114: Allocated 100GB to MICE on /exportstage1/datafs-sdb3

T-98 High NGHW Disk server testing Closed 20080114 James Jackson from CMS is available for testing disk servers. He will need access to all the CMS Castor servers, gdss86 and gdss171. He will use gdss86 and 171 to perform benchmarking of the systems with multiple readers and writers.

20070723: James's keys are on all the above systems. All the above systems have default network tuning and a blockdev of 512 for both the system and data drives. As there was no test program to perform the required tests, James has written a program. Installed C++ on gdss171 so that it can be compiled. It was already installed on gdss86. James has no roadblock to performing the testing

20070731: James Jackson started testing on gdss171.

20070808: James Jackson now out on leave.

20071010: Discussed with James Jackson. This issue needs to be left open as it is still part of his task list when he becomes full time at RAL.

20071114: Will check again with James Jackson.

20080114: gdss171 is close to having to be in production. James J has copied his programs off the machine. He will need a server in the future to re-start his testing.

T-106 Medium NGHW 64 bit and XFS benchmarks Closed 20080114 Testing 64 bit machines with XFS filesystems

20080114: Amalgamated into F-21

T-104 High NGHW Stride, Stripe and FS blocksize testing Closed 20080114 Test hardware with different array stripe size, filesystem stride and filesystem block size.

20070829: Have an initial script written 3 years ago but not completed that automatically tests an array by changing the stride size and block size. Cannot change the array stripe size automatically though. The script needs finishing and running.

20080114: Closed and amalgamated into F-21

F-38 Low NGHW Benchmark Evaluation Closed 20080114 Compile and run aiostress and fiobench for evaluation as to whether these will advance our testing methodology. Also, look at SPC benchmarks from Storage Performance Group.

04/07: Evaluate xdd for benchmarking.

10/07: xdd compiled up and ready to be benchmarked when time allows

20070723: Investigating the HPC Challenge benchmark suite and the SPC Storage benchmark from the Storage Performance Council.

20070814: Need to review testing strategy especially for 64 bit operation. xdd used by DDN for testing.

20080114: Closed and amalgamated into F-21

F-48 Medium NGHW Filesystem performance Closed 20080114 Investigate filesystem performance with Castor, ZFS, XFS ext4, etc...

20071218: ext3/ext4 and XFS need testing on 64 bit and 32.

20080114: Closed and amalgamated into F-21

T-122 High RAS Network testing between RAL and Fermi Closed 20080114 Took over network testing from RAS during his absence.

20071024: The network performance between RAL and Fermi is poor compared to between RAL and FZK, where with 2 streams line speed can almost be attained. There are a lot of packet drops on the Broadcom chip and UDP messages being dropped in the network stack. Inserted an Intel card for comparison purposes but there is no major difference other than the TX carrier count is racking up. Fermi are using a non-standard TCP stack. We also see a lot of packet drops using iperf locally with 10 streams.

2007114: Used iptraf to monitor TCP window size. The size is very low on the link to Fermi but ok to IN2P3 on gdss95. Need to run further tests between RAL and Fermi on a system that is solely doing network activity so that memory is not being used for disk I/O

20080114: Closed and amalgamated into F-23

T-134 Medium JRHA Repurposing Tier2 disk servers Closed 2008-01-08 We are borrowing a rack of unused Viglen disk servers from the Tier2 to tide us over until our new deliveries are in production, these need recabling and reinstalling from the Tier1 installation system.

2008-01-06: Awaiting cable numbers and confirmation from Chris Brew.

2008-01-08: Confirmation recieved, cabled and installed.

T-111 High NGHW fsck during booting Closed 20080118 Even if data arrays have been dismounted prior to shutdown, replying "Y" to the request to fsck /,/tmp,/usr and /var during boot also checks the clean data arrays. This of course being rather time consuming with 10TB. The fsck check during boot appears to be force checking all filesystems in the fstab rather than those that are dirty. This appears to be different behaviour to previous releases.

20071114: This can be a real issue as filesystems are being checked twice, when there is no need. Looking at changing the fstab entry to prevent checking on booting, if clean.

20071218: Confirmed by testing on csfnfs34 that this is a real problem. Raised a bug with RedHat. This "feature" is going away in RHEL5, so the bug has been closed. Going forward, we need to experiment with mounting the external array either in fstab or manually as in a script in /etc/rc.d/init.d/

20080118: Await RHEL5 and check whether this "issue" is no longer relevant

F-133 Medium Disk server profiling Closed 20080117 James Jackson will need a disks server so that the disk performance can be profiled. The disk server will probably come out of the Tier2 pool of disk servers that have been added to the Tier1s.

20080117: Closed as James Jackson will be using one of the CMS servers.

F-17 High JIT/JRHA Emergency shutdown process Closed 2008-01-21 The emergency shutdown process needs a review with respect to shedding load, stopping the scheduler, and shutting down the disk servers gracefully to prevent fscks on bring up. The process needs to become more graceful in the event of cooling problems.

20/06: no progress due to other issues.

27/06: ditto.

04/07: ditto. JIT will start by making sure all the existing hardware is covered by the emergency shutdown scripts.

20070926: Ongoing

20071010: JRHA has now completed work on APC reconfigs so this can now progress.

2007-11-14: SRPMs for the old system not available so will rebuild the whole system from scratch. Test of completed system scheduled for farm downtime on 03/12/2007.

2008-01-02: Rudimentary system in place that will, at least, shut off the correct APCs.

F-53 High NGHW Nagios Out of Memory Check Closed 2008-01-24 Need Nagios to be able to detect Out of Memory on Castor servers. There is a problem with a usual check as the server becomes so slow that NRPE timeouts will occur. ping is likely to work as the network interface is still up. There is also a danger that the OOM killer will kill a Nagios related process.

27/06: NGHW/JFW to discuss options for OoM detection.

04/07: On reflection, we will need to do best efforts as the OoM killer may kill the Nagios process. It would appear that we need to write a plug in to alert Nagios but be aware that lack of reporting may constitute a OoM issue.

20070723: Liaise with JFW over the Nagios check

20071114: Believe that we now have a OOM check

20080123: Check for string "Out of memory" in output of dmesg command exists; if nrpe process is killed by OOM killer, check from server will report failure as well

20080124: Bonny confirmed that this check was suitable

T-76 15335


Low JFW Move introductory documentation to Tier1 wiki and update it Closed 6/2/2008 Current user documentation is a) inaccessible from off-site, b) very out of date

3/10: Liaising with Catalin as he is also working on user documentation

6/2/2008: Closed as a) most user access will be stopped for Tier1, b) Services Team have a similar task (assigned to Catalin)

T-114 20792 High JIT gmetad restart problem Closed 23/1/2008 If gmetad is restarted when it has fallen over, it clobbers the latest ganglia data.

20071010: Working on checking for running process before rsyncing the RAM to disk.

2007-11-14: No further progress. Workaround is to check if gmetad is running and if it is not, use service gmetad start and not service gmetad restart.

2008-01-23: New version of the ganglia-tmpfs RPM (1.1-1) released and put on touch to fix this problem. It also updates the size of the RAM disk that is created. Closed.

I-136 22847 22891 High JFW AFS performance problems Closed 06/2/2008 Many users experienced serious performance problems accessing AFS files.

18/1/08: First reported problem from Babar (late Friday evening)

21/1/08: First investigations show no problem for access from Tier1

28/1/08: More problems reported by PPD, including comment that access had been poor for more than 1 week

29/1/08: Noted that AFS server logs files were large; corrected.

29/1/08: Problems with access to UKNF web pages (hosted from AFS) reported

30/1/08: Babar reported continuing problems

31/1/08: Noted that AFS logs files had large number of call back failures for sites on specific external (class B) network. This problem is documented in the OpenAFS mailing list archive (www.openafs.org/pipermail/openafs-info/2006-April/022224.html) as causing long delays in accessing AFS files. Message sent to OpenAFS mailing list asking for suggestions (one specific reply suggested upgrading server software)

31/1/08: Martin reported continuing problems in EGEE broadcast message

31/1/08: Bristol reported similar problems with their AFS cell (which is using an even old version of the server software)

31/1/08: Message sent to helpdesk at site of external network that is causing problems (probably due to firewall)

1/2/08: Response from external network saying that they are not happy to open firewall to all sites for AFS ports (7000-7009). Replied suggesting opening ports for RAL AFS server only

5/2/08: As there had been no response from external network, requested emergency site firewall change to block all traffic from that network. This did not work as there is a much higher permissive rule allowing incoming traffic from all sites. Martin added switch to network path to AFS server with firewall block for traffic from offending external network; AFS response returned to normal within a few minutes

6/2/08: Checked with users who had reported problems; all is well again. Problem closed with this temporary fix; permanent fix will be replacement hardware to run latest AFS server code

T-80 Medium JFW nagios test for fsprobe Closed 13/2/2008 Need to investigate a Nagios test to ascertain whether the process is there, whether the file is still there and whether it is being written to.

2007-07-23: Liaise with JFW

2007-12-18: No progress

2008-02-12: Test for fsprobe process added

P-109 High JFW Stability of Nagios master server (nincom) Closed 13/2/2008 The Nagios master server (nincom) is now much more stable, but has crashed several times with out of memory errors since (but a lot less often).

2007-09-08: Added 4Gb file /tmp/swap1 as a file swap area to see if more swap space helps

2007-09-19: nincom has not crashed since 6th September

2007-10-10: nincom had to be restarted on 1st October and 9th October to solve OOM problems

2007-12-18: Many reboots for OOM problems (some after kernel panic); check Ganglia statistics which suggested problems restarted shortly after major RPM update on 2007-10-23.

2008-01-02: Updated to latest kernel (2.6.9-67.0.1) to see if this helps

2008-01-03: nincom required reboot after OOM problems killed some processes; updated from SL 4.2 to SL 4.4, 80 new RPMS required, so rebooted.

2008-01-09: nincom required reboots on 07/01 and 08/01 (twice). Updated configuration to increase memory buffer slots available and restarted nincom (08/01 and 09/01)

2008-01-23: nincom failed on 21/1, 22/1 and 23/1 with OOM problems. Configuration updated again to increase memory buffer slots

2008-01-30: no reboots now for a week

2008-02-13: stability much improved, still very occasional crashes due to OOM problems

T-68 Low JIT Improve RPM procedure documentation Closed 2008-02-20 Linked to T-113 (signing local RPMs)

2008-02-20: Closed. Tracked in T-113

F-11 Low NGHW SNORT Closed 2008-02-20 Investigate whether the Tier1 should be running SNORT

2008-02-20: Merged into T-139.

F-27 Medium NGHW Tripwire Closed 2008-02-20 Tripwire or replacement thereof (aide).

2008-02-20: Merged into T-139.

F-28 Medium NGHW Checksumming critical files Closed 2008-02-20 Cron job to check system files on a daily basis to ensure that no unforeseen changes have occurred. Related to F-27.

2008-02-20: Merged into T-139.

F-40 Medium NGHW NFS ports Closed 2008-02-20 It would be beneficial to pin NFS ports to a defined range. Also beneficial on touch.

2008-02-20: Merged into T-139.

T-131 Medium JIT OS update Closed 2008-02-13 csfnfs46 and 47 have been upgraded to SL4. csfnfs48 had a problem shutting down, which was traced to the BIOS battery. New batteries(CR2032) are being purchased for the upgrade to csfnfs48 through csfnfs65 except gdss51, which already is on SL4.

20080213: Obsolete. Compusys 2004 servers being updates wioth SL4/dCache configuration for migration of data from dCache to Castor. Closed.

P-127 Medium JIT Machine Room Walkthrough procedure Closed 2008-02-26 Develop a procedure for a machine room walkthrough which can be incorporated into the Admin on Duty process.

2008-01-09: Owner changed to JIT. No progress.

2008-02-13: Took notes during a walkthrough with Nick on Thursday morning.

2008-02-20: Still needs typing up.

2008-02-26: Done, see RAL_Tier1_Machine_Room_Walkthrough

T-73 High JIT Nagios tests for Viglen disk servers utilising tw_cli Closed 2008-04-02 Depends on T-65.

2007-06-27: T-65 complete so work starting on this soon.

2007-10-10 Required for F-61 - changed to High.

2007-11-29: Progress. Using a combination of tw_cli and 3dm2

2008-02-20: Working on it this week.

208-03-26: Test works. I am testing it on gdss153 and on machines which get disk failures. Will give it to Jonathan to include in next plugins release.

2008-04-02: Testing showed no problems with the test so check_disk_3ware.sh has been checked into CVS.

T-120 High JIT Update chkrootkit RPMs to latest version. Closed 2008-04-03 2007-11-14: Changed to high priority.

2008-01-09: Latest chkrootkit source downloaded.

2008-02-13: Looked at installing RPM from Dag/AT but we have to rebuild it anyway as we have a source patch for chkrootkit to prevent it searching /afs.

2008-02-20: AFS patch located and source downloaded. Need to build the new RPM still.

2008-04-03: RPM build and added to repos on touch. Closed.

F-39 Low NGHW SL 4.4 install for Areca Closed 2008-05-21 The installation for Areca based machines needs updating to SL4.4. Currently, the installation process is based on SL4.2. A device driver disk needs building based on the tftp kernel in SL4.4.

27/06: Medium -> low. Now concentrate on SL4.5.

20071114: RHEL 5.1 released during Nov 2007 includes Areca driver

20080213: Issues with SL5 release kernels for NFS performance.

20080521: Merged with T-130. Closed.

T-99 High Data Integrity Closed 2008-05-21 CERN run memtester as a tool to identify memory corruption issues, which may translate through to filesystem corruption. fsprobe is currently running on all SL4 disk servers to identify filesystem corruption. Need to package up memtester with a low nice value. Understand what it does and what it gives us and whether we need a Nagios test for it and what is the scope of the test.

20070829: Discovered that there has been a new version of memtester. The bug fixed in the new version(4.0.7) is in the area of "failing to lock memory". Downloaded the new version and compiled it. Initial testing looking good on gdss86. Emailed Tim Bell at CERN regarding how they were running memtester(with which parameters and how much memory grabbed each time). However, it may be that between fsprobe and the old version of memtester(4.0.6), memory could be getting corrupted by the failure of memtester to lock the memory properly. Tim Bell is out at the office currently. No more action until version and running parameters resolved.

20070906: Tim Bell from CERN emailed procedure for how they run memtester. They also use ipmi sel logs and edac to monitor ECC errors.

20080521: Closed.

P-121 High JFW Nagios tests on Castor servers do not run reliably Closed 2008-05-21 2007-11-23: Increased timeout for some checks on Castor servers to 4 minutes (servers under heavy load)

2008-01-16: Evaluating a dual-card RAID set-up on gdss170, have shown no problems with time-outs so far, this configuration also solves the 250G OS array rebuilding onto the 500G hot-spare issue.

2008-02-13: Ordering extra 3ware cards for 86 x Viglen 2006 servers.

2008-05-21: Cards arrived, to be fitted. Currently no problems with test timeouts. 20071024: gdss138 has hit an issue with data integrity. Liaised with Peter Kelemen. Need to progress further with the issue on gdss138 by validating data, array and also with memtester, with parameters as defined by Tim Bell.

20071114: Ran an array verify, which picked up and fixed a single issue. Now running a checksum on the data files to check whether any holes have occurred.

20071218: The checksum went fine.

20080521: Closed.

T-118 21313 Medium JIT SL4 frontends Closed 2008-12-05 Meeting arranged with Matt for Monday morning.

2007-11-14: Initial kickstart done. Matt, Derek and Martin have provided changes that need making.

2007-11-23: RPMs for extra repos built. Small discussion re protection of repositories.

2008-01-02: Latest incarnation of kickstart installed on lcg0607 (see RT). Awaiting testing by grid services team.

2008-01-30: Fixed 17 problems during install. Catalin had a look and couldn't run some commands but he wasn't sure that they should work on SL4. Needs checking over by Matt/Derek.

2008-02-13: Only one issue left with this which is that it pulls torque/maui packages in from gLite repo instead of local torque repo.

2008-02-20: Install now pulls in correct versions of torque/maui but still offered as an update from gLite repo unless specifically excluded. DAG offers updates to a few packages which we may not want either.

2008-03-19: Turns out not to be correct versions of torque/maui. New build needed for SL4.

2008-06-18: Torque/maui build now available (built by MJB).

2008-12-05: Closed.

T-130 Medium JIT/JRHA Kernel build and SL 4.6 build for areca Closed. 2008-12-05 New SL4 kernel released. Need to build, insert Adaptec and Areca drivers and configure.

[Problem consolidation] F-39: The installation for Areca based machines needs updating to SL4.4. Currently, the installation process is based on SL4.2. A device driver disk needs building based on the tftp kernel in SL4.4.

27/06: Medium -> low. Now concentrate on SL4.5.

20071114: RHEL 5.1 released during Nov 2007 includes Areca driver

20080213: Issues with SL5 release kernels for NFS performance.

2008-06-18: Latest SL4 update kernel has Areca driver built in. [SL4.6 install kernel too?] Latest kernel built for Adaptec machines. Both kernels need testing. Will set up and test SL4.6 kickstart on Areca machines and test Adaptec kernel.

2008-12-05: Closed

F-135 High JIT tw_cli settings during install Closed. 2008-12-05 Script controller setup during kickstart.

2008-02-13: Looked at this today. Need to decide what settings we want to set, maybe need to wait for James Jackson's tests.

2008-05-28: Emailed James Jackson this morning to see if he has any particular recommendations.

2008-12-05: Closed

F-138 Medium Create 64-bit system build node Closed. 2008-12-05 Closed
T-142 Medium JIT Do background verifies affect performance? Closed 2008-12-05 [See also F-34]

2008-06-18: There is substantial impact on performance when running verifies, even on lowest priority:

  • Impact of I/O on verify times:
    • Verify without IOzone running: 4 hours, 28 minutes
    • Verify with IOzone running: 2 days, 11 hours, 31 minutes; that's 13 times longer.
  • Impact of verifies on IOzone performance (percent decrease in I/O rate):
    • Initial write: 6.4%
    • Rewrite: 4.9%
    • Read: 13.9%
    • Re-read: 14.8%

We need to try out James Jackson's tests to see if there's the same impact on performance when running more CASTOR-like load.

2008-12-05: Closed.