RAL Tier1 Fabric Team Closed Items 2008
2008 closed items from RAL Tier1 Fabric Team.
Action ID | RT Number | Priority | Owner | Action Title | Target date | Status | Date closed | Notes |
---|---|---|---|---|---|---|---|---|
T-64 | 19384 | High | JIT,JRHA | Benchmarking of loan Viglen equipment and comparison with production servers | 06/07/07 | Closed | 2008-01-02 | Additional system drives delivered. Writing plan, complete by Fri 15/06/07.
20070613: James diverted to Castor Systems work. 20070620: Networking now available, JIT aims to start testing by close of Friday. 20070626: SL4.4 installation using driver from 3ware website and manually updated initrd. 20070627: Testing in progress. Need a standard viglen server for comparison. 20070704: Tests for RAID6/i386/ext3 done. RAID5 in progress. Got a 2006 server (gdss172) for comparison. 20070710: Tests for RAID5/i386/ext3 done. x86_64 install to be started as well as tests on current production kit (gdss170). 20070718: Some tests complete on gdss170/172. NFS test still to be done. James T to show James A how to run the tests. Graphs for 512MB and 16GB files plotted. 20070802: Local tests completed on all machines. 20070926: Report writing in progress. 20071010: Graphs mostly created. 20071024: No progress last week as A.O.D. and broken linux machine. Using JRHA's machine now. 2007-11-14: Andrew has suggested some more tests and would like the graphs plotted differently. 2007-11-23: Kit packed up for return to Viglen. 2007-11-29: Returned to Viglen. |
T-95 | Medium | JIT,JRHA | Test DDN kit | Closed | 2008-01-02 | Disk/load testing.
20070808: Worked with Mark Adams from DDN to run some preliminary benchmarks after he spent monday configuring the systems for our needs. First indications are promising, we will start to run more in-depth benchmarks when JIT returns from leave. 20070813: Remote access server installed, awaiting external connection. 20070824: Run some iozone tests (results in /root/DDNtest2007/results/ on the machines). Some of the test machines dropped out before the end of the tests but that might just have been my SSH connection. 20070919: Need to move external access to standard network (MJB). 20070926: Some results sent to DDN. Starting NFS tests. 20071010: Collaborating with DDN to run tests, they have now run a few themselves as well. 20071024: No progress last week due to James T. being A.O.D. gdss170 (the NFS client) is broken at the moment. 2007-11-14: Test servers and access node blanked and powered off today. Report on what has been done required. 2007-11-23: Kit dismantled and ready for collection. Only Dell access node remains to be packed. | ||
T-117 | High | JIT | SL4 workers missing some packages | 2007-10-03 | Closed | 2008-01-02 | List of missing packages generated.
2007-10-10: Need to install CASTOR client and CERN libs on SL4 workers. | |
T-123 | Low | JIT | Script to replace IP addresses with names in text files. | 2007-12-31 | Closed | 2008-01-02 | Script works but needs tidying up. Script is adequate. | |
F-13 | Medium | MJB | Compusys 2004 maintenance | Closed | 20080102 | Investigate cost of extending the maintenance on the Compusys 2004 systems and arrays.
20070620: No progress. 20070627: No progress. JA, MJB to meet to discuss. 20070704: Meeting scheduled for tomorrow. 20071010: Arrange another meeting to discuss. 20080102: Closed - no point in getting paid maintenance. MJB and JRHA will discuss further spares acquisitions - c.f. F-15. | ||
F-29 | Medium | NGHW | Upgrade OS | Closed | Upgrade OS on 2003 and 2004 disk hardware. We already have one system of each upgraded. We have encountered problems with cache-write back on the 2004 hardware, which is resolved by changing the cache to write-through but at a performance hit.
04/07: New kernel in progress with updated aic drivers from Adaptec's web site. This will be our first kernel with drivers from Adaptec. Previous drivers have been from Justin Gibbs the original maintainer. however, this should afford us a forward maintenance route. 10/07: Latest Areca driver built into the kernel. Unfortunately, the instructions on incorporation leave a little to be desired. They allow the driver to be built but the help information is missing from the kernel configuration file. 20070723: Investigating whether we should use the Red Hat or Broadcom version of the tg3 driver. Reputedly, the Broadcom version is more performant. 20070925: New kernel base released. New base needs to be 55.0.6 20071010: New base of 55.0.9 installed. Kernel compiled after much trauma due to a compilation issue, resolved by installing and compiling 55.0.2. There is a future kernel in the pipeline due to a security alert in mountd. This will have to be our new base when it is released. The .config file for the new base has been rationalized with the deployed .config file line by line. There are a a few more issues to be resolved though. 20071024: Moving ahead well. Very few lines in the config file are now unexplained. 20071114. New kernel built and tested on csfnfs34, 65 and gdss86. No problems found. However, the kernel base has moved forward so need to remake kernel with latest release and smoke test. 20071218: New kernel smoke tested fine. Implemented on test servers. Deployed on 3rd Dec to csfnfs33 through csfnfs42,gdss43, csfnfs46 and csfnfs47 as well as gdss51. All 21 of the Areca based servers(gdss66 through and including gdss86). csfnfs48 and onwards need to be upgraded once we have a sufficient spares to ensure hardware recovery. 20071218: New kernel released on 3rd December. Need new port. 20080107: Closed. | |||
F-31 | Low | NGHW | UKQCD transition | Closed | UKQCD currently run on their own grid. Ultimately, they will need transitioning onto the main LCG farm.
27/06: UKQCD not planning to move till early 2008. 20080107: Transferred this issue to Grid Services team | |||
F-42 | Low | NGHW | IFT Meeting | Closed | Arrange IFT meeting re discussion of their technologies.
20080107: Closed | |||
T-119 | 21246 | Medium | JRHA | New Castor servers requiring network cables | 2007-10-12 | Closed | 2007-12-10 |
4 tape servers in rack C Ctsc16.ads.rl.ac.uk - 5510b Port 32 Ctsc17.ads.rl.ac.uk - 5510b Port 33 Ctsc18.ads.rl.ac.uk - 5510b Port 34 Ctsc19.ads.rl.ac.uk - 5510b Port 35 3 services machines in rack B Ccsb11.ads.rl.ac.uk - 5510b Port 36 Ccsb12.ads.rl.ac.uk - 5510b Port 37 Cdb13.ads.rl.ac.uk - 5510b Port 38 2007-10-16: Cabled machines with reclaimed blue cables, awaiting cable numbers for labelling from networking. 2007-10-24: Still awaiting cable numbers. 2007-12-10: Numbered cables. |
F-35 | Low | JRHA | APC move | 2007-12-03 | Closed | 2007-12-03 | Currently, 3 APCs for the 2003 hardware are powering the 2002 hardware. These ideally should be swapped round at an appropriate time and their configuration changed (power failure or scheduled shutdown).
2007-11-14: Added to list of things to do when we 'stop' for the update to csfnfs02 RAL Tier1 Farm Shutdown. 2007-12-03: Move completed during shutdown, one pdu died but has been left dead as only affects eight of the oldest batchworkers. | |
F-79 | High | MJB | VO software server for CMS | 2007-29-06 | Closed |
2007-06-20: lcg0614 selected and disks replaced with 500GB Hitachi units for SW raid1. 2007-07-04: no progress 2007-10-03: Waiting for CMS to install their software stack from scratch. 2007-10-24: System available as server, progress now pending CMS testing. 20080102: System in production last autumn. | ||
F-30 | Medium | NGHW | csfnfs58 documentation | Closed | csfnfs58 holds many small partitions. The array partitioning, fdisk, filesystem parameters, directory mount points and permissions needs documenting and scripting so that rebuilding the server or upgrading the OS is facilitated. Backups on data partitions are in place for those that need it.
20070925: CMS agitating to use new disk and server via VO machine(lcg0614). This will free up space on cfsnfs58. 20071218: Since the shutdown, all space on csfnfs58 is available. 20080107: Need to allocate 100GB for MICE 20080114: 100GB allocated to MICE - /exportstage/datafs-sdb3 | |||
F-51 | Low | NGHW | RAIDWATCH on Comp 2004 | Closed | Implement RAIDWATCH on IFT arrays. RAIDWATCH only scales to 7 machines. HPC have experience of installing it. It would be best to have a RAIDWATCH server per rack.
27/06: Medium->Low 20080114: Closing issue as RAIDWATCH is not a priority. | |||
F-52 | Low | NGHW | Resolver explanation for Castor | Closed | Explain process that the resolver uses to find a host. Order of preference depending on /etc/hosts, /etc/nsswitch.conf, /etc/host.conf etc...
20071218: This is intertwined with nsswitch.conf, whose man page, RedHat tips etc all disagree. 20080114: Closing issue. | |||
T-82 | Medium | NGHW | Need to update csfnfs58 functionality and VO space for MICE | Closed | csfnfs58 gives small VOs disk space, provides legacy repositories as well as providing a repository for LCG software. The mtab.tab has 77 entries and when mounting a new partition the mount can hang with the result that the system needs to be power cycled for the new mount to take effect but with all filesystems needing to be fsck'd. MICE need some space and we no longer are able to give space on the fly to small VOs without being at risk. csfnfs58 has 77 entries in /etc/mtab. We need to migrate csfnfs58 functionality towards newer hardware and OS so that new VOs can be added without being at risk.
2007-07-04: Still outstanding, raised at Monday meeting. 2007-09-25: No further progress 2007-10-10: Waiting go for farm stop from CMS (for nfs02) to include extra partitions and mount lcg software area with no_root_squash for DR. 2007-11-14: Waiting on whether a kernel update for CVE-2007-5191 will be released in a timely manner so we can incorporate the update with the farm shutdown. We will need to go ahead with kernel updates anyone. Added NGHW to the Red Hat bug list tracking system so should receive notification of fix. 2007-12-18: During the shutdown made all partitions available for use. 20080107: Need to allocate 100GB to MICE 20080114: Allocated 100GB to MICE on /exportstage1/datafs-sdb3 | |||
T-98 | High | NGHW | Disk server testing | Closed | 20080114 | James Jackson from CMS is available for testing disk servers. He will need access to all the CMS Castor servers, gdss86 and gdss171. He will use gdss86 and 171 to perform benchmarking of the systems with multiple readers and writers.
20070723: James's keys are on all the above systems. All the above systems have default network tuning and a blockdev of 512 for both the system and data drives. As there was no test program to perform the required tests, James has written a program. Installed C++ on gdss171 so that it can be compiled. It was already installed on gdss86. James has no roadblock to performing the testing 20070731: James Jackson started testing on gdss171. 20070808: James Jackson now out on leave. 20071010: Discussed with James Jackson. This issue needs to be left open as it is still part of his task list when he becomes full time at RAL. 20071114: Will check again with James Jackson. 20080114: gdss171 is close to having to be in production. James J has copied his programs off the machine. He will need a server in the future to re-start his testing. | ||
T-106 | Medium | NGHW | 64 bit and XFS benchmarks | Closed | 20080114 | Testing 64 bit machines with XFS filesystems
20080114: Amalgamated into F-21 | ||
T-104 | High | NGHW | Stride, Stripe and FS blocksize testing | Closed | 20080114 | Test hardware with different array stripe size, filesystem stride and filesystem block size.
20070829: Have an initial script written 3 years ago but not completed that automatically tests an array by changing the stride size and block size. Cannot change the array stripe size automatically though. The script needs finishing and running. 20080114: Closed and amalgamated into F-21 | ||
F-38 | Low | NGHW | Benchmark Evaluation | Closed | 20080114 | Compile and run aiostress and fiobench for evaluation as to whether these will advance our testing methodology. Also, look at SPC benchmarks from Storage Performance Group.
04/07: Evaluate xdd for benchmarking. 10/07: xdd compiled up and ready to be benchmarked when time allows 20070723: Investigating the HPC Challenge benchmark suite and the SPC Storage benchmark from the Storage Performance Council. 20070814: Need to review testing strategy especially for 64 bit operation. xdd used by DDN for testing. 20080114: Closed and amalgamated into F-21 | ||
F-48 | Medium | NGHW | Filesystem performance | Closed | 20080114 | Investigate filesystem performance with Castor, ZFS, XFS ext4, etc...
20071218: ext3/ext4 and XFS need testing on 64 bit and 32. 20080114: Closed and amalgamated into F-21 | ||
T-122 | High | RAS | Network testing between RAL and Fermi | Closed | 20080114 | Took over network testing from RAS during his absence.
20071024: The network performance between RAL and Fermi is poor compared to between RAL and FZK, where with 2 streams line speed can almost be attained. There are a lot of packet drops on the Broadcom chip and UDP messages being dropped in the network stack. Inserted an Intel card for comparison purposes but there is no major difference other than the TX carrier count is racking up. Fermi are using a non-standard TCP stack. We also see a lot of packet drops using iperf locally with 10 streams. 2007114: Used iptraf to monitor TCP window size. The size is very low on the link to Fermi but ok to IN2P3 on gdss95. Need to run further tests between RAL and Fermi on a system that is solely doing network activity so that memory is not being used for disk I/O 20080114: Closed and amalgamated into F-23 | ||
T-134 | Medium | JRHA | Repurposing Tier2 disk servers | Closed | 2008-01-08 | We are borrowing a rack of unused Viglen disk servers from the Tier2 to tide us over until our new deliveries are in production, these need recabling and reinstalling from the Tier1 installation system.
2008-01-06: Awaiting cable numbers and confirmation from Chris Brew. 2008-01-08: Confirmation recieved, cabled and installed. | ||
T-111 | High | NGHW | fsck during booting | Closed | 20080118 | Even if data arrays have been dismounted prior to shutdown, replying "Y" to the request to fsck /,/tmp,/usr and /var during boot also checks the clean data arrays. This of course being rather time consuming with 10TB. The fsck check during boot appears to be force checking all filesystems in the fstab rather than those that are dirty. This appears to be different behaviour to previous releases.
20071114: This can be a real issue as filesystems are being checked twice, when there is no need. Looking at changing the fstab entry to prevent checking on booting, if clean. 20071218: Confirmed by testing on csfnfs34 that this is a real problem. Raised a bug with RedHat. This "feature" is going away in RHEL5, so the bug has been closed. Going forward, we need to experiment with mounting the external array either in fstab or manually as in a script in /etc/rc.d/init.d/ 20080118: Await RHEL5 and check whether this "issue" is no longer relevant | ||
F-133 | Medium | Disk server profiling | Closed | 20080117 | James Jackson will need a disks server so that the disk performance can be profiled. The disk server will probably come out of the Tier2 pool of disk servers that have been added to the Tier1s.
20080117: Closed as James Jackson will be using one of the CMS servers. | |||
F-17 | High | JIT/JRHA | Emergency shutdown process | Closed | 2008-01-21 | The emergency shutdown process needs a review with respect to shedding load, stopping the scheduler, and shutting down the disk servers gracefully to prevent fscks on bring up. The process needs to become more graceful in the event of cooling problems.
20/06: no progress due to other issues. 27/06: ditto. 04/07: ditto. JIT will start by making sure all the existing hardware is covered by the emergency shutdown scripts. 20070926: Ongoing 20071010: JRHA has now completed work on APC reconfigs so this can now progress. 2007-11-14: SRPMs for the old system not available so will rebuild the whole system from scratch. Test of completed system scheduled for farm downtime on 03/12/2007. 2008-01-02: Rudimentary system in place that will, at least, shut off the correct APCs. | ||
F-53 | High | NGHW | Nagios Out of Memory Check | Closed | 2008-01-24 | Need Nagios to be able to detect Out of Memory on Castor servers. There is a problem with a usual check as the server becomes so slow that NRPE timeouts will occur. ping is likely to work as the network interface is still up. There is also a danger that the OOM killer will kill a Nagios related process.
27/06: NGHW/JFW to discuss options for OoM detection. 04/07: On reflection, we will need to do best efforts as the OoM killer may kill the Nagios process. It would appear that we need to write a plug in to alert Nagios but be aware that lack of reporting may constitute a OoM issue. 20070723: Liaise with JFW over the Nagios check 20071114: Believe that we now have a OOM check 20080123: Check for string "Out of memory" in output of dmesg command exists; if nrpe process is killed by OOM killer, check from server will report failure as well 20080124: Bonny confirmed that this check was suitable | ||
T-76 | 15335 | Low | JFW | Move introductory documentation to Tier1 wiki and update it | Closed | 6/2/2008 | Current user documentation is a) inaccessible from off-site, b) very out of date
3/10: Liaising with Catalin as he is also working on user documentation 6/2/2008: Closed as a) most user access will be stopped for Tier1, b) Services Team have a similar task (assigned to Catalin) | |
T-114 | 20792 | High | JIT | gmetad restart problem | Closed | 23/1/2008 | If gmetad is restarted when it has fallen over, it clobbers the latest ganglia data.
20071010: Working on checking for running process before rsyncing the RAM to disk. 2007-11-14: No further progress. Workaround is to check if gmetad is running and if it is not, use 2008-01-23: New version of the ganglia-tmpfs RPM (1.1-1) released and put on touch to fix this problem. It also updates the size of the RAM disk that is created. Closed. | |
I-136 | 22847 22891 | High | JFW | AFS performance problems | Closed | 06/2/2008 | Many users experienced serious performance problems accessing AFS files.
18/1/08: First reported problem from Babar (late Friday evening) 21/1/08: First investigations show no problem for access from Tier1 28/1/08: More problems reported by PPD, including comment that access had been poor for more than 1 week 29/1/08: Noted that AFS server logs files were large; corrected. 29/1/08: Problems with access to UKNF web pages (hosted from AFS) reported 30/1/08: Babar reported continuing problems 31/1/08: Noted that AFS logs files had large number of call back failures for sites on specific external (class B) network. This problem is documented in the OpenAFS mailing list archive (www.openafs.org/pipermail/openafs-info/2006-April/022224.html) as causing long delays in accessing AFS files. Message sent to OpenAFS mailing list asking for suggestions (one specific reply suggested upgrading server software) 31/1/08: Martin reported continuing problems in EGEE broadcast message 31/1/08: Bristol reported similar problems with their AFS cell (which is using an even old version of the server software) 31/1/08: Message sent to helpdesk at site of external network that is causing problems (probably due to firewall) 1/2/08: Response from external network saying that they are not happy to open firewall to all sites for AFS ports (7000-7009). Replied suggesting opening ports for RAL AFS server only 5/2/08: As there had been no response from external network, requested emergency site firewall change to block all traffic from that network. This did not work as there is a much higher permissive rule allowing incoming traffic from all sites. Martin added switch to network path to AFS server with firewall block for traffic from offending external network; AFS response returned to normal within a few minutes 6/2/08: Checked with users who had reported problems; all is well again. Problem closed with this temporary fix; permanent fix will be replacement hardware to run latest AFS server code | |
T-80 | Medium | JFW | nagios test for fsprobe | Closed | 13/2/2008 | Need to investigate a Nagios test to ascertain whether the process is there, whether the file is still there and whether it is being written to.
2007-07-23: Liaise with JFW 2007-12-18: No progress 2008-02-12: Test for fsprobe process added | ||
P-109 | High | JFW | Stability of Nagios master server (nincom) | Closed | 13/2/2008 | The Nagios master server (nincom) is now much more stable, but has crashed several times with out of memory errors since (but a lot less often).
2007-09-08: Added 4Gb file /tmp/swap1 as a file swap area to see if more swap space helps 2007-09-19: nincom has not crashed since 6th September 2007-10-10: nincom had to be restarted on 1st October and 9th October to solve OOM problems 2007-12-18: Many reboots for OOM problems (some after kernel panic); check Ganglia statistics which suggested problems restarted shortly after major RPM update on 2007-10-23. 2008-01-02: Updated to latest kernel (2.6.9-67.0.1) to see if this helps 2008-01-03: nincom required reboot after OOM problems killed some processes; updated from SL 4.2 to SL 4.4, 80 new RPMS required, so rebooted. 2008-01-09: nincom required reboots on 07/01 and 08/01 (twice). Updated configuration to increase memory buffer slots available and restarted nincom (08/01 and 09/01) 2008-01-23: nincom failed on 21/1, 22/1 and 23/1 with OOM problems. Configuration updated again to increase memory buffer slots 2008-01-30: no reboots now for a week 2008-02-13: stability much improved, still very occasional crashes due to OOM problems | ||
T-68 | Low | JIT | Improve RPM procedure documentation | Closed | 2008-02-20 | Linked to T-113 (signing local RPMs)
2008-02-20: Closed. Tracked in T-113 | ||
F-11 | Low | NGHW | SNORT | Closed | 2008-02-20 | Investigate whether the Tier1 should be running SNORT
2008-02-20: Merged into T-139. | ||
F-27 | Medium | NGHW | Tripwire | Closed | 2008-02-20 | Tripwire or replacement thereof (aide).
2008-02-20: Merged into T-139. | ||
F-28 | Medium | NGHW | Checksumming critical files | Closed | 2008-02-20 | Cron job to check system files on a daily basis to ensure that no unforeseen changes have occurred. Related to F-27.
2008-02-20: Merged into T-139. | ||
F-40 | Medium | NGHW | NFS ports | Closed | 2008-02-20 | It would be beneficial to pin NFS ports to a defined range. Also beneficial on touch.
2008-02-20: Merged into T-139. | ||
T-131 | Medium | JIT | OS update | Closed | 2008-02-13 | csfnfs46 and 47 have been upgraded to SL4. csfnfs48 had a problem shutting down, which was traced to the BIOS battery. New batteries(CR2032) are being purchased for the upgrade to csfnfs48 through csfnfs65 except gdss51, which already is on SL4.
20080213: Obsolete. Compusys 2004 servers being updates wioth SL4/dCache configuration for migration of data from dCache to Castor. Closed. | ||
P-127 | Medium | JIT | Machine Room Walkthrough procedure | Closed | 2008-02-26 | Develop a procedure for a machine room walkthrough which can be incorporated into the Admin on Duty process.
2008-01-09: Owner changed to JIT. No progress. 2008-02-13: Took notes during a walkthrough with Nick on Thursday morning. 2008-02-20: Still needs typing up. 2008-02-26: Done, see RAL_Tier1_Machine_Room_Walkthrough | ||
T-73 | High | JIT | Nagios tests for Viglen disk servers utilising tw_cli | Closed | 2008-04-02 | Depends on T-65.
2007-06-27: T-65 complete so work starting on this soon. 2007-10-10 Required for F-61 - changed to High. 2007-11-29: Progress. Using a combination of tw_cli and 3dm2 2008-02-20: Working on it this week. 208-03-26: Test works. I am testing it on gdss153 and on machines which get disk failures. Will give it to Jonathan to include in next plugins release. 2008-04-02: Testing showed no problems with the test so | ||
T-120 | High | JIT | Update chkrootkit RPMs to latest version. | Closed | 2008-04-03 | 2007-11-14: Changed to high priority.
2008-01-09: Latest chkrootkit source downloaded. 2008-02-13: Looked at installing RPM from Dag/AT but we have to rebuild it anyway as we have a source patch for chkrootkit to prevent it searching /afs. 2008-02-20: AFS patch located and source downloaded. Need to build the new RPM still. 2008-04-03: RPM build and added to repos on touch. Closed. | ||
F-39 | Low | NGHW | SL 4.4 install for Areca | Closed | 2008-05-21 | The installation for Areca based machines needs updating to SL4.4. Currently, the installation process is based on SL4.2. A device driver disk needs building based on the tftp kernel in SL4.4.
27/06: Medium -> low. Now concentrate on SL4.5. 20071114: RHEL 5.1 released during Nov 2007 includes Areca driver 20080213: Issues with SL5 release kernels for NFS performance. 20080521: Merged with T-130. Closed. | ||
T-99 | High | Data Integrity | Closed | 2008-05-21 | CERN run memtester as a tool to identify memory corruption issues, which may translate through to filesystem corruption. fsprobe is currently running on all SL4 disk servers to identify filesystem corruption. Need to package up memtester with a low nice value. Understand what it does and what it gives us and whether we need a Nagios test for it and what is the scope of the test.
20070829: Discovered that there has been a new version of memtester. The bug fixed in the new version(4.0.7) is in the area of "failing to lock memory". Downloaded the new version and compiled it. Initial testing looking good on gdss86. Emailed Tim Bell at CERN regarding how they were running memtester(with which parameters and how much memory grabbed each time). However, it may be that between fsprobe and the old version of memtester(4.0.6), memory could be getting corrupted by the failure of memtester to lock the memory properly. Tim Bell is out at the office currently. No more action until version and running parameters resolved. 20070906: Tim Bell from CERN emailed procedure for how they run memtester. They also use ipmi sel logs and edac to monitor ECC errors. 20080521: Closed. | |||
P-121 | High | JFW | Nagios tests on Castor servers do not run reliably | Closed | 2008-05-21 | 2007-11-23: Increased timeout for some checks on Castor servers to 4 minutes (servers under heavy load)
2008-01-16: Evaluating a dual-card RAID set-up on gdss170, have shown no problems with time-outs so far, this configuration also solves the 250G OS array rebuilding onto the 500G hot-spare issue. 2008-02-13: Ordering extra 3ware cards for 86 x Viglen 2006 servers. 2008-05-21: Cards arrived, to be fitted. Currently no problems with test timeouts. 20071024: gdss138 has hit an issue with data integrity. Liaised with Peter Kelemen. Need to progress further with the issue on gdss138 by validating data, array and also with memtester, with parameters as defined by Tim Bell. 20071114: Ran an array verify, which picked up and fixed a single issue. Now running a checksum on the data files to check whether any holes have occurred. 20071218: The checksum went fine. 20080521: Closed. | ||
T-118 | 21313 | Medium | JIT | SL4 frontends | Closed | 2008-12-05 | Meeting arranged with Matt for Monday morning.
2007-11-14: Initial kickstart done. Matt, Derek and Martin have provided changes that need making. 2007-11-23: RPMs for extra repos built. Small discussion re protection of repositories. 2008-01-02: Latest incarnation of kickstart installed on lcg0607 (see RT). Awaiting testing by grid services team. 2008-01-30: Fixed 17 problems during install. Catalin had a look and couldn't run some commands but he wasn't sure that they should work on SL4. Needs checking over by Matt/Derek. 2008-02-13: Only one issue left with this which is that it pulls torque/maui packages in from gLite repo instead of local torque repo. 2008-02-20: Install now pulls in correct versions of torque/maui but still offered as an update from gLite repo unless specifically excluded. DAG offers updates to a few packages which we may not want either. 2008-03-19: Turns out not to be correct versions of torque/maui. New build needed for SL4. 2008-06-18: Torque/maui build now available (built by MJB). 2008-12-05: Closed. | |
T-130 | Medium | JIT/JRHA | Kernel build and SL 4.6 build for areca | Closed. | 2008-12-05 | New SL4 kernel released. Need to build, insert Adaptec and Areca drivers and configure.
[Problem consolidation] F-39: The installation for Areca based machines needs updating to SL4.4. Currently, the installation process is based on SL4.2. A device driver disk needs building based on the tftp kernel in SL4.4. 27/06: Medium -> low. Now concentrate on SL4.5. 20071114: RHEL 5.1 released during Nov 2007 includes Areca driver 20080213: Issues with SL5 release kernels for NFS performance. 2008-06-18: Latest SL4 update kernel has Areca driver built in. [SL4.6 install kernel too?] Latest kernel built for Adaptec machines. Both kernels need testing. Will set up and test SL4.6 kickstart on Areca machines and test Adaptec kernel. 2008-12-05: Closed | ||
F-135 | High | JIT | tw_cli settings during install | Closed. | 2008-12-05 | Script controller setup during kickstart.
2008-02-13: Looked at this today. Need to decide what settings we want to set, maybe need to wait for James Jackson's tests. 2008-05-28: Emailed James Jackson this morning to see if he has any particular recommendations. 2008-12-05: Closed | ||
F-138 | Medium | Create 64-bit system build node | Closed. | 2008-12-05 | Closed | |||
T-142 | Medium | JIT | Do background verifies affect performance? | Closed | 2008-12-05 | [See also F-34]
2008-06-18: There is substantial impact on performance when running verifies, even on lowest priority:
We need to try out James Jackson's tests to see if there's the same impact on performance when running more CASTOR-like load. 2008-12-05: Closed. |