Difference between revisions of "RAL Tier1 Fabric Team Closed Items 2007"

From GridPP Wiki
Jump to: navigation, search
 
(No difference)

Latest revision as of 09:08, 28 May 2008

207 closed items from RAL Tier1 Fabric Team.

Action ID RT Number Priority Owner Action Title Target date Status Date closed Notes
F-1 High Lex Ganglia 3 rollout 27/06/07 Closed 27/06/07 Rollout and documentation of the Ganglia 3 monitoring framework.

06/06: Workers done. Working on special metrics.

13/06: Metric scripts moved; disk servers in progress.

20/06: Final push to complete migration today. Turning off old server tomorrow. Some more edits to installation systems to configure new Ganglia. Documentation to do.

27/06: Migration complete except for machines in Services_CASTOR and Services_Datastore to which we have no root access. Documentation available on cvs.gridpp.rl.ac.uk in doc/ganglia/ganglia.txt.

R-4 19233 High JFW Pool Accounts Closed 27/06/07 Increasing the number of pool accounts available to Grid jobs. Currently too frequent recycling of account for WLCG experiments. 06/06: New accounts for WLCG + Biomed created on WNs. Waiting for more uids from Chris Brew. Waiting for Services Team to configure CE to use them. 13/06: CE configured for original exctras. More extras being added in light of survey of use for other VO uids. Remaining actions of this passed to Grid Services Team.
F-5 High NGHW Data Integrity - Fsprobe investigations Closed 23/07/07 The fsprobe tool is used to expose silent data corruptions - we need to test the storage systemjs at RAL to see if we suffer as CERN do.

06/06: RPM spec file being created for deployment of tools.

13/06: Installed for testing on CV/Areca servers running Castor. 20/06: Testing continues in Nick's absence.

27/06: RPM published by Keleman. Will need to modify to populate and configure on T1 systems.

04/07: This has now morphed into data integrity as this is more than just fsprobe. It will include memtester and any other tools especially verification on the 3Ware controllers on the Viglen's. Will need help to configure all the Viglens.

10/07: Now installed on gdss125,126,127 and 128. These are all Viglen machines.

20070724: All SL4 based disk servers are running fsprobe.

P-6 18608 Medium JIT Provisioning of SL4 worker with Glite3 workarounds. 05/06/07 Closed 05/06/07 Handed over to Derek. Derek would like a new one with no workarounds as Glite3 for SL4 has just been released.
F-12 Low NGHW Memory in Castor servers Closed 04/07/07 Evaluate whether the memory in the Castor should be increased from 4GB to 8GB or more.

04/07: This will be addressed later as currently with version 2.1.3 of Castor and the experiments not using Castor correctly with Phedex, there is no point until the software and users settle down.

F-14 Medium JRHA Spare riser cards for Compusys 2004 Closed Investigate cost of purchasing 5 riser cards for the Compusys 2004 systems. Quote received.
F-36 High JRHA Power Labels 04/07/2007 Closed 28/07/2007 Power Labels (415V phase warnings) need attaching to 2006 hardware.
F-56 Medium JRHA Tier1 rack labels Closed 13/06/07 Tier1 rack labels to be affixed to Tier1 hardware doors.
T-57 High JRHA Mimic Move 20/06/07 Closed 13/06/07 Mimic display needs to be moved to a new server before the old Ganglia server is retired.

20070613 - Installed php-mysql and copied files to new mimic directory, move completed with no problems.

T-63 Medium MJB Create alias for mimic 20/06/07 Closed. 13/06/07 An alias (status.gridpp.rl.ac.uk) needs to be created to point at the new location of the mimic display at "status.gridpp.rl.ac.uk/mimic".
T-65 19278 Medium JIT Roll out tw_cli on Viglen servers. 22/06/07 Closed 25/06/07 RPM in place.

13/06: Check all existing Viglen servers and installation configs.

20/06: Ongoing.

25/06: Complete: installed on all Viglen servers.

P-66 18492 Medium JIT dCache client upgrade on worker nodes 29/06/07 Closed 03/07/07 20/06: Ongoing.
T-67 Low JIT Improve power off scripts on the consoles. 20/07/07 Closed 13/06/07 Duplicate of F-17
P-69 18608 Medium JIT Provisioning of SL4 worker (without Glite3 workarounds). 29/06/07 Closed 28/06/07 After completing P-6, Glite3 for SL4 was released so ticket re-opened.

20/06: aiming for some progress later in week.

27/06: second WN installed. Check with Derek and close this action.

28/06: All working. Closed.

T-71 High JIT Castor Provisioning 22/06/07 Closed 04/07/07 Assist with provisioning of Castor instances for ATLAS. This take higest priority. Record all steps and sequences for use in PXE/kickstart systems.

20/06: KS for OS now OK, but Castor stuff still done by hand.

27/06: Documenting what has been done, need a list of post kickstart step from Cheney.

04/07: ATLAS instance working.

F-72 Medium JFW Web sites 15/06/07 Closed 14/06/07 Respond to query from Georgina Brown concerning Web Sites in FBU.
F-77 Critical MJB/JA ADS network reconfiguration 26/06/2007 Closed 26/06/2007 20/06: JA,MJB havce installed cables to racks D+E.

JA progressing with remaining cables and cable-to-port plans. On track for switch moves on Tuesday 26/06.

27/06: Sucessfully completed migration as planned.

T-81 Low JIT Document procedure for migrating a system to software RAID. Closed. 28/06/2007 26/06: Start of document added to CVS.

28/06: Final document committed to CVS.

T-84 Medium JRHA Mimic Nincom Checker 28/06/2007 Closed 28/06/2007 The warning notice on the mimic display rolled over at 59:59 to 00:00. This has now been corrected.
T-87 High NGHW Password and keys change on disk servers 28/06/2007 Closed 28/06/2007 Passwords and keys altered on all disk servers and installation server. Registered in Ops
T-89 Medium NGHW LSF Mount Closed 04/07/2007 The new version of LSF uses a mounted filesystem for job statistics and records the status in its configuration file. When the network has a problem, the disk server client machines will hang and you are unable to login on the console. Discussed with the Storage group, options are to use NFS or AFS(others are a non-starter), they will look at using NFS mount options to alleviate the problem.
T-88 Medium JIT RPM for 3dm2 password update Closed 10/07/2007 Update 3dm2 config RPM with new password.

10/07: RPM created and updated on gdss87-172. Closed.

P-92 19911 Medium JIT Correct ganglia SL4 worker config Closed 12/07/2007 lcg0285 is no longer an SL4 worker so define a new data source in ganglia and update the SL4 workers.
P-94 19908 Medium JIT Modification to Ganglia Special Metrics 20/07/2007 Closed 16/07/2007 Catalin has asked for changes to the Services_Grid special metrics.

16/07: Done.

T-96 Medium JFW Create RPM to allow imap to use SSL certificates Closed 18/7/2007 13/7 Fix has been installed by hand on front-ends. Requires RPM and insertion into install for standard UI kick-start files

16/7 RPM created as tier1-fix-imapcert and placed in yum/local/SL{3,4} on touch

18/7 RPM installed on front-ends

F-59 Critical JFW Root passwords on all farm nodes 31/07/07 Closed 20/07/07 This task is to check and set root passwords on all farm nodes as it is not clear that we know them.

27/06: This is now urgent.

29/06: root password changed on disk servers (Nick White)

10/07: root password changed on batch workers

11/07: root password changed on front ends

19/07: root password changed on farm service nodes

20/07: password hashes updated on kickstart files

T-85 19740 High JIT, JRHA Monitoring page 14/07/07 Closed 17/07/07 c.f. RT Ticket.

20070704: Current status display done. Needs work as not working in I.E.

20070706: Display progressing and working in all browsers, accessible at http://status.gridpp.rl.ac.uk/public/

20070710: JIT/JRHA attended a meeting about status screens in eScience. We seem to be progressing OK. Display works in I.E.

20070718: We have met the basic requirements. New feature requests will be new items.

F-5 High NGHW Data Integrity - Fsprobe investigations Closed 20070723 The fsprobe tool is used to expose silent data corruptions - we need to test the storage systemjs at RAL to see if we suffer as CERN do.

06/06: RPM spec file being created for deployment of tools.

13/06: Installed for testing on CV/Areca servers running Castor. 20/06: Testing continues in Nick's absence.

27/06: RPM published by Keleman. Will need to modify to populate and configure on T1 systems.

04/07: This has now morphed into data integrity as this is more than just fsprobe. It will include memtester and any other tools especially verification on the 3Ware controllers on the Viglen's. Will need help to configure all the Viglens.

10/07: Now installed on gdss125,126,127 and 128. These are all Viglen machines.

20070723: fsprobe now deployed across all SL4 servers.

F-58 11873 Low NGHW ntop for disk servers Closed 20070723 Requirement for ntop on the disk servers. To profile network traffic.

04/07: This is still a requirement.

10/07: ntop being evaluated on gdss86. Believe that it will be useful given the information it provides. Need to discuss with RAS.

20070723: ntop now available for SL4 disk servers. ntop is memory hungry and runs its own web servers. It is not running by default so has to be switched on manually. It is not installed by default. It does give us a tool to investigate whether issues are local network, disk or WAN related.

F-91 High NGHW kickstart driver ordering Closed 20070723 kickstart can order the controllers in an order in which we do not wish to install on. This can cause the data drive to be overwritten by the OS installation. The rudimentary method of arranging the order is dependent on the hardware but can involve either unplugging the fibre to the disk array or removing the controller from the system. Consulted the anaconda source and found the "driverload" parameter. This is used in the kickstart file to identify the controller order. The syntax is as follows:-

driverload=driver1:driver2:driver3.. eg driverload=megaraid_sas:qla2xxx:qla2400 or driverload=3w-xxxx:aic7xxx

20070710 - Need to document the above

20070723: Documented on personal wiki

F-97 Critical NGHW Certificate about to expire on gdss43 20070726 Closed 20070723 Certificates update on gdss43. Castor stopped prior to installation of new certificates and restarted without problems after it.
I-1 Critical NGHW bdata-data24 on csfnfs27 failed - 800GB at risk 20070723 Closed 20070730 Unable to access all the data on the array. Tracked issue down to a single file causing the array to go off-line. File moved to lost+found/corrupted so blocks cannot be re-allocated. File to be re-imported by Babar management team. All remaining files checksummed without any problems.
I-2 Critical NGHW hdd of the software system disk mirror failing on csfnfs59 holding bdata-data74,75,76 and 77 20070723 Closed 20070730 hdd was getting progressively worse as 4 out of the 7 partitions had failed. A replacement drive was taken from csfnfs65 and wiped. The failing drive was then marked as faulty and removed from the software mirror. The system was shutdown and the replacement drive installed and partitioned. Finally, the drive was added to the software mirror and rebuilding started automatically. It completed without any issues. The array filesystems were marked as not needing fsck'ing prior to the shutdown so that on power up a fsck would not occur. The auto fsck was re-instated after the replacement
T-103 20489 Medium JRHA Network cables for new ADS equipment 20070815 Closed 20070815 Three extra network cables were required for monitoring ports on two new fibre-channel switches and an array.

Ran 15m cables from 3com 10/100 switch as follows:

Port 17 - RLT9523 - Upper FC Switch
Port 18 - RLT9524 - Lower FC Switch
Port 19 - RLT9525 - Array Unit
T-86 High JRHA Mimic Update 20070718 Closed 20070814 A new version of the mimic display needs to be created to interact with the new nagios 2.9 database format.

20070711: Received details of DB alterations from JFW, started looking into mimic re-write.

20070716: Completed preliminary version, appears to work correctly.

20070808: Awaiting rollout of nagios 2.9.

20070814: Nagios and Mimic switchover successful, minor problems corrected, new mimic version now in use.

T-103 Medium JRHA Memtest86+ 1.70 20070824 Closed 20070824 The PXE boot system has been updated with the latest version of memtest86+ while legacy versions have been removed.
F-60 High JFW Bring new Nagios slave servers into use 22/06/07 Closed 29/08/07 Adding new Nagios slave servers (nagios03/4/5) will allow testing load to be reduced on existing slaves and hopefully will eliminate memory leaks and master server.

13/06: Configuration of nagios03 under way. Nagios04 rtr. Nagios05 to be installed - possible network issue.

27/06: Progress on 03.

04/07: nagios03 in use, working on nagios04. Possible issues with networking on the blades causing communication hangs. Suggest nagios02 move to a non-blade box.

16/07: 5 batch workers reallocated as Nagios slave servers

19/07: nagios02 reinstalled, configured and brought into service; blade server powered off

25/07: nagios04 brought into use

27/07: nagios05 brought into use to take over some of tests from nagios01

31/07: nincom has hung at least twice with out of normal memory rather than out of high memory so the nature of the fault has changed since changing the hardware.

08/08: Nincom has hung several times this week. dstat shows that all swap had been used in a 5 minute period. A 2GB swap file was added but the following evening even all this had been used. ie 6GB of swap over a 15 minute period at 5:30AM. max_concurrent_checks originally was set to 0, which would give unlimited parallel service checks. nincom hung with this set to 2000. The service_reaper_frequency, whose default is 10 but it was set to 4, was reduced to 2. There are several items out on the net regarding Nagios and out of memory indicating a configuration change or a source code change. Also tried the network tuning and VM tuning in an attempt to prevent OOM kicking in. The system still hung overnight. Will pursue tunables and source code changes. Moving to the next version of Nagios may be a better and timely solution rather than spending time on this version.

14/08: NGHW- Looked at various tunables and wrote a script for recording what is happening between 5 and 6 in the morning. Those logs show that there are lots of connections in CLOSED_WAIT state. There is no major disk activity by the kernel during that time except swapping. Reviewed nagios configuration file and changed the reaper and concurrent jobs parameters. The concurrent jobs parameter was set as unlimited. This was reduced to 2000 and then to 400. The reaper frequency was dropped from 4 to 2. Two swapfiles were added so that the machine had 8GB of swap. nincom managed to stay up for 6 days with tunables for network and vm. nagios on nincom has been updated to version 2.9 from 2.6. The logs on Wednesday morning show that there is a considerable difference between 2.6 and 2.9. There were no sessions in CLOSE_WAIT between 5 and 6. The memory seems fine and there was no swapping. This problem is being left open for another week and if all is well can be closed.

14/08: Updated to version 2.9 of Nagios server on nincom. Required updates to web page showing farm status as SQL tables had changed (see task T-86). Rebooted nincom to remove all tuning applied above

29/08: Only 1 system crash (on 28/08) due to OOM problems, but not the same as previous crashes. This task is now closed.

T-112 20881 Medium JRHA Network cables for new VTL equipment 20070917 Closed 20070917 Three extra network cables were required for ports on two new arrays and a server.

Ran 15m cables from 3com 10/100 switch as follows:

Port 41 - RLT9526 - VTL Server
Port 42 - RLT9527 - Array JS1
Port 43 - RLT9528 - Array JS2
F-20 Medium NGHW UKQCD Closed 20070925 Handover support interface of UKQCD to Grid Services Team. 04/07: Started hand over to Services team.

20070925: Grid Services team now interacting directly with UKQCD

F-43 High NGHW Castor Closed 20070925 Assist Castor team especially with OS related issues.

04/07: Tim Folkes performance statistics revealed that the performance on the Viglen servers was not good and also there was a hang on gdss84. The hang was related to the network tuning. This tuning was for wanIn but currently Phedex has been using it incorrectly. The network tuning across all Castor Areca based machines has been restored to the defaults. On the Castor Viglen servers, the lack of performance(circa 50% of the Areca) is believed to be related to the default readahead. This has been upped to 512,which is the same as on the Areca.

10/07: The hang on gdss84 revealed that the network tuning from SC06 was present on several of the Areca based servers. This has now been set to the system default values for all the Areca based servers. Unaware of any specific testing to evaluate whether the readahead tuning has increased performance. However, also found that the network tuning is different between the Areca and 3Ware based servers.

20070723: Minimal CMS testing in past week

20070808: network tuning parameters checked out and run ok on test server. Analysis of the workload on the CMS Castor disk servers and duplicating it via a script. This was then run in several modes with each of the 4 I/O schedulers. Tunables were altered but at no time were we able to achieve a read performance consistently above 20MB/s on 1 reader with 4 writers at the same time. Moving the ext3 journal off the data disk did not make any difference. Neither did changing the readahead to 16K. Email sent to Areca developers, with a response of "expected behaviour". Will be needing to re-test on 3Ware disk servers. Evidence from DDN indicates that not all is well with the Linux 2.6 kernel in implementing the I/O schedulers and readahead with their hardware.

20070814: network and filesystem tuning implemented on wanout. ext3 journal now being used in writeback mode rather than ordered which gives a speed improvement but will leave us exposed in the event of an unscheduled power off. Tuning to implemented on wanin.

20070829: Tuning installed on all CMS FarmRead machines(gdss71,72,73,74,75 and 84). This tuning is for preservation of low memory and limiting memory for TCP usage. Tuning installed on all CMS WanIn(gdss93, 94, 95, 96, 97 and 98) and CMSWanOut(gdss85, 90, 125, 126,127 and 128) machines. This tuning is for low memory protection and maximizing the TCP stack usage by allocating extra space for buffers.

20070903: Tuning installed on all LHCB machines(gdss69,70,,79,121,122,123 and 124). This tuning is for preserving low memory and limiting memory for TCP usage.

20070905: Tuning installed on all Atlas machines(gdss80,81,82,83,103 through to and including 120). This tuning is for preserving low memory and maximising the TCP stack usage by allocating extra space for buffers.

20070925: Tuning running. New problem will be opened up for any changes to the existing tuning or if new tuning needs to be applied.

F-54 Medium NGHW Castor profiling Closed 20070925 Investigate system resources that Castor is using. profiling is being run on gdss84 and along with enabling the vm to dump the processes that are accessing it, should be able to figure out where system resources are being used. When the systems have 30 jobs, the machines start paging and swopping until the out of memory killer kicks in and kills processes. The number of jobs has been dropped to 20 to prevent the OOM kicking in. It looks like the kernel is copying from and to user from kernel space predominately due to network traffic both down eth0 and lo.

04/07: This is still the same with version 2.1.3 of Castor.

10/07: gdss84 on which the profiling is running is not currently being used in the CMS Castor tests.

20070723: Minimal CMS testing in past week

20070731: RAS has suggested 9 theoretical tuning changes, which had not been tested. These have at least now been confirmed that they are not detrimental to a system's health. However, neither do they improve the read performance. Now have a minimal test rig using NFS over TCP to generate network traffic into a test server with a simple "dd" script that writes 4 streams whilst reading 1 stream. So far no amount of tuning has enabled the read stream to be more performant of the write streams. This is the scenario that we are witnessing with the CMS testing. We are in dire danger of filling up the disk but not able to empty it fast enough. Found further tunables in the source code for the CFG scheduler but also looking at testing with other I/O schedulers.

20070808: network tuning parameters checked out and run ok on test server. Analysis of the workload on the CMS Castor disk servers and duplicating it via a script. This was then run in several modes with each of the 4 I/O schedulers. Tunables were altered but at no time were we able to achieve a read performance consistently above 20MB/s on 1 reader with 4 writers at the same time. Moving the ext3 journal off the data disk did not make any difference. Neither did changing the readahead to 16K. Email sent to Areca developers, with a response of "expected behaviour". Will be needing to re-test on 3Ware disk servers. Evidence from DDN indicates that not all is well with the Linux 2.6 kernel in implementing the I/O schedulers and readahead with their hardware.

20070814: Tuning applied to gdss90 in wanout. Need to apply it to wanin. ext3 tuning applied by moving journal to writeback from ordered.

20070829: Tuning installed on all CMS FarmRead machines(gdss71,72,73,74,75 and 84). This tuning is for preservation of low memory and limiting memory for TCP usage. Tuning installed on all CMS WanIn(gdss93, 94, 95, 96, 97 and 98) and CMSWanOut(gdss85, 90, 125, 126,127 and 128) machines. This tuning is for low memory protection and maximizing the TCP stack usage by allocating extra space for buffers.

20070903: Tuning installed on all LHCB machines(gdss69,70,,79,121,122,123 and 124). This tuning is for preserving low memory and limiting memory for TCP usage.

20070905: Tuning installed on all Atlas machines(gdss80,81,82,83,103 through to and including 120). This tuning is for preserving low memory and maximising the TCP stack usage by allocating extra space for buffers.

20070925: Tuning applied now waiting for next service challenges for review.

T-101 Medium NGHW System Admin XWG Closed 20070925 Cross dept. working group to "identify preferred generic skill set, and how best to train new and existing staff to ensure an adequate and up-to-date skill set"

20070808: Initial meeting. Need to gather thoughts from team.

20070906: Not been formally approached for inclusion in the working party and neither has another member of the working group. This is now with the working group's sponsor.

20070925: No formal approach for participation in working group. Issue closed and will be resurrected if and when there is buy in from group leaders.

F-7 Medium JRHA/JIT Soak testing Tier2 equipment 20070630 Closed 20070926 Contract testing of Tier2 disk serevrs and worker nodes for Chris Brew. To involve Alistair Haig.

20070606: DHCP entries complete, waiting DNS update.

20070613: Disk server now kick-started. Still issues with DHCP for CPU nodes. Testing yet to start.

20070620: KS working for CPUs and Storage servers. Testing to start this pm.

20070627: Testing underway.

20070711: Received order number from PPD.

20070808: Issues with a few nodes have been resolved, testing has restarted on those nodes.

20070815: Disk servers brought out of test and blanked, worker nodes brought out of test, will handover to T2 by next week.

20070919: One worker node has developed a BIOS issue, will work with Streamline to rectify this.

F-19 Medium NGHW File Systems Working Group Closed 20070926 Participate in the HEPiX FSWG. 06/06: Nick joined FSWG. 27/06: Compiling data for FSWG survey. Will need FT assistance for gathering data. 04/07: Some progress on questionnaire. 10/07: Minor progress on questionnaire

20070925: Questionnaire completed for NFS, AFS, dCache, Castor and xrootd. Need global figures for weekly data rates.

20070926: Weekly data rates estimated from Cacti plots.

T-74 High JRHA CASTOR Hardware Closed 20071003 Work with CASTOR team to resolve issues with their hardware.

20070614: Moved disk array from Rack D to Rack C.

20070702: Replacement parts aranged with OCF and IBM, engineer scheduled today.

20070703: Engineer resolved issues with c15.

20070711: Replacement PSUs arrived from IBM and Artesyn, fitted to c06 and c09.

20070717: Contacted OCF with regards to faulty array parts.

20070802: All outstanding faults have been resolved, bar one missing fan unit on an array.

T-108 High NGHW/JRHA Loss of data array Closed 20071005 On the Areca machines, occasionally one or more of the drives fails to be initialised or recognised in time so the data array is incomplete, when the OS loads. Suggestion is to increase the timeout in grub from 5 to 60 to give added time for the data array to become ready. If the array is not ready then the partition table can appear to have been lost.

20070905: The timeout parameter has been increased from 5 to 60 on all the Areca based machines(gdss66 through to and including gdss86).

20070925: This does not affect Viglen based servers as was first thought. The timeout parameter needs to be included in the Areca kickstart.

20071005: JRHA: Added the necessary fix to the postinstall script to change the grub timeout to 60 seconds.

F-90 High JRHA PQQ reference questionnaire Closed 20071005 Run references questionnaire for PQQ respondents.

20070706: Arranged to meet with RAS to discuss procedures.

20070717: Recieved tender information, started process of issuing questionnaires.

20070808: Recieved five replies so far, many contacts appear to be on holiday.

20070812: Recieved more replies, many still outstanding.

20071005: Closed as tenders now closing.

F-10 Medium NGHW Castor memory usage Closed 20071010 Castor is eating memory resulting in the out of memory killer kicking in. This is killing processes, which leaves network connections without endpoints. These do not die and the memory remains allocated. rmnode will allocate dependent on the number of filesystems.

04/07: Castor memory usage has changed since version 2.1.3 so will need to revisit this issue with this new release. We have also reduced the number of filesystems from 5 to 3 on the Areca based machines and increased from 1 to 3 on the Viglens.

10/07: To prevent network hangs all the network tuning for SC06 has been removed as gdss84 hung with low memory exhausted. gdss84 had the SC06 network tuning installed. The number of Castor job slots was decremented to 20. Now that we have some stability, we are considering raising this to 30 since the upgrade to Castor 2.1.3

20070723: Liaising with JFW regarding an OOM check for Nagios

20071010: JFW has a OOM check

F-49 High NGHW FSWG Questionnaire Closed 20071010 Fill out the FSWG questionnaire for AFS, NFS, dCache, Castor and xrootd.

20071010: Questionnaire completed for all access methods.

T-70 High MJB Thumper report 11/05/07 Closed Latest revision of report sent to MJB 31/05/07.

20/06: Report to be sent to Sun.

09/2007: Report sent to Sun.

T-100 Low NGHW Intel Core 2 chip issues Closed 20071010 Investigate the Linux implications of the Intel errata regarding Core 2 chip Bios fixes.

20070808: Nick Baron reviewed the Intel issues and concluded that the risk of a hardware error breaching your security is probably a lot less than an errant software error. Further info when the errata has been reviewed.

20071010: Being closed as this issue is not being flaunted round the computer world as a big deal.

T-102 Medium NGHW Tender disk benchmarking Closed 20071010 Ran "dd" tests on hardware to gauge reasonable figures for the disk tender.

20070814: Initial figures provided. They also appear to reflect the performance that we see on the servers being used for the CMS Castor instance. Tests run on 3ware Viglen servers.

20070906: Need to run the same tests on Areca servers.

20071010: Being closed as the tender has been submitted to vendors

T-116 Medium JIT Investigate configuration management systems such as cfengine and puppet. 30/11/2007 Closed 10/10/2007 Duplicate of F-2
F-10 Medium NGHW Castor memory usage Closed 20071024 Castor is eating memory resulting in the out of memory killer kicking in. This is killing processes, which leaves network connections without endpoints. These do not die and the memory remains allocated. rmnode will allocate dependent on the number of filesystems.

04/07: Castor memory usage has changed since version 2.1.3 so will need to revisit this issue with this new release. We have also reduced the number of filesystems from 5 to 3 on the Areca based machines and increased from 1 to 3 on the Viglens.

10/07: To prevent network hangs all the network tuning for SC06 has been removed as gdss84 hung with low memory exhausted. gdss84 had the SC06 network tuning installed. The number of Castor job slots was decremented to 20. Now that we have some stability, we are considering raising this to 30 since the upgrade to Castor 2.1.3

20070723: Liaising with JFW regarding an OOM check for Nagios

20071024: OOM check available

F-41a Medium NGHW Areca kernel update Closed 20071024 The kernel on Areca based servers needs updating. This is an opportunity to include the latest Areca driver, investigate changes to the Broadcom driver for NAPI, incorporate relevant Adaptec driver and update kernel configuration for performance issues such as read streaming. Garner OS kernel needs from team such as extra debug info, if appropriate.

27/06: Consider Areca and Adaptec Drivers for SL4.5 kernels too.

04/07: New build in progress. There is no Areca driver in the 2.6.9-55.0.2 kernel.

10/07: Areca version 1.20.0X.13-61107 being used. The Areca web site also has 1.20.0X.13, which is down revision. However, 1.20.0X-61107 is an unusual Areca version number and might change. The aic79xx driver is at version 2.0.26, which should recognise the Adpatec cards in the Compusys 2004 year hardware and the aic7xxx driver is at version 6.3.11. Kernel has been built. Further investigation into latest Broadcom tg3 driver, as there is web access suggestion that the Broadcom provided driver is more performant than the Red Hat included driver. Evaluating source changes.

20071010: New base of 55.0.9 installed. Kernel compiled after much trauma due to a compilation issue, resolved by installing and compiling 55.0.2. There is a future kernel in the pipeline due to a security alert in mountd. This will have to be our new base when it is released. The .config file for the new base has been rationalized with the deployed .config file line by line. There are a a few more issues to be resolved though.

20071024: Most of the information is in F29, so this issue particular to the Areca cards is being closed.

P-103 20009 Medium JIT dCache ganglia problem Closed 20071024 The data sources for the Storage_dCache ganglia cluster cannot see the multicast ganglia info from csfnfs39,42,50,54,56-58,60-64. Checked that they are joining the correct multicast group with netstat -ng Tried restarting gmond on both the data sources (dcache and dcache-tape) and the csfnfs hosts. Derek suggests that it could be a problem with a switch. It would be the switch at the data source end (compusys kit in A5 lower). The Viglen systems in the cluster can see the multicast packets.

20071010: Investigating why latest ganglia does not run on RH7.x. Jabbered Lex briefly this morning.

20071024: Fixed itself. Closed.

T-124 High NGHW csfnfs37 system disk Closed 20071024 The system disk on csfnfs37 failed with excessive reallocated sectors. The disk was replaced. re-installed and reconfigured as far as possible to what it was prior to breaking as able to get rpm list, chkconfig, /etc, /root etc

20071024: Closed.

F-18 Medium NGHW/JRHA APC Units Closed Review APC unit configurations.

20070627: (JRHA) Review of setup and consistency required.

20070710: Bumped up to medium.

20070815: Begun audit of current setup.

20070926: 133 and 135 are now testbeds for new firmware with more to follow, experimental graphing taking place [1].

20071003: All PDUs updated to v2.7 firmware, unable to locate an update for the MasterSwitch firmware. 133 & 135 have v3.5 and 115 has v3.7, these are being used as testbeds.

20071009: All PDUs (excluding MasterSwitches) now configured and in load graphing system.

20071010: Report drafted, sent to MJB.

T-107 High JRHA Network Port Speed Nagios Check Closed 2007-10-12 Nagios test for checking that network ports are running at the correct speed.

20071005: JRHA: Took ownership of this issue.

20071010: Learnt test structure from JFW, begun writing test.

20071012: Test completed, JFW will test and generate new RPM.

F-26 High JRHA APC configuration Closed 09/10/2007 APC configurations need reviewing as changes have been made but are not reflected in the configuration. APC web page needs updating.

20070627: Related to F-18.

20071009: Completed. Updated web-page to cover changes made to setup.

F-50 Medium JIT Viglen 3Ware messages Closed 14/11/2007 Investigate cause of odd error messages from 3Ware controllers on Viglen based hardware.

20070926: Ongoing

2007-11-14: These errors have not been seen for many months despite heavy load on CASTOR servers so closed issue.

P-93 19823 Medium JIT Separate experimental Ganglia monitoring of CASTOR instances 2007-07-20 Closed 14/11/2007 James Jackson has enquired about the possibility of having separate ganglia clusters for each CASTOR instance. There was some objection to this at the Monday meeting but this needs clarifying.

2007-07-18 - This is possible and will be done. Chris Kruk has emailed James a list of servers to go into each instance in Ganglia.

2007-08-15 - Updated list from Chris. Working on it today.

2007-08-24 - LHCb instance moved to Storage_CASTOR_LHCb cluster. ATLAS not being moved yet as they may move some servers between CASTOR and dCache. CMS not done as they are in the middle of testing and they're referring to the ganglia plots.

2007-09-26 - Will check with Chris Kruk that all shuffling done and move rest of the servers to appropriate clusters.

2007-10-10: CMS moved to separate cluster

2007-10-17: ATLAS moved.

2007-10-24: Need to add the service nodes to appropriate clusters. Waiting for info from CASTOR team as to which machines should go into which cluster.

2007-11-14: No word from the CASTOR team on the service nodes. Closed this issue and will open a new one if/when the need for separate service clusters arises.

T-125 High JRHA EM Field Measuring Closed 2007-11-23 Measurement of EM Fields in A5 lower machine room with portable field detector.

2007-11-14: Discussed with GWR, measuring tool will only measure 30-300Hz so can measure mains frequency but little else. Measuring grid has been set at four floor tiles.

2007-11-23: Measurements have been taken, but are essentially useless. All agreed that equipment needed for meaningful measurements would be too expensive to justify, hence this item has been closed.

T-126 High NGHW Farm Shutdown 2007-12-03 Closed 2007-12-03 Farm Shutdown

Items requiring attention RAL Tier1 Farm Shutdown

2007-11-14: Need to discuss at Fabric Team meeting. 2007-12-18: Shutdown occurred on Monday 3rd Dec.

F-61 High JFW Error messages for RAID disk systems 29/06/07 Closed 11/12/07 Different manufacturers have different error messages in their disk and RAID system drivers. Nagios needs to know these to be able to detect RAID disk problems.

13/06: Done for SL3/Infotrend.

27/06: No progress.

29/08: Done for SL4/3ware and progress on SL4/Areca

19/09: Completed for SL4/Areca

04/10: Tests started for SL4 Areca and 3ware systems, but quickly revealed problems with the test script and the message files. The tests have been stopped; these problems are easily corrected but need to be done before the tests can start again.

22/10: Test script updated to ensure checks are on full words. Messages files updated to remove generic messages

11/12: More generic messages removed; checks re-enabled

P-130 Medium NGHW Remove tripwire config Closed 2007-12-18 Remove antiquated tripwire configuration on older RH7.3 disk servers to reduce farm root mail.

2007-12-18: Done as part of farm shutdown.