RAL Tier1 Fabric Team

From GridPP Wiki
Revision as of 13:05, 1 April 2010 by Jonathan wheeler (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

The RAL Tier1 Fabric Team look after the hardware and operating systems for RAL Tier1.

Team members, roles and responsibilities

Name Role Responsibility
Martin Bly (MJB) Team Leader Team & Network
James Adams (JRHA) Systems Administrator CPU Farm
Ian Collier (IAC) Senior Systems Administrator Fabric Management System
Tim Folkes (GTF) Senior Systems Administrator Robotics & Tape systems
Kashif Hafeez (KH) Hardware Technician Hardware Support
Cheney Ketley (CFK) Systems Administrator Castor
James Thorne (JIT) Systems Administrator Storage Farm
Jonathan Wheeler (JFW) Systems Administrator Core Services

Tier1 Fabric Team Activities


Action ID prefix Status
F = From Fabric Team Meeting Open = Action has been created
I = Major incident resolution Progress = Action is being worked on
T = Added by Team members or Team Leader Closed = Action is complete
P = Created by other project members Rejected = Action is rejected
R = Created by UKI ROC/Production Manager

Closed items

Lists of closed items are now maintained on separate pages by year as this page was getting too long:

Open items

The following list is now extremely out-of-date.

Action ID RT Number Priority Owner Action Title Status Notes
F-2 Medium JIT Review of system configuration tools Progress A comprehensive review of the needs of the Tier1 for provisioning, configuration and configuration management tools.

20070606: Lex was to have conducted this review but is now leaving so won't have the time.

2007-11-14: Puppet testing was started on James's desktop (hagrid) but that's now dead. Need to find somehwere else to test or fix James's desktop.

2007-12-06: Spoke to Guy and Chris in CASTOR group. Looks like it will be possible to use their puppet server.

2008-01-02: Now on JIT's job plan so progress soon.

2008-01-23: James has been following the puppet mailing list and researching other options (cfengine, quattor).

2008-02-13: Reading about the various options in more detail and started to write a preliminary report.

F-3 High JRHA Hardware database Progress Project to create a database of all hardware for asset management and tracking purposes, failure trend analysis etc.

2007-06-06: James has some ideas.

2007-06-20: JA and MJB need to meet and discuss options.

2007-07-04: Meeting scheduled for tomorrow.

2007-07-11: Meeting has been rescheduled for Thursday 19th.

2007-07-19: Prototype system agreed to be built by end of August.

2007-09-19: Prototype built, tentatively in use for test purposes only.

2007-10-10: Resumed work on database now that PDU config is stable. Priority changed to high.

2007-10-24: Mysql database is currently running on topaz, basic web interface experimental, should be accessible (on and off) [1]

2007-11-14: Most of web-interface completed.

2007-12-10: Requested IP and Hostname for dedicated server for this and the power monitoring system.

2008-01-08: Still awaiting hostname, then will get the system working on this and allow select fabric team members to evaluate and comment on it.

2008-01-15: thor.gridpp.rl.ac.uk now up and running on sl5, cacti instance has been moved off topaz and onto thor and upgraded to the latest release, hardware tracking database has also been moved but is currently broken due to mysql DB issues.

2008-01-30: Up and running.

2008-02-20: Worked out a few more bugs in the web interface, hardware repair is taking priority at the moment.

2008-03-05: Web interface now mostly functional at above address, have begun backfilling with some historical data.

2008-05-21: Current state needs to be documented and then released for use.

F-8 Medium MJB Migrate UKQCD data to NFS mounted partition. Progress Need to migrate UKQCD off local arrays on csfnfs08 so we can retire the arrays, but the server must continue because we can't migrate the UKQCD grid software to SL4.

06/06: Data copied to gdss132 and /stage/ukqcd-daat1 exported and molunted on csfnfs08. Waiting contact from UKQCD project for time to switch.

20/06: Still waiting time slot from UKQCD via Services team.

27/06: ditto.

04/07: Still waiting. However we now need to provision 16TB total for UKQCD. Need to prod Catalin to get involvement from UKQCD.

19/09: Can swap to new disk when we want - just tell UKQCD to stop for a bit.

F-9 Low Nagios Castor Loading Consideration Open CMS currently query Lemon at CERN to regulate the number of jobs CMS submits. Consideration as to whether we should/can provide this functionality out of Nagios. This would prevent Castor becoming overloaded. Effectively introducing a negative feedback loop into the architecture.

2008-02-13: Castor group investigating Lemon for their needs.

F-15 Medium MJB/JRHA Spares for Compusys 2004 Progress Investigate cost of purchasing 5 x riser cards, SCSI cards(39320A), SCSI cables, SATA 250GB drives and increasing the RAM from 4 to 8GB.

2007-06-20: No progress.

2007-06-27: depends on F-13

2007-10-22: MJB purchased ten 160GB disks, five set aside for disk server OS drives, five for worker nodes.

2007-11-14: JRHA Asked for new quote for five and ten riser cards.

2007-12-10: NGHW ordered riser cards, for delivery in 2-3 weeks.

2007-12-18: Need to order CMOS batteries type 2032 as csfnfs48 was unable to shutdown.

2008-01-07: Phoned Compusys on 2 occasions. Left voicemail for Chris Unwin.

2008-0-114: Compusys email address not working so email order not received. Ended up faxing the order through. However, since my GPC card has been shredded and my account frozen, although Compusys have the order, they were unable to process it. James A has now taken the issue over and has provided Compusys with his card details in a phone call to Chris Unwin.

2008-02-20: Riser cards have been purchased and recieved, SCSI cables, drives and RAM should now be evaluated. Older hardware has varying mixed types of SCSI cables. Spare system drives for these systems have been purchased, a few original spare data drives remain.

F-16 Low Compusys 2004 performance Open Investigate performance of Compusys 2004 hardware with software RAID0 over 2 physical arrays

20070710: On hold until csfnfs65 has been repaired. Currently, it is required to keep csfnfs59 running.

20071024: New drives sourced. Need to install in csfnfs65.

20071114: New 160GB drive installed, partitioned and mirrored.

F-21 Medium JIT Review current disk test methodologies and benchmarking Progress The disk test methodologies needs reviewing with respect to whether the tests are exercising the disks sufficiently and whether the workload is keeping pace with disk and controller technology. Both acceptance thrashing and performance measurement.

20070723: Discussion with Viglen regarding their test methodology. They use the HPC Challenge benchmark suite and iozone. Now investigating the SPC benchmark from the Storage Performance Council.

20070903: A possible test harness that the Linux kernel testers use is the AutoTest suite from http://test/kernel.org/autotest/

20070904: aiostress and fiobench for evaluation as to whether these will advance our testing methodology. Also, look at SPC benchmarks from Storage Performance Group.

20071001: xdd has been compiled and is now available as a test tool.

20070723: Investigating the HPC Challenge benchmark suite and the SPC Storage benchmark from the Storage Performance Council.

20070814: Need to review testing strategy especially for 64 bit operation. xdd used by DDN for testing.

20080114: SPC Storage benchmark is not available to institutions that do not give degrees. Therefore RAL would have to pay money for this benchmark.

20080114: Problem consolidation(T-106) - Test 64 bit machines with XFS filesystems

20080114: Problem consolidation(T-104) - Test hardware with different array stripe size, filesystem stride and filesystem block size.

20080114: Problem consolidation(F-48) - Investigate various filesystems performance with Castor such as ZFS, XFS, ext2, ext3 and ext4 etc...

F-22 Medium JIT/RAS Strategy for disk kernel maintenance Progress Review strategy, garner kernel requests from team and implement kernel build system to automate production of new kernels when updates are available. The system could be used for building tools or debug binaries. 04/07: csflnx356 to be used as build system. 10/07: Use of Xen to provide, build, test and integration platforms also could be used as a repository for RPM builds.

20070829: Initial draft of document almost ready for initial comment and distribution.

20070831: Initial draft sent to Andrew.

20071114: Need to update after review meeting with RAS

2008-06-18: Ongoing. The document is available but I've not heard any more about the "review" of it.

F-23 Low RAS Network Tuning Progress Evaluate whether the LAN and WAN tuning is optimized for performance, bandwidth and latency.

04/07: Delayed until later in the year when we have stability.

10/07: ntop being evaluated

20070723: ntop now available for local network issues as an RPM. It is not installed by default as it is memory hungry and once it is installed it needs to be manually started. intop the ASCII display tool is no longer supported or available. ntop now has its own web server.

20070731: ntop installed on CMS Castor instance so we can evaluate whether issues are network or system related.

20070808: CMS testing on-going, network tuning not yet in place. Need to co-ordinate with Storage group. Tuning has been validated to ensure that system does not crash or hang.

20070814: Tuning for network and disk installed on gdss90. This is to be rolled out on the Wanin servers as well. The journaling has also been changed to writeback as this gives a disk performance improvement.

20070829: Tuning installed on all CMS FarmRead machines(gdss71,72,73,74,75 and 84). This tuning is for preservation of low memory and limiting memory for TCP usage. Tuning installed on all CMS WanIn(gdss93, 94, 95, 96, 97 and 98) and CMSWanOut(gdss85, 90, 125, 126,127 and 128) machines. This tuning is for low memory protection and maximizing the TCP stack usage by allocating extra space for buffers.

20070903: Tuning installed on all LHCB machines(gdss69,70,,79,121,122,123 and 124). This tuning is for preserving low memory and limiting memory for TCP usage.

20070905: Tuning installed on all Atlas machines(gdss80,81,82,83,103 through to and including 120). This tuning is for preserving low memory and maximising the TCP stack usage by allocating extra space for buffers.

20071024: Documentation started on wiki. Approximately 50% completed.

20071024: Took over network testing from RAS during his absence.

20071024: The network performance between RAL and Fermi is poor compared to between RAL and FZK, where with 2 streams line speed can almost be attained. There are a lot of packet drops on the Broadcom chip and UDP messages being dropped in the network stack. Inserted an Intel card for comparison purposes but there is no major difference other than the TX carrier count is racking up. Fermi are using a non-standard TCP stack. We also see a lot of packet drops using iperf locally with 10 streams.

20071114: Used iptraf to view network traffic on gdss93. The TCP window on the Fermi connection was at 183 bytes. This is extremely low and is the size of the window at IP level after the data link headers have been stripped off. At this level, data transfer will be very poor. A connection to FZK had a TCP window of 33KB and locally to the home file system a window of 63KB. The higher the number the better, as it is an indication of how much space is left in the buffer at the receive end as well as an indication that the two end points had to agree a level at which to transmit where extreme data loss did not occur. On a busy machine, memory for TCP buffers will be at a premium so ideally we may need to shed disk I/O to increase the available memory for TCP. More testing needs to be done between RAL and Fermi.

20071120: Wiki documentation completed RAL_Tier1_Disk_Server_Tuning

20071218: Investigation into Viglen 3ware performance may affect the tuning of the FarmRead and WANIN/OUT systems

20080114: Amalgamated T-122 into F-23

20080114: Reassigned back to Andrew.

F-24 Low RAS Helpdesk review Open Review RT and Footprints helpdesks with team in order to recommend a way forward from our current position.

Will need team assistance in building prototypes so that a reasonable evaluation can take place.

20070906: Delayed until later in year.

20070925: SL4 machine being supplied to DR to move current RT implementation forward

20080107: No action from NGHW prior to departure.

F-25 Low NFS performance Open Although NFS compilation performance issues have been investigated briefly. There is scope for greater improvement. Although this is for Babar, it would still be a useful exercise to understand why NFS performance especially during compilation has issues.

04/07: No progress

20070724: OS does not allow investigation into whether the network card can be optimally tuned. Would have to move to a later OS

20071218: The RX and TX buffers could now be increased on the NIC. The default is 256 for each and the maximum is 4096. This should help performance. This needs to be implemented with a degree of care in coordination with Babar management team.

20080213: Implemented. Need to check performance change.

F-32 High JRHA Filesystem errors (gdss153 & gdss159) Closed Investigate spurious filesystem errors and causes such as filesystem going read only.

2008-01-16: NGHW -> JRHA

2008-01-18: fsck on gdss153 completed with many errors logged, cause uncertain.

2008-01-21: castor2 on gdss159 went read-only, fsck in started on castor 1, 2 & 3, already showing MANY errors.

2008-01-22: fsck never got past castor1, stopped the fsck and started an array verify, this spewed out hundreds of data/parity mismatch errors.

2008-01-22: The verify continued to run until 05:13:57 when drive 13 dropped out of the array triggering a rebuild onto the hot-spare. A verify has been started again but the results will probably not be meaningful now the array has rebuilt. Once it has completed an fsck will be started again.

2008-02-18: the cause of gdss159's upset was traced to a particularly faulty drive, the machine has now been recommissioned into service for CASTOR.

2008-02-20: Have resumed discussion with Viglen about gdss153.

2008-03-03: gdss153 has now had a new backplane, SATA cables and RAID controller, but has started to show corruption once again. The current plan is to export the drives as JBOD and run fsprobe on them individually.

2008-03-19: Has now been running fsprobe for a week or so with no problems reported, however partitions were not occupying whole disk due to an error on my part, have restarted fsprobe on whole disks.

F-33 Low PCI space Open Investigate whether changing PCI space latency increases performance.
F-34 Low JIT Trickle read Progress Write program to trickle read the whole array to keep the disk heads moving. This program needs to be low priority so as not to affect performance. We often have multi disk loss when using an array heavily after it has spent time being inactive. The idea being to keep the array always slightly busy.

2008-02-06: Also run iozone during a verify to measure the performance hit of a verify.

2008-02-13: Use array verification systems from controller cards? This has significance due to possible array issues on 3ware controllers.

2008-02-18: Will co-ordinate effort with James Jackson to establish performance impact of running a low-priority background verify.

2008-03-19: Have spoken to James Jackson and agreed to run some tests when his current set have completed.

2008-05-28: -> JIT as tied in with his disk testing tasks.

2008-06-15: Have run IOzone/verify combinations and there is a substantial hit (see T-142).

F-37 Low JIT Cache on IFT arrays Progress Testing with SL4 on IFT arrays reveals that the "Synchronizing cache" command during shutdown or rmmod of the device driver will hang. The only resolution is to power the system off and on. This can result in a fsck. The array does not understand the command the OS is sending to flush the cache. IFT need more info to replicate. There is also a bypass available by coding in the kernel or using write-through cache but at a performance hit.

04/07: Will be attempting to address this issue with a new kernel release.

10/07: New kernel release progressing with inclusion of latest driver from Adaptec and Areca

20070723: Evaluating whether to use the Broadcom or Red Hat version of the tg3 driver.

20080107: Unable to check whether the 55.0.12 customized kernel is able to handle the array's inability to understand the "Syncronizing cache" command until csfnfs65 is available.

F-41 High JRHA Configure Areca cards Open Areca cards need configuring to provide alerts.

2007-06-27: JRHA to survey. Possible DNS entries. Also for 3ware cards.

2007-07-24: no progress

2007-10-10: Will start work on this after hardware database is stable.

2007-11-14: Begun reading up on config.

2008-01-08: Returned to this in new year, at least four of the cards have been setup fairly well, am evaluting the current config for duplication.

2008-02-20: Set aside 28th and 29th of February to look at this.

2008-03-05: Managed to spend a small amount of time looking at this, want to check the IP set-up of the cards is as it should be.

2008-03-19: Plan to return to this shortly.

F-44 Medium JHRA/JIT IOMeter Open Run IOMeter against Areca based servers. IOMeter runs in either raw or file mode. Initially, running it destroyed the filesystem until further documentation and communication with the user group revealed how to run it on a filesystem. Initial runs were not particularly performant but at least now they don't destroy the filesystem. Need further runs to obtain useful benchmarks.

04/07: No progress

10/07: No progress

20070925: Transferring to James and James after Fabric team meeting.

2007-11-14: Waiting for Nick to finish with gdss86. Machine is running kernel tests.

2008-01-07: gdss86 is available.

2008-06-18: gdss86 to be used for new kernel testing.

F-45 Medium JRHA Remote monitoring Open Investigate IPMI or KVM over IP, etc technologies for remote management of machines. Intel's AFT may also be a useful tool.

2008-02-20: Latest disk server delivery is fitted with IPMI (but not KVM over LAN) so will be using these to learn more.

F-46 Medium JRHA Machine room environment monitoring Progress Environmental monitoring.

2007-06-20: Waiting for RAS to discuss sensor placements.

2007-07-06: Arranged to meet with RAS (and others) to discuss and plan course of action.

2007-11-14: Taken responsibility for project, will work with Wadud from HPCSG and GWR to distribute sensors in a useful way.

2007-12-10: Arranging meeting with Wadud to discuss.

2008-01-08: Have made a lot of progress with Wadud, who has installed several of the sensor units in HPCSG's racks as an experimental and educational set-up, these have been given the host names [2] [3] [4]. One unit was installed in a Tier1 rack prior to the Christmas break to allow for power off monitoring and will be reinstalled in a more convenient place.

2008-02-20: Wadud has been working to rectify some intermittent faults with airflow sensors, will be talking to Graham very soon to discuss his needs for temperature monitoring around the machine room.

2008-03-05: In collaboration with Graham, have begun to install a trial grid of sensors near the new deliveries of CPU nodes of sixteen sensors above the aisles, this data will be available at http://thor.gridpp.rl.ac.uk/artemis/.

2008-03-19: ARTEMIS now fully featured with historical data recording via rrdtool, sensor grid will be expanded shortly.

F-47 Low JRHA Machine room power monitoring Progress Monitoring and recording voltage.

2007-06-27: Low, JRHA. This is really environment monitoring and would be useful in investigating transients.

2007-10-10: As part of F-18 (APC config) a Cacti system has been set up to monitor the load on the PDUs, all that remains now is voltage monitoring, this may be possible by accessing a UPS (although the Tier1 currently has none).

2008-01-20: The power monitoring cacti has been moved to thor where it will sit alongside the hardware database. Voltage monitoring is possible if unregulated DC power-packs are connected to the analogue inputs of one of the temperature monitors.

2008-02-20: Will evaluate an off-the-shelf power quality analyser.

F-55 Low MJB Machine Evaluation Rack Open When a rack becomes spare, use it for holding evaluation hardware rather than on a bench.

20080102: NGHW -> MJB

T-75 Medium MJB Upgrade of yumit to pakiti Progress Upgrade yumit to pakiti across the Tier1 farm

2007-10-10: Work ongoing by Steve Cobrin - provided Steve with SL4 box (csflnx264) to test Pakiti interactions.

2008-01-16: Cannot use csflnx264 now as it has become thor.gridpp.rl.ac.uk.

F-78 High MJB Testing Viglen Twin system Progress Set up and run SPECint tests on Viglen test unit, then add to batch system to take power readings.

2007-07-04: System provisoned and in use in batch system. James will monitor jobs, power, temperature twice daily.

2007-07-11: Measurements being taken.

2007-07-20: Power, airflow and load measurements complete.

2007-10-10: Received SPECint2000, JFW obtaining SPECint2006. MJB to provide JHRA with access to both of the twin systems.

2007-10-30: Drained one half of the twin (lcgtest03) installed spec2000, only gcc-3.2.3 is available for SL3 so compiled gcc-3.4.6 from source, discovered spec2000 v1.2 will not work due to use of legacy headers. JFW has obtained v1.3 from spec along with spec2006, so will continue when these arrive.

2007-12-10: Resumed work now that farm shut-down has passed and v1.3 has arrived.

T-83 Low JRHA IPMI Mac Addresses Open Generate a list of MAC addresses for the IPMI cards in the CASTOR IBM servers and attached arrays.

2008-01-08: This is near-impossible without opening the machines to examine their MAC stickers as no records seem to have been kept by the CASTOR team.

2008-05-21: Can use OpenIMPI tools to do this.

T-105 Medium MJB Superodoctor update Open Superodoctor update required.

2007-08-29: Downloaded the latest superodoctor binaries and configuration files from Supermicro's web site. They worked fine and gave temperatures on the Areca based machine. Need to turn these files into an RPM for general distribution.

2007-11-14: Pass files to MJB to re-cut RPM

2007-12-18: The 64 bit sdt binary segmentation faults in one of the tls libraries. Downloaded the latest version but the sdt 64 bit binary still seg faults. The 32 bit binary no longer gives reasonable information(eg All fans stopped). Need to revisit this.

20080114: In negotiation with Supermicro support in Holland regarding providing debug info as to why sdt is seg faulting.

T-110 High Unable to shutdown Castor servers reliably Progress 27% of the Atlas Castor instance of machines failed to shutdown properly and had to be crashed. This was not caused by castor-gridftp or issues with the home filesystem as the home filesystem was not mounted and castor-gridftp was manually shutdown before the intervention started, as was the case with the CMS instance.

20071010: Discussed with Chris K. LSF will open a directory called .lsbatch, which holds job information and status either in /tmp or on the home filesystem, if it is mounted. It should be able to switch between the two mount points automatically. eg /home/csf/cms001/.lsbatch. This directory being open on a disk server may be a contributory factor to them being unable to shut down cleanly.

20071114: Unmounting the LSF partition may be required before shutting down. Need to investigate more when a shutdown is needed.

20071218: During the shutdown, the Castor disk servers shutdown without any issues. The LSF mount is a hard mount.

20080213: Castor team looking at using http for lsf communication as an alternative to nfs mount at /lsf.

T-113 Medium JFW Investigate how to include GPG signatures on locally written RPMs Progress It will improve our security to have signatures on our locally written RPMs and allow us to use signature checking for all RPMs installed on the Tier1 farm.

27-09-2007: Method documented on Tier1 CVS repository under doc/RPM_GPG

10-10-2007: JFW and MJB to discuss

06-02-2008: Linked to T68 (Improve RPM documentation)

2008-02-20: Closed T-68. We can track both here. Update docs once signing in place.

T-115 Medium JIT Disk test suite RPM Progress RPM of all disk testing scripts for vendors and the Tier1 to use.
P-128 Medium JIT Generate ssh logins for central sysloggers Progress Generate a list of abusive ssh login attempts from the central sysloggers rather than extracting them from Logwatch root mail.

2008-01-09: James has been looking at swatch and researching other log monitoring tools, and logging to a database. SSH logging to a separate file has been suggested, if possible with syslog-ng.

2008-02-13: Working with HPCSG on php-syslog-ng.

P-129 [5] Medium Review Logwatch configuration Open Review the farm Logwatch configuration to see if any items can be removed as unnecessary or redundant with the aim of reducing the number of e-mails sent to farm root mail

2008-01-09: JIT looked at logwatch messages during AoD last week. The df output can be removed. Faulty disks on marley and logger1 picked up via Logwatch. SSH logins still only in logwatch at the moment, c.f. item P-128.

F-132 Medium JFW csfnfs58 Documentation Open The array partitioning, fdisk, filesystem parameters, directory mount points and permissions needs documenting and scripting so that rebuilding the server or upgrading the OS is facilitated. Backups on data partitions are in place for those that need it.
F-134 Medium JIT/JRHA/MJB Emergency shut-down script Open A more sophisticated shut-down script needs to be created to replace the ad-hoc "all-off" script currently in place.
I-137 RT# High Team Recovery from power failure on 7th Feb 2008 Open 20080213: Power feeds all turned off, systenm made safe before power respored to building. After power restoration, meeting to allocate tasks for restarts. General proceudre: initiate boot to single user, force fsck. Determine start order to minimise need to turn off services (yumit etc) till service is up. Order: Network, NIS, home filesystem, DHCP/TFTP, consoles, loggers, core services, AFS. Then LCG global services, local LCG services, disk servers to fsck. Fsck overnight, then update kernels on disk servers, prioritising Castor servers. Limited batch restored late Friday with scheduler and CEs.

Failed hardware: Several older APCs, one new (7953). Pre-existing memory issues in 2 x Woodcrest WNs. Compusys 2004 node mobo failed (NIC chips). 200GB drive failed in array72 (csfnfs42/bottom). 200GB drive failed in array67 (csfnfs39/bottom). Riser card in csfnfs56. BIOS battery replaced in csfnfs62. gdss68: no power light (may be pre-existing).

Data issues: csfnfs55/cms-data26 - fsck issues, mountable read-only: copied data to spare partition and reconstiuted file system, copied data back. csfnsf50/system drive (pre-failed): couldn't boot system on software raid pair as grub not installed on the second disk. JRHA fixed. ganglia/system drive: mba corrupted, grub missing. gdss68/var: fsck issue, fixed in maunal mode, rebooted and fscked all arrays again.

Updates: Most lcg-CA updates. Kernels on disk servers and most other nodes where necessary (including home filesystem).

Notes: JRHA fixed the issues with the existing issues with some of the PXE menu options.

T-139 RT# Medium JIT Security-related items Open 2008-02-20: Combined from F-11, F-27, F28, F40:
  • Investigate whether the Tier1 should be running SNORT.
  • Tripwire or replacement thereof (aide).
  • Cron job to check system files on a daily basis to ensure that no unforeseen changes have occurred.
  • It would be beneficial to pin NFS ports to a defined range. Also beneficial on touch.
T-140 23155 Medium JIT New ganglia clusters Progress 2008-02-20: Need the following new clusters in ganglia:
  • Testing (but only if we can do that without affecting overall stats)
  • Storage_CASTOR_ALICE
  • Storage_AFS
  • Services_CASTOR_* for each LHC VO
  • Loggers (as a private, protected cluster)
T-141 High JRHA Viglen Cable Looms Open Need to make up two sets of network cable looms for the new purchases of Viglen disk servers.

These need to be two bundles of 100 with seven sub-bundles.

Action-ID (next free 143) RT# Priority Owner Title Status Notes