https://www.gridpp.ac.uk/w/index.php?title=RAL_Tier1_weekly_operations_Fabric_20091109&feed=atom&action=history
RAL Tier1 weekly operations Fabric 20091109 - Revision history
2024-03-29T08:46:32Z
Revision history for this page on the wiki
MediaWiki 1.22.0
https://www.gridpp.ac.uk/w/index.php?title=RAL_Tier1_weekly_operations_Fabric_20091109&diff=2259&oldid=prev
James thorne at 13:42, 11 November 2009
2009-11-11T13:42:40Z
<p></p>
<p><b>New page</b></p><div>== Summary of week gone ==<br />
<br />
=== Developments ===<br />
* All:<br />
** SSC HR rollout tasks<br />
<br />
* Martin:<br />
** Completed Disk Procurement eval<br />
** Work on EMC arrays problem<br />
** Survey of CPU tender responses<br />
** Meeting with Seagate and Viglen re 2008 disk acceptance<br />
<br />
* Ian:<br />
** Work on Quest FP7 bid<br />
** Attended Quattor Workshop (and QUEST F2F)<br />
** Making new kernels available to Quattor managed systems<br />
<br />
* James T:<br />
** Catch up<br />
** Viglen Disk Server Problems<br />
** CRISTAL2 preparation<br />
** A/L Friday<br />
<br />
* Jonathan:<br />
** updated RPMS and rebooted for new kernel on many systems<br />
** sorted out problems with atlasbackup for some nodes<br />
** sorted out Nagios problems for some servers<br />
** arranged for gdss411-413 to be installed as Castor disk servers prior to deployment<br />
** migrated farm home filesystems from /home/csf to /home/tier1 for sremaining users except for bfactory functional userids)<br />
** Nagios configuration of updates<br />
** with Kash, shutdown nagger and nagiosdb to replace faulty memory in nagger<br />
<br />
* James A:<br />
** Quattor Workshop @ Brussels<br />
<br />
* Kash:<br />
** Drive replacement.<br />
** Fixing broken WNs.<br />
** gdss154 and 168 fixed and back in production.<br />
** gdss383 replaced 4x2gb memory. (Fixed) and ready for deployment.<br />
** gdss117, 139 and 154 fixed and given back to castor<br />
** gdss67 after long intervention and efforts (Replaced new 24 ports raid card). I've managed to fix it (Data saved). Need to rebuild it from scratch.<br />
** gdss411, 412 and 413 fixed and ready for deployment.<br />
** nagger replaced memory with Mr. Wheeler.<br />
** Working on 2008 Disk servers and working nodes.<br />
** Working on gdss67, 125, 282 and 403.<br />
<br />
=== Operational Issues and Incidents ===<br />
<br />
{| border=1 align=center<br />
|- bgcolor="#7c8aaf"<br />
! Index<br />
! Description<br />
! Start<br />
! End<br />
! Severity<br />
! Affected VO(s)<br />
|-<br />
| <br />
| EMC arrays serving 3D/LFC/FTS databases made unstable by attempts to stabilise the Castor EMC arrays<br />
| Tuesday 6/0ct am<br />
| not in sight<br />
| Catastrophic<br />
| All<br />
|-<br />
|}<br />
== Summary of plans for week ahead ==<br />
<br />
=== Scheduled and Cancelled Down Times ===<br />
<br />
Type=Down/At Risk/Cancelled entries in/planned to go to GOCDB<br />
<br />
{| border=1 align=center<br />
|- bgcolor="#7c8aaf"<br />
! Component<br />
! Description<br />
! Start<br />
! End<br />
! Affected VO(s)<br />
! Type<br />
|-<br />
|}<br />
<br />
=== Development priorities ===<br />
* All<br />
** Work on evacuating A1 Upper (Castor admin and LSF systems)<br />
<br />
* Martin:<br />
** Move EMC kit<br />
** Spares and additional hardware for database arrays<br />
** CPU ITT evaluation<br />
*** Testing sample hardware<br />
<br />
* Ian:<br />
** Further Quattor FP7 work (last two weeks)<br />
** Roll out new kernels on Quattor managed machines<br />
** Look at disk stats with Kash<br />
** Work on CPU procurements<br />
<br />
* James T:<br />
** Progress meeting with Viglen<br />
** CRISTAL2 preparation<br />
** Disk server cover in Kash's absence<br />
** Catch up on helpdesk tickets and meeting actions<br />
** TOASTER preparation<br />
<br />
* Jonathan:<br />
** Quattor implementation for Nagios slave<br />
** update environment for SL5 systems<br />
** updates to farm to allow Babar functional userids to migrate home filesystem<br />
** Nagios configuration updates<br />
<br />
* James A:<br />
** A/L<br />
<br />
* Kash:<br />
** Drive replacement.<br />
** Fixing broken WNs.<br />
** gdss67 rebuild from scratch and move in HPD room.<br />
** Continuous working on 2008 disk servers and working nodes.<br />
** Continuous working on gdss67, 125, 282 and 403.<br />
<br />
=== Absences ===<br />
<br />
* Jonathan<br />
** A/L on Thursday (12th November)<br />
<br />
* James A<br />
** Annual Leave (Mon 9th - Fri 20th).<br />
<br />
=== Fabric On-Call ===<br />
<br />
* Mon-Sun: Ian is Primary On-Call<br />
<br />
=== Advanced Warning of Requirements and Blocking issues ===<br />
<br />
<br />
=== Services Issues ===<br />
<br />
* Various requests for hardware.<br />
<br />
[[:Category:RAL_Tier1]]<br />
<br />
[[RAL Tier1 weekly operations fabric]]</div>
James thorne