Search results

Create the page "Batch System" on this wiki!

Page title matches

Batch system status

* [https://twiki.cern.ch/twiki/bin/view/LCG/BatchSystemComparison Batch System Comparison Table] == Sites batch system status ==

11 KB (1,661 words) - 12:47, 21 June 2019
RAL Tier1 Batch System

...ent/uploads/2008/12/batchsystemconfig-nov08.pdf Configuration of the batch system at November 2008] [[Category:Batch Systems]]

325 B (40 words) - 12:23, 18 March 2014

Page text matches

Main Page

* [[New Information System]] * [[:Category:Batch_Systems|Batch Systems]]

8 KB (1,130 words) - 17:31, 17 April 2024
Tier1 Operations Report 2019-06-17

 ...; padding-top: 0.1em; padding-bottom: 0.1em;" | Limits on concurrent batch system jobs.

14 KB (1,386 words) - 09:37, 19 June 2019
Operations Bulletin Latest

* Technical Meeting last week about the New JSON based Information System: https://indico.cern.ch/event/821105/ ...on Thursday. All SAM tests failed until this was fixed the next morning. Batch farm also did not start any new jobs during this time. We used this accide

41 KB (5,018 words) - 14:09, 30 October 2019
Tier1 Operations Report 2016-03-16

...y LHCb of a low but persistent rate of failure when copying the results of batch jobs to Castor. There is also a further problem that sometimes occurs when * GDSS620 (GenTape - D0T1) Reported a read-only file system yesterday (Tuesday) morning and was taken out of production. Two T2K files

13 KB (1,356 words) - 09:59, 16 March 2016
Tier1 Operations Report 2015-12-09

...y LHCb of a low but persistent rate of failure when copying the results of batch jobs to Castor. There is also a further problem that sometimes occurs when ... hours of yesterday morning (8th Dec). This also reported a read-only file system.

13 KB (1,411 words) - 08:55, 10 December 2015
Tier1 Operations Report 2014-05-07

* There was a problem on Thurdsay with the batch farm caused by a particular (biomed) user running very large jobs. This led | Outage of tape system for update of library controller.

13 KB (1,357 words) - 12:47, 9 May 2014
Past Ticket Bulletins 2014

...n F ticketed the CA concerning a possible problem with the ticket reminder system. JK has responded with a reply, and asked that similar tickets in the futur LHCB having cvmfs trouble at IC, which was likely caused by a batch of naughty CMS jobs ruining it for everyone else. LHCB re-enabled IC to see

184 KB (30,332 words) - 17:18, 16 December 2014
Operations Bulletin 170314

...is week there is a [http://indico.cern.ch/event/272785/ pre-GDB meeting on batch systems] and a [http://indico.cern.ch/event/272619/other-view?view=standard ...re will be a [https://www.gridpp.ac.uk/wiki/Batch_system_status pre-GDB on batch systems] next Tuesday, and a [https://indico.cern.ch/event/272619/timetable

42 KB (5,176 words) - 11:12, 17 March 2014
Operations Bulletin 280414

* CERN batch capacity migrated to SLC6 was at 65% last week. * The APEL accounting system has been undergoing database maintenance to improve performance and reliabi

46 KB (5,930 words) - 18:40, 28 April 2014
Batch system status

* [https://twiki.cern.ch/twiki/bin/view/LCG/BatchSystemComparison Batch System Comparison Table] == Sites batch system status ==

11 KB (1,661 words) - 12:47, 21 June 2019
RAL Tier1 Batch System

...ent/uploads/2008/12/batchsystemconfig-nov08.pdf Configuration of the batch system at November 2008] [[Category:Batch Systems]]

325 B (40 words) - 12:23, 18 March 2014
Operations Bulletin 150413

...ve released a new [https://ggus.eu/pages/didyouknow.php page on using] the system. * Investigations are ongoing into problems at batch job set-up.

43 KB (5,533 words) - 08:50, 18 August 2014
Tier1 Operations Report 2018-07-09

 ...; padding-top: 0.1em; padding-bottom: 0.1em;" | Limits on concurrent batch system jobs.

17 KB (1,646 words) - 09:31, 11 July 2018
Tier1 Operations Report 2017-01-04

...y LHCb of a low but persistent rate of failure when copying the results of batch jobs to Castor. There is also a further problem that sometimes occurs when | Outage of Castor Storage System for patching

14 KB (1,476 words) - 14:02, 4 January 2017
Tier1 Operations Report 2015-05-20

* GDSS649 (LHCbUser - D1T0) failed on Saturday 16th May when the system hung up. Following tests a faulty drive was replaced. It was returned to se ...ew configuration of a batch of new worker nodes was reported. Most of this batch have now been re-set to have the usual worker node configuration.

13 KB (1,442 words) - 11:25, 20 May 2015
Tier1 Operations Report 2014-03-19

...r the hypervisor hosting this virtual machine rebooted and this particular system was not configured to re-start. This was resolved by the primary on-call. ...ed during the change. The batch system was also reconfigured such that new batch jobs world not startt during this period. The change was successful. There

14 KB (1,553 words) - 11:36, 19 March 2014
Tier1 Operations Report 2016-01-20

* As reported last week the CMSTape system has been busy - and throughput was compromised by two out of its five disk .... Following the first rebuild a another problematic disk was found and the system was returned to service on Monday (18th Jan) once that too had been resolve

13 KB (1,364 words) - 12:54, 20 January 2016
GridPP approved VOs

|MAGIC is a system of two imaging atmospheric Cherenkov telescopes (or IACTs). MAGIC-I started * high priority in the batch system for the atlassgm user;

78 KB (13,056 words) - 13:44, 23 April 2024
Tier1 Operations Report 2017-07-26

 ...; padding-top: 0.1em; padding-bottom: 0.1em;" | Limits on concurrent batch system jobs.

18 KB (1,971 words) - 14:03, 26 July 2017
Operations Bulletin 310314

* Last week there was a [http://indico.cern.ch/event/272785/ pre-GDB on batch systems] and a [http://indico.cern.ch/event/272619/other-view?view=standard ...is week there is a [http://indico.cern.ch/event/272785/ pre-GDB meeting on batch systems] and a [http://indico.cern.ch/event/272619/other-view?view=standard

48 KB (6,293 words) - 07:35, 31 March 2014
Operations Bulletin 240314

* Last week there was a [http://indico.cern.ch/event/272785/ pre-GDB on batch systems] and a [http://indico.cern.ch/event/272619/other-view?view=standard ...is week there is a [http://indico.cern.ch/event/272785/ pre-GDB meeting on batch systems] and a [http://indico.cern.ch/event/272619/other-view?view=standard

48 KB (6,293 words) - 07:36, 31 March 2014
Operations Bulletin 070414

* Last week there was a [http://indico.cern.ch/event/272785/ pre-GDB on batch systems] and a [http://indico.cern.ch/event/272619/other-view?view=standard ...reviewed are capable of supporting multicore jobs however a tuning of each system is required to be able to absorb them (draining/reservation of resources) w

45 KB (5,701 words) - 09:21, 7 April 2014
Operations Bulletin 140414

* Last week there was a [http://indico.cern.ch/event/272785/ pre-GDB on batch systems] and a [http://indico.cern.ch/event/272619/other-view?view=standard * CERN batch capacity migrated to SLC6 was at 65% last week.

52 KB (6,980 words) - 08:19, 15 April 2014
Tier1 Operations Report 2014-04-09

...ed. Multiple disk failures were being reported by the disk controller. The system was returned to production yesterday evening (8th April) and is being drain * The EMI3 Argus server is now in use everywehere in the batch farm.

14 KB (1,599 words) - 11:33, 14 April 2014
Operations Bulletin 210412

* CERN batch capacity migrated to SLC6 was at 65% last week. * The APEL accounting system has been undergoing database maintenance to improve performance and reliabi

45 KB (5,796 words) - 22:44, 21 April 2014
Tier1 Operations Report 2014-12-10

... CMS deleting files to make space and a reduction in the number of running batch jobs relieved thd strain. ... brought into use. (Currently Atlas 3D/Frontier still uses the OGMA datase system, although this was also changed to update from CERN using Oracle Golden Gat

14 KB (1,492 words) - 13:08, 10 December 2014
Tier1 Operations Report 2014-04-30

...bers of jobs (from T2K) submitted to the batch system by the WMSs. A batch system parameter (max number of gridftp connections on ARC CEs) has been increased | System be decommissioned. (Replaced my myproxy.gridpp.rl.ac.uk).

14 KB (1,557 words) - 13:24, 30 April 2014
Operations Bulletin 050514

* CERN batch capacity migrated to SLC6 was at 65% last week. * The APEL accounting system has been undergoing database maintenance to improve performance and reliabi

41 KB (5,106 words) - 19:52, 5 May 2014
Tier1 Operations Report 2014-05-14

* Testing CVMFS Client version 2.1.19 ongoing. This is now rolled out to one batch of worker nodes. So far so good. | Outage of tape system for update of tape library controller. (Postponed from 13th May).

13 KB (1,393 words) - 10:46, 14 May 2014
Tier1 Operations Report 2019-02-25

 ...; padding-top: 0.1em; padding-bottom: 0.1em;" | Limits on concurrent batch system jobs.

17 KB (1,612 words) - 11:29, 27 February 2019
Operations Bulletin 020614

...onday covered some site reports and OS related updates. Tuesday's focus is batch systems. Wednesday covers IPv6, security and benchmarking. Thursday storage ...naged services from Quattor to a new Puppet based Configuration Management system.

41 KB (5,148 words) - 09:38, 2 June 2014
Operations Bulletin 090614

...onday covered some site reports and OS related updates. Tuesday's focus is batch systems. Wednesday covers IPv6, security and benchmarking. Thursday storage ...naged services from Quattor to a new Puppet based Configuration Management system.

41 KB (5,148 words) - 07:10, 9 June 2014
Tier1 Operations Report 2014-06-11

* Today (11th June) a new tape controller system (ACSLS) is being installed. There have been some problems with the new serv | Castor (all SRM endpoints) and batch (all CEs)

15 KB (1,592 words) - 12:26, 11 June 2014
Imperial arc ce for cloud

0) Find and read the "ARC Computing Element System Administrator Guide". <br> 3) Ensure the machine can submit to the batch system & has all of the users. <br>

11 KB (1,578 words) - 15:50, 12 June 2014
Cloud Work at Imperial

...ieve this by using the (Condor) Submit module of a glideinWMS as the batch system and then channeling the jobs via the glideinWMS to the gridpp cloud. <br>

925 B (154 words) - 11:11, 23 August 2019
Operations Bulletin 160614

...r their resources into a ‘pool’ via the [https://e-grant.egi.eu eGrant system]. [https://wiki.egi.eu/wiki/Resource_Allocation_Process More information] i * Castor and batch services currently down for Castor Namserver Upgrade (to version 2.1.14). I

39 KB (4,952 words) - 19:40, 13 June 2014
RAL Memory Limits

...n there is contention between other processes for physical memory will the system force physical memory into swap and push the physical memory used towards t

1 KB (241 words) - 10:28, 11 February 2015
Operations Bulletin 070714

...] needs updating and a consensus! Could the SEs implement some reservation system internally? Is there merit in the suggestion to make use of [https://www.gr * KeyDocs are going to be reviewed (in next 4 weeks) as the system is not working (or not adding anything) in some areas.

43 KB (5,584 words) - 12:52, 7 July 2014
Staged rollout emi3

'''UKI-NORTHGRID-MAN-HEP''': Multicore and passing parameters to the batch system testing requested by the experiments through the WLCG Task Force Alessandra

8 KB (1,155 words) - 11:09, 13 March 2015
Example Build of an EMI-UMD Cluster

...egi-trustanchors.repo Finally, for historical reasons related to our build system, we also installed these two repos from the glite 3.2 instructions - jpacka ...wever you do it, make a munge key using /usr/sbin/create-munge-key on some system that has munge installed on it (this one?) and use the resulting key on all

15 KB (2,429 words) - 10:18, 31 July 2015
Operations Team Completed Actions

| Email everyone on how to hack the publishing system to avoid publishing incorrect GlueSubClusterWNTmpDir. | Plan out the future of CE/Batch System integration. Torque/maui are not supported by EGI. Layout an agenda with pr

33 KB (5,297 words) - 10:13, 15 November 2017
Example Build of an ARC/Condor Cluster

...lable, called HTCondor (or CONDOR for short). We also decided to front the system with an ARC CE. You'll need a copy of the ARC System Admin Manual.

121 KB (17,569 words) - 08:26, 28 November 2019
Operations Bulletin 010914

...or allocation. It is a brokering service only. There is one request in the system for cloud resources. * News: CERN-IT to terminate the SLC5-based interactive and batch services (lxplus5 and lxbatch5) soon. The current target date is 30 Septemb

42 KB (5,358 words) - 10:48, 1 September 2014
Operations Bulletin 150914

... jobs at CCIN2P3 and of the method to passing job requirement arguments to batch systems via CE. ([https://indico.cern.ch/event/339461/ Agenda]) * OSG following up on how to discover HTCondor CEs in the information system.

46 KB (6,062 words) - 10:07, 15 September 2014
Tier1 Operations Report 2014-10-01

...ring Saturday evening. It was restarted and tested but no fault found. The system was returned to service yesterday (30th Sep). * One batch of worker nodes (64 machines) have had Linux cgroups configured to enforce

13 KB (1,429 words) - 10:06, 8 October 2014
RAL Tier1 Incident 20130626 Failure of RAL CVMFS Stratum1 Triggered Batch Farm Problems

==RAL Tier1 Incident 20130626 Failure of RAL CVMFS Stratum1 Triggered Batch Farm Problems=====Description:=== ...s over to use other replicas. However this did not happen across the Tier1 batch farm where many nodes were running a version of the CVMFS client in which t

12 KB (1,968 words) - 15:13, 16 September 2014
Monitoring

...ordinating/publicising local site-admin tools (Nagios plugins, local batch system dashboards)

906 B (116 words) - 08:35, 5 June 2018
Tier1 Operations Report 2014-09-17

...of the systems affected was the argus server and this caused a problem for batch job submissions for an hour or so. * The Atlas Frontier service will be switched to use the new database system that updates from CERN using Oracle "GoldenGate" on 24th Sep.

12 KB (1,195 words) - 14:07, 17 September 2014
Operations Bulletin 220914

... jobs at CCIN2P3 and of the method to passing job requirement arguments to batch systems via CE. ([https://indico.cern.ch/event/339461/ Agenda]) * OSG following up on how to discover HTCondor CEs in the information system.

48 KB (6,422 words) - 08:45, 23 September 2014
Operations Bulletin 200317

*** Durham: Batch system upgrade led to one outage and a University wide internet connection loss le * Ongoing tests ongoing with some batch jobs for the LHC VOs running in SL6 containers on worker nodes running SL7.

42 KB (5,079 words) - 18:37, 19 March 2017
Tier1 Operations Report 2014-10-08

...h hosts the Atlas and GEN SRM databases) was moved to the standby database system. This required an outage of the Castor Atlas and GEN instances which lasted ...day morning (5th Oct). It was restarted and tested but no fault found. The system was returned to service this morning (8th Oct).

15 KB (1,740 words) - 10:50, 15 October 2014
Site information

...completed in October 2008. The first is to provide information about batch system memory limits. The second is to give an update on networking issues that ca [[Background to batch system memory request for details:]]

17 KB (2,669 words) - 11:14, 1 March 2016
Guide to Ganga

...ol for use with both local batch systems and the DIRAC workload management system. It's maintained by Ulrik Egede (ulrik<AT>monash.edu) - please email if you ...want to use it to submit jobs to the grid rather than just for local batch system submission), there are a few steps you need to go through:

15 KB (2,621 words) - 14:40, 27 May 2020
Tier1 Operations Report 2018-06-18

 ...; padding-top: 0.1em; padding-bottom: 0.1em;" | Limits on concurrent batch system jobs.

16 KB (1,535 words) - 13:37, 20 June 2018
Tier1 Operations Report 2018-05-14

 ...; padding-top: 0.1em; padding-bottom: 0.1em;" | Limits on concurrent batch system jobs.

16 KB (1,476 words) - 07:41, 16 May 2018
Publishing tutorial

...y maximise throughput. Experiments show that, in order to fully utilize a system, it is often necessary to choose a number of slots that is higher than th Sites have to transmit (via the BDII and the accounting system) a couple more things; the power of the site and the amount of work done.

8 KB (1,284 words) - 14:03, 2 October 2017
Tier1 Operations Report 2014-10-29

...tlasDataDisk - D1T0) had failed for the third time on around a month. This system has been completely drained and is undergoing further investigations. ...regular "PSU" patches will be applied to the Pluto Castor standby database system on Monday (27th Oct) and to the Pluto production database on Wednesday (29t

14 KB (1,569 words) - 13:13, 29 October 2014
Operations Bulletin 271014

* Machine/Job features: Concluded on a single architecture for cloud and batch implementations. * The OGMA database system (Atlas3D/Frontier) has been updated and switched to using Oracle GoldenGate

40 KB (4,976 words) - 10:25, 27 October 2014
Operations Bulletin 031114

* Machine/Job features: Concluded on a single architecture for cloud and batch implementations. * The OGMA database system (Atlas3D/Frontier) has been updated and switched to using Oracle GoldenGate

42 KB (5,228 words) - 10:37, 4 November 2014
Operations Bulletin 101114

* Machine/Job features: Concluded on a single architecture for cloud and batch implementations. LHCB having cvmfs trouble at IC, which was likely caused by a batch of naughty CMS jobs ruining it for everyone else. LHCB re-enabled IC to see

48 KB (6,138 words) - 09:19, 10 November 2014
Operations Bulletin 171114

* Multicore: Passing parameters to batch system discussion started. Limited tests. ATLAS 40% resources now MC. Still 37 sit

39 KB (4,698 words) - 18:46, 16 November 2014
OldEMITarball

...caster test cluster runs using torque, interfacing with a DPM SE, so other batch/storage combinations are not as well tested. ''This assumes that the workernode has been setup to work within the batch system, and the users and groups have been set up. It would technically be possibl

25 KB (4,174 words) - 09:57, 23 July 2015
EMITarball

The tarball versions listed may look convoluted, but there is a system to them! The first part denotes what middleware was used to build the tarba ... vomses, CA and CRLs. For a WN you will have to set up the users and batch system yourself.

11 KB (1,832 words) - 10:02, 23 January 2018
Operations Bulletin 081214

* Multicore: Passing parameters to batch systems [https://indico.cern.ch/event/272779/session/0/contribution/8/mater ...n F ticketed the CA concerning a possible problem with the ticket reminder system. JK has responded with a reply, and asked that similar tickets in the futur

50 KB (6,536 words) - 00:08, 7 December 2014
Past Ticket Bulletins 2015

...on (but as he also notes - what's getting loaded and causing the problem - Batch, CE or WNs?). Kashif reckons the argus server, and suggests a handy glexec Sno+ spotted malloc errors at Lancaster. The problems seemed to survive one batch of fixes, but I asked again if they still see problems after running a good

117 KB (18,736 words) - 11:05, 4 January 2016
Tier1 Operations Report 2014-12-17

* Following a restriction on numbers of CMS batch jobs imposed during problems a week or so ago the CMS jobs limits on the fa ... brought into use. (Currently Atlas 3D/Frontier still uses the OGMA datase system, although this was also changed to update from CERN using Oracle Golden Gat

14 KB (1,504 words) - 14:50, 17 December 2014
Tier1 Operations Report 2015-01-07

... brought into use. (Currently Atlas 3D/Frontier still uses the OGMA datase system, although this was also changed to update from CERN using Oracle Golden Gat | Due to Kernel patching of EGI ADV 20141217, the RAL tier1 batch farm worker nodes will need to be rebooted.

17 KB (1,780 words) - 12:56, 7 January 2015
Dirac Dictionary

WMS - Workload Managment System. The central part of the DIRAC system

2 KB (306 words) - 12:29, 12 March 2015
Tier1 Operations Report 2015-01-14

...ostic tests were being run on the faulty router – however after that the system restarted and took over as the master router of the pair (which was not ant ...the week. Intermittent timeouts were seen on the tests. The number of LHCb batch jobs has been restricted to try and reduce the problem. In addition, during

14 KB (1,559 words) - 10:52, 21 January 2015
Tier1 Operations Report 2015-02-11

* We are now fully using cgroups to control job memory limits on the batch farm. ... brought into use. (Currently Atlas 3D/Frontier still uses the OGMA datase system, although this was also changed to update from CERN using Oracle Golden Gat

13 KB (1,290 words) - 11:23, 11 February 2015
Tier1 Operations Report 2015-03-04

...Thursday (26th Feb) there was a problem with our Argus server that stopped batch job submission starting for an hour or so. * Cap on maximum number of ALICE batch jobs raised from 3500 to 6000.

12 KB (1,175 words) - 14:56, 4 March 2015
Cloud & VM status

...ese sites. There is a complementary page about [[Batch system status|batch system status]].

3 KB (378 words) - 09:57, 27 June 2017
Vacuum

...f this system is that there is no gate keeper service, head node, or batch system accepting and then directing jobs to particular worker nodes, avoiding seve

4 KB (628 words) - 12:52, 13 March 2015
RAL Tier1 Summary of Post Mortems

...fter the action refer to the Tier1 internal (Footprints) incident tracking system. * Investigate, and implement, an alternative method of connecting to the system to allow for a reconnection in the event of a network break.

8 KB (1,074 words) - 09:36, 18 September 2018
RAL Tier1 Incident 20150408 network intervention preceding Castor upgrade

... network changes were made at the start of a planned upgrade to the Castor system. A network problem was triggered that took most of the day to resolve (and | In response to a ticket from t2k, the non-LHC VOs were re-enabled on the batch farm.

15 KB (2,406 words) - 16:43, 17 August 2015
Operations Bulletin 270415

* A [https://indico.cern.ch/event/319821/ pre-GDB on batch systems] will take place in May. ...tops the HTC solution until after CHEP. Is there interest in testing other batch systems? Raul mentioned SLURM. There is also SGE and Torque.

43 KB (5,339 words) - 06:42, 27 April 2015
GridPP VO Incubator

|Enable the OSG VO on RAL CEs and batch system. |Test of the Transformation System

19 KB (3,141 words) - 12:14, 27 April 2020
Monitoring Resource Usage of Jobs with cAdvisor

...addition data is exported to a central database. For sites running a batch system with cgroups enabled, cAdvisor can provide information about running jobs o

4 KB (584 words) - 20:09, 12 May 2015
RAL Tier1 weekly operations castor 18/05/2015

* CMS CASTOR file open time issues affecting batch farm efficiency ...t dataset that is located almost entirely on one node. Shaun has devised a system to redistribute this dataset across the rest of the cmsDisk pool.

4 KB (566 words) - 14:12, 15 May 2015
Operations Bulletin 180515

* There is a [https://indico.cern.ch/event/319821/ pre-GDB on batch systems at CERN this week]. Tier-2 participation encouraged. ** "Consider open science as a production and dissemination system that needs integrated, easy and fair access to several types of shared reso

46 KB (5,803 words) - 11:48, 16 May 2015
RAL Tier1 CASTOR Experiments Completed Actions 2012

| 20120425-01 || Medium || || Gareth || Review batch system limits || Done. Limits have been removed or increased. || 2012-05-23

4 KB (566 words) - 09:26, 20 May 2015
Operations Bulletin 250515

...ch/twiki/bin/view/LCG/GDBMeetingNotes20150512 summary of the pre-GDB about batch systems] is available. * There is a [https://indico.cern.ch/event/319821/ pre-GDB on batch systems at CERN this week]. Tier-2 participation encouraged.

46 KB (5,732 words) - 18:32, 23 May 2015
Operations Bulletin 010615

* Tier-1problems with secondary database system for Castor - resolved quickly. ...ch/twiki/bin/view/LCG/GDBMeetingNotes20150512 summary of the pre-GDB about batch systems] is available.

43 KB (5,271 words) - 22:18, 31 May 2015
Tier1 Operations Report 2015-06-10

* A problem with the Argus server affected batch job submissions for a while during the early evening of Friday 5th June. Th * The second batch of 2014 CPU purchases has been brought online.

16 KB (1,741 words) - 13:24, 10 June 2015
Operations Bulletin 080615

* Tier-1problems with secondary database system for Castor - resolved quickly. ...ch/twiki/bin/view/LCG/GDBMeetingNotes20150512 summary of the pre-GDB about batch systems] is available.

43 KB (5,271 words) - 13:02, 6 June 2015
Operations Bulletin 150615

* Tier-1problems with secondary database system for Castor - resolved quickly. ...ch/twiki/bin/view/LCG/GDBMeetingNotes20150512 summary of the pre-GDB about batch systems] is available.

43 KB (5,391 words) - 15:50, 14 June 2015
Operations Bulletin 220615

* Tier-1problems with secondary database system for Castor - resolved quickly. ...9/ Agenda]. There will be presentations and discussions on the Information System.

45 KB (5,632 words) - 13:32, 21 June 2015
Tier1 Operations Report 2015-06-24

* The batch job limit for Alice has been completely removed. (It was set at 6000). ...this 15-minute period all services will be unavailable. The Castor storage system will be stopped at 12:45 UTC before the network break, and restarted once t

15 KB (1,738 words) - 13:30, 24 June 2015
Operations Bulletin 290615

* Highlights: Information System discussion started. Use cases and dependencies will be built up and reviewe * T2 feedback: UK response on Information System: Useful for service discovery; minor VO usage; contains too much informatio

45 KB (5,792 words) - 21:55, 28 June 2015
Operations Bulletin 060715

...setup for the discussion of batch and CE matters in WLCG: project-lcg-gdb-batch at cern.ch. * Highlights: Information System discussion started. Use cases and dependencies will be built up and reviewe

45 KB (5,742 words) - 21:15, 5 July 2015
Tier1 Operations Report 2015-07-15

...s down. Forcing the FTS system to use more typical settings unblocked the system. The backlog had cleared by the following morning. ...on (with grid middleware delivered via CVMFS) has been extended to a whole batch of WNs.

12 KB (1,261 words) - 11:39, 15 July 2015
Operations Bulletin 200715

...ntation proposing a new Task Force] studying the future of the Information System. ...tops the HTC solution until after CHEP. Is there interest in testing other batch systems? Raul mentioned SLURM. There is also SGE and Torque.

46 KB (5,777 words) - 09:41, 20 July 2015
Operations Bulletin 270715

...tops the HTC solution until after CHEP. Is there interest in testing other batch systems? Raul mentioned SLURM. There is also SGE and Torque. ...implement it involves additional complexity and possibly cost. The current system works fine and we therefore see no overriding reason to remove T1-T1 transi

47 KB (5,972 words) - 08:49, 27 July 2015
Tier1 Operations Report 2015-07-29

...iday morning (24th July) there was a warning level in the fire suppression system in the machine room. The cause seems to have been the failure of a PDU feed ...) redirection accessing Castor; Slow file open times using Xroot; and poor batch job efficiencies.

13 KB (1,341 words) - 12:17, 29 July 2015
Tier1 Operations Report 2015-08-12

...) redirection accessing Castor; Slow file open times using Xroot; and poor batch job efficiencies. * Deployed changes to remove glite-CLUSTER node from information system and shutdown cream-ce01 and cream-ce02.

13 KB (1,380 words) - 13:20, 12 August 2015
Tier1 Operations Report 2015-08-19

...ek triggered by the updates to the information provided to the information system by the ARC CEs. This was fixed on Thursday (13th). ...eferred to last week has improved data access rates for the worst cases of batch work (pile-up jobs).

14 KB (1,580 words) - 13:55, 19 August 2015
Operations Bulletin 100815

* Info system: Implementing [https://twiki.cern.ch/twiki/bin/view/EGEE/AllAboutREBUS#REBU ...tops the HTC solution until after CHEP. Is there interest in testing other batch systems? Raul mentioned SLURM. There is also SGE and Torque.

52 KB (6,730 words) - 22:58, 9 August 2015
Operations Bulletin 030815

* Info system: Implementing [https://twiki.cern.ch/twiki/bin/view/EGEE/AllAboutREBUS#REBU ...tops the HTC solution until after CHEP. Is there interest in testing other batch systems? Raul mentioned SLURM. There is also SGE and Torque.

52 KB (6,730 words) - 23:00, 9 August 2015
Operations Bulletin 170815

* Info system: Implementing [https://twiki.cern.ch/twiki/bin/view/EGEE/AllAboutREBUS#REBU ...tops the HTC solution until after CHEP. Is there interest in testing other batch systems? Raul mentioned SLURM. There is also SGE and Torque.

48 KB (6,103 words) - 23:03, 16 August 2015
Operations Bulletin 240815

* Info system: Implementing [https://twiki.cern.ch/twiki/bin/view/EGEE/AllAboutREBUS#REBU * Lydia's document - Setup a system to do data archiving using FTS3

45 KB (5,578 words) - 19:59, 22 August 2015
Operations Bulletin 310815

...ng filer migration. No dates yet but it will affect ARGUS, all CEs and all batch worker nodes (glExec) running GridJobs. The downtime is foreseen to last 1h * Lydia's document - Setup a system to do data archiving using FTS3

41 KB (5,000 words) - 04:11, 1 September 2015
Operations Bulletin 070915

...ng filer migration. No dates yet but it will affect ARGUS, all CEs and all batch worker nodes (glExec) running GridJobs. The downtime is foreseen to last 1h * Lydia's document - Setup a system to do data archiving using FTS3

43 KB (5,351 words) - 16:49, 6 September 2015
Operations Bulletin 140915

...ng filer migration. No dates yet but it will affect ARGUS, all CEs and all batch worker nodes (glExec) running GridJobs. The downtime is foreseen to last 1h * Lydia's document - Setup a system to do data archiving using FTS3

44 KB (5,604 words) - 10:22, 15 September 2015
Operations Bulletin 210915

...ng filer migration. No dates yet but it will affect ARGUS, all CEs and all batch worker nodes (glExec) running GridJobs. The downtime is foreseen to last 1h * Lydia's document - Setup a system to do data archiving using FTS3

44 KB (5,552 words) - 22:25, 19 September 2015
Tier1 Operations Report 2015-09-23

..., following the application of the updated FTS3 software to the production system last week a memory leak was introduced which also caused a set of problems * Updating the first batch of the remaining Castor disk servers (those in tape-backed service classes)

14 KB (1,604 words) - 12:01, 23 September 2015
Operations Bulletin 290915

* Lydia's document - Setup a system to do data archiving using FTS3 ...mblyness with the CEs. However, I understand much of this is caused by the batch farm being busy. There are low-availability tickets 'on hold' for Liverpool

45 KB (5,699 words) - 08:31, 28 September 2015
Operations Bulletin 191015

* Lydia's document - Setup a system to do data archiving using FTS3 ...mblyness with the CEs. However, I understand much of this is caused by the batch farm being busy. There are low-availability tickets 'on hold' for Liverpool

46 KB (5,818 words) - 10:21, 19 October 2015
Operations Bulletin 121015

* Lydia's document - Setup a system to do data archiving using FTS3 ...mblyness with the CEs. However, I understand much of this is caused by the batch farm being busy. There are low-availability tickets 'on hold' for Liverpool

52 KB (6,786 words) - 16:22, 12 October 2015
Operations Bulletin 261015

* We are investigating why LHCB batch jobs sometimes fail to write results back to Castor (and the sometimes fail * Lydia's document - Setup a system to do data archiving using FTS3

47 KB (6,004 words) - 20:08, 25 October 2015
Operations Bulletin 021115

* We are investigating why LHCB batch jobs sometimes fail to write results back to Castor (and the sometimes fail * Lydia's document - Setup a system to do data archiving using FTS3

48 KB (6,098 words) - 09:45, 2 November 2015
Tier1 Operations Report 2015-11-11

* We have been investigating the behaviour of some batch jobs as there is a low level of failures that are not understood. * gdss664 (AtlasTape - D0T1) was removed from service on the 28th Oct. The system was having some problems running some network commands which were resolved

12 KB (1,234 words) - 14:05, 11 November 2015
Tier1 Operations Report 2015-11-04

...ng a disk replacement and updating the firmware in the disk controller the system was re-run through the acceptance testing for 5 days before being returned ...d battery replacement and updating the firmware in the disk controller the system was re-run through the acceptance testing for 5 days before being returned

12 KB (1,253 words) - 10:56, 4 November 2015
Operations Bulletin 091115

* We are investigating why LHCB batch jobs sometimes fail to write results back to Castor (and the sometimes fail * Lydia's document - Setup a system to do data archiving using FTS3

54 KB (7,032 words) - 12:38, 8 November 2015
RAL Tier1 weekly operations castor 30/11/2015

* LHCb batch jobs failing to copy results into castor - changes made seems to have impro ...e that it is just not possible to simulate the behaviour on pre-production system. ACTIONS: RA to ensure the procedure for dealing with any recurrence of thi

5 KB (850 words) - 11:33, 27 November 2015
Operations Bulletin 161115

* We are investigating why LHCB batch jobs sometimes fail to write results back to Castor (and the sometimes fail * Lydia's document - Setup a system to do data archiving using FTS3

48 KB (6,095 words) - 12:37, 16 November 2015
Operations Bulletin 301115

* We are investigating why LHCB batch jobs sometimes fail to write results back to Castor (and the sometimes fail * Lydia's document - Setup a system to do data archiving using FTS3

48 KB (6,163 words) - 16:24, 29 November 2015
RAL Tier1 weekly operations castor 07/12/2015

...- also a possible improvement in configuration, frequent write of a Oracle system log is slow and can be improved by writing to a dedicated area with a diffe * LHCb batch jobs failing to copy results into castor - changes made seems to have impro

6 KB (1,018 words) - 12:25, 4 December 2015
Operations Bulletin 071215

...repository/Technical_Documents/WLCGFutureISUseCases_1.3.pdf An Information System future use cases document has been produced]. * We are investigating why LHCB batch jobs sometimes fail to write results back to Castor (and the sometimes fail

47 KB (5,899 words) - 20:49, 6 December 2015
Operations Bulletin 110116

* Approach for configuring batch systems (e.g. setting up mem limits). ...nical_Documents/WLCGFutureISUseCases_1.6.pdf PDF]). Looking at information system owned by WLCG (an interesting idea). Starting to prepare a Roadmap to GLUE

47 KB (5,834 words) - 10:11, 11 January 2016
RAL Tier1 weekly operations castor 14/12/2015

...- also a possible improvement in configuration, frequent write of a Oracle system log is slow and can be improved by writing to a dedicated area with a diffe * LHCb batch jobs failing to copy results into castor - changes made seems to have impro

7 KB (1,141 words) - 15:00, 11 December 2015
Operations Bulletin 141215

...nical_Documents/WLCGFutureISUseCases_1.6.pdf PDF]). Looking at information system owned by WLCG (an interesting idea). Starting to prepare a Roadmap to GLUE ...repository/Technical_Documents/WLCGFutureISUseCases_1.3.pdf An Information System future use cases document has been produced].

44 KB (5,454 words) - 14:43, 17 December 2015
Operations Bulletin 211215

...nical_Documents/WLCGFutureISUseCases_1.6.pdf PDF]). Looking at information system owned by WLCG (an interesting idea). Starting to prepare a Roadmap to GLUE ...repository/Technical_Documents/WLCGFutureISUseCases_1.3.pdf An Information System future use cases document has been produced].

53 KB (6,852 words) - 11:31, 21 December 2015
Tier1 Operations Report 2016-01-06

... This problem was initially reported at the last meeting as a high rate of batch job failures seen by LHCb since around the 9th December. ...nd all canbemigr files migrated to tape. A faulty disk drive was replaced. System returned to production on Christmas Day!

16 KB (1,824 words) - 12:29, 6 January 2016
Operations Bulletin 281215

...nical_Documents/WLCGFutureISUseCases_1.6.pdf PDF]). Looking at information system owned by WLCG (an interesting idea). Starting to prepare a Roadmap to GLUE ...repository/Technical_Documents/WLCGFutureISUseCases_1.3.pdf An Information System future use cases document has been produced].

53 KB (6,920 words) - 15:52, 4 January 2016
Deployment Team Completed Actions

| Setting of alarms in the GridLoad system. |Various discussions indicated that the only flexible system is for sites to raise events on a case by case basis. Sites should do this

68 KB (11,032 words) - 13:08, 16 September 2016
Past Ticket Bulletins 2016

BIRMINGHAM ticket, regarding small VOs and their batch system. Daniela had to reopen the ticket, which I think has meant it snuck by Mark Small VO acls on the Birmingham batch system, Mark is just getting round to look at this too. In progress (29/11)

150 KB (23,740 words) - 12:54, 9 January 2017
Tier1 Operations Report 2016-01-27

...rver failures we have reviewed the situation - particularly looking at one batch of systems which show very high drive failure rates. ...f service. One disk was showing a lot of errors. That was replaced and the system returned to service the following day (20th Jan).

13 KB (1,350 words) - 10:02, 27 January 2016
RAL Tier1 weekly operations castor 15/01/2016

...- also a possible improvement in configuration, frequent write of a Oracle system log is slow and can be improved by writing to a dedicated area with a diffe * LHCb batch jobs failing to copy results into castor - changes made seems to have impro

7 KB (1,085 words) - 16:11, 18 January 2016
Operations Bulletin 180116

* Approach for configuring batch systems (e.g. setting up mem limits). ...nical_Documents/WLCGFutureISUseCases_1.6.pdf PDF]). Looking at information system owned by WLCG (an interesting idea). Starting to prepare a Roadmap to GLUE

51 KB (6,516 words) - 23:59, 18 January 2016
Operations Bulletin 281116

So far, methods exist for ARC CE, and Torque batch system. Method for VAC still rough and being worked out by (e.g.) ...15th a partial but significant database corruption occurred on the signing system for the CA. Data was restored from (offline) backups but the rebuild was n

46 KB (5,834 words) - 12:12, 28 November 2016
RAL Tier1 weekly operations castor 25/01/2016

configuration, frequent write of a Oracle system log is slow and can be improved by writing to * LHCb batch jobs failing to copy results into castor - changes made seems to have impro

7 KB (1,203 words) - 17:47, 23 January 2016
Operations Bulletin 010216

* Approach for configuring batch systems (e.g. setting up mem limits). * We are investigating why LHCb batch jobs sometimes fail to write results back to Castor (and the sometimes fail

47 KB (5,867 words) - 21:18, 31 January 2016
Tier1 Operations Report 2016-02-10

... (AtlasScratchDisk - D1T0) Failed on Monday 18th Jan with a read-only file system. On investigation three disks in the RAID set had problems. Following a lot ...y LHCb of a low but persistent rate of failure when copying the results of batch jobs to Castor. There is also a further problem that sometimes occurs when

15 KB (1,664 words) - 09:48, 10 February 2016
Operations Bulletin 080216

* Approach for configuring batch systems (e.g. setting up mem limits). * We are investigating why LHCb batch jobs sometimes fail to write results back to Castor (and the sometimes fail

46 KB (5,812 words) - 23:00, 8 February 2016
Operations Bulletin 150216

* We are working a refresh of the database system behind the LFC. * WLCG Information System Evolution Task Force is drafting refined definitions for LOG_CPU and PHYS_C

54 KB (7,071 words) - 09:10, 15 February 2016
Tier1 Operations Report 2017-04-26

 ...; padding-top: 0.1em; padding-bottom: 0.1em;" | Limits on concurrent batch system jobs.

14 KB (1,457 words) - 10:30, 26 April 2017
Operations Bulletin 210316

* EGI now has a timeline for deployment of the ARGO central monitoring system. ... LHCb pilot scripts tested in VMs: same pilot scripts can be used on VM or batch sites in multiprocessor slots.

42 KB (5,278 words) - 01:55, 20 March 2016
Tier1 Operations Report 2016-03-23

...etc) were also rebooted at this time and there was a confusion that led to batch jobs not being re-allowed to start until later that evening. ...y LHCb of a low but persistent rate of failure when copying the results of batch jobs to Castor. There is also a further problem that sometimes occurs when

12 KB (1,283 words) - 21:43, 22 March 2016
Operations Bulletin 280316

* EGI now has a timeline for deployment of the ARGO central monitoring system. ... LHCb pilot scripts tested in VMs: same pilot scripts can be used on VM or batch sites in multiprocessor slots.

44 KB (5,455 words) - 02:03, 29 March 2016
Tier1 Operations Report 2016-03-30

* GDSS620 (GenTape - D0T1) Reported a read-only file system on the 15th March and was taken out of production. Two T2K files that were ...y LHCb of a low but persistent rate of failure when copying the results of batch jobs to Castor. There is also a further problem that sometimes occurs when

13 KB (1,394 words) - 11:01, 30 March 2016
Operations Bulletin 040416

* EGI now has a timeline for deployment of the ARGO central monitoring system. ... LHCb pilot scripts tested in VMs: same pilot scripts can be used on VM or batch sites in multiprocessor slots.

44 KB (5,446 words) - 23:24, 3 April 2016
Operations Bulletin 020516

... LHCb pilot scripts tested in VMs: same pilot scripts can be used on VM or batch sites in multiprocessor slots. ...las jobs fail due to a lost heartbeat. Alessandra's digging revealed batch system memory restrictions as the likely culprit, but we can chat about it if it d

46 KB (5,853 words) - 07:32, 9 May 2016
Operations Bulletin 090516

... LHCb pilot scripts tested in VMs: same pilot scripts can be used on VM or batch sites in multiprocessor slots. ...las jobs fail due to a lost heartbeat. Alessandra's digging revealed batch system memory restrictions as the likely culprit, but we can chat about it if it d

46 KB (5,853 words) - 07:33, 9 May 2016
Tier1 Operations Report 2016-05-18

...ring that afternoon full tape access (read & write) was restored. The tape system was left "at risk" over the weekend. ...y LHCb of a low but persistent rate of failure when copying the results of batch jobs to Castor. There is also a further problem that sometimes occurs when

14 KB (1,576 words) - 08:37, 25 May 2016
Tier1 Operations Report 2016-10-19

...squids are multiple-use this had a knock-on effect on CVMFS clients on the batch worker nodes. ...y LHCb of a low but persistent rate of failure when copying the results of batch jobs to Castor. There is also a further problem that sometimes occurs when

16 KB (1,868 words) - 12:28, 19 October 2016
Tier1 Operations Report 2016-05-11

...ch jobs were un-paused. In order to minimise load through the night no new batch jobs were started until the following morning. See blog post at: http://www ...y LHCb of a low but persistent rate of failure when copying the results of batch jobs to Castor. There is also a further problem that sometimes occurs when

15 KB (1,734 words) - 10:46, 11 May 2016
Tier1 Operations Report 2016-06-01

...y LHCb of a low but persistent rate of failure when copying the results of batch jobs to Castor. There is also a further problem that sometimes occurs when | At risk on tape system overnight following problem mounting tapes.

13 KB (1,454 words) - 07:42, 7 June 2016
Operations Bulletin 060616

...CG regarding the [https://indico.cern.ch/event/517084/ use of information system] (vs GOCDB)| [https://indico.cern.ch/event/517084/contributions/2151002/att ...hing. We have put in place various mitigations (e.g. a re-starter) and the system has worked through the weekend. The vendor is coming in tomorrow (Wed) to f

44 KB (5,505 words) - 16:19, 3 June 2016
Tier1 Operations Report 2016-06-15

... - this includes four Tier1 drives physically located in that library. The system ran stably during last night - with this very limited Tier1 tape capacity. ...y LHCb of a low but persistent rate of failure when copying the results of batch jobs to Castor. There is also a further problem that sometimes occurs when

13 KB (1,445 words) - 10:54, 15 June 2016
Tier1 Operations Report 2016-06-08

... weekend, although the control software (which has been running on a spare system) has been crashing a few times per day. Yesterday (Tuesday 7th June) we mov ...y LHCb of a low but persistent rate of failure when copying the results of batch jobs to Castor. There is also a further problem that sometimes occurs when

14 KB (1,602 words) - 11:11, 8 June 2016
Operations Bulletin 130616

...CG regarding the [https://indico.cern.ch/event/517084/ use of information system] (vs GOCDB)| [https://indico.cern.ch/event/517084/contributions/2151002/att ... LHCb pilot scripts tested in VMs: same pilot scripts can be used on VM or batch sites in multiprocessor slots.

51 KB (6,664 words) - 22:21, 13 June 2016
Operations Bulletin 200616

* Batch system monitoring HEPiX working group - contact A Lahiff. ...CG regarding the [https://indico.cern.ch/event/517084/ use of information system] (vs GOCDB)| [https://indico.cern.ch/event/517084/contributions/2151002/att

44 KB (5,489 words) - 22:21, 19 June 2016
Operations Bulletin 040716

* AL: a batch system used entirely by non-LHC users? * Batch system monitoring HEPiX working group - contact A Lahiff.

43 KB (5,335 words) - 16:28, 3 July 2016
Operations Bulletin 270616

* Batch system monitoring HEPiX working group - contact A Lahiff. ...CG regarding the [https://indico.cern.ch/event/517084/ use of information system] (vs GOCDB)| [https://indico.cern.ch/event/517084/contributions/2151002/att

45 KB (5,697 words) - 11:17, 27 June 2016
Tier1 Operations Report 2016-06-29

...m and we have worked closely with the vendor (Oracle). Since that date the system has been stable - with no crashes at all for a week. We do have a reduced n ...n Tuesday 21st June. It is being drained ahead of sorting out the re-named system.

15 KB (1,605 words) - 09:56, 29 June 2016
Operations Bulletin 110716

* AL: a batch system used entirely by non-LHC users? ...ng problems with the tape library control software: We are able to run the system stably but with a reduced number of the Tier1 tape drives enabled. The prob

50 KB (6,425 words) - 22:25, 11 July 2016
Tier1 Operations Report 2016-07-06

...ape drives. Initial results suggest this enables us to stably run the full system, with all tape drives in use, ...y LHCb of a low but persistent rate of failure when copying the results of batch jobs to Castor. There is also a further problem that sometimes occurs when

12 KB (1,248 words) - 12:21, 6 July 2016
Operations Bulletin 180716

* AL: a batch system used entirely by non-LHC users? ... LHCb pilot scripts tested in VMs: same pilot scripts can be used on VM or batch sites in multiprocessor slots.

45 KB (5,664 words) - 20:46, 17 July 2016
Operations Bulletin 250716

... LHCb pilot scripts tested in VMs: same pilot scripts can be used on VM or batch sites in multiprocessor slots. ...er asks if LHCB can take a look as the jobs are consistently hitting batch system limits and wasting CPU resources because of this. Waiting for reply (18/7)

46 KB (5,862 words) - 06:50, 25 July 2016
Operations Bulletin 010816

* The LSST VO has been enabled on the batch system. * Note: Upgrade of Database System behind the LFC on Monday (1st August).

42 KB (5,278 words) - 19:59, 1 August 2016
Tier1 Operations Report 2016-08-03

...y LHCb of a low but persistent rate of failure when copying the results of batch jobs to Castor. There is also a further problem that sometimes occurs when * The 2009 worker nodes are being drained from the batch system ahead of their use as tests systems before final decommissioning.

12 KB (1,328 words) - 15:58, 9 August 2016
Operations Bulletin 080816

* The LSST VO has been enabled on the batch system. * Note: Upgrade of Database System behind the LFC on Monday (1st August).

42 KB (5,192 words) - 21:21, 7 August 2016
Operations Bulletin 220816

* The LSST VO has been enabled on the batch system. * The Database System behind the LFC was upgraded (to new hardware) at the start of last week (Mo

45 KB (5,779 words) - 09:19, 22 August 2016
Operations Bulletin 150816

* The LSST VO has been enabled on the batch system. * The Database System behind the LFC was upgraded (to new hardware) at the start of last week (Mo

50 KB (6,392 words) - 07:47, 16 August 2016
Tier1 Operations Report 2016-09-07

* GDSS776 (LHCbDst - D1T0) failed with a read-only file system on Thursday 1st September, It was put back in service the following day - i ...y LHCb of a low but persistent rate of failure when copying the results of batch jobs to Castor. There is also a further problem that sometimes occurs when

13 KB (1,414 words) - 14:58, 7 September 2016
Tier1 Operations Report 2016-09-14

* Atlas reported a problem with the batch system last Friday (9th Sep). It turned out that there was a problem on one partic ...y LHCb of a low but persistent rate of failure when copying the results of batch jobs to Castor. There is also a further problem that sometimes occurs when

12 KB (1,290 words) - 11:29, 14 September 2016
Operations Bulletin 260916

...w on use the [ https://operations-portal.egi.eu/downtimes/subscription new system]. ... LHCb pilot scripts tested in VMs: same pilot scripts can be used on VM or batch sites in multiprocessor slots.

49 KB (6,409 words) - 00:39, 26 September 2016
Operations Bulletin 101016

...w on use the [ https://operations-portal.egi.eu/downtimes/subscription new system]. ... LHCb pilot scripts tested in VMs: same pilot scripts can be used on VM or batch sites in multiprocessor slots.

54 KB (7,110 words) - 14:59, 10 October 2016
Operations Bulletin 300117

So far, methods exist for ARC CE, and Torque batch system. Method for VAC still rough and being worked out by (e.g.) ...15th a partial but significant database corruption occurred on the signing system for the CA. Data was restored from (offline) backups but the rebuild was n

46 KB (5,756 words) - 13:12, 30 January 2017
Operations Bulletin 211116

So far, methods exist for ARC CE, and Torque batch system. Method for VAC still rough and being worked out by (e.g.) ...15th a partial but significant database corruption occurred on the signing system for the CA. Data was restored from (offline) backups but the rebuild was n

47 KB (6,026 words) - 09:22, 21 November 2016
Tier1 Operations Report 2017-03-15

...nt - and attempted to move VMs to other nodes. It took a few hours for the system to recover. This affected a number of services including BDIIs, FTS nodes a 

15 KB (1,598 words) - 12:18, 15 March 2017
Tier1 Operations Report 2016-11-16

...y LHCb of a low but persistent rate of failure when copying the results of batch jobs to Castor. There is also a further problem that sometimes occurs when * There was a short (one to two hour) interruption to tape mounts while the system that runs the tape library control software was swapped on Tuesday morning

12 KB (1,270 words) - 10:53, 17 November 2016
Tier1 Operations Report 2016-10-26

...esponse a number of services were stopped. In essence we stopped the batch system on Monday (24th Oct). Storage (Castor) was able to continue running. At the ...day evening (19th Oct). This had a knock-on effect on CVMFS clients on the batch worker nodes and for some hours reduced the number of worker nodes availabl

14 KB (1,523 words) - 13:14, 26 October 2016
Operations Bulletin 311016

* US: PNNL LHCONE system outage planned October 17-21 * Some changes were made to increase the number of CMS batch jobs that we run in order to bring the number more into line with the pledg

50 KB (6,401 words) - 08:48, 31 October 2016
Tier1 Operations Report 2016-11-02

...e others that were more exposed - were stopped since the Monday. The batch system and most of the others were brought back up by the end of Wednesday afterno ... modules were swapped over. On re-test the fault had cleared. However, the system crashed on Friday 28th Oct. It was returned to service yesterday (1st Nov).

14 KB (1,581 words) - 17:12, 2 November 2016
Tier1 Operations Report 2017-10-04

 ...; padding-top: 0.1em; padding-bottom: 0.1em;" | Limits on concurrent batch system jobs.

15 KB (1,509 words) - 10:08, 11 October 2017
Tier1 Operations Report 2016-11-23

...y LHCb of a low but persistent rate of failure when copying the results of batch jobs to Castor. There is also a further problem that sometimes occurs when * There was an intervention on the ECHO Ceph system last week to enable a reconfiguration of its underlying network.

13 KB (1,436 words) - 14:51, 23 November 2016
Tier1 Operations Report 2017-06-07

* We are seeing a high rate of reported disk problems on the OCF '14 batch of disk servers. In some of the cases the vendor finds no fault in the driv 

16 KB (1,833 words) - 13:11, 7 June 2017
Tier1 Operations Report 2016-11-30

...y LHCb of a low but persistent rate of failure when copying the results of batch jobs to Castor. There is also a further problem that sometimes occurs when * There was restart test of the ECHO Ceph system yesterday> this was to understand how best to do this and set-up appropriat

13 KB (1,400 words) - 14:23, 30 November 2016
Operations Bulletin 051216

* We need to carry out firmware updates on a particular batch of servers - which are in use by Atlas, VMS and LHCb. Will arrange when thi So far, methods exist for ARC CE, and Torque batch system. Method for VAC still rough and being worked out by (e.g.)

44 KB (5,462 words) - 10:47, 6 December 2016
Operations Bulletin 121216

* We need to carry out firmware updates on a particular batch of servers - which are in use by Atlas, VMS and LHCb. Will arrange when thi ...nning HTCondor jobs in Vac/Vcycle compatible VMs: adapted for ATLAS, local batch, and now being tested for ALICE at Manchester using a pool of HTCondor jobs

45 KB (5,745 words) - 13:04, 12 December 2016
Operations Bulletin 090117

So far, methods exist for ARC CE, and Torque batch system. Method for VAC still rough and being worked out by (e.g.) ...15th a partial but significant database corruption occurred on the signing system for the CA. Data was restored from (offline) backups but the rebuild was n

49 KB (6,219 words) - 11:44, 9 January 2017
Operations Bulletin 201216

...ednesday (14th) we will carry out rolling firmware updates on a particular batch of servers - which are in use by Atlas, CMS and LHCb. ...nning HTCondor jobs in Vac/Vcycle compatible VMs: adapted for ATLAS, local batch, and now being tested for ALICE at Manchester using a pool of HTCondor jobs

46 KB (5,848 words) - 09:12, 20 December 2016
Past Ticket Bulletins 2017

The webdav/xroot ticket - after rebuilding the system from scratch and getting help from Dan it looks like xroot still isn't play ...info&ticket_id=130537 130537]) there's an invitation to the VO to test the system. Waiting for reply (13/9)

121 KB (19,081 words) - 12:04, 23 January 2018
Tier1 Operations Report 2016-12-21

...y LHCb of a low but persistent rate of failure when copying the results of batch jobs to Castor. There is also a further problem that sometimes occurs when ...) Firmware updates were applied to the RAID cards in the Clustervision '13 batch of disk servers.

14 KB (1,569 words) - 14:30, 21 December 2016
Operations Bulletin 020117

...ednesday (14th) we will carry out rolling firmware updates on a particular batch of servers - which are in use by Atlas, CMS and LHCb. ...nning HTCondor jobs in Vac/Vcycle compatible VMs: adapted for ATLAS, local batch, and now being tested for ALICE at Manchester using a pool of HTCondor jobs

49 KB (6,317 words) - 09:33, 3 January 2017
Tier1 Operations Report 2017-01-18

...ers but not much activity. At the end of the afternoon the number of ALICE batch jobs was cut back (to 500) as a temporary measure to reduce the load on the ...y LHCb of a low but persistent rate of failure when copying the results of batch jobs to Castor. There is also a further problem that sometimes occurs when

14 KB (1,561 words) - 14:46, 18 January 2017
Tier1 Operations Report 2017-01-11

* GDSS665 (LhcbRawRdst - D0T1) failed on Saturday 31st Dec. Two disks in the system were replaced and it was returned to service on Friday 6th Jan. ...y LHCb of a low but persistent rate of failure when copying the results of batch jobs to Castor. There is also a further problem that sometimes occurs when

14 KB (1,531 words) - 18:01, 17 January 2017
Operations Bulletin 160117

So far, methods exist for ARC CE, and Torque batch system. Method for VAC still rough and being worked out by (e.g.) ...15th a partial but significant database corruption occurred on the signing system for the CA. Data was restored from (offline) backups but the rebuild was n

51 KB (6,530 words) - 08:56, 16 January 2017
Operations Bulletin 230117

So far, methods exist for ARC CE, and Torque batch system. Method for VAC still rough and being worked out by (e.g.) ...15th a partial but significant database corruption occurred on the signing system for the CA. Data was restored from (offline) backups but the rebuild was n

47 KB (5,971 words) - 09:04, 23 January 2017
Tier1 Operations Report 2017-01-25

...y LHCb of a low but persistent rate of failure when copying the results of batch jobs to Castor. There is also a further problem that sometimes occurs when * GDSS780 (LHCbDst - D1T0) crashed at around 8am this morning (Wed 25th Jan). System under investigation.

15 KB (1,614 words) - 14:30, 25 January 2017
Operations Bulletin 070217

So far, methods exist for ARC CE, and Torque batch system. Method for VAC still rough and being worked out by (e.g.) ...15th a partial but significant database corruption occurred on the signing system for the CA. Data was restored from (offline) backups but the rebuild was n

44 KB (5,446 words) - 15:00, 8 February 2017
Operations Bulletin 130217

So far, methods exist for ARC CE, and Torque batch system. Method for VAC still rough and being worked out by (e.g.) ...15th a partial but significant database corruption occurred on the signing system for the CA. Data was restored from (offline) backups but the rebuild was n

49 KB (6,270 words) - 11:57, 13 February 2017
Tier1 Operations Report 2017-02-22

 ...; padding-top: 0.1em; padding-bottom: 0.1em;" | Limits on concurrent batch system jobs.

14 KB (1,425 words) - 14:24, 22 February 2017
Tier1 Operations Report 2018-01-03

* The number of Atlas batch jobs being run is lower than expected. The batch (Condor) scheduling will be looked at to try and understand and improve thi 

17 KB (1,714 words) - 14:41, 3 January 2018
Operations Bulletin 270217

* Steve: ARC sites are getting a beating from the ARGO monitoring system. Why? * Tests ongoing with some batch jobs for the LHC VOs running in SL6 containers on worker nodes running SL7.

44 KB (5,413 words) - 16:17, 27 February 2017
Tier1 Operations Report 2017-03-01

 ...; padding-top: 0.1em; padding-bottom: 0.1em;" | Limits on concurrent batch system jobs.

15 KB (1,577 words) - 14:44, 1 March 2017
Tier1 Operations Report 2017-08-16

 ...; padding-top: 0.1em; padding-bottom: 0.1em;" | Limits on concurrent batch system jobs.

16 KB (1,690 words) - 13:58, 16 August 2017
Operations Bulletin 060317

* Steve: ARC sites are getting a beating from the ARGO monitoring system. Why? * Ongoing tests ongoing with some batch jobs for the LHC VOs running in SL6 containers on worker nodes running SL7.

44 KB (5,399 words) - 12:16, 6 March 2017
Operations Bulletin 130317

*** Durham: Batch system upgrade led to one outage and a University wide internet connection loss le * Steve: ARC sites are getting a beating from the ARGO monitoring system. Why?

42 KB (5,126 words) - 10:15, 13 March 2017
Tier1 Operations Report 2017-03-08

...in the Condor configuration file was not erroneous and had no effect. LHCb batch jobs were not being limited. 

14 KB (1,423 words) - 15:14, 8 March 2017
GridPP5 Tier2 plans

* [https://twiki.cern.ch/twiki/bin/view/LCG/BatchSystemComparison Batch System Comparison Table] * [[Batch system status]]

4 KB (725 words) - 15:50, 4 April 2017
Tier1 Operations Report 2017-03-22

 ...; padding-top: 0.1em; padding-bottom: 0.1em;" | Limits on concurrent batch system jobs.

14 KB (1,476 words) - 15:53, 22 March 2017
Tier1 Operations Report 2017-03-29

* Some batch job submission errors have been seen by CMS and LHCb. These are not yet und 

16 KB (1,752 words) - 15:22, 29 March 2017
Operations Bulletin 270317

*** Durham: Batch system upgrade led to one outage and a University wide internet connection loss le * Ongoing tests ongoing with some batch jobs for the LHC VOs running in SL6 containers on worker nodes running SL7.

45 KB (5,641 words) - 09:51, 27 March 2017
Operations Bulletin 030417

*** Durham: Batch system upgrade led to one outage and a University wide internet connection loss le * Ongoing tests ongoing with some batch jobs for the LHC VOs running in SL6 containers on worker nodes running SL7.

45 KB (5,539 words) - 09:16, 3 April 2017
Operations Bulletin 170417

...VMs were restarted elsewhere but a small number had a problem. This led to batch (CE and argus) problems during that day. * The number of batch jobs running in SL6 containers on worker nodes running SL7 is being ramped

45 KB (5,594 words) - 03:25, 18 April 2017
Tier1 Operations Report 2017-04-12

...bad state. This included one of the CEs and an argus system. This affected batch jobs submission during the day. The problem was resolved by the oncall team 

15 KB (1,663 words) - 13:18, 12 April 2017
Tier1 Operations Report 2017-04-05

* Some batch job submission errors have been seen by CMS and LHCb. These are not yet und 

14 KB (1,455 words) - 09:30, 5 April 2017
Operations Bulletin 100417

...VMs were restarted elsewhere but a small number had a problem. This led to batch (CE and argus) problems during that day. * The number of batch jobs running in SL6 containers on worker nodes running SL7 is being ramped

45 KB (5,594 words) - 03:26, 18 April 2017
Tier1 Operations Report 2017-04-19

 ...; padding-top: 0.1em; padding-bottom: 0.1em;" | Limits on concurrent batch system jobs.

15 KB (1,526 words) - 09:57, 23 April 2017
Operations Bulletin 240417

...VMs were restarted elsewhere but a small number had a problem. This led to batch (CE and argus) problems during that day. * The number of batch jobs running in SL6 containers on worker nodes running SL7 is being ramped

45 KB (5,597 words) - 04:19, 23 April 2017
Operations Bulletin 010517

* The number of batch jobs running in SL6 containers on worker nodes running SL7 is being ramped * There's also a bunch of low availability tickets which clog up the system.

44 KB (5,391 words) - 21:07, 1 May 2017
Tier1 Operations Report 2017-05-03

* There is a problem with the UPS system in the R89 computer problem. Internal capacitors overheated last Friday (28 

16 KB (1,689 words) - 12:44, 4 May 2017
Tier1 Operations Report 2017-05-17

 ...; padding-top: 0.1em; padding-bottom: 0.1em;" | Limits on concurrent batch system jobs.

16 KB (1,685 words) - 13:08, 17 May 2017
Tier1 Operations Report 2017-05-10

 ...; padding-top: 0.1em; padding-bottom: 0.1em;" | Limits on concurrent batch system jobs.

17 KB (1,843 words) - 13:14, 10 May 2017
Operations Bulletin 290517

...ble use of the Tier1 by LIGO (batch, Echo storage and cvmfs) and CCP4 (for batch). ...RHEL7), I've written up how we build nodes at Liverpool for our ARC/Condor system.

42 KB (5,118 words) - 09:19, 29 May 2017
Tier1 Operations Report 2017-05-24

 ...; padding-top: 0.1em; padding-bottom: 0.1em;" | Limits on concurrent batch system jobs.

15 KB (1,676 words) - 14:43, 24 May 2017
Tier1 Operations Report 2017-05-31

...cleaned up manually leaving a handful for debugging purposes. However, the system subsequently also deleted these remaining files without manual intervention 

15 KB (1,685 words) - 13:14, 31 May 2017
Operations Bulletin 050617

* LIGO and the MICE pilot role have now been enabled for batch access. ...RHEL7), I've written up how we build nodes at Liverpool for our ARC/Condor system.

41 KB (5,069 words) - 16:29, 5 June 2017
Tier1 Operations Report 2017-06-14

...t we are seeing a high rate of reported disk problems on one (the OCF '14) batch of disk servers. In some of the cases the vendor finds no fault in the driv 

14 KB (1,456 words) - 10:07, 14 June 2017
Tier1 Operations Report 2017-06-21

 ...; padding-top: 0.1em; padding-bottom: 0.1em;" | Limits on concurrent batch system jobs.

14 KB (1,526 words) - 10:25, 21 June 2017
Tier1 Operations Report 2017-06-28

 ...; padding-top: 0.1em; padding-bottom: 0.1em;" | Limits on concurrent batch system jobs.

15 KB (1,649 words) - 09:48, 5 July 2017
Tier1 Operations Report 2017-07-12

...es. These were done ahead of the updating of the remaining systems in this batch planned for next week. 

15 KB (1,670 words) - 15:00, 12 July 2017
Tier1 Operations Report 2017-07-05

 ...; padding-top: 0.1em; padding-bottom: 0.1em;" | Limits on concurrent batch system jobs.

15 KB (1,595 words) - 14:25, 5 July 2017
Tier1 Operations Report 2017-07-19

... locking sessions and hot-spotting of files was seen. The problem affected batch access to files as well. Since then there have still been some indication o 

16 KB (1,785 words) - 07:47, 26 July 2017
Operations Bulletin 240717

...RHEL7), I've written up how we build nodes at Liverpool for our ARC/Condor system. ...ere doing something weird - which isn't the first time we've seen a dodgey batch of jobs recently. But following the panda link it looks like these errors a

42 KB (5,229 words) - 09:54, 31 July 2017
Operations Bulletin 310717

...RHEL7), I've written up how we build nodes at Liverpool for our ARC/Condor system. ...ere doing something weird - which isn't the first time we've seen a dodgey batch of jobs recently. But following the panda link it looks like these errors a

42 KB (5,229 words) - 09:54, 31 July 2017
Tier1 Operations Report 2017-08-09

* There was a problem with the test FTS3 sertvice on Friday 28th July. The system hit a limit of having done 2 billion file transfers. An emergency update wa 

19 KB (2,046 words) - 13:05, 9 August 2017
Tier1 Operations Report 2017-08-23

 ...; padding-top: 0.1em; padding-bottom: 0.1em;" | Limits on concurrent batch system jobs.

17 KB (1,707 words) - 09:27, 23 August 2017
Tier1 Operations Report 2017-08-30

 ...; padding-top: 0.1em; padding-bottom: 0.1em;" | Limits on concurrent batch system jobs.

16 KB (1,671 words) - 16:03, 31 August 2017
Tier1 Operations Report 2017-09-06

 ...; padding-top: 0.1em; padding-bottom: 0.1em;" | Limits on concurrent batch system jobs.

15 KB (1,513 words) - 13:38, 6 September 2017
RAL Tier1 Incident 20170818 first Echo data loss

...was backfilling data onto a new OSD following the introduction of the last batch of '15 generation storage hardware. This bug was triggered by the primary O ...as it was part of the data re-balancing after the introduction of the last batch of '15 generation storage hardware. This bug was triggered by the primary O

23 KB (3,827 words) - 18:08, 3 October 2017
Tier1 Operations Report 2017-09-27

 ...; padding-top: 0.1em; padding-bottom: 0.1em;" | Limits on concurrent batch system jobs.

15 KB (1,498 words) - 12:23, 4 October 2017
Tier1 Operations Report 2017-09-20

 ...; padding-top: 0.1em; padding-bottom: 0.1em;" | Limits on concurrent batch system jobs.

18 KB (1,946 words) - 15:50, 20 September 2017
Tier1 Operations Report 2017-10-18

 ...; padding-top: 0.1em; padding-bottom: 0.1em;" | Limits on concurrent batch system jobs.

15 KB (1,513 words) - 15:44, 18 October 2017
Tier1 Operations Report 2017-10-11

 ...; padding-top: 0.1em; padding-bottom: 0.1em;" | Limits on concurrent batch system jobs.

15 KB (1,589 words) - 14:02, 11 October 2017
Tier1 Operations Report 2017-10-25

 ...; padding-top: 0.1em; padding-bottom: 0.1em;" | Limits on concurrent batch system jobs.

14 KB (1,387 words) - 07:31, 26 October 2017
Tier1 Operations Report 2017-11-22

 ...; padding-top: 0.1em; padding-bottom: 0.1em;" | Limits on concurrent batch system jobs.

16 KB (1,623 words) - 13:52, 22 November 2017
Tier1 Operations Report 2017-11-01

...d CMS jobs needed to complete before the changes were picked up by all CMS batch jobs. 

14 KB (1,437 words) - 09:24, 7 November 2017
Tier1 Operations Report 2017-11-08

...failed. The physical box was replaced (using one from the Castor 'preprod) system and CMS Castor service resumed during the afternoon. 

15 KB (1,557 words) - 16:32, 8 November 2017
Tier1 Operations Report 2017-11-15

 ...; padding-top: 0.1em; padding-bottom: 0.1em;" | Limits on concurrent batch system jobs.

15 KB (1,515 words) - 14:36, 15 November 2017
Tier1 Operations Report 2017-11-29

 ...; padding-top: 0.1em; padding-bottom: 0.1em;" | Limits on concurrent batch system jobs.

16 KB (1,686 words) - 15:36, 12 December 2017
Tier1 Operations Report 2017-12-13

...ed back. This had left is with some issues in our configuration/deployment system (Quattor/Aquilon) – but those were resolved quickly. We made a plan to ro 

17 KB (1,846 words) - 13:56, 13 December 2017
Tier1 Operations Report 2017-12-06

...ed back. This had left is with some issues in our configuration/deployment system (Quattor/Aquilon) – but those were resolved quickly. We made a plan to ro 

17 KB (1,884 words) - 08:35, 13 December 2017
Tier1 Operations Report 2018-02-27

 ...; padding-top: 0.1em; padding-bottom: 0.1em;" | Limits on concurrent batch system jobs.

16 KB (1,553 words) - 11:43, 28 February 2018
Tier1 Operations Report 2018-02-20

 ...; padding-top: 0.1em; padding-bottom: 0.1em;" | Limits on concurrent batch system jobs.

15 KB (1,442 words) - 13:35, 21 February 2018
Tier1 Operations Report 2018-01-24

...chillers failed to restart. It was planned to replace a faulty card in the system this morning (24th Jan) - which should have taken around 30minutes. However 

17 KB (1,803 words) - 14:26, 24 January 2018
Tier1 Operations Report 2017-12-20

 ...; padding-top: 0.1em; padding-bottom: 0.1em;" | Limits on concurrent batch system jobs.

16 KB (1,692 words) - 14:14, 20 December 2017
Tier1 Operations Report 2018-01-10

 ...; padding-top: 0.1em; padding-bottom: 0.1em;" | Limits on concurrent batch system jobs.

15 KB (1,481 words) - 13:08, 17 January 2018

Search results

Page title matches

Page text matches

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Main GridPP website

Navigation

Tools