Microsoft Data Protection Heaven and Backup Exec Hell: failed jobs

Showing posts with label failed jobs. Show all posts

Saturday, 24 December 2016

DPM: Replica is inconsistent, Error 3106, "system cannot find the path specified" - despite restarting both DPM + Protected Server systems

It's been a while, but we've recently come across a ridiculous error situation with DPM that caused a random subset of Protected Servers to just stop backing up.

Each Volume or resource would be in "Replica is inconsistent" state. You'd play the usual game of running consistency checks, or consistency & synchronisation but the job goes to "OK" for a while - but the ability to make a new recovery point is missing (eg the "Create a recovery point after synchronizing") option is unavailable, and a short while later a protected volume would return to "replica is inconsistent" state.

After much head scratching and monitoring, we realised it was very simple... the installation path for DPM includes a temp folder, for example:

"C:\Program Files\Microsoft System Center 2012\DPM\DPM\Temp"

The folder "MTA" had been removed as part of a clear up of old temporary resources, and despite this folder not being actively used during a backup it seems not having it breaks DPM.

Simply recreate a folder called "MTA" and you'll then find everything is working just fine again - re-run those consistency checks and then make a recovery point with synchronisation and all will be well.

Hopefully this will help someone else with a similar issue!

Wednesday, 30 September 2009

Error E000FE30 every day on one server...

...for months. For months I've struggled with a problem on ONE server, that happens to be at a remote site on a different subnet, connected via a WAN VPN Link.

Every day, one or more jobs would fail with Backup Exec Errors, mainly E000FE30 - with the useful and generic messages about "communications failure has occured" and sometimes the "connection lost to the remote agent".

Needless to say, I've spent some time working on this, and tried all sorts. Reconfiguring the system to use a different WAN link to ensure the fault isn't with the WAN. Nothing. Checking to ensure the issue isn't with the server, reinstalling agents, trying all sorts.

I've updated network drivers, checked all sorts of patches etc - but nothing, Still this error - consistently failing jobs.

I even got a colleague to look at it for a fresh pair of eyes and he too tried all sorts. Given the error, we suspected "something" to do with comms, but never found any issue, and in hundreds of tests conducted could never replicate the issue - transferring large files to/fro the server worked fine etc.

Today I found the answer. The "Large TCP Offload" feature on the Network Card. While I've seen plenty of issues with this feature before, you normally see it with terrible throughput on the system in general and so on - but this machine is solid as a rock for everything else.

Still, the setting is off, and first complete, full backups in a few weeks... voila!

Top tip for anyone else facing this problem - don't just check the network drivers, but try turning off these features, even if you cannot see this issue at any other time on the machine.

Is this a Backup Exec issue? I'm not sure, but I'm happy to blame it since everything else works just fine.

Wednesday, 14 November 2007

A new morning, a new failure...

It's just another average morning, a little nippy outside, terrible morning TV still exists (when will it die?) and I've got a raft of Backup Exec failures on my hands.

It seems that one of our sites has a condition I'll call "fussity", whereby the CASO server will submit it the jobs it's supposed to do, and, for about 90% of them, it'll just spit the dummy. Throwing "loading media" at you for a random period (could be 2 mins, could be 20 mins) eventually it gives up and fails the job. Only for the next one to work. And then next 9 to fail, and so on.... the last time we had a case of "fussity" the only way to stop it being so picky about jobs was to completely uninstall, rename the server, reinstall, reservice pack, and then re-create your devices, media, jobs and so on. I'm hoping to avoid it this time.

Meanwhile the main site figured it wasn't in the mood initially last night until around 8:30pm when I rebooted it. It just kept "recovering" it's own jobs. Good stuff.

All in a day's admin of Backup Exec.

Tuesday, 13 November 2007

Queued. You mean "lost the plot"

Can we have a drum roll please....? The reason for our lovely "queued" status today is that at some point where the CASO box rolled onto it's belly and refused to play nicely with all the other devices, it managed to leave a "piece" of "media" in use in most of the Backup to Disk folders. End result - the "maximum concurrent jobs" limit has been reached.

The only resolve that seems to work is completely restarting the services (if you just restart Device and Media then the box tends to give up and never run a backup again). However, the trouble with this theory is that there are a couple of jobs running I want to finish first.

We could of course up the concurrent jobs on each device but it gets a bit tedious moving the limits up and down every day.

Really of course, it should JUST BLOODY WORK.

I'm queuing in the rain...

I woke up this morning half expecting Backup Exec not to be working, and, as usual, it isn't. There are 2 possible causes for this:

a) My colleagues theory that Backup Exec works as long as you are physically staring at the Job Monitor, is in fact, completely true, and therefore because last night I figured I had other things to do, didn't watch it, and thus must now suffer it's failures.

or

b) (and this is my most likely theory) it's just a crock of craaaaaaap.

I'm now staring at the screen, and, in typical fashion, with no useful or meaningful reason it's just "Queued" for about 15 jobs, all of which have been sitting there for 9 hours. Just queued. No alerts, no reasons. But there is one job running, that's been running for 6 hours, and is slowly very slowly) notching up the byte count.

I am going to completely lose the will to live with this software soon.

Monday, 12 November 2007

More on "Loading Media"...

The earlier problem with "Loading Media" appears to be a false alarm - while the CASO server (the one which is supposed to mean we can manage things centrally without needing to constantly login to all the other boxes) swears blind the job status is "Loading Media", logging onto the actual media server reveals it's working away backing up quite happily.

Yet again the CASO option proves it's worthless.

"Loading Media"...

Fantastic news, it's not even 11am yet, but we've got more trusty errors to keep us busy.

Right now we've got the famous "Loading Media" nonsense. 3 jobs being backed up to 3 different "Backup Folder's" (you know, tapeless, medialess stuff) now sitting at Loading Media. No alerts naturally, no reason for them to do this, but they've all stopped and will undoubtedly now sit there forever not backing up.

A new day and new problems...

Ah it's a good feeling today, why? Because only 6 jobs have failed (sort of) at the first site over the last 24 hours. Of course, it isn't quite that simple, as most of those were re-run 2-3 times to make them run, and we haven't got any real backups from the other 2 sites to talk about, but it's progress symantec-style.

Today's first problem (and it's only 10 past 9...) is that despite rebuilding Site 2, it's now developed a new problem - the first backup job, (to one distinct backup to disk volume with it's own media set) is running, but subequent jobs joining the queue cause the job to move to "loading media" and there it just sits... good stuff. So don't schedule jobs either maybe? It's practically a manual setup anyway I guess, so sitting there like a hawk looking for an opportunity to backup won't hurt.

Sunday, 11 November 2007

It's just past 7...

...and the first batch of Backups have appeared for the evening - since I made some changes last week to split the jobs into even more policies to reduce the chance that "Backup Exec Job Engine" or "Ben-Gin" as we call it will stall, it does seem to cope a little better, although so far this hasn't translated into the system running correctly, it has stopped it failing quite so often around job submission time.

Of course the whole point of policies is defeated by having to create lots of identical policies with slightly different start times (since now any change means changing them all...)... but we'll let that slide. Anything to get Backups to run even once a week would be nice!

A Little History

Since this blog is new, I thought I'd share a bit of history about how we came to be in this position of undesireable misery.

Between the 3 of us, we've been using Backup Exec for many years, back when it was "Veritas Backup Exec". It was a reliable product with a few irritating quirks, but on the whole it got the job of Backups done, and we managed a few basic single server setups for various customers, some to single DAT drives, others to large LTO Libraries, but always with reliability of Backup Exec, if not the tape drives.

Move forward a few years...

We're faced with an ever increasing number of servers, multiple sites and the need to backup everything regularly, and in many cases the old "one drive per server" setup just wasn't working anymore. So we built some multi-terabyte Backup Exec Media Servers. Armed with plenty of storage, RAID Arrays and lots of shiny new Backup Exec 10d (d for disk don't you know!) licenses, we set out...

Having looked at all the various promised features, synthetic backups, great support for "Disk Based Backups" (something we really wanted), we figured it would be the wise choice, and, having had no issues with our old v8.6 and v9 installations didn't expect any problem. Yeah sure Symantec had bought Veritas but the veritas name was still there and it was the same product right?

Forward a few months...

Missed and Failed Backups are par for the course, random errors are the norm, and most of the promised features just don't work, or don't work as you'd expect, and some of the most useful features are hindered by completely stupid limitations that render the feature worthless. Oh, and just before you say "It's OK, we'll just do simple backups", don't expect it to be any easier - they don't work either.

Between us we'll post over the next few weeks about some of the biggest problems this product has, keep you up to date with the ongoing hell. If you're considering Backup Exec, don't. Try something else. Backup to 20,000 floppy disks manually copying files. Anything. It will work better.

And so it begins...

So here we are with a completely failing Backup Exec setup. As usual, it's having random fits and failures. This weekend was going quite well, or rather, "well" in backup exec terms.

We had 1 out of 3 media servers running, but the one that was running was happily dealing with jobs it was assigned, until 16:30, and then, for no reason at all as far as we can see, it did the usual "Server Paused" thing. You know, the one where it dumps jobs you've had running for hours for no real reason, and just sits like a lame duck until you reboot it (because nothing else works).

Meanwhile box number 2 is still in "paused" status via the CASO tool, because hell, asking is to be enabled causes jobs to just be sent into an infinite loop of "Queued" and "Ready, On Hold" mixed with a load of failures. Good stuff Symantec.

Box number 3 however is really screwed. We've ended up having to reinstall and then when that failed, uninstall, and reinstall from scratch. Of course, that's really upset things, because despite following the instructions, the orphaned "managed media server" is now stuck on the CASO box, as is the new reinstalled one of the same, and all the non-existant "Backup to Disk" drives are too, you know, for good measure which it steadfastly refuses to remove.

No point calling Symantec for support, they never actually have an answer unless it's the most basic of problems.

So what shall we do?

I'm going to uninstall box 2 completely, rename the box on the network so it can't be confused anymore, reinstall, service pack 4 Backup Exec 10d and then configure it - once that's done I'll try a backup job, and if that runs (yeah, because it's so likely to work first time...) I'll try putting it onto managed media server mode so we can actually use it properly.

This is the first post here, but, inspired by a guy ( see http://y33dave.wordpress.com/%20) who eventually ditched Backup Exec for CommVault who maintained a blog of the miserable journey he went on we'll be keeping this Backup Exec Hell blog going...