Friday 23 November 2007

And now, for CPS...

Something truely amazing has started happening, as Backup Exec is *still* running, 6 days in, and it looks like we may have turned a corner with it in terms of making it think, act and work, like a competent backup solution.

As a result we've been able to look at other things - since my colleague tried the "CPS" or Continuous Protection Server part of Backup Exec some time ago, we found it was pretty good stuff - it worked straight out of the box, so we figured we'd give it a try - after all it would be more than a little useful if we could just replicate data from one site to another as part of a backup system, as it would boost our protection against failure of one of our key servers.

Installation of the CPS Server was easy, and worked first time. However, getting it to talk to our remote server was a little more difficult, as it's on a different subnet, and, as a bonus, the CPS Server also happens to have multiple NICs on different networks. To annoy us, CPS decided to pick the wrong IP to bind to (and there's no indication it will do this, nor any GUI to choose it).

It's OK though, easily fixed - with a famous registry edit to set a "PreferredAddress" on the CPS Server, and, on the CPS Machine being copied, changing it's "Gateway" address for the Veritas Software to be the IP of the CPS machine and not it's host name (even though it resolves to the same).

Right now we have a working CPS job doing the initial copy of our data, after which it should continuously protect... rock on.

Tuesday 20 November 2007

I'm still in a dream

I must be. Because we've still got 3 working Backup Exec Media Servers, CASO is working, backups are working, and, with a couple exceptions they all run.

We've got 2 troublesome jobs which fail as they just have issues spitting the files out to the Media Servers and we'll work on those, and one that just fails on System State every day despite Windows itself being able to back it up, Backup Exec insists that "a failure occurred reading an object" and later claims that "shadow?copy?components" is a corrupt file.

So right now, we've had 3 days and 14 hours of running backups - you know we may even get to run test restores because it's not crashed.

Sunday 18 November 2007

Did we hit April 1st? What has been done with the Real Backup Exec?

In a brave move, given the near 3 day successful run with Backup Exec over last week, and given that I was more interested in seeing Bill Bailey live over the weekend than staring into the "Alerts" list on Backup Exec, I figured we'd just see if it could cope again.

Amazingly, it has. So far. It's now late Sunday evening but everything is running, and, to top it I've got 3 media servers online, running, and doing what they're designed for. We've even found time to expand the storage on the smallest of the servers from around 1Tb to 3Tb ready for when it starts to get a real workload.

Friday 16 November 2007

It was too good to be true.

Backups have run for 2 days, 23 hours, and just crashed I actually thought we'd make 3 days then.

It appears that "BECatSrv.dll" decided to give up and die.

Now to see if it recovers...

Thursday 15 November 2007

"d" is for "tape"

Something is wrong. Very wrong. Backup Exec is still working. That's over a day. It's not crashed or had a funny 10 minutes where it doesn't work. None of this "server paused" rubbish. Still meanwhile in the world of the admin's maintaining it, we've had a good old debate about how the whole thing is supposed to work, how symantec interpret "tape" backups when they're actually disk backups.

The only thing we did resolve is that the "d" in Backup Exec 10d means "tape". Yes. it means Tape. Which is probably how Symantec would have preferred we kept things.

Eye Tests

I'm starting to think I need an eye test. In our office I'm the only one with good eyesight that doesn't already wear glasses. Amazingly though, I think that may have to change.

My rationale is that this morning I've opened the old Backup Exec status window, yet there isn't a Server Paused, Loading Media, Queued or other failure at all, and over the past 24 hours we have just 13 out of 250 odd jobs failed, and the 13 are largely "missed" backups which just mean I need to refine the windows so there is enough time for them all to run (I never get that far normally as it crashes so everything's "missed".

So yeah, either my eyes are playing tricks on me, or Backup Exec has performed a minor miracle.

Wednesday 14 November 2007

Making it work: Tip #2

One of my pet hates of Backup Exec is it's rather poor disk space management for Backup To Disk Folders. There are 2 approaches to Disk Space Management, the Sane Way, and the Stupid Way.

Guess which Backup Exec chooses...

As you can't set a limit on the amount of space a Backup To Disk volume can occupy (almost like a LTO-1 tape has a 100Gb storage capacity, and that's that, if you run out, you run out), you'd expect to be able to do similar on Backup Exec. But while you can tell it how big an individual Backup To Disk FILE is, you can't set a limit on the total space a "Disk" can occupy. The limit of course is the physical space on the drive itself.

The "workaround" is that if you had a drive, let's say it's 500Gb, and you have 6 Backup to Disk "Folders" on that drive. Set each one to have a Disk "Reserve" of something (the default is "nothing", naturally), let's say 10Gb. When you first use Backup Exec, it'll just keep making "B2Dnnnnn" files, but once you get to <10Gb left on the PHYSICAL drive, the Backup Exec Folders will then theoretically start to be overwritten, preventing the problem of running out of space completely.

Follow that? No. OK, try this:

PHYSICAL DRIVE (we'll call it Drive E:) has 500Gb Space, and it's only being used by Backup Exec.

You have 6 folders, each set with 10Gb "Disk Reserve":

Folder A, Folder B etc

Roll forward 2 weeks...

Folder A has 40Gb
Folder B has 2Gb
Folder C has 19Gb
Folder D has 125Gb
Folder E has 1Gb
Folder F has 28Gb
Folder G has 6Gb
...and the physical drive has 279Gb Free.

Roll forward 4 weeks....

Folder A has 100Gb
Folder B has 10Gb
Folder C has 90Gb
Folder D has 160Gb
Folder E has 2Gb
Folder F has 118Gb
Folder G has 10Gb
...and the physical drive has 10Gb Free.

At some point between weeks 3 and 4, the physical space hit 10Gb, so old "media" was overwritten. That's the theory of how this works.

Here's the reality:

* Make sure the PHYSICAL storage is MORE than the expected DATA need - that's full backups, storage for daily changes, incrementals, synthetics and so on, plus any other data you have on the same volume (perhaps Catalogs). It pays to have perhaps 50% more storage than you truely need, and more if you can.

* Make sure your storage calculations are enough for the retention. So if you've got a 4 week retention, do a weekly full, you need x 4 plus incremental storage. Realistically, maybe 6 times the storage.

* If your overall data storage decreases, don't expect more space to appear on the phyiscal drive. Because it treats "media" like a tape, it handles it the same way. If you had 30 tapes, and you only needed 20 of them now, the other 10 don't get erased from the earth, they just stay kicking about in your cupboard. Backup exec keeps your virtual "tapes" in it's cupboard. (we'll explain how to manage this better some other day).

* Regularly monitor your Backup to Disk Folders, particularly once you're getting to the space limits, because if your reserve is too big, and the retention period longer than you can cope with in physical space terms, you'll start getting failed jobs if there is simply no overwritable media left to be used.

* Having "overwriteable" media available is handy in some ways, as it means data beyond the retention period can still be available if the media hasn't yet been overwritten, almost building in a "last chance saloon" retention period, but it is also likely to consume all available space.

* If not all data is equally important, consider having different backup to disk folders, so that more critical data can be given longer actual retention times, but don't forget overwriteable media in one folder can't be claimed by another to add space, so good planning of physical space, folder space and necessary space is required.

If you follow this guide you can reduce the misery of the lacklustre Backup Volume Management. It won't allow you to have absolute limits for a media set (which you'd no doubt want), but it does stop you just running out of space.

Uptime - 1 day!

A minor miracle has occured - the Backup Exec services have decided to play nicely together, and as a result the CASO server has now been running for 1 day. It sounds like it's nothing, the sort of uptime that would have perhaps made a Windows 98 use proud, but in fact, with Backup Exec, a whole day without something crashing is certainly a miracle.

Right now we have 3 servers, all up, all running jobs, all receiving jobs delegated from the CASO box, and barely any failures (a couple of those are genuinely not the fault of Backup Exec). It will never last, surely?

d means "dumb"

I think that's what it means. Dumb as in "poorly thought-out", "stupidly implemented" and so on. At least that's what I believe. It certainly doesn't point to any amazing disk capabilities as they don't work.

What's the point of this 'd' nonsense?

So the 'd' in 10d, is supposed to represent 'disk', as in you can now backup to disk, rather than tape, but hang on... we've had disk-based backups since v8.6, so what's the beef?

Well aparantly the emphasis is more on the 'staging' side of things, ie you can backup disk-to-disk-to-tape. Brilliant I thought, that's just what we need. We can do the first backup to disk, which is nice and fast, then we can put that media set onto tape and store it offsite, presumably using the new 'duplicate' feature. That way, we've got the data onsite in case we need a quick restore and a duplicate copy offsite for disaster recover - what more could we want? Even better, it'll all be really easy to do now, because we have 'Policies', 'Selection Lists' and 'Templates' to save time.

NO, NO, NO! What was I thinking?! Did I actually expect the new features to work, as PROMISED? Silly me. That was HOPING for too much. Skip forward to REALITY, which actually means 'disks' are treated exactly the same as tapes. Why? How much more effort would it have been to expect that a 'd' product would have at least some slight comprehension of disk-based media? I'm sorry, but I'd expect a little more intelligence, such as being able to quote the maximum disk size, so that I could actually allocate a quota for a particular device I was backing up. Too much to ask obviously.

And another thing... Am I the only person that thought that this new 'duplicate' media set feature would work between different Managed Media Servers? After all, they're connected via a LAN and they have the ability to share Files and Folders via normal, everyday UNC shares. You can even create a 'duplicate' job and set the source and target devices from different Media Servers - they're all there in the list, but when you try to submit it, it says no. Now that's something that would've been bloody useful. Imagine, you have a Managed Media Server on another site, via a VPN link and being able to automatically duplicate your Media Set to it, so you can relax in the comfort of knowing that you always have an offsite backup? How much more effort would that have been?

Obviously Symantec are very good at painting pictures, but not so good when it comes to implementation. Maybe they should get out of this software game and get into bull**** marketing instead.

A new morning, a new failure...

It's just another average morning, a little nippy outside, terrible morning TV still exists (when will it die?) and I've got a raft of Backup Exec failures on my hands.

It seems that one of our sites has a condition I'll call "fussity", whereby the CASO server will submit it the jobs it's supposed to do, and, for about 90% of them, it'll just spit the dummy. Throwing "loading media" at you for a random period (could be 2 mins, could be 20 mins) eventually it gives up and fails the job. Only for the next one to work. And then next 9 to fail, and so on.... the last time we had a case of "fussity" the only way to stop it being so picky about jobs was to completely uninstall, rename the server, reinstall, reservice pack, and then re-create your devices, media, jobs and so on. I'm hoping to avoid it this time.

Meanwhile the main site figured it wasn't in the mood initially last night until around 8:30pm when I rebooted it. It just kept "recovering" it's own jobs. Good stuff.

All in a day's admin of Backup Exec.

Tuesday 13 November 2007

Backup Exec Upgrade Cycle Explained:




Here we attempt to explain the Symantec Backup Exec upgrade life cycle. Start at 'Promise' and work your way round clockwise.

A New Strategy

I'm going to try a new strategy. We're going to create new jobs, one by one, for the servers. Each job will backup using old style Full/Incremental jobs, to the local server with basic 4 week rotation. Not even synthetics now, we can kiss goodbye to even more space. Thanks Backup Exec!

Let's see in a few days if that works either (see the lifecycle post below) - I'm at the "hope" stage again here...)

Alternatives

Let's see now...

CommVault has good reports from a few people, but has a high price tag, and so far quite annoying sales processes (I hate any product you can't just try without asking nicely, agreeing to countless phone calls etc). Just give me the product, and I'll let you know if it's worth continuing discussions.

ArcServe - if hell is frozen over I guess, given it's supposedly "really good" at backups but just not good at bringing them back again when it matters (you know, the point of backing up...). Mind you that's not exactly Backup Exec's strength (mainly because you never get a backup to try restoring from anyhow)

Yosemite - never used it, but it seems to meet the "what it can do" on paper (like Backup Exec then).

Perhaps tin can, string and a 56k modem would help. We could TFTP the files around.

Legal Department at Symantec Please...

Right now I'm calling Symantec "Customer Care" to find out where I send a legal notice. I'm inth e UK so the website handily giving us U.S. Addresses isn't ideal - but hell, if it has to go stateside, that's fine.

Why do I want it? Well the software doesn't work, it isn't fit for purpose. I spend more time per day trying to make this stupid system work than makes sense. I'd be better off hiring cheap labour and having them manually run backups.

Anyhow, the "agent" wasn't able to give me any useful help, so I ended up having to just hope that the UK Registered Address will do (it should do):

SYMANTEC (UK) LIMITED
350 BROOK DRIVE
GREEN PARK
READING
BERKSHIRE
RG2 6UH

Queued. You mean "lost the plot"

Can we have a drum roll please....? The reason for our lovely "queued" status today is that at some point where the CASO box rolled onto it's belly and refused to play nicely with all the other devices, it managed to leave a "piece" of "media" in use in most of the Backup to Disk folders. End result - the "maximum concurrent jobs" limit has been reached.

The only resolve that seems to work is completely restarting the services (if you just restart Device and Media then the box tends to give up and never run a backup again). However, the trouble with this theory is that there are a couple of jobs running I want to finish first.

We could of course up the concurrent jobs on each device but it gets a bit tedious moving the limits up and down every day.

Really of course, it should JUST BLOODY WORK.

I'm queuing in the rain...

I woke up this morning half expecting Backup Exec not to be working, and, as usual, it isn't. There are 2 possible causes for this:

a) My colleagues theory that Backup Exec works as long as you are physically staring at the Job Monitor, is in fact, completely true, and therefore because last night I figured I had other things to do, didn't watch it, and thus must now suffer it's failures.

or

b) (and this is my most likely theory) it's just a crock of craaaaaaap.

I'm now staring at the screen, and, in typical fashion, with no useful or meaningful reason it's just "Queued" for about 15 jobs, all of which have been sitting there for 9 hours. Just queued. No alerts, no reasons. But there is one job running, that's been running for 6 hours, and is slowly very slowly) notching up the byte count.

I am going to completely lose the will to live with this software soon.

Monday 12 November 2007

Snow stopper (chuckle)

It's just gone 8pm, and naturally the backup system has lost the plot. Not that this helps me. It means the last 24 hours of backups have gone south too, (or the 15 odd jobs running have).

Do we know what's wrong - no, of course not, but good old bengine.exe has started chewing 100% CPU now assigned to the pit of hell until further notice. No idea if it's working or not, we're not allowed to know because the CASO box isn't giving status anymore, although I'm sure when I glanced at it just now we had the famous "server paused" nonsense again.

Give it an hour or so and I'll be able to say what died this time...

It's Snowing...

Actually it isn't. But it may as well be, as Backup Exec hasn't snuffed it yet. And we're past lunch...

More on "Loading Media"...

The earlier problem with "Loading Media" appears to be a false alarm - while the CASO server (the one which is supposed to mean we can manage things centrally without needing to constantly login to all the other boxes) swears blind the job status is "Loading Media", logging onto the actual media server reveals it's working away backing up quite happily.

Yet again the CASO option proves it's worthless.

"Loading Media"...

Fantastic news, it's not even 11am yet, but we've got more trusty errors to keep us busy.

Right now we've got the famous "Loading Media" nonsense. 3 jobs being backed up to 3 different "Backup Folder's" (you know, tapeless, medialess stuff) now sitting at Loading Media. No alerts naturally, no reason for them to do this, but they've all stopped and will undoubtedly now sit there forever not backing up.

Making it work: Tip #1

As this Backup Exec system has given us so much hell, we've learned a few things you won't find in the manual, so as we think of them, we'll make a note of them for you in case you stumble across this page determined to find fixes to your Backup Exec issues.

My first top tip is this - if you find your server regularly ends up in a 100% CPU used status, not long after jobs should have been submitted to the queue via your policy, the chances are you've come across "incapability bug #1" - it seems Backup Exec can't handle having LOTS of jobs submitted at once - despite this being impossible to control if you create a policy for a group of servers which is recommended. So split your jobs into lots of smaller policies starting them a few minutes apart from each other (we did it 15 mins apart but you can go closer from experiements tried here) (no more than 20 jobs/selection lists per policy). Result, "bengine" stops sitting there at 100% CPU and causing the dreaded "server paused" fault.

A new day and new problems...

Ah it's a good feeling today, why? Because only 6 jobs have failed (sort of) at the first site over the last 24 hours. Of course, it isn't quite that simple, as most of those were re-run 2-3 times to make them run, and we haven't got any real backups from the other 2 sites to talk about, but it's progress symantec-style.

Today's first problem (and it's only 10 past 9...) is that despite rebuilding Site 2, it's now developed a new problem - the first backup job, (to one distinct backup to disk volume with it's own media set) is running, but subequent jobs joining the queue cause the job to move to "loading media" and there it just sits... good stuff. So don't schedule jobs either maybe? It's practically a manual setup anyway I guess, so sitting there like a hawk looking for an opportunity to backup won't hurt.

Sunday 11 November 2007

It's just past 7...

...and the first batch of Backups have appeared for the evening - since I made some changes last week to split the jobs into even more policies to reduce the chance that "Backup Exec Job Engine" or "Ben-Gin" as we call it will stall, it does seem to cope a little better, although so far this hasn't translated into the system running correctly, it has stopped it failing quite so often around job submission time.

Of course the whole point of policies is defeated by having to create lots of identical policies with slightly different start times (since now any change means changing them all...)... but we'll let that slide. Anything to get Backups to run even once a week would be nice!

A Little History

Since this blog is new, I thought I'd share a bit of history about how we came to be in this position of undesireable misery.

Between the 3 of us, we've been using Backup Exec for many years, back when it was "Veritas Backup Exec". It was a reliable product with a few irritating quirks, but on the whole it got the job of Backups done, and we managed a few basic single server setups for various customers, some to single DAT drives, others to large LTO Libraries, but always with reliability of Backup Exec, if not the tape drives.

Move forward a few years...

We're faced with an ever increasing number of servers, multiple sites and the need to backup everything regularly, and in many cases the old "one drive per server" setup just wasn't working anymore. So we built some multi-terabyte Backup Exec Media Servers. Armed with plenty of storage, RAID Arrays and lots of shiny new Backup Exec 10d (d for disk don't you know!) licenses, we set out...

Having looked at all the various promised features, synthetic backups, great support for "Disk Based Backups" (something we really wanted), we figured it would be the wise choice, and, having had no issues with our old v8.6 and v9 installations didn't expect any problem. Yeah sure Symantec had bought Veritas but the veritas name was still there and it was the same product right?

Forward a few months...

Missed and Failed Backups are par for the course, random errors are the norm, and most of the promised features just don't work, or don't work as you'd expect, and some of the most useful features are hindered by completely stupid limitations that render the feature worthless. Oh, and just before you say "It's OK, we'll just do simple backups", don't expect it to be any easier - they don't work either.

Between us we'll post over the next few weeks about some of the biggest problems this product has, keep you up to date with the ongoing hell. If you're considering Backup Exec, don't. Try something else. Backup to 20,000 floppy disks manually copying files. Anything. It will work better.

And so it begins...

So here we are with a completely failing Backup Exec setup. As usual, it's having random fits and failures. This weekend was going quite well, or rather, "well" in backup exec terms.

We had 1 out of 3 media servers running, but the one that was running was happily dealing with jobs it was assigned, until 16:30, and then, for no reason at all as far as we can see, it did the usual "Server Paused" thing. You know, the one where it dumps jobs you've had running for hours for no real reason, and just sits like a lame duck until you reboot it (because nothing else works).

Meanwhile box number 2 is still in "paused" status via the CASO tool, because hell, asking is to be enabled causes jobs to just be sent into an infinite loop of "Queued" and "Ready, On Hold" mixed with a load of failures. Good stuff Symantec.

Box number 3 however is really screwed. We've ended up having to reinstall and then when that failed, uninstall, and reinstall from scratch. Of course, that's really upset things, because despite following the instructions, the orphaned "managed media server" is now stuck on the CASO box, as is the new reinstalled one of the same, and all the non-existant "Backup to Disk" drives are too, you know, for good measure which it steadfastly refuses to remove.

No point calling Symantec for support, they never actually have an answer unless it's the most basic of problems.

So what shall we do?

I'm going to uninstall box 2 completely, rename the box on the network so it can't be confused anymore, reinstall, service pack 4 Backup Exec 10d and then configure it - once that's done I'll try a backup job, and if that runs (yeah, because it's so likely to work first time...) I'll try putting it onto managed media server mode so we can actually use it properly.

This is the first post here, but, inspired by a guy ( see http://y33dave.wordpress.com/%20) who eventually ditched Backup Exec for CommVault who maintained a blog of the miserable journey he went on we'll be keeping this Backup Exec Hell blog going...