Saturday 28 April 2012

Terabyte migrations with minimal downtime

April 2012 sees me at the tail-end of a few terabytes worth of data migrations for "a large government department" in Australia. The plan was to avoid significant outage due to legal requirements and public safety concerns. Enterprise Replication has been the key to the whole process; and quite a successful one at that.

The databases were on 32 bit Solaris, using IDS 10; in fact the largest system was still on 9.4 due to a fatal engine bug that manifested when they last attempted to migrate to 10 along with the other engines. I understand they were offline for over a week, along with major disruptions to service as well.

The target machines are new x86_64 Intel Linux boxes running RedHat. Clearly archival tape-off, tape-on could not work, neither could HDR be used. Performing the unloads and reloads, even with assistance from HPL would have taken a few days on the largest, therefore keeping the systems offline for too long.

A prolonged migration was unacceptable to the site because, although the users can fall back to manual paperwork, their efficiency would have tanked, they would have had to enter days worth of activity when it came online again, and they would not have had access to existing data. OK, maybe we could have given them read-only access, but that would have been hellishly complicated because of internal auditing causing - you guessed it, updates which would need further migration. Further, one of the systems had a legislative imperative to produce a response within 1 hour of a source document being lodged at any time of the day. Even with ER this one was, shall we say, exhilarating during the cut-over of the database clients and ancillary systems.

The first two systems tended to work as a team, so the aim was to cut-over the pair of them at the same time. I'll call them Baby Bear, since I can't think of any suitable Dr Seuss characters off the top of my head...

The smallest engine in the set had about 30 GB of data, none of it blobs, and a very clean ER-friendly schema with only a handful of tables. The largest tables in the schema were the audit tables, larger by a factor of 10 at least. The second smallest engine had about 56 GB of data, with a fair part of that being blobs. The schema was still quite clean but some tables were not specified explicitly with primary keys. At least they had properly declared unique indexes that could be converted to primary keys without an outage.

As an aside, for those who don't know, ER requires either the tables to have an explicit primary key declared, or with the latest version of IDS, you can modify the table to have a secret serial which is used as a primary key for ER. That would mean altering the table, which would mean an outage for unkeyed tables, so once again, not acceptable for this client.

Assuming a table actually has a realistic candidate key, the options vary depending on how it's indexed and constrained. If a table has a unique index declared, it's trivial to declare a primary key which will piggy-back the index without locking the table for more than a microsecond. If the table has no index on the candidate, then first creating the unique index then adding the PK constraint is still safe enough on a large system if your table size counts up to a few hundred thousand rows. Anything larger and you may have to kick the users off for a few minutes to build the index.

Back to the story.

The next engine in line, let's call it Momma Bear, had about 400 GB of data, the majority of that being blobs of documents, videos, recordings, scans, images, etc. The schema had a large number of tables, which is not unusual for any big system, but these were swamped by thousands of permissions tables personalized for each individual user. Fortunately they could be reconstructed when the user logs in, so they didn't have to be migrated. But more on these beasts later... Once again, many tables didn't have primary keys, but they had or could have unique indexes, so no outage required here. The other hassle for this schema was a large number of audit tables that were completely unindexed and therefore not suitable for ER. Tests showed that we could migrate them manually during the few hours of outage that this system could handle over the weekend. Happy Days.

And then there's Pappa Bear. 1.7 TB of data on the Solaris, 1.2 TB of this being blobs, including audit rows containing any blobs that were replaced during normal operations. Total table count was around the "geez that's a lot" mark. Once again, a bunch did not have primary key but had unique indexes. Tragically, about 2 dozen tables, with a serial column, only had a plain non-unique index declared on the serial column despite the fact the values were always allocated by the engine. Damn! There was no getting around an outage so that I could put primary keys in place. First the existing indexes had to be dropped, then recreated as unique. Luckily the size of the tables involved means that the outage took less than a minute which was a blessing.

NOTE: if you have to do this too, just beware the existence of foreign key constraints between your tables. If they rely on the index you are trying to remove, you'll have to remove the FK constraint temporarily.

To compound it all, there were also development, testing and training engines which needed migrating, and sadly the schemas didn't necessarily match up exactly from development to production. However these extra engines did give me plenty of opportunity to practice the procedures and get the scripts matured quite nicely. In fact it got boring using ER on these engines after a few, so I tended to do a big-bang migration on them if I felt comfortable with the schema of that particular bear.

Another feature asked for in the tender was to provide disaster recovery capabilities. HDR is a doddle as far as that goes, but don't ever forget that your applications - typically a web server these days - needs DR capability too. There's no point having live data if'n y'all can't get to it. Actually there is, because it saves the extra burden of recovery from tape in the event of a disaster, but if you're serious about DR you must deal with your application machines, client and user connectivity, and probably a whole handful of other services wrapped around it all.

Still, getting an HDR cluster for your engines is a step forward, and as a bonus it will now allow you to upgrade the engines with zero downtime, because IDS now lets you progressively take down, upgrade and re-insert the engines in the cluster without losing availability.

To be continued...

Or not! In the unlikely case there is somebody hanging out for the next installment, sorry! It's just too long ago now to recall all the interesting tid-bits of advice I was hoping to save for myself and posterity. I should have written it up as I was going along. I guess if I ever do another large ER-based migration I can write it down, but until then the lessons learned will have to remain in my head, decaying slowly over time along with the neurones.

No comments:

Post a Comment