Time over time, users make an unsurprising feature request on the AkeebaBackup.com forum: “Can you make it so that I can synchronize a live and dev site without a full backup?”. The typical answer they get is “No, because of technical issues”. I was surprised to see that a trending idea in ideas.joomla.org is exactly that – not to mention that it was submitted by one of the people very actively engaged with core Joomla! development. In the hope that anyone cares to read, I am going to make the case against such a feature, proving why it is a Really Bad Idea™.
Disclosure: I have the know-how to create such a feature and make it work on most servers and most sites. I even have code infrastructure in place to easily make it happen, without having to start from scratch. This article is a breakdown of my research and spec notes when I was doing the feasibility study of such a feature. After reading this lengthy article, you'll hopefully understand why I decided to never put it to code and, most likely, agree with my choice too.
What is a sync operation, anyway?
My professors used to say that you have to know the problem you are trying to solve before setting of to solve it. We all nod our head complacently on “being able to sync two sites”. How many of you can define what this means? Are there even two of you who have the same idea about what synchronization means? For some people it really means “get the components I upgraded and send them to the other site”. For other people it means “I've changes some articles and menus and I want them magically transferred to the other site”. For most people it means “huh, I never really thought of that!”.
Rule #1: You get what you ask for.
The corollary to Rule #1 is that if you, as users, do not spec out your expectations, you'll get a system which you might not like or be able to use. Think about the ACL in Joomla! 1.6: it's what we all asked for, a complicated and interwound contraption which can do everything but is virtually impossible to use without a dedicated team. For the sake of the argument, I'll make an assumption here. We'll consider a “full site synchronization” which aims at merging the changes made to the two sites, with the origin site's changes winning over the target site's changes. In other words, if a specific article exists on both sites, the one used to initiate the sync process (origin) will force its own copy to the one receiving the changes (target).
In theory, performing a synchronization consists of two steps: creating a changeset and applying it to the target. It sounds like a child's play, right? Well, it's as easy as saying “Let's go to Mars and get back home for a cold one before dinner”. Right... So, let's take each step apart.
Files are easy. Really.
Creating a changeset is an odd beast. What does a changeset consist of? Well, what does your site consist of? You have files and you have database data. Let's start with those files first. You might think they are trivial to manage, but are they? In the simplest case you could compare the date and time of each file against the last synchronization operation. If the file is newer, it is either added or modified and has to be synchronized to the target site. But what if the file was touched and not really modified? Bummer. We'll need to check the file size, ergo we need a database table to store that information.
Welcome to the world of pain. By creating a database table to store file information you are facing new challenges. First, for each file you need to make a database query to see if it has changed. You're thinking that you can make a single query for all the files in a specific directory. Yes, you can do that, but it comes at a premium: you use up lots of memory and on most shared hosts you'll get a memory outage error. However, if you don't do that some other shared hosts, with a MySQL query limit, will turn belly up and your operation will fail. The solution? Screw some of your users, you can't always win.
Another dire consequence of using a database table is that it, naturally, takes up MySQL storage space. With an average of 1Kb per row and 10,000 files on your typical small site, you're looking in the eyes of 10Mb of database overhead. Dammit, you just killed another segment of shared hosts which only give 10-15Mb of database space and charge extortionist amounts of money for extra space. As with the previous point made, the solution is simple: screw some more of your users.
The good news with using a database table is that you immediately know which files have been deleted. All you have to do is mark the files which you found with the current sync operation's date and time and then run a query against the files whose time stamp is less than that. Of course this implies that you need to run more queries, so you get to kill the shared hosts with MySQL query limits even faster but, hey, we already decided we screw users on those hosts anyway. The real challenge is how you notify the target that these files are gone, albeit it's trivial: use a manifest file or, in other words, a list of files to delete.
Regarding the files which were changed, you could either include the file in its entirety or use a diff file. In my experience, running diff is about ten times slower and comes with unique challenges when applying those changes, so I'd really want to go with the entire file and get done with it. Wastes some space, that's true, but it doesn't require us to screw some folks who are on slow hosts.
Speaking of which, how do you suppose such a process would run? I mean, we are assuming that we're writing a PHP script which has to run within the strict confines of PHP's max_execution_time (varies by host, it can be as low as 3 friggin' seconds). OK, I have the solution to that. It's well tested and it's the step chaining approach I use in Akeeba Backup.
One question which springs to mind is “Can I do the same thing without all this overhead”? Betcha. You can either directly compare the filesystems on the origin and target servers. In fact it's a problem already solved and it's called rsync! Which brings us to:
Rule #2: Do not reinvent the wheel
If the user wanted to use rsync (and deal with the problems of providing access to both filesystems over that protocol) he'd most likely use a dedicated server and already set it up. Why bother reinventing the wheel, if there is a very round one readily available round the corner? So, idea dismissed, we're stuck with our complex file comparison procedure.
Databases are way too complex
Ah, the database data. There are some oddities regarding databases. Databases have tables. Each table has a structure which is not guaranteed to be immutable. Each table has rows, whose data can be modified over time. To make it worse, each table row has columns. So, we have four (4) levels of complexity: tables, table structure, rows and columns. Let's try to tackle each one. Tip: get a pint of your favourite drink before continuing reading.
The first level of complexity is rather trivial. Either a table is there or it's not. All we need to do is to store the names of the tables we had during the last sync operation and look for any changes. New tables will be marked for creation and missing tables will be marked for deletion. Trivial, Dr. Watson, trivial.
Armed with self-confidence from the previous task, we set off to tackle the table structure problem. A table's structure can be changed over time. The first thing we need to know is a complete list of all the properties of the table at the time of the previous sync operation: each column's definition, the database engine used, regular indices, foreign keys, unique indices, collation and charset. Then we have to check them against the current properties. If anything differs (say, a column added, modified or dropped, an index changed, the collation changes) we have to create an ALTER TABLE command. We immediately have to face grave problems:
- Many hosts do not support ALTER TABLE commands. The solution to that is an overkill: create a temporary table with the new structure, migrate the data to the temporary table, drop the original table, create the table with the new structure, migrate the data from the temporary to the new table and drop the temporary table. Even though that works wonders for small-ish tables (up to ca. 1000 rows), it is a guaranteed timeout error on larger tables. So, we have to screw the users whose MySQL setup doesn't allow them to run ALTER TABLE commands.
- We have no idea whatsoever how the change was applied to the database in the first place. Don't laugh, I am dead serious here. Doing something as trivial as changing the charset of a column may result in truncated data or MySQL errors. We simply have no means to detect, anticipate or work around this potential problem. We just have to accept the fact that there is a potentially grave bug buried in our code and forewarn the user. Nobody will read the warning, some will be afflicted by this bug, we won't be able to do about it and we'll have disgruntled users. This brings us to Rule #3, a corollary to Murphy's Law:
Rule #3: If technical limitation are likely to cause a grave operational issue under rare circumstances, the probability of these circumstances happening is proportional to the square of the criticality of the affected data.
I am, unfortunately, very serious about this rule. I've seen it happening many times in the past and it always comes with a high WTF/minute rate (the Universal Code Quality Indicator, according to some) when trying to debug it. Things are getting more desperate along the way, so don't worry about that.
Rows. Millions of rows. This is the very essence of our sites. How can we tell if a row has changed? I'll tell you upfront: with the current database structure in core Joomla! and extensions' tables it's very difficult to virtually impossible. In order to figure out if a row has changed you need to do either of:
- Have an ON UPDATE CURRENT_TIMESTAMP column on all tables. This trick is used by enterprise-grade ERPs and MRPs and high-end CMS solutions. The idea is that every time you update a row (say, an article's hit count changes or you edit a menu item) a column with a standard name, common among all tables is updated with the date and time of that change. This way you can easily detect changed rows by comparing their timestamp with the timestamp of the last sync operation. On the downside, Joomla! can guarantee that only it core tables will have such a column (and restrict itself to running only on MySQL 4.1 or later, which Joomla! 1.6 already does). Adoption of such a column by third party extensions can not be enforced and, certainly, older extensions which are no longer maintained will not include it in their database tables' structure. As a workaround, we can have our code detect the missing column and add it, but this requires to run an ALTER TABLE. Guess what? We already said that this approach doesn't work on some shared hosts, but we've already decided to screw users on those hosts anyway. Furthermore, if the table already has such a timestamp column, we can't add a new one: there can be exactly one such column in any given MySQL table. So, we have to screw a little more users...
- Alternatively, we can store a hash of the contents of each row, i.e. using MD5 or SHA1. The first drawback is that we need a gigantic table to store this information. I've seen too many sites who have hundreds of tables with several thousand rows and dozens of sites with specific tables having several million rows. Let me put it that way: making Akeeba Backup work with such tables was NOT fun. Imagine now having a table with dozens of millions of rows. As you might have thought already, all the drawbacks mentioned in the file table earlier apply to this table as well. To top those drawbacks, selecting rows from such a table requires messing around with MySQL server variables which allow big selects. Big selects are painfully slow even on dedicated servers. You get the idea, it's a bloody train wreck and I am about to make it worse. Even using such a table, needs us to also store a reference back to the original table (so we know which row's hash we're storing). That would be fine if all tables had a unique integer index. But they don't. Some tables have a VARCHAR unique index and many tables (usually the “glue table” in a many-to-many relationship) have no unique indices whatsoever. Some of those tables are core tables, too. This would increase the complexity of our code to handle row changes by an order of magnitude. The result would be something extremely slow and certainly not very fun to maintain.
If you are still reading, you are a very brave individual. Most people have cried out “THIS IS TOO COMPLICATED, I QUIT!” long before. As a reward, I am going to give you more problems to contemplate on.
Columns. Per row. Oh, boy... The thing is that using either of the previous methods a whole row is marked as modified even if only one of the columns has only changes. This can pose a challenge, as the hits column will always* get updated (* if not using caching) when refreshing a page showing that item. Now, if you want to be able to select which columns of each table to synchronize and which to ignore, the code gets overly complex and dead slow. Which reminds me of...
Rule #4: Keep It Simple and Stupid
Come on, admit it. Complex solutions seem sexy but you quickly discover that the sheer amount of complexity makes them unstable and hard to use. My suggestion: don't even discuss about it, treat rows as the smallest division of database data you want to work with.
Delivery format of the changeset
A changeset is worth nothing if you can't send it to the target server in order to apply it. There are two approaches you can use: direct delivery and delivery by archive.
We'll start with direct delivery. This is a “do as you go” approach. Each file which needs to be changed / updated / deleted is synchronized with the target filesystem. This is how rsync works, but it may not be optimal if the target site is a live site. For the duration of the sync process, the live site will be in a “confused” state and interesting things can happen, from getting a 500 error page to losing orders and from full path disclosure issues (if you have stupidly set Error Reporting to Maximum) to accidentally let a user doing something he's not supposed to even contemplate. Same thing goes for the database, with a much bigger warning: all databases changes MUST happen with the target site taken off-line and run them without loading Joomla!'s application or Crap Will Happen™. More specifically, if Joomla! needs to log you in and you have just updated your #__users table but not the ACL tables, it is possible that login will fail and leave your site in a confused state.
Rule #5: Just because you can it doesn't mean you should
Why is that? Here's your China Syndrome scenario (and, frighteningly, very realistic): should a direct delivery sync go awry because of a fatal error in origin or target site, a network connection issue, or remaining bad luck from breaking a mirror six years ago you are stuck with a dead site. You did take a backup before synchronizing, didn't you? What do you mean you didn't? Tough luck, you're screwed: your site is FUBAR.
Then, we have the archive delivery. I would personally opt for that solution. The entire changeset is stored on the disk as a single file, containing both the file and database changes. Should the sync process fail, you can always re-apply it. Isolating the changeset creation and application you ensure that a fatal error on the origin site will not take down the target site as well. This is good design. If you are overly paranoid, you can always keep a backup of anything you change on the target site so that you can roll back the changeset in the case of a failure. This is even better design, at the cost of complicating the changeset application process even more.
Transfer and application
However, we didn't fully tackle two immediate issues arising from those ideas: transferring the changeset and applying it.
Transferring the changeset should ideally be done by FTP. However, most users do not understand what an FTP directory is or why and how it differs from the absolute filesystem directory to the site's root. If you don't believe me, monitor the new posts on AkeebaBackup.com for a week. Even when they finally do get it, it's not always certain that their server will do them this favor. Some hosts need special firewall settings to allow FTP connections to remote sites. Some will not allow FTP connections to sites on the same server (loopback connections). Some others do not support a ton of FTP features, such as passive mode or CHMOD'ing files. Even when all works correctly, the target directory's permissions may not allow for FTP uploads. Dealing with each of those problems is a royal pain in the ass and users expect you to do that, otherwise they'll consider your software to be crap. After all, that's why they are users and we are developers, right? As an alternative, given that the changeset is delivered as an archive, we can instruct the user to upload it to predefined directory and run the application process directly on the target server.
Which begs the next question. How do we launch the changeset application process? We obviously can not run it on the origin server as it doesn't necessarily have access to the filesystem or database of the target site. We wouldn't want to tell the user “log in to the target site and run the process there” either. We also can't use a Joomla! component on the target site, as the application process would leave the site's file or database in a confused state at some point, causing the process to error out with a fatal error. So, we need a new entry point (executable PHP file) to extract the archive, move the files and apply database changes. The major woe is that yet another entry point in your site means that you have yet another failure point and yet another potential security issue.
Assuming that we bite the bullet and create that extra entry point, we have to consider how it is going to work. Do note that this is the most crucial point of the whole design. If you spec out code which is working on anything but a real site, you've dug a big whole in the water. So...
Rule #6: Spec your software for use in the real world
Is it worth it?
Let me count how many times we've deliberately screwed our users due to technical limitations. Four. How many potential failure points which we can't recover from do we have? Too many to count them. Of course, failures can only happen on restrictive shared hosts and by people who have no idea how site synchronization works. Yes, I know that... but who was our target audience for this feature? Was it the enterprise user? No. They have a dedicated IT team and rsync. Was that the knowledgeable web site developer? No, he knows a dozen more ways to do that. Is it the guy with the dedicated server? No, he can just SSH to copy files around and merge database tables. So who is it? The “small guy”, who knows jack about how his site works and is hosted on a dirt cheap shared host because he doesn't know better. So, we're setting up to create a feature which is very likely to fail on the exact systems used the most by its target audience. Huh!
Rule #7: New features must make sense
This seems to be common sense, but it's not as self-explanatory as you might expect. Joomla!, like most prime time Open Source projects, fails miserably in a critical point: user expectations management. Users expect Joomla! to do everything and work flawlessly every time, on every server, under every circumstance. Including a feature with so many gotchas is a genuinely bad idea. Many users will fail to use it properly, many others will think it does something it doesn't and the end result is a bunch of aggravated users who cry “Joomla! sucks”. Of course it's going to be their fault, but not entirely. Joomla! has spoiled them by delivering meaningful features which work consistently across servers. Prime time Joomla! components followed suite. This is the biggest strength and the biggest weakness of Joomla!: he have raised the bar too high. Including a difficult to support feature in the core requires us to lower that bar because a. we don't have enough volunteers to support it b. volunteers do not necessarily have the technical expertise to do meaningful support c. the support volunteers are inherently disconnected from the developers and d. users can not accept that their host's limitations can pose an unsurmountable obstacle to a feature working as they expect.
So, no, I do not consider that feature worth the development and support effort and this is why I vehemently denied implementing it in Akeeba Backup even though I could (see rule #5). It's one of those features that can drag a project behind and do more harm than they can do good. In my humble opinion, the resources of the project should be put to a better use, for example developing a content versioning feature (easier to implement, more useful, more predictable behavior). I also believe that we should start doing some user expectations management. We have to let people know what is possible and, most importantly, what is not. In my experience, explaining why something can not be done works wonders and people appreciate it more than delivering a necessarily half-baked solution.
Feel free to flame me in the comments below. I'm putting my flameproof suit on. ;)