We live in a potentially hostile world. Spammers, scammers, hackers and - alas! - script kiddies are after our site, for all we know. It's bad if - like most people - your site is your personal page. It's humiliating if - like many - it's the internet presence of your company. It's devastating if you are one of those people whose site is their business. Having regular, automated full site backups is a good first step, but they're only good at fixing a disaster after it has happened. Putting restrictions and controls (such as firewalls and tough passwords) is essential, but only if they don't fail. As Einstein bluntly put it "Only two things are infinite, the universe and human stupidity, and I'm not sure about the former". An ingenius hacker, or a stupid script kiddie, might stumble upon a way to bypass your security controls and gain unauthorized access to your site. They can even hack you yesterday and eploit their back door today.
So, what can we do? Sit around, act casual until disaster strikes? No, not at all. What we need is a proactive check of our site files. If anything unusual is added, removed or modified the equivalent of a red alert should go off in our head and force us to take measures to contain and fix the problem before it's too late. It all boils down to an easy way to get a difference between the current state of our site and the last (and also known good) state of our site. This is the question I tried to answer with JoomlaPack SiteDiff.
So, what can we do? Sit around, act casual until disaster strikes? No, not at all. What we need is a proactive check of our site files. If anything unusual is added, removed or modified the equivalent of a red alert should go off in our head and force us to take measures to contain and fix the problem before it's too late. It all boils down to an easy way to get a difference between the current state of our site and the last (and also known good) state of our site. This is the question I tried to answer with JoomlaPack SiteDiff.
If you have been using JoomlaPack - or Akeeba Backup - you already know how easy it is to make full site backups with it. If you are also using JoomlaPack Remote Akeeba Remote Control (part of our Native Tools package) you also know how embarassingly simple is to schedule your backups and automatically download the backup file to your PC. Now, with so much simplicity involved, it's a shame not to take daily backups of your site, isn't it?
The cool thing with Akeeba Backup - in fact its raison d'être - is that the generated backup archives contain a full dump of your site's files and database. Each archive is a self-contained snapshot of your site at the time the backup was taken. This means that if we can check the contents of two archives against each other we have the information we seek: which files where modified, added or deleted between two points in time. The bad thing is that doing so requires extracting them (long process, requires lots of space, deleting the dozens of thousands of files is even slower), producing MD5 sums out of them (very slow, I have yet to find a decent recursive MD5 sum generator for Windows), comparing them (importing text files to OpenOffice.org Calc and running VLOOKUP's is way off the chart for most people) and interpreting the results.
Did you notice a pattern here? Three out of four steps are good candidates for automation. Only the interpretation of results requires human intervention. Even better, the three most time consuming parts of the process can be automated. Wait a minute! Don't I write software which automates laborous tasks? Yes, I do. So, that's how SiteDiff was born.
What does SiteDiff do, anyway?
SiteDiff is a Windows utility which reads two site backup archives, calculates the MD5 sums of their contents (file by file, nothing written to disk), compares the two file / MD5 sums lists and reports in an easy to read format which files are the same, which have been added, deleted or modified. Since it relies on MD5 sums to do all the file comparison it won't get fooled by modified files with the same size and creation/modification date as the originals. SiteDiff will report the truth and nothing but the truth.
Update! SiteDiff is now part of the standard Akeeba Backup native tools distribution. The latest release is available on AkeebaBackup.com.
What does it look (and work) like?
Glad you asked! Here's what it looks like, when you fire it up:
The top of the application's window allows you to select the two archives which will be compared against each other. The first point in time is loaded from the reference archive, whereas the second point in time is loaded from the current archive. Think of it like this: the reference is the last known good state of our site, the current is the one we try to figure out if it's good or not. You can use the small buttons (which look like open folders) to show a standard Windows file open dialog so as to easily pick the files. Since MD5 hashing is slow as molasses, ticking the "Enable caching of reference data" check box will allow SiteDiff to store a cache file (text format) in the same directory as the reference archive. If you contantly check new backups against the same reference archive, SiteDiff will load the MD5 hashes of the reference archive's contents off the cache file instead of calculating them from scratch, slashing the time required to about a half.
Clicking the Start button replaces the file edit boxes with two progress bar. As the archives are being processed, they fill up. When processing both files finishes, a rolling progress bar (properly called a "marquee") is displayed next to the Start button. During this step the two lists of files and MD5 hashes are compared against each other, calculating which files are unmodified, added, removed or changed.
When this final step completes, the Filter and Results pane are unlocked. The Results pane contains a list of files contained in both archives.
Each file's status is denoted by an icon next to the file name and its color:
- Equals sign (green color). The file is present in both archives and wasn't modified.
- Plus sign (yellow-ish color). The file is present in the current archive but not in the reference archive, i.e. it was added in the mean time between the two backups.
- X sign (gray-ish color). The file is present in the reference archive but not in the current archive, i.e. it was deleted in the mean time between the two backups.
- Not equals sign (red color). The file is present in both archives and its contents were modified in the mean time between the two backups.
Interpreting the results
The interpretation of the results is the most important part, as it allows us to understand what happens to our site. Obviously, the unmodified (green) files are not important. There is, however, room for interpretation for the rest of the files.
First, you should note that changed files in the installation directory are not important. This is a directory normally not present on your site - otherwise Joomla! doesn't run - but added there by JoomlaPack. If you have upgraded JoomlaPack, or changed the embedded installer, there is going to be a lot of modified, deleted and added files. The database dump files, located inside the installation/sql directory, will always be modified between two backups, especially the main database dump file (joomla.sql). So, we can mostly ignore the contents of this directory. Let's see the other files now.
If you see any modified files, ask yourself two questions: 1. Is this a file normally modified over time, e.g. a log file? 2. Did I upgrade Joomla! or an extension which contains this file? If it's a file normally modified over time you can safely ignore it. If it's not and especially if it's a file containing executable code (e.g. PHP, JS) you have to be very careful. If the reference archive was taken right before the extension upgrade and the current archive right after that, seeing this extension's files - and only this extension's files! - modified is normal. If in doubt, extract the current archive and compare those files against those in the extension's archive. Same goes for Joomla! upgrades. Especially for Joomla! upgrades you can use the full Joomla! distribution file as the reference archive and your latest site backup as the current archive. There should be no modified or deleted files, only added files. If not, someone has tampered with your core Joomla! files, which sounds like having been hacked, doesn't it?
In the case of removed files, you have to ask yourself similar questions. What were those files? If you believe they were part of an essential feature of your site, it sounds very suspicious. You get the picture, right?
Finally, do not overlook added files! If you see executable code files added to your site (e.g. PHP and JS files), or even archives and encoded files, which you didn't put there it may be a sign of a silent intrusion. This is a nightmare scenario: somebody hacks into your site, places some "back door" and waits a while before exploiting this. The real nightmare of this scenario is that once they activate their exploit you may very well end up chasing your tail: once you think you have "cleaned" your site, the hack comes back. We've all have read about this many times in the forums and blog posts all over the Internet. So, if you see suspicious files added which you didn't put in there yourself, be alert! Analyze them and act accordingly. Do note that removing them is usually not enough. You also have to indentify the point of intrusion to make sure this doesn't happen again.
Updates
After this article was written, SiteDiff received an overhaul and was made a part of the Akeeba Backup native tools, rebranded as Akeeba SiteDiff. Stable and beta releases are always available in the AkeebaBackup.com download area. The Native Tools version is about 100x faster than the pre-release version originally posted here. Comparing two 200Mb archives with 11000 files each should now take approximately 20 seconds!