Oct 17, 2009

Cache integrity vs. website speed

Well, last few days we were working on improving the very core logic of Web Optimizer algorithms - fetching files and checking cache integrity. Why is it so important? After all cache files are created (all merged JavaScript, CSS, all CSS Sprites and data:URI resources, gzipped versions, etc) website should be loaded as fast as possible. It's not generally very good if client side optimization wastes server time out.

And right now we reached 10x better performance for full version (with unobtrusive logic and multiple hosts disabled - up to 20x better). How is it possible?

General flow

Here is pie chart for time consumption for Web Optimizer logic:

Time consumption

This chart is valid for both versions - demo and full - but it can be optimized for your website only with full version. Demo version doesn't have performance group of options.

Cache integrity

Why cache integrity is so important? Because we need to be sure that all merged and minified files are up to date. It will be very bad if we can create cache files only once. And every small change in website design or layout would lead to all website cache re-creation. Web Optimizer can check cache integrity 'on fly', and perfectly does this.

But there is a huge lack of performance: with every hit to your website Web Optimizer checks all files that are listed in HTML documents and re-calculates the check sum. Then checks if such cache files exist. And only then serves the optimized content. It's very good, but it's excess. Usually websites are not changed for monthes and years. So we don't need to check these files thousand times a day.

The first point: do not check files changes

We can skip re-calculation of files' content and this can bring us about 2-3 times acceleration due to elimination of very expensive file system calls. Well this leads to cache clean up every time when physical files are changed. But on the live website it's not often, but saves 50-70% of server time (on Web Optimizer actions).

For this logic option "Ignore file modification time stamp" in Performance section is resposible. The difference between this option and fourth option below (which just skips file system calls) is the following.

If you change file content (i.e. add a few styles to main.css) with this option disabled (enabled check of file modification time) Web Optimizer will fetch content of main.css and tries to compare it with the previous one (by check sum). If check sum is different - a new cache file will be created. With this option enabled (disabled check of file modification time) Web Optimizer will take into account only file name (usually all content in head section), but not its content.

The second point: exclude regular expressions

Further investigation what was slow in Web Optimizer core logic put light on a lot of Regular Expressions. Well, RegExp's are very good if you need to do something fast and be sure in result. But they are also very expensive. And in the most of cases they can be replaced with raw string functions (i.e. strpos, substr, etc). Well-formed standard-compaint websites can be parsed very quickly, so why Web Optimizer must be slow for them? It must be slow for old-schooled websites, that can be parsed in a standard way.

This logic is managed by "Do not use regular expressions" option. This approach saves about 15-20% everywhere in Web Optimizer core.

The third point: quick check

So we have reduced calls to file system (from dozens to 2-3), we have optimized regular expressions with string functions, what else? The next step should be in reducing overall logic operations. While fetching all styles and scripts we make a lot operations: get tags, get attribute, correct attributes, check options, check values, etc.

All this can be skipped on just general cache integrity check. So reducing this logic to minimum (just to be sure that we can serve the same cached files for the same pack of styles and scripts) can bring us additional 10% in performance. A few but with other approaches is enough to provide the fastest client side optimization solution.

This is Check cache integrity only with head option.

The fourth point: reduce even more

OK, but the resonable question will arise: why will we need to perform any calls to file system? The answer is simple: we need to force cache reload on a client side if we have the same cache file name (i.e the same set of scripts but their content has been changed, so we need to reload it on the client side). Cache reload is forced by additional GET parameter (in demo version) and changed file name (with mod_rewrite in full version). These operations (check for cache file existence and its mtime) can be avoided if we hard code 'version' of our website application. So calls to file system can be reduced to 0.

But this is generally dangerous: we can't check cache integrity properly, and can serve files which don;t exist. This can be made only after all cache files have been created, and we are just tunind server side to the best performance.

This is option Cache version number (zero value skips its usage).

Finally

For now we have the following picture (in relative numbers):

Web Optimzer logic: full version

As you can see almost all parts are balanced to achieve exceptional server side performance (usually 1-2-5 ms in full version versus 20-40-100 for a demo version). All options are included in nightly builds and will be available in 0.6.3 after complete testing.

3 comments:

  1. > do not check files changes

    Which files are checked for changes?
    All cache files or only local static files the page links to?

    Up to now I always thought that wbo will call the CMS and generate a new cache file if a cache file is older than specified in server side caching.

    In my case I never disabled the mtime check because I have some pages/articles with imported RSS news via PHP and I was afraid that the news will never be updated if I disable the mtime check because the generated will not be updated.
    The same is of course true for dynamic content blocks like 'newest articles'.

    How is mtime check related to server side caching of CMS generated HTML pages?

    > exclude regular expressions

    Congratulations. I love this optimization.

    > reduce even more

    If using a version number isn't it sufficient to clean/purge the cache to create new cache files?

    And again I am confused; which files are included into this version administration?
    All CMS generated files or only the static assets like CSS and layout files?

    ReplyDelete
  2. Option "Don't check files' mtime" affects only static assets. HTML cache deals with its own group of options, Server Side caching.

    ReplyDelete
  3. Well, in short: All options except 'version' activated. I am happy with the result. My Textpattern website reaction feels much more fluid. And it wasn't slow at all before as the Textpattern core offers a very fast CMS.

    ReplyDelete