An awful lot of files in SVN
Derick was always bitching at me for the huge amount of time needed to process the Webdav components sub directory when doing releases. We always supposed that the Subversion performance issue here resulted from the Webdav test suite, which contains an awful lot of small test files and quite some sub-directories. I finally found the time to refactor the tests and the performance improvement is astonishing.
Long story short: In the eZ Webdav component we use half-way automated acceptance tests beside the normal unit tests. These test cases ensure compatibility with different clients. Each of them holds captured request and response data, from a manual, successful test run with a certain client. Every test case consists of about 20-200 data sets, while each set - before yesterday - consisted of 5 files.
I now merged the data of each test run into a single file, so there is only 1 file per client acceptance test. Obviously, this reduced the number of files in the Webdav components sub-directory drastically:
$ find Webdav.old/ | grep -v svn | wc -l
22629
$ find Webdav.new/ | grep -v svn | wc -l
548
To give you an idea of how many files these were: A checkout of the PHP source tree contains about 51357 files.
The speed improvement for a recent checkout of the Webdav sub-directory, compared to a revision before this refactoring is amazing:
$ time svn co -r10366 http://svn.ez.no/svn/ezcomponents/trunk/Webdav Webdav.old
...
real 8m32.405s
user 1m19.961s
sys 1m48.815s
$ time svn co http://svn.ez.no/svn/ezcomponents/trunk/Webdav Webdav.new
...
real 0m16.707s
user 0m1.028s
sys 0m0.640s
From more than 8 minutes to less than 17 seconds. Further more, the size of the checkouts differs drastically:
$ du -ch Webdav.old/ | tail -n1
261M total
$ du -ch Webdav.new/ | tail -n1
44M total
From my observation, the size difference results mostly from the reduction of meta information, Subversion needs to store:
$ find Webdav.old/ -name '.svn' | xargs du -ch | tail -n1
175M total
$ find Webdav.new/ -name '.svn' | xargs du -ch | tail -n1
24M total
Additionally the average file size in the Webdav directory was raised quite a bit:
$ find Webdav.old/ -type f | grep -v svn | xargs du -s | \
awk 'BEGIN { sum = 0 } { sum += $1 } END { print sum " / " NR " = " sum/NR }'
86464 / 22562 = 3.83228
$ find Webdav.new/ -type f | grep -v svn | xargs du -s | \
awk 'BEGIN { sum = 0 } { sum += $1 } END { print sum " / " NR " = " sum/NR }'
20324 / 497 = 40.8934
So, what is the lesson learned here: Do not store too many small files in an SVN repository, if you don't have to.
Disclaimer: No, this does not mean, that you should store all your source code in a single file, from now on! This is not the sense of version control. ;)
A little side note as the closing word: If you store a lot of such information, like generated test data, in SVN, be sure to mark these files as binary. This saves you a lot of junk on your commit mailing list. You can do this using:
$ svn propset 'svn:mime-type' 'application/octet-stream' the/binary/file
Comments