An awful lot of files in SVN

2009-06-04

Tobias Schlitt

Derick was always bitching at me when doing releases for the huge amount of time needed to process the Webdav components sub directory. We always supposed, that the Subversion performance issues here resulted from the Webdav test suite, which consists of an awful lot of small test files and some sub-directories. I finally found the time to refactor the tests and the performance improvement is astonishing.

Derick was always bitching at me for the huge amount of time needed to process the Webdav components sub directory when doing releases. We always supposed that the Subversion performance issue here resulted from the Webdav test suite, which contains an awful lot of small test files and quite some sub-directories. I finally found the time to refactor the tests and the performance improvement is astonishing.

Long story short: In the eZ Webdav component we use half-way automated acceptance tests beside the normal unit tests. These test cases ensure compatibility with different clients. Each of them holds captured request and response data, from a manual, successful test run with a certain client. Every test case consists of about 20-200 data sets, while each set - before yesterday - consisted of 5 files.

I now merged the data of each test run into a single file, so there is only 1 file per client acceptance test. Obviously, this reduced the number of files in the Webdav components sub-directory drastically:

$ find Webdav.old/ | grep -v svn | wc -l
22629

$ find Webdav.new/ | grep -v svn | wc -l
548

To give you an idea of how many files these were: A checkout of the PHP source tree contains about 51357 files.

The speed improvement for a recent checkout of the Webdav sub-directory, compared to a revision before this refactoring is amazing:

$ time svn co -r10366 http://svn.ez.no/svn/ezcomponents/trunk/Webdav Webdav.old
...
real        8m32.405s
user        1m19.961s
sys 1m48.815s

$ time svn co http://svn.ez.no/svn/ezcomponents/trunk/Webdav Webdav.new
...
real        0m16.707s
user        0m1.028s
sys 0m0.640s

From more than 8 minutes to less than 17 seconds. Further more, the size of the checkouts differs drastically:

$ du -ch Webdav.old/ | tail -n1
261M    total

$ du -ch Webdav.new/ | tail -n1
44M total

From my observation, the size difference results mostly from the reduction of meta information, Subversion needs to store:

$ find Webdav.old/ -name '.svn' | xargs du -ch | tail -n1
175M        total

$ find Webdav.new/ -name '.svn' | xargs du -ch | tail -n1
24M total

Additionally the average file size in the Webdav directory was raised quite a bit:

$ find Webdav.old/ -type f | grep -v svn | xargs du -s | \
  awk 'BEGIN { sum = 0 } { sum += $1 } END { print sum " / " NR " = " sum/NR }'
86464 / 22562 = 3.83228

$ find Webdav.new/ -type f | grep -v svn | xargs du -s | \
  awk 'BEGIN { sum = 0 } { sum += $1 } END { print sum " / " NR " = " sum/NR }'
20324 / 497 = 40.8934

So, what is the lesson learned here: Do not store too many small files in an SVN repository, if you don't have to.

Disclaimer: No, this does not mean, that you should store all your source code in a single file, from now on! This is not the sense of version control. ;)

A little side note as the closing word: If you store a lot of such information, like generated test data, in SVN, be sure to mark these files as binary. This saves you a lot of junk on your commit mailing list. You can do this using:

$ svn propset 'svn:mime-type' 'application/octet-stream' the/binary/file

Comments

Don't know if thats an option for you guys, but if you use a different protocol, svn+ssh:// instead of http(s)://, you'll see huge performance increases as well! That causes other changes though, as its not running through apache and its authentication anymore.

Olly at 2009-06-03

Hi Olly,
thanks for the hint. svn+ssh is not a real option, since we quite a lot of commiters, company internal as well as external contributors. Creating a system accounts for all of these is not possible.
However, the general performance of the current setup is quite ok. Just this special case was really annoying.
Regards, Toby

Toby at 2009-06-04

Just for the completion of this topic: Instead of adapting your source code to the vcs you could also think about switching to a more decent vcs:
http://www.koch.ro/blog/index.php?/archives/112-GIT-vs-SVN-performance-with-eZ-Components.html
Switching to a DVCS would also make packaging for distributions much easier. Currently one can't point to one SVN release in trunk and say: This is version 2009.2.

Thomas Koch at 2009-06-04

Doh! Ever tried storing the var/siteaccess/storage dir from an eZP install in svn? I have really bad memories about that - even doing the "delete+mark as ignored" after the initial checkin operation took ages!

Gaetano at 2009-06-04