Issue 30: Commit performance limited by physical memory size

Reported by Unknown User, Nov 23, 2005

(This entry was imported from the savannah tracker, original 
location: https://savannah.nongnu.org/bugs/index.php?15042)

I have been trying to import a subversion repository into monotone 
(head is 1.5g / 20k files), to compare the two systems.  I created 
scripts that would do:

1. svn update next revision #
2. monotone ls missing, to drop removed files
3. monotone add * for each top-level folder (to add new files)
4. monotone commit

Steps 1-3 are fast.  Step 4 was fast enough until the working folder 
size grew past the available physical memory in the machine (about 
800megs).  So I tried to figure out what the slowness was using 
strace; to import the whole repository in this way would take maybe 
2 weeks of constant disk activity, and in any case saying it'll take 
3+ minutes to commit a simple edit is absolutely ridiculous.

Turns out that monotone is rehashing every single file, which 
guarentees that once the files cannot completely fit into the disk 
cache that commit performance will be ridiculously slow.  I tested 
that it was not an svn-related issue by just adding a new small file 
by hand and committing it, with the same results.

The obvious solution is to store the last commit/update time in the 
MT folder and only stat files to see if their modification time 
indicates they were changed after that time.  Make the complete 
rehashing be a "--force" style option, if the user wants 
it.  The default certainly should not be the extremely slow 
behavior, or at the very least there should actually be a way to do 
a commit quickly.

I also found a few minor perforance issues:

A) when the monotone reads from the database it does a read of 1k 
followed by a _llseek to the current file position, and repeats 
until the data is read.  I couldn't follow the C++ enough to figure 
out why this is, but it is a significant slowdown (it takes about as 
long to do a syscall as to copy the 1k of data).  Maybe it's needed 
for some sort of database consistency?
B) data reads to hash the files read in 8191 chunks.  this should be 
8192 to be in paged-sized chunks.  it should really mmap files 
larger than some size though.
C) It's calling lstat on each file twice.  Removing the duplicate 
would save maybe a tenth of a second on a large working folder.
D) seems to do an "fchmod 0100755" on each file in the 
working folder, which may not be the permissions the user wants 
(have to set umask).

monotone version:
-----------------
monotone 0.23 (base revision: 
e32d161b8d5cde9f0968998cc332f82f82c27006)
Running on: Linux 2.6.13-rc5-mm1 #2 SMP PREEMPT Mon Oct 24 09:04:47 
EDT 2005 i686

Comment 1 by Unknown User, Nov 23, 2005

Now I see there is an refresh_inodeprints command that does exactly 
this.  So sorry for wasting the time, but it would be a nice touch 
to check the time it takes to scan files for the commit, and print a 
message saying user might want to enable this.  Also, 
"refresh_inodeprints" tells the user the low-level details 
but not why they would want to do it.  Maybe rename to something 
like "enable_largetree" and "refresh", which 
could do the same thing.

It still takes quite a while, possibly b/c of the double stat / 
chmod mentioned above.

Comment 2 by Unknown User, Nov 24, 2005

Indeed, some more automaticity for the inodeprints stuff is on the 
todo list... most people don't have 1.5 gig trees, so the 
speed/safety tradeoff is reasonable :-).

The database code is all in the sqlite library; in particular, I 
believe it is sqlite/os_unix.c, sqlite3OsRead, sqlite3OsWrite, 
sqlite3OsSeek.  I'm sure these could be changed to use pread and 
pwrite if necessary, and possibly even mmap (though that might be a 
larger change).  Do you have numbers we can wave at the sqlite 
people to justify the effort?

The looking at it here, double-stat'ing seems to only occur when 
_not_ using inodeprints; should look into it, but it's probably not 
too important, since the stats are probably not the bottleneck in 
that mode anyway.

I can't seem to reproduce the chmod thing.  My guess is that it's 
actually only happening on files that are marked with the 
'executable' attribute?  We're a little over-zealous about applying 
such attributes at the moment; it's an area where cleanup 
discussions are underway.  If this is correct, then umask isn't 
strictly relevant; the make_executable functions reads off the 
current mode and adds a+x to it.  I guess technically we should mask 
that against the umask too... hrm.

Please add your email address to this bug?  It is hard to have a 
conversation when it is not clear whether anyone is on the other end.

Comment 3 by Unknown User, Jan 24, 2007

> most people don't have 1.5 gig trees

I have data files that are larger than that, and I would like to 
keep them under version control. Today I tried to commit one of 
those, and my system started thrashing:

TIME  VIRT SWAP  RES CODE DATA  SHR %CPU %MEM S COMMAND
3:30 1203m 404m 799m 5160 1.2g 4020 19.6 70.3 R mtn ci

J.

Comment 4 by Unknown User, Jan 24, 2007

> Please add your email address to this bug?

BTW, I am not the person who originally reported the bug. I just hit 
the problem today and sent the comment before this one.

J.

Comment 5 by Unknown User, May 9, 2008

Turns out that monotone is rehashing every single file, which 
guarentees that once the files cannot completely fit into the disk 
cache that commit performance will be ridiculously slow. I tested 
that it was not an svn-related issue by just adding a new small file 
by hand and committing it, with the same results. [
http://www.youtube.com/user/KristinOrlistat xenical], [
http://www.youtube.com/user/thepropecia propecia], [
http://www.youtube.com/user/CindyGarman clomid], [
http://www.youtube.com/JackGillOnline buy viagra online]