Reported by Unknown User, Nov 23, 2005
(This entry was imported from the savannah tracker, original location: https://savannah.nongnu.org/bugs/index.php?15042) I have been trying to import a subversion repository into monotone (head is 1.5g / 20k files), to compare the two systems. I created scripts that would do: 1. svn update next revision # 2. monotone ls missing, to drop removed files 3. monotone add * for each top-level folder (to add new files) 4. monotone commit Steps 1-3 are fast. Step 4 was fast enough until the working folder size grew past the available physical memory in the machine (about 800megs). So I tried to figure out what the slowness was using strace; to import the whole repository in this way would take maybe 2 weeks of constant disk activity, and in any case saying it'll take 3+ minutes to commit a simple edit is absolutely ridiculous. Turns out that monotone is rehashing every single file, which guarentees that once the files cannot completely fit into the disk cache that commit performance will be ridiculously slow. I tested that it was not an svn-related issue by just adding a new small file by hand and committing it, with the same results. The obvious solution is to store the last commit/update time in the MT folder and only stat files to see if their modification time indicates they were changed after that time. Make the complete rehashing be a "--force" style option, if the user wants it. The default certainly should not be the extremely slow behavior, or at the very least there should actually be a way to do a commit quickly. I also found a few minor perforance issues: A) when the monotone reads from the database it does a read of 1k followed by a _llseek to the current file position, and repeats until the data is read. I couldn't follow the C++ enough to figure out why this is, but it is a significant slowdown (it takes about as long to do a syscall as to copy the 1k of data). Maybe it's needed for some sort of database consistency? B) data reads to hash the files read in 8191 chunks. this should be 8192 to be in paged-sized chunks. it should really mmap files larger than some size though. C) It's calling lstat on each file twice. Removing the duplicate would save maybe a tenth of a second on a large working folder. D) seems to do an "fchmod 0100755" on each file in the working folder, which may not be the permissions the user wants (have to set umask). monotone version: ----------------- monotone 0.23 (base revision: e32d161b8d5cde9f0968998cc332f82f82c27006) Running on: Linux 2.6.13-rc5-mm1 #2 SMP PREEMPT Mon Oct 24 09:04:47 EDT 2005 i686
Comment 1 by Unknown User, Nov 23, 2005
Now I see there is an refresh_inodeprints command that does exactly this. So sorry for wasting the time, but it would be a nice touch to check the time it takes to scan files for the commit, and print a message saying user might want to enable this. Also, "refresh_inodeprints" tells the user the low-level details but not why they would want to do it. Maybe rename to something like "enable_largetree" and "refresh", which could do the same thing. It still takes quite a while, possibly b/c of the double stat / chmod mentioned above.
Comment 2 by Unknown User, Nov 24, 2005
Indeed, some more automaticity for the inodeprints stuff is on the todo list... most people don't have 1.5 gig trees, so the speed/safety tradeoff is reasonable :-). The database code is all in the sqlite library; in particular, I believe it is sqlite/os_unix.c, sqlite3OsRead, sqlite3OsWrite, sqlite3OsSeek. I'm sure these could be changed to use pread and pwrite if necessary, and possibly even mmap (though that might be a larger change). Do you have numbers we can wave at the sqlite people to justify the effort? The looking at it here, double-stat'ing seems to only occur when _not_ using inodeprints; should look into it, but it's probably not too important, since the stats are probably not the bottleneck in that mode anyway. I can't seem to reproduce the chmod thing. My guess is that it's actually only happening on files that are marked with the 'executable' attribute? We're a little over-zealous about applying such attributes at the moment; it's an area where cleanup discussions are underway. If this is correct, then umask isn't strictly relevant; the make_executable functions reads off the current mode and adds a+x to it. I guess technically we should mask that against the umask too... hrm. Please add your email address to this bug? It is hard to have a conversation when it is not clear whether anyone is on the other end.
Comment 3 by Unknown User, Jan 24, 2007
> most people don't have 1.5 gig trees I have data files that are larger than that, and I would like to keep them under version control. Today I tried to commit one of those, and my system started thrashing: TIME VIRT SWAP RES CODE DATA SHR %CPU %MEM S COMMAND 3:30 1203m 404m 799m 5160 1.2g 4020 19.6 70.3 R mtn ci J.
Comment 4 by Unknown User, Jan 24, 2007
> Please add your email address to this bug? BTW, I am not the person who originally reported the bug. I just hit the problem today and sent the comment before this one. J.
Comment 5 by Unknown User, May 9, 2008
Turns out that monotone is rehashing every single file, which guarentees that once the files cannot completely fit into the disk cache that commit performance will be ridiculously slow. I tested that it was not an svn-related issue by just adding a new small file by hand and committing it, with the same results. [ http://www.youtube.com/user/KristinOrlistat xenical], [ http://www.youtube.com/user/thepropecia propecia], [ http://www.youtube.com/user/CindyGarman clomid], [ http://www.youtube.com/JackGillOnline buy viagra online]