[p4] Client-side file fragmentation on NTFS

Frank Compagner frank.compagner at guerrilla-games.com
Thu Dec 6 17:04:49 PST 2007


Thanks for your comments, see mine below.

PW> Changing the work flow so that daily multi-gigabyte syncs are not needed may
PW> be worthwhile. Not always possible but as the number of clients doing this
PW> gets larger the more necessary and beneficial it will become.

Yes, that would be nice, but this is partly unavoidable as the amount
of data produced on a daily basis is just enormous. The only real way
of reducing this somewhat would be to split the departments over
two or more branches, but that means a major overhaul of our workflow.
We're thinking about this, but we need to think this through properly
first. Another complicating factor is that the Perforce branching process
doesn't work really well for this much (unmergeable) data. So this
might take a while.

PW> Yes I have observed this problem first hand and found similar behaviour when
PW> I investigated. The impact of this goes well beyond just sync times as the
PW> fragmentation quickly deteriorates the performance of any applications
PW> accessing the synced files (data / code builds) and leads to fragmentation
PW> of non-synced files being touched on the system.
PW> As you observed when p4 syncs for each new revision of a file being
PW> retrieved to a workspace it first creates a temporary file. Then even though
PW> the new file size is known it does not preallocate but instead appends in
PW> 4KB chunks. Once the file is completely downloaded it deletes the previous
PW> file and renames the temporary file.
PW> This is not an atypical process for applications, lots work in similar way
PW> without causing undue problems. In the p4 case this becomes problematic as
PW> it is in effect placing a server filesystem workload pattern onto each
PW> client. Think of it as like having your whole team edit their files on your
PW> workstation. Fragmentation quickly becomes an issue even for non-gigabyte
PW> workspaces.

I have no issues with the temp file being created, that's just safe
practice in the event of an abort. But you're right that as the client
already knows the (approximate) size of the file before the sync
starts, it could do a better job of informing the filesystem.

PW> Yes. The impact is particularly easy to measure on automated build machines
PW> that run a build process on a daily/hourly/changelist basis. The build
PW> process would typically involve a sync step. Capturing the fragmentation
PW> before each step and the elapsed time for each step makes it possible to
PW> track down the steps causing fragmentation and the impact there of. I found
PW> syncing to be a large contributer to fragmentation and it impact on
PW> subsequent steps substantial.

Our buildmachines do continous integration and are all very heavily
fragmented. I'm almost ready to put my optimized sync tool to work on
a couple of them to see if that helps in reducing build times. We do
keep some stats on them, so hopefully we'll be able to measure the
difference.

>> Do you agree that improving this behaviour is worthwhile?

PW> Yes, but look a little further than the sync to make sure you do not have
PW> other processes contributing to the fragmentation.

Certainly a good idea, we've already gone through most of the rest of the
buildsteps and have eliminated fragmentation wherever possible.
There's still some left, but by far the biggest cause is the p4 sync.

PW> This sounds great. I have something similar that I coincidently created
PW> recently (while working on a t-ntfs project), it does not use the p4api
PW> directly though and like yours needs to be polished. As I noted above p4
PW> sync works on a file by file basis, retrieving, deleting and then renaming.
PW> One thing I was considering doing is a delete pass first, and then all the
PW> retrieves with preallocation. The retrieves would optionally be
PW> multi-threaded but this was originally more to do with populating a perforce
PW> proxy and would most probably not help if you are disk io bound.

Sounds intersting as well, I'd be happy to comment. The temp files
shouldn't be a problem, though, you can preallocate those just as
well, right? I'm using the SetFileValidData() Windows function to
preallocate the file before I start writing, and it seems to work
well. I also have multithreaded variant where one thread fills the
buffer with the p4api Write() calls, and the other tries to write the
buffers to disk as soon as the files are complete or the buffer is
full. Waterproofing the multithreaded variant isn't trivial, though,
plus it needs more memory to work with. As it looks now the
performance of the SetFileValidData() is close enough to the
multithreaded variant to make that my prefered approach.

Speaking of proxies, we've found that fragmentation on the proxy
disk is also a noticeable bottleneck. Not much I can do about that
other than run a defragmenter, but we have 500 GB harddisks in the
proxies, running the defragmenter at 1 AM, and I've twice already
seen the defragmenter still hard at work at 10 AM when everybody is
back using it. So that could so with some improving as well.

----------------------------------------------------------------
Frank Compagner                                  Guerrilla Games



More information about the perforce-user mailing list