summaryrefslogtreecommitdiff
path: root/fs/nfs/file.c
AgeCommit message (Collapse)AuthorFilesLines
2011-11-02Merge branch 'osd-devel' into nfs-for-nextTrond Myklebust1-8/+2
2011-11-02nfs: Fix unused variable warning from file.cRakib Mullick1-6/+3
Fix the following unused variable warning. fs/nfs/file.c: In function ‘nfs_file_release’: fs/nfs/file.c:140:17: warning: unused variable ‘dentry’ fs/nfs/file.c: In function ‘nfs_file_read’: fs/nfs/file.c:237:9: warning: unused variable ‘count’ Signed-off-by: Rakib Mullick <rakib.mullick@gmail.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2011-10-28nfs: drop unnecessary locking in llseekAndi Kleen1-9/+2
This makes NFS follow the standard generic_file_llseek locking scheme. Cc: Trond.Myklebust@netapp.com Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2011-10-28vfs: do (nearly) lockless generic_file_llseekAndi Kleen1-2/+3
The i_mutex lock use of generic _file_llseek hurts. Independent processes accessing the same file synchronize over a single lock, even though they have no need for synchronization at all. Under high utilization this can cause llseek to scale very poorly on larger systems. This patch does some rethinking of the llseek locking model: First the 64bit f_pos is not necessarily atomic without locks on 32bit systems. This can already cause races with read() today. This was discussed on linux-kernel in the past and deemed acceptable. The patch does not change that. Let's look at the different seek variants: SEEK_SET: Doesn't really need any locking. If there's a race one writer wins, the other loses. For 32bit the non atomic update races against read() stay the same. Without a lock they can also happen against write() now. The read() race was deemed acceptable in past discussions, and I think if it's ok for read it's ok for write too. => Don't need a lock. SEEK_END: This behaves like SEEK_SET plus it reads the maximum size too. Reading the maximum size would have the 32bit atomic problem. But luckily we already have a way to read the maximum size without locking (i_size_read), so we can just use that instead. Without i_mutex there is no synchronization with write() anymore, however since the write() update is atomic on 64bit it just behaves like another racy SEEK_SET. On non atomic 32bit it's the same as SEEK_SET. => Don't need a lock, but need to use i_size_read() SEEK_CUR: This has a read-modify-write race window on the same file. One could argue that any application doing unsynchronized seeks on the same file is already broken. But for the sake of not adding a regression here I'm using the file->f_lock to synchronize this. Using this lock is much better than the inode mutex because it doesn't synchronize between processes. => So still need a lock, but can use a f_lock. This patch implements this new scheme in generic_file_llseek. I dropped generic_file_llseek_unlocked and changed all callers. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2011-07-20fs: push i_mutex and filemap_write_and_wait down into ->fsync() handlersJosef Bacik1-3/+8
Btrfs needs to be able to control how filemap_write_and_wait_range() is called in fsync to make it less of a painful operation, so push down taking i_mutex and the calling of filemap_write_and_wait() down into the ->fsync() handlers. Some file systems can drop taking the i_mutex altogether it seems, like ext3 and ocfs2. For correctness sake I just pushed everything down in all cases to make sure that we keep the current behavior the same for everybody, and then each individual fs maintainer can make up their mind about what to do from there. Thanks, Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-20fs: handle SEEK_HOLE/SEEK_DATA properly in all fs's that define their own llseekJosef Bacik1-2/+5
This converts everybody to handle SEEK_HOLE/SEEK_DATA properly. In some cases we just return -EINVAL, in others we do the normal generic thing, and in others we're simply making sure that the properly due-dilligence is done. For example in NFS/CIFS we need to make sure the file size is update properly for the SEEK_HOLE and SEEK_DATA case, but since it calls the generic llseek stuff itself that is all we have to do. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-03-31Fix common misspellingsLucas De Marchi1-1/+1
Fixes generated by 'codespell' and manually reviewed. Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>
2011-03-24NFSv4.1 convert layoutcommit sync to booleanAndy Adamson1-1/+1
Signed-off-by: Andy Adamson <andros@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2011-03-23NFSv4.1: layoutcommitAndy Adamson1-0/+3
The filelayout driver sends LAYOUTCOMMIT only when COMMIT goes to the data server (as opposed to the MDS) and the data server WRITE is not NFS_FILE_SYNC. Only whole file layout support means that there is only one IOMODE_RW layout segment. Signed-off-by: Andy Adamson <andros@netapp.com> Signed-off-by: Alexandros Batsakis <batsakis@netapp.com> Signed-off-by: Boaz Harrosh <bharrosh@panasas.com> Signed-off-by: Dean Hildebrand <dhildeb@us.ibm.com> Signed-off-by: Fred Isaman <iisaman@citi.umich.edu> Signed-off-by: Mingyang Guo <guomingyang@nrchpc.ac.cn> Signed-off-by: Tao Guo <guotao@nrchpc.ac.cn> Signed-off-by: Zhang Jingwang <zhangjingwang@nrchpc.ac.cn> Tested-by: Boaz Harrosh <bharrosh@panasas.com> Signed-off-by: Benny Halevy <bhalevy@panasas.com> Signed-off-by: Fred Isaman <iisaman@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2011-03-11NFSv4.1: shift pnfs_update_layout locationsFred Isaman1-4/+0
Move the pnfs_update_layout call location to nfs_pageio_do_add_request(). Grab the lseg sent in the doio function to nfs_read_rpcsetup and attach it to each nfs_read_data so it can be sent to the layout driver. Signed-off-by: Andy Adamson <andros@netapp.com> Signed-off-by: Andy Adamson <andros@citi.umich.edu> Signed-off-by: Dean Hildebrand <dhildeb@us.ibm.com> Signed-off-by: Fred Isaman <iisaman@citi.umich.edu> Signed-off-by: Fred Isaman <iisaman@netapp.com> Signed-off-by: Benny Halevy <bhalevy@panasas.com> Signed-off-by: Boaz Harrosh <bharrosh@panasas.com> Signed-off-by: Oleg Drokin <green@linuxhacker.ru> Signed-off-by: Tao Guo <guotao@nrchpc.ac.cn> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-12-07NFS: Fix fcntl F_GETLK not reporting some conflictsSergey Vlasov1-0/+2
The commit 129a84de2347002f09721cda3155ccfd19fade40 (locks: fix F_GETLK regression (failure to find conflicts)) fixed the posix_test_lock() function by itself, however, its usage in NFS changed by the commit 9d6a8c5c213e34c475e72b245a8eb709258e968c (locks: give posix_test_lock same interface as ->lock) remained broken - subsequent NFS-specific locking code received F_UNLCK instead of the user-specified lock type. To fix the problem, fl->fl_type needs to be saved before the posix_test_lock() call and restored if no local conflicts were reported. Reference: https://bugzilla.kernel.org/show_bug.cgi?id=23892 Tested-by: Alexander Morozov <amorozov@etersoft.ru> Signed-off-by: Sergey Vlasov <vsu@altlinux.ru> Cc: <stable@kernel.org> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-10-31locks: let the caller free file_lock on ->setlease failureChristoph Hellwig1-2/+0
The caller allocated it, the caller should free it. The only issue so far is that we could change the flp pointer even on an error return if the fl_change callback failed. But we can simply move the flp assignment after the fl_change invocation, as the callers don't care about the flp return value if the setlease call failed. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-30locks: fix setlease methods to free passed-in lockJ. Bruce Fields1-1/+2
We modified setlease to require the caller to allocate the new lease in the case of creating a new lease, but forgot to fix up the filesystem methods. Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Steve French <sfrench@samba.org> Cc: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-26Merge branch 'nfs-for-2.6.37' of ↵Linus Torvalds1-0/+5
git://git.linux-nfs.org/projects/trondmy/nfs-2.6 * 'nfs-for-2.6.37' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6: net/sunrpc: Use static const char arrays nfs4: fix channel attribute sanity-checks NFSv4.1: Use more sensible names for 'initialize_mountpoint' NFSv4.1: pnfs: filelayout: add driver's LAYOUTGET and GETDEVICEINFO infrastructure NFSv4.1: pnfs: add LAYOUTGET and GETDEVICEINFO infrastructure NFS: client needs to maintain list of inodes with active layouts NFS: create and destroy inode's layout cache NFSv4.1: pnfs: filelayout: introduce minimal file layout driver NFSv4.1: pnfs: full mount/umount infrastructure NFS: set layout driver NFS: ask for layouttypes during v4 fsinfo call NFS: change stateid to be a union NFSv4.1: pnfsd, pnfs: protocol level pnfs constants SUNRPC: define xdr_decode_opaque_fixed NFSD: remove duplicate NFS4_STATEID_SIZE
2010-10-25Merge branch 'nfs-for-2.6.37' of ↵Linus Torvalds1-25/+56
git://git.linux-nfs.org/projects/trondmy/nfs-2.6 * 'nfs-for-2.6.37' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6: (67 commits) SUNRPC: Cleanup duplicate assignment in rpcauth_refreshcred nfs: fix unchecked value Ask for time_delta during fsinfo probe Revalidate caches on lock SUNRPC: After calling xprt_release(), we must restart from call_reserve NFSv4: Fix up the 'dircount' hint in encode_readdir NFSv4: Clean up nfs4_decode_dirent NFSv4: nfs4_decode_dirent must clear entry->fattr->valid NFSv4: Fix a regression in decode_getfattr NFSv4: Fix up decode_attr_filehandle() to handle the case of empty fh pointer NFS: Ensure we check all allocation return values in new readdir code NFS: Readdir plus in v4 NFS: introduce generic decode_getattr function NFS: check xdr_decode for errors NFS: nfs_readdir_filler catch all errors NFS: readdir with vmapped pages NFS: remove page size checking code NFS: decode_dirent should use an xdr_stream SUNRPC: Add a helper function xdr_inline_peek NFS: remove readdir plus limit ...
2010-10-24NFS: create and destroy inode's layout cacheBenny Halevy1-0/+5
At the start of the io paths, try to grab the relevant layout information. This will initiate the inode's layout cache, but stubs ensure the cache stays empty. Signed-off-by: Benny Halevy <bhalevy@panasas.com> Signed-off-by: Dean Hildebrand <dhildebz@umich.edu> Signed-off-by: Marc Eshel <eshel@almaden.ibm.com> Signed-off-by: Tao Guo <guotao@nrchpc.ac.cn> Signed-off-by: Ricardo Labiaga <ricardo.labiaga@netapp.com> Signed-off-by: Boaz Harrosh <bharrosh@panasas.com> Signed-off-by: Andy Adamson <andros@netapp.com> Signed-off-by: Fred Isaman <iisaman@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-10-24Revalidate caches on lockRicardo Labiaga1-3/+16
Instead of blindly zapping the caches, attempt to revalidate them if the server has indicated that it uses high resolution timestamps. NFSv4 should be able to always revalidate the cache since the protocol requires the update of the change attribute on modification of the data. In reality, there are servers (the Linux NFS server for example) that do not obey this requirement and use ctime as the basis for change attribute. Long term, the server needs to be fixed. At this time, and to be on the safe side, continue zapping caches if the server indicates that it does not have a high resolution timestamp. Signed-off-by: Ricardo Labiaga <Ricardo.Labiaga@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-10-19NFS: Don't SIGBUS if nfs_vm_page_mkwrite races with a cache invalidationTrond Myklebust1-9/+8
In the case where we lock the page, and then find out that the page has been thrown out of the page cache, we should just return VM_FAULT_NOPAGE. This is what block_page_mkwrite() does in these situations. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: stable@kernel.org
2010-09-23nfs: introduce mount option '-olocal_lock' to make locks localSuresh Jayaraman1-13/+32
NFS clients since 2.6.12 support flock locks by emulating fcntl byte-range locks. Due to this, some windows applications which seem to use both flock (share mode lock mapped as flock by Samba) and fcntl locks sequentially on the same file, can't lock as they falsely assume the file is already locked. The problem was reported on a setup with windows clients accessing excel files on a Samba exported share which is originally a NFS mount from a NetApp filer. Older NFS clients (< 2.6.12) did not see this problem as flock locks were considered local. To support legacy flock behavior, this patch adds a mount option "-olocal_lock=" which can take the following values: 'none' - Neither flock locks nor POSIX locks are local 'flock' - flock locks are local 'posix' - fcntl/POSIX locks are local 'all' - Both flock locks and POSIX locks are local Testing: - This patch was tested by using -olocal_lock option with different values and the NLM calls were noted from the network packet captured. 'none' - NLM calls were seen during both flock() and fcntl(), flock lock was granted, fcntl was denied 'flock' - no NLM calls for flock(), NLM call was seen for fcntl(), granted 'posix' - NLM call was seen for flock() - granted, no NLM call for fcntl() 'all' - no NLM calls were seen during both flock() and fcntl() - No bugs were seen during NFSv4 locking/unlocking in general and NFSv4 reboot recovery. Cc: Neil Brown <neilb@suse.de> Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-09-12Remove incorrect do_vfs_lock messageFabio Olive Leite1-4/+0
The do_vfs_lock function on fs/nfs/file.c is only called if NLM is not being used, via the -onolock mount option. Therefore it cannot really be "out of sync with lock manager" when the local locking function called returns an error, as there will be no corresponding call to the NLM. For details, simply check the if/else on do_setlk and do_unlk on fs/nfs/file.c. Signed-Off-By: Fabio Olive Leite <fleite@redhat.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-08-11NFS: fix the return value of nfs_file_fsync()J. R. Okajima1-1/+1
By the commit af7fa16 2010-08-03 NFS: Fix up the fsync code close(2) became returning the non-zero value even if it went well. nfs_file_fsync() should return 0 when "status" is positive. Signed-off-by: J. R. Okajima <hooanon05@yahoo.co.jp> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-08-07Merge branch 'nfs-for-2.6.36' of ↵Linus Torvalds1-30/+21
git://git.linux-nfs.org/projects/trondmy/nfs-2.6 * 'nfs-for-2.6.36' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6: (42 commits) NFS: NFSv4.1 is no longer a "developer only" feature NFS: NFS_V4 is no longer an EXPERIMENTAL feature NFS: Fix /proc/mount for legacy binary interface NFS: Fix the locking in nfs4_callback_getattr SUNRPC: Defer deleting the security context until gss_do_free_ctx() SUNRPC: prevent task_cleanup running on freed xprt SUNRPC: Reduce asynchronous RPC task stack usage SUNRPC: Move the bound cred to struct rpc_rqst SUNRPC: Clean up of rpc_bindcred() SUNRPC: Move remaining RPC client related task initialisation into clnt.c SUNRPC: Ensure that rpc_exit() always wakes up a sleeping task SUNRPC: Make the credential cache hashtable size configurable SUNRPC: Store the hashtable size in struct rpc_cred_cache NFS: Ensure the AUTH_UNIX credcache is allocated dynamically NFS: Fix the NFS users of rpc_restart_call() SUNRPC: The function rpc_restart_call() should return success/failure NFSv4: Get rid of the bogus RPC_ASSASSINATED(task) checks NFSv4: Clean up the process of renewing the NFSv4 lease NFSv4.1: Handle NFS4ERR_DELAY on SEQUENCE correctly NFS: nfs_rename() should not have to flush out writebacks ...
2010-08-03NFS: Fix up the fsync codeTrond Myklebust1-30/+21
Christoph points out that the VFS will always flush out data before calling nfs_fsync(), so we can dispense with a full call to nfs_wb_all(), and replace that with a simpler call to nfs_commit_inode(). Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-07-30NFS: kswapd must not block in nfs_release_pageTrond Myklebust1-2/+11
See https://bugzilla.kernel.org/show_bug.cgi?id=16056 If other processes are blocked waiting for kswapd to free up some memory so that they can make progress, then we cannot allow kswapd to block on those processes. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: stable@kernel.org
2010-05-27drop unused dentry argument to ->fsyncChristoph Hellwig1-2/+3
Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-05-14NFSv4: Allow attribute caching with 'noac' mounts if client holds a delegationTrond Myklebust1-6/+9
If the server has given us a delegation on a file, we _know_ that we can cache the attribute information even when the user has specified 'noac'. Reviewed-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-03-30include cleanup: Update gfp.h and slab.h includes to prepare for breaking ↵Tejun Heo1-1/+1
implicit slab.h inclusion from percpu.h percpu.h is included by sched.h and module.h and thus ends up being included when building most .c files. percpu.h includes slab.h which in turn includes gfp.h making everything defined by the two files universally available and complicating inclusion dependencies. percpu.h -> slab.h dependency is about to be removed. Prepare for this change by updating users of gfp and slab facilities include those headers directly instead of assuming availability. As this conversion needs to touch large number of source files, the following script is used as the basis of conversion. http://userweb.kernel.org/~tj/misc/slabh-sweep.py The script does the followings. * Scan files for gfp and slab usages and update includes such that only the necessary includes are there. ie. if only gfp is used, gfp.h, if slab is used, slab.h. * When the script inserts a new include, it looks at the include blocks and try to put the new include such that its order conforms to its surrounding. It's put in the include block which contains core kernel includes, in the same order that the rest are ordered - alphabetical, Christmas tree, rev-Xmas-tree or at the end if there doesn't seem to be any matching order. * If the script can't find a place to put a new include (mostly because the file doesn't have fitting include block), it prints out an error message indicating which .h file needs to be added to the file. The conversion was done in the following steps. 1. The initial automatic conversion of all .c files updated slightly over 4000 files, deleting around 700 includes and adding ~480 gfp.h and ~3000 slab.h inclusions. The script emitted errors for ~400 files. 2. Each error was manually checked. Some didn't need the inclusion, some needed manual addition while adding it to implementation .h or embedding .c file was more appropriate for others. This step added inclusions to around 150 files. 3. The script was run again and the output was compared to the edits from #2 to make sure no file was left behind. 4. Several build tests were done and a couple of problems were fixed. e.g. lib/decompress_*.c used malloc/free() wrappers around slab APIs requiring slab.h to be added manually. 5. The script was run on all .h files but without automatically editing them as sprinkling gfp.h and slab.h inclusions around .h files could easily lead to inclusion dependency hell. Most gfp.h inclusion directives were ignored as stuff from gfp.h was usually wildly available and often used in preprocessor macros. Each slab.h inclusion directive was examined and added manually as necessary. 6. percpu.h was updated not to include slab.h. 7. Build test were done on the following configurations and failures were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my distributed build env didn't work with gcov compiles) and a few more options had to be turned off depending on archs to make things build (like ipr on powerpc/64 which failed due to missing writeq). * x86 and x86_64 UP and SMP allmodconfig and a custom test config. * powerpc and powerpc64 SMP allmodconfig * sparc and sparc64 SMP allmodconfig * ia64 SMP allmodconfig * s390 SMP allmodconfig * alpha SMP allmodconfig * um on x86_64 SMP allmodconfig 8. percpu.h modifications were reverted so that it could be applied as a separate patch and serve as bisection point. Given the fact that I had only a couple of failures from tests on step 6, I'm fairly confident about the coverage of this conversion patch. If there is a breakage, it's likely to be something in one of the arch headers which should be easily discoverable easily on most builds of the specific arch. Signed-off-by: Tejun Heo <tj@kernel.org> Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
2010-03-19NFS: Prevent another deadlock in nfs_release_page()Trond Myklebust1-1/+2
We should not attempt to free the page if __GFP_FS is not set. Otherwise we can deadlock as per http://bugzilla.kernel.org/show_bug.cgi?id=15578 Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: stable@kernel.org
2010-02-10NFS: Improve NFS iostat byte count accuracy for writesChuck Lever1-3/+12
The bytes counted by the performance counters for NFS writes should reflect write and sync errors. If the write(2) system call reports an error, the bytes should not be counted. And, if the write is short, the actual number of bytes that was written should be counted, not the number of bytes that was requested. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-02-10NFS: Account for NFS bytes read via the splice APIChuck Lever1-1/+4
Bytes read via the splice API should be accounted for in the NFS performance statistics. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-02-10NFS: Fix byte accounting for generic NFS readsChuck Lever1-2/+4
Currently, the NFS I/O counters count the number of bytes requested by applications, rather than the number of bytes actually read by the system calls. The number of bytes requested for reads is actually not that useful, because the value is usually a buffer size for reads. That is, that requested number is usually a maximum, and frequently doesn't reflect the actual number of bytes read. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-02-10NFS: Proper accounting for NFS VFS callsChuck Lever1-2/+2
Nit: The VFSOPEN and VFSFLUSH counters are function call counters. Count every call to these routines. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-01-26NFS: Try to commit unstable writes in nfs_release_page()Trond Myklebust1-0/+2
If someone calls nfs_release_page(), we presumably already know that the page is clean, however it may be holding an unstable write. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: stable@kernel.org Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
2009-12-10vfs: Implement proper O_SYNC semanticsChristoph Hellwig1-2/+2
While Linux provided an O_SYNC flag basically since day 1, it took until Linux 2.4.0-test12pre2 to actually get it implemented for filesystems, since that day we had generic_osync_around with only minor changes and the great "For now, when the user asks for O_SYNC, we'll actually give O_DSYNC" comment. This patch intends to actually give us real O_SYNC semantics in addition to the O_DSYNC semantics. After Jan's O_SYNC patches which are required before this patch it's actually surprisingly simple, we just need to figure out when to set the datasync flag to vfs_fsync_range and when not. This patch renames the existing O_SYNC flag to O_DSYNC while keeping it's numerical value to keep binary compatibility, and adds a new real O_SYNC flag. To guarantee backwards compatiblity it is defined as expanding to both the O_DSYNC and the new additional binary flag (__O_SYNC) to make sure we are backwards-compatible when compiled against the new headers. This also means that all places that don't care about the differences can just check O_DSYNC and get the right behaviour for O_SYNC, too - only places that actuall care need to check __O_SYNC in addition. Drivers and network filesystems have been updated in a fail safe way to always do the full sync magic if O_DSYNC is set. The few places setting O_SYNC for lower layers are kept that way for now to stay failsafe. We enforce that O_DSYNC is set when __O_SYNC is set early in the open path to make sure we always get these sane options. Note that parisc really screwed up their headers as they already define a O_DSYNC that has always been a no-op. We try to repair it by using it for the new O_DSYNC and redefinining O_SYNC to send both the traditional O_SYNC numerical value _and_ the O_DSYNC one. Cc: Richard Henderson <rth@twiddle.net> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Grant Grundler <grundler@parisc-linux.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Ingo Molnar <mingo@elte.hu> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andreas Dilger <adilger@sun.com> Acked-by: Trond Myklebust <Trond.Myklebust@netapp.com> Acked-by: Kyle McMartin <kyle@mcmartin.ca> Acked-by: Ulrich Drepper <drepper@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jan Kara <jack@suse.cz>
2009-09-27const: mark struct vm_struct_operationsAlexey Dobriyan1-2/+2
* mark struct vm_area_struct::vm_ops as const * mark vm_ops in AGP code But leave TTM code alone, something is fishy there with global vm_ops being used. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-16HWPOISON: Enable error_remove_page for NFSAndi Kleen1-0/+1
Enable hardware memory error handling for NFS Truncation of data pages at runtime should be safe in NFS, even when it doesn't support migration so far. Trond tells me migration is also queued up for 2.6.32. Acked-by: Trond.Myklebust@netapp.com Signed-off-by: Andi Kleen <ak@linux.intel.com>
2009-08-10NFS: read-modify-write page updatingPeter Staubach1-2/+46
Hi. I have a proposal for possibly resolving this issue. I believe that this situation occurs due to the way that the Linux NFS client handles writes which modify partial pages. The Linux NFS client handles partial page modifications by allocating a page from the page cache, copying the data from the user level into the page, and then keeping track of the offset and length of the modified portions of the page. The page is not marked as up to date because there are portions of the page which do not contain valid file contents. When a read call comes in for a portion of the page, the contents of the page must be read in the from the server. However, since the page may already contain some modified data, that modified data must be written to the server before the file contents can be read back in the from server. And, since the writing and reading can not be done atomically, the data must be written and committed to stable storage on the server for safety purposes. This means either a FILE_SYNC WRITE or a UNSTABLE WRITE followed by a COMMIT. This has been discussed at length previously. This algorithm could be described as modify-write-read. It is most efficient when the application only updates pages and does not read them. My proposed solution is to add a heuristic to decide whether to do this modify-write-read algorithm or switch to a read- modify-write algorithm when initially allocating the page in the write system call path. The heuristic uses the modes that the file was opened with, the offset in the page to read from, and the size of the region to read. If the file was opened for reading in addition to writing and the page would not be filled completely with data from the user level, then read in the old contents of the page and mark it as Uptodate before copying in the new data. If the page would be completely filled with data from the user level, then there would be no reason to read in the old contents because they would just be copied over. This would optimize for applications which randomly access and update portions of files. The linkage editor for the C compiler is an example of such a thing. I tested the attached patch by using rpmbuild to build the current Fedora rawhide kernel. The kernel without the patch generated about 269,500 WRITE requests. The modified kernel containing the patch generated about 261,000 WRITE requests. Thus, about 8,500 fewer WRITE requests were generated. I suspect that many of these additional WRITE requests were probably FILE_SYNC requests to WRITE a single page, but I didn't test this theory. The difference between this patch and the previous one was to remove the unneeded PageDirty() test. I then retested to ensure that the resulting system continued to behave as desired. Thanx... ps Signed-off-by: Peter Staubach <staubach@redhat.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-08-10NFS: Add a ->migratepage() aop for NFSTrond Myklebust1-0/+1
Make NFS a bit more friendly to NUMA and memory hot removal... Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-07-12headers: smp_lock.h reduxAlexey Dobriyan1-1/+0
* Remove smp_lock.h from files which don't need it (including some headers!) * Add smp_lock.h to files which do need it * Make smp_lock.h include conditional in hardirq.h It's needed only for one kernel_locked() usage which is under CONFIG_PREEMPT This will make hardirq.h inclusion cheaper for every PREEMPT=n config (which includes allmodconfig/allyesconfig, BTW) Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-17NFS: add support for splice writesSuresh Jayaraman1-0/+31
Adds support for splice writes. It effectively calls generic_file_splice_write() to do the writes. We need not worry about O_APPEND case as the combination of splice() writes and O_APPEND is disallowed. This patch propagates NFS write errors back to the caller. The number of bytes written via splice are being added to NFSIO_NORMALWRITTENBYTES as these are effectively cached writes. Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-06-17NFSv4/NLM: Push file locking BKL dependencies down into the NLM layerTrond Myklebust1-6/+0
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-05-02NFS: Close page_mkwrite() racesTrond Myklebust1-3/+3
Follow up to Nick Piggin's patches to ensure that nfs_vm_page_mkwrite returns with the page lock held, and sets the VM_FAULT_LOCKED flag. See http://bugzilla.kernel.org/show_bug.cgi?id=12913 Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07NFS: Fix the return value in nfs_page_mkwrite()Trond Myklebust1-2/+0
Commit c2ec175c39f62949438354f603f4aa170846aabb ("mm: page_mkwrite change prototype to match fault") exposed a bug in the NFS implementation of page_mkwrite. We should be returning 0 on success... Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-03NFS: FS-Cache page managementDavid Howells1-4/+14
FS-Cache page management for NFS. This includes hooking the releasing and invalidation of pages marked with PG_fscache (aka PG_private_2) and waiting for completion of the write-to-cache flag (PG_fscache_write aka PG_owner_priv_2). Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Steve Dickson <steved@redhat.com> Acked-by: Trond Myklebust <Trond.Myklebust@netapp.com> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Tested-by: Daire Byrne <Daire.Byrne@framestore.com>
2009-04-03NFS: Add comment banners to some NFS functionsDavid Howells1-0/+26
Add comment banners to some NFS functions so that they can be modified by the NFS fscache patches for further information. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Steve Dickson <steved@redhat.com> Acked-by: Trond Myklebust <Trond.Myklebust@netapp.com> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Tested-by: Daire Byrne <Daire.Byrne@framestore.com>
2009-04-01Merge branch 'devel' into for-linusTrond Myklebust1-16/+16
2009-04-01mm: page_mkwrite change prototype to match faultNick Piggin1-1/+4
Change the page_mkwrite prototype to take a struct vm_fault, and return VM_FAULT_xxx flags. There should be no functional change. This makes it possible to return much more detailed error information to the VM (and also can provide more information eg. virtual_address to the driver, which might be important in some special cases). This is required for a subsequent fix. And will also make it easier to merge page_mkwrite() with fault() in future. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Chris Mason <chris.mason@oracle.com> Cc: Trond Myklebust <trond.myklebust@fys.uio.no> Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <joel.becker@oracle.com> Cc: Artem Bityutskiy <dedekind@infradead.org> Cc: Felix Blyakher <felixb@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-03-19NFS: Optimise NFS close()Trond Myklebust1-9/+2
Close-to-open cache consistency rules really only require us to flush out writes on calls to close(), and require us to revalidate attributes on the very last close of the file. Currently we appear to be doing a lot of extra attribute revalidation and cache flushes. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-03-11NFS: Kill the "defined but not used" compile error on nommu machinesTrond Myklebust1-7/+5
Bryan Wu reports that when compiling NFS on nommu machines he gets a "defined but not used" error on nfs_file_mmap(). The easiest fix is simply to get rid of the special casing in NFS, and just always call generic_file_mmap() to set up the file. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-03-11NFS: Throttle page dirtying while we're flushing to diskTrond Myklebust1-0/+9
The following patch is a combination of a patch by myself and Peter Staubach. Trond: If we allow other processes to dirty pages while a process is doing a consistency sync to disk, we can end up never making progress. Peter: Attached is a patch which addresses a continuing problem with the NFS client generating out of order WRITE requests. While this is compliant with all of the current protocol specifications, there are servers in the market which can not handle out of order WRITE requests very well. Also, this may lead to sub-optimal block allocations in the underlying file system on the server. This may cause the read throughputs to be reduced when reading the file from the server. Peter: There has been a lot of work recently done to address out of order issues on a systemic level. However, the NFS client is still susceptible to the problem. Out of order WRITE requests can occur when pdflush is in the middle of writing out pages while the process dirtying the pages calls generic_file_buffered_write which calls generic_perform_write which calls balance_dirty_pages_rate_limited which ends up calling writeback_inodes which ends up calling back into the NFS client to writes out dirty pages for the same file that pdflush happens to be working with. Signed-off-by: Peter Staubach <staubach@redhat.com> [modification by Trond to merge the two similar patches] Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>