summaryrefslogtreecommitdiff
path: root/fs/dcache.c
AgeCommit message (Collapse)AuthorFilesLines
2012-10-02vfs: dcache: use DCACHE_DENTRY_KILLED instead of DCACHE_DISCONNECTED in d_kill()Miklos Szeredi1-2/+2
commit b161dfa6937ae46d50adce8a7c6b12233e96e7bd upstream. IBM reported a soft lockup after applying the fix for the rename_lock deadlock. Commit c83ce989cb5f ("VFS: Fix the nfs sillyrename regression in kernel 2.6.38") was found to be the culprit. The nfs sillyrename fix used DCACHE_DISCONNECTED to indicate that the dentry was killed. This flag can be set on non-killed dentries too, which results in infinite retries when trying to traverse the dentry tree. This patch introduces a separate flag: DCACHE_DENTRY_KILLED, which is only set in d_kill() and makes try_to_ascend() test only this flag. IBM reported successful test results with this patch. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Cc: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-05-03vfs: make word-at-a-time accesses handle a non-existing pageLinus Torvalds1-4/+22
It turns out that there are more cases than CONFIG_DEBUG_PAGEALLOC that can have holes in the kernel address space: it seems to happen easily with Xen, and it looks like the AMD gart64 code will also punch holes dynamically. Actually hitting that case is still very unlikely, so just do the access, and take an exception and fix it up for the very unlikely case of it being a page-crosser with no next page. And hey, this abstraction might even help other architectures that have other issues with unaligned word accesses than the possible missing next page. IOW, this could do the byte order magic too. Peter Anvin fixed a thinko in the shifting for the exception case. Reported-and-tested-by: Jana Saout <jana@saout.de> Cc: Peter Anvin <hpa@zytor.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-28vfs: fix d_ancestor() case in d_materialize_uniqueMichel Lespinasse1-1/+2
In d_materialise_unique() there are 3 subcases to the 'aliased dentry' case; in two subcases the inode i_lock is properly released but this does not occur in the -ELOOP subcase. This seems to have been introduced by commit 1836750115f2 ("fix loop checks in d_materialise_unique()"). Signed-off-by: Michel Lespinasse <walken@google.com> Cc: stable@vger.kernel.org # v3.0+ [ Added a comment, and moved the unlock to where we generate the -ELOOP, which seems to be more natural. You probably can't actually trigger this without a buggy network file server - d_materialize_unique() is for finding aliases on non-local filesystems, and the d_ancestor() case is for a hardlinked directory loop. But we should be robust in the case of such buggy servers anyway. ] Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-24Merge tag 'module-for-3.4' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux Pull cleanup of fs/ and lib/ users of module.h from Paul Gortmaker: "Fix up files in fs/ and lib/ dirs to only use module.h if they really need it. These are trivial in scope vs the work done previously. We now have things where any few remaining cleanups can be farmed out to arch or subsystem maintainers, and I have done so when possible. What is remaining here represents the bits that don't clearly lie within a single arch/subsystem boundary, like the fs dir and the lib dir. Some duplicate includes arising from overlapping fixes from independent subsystem maintainer submissions are also quashed." Fix up trivial conflicts due to clashes with other include file cleanups (including some due to the previous bug.h cleanup pull). * tag 'module-for-3.4' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux: lib: reduce the use of module.h wherever possible fs: reduce the use of module.h wherever possible includecheck: delete any duplicate instances of module.h
2012-03-22fs: fix kernel-doc warnings in dcache.cRandy Dunlap1-1/+1
Fix kernel-doc warnings in fs/dcache.c: Warning(fs/dcache.c:1743): No description found for parameter 'seqp' Warning(fs/dcache.c:1743): Excess function parameter 'seq' description in '__d_lookup_rcu' Signed-off-by: Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-21Merge branch 'for-linus' of ↵Linus Torvalds1-24/+0
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs pile 1 from Al Viro: "This is _not_ all; in particular, Miklos' and Jan's stuff is not there yet." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (64 commits) ext4: initialization of ext4_li_mtx needs to be done earlier debugfs-related mode_t whack-a-mole hfsplus: add an ioctl to bless files hfsplus: change finder_info to u32 hfsplus: initialise userflags qnx4: new helper - try_extent() qnx4: get rid of qnx4_bread/qnx4_getblk take removal of PF_FORKNOEXEC to flush_old_exec() trim includes in inode.c um: uml_dup_mmap() relies on ->mmap_sem being held, but activate_mm() doesn't hold it um: embed ->stub_pages[] into mmu_context gadgetfs: list_for_each_safe() misuse ocfs2: fix leaks on failure exits in module_init ecryptfs: make register_filesystem() the last potential failure exit ntfs: forgets to unregister sysctls on register_filesystem() failure logfs: missing cleanup on register_filesystem() failure jfs: mising cleanup on register_filesystem() failure make configfs_pin_fs() return root dentry on success configfs: configfs_create_dir() has parent dentry in dentry->d_parent configfs: sanitize configfs_create() ...
2012-03-20vfs: d_alloc_root() goneAl Viro1-24/+0
all callers converted to d_make_root() by now Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-03-19Merge branch 'dcache-word-accesses'Linus Torvalds1-0/+23
* branch 'dcache-word-accesses': vfs: use 'unsigned long' accesses for dcache name comparison and hashing This does the name hashing and lookup using word-sized accesses when that is efficient, namely on x86 (although any little-endian machine with good unaligned accesses would do). It does very much depend on little-endian logic, but it's a very hot couple of functions under some real loads, and this patch improves the performance of __d_lookup_rcu() and link_path_walk() by up to about 30%. Giving a 10% improvement on some very pathname-heavy benchmarks. Because we do make unaligned accesses past the filename, the optimization is disabled when CONFIG_DEBUG_PAGEALLOC is active, and we effectively depend on the fact that on x86 we don't really ever have the last page of usable RAM followed immediately by any IO memory (due to ACPI tables, BIOS buffer areas etc). Some of the bit operations we do are a bit "subtle". It's commented, but you do need to really think about the code. Or just consider it black magic. Thanks to people on G+ for some of the optimized bit tricks.
2012-03-19vfs: get rid of batshit-insane pointless dentry hash calculationsLinus Torvalds1-3/+3
For some odd historical reason, the final mixing round for the dentry cache hash table lookup had an insane "xor with big constant" logic. In two places. The big constant that is being xor'ed is GOLDEN_RATIO_PRIME, which is a fairly random-looking number that is designed to be *multiplied* with so that the bits get spread out over a whole long-word. But xor'ing with it is insane. It doesn't really even change the hash - it really only shifts the hash around in the hash table. To make matters worse, the insane big constant is different on 32-bit and 64-bit builds, even though the name hash bits we use are always 32-bit (and the bits from the pointer we mix in effectively are too). It's all total voodoo programming, in other words. Now, some testing and analysis of the hash chains shows that the rest of the hash function seems to be fairly good. It does pick the right bits of the parent dentry pointer, for example, and while it's generally a bad idea to use an xor to mix down the upper bits (because if there is a repeating pattern, the xor can cause "destructive interference"), it seems to not have been a disaster. For example, replacing the hash with the normal "hash_long()" code (that uses the GOLDEN_RATIO_PRIME constant correctly, btw) actually just makes the hash worse. The hand-picked hash knew which bits of the pointer had the highest entropy, and hash_long() ends up mixing bits less optimally at least in some trivial tests. So the hash function overall seems fine, it just has that really odd "shift result around by a constant xor". So get rid of the silly xor, and replace the down-mixing of the bits with an add instead of an xor that tends to not have the same kind of destructive interference issues. Some stats on the resulting hash chains shows that they look statistically identical before and after, but the code is simpler and no longer makes you go "WTF?". Also, the incoming hash really is just "unsigned int", not a long, and there's no real point to worry about the high 26 bits of the dentry pointer for the 64-bit case, because they are all going to be identical anyway. So also change the hashing to be done in the more natural 'unsigned int' that is the real size of the actual hashed data anyway. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-08vfs: use 'unsigned long' accesses for dcache name comparison and hashingLinus Torvalds1-0/+23
Ok, this is hacky, and only works on little-endian machines with goo unaligned handling. And even then only with CONFIG_DEBUG_PAGEALLOC disabled, since it can access up to 7 bytes after the pathname. But it runs like a bat out of hell. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-04vfs: move dentry_cmp from <linux/dcache.h> to fs/dcache.cLinus Torvalds1-0/+20
It's only used inside fs/dcache.c, and we're going to play games with it for the word-at-a-time patches. This time we really don't even want to export it, because it really is an internal function to fs/dcache.c, and has been since it was introduced. Having it in that extremely hot header file (it's included in pretty much everything, thanks to <linux/fs.h>) is a disaster for testing different versions, and is utterly pointless. We really should have some kind of header file diet thing, where we figure out which parts of header files are really better off private and only result in more expensive compiles. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-02vfs: trivial __d_lookup_rcu() cleanupsLinus Torvalds1-5/+8
These don't change any semantics, but they clean up the code a bit and mark some arguments appropriately 'const'. They came up as I was doing the word-at-a-time dcache name accessor code, and cleaning this up now allows me to send out a smaller relevant interesting patch for the experimental stuff. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-02-28fs: reduce the use of module.h wherever possiblePaul Gortmaker1-1/+1
For files only using THIS_MODULE and/or EXPORT_SYMBOL, map them onto including export.h -- or if the file isn't even using those, then just delete the include. Fix up any implicit include dependencies that were being masked by module.h along the way. Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2012-02-13vfs: fix panic in __d_lookup() with high dentry hashtable countsDimitri Sivanich1-4/+4
When the number of dentry cache hash table entries gets too high (2147483648 entries), as happens by default on a 16TB system, use of a signed integer in the dcache_init() initialization loop prevents the dentry_hashtable from getting initialized, causing a panic in __d_lookup(). Fix this in dcache_init() and similar areas. Signed-off-by: Dimitri Sivanich <sivanich@sgi.com> Acked-by: David S. Miller <davem@davemloft.net> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-01-13Merge branch 'for-linus' of ↵Linus Torvalds1-2/+9
git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: ceph: ensure prealloc_blob is in place when removing xattr rbd: initialize snap_rwsem in rbd_add() ceph: enable/disable dentry complete flags via mount option vfs: export symbol d_find_any_alias() ceph: always initialize the dentry in open_root_dentry() libceph: remove useless return value for osd_client __send_request() ceph: avoid iput() while holding spinlock in ceph_dir_fsync ceph: avoid useless dget/dput in encode_fh ceph: dereference pointer after checking for NULL crush: fix force for non-root TAKE ceph: remove unnecessary d_fsdata conditional checks ceph: Use kmemdup rather than duplicating its implementation Fix up conflicts in fs/ceph/super.c (d_alloc_root() failure handling vs always initialize the dentry in open_root_dentry)
2012-01-12vfs: export symbol d_find_any_alias()Sage Weil1-2/+9
Ceph needs this. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sage Weil <sage@newdream.net>
2012-01-10fix shrink_dcache_parent() livelockMiklos Szeredi1-4/+11
Two (or more) concurrent calls of shrink_dcache_parent() on the same dentry may cause shrink_dcache_parent() to loop forever. Here's what appears to happen: 1 - CPU0: select_parent(P) finds C and puts it on dispose list, returns 1 2 - CPU1: select_parent(P) locks P->d_lock 3 - CPU0: shrink_dentry_list() locks C->d_lock dentry_kill(C) tries to lock P->d_lock but fails, unlocks C->d_lock 4 - CPU1: select_parent(P) locks C->d_lock, moves C from dispose list being processed on CPU0 to the new dispose list, returns 1 5 - CPU0: shrink_dentry_list() finds dispose list empty, returns 6 - Goto 2 with CPU0 and CPU1 switched Basically select_parent() steals the dentry from shrink_dentry_list() and thinks it found a new one, causing shrink_dentry_list() to think it's making progress and loop over and over. One way to trigger this is to make udev calls stat() on the sysfs file while it is going away. Having a file in /lib/udev/rules.d/ with only this one rule seems to the trick: ATTR{vendor}=="0x8086", ATTR{device}=="0x10ca", ENV{PCI_SLOT_NAME}="%k", ENV{MATCHADDR}="$attr{address}", RUN+="/bin/true" Then execute the following loop: while true; do echo -bond0 > /sys/class/net/bonding_masters echo +bond0 > /sys/class/net/bonding_masters echo -bond1 > /sys/class/net/bonding_masters echo +bond1 > /sys/class/net/bonding_masters done One fix would be to check all callers and prevent concurrent calls to shrink_dcache_parent(). But I think a better solution is to stop the stealing behavior. This patch adds a new dentry flag that is set when the dentry is added to the dispose list. The flag is cleared in dentry_lru_del() in case the dentry gets a new reference just before being pruned. If the dentry has this flag, select_parent() will skip it and let shrink_dentry_list() retry pruning it. With select_parent() skipping those dentries there will not be the appearance of progress (new dentries found) when there is none, hence shrink_dcache_parent() will not loop forever. Set the flag is also set in prune_dcache_sb() for consistency as suggested by Linus. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> CC: stable@vger.kernel.org Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-01-09vfs: new helper - d_make_root()Al Viro1-0/+17
d_alloc_root() with iput() in case of allocation failure... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-01-09dcache: use a dispose list in select_parentDave Chinner1-42/+21
select_parent currently abuses the dentry cache LRU to provide cleanup features for child dentries that need to be freed. It moves them to the tail of the LRU, then tells shrink_dcache_parent() to calls __shrink_dcache_sb to unconditionally move them to a dispose list (as DCACHE_REFERENCED is ignored). __shrink_dcache_sb() has to relock the dentries to move them off the LRU onto the dispose list, but otherwise does not touch the dentries that select_parent() moved to the tail of the LRU. It then passses the dispose list to shrink_dentry_list() which tries to free the dentries. IOWs, the use of __shrink_dcache_sb() is superfluous - we can build exactly the same list of dentries for disposal directly in select_parent() and call shrink_dentry_list() instead of calling __shrink_dcache_sb() to do that. This means that we avoid long holds on the lru lock walking the LRU moving dentries to the dispose list We also avoid the need to relock each dentry just to move it off the LRU, reducing the numebr of times we lock each dentry to dispose of them in shrink_dcache_parent() from 3 to 2 times. Further, we remove one of the two callers of __shrink_dcache_sb(). This also means that __shrink_dcache_sb can be moved into back into prune_dcache_sb() and we no longer have to handle referenced dentries conditionally, simplifying the code. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-01-03vfs: mnt_ns moved to struct mountAl Viro1-1/+1
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-01-03vfs: move mnt_mountpoint to struct mountAl Viro1-1/+1
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-01-03vfs: now it can be done - make mnt_parent point to struct mountAl Viro1-3/+4
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-01-03vfs: mnt_parent moved to struct mountAl Viro1-1/+1
the second victim... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-01-03vfs: spread struct mount - mnt_has_parentAl Viro1-1/+2
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-01-03vfs: more mnt_parent cleanupsAl Viro1-25/+0
a) mount --move is checking that ->mnt_parent is non-NULL before looking if that parent happens to be shared; ->mnt_parent is never NULL and it's not even an misspelled !mnt_has_parent() b) pivot_root open-codes is_path_reachable(), poorly. c) so does path_is_under(), while we are at it. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-01-03vfs: new internal helper: mnt_has_parent(mnt)Al Viro1-3/+3
vfsmounts have ->mnt_parent pointing either to a different vfsmount or to itself; it's never NULL and termination condition in loops traversing the tree towards root is mnt == mnt->mnt_parent. At least one place (see the next patch) is confused about what's going on; let's add an explicit helper checking it right way and use it in all places where we need it. Not that there had been too many, but... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-12-06fix apparmor dereferencing potentially freed dentry, sanitize __d_path() APIAl Viro1-27/+44
__d_path() API is asking for trouble and in case of apparmor d_namespace_path() getting just that. The root cause is that when __d_path() misses the root it had been told to look for, it stores the location of the most remote ancestor in *root. Without grabbing references. Sure, at the moment of call it had been pinned down by what we have in *path. And if we raced with umount -l, we could have very well stopped at vfsmount/dentry that got freed as soon as prepend_path() dropped vfsmount_lock. It is safe to compare these pointers with pre-existing (and known to be still alive) vfsmount and dentry, as long as all we are asking is "is it the same address?". Dereferencing is not safe and apparmor ended up stepping into that. d_namespace_path() really wants to examine the place where we stopped, even if it's not connected to our namespace. As the result, it looked at ->d_sb->s_magic of a dentry that might've been already freed by that point. All other callers had been careful enough to avoid that, but it's really a bad interface - it invites that kind of trouble. The fix is fairly straightforward, even though it's bigger than I'd like: * prepend_path() root argument becomes const. * __d_path() is never called with NULL/NULL root. It was a kludge to start with. Instead, we have an explicit function - d_absolute_root(). Same as __d_path(), except that it doesn't get root passed and stops where it stops. apparmor and tomoyo are using it. * __d_path() returns NULL on path outside of root. The main caller is show_mountinfo() and that's precisely what we pass root for - to skip those outside chroot jail. Those who don't want that can (and do) use d_path(). * __d_path() root argument becomes const. Everyone agrees, I hope. * apparmor does *NOT* try to use __d_path() or any of its variants when it sees that path->mnt is an internal vfsmount. In that case it's definitely not mounted anywhere and dentry_path() is exactly what we want there. Handling of sysctl()-triggered weirdness is moved to that place. * if apparmor is asked to do pathname relative to chroot jail and __d_path() tells it we it's not in that jail, the sucker just calls d_absolute_path() instead. That's the other remaining caller of __d_path(), BTW. * seq_path_root() does _NOT_ return -ENAMETOOLONG (it's stupid anyway - the normal seq_file logics will take care of growing the buffer and redoing the call of ->show() just fine). However, if it gets path not reachable from root, it returns SEQ_SKIP. The only caller adjusted (i.e. stopped ignoring the return value as it used to do). Reviewed-by: John Johansen <john.johansen@canonical.com> ACKed-by: John Johansen <john.johansen@canonical.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Cc: stable@vger.kernel.org
2011-11-20VFS: Log the fact that we've given ELOOP rather than creating a loopDavid Howells1-1/+10
To prevent an NFS server from being used to create a directory loop in an NFS superblock on the client, the following patch was committed: commit 1836750115f20b774e55c032a3893e8c5bdf41ed Author: Al Viro <viro@zeniv.linux.org.uk> Date: Tue Jul 12 21:42:24 2011 -0400 Subject: fix loop checks in d_materialise_unique() This causes ELOOP to be reported to anyone trying to access the dentry that would otherwise cause the kernel to complete the loop. However, no indication is given to the caller as to why an operation that ought to work doesn't. The fault is with the kernel, which doesn't want to try and solve the problem as it gets horrendously messy if there's another mountpoint somewhere in the trees being spliced that can't be moved[*]. [*] The real problem is that we don't handle the excision of a subtree that gets moved _out_ of what we can see. This can happen on the server where a directory is merely moved between two other dirs on the same filesystem, but where destination dir is not accessible by the client. So, given the choice to return ELOOP rather than trying to reconfigure the dentry tree, we should give the caller some indication of why they aren't being allowed to make what should be a legitimate request and log a message. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Sachin Prabhu <sprabhu@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-11-07vfs: d_invalidate() should leave mountpoints aloneAl Viro1-2/+4
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-11-02vfs: add d_prune dentry operationSage Weil1-5/+35
This adds a d_prune dentry operation that is called by the VFS prior to pruning (i.e. unhashing and killing) a hashed dentry from the dcache. Wrap dentry_lru_del() and use the new _prune() helper in the cases where we are about to unhash and kill the dentry. This will be used by Ceph to maintain a flag indicating whether the complete contents of a directory are contained in the dcache, allowing it to satisfy lookups and readdir without addition server communication. Renumber a few DCACHE_* #defines to group DCACHE_OP_PRUNE with the other DCACHE_OP_ bits. Signed-off-by: Sage Weil <sage@newdream.net> Signed-off-by: Christoph Hellwig <hch@lst.de>
2011-08-06vfs: renumber DCACHE_xyz flags, remove some stale onesLinus Torvalds1-1/+1
Gcc tends to generate better code with small integers, including the DCACHE_xyz flag tests - so move the common ones to be first in the list. Also just remove the unused DCACHE_INOTIFY_PARENT_WATCHED and DCACHE_AUTOFS_PENDING values, their users no longer exists in the source tree. And add a "unlikely()" to the DCACHE_OP_COMPARE test, since we want the common case to be a nice straight-line fall-through. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-08-03fs/dcache.c: fix new kernel-doc warningRandy Dunlap1-0/+1
Fix new kernel-doc warning in fs/dcache.c: Warning(fs/dcache.c:797): No description found for parameter 'sb' Signed-off-by: Randy Dunlap <rdunlap@xenotime.net> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-08-01VFS: Reorganise shrink_dcache_for_umount_subtree() after demise of dcache_lockDavid Howells1-17/+5
Reorganise shrink_dcache_for_umount_subtree() in light of the demise of dcache_lock. Without that dcache_lock, there is no need for the batching of removal of dentries from the system under it (we wanted to make intensive use of the locked data whilst we held it, but didn't want to hold it for long at a time). This works, provided the preceding patch is correct in its removal of locking on dentry->d_lock on the basis that no one should be locking these dentries any more as the whole superblock is defunct. With this patch, the calls to dentry_lru_del() and __d_shrink() are placed at the point where each dentry is detached handled. It is possible that, as an alternative, the batching should still be done - but only for dentry_lru_del() of all a dentry's children in one go. In such a case, the batching would be done under dcache_lru_lock. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-08-01VFS: Remove dentry->d_lock locking from shrink_dcache_for_umount_subtree()David Howells1-24/+24
Locks of the dcache_lock were replaced by locks of dentry->d_lock in commits such as: 2304450783dfde7b0b94ae234edd0dbffa865073 2fd6b7f50797f2e993eea59e0a0b8c6399c811dc as part of the RCU-based pathwalk changes, despite the fact that the caller (shrink_dcache_for_umount()) notes in the banner comment the reasons that d_lock is not necessary in these functions: /* * destroy the dentries attached to a superblock on unmounting * - we don't need to use dentry->d_lock because: * - the superblock is detached from all mountings and open files, so the * dentry trees will not be rearranged by the VFS * - s_umount is write-locked, so the memory pressure shrinker will ignore * any dentries belonging to this superblock that it comes across * - the filesystem itself is no longer permitted to rearrange the dentries * in this superblock */ So remove these locks. If the locks are actually necessary, then this banner comment should be altered instead. The hash table chains are protected by 1-bit locks in the hash table heads, so those shouldn't be a problem. Note that to make this work, __d_drop() has to be split so that the RCUwalk barrier can be avoided. This causes problems otherwise as it has an assertion that dentry->d_lock is locked - but there is no need for that as no one else can be trying to access this dentry, except to step over it (and that should be handled by d_free(), I think). Signed-off-by: David Howells <dhowells@redhat.com> Cc: Nick Piggin <npiggin@kernel.dk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-08-01VFS: Remove detached-dentry counter from shrink_dcache_for_umount_subtree()David Howells1-3/+0
Remove the detached-dentry counter from shrink_dcache_for_umount_subtree() as the value it computes is no longer used as of commit 312d3ca856d369bb04d0443846b85b4cdde6fa8a which made the nr_dentry counters summed per-CPU rather than global atomic. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-26vfs: document locking requirements for d_move, __d_move and d_materialise_uniqueJeff Layton1-4/+7
Adding a comment to d_materialise_unique per Al's request... d_move and __d_move have some pretty substantial locking requirements, but they are not clearly documented. Add some comments spelling them out. Also, document the requirement for the i_mutex of the parent in d_materialise_unique. Cc: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-22Merge branch 'for-linus' of ↵Linus Torvalds1-171/+91
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (107 commits) vfs: use ERR_CAST for err-ptr tossing in lookup_instantiate_filp isofs: Remove global fs lock jffs2: fix IN_DELETE_SELF on overwriting rename() killing a directory fix IN_DELETE_SELF on overwriting rename() on ramfs et.al. mm/truncate.c: fix build for CONFIG_BLOCK not enabled fs:update the NOTE of the file_operations structure Remove dead code in dget_parent() AFS: Fix silly characters in a comment switch d_add_ci() to d_splice_alias() in "found negative" case as well simplify gfs2_lookup() jfs_lookup(): don't bother with . or .. get rid of useless dget_parent() in btrfs rename() and link() get rid of useless dget_parent() in fs/btrfs/ioctl.c fs: push i_mutex and filemap_write_and_wait down into ->fsync() handlers drivers: fix up various ->llseek() implementations fs: handle SEEK_HOLE/SEEK_DATA properly in all fs's that define their own llseek Ext4: handle SEEK_HOLE/SEEK_DATA generically Btrfs: implement our own ->llseek fs: add SEEK_HOLE and SEEK_DATA flags reiserfs: make reiserfs default to barrier=flush ... Fix up trivial conflicts in fs/xfs/linux-2.6/xfs_super.c due to the new shrinker callout for the inode cache, that clashed with the xfs code to start the periodic workers later.
2011-07-21vfs: drop conditional inode prefetch in __do_lookup_rcuLinus Torvalds1-2/+0
It seems to hurt performance in real life. Yes, the inode will be used later, but the conditional doesn't seem to predict all that well (negative dentries are not uncommon) and it looks like the cost of prefetching is simply higher than depending on the cache doing the right thing. As usual. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-07-20Remove dead code in dget_parent()Al Viro1-5/+0
->d_parent is never NULL... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-20switch d_add_ci() to d_splice_alias() in "found negative" case as wellAl Viro1-19/+5
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-20superblock: introduce per-sb cache shrinker infrastructureDave Chinner1-109/+12
With context based shrinkers, we can implement a per-superblock shrinker that shrinks the caches attached to the superblock. We currently have global shrinkers for the inode and dentry caches that split up into per-superblock operations via a coarse proportioning method that does not batch very well. The global shrinkers also have a dependency - dentries pin inodes - so we have to be very careful about how we register the global shrinkers so that the implicit call order is always correct. With a per-sb shrinker callout, we can encode this dependency directly into the per-sb shrinker, hence avoiding the need for strictly ordering shrinker registrations. We also have no need for any proportioning code for the shrinker subsystem already provides this functionality across all shrinkers. Allowing the shrinker to operate on a single superblock at a time means that we do less superblock list traversals and locking and reclaim should batch more effectively. This should result in less CPU overhead for reclaim and potentially faster reclaim of items from each filesystem. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-20make d_splice_alias(ERR_PTR(err), dentry) = ERR_PTR(err)Al Viro1-0/+3
... and simplify the living hell out of callers Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-20Make ->d_sb assign-once and always non-NULLAl Viro1-36/+39
New helper (non-exported, fs/internal.h-only): __d_alloc(sb, name). Allocates dentry, sets its ->d_sb to given superblock and sets ->d_op accordingly. Old d_alloc(NULL, name) callers are converted to that (all of them know what superblock they want). d_alloc() itself is left only for parent != NULl case; uses __d_alloc(), inserts result into the list of parent's children. Note that now ->d_sb is assign-once and never NULL *and* ->d_parent is never NULL either. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-20fs: add a DCACHE_NEED_LOOKUP flag for d_flagsJosef Bacik1-2/+32
Btrfs (and I'd venture most other fs's) stores its indexes in nice disk order for readdir, but unfortunately in the case of anything that stats the files in order that readdir spits back (like oh say ls) that means we still have to do the normal lookup of the file, which means looking up our other index and then looking up the inode. What I want is a way to create dummy dentries when we find them in readdir so that when ls or anything else subsequently does a stat(), we already have the location information in the dentry and can go straight to the inode itself. The lookup stuff just assumes that if it finds a dentry it is done, it doesn't perform a lookup. So add a DCACHE_NEED_LOOKUP flag so that the lookup code knows it still needs to run i_op->lookup() on the parent to get the inode for the dentry. I have tested this with btrfs and I went from something that looks like this http://people.redhat.com/jwhiter/ls-noreada.png To this http://people.redhat.com/jwhiter/ls-good.png Thats a savings of 1300 seconds, or 22 minutes. That is a significant savings. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-14fix loop checks in d_materialise_unique()Al Viro1-17/+34
Both __d_unalias() and __d_materialise_dentry() need loop prevention. Grab rename_lock in caller, check for loops there... As a side benefit, we have dentry_lock_for_move() called only under rename_lock, which seriously reduces deadlock potential of the execrable "locking order" used for ->d_lock. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-05-25vmscan: change shrinker API by passing shrink_control structYing Han1-2/+6
Change each shrinker's API by consolidating the existing parameters into shrink_control struct. This will simplify any further features added w/o touching each file of shrinker. [akpm@linux-foundation.org: fix build] [akpm@linux-foundation.org: fix warning] [kosaki.motohiro@jp.fujitsu.com: fix up new shrinker API] [akpm@linux-foundation.org: fix xfs warning] [akpm@linux-foundation.org: update gfs2] Signed-off-by: Ying Han <yinghan@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Minchan Kim <minchan.kim@gmail.com> Acked-by: Pavel Emelyanov <xemul@openvz.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Acked-by: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Cc: Steven Whitehouse <swhiteho@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-20sanitize <linux/prefetch.h> usageLinus Torvalds1-0/+1
Commit e66eed651fd1 ("list: remove prefetching from regular list iterators") removed the include of prefetch.h from list.h, which uncovered several cases that had apparently relied on that rather obscure header file dependency. So this fixes things up a bit, using grep -L linux/prefetch.h $(git grep -l '[^a-z_]prefetchw*(' -- '*.[ch]') grep -L 'prefetchw*(' $(git grep -l 'linux/prefetch.h' -- '*.[ch]') to guide us in finding files that either need <linux/prefetch.h> inclusion, or have it despite not needing it. There are more of them around (mostly network drivers), but this gets many core ones. Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-04-25add hlist_bl_lock/unlock helpersChristoph Hellwig1-16/+6
Now that the whole dcache_hash_bucket crap is gone, go all the way and also remove the weird locking layering violations for locking the hash buckets. Add hlist_bl_lock/unlock helpers to move the locking into the list abstraction instead of requiring each caller to open code it. After all allowing for the bit locks is the whole point of these helpers over the plain hlist variant. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-04-24vfs: get rid of insane dentry hashing rulesLinus Torvalds1-26/+16
The dentry hashing rules have been really quite complicated for a long while, in odd ways. That made functions like __d_drop() very fragile and non-obvious. In particular, whether a dentry was hashed or not was indicated with an explicit DCACHE_UNHASHED bit. That's despite the fact that the hash abstraction that the dentries use actually have a 'is this entry hashed or not' model (which is a simple test of the 'pprev' pointer). The reason that was done is because we used the normal 'is this entry unhashed' model to mark whether the dentry had _ever_ been hashed in the dentry hash tables, and that logic goes back many years (commit b3423415fbc2: "dcache: avoid RCU for never-hashed dentries"). That, in turn, meant that __d_drop had totally different unhashing logic for the dentry hash table case and for the anonymous dcache case, because in order to use the "is this dentry hashed" logic as a flag for whether it had ever been on the RCU hash table, we had to unhash such a dentry differently so that we'd never think that it wasn't 'unhashed' and wouldn't be free'd correctly. That's just insane. It made the logic really hard to follow, when there were two different kinds of "unhashed" states, and one of them (the one that used "list_bl_unhashed()") really had nothing at all to do with being unhashed per se, but with a very subtle lifetime rule instead. So turn all of it around, and make it logical. Instead of having a DENTRY_UNHASHED bit in d_flags to indicate whether the dentry is on the hash chains or not, use the hash chain unhashed logic for that. Suddenly "d_unhashed()" just uses "list_bl_unhashed()", and everything makes sense. And for the lifetime rule, just use an explicit DENTRY_RCUACCEES bit. If we ever insert the dentry into the dentry hash table so that it is visible to RCU lookup, we mark it DENTRY_RCUACCESS to show that it now needs the RCU lifetime rules. Now suddently that test at dentry free time makes sense too. And because unhashing now is sane and doesn't depend on where the dentry got unhashed from (because the dentry hash chain details doesn't have some subtle side effects), we can re-unify the __d_drop() logic and use common code for the unhashing. Also fix one more open-coded hash chain bit_spin_lock() that I missed in the previous chain locking cleanup commit. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-04-23vfs: get rid of 'struct dcache_hash_bucket' abstractionLinus Torvalds1-24/+21
It's a useless abstraction for 'hlist_bl_head', and it doesn't actually help anything - quite the reverse. All the users end up having to know about the hlist_bl_head details anyway, using 'struct hlist_bl_node *' etc. So it just makes the code look confusing. And the cost of it is extra '&b->head' syntactic noise, but more importantly it spuriously makes the hash table dentry list look different from the per-superblock DCACHE_DISCONNECTED dentry list. As a result, the code ended up using ad-hoc locking for one case and special helper functions for what is really another totally identical case in the very same function. Make it all look and work the same. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>