summaryrefslogtreecommitdiff
path: root/fs/btrfs
AgeCommit message (Collapse)AuthorFilesLines
2012-03-30Merge branch 'for-linus' of ↵Linus Torvalds41-2225/+4362
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes and features from Chris Mason: "We've merged in the error handling patches from SuSE. These are already shipping in the sles kernel, and they give btrfs the ability to abort transactions and go readonly on errors. It involves a lot of churn as they clarify BUG_ONs, and remove the ones we now properly deal with. Josef reworked the way our metadata interacts with the page cache. page->private now points to the btrfs extent_buffer object, which makes everything faster. He changed it so we write an whole extent buffer at a time instead of allowing individual pages to go down,, which will be important for the raid5/6 code (for the 3.5 merge window ;) Josef also made us more aggressive about dropping pages for metadata blocks that were freed due to COW. Overall, our metadata caching is much faster now. We've integrated my patch for metadata bigger than the page size. This allows metadata blocks up to 64KB in size. In practice 16K and 32K seem to work best. For workloads with lots of metadata, this cuts down the size of the extent allocation tree dramatically and fragments much less. Scrub was updated to support the larger block sizes, which ended up being a fairly large change (thanks Stefan Behrens). We also have an assortment of fixes and updates, especially to the balancing code (Ilya Dryomov), the back ref walker (Jan Schmidt) and the defragging code (Liu Bo)." Fixed up trivial conflicts in fs/btrfs/scrub.c that were just due to removal of the second argument to k[un]map_atomic() in commit 7ac687d9e047. * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (75 commits) Btrfs: update the checks for mixed block groups with big metadata blocks Btrfs: update to the right index of defragment Btrfs: do not bother to defrag an extent if it is a big real extent Btrfs: add a check to decide if we should defrag the range Btrfs: fix recursive defragment with autodefrag option Btrfs: fix the mismatch of page->mapping Btrfs: fix race between direct io and autodefrag Btrfs: fix deadlock during allocating chunks Btrfs: show useful info in space reservation tracepoint Btrfs: don't use crc items bigger than 4KB Btrfs: flush out and clean up any block device pages during mount btrfs: disallow unequal data/metadata blocksize for mixed block groups Btrfs: enhance superblock sanity checks Btrfs: change scrub to support big blocks Btrfs: minor cleanup in scrub Btrfs: introduce common define for max number of mirrors Btrfs: fix infinite loop in btrfs_shrink_device() Btrfs: fix memory leak in resolver code Btrfs: allow dup for data chunks in mixed mode Btrfs: validate target profiles only if we are going to use them ...
2012-03-29Btrfs: update the checks for mixed block groups with big metadata blocksChris Mason1-12/+17
Dave Sterba had put in patches to look for mixed data/metadata groups with metadata bigger than 4KB. But these ended up in the wrong place and it wasn't testing the feature flag correctly. This updates the tests to make sure our sizes are matching Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-03-29Btrfs: update to the right index of defragmentLiu Bo1-0/+3
When we use autodefrag, we forget to update the index which indicates the last page we've dirty. And we'll set dirty flags on a same set of pages again and again. Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-03-29Btrfs: do not bother to defrag an extent if it is a big real extentLiu Bo1-6/+3
$ mkfs.btrfs /dev/sdb7 $ mount /dev/sdb7 /mnt/btrfs/ -oautodefrag $ dd if=/dev/zero of=/mnt/btrfs/foobar bs=4k count=10 oflag=direct 2>/dev/null $ filefrag -v /mnt/btrfs/foobar Filesystem type is: 9123683e File size of /mnt/btrfs/foobar is 40960 (10 blocks, blocksize 4096) ext logical physical expected length flags 0 0 3072 10 eof /mnt/btrfs/foobar: 1 extent found Now we have a big real extent [0, 40960), but autodefrag will still defrag it. $ sync $ filefrag -v /mnt/btrfs/foobar Filesystem type is: 9123683e File size of /mnt/btrfs/foobar is 40960 (10 blocks, blocksize 4096) ext logical physical expected length flags 0 0 3082 10 eof /mnt/btrfs/foobar: 1 extent found So if we already find a big real extent, we're ok about that, just skip it. Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-03-29Btrfs: add a check to decide if we should defrag the rangeLiu Bo1-1/+35
If our file's layout is as follows: | hole | data1 | hole | data2 | we do not need to defrag this file, because this file has holes and cannot be merged into one extent. Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-03-29Btrfs: fix recursive defragment with autodefrag optionLiu Bo1-3/+5
$ mkfs.btrfs disk $ mount disk /mnt -o autodefrag $ dd if=/dev/zero of=/mnt/foobar bs=4k count=10 2>/dev/null && sync $ for i in `seq 9 -2 0`; do dd if=/dev/zero of=/mnt/foobar bs=4k count=1 \ seek=$i conv=notrunc 2> /dev/null; done && sync then we'll get to defrag "foobar" again and again. So does option "-o autodefrag,compress". Reasons: When the cleaner kthread gets to fetch inodes from the defrag tree and defrag them, it will dirty pages and submit them, this will comes to another DATA COW where the processing inode will be inserted to the defrag tree again. This patch sets a rule for COW code, i.e. insert an inode when we're really going to make some defragments. Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-03-29Btrfs: fix the mismatch of page->mappingLiu Bo1-16/+19
commit 600a45e1d5e376f679ff9ecc4ce9452710a6d27c (Btrfs: fix deadlock on page lock when doing auto-defragment) fixes the deadlock on page, but it also introduces another bug. A page may have been truncated after unlock & lock. So we need to find it again to get the right one. And since we've held i_mutex lock, inode size remains unchanged and we can drop isize overflow checks. Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-03-29Btrfs: fix race between direct io and autodefragLiu Bo1-1/+5
The bug is from running xfstests 209 with autodefrag. The race is as follows: t1 t2(autodefrag) direct IO invalidate pagecache dio(old data) add_inode_defrag invalidate pagecache endio direct IO invalidate pagecache run_defrag readpage(old data) set page dirty (old data) dio(new data, rewrite) invalidate pagecache (*) endio t2(autodefrag) will get old data into pagecache via readpage and set pagecache dirty. Meanwhile, invalidate pagecache(*) will fail due to dirty flags in pages. So the old data may be flushed into disk by flush thread, which will lead to data loss. And so does the case of user defragment progs. The patch fixes this race by holding i_mutex when we readpage and set page dirty. Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-03-29Btrfs: fix deadlock during allocating chunksLiu Bo1-0/+50
This deadlock comes from xfstests 251. We'll hold the chunk_mutex throughout the whole of a chunk allocation. But if we find that we've used up system chunk space, we need to allocate a new system chunk, but this will lead to a recursion of chunk allocation and end up with a deadlock on chunk_mutex. So instead we need to allocate the system chunk first if we find we're in ENOSPC. Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-03-29Btrfs: show useful info in space reservation tracepointLiu Bo3-25/+13
o For space info, the type of space info is useful for debug. o For transaction handle, its transid is useful. Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-03-28Btrfs: don't use crc items bigger than 4KBChris Mason1-1/+3
With the big metadata blocks, we can have crc items that are much bigger than a page. There are a few places that we try to kmalloc memory to hold the items during a split. Items bigger than 4KB don't really have a huge benefit in efficiency, but they do trigger larger order allocations. This commits changes the csums to make sure they stay under 4KB. This is not a format change, just a #define to limit huge items. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-03-28Btrfs: flush out and clean up any block device pages during mountChris Mason2-0/+4
Btrfs puts the filesystem metadata into its own address space, and somehow the block device address space isn't getting onto disk properly before a mount. The end result is that a loop of mkfs and mounting the filesystem will sometimes find stale or incorrect data. This commit should fix it by sprinkling fdatawrites and invalidate_bdev calls around. This is a short term measure to make sure it is fixed. The block devices really should be flushed and cleaned up higher in the stack. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-03-28Merge git://git.jan-o-sch.net/btrfs-unstable into for-linusChris Mason5-55/+75
Conflicts: fs/btrfs/transaction.c Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-03-28Merge branch 'for-chris' of git://github.com/idryomov/btrfs-unstable into ↵Chris Mason4-139/+157
for-linus
2012-03-28Merge branch 'error-handling' into for-linusChris Mason37-1018/+1973
Conflicts: fs/btrfs/ctree.c fs/btrfs/disk-io.c fs/btrfs/extent-tree.c fs/btrfs/extent_io.c fs/btrfs/extent_io.h fs/btrfs/inode.c fs/btrfs/scrub.c Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-03-28btrfs: disallow unequal data/metadata blocksize for mixed block groupsDavid Sterba1-0/+8
With support for bigger metadata blocks, we must avoid mounting a filesystem with different block size for mixed block groups, this causes corruption (found by xfstests/083). Signed-off-by: David Sterba <dsterba@suse.cz>
2012-03-28Btrfs: enhance superblock sanity checksDavid Sterba1-5/+18
Validate checksum algorithm during mount and prevent BUG_ON later in btrfs_super_csum_size. Signed-off-by: David Sterba <dsterba@suse.cz>
2012-03-27Btrfs: change scrub to support big blocksStefan Behrens1-340/+1013
Scrub used to be coded for nodesize == leafsize == sectorsize == PAGE_SIZE. This is now changed to support sizes for nodesize and leafsize which are N * PAGE_SIZE. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-03-27Btrfs: minor cleanup in scrubStefan Behrens1-36/+25
Just a minor cleanup commit in preparation for the big block changes. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-03-27Btrfs: introduce common define for max number of mirrorsStefan Behrens2-5/+7
Readahead already has a define for the max number of mirrors. Scrub needs such a define now, the rest of the code will need something like this soon. Therefore the define was added to ctree.h and removed from the readahead code. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-03-27Btrfs: fix infinite loop in btrfs_shrink_device()Ilya Dryomov1-3/+2
If relocate of block group 0 fails with ENOSPC we end up infinitely looping because key.offset -= 1 statement in that case brings us back to where we started. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2012-03-27Btrfs: fix memory leak in resolver codeIlya Dryomov1-6/+1
init_ipath() allocates btrfs_data_container which is never freed. Free it in free_ipath() and nuke the comment for init_data_container() - we can safely free it with kfree(). Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2012-03-27Btrfs: allow dup for data chunks in mixed modeIlya Dryomov1-4/+9
Generally we don't allow dup for data, but mixed chunks are special and people seem to think this has its use cases. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2012-03-27Btrfs: validate target profiles only if we are going to use themIlya Dryomov1-16/+11
Do not run sanity checks on all target profiles unless they all will be used. This came up because alloc_profile_is_valid() is now more strict than it used to be. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2012-03-27Btrfs: improve the logic in btrfs_can_relocate()Ilya Dryomov1-6/+18
Currently if we don't have enough space allocated we go ahead and loop though devices in the hopes of finding enough space for a chunk of the *same* type as the one we are trying to relocate. The problem with that is that if we are trying to restripe the chunk its target type can be more relaxed than the current one (eg require less devices or less space). So, when restriping, run checks against the target profile instead of the current one. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2012-03-27Btrfs: add __get_block_group_index() helperIlya Dryomov1-5/+12
Add __get_block_group_index() helper to be able to derive block group index from an arbitary set of flags. Implement get_block_group_index() in terms of it. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2012-03-27Btrfs: add get_restripe_target() helperIlya Dryomov1-44/+50
Add get_restripe_target() helper and switch everybody to use it. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2012-03-27Btrfs: move alloc_profile_is_valid() to volumes.cIlya Dryomov3-30/+25
Header file is not a good place to define functions. This also moves a call to alloc_profile_is_valid() down the stack and removes a redundant check from __btrfs_alloc_chunk() - alloc_profile_is_valid() takes it into account. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2012-03-27Btrfs: make profile_is_valid() check more strictIlya Dryomov3-12/+17
"0" is a valid value for an on-disk chunk profile, but it is not a valid extended profile. (We have a separate bit for single chunks in extended case) Also rename it to alloc_profile_is_valid() for clarity. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2012-03-27Btrfs: add wrappers for working with alloc profilesIlya Dryomov3-30/+30
Add functions to abstract the conversion between chunk and extended allocation profile formats and switch everybody to use them. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2012-03-27Btrfs: stop silently switching single chunks to raid0 on balanceIlya Dryomov1-3/+2
This has been causing a lot of confusion for quite a while now and a lot of users were surprised by this (some of them were even stuck in a ENOSPC situation which they couldn't easily get out of). The addition of restriper gives users a clear choice between raid0 and drive concat setup so there's absolutely no excuse for us to keep doing this. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2012-03-27Btrfs: fix regression in scrub path resolvingJan Schmidt4-55/+73
In commit 4692cf58 we introduced new backref walking code for btrfs. This assumes we're searching live roots, which requires a transaction context. While scrubbing, however, we must not join a transaction because this could deadlock with the commit path. Additionally, what scrub really wants to do is resolving a logical address in the commit root it's currently checking. This patch adds support for logical to path resolving on commit roots and makes scrub use that. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
2012-03-27Btrfs: check return value of btrfs_cow_block()Jan Schmidt1-2/+4
The two helper functions commit_cowonly_roots() and create_pending_snapshot() failed to check the return value from btrfs_cow_block(), which could at least in theory fail with -ENOSPC from btrfs_alloc_free_block(). This commit adds the missing checks. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
2012-03-27Btrfs: actually call btrfs_init_lockdepJan Schmidt1-0/+2
btrfs_init_lockdep only makes our lockdep class names look prettier, thus it did never hurt we forgot to actually call it. This turns our lockdep identifier strings from lockdep auto-set #[id] into really pretty "btrfs-fs-01" or "btrfs-csum-03". Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
2012-03-26Btrfs: deal with read errors on extent buffers differentlyJosef Bacik3-27/+66
Since we need to read and write extent buffers in their entirety we can't use the normal bio_readpage_error stuff since it only works on a per page basis. So instead make it so that if we see an io error in endio we just mark the eb as having an IO error and then in btree_read_extent_buffer_pages we will manually try other mirrors and then overwrite the bad mirror if we find a good copy. This works with larger than page size blocks. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-03-26Btrfs: don't use threaded IO completion helpers for metadata writesChris Mason1-4/+4
The metadata write IO completion code is now simple enough that we don't need the threaded helpers anymore. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-03-26Btrfs: adjust the write_lock_level as we unlockChris Mason1-6/+17
btrfs_search_slot sometimes needs write locks on high levels of the tree. It remembers the highest level that needs a write lock and will use that for all future searches through the tree in a given call. But, very often we'll just cow the top level or the level below and we won't really need write locks on the root again after that. This patch changes things to adjust the write lock requirement as it unlocks levels. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-03-26Btrfs: loop waiting on writebackChris Mason1-5/+5
lock_extent_buffer_for_io needs to loop around and make sure the writeback bits are not set. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-03-26Btrfs: add the ability to cache a pointer into the ebChris Mason3-30/+116
This cuts down on the CPU time used by map_private_extent_buffer Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-03-26Btrfs: ensure an entire eb is written at onceJosef Bacik4-209/+390
This patch simplifies how we track our extent buffers. Previously we could exit writepages with only having written half of an extent buffer, which meant we had to track the state of the pages and the state of the extent buffers differently. Now we only read in entire extent buffers and write out entire extent buffers, this allows us to simply set bits in our bflags to indicate the state of the eb and we no longer have to do things like track uptodate with our iotree. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-03-26Btrfs: introduce mark_extent_buffer_accessedJosef Bacik1-2/+15
Because an eb can have multiple pages we need to make sure that all pages within the eb are markes as accessed, since releasepage can be called against any page in the eb. This will keep us from possibly evicting hot eb's when we're doing larger than pagesize eb's. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
2012-03-26Btrfs: introduce free_extent_buffer_staleJosef Bacik5-60/+201
Because btrfs cow's we can end up with extent buffers that are no longer necessary just sitting around in memory. So instead of evicting these pages, we could end up evicting things we actually care about. Thus we have free_extent_buffer_stale for use when we are freeing tree blocks. This will make it so that the ref for the eb being in the radix tree is dropped as soon as possible and then is freed when the refcount hits 0 instead of waiting to be released by releasepage. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
2012-03-26Btrfs: only use the existing eb if it's count isn't 0Josef Bacik1-2/+8
We can run into a problem where we find an eb for our existing page already on the radix tree but it has a ref count of 0. It hasn't yet been removed by RCU yet so this can cause issues where we will use the EB after free. So do atomic_inc_not_zero on the exists->refs and if it is zero just do synchronize_rcu() and try again. We won't have to worry about new allocators coming in since they will block on the page lock at this point. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
2012-03-26Btrfs: set page->private to the ebJosef Bacik3-93/+91
We spend a lot of time looking up extent buffers from pages when we could just store the pointer to the eb the page is associated with in page->private. This patch does just that, and it makes things a little simpler and reduces a bit of CPU overhead involved with doing metadata IO. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
2012-03-26Btrfs: allow metadata blocks larger than the page sizeChris Mason6-189/+190
A few years ago the btrfs code to support blocks lager than the page size was disabled to fix a few corner cases in the page cache handling. This fixes the code to properly support large metadata blocks again. Since current kernels will crash early and often with larger metadata blocks, this adds an incompat bit so that older kernels can't mount it. This also does away with different blocksizes for nodes and leaves. You get a single block size for all tree blocks. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-03-26Btrfs: remove search_start and search_end from find_free_extent and callersJosef Bacik3-19/+9
We have been passing nothing but (u64)-1 to find_free_extent for search_end in all of the callers, so it's completely useless, and we've always been passing 0 in as search_start, so just remove them as function arguments and move search_start into find_free_extent. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
2012-03-26Btrfs: remove the ideal caching codeJosef Bacik1-85/+8
This is a relic from before we had the disk space cache and it was to make bootup times when you had btrfs as root not be so damned slow. Now that we have the disk space cache this isn't a problem anymore and really having this code casues uneeded fragmentation and complexity, so just remove it. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
2012-03-22btrfs: Fix busyloop in transaction_kthread()Jan Kara1-2/+7
When a filesystem got aborted due do error, transaction_kthread() will busyloop. Fix it by going to sleep in that case as well. Maybe we should just stop transaction_kthread() when filesystem is aborted but that would be more complex. Signed-off-by: Jan Kara <jack@suse.cz>
2012-03-22btrfs: replace many BUG_ONs with proper error handlingJeff Mahoney23-385/+980
btrfs currently handles most errors with BUG_ON. This patch is a work-in- progress but aims to handle most errors other than internal logic errors and ENOMEM more gracefully. This iteration prevents most crashes but can run into lockups with the page lock on occasion when the timing "works out." Signed-off-by: Jeff Mahoney <jeffm@suse.com>
2012-03-22btrfs: enhance transaction abort infrastructureJeff Mahoney8-56/+300
Signed-off-by: Jeff Mahoney <jeffm@suse.com>