summaryrefslogtreecommitdiff
path: root/drivers/md
AgeCommit message (Collapse)AuthorFilesLines
2013-10-13bcache: Fix a null ptr deref regressionKent Overstreet1-2/+1
commit 2fe80d3bbf1c8bd9efc5b8154207c8dd104e7306 upstream. Commit c0f04d88e46d ("bcache: Fix flushes in writeback mode") was fixing a reported data corruption bug, but it seems some last minute refactoring or rebasing introduced a null pointer deref. Signed-off-by: Kent Overstreet <kmo@daterainc.com> Reported-by: Gabriel de Perthuis <g2p.code@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-10-05dm-raid: silence compiler warning on rebuilds_per_group.NeilBrown1-1/+1
commit 3f6bbd3ffd7b733dd705e494663e5761aa2cb9c1 upstream. This doesn't really need to be initialised, but it doesn't hurt, silences the compiler, and as it is a counter it makes sense for it to start at zero. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-10-05dm mpath: disable WRITE SAME if it failsMike Snitzer2-1/+21
commit f84cb8a46a771f36a04a02c61ea635c968ed5f6a upstream. Workaround the SCSI layer's problematic WRITE SAME heuristics by disabling WRITE SAME in the DM multipath device's queue_limits if an underlying device disabled it. The WRITE SAME heuristics, with both the original commit 5db44863b6eb ("[SCSI] sd: Implement support for WRITE SAME") and the updated commit 66c28f971 ("[SCSI] sd: Update WRITE SAME heuristics"), default to enabling WRITE SAME(10) even without successfully determining it is supported. After the first failed WRITE SAME the SCSI layer will disable WRITE SAME for the device (by setting sdkp->device->no_write_same which results in 'max_write_same_sectors' in device's queue_limits to be set to 0). When a device is stacked ontop of such a SCSI device any changes to that SCSI device's queue_limits do not automatically propagate up the stack. As such, a DM multipath device will not have its WRITE SAME support disabled. This causes the block layer to continue to issue WRITE SAME requests to the mpath device which causes paths to fail and (if mpath IO isn't configured to queue when no paths are available) it will result in actual IO errors to the upper layers. This fix doesn't help configurations that have additional devices stacked ontop of the mpath device (e.g. LVM created linear DM devices ontop). A proper fix that restacks all the queue_limits from the bottom of the device stack up will need to be explored if SCSI will continue to use this model of optimistically allowing op codes and then disabling them after they fail for the first time. Before this patch: EXT4-fs (dm-6): mounted filesystem with ordered data mode. Opts: (null) device-mapper: multipath: XXX snitm debugging: got -EREMOTEIO (-121) device-mapper: multipath: XXX snitm debugging: failing WRITE SAME IO with error=-121 end_request: critical target error, dev dm-6, sector 528 dm-6: WRITE SAME failed. Manually zeroing. device-mapper: multipath: Failing path 8:112. end_request: I/O error, dev dm-6, sector 4616 dm-6: WRITE SAME failed. Manually zeroing. end_request: I/O error, dev dm-6, sector 4616 end_request: I/O error, dev dm-6, sector 5640 end_request: I/O error, dev dm-6, sector 6664 end_request: I/O error, dev dm-6, sector 7688 end_request: I/O error, dev dm-6, sector 524288 Buffer I/O error on device dm-6, logical block 65536 lost page write due to I/O error on dm-6 JBD2: Error -5 detected when updating journal superblock for dm-6-8. end_request: I/O error, dev dm-6, sector 524296 Aborting journal on device dm-6-8. end_request: I/O error, dev dm-6, sector 524288 Buffer I/O error on device dm-6, logical block 65536 lost page write due to I/O error on dm-6 JBD2: Error -5 detected when updating journal superblock for dm-6-8. # cat /sys/block/sdh/queue/write_same_max_bytes 0 # cat /sys/block/dm-6/queue/write_same_max_bytes 33553920 After this patch: EXT4-fs (dm-6): mounted filesystem with ordered data mode. Opts: (null) device-mapper: multipath: XXX snitm debugging: got -EREMOTEIO (-121) device-mapper: multipath: XXX snitm debugging: WRITE SAME I/O failed with error=-121 end_request: critical target error, dev dm-6, sector 528 dm-6: WRITE SAME failed. Manually zeroing. # cat /sys/block/sdh/queue/write_same_max_bytes 0 # cat /sys/block/dm-6/queue/write_same_max_bytes 0 It should be noted that WRITE SAME support wasn't enabled in DM multipath until v3.10. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: Martin K. Petersen <martin.petersen@oracle.com> Cc: Hannes Reinecke <hare@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-10-05dm-snapshot: fix performance degradation due to small hash sizeMikulas Patocka1-3/+2
commit 60e356f381954d79088d0455e357db48cfdd6857 upstream. LVM2, since version 2.02.96, creates origin with zero size, then loads the snapshot driver and then loads the origin. Consequently, the snapshot driver sees the origin size zero and sets the hash size to the lower bound 64. Such small hash table causes performance degradation. This patch changes it so that the hash size is determined by the size of snapshot volume, not minimum of origin and snapshot size. It doesn't make sense to set the snapshot size significantly larger than the origin size, so we do not need to take origin size into account when calculating the hash size. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-10-05dm snapshot: workaround for a false positive lockdep warningMikulas Patocka1-1/+1
commit 5ea330a75bd86b2b2a01d7b85c516983238306fb upstream. The kernel reports a lockdep warning if a snapshot is invalidated because it runs out of space. The lockdep warning was triggered by commit 0976dfc1d0cd80a4e9dfaf87bd87 ("workqueue: Catch more locking problems with flush_work()") in v3.5. The warning is false positive. The real cause for the warning is that the lockdep engine treats different instances of md->lock as a single lock. This patch is a workaround - we use flush_workqueue instead of flush_work. This code path is not performance sensitive (it is called only on initialization or invalidation), thus it doesn't matter that we flush the whole workqueue. The real fix for the problem would be to teach the lockdep engine to treat different instances of md->lock as separate locks. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Acked-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-10-05bcache: Fix flushes in writeback modeKent Overstreet1-6/+9
commit c0f04d88e46d14de51f4baebb6efafb7d59e9f96 upstream. In writeback mode, when we get a cache flush we need to make sure we issue a flush to the backing device. The code for sending down an extra flush was wrong - by cloning the bio we were probably getting flags that didn't make sense for a bare flush, and also the old code was firing for FUA bios, for which we don't need to send a flush to the backing device. This was causing data corruption somehow - the mechanism was never determined, but this patch fixes it for the users that were seeing it. Signed-off-by: Kent Overstreet <kmo@daterainc.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-10-05bcache: Fix for handling overlapping extents when reading in a btree nodeKent Overstreet1-11/+28
commit 84786438ed17978d72eeced580ab757e4da8830b upstream. btree_sort_fixup() was overly clever, because it was trying to avoid pulling a key off the btree iterator in more than one place. This led to a really obscure bug where we'd break early from the loop in btree_sort_fixup() if the current key overlapped with keys in more than one older set, and the next key it overlapped with was zero size. Signed-off-by: Kent Overstreet <kmo@daterainc.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-10-05bcache: Fix a shrinker deadlockKent Overstreet1-1/+1
commit a698e08c82dfb9771e0bac12c7337c706d729b6d upstream. GFP_NOIO means we could be getting called recursively - mca_alloc() -> mca_data_alloc() - definitely can't use mutex_lock(bucket_lock) then. Whoops. Signed-off-by: Kent Overstreet <kmo@daterainc.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-10-05bcache: Fix a dumb CPU spinning bug in writebackKent Overstreet1-2/+1
commit 79e3dab90d9f826ceca67c7890e048ac9169de49 upstream. schedule_timeout() != schedule_timeout_uninterruptible() Signed-off-by: Kent Overstreet <kmo@daterainc.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-10-05bcache: Fix a flush/fua performance bugKent Overstreet1-0/+1
commit 1394d6761b6e9e15ee7c632a6d48791188727b40 upstream. bch_journal_meta() was missing the flush to make the journal write actually go down (instead of waiting up to journal_delay_ms)... Whoops Signed-off-by: Kent Overstreet <kmo@daterainc.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-10-05bcache: Fix a writeback performance regressionKent Overstreet4-30/+43
commit c2a4f3183a1248f615a695fbd8905da55ad11bba upstream. Background writeback works by scanning the btree for dirty data and adding those keys into a fixed size buffer, then for each dirty key in the keybuf writing it to the backing device. When read_dirty() finishes and it's time to scan for more dirty data, we need to wait for the outstanding writeback IO to finish - they still take up slots in the keybuf (so that foreground writes can check for them to avoid races) - without that wait, we'll continually rescan when we'll be able to add at most a key or two to the keybuf, and that takes locks that starves foreground IO. Doh. Signed-off-by: Kent Overstreet <kmo@daterainc.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-10-05bcache: Fix for when no journal entries are foundKent Overstreet1-12/+18
commit c426c4fd46f709ade2bddd51c5738729c7ae1db5 upstream. The journal replay code didn't handle this case, causing it to go into an infinite loop... Signed-off-by: Kent Overstreet <kmo@daterainc.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-10-05bcache: Strip endline when writing the label through sysfsGabriel de Perthuis1-1/+7
commit aee6f1cfff3ce240eb4b43b41ca466b907acbd2e upstream. sysfs attributes with unusual characters have crappy failure modes in Squeeze (udev 164); later versions of udev are unaffected. This should make these characters more unusual. Signed-off-by: Gabriel de Perthuis <g2p.code@gmail.com> Signed-off-by: Kent Overstreet <kmo@daterainc.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-10-05bcache: Fix a dumb journal discard bugKent Overstreet1-1/+1
commit 6d9d21e35fbfa2934339e96934f862d118abac23 upstream. That switch statement was obviously wrong, leading to some sort of weird spinning on rare occasion with discards enabled... Signed-off-by: Kent Overstreet <kmo@daterainc.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-08-29bcache: FUA fixesKent Overstreet3-4/+34
commit e49c7c374e7aacd1f04ecbc21d9dbbeeea4a77d6 upstream. Journal writes need to be marked FUA, not just REQ_FLUSH. And btree node writes have... weird ordering requirements. Signed-off-by: Kent Overstreet <koverstreet@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-08-29md: bcache: io.c: fix a potential NULL pointer dereferenceKumar Amit Mehta1-0/+2
commit 5c694129c8db6d89c9be109049a16510b2f70f6d upstream. bio_alloc_bioset returns NULL on failure. This fix adds a missing check for potential NULL pointer dereferencing. Signed-off-by: Kumar Amit Mehta <gmate.amit@gmail.com> Signed-off-by: Kent Overstreet <koverstreet@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-08-04dm verity: fix inability to use a few specific devices sizesMikulas Patocka1-3/+2
commit b1bf2de07271932326af847a3c6a01fdfd29d4be upstream. Fix a boundary condition that caused failure for certain device sizes. The problem is reported at http://code.google.com/p/cryptsetup/issues/detail?id=160 For certain device sizes the number of hashes at a specific level was calculated incorrectly. It happens for example for a device with data and metadata block size 4096 that has 16385 blocks and algorithm sha256. The user can test if he is affected by this bug by running the "veritysetup verify" command and also by activating the dm-verity kernel driver and reading the whole block device. If it passes without an error, then the user is not affected. The condition for the bug is: Split the total number of data blocks (data_block_bits) into bit strings, each string has hash_per_block_bits bits. hash_per_block_bits is rounddown(log2(metadata_block_size/hash_digest_size)). Equivalently, you can say that you convert data_blocks_bits to 2^hash_per_block_bits base. If there some zero bit string below the most significant bit string and at least one bit below this zero bit string is set, then the bug happens. The same bug exists in the userspace veritysetup tool, so you must use fixed veritysetup too if you want to use devices that are affected by this boundary condition. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: Milan Broz <gmazyland@gmail.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-08-04dm ioctl: set noio flag to avoid __vmalloc deadlockMikulas Patocka1-0/+3
commit 1c0e883e86ece31880fac2f84b260545d66a39e0 upstream. Set noio flag while calling __vmalloc() because it doesn't fully respect gfp flags to avoid a possible deadlock (see commit 502624bdad3dba45dfaacaf36b7d83e39e74b2d2). This should be backported to stable kernels 3.8 and newer. The kernel 3.8 doesn't have memalloc_noio_save(), so we should set and restore process flag PF_MEMALLOC instead. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-08-04dm mpath: fix ioctl deadlock when no pathsHannes Reinecke2-7/+10
commit 6c182cd88d179cbbd06f4f8a8a19b6977940753f upstream. When multipath needs to retry an ioctl the reference to the current live table needs to be dropped. Otherwise a deadlock occurs when all paths are down: - dm_blk_ioctl takes a reference to the current table and spins in multipath_ioctl(). - A new table is being loaded, but upon resume the process hangs in dm_table_destroy() waiting for references to drop to zero. With this patch the reference to the old table is dropped prior to retry, thereby avoiding the deadlock. Signed-off-by: Hannes Reinecke <hare@suse.de> Cc: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-08-04md/raid10: remove use-after-free bug.NeilBrown1-1/+7
commit 0eb25bb027a100f5a9df8991f2f628e7d851bc1e upstream. We always need to be careful when calling generic_make_request, as it can start a chain of events which might free something that we are using. Here is one place I wasn't careful enough. If the wbio2 is not in use, then it might get freed at the first generic_make_request call. So perform all necessary tests first. This bug was introduced in 3.3-rc3 (24afd80d99) and can cause an oops, so fix is suitable for any -stable since then. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-08-04md/raid5: fix interaction of 'replace' and 'recovery'.NeilBrown2-5/+11
commit f94c0b6658c7edea8bc19d13be321e3860a3fa54 upstream. If a device in a RAID4/5/6 is being replaced while another is being recovered, then the writes to the replacement device currently don't happen, resulting in corruption when the replacement completes and the new drive takes over. This is because the replacement writes are only triggered when 's.replacing' is set and not when the similar 's.sync' is set (which is the case during resync and recovery - it means all devices need to be read). So schedule those writes when s.replacing is set as well. In this case we cannot use "STRIPE_INSYNC" to record that the replacement has happened as that is needed for recording that any parity calculation is complete. So introduce STRIPE_REPLACED to record if the replacement has happened. For safety we should also check that STRIPE_COMPUTE_RUN is not set. This has a similar effect to the "s.locked == 0" test. The latter ensure that now IO has been flagged but not started. The former checks if any parity calculation has been flagged by not started. We must wait for both of these to complete before triggering the 'replace'. Add a similar test to the subsequent check for "are we finished yet". This possibly isn't needed (is subsumed in the STRIPE_INSYNC test), but it makes it more obvious that the REPLACE will happen before we think we are finished. Finally if a NeedReplace device is not UPTODATE then that is an error. We really must trigger a warning. This bug was introduced in commit 9a3e1101b827a59ac9036a672f5fa8d5279d0fe2 (md/raid5: detect and handle replacements during recovery.) which introduced replacement for raid5. That was in 3.3-rc3, so any stable kernel since then would benefit from this fix. Reported-by: qindehua <13691222965@163.com> Tested-by: qindehua <qindehua@163.com> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-08-04md/raid1: fix bio handling problems in process_checks()NeilBrown1-23/+30
commit 30bc9b53878a9921b02e3b5bc4283ac1c6de102a upstream. Recent change to use bio_copy_data() in raid1 when repairing an array is faulty. The underlying may have changed the bio in various ways using bio_advance and these need to be undone not just for the 'sbio' which is being copied to, but also the 'pbio' (primary) which is being copied from. So perform the reset on all bios that were read from and do it early. This also ensure that the sbio->bi_io_vec[j].bv_len passed to memcmp is correct. This fixes a crash during a 'check' of a RAID1 array. The crash was introduced in 3.10 so this is suitable for 3.10-stable. Reported-by: Joe Lawrence <joe.lawrence@stratus.com> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-08-04md: Remove recent change which allows devices to skip recovery.NeilBrown1-14/+0
commit 5024c298311f3b97c85cb034f9edaa333fdb9338 upstream. commit 7ceb17e87bde79d285a8b988cfed9eaeebe60b86 md: Allow devices to be re-added to a read-only array. allowed a bit more than just that. It also allows devices to be added to a read-write array and to end up skipping recovery. This patch removes the offending piece of code pending a rewrite for a subsequent release. More specifically: If the array has a bitmap, then the device will still need a bitmap based resync ('saved_raid_disk' is set under different conditions is a bitmap is present). If the array doesn't have a bitmap, then this is correct as long as nothing has been written to the array since the metadata was checked by ->validate_super. However there is no locking to ensure that there was no write. Bug was introduced in 3.10 and causes data corruption so patch is suitable for 3.10-stable. Reported-by: Joe Lawrence <joe.lawrence@stratus.com> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-07-28bcache: Journal replay fixKent Overstreet1-1/+6
commit faa5673617656ee58369a3cfe4a312cfcdc59c81 upstream. The journal replay code starts by finding something that looks like a valid journal entry, then it does a binary search over the unchecked region of the journal for the journal entries with the highest sequence numbers. Trouble is, the logic was wrong - journal_read_bucket() returns true if it found journal entries we need, but if the range of journal entries we're looking for loops around the end of the journal - in that case journal_read_bucket() could return true when it hadn't found the highest sequence number we'd seen yet, and in that case the binary search did the wrong thing. Whoops. Signed-off-by: Kent Overstreet <kmo@daterainc.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-07-28bcache: Fix GC_SECTORS_USED() calculationKent Overstreet1-1/+3
commit 29ebf465b9050f241c4433a796a32e6c896a9dcd upstream. Part of the job of garbage collection is to add up however many sectors of live data it finds in each bucket, but that doesn't work very well if it doesn't reset GC_SECTORS_USED() when it starts. Whoops. This wouldn't have broken anything horribly, but allocation tries to preferentially reclaim buckets that are mostly empty and that's not gonna work with an incorrect GC_SECTORS_USED() value. Signed-off-by: Kent Overstreet <kmo@daterainc.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-07-28bcache: Fix a sysfs splat on shutdownKent Overstreet2-1/+11
commit c9502ea4424b31728703d113fc6b30bfead14633 upstream. If we stopped a bcache device when we were already detaching (or something like that), bcache_device_unlink() would try to remove a symlink from sysfs that was already gone because the bcache dev kobject had already been removed from sysfs. So keep track of whether we've removed stuff from sysfs. Signed-off-by: Kent Overstreet <kmo@daterainc.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-07-28bcache: Shutdown fixKent Overstreet1-7/+11
commit 5caa52afc5abd1396e4af720469abb5843a71eb8 upstream. Stopping a cache set is supposed to make it stop attached backing devices, but somewhere along the way that code got lost. Fixing this mainly has the effect of fixing our reboot notifier. Signed-off-by: Kent Overstreet <kmo@daterainc.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-07-28bcache: Advertise that flushes are supportedKent Overstreet2-1/+9
commit 54d12f2b4fd0f218590d1490b41a18d0e2328a9a upstream. Whoops - bcache's flush/FUA was mostly correct, but flushes get filtered out unless we say we support them... Signed-off-by: Kent Overstreet <kmo@daterainc.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-07-28bcache: Fix a dumb raceKent Overstreet1-2/+4
commit 6aa8f1a6ca41c49721d2de4e048d3da8d06411f9 upstream. In the far-too-complicated closure code - closures can have destructors, for probably dubious reasons; they get run after the closure is no longer waiting on anything but before dropping the parent ref, intended just for freeing whatever memory the closure is embedded in. Trouble is, when remaining goes to 0 and we've got nothing more to run - we also have to unlock the closure, setting remaining to -1. If there's a destructor, that unlock isn't doing anything - nobody could be trying to lock it if we're about to free it - but if the unlock _is needed... that check for a destructor was racy. Argh. Signed-off-by: Kent Overstreet <kmo@daterainc.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-07-25md/raid10: fix two problems with RAID10 resync.NeilBrown1-2/+9
commit 7bb23c4934059c64cbee2e41d5d24ce122285176 upstream. 1/ When an different between blocks is found, data is copied from one bio to the other. However bv_len is used as the length to copy and this could be zero. So use r10_bio->sectors to calculate length instead. Using bv_len was probably always a bit dubious, but the introduction of bio_advance made it much more likely to be a problem. 2/ When preparing some blocks for sync, we don't set BIO_UPTODATE except on bios that we schedule for a read. This ensures that missing/failed devices don't confuse the loop at the top of sync_request write. Commit 8be185f2c9d54d6 "raid10: Use bio_reset()" removed a loop which set BIO_UPTDATE on all appropriate bios. So we need to re-add that flag. These bugs were introduced in 3.10, so this patch is suitable for 3.10-stable, and can remove a potential for data corruption. Reported-by: Brassow Jonathan <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-07-25md/raid10: fix two bugs affecting RAID10 reshape.NeilBrown1-2/+2
commit 78eaa0d4cbcdb345992fa3dd22b3bcbb473cc064 upstream. 1/ If a RAID10 is being reshaped to a fewer number of devices and is stopped while this is ongoing, then when the array is reassembled the 'mirrors' array will be allocated too small. This will lead to an access error or memory corruption. 2/ A sanity test for a reshaping RAID10 array is restarted is slightly incorrect. Due to the first bug, this is suitable for any -stable kernel since 3.5 where this code was introduced. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-07-25md/raid10: fix bug which causes all RAID10 reshapes to move no data.NeilBrown1-5/+4
commit 1376512065b23f39d5f9a160948f313397dde972 upstream. The recent comment: commit 7e83ccbecd608b971f340e951c9e84cd0343002f md/raid10: Allow skipping recovery when clean arrays are assembled Causes raid10 to skip a recovery in certain cases where it is safe to do so. Unfortunately it also causes a reshape to be skipped which is never safe. The result is that an attempt to reshape a RAID10 will appear to complete instantly, but no data will have been moves so the array will now contain garbage. (If nothing is written, you can recovery by simple performing the reverse reshape which will also complete instantly). Bug was introduced in 3.10, so this is suitable for 3.10-stable. Signed-off-by: NeilBrown <neilb@suse.de> Cc: Martin Wilck <mwilck@arcor.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-06-13Merge tag 'md-3.10-fixes' of git://neil.brown.name/mdLinus Torvalds4-26/+47
Pull md bugfixes from Neil Brown: "A few bugfixes for md Some tagged for -stable" * tag 'md-3.10-fixes' of git://neil.brown.name/md: md/raid1,5,10: Disable WRITE SAME until a recovery strategy is in place md/raid1,raid10: use freeze_array in place of raise_barrier in various places. md/raid1: consider WRITE as successful only if at least one non-Faulty and non-rebuilding drive completed it. md: md_stop_writes() should always freeze recovery.
2013-06-13md/raid1,5,10: Disable WRITE SAME until a recovery strategy is in placeH. Peter Anvin3-5/+6
There are cases where the kernel will believe that the WRITE SAME command is supported by a block device which does not, in fact, support WRITE SAME. This currently happens for SATA drivers behind a SAS controller, but there are probably a hundred other ways that can happen, including drive firmware bugs. After receiving an error for WRITE SAME the block layer will retry the request as a plain write of zeroes, but mdraid will consider the failure as fatal and consider the drive failed. This has the effect that all the mirrors containing a specific set of data are each offlined in very rapid succession resulting in data loss. However, just bouncing the request back up to the block layer isn't ideal either, because the whole initial request-retry sequence should be inside the write bitmap fence, which probably means that md needs to do its own conversion of WRITE SAME to write zero. Until the failure scenario has been sorted out, disable WRITE SAME for raid1, raid5, and raid10. [neilb: added raid5] This patch is appropriate for any -stable since 3.7 when write_same support was added. Cc: stable@vger.kernel.org Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Signed-off-by: NeilBrown <neilb@suse.de>
2013-06-13md/raid1,raid10: use freeze_array in place of raise_barrier in various places.NeilBrown2-18/+18
Various places in raid1 and raid10 are calling raise_barrier when they really should call freeze_array. The former is only intended to be called from "make_request". The later has extra checks for 'nr_queued' and makes a call to flush_pending_writes(), so it is safe to call it from within the management thread. Using raise_barrier will sometimes deadlock. Using freeze_array should not. As 'freeze_array' currently expects one request to be pending (in handle_read_error - the only previous caller), we need to pass it the number of pending requests (extra) to ignore. The deadlock was made particularly noticeable by commits 050b66152f87c7 (raid10) and 6b740b8d79252f13 (raid1) which appeared in 3.4, so the fix is appropriate for any -stable kernel since then. This patch probably won't apply directly to some early kernels and will need to be applied by hand. Cc: stable@vger.kernel.org Reported-by: Alexander Lyakas <alex.bolshoy@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
2013-06-13md/raid1: consider WRITE as successful only if at least one non-Faulty and ↵Alex Lyakas2-2/+22
non-rebuilding drive completed it. Without that fix, the following scenario could happen: - RAID1 with drives A and B; drive B was freshly-added and is rebuilding - Drive A fails - WRITE request arrives to the array. It is failed by drive A, so r1_bio is marked as R1BIO_WriteError, but the rebuilding drive B succeeds in writing it, so the same r1_bio is marked as R1BIO_Uptodate. - r1_bio arrives to handle_write_finished, badblocks are disabled, md_error()->error() does nothing because we don't fail the last drive of raid1 - raid_end_bio_io() calls call_bio_endio() - As a result, in call_bio_endio(): if (!test_bit(R1BIO_Uptodate, &r1_bio->state)) clear_bit(BIO_UPTODATE, &bio->bi_flags); this code doesn't clear the BIO_UPTODATE flag, and the whole master WRITE succeeds, back to the upper layer. So we returned success to the upper layer, even though we had written the data onto the rebuilding drive only. But when we want to read the data back, we would not read from the rebuilding drive, so this data is lost. [neilb - applied identical change to raid10 as well] This bug can result in lost data, so it is suitable for any -stable kernel. Cc: stable@vger.kernel.org Signed-off-by: Alex Lyakas <alex@zadarastorage.com> Signed-off-by: NeilBrown <neilb@suse.de>
2013-06-13md: md_stop_writes() should always freeze recovery.NeilBrown1-1/+1
__md_stop_writes() will currently sometimes freeze recovery. So any caller must be ready for that to happen, and indeed they are. However if __md_stop_writes() doesn't freeze_recovery, then a recovery could start before mddev_suspend() is called, which could be awkward. This can particularly cause problems or dm-raid. So change __md_stop_writes() to always freeze recovery. This is safe and more predicatable. Reported-by: Brassow Jonathan <jbrassow@redhat.com> Tested-by: Brassow Jonathan <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>
2013-06-12Merge branch 'for-linus' of git://git.kernel.dk/linux-blockLinus Torvalds6-124/+102
Pull block layer fixes from Jens Axboe: "Outside of bcache (which really isn't super big), these are all few-liners. There are a few important fixes in here: - Fix blk pm sleeping when holding the queue lock - A small collection of bcache fixes that have been done and tested since bcache was included in this merge window. - A fix for a raid5 regression introduced with the bio changes. - Two important fixes for mtip32xx, fixing an oops and potential data corruption (or hang) due to wrong bio iteration on stacked devices." * 'for-linus' of git://git.kernel.dk/linux-block: scatterlist: sg_set_buf() argument must be in linear mapping raid5: Initialize bi_vcnt pktcdvd: silence static checker warning block: remove refs to XD disks from documentation blkpm: avoid sleep when holding queue lock mtip32xx: Correctly handle bio->bi_idx != 0 conditions mtip32xx: Fix NULL pointer dereference during module unload bcache: Fix error handling in init code bcache: clarify free/available/unused space bcache: drop "select CLOSURES" bcache: Fix incompatible pointer type warning
2013-05-30raid5: Initialize bi_vcntKent Overstreet1-0/+2
The patch that converted raid5 to use bio_reset() forgot to initialize bi_vcnt. Signed-off-by: Kent Overstreet <koverstreet@google.com> Cc: NeilBrown <neilb@suse.de> Cc: linux-raid@vger.kernel.org Tested-by: Ilia Mirkin <imirkin@alum.mit.edu> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-05-19dm thin: fix metadata dev resize detectionAlasdair G Kergon1-2/+2
Fix detection of the need to resize the dm thin metadata device. The code incorrectly tried to extend the metadata device when it didn't need to due to a merging error with patch 24347e9 ("dm thin: detect metadata device resizing"). device-mapper: transaction manager: couldn't open metadata space map device-mapper: thin metadata: tm_open_with_sm failed device-mapper: thin: aborting transaction failed device-mapper: thin: switching pool to failure mode Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-05-15Merge branch 'bcache-for-upstream' of ↵Jens Axboe5-124/+100
git://evilpiepirate.org/~kent/linux-bcache into for-linus Kent writes: Jens - couple more bcache patches. Bug fixes and a doc update.
2013-05-15bcache: Fix error handling in init codeKent Overstreet4-121/+99
This code appears to have rotted... fix various bugs and do some refactoring. Signed-off-by: Kent Overstreet <koverstreet@google.com>
2013-05-15bcache: drop "select CLOSURES"Paul Bolle1-1/+0
The Kconfig entry for BCACHE selects CLOSURES. But there's no Kconfig symbol CLOSURES. That symbol was used in development versions of bcache, but was removed when the closures code was no longer provided as a kernel library. It can safely be dropped. Signed-off-by: Paul Bolle <pebolle@tiscali.nl>
2013-05-15bcache: Fix incompatible pointer type warningEmil Goode1-2/+1
The function pointer release in struct block_device_operations should point to functions declared as void. Sparse warnings: drivers/md/bcache/super.c:656:27: warning: incorrect type in initializer (different base types) drivers/md/bcache/super.c:656:27: expected void ( *release )( ... ) drivers/md/bcache/super.c:656:27: got int ( static [toplevel] *<noident> )( ... ) drivers/md/bcache/super.c:656:2: warning: initialization from incompatible pointer type [enabled by default] drivers/md/bcache/super.c:656:2: warning: (near initialization for ‘bcache_ops.release’) [enabled by default] Signed-off-by: Emil Goode <emilgoode@gmail.com> Signed-off-by: Kent Overstreet <koverstreet@google.com>
2013-05-10dm cache: set config valueJoe Thornber1-28/+31
Share configuration option processing code between the dm cache ctr and message functions. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-05-10dm cache: move config fnsAlasdair G Kergon1-17/+17
Move process_config_option() in dm-cache-target.c to make the next patch more readable. Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-05-10dm thin: generate event when metadata threshold passedJoe Thornber3-0/+58
Generate a dm event when the amount of remaining thin pool metadata space falls below a certain level. The threshold is taken to be a quarter of the size of the metadata device with a minimum threshold of 4MB. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-05-10dm persistent metadata: add space map threshold callbackJoe Thornber1-1/+76
Add a threshold callback to dm persistent data space maps. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-05-10dm persistent data: add threshold callback to space mapJoe Thornber3-3/+29
Add a threshold callback function to the persistent data space map interface for a subsequent patch to use. dm-thin and dm-cache are interested in knowing when they're getting low on metadata or data blocks. This patch introduces a new method for registering a callback against a threshold. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2013-05-10dm thin: detect metadata device resizingJoe Thornber3-3/+64
Allow the dm thin pool metadata device to be extended. Whenever a pool is resumed, detect whether the size of the metadata device has increased, and if so, extend the metadata to use the new space. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>