summaryrefslogtreecommitdiff
path: root/fs/btrfs
AgeCommit message (Collapse)Author
2016-04-20Btrfs: fix file/data loss caused by fsync after rename and new inodeFilipe Manana
commit 56f23fdbb600e6087db7b009775b95ce07cc3195 upstream. If we rename an inode A (be it a file or a directory), create a new inode B with the old name of inode A and under the same parent directory, fsync inode B and then power fail, at log tree replay time we end up removing inode A completely. If inode A is a directory then all its files are gone too. Example scenarios where this happens: This is reproducible with the following steps, taken from a couple of test cases written for fstests which are going to be submitted upstream soon: # Scenario 1 mkfs.btrfs -f /dev/sdc mount /dev/sdc /mnt mkdir -p /mnt/a/x echo "hello" > /mnt/a/x/foo echo "world" > /mnt/a/x/bar sync mv /mnt/a/x /mnt/a/y mkdir /mnt/a/x xfs_io -c fsync /mnt/a/x <power failure happens> The next time the fs is mounted, log tree replay happens and the directory "y" does not exist nor do the files "foo" and "bar" exist anywhere (neither in "y" nor in "x", nor the root nor anywhere). # Scenario 2 mkfs.btrfs -f /dev/sdc mount /dev/sdc /mnt mkdir /mnt/a echo "hello" > /mnt/a/foo sync mv /mnt/a/foo /mnt/a/bar echo "world" > /mnt/a/foo xfs_io -c fsync /mnt/a/foo <power failure happens> The next time the fs is mounted, log tree replay happens and the file "bar" does not exists anymore. A file with the name "foo" exists and it matches the second file we created. Another related problem that does not involve file/data loss is when a new inode is created with the name of a deleted snapshot and we fsync it: mkfs.btrfs -f /dev/sdc mount /dev/sdc /mnt mkdir /mnt/testdir btrfs subvolume snapshot /mnt /mnt/testdir/snap btrfs subvolume delete /mnt/testdir/snap rmdir /mnt/testdir mkdir /mnt/testdir xfs_io -c fsync /mnt/testdir # or fsync some file inside /mnt/testdir <power failure> The next time the fs is mounted the log replay procedure fails because it attempts to delete the snapshot entry (which has dir item key type of BTRFS_ROOT_ITEM_KEY) as if it were a regular (non-root) entry, resulting in the following error that causes mount to fail: [52174.510532] BTRFS info (device dm-0): failed to delete reference to snap, inode 257 parent 257 [52174.512570] ------------[ cut here ]------------ [52174.513278] WARNING: CPU: 12 PID: 28024 at fs/btrfs/inode.c:3986 __btrfs_unlink_inode+0x178/0x351 [btrfs]() [52174.514681] BTRFS: Transaction aborted (error -2) [52174.515630] Modules linked in: btrfs dm_flakey dm_mod overlay crc32c_generic ppdev xor raid6_pq acpi_cpufreq parport_pc tpm_tis sg parport tpm evdev i2c_piix4 proc [52174.521568] CPU: 12 PID: 28024 Comm: mount Tainted: G W 4.5.0-rc6-btrfs-next-27+ #1 [52174.522805] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014 [52174.524053] 0000000000000000 ffff8801df2a7710 ffffffff81264e93 ffff8801df2a7758 [52174.524053] 0000000000000009 ffff8801df2a7748 ffffffff81051618 ffffffffa03591cd [52174.524053] 00000000fffffffe ffff88015e6e5000 ffff88016dbc3c88 ffff88016dbc3c88 [52174.524053] Call Trace: [52174.524053] [<ffffffff81264e93>] dump_stack+0x67/0x90 [52174.524053] [<ffffffff81051618>] warn_slowpath_common+0x99/0xb2 [52174.524053] [<ffffffffa03591cd>] ? __btrfs_unlink_inode+0x178/0x351 [btrfs] [52174.524053] [<ffffffff81051679>] warn_slowpath_fmt+0x48/0x50 [52174.524053] [<ffffffffa03591cd>] __btrfs_unlink_inode+0x178/0x351 [btrfs] [52174.524053] [<ffffffff8118f5e9>] ? iput+0xb0/0x284 [52174.524053] [<ffffffffa0359fe8>] btrfs_unlink_inode+0x1c/0x3d [btrfs] [52174.524053] [<ffffffffa038631e>] check_item_in_log+0x1fe/0x29b [btrfs] [52174.524053] [<ffffffffa0386522>] replay_dir_deletes+0x167/0x1cf [btrfs] [52174.524053] [<ffffffffa038739e>] fixup_inode_link_count+0x289/0x2aa [btrfs] [52174.524053] [<ffffffffa038748a>] fixup_inode_link_counts+0xcb/0x105 [btrfs] [52174.524053] [<ffffffffa038a5ec>] btrfs_recover_log_trees+0x258/0x32c [btrfs] [52174.524053] [<ffffffffa03885b2>] ? replay_one_extent+0x511/0x511 [btrfs] [52174.524053] [<ffffffffa034f288>] open_ctree+0x1dd4/0x21b9 [btrfs] [52174.524053] [<ffffffffa032b753>] btrfs_mount+0x97e/0xaed [btrfs] [52174.524053] [<ffffffff8108e1b7>] ? trace_hardirqs_on+0xd/0xf [52174.524053] [<ffffffff8117bafa>] mount_fs+0x67/0x131 [52174.524053] [<ffffffff81193003>] vfs_kern_mount+0x6c/0xde [52174.524053] [<ffffffffa032af81>] btrfs_mount+0x1ac/0xaed [btrfs] [52174.524053] [<ffffffff8108e1b7>] ? trace_hardirqs_on+0xd/0xf [52174.524053] [<ffffffff8108c262>] ? lockdep_init_map+0xb9/0x1b3 [52174.524053] [<ffffffff8117bafa>] mount_fs+0x67/0x131 [52174.524053] [<ffffffff81193003>] vfs_kern_mount+0x6c/0xde [52174.524053] [<ffffffff8119590f>] do_mount+0x8a6/0x9e8 [52174.524053] [<ffffffff811358dd>] ? strndup_user+0x3f/0x59 [52174.524053] [<ffffffff81195c65>] SyS_mount+0x77/0x9f [52174.524053] [<ffffffff814935d7>] entry_SYSCALL_64_fastpath+0x12/0x6b [52174.561288] ---[ end trace 6b53049efb1a3ea6 ]--- Fix this by forcing a transaction commit when such cases happen. This means we check in the commit root of the subvolume tree if there was any other inode with the same reference when the inode we are fsync'ing is a new inode (created in the current transaction). Test cases for fstests, covering all the scenarios given above, were submitted upstream for fstests: * fstests: generic test for fsync after renaming directory https://patchwork.kernel.org/patch/8694281/ * fstests: generic test for fsync after renaming file https://patchwork.kernel.org/patch/8694301/ * fstests: add btrfs test for fsync after snapshot deletion https://patchwork.kernel.org/patch/8670671/ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-04-20btrfs: fix crash/invalid memory access on fsync when using overlayfsFilipe Manana
commit de17e793b104d690e1d007dfc5cb6b4f649598ca upstream. If the lower or upper directory of an overlayfs mount belong to a btrfs file system and we fsync the file through the overlayfs' merged directory we ended up accessing an inode that didn't belong to btrfs as if it were a btrfs inode at btrfs_sync_file() resulting in a crash like the following: [ 7782.588845] BUG: unable to handle kernel NULL pointer dereference at 0000000000000544 [ 7782.590624] IP: [<ffffffffa030b7ab>] btrfs_sync_file+0x11b/0x3e9 [btrfs] [ 7782.591931] PGD 4d954067 PUD 1e878067 PMD 0 [ 7782.592016] Oops: 0002 [#6] PREEMPT SMP DEBUG_PAGEALLOC [ 7782.592016] Modules linked in: btrfs overlay ppdev crc32c_generic evdev xor raid6_pq psmouse pcspkr sg serio_raw acpi_cpufreq parport_pc parport tpm_tis i2c_piix4 tpm i2c_core processor button loop autofs4 ext4 crc16 mbcache jbd2 sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix virtio_pci libata virtio_ring virtio scsi_mod e1000 floppy [last unloaded: btrfs] [ 7782.592016] CPU: 10 PID: 16437 Comm: xfs_io Tainted: G D 4.5.0-rc6-btrfs-next-26+ #1 [ 7782.592016] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014 [ 7782.592016] task: ffff88001b8d40c0 ti: ffff880137488000 task.ti: ffff880137488000 [ 7782.592016] RIP: 0010:[<ffffffffa030b7ab>] [<ffffffffa030b7ab>] btrfs_sync_file+0x11b/0x3e9 [btrfs] [ 7782.592016] RSP: 0018:ffff88013748be40 EFLAGS: 00010286 [ 7782.592016] RAX: 0000000080000000 RBX: ffff880133b30c88 RCX: 0000000000000001 [ 7782.592016] RDX: 0000000000000001 RSI: ffffffff8148fec0 RDI: 00000000ffffffff [ 7782.592016] RBP: ffff88013748bec0 R08: 0000000000000001 R09: 0000000000000000 [ 7782.624248] R10: ffff88013748be40 R11: 0000000000000246 R12: 0000000000000000 [ 7782.624248] R13: 0000000000000000 R14: 00000000009305a0 R15: ffff880015e3be40 [ 7782.624248] FS: 00007fa83b9cb700(0000) GS:ffff88023ed40000(0000) knlGS:0000000000000000 [ 7782.624248] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 7782.624248] CR2: 0000000000000544 CR3: 00000001fa652000 CR4: 00000000000006e0 [ 7782.624248] Stack: [ 7782.624248] ffffffff8108b5cc ffff88013748bec0 0000000000000246 ffff8800b005ded0 [ 7782.624248] ffff880133b30d60 8000000000000000 7fffffffffffffff 0000000000000246 [ 7782.624248] 0000000000000246 ffffffff81074f9b ffffffff8104357c ffff880015e3be40 [ 7782.624248] Call Trace: [ 7782.624248] [<ffffffff8108b5cc>] ? arch_local_irq_save+0x9/0xc [ 7782.624248] [<ffffffff81074f9b>] ? ___might_sleep+0xce/0x217 [ 7782.624248] [<ffffffff8104357c>] ? __do_page_fault+0x3c0/0x43a [ 7782.624248] [<ffffffff811a2351>] vfs_fsync_range+0x8c/0x9e [ 7782.624248] [<ffffffff811a237f>] vfs_fsync+0x1c/0x1e [ 7782.624248] [<ffffffff811a24d6>] do_fsync+0x31/0x4a [ 7782.624248] [<ffffffff811a2700>] SyS_fsync+0x10/0x14 [ 7782.624248] [<ffffffff81493617>] entry_SYSCALL_64_fastpath+0x12/0x6b [ 7782.624248] Code: 85 c0 0f 85 e2 02 00 00 48 8b 45 b0 31 f6 4c 29 e8 48 ff c0 48 89 45 a8 48 8d 83 d8 00 00 00 48 89 c7 48 89 45 a0 e8 fc 43 18 e1 <f0> 41 ff 84 24 44 05 00 00 48 8b 83 58 ff ff ff 48 c1 e8 07 83 [ 7782.624248] RIP [<ffffffffa030b7ab>] btrfs_sync_file+0x11b/0x3e9 [btrfs] [ 7782.624248] RSP <ffff88013748be40> [ 7782.624248] CR2: 0000000000000544 [ 7782.661994] ---[ end trace 721e14960eb939bc ]--- This started happening since commit 4bacc9c9234 (overlayfs: Make f_path always point to the overlay and f_inode to the underlay) and even though after this change we could still access the btrfs inode through struct file->f_mapping->host or struct file->f_inode, we would end up resulting in more similar issues later on at check_parent_dirs_for_sync() because the dentry we got (from struct file->f_path.dentry) was from overlayfs and not from btrfs, that is, we had no way of getting the dentry that belonged to btrfs (we always got the dentry that belonged to overlayfs). The new patch from Miklos Szeredi, titled "vfs: add file_dentry()" and recently submitted to linux-fsdevel, adds a file_dentry() API that allows us to get the btrfs dentry from the input file and therefore being able to fsync when the upper and lower directories belong to btrfs filesystems. This issue has been reported several times by users in the mailing list and bugzilla. A test case for xfstests is being submitted as well. Fixes: 4bacc9c9234c ("overlayfs: Make f_path always point to the overlay and f_inode to the underlay") Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=101951 Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=109791 Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-03-09Btrfs: fix loading of orphan roots leading to BUG_ONFilipe Manana
commit 909c3a22da3b8d2cfd3505ca5658f0176859d400 upstream. When looking for orphan roots during mount we can end up hitting a BUG_ON() (at root-item.c:btrfs_find_orphan_roots()) if a log tree is replayed and qgroups are enabled. This is because after a log tree is replayed, a transaction commit is made, which triggers qgroup extent accounting which in turn does backref walking which ends up reading and inserting all roots in the radix tree fs_info->fs_root_radix, including orphan roots (deleted snapshots). So after the log tree is replayed, when finding orphan roots we hit the BUG_ON with the following trace: [118209.182438] ------------[ cut here ]------------ [118209.183279] kernel BUG at fs/btrfs/root-tree.c:314! [118209.184074] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC [118209.185123] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic ppdev xor raid6_pq evdev sg parport_pc parport acpi_cpufreq tpm_tis tpm psmouse processor i2c_piix4 serio_raw pcspkr i2c_core button loop autofs4 ext4 crc16 mbcache jbd2 sd_mod sr_mod cdrom ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring virtio scsi_mod e1000 floppy [last unloaded: btrfs] [118209.186318] CPU: 14 PID: 28428 Comm: mount Tainted: G W 4.5.0-rc5-btrfs-next-24+ #1 [118209.186318] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014 [118209.186318] task: ffff8801ec131040 ti: ffff8800af34c000 task.ti: ffff8800af34c000 [118209.186318] RIP: 0010:[<ffffffffa04237d7>] [<ffffffffa04237d7>] btrfs_find_orphan_roots+0x1fc/0x244 [btrfs] [118209.186318] RSP: 0018:ffff8800af34faa8 EFLAGS: 00010246 [118209.186318] RAX: 00000000ffffffef RBX: 00000000ffffffef RCX: 0000000000000001 [118209.186318] RDX: 0000000080000000 RSI: 0000000000000001 RDI: 00000000ffffffff [118209.186318] RBP: ffff8800af34fb08 R08: 0000000000000001 R09: 0000000000000000 [118209.186318] R10: ffff8800af34f9f0 R11: 6db6db6db6db6db7 R12: ffff880171b97000 [118209.186318] R13: ffff8801ca9d65e0 R14: ffff8800afa2e000 R15: 0000160000000000 [118209.186318] FS: 00007f5bcb914840(0000) GS:ffff88023edc0000(0000) knlGS:0000000000000000 [118209.186318] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [118209.186318] CR2: 00007f5bcaceb5d9 CR3: 00000000b49b5000 CR4: 00000000000006e0 [118209.186318] Stack: [118209.186318] fffffbffffffffff 010230ffffffffff 0101000000000000 ff84000000000000 [118209.186318] fbffffffffffffff 30ffffffffffffff 0000000000000101 ffff880082348000 [118209.186318] 0000000000000000 ffff8800afa2e000 ffff8800afa2e000 0000000000000000 [118209.186318] Call Trace: [118209.186318] [<ffffffffa042e2db>] open_ctree+0x1e37/0x21b9 [btrfs] [118209.186318] [<ffffffffa040a753>] btrfs_mount+0x97e/0xaed [btrfs] [118209.186318] [<ffffffff8108e1c0>] ? trace_hardirqs_on+0xd/0xf [118209.186318] [<ffffffff8117b87e>] mount_fs+0x67/0x131 [118209.186318] [<ffffffff81192d2b>] vfs_kern_mount+0x6c/0xde [118209.186318] [<ffffffffa0409f81>] btrfs_mount+0x1ac/0xaed [btrfs] [118209.186318] [<ffffffff8108e1c0>] ? trace_hardirqs_on+0xd/0xf [118209.186318] [<ffffffff8108c26b>] ? lockdep_init_map+0xb9/0x1b3 [118209.186318] [<ffffffff8117b87e>] mount_fs+0x67/0x131 [118209.186318] [<ffffffff81192d2b>] vfs_kern_mount+0x6c/0xde [118209.186318] [<ffffffff81195637>] do_mount+0x8a6/0x9e8 [118209.186318] [<ffffffff8119598d>] SyS_mount+0x77/0x9f [118209.186318] [<ffffffff81493017>] entry_SYSCALL_64_fastpath+0x12/0x6b [118209.186318] Code: 64 00 00 85 c0 89 c3 75 24 f0 41 80 4c 24 20 20 49 8b bc 24 f0 01 00 00 4c 89 e6 e8 e8 65 00 00 85 c0 89 c3 74 11 83 f8 ef 75 02 <0f> 0b 4c 89 e7 e8 da 72 00 00 eb 1c 41 83 bc 24 00 01 00 00 00 [118209.186318] RIP [<ffffffffa04237d7>] btrfs_find_orphan_roots+0x1fc/0x244 [btrfs] [118209.186318] RSP <ffff8800af34faa8> [118209.230735] ---[ end trace 83938f987d85d477 ]--- So fix this by not treating the error -EEXIST, returned when attempting to insert a root already inserted by the backref walking code, as an error. The following test case for xfstests reproduces the bug: seq=`basename $0` seqres=$RESULT_DIR/$seq echo "QA output created by $seq" tmp=/tmp/$$ status=1 # failure is the default! trap "_cleanup; exit \$status" 0 1 2 3 15 _cleanup() { _cleanup_flakey cd / rm -f $tmp.* } # get standard environment, filters and checks . ./common/rc . ./common/filter . ./common/dmflakey # real QA test starts here _supported_fs btrfs _supported_os Linux _require_scratch _require_dm_target flakey _require_metadata_journaling $SCRATCH_DEV rm -f $seqres.full _scratch_mkfs >>$seqres.full 2>&1 _init_flakey _mount_flakey _run_btrfs_util_prog quota enable $SCRATCH_MNT # Create 2 directories with one file in one of them. # We use these just to trigger a transaction commit later, moving the file from # directory a to directory b and doing an fsync against directory a. mkdir $SCRATCH_MNT/a mkdir $SCRATCH_MNT/b touch $SCRATCH_MNT/a/f sync # Create our test file with 2 4K extents. $XFS_IO_PROG -f -s -c "pwrite -S 0xaa 0 8K" $SCRATCH_MNT/foobar | _filter_xfs_io # Create a snapshot and delete it. This doesn't really delete the snapshot # immediately, just makes it inaccessible and invisible to user space, the # snapshot is deleted later by a dedicated kernel thread (cleaner kthread) # which is woke up at the next transaction commit. # A root orphan item is inserted into the tree of tree roots, so that if a # power failure happens before the dedicated kernel thread does the snapshot # deletion, the next time the filesystem is mounted it resumes the snapshot # deletion. _run_btrfs_util_prog subvolume snapshot $SCRATCH_MNT $SCRATCH_MNT/snap _run_btrfs_util_prog subvolume delete $SCRATCH_MNT/snap # Now overwrite half of the extents we wrote before. Because we made a snapshpot # before, which isn't really deleted yet (since no transaction commit happened # after we did the snapshot delete request), the non overwritten extents get # referenced twice, once by the default subvolume and once by the snapshot. $XFS_IO_PROG -c "pwrite -S 0xbb 4K 8K" $SCRATCH_MNT/foobar | _filter_xfs_io # Now move file f from directory a to directory b and fsync directory a. # The fsync on the directory a triggers a transaction commit (because a file # was moved from it to another directory) and the file fsync leaves a log tree # with file extent items to replay. mv $SCRATCH_MNT/a/f $SCRATCH_MNT/a/b $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/a $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foobar echo "File digest before power failure:" md5sum $SCRATCH_MNT/foobar | _filter_scratch # Now simulate a power failure and mount the filesystem to replay the log tree. # After the log tree was replayed, we used to hit a BUG_ON() when processing # the root orphan item for the deleted snapshot. This is because when processing # an orphan root the code expected to be the first code inserting the root into # the fs_info->fs_root_radix radix tree, while in reallity it was the second # caller attempting to do it - the first caller was the transaction commit that # took place after replaying the log tree, when updating the qgroup counters. _flakey_drop_and_remount echo "File digest before after failure:" # Must match what he got before the power failure. md5sum $SCRATCH_MNT/foobar | _filter_scratch _unmount_flakey status=0 exit Fixes: 2d9e97761087 ("Btrfs: use btrfs_get_fs_root in resolve_indirect_ref") Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-03-09btrfs: async-thread: Fix a use-after-free error for traceQu Wenruo
commit 0a95b851370b84a4b9d92ee6d1fa0926901d0454 upstream. Parameter of trace_btrfs_work_queued() can be freed in its workqueue. So no one use use that pointer after queue_work(). Fix the user-after-free bug by move the trace line before queue_work(). Reported-by: Dave Jones <davej@codemonkey.org.uk> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-03-09btrfs: Fix no_space in write and rm loopZhao Lei
commit e1746e8381cd2af421f75557b5cae3604fc18b35 upstream. I see no_space in v4.4-rc1 again in xfstests generic/102. It happened randomly in some node only. (one of 4 phy-node, and a kvm with non-virtio block driver) By bisect, we can found the first-bad is: commit bdced438acd8 ("block: setup bi_phys_segments after splitting")' But above patch only triggered the bug by making bio operation faster(or slower). Main reason is in our space_allocating code, we need to commit page writeback before wait it complish, this patch fixed above bug. BTW, there is another reason for generic/102 fail, caused by disable default mixed-blockgroup, I'll fix it in xfstests. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-03-09Btrfs: fix deadlock running delayed iputs at transaction commit timeFilipe Manana
commit c2d6cb1636d235257086f939a8194ef0bf93af6e upstream. While running a stress test I ran into a deadlock when running the delayed iputs at transaction time, which produced the following report and trace: [ 886.399989] ============================================= [ 886.400871] [ INFO: possible recursive locking detected ] [ 886.401663] 4.4.0-rc6-btrfs-next-18+ #1 Not tainted [ 886.402384] --------------------------------------------- [ 886.403182] fio/8277 is trying to acquire lock: [ 886.403568] (&fs_info->delayed_iput_sem){++++..}, at: [<ffffffffa0538823>] btrfs_run_delayed_iputs+0x36/0xbf [btrfs] [ 886.403568] [ 886.403568] but task is already holding lock: [ 886.403568] (&fs_info->delayed_iput_sem){++++..}, at: [<ffffffffa0538823>] btrfs_run_delayed_iputs+0x36/0xbf [btrfs] [ 886.403568] [ 886.403568] other info that might help us debug this: [ 886.403568] Possible unsafe locking scenario: [ 886.403568] [ 886.403568] CPU0 [ 886.403568] ---- [ 886.403568] lock(&fs_info->delayed_iput_sem); [ 886.403568] lock(&fs_info->delayed_iput_sem); [ 886.403568] [ 886.403568] *** DEADLOCK *** [ 886.403568] [ 886.403568] May be due to missing lock nesting notation [ 886.403568] [ 886.403568] 3 locks held by fio/8277: [ 886.403568] #0: (sb_writers#11){.+.+.+}, at: [<ffffffff81174c4c>] __sb_start_write+0x5f/0xb0 [ 886.403568] #1: (&sb->s_type->i_mutex_key#15){+.+.+.}, at: [<ffffffffa054620d>] btrfs_file_write_iter+0x73/0x408 [btrfs] [ 886.403568] #2: (&fs_info->delayed_iput_sem){++++..}, at: [<ffffffffa0538823>] btrfs_run_delayed_iputs+0x36/0xbf [btrfs] [ 886.403568] [ 886.403568] stack backtrace: [ 886.403568] CPU: 6 PID: 8277 Comm: fio Not tainted 4.4.0-rc6-btrfs-next-18+ #1 [ 886.403568] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014 [ 886.403568] 0000000000000000 ffff88009f80f770 ffffffff8125d4fd ffffffff82af1fc0 [ 886.403568] ffff88009f80f830 ffffffff8108e5f9 0000000200000000 ffff88009fd92290 [ 886.403568] 0000000000000000 ffffffff82af1fc0 ffffffff829cfb01 00042b216d008804 [ 886.403568] Call Trace: [ 886.403568] [<ffffffff8125d4fd>] dump_stack+0x4e/0x79 [ 886.403568] [<ffffffff8108e5f9>] __lock_acquire+0xd42/0xf0b [ 886.403568] [<ffffffff810c22db>] ? __module_address+0xdf/0x108 [ 886.403568] [<ffffffff8108eb77>] lock_acquire+0x10d/0x194 [ 886.403568] [<ffffffff8108eb77>] ? lock_acquire+0x10d/0x194 [ 886.403568] [<ffffffffa0538823>] ? btrfs_run_delayed_iputs+0x36/0xbf [btrfs] [ 886.489542] [<ffffffff8148556b>] down_read+0x3e/0x4d [ 886.489542] [<ffffffffa0538823>] ? btrfs_run_delayed_iputs+0x36/0xbf [btrfs] [ 886.489542] [<ffffffffa0538823>] btrfs_run_delayed_iputs+0x36/0xbf [btrfs] [ 886.489542] [<ffffffffa0533953>] btrfs_commit_transaction+0x8f5/0x96e [btrfs] [ 886.489542] [<ffffffffa0521d7a>] flush_space+0x435/0x44a [btrfs] [ 886.489542] [<ffffffffa052218b>] ? reserve_metadata_bytes+0x26a/0x384 [btrfs] [ 886.489542] [<ffffffffa05221ae>] reserve_metadata_bytes+0x28d/0x384 [btrfs] [ 886.489542] [<ffffffffa052256c>] ? btrfs_block_rsv_refill+0x58/0x96 [btrfs] [ 886.489542] [<ffffffffa0522584>] btrfs_block_rsv_refill+0x70/0x96 [btrfs] [ 886.489542] [<ffffffffa053d747>] btrfs_evict_inode+0x394/0x55a [btrfs] [ 886.489542] [<ffffffff81188e31>] evict+0xa7/0x15c [ 886.489542] [<ffffffff81189878>] iput+0x1d3/0x266 [ 886.489542] [<ffffffffa053887c>] btrfs_run_delayed_iputs+0x8f/0xbf [btrfs] [ 886.489542] [<ffffffffa0533953>] btrfs_commit_transaction+0x8f5/0x96e [btrfs] [ 886.489542] [<ffffffff81085096>] ? signal_pending_state+0x31/0x31 [ 886.489542] [<ffffffffa0521191>] btrfs_alloc_data_chunk_ondemand+0x1d7/0x288 [btrfs] [ 886.489542] [<ffffffffa0521282>] btrfs_check_data_free_space+0x40/0x59 [btrfs] [ 886.489542] [<ffffffffa05228f5>] btrfs_delalloc_reserve_space+0x1e/0x4e [btrfs] [ 886.489542] [<ffffffffa053620a>] btrfs_direct_IO+0x10c/0x27e [btrfs] [ 886.489542] [<ffffffff8111d9a1>] generic_file_direct_write+0xb3/0x128 [ 886.489542] [<ffffffffa05463c3>] btrfs_file_write_iter+0x229/0x408 [btrfs] [ 886.489542] [<ffffffff8108ae38>] ? __lock_is_held+0x38/0x50 [ 886.489542] [<ffffffff8117279e>] __vfs_write+0x7c/0xa5 [ 886.489542] [<ffffffff81172cda>] vfs_write+0xa0/0xe4 [ 886.489542] [<ffffffff811734cc>] SyS_write+0x50/0x7e [ 886.489542] [<ffffffff814872d7>] entry_SYSCALL_64_fastpath+0x12/0x6f [ 1081.852335] INFO: task fio:8244 blocked for more than 120 seconds. [ 1081.854348] Not tainted 4.4.0-rc6-btrfs-next-18+ #1 [ 1081.857560] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 1081.863227] fio D ffff880213f9bb28 0 8244 8240 0x00000000 [ 1081.868719] ffff880213f9bb28 00ffffff810fc6b0 ffffffff0000000a ffff88023ed55240 [ 1081.872499] ffff880206b5d400 ffff880213f9c000 ffff88020a4d5318 ffff880206b5d400 [ 1081.876834] ffffffff00000001 ffff880206b5d400 ffff880213f9bb40 ffffffff81482ba4 [ 1081.880782] Call Trace: [ 1081.881793] [<ffffffff81482ba4>] schedule+0x7f/0x97 [ 1081.883340] [<ffffffff81485eb5>] rwsem_down_write_failed+0x2d5/0x325 [ 1081.895525] [<ffffffff8108d48d>] ? trace_hardirqs_on_caller+0x16/0x1ab [ 1081.897419] [<ffffffff81269723>] call_rwsem_down_write_failed+0x13/0x20 [ 1081.899251] [<ffffffff81269723>] ? call_rwsem_down_write_failed+0x13/0x20 [ 1081.901063] [<ffffffff81089fae>] ? __down_write_nested.isra.0+0x1f/0x21 [ 1081.902365] [<ffffffff814855bd>] down_write+0x43/0x57 [ 1081.903846] [<ffffffffa05211b0>] ? btrfs_alloc_data_chunk_ondemand+0x1f6/0x288 [btrfs] [ 1081.906078] [<ffffffffa05211b0>] btrfs_alloc_data_chunk_ondemand+0x1f6/0x288 [btrfs] [ 1081.908846] [<ffffffff8108d461>] ? mark_held_locks+0x56/0x6c [ 1081.910409] [<ffffffffa0521282>] btrfs_check_data_free_space+0x40/0x59 [btrfs] [ 1081.912482] [<ffffffffa05228f5>] btrfs_delalloc_reserve_space+0x1e/0x4e [btrfs] [ 1081.914597] [<ffffffffa053620a>] btrfs_direct_IO+0x10c/0x27e [btrfs] [ 1081.919037] [<ffffffff8111d9a1>] generic_file_direct_write+0xb3/0x128 [ 1081.920754] [<ffffffffa05463c3>] btrfs_file_write_iter+0x229/0x408 [btrfs] [ 1081.922496] [<ffffffff8108ae38>] ? __lock_is_held+0x38/0x50 [ 1081.923922] [<ffffffff8117279e>] __vfs_write+0x7c/0xa5 [ 1081.925275] [<ffffffff81172cda>] vfs_write+0xa0/0xe4 [ 1081.926584] [<ffffffff811734cc>] SyS_write+0x50/0x7e [ 1081.927968] [<ffffffff814872d7>] entry_SYSCALL_64_fastpath+0x12/0x6f [ 1081.985293] INFO: lockdep is turned off. [ 1081.986132] INFO: task fio:8249 blocked for more than 120 seconds. [ 1081.987434] Not tainted 4.4.0-rc6-btrfs-next-18+ #1 [ 1081.988534] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 1081.990147] fio D ffff880218febbb8 0 8249 8240 0x00000000 [ 1081.991626] ffff880218febbb8 00ffffff81486b8e ffff88020000000b ffff88023ed75240 [ 1081.993258] ffff8802120a9a00 ffff880218fec000 ffff88020a4d5318 ffff8802120a9a00 [ 1081.994850] ffffffff00000001 ffff8802120a9a00 ffff880218febbd0 ffffffff81482ba4 [ 1081.996485] Call Trace: [ 1081.997037] [<ffffffff81482ba4>] schedule+0x7f/0x97 [ 1081.998017] [<ffffffff81485eb5>] rwsem_down_write_failed+0x2d5/0x325 [ 1081.999241] [<ffffffff810852a5>] ? finish_wait+0x6d/0x76 [ 1082.000306] [<ffffffff81269723>] call_rwsem_down_write_failed+0x13/0x20 [ 1082.001533] [<ffffffff81269723>] ? call_rwsem_down_write_failed+0x13/0x20 [ 1082.002776] [<ffffffff81089fae>] ? __down_write_nested.isra.0+0x1f/0x21 [ 1082.003995] [<ffffffff814855bd>] down_write+0x43/0x57 [ 1082.005000] [<ffffffffa05211b0>] ? btrfs_alloc_data_chunk_ondemand+0x1f6/0x288 [btrfs] [ 1082.007403] [<ffffffffa05211b0>] btrfs_alloc_data_chunk_ondemand+0x1f6/0x288 [btrfs] [ 1082.008988] [<ffffffffa0545064>] btrfs_fallocate+0x7c1/0xc2f [btrfs] [ 1082.010193] [<ffffffff8108a1ba>] ? percpu_down_read+0x4e/0x77 [ 1082.011280] [<ffffffff81174c4c>] ? __sb_start_write+0x5f/0xb0 [ 1082.012265] [<ffffffff81174c4c>] ? __sb_start_write+0x5f/0xb0 [ 1082.013021] [<ffffffff811712e4>] vfs_fallocate+0x170/0x1ff [ 1082.013738] [<ffffffff81181ebb>] ioctl_preallocate+0x89/0x9b [ 1082.014778] [<ffffffff811822d7>] do_vfs_ioctl+0x40a/0x4ea [ 1082.015778] [<ffffffff81176ea7>] ? SYSC_newfstat+0x25/0x2e [ 1082.016806] [<ffffffff8118b4de>] ? __fget_light+0x4d/0x71 [ 1082.017789] [<ffffffff8118240e>] SyS_ioctl+0x57/0x79 [ 1082.018706] [<ffffffff814872d7>] entry_SYSCALL_64_fastpath+0x12/0x6f This happens because we can recursively acquire the semaphore fs_info->delayed_iput_sem when attempting to allocate space to satisfy a file write request as shown in the first trace above - when committing a transaction we acquire (down_read) the semaphore before running the delayed iputs, and when running a delayed iput() we can end up calling an inode's eviction handler, which in turn commits another transaction and attempts to acquire (down_read) again the semaphore to run more delayed iput operations. This results in a deadlock because if a task acquires multiple times a semaphore it should invoke down_read_nested() with a different lockdep class for each level of recursion. Fix this by simplifying the implementation and use a mutex instead that is acquired by the cleaner kthread before it runs the delayed iputs instead of always acquiring a semaphore before delayed references are run from anywhere. Fixes: d7c151717a1e (btrfs: Fix NO_SPACE bug caused by delayed-iput) Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-03-03btrfs: initialize the seq counter in struct btrfs_deviceSebastian Andrzej Siewior
commit 546bed631203344611f42b2af1d224d2eedb4e6b upstream. I managed to trigger this: | INFO: trying to register non-static key. | the code is fine but needs lockdep annotation. | turning off the locking correctness validator. | CPU: 1 PID: 781 Comm: systemd-gpt-aut Not tainted 4.4.0-rt2+ #14 | Hardware name: ARM-Versatile Express | [<80307cec>] (dump_stack) | [<80070e98>] (__lock_acquire) | [<8007184c>] (lock_acquire) | [<80287800>] (btrfs_ioctl) | [<8012a8d4>] (do_vfs_ioctl) | [<8012ac14>] (SyS_ioctl) so I think that btrfs_device_data_ordered_init() is not invoked behind a macro somewhere. Fixes: 7cc8e58d53cd ("Btrfs: fix unprotected device's variants on 32bits machine") Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-03-03Btrfs: Initialize btrfs_root->highest_objectid when loading tree root and ↵Chandan Rajendra
subvolume roots commit f32e48e925964c4f8ab917850788a87e1cef3bad upstream. The following call trace is seen when btrfs/031 test is executed in a loop, [ 158.661848] ------------[ cut here ]------------ [ 158.662634] WARNING: CPU: 2 PID: 890 at /home/chandan/repos/linux/fs/btrfs/ioctl.c:558 create_subvol+0x3d1/0x6ea() [ 158.664102] BTRFS: Transaction aborted (error -2) [ 158.664774] Modules linked in: [ 158.665266] CPU: 2 PID: 890 Comm: btrfs Not tainted 4.4.0-rc6-g511711a #2 [ 158.666251] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 [ 158.667392] ffffffff81c0a6b0 ffff8806c7c4f8e8 ffffffff81431fc8 ffff8806c7c4f930 [ 158.668515] ffff8806c7c4f920 ffffffff81051aa1 ffff880c85aff000 ffff8800bb44d000 [ 158.669647] ffff8808863b5c98 0000000000000000 00000000fffffffe ffff8806c7c4f980 [ 158.670769] Call Trace: [ 158.671153] [<ffffffff81431fc8>] dump_stack+0x44/0x5c [ 158.671884] [<ffffffff81051aa1>] warn_slowpath_common+0x81/0xc0 [ 158.672769] [<ffffffff81051b27>] warn_slowpath_fmt+0x47/0x50 [ 158.673620] [<ffffffff813bc98d>] create_subvol+0x3d1/0x6ea [ 158.674440] [<ffffffff813777c9>] btrfs_mksubvol.isra.30+0x369/0x520 [ 158.675376] [<ffffffff8108a4aa>] ? percpu_down_read+0x1a/0x50 [ 158.676235] [<ffffffff81377a81>] btrfs_ioctl_snap_create_transid+0x101/0x180 [ 158.677268] [<ffffffff81377b52>] btrfs_ioctl_snap_create+0x52/0x70 [ 158.678183] [<ffffffff8137afb4>] btrfs_ioctl+0x474/0x2f90 [ 158.678975] [<ffffffff81144b8e>] ? vma_merge+0xee/0x300 [ 158.679751] [<ffffffff8115be31>] ? alloc_pages_vma+0x91/0x170 [ 158.680599] [<ffffffff81123f62>] ? lru_cache_add_active_or_unevictable+0x22/0x70 [ 158.681686] [<ffffffff813d99cf>] ? selinux_file_ioctl+0xff/0x1d0 [ 158.682581] [<ffffffff8117b791>] do_vfs_ioctl+0x2c1/0x490 [ 158.683399] [<ffffffff813d3cde>] ? security_file_ioctl+0x3e/0x60 [ 158.684297] [<ffffffff8117b9d4>] SyS_ioctl+0x74/0x80 [ 158.685051] [<ffffffff819b2bd7>] entry_SYSCALL_64_fastpath+0x12/0x6a [ 158.685958] ---[ end trace 4b63312de5a2cb76 ]--- [ 158.686647] BTRFS: error (device loop0) in create_subvol:558: errno=-2 No such entry [ 158.709508] BTRFS info (device loop0): forced readonly [ 158.737113] BTRFS info (device loop0): disk space caching is enabled [ 158.738096] BTRFS error (device loop0): Remounting read-write after error is not allowed [ 158.851303] BTRFS error (device loop0): cleaner transaction attach returned -30 This occurs because, Mount filesystem Create subvol with ID 257 Unmount filesystem Mount filesystem Delete subvol with ID 257 btrfs_drop_snapshot() Add root corresponding to subvol 257 into btrfs_transaction->dropped_roots list Create new subvol (i.e. create_subvol()) 257 is returned as the next free objectid btrfs_read_fs_root_no_name() Finds the btrfs_root instance corresponding to the old subvol with ID 257 in btrfs_fs_info->fs_roots_radix. Returns error since btrfs_root_item->refs has the value of 0. To fix the issue the commit initializes tree root's and subvolume root's highest_objectid when loading the roots from disk. Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-03-03Btrfs: fix transaction handle leak on failure to create hard linkFilipe Manana
commit 271dba4521aed0c37c063548f876b49f5cd64b2e upstream. If we failed to create a hard link we were not always releasing the the transaction handle we got before, resulting in a memory leak and preventing any other tasks from being able to commit the current transaction. Fix this by always releasing our transaction handle. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-03-03Btrfs: fix number of transaction units required to create symlinkFilipe Manana
commit 9269d12b2d57d9e3d13036bb750762d1110d425c upstream. We weren't accounting for the insertion of an inline extent item for the symlink inode nor that we need to update the parent inode item (through the call to btrfs_add_nondir()). So fix this by including two more transaction units. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-03-03Btrfs: send, don't BUG_ON() when an empty symlink is foundFilipe Manana
commit a879719b8c90e15c9e7fa7266d5e3c0ca962f9df upstream. When a symlink is successfully created it always has an inline extent containing the source path. However if an error happens when creating the symlink, we can leave in the subvolume's tree a symlink inode without any such inline extent item - this happens if after btrfs_symlink() calls btrfs_end_transaction() and before it calls the inode eviction handler (through the final iput() call), the transaction gets committed and a crash happens before the eviction handler gets called, or if a snapshot of the subvolume is made before the eviction handler gets called. Sadly we can't just avoid this by making btrfs_symlink() call btrfs_end_transaction() after it calls the eviction handler, because the later can commit the current transaction before it removes any items from the subvolume tree (if it encounters ENOSPC errors while reserving space for removing all the items). So make send fail more gracefully, with an -EIO error, and print a message to dmesg/syslog informing that there's an empty symlink inode, so that the user can delete the empty symlink or do something else about it. Reported-by: Stephen R. van den Berg <srb@cuci.nl> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-03-03btrfs: statfs: report zero available if metadata are exhaustedDavid Sterba
commit ca8a51b3a979d57b082b14eda38602b7f52d81d1 upstream. There is one ENOSPC case that's very confusing. There's Available greater than zero but no file operation succeds (besides removing files). This happens when the metadata are exhausted and there's no possibility to allocate another chunk. In this scenario it's normal that there's still some space in the data chunk and the calculation in df reflects that in the Avail value. To at least give some clue about the ENOSPC situation, let statfs report zero value in Avail, even if there's still data space available. Current: /dev/sdb1 4.0G 3.3G 719M 83% /mnt/test New: /dev/sdb1 4.0G 3.3G 0 100% /mnt/test We calculate the remaining metadata space minus global reserve. If this is (supposedly) smaller than zero, there's no space. But this does not hold in practice, the exhausted state happens where's still some positive delta. So we apply some guesswork and compare the delta to a 4M threshold. (Practically observed delta was 2M.) We probably cannot calculate the exact threshold value because this depends on the internal reservations requested by various operations, so some operations that consume a few metadata will succeed even if the Avail is zero. But this is better than the other way around. Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-03-03Btrfs: igrab inode in writepageJosef Bacik
commit be7bd730841e69fe8f70120098596f648cd1f3ff upstream. We hit this panic on a few of our boxes this week where we have an ordered_extent with an NULL inode. We do an igrab() of the inode in writepages, but weren't doing it in writepage which can be called directly from the VM on dirty pages. If the inode has been unlinked then we could have I_FREEING set which means igrab() would return NULL and we get this panic. Fix this by trying to igrab in btrfs_writepage, and if it returns NULL then just redirty the page and return AOP_WRITEPAGE_ACTIVATE; so the VM knows it wasn't successful. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-03-03Btrfs: add missing brelse when superblock checksum failsAnand Jain
commit b2acdddfad13c38a1e8b927d83c3cf321f63601a upstream. Looks like oversight, call brelse() when checksum fails. Further down the code, in the non error path, we do call brelse() and so we don't see brelse() in the goto error paths. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-02-25Btrfs: fix direct IO requests not reporting IO error to user spaceFilipe Manana
commit 1636d1d77ef4e01e57f706a4cae3371463896136 upstream. If a bio for a direct IO request fails, we were not setting the error in the parent bio (the main DIO bio), making us not return the error to user space in btrfs_direct_IO(), that is, it made __blockdev_direct_IO() return the number of bytes issued for IO and not the error a bio created and submitted by btrfs_submit_direct() got from the block layer. This essentially happens because when we call: dio_end_io(dio_bio, bio->bi_error); It does not set dio_bio->bi_error to the value of the second argument. So just add this missing assignment in endio callbacks, just as we do in the error path at btrfs_submit_direct() when we fail to clone the dio bio or allocate its private object. This follows the convention of what is done with other similar APIs such as bio_endio() where the caller is responsible for setting the bi_error field in the bio it passes as an argument to bio_endio(). This was detected by the new generic test cases in xfstests: 271, 272, 276 and 278. Which essentially setup a dm error target, then load the error table, do a direct IO write and unload the error table. They expect the write to fail with -EIO, which was not getting reported when testing against btrfs. Fixes: 4246a0b63bd8 ("block: add a bi_error field to struct bio") Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-02-25Btrfs: fix hang on extent buffer lock caused by the inode_paths ioctlFilipe Manana
commit 0c0fe3b0fa45082cd752553fdb3a4b42503a118e upstream. While doing some tests I ran into an hang on an extent buffer's rwlock that produced the following trace: [39389.800012] NMI watchdog: BUG: soft lockup - CPU#15 stuck for 22s! [fdm-stress:32166] [39389.800016] NMI watchdog: BUG: soft lockup - CPU#14 stuck for 22s! [fdm-stress:32165] [39389.800016] Modules linked in: btrfs dm_mod ppdev xor sha256_generic hmac raid6_pq drbg ansi_cprng aesni_intel i2c_piix4 acpi_cpufreq aes_x86_64 ablk_helper tpm_tis parport_pc i2c_core sg cryptd evdev psmouse lrw tpm parport gf128mul serio_raw pcspkr glue_helper processor button loop autofs4 ext4 crc16 mbcache jbd2 sd_mod sr_mod cdrom ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring crc32c_intel scsi_mod e1000 virtio floppy [last unloaded: btrfs] [39389.800016] irq event stamp: 0 [39389.800016] hardirqs last enabled at (0): [< (null)>] (null) [39389.800016] hardirqs last disabled at (0): [<ffffffff8104e58d>] copy_process+0x638/0x1a35 [39389.800016] softirqs last enabled at (0): [<ffffffff8104e58d>] copy_process+0x638/0x1a35 [39389.800016] softirqs last disabled at (0): [< (null)>] (null) [39389.800016] CPU: 14 PID: 32165 Comm: fdm-stress Not tainted 4.4.0-rc6-btrfs-next-18+ #1 [39389.800016] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014 [39389.800016] task: ffff880175b1ca40 ti: ffff8800a185c000 task.ti: ffff8800a185c000 [39389.800016] RIP: 0010:[<ffffffff810902af>] [<ffffffff810902af>] queued_spin_lock_slowpath+0x57/0x158 [39389.800016] RSP: 0018:ffff8800a185fb80 EFLAGS: 00000202 [39389.800016] RAX: 0000000000000101 RBX: ffff8801710c4e9c RCX: 0000000000000101 [39389.800016] RDX: 0000000000000100 RSI: 0000000000000001 RDI: 0000000000000001 [39389.800016] RBP: ffff8800a185fb98 R08: 0000000000000001 R09: 0000000000000000 [39389.800016] R10: ffff8800a185fb68 R11: 6db6db6db6db6db7 R12: ffff8801710c4e98 [39389.800016] R13: ffff880175b1ca40 R14: ffff8800a185fc10 R15: ffff880175b1ca40 [39389.800016] FS: 00007f6d37fff700(0000) GS:ffff8802be9c0000(0000) knlGS:0000000000000000 [39389.800016] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [39389.800016] CR2: 00007f6d300019b8 CR3: 0000000037c93000 CR4: 00000000001406e0 [39389.800016] Stack: [39389.800016] ffff8801710c4e98 ffff8801710c4e98 ffff880175b1ca40 ffff8800a185fbb0 [39389.800016] ffffffff81091e11 ffff8801710c4e98 ffff8800a185fbc8 ffffffff81091895 [39389.800016] ffff8801710c4e98 ffff8800a185fbe8 ffffffff81486c5c ffffffffa067288c [39389.800016] Call Trace: [39389.800016] [<ffffffff81091e11>] queued_read_lock_slowpath+0x46/0x60 [39389.800016] [<ffffffff81091895>] do_raw_read_lock+0x3e/0x41 [39389.800016] [<ffffffff81486c5c>] _raw_read_lock+0x3d/0x44 [39389.800016] [<ffffffffa067288c>] ? btrfs_tree_read_lock+0x54/0x125 [btrfs] [39389.800016] [<ffffffffa067288c>] btrfs_tree_read_lock+0x54/0x125 [btrfs] [39389.800016] [<ffffffffa0622ced>] ? btrfs_find_item+0xa7/0xd2 [btrfs] [39389.800016] [<ffffffffa069363f>] btrfs_ref_to_path+0xd6/0x174 [btrfs] [39389.800016] [<ffffffffa0693730>] inode_to_path+0x53/0xa2 [btrfs] [39389.800016] [<ffffffffa0693e2e>] paths_from_inode+0x117/0x2ec [btrfs] [39389.800016] [<ffffffffa0670cff>] btrfs_ioctl+0xd5b/0x2793 [btrfs] [39389.800016] [<ffffffff8108a8b0>] ? arch_local_irq_save+0x9/0xc [39389.800016] [<ffffffff81276727>] ? __this_cpu_preempt_check+0x13/0x15 [39389.800016] [<ffffffff8108a8b0>] ? arch_local_irq_save+0x9/0xc [39389.800016] [<ffffffff8118b3d4>] ? rcu_read_unlock+0x3e/0x5d [39389.800016] [<ffffffff811822f8>] do_vfs_ioctl+0x42b/0x4ea [39389.800016] [<ffffffff8118b4f3>] ? __fget_light+0x62/0x71 [39389.800016] [<ffffffff8118240e>] SyS_ioctl+0x57/0x79 [39389.800016] [<ffffffff814872d7>] entry_SYSCALL_64_fastpath+0x12/0x6f [39389.800016] Code: b9 01 01 00 00 f7 c6 00 ff ff ff 75 32 83 fe 01 89 ca 89 f0 0f 45 d7 f0 0f b1 13 39 f0 74 04 89 c6 eb e2 ff ca 0f 84 fa 00 00 00 <8b> 03 84 c0 74 04 f3 90 eb f6 66 c7 03 01 00 e9 e6 00 00 00 e8 [39389.800012] Modules linked in: btrfs dm_mod ppdev xor sha256_generic hmac raid6_pq drbg ansi_cprng aesni_intel i2c_piix4 acpi_cpufreq aes_x86_64 ablk_helper tpm_tis parport_pc i2c_core sg cryptd evdev psmouse lrw tpm parport gf128mul serio_raw pcspkr glue_helper processor button loop autofs4 ext4 crc16 mbcache jbd2 sd_mod sr_mod cdrom ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring crc32c_intel scsi_mod e1000 virtio floppy [last unloaded: btrfs] [39389.800012] irq event stamp: 0 [39389.800012] hardirqs last enabled at (0): [< (null)>] (null) [39389.800012] hardirqs last disabled at (0): [<ffffffff8104e58d>] copy_process+0x638/0x1a35 [39389.800012] softirqs last enabled at (0): [<ffffffff8104e58d>] copy_process+0x638/0x1a35 [39389.800012] softirqs last disabled at (0): [< (null)>] (null) [39389.800012] CPU: 15 PID: 32166 Comm: fdm-stress Tainted: G L 4.4.0-rc6-btrfs-next-18+ #1 [39389.800012] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014 [39389.800012] task: ffff880179294380 ti: ffff880034a60000 task.ti: ffff880034a60000 [39389.800012] RIP: 0010:[<ffffffff81091e8d>] [<ffffffff81091e8d>] queued_write_lock_slowpath+0x62/0x72 [39389.800012] RSP: 0018:ffff880034a639f0 EFLAGS: 00000206 [39389.800012] RAX: 0000000000000101 RBX: ffff8801710c4e98 RCX: 0000000000000000 [39389.800012] RDX: 00000000000000ff RSI: 0000000000000000 RDI: ffff8801710c4e9c [39389.800012] RBP: ffff880034a639f8 R08: 0000000000000001 R09: 0000000000000000 [39389.800012] R10: ffff880034a639b0 R11: 0000000000001000 R12: ffff8801710c4e98 [39389.800012] R13: 0000000000000001 R14: ffff880172cbc000 R15: ffff8801710c4e00 [39389.800012] FS: 00007f6d377fe700(0000) GS:ffff8802be9e0000(0000) knlGS:0000000000000000 [39389.800012] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [39389.800012] CR2: 00007f6d3d3c1000 CR3: 0000000037c93000 CR4: 00000000001406e0 [39389.800012] Stack: [39389.800012] ffff8801710c4e98 ffff880034a63a10 ffffffff81091963 ffff8801710c4e98 [39389.800012] ffff880034a63a30 ffffffff81486f1b ffffffffa0672cb3 ffff8801710c4e00 [39389.800012] ffff880034a63a78 ffffffffa0672cb3 ffff8801710c4e00 ffff880034a63a58 [39389.800012] Call Trace: [39389.800012] [<ffffffff81091963>] do_raw_write_lock+0x72/0x8c [39389.800012] [<ffffffff81486f1b>] _raw_write_lock+0x3a/0x41 [39389.800012] [<ffffffffa0672cb3>] ? btrfs_tree_lock+0x119/0x251 [btrfs] [39389.800012] [<ffffffffa0672cb3>] btrfs_tree_lock+0x119/0x251 [btrfs] [39389.800012] [<ffffffffa061aeba>] ? rcu_read_unlock+0x5b/0x5d [btrfs] [39389.800012] [<ffffffffa061ce13>] ? btrfs_root_node+0xda/0xe6 [btrfs] [39389.800012] [<ffffffffa061ce83>] btrfs_lock_root_node+0x22/0x42 [btrfs] [39389.800012] [<ffffffffa062046b>] btrfs_search_slot+0x1b8/0x758 [btrfs] [39389.800012] [<ffffffff810fc6b0>] ? time_hardirqs_on+0x15/0x28 [39389.800012] [<ffffffffa06365db>] btrfs_lookup_inode+0x31/0x95 [btrfs] [39389.800012] [<ffffffff8108d62f>] ? trace_hardirqs_on+0xd/0xf [39389.800012] [<ffffffff8148482b>] ? mutex_lock_nested+0x397/0x3bc [39389.800012] [<ffffffffa068821b>] __btrfs_update_delayed_inode+0x59/0x1c0 [btrfs] [39389.800012] [<ffffffffa068858e>] __btrfs_commit_inode_delayed_items+0x194/0x5aa [btrfs] [39389.800012] [<ffffffff81486ab7>] ? _raw_spin_unlock+0x31/0x44 [39389.800012] [<ffffffffa0688a48>] __btrfs_run_delayed_items+0xa4/0x15c [btrfs] [39389.800012] [<ffffffffa0688d62>] btrfs_run_delayed_items+0x11/0x13 [btrfs] [39389.800012] [<ffffffffa064048e>] btrfs_commit_transaction+0x234/0x96e [btrfs] [39389.800012] [<ffffffffa0618d10>] btrfs_sync_fs+0x145/0x1ad [btrfs] [39389.800012] [<ffffffffa0671176>] btrfs_ioctl+0x11d2/0x2793 [btrfs] [39389.800012] [<ffffffff8108a8b0>] ? arch_local_irq_save+0x9/0xc [39389.800012] [<ffffffff81140261>] ? __might_fault+0x4c/0xa7 [39389.800012] [<ffffffff81140261>] ? __might_fault+0x4c/0xa7 [39389.800012] [<ffffffff8108a8b0>] ? arch_local_irq_save+0x9/0xc [39389.800012] [<ffffffff8118b3d4>] ? rcu_read_unlock+0x3e/0x5d [39389.800012] [<ffffffff811822f8>] do_vfs_ioctl+0x42b/0x4ea [39389.800012] [<ffffffff8118b4f3>] ? __fget_light+0x62/0x71 [39389.800012] [<ffffffff8118240e>] SyS_ioctl+0x57/0x79 [39389.800012] [<ffffffff814872d7>] entry_SYSCALL_64_fastpath+0x12/0x6f [39389.800012] Code: f0 0f b1 13 85 c0 75 ef eb 2a f3 90 8a 03 84 c0 75 f8 f0 0f b0 13 84 c0 75 f0 ba ff 00 00 00 eb 0a f0 0f b1 13 ff c8 74 0b f3 90 <8b> 03 83 f8 01 75 f7 eb ed c6 43 04 00 5b 5d c3 0f 1f 44 00 00 This happens because in the code path executed by the inode_paths ioctl we end up nesting two calls to read lock a leaf's rwlock when after the first call to read_lock() and before the second call to read_lock(), another task (running the delayed items as part of a transaction commit) has already called write_lock() against the leaf's rwlock. This situation is illustrated by the following diagram: Task A Task B btrfs_ref_to_path() btrfs_commit_transaction() read_lock(&eb->lock); btrfs_run_delayed_items() __btrfs_commit_inode_delayed_items() __btrfs_update_delayed_inode() btrfs_lookup_inode() write_lock(&eb->lock); --> task waits for lock read_lock(&eb->lock); --> makes this task hang forever (and task B too of course) So fix this by avoiding doing the nested read lock, which is easily avoidable. This issue does not happen if task B calls write_lock() after task A does the second call to read_lock(), however there does not seem to exist anything in the documentation that mentions what is the expected behaviour for recursive locking of rwlocks (leaving the idea that doing so is not a good usage of rwlocks). Also, as a side effect necessary for this fix, make sure we do not needlessly read lock extent buffers when the input path has skip_locking set (used when called from send). Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-02-25Btrfs: fix page reading in extent_same ioctl leading to csum errorsFilipe Manana
commit 313140023026ae542ad76e7e268c56a1eaa2c28e upstream. In the extent_same ioctl, we were grabbing the pages (locked) and attempting to read them without bothering about any concurrent IO against them. That is, we were not checking for any ongoing ordered extents nor waiting for them to complete, which leads to a race where the extent_same() code gets a checksum verification error when it reads the pages, producing a message like the following in dmesg and making the operation fail to user space with -ENOMEM: [18990.161265] BTRFS warning (device sdc): csum failed ino 259 off 495616 csum 685204116 expected csum 1515870868 Fix this by using btrfs_readpage() for reading the pages instead of extent_read_full_page_nolock(), which waits for any concurrent ordered extents to complete and locks the io range. Also do better error handling and don't treat all failures as -ENOMEM, as that's clearly misleasing, becoming identical to the checks and operation of prepare_uptodate_page(). The use of extent_read_full_page_nolock() was required before commit f441460202cb ("btrfs: fix deadlock with extent-same and readpage"), as we had the range locked in an inode's io tree before attempting to read the pages. Fixes: f441460202cb ("btrfs: fix deadlock with extent-same and readpage") Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-02-25Btrfs: fix invalid page accesses in extent_same (dedup) ioctlFilipe Manana
commit e0bd70c67bf996b360f706b6c643000f2e384681 upstream. In the extent_same ioctl we are getting the pages for the source and target ranges and unlocking them immediately after, which is incorrect because later we attempt to map them (with kmap_atomic) and access their contents at btrfs_cmp_data(). When we do such access the pages might have been relocated or removed from memory, which leads to an invalid memory access. This issue is detected on a kernel with CONFIG_DEBUG_PAGEALLOC=y which produces a trace like the following: 186736.677437] general protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC [186736.680382] Modules linked in: btrfs dm_flakey dm_mod ppdev xor raid6_pq sha256_generic hmac drbg ansi_cprng acpi_cpufreq evdev sg aesni_intel aes_x86_64 parport_pc ablk_helper tpm_tis psmouse parport i2c_piix4 tpm cryptd i2c_core lrw processor button serio_raw pcspkr gf128mul glue_helper loop autofs4 ext4 crc16 mbcache jbd2 sd_mod sr_mod cdrom ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring crc32c_intel scsi_mod e1000 virtio floppy [last unloaded: btrfs] [186736.681319] CPU: 13 PID: 10222 Comm: duperemove Tainted: G W 4.4.0-rc6-btrfs-next-18+ #1 [186736.681319] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014 [186736.681319] task: ffff880132600400 ti: ffff880362284000 task.ti: ffff880362284000 [186736.681319] RIP: 0010:[<ffffffff81264d00>] [<ffffffff81264d00>] memcmp+0xb/0x22 [186736.681319] RSP: 0018:ffff880362287d70 EFLAGS: 00010287 [186736.681319] RAX: 000002c002468acf RBX: 0000000012345678 RCX: 0000000000000000 [186736.681319] RDX: 0000000000001000 RSI: 0005d129c5cf9000 RDI: 0005d129c5cf9000 [186736.681319] RBP: ffff880362287d70 R08: 0000000000000000 R09: 0000000000001000 [186736.681319] R10: ffff880000000000 R11: 0000000000000476 R12: 0000000000001000 [186736.681319] R13: ffff8802f91d4c88 R14: ffff8801f2a77830 R15: ffff880352e83e40 [186736.681319] FS: 00007f27b37fe700(0000) GS:ffff88043dda0000(0000) knlGS:0000000000000000 [186736.681319] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [186736.681319] CR2: 00007f27a406a000 CR3: 0000000217421000 CR4: 00000000001406e0 [186736.681319] Stack: [186736.681319] ffff880362287ea0 ffffffffa048d0bd 000000000009f000 0000000000001000 [186736.681319] 0100000000000000 ffff8801f2a77850 ffff8802f91d49b0 ffff880132600400 [186736.681319] 00000000000004f8 ffff8801c1efbe41 0000000000000000 0000000000000038 [186736.681319] Call Trace: [186736.681319] [<ffffffffa048d0bd>] btrfs_ioctl+0x24cb/0x2731 [btrfs] [186736.681319] [<ffffffff8108a8b0>] ? arch_local_irq_save+0x9/0xc [186736.681319] [<ffffffff8118b3d4>] ? rcu_read_unlock+0x3e/0x5d [186736.681319] [<ffffffff811822f8>] do_vfs_ioctl+0x42b/0x4ea [186736.681319] [<ffffffff8118b4f3>] ? __fget_light+0x62/0x71 [186736.681319] [<ffffffff8118240e>] SyS_ioctl+0x57/0x79 [186736.681319] [<ffffffff814872d7>] entry_SYSCALL_64_fastpath+0x12/0x6f [186736.681319] Code: 0a 3c 6e 74 0d 3c 79 74 04 3c 59 75 0c c6 06 01 eb 03 c6 06 00 31 c0 eb 05 b8 ea ff ff ff 5d c3 55 31 c9 48 89 e5 48 39 d1 74 13 <0f> b6 04 0f 44 0f b6 04 0e 48 ff c1 44 29 c0 74 ea eb 02 31 c0 (gdb) list *(btrfs_ioctl+0x24cb) 0x5e0e1 is in btrfs_ioctl (fs/btrfs/ioctl.c:2972). 2967 dst_addr = kmap_atomic(dst_page); 2968 2969 flush_dcache_page(src_page); 2970 flush_dcache_page(dst_page); 2971 2972 if (memcmp(addr, dst_addr, cmp_len)) 2973 ret = BTRFS_SAME_DATA_DIFFERS; 2974 2975 kunmap_atomic(addr); 2976 kunmap_atomic(dst_addr); So fix this by making sure we keep the pages locked and respect the same locking order as everywhere else: get and lock the pages first and then lock the range in the inode's io tree (like for example at __btrfs_buffered_write() and extent_readpages()). If an ordered extent is found after locking the range in the io tree, unlock the range, unlock the pages, wait for the ordered extent to complete and repeat the entire locking process until no overlapping ordered extents are found. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-02-25btrfs: properly set the termination value of ctx->pos in readdirDavid Sterba
commit bc4ef7592f657ae81b017207a1098817126ad4cb upstream. The value of ctx->pos in the last readdir call is supposed to be set to INT_MAX due to 32bit compatibility, unless 'pos' is intentially set to a larger value, then it's LLONG_MAX. There's a report from PaX SIZE_OVERFLOW plugin that "ctx->pos++" overflows (https://forums.grsecurity.net/viewtopic.php?f=1&t=4284), on a 64bit arch, where the value is 0x7fffffffffffffff ie. LLONG_MAX before the increment. We can get to that situation like that: * emit all regular readdir entries * still in the same call to readdir, bump the last pos to INT_MAX * next call to readdir will not emit any entries, but will reach the bump code again, finds pos to be INT_MAX and sets it to LLONG_MAX Normally this is not a problem, but if we call readdir again, we'll find 'pos' set to LLONG_MAX and the unconditional increment will overflow. The report from Victor at (http://thread.gmane.org/gmane.comp.file-systems.btrfs/49500) with debugging print shows that pattern: Overflow: e Overflow: 7fffffff Overflow: 7fffffffffffffff PAX: size overflow detected in function btrfs_real_readdir fs/btrfs/inode.c:5760 cicus.935_282 max, count: 9, decl: pos; num: 0; context: dir_context; CPU: 0 PID: 2630 Comm: polkitd Not tainted 4.2.3-grsec #1 Hardware name: Gigabyte Technology Co., Ltd. H81ND2H/H81ND2H, BIOS F3 08/11/2015 ffffffff81901608 0000000000000000 ffffffff819015e6 ffffc90004973d48 ffffffff81742f0f 0000000000000007 ffffffff81901608 ffffc90004973d78 ffffffff811cb706 0000000000000000 ffff8800d47359e0 ffffc90004973ed8 Call Trace: [<ffffffff81742f0f>] dump_stack+0x4c/0x7f [<ffffffff811cb706>] report_size_overflow+0x36/0x40 [<ffffffff812ef0bc>] btrfs_real_readdir+0x69c/0x6d0 [<ffffffff811dafc8>] iterate_dir+0xa8/0x150 [<ffffffff811e6d8d>] ? __fget_light+0x2d/0x70 [<ffffffff811dba3a>] SyS_getdents+0xba/0x1c0 Overflow: 1a [<ffffffff811db070>] ? iterate_dir+0x150/0x150 [<ffffffff81749b69>] entry_SYSCALL_64_fastpath+0x12/0x83 The jump from 7fffffff to 7fffffffffffffff happens when new dir entries are not yet synced and are processed from the delayed list. Then the code could go to the bump section again even though it might not emit any new dir entries from the delayed list. The fix avoids entering the "bump" section again once we've finished emitting the entries, both for synced and delayed entries. References: https://forums.grsecurity.net/viewtopic.php?f=1&t=4284 Reported-by: Victor <services@swwu.com> Signed-off-by: David Sterba <dsterba@suse.com> Tested-by: Holger Hoffstätte <holger.hoffstaette@googlemail.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-02-25Revert "btrfs: clear PF_NOFREEZE in cleaner_kthread()"David Sterba
commit 80ad623edd2d0ccb47d85357ee31c97e6c684e82 upstream. This reverts commit 696249132158014d594896df3a81390616069c5c. The cleaner thread can block freezing when there's a snapshot cleaning in progress and the other threads get suspended first. From the logs provided by Martin we're waiting for reading extent pages: kernel: PM: Syncing filesystems ... done. kernel: Freezing user space processes ... (elapsed 0.015 seconds) done. kernel: Freezing remaining freezable tasks ... kernel: Freezing of tasks failed after 20.003 seconds (1 tasks refusing to freeze, wq_busy=0): kernel: btrfs-cleaner D ffff88033dd13bc0 0 152 2 0x00000000 kernel: ffff88032ebc2e00 ffff88032e750000 ffff88032e74fa50 7fffffffffffffff kernel: ffffffff814a58df 0000000000000002 ffffea000934d580 ffffffff814a5451 kernel: 7fffffffffffffff ffffffff814a6e8f 0000000000000000 0000000000000020 kernel: Call Trace: kernel: [<ffffffff814a58df>] ? bit_wait+0x2c/0x2c kernel: [<ffffffff814a5451>] ? schedule+0x6f/0x7c kernel: [<ffffffff814a6e8f>] ? schedule_timeout+0x2f/0xd8 kernel: [<ffffffff81076f94>] ? timekeeping_get_ns+0xa/0x2e kernel: [<ffffffff81077603>] ? ktime_get+0x36/0x44 kernel: [<ffffffff814a4f6c>] ? io_schedule_timeout+0x94/0xf2 kernel: [<ffffffff814a4f6c>] ? io_schedule_timeout+0x94/0xf2 kernel: [<ffffffff814a590b>] ? bit_wait_io+0x2c/0x30 kernel: [<ffffffff814a5694>] ? __wait_on_bit+0x41/0x73 kernel: [<ffffffff8109eba8>] ? wait_on_page_bit+0x6d/0x72 kernel: [<ffffffff8105d718>] ? autoremove_wake_function+0x2a/0x2a kernel: [<ffffffff811a02d7>] ? read_extent_buffer_pages+0x1bd/0x203 kernel: [<ffffffff8117d9e9>] ? free_root_pointers+0x4c/0x4c kernel: [<ffffffff8117e831>] ? btree_read_extent_buffer_pages.constprop.57+0x5a/0xe9 kernel: [<ffffffff8117f4f3>] ? read_tree_block+0x2d/0x45 kernel: [<ffffffff8116782a>] ? read_block_for_search.isra.34+0x22a/0x26b kernel: [<ffffffff811656c3>] ? btrfs_set_path_blocking+0x1e/0x4a kernel: [<ffffffff8116919b>] ? btrfs_search_slot+0x648/0x736 kernel: [<ffffffff81170559>] ? btrfs_lookup_extent_info+0xb7/0x2c7 kernel: [<ffffffff81170ee5>] ? walk_down_proc+0x9c/0x1ae kernel: [<ffffffff81171c9d>] ? walk_down_tree+0x40/0xa4 kernel: [<ffffffff8117375f>] ? btrfs_drop_snapshot+0x2da/0x664 kernel: [<ffffffff8104ff21>] ? finish_task_switch+0x126/0x167 kernel: [<ffffffff811850f8>] ? btrfs_clean_one_deleted_snapshot+0xa6/0xb0 kernel: [<ffffffff8117eaba>] ? cleaner_kthread+0x13e/0x17b kernel: [<ffffffff8117e97c>] ? btrfs_item_end+0x33/0x33 kernel: [<ffffffff8104d256>] ? kthread+0x95/0x9d kernel: [<ffffffff8104d1c1>] ? kthread_parkme+0x16/0x16 kernel: [<ffffffff814a7b5f>] ? ret_from_fork+0x3f/0x70 kernel: [<ffffffff8104d1c1>] ? kthread_parkme+0x16/0x16 As this affects a released kernel (4.4) we need a minimal fix for stable kernels. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=108361 Reported-by: Martin Ziegler <ziegler@uni-freiburg.de> CC: Jiri Kosina <jkosina@suse.cz> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-02-25Btrfs: fix fitrim discarding device area reserved for boot loader's useFilipe Manana
commit 8cdc7c5b00d945a3c823fc4277af304abb9cb43d upstream. As of the 4.3 kernel release, the fitrim ioctl can now discard any region of a disk that is not allocated to any chunk/block group, including the first megabyte which is used for our primary superblock and by the boot loader (grub for example). Fix this by not allowing to trim/discard any region in the device starting with an offset not greater than min(alloc_start_mount_option, 1Mb), just as it was not possible before 4.3. A reproducer test case for xfstests follows. seq=`basename $0` seqres=$RESULT_DIR/$seq echo "QA output created by $seq" tmp=/tmp/$$ status=1 # failure is the default! trap "_cleanup; exit \$status" 0 1 2 3 15 _cleanup() { cd / rm -f $tmp.* } # get standard environment, filters and checks . ./common/rc . ./common/filter # real QA test starts here _need_to_be_root _supported_fs btrfs _supported_os Linux _require_scratch rm -f $seqres.full _scratch_mkfs >>$seqres.full 2>&1 # Write to the [0, 64Kb[ and [68Kb, 1Mb[ ranges of the device. These ranges are # reserved for a boot loader to use (GRUB for example) and btrfs should never # use them - neither for allocating metadata/data nor should trim/discard them. # The range [64Kb, 68Kb[ is used for the primary superblock of the filesystem. $XFS_IO_PROG -c "pwrite -S 0xfd 0 64K" $SCRATCH_DEV | _filter_xfs_io $XFS_IO_PROG -c "pwrite -S 0xfd 68K 956K" $SCRATCH_DEV | _filter_xfs_io # Now mount the filesystem and perform a fitrim against it. _scratch_mount _require_batched_discard $SCRATCH_MNT $FSTRIM_PROG $SCRATCH_MNT # Now unmount the filesystem and verify the content of the ranges was not # modified (no trim/discard happened on them). _scratch_unmount echo "Content of the ranges [0, 64Kb] and [68Kb, 1Mb[ after fitrim:" od -t x1 -N $((64 * 1024)) $SCRATCH_DEV od -t x1 -j $((68 * 1024)) -N $((956 * 1024)) $SCRATCH_DEV status=0 exit Reported-by: Vincent Petry <PVince81@yahoo.fr> Reported-by: Andrei Borzenkov <arvidjaar@gmail.com> Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=109341 Fixes: 499f377f49f0 (btrfs: iterate over unused chunk space in FITRIM) Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-02-25btrfs: handle invalid num_stripes in sys_arrayDavid Sterba
commit f5cdedd73fa71b74dcc42f2a11a5735d89ce7c4f upstream. We can handle the special case of num_stripes == 0 directly inside btrfs_read_sys_array. The BUG_ON in btrfs_chunk_item_size is there to catch other unhandled cases where we fail to validate external data. A crafted or corrupted image crashes at mount time: BTRFS: device fsid 9006933e-2a9a-44f0-917f-514252aeec2c devid 1 transid 7 /dev/loop0 BTRFS info (device loop0): disk space caching is enabled BUG: failure at fs/btrfs/ctree.h:337/btrfs_chunk_item_size()! Kernel panic - not syncing: BUG! CPU: 0 PID: 313 Comm: mount Not tainted 4.2.5-00657-ge047887-dirty #25 Stack: 637af890 60062489 602aeb2e 604192ba 60387961 00000011 637af8a0 6038a835 637af9c0 6038776b 634ef32b 00000000 Call Trace: [<6001c86d>] show_stack+0xfe/0x15b [<6038a835>] dump_stack+0x2a/0x2c [<6038776b>] panic+0x13e/0x2b3 [<6020f099>] btrfs_read_sys_array+0x25d/0x2ff [<601cfbbe>] open_ctree+0x192d/0x27af [<6019c2c1>] btrfs_mount+0x8f5/0xb9a [<600bc9a7>] mount_fs+0x11/0xf3 [<600d5167>] vfs_kern_mount+0x75/0x11a [<6019bcb0>] btrfs_mount+0x2e4/0xb9a [<600bc9a7>] mount_fs+0x11/0xf3 [<600d5167>] vfs_kern_mount+0x75/0x11a [<600d710b>] do_mount+0xa35/0xbc9 [<600d7557>] SyS_mount+0x95/0xc8 [<6001e884>] handle_syscall+0x6b/0x8e Reported-by: Jiri Slaby <jslaby@suse.com> Reported-by: Vegard Nossum <vegard.nossum@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-12-18Merge branch 'for-linus-4.4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "A couple of small fixes" * 'for-linus-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: check prepare_uptodate_page() error code earlier Btrfs: check for empty bitmap list in setup_cluster_bitmaps btrfs: fix misleading warning when space cache failed to load Btrfs: fix transaction handle leak in balance Btrfs: fix unprotected list move from unused_bgs to deleted_bgs list
2015-12-15Merge branch 'for-chris-4.4' of ↵Chris Mason
git://git.kernel.org/pub/scm/linux/kernel/git/fdmanana/linux into for-linus-4.4
2015-12-15Btrfs: check prepare_uptodate_page() error code earlierChris Mason
prepare_pages() may end up calling prepare_uptodate_page() twice if our write only spans a single page. But if the first call returns an error, our page will be unlocked and its not safe to call it again. This bug goes all the way back to 2011, and it's not something commonly hit. While we're here, add a more explicit check for the page being truncated away. The bare lock_page() alone is protected only by good thoughts and i_mutex, which we're sure to regret eventually. Reported-by: Dave Jones <dsj@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-12-15Btrfs: check for empty bitmap list in setup_cluster_bitmapsChris Mason
Dave Jones found a warning from kasan in setup_cluster_bitmaps() ================================================================== BUG: KASAN: stack-out-of-bounds in setup_cluster_bitmap+0xc4/0x5a0 at addr ffff88039bef6828 Read of size 8 by task nfsd/1009 page:ffffea000e6fbd80 count:0 mapcount:0 mapping: (null) index:0x0 flags: 0x8000000000000000() page dumped because: kasan: bad access detected CPU: 1 PID: 1009 Comm: nfsd Tainted: G W 4.4.0-rc3-backup-debug+ #1 ffff880065647b50 000000006bb712c2 ffff88039bef6640 ffffffffa680a43e 0000004559c00000 ffff88039bef66c8 ffffffffa62638d1 ffffffffa61121c0 ffff8803a5769de8 0000000000000296 ffff8803a5769df0 0000000000046280 Call Trace: [<ffffffffa680a43e>] dump_stack+0x4b/0x6d [<ffffffffa62638d1>] kasan_report_error+0x501/0x520 [<ffffffffa61121c0>] ? debug_show_all_locks+0x1e0/0x1e0 [<ffffffffa6263948>] kasan_report+0x58/0x60 [<ffffffffa6814b00>] ? rb_last+0x10/0x40 [<ffffffffa66f8af4>] ? setup_cluster_bitmap+0xc4/0x5a0 [<ffffffffa6262ead>] __asan_load8+0x5d/0x70 [<ffffffffa66f8af4>] setup_cluster_bitmap+0xc4/0x5a0 [<ffffffffa66f675a>] ? setup_cluster_no_bitmap+0x6a/0x400 [<ffffffffa66fcd16>] btrfs_find_space_cluster+0x4b6/0x640 [<ffffffffa66fc860>] ? btrfs_alloc_from_cluster+0x4e0/0x4e0 [<ffffffffa66fc36e>] ? btrfs_return_cluster_to_free_space+0x9e/0xb0 [<ffffffffa702dc37>] ? _raw_spin_unlock+0x27/0x40 [<ffffffffa666a1a1>] find_free_extent+0xba1/0x1520 Andrey noticed this was because we were doing list_first_entry on a list that might be empty. Rework the tests a bit so we don't do that. Signed-off-by: Chris Mason <clm@fb.com> Reprorted-by: Andrey Ryabinin <ryabinin.a.a@gmail.com> Reported-by: Dave Jones <dsj@fb.com>
2015-12-10btrfs: fix misleading warning when space cache failed to loadHolger Hoffstätte
When an inconsistent space cache is detected during loading we log a warning that users frequently mistake as instruction to invalidate the cache manually, even though this is not required. Fix the message to indicate that the cache will be rebuilt automatically. Signed-off-by: Holger Hoffstätte <holger.hoffstaette@googlemail.com> Acked-by: Filipe Manana <fdmanana@suse.com>
2015-12-10Btrfs: fix transaction handle leak in balanceFilipe Manana
If we fail to allocate a new data chunk, we were jumping to the error path without release the transaction handle we got before. Fix this by always releasing it before doing the jump. Fixes: 2c9fe8355258 ("btrfs: Fix lost-data-profile caused by balance bg") Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-12-10Btrfs: fix unprotected list move from unused_bgs to deleted_bgs listFilipe Manana
As of my previous change titled "Btrfs: fix scrub preventing unused block groups from being deleted", the following warning at extent-tree.c:btrfs_delete_unused_bgs() can be hit when we mount the a filesysten with "-o discard": 10263 void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info) 10264 { (...) 10405 if (trimming) { 10406 WARN_ON(!list_empty(&block_group->bg_list)); 10407 spin_lock(&trans->transaction->deleted_bgs_lock); 10408 list_move(&block_group->bg_list, 10409 &trans->transaction->deleted_bgs); 10410 spin_unlock(&trans->transaction->deleted_bgs_lock); 10411 btrfs_get_block_group(block_group); 10412 } (...) This happens because scrub can now add back the block group to the list of unused block groups (fs_info->unused_bgs). This is dangerous because we are moving the block group from the unused block groups list to the list of deleted block groups without holding the lock that protects the source list (fs_info->unused_bgs_lock). The following diagram illustrates how this happens: CPU 1 CPU 2 cleaner_kthread() btrfs_delete_unused_bgs() sees bg X in list fs_info->unused_bgs deletes bg X from list fs_info->unused_bgs scrub_enumerate_chunks() searches device tree using its commit root finds device extent for block group X gets block group X from the tree fs_info->block_group_cache_tree (via btrfs_lookup_block_group()) sets bg X to RO (again) scrub_chunk(bg X) sets bg X back to RW mode adds bg X to the list fs_info->unused_bgs again, since it's still unused and currently not in that list sets bg X to RO mode btrfs_remove_chunk(bg X) --> discard is enabled and bg X is in the fs_info->unused_bgs list again so the warning is triggered --> we move it from that list into the transaction's delete_bgs list, but we can have another task currently manipulating the first list (fs_info->unused_bgs) Fix this by using the same lock (fs_info->unused_bgs_lock) to protect both the list of unused block groups and the list of deleted block groups. This makes it safe and there's not much worry for more lock contention, as this lock is seldom used and only the cleaner kthread adds elements to the list of deleted block groups. The warning goes away too, as this was previously an impossible case (and would have been better a BUG_ON/ASSERT) but it's not impossible anymore. Reproduced with fstest btrfs/073 (using MOUNT_OPTIONS="-o discard"). Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-11-27Merge branch 'for-linus-4.4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "This has Mark Fasheh's patches to fix quota accounting during subvol deletion, which we've been working on for a while now. The patch is pretty small but it's a key fix. Otherwise it's a random assortment" * 'for-linus-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: btrfs: fix balance range usage filters in 4.4-rc btrfs: qgroup: account shared subtree during snapshot delete Btrfs: use btrfs_get_fs_root in resolve_indirect_ref btrfs: qgroup: fix quota disable during rescan Btrfs: fix race between cleaner kthread and space cache writeout Btrfs: fix scrub preventing unused block groups from being deleted Btrfs: fix race between scrub and block group deletion btrfs: fix rcu warning during device replace btrfs: Continue replace when set_block_ro failed btrfs: fix clashing number of the enhanced balance usage filter Btrfs: fix the number of transaction units needed to remove a block group Btrfs: use global reserve when deleting unused block group after ENOSPC Btrfs: tests: checking for NULL instead of IS_ERR() btrfs: fix signed overflows in btrfs_sync_file
2015-11-25btrfs: fix balance range usage filters in 4.4-rcHolger Hoffstätte
There's a regression in 4.4-rc since commit bc3094673f22 (btrfs: extend balance filter usage to take minimum and maximum) in that existing (non-ranged) balance with -dusage=x no longer works; all chunks are skipped. After staring at the code for a while and wondering why a non-ranged balance would even need min and max thresholds (..which then were not set correctly, leading to the bug) I realized that the only problem was the fact that the filter functions were named wrong, thanks to patching copypasta. Simply renaming both functions lets the existing btrfs-progs call balance with -dusage=x and now the non-ranged filter function is invoked, properly using only a single chunk limit. Signed-off-by: Holger Hoffstätte <holger.hoffstaette@googlemail.com> Fixes: bc3094673f22 ("btrfs: extend balance filter usage to take minimum and maximum") Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-11-25btrfs: qgroup: account shared subtree during snapshot deleteMark Fasheh
Commit 0ed4792 ('btrfs: qgroup: Switch to new extent-oriented qgroup mechanism.') removed our qgroup accounting during btrfs_drop_snapshot(). Predictably, this results in qgroup numbers going bad shortly after a snapshot is removed. Fix this by adding a dirty extent record when we encounter extents during our shared subtree walk. This effectively restores the functionality we had with the original shared subtree walking code in 1152651 (btrfs: qgroup: account shared subtrees during snapshot delete). The idea with the original patch (and this one) is that shared subtrees can get skipped during drop_snapshot. The shared subtree walk then allows us a chance to visit those extents and add them to the qgroup work for later processing. This ultimately makes the accounting for drop snapshot work. The new qgroup code nicely handles all the other extents during the tree walk via the ref dec/inc functions so we don't have to add actions beyond what we had originally. Signed-off-by: Mark Fasheh <mfasheh@suse.de> Signed-off-by: Chris Mason <clm@fb.com>
2015-11-25Btrfs: use btrfs_get_fs_root in resolve_indirect_refJosef Bacik
The backref code will look up the fs_root we're trying to resolve our indirect refs for, unfortunately we use btrfs_read_fs_root_no_name, which returns -ENOENT if the ref is 0. This isn't helpful for the qgroup stuff with snapshot delete as it won't be able to search down the snapshot we are deleting, which will cause us to miss roots. So use btrfs_get_fs_root and send false for check_ref so we can always get the root we're looking for. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Mark Fasheh <mfasheh@suse.de> Signed-off-by: Chris Mason <clm@fb.com>
2015-11-25btrfs: qgroup: fix quota disable during rescanJustin Maggard
There's a race condition that leads to a NULL pointer dereference if you disable quotas while a quota rescan is running. To fix this, we just need to wait for the quota rescan worker to actually exit before tearing down the quota structures. Signed-off-by: Justin Maggard <jmaggard@netgear.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-11-25Btrfs: fix race between cleaner kthread and space cache writeoutFilipe Manana
When a block group becomes unused and the cleaner kthread is currently running, we can end up getting the current transaction aborted with error -ENOENT when we try to commit the transaction, leading to the following trace: [59779.258768] WARNING: CPU: 3 PID: 5990 at fs/btrfs/extent-tree.c:3740 btrfs_write_dirty_block_groups+0x17c/0x214 [btrfs]() [59779.272594] BTRFS: Transaction aborted (error -2) (...) [59779.291137] Call Trace: [59779.291621] [<ffffffff812566f4>] dump_stack+0x4e/0x79 [59779.292543] [<ffffffff8104d0a6>] warn_slowpath_common+0x9f/0xb8 [59779.293435] [<ffffffffa04cb81f>] ? btrfs_write_dirty_block_groups+0x17c/0x214 [btrfs] [59779.295000] [<ffffffff8104d107>] warn_slowpath_fmt+0x48/0x50 [59779.296138] [<ffffffffa04c2721>] ? write_one_cache_group.isra.32+0x77/0x82 [btrfs] [59779.297663] [<ffffffffa04cb81f>] btrfs_write_dirty_block_groups+0x17c/0x214 [btrfs] [59779.299141] [<ffffffffa0549b0d>] commit_cowonly_roots+0x1de/0x261 [btrfs] [59779.300359] [<ffffffffa04dd5b6>] btrfs_commit_transaction+0x4c4/0x99c [btrfs] [59779.301805] [<ffffffffa04b5df4>] btrfs_sync_fs+0x145/0x1ad [btrfs] [59779.302893] [<ffffffff81196634>] sync_filesystem+0x7f/0x93 (...) [59779.318186] ---[ end trace 577e2daff90da33a ]--- The following diagram illustrates a sequence of steps leading to this problem: CPU 1 CPU 2 <at transaction N> adds bg A to list fs_info->unused_bgs adds bg B to list fs_info->unused_bgs <transaction kthread commits transaction N and wakes up the cleaner kthread> cleaner kthread delete_unused_bgs() sees bg A in list fs_info->unused_bgs btrfs_start_transaction() <transaction N + 1 starts> deletes bg A update_block_group(bg C) --> adds bg C to list fs_info->unused_bgs deletes bg B sees bg C in the list fs_info->unused_bgs btrfs_remove_chunk(bg C) btrfs_remove_block_group(bg C) --> checks if the block group is in a dirty list, and because it isn't now, it does nothing --> the block group item is deleted from the extent tree --> adds bg C to list transaction->dirty_bgs some task calls btrfs_commit_transaction(t N + 1) commit_cowonly_roots() btrfs_write_dirty_block_groups() --> sees bg C in cur_trans->dirty_bgs --> calls write_one_cache_group() which returns -ENOENT because it did not find the block group item in the extent tree --> transaction aborte with -ENOENT because write_one_cache_group() returned that error So fix this by adding a block group to the list of dirty block groups before adding it to the list of unused block groups. This happened on a stress test using fsstress plus concurrent calls to fallocate 20G and truncate (releasing part of the space allocated with fallocate). Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-11-25Btrfs: fix scrub preventing unused block groups from being deletedFilipe Manana
Currently scrub can race with the cleaner kthread when the later attempts to delete an unused block group, and the result is preventing the cleaner kthread from ever deleting later the block group - unless the block group becomes used and unused again. The following diagram illustrates that race: CPU 1 CPU 2 cleaner kthread btrfs_delete_unused_bgs() gets block group X from fs_info->unused_bgs and removes it from that list scrub_enumerate_chunks() searches device tree using its commit root finds device extent for block group X gets block group X from the tree fs_info->block_group_cache_tree (via btrfs_lookup_block_group()) sets bg X to RO sees the block group is already RO and therefore doesn't delete it nor adds it back to unused list So fix this by making scrub add the block group again to the list of unused block groups if the block group is still unused when it finished scrubbing it and it hasn't been removed already. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-11-25Btrfs: fix race between scrub and block group deletionFilipe Manana
Scrub can race with the cleaner kthread deleting block groups that are unused (and with relocation too) leading to a failure with error -EINVAL that gets returned to user space. The following diagram illustrates how it happens: CPU 1 CPU 2 cleaner kthread btrfs_delete_unused_bgs() gets block group X from fs_info->unused_bgs sets block group to RO btrfs_remove_chunk(bg X) deletes device extents scrub_enumerate_chunks() searches device tree using its commit root finds device extent for block group X gets block group X from the tree fs_info->block_group_cache_tree (via btrfs_lookup_block_group()) sets bg X to RO (again) btrfs_remove_block_group(bg X) deletes block group from fs_info->block_group_cache_tree removes extent map from fs_info->mapping_tree scrub_chunk(offset X) searches fs_info->mapping_tree for extent map starting at offset X --> doesn't find any such extent map --> returns -EINVAL and scrub errors out to userspace with -EINVAL Fix this by dealing with an extent map lookup failure as an indicator of block group deletion. Issue reproduced with fstest btrfs/071. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-11-25btrfs: fix rcu warning during device replaceDavid Sterba
The test btrfs/011 triggers a rcu warning Reviewed-by: Anand Jain <anand.jain@oracle.com> =============================== [ INFO: suspicious RCU usage. ] 4.4.0-rc1-default+ #286 Tainted: G W ------------------------------- fs/btrfs/volumes.c:1977 suspicious rcu_dereference_check() usage! other info that might help us debug this: rcu_scheduler_active = 1, debug_locks = 0 4 locks held by btrfs/28786: 0: (&fs_info->dev_replace.lock_finishing_cancel_unmount){+.+...}, at: [<ffffffffa00bc785>] btrfs_dev_replace_finishing+0x45/0xa00 [btrfs] 1: (uuid_mutex){+.+.+.}, at: [<ffffffffa00bc84f>] btrfs_dev_replace_finishing+0x10f/0xa00 [btrfs] 2: (&fs_devs->device_list_mutex){+.+.+.}, at: [<ffffffffa00bc868>] btrfs_dev_replace_finishing+0x128/0xa00 [btrfs] 3: (&fs_info->chunk_mutex){+.+...}, at: [<ffffffffa00bc87d>] btrfs_dev_replace_finishing+0x13d/0xa00 [btrfs] stack backtrace: CPU: 0 PID: 28786 Comm: btrfs Tainted: G W 4.4.0-rc1-default+ #286 Hardware name: Intel Corporation SandyBridge Platform/To be filled by O.E.M., BIOS ASNBCPT1.86C.0031.B00.1006301607 06/30/2010 0000000000000001 ffff8800a07dfb48 ffffffff8141d47b 0000000000000001 0000000000000001 0000000000000000 ffff8801464a4f00 ffff8800a07dfb78 ffffffff810cd883 ffff880146eb9400 ffff8800a3698600 ffff8800a33fe220 Call Trace: [<ffffffff8141d47b>] dump_stack+0x4f/0x74 [<ffffffff810cd883>] lockdep_rcu_suspicious+0x103/0x140 [<ffffffffa0071261>] btrfs_rm_dev_replace_remove_srcdev+0x111/0x130 [btrfs] [<ffffffff810d354d>] ? trace_hardirqs_on+0xd/0x10 [<ffffffff81449536>] ? __percpu_counter_sum+0x66/0x80 [<ffffffffa00bcc15>] btrfs_dev_replace_finishing+0x4d5/0xa00 [btrfs] [<ffffffffa00bc96e>] ? btrfs_dev_replace_finishing+0x22e/0xa00 [btrfs] [<ffffffffa00a8795>] ? btrfs_scrub_dev+0x415/0x6d0 [btrfs] [<ffffffffa003ea69>] ? btrfs_start_transaction+0x9/0x20 [btrfs] [<ffffffffa00bda79>] btrfs_dev_replace_start+0x339/0x590 [btrfs] [<ffffffff81196aa5>] ? __might_fault+0x95/0xa0 [<ffffffffa0078638>] btrfs_ioctl_dev_replace+0x118/0x160 [btrfs] [<ffffffff811409c6>] ? stack_trace_call+0x46/0x70 [<ffffffffa007c914>] ? btrfs_ioctl+0x24/0x1770 [btrfs] [<ffffffffa007ce43>] btrfs_ioctl+0x553/0x1770 [btrfs] [<ffffffff811409c6>] ? stack_trace_call+0x46/0x70 [<ffffffff811d6eb1>] ? do_vfs_ioctl+0x21/0x5a0 [<ffffffff811d6f1c>] do_vfs_ioctl+0x8c/0x5a0 [<ffffffff811e3336>] ? __fget_light+0x86/0xb0 [<ffffffff811e3369>] ? __fdget+0x9/0x20 [<ffffffff811d7451>] ? SyS_ioctl+0x21/0x80 [<ffffffff811d7483>] SyS_ioctl+0x53/0x80 [<ffffffff81b1efd7>] entry_SYSCALL_64_fastpath+0x12/0x6f This is because of unprotected use of rcu_dereference in btrfs_scratch_superblocks. We can't add rcu locks around the whole function because we read the superblock. The fix will use the rcu string buffer directly without the rcu locking. Thi is safe as the device will not go away in the meantime. We're holding the device list mutexes. Restructuring the code to narrow down the rcu section turned out to be impossible, we need to call filp_open (through update_dev_time) on the buffer and this could call kmalloc/__might_sleep. We could call kstrdup with GFP_ATOMIC but it's not absolutely necessary. Fixes: 12b1c2637b6e (Btrfs: enhance btrfs_scratch_superblock to scratch all superblocks) Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-11-25btrfs: Continue replace when set_block_ro failedZhaolei
xfstests/011 failed in node with small_size filesystem. Can be reproduced by following script: DEV_LIST="/dev/vdd /dev/vde" DEV_REPLACE="/dev/vdf" do_test() { local mkfs_opt="$1" local size="$2" dmesg -c >/dev/null umount $SCRATCH_MNT &>/dev/null echo mkfs.btrfs -f $mkfs_opt "${DEV_LIST[*]}" mkfs.btrfs -f $mkfs_opt "${DEV_LIST[@]}" || return 1 mount "${DEV_LIST[0]}" $SCRATCH_MNT echo -n "Writing big files" dd if=/dev/urandom of=$SCRATCH_MNT/t0 bs=1M count=1 >/dev/null 2>&1 for ((i = 1; i <= size; i++)); do echo -n . /bin/cp $SCRATCH_MNT/t0 $SCRATCH_MNT/t$i || return 1 done echo echo Start replace btrfs replace start -Bf "${DEV_LIST[0]}" "$DEV_REPLACE" $SCRATCH_MNT || { dmesg return 1 } return 0 } # Set size to value near fs size # for example, 1897 can trigger this bug in 2.6G device. # ./do_test "-d raid1 -m raid1" 1897 System will report replace fail with following warning in dmesg: [ 134.710853] BTRFS: dev_replace from /dev/vdd (devid 1) to /dev/vdf started [ 135.542390] BTRFS: btrfs_scrub_dev(/dev/vdd, 1, /dev/vdf) failed -28 [ 135.543505] ------------[ cut here ]------------ [ 135.544127] WARNING: CPU: 0 PID: 4080 at fs/btrfs/dev-replace.c:428 btrfs_dev_replace_start+0x398/0x440() [ 135.545276] Modules linked in: [ 135.545681] CPU: 0 PID: 4080 Comm: btrfs Not tainted 4.3.0 #256 [ 135.546439] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.2-0-g33fbe13 by qemu-project.org 04/01/2014 [ 135.547798] ffffffff81c5bfcf ffff88003cbb3d28 ffffffff817fe7b5 0000000000000000 [ 135.548774] ffff88003cbb3d60 ffffffff810a88f1 ffff88002b030000 00000000ffffffe4 [ 135.549774] ffff88003c080000 ffff88003c082588 ffff88003c28ab60 ffff88003cbb3d70 [ 135.550758] Call Trace: [ 135.551086] [<ffffffff817fe7b5>] dump_stack+0x44/0x55 [ 135.551737] [<ffffffff810a88f1>] warn_slowpath_common+0x81/0xc0 [ 135.552487] [<ffffffff810a89e5>] warn_slowpath_null+0x15/0x20 [ 135.553211] [<ffffffff81448c88>] btrfs_dev_replace_start+0x398/0x440 [ 135.554051] [<ffffffff81412c3e>] btrfs_ioctl+0x1d2e/0x25c0 [ 135.554722] [<ffffffff8114c7ba>] ? __audit_syscall_entry+0xaa/0xf0 [ 135.555506] [<ffffffff8111ab36>] ? current_kernel_time64+0x56/0xa0 [ 135.556304] [<ffffffff81201e3d>] do_vfs_ioctl+0x30d/0x580 [ 135.557009] [<ffffffff8114c7ba>] ? __audit_syscall_entry+0xaa/0xf0 [ 135.557855] [<ffffffff810011d1>] ? do_audit_syscall_entry+0x61/0x70 [ 135.558669] [<ffffffff8120d1c1>] ? __fget_light+0x61/0x90 [ 135.559374] [<ffffffff81202124>] SyS_ioctl+0x74/0x80 [ 135.559987] [<ffffffff81809857>] entry_SYSCALL_64_fastpath+0x12/0x6f [ 135.560842] ---[ end trace 2a5c1fc3205abbdd ]--- Reason: When big data writen to fs, the whole free space will be allocated for data chunk. And operation as scrub need to set_block_ro(), and when there is only one metadata chunk in system(or other metadata chunks are all full), the function will try to allocate a new chunk, and failed because no space in device. Fix: When set_block_ro failed for metadata chunk, it is not a problem because scrub_lock paused commit_trancaction in same time, and metadata are always cowed, so the on-the-fly writepages will not write data into same place with scrub/replace. Let replace continue in this case is no problem. Tested by above script, and xfstests/011, plus 100 times xfstests/070. Changelog v1->v2: 1: Add detail comments in source and commit-message. 2: Add dmesg detail into commit-message. 3: Limit return value of -ENOSPC to be passed. All suggested by: Filipe Manana <fdmanana@gmail.com> Suggested-by: Filipe Manana <fdmanana@gmail.com> Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-11-25btrfs: fix clashing number of the enhanced balance usage filterDavid Sterba
I've accidentally picked an already used number for the enhanced usage filter represented by BTRFS_BALANCE_ARGS_USAGE_RANGE, clashing with BTRFS_BALANCE_ARGS_CONVERT. Introduced during the development phase, no backward compatibility issues. Reported-by: Holger Hoffstätte <holger.hoffstaette@googlemail.com> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Fixes: bc3094673f22 ("btrfs: extend balance filter usage to take minimum and maximum") Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-11-25Btrfs: fix the number of transaction units needed to remove a block groupFilipe Manana
We were using only 1 transaction unit when attempting to delete an unused block group but in reality we need 3 + N units, where N corresponds to the number of stripes. We were accounting only for the addition of the orphan item (for the block group's free space cache inode) but we were not accounting that we need to delete one block group item from the extent tree, one free space item from the tree of tree roots and N device extent items from the device tree. While one unit is not enough, it worked most of the time because for each single unit we are too pessimistic and assume an entire tree path, with the highest possible heigth (8), needs to be COWed with eventual node splits at every possible level in the tree, so there was usually enough reserved space for removing all the items and adding the orphan item. However after adding the orphan item, writepages() can by called by the VM subsystem against the btree inode when we are under memory pressure, which causes writeback to start for the nodes we COWed before, this forces the operation to remove the free space item to COW again some (or all of) the same nodes (in the tree of tree roots). Even without writepages() being called, we could fail with ENOSPC because these items are located in multiple trees and one of them might have a higher heigth and require node/leaf splits at many levels, exhausting all the reserved space before removing all the items and adding the orphan. In the kernel 4.0 release, commit 3d84be799194 ("Btrfs: fix BUG_ON in btrfs_orphan_add() when delete unused block group"), we attempted to fix a BUG_ON due to ENOSPC when trying to add the orphan item by making the cleaner kthread reserve one transaction unit before attempting to remove the block group, but this was not enough. We had a couple user reports still hitting the same BUG_ON after 4.0, like Stefan Priebe's report on a 4.2-rc6 kernel for example: http://www.spinics.net/lists/linux-btrfs/msg46070.html So fix this by reserving all the necessary units of metadata. Reported-by: Stefan Priebe <s.priebe@profihost.ag> Fixes: 3d84be799194 ("Btrfs: fix BUG_ON in btrfs_orphan_add() when delete unused block group") Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-11-25Btrfs: use global reserve when deleting unused block group after ENOSPCFilipe Manana
It's possible to reach a state where the cleaner kthread isn't able to start a transaction to delete an unused block group due to lack of enough free metadata space and due to lack of unallocated device space to allocate a new metadata block group as well. If this happens try to use space from the global block group reserve just like we do for unlink operations, so that we don't reach a permanent state where starting a transaction for filesystem operations (file creation, renames, etc) keeps failing with -ENOSPC. Such an unfortunate state was observed on a machine where over a dozen unused data block groups existed and the cleaner kthread was failing to delete them due to ENOSPC error when attempting to start a transaction, and even running balance with a -dusage=0 filter failed with ENOSPC as well. Also unmounting and mounting again the filesystem didn't help. Allowing the cleaner kthread to use the global block reserve to delete the unused data block groups fixed the problem. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-11-25Btrfs: tests: checking for NULL instead of IS_ERR()Dan Carpenter
btrfs_alloc_dummy_root() return an error pointer on failure, it never returns NULL. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-11-25btrfs: fix signed overflows in btrfs_sync_fileDavid Sterba
The calculation of range length in btrfs_sync_file leads to signed overflow. This was caught by PaX gcc SIZE_OVERFLOW plugin. https://forums.grsecurity.net/viewtopic.php?f=1&t=4284 The fsync call passes 0 and LLONG_MAX, the range length does not fit to loff_t and overflows, but the value is converted to u64 so it silently works as expected. The minimal fix is a typecast to u64, switching functions to take (start, end) instead of (start, len) would be more intrusive. Coccinelle script found that there's one more opencoded calculation of the length. <smpl> @@ loff_t start, end; @@ * end - start </smpl> CC: stable@vger.kernel.org Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-11-13Merge branch 'for-linus-4.4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes and cleanups from Chris Mason: "Some of this got cherry-picked from a github repo this week, but I verified the patches. We have three small scrub cleanups and a collection of fixes" * 'for-linus-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: btrfs: Use fs_info directly in btrfs_delete_unused_bgs btrfs: Fix lost-data-profile caused by balance bg btrfs: Fix lost-data-profile caused by auto removing bg btrfs: Remove len argument from scrub_find_csum btrfs: Reduce unnecessary arguments in scrub_recheck_block btrfs: Use scrub_checksum_data and scrub_checksum_tree_block for scrub_recheck_block_checksum btrfs: Reset sblock->xxx_error stats before calling scrub_recheck_block_checksum btrfs: scrub: setup all fields for sblock_to_check btrfs: scrub: set error stats when tree block spanning stripes Btrfs: fix race when listing an inode's xattrs Btrfs: fix race leading to BUG_ON when running delalloc for nodatacow Btrfs: fix race leading to incorrect item deletion when dropping extents Btrfs: fix sleeping inside atomic context in qgroup rescan worker Btrfs: fix race waiting for qgroup rescan worker btrfs: qgroup: exit the rescan worker during umount Btrfs: fix extent accounting for partial direct IO writes
2015-11-10btrfs: Use fs_info directly in btrfs_delete_unused_bgsZhao Lei
No need to use root->fs_info in btrfs_delete_unused_bgs(), use fs_info directly instead. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-11-10btrfs: Fix lost-data-profile caused by balance bgZhao Lei
Reproduce: (In integration-4.3 branch) TEST_DEV=(/dev/vdg /dev/vdh) TEST_DIR=/mnt/tmp umount "$TEST_DEV" >/dev/null mkfs.btrfs -f -d raid1 "${TEST_DEV[@]}" mount -o nospace_cache "$TEST_DEV" "$TEST_DIR" btrfs balance start -dusage=0 $TEST_DIR btrfs filesystem usage $TEST_DIR dd if=/dev/zero of="$TEST_DIR"/file count=100 btrfs filesystem usage $TEST_DIR Result: We can see "no data chunk" in first "btrfs filesystem usage": # btrfs filesystem usage $TEST_DIR Overall: ... Metadata,single: Size:8.00MiB, Used:0.00B /dev/vdg 8.00MiB Metadata,RAID1: Size:122.88MiB, Used:112.00KiB /dev/vdg 122.88MiB /dev/vdh 122.88MiB System,single: Size:4.00MiB, Used:0.00B /dev/vdg 4.00MiB System,RAID1: Size:8.00MiB, Used:16.00KiB /dev/vdg 8.00MiB /dev/vdh 8.00MiB Unallocated: /dev/vdg 1.06GiB /dev/vdh 1.07GiB And "data chunks changed from raid1 to single" in second "btrfs filesystem usage": # btrfs filesystem usage $TEST_DIR Overall: ... Data,single: Size:256.00MiB, Used:0.00B /dev/vdh 256.00MiB Metadata,single: Size:8.00MiB, Used:0.00B /dev/vdg 8.00MiB Metadata,RAID1: Size:122.88MiB, Used:112.00KiB /dev/vdg 122.88MiB /dev/vdh 122.88MiB System,single: Size:4.00MiB, Used:0.00B /dev/vdg 4.00MiB System,RAID1: Size:8.00MiB, Used:16.00KiB /dev/vdg 8.00MiB /dev/vdh 8.00MiB Unallocated: /dev/vdg 1.06GiB /dev/vdh 841.92MiB Reason: btrfs balance delete last data chunk in case of no data in the filesystem, then we can see "no data chunk" by "fi usage" command. And when we do write operation to fs, the only available data profile is 0x0, result is all new chunks are allocated single type. Fix: Allocate a data chunk explicitly to ensure we don't lose the raid profile for data. Test: Test by above script, and confirmed the logic by debug output. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-11-10btrfs: Fix lost-data-profile caused by auto removing bgZhao Lei
Reproduce: (In integration-4.3 branch) TEST_DEV=(/dev/vdg /dev/vdh) TEST_DIR=/mnt/tmp umount "$TEST_DEV" >/dev/null mkfs.btrfs -f -d raid1 "${TEST_DEV[@]}" mount -o nospace_cache "$TEST_DEV" "$TEST_DIR" umount "$TEST_DEV" mount -o nospace_cache "$TEST_DEV" "$TEST_DIR" btrfs filesystem usage $TEST_DIR We can see the data chunk changed from raid1 to single: # btrfs filesystem usage $TEST_DIR Data,single: Size:8.00MiB, Used:0.00B /dev/vdg 8.00MiB # Reason: When a empty filesystem mount with -o nospace_cache, the last data blockgroup will be auto-removed in umount. Then if we mount it again, there is no data chunk in the filesystem, so the only available data profile is 0x0, result is all new chunks are created as single type. Fix: Don't auto-delete last blockgroup for a raid type. Test: Test by above script, and confirmed the logic by debug output. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-11-10btrfs: Remove len argument from scrub_find_csumZhao Lei
It is useless. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-11-10btrfs: Reduce unnecessary arguments in scrub_recheck_blockZhao Lei
We don't need pass so many arguments for recheck sblock now, this patch cleans them. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>