summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2015-01-21Btrfs: sort raid_map before adding tgtdev stripesZhao Lei
It can avoid complex calculation of real stripes in sort, moreover, we can clean up code of sorting tgtdev_map because it will be in order initially. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-01-21Btrfs: fix a out-of-bound access of raid_mapZhao Lei
We add the number of stripes on target devices into bbio->num_stripes if we are under device replacement, and we just sort the raid_map of those stripes that not on the target devices, so if when we need real raid_map, we need skip the stripes on the target devices. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-01-21Btrfs: fix fsync log replay for inodes with a mix of regular refs and extrefsFilipe Manana
If we have an inode with a large number of hard links, some of which may be extrefs, turn a regular ref into an extref, fsync the inode and then replay the fsync log (after a crash/reboot), we can endup with an fsync log that makes the replay code always fail with -EOVERFLOW when processing the inode's references. This is easy to reproduce with the test case I made for xfstests. Its steps are the following: _scratch_mkfs "-O extref" >> $seqres.full 2>&1 _init_flakey _mount_flakey # Create a test file with 3001 hard links. This number is large enough to # make btrfs start using extrefs at some point even if the fs has the maximum # possible leaf/node size (64Kb). echo "hello world" > $SCRATCH_MNT/foo for i in `seq 1 3000`; do ln $SCRATCH_MNT/foo $SCRATCH_MNT/foo_link_`printf "%04d" $i` done # Make sure all metadata and data are durably persisted. sync # Now remove one link, add a new one with a new name, add another new one with # the same name as the one we just removed and fsync the inode. rm -f $SCRATCH_MNT/foo_link_0001 ln $SCRATCH_MNT/foo $SCRATCH_MNT/foo_link_3001 ln $SCRATCH_MNT/foo $SCRATCH_MNT/foo_link_0001 rm -f $SCRATCH_MNT/foo_link_0002 ln $SCRATCH_MNT/foo $SCRATCH_MNT/foo_link_3002 ln $SCRATCH_MNT/foo $SCRATCH_MNT/foo_link_3003 $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo # Simulate a crash/power loss. This makes sure the next mount # will see an fsync log and will replay that log. _load_flakey_table $FLAKEY_DROP_WRITES _unmount_flakey _load_flakey_table $FLAKEY_ALLOW_WRITES _mount_flakey # Check that the number of hard links is correct, we are able to remove all # the hard links and read the file's data. This is just to verify we don't # get stale file handle errors (due to dangling directory index entries that # point to inodes that no longer exist). echo "Link count: $(stat --format=%h $SCRATCH_MNT/foo)" [ -f $SCRATCH_MNT/foo ] || echo "Link foo is missing" for ((i = 1; i <= 3003; i++)); do name=foo_link_`printf "%04d" $i` if [ $i -eq 2 ]; then [ -f $SCRATCH_MNT/$name ] && echo "Link $name found" else [ -f $SCRATCH_MNT/$name ] || echo "Link $name is missing" fi done rm -f $SCRATCH_MNT/foo_link_* cat $SCRATCH_MNT/foo rm -f $SCRATCH_MNT/foo status=0 exit The fix is simply to correct the overflow condition when overwriting a reference item because it was wrong, trying to increase the item in the fs/subvol tree by an impossible amount. Also ensure that we don't insert one normal ref and one ext ref for the same dentry - this happened because processing a dir index entry from the parent in the log happened when the normal ref item was full, which made the logic insert an extref and later when the normal ref had enough room, it would be inserted again when processing the ref item from the child inode in the log. This issue has been present since the introduction of the extrefs feature (2012). A test case for xfstests follows soon. This test only passes if the previous patch titled "Btrfs: fix fsync when extend references are added to an inode" is applied too. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-01-21Btrfs: fix fsync when extend references are added to an inodeFilipe Manana
If we added an extended reference to an inode and fsync'ed it, the log replay code would make our inode have an incorrect link count, which was lower then the expected/correct count. This resulted in stale directory index entries after deleting some of the hard links, and any access to the dangling directory entries resulted in -ESTALE errors because the entries pointed to inode items that don't exist anymore. This is easy to reproduce with the test case I made for xfstests, and the bulk of that test is: _scratch_mkfs "-O extref" >> $seqres.full 2>&1 _init_flakey _mount_flakey # Create a test file with 3001 hard links. This number is large enough to # make btrfs start using extrefs at some point even if the fs has the maximum # possible leaf/node size (64Kb). echo "hello world" > $SCRATCH_MNT/foo for i in `seq 1 3000`; do ln $SCRATCH_MNT/foo $SCRATCH_MNT/foo_link_`printf "%04d" $i` done # Make sure all metadata and data are durably persisted. sync # Add one more link to the inode that ends up being a btrfs extref and fsync # the inode. ln $SCRATCH_MNT/foo $SCRATCH_MNT/foo_link_3001 $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo # Simulate a crash/power loss. This makes sure the next mount # will see an fsync log and will replay that log. _load_flakey_table $FLAKEY_DROP_WRITES _unmount_flakey _load_flakey_table $FLAKEY_ALLOW_WRITES _mount_flakey # Now after the fsync log replay btrfs left our inode with a wrong link count N, # which was smaller than the correct link count M (N < M). # So after removing N hard links, the remaining M - N directory entries were # still visible to user space but it was impossible to do anything with them # because they pointed to an inode that didn't exist anymore. This resulted in # stale file handle errors (-ESTALE) when accessing those dentries for example. # # So remove all hard links except the first one and then attempt to read the # file, to verify we don't get an -ESTALE error when accessing the inodel # # The btrfs fsck tool also detected the incorrect inode link count and it # reported an error message like the following: # # root 5 inode 257 errors 2001, no inode item, link count wrong # unresolved ref dir 256 index 2978 namelen 13 name foo_link_2976 filetype 1 errors 4, no inode ref # # The fstests framework automatically calls fsck after a test is run, so we # don't need to call fsck explicitly here. rm -f $SCRATCH_MNT/foo_link_* cat $SCRATCH_MNT/foo status=0 exit So make sure an fsync always flushes the delayed inode item, so that the fsync log contains it (needed in order to trigger the link count fixup code) and fix the extref counting function, which always return -ENOENT to its caller (and made it assume there were always 0 extrefs). This issue has been present since the introduction of the extrefs feature (2012). A test case for xfstests follows soon. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-01-21Btrfs: fix directory inconsistency after fsync log replayFilipe Manana
If we have an inode (file) with a link count greater than 1, remove one of its hard links, fsync the inode, power fail/crash and then replay the fsync log on the next mount, we end up getting the parent directory's metadata inconsistent - its i_size still reflects the deleted hard link and has dangling index entries (with no matching inode reference entries). This prevents the directory from ever being deletable, as its i_size can never decrease to BTRFS_EMPTY_DIR_SIZE even if all of its children inodes are deleted, and the dangling index entries can never be removed (as they point to an inode that does not exist anymore). This is easy to reproduce with the following excerpt from the test case for xfstests that I just made: _scratch_mkfs >> $seqres.full 2>&1 _init_flakey _mount_flakey # Create a test file with 2 hard links in the same directory. mkdir -p $SCRATCH_MNT/a/b echo "hello world" > $SCRATCH_MNT/a/b/foo ln $SCRATCH_MNT/a/b/foo $SCRATCH_MNT/a/b/bar # Make sure all metadata and data are durably persisted. sync # Now remove one of the hard links and fsync the inode. rm -f $SCRATCH_MNT/a/b/bar $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/a/b/foo # Simulate a crash/power loss. This makes sure the next mount # will see an fsync log and will replay that log. _load_flakey_table $FLAKEY_DROP_WRITES _unmount_flakey _load_flakey_table $FLAKEY_ALLOW_WRITES _mount_flakey # Remove the last hard link of the file and attempt to remove its parent # directory - this failed in btrfs because the fsync log and replay code # didn't decrement the parent directory's i_size and left dangling directory # index entries - this made the btrfs rmdir implementation always fail with # the error -ENOTEMPTY. # # The dangling directory index entries were visible to user space, but it was # impossible to do anything on them (unlink, open, read, write, stat, etc) # because the inode they pointed to did not exist anymore. # # The parent directory's metadata inconsistency (stale index entries) was # also detected by btrfs' fsck tool, which is run automatically by the fstests # framework when the test finishes. The error message reported by fsck was: # # root 5 inode 259 errors 2001, no inode item, link count wrong # unresolved ref dir 258 index 3 namelen 3 name bar filetype 1 errors 4, no inode ref # rm -f $SCRATCH_MNT/a/b/* rmdir $SCRATCH_MNT/a/b rmdir $SCRATCH_MNT/a To fix this just make sure that after an unlink, if the inode is fsync'ed, he parent inode is fully logged in the fsync log. A test case for xfstests follows soon. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-01-21Btrfs: lookup for block group only if needed when freeing a tree blockFilipe Manana
Very often our extent buffer's header generation doesn't match the current transaction's id or it is also referenced by other trees (snapshots), so we don't need the corresponding block group cache object. Therefore only search for it if we are going to use it, so we avoid an unnecessary search in the block groups rbtree (and acquiring and releasing its spinlock). Freeing a tree block is performed when COWing or deleting a node/leaf, which implies we are holding the node/leaf's parent node lock, therefore reducing the amount of time spent when freeing a tree block helps reducing the amount of time we are holding the parent node's lock. For example, for a run of xfstests/generic/083, the block group cache object was needed only 682 times for a total of 226691 calls to free a tree block. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-01-21btrfs: remove a no-op unfreeze superbock callbackDavid Sterba
Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
2015-01-21btrfs: switch extent_state state to unsignedDavid Sterba
Currently there's a 4B hole in the structure between refs and state and there are only 16 bits used so we can make it unsigned. This will get a better packing and may save some stack space for local variables. The size of extent_state gets reduced by 8B and there are usually a lot of slab objects. struct extent_state { u64 start; /* 0 8 */ u64 end; /* 8 8 */ struct rb_node rb_node; /* 16 24 */ wait_queue_head_t wq; /* 40 24 */ /* --- cacheline 1 boundary (64 bytes) --- */ atomic_t refs; /* 64 4 */ /* XXX 4 bytes hole, try to pack */ long unsigned int state; /* 72 8 */ u64 private; /* 80 8 */ /* size: 88, cachelines: 2, members: 7 */ /* sum members: 84, holes: 1, sum holes: 4 */ /* last cacheline: 24 bytes */ }; Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
2015-01-21btrfs: set proper message level for skinny metadataDavid Sterba
This has been confusing people for too long, the message is really just informative. CC: <stable@vger.kernel.org> # 3.10+ Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
2015-01-21btrfs: update message levels after checksum errorsDavid Sterba
The errors are worth noting and might get missed with INFO level. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
2015-01-21btrfs: update message levels during failed mountDavid Sterba
All error conditions from open_ctree shall be ERR. Warning would suggest that something's wrong and we can continue. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
2015-01-21btrfs: update message levels for errorsDavid Sterba
Several messages that point to some internal problem, level INFO is wrong here. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
2015-01-21Btrfs: fix setup_leaf_for_split() to avoid leaf corruptionFilipe Manana
We were incorrectly detecting when the target key didn't exist anymore after releasing the path and re-searching the tree. This could make us split or duplicate (btrfs_split_item() and btrfs_duplicate_item() are its only callers at the moment) an item when we should not. For the case of duplicating an item, we currently only duplicate checksum items (csum tree) and file extent items (fs/subvol trees). For the checksum items we end up overriding the item completely, but for file extent items we update only some of their fields in the copy (done in __btrfs_drop_extents), which means we can end up having a logical corruption for some values. Also for the case where we duplicate a file extent item it will make us produce a leaf with a wrong key order, as btrfs_duplicate_item() advances us to the next slot and then its caller sets a smaller key on the new item at that slot (like in __btrfs_drop_extents() e.g.). Alternatively if the tree search in setup_leaf_for_split() leaves with path->slots[0] == btrfs_header_nritems(path->nodes[0]), we end up accessing beyond the leaf's end (when we check if the item's size has changed) and make our caller insert an item at the invalid slot btrfs_header_nritems(path->nodes[0]) + 1, causing an invalid memory access if the leaf is full or nearly full. This issue has been present since the introduction of this function in 2009: Btrfs: Add btrfs_duplicate_item commit ad48fd754676bfae4139be1a897b1ea58f9aaf21 Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-01-21Merge branch 'cleanup/blocksize-diet-part2' of ↵Chris Mason
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus
2015-01-21Merge branch 'fix/find-item-path-leak' of ↵Chris Mason
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus
2015-01-21Btrfs: track dirty block groups on their own listJosef Bacik
Currently any time we try to update the block groups on disk we will walk _all_ block groups and check for the ->dirty flag to see if it is set. This function can get called several times during a commit. So if you have several terabytes of data you will be a very sad panda as we will loop through _all_ of the block groups several times, which makes the commit take a while which slows down the rest of the file system operations. This patch introduces a dirty list for the block groups that we get added to when we dirty the block group for the first time. Then we simply update any block groups that have been dirtied since the last time we called btrfs_write_dirty_block_groups. This allows us to clean up how we write the free space cache out so it is much cleaner. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-01-21Btrfs: change how we track dirty rootsJosef Bacik
I've been overloading root->dirty_list to keep track of dirty roots and which roots need to have their commit roots switched at transaction commit time. This could cause us to lose an update to the root which could corrupt the file system. To fix this use a state bit to know if the root is dirty, and if it isn't set we go ahead and move the root to the dirty list. This way if we re-dirty the root after adding it to the switch_commit list we make sure to update it. This also makes it so that the extent root is always the last root on the dirty list to try and keep the amount of churn down at this point in the commit. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-01-17Merge tag 'driver-core-3.19-rc5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core fix from Greg KH: "Here is one kernfs fix for a reported issue for 3.19-rc5. It has been in linux-next for a while" * tag 'driver-core-3.19-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: kernfs: Fix kernfs_name_compare
2015-01-17Merge tag 'nfs-for-3.19-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds
Pull NFS client bugfixes from Trond Myklebust: "Highlights include: - Stable fix for a NFSv3/lockd race - Fixes for several NFSv4.1 client id trunking bugs - Remove an incorrect test when checking for delegated opens" * tag 'nfs-for-3.19-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: NFSv4: Remove incorrect check in can_open_delegated() NFS: Ignore transport protocol when detecting server trunking NFSv4/v4.1: Verify the client owner id during trunking detection NFSv4: Cache the NFSv4/v4.1 client owner_id in the struct nfs_client NFSv4.1: Fix client id trunking on Linux LOCKD: Fix a race when initialising nlmsvc_timeout
2015-01-16Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse Pull fuse fixes from Miklos Szeredi: "This fixes a regression in the latest fuse update plus a fix for a rather theoretical memory ordering issue" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: fuse: add memory barrier to INIT fuse: fix LOOKUP vs INIT compat handling
2015-01-14btrfs: expand btrfs_find_item if found_key is NULLDavid Sterba
If the found_key is NULL, then btrfs_find_item becomes a verbose wrapper for simple btrfs_search_slot. After we've removed all such callers, passing a NULL key is not valid anymore. Signed-off-by: David Sterba <dsterba@suse.cz>
2015-01-14btrfs: simplify insert_orphan_itemDavid Sterba
We can search and add the orphan item in one go, btrfs_insert_orphan_item will find out if the item already exists. Signed-off-by: David Sterba <dsterba@suse.cz>
2015-01-14btrfs: cleanup, remove inode_ref_info helperDavid Sterba
A simple wrapper around btrfs_find_item. Signed-off-by: David Sterba <dsterba@suse.cz>
2015-01-14btrfs: cleanup, remove inode_item_info helperDavid Sterba
It's only a simple wrapper around btrfs_find_item, the locally defined key is not used. Signed-off-by: David Sterba <dsterba@suse.cz>
2015-01-14btrfs: fix leak of path in btrfs_find_itemDavid Sterba
If btrfs_find_item is called with NULL path it allocates one locally but does not free it. Affected paths are inserting an orphan item for a file and for a subvol root. Move the path allocation to the callers. CC: <stable@vger.kernel.org> # 3.14+ Fixes: 3f870c289900 ("btrfs: expand btrfs_find_item() to include find_orphan_item functionality") Signed-off-by: David Sterba <dsterba@suse.cz>
2015-01-13locks: fix NULL-deref in generic_delete_leaseNeilBrown
commit 0efaa7e82f02fe69c05ad28e905f31fc86e6f08e locks: generic_delete_lease doesn't need a file_lock at all moves the call to fl->fl_lmops->lm_change() to a place in the code where fl might be a non-lease lock. When that happens, fl_lmops is NULL and an Oops ensures. So add an extra test to restore correct functioning. Reported-by: Linda Walsh <suse@tlinx.org> Link: https://bugzilla.suse.com/show_bug.cgi?id=912569 Cc: stable@vger.kernel.org (v3.18) Fixes: 0efaa7e82f02fe69c05ad28e905f31fc86e6f08e Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Jeff Layton <jlayton@primarydata.com>
2015-01-11Merge branch 'sched-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Ingo Molnar: "Misc fixes: group scheduling corner case fix, two deadline scheduler fixes, effective_load() overflow fix, nested sleep fix, 6144 CPUs system fix" * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/fair: Fix RCU stall upon -ENOMEM in sched_create_group() sched/deadline: Avoid double-accounting in case of missed deadlines sched/deadline: Fix migration of SCHED_DEADLINE tasks sched: Fix odd values in effective_load() calculations sched, fanotify: Deal with nested sleeps sched: Fix KMALLOC_MAX_SIZE overflow during cpumask allocation
2015-01-09Merge branch 'for-3.19' of git://linux-nfs.org/~bfields/linuxLinus Torvalds
Pull two nfsd bugfixes from Bruce Fields. * 'for-3.19' of git://linux-nfs.org/~bfields/linux: rpc: fix xdr_truncate_encode to handle buffer ending on page boundary nfsd: fix fi_delegees leak when fi_had_conflict returns true
2015-01-09Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client Pull two Ceph fixes from Sage Weil: "These are both pretty trivial: a sparse warning fix and size_t printk thing" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: libceph: fix sparse endianness warnings ceph: use %zu for len in ceph_fill_inline_data()
2015-01-09Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "None of these are huge, but my commit does fix a regression from 3.18 that could cause lost files during log replay. This also adds Dave Sterba to the list of Btrfs maintainers. It doesn't mean we're doing things differently, but Dave has really been helping with the maintainer workload for years" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: don't delay inode ref updates during log replay Btrfs: correctly get tree level in tree_backref_for_extent Btrfs: call inode_dec_link_count() on mkdir error path Btrfs: abort transaction if we don't find the block group Btrfs, scrub: uninitialized variable in scrub_extent_for_parity() Btrfs: add more maintainers
2015-01-09kernfs: Fix kernfs_name_compareRasmus Villemoes
Returning a difference from a comparison functions is usually wrong (see acbbe6fbb240 "kcmp: fix standard comparison bug" for the long story). Here there is the additional twist that if the void pointers ns and kn->ns happen to differ by a multiple of 2^32, kernfs_name_compare returns 0, falsely reporting a match to the caller. Technically 'hash - kn->hash' is ok since the hashes are restricted to 31 bits, but it's better to avoid that subtlety. Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-01-09sched, fanotify: Deal with nested sleepsPeter Zijlstra
As per e23738a7300a ("sched, inotify: Deal with nested sleeps"). fanotify_read is a wait loop with sleeps in. Wait loops rely on task_struct::state and sleeps do too, since that's the only means of actually sleeping. Therefore the nested sleeps destroy the wait loop state and the wait loop breaks the sleep functions that assume TASK_RUNNING (mutex_lock). Fix this by using the new woken_wake_function and wait_woken() stuff, which registers wakeups in wait and thereby allows shrinking the task_state::state changes to the actual sleep part. Reported-by: Yuanhan Liu <yuanhan.liu@linux.intel.com> Reported-by: Sedat Dilek <sedat.dilek@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Takashi Iwai <tiwai@suse.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Eric Paris <eparis@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Eric Paris <eparis@redhat.com> Link: http://lkml.kernel.org/r/20141216152838.GZ3337@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-01-08vfs: renumber FMODE_NONOTIFY and add to uniqueness checkDavid Drysdale
Fix clashing values for O_PATH and FMODE_NONOTIFY on sparc. The clashing O_PATH value was added in commit 5229645bdc35 ("vfs: add nonconflicting values for O_PATH") but this can't be changed as it is user-visible. FMODE_NONOTIFY is only used internally in the kernel, but it is in the same numbering space as the other O_* flags, as indicated by the comment at the top of include/uapi/asm-generic/fcntl.h (and its use in fs/notify/fanotify/fanotify_user.c). So renumber it to avoid the clash. All of this has happened before (commit 12ed2e36c98a: "fanotify: FMODE_NONOTIFY and __O_SYNC in sparc conflict"), and all of this will happen again -- so update the uniqueness check in fcntl_init() to include __FMODE_NONOTIFY. Signed-off-by: David Drysdale <drysdale@google.com> Acked-by: David S. Miller <davem@davemloft.net> Acked-by: Jan Kara <jack@suse.cz> Cc: Heinrich Schuchardt <xypron.glpk@gmx.de> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Eric Paris <eparis@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-01-08ocfs2: fix the wrong directory passed to ocfs2_lookup_ino_from_name() when ↵Xue jiufei
link file In ocfs2_link(), the parent directory inode passed to function ocfs2_lookup_ino_from_name() is wrong. Parameter dir is the parent of new_dentry not old_dentry. We should get old_dir from old_dentry and lookup old_dentry in old_dir in case another node remove the old dentry. With this change, hard linking works again, when paths are relative with at least one subdirectory. This is how the problem was reproducable: # mkdir a # mkdir b # touch a/test # ln a/test b/test ln: failed to create hard link `b/test' => `a/test': No such file or directory However when creating links in the same dir, it worked well. Now the link gets created. Fixes: 0e048316ff57 ("ocfs2: check existence of old dentry in ocfs2_link()") Signed-off-by: joyce.xue <xuejiufei@huawei.com> Reported-by: Szabo Aron - UBIT <aron@ubit.hu> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Tested-by: Aron Szabo <aron@ubit.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-01-08ocfs2: remove bogus check in dlm_process_recovery_dataJoseph Qi
In dlm_process_recovery_data, only when dlm_new_lock failed the ret will be set to -ENOMEM. And in this case, newlock is definitely NULL. So test newlock is meaningless, remove it. Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Reviewed-by: Alex Chen <alex.chen@huawei.com> Reviewed-by: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-01-08ceph: use %zu for len in ceph_fill_inline_data()Ilya Dryomov
len is size_t, should be printed with %zu. Signed-off-by: Ilya Dryomov <idryomov@redhat.com>
2015-01-07nfsd: fix fi_delegees leak when fi_had_conflict returns trueJeff Layton
Currently, nfs4_set_delegation takes a reference to an existing delegation and then checks to see if there is a conflict. If there is one, then it doesn't release that reference. Change the code to take the reference after the check and only if there is no conflict. Signed-off-by: Jeff Layton <jlayton@primarydata.com> Cc: stable@vger.kernel.org Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-01-06Merge tag 'ext4_for_linus_stable' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 bugfixes from Ted Ts'o: "Revert a potential seek_data/hole regression which shows up when using ext4 to handle ext3 file systems, plus two minor bug fixes" * tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: remove spurious KERN_INFO from ext4_warning call Revert "ext4: fix suboptimal seek_{data,hole} extents traversial" ext4: prevent online resize with backup superblock
2015-01-06fuse: add memory barrier to INITMiklos Szeredi
Theoretically we need to order setting of various fields in fc with fc->initialized. No known bug reports related to this yet. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2015-01-06fuse: fix LOOKUP vs INIT compat handlingMiklos Szeredi
Analysis from Marc: "Commit 7078187a795f ("fuse: introduce fuse_simple_request() helper") from the above pull request triggers some EIO errors for me in some tests that rely on fuse Looking at the code changes and a bit of debugging info I think there's a general problem here that fuse_get_req checks and possibly waits for fc->initialized, and this was always called first. But this commit changes the ordering and in many places fc->minor is now possibly used before fuse_get_req, and we can't be sure that fc has been initialized. In my case fuse_lookup_init sets req->out.args[0].size to the wrong size because fc->minor at that point is still 0, leading to the EIO error." Fix by moving the compat adjustments into fuse_simple_request() to after fuse_get_req(). This is also more readable than the original, since now compatibility is handled in a single function instead of cluttering each operation. Reported-by: Marc Dionne <marc.c.dionne@gmail.com> Tested-by: Marc Dionne <marc.c.dionne@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Fixes: 7078187a795f ("fuse: introduce fuse_simple_request() helper")
2015-01-05NFSv4: Remove incorrect check in can_open_delegated()Trond Myklebust
Remove an incorrect check for NFS_DELEGATION_NEED_RECLAIM in can_open_delegated(). We are allowed to cache opens even in a situation where we're doing reboot recovery. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-01-05NFS: Ignore transport protocol when detecting server trunkingChuck Lever
Detect server trunking across transport protocols. Otherwise, an RDMA mount and a TCP mount of the same server will end up with separate nfs_clients using the same clientid4. Reported-by: Dai Ngo <dai.ngo@oracle.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-01-05NFSv4/v4.1: Verify the client owner id during trunking detectionTrond Myklebust
While we normally expect the NFSv4 client to always send the same client owner to all servers, there are a couple of situations where that is not the case: 1) In NFSv4.0, switching between use of '-omigration' and not will cause the kernel to switch between using the non-uniform and uniform client strings. 2) In NFSv4.1, or NFSv4.0 when using uniform client strings, if the uniquifier string is suddenly changed. This patch will catch those situations by checking the client owner id in the trunking detection code, and will do the right thing if it notices that the strings differ. Cc: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-01-05NFSv4: Cache the NFSv4/v4.1 client owner_id in the struct nfs_clientTrond Myklebust
Ensure that we cache the NFSv4/v4.1 client owner_id so that we can verify it when we're doing trunking detection. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-01-05NFSv4.1: Fix client id trunking on LinuxTrond Myklebust
Currently, our trunking code will check for session trunking, but will fail to detect client id trunking. This is a problem, because it means that the client will fail to recognise that the two connections represent shared state, even if they do not permit a shared session. By removing the check for the server minor id, and only checking the major id, we will end up doing the right thing in both cases: we close down the new nfs_client and fall back to using the existing one. Fixes: 05f4c350ee02e ("NFS: Discover NFSv4 server trunking when mounting") Cc: Chuck Lever <chuck.lever@oracle.com> Cc: stable@vger.kernel.org # 3.7.x Tested-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-01-05LOCKD: Fix a race when initialising nlmsvc_timeoutTrond Myklebust
This commit fixes a race whereby nlmclnt_init() first starts the lockd daemon, and then calls nlm_bind_host() with the expectation that nlmsvc_timeout has already been initialised. Unfortunately, there is no no synchronisation between lockd() and lockd_up() to guarantee that this is the case. Fix is to move the initialisation of nlmsvc_timeout into lockd_create_svc Fixes: 9a1b6bf818e74 ("LOCKD: Don't call utsname()->nodename...") Cc: Bruce Fields <bfields@fieldses.org> Cc: stable@vger.kernel.org # 3.10.x Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-01-02ext4: remove spurious KERN_INFO from ext4_warning callJakub Wilk
Signed-off-by: Jakub Wilk <jwilk@jwilk.net> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-01-02Revert "ext4: fix suboptimal seek_{data,hole} extents traversial"Theodore Ts'o
This reverts commit 14516bb7bb6ffbd49f35389f9ece3b2045ba5815. This was causing regression test failures with generic/285 with an ext3 filesystem using CONFIG_EXT4_USE_FOR_EXT23. Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-01-02Btrfs: don't delay inode ref updates during log replayChris Mason
Commit 1d52c78afbb (Btrfs: try not to ENOSPC on log replay) added a check to skip delayed inode updates during log replay because it confuses the enospc code. But the delayed processing will end up ignoring delayed refs from log replay because the inode itself wasn't put through the delayed code. This can end up triggering a warning at commit time: WARNING: CPU: 2 PID: 778 at fs/btrfs/delayed-inode.c:1410 btrfs_assert_delayed_root_empty+0x32/0x34() Which is repeated for each commit because we never process the delayed inode ref update. The fix used here is to change btrfs_delayed_delete_inode_ref to return an error if we're currently in log replay. The caller will do the ref deletion immediately and everything will work properly. Signed-off-by: Chris Mason <clm@fb.com> cc: stable@vger.kernel.org # v3.18 and any stable series that picked 1d52c78afbbf80b58299e076a159617d6b42fe3c
2015-01-02Btrfs: correctly get tree level in tree_backref_for_extentFilipe Manana
If we are using skinny metadata, the block's tree level is in the offset of the key and not in a btrfs_tree_block_info structure following the extent item (it doesn't exist). Therefore fix it. Besides returning the correct level in the tree, this also prevents reading past the leaf's end in the case where the extent item is the last item in the leaf (eb) and it has only 1 inline reference - this is because sizeof(struct btrfs_tree_block_info) is greater than sizeof(struct btrfs_extent_inline_ref). Got it while running a scrub which produced the following warning: BTRFS: checksum error at logical 42123264 on dev /dev/sde, sector 15840: metadata node (level 24) in tree 5 Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>