summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2012-01-06NFS: Remove pNFS bloat from the generic write pathTrond Myklebust
We have no business doing any this in the standard write release path. Get rid of it, and put it in the pNFS layer. Also, while we're at it, get rid of the completely bogus unlock/relock semantics that were present in nfs_writeback_release_full(). It is not only unnecessary, but actually dangerous to release the write lock just in order to take it again in nfs_page_async_flush(). Better just to open code the pgio operations in a pnfs helper. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-01-06pnfs-obj: Must return layout on IO errorBoaz Harrosh
As mandated by the standard. In case of an IO error, a pNFS objects layout driver must return it's layout. This is because all device errors are reported to the server as part of the layout return buffer. This is implemented the same way PNFS_LAYOUTRET_ON_SETATTR is done, through a bit flag on the pnfs_layoutdriver_type->flags member. The flag is set by the layout driver that wants a layout_return preformed at pnfs_ld_{write,read}_done in case of an error. (Though I have not defined a wrapper like pnfs_ld_layoutret_on_setattr because this code is never called outside of pnfs.c and pnfs IO paths) Without this patch 3.[0-2] Kernels leak memory and have an annoying WARN_ON after every IO error utilizing the pnfs-obj driver. [This patch is for 3.2 Kernel. 3.1/0 Kernels need a different patch] CC: Stable Tree <stable@kernel.org> Signed-off-by: Boaz Harrosh <bharrosh@panasas.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-01-06pnfs-obj: pNFS errors are communicated on iodata->pnfs_errorBoaz Harrosh
Some time along the way pNFS IO errors were switched to communicate with a special iodata->pnfs_error member instead of the regular RPC members. But objlayout was not switched over. Fix that! Without this fix any IO error is hanged, because IO is not switched to MDS and pages are never cleared or read. [Applies to 3.2.0. Same bug different patch for 3.1/0 Kernels] CC: Stable Tree <stable@kernel.org> Signed-off-by: Boaz Harrosh <bharrosh@panasas.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-01-05NFS: Cache state owners after files are closedChuck Lever
Servers have a finite amount of memory to store NFSv4 open and lock owners. Moreover, servers may have a difficult time determining when they can reap their state owner table, thanks to gray areas in the NFSv4 protocol specification. Thus clients should be careful to reuse state owners when possible. Currently Linux is not too careful. When a user has closed all her files on one mount point, the state owner's reference count goes to zero, and it is released. The next OPEN allocates a new one. A workload that serially opens and closes files can run through a large number of open owners this way. When a state owner's reference count goes to zero, slap it onto a free list for that nfs_server, with an expiry time. Garbage collect before looking for a state owner. This makes state owners for active users available for re-use. Now that there can be unused state owners remaining at umount time, purge the state owner free list when a server is destroyed. Also be sure not to reclaim unused state owners during state recovery. This change has benefits for the client as well. For some workloads, this approach drops the number of OPEN_CONFIRM calls from the same as the number of OPEN calls, down to just one. This reduces wire traffic and thus open(2) latency. Before this patch, untarring a kernel source tarball shows the OPEN_CONFIRM call counter steadily increasing through the test. With the patch, the OPEN_CONFIRM count remains at 1 throughout the entire untar. As long as the expiry time is kept short, I don't think garbage collection should be terribly expensive, although it does bounce the clp->cl_lock around a bit. [ At some point we should rationalize the use of the nfs_server ->destroy method. ] Signed-off-by: Chuck Lever <chuck.lever@oracle.com> [Trond: Fixed a garbage collection race and a few efficiency issues] Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-01-05NFS: Clean up nfs4_find_state_owners_locked()Chuck Lever
There's no longer a need to check the so_server field in the state owner, because nowadays the RB tree we search for state owners contains owners for that only server. Make nfs4_find_state_owners_locked() use the same tree searching logic as nfs4_insert_state_owner_locked(). Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-01-05NFSv4: include bitmap in nfsv4 get acl dataAndy Adamson
The NFSv4 bitmap size is unbounded: a server can return an arbitrary sized bitmap in an FATTR4_WORD0_ACL request. Replace using the nfs4_fattr_bitmap_maxsz as a guess to the maximum bitmask returned by a server with the inclusion of the bitmap (xdr length plus bitmasks) and the acl data xdr length to the (cached) acl page data. This is a general solution to commit e5012d1f "NFSv4.1: update nfs4_fattr_bitmap_maxsz" and fixes hitting a BUG_ON in xdr_shrink_bufhead when getting ACLs. Fix a bug in decode_getacl that returned -EINVAL on ACLs > page when getxattr was called with a NULL buffer, preventing ACL > PAGE_SIZE from being retrieved. Cc: stable@kernel.org Signed-off-by: Andy Adamson <andros@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-01-05nfs: fix a minor do_div portability issueChris Metcalf
This change modifies filelayout_get_dense_offset() to use the functions in math64.h and thus avoid a 32-bit platform compile error trying to use do_div() on an s64 type. Signed-off-by: Chris Metcalf <cmetcalf@tilera.com> Reviewed-by: Boaz Harrosh <bharrosh@panasas.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-01-05NFSv4.1: cleanup comment and debug printkAndy Adamson
Signed-off-by: Andy Adamson <andros@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-01-05NFSv4.1: change nfs4_free_slot parameters for dynamic slotsAndy Adamson
Signed-off-by: Andy Adamson <andros@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-01-05NFSv4.1: cleanup init and reset of session slot tablesAndy Adamson
We are either initializing or resetting a session. Initialize or reset the session slot tables accordingly. Signed-off-by: Andy Adamson <andros@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-01-05NFSv4.1: fix backchannel slotid off-by-one bugAndy Adamson
Cc:stable@kernel.org Signed-off-by: Andy Adamson <andros@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-01-05nfs: fix regression in handling of context= option in NFSv4Jeff Layton
Setting the security context of a NFSv4 mount via the context= mount option is currently broken. The NFSv4 codepath allocates a parsed options struct, and then parses the mount options to fill it. It eventually calls nfs4_remote_mount which calls security_init_mnt_opts. That clobbers the lsm_opts struct that was populated earlier. This bug also looks like it causes a small memory leak on each v4 mount where context= is used. Fix this by moving the initialization of the lsm_opts into nfs_alloc_parsed_mount_data. Also, add a destructor for nfs_parsed_mount_data to make it easier to free all of the allocations hanging off of it, and to ensure that the security_free_mnt_opts is called whenever security_init_mnt_opts is. I believe this regression was introduced quite some time ago, probably by commit c02d7adf. Cc: stable@vger.kernel.org Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-01-05NFS - fix recent breakage to NFS error handling.NeilBrown
From c6d615d2b97fe305cbf123a8751ced859dca1d5e Mon Sep 17 00:00:00 2001 From: NeilBrown <neilb@suse.de> Date: Wed, 16 Nov 2011 09:39:05 +1100 Subject: [PATCH] NFS - fix recent breakage to NFS error handling. commit 02c24a82187d5a628c68edfe71ae60dc135cd178 made a small and presumably unintended change to write error handling in NFS. Previously an error from filemap_write_and_wait_range would only be of interest if nfs_file_fsync did not return an error. After this commit, an error from filemap_write_and_wait_range would mean that (the rest of) nfs_file_fsync would not even be called. This means that: 1/ you are more likely to see EIO than e.g. EDQUOT or ENOSPC. 2/ NFS_CONTEXT_ERROR_WRITE remains set for longer so more writes are synchronous. This patch restores previous behaviour. Cc: stable@kernel.org Cc: Josef Bacik <josef@redhat.com> Cc: Jan Kara <jack@suse.cz> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-01-05SUNRPC: Clean up the RPCSEC_GSS service ticket requestsTrond Myklebust
Instead of hacking specific service names into gss_encode_v1_msg, we should just allow the caller to specify the service name explicitly. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Acked-by: J. Bruce Fields <bfields@redhat.com>
2012-01-04minixfs: misplaced checks lead to dentry leakAl Viro
bitmap size sanity checks should be done *before* allocating ->s_root; there their cleanup on failure would be correct. As it is, we do iput() on root inode, but leak the root dentry... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Acked-by: Josh Boyer <jwboyer@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-04[CIFS] default ntlmv2 for cifs mount delayed to 3.3Steve French
Turned out the ntlmv2 (default security authentication) upgrade was harder to test than expected, and we ran out of time to test against Apple and a few other servers that we wanted to. Delay upgrade of default security from ntlm to ntlmv2 (on mount) to 3.3. Still works fine to specify it explicitly via "sec=ntlmv2" so this should be fine. Acked-by: Jeff Layton <jlayton@samba.org> Signed-off-by: Steve French <smfrench@gmail.com>
2012-01-03cifs: fix bad buffer length check in coalesce_t2Jeff Layton
The current check looks to see if the RFC1002 length is larger than CIFSMaxBufSize, and fails if it is. The buffer is actually larger than that by MAX_CIFS_HDR_SIZE. This bug has been around for a long time, but the fact that we used to cap the clients MaxBufferSize at the same level as the server tended to paper over it. Commit c974befa changed that however and caused this bug to bite in more cases. Reported-and-Tested-by: Konstantinos Skarlatos <k.skarlatos@gmail.com> Tested-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com>
2011-12-30Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: ceph: disable use of dcache for readdir etc.
2011-12-29Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfsLinus Torvalds
* 'for-linus' of git://oss.sgi.com/xfs/xfs: xfs: log all dirty inodes in xfs_fs_sync_fs xfs: log the inode in ->write_inode calls for kupdate
2011-12-29procfs: do not confuse jiffies with cputime64_tAndreas Schwab
Commit 2a95ea6c0d129b4 ("procfs: do not overflow get_{idle,iowait}_time for nohz") did not take into account that one some architectures jiffies and cputime use different units. This causes get_idle_time() to return numbers in the wrong units, making the idle time fields in /proc/stat wrong. Instead of converting the usec value returned by get_cpu_{idle,iowait}_time_us to units of jiffies, use the new function usecs_to_cputime64 to convert it to the correct unit of cputime64_t. Signed-off-by: Andreas Schwab <schwab@linux-m68k.org> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Arnd Bergmann <arnd@arndb.de> Cc: "Artem S. Tashkinov" <t.artem@mailcity.com> Cc: Dave Jones <davej@redhat.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-12-29ceph: disable use of dcache for readdir etc.Sage Weil
Ceph attempts to use the dcache to satisfy negative lookups and readdir when the entire directory contents are in cache. Disable this behavior until lingering bugs in this code are shaken out; we'll re-enable these hooks once things are fully stable. Signed-off-by: Sage Weil <sage@newdream.net>
2011-12-26vfs: fix handling of lock allocation failure in lease-break caseLinus Torvalds
Bruce Fields notes that commit 778fc546f749 ("locks: fix tracking of inprogress lease breaks") introduced a possible error pointer dereference on failure to allocate memory. locks_conflict() will dereference the passed-in new lease lock structure that may be an error pointer. This means an open (without O_NONBLOCK set) on a file with a lease applied (generally only done when Samba or nfsd (with v4) is running) could crash if a kmalloc() fails. So instead of playing games with IS_ERROR() all over the place, just check the allocation failure early. That makes the code more straightforward, and avoids this possible bad pointer dereference. Based-on-patch-by: J. Bruce Fields <bfields@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-12-23Merge tag 'writeback' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linuxLinus Torvalds
for linus: writeback reason binary tracing format fix * tag 'writeback' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux: writeback: show writeback reason with __print_symbolic
2011-12-23Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: call d_instantiate after all ops are setup Btrfs: fix worker lock misuse in find_worker
2011-12-23xfs: log all dirty inodes in xfs_fs_sync_fsChristoph Hellwig
Since Linux 2.6.36 the writeback code has introduces various measures for live lock prevention during sync(). Unfortunately some of these are actively harmful for the XFS model, where the inode gets marked dirty for metadata from the data I/O handler. The older_than_this checks that are now more strictly enforced since writeback: avoid livelocking WB_SYNC_ALL writeback by only calling into __writeback_inodes_sb and thus only sampling the current cut off time once. But on a slow enough devices the previous asynchronous sync pass might not have fully completed yet, and thus XFS might mark metadata dirty only after that sampling of the cut off time for the blocking pass already happened. I have not myself reproduced this myself on a real system, but by introducing artificial delay into the XFS I/O completion workqueues it can be reproduced easily. Fix this by iterating over all XFS inodes in ->sync_fs and log all that are dirty. This might log inode that only got redirtied after the previous pass, but given how cheap delayed logging of inodes is it isn't a major concern for performance. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Tested-by: Mark Tinguely <tinguely@sgi.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2011-12-23xfs: log the inode in ->write_inode calls for kupdateChristoph Hellwig
If the writeback code writes back an inode because it has expired we currently use the non-blockin ->write_inode path. This means any inode that is pinned is skipped. With delayed logging and a workload that has very little log traffic otherwise it is very likely that an inode that gets constantly written to is always pinned, and thus we keep refusing to write it. The VM writeback code at that point redirties it and doesn't try to write it again for another 30 seconds. This means under certain scenarious time based metadata writeback never happens. Fix this by calling into xfs_log_inode for kupdate in addition to data integrity syncs, and thus transfer the inode to the log ASAP. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Tested-by: Mark Tinguely <tinguely@sgi.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2011-12-23Btrfs: call d_instantiate after all ops are setupAl Viro
This closes races where btrfs is calling d_instantiate too soon during inode creation. All of the callers of btrfs_add_nondir are updated to instantiate after the inode is fully setup in memory. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-12-23Btrfs: fix worker lock misuse in find_workerChris Mason
Dan Carpenter noticed that we were doing a double unlock on the worker lock, and sometimes picking a worker thread without the lock held. This fixes both errors. Signed-off-by: Chris Mason <chris.mason@oracle.com> Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
2011-12-20Merge branch 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds
* 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: NFS: Fix a regression in nfs_file_llseek() NFSv4: Do not accept delegated opens when a delegation recall is in effect NFSv4: Ensure correct locking when accessing the 'lock_states' list NFSv4.1: Ensure that we handle _all_ SEQUENCE status bits. NFSv4: Don't error if we handled it in nfs4_recovery_handle_error SUNRPC: Ensure we always bump the backlog queue in xprt_free_slot SUNRPC: Fix the execution time statistics in the face of RPC restarts
2011-12-20nilfs2: potential integer overflow in nilfs_ioctl_clean_segments()Haogang Chen
There is a potential integer overflow in nilfs_ioctl_clean_segments(). When a large argv[n].v_nmembs is passed from the userspace, the subsequent call to vmalloc() will allocate a buffer smaller than expected, which leads to out-of-bound access in nilfs_ioctl_move_blocks() and lfs_clean_segments(). The following check does not prevent the overflow because nsegs is also controlled by the userspace and could be very large. if (argv[n].v_nmembs > nsegs * nilfs->ns_blocks_per_segment) goto out_free; This patch clamps argv[n].v_nmembs to UINT_MAX / argv[n].v_size, and returns -EINVAL when overflow. Signed-off-by: Haogang Chen <haogangchen@gmail.com> Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-12-20nilfs2: unbreak compat ioctlThomas Meyer
commit 828b1c50ae ("nilfs2: add compat ioctl") incidentally broke all other NILFS compat ioctls. Make them work again. Signed-off-by: Thomas Meyer <thomas@m3y3r.de> Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Tested-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Cc: <stable@vger.kernel.org> [3.0+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-12-18writeback: show writeback reason with __print_symbolicWu Fengguang
This makes the binary trace understandable by trace-cmd. CC: Dave Chinner <david@fromorbit.com> CC: Curt Wohlgemuth <curtw@google.com> CC: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-12-16Merge branches 'for-linus' and 'for-linus-3.2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: unplug every once and a while Btrfs: deal with NULL srv_rsv in the delalloc inode reservation code Btrfs: only set cache_generation if we setup the block group Btrfs: don't panic if orphan item already exists Btrfs: fix leaked space in truncate Btrfs: fix how we do delalloc reservations and how we free reservations on error Btrfs: deal with enospc from dirtying inodes properly Btrfs: fix num_workers_starting bug and other bugs in async thread BTRFS: Establish i_ops before calling d_instantiate Btrfs: add a cond_resched() into the worker loop Btrfs: fix ctime update of on-disk inode btrfs: keep orphans for subvolume deletion Btrfs: fix inaccurate available space on raid0 profile Btrfs: fix wrong disk space information of the files Btrfs: fix wrong i_size when truncating a file to a larger size Btrfs: fix btrfs_end_bio to deal with write errors to a single mirror * 'for-linus-3.2' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: btrfs: lower the dirty balance poll interval
2011-12-16btrfs: lower the dirty balance poll intervalWu Fengguang
Tests show that the original large intervals can easily make the dirty limit exceeded on 100 concurrent dd's. So adapt to as large as the next check point selected by the dirty throttling algorithm. Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-12-15NFS: Fix a regression in nfs_file_llseek()Trond Myklebust
After commit 06222e491e663dac939f04b125c9dc52126a75c4 (fs: handle SEEK_HOLE/SEEK_DATA properly in all fs's that define their own llseek) the behaviour of llseek() was changed so that it always revalidates the file size. The bug appears to be due to a logic error in the afore-mentioned commit, which always evaluates to 'true'. Reported-by: Roel Kluin <roel.kluin@gmail.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: stable@vger.kernel.org [>=3.1]
2011-12-15Btrfs: unplug every once and a whileChris Mason
The btrfs io submission threads can build up massive plug lists. This keeps things more reasonable so we don't hand over huge dumps of IO at once. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-12-15Merge branch 'for-chris' of ↵Chris Mason
http://git.kernel.org/pub/scm/linux/kernel/git/josef/btrfs-work into integration Conflicts: fs/btrfs/inode.c Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-12-15Btrfs: deal with NULL srv_rsv in the delalloc inode reservation codeChris Mason
btrfs_update_inode is sometimes called with a null reservation. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-12-15Btrfs: only set cache_generation if we setup the block groupJosef Bacik
A user reported a problem booting into a new kernel with the old format inodes. He was panicing in cow_file_range while writing out the inode cache. This is because if the block group is not cached we'll just skip writing out the cache, however if it gets dirtied again in the same transaction and it finished caching we'd go ahead and write it out, but since we set cache_generation to the transid we think we've already truncated it and will just carry on, running into cow_file_range and blowing up. We need to make sure we only set cache_generation if we've done the truncate. The user tested this patch and verified that the panic no longer occured. Thanks, Reported-and-Tested-by: Klaus Bitto <klaus.bitto@gmail.com> Signed-off-by: Josef Bacik <josef@redhat.com>
2011-12-15Btrfs: don't panic if orphan item already existsJosef Bacik
I've been hitting this BUG_ON() in btrfs_orphan_add when running xfstest 269 in a loop. This is because we will add an orphan item, do the truncate, the truncate will fail for whatever reason (*cough*ENOSPC*cough*) and then we're left with an orphan item still in the fs. Then we come back later to do another truncate and it blows up because we already have an orphan item. This is ok so just fix the BUG_ON() to only BUG() if ret is not EEXIST. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
2011-12-15Btrfs: fix leaked space in truncateJosef Bacik
We were occasionaly leaking space when running xfstest 269. This is because if we failed to start the transaction in the truncate loop we'd just goto out, but we need to break so that the inode is removed from the orphan list and the space is properly freed. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
2011-12-15Btrfs: fix how we do delalloc reservations and how we free reservations on errorJosef Bacik
Running xfstests 269 with some tracing my scripts kept spitting out errors about releasing bytes that we didn't actually have reserved. This took me down a huge rabbit hole and it turns out the way we deal with reserved_extents is wrong, we need to only be setting it if the reservation succeeds, otherwise the free() method will come in and unreserve space that isn't actually reserved yet, which can lead to other warnings and such. The math was all working out right in the end, but it caused all sorts of other issues in addition to making my scripts yell and scream and generally make it impossible for me to track down the original issue I was looking for. The other problem is with our error handling in the reservation code. There are two cases that we need to deal with 1) We raced with free. In this case free won't free anything because csum_bytes is modified before we dro the lock in our reservation path, so free rightly doesn't release any space because the reservation code may be depending on that reservation. However if we fail, we need the reservation side to do the free at that point since that space is no longer in use. So as it stands the code was doing this fine and it worked out, except in case #2 2) We don't race with free. Nobody comes in and changes anything, and our reservation fails. In this case we didn't reserve anything anyway and we just need to clean up csum_bytes but not free anything. So we keep track of csum_bytes before we drop the lock and if it hasn't changed we know we can just decrement csum_bytes and carry on. Because of the case where we can race with free()'s since we have to drop our spin_lock to do the reservation, I'm going to serialize all reservations with the i_mutex. We already get this for free in the heavy use paths, truncate and file write all hold the i_mutex, just needed to add it to page_mkwrite and various ioctl/balance things. With this patch my space leak scripts no longer scream bloody murder. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
2011-12-15Btrfs: deal with enospc from dirtying inodes properlyJosef Bacik
Now that we're properly keeping track of delayed inode space we've been getting a lot of warnings out of btrfs_dirty_inode() when running xfstest 83. This is because a bunch of people call mark_inode_dirty, which is void so we can't return ENOSPC. This needs to be fixed in a few areas 1) file_update_time - this updates the mtime and such when writing to a file, which will call mark_inode_dirty. So copy file_update_time into btrfs so we can call btrfs_dirty_inode directly and return an error if we get one appropriately. 2) fix symlinks to use btrfs_setattr for ->setattr. For some reason we weren't setting ->setattr for symlinks, even though we should have been. This catches one of the cases where we were getting errors in mark_inode_dirty. 3) Fix btrfs_setattr and btrfs_setsize to call btrfs_dirty_inode directly instead of mark_inode_dirty. This lets us return errors properly for truncate and chown/anything related to setattr. 4) Add a new btrfs_fs_dirty_inode which will just call btrfs_dirty_inode and print an error if we have one. The only remaining user we can't control for this is touch_atime(), but we don't really want to keep people from walking down the tree if we don't have space to save the atime update, so just complain but don't worry about it. With this patch xfstests 83 complains a handful of times instead of hundreds of times. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
2011-12-15Btrfs: fix num_workers_starting bug and other bugs in async threadJosef Bacik
Al pointed out we have some random problems with the way we account for num_workers_starting in the async thread stuff. First of all we need to make sure to decrement num_workers_starting if we fail to start the worker, so make __btrfs_start_workers do this. Also fix __btrfs_start_workers so that it doesn't call btrfs_stop_workers(), there is no point in stopping everybody if we failed to create a worker. Also check_pending_worker_creates needs to call __btrfs_start_work in it's work function since it already increments num_workers_starting. People only start one worker at a time, so get rid of the num_workers argument everywhere, and make btrfs_queue_worker a void since it will always succeed. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
2011-12-15BTRFS: Establish i_ops before calling d_instantiateCasey Schaufler
The Smack LSM hook for security_d_instantiate checks the inode's i_op->getxattr value to determine if the containing filesystem supports extended attributes. The BTRFS filesystem sets the inode's i_op value only after it has instantiated the inode. This results in Smack incorrectly giving new BTRFS inodes attributes from the filesystem defaults on the assumption that values can't be stored on the filesystem. This patch moves the assignment of inode operation vectors ahead of the calls to d_instantiate, letting Smack know that the filesystem supports extended attributes. There should be no impact on the performance or behavior of BTRFS. Signed-off-by: Casey Schaufler <casey@schaufler-ca.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-12-15Btrfs: add a cond_resched() into the worker loopChris Mason
If we have a constant stream of end_io completions or crc work, we can hit softlockup messages from the async helper threads. This adds a cond_resched() into the loop to avoid them. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-12-15Btrfs: fix ctime update of on-disk inodeLi Zefan
To reproduce the bug: # touch /mnt/tmp # stat /mnt/tmp | grep Change Change: 2011-12-09 09:32:23.412105981 +0800 # chattr +i /mnt/tmp # stat /mnt/tmp | grep Change Change: 2011-12-09 09:32:43.198105295 +0800 # umount /mnt # mount /dev/loop1 /mnt # stat /mnt/tmp | grep Change Change: 2011-12-09 09:32:23.412105981 +0800 We should update ctime of in-memory inode before calling btrfs_update_inode(). Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-12-15btrfs: keep orphans for subvolume deletionArne Jansen
Since we have the free space caches, btrfs_orphan_cleanup also runs for the tree_root. Unfortunately this also cleans up the orphans used to mark subvol deletions in progress. Currently if a subvol deletion gets interrupted twice by umount/mount, the deletion will not be continued and the space permanently lost, though it would be possible to write a tool to recover those lost subvol deletions. This patch checks if the orphan belongs to a subvol (dead root) and skips the deletion. Signed-off-by: Arne Jansen <sensille@gmx.net> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-12-15Btrfs: fix inaccurate available space on raid0 profileMiao Xie
When we use raid0 as the data profile, df command may show us a very inaccurate value of the available space, which may be much less than the real one. It may make the users puzzled. Fix it by changing the calculation of the available space, and making it be more similar to a fake chunk allocation. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-12-15Btrfs: fix wrong disk space information of the filesMiao Xie
Btrfsck report errors after the 83th case of xfstests was run, The error number is 400, it means the used disk space of the file is wrong. The reason of this bug is that: The file truncation may fail when the space of the file system is not enough, and leave some file extents, whose offset are beyond the end of the files. When we want to expand those files, we will drop those file extents, and put in dummy file extents, and then we should update the i-node. But btrfs forgets to do it. This patch adds the forgotten i-node update. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>