summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2012-04-11batman-adv: form groups in the bridge loop avoidanceSimon Wunderlich
backbone gateways may be part of the same LAN, but participate in different meshes. With this patch, backbone gateways form groups by applying the groupid of another backbone gateway if it is higher. After forming the group, they only accept messages from backbone gateways of the same group. Signed-off-by: Simon Wunderlich <siwu@hrz.tu-chemnitz.de> Signed-off-by: Antonio Quartulli <ordex@autistici.org>
2012-04-11batman-adv: drop STP over batmanSimon Wunderlich
Signed-off-by: Simon Wunderlich <siwu@hrz.tu-chemnitz.de> Signed-off-by: Antonio Quartulli <ordex@autistici.org>
2012-04-11batman-adv: add broadcast duplicate checkSimon Wunderlich
When multiple backbone gateways relay the same broadcast from the backbone into the mesh, other nodes in the mesh may receive this broadcast multiple times. To avoid this, the crc checksums of received broadcasts are recorded and new broadcast packets with the same content may be dropped if received by another gateway. Signed-off-by: Simon Wunderlich <siwu@hrz.tu-chemnitz.de> Signed-off-by: Antonio Quartulli <ordex@autistici.org>
2012-04-11batman-adv: don't let backbone gateways exchange tt entriesSimon Wunderlich
As the backbone gateways are connected to the same backbone, they should announce the same clients on the backbone non-exclusively. Signed-off-by: Simon Wunderlich <siwu@hrz.tu-chemnitz.de> Signed-off-by: Antonio Quartulli <ordex@autistici.org>
2012-04-11batman-adv: allow multiple entries in tt_global_entriesSimon Wunderlich
as backbone gateways will all independently announce the same clients, also the tt global table must be able to hold multiple originators per client entry. Signed-off-by: Simon Wunderlich <siwu@hrz.tu-chemnitz.de> Signed-off-by: Antonio Quartulli <ordex@autistici.org>
2012-04-11batman-adv: export claim tables through debugfsSimon Wunderlich
Signed-off-by: Simon Wunderlich <siwu@hrz.tu-chemnitz.de> Signed-off-by: Antonio Quartulli <ordex@autistici.org>
2012-04-11batman-adv: make bridge loop avoidance switchableSimon Wunderlich
Signed-off-by: Simon Wunderlich <siwu@hrz.tu-chemnitz.de> Signed-off-by: Antonio Quartulli <ordex@autistici.org>
2012-04-11batman-adv: add basic bridge loop avoidance codeSimon Wunderlich
This second version of the bridge loop avoidance for batman-adv avoids loops between the mesh and a backbone (usually a LAN). By connecting multiple batman-adv mesh nodes to the same ethernet segment a loop can be created when the soft-interface is bridged into that ethernet segment. A simple visualization of the loop involving the most common case - a LAN as ethernet segment: node1 <-- LAN --> node2 | | wifi <-- mesh --> wifi Packets from the LAN (e.g. ARP broadcasts) will circle forever from node1 or node2 over the mesh back into the LAN. With this patch, batman recognizes backbone gateways, nodes which are part of the mesh and backbone/LAN at the same time. Each backbone gateway "claims" clients from within the mesh to handle them exclusively. By restricting that only responsible backbone gateways may handle their claimed clients traffic, loops are effectively avoided. Signed-off-by: Simon Wunderlich <siwu@hrz.tu-chemnitz.de> Signed-off-by: Antonio Quartulli <ordex@autistici.org>
2012-04-11batman-adv: remove old bridge loop avoidance codeSimon Wunderlich
The functionality is to be replaced by an improved implementation, so first clean up. Signed-off-by: Simon Wunderlich <siwu@hrz.tu-chemnitz.de> Signed-off-by: Antonio Quartulli <ordex@autistici.org>
2012-04-11batman-adv: encourage batman to take shorter routes by changing the default ↵Marek Lindner
hop penalty Signed-off-by: Marek Lindner <lindner_marek@yahoo.de> Signed-off-by: Antonio Quartulli <ordex@autistici.org>
2012-04-11batman-adv: Remove declaration of only locally used functionsSven Eckelmann
Signed-off-by: Sven Eckelmann <sven@narfation.org> Acked-by: Antonio Quartulli <ordex@autistici.org> Signed-off-by: Antonio Quartulli <ordex@autistici.org>
2012-04-11batman-adv: Replace bitarray operations with bitmapSven Eckelmann
bitarray.c consists mostly of functionality that is already available as part of the standard kernel API. batman-adv could use architecture optimized code and reduce the binary size by switching to the standard functions. Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Antonio Quartulli <ordex@autistici.org>
2012-04-11batman-adv: use ETH_ALEN instead of hardcoded numeric constantsAntonio Quartulli
In packet.h the numeric constant 6 is used instead of the more portable ETH_ALEN define. This patch substitute any hardcoded value with such define. Signed-off-by: Antonio Quartulli <ordex@autistici.org> Acked-by: Sven Eckelmann <sven@narfation.org>
2012-04-11batman-adv: clean up KconfigAntonio Quartulli
Signed-off-by: Antonio Quartulli <ordex@autistici.org>
2012-04-10Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
2012-04-06Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds
Pull networking updates from David Miller: 1) Fix inaccuracies in network driver interface documentation, from Ben Hutchings. 2) Fix handling of negative offsets in BPF JITs, from Jan Seiffert. 3) Compile warning, locking, and refcounting fixes in netfilter's xt_CT, from Pablo Neira Ayuso. 4) phonet sendmsg needs to validate user length just like any other datagram protocol, fix from Sasha Levin. 5) Ipv6 multicast code uses wrong loop index, from RongQing Li. 6) Link handling and firmware fixes in bnx2x driver from Yaniv Rosner and Yuval Mintz. 7) mlx4 erroneously allocates 4 pages at a time, regardless of page size, fix from Thadeu Lima de Souza Cascardo. 8) SCTP socket option wasn't extended in a backwards compatible way, fix from Thomas Graf. 9) Add missing address change event emissions to bonding, from Shlomo Pongratz. 10) /proc/net/dev regressed because it uses a private offset to track where we are in the hash table, but this doesn't track the offset pullback that the seq_file code does resulting in some entries being missed in large dumps. Fix from Eric Dumazet. 11) do_tcp_sendpage() unloads the send queue way too fast, because it invokes tcp_push() when it shouldn't. Let the natural sequence generated by the splice paths, and the assosciated MSG_MORE settings, guide the tcp_push() calls. Otherwise what goes out of TCP is spaghetti and doesn't batch effectively into GSO/TSO clusters. From Eric Dumazet. 12) Once we put a SKB into either the netlink receiver's queue or a socket error queue, it can be consumed and freed up, therefore we cannot touch it after queueing it like that. Fixes from Eric Dumazet. 13) PPP has this annoying behavior in that for every transmit call it immediately stops the TX queue, then calls down into the next layer to transmit the PPP frame. But if that next layer can take it immediately, it just un-stops the TX queue right before returning from the transmit method. Besides being useless work, it makes several facilities unusable, in particular things like the equalizers. Well behaved devices should only stop the TX queue when they really are full, and in PPP's case when it gets backlogged to the downstream device. David Woodhouse therefore fixed PPP to not stop the TX queue until it's downstream can't take data any more. 14) IFF_UNICAST_FLT got accidently lost in some recent stmmac driver changes, re-add. From Marc Kleine-Budde. 15) Fix link flaps in ixgbe, from Eric W. Multanen. 16) Descriptor writeback fixes in e1000e from Matthew Vick. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (47 commits) net: fix a race in sock_queue_err_skb() netlink: fix races after skb queueing doc, net: Update ndo_start_xmit return type and values doc, net: Remove instruction to set net_device::trans_start doc, net: Update netdev operation names doc, net: Update documentation of synchronisation for TX multiqueue doc, net: Remove obsolete reference to dev->poll ethtool: Remove exception to the requirement of holding RTNL lock MAINTAINERS: update for Marvell Ethernet drivers bonding: properly unset current_arp_slave on slave link up phonet: Check input from user before allocating tcp: tcp_sendpages() should call tcp_push() once ipv6: fix array index in ip6_mc_add_src() mlx4: allocate just enough pages instead of always 4 pages stmmac: re-add IFF_UNICAST_FLT for dwmac1000 bnx2x: Clear MDC/MDIO warning message bnx2x: Fix BCM57711+BCM84823 link issue bnx2x: Clear BCM84833 LED after fan failure bnx2x: Fix BCM84833 PHY FW version presentation bnx2x: Fix link issue for BCM8727 boards. ...
2012-04-06net: fix a race in sock_queue_err_skb()Eric Dumazet
As soon as an skb is queued into socket error queue, another thread can consume it, so we are not allowed to reference skb anymore, or risk use after free. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-06netlink: fix races after skb queueingEric Dumazet
As soon as an skb is queued into socket receive_queue, another thread can consume it, so we are not allowed to reference skb anymore, or risk use after free. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-05phonet: Check input from user before allocatingSasha Levin
A phonet packet is limited to USHRT_MAX bytes, this is never checked during tx which means that the user can specify any size he wishes, and the kernel will attempt to allocate that size. In the good case, it'll lead to the following warning, but it may also cause the kernel to kick in the OOM and kill a random task on the server. [ 8921.744094] WARNING: at mm/page_alloc.c:2255 __alloc_pages_slowpath+0x65/0x730() [ 8921.749770] Pid: 5081, comm: trinity Tainted: G W 3.4.0-rc1-next-20120402-sasha #46 [ 8921.756672] Call Trace: [ 8921.758185] [<ffffffff810b2ba7>] warn_slowpath_common+0x87/0xb0 [ 8921.762868] [<ffffffff810b2be5>] warn_slowpath_null+0x15/0x20 [ 8921.765399] [<ffffffff8117eae5>] __alloc_pages_slowpath+0x65/0x730 [ 8921.769226] [<ffffffff81179c8a>] ? zone_watermark_ok+0x1a/0x20 [ 8921.771686] [<ffffffff8117d045>] ? get_page_from_freelist+0x625/0x660 [ 8921.773919] [<ffffffff8117f3a8>] __alloc_pages_nodemask+0x1f8/0x240 [ 8921.776248] [<ffffffff811c03e0>] kmalloc_large_node+0x70/0xc0 [ 8921.778294] [<ffffffff811c4bd4>] __kmalloc_node_track_caller+0x34/0x1c0 [ 8921.780847] [<ffffffff821b0e3c>] ? sock_alloc_send_pskb+0xbc/0x260 [ 8921.783179] [<ffffffff821b3c65>] __alloc_skb+0x75/0x170 [ 8921.784971] [<ffffffff821b0e3c>] sock_alloc_send_pskb+0xbc/0x260 [ 8921.787111] [<ffffffff821b002e>] ? release_sock+0x7e/0x90 [ 8921.788973] [<ffffffff821b0ff0>] sock_alloc_send_skb+0x10/0x20 [ 8921.791052] [<ffffffff824cfc20>] pep_sendmsg+0x60/0x380 [ 8921.792931] [<ffffffff824cb4a6>] ? pn_socket_bind+0x156/0x180 [ 8921.794917] [<ffffffff824cb50f>] ? pn_socket_autobind+0x3f/0x90 [ 8921.797053] [<ffffffff824cb63f>] pn_socket_sendmsg+0x4f/0x70 [ 8921.798992] [<ffffffff821ab8e7>] sock_aio_write+0x187/0x1b0 [ 8921.801395] [<ffffffff810e325e>] ? sub_preempt_count+0xae/0xf0 [ 8921.803501] [<ffffffff8111842c>] ? __lock_acquire+0x42c/0x4b0 [ 8921.805505] [<ffffffff821ab760>] ? __sock_recv_ts_and_drops+0x140/0x140 [ 8921.807860] [<ffffffff811e07cc>] do_sync_readv_writev+0xbc/0x110 [ 8921.809986] [<ffffffff811958e7>] ? might_fault+0x97/0xa0 [ 8921.811998] [<ffffffff817bd99e>] ? security_file_permission+0x1e/0x90 [ 8921.814595] [<ffffffff811e17e2>] do_readv_writev+0xe2/0x1e0 [ 8921.816702] [<ffffffff810b8dac>] ? do_setitimer+0x1ac/0x200 [ 8921.818819] [<ffffffff810e2ec1>] ? get_parent_ip+0x11/0x50 [ 8921.820863] [<ffffffff810e325e>] ? sub_preempt_count+0xae/0xf0 [ 8921.823318] [<ffffffff811e1926>] vfs_writev+0x46/0x60 [ 8921.825219] [<ffffffff811e1a3f>] sys_writev+0x4f/0xb0 [ 8921.827127] [<ffffffff82658039>] system_call_fastpath+0x16/0x1b [ 8921.829384] ---[ end trace dffe390f30db9eb7 ]--- Signed-off-by: Sasha Levin <levinsasha928@gmail.com> Acked-by: Rémi Denis-Courmont <remi.denis-courmont@nokia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-05tcp: tcp_sendpages() should call tcp_push() onceEric Dumazet
commit 2f533844242 (tcp: allow splice() to build full TSO packets) added a regression for splice() calls using SPLICE_F_MORE. We need to call tcp_flush() at the end of the last page processed in tcp_sendpages(), or else transmits can be deferred and future sends stall. Add a new internal flag, MSG_SENDPAGE_NOTLAST, acting like MSG_MORE, but with different semantic. For all sendpage() providers, its a transparent change. Only sock_sendpage() and tcp_sendpages() can differentiate the two different flags provided by pipe_to_sendpage() Reported-by: Tom Herbert <therbert@google.com> Cc: Nandita Dukkipati <nanditad@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Tom Herbert <therbert@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: H.K. Jerry Chu <hkchu@google.com> Cc: Maciej Żenczykowski <maze@google.com> Cc: Mahesh Bandewar <maheshb@google.com> Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: Eric Dumazet <eric.dumazet@gmail>com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-05Merge branch 'akpm' (Andrew's patch-bomb)Linus Torvalds
Merge batch of fixes from Andrew Morton: "The simple_open() cleanup was held back while I wanted for laggards to merge things. I still need to send a few checkpoint/restore patches. I've been wobbly about merging them because I'm wobbly about the overall prospects for success of the project. But after speaking with Pavel at the LSF conference, it sounds like they're further toward completion than I feared - apparently davem is at the "has stopped complaining" stage regarding the net changes. So I need to go back and re-review those patchs and their (lengthy) discussion." * emailed from Andrew Morton <akpm@linux-foundation.org>: (16 patches) memcg swap: use mem_cgroup_uncharge_swap fix backlight: add driver for DA9052/53 PMIC v1 C6X: use set_current_blocked() and block_sigmask() MAINTAINERS: add entry for sparse checker MAINTAINERS: fix REMOTEPROC F: typo alpha: use set_current_blocked() and block_sigmask() simple_open: automatically convert to simple_open() scripts/coccinelle/api/simple_open.cocci: semantic patch for simple_open() libfs: add simple_open() hugetlbfs: remove unregister_filesystem() when initializing module drivers/rtc/rtc-88pm860x.c: fix rtc irq enable callback fs/xattr.c:setxattr(): improve handling of allocation failures fs/xattr.c:listxattr(): fall back to vmalloc() if kmalloc() failed fs/xattr.c: suppress page allocation failure warnings from sys_listxattr() sysrq: use SEND_SIG_FORCED instead of force_sig() proc: fix mount -t proc -o AAA
2012-04-05simple_open: automatically convert to simple_open()Stephen Boyd
Many users of debugfs copy the implementation of default_open() when they want to support a custom read/write function op. This leads to a proliferation of the default_open() implementation across the entire tree. Now that the common implementation has been consolidated into libfs we can replace all the users of this function with simple_open(). This replacement was done with the following semantic patch: <smpl> @ open @ identifier open_f != simple_open; identifier i, f; @@ -int open_f(struct inode *i, struct file *f) -{ ( -if (i->i_private) -f->private_data = i->i_private; | -f->private_data = i->i_private; ) -return 0; -} @ has_open depends on open @ identifier fops; identifier open.open_f; @@ struct file_operations fops = { ... -.open = open_f, +.open = simple_open, ... }; </smpl> [akpm@linux-foundation.org: checkpatch fixes] Signed-off-by: Stephen Boyd <sboyd@codeaurora.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Julia Lawall <Julia.Lawall@lip6.fr> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-04-05net: replace continue with break to reduce unnecessary loop in xxx_xmarksourcesRongQing.Li
The conditional which decides to skip inactive filters does not change with the change of loop index, so it is unnecessary to check them many times. Signed-off-by: RongQing.Li <roy.qing.li@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-05net: remove k{un}map_skb_frag()Eric Dumazet
Since commit 3e4d3af501 (mm: stack based kmap_atomic()) we dont have to disable BH anymore while mapping skb frags. We can remove kmap_skb_frag() / kunmap_skb_frag() helpers and use kmap_atomic() / kunmap_atomic() Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-05net/dcb: Add an optional max rate attributeAmir Vadai
Although not specified in 8021Qaz spec, it could be useful to enable drivers whose HW supports setting a rate limit for an ETS TC. This patch adds this optional attribute to DCB netlink. To use it, drivers should implement and register the callbacks ieee_setmaxrate and ieee_getmaxrate. The units are 64 bits long and specified in Kbps to enable usage over both slow and very fast networks. Signed-off-by: Amir Vadai <amirv@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-05net/route: export symbol ip_tos2prioAmir Vadai
Need to export this to enable drivers use rt_tos2priority() Signed-off-by: Amir Vadai <amirv@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-05ipv6: fix array index in ip6_mc_add_src()RongQing.Li
Convert array index from the loop bound to the loop index. And remove the void type conversion to ip6_mc_del1_src() return code, seem it is unnecessary, since ip6_mc_del1_src() does not use __must_check similar attribute, no compiler will report the warning when it is removed. v2: enrich the commit header Signed-off-by: RongQing.Li <roy.qing.li@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-04sctp: Allow struct sctp_event_subscribe to grow without breaking binariesThomas Graf
getsockopt(..., SCTP_EVENTS, ...) performs a length check and returns an error if the user provides less bytes than the size of struct sctp_event_subscribe. Struct sctp_event_subscribe needs to be extended by an u8 for every new event or notification type that is added. This obviously makes getsockopt fail for binaries that are compiled against an older versions of <net/sctp/user.h> which do not contain all event types. This patch changes getsockopt behaviour to no longer return an error if not enough bytes are being provided by the user. Instead, it returns as much of sctp_event_subscribe as fits into the provided buffer. This leads to the new behavior that users see what they have been aware of at compile time. The setsockopt(..., SCTP_EVENTS, ...) API is already behaving like this. Signed-off-by: Thomas Graf <tgraf@suug.ch> Acked-by: Vlad Yasevich <vladislav.yasevich@hp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-04ethtool: Add a common function for drivers with transmit time stamping.Richard Cochran
Currently, most drivers do not support transmit SO_TIMESTAMPING. For those that do support it, there is one appropriate response to the get_ts_info query. This patch adds a common function providing this response. Signed-off-by: Richard Cochran <richardcochran@gmail.com> Reviewed-by: Ben Hutchings <bhutchings@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-04ethtool: Introduce a method for getting time stamping capabilities.Richard Cochran
This commit adds a new ethtool ioctl that exposes the SO_TIMESTAMPING capabilities of a network interface. In addition, user space programs can use this ioctl to discover the PTP Hardware Clock (PHC) device associated with the interface. Since software receive time stamps are handled by the stack, the generic ethtool code can answer the query correctly in case the MAC or PHY drivers lack special time stamping features. Signed-off-by: Richard Cochran <richardcochran@gmail.com> Reviewed-by: Ben Hutchings <bhutchings@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-04ipv6: Fix 'inet6_rtm_getroute' to release 'rt->dst' in case of 'alloc_skb' ↵Shmulik Ladkani
failure In 72331bc [ipv6: Fix RTM_GETROUTE's interpretation of RTA_IIF to be consistent with ipv4] the code of 'inet6_rtm_getroute()' was re-ordered such that the reference to 'rt->dst' is incremented prior skb allocation. Hence, if 'alloc_skb()' fails, must drop a reference from 'rt->dst'. Add the missing 'dst_release()' call. Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-03netfilter: nf_conntrack: fix count leak in error path of __nf_conntrack_allocPablo Neira Ayuso
We have to decrement the conntrack counter if we fail to access the zone extension. Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-03netfilter: xt_CT: fix missing put timeout object in error pathPablo Neira Ayuso
The error path misses putting the timeout object. This patch adds new function xt_ct_tg_timeout_put() to put the timeout object. Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-03netfilter: xt_CT: allocation has to be GFP_ATOMIC under rcu_read_lock sectionPablo Neira Ayuso
Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-03Merge branch 'master' of git://1984.lsi.us.es/netDavid S. Miller
2012-04-03filter: add XOR operationJiri Pirko
Add XOR instruction fo BPF machine. Needed for computing packet hashes. Signed-off-by: Jiri Pirko <jpirko@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-03filter: Allow to create sk-unattached filtersJiri Pirko
Today, BPF filters are bind to sockets. Since BPF machine becomes handy for other purposes, this patch allows to create unattached filter. Signed-off-by: Jiri Pirko <jpirko@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-03bpf jit: Make the filter.c::__load_pointer helper non-static for the jitsJan Seiffert
The function is renamed to make it a little more clear what it does. It is not added to any .h because it is not for general consumption, only for bpf internal use (and so by the jits). Signed-of-by: Jan Seiffert <kaffeemonster@googlemail.com> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-03tcp: allow splice() to build full TSO packetsEric Dumazet
vmsplice()/splice(pipe, socket) call do_tcp_sendpages() one page at a time, adding at most 4096 bytes to an skb. (assuming PAGE_SIZE=4096) The call to tcp_push() at the end of do_tcp_sendpages() forces an immediate xmit when pipe is not already filled, and tso_fragment() try to split these skb to MSS multiples. 4096 bytes are usually split in a skb with 2 MSS, and a remaining sub-mss skb (assuming MTU=1500) This makes slow start suboptimal because many small frames are sent to qdisc/driver layers instead of big ones (constrained by cwnd and packets in flight of course) In fact, applications using sendmsg() (adding an additional memory copy) instead of vmsplice()/splice()/sendfile() are a bit faster because of this anomaly, especially if serving small files in environments with large initial [c]wnd. Call tcp_push() only if MSG_MORE is not set in the flags parameter. This bit is automatically provided by splice() internals but for the last page, or on all pages if user specified SPLICE_F_MORE splice() flag. In some workloads, this can reduce number of sent logical packets by an order of magnitude, making zero-copy TCP actually faster than one-copy :) Reported-by: Tom Herbert <therbert@google.com> Cc: Nandita Dukkipati <nanditad@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Tom Herbert <therbert@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: H.K. Jerry Chu <hkchu@google.com> Cc: Maciej Żenczykowski <maze@google.com> Cc: Mahesh Bandewar <maheshb@google.com> Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: Eric Dumazet <eric.dumazet@gmail>com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-03net: fix /proc/net/dev regressionEric Dumazet
Commit f04565ddf52 (dev: use name hash for dev_seq_ops) added a second regression, as some devices are missing from /proc/net/dev if many devices are defined. When seq_file buffer is filled, the last ->next/show() method is canceled (pos value is reverted to value prior ->next() call) Problem is after above commit, we dont restart the lookup at right position in ->start() method. Fix this by removing the internal 'pos' pointer added in commit, since we need to use the 'loff_t *pos' provided by seq_file layer. This also reverts commit 5cac98dd0 (net: Fix corruption in /proc/*/net/dev_mcast), since its not needed anymore. Reported-by: Ben Greear <greearb@candelatech.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Cc: Mihai Maruseac <mmaruseac@ixiacom.com> Tested-by: Ben Greear <greearb@candelatech.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-03af_unix: reduce high order page allocationsEric Dumazet
unix_dgram_sendmsg() currently builds linear skbs, and this can stress page allocator with high order page allocations. When memory gets fragmented, this can eventually fail. We can try to use order-2 allocations for skb head (SKB_MAX_ALLOC) plus up to 16 page fragments to lower pressure on buddy allocator. This patch has no effect on messages of less than 16064 bytes. (on 64bit arches with PAGE_SIZE=4096) For bigger messages (from 16065 to 81600 bytes), this patch brings reliability at the expense of performance penalty because of extra pages allocations. netperf -t DG_STREAM -T 0,2 -- -m 16064 -s 200000 ->4086040 Messages / 10s netperf -t DG_STREAM -T 0,2 -- -m 16068 -s 200000 ->3901747 Messages / 10s Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-03netfilter: xt_CT: remove a compile warningPablo Neira Ayuso
If CONFIG_NF_CONNTRACK_TIMEOUT=n we have following warning : CC [M] net/netfilter/xt_CT.o net/netfilter/xt_CT.c: In function ‘xt_ct_tg_check_v1’: net/netfilter/xt_CT.c:284: warning: label ‘err4’ defined but not used Reported-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2012-04-02Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds
Pull networking fixes from David Miller: 1) Provide device string properly for USB i2400m wimax devices, also don't OOPS when providing firmware string. From Phil Sutter. 2) Add support for sh_eth SH7734 chips, from Nobuhiro Iwamatsu. 3) Add another device ID to USB zaurus driver, from Guan Xin. 4) Loop index start in pool vector iterator is wrong causing MAC to not get configured in bnx2x driver, fix from Dmitry Kravkov. 5) EQL driver assumes HZ=100, fix from Eric Dumazet. 6) Now that skb_add_rx_frag() can specify the truesize increment separately, do so in f_phonet and cdc_phonet, also from Eric Dumazet. 7) virtio_net accidently uses net_ratelimit() not only on the kernel warning but also the statistic bump, fix from Rick Jones. 8) ip_route_input_mc() uses fixed init_net namespace, oops, use dev_net(dev) instead. Fix from Benjamin LaHaise. 9) dev_forward_skb() needs to clear the incoming interface index of the SKB so that it looks like a new incoming packet, also from Benjamin LaHaise. 10) iwlwifi mistakenly initializes a channel entry as 2GHZ instead of 5GHZ, fix from Stanislav Yakovlev. 11) Missing kmalloc() return value checks in orinoco, from Santosh Nayak. 12) ath9k doesn't check for HT capabilities in the right way, it is checking ht_supported instead of the ATH9K_HW_CAP_HT flag. Fix from Sujith Manoharan. 13) Fix x86 BPF JIT emission of 16-bit immediate field of AND instructions, from Feiran Zhuang. 14) Avoid infinite loop in GARP code when registering sysfs entries. From David Ward. 15) rose protocol uses memcpy instead of memcmp in a device address comparison, oops. Fix from Daniel Borkmann. 16) Fix build of lpc_eth due to dev_hw_addr_rancom() interface being renamed to eth_hw_addr_random(). From Roland Stigge. 17) Make ipv6 RTM_GETROUTE interpret RTA_IIF attribute the same way that ipv4 does. Fix from Shmulik Ladkani. 18) via-rhine has an inverted bit test, causing suspend/resume regressions. Fix from Andreas Mohr. 19) RIONET assumes 4K page size, fix from Akinobu Mita. 20) Initialization of imask register in sky2 is buggy, because bits are "or'd" into an uninitialized local variable. Fix from Lino Sanfilippo. 21) Fix FCOE checksum offload handling, from Yi Zou. 22) Fix VLAN processing regression in e1000, from Jiri Pirko. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (52 commits) sky2: dont overwrite settings for PHY Quick link tg3: Fix 5717 serdes powerdown problem net: usb: cdc_eem: fix mtu net: sh_eth: fix endian check for architecture independent usb/rtl8150 : Remove duplicated definitions rionet: fix page allocation order of rionet_active via-rhine: fix wait-bit inversion. ipv6: Fix RTM_GETROUTE's interpretation of RTA_IIF to be consistent with ipv4 net: lpc_eth: Fix rename of dev_hw_addr_random net/netfilter/nfnetlink_acct.c: use linux/atomic.h rose_dev: fix memcpy-bug in rose_set_mac_address Fix non TBI PHY access; a bad merge undid bug fix in a previous commit. net/garp: avoid infinite loop if attribute already exists x86 bpf_jit: fix a bug in emitting the 16-bit immediate operand of AND bonding: emit event when bonding changes MAC mac80211: fix oper channel timestamp updation ath9k: Use HW HT capabilites properly MAINTAINERS: adding maintainer for ipw2x00 net: orinoco: add error handling for failed kmalloc(). net/wireless: ipw2x00: fix a typo in wiphy struct initilization ...
2012-04-02net: Report dev->promiscuity in netlink reports.Ben Greear
The standard ways of probing a device's promiscuity (ifi_flags, for instance) does not report the actual state of the device. This patch adds dev->promiscuity to the netlink netdevice report so that users can know for certain if the device is acting PROMISC or not. Signed-off-by: Ben Greear <greearb@candelatech.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-02net/ipv6/addrconf.c: Checkpatch cleanupsEldad Zack
net/ipv6/addrconf.c:340: WARNING: EXPORT_SYMBOL(foo); should immediately follow its function/variable net/ipv6/addrconf.c:342: ERROR: "foo * bar" should be "foo *bar" net/ipv6/addrconf.c:444: ERROR: "foo * bar" should be "foo *bar" net/ipv6/addrconf.c:1337: WARNING: EXPORT_SYMBOL(foo); should immediately follow its function/variable net/ipv6/addrconf.c:1526: ERROR: "(foo*)" should be "(foo *)" net/ipv6/addrconf.c:1671: ERROR: open brace '{' following function declarations go on the next line net/ipv6/addrconf.c:1914: ERROR: "foo * bar" should be "foo *bar" net/ipv6/addrconf.c:2368: ERROR: "foo * bar" should be "foo *bar" net/ipv6/addrconf.c:2370: ERROR: "foo * bar" should be "foo *bar" net/ipv6/addrconf.c:2416: ERROR: "foo * bar" should be "foo *bar" net/ipv6/addrconf.c:2437: ERROR: "foo * bar" should be "foo *bar" net/ipv6/addrconf.c:2573: ERROR: "foo * bar" should be "foo *bar" net/ipv6/addrconf.c:3797: ERROR: "foo* bar" should be "foo *bar" Signed-off-by: Eldad Zack <eldad@fogrefinery.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-02net/ipv6/icmp.c: Checkpatch cleanupsEldad Zack
icmp.c:501: ERROR: "(foo*)" should be "(foo *)" icmp.c:582: ERROR: "(foo*)" should be "(foo *)" icmp.c:954: WARNING: EXPORT_SYMBOL(foo); should immediately follow its function/variable Signed-off-by: Eldad Zack <eldad@fogrefinery.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-02net/ipv6/fib6_rules.c: Checkpatch cleanupEldad Zack
fib6_rules.c:26: ERROR: open brace '{' following struct go on the same line Signed-off-by: Eldad Zack <eldad@fogrefinery.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-02net/ipv6/exthdrs_core.c: Checkpatch cleanupsEldad Zack
exthdrs_core.c:113: WARNING: EXPORT_SYMBOL(foo); should immediately follow its function/variable exthdrs_core.c:114: WARNING: EXPORT_SYMBOL(foo); should immediately follow its function/variable Signed-off-by: Eldad Zack <eldad@fogrefinery.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-02net/ipv6/exthdrs.c: Checkpatch cleanupsEldad Zack
exthdrs.c:726: WARNING: EXPORT_SYMBOL(foo); should immediately follow its function/variable exthdrs.c:741: ERROR: "(foo*)" should be "(foo *)" exthdrs.c:741: ERROR: "(foo*)" should be "(foo *)" exthdrs.c:744: ERROR: "(foo**)" should be "(foo **)" exthdrs.c:746: ERROR: "(foo**)" should be "(foo **)" exthdrs.c:748: ERROR: "(foo**)" should be "(foo **)" exthdrs.c:750: ERROR: "(foo**)" should be "(foo **)" exthdrs.c:755: WARNING: EXPORT_SYMBOL(foo); should immediately follow its function/variable exthdrs.c:896: WARNING: EXPORT_SYMBOL(foo); should immediately follow its function/variable Signed-off-by: Eldad Zack <eldad@fogrefinery.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-02net/ipv6/datagram.c: Checkpatch cleanupsEldad Zack
datagram.c:101: ERROR: "(foo*)" should be "(foo *)" datagram.c:521: ERROR: space required before the open parenthesis '(' datagram.c:830: WARNING: braces {} are not necessary for single statement blocks datagram.c:849: WARNING: braces {} are not necessary for single statement blocks Signed-off-by: Eldad Zack <eldad@fogrefinery.com> Signed-off-by: David S. Miller <davem@davemloft.net>