summaryrefslogtreecommitdiff
path: root/net/ipv4
AgeCommit message (Collapse)Author
2014-01-21net: Add GRO support for UDP encapsulating protocolsOr Gerlitz
Add GRO handlers for protocols that do UDP encapsulation, with the intent of being able to coalesce packets which encapsulate packets belonging to the same TCP session. For GRO purposes, the destination UDP port takes the role of the ether type field in the ethernet header or the next protocol in the IP header. The UDP GRO handler will only attempt to coalesce packets whose destination port is registered to have gro handler. Use a mark on the skb GRO CB data to disallow (flush) running the udp gro receive code twice on a packet. This solves the problem of udp encapsulated packets whose inner VM packet is udp and happen to carry a port which has registered offloads. Signed-off-by: Shlomo Pongratz <shlomop@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-21ipv6: protect protocols not handling ipv4 from v4 connection/bind attemptsHannes Frederic Sowa
Some ipv6 protocols cannot handle ipv4 addresses, so we must not allow connecting and binding to them. sendmsg logic does already check msg->name for this but must trust already connected sockets which could be set up for connection to ipv4 address family. Per-socket flag ipv6only is of no use here, as it is under users control by setsockopt. Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-21tcp: delete redundant calls of tcp_mtup_init()Peter Pan(潘卫平)
As tcp_rcv_state_process() has already calls tcp_mtup_init() for non-fastopen sock, we can delete the redundant calls of tcp_mtup_init() in tcp_{v4,v6}_syn_recv_sock(). Signed-off-by: Weiping Pan <panweiping3@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-21ipv4: remove the useless argument from ip_tunnel_hash()Duan Jiong
Since commit c544193214("GRE: Refactor GRE tunneling code") introduced function ip_tunnel_hash(), the argument itn is no longer in use, so remove it. Signed-off-by: Duan Jiong <duanj.fnst@cn.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-19ipv6: make IPV6_RECVPKTINFO work for ipv4 datagramsHannes Frederic Sowa
We currently don't report IPV6_RECVPKTINFO in cmsg access ancillary data for IPv4 datagrams on IPv6 sockets. This patch splits the ip6_datagram_recv_ctl into two functions, one which handles both protocol families, AF_INET and AF_INET6, while the ip6_datagram_recv_specific_ctl only handles IPv6 cmsg data. ip6_datagram_recv_*_ctl never reported back any errors, so we can make them return void. Also provide a helper for protocols which don't offer dual personality to further use ip6_datagram_recv_ctl, which is exported to modules. I needed to shuffle the code for ping around a bit to make it easier to implement dual personality for ping ipv6 sockets in future. Reported-by: Gert Doering <gert@space.net> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-18ipv4: be friend with drop monitorEric Dumazet
Replace some dev_kfree_skb() with kfree_skb() calls when we drop one skb, this might help bug tracking. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-18net: add build-time checks for msg->msg_name sizeSteffen Hurrle
This is a follow-up patch to f3d3342602f8bc ("net: rework recvmsg handler msg_name and msg_namelen logic"). DECLARE_SOCKADDR validates that the structure we use for writing the name information to is not larger than the buffer which is reserved for msg->msg_name (which is 128 bytes). Also use DECLARE_SOCKADDR consistently in sendmsg code paths. Signed-off-by: Steffen Hurrle <steffen@hurrle.net> Suggested-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-18Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Conflicts: drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c net/ipv4/tcp_metrics.c Overlapping changes between the "don't create two tcp metrics objects with the same key" race fix in net and the addition of the destination address in the lookup key in net-next. Minor overlapping changes in bnx2x driver. Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-17ipv4: fix a dst leak in tunnelsEric Dumazet
This patch : 1) Remove a dst leak if DST_NOCACHE was set on dst Fix this by holding a reference only if dst really cached. 2) Remove a lockdep warning in __tunnel_dst_set() This was reported by Cong Wang. 3) Remove usage of a spinlock where xchg() is enough 4) Remove some spurious inline keywords. Let compiler decide for us. Fixes: 7d442fab0a67 ("ipv4: Cache dst in tunnels") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Cong Wang <cwang@twopensource.com> Cc: Tom Herbert <therbert@google.com> Cc: Maciej Żenczykowski <maze@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-17tcp: metrics: Avoid duplicate entries with the same destination-IPChristoph Paasch
Because the tcp-metrics is an RCU-list, it may be that two soft-interrupts are inside __tcp_get_metrics() for the same destination-IP at the same time. If this destination-IP is not yet part of the tcp-metrics, both soft-interrupts will end up in tcpm_new and create a new entry for this IP. So, we will have two tcp-metrics with the same destination-IP in the list. This patch checks twice __tcp_get_metrics(). First without holding the lock, then while holding the lock. The second one is there to confirm that the entry has not been added by another soft-irq while waiting for the spin-lock. Fixes: 51c5d0c4b169b (tcp: Maintain dynamic metrics in local cache.) Signed-off-by: Christoph Paasch <christoph.paasch@uclouvain.be> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-17ipv6: tcp: fix flowlabel value in ACK messages send from TIME_WAITFlorent Fourcot
This patch is following the commit b903d324bee262 (ipv6: tcp: fix TCLASS value in ACK messages sent from TIME_WAIT). For the same reason than tclass, we have to store the flow label in the inet_timewait_sock to provide consistency of flow label on the last ACK. Signed-off-by: Florent Fourcot <florent.fourcot@enst-bretagne.fr> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-16net/ipv4: don't use module_init in non-modular gre_offloadPaul Gortmaker
Recent commit 438e38fadca2f6e57eeecc08326c8a95758594d4 ("gre_offload: statically build GRE offloading support") added new module_init/module_exit calls to the gre_offload.c file. The file is obj-y and can't be anything other than built-in. Currently it can never be built modular, so using module_init as an alias for __initcall can be somewhat misleading. Fix this up now, so that we can relocate module_init from init.h into module.h in the future. If we don't do this, we'd have to add module.h to obviously non-modular code, and that would be a worse thing. We also make the inclusion explicit. Note that direct use of __initcall is discouraged, vs. one of the priority categorized subgroups. As __initcall gets mapped onto device_initcall, our use of device_initcall directly in this change means that the runtime impact is zero -- it will remain at level 6 in initcall ordering. As for the module_exit, rather than replace it with __exitcall, we simply remove it, since it appears only UML does anything with those, and even for UML, there is no relevant cleanup to be done here. Cc: Eric Dumazet <edumazet@google.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-14tcp: do not export tcp_gso_segment() and tcp_gro_receive()Eric Dumazet
tcp_gso_segment() and tcp_gro_receive() no longer need to be exported. IPv4 and IPv6 offloads are statically linked. Note that tcp_gro_complete() is still used by bnx2x, unfortunately. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-14net: avoid reference counter overflows on fib_rules in multicast forwardingHannes Frederic Sowa
Bob Falken reported that after 4G packets, multicast forwarding stopped working. This was because of a rule reference counter overflow which freed the rule as soon as the overflow happend. This patch solves this by adding the FIB_LOOKUP_NOREF flag to fib_rules_lookup calls. This is safe even from non-rcu locked sections as in this case the flag only implies not taking a reference to the rule, which we don't need at all. Rules only hold references to the namespace, which are guaranteed to be available during the call of the non-rcu protected function reg_vif_xmit because of the interface reference which itself holds a reference to the net namespace. Fixes: f0ad0860d01e47 ("ipv4: ipmr: support multiple tables") Fixes: d1db275dd3f6e4 ("ipv6: ip6mr: support multiple tables") Reported-by: Bob Falken <NetFestivalHaveFun@gmx.com> Cc: Patrick McHardy <kaber@trash.net> Cc: Thomas Graf <tgraf@suug.ch> Cc: Julian Anastasov <ja@ssi.bg> Cc: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-14net: replace macros net_random and net_srandom with direct calls to prandomAruna-Hewapathirane
This patch removes the net_random and net_srandom macros and replaces them with direct calls to the prandom ones. As new commits only seem to use prandom_u32 there is no use to keep them around. This change makes it easier to grep for users of prandom_u32. Signed-off-by: Aruna-Hewapathirane <aruna.hewapathirane@gmail.com> Suggested-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-14ipv4: register igmp_notifier even when !CONFIG_PROC_FSWANG Cong
We still need this notifier even when we don't config PROC_FS. It should be rare to have a kernel without PROC_FS, so just for completeness. Cc: Stephen Hemminger <stephen@networkplumber.org> Cc: David S. Miller <davem@davemloft.net> Cc: Patrick McHardy <kaber@trash.net> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-14Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
2014-01-13inet_diag: fix inet_diag_dump_icsk() to use correct state for timewait socketsNeal Cardwell
Fix inet_diag_dump_icsk() to reflect the fact that both TCP_TIME_WAIT and TCP_FIN_WAIT2 connections are represented by inet_timewait_sock (not just TIME_WAIT), and for such sockets the tw_substate field holds the real state, which can be either TCP_TIME_WAIT or TCP_FIN_WAIT2. This brings the inet_diag state-matching code in line with the field it uses to populate idiag_state. This is also analogous to the info exported in /proc/net/tcp, where get_tcp4_sock() exports sk->sk_state and get_timewait4_sock() exports tw->tw_substate. Before fixing this, (a) neither "ss -nemoi" nor "ss -nemoi state fin-wait-2" would return a socket in TCP_FIN_WAIT2; and (b) "ss -nemoi state time-wait" would also return sockets in state TCP_FIN_WAIT2. This is an old bug that predates 05dbc7b ("tcp/dccp: remove twchain"). Signed-off-by: Neal Cardwell <ncardwell@google.com> Cc: Eric Dumazet <edumazet@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-13gre_offload: simplify GRE header length calculation in gre_gso_segment()Neal Cardwell
Simplify the GRE header length calculation in gre_gso_segment(). Switch to an approach that is simpler, faster, and more general. The new approach will continue to be correct even if we add support for the optional variable-length routing info that may be present in a GRE header. Signed-off-by: Neal Cardwell <ncardwell@google.com> Cc: Eric Dumazet <edumazet@google.com> Cc: H.K. Jerry Chu <hkchu@google.com> Cc: Pravin B Shelar <pshelar@nicira.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-13gre_offload: fix sparse non static symbol warningWei Yongjun
Fixes the following sparse warning: net/ipv4/gre_offload.c:253:5: warning: symbol 'gre_gro_complete' was not declared. Should it be static? Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-13ipv4: introduce hardened ip_no_pmtu_disc modeHannes Frederic Sowa
This new ip_no_pmtu_disc mode only allowes fragmentation-needed errors to be honored by protocols which do more stringent validation on the ICMP's packet payload. This knob is useful for people who e.g. want to run an unmodified DNS server in a namespace where they need to use pmtu for TCP connections (as they are used for zone transfers or fallback for requests) but don't want to use possibly spoofed UDP pmtu information. Currently the whitelisted protocols are TCP, SCTP and DCCP as they check if the returned packet is in the window or if the association is valid. Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: David Miller <davem@davemloft.net> Cc: John Heffner <johnwheffner@gmail.com> Suggested-by: Florian Weimer <fweimer@redhat.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-13ipv4: introduce ip_dst_mtu_maybe_forward and protect forwarding path against ↵Hannes Frederic Sowa
pmtu spoofing While forwarding we should not use the protocol path mtu to calculate the mtu for a forwarded packet but instead use the interface mtu. We mark forwarded skbs in ip_forward with IPSKB_FORWARDED, which was introduced for multicast forwarding. But as it does not conflict with our usage in unicast code path it is perfect for reuse. I moved the functions ip_sk_accept_pmtu, ip_sk_use_pmtu and ip_skb_dst_mtu along with the new ip_dst_mtu_maybe_forward to net/ip.h to fix circular dependencies because of IPSKB_FORWARDED. Because someone might have written a software which does probe destinations manually and expects the kernel to honour those path mtus I introduced a new per-namespace "ip_forward_use_pmtu" knob so someone can disable this new behaviour. We also still use mtus which are locked on a route for forwarding. The reason for this change is, that path mtus information can be injected into the kernel via e.g. icmp_err protocol handler without verification of local sockets. As such, this could cause the IPv4 forwarding path to wrongfully emit fragmentation needed notifications or start to fragment packets along a path. Tunnel and ipsec output paths clear IPCB again, thus IPSKB_FORWARDED won't be set and further fragmentation logic will use the path mtu to determine the fragmentation size. They also recheck packet size with help of path mtu discovery and report appropriate errors. Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: David Miller <davem@davemloft.net> Cc: John Heffner <johnwheffner@gmail.com> Cc: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-10tcp: metrics: Allow selective get/del of tcp-metrics based on src IPChristoph Paasch
We want to be able to get/del tcp-metrics based on the src IP. This patch adds the necessary parsing of the netlink attribute and if the source address is set, it will match on this one too. Signed-off-by: Christoph Paasch <christoph.paasch@uclouvain.be> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-10tcp: metrics: Delete all entries matching a certain destinationChristoph Paasch
As we now can have multiple entries per destination-IP, the "ip tcp_metrics delete address ADDRESS" command deletes all of them. Signed-off-by: Christoph Paasch <christoph.paasch@uclouvain.be> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-10tcp: metrics: New netlink attribute for src IP and dumped in netlink replyChristoph Paasch
This patch adds a new netlink attribute for the source-IP and appends it to the netlink reply. Now, iproute2 can have access to the source-IP. Signed-off-by: Christoph Paasch <christoph.paasch@uclouvain.be> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-10tcp: metrics: Add source-address to tcp-metricsChristoph Paasch
We add the source-address to the tcp-metrics, so that different metrics will be used per source/destination-pair. We use the destination-hash to store the metric inside the hash-table. That way, deleting and dumping via "ip tcp_metrics" is easy. Signed-off-by: Christoph Paasch <christoph.paasch@uclouvain.be> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-10tcp: metrics: rename tcpm_addr to tcpm_daddrChristoph Paasch
As we will add also the source-address, we rename all accesses to the tcp-metrics address to use "daddr". Signed-off-by: Christoph Paasch <christoph.paasch@uclouvain.be> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-09Merge branch 'master' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nftables Pablo Neira Ayuso says: ==================== nf_tables updates for net-next The following patchset contains the following nf_tables updates, mostly updates from Patrick McHardy, they are: * Add the "inet" table and filter chain type for this new netfilter family: NFPROTO_INET. This special table/chain allows IPv4 and IPv6 rules, this should help to simplify the burden in the administration of dual stack firewalls. This also includes several patches to prepare the infrastructure for this new table and a new meta extension to match the layer 3 and 4 protocol numbers, from Patrick McHardy. * Load both IPv4 and IPv6 conntrack modules in nft_ct if the rule is used in NFPROTO_INET, as we don't certainly know which one would be used, also from Patrick McHardy. * Do not allow to delete a table that contains sets, otherwise these sets become orphan, from Patrick McHardy. * Hold a reference to the corresponding nf_tables family module when creating a table of that family type, to avoid the module deletion when in use, from Patrick McHardy. * Update chain counters before setting the chain policy to ensure that we don't leave the chain in inconsistent state in case of errors (aka. restore chain atomicity). This also fixes a possible leak if it fails to allocate the chain counters if no counters are passed to be restored, from Patrick McHardy. * Don't check for overflows in the table counter if we are just renaming a chain, from Patrick McHardy. * Replay the netlink request after dropping the nfnl lock to load the module that supports provides a chain type, from Patrick. * Fix chain type module references, from Patrick. * Several cleanups, function renames, constification and code refactorizations also from Patrick McHardy. * Add support to set the connmark, this can be used to set it based on the meta mark (similar feature to -j CONNMARK --restore), from Kristian Evensen. * A couple of fixes to the recently added meta/set support and nft_reject, and fix missing chain type unregistration if we fail to register our the family table/filter chain type, from myself. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-09netfilter: nf_tables: fix error path in the init functionsPablo Neira Ayuso
We have to unregister chain type if this fails to register netns. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-01-09ip_tunnel: fix sparse non static symbol warningWei Yongjun
Fixes the following sparse warning: net/ipv4/ip_tunnel.c:116:18: warning: symbol 'tunnel_dst_check' was not declared. Should it be static? Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-09netfilter: nf_tables: rename nft_do_chain_pktinfo() to nft_do_chain()Patrick McHardy
We don't encode argument types into function names and since besides nft_do_chain() there are only AF-specific versions, there is no risk of confusion. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-01-09netfilter: nf_tables: minor nf_chain_type cleanupsPatrick McHardy
Minor nf_chain_type cleanups: - reorder struct to plug a hoe - rename struct module member to "owner" for consistency - rename nf_hookfn array to "hooks" for consistency - reorder initializers for better readability Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-01-09netfilter: nf_tables: constify chain type definitions and pointersPatrick McHardy
Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-01-09netfilter: nf_tables: add missing module references to chain typesPatrick McHardy
In some cases we neither take a reference to the AF info nor to the chain type, allowing the module to be unloaded while in use. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-01-07netfilter: nf_tables: add "inet" table for IPv4/IPv6Patrick McHardy
This patch adds a new table family and a new filter chain that you can use to attach IPv4 and IPv6 rules. This should help to simplify rule-set maintainance in dual-stack setups. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-01-07netfilter: nf_tables: add support for multi family tablesPatrick McHardy
Add support to register chains to multiple hooks for different address families for mixed IPv4/IPv6 tables. Signed-off-by: Patrick McHardy <kaber@trash.net>
2014-01-07netfilter: nf_tables: make chain types override the default AF functionsPatrick McHardy
Currently the AF-specific hook functions override the chain-type specific hook functions. That doesn't make too much sense since the chain types are a special case of the AF-specific hooks. Make the AF-specific hook functions the default and make the optional chain type hooks override them. As a side effect, the necessary code restructuring reduces the code size, f.i. in case of nf_tables_ipv4.o: nf_tables_ipv4_init_net | -24 nft_do_chain_ipv4 | -113 2 functions changed, 137 bytes removed, diff: -137 Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-01-07net-gre-gro: Add GRE support to the GRO stackJerry Chu
This patch built on top of Commit 299603e8370a93dd5d8e8d800f0dff1ce2c53d36 ("net-gro: Prepare GRO stack for the upcoming tunneling support") to add the support of the standard GRE (RFC1701/RFC2784/RFC2890) to the GRO stack. It also serves as an example for supporting other encapsulation protocols in the GRO stack in the future. The patch supports version 0 and all the flags (key, csum, seq#) but will flush any pkt with the S (seq#) flag. This is because the S flag is not support by GSO, and a GRO pkt may end up in the forwarding path, thus requiring GSO support to break it up correctly. Currently the "packet_offload" structure only contains L3 (ETH_P_IP/ ETH_P_IPV6) GRO offload support so the encapped pkts are limited to IP pkts (i.e., w/o L2 hdr). But support for other protocol type can be easily added, so is the support for GRE variations like NVGRE. The patch also support csum offload. Specifically if the csum flag is on and the h/w is capable of checksumming the payload (CHECKSUM_COMPLETE), the code will take advantage of the csum computed by the h/w when validating the GRE csum. Note that commit 60769a5dcd8755715c7143b4571d5c44f01796f1 "ipv4: gre: add GRO capability" already introduces GRO capability to IPv4 GRE tunnels, using the gro_cells infrastructure. But GRO is done after GRE hdr has been removed (i.e., decapped). The following patch applies GRO when pkts first come in (before hitting the GRE tunnel code). There is some performance advantage for applying GRO as early as possible. Also this approach is transparent to other subsystem like Open vSwitch where GRE decap is handled outside of the IP stack hence making it harder for the gro_cells stuff to apply. On the other hand, some NICs are still not capable of hashing on the inner hdr of a GRE pkt (RSS). In that case the GRO processing of pkts from the same remote host will all happen on the same CPU and the performance may be suboptimal. I'm including some rough preliminary performance numbers below. Note that the performance will be highly dependent on traffic load, mix as usual. Moreover it also depends on NIC offload features hence the following is by no means a comprehesive study. Local testing and tuning will be needed to decide the best setting. All tests spawned 50 copies of netperf TCP_STREAM and ran for 30 secs. (super_netperf 50 -H 192.168.1.18 -l 30) An IP GRE tunnel with only the key flag on (e.g., ip tunnel add gre1 mode gre local 10.246.17.18 remote 10.246.17.17 ttl 255 key 123) is configured. The GRO support for pkts AFTER decap are controlled through the device feature of the GRE device (e.g., ethtool -K gre1 gro on/off). 1.1 ethtool -K gre1 gro off; ethtool -K eth0 gro off thruput: 9.16Gbps CPU utilization: 19% 1.2 ethtool -K gre1 gro on; ethtool -K eth0 gro off thruput: 5.9Gbps CPU utilization: 15% 1.3 ethtool -K gre1 gro off; ethtool -K eth0 gro on thruput: 9.26Gbps CPU utilization: 12-13% 1.4 ethtool -K gre1 gro on; ethtool -K eth0 gro on thruput: 9.26Gbps CPU utilization: 10% The following tests were performed on a different NIC that is capable of csum offload. I.e., the h/w is capable of computing IP payload csum (CHECKSUM_COMPLETE). 2.1 ethtool -K gre1 gro on (hence will use gro_cells) 2.1.1 ethtool -K eth0 gro off; csum offload disabled thruput: 8.53Gbps CPU utilization: 9% 2.1.2 ethtool -K eth0 gro off; csum offload enabled thruput: 8.97Gbps CPU utilization: 7-8% 2.1.3 ethtool -K eth0 gro on; csum offload disabled thruput: 8.83Gbps CPU utilization: 5-6% 2.1.4 ethtool -K eth0 gro on; csum offload enabled thruput: 8.98Gbps CPU utilization: 5% 2.2 ethtool -K gre1 gro off 2.2.1 ethtool -K eth0 gro off; csum offload disabled thruput: 5.93Gbps CPU utilization: 9% 2.2.2 ethtool -K eth0 gro off; csum offload enabled thruput: 5.62Gbps CPU utilization: 8% 2.2.3 ethtool -K eth0 gro on; csum offload disabled thruput: 7.69Gbps CPU utilization: 8% 2.2.4 ethtool -K eth0 gro on; csum offload enabled thruput: 8.96Gbps CPU utilization: 5-6% Signed-off-by: H.K. Jerry Chu <hkchu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-07ipv4: loopback device: ignore value changes after device is uppedJiri Pirko
When lo is brought up, new ifa is created. Then, devconf and neigh values bitfield should be set so later changes of default values would not affect lo values. Note that the same behaviour is in ipv6. Also note that this is likely not an issue in many distros (for example Fedora 19) because userspace sets address to lo manually before bringing it up. Signed-off-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-06gre_offload: statically build GRE offloading supportEric Dumazet
GRO/GSO layers can be enabled on a node, even if said node is only forwarding packets. This patch permits GSO (and upcoming GRO) support for GRE encapsulated packets, even if the host has no GRE tunnel setup. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: H.K. Jerry Chu <hkchu@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-06Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Conflicts: drivers/net/ethernet/qlogic/qlcnic/qlcnic_sriov_pf.c net/ipv6/ip6_tunnel.c net/ipv6/ip6_vti.c ipv6 tunnel statistic bug fixes conflicting with consolidation into generic sw per-cpu net stats. qlogic conflict between queue counting bug fix and the addition of multiple MAC address support. Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-06tcp: out_of_order_queue do not use its lockEric Dumazet
TCP out_of_order_queue lock is not used, as queue manipulation happens with socket lock held and we therefore use the lockless skb queue routines (as __skb_queue_head()) We can use __skb_queue_head_init() instead of skb_queue_head_init() to make this more consistent. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-06Merge branch 'master' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nftables Pablo Neira Ayuso says: <pablo@netfilter.org> ==================== nftables updates for net-next The following patchset contains nftables updates for your net-next tree, they are: * Add set operation to the meta expression by means of the select_ops() infrastructure, this allows us to set the packet mark among other things. From Arturo Borrero Gonzalez. * Fix wrong format in sscanf in nf_tables_set_alloc_name(), from Daniel Borkmann. * Add new queue expression to nf_tables. These comes with two previous patches to prepare this new feature, one to add mask in nf_tables_core to evaluate the queue verdict appropriately and another to refactor common code with xt_NFQUEUE, from Eric Leblond. * Do not hide nftables from Kconfig if nfnetlink is not enabled, also from Eric Leblond. * Add the reject expression to nf_tables, this adds the missing TCP RST support. It comes with an initial patch to refactor common code with xt_NFQUEUE, again from Eric Leblond. * Remove an unused variable assignment in nf_tables_dump_set(), from Michal Nazarewicz. * Remove the nft_meta_target code, now that Arturo added the set operation to the meta expression, from me. * Add help information for nf_tables to Kconfig, also from me. * Allow to dump all sets by specifying NFPROTO_UNSPEC, similar feature is available to other nf_tables objects, requested by Arturo, from me. * Expose the table usage counter, so we can know how many chains are using this table without dumping the list of chains, from Tomasz Bursztyka. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-05Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller
Pablo Neira Ayuso says: ==================== netfilter/IPVS updates for net-next The following patchset contains Netfilter updates for your net-next tree, they are: * Add full port randomization support. Some crazy researchers found a way to reconstruct the secure ephemeral ports that are allocated in random mode by sending off-path bursts of UDP packets to overrun the socket buffer of the DNS resolver to trigger retransmissions, then if the timing for the DNS resolution done by a client is larger than usual, then they conclude that the port that received the burst of UDP packets is the one that was opened. It seems a bit aggressive method to me but it seems to work for them. As a result, Daniel Borkmann and Hannes Frederic Sowa came up with a new NAT mode to fully randomize ports using prandom. * Add a new classifier to x_tables based on the socket net_cls set via cgroups. These includes two patches to prepare the field as requested by Zefan Li. Also from Daniel Borkmann. * Use prandom instead of get_random_bytes in several locations of the netfilter code, from Florian Westphal. * Allow to use the CTA_MARK_MASK in ctnetlink when mangling the conntrack mark, also from Florian Westphal. * Fix compilation warning due to unused variable in IPVS, from Geert Uytterhoeven. * Add support for UID/GID via nfnetlink_queue, from Valentina Giusti. * Add IPComp extension to x_tables, from Fan Du. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-04net: unify the pcpu_tstats and br_cpu_netstats as oneLi RongQing
They are same, so unify them as one, pcpu_sw_netstats. Define pcpu_sw_netstat in netdevice.h, remove pcpu_tstats from if_tunnel and remove br_cpu_netstats from br_private.h Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: Li RongQing <roy.qing.li@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-03ipv4: Use percpu Cache route in IP tunnelsTom Herbert
percpu route cache eliminates share of dst refcnt between CPUs. Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-03ipv4: Cache dst in tunnelsTom Herbert
Avoid doing a route lookup on every packet being tunneled. In ip_tunnel.c cache the route returned from ip_route_output if the tunnel is "connected" so that all the rouitng parameters are taken from tunnel parms for a packet. Specifically, not NBMA tunnel and tos is from tunnel parms (not inner packet). Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-03netfilter: nf_conntrack: remove dead codestephen hemminger
The following code is not used in current upstream code. Some of this seems to be old hooks, other might be used by some out of tree module (which I don't care about breaking), and the need_ipv4_conntrack was used by old NAT code but no longer called. Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-01-02ipv4: fix tunneled VM traffic over hw VXLAN/GRE GSO NICWei-Chun Chao
VM to VM GSO traffic is broken if it goes through VXLAN or GRE tunnel and the physical NIC on the host supports hardware VXLAN/GRE GSO offload (e.g. bnx2x and next-gen mlx4). Two issues - (VXLAN) VM traffic has SKB_GSO_DODGY and SKB_GSO_UDP_TUNNEL with SKB_GSO_TCP/UDP set depending on the inner protocol. GSO header integrity check fails in udp4_ufo_fragment if inner protocol is TCP. Also gso_segs is calculated incorrectly using skb->len that includes tunnel header. Fix: robust check should only be applied to the inner packet. (VXLAN & GRE) Once GSO header integrity check passes, NULL segs is returned and the original skb is sent to hardware. However the tunnel header is already pulled. Fix: tunnel header needs to be restored so that hardware can perform GSO properly on the original packet. Signed-off-by: Wei-Chun Chao <weichunc@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-02ipv4: switch and case should be at the same indentWeilong Chen
Signed-off-by: Weilong Chen <chenweilong@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>