summaryrefslogtreecommitdiff
path: root/net/ipv4/route.c
AgeCommit message (Collapse)AuthorFilesLines
2014-08-14ip: make IP identifiers less predictableEric Dumazet1-3/+33
[ Upstream commit 04ca6973f7c1a0d8537f2d9906a0cf8e69886d75 ] In "Counting Packets Sent Between Arbitrary Internet Hosts", Jeffrey and Jedidiah describe ways exploiting linux IP identifier generation to infer whether two machines are exchanging packets. With commit 73f156a6e8c1 ("inetpeer: get rid of ip_id_count"), we changed IP id generation, but this does not really prevent this side-channel technique. This patch adds a random amount of perturbation so that IP identifiers for a given destination [1] are no longer monotonically increasing after an idle period. Note that prandom_u32_max(1) returns 0, so if generator is used at most once per jiffy, this patch inserts no hole in the ID suite and do not increase collision probability. This is jiffies based, so in the worst case (HZ=1000), the id can rollover after ~65 seconds of idle time, which should be fine. We also change the hash used in __ip_select_ident() to not only hash on daddr, but also saddr and protocol, so that ICMP probes can not be used to infer information for other protocols. For IPv6, adds saddr into the hash as well, but not nexthdr. If I ping the patched target, we can see ID are now hard to predict. 21:57:11.008086 IP (...) A > target: ICMP echo request, seq 1, length 64 21:57:11.010752 IP (... id 2081 ...) target > A: ICMP echo reply, seq 1, length 64 21:57:12.013133 IP (...) A > target: ICMP echo request, seq 2, length 64 21:57:12.015737 IP (... id 3039 ...) target > A: ICMP echo reply, seq 2, length 64 21:57:13.016580 IP (...) A > target: ICMP echo request, seq 3, length 64 21:57:13.019251 IP (... id 3437 ...) target > A: ICMP echo reply, seq 3, length 64 [1] TCP sessions uses a per flow ID generator not changed by this patch. Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Jeffrey Knockel <jeffk@cs.unm.edu> Reported-by: Jedidiah R. Crandall <crandall@cs.unm.edu> Cc: Willy Tarreau <w@1wt.eu> Cc: Hannes Frederic Sowa <hannes@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-08-14inetpeer: get rid of ip_id_countEric Dumazet1-28/+19
[ Upstream commit 73f156a6e8c1074ac6327e0abd1169e95eb66463 ] Ideally, we would need to generate IP ID using a per destination IP generator. linux kernels used inet_peer cache for this purpose, but this had a huge cost on servers disabling MTU discovery. 1) each inet_peer struct consumes 192 bytes 2) inetpeer cache uses a binary tree of inet_peer structs, with a nominal size of ~66000 elements under load. 3) lookups in this tree are hitting a lot of cache lines, as tree depth is about 20. 4) If server deals with many tcp flows, we have a high probability of not finding the inet_peer, allocating a fresh one, inserting it in the tree with same initial ip_id_count, (cf secure_ip_id()) 5) We garbage collect inet_peer aggressively. IP ID generation do not have to be 'perfect' Goal is trying to avoid duplicates in a short period of time, so that reassembly units have a chance to complete reassembly of fragments belonging to one message before receiving other fragments with a recycled ID. We simply use an array of generators, and a Jenkin hash using the dst IP as a key. ipv6_select_ident() is put back into net/ipv6/ip6_output.c where it belongs (it is only used from this file) secure_ip_id() and secure_ipv6_id() no longer are needed. Rename ip_select_ident_more() to ip_select_ident_segs() to avoid unnecessary decrement/increment of the number of segments. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-07-28ipv4: irq safe sk_dst_[re]set() and ipv4_sk_update_pmtu() fixEric Dumazet1-7/+8
[ Upstream commit 7f502361531e9eecb396cf99bdc9e9a59f7ebd7f ] We have two different ways to handle changes to sk->sk_dst First way (used by TCP) assumes socket lock is owned by caller, and use no extra lock : __sk_dst_set() & __sk_dst_reset() Another way (used by UDP) uses sk_dst_lock because socket lock is not always taken. Note that sk_dst_lock is not softirq safe. These ways are not inter changeable for a given socket type. ipv4_sk_update_pmtu(), added in linux-3.8, added a race, as it used the socket lock as synchronization, but users might be UDP sockets. Instead of converting sk_dst_lock to a softirq safe version, use xchg() as we did for sk_rx_dst in commit e47eb5dfb296b ("udp: ipv4: do not use sk_dst_lock from softirq context") In a follow up patch, we probably can remove sk_dst_lock, as it is only used in IPv6. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Steffen Klassert <steffen.klassert@secunet.com> Fixes: 9cb3a50c5f63e ("ipv4: Invalidate the socket cached route on pmtu events if possible") Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-05-30ipv4: initialise the itag variable in __mkroute_inputLi RongQing1-1/+1
[ Upstream commit fbdc0ad095c0a299e9abf5d8ac8f58374951149a ] the value of itag is a random value from stack, and may not be initiated by fib_validate_source, which called fib_combine_itag if CONFIG_IP_ROUTE_CLASSID is not set This will make the cached dst uncertainty Signed-off-by: Li RongQing <roy.qing.li@gmail.com> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-05-30ipv4: return valid RTA_IIF on ip route getJulian Anastasov1-1/+1
[ Upstream commit 91146153da2feab18efab2e13b0945b6bb704ded ] Extend commit 13378cad02afc2adc6c0e07fca03903c7ada0b37 ("ipv4: Change rt->rt_iif encoding.") from 3.6 to return valid RTA_IIF on 'ip route get ... iif DEVICE' instead of rt_iif 0 which is displayed as 'iif *'. inet_iif is not appropriate to use because skb_iif is not set. Use the skb->dev->ifindex instead. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-03-06ipv4: fix counter in_slow_totDuan Jiong1-2/+2
[ Upstream commit a6254864c08109c66a194612585afc0439005286 ] since commit 89aef8921bf("ipv4: Delete routing cache."), the counter in_slow_tot can't work correctly. The counter in_slow_tot increase by one when fib_lookup() return successfully in ip_route_input_slow(), but actually the dst struct maybe not be created and cached, so we can increase in_slow_tot after the dst struct is created. Signed-off-by: Duan Jiong <duanj.fnst@cn.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-08ipv4: fix race in concurrent ip_route_input_slow()Alexei Starovoitov1-2/+6
[ Upstream commit dcdfdf56b4a6c9437fc37dbc9cee94a788f9b0c4 ] CPUs can ask for local route via ip_route_input_noref() concurrently. if nh_rth_input is not cached yet, CPUs will proceed to allocate equivalent DSTs on 'lo' and then will try to cache them in nh_rth_input via rt_cache_route() Most of the time they succeed, but on occasion the following two lines: orig = *p; prev = cmpxchg(p, orig, rt); in rt_cache_route() do race and one of the cpus fails to complete cmpxchg. But ip_route_input_slow() doesn't check the return code of rt_cache_route(), so dst is leaking. dst_destroy() is never called and 'lo' device refcnt doesn't go to zero, which can be seen in the logs as: unregister_netdevice: waiting for lo to become free. Usage count = 1 Adding mdelay() between above two lines makes it easily reproducible. Fix it similar to nh_pcpu_rth_output case. Fixes: d2d68ba9fe8b ("ipv4: Cache input routes in fib_info nexthops.") Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-11-04ipv4: fix ineffective source address selectionJiri Benc1-1/+1
[ Upstream commit 0a7e22609067ff524fc7bbd45c6951dd08561667 ] When sending out multicast messages, the source address in inet->mc_addr is ignored and rewritten by an autoselected one. This is caused by a typo in commit 813b3b5db831 ("ipv4: Use caller's on-stack flowi as-is in output route lookups"). Signed-off-by: Jiri Benc <jbenc@redhat.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-05-27ipv4: fix redirect handling for TCP packetsMichal Kubecek1-1/+6
Unlike ipv4_redirect() and ipv4_sk_redirect(), ip_do_redirect() doesn't call __build_flow_key() directly but via ip_rt_build_flow_key() wrapper. This leads to __build_flow_key() getting pointer to IPv4 header of the ICMP redirect packet rather than pointer to the embedded IPv4 header of the packet initiating the redirect. As a result, handling of ICMP redirects initiated by TCP packets is broken. Issue was introduced by 4895c771c ("ipv4: Add FIB nexthop exceptions.") Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-22rtnetlink: Remove passing of attributes into rtnl_doit functionsThomas Graf1-1/+1
With decnet converted, we can finally get rid of rta_buf and its computations around it. It also gets rid of the minimal header length verification since all message handlers do that explicitly anyway. Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-02-19net: ipv4: fix waring -Wunused-variableGao feng1-4/+5
the vars ip_rt_gc_timeout is used only when CONFIG_SYSCTL is selected. move these vars into CONFIG_SYSCTL. Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-02-18net: proc: change proc_net_fops_create to proc_createGao feng1-2/+2
Right now, some modules such as bonding use proc_create to create proc entries under /proc/net/, and other modules such as ipv4 use proc_net_fops_create. It looks a little chaos.this patch changes all of proc_net_fops_create to proc_create. we can remove proc_net_fops_create after this patch. Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-22ipv4: Fix route refcount on pmtu discoverySteffen Klassert1-2/+9
git commit 9cb3a50c (ipv4: Invalidate the socket cached route on pmtu events if possible) introduced a refcount problem. We don't get a refcount on the route if we get it from__sk_dst_get(), but we need one if we want to reuse this route because __sk_dst_set() releases the refcount of the old route. This patch adds proper refcount handling for that case. We introduce a 'new' flag to indicate that we are going to use a new route and we release the old route only if we replace it by a new one. Reported-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-21ipv4: Invalidate the socket cached route on pmtu events if possibleSteffen Klassert1-1/+41
The route lookup in ipv4_sk_update_pmtu() might return a route different from the route we cached at the socket. This is because standart routes are per cpu, so each cpu has it's own struct rtable. This means that we do not invalidate the socket cached route if the NET_RX_SOFTIRQ is not served by the same cpu that the sending socket uses. As a result, the cached route reused until we disconnect. With this patch we invalidate the socket cached route if possible. If the socket is owened by the user, we can't update the cached route directly. A followup patch will implement socket release callback functions for datagram sockets to handle this case. Reported-by: Yurij M. Plotnikov <Yurij.Plotnikov@oktetlabs.ru> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-17ipv4: Don't update the pmtu on mtu locked routesSteffen Klassert1-0/+3
Routes with locked mtu should not use learned pmtu informations, so do not update the pmtu on these routes. Reported-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-17ipv4: Remove output route check in ipv4_mtuSteffen Klassert1-1/+1
The output route check was introduced with git commit 261663b0 (ipv4: Don't use the cached pmtu informations for input routes) during times when we cached the pmtu informations on the inetpeer. Now the pmtu informations are back in the routes, so this check is obsolete. It also had some unwanted side effects, as reported by Timo Teras and Lukas Tribus. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Acked-by: Timo Teräs <timo.teras@iki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-12-07ipv4/route/rtnl: get mcast attributes when dst is multicastNicolas Dichtel1-2/+21
Commit f1ce3062c538 (ipv4: Remove 'rt_dst' from 'struct rtable') removes the call to ipmr_get_route(), which will get multicast parameters of the route. I revert the part of the patch that remove this call. I think the goal was only to get rid of rt_dst field. The patch is only compiled-tested. My first idea was to remove ipmr_get_route() because rt_fill_info() was the only user, but it seems the previous patch cleans the code a bit too much ;-) Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-11-25Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller1-3/+6
Conflicts: drivers/net/wireless/iwlwifi/pcie/tx.c Minor iwlwifi conflict in TX queue disabling between 'net', which removed a bogus warning, and 'net-next' which added some status register poking code. Signed-off-by: David S. Miller <davem@davemloft.net>
2012-11-22ipv4: do not cache looped multicastsJulian Anastasov1-2/+5
Starting from 3.6 we cache output routes for multicasts only when using route to 224/4. For local receivers we can set RTCF_LOCAL flag depending on the membership but in such case we use maddr and saddr which are not caching keys as before. Additionally, we can not use same place to cache routes that differ in RTCF_LOCAL flag value. Fix it by caching only RTCF_MULTICAST entries without RTCF_LOCAL (send-only, no loopback). As a side effect, we avoid unneeded lookup for fnhe when not caching because multicasts are not redirected and they do not learn PMTU. Thanks to Maxime Bizon for showing the caching problems in __mkroute_output for 3.6 kernels: different RTCF_LOCAL flag in cache can lead to wrong ip_mc_output or ip_output call and the visible problem is that traffic can not reach local receivers via loopback. Reported-by: Maxime Bizon <mbizon@freebox.fr> Tested-by: Maxime Bizon <mbizon@freebox.fr> Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-11-18net: Don't export sysctls to unprivileged usersEric W. Biederman1-0/+4
In preparation for supporting the creation of network namespaces by unprivileged users, modify all of the per net sysctl exports and refuse to allow them to unprivileged users. This makes it safe for unprivileged users in general to access per net sysctls, and allows sysctls to be exported to unprivileged users on an individual basis as they are deemed safe. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-11-13xfrm: Fix the gc threshold value for ipv4Steffen Klassert1-1/+1
The xfrm gc threshold value depends on ip_rt_max_size. This value was set to INT_MAX with the routing cache removal patch, so we start doing garbage collecting when we have INT_MAX/2 IPsec routes cached. Fix this by going back to the static threshold of 1024 routes. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2012-10-18ipv4: Fix flushing of cached routing informationsSteffen Klassert1-3/+6
Currently we can not flush cached pmtu/redirect informations via the ipv4_sysctl_rtcache_flush sysctl. We need to check the rt_genid of the old route and reset the nh exeption if the old route is expired when we bind a new route to a nh exeption. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-10-10ipv4: fix route mark sparse warningstephen hemminger1-1/+1
Sparse complains about RTA_MARK which is should be host order according to include file and usage in iproute. net/ipv4/route.c:2223:46: warning: incorrect type in argument 3 (different base types) net/ipv4/route.c:2223:46: expected restricted __be32 [usertype] value net/ipv4/route.c:2223:46: got unsigned int [unsigned] [usertype] flowic_mark Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-10-08ipv4: Add FLOWI_FLAG_KNOWN_NHJulian Anastasov1-4/+17
Add flag to request that output route should be returned with known rt_gateway, in case we want to use it as nexthop for neighbour resolving. The returned route can be cached as follows: - in NH exception: because the cached routes are not shared with other destinations - in FIB NH: when using gateway because all destinations for NH share same gateway As last option, to return rt_gateway!=0 we have to set DST_NOCACHE. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-10-08ipv4: introduce rt_uses_gatewayJulian Anastasov1-22/+26
Add new flag to remember when route is via gateway. We will use it to allow rt_gateway to contain address of directly connected host for the cases when DST_NOCACHE is used or when the NH exception caches per-destination route without DST_NOCACHE flag, i.e. when routes are not used for other destinations. By this way we force the neighbour resolving to work with the routed destination but we can use different address in the packet, feature needed for IPVS-DR where original packet for virtual IP is routed via route to real IP. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-10-08ipv4: make sure nh_pcpu_rth_output is always allocatedJulian Anastasov1-3/+0
Avoid checking nh_pcpu_rth_output in fast path, abort fib_info creation on alloc_percpu failure. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-10-08ipv4: fix sending of redirectsJulian Anastasov1-14/+16
After "Cache input routes in fib_info nexthops" (commit d2d68ba9fe) and "Elide fib_validate_source() completely when possible" (commit 7a9bc9b81a) we can not send ICMP redirects. It seems we should not cache the RTCF_DOREDIRECT flag in nh_rth_input because the same fib_info can be used for traffic that is not redirected, eg. from other input devices or from sources that are not in same subnet. As result, we have to disable the caching of RTCF_DOREDIRECT flag and to force source validation for the case when forwarding traffic to the input device. If traffic comes from directly connected source we allow redirection as it was done before both changes. Avoid setting RTCF_DOREDIRECT if IN_DEV_TX_REDIRECTS is disabled, this can avoid source address validation and to help caching the routes. After the change "Adjust semantics of rt->rt_gateway" (commit f8126f1d51) we should make sure our ICMP_REDIR_HOST messages contain daddr instead of 0.0.0.0 when target is directly connected. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-10-08ipv4: Don't report stale pmtu values to userspaceSteffen Klassert1-8/+11
We report cached pmtu values even if they are already expired. Change this to not report these values after they are expired and fix a race in the expire time calculation, as suggested by Eric Dumazet. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-10-08ipv4: Don't create nh exeption when the device mtu is smaller than the ↵Steffen Klassert1-0/+3
reported pmtu When a local tool like tracepath tries to send packets bigger than the device mtu, we create a nh exeption and set the pmtu to device mtu. The device mtu does not expire, so check if the device mtu is smaller than the reported pmtu and don't crerate a nh exeption in that case. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-10-08ipv4: Always invalidate or update the route on pmtu eventsSteffen Klassert1-11/+11
Some protocols, like IPsec still cache routes. So we need to invalidate the old route on pmtu events to avoid the reuse of stale routes. We also need to update the mtu and expire time of the route if we already use a nh exception route, otherwise we ignore newly learned pmtu values after the first expiration. With this patch we always invalidate or update the route on pmtu events. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-28Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller1-7/+2
Conflicts: drivers/net/team/team.c drivers/net/usb/qmi_wwan.c net/batman-adv/bat_iv_ogm.c net/ipv4/fib_frontend.c net/ipv4/route.c net/l2tp/l2tp_netlink.c The team, fib_frontend, route, and l2tp_netlink conflicts were simply overlapping changes. qmi_wwan and bat_iv_ogm were of the "use HEAD" variety. With help from Antonio Quartulli. Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-18netns: move net->ipv4.rt_genid to net->rt_genidNicolas Dichtel1-7/+2
This commit prepares the use of rt_genid by both IPv4 and IPv6. Initialization is left in IPv4 part. Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-18net: rt_cache_flush() cleanupEric Dumazet1-17/+2
We dont use jhash anymore since route cache removal, so we can get rid of get_random_bytes() calls for rt_genid changes. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-18ipv4/route: arg delay is useless in rt_cache_flush()Nicolas Dichtel1-16/+3
Since route cache deletion (89aef8921bfbac22f), delay is no more used. Remove it. Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-10netlink: Rename pid to portid to avoid confusionEric W. Biederman1-4/+4
It is a frequent mistake to confuse the netlink port identifier with a process identifier. Try to reduce this confusion by renaming fields that hold port identifiers portid instead of pid. I have carefully avoided changing the structures exported to userspace to avoid changing the userspace API. I have successfully built an allyesconfig kernel with this change. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-07net: rt_cache_flush() cleanupEric Dumazet1-17/+2
We dont use jhash anymore since route cache removal, so we can get rid of get_random_bytes() calls for rt_genid changes. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-07ipv4/route: arg delay is useless in rt_cache_flush()Nicolas Dichtel1-16/+3
Since route cache deletion (89aef8921bfbac22f), delay is no more used. Remove it. Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-31ipv4: Minor logic clean-up in ipv4_mtuAlexander Duyck1-4/+1
In ipv4_mtu there is some logic where we are testing for a non-zero value and a timer expiration, then setting the value to zero, and then testing if the value is zero we set it to a value based on the dst. Instead of bothering with the extra steps it is easier to just cleanup the logic so that we set it to the dst based value if it is zero or if the timer has expired. Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
2012-08-31Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller1-2/+4
Merge the 'net' tree to get the recent set of netfilter bug fixes in order to assist with some merge hassles Pablo is going to have to deal with for upcoming changes. Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-30ipv4: must use rcu protection while calling fib_lookupEric Dumazet1-0/+2
Following lockdep splat was reported by Pavel Roskin : [ 1570.586223] =============================== [ 1570.586225] [ INFO: suspicious RCU usage. ] [ 1570.586228] 3.6.0-rc3-wl-main #98 Not tainted [ 1570.586229] ------------------------------- [ 1570.586231] /home/proski/src/linux/net/ipv4/route.c:645 suspicious rcu_dereference_check() usage! [ 1570.586233] [ 1570.586233] other info that might help us debug this: [ 1570.586233] [ 1570.586236] [ 1570.586236] rcu_scheduler_active = 1, debug_locks = 0 [ 1570.586238] 2 locks held by Chrome_IOThread/4467: [ 1570.586240] #0: (slock-AF_INET){+.-...}, at: [<ffffffff814f2c0c>] release_sock+0x2c/0xa0 [ 1570.586253] #1: (fnhe_lock){+.-...}, at: [<ffffffff815302fc>] update_or_create_fnhe+0x2c/0x270 [ 1570.586260] [ 1570.586260] stack backtrace: [ 1570.586263] Pid: 4467, comm: Chrome_IOThread Not tainted 3.6.0-rc3-wl-main #98 [ 1570.586265] Call Trace: [ 1570.586271] [<ffffffff810976ed>] lockdep_rcu_suspicious+0xfd/0x130 [ 1570.586275] [<ffffffff8153042c>] update_or_create_fnhe+0x15c/0x270 [ 1570.586278] [<ffffffff815305b3>] __ip_rt_update_pmtu+0x73/0xb0 [ 1570.586282] [<ffffffff81530619>] ip_rt_update_pmtu+0x29/0x90 [ 1570.586285] [<ffffffff815411dc>] inet_csk_update_pmtu+0x2c/0x80 [ 1570.586290] [<ffffffff81558d1e>] tcp_v4_mtu_reduced+0x2e/0xc0 [ 1570.586293] [<ffffffff81553bc4>] tcp_release_cb+0xa4/0xb0 [ 1570.586296] [<ffffffff814f2c35>] release_sock+0x55/0xa0 [ 1570.586300] [<ffffffff815442ef>] tcp_sendmsg+0x4af/0xf50 [ 1570.586305] [<ffffffff8156fc60>] inet_sendmsg+0x120/0x230 [ 1570.586308] [<ffffffff8156fb40>] ? inet_sk_rebuild_header+0x40/0x40 [ 1570.586312] [<ffffffff814f4bdd>] ? sock_update_classid+0xbd/0x3b0 [ 1570.586315] [<ffffffff814f4c50>] ? sock_update_classid+0x130/0x3b0 [ 1570.586320] [<ffffffff814ec435>] do_sock_write+0xc5/0xe0 [ 1570.586323] [<ffffffff814ec4a3>] sock_aio_write+0x53/0x80 [ 1570.586328] [<ffffffff8114bc83>] do_sync_write+0xa3/0xe0 [ 1570.586332] [<ffffffff8114c5a5>] vfs_write+0x165/0x180 [ 1570.586335] [<ffffffff8114c805>] sys_write+0x45/0x90 [ 1570.586340] [<ffffffff815d2722>] system_call_fastpath+0x16/0x1b Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Pavel Roskin <proski@gnu.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-24ipv4: take rt_uncached_lock only if neededEric Dumazet1-1/+1
Multicast traffic allocates dst with DST_NOCACHE, but dst is not inserted into rt_uncached_list. This slowdown multicast workloads on SMP because rt_uncached_lock is contended. Change the test before taking the lock to actually check the dst was inserted into rt_uncached_list. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-22ipv4: properly update pmtuEric Dumazet1-1/+1
Sylvain Munault reported following info : - TCP connection get "stuck" with data in send queue when doing "large" transfers ( like typing 'ps ax' on a ssh connection ) - Only happens on path where the PMTU is lower than the MTU of the interface - Is not present right after boot, it only appears 10-20min after boot or so. (and that's inside the _same_ TCP connection, it works fine at first and then in the same ssh session, it'll get stuck) - Definitely seems related to fragments somehow since I see a router sending ICMP message saying fragmentation is needed. - Exact same setup works fine with kernel 3.5.1 Problem happens when the 10 minutes (ip_rt_mtu_expires) expiration period is over. ip_rt_update_pmtu() calls dst_set_expires() to rearm a new expiration, but dst_set_expires() does nothing because dst.expires is already set. It seems we want to set the expires field to a new value, regardless of prior one. With help from Julian Anastasov. Reported-by: Sylvain Munaut <s.munaut@whatever-company.com> Signed-off-by: Eric Dumazet <edumazet@google.com> CC: Julian Anastasov <ja@ssi.bg> Tested-by: Sylvain Munaut <s.munaut@whatever-company.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-22Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller1-1/+0
2012-08-14ipv4: Cache local output routesYan, Zheng1-1/+0
Commit caacf05e5ad1abf causes big drop of UDP loop back performance. The cause of the regression is that we do not cache the local output routes. Each time we send a datagram from unconnected UDP socket, the kernel allocates a dst_entry and adds it to the rt_uncached_list. It creates lock contention on the rt_uncached_lock. Reported-by: Alex Shi <alex.shi@intel.com> Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-09net: Loopback ifindex is constant nowPavel Emelyanov1-3/+3
As pointed out, there are places, that access net->loopback_dev->ifindex and after ifindex generation is made per-net this value becomes constant equals 1. So go ahead and introduce the LOOPBACK_IFINDEX constant and use it where appropriate. Signed-off-by: Pavel Emelyanov <xemul@parallels.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-04ipv4: Introduce IN_DEV_NET_ROUTE_LOCALNETEric Dumazet1-4/+7
performance profiles show a high cost in the IN_DEV_ROUTE_LOCALNET() call done in ip_route_input_slow(), because of multiple dereferences, even if cache lines are clean and available in cpu caches. Since we already have the 'net' pointer, introduce IN_DEV_NET_ROUTE_LOCALNET() macro avoiding two dereferences (dev_net(in_dev->dev)) Also change the tests to use IN_DEV_NET_ROUTE_LOCALNET() only if saddr or/and daddr are loopback addresse. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-02ipv4: route.c cleanupEric Dumazet1-4/+0
Remove unused includes after IP cache removal Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-31ipv4: Properly purge netdev references on uncached routes.David S. Miller1-4/+64
When a device is unregistered, we have to purge all of the references to it that may exist in the entire system. If a route is uncached, we currently have no way of accomplishing this. So create a global list that is scanned when a network device goes down. This mirrors the logic in net/core/dst.c's dst_ifdown(). Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-31ipv4: Cache routes in nexthop exception entries.David S. Miller1-46/+57
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-31ipv4: percpu nh_rth_output cacheEric Dumazet1-5/+13
Input path is mostly run under RCU and doesnt touch dst refcnt But output path on forwarding or UDP workloads hits badly dst refcount, and we have lot of false sharing, for example in ipv4_mtu() when reading rt->rt_pmtu Using a percpu cache for nh_rth_output gives a nice performance increase at a small cost. 24 udpflood test on my 24 cpu machine (dummy0 output device) (each process sends 1.000.000 udp frames, 24 processes are started) before : 5.24 s after : 2.06 s For reference, time on linux-3.5 : 6.60 s Signed-off-by: Eric Dumazet <edumazet@google.com> Tested-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>