Message ID | 1460628921-25635-1-git-send-email-bruce.richardson@intel.com (mailing list archive) |
---|---|
State | Superseded, archived |
Headers |
Return-Path: <dev-bounces@dpdk.org> X-Original-To: patchwork@dpdk.org Delivered-To: patchwork@dpdk.org Received: from [92.243.14.124] (localhost [IPv6:::1]) by dpdk.org (Postfix) with ESMTP id 0A5F32C6C; Thu, 14 Apr 2016 12:15:44 +0200 (CEST) Received: from mga11.intel.com (mga11.intel.com [192.55.52.93]) by dpdk.org (Postfix) with ESMTP id 9CE4AA6A for <dev@dpdk.org>; Thu, 14 Apr 2016 12:15:42 +0200 (CEST) Received: from fmsmga001.fm.intel.com ([10.253.24.23]) by fmsmga102.fm.intel.com with ESMTP; 14 Apr 2016 03:15:41 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.24,484,1455004800"; d="scan'208";a="945005226" Received: from irvmail001.ir.intel.com ([163.33.26.43]) by fmsmga001.fm.intel.com with ESMTP; 14 Apr 2016 03:15:39 -0700 Received: from sivswdev01.ir.intel.com (sivswdev01.ir.intel.com [10.237.217.45]) by irvmail001.ir.intel.com (8.14.3/8.13.6/MailSET/Hub) with ESMTP id u3EAFdSn001847; Thu, 14 Apr 2016 11:15:39 +0100 Received: from sivswdev01.ir.intel.com (localhost [127.0.0.1]) by sivswdev01.ir.intel.com with ESMTP id u3EAFc7N026110; Thu, 14 Apr 2016 11:15:38 +0100 Received: (from bricha3@localhost) by sivswdev01.ir.intel.com with id u3EAFchs026084; Thu, 14 Apr 2016 11:15:38 +0100 From: Bruce Richardson <bruce.richardson@intel.com> To: dev@dpdk.org Cc: Helin Zhang <helin.zhang@intel.com>, Jingjing Wu <jingjing.wu@intel.com>, Bruce Richardson <bruce.richardson@intel.com> Date: Thu, 14 Apr 2016 11:15:21 +0100 Message-Id: <1460628921-25635-1-git-send-email-bruce.richardson@intel.com> X-Mailer: git-send-email 1.7.4.1 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Subject: [dpdk-dev] [PATCH] i40e: improve performance of vector PMD X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: patches and discussions about DPDK <dev.dpdk.org> List-Unsubscribe: <http://dpdk.org/ml/options/dev>, <mailto:dev-request@dpdk.org?subject=unsubscribe> List-Archive: <http://dpdk.org/ml/archives/dev/> List-Post: <mailto:dev@dpdk.org> List-Help: <mailto:dev-request@dpdk.org?subject=help> List-Subscribe: <http://dpdk.org/ml/listinfo/dev>, <mailto:dev-request@dpdk.org?subject=subscribe> Errors-To: dev-bounces@dpdk.org Sender: "dev" <dev-bounces@dpdk.org> |
Commit Message
Bruce Richardson
April 14, 2016, 10:15 a.m. UTC
An analysis of the i40e code using Intel® VTune™ Amplifier 2016 showed
that the code was unexpectedly causing stalls due to "Loads blocked by
Store Forwards". This can occur when a load from memory has to wait
due to the prior store being to the same address, but being of a smaller
size i.e. the stored value cannot be directly returned to the loader.
[See ref: https://software.intel.com/en-us/node/544454]
These stalls are due to the way in which the data_len values are handled
in the driver. The lengths are extracted using vector operations, but those
16-bit lengths are then assigned using scalar operations i.e. 16-bit
stores.
These regular 16-bit stores actually have two effects in the code:
* they cause the "Loads blocked by Store Forwards" issues reported
* they also cause the previous loads in the RX function to actually be a
load followed by a store to an address on the stack, because the 16-bit
assignment can't be done to an xmm register.
By converting the 16-bit stores operations into a sequence of SSE blend
operations, we can ensure that the descriptor loads only occur once, and
avoid both the additional store and loads from the stack, as well as the
stalls due to the second loads being blocked.
Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
---
drivers/net/i40e/i40e_rxtx_vec.c | 24 ++++++++++--------------
1 file changed, 10 insertions(+), 14 deletions(-)
Comments
On Thu, Apr 14, 2016 at 11:15:21AM +0100, Bruce Richardson wrote: > An analysis of the i40e code using Intel® VTune™ Amplifier 2016 showed > that the code was unexpectedly causing stalls due to "Loads blocked by > Store Forwards". This can occur when a load from memory has to wait > due to the prior store being to the same address, but being of a smaller > size i.e. the stored value cannot be directly returned to the loader. > [See ref: https://software.intel.com/en-us/node/544454] > > These stalls are due to the way in which the data_len values are handled > in the driver. The lengths are extracted using vector operations, but those > 16-bit lengths are then assigned using scalar operations i.e. 16-bit > stores. > > These regular 16-bit stores actually have two effects in the code: > * they cause the "Loads blocked by Store Forwards" issues reported > * they also cause the previous loads in the RX function to actually be a > load followed by a store to an address on the stack, because the 16-bit > assignment can't be done to an xmm register. > > By converting the 16-bit stores operations into a sequence of SSE blend > operations, we can ensure that the descriptor loads only occur once, and > avoid both the additional store and loads from the stack, as well as the > stalls due to the second loads being blocked. > > Signed-off-by: Bruce Richardson <bruce.richardson@intel.com> > Self-NAK on this version. The blend instruction used is SSE4.1 so breaks the "default" build. Two obvious options to fix this: 1. Keep the old code with SSE4.1 #ifdefs separating old and new 2. Update the vpmd requirement to SSE4.1, and factor that in during runtime select of the RX code path. Personally, I prefer the second option. Any objections? /Bruce
> -----Original Message----- > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Bruce Richardson > Sent: Thursday, April 14, 2016 2:50 PM > To: dev@dpdk.org > Cc: Zhang, Helin; Wu, Jingjing > Subject: Re: [dpdk-dev] [PATCH] i40e: improve performance of vector PMD > > On Thu, Apr 14, 2016 at 11:15:21AM +0100, Bruce Richardson wrote: > > An analysis of the i40e code using Intel® VTune™ Amplifier 2016 showed > > that the code was unexpectedly causing stalls due to "Loads blocked by > > Store Forwards". This can occur when a load from memory has to wait > > due to the prior store being to the same address, but being of a smaller > > size i.e. the stored value cannot be directly returned to the loader. > > [See ref: https://software.intel.com/en-us/node/544454] > > > > These stalls are due to the way in which the data_len values are handled > > in the driver. The lengths are extracted using vector operations, but those > > 16-bit lengths are then assigned using scalar operations i.e. 16-bit > > stores. > > > > These regular 16-bit stores actually have two effects in the code: > > * they cause the "Loads blocked by Store Forwards" issues reported > > * they also cause the previous loads in the RX function to actually be a > > load followed by a store to an address on the stack, because the 16-bit > > assignment can't be done to an xmm register. > > > > By converting the 16-bit stores operations into a sequence of SSE blend > > operations, we can ensure that the descriptor loads only occur once, and > > avoid both the additional store and loads from the stack, as well as the > > stalls due to the second loads being blocked. > > > > Signed-off-by: Bruce Richardson <bruce.richardson@intel.com> > > > Self-NAK on this version. The blend instruction used is SSE4.1 so breaks the > "default" build. > > Two obvious options to fix this: > 1. Keep the old code with SSE4.1 #ifdefs separating old and new > 2. Update the vpmd requirement to SSE4.1, and factor that in during runtime > select of the RX code path. > > Personally, I prefer the second option. Any objections? +1 for second one. > > /Bruce
Hi Bruce, > -----Original Message----- > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Ananyev, > Konstantin > Sent: Thursday, April 14, 2016 3:00 PM > To: Richardson, Bruce <bruce.richardson@intel.com>; dev@dpdk.org > Cc: Zhang, Helin <helin.zhang@intel.com>; Wu, Jingjing > <jingjing.wu@intel.com> > Subject: Re: [dpdk-dev] [PATCH] i40e: improve performance of vector PMD > > > > > -----Original Message----- > > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Bruce Richardson > > Sent: Thursday, April 14, 2016 2:50 PM > > To: dev@dpdk.org > > Cc: Zhang, Helin; Wu, Jingjing > > Subject: Re: [dpdk-dev] [PATCH] i40e: improve performance of vector > > PMD > > > > On Thu, Apr 14, 2016 at 11:15:21AM +0100, Bruce Richardson wrote: > > > An analysis of the i40e code using Intel® VTune™ Amplifier 2016 > > > showed that the code was unexpectedly causing stalls due to "Loads > > > blocked by Store Forwards". This can occur when a load from memory > > > has to wait due to the prior store being to the same address, but > > > being of a smaller size i.e. the stored value cannot be directly returned to > the loader. > > > [See ref: https://software.intel.com/en-us/node/544454] > > > > > > These stalls are due to the way in which the data_len values are > > > handled in the driver. The lengths are extracted using vector > > > operations, but those 16-bit lengths are then assigned using scalar > > > operations i.e. 16-bit stores. > > > > > > These regular 16-bit stores actually have two effects in the code: > > > * they cause the "Loads blocked by Store Forwards" issues reported > > > * they also cause the previous loads in the RX function to actually > > > be a load followed by a store to an address on the stack, because > > > the 16-bit assignment can't be done to an xmm register. > > > > > > By converting the 16-bit stores operations into a sequence of SSE > > > blend operations, we can ensure that the descriptor loads only occur > > > once, and avoid both the additional store and loads from the stack, > > > as well as the stalls due to the second loads being blocked. > > > > > > Signed-off-by: Bruce Richardson <bruce.richardson@intel.com> > > > > > Self-NAK on this version. The blend instruction used is SSE4.1 so > > breaks the "default" build. > > > > Two obvious options to fix this: > > 1. Keep the old code with SSE4.1 #ifdefs separating old and new 2. > > Update the vpmd requirement to SSE4.1, and factor that in during > > runtime select of the RX code path. > > > > Personally, I prefer the second option. Any objections? > > +1 for second one. > > > > > /Bruce I am using the "default" build when building in VM's, will both options work for me? Regards, Bernard.
This patchset improves the performance of the i40e SSE pmd by removing operations that triggered CPU stalls. It also shortens the code and cleans it up a little. The base requirement for using the SSE code path has been pushed up to SSE4.1 from SSE3, due to the use of the blend instruction. The instruction set level is now checked at runtime as part of the driver selection process Bruce Richardson (3): i40e: require SSE4.1 support for vector driver i40e: improve performance of vector PMD i40e: simplify SSE packet length extraction code drivers/net/i40e/Makefile | 6 ++++ drivers/net/i40e/i40e_rxtx_vec.c | 59 ++++++++++++++-------------------------- 2 files changed, 27 insertions(+), 38 deletions(-)
On Thu, Apr 14, 2016 at 05:02:34PM +0100, Bruce Richardson wrote: > This patchset improves the performance of the i40e SSE pmd by removing > operations that triggered CPU stalls. It also shortens the code and > cleans it up a little. > > The base requirement for using the SSE code path has been pushed up to > SSE4.1 from SSE3, due to the use of the blend instruction. The instruction > set level is now checked at runtime as part of the driver selection process > > Bruce Richardson (3): > i40e: require SSE4.1 support for vector driver > i40e: improve performance of vector PMD > i40e: simplify SSE packet length extraction code > > drivers/net/i40e/Makefile | 6 ++++ > drivers/net/i40e/i40e_rxtx_vec.c | 59 ++++++++++++++-------------------------- > 2 files changed, 27 insertions(+), 38 deletions(-) > > -- > 2.5.5 Acked-by: Zhe Tao <zhe.tao@intel.com>
On Sun, Apr 17, 2016 at 04:32:10PM +0800, Zhe Tao wrote: > On Thu, Apr 14, 2016 at 05:02:34PM +0100, Bruce Richardson wrote: > > This patchset improves the performance of the i40e SSE pmd by removing > > operations that triggered CPU stalls. It also shortens the code and > > cleans it up a little. > > > > The base requirement for using the SSE code path has been pushed up to > > SSE4.1 from SSE3, due to the use of the blend instruction. The instruction > > set level is now checked at runtime as part of the driver selection process > > > > Bruce Richardson (3): > > i40e: require SSE4.1 support for vector driver > > i40e: improve performance of vector PMD > > i40e: simplify SSE packet length extraction code > > > > drivers/net/i40e/Makefile | 6 ++++ > > drivers/net/i40e/i40e_rxtx_vec.c | 59 ++++++++++++++-------------------------- > > 2 files changed, 27 insertions(+), 38 deletions(-) > > > > -- > > 2.5.5 > Acked-by: Zhe Tao <zhe.tao@intel.com> Applied to dpdk-next-net/rel_16_07 /Bruce
diff --git a/drivers/net/i40e/i40e_rxtx_vec.c b/drivers/net/i40e/i40e_rxtx_vec.c index 047aff5..d0a0cc9 100644 --- a/drivers/net/i40e/i40e_rxtx_vec.c +++ b/drivers/net/i40e/i40e_rxtx_vec.c @@ -192,11 +192,7 @@ desc_to_olflags_v(__m128i descs[4], struct rte_mbuf **rx_pkts) static inline void desc_pktlen_align(__m128i descs[4]) { - __m128i pktlen0, pktlen1, zero; - union { - uint16_t e[4]; - uint64_t dword; - } vol; + __m128i pktlen0, pktlen1; /* mask everything except pktlen field*/ const __m128i pktlen_msk = _mm_set_epi32(PKTLEN_MASK, PKTLEN_MASK, @@ -206,18 +202,18 @@ desc_pktlen_align(__m128i descs[4]) pktlen1 = _mm_unpackhi_epi32(descs[1], descs[3]); pktlen0 = _mm_unpackhi_epi32(pktlen0, pktlen1); - zero = _mm_xor_si128(pktlen0, pktlen0); - pktlen0 = _mm_srli_epi32(pktlen0, PKTLEN_SHIFT); pktlen0 = _mm_and_si128(pktlen0, pktlen_msk); - pktlen0 = _mm_packs_epi32(pktlen0, zero); - vol.dword = _mm_cvtsi128_si64(pktlen0); - /* let the descriptor byte 15-14 store the pkt len */ - *((uint16_t *)&descs[0]+7) = vol.e[0]; - *((uint16_t *)&descs[1]+7) = vol.e[1]; - *((uint16_t *)&descs[2]+7) = vol.e[2]; - *((uint16_t *)&descs[3]+7) = vol.e[3]; + pktlen0 = _mm_packs_epi32(pktlen0, pktlen0); + + descs[3] = _mm_blend_epi16(descs[3], pktlen0, 0x80); + pktlen0 = _mm_slli_epi64(pktlen0, 16); + descs[2] = _mm_blend_epi16(descs[2], pktlen0, 0x80); + pktlen0 = _mm_slli_epi64(pktlen0, 16); + descs[1] = _mm_blend_epi16(descs[1], pktlen0, 0x80); + pktlen0 = _mm_slli_epi64(pktlen0, 16); + descs[0] = _mm_blend_epi16(descs[0], pktlen0, 0x80); } /*