[dpdk-dev] net/i40e: add additional prefetch instructions for bulk rx

Message ID 20160714172719.17502-2-vladyslav.buslov@harmonicinc.com (mailing list archive)
State Changes Requested, archived
Delegated to: Bruce Richardson
Headers

Commit Message

Vladyslav Buslov July 14, 2016, 5:27 p.m. UTC
  Added prefetch of first packet payload cacheline in i40e_rx_scan_hw_ring
Added prefetch of second mbuf cacheline in i40e_rx_alloc_bufs

Signed-off-by: Vladyslav Buslov <vladyslav.buslov@harmonicinc.com>
---
 drivers/net/i40e/i40e_rxtx.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)
  

Comments

Ferruh Yigit Sept. 14, 2016, 1:24 p.m. UTC | #1
On 7/14/2016 6:27 PM, Vladyslav Buslov wrote:
> Added prefetch of first packet payload cacheline in i40e_rx_scan_hw_ring
> Added prefetch of second mbuf cacheline in i40e_rx_alloc_bufs
> 
> Signed-off-by: Vladyslav Buslov <vladyslav.buslov@harmonicinc.com>
> ---
>  drivers/net/i40e/i40e_rxtx.c | 7 +++++--
>  1 file changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/i40e/i40e_rxtx.c b/drivers/net/i40e/i40e_rxtx.c
> index d3cfb98..e493fb4 100644
> --- a/drivers/net/i40e/i40e_rxtx.c
> +++ b/drivers/net/i40e/i40e_rxtx.c
> @@ -1003,6 +1003,7 @@ i40e_rx_scan_hw_ring(struct i40e_rx_queue *rxq)
>                 /* Translate descriptor info to mbuf parameters */
>                 for (j = 0; j < nb_dd; j++) {
>                         mb = rxep[j].mbuf;
> +                       rte_prefetch0(RTE_PTR_ADD(mb->buf_addr, RTE_PKTMBUF_HEADROOM));
>                         qword1 = rte_le_to_cpu_64(\
>                                 rxdp[j].wb.qword1.status_error_len);
>                         pkt_len = ((qword1 & I40E_RXD_QW1_LENGTH_PBUF_MASK) >>
> @@ -1086,9 +1087,11 @@ i40e_rx_alloc_bufs(struct i40e_rx_queue *rxq)
> 
>         rxdp = &rxq->rx_ring[alloc_idx];
>         for (i = 0; i < rxq->rx_free_thresh; i++) {
> -               if (likely(i < (rxq->rx_free_thresh - 1)))
> +               if (likely(i < (rxq->rx_free_thresh - 1))) {
>                         /* Prefetch next mbuf */
> -                       rte_prefetch0(rxep[i + 1].mbuf);
> +                       rte_prefetch0(&rxep[i + 1].mbuf->cacheline0);
> +                       rte_prefetch0(&rxep[i + 1].mbuf->cacheline1);
> +               }
> 
>                 mb = rxep[i].mbuf;
>                 rte_mbuf_refcnt_set(mb, 1);
> --
> 2.8.3
> 

i40e maintainers, can you please review the patch?

Thanks,
ferruh
  
Jingjing Wu Oct. 10, 2016, 1:25 p.m. UTC | #2
> -----Original Message-----
> From: Yigit, Ferruh
> Sent: Wednesday, September 14, 2016 9:25 PM
> To: Vladyslav Buslov <vladyslav.buslov@harmonicinc.com>; Zhang, Helin
> <helin.zhang@intel.com>; Wu, Jingjing <jingjing.wu@intel.com>
> Cc: dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH] net/i40e: add additional prefetch instructions for bulk rx
> 
> On 7/14/2016 6:27 PM, Vladyslav Buslov wrote:
> > Added prefetch of first packet payload cacheline in i40e_rx_scan_hw_ring
> > Added prefetch of second mbuf cacheline in i40e_rx_alloc_bufs
> >
> > Signed-off-by: Vladyslav Buslov <vladyslav.buslov@harmonicinc.com>
> > ---
> >  drivers/net/i40e/i40e_rxtx.c | 7 +++++--
> >  1 file changed, 5 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/net/i40e/i40e_rxtx.c b/drivers/net/i40e/i40e_rxtx.c
> > index d3cfb98..e493fb4 100644
> > --- a/drivers/net/i40e/i40e_rxtx.c
> > +++ b/drivers/net/i40e/i40e_rxtx.c
> > @@ -1003,6 +1003,7 @@ i40e_rx_scan_hw_ring(struct i40e_rx_queue *rxq)
> >                 /* Translate descriptor info to mbuf parameters */
> >                 for (j = 0; j < nb_dd; j++) {
> >                         mb = rxep[j].mbuf;
> > +                       rte_prefetch0(RTE_PTR_ADD(mb->buf_addr,
> RTE_PKTMBUF_HEADROOM));

Why did prefetch here? I think if application need to deal with packet, it is more suitable to put it in application.

> >                         qword1 = rte_le_to_cpu_64(\
> >                                 rxdp[j].wb.qword1.status_error_len);
> >                         pkt_len = ((qword1 &
> I40E_RXD_QW1_LENGTH_PBUF_MASK) >>
> > @@ -1086,9 +1087,11 @@ i40e_rx_alloc_bufs(struct i40e_rx_queue *rxq)
> >
> >         rxdp = &rxq->rx_ring[alloc_idx];
> >         for (i = 0; i < rxq->rx_free_thresh; i++) {
> > -               if (likely(i < (rxq->rx_free_thresh - 1)))
> > +               if (likely(i < (rxq->rx_free_thresh - 1))) {
> >                         /* Prefetch next mbuf */
> > -                       rte_prefetch0(rxep[i + 1].mbuf);
> > +                       rte_prefetch0(&rxep[i + 1].mbuf->cacheline0);
> > +                       rte_prefetch0(&rxep[i + 1].mbuf->cacheline1);
> > +               }
Agree with this change. And when I test it by testpmd with iofwd, no performance increase is observed but minor decrease.
Can you share will us when it will benefit the performance in your scenario ? 


Thanks
Jingjing
  
Vladyslav Buslov Oct. 10, 2016, 5:05 p.m. UTC | #3
> -----Original Message-----
> From: Wu, Jingjing [mailto:jingjing.wu@intel.com]
> Sent: Monday, October 10, 2016 4:26 PM
> To: Yigit, Ferruh; Vladyslav Buslov; Zhang, Helin
> Cc: dev@dpdk.org
> Subject: RE: [dpdk-dev] [PATCH] net/i40e: add additional prefetch
> instructions for bulk rx
> 
> 
> 
> > -----Original Message-----
> > From: Yigit, Ferruh
> > Sent: Wednesday, September 14, 2016 9:25 PM
> > To: Vladyslav Buslov <vladyslav.buslov@harmonicinc.com>; Zhang, Helin
> > <helin.zhang@intel.com>; Wu, Jingjing <jingjing.wu@intel.com>
> > Cc: dev@dpdk.org
> > Subject: Re: [dpdk-dev] [PATCH] net/i40e: add additional prefetch
> > instructions for bulk rx
> >
> > On 7/14/2016 6:27 PM, Vladyslav Buslov wrote:
> > > Added prefetch of first packet payload cacheline in
> > > i40e_rx_scan_hw_ring Added prefetch of second mbuf cacheline in
> > > i40e_rx_alloc_bufs
> > >
> > > Signed-off-by: Vladyslav Buslov <vladyslav.buslov@harmonicinc.com>
> > > ---
> > >  drivers/net/i40e/i40e_rxtx.c | 7 +++++--
> > >  1 file changed, 5 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/drivers/net/i40e/i40e_rxtx.c
> > > b/drivers/net/i40e/i40e_rxtx.c index d3cfb98..e493fb4 100644
> > > --- a/drivers/net/i40e/i40e_rxtx.c
> > > +++ b/drivers/net/i40e/i40e_rxtx.c
> > > @@ -1003,6 +1003,7 @@ i40e_rx_scan_hw_ring(struct i40e_rx_queue
> *rxq)
> > >                 /* Translate descriptor info to mbuf parameters */
> > >                 for (j = 0; j < nb_dd; j++) {
> > >                         mb = rxep[j].mbuf;
> > > +                       rte_prefetch0(RTE_PTR_ADD(mb->buf_addr,
> > RTE_PKTMBUF_HEADROOM));
> 
> Why did prefetch here? I think if application need to deal with packet, it is
> more suitable to put it in application.
> 
> > >                         qword1 = rte_le_to_cpu_64(\
> > >                                 rxdp[j].wb.qword1.status_error_len);
> > >                         pkt_len = ((qword1 &
> > I40E_RXD_QW1_LENGTH_PBUF_MASK) >>
> > > @@ -1086,9 +1087,11 @@ i40e_rx_alloc_bufs(struct i40e_rx_queue
> *rxq)
> > >
> > >         rxdp = &rxq->rx_ring[alloc_idx];
> > >         for (i = 0; i < rxq->rx_free_thresh; i++) {
> > > -               if (likely(i < (rxq->rx_free_thresh - 1)))
> > > +               if (likely(i < (rxq->rx_free_thresh - 1))) {
> > >                         /* Prefetch next mbuf */
> > > -                       rte_prefetch0(rxep[i + 1].mbuf);
> > > +                       rte_prefetch0(&rxep[i + 1].mbuf->cacheline0);
> > > +                       rte_prefetch0(&rxep[i + 1].mbuf->cacheline1);
> > > +               }
> Agree with this change. And when I test it by testpmd with iofwd, no
> performance increase is observed but minor decrease.
> Can you share will us when it will benefit the performance in your scenario ?
> 
> 
> Thanks
> Jingjing

Hello Jingjing,

Thanks for code review.

My use case: We have simple distributor thread that receives packets from port and distributes them among worker threads according to VLAN and MAC address hash. 

While working on performance optimization we determined that most of CPU usage of this thread is in DPDK.
As and optimization we decided to switch to rx burst alloc function, however that caused additional performance degradation compared to scatter rx mode.
In profiler two major culprits were:
  1. Access to packet data Eth header in application code. (cache miss)
  2. Setting next packet descriptor field to NULL in DPDK i40e_rx_alloc_bufs code. (this field is in second descriptor cache line that was not prefetched)
After applying my fixes performance improved compared to scatter rx mode.

I assumed that prefetch of first cache line of packet data belongs to DPDK because it is done in scatter rx mode. (in i40e_recv_scattered_pkts)
It can be moved to application side but IMO it is better to be consistent across all rx modes.

Regards,
Vladyslav
  
Ananyev, Konstantin Oct. 11, 2016, 8:51 a.m. UTC | #4
Hi Vladislav,

> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Vladyslav Buslov
> Sent: Monday, October 10, 2016 6:06 PM
> To: Wu, Jingjing <jingjing.wu@intel.com>; Yigit, Ferruh <ferruh.yigit@intel.com>; Zhang, Helin <helin.zhang@intel.com>
> Cc: dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH] net/i40e: add additional prefetch instructions for bulk rx
> 
> > -----Original Message-----
> > From: Wu, Jingjing [mailto:jingjing.wu@intel.com]
> > Sent: Monday, October 10, 2016 4:26 PM
> > To: Yigit, Ferruh; Vladyslav Buslov; Zhang, Helin
> > Cc: dev@dpdk.org
> > Subject: RE: [dpdk-dev] [PATCH] net/i40e: add additional prefetch
> > instructions for bulk rx
> >
> >
> >
> > > -----Original Message-----
> > > From: Yigit, Ferruh
> > > Sent: Wednesday, September 14, 2016 9:25 PM
> > > To: Vladyslav Buslov <vladyslav.buslov@harmonicinc.com>; Zhang, Helin
> > > <helin.zhang@intel.com>; Wu, Jingjing <jingjing.wu@intel.com>
> > > Cc: dev@dpdk.org
> > > Subject: Re: [dpdk-dev] [PATCH] net/i40e: add additional prefetch
> > > instructions for bulk rx
> > >
> > > On 7/14/2016 6:27 PM, Vladyslav Buslov wrote:
> > > > Added prefetch of first packet payload cacheline in
> > > > i40e_rx_scan_hw_ring Added prefetch of second mbuf cacheline in
> > > > i40e_rx_alloc_bufs
> > > >
> > > > Signed-off-by: Vladyslav Buslov <vladyslav.buslov@harmonicinc.com>
> > > > ---
> > > >  drivers/net/i40e/i40e_rxtx.c | 7 +++++--
> > > >  1 file changed, 5 insertions(+), 2 deletions(-)
> > > >
> > > > diff --git a/drivers/net/i40e/i40e_rxtx.c
> > > > b/drivers/net/i40e/i40e_rxtx.c index d3cfb98..e493fb4 100644
> > > > --- a/drivers/net/i40e/i40e_rxtx.c
> > > > +++ b/drivers/net/i40e/i40e_rxtx.c
> > > > @@ -1003,6 +1003,7 @@ i40e_rx_scan_hw_ring(struct i40e_rx_queue
> > *rxq)
> > > >                 /* Translate descriptor info to mbuf parameters */
> > > >                 for (j = 0; j < nb_dd; j++) {
> > > >                         mb = rxep[j].mbuf;
> > > > +                       rte_prefetch0(RTE_PTR_ADD(mb->buf_addr,
> > > RTE_PKTMBUF_HEADROOM));
> >
> > Why did prefetch here? I think if application need to deal with packet, it is
> > more suitable to put it in application.
> >
> > > >                         qword1 = rte_le_to_cpu_64(\
> > > >                                 rxdp[j].wb.qword1.status_error_len);
> > > >                         pkt_len = ((qword1 &
> > > I40E_RXD_QW1_LENGTH_PBUF_MASK) >>
> > > > @@ -1086,9 +1087,11 @@ i40e_rx_alloc_bufs(struct i40e_rx_queue
> > *rxq)
> > > >
> > > >         rxdp = &rxq->rx_ring[alloc_idx];
> > > >         for (i = 0; i < rxq->rx_free_thresh; i++) {
> > > > -               if (likely(i < (rxq->rx_free_thresh - 1)))
> > > > +               if (likely(i < (rxq->rx_free_thresh - 1))) {
> > > >                         /* Prefetch next mbuf */
> > > > -                       rte_prefetch0(rxep[i + 1].mbuf);
> > > > +                       rte_prefetch0(&rxep[i + 1].mbuf->cacheline0);
> > > > +                       rte_prefetch0(&rxep[i + 1].mbuf->cacheline1);

I think there are rte_mbuf_prefetch_part1/part2 defined in rte_mbuf.h,
specially for that case.

> > > > +               }
> > Agree with this change. And when I test it by testpmd with iofwd, no
> > performance increase is observed but minor decrease.
> > Can you share will us when it will benefit the performance in your scenario ?
> >
> >
> > Thanks
> > Jingjing
> 
> Hello Jingjing,
> 
> Thanks for code review.
> 
> My use case: We have simple distributor thread that receives packets from port and distributes them among worker threads according to
> VLAN and MAC address hash.
> 
> While working on performance optimization we determined that most of CPU usage of this thread is in DPDK.
> As and optimization we decided to switch to rx burst alloc function, however that caused additional performance degradation compared to
> scatter rx mode.
> In profiler two major culprits were:
>   1. Access to packet data Eth header in application code. (cache miss)
>   2. Setting next packet descriptor field to NULL in DPDK i40e_rx_alloc_bufs code. (this field is in second descriptor cache line that was not
> prefetched)

I wonder what will happen if we'll remove any prefetches here?
Would it make things better or worse (and by how much)?

> After applying my fixes performance improved compared to scatter rx mode.
> 
> I assumed that prefetch of first cache line of packet data belongs to DPDK because it is done in scatter rx mode. (in
> i40e_recv_scattered_pkts)
> It can be moved to application side but IMO it is better to be consistent across all rx modes.

I would agree with Jingjing here, probably PMD should avoid to prefetch packet's data. 
Konstantin
  
Vladyslav Buslov Oct. 11, 2016, 9:24 a.m. UTC | #5
Hi Konstantin,

> -----Original Message-----
> From: Ananyev, Konstantin [mailto:konstantin.ananyev@intel.com]
> Sent: Tuesday, October 11, 2016 11:51 AM
> To: Vladyslav Buslov; Wu, Jingjing; Yigit, Ferruh; Zhang, Helin
> Cc: dev@dpdk.org
> Subject: RE: [dpdk-dev] [PATCH] net/i40e: add additional prefetch
> instructions for bulk rx
> 
> Hi Vladislav,
> 
> > -----Original Message-----
> > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Vladyslav Buslov
> > Sent: Monday, October 10, 2016 6:06 PM
> > To: Wu, Jingjing <jingjing.wu@intel.com>; Yigit, Ferruh
> > <ferruh.yigit@intel.com>; Zhang, Helin <helin.zhang@intel.com>
> > Cc: dev@dpdk.org
> > Subject: Re: [dpdk-dev] [PATCH] net/i40e: add additional prefetch
> > instructions for bulk rx
> >
> > > -----Original Message-----
> > > From: Wu, Jingjing [mailto:jingjing.wu@intel.com]
> > > Sent: Monday, October 10, 2016 4:26 PM
> > > To: Yigit, Ferruh; Vladyslav Buslov; Zhang, Helin
> > > Cc: dev@dpdk.org
> > > Subject: RE: [dpdk-dev] [PATCH] net/i40e: add additional prefetch
> > > instructions for bulk rx
> > >
> > >
> > >
> > > > -----Original Message-----
> > > > From: Yigit, Ferruh
> > > > Sent: Wednesday, September 14, 2016 9:25 PM
> > > > To: Vladyslav Buslov <vladyslav.buslov@harmonicinc.com>; Zhang,
> > > > Helin <helin.zhang@intel.com>; Wu, Jingjing
> > > > <jingjing.wu@intel.com>
> > > > Cc: dev@dpdk.org
> > > > Subject: Re: [dpdk-dev] [PATCH] net/i40e: add additional prefetch
> > > > instructions for bulk rx
> > > >
> > > > On 7/14/2016 6:27 PM, Vladyslav Buslov wrote:
> > > > > Added prefetch of first packet payload cacheline in
> > > > > i40e_rx_scan_hw_ring Added prefetch of second mbuf cacheline in
> > > > > i40e_rx_alloc_bufs
> > > > >
> > > > > Signed-off-by: Vladyslav Buslov
> > > > > <vladyslav.buslov@harmonicinc.com>
> > > > > ---
> > > > >  drivers/net/i40e/i40e_rxtx.c | 7 +++++--
> > > > >  1 file changed, 5 insertions(+), 2 deletions(-)
> > > > >
> > > > > diff --git a/drivers/net/i40e/i40e_rxtx.c
> > > > > b/drivers/net/i40e/i40e_rxtx.c index d3cfb98..e493fb4 100644
> > > > > --- a/drivers/net/i40e/i40e_rxtx.c
> > > > > +++ b/drivers/net/i40e/i40e_rxtx.c
> > > > > @@ -1003,6 +1003,7 @@ i40e_rx_scan_hw_ring(struct
> i40e_rx_queue
> > > *rxq)
> > > > >                 /* Translate descriptor info to mbuf parameters */
> > > > >                 for (j = 0; j < nb_dd; j++) {
> > > > >                         mb = rxep[j].mbuf;
> > > > > +                       rte_prefetch0(RTE_PTR_ADD(mb->buf_addr,
> > > > RTE_PKTMBUF_HEADROOM));
> > >
> > > Why did prefetch here? I think if application need to deal with
> > > packet, it is more suitable to put it in application.
> > >
> > > > >                         qword1 = rte_le_to_cpu_64(\
> > > > >                                 rxdp[j].wb.qword1.status_error_len);
> > > > >                         pkt_len = ((qword1 &
> > > > I40E_RXD_QW1_LENGTH_PBUF_MASK) >>
> > > > > @@ -1086,9 +1087,11 @@ i40e_rx_alloc_bufs(struct i40e_rx_queue
> > > *rxq)
> > > > >
> > > > >         rxdp = &rxq->rx_ring[alloc_idx];
> > > > >         for (i = 0; i < rxq->rx_free_thresh; i++) {
> > > > > -               if (likely(i < (rxq->rx_free_thresh - 1)))
> > > > > +               if (likely(i < (rxq->rx_free_thresh - 1))) {
> > > > >                         /* Prefetch next mbuf */
> > > > > -                       rte_prefetch0(rxep[i + 1].mbuf);
> > > > > +                       rte_prefetch0(&rxep[i + 1].mbuf->cacheline0);
> > > > > +                       rte_prefetch0(&rxep[i +
> > > > > + 1].mbuf->cacheline1);
> 
> I think there are rte_mbuf_prefetch_part1/part2 defined in rte_mbuf.h,
> specially for that case.

Thanks for pointing that out.
I'll submit new patch if you decide to move forward with this development.

> 
> > > > > +               }
> > > Agree with this change. And when I test it by testpmd with iofwd, no
> > > performance increase is observed but minor decrease.
> > > Can you share will us when it will benefit the performance in your
> scenario ?
> > >
> > >
> > > Thanks
> > > Jingjing
> >
> > Hello Jingjing,
> >
> > Thanks for code review.
> >
> > My use case: We have simple distributor thread that receives packets
> > from port and distributes them among worker threads according to VLAN
> and MAC address hash.
> >
> > While working on performance optimization we determined that most of
> CPU usage of this thread is in DPDK.
> > As and optimization we decided to switch to rx burst alloc function,
> > however that caused additional performance degradation compared to
> scatter rx mode.
> > In profiler two major culprits were:
> >   1. Access to packet data Eth header in application code. (cache miss)
> >   2. Setting next packet descriptor field to NULL in DPDK
> > i40e_rx_alloc_bufs code. (this field is in second descriptor cache
> > line that was not
> > prefetched)
> 
> I wonder what will happen if we'll remove any prefetches here?
> Would it make things better or worse (and by how much)?

In our case it causes few per cent PPS degradation on next=NULL assignment but it seems that JingJing's test doesn't confirm it.

> 
> > After applying my fixes performance improved compared to scatter rx
> mode.
> >
> > I assumed that prefetch of first cache line of packet data belongs to
> > DPDK because it is done in scatter rx mode. (in
> > i40e_recv_scattered_pkts)
> > It can be moved to application side but IMO it is better to be consistent
> across all rx modes.
> 
> I would agree with Jingjing here, probably PMD should avoid to prefetch
> packet's data.

Actually I can see some valid use cases where it is beneficial to have this prefetch in driver.
In our sw distributor case it is trivial to just prefetch next packet on each iteration because packets are processed one by one.
However when we move this functionality to hw by means of RSS/vfunction/FlowDirector(our long term goal) worker threads will receive packets directly from rx queues of NIC.
First operation of worker thread is to perform bulk lookup in hash table by destination MAC. This will cause cache miss on accessing each eth header and can't be easily mitigated in application code.
I assume it is ubiquitous use case for DPDK.

Regards,
Vladyslav
  
Ananyev, Konstantin Oct. 12, 2016, 12:04 a.m. UTC | #6
Hi Vladislav,

> > > > >
> > > > > On 7/14/2016 6:27 PM, Vladyslav Buslov wrote:
> > > > > > Added prefetch of first packet payload cacheline in
> > > > > > i40e_rx_scan_hw_ring Added prefetch of second mbuf cacheline in
> > > > > > i40e_rx_alloc_bufs
> > > > > >
> > > > > > Signed-off-by: Vladyslav Buslov
> > > > > > <vladyslav.buslov@harmonicinc.com>
> > > > > > ---
> > > > > >  drivers/net/i40e/i40e_rxtx.c | 7 +++++--
> > > > > >  1 file changed, 5 insertions(+), 2 deletions(-)
> > > > > >
> > > > > > diff --git a/drivers/net/i40e/i40e_rxtx.c
> > > > > > b/drivers/net/i40e/i40e_rxtx.c index d3cfb98..e493fb4 100644
> > > > > > --- a/drivers/net/i40e/i40e_rxtx.c
> > > > > > +++ b/drivers/net/i40e/i40e_rxtx.c
> > > > > > @@ -1003,6 +1003,7 @@ i40e_rx_scan_hw_ring(struct
> > i40e_rx_queue
> > > > *rxq)
> > > > > >                 /* Translate descriptor info to mbuf parameters */
> > > > > >                 for (j = 0; j < nb_dd; j++) {
> > > > > >                         mb = rxep[j].mbuf;
> > > > > > +                       rte_prefetch0(RTE_PTR_ADD(mb->buf_addr,
> > > > > RTE_PKTMBUF_HEADROOM));
> > > >
> > > > Why did prefetch here? I think if application need to deal with
> > > > packet, it is more suitable to put it in application.
> > > >
> > > > > >                         qword1 = rte_le_to_cpu_64(\
> > > > > >                                 rxdp[j].wb.qword1.status_error_len);
> > > > > >                         pkt_len = ((qword1 &
> > > > > I40E_RXD_QW1_LENGTH_PBUF_MASK) >>
> > > > > > @@ -1086,9 +1087,11 @@ i40e_rx_alloc_bufs(struct i40e_rx_queue
> > > > *rxq)
> > > > > >
> > > > > >         rxdp = &rxq->rx_ring[alloc_idx];
> > > > > >         for (i = 0; i < rxq->rx_free_thresh; i++) {
> > > > > > -               if (likely(i < (rxq->rx_free_thresh - 1)))
> > > > > > +               if (likely(i < (rxq->rx_free_thresh - 1))) {
> > > > > >                         /* Prefetch next mbuf */
> > > > > > -                       rte_prefetch0(rxep[i + 1].mbuf);
> > > > > > +                       rte_prefetch0(&rxep[i + 1].mbuf->cacheline0);
> > > > > > +                       rte_prefetch0(&rxep[i +
> > > > > > + 1].mbuf->cacheline1);
> >
> > I think there are rte_mbuf_prefetch_part1/part2 defined in rte_mbuf.h,
> > specially for that case.
> 
> Thanks for pointing that out.
> I'll submit new patch if you decide to move forward with this development.
> 
> >
> > > > > > +               }
> > > > Agree with this change. And when I test it by testpmd with iofwd, no
> > > > performance increase is observed but minor decrease.
> > > > Can you share will us when it will benefit the performance in your
> > scenario ?
> > > >
> > > >
> > > > Thanks
> > > > Jingjing
> > >
> > > Hello Jingjing,
> > >
> > > Thanks for code review.
> > >
> > > My use case: We have simple distributor thread that receives packets
> > > from port and distributes them among worker threads according to VLAN
> > and MAC address hash.
> > >
> > > While working on performance optimization we determined that most of
> > CPU usage of this thread is in DPDK.
> > > As and optimization we decided to switch to rx burst alloc function,
> > > however that caused additional performance degradation compared to
> > scatter rx mode.
> > > In profiler two major culprits were:
> > >   1. Access to packet data Eth header in application code. (cache miss)
> > >   2. Setting next packet descriptor field to NULL in DPDK
> > > i40e_rx_alloc_bufs code. (this field is in second descriptor cache
> > > line that was not
> > > prefetched)
> >
> > I wonder what will happen if we'll remove any prefetches here?
> > Would it make things better or worse (and by how much)?
> 
> In our case it causes few per cent PPS degradation on next=NULL assignment but it seems that JingJing's test doesn't confirm it.
> 
> >
> > > After applying my fixes performance improved compared to scatter rx
> > mode.
> > >
> > > I assumed that prefetch of first cache line of packet data belongs to
> > > DPDK because it is done in scatter rx mode. (in
> > > i40e_recv_scattered_pkts)
> > > It can be moved to application side but IMO it is better to be consistent
> > across all rx modes.
> >
> > I would agree with Jingjing here, probably PMD should avoid to prefetch
> > packet's data.
> 
> Actually I can see some valid use cases where it is beneficial to have this prefetch in driver.
> In our sw distributor case it is trivial to just prefetch next packet on each iteration because packets are processed one by one.
> However when we move this functionality to hw by means of RSS/vfunction/FlowDirector(our long term goal) worker threads will receive
> packets directly from rx queues of NIC.
> First operation of worker thread is to perform bulk lookup in hash table by destination MAC. This will cause cache miss on accessing each
> eth header and can't be easily mitigated in application code.
> I assume it is ubiquitous use case for DPDK.

Yes it is a quite common use-case.
Though I many cases it is possible to reorder user code to hide (or minimize) that data-access latency.
From other side there are scenarios where this prefetch is excessive and can cause some drop in performance.
Again, as I know, none of PMDs for Intel devices prefetches packet's data in  simple (single segment) RX mode.
Another thing that some people may argue then - why only one cache line is prefetched,
in some use-cases might need to look at 2-nd one.  

Konstantin

> 
> Regards,
> Vladyslav
  
Bruce Richardson Oct. 13, 2016, 10:18 a.m. UTC | #7
On Wed, Oct 12, 2016 at 12:04:39AM +0000, Ananyev, Konstantin wrote:
> Hi Vladislav,
> 
> > > > > >
> > > > > > On 7/14/2016 6:27 PM, Vladyslav Buslov wrote:
> > > > > > > Added prefetch of first packet payload cacheline in
> > > > > > > i40e_rx_scan_hw_ring Added prefetch of second mbuf cacheline in
> > > > > > > i40e_rx_alloc_bufs
> > > > > > >
> > > > > > > Signed-off-by: Vladyslav Buslov
> > > > > > > <vladyslav.buslov@harmonicinc.com>
> > > > > > > ---
> > > > > > >  drivers/net/i40e/i40e_rxtx.c | 7 +++++--
> > > > > > >  1 file changed, 5 insertions(+), 2 deletions(-)
> > > > > > >
> > > > > > > diff --git a/drivers/net/i40e/i40e_rxtx.c
> > > > > > > b/drivers/net/i40e/i40e_rxtx.c index d3cfb98..e493fb4 100644
> > > > > > > --- a/drivers/net/i40e/i40e_rxtx.c
> > > > > > > +++ b/drivers/net/i40e/i40e_rxtx.c
> > > > > > > @@ -1003,6 +1003,7 @@ i40e_rx_scan_hw_ring(struct
> > > i40e_rx_queue
> > > > > *rxq)
> > > > > > >                 /* Translate descriptor info to mbuf parameters */
> > > > > > >                 for (j = 0; j < nb_dd; j++) {
> > > > > > >                         mb = rxep[j].mbuf;
> > > > > > > +                       rte_prefetch0(RTE_PTR_ADD(mb->buf_addr,
> > > > > > RTE_PKTMBUF_HEADROOM));
> > > > >
> > > > > Why did prefetch here? I think if application need to deal with
> > > > > packet, it is more suitable to put it in application.
> > > > >
> > > > > > >                         qword1 = rte_le_to_cpu_64(\
> > > > > > >                                 rxdp[j].wb.qword1.status_error_len);
> > > > > > >                         pkt_len = ((qword1 &
> > > > > > I40E_RXD_QW1_LENGTH_PBUF_MASK) >>
> > > > > > > @@ -1086,9 +1087,11 @@ i40e_rx_alloc_bufs(struct i40e_rx_queue
> > > > > *rxq)
> > > > > > >
> > > > > > >         rxdp = &rxq->rx_ring[alloc_idx];
> > > > > > >         for (i = 0; i < rxq->rx_free_thresh; i++) {
> > > > > > > -               if (likely(i < (rxq->rx_free_thresh - 1)))
> > > > > > > +               if (likely(i < (rxq->rx_free_thresh - 1))) {
> > > > > > >                         /* Prefetch next mbuf */
> > > > > > > -                       rte_prefetch0(rxep[i + 1].mbuf);
> > > > > > > +                       rte_prefetch0(&rxep[i + 1].mbuf->cacheline0);
> > > > > > > +                       rte_prefetch0(&rxep[i +
> > > > > > > + 1].mbuf->cacheline1);
> > >
> > > I think there are rte_mbuf_prefetch_part1/part2 defined in rte_mbuf.h,
> > > specially for that case.
> > 
> > Thanks for pointing that out.
> > I'll submit new patch if you decide to move forward with this development.
> > 
> > >
> > > > > > > +               }
> > > > > Agree with this change. And when I test it by testpmd with iofwd, no
> > > > > performance increase is observed but minor decrease.
> > > > > Can you share will us when it will benefit the performance in your
> > > scenario ?
> > > > >
> > > > >
> > > > > Thanks
> > > > > Jingjing
> > > >
> > > > Hello Jingjing,
> > > >
> > > > Thanks for code review.
> > > >
> > > > My use case: We have simple distributor thread that receives packets
> > > > from port and distributes them among worker threads according to VLAN
> > > and MAC address hash.
> > > >
> > > > While working on performance optimization we determined that most of
> > > CPU usage of this thread is in DPDK.
> > > > As and optimization we decided to switch to rx burst alloc function,
> > > > however that caused additional performance degradation compared to
> > > scatter rx mode.
> > > > In profiler two major culprits were:
> > > >   1. Access to packet data Eth header in application code. (cache miss)
> > > >   2. Setting next packet descriptor field to NULL in DPDK
> > > > i40e_rx_alloc_bufs code. (this field is in second descriptor cache
> > > > line that was not
> > > > prefetched)
> > >
> > > I wonder what will happen if we'll remove any prefetches here?
> > > Would it make things better or worse (and by how much)?
> > 
> > In our case it causes few per cent PPS degradation on next=NULL assignment but it seems that JingJing's test doesn't confirm it.
> > 
> > >
> > > > After applying my fixes performance improved compared to scatter rx
> > > mode.
> > > >
> > > > I assumed that prefetch of first cache line of packet data belongs to
> > > > DPDK because it is done in scatter rx mode. (in
> > > > i40e_recv_scattered_pkts)
> > > > It can be moved to application side but IMO it is better to be consistent
> > > across all rx modes.
> > >
> > > I would agree with Jingjing here, probably PMD should avoid to prefetch
> > > packet's data.
> > 
> > Actually I can see some valid use cases where it is beneficial to have this prefetch in driver.
> > In our sw distributor case it is trivial to just prefetch next packet on each iteration because packets are processed one by one.
> > However when we move this functionality to hw by means of RSS/vfunction/FlowDirector(our long term goal) worker threads will receive
> > packets directly from rx queues of NIC.
> > First operation of worker thread is to perform bulk lookup in hash table by destination MAC. This will cause cache miss on accessing each
> > eth header and can't be easily mitigated in application code.
> > I assume it is ubiquitous use case for DPDK.
> 
> Yes it is a quite common use-case.
> Though I many cases it is possible to reorder user code to hide (or minimize) that data-access latency.
> From other side there are scenarios where this prefetch is excessive and can cause some drop in performance.
> Again, as I know, none of PMDs for Intel devices prefetches packet's data in  simple (single segment) RX mode.
> Another thing that some people may argue then - why only one cache line is prefetched,
> in some use-cases might need to look at 2-nd one.  
> 
There is a build-time config setting for this behaviour for exactly the reasons
called out here - in some apps you get a benefit, in others you see a perf
hit. The default is "on", which makes sense for most cases, I think.
From common_base:

CONFIG_RTE_PMD_PACKET_PREFETCH=y$

/Bruce
  
Ananyev, Konstantin Oct. 13, 2016, 10:30 a.m. UTC | #8
> -----Original Message-----
> From: Richardson, Bruce
> Sent: Thursday, October 13, 2016 11:19 AM
> To: Ananyev, Konstantin <konstantin.ananyev@intel.com>
> Cc: Vladyslav Buslov <Vladyslav.Buslov@harmonicinc.com>; Wu, Jingjing <jingjing.wu@intel.com>; Yigit, Ferruh <ferruh.yigit@intel.com>;
> Zhang, Helin <helin.zhang@intel.com>; dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH] net/i40e: add additional prefetch instructions for bulk rx
> 
> On Wed, Oct 12, 2016 at 12:04:39AM +0000, Ananyev, Konstantin wrote:
> > Hi Vladislav,
> >
> > > > > > >
> > > > > > > On 7/14/2016 6:27 PM, Vladyslav Buslov wrote:
> > > > > > > > Added prefetch of first packet payload cacheline in
> > > > > > > > i40e_rx_scan_hw_ring Added prefetch of second mbuf cacheline in
> > > > > > > > i40e_rx_alloc_bufs
> > > > > > > >
> > > > > > > > Signed-off-by: Vladyslav Buslov
> > > > > > > > <vladyslav.buslov@harmonicinc.com>
> > > > > > > > ---
> > > > > > > >  drivers/net/i40e/i40e_rxtx.c | 7 +++++--
> > > > > > > >  1 file changed, 5 insertions(+), 2 deletions(-)
> > > > > > > >
> > > > > > > > diff --git a/drivers/net/i40e/i40e_rxtx.c
> > > > > > > > b/drivers/net/i40e/i40e_rxtx.c index d3cfb98..e493fb4 100644
> > > > > > > > --- a/drivers/net/i40e/i40e_rxtx.c
> > > > > > > > +++ b/drivers/net/i40e/i40e_rxtx.c
> > > > > > > > @@ -1003,6 +1003,7 @@ i40e_rx_scan_hw_ring(struct
> > > > i40e_rx_queue
> > > > > > *rxq)
> > > > > > > >                 /* Translate descriptor info to mbuf parameters */
> > > > > > > >                 for (j = 0; j < nb_dd; j++) {
> > > > > > > >                         mb = rxep[j].mbuf;
> > > > > > > > +                       rte_prefetch0(RTE_PTR_ADD(mb->buf_addr,
> > > > > > > RTE_PKTMBUF_HEADROOM));
> > > > > >
> > > > > > Why did prefetch here? I think if application need to deal with
> > > > > > packet, it is more suitable to put it in application.
> > > > > >
> > > > > > > >                         qword1 = rte_le_to_cpu_64(\
> > > > > > > >                                 rxdp[j].wb.qword1.status_error_len);
> > > > > > > >                         pkt_len = ((qword1 &
> > > > > > > I40E_RXD_QW1_LENGTH_PBUF_MASK) >>
> > > > > > > > @@ -1086,9 +1087,11 @@ i40e_rx_alloc_bufs(struct i40e_rx_queue
> > > > > > *rxq)
> > > > > > > >
> > > > > > > >         rxdp = &rxq->rx_ring[alloc_idx];
> > > > > > > >         for (i = 0; i < rxq->rx_free_thresh; i++) {
> > > > > > > > -               if (likely(i < (rxq->rx_free_thresh - 1)))
> > > > > > > > +               if (likely(i < (rxq->rx_free_thresh - 1))) {
> > > > > > > >                         /* Prefetch next mbuf */
> > > > > > > > -                       rte_prefetch0(rxep[i + 1].mbuf);
> > > > > > > > +                       rte_prefetch0(&rxep[i + 1].mbuf->cacheline0);
> > > > > > > > +                       rte_prefetch0(&rxep[i +
> > > > > > > > + 1].mbuf->cacheline1);
> > > >
> > > > I think there are rte_mbuf_prefetch_part1/part2 defined in rte_mbuf.h,
> > > > specially for that case.
> > >
> > > Thanks for pointing that out.
> > > I'll submit new patch if you decide to move forward with this development.
> > >
> > > >
> > > > > > > > +               }
> > > > > > Agree with this change. And when I test it by testpmd with iofwd, no
> > > > > > performance increase is observed but minor decrease.
> > > > > > Can you share will us when it will benefit the performance in your
> > > > scenario ?
> > > > > >
> > > > > >
> > > > > > Thanks
> > > > > > Jingjing
> > > > >
> > > > > Hello Jingjing,
> > > > >
> > > > > Thanks for code review.
> > > > >
> > > > > My use case: We have simple distributor thread that receives packets
> > > > > from port and distributes them among worker threads according to VLAN
> > > > and MAC address hash.
> > > > >
> > > > > While working on performance optimization we determined that most of
> > > > CPU usage of this thread is in DPDK.
> > > > > As and optimization we decided to switch to rx burst alloc function,
> > > > > however that caused additional performance degradation compared to
> > > > scatter rx mode.
> > > > > In profiler two major culprits were:
> > > > >   1. Access to packet data Eth header in application code. (cache miss)
> > > > >   2. Setting next packet descriptor field to NULL in DPDK
> > > > > i40e_rx_alloc_bufs code. (this field is in second descriptor cache
> > > > > line that was not
> > > > > prefetched)
> > > >
> > > > I wonder what will happen if we'll remove any prefetches here?
> > > > Would it make things better or worse (and by how much)?
> > >
> > > In our case it causes few per cent PPS degradation on next=NULL assignment but it seems that JingJing's test doesn't confirm it.
> > >
> > > >
> > > > > After applying my fixes performance improved compared to scatter rx
> > > > mode.
> > > > >
> > > > > I assumed that prefetch of first cache line of packet data belongs to
> > > > > DPDK because it is done in scatter rx mode. (in
> > > > > i40e_recv_scattered_pkts)
> > > > > It can be moved to application side but IMO it is better to be consistent
> > > > across all rx modes.
> > > >
> > > > I would agree with Jingjing here, probably PMD should avoid to prefetch
> > > > packet's data.
> > >
> > > Actually I can see some valid use cases where it is beneficial to have this prefetch in driver.
> > > In our sw distributor case it is trivial to just prefetch next packet on each iteration because packets are processed one by one.
> > > However when we move this functionality to hw by means of RSS/vfunction/FlowDirector(our long term goal) worker threads will
> receive
> > > packets directly from rx queues of NIC.
> > > First operation of worker thread is to perform bulk lookup in hash table by destination MAC. This will cause cache miss on accessing
> each
> > > eth header and can't be easily mitigated in application code.
> > > I assume it is ubiquitous use case for DPDK.
> >
> > Yes it is a quite common use-case.
> > Though I many cases it is possible to reorder user code to hide (or minimize) that data-access latency.
> > From other side there are scenarios where this prefetch is excessive and can cause some drop in performance.
> > Again, as I know, none of PMDs for Intel devices prefetches packet's data in  simple (single segment) RX mode.
> > Another thing that some people may argue then - why only one cache line is prefetched,
> > in some use-cases might need to look at 2-nd one.
> >
> There is a build-time config setting for this behaviour for exactly the reasons
> called out here - in some apps you get a benefit, in others you see a perf
> hit. The default is "on", which makes sense for most cases, I think.
> From common_base:
> 
> CONFIG_RTE_PMD_PACKET_PREFETCH=y$

Yes, but right now i40e and ixgbe non-scattered RX (both vector and scalar) just ignore that flag.
Though yes, might be a good thing to make them to obey that flag properly.
Konstantin
  
Ferruh Yigit Nov. 15, 2016, 12:19 p.m. UTC | #9
On 10/13/2016 11:30 AM, Ananyev, Konstantin wrote:

<...>

>>>>
>>>> Actually I can see some valid use cases where it is beneficial to have this prefetch in driver.
>>>> In our sw distributor case it is trivial to just prefetch next packet on each iteration because packets are processed one by one.
>>>> However when we move this functionality to hw by means of RSS/vfunction/FlowDirector(our long term goal) worker threads will
>> receive
>>>> packets directly from rx queues of NIC.
>>>> First operation of worker thread is to perform bulk lookup in hash table by destination MAC. This will cause cache miss on accessing
>> each
>>>> eth header and can't be easily mitigated in application code.
>>>> I assume it is ubiquitous use case for DPDK.
>>>
>>> Yes it is a quite common use-case.
>>> Though I many cases it is possible to reorder user code to hide (or minimize) that data-access latency.
>>> From other side there are scenarios where this prefetch is excessive and can cause some drop in performance.
>>> Again, as I know, none of PMDs for Intel devices prefetches packet's data in  simple (single segment) RX mode.
>>> Another thing that some people may argue then - why only one cache line is prefetched,
>>> in some use-cases might need to look at 2-nd one.
>>>
>> There is a build-time config setting for this behaviour for exactly the reasons
>> called out here - in some apps you get a benefit, in others you see a perf
>> hit. The default is "on", which makes sense for most cases, I think.
>> From common_base:
>>
>> CONFIG_RTE_PMD_PACKET_PREFETCH=y$
> 
> Yes, but right now i40e and ixgbe non-scattered RX (both vector and scalar) just ignore that flag.
> Though yes, might be a good thing to make them to obey that flag properly.

Hi Vladyslav,

According Konstantin's comment, what do you think updating patch to do
prefetch within CONFIG_RTE_PMD_PACKET_PREFETCH ifdef?

But since config option is enabled by default, performance concern is
still valid and needs to be investigated.

Thanks,
ferruh
  
Vladyslav Buslov Nov. 15, 2016, 1:27 p.m. UTC | #10
> -----Original Message-----
> From: Ferruh Yigit [mailto:ferruh.yigit@intel.com]
> Sent: Tuesday, November 15, 2016 2:19 PM
> To: Ananyev, Konstantin; Richardson, Bruce
> Cc: Vladyslav Buslov; Wu, Jingjing; Zhang, Helin; dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH] net/i40e: add additional prefetch
> instructions for bulk rx
> 
> On 10/13/2016 11:30 AM, Ananyev, Konstantin wrote:
> 
> <...>
> 
> >>>>
> >>>> Actually I can see some valid use cases where it is beneficial to have this
> prefetch in driver.
> >>>> In our sw distributor case it is trivial to just prefetch next packet on
> each iteration because packets are processed one by one.
> >>>> However when we move this functionality to hw by means of
> >>>> RSS/vfunction/FlowDirector(our long term goal) worker threads will
> >> receive
> >>>> packets directly from rx queues of NIC.
> >>>> First operation of worker thread is to perform bulk lookup in hash
> >>>> table by destination MAC. This will cause cache miss on accessing
> >> each
> >>>> eth header and can't be easily mitigated in application code.
> >>>> I assume it is ubiquitous use case for DPDK.
> >>>
> >>> Yes it is a quite common use-case.
> >>> Though I many cases it is possible to reorder user code to hide (or
> minimize) that data-access latency.
> >>> From other side there are scenarios where this prefetch is excessive and
> can cause some drop in performance.
> >>> Again, as I know, none of PMDs for Intel devices prefetches packet's
> data in  simple (single segment) RX mode.
> >>> Another thing that some people may argue then - why only one cache
> >>> line is prefetched, in some use-cases might need to look at 2-nd one.
> >>>
> >> There is a build-time config setting for this behaviour for exactly
> >> the reasons called out here - in some apps you get a benefit, in
> >> others you see a perf hit. The default is "on", which makes sense for most
> cases, I think.
> >> From common_base:
> >>
> >> CONFIG_RTE_PMD_PACKET_PREFETCH=y$
> >
> > Yes, but right now i40e and ixgbe non-scattered RX (both vector and scalar)
> just ignore that flag.
> > Though yes, might be a good thing to make them to obey that flag
> properly.
> 
> Hi Vladyslav,
> 
> According Konstantin's comment, what do you think updating patch to do
> prefetch within CONFIG_RTE_PMD_PACKET_PREFETCH ifdef?
> 
> But since config option is enabled by default, performance concern is still
> valid and needs to be investigated.
> 
> Thanks,
> ferruh

Hi Ferruh,

I'll update my patch according to code review suggestions.

Regards,
Vlad
  

Patch

diff --git a/drivers/net/i40e/i40e_rxtx.c b/drivers/net/i40e/i40e_rxtx.c
index d3cfb98..e493fb4 100644
--- a/drivers/net/i40e/i40e_rxtx.c
+++ b/drivers/net/i40e/i40e_rxtx.c
@@ -1003,6 +1003,7 @@  i40e_rx_scan_hw_ring(struct i40e_rx_queue *rxq)
 		/* Translate descriptor info to mbuf parameters */
 		for (j = 0; j < nb_dd; j++) {
 			mb = rxep[j].mbuf;
+			rte_prefetch0(RTE_PTR_ADD(mb->buf_addr, RTE_PKTMBUF_HEADROOM));
 			qword1 = rte_le_to_cpu_64(\
 				rxdp[j].wb.qword1.status_error_len);
 			pkt_len = ((qword1 & I40E_RXD_QW1_LENGTH_PBUF_MASK) >>
@@ -1086,9 +1087,11 @@  i40e_rx_alloc_bufs(struct i40e_rx_queue *rxq)
 
 	rxdp = &rxq->rx_ring[alloc_idx];
 	for (i = 0; i < rxq->rx_free_thresh; i++) {
-		if (likely(i < (rxq->rx_free_thresh - 1)))
+		if (likely(i < (rxq->rx_free_thresh - 1))) {
 			/* Prefetch next mbuf */
-			rte_prefetch0(rxep[i + 1].mbuf);
+			rte_prefetch0(&rxep[i + 1].mbuf->cacheline0);
+			rte_prefetch0(&rxep[i + 1].mbuf->cacheline1);
+		}
 
 		mb = rxep[i].mbuf;
 		rte_mbuf_refcnt_set(mb, 1);