[dpdk-dev,v3,0/5] vhost: optimize enqueue

Message ID 20160927102123.GL25823@yliu-dev.sh.intel.com (mailing list archive)
State Not Applicable, archived
Delegated to: Yuanhan Liu
Headers

Commit Message

Yuanhan Liu Sept. 27, 2016, 10:21 a.m. UTC
  On Thu, Sep 22, 2016 at 05:01:41PM +0800, Jianbo Liu wrote:
> On 22 September 2016 at 14:58, Wang, Zhihong <zhihong.wang@intel.com> wrote:
> >
> >
> >> -----Original Message-----
> >> From: Jianbo Liu [mailto:jianbo.liu@linaro.org]
> >> Sent: Thursday, September 22, 2016 1:48 PM
> >> To: Yuanhan Liu <yuanhan.liu@linux.intel.com>
> >> Cc: Wang, Zhihong <zhihong.wang@intel.com>; Maxime Coquelin
> >> <maxime.coquelin@redhat.com>; dev@dpdk.org
> >> Subject: Re: [dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue
> >>
> >> On 22 September 2016 at 10:29, Yuanhan Liu <yuanhan.liu@linux.intel.com>
> >> wrote:
> >> > On Wed, Sep 21, 2016 at 08:54:11PM +0800, Jianbo Liu wrote:
> >> >> >> > My setup consists of one host running a guest.
> >> >> >> > The guest generates as much 64bytes packets as possible using
> >> >> >>
> >> >> >> Have you tested with other different packet size?
> >> >> >> My testing shows that performance is dropping when packet size is
> >> more
> >> >> >> than 256.
> >> >> >
> >> >> >
> >> >> > Hi Jianbo,
> >> >> >
> >> >> > Thanks for reporting this.
> >> >> >
> >> >> >  1. Are you running the vector frontend with mrg_rxbuf=off?
> >> >> >
> >> Yes, my testing is mrg_rxbuf=off, but not vector frontend PMD.
> >>
> >> >> >  2. Could you please specify what CPU you're running? Is it Haswell
> >> >> >     or Ivy Bridge?
> >> >> >
> >> It's an ARM server.
> >>
> >> >> >  3. How many percentage of drop are you seeing?
> >> The testing result:
> >> size (bytes)     improvement (%)
> >> 64                   3.92
> >> 128                 11.51
> >> 256                  24.16
> >> 512                  -13.79
> >> 1024                -22.51
> >> 1500                -12.22
> >> A correction is that performance is dropping if byte size is larger than 512.
> >
> >
> > Jianbo,
> >
> > Could you please verify does this patch really cause enqueue perf to drop?
> >
> > You can test the enqueue path only by set guest to do rxonly, and compare
> > the mpps by show port stats all in the guest.
> >
> >
> Tested with testpmd, host: txonly, guest: rxonly
> size (bytes)     improvement (%)
> 64                    4.12
> 128                   6
> 256                   2.65
> 512                   -1.12
> 1024                 -7.02

There is a difference between Zhihong's code and the old I spotted in
the first time: Zhihong removed the avail_idx prefetch. I understand
the prefetch becomes a bit tricky when mrg-rx code path is considered;
thus, I didn't comment on that.

That's one of the difference that, IMO, could drop a regression. I then
finally got a chance to add it back.

A rough test shows it improves the performance of 1400B packet size greatly
in the "txonly in host and rxonly in guest" case: +33% is the number I get
with my test server (Ivybridge).

I guess this might/would help your case as well. Mind to have a test
and tell me the results?

BTW, I made it in rush; I haven't tested the mrg-rx code path yet.

Thanks.

	--yliu
commit e5852d04bf87c02d6d0d8e6d8ded4c33030b9c9e
Author: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Date:   Tue Sep 27 17:51:15 2016 +0800

    xxxx
    
    Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
  

Comments

Zhihong Wang Sept. 27, 2016, 4:45 p.m. UTC | #1
> -----Original Message-----
> From: Yuanhan Liu [mailto:yuanhan.liu@linux.intel.com]
> Sent: Tuesday, September 27, 2016 6:21 PM
> To: Jianbo Liu <jianbo.liu@linaro.org>
> Cc: Wang, Zhihong <zhihong.wang@intel.com>; Maxime Coquelin
> <maxime.coquelin@redhat.com>; dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue
> 
> On Thu, Sep 22, 2016 at 05:01:41PM +0800, Jianbo Liu wrote:
> > On 22 September 2016 at 14:58, Wang, Zhihong <zhihong.wang@intel.com>
> wrote:
> > >
> > >
> > >> -----Original Message-----
> > >> From: Jianbo Liu [mailto:jianbo.liu@linaro.org]
> > >> Sent: Thursday, September 22, 2016 1:48 PM
> > >> To: Yuanhan Liu <yuanhan.liu@linux.intel.com>
> > >> Cc: Wang, Zhihong <zhihong.wang@intel.com>; Maxime Coquelin
> > >> <maxime.coquelin@redhat.com>; dev@dpdk.org
> > >> Subject: Re: [dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue
> > >>
> > >> On 22 September 2016 at 10:29, Yuanhan Liu <yuanhan.liu@linux.intel.com>
> > >> wrote:
> > >> > On Wed, Sep 21, 2016 at 08:54:11PM +0800, Jianbo Liu wrote:
> > >> >> >> > My setup consists of one host running a guest.
> > >> >> >> > The guest generates as much 64bytes packets as possible using
> > >> >> >>
> > >> >> >> Have you tested with other different packet size?
> > >> >> >> My testing shows that performance is dropping when packet size is
> > >> more
> > >> >> >> than 256.
> > >> >> >
> > >> >> >
> > >> >> > Hi Jianbo,
> > >> >> >
> > >> >> > Thanks for reporting this.
> > >> >> >
> > >> >> >  1. Are you running the vector frontend with mrg_rxbuf=off?
> > >> >> >
> > >> Yes, my testing is mrg_rxbuf=off, but not vector frontend PMD.
> > >>
> > >> >> >  2. Could you please specify what CPU you're running? Is it Haswell
> > >> >> >     or Ivy Bridge?
> > >> >> >
> > >> It's an ARM server.
> > >>
> > >> >> >  3. How many percentage of drop are you seeing?
> > >> The testing result:
> > >> size (bytes)     improvement (%)
> > >> 64                   3.92
> > >> 128                 11.51
> > >> 256                  24.16
> > >> 512                  -13.79
> > >> 1024                -22.51
> > >> 1500                -12.22
> > >> A correction is that performance is dropping if byte size is larger than 512.
> > >
> > >
> > > Jianbo,
> > >
> > > Could you please verify does this patch really cause enqueue perf to drop?
> > >
> > > You can test the enqueue path only by set guest to do rxonly, and compare
> > > the mpps by show port stats all in the guest.
> > >
> > >
> > Tested with testpmd, host: txonly, guest: rxonly
> > size (bytes)     improvement (%)
> > 64                    4.12
> > 128                   6
> > 256                   2.65
> > 512                   -1.12
> > 1024                 -7.02
> 
> There is a difference between Zhihong's code and the old I spotted in
> the first time: Zhihong removed the avail_idx prefetch. I understand
> the prefetch becomes a bit tricky when mrg-rx code path is considered;
> thus, I didn't comment on that.
> 
> That's one of the difference that, IMO, could drop a regression. I then
> finally got a chance to add it back.
> 
> A rough test shows it improves the performance of 1400B packet size greatly
> in the "txonly in host and rxonly in guest" case: +33% is the number I get
> with my test server (Ivybridge).

Thanks Yuanhan! I'll validate this on x86.

> 
> I guess this might/would help your case as well. Mind to have a test
> and tell me the results?
> 
> BTW, I made it in rush; I haven't tested the mrg-rx code path yet.
> 
> Thanks.
> 
> 	--yliu
  
Zhihong Wang Oct. 9, 2016, 12:09 p.m. UTC | #2
> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Wang, Zhihong
> Sent: Wednesday, September 28, 2016 12:45 AM
> To: Yuanhan Liu <yuanhan.liu@linux.intel.com>; Jianbo Liu
> <jianbo.liu@linaro.org>
> Cc: Maxime Coquelin <maxime.coquelin@redhat.com>; dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue
> 
> 
> 
> > -----Original Message-----
> > From: Yuanhan Liu [mailto:yuanhan.liu@linux.intel.com]
> > Sent: Tuesday, September 27, 2016 6:21 PM
> > To: Jianbo Liu <jianbo.liu@linaro.org>
> > Cc: Wang, Zhihong <zhihong.wang@intel.com>; Maxime Coquelin
> > <maxime.coquelin@redhat.com>; dev@dpdk.org
> > Subject: Re: [dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue
> >
> > On Thu, Sep 22, 2016 at 05:01:41PM +0800, Jianbo Liu wrote:
> > > On 22 September 2016 at 14:58, Wang, Zhihong
> <zhihong.wang@intel.com>
> > wrote:
> > > >
> > > >
> > > >> -----Original Message-----
> > > >> From: Jianbo Liu [mailto:jianbo.liu@linaro.org]
> > > >> Sent: Thursday, September 22, 2016 1:48 PM
> > > >> To: Yuanhan Liu <yuanhan.liu@linux.intel.com>
> > > >> Cc: Wang, Zhihong <zhihong.wang@intel.com>; Maxime Coquelin
> > > >> <maxime.coquelin@redhat.com>; dev@dpdk.org
> > > >> Subject: Re: [dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue
> > > >>
> > > >> On 22 September 2016 at 10:29, Yuanhan Liu
> <yuanhan.liu@linux.intel.com>
> > > >> wrote:
> > > >> > On Wed, Sep 21, 2016 at 08:54:11PM +0800, Jianbo Liu wrote:
> > > >> >> >> > My setup consists of one host running a guest.
> > > >> >> >> > The guest generates as much 64bytes packets as possible
> using
> > > >> >> >>
> > > >> >> >> Have you tested with other different packet size?
> > > >> >> >> My testing shows that performance is dropping when packet
> size is
> > > >> more
> > > >> >> >> than 256.
> > > >> >> >
> > > >> >> >
> > > >> >> > Hi Jianbo,
> > > >> >> >
> > > >> >> > Thanks for reporting this.
> > > >> >> >
> > > >> >> >  1. Are you running the vector frontend with mrg_rxbuf=off?
> > > >> >> >
> > > >> Yes, my testing is mrg_rxbuf=off, but not vector frontend PMD.
> > > >>
> > > >> >> >  2. Could you please specify what CPU you're running? Is it
> Haswell
> > > >> >> >     or Ivy Bridge?
> > > >> >> >
> > > >> It's an ARM server.
> > > >>
> > > >> >> >  3. How many percentage of drop are you seeing?
> > > >> The testing result:
> > > >> size (bytes)     improvement (%)
> > > >> 64                   3.92
> > > >> 128                 11.51
> > > >> 256                  24.16
> > > >> 512                  -13.79
> > > >> 1024                -22.51
> > > >> 1500                -12.22
> > > >> A correction is that performance is dropping if byte size is larger than
> 512.
> > > >
> > > >
> > > > Jianbo,
> > > >
> > > > Could you please verify does this patch really cause enqueue perf to
> drop?
> > > >
> > > > You can test the enqueue path only by set guest to do rxonly, and
> compare
> > > > the mpps by show port stats all in the guest.
> > > >
> > > >
> > > Tested with testpmd, host: txonly, guest: rxonly
> > > size (bytes)     improvement (%)
> > > 64                    4.12
> > > 128                   6
> > > 256                   2.65
> > > 512                   -1.12
> > > 1024                 -7.02
> >
> > There is a difference between Zhihong's code and the old I spotted in
> > the first time: Zhihong removed the avail_idx prefetch. I understand
> > the prefetch becomes a bit tricky when mrg-rx code path is considered;
> > thus, I didn't comment on that.
> >
> > That's one of the difference that, IMO, could drop a regression. I then
> > finally got a chance to add it back.
> >
> > A rough test shows it improves the performance of 1400B packet size
> greatly
> > in the "txonly in host and rxonly in guest" case: +33% is the number I get
> > with my test server (Ivybridge).
> 
> Thanks Yuanhan! I'll validate this on x86.

Hi Yuanhan,

Seems your code doesn't perform correctly. I write a new version
of avail idx prefetch but didn't see any perf benefit.

To be honest I doubt the benefit of this idea. The previous mrg_off
code has this method but doesn't give any benefits.

Even if this is useful, the benefits should be more significant for
small packets, it's unlikely this simple idx prefetch could bring
over 30% perf gain for large packets like 1400B ones.

But if you really do work it out like that I'll be very glad to see.

Thanks
Zhihong

> 
> >
> > I guess this might/would help your case as well. Mind to have a test
> > and tell me the results?
> >
> > BTW, I made it in rush; I haven't tested the mrg-rx code path yet.
> >
> > Thanks.
> >
> > 	--yliu
  
Yuanhan Liu Oct. 10, 2016, 2:44 a.m. UTC | #3
On Sun, Oct 09, 2016 at 12:09:07PM +0000, Wang, Zhihong wrote:
> > > > Tested with testpmd, host: txonly, guest: rxonly
> > > > size (bytes)     improvement (%)
> > > > 64                    4.12
> > > > 128                   6
> > > > 256                   2.65
> > > > 512                   -1.12
> > > > 1024                 -7.02
> > >
> > > There is a difference between Zhihong's code and the old I spotted in
> > > the first time: Zhihong removed the avail_idx prefetch. I understand
> > > the prefetch becomes a bit tricky when mrg-rx code path is considered;
> > > thus, I didn't comment on that.
> > >
> > > That's one of the difference that, IMO, could drop a regression. I then
> > > finally got a chance to add it back.
> > >
> > > A rough test shows it improves the performance of 1400B packet size
> > greatly
> > > in the "txonly in host and rxonly in guest" case: +33% is the number I get
> > > with my test server (Ivybridge).
> > 
> > Thanks Yuanhan! I'll validate this on x86.
> 
> Hi Yuanhan,
> 
> Seems your code doesn't perform correctly. I write a new version
> of avail idx prefetch but didn't see any perf benefit.
> 
> To be honest I doubt the benefit of this idea. The previous mrg_off
> code has this method but doesn't give any benefits.

Good point. I thought of that before, too. But you know that I made it
in rush, that I didn't think further and test more.

I looked the code a bit closer this time, and spotted a bug: the prefetch
actually didn't happen, due to following code piece:

	if (vq->next_avail_idx >= NR_AVAIL_IDX_PREFETCH) {
		prefetch_avail_idx(vq);
		...
	}

Since vq->next_avail_idx is set to 0 at the entrance of enqueue path,
prefetch_avail_idx() will be called. The fix is easy though: just put
prefetch_avail_idx before invoking enqueue_packet.

In summary, Zhihong is right, I see no more gains with that fix :(

However, as stated, that's kind of the only difference I found between
yours and the old code, that maybe it's still worthwhile to have a
test on ARM, Jianbo?

	--yliu

> Even if this is useful, the benefits should be more significant for
> small packets, it's unlikely this simple idx prefetch could bring
> over 30% perf gain for large packets like 1400B ones.
> 
> But if you really do work it out like that I'll be very glad to see.
> 
> Thanks
> Zhihong
> 
> > 
> > >
> > > I guess this might/would help your case as well. Mind to have a test
> > > and tell me the results?
> > >
> > > BTW, I made it in rush; I haven't tested the mrg-rx code path yet.
> > >
> > > Thanks.
> > >
> > > 	--yliu
  
Jianbo Liu Oct. 10, 2016, 5:31 a.m. UTC | #4
On 10 October 2016 at 10:44, Yuanhan Liu <yuanhan.liu@linux.intel.com> wrote:
> On Sun, Oct 09, 2016 at 12:09:07PM +0000, Wang, Zhihong wrote:
>> > > > Tested with testpmd, host: txonly, guest: rxonly
>> > > > size (bytes)     improvement (%)
>> > > > 64                    4.12
>> > > > 128                   6
>> > > > 256                   2.65
>> > > > 512                   -1.12
>> > > > 1024                 -7.02
>> > >
>> > > There is a difference between Zhihong's code and the old I spotted in
>> > > the first time: Zhihong removed the avail_idx prefetch. I understand
>> > > the prefetch becomes a bit tricky when mrg-rx code path is considered;
>> > > thus, I didn't comment on that.
>> > >
>> > > That's one of the difference that, IMO, could drop a regression. I then
>> > > finally got a chance to add it back.
>> > >
>> > > A rough test shows it improves the performance of 1400B packet size
>> > greatly
>> > > in the "txonly in host and rxonly in guest" case: +33% is the number I get
>> > > with my test server (Ivybridge).
>> >
>> > Thanks Yuanhan! I'll validate this on x86.
>>
>> Hi Yuanhan,
>>
>> Seems your code doesn't perform correctly. I write a new version
>> of avail idx prefetch but didn't see any perf benefit.
>>
>> To be honest I doubt the benefit of this idea. The previous mrg_off
>> code has this method but doesn't give any benefits.
>
> Good point. I thought of that before, too. But you know that I made it
> in rush, that I didn't think further and test more.
>
> I looked the code a bit closer this time, and spotted a bug: the prefetch
> actually didn't happen, due to following code piece:
>
>         if (vq->next_avail_idx >= NR_AVAIL_IDX_PREFETCH) {
>                 prefetch_avail_idx(vq);
>                 ...
>         }
>
> Since vq->next_avail_idx is set to 0 at the entrance of enqueue path,
> prefetch_avail_idx() will be called. The fix is easy though: just put
> prefetch_avail_idx before invoking enqueue_packet.
>
> In summary, Zhihong is right, I see no more gains with that fix :(
>
> However, as stated, that's kind of the only difference I found between
> yours and the old code, that maybe it's still worthwhile to have a
> test on ARM, Jianbo?
>
I haven't tested it, but I think it could be no improvement for ARM either.

A smalll suggestion for enqueue_packet:

.....
+       /* start copy from mbuf to desc */
+       while (mbuf_avail || mbuf->next) {
.....

Considering pkt_len is in the first cache line (same as data_len),
while next pointer is in the second cache line,
is it better to check the total packet len, instead of the last mbuf's
next pointer to jump out of while loop and avoid possible cache miss?
  
Zhihong Wang Oct. 10, 2016, 6:22 a.m. UTC | #5
> -----Original Message-----

> From: Jianbo Liu [mailto:jianbo.liu@linaro.org]

> Sent: Monday, October 10, 2016 1:32 PM

> To: Yuanhan Liu <yuanhan.liu@linux.intel.com>

> Cc: Wang, Zhihong <zhihong.wang@intel.com>; Maxime Coquelin

> <maxime.coquelin@redhat.com>; dev@dpdk.org

> Subject: Re: [dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue

> 

> On 10 October 2016 at 10:44, Yuanhan Liu <yuanhan.liu@linux.intel.com>

> wrote:

> > On Sun, Oct 09, 2016 at 12:09:07PM +0000, Wang, Zhihong wrote:

> >> > > > Tested with testpmd, host: txonly, guest: rxonly

> >> > > > size (bytes)     improvement (%)

> >> > > > 64                    4.12

> >> > > > 128                   6

> >> > > > 256                   2.65

> >> > > > 512                   -1.12

> >> > > > 1024                 -7.02

> >> > >

> >> > > There is a difference between Zhihong's code and the old I spotted in

> >> > > the first time: Zhihong removed the avail_idx prefetch. I understand

> >> > > the prefetch becomes a bit tricky when mrg-rx code path is

> considered;

> >> > > thus, I didn't comment on that.

> >> > >

> >> > > That's one of the difference that, IMO, could drop a regression. I then

> >> > > finally got a chance to add it back.

> >> > >

> >> > > A rough test shows it improves the performance of 1400B packet size

> >> > greatly

> >> > > in the "txonly in host and rxonly in guest" case: +33% is the number I

> get

> >> > > with my test server (Ivybridge).

> >> >

> >> > Thanks Yuanhan! I'll validate this on x86.

> >>

> >> Hi Yuanhan,

> >>

> >> Seems your code doesn't perform correctly. I write a new version

> >> of avail idx prefetch but didn't see any perf benefit.

> >>

> >> To be honest I doubt the benefit of this idea. The previous mrg_off

> >> code has this method but doesn't give any benefits.

> >

> > Good point. I thought of that before, too. But you know that I made it

> > in rush, that I didn't think further and test more.

> >

> > I looked the code a bit closer this time, and spotted a bug: the prefetch

> > actually didn't happen, due to following code piece:

> >

> >         if (vq->next_avail_idx >= NR_AVAIL_IDX_PREFETCH) {

> >                 prefetch_avail_idx(vq);

> >                 ...

> >         }

> >

> > Since vq->next_avail_idx is set to 0 at the entrance of enqueue path,

> > prefetch_avail_idx() will be called. The fix is easy though: just put

> > prefetch_avail_idx before invoking enqueue_packet.

> >

> > In summary, Zhihong is right, I see no more gains with that fix :(

> >

> > However, as stated, that's kind of the only difference I found between

> > yours and the old code, that maybe it's still worthwhile to have a

> > test on ARM, Jianbo?

> >

> I haven't tested it, but I think it could be no improvement for ARM either.

> 

> A smalll suggestion for enqueue_packet:

> 

> .....

> +       /* start copy from mbuf to desc */

> +       while (mbuf_avail || mbuf->next) {

> .....

> 

> Considering pkt_len is in the first cache line (same as data_len),

> while next pointer is in the second cache line,

> is it better to check the total packet len, instead of the last mbuf's

> next pointer to jump out of while loop and avoid possible cache miss?


Jianbo,

Thanks for the reply!

This idea sounds good, but it won't help the general perf in my
opinion, since the 2nd cache line is accessed anyway prior in
virtio_enqueue_offload.

Also this would bring a NULL check when actually access mbuf->next.

BTW, could you please publish the number of:

 1. mrg_rxbuf=on, comparison between original and original + this patch

 2. mrg_rxbuf=off, comparison between original and original + this patch

So we can have a whole picture of how this patch impact on ARM platform.

Thanks
Zhihong
  
Jianbo Liu Oct. 10, 2016, 6:57 a.m. UTC | #6
On 10 October 2016 at 14:22, Wang, Zhihong <zhihong.wang@intel.com> wrote:
>
>
>> -----Original Message-----
>> From: Jianbo Liu [mailto:jianbo.liu@linaro.org]
>> Sent: Monday, October 10, 2016 1:32 PM
>> To: Yuanhan Liu <yuanhan.liu@linux.intel.com>
>> Cc: Wang, Zhihong <zhihong.wang@intel.com>; Maxime Coquelin
>> <maxime.coquelin@redhat.com>; dev@dpdk.org
>> Subject: Re: [dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue
>>
>> On 10 October 2016 at 10:44, Yuanhan Liu <yuanhan.liu@linux.intel.com>
>> wrote:
>> > On Sun, Oct 09, 2016 at 12:09:07PM +0000, Wang, Zhihong wrote:
>> >> > > > Tested with testpmd, host: txonly, guest: rxonly
>> >> > > > size (bytes)     improvement (%)
>> >> > > > 64                    4.12
>> >> > > > 128                   6
>> >> > > > 256                   2.65
>> >> > > > 512                   -1.12
>> >> > > > 1024                 -7.02
>> >> > >
>> >> > > There is a difference between Zhihong's code and the old I spotted in
>> >> > > the first time: Zhihong removed the avail_idx prefetch. I understand
>> >> > > the prefetch becomes a bit tricky when mrg-rx code path is
>> considered;
>> >> > > thus, I didn't comment on that.
>> >> > >
>> >> > > That's one of the difference that, IMO, could drop a regression. I then
>> >> > > finally got a chance to add it back.
>> >> > >
>> >> > > A rough test shows it improves the performance of 1400B packet size
>> >> > greatly
>> >> > > in the "txonly in host and rxonly in guest" case: +33% is the number I
>> get
>> >> > > with my test server (Ivybridge).
>> >> >
>> >> > Thanks Yuanhan! I'll validate this on x86.
>> >>
>> >> Hi Yuanhan,
>> >>
>> >> Seems your code doesn't perform correctly. I write a new version
>> >> of avail idx prefetch but didn't see any perf benefit.
>> >>
>> >> To be honest I doubt the benefit of this idea. The previous mrg_off
>> >> code has this method but doesn't give any benefits.
>> >
>> > Good point. I thought of that before, too. But you know that I made it
>> > in rush, that I didn't think further and test more.
>> >
>> > I looked the code a bit closer this time, and spotted a bug: the prefetch
>> > actually didn't happen, due to following code piece:
>> >
>> >         if (vq->next_avail_idx >= NR_AVAIL_IDX_PREFETCH) {
>> >                 prefetch_avail_idx(vq);
>> >                 ...
>> >         }
>> >
>> > Since vq->next_avail_idx is set to 0 at the entrance of enqueue path,
>> > prefetch_avail_idx() will be called. The fix is easy though: just put
>> > prefetch_avail_idx before invoking enqueue_packet.
>> >
>> > In summary, Zhihong is right, I see no more gains with that fix :(
>> >
>> > However, as stated, that's kind of the only difference I found between
>> > yours and the old code, that maybe it's still worthwhile to have a
>> > test on ARM, Jianbo?
>> >
>> I haven't tested it, but I think it could be no improvement for ARM either.
>>
>> A smalll suggestion for enqueue_packet:
>>
>> .....
>> +       /* start copy from mbuf to desc */
>> +       while (mbuf_avail || mbuf->next) {
>> .....
>>
>> Considering pkt_len is in the first cache line (same as data_len),
>> while next pointer is in the second cache line,
>> is it better to check the total packet len, instead of the last mbuf's
>> next pointer to jump out of while loop and avoid possible cache miss?
>
> Jianbo,
>
> Thanks for the reply!
>
> This idea sounds good, but it won't help the general perf in my
> opinion, since the 2nd cache line is accessed anyway prior in
> virtio_enqueue_offload.
>
Yes, you are right. I'm thinking of prefetching beforehand.
And if it's a chained mbuf, virtio_enqueue_offload will not be called
in next loop.

> Also this would bring a NULL check when actually access mbuf->next.
>
> BTW, could you please publish the number of:
>
>  1. mrg_rxbuf=on, comparison between original and original + this patch
>
>  2. mrg_rxbuf=off, comparison between original and original + this patch
>
> So we can have a whole picture of how this patch impact on ARM platform.
>
I think you already have got many results in my previous emails.
Sorry I can't test right now and busy with other things.
  
Zhihong Wang Oct. 10, 2016, 7:25 a.m. UTC | #7
> -----Original Message-----

> From: Jianbo Liu [mailto:jianbo.liu@linaro.org]

> Sent: Monday, October 10, 2016 2:58 PM

> To: Wang, Zhihong <zhihong.wang@intel.com>

> Cc: Yuanhan Liu <yuanhan.liu@linux.intel.com>; Maxime Coquelin

> <maxime.coquelin@redhat.com>; dev@dpdk.org

> Subject: Re: [dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue

> 

> On 10 October 2016 at 14:22, Wang, Zhihong <zhihong.wang@intel.com>

> wrote:

> >

> >

> >> -----Original Message-----

> >> From: Jianbo Liu [mailto:jianbo.liu@linaro.org]

> >> Sent: Monday, October 10, 2016 1:32 PM

> >> To: Yuanhan Liu <yuanhan.liu@linux.intel.com>

> >> Cc: Wang, Zhihong <zhihong.wang@intel.com>; Maxime Coquelin

> >> <maxime.coquelin@redhat.com>; dev@dpdk.org

> >> Subject: Re: [dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue

> >>

> >> On 10 October 2016 at 10:44, Yuanhan Liu <yuanhan.liu@linux.intel.com>

> >> wrote:

> >> > On Sun, Oct 09, 2016 at 12:09:07PM +0000, Wang, Zhihong wrote:

> >> >> > > > Tested with testpmd, host: txonly, guest: rxonly

> >> >> > > > size (bytes)     improvement (%)

> >> >> > > > 64                    4.12

> >> >> > > > 128                   6

> >> >> > > > 256                   2.65

> >> >> > > > 512                   -1.12

> >> >> > > > 1024                 -7.02

> >> >> > >

> >> >> > > There is a difference between Zhihong's code and the old I spotted

> in

> >> >> > > the first time: Zhihong removed the avail_idx prefetch. I

> understand

> >> >> > > the prefetch becomes a bit tricky when mrg-rx code path is

> >> considered;

> >> >> > > thus, I didn't comment on that.

> >> >> > >

> >> >> > > That's one of the difference that, IMO, could drop a regression. I

> then

> >> >> > > finally got a chance to add it back.

> >> >> > >

> >> >> > > A rough test shows it improves the performance of 1400B packet

> size

> >> >> > greatly

> >> >> > > in the "txonly in host and rxonly in guest" case: +33% is the number

> I

> >> get

> >> >> > > with my test server (Ivybridge).

> >> >> >

> >> >> > Thanks Yuanhan! I'll validate this on x86.

> >> >>

> >> >> Hi Yuanhan,

> >> >>

> >> >> Seems your code doesn't perform correctly. I write a new version

> >> >> of avail idx prefetch but didn't see any perf benefit.

> >> >>

> >> >> To be honest I doubt the benefit of this idea. The previous mrg_off

> >> >> code has this method but doesn't give any benefits.

> >> >

> >> > Good point. I thought of that before, too. But you know that I made it

> >> > in rush, that I didn't think further and test more.

> >> >

> >> > I looked the code a bit closer this time, and spotted a bug: the prefetch

> >> > actually didn't happen, due to following code piece:

> >> >

> >> >         if (vq->next_avail_idx >= NR_AVAIL_IDX_PREFETCH) {

> >> >                 prefetch_avail_idx(vq);

> >> >                 ...

> >> >         }

> >> >

> >> > Since vq->next_avail_idx is set to 0 at the entrance of enqueue path,

> >> > prefetch_avail_idx() will be called. The fix is easy though: just put

> >> > prefetch_avail_idx before invoking enqueue_packet.

> >> >

> >> > In summary, Zhihong is right, I see no more gains with that fix :(

> >> >

> >> > However, as stated, that's kind of the only difference I found between

> >> > yours and the old code, that maybe it's still worthwhile to have a

> >> > test on ARM, Jianbo?

> >> >

> >> I haven't tested it, but I think it could be no improvement for ARM either.

> >>

> >> A smalll suggestion for enqueue_packet:

> >>

> >> .....

> >> +       /* start copy from mbuf to desc */

> >> +       while (mbuf_avail || mbuf->next) {

> >> .....

> >>

> >> Considering pkt_len is in the first cache line (same as data_len),

> >> while next pointer is in the second cache line,

> >> is it better to check the total packet len, instead of the last mbuf's

> >> next pointer to jump out of while loop and avoid possible cache miss?

> >

> > Jianbo,

> >

> > Thanks for the reply!

> >

> > This idea sounds good, but it won't help the general perf in my

> > opinion, since the 2nd cache line is accessed anyway prior in

> > virtio_enqueue_offload.

> >

> Yes, you are right. I'm thinking of prefetching beforehand.

> And if it's a chained mbuf, virtio_enqueue_offload will not be called

> in next loop.

> 

> > Also this would bring a NULL check when actually access mbuf->next.

> >

> > BTW, could you please publish the number of:

> >

> >  1. mrg_rxbuf=on, comparison between original and original + this patch

> >

> >  2. mrg_rxbuf=off, comparison between original and original + this patch

> >

> > So we can have a whole picture of how this patch impact on ARM platform.

> >

> I think you already have got many results in my previous emails.

> Sorry I can't test right now and busy with other things.


We're still missing mrg on data.
  

Patch

diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
index 381dc27..41bfeba 100644
--- a/lib/librte_vhost/vhost.h
+++ b/lib/librte_vhost/vhost.h
@@ -61,6 +61,8 @@  struct buf_vector {
 	uint32_t desc_idx;
 };
 
+#define NR_AVAIL_IDX_PREFETCH	32
+
 /**
  * Structure contains variables relevant to RX/TX virtqueues.
  */
@@ -70,7 +72,7 @@  struct vhost_virtqueue {
 	struct vring_used	*used;
 	uint32_t		size;
 
-	/* Last index used on the available ring */
+	uint16_t		last_avail_idx;
 	uint16_t		last_used_idx;
 #define VIRTIO_INVALID_EVENTFD		(-1)
 #define VIRTIO_UNINITIALIZED_EVENTFD	(-2)
@@ -89,6 +91,9 @@  struct vhost_virtqueue {
 	/* Shadow used ring for performance */
 	struct vring_used_elem	*shadow_used_ring;
 	uint32_t		shadow_used_idx;
+
+	uint16_t		next_avail_idx;
+	uint16_t		avail_idx_buf[NR_AVAIL_IDX_PREFETCH];
 } __rte_cache_aligned;
 
 /* Old kernels have no such macro defined */
diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c
index 11a2c1a..1cc22fc 100644
--- a/lib/librte_vhost/virtio_net.c
+++ b/lib/librte_vhost/virtio_net.c
@@ -170,6 +170,41 @@  flush_used_ring(struct virtio_net *dev, struct vhost_virtqueue *vq,
 	}
 }
 
+/* Fetch NR_AVAIL_IDX_PREFETCH avail entries at once */
+static void
+prefetch_avail_idx(struct vhost_virtqueue *vq)
+{
+	int i;
+
+	for (i = 0; i < NR_AVAIL_IDX_PREFETCH; i++) {
+		vq->avail_idx_buf[i] = vq->avail->ring[(vq->last_avail_idx + i) &
+					(vq->size - 1)];
+	}
+}
+
+static uint16_t
+next_avail_idx(struct vhost_virtqueue *vq)
+{
+	if (vq->next_avail_idx >= NR_AVAIL_IDX_PREFETCH) {
+		prefetch_avail_idx(vq);
+		vq->next_avail_idx = 0;
+		vq->last_avail_idx += NR_AVAIL_IDX_PREFETCH;
+	}
+
+	return vq->avail_idx_buf[vq->next_avail_idx++];
+}
+
+/*
+ * Just peek, but don't move forward the "next_avail_idx" pointer
+ * The caller also has to make sure the point doesn't go beyond
+ * the array.
+ */
+static uint16_t
+peek_next_avail_idx(struct vhost_virtqueue *vq)
+{
+	return vq->avail_idx_buf[vq->next_avail_idx];
+}
+
 static inline int __attribute__((always_inline))
 enqueue_packet(struct virtio_net *dev, struct vhost_virtqueue *vq,
 		uint16_t avail_idx, struct rte_mbuf *mbuf,
@@ -193,7 +228,7 @@  enqueue_packet(struct virtio_net *dev, struct vhost_virtqueue *vq,
 	mbuf_avail = mbuf_len;
 
 	/* get the current desc */
-	desc_current = vq->avail->ring[(vq->last_used_idx) & (vq->size - 1)];
+	desc_current = next_avail_idx(vq);
 	desc_chain_head = desc_current;
 	desc = &vq->desc[desc_current];
 	desc_addr = gpa_to_vva(dev, desc->addr);
@@ -235,9 +270,7 @@  enqueue_packet(struct virtio_net *dev, struct vhost_virtqueue *vq,
 				if (avail_idx == vq->last_used_idx)
 					goto error;
 
-				desc_current =
-					vq->avail->ring[(vq->last_used_idx) &
-					(vq->size - 1)];
+				desc_current = next_avail_idx(vq);
 				desc_chain_head = desc_current;
 				desc_chain_len = 0;
 			} else
@@ -298,6 +331,7 @@  notify_guest(struct virtio_net *dev, struct vhost_virtqueue *vq)
 		eventfd_write(vq->callfd, (eventfd_t)1);
 }
 
+
 uint16_t
 rte_vhost_enqueue_burst(int vid, uint16_t queue_id,
 	struct rte_mbuf **pkts, uint16_t count)
@@ -331,14 +365,15 @@  rte_vhost_enqueue_burst(int vid, uint16_t queue_id,
 
 	/* start enqueuing packets 1 by 1 */
 	vq->shadow_used_idx = 0;
+	vq->next_avail_idx  = 0;
 	used_idx = vq->last_used_idx & (vq->size - 1);
 	avail_idx = *((volatile uint16_t *)&vq->avail->idx);
 	while (pkt_left && avail_idx != vq->last_used_idx) {
 		/* prefetch the next desc */
-		if (pkt_left > 1 && avail_idx != vq->last_used_idx + 1)
-			rte_prefetch0(&vq->desc[vq->avail->ring[
-					(vq->last_used_idx + 1) &
-					(vq->size - 1)]]);
+		if (pkt_left > 1 &&
+		    vq->next_avail_idx + 1 < NR_AVAIL_IDX_PREFETCH) {
+			rte_prefetch0(&vq->desc[peek_next_avail_idx(vq)]);
+		}
 
 		if (enqueue_packet(dev, vq, avail_idx, pkts[pkt_idx],
 					is_mrg_rxbuf))
@@ -347,6 +382,7 @@  rte_vhost_enqueue_burst(int vid, uint16_t queue_id,
 		pkt_idx++;
 		pkt_left--;
 	}
+	vq->last_avail_idx += vq->next_avail_idx;
 
 	/* batch update used ring for better performance */
 	if (likely(vq->shadow_used_idx > 0))