[dpdk-dev] mbuf: extend rte_mbuf_prefetch_part* to support more prefetching methods

Message ID 1464663966-8122-1-git-send-email-jianbo.liu@linaro.org (mailing list archive)
State Rejected, archived
Delegated to: Thomas Monjalon
Headers

Commit Message

Jianbo Liu May 31, 2016, 3:06 a.m. UTC
  Change the inline function to macro with parameters

Signed-off-by: Jianbo Liu <jianbo.liu@linaro.org>
---
 drivers/net/fm10k/fm10k_rxtx_vec.c      |  8 ++++----
 drivers/net/i40e/i40e_rxtx_vec.c        |  8 ++++----
 drivers/net/ixgbe/ixgbe_rxtx_vec.c      |  8 ++++----
 drivers/net/ixgbe/ixgbe_rxtx_vec_neon.c | 12 ++++++++----
 drivers/net/mlx4/mlx4.c                 |  4 ++--
 drivers/net/mlx5/mlx5_rxtx.c            |  4 ++--
 examples/ipsec-secgw/ipsec-secgw.c      |  2 +-
 lib/librte_mbuf/rte_mbuf.h              | 25 +++++++++++++------------
 8 files changed, 38 insertions(+), 33 deletions(-)
  

Comments

Olivier Matz May 31, 2016, 7:28 p.m. UTC | #1
Hi Jianbo,

On 05/31/2016 05:06 AM, Jianbo Liu wrote:
> Change the inline function to macro with parameters
> 
> Signed-off-by: Jianbo Liu <jianbo.liu@linaro.org>
>
> [...]
> --- a/lib/librte_mbuf/rte_mbuf.h
> +++ b/lib/librte_mbuf/rte_mbuf.h
> @@ -849,14 +849,15 @@ struct rte_mbuf {
>   * in the receive path. If the cache line of the architecture is higher than
>   * 64B, the second part will also be prefetched.
>   *
> + * @param method
> + *   The prefetch method: prefetch0, prefetch1, prefetch2 or
> + *                        prefetch_non_temporal.
> + *
>   * @param m
>   *   The pointer to the mbuf.
>   */
> -static inline void
> -rte_mbuf_prefetch_part1(struct rte_mbuf *m)
> -{
> -	rte_prefetch0(&m->cacheline0);
> -}
> +#define RTE_MBUF_PREFETCH_PART1(method, m)	\
> +	rte_##method(&(m)->cacheline0)

I'm not very fan of this macro, because it allows to
really do everything):

  RTE_MBUF_PREFETCH_PART1(pktmbuf_free, m)

would expand as:

  rte_pktmbuf_free(m)


I'd prefer to have a switch case like this, almost similar
to what Keith proposed in the initial discussion for my
patch:

enum rte_mbuf_prefetch_type {
	PREFETCH0,
	PREFETCH1,
...
};

static inline void
rte_mbuf_prefetch_part1(enum rte_mbuf_prefetch_type type,
	struct rte_mbuf *m)
{
	switch (type) {
	case PREFETCH0:
		rte_prefetch0(&m->cacheline0);
		break;
	case PREFETCH1:
		rte_prefetch1(&m->cacheline0);
		break;
	...
}


Some questions: could you give some details about the use
of non-temporal prefetch in ixgbe_vec_neon? What are the
pros and cons, and would it be useful in other drivers?
Currently all drivers are doing prefetch0 when they prefetch
the mbuf structure. Some drivers use prefetch1 for data.


By the way, I did not try to apply the patch, but it looks
it's on top of dpdk-next-net/rel_16_07, right?

Thanks,
Olivier
  
Stephen Hemminger May 31, 2016, 8 p.m. UTC | #2
On Tue, 31 May 2016 08:36:06 +0530
Jianbo Liu <jianbo.liu@linaro.org> wrote:

> Change the inline function to macro with parameters
> 
> Signed-off-by: Jianbo Liu <jianbo.liu@linaro.org>

Going from typed (inline) to untyped (macro) is a step backwards
in code safety.
  
Jianbo Liu June 1, 2016, 3:29 a.m. UTC | #3
On 1 June 2016 at 03:28, Olivier MATZ <olivier.matz@6wind.com> wrote:
> Hi Jianbo,
>
> On 05/31/2016 05:06 AM, Jianbo Liu wrote:
>> Change the inline function to macro with parameters
>>
>> Signed-off-by: Jianbo Liu <jianbo.liu@linaro.org>
>>
>> [...]
>> --- a/lib/librte_mbuf/rte_mbuf.h
>> +++ b/lib/librte_mbuf/rte_mbuf.h
>> @@ -849,14 +849,15 @@ struct rte_mbuf {
>>   * in the receive path. If the cache line of the architecture is higher than
>>   * 64B, the second part will also be prefetched.
>>   *
>> + * @param method
>> + *   The prefetch method: prefetch0, prefetch1, prefetch2 or
>> + *                        prefetch_non_temporal.
>> + *
>>   * @param m
>>   *   The pointer to the mbuf.
>>   */
>> -static inline void
>> -rte_mbuf_prefetch_part1(struct rte_mbuf *m)
>> -{
>> -     rte_prefetch0(&m->cacheline0);
>> -}
>> +#define RTE_MBUF_PREFETCH_PART1(method, m)   \
>> +     rte_##method(&(m)->cacheline0)
>
> I'm not very fan of this macro, because it allows to
> really do everything):
>
>   RTE_MBUF_PREFETCH_PART1(pktmbuf_free, m)
>
> would expand as:
>
>   rte_pktmbuf_free(m)
>
>
> I'd prefer to have a switch case like this, almost similar
> to what Keith proposed in the initial discussion for my
> patch:
>
> enum rte_mbuf_prefetch_type {
>         PREFETCH0,
>         PREFETCH1,
> ...
> };
>
> static inline void
> rte_mbuf_prefetch_part1(enum rte_mbuf_prefetch_type type,
>         struct rte_mbuf *m)
> {
>         switch (type) {
>         case PREFETCH0:
>                 rte_prefetch0(&m->cacheline0);
>                 break;
>         case PREFETCH1:
>                 rte_prefetch1(&m->cacheline0);
>                 break;
>         ...
> }
>
How about adding these to forbid the illegal use of this macro?
enum rte_mbuf_prefetch_type {
         ENUM_prefetch0,
         ENUM_prefetch1,
 ...
};

#define RTE_MBUF_PREFETCH_PART1(type, m) \
    if (ENUM_##type == ENUM_prefretch0) \
        rte_prefetch0(&(m)->cacheline0);   \
    else if (ENUM_##type == ENUM_prefetch1) \
        rte_prefetch1(&(m)->cacheline0); \
    ....

>
> Some questions: could you give some details about the use
> of non-temporal prefetch in ixgbe_vec_neon? What are the
> pros and cons, and would it be useful in other drivers?
> Currently all drivers are doing prefetch0 when they prefetch
> the mbuf structure. Some drivers use prefetch1 for data.
>
It's for performance consideration, and only on armv8a platform.

>
> By the way, I did not try to apply the patch, but it looks
> it's on top of dpdk-next-net/rel_16_07, right?
>
Yes
  
Jerin Jacob June 1, 2016, 6 a.m. UTC | #4
On Wed, Jun 01, 2016 at 11:29:47AM +0800, Jianbo Liu wrote:
> On 1 June 2016 at 03:28, Olivier MATZ <olivier.matz@6wind.com> wrote:
> > Hi Jianbo,
> >
> > On 05/31/2016 05:06 AM, Jianbo Liu wrote:
> >> Change the inline function to macro with parameters
> >>
> >> Signed-off-by: Jianbo Liu <jianbo.liu@linaro.org>
> >>
> >> [...]
> >> --- a/lib/librte_mbuf/rte_mbuf.h
> >> +++ b/lib/librte_mbuf/rte_mbuf.h
> >> @@ -849,14 +849,15 @@ struct rte_mbuf {
> >>   * in the receive path. If the cache line of the architecture is higher than
> >>   * 64B, the second part will also be prefetched.
> >>   *
> >> + * @param method
> >> + *   The prefetch method: prefetch0, prefetch1, prefetch2 or
> >> + *                        prefetch_non_temporal.
> >> + *
> >>   * @param m
> >>   *   The pointer to the mbuf.
> >>   */
> >> -static inline void
> >> -rte_mbuf_prefetch_part1(struct rte_mbuf *m)
> >> -{
> >> -     rte_prefetch0(&m->cacheline0);
> >> -}
> >> +#define RTE_MBUF_PREFETCH_PART1(method, m)   \
> >> +     rte_##method(&(m)->cacheline0)
> >
> > I'm not very fan of this macro, because it allows to
> > really do everything):
> >
> >   RTE_MBUF_PREFETCH_PART1(pktmbuf_free, m)
> >
> > would expand as:
> >
> >   rte_pktmbuf_free(m)
> >
> >
> > I'd prefer to have a switch case like this, almost similar
> > to what Keith proposed in the initial discussion for my
> > patch:
> >
> > enum rte_mbuf_prefetch_type {
> >         PREFETCH0,
> >         PREFETCH1,
> > ...
> > };
> >
> > static inline void
> > rte_mbuf_prefetch_part1(enum rte_mbuf_prefetch_type type,
> >         struct rte_mbuf *m)
> > {
> >         switch (type) {
> >         case PREFETCH0:
> >                 rte_prefetch0(&m->cacheline0);
> >                 break;
> >         case PREFETCH1:
> >                 rte_prefetch1(&m->cacheline0);
> >                 break;
> >         ...
> > }
> >
> How about adding these to forbid the illegal use of this macro?
> enum rte_mbuf_prefetch_type {
>          ENUM_prefetch0,
>          ENUM_prefetch1,
>  ...
> };
> 
> #define RTE_MBUF_PREFETCH_PART1(type, m) \
>     if (ENUM_##type == ENUM_prefretch0) \
>         rte_prefetch0(&(m)->cacheline0);   \
>     else if (ENUM_##type == ENUM_prefetch1) \
>         rte_prefetch1(&(m)->cacheline0); \
>     ....
> 
> >
> > Some questions: could you give some details about the use
> > of non-temporal prefetch in ixgbe_vec_neon? What are the
> > pros and cons, and would it be useful in other drivers?
> > Currently all drivers are doing prefetch0 when they prefetch
> > the mbuf structure. Some drivers use prefetch1 for data.
> >
> It's for performance consideration, and only on armv8a platform.

Strictly it is not armv8 specific, IA also implemented this API with
_MM_HINT_NTA hint.

Do we really need non-temporal/transient version of prefetch for ixgbe?
If so, for x86 also it makes sense to keep it? Right?

The primary use case for transient version would be use with pipe line
line mode where the same cpu wont consume the packet.

/**
 * Prefetch a cache line into all cache levels (non-temporal/transient
 * version)
 *
 * The non-temporal prefetch is intended as a prefetch hint that
 * processor will
 * use the prefetched data only once or short period, unlike the
 * rte_prefetch0() function which imply that prefetched data to use
 * repeatedly.
 *
 * @param p
 *   Address to prefetch
 */
static inline void rte_prefetch_non_temporal(const volatile void *p); 

> 
> >
> > By the way, I did not try to apply the patch, but it looks
> > it's on top of dpdk-next-net/rel_16_07, right?
> >
> Yes
  
Olivier Matz June 2, 2016, 7:10 a.m. UTC | #5
Hi Jianbo,

On 06/01/2016 05:29 AM, Jianbo Liu wrote:
>> enum rte_mbuf_prefetch_type {
>> >         PREFETCH0,
>> >         PREFETCH1,
>> > ...
>> > };
>> >
>> > static inline void
>> > rte_mbuf_prefetch_part1(enum rte_mbuf_prefetch_type type,
>> >         struct rte_mbuf *m)
>> > {
>> >         switch (type) {
>> >         case PREFETCH0:
>> >                 rte_prefetch0(&m->cacheline0);
>> >                 break;
>> >         case PREFETCH1:
>> >                 rte_prefetch1(&m->cacheline0);
>> >                 break;
>> >         ...
>> > }
>> >
> How about adding these to forbid the illegal use of this macro?
> enum rte_mbuf_prefetch_type {
>          ENUM_prefetch0,
>          ENUM_prefetch1,
>  ...
> };
> 
> #define RTE_MBUF_PREFETCH_PART1(type, m) \
>     if (ENUM_##type == ENUM_prefretch0) \
>         rte_prefetch0(&(m)->cacheline0);   \
>     else if (ENUM_##type == ENUM_prefetch1) \
>         rte_prefetch1(&(m)->cacheline0); \
>     ....
> 

As Stephen stated, a static inline is better than a macro, mainly
because it is understood by the compiler instead of beeing a dumb
code replacement.

Any reason why you would prefer a macro in that case?

Regards
Olivier
  
Jianbo Liu June 2, 2016, 9:04 a.m. UTC | #6
On 1 June 2016 at 14:00, Jerin Jacob <jerin.jacob@caviumnetworks.com> wrote:
> On Wed, Jun 01, 2016 at 11:29:47AM +0800, Jianbo Liu wrote:
>> On 1 June 2016 at 03:28, Olivier MATZ <olivier.matz@6wind.com> wrote:
>> > Hi Jianbo,
>> >
>> > On 05/31/2016 05:06 AM, Jianbo Liu wrote:
>> >> Change the inline function to macro with parameters
>> >>
>> >> Signed-off-by: Jianbo Liu <jianbo.liu@linaro.org>
>> >>
>> >> [...]
[...]
>> It's for performance consideration, and only on armv8a platform.
>
> Strictly it is not armv8 specific, IA also implemented this API with
> _MM_HINT_NTA hint.

I mean this patch is only for ixgbe vector PMD on armv8 platform.

>
> Do we really need non-temporal/transient version of prefetch for ixgbe?

Strictly speaking, we don't have to since we don't know how APPs use
the mbuf header.
But, is it high possibility that the second part is used only once or
short period because prefetching is done only when split_packet is not
NULL?

> If so, for x86 also it makes sense to keep it? Right?
>
> The primary use case for transient version would be use with pipe line
> line mode where the same cpu wont consume the packet.
>
> /**
>  * Prefetch a cache line into all cache levels (non-temporal/transient
>  * version)
>  *
>  * The non-temporal prefetch is intended as a prefetch hint that
>  * processor will
>  * use the prefetched data only once or short period, unlike the
>  * rte_prefetch0() function which imply that prefetched data to use
>  * repeatedly.
>  *
>  * @param p
>  *   Address to prefetch
>  */
> static inline void rte_prefetch_non_temporal(const volatile void *p);
>
>>
>> >
>> > By the way, I did not try to apply the patch, but it looks
>> > it's on top of dpdk-next-net/rel_16_07, right?
>> >
>> Yes
  
Jianbo Liu June 2, 2016, 9:12 a.m. UTC | #7
On 2 June 2016 at 15:10, Olivier MATZ <olivier.matz@6wind.com> wrote:
> Hi Jianbo,
>
> On 06/01/2016 05:29 AM, Jianbo Liu wrote:
>>> enum rte_mbuf_prefetch_type {
>>> >         PREFETCH0,
>>> >         PREFETCH1,
>>> > ...
>>> > };
>>> >
>>> > static inline void
>>> > rte_mbuf_prefetch_part1(enum rte_mbuf_prefetch_type type,
>>> >         struct rte_mbuf *m)
>>> > {
>>> >         switch (type) {
>>> >         case PREFETCH0:
>>> >                 rte_prefetch0(&m->cacheline0);
>>> >                 break;
>>> >         case PREFETCH1:
>>> >                 rte_prefetch1(&m->cacheline0);
>>> >                 break;
>>> >         ...
>>> > }
>>> >
>> How about adding these to forbid the illegal use of this macro?
>> enum rte_mbuf_prefetch_type {
>>          ENUM_prefetch0,
>>          ENUM_prefetch1,
>>  ...
>> };
>>
>> #define RTE_MBUF_PREFETCH_PART1(type, m) \
>>     if (ENUM_##type == ENUM_prefretch0) \
>>         rte_prefetch0(&(m)->cacheline0);   \
>>     else if (ENUM_##type == ENUM_prefetch1) \
>>         rte_prefetch1(&(m)->cacheline0); \
>>     ....
>>
>
> As Stephen stated, a static inline is better than a macro, mainly
> because it is understood by the compiler instead of beeing a dumb
> code replacement.
>
> Any reason why you would prefer a macro in that case?
>
For the simplicity reason. If not, we may have to write several
similar functions for different prefetchings.
  
Jerin Jacob June 2, 2016, 9:30 a.m. UTC | #8
On Thu, Jun 02, 2016 at 05:04:13PM +0800, Jianbo Liu wrote:
> On 1 June 2016 at 14:00, Jerin Jacob <jerin.jacob@caviumnetworks.com> wrote:
> > On Wed, Jun 01, 2016 at 11:29:47AM +0800, Jianbo Liu wrote:
> >> On 1 June 2016 at 03:28, Olivier MATZ <olivier.matz@6wind.com> wrote:
> >> > Hi Jianbo,
> >> >
> >> > On 05/31/2016 05:06 AM, Jianbo Liu wrote:
> >> >> Change the inline function to macro with parameters
> >> >>
> >> >> Signed-off-by: Jianbo Liu <jianbo.liu@linaro.org>
> >> >>
> >> >> [...]
> [...]
> >> It's for performance consideration, and only on armv8a platform.
> >
> > Strictly it is not armv8 specific, IA also implemented this API with
> > _MM_HINT_NTA hint.
> 
> I mean this patch is only for ixgbe vector PMD on armv8 platform.
> 
> >
> > Do we really need non-temporal/transient version of prefetch for ixgbe?
> 
> Strictly speaking, we don't have to since we don't know how APPs use
> the mbuf header.

Then IMO it makes sense to keep the same behavior as x86 ixgbe driver.
Then on the upside, We may not need the new macros for part prefetching

Jerin

> But, is it high possibility that the second part is used only once or
> short period because prefetching is done only when split_packet is not
> NULL?
> 
> > If so, for x86 also it makes sense to keep it? Right?
> >
> > The primary use case for transient version would be use with pipe line
> > line mode where the same cpu wont consume the packet.
> >
> > /**
> >  * Prefetch a cache line into all cache levels (non-temporal/transient
> >  * version)
> >  *
> >  * The non-temporal prefetch is intended as a prefetch hint that
> >  * processor will
> >  * use the prefetched data only once or short period, unlike the
> >  * rte_prefetch0() function which imply that prefetched data to use
> >  * repeatedly.
> >  *
> >  * @param p
> >  *   Address to prefetch
> >  */
> > static inline void rte_prefetch_non_temporal(const volatile void *p);
> >
> >>
> >> >
> >> > By the way, I did not try to apply the patch, but it looks
> >> > it's on top of dpdk-next-net/rel_16_07, right?
> >> >
> >> Yes
  
Olivier Matz June 21, 2016, 2:56 p.m. UTC | #9
Hi,

On 06/02/2016 11:30 AM, Jerin Jacob wrote:
> On Thu, Jun 02, 2016 at 05:04:13PM +0800, Jianbo Liu wrote:
>> On 1 June 2016 at 14:00, Jerin Jacob <jerin.jacob@caviumnetworks.com> wrote:
>>> On Wed, Jun 01, 2016 at 11:29:47AM +0800, Jianbo Liu wrote:
>>>> On 1 June 2016 at 03:28, Olivier MATZ <olivier.matz@6wind.com> wrote:
>>>>> Hi Jianbo,
>>>>>
>>>>> On 05/31/2016 05:06 AM, Jianbo Liu wrote:
>>>>>> Change the inline function to macro with parameters
>>>>>>
>>>>>> Signed-off-by: Jianbo Liu <jianbo.liu@linaro.org>
>>>>>>
>>>>>> [...]
>> [...]
>>>> It's for performance consideration, and only on armv8a platform.
>>>
>>> Strictly it is not armv8 specific, IA also implemented this API with
>>> _MM_HINT_NTA hint.
>>
>> I mean this patch is only for ixgbe vector PMD on armv8 platform.
>>
>>>
>>> Do we really need non-temporal/transient version of prefetch for ixgbe?
>>
>> Strictly speaking, we don't have to since we don't know how APPs use
>> the mbuf header.
> 
> Then IMO it makes sense to keep the same behavior as x86 ixgbe driver.
> Then on the upside, We may not need the new macros for part prefetching
> 
> Jerin

Knowing that http://www.dpdk.org/dev/patchwork/patch/13992/ has been
submitted, I think this patch can be marked as closed in patchwork.
  

Patch

diff --git a/drivers/net/fm10k/fm10k_rxtx_vec.c b/drivers/net/fm10k/fm10k_rxtx_vec.c
index ef256a5..0e4c91c 100644
--- a/drivers/net/fm10k/fm10k_rxtx_vec.c
+++ b/drivers/net/fm10k/fm10k_rxtx_vec.c
@@ -487,10 +487,10 @@  fm10k_recv_raw_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
 		rte_compiler_barrier();
 
 		if (split_packet) {
-			rte_mbuf_prefetch_part2(rx_pkts[pos]);
-			rte_mbuf_prefetch_part2(rx_pkts[pos + 1]);
-			rte_mbuf_prefetch_part2(rx_pkts[pos + 2]);
-			rte_mbuf_prefetch_part2(rx_pkts[pos + 3]);
+			RTE_MBUF_PREFETCH_PART2(prefetch0, rx_pkts[pos]);
+			RTE_MBUF_PREFETCH_PART2(prefetch0, rx_pkts[pos + 1]);
+			RTE_MBUF_PREFETCH_PART2(prefetch0, rx_pkts[pos + 2]);
+			RTE_MBUF_PREFETCH_PART2(prefetch0, rx_pkts[pos + 3]);
 		}
 
 		/* D.1 pkt 3,4 convert format from desc to pktmbuf */
diff --git a/drivers/net/i40e/i40e_rxtx_vec.c b/drivers/net/i40e/i40e_rxtx_vec.c
index eef80d9..a5c4847 100644
--- a/drivers/net/i40e/i40e_rxtx_vec.c
+++ b/drivers/net/i40e/i40e_rxtx_vec.c
@@ -297,10 +297,10 @@  _recv_raw_pkts_vec(struct i40e_rx_queue *rxq, struct rte_mbuf **rx_pkts,
 		_mm_storeu_si128((__m128i *)&rx_pkts[pos+2], mbp2);
 
 		if (split_packet) {
-			rte_mbuf_prefetch_part2(rx_pkts[pos]);
-			rte_mbuf_prefetch_part2(rx_pkts[pos + 1]);
-			rte_mbuf_prefetch_part2(rx_pkts[pos + 2]);
-			rte_mbuf_prefetch_part2(rx_pkts[pos + 3]);
+			RTE_MBUF_PREFETCH_PART2(prefetch0, rx_pkts[pos]);
+			RTE_MBUF_PREFETCH_PART2(prefetch0, rx_pkts[pos + 1]);
+			RTE_MBUF_PREFETCH_PART2(prefetch0, rx_pkts[pos + 2]);
+			RTE_MBUF_PREFETCH_PART2(prefetch0, rx_pkts[pos + 3]);
 		}
 
 		/* avoid compiler reorder optimization */
diff --git a/drivers/net/ixgbe/ixgbe_rxtx_vec.c b/drivers/net/ixgbe/ixgbe_rxtx_vec.c
index 09f4892..55adb56 100644
--- a/drivers/net/ixgbe/ixgbe_rxtx_vec.c
+++ b/drivers/net/ixgbe/ixgbe_rxtx_vec.c
@@ -308,10 +308,10 @@  _recv_raw_pkts_vec(struct ixgbe_rx_queue *rxq, struct rte_mbuf **rx_pkts,
 		_mm_storeu_si128((__m128i *)&rx_pkts[pos+2], mbp2);
 
 		if (split_packet) {
-			rte_mbuf_prefetch_part2(rx_pkts[pos]);
-			rte_mbuf_prefetch_part2(rx_pkts[pos + 1]);
-			rte_mbuf_prefetch_part2(rx_pkts[pos + 2]);
-			rte_mbuf_prefetch_part2(rx_pkts[pos + 3]);
+			RTE_MBUF_PREFETCH_PART2(prefetch0, rx_pkts[pos]);
+			RTE_MBUF_PREFETCH_PART2(prefetch0, rx_pkts[pos + 1]);
+			RTE_MBUF_PREFETCH_PART2(prefetch0, rx_pkts[pos + 2]);
+			RTE_MBUF_PREFETCH_PART2(prefetch0, rx_pkts[pos + 3]);
 		}
 
 		/* avoid compiler reorder optimization */
diff --git a/drivers/net/ixgbe/ixgbe_rxtx_vec_neon.c b/drivers/net/ixgbe/ixgbe_rxtx_vec_neon.c
index 9c1d124..941b2d5 100644
--- a/drivers/net/ixgbe/ixgbe_rxtx_vec_neon.c
+++ b/drivers/net/ixgbe/ixgbe_rxtx_vec_neon.c
@@ -280,10 +280,14 @@  _recv_raw_pkts_vec(struct ixgbe_rx_queue *rxq, struct rte_mbuf **rx_pkts,
 		vst1q_u64((uint64_t *)&rx_pkts[pos + 2], mbp2);
 
 		if (split_packet) {
-			rte_prefetch_non_temporal(&rx_pkts[pos]->cacheline1);
-			rte_prefetch_non_temporal(&rx_pkts[pos + 1]->cacheline1);
-			rte_prefetch_non_temporal(&rx_pkts[pos + 2]->cacheline1);
-			rte_prefetch_non_temporal(&rx_pkts[pos + 3]->cacheline1);
+			RTE_MBUF_PREFETCH_PART2(prefetch_non_temporal,
+						rx_pkts[pos]);
+			RTE_MBUF_PREFETCH_PART2(prefetch_non_temporal,
+						rx_pkts[pos + 1]);
+			RTE_MBUF_PREFETCH_PART2(prefetch_non_temporal,
+						rx_pkts[pos + 2]);
+			RTE_MBUF_PREFETCH_PART2(prefetch_non_temporal,
+						rx_pkts[pos + 3]);
 		}
 
 		/* D.1 pkt 3,4 convert format from desc to pktmbuf */
diff --git a/drivers/net/mlx4/mlx4.c b/drivers/net/mlx4/mlx4.c
index 9ed1491..677ca02 100644
--- a/drivers/net/mlx4/mlx4.c
+++ b/drivers/net/mlx4/mlx4.c
@@ -3283,8 +3283,8 @@  mlx4_rx_burst(void *dpdk_rxq, struct rte_mbuf **pkts, uint16_t pkts_n)
 		 * Fetch initial bytes of packet descriptor into a
 		 * cacheline while allocating rep.
 		 */
-		rte_mbuf_prefetch_part1(seg);
-		rte_mbuf_prefetch_part2(seg);
+		RTE_MBUF_PREFETCH_PART1(prefetch0, seg);
+		RTE_MBUF_PREFETCH_PART2(prefetch0, seg);
 		ret = rxq->if_cq->poll_length_flags(rxq->cq, NULL, NULL,
 						    &flags);
 		if (unlikely(ret < 0)) {
diff --git a/drivers/net/mlx5/mlx5_rxtx.c b/drivers/net/mlx5/mlx5_rxtx.c
index 29bfcec..3d853c5 100644
--- a/drivers/net/mlx5/mlx5_rxtx.c
+++ b/drivers/net/mlx5/mlx5_rxtx.c
@@ -1134,8 +1134,8 @@  mlx5_rx_burst(void *dpdk_rxq, struct rte_mbuf **pkts, uint16_t pkts_n)
 		 * Fetch initial bytes of packet descriptor into a
 		 * cacheline while allocating rep.
 		 */
-		rte_mbuf_prefetch_part1(seg);
-		rte_mbuf_prefetch_part2(seg);
+		RTE_MBUF_PREFETCH_PART1(prefetch0, seg);
+		RTE_MBUF_PREFETCH_PART2(prefetch0, seg);
 		ret = rxq->poll(rxq->cq, NULL, NULL, &flags, &vlan_tci);
 		if (unlikely(ret < 0)) {
 			struct ibv_wc wc;
diff --git a/examples/ipsec-secgw/ipsec-secgw.c b/examples/ipsec-secgw/ipsec-secgw.c
index ebd7c23..2da94b3 100644
--- a/examples/ipsec-secgw/ipsec-secgw.c
+++ b/examples/ipsec-secgw/ipsec-secgw.c
@@ -298,7 +298,7 @@  prepare_tx_burst(struct rte_mbuf *pkts[], uint16_t nb_pkts, uint8_t port)
 	const int32_t prefetch_offset = 2;
 
 	for (i = 0; i < (nb_pkts - prefetch_offset); i++) {
-		rte_mbuf_prefetch_part2(pkts[i + prefetch_offset]);
+		RTE_MBUF_PREFETCH_PART2(prefetch0, pkts[i + prefetch_offset]);
 		prepare_tx_pkt(pkts[i], port);
 	}
 	/* Process left packets */
diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
index 11fa06d..f01754c 100644
--- a/lib/librte_mbuf/rte_mbuf.h
+++ b/lib/librte_mbuf/rte_mbuf.h
@@ -849,14 +849,15 @@  struct rte_mbuf {
  * in the receive path. If the cache line of the architecture is higher than
  * 64B, the second part will also be prefetched.
  *
+ * @param method
+ *   The prefetch method: prefetch0, prefetch1, prefetch2 or
+ *                        prefetch_non_temporal.
+ *
  * @param m
  *   The pointer to the mbuf.
  */
-static inline void
-rte_mbuf_prefetch_part1(struct rte_mbuf *m)
-{
-	rte_prefetch0(&m->cacheline0);
-}
+#define RTE_MBUF_PREFETCH_PART1(method, m)	\
+	rte_##method(&(m)->cacheline0)
 
 /**
  * Prefetch the second part of the mbuf
@@ -866,19 +867,19 @@  rte_mbuf_prefetch_part1(struct rte_mbuf *m)
  * this function does nothing as it is expected that the full mbuf is
  * already in cache.
  *
+ * @param method
+ *   The prefetch method: prefetch0, prefetch1, prefetch2 or
+ *                        prefetch_non_temporal.
+ *
  * @param m
  *   The pointer to the mbuf.
  */
-static inline void
-rte_mbuf_prefetch_part2(struct rte_mbuf *m)
-{
 #if RTE_CACHE_LINE_SIZE == 64
-	rte_prefetch0(&m->cacheline1);
+#define RTE_MBUF_PREFETCH_PART2(method, m)	\
+	rte_##method(&(m)->cacheline1)
 #else
-	RTE_SET_USED(m);
+#define RTE_MBUF_PREFETCH_PART2(method, m)
 #endif
-}
-
 
 static inline uint16_t rte_pktmbuf_priv_size(struct rte_mempool *mp);