[dpdk-dev,1/4] eal/common: introduce rte_memset on IA platform

Message ID 1480926387-63838-2-git-send-email-zhiyong.yang@intel.com (mailing list archive)
State Superseded, archived
Delegated to: Thomas Monjalon
Headers

Checks

Context Check Description
checkpatch/checkpatch success coding style OK

Commit Message

Yang, Zhiyong Dec. 5, 2016, 8:26 a.m. UTC
  Performance drop has been caused in some cases when DPDK code calls glibc
function memset. reference to discussions about memset in
http://dpdk.org/ml/archives/dev/2016-October/048628.html
It is necessary to introduce more high efficient function to fix it.
One important thing about rte_memset is that we can get clear control
on what instruction flow is used. This patch supports instruction sets
such as sse & avx(128 bits), avx2(256 bits) and avx512(512bits).
rte_memset makes full use of vectorization and inline function to improve
the perf on IA. In addition, cache line and memory alignment are fully
taken into consideration.

Signed-off-by: Zhiyong Yang <zhiyong.yang@intel.com>
---
 .../common/include/arch/x86/rte_memset.h           | 376 +++++++++++++++++++++
 lib/librte_eal/common/include/generic/rte_memset.h |  51 +++
 2 files changed, 427 insertions(+)
 create mode 100644 lib/librte_eal/common/include/arch/x86/rte_memset.h
 create mode 100644 lib/librte_eal/common/include/generic/rte_memset.h
  

Comments

Thomas Monjalon Dec. 2, 2016, 10:25 a.m. UTC | #1
2016-12-05 16:26, Zhiyong Yang:
> +#ifndef _RTE_MEMSET_X86_64_H_

Is this implementation specific to 64-bit?

> +
> +#define rte_memset memset
> +
> +#else
> +
> +static void *
> +rte_memset(void *dst, int a, size_t n);
> +
> +#endif

If I understand well, rte_memset (as rte_memcpy) is using the most recent
instructions available (and enabled) when compiling.
It is not adapting the instructions to the run-time CPU.
There is no need to downgrade at run-time the instruction set as it is
obviously not a supported case, but it would be nice to be able to
upgrade a "default compilation" at run-time as it is done in rte_acl.
I explain this case more clearly for reference:

We can have AVX512 supported in the compiler but disable it when compiling
(CONFIG_RTE_MACHINE=snb) in order to build a binary running almost everywhere.
When running this binary on a CPU having AVX512 support, it will not
benefit of the AVX512 improvement.
Though, we can compile an AVX512 version of some functions and use them only
if the running CPU is capable.
This kind of miracle can be achieved in two ways:

1/ For generic C code compiled with a recent GCC, a function can be built
for several CPUs thanks to the attribute target_clones.

2/ For manually optimized functions using CPU-specific intrinsics or asm,
it is possible to build them with non-default flags thanks to the
attribute target.

3/ For manually optimized files using CPU-specific intrinsics or asm,
we use specifics flags in the makefile.

The function clone in case 1/ is dynamically chosen at run-time
through ifunc resolver.
The specific functions in cases 2/ and 3/ must chosen at run-time
by initializing a function pointer thanks to rte_cpu_get_flag_enabled().

Note that rte_hash and software crypto PMDs have a run-time check
with rte_cpu_get_flag_enabled() but do not override CFLAGS
in the Makefile. Next step for these libraries?

Back to rte_memset, I think you should try the solution 2/.
  
Yang, Zhiyong Dec. 8, 2016, 7:41 a.m. UTC | #2
HI, Thomas:
	Sorry for late reply. I have been being always considering your suggestion. 

> -----Original Message-----
> From: Thomas Monjalon [mailto:thomas.monjalon@6wind.com]
> Sent: Friday, December 2, 2016 6:25 PM
> To: Yang, Zhiyong <zhiyong.yang@intel.com>
> Cc: dev@dpdk.org; yuanhan.liu@linux.intel.com; Richardson, Bruce
> <bruce.richardson@intel.com>; Ananyev, Konstantin
> <konstantin.ananyev@intel.com>; De Lara Guarch, Pablo
> <pablo.de.lara.guarch@intel.com>
> Subject: Re: [dpdk-dev] [PATCH 1/4] eal/common: introduce rte_memset on
> IA platform
> 
> 2016-12-05 16:26, Zhiyong Yang:
> > +#ifndef _RTE_MEMSET_X86_64_H_
> 
> Is this implementation specific to 64-bit?
> 

Yes.

> > +
> > +#define rte_memset memset
> > +
> > +#else
> > +
> > +static void *
> > +rte_memset(void *dst, int a, size_t n);
> > +
> > +#endif
> 
> If I understand well, rte_memset (as rte_memcpy) is using the most recent
> instructions available (and enabled) when compiling.
> It is not adapting the instructions to the run-time CPU.
> There is no need to downgrade at run-time the instruction set as it is
> obviously not a supported case, but it would be nice to be able to upgrade a
> "default compilation" at run-time as it is done in rte_acl.
> I explain this case more clearly for reference:
> 
> We can have AVX512 supported in the compiler but disable it when compiling
> (CONFIG_RTE_MACHINE=snb) in order to build a binary running almost
> everywhere.
> When running this binary on a CPU having AVX512 support, it will not benefit
> of the AVX512 improvement.
> Though, we can compile an AVX512 version of some functions and use them
> only if the running CPU is capable.
> This kind of miracle can be achieved in two ways:
> 
> 1/ For generic C code compiled with a recent GCC, a function can be built for
> several CPUs thanks to the attribute target_clones.
> 
> 2/ For manually optimized functions using CPU-specific intrinsics or asm, it is
> possible to build them with non-default flags thanks to the attribute target.
> 
> 3/ For manually optimized files using CPU-specific intrinsics or asm, we use
> specifics flags in the makefile.
> 
> The function clone in case 1/ is dynamically chosen at run-time through ifunc
> resolver.
> The specific functions in cases 2/ and 3/ must chosen at run-time by
> initializing a function pointer thanks to rte_cpu_get_flag_enabled().
> 
> Note that rte_hash and software crypto PMDs have a run-time check with
> rte_cpu_get_flag_enabled() but do not override CFLAGS in the Makefile.
> Next step for these libraries?
> 
> Back to rte_memset, I think you should try the solution 2/.

I have read the ACL code, if I understand well , for complex algo implementation,  
it is good idea, but Choosing functions at run time will bring some overhead. For frequently  called function
Which consumes small cycles, the overhead maybe is more than  the gains optimizations brings 
For example, for most applications in dpdk, memset only set N = 10 or 12bytes. It consumes fewer cycles.

Thanks
Zhiyong
  
Ananyev, Konstantin Dec. 8, 2016, 9:26 a.m. UTC | #3
Hi Zhiyong,

> 
> HI, Thomas:
> 	Sorry for late reply. I have been being always considering your suggestion.
> 
> > -----Original Message-----
> > From: Thomas Monjalon [mailto:thomas.monjalon@6wind.com]
> > Sent: Friday, December 2, 2016 6:25 PM
> > To: Yang, Zhiyong <zhiyong.yang@intel.com>
> > Cc: dev@dpdk.org; yuanhan.liu@linux.intel.com; Richardson, Bruce
> > <bruce.richardson@intel.com>; Ananyev, Konstantin
> > <konstantin.ananyev@intel.com>; De Lara Guarch, Pablo
> > <pablo.de.lara.guarch@intel.com>
> > Subject: Re: [dpdk-dev] [PATCH 1/4] eal/common: introduce rte_memset on
> > IA platform
> >
> > 2016-12-05 16:26, Zhiyong Yang:
> > > +#ifndef _RTE_MEMSET_X86_64_H_
> >
> > Is this implementation specific to 64-bit?
> >
> 
> Yes.
> 
> > > +
> > > +#define rte_memset memset
> > > +
> > > +#else
> > > +
> > > +static void *
> > > +rte_memset(void *dst, int a, size_t n);
> > > +
> > > +#endif
> >
> > If I understand well, rte_memset (as rte_memcpy) is using the most recent
> > instructions available (and enabled) when compiling.
> > It is not adapting the instructions to the run-time CPU.
> > There is no need to downgrade at run-time the instruction set as it is
> > obviously not a supported case, but it would be nice to be able to upgrade a
> > "default compilation" at run-time as it is done in rte_acl.
> > I explain this case more clearly for reference:
> >
> > We can have AVX512 supported in the compiler but disable it when compiling
> > (CONFIG_RTE_MACHINE=snb) in order to build a binary running almost
> > everywhere.
> > When running this binary on a CPU having AVX512 support, it will not benefit
> > of the AVX512 improvement.
> > Though, we can compile an AVX512 version of some functions and use them
> > only if the running CPU is capable.
> > This kind of miracle can be achieved in two ways:
> >
> > 1/ For generic C code compiled with a recent GCC, a function can be built for
> > several CPUs thanks to the attribute target_clones.
> >
> > 2/ For manually optimized functions using CPU-specific intrinsics or asm, it is
> > possible to build them with non-default flags thanks to the attribute target.
> >
> > 3/ For manually optimized files using CPU-specific intrinsics or asm, we use
> > specifics flags in the makefile.
> >
> > The function clone in case 1/ is dynamically chosen at run-time through ifunc
> > resolver.
> > The specific functions in cases 2/ and 3/ must chosen at run-time by
> > initializing a function pointer thanks to rte_cpu_get_flag_enabled().
> >
> > Note that rte_hash and software crypto PMDs have a run-time check with
> > rte_cpu_get_flag_enabled() but do not override CFLAGS in the Makefile.
> > Next step for these libraries?
> >
> > Back to rte_memset, I think you should try the solution 2/.
> 
> I have read the ACL code, if I understand well , for complex algo implementation,
> it is good idea, but Choosing functions at run time will bring some overhead. For frequently  called function
> Which consumes small cycles, the overhead maybe is more than  the gains optimizations brings
> For example, for most applications in dpdk, memset only set N = 10 or 12bytes. It consumes fewer cycles.

But then what the point to have an rte_memset() using vector instructions at all?
From what you are saying the most common case is even less then SSE register size.
Konstantin

> 
> Thanks
> Zhiyong
  
Yang, Zhiyong Dec. 8, 2016, 9:53 a.m. UTC | #4
Hi, Konstantin:

> -----Original Message-----

> From: Ananyev, Konstantin

> Sent: Thursday, December 8, 2016 5:26 PM

> To: Yang, Zhiyong <zhiyong.yang@intel.com>; Thomas Monjalon

> <thomas.monjalon@6wind.com>

> Cc: dev@dpdk.org; yuanhan.liu@linux.intel.com; Richardson, Bruce

> <bruce.richardson@intel.com>; De Lara Guarch, Pablo

> <pablo.de.lara.guarch@intel.com>

> Subject: RE: [dpdk-dev] [PATCH 1/4] eal/common: introduce rte_memset on

> IA platform

> 

> 

> Hi Zhiyong,

> 

> >

> > HI, Thomas:

> > 	Sorry for late reply. I have been being always considering your

> suggestion.

> >

> > > -----Original Message-----

> > > From: Thomas Monjalon [mailto:thomas.monjalon@6wind.com]

> > > Sent: Friday, December 2, 2016 6:25 PM

> > > To: Yang, Zhiyong <zhiyong.yang@intel.com>

> > > Cc: dev@dpdk.org; yuanhan.liu@linux.intel.com; Richardson, Bruce

> > > <bruce.richardson@intel.com>; Ananyev, Konstantin

> > > <konstantin.ananyev@intel.com>; De Lara Guarch, Pablo

> > > <pablo.de.lara.guarch@intel.com>

> > > Subject: Re: [dpdk-dev] [PATCH 1/4] eal/common: introduce

> rte_memset

> > > on IA platform

> > >

> > > 2016-12-05 16:26, Zhiyong Yang:

> > > > +#ifndef _RTE_MEMSET_X86_64_H_

> > >

> > > Is this implementation specific to 64-bit?

> > >

> >

> > Yes.

> >

> > > > +

> > > > +#define rte_memset memset

> > > > +

> > > > +#else

> > > > +

> > > > +static void *

> > > > +rte_memset(void *dst, int a, size_t n);

> > > > +

> > > > +#endif

> > >

> > > If I understand well, rte_memset (as rte_memcpy) is using the most

> > > recent instructions available (and enabled) when compiling.

> > > It is not adapting the instructions to the run-time CPU.

> > > There is no need to downgrade at run-time the instruction set as it

> > > is obviously not a supported case, but it would be nice to be able

> > > to upgrade a "default compilation" at run-time as it is done in rte_acl.

> > > I explain this case more clearly for reference:

> > >

> > > We can have AVX512 supported in the compiler but disable it when

> > > compiling

> > > (CONFIG_RTE_MACHINE=snb) in order to build a binary running almost

> > > everywhere.

> > > When running this binary on a CPU having AVX512 support, it will not

> > > benefit of the AVX512 improvement.

> > > Though, we can compile an AVX512 version of some functions and use

> > > them only if the running CPU is capable.

> > > This kind of miracle can be achieved in two ways:

> > >

> > > 1/ For generic C code compiled with a recent GCC, a function can be

> > > built for several CPUs thanks to the attribute target_clones.

> > >

> > > 2/ For manually optimized functions using CPU-specific intrinsics or

> > > asm, it is possible to build them with non-default flags thanks to the

> attribute target.

> > >

> > > 3/ For manually optimized files using CPU-specific intrinsics or

> > > asm, we use specifics flags in the makefile.

> > >

> > > The function clone in case 1/ is dynamically chosen at run-time

> > > through ifunc resolver.

> > > The specific functions in cases 2/ and 3/ must chosen at run-time by

> > > initializing a function pointer thanks to rte_cpu_get_flag_enabled().

> > >

> > > Note that rte_hash and software crypto PMDs have a run-time check

> > > with

> > > rte_cpu_get_flag_enabled() but do not override CFLAGS in the Makefile.

> > > Next step for these libraries?

> > >

> > > Back to rte_memset, I think you should try the solution 2/.

> >

> > I have read the ACL code, if I understand well , for complex algo

> > implementation, it is good idea, but Choosing functions at run time

> > will bring some overhead. For frequently  called function Which

> > consumes small cycles, the overhead maybe is more than  the gains

> optimizations brings For example, for most applications in dpdk, memset only

> set N = 10 or 12bytes. It consumes fewer cycles.

> 

> But then what the point to have an rte_memset() using vector instructions at

> all?

> From what you are saying the most common case is even less then SSE

> register size.

> Konstantin


For most cases, memset is used such as memset(address, 0, sizeof(struct xxx)); 
The use case here is small by accident, I only give an example here. 
but rte_memset is introduced to need consider generic case. 
sizeof(struct xxx) is not limited to very small size, such as  less than SSE register size.
I just want to say that the size for the most use case is not very large,  So cycles consumed
Is not large. It is not suited to choose function at run-time since overhead  is considered.

thanks
Zhiyong
  
Bruce Richardson Dec. 8, 2016, 10:27 a.m. UTC | #5
On Thu, Dec 08, 2016 at 09:53:12AM +0000, Yang, Zhiyong wrote:
> Hi, Konstantin:
> 
> > -----Original Message-----
> > From: Ananyev, Konstantin
> > Sent: Thursday, December 8, 2016 5:26 PM
> > To: Yang, Zhiyong <zhiyong.yang@intel.com>; Thomas Monjalon
> > <thomas.monjalon@6wind.com>
> > Cc: dev@dpdk.org; yuanhan.liu@linux.intel.com; Richardson, Bruce
> > <bruce.richardson@intel.com>; De Lara Guarch, Pablo
> > <pablo.de.lara.guarch@intel.com>
> > Subject: RE: [dpdk-dev] [PATCH 1/4] eal/common: introduce rte_memset on
> > IA platform
> > 
> > 
> > Hi Zhiyong,
> > 
> > >
> > > HI, Thomas:
> > > 	Sorry for late reply. I have been being always considering your
> > suggestion.
> > >
> > > > -----Original Message-----
> > > > From: Thomas Monjalon [mailto:thomas.monjalon@6wind.com]
> > > > Sent: Friday, December 2, 2016 6:25 PM
> > > > To: Yang, Zhiyong <zhiyong.yang@intel.com>
> > > > Cc: dev@dpdk.org; yuanhan.liu@linux.intel.com; Richardson, Bruce
> > > > <bruce.richardson@intel.com>; Ananyev, Konstantin
> > > > <konstantin.ananyev@intel.com>; De Lara Guarch, Pablo
> > > > <pablo.de.lara.guarch@intel.com>
> > > > Subject: Re: [dpdk-dev] [PATCH 1/4] eal/common: introduce
> > rte_memset
> > > > on IA platform
> > > >
> > > > 2016-12-05 16:26, Zhiyong Yang:
> > > > > +#ifndef _RTE_MEMSET_X86_64_H_
> > > >
> > > > Is this implementation specific to 64-bit?
> > > >
> > >
> > > Yes.
> > >
> > > > > +
> > > > > +#define rte_memset memset
> > > > > +
> > > > > +#else
> > > > > +
> > > > > +static void *
> > > > > +rte_memset(void *dst, int a, size_t n);
> > > > > +
> > > > > +#endif
> > > >
> > > > If I understand well, rte_memset (as rte_memcpy) is using the most
> > > > recent instructions available (and enabled) when compiling.
> > > > It is not adapting the instructions to the run-time CPU.
> > > > There is no need to downgrade at run-time the instruction set as it
> > > > is obviously not a supported case, but it would be nice to be able
> > > > to upgrade a "default compilation" at run-time as it is done in rte_acl.
> > > > I explain this case more clearly for reference:
> > > >
> > > > We can have AVX512 supported in the compiler but disable it when
> > > > compiling
> > > > (CONFIG_RTE_MACHINE=snb) in order to build a binary running almost
> > > > everywhere.
> > > > When running this binary on a CPU having AVX512 support, it will not
> > > > benefit of the AVX512 improvement.
> > > > Though, we can compile an AVX512 version of some functions and use
> > > > them only if the running CPU is capable.
> > > > This kind of miracle can be achieved in two ways:
> > > >
> > > > 1/ For generic C code compiled with a recent GCC, a function can be
> > > > built for several CPUs thanks to the attribute target_clones.
> > > >
> > > > 2/ For manually optimized functions using CPU-specific intrinsics or
> > > > asm, it is possible to build them with non-default flags thanks to the
> > attribute target.
> > > >
> > > > 3/ For manually optimized files using CPU-specific intrinsics or
> > > > asm, we use specifics flags in the makefile.
> > > >
> > > > The function clone in case 1/ is dynamically chosen at run-time
> > > > through ifunc resolver.
> > > > The specific functions in cases 2/ and 3/ must chosen at run-time by
> > > > initializing a function pointer thanks to rte_cpu_get_flag_enabled().
> > > >
> > > > Note that rte_hash and software crypto PMDs have a run-time check
> > > > with
> > > > rte_cpu_get_flag_enabled() but do not override CFLAGS in the Makefile.
> > > > Next step for these libraries?
> > > >
> > > > Back to rte_memset, I think you should try the solution 2/.
> > >
> > > I have read the ACL code, if I understand well , for complex algo
> > > implementation, it is good idea, but Choosing functions at run time
> > > will bring some overhead. For frequently  called function Which
> > > consumes small cycles, the overhead maybe is more than  the gains
> > optimizations brings For example, for most applications in dpdk, memset only
> > set N = 10 or 12bytes. It consumes fewer cycles.
> > 
> > But then what the point to have an rte_memset() using vector instructions at
> > all?
> > From what you are saying the most common case is even less then SSE
> > register size.
> > Konstantin
> 
> For most cases, memset is used such as memset(address, 0, sizeof(struct xxx)); 
> The use case here is small by accident, I only give an example here. 
> but rte_memset is introduced to need consider generic case. 
> sizeof(struct xxx) is not limited to very small size, such as  less than SSE register size.
> I just want to say that the size for the most use case is not very large,  So cycles consumed
> Is not large. It is not suited to choose function at run-time since overhead  is considered.
> 
For small copies with sizes specified at compile time, do compilers not
fully inline the memset call with a fixed-size equivalent. I believe
some compilers used to do so with memcpy - which is why we had a macro
for it in DPDK, so that compile-time constant copies would use regular
memcpy. If that is also the case for memset, then we should perhaps
specify that rte_memset is only for relatively large copies, e.g. >64
bytes. In that case, run-time detection may be worthwhile.

/Bruce
  
Ananyev, Konstantin Dec. 8, 2016, 10:30 a.m. UTC | #6
> -----Original Message-----
> From: Yang, Zhiyong
> Sent: Thursday, December 8, 2016 9:53 AM
> To: Ananyev, Konstantin <konstantin.ananyev@intel.com>; Thomas Monjalon <thomas.monjalon@6wind.com>
> Cc: dev@dpdk.org; yuanhan.liu@linux.intel.com; Richardson, Bruce <bruce.richardson@intel.com>; De Lara Guarch, Pablo
> <pablo.de.lara.guarch@intel.com>
> Subject: RE: [dpdk-dev] [PATCH 1/4] eal/common: introduce rte_memset on IA platform
> 
> Hi, Konstantin:
> 
> > -----Original Message-----
> > From: Ananyev, Konstantin
> > Sent: Thursday, December 8, 2016 5:26 PM
> > To: Yang, Zhiyong <zhiyong.yang@intel.com>; Thomas Monjalon
> > <thomas.monjalon@6wind.com>
> > Cc: dev@dpdk.org; yuanhan.liu@linux.intel.com; Richardson, Bruce
> > <bruce.richardson@intel.com>; De Lara Guarch, Pablo
> > <pablo.de.lara.guarch@intel.com>
> > Subject: RE: [dpdk-dev] [PATCH 1/4] eal/common: introduce rte_memset on
> > IA platform
> >
> >
> > Hi Zhiyong,
> >
> > >
> > > HI, Thomas:
> > > 	Sorry for late reply. I have been being always considering your
> > suggestion.
> > >
> > > > -----Original Message-----
> > > > From: Thomas Monjalon [mailto:thomas.monjalon@6wind.com]
> > > > Sent: Friday, December 2, 2016 6:25 PM
> > > > To: Yang, Zhiyong <zhiyong.yang@intel.com>
> > > > Cc: dev@dpdk.org; yuanhan.liu@linux.intel.com; Richardson, Bruce
> > > > <bruce.richardson@intel.com>; Ananyev, Konstantin
> > > > <konstantin.ananyev@intel.com>; De Lara Guarch, Pablo
> > > > <pablo.de.lara.guarch@intel.com>
> > > > Subject: Re: [dpdk-dev] [PATCH 1/4] eal/common: introduce
> > rte_memset
> > > > on IA platform
> > > >
> > > > 2016-12-05 16:26, Zhiyong Yang:
> > > > > +#ifndef _RTE_MEMSET_X86_64_H_
> > > >
> > > > Is this implementation specific to 64-bit?
> > > >
> > >
> > > Yes.
> > >
> > > > > +
> > > > > +#define rte_memset memset
> > > > > +
> > > > > +#else
> > > > > +
> > > > > +static void *
> > > > > +rte_memset(void *dst, int a, size_t n);
> > > > > +
> > > > > +#endif
> > > >
> > > > If I understand well, rte_memset (as rte_memcpy) is using the most
> > > > recent instructions available (and enabled) when compiling.
> > > > It is not adapting the instructions to the run-time CPU.
> > > > There is no need to downgrade at run-time the instruction set as it
> > > > is obviously not a supported case, but it would be nice to be able
> > > > to upgrade a "default compilation" at run-time as it is done in rte_acl.
> > > > I explain this case more clearly for reference:
> > > >
> > > > We can have AVX512 supported in the compiler but disable it when
> > > > compiling
> > > > (CONFIG_RTE_MACHINE=snb) in order to build a binary running almost
> > > > everywhere.
> > > > When running this binary on a CPU having AVX512 support, it will not
> > > > benefit of the AVX512 improvement.
> > > > Though, we can compile an AVX512 version of some functions and use
> > > > them only if the running CPU is capable.
> > > > This kind of miracle can be achieved in two ways:
> > > >
> > > > 1/ For generic C code compiled with a recent GCC, a function can be
> > > > built for several CPUs thanks to the attribute target_clones.
> > > >
> > > > 2/ For manually optimized functions using CPU-specific intrinsics or
> > > > asm, it is possible to build them with non-default flags thanks to the
> > attribute target.
> > > >
> > > > 3/ For manually optimized files using CPU-specific intrinsics or
> > > > asm, we use specifics flags in the makefile.
> > > >
> > > > The function clone in case 1/ is dynamically chosen at run-time
> > > > through ifunc resolver.
> > > > The specific functions in cases 2/ and 3/ must chosen at run-time by
> > > > initializing a function pointer thanks to rte_cpu_get_flag_enabled().
> > > >
> > > > Note that rte_hash and software crypto PMDs have a run-time check
> > > > with
> > > > rte_cpu_get_flag_enabled() but do not override CFLAGS in the Makefile.
> > > > Next step for these libraries?
> > > >
> > > > Back to rte_memset, I think you should try the solution 2/.
> > >
> > > I have read the ACL code, if I understand well , for complex algo
> > > implementation, it is good idea, but Choosing functions at run time
> > > will bring some overhead. For frequently  called function Which
> > > consumes small cycles, the overhead maybe is more than  the gains
> > optimizations brings For example, for most applications in dpdk, memset only
> > set N = 10 or 12bytes. It consumes fewer cycles.
> >
> > But then what the point to have an rte_memset() using vector instructions at
> > all?
> > From what you are saying the most common case is even less then SSE
> > register size.
> > Konstantin
> 
> For most cases, memset is used such as memset(address, 0, sizeof(struct xxx));

Ok then I suppose for such cases you don't need any special function and memset()
would still be the best choice, right?

> The use case here is small by accident, I only give an example here.
> but rte_memset is introduced to need consider generic case.

We can have rte_memset_huge() or so instead, and document that
it should be used for sizes greater than some cutoff point.
Inside it you can just call a function pointer installed at startup (same as rte_acl_classify() does).
For big sizes, I suppose the price of extra function pointer call would not affect performance much.
For sizes smaller then this cutoff point you still can use either rte_memset_scalar() or just normal rte_memset().
Something like that:

extern void *(*__rte_memset_vector)( (void *s, int c, size_t n);

static inline void*
rte_memset_huge(void *s, int c, size_t n)
{
   return __rte_memset_vector(s, c, n);
}

static inline void *
rte_memset(void *s, int c, size_t n)
{
	If (n < XXX)
		return rte_memset_scalar(s, c, n);
	else
		return rte_memset_huge(s, c, n);
}

XXX could be either a define, or could also be a variable, so it can be setuped at startup,
depending on the architecture.

Would that work?
Konstantin

> sizeof(struct xxx) is not limited to very small size, such as  less than SSE register size.
> I just want to say that the size for the most use case is not very large,  So cycles consumed
> Is not large. It is not suited to choose function at run-time since overhead  is considered.
> 
> thanks
> Zhiyong
  
Thomas Monjalon Dec. 8, 2016, 3:09 p.m. UTC | #7
2016-12-08 07:41, Yang, Zhiyong:
> From: Thomas Monjalon [mailto:thomas.monjalon@6wind.com]
> > 2016-12-05 16:26, Zhiyong Yang:
> > > +#ifndef _RTE_MEMSET_X86_64_H_
> > 
> > Is this implementation specific to 64-bit?
> > 
> 
> Yes.

So should we rename this file?
rte_memset.h -> rte_memset_64.h

You need also to create a file rte_memset.h for each arch.
  
Yang, Zhiyong Dec. 11, 2016, 12:04 p.m. UTC | #8
Hi, Thomas:

> -----Original Message-----
> From: Thomas Monjalon [mailto:thomas.monjalon@6wind.com]
> Sent: Thursday, December 8, 2016 11:10 PM
> To: Yang, Zhiyong <zhiyong.yang@intel.com>
> Cc: dev@dpdk.org; yuanhan.liu@linux.intel.com; Richardson, Bruce
> <bruce.richardson@intel.com>; Ananyev, Konstantin
> <konstantin.ananyev@intel.com>; De Lara Guarch, Pablo
> <pablo.de.lara.guarch@intel.com>
> Subject: Re: [dpdk-dev] [PATCH 1/4] eal/common: introduce rte_memset on
> IA platform
> 
> 2016-12-08 07:41, Yang, Zhiyong:
> > From: Thomas Monjalon [mailto:thomas.monjalon@6wind.com]
> > > 2016-12-05 16:26, Zhiyong Yang:
> > > > +#ifndef _RTE_MEMSET_X86_64_H_
> > >
> > > Is this implementation specific to 64-bit?
> > >
> >
> > Yes.
> 
> So should we rename this file?
> rte_memset.h -> rte_memset_64.h
> 
> You need also to create a file rte_memset.h for each arch.

Ok
  
Yang, Zhiyong Dec. 11, 2016, 12:32 p.m. UTC | #9
Hi, Konstantin, Bruce:

> -----Original Message-----

> From: Ananyev, Konstantin

> Sent: Thursday, December 8, 2016 6:31 PM

> To: Yang, Zhiyong <zhiyong.yang@intel.com>; Thomas Monjalon

> <thomas.monjalon@6wind.com>

> Cc: dev@dpdk.org; yuanhan.liu@linux.intel.com; Richardson, Bruce

> <bruce.richardson@intel.com>; De Lara Guarch, Pablo

> <pablo.de.lara.guarch@intel.com>

> Subject: RE: [dpdk-dev] [PATCH 1/4] eal/common: introduce rte_memset on

> IA platform

> 

> 

> 

> > -----Original Message-----

> > From: Yang, Zhiyong

> > Sent: Thursday, December 8, 2016 9:53 AM

> > To: Ananyev, Konstantin <konstantin.ananyev@intel.com>; Thomas

> > Monjalon <thomas.monjalon@6wind.com>

> > Cc: dev@dpdk.org; yuanhan.liu@linux.intel.com; Richardson, Bruce

> > <bruce.richardson@intel.com>; De Lara Guarch, Pablo

> > <pablo.de.lara.guarch@intel.com>

> > Subject: RE: [dpdk-dev] [PATCH 1/4] eal/common: introduce rte_memset

> > on IA platform

> >

> > Hi, Konstantin:

> >

> > > -----Original Message-----

> > > From: Ananyev, Konstantin

> > > Sent: Thursday, December 8, 2016 5:26 PM

> > > To: Yang, Zhiyong <zhiyong.yang@intel.com>; Thomas Monjalon

> > > <thomas.monjalon@6wind.com>

> > > Cc: dev@dpdk.org; yuanhan.liu@linux.intel.com; Richardson, Bruce

> > > <bruce.richardson@intel.com>; De Lara Guarch, Pablo

> > > <pablo.de.lara.guarch@intel.com>

> > > Subject: RE: [dpdk-dev] [PATCH 1/4] eal/common: introduce rte_memset

> > > on IA platform

> > >

> > >

> > > Hi Zhiyong,

> > >

> > > >

> > > > HI, Thomas:

> > > > 	Sorry for late reply. I have been being always considering your

> > > suggestion.

> > > >

> > > > > -----Original Message-----

> > > > > From: Thomas Monjalon [mailto:thomas.monjalon@6wind.com]

> > > > > Sent: Friday, December 2, 2016 6:25 PM

> > > > > To: Yang, Zhiyong <zhiyong.yang@intel.com>

> > > > > Cc: dev@dpdk.org; yuanhan.liu@linux.intel.com; Richardson, Bruce

> > > > > <bruce.richardson@intel.com>; Ananyev, Konstantin

> > > > > <konstantin.ananyev@intel.com>; De Lara Guarch, Pablo

> > > > > <pablo.de.lara.guarch@intel.com>

> > > > > Subject: Re: [dpdk-dev] [PATCH 1/4] eal/common: introduce

> > > rte_memset

> > > > > on IA platform

> > > > >

> > > > > 2016-12-05 16:26, Zhiyong Yang:

> > > > > > +#ifndef _RTE_MEMSET_X86_64_H_

> > > > >

> > > > > Is this implementation specific to 64-bit?

> > > > >

> > > >

> > > > Yes.

> > > >

> > > > > > +

> > > > > > +#define rte_memset memset

> > > > > > +

> > > > > > +#else

> > > > > > +

> > > > > > +static void *

> > > > > > +rte_memset(void *dst, int a, size_t n);

> > > > > > +

> > > > > > +#endif

> > > > >

> > > > > If I understand well, rte_memset (as rte_memcpy) is using the

> > > > > most recent instructions available (and enabled) when compiling.

> > > > > It is not adapting the instructions to the run-time CPU.

> > > > > There is no need to downgrade at run-time the instruction set as

> > > > > it is obviously not a supported case, but it would be nice to be

> > > > > able to upgrade a "default compilation" at run-time as it is done in

> rte_acl.

> > > > > I explain this case more clearly for reference:

> > > > >

> > > > > We can have AVX512 supported in the compiler but disable it when

> > > > > compiling

> > > > > (CONFIG_RTE_MACHINE=snb) in order to build a binary running

> > > > > almost everywhere.

> > > > > When running this binary on a CPU having AVX512 support, it will

> > > > > not benefit of the AVX512 improvement.

> > > > > Though, we can compile an AVX512 version of some functions and

> > > > > use them only if the running CPU is capable.

> > > > > This kind of miracle can be achieved in two ways:

> > > > >

> > > > > 1/ For generic C code compiled with a recent GCC, a function can

> > > > > be built for several CPUs thanks to the attribute target_clones.

> > > > >

> > > > > 2/ For manually optimized functions using CPU-specific

> > > > > intrinsics or asm, it is possible to build them with non-default

> > > > > flags thanks to the

> > > attribute target.

> > > > >

> > > > > 3/ For manually optimized files using CPU-specific intrinsics or

> > > > > asm, we use specifics flags in the makefile.

> > > > >

> > > > > The function clone in case 1/ is dynamically chosen at run-time

> > > > > through ifunc resolver.

> > > > > The specific functions in cases 2/ and 3/ must chosen at

> > > > > run-time by initializing a function pointer thanks to

> rte_cpu_get_flag_enabled().

> > > > >

> > > > > Note that rte_hash and software crypto PMDs have a run-time

> > > > > check with

> > > > > rte_cpu_get_flag_enabled() but do not override CFLAGS in the

> Makefile.

> > > > > Next step for these libraries?

> > > > >

> > > > > Back to rte_memset, I think you should try the solution 2/.

> > > >

> > > > I have read the ACL code, if I understand well , for complex algo

> > > > implementation, it is good idea, but Choosing functions at run

> > > > time will bring some overhead. For frequently  called function

> > > > Which consumes small cycles, the overhead maybe is more than  the

> > > > gains

> > > optimizations brings For example, for most applications in dpdk,

> > > memset only set N = 10 or 12bytes. It consumes fewer cycles.

> > >

> > > But then what the point to have an rte_memset() using vector

> > > instructions at all?

> > > From what you are saying the most common case is even less then SSE

> > > register size.

> > > Konstantin

> >

> > For most cases, memset is used such as memset(address, 0,

> > sizeof(struct xxx));

> 

> Ok then I suppose for such cases you don't need any special function and

> memset() would still be the best choice, right?

> 


In fact, the bad performance drop has been found on IVB,   Please reference to 
http://dpdk.org/ml/archives/dev/2016-October/048628.html
The following code cause the perf issue
memset((void *)(uintptr_t)&(virtio_hdr->hdr),0 , dev->vhost_hlen);
vhost_hlen is 10 or 12 bytes, So, glibc memset is not used here.

> > The use case here is small by accident, I only give an example here.

> > but rte_memset is introduced to need consider generic case.

> 

> We can have rte_memset_huge() or so instead, and document that it should

> be used for sizes greater than some cutoff point.

> Inside it you can just call a function pointer installed at startup (same as

> rte_acl_classify() does).

> For big sizes, I suppose the price of extra function pointer call would not

> affect performance much.

> For sizes smaller then this cutoff point you still can use either

> rte_memset_scalar() or just normal rte_memset().

> Something like that:

> 

> extern void *(*__rte_memset_vector)( (void *s, int c, size_t n);

> 

> static inline void*

> rte_memset_huge(void *s, int c, size_t n) {

>    return __rte_memset_vector(s, c, n);

> }

> 

> static inline void *

> rte_memset(void *s, int c, size_t n)

> {

> 	If (n < XXX)

> 		return rte_memset_scalar(s, c, n);

> 	else

> 		return rte_memset_huge(s, c, n);

> }

> 

> XXX could be either a define, or could also be a variable, so it can be setuped

> at startup, depending on the architecture.

> 

> Would that work?

> Konstantin

> 

The idea sounds good.   It maybe is more feasible for rte_memcpy and rte_memset.
If I understand well , the idea from Bruce is similar, right ?

> > sizeof(struct xxx) is not limited to very small size, such as  less than SSE

> register size.

> > I just want to say that the size for the most use case is not very

> > large,  So cycles consumed Is not large. It is not suited to choose function at

> run-time since overhead  is considered.

> >

> > thanks

> > Zhiyong
  
Yang, Zhiyong Dec. 15, 2016, 6:51 a.m. UTC | #10
Hi, Thomas, Konstantin:

> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Yang, Zhiyong
> Sent: Sunday, December 11, 2016 8:33 PM
> To: Ananyev, Konstantin <konstantin.ananyev@intel.com>; Thomas
> Monjalon <thomas.monjalon@6wind.com>
> Cc: dev@dpdk.org; yuanhan.liu@linux.intel.com; Richardson, Bruce
> <bruce.richardson@intel.com>; De Lara Guarch, Pablo
> <pablo.de.lara.guarch@intel.com>
> Subject: Re: [dpdk-dev] [PATCH 1/4] eal/common: introduce rte_memset on
> IA platform
> 
> Hi, Konstantin, Bruce:
> 
> > -----Original Message-----
> > From: Ananyev, Konstantin
> > Sent: Thursday, December 8, 2016 6:31 PM
> > To: Yang, Zhiyong <zhiyong.yang@intel.com>; Thomas Monjalon
> > <thomas.monjalon@6wind.com>
> > Cc: dev@dpdk.org; yuanhan.liu@linux.intel.com; Richardson, Bruce
> > <bruce.richardson@intel.com>; De Lara Guarch, Pablo
> > <pablo.de.lara.guarch@intel.com>
> > Subject: RE: [dpdk-dev] [PATCH 1/4] eal/common: introduce rte_memset
> > on IA platform
> >
> >
> >
> > > -----Original Message-----
> > > From: Yang, Zhiyong
> > > Sent: Thursday, December 8, 2016 9:53 AM
> > > To: Ananyev, Konstantin <konstantin.ananyev@intel.com>; Thomas
> > > Monjalon <thomas.monjalon@6wind.com>
> > > Cc: dev@dpdk.org; yuanhan.liu@linux.intel.com; Richardson, Bruce
> > > <bruce.richardson@intel.com>; De Lara Guarch, Pablo
> > > <pablo.de.lara.guarch@intel.com>
> > > Subject: RE: [dpdk-dev] [PATCH 1/4] eal/common: introduce rte_memset
> > > on IA platform
> > >
> > extern void *(*__rte_memset_vector)( (void *s, int c, size_t n);
> >
> > static inline void*
> > rte_memset_huge(void *s, int c, size_t n) {
> >    return __rte_memset_vector(s, c, n); }
> >
> > static inline void *
> > rte_memset(void *s, int c, size_t n)
> > {
> > 	If (n < XXX)
> > 		return rte_memset_scalar(s, c, n);
> > 	else
> > 		return rte_memset_huge(s, c, n);
> > }
> >
> > XXX could be either a define, or could also be a variable, so it can
> > be setuped at startup, depending on the architecture.
> >
> > Would that work?
> > Konstantin
> >
I have implemented the code for  choosing the functions at run time.
rte_memcpy is used more frequently, So I test it at run time. 

typedef void *(*rte_memcpy_vector_t)(void *dst, const void *src, size_t n);
extern rte_memcpy_vector_t rte_memcpy_vector;
static inline void *
rte_memcpy(void *dst, const void *src, size_t n)
{
        return rte_memcpy_vector(dst, src, n);
}
In order to reduce the overhead at run time, 
I assign the function address to var rte_memcpy_vector before main() starts to init the var.

static void __attribute__((constructor))
rte_memcpy_init(void)
{
	if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX2))
	{
		rte_memcpy_vector = rte_memcpy_avx2;
	}
	else if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_SSE4_1))
	{
		rte_memcpy_vector = rte_memcpy_sse;
	}
	else
	{
		rte_memcpy_vector = memcpy;
	}

}
I run the same virtio/vhost loopback tests without NIC.
I can see the  throughput drop  when running choosing functions at run time
compared to original code as following on the same platform(my machine is haswell) 
	Packet size	perf drop
	64 		-4%
	256 		-5.4%
	1024		-5%
	1500		-2.5%
Another thing, I run the memcpy_perf_autotest,  when N= <128, 
the rte_memcpy perf gains almost disappears
When choosing functions at run time.  For N=other numbers, the perf gains will become narrow.

Thanks
Zhiyong
  
Bruce Richardson Dec. 15, 2016, 10:12 a.m. UTC | #11
On Thu, Dec 15, 2016 at 06:51:08AM +0000, Yang, Zhiyong wrote:
> Hi, Thomas, Konstantin:
> 
> > -----Original Message-----
> > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Yang, Zhiyong
> > Sent: Sunday, December 11, 2016 8:33 PM
> > To: Ananyev, Konstantin <konstantin.ananyev@intel.com>; Thomas
> > Monjalon <thomas.monjalon@6wind.com>
> > Cc: dev@dpdk.org; yuanhan.liu@linux.intel.com; Richardson, Bruce
> > <bruce.richardson@intel.com>; De Lara Guarch, Pablo
> > <pablo.de.lara.guarch@intel.com>
> > Subject: Re: [dpdk-dev] [PATCH 1/4] eal/common: introduce rte_memset on
> > IA platform
> > 
> > Hi, Konstantin, Bruce:
> > 
> > > -----Original Message-----
> > > From: Ananyev, Konstantin
> > > Sent: Thursday, December 8, 2016 6:31 PM
> > > To: Yang, Zhiyong <zhiyong.yang@intel.com>; Thomas Monjalon
> > > <thomas.monjalon@6wind.com>
> > > Cc: dev@dpdk.org; yuanhan.liu@linux.intel.com; Richardson, Bruce
> > > <bruce.richardson@intel.com>; De Lara Guarch, Pablo
> > > <pablo.de.lara.guarch@intel.com>
> > > Subject: RE: [dpdk-dev] [PATCH 1/4] eal/common: introduce rte_memset
> > > on IA platform
> > >
> > >
> > >
> > > > -----Original Message-----
> > > > From: Yang, Zhiyong
> > > > Sent: Thursday, December 8, 2016 9:53 AM
> > > > To: Ananyev, Konstantin <konstantin.ananyev@intel.com>; Thomas
> > > > Monjalon <thomas.monjalon@6wind.com>
> > > > Cc: dev@dpdk.org; yuanhan.liu@linux.intel.com; Richardson, Bruce
> > > > <bruce.richardson@intel.com>; De Lara Guarch, Pablo
> > > > <pablo.de.lara.guarch@intel.com>
> > > > Subject: RE: [dpdk-dev] [PATCH 1/4] eal/common: introduce rte_memset
> > > > on IA platform
> > > >
> > > extern void *(*__rte_memset_vector)( (void *s, int c, size_t n);
> > >
> > > static inline void*
> > > rte_memset_huge(void *s, int c, size_t n) {
> > >    return __rte_memset_vector(s, c, n); }
> > >
> > > static inline void *
> > > rte_memset(void *s, int c, size_t n)
> > > {
> > > 	If (n < XXX)
> > > 		return rte_memset_scalar(s, c, n);
> > > 	else
> > > 		return rte_memset_huge(s, c, n);
> > > }
> > >
> > > XXX could be either a define, or could also be a variable, so it can
> > > be setuped at startup, depending on the architecture.
> > >
> > > Would that work?
> > > Konstantin
> > >
> I have implemented the code for  choosing the functions at run time.
> rte_memcpy is used more frequently, So I test it at run time. 
> 
> typedef void *(*rte_memcpy_vector_t)(void *dst, const void *src, size_t n);
> extern rte_memcpy_vector_t rte_memcpy_vector;
> static inline void *
> rte_memcpy(void *dst, const void *src, size_t n)
> {
>         return rte_memcpy_vector(dst, src, n);
> }
> In order to reduce the overhead at run time, 
> I assign the function address to var rte_memcpy_vector before main() starts to init the var.
> 
> static void __attribute__((constructor))
> rte_memcpy_init(void)
> {
> 	if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX2))
> 	{
> 		rte_memcpy_vector = rte_memcpy_avx2;
> 	}
> 	else if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_SSE4_1))
> 	{
> 		rte_memcpy_vector = rte_memcpy_sse;
> 	}
> 	else
> 	{
> 		rte_memcpy_vector = memcpy;
> 	}
> 
> }
> I run the same virtio/vhost loopback tests without NIC.
> I can see the  throughput drop  when running choosing functions at run time
> compared to original code as following on the same platform(my machine is haswell) 
> 	Packet size	perf drop
> 	64 		-4%
> 	256 		-5.4%
> 	1024		-5%
> 	1500		-2.5%
> Another thing, I run the memcpy_perf_autotest,  when N= <128, 
> the rte_memcpy perf gains almost disappears
> When choosing functions at run time.  For N=other numbers, the perf gains will become narrow.
> 
How narrow. How significant is the improvement that we gain from having
to maintain our own copy of memcpy. If the libc version is nearly as
good we should just use that.

/Bruce
  
Ananyev, Konstantin Dec. 15, 2016, 10:53 a.m. UTC | #12
Hi Zhiyong,

> -----Original Message-----
> From: Yang, Zhiyong
> Sent: Thursday, December 15, 2016 6:51 AM
> To: Yang, Zhiyong <zhiyong.yang@intel.com>; Ananyev, Konstantin <konstantin.ananyev@intel.com>; Thomas Monjalon
> <thomas.monjalon@6wind.com>
> Cc: dev@dpdk.org; yuanhan.liu@linux.intel.com; Richardson, Bruce <bruce.richardson@intel.com>; De Lara Guarch, Pablo
> <pablo.de.lara.guarch@intel.com>
> Subject: RE: [dpdk-dev] [PATCH 1/4] eal/common: introduce rte_memset on IA platform
> 
> Hi, Thomas, Konstantin:
> 
> > -----Original Message-----
> > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Yang, Zhiyong
> > Sent: Sunday, December 11, 2016 8:33 PM
> > To: Ananyev, Konstantin <konstantin.ananyev@intel.com>; Thomas
> > Monjalon <thomas.monjalon@6wind.com>
> > Cc: dev@dpdk.org; yuanhan.liu@linux.intel.com; Richardson, Bruce
> > <bruce.richardson@intel.com>; De Lara Guarch, Pablo
> > <pablo.de.lara.guarch@intel.com>
> > Subject: Re: [dpdk-dev] [PATCH 1/4] eal/common: introduce rte_memset on
> > IA platform
> >
> > Hi, Konstantin, Bruce:
> >
> > > -----Original Message-----
> > > From: Ananyev, Konstantin
> > > Sent: Thursday, December 8, 2016 6:31 PM
> > > To: Yang, Zhiyong <zhiyong.yang@intel.com>; Thomas Monjalon
> > > <thomas.monjalon@6wind.com>
> > > Cc: dev@dpdk.org; yuanhan.liu@linux.intel.com; Richardson, Bruce
> > > <bruce.richardson@intel.com>; De Lara Guarch, Pablo
> > > <pablo.de.lara.guarch@intel.com>
> > > Subject: RE: [dpdk-dev] [PATCH 1/4] eal/common: introduce rte_memset
> > > on IA platform
> > >
> > >
> > >
> > > > -----Original Message-----
> > > > From: Yang, Zhiyong
> > > > Sent: Thursday, December 8, 2016 9:53 AM
> > > > To: Ananyev, Konstantin <konstantin.ananyev@intel.com>; Thomas
> > > > Monjalon <thomas.monjalon@6wind.com>
> > > > Cc: dev@dpdk.org; yuanhan.liu@linux.intel.com; Richardson, Bruce
> > > > <bruce.richardson@intel.com>; De Lara Guarch, Pablo
> > > > <pablo.de.lara.guarch@intel.com>
> > > > Subject: RE: [dpdk-dev] [PATCH 1/4] eal/common: introduce rte_memset
> > > > on IA platform
> > > >
> > > extern void *(*__rte_memset_vector)( (void *s, int c, size_t n);
> > >
> > > static inline void*
> > > rte_memset_huge(void *s, int c, size_t n) {
> > >    return __rte_memset_vector(s, c, n); }
> > >
> > > static inline void *
> > > rte_memset(void *s, int c, size_t n)
> > > {
> > > 	If (n < XXX)
> > > 		return rte_memset_scalar(s, c, n);
> > > 	else
> > > 		return rte_memset_huge(s, c, n);
> > > }
> > >
> > > XXX could be either a define, or could also be a variable, so it can
> > > be setuped at startup, depending on the architecture.
> > >
> > > Would that work?
> > > Konstantin
> > >
> I have implemented the code for  choosing the functions at run time.
> rte_memcpy is used more frequently, So I test it at run time.
> 
> typedef void *(*rte_memcpy_vector_t)(void *dst, const void *src, size_t n);
> extern rte_memcpy_vector_t rte_memcpy_vector;
> static inline void *
> rte_memcpy(void *dst, const void *src, size_t n)
> {
>         return rte_memcpy_vector(dst, src, n);
> }
> In order to reduce the overhead at run time,
> I assign the function address to var rte_memcpy_vector before main() starts to init the var.
> 
> static void __attribute__((constructor))
> rte_memcpy_init(void)
> {
> 	if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX2))
> 	{
> 		rte_memcpy_vector = rte_memcpy_avx2;
> 	}
> 	else if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_SSE4_1))
> 	{
> 		rte_memcpy_vector = rte_memcpy_sse;
> 	}
> 	else
> 	{
> 		rte_memcpy_vector = memcpy;
> 	}
> 
> }

I thought we discussed a bit different approach.
In which rte_memcpy_vector() (rte_memeset_vector) would be called  only after some cutoff point, i.e:

void
rte_memcpy(void *dst, const void *src, size_t len)
{
	if (len < N) memcpy(dst, src, len);
	else rte_memcpy_vector(dst, src, len);
}

If you just always call rte_memcpy_vector() for every len, 
then it means that compiler most likely has always to generate a proper call
(not inlining happening).
For small length(s) price of extra function would probably overweight any
potential gain with SSE/AVX2 implementation.  

Konstantin 

> I run the same virtio/vhost loopback tests without NIC.
> I can see the  throughput drop  when running choosing functions at run time
> compared to original code as following on the same platform(my machine is haswell)
> 	Packet size	perf drop
> 	64 		-4%
> 	256 		-5.4%
> 	1024		-5%
> 	1500		-2.5%
> Another thing, I run the memcpy_perf_autotest,  when N= <128,
> the rte_memcpy perf gains almost disappears
> When choosing functions at run time.  For N=other numbers, the perf gains will become narrow.
> 
> Thanks
> Zhiyong
  
Yang, Zhiyong Dec. 16, 2016, 2:15 a.m. UTC | #13
Hi,Konstantin:

> -----Original Message-----

> From: Ananyev, Konstantin

> Sent: Thursday, December 15, 2016 6:54 PM

> To: Yang, Zhiyong <zhiyong.yang@intel.com>; Thomas Monjalon

> <thomas.monjalon@6wind.com>

> Cc: dev@dpdk.org; yuanhan.liu@linux.intel.com; Richardson, Bruce

> <bruce.richardson@intel.com>; De Lara Guarch, Pablo

> <pablo.de.lara.guarch@intel.com>

> Subject: RE: [dpdk-dev] [PATCH 1/4] eal/common: introduce rte_memset on

> IA platform

> 

> Hi Zhiyong,

> 

> > -----Original Message-----

> > From: Yang, Zhiyong

> > Sent: Thursday, December 15, 2016 6:51 AM

> > To: Yang, Zhiyong <zhiyong.yang@intel.com>; Ananyev, Konstantin

> > <konstantin.ananyev@intel.com>; Thomas Monjalon

> > <thomas.monjalon@6wind.com>

> > Cc: dev@dpdk.org; yuanhan.liu@linux.intel.com; Richardson, Bruce

> > <bruce.richardson@intel.com>; De Lara Guarch, Pablo

> > <pablo.de.lara.guarch@intel.com>

> > Subject: RE: [dpdk-dev] [PATCH 1/4] eal/common: introduce rte_memset

> > on IA platform

> >

> > Hi, Thomas, Konstantin:

> >

> > > -----Original Message-----

> > > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Yang, Zhiyong

> > > Sent: Sunday, December 11, 2016 8:33 PM

> > > To: Ananyev, Konstantin <konstantin.ananyev@intel.com>; Thomas

> > > Monjalon <thomas.monjalon@6wind.com>

> > > Cc: dev@dpdk.org; yuanhan.liu@linux.intel.com; Richardson, Bruce

> > > <bruce.richardson@intel.com>; De Lara Guarch, Pablo

> > > <pablo.de.lara.guarch@intel.com>

> > > Subject: Re: [dpdk-dev] [PATCH 1/4] eal/common: introduce

> rte_memset

> > > on IA platform

> > >

> > > Hi, Konstantin, Bruce:

> > >

> > > > -----Original Message-----

> > > > From: Ananyev, Konstantin

> > > > Sent: Thursday, December 8, 2016 6:31 PM

> > > > To: Yang, Zhiyong <zhiyong.yang@intel.com>; Thomas Monjalon

> > > > <thomas.monjalon@6wind.com>

> > > > Cc: dev@dpdk.org; yuanhan.liu@linux.intel.com; Richardson, Bruce

> > > > <bruce.richardson@intel.com>; De Lara Guarch, Pablo

> > > > <pablo.de.lara.guarch@intel.com>

> > > > Subject: RE: [dpdk-dev] [PATCH 1/4] eal/common: introduce

> > > > rte_memset on IA platform

> > > >

> > > >

> > > >

> > > > > -----Original Message-----

> > > > > From: Yang, Zhiyong

> > > > > Sent: Thursday, December 8, 2016 9:53 AM

> > > > > To: Ananyev, Konstantin <konstantin.ananyev@intel.com>; Thomas

> > > > > Monjalon <thomas.monjalon@6wind.com>

> > > > > Cc: dev@dpdk.org; yuanhan.liu@linux.intel.com; Richardson, Bruce

> > > > > <bruce.richardson@intel.com>; De Lara Guarch, Pablo

> > > > > <pablo.de.lara.guarch@intel.com>

> > > > > Subject: RE: [dpdk-dev] [PATCH 1/4] eal/common: introduce

> > > > > rte_memset on IA platform

> > > > >

> > > > extern void *(*__rte_memset_vector)( (void *s, int c, size_t n);

> > > >

> > > > static inline void*

> > > > rte_memset_huge(void *s, int c, size_t n) {

> > > >    return __rte_memset_vector(s, c, n); }

> > > >

> > > > static inline void *

> > > > rte_memset(void *s, int c, size_t n) {

> > > > 	If (n < XXX)

> > > > 		return rte_memset_scalar(s, c, n);

> > > > 	else

> > > > 		return rte_memset_huge(s, c, n); }

> > > >

> > > > XXX could be either a define, or could also be a variable, so it

> > > > can be setuped at startup, depending on the architecture.

> > > >

> > > > Would that work?

> > > > Konstantin

> > > >

> > I have implemented the code for  choosing the functions at run time.

> > rte_memcpy is used more frequently, So I test it at run time.

> >

> > typedef void *(*rte_memcpy_vector_t)(void *dst, const void *src,

> > size_t n); extern rte_memcpy_vector_t rte_memcpy_vector; static inline

> > void * rte_memcpy(void *dst, const void *src, size_t n) {

> >         return rte_memcpy_vector(dst, src, n); } In order to reduce

> > the overhead at run time, I assign the function address to var

> > rte_memcpy_vector before main() starts to init the var.

> >

> > static void __attribute__((constructor))

> > rte_memcpy_init(void)

> > {

> > 	if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX2))

> > 	{

> > 		rte_memcpy_vector = rte_memcpy_avx2;

> > 	}

> > 	else if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_SSE4_1))

> > 	{

> > 		rte_memcpy_vector = rte_memcpy_sse;

> > 	}

> > 	else

> > 	{

> > 		rte_memcpy_vector = memcpy;

> > 	}

> >

> > }

> 

> I thought we discussed a bit different approach.

> In which rte_memcpy_vector() (rte_memeset_vector) would be called  only

> after some cutoff point, i.e:

> 

> void

> rte_memcpy(void *dst, const void *src, size_t len) {

> 	if (len < N) memcpy(dst, src, len);

> 	else rte_memcpy_vector(dst, src, len);

> }

> 

> If you just always call rte_memcpy_vector() for every len, then it means that

> compiler most likely has always to generate a proper call (not inlining

> happening).


> For small length(s) price of extra function would probably overweight any

> potential gain with SSE/AVX2 implementation.

> 

> Konstantin


Yes, in fact,  from my tests, For small length(s)  rte_memset is far better than glibc memset, 
For large lengths, rte_memset is only a bit better than memset. 
because memset use the AVX2/SSE, too. Of course, it will use AVX512 on future machine.

>For small length(s) price of extra function would probably overweight any

 >potential gain.  

This is the key point. I think it should include the scalar optimization, not only vector optimization.

The value of rte_memset is always inlined and for small lengths it will be better.
when in some case We are not sure that memset is always inlined by compiler.
It seems that choosing function at run time will lose the gains.
The following is tested on haswell by patch code.
** rte_memset() - memset perf tests
        (C = compile-time constant) **
======== ======= ======== ======= ========
   Size memset in cache  memset in mem
(bytes)        (ticks)        (ticks)
------- -------------- ---------------
============= 32B aligned ================
      3            3 -    8       19 -  128
      4            4 -    8       13 -  128
      8            2 -    7       19 -  128
      9            2 -    7       19 -  127
     12           2 -    7       19 -  127
     17          3 -    8        19 -  132
     64          3 -    8        28 -  168
    128        7 -   13       54 -  200
    255        8 -   20       100 -  223
    511        14 -   20     187 -  314
   1024      24 -   29     328 -  379
   8192     198 -  225   1829 - 2193

Thanks
Zhiyong
  
Yang, Zhiyong Dec. 16, 2016, 10:19 a.m. UTC | #14
Hi, Bruce:

> -----Original Message-----
> From: Richardson, Bruce
> Sent: Thursday, December 15, 2016 6:13 PM
> To: Yang, Zhiyong <zhiyong.yang@intel.com>
> Cc: Ananyev, Konstantin <konstantin.ananyev@intel.com>; Thomas
> Monjalon <thomas.monjalon@6wind.com>; dev@dpdk.org;
> yuanhan.liu@linux.intel.com; De Lara Guarch, Pablo
> <pablo.de.lara.guarch@intel.com>
> Subject: Re: [dpdk-dev] [PATCH 1/4] eal/common: introduce rte_memset on
> IA platform
> 
> On Thu, Dec 15, 2016 at 06:51:08AM +0000, Yang, Zhiyong wrote:
> > Hi, Thomas, Konstantin:
> >
> > > -----Original Message-----
> > > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Yang, Zhiyong
> > > Sent: Sunday, December 11, 2016 8:33 PM
> > > To: Ananyev, Konstantin <konstantin.ananyev@intel.com>; Thomas
> > > Monjalon <thomas.monjalon@6wind.com>
> > > Cc: dev@dpdk.org; yuanhan.liu@linux.intel.com; Richardson, Bruce
> > > <bruce.richardson@intel.com>; De Lara Guarch, Pablo
> > > <pablo.de.lara.guarch@intel.com>
> > > Subject: Re: [dpdk-dev] [PATCH 1/4] eal/common: introduce
> rte_memset
> > > on IA platform
> > >
> > > Hi, Konstantin, Bruce:
> > >
> > > > -----Original Message-----
> > > > From: Ananyev, Konstantin
> > > > Sent: Thursday, December 8, 2016 6:31 PM
> > > > To: Yang, Zhiyong <zhiyong.yang@intel.com>; Thomas Monjalon
> > > > <thomas.monjalon@6wind.com>
> > > > Cc: dev@dpdk.org; yuanhan.liu@linux.intel.com; Richardson, Bruce
> > > > <bruce.richardson@intel.com>; De Lara Guarch, Pablo
> > > > <pablo.de.lara.guarch@intel.com>
> > > > Subject: RE: [dpdk-dev] [PATCH 1/4] eal/common: introduce
> > > > rte_memset on IA platform
> > > >
> > > >
> > > >
> > > > > -----Original Message-----
> > > > > From: Yang, Zhiyong
> > > > > Sent: Thursday, December 8, 2016 9:53 AM
> > > > > To: Ananyev, Konstantin <konstantin.ananyev@intel.com>; Thomas
> > > > > Monjalon <thomas.monjalon@6wind.com>
> > > > > Cc: dev@dpdk.org; yuanhan.liu@linux.intel.com; Richardson, Bruce
> > > > > <bruce.richardson@intel.com>; De Lara Guarch, Pablo
> > > > > <pablo.de.lara.guarch@intel.com>
> > > > > Subject: RE: [dpdk-dev] [PATCH 1/4] eal/common: introduce
> > > > > rte_memset on IA platform
> > > > >
> > > > extern void *(*__rte_memset_vector)( (void *s, int c, size_t n);
> > > >
> > > > static inline void*
> > > > rte_memset_huge(void *s, int c, size_t n) {
> > > >    return __rte_memset_vector(s, c, n); }
> > > >
> > > > static inline void *
> > > > rte_memset(void *s, int c, size_t n) {
> > > > 	If (n < XXX)
> > > > 		return rte_memset_scalar(s, c, n);
> > > > 	else
> > > > 		return rte_memset_huge(s, c, n); }
> > > >
> > > > XXX could be either a define, or could also be a variable, so it
> > > > can be setuped at startup, depending on the architecture.
> > > >
> > > > Would that work?
> > > > Konstantin
> > > >
> > I have implemented the code for  choosing the functions at run time.
> > rte_memcpy is used more frequently, So I test it at run time.
> >
> > typedef void *(*rte_memcpy_vector_t)(void *dst, const void *src,
> > size_t n); extern rte_memcpy_vector_t rte_memcpy_vector; static inline
> > void * rte_memcpy(void *dst, const void *src, size_t n) {
> >         return rte_memcpy_vector(dst, src, n); } In order to reduce
> > the overhead at run time, I assign the function address to var
> > rte_memcpy_vector before main() starts to init the var.
> >
> > static void __attribute__((constructor))
> > rte_memcpy_init(void)
> > {
> > 	if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX2))
> > 	{
> > 		rte_memcpy_vector = rte_memcpy_avx2;
> > 	}
> > 	else if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_SSE4_1))
> > 	{
> > 		rte_memcpy_vector = rte_memcpy_sse;
> > 	}
> > 	else
> > 	{
> > 		rte_memcpy_vector = memcpy;
> > 	}
> >
> > }
> > I run the same virtio/vhost loopback tests without NIC.
> > I can see the  throughput drop  when running choosing functions at run
> > time compared to original code as following on the same platform(my
> machine is haswell)
> > 	Packet size	perf drop
> > 	64 		-4%
> > 	256 		-5.4%
> > 	1024		-5%
> > 	1500		-2.5%
> > Another thing, I run the memcpy_perf_autotest,  when N= <128, the
> > rte_memcpy perf gains almost disappears When choosing functions at run
> > time.  For N=other numbers, the perf gains will become narrow.
> >
> How narrow. How significant is the improvement that we gain from having to
> maintain our own copy of memcpy. If the libc version is nearly as good we
> should just use that.
> 
> /Bruce

Zhihong sent a patch about rte_memcpy,  From the patch,  
we can see the optimization job for memcpy will bring obvious perf improvements
than glibc for DPDK.
http://www.dpdk.org/dev/patchwork/patch/17753/
git log as following:
This patch is tested on Ivy Bridge, Haswell and Skylake, it provides
up to 20% gain for Virtio Vhost PVP traffic, with packet size ranging
from 64 to 1500 bytes.

thanks
Zhiyong
  
Ananyev, Konstantin Dec. 16, 2016, 11:47 a.m. UTC | #15
Hi Zhiyong,

> > > > > >
> > > > > extern void *(*__rte_memset_vector)( (void *s, int c, size_t n);
> > > > >
> > > > > static inline void*
> > > > > rte_memset_huge(void *s, int c, size_t n) {
> > > > >    return __rte_memset_vector(s, c, n); }
> > > > >
> > > > > static inline void *
> > > > > rte_memset(void *s, int c, size_t n) {
> > > > > 	If (n < XXX)
> > > > > 		return rte_memset_scalar(s, c, n);
> > > > > 	else
> > > > > 		return rte_memset_huge(s, c, n); }
> > > > >
> > > > > XXX could be either a define, or could also be a variable, so it
> > > > > can be setuped at startup, depending on the architecture.
> > > > >
> > > > > Would that work?
> > > > > Konstantin
> > > > >
> > > I have implemented the code for  choosing the functions at run time.
> > > rte_memcpy is used more frequently, So I test it at run time.
> > >
> > > typedef void *(*rte_memcpy_vector_t)(void *dst, const void *src,
> > > size_t n); extern rte_memcpy_vector_t rte_memcpy_vector; static inline
> > > void * rte_memcpy(void *dst, const void *src, size_t n) {
> > >         return rte_memcpy_vector(dst, src, n); } In order to reduce
> > > the overhead at run time, I assign the function address to var
> > > rte_memcpy_vector before main() starts to init the var.
> > >
> > > static void __attribute__((constructor))
> > > rte_memcpy_init(void)
> > > {
> > > 	if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX2))
> > > 	{
> > > 		rte_memcpy_vector = rte_memcpy_avx2;
> > > 	}
> > > 	else if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_SSE4_1))
> > > 	{
> > > 		rte_memcpy_vector = rte_memcpy_sse;
> > > 	}
> > > 	else
> > > 	{
> > > 		rte_memcpy_vector = memcpy;
> > > 	}
> > >
> > > }
> >
> > I thought we discussed a bit different approach.
> > In which rte_memcpy_vector() (rte_memeset_vector) would be called  only
> > after some cutoff point, i.e:
> >
> > void
> > rte_memcpy(void *dst, const void *src, size_t len) {
> > 	if (len < N) memcpy(dst, src, len);
> > 	else rte_memcpy_vector(dst, src, len);
> > }
> >
> > If you just always call rte_memcpy_vector() for every len, then it means that
> > compiler most likely has always to generate a proper call (not inlining
> > happening).
> 
> > For small length(s) price of extra function would probably overweight any
> > potential gain with SSE/AVX2 implementation.
> >
> > Konstantin
> 
> Yes, in fact,  from my tests, For small length(s)  rte_memset is far better than glibc memset,
> For large lengths, rte_memset is only a bit better than memset.
> because memset use the AVX2/SSE, too. Of course, it will use AVX512 on future machine.

Ok, thanks for clarification.
From previous mails I got a wrong  impression that on big lengths
rte_memset_vector() is significantly faster than memset().

> 
> >For small length(s) price of extra function would probably overweight any
>  >potential gain.
> This is the key point. I think it should include the scalar optimization, not only vector optimization.
> 
> The value of rte_memset is always inlined and for small lengths it will be better.
> when in some case We are not sure that memset is always inlined by compiler.

Ok, so do you know in what cases memset() is not get inlined?
Is it when len parameter can't be precomputed by the compiler
(is not a constant)?

So to me it sounds like:
- We don't need to have an optimized verision of rte_memset() for big sizes.
- Which probably means we don't need an arch specific versions of rte_memset_vector() at all -
   for small sizes (<= 32B) scalar version would be good enough. 
- For big sizes we can just rely on memset().
Is that so?

> It seems that choosing function at run time will lose the gains.
> The following is tested on haswell by patch code.

Not sure what columns 2 and 3 in the table below mean? 
Konstantin

> ** rte_memset() - memset perf tests
>         (C = compile-time constant) **
> ======== ======= ======== ======= ========
>    Size memset in cache  memset in mem
> (bytes)        (ticks)        (ticks)
> ------- -------------- ---------------
> ============= 32B aligned ================
>       3            3 -    8       19 -  128
>       4            4 -    8       13 -  128
>       8            2 -    7       19 -  128
>       9            2 -    7       19 -  127
>      12           2 -    7       19 -  127
>      17          3 -    8        19 -  132
>      64          3 -    8        28 -  168
>     128        7 -   13       54 -  200
>     255        8 -   20       100 -  223
>     511        14 -   20     187 -  314
>    1024      24 -   29     328 -  379
>    8192     198 -  225   1829 - 2193
> 
> Thanks
> Zhiyong
  
Yuanhan Liu Dec. 19, 2016, 6:27 a.m. UTC | #16
On Fri, Dec 16, 2016 at 10:19:43AM +0000, Yang, Zhiyong wrote:
> > > I run the same virtio/vhost loopback tests without NIC.
> > > I can see the  throughput drop  when running choosing functions at run
> > > time compared to original code as following on the same platform(my
> > machine is haswell)
> > > 	Packet size	perf drop
> > > 	64 		-4%
> > > 	256 		-5.4%
> > > 	1024		-5%
> > > 	1500		-2.5%
> > > Another thing, I run the memcpy_perf_autotest,  when N= <128, the
> > > rte_memcpy perf gains almost disappears When choosing functions at run
> > > time.  For N=other numbers, the perf gains will become narrow.
> > >
> > How narrow. How significant is the improvement that we gain from having to
> > maintain our own copy of memcpy. If the libc version is nearly as good we
> > should just use that.
> > 
> > /Bruce
> 
> Zhihong sent a patch about rte_memcpy,  From the patch,  
> we can see the optimization job for memcpy will bring obvious perf improvements
> than glibc for DPDK.

Just a clarification: it's better than the __original DPDK__ rte_memcpy
but not the glibc one. That makes me think have any one tested the memcpy
with big packets? Does the one from DPDK outweigh the one from glibc,
even for big packets?

	--yliu

> http://www.dpdk.org/dev/patchwork/patch/17753/
> git log as following:
> This patch is tested on Ivy Bridge, Haswell and Skylake, it provides
> up to 20% gain for Virtio Vhost PVP traffic, with packet size ranging
> from 64 to 1500 bytes.
> 
> thanks
> Zhiyong
  
Yao, Lei A Dec. 20, 2016, 2:41 a.m. UTC | #17
> On Fri, Dec 16, 2016 at 10:19:43AM +0000, Yang, Zhiyong wrote:
> > > > I run the same virtio/vhost loopback tests without NIC.
> > > > I can see the  throughput drop  when running choosing functions at run
> > > > time compared to original code as following on the same platform(my
> > > machine is haswell)
> > > > 	Packet size	perf drop
> > > > 	64 		-4%
> > > > 	256 		-5.4%
> > > > 	1024		-5%
> > > > 	1500		-2.5%
> > > > Another thing, I run the memcpy_perf_autotest,  when N= <128, the
> > > > rte_memcpy perf gains almost disappears When choosing functions at
> run
> > > > time.  For N=other numbers, the perf gains will become narrow.
> > > >
> > > How narrow. How significant is the improvement that we gain from
> having to
> > > maintain our own copy of memcpy. If the libc version is nearly as good we
> > > should just use that.
> > >
> > > /Bruce
> >
> > Zhihong sent a patch about rte_memcpy,  From the patch,
> > we can see the optimization job for memcpy will bring obvious perf
> improvements
> > than glibc for DPDK.
> 
> Just a clarification: it's better than the __original DPDK__ rte_memcpy
> but not the glibc one. That makes me think have any one tested the memcpy
> with big packets? Does the one from DPDK outweigh the one from glibc,
> even for big packets?
> 
> 	--yliu
> 
I have test the loopback performanc rte_memcpy and glibc memcpy. For both small packer and
Big packet, rte_memcpy has better performance. My test enviromen is following
CPU: BDW
Ubutnu16.04
Kernal:  4.4.0
gcc : 5.4.0
Path: mergeable
Size       rte_memcpy performance gain
64           31%
128         35%
260         27%
520         33%
1024      18%
1500      12%

--Lei
> > http://www.dpdk.org/dev/patchwork/patch/17753/
> > git log as following:
> > This patch is tested on Ivy Bridge, Haswell and Skylake, it provides
> > up to 20% gain for Virtio Vhost PVP traffic, with packet size ranging
> > from 64 to 1500 bytes.
> >
> > thanks
> > Zhiyong
  
Yang, Zhiyong Dec. 20, 2016, 9:31 a.m. UTC | #18
Hi, Konstantin:

> -----Original Message-----
> From: Ananyev, Konstantin
> Sent: Friday, December 16, 2016 7:48 PM
> To: Yang, Zhiyong <zhiyong.yang@intel.com>; Thomas Monjalon
> <thomas.monjalon@6wind.com>
> Cc: dev@dpdk.org; yuanhan.liu@linux.intel.com; Richardson, Bruce
> <bruce.richardson@intel.com>; De Lara Guarch, Pablo
> <pablo.de.lara.guarch@intel.com>
> Subject: RE: [dpdk-dev] [PATCH 1/4] eal/common: introduce rte_memset on
> IA platform
> 
> Hi Zhiyong,
> 
> > > > > > >
> > > > > > extern void *(*__rte_memset_vector)( (void *s, int c, size_t
> > > > > > n);
> > > > > >
> > > > > > static inline void*
> > > > > > rte_memset_huge(void *s, int c, size_t n) {
> > > > > >    return __rte_memset_vector(s, c, n); }
> > > > > >
> > > > > > static inline void *
> > > > > > rte_memset(void *s, int c, size_t n) {
> > > > > > 	If (n < XXX)
> > > > > > 		return rte_memset_scalar(s, c, n);
> > > > > > 	else
> > > > > > 		return rte_memset_huge(s, c, n); }
> > > > > >
> > > > > > XXX could be either a define, or could also be a variable, so
> > > > > > it can be setuped at startup, depending on the architecture.
> > > > > >
> > > > > > Would that work?
> > > > > > Konstantin
> > > > > >
> > > > I have implemented the code for  choosing the functions at run time.
> > > > rte_memcpy is used more frequently, So I test it at run time.
> > > >
> > > > typedef void *(*rte_memcpy_vector_t)(void *dst, const void *src,
> > > > size_t n); extern rte_memcpy_vector_t rte_memcpy_vector; static
> > > > inline void * rte_memcpy(void *dst, const void *src, size_t n) {
> > > >         return rte_memcpy_vector(dst, src, n); } In order to
> > > > reduce the overhead at run time, I assign the function address to
> > > > var rte_memcpy_vector before main() starts to init the var.
> > > >
> > > > static void __attribute__((constructor))
> > > > rte_memcpy_init(void)
> > > > {
> > > > 	if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX2))
> > > > 	{
> > > > 		rte_memcpy_vector = rte_memcpy_avx2;
> > > > 	}
> > > > 	else if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_SSE4_1))
> > > > 	{
> > > > 		rte_memcpy_vector = rte_memcpy_sse;
> > > > 	}
> > > > 	else
> > > > 	{
> > > > 		rte_memcpy_vector = memcpy;
> > > > 	}
> > > >
> > > > }
> > >
> > > I thought we discussed a bit different approach.
> > > In which rte_memcpy_vector() (rte_memeset_vector) would be called
> > > only after some cutoff point, i.e:
> > >
> > > void
> > > rte_memcpy(void *dst, const void *src, size_t len) {
> > > 	if (len < N) memcpy(dst, src, len);
> > > 	else rte_memcpy_vector(dst, src, len); }
> > >
> > > If you just always call rte_memcpy_vector() for every len, then it
> > > means that compiler most likely has always to generate a proper call
> > > (not inlining happening).
> >
> > > For small length(s) price of extra function would probably
> > > overweight any potential gain with SSE/AVX2 implementation.
> > >
> > > Konstantin
> >
> > Yes, in fact,  from my tests, For small length(s)  rte_memset is far
> > better than glibc memset, For large lengths, rte_memset is only a bit better
> than memset.
> > because memset use the AVX2/SSE, too. Of course, it will use AVX512 on
> future machine.
> 
> Ok, thanks for clarification.
> From previous mails I got a wrong  impression that on big lengths
> rte_memset_vector() is significantly faster than memset().
> 
> >
> > >For small length(s) price of extra function would probably overweight
> > >any
> >  >potential gain.
> > This is the key point. I think it should include the scalar optimization, not
> only vector optimization.
> >
> > The value of rte_memset is always inlined and for small lengths it will be
> better.
> > when in some case We are not sure that memset is always inlined by
> compiler.
> 
> Ok, so do you know in what cases memset() is not get inlined?
> Is it when len parameter can't be precomputed by the compiler (is not a
> constant)?
> 
> So to me it sounds like:
> - We don't need to have an optimized verision of rte_memset() for big sizes.
> - Which probably means we don't need an arch specific versions of
> rte_memset_vector() at all -
>    for small sizes (<= 32B) scalar version would be good enough.
> - For big sizes we can just rely on memset().
> Is that so?

Using memset has actually met some trouble in some case, such as
http://dpdk.org/ml/archives/dev/2016-October/048628.html

> 
> > It seems that choosing function at run time will lose the gains.
> > The following is tested on haswell by patch code.
> 
> Not sure what columns 2 and 3 in the table below mean?
> Konstantin

Column1 shows Size(bytes).
Column2 shows  rte_memset Vs memset  perf results in cache
Column3 shows  rte_memset Vs memset  perf results in memory.
The data is  gotten using  rte_rdtsc();
 The test can be run using [PATCH 3/4] app/test: add performance autotest for rte_memset

Thanks
Zhiyong
> 
> > ** rte_memset() - memset perf tests
> >         (C = compile-time constant) ** ======== ======= ========
> > ======= ========
> >    Size memset in cache  memset in mem
> > (bytes)        (ticks)        (ticks)
> > ------- -------------- --------------- ============= 32B aligned
> > ================
> >       3            3 -    8       19 -  128
> >       4            4 -    8       13 -  128
> >       8            2 -    7       19 -  128
> >       9            2 -    7       19 -  127
> >      12           2 -    7       19 -  127
> >      17          3 -    8        19 -  132
> >      64          3 -    8        28 -  168
> >     128        7 -   13       54 -  200
> >     255        8 -   20       100 -  223
> >     511        14 -   20     187 -  314
> >    1024      24 -   29     328 -  379
> >    8192     198 -  225   1829 - 2193
> >
> > Thanks
> > Zhiyong
  
Yang, Zhiyong Dec. 27, 2016, 10:04 a.m. UTC | #19
DPDK code has met performance drop badly in some case when calling glibc
function memset. Reference to discussions about memset in 
http://dpdk.org/ml/archives/dev/2016-October/048628.html
It is necessary to introduce more high efficient function to fix it.
One important thing about rte_memset is that we can get clear control
on what instruction flow is used.

This patchset introduces rte_memset to bring more high efficient
implementation, and will bring obvious perf improvement, especially
for small N bytes in the most application scenarios.

Patch 1 implements rte_memset in the file rte_memset.h on IA platform
The file supports three types of instruction sets including sse & avx
(128bits), avx2(256bits) and avx512(512bits). rte_memset makes use of
vectorization and inline function to improve the perf on IA. In addition,
cache line and memory alignment are fully taken into consideration.

Patch 2 implements functional autotest to validates the function whether
to work in a right way.

Patch 3 implements performance autotest separately in cache and memory.
We can see the perf of rte_memset is obviously better than glibc memset
especially for small N bytes.

Patch 4 Using rte_memset instead of copy_virtio_net_hdr can bring 3%~4%
performance improvements on IA platform from virtio/vhost non-mergeable
loopback testing.

Changes in V2:

Patch 1:
Rename rte_memset.h -> rte_memset_64.h and create a file rte_memset.h
for each arch.

Patch 3:
add the perf comparation data between rte_memset and memset on haswell.

Patch 4:
Modify release_17_02.rst description.

Zhiyong Yang (4):
  eal/common: introduce rte_memset on IA platform
  app/test: add functional autotest for rte_memset
  app/test: add performance autotest for rte_memset
  lib/librte_vhost: improve vhost perf using rte_memset

 app/test/Makefile                                  |   3 +
 app/test/test_memset.c                             | 158 +++++++++
 app/test/test_memset_perf.c                        | 348 +++++++++++++++++++
 doc/guides/rel_notes/release_17_02.rst             |   7 +
 .../common/include/arch/arm/rte_memset.h           |  36 ++
 .../common/include/arch/ppc_64/rte_memset.h        |  36 ++
 .../common/include/arch/tile/rte_memset.h          |  36 ++
 .../common/include/arch/x86/rte_memset.h           |  51 +++
 .../common/include/arch/x86/rte_memset_64.h        | 378 +++++++++++++++++++++
 lib/librte_eal/common/include/generic/rte_memset.h |  52 +++
 lib/librte_vhost/virtio_net.c                      |  18 +-
 11 files changed, 1116 insertions(+), 7 deletions(-)
 create mode 100644 app/test/test_memset.c
 create mode 100644 app/test/test_memset_perf.c
 create mode 100644 lib/librte_eal/common/include/arch/arm/rte_memset.h
 create mode 100644 lib/librte_eal/common/include/arch/ppc_64/rte_memset.h
 create mode 100644 lib/librte_eal/common/include/arch/tile/rte_memset.h
 create mode 100644 lib/librte_eal/common/include/arch/x86/rte_memset.h
 create mode 100644 lib/librte_eal/common/include/arch/x86/rte_memset_64.h
 create mode 100644 lib/librte_eal/common/include/generic/rte_memset.h
  
Yang, Zhiyong Jan. 9, 2017, 9:48 a.m. UTC | #20
Hi, Thomas, Bruce, Konstantin:

	Any comments about the patchset?  Do I need to modify anything?

Thanks
Zhiyong 

> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Zhiyong Yang
> Sent: Tuesday, December 27, 2016 6:05 PM
> To: dev@dpdk.org
> Cc: yuanhan.liu@linux.intel.com; thomas.monjalon@6wind.com; Richardson,
> Bruce <bruce.richardson@intel.com>; Ananyev, Konstantin
> <konstantin.ananyev@intel.com>; De Lara Guarch, Pablo
> <pablo.de.lara.guarch@intel.com>
> Subject: [dpdk-dev] [PATCH v2 0/4] eal/common: introduce rte_memset and
> related test
> 
> DPDK code has met performance drop badly in some case when calling glibc
> function memset. Reference to discussions about memset in
> http://dpdk.org/ml/archives/dev/2016-October/048628.html
> It is necessary to introduce more high efficient function to fix it.
> One important thing about rte_memset is that we can get clear control on
> what instruction flow is used.
> 
> This patchset introduces rte_memset to bring more high efficient
> implementation, and will bring obvious perf improvement, especially for
> small N bytes in the most application scenarios.
> 
> Patch 1 implements rte_memset in the file rte_memset.h on IA platform The
> file supports three types of instruction sets including sse & avx (128bits),
> avx2(256bits) and avx512(512bits). rte_memset makes use of vectorization
> and inline function to improve the perf on IA. In addition, cache line and
> memory alignment are fully taken into consideration.
> 
> Patch 2 implements functional autotest to validates the function whether to
> work in a right way.
> 
> Patch 3 implements performance autotest separately in cache and memory.
> We can see the perf of rte_memset is obviously better than glibc memset
> especially for small N bytes.
> 
> Patch 4 Using rte_memset instead of copy_virtio_net_hdr can bring 3%~4%
> performance improvements on IA platform from virtio/vhost non-mergeable
> loopback testing.
> 
> Changes in V2:
> 
> Patch 1:
> Rename rte_memset.h -> rte_memset_64.h and create a file rte_memset.h
> for each arch.
> 
> Patch 3:
> add the perf comparation data between rte_memset and memset on
> haswell.
> 
> Patch 4:
> Modify release_17_02.rst description.
> 
> Zhiyong Yang (4):
>   eal/common: introduce rte_memset on IA platform
>   app/test: add functional autotest for rte_memset
>   app/test: add performance autotest for rte_memset
>   lib/librte_vhost: improve vhost perf using rte_memset
> 
>  app/test/Makefile                                  |   3 +
>  app/test/test_memset.c                             | 158 +++++++++
>  app/test/test_memset_perf.c                        | 348 +++++++++++++++++++
>  doc/guides/rel_notes/release_17_02.rst             |   7 +
>  .../common/include/arch/arm/rte_memset.h           |  36 ++
>  .../common/include/arch/ppc_64/rte_memset.h        |  36 ++
>  .../common/include/arch/tile/rte_memset.h          |  36 ++
>  .../common/include/arch/x86/rte_memset.h           |  51 +++
>  .../common/include/arch/x86/rte_memset_64.h        | 378
> +++++++++++++++++++++
>  lib/librte_eal/common/include/generic/rte_memset.h |  52 +++
>  lib/librte_vhost/virtio_net.c                      |  18 +-
>  11 files changed, 1116 insertions(+), 7 deletions(-)  create mode 100644
> app/test/test_memset.c  create mode 100644 app/test/test_memset_perf.c
> create mode 100644
> lib/librte_eal/common/include/arch/arm/rte_memset.h
>  create mode 100644
> lib/librte_eal/common/include/arch/ppc_64/rte_memset.h
>  create mode 100644 lib/librte_eal/common/include/arch/tile/rte_memset.h
>  create mode 100644
> lib/librte_eal/common/include/arch/x86/rte_memset.h
>  create mode 100644
> lib/librte_eal/common/include/arch/x86/rte_memset_64.h
>  create mode 100644 lib/librte_eal/common/include/generic/rte_memset.h
> 
> --
> 2.7.4
  
Yang, Zhiyong Jan. 17, 2017, 6:24 a.m. UTC | #21
Hi, Thomas:
	Does this patchset have chance to be applied for 1702 release? 
Thanks
Zhiyong

> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Yang, Zhiyong
> Sent: Monday, January 9, 2017 5:49 PM
> To: thomas.monjalon@6wind.com; Richardson, Bruce
> <bruce.richardson@intel.com>; Ananyev, Konstantin
> <konstantin.ananyev@intel.com>
> Cc: yuanhan.liu@linux.intel.com; De Lara Guarch, Pablo
> <pablo.de.lara.guarch@intel.com>; dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH v2 0/4] eal/common: introduce rte_memset
> and related test
> 
> Hi, Thomas, Bruce, Konstantin:
> 
> 	Any comments about the patchset?  Do I need to modify anything?
> 
> Thanks
> Zhiyong
> 
> > -----Original Message-----
> > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Zhiyong Yang
> > Sent: Tuesday, December 27, 2016 6:05 PM
> > To: dev@dpdk.org
> > Cc: yuanhan.liu@linux.intel.com; thomas.monjalon@6wind.com;
> > Richardson, Bruce <bruce.richardson@intel.com>; Ananyev, Konstantin
> > <konstantin.ananyev@intel.com>; De Lara Guarch, Pablo
> > <pablo.de.lara.guarch@intel.com>
> > Subject: [dpdk-dev] [PATCH v2 0/4] eal/common: introduce rte_memset
> > and related test
> >
> > DPDK code has met performance drop badly in some case when calling
> > glibc function memset. Reference to discussions about memset in
> > http://dpdk.org/ml/archives/dev/2016-October/048628.html
> > It is necessary to introduce more high efficient function to fix it.
> > One important thing about rte_memset is that we can get clear control
> > on what instruction flow is used.
> >
> > This patchset introduces rte_memset to bring more high efficient
> > implementation, and will bring obvious perf improvement, especially
> > for small N bytes in the most application scenarios.
> >
> > Patch 1 implements rte_memset in the file rte_memset.h on IA platform
> > The file supports three types of instruction sets including sse & avx
> > (128bits),
> > avx2(256bits) and avx512(512bits). rte_memset makes use of
> > vectorization and inline function to improve the perf on IA. In
> > addition, cache line and memory alignment are fully taken into
> consideration.
> >
> > Patch 2 implements functional autotest to validates the function
> > whether to work in a right way.
> >
> > Patch 3 implements performance autotest separately in cache and memory.
> > We can see the perf of rte_memset is obviously better than glibc
> > memset especially for small N bytes.
> >
> > Patch 4 Using rte_memset instead of copy_virtio_net_hdr can bring
> > 3%~4% performance improvements on IA platform from virtio/vhost
> > non-mergeable loopback testing.
> >
> > Changes in V2:
> >
> > Patch 1:
> > Rename rte_memset.h -> rte_memset_64.h and create a file
> rte_memset.h
> > for each arch.
> >
> > Patch 3:
> > add the perf comparation data between rte_memset and memset on
> > haswell.
> >
> > Patch 4:
> > Modify release_17_02.rst description.
> >
> > Zhiyong Yang (4):
> >   eal/common: introduce rte_memset on IA platform
> >   app/test: add functional autotest for rte_memset
> >   app/test: add performance autotest for rte_memset
> >   lib/librte_vhost: improve vhost perf using rte_memset
> >
> >  app/test/Makefile                                  |   3 +
> >  app/test/test_memset.c                             | 158 +++++++++
> >  app/test/test_memset_perf.c                        | 348 +++++++++++++++++++
> >  doc/guides/rel_notes/release_17_02.rst             |   7 +
> >  .../common/include/arch/arm/rte_memset.h           |  36 ++
> >  .../common/include/arch/ppc_64/rte_memset.h        |  36 ++
> >  .../common/include/arch/tile/rte_memset.h          |  36 ++
> >  .../common/include/arch/x86/rte_memset.h           |  51 +++
> >  .../common/include/arch/x86/rte_memset_64.h        | 378
> > +++++++++++++++++++++
> >  lib/librte_eal/common/include/generic/rte_memset.h |  52 +++
> >  lib/librte_vhost/virtio_net.c                      |  18 +-
> >  11 files changed, 1116 insertions(+), 7 deletions(-)  create mode
> > 100644 app/test/test_memset.c  create mode 100644
> > app/test/test_memset_perf.c create mode 100644
> > lib/librte_eal/common/include/arch/arm/rte_memset.h
> >  create mode 100644
> > lib/librte_eal/common/include/arch/ppc_64/rte_memset.h
> >  create mode 100644
> > lib/librte_eal/common/include/arch/tile/rte_memset.h
> >  create mode 100644
> > lib/librte_eal/common/include/arch/x86/rte_memset.h
> >  create mode 100644
> > lib/librte_eal/common/include/arch/x86/rte_memset_64.h
> >  create mode 100644
> lib/librte_eal/common/include/generic/rte_memset.h
> >
> > --
> > 2.7.4
  
Thomas Monjalon Jan. 17, 2017, 8:14 p.m. UTC | #22
2017-01-17 06:24, Yang, Zhiyong:
> Hi, Thomas:
> 	Does this patchset have chance to be applied for 1702 release? 

It could be part of 17.02 but there are some issues:

The x86 part did not receive any ack from x86 maintainers.

checkpatch reports some warnings, especially about counting elements
of an array. Please use RTE_DIM.

The file in generic/ is for doxygen only.
Please check how it is done for other files.

The description is "Functions for vectorised implementation of memset()."
Does it mean memset from glibc does not use vector instructions?

The functional autotest is not integrated in the basic test suite.

I wish this kind of review would be done by someone else.
As it has not a big performance impact, this series could wait the next release.
By the way, have you tried to work on glibc, as I had suggested?
  
Vincent Jardin Jan. 18, 2017, 12:15 a.m. UTC | #23
Le 17/01/2017 à 21:14, Thomas Monjalon a écrit :
> By the way, have you tried to work on glibc, as I had suggested?

Nothing here:
 
https://sourceware.org/cgi-bin/search.cgi?wm=wrd&form=extended&m=all&s=D&ul=%2Fml%2Flibc-alpha%2F%25&q=memset
  
Yang, Zhiyong Jan. 18, 2017, 2:42 a.m. UTC | #24
hi, Thomas:
	Thanks for your reply.

> -----Original Message-----
> From: Thomas Monjalon [mailto:thomas.monjalon@6wind.com]
> Sent: Wednesday, January 18, 2017 4:14 AM
> To: Yang, Zhiyong <zhiyong.yang@intel.com>
> Cc: Richardson, Bruce <bruce.richardson@intel.com>; Ananyev, Konstantin
> <konstantin.ananyev@intel.com>; yuanhan.liu@linux.intel.com; De Lara
> Guarch, Pablo <pablo.de.lara.guarch@intel.com>; dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH v2 0/4] eal/common: introduce rte_memset
> and related test
> 
> 2017-01-17 06:24, Yang, Zhiyong:
> > Hi, Thomas:
> > 	Does this patchset have chance to be applied for 1702 release?
> 
> It could be part of 17.02 but there are some issues:
> 
> The x86 part did not receive any ack from x86 maintainers.

Ok

> 
> checkpatch reports some warnings, especially about counting elements of an
> array. Please use RTE_DIM.

Ok, I ignore these warning as reference to current release code. More clean code
will been sent in future.

> 
> The file in generic/ is for doxygen only.
> Please check how it is done for other files.

Ok.  I don't know this before. :), thank you.

> 
> The description is "Functions for vectorised implementation of memset()."
> Does it mean memset from glibc does not use vector instructions?
> 

Sorry for causing misleading understanding,
Glibc memset() use vectorization instructions to implement optimization, of course.
I just want to say "the functions for implementing the same functionality
like glibc memset() ".  My bad English expressions.  :)

> The functional autotest is not integrated in the basic test suite.
> 

I can run command line "memset_autotest",  It seems that I leave something out.

> I wish this kind of review would be done by someone else.
> As it has not a big performance impact, this series could wait the next release.

Ok.
Maybe memset() consumes small ratio for current DPDK data path. 

> By the way, have you tried to work on glibc, as I had suggested?

I'm not familiar with glibc regulation, as far as I know, glibc is using X86 asm,
rather than intrinsic.  I will consider your suggestion.
  
Thomas Monjalon Jan. 18, 2017, 7:42 a.m. UTC | #25
2017-01-18 02:42, Yang, Zhiyong:
> From: Thomas Monjalon [mailto:thomas.monjalon@6wind.com]
> > The functional autotest is not integrated in the basic test suite.
> 
> I can run command line "memset_autotest",  It seems that I leave something out.

Please check app/test/autotest_data.py
  
Yang, Zhiyong Jan. 19, 2017, 1:36 a.m. UTC | #26
> -----Original Message-----
> From: Thomas Monjalon [mailto:thomas.monjalon@6wind.com]
> Sent: Wednesday, January 18, 2017 3:43 PM
> To: Yang, Zhiyong <zhiyong.yang@intel.com>
> Cc: Richardson, Bruce <bruce.richardson@intel.com>; Ananyev, Konstantin
> <konstantin.ananyev@intel.com>; yuanhan.liu@linux.intel.com; De Lara
> Guarch, Pablo <pablo.de.lara.guarch@intel.com>; dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH v2 0/4] eal/common: introduce rte_memset
> and related test
> 
> 2017-01-18 02:42, Yang, Zhiyong:
> > From: Thomas Monjalon [mailto:thomas.monjalon@6wind.com]
> > > The functional autotest is not integrated in the basic test suite.
> >
> > I can run command line "memset_autotest",  It seems that I leave
> something out.
> 
> Please check app/test/autotest_data.py

Thanks, Thomas.
  

Patch

diff --git a/lib/librte_eal/common/include/arch/x86/rte_memset.h b/lib/librte_eal/common/include/arch/x86/rte_memset.h
new file mode 100644
index 0000000..3b2d3a3
--- /dev/null
+++ b/lib/librte_eal/common/include/arch/x86/rte_memset.h
@@ -0,0 +1,376 @@ 
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2010-2016 Intel Corporation. All rights reserved.
+ *   All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#ifndef _RTE_MEMSET_X86_64_H_
+#define _RTE_MEMSET_X86_64_H_
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+/**
+ * @file
+ *
+ * Functions for vectorised implementation of memset().
+ */
+
+#include <stdio.h>
+#include <stdint.h>
+#include <string.h>
+#include <rte_vect.h>
+
+static inline void *
+rte_memset(void *dst, int a, size_t n) __attribute__((always_inline));
+
+static inline void
+rte_memset_less16(void *dst, int a, size_t n)
+{
+	uintptr_t dstu = (uintptr_t)dst;
+
+	if (n & 0x01) {
+		*(uint8_t *)dstu = (uint8_t)a;
+		dstu = (uintptr_t)((uint8_t *)dstu + 1);
+	}
+	if (n & 0x02) {
+		*(uint16_t *)dstu = (uint8_t)a | (((uint8_t)a) << 8);
+		dstu = (uintptr_t)((uint16_t *)dstu + 1);
+	}
+	if (n & 0x04) {
+		uint16_t b = ((uint8_t)a | (((uint8_t)a) << 8));
+
+		*(uint32_t *)dstu = (uint32_t)(b | (b << 16));
+		dstu = (uintptr_t)((uint32_t *)dstu + 1);
+	}
+	if (n & 0x08) {
+		uint16_t b = ((uint8_t)a | (((uint8_t)a) << 8));
+		uint32_t c = b | (b << 16);
+
+		*(uint32_t *)dstu = c;
+		*((uint32_t *)dstu + 1) = c;
+		dstu = (uintptr_t)((uint32_t *)dstu + 2);
+	}
+}
+
+static inline void
+rte_memset16(uint8_t *dst, int8_t a)
+{
+	__m128i xmm0;
+
+	xmm0 = _mm_set1_epi8(a);
+	_mm_storeu_si128((__m128i *)dst, xmm0);
+}
+
+static inline void
+rte_memset_17to32(void *dst, int a, size_t n)
+{
+	rte_memset16((uint8_t *)dst, a);
+	rte_memset16((uint8_t *)dst - 16 + n, a);
+}
+
+#ifdef RTE_MACHINE_CPUFLAG_AVX512
+
+/**
+ * AVX512 implementation below
+ */
+
+static inline void
+rte_memset32(uint8_t *dst, int8_t a)
+{
+	__m256i ymm0;
+
+	ymm0 = _mm256_set1_epi8(a);
+	_mm256_storeu_si256((__m256i *)dst, ymm0);
+}
+
+static inline void
+rte_memset64(uint8_t *dst, int8_t a)
+{
+	__m512i zmm0;
+
+	zmm0 = _mm512_set1_epi8(a);
+	_mm512_storeu_si512((void *)dst, zmm0);
+}
+
+static inline void
+rte_memset128blocks(uint8_t *dst, int8_t a, size_t n)
+{
+	__m512i zmm0;
+
+	zmm0 = _mm512_set1_epi8(a);
+	while (n >= 128) {
+		n -= 128;
+		_mm512_store_si512((void *)(dst + 0 * 64), zmm0);
+		_mm512_store_si512((void *)(dst + 1 * 64), zmm0);
+		dst = dst + 128;
+	}
+}
+
+static inline void *
+rte_memset(void *dst, int a, size_t n)
+{
+	void *ret = dst;
+	size_t dstofss;
+	size_t bits;
+
+	if (n < 16) {
+		rte_memset_less16(dst, a, n);
+		return ret;
+	} else if (n == 16) {
+		rte_memset16((uint8_t *)dst, a);
+		return ret;
+	}
+	if (n <= 32) {
+		rte_memset_17to32(dst, a, n);
+		return ret;
+	}
+	if (n <= 64) {
+		rte_memset32((uint8_t *)dst, a);
+		rte_memset32((uint8_t *)dst - 32 + n, a);
+		return ret;
+	}
+	if (n >= 256) {
+		dstofss = ((uintptr_t)dst & 0x3F);
+		if (dstofss > 0) {
+			dstofss = 64 - dstofss;
+			n -= dstofss;
+			rte_memset64((uint8_t *)dst, a);
+			dst = (uint8_t *)dst + dstofss;
+		}
+		rte_memset128blocks((uint8_t *)dst, a, n);
+		bits = n;
+		n = n & 127;
+		bits -= n;
+		dst = (uint8_t *)dst + bits;
+	}
+	if (n > 128) {
+		n -= 128;
+		rte_memset64((uint8_t *)dst, a);
+		rte_memset64((uint8_t *)dst + 64, a);
+		dst = (uint8_t *)dst + 128;
+	}
+	if (n > 64) {
+		rte_memset64((uint8_t *)dst, a);
+		rte_memset64((uint8_t *)dst - 64 + n, a);
+		return ret;
+	}
+	if (n > 0)
+		rte_memset64((uint8_t *)dst - 64 + n, a);
+	return ret;
+}
+
+#elif defined RTE_MACHINE_CPUFLAG_AVX2
+
+/**
+ *  AVX2 implementation below
+ */
+
+static inline void
+rte_memset32(uint8_t *dst, int8_t a)
+{
+	__m256i ymm0;
+
+	ymm0 = _mm256_set1_epi8(a);
+	_mm256_storeu_si256((__m256i *)dst, ymm0);
+}
+
+static inline void
+rte_memset_33to64(void *dst, int a, size_t n)
+{
+	rte_memset32((uint8_t *)dst, a);
+	rte_memset32((uint8_t *)dst - 32 + n, a);
+}
+
+static inline void
+rte_memset64blocks(uint8_t *dst, int8_t a, size_t n)
+{
+	__m256i ymm0;
+
+	ymm0 = _mm256_set1_epi8(a);
+	while (n >= 64) {
+		n -= 64;
+		_mm256_store_si256((__m256i *)((uint8_t *)dst + 0 * 32), ymm0);
+		_mm256_store_si256((__m256i *)((uint8_t *)dst + 1 * 32), ymm0);
+		dst = (uint8_t *)dst + 64;
+
+	}
+}
+
+static inline void *
+rte_memset(void *dst, int a, size_t n)
+{
+	void *ret = dst;
+	size_t dstofss;
+	size_t bits;
+
+	if (n < 16) {
+		rte_memset_less16(dst, a, n);
+		return ret;
+	} else if (n == 16) {
+		rte_memset16((uint8_t *)dst, a);
+		return ret;
+	}
+	if (n <= 32) {
+		rte_memset_17to32(dst, a, n);
+		return ret;
+	}
+	if (n <= 64) {
+		rte_memset_33to64(dst, a, n);
+		return ret;
+	}
+	if (n > 64) {
+		dstofss = (uintptr_t)dst & 0x1F;
+		if (dstofss > 0) {
+			dstofss = 32 - dstofss;
+			n -= dstofss;
+			rte_memset32((uint8_t *)dst, a);
+			dst = (uint8_t *)dst + dstofss;
+		}
+		rte_memset64blocks((uint8_t *)dst, a, n);
+		bits = n;
+		n = n & 63;
+		bits -= n;
+		dst = (uint8_t *)dst + bits;
+	}
+	if (n > 32) {
+		rte_memset_33to64(dst, a, n);
+		return ret;
+	}
+	if (n > 0)
+		rte_memset32((uint8_t *)dst - 32 + n, a);
+	return ret;
+}
+
+#else /* RTE_MACHINE_CPUFLAG */
+
+/**
+ * SSE && AVX implementation below
+ */
+
+static inline void
+rte_memset32(uint8_t *dst, int8_t a)
+{
+	__m128i xmm0 = _mm_set1_epi8(a);
+
+	_mm_storeu_si128((__m128i *)dst, xmm0);
+	_mm_storeu_si128((__m128i *)(dst + 16), xmm0);
+}
+
+static inline void
+rte_memset16blocks(uint8_t *dst, int8_t a, size_t n)
+{
+	__m128i xmm0 = _mm_set1_epi8(a);
+
+	while (n >= 16) {
+		n -= 16;
+		_mm_store_si128((__m128i *)(dst + 0 * 16), xmm0);
+		dst = (uint8_t *)dst + 16;
+	}
+}
+
+static inline void
+rte_memset64blocks(uint8_t *dst, int8_t a, size_t n)
+{
+	__m128i xmm0 = _mm_set1_epi8(a);
+
+	while (n >= 64) {
+		n -= 64;
+		_mm_store_si128((__m128i *)(dst + 0 * 16), xmm0);
+		_mm_store_si128((__m128i *)(dst + 1 * 16), xmm0);
+		_mm_store_si128((__m128i *)(dst + 2 * 16), xmm0);
+		_mm_store_si128((__m128i *)(dst + 3 * 16), xmm0);
+		dst = (uint8_t *)dst + 64;
+	}
+}
+
+static inline void *
+rte_memset(void *dst, int a, size_t n)
+{
+	void *ret = dst;
+	size_t dstofss;
+	size_t bits;
+
+	if (n < 16) {
+		rte_memset_less16(dst, a, n);
+		return ret;
+	} else if (n == 16) {
+		rte_memset16((uint8_t *)dst, a);
+		return ret;
+	}
+	if (n <= 32) {
+		rte_memset_17to32(dst, a, n);
+		return ret;
+	}
+	if (n <= 48) {
+		rte_memset32((uint8_t *)dst, a);
+		rte_memset16((uint8_t *)dst - 16 + n, a);
+		return ret;
+	}
+	if (n <= 64) {
+		rte_memset32((uint8_t *)dst, a);
+		rte_memset16((uint8_t *)dst + 32, a);
+		rte_memset16((uint8_t *)dst - 16 + n, a);
+		return ret;
+	}
+	if (n > 64) {
+		dstofss = (uintptr_t)dst & 0xF;
+		if (dstofss > 0) {
+			dstofss = 16 - dstofss;
+			n -= dstofss;
+			rte_memset16((uint8_t *)dst, a);
+			dst = (uint8_t *)dst + dstofss;
+		}
+		rte_memset64blocks((uint8_t *)dst, a, n);
+		bits = n;
+		n &= 63;
+		bits -= n;
+		dst = (uint8_t *)dst + bits;
+		rte_memset16blocks((uint8_t *)dst, a, n);
+		bits = n;
+		n &= 0xf;
+		bits -= n;
+		dst = (uint8_t *)dst + bits;
+		if (n > 0) {
+			rte_memset16((uint8_t *)dst - 16 + n, a);
+			return ret;
+		}
+	}
+	return ret;
+}
+
+#endif /* RTE_MACHINE_CPUFLAG */
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_MEMSET_H_ */
diff --git a/lib/librte_eal/common/include/generic/rte_memset.h b/lib/librte_eal/common/include/generic/rte_memset.h
new file mode 100644
index 0000000..416a638
--- /dev/null
+++ b/lib/librte_eal/common/include/generic/rte_memset.h
@@ -0,0 +1,51 @@ 
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2010-2016 Intel Corporation. All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#ifndef _RTE_MEMSET_H_
+#define _RTE_MEMSET_H_
+
+/**
+ * @file
+ *
+ * Functions for vectorised implementation of memset().
+ */
+#ifndef _RTE_MEMSET_X86_64_H_
+
+#define rte_memset memset
+
+#else
+
+static void *
+rte_memset(void *dst, int a, size_t n);
+
+#endif
+#endif /* _RTE_MEMSET_H_ */