2005年08月23日

Conference.NET Song Lyrics – King of the Code

Sung to "King of the Road"

作者:Chris Sells

译者:荣耀

 

贱卖也罢 出租也好

除了平台 谁还在乎别的魂归何处

再也不见指针 压栈 出栈

可还未堵住俺的内存

如今用C#只要写十行code

俺比往日更有生产力

俺不是一个卑微的程序员

俺是代码的皇帝

 

俺对Sun的那帮家伙感到抱歉

DOTNET让他们抱头鼠窜

当俺听到Scott McNeally的哀号

俺说 在Sun照耀不到之处守住你的VM吧

VB DOT NET是民心所向

它是俺们看过的最好版本Java

俺不是一个卑微的程序员

俺是代码的皇帝

 

C的伙计们把这些都当成了新玩意

讨喜的运行时掩盖了令人伤感的一切

俺是用VB的 最好防着俺一点

俺从93年就开始编写受控代码

数据库代码突然崩掉 俺想哭都来不及

把这些都修修补补成新的API吧

俺不是一个卑微的程序员

俺是代码的皇帝

 

潮起潮又落 逝者如斯夫

COM曾是情人 但爱已不再

若问COM和DOTNET有何不同

COM是没有香烟的爱

DOTNET将陪伴俺们直到无趣

那时俺们再发明DOT-ORG而将她抛弃

俺不是一个卑微的程序员

俺是代码的皇帝

Conference.NET Song Lyrics – King of the Code

Sung to "King of the Road"

作者:Chris Sells

译者:荣耀

 

贱卖也罢 出租也好

除了平台 谁还在乎别的魂归何处

再也不见指针 压栈 出栈

可还未堵住俺的内存

如今用C#只要写十行code

俺比往日更有生产力

俺不是一个卑微的程序员

俺是代码的皇帝

 

俺对Sun的那帮家伙感到抱歉

DOTNET让他们抱头鼠窜

当俺听到Scott McNeally的哀号

俺说 在Sun照耀不到之处守住你的VM吧

VB DOT NET是民心所向

它是俺们看过的最好版本Java

俺不是一个卑微的程序员

俺是代码的皇帝

 

C的伙计们把这些都当成了新玩意

讨喜的运行时掩盖了令人伤感的一切

俺是用VB的 最好防着俺一点

俺从93年就开始编写受控代码

数据库代码突然崩掉 俺想哭都来不及

把这些都修修补补成新的API吧

俺不是一个卑微的程序员

俺是代码的皇帝

 

潮起潮又落 逝者如斯夫

COM曾是情人 但爱已不再

若问COM和DOTNET有何不同

COM是没有香烟的爱

DOTNET将陪伴俺们直到无趣

那时俺们再发明DOT-ORG而将她抛弃

俺不是一个卑微的程序员

俺是代码的皇帝

Conference.NET Song Lyrics
King of the Code
Sung to "King of the Road"

Platform for sale or rent
who cares where all the others went
no more pointer, pushes, pops or peaks
I ain’t got no memory leaks
I wrote ten lines of C# today
I was more productive than yesterday
I’m a programmer of mean by no means
I’m a king of the code

I feel sorry for those guys at Sun
DOTNET has got them on the run
But when I hear Scott McNeally whine
I say stick your VM where the Sun don’t shine
Most folks agree VB DOT NET
Is the best version of Java that we’ve seen yet
I’m a programmer of mean by no means
I’m a king of the code

You C guys think all this stuff is new
A friendly runtime that hides the goo
I’m a VB guy, better watch out for me
’cause I’ve been writing managed code since 93
Database code’s busted — no time to cry
Patch it all to the newest API
I’m a programmer of mean by no means
I’m a king of the code

New waves they come and go so fast
COM was love, but love don’t last
If COM was love than what’s DOTNET
It’s love without the cigarette
DOTNET will be here till we all get bored
Then we’ll throw it away and invent DOT-ORG
I’m a programmer of mean by no means
I’m a king of the code

For more, visit: http://www.sellsbrothers.com/fun/

2005年08月16日

strcpy 和 memcpy两个函数的实现

觉得很考验编程的基本功。

比如输入参数的错误检测,编程的风格等。
char*strcpy(char*strDest, const char*strSrc)
{
  assert((strDest != NULL) && (strSrc != NULL));
 
 char *address = strDest;
 while ((*strDest++ = *strSrc++) != ”)
  continue;
  
 return address;

 

void *memcpy(void *pvTo, const void *pvFrom, size_t size)
{
 assert((pvTo != NULL) && (pvFrom != NULL)); // 使用断言
 byte *pbTo = (byte *) pvTo; // 防止改变pvTo 的地址
 byte *pbFrom = (byte *) pvFrom; // 防止改变pvFrom 的地址
 
 while(size — > 0 )
  *pbTo ++ = *pbFrom ++ ;
 
 return pvTo;
  
}

应用程序安全性的一大进步:证明 C Runtime 和 Windows API 对安全性的影响

Michael Howard
安全性项目经理
Secure Windows Initiative 小组
Windows XP 小组
Microsoft Corporation

2001 年 4 月

摘要:本文将讨论使用 C 和 C++ 进行函数调用时的常见错误及其安全隐患,并概括某些函数的正确使用方法。作为一种持续不断的努力,在今后的几个月中,我们将继续展开讨论,为更多的 API 提供安全性信息。

简介

在对 C 和 C++ 代码进行代码检查以寻找安全薄弱环节时,我发现了在调用某些函数时的一些常见问题。尽管某种函数调用可能与安全性无关,但如果使用不当,仍会导致不易发觉的安全隐患。

本文将讨论这些错误及其安全隐患,并将概述一些函数的正确使用方法。

对于在 MSDN 和 Platform SDK 中记载的一些 API 的安全性评价,我们已经展开了讨论(这一讨论仍在继续)。在第一轮的讨论中,我们概述了顶级函数调用,这些调用导致了 Microsoft 和非 Microsoft 产品的安全薄弱环节。

我们首先讨论以下函数,这些函数在安全性方面尤为值得注意:

CopyMemory
CreateProcess、CreateProcessAsUser、CreateProcessWithLogonW
SetSecurityDescriptorDacl
模拟函数
memcpy
sprintf、swprintf
strcat、wcscat、_mbscat
strcpy、wcscpy、_mbscpy
strncat、wcsncat、_mbsncat
WinExec

CopyMemory

安全性评价

第一个参数 Destination 必须足以容纳 count 个字节的 Source 组合大小,否则就可能发生缓冲区溢出。这样,当发生违规访问时,应用程序就可能会遭到拒绝服务攻击,或者更坏,可能会使攻击者将可执行代码注入到您的进程中。如果 Destination 是基于堆栈的缓冲区,则尤为如此。要注意的是,最后一个参数 Length 是要复制到 Destination 的字节数,而不是 Destination 的大小。

以下代码示例演示了安全使用 CopyMemory() 的方法:

void test(char *pbData, unsigned int cbData) {
    char buf[BUFFER_SIZE];
    CopyMemory(buf, pbData, min(cbData,BUFFER_SIZE));
}

CreateProcess、CreateProcessAsUser、CreateProcessWithLogonW

安全性评价

第一个参数 lpApplicationName 可以为 NULL。在这种情况下,可执行程序的名称必须是 lpCommandLine 中第一个用空格分隔的字符串。但是,如果可执行程序的名称或路径名中有空格,则存在一定的风险,因为如果空格处理不当,就可能会运行恶意的可执行程序。以下示例是危险的,因为该进程将试图运行“Program.exe”(如果该程序存在),而不是“foo.exe”。

CreateProcess(NULL, "C:\Program Files\foo", ...)

如果恶意用户要在系统中创建名为“Program.exe”的特洛伊程序,那么任何使用“Program Files”目录不正确地调用 CreateProcess 的程序都将启动特洛伊程序,而不是要调用的应用程序。

注意不要为 lpApplicationName 传递 NULL,以避免函数根据其运行时参数来分析并确定可执行文件路径名。否则,如果 lpApplicationName 为 NULL,则用引号将 lpCommandLine 中的可执行路径引起,如下例所示。

CreateProcess(NULL, "\"C:\Program Files\foo.exe\" -L -S", ...)

SetSecurityDescriptorDacl

安全性评价

最好不要创建具有 NULL DACL 的安全描述符(即:pDacl 为 NULL),因为这样的 DACL 无法为对象提供安全性。实际上,攻击者可在对象上设置一个 Everyone (Deny All Access) ACE,从而拒绝每个人(包括管理员)访问该对象。NULL DACL 没有为对象提供任何免受攻击的保护。

模拟函数

安全性评价

如果对模拟函数的调用因任何原因而失败,则不会对客户端进行模拟,客户端请求将在进行调用的进程所在的安全环境中进行。如果进程作为高度特权化的帐户(如 LocalSystem)来运行,或作为管理组的成员来运行,则用户可能可以执行在其他情况下不允许进行的操作。所以,您务必要始终检查调用的返回值,如果该值未报出错误,则不要继续执行客户端请求。以下是一些示例:

RpcImpersonateClient
ImpersonateNamedPipeClient
SetThreatToken
ImpersonateSelf
CoImpersonateClient
ImpersonateDdeClientWindow
ImpersonateSecurityContext
ImpersonateLoggedOnUser

memcpy

安全性评价

第一个参数 dest 必须足以容纳 count 个字节的 src 组合大小,否则就可能发生缓冲区溢出。这样,当发生违规访问时,应用程序就可能会遭到拒绝服务攻击,或者更坏,可能会使攻击者将可执行代码注入到您的进程中。如果 dest 是基于堆栈的缓冲区,则尤为如此。要注意的是,最后一个参数 count 是要复制到 dest 的字节数,而不是 dest 的大小。

以下代码示例演示了安全使用 memcpy() 的方法:

void test(char *pbData, unsigned int cbData) {
    char buf[BUFFER_SIZE];
    memcpy(buf, pbData, min(cbData,BUFFER_SIZE));
}

sprintf、swprintf

安全性评价

第一个参数 buffer 必须足以容纳 format 的格式化版本和末尾的 NULL (‘\0′) 字符,否则就可能发生缓冲区溢出。这样,当发生违规访问时,应用程序就可能会遭到拒绝服务攻击,或者更坏,可能会使攻击者将可执行代码注入到您的进程中。如果 buffer 是基于堆栈的缓冲区,则尤为如此。

另外,应注意用户或应用程序将 format 提供为变量的危险。下例是危险的,因为攻击者可能会将 szTemplate 设置为“%90s%10s”,这样会创建一个 100 字节的字符串:

void test(char *szTemplate,char *szData1, char *szData2) {
    char buf[BUFFER_SIZE];
    sprintf(buf,szTemplate,szData1,szData2);
}

应考虑使用 _snprintf(英文)或 _snwprintf 来代替。

strcat、wcscat、_mbscat

安全性评价

第一个参数 strDestination 必须足以容纳当前的 strDestinationstrSource 组合以及一个末尾 ‘\0′,否则就可能发生缓冲区溢出。这样,当发生违规访问时,应用程序就可能会遭到拒绝服务攻击,或者更坏,可能会使攻击者将可执行代码注入到您的进程中。如果 strDestination 是基于堆栈的缓冲区,则尤为如此。应考虑使用 strncat(英文)、wcsncat_mbsncat

strcpy、wcscpy、_mbscpy

安全性评价

第一个参数 strDestination 必须足以容纳 strSource 和末尾的 ‘\0′,否则就可能发生缓冲区溢出。这样,当发生违规访问时,应用程序就可能会遭到拒绝服务攻击,或者更坏,可能会使攻击者将可执行代码注入到您的进程中。如果 strDestination 是基于堆栈的缓冲区,则尤为如此。应考虑使用 strncpy(英文)、wcsncpy 或 _mbsncpy

strncat、wcsncat、_mbsncat

安全性评价

第一个参数 strDestination 必须足以容纳当前 strDestinationstrSource 组合以及一个末尾 NULL (‘\0′),否则就可能发生缓冲区溢出。这样,当发生违规访问时,应用程序就可能会遭到拒绝服务攻击,或者更坏,可能会使攻击者将可执行代码注入到您的进程中。如果 strDestination 是基于堆栈的缓冲区,则尤为如此。要注意的是,最后一个参数 count 是要复制到 strDestination 的字节数,而不是 strDestination 的大小。

还要注意,如果缓冲区 strDestination 中有剩余的空间,则 strncat 仅添加末尾 NULL。

以下代码示例演示了安全使用 strncat 的方法:

void test(char *szWords1, char *szWords2) {
     char buf[BUFFER_SIZE];
      
     strncpy(buf,szWords1,sizeof buf - 1);
     buf[BUFFER_SIZE - 1] = '\0';               
     unsigned int cRemaining = (sizeof buf - strlen(buf)) - 1;
     strncat(buf,szWords2,cRemaining);
}

WinExec

安全性评价

可执行名称被视为 lpCmdLine 中的第一个用空格分隔的字符串。 但是,如果可执行程序的名称或路径名中有空格,则存在一定的风险,因为如果空格处理不当,就可能会运行恶意的可执行程序。以下示例是危险的,因为该进程将试图运行“Program.exe”(如果该程序存在),而不是“foo.exe”。

WinExec("C:\Program Files\foo", ...) 

如果恶意用户要在系统中创建名为“Program.exe”的特洛伊程序,那么任何使用“Program Files”目录不正确地调用 WinExec 的程序都将启动特洛伊程序,而不是要调用的应用程序。

就安全性而言,我们强烈建议您使用 CreateProcess 而不是 WinExec。但是,如果您由于遗留问题而必须使用 WinExec,则务必要将应用程序名用引号引起来,如下例所示:

WinExec("\"C:\Program Files\foo.exe\" -L -S", ...)

总结

随着我们今后对函数错误使用的进一步了解,我们会向此函数列表中添加更多的内容。如果您对这些函数调用有任何意见,欢迎发送电子邮件与我商榷:mikehow@microsoft.com

 

©2001 Microsoft Corporation 版权所有。保留所有权利。使用规定。  

Optimizing Memcpy improves speed

原文:http://www.embedded.com/shared/printableArticle.jhtml;jsessionid=IYGCWHD2ZB3HGQSNDBCCKH0CJUMEKJVN?articleID=19205567
By Michael Morrow, Courtesy of Embedded Systems Programming
Û 29 2004 (17:00 H)
URL: http://www.embedded.com/showArticle.jhtml?articleID=19205567

Knowing a few details about your system-memory size, cache type, and bus width can pay big dividends in higher performance.

The memcpy() routine in every C library moves blocks of memory of arbitrary size. It’s used quite a bit in some programs and so is a natural target for optimization. Cross-compiler vendors generally include a precompiled set of standard class libraries, including a basic implementation of memcpy(). Unfortunately, since this same code must run on hardware with a variety of processors and memory architectures, it can’t be optimized for any specific architecture. An intimate knowledge of your target hardware and memory-transfer needs can help you write a much more efficient implementation of memcpy().

This article will show you how to find the best algorithm for optimizing the memcpy() library routine on your hardware. I’ll discuss three popular algorithms for moving data within memory and some factors that should help you choose the best algorithm for your needs. Although I used an Intel XScale 80200 processor and evaluation board for this study, the results are general and can be applied to any hardware.

A variety of hardware and software factors might affect your decision about a memcpy() algorithm. These include the speed of your processor, the width of your memory bus, the availability and features of a data cache, and the size and alignment of the memory transfers your application will make. I’ll show you how each of these factors affects the performance of the three algorithms. But let’s first discuss the algorithms themselves.

Three basic memcpy() algorithms
The simplest memory-transfer algorithm just reads one byte at a time and writes that byte before reading the next. We’ll call this algorithm byte-by-byte. Listing 1 shows the C code for this algorithm. As you can see, it has the advantage of implementation simplicity. Byte-by-byte, however, may not offer optimal performance, particularly if your memory bus is wider than 8 bits.

Listing 1: The byte-by-byte algorithm

void * memcpy(void * dst, void const * src, size_t len)
{
    char * pDst = (char *) dst;
    char const * pSrc = (char const *) src;

    while (len–)
    {
        *pDst++ = *pSrc++;
    }

    return (dst);
}

An algorithm that offers better performance on wider memory buses, such as the one on the evaluation board I used, can be found in GNU’s newlib source code. I’ve posted the code here. If the source and destination pointers are both aligned on 4-byte boundaries, my modified-GNU algorithm copies 32 bits at a time rather than 8 bits. Listing 2 shows an implementation of this algorithm.

    while (len–)
    {
        *pDst++ = *pSrc++;
    }

    return (dst);
}

An algorithm that offers better performance on wider memory buses, such as the one on the evaluation board I used, can be found in GNU’s newlib source code. I’ve posted the code here. If the source and destination pointers are both aligned on 4-byte boundaries, my modified-GNU algorithm copies 32 bits at a time rather than 8 bits. Listing 2 shows an implementation of this algorithm.

Listing 2: The modified-GNU algorithm

void * memcpy(void * dst, void const * src, size_t len)
{
    long * plDst = (long *) dst;
    long const * plSrc = (long const *) src;

    if (!(src & 0xFFFFFFFC) && !(dst & 0xFFFFFFFC))
    {
        while (len >= 4)
    {
            *plDst++ = *plSrc++;
            len -= 4;
        }
    }

    char * pcDst = (char *) plDst;
    char const * pcDst = (char const *) plSrc;

    while (len–)
    {
        *pcDst++ = *pcSrc++;
    }

    return (dst);
}

A variation of the modified-GNU algorithm uses computation to adjust for address misalignment. I’ll call this algorithm the optimized algorithm. The optimized algorithm attempts to access memory efficiently, using 4-byte or larger reads-writes. It operates on the data internally to get the right bytes into the appropriate places. Figure 1 shows a typical step in this algorithm: memory is fetched on naturally aligned boundaries from the source of the block, the appropriate bytes are combined, then written out to the destination’s natural alignment.

    if (!(src & 0xFFFFFFFC) && !(dst & 0xFFFFFFFC))
    {
        while (len >= 4)
    {
            *plDst++ = *plSrc++;
            len -= 4;
        }
    }

    char * pcDst = (char *) plDst;
    char const * pcDst = (char const *) plSrc;

    while (len–)
    {
        *pcDst++ = *pcSrc++;
    }

    return (dst);
}

A variation of the modified-GNU algorithm uses computation to adjust for address misalignment. I’ll call this algorithm the optimized algorithm. The optimized algorithm attempts to access memory efficiently, using 4-byte or larger reads-writes. It operates on the data internally to get the right bytes into the appropriate places. Figure 1 shows a typical step in this algorithm: memory is fetched on naturally aligned boundaries from the source of the block, the appropriate bytes are combined, then written out to the destination’s natural alignment.


Figure 1: The optimized algorithm

Note that the optimized algorithm uses some XScale assembly language. You can download this algorithm here. The preload instruction is a hint to the ARM processor that data at a specified address may be needed soon. Processor-specific opcodes like these can help wring every bit of performance out of a critical routine. Knowing your target machine is a virtue when optimizing memcpy().

Having looked at all three of the algorithms in some detail, we can begin to compare their performance under various conditions.

Block size
What effect does data size have on the performance of our algorithms? To keep things simple, let’s assume that there’s no data cache (or that it’s been disabled) and that all of the source and destination addresses are aligned on 4-byte boundaries.


Figure 2: Algorithm comparison for small data blocks (20 bytes)

As you can see in Figure 2, byte-by-byte does the best when the blocks are small, in this case 20 bytes. Byte-by-byte’s main advantage is that it lacks overhead. It just moves bytes, without looking at addresses. The actual copy loop only runs for a small number of iterations (20 in this case), and then the routine is complete.

Note also that the performance of byte-by-byte improves dramatically as the processor clock speed increases. This implies that the routine is CPU-bound. Until we saturate the memory bus with reads and writes, the byte-by-byte algorithm will continue to execute more quickly.

For larger blocks, the situation is different. For example, simply increasing the size of the data blocks to 128 bytes makes byte-by-byte the clear loser at all processor speeds, as shown in Figure 3.


Figure 3: Algorithm comparison for large data blocks (128 bytes)

Here the modified-GNU algorithm has the best performance, as it makes more efficient use of the memory bus. The optimized algorithm has comparable performance, but the effects of its additional computation take their toll in overhead.

Data alignment
What happens if the source and destination are not both aligned on a 4-byte boundary? Figure 4 shows the results on a 128-byte block with unaligned addresses. It’s obvious here that byte-by-byte is not affected by the misalignment of data blocks. It just moves a single byte of memory at a time and doesn’t really care about addresses.


Figure 4: Algorithm comparison for unaligned data blocks (128 bytes)

The modified-GNU algorithm performs worse than byte-by-byte in this situation, largely because it defaults to byte-by-byte (after a fixed overhead).

The big overhead in the GNU algorithm comes from register save-restore in its prologue-epilogue. The algorithm saves off four registers, where the byte-by-byte routine saves none. So, as memory speed decreases in relation to processor speed, the GNU algorithm suffers accordingly. By the way, "optimized" memcpy saves off nine registers, which is part of the reason it becomes less compelling at high core and bus speeds. This overhead matters less when the stack is cached (probably the normal case).

The optimized algorithm handles unaligned addresses the best outperforming byte-by-byte. At slower clock speeds, the overhead of dealing with alignment is amortized by the cost of actually moving the memory four times as quickly. As CPU performance improves (with respect to the memory system), the elegance of an algorithm like optimized becomes less helpful.

Caching
Everything changes if your processor has a data cache. Let’s try the same memcpy tests we’ve already run, but with the data already in cache. Figures 5 and 6 show the results. The memory is no longer the bottleneck, so algorithm efficiency becomes the limiting factor. If your data is likely to reside mainly in the cache, use something other than byte-by-byte.


Figure 5: Data cache effect on memcpy throughput (333MHz)


Figure 6: Data cache effect on memcpy throughput (733MHz)

Note that the bar charts in Figures 5 and 6 look about the same, although the y-axis scales differ. This y-axis difference supports the contention that we’re limited by processing speed, not by memory. Also, the data for unaligned memcpy shows that the GNU memcpy performance degrades to that of byte-by-byte performance when addresses are not aligned. You may see severe degradation in memcpy performance if your data is not always aligned in memory.

Write policy
A write-through cache is one that updates both the cache and the memory behind it whenever the processor writes. This sort of cache tries to satisfy reads without going to memory.

A write-back cache, on the other hand, tries to satisfy both reads and writes without going to memory. Only when the cache needs storage will it evict some of its data to memory; this is called variously a write back, a cast out, or an eviction. Write-back caches tend to use less memory bandwidth than write-through caches.

The processor I used allows the cache to be configured using either policy. What effect does this have on memcpy? It depends. Figure 7 shows what happens when the cache is cold (no data in it). Figure 8 shows what happens if the cache contains only garbage data (data from other addresses).


Figure 7: Cache-policy effect (cold cache, 128 bytes, 333MHz)


Figure 8: Cache-policy effect (garbage cache, 128 bytes, 333MHz)

With a cold cache, optimized memcpy with write-back cache works best because the cache doesn’t have to write to memory and so avoids any delays on the bus.

For a garbage-filled cache, write-through caches work slightly better, because the cache doesn’t need to spend extra cycles evicting irrelevant data to memory. As usual, the more you know about your system"such as the likelihood of having certain data in the cache"the better you can judge the efficacy of one cache policy over another.


Figure 9: Performance of 4KB memcpy

Special situations
If you know all about the data you’re copying as well as the environment in which memcpy runs, you may be able to create a specialized version that runs very fast. Figure 9 shows the performance gain we accrue by writing a memcpy that handles only 4KB-aligned pages when the cache is in write-through mode. This example shows that writing a very specific algorithm may double the speed of a memcpy-rich program. I’ve posted one of these algorithms here.

Optimize away
Some applications spend significant processor time transferring data within memory; by choosing the optimal algorithm, you could improve overall program performance significantly. The moral of the story: know your target hardware and the characteristics of your application. Armed with this knowledge, you can easily find the optimal algorithm.

Mike Morrow is a processor architect in the Intel XScale core group. He’s been developing embedded software/hardware for about 15 years. Mike earned an MS in computer science from the University of Tennessee. He can be reached at michael.morrow@intel.com.

The memcpy() routine in every C library moves blocks of memory of arbitrary size. It’s used quite a bit in some programs and so is a natural target for optimization. Cross-compiler vendors generally include a precompiled set of standard class libraries, including a basic implementation of memcpy(). Unfortunately, since this same code must run on hardware with a variety of processors and memory architectures, it can’t be optimized for any specific architecture. An intimate knowledge of your target hardware and memory-transfer needs can help you write a much more efficient implementation of memcpy().

This article will show you how to find the best algorithm for optimizing the memcpy() library routine on your hardware. I’ll discuss three popular algorithms for moving data within memory and some factors that should help you choose the best algorithm for your needs. Although I used an Intel XScale 80200 processor and evaluation board for this study, the results are general and can be applied to any hardware.

A variety of hardware and software factors might affect your decision about a memcpy() algorithm. These include the speed of your processor, the width of your memory bus, the availability and features of a data cache, and the size and alignment of the memory transfers your application will make. I’ll show you how each of these factors affects the performance of the three algorithms. But let’s first discuss the algorithms themselves.

Three basic memcpy() algorithms
The simplest memory-transfer algorithm just reads one byte at a time and writes that byte before reading the next. We’ll call this algorithm byte-by-byte. Listing 1 shows the C code for this algorithm. As you can see, it has the advantage of implementation simplicity. Byte-by-byte, however, may not offer optimal performance, particularly if your memory bus is wider than 8 bits.

Listing 1: The byte-by-byte algorithm

void * memcpy(void * dst, void const * src, size_t len)
{
    char * pDst = (char *) dst;
    char const * pSrc = (char const *) src;

    while (len–)
    {
        *pDst++ = *pSrc++;
    }

    return (dst);
}

An algorithm that offers better performance on wider memory buses, such as the one on the evaluation board I used, can be found in GNU’s newlib source code. I’ve posted the code here. If the source and destination pointers are both aligned on 4-byte boundaries, my modified-GNU algorithm copies 32 bits at a time rather than 8 bits. Listing 2 shows an implementation of this algorithm.

    while (len–)
    {
        *pDst++ = *pSrc++;
    }

    return (dst);
}

An algorithm that offers better performance on wider memory buses, such as the one on the evaluation board I used, can be found in GNU’s newlib source code. I’ve posted the code here. If the source and destination pointers are both aligned on 4-byte boundaries, my modified-GNU algorithm copies 32 bits at a time rather than 8 bits. Listing 2 shows an implementation of this algorithm.

Listing 2: The modified-GNU algorithm

void * memcpy(void * dst, void const * src, size_t len)
{
    long * plDst = (long *) dst;
    long const * plSrc = (long const *) src;

    if (!(src & 0xFFFFFFFC) && !(dst & 0xFFFFFFFC))
    {
        while (len >= 4)
    {
            *plDst++ = *plSrc++;
            len -= 4;
        }
    }

    char * pcDst = (char *) plDst;
    char const * pcDst = (char const *) plSrc;

    while (len–)
    {
        *pcDst++ = *pcSrc++;
    }

    return (dst);
}

A variation of the modified-GNU algorithm uses computation to adjust for address misalignment. I’ll call this algorithm the optimized algorithm. The optimized algorithm attempts to access memory efficiently, using 4-byte or larger reads-writes. It operates on the data internally to get the right bytes into the appropriate places. Figure 1 shows a typical step in this algorithm: memory is fetched on naturally aligned boundaries from the source of the block, the appropriate bytes are combined, then written out to the destination’s natural alignment.

    if (!(src & 0xFFFFFFFC) && !(dst & 0xFFFFFFFC))
    {
        while (len >= 4)
    {
            *plDst++ = *plSrc++;
            len -= 4;
        }
    }

    char * pcDst = (char *) plDst;
    char const * pcDst = (char const *) plSrc;

    while (len–)
    {
        *pcDst++ = *pcSrc++;
    }

    return (dst);
}

A variation of the modified-GNU algorithm uses computation to adjust for address misalignment. I’ll call this algorithm the optimized algorithm. The optimized algorithm attempts to access memory efficiently, using 4-byte or larger reads-writes. It operates on the data internally to get the right bytes into the appropriate places. Figure 1 shows a typical step in this algorithm: memory is fetched on naturally aligned boundaries from the source of the block, the appropriate bytes are combined, then written out to the destination’s natural alignment.


Figure 1: The optimized algorithm

Note that the optimized algorithm uses some XScale assembly language. You can download this algorithm here. The preload instruction is a hint to the ARM processor that data at a specified address may be needed soon. Processor-specific opcodes like these can help wring every bit of performance out of a critical routine. Knowing your target machine is a virtue when optimizing memcpy().

Having looked at all three of the algorithms in some detail, we can begin to compare their performance under various conditions.

Block size
What effect does data size have on the performance of our algorithms? To keep things simple, let’s assume that there’s no data cache (or that it’s been disabled) and that all of the source and destination addresses are aligned on 4-byte boundaries.


Figure 2: Algorithm comparison for small data blocks (20 bytes)

As you can see in Figure 2, byte-by-byte does the best when the blocks are small, in this case 20 bytes. Byte-by-byte’s main advantage is that it lacks overhead. It just moves bytes, without looking at addresses. The actual copy loop only runs for a small number of iterations (20 in this case), and then the routine is complete.

Note also that the performance of byte-by-byte improves dramatically as the processor clock speed increases. This implies that the routine is CPU-bound. Until we saturate the memory bus with reads and writes, the byte-by-byte algorithm will continue to execute more quickly.

For larger blocks, the situation is different. For example, simply increasing the size of the data blocks to 128 bytes makes byte-by-byte the clear loser at all processor speeds, as shown in Figure 3.


Figure 3: Algorithm comparison for large data blocks (128 bytes)

Here the modified-GNU algorithm has the best performance, as it makes more efficient use of the memory bus. The optimized algorithm has comparable performance, but the effects of its additional computation take their toll in overhead.

Data alignment
What happens if the source and destination are not both aligned on a 4-byte boundary? Figure 4 shows the results on a 128-byte block with unaligned addresses. It’s obvious here that byte-by-byte is not affected by the misalignment of data blocks. It just moves a single byte of memory at a time and doesn’t really care about addresses.


Figure 4: Algorithm comparison for unaligned data blocks (128 bytes)

The modified-GNU algorithm performs worse than byte-by-byte in this situation, largely because it defaults to byte-by-byte (after a fixed overhead).

The big overhead in the GNU algorithm comes from register save-restore in its prologue-epilogue. The algorithm saves off four registers, where the byte-by-byte routine saves none. So, as memory speed decreases in relation to processor speed, the GNU algorithm suffers accordingly. By the way, "optimized" memcpy saves off nine registers, which is part of the reason it becomes less compelling at high core and bus speeds. This overhead matters less when the stack is cached (probably the normal case).

The optimized algorithm handles unaligned addresses the best outperforming byte-by-byte. At slower clock speeds, the overhead of dealing with alignment is amortized by the cost of actually moving the memory four times as quickly. As CPU performance improves (with respect to the memory system), the elegance of an algorithm like optimized becomes less helpful.

Caching
Everything changes if your processor has a data cache. Let’s try the same memcpy tests we’ve already run, but with the data already in cache. Figures 5 and 6 show the results. The memory is no longer the bottleneck, so algorithm efficiency becomes the limiting factor. If your data is likely to reside mainly in the cache, use something other than byte-by-byte.


Figure 5: Data cache effect on memcpy throughput (333MHz)


Figure 6: Data cache effect on memcpy throughput (733MHz)

Note that the bar charts in Figures 5 and 6 look about the same, although the y-axis scales differ. This y-axis difference supports the contention that we’re limited by processing speed, not by memory. Also, the data for unaligned memcpy shows that the GNU memcpy performance degrades to that of byte-by-byte performance when addresses are not aligned. You may see severe degradation in memcpy performance if your data is not always aligned in memory.

Write policy
A write-through cache is one that updates both the cache and the memory behind it whenever the processor writes. This sort of cache tries to satisfy reads without going to memory.

A write-back cache, on the other hand, tries to satisfy both reads and writes without going to memory. Only when the cache needs storage will it evict some of its data to memory; this is called variously a write back, a cast out, or an eviction. Write-back caches tend to use less memory bandwidth than write-through caches.

The processor I used allows the cache to be configured using either policy. What effect does this have on memcpy? It depends. Figure 7 shows what happens when the cache is cold (no data in it). Figure 8 shows what happens if the cache contains only garbage data (data from other addresses).


Figure 7: Cache-policy effect (cold cache, 128 bytes, 333MHz)


Figure 8: Cache-policy effect (garbage cache, 128 bytes, 333MHz)

With a cold cache, optimized memcpy with write-back cache works best because the cache doesn’t have to write to memory and so avoids any delays on the bus.

For a garbage-filled cache, write-through caches work slightly better, because the cache doesn’t need to spend extra cycles evicting irrelevant data to memory. As usual, the more you know about your system"such as the likelihood of having certain data in the cache"the better you can judge the efficacy of one cache policy over another.


Figure 9: Performance of 4KB memcpy

Special situations
If you know all about the data you’re copying as well as the environment in which memcpy runs, you may be able to create a specialized version that runs very fast. Figure 9 shows the performance gain we accrue by writing a memcpy that handles only 4KB-aligned pages when the cache is in write-through mode. This example shows that writing a very specific algorithm may double the speed of a memcpy-rich program. I’ve posted one of these algorithms here.

Optimize away
Some applications spend significant processor time transferring data within memory; by choosing the optimal algorithm, you could improve overall program performance significantly. The moral of the story: know your target hardware and the characteristics of your application. Armed with this knowledge, you can easily find the optimal algorithm.

Mike Morrow is a processor architect in the Intel XScale core group. He’s been developing embedded software/hardware for about 15 years. Mike earned an MS in computer science from the University of Tennessee. He can be reached at michael.morrow@intel.com.

Copyright 2005 © CMP Media LLC

Knowing a few details about your system-memory size, cache type, and bus width can pay big dividends in higher performance.

The memcpy() routine in every C library moves blocks of memory of arbitrary size. It’s used quite a bit in some programs and so is a natural target for optimization. Cross-compiler vendors generally include a precompiled set of standard class libraries, including a basic implementation of memcpy(). Unfortunately, since this same code must run on hardware with a variety of processors and memory architectures, it can’t be optimized for any specific architecture. An intimate knowledge of your target hardware and memory-transfer needs can help you write a much more efficient implementation of memcpy().

This article will show you how to find the best algorithm for optimizing the memcpy() library routine on your hardware. I’ll discuss three popular algorithms for moving data within memory and some factors that should help you choose the best algorithm for your needs. Although I used an Intel XScale 80200 processor and evaluation board for this study, the results are general and can be applied to any hardware.

A variety of hardware and software factors might affect your decision about a memcpy() algorithm. These include the speed of your processor, the width of your memory bus, the availability and features of a data cache, and the size and alignment of the memory transfers your application will make. I’ll show you how each of these factors affects the performance of the three algorithms. But let’s first discuss the algorithms themselves.

Three basic memcpy() algorithms
The simplest memory-transfer algorithm just reads one byte at a time and writes that byte before reading the next. We’ll call this algorithm byte-by-byte. Listing 1 shows the C code for this algorithm. As you can see, it has the advantage of implementation simplicity. Byte-by-byte, however, may not offer optimal performance, particularly if your memory bus is wider than 8 bits.

Listing 1: The byte-by-byte algorithm

void * memcpy(void * dst, void const * src, size_t len)
{
    char * pDst = (char *) dst;
    char const * pSrc = (char const *) src;

    while (len–)
    {
        *pDst++ = *pSrc++;
    }

    return (dst);
}

An algorithm that offers better performance on wider memory buses, such as the one on the evaluation board I used, can be found in GNU’s newlib source code. I’ve posted the code here. If the source and destination pointers are both aligned on 4-byte boundaries, my modified-GNU algorithm copies 32 bits at a time rather than 8 bits. Listing 2 shows an implementation of this algorithm.

    while (len–)
    {
        *pDst++ = *pSrc++;
    }

    return (dst);
}

An algorithm that offers better performance on wider memory buses, such as the one on the evaluation board I used, can be found in GNU’s newlib source code. I’ve posted the code here. If the source and destination pointers are both aligned on 4-byte boundaries, my modified-GNU algorithm copies 32 bits at a time rather than 8 bits. Listing 2 shows an implementation of this algorithm.

Listing 2: The modified-GNU algorithm

void * memcpy(void * dst, void const * src, size_t len)
{
    long * plDst = (long *) dst;
    long const * plSrc = (long const *) src;

    if (!(src & 0xFFFFFFFC) && !(dst & 0xFFFFFFFC))
    {
        while (len >= 4)
    {
            *plDst++ = *plSrc++;
            len -= 4;
        }
    }

    char * pcDst = (char *) plDst;
    char const * pcDst = (char const *) plSrc;

    while (len–)
    {
        *pcDst++ = *pcSrc++;
    }

    return (dst);
}

A variation of the modified-GNU algorithm uses computation to adjust for address misalignment. I’ll call this algorithm the optimized algorithm. The optimized algorithm attempts to access memory efficiently, using 4-byte or larger reads-writes. It operates on the data internally to get the right bytes into the appropriate places. Figure 1 shows a typical step in this algorithm: memory is fetched on naturally aligned boundaries from the source of the block, the appropriate bytes are combined, then written out to the destination’s natural alignment.

    if (!(src & 0xFFFFFFFC) && !(dst & 0xFFFFFFFC))
    {
        while (len >= 4)
    {
            *plDst++ = *plSrc++;
            len -= 4;
        }
    }

    char * pcDst = (char *) plDst;
    char const * pcDst = (char const *) plSrc;

    while (len–)
    {
        *pcDst++ = *pcSrc++;
    }

    return (dst);
}

A variation of the modified-GNU algorithm uses computation to adjust for address misalignment. I’ll call this algorithm the optimized algorithm. The optimized algorithm attempts to access memory efficiently, using 4-byte or larger reads-writes. It operates on the data internally to get the right bytes into the appropriate places. Figure 1 shows a typical step in this algorithm: memory is fetched on naturally aligned boundaries from the source of the block, the appropriate bytes are combined, then written out to the destination’s natural alignment.


Figure 1: The optimized algorithm

Note that the optimized algorithm uses some XScale assembly language. You can download this algorithm here. The preload instruction is a hint to the ARM processor that data at a specified address may be needed soon. Processor-specific opcodes like these can help wring every bit of performance out of a critical routine. Knowing your target machine is a virtue when optimizing memcpy().

Having looked at all three of the algorithms in some detail, we can begin to compare their performance under various conditions.

Block size
What effect does data size have on the performance of our algorithms? To keep things simple, let’s assume that there’s no data cache (or that it’s been disabled) and that all of the source and destination addresses are aligned on 4-byte boundaries.


Figure 2: Algorithm comparison for small data blocks (20 bytes)

As you can see in Figure 2, byte-by-byte does the best when the blocks are small, in this case 20 bytes. Byte-by-byte’s main advantage is that it lacks overhead. It just moves bytes, without looking at addresses. The actual copy loop only runs for a small number of iterations (20 in this case), and then the routine is complete.

Note also that the performance of byte-by-byte improves dramatically as the processor clock speed increases. This implies that the routine is CPU-bound. Until we saturate the memory bus with reads and writes, the byte-by-byte algorithm will continue to execute more quickly.

For larger blocks, the situation is different. For example, simply increasing the size of the data blocks to 128 bytes makes byte-by-byte the clear loser at all processor speeds, as shown in Figure 3.


Figure 3: Algorithm comparison for large data blocks (128 bytes)

Here the modified-GNU algorithm has the best performance, as it makes more efficient use of the memory bus. The optimized algorithm has comparable performance, but the effects of its additional computation take their toll in overhead.

Data alignment
What happens if the source and destination are not both aligned on a 4-byte boundary? Figure 4 shows the results on a 128-byte block with unaligned addresses. It’s obvious here that byte-by-byte is not affected by the misalignment of data blocks. It just moves a single byte of memory at a time and doesn’t really care about addresses.


Figure 4: Algorithm comparison for unaligned data blocks (128 bytes)

The modified-GNU algorithm performs worse than byte-by-byte in this situation, largely because it defaults to byte-by-byte (after a fixed overhead).

The big overhead in the GNU algorithm comes from register save-restore in its prologue-epilogue. The algorithm saves off four registers, where the byte-by-byte routine saves none. So, as memory speed decreases in relation to processor speed, the GNU algorithm suffers accordingly. By the way, "optimized" memcpy saves off nine registers, which is part of the reason it becomes less compelling at high core and bus speeds. This overhead matters less when the stack is cached (probably the normal case).

The optimized algorithm handles unaligned addresses the best outperforming byte-by-byte. At slower clock speeds, the overhead of dealing with alignment is amortized by the cost of actually moving the memory four times as quickly. As CPU performance improves (with respect to the memory system), the elegance of an algorithm like optimized becomes less helpful.

Caching
Everything changes if your processor has a data cache. Let’s try the same memcpy tests we’ve already run, but with the data already in cache. Figures 5 and 6 show the results. The memory is no longer the bottleneck, so algorithm efficiency becomes the limiting factor. If your data is likely to reside mainly in the cache, use something other than byte-by-byte.


Figure 5: Data cache effect on memcpy throughput (333MHz)


Figure 6: Data cache effect on memcpy throughput (733MHz)

Note that the bar charts in Figures 5 and 6 look about the same, although the y-axis scales differ. This y-axis difference supports the contention that we’re limited by processing speed, not by memory. Also, the data for unaligned memcpy shows that the GNU memcpy performance degrades to that of byte-by-byte performance when addresses are not aligned. You may see severe degradation in memcpy performance if your data is not always aligned in memory.

Write policy
A write-through cache is one that updates both the cache and the memory behind it whenever the processor writes. This sort of cache tries to satisfy reads without going to memory.

A write-back cache, on the other hand, tries to satisfy both reads and writes without going to memory. Only when the cache needs storage will it evict some of its data to memory; this is called variously a write back, a cast out, or an eviction. Write-back caches tend to use less memory bandwidth than write-through caches.

The processor I used allows the cache to be configured using either policy. What effect does this have on memcpy? It depends. Figure 7 shows what happens when the cache is cold (no data in it). Figure 8 shows what happens if the cache contains only garbage data (data from other addresses).


Figure 7: Cache-policy effect (cold cache, 128 bytes, 333MHz)


Figure 8: Cache-policy effect (garbage cache, 128 bytes, 333MHz)

With a cold cache, optimized memcpy with write-back cache works best because the cache doesn’t have to write to memory and so avoids any delays on the bus.

For a garbage-filled cache, write-through caches work slightly better, because the cache doesn’t need to spend extra cycles evicting irrelevant data to memory. As usual, the more you know about your system"such as the likelihood of having certain data in the cache"the better you can judge the efficacy of one cache policy over another.


Figure 9: Performance of 4KB memcpy

Special situations
If you know all about the data you’re copying as well as the environment in which memcpy runs, you may be able to create a specialized version that runs very fast. Figure 9 shows the performance gain we accrue by writing a memcpy that handles only 4KB-aligned pages when the cache is in write-through mode. This example shows that writing a very specific algorithm may double the speed of a memcpy-rich program. I’ve posted one of these algorithms here.

Optimize away
Some applications spend significant processor time transferring data within memory; by choosing the optimal algorithm, you could improve overall program performance significantly. The moral of the story: know your target hardware and the characteristics of your application. Armed with this knowledge, you can easily find the optimal algorithm.

Mike Morrow is a processor architect in the Intel XScale core group. He’s been developing embedded software/hardware for about 15 years. Mike earned an MS in computer science from the University of Tennessee. He can be reached at michael.morrow@intel.com.

The memcpy() routine in every C library moves blocks of memory of arbitrary size. It’s used quite a bit in some programs and so is a natural target for optimization. Cross-compiler vendors generally include a precompiled set of standard class libraries, including a basic implementation of memcpy(). Unfortunately, since this same code must run on hardware with a variety of processors and memory architectures, it can’t be optimized for any specific architecture. An intimate knowledge of your target hardware and memory-transfer needs can help you write a much more efficient implementation of memcpy().

This article will show you how to find the best algorithm for optimizing the memcpy() library routine on your hardware. I’ll discuss three popular algorithms for moving data within memory and some factors that should help you choose the best algorithm for your needs. Although I used an Intel XScale 80200 processor and evaluation board for this study, the results are general and can be applied to any hardware.

A variety of hardware and software factors might affect your decision about a memcpy() algorithm. These include the speed of your processor, the width of your memory bus, the availability and features of a data cache, and the size and alignment of the memory transfers your application will make. I’ll show you how each of these factors affects the performance of the three algorithms. But let’s first discuss the algorithms themselves.

Three basic memcpy() algorithms
The simplest memory-transfer algorithm just reads one byte at a time and writes that byte before reading the next. We’ll call this algorithm byte-by-byte. Listing 1 shows the C code for this algorithm. As you can see, it has the advantage of implementation simplicity. Byte-by-byte, however, may not offer optimal performance, particularly if your memory bus is wider than 8 bits.

Listing 1: The byte-by-byte algorithm

void * memcpy(void * dst, void const * src, size_t len)
{
    char * pDst = (char *) dst;
    char const * pSrc = (char const *) src;

    while (len–)
    {
        *pDst++ = *pSrc++;
    }

    return (dst);
}

An algorithm that offers better performance on wider memory buses, such as the one on the evaluation board I used, can be found in GNU’s newlib source code. I’ve posted the code here. If the source and destination pointers are both aligned on 4-byte boundaries, my modified-GNU algorithm copies 32 bits at a time rather than 8 bits. Listing 2 shows an implementation of this algorithm.

    while (len–)
    {
        *pDst++ = *pSrc++;
    }

    return (dst);
}

An algorithm that offers better performance on wider memory buses, such as the one on the evaluation board I used, can be found in GNU’s newlib source code. I’ve posted the code here. If the source and destination pointers are both aligned on 4-byte boundaries, my modified-GNU algorithm copies 32 bits at a time rather than 8 bits. Listing 2 shows an implementation of this algorithm.

Listing 2: The modified-GNU algorithm

void * memcpy(void * dst, void const * src, size_t len)
{
    long * plDst = (long *) dst;
    long const * plSrc = (long const *) src;

    if (!(src & 0xFFFFFFFC) && !(dst & 0xFFFFFFFC))
    {
        while (len >= 4)
    {
            *plDst++ = *plSrc++;
            len -= 4;
        }
    }

    char * pcDst = (char *) plDst;
    char const * pcDst = (char const *) plSrc;

    while (len–)
    {
        *pcDst++ = *pcSrc++;
    }

    return (dst);
}

A variation of the modified-GNU algorithm uses computation to adjust for address misalignment. I’ll call this algorithm the optimized algorithm. The optimized algorithm attempts to access memory efficiently, using 4-byte or larger reads-writes. It operates on the data internally to get the right bytes into the appropriate places. Figure 1 shows a typical step in this algorithm: memory is fetched on naturally aligned boundaries from the source of the block, the appropriate bytes are combined, then written out to the destination’s natural alignment.

    if (!(src & 0xFFFFFFFC) && !(dst & 0xFFFFFFFC))
    {
        while (len >= 4)
    {
            *plDst++ = *plSrc++;
            len -= 4;
        }
    }

    char * pcDst = (char *) plDst;
    char const * pcDst = (char const *) plSrc;

    while (len–)
    {
        *pcDst++ = *pcSrc++;
    }

    return (dst);
}

A variation of the modified-GNU algorithm uses computation to adjust for address misalignment. I’ll call this algorithm the optimized algorithm. The optimized algorithm attempts to access memory efficiently, using 4-byte or larger reads-writes. It operates on the data internally to get the right bytes into the appropriate places. Figure 1 shows a typical step in this algorithm: memory is fetched on naturally aligned boundaries from the source of the block, the appropriate bytes are combined, then written out to the destination’s natural alignment.


Figure 1: The optimized algorithm

Note that the optimized algorithm uses some XScale assembly language. You can download this algorithm here. The preload instruction is a hint to the ARM processor that data at a specified address may be needed soon. Processor-specific opcodes like these can help wring every bit of performance out of a critical routine. Knowing your target machine is a virtue when optimizing memcpy().

Having looked at all three of the algorithms in some detail, we can begin to compare their performance under various conditions.

Block size
What effect does data size have on the performance of our algorithms? To keep things simple, let’s assume that there’s no data cache (or that it’s been disabled) and that all of the source and destination addresses are aligned on 4-byte boundaries.


Figure 2: Algorithm comparison for small data blocks (20 bytes)

As you can see in Figure 2, byte-by-byte does the best when the blocks are small, in this case 20 bytes. Byte-by-byte’s main advantage is that it lacks overhead. It just moves bytes, without looking at addresses. The actual copy loop only runs for a small number of iterations (20 in this case), and then the routine is complete.

Note also that the performance of byte-by-byte improves dramatically as the processor clock speed increases. This implies that the routine is CPU-bound. Until we saturate the memory bus with reads and writes, the byte-by-byte algorithm will continue to execute more quickly.

For larger blocks, the situation is different. For example, simply increasing the size of the data blocks to 128 bytes makes byte-by-byte the clear loser at all processor speeds, as shown in Figure 3.


Figure 3: Algorithm comparison for large data blocks (128 bytes)

Here the modified-GNU algorithm has the best performance, as it makes more efficient use of the memory bus. The optimized algorithm has comparable performance, but the effects of its additional computation take their toll in overhead.

Data alignment
What happens if the source and destination are not both aligned on a 4-byte boundary? Figure 4 shows the results on a 128-byte block with unaligned addresses. It’s obvious here that byte-by-byte is not affected by the misalignment of data blocks. It just moves a single byte of memory at a time and doesn’t really care about addresses.


Figure 4: Algorithm comparison for unaligned data blocks (128 bytes)

The modified-GNU algorithm performs worse than byte-by-byte in this situation, largely because it defaults to byte-by-byte (after a fixed overhead).

The big overhead in the GNU algorithm comes from register save-restore in its prologue-epilogue. The algorithm saves off four registers, where the byte-by-byte routine saves none. So, as memory speed decreases in relation to processor speed, the GNU algorithm suffers accordingly. By the way, "optimized" memcpy saves off nine registers, which is part of the reason it becomes less compelling at high core and bus speeds. This overhead matters less when the stack is cached (probably the normal case).

The optimized algorithm handles unaligned addresses the best outperforming byte-by-byte. At slower clock speeds, the overhead of dealing with alignment is amortized by the cost of actually moving the memory four times as quickly. As CPU performance improves (with respect to the memory system), the elegance of an algorithm like optimized becomes less helpful.

Caching
Everything changes if your processor has a data cache. Let’s try the same memcpy tests we’ve already run, but with the data already in cache. Figures 5 and 6 show the results. The memory is no longer the bottleneck, so algorithm efficiency becomes the limiting factor. If your data is likely to reside mainly in the cache, use something other than byte-by-byte.


Figure 5: Data cache effect on memcpy throughput (333MHz)


Figure 6: Data cache effect on memcpy throughput (733MHz)

Note that the bar charts in Figures 5 and 6 look about the same, although the y-axis scales differ. This y-axis difference supports the contention that we’re limited by processing speed, not by memory. Also, the data for unaligned memcpy shows that the GNU memcpy performance degrades to that of byte-by-byte performance when addresses are not aligned. You may see severe degradation in memcpy performance if your data is not always aligned in memory.

Write policy
A write-through cache is one that updates both the cache and the memory behind it whenever the processor writes. This sort of cache tries to satisfy reads without going to memory.

A write-back cache, on the other hand, tries to satisfy both reads and writes without going to memory. Only when the cache needs storage will it evict some of its data to memory; this is called variously a write back, a cast out, or an eviction. Write-back caches tend to use less memory bandwidth than write-through caches.

The processor I used allows the cache to be configured using either policy. What effect does this have on memcpy? It depends. Figure 7 shows what happens when the cache is cold (no data in it). Figure 8 shows what happens if the cache contains only garbage data (data from other addresses).


Figure 7: Cache-policy effect (cold cache, 128 bytes, 333MHz)


Figure 8: Cache-policy effect (garbage cache, 128 bytes, 333MHz)

With a cold cache, optimized memcpy with write-back cache works best because the cache doesn’t have to write to memory and so avoids any delays on the bus.

For a garbage-filled cache, write-through caches work slightly better, because the cache doesn’t need to spend extra cycles evicting irrelevant data to memory. As usual, the more you know about your system"such as the likelihood of having certain data in the cache"the better you can judge the efficacy of one cache policy over another.


Figure 9: Performance of 4KB memcpy

Special situations
If you know all about the data you’re copying as well as the environment in which memcpy runs, you may be able to create a specialized version that runs very fast. Figure 9 shows the performance gain we accrue by writing a memcpy that handles only 4KB-aligned pages when the cache is in write-through mode. This example shows that writing a very specific algorithm may double the speed of a memcpy-rich program. I’ve posted one of these algorithms here.

Optimize away
Some applications spend significant processor time transferring data within memory; by choosing the optimal algorithm, you could improve overall program performance significantly. The moral of the story: know your target hardware and the characteristics of your application. Armed with this knowledge, you can easily find the optimal algorithm.

Mike Morrow is a processor architect in the Intel XScale core group. He’s been developing embedded software/hardware for about 15 years. Mike earned an MS in computer science from the University of Tennessee. He can be reached at michael.morrow@intel.com.

Copyright 2005 © CMP Media LLC

Jserv’s blog

http://blog.linux.org.tw/~jserv/

Indri is a new search engine from the Lemur project; a cooperative effort between the University of Massachusetts and Carnegie Mellon University to build language modeling information retrieval tools.

Effective

  • Best-in-class ad hoc retrieval performance

Flexible

  • Supports popular structured query operators from INQUERY
  • Open source, with a flexible BSD-inspired license
  • Parses PDF, HTML, XML, and TREC documents
  • Word and PowerPoint parsing (Windows only)

Usable

  • Supports UTF-8 encoded text
  • Includes both command line tools and a Java user interface
  • API can be used from Java, PHP, or C++
  • Works on Windows, Linux, Solaris and Mac OS X

Powerful

  • Can be used on a cluster of machines for faster indexing and retrieval
  • Field retrieval
  • Passage retrieval
  • Scales to terabyte-sized collections

2005年07月24日

 

How to Write a "Roses Are Red" Poem

by Bruce Lansky

Some roses are red, but some are gold, peach, or white. Just as roses can be different colors, this Valentine’s Day folk rhyme can be written in different ways.

Most people are familiar with this poem, and it’ll be easy and fun for you to change it around and create surprising, delightful, and funny variations.

First of all, in case you’ve forgotten, here’s the original:

Roses Are Red

Roses are red.
Violets are blue.
Sugar is sweet.
And so are you.

-Anonymous

Before I show you how easy and fun it is to create variations on this pattern, here’s one I think you’ll enjoy:

Roses are Blue

Roses are blue.
Violets are red.
If you agree,
You’ve got rocks in your head.

-by a student in Denver whose name I’ve forgotten

I hope you’re thinking, "That looks like fun. How did that student come up with such a cute rhyme?" Here’s the method:

Write the names of colors that contain only one syllable. Under the names, write words that rhyme with the colors. Eventually, your paper should look something like this:

white blue red pink green black
sight shoe bed stink mean back
fright too head think seen sack
tight two dead drink bean tack

kite clue said fink
fight you fed few

OK, now you’re ready to get into a fun, creative session. Start off with a "fill in the blank" poetry completion exercise. Pick one of the colors on your list, say "pink." Recite the following stanza, then fill in the blank at the end.

Violets are blue.
Roses are pink.
Put on your shoes,
your feet really ____.

-Bruce Lansky

If you pick "black," say:

Roses are red.
Asphalt is black.
If you’re feeling hungry,
I’ll give you a _____.

If you pick "blue," say:

Roses are red.
Violets are blue.
Please flush the toilet
after you’re _______.

-Bruce Lansky

Violets are blue.
Roses are pink.
Put on your shoes,
your feet really ____.

-Bruce Lansky

If you pick "black," say:

Roses are red.
Asphalt is black.
If you’re feeling hungry,
I’ll give you a _____.

If you pick "blue," say:

Roses are red.
Violets are blue.
Please flush the toilet
after you’re _______.

-Bruce Lansky

Once you’ve caught on to the "fill in the blank" idea, move on to a more difficult creative exercise. See if you can complete an entire poem. In other words, after picking a color, create the last two lines on your own. Arrange the first two lines according to your color choice as suggested above.

This is a challenging exercise that you’ll love to work on at home or on the bus. Who knows what you’ll come up with? I’ll bet some of your poems will be fun to read and worth sharing with your classmates and teachers–possibly in the form of illustrated posters that can be used to decorate your classrooms or hallways on Valentine’s Day.

Speaking of which, since I usually have the sniffles during the month of February, here’s my Valentines Day poem to you:

Kiss Me Not
 
Roses are red.
Violets are blue.
Please don’t kiss me,
‘cuz I have the flu.

© 2000 Bruce Lansky, reprinted from If Pigs Could Fly…and Other Deep Thoughts published by Meadowbrook Press

 

-Bruce Lansky

Kiss Me Not
 
Roses are red.
Violets are blue.
Please don’t kiss me,
‘cuz I have the flu.

© 2000 Bruce Lansky, reprinted from If Pigs Could Fly…and Other Deep Thoughts published by Meadowbrook Press

 

-Bruce Lansky

2005年07月19日

1。

我是第一次从公道兄那里听到境域易位(哲学上叫paradigm shift) 这个新名词.但从概念上来讲我觉得paradigm不玄,它是比逻辑还要基本的(fundamental)概念,所有的逻辑思维都是基于paradigm之上,它相当于数学的公理系统,paradigm shift就是数学上"从一个公理系统转换到另一个公理系统". 公道兄所比喻的“孙猴跳不出如来的佛掌”形象地阐明了人在认识世界时所遇到的一个困惑:人无法认识自我,以及在认识外部世界时,人无法把握由人的主观因素所造成的偏差. 这些困惑就体现在各种学科的悖论里,我把它们列出来以娱华岳网友, 幼教: 我儿子坐在澡盆里试图把他自己连人带盆一起端起来 商业: 无法证明其自己公平性的公平秤 宗教: 上帝创造万物,上帝是否是万物之一? 若上帝是万物之一,那上帝如何创造他自己? 若上帝不是万物之一,是谁创造了上帝?又是谁创造了创造上帝的大上帝?是谁创造了大大上帝? …大大大大大大上帝?… 武侠: 独孤求败的风清扬 数学: 哥德尔的不完备性定理 物理: 量子力学的测不准定律 软件: 计算机程序里的递归 哲学: 罗素的理发师悖论 . . . (我得下班了,就此截住)


2。

你知道什么是Paradigm shift吗?

Paradigm shift 一般被译为范式转变。

范式(Paradigm)这个词最现的意思是一个例子或模式(An example or pattern)。
它源于希腊语的(paradeigma)。60年代以后,范式这个词常被用于科学或学术里面,指的
是一套理论框架。

而Paradigm shift(范式转变)最早见于美国哲学家托马斯·孔恩(Thomas Kuhn)的书
:《科学革命的结构》(The Structure of Scientific Revolutions)。在这本
书中,孔恩提出:“科学不是事实、理论和方法的简单堆砌,科学的发展也不是知识的简单积
累,而是通过范式的不断转换所进行的不断革命的进程。”也就是说科学发展的方式不是一种
认识由少到多的逐渐发展的方式,而是一个因为旧有的模式遇到巨大的困难而被新的模式/解
释所推翻的过程。例如,哥白尼的日心说推翻托勒密的地心说,爱因斯坦的相对论推翻牛顿的
经典物理学。

所以,范式转变(paradigm shift)指的是当现有的范式里面出现反常或不一致时,我们不
能解决出现的问题,因此对现实的观点就要改变,同时也要改变我们感知、思考、和评价世界
的方法,这种改变就叫做范式转变(paradigm shift)。这就要求我们采用新的假设和预期
,从而改变我们的理论、传统、规则和实践标准。我们就必须创立新的范式来解决旧的范式所
不能解决的问题。

自从这个词出现以来,就成为了知识分子的最爱,长期被作为一个时髦名词使用。于是某些自
诩高深的学者就出现了。这是我不喜欢这些所谓的学者的地方,他们时常把某些简单的概念故
意搞得玄乎其玄。利用一些普通人不懂的词汇来壮自家的声势。写一些词语晦涩而实际上空洞
无物的文章。这里我不是说paradigm shift这个词的好坏,而是想说,如果你只是想告诉人
家要改变思维模式的话,没有必要诘屈敖牙的说,你要范式转变。

我们是陷入词语迷宫的一代。在这个问题上我有失语症。我时常想起物理学家费曼写的本书中
有这么一个笑话,讲他有一次参加一个讲座,他写道:“……会中有个社会学家写了一篇我们
都要读的论文……于是我停下来,仔细的读那句话。‘社会区域的个体分子常常透过形象化的,
符号化的渠道获得信息’。我反复地读,把它翻译出来。你可晓得它是什么意思?‘大家都阅
读’!”(理查德·费曼(R.Feynman)《别闹了,费曼先生》(p367))。

PS.

可怜我们程序员也未能免俗:

“目前 IT 界有兩股熱潮,一是企業間的資料互通及流程自動化,一是軟體轉型走 Web
service 路線;前者已如火如荼地在進行,後者則暗潮洶湧,蓄勢待發。 而在背後直
接促成這兩項 paradigm shift(典範轉移)的基礎科技,正是 XML。”
      ——勞虎

「C++ 是個難學易用的語言」,這句話相信很多人心有戚戚。C++ 的學習難度,一在於
語言本身太多的「幕」,一在於 "paradigm shift" (思考模式的移轉)。
      –候捷《C++ 的沉迷與愛戀》

2005年07月07日



西门鸡翅的格局,是和别处不同的:都是当街一排小店面,门外一排塑料桌椅, 可以随时招待客人。PKU的学生,傍午傍晚吃完饭,每每花三元钱,买一串鸡翅,——这是几个月前的事,现在每串还是卖三块,——在外头坐着,热热的吃了休息;倘肯多花几块, 便可以买几个板筋,或者肉串,一起慢慢吃了,如果出到二十块,那就能买一个大菜了, 但这些顾客,多是光光,大抵没有这样阔绰。只有带着mm的,才踱进店面里头的雅座里, 要酒要菜,慢慢地坐喝。

  我从十六岁起,便在路口的贵州酒店里当伙计,经理说,样子太傻,怕侍候不了带m m的主顾, 就在外面做点事罢。外面的光光主顾,虽然容易说话,但唠唠叨叨缠夹不清的也很不少。 他们往往要亲自数数串的数目,看看少了几串没有,吃完后再数数竹签,然后放心:在这严重 兼督下,偷吃也很为难。所以过了几天,经理又说我干不了这事。幸亏荐头的情面大,下岗不得,便改为专收拾桌面的一种无聊职务了。

  我从此便整天的站在店外,专管我的职务。虽然没有什么失职,但总觉得有些单调,有些 无聊。经理是一副凶脸孔,主顾也没有好声气,教人活泼不得;只有孔乙己到店,还老是带着ppmm,才可以养养眼,笑几声,所以至今还记得。

  孔乙己是在外头吃串而带着mm的唯一的人。他身材很高大;青白脸色,额头间时常夹些伤痕; 几根较长而又不粗的胡子。穿的虽然是Adidas,可是又脏又破,似乎几个月没有换,也没有洗。 他对人说话,总是满口"re","sp",教人半懂不懂的。因为他也姓孔,别人便从中学语文课本上 鲁迅的<<孔乙己>>这半懂不懂的课文里,替他取下一个绰号,叫作孔乙己。孔乙己一到店, 所有吃串的人便都看着他笑,有的叫道,“孔乙己,你脸上又添上新伤疤了!”他不回答, 对服务员mm说,“六个鸡翅,要一瓶鲜橙多。”便排出五十大元。他们又故意的高声嚷道, “你一定又在泡人家mm了!”孔乙己睁大眼睛说,“你怎么这样凭空污人清白……” "什么清白?我前天亲眼见你bg一个mm被她gg看见,被人用板砖砸。”孔乙己便涨红了脸, 额上的青筋条条绽出,争辩道,“bg不能算泡mm……bg!……BBS上的事,能算泡mm么? ” 接连便是难懂的话,什么“君子好色而不淫”,什么“sigh” 之类,引得众人都哄笑起来: 店内外充满了快活的空气。

  听人家背地里谈论,孔乙己原来也有过mm,但终于没有谈成,又不爱打扮;于是愈过愈衰, 弄到将要成光光了。幸而来自CS系,便替mm修修电脑,偶尔也能俘获mm的芳心。可惜他 又有一样坏脾气,便是迷上了BBS。修好电脑不到几分钟,便抢mm的机器灌水。如是几次, 叫他修电脑的mm也没有了。孔乙己没有法,便免不了偶然找个ppmm bg。但他在他们系里, 品行却比别人都好,就是从不奸商;虽然间或没有现货,暂时记在笔记本上,但不出一星期, 定然帮人买到,无论是CPU还是网线。

  孔乙己和那个mm吃过一串鸡翅,涨红的脸色渐渐复了原,旁人便又问道, “孔乙己, 你当真在BBS上很有名么?” 孔乙己看着问他的人,显出不屑置辩的神气。他们便接着说道,“你怎的连半个mm也泡不到呢?” 孔乙己立刻显出颓唐不安模样,脸上笼上了一层灰色,嘴里说些话;这回可是全是" sigh","555"之类, 一些不懂了。在这时候,众人也都哄笑起来:店内外充满了快活的空气。

  在这些时候,我可以附和着笑,经理是决不责备的。而且经理见了孔乙己,也每每这样问他, 引人发笑。孔乙己自己知道不能和他们谈天,便只好向孩子说话。有一回对我说道,“你上过BBS么?” 我略略点一点头。他说,“上过BBS,……我便考你一考。telnet上BBS,怎样上的?”我想, 这么猥琐的人,也配考我么?便回过脸去,不再理会。孔乙己等了许久,很恳切的说道, “不知道 罢?… 我教给你,记着!这些方法应该记着。将来做斑竹的时候,管理版面要用。”我暗想我和 斑竹的等级还很远呢,而且我上bbs只是为了看一些新闻;又好笑,又不耐烦,懒懒的答他道, "谁要你教,不是在"运行"里面输入telnet么?”孔乙己显出极高兴的样子,将两个指头的长指甲 敲着桌子,点头说,“对呀对呀!……还有三种其它软件,你知道么?”我愈不耐烦了,努着嘴 走远。孔乙己刚拿过竹签,想在桌上写字,见我毫不热心,便又叹一口气,显出极惋惜的样子。

  有几回,路过的新生mm听得笑声,也赶热闹,围住了孔乙己。他便给她们分一塌糊涂上的马甲, 一人一个。mm们记住了密码,仍然不散,眼睛都望着他。孔乙己着了慌,赶紧把记马甲的电话簿放 回兜里,坐下下去说道,“不多了,我已经不多了。”直起身又掏出电话簿,自己摇头说,“不多 不多!多乎哉?不多也。马甲多了要被站务咔嚓的.”于是这一群mm都在笑声里走散了。

  孔乙己是这样的使人快活,可是没有他,别人也便这么过。

  有一天,大约是中秋前的两三天,掌柜正在慢慢的结账,移开计算器,忽然说,“孔乙己长久 没有来了.他还欠19块钱呢!”我才也觉得他的确长久没有来了。一个吃串的人说道,“ 他怎么会来? ……他又失恋了。”掌柜说,“哦!”“他总喜欢做梦。这一回,是自己发昏,竟然想去追外语学院 的院花了。外语学院的mm,看得上CS的么?”“后来怎么样?”“怎么样?先是被拒,后来又被拒, 被拒了十几次.”“后来呢?”“后来他说他失恋了。”“失恋了怎样呢?”“怎样?… …谁晓得? 许是追清华CS的mm去了吧。”掌柜也不再问,仍然慢慢的算他的账。

  中秋之后,秋风是一天凉比一天,看看将近初冬;店外的桌椅都撤了;我整天的靠着暖气,也须 穿上毛衣了。一天的下半天,没有一个顾客,我正合了眼坐着。忽然间听得一个声音,“ 来一串鸡 翅。”这声音虽然极低,却很耳熟。我站起来四处一望,那孔乙己便在门口一个角落坐着。他脸上 黑而且瘦,已经不成样子;穿一件破外套,这回不是Adidas了;见了我,又说道, “来一串鸡翅。”

  经理也走了过去去,一面说,“孔乙己么?你还欠十九块钱呢!”孔乙己很颓唐的仰面答道,“这……下回 还清罢。这一回是现钱,鸡翅要热。”掌柜仍然同平常一样,笑着对他说,“孔乙己,你又去泡mm了!” 但他这回却不十分分辩,单说了一句“不要取笑!”“取笑?要是不泡mm,怎么会失恋? ”孔乙己低声说 道,“甩了,我甩了她的……”他的眼色,很像恳求经理,不要再提。此时已经聚集了几个人,便和经理都笑了。

  我拿过鸡翅,端出去,放在桌上。他从破衣袋里摸出一堆零钱,放在我手里. 不一会,他吃完鸡翅,便又在旁人的说笑声中,慢慢走去了。

  自此以后,又长久没有看见孔乙己。到了过年,掌柜翻着帐簿说,“孔乙己还欠十九块钱呢!”

  到第二年的端午,又说“孔乙己还欠十九块钱呢!”到中秋可是没有说,再到过年也没有看见他。   

  我到现在终于没有见——大约孔乙己的确有了mm了。

2005年06月28日

从前有一天,人的情感和特质聚会。


当无聊打第三次哈欠的时候,疯狂说:「我们来玩捉迷藏吧?!」

兴趣颇有兴味地扬起眉毛;而好奇则忍不住的问道:「捉迷藏?!那是什麽??」

「那是个游戏,」疯狂解释,「我闭起眼睛从一数到一百万,这段时间您们要找地方躲起来;

当我数完以後,第一个被我抓到的人要代替我的位置继续这个游戏。」

热情赞成地在愉悦身边跳舞;快乐因为说服了疑惑和从来没对什麽产生兴趣的漠不关心而一直跳跃。

但不是所有人都想叁加这个游戏。

真实不想玩,因为他觉得「为什麽我要躲起来?」;

优越感认为这是一个愚蠢的游戏(事实上,是因为这游戏不是它想出来的主意);

而懦弱则选择了不要冒险。

「一,二,三」……疯狂数着。

懒惰是第一个躲起来的,正如同平常一样,它不想离开第一块石头那麽远。

信念跑到天上;而忌妒躲到了靠着自己力量赢得最高树冠的胜利的背後;

大方找不到一个可以躲藏的地方,因为它找到的每个地方都比较适合它的朋友:

清如明镜的湖给美丽;树洞是害羞最佳的藏处;而自由应该跟着蝴蝶飞翔。

所以最後大方他选择站在太阳的光线底下。

相反的,自私从一开始就找到了一个只适合自己的地方。又通风,又舒服。

谎言躲到海洋的底部(骗人的,其实它躲在彩虹的背後);

热情和欲望跑到了火山的中间。

遗忘呢??我也忘了,反正不重要。

当疯狂数到九十九万九千九百九十九的时候,爱情还是没找到躲的地方,因为它找到的所有地点都已经有人了。

不过它注意到一棵玫瑰树,因此它温柔地躲到花丛里面。

「一百万!」疯狂数完了开始找人。

第一个找到的是躲在距离石头三步处的懒惰。

接着它听到信念在天上跟上帝争吵的声音,而且还感受到热情和欲望在火山里的脉动。

一个不小心,又发现了忌妒,当然,还有它身前的胜利。

疯狂根本不需要去找自私,因为自私躲在虎头蜂的巢里。

走了很久,疯狂也渴了。

没想到,在湖里找到美丽和坐在湖畔不知道该躲在哪个角落的疑惑。

接着它陆续找到了所有人。

潜能在草地上;苦恼在一个黑暗的洞里;

谎言在海洋里面(还是骗人的,它躲在彩虹後面);

遗忘根本忘记自己在玩捉迷藏,最後,只剩下爱情了。

疯狂在每棵树的後面,星球上的每个缝隙,所有的山上都找不到爱情。

就在它正要放弃的那一刹那,发现了玫瑰树和玫瑰花。

它抓住了树枝,开始晃动这棵树,结果听到一声痛彻心扉的惨叫。

原来玫瑰的刺伤了爱情的眼睛。

疯狂哭着,哀求着,请求爱情的原谅,并承诺当爱情一辈子的导盲犬。

自从它们在地球上第一次玩过捉迷藏之後,爱情就盲目了。 而疯狂总是伴随着它。