🎉 Celebrating 25 Years of GameDev.net! 🎉

Not many can claim 25 years on the Internet! Join us in celebrating this milestone. Learn more about our history, and thank you for being a part of our community!

Cache write-back with memset?

Started by
6 comments, last by frob 4 years, 6 months ago

I noticed that doing memset one after another takes different CPU time.

Where first memset of a mllion ints take about a second. But the next memset of the same memory takes 60 ms.

I believe this is due writeback to the cache.

The first memset brings the memory to cache, and the second memset writes into the cache instead of directly to RAM.

However, can we benefit from writeback even if we memset a memory larger than the cache size?

It seems like the answer is yes. But how is that implemented?

Advertisement

There is no easy answer.

Yes, the speed of writing to memory is variable.

Some memory is kept on the cache for the individual processor. Sometimes it must be synchronized with other processors. When it is large enough it gets evicted from the cache.

Alignment matters and it depends on the chip, often with 4 or 8 or 16 bit boundaries; if you're within a single word or dword or qword often writing to the first byte or second byte (such as registers al or ah) have exactly the same cost as writing two bytes (ax) or four bytes (eax) or eight bytes (rax), but if you start writing in the upper ranges of the boundary there is a significant cost as the other bytes must be manipulated.

There are natural boundaries at 1K, 4K, 16K, 32K, 256K, and many more. Not only can these trigger operations between levels of cache for L1, L2, L3, and main memory, they can also trigger virtual memory shuffles and potentially even swapping to disk.

Fortunately for you, memset is already heavily optimized. Don't call it if you don't need to, but if you need to call it, it's a compiler intrinsic (meaning the compiler takes special effort to optimize it) on every major compiler.

Something weird is going on with your benchmark. 1 million ints is only 4MB. I'd expect a memset of that size to complete in under 2 milliseconds in the worst case. 4MB will also fit in the L3 cache on most modern CPUs. Can you share your test code?

You also might want to read https://randomascii.wordpress.com/2014/12/10/hidden-costs-of-memory-allocation/ which explains why freshly allocated memory can be much more expensive to access. The short version is that it's not just cache that speeds up the second access - the operating system will also need to process soft page faults for the first access to newly allocated memory.

I figured that allocating memory is O(1), just marking the required memory.

Once you read or write it with memset, it needs to be brought to cache or do other memory management things. Which are O(n).

The second memset benefits from not having to do those memory management things, and in addition the memory is already in the cache.

I just didn't understand when we try to go beyond 16MiB, it still had that second memset faster than the first one.

Zurtan said:
I figured that allocating memory is O(1), just marking the required memory.

Not exactly. You request a block of X bytes, the memory manager needs to find a good spot for allocating, and it needs to update its internal administration. This may mean it is going to search through the free blocks, for example.

Allocation is not O(1). For example:

Linux has a memory manager. This is the thing implementing mmap() and sbrk(), which is what malloc()/new in turn uses to allocate memory.

The Linux memory manager uses a tree to manage mmap() allocated chunks. When you have lots of them, either because you do a lot of memory-mapped I/O, or because you have called malloc() for many small allocations that are all still alive and not available for re-use, then each successive call to mmap() will get slower.

enum Bool { True, False, FileNotFound };
Zurtan said:
Once you read or write it with memset, it needs to be brought to cache or do other memory management things. Which are O(n).

That again depends.

You talk about the cache. But cache is CPU dependent.

Here is an arbitrary selection, with chips pulled from the wikipedia page rather than for popularity or anything. If you're dealing with a single 4 MB block, it may fit in various places. On a somewhat dated I3-2100 chip it won't fit in cache at all. A little bigger and newer chip, perhaps an i5-4460, there's 6 MB of L3 cache but you'll still only have 1MB of L2 cache, so a good chunk will likely be evicted by other stuff if your computer is doing anything. Throw it on a bigger i7-7800 and the entire thing fits in L2 cache.

That alone will affect your second pass through the data dramatically. In one case it must do it, in another case it doesn't need to.


You talk about O(n). While there was an era --- perhaps ending in the early 1990s --- where working with a 4MB chunk of memory was once a massive task, these days that's a thumbnail image on a web page, trivially handled. But there is tremendous variability, a full ten orders of magnitude or more, before we even touch big O.

Even on a single individual computer there may be variations between runs. Maybe in one run the program may have all of it in memory already allocated to the process so the OS memory allocator must do nothing at all, requiring a few nanoseconds to check the data structure; running another time my require an OS allocation that happens to be completely clean and ready to go; running yet again may require wiping a chunk of virtual memory, slower again; running another time the system may need to swap memory out to disk for it, and maybe it's an old HDD that first requires the disk to wake up and start spinning the platters.

That is all before touching any big-O notation, all on a single machine with exactly the same hardware situation between runs. One takes nanoseconds where the other can potentially take multiple seconds.

Attempts to measure anything like this are highly dependent on details.

Here's more reading on that, from Intel, about some of the many details.



Okay, enough on the fact that it is highly variable to begin with.

Trying to stick back with the original question in mind:

Historically, both AMD and Intel have switched things up and sometimes used write-through and sometimes used write-back. Currently the chips use write-back caches

Usually these days, but not always, what your code will work with uses a write-back cache. Note that both AMD and Intel --- and other chip makers --- use both for various purposes. Memory can be marked for different purposes, but they're more exotic that the usual case.



For the task you described of multiple writes, the best way to take advantage of write-back cache is to do multiple different writes to the same block of memory in the same cache line. If not in the same cache line, in the same cache page in L1, L2, or L3 cache.


When you write to a block of data everything in the tree must be marked as dirty. How long it takes depends tremendously on the system. The L1 data cache line is marked as dirty, the L1 page is marked as dirty, the L2 page is marked as dirty, and possibly that is propagated to other cores on the same die, possibly that is propogated to other chips. The l3 cache, if present on the chip, will also get marked, and propagated if multiple cores and multiple chips exist.

So that's one variability. To minimize this, stick within a single cache line and keep modifying it over and over. It stays on die and everything is very, very fast. Usually, zero time at all, but not necessarily, depending on what else is going on in the system at the time.

In recent years that cache line size has been 64 bytes. As long as you're writing repeatedly to the same 64 byte cache line, the first time you write you pay a cost to mark the assorted caches as dirty, then you can play with those 64 bytes all you want. However, the size has changed many times over the years, slowly growing larger from 32 bytes, earlier 16 bytes, etc. In the future, it will grow again to probably 128 byte cache lines, then 256, then bigger.

Moving from one cache line to the next cache line, currently that means writing to the 65th byte, you trigger all the same dirty marks. The cache line is marked as dirty. the rest of the L1 cache gets the line marked as dirty if it already wasn't, the L2 cache gets marked as dirty if it already wasn't, the L3 cache if present gets marked as dirty if it already wasn't, other cores on the die have the memory marked as dirty if necessary, and other CPUs on the motherboard get the memory marked as dirty if necessary. But since you were just modifying the earlier memory probably those were already marked. But maybe you crossed a boundary so it needs to be propagated.

Those boundary sizes are again dependent entirely on the machine. Beyond the page size of 64 bytes, L1 cache contains 32 or 64 or 128 or more memory on a page, L2 cache has assorted sizes ranging from 32 to 1MB depending on the chip, L3 memory can have many megabytes on some chips, with assorted boundaries depending on the hardware.

So just like the first time, you can write all you want on both of those dirty cache lines all you want without worrying about performance. Writing on them is basically free. You can expand it to a third cache line, a fourth cache line, repeating many times, and as long as what you are writing to remains active inside the L1 data cache, writing to it is free.

This remains free until something gets evicted or synchronized.

Data can be evicted for many reasons. Only a few are within your control.

It can be evicted because your program is accessing more memory, so something new displaces something that is old. You can control this by only accessing a tiny block of memory, never anything new. It's hard to keep your program fed with interesting data, but some programming tasks are compute intensive, not memory intensive.

It can be evicted because a different program is accessing more memory, displacing the old content with their content. The operating system controls this. Maybe your program is being swapped out so another program can run. Maybe another program in another thread on a hyperthreaded processor is pulling it in.

It can be evicted because memory is being synced up with another part of the system. Usually this is the case if you're writing multithreaded programs or doing some advanced parallel processing, far beyond beginners in gamedev, but frequently done for some processing on advanced games.

Once it is evicted, the data is written out to L2, or L3, or main memory, or synchronized to other cores, or synchronized to other CPUs, and the free lunch is over. The cache gets marked as clean, and any future write must incur the costs.


Now back to your example of writing 4MB of memory, and joining it up with the first things I wrote in this rather long post.

When you use memset your code is rapidly traveling across 4 megabytes of memory. All of that gets loaded up, potentially paged in from disk into main memory, potentially loaded from main memory to L3, potentially loaded from L3 to L2, potentially loaded from L2 to L1, and modified to your value. Cache prediction algortihms --- which vary based on the processor --- attempt to prefetch the data to help minimize cache misses, but there will still be many thousand cache misses as about 70,000 cache lines (4 megabytes / 64 bytes) get loaded, written to, and evicted.

Depending on your processor, running through that same 4 megabytes a second time will take some time. On the first processor I mentioned above, all the memory would have been evicted back to main memory by the time of the second run. So on that processor there is absolutely no benefit from a writeback operation. On the second processor I mentioned all the memory could still be in L3 cache (unless evicted for other reasons), so you get some benefit from the writeback operation, saving time based on your memory timing speed, likely on the order of 500 microseconds relative to the first processor across the entire 4 megabyte span. On the third processor I mentioned everything is still in L2 cache, so you get a much bigger benefit from the writeback operation, saving likely on the order of 10 milliseconds relative to the first processor.

And note that every processor is different, check the detailed CPU data sheets for any specific processor and you'll find notes about cache sizes, cache associativity, cache data latency, and time required for data dependency chains. If it touches main memory like the first one, then you also need to consider the timings for the memory chips in addition to what's on the CPU.


So YES, on SOME (but not all) processors, using memset to write to a 4MB block of memory CAN (depending on details mostly outside the program's control) get SOME (varying depending on the chip) benefit from writeback caching. But as I wrote in my first reply, it is better if you don't do the work at all, because not doing work is free.

Assuming you must do the work, you will get far better results by actively using smaller memory blocks. The details will still vary tremendously by the specific CPU and the specific use pattern.

This topic is closed to new replies.

Advertisement