Matt Godbolt’s blog

Coding

Another weekend and some more data recovery

This weekend:

So, all in all, pretty damn funky weekend.

The data recovery was achieved in the end by booting a ‘backup‘ Windows 2000 ISO on QEMU, and then buying R-Studio Data Recovery and running that on the data. A bit of a disappointment, but I ran out of time either rolling my own or getting either ScroungeNTFS or testdisk to work.

As I write this I’m now waiting for a 150GB file copy from the virtual disk to a cleanly formatted 1TB drive to give to my Dad. Because I’m an idiot, I originally recovered the data onto the virtual disk, and not the final destination.

To add extra fun, rather than copy the data out somehow from the virtual booted Windows 2000, I’m copying using:

Filed under: Coding Blog

Posted at 22:20:00 BST on 18th August 2008.


Comments on hacking about with stacks

One day I’ll update this blog to support comments, so I don’t have to post a new article when I get some really smart emails. One such came in last night from Justin Fletcher, on my stack post.

I hope Justin’ll forgive me for posting his email in bits, with some thoughts of mine interspersed:

The allocation of pages only on access is a great way of saving memory that might not actually be needed by the process, especially on systems where the allocation of memory would be costly due to synchronisation issues (I’m thinking of earlier ARMs specifically here, where the cost of the cache and TLB flush as excessive compared to allocating pages immediately the program says it needs it — also a multi-processor system will need to synchronise its memory use if they are using a shared memory model).

Something I hadn’t considered — the cost of synchronising across multiple processors. From what I can gather Linux has pretty lightweight locking for this, but it is a consideration. A side point that occurs to me here is how multiple threads’ stacks are arranged in memory. How does the operating system prevent threads’ stacks from overrunning into each other’s memory areas? When I get more time I’ll have to look into it.

In the Windows case, probing the memory before use has a number of effects:

  • A failure can be detected (probably fatally and non-recoverably) at the point the potential allocation occurs. This might be deferred from the point of declaration, depending on the intelligence of the compiler (and I’ve seen their compiler do some nice things). This earlier indication of the failure can be seen as being more useful in many cases, particularly if you can guarantee it. For example if you had code that set up a transaction with a piece of hardware then you would want your code to crash before you put it into a state that you couldn’t recover from. (take, for example, the case of changing the internal memory map of a device to perform a transient, but important, operation). Same, of course, goes for non-hardware examples.

The _chkstk function throws a Stack Overflow exception at the point of the stack-based allocation — a hard stack limit check anyway. There’s nothing smart in the stack probe itself, the probes are literally just a read at each page, no clever kernel code is called. I would imagine that if one of the probes fails, then the process would be put to sleep until a page is available, just as if memory were exhausted for any other reason.

  • It increases page table churn. Your page tables get re-written a lot as the new pages are made available to the application even if it doesn’t use them. This may also mean in-use memory being paged out for just this allocation. Which the application might not use. Allocations from the heap might not do the probe, so might not matter.

From what I’ve seen of the heap code on Windows (and I’ve spent rather too long debugging heap issues), no such probe is made for allocations. This makes a lot of sense as it’s more often the case that you’ll do a very big heap allocation than you’ll make a big stack allocation. The cost of paging in and committing resources to feasibly hundreds of megabytes of the heap would seem too much.

  • If the page is probed then it’s got to have been cleared by the system as well, so not only have you allocated the memory by probing it, but you’ve also caused the system to clear it as well. For security, obviously — you don’t want your stack to contain data from other processes.

An incredibly important point! The memory would be cleared on demand as it was paged in anyway (assuming it was actually used) but it does front-load the cost at the point of the stack allocation. Also interesting: some modern processors have some fast ways of clearing memory in a cache efficient way.

  • You always know how much memory the application is actually using — it can’t fail because parts of the stack can’t be allocated at some random point, only if the entire allocation failed.

Again, on Linux and Windows I think the process is put to sleep if there are no free page frames; but on an embedded system with no backing store this would definitely be the case.

With the on-demand (non-probed) stack system, you have a few issues:

  • You never know if the space set aside by the application is actually used. This isn’t a problem per se, but it does make internal metering of the application harder — valgrind probably tells you more about this, though so it’s probably not a big deal.

Justin’s not wrong here, it’s not a problem as such. That being said I wonder how many programmers (myself included…) have written something like:

char buf[64*1024];
sprintf(buf, "file-%04d.dat", num);
  • Stack allocation failures go unnoticed until the stack is used. This is a serious problem because it makes your application completely non-deterministic in a low memory situation for a language feature. If you’ve allocated your memory (char buffer[bigsize]) then you expect it to be there, pretty much. Unless you explicitly write to every single byte of the buffer (unless you know the page allocation size — and remember that it is the prerogative of the operating system to change its own internal allocation size for pages as it deems fit — there may be different page sizes in use depending on the usage pattern of the application) you can’t be sure that access to those pages won’t fail. It’s not unreasonable for the developer to assume that just because the memory has been allocated that accessing it should work.

As above, this is an embedded system issue I think.

Many thanks to Justin for taking the time to reply. His embedded systems point of view has caused a lot of further thinking on my part

Filed under: Coding

Posted at 22:41:23 BST on 4th August 2008.


Hacking about with stacks

A stack of plates

At the most recent Google London Open Source Jam, I got chatting to a Googler about how operating systems deal with the processor stack. Specifically, what’s the largest object you can allocate on the stack. Not just the maximum stack limit itself, but the implementation of modern operating systems might limit the largest object that can be made on the stack. Take, for example, this code snippet:

void Function() {
  char buffer[32 * 1024];
  buffer[0] = 0; // uh-oh?
  ...
}

We were discussing whether this would work ok on all platforms. Well, why wouldn’t it?

Well, on some operating systems, pages of memory (4KB each on x86) aren’t actually allocated to a process until the last possible moment. This improves performance and in general means more physical memory is available to the system as a whole. This is achieved by the system marking the pages as “not present” in the memory management unit, and then catching the page faults which occur when the program attempts to access them. At this point the operating system actually goes and finds a spare page, gives it to the process, and restarts the process.

So far so good — but how does the system discriminate between an invalid memory access (a programming error), from a stack access just below the current bottom page of the stack?

In order to investigate this, I wrote some experimental code, and tested it on Windows XP and Linux 2.6.18. The code uses two ways to see how far it can poke under the stack: One using large stack-based character arrays, and secondly using assembly-level instructions to directly poke below the stack (which is pretty naughty, and not “supposed” to work.)

The main conclusion: On both Windows and Linux, the original code above is totally valid.

On Windows, whenever the compiler spots a large amount of stack is needed, instead of simply subtracting from the stack pointer, it calls a function _chkstk, passing how much stack it needs. This function probes the memory starting at the current stack pointer and stepping back one page at a time until it has walked the stack pointer down by the given amount. Each probe is a read operation that might fault, and means Windows ensures there’s a page of memory ready for each page in the stack.

On Linux, the kernel treats any access within 32 bytes below the current stack pointer as being valid. The 32-byte threshold is to deal with the fact that some of the x86 instructions (PUSH etc) write to memory before they update the stack pointer. (Having the source available makes this an easier diagnosis than on Windows.)

My experiment also revealed the default maximum stack size on Windows is 1MB, and on Linux it’s at least 4MB. Windows pre-allocates about 64KB of stack pages, and Linux seems to allocate about 128KB.

A surprising discovery for me was how smart the Microsoft compiler was at optimising the code! It took several iterations before I got it to not optimise out the calls to my “stack eater” function, and then for it not to optimise the usage of the stack so only one of my large stack buffers was ever used.

In conclusion then: Using large stack-based buffers is safe on both Windows and x86 Linux. On Windows, the compiler protects you, and forcibly probes every page so they’re all allocated by the operating system. On Linux, the operating system uses the stack pointer register to work out whether an access is valid, and will allocate pages to the stack on demand.

My gut feeling is that Linux’s way is slightly better, as it defers the page allocations until they’re actually needed.

Filed under: Coding

Posted at 23:59:01 BST on 2nd August 2008.


How to make a pure virtual call

If you’re staring at the run-time error “Pure virtual function call” you might be wondering what’s happened. Usually, this is due to calling pure virtual functions from a constructor or destructor. Consider the following code:

Yes, another totally contrived example.

class AllocBase
{
public:
  AllocBase() : allocated_(false) {}
  virtual ~AllocBase() { Shutdown(); }

  void Initialise() {
    if (!allocated_) {
      Alloc();
      allocated_ = true;
    }
  }

  void Shutdown() {
    if (allocated_)
      Dealloc();
    allocated_ = false;
  }

  // Overridden by subclasses:
  virtual void Alloc() = 0;
  virtual void Dealloc() = 0;

private:
  bool allocated_;
};

class AllocClient : public AllocBase
{
public:
  AllocClient() : data_(0) {}
  virtual void Alloc() { data_ = new char[10]; }
  virtual void Dealloc() { delete[] data_; }

private:
  char* data_;
};

int main(int, char**)
{
  AllocClient alloc_client;
  alloc_client.Initialise();

  return 0;
}

Looks fairly innocuous at first glance — the destructor calls Shutdown() if it hasn’t already been called. Shutdown() itself isn’t virtual, but it does call through to the pure virtual Dealloc(). This is where the error lies. During construction and destruction, virtual functions are prohibited. To see why, consider the case of destruction of the AllocClient object. The compiler must runs code similar to:

obj->~AllocClient(); // call the AllocClient's destructor
obj->~AllocBase(); // call the base class's destructor

So by the time ~AllocBase() is called, the AllocClient aspect of the object has already been destroyed. Any virtual calls it makes would potentially run code defined in the AllocClient part of the object — which is doomed to fail as it will expect the AllocClient members to be in a usable state.

So how does this generate a pure virtual function call error? Well, what the compiler usually actually runs is:

obj->~AllocClient(); // call the AllocClient's destructor
// Ensure any virtual function calls in ~AllocBase route to
// AllocBase functions, and not any derivee's
obj->__vtable__ = AllocBase::__vtable__;
obj->~AllocBase(); // call the base class's destructor

From here it’s easy to see how the pure virtual calls come about.

In some cases the compiler will catch these at compile time — if you directly call a virtual function in the destructor or constructor, for example. However, GCC and Microsoft’s compiler aren’t smart enough to notice nested calls (and in general it would be extremely hard to prove they actually get called).

Filed under: Coding

Posted at 13:50:00 BST on 23rd July 2008.