Tallan's Technology Blog

Tallan's Top Technologists Share Their Thoughts on Today's Technology Challenges

Analysis of a Winner!

James Curran

The Tallan CodeOff is an annual (*) event here at Tallan to challenge the development staff.  At the start of the contest, a simple programming task is emailed out to the participants, who then have an hour to complete it.  In the past, the winner was the first person to submit a correct entry.  This year, however, the contestants could take the full hour, and the winner would be whoever’s ran the fastest.

The task for this year’s contest was to take a directory full of HTML files, and strip out all unnecessary whitespace from them.  Per the HTML standard, on runs of contiguous spaces, only the first one is significant – the rest are ignored by the browser, so by removing those spaces beforehand saves disk space and shortens download time without having any effect on the appearance of the page.

The specific folder we were given to work on contained 1017 files totaling  77 MB of text with one file 49MB by itself.

My entry was the winner, converting all the files in just under 2 seconds, so I was asked to write up an explanation of it.

To start with, I should mention that my first personal computer (circa 1978) was a Radio Shack TRS-80 which came with a massive 16 KB (that’s kilobytes with a K) of RAM (and that was the deluxe model!) and had a CPU which ran at 1.77 MHz or about 1/2000th the speed of a modern Windows desktop. (**)  So, I was learning all the tricks on  how to optimize code for size and speed, even before some of my coworkers were born.

Therefore, it somewhat pains me to offer the main piece of advice for this project: Forget the tricks, and trust the framework.

Getting into the specifics of it, before the start of the contest, we were all given a template to use which had the code for getting a list of filenames to be processed (along with some housekeeping tasks) so all we had to do during the actual contest period was write a method which took that array of filename, processed each file and wrote the output to a different folder.

Mine looked like this:

private void CodeOff(IEnumerable<string> files)
{
    foreach (var pathname in files)
        ProcessFile(pathname);
}

That was simple enough.  Of course the real work is done in ProcessFile() (but we’re not done with the CodeOff() method just yet – we’ll get back to it)

The most obvious solution is to use Regular Expressions:

Regex removeSpaces = new Regex(@"\s+");
private void ProcessFile(string pathname)
{
    using (var outfile = Output(Path.GetFileName(pathname)))
    {
        var lines = File.ReadAllText(pathname);
        var text = removeSpaces.Replace(lines, " ");
        outfile.Write(text);
    }
}
private StreamWriter Output(string filename)
{
    return new StreamWriter(Path.Combine(_outputFolder, filename));
}

The Regex expression “\s+” says to seek out one or more instances (“+”) of whitespace (“\s”), and replace them with a single space.  Now, this is a perfectly reasonable implementation, but not the fastest.   By their nature (the flexibility of the pattern being search for), a Regex will almost always lose to a hand-crafted search algorithm.

Wait a minute…. Didn’t I just say that the theme of this article is “Trust the Framework”, and then, in the very first specific, I telling you NOT to use the framework?  Well, yes, but just in this case.  This is not to say that RegExes should be avoided in general.  The search pattern needs to get only slightly more complicated before crafting the search algorithm by hand becomes impractical (particularly when one is only given an hour to code it).  There not much point taking several hours to write code which will only save you a couple seconds.  (On the only hand,  if you will be searching for a specific pattern a lot, such as in constantly incoming data in productionhard-coding the search – even a complex one, might be better than using a Regex.)

But before we move on to improving that routine, let’s get back to our theme, by looking at another line.  File.ReadAllText(pathname); is a simple .NET function which, given just a pathname, open the file in text mode, reads the entire file into a string, and then closes the file – Everything you need done with the input file in one step.

Now, you might think that reading an entire file, which as I’ve already said,  could be as large as 49 MB in this contest, would lead to an out-of-memory error, and that I should be reading it in and processing it  in segments– but why?  In .NET memory is managed and virtualized.  If we run out, the framework (and the OS) will automatically page it out to virtual memory (which is normally the hard disk, unless you are using ReadyBoost in which case it’ll be a faster SD card).  So, I can write my own scheme to read to file a segment at a time, or I can let the framework do the exact same thing.  The difference is that while I would have had only an hour to code it, there was probably a team at Microsoft who have worked several months to several years  to make these specific tasks as optimal as possible.

My next attempt at processing the files was only slightly longer, but significantly faster:

   1:  private void ProcessFile(string pathname)
   2:  {
   3:      using (var outfile = Output(Path.GetFileName(pathname)))
   4:      {
   5:          var lines = File.ReadAllText(pathname);
   6:          bool lastSpace = false;
   7:          foreach (char c in lines)
   8:          {
   9:              if (Char.IsWhiteSpace(c))
  10:                  lastSpace = true;
  11:              else
  12:              {
  13:                  if (lastSpace)
  14:                      outfile.Write(' ');
  15:                  outfile.Write(c);
  16:                  lastSpace = false;
  17:              }
  18:          }
  19:      }
  20:  }

Here I manually step through each character in the file.  If the character is some form of whitespace, I just set a flag, and ignore it.  If it’s not whitespace,  I print it out and reset the flag, stopping to print a single space first, if the previous character was a space.

“But,” you say, “writing individual characters out to a file? Isn’t that slow?  Shouldn’t you collect them in a StringBuilder first?”  Well, no, because I’m not writing them directly to the disk.  I’m writing to a StreamWriter, which is internally buffered.  It knows the proper sized block of data to write out to the disk far better than I do.

But still there’s a way of squeezing a bit more speed out of this.  foreach is nice and neat and elegant, but it requires an IEnumerable object behind the scenes, which adds a tiny bit of overhead.  Coding it up with a for() loop instead cut the time a bit.

   1:  private void ProcessFile(string pathname)
   2:  {
   3:      using (var outfile = Output(Path.GetFileName(pathname)))
   4:      {
   5:          var lines = File.ReadAllText(pathname);
   6:          bool lastSpace = false;
   7:          for (int i = 0; i < lines.Length; ++i)
   8:          {
   9:              if (Char.IsWhiteSpace(lines[i]))
  10:                  lastSpace = true;
  11:              else
  12:              {
  13:                  if (lastSpace)
  14:                      outfile.Write(' ');
  15:                  outfile.Write(lines[i]);
  16:                  lastSpace = false;
  17:              }
  18:          }
  19:      }
  20:  }

Two things to keep in mind here:  I greatly prefer using the foreach – I think it’s a much clearer design (others disagree and prefer for() for the same reason), and the difference is minor (a barely noticeable 10% in this case), so in most cases I’d stick with the foreach in my day-to-day programming.  Further,  indexed array access (“line[i]”) is normally bounds checked.   With two indexed accesses inside that loop that would need to be checked, this algorithm would be slower – except that Microsoft realized we often index through the entire array in a for loop, and wrote a special case optimization just for that.  If a for() statement is written exactly like that (ranging from 0 to Array.Length), then bounds-checking is turned off inside the body of the loop.   Note, however, any deviation from that exact pattern  will not invoke the special case, and the bound checks will be performed.  For example, “int len=lines.Length; for(int i=0; i<len;++i)” would be bounds checked.

And, with that, It seems I’d wrung out every inefficiency in the basic algorithm, and that was my winning entry.  However, during the conference call that was going on concurrent with the contest, some contestants mentioned that they had coded multi-threaded solutions – something that hadn’t occurred to me to attempt since we were working under a time limit.   But after the contest was over, I had a train ride home with a laptop and nothing else to do, so I figured I’d give it a go.

The simplest entry point for multi-tasking would be to give each invocation of ProcessFIle() it’s own thread in CodeOff() (told you we’ll come back to that method).

My first attempt was to use PLINQ, which involved simply tacking .AsParellel() onto the foreach:

private void CodeOff(IEnumerable<string> files)
{
    foreach (var pathname in files.AsParallel())
        ProcessFile(pathname);
}

I never quite understood how this was supposed to work, and in this case, it didn’t.  I saw no difference in the run-times.

OK, it’s going to make me work for this.  No problem.  There are means which involve only slightly more typing.  Since the task I wanted to run on separate threads was already a single method, which took a single parameter and returned nothing, the ThreadPool seemed like a good idea.

   1:  private void CodeOff(IEnumerable<string> files)
   2:  {
   3:      foreach (var pathname in files)
   4:      {
   5:          ThreadPool.QueueUserWorkItem(p => ProcessFile(p as string), pathname);
   6:      }
   7:      WaitForThreadPool();
   8:  }

My original attempt using the ThreadPool differs from that code in two ways.  First, instead of using the lambda to call ProcessFile, I called it directly (“ThreadPool.QueueUserWorkItem(ProcessFile, pathname);”)  The problem with that, is that QueueUserWorkItem insists that the method called accept an object parameter, and ProcessFile took a string.  That meant I had to rewrite ProcessFile to accept on object, and then cast it to a string.  This seemed very messy, and the lambda had no noticeable effect on the run-time, so I went with that.

However, the bigger change was adding the call to WaitForThreadPool().  When I first ran the ThreadPool version of the program, I got very good times – dropping from 12 seconds to 1.5 seconds (I was running these on my apparently very slow client-issued laptop).  I was just about to loudly pronounce victory, when I realized it wasn’t saying “1.5 seconds”, but actually “1.5 milliseconds”  Clearly, something was wrong.  Which bring us to one of the major liabilities of the ThreadPool – it’s pretty much “Fire & Forget”.  Once a task is queued on the thread pool, there’s no expressed way to get any information from (or about) it – including when it’s complete.  I was queuing up a thousand tasks, and then not waiting for any of them to actually run.  A quick Google search gave me the code for WaitForThreadPool() (which I ended up rewriting a bit):

   1:  private static void WaitForThreadPool()
   2:  {
   3:      int maxThreads;
   4:      int placeholder;
   5:      int availThreads;
   6:      ThreadPool.GetMaxThreads(out maxThreads, out placeholder);
   7:  
   8:      do
   9:      {
  10:          Thread.Sleep(TimeSpan.FromMilliseconds(100));
  11:          ThreadPool.GetAvailableThreads(out availThreads, out placeholder);
  12:      } while (availThreads < maxThreads);
  13:  }
  14:  

This should be fairly self-explanatory. We get the maximum number of threads it will be using (usually the number of cores in the CPU), and sleep until the number of available threads equals that number, which will happen when there’s no more work to do.

And with that, the run dropped to about half it’s previous time. (Still haven’t got an official timing on the PC used for the other benchmarks)

 

(*) Well, I think it’s an annual event.  I’ve worked here nine months, and the last one was eight months ago….

(**) And that was 8 years before I got a PC with a hard disk.  I presently have over 100 MP3 files which are too big to fit on that disk.

No comments

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>