This blog has moved to Medium

Subscribe via email


Dotnet Web Crawler Speedup

I’m writing a web crawler in C#, and getting it to perform well was really annoying.
I tried simply using ThreadPool.QueueUserWorkItem() to queue up my requests to multiple threads. Each thread just ran WebClient.DownloadString().

While the threads did run in parallel, it turned out WebClient had an inherent lock.
I tried messing with the ConnectionManagementSection, but that turned out read-only.
After some Google, I found that the configuration can only be changed by modifying the machine.config or user.config files! Seems pretty stupid to me.

After doing that simply didn’t work either, I found this code that helped me through. I still don’t know exactly why WebClient.DownloadString() doesn’t work, but after some tweaking I got to about 2.5 pages pre second. Still not top speed, but way better than the 0.5 pages/second I started with.