« Project Server authors breaking radio silence | Main | The one best way to manage projects »

Do your own SEO - Using Excel and Grep to read log files

I'm always curious about how you arrived here. It seems that the majority find their way through a search on google. The reason I know this is because I take a look at the log files from my server. If someone got here by clicking on a link somewhere, that referring site shows up in the log. Each log entry looks something like this:

192.198.152.98 - - [21/Sep/2006:14:33:37 -0400] "GET /blog/archives/design/bob-colwell-on-computer-architecture.html HTTP/1.0" 200 12958 zo-d.com "http://www.google.com/search?q=bob+colwell&btnG=Search&hl= en&lr=" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.7) Gecko/20060909 Firefox/1.5.0.7" "unknown"

We can look at this and see that it the server is configured to write out in "combined log format". There are a few interesting things in this entry (and I'll skip the boring things) The first part is the IP address. Intel owns the IP addresses from 192.198.128.0 to 192.198.177.255 so I'm pretty confident that this is someone at Intel hitting the site. Moving to the right we see the file that was requested. And then skipping further right we see a URL which shows the referring site. In this case a google search for "Bob colwell".

Sometimes the referring URL tells us more. For example: http://www.google.de/search?q=%22Theory+of+Constraints%22&hl=de&lr=&start=90&sa=N tells us that it wasn't until the tenth page of results that this site showed up for a search on "Theory of Constraints". It also tells us that the person was searching on the German google site. Looking at this tells me that I should probably try harder if my goal is to be a useful and popular site for people looking for information about "Theory of Constraints".

The problem is that there are so many lines in the log file which don't contain referer data, or which tell me that the style sheet was used that it is difficult to pick out the interesting parts. The easiest way for me to do that is to have software do it for me. I use a this process. Save the log file to my computer as a text file. Previously I'd just open it in excel and then filter, but the weekly log is longer than excel can hangle, so the next step is filtering it by using grep. Note: Excel 2007 will solve this limitation by moving to a limit of 1 million rows - but it won't be out until next year) If you are using a mac or linux you can just open a terminal window, but for a windows box you may need to install windows grep or another program. Once you do that simply type something like: grep /search? logfile.txt > searches.txt. This finds all lines containing /search? in the file and then writes them to a file called searches.txt. The > tells it to write the result of the search into the file which comes after it.

That cuts the number of lines down to a manageable number and shows only those which match the characteristics of a search. Almost all search engines use the word "search" in their results, so this will also show yahoo and msn search results as well.

After this, just open the text file in excel. It should recognize it as a text file and open the conversion wizard. The lines are space delimited so check delimited and the space box and then you have all your data in excel to browse through or analyze.

This same technique can be used to filter out any sort of data in the file you like. Just change the parameters you are feeding to grep.

RELATED POSTS
  • VBA returns in Excel 2011 for Mac
  • Office VBA for Mac After 2008
  • Analyze Microsoft Project Resource Usage Data In Excel
  • Microsoft Project VBA - the Instr function
  • VBA Writing to a text file (MS Project, Excel)
  • MS Project VBA - Trim Function
  • Project Scheduling with Excel

  • About

    The previous article is Project Server authors breaking radio silence.

    The next article is The one best way to manage projects.

    Current articles are in the main index page and you can find a complete list of articles in the archives.

    Creative Commons License
    This weblog is licensed under a Creative Commons License.
    Powered by
    Movable Type 3.34