APLawrence - Information and Resources for Unix and Linux Systems, Bloggers and the self-employed
RSS Feeds Get APLawrence.com by RSS














(OLDER) <- More Stuff -> (NEWER) (NEWEST)
Printer Friendly Version



Analyzing web logs with grep


2008/01/17



If you have a website, your logs are a great source of information. For complete, in depth analysis, you can't beat Google Analytics, but you can actually get a lot of quick and useful stats from the command line with just "grep".

Please don't tell me your website isn't running on Linux or BSD or that you don't have shell access. It is completely unnecessary and foolish to be running a website on Microsoft, and if your problem is that your host won't give you shell access, you need to find another host.

Let's start with the simplistic approach: I want to extract log entries for yesterday's There's something about a Muntz TV post. Obviously "grep" can do that; I just cd over to my logs directory and run:

grep muntz_tv.html access_log
 

But that gives me too much:

72.74.93.123 - - [16/Jan/2008:23:06:11 +0000] "GET /foo-web/muntz_tv.html
HTTP/1.1" 200 21849 "-" "Mozilla/5.0 (Macintosh; U; Intel Mac OS
X; en-US; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11"
168.75.65.67 - - [16/Jan/2008:23:06:12 +0000] "GET /foo-web/muntz_tv.html
HTTP/1.1" 200 21855 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows
NT 5.0;.NET CLR 1.0.3705; ContextAd Bot 1.0)"
66.249.72.165 - - [16/Jan/2008:23:06:13 +0000] "GET /foo-web/muntz_tv.html
HTTP/1.1" 200 21868 "-" "Mediapartners-Google"
72.74.93.123 - - [16/Jan/2008:23:06:15 +0000] "GET /foo-web.js
HTTP/1.1" 200 3590 "http://aplawrence.com/foo-web/muntz_tv.html"
"Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en-US; rv:1.8.1.11)
Gecko/20071127 Firefox/2.0.0.11"
72.74.93.123 - - [16/Jan/2008:23:06:15 +0000] "GET /images/147.jpg
HTTP/1.1" 200 2288 "http://aplawrence.com/foo-web/muntz_tv.html"
"Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en-US; rv:1.8.1.11)
Gecko/20071127 Firefox/2.0.0.11"
..
 

For one thing, one of those IP addresses is mine. For another, I'm not interested in the "GET /foo-web.js" lines that are other files loaded with these accesses. So my next attempt improves things a bit:

grep "muntz_tv.html HTTP"  access_log | sed /72.74.93.123/d
 

The addition of " HTTP" eliminates everything that wasn't actually a GET of the actual page, and the "sed" eliminates my ip address from the results (not my actual ip, by the way - just for illustrative purposes).

168.75.65.67 - - [16/Jan/2008:23:06:12 +0000] "GET /foo-web/muntz_tv.html HTTP/1
.1" 200 21855 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0;.NET CLR 1.
0.3705; ContextAd Bot 1.0)"
66.249.72.165 - - [16/Jan/2008:23:06:13 +0000] "GET /foo-web/muntz_tv.html HTTP/
1.1" 200 21868 "-" "Mediapartners-Google"
128.30.52.13 - - [16/Jan/2008:23:09:07 +0000] "GET /foo-web/muntz_tv.html HTTP/1
.1" 200 22175 "-" "W3C_Validator/1.575"
128.30.52.13 - - [16/Jan/2008:23:09:32 +0000] "GET /foo-web/muntz_tv.html HTTP/1
.1" 200 22428 "-" "W3C_Validator/1.575"
128.30.52.13 - - [16/Jan/2008:23:10:37 +0000] "GET /foo-web/muntz_tv.html HTTP/1
.1" 200 22346 "-" "W3C_Validator/1.575"
128.30.52.13 - - [16/Jan/2008:23:11:47 +0000] "GET /foo-web/muntz_tv.html HTTP/1
.1" 200 22056 "-" "W3C_Validator/1.575"
128.30.52.13 - - [16/Jan/2008:23:13:14 +0000] "GET /foo-web/muntz_tv.html HTTP/1
.1" 200 22004 "-" "W3C_Validator/1.575"
..
 

Of course I could pipe that through "wc -l" to get a count:

$ grep "muntz_tv.html HTTP"  access_log | sed /72.74.93.123/d | wc -l
  88
 

But that also includes a lot of repeated ip's. Those are easy to get rid of:

$ grep "muntz_tv.html HTTP"  access_log | sed '/72.74.93.123/d;s/- - .*//' | sort -u |  wc -l
  59
 

Adding "s/- - .*//" to the "sed" eliminates everything on the line except the leading IP address, and sending that through "sort -u" eliminates duplicates.

What if we want to know how many of those people visited other pages? That's easy enough:

grep "HTTP.*muntz_tv.html" access_log
 

is the base of it, because that gives us lines where "muntz_tv.html" is in the Referrer field: the visitor clicked on some link on that page. However, I need more. I'll still need to delete my own IP address and get rid of duplicates, but I also need to get rid of lines like this:

66.199.242.10 - - [17/Jan/2008:13:31:17 +0000] "GET /foo-web/muntz_tv.html
HTTP/1.0" 200 20384 "http://aplawrence.com/foo-web/muntz_tv.html"
"Mozilla/4.0 (compatible; MSIE 6.0; Windows 98; Win 9x 4.90)"
 

That's someone who did a GET of the page from the page - in other words, a reload. To eliminate those, I add another "sed":





grep "HTTP.*muntz_tv.html" access_log | sed '/muntz_tv.html HTTP/d'
 

Putting it all together:

grep "HTTP.*muntz_tv.html" t | sed '/muntz_tv.html HTTP/d;/72.74.93.123/d' | grep "html HTTP"
 

That gives us the visitors who went to another page from there. I can pipe that through the "sed s'/- -.*//' and "wc -l" also.

How about getting the visitors who came to a page here from another page here? That sounds a little tricky, but t's actually not too bad:


grep "HTTP.*aplawrence.com" access_log | grep "html HTTP" | sed '/72.74.93.123/d'
 

As you can see, simple tools can give useful information very quickly.




Click here to add your comments





Fri Jan 18 23:44:12 2008: Subject:   BigDumbDinosaur
http://bcstechnology.net

Summed up, don't send out a battleship to sink a rowboat!

I see a frequent tendency on the part of youngish computer jocks to reach for the big, complicated tools to do the small, simple jobs. It must be some sort of Windows-induced mental disease.

Programs like grep, sed and awk, joined together with simple shell constructs, have been around for decades and are very good at what they do. Scanning the access log for interesting facts is a job that is tailor-made for these ancient and honorable UNIX tools. And, best of all, they're all supplied free of charge.

Don't miss responses! Subscribe to Comments by RSS or by Email

Click here to add your comments


If you want a picture to show with your comment, go get a Gravatar



Auto FTP Manager

numly esn 96312-080117-880257-66
numly barcode

Have you tried Searching this site?

Unix/Linux/Mac OS X support by phone, email or on-site: Support Rates

This is a Unix/Linux resource website. It contains technical articles about Unix, Linux and general computing related subjects, opinion, news, help files, how-to's, tutorials and more. We appreciate comments and article submissions.

Publishing your articles here

Jump to Comments



Many of the products and books I review are things I purchased for my own use. Some were given to me specifically for the purpose of reviewing them. I resell or can earn commissions from the sale of some of these items. Links within these pages may be affiliate links that pay me for referring you to them. That's mostly insignificant amounts of money; whenever it is not I have made my relationship plain. I also may own stock in companies mentioned here. If you have any question, please do feel free to contact me.

Specific links that take you to pages that allow you to purchase the item I reviewed are very likely to pay me a commission. Many of the books I review were given to me by the publishers specifically for the purpose of writing a review. These gifts and referral fees do not affect my opinions; I often give bad reviews anyway.

We use Google third-party advertising companies to serve ads when you visit our website. These companies may use information (not including your name, address, email address, or telephone number) about your visits to this and other websites in order to provide advertisements about goods and services of interest to you. If you would like more information about this practice and to know your choices about not having this information used by these companies, click here.

book graphic Internet Income guide

My Hard Truths about Easy Money on the Internet will show you how to make money on the Internet!

 I sell and support
 Kerio Mail server




pavatar.jpg
More:
       - Blogging
       - Unix
       - Shell
       - Basics


Unix/Linux Consultants

Skills Tests

Guest Post Here











My Favorites

Change Congress