What is the Apache access log?  Why do you need it?

Let me share a story.

I remember starting my first blog years and years ago. I paid for hosting and then installed a (much younger) version of WordPress on it.

For a while, I blogged into the void with nobody really paying attention. Then I started to get some comments. A trickle at first, and then a flood.

I was excited until I realized that they were all suspiciously vague and often non-sequiturs. “Super pro info site you have here, oPPS, I HITTED THE CAPSLOCK KEY.” And these comments tended to link back to what I’ll gently say weren’t the finest sites the internet had to offer.

Yep. Comment spam.

Somewhere between manually deleting these comments and eventually installing a WordPress plugin to help, I started to wonder where these comments were all coming from. They all seemed to magically appear in the middle of the night and they were spammy, but I was interested in patterns beyond that.

This is a perfect use case for the Apache access log. You can use it to examine a detailed log of who has been to your website.

The information about visitors can include their IP address, their browser, the actual HTTP request itself, the response, and plenty more.

In this post, you’ll get what’s promised in the title: a detailed introduction to this type of logging. Being an introduction, the post won’t venture into super-advanced topics.

But since it’s a detailed introduction, you can bet we’ll answer the basic questions so you can leave the post with solid knowledge of the fundamentals of the Apache access log.

We’ll start out by giving you a basic definition of the Apache access log. We continue by explaining the nature of Apache, what it does, and where it stores its logs.

Then we’ll show how to file the log file as well as how to search in it. Finally, we wrap up by covering how to customize the formatting of the log entries.

Let’s get started.

Apache Access Log: The Basics

Well, at the broadest level, the Apache access log is a source of information about who is accessing your website and how.

But as you might expect, a lot more goes into it than just that.

After all, people visiting your website aren’t like guests at your wedding, politely signing a registry to record their presence. They’ll visit for a whole host of reasons, stay for seconds or hours, and do all sorts of interesting and improbable things.

And some of them will passively (or even actively) thwart information capture.

So, the Apache access log has a bit of nuance to it.  And it’s also a little…complicated at first glance.

But don’t worry—demystifying it is the purpose of this post.

Apache Access Log: the What and Where

Having established the value proposition, let’s get a little more specific about some things.

Apache is a widely used piece of software called a web server.  The definition of server is always a little fuzzy since we use it to describe two related but different concepts: an actual computer (or virtual computer) and a piece of software running on that computer.

Both exist to wait for and respond to requests, but they exist at different levels of abstraction.

Apache is a server of the software variety.  Specifically, it’s a web server.  This means that it lives to take in HTTP requests and give back HTTP responses.  Put more plainly, it gives you web pages.

As any piece of complex software should, Apache leaves a lot of log files in its wake.

The access log is one of those files.  It simply has a list of all inbound requests, formatted to allow you to consume them easily (and probably with automated tooling).

The overwhelming majority of the time you see it in the wild, Apache runs on Linux.  (You can port it to Windows, but that’s not common.)  And you’ll usually find the Apache access log in the  /var/log/apache/ directory, called access.log.

This is not universal, though, because you can configure Apache to store the log wherever you want, and conventions may vary by Linux distribution.

You can read about locating the log file in more detail here.

Another important type of log file Apache uses is the error log. As its name suggests, this is where Apache records information about errors or other abnormal situations.

In practice, many of the so-called “errors” written to the log aren’t that critical, but rather very minor incidents such 404 errors (i.e. a user requesting a resource that doesn’t exist.)

Apache also records events that aren’t even errors but might be of help down the line. For instance, warnings that could indicate potential problems can be recorded.

How to Read the Access Log

When you look at the screenshot above of an access log, it may seem intimidating.

Part of that is viewing it through the SSH shell with the text wrapping around.  But there’s legitimately a whole lot of information packed in there, which makes it seem daunting at first.

Really, though, it’s just a series of log entries recorded one entry per line.

Each piece of information per entry is separated by a space.  My log, in particular, is in the so-called “combined” format, which we’ll discuss in more detail in later sections.

So if you were to parse the access log, you would first tokenize it by line and then by spaces into a series of entries.

Or, to think of it another way, consider that it would be pretty straightforward to import the Apache access log to Excel and view it there.

Searching the Log

Okay, so now you understand the conceptual formatting of the access log and how to read it.

But I’m guessing you’re probably not going to grab a cup of coffee, sit down, and start reading it like a newspaper.

So, how do you work with this thing?  How do you interpret it?

Probably the most common way you’re going to use the access log, especially at first, is with simple search.

In the past, I’ve written a guide to searching files with grep and regex, and these access logs make an excellent candidate for that.

For instance, you might do a search for all lines containing 404 to see all instances of users requesting non-existent pages.

If that’s not sufficient, you can also open the logs with a text editor or with Excel.  This is a little more involved and may suffer when your files are particularly large, though.

And finally, you might also consume these logs via a more sophisticated, broader log management strategy by folding them into a log aggregation scheme.

Conceptually, this is somewhat like the aforementioned “put it into Excel” concept.  You leverage tooling that parses the logs for you and puts them into easily digestible formats, potentially including dashboards and graphs.

Customizing the Access Log Format

You also have options beyond just how you consume the log file.  If you want (and have the correct permissions to change settings in Apache), you can also customize the information recorded in the file.

Consider the “combined” format we’ve mentioned earlier. We’ll now describe it in more detail.

The following string is what you’d use to represent the combined format in the Apache settings:

"%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-agent}i\""

Each of those represents a variable to the log formatting utility.

In your Apache configuration files, you can specify the variables you want to see in the access log.

A detailed treatment of these configuration options is beyond the scope of this post, but we’ll give you a quick overview of the options we’ve just used above, along with other common options:

  • %h – Remote host (client IP address)
  • %l – User identity, or dash, if none (often not used)
  • %u – Username, via HTTP authentication, or dash if not used
  • %t – Timestamp of when Apache received the HTTP request
  • \”%r\ – The actual request itself from the client
  • %>s – The status code Apache returns in response to the request
  • %b – The size of the request in bytes.
  • \”%{Referer}i\” – Referrer header, or dash if not used  (In other words, did they click a URL on another site to come to your site)
  • \”%{User-agent}i\ – User agent (contains information about the requester’s browser/OS/etc)

As you can see, this is powerful stuff. You can get an awful lot of information about requests to your site and the people making them.

Understand the Opportunity Here

How you handle access to all of this information should go beyond just satisfying simple curiosity.

This information helps form the backbone of a good production strategy, and you should take advantage of it directly or indirectly.

You can mine the access log for all sorts of information.  Are you seeing an unusual number of errors or improper requests?  Is there a particular site referring a lot of traffic your way?  This is just a tiny sampling of the sorts of questions you can answer.

So take a look through the Apache access log and get comfortable with it.

Then, once you’re comfortable with it, I suggest you develop a strategy for making use of the information it contains.

Comments are closed.



Jump in with your own data. Free for 30 days.

Free Trial