Setting up Logstash, for Apache access logs

Configuration to get Apache Access logs

In this case, we will run LogStash on each server where an Apache web server is running. In our Apache setup, we've enabled the Apache Combined Access Log for each of our Apache Virtual servers.
In this case, we have one virtual machine running Apache 2.4.18 with a different access log per Virtual Host. So an extract of the configuration looks like:

<VirtualHost site1>
    DocumentRoot /var/www/site1
    <...>
    CustomLog /var/log/apache2/site1.log combined
</VirtualHost>
<VirtualHost site2>
    DocumentRoot /var/www/site2
    <...>
    CustomLog /var/log/apache2/site2.log combined
</VirtualHost>
<...>


More info about configuring an Apache2 web server with virtual hosts on the web site of Apache2: http://httpd.apache.org/docs/2.4/vhosts/

We have the following configuration for this logstash instance:
input {
  file {
    path => ["/var/log/apache2/drupal7.log", "/var/log/apache2/gallery.log", "/var/log/apache2/onebit.log"]
    exclude => "*access*.log"
    add_field => { "source" => "apache" }
    add_field => { "loglevel" => "info" }
  }
}
filter {
  if [source] == "apache" {
    grok {
      break_on_match => false
      match => { "message" => "%{COMBINEDAPACHELOG}" }
      match => { "path" => "%{GREEDYDATA}/%{GREEDYDATA:website}.log" }
    }
    geoip {
      source => "clientip"
    }
    if ([agent] =~ /(?i).*bot.*/) or ([agent] =~ /(?i).*spider.*/) or ([agent] =~ /(?i).*crawler.*/) or ([agent] =~ /(?i).*bark.*/) or ([agent] =~ /(?i).*slurp.*/) or ([agent] =~ /(?i).*yandex.*/) {
      mutate {
        replace => { "clienttype" => "Robot" }
      }
    } else if [agent] =~ /(?i).*wget.*/ {
      mutate {
        replace => { "clienttype" => "Wget" }
      }
    } else if [agent] =~ /(?i).*check_http.*/ {
      mutate {
        replace => { "clienttype" => "Nagios" }
      }
    } else {
      useragent {
        source => "agent"
        target => "decoded_agent"
      }
      mutate {
        replace => { "clienttype" => "Human" }
      }
    }
  }
}
output {
  if [source] == "apache" {
    elasticsearch {
      hosts => [ "es1:9200","es2:9200" ]
      index => "apache-%{+YYYY.MM}"
      template => "/etc/logstash/apache-tpl.json"
    }
  }
}


The input of this configuration will follow the 3 files listed. Each time a new line of log is appended to it, it will be ingested by Logstash.
A field named "source" is created, to perform actions on Apache log line only.

The filter in place will match the log line against a pre-defined expression looking for Apache Combined Access log and splitting each elements into a corresponding field. Another match is done on a field named "path", containing the path and filename of the log file from which the event is originating. This allow us to add a field identifying the Virtual Host associated with the Apache event.

The second action is to do a Geographical IP lookup of the content of the field "clientip".

The third action block of the filter will perform a basic detection of the content of the User Agent string to help determine if we are facing a Robot (web crawler or any kind of indexing engine like Google), a manual wget, a monitoring by Nagios or a real human behind a computer visiting our sites. If we think that the visitor is a human, we are doing a decoding of the User Agent string using the built-in Logstash filter useragent. In all case of the if ... then ... else ... clauses we add a new field "clienttype" containing what we think the kind of visitor is.

Then we output the event into an ElasticSearch index, creating a new index each month, loading a user-defined mapping template in the file /etc/logstash/apache-tpl.json. See the page about mapping for more info about this. We need to use a user-defined mapping to force ElasticSearch to map the field "Bytes" to a numeric field and not a string-type field. This will allow us later to perform numerical aggregation on this information (average, sum, ...). See the page about mappings for more information about the mapping.

An alternative way to follow many log files can be done with the following definition:
    file {
        path => "/var/log/apache2/*"
        exclude => "*.gz"
    }


to include all log files present in the directory without the one that are zipped.
Another possibility:
    file {
        path => "/var/log/apache2/*.log"
        exclude => [ "/var/log/apache2/access.log", "/var/log/apache2/error.log", "/var/log/apache2/other_vhosts_access.log" ]
    }


All file with an extension of ".log" will be followed excepted the ones listed in the exclude array.

The file input is able to follow the rotation of logs, but if your process still write to the old file, unless you include its name too in the configuration, you won't get the new lines appended to it.