Setting up Logstash, for Tweets

Configuration to get Tweets

input {
  twitter {
    consumer_key => "<your twitter API consumer key>"
    consumer_secret => "<your twitter API consumer secret>"
    oauth_token => "<your twitter API application token>"
    oauth_token_secret => "<your twitter API application token secret>"
    keywords => [ "thevoicebe" ]
    add_field => { "source" => "tweets" }
    full_tweet => false
  }
}
filter {
  if [type] == "tweets" {
    mutate {
      gsub => [ "message", "\n", " " ]
      remove_field => [ "retweeted" ]
    }
    if [message] =~ /^RT / {
      grok {
        match => { "message" => "RT @%{USERNAME:retweet_user}: %{GREEDYDATA:message}" }
        overwrite => [ "message" ]
        add_field => { "retweet" => "true" }
      }
    } else {
      mutate {
        add_field => { "retweet" => "false" }
      }
    }
  }
}
output {
  if [source] == "tweets" {
    elasticsearch {
      hosts => [ "es1:9200","es2:9200" ]
      index => "twitter-%{+YYYY}"
    }
    file {
      path => "/opt/tweets/twitter-%{+YYYY-MM-dd}.out"
      codec => line { format => "%{user} | %{message}" }
    }
  }
}


This is a configuration to receive all tweets containing the word 'thevoicebe' (@TheVoiceBE or #theVoiceBE or TheVoiceBE will match too). Each message is parsed to change any carriage return into a blank space, fields name starting by "retweeted" are removed from the result.
If the message starts by RT, we keep only the message part, dropping the RT and username part and we add a field "retweet" set on "true" if RT is found, "false" in any other cases.
We save the message into a ElasticSearch index, using the given name that will be of the form "twitter-2017". So creating an index per year (as there are not many tweets with this keyword, we can do it).
It is recommended to split your index to reduce its size.
We also save a copy of the message and the user sending it into a text file, creating a file per day in this case.