configuration

NiFi and SSL for authorization

Introduction

By default, your NiFi installation is not protected at all. Anyone knowing the hostname of your NiFi hosts can connect to them with a simple web browser.
To protect access to NiFi, by adding user authentication and authorization, you will need to enable SSL. Client-side certificates, generated by the NiFi CA are going to be used not only to setup an encrypted link to the NiFi hosts but also to provide user authentication.
When SSL has been enabled for NiFi, it is no more possible to connect using HTTP.
 

In the Ambari GUI

Ambari usage and tweaking

Changing the default web port

By default, the Ambari web GUI is listening on port 8080. It is really easy to change this port.
On the command line of your Ambari Server, you will just do that:
  1. Stop the Ambari server ( sudo ambari-server stop )
  2. Edit the file /etc/ambari-server/conf/ambari.properties
  3. Add the line client.api.port=<your port>

Creating a HDF cluster

Setting up a HDF cluster with Ambari

To have a fully functional cluster running HortonWorks Data Flow

 

Attention, read this first before starting the deployment of an HDF cluster

(Valid end of June 2017)
The last version of Ambari (2.5.1) is well supported on Ubuntu 16 LTS and Ubuntu 14 LTS. This is also the case for the full Hortonworks Data Platform stack (HDP, version 2.6.1). Besides being supported on Oracle Linux, Suse, CentOS, RedHat and Debian.

Creating a HDP cluster

Setting up a HDP cluster with Ambari

To have a fully functional cluster running HortonWorks Data Platform

 

Presentation

The Apache Ambari project implements a Web GUI that can be used to help in provisioning, managing and monitoring an Apache Hadoop cluster. With the time, it has introduced support for many Open Source projects, part of the Hadoop ecosystem.
The Ambari server will enable you to:
  1. Create a new cluster
  2. Provision services on selected nodes of the cluster
  3. Manage multiple versions of the services configuration

Setting up Logstash, for Syslog

Configuration to get Syslog messages

This is the configuraton of this Logstash instance. We will use the syslog input model to listen for syslog messages from all our hosts.
We will start the Logstash on server "logstash-runner", then we will configure Rsyslog.

input {
  syslog {
    port => "10514"
    add_field => {
      "source" => "syslog"
    }
  }
}
filter {
  if [source] == "syslog" {
    grok {

Setting up Logstash, for Tweets

Configuration to get Tweets

input {
  twitter {
    consumer_key => "<your twitter API consumer key>"
    consumer_secret => "<your twitter API consumer secret>"
    oauth_token => "<your twitter API application token>"
    oauth_token_secret => "<your twitter API application token secret>"
    keywords => [ "thevoicebe" ]
    add_field => { "source" => "tweets" }
    full_tweet => false
  }
}
filter {
  if [type] == "tweets" {
    mutate {

Setting up Logstash - the basics

Setting up Logstash, the basics


Installing Logstash on the server (done with Ubuntu 16.04 LTS).
    server1(admin) ~$ wget -qO - https://artifacts.elastic.co/GPG-KEY-elasticsearch | sudo apt-key add -
    server1(admin) ~$ sudo apt-get install apt-transport-https
    server1(admin) ~$ echo "deb https://artifacts.elastic.co/packages/5.x/apt stable main" | sudo tee -a /etc/apt/sources.list.d/elastic-5.x.list
    server1(admin) ~$    sudo apt-get update && sudo apt-get install logstash


ElastiSearch fields mapping customisation

Changing the field mapping ElasticSearch is using

With ElasticSearch, you don't need to explicitly define everything (field names, field types, indices, ...). He will try to do it automatically.
When uploading data using the REST api to an index which is not yet created, a new one with the name provided will be created. Default mapping (the types to use for each fields) and settings will be applied.

Ceph and GlusterFS comparison

GlusterFS and Ceph are both a software defined storage solution, parts of the RedHat solutions portfolio.
If at first view they can seem to be identical in what they offer (storage distributed on commodity hardware, with fault-resilience), when looking more in depth, there are some difference that can make one of the two solutions better for some use cases than the other, and vice-versa.

So, let's dig into what these 2 solutions have to offer:

 

Pages