Newest content

Setting up Logstash, for Apache access logs

Configuration to get Apache Access logs

In this case, we will run LogStash on each server where an Apache web server is running. In our Apache setup, we've enabled the Apache Combined Access Log for each of our Apache Virtual servers.
In this case, we have one virtual machine running Apache 2.4.18 with a different access log per Virtual Host. So an extract of the configuration looks like:

<VirtualHost site1>
    DocumentRoot /var/www/site1
    <...>
    CustomLog /var/log/apache2/site1.log combined
</VirtualHost>

Setting up Logstash, for Tweets

Configuration to get Tweets

input {
  twitter {
    consumer_key => "<your twitter API consumer key>"
    consumer_secret => "<your twitter API consumer secret>"
    oauth_token => "<your twitter API application token>"
    oauth_token_secret => "<your twitter API application token secret>"
    keywords => [ "thevoicebe" ]
    add_field => { "source" => "tweets" }
    full_tweet => false
  }
}
filter {
  if [type] == "tweets" {
    mutate {

Setting up Logstash, for Syslog

Configuration to get Syslog messages

This is the configuraton of this Logstash instance. We will use the syslog input model to listen for syslog messages from all our hosts.
We will start the Logstash on server "logstash-runner", then we will configure Rsyslog.

input {
  syslog {
    port => "10514"
    add_field => {
      "source" => "syslog"
    }
  }
}
filter {
  if [source] == "syslog" {
    grok {

GFS on iSCSI shared storage

This method, based on software version delivered with CentOS 6.0 use dlm_controld.pcmk and gfs_controld.pcmk, which are special version developped to be used directly by Pacemaker. After upgrading the OS to CentOS 6.2, the RPM providing dlm_controld.pcmk and gfs_controld.pcmk were replaced by cman, wichi provides the standard gfs_controld and dlm_controld. To be use these two with Pacemaker, we need to enable CMAN with Corosync.

Databases

Database (RDBMS or directories) are often seen has the foundations for almost all applications used within the enterprise.
  • RDBMS will be often used when there is many read and write. RDBMS, the SQL databases, can hold any kind of data, ordered and linked just like you want, but the protocols to access them may be more heavy than for directories

Failover of database

When it comes to high-availability of databases, numerous strategies can be applied.

Do we need a complete active-active solution doing load-balancing ? Or do we have enough with a standby database ?
The technical implications behind these choices are totally different, also the reaction to outages will be different of course.

Virtualization

With a market share into the entreprise segment more and more growing, and also to stay aligned with a demand of Cloud enabled technology, it was normal for Linux to support tools for virtualization. As a precurseur, Vmware did support the Linux operating system since its early beginning. But now, the market is wide open to other solutions too.
Starting as open source alternative to the first commercial solution that Vmware was, these alternative solutions have been acquired / integrated into sophisticated commercial solutions.
The valid alternatives to Vmware are :

System

The Operating System

Well, this is obviously our the first brick of our "Open Infrastracture". Among all free OS available today, we've choosen Linux, which finally is the title of this site. But wich distribution ? You don't have to know Linux very well to be informed that there are a lot of various distributions delivering a Linux OS. A distribution will deliver the Linux kernel and all that is needed to boot a server but also a lot of supplemental applications doing a lot of things like web servers, database servers, graphical desktops or office applications.

Storage

Storage is a key component of any system. We have heard a lot about storage.
But what is what ? Local storage versus remote storage.
Block storage, file storage, object storage, ... what's the differences ?
Let's try to raise a bit of the curtain on the storage aspect of your infrastructure.

Open Source Solutions

Or what an enterpise may need

We will go through a lot of various solutions that an enterprise can require to function properly. Going from the lowest layer to the highest in the company, we will found that the Open Source world can bring solutions regarding:
  • The virtualization layer to reduce the hardware cost
  • The network and applications management tools

MySQL active-passive cluster

We will use the iSCSI Lun defined in our iSCSI cluster as a shared storage and we will run MySQL in active-passive (fail-over) mode using Pacemaker and Corosync cluster engine.

The cluster will have to connect to the iSCSI target, mount the iSCSI partition on one node and start a MySQL service which has all its data on this partition.

We will need the following resources and resource agents (RA) on this cluster :

  • virtual IP → ocf:heartbeat:IPAddr2

Muli-master OpenLDAP with 3 or more nodes

Going from a two-nodes multi-master configuration to one with more than two nodes is not really complex, once you have understood what we do in two nodes configuration :

  • In the two nodes configuration, each node has a different ServerID, in N nodes too. To let the local LDAP differentiate between the various masters, the configuration will now list a ServerID directive by node followed by the LDAP URI to connect this LDAP server.

Detailed presentation of MongoDB

Features

  • NoSQL
  • Sharding automatic
  • Own language, integrated into the driver similar to JavaScript
  • DSON - Binary JSON
  • Aggregation framework (similar to GROUP BY)
    • $lookup
  • Lookup framework (similar to LEFT JOIN)
    • $group
  • Native support for MapReduce
  • MongoDB Connector for BI
Allow traditional BI tools to access semi-structured and unstructured data
  • Different type of indexing

Database decision criteria

Decision criteria for a (NoSQL) database

We base our main criteria on the CAP theorem.
This theorem, proposed by Eric Brewer, states that it is impossible for a distributed computer system to simultaneously provide more than two out of three of the following guarantees:
1) Consistency
Every read receives the most recent write or an error
2) Availability
Every request receives a (non-error) response – without guarantee that it contains the most recent write

Ceph and GlusterFS comparison

GlusterFS and Ceph are both a software defined storage solution, parts of the RedHat solutions portfolio.
If at first view they can seem to be identical in what they offer (storage distributed on commodity hardware, with fault-resilience), when looking more in depth, there are some difference that can make one of the two solutions better for some use cases than the other, and vice-versa.

So, let's dig into what these 2 solutions have to offer:

 

Ceph Block Device usage

We saw previously the steps to follow to have a Ceph cluster up and running, with some distributed and protected storage.
Now, it is time to add some services on it to let your applications, VM or servers to access and use this storage.
We will add a block device, a Ceph filesystem and an Object gateway (compatible with OpenStack Swift and Amazon S3)

History of Open Source

Open Source didn't begin with Linux nor it began with the first Unix Operating System. The philosophy of what is called today Open Source was already present, under some aspects, since the begining of the IT history. So to be complete, an history of Open Source need to start at a time were nobody was thinking that one day there will be one or more computer in almost each house of the world and that all of them would be interconnected by a worldwide network.

Implementing high-availability

In the section, we will cover the technical aspects of the implementation of various solutions to create working high-available setup. This are technical howto's covering some particular aspect(s) of the software in use, to achieve a given goal.

All the pratical howto's in this section are based on the following software :
  • Distribution: CentOS 6 64 bits

  • Other repositories in use : EPEL, RPMforge

Building an e-commerce web site

To build a functional and secure e-commerce web site, you will have to follow some steps. Here are the most important of them:

  1. At least one domain name
  2. A web hosting provider
  3. A software package to run your shop
  4. A subscription to one or more services to provide payment facilities to your shop
  5. Invoices archiving
  6. A subscription to one or more services to provide payment facilities to your shop
  7. One or more various Internet ads subscription
  8. Presence in the social networks
  9. Referencing

Big data definitions

Before digging into this world made of huge amount of data, streaming data flows and anayltic applications, let's fix some basic ideas.
Let's define the ground concepts of this world.

Big Data

Big data usually includes data sets with sizes beyond the ability of commonly used software tools to capture, curate, manage, and process data within a tolerable elapsed time. Big data "size" is a constantly moving target, as of 2012 ranging from a few dozen terabytes to many petabytes of data.

Evolution of data

The evolution of captured digital information by today enterprises is evolving following a kind of hyperbolic curve.
The reduction of the storage cost, the reduction of the cost of the compute resources, the ubiquity of digital access (PC, Internet, smartphones, tablets, …), the evolution toward a digital world are factors creating more and more sources of digital information.

HA on filesystems

The RAID system, beeing hardware or software, will protect your local filesystems against disk failure. But when these filesystems are shared to other system, you need also to protect against failure of the host sharing the filesystem. This can be done by using a SAN and a cluster software that will mount and share again some SAN LUN's if a server failed.

But there are alternative solutions to the costly SAN. The OpenSource community offers you various kind of mechanisms to build highly-available filesystem.

Filesystem replication

Inventory management

One of the core components in any enteprise is the CMDB (Configuration Management DataBase). The CMDB is not just an inventory tools listing all the elements you have in your infrastructure, it is also a tool showing the dependencies between them.
Even in the case of small infrastructure, it is valuable to have in place a good inventory tools with dependency links between the elements.
In such CMDB tools, each componant is called "Configuration Item" or CI. A CI may be a server or a CPU in a server, a software, ...

Load balancing

Load-balancing, this is the capability to direct queries to a member of a server farm in turn.

This action can be simply done by using DNS Round-Robin, that's when you configure in a DNS more than one A record for the same name. So at each query to resolve this name, the DNS server will respond with a different IP address, cycling into the set of A records provided for this unique name.
But in this scenario, there are a lot of potential problems :

Document management

Document management systems, they will allow the an organization to organize all source of information according to her business rules :
  • Workflow to approve the changes made into a document
  • Metadata to allow a quick search and retrieval of the document
  • Indexing to be able to retrieve a document based on information contained into it

Management

Under management, in the web site, we will cover the following topics :
the software which are related to keep a list of all your assets and to build your CMDB if you are an ITIL adict.
to keep an eye on what's going on among your system and to be warned at time when something not wanted happen.

Knowledge sharing

In this category, we can find solutions like :
  • Document management systems : where you will have versioning of documents, workflows, check out / check in possibilities, integration within the Office applications
    • KnowledgeTree
    • O3Spaces
  • Web forums : create threads of conversation with a web browser, post answers or questions, search for pertinent information
    • phpBB
  • Wiki : share your knowledge by creating / modifying / deleting articles on web site.
    • tWiki

Document data stores

These data stores are shemaless or schemafree, meaning that the records in the same logical container (table or collection or ...) can be of a different structure each. In other words, two consecutive records can have different number of columns, each of different type. More, each column can hold another record with its own set of columns, creating nested records.

Detailed presentaton of ElasticSearch

Features

Open Source real-time search and analytics engine with a dedicated ecosystem of tools to feed it, manage it and use it.

Fully-featured search

  • Relevance-ranked text search
  • Scalable search
  • High-performance geo, temporal, range and key lookup
  • Highlighting
  • Support for complex / nested document types
  • Spelling suggestions
  • Powerful query DSL
  • “Standing” queries
  • Real-time results
  • Extensible via plugins

Powerful faceting/analysis

Search engines

This family consists of NoSQL data stores crafted for one specific goal, the indexing and search into their data content using complex, full-text and / or distributed search queries.

Running Linux on Windows with Vagrant

Testing Linux software on Windows with Vagrant

There is no magic, you will not able to run a Linux software natively on Windows without third-parties addon.

But thanks to a lot of vendor, you have virtualization tools available for doing your tests.
Everything that has been tested and presented on this site, if not done directly on physical Linux host, has been done with the help of a virtualization software.
Even if your host OS is Linux, there are advantages to use the virtualization:
  •  you can use different version of your distribution

ElastiSearch fields mapping customisation

Changing the field mapping ElasticSearch is using

With ElasticSearch, you don't need to explicitly define everything (field names, field types, indices, ...). He will try to do it automatically.
When uploading data using the REST api to an index which is not yet created, a new one with the name provided will be created. Default mapping (the types to use for each fields) and settings will be applied.

Setting up Logstash - the basics

Setting up Logstash, the basics


Installing Logstash on the server (done with Ubuntu 16.04 LTS).
    server1(admin) ~$ wget -qO - https://artifacts.elastic.co/GPG-KEY-elasticsearch | sudo apt-key add -
    server1(admin) ~$ sudo apt-get install apt-transport-https
    server1(admin) ~$ echo "deb https://artifacts.elastic.co/packages/5.x/apt stable main" | sudo tee -a /etc/apt/sources.list.d/elastic-5.x.list
    server1(admin) ~$    sudo apt-get update && sudo apt-get install logstash


Switching user (sudo)

On any Linux system, there is only one account with all the super-administrator power, this is the root account, the account with ID 0.
It is not recommended to use directly this account, for different reasons, all linked to security and traceability:
  • If allowed to be used to log on, one half of the job is done for the hackers, they just need to find the password as the username is always the same to get super power on the system

Creating a HDF cluster

Setting up a HDF cluster with Ambari

To have a fully functional cluster running HortonWorks Data Flow

 

Attention, read this first before starting the deployment of an HDF cluster

(Valid end of June 2017)
The last version of Ambari (2.5.1) is well supported on Ubuntu 16 LTS and Ubuntu 14 LTS. This is also the case for the full Hortonworks Data Platform stack (HDP, version 2.6.1). Besides being supported on Oracle Linux, Suse, CentOS, RedHat and Debian.

Ambari usage and tweaking

Changing the default web port

By default, the Ambari web GUI is listening on port 8080. It is really easy to change this port.
On the command line of your Ambari Server, you will just do that:
  1. Stop the Ambari server ( sudo ambari-server stop )
  2. Edit the file /etc/ambari-server/conf/ambari.properties
  3. Add the line client.api.port=<your port>

NiFi for Syslog

Let’s build with NiFi a flow similar to what we build with Logstash to store syslog messages into an ElasticSearch index.

 

Receving the messages

We start with the ListenSyslog processor of NiFi that can be configured to listen on any UDP or TCP ports for syslog. When listening on TCP, you must specify the maximum number of concurrent TCP connections. This parameters will be dependant of the number of systems sending syslog message simultaneously to your listener.

NiFi installation and implementation

NiFi introduction

NiFi will allows you to create various data pipelines in a very nice web GUI.
Inside NiFi, one event sent and handled by the system is called a flow file. Each event will be stored as file, containing attributes. Flow files will be received, transformed, routed, split, transferred by processors. Tons of processors are proposed by default, there are processors to:
  • Receive messages from Syslog, HTTP, FTP, HDFS, Kafka, …

NiFi and JSON

NiFi and JSON

Remark: with the introduction of the records-oriented flow files, managing JSON with NiFi became easier than ever.
The below how-to about JSON manipulation is making an extensive use of message contents and attributes extraction / modification.
You will find later pages about the records usage in NiFi.

NiFi and ElasticSearch

Custom mapping for the index you will update with NiFi flows

Unlike the Logstash output “ElasticSearch”, you cannot associate a customized mapping with the processor. Therefore, if the dynamic mapping of ElasticSearch doesn’t attribute the type you really want to one of your fields, you will have to use a default mapping template (see this chapter in the ElasticSearch section of the site).
If doing that, remember that:

Creating a HDP cluster

Setting up a HDP cluster with Ambari

To have a fully functional cluster running HortonWorks Data Platform

 

Presentation

The Apache Ambari project implements a Web GUI that can be used to help in provisioning, managing and monitoring an Apache Hadoop cluster. With the time, it has introduced support for many Open Source projects, part of the Hadoop ecosystem.
The Ambari server will enable you to:
  1. Create a new cluster
  2. Provision services on selected nodes of the cluster
  3. Manage multiple versions of the services configuration

NiFi and SSL for authorization

Introduction

By default, your NiFi installation is not protected at all. Anyone knowing the hostname of your NiFi hosts can connect to them with a simple web browser.
To protect access to NiFi, by adding user authentication and authorization, you will need to enable SSL. Client-side certificates, generated by the NiFi CA are going to be used not only to setup an encrypted link to the NiFi hosts but also to provide user authentication.
When SSL has been enabled for NiFi, it is no more possible to connect using HTTP.
 

In the Ambari GUI

NiFi for Apache - using MiNiFi

Presentation

In this guide, we will use the lightweight version of NiFi, Minifi, that will run on an Apache web server, looking for new event written in the Apache access logs.
MiNiFi is a lightweight version of NiFi, without the web interface and with only a limited set of processors. It doesn’t take a lot of resources on the host it is running.
It can be used as a “Forward-only” to any central NiFi server you have previously setup.

Configuring MiNiFi

NiFi for Apache - the flow

Presentation

In the previous guide, you have installed, configured and enabled the MiNiFi agent on each of your web server. Now, it is time to build a flow on your central NiFi server to do something with the information that will be sent to it.

 

Building up a flow on the NiFi server

We are now back to the workspace of our NiFi server.
If you have followed this guide line by line, you should only have one input port called “RemoteMiNiFi” on it.

NiFi and the Hortonworks Registry

Introduction

The HortonWorks Registry is a service running on your Hortonworks Data Flow cluster that will allow you to centrally store and distribute schemas of how the data you are manipulating are organized.
The Registry is a web application offering:
  • A web interface to add and modify schema
  • A REST API that can be used by any other service to retrieve schema information
The Registry retains previous version of the schema each time you perform an update on an existing schema.

NiFI for Apache - the flow using records and registry

Presentation

In a previous guide, we’ve setup MiNiFi on Web servers to export Apache access log event to a central NiFi server. Then we saw an example of flow build in this NiFi server to handle this flow. This flow was using standard NiFi processors, manipulating each event as a string. Now, we will start a new flow, achieving the same purpose but using a record oriented approach.
We will then discover the ease of use of the record oriented flow files and how it can speed up the deployment of a flow.

Pieces needed from before

My vision


My vision is an IT infrastructure where you are no more kept prisoner by one or two vendors.

A flexible IT infrastructure that can be easily bended to your business and no more the opposite.

An IT infrastructure where you master all the components interactions, because you know exactly what's inside.