Why do we need Logstash

Elasticsearch - A practical introduction

In the book, short messages from Twitter are accessed to demonstrate the properties of the different aggregations. Appendix B describes the installation of the Twitter River, which can be used to index tweets in Elasticsearch. Since the rivers are now marked as deprecated and are no longer included in version 2.0 of Elasticsearch, it is advisable to describe an alternative way of indexing tweets in Elasticsearch.

This article describes how we can use Logstash to read in data via the Twitter input and write to Elasticsearch via the Elasticsearch output. Large parts are inspired by an English-language article by David Pilato on the subject.

To use Logstash, we need to download the appropriate archive from the Elasticsearch website and unzip it. Logstash 1.5.4 is used in the example. The configuration takes place in a single file, for example named, which we can store anywhere in the file system.

A Logstash configuration file usually consists of three sections:

  • in the section it is configured from where data is read in
  • The incoming data can be manipulated via
  • describes where the data should be written

For our example we only need the - and the - section, because we are not modifying the data, but are writing them to Elasticsearch as they are. The complete example can be found in the configuration files and in the repository with the example data for the book.

Configure Twitter input

In the section we configure the Twitter input.

Access to the Twitter API was previously open, but now you have to create an account in the Twitter developer portal. The various data required for access are then available there. However, only some of the data is free for this type of access, if too much data is retrieved it can happen that the access is temporarily throttled or blocked.

The plugin recently made it possible to access the Twitter sample stream, which is also used in the book with the Twitter River. In this example, however, we state that we are only interested in tweets with the keywords and.

Finally, via we state that we want to receive all data. This will return all the necessary fields.

Test Twitter input

We can test the Twitter input by configuring the output, which simply outputs the incoming data to the command line.

This outputs the data as a Ruby hash, a data structure that is similar to JSON. An incoming tweet can look like this in excerpts.

Important fields for the evaluation can be the field, which contains the entire text. Date, hashtag and user information can be evaluated using aggregations.

Configure Elasticsearch output

To really index the data, we configure the Elasticsearch output. A configuration for Elasticsearch on the local host can look like this.

We state that we want to access Elasticsearch over HTTP on the same system using the default port 9200. The index name is assigned as the type. So that the data can be mapped sensibly, an index template is also registered under the name, which must be located under the name in the same directory as the configuration file. This file describes the index settings and the mapping for the type.

In the following we go through the file one by one. To enable a correct display, opening and closing brackets are specified in the blocks that do not appear in the original file.

At the beginning the name is defined, which must correspond to the name in the configuration. describes the order to be used if several index templates are found. Finally, the block describes that only a single shard should be used for the index.

The mapping area begins by deactivating the field, since access can always take place via fields.

Using the interesting trick, a mapping can be stored for groups of fields. In the example, two groups are stored, whereby only refers to a single field that is to be treated differently. For all other string fields, an additional subfield named is added, which contains the first 256 characters of the original content. These fields can then mainly be used by aggregations that work on the index terms.

Finally, two more special fields are configured: Contains the entire content of the message, the geodata transmitted by Twitter. By configuring as, the geo-functionalities of Elasitcsearch can also be applied to the data.

To test our tweets in Elasticsearch, for example, we can request a terms aggregation on the original content of the username.

Not all aggregation examples in the book can be mapped 1: 1 to the data indexed via Logstash, since individual fields such as the user name are mapped differently via the Twitter River. Calling up the mapping under should, however, be sufficient as a guide. The data structure is definitely similar enough to the Twitter River.

Have fun aggregating!