Why do we need Logstash
In the book, short messages from Twitter are accessed to demonstrate the properties of the different aggregations. Appendix B describes the installation of the Twitter River, which can be used to index tweets in Elasticsearch. Since the rivers are now marked as deprecated and are no longer included in version 2.0 of Elasticsearch, it is advisable to describe an alternative way of indexing tweets in Elasticsearch.
This article describes how we can use Logstash to read in data via the Twitter input and write to Elasticsearch via the Elasticsearch output. Large parts are inspired by an English-language article by David Pilato on the subject.
To use Logstash, we need to download the appropriate archive from the Elasticsearch website and unzip it. Logstash 1.5.4 is used in the example. The configuration takes place in a single file, for example named, which we can store anywhere in the file system.
A Logstash configuration file usually consists of three sections:
- in the section it is configured from where data is read in
- The incoming data can be manipulated via
- describes where the data should be written
For our example we only need the - and the - section, because we are not modifying the data, but are writing them to Elasticsearch as they are. The complete example can be found in the configuration files and in the repository with the example data for the book.
Configure Twitter input
In the section we configure the Twitter input.
Access to the Twitter API was previously open, but now you have to create an account in the Twitter developer portal. The various data required for access are then available there. However, only some of the data is free for this type of access, if too much data is retrieved it can happen that the access is temporarily throttled or blocked.
The plugin recently made it possible to access the Twitter sample stream, which is also used in the book with the Twitter River. In this example, however, we state that we are only interested in tweets with the keywords and.
Finally, via we state that we want to receive all data. This will return all the necessary fields.
Test Twitter input
We can test the Twitter input by configuring the output, which simply outputs the incoming data to the command line.
This outputs the data as a Ruby hash, a data structure that is similar to JSON. An incoming tweet can look like this in excerpts.
Important fields for the evaluation can be the field, which contains the entire text. Date, hashtag and user information can be evaluated using aggregations.
Configure Elasticsearch output
To really index the data, we configure the Elasticsearch output. A configuration for Elasticsearch on the local host can look like this.
We state that we want to access Elasticsearch over HTTP on the same system using the default port 9200. The index name is assigned as the type. So that the data can be mapped sensibly, an index template is also registered under the name, which must be located under the name in the same directory as the configuration file. This file describes the index settings and the mapping for the type.
In the following we go through the file one by one. To enable a correct display, opening and closing brackets are specified in the blocks that do not appear in the original file.
At the beginning the name is defined, which must correspond to the name in the configuration. describes the order to be used if several index templates are found. Finally, the block describes that only a single shard should be used for the index.
The mapping area begins by deactivating the field, since access can always take place via fields.
Using the interesting trick, a mapping can be stored for groups of fields. In the example, two groups are stored, whereby only refers to a single field that is to be treated differently. For all other string fields, an additional subfield named is added, which contains the first 256 characters of the original content. These fields can then mainly be used by aggregations that work on the index terms.
Finally, two more special fields are configured: Contains the entire content of the message, the geodata transmitted by Twitter. By configuring as, the geo-functionalities of Elasitcsearch can also be applied to the data.
To test our tweets in Elasticsearch, for example, we can request a terms aggregation on the original content of the username.
Not all aggregation examples in the book can be mapped 1: 1 to the data indexed via Logstash, since individual fields such as the user name are mapped differently via the Twitter River. Calling up the mapping under should, however, be sufficient as a guide. The data structure is definitely similar enough to the Twitter River.
Have fun aggregating!
- What is the best FOSS PDF Editor
- Is it kidnapping or kidnapping
- What are modern examples of chivalrous men
- Why did Bulgaria join NATO
- What does a fitness advisor do
- Why is AR preferred to VR
- Did Walt Disney do drugs
- Which artist sang the song Mysterious Ways
- Why are Uber drivers so arrogant
- How is Japan different from India?
- How can we overcome aimless training
- Is the NAND operator associative
- How do pacifists practice their beliefs?
- Has anyone created a messenger bot?
- Did you write your SOP
- Can I make oil out of styrofoam
- Why is physical activity important in school?
- How do I install natural basalt stone
- Can Bali remain a tourist paradise
- How do I calculate the shear stress
- Will traffic ever get better in Los Angeles?
- What does sub sexual
- Agree with Antifa
- Can you manipulate someone