Tag: ELK

Logstash – Key Value Parsing

The KV filter plug-in is a quick way to split key/value pairs from message data. An example syslog message where there is some prefix information followed by key/value pairs. In this case, each pair is separated by a semicolon and they keys and values are separated by a colon.

<140>1 2023-04-13T17:43:00+01:00 DEVICENAME5@10.1.2.3 EVENT 2693 [meta sequenceId="33"]"time-stamp":2023-04-13T17:43:00+01:00;"session-id":;"user-name":;"id":0;"type":CREATE;"entity":not-alarmed-event-notification

The first thing you need to do is to parse the message so the key/value pair data is in a single field.

"message" => "<%{POSINT:syslog_pri}>%{NUMBER:stuff} %{DATA:syslog_timestamp}+%{DATA:syslog_timestamp_offset} %{SYSLOGHOST:logsource}@%{DATA:sourceip} %{DATA:log_type} %{NUMBER:event_id} \[meta sequenceId=\"%{DATA:meta_sequence_id}\"\] %{GREEDYDATA:kvfields}"

Now that the data is available in kvfields, the kv filter can be used to parse the data. Indicate which character splits fields, which character splits the key and value, and what field is the source of the key/value pair data. Additionally, if you need to trim data from keys (trim_key) or values (trim_value), you can do so. In this case, each of the keys is quoted. I do not wish to carry the quotes through on the field name, so I am trimming the double-quote character from keys.

kv {
     field_split => ";"
     value_split  => ":"
     trim_key  => '"'
     source  => "kvfields"
}

You can recursively parse data, if needed, and the key/values parsed from a value will be sub-elements of the parent key.

Ruby

Sometimes more advanced logic is required to parse message content. There is a ruby filter plugin that allows you to run ruby code. As an example, the “attributes” key contains key/value pairs but the same delimiter is used for both key/value and the list of pairs.

<140>1 2023-04-13T17:57:00+01:00 DEVICENAME5@10.1.2.3 EVENT 2693 [meta sequenceId="12"] "time-stamp":2023-04-13T17:57:00+01:00;"session-id":;"user-name":;"id":0;"type":CREATE;"entity":not-alarmed-event-notification;"attributes":"condition-type;T-BE-FEC;condition-description;Bit Error Forward Error Correction HT = 325651545656;location;near-end;direction;ingress;time-period;1min;service-affect;NSA;severity-level;cleared;fm-entity;och-os-1/2/2;fm-entity-type;OCH-OS;occurrence-date-time;2023-04-13T17:55:55+01:00;alarm-condition-type;standing;extension-description;;last-severity-level;not-applicable;alarm-id;85332F351D9EA5FC7BB52C1C75F85B5527251155;"

If you break the string into an array on the delimiter, even elements are the key and the +1 odd element is the corresponding value.

ruby {
     code => "
          strattributes = event.get('[attributes]')
          arrayattributebreakout = strattributes.split(';')
          if arrayattributebreakout.count > 0
               arrayattributebreakout.each_with_index do |element,index|
               if index.even?
                    event.set(arrayattributebreakout[index], arrayattributebreakout[index+1])
               end
          end
       end
       "
}

 

 

Increasing Kibana CSV Report Max Size

The default size limit for CSV reports in Kibana is 10 meg. Since that’s not enough for some of our users, I’ve been testing increases to the xpack.reporting.csv.maxSizeBytes value.

We’re still limited by the ES http.max_content_length value — which the documentation seems pretty confident shouldn’t be increased because the system can become unstable. Increasing the max Kibana report size to 100mb just yields a different error because ES doesn’t like it. 75 exhausted the JavaScript heap (?) – which I could get around by setting  NODE_OPTIONS=–max_old_space_size=4096 … but that just led to the server abending whenever a report was run (in fact, I had to remove the reports I tried to run from the server to get everything back into a working state). Increasing the limit to 50 meg, though, didn’t do anything unreasonable in dev. So somewhere between 50 and 75 meg is our upper limit, and 50 seemed like a nice round number to me.

Notes on resource usage – Data is held in memory as a report is created. We’d see an increase in memory/CPU usage while the report is being generated (or, I guess more accurately, a longer time during which the memory/CPU usage is increased because if a 10 meg report takes 30 seconds to run then a 50 meg report is going to take 2.5 minutes to run … and the memory/CPU usage is pretty much the same during the “a report is running” period).

Then, though, the report is stashed in ElasticSearch for user(s) to retrieve within .reporting* indicies. And that’s where things get a little silly — architecturally, this is just another index; it ages off with a lifecycle policy if one exists. But it looks like they never created a lifecycle management policy. So you can still retrieve reports run a little over two years ago! We will certainly want to set up a policy to clean up old reports … just have to decide how long is reasonable.

 

Visualizing GeoIP Information in Kibana

Before we can use map details in Kibana visualizations, we need to add fields with the geographic information. The first few steps are something the ELK admin staff will need to do in order to map source and/or destination IPs to geographic information.

First update the relevant index template to map the location information into geo-point fields – load this JSON (but, first, make sure there aren’t existing mappings otherwise you’ll need to merge the existing JSON in with the new elements for geoip_src and geoip_dst

{
  "_doc": {
    "_meta": {},
    "_source": {},
    "properties": {
      "geoip_dst": {
        "dynamic": true,
        "type": "object",
        "properties": {
          "ip": {
            "type": "ip"
          },
          "latitude": {
            "type": "half_float"
          },
          "location": {
            "type": "geo_point"
          },
          "longitude": {
            "type": "half_float"
          }
        }
      },
      "geoip_src": {
        "dynamic": true,
        "type": "object",
        "properties": {
          "ip": {
            "type": "ip"
          },
          "latitude": {
            "type": "half_float"
          },
          "location": {
            "type": "geo_point"
          },
          "longitude": {
            "type": "half_float"
          }
        }
      }
    }
  }
}

First, click on the index template name to view the settings. Click to the ‘mappings’ tab and copy what is in there

Munge in the two ‘properties’ in the above JSON. Edit the index template

Click to the “Mappings” section and use “Load JSON” to import the new mapping configuration

Paste in your JSON & click to “Load & Overwrite”

Voila – you will have geo-point items in the template.

Next, the logstash pipeline needs to be configured to enrich log records with geoip information. There is a geoip filter available, which uses the MaxMind GeoIP database (this is refreshed automatically; currently, we do not merge in any geoip information for the private network address spaces) . You just need to indicate what field(s) have the IP address and where the location information should be stored. You can have multiple geographic IP fields – in this example, we map both source and destination IP addresses.

        geoip {
                source => "src_ip"
                target => "geoip_src"
                add_field => [ "[geoip][location]", "%{[geoip][longitude]}" ]
                add_field => [ "[geoip][location]", "%{[geoip][latitude]}"  ]
        }
        geoip {
                source => "dst_ip"
                target => "geoip_dest"
                add_field => [ "[geoip][location]", "%{[geoip][longitude]}" ]
                add_field => [ "[geoip][location]", "%{[geoip][latitude]}"  ]
        }

E.G.

One logstash is restarted, the documents stored in Kibana will have geoip_src and geoip_dest fields:

Once relevant data is being stored, use the refresh-looking button on the index pattern(s) to refresh the field list from stored data. This will add the geo-point items into the index pattern.

Once GeoIP information is available in the index pattern, select the “Maps” visualization

Leave the road map layer there (otherwise you won’t see the countries!)

Select ‘Documents’ as the data source to link in ElasticSearch data

Select the index pattern that contains your data source (if your index pattern does not appear, then Kibana doesn’t recognize the pattern as containing geographic fields … I’ve had to delete and recreate my index pattern so the geographic fields were properly mapped).

And select the field(s) that contain geographic details:

You can name the layer

And add a tool tip that will include the country code or name

Under “Term joins”, add a new join. Click on “Join –select–” to link a field from the map to a field in your dataset.

In this case, I am joining the two-character country codes —

Normally, you can leave the “and use metric count” in place (the map is color coded by the number of requests coming from each country). If you want to add a filter, you can click the “where — add filter –” link to edit the filter.

In this example, I don’t want to filter the data, so I’ve left that at the default.

Click “Save & close” to save the changes to the map visualization. To view your map, you won’t find it under Visualizations – instead, click “Maps” along the left-hand navigation menu.

Voila – a map where the shading on a country gets darker the more requests have come from the country.

Internal Addresses

If we want to (and if we have information to map IP subnets to City/State/Zip/LatLong, etc), we can edit the database used for GeoIP mappings — https://github.com/maxmind/getting-started-with-mmdb provides a perl module that interacts with the database file. That isn’t currently done, so internal servers where traffic is sourced primarily from private address spaces won’t have particularly thrilling map data.

 

Filebeat – No Harvesters Starting

Using filebeat-7.17.4, we have seen instances where no harvesters will start and no IP communication is established with the logstash servers. Stopping the filebeat service, confirming the process and any associated network ports are closed, and then starting the service does not restore communication. In this situation, we have had to restart the ​logstash​ servers and immediately begin to see harvesters spin up in the log files:

2022-09-15T12:02:20.018-0400    INFO    [input.harvester]       log/harvester.go:309    Harvester started for paths: 
[/var/log/network/network.log /opt/splunk/var/log/syslog-ng/*/*.log]       
{"input_id": "bf04e307-7fb3-5555-87d5-55555d3fa8d6", "source": "/var/log/syslog-ng/mr01.example.net/network.log",
 "state_id": "native::2228458-65570", "finished": false, "os_id": "2225548-64550", "old_source": 
"/var/log/syslog-ng/mr01.example.net/network.log", "old_finished": true, "old_os_id": "2225548-64550", 
"harvester_id": "36555c83-455c-4551-9f55-dd5555552771"}

Logstash – Setting Config with Environment Variables

I took over management of an ElasticSearch environment that has a lot of configuration inconsistencies. Unfortunately, the previous owners weren’t the ones who built the environment either … so no one knew why ServerX did one thing and ServerY did another. Didn’t mess with it (if it’s working, don’t break it!) until we encountered some users who couldn’t find their data — because, depending on which logstash server information transits, stuff ends up in different indices. So now we’re consolidating configurations and I am going to pull the “right” config files into a git repo so we can easily maintain consistency.

Except … any repository becomes in scope for security scanning. And, really, typing your password in clear text isn’t a wonderful plan. So my first step is using environment variables as configuration parameters in logstash.

The first thing to do is set the environment variables somewhere logstash can use them. In my case, I’m using a unit file that sources its environment from /etc/default/logstash

Once the environment variables are there, enclose the variable name in ${} and use it in the config:

Logstash Config

Restart ElasticSearch and verify the pipeline(s) have started successfully.

OpenSearch Evaluation Overview

What is ElasticSearch?

ElasticSearch, based on the Lucene search software, is a distributed search and analytics application which ingests, stores, and indexes data. Kibana is a web-based front-end providing user access to data stored within ElasticSearch.

What is OpenSearch?

In short, it’s the same but different. OpenSearch is also based on the Lucene search software, is designed to be a distributed search and analytics application, and ingests/stores/indexes data. If it’s essentially the same thing, why does OpenSearch exist? ElasticSearch was initially licensed under the open-source Apache 2.0 license – a rather permissive free software license. ElasticCo did not agree with how their software was being used by Amazon; and, in 2021, the license for ElasticSearch was changed to Server Side Public License (SSPL). One of the requirements of SSPL is that anyone who implements the software and sells their implementation as a service needs to publish their source code under the SSPL license – not just changes made to the original program but all other software a user would require to run the software-as-a-service environment for themselves. Amazon used ElasticSearch for their Amazon Elasticsearch Service offering, but was unable/unwilling to continue doing so under the new license terms. In April of 2021, Amazon Web Services created a fork of ElasticSearch as the basis for OpenSearch.

Differences Between OpenSearch and ElasticSearch

After the OpenSearch fork was created, the product roadmap for ElasticSearch was driven by ElasticCo and the roadmap for OpenSearch was community driven (with significant oversight and input from Amazon) – this means the products are not identical although they provide the same core functionality. Elastic publishes a list of features unique to ElasticSearch, and the underlying machine learning algorithms are different. However, the important components of the “unique” feature list have been implemented in OpenSearch over time.

The biggest differences are price and support. OpenSearch is free software – there is no purchasing a license to unlock features. It does appear that Amazon has an internal iteration of OpenSearch as their as-a-service offering provides features not available in the open-source OpenSearch code base, but that is only available for cloud customers. ElasticCo offers ElasticSearch as free software with a limited feature set. One critical limitation is user authentication mechanisms – we are unable to implement PingID as an authentication source with the free feature set. Advanced features not currently used today – machine learning based anomaly detection, as an example – are also unavailable in the free iteration of ElasticSearch. With an ElasticSearch license, we would also get vendor support. OpenSearch does not offer vendor support, although there are third party companies that will provide support services.

Both OpenSearch and ElasticSearch have community-based support forums available – I have gotten responses from developers on both forums for questions regarding usage nuances.

Salient Feature Comparison

Most companies have a list differentiating their product from the products offered by competitors – but the important thing is how the products differ as it relates to how an individual customer uses the product. A car that can have a fresh cup of espresso waiting for you as you leave for work might be amazing to some people, but those who don’t drink coffee won’t be nearly as impressed. So how do the two products compare for me?

Data ingestion – Data is ingested using the same mechanisms – ElasticCo’s filebeat and logstash are important components of data ingestion, and these components remain unchanged. This means existing processes that feed data into ElasticSearch today would not need to be changed to begin ingesting data into OpenSearch.

Data storage – Both products distribute searchable data over a cluster of servers. Data storage is “tiered” as hot, warm, and cold which allows less used data to reside on slower, less expensive resources. We have confirmed that ingested data is properly housed on cluster nodes designated for ‘hot’ storage and moved to ‘warm’ and ‘cold’ storage as dictated by defined policies. The item count to size ratio is similar between both products (i.e. storing ten million documents takes about the same amount of disk space). OpenSearch provides the ability to alert on transition failures (moving from hot to warm, for instance) which will reduce the amount of manual “health checking” required for the environment.

Search and aggregation – Both products allow both GUI and API searches of indexed data. Data can be aggregated as it is searched – returning the max/min/average value from a search, a count of records matching search criterion, creating sub-aggregations. ElasticSearch does have aggregations not available in OpenSearch, although these could be handled through custom scripted aggregations and many have corresponding GitHub issues requesting such an aggregation be added to OpenSearch (e.g. weighted average, geohash grid, or geotile grid)

Aggregation Name ElasticSearch 8.x OpenSearch 2.x
auto-interval date histogram x
categorize text x
children x
composite x
frequent items x
geohex grid x
geotile grid x
ip prefix x
multi terms x
parent x
random sampler x
rare terms x
terms x
variable width histogram x
boxplot x
geo-centroid x
geo-line x
median absolute deviation x
rate x
string stats x
t-test x
top metrics x
weighted avg x

Alerting – ElastAlert2 can be used to provide the same index monitoring and alerting functionality that ElastAlert currently provides with ElasticSearch. Additionally, OpenSearch includes a built-in alerting capability that might allow us to streamline the functionality into the base OpenSearch implementation.

API Access – Both ElasticSearch and OpenSearch provide API-based access to data. Queries to the ElasticSearch API endpoint returned expected data when directed to the OpenSearch API endpoint. The ElasticSearch python module can be used to access OpenSearch data, although there is a specific OpenSearch module as well.

UX – ElasticSearch allows users to search and visualize data through Kibana; OpenSearch provides graphical user access in OpenSearch Dashboard. While the “look and feel” of the GUI differs (Kibana 8 looks different than the Kibana 7 we use today, too), the user functionality remains the same.

Kibana 7.7 OpenSearch Dashboards 2.2

Kibana uses “KQL” – Kibana Query Language – to compose searches while OpenSearch Dashboards uses “DQL” – Dashboards Query Language, but queries used in Kibana were used in OpenSearch Dashboard without modification.

Currently used visualizations are available in both Kibana and OpenSearch Dashboards

Kibana Visualization OpenSearch Dashboards Visualization

But there are some currently unused visualizations that are unique to each product.

Visualization Kibana OpenSearch Dashboard
Area x x
Controls x x
Coordinate Map x
Data Table x x
Gantt Chart x
Gauge x x
Goal x x
Heat Map x x
Horizonal Bar x x
Lens x
Line x x
Maps x
Markdown x x
Metric x x
Pie x x
Region Map x
Tag Cloud x x
Timeline x x
TSVB x x
Vega x x
Vertical Bar x x

Dashboards can be used to group visualizations.

Kibana OpenSearch Dashboards

New features will be available in either OpenSearch or a licensed installation of ElasticSearch. Currently data is either retained as written or aged out of the system to save disk space. Either path allows us to roll up data – as an example retaining the total number of users per month or total bytes per month instead of retaining each detailed record. Additionally, we will be able to use the “anomaly detection” which is able to monitor large volumes of index data and highlight unusual events. Both newer ElasticSearch versions and OpenSearch offer a Tableau connector which may make data stored in the platform more accessible to users.

 

Logstash – Filtering data with Ruby

I’ve been working on forking log data into two different indices based on an element contained within the record — if the filename being sent includes the string “BASELINE”, then the data goes into the baseline index, otherwise it goes into the scan index. The data being ingested has the file name in “@fields.myfilename”

It took a while to figure out how to get the value from the current data — event.get(‘[@fields][myfilename]’) to get the @fields.myfilename value.

The following logstash config accepts JSON inputs, parses the underscore-delimited filename into fields, replaces the dashes with underscores as KDL doesn’t handle dashes and wildcards in searches, and adds a flag to any record that should be a baseline. In the output section, that flag is then used to publish data to the appropriate index based on the baseline flag value.

input {
  tcp {
    port => 5055
    codec => json
  }
}
filter {
        # Sample file name: scan_ABCDMIIWO0Y_1-A-5-L2_BASELINE.json
        ruby {  code => "
                        strfilename = event.get('[@fields][myfilename]')
                        arrayfilebreakout = strfilename.split('_')
                        event.set('hostname', arrayfilebreakout[1])
                        event.set('direction',arrayfilebreakout[2])
                        event.set('parseablehost', strfilename.gsub('-','_'))

                        if strfilename.downcase =~ /baseline/
                                event.set('baseline', 1)
                        end" }
}
output {
        if [baseline] == 1 {
                elasticsearch {
                        action => "index"
                        hosts => ["https://elastic.example.com:9200"]
                        ssl => true
                        cacert => ["/path/to/logstash/config/certs/My_Chain.pem"]
                        ssl_certificate_verification => true
                        # Credentials go here
                        index => "ljr-baselines"
                }
        }
        else{
              elasticsearch {
                        action => "index"
                        hosts => ["https://elastic.example.com:9200"]
                        ssl => true
                        cacert => ["/path/to/logstash/config/certs/My_Chain.pem"]
                        ssl_certificate_verification => true
                        # Credentials go here
                        index => "ljr-scans-%{+YYYY.MM.dd}"
                }
        }
}

Kibana Visualization – Vega Line Chart with Baseline

There’s often a difference between hypothetical (e.g. the physics formula answer) and real results — sometimes this is because sciences will ignore “negligible” factors that can be, well, more than negligible, sometimes this is because the “real world” isn’t perfect. In transmission media, this difference is a measurable “loss” — hypothetically, we know we could send X data in Y delta-time, but we only sent X’. Loss also happens because stuff breaks — metal corrodes, critters nest in fiber junction boxes, dirt builds up on a dish. And it’s not easy, when looking at loss data at a single point in time, to identify what’s normal loss and what’s a problem.

We’re starting a project to record a baseline of loss for all sorts of things — this will allow individuals to check the current loss data against that which engineers say “this is as good as it’s gonna get”. If the current value is close … there’s not a problem. If there’s a big difference … someone needs to go fix something.

Unfortunately, creating a graph in Kibana that shows the baseline was … not trivial. There is a rule mark that allows you to draw a straight line between two points. You cannot just say “draw a line at ​y​ from 0 to some large value that’s going to be off the graph. The line doesn’t render (say, 0 => today or the year 2525). You cannot just get the max value of the axis.

I finally stumbled across a series of data contortions that make the baseline graphable.

The data sets I have available have a datetime object (when we measured this loss) and a loss value. For scans, there may be lots of scans for a single device. For baselines, there will only be one record.

The joinaggregate transformation method — which appends the value to each element of the data set — was essential because I needed to know the largest datetime value that would appear in the chart.

           , {“type”: “joinaggregate”, “fields”: [“transformedtimestamp”], “ops”: [“max”], “as”: [“maxtime”]}

The lookup transformation method — which can access elements from other data sets — allowed me to get that maximum timestamp value into the baseline data set. Except … lookup needs an exact match in the search field. Luckily, it does return a random (I presume either first or last … but it didn’t matter in this case because all records have the same max date value) record when multiple matches are found.

So I used a formula transformation method to add a constant to each record as well

           , {“type”: “formula”, “as”: “pi”, “expr”: “PI”}

Now that there’s a record to be found, I can add the max time from our scan data into our baseline data

                , {“type”: “lookup”, “from”: “scandata”, “key”: “pi”, “fields”: [“pi”], “values”: [“maxtime”], “as”: [“maxtime”]}

Voila — a chart with a horizontal line at the baseline loss value. Yes, I randomly copied a record to use as the baseline and selected the wrong one (why some scans are below the “good as it’s ever going to get” baseline value!). But … once we have live data coming into the system, we’ll have reasonable looking graphs.

The full Vega spec for this graph:

{
    "$schema": "https://vega.github.io/schema/vega/v4.json",
      "description": "Scan data with baseline",
    "padding": 5,

    "title": {
        "text": "Scan Data",
        "frame": "bounds",
        "anchor": "start",
        "offset": 12,
        "zindex": 0
      },
    "data": [
    {
        "name": "scandata",
        "url": {
            "%context%": true,
            "%timefield%": "@timestamp",
            "index": "traces-*",
            "body": {
            "sort": [{
                "@timestamp": {
                    "order": "asc"
                }
            }],
            "size": 10000,
            "_source":["@timestamp","Events.Summary.total loss"]
            }
        }
        ,"format": { "property": "hits.hits"}
        ,"transform":[
            {"type": "formula", "expr": "datetime(datum._source['@timestamp'])", "as": "transformedtimestamp"}
            , {"type": "joinaggregate", "fields": ["transformedtimestamp"], "ops": ["max"], "as": ["maxtime"]}
            , {"type": "formula", "as": "pi", "expr": "PI"}
        ]
    }
  ,
   {
        "name": "baseline",
        "url": {
            "%context%": true,
            "index": "baselines*",
            "body": {
                "sort": [{
                    "@timestamp": {
                        "order": "desc"
                    }
                }],
                "size": 1,
                "_source":["@timestamp","Events.Summary.total loss"]
            }
        }
        ,"format": { "property": "hits.hits" }
        ,"transform":[
                {"type": "formula", "as": "pi", "expr": "PI"}
                , {"type": "lookup", "from": "scandata", "key": "pi", "fields": ["pi"], "values": ["maxtime"], "as": ["maxtime"]}
        ]
  }
]      
,
    "scales": [
      {
        "name": "x",
        "type": "point",
        "range": "width",
        "domain": {"data": "scandata", "field": "transformedtimestamp"}
      },
      {
        "name": "y",
        "type": "linear",
        "range": "height",
        "nice": true,
        "zero": true,
        "domain": {"data": "scandata", "field": "_source.Events.Summary.total loss"}
      }
    ],
        "axes": [
      {"orient": "bottom", "scale": "x"},
      {"orient": "left", "scale": "y"}
    ],
     "marks": [
                {
            "type": "line",
            "from": {"data": "scandata"},
            "encode": {
              "enter": {
                "x": { "scale": "x", "field": "transformedtimestamp", "type": "temporal",
      "timeUnit": "yearmonthdatehourminute"},
                "y": {"scale": "y",       "type": "quantitative","field": "_source.Events.Summary.total loss"},
                "strokeWidth": {"value": 2},
                "stroke": {"value": "green"}
              }
            }
          }
                 ,        {
            "type": "rule",
            "from": {"data": "baseline"},
            "encode": {
              "enter": {
                "stroke": {"value": "#652c90"},
                "x": {"scale": "x", "value": 0},
                "y": {"scale": "y",      "type": "quantitative","field": "_source.Events.Summary.total loss"},
                "x2": {"scale": "x","field": "maxtime", "type": "temporal"},
                "strokeWidth": {"value": 4},
                "opacity": {"value": 0.3}
              }
            }
          }
     ]         
}

Vega Visualization when Data Element Name Contains At Symbol

We have data created by an external source (i.e. I cannot just change the names used so it works) — the datetime field is named @timestamp and I had an awful time figuring out out how to address that element within a transformation expression.

Just to make sure I wasn’t doing something silly, I created a copy of the data element named without the at symbol. Voila – transformedtimestamp is populated with a datetime element.

This works fine if the data element is named 'timestamp'

I finally figured it out – it appears that I have encountered a JavaScript limitation. Instead of using the dot-notation to access the element, the array subscript method works – not datum.@timestamp in any iteration or with any combination of escapes.

enter image description here

 

Logstash

General Info

Logstash is a pipeline based data processing service. Data comes into logstash, is manipulated, and is sent elsewhere. The source is maintained on GitHub by ElasticCo.

Installation

Logstash was downloaded from ElasticCo and installed from a gzipped tar archive to the /opt/elk/logstash folder.

Configuration

The Logstash server is configured using the logstash.yml file.

Logstash uses Log4J 2 for logging. Logging configuration is maintained in the log4j2.properties file

Logstash is java-based, and the JVM settings are maintained in the jvm.options file – this includes min heap space, garbage collection configuration, JRuby settings, etc.

Logstash loads the pipelines defined in /opt/elk/logstash/config/pipelines.yml – each pipeline needs an ID and a path to its configuration. The path can be to a config file or to a folder of config files for the pipeline. The number of workers for the pipeline defaults to the number of CPUs, so we normally define a worker count as well – this can be increased as load dictates.

– pipeline.id: LJR
pipeline.workers: 2
path.config: “/opt/elk/logstash/config/ljr.conf”

Each pipeline is configured in an individual config file that defines the input, any data manipulation to be performed, and the output.

Testing Configuration

As we have it configured, you must reload Logstash to implement any configuration changes. As errors in pipeline definitions will prevent the pipeline from loading, it is best to test the configuration prior to restarting Logstash.

/opt/elk/logstash/bin/logstash –config.test_and_exit -f ljr_firewall_logs_wip.conf

Some errors may occur – if the test ends with “Configuration OK”, then it’s OK!

Automatic Config Reload

The configuration can automatically be reloaded when changes to config files are detected. This doesn’t give you the opportunity to test a configuration prior to it going live on the server (once it’s saved, it will be loaded … or not loaded if there’s an error)

Input

Input instructs logstash about what format data the pipeline will receive – is JSON data being sent to the pipeline, is syslog sending log data to the pipeline, or does data come from STDIN? The types of data that can be received are defined by the input plugins. Each input has its own configuration parameters. We use Beats, syslog, JSON (a codec, not a filter type), Kafka, stuff

The input configuration also indicates which port to use for the pipeline – this needs to be unique!

Input for a pipeline on port 5055 receiving JSON formatted data

Input for a pipeline on port 5100 (both TCP and UDP) receiving syslog data

Output

Output is similarly simple – various output plugins define the systems to which data can be shipped. Each output has its own configuration parameters – ElasticSearch, Kafka, and file are the three output plug-ins we currently use.

ElasticSearch

Most of the data we ingest into logstash is processed and sent to ElasticSearch. The data is indexed and available to users through ES and Kibana.

Kafka

Some data is sent to Kafka basically as a holding queue. It is then picked up by the “aggregation” logstash server, processed some more, and relayed to the ElasticSearch system.

File

File output is generally used for debugging – seeing the output data allows you to verify your data manipulations are working property (as well as just make sure you see data transiting the pipeline without resorting to tcpdump!).

Filter

Filtering allows data to be removed, attributes to be added to records, and parses data into fields. The types of filters that can be applied are defined by the filter plugins. Each plugin has its own documentation. Most of our data streams are filtered using Grok – see below for more details on that.

Conditional rules can be used in filters. This example filters out messages that contain the string “FIREWALL”, “id=firewall”, or “FIREWALL_VRF” as the business need does not require these messages, so there’s no reason to waste disk space and I/O processing, indexing, and storing these messages.

This example adds a field, ‘sourcetype’, with a value that is based on the log file path.

Grok

The grok filter is a Logstash plugin that is used to extract data from log records – this allows us to pull important information into distinct fields within the ElasticSearch record. Instead of having the full message in the ‘message’ field, you can have success/failure in its own field, the logon user in its own field, or the source IP in its own field. This allows more robust reporting. If the use case just wants to store data, parsing the record may not be required. But, if they want to report on the number of users logged in per hour or how much data is sent to each IP address, we need to have the relevant fields available in the document.

Patterns used by the grok filter are maintained in a Git repository – the grok-patterns contains the data types like ‘DATA’ in %{DATA:fieldname}

The following are the ones I’ve used most frequently:

Name Field Type Pattern Notes Notes
DATA Text data .*? This does not expand to the most matching characters – so looking for foo.*?bar in “foobar is not really a word, but foobar gets used a lot in IT documentation” will only match the bold portion of the text
GREEDYDATA Text data .* Whereas this matches the most matching characters – so foo.*bar in “foobar is not really a word, but foobar gets used a lot in IT documentation” matches the whole bold portion of the text
IPV4 IPv4 address
IPV6 IPv6 address
IP IP address – either v4 or v6 (?:%{IPV6}|%{IPV4}) This provides some flexibility as groups move to IPv6 – but it’s a more complex filter, so I’ve been using IPV4 with the understanding that we may need to adjust some parsing rules in the future
LOGLEVEL Text data Regex to match list of standard log level strings – provides data validation over using DATA (i.e. if someone sets their log level to “superawful”, it won’t match)
SYSLOGBASE Text data This matches the standard start of a syslog record. Often used as “%{SYSLOGBASE} %{GREEDYDATA:msgtext}” to parse out the timestamp, facility, host, and program – the remainder of the text is mapped to “msgtext”
URI Text data protocol://stuff text is parsed into the protocol, user, host, path, and query parameters
INT Numeric data (?:[+-]?(?:[0-9]+)) Signed or unsigned integer
NUMBER Numeric data Can include a casting like %{NUMBER:fieldname;int} or %{NUMBER:fieldname;float}
TIMESTAMP_ISO8601 DateTime %{YEAR}-%{MONTHNUM}-%{MONTHDAY}[T ]%{HOUR}:?%{MINUTE}(?::?%{SECOND})?%{ISO8601_TIMEZONE}? There are various other date patterns depending on how the string will be formatted. This is the one that matches YYYYMMDDThh:mm:ss

Parsing an entire log string

In a system with a set format for log data, parsing the entire line is reasonable – and, often, there will be a filter for well-known log types. I.E. if you are using the default Apache HTTPD log format, you don’t need to write a filter for each component of the log line – just match either the HTTPD_COMBINEDLOG or HTTPD_COMMONLOG pattern.

match => { “message” => “%{HTTPD_COMMONLOG}” }

But you can create your own filter as well – internally developed applications and less common vendor applications won’t have prebuilt filter rules.

match => { “message” => “%{TIMESTAMP_ISO8601:logtime} – %{IPV4:srcip} – %{IPV4:dstip} – %{DATA:result}” }

Extracting an array of data

Instead of trying to map an entire line at once, you can extract individual data elements by matching an array of patterns within the message.

match => { “message” => [“srcip=%{IPV4:src_ip}”
, “srcport=%{NUMBER:srcport:int}”
,”dstip=%{IPV4:dst_ip}”
,”dstport=%{NUMBER:dstport:int}”] }

This means the IP and port information will be extracted regardless of the order in which the fields are written in the log record. This also allows you to parse data out of log records where multiple different formats are used (as an example, the NSS Firewall logs) instead of trying to write different parsers for each of the possible string combinations.

Logstash, by default, breaks when a match is found. This means you can ‘stack’ different filters instead of using if tests. Sometimes, though, you don’t want to break when a match is found – maybe you are extracting a bit of data that gets used in another match. In these cases, you can set break_on_match to ‘false’ in the grok rule.

I have also had to set break_on_match when extracting an array of values from a message.

Troubleshooting

Log Files

Logstash logs output to /opt/elk/logstash/logs/logstash-plain.log – the logging level is defined in the /opt/elk/logstash/config/log4j2.properties configuration file.

Viewing Data Transmitted to a Pipeline

There are several ways to confirm that data is being received by a pipeline – tcpdump can be used to verify information is being received on the port. If no data is being received, the port may be offline (if there is an error in the pipeline config, the pipeline will not load – grep /opt/elk/logstash/logs/logstash-plain.log for the pipeline name to view errors), there may be a firewall preventing communication, or the sender could not be transmitting data.

tcpdump dst port 5100 -vv

If data is confirmed to be coming into the pipeline port, add a “file” output filter to the pipeline.

Issues

Data from filebeat servers not received in ElasticSearch

We have encountered a scenario were data from the filebeat servers was not being transmitted to ElasticSearch. Monitoring the filebeat server did not show any data being sent. Restarting the Logstash servers allowed data to be transmitted as expected.