Varobj

2021-04-08

Logstash使用指南-处理nginx日志



简介

Logstash 是一个开源的数据收集引擎,具有实时流水线功能。Logstash 可以实时地收集不同来源的数据,并将数据规范化后传输到不同目的地。清理和结构化(democratize)数据,用于各种高阶的下游分析和可视化实例。

虽然 Logstash 最初用于日志收集,但其丰富插件扩展机制,目前功能远远不止日志收集。任何类型的数据都可以通过广义的输入、过滤和输出三个步骤,进行数据提取、清洗、甚至聚合等操作,目前超过 200 种插件可以使用。

images

常见插件

默认输入插件有:

$ /usr/share/logstash/bin/logstash-plugin list|grep input
logstash-input-elasticsearch
logstash-input-file
logstash-input-http
logstash-input-redis
logstash-input-stdin
logstash-input-syslog
logstash-input-tcp
logstash-input-udp
logstash-input-unix
logstash-input-jdbc
logstash-input-kafka
logstash-input-rabbitmq
..

常用的过滤插件有:

$ /usr/share/logstash/bin/logstash-plugin list|grep filter
logstash-filter-aggregate
logstash-filter-csv
logstash-filter-date
logstash-filter-dns
logstash-filter-drop
logstash-filter-elasticsearch
logstash-filter-geoip
logstash-filter-grok
logstash-filter-json
logstash-filter-kv
logstash-filter-memcached
logstash-filter-ruby
logstash-filter-sleep
logstash-filter-split
logstash-filter-throttle
logstash-filter-translate
logstash-filter-truncate
logstash-filter-urldecode
logstash-filter-useragent
logstash-filter-uuid
logstash-filter-xml
..

常用的输出插件有:

$ /usr/share/logstash/bin/logstash-plugin list|grep outpot
logstash-output-csv
logstash-output-elasticsearch
logstash-output-email
logstash-output-file
logstash-output-http
logstash-output-redis
logstash-output-stdout
logstash-output-tcp
logstash-output-udp
logstash-output-kafka
logstash-output-rabbitmq
..

需求

需要将 nginx 访问日志,按格式输出到 csv 中,方便后续导入数据库等。 nginx 访问日志格式:

log_format  main  '$remote_addr --- $remote_user --- [$time_local] --- $request --- '
                  '"$status" --- $body_bytes_sent --- "$http_referer" --- '
                  '"$http_user_agent" --- "$http_x_forwarded_for"';

acess.log 添加一条测试数据:

101.80.200.46 --- - --- [06/Apr/2021:17:42:46 +0800] --- GET /api/material/library/get-group-list?_cost= HTTP/1.1 --- "200" --- 2940 --- "-" --- "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.81 Safari/537.36 Maxthon/5.3.8.2000" --- "-"

logstash 和 filebeat 安装

filebeat 作用是把文件按行依次传递给 logstash 的输入端。

wget https://artifacts.elastic.co/downloads/logstash/logstash-7.12.0-x86_64.rpm

yum -y localinstall logstash-7.12.0-x86_64.rpm

wget https://artifacts.elastic.co/downloads/beats/filebeat/filebeat-7.12.0-x86_64.rpm

yum -y localinstall filebeat-7.12.0-x86_64.rpm

添加修改 logstash 配置

$ vim /etc/logstash/conf.d/nginx_access.conf

input {
  beats {
    port => "5044"
  }
}

filter {
  grok {
    match => { "message" => "%{IPORHOST:ip} --- %{USER:user} --- \[(?<datetime>.*)\] --- %{WORD:method} %{NOTSPACE:request} HTTP/%{NUMBER:httpversion} --- \"%{NUMBER:status}\" --- %{NUMBER:bytes} --- \"(?<referer>\S+)\" --- \"(?<ua>.*)\" --- \"(?<forwarded>\S+)\"" }
  }
}

output {
  stdout { codec => rubydebug }
}

可以使用的模板参考 https://hub.fastgit.org/elastic/logstash/blob/v1.4.0/patterns/grok-patterns ,可以使用在线工具,检查正则是否有效 https://grokdebug.herokuapp.com

修改 filebeat 配置

$ vim /etc/filebeat/filebeat.yml

filebeat.inputs:
- type: log
  enabled: true
  paths:
    - /vagrant/access.log
output.logstash:
  hosts: ["localhost:5044"]

/vagrant/access.log 中只有一行测试数据,首先开启 logstash

/usr/share/logstash/bin/logstash -f /etc/logstash/conf.d/nginx_access.conf --config.test_and_exit

--config.test_and_exit 测试是否配置正确,如果正确会输出

[INFO ] 2021-04-07 11:12:46.508 [LogStash::Runner] runner - Using config.test_and_exit mode. 
Config Validation Result: OK. Exiting Logstash

然后去掉 --config.test_and_exit 再次运行 logstash

新开窗口运行 filebeat

systemctl start filebeat.service

添加 GEOIP 解析

$ vim /etc/logstash/conf.d/nginx_access.conf

filter {
  grok ...
  geoip {
    source => "ip"
  }
}

解析后输出会多一项

"geoip" => {
        "country_name" => "China",
        "region_code" => "SH",
        "city_name" => "Shanghai",
        "ip" => "101.80.200.46",
        "timezone" => "Asia/Shanghai",
        "longitude" => 121.4012,
        "location" => {
            "lat" => 31.0449,
            "lon" => 121.4012
        },
        "latitude" => 31.0449,
        "country_code2" => "CN",
        "region_name" => "Shanghai",
        "continent_code" => "AS",
        "country_code3" => "CN"
}

输出配置增加 输出 csv 格式

$ vim /etc/logstash/conf.d/nginx_access.conf

output {
  csv {
    fields => ["datetime", "ip", "user", "method", "httpversion", "status", "bytes", "request", "ua", "referer", "forwarded", "[geoip][timezone]", "[geoip][country_name]", "[geoip][city_name]", "[geoip][longitude]", "[geoip][latitude]"]
    path => "/vagrant/nginx_access_all.csv"
    csv_options => {
      "write_headers" => true
      "headers" => ["datetime", "ip", "user", "method", "httpversion", "status", "bytes", "request", "ua", "referer", "forwarded", "[geoip][timezone]", "[geoip][country_name]", "[geoip][city_name]", "[geoip][longitude]", "[geoip][latitude]"]
      "col_sep" => ";"
    }
  }
}

删除filebeat日志文件,重新运行 filebeat

rm -rf /var/lib/filebeat/registry/filebeat/log.json
systemctl restart filebeat.service

查看csv文件

$ cat /vagrant/nginx_access_all.csv

datetime;ip;user;method;httpversion;status;bytes;request;ua;referer;forwarded;[geoip][timezone];[geoip][country_name];[geoip][city_name];[geoip][longitude];[geoip][latitude]
06/Apr/2021:17:42:46 +0800;101.80.200.46;-;GET;1.1;200;2940;/api/material/library/get-group-list?is_null_creative=&is_null_cost=&channel_id=0&label_id=0&consume_time_start=2021-03-07&consume_time_end=2021-04-06&make_time_start=2021-03-07&make_time_end=2021-04-06&label_status=0&video_author=%E5%BC%A0%E6%96%87%E6%9D%B0&product_id=0&video_type=0&video_source=0&keywords=&page=1&size=30&sort_field=&sort_type=;;-;-;Asia/Shanghai;China;Shanghai;121.4012;31.0449
datetime;ip;user;method;httpversion;status;bytes;request;ua;referer;forwarded;[geoip][timezone];[geoip][country_name];[geoip][city_name];[geoip][longitude];[geoip][latitude]
06/Apr/2021:17:42:54 +0800;101.80.200.50;-;POST;1.1;200;206;/api/xh-material-library/upload;;-;-;Asia/Shanghai;China;Shanghai;121.4012;31.0449

优化:只在文件第一行添加csv头

修改配置,在初始化的时候判断csv文件是否存在,不存在添加头,在输出选项 csv 中取消输出头配置项

filter {
  ruby {
    init => "
      begin
        @@csv_file    = '/vagrant/nginx_access_all.csv'
        @@csv_headers = ['datetime', 'ip', 'user', 'method', 'httpversion', 'status', 'bytes', 'request', 'ua', 'referer', 'forwarded', 'geoip_timezone', 'geoip_country_name', 'geoip_city_name', 'geoip_longitude', 'geoip_latitude']
        if File.zero?(@@csv_file) || !File.exist?(@@csv_file)
           CSV.open(@@csv_file, 'w') do |csv|
              csv << @@csv_headers
           end
         end
       end
     "
     code => "
       begin
         event.set('[@metadata][csv_file]', @@csv_file)
         event.set('[@metadata][csv_headers]', @@csv_headers)
       end
     "
  }
  ..
}
..
output {
  csv {
    fields => ["datetime", "ip", "user", "method", "httpversion", "status", "bytes", "request", "ua", "referer", "forwarded", "[geoip][timezone]", "[geoip][country_name]", "[geoip][city_name]", "[geoip][longitude]", "[geoip][latitude]"]
    path => "/vagrant/nginx_access_all.csv"
  }
}

重新执行,csv文件结果如下

$ cat /vagrant/nginx_access_all.csv
datetime,ip,user,method,httpversion,status,bytes,request,ua,referer,forwarded,geoip_timezone,geoip_country_name,geoip_city_name,geoip_longitude,geoip_latitude
06/Apr/2021:17:42:46 +0800,101.80.200.46,-,GET,1.1,200,2940,/api/material/library/get-group-list?is_null_creative=&is_null_cost=&channel_id=0&label_id=0&consume_time_start=2021-03-07&consume_time_end=2021-04-06&make_time_start=2021-03-07&make_time_end=2021-04-06&label_status=0&video_author=%E5%BC%A0%E6%96%87%E6%9D%B0&product_id=0&video_type=0&video_source=0&keywords=&page=1&size=30&sort_field=&sort_type=,"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.81 Safari/537.36 Maxthon/5.3.8.2000",-,-,Asia/Shanghai,China,Shanghai,121.4012,31.0449
06/Apr/2021:17:42:54 +0800,101.80.200.50,-,POST,1.1,200,206,/api/xh-material-library/upload,"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3861.400 QQBrowser/10.7.4313.400",-,-,Asia/Shanghai,China,Shanghai,121.4012,31.0449

处理速度

2h2g 机器处理 5gb 文件耗时约半小时, logstash 基于 jvm,内存消耗巨大,约 700mb