I covered how to install baze
and introduced suc
yesterday. Today, we're looking at statistik
.
If you have a bunch of numbers and want a quick look at their distribution, statistik
is the simple command for you.
For example, let's look at how many requests each IP made to ident.me. First, let's look at the first 5 requests of the last hour to confirm we're extracting the data correctly (I censored the results):
> doas head -n5 /var/log/nginx/access.log | jq -r .r
9.255.1.7
2600:40:4201:7a10::abc7
8.136.246.156
3.2.132.135
3.2.144.1
We wonder if some IPs make more requests than others, and what the distribution looks like. Let's find out by looking at the first 100,000 requests now, passing them through suc
to get the number of requests per IP, then statistik
for analysis:
> doas head -n100000 /var/log/nginx/access.log | jq -r .r | suc | statistik
size: 45157
sum: 100000.0
avg: 2.2144960914143987
min: 1.0
p10: 1.0
p20: 1.0
p25: 1.0
p30: 1.0
p40: 1.0
p50: 1.0
p60: 1.0
p70: 1.0
p75: 2.0
p80: 2.0
p90: 3.0
p99: 18.0
p999: 115.0
p9999: 277.0
max: 741.0
We see here that those 100,000 requests came from 45,157 unique IPs. A bit over 70% of IPs made only 1 request; at least 90% made 3 or fewer requests. The top 1% of IPs made 18 or more requests, with the top IP making 741.
I occasionally run public analytics on one hour of service. For comparison, here is the latest graph of number of IPs per request count. In my opinion, much harder to interpret:
You might not be surprised to learn that statistik
's implementation is trivial:
#!/usr/bin/env ruby
s = []
sum = 0
ARGF.each_line do |l|
n = l.to_f
s << n
sum += n
end
s = s.sort
cnt = s.length
print "size: #{cnt}\n"
print "sum: #{sum}\n"
print "avg: #{sum/cnt}\n\n"
print "min: #{s.first}\n"
[10, 20, 25, 30, 40, 50, 60, 70, 75, 80, 90, 99].each do |p|
print "p#{p}: #{s[s.length*p/100]}\n"
end
print "p999: #{s[s.length*999/1000]}\n"
print "p9999: #{s[s.length*9999/10000]}\n"
print "max: #{s.last}\n"