splunk oddity #17681 – stats vs table

It’s fairly common to want to table the data you’ve found in a search in Splunk – heck, if you’re not prettying the data up somewhy, why are you bothering with the tool?

But I digress.

There are two (at least) ways of making a table – you can use the |table <field(s)> syntax, or you can use |stats <some function> <field(s)> approach.

Interestingly, in my testing in both test and production environments, using the |stats... approach is consistently 10-15% faster than the |table... option.

Why? I don’t know. He’s on third. And I don’t give a darn!

This is another case of technical intricacies mattering … but I don’t know what is going on under the hood that makes the apparently-more-complex option run faster than the apparently-simler option.

Maybe someday someone in Splunk engineering will be able to enlighten me to that.

This reminds me a bit of an optimization I was able to help a friend with upwards of 12 years ago – they had queries running in MySQL that were taking forever to complete (and by “forever”, I mean they were running sometimes 4-5 times a long as the interval between running them (they ran every 5 mintues, but could take 20+ minutes to finish!)).

What I found, at least back in the dark days of MySQL 3.x was that using IN(...) was loads faster than using OR statements.

So a query that had a clause WHERE name IN("bob","sarah","mike","terry","sue") would run anywhere from 20-90% quicker than the logically-equivalent WHERE name="bob" OR name="sarah" OR name="mike" OR name="terry" OR name="sue" (given a large enough dataset overwhich it was running … on small [enough] tables (say up to a couplefew thousand records), the OR version would run equally, or occasionally faster).

In their case, by switching to the IN(...) form, queries went from taking 20+ minutes to finishing in ~20 seconds!

Bonus tidbit:

It is well-known in Splunkland that using dedup is an “expensive” operation. Want a clever way around it (that is much faster)? Instead of doing something like index=myndx | fields ip host | dedup host, run index=myndx | fields ip host | stats count by host | fields -count. The |stats .. |fields -count trick seems to run anywhere from 15-30% faster than dedup.

Also published on Medium.