Category Archives: insights

ben thompson missed *a lot* in his microsoft-github article

Ben Thompson is generally spot-on in his analysis of industry goings-on. But he missed a lot in The Cost of Developers this week.

Here’s what he got right about this acquisition:

  • Developers can be quite expensive (though, $7.5B (in equity) is only ~$265 per user (which is pretty cheap))
  • Microsoft is betting that a future of open-source, cloud-based applications that exist independent of platforms will be a large-and-increasing share of the future
  • That there is room in that future for a company to win by offering a superior user experience for developers directly, not simply exerting leverage on them
  • Microsoft is the best possible acquirer for GitHub
  • GitHub, having raised $350 million in venture capital, was not going to make it as an independent entity
  • Purely enterprise-focused companies like IBM or Oracle would be tempted to wring every possible bit of profit out of the company
  • What Microsoft wants is much fuzzier: it wants to be developers’ friend
  • [Microsoft] will be ever more invested in a world with no gatekeepers, where developer tools and clouds win by being better on the merits, not by being able to leverage users

And here’s what he missed and/or got wrong:

  • [Microsoft] is in second place in the cloud. Moreover, that second place is largely predicated on shepherding existing corporate customers to cloud computing; it is not clear why any new company — or developer — would choose Microsoft
  • It is very hard to imagine GitHub ever generating the sort of revenue that justifies this purchase price

Some of the below I commented on Google+ yesterday. The rest is in response to more idiocy & paranoia I’ve seen on some technical community mailing lists (bet you didn’t know those still existed) in the last 24 hours, or in response to specific items in Ben’s essay that are shortsighted, misguided, or incredibly wrong.

  • If you cannot see why new users, developers, and companies would go to Microsoft Azure offerings, you don’t understand what they’re doing
    • AWS is huge – but Azure and Google Cloud Platform (GCP) have huge technical (and economic) advantages
    • Amazon likes to throw new cloud features at the wall like spaghetti to see what sticks; Google and Microsoft have clearly thought-through this whole cloud business, and make incredibly solid business & technical sense to use over AWS in most use cases (the only [occasional] real exception being “but we already use AWS”). Have you not seen the Azure IoT offerings?
  • GitHub has not yet been profitable, and would probably have IPO’d (poorly) in the next year to keep from running out of cash
    • Arguably, GitHub would never become profitable on their own
  • Microsoft has a long history of contributing to OSS projects (most-to-all of which are on GitHub)
    • If they were going to acquire anyone in this space, GitHub is the only one that makes any sense
  • (This was tangentially-mentioned in Ben’s essay by linking to his analysis of the Microsoft-LinkedIn acquisition in 2016.) Alongside the LinkedIn acquisition a couple years back (which has an obvious play for an eventual IDaaS (fully-and-forever integration with Office365 regardless of where you work, everything follows automagically)), offering better integrations with their existing tools (Visual Studio already had git integrations – they should only get better with this acquisition) is a Good Thing™ for devs and end user alike (because making those excellent developer tools even better means they’ll be better whether they’re using GitHub, Bitbucket, GitLab, etc)
  • The more-or-less instantaneous expansion of offered items in the Windows Store (some kind of cloud-based/distributed build-on-demand for software when you want it (and which fork you want)) to “everything” on GitHub is a brilliant possibility
    • In light of Apple’s announcement yesterday about enabling iOS apps to come to macOS over the next releases of iOS and macOS, this should have been at the forefront of most people’s thought processes (after the keynote was done, of course)
    • Through this acquisition, it’s [probably] likely more developers will use Microsoft APIs (.NET, etc) in their projects
  • Echoing Ballmer’s chant, “Developers! Developers! Developers!”, while Microsoft doesn’t really care about Windows anymore (just look at the recent reorg), it is still THE most widespread end-user platform in the world – and bringing millions more developers “into the fold” is genius
    • Even if some small percentage will opt to go elsewhere, most won’t change because, well, change is hard
    • All the developers Microsoft had that weren’t yet using GitHub will have a huge reason to start
  • Microsoft has typically been a buy-don’t-build shop (there are exceptions, but look at the original DOS, PowerPoint, SQL Server, Skype, their failed attempt at Yahoo!, etc): they could have spent 5-10x as much building something “as good as” GitHub, or they could buy it; they opted for the “buy” (via equity, note, and not cash (smart from several business viewpoints (not least of which is the “enforced” interest the GitHub subsidiary (with its new CEO, etc) will have in continuing to ensure it is The place for developers to put their projects (after all, if that drops considerably, the equity aspect GitHub got in the deal is going to drop))))

on internet sales tax

The debate is raging again as the Supreme Court of the United States is getting ready to make a decision on collecting sales tax for online sales.

I’ve read as many viewpoints from supporting and detracting from requiring businesses to collect sales tax from their customers.

And my [current] view is that all businesses conducting business online should collect the sales tax you would have paid if you went in person.

Company in Oregon? No sales tax. Company in Kentucky? Sales tax.

Don’t collect it for whereever the buyer happens to be: collect it based on where the seller is.

Simple.

Straighforward.

And is something the merchant is already setup to do.


more thoughts on `|stats` vs `|dedup` in splunk

Yesterday I wrote-up a neat little find in Splunk wherein running stats count by ... is substantially faster than running dedup ....

After some further reflection over dinner, I figured out the major portion of why this is – and I feel a little dumb for not having thought of it before. (A coworker added some more context, but it’s a smaller reason of why one is faster then the other.)

The major reason stats count by... is faster than dedup ... is that stats can hand-off the counting process to something else (though, even if it doesn’t, incrementing a hashtable entry by 1 every time you encounter an instance isn’t terribly computationally complex) and keep going.

In contrast, dedup must compare every individual returned event’s field that matches what you’re trying to dedup to it’s growing list of unique entries for that field.

In the particular case I was seeing yesterday, that means that every single event in the list of 4,000,000 events returned by the search has to be compared one at a time to a list (that I know is going to top out at about 11,000). To use Big-O Notation, this is an O(n*m) operation (bordering on O(n2))!

That initial list of length m fills pretty quickly (it is, after all, only going to get to ~11,000 total entries (in this case)), but as it grows to its max, it gets progressively harder and hard to check whether or not the next event has already been dedup’d.

At ~750,000 events returned (roughly 1/5 my total), the list is unique field values was 98% complete – yet there were still ~3.2 million events left to go (to find just 2% more unique field values).

Those last 3.2 million events each need to check against the list of >10,500 entries – which means, roughly, 16,8 billion comparisons still need to be made!

(Because linear searching finds what it’s looking for on average by the time it has traversed half the list. If the list is being created in a slighly more efficient manner (say a heap or [balanced] binary search tree), it will still take ~43 million comparisons (3.2 million * log2(11,000)).)

Compare this to the relative complexity of using |stats count by ... – it still has to run through all 4 million events, but all it is doing is adding one to the list for every value that shows up in that particular field – IOW, it “only” has to do a total of 4 million [simple] things (because it does need to look at every event returned). dedup at a minimum is going to do ~54 million comparison (and probably a lot more – given it doesn’t merely take 13x the time to run, but closer to 25x).

The secondary contributing factor – important, but not as much a factor as what I covered above – is that dedup must process the whole event, whereas stats chucks everything that isn’t part of what it’s counting (so if an event is 1kb in size, dedup has to return the whole kb, while stats is only looking at maybe 1/10 the total (if you include a coupld extra fields)).

Another neat aspect of using |stats is that it creates a table for you – if you’re running |dedup, you then have to |table ... to get the fields you want displayed how you want.

And adding |table adds to the run time.

So there you have it – turns out those CompSci 201 classes do come in handy 18 years later 🤓

document what didn’t work

In a recent episode of Paul’s Security Weekly, an off-hand comment was made about documentation: you shouldn’t merely document what to do, nor even why, but also what you tried that didn’t work (ie, augment the status quo).

The upshot being, to save whomever comes to this note next (especially if it turns out to be yourselfeffort you spent that was in vain.

This is similar to a famous quote attributed to Edison,

I have not failed. I’ve just found 10,000 ways that won’t work.

In light of my recommended, preferred practice and policy of “terse verbosity“, I would strongly suggest not placing the “doesn’t work” in-line, typically. Instead, put footnotes, an appendix, etc. But always

explain everything you did, but use bullet points if possible, rather than prose form

Loads of other goodies in that episode, too – but this one jumped-out as applicable to everyone.

but, i got them on sale!

Back in August 2008, I had a one-week “quick start” professional services engagement in Nutley New Jersey. It was a supposed to be a super simple week: install HP Server Automation at BT Global.

Another ProServe engineer was onsite to setup HP Network Automation.

Life was gonna be easy-peasy – the only deliverable was to setup and verify a vanilla HPSA installation.

Except, like every Professional Services engagement in history, all was not as it seemed.

First monkey wrench: our primary technical contact / champion was an old-hat Sun Solaris fan (to the near-exclusion of any other OS for any purpose – he even wanted to run SunOS on his laptop).

Second monkey wrench: expanding on the first, out technical contact was super excited about the servers he’d gotten just the weekend before from Sun because they were “on sale”.

It’s time for a short background digression. Because technical intricacies matter.

HP Server Automation was written on Red Hat Linux. It worked great on RHEL. But, due to some [large] customer requests, it also supported running on Sun Solaris.

In 2007, Sun introduced a novel architecture dubbed, “Niagara”, or UltraSPARC T1, which they offered in their T1000 and T2000 series servers. Niagara did several clever things – it offered multiple threads running per core, with as many as 32 simultaneous processes running.

According to AnandTech, the UltraSPARC T1 was a “72 W, 1.2 GHz chip almost 3 times (in SpecWeb2005) as fast as four Xeon cores at 2.8 GHz”.

But there is always a tradeoff. The tradeoff Sun chose for the first CPU in the product line was to share a single FPU (floating point unit) between the integer cores and pipelines. For workloads that mostly involve static / simple data (ie, not much in the way of calculation), they were blazingly fast.

But sharing an FPU brings problems when you need to actually do floating-point math – as cryptographic algorithms and protocols all end up relying upon for gathering entropy for their random value generation processes. Why does this matter? Well, in the case of HPSA, not only is all interprocess, intraserver, and interserver communication secured with HTTPS certificates, but because large swaths are written in Java, each JVM needs to emulate its own FPU – so not only is the single FPU shared between all of the integer cores of the T1 CPU, it is further time-sliced and shared amongst every JRE instance.

At the time, the “standard” reboot time for a server running in an SA Core was generally benchmarked at ~15-20 minutes. That time encompassed all of the following:

  • stop all SA processes (in the proper order)
  • stop Oracle
  • restart the server
  • start Oracle
  • start all SA components (in the proper order)

As you’ll recall from my article on the Sun JRE 1.4.x from 6.5 years ago, there is a Java component (the Twist) that already takes a long time to start as it seeds its entropy pool.

So when it is sharing the single FPU not only between other JVMs, but between every other process which might end up needing it, the total start time is reduced dramatically.

How dramatically? Shutdown alone was taking upwards of 20 minutes. Startup was north of 35 minutes.

That’s right – instead of ~15-20 minutes for a full restart cycle, if you ran HPSA on a T1-powered server, you were looking at ~60+ minutes to restart.

Full restarts, while not incredibly common, are not all that unordinary, either.

At the time, it was not unusual to want to fully restart an HPSA Core 2-3 times per month. And during initial installation and configuration, restarts need to happen 4-5 times in addition to the number of times various components are restarted during installation as configuration files are updated, new processes and services are started, etc.

What should have been about a one-day setup, with 2-3 days of knowledge transfer – turned into nearly 3 days just to install and initially configure the software.

And why were we stuck on this “revolutionary” hardware? Because of what I noted earlier: our main technical contact was a die-hard Solaris fanboi who’d gotten these servers “on sale” (because their Sun rep “liked them”).

How big a “sale” did he get? Well, his sales rep told him they were getting these last-model-year boxes for 20% off list plus an additional 15% off! That sounds pretty good – depending on how you do the math, he was getting somewhere between 32% and 35% off the list price – for a little over $14,000 a piece (they’d bought two servers – one to run Oracle RDBMS (which Oracle themselves recommended not running on the T1 CPU family), and the other to run HPSA proper).

Except his sales rep lied. Flat-out lied. How do I know? Because I used Sun’s own server configurator site and was able to configure two identical servers for just a smidge over $15,000 each – with no discounts. That means they got 7% off list …
tops.

So not only were they running hardware barely discounted off list (and, interestingly, only slightly cheaper (less than $2000) than the next generation T2-powered servers which had a single FPU per core, not per CPU (which still had some performance issues, but at least weren’t dog-vomit slow), but they were running on Solaris – which had always been a second-class citizen when it came to HPSA performance: all things being roughly equal, x86 hardware running RHEL would always smack the pants off SPARC hardware running Solaris under Server Automation.

For kicks, I configured a pair of servers from Dell (because their online server configurator worked a lot better than any other I knew of, and because I wanted to demonstrate that just because SA was an HP product didn’t mean you had to run HP servers), and was able to massively out-spec two x86 servers for less than $14,000 a pop (more CPU cores, more RAM, more storage, etc) and present my findings as part of our write-up of the week.

Also for kicks, I demoed SA running in a 2-CPU, 4GB VM on my laptop rebooting faster than either T1000 server they had purchased could run.

Whats the moral of this story? There’s two (at least):

  1. Always always always find out from your vendor if they have a preferred or suggested architecture before namby-pamby buying hardware from your favorite sales rep, and
  2. Be ever ready and willing to kick your preconceived notions to the sidelines when presented with evidence that they are not merely ill thought out, but out and out, objectively wrong

These are fundamental tenets of automation:

“Too many people try to take new tools and make them fit their current processes, procedures, and policies – rather than seeing what policies, procedures, and processes are either made redundant by the new tools, or can be improved, shortened, or – wait for it – automated!”

You must always be reviewing and rethinking your preconceived notions, what policies you’re currently following, etc. As I heard recently, you need to reverse your benchmarks: don’t ask, “why are we doing X?”; ask, “what would happen if we didn’t do X?”

That was a question never asked by anyone prior to our arrival to implement what sales had sold them.

wonder how many zombie film/tv/game creators are/were computer science nerds

As you all know, I am a huge zombie fan.

And, as you probably know, I was a CIS/CS major/minor at Elon.

A concept I was introduced to at both Shodor and Elon was ant colony simulations.

And I realized today that many people have been introduced to the basics concepts of ant colony simulations through films like Night of the Living Dead or World War Z and shows like Z Nation or The Walking Dead.

In short, ant colony optimization simulations, a part of swarm intelligence, use the “basic rules” of ant intelligence to game-out problems of traffic patterns, crowd control, logistics planning, and all kinds of other things.

Those basic rules the ants follow more or less come down to the following:

  • pick a random direction to wander
  • continue walking straight until you hit something
    • if you hit a wall
      • turn a random number of degrees between 1 and 359
      • loop up one level
      • if you hit food
        • if you are carrying trash
          • turn a random number of degrees between 1 and 179 or 181 and 359
          • loop up two levels
        • if you are carrying food
          • drop it
          • turn 180 degrees, loop up two levels
        • if you are not carrying anything
          • pick it up
          • either turn 180 degrees and loop up two levels, or
          • loop up two levels (ie, continue walking straight)
      • if you hit trash (dead ants, etc)
        • if you are carrying trash
          • drop it
          • turn 180 degrees, loop up two levels
        • if you are carrying food
          • turn a random number of degrees between 1 and 179 or 181 and 359
          • loop up to levels
        • if you are carrying nothing
          • pick it up
          • either turn 180 degree and loop up two levels, or
          • loop up two levels (ie, continue walking straight)
      • if you hit an ant
        • a new ant spawns in a random cell next to the two existing ants (with a 1/grid-shape probability, in a square grid, this would be a ~10% chance of spawning a new ant; in a hex grid, it would be a ~12.5% chance of a spawn), IF there is an empty cell next to either ant
        • if you are both carrying the same thing,
          • start a new drop point
          • turn around 180 degrees
          • loop up two levels
        • if you are carrying different things (or not carrying anything)
          • turn a random number of degrees between 1 and 359
          • loop up two levels
    • if you have been alive “too long” (parameterizable), you die and become trash (dropping whatever you have “next” to you in a random grid point (for example, if the grid is square, you’re in position “5”, and your cargo could be in positions 1-4 or 6-9:

There are more rules you can add or modify (maybe weight your choice of direction to pick when you turn based on whether an ant has been there recently (ie simulated pheromone trails)), but those are the basics. With randomly-distributed “stuff” (food, walls, ants, trash, etc) on a board of size B, an ant population – P – of 10% * B, a generation frequency – F – of 9% * B, an iteration count of 5x board-size, a life span – L – of 10% * B, and let it run, you will see piles of trash, food, etc accumulate on the board.

They may accumulate differently on each run due to some of the random nature of the inputs, but they’ll accumulate. Showing how large numbers of relatively unintelligent things can do things that look intelligent to an outside observer.

And that’s how zombies appear in most pop culture depictions: they wander more-or-less aimlessly until attracted by something (sound, a food source (aka the living), fire, etc). And while they seem to exhibit mild group/swarm intelligence, it’s just that – an appearance to outside observers.

So if you like zombie stories, you might like computer science.

what is happening with news publishers?

I think, closer to the lines of thought that Ben Thompson of Stratechery has laid-out, that news publishing is about to undergo a major nichification – the days of everyone trying to report everything is over.

“Local” (whether by geography, interest, or some other grouping mechanism) publishing in narrowly-defined niches is basically going to finish gobbling the Old Line news publishers in the next 3-5 years. And I see automated “curation” (though, if it’s automated, it’s technically not “curating”) as a clever way to cross-cut unforeseen niches from other niches (and from the handful of “major” publishers that will refuse to die – even through they’re going to dramatically shrink very soon) – think applying pivot table data anaysis concept to news and publishing, rather than mere data.

Jean-Louis Gassée wrote in February the following about Facebook, & Google, about news publishers: “If they are really willing to contribute to a sustainable news ecosystem, as they claim, both should allow publishers to sell subscriptions on their platforms (while collecting a fee, obviously).” 

And that’s certainly an interesting idea – but one that I think will only last, if it even comes to fruition, for a very short period of time. It’s the Napster of news publishing.

I see news publishing undergoing the same sea change the music industry did starting in the late 90s with the rise of #Napster. Until Napster came along, if you wanted to listen to a specific song, you had to either a) wait for it to be on the radio, b) get the vinyl/tape/CD, c) get a friend to record it for you from the radio or some media they had. Then Napster and its ilk came along with peer-to-peer file-sharing, crazy lawsuits from the #RIAA, and services like #Apple’s #iTunes charging a mere $0.99 per track (and $9.99 per digital album) made file-sharing (which became a major attack vector for malware)

Then Napster and its ilk came along with peer-to-peer file-sharing, crazy lawsuits from the #RIAA, and services like #Apple’s #iTunes charging a mere $0.99 per track (and $9.99 per digital album) made file-sharing (which became a major attack vector for malware) far far less interesting: why spend hours searching for and downloading songs (which might be lousy quality, not the “real” song, etc) when you could just go to iTunes and get what you want in a couple minutes for 99¢?

Then came Pandora. And Spotify. And probably all kinds of other services I don’t know anything about. Why? Because people wanted what they wanted when they wanted it.

The same is true for “news”. How much of an average newspaper issue does the historically-average newspaper reader actually read? 10%? 30%? 50%? I’d bet anything north of 20% is highly unlikely overall.

And what do you have to do to “read” the news in a newspaper? You need to skip past ads, you need to flip between pages (and sometimes sections), you need to physically get the paper. And on and on. Paginated websites (like diply, just to name one) try to replicate the newspaper feel (flipping pages, skipping ads, not being able to see everything until you get to the end, etc) in a move to make money by selling ads and forcing eyeballs to look at them. (To combat that, folks like me run tools like pihole and ublock origin.)

Nichifying news is going to be a huge thing very soon: somewhat akin to the idea of targeted newsletters, but for “real” news, and not just something related to a website.