Or "The power of conventions"

It was a Wednesday, many years ago.

It was a Wednesday because it was Cheating Wednesday. After a couple of relatively close Wednesdays in which we organised different after offices, we started naming these events like that - even if no participant had a partner to cheat on (let alone the will to) that was not invited/part of the event itself. It was a bit of a weird name (I can't recall why we started calling them like that), but it did the trick - people got used to maybe having a plan on Wednesday's afternoons, naming the event gave it some sort of hierarchy and importance.

It was many years ago because... well, because after offices? Like, in the meatspace. And regular visits to the office, for that matter.

We were all discussing that night's plan (bowling night) when one of our clients notified us there was some data missing from their e-shop's admin panel.

- All of the retailers that start with the letter A and the letter B have vanished from the production Admin frontend.

Understanding how software works is no easy task1, so hearing a report like this one from a user puts you in a detective-like position - the software itself didn't have any feature to filter retailers by first letter, so it couldn't be an accidental deletion of A and B-starting retailers, but it also would have been really weird for an attacker to cherry-pick what to delete with this criteria. The main hypothesis was that the report was somehow bogus2.

Except it was true - the admin panel showed no retailers that started with those letters, and we had evidence that there used to be. And, even worse - the database agreed they were gone.

To better paint the picture, I must mention that this one was a then-new customer. We had inherited a moderate sized project consisting of an already published mobile app and a backoffice site that also served the app's API - but the code was missing lots of love, to say the least3. Having been onboard for so little time, we didn't really know the app that well at the time, so a few of us kept looking at the admin site & database for any useful hint on the issue, while others skimmed through the source code, hoping for an explanation.

We already knew (and had checked) that the database backups were available and ready to be restored, and we knew the retailers table didn't change often - so it was quite likely that we would not lose any data in the end. But we first wanted to understand the cause of the issue, to avoid it in the future.

Time kept going on (I haven't yet seen it do differently), people kept discussing bowling plans, we kept looking for an explanation... And then our QA engineer told us we were now missing some more retailers!

We were even more puzzled than before. Was it actually an attack? It was really unlikely that someone accidentally triggered the same bug twice in a few hours, but not once during the previous few weeks we'd spent in the project by that time. But, on the other hand, what kind of a sadist was this attacker that decided to delete data in slow motion? Something felt really off.

Further discussing the issue with our QA engineer, he agreed the issue was strange - he couldn't understand why this happened the very same time he was checking the Retailers page for the very first time. And it was curious, he said, since he had only been able to check that the page's links worked just fine - as reported by his link-validator browser plugin.

And that's when it clicked.

The link validator plugin looked for every link in a page and navigated it to check it took visitors to a valid page. The Retailers page was your typical CRUD4 page, having a DELETE button next to each item. But, unlike your typical CRUD page, these DELETE buttons were just standard HTML links that triggered a GET request to a URL on the server that deleted the item from the database.

So the "button" was at odds with the HTTP standard, which says that "the GET and HEAD methods SHOULD NOT have the significance of taking an action other than retrieval"5. Using a GET request to delete an item did in fact "work" in the make the intended thing happen sense, but violating the convention set by the standard6 made an otherwise-harmless plugin delete a page's worth of data items each time our QA engineer tried to test if everything was good on that CRUD7.

On the bright side, the existence of (and the abiding by) that very same convention enabled different parties to passively cooperate by enabling each other to do their job in a better way - the link checker helps us test our stuff, but it couldn't exist if there was no way for it to know which links are safe to follow.

Back to our story, we managed to restore the lost data from the database backups, and we eventually changed those DELETE buttons to use a proper HTTP method8.

And we went bowling. Long live Cheating Wednesdays.

  1. Obligatory XKCD reference: #1425 - Tasks 

  2. You can read The 500 mile email for a similar "things don't work that way" feeling. 

  3. We eventually went down the path of writing all new functionality in a new API that worked side-by-side with the old one, until we were able to shut down the old one - but that's a story for another post. 

  4. CRUD: Create, Read, Update, Delete. 

  5. As per RFC 2119, "SHOULD NOT" means there may be exceptions to the rule, but they have to be well thought and justified. 

  6. And we had exactly the same amount of retailers starting with A and B as they fit in a single page, thus making the user report the issue in that way. 

  7. I really can't remember if we used a DELETE method (which requires some extra code to make it work with web browsers), but we at least changed it to a POST to avoid this issue.