As this little blogling turns 2 months old, I’ve changed enough digital diapers (viz. formally reviewed 25 different whiskies) as to have learned quite a deal about the challenges of reviewerhood. For me, one of the greatest challenges has been choosing a scoring system and sticking to it. In this post I’ll explain my thinking on reviewing and scoring, and how it has recently evolved.
Ultimately, a spirit review is attempting the impossible: to capture in static language a multifaceted sensory experience that evolves with time and is shaped by personal memories and associations. As difficult as describing a whisky is even using one’s full powers of language, even more so to reduce it down to a final score in a meaningful way.
Whisky is commonly scored on a 100-point scale (e.g. WhiskyBase, WhiskyFun), similar to the standard for wine. Most of the range is in practise (or by definition) not used and just about every whisky falls in the 75 – 95 point range. Sometimes a whisky score is split into 25+25+25+25 for nose, palate, finish, and totality / balance.
Numerical review scales have all sorts of issues:
- They implicitly assume it’s meaningful to rank whiskies from worst to best.
- It’s extremely difficult to score consistently, due to variations in the whisky, the setting, and one’s own condition.
- An individual’s scoring will ‘drift’ over time, unless the scale is tied to reference whiskies that one often return to for calibration (and rarely is such rigour adhered to).
- It’s not in and of itself clear what each numerical level means. How good is ’82’?
These limitations notwithstanding, an upside to numerical scores is that they lend themselves well to statistics — and statistics are fun!
The nerdgasmically amazing site WhiskyAnalysis.com describes in depth what can be learned from aggregating reviewer scores into a “MetaScore”, in this case based on two dozen “expert” whisky reviewers. This is more complicated than just naively averaging their scores, since one has to correct for individual reviewers’ differences in the numerical scale used, and biases in how generously scores are applied, as well as the partial overlap between reviewers in which whiskies they have scored, and so on and so forth. But in the end some really interesting results emerge from this study:
- Neither individual reviewers’ score distributions or the MetaScore distribution is normally distributed (a “bell curve”), but has a prominent left skew: there is a wider range of worse-than-typical than better-than-typical whiskies.
- Any two reviewers typically differ a lot in scores of individual whiskies. On average, the correlation between any two reviewers’ normalized scores is 0.44 (sometime as little as 0.18, as for Ralfy and Jim Murray).
- However, each reviewer’s score correlates on average 0.73 with the MetaScore. This is quite high!
The overall picture here is that the scores of one person are generally not a reliable guide for another person, but it’s not at all arbitrary what is good whisky and what isn’t.
With everything said so far about numerical scores in mind, I decided from the start not to use them on this blog. Still, the idea of a condensed verdict of some sort appeals to me. It leaves a sense of accomplishment and closure as the last drop of a dram is drunk.
In my first non-numerical scoring system, I sought to evaluate whisky along a few more-or-less independent ‘dimensions’ of quality. This idea was loosely based on the intuition behind principal component analysis. In a complex space, such as that made up by the many determinants of whisky quality, if there are significant patterns, a few orthogonal ‘principal axes’ that can be found that explain most of the statistical variation of the sample. Indeed, this methodology has been applied to whisky flavour profiles. Only in my case, rather than statistics I used armchair philosophizing (and a bit of ChatGPT advice) to decide that variety, balance, intensity, distinctiveness and a summary dimension of overall enjoyability would suffice to parametrize ‘quality space’. I associated a symbol with each dimension; for example ⚖️ for balance. If the experience was ‘remarkably’ balanced I would award a ⚖️, and if it was ‘exceptionally’ balanced, a double ⚖️⚖️. A normally (i.e. unremarkably) balanced whisky would simply not get that symbol; and there was also a symbol for remarkably unbalanced. Thus, a hypothetical score for an exceptional whisky might be ⚖️💥💎🧠🧠/⭐⭐ (remarkably balanced, intense and distinct, with exceptional variety leading to an exceptional overall enjoyability). A less enthusiastic verdict might read 💤/✔️ (boring: lacking in intensity, distinctiveness and variety; overall not more than okay). I even made a cute infographic:

An eccentric and a bit complicated system, but it seemed to pack a lot of meaningful information into a compact format, and therefore I liked it. Until I tried to put it into practise…
I discovered two main problems. First, I thought the unremarkable / remarkable / exceptional hierarchy would be easy enough to apply, because if, for example, the distinctiveness (i.e. uniqueness, character) of the whisky didn’t stand out to me on its own, then by definition it would be unremarkable in that regard. And ‘exceptional’ really has to sweep me off my feet. But I found myself in situations where I had awarded 💎 to one whisky and not to another, but if I compared them side by side the difference seemed so much smaller than between ‘unremarkable’ and ‘remarkable’. Second, while the meaning of the dimensions was relatively clear on paper, and I wrote long descriptors defining each, upon tasting they were less readily identifyable. In particular, balance and variety are not really so independent from one another, and their combination could be seen as complexity, which I had therefore excluded as a dimension of its on, perhaps in error. So in addition to doubting my ability to consistently score along my chosen dimensions, I also started to doubt my chosen dimensions.
To attempt to rectify the situation, I devised a second, simplified version the scoring system. I’m stating it here more for my own record-keeping than to make an actual point. I reorganized the previous dimensions into 5 primary positive qualities balanced variety, mature complexity, distinctiveness, boldness, crispness, and three negative ones boring, unbalanced, off-note (all with a designated emoji symbol); and would have just two levels, not notable or notable for each. An overall mark of ok/good, great, outstanding, sublime would crown the string of awarded symbols, similar to before, with more positive qualities expected of a higher overall mark. But as one can imagine, it again proved difficult to apply the judgments consistently. In general, it actually really breaks the flow of the experience when one feels compelled to make many specific semi-judgments for every single review! Again, back to the drawing board.
So what have I converged to? Simplicity and flexibility. Various aspects of quality (or lack thereof) will simply be reflected in the review text itself, and won’t be shoehorned into pre-defined categories. I have taken the habit now of writing an “experience” and a “verdict” section to each review. In terms of marks, I now only give an overall judgement into one of six ‘quality categories’ as part of the verdict:
❓— inconclusive: a fair review was not possible, e.g. due to tainted sample
❌— bad: from mildly unpleasant to completely undrinkable
💤— boring: of very little interest for a seasoned palate (most entry-level blends and malts)
✔️— decent: enjoyable in its own right, but either lacks some character or has notable flaws
⭐— high quality: well-crafted, worthwhile, stimulating, with few if any flaws
❤️— personal favourite: of notable quality and especially appealing to my personal preferences
These categories would not be well represented by 0-5, as they are qualitatively very different, not just ordered. A ❤️ for me would be ⭐for someone else, and vice versa. But my idea is that a quality whisky (⭐) can be recognized as such regardless of one’s specific personal preferences. There might still be edge cases between, for example, ✔️ and ⭐, but this is inevitable with any categorical system.
An advantage of having ended up with such a simple scale is that I believe I can go back and retroactively update earlier reviews to conform to it, without compromising the integrity of those reviews. But should I decide to move onto some new scale in the future, I would probably not edit older posts. Hence, I have this post as a historical record of past review scales.
I’ll end on a bit of a cheesy note (Ben Nevis, anyone?). I found a lot of enjoyment pouring hours, and accompanying writing-inspiration drams, into devising my original eccentric review system and notation. As dysfunctional as it was, I’m quite proud and fond of it still. Even so, I have found the good sense to overcome this self-indulgence, and I’ve killed my darling. And this I take as a tiny drop of character development in the spirit cask of life.


Leave a reply to New stop light scoring system – DuckDrams Cancel reply