Summary: have been in a reflective mindset to start of 202(thrive) and I decided to explore my evolving usage of Generative AI models over the last two years. By not committing to a single AI model, I have unintentionally build a mental toolbox of AI models that I find valuable in my daily life. Until one AI model reaches human level capabilities in all valuable tasks, it does not make sense for me to rely on a single model for all tasks. This post starts off reflective and ends trying to convince you to change your behavior to stop relying on Claude or ChatGPT (calling it Chat moving forward) for all of your AI related work.
The table below took a while to put together but it summarizes the history of my Generative AI use since mid-2023.
The takeaway from this table is that my exploration of the various AI models have led me to have a mental toolbox of models that excel at specific applications (the last column of the table). For example, Perplexity AI is the ideal AI model when internet search is a core part of the task, while Pi is much better for interpersonal topics that require high Emotional Intelligence, etc.
The Chatbot Arena LLM Leaderboard below highlights the assumption that no one AI model is best in class in all tasks is being shown in the wild with other users. The leaderboard is created by crowdsourcing user feedback on model outputs between two models to help establish the model with better responses.
Some notable observations form the latest leaderboard
The top end of the leaderboard are dominated by various versions of Gemini and Chat.
It was a bit surprising to see the latest Claude AI model not make it to the top 10 and get stuck in a 5 way tie with Grok-2 and older Chat models, considering this has been my preferred model over the past year and its general popularity among people on the internet.
The main reason for my approach of a multi-model approach is due to my aversion of falling for the “Man with a Hammer” Tendency and using a single AI model for all AI tasks. The perils of just using Chat for all your AI needs is your are limiting yourself to weaknesses of a single model when using a different model with different training data, and different Reinforcement Learning (RL) methods could provide better responses.
By using multiple models, we can help avoid the Einstellung effect which reduces how the potential for holistic thinking by us and diverse AI outputs.
Definition: The Einstellung effect refers to the human tendency to get stuck in a particular way of thinking or problem-solving, even when better or more appropriate methods are available. It's a cognitive bias where prior experience with a particular solution can blind us to alternative, potentially superior, approaches.
The idea of using a multi model approach that my reflections above highlighted are very nicely tied to the concept of ‘the latticework of mental models’ used in modern decision making.
Mental models, a term popularized by the late Charlie Munger, are useful for the following reasons:
provide the simplified representation of a idea, belief, or system found in the real world
promote multi-disciplinary thinking
sever as a useful aid in decision making by helping predict future behavior of people and systems
help avoid cognitive biases
In short, mental models enhances our quality of thinking by promoting multidisciplinary thinking by giving us more options in our “mental toolbox” to use to solve problems, make decisions, and explain phenomenon in the world.
As an exercise in making this connection stronger between AI models and mental models, I explored the 100+ mental models listed on Farnam Street to see which mental models best reflect each of the AI models I have used so far.
Swiss Army Knife:
General-purpose model, good at many tasks. Easy to use and accessible for a wide range of users.
Inversion via High Margin of Safety
Claude prioritizes safety and avoids generating harmful outputs but refusing outputs that can avoid common paths of failure that can lead to harmful AI outputs for users
Trust
Anthropic is very transparent regarding the system design and ethical framework that guides Claude into optimizing responses to be 'Helpful, Honest, and Harmless'.
Surface Area:
Making AI more narrow with niche tools like NotebookLM and Deep Research reduces the number of ways they can go wrong and the practical usefulness of these tools to modern knowledge workers
Scale:
Long context window can allow you to have a conversation as long as a 3,000 page book. This helps maintain coherent conversations across extended interactions and not worry about running out of chat space. It is the equivalent of not having range anxiety if you have an EV and are going on a long weekend trip away from town.
Second-Order Thinking:
New AI model (Gemini 2.0) that simulates “thinking” via chain of thought inference work promotes critical thinking of problems instead of blurting out the first thing that pops up for the AI. This is the cutting edge of AI models as it tries to simulate system 2 thinking and erode the amount of cognitive tasks that only humans are currently able to do.
Leverage
Amplifies the user's ability to access, process, and summarize information from the internet, to aid in research and knowledge acquisition.
Positive Lollapalooza Effect
Provides access to the latest AI models from the big labs, which allows users to quickly compare outputs from different models side-by-side and choose the best one for their given use case.
The conversational chat style and focus on high EQ allows therapy style sessions by allowing users a safe space to frame their experiences into narratives in order to process difficult emotions, gaining new perspectives, and promoting mental well-being.
Wisdom (or Folly) of the Crowds
Can quickly find out what questionable act public figures have done when you are out of the loop when it comes to news.
Grok's training on real-time Twitter data can be seen as an attempt to tap into the "wisdom of the crowds," but the limitations of Twitter as a "crowd" must be considered.
Low/Moderate Margin of Safety
The willingness to discuss a wide range of topics, including those that might be considered harmful or controversial, indicating a lower margin of safety compared to models like Claude.
Swiss Army Knife: General-purpose model, good at many tasks. Easy to use and accessible for a wide range of users.
Frictionless
Captures the ease of access to the tool across multiple devices and mediums. OpenAI app, Chat.com, 1-800-CHAT-GPT
Important disclaimer: In this exercise, I explore which mental models come to mind when I think of each of these AI models, based on my personal experience interacting with them over the past few months. This is a subjective and metaphorical assessment, intended to highlight some of the dominant operational principles of these tools, rather than provide an exhaustive or definitive analysis. My intention is to offer a framework for thinking about these AI models and to spark reflection and discussion, not to present a rigid categorization. (Suggestion inspired by Gemini 2.0)
Parting Thoughts
For most people, I think a single model should be sufficient. However exploring different models is a worthwhile exercise to
build a more well-rounded experience as an AI user,
gives us to different model “thinking” styles,
help us understand the efficiencies and limitations of different AI models
most importantly, give us the agency of choosing the right tool for each task and promote critical engagement with these tools instead of letting AI do the work
Links to the models used:
https://chatgpt.com/
https://claude.ai
https://pi.ai
https://www.perplexity.ai/
https://x.com/i/grok
https://chat.mistral.ai/chat
https://gemini.google.com