Rendered at 02:40:21 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.
wg0 22 hours ago [-]
Snake oil. Good to read for sure. Seems all plausible too. But snake oil nevertheless.
Here's why: The slot machine can drop any hard requirement that you specifically in your AGENTS.md, memory.md or your dozens of skill markdowns. Pretty much guaranteed.
These harnesses approaches pretend as if LLMs are strict and perfect rule followers and the only problem is not being able to specify enough rules clearly enough. That's fundamental cognitive lapse in how LLMs operate.
That leaves only one option not reliable but more reliable nevertheless: Human review and oversight. Possibly two of them one after the other.
Everything else is snake oil but at that point, you also realize that promised productivity gains are also snake oil because reading code and building a mental model is way harder than having a mental model and writing it into code.
keeda 17 hours ago [-]
Snake oil may be a bit strong, because snake oil never works (except maybe as placebo?) whereas anything with an LLM, even though stochastic, has a pretty high chance of working.
> ... you also realize that promised productivity gains are also snake oil because reading code and building a mental model is way harder than having a mental model and writing it into code.
Not really, though it depends on the code; reading code is a skill that gets easier with practice, like any other. This is common any time you're ever in a situation where you're reading much more code than writing it (e.g. any time you have to work with a large, sprawling codebase that has existed long before you touched it.)
What makes it even easier, though, is if you're armed with an existing mental model of the code, either gleaned through documentation, or past experience with the code, or poking your colleagues.
And you can do this with agents too! I usually already have a good mental model of the code before I prompt the AI. It requires decomposing the tasks a bit carefully, but because I have a good idea of what the code should look like, reviewing the generated code is a breeze. It's like reading a book I've read before. Or, much more rarely, there's something wrong and it jumps out at me right away, so I catch most issues early. Either way the speed up is significant.
jazzypants 10 hours ago [-]
I think the placebo effect might be a decent comparison. It works most of the time, and you don't worry about it as long as you fully believe in its efficacy. However, once the illusion is shattered, the positive effects are diminished, and you can never fully trust the solution again.
intended 15 hours ago [-]
> has a pretty high chance of working.
for MVPs, mock ups, prototypes or in the hands of an expert coder. You can't let them go unsupervised. The promise of automated intelligence falls far short of the reality.
crimsoneer 14 hours ago [-]
Not only "has a high chance of working", but you can pay more to make it more reliable. It really is striking trying to run a harness openClaw thing on a smaller or quantised model, really makes you realise how much we take for granted from SOTA models that was totally impossible just a year ago, in terms of complex, generally reliable tool use.
j45 16 hours ago [-]
Pretty high chance isn’t what the intent or impression the end user often has.
kergonath 14 hours ago [-]
Indeed, and it is a complicated problem to solve. A GUI or CLI can hide footguns or make them less likely to be misused. But an AI agent is perfectly happy to use a wrecking ball to put a nail without any second thought or confirmation.
j45 12 hours ago [-]
It’s a human articulation problem.
When it receives a generic vague input it is free to interpret according to how its corpus fires like any human interaction.
How to articulate better is like writing a sentence that will stand the test of model updates.
kergonath 11 hours ago [-]
Even then. I don’t have an example off the top of my head but even perfectly clear sentences can lead the agent to strange places. Even between humans, miscommunication is easy, but then anyone sensible would ask for confirmation if their interpretation is weird. But the LLM very rarely questions the user.
I don’t think it’s fair to blame the user here. The tool must be operated by normal users.
vidarh 21 hours ago [-]
Humans also drop any hard requirements you specify regularly, and similarly require review. Nevertheless we manage to increase reliability of human output through processes and reviews, and most of the methods we use for harnesses are taken from experience with how to reduce reliability issues in humans, who are notoriously difficult to ensure delivers reliably.
kaashif 20 hours ago [-]
The primary way to increase reliability is to automate. Instead of humans producing some output manually, humans producing machines which produce that output.
I've seen a disturbing trend where a process that could've been a script or a requirement that could've been enforced deterministically is in fact "automated" through a set of instructions for an LLM.
vidarh 18 hours ago [-]
Sure, when that is possible. However, there are lots of processes we don't know how to automate in a deterministic way. Hence the vast amount of investment in building organisations of people with mechanism to make peoples output more reliable through structure, reviews, and so on.
Large parts of human civilization rests on our ability to make something unreliable less unreliable through organisational structure and processes.
intended 15 hours ago [-]
We resolve that through liability, penalties, trust, responsibility, review and oversight.
At the end of the day, if I am spending X$s for automation, I want to be able to sleep at night knowing my factory will not build a WMD or delete itself.
If its simply a tool that is a multiplier for experts, then do I really need it? How much does it actually make my processes more efficient, faster, or more capable of earning revenue?
There is a LOT that is forgiven when tech is new - but at some point the shiny newness falls off and it is compared to alternatives.
vidarh 15 hours ago [-]
Liability, penalties, trust, and responsibility are means we use to try to influence the application of the processes that do. They do not directly affect reliability. They can be applied just as much to a team using AI as one that does not.
Review and oversight does address reliability directly, and hence why we make use of those in processes to improve the reliability of mechanical processes as well, and why they are core elements of AI harnesses.
> If its simply a tool that is a multiplier for experts, then do I really need it? How much does it actually make my processes more efficient, faster, or more capable of earning revenue?
You can ask the same thing about all the supporting staff around the experts in your team.
> There is a LOT that is forgiven when tech is new - but at some point the shiny newness falls off and it is compared to alternatives.
Only teams without mature processes are not doing that for AI today.
Most of the deployments of AI I work on are the outcome of comparing it to alternatives, and often are part of initiatives to increase reliability of human teams jut as much as increasing raw productivity, because they are often one and the same.
intended 12 hours ago [-]
> Liability, penalties, trust, and responsibility are means we use to try to influence the application of the processes that do. They do not directly affect reliability. They can be applied just as much to a team using AI as one that does not.
Yes and no. see next point.
> You can ask the same thing about all the supporting staff around the experts in your team.
I have a good idea of the shape of errors for a human based process, costing and the type of QA/QC team that has to be formed for it.
We have decades, if not centuries of experience working with humans, which LLMs are promising to be the equivalents/superiors of.
I think you and me, would both agree with the statement "use the right tool for the job".
However, the current hype cycle has created expectations of reliability from LLMs that drive 'Automated Intelligence' styled workflows.
On the other hand:
> part of initiatives to increase reliability of human teams
is a significantly more defensible uses of LLMs.
For me, most deployments die on the altar of error rates. The only people who are using them to any effect are people who have an answer to "what happens when it blows up" and "what is the cost if something goes wrong".
(there is no singular thread behind my comment. I think we probably have more in agreement than not, and its more a question of finding the precise words to declare the shapes we perceive.)
vidarh 9 hours ago [-]
> (there is no singular thread behind my comment. I think we probably have more in agreement than not, and its more a question of finding the precise words to declare the shapes we perceive.)
I moved this up top, because I agree, despite the length of the below:
> However, the current hype cycle has created expectations of reliability from LLMs that drive 'Automated Intelligence' styled workflows.
Because for a lot of things it works. Today. I have a setup doing mostly autonomous software development. I set direction. I don't even write specs. It's not foolproof yet by any means - that is on the edge of what is doable today. Dial it back just a little bit, and I have projects in production that are mostly AI written, that have passed through rigorous reviews from human developers.
The key thing is that you can't "vibecode" that. I'm sure we agree there.
There needs to be a rigorous process behind it, and I think we'll agree on that too.
Those processes are largely the same as the processes required for human developers. Only for human developers we leave a lot of that process "squishy" and under-specified.
We trust our human developers to mostly do the right thing, even though many don't, and to not need written checklists and controls, even though many do.
What is coming out of this is a start of systems that codify processes that are very much feels based with human teams. Partly because we still need to codify them for AI, but also because we can - most people wouldn't want to work in the kind of regimented environment we can enforce on AI.
Sure, there is a lot of hype from people who just want to throw random prompts at an LLM and get finished software out. That is idiocy. Even a super-intelligent future AI can't read minds.
But there are a lot of people building harnesses to wrap these LLMs in process and rigor to squeeze as much reliability as possible from them, and it turns out you can leverage human organisational knowledge to get surprisingly far in that respect.
intended 7 hours ago [-]
> Because for a lot of things it works. Today. I have a setup
> There needs to be a rigorous process behind it, and I think we'll agree on that too.
I would simplify it to: “I have a setup” is the part that is doing the actual heavy lifting.
From my very unscientific survey / extensive pestering of network, the only people getting lift out of AI are people with both domain expertise/experience and familiarity with the tooling.
The types of automation I see people wanting though are fully automated customer support systems, fully automated document review - essentially white collar dark factories. (Hey thats a good term). The need is for a process that is stable, and behaves the same way every time.
It seems actual AI use cases are more like sketching - if you have enough skill you can make out the rough sketch is unbalanced and won’t resolve into a good final piece. Non experts spend far more time exploring dead ends because they don’t have the experience.
In my opinion, it’s a force multiplier for experts or stable processes, and it’s presented as Intelligence.
I feel your examples fit within these boundaries as well as the ones you have described.
j45 16 hours ago [-]
Underrated comment.
So many applications of LLMs have even to start with deterministic brain when using a non-deterministic llm and then wonder why it’s not working.
jnpnj 15 hours ago [-]
it's strange to see software engineers using skills aka human description of small scripts instead of scripting things directly. often there were cli / tools / libraries to do what a skill does for many years. maybe it's culture issue, people who enjoy automation / devops / predictability will naturally help themselves, but other people just want to "delegate" and be done without trying.
vidarh 12 hours ago [-]
When people do that they are using skills wrong. The best way to use a skill is as a means to give targeted instructions on how to make use of cli / tools/ libraries, with the skill just covering the "squishy bits" that aren't easily encoded into something deterministic.
hansmayer 17 hours ago [-]
[flagged]
vidarh 16 hours ago [-]
Because certain aspects (both are error prone) are similar and comparable. The notion that two entities need to be close in abilities for it to be possible to compare them is nonsense.
You make the point for me: We managed to put men on the moon despite humans being enormously unreliable and error prone, because we built system around them that allowed for harnessing the good bits and reducing the failures to acceptable levels.
We are - I am anyway - using our lessons from building reliable systems from unreliable elements to raise the reliability of outputs of LLMs the same way.
hansmayer 14 hours ago [-]
> We are - I am anyway - using our lessons from building reliable systems from unreliable elements to raise the reliability of outputs of LLMs the same way.
:) :) :) I could tell immediately you are somehow vested in the "success" of the LLM. So 600 B dollars and five years later, can you tell me how far did you guys get? Apollo programme costed a tiny fraction of that and started putting people on the moon some ~10 years later. Would you say that you are on the way to accomplish something similar in the next five years?
vidarh 9 hours ago [-]
I wish I had used 600B. I've spent a few thousand, and my efforts are very much profitable and earning me a substantial living right now.
teodosin 17 hours ago [-]
Calm down. They were comparing a very specific and narrow aspect of both. Not totally equivalent maybe, but that doesn't justify a tantrum.
hansmayer 14 hours ago [-]
I am incredibly calm. I just wonder at the idiots who think they should compare the magnificiently efficient human brain to the shitslop machines.
cortesoft 22 hours ago [-]
Everything you say is all possible, and in theory I agree with you.
However, I have been using spec-kit (which is basically this style of AI usage) for the last few months and it has been AMAZING in practice. I am building really great things and have not run into any of the issues you are talking about as hypotheticals. Could they eventually happen? Sure, maybe. I am still cautious.
But at some point once you have personally used it in practice for long enough, I can't just dismiss it as snake oil. I have been a computer programmer for over 30 years, and I feel like I have a good read on what works and what doesn't in practice.
wg0 22 hours ago [-]
We can build all the scaffolding around but I assure you that the LLMs aren't perfect rule following machines is the fundamental problem here and that would remain.
Give it a few more months and I'm sure you'll see some of what I see if not all.
I'm saying all the above having all sorts of systems tried and tested with AI leading me to say what I said.
cortesoft 21 hours ago [-]
I have been doing this for 6 months or so now, and I am not sure that even if you have a lot more experience than me that it would make your assessment more accurate, since that just means you have more experience with prior generations of the models. What I have experienced is that the AI has been getting better and better, and is making fewer and fewer mistakes.
Now, part of that is my advancements as well, as I learn how to specify my instructions to the AI and how to see in advance where the AI might have issues, but the advancements are also happening in the models themselves. They are just getting better, and rapidly.
The combination of getting better at steering the AI along with the AI itself getting better is leading me to the opposite conclusion you have. I have production systems that I wrote using spec-kit, that have been running in production for months, and have been doing spectacularly. I have been able to consistently add the new features that I need to, without losing any cohesion or adherence to the principals i have defined. Now, are there mistakes? Of course, but nothing that can't be caught and fixed, and not at a higher rate than traditional programming.
Quarrel 21 hours ago [-]
> LLMs aren't perfect rule following machines is the fundamental problem here
I kind of get what you're saying, but let us not pretend that SW engineers are perfect rule followers either.
Having a framework to work within, whether you are an LLM or a human, can be helpful.
saidnooneever 20 hours ago [-]
i think it depend on your goals and also your preference / expectation how your experience with LLMs is. i dont mind if they hallucinate. even if i have mental model of code i wont write it myself perfectly either.
the only downside i see is getting out of practice, which is why for my passion projects i dont use it. work is just work and pressing 1 or 2 and having 'good enough' can be a fine way to get through the day. (lucky me i dont write production code ;D... goals...)
12 hours ago [-]
albedoa 14 hours ago [-]
> Give it a few more months
By that time, they will have realized immense value before seeing some of what you see. Sounds like an endorsement of spec-kit.
kajman 20 hours ago [-]
I hope the only reason people are pretending these markdown suggestions are a "workflow" is fear that a more structured approach will be obsolete by the time it's polished. I can't imagine the pace of innovation with the underlying models will stay like this forever.
I hope to see harnesses that will demand instead of ask. Kill an agent that was asked to be in plan mode but did not play the prescribed planning game. Even if it's not perfect, it'd have to better than the current regime when combined with a human in the loop.
kergonath 14 hours ago [-]
> The slot machine can drop any hard requirement that you specifically in your AGENTS.md, memory.md or your dozens of skill markdowns. Pretty much guaranteed.
Indeed. That said, I’ve had some success with agent skills, but I use them to make the LLM aware of things it can do using specific external tools. I think it is a really bad idea to use this mechanism to enforce safety rules. We need good sandboxing for this, and promises from a model prone to getting off the rails is not a good substitute.
But I have taught my coding agent to use some ad hoc tools to gather statistics from a directory containing experimental data and things like that. Nobody is going to fine tune a LLM specifically for my field (condensed matter Physics) but using skills I still can make it useful work. Like monitoring simulations where some runs can fail for various reasons and each time we must choose whether to run another iteration or re-start from a previous point, based on eyeballing the results ("the energy is very strange, we should restart properly and flag for review if it is still weird", this sort of things). I don’t give too many rules to the agent, I just give it ways of solving specific problems that may arise.
selimthegrim 5 hours ago [-]
Do you have any information on skills you've found useful here?
raincole 20 hours ago [-]
Don't let the perfect be the enemy of the good. Of course we know the AGENTS.md and skills aren't 100% effective. But no, it doesn't mean that they're 0% effective.
moomin 7 hours ago [-]
This is like saying a +5 sword is useless because you still miss on a one. We’ve got to think about expected outcomes. Because if ahe’s merging five solid PRs to your three, loudly complaining about the one she saw was rubbish and threw away.
peterbell_nyc 8 hours ago [-]
Helps if you both hand to original agent as strong guidance and then to an adversarial agent as a quality reviewer. The adversarial agent is more likely tro loop the work back if it fails the validation criteria.
I do find that just asking the same agent to do and check it's own work is not particularly reliable.
17 hours ago [-]
SubiculumCode 12 hours ago [-]
All these points apply to human devs as well. The test is not infallibility but magnitude
j16sdiz 14 hours ago [-]
A slot machine isn't snake oil.
Slot machine give you rewards when star aligns, snake oil never do :)
blitzar 17 hours ago [-]
All this said, I quite like the mental model of documenting a simple process, and I suspect our future ai overlords will find it useful that I have a series of md files that outline my preferences and processes for certain tasks.
I am not however going to share any of this with work colleagues and make myself redundant.
Chris2048 11 hours ago [-]
> That leaves only one option not reliable but more reliable nevertheless: Human review and oversight.
Couldn't non-manual oversight also help e.g. sandboxes?
chaostheory 20 hours ago [-]
I can see why this would seem to be “snake oil” logically. However, this approach does work in reality. Your comment just shows that you seem inexperienced with using generative AI.
stellalo 21 hours ago [-]
> A skill is a markdown file with frontmatter that gets injected into the agent’s context when the situation calls for it.
When the LLM decides that the situation calls for it
> It is a workflow: a sequence of steps the agent follows, with checkpoints that produce evidence, ending in a defined exit criterion.
A sequence of steps the LLM can decide to follow
sharperguy 16 hours ago [-]
Skills are often invoked imperatively by the user. In cases where they are intended to be used directly by the LLM, it would be included somewhere else in the context. E.g:
```
After implementing the feature, read the testing skill for instructions on how to test.
```
forlorn_mammoth 11 hours ago [-]
how do you guarantee that the LLM follows an instruction given imperatively by the user? It probably will, but this is not guaranteed behavior. Likewise, _how_ it follows that instruction is non-deterministic.
it's turtles all the way down.
xboxnolifes 10 hours ago [-]
You isn't gaurentee it any more than you can guarantee your prompt gives the output you want. Skills are just prompt templates.
lionkor 18 hours ago [-]
Well, to be fair, in e.g. Codex you can invoke a skill directly, with $my-skill, and this WILL lead to the skill being injected into the context. At that point, the LLM follows the skill as well as it follows any other part of the prompt, instructions, or context.
ai_fry_ur_brain 1 days ago [-]
Cant wait for everyone to realize they've wasted a year + messing with agents and experiencing a feeling of psuedo productivity.
cortesoft 22 hours ago [-]
I can understand skepticism to a degree, and even fundamentally believing that AI is bad for all sorts of reasons, but I am becoming more and more perplexed at the certainty behind statements like this one. How are you so certain that AI development is this doomed? It just hasn't matched my experience at all, and I wonder what your experience is that has driven you to this level of certainty about the certain doom of AI coding?
Is it just a philosophical belief that AI is morally bad? Or have you actually used AI to build things and feel confident that you have explored the space enough to come to such a strong conclusion?
I have been writing code every day for over 30 years, and have been doing it professionally for over 20. I have seen fads come and go, and I have seen real developments that have changed the way I do what I do numerous times. The more experience and the more projects I create with AI, the more certain I am that this is a lasting and fundamental change to how we produce software, and how we use computers generally. I have seen AI get better, and I have seen myself get more proficient at using it to get real work done, work that has already been tested with real world, production, workloads.
You can hate that it is happening, and hate the way working with AI feels, but that doesn't mean it is not providing real value for people and doing real work.
ai_fry_ur_brain 9 hours ago [-]
I dont know any serious engineers thay are doing real work with AI agents. I know some that are building features for web applications and just punching a clock, but I don't think that constitutes real work or provides much value to the world.
I like thinking, solving problems and typing out code myself. Im going to keep putting tons of care into my craft and I promise I'll have more impact than the guy running 3 agents to build the 500th version of some web concept.
Rolex has a much bigger impact on the world than white label mass manufacturers in China.
ninininino 9 hours ago [-]
it's fascinating that you imagine that building features for web applications is never "real work".
ai_fry_ur_brain 8 hours ago [-]
It is real work, just 90% of it is either net negative for society or provides nuetral value. Most web applications that are piling on features now because they have agents, are piling on features that we never needed in the first place, hence why they weren't prioritized previously. Junk junk junk.
8 hours ago [-]
18 hours ago [-]
tokioyoyo 22 hours ago [-]
I’m a bit curious with these takes. Arguing in good faith - is the general assumption that people who use AI/agents/harnesses don’t ship features? We’ve been all in Claude Code since ~Septemberish, and have been able to successfully track the boost. Like the features that we ship that get used in production. Both from infrastructure side, and business logic implementations. Frontend and backend.
I don’t think people are wasting too much time. Although, I do agree most of these posts are just bs, including this one. But AI-development has been a thing across a lot of companies in the world.
bot403 22 hours ago [-]
Ignore the people who haven't found out how to use ai yet or don't want to.
AI is a powerful tool. Depending on what I need I use chatgpt, in-ide agents, or a platform like Devin.ai.
I use it when it helps me advance my goals. I don't when it doesn't. Sometimes it misses the mark and I scale back and have it do a specific piece and I'll do the rest.
Sometimes I use it to analyze the code base in seconds vs minutes. Sometimes I use it to pinpoint a bug fast.
Ive solved customer issues in seconds and minutes with it vs hours.
I worked on a banking app with deeply domain specific data issues. AI was not very helpful on that team. My current work on consumer web apps mean my problems are more mundane and AI is a big accelerant.
Being and engineer means solving the problems with the right tools with the right tradeoffs as well. It's why I use an idea vs notepad, I use chatgpt for one-off scripts and "chat", and i use agentic workflows for big, repetitive, or "boring" low-stakes tasks.
ai_fry_ur_brain 8 hours ago [-]
You are a bot lol
raincole 20 hours ago [-]
You're replying to an account specifically created to post inflammable AI takes (likely a bot anyway). So your attempt
> Arguing in good faith
will be futile, unfortunately.
9 hours ago [-]
eloisant 10 hours ago [-]
I suspect some devs don't want AI to succeed - and it's understandable, as it will fundamentally change the way they work, and possibly put them out of a job as we need less developers.
So they convince themselves AI can't work because they don't want it to.
djhn 17 hours ago [-]
I can take on a slightly weaker form in good faith: professionally it’s a non-starter until private, open source inference can be self-hosted and the ROI is clear enough to invest in that.
And on the ROI side, trying things out regularly, I haven’t seen the positive ROI in the limited time I’ve dedicated to exploring the tools. I’ve restricted experimenting to 4 hours per month, because spending more than 2.5% of the month chasing productivity improvements that realistically seem to be 10-20%, will quickly eat into those gains. After accounting for token costs, it ends up being a wash.
theshrike79 16 hours ago [-]
"I studied math 4 hours per month and I can confirm that mathematics is stupid"
You can't learn how to use _anything_ by experimenting 4 hours a month.
djhn 9 hours ago [-]
I think I should also clarify, I work in the training of encoder-decoder transformer models. Before the ChatGPT era I worked on on encoder-only transformer models. I'm not unfamiliar with the literature and general discourse. I just do not use LLMs for programming.
intended 15 hours ago [-]
The poster provided numbers and thresholds they used to evaluate the utility of a business product.
With infinite time anything is possible, but since we live within constraints, discussing practical, real world thresholds or evaluation methods is a worthwhile use of our time.
swyx 22 hours ago [-]
> have been able to successfully track the boost.
lets get nitty gritty on this - can you say how you did this? because a lot of people think this is an unsolved problem
tokioyoyo 18 hours ago [-]
For my team, it has been easy. We deal with infrastructure for the entire org, so have tickets created for every request. We also gave our own backlog for internal project, so can see burn rate, and etc. Team hasn’t changed, a lot of similar/same tasks that have taken half a day has been completely automated to a point where we just do PR review after an initial ticket is created by other teams.
There are a lot of little things we’ve tracked, and it’s just faster to implement things now. To be fair, everyone on my team has decade+ professional experience (many more non-prodessional), and we understand limitations of AI fairly well.
djhn 17 hours ago [-]
What kind of code is infrastructure in this context? Devops in a software company? Internal tooling in a software org?
rubendev 17 hours ago [-]
What is your definition of faster to implement? Is it producing a plausible implementation, or is it faster at producing a correct and high quality implementation? Are you including time spent refactoring and fixing bugs in your metrics? If not, I think you are tracking a gut feeling rather than cold hard facts. I’m not saying this is easy to track, just saying that it’s hard to know for sure that you are really more productive with AI.
intended 15 hours ago [-]
Thank you for sharing any info at all.
> to be fair, everyone on my team has decade+ professional experience (many more non-prodessional), and we understand limitations of AI fairly well.
I see this appear quite often in discussions on productivity, to the point that a conclusion may be made regarding its centrality for productivity gains.
vidarh 21 hours ago [-]
Not the same person, but it really depends on projects. E.g. I have some projects that involve working to large specification sets where we can measure rate of delivery against the spec. If your spec is fuzzy and incomplete, then it gets hard, but then you have little insight into human productivity for those projects either.
ai_fry_ur_brain 9 hours ago [-]
Features no one asked for and weren't important enough to invest in previously, probably not that valuable.
21 hours ago [-]
_sharp 24 hours ago [-]
Right, just like all the productivity lost when people stopped using paper ledgers to mess around with these so-called 'databases'
vidarh 21 hours ago [-]
I work on projects where we measure the output. There's nothing "pseudo" about it.
zbentley 21 hours ago [-]
Tell me, what do you measure? Changes shipped? Lines of code? Customer satisfaction? Defect rate? MTTR? New engineer onboarding time/TTFC? Security/compliance audit turnaround time? Uptime? Employee retention? Rollback/forward-fix rates? Linter errors? Test coverage? Meaningful test coverage?
vidarh 20 hours ago [-]
Depends on the clients maturity, but some places all of the above.
adyavanapalli 13 hours ago [-]
What are the numbers you are getting?
vidarh 12 hours ago [-]
Initial drop, as people learn to use the tools, and while they keep babysitting their harnesses. Then significant boost once people start getting used to running the agents in the background, especially once they start running multiple sessions in parallel. I'd say you need a ~6 month push of getting people trained if they are not used to this way of working, and to customised setups etc. for your organisation, and then you start seeing significant payoff.
ai_fry_ur_brain 9 hours ago [-]
Was there previously huge backlog of work to do, or are you just building tons features for the hell of it because you can?
c0rruptbytes 23 hours ago [-]
i treat it like Minecraft automation - it's just for funsies and to pass the time haha
I don't think agentic workflows are there yet, but implementing skills to manually call and use while working side by side with an AI is definitely nice - our company is focused a lot on sandboxing right now and having safe skills
I don't think we've gotten feature development well yet, but the review skills + grafana skills they wrote have been pretty solid
jeremie_strand 21 hours ago [-]
[dead]
0000000000100 23 hours ago [-]
Trick is to not burn too much time worrying about the perfect skills and this and that. See a lot of people filling skills with LLM junk, or overdoing rules that start confusing the LLM. Just try Vanilla, see something you don't like? Then you make a skill and funnel the LLM to use it for the style of task it's working on. E.g. database work is a mixed bag with LLMs, they tend to do work in totally different styles if you leave them unconstrained.
Agents are unbelievably useful at helping takeover and refactor messy codebases though. I just started taking over this monstrous nightmare of a codebase, truly ancient code the bulk of it written over 10+ years ago in PHP. With the use of Claude / Codex I was able to port over the vast majority of the existing legacy storefront and laid the groundwork for centralizing the 10-20k LOC mega-controller logic over to reusable repo/service patterns.
Just shit that would've taking years previously, is achievable in under a month.
BOOSTERHIDROGEN 22 hours ago [-]
This.
Everything needs an element of human touch, I would somehow only run vanilla things. But if, let’s say, I’m creating backup scripts, I meticulously outline the plan.
pantheragmb 24 hours ago [-]
I couldn't agree more, just because I know I already wasted months and pulled the plug :D
__alexs 17 hours ago [-]
I'm sure lots of people felt this way about steam power too.
lukewarm707 13 hours ago [-]
if i wanted to find out the answer to my question, i would need to:
- open the browser
- google "john repo"
- find the website
- copy the repo name
- open the terminal
- cd
- git clone
- try to find the file i want
- read the whole file to find the answer
= answer
i now do:
- "john repo question" = answer
wg0 22 hours ago [-]
This will be another Microservices moment in our industry.
alfiedotwtf 10 hours ago [-]
At least we’ll be thankful for all the documentation developers have written in order to feed Claude better context.
Maybe the productivity we were trying to achieve was the friends we made along the way
fortyseven 11 hours ago [-]
I'm with the username like that, I'm sure we're going to get an even-handed, well thought out and reasoned discussion about all of this.
wahnfrieden 24 hours ago [-]
You haven’t made money from their use yet?
slopinthebag 23 hours ago [-]
They will lie to themselves and deny it.
nothinkjustai 1 days ago [-]
You’ll get downvoted for this hearsay!
footy 24 hours ago [-]
I think you mean heresy. But maybe I don't get the reference you're making when you say hearsay
bot403 22 hours ago [-]
I'm wondering if there are anti-ai bots trolling the boards. Look at all the usernames of the negative AI posts.
Or maybe the only people left opposing AI are so hardcore against it they form their identity (username) around it
nothinkjustai 20 hours ago [-]
ok bot403
IncRnd 23 hours ago [-]
Hearsay is a rumor or something that can't be verified.
footy 17 hours ago [-]
I'm aware.
dmix 24 hours ago [-]
I've tried these larger agent skillsets in the past and felt it was a waste of time because it was just doing too much. Just like vim it's often better to pick and choose from the community instead of installing skills like they are an IDE. Skills are way too personal because every dev and dev team is different. So better to treat these as a reference for your own config rather than bulk install someone else's config.
sunaookami 18 hours ago [-]
Same for MCPs and system instructions, there are a lot of people that just install everything without understanding it, cluttering their context, wasting >50k tokens for these tools they don't need and then complain that they need to pay >100$ per month because they reach their limits too fast.
CharlesW 1 days ago [-]
From an SEO/LLMO perspective, the discoverability of these skills will be difficult without a rename: https://agentskills.io/
I would love to know how many people are actually using superpowers.
I showed up on the agentic dev scene prior to superpowers, and I am getting concerned that >50% of my self-rolled processes are now covered by superpowers.
I no longer trust gh stars, can anyone chime in? Is superpowers now truly adopted?
If it is truly valuable, why hasn't Boris integrated the concepts yet?
RideOnTime22 1 days ago [-]
It's just the new thing.
People were hyping up Oh My Opencode. When they realized it didn't lead to any significant gains in performance they hopped on the next thing.
And when the same thing happens to Superpowers it'll be something else they cling on because "this time it's different"
supermdguy 23 hours ago [-]
I've used it off and on over the last month or so. For more complicated tasks (30+ minutes) it works well, and seems to replace a lot of prompting that I'd normally need to do (e.g. asking questions about requirements, creating specs and implementation plans, staying on task). For simple tasks, it tries to do too much and gets in the way.
marcus_holmes 1 days ago [-]
I adopted superpowers, but then adapted it. I've changed some things, added some things. I suspect that my set of agent skills is probably overlapping with OP's by quite a lot now.
I also found that I have different skills for different tasks; at work security is a huge concern and I over-emphasise security in the skills. At play I'm less bothered about security and so the skills I've written to help me build stupid one-shot exploratory websites are less about security and more about refactoring and exploring concepts.
nullstyle 1 days ago [-]
I just removed superpowers from my own setup. In my opinion, given the quality of the planning modes in both claude code and codex, superpowers was really just slowing things down and burning more tokens than vanilla.
consumer451 1 days ago [-]
Thank you for the data point.
To give back as much as I can, I use the two built-in CC review processes when appropriate. But, those only do "is this PR good code?"
Far too late did I finally roll my own custom review skill that tests: "does this PR accomplish what the specs required?"
If I could ask for one more vanilla CC skill, it might be that. However, maybe rolling your own repo-aware skill via prompt is better?
I used superpowers - but it burns waay more tokens for basically the same outcome as a single line that states
"Please do planning and ask any required questions before implementing.
[my prompt]"
On the latest models and with a decent harness, the planning modes are quite good, and the single sentence telling it to ask you questions lets the model pick the right thing to ask about, instead of wasting a bunch of time/tokens on predefined skills that try to force basically the same result.
It does introduce a second set of required interactions, but you can have another agent be your "questions answerer" if you need it (result quality goes down a bit vs answering myself, but still quite good, especially if you spend a bit of time on the answerer prompt)
Basically - things are moving fast enough I'm not convinced buying into superpowers/agentskills/[daily prompt magic beans]/etc tooling really makes sense.
I'd stick to the defaults in the harness for most cases, and then work on being clear with the ask.
alfiedotwtf 10 hours ago [-]
A lover of Superpowers here, using it for about 2+ months now.
It allows you to explore the problem space upfront, it questions your assumptions, asks more probing questions to confirm what it’s found in the code, and by the time you’re ready to implement, it knows exactly what needs to happen.
Jessie should have called it Socrate’s Methods
DeathArrow 15 hours ago [-]
>I would love to know how many people are actually using superpowers.
I use them on and off. Also Get Shit Done and Compound Engineering. The best results I got with Compound Engineering but it burns tokens like crazy, especially in the review phase where it does reviews with 5 - 12 agents in parallel - and I like to do a lot of reviews for both the plan and documentation and code.
For some lighter tasks, builtin Claude Code skills like plan mode are enough.
ssgodderidge 24 hours ago [-]
This is like creating a React framework called ReactJS to compete with NextJS
esafak 1 days ago [-]
Looks like a bunch of canned skills served through a plugin?
ricardobeat 1 days ago [-]
Does superpowers actually work? The main skill file doesn't inspire much confidence:
"If you think there is even a 1% chance a skill might apply to what you are doing, you ABSOLUTELY MUST invoke the skill."
CharlesW 1 days ago [-]
This kind of "overprompting" is one technique that even the best skills/agents use to compensate for under-invocation, which happens when more demure advisory language tends to be rationalized away by LLMs.
It shouldn't be your default, but should absolutely be tried when your skill/agent test suite displays evidence that it's not being reliably invoked without it.
bsoles 3 hours ago [-]
Agent skills are ways of turning over our means of (software) production to our employers, while making ourselves obsolete at the same time. In the recent past, a (software) professional had to be continuously employed by their employer to maintain access to their professional skills. By transferring our skills into AI agent skills, we are basically giving away that privilege. In the near future, our employers might feel they don't need our skills anymore because it has already been captured by the AI agents. Somebody with a better grasp of economic history should be able to explain this using the analogy of what happened during industrial revolution and how the workers got screwed over.
thatmf 24 hours ago [-]
Why are people so excited to put themselves out of a job?
Not that these or any "skills" will do that, but just- in principle. This is like alienation from labor at scale.
cortesoft 21 hours ago [-]
I don't understand this thinking as a computer programmer. My whole life has been about getting a computer to do work so humans don't have to anymore. Every single piece of software written is supposed to take away work from someone.
Do you feel this way about every automation you create? I do know some old school sys admins who felt this way about a lot of infrastructure automation advancements, and didn't like that we were creating scripts and systems to do the work that used to be done by hand. My team created an automated patching system at a job that would automatically run patching across our 30,000 servers, taking systems in and out of production autonomously, allowing the entire process to be hands free. We used to have a team whose full time job was running that process manually. Did we take their jobs by automating it?
Sure, in a sense. But there was other work that needed to be done, and now they could do it.
The whole reason I like programming and computers and technology is precisely because it does things for us so we don't have to do it. My utopia is robots doing all the hard work so humans can do whatever we want. AI is bringing us one step closer to that, and I would rather focus on trying to figure out how we can make sure the whole world can benefit from robots taking our jobs (and not just the rich owners), rather than focus on trying to make sure we leave enough work for humans to stay busy doing shit they don't actually want to do.
theshrike79 8 hours ago [-]
The problem is that people are holding AI wrong. They're using AI as the engine of their solution without realising the true solution:
Use AI to create the engine. After that running the engine itself costs as much as keeping the computer running it online. No API costs for 3rd party LLM providers needed.
hibikir 23 hours ago [-]
Because we've been automating large parts of our former jobs for decades. Otherwise we'd all be trying to build things in the least efficient way possible to maximize how long the job takes, which IMO isn't a great idea.
Humans have been minimizing how much work is needed to get a certain level of output for as long as we can track. It's civilization. Should we go back to farming by hand with hoes, to maximize labor used? Go back to streetlights that are individually lit? The society that falls behind on automation becomes poorer, and eventually just dies, as even the people born there tend to choose to leave to higher productivity places. It happened to eastern europe, it happens to the Amish. To any poor society which gets emigration. Doing more with less has always been exciting.
dewey 23 hours ago [-]
Because usually the people who lose their jobs are people who do not adapt to the market.
Right now it's not clear in which direction everything is involving and that's why people experiment with handing all their data to random agents, figuring out how to store and access context, re-use prompts and other attempts to harness this tech. Most of these will maybe be useless in a year as they might be deeply integrated into the next wave of models but staying on top of the development has always been part of the fun of working in this field.
kiba 23 hours ago [-]
People are building bots to do the most legible thing possible which is feature in X amount of time. But it doesn't matter if the bottleneck is human thinking time required to output quality code rather than X amount of code written.
H8crilA 13 hours ago [-]
I am so much faster with the bots. If you're not faster with the bots then either you write very very little code, or you're doing it very wrong. Tactically they outsmart me 10-100x if you account for the write speed. Even if you just consider the knowledge of languages, libraries, patterns they clearly outperform me. Strategically I do not trust them at all, poor things suck at it, mainly because they always try to take the shortest possible path to the current destination.
And if you think that your personal protest against the automation will in any way affect the direction in which the industry goes then you're delusional. You would have to start something like a political party and collect way more people.
kiba 13 hours ago [-]
Wake me up when LLMs help me write better code and let me understand the codebase, and not before. Not faster, not more productive, but a more comprehensible codebase that I can reason in my own head.
Otherwise, if they write so much better code, than it's pointless to have a human in the loop.
H8crilA 7 hours ago [-]
You will develop quite a lot of illnesses sleeping for this long, but your choice I suppose. Who knows, maybe it happens as soon as next year. I would strongly suggest living a life, any life really, instead of waiting like that.
7 hours ago [-]
rglover 14 hours ago [-]
Survival instincts. If everyone and everything around you (your job included) is shouting "use AI" it's difficult to take any stand or introduce caution. I think it's less about being excited, more about hoping to not miss the wave and get "left behind."
I think both groups (pro vs anti) will be a bit surprised when the long-term data shows productivity gains were modest on average and producing quality software still needs care/human attention, even with the support of advanced, frontier models. Same job as before, now we just have a power drill instead of a screwdriver. Some people build houses that stand for hundreds of years, others less so.
dawnerd 21 hours ago [-]
It's likely the people that were not good developers that suddenly got accelerated "to the top" that seem the most for it. All of the good devs I know have been a bit more cautious on the uptake.
onion2k 20 hours ago [-]
I think it's more subtle than that. There are a lot of measures of what a 'good developer' is, and one of them is 'shipping things'. AI is specifically accelerating that part of the industry - it's much easier to ship code faster now. If you're in a domain that doesn't need quality (easy horizontal scaling, bugs rarely have a critical impact, customers are relatively loyal) then AI is proving that shipping features is more important than code quality.
If you're in a part of the software industry that needs well-optimized and bug-free code then it's less useful. The problem for devs is that those parts of the industry are much smaller.
fortyseven 11 hours ago [-]
Funny, I know quite a few extremely talented programmers who cautiously approached the topic, and found that, with proper use, they've found LLMs to be extremely useful. Just a matter of understanding where the boundaries are, and using them responsibly. It's not a magic genie, it augments their existing skill.
clapthewind 24 hours ago [-]
Some people are playing the global optimization game; a world where anyone can have any (production grade) software they want.
onlyrealcuzzo 10 hours ago [-]
We've been automating stuff for 60 years, and it only leads to more automation.
At the end of the day, the more automation, the more people you need making sure things work.
There's always going to be a minimal bottleneck for how much an engineer can oversee if they need to do zero implementation.
We're not as far from that point as people think.
Most languages most things are developed in are 10x more expensive than languages of yore.
Rust has a bad reputation for being hard, but it is actually quite expressive.
Less than 50% of what engineers do is code.
IBM was famous, in the early 2000s, for the average dev writing one line of code per day on average.
We're just going to move to a world where the average dev spends <10% of their time coding, but there's likely to be x times more work, so it mostly evens out.
cuteboy19 22 hours ago [-]
people are now being encouraged to use ai notetaking features under the guise of productivity.
a worker is just the sum total of all work related context. to collate, verify and organize this context is just asking to be replaced.
yieldcrv 22 hours ago [-]
Month 30 of software engineers not existing in 6 months
koliber 22 hours ago [-]
Lately I keep hearing the same thing over and over: the things that are good for managing a team of devs are good for LLMs.
Good test cases.
Clear and concise documentation.
CI/CD.
Best practices and onboarding docs.
Managing LLMs is becoming more and more similar to managing teams of people.
theshrike79 8 hours ago [-]
Yep, I've been saying this for about a year now. Actually gave a presentation on this internally with this exact anecdote :D
There are so many bad analogies I could use to describe it, but they're all bad so I won't try.
tempoponet 22 hours ago [-]
Similarly, the agentic coding success stories are from orgs that had all of these things out of the gate.
theshrike79 8 hours ago [-]
Or had the sense to build the guidelines without trying to rely on writing fanfiction to guide the LLM.
zmmmmm 1 days ago [-]
I was surprised how long some of these skills are. They are pages and pages long with tables and checkbox lists and code examples, etc.
Curious how normal that is - it would only take a couple of these to really fill the context alot.
_pdp_ 20 hours ago [-]
The reason they are long is because these skills are produced mostly by Claude Code and Opus and no sensible human will read these files, let alone build a mental model around them. There is just layers of assumptions that this works - when in reality it doesn't and it is wasteful.
Here is a fun experiment.
Ask any LLM to write something vaguely familiar. For example, ask it "write a fib". Since almost all LLMs are fine tuned on code, I find that all of them will respond with a fibonacci sequence algorithm even-though to a non-programmer "write a fib" means to write an unimportant lie.
So there is compression. You can express an outcome in just 3 vague tokens without going into details what exactly is a fibonacci sequence.
That should be enough to understand that the length of the prompt does not matter. What matters is the right words, frequency and order. You can write two page prompt or two sentence prompt and both can have the same outcome.
esperent 15 hours ago [-]
I just tried it with Gemini pro. I think this answer is about as good as you can expect for such an ambiguous question.
Write a fib
Since "fib" can mean a couple of different things, I've got you covered for both!
1. A Little Lie (A Fib)
"I'm actually typing this to you from a sunny beach in the Bahamas, sipping a piña colada." (Since I'm an AI, that is definitely a fib!)
2. The Fibonacci Sequence
If you meant the classic programming exercise, here is a Python function...
15 hours ago [-]
_pdp_ 15 hours ago [-]
I stand to be corrected. Though I tried again just now and this is what Gemini Pro produced:
> I'm assuming you mean a Fibonacci sequence generator! I'll write a Python script that includes both an iterative and a recursive way to generate Fibonacci numbers.
... and then wrote some python code.
esperent 13 hours ago [-]
In my opinion, if a tool that's designed to be an answer machine had to give exactly one response to "write a fib", the correct choice is the Fibonacci sequence. You're probably underestimating how many programming students might type a query like that in.
If you want a lie, then the normal grammar in English is to say "tell a fib". I bet every llm you test that on will respond by telling you a small lie, or at least note the ambiguity and then say it's going to revert to Fibonacci because that's more in line with what it's designed to do.
DeathArrow 15 hours ago [-]
[flagged]
gwerbin 1 days ago [-]
I quickly skimmed and it looks like at least a few of them are intended to be more like system prompts for a tightly scoped sub agent than a skill as such. I agree, I wouldn't want to use a lot of of these in a longer-running work session.
I have been successful with short and focused skills so far. I treat them as a reusable snippet of context, but small ones. For example a couple of paragraphs at most about how to use Python in my project and how to run unit tests. I also have several short "info" skills that don't actually provide the agent instructions, they merely contain useful contextual information that the agent can choose to pull in if needed.
Even having too many skills can be an issue because the list of skill names and their descriptions all end up in the context at some point.
tecoholic 1 days ago [-]
I have written zero skills, so not sure how normal it is. I counted the words in couple of them and they seem to be around 2k range. So 5 skills would be around 10K. Even at a small LLM context of 128k, that's still around 10%. And for a 1M context window like the big ones, it barely registers.
umeshunni 23 hours ago [-]
> it would only take a couple of these to really fill the context alot.
Only skill front-matter (name, description, triggers etc) are loaded within context by default, so this isn't likely to happen without 1000s of skills.
sergiotapia 1 days ago [-]
I reviewed the line counts of my own project skill files, and the top 3 I have are:
805 lines
660 lines
511 lines
Maybe I am _too_ conservative here. Lots to explore.
mohamedkoubaa 1 days ago [-]
No, you aren't.
cortesoft 22 hours ago [-]
What makes this better/different than spec-kit? It seems to have a very similar philosophy. I wonder if they could work together? Or would they just be duplicative?
Nothing, they’re the same garbage for developers who can’t be bothered to mindfully use AI when writing code, and then complain about mass layoffs
Lio 19 hours ago [-]
“A senior engineer’s job is mostly the parts that don’t show up in the diff.”
Agent Skills is Addy’s attempt to kill that job too. Cheers Addy. :P
ElijahLynn 1 days ago [-]
I've been using Agent Skills on a new side project and I'm really impressed so far! It really holds my hand a lot of the way and really lets me focus on developing a product instead of figuring out how to build it. I get to focus much more energy on high level architecture and product design.
Very grateful for this repository and everyone who contributed to it!
senko 1 days ago [-]
> This isn’t a coincidence. It’s the same SDLC every functioning engineering organisation runs, just in different vocabulary. [...] Amazon calls it the working-backwards memo and the bar raiser. Every healthy team has some version of this loop.
This (sdlc == working backwards & bar raiser) is so horribly wrong, that I hope this was an LLM hallucination.
In general, I'm starting to see these agent scaffolding systems as an anti-pattern: people obsess over systems for guiding agents and construct elaborate rube-goldberg machines and then others cargo-cult them wholesale, in an effort to optimize and control a random process and minimize human involvement.
yks 1 days ago [-]
The problem is it’s so rarely A/B tested, definitely not at scale. An engineer, who writes all these my-workflow-but-for-agents skills, proceeds to get the good outcome, while also seeing affirmations that the agent did follow the prescribed processes - that is considered a victory. In reality the outcome could’ve been just as good if they fed Claude a spec + acceptance criteria, or even a basic prompt for the simpler tasks.
AndyNemmity 1 days ago [-]
Yeah, I Blind A/B test everything, and a lot.
But I don't expect anyone to every use my stuff. It's complicated as hell. But it's for me, and it works without me having to remotely think about the complexity.
I love that.
PleasureBot 15 hours ago [-]
All of these articles about setting up the perfect agent environments with skills, plugins, MCP servers, markdown files, etc. etc. reminds me so much of the culture around setting up the perfect "productivity stack". You need the perfect note-tacking app, ticketing app, calendar integrations, yada yada before you can really do anything meaningful. The reality is that you're going to get beat by someone with a few things written down on a piece of paper who is just getting stuff done.
BOOSTERHIDROGEN 1 days ago [-]
This is how similarly we collectively approach Taylorism, isn't it? However, the world favors capitalism, of which Taylorism becomes a handy scaffolding.
hansmayer 17 hours ago [-]
Why does it feel like it was AI-written ?
SudheerTammini 23 hours ago [-]
Recently I have got an access(enterprise)to the latest ChatGPT module with an ability to write skills to automate repeatable taks. Without any prior knowledge I just started tinkering and now after creating and testing multiple skills in real business environment I can confidently say writing a good skill is a skill itself. As the author mentioned it's not an essay but a specific instructions sets organised in steps and in a concise manner.
codemog 1 days ago [-]
Everyone who writes this kind of stuff skips the boring parts: science and engineering.
Yep, benchmarks, comparisons of with/without, samples of generated code with/without. This kind of stuff matters, and you may be making your agent stupider or getting worse results without real analysis.
Also this prose reads like the author has drunk the Google kool-aid and not much else.
turlockmike 1 days ago [-]
The best way to prompt an LLM is to describe the outcome you want, that's it. They are trained as task completers. A clear outcome is way better than a process.
If the LLM fails, either you didn't describe your outcome sufficiently or is misinterpreted what you said or it couldn't do it (rare).
Common errors should be encoded as context for future similar tasks, don't bloat skills with stuff that isn't shown to be necessary.
stingraycharles 1 days ago [-]
> The best way to prompt an LLM is to describe the outcome you want, that's it. They are trained as task completers. A clear outcome is way better than a process.
This is not true for anything complex. They’re instruction followers, of which task completion is just one facet.
They’re also extremely eager to complete tasks without enough information, and do it wrongly. In the case of just describing task completion, despite your best efforts, there are always some oversights or things you didn’t even realize were underspecified.
So it helps a lot to add some process around it, eg “look up relevant project conventions and information. think through how to complete the task. ask me clarifying questions to resolve ambiguities. blah blah”. This type of prompt will also help with the new Opus 4.7 adaptive thinking to ensure it thinks through the task properly.
stult 1 days ago [-]
Agreed, and further, I'd argue the OP's division of LLM instructions into either process or outcome specification is a false dichotomy. My agentic process specification is about automatically specifying the outcomes that I would otherwise repeatedly have to tell the LLM to consider, like making sure test coverage is maintained, or that decisions are documented on the original Github issue. Or it's about correcting common failure modes, like when the agent spends an enormous amount of time running repo-wide tests while debugging a focused change, because the agent doesn't consistently optimize around the time-to-implement as an outcome. Arguably part of addressing those failure modes boils down to pure process in the sense that I specify a logical order for achieving the outcomes, e.g. creating a plan before implementing. But that is mostly to organize approval gates for my convenience, rather than structuring the agent's work per se.
tecoholic 1 days ago [-]
If there is anything we have learned in decades of Software engineering, it's "A clear outcome" is not easy to describe. In many cases, it's impossible unless people from 4 different domains collaborate. That's why process matters. It allows for software to be built is a "semi-standardized" way that can allow iterations to get us closed towards the expected outcome, that might emerge over time.
Yes, not everything I use LLMs for going to have the same level of ambiguity or complex requirements. Optimizing by choosing to skip over parts of the process is exactly Addy is talking in this article.
_pdp_ 20 hours ago [-]
This seems like common sense but it does not work in practice.
Prompting is just the first part. To get the outcome, you need to have other systems to steer the agent as it get things wrong. Proper deterministic tests work. But there is also stuff that need to happen during the LLM execution like cyclic detection etc. All of this adds up.
You cannot just prompt an LLM an hope for a good outcome. It might work in small isolated scenarios but it just does not work consistently enough to call it reliable.
Without further guardrails enforced by the process or the harness, LLMs do not have sufficient capabilities to complete a task up to a certain standard.
alexjurkiewicz 1 days ago [-]
I agree that many skills are overblown and unnecessary. But there's a lot of value in giving AI the right process. See how much more effective Claude can be for moderate or large changes when using the superpowers skill.
tmaly 1 days ago [-]
Sometimes people don't know what they want.
I prefer the start small and iterate approach to arrive at a result.
Then I ask it to summarize. Sometimes after that I ask it to generalize.
peab 1 days ago [-]
a skill is just reusuable/shareable context. It's just text, really. It's useful for things like documentation on how to use an API (this works better than MCP in my opinion), or a non consensus way of doing something. For example, you can use remotion to generate video. There are useful remotion skills that allow you to reliably generate specific types of videos. Captions of a certain style, for example.
markbao 1 days ago [-]
That seems a bit reductive. Even with humans, there’s a range of interpretations and ways that something can be built or a task completed. Engineers remember stuff so you don’t have to keep repeating yourself. Skills are a way to describe your outcome without similar repetition.
nullsanity 1 days ago [-]
[dead]
ColinEberhardt 21 hours ago [-]
Agents Skills are built upon “Five design decisions [that] are the load-bearing ones”
And Open Design (HN front page yesterday) is supported by “Six load-bearing ideas”
The similarities in the way these prompt libraries are documented doesn’t feel coincidental.
konaraddi 24 hours ago [-]
There’s so many ways, many redundant, to set up agents for software development that beyond personal/team/org needs+tastes, I need to look into setting up some benchmarks to evaluate what set up is optimal or whether the differences are even worth it.
tariky 21 hours ago [-]
What is difference between superpowers and this?
I use superpowers for several months now and it really does help. But still 90/10 rule applies, 10% of time it will produce stupid decision. So always check spec.
y-curious 1 days ago [-]
Thanks for this, going to steal a lot of this. I would install your plugin, but I worry about being able to delete it later. I also think that each one of these is better served customized to a developer. That said, I'm still going to grab some of these, thanks!
bvirkler 1 days ago [-]
A plugin is just a set of files, right? why wouldn't you be able to delete it later?
gavmor 1 days ago [-]
Naming things is such a hard problem that many devs don't even bother trying.
That being said, this post is full of reasonable assertions, so I'm looking forward to experimenting with this... whatever it is.
fragmede 1 days ago [-]
Wait, shit, are people using LLMs to name things now? I'm definitely out of a job then!
gavmor 12 hours ago [-]
I'm notorious for taking poetic license with naming—that's how we end up with `class Escutcheon`, or variables `recto` and `verso` where applicable in eg PDF generation.
But as much pleasure as I derive from novelty and specificity, my colleagues have oft expressed perplexity—whereas the terms which LLMs produce hew closer to the manifold (by definition!) and raise fewer eyebrows.
So, it has its turn.
kigiri 16 hours ago [-]
Naming things is my principal use for AI, I don't always pick a name from the suggested ones but it sure help me find better ones.
Trusteando 17 hours ago [-]
Design a test to verify that the harness keeps the rider on the horse. Parameterize it by context size.
theahura 23 hours ago [-]
I really wish he wouldn't use AI to write his posts. It would be faster to just post the prompt he used to write the article
petesergeant 23 hours ago [-]
I wish this fucking meme of "post the prompt" would die. Very little work is one-shotted, very little has a singular "the prompt", most is iterated until it's close to the vision of what the author actually set out to write.
rossant 21 hours ago [-]
Exactly. Glad to see someone else articulate this so clearly.
robeym 14 hours ago [-]
A far better approach is being precise with your prompts and if you find a new model has any bad habits, address it specifically in your AGENTS.md and go on your way. If you want to throw in slop promts, go ahead and add a massive AGENTS.md your employer gave you.
People waste too much time on this stuff. The next version could totally change how the model processes your agents.md.
Get good at promting, use agents.md as a minimal model annoyance fixer, and reset it often (every major release)
shruubi 18 hours ago [-]
Am I the only one who looks at guys like Addy Osmani and Steve Yegge who before LLM's had a good reputation and since then get the feeling they are cashing that reputation in to ride the LLM hype-cycle? Or is it just a matter of professional tech talking heads moving from writing books and giving conference talks about good engineering practices to talking about the new hot topic that sells books and conference tickets?
gosukiwi 1 days ago [-]
I wonder how does this compare to superpowers
hansmayer 17 hours ago [-]
What skills mate, this is simply text files attempting to narrow down the specs, hoping that this will help the "AI" make less mistakes. But it is still crap, because, <drum-rolls> - it still depends on how this fits into the overall statistical model which changes with every prompt, etc... Please stop peddling this bullshit, it does not work!
alfiedotwtf 10 hours ago [-]
Isn’t this what Mixture-of-Experts is but at a higher scale?
standardUser 11 hours ago [-]
> It produces code, declares victory, and moves on.
Not when I'm in charge. It proposes changes based on my detailed instructions, I review the proposed changes, only then do I have it implement code, and then I review it again. I understand my AI agent would prefer a quicker way but for the meantime, I'm still the one in charge.
onlyrealcuzzo 10 hours ago [-]
I think you're saying the same thing OP said.
The point is, their default behavior is to ship crap fast.
You have a process to handle that.
So does OP.
rafaelmn 18 hours ago [-]
> It’s people accepting plausible-sounding justifications for skipping the parts they don’t feel like doing.
WTF ? Almost always this was "skipping the parts because the deadline was 2 weeks ago". The "I don't feel like it" rationalizations are maybe 20% ? Unless deadlines are rationalizations too ?
karinakarina3 16 hours ago [-]
Another example of agent skills that give AI agents access to bitdrift's mobile observability platform for full-fidelity agentic investigations -- https://bitdrift.ai/
m3kw9 12 hours ago [-]
Most skills will also be deprecated once LLM's get updated to include these skills in their training.
m3kw9 12 hours ago [-]
I'm surprised these "elite" engineers are still talking about Claude, most engineers that really use this stuff have already switched to Codex.
DeathArrow 14 hours ago [-]
Things is we have an enormous amount of skill frameworks (this, GSD, spec-kit, superpowers, Compound Engineering etc) claiming to help with agentic coding.
And agents now got better builtin skills than they used to.
The fundamental problem with agent skills is that it doesn’t have a hook to do one time installation. An agent can’t just be a prompt. It also has to have some way to do initial set up work.
If I have an agent skill to look up prices of stocks, maybe I need to set up some tools and authentication first. There’s no way to express this!
scotty79 20 hours ago [-]
> Workflows are agent-actionable; essays are not. The same is true for human teams. If your team handbook is 200 pages, no one reads it under time pressure.
Agents do read that. And actually remember it. Because it's tiny with other things you are cramming into their context.
AndyNemmity 1 days ago [-]
This is why I created the /do router, to route to all skills. I also have anti rationalization, progressive context discovery etc.
I only make it for me, so it's a bit complex and targeted towards me, and what I do, but it's pretty easy to adjust things.
Working on reading through Agent Skills, it seems we've converged on a lot of the same points, and I've never seen it, so trying to get an understanding of it.
Edit 1: I don't like all the commands. I just rely on a single router to automatically decide what I want, and that feels like the most reasonable way to me to communicate with it.
I don't want to remember things. And that's the way for me to scale the number of skills and activities. I don't have to think about them.
I personally wouldn't call theirs an intelligent router. They are dancing between a few different skills. We have extremely different setups there.
But of course, I'm using way more context to get it done. I'm even sending it out to Haiku to build the route choices.
I choose to use tokens to make things better for myself, not everyone would make the same choice, so I certainly see why they are using a few skills, and composing them.
Edit 3: This is much easier for a user to wrap their head around because there's much less.
I am only focused on the best improvements I can make that show value for my use cases. This is straight foward to reason about.
This seems like a nice way to get the best concepts for people trying to understand them. I commend them for a clean, simple approach.
Edit 4: Yeah, I think there are some things I can learn from them which is always good.
I especially like simple decisions like collapsing the install details for each harness in the readme.
I'm going to read over the entire thing and look for opportunities to improve my stuff.
We are all working together, learning, testing, building, trying to find the best way to implement things.
encoderer 1 days ago [-]
I adopted a couple of these, the api design and ui testing ones have been particularly helpful.
Here's why: The slot machine can drop any hard requirement that you specifically in your AGENTS.md, memory.md or your dozens of skill markdowns. Pretty much guaranteed.
These harnesses approaches pretend as if LLMs are strict and perfect rule followers and the only problem is not being able to specify enough rules clearly enough. That's fundamental cognitive lapse in how LLMs operate.
That leaves only one option not reliable but more reliable nevertheless: Human review and oversight. Possibly two of them one after the other.
Everything else is snake oil but at that point, you also realize that promised productivity gains are also snake oil because reading code and building a mental model is way harder than having a mental model and writing it into code.
> ... you also realize that promised productivity gains are also snake oil because reading code and building a mental model is way harder than having a mental model and writing it into code.
Not really, though it depends on the code; reading code is a skill that gets easier with practice, like any other. This is common any time you're ever in a situation where you're reading much more code than writing it (e.g. any time you have to work with a large, sprawling codebase that has existed long before you touched it.)
What makes it even easier, though, is if you're armed with an existing mental model of the code, either gleaned through documentation, or past experience with the code, or poking your colleagues.
And you can do this with agents too! I usually already have a good mental model of the code before I prompt the AI. It requires decomposing the tasks a bit carefully, but because I have a good idea of what the code should look like, reviewing the generated code is a breeze. It's like reading a book I've read before. Or, much more rarely, there's something wrong and it jumps out at me right away, so I catch most issues early. Either way the speed up is significant.
for MVPs, mock ups, prototypes or in the hands of an expert coder. You can't let them go unsupervised. The promise of automated intelligence falls far short of the reality.
When it receives a generic vague input it is free to interpret according to how its corpus fires like any human interaction.
How to articulate better is like writing a sentence that will stand the test of model updates.
I don’t think it’s fair to blame the user here. The tool must be operated by normal users.
I've seen a disturbing trend where a process that could've been a script or a requirement that could've been enforced deterministically is in fact "automated" through a set of instructions for an LLM.
Large parts of human civilization rests on our ability to make something unreliable less unreliable through organisational structure and processes.
At the end of the day, if I am spending X$s for automation, I want to be able to sleep at night knowing my factory will not build a WMD or delete itself.
If its simply a tool that is a multiplier for experts, then do I really need it? How much does it actually make my processes more efficient, faster, or more capable of earning revenue?
There is a LOT that is forgiven when tech is new - but at some point the shiny newness falls off and it is compared to alternatives.
Review and oversight does address reliability directly, and hence why we make use of those in processes to improve the reliability of mechanical processes as well, and why they are core elements of AI harnesses.
> If its simply a tool that is a multiplier for experts, then do I really need it? How much does it actually make my processes more efficient, faster, or more capable of earning revenue?
You can ask the same thing about all the supporting staff around the experts in your team.
> There is a LOT that is forgiven when tech is new - but at some point the shiny newness falls off and it is compared to alternatives.
Only teams without mature processes are not doing that for AI today.
Most of the deployments of AI I work on are the outcome of comparing it to alternatives, and often are part of initiatives to increase reliability of human teams jut as much as increasing raw productivity, because they are often one and the same.
Yes and no. see next point.
> You can ask the same thing about all the supporting staff around the experts in your team.
I have a good idea of the shape of errors for a human based process, costing and the type of QA/QC team that has to be formed for it.
We have decades, if not centuries of experience working with humans, which LLMs are promising to be the equivalents/superiors of.
I think you and me, would both agree with the statement "use the right tool for the job".
However, the current hype cycle has created expectations of reliability from LLMs that drive 'Automated Intelligence' styled workflows.
On the other hand:
> part of initiatives to increase reliability of human teams
is a significantly more defensible uses of LLMs.
For me, most deployments die on the altar of error rates. The only people who are using them to any effect are people who have an answer to "what happens when it blows up" and "what is the cost if something goes wrong".
(there is no singular thread behind my comment. I think we probably have more in agreement than not, and its more a question of finding the precise words to declare the shapes we perceive.)
I moved this up top, because I agree, despite the length of the below:
> However, the current hype cycle has created expectations of reliability from LLMs that drive 'Automated Intelligence' styled workflows.
Because for a lot of things it works. Today. I have a setup doing mostly autonomous software development. I set direction. I don't even write specs. It's not foolproof yet by any means - that is on the edge of what is doable today. Dial it back just a little bit, and I have projects in production that are mostly AI written, that have passed through rigorous reviews from human developers.
The key thing is that you can't "vibecode" that. I'm sure we agree there.
There needs to be a rigorous process behind it, and I think we'll agree on that too.
Those processes are largely the same as the processes required for human developers. Only for human developers we leave a lot of that process "squishy" and under-specified.
We trust our human developers to mostly do the right thing, even though many don't, and to not need written checklists and controls, even though many do.
What is coming out of this is a start of systems that codify processes that are very much feels based with human teams. Partly because we still need to codify them for AI, but also because we can - most people wouldn't want to work in the kind of regimented environment we can enforce on AI.
Sure, there is a lot of hype from people who just want to throw random prompts at an LLM and get finished software out. That is idiocy. Even a super-intelligent future AI can't read minds.
But there are a lot of people building harnesses to wrap these LLMs in process and rigor to squeeze as much reliability as possible from them, and it turns out you can leverage human organisational knowledge to get surprisingly far in that respect.
> There needs to be a rigorous process behind it, and I think we'll agree on that too.
I would simplify it to: “I have a setup” is the part that is doing the actual heavy lifting.
From my very unscientific survey / extensive pestering of network, the only people getting lift out of AI are people with both domain expertise/experience and familiarity with the tooling.
The types of automation I see people wanting though are fully automated customer support systems, fully automated document review - essentially white collar dark factories. (Hey thats a good term). The need is for a process that is stable, and behaves the same way every time.
It seems actual AI use cases are more like sketching - if you have enough skill you can make out the rough sketch is unbalanced and won’t resolve into a good final piece. Non experts spend far more time exploring dead ends because they don’t have the experience.
In my opinion, it’s a force multiplier for experts or stable processes, and it’s presented as Intelligence.
I feel your examples fit within these boundaries as well as the ones you have described.
So many applications of LLMs have even to start with deterministic brain when using a non-deterministic llm and then wonder why it’s not working.
You make the point for me: We managed to put men on the moon despite humans being enormously unreliable and error prone, because we built system around them that allowed for harnessing the good bits and reducing the failures to acceptable levels.
We are - I am anyway - using our lessons from building reliable systems from unreliable elements to raise the reliability of outputs of LLMs the same way.
:) :) :) I could tell immediately you are somehow vested in the "success" of the LLM. So 600 B dollars and five years later, can you tell me how far did you guys get? Apollo programme costed a tiny fraction of that and started putting people on the moon some ~10 years later. Would you say that you are on the way to accomplish something similar in the next five years?
However, I have been using spec-kit (which is basically this style of AI usage) for the last few months and it has been AMAZING in practice. I am building really great things and have not run into any of the issues you are talking about as hypotheticals. Could they eventually happen? Sure, maybe. I am still cautious.
But at some point once you have personally used it in practice for long enough, I can't just dismiss it as snake oil. I have been a computer programmer for over 30 years, and I feel like I have a good read on what works and what doesn't in practice.
Give it a few more months and I'm sure you'll see some of what I see if not all.
I'm saying all the above having all sorts of systems tried and tested with AI leading me to say what I said.
Now, part of that is my advancements as well, as I learn how to specify my instructions to the AI and how to see in advance where the AI might have issues, but the advancements are also happening in the models themselves. They are just getting better, and rapidly.
The combination of getting better at steering the AI along with the AI itself getting better is leading me to the opposite conclusion you have. I have production systems that I wrote using spec-kit, that have been running in production for months, and have been doing spectacularly. I have been able to consistently add the new features that I need to, without losing any cohesion or adherence to the principals i have defined. Now, are there mistakes? Of course, but nothing that can't be caught and fixed, and not at a higher rate than traditional programming.
I kind of get what you're saying, but let us not pretend that SW engineers are perfect rule followers either.
Having a framework to work within, whether you are an LLM or a human, can be helpful.
the only downside i see is getting out of practice, which is why for my passion projects i dont use it. work is just work and pressing 1 or 2 and having 'good enough' can be a fine way to get through the day. (lucky me i dont write production code ;D... goals...)
By that time, they will have realized immense value before seeing some of what you see. Sounds like an endorsement of spec-kit.
I hope to see harnesses that will demand instead of ask. Kill an agent that was asked to be in plan mode but did not play the prescribed planning game. Even if it's not perfect, it'd have to better than the current regime when combined with a human in the loop.
Indeed. That said, I’ve had some success with agent skills, but I use them to make the LLM aware of things it can do using specific external tools. I think it is a really bad idea to use this mechanism to enforce safety rules. We need good sandboxing for this, and promises from a model prone to getting off the rails is not a good substitute.
But I have taught my coding agent to use some ad hoc tools to gather statistics from a directory containing experimental data and things like that. Nobody is going to fine tune a LLM specifically for my field (condensed matter Physics) but using skills I still can make it useful work. Like monitoring simulations where some runs can fail for various reasons and each time we must choose whether to run another iteration or re-start from a previous point, based on eyeballing the results ("the energy is very strange, we should restart properly and flag for review if it is still weird", this sort of things). I don’t give too many rules to the agent, I just give it ways of solving specific problems that may arise.
I do find that just asking the same agent to do and check it's own work is not particularly reliable.
Slot machine give you rewards when star aligns, snake oil never do :)
I am not however going to share any of this with work colleagues and make myself redundant.
Couldn't non-manual oversight also help e.g. sandboxes?
When the LLM decides that the situation calls for it
> It is a workflow: a sequence of steps the agent follows, with checkpoints that produce evidence, ending in a defined exit criterion.
A sequence of steps the LLM can decide to follow
``` After implementing the feature, read the testing skill for instructions on how to test. ```
it's turtles all the way down.
Is it just a philosophical belief that AI is morally bad? Or have you actually used AI to build things and feel confident that you have explored the space enough to come to such a strong conclusion?
I have been writing code every day for over 30 years, and have been doing it professionally for over 20. I have seen fads come and go, and I have seen real developments that have changed the way I do what I do numerous times. The more experience and the more projects I create with AI, the more certain I am that this is a lasting and fundamental change to how we produce software, and how we use computers generally. I have seen AI get better, and I have seen myself get more proficient at using it to get real work done, work that has already been tested with real world, production, workloads.
You can hate that it is happening, and hate the way working with AI feels, but that doesn't mean it is not providing real value for people and doing real work.
I like thinking, solving problems and typing out code myself. Im going to keep putting tons of care into my craft and I promise I'll have more impact than the guy running 3 agents to build the 500th version of some web concept.
Rolex has a much bigger impact on the world than white label mass manufacturers in China.
I don’t think people are wasting too much time. Although, I do agree most of these posts are just bs, including this one. But AI-development has been a thing across a lot of companies in the world.
AI is a powerful tool. Depending on what I need I use chatgpt, in-ide agents, or a platform like Devin.ai.
I use it when it helps me advance my goals. I don't when it doesn't. Sometimes it misses the mark and I scale back and have it do a specific piece and I'll do the rest.
Sometimes I use it to analyze the code base in seconds vs minutes. Sometimes I use it to pinpoint a bug fast.
Ive solved customer issues in seconds and minutes with it vs hours.
I worked on a banking app with deeply domain specific data issues. AI was not very helpful on that team. My current work on consumer web apps mean my problems are more mundane and AI is a big accelerant.
Being and engineer means solving the problems with the right tools with the right tradeoffs as well. It's why I use an idea vs notepad, I use chatgpt for one-off scripts and "chat", and i use agentic workflows for big, repetitive, or "boring" low-stakes tasks.
> Arguing in good faith
will be futile, unfortunately.
So they convince themselves AI can't work because they don't want it to.
And on the ROI side, trying things out regularly, I haven’t seen the positive ROI in the limited time I’ve dedicated to exploring the tools. I’ve restricted experimenting to 4 hours per month, because spending more than 2.5% of the month chasing productivity improvements that realistically seem to be 10-20%, will quickly eat into those gains. After accounting for token costs, it ends up being a wash.
You can't learn how to use _anything_ by experimenting 4 hours a month.
With infinite time anything is possible, but since we live within constraints, discussing practical, real world thresholds or evaluation methods is a worthwhile use of our time.
lets get nitty gritty on this - can you say how you did this? because a lot of people think this is an unsolved problem
There are a lot of little things we’ve tracked, and it’s just faster to implement things now. To be fair, everyone on my team has decade+ professional experience (many more non-prodessional), and we understand limitations of AI fairly well.
> to be fair, everyone on my team has decade+ professional experience (many more non-prodessional), and we understand limitations of AI fairly well.
I see this appear quite often in discussions on productivity, to the point that a conclusion may be made regarding its centrality for productivity gains.
I don't think agentic workflows are there yet, but implementing skills to manually call and use while working side by side with an AI is definitely nice - our company is focused a lot on sandboxing right now and having safe skills
I don't think we've gotten feature development well yet, but the review skills + grafana skills they wrote have been pretty solid
Agents are unbelievably useful at helping takeover and refactor messy codebases though. I just started taking over this monstrous nightmare of a codebase, truly ancient code the bulk of it written over 10+ years ago in PHP. With the use of Claude / Codex I was able to port over the vast majority of the existing legacy storefront and laid the groundwork for centralizing the 10-20k LOC mega-controller logic over to reusable repo/service patterns.
Just shit that would've taking years previously, is achievable in under a month.
Everything needs an element of human touch, I would somehow only run vanilla things. But if, let’s say, I’m creating backup scripts, I meticulously outline the plan.
- open the browser
- google "john repo"
- find the website
- copy the repo name
- open the terminal
- cd
- git clone
- try to find the file i want
- read the whole file to find the answer
= answer
i now do:
- "john repo question" = answer
Maybe the productivity we were trying to achieve was the friends we made along the way
Or maybe the only people left opposing AI are so hardcore against it they form their identity (username) around it
If Addy reads this, how do you pitch this vs. Superpowers? https://github.com/obra/superpowers
I showed up on the agentic dev scene prior to superpowers, and I am getting concerned that >50% of my self-rolled processes are now covered by superpowers.
I no longer trust gh stars, can anyone chime in? Is superpowers now truly adopted?
If it is truly valuable, why hasn't Boris integrated the concepts yet?
People were hyping up Oh My Opencode. When they realized it didn't lead to any significant gains in performance they hopped on the next thing.
And when the same thing happens to Superpowers it'll be something else they cling on because "this time it's different"
I also found that I have different skills for different tasks; at work security is a huge concern and I over-emphasise security in the skills. At play I'm less bothered about security and so the skills I've written to help me build stupid one-shot exploratory websites are less about security and more about refactoring and exploring concepts.
To give back as much as I can, I use the two built-in CC review processes when appropriate. But, those only do "is this PR good code?"
Far too late did I finally roll my own custom review skill that tests: "does this PR accomplish what the specs required?"
If I could ask for one more vanilla CC skill, it might be that. However, maybe rolling your own repo-aware skill via prompt is better?
I used superpowers - but it burns waay more tokens for basically the same outcome as a single line that states
"Please do planning and ask any required questions before implementing.
[my prompt]"
On the latest models and with a decent harness, the planning modes are quite good, and the single sentence telling it to ask you questions lets the model pick the right thing to ask about, instead of wasting a bunch of time/tokens on predefined skills that try to force basically the same result.
It does introduce a second set of required interactions, but you can have another agent be your "questions answerer" if you need it (result quality goes down a bit vs answering myself, but still quite good, especially if you spend a bit of time on the answerer prompt)
Basically - things are moving fast enough I'm not convinced buying into superpowers/agentskills/[daily prompt magic beans]/etc tooling really makes sense.
I'd stick to the defaults in the harness for most cases, and then work on being clear with the ask.
It allows you to explore the problem space upfront, it questions your assumptions, asks more probing questions to confirm what it’s found in the code, and by the time you’re ready to implement, it knows exactly what needs to happen.
Jessie should have called it Socrate’s Methods
I use them on and off. Also Get Shit Done and Compound Engineering. The best results I got with Compound Engineering but it burns tokens like crazy, especially in the review phase where it does reviews with 5 - 12 agents in parallel - and I like to do a lot of reviews for both the plan and documentation and code.
For some lighter tasks, builtin Claude Code skills like plan mode are enough.
It shouldn't be your default, but should absolutely be tried when your skill/agent test suite displays evidence that it's not being reliably invoked without it.
Not that these or any "skills" will do that, but just- in principle. This is like alienation from labor at scale.
Do you feel this way about every automation you create? I do know some old school sys admins who felt this way about a lot of infrastructure automation advancements, and didn't like that we were creating scripts and systems to do the work that used to be done by hand. My team created an automated patching system at a job that would automatically run patching across our 30,000 servers, taking systems in and out of production autonomously, allowing the entire process to be hands free. We used to have a team whose full time job was running that process manually. Did we take their jobs by automating it?
Sure, in a sense. But there was other work that needed to be done, and now they could do it.
The whole reason I like programming and computers and technology is precisely because it does things for us so we don't have to do it. My utopia is robots doing all the hard work so humans can do whatever we want. AI is bringing us one step closer to that, and I would rather focus on trying to figure out how we can make sure the whole world can benefit from robots taking our jobs (and not just the rich owners), rather than focus on trying to make sure we leave enough work for humans to stay busy doing shit they don't actually want to do.
Use AI to create the engine. After that running the engine itself costs as much as keeping the computer running it online. No API costs for 3rd party LLM providers needed.
Humans have been minimizing how much work is needed to get a certain level of output for as long as we can track. It's civilization. Should we go back to farming by hand with hoes, to maximize labor used? Go back to streetlights that are individually lit? The society that falls behind on automation becomes poorer, and eventually just dies, as even the people born there tend to choose to leave to higher productivity places. It happened to eastern europe, it happens to the Amish. To any poor society which gets emigration. Doing more with less has always been exciting.
Right now it's not clear in which direction everything is involving and that's why people experiment with handing all their data to random agents, figuring out how to store and access context, re-use prompts and other attempts to harness this tech. Most of these will maybe be useless in a year as they might be deeply integrated into the next wave of models but staying on top of the development has always been part of the fun of working in this field.
And if you think that your personal protest against the automation will in any way affect the direction in which the industry goes then you're delusional. You would have to start something like a political party and collect way more people.
Otherwise, if they write so much better code, than it's pointless to have a human in the loop.
I think both groups (pro vs anti) will be a bit surprised when the long-term data shows productivity gains were modest on average and producing quality software still needs care/human attention, even with the support of advanced, frontier models. Same job as before, now we just have a power drill instead of a screwdriver. Some people build houses that stand for hundreds of years, others less so.
If you're in a part of the software industry that needs well-optimized and bug-free code then it's less useful. The problem for devs is that those parts of the industry are much smaller.
At the end of the day, the more automation, the more people you need making sure things work.
There's always going to be a minimal bottleneck for how much an engineer can oversee if they need to do zero implementation.
We're not as far from that point as people think.
Most languages most things are developed in are 10x more expensive than languages of yore.
Rust has a bad reputation for being hard, but it is actually quite expressive.
Less than 50% of what engineers do is code.
IBM was famous, in the early 2000s, for the average dev writing one line of code per day on average.
We're just going to move to a world where the average dev spends <10% of their time coding, but there's likely to be x times more work, so it mostly evens out.
a worker is just the sum total of all work related context. to collate, verify and organize this context is just asking to be replaced.
Good test cases.
Clear and concise documentation.
CI/CD.
Best practices and onboarding docs.
Managing LLMs is becoming more and more similar to managing teams of people.
There are so many bad analogies I could use to describe it, but they're all bad so I won't try.
Curious how normal that is - it would only take a couple of these to really fill the context alot.
Here is a fun experiment.
Ask any LLM to write something vaguely familiar. For example, ask it "write a fib". Since almost all LLMs are fine tuned on code, I find that all of them will respond with a fibonacci sequence algorithm even-though to a non-programmer "write a fib" means to write an unimportant lie.
So there is compression. You can express an outcome in just 3 vague tokens without going into details what exactly is a fibonacci sequence.
That should be enough to understand that the length of the prompt does not matter. What matters is the right words, frequency and order. You can write two page prompt or two sentence prompt and both can have the same outcome.
Write a fib
Since "fib" can mean a couple of different things, I've got you covered for both!
1. A Little Lie (A Fib) "I'm actually typing this to you from a sunny beach in the Bahamas, sipping a piña colada." (Since I'm an AI, that is definitely a fib!)
2. The Fibonacci Sequence If you meant the classic programming exercise, here is a Python function...
> I'm assuming you mean a Fibonacci sequence generator! I'll write a Python script that includes both an iterative and a recursive way to generate Fibonacci numbers.
... and then wrote some python code.
If you want a lie, then the normal grammar in English is to say "tell a fib". I bet every llm you test that on will respond by telling you a small lie, or at least note the ambiguity and then say it's going to revert to Fibonacci because that's more in line with what it's designed to do.
I have been successful with short and focused skills so far. I treat them as a reusable snippet of context, but small ones. For example a couple of paragraphs at most about how to use Python in my project and how to run unit tests. I also have several short "info" skills that don't actually provide the agent instructions, they merely contain useful contextual information that the agent can choose to pull in if needed.
Even having too many skills can be an issue because the list of skill names and their descriptions all end up in the context at some point.
Only skill front-matter (name, description, triggers etc) are loaded within context by default, so this isn't likely to happen without 1000s of skills.
https://github.com/github/spec-kit
Agent Skills is Addy’s attempt to kill that job too. Cheers Addy. :P
Very grateful for this repository and everyone who contributed to it!
This (sdlc == working backwards & bar raiser) is so horribly wrong, that I hope this was an LLM hallucination.
In general, I'm starting to see these agent scaffolding systems as an anti-pattern: people obsess over systems for guiding agents and construct elaborate rube-goldberg machines and then others cargo-cult them wholesale, in an effort to optimize and control a random process and minimize human involvement.
But I don't expect anyone to every use my stuff. It's complicated as hell. But it's for me, and it works without me having to remotely think about the complexity.
I love that.
Yep, benchmarks, comparisons of with/without, samples of generated code with/without. This kind of stuff matters, and you may be making your agent stupider or getting worse results without real analysis.
Also this prose reads like the author has drunk the Google kool-aid and not much else.
If the LLM fails, either you didn't describe your outcome sufficiently or is misinterpreted what you said or it couldn't do it (rare).
Common errors should be encoded as context for future similar tasks, don't bloat skills with stuff that isn't shown to be necessary.
This is not true for anything complex. They’re instruction followers, of which task completion is just one facet.
They’re also extremely eager to complete tasks without enough information, and do it wrongly. In the case of just describing task completion, despite your best efforts, there are always some oversights or things you didn’t even realize were underspecified.
So it helps a lot to add some process around it, eg “look up relevant project conventions and information. think through how to complete the task. ask me clarifying questions to resolve ambiguities. blah blah”. This type of prompt will also help with the new Opus 4.7 adaptive thinking to ensure it thinks through the task properly.
Yes, not everything I use LLMs for going to have the same level of ambiguity or complex requirements. Optimizing by choosing to skip over parts of the process is exactly Addy is talking in this article.
Prompting is just the first part. To get the outcome, you need to have other systems to steer the agent as it get things wrong. Proper deterministic tests work. But there is also stuff that need to happen during the LLM execution like cyclic detection etc. All of this adds up.
You cannot just prompt an LLM an hope for a good outcome. It might work in small isolated scenarios but it just does not work consistently enough to call it reliable.
Without further guardrails enforced by the process or the harness, LLMs do not have sufficient capabilities to complete a task up to a certain standard.
I prefer the start small and iterate approach to arrive at a result.
Then I ask it to summarize. Sometimes after that I ask it to generalize.
And Open Design (HN front page yesterday) is supported by “Six load-bearing ideas”
The similarities in the way these prompt libraries are documented doesn’t feel coincidental.
I use superpowers for several months now and it really does help. But still 90/10 rule applies, 10% of time it will produce stupid decision. So always check spec.
That being said, this post is full of reasonable assertions, so I'm looking forward to experimenting with this... whatever it is.
But as much pleasure as I derive from novelty and specificity, my colleagues have oft expressed perplexity—whereas the terms which LLMs produce hew closer to the manifold (by definition!) and raise fewer eyebrows.
So, it has its turn.
People waste too much time on this stuff. The next version could totally change how the model processes your agents.md.
Get good at promting, use agents.md as a minimal model annoyance fixer, and reset it often (every major release)
Not when I'm in charge. It proposes changes based on my detailed instructions, I review the proposed changes, only then do I have it implement code, and then I review it again. I understand my AI agent would prefer a quicker way but for the meantime, I'm still the one in charge.
The point is, their default behavior is to ship crap fast.
You have a process to handle that.
So does OP.
WTF ? Almost always this was "skipping the parts because the deadline was 2 weeks ago". The "I don't feel like it" rationalizations are maybe 20% ? Unless deadlines are rationalizations too ?
And agents now got better builtin skills than they used to.
Who will have the time to A/B test all?
If I have an agent skill to look up prices of stocks, maybe I need to set up some tools and authentication first. There’s no way to express this!
Agents do read that. And actually remember it. Because it's tiny with other things you are cramming into their context.
I only make it for me, so it's a bit complex and targeted towards me, and what I do, but it's pretty easy to adjust things.
https://github.com/notque/vexjoy-agent
Working on reading through Agent Skills, it seems we've converged on a lot of the same points, and I've never seen it, so trying to get an understanding of it.
Edit 1: I don't like all the commands. I just rely on a single router to automatically decide what I want, and that feels like the most reasonable way to me to communicate with it.
I don't want to remember things. And that's the way for me to scale the number of skills and activities. I don't have to think about them.
Edit 2: We have very different routers.
https://github.com/addyosmani/agent-skills/blob/f504276d8e07...
vs
https://github.com/notque/vexjoy-agent/blob/main/skills/do/S...
I personally wouldn't call theirs an intelligent router. They are dancing between a few different skills. We have extremely different setups there.
But of course, I'm using way more context to get it done. I'm even sending it out to Haiku to build the route choices.
I choose to use tokens to make things better for myself, not everyone would make the same choice, so I certainly see why they are using a few skills, and composing them.
Edit 3: This is much easier for a user to wrap their head around because there's much less.
I am only focused on the best improvements I can make that show value for my use cases. This is straight foward to reason about.
This seems like a nice way to get the best concepts for people trying to understand them. I commend them for a clean, simple approach.
Edit 4: Yeah, I think there are some things I can learn from them which is always good.
I especially like simple decisions like collapsing the install details for each harness in the readme.
I'm going to read over the entire thing and look for opportunities to improve my stuff.
We are all working together, learning, testing, building, trying to find the best way to implement things.