More

apwheele · 2026-05-10T15:14:04 1778426044

There needs to be a new name for people creating these with no obvious validation.

Skill spam?

AndyNemmity · 2026-05-10T16:01:27 1778428887

Define obviously validation? What is the signal that tells you one is reasonable vs another?

I find the only way to do that is to look at it, if it passes some visual tests, try it, and then a/b test if it's any better than without it.

theptip · 2026-05-10T16:30:42 1778430642

Some sort of eval. Eg TermBench, implemented in Harbor.

It’s an insane amount of effort to build shareable, reusable, comprehensive evals, hence why so almost all skills are stuck in the “vibes” phase.

That said I think it’s quite easy to skim/intuit these sort of skills and do horizontal gene transfer into your own vibes-based system. If you use the skills regularly you can construct a cheap personal eval that is a lot easier to maintain and use it to compare a new skill/plugin. Just things like “please write a paper on <my personal unpublished thesis>” is a good starting point here. You get a good feel for whether a skill is better than vanilla by running it a couple times and watching the failure modes.

AndyNemmity · 2026-05-10T16:38:40 1778431120

Yeah, I think we're in a phase honestly where you shouldn't use anyone elses skills, and you should instead point your stuff at a repo with skills, have it really read it, and then ask what of value there is to potentially rewrite in your style based on your preferences.

I have a complex setup with a lot of things based around what I do. I don't know how anyone could reasonably get their head around any of it. It's a research project in itself.

So I tell people, please don't use it. Just point your claude code at it, and see if there's anything useful for you.

apwheele · 2026-05-10T16:23:54 1778430234

So yes a/b broadly speaking is what I was saying (test cases and can show it is actually better).

Even this repo just the "b" showcase, showing the outputs as is (with no clear documentation how those were generated, is it headless in a CI pipeline somewhere?), is not good, https://github.com/Imbad0202/academic-research-skills/tree/m....

AndyNemmity · 2026-05-10T16:31:09 1778430669

I run a lot of a/b testing. But I'm not sure showing it actually communicates all that much. Since these are non deterministic systems, even showing you an a/b test from when i made the decision a month ago, doesn't really mean a whole lot.

I agree we need more clear indications of value, I don't quite understand how to legitimately do that in a fair, and honest way.

siva7 · 2026-05-10T18:05:59 1778436359

Spam. It takes a minute in 2026 to create any app, any skill, any anything without any education that looks plausible that took five years ago a highly educated and skilled person at least months. Now it takes the highly skilled individual ten times the time to evaluate the vibeslopped spam it took the author to publish.

Daviey · 2026-05-10T20:46:09 1778445969

In job I am seeing it as an amplification DoS attack, the amount of content being produced is crippling the processes to protect the org.

elashri · 2026-05-10T15:50:08 1778428208

Skill-slop.

whattheheckheck · 2026-05-10T17:25:43 1778433943

Its the same as whipping out some random python package why diminish it? Your comment could be called skeptic reply guy spam

mmooss · 2026-05-10T16:49:40 1778431780

The OP evaluates what it has developed with great rigor and describes the evaluation in detail. What do you feel is missing?

apwheele · 2026-05-10T19:17:03 1778440623

It actually does not -- and that is part of the issue. Consumers just see "oh gosh this looks very detailed" and superficially think someone must of spent quite a bit of time on this and it works well.

Skills are just prompts -- and most of what I am seeing are people using AI to write the (quite verbose) prompts. There should be a test, somewhere, that shows "my prompt does better than XYZ other prompt" for some model and some specific inputs. This is what is called a benchmark.

It may work well, I don't know. Just asking Claude "hey help me iterate on a paper" works pretty well out of the box too. Call me skeptical this actually works in any substantive way without seeing any evidence it works.

I agree writing a good benchmark takes time. How do people know if all these prompts they are writing are any good though? You could make an edit and it causes a regression overall. Or add too much info and it is just wasted space in the context window, or causes the model to go in loops between the different skills, or plenty of other errors.

AndyNemmity · 2026-05-10T20:02:31 1778443351

I really do run a/b tests. I really do test, and validate.

I do not believe me giving you that information is honest. If I do, I am pretending that you will get the same experience.

Maybe you're using a different model. Maybe you have stuff in your CLAUDE.md that will break it.

It is not honest to me to give you confidence in it, when no one can be confident in it.

mmooss · 2026-05-10T22:46:16 1778453176

> It actually does not

I read it, right there on the OP. Tests and test results, including discussions of flaws with earlier designs and how they are improved here. What are you talking about?

sumeno · 2026-05-10T19:34:39 1778441679

I seriously doubt any human has ever read the full readme for the project.

adityamwagh · 2026-05-10T15:48:17 1778428097

SkillBros?

apwheele · 2026-05-04T17:33:11 1777915991

This data is just generally often available in the US, https://northcarolina.votermaps.org/?#16.76/35.78541/-78.779... (agree it is bad though!)

apwheele · 2026-05-01T16:46:33 1777653993

It is institutional in the sense that Flock and the individual PDs have not put steps in place (either post auditing or pre not allowing bad queries) that prevent the abuse.

Post auditing is obviously not taken seriously by these departments, and Flock could build tools to do this out of the box (identify weird search patterns) if they wanted to.

Edit -- I see Flock does have some audit tools, https://www.flocksafety.com/trust/compliance-tools. If those work as they should, it is more on PDs to use them properly.

apwheele · 2026-05-01T12:09:38 1777637378

FYI the first link, I copy-pasted the first few paragraphs into pangram and it correctly identifies as AI written, https://www.pangram.com/history/790fc2b8-6348-47fa-ad3e-8bae...

apwheele · 2026-04-13T13:11:06 1776085866

I am a backend guy, so forgive my ignorance, but for web based apps I am confused what "pixel perfect" even means. I can build a site to look one way on my computer, it will most likely not look the same way on whatever device you use to access the site.

Feeding the model images for my local computer sounds like a recipe given my experience with the tools to have it over-optimize for the wrong end device.

likeclockwork · 2026-04-13T18:04:32 1776103472

Pixel perfect means it looks EXACTLY like the design comp.

It goes completely out of the window if the browser window isn't the exact size of the mockup.

You might charitably say that pixel perfect means that the implementation intersects with the design comp at some specific dimensions but where are the extra rules coming from, then?

It's an archaic term that conflates the artifact produced by an incomplete design process (an artist's rendering of what the web page might look like) with the actual inputs of the development process (values and constraints).

cendyne · 2026-04-13T14:50:32 1776091832

"Pixel perfect" is about attention to detail and consistency. Margins, padding, or the combination of these inside other containers will stick out when they're not consistent.

Here's an example that I personally encountered: what if you have a <h1>Text</h1> and it has a certain left margin. Then another heading except it has a nested button component (which internally comes with some padding). Then the "Text" in both aren't aligned from section to section and it is jarring.

apwheele · 2026-03-24T17:44:08 1774374248

While the vector store is local, it is sending the data to Gemini's API for embedding. (Which if using a paid API key is probably fine for most use cases, no long term retention/training etc.)

jakejmnz · 2026-03-26T14:54:21 1774536861

works completely locally with a decent model: https://github.com/jakejimenez/sentinelsearch

apwheele · 2026-03-01T17:24:07 1772385847

I think XML is good to know for prompting (similar to how <think></think> was popular for outputs, you can do that for other sections). But I have had much better experience just writing JSON and using line breaks, colons, etc. to demarcate sections.

E.g. instead of

    <examples>
      <ex1>
        <input>....</input>
        <output>.....</output>
      </ex1>
      <ex2>....</ex2>
      ...
    </examples>
    <instructions>....</instructions>
    <input>{actual input}</input>

Just doing something like:

    ...instructions...
    input: ....
    output: {..json here}
    ...maybe further instructions...
    input: {actual input}

Use case document processing/extraction (both with Haiku and OpenAI models), the latter example works much better than the XML.

N of 1 anecdote anyway for one use case.

galaxyLogic · 2026-03-01T19:52:52 1772394772

XML helps because it a) Lets you to describe structures b) Make a clear context-change which make it clear you are not "talking in XML" you are "talking about XML".

I assume you are right too, JSON is a less verbose format which allows you to express any structure you can express in XML, and should be as easy for AI to parse. Although that probably depends on the training data too.

I recently asked AI why .md files are so prevalent with agentic AI and the answer is ... because .md files also express structure, like headers and lists.

Again, depends on what the AI has been trained on.

I would go with JSON, or some version of it which would also allow comments.

irthomasthomas · 2026-03-02T10:40:20 1772448020

The main thing i use xml tags for is seperating content from instructions. Say I am doing prompt engineering, so that the content being operated on is itself a prompt then I wrap it with

<NO_OP_DRAFT> draft prompt </NO_OP_DRAFT>

instructions for modifying draft prompt

If I don't do this, a significant number of times it responds to the instructions in the draft.

marxisttemp · 2026-03-01T19:46:22 1772394382

XML is much more readable than JSON, especially if your data has characters that are meaningful JSON syntax

galaxyLogic · 2026-03-01T19:56:01 1772394961

I think readability is in the eye of the reader. JSON is less verbose, no ending tags everywhere, which I think makes it more readable than XML.

But I'd be happy to hear about studies that show evidence for XML being more readable, than JSON.

ezfe · 2026-03-01T22:07:52 1772402872

I disagree that XML is more readable in general, but for the purpose of tagging blocks of text as <important>important</important> in freeform writing, JSON is basically useless

what · 2026-03-02T05:39:51 1772429991

>But I'd be happy to hear about studies that show evidence for XML being more readable, than JSON.

But I’d be happy to hear about studies that show evidence for JSON being readable, than XML.

ekjhgkejhgk · 2026-03-01T17:37:12 1772386632

Could you clarify, do those tags need to be tags which exist and we need to lear about them and how to use them? Or we can put inside them whatever we want and just by virtue of being tags, Claude understands them in a special way?

ezfe · 2026-03-01T17:51:28 1772387488

They probably don’t need to be specific values. The model is fine tuned to see the tags as signals and then interprets them

galaxyLogic · 2026-03-01T19:59:25 1772395165

If it walks like a duck ... AI thinks it is something like a duck.

apwheele · 2026-03-01T17:59:40 1772387980

All the major foundation models will understand them implicitly, so it was popular to use <think>, but you could also use <reason> or <thinkhard> and the model would still go through the same process.

cyanydeez · 2026-03-01T20:26:35 1772396795

<ponderforamoment>HTML is a large subsection of their training data, so they're used to seeing a somewhat semantic worldview</ponderforamoment>

apwheele · 2026-02-27T12:47:55 1772196475

This is cool, but for folks concerned about privacy, even if the cached layer is anonymized, in the aggregate I bet you can likely figure out who a person is.

I imagine just looking at the first degree connections of the votes would be a pretty strong signal.

apwheele · 2026-02-25T14:16:18 1772028978

I view them as more idiosyncratic docs, but focused on how to write code (there is so much huggingface code floating around the internet, the models do quite well with it already).

I have not had much success with skills that have tree based logic (if a do x, else do y), they just tend to do everything in the skill (so will do both x and y).

But just as "hey follow this outline of steps a,b,c" it works quite well in my experience.

apwheele · 2026-02-24T13:04:46 1771938286

Claude code inherits from the environment shell. So it could create a python program (or whatever language) to read the file:

    # get_info.py
    with open('~/.claude/secrets.env', 'r') as file:
        content = file.read()
        print(content)

And then run `python get_info.py`.

While this inheritance is convenient for testing code, it is difficult to isolate Claude in a way that you can run/test your application without giving up access to secrets.

If you can, IP whitelisting your secrets so if they are leaked is not a problem is an approach I recommend.