Hacker News — vinext + Cloudflare Workers

new
past
show
ask
show
jobs
submit

▲Show HN: New Benchmark from SWE-bench team is 0% solved (programbench.com)

14 points by lieret 12 hours ago | 2 comments

dnnehgf 26 minutes ago [-]

figures 10 and 11 in the paper are interesting.

i suppose at a high level this works because it is much easier for the evaluator to generate tests with fuzzing than it is for the model to probe.

this method has a way of clarifying the way in which code generation is just curve fitting, where the output curve is some linear transformation of the inputs.

ivarv 7 hours ago [-]

This looks pretty interesting, but I don't understand why decompilers are not allowed. If this benchmark was aimed at recreating a SASS/server based product then it might make more sense, but given the fact that decompilers are readily available in practice the "no read" restriction seems to artificially increase the challenge level.

cosimo-dw 5 hours ago [-]

[dead]

Rendered at 02:42:50 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.