Hacker News — vinext + Cloudflare Workers

new
past
show
ask
show
jobs
submit

▲Docker 29 has changed its default image store for new installs (docs.docker.com)

118 points by neitsab 3 days ago | 75 comments

fabian2k 11 hours ago [-]

> This difference is particularly noticeable with multiple images sharing the same base layers. With legacy storage drivers, shared base layers were stored once locally, and reused images that depended on them. With containerd, each image stores its own compressed version of shared layers, even though the uncompressed layers are still de-duplicated through snapshotters.

This seems like a really weird decision. If base images are duplicated for every image you have, that will add up quickly.

kodama-lens 10 hours ago [-]

I think there is an Issue/PR right now to change this. See: https://github.com/containerd/containerd/issues/13307

epistasis 9 hours ago [-]

Oh, very glad to see this, ML applications that were mentioned in it are exactly why I was thinking this was such a disastrous change.

However, the tedium of the reply chain reminds me why I tend to focus most energy on internal projects rather than external open source...

Docker may have been built for a specific type of use case that most developers are familiar with (e.g. web apps backed by a DB container) but containerization is useful across so much of computing that are very different. Something that seems trivial in the python/DB space, having one or two different small duplicates of OS layers, is very different once you have 30 containers for different models+code, and then ~100 more dev containers lying around as build artifacts from building and pushing, and pulling, each at ~10GB, that the inefficient new system is just painful.

The smallest PyTorch container I ever built was 1.8GB, and that was just for some CPU-only inference endpoints, and that took several hours of yak shaving to achieve, and after a month or two of development it had ballooned back to 8GB. Containers with CUDA, or using significant other AI/ML libraries, get really big. YAGNI is a great principle for your own code when writing from scratch, but YAGNI is a bit dangerous when there's been an entire ecosystem built on your product and things are getting rewritten from scratch, because the "you" is far larger than the developer making the change. Docker's core feature has always been reusable and composable layers, so seeing it abandoned seems that somebody took YAGNI far too extreme on their own corner of the computing world.

a_t48 5 hours ago [-]

Docker(Hub) just isn't built for this use case. I've built https://clipper.dev to better handle ML/large images. It consists of a registry+pull client that breaks apart layers and does content addressing of individual chunks by _uncompressed_ hash, so that content can be better shared. My pull client has better parallelization and wastes much less bandwidth. It annoys the heck out of me when I change one file in a layer and have to redownload bytes my device already has. By sharing across layers I've seen 80-90% improvements in pull times for "patches".

I'm also in the process of building a BuildKit builder, I'm seeing large improvements on the speed of exporting images. The same image that takes Docker >3 minutes to export and push takes me under a minute. https://github.com/clipper-registry/benchmarks/actions/runs/...

epistasis 11 hours ago [-]

This is hell for a lot of ML containers, that have gigabytes of CUDA and PyTorch. Before at least you could keep your code contained to a layer. But if I understand this correctly every code revision duplicates gigabytes of the same damn bloated crap.

a_t48 5 hours ago [-]

It's even worse when you end up installing PyTorch as a separate package in some other layer. It's not shared between layers at all with regular Docker.

spwa4 10 hours ago [-]

If you have problems with 13 (I believe) GB of docker layers ... how do you deal with terabytes or petabytes of AI training data?

epistasis 10 hours ago [-]

Petabytes of training data is only one application of PyTorch, which is going to use tens of thousands of containers, but...

Inference, development cycles, any of the application domains of PyTorch that don't involve training frontier models... all of those are complicated by excessive container layers.

But mostly dev really sucks with writing out an extra 10GB for a small code change.

a_t48 5 hours ago [-]

Going to self promote one last time here - I've built a fix for this, at least for the registry/image export side, at https://clipper.dev. Docker(Hub) can't share large files between layers, but I can.

StableAlkyne 10 hours ago [-]

You don't even need MB of training data for some ML applications. AI is the sexy thing nowadays, but neural networks (Torch is a NN library) are generally useful for even small regression and clarification problems.

For some problems you might even be able to get away with single digit numbers of training points (classic example of this regime being Physics-Informed Neural Networks)

nijave 5 hours ago [-]

Yeah, our handful of models we just commit to the git repo--usually only a few MB.

Image still ends up being like 6-8Gi tho. iirc pytorch had a hard dependency on CUDA libs which pulled in a bunch of different hardware-specific kernel binaries. The models ran on CPU and didn't even need CUDA but it was incredibly hard to remove them--there was some pytorch init code that expected the CUDA crap to exist even on CPU-only.

Normal_gaussian 10 hours ago [-]

the training data is on a separate drive; or the training data isn't that large for this use case; or they aren't training.

0cf8612b2e1e 9 hours ago [-]

You don’t train petabytes on your laptop.

IsTom 11 hours ago [-]

Docker is already hogging a lot of disk space and needs to be pruned regularly. I can't imagine what's it's going to be like now.

embedding-shape 7 hours ago [-]

"really weird decision" seems like an understatement, I thought the entire point of the specific storage design with the whole layering shebang was so things could be shared? If you remove that, just get rid of layers as a whole, what's the point otherwise?

UltraSane 35 minutes ago [-]

Enterprise grade dedupe helps a lot with this. Dedupe on them has gotten very good.

tetha 5 hours ago [-]

It does. It's also very nice that this moves storage usage from /var/lib/docker over to /var/lib/containerd.

Due to that, a careless installation of a few new dev-systems under the new docker version immediately blew up storage usage on the root-disk, while happily ignoring hundreds of gigabytes on a volume on /var/lib/docker.. because that's where it needs the storage, right? A few older systems also were upgraded but didn't, which was quite confusing at first.

Sorry for being salty, but that was a pretty hectic afternoon with those new agents trashing builds, and now we have a pretty annoying migration plan to plan for the rest. And yes yes it's just a reinstallation, but we have other things to do as well.

Oxodao 12 hours ago [-]

Docker already fills up my dev machines yet they decided for this insane solution:

> The containerd image store uses more disk space than the legacy storage drivers for the same images. This is because containerd stores images in both compressed and uncompressed formats, while the legacy drivers stored only the uncompressed layers.

Why ?

giobox 10 hours ago [-]

> https://docs.docker.com/reference/cli/docker/system/prune/

Just in case - I'm always amazed how many Docker users don't know about the prune command for cleaning up the caches and deleting unused container images and just slowly let their docker image cache eat their disk.

johannes1234321 9 hours ago [-]

Prune is nice, but if you have a bunch of containers which run shirt time for a build step or similar prune would collect those, too. A filter "last used a few months ago" would be useful.

giobox 9 hours ago [-]

I think you can filter on last created, but agree last used would be helpful:

  docker image prune -a --filter "until=24h"

> https://docs.docker.com/reference/cli/docker/image/prune/#fi...

9 hours ago [-]

kevinmgranger 7 hours ago [-]

Any reason why those containers can't be run ephemerally?

ElevenLathe 12 hours ago [-]

Sounds like a straightforward time-space tradeoff: if you have the compressed layers sitting around when you need them, you can avoid the expense and time of compressing them.

Filligree 11 hours ago [-]

Why would I need the compressed layers?

XYen0n 7 hours ago [-]

The OCI manifest references the hashes of these compressed layers, and re-compressing them does not guarantee obtaining the same hash

flakes 7 hours ago [-]

Recompressing should be guaranteed deterministic. It’s the packing/unpacking of tar archives to/from directories on disk that leads to the non-determinism (such as timestamps and ownership metadata). If the tar is left intact, both zstd and gzip should produce byte for byte identical outputs given the same compression parameters.

cpuguy83 4 hours ago [-]

That is not correct. You would have to use the same compression tool (and likely version) for this to match.

Old docker discarded the compressed bits but kept some metadata about the the so it can at least recreate the tar.

It also recreated the manifest o push.

flakes 4 hours ago [-]

Thanks for the correction. I did mean given the same tooling version/parameters, but (as you and others pointed out) preserving and recreating that state is not at all straightforward.

XYen0n 6 hours ago [-]

You are correct; I confused archiving with compression. However, even considering only the compression process, same compression parameters cannot be guaranteed, as it is unknown which compression parameters the image publisher used.

flakes 4 hours ago [-]

Thats true. And regardless of compressed vs regular tar, I think the OCI format working with opaque archives is extremely limiting. I hope the industry will eventually redesign to use content addressable storage per file and have metadata to describe the layer/disk layout instead. That would allow per file deduplication, and we can use tar for just bulk transfer over the wire, rather than using tar for the data at rest.

cpuguy83 4 hours ago [-]

containerd 2.3 has support for erofs which does a direct import of the layer. It can even convert the tar based layers to erofs, faster than extracting the tar normally.

Also looking at block-based content store so that blocks can be deduped across images.

mort96 6 hours ago [-]

If that's the purpose, couldn't you store the hash and throw away the compressed image?

(As others said, compression is deterministic for the same algorithm, parameters and input data)

a_t48 5 hours ago [-]

Zstd for example only promises determinism on the same version of the library. I've personally seen the hashes mutate between pull and export. Things like tar padding also make a difference. Really, the thing to do is to hash on the _uncompressed_ data and let compression be a transport/registry detail. That's what I've done, at least.

mort96 5 hours ago [-]

I didn't know that about zstd, that's a bit unfortunate.

Tar isn't related here though, we're talking about compression not archival formats

thaJeztah 1 hours ago [-]

Yes, compression being part of the OCI image's digest was (in hindsight) a poor decision. _Technically_ OCI images allow uncompressed layers, and the layers could be included without compression (and transport compression to be used); this would allow layers to be fully reproducible. We explored some options to do this (and made some preparations; https://github.com/containerd/containerd/pull/8166), but also discovered that various implementations of registry clients didn't handle transport-compression correctly (https://github.com/distribution/distribution/pull/3754), which could result in client either pulling the full, uncompressed, content, or image validation failing.

a_t48 1 minutes ago [-]

For my registry/pull client fork I hash on the uncompressed content and store as compressed under the uncompressed digest. This lets me have my cake and eat it, too - compression free digests, smaller storage costs, be able to set consistent compression settings, have the ability to spend extra CPU to recompress on the backend without breaking hashes, etc. I control both pull client and registry, so it works.

cpuguy83 4 hours ago [-]

The whole entire reason is compression is not deterministic across tooling.

NewJazz 11 hours ago [-]

Pushing

mort96 6 hours ago [-]

What about pushing? Computers are fast enough to compress stuff as it's being transmitted, you don't need to store the compressed copy anywhere...

cryptonym 11 hours ago [-]

To save disk space /s

12 hours ago [-]

colechristensen 11 hours ago [-]

I'm not sure about the fastest macbook disk access, but even with NVMe storage I've found lz4 to be faster than the disk. That is (it's hard to say this exactly correct) compressed content gets read/written FASTER than uncompressed content because fewer bytes need to transit the disk interface and the CPU is able to compress/decompress significantly faster than data is able to go through whatever disk bus you've got.

fpoling 11 hours ago [-]

On my 2 years old ThinkPad laptop SSD is faster than lz4. On a fat EC2 server lz4 is faster. So one really has to test a particular config.

colechristensen 9 hours ago [-]

Yeah, I'm not surprised the PCIe 5.0 transfer speeds matched with top tier SSD chips win that race.

It still bothers me that the fastest most performant computer I have access to is almost always my laptop, and that by a considerable margin.

Someone should do some lz4 vs. ssd benchmarks across hardware to make my argument more solid and the boundaries clear.

UltraSane 33 minutes ago [-]

You can get AWS instances with very fast local NVMe drives.

freedomben 11 hours ago [-]

did you mean the first "compressed" to be "uncompressed" ?

awestroke 9 hours ago [-]

But if it stores the uncompressed layers, why store the compressed ones too? Why both at the same time?

black3r 7 hours ago [-]

Also this doesn't just mean more disk space usage, but also longer local build times... for the app I'm working on `exporting to image` takes 71.5 seconds with containerd, without containerd it's 4.3s (the rest of the build takes ~180 seconds). And that's just a 5.76GB image.

cpuguy83 4 hours ago [-]

That's almost certainly nothing to do with the change. Please report this if you can with commands used.

Buildkit isn't changing behavior here. Internally in docker there is a shim to make the legacy storage behave like containerd snapshotters (as well as it can, anyway--not perfect due to hard to resolve issues in the old storage). But it still kept both the compressed and uncompressed versions of images.

sschueller 11 hours ago [-]

[flagged]

stingraycharles 11 hours ago [-]

What does Apple have to do with any of this?

mschuster91 11 hours ago [-]

> It is shameful for apple to hard solder their disks. There is no benefit to the user

Actually, it is. The speed and latency difference does matter, that is how even an 8GB RAM MacBook feels snappier than many a 32GB Windows machine - it can use the disk as swap.

giobox 10 hours ago [-]

This explanation for the soldered in SSD on some models has never fully made sense, because Apple make computers with removable fast SSDs right now: the M4 Mac Mini, and their range topping Mac Studios.

I absolutely agree Apple typically ship a fast SSD in their computers. I am not convinced they had to solder them to achieve the performance.

newsoftheday 11 hours ago [-]

I had to work on a Mac M3 for a year, it sucked, it did not feel snappier than any Windows or Linux machine (including this one) that I've ever used and that is going back to the 1980's.

stingraycharles 11 hours ago [-]

I suggest you judge based on benchmarks rather than vibes.

If you believe the latest M3 does not perform better than machines you’ve used in the 80s, I have no idea how to even start a reasonable discussion about this.

newsoftheday 8 hours ago [-]

> If you believe the latest M3 does not perform better than machines you’ve used in the 80s

That wasn't what I was trying to say, I apologize, I should have been clearer. What I intended to say was that I've been using various, many computers since the 1980's so I have a wide and deep sampling of experiences with them and to that end...the M3 did NOT feel to me like it performed better. Regardless the benchmarks, I know how the machine should feel and I know M3 did not feel any better than any other machine I've used (and that is a lot of laptops).

neitsab 3 days ago [-]

Docker v29 (released 2025-11) switched to using containerd for its image store for new installs.

This means `/var/lib/docker` is no longer "hermetic": images and container snapshots are located in `/var/lib/containerd` now.

More info about the switch: https://www.docker.com/blog/docker-engine-version-29/

To configure this directory, see https://docs.docker.com/engine/storage/containerd/.

neitsab 3 days ago [-]

I noticed the change because I wanted to persist Docker-related data between container instantiations on IncusOS. I couldn't understand why the custom volume I had mounted on /var/lib/docker didn't contain the downloaded images.

To keep both /var/lib/{containerd,docker} in sync, I use a single ZFS dataset ("custom filesystem volume" in Incus parlance) and mount subpaths inside the container:

  incus storage volume create local docker-data
  incus config device add docker docker disk pool=local source=docker-data/docker path=/var/lib/docker
  incus config device add docker containerd disk pool=local source=docker-data/containerd path=/var/lib/containerd

There are other ways to achieve the same of course.

hommelix 8 hours ago [-]

> Docker Engine includes an experimental feature that can automatically switch to the containerd image store under certain conditions. This feature is experimental. It's provided for those who want to test it, but starting fresh is the recommended approach.

How bad did we fall with the ship often, ship early and fix later idea? Make a major change, release it and the migration feature is experimental and not recommended.

0xbadcafebee 10 hours ago [-]

It sounds like this breaks all Docker installs that use userns-remap? Are they really shipping a breaking change with no fix? In addition to bloating the disk? In addition to breaking all old systems that relied on mapping /var/lib/docker?

I can't believe Docker finally shit the bed. Time to replace Docker with Podman.... sigh

shaun_docker 2 hours ago [-]

The page is documenting current compatibility. No users are automatically migrated to an incompatible setup. If you used userns-remap then you should currently still be using the previous Docker storage stack.

tdemin 6 hours ago [-]

[dead]

cyberax 7 hours ago [-]

Just use Podman. Docker's development is driven by managers who want to shove hosted services everywhere.

Meanwhile, the basic stuff like caching doesn't work properly.

thaJeztah 1 hours ago [-]

I'm a maintainer of the Moby project (which is used to build the Docker Engine), and saw this post got some attention, so let me try to outline some of the changes and motivation. Happy to answer questions if there's any.

First of all, some history; the Docker Engine was a monolith daemon that provided many services; this worked well when using Docker as a standalone solution, but when used as runtime for Kubernetes, this wasn't ideal; many components were not designed for this purpose, which meant they had to be replaced / overridden with hacks to make it work. The containerd project was created to provide a more modular runtime for the container ecosystem, providing separate subcomponents (a containerd runtime, image/content storage) for the container ecosystem to build on. It was created "from scratch" with lessons learned over the Years, providing a modern foundation.

While docker has used containerd as a runtime for many Years, it still used its own implementation for storing images ("graph-drivers"); this implementation started to show its age and had many limitations; graph-drivers have no native support for multi-platform ("multi-arch") images, no support for OCI Artifacts, and no reproducible images when pushing to different registries (among others).

Around 4 Years ago, we started to re-implement the image storage using containerd "snapshotters"; our initial goal was to provide a mostly seamless transition; add multi-arch support, but keep the UX as close as possible to the graph-drivers. Around 2 Years ago, Docker Desktop changed to using the containerd image storage (snapshotters) as a default for new installations, and Docker v29 made it the default for Linux installations.

While we kept most of the UX similar, there are some differences; when storing an image with graph-drivers, docker would pull the OCI image, extract the content (layers), and discard the (compressed) layers. While this reduced storage, it also made images non-reproducible as the image had to be re-constructed when pushing to a registry (which also resulted in slower pushes).

The containerd image storage uses a different design, where a copy of the compressed artifacts are preserved (by default); this requires more storage to keep these extra blobs, but reduces duplication and increases push performance. It was the decision containerd maintainers made early in their design process, and all containerd-based tools have used this model since the start of the containerd project.

We have a couple of roadmap items to improve this in future; some are outlined in this ticket; https://github.com/moby/moby/issues/51581, but there's other options that will become availeble through the containerd image store; support for erofs as an alternative to (tar) compressed image layers, as well as automatic garbage-collection (which would reduce the need for manually pruning content through `docker system prune` (and related commands).

(FWIW; docker still provides graph-drivers as an alternative https://docs.docker.com/engine/storage/drivers/select-storag...)

jillesvangurp 4 hours ago [-]

Since people are mentioning alternatives, worth calling out colima if maybe you don't want to jump onto Red Hat's podman or Suse's Rancher. Both are open source but there are some corporate entities behind them with their own agendas that you might want to consider.

My reasoning is simply that I don't really want to swap out one overly complicated thing for another. I'm sure Podman is fine and amazing. But I'm just not in the IBM/Red Hat ecosystem and I have some reservations their generally a bit overly complex solutions. There's a reason IBM is involved, just saying. And as I'm not planning to use podman in production I see no reason to have it on my laptop.

As for Rancher, that seems to me a bit like moving the problem than solving it as it seems to be a for profit solution around an OSS core with its own complexity and potentially similar risks to Docker Desktop down the road.

With colima, it's all open source and easy to install/upgrade via homebrew. Nice simple wrapper around qemu. There's no UI, and I don't really miss having one. Lazydocker works fine as a TUI if you crave a UI and so do other generic docker UIs/IDEs. I mainly use docker and docker compose on the command line and that works fine for me. It has Kubernetes support as well if you need that but that's not something I use or need.

DeathArrow 11 hours ago [-]

I should start looking into Podman.

newsoftheday 11 hours ago [-]

The article says to regularly run prune, how regularly? Currently I run the following once per day from cron:

    docker system prune -a -f
    docker volume prune -a -f

wolttam 9 hours ago [-]

This would depend entirely on how much churn your system is doing on containers/volumes/images. Once a day sounds really often for most situations.

"Regularly" = when you're running out of space because of a bunch of built up old stuff.

6 hours ago [-]

tetha 5 hours ago [-]

Monitor your disks to see if they grow full, and have an idea what your storage baseline should be. Storage in /var/lib/docker/overlay2 can also leak, even if you prune regularly.

arnitdo 11 hours ago [-]

From the docs, you can just run `docker system prune -a --volumes`

Ref: https://docs.docker.com/reference/cli/docker/system/prune/

bravetraveler 6 hours ago [-]

Personally, I'd recommend the pointed 'docker {container,image,volume} prune' commands for scheduling granularity/control. At least, filtering as you've also shown.

The 'system' context captures networks; much to my dismay, this has been a problem for no fewer than three employers. It's painfully common for things to expect the networks to persist. They don't really consume resources, so I see no reason to invite the systematic heartburn.

When? When there's disk pressure. Maybe some longer term (weekly, monthly?) to keep a lid on things. The image cache provides a benefit, no sense fighting it. At our rate, daily pruning means I might lose hours (through a week) repeatedly pulling the same images.

mrichman 12 hours ago [-]

Why not just use podman at this point?

nitinreddy88 12 hours ago [-]

They are adopting to containerd standard, not sure why negative sentiment

mgrandl 9 hours ago [-]

Where did you see that? I just did a deep dive into podman/quadlets/bootc/composefs and never once seen a mention of that. A google search also didn’t bring anything like that up.

eikenberry 6 hours ago [-]

I think the "They" mentioned was Docker, not Podman. That Docker was adopting the containerd standard.

mgrandl 6 hours ago [-]

Jup that’s definitely it, not sure how I got it that wrong.

aljgz 4 hours ago [-]

It's not just you. I interpreted it similarly

pjmlp 8 hours ago [-]

In case you missed it, recent Rancher Desktop versions also went through this.

xiaod 5 hours ago [-]

[flagged]

BoldBrook418 3 hours ago [-]

[dead]

QuietLedge375 9 hours ago [-]

[dead]

Rendered at 01:30:33 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.