Rendered at 12:26:58 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.
gberger 3 days ago [-]
S3: Simple Storage Service. It's a building block, and it's only natural other abstractions are built on top of it.
sagiba 2 days ago [-]
Agree it doesn't have to be part of S3 itself. My point is that there is a missing semantic layer.
In practice, many teams use S3 directly without any layer on top. So without better organizational capabilities, they can't keep track of what they have stored where, who created it, whether it is still used, etc.
And when teams do use a catalog, it's usually detached from the storage layer itself, so you can't easily view a dataset in the catalog and know how much it costs, who accessed it, and so on.
Have you seen better places that figured out a better way to handle this? Without a ton of custom tooling?
FridgeSeal 2 days ago [-]
No but why doesn’t this object-storage-primitive accommodate all my specific requirements already?
They should also accommodate my need for all POSIX filesystem API’s included cheap-moves and renames!!!!!
/s
sagiba 2 days ago [-]
POSIX isn't the ask. Datasets are. The need to keep track of what data you have stored is universal, not my specific requirement.
FridgeSeal 2 days ago [-]
I make the (glib) comment, because it’s a similar argument to the one that was popular a few years ago.
S3 is an object store. Treat it more like a KV store. As other comments have pointed out, the solution here is pick-your-favourite-metadata-store, be it Postgres, or what iceberg does, and other data on S3.
hilariously 3 days ago [-]
The prefixes are not meaningless, they are performance boundaries as well if you read the docs.
sagiba 2 days ago [-]
Good point, prefixes are performance boundaries too, per-prefix rate scaling means you can spread load across prefixes to get aggregate throughput well above 3.5k RPS [1].
But that's a different thing than what the post is about. Even teams that use prefixes for performance don't have an S3-native way to ask what a prefix represents, who owns it, whether it's still accessed, and so on. The semantic layer is missing whether you're hashing for throughput or just laying data out the obvious way.
Absolutely right and if I disagreed with any other point I would have said something, a very solid post about metadata in general I just dont want people to ever leave with a "oh well we can just put it in one prefix because it MEANS the same thing" because that way is slow and bad.
skybrian 2 days ago [-]
Why do they put everything into one huge bucket? Wouldn't the best way to clean it up be to create more buckets?
sagiba 2 days ago [-]
You can have lots of buckets, but each one typically still contains many datasets.
Think of a team doing ML, for example. They work with data all day across many different tools, each reading some inputs from S3 and writing outputs to S3. They won't create a bucket for every output, that's not practical. So they write to a single bucket with outputs organized under prefixes.
Buckets are more of an administrative boundary (IAM, cost, replication) than a data organization unit. So even with more buckets, the dataset abstraction is still missing - there's no good native way to track what a prefix represents, who created it, whether it's still accessed, how much it costs, etc.
skybrian 2 days ago [-]
It sounds like they're missing the concept of sub-buckets that a team member can easily create for a small project.
CodesInChaos 2 days ago [-]
The "dataset" abstraction this article proposes feels rather specific for their use-case, not universal. None of my S3 use-cases would benefit from it.
Just store such metadata in your database, where you can organize, index and aggregate it whatever way you like.
sagiba 2 days ago [-]
Curious what your use cases look like. If you're storing data where you always know what's there, who created it, and whether it's still in use without needing to query for it, that's actually a great place to be. The post is about the much messier middle ground most teams I've talked to are in.
CodesInChaos 1 days ago [-]
Some of the most important ones are:
1. Invoice PDFs. Individually small, but there are a hundred million of them. Deleted after 10 years or when the tenant deletes their account.
2. Reports and exports. Few but potentially big files. If an export logically consists of multiple files, it's stored as zip file. Live 30 days or until the tenant deletes their account.
3. Streaming database exports using AWS Database Migration Service for replication into Snowflake
Every file has an entry in the database tracking its storage location and status.
Grouping them by tenant, (sub)type or time-interval makes sense for these. But "dataset" isn't an applicable concept.
sagiba 24 hours ago [-]
That's a clean architecture and the dataset abstraction isn't really needed when every file has a DB row and clear lifecycle.
The post is more about the pipeline / ML / log / export world where ownership isn't enforced by application code.
The DMS case sits somewhere in between - there's a per-table grouping that could be useful, but the files are usually transient enough that it doesn't matter much.
Different problem from yours.
mdavid626 2 days ago [-]
Store key to the item in a database. Then you can query it however you want.
I could imagine it though that S3 could offer something similar. We can already list the bucket items, why not add some of querying ability?
dchess 2 days ago [-]
I feel like this is what delta lake and ducklake are largely solving for. And then some.
sagiba 2 days ago [-]
They solve it, partially, for tabular data. Delta, Iceberg, DuckLake are all table formats. And yeah, they do more than dataset abstraction (transactions, time travel, schema evolution).
But that's just one slice of storage. Most teams also have logs, media, ML artifacts, raw dumps, etc., none of which fit into a table format. And even with tables, you often can't easily look at a Delta table and know what the underlying storage is costing you, whether it's still accessed, etc.
Another system might solve it for your media files, another for your log streams, and so on. That's the thing, you have a set of management nice-to-haves that are quite generic and aren't universally supported today, so you end up reinventing them separately across each domain. And even if you did, you still wouldn't have a central aggregated view across all your storage.
FridgeSeal 2 days ago [-]
> logs, media, ML artifacts, raw dumps, etc., none of which fit into a table format.
You would be appalled at the kind of stuff I have seen teams stuff into parquet and iceberg tables.
sagiba 2 days ago [-]
Ha. The fact that teams reach for iceberg to organize things that aren't really tables is itself a symptom of needing better management tools for other types of data.
FridgeSeal 11 hours ago [-]
Sure, but that’s not an S3 concern, because the vast majority of people use S3 as it is, without needing additional management machinery.
The solution is just to spin up the machinery you need for your solution, rather than making S3 cover all possible bases.
In practice, many teams use S3 directly without any layer on top. So without better organizational capabilities, they can't keep track of what they have stored where, who created it, whether it is still used, etc.
And when teams do use a catalog, it's usually detached from the storage layer itself, so you can't easily view a dataset in the catalog and know how much it costs, who accessed it, and so on.
Have you seen better places that figured out a better way to handle this? Without a ton of custom tooling?
They should also accommodate my need for all POSIX filesystem API’s included cheap-moves and renames!!!!!
/s
S3 is an object store. Treat it more like a KV store. As other comments have pointed out, the solution here is pick-your-favourite-metadata-store, be it Postgres, or what iceberg does, and other data on S3.
But that's a different thing than what the post is about. Even teams that use prefixes for performance don't have an S3-native way to ask what a prefix represents, who owns it, whether it's still accessed, and so on. The semantic layer is missing whether you're hashing for throughput or just laying data out the obvious way.
[1] https://docs.aws.amazon.com/AmazonS3/latest/userguide/optimi...
Think of a team doing ML, for example. They work with data all day across many different tools, each reading some inputs from S3 and writing outputs to S3. They won't create a bucket for every output, that's not practical. So they write to a single bucket with outputs organized under prefixes.
Buckets are more of an administrative boundary (IAM, cost, replication) than a data organization unit. So even with more buckets, the dataset abstraction is still missing - there's no good native way to track what a prefix represents, who created it, whether it's still accessed, how much it costs, etc.
Just store such metadata in your database, where you can organize, index and aggregate it whatever way you like.
1. Invoice PDFs. Individually small, but there are a hundred million of them. Deleted after 10 years or when the tenant deletes their account.
2. Reports and exports. Few but potentially big files. If an export logically consists of multiple files, it's stored as zip file. Live 30 days or until the tenant deletes their account.
3. Streaming database exports using AWS Database Migration Service for replication into Snowflake
Every file has an entry in the database tracking its storage location and status.
Grouping them by tenant, (sub)type or time-interval makes sense for these. But "dataset" isn't an applicable concept.
The post is more about the pipeline / ML / log / export world where ownership isn't enforced by application code.
The DMS case sits somewhere in between - there's a per-table grouping that could be useful, but the files are usually transient enough that it doesn't matter much. Different problem from yours.
I could imagine it though that S3 could offer something similar. We can already list the bucket items, why not add some of querying ability?
But that's just one slice of storage. Most teams also have logs, media, ML artifacts, raw dumps, etc., none of which fit into a table format. And even with tables, you often can't easily look at a Delta table and know what the underlying storage is costing you, whether it's still accessed, etc.
Another system might solve it for your media files, another for your log streams, and so on. That's the thing, you have a set of management nice-to-haves that are quite generic and aren't universally supported today, so you end up reinventing them separately across each domain. And even if you did, you still wouldn't have a central aggregated view across all your storage.
You would be appalled at the kind of stuff I have seen teams stuff into parquet and iceberg tables.
The solution is just to spin up the machinery you need for your solution, rather than making S3 cover all possible bases.