AIStore

Eliminating Cluster Authentication Risks: AIStore with RSA and OIDC Issuer Discovery

2026-04-09T00:00:00+00:00

Back in February 1997, RFC 2104 introduced HMAC as a mechanism for authenticating messages based on a shared secret key.

Symmetric signing algorithms like HMAC can be used to securely sign access tokens, but with two extremely important caveats:

The secret key must be strong enough to avoid simple brute-force attacks (high entropy, sufficiently long, and randomly generated)
The secret key must NEVER be leaked

Unfortunately for the first point, hashcat received its public release in 2009. Since then, advances in GPU hardware and frameworks like NVIDIA CUDA have turned tools such as hashcat into increasingly effective brute-force engines, capable of breaking secrets that were once considered safe. Still, given a sufficiently long and random key, this is not a concern with HMAC-SHA256.

The second issue is a much larger problem for symmetric signing approaches like HMAC. Since the signing key is also used for validation, it must be provided to the server, not just the token issuer. And this secret key must never be accidentally exposed in deployment pipelines, configuration files, or logs. This increased attack surface is a massive risk!

What’s worse, a compromised signing key gives no indication to server owners. With no key rotation, a stolen key can be used to sign tokens with ANY level of access indefinitely. Attackers can use this key to quietly read or corrupt sensitive data without revealing their access. For any AIStore deployments that are not carefully gated in a protected environment, this could spell disaster.

With the 4.3 and subsequent 4.4 releases, AIStore AuthN now supports RSA signing keys and OIDC Issuer Discovery -- two essential features to mitigate the risk of this total security collapse.

RSA JWT Signing
OIDC Issuer Discovery
Complete Kubernetes Deployment
Conclusion and Future Work
References

RSA JWT Signing

Previously, AIStore AuthN relied on HS256, which uses HMAC-SHA256 with a shared secret key. This is a symmetric algorithm, where the same secret is used for both signing JWTs and validating them.

This meant the signing key was distributed and could potentially exist in files, K8s secrets, K8s Pod specs, or environment variables in the actual AIS deployment.

We needed to be able to distribute a key publicly without exposing the ability to sign new tokens. That’s where asymmetric RSA signing key pairs come into the picture. With RSA, the private key never leaves the AuthN service. JWT signatures are validated only by a public key that cannot be used to sign new tokens.

AuthN also now supports encrypting the private key locally with a passphrase so it’s never unprotected on disk even within the service.

See RSA Signing in the AuthN docs for more details.

OIDC Issuer Discovery

Static Key Distribution

Even with the improved security of RSA keys, relying on static key distribution presents challenges.

First, this still doesn’t fully address the issue of compromised keys. Private key leaks are less likely, as they are never distributed, but we still risk silent exposure. Without key rotation, a compromised private key can be used to mint fraudulent JWTs indefinitely. And by using a static public key in AIS config, we can’t simply rotate the validation key in AIS without invalidating all existing tokens.

The static config also adds friction to deployment, since AuthN generates the key pair. Any AIS cluster deployment would need to inject the generated public key into its config.

Trusted Issuers

OIDC issuer lookup solves all of this by validating JWTs with a cached set of keys from trusted issuers. Instead of checking a JWT signature with a static public key, AIS uses the iss and kid claims from the JWT to look up the associated public key.

AIS itself has supported the concept of OIDC issuer discovery since version 4.1, but this was restricted to 3rd-party JWT issuers, which needed additional configuration to support the custom JWT format for AIS access.

This update brings that functionality to the native AIStore AuthN service, offering much better security and simplified deployment compared to the previous approach of symmetric, static signing keys.

OIDC in AuthN

AuthN does NOT fully implement the OIDC spec. It simply exposes the path /.well-known/openid-configuration, which responds with a “discovery document” containing jwks_uri. That jwks_uri path then returns the complete set of valid public JSON Web Keys (JWK). A JWK is a generic JSON container for different key types. In the case of AuthN, it represents an encoded RSA public key with some extra metadata.

This JWK set (JWKS) is then cached on the AIStore proxies, where the keys are used to validate JWT signatures.

Below is a diagram showing the full flow; see the AuthN docs for more implementation details.

Drawbacks and Limitations

One disadvantage is that previously, AIS had no dependency on the availability of the AuthN service. Now, AIS expects AuthN to be reachable for updating the local cache of key sets on a regular basis, increasing the requirement for AuthN reliability. Deploying in K8s simplifies this, but multi-replica support for AuthN requires ongoing work (see future work).

However, AIS will not need to query AuthN on every request, and in fact caches the key sets intelligently thanks to the JWX library.

Note: AIS currently only refreshes its cached key sets for a specific issuer on proxy restart. This is a known deficiency that limits the usability of key rotation and will be fixed in a future release. See the signing key rotation section below.

Complete Kubernetes Deployment

With RSA signing and OIDC discovery, the signing key is no longer shared, keys can be rotated without touching AIS config, and AIStore and AuthN can be deployed in any order without pre-distributing keys.

To demonstrate, we’ll show a local AIS cluster deployed in K8s alongside AuthN, runnable in K8s KinD via a single script.

See the full deployment scripts on the ais-k8s repo.

Running the Deployment

See the guide in ais-k8s for full details. First, you’ll need a few prerequisites:

Next, to create the local deployment, clone ais-k8s and navigate to local.

Then run ./test-cluster.sh --auth.

That’s it! The script will bootstrap a local K8s cluster with all dependencies and an entire stack for AIS: K8s operator, AIS cluster, AIS AuthN, and an admin client deployment.

Once deployed, run the following to drop into a shell on the admin client pod inside the cluster:

kubectl exec -it -n ais deploy/ais-client -- /bin/bash

Initially, the AIS CLI won’t have access because AIS is enforcing authentication:

root@ais-client-7d869f99bf-dp76b:/# ais ls
Error: token required

This pod is pre-configured with environment variables for accessing the AuthN service. Run ais auth login $AIS_AUTHN_USERNAME -p $AIS_AUTHN_PASSWORD to fetch a token. Now the client in this pod has full admin access to the local cluster.

root@ais-client-7d869f99bf-dp76b:/# ais auth login $AIS_AUTHN_USERNAME -p $AIS_AUTHN_PASSWORD
Logged in (/root/.config/ais/cli/auth.token)
# Successful request
root@ais-client-7d869f99bf-dp76b:/# ais ls
No buckets in the cluster.

Below is a simplified diagram showing the entire setup:

AuthN Config

In recent versions of AuthN, RSA is the default signing method and will auto-generate a key pair on initial startup.

The relevant configuration for enabling OIDC lookup in the AuthN local helm environment is net.ExternalURL:

net:
  externalURL: "https://ais-authn.ais.svc.cluster.local:52001"

This tells the AuthN service what to use when building the jwks_uri in the openid-configuration response. The URL that clients can use to access the service depends on the deployment, so it must be configured in advance.

AIS Config

Because the AIS cluster runs in the same local deployment, we can use the K8s service DNS to access AuthN directly.

In the local-auth helm values for AIS, we set configToUpdate to update the AIS internal configuration to trust JWTs signed by the given allowed issuer.

The auth section configures how the operator and admin clients connect and provision an admin token by using credentials from a K8s secret.

# Configure AIS to trust JWTs issued by the local AuthN issuer
configToUpdate:
  auth:
    enabled: true
    # Instead of signature.key, we configure a list of issuers that we trust
    oidc:
      allowed_iss: ["https://ais-authn.ais.svc.cluster.local:52001"]

# Client AuthN API config used by operator and admin client
auth:
  serviceURL: "https://ais-authn.ais.svc.cluster.local:52001"
  # Currently, AuthN only supports username and password login to fetch tokens
  usernamePassword:
    secretName: ais-authn-su-creds

Conclusion and Future Work

Moving towards RSA signing and OIDC issuer lookup is important for AuthN, but there’s more we want to build:

Signing Key Rotation

Issuer lookup supports multiple active signing keys, which in theory allows for seamless rotation. Currently, AuthN keys can be rotated manually via ais auth rotate-key. However, AIStore version 4.4 won’t accept tokens signed by the new keys until the cached keyset is refreshed. This refresh is only triggered by the previous JWK’s expiry date (currently unset by AuthN) or by a proxy restart.

Once live rotation is fully supported on the AIStore side, automated signing key rotation for AuthN is a natural progression. Configurable intervals for automated rotation would eliminate the manual step and reduce the window of exposure if a key is compromised.

Multi-replica Support

Since AIS now actively queries AuthN for key sets, a single-replica AuthN becomes a potential availability bottleneck. Supporting multiple replicas would bring AuthN to production-grade availability.

This is a non-trivial development that requires more than a simple scale-up. AuthN currently uses BuntDB as its underlying storage. Support for distributed DBs will be required for multi-replica access. Signing keys must also be distributed and synchronized between instances to support consistency between multiple signers.

Service Account Authentication

A current limitation is that the AIS operator requires K8s secrets for admin credentials to manage AuthN-enabled clusters. One proposed alternative is to support AuthN token provisioning via a K8s projected service account token. This moves the access control used for the operator and admin client deployments to K8s RBAC and away from static credentials.

Follow our progress on the main AIStore repo or try out the local deployment yourself!

References

AIStore Authentication

Standards and Specs

Libraries and Tools

AIStore JWK caching library: lestrrat-go/jwx
Kubernetes in Docker (KinD)
AuthN storage: BuntDB

General

Native Bucket Inventory: Up to 17x Faster Remote Bucket Listing

2026-04-06T00:00:00+00:00

AIStore 4.3 introduces Native Bucket Inventory (NBI), a new mechanism for accelerating large remote-bucket listings by turning a repeatedly expensive operation into a local, reusable metadata path. Instead of traversing a remote bucket on every ais ls, AIS can precompute the bucket inventory once, persist it as compact binary chunks in the cluster, and answer subsequent listing requests directly from that local snapshot.

In our benchmarks, NBI delivers roughly 15x to 17x speedup for repeated listing of an s3:// bucket with 3.2 million objects, highlighting how effective a precomputed local snapshot can be for large datasets. In this post, we walk through the design of NBI, the internal create and list workflows, the benchmark results, and how to use it from the AIStore Python SDK and CLI.

Motivation
Workflow
Usage
Benchmark
Current Limitations
When to Use NBI (and When Not To)
Conclusion
References

Motivation

Remote bucket listing becomes expensive when the bucket is both large and repeatedly accessed. A full listing requires AIS to retrieve and assemble a large volume of object metadata from the backend before it can return a complete result to the client. When that bucket is relatively stable and listed again and again, the system ends up redoing essentially the same work each time, even though the contents change very little between requests.

The core issue is that the object metadata returned by listing is often reusable, but the system keeps rebuilding it from scratch. The larger the bucket, the more bandwidth, latency, and backend API work AIS must spend to reconstruct information it has effectively already seen.

NBI addresses that mismatch by treating the bucket listing results as cacheable metadata. Instead of rebuilding the full listing on every request, AIS captures it once, stores it locally in a compact form, and reuses that snapshot for subsequent listings.

While our benchmarks use S3, NBI is backend-agnostic and works identically with any remote backend — AWS S3, Google Cloud Storage, Azure Blob, OCI Object Storage, and remote AIS clusters.

Workflow

NBI runs in two phases: creation and listing.

Creation

When the user requests inventory creation, the proxy distributes the job to all targets. Each target independently walks the remote backend, but only keeps the entries whose names hash to that target. The entries are sorted, grouped into chunks of ~20K names each, encoded as compressed msgpack, and written to the AIS system bucket ais://.sys-inventory. The resulting object path follows the pattern {provider}/@#/{bucket}/inv-{uuid}.

Listing

When a list request carries the inventory flag, the proxy broadcasts to all targets instead of sending the request to the remote backend. Each target reads its local inventory chunks, binary-searches for the continuation token, and returns a page of entries. The proxy merges per-target pages to assemble a globally sorted result. No S3 calls are made.

Usage

Python SDK

from aistore import Client

client = Client("http://ais-endpoint:51080")
bck = client.bucket("my-bucket", provider="s3")

# Create the inventory (one time)
job_id = bck.create_inventory(name="trainset-v1", prefix="images/")
client.job(job_id).wait()

# List via inventory — no S3 calls made
page = bck.list_objects(inventory_name="trainset-v1", prefix="images/train/")
for entry in page.entries:
    print(entry.name)

# Clean up when no longer needed
bck.destroy_inventory(name="trainset-v1")

CLI

# Create inventory
$ ais nbi create s3://my-bucket

# Monitor creation
$ ais show job create-inventory

# Show inventory metadata
$ ais nbi show s3://my-bucket

# List via inventory
$ ais ls s3://my-bucket --inventory
$ ais ls s3://my-bucket --inventory --prefix images/train/

# Destroy inventory
$ ais nbi rm s3://my-bucket

Benchmark

We measured listing latency across 15 scale points from 1K to 3.2M objects in an s3:// bucket, with 3 runs per point. The chart below shows p50 latency for AIS regular listing, S3 direct (boto3), NBI creation (one-time), and NBI listing.

==========================================================================================================================
NBI latency-vs-scale  (3 runs each)
==========================================================================================================================
Objects     Creation    Regular          NBI          S3 Direct    Speedup
                         p50      sd     p50     sd    p50      sd   (R/N)
--------------------------------------------------------------------------
   1K          449ms    405ms   144ms    21ms    2ms   676ms    2.9s  19.0x
   2K          616ms    596ms    21ms    34ms    4ms   480ms   251ms  17.6x
   5K          811ms     1.2s   161ms    51ms   33ms    1.4s   186ms  22.4x
  10K           1.8s     1.9s   177ms   147ms   35ms    2.7s   165ms  12.6x
  20K           3.5s     3.4s   106ms   248ms   20ms    6.2s   154ms  13.8x
  40K           7.2s     7.2s   358ms   550ms   95ms   10.8s   278ms  13.2x
  50K           8.6s     8.8s   689ms   631ms   30ms   12.9s   242ms  13.9x
  80K          13.9s    13.8s    1.1s    1.0s   22ms   21.4s   995ms  13.2x
 100K          17.0s    17.9s   387ms    1.4s   45ms   27.3s   822ms  12.6x
 200K          31.8s    37.6s    2.1s    2.9s  225ms   53.8s    6.6s  13.1x
 400K         1m  5s   1m 16s   364ms    6.3s  698ms  1m 50s    4.8s  12.3x
 600K         1m 43s    2m 0s    1.8s    9.1s  276ms  2m 39s    3.5s  13.3x
 800K         2m 10s   2m 46s    2.2s   12.5s  939ms  3m 43s   901ms  13.4x
   1M         2m 42s   3m 38s   920ms   16.4s  213ms  4m 42s   10.7s  13.3x
 3.2M         8m 34s  16m 55s    4.3s   1m 0s   2.8s 14m 46s   23.1s  16.9x
==========================================================================================================================

NBI listing latency still increases with object count because it scans locally stored inventory data, but its absolute latency remains far lower than regular listing. On the chart, the NBI curve appears almost flat compared to a regular AIS list or a direct S3 list. The speedup remains consistent across the entire 1K-3.2M range, reaching up to 16.9x at 3.2M objects.

Current Limitations

NBI is experimental in AIStore 4.3, and the current implementation keeps inventory management intentionally simple. At the moment, AIStore supports only one inventory per bucket, so concurrent inventories for the same bucket are not supported. Inventories are created manually via CLI or SDK and remain static until they are recreated or removed; if an inventory already exists and you want a new one, you can recreate it with --force, or simply remove it first with ais nbi rm and then create it again.

The current creation path is also optimized for correctness and simplicity rather than minimum backend work. During inventory creation, all targets walk the remote bucket in parallel and each keeps only its own portion of the results. Automatic refresh and more efficient creation strategies are planned for future releases.

Note: NBI replaces the older S3-specific --s3-inventory path, which depended on provider-generated CSV/Parquet inventory files. The new implementation is AIS-native, backend-agnostic, and does not require external tooling.

When to Use NBI (and When Not To)

Good fit:

Large remote buckets (100K+ objects) that are listed repeatedly
Training pipelines that enumerate a dataset before each epoch
Data audits or dashboards that scan bucket contents periodically
Any workflow where the bucket is relatively stable between listings

Not a good fit:

Small buckets — creation cost exceeds the listing savings
Rapidly changing buckets — the snapshot goes stale quickly, and frequent recreation negates the benefit
ais:// buckets — metadata is already local (to each listing target), so NBI provides no speedup
One-off listings — if you only list a bucket once, the creation overhead is pure cost

Conclusion

NBI delivers 15x better listing performance for large remote buckets, with measured speedups reaching 14-22x across the range we tested. That makes it a practical solution for repeated listing of multi-million-object s3://, gs://, and other remote buckets where rebuilding the full result from the backend on every request is too slow and too expensive.

References

Parallel Download: 9x Lower Latency for Large-Object Reads

2026-03-25T00:00:00+00:00

In AIStore 4.3, we introduced parallel download APIs to accelerate reads of large objects in an AIS cluster. Instead of pulling the entire object through one long sequential GET request stream, parallel download breaks the read into coordinated range-reads and fetches multiple chunks at the same time. Those chunks are then either consumed in order as a reader stream or written directly into their final offsets on the client side. By turning one serialized read path into many concurrent chunk transfers, parallel download can engage more disks on AIS targets, better utilize available network bandwidth, and significantly increase single-object throughput.

Our benchmarks confirm the impact: fetching a 128 GiB object via parallel download is up to 9x faster than a standard single-stream GET request. When integrated with PyTorch DataLoader, parallel download reduces per-batch fetch latency by 11x compared to single-stream GET on a 10 TiB bucket.

This post describes parallel download’s design, internal workflow, and the trade-offs behind its performance improvements. It also summarizes the current benchmark results and shows how to use it from the AIStore Python SDK and PyTorch.

Motivation
Architecture and Workflow
Usage
Benchmark
Conclusion
References

Motivation: Why Parallel Download Scales Better for Large Objects

The motivation for parallel download starts with a simple observation: once an object becomes large enough, reading it through one sequential GET leaves a lot of the cluster’s available bandwidth unused. Starting from AIStore 4.0, AIStore has supported chunked objects as a first-class storage representation: the cluster actively creates the chunks, places them across storage devices, and manages that layout internally. Once the object is stored that way, the natural next step is to build a read path that can exploit the layout instead of collapsing everything back into one serialized stream. That is the role of parallel download. It turns one logical object read into multiple coordinated chunk reads and, in practice, unlocks two distinct performance gains:

Breaking the Single-Disk Limit: A single disk can only deliver so much read throughput, often well below the bandwidth of a modern data-center NIC. If a large object is fetched as one sequential stream, read throughput is effectively capped by the disk serving that stream. AIStore’s chunked object representation removes that bottleneck by distributing object chunks across the target’s available disks, allowing one logical object read to engage multiple disks in parallel.
Taking Advantage of NVMe Parallelism: NVMe SSDs are built around deep queues and internal parallelism (NVMe Overview), so they perform best when multiple read requests are in flight at the same time. Parallel chunk reads give the device more work to schedule concurrently across its internal resources, which often raises effective read throughput well beyond what one long sequential request can sustain. This is exactly the behavior we will see later in the benchmark results.

Taken together, these two effects point to the same strategy: concurrent chunk fetching. The client needs to understand the object’s chunk boundaries and issue multiple range-read requests in parallel while preserving the correct chunk order at the destination. That is exactly what parallel download does. When the object is large enough, parallel download can improve single-object throughput both by engaging more of the cluster’s storage layout and by driving more of the underlying NVMe read parallelism.

Architecture and Workflow

Parallel download uses a coordinator-worker design, but it has two distinct execution patterns depending on whether the caller consumes the object as a stream or materializes the full object in memory.

1. Streaming Mode: Ring-Buffer Transfer

When the caller consumes the object incrementally, the parallel download API uses a bounded ring buffer to preserve ordered streaming semantics while multiple chunk fetches stay in flight.

At a high level, the workflow is:

The client issues a HEAD request to fetch the object’s metadata, including total object size and chunk size.
Parallel download allocates a shared buffer of size num_workers * chunk_size, giving each worker one slot in the ring, and spawns num_workers subprocesses.
Each subprocess worker issues range-read GET requests for its assigned chunk.
As chunks arrive, workers place them into their assigned buffer slots.
The main process consumes the slots in order, preserving the original byte order of the object as it copies data into the reader output stream.
Once a slot is fully consumed, the main process marks it reusable and signals the corresponding worker to fetch the next chunk.

This loop continues until the entire object has been streamed to the caller.

The ring-buffer design matters for two reasons:

Bounded memory usage: the buffer stays fixed at num_workers * chunk_size no matter how large the object is.
A full download pipeline: as soon as the consumer releases a slot, another range-read can begin, keeping the configured level of parallelism active until the final chunk is fetched.

2. Full-Object Mode: Direct Shared Memory

When the caller needs the full object materialized in memory, the parallel download API does not use the ring buffer. Instead, it allocates one shared-memory segment large enough to hold the full object and downloads directly into that destination.

At a high level, the workflow is:

The client issues a HEAD request to fetch the object’s metadata, including total object size and chunk size.
Parallel download allocates a shared-memory buffer to hold the full object.
Worker subprocesses issue parallel range-read GET requests for their assigned byte ranges.
Each worker writes directly into its exact offset inside the final shared-memory destination.
Once all workers finish, the caller receives a view over that full shared-memory segment.

This pattern avoids the extra copy from ring-buffer slots into a streaming output, but it trades that for a much larger memory reservation because the full object must fit in shared memory at once.

Usage

AIStore currently exposes parallel download through four interfaces: the Python SDK, PyTorch integration, the native Go API, and the CLI.

Python SDK

Use get_reader(num_workers=...) to enable parallel download for a single object read. The returned reader can be consumed as a streaming iterator:

from aistore import Client

client = Client("AIS_ENDPOINT")
bucket = client.bucket("my_bucket")

reader = bucket.object("large-object.bin").get_reader(num_workers=8)
for chunk in reader:
    # ...process the chunk

If your application needs the entire object materialized in memory, the same reader also supports read_all(). It returns a ParallelBuffer backed by shared memory. From there, you can either copy into a regular bytes object or access the underlying buffer directly and avoid the extra copy:

with bucket.object("large-object.bin").get_reader(num_workers=8).read_all() as buf:
    raw = buf.tobytes()  # option 1: copy into a new bytes object
    raw = buf.buf        # option 2: use the memoryview directly

Note: read_all() does not use the streaming ring buffer. It allocates a full-size shared-memory segment for the object and downloads the entire object into that buffer. On Linux, those shared-memory objects are normally created through POSIX shared memory and exposed via /dev/shm. As a result, very large objects can consume shared-memory capacity quickly and also contribute to overall memory pressure. If you use this path on Linux, monitor /dev/shm usage during testing, for example with df -h /dev/shm. Prefer the streaming iterator when the full object does not need to be materialized in memory at once.

Use Case: High-throughput reads for a single large object from an AIS cluster.

PyTorch Integration

AISParallelMapDataset plugs directly into the standard PyTorch DataLoader. Each __getitem__ call downloads one object using parallel range-reads and returns a ParallelBuffer:

from torch.utils.data import DataLoader
from aistore import Client
from aistore.pytorch import AISParallelMapDataset

bucket = Client("AIS_ENDPOINT").bucket("training-data")
dataset = AISParallelMapDataset(bucket, num_workers=8)

loader = DataLoader(dataset, batch_size=8, num_workers=2, collate_fn=lambda x: x)
for batch in loader:
    for name, buf in batch:
        tensor = torch.frombuffer(buf.buf, dtype=torch.float32)
        # ...train on tensor
        buf.close() # must be closed to avoid resource leak

Note: There are two different num_workers settings here, and they control different kinds of parallelism. AISParallelMapDataset(..., num_workers=N) controls the workers used inside each object download. DataLoader(..., num_workers=M) controls PyTorch subprocesses that prefetch samples across the batch pipeline. Setting both to high values multiplies total concurrency, which can oversubscribe CPU resources and make shared-memory buffer lifetime harder to manage. In practice, treat these as two knobs competing for the same client-side resources, not as independent speedups you can increase without limit.

Use Case: Loading large objects (video tensors, audio clips, high-resolution images) into a PyTorch training pipeline where per-sample download latency is the bottleneck.

Go API: Stream Mode

The Go streaming variant is api.MultipartDownloadStream(). It is the Go equivalent of the Python reader-based API: it returns an io.ReadCloser and performs concurrent range-reads behind the scenes while keeping only a bounded ring buffer in memory.

reader, attrs, err := api.MultipartDownloadStream(bp, bck, objName, &api.MpdStreamArgs{
    NumWorkers: 8,
    ChunkSize:  8 * cos.MiB,
})
if err != nil {
    return err
}
defer reader.Close()

_, err = io.Copy(dst, reader)

Use Case: The Go interface for the same reader-based parallel-download workflow.

CLI

The AIS CLI exposes parallel download through the --mpd option for large-object downloads. Under the hood, it uses the Go direct-write API api.MultipartDownload(), which writes each chunk directly into its final offset in the destination file.

Use Case: Downloading a large object directly into a local file or other seekable destination with minimal client-side buffering.

# Use `--mpd` option to download a single large object with parallel chunk fetching.
$ ais get ais://my-bucket/large-object.bin /tmp/large-object.bin --mpd --num-workers 8

Benchmark

The following measurements show how much performance parallel download can unlock in practice.

1. Single Large-Object Read

Results in this section were produced with the single-object benchmark script. We evaluated single large-object reads on two AIStore clusters. Both used the same overall configuration:

Kubernetes Cluster: 3 bare-metal nodes, each hosting one AIS proxy (gateway) and one AIS target (storage server)
Benchmark Client: 1 client machine
Benchmark Object: one 128 GiB object
Target CPU: 48 cores per node
Target Memory: 995 GiB per node
Client CPU: 48 cores
Client Memory: 995 GiB
Client Network Bandwidth: 100 Gbps

The two environments differed mainly in storage media and capacity:

NVMe-based Cluster: 16 × 5.8 TiB NVMe SSDs per Target

On the NVMe cluster, parallel download reached up to 9x speedup over a standard single-stream GET in the large-object benchmark. The chart includes both chunked and non-chunked cases: the monolithic label means the object was stored as a regular non-chunked object, while the other labels are AIS chunk sizes used to distribute the object across disks. Across that full sweep, once multiple read requests are in flight, throughput rises sharply across nearly all chunk sizes. The best results come from combining sufficiently large chunks with enough workers to keep the device busy, which is the NVMe parallelism discussed earlier.

HDD-based Cluster: 10 × 9.1 TiB Drives per Target

On the HDD cluster, parallel download still delivered up to 6.9x speedup, but the pattern is different. Here, the gain depends much more on the object being properly chunked across disks so that parallel download can read from multiple devices in parallel. Unlike NVMe, HDDs do not provide the same internal parallelism, so the improvement is more sensitive to chunk size and tapers off sooner for very large chunks.

Taken together, these two charts show that parallel download does not have a single best configuration that works everywhere. The optimal chunk size and worker count depend on your client-side resources, storage media, and object size distribution. For that reason, we encourage users to benchmark a small set of chunk-size and worker combinations on their own workload, find the sweet spot, and then use that setting for the full training or data-loading job. In our case, the best region was around 64-128 MiB chunks with 64 workers, and we will carry that tuning into the next benchmark.

2. Full Data-Loading Job via PyTorch

Results in this section were produced with the PyTorch data-loading benchmark script. To measure end-to-end impact, we ran that benchmark on the same NVMe-based cluster described above. The workload used a 10.61 TiB bucket containing 1,589 large training-sample objects ranging from 2.51 GiB to 17.32 GiB (average 6.84 GiB).

Based on the single-object benchmark, 64 MiB was the best chunk size on this cluster, so we rechunked the dataset before running the job:

$ ais bucket rechunk ais://mpd-bench --chunk-size 64MiB --objsize-limit 1

We then compared two end-to-end configurations over 64 batches with batch_size=8:

GET: standard single-stream reads via AISMapDataset
Parallel: per-object parallel downloads via AISParallelMapDataset with workers=48

As measured by the PyTorch data-loading benchmark script, the per-batch latency chart shows a clean separation between the two modes across the entire run. Standard GET stays in the 150-265 second range per batch, while the parallel mode stays near 14-23 seconds. The gap is not limited to a few outliers or warm-up effects; it persists across all 64 batches.

The same pattern is visible at the cluster level. During the benchmark run, total GET throughput stays near the single-stream baseline while the GET phase is running, then jumps sharply when the parallel phase begins:

$ AIS_ENDPOINT= AIS_BUCKET=mpd-bench BATCH_SIZE=8 NUM_BATCHES=64 AIS_WORKERS=48 python3 python/tests/perf/pytorch/parallel_download.py
Bucket: ais://mpd-bench
Objects: 1589  total=10865.5 GiB  avg=6.84 GiB  min=2.51 GiB  max=17.32 GiB
Config:  batch_size=8  num_batches=64  parallel_workers=48
...
                               GET    Parallel   Speedup
──────────────────────────────────────────────────────────
Throughput (GiB/s)            0.28        3.09     11.0x
Samples/sec                   0.04        0.46     11.0x
Total wall time (s)       12322.12     1117.81     11.0x
Batch latency mean (s)      192.53       17.47     11.0x
Batch latency med (s)       187.31       17.23     10.9x
Batch latency p95 (s)       231.40       20.74     11.2x
Time-to-first-batch (s)     161.36       15.79     10.2x

The same gap appears in the aggregate results. The parallel mode raises throughput from 0.28 GiB/s to 3.09 GiB/s, cuts mean batch latency from 192.53s to 17.47s, and reduces total wall time from 12,322s to 1,118s. Across the full benchmark, the improvement stays consistently around 10-11x.

Conclusion

Parallel download gives AIStore a parallel read path for large objects by turning one logical GET into multiple coordinated chunk fetches. In practice, that allows the client to take advantage of chunked object placement across disks and, on NVMe-based systems, to drive much more of the storage device’s internal read parallelism.

In our benchmarks, parallel download improved single-object throughput by up to 9x and reduced PyTorch per-batch latency by about 11x. Those gains carried through from synthetic single-object reads to a realistic end-to-end data-loading job, showing that parallel download can translate directly into shorter training input pipelines when large objects dominate the workload.

References

The Many Lives of a Dataset Called ‘data’

2025-12-15T00:00:00+00:00

For whatever reason, a bucket called s3://data shows up with remarkable frequency as we deploy AIStore (AIS) clusters and populate them with user datasets. Likely for the same reason that password = password remains a popular choice.

At NVIDIA, for example, SwiftStack (an S3-compatible object store) is widely used internally. But it is rarely present alone. Other S3-compatible systems appear more often than not: cloud accounts, regional replicas, compliance copies. It is a rule rather than the exception for several storage backends to quietly coexist in workloads run by any given team.

Hence, same-name datasets get copied, mutated, and passed across accounts, eventually finding their way back to us for concurrent use - e.g., s3://data in its many incarnations.

Same bucket name.
Different endpoints.
Different credentials.
Different contents.

Same Name, Many Buckets

In real deployments, what s3://data actually refers to often looks like this:

s3://data exists in:
├── SwiftStack (on-prem)
├── OCI (region ABC)
├── AWS S3 (us-east-1)
├── (and more)

From a human perspective, these buckets feel interchangeable. From a system’s perspective, they absolutely are not.

What’s in the Name

Traditional object storage APIs quietly assume that a bucket name uniquely identifies a dataset. That assumption breaks down the moment environments span multiple providers.

In AIS, a bucket is a triplet (see below) with properties:

          ┌────────── Bucket Identity ───────────┐
          │ ( provider, namespace, bucket name ) │
          └────────────────┬─────────────────────┘
                 ┌─────────┴─────────┐
                 │ bucket properties │
                 └───────────────────┘

Two buckets may share the same name and the same provider, yet belong to different namespaces - and therefore represent entirely different datasets. Credentials, policies, lifecycle rules, and contents remain isolated.

Bucket namespaces are not necessarily static (although they usually are). In AIS, namespace resolution itself can be a runtime decision that’d entail distributing updated bucket metadata - typically a split-second operation.

Dynamic Binding

Separately from namespace, AIS allows a logical bucket to be bound to another bucket as its backing data source.

Note: dynamic binding is not request forwarding or caching. It specifies where a dataset physically resides and how it is accessed remotely.

A logical bucket (e.g., ais://my-training-data) may source its contents from:

an on-prem S3-compatible system,
a public cloud bucket,
a regional replica,
or a derived dataset produced by a processing pipeline.

Consider two related datasets:

Original: raw images, audio, or video with minimal labeling
Processed: augmented, re-labeled, and reordered for efficient training

Both represent the same logical corpus. Training code references a single name: ais://my-training-data. At runtime, the platform decides which backing data to bind:

training --> processed dataset
validation --> raw dataset
debugging --> local copy (or a subset thereof)
compliance --> immutable regional mirror

          ┌─────────────────────────────────────────────┐
          │                Application                  │
          │            ais://my-training-data           │
          └───────────────────────┬─────────────────────┘
                                  │
          ┌───────────────────────┴─────────────────────┐
          │               Bucket Identity               │
          │   (provider + namespace + bucket name)      │
          └───────────────────────┬─────────────────────┘
                                  │
          ┌───────────────────────┴─────────────────────┐
          │               Backend Binding               │
          │                 (at runtime)                │
          └───────────────────────┬─────────────────────┘
                                  - (current binding)
          ┌──────────────┬────────┴─────┬───────────────┐
          │  SwiftStack  │    AWS S3    │     OCI       │
          │   s3://data  │   s3://data  │  s3://data    │
          └──────────────┴──────────────┴───────────────┘

Recap

Bucket names are not identities.
Dataset selection is a configuration and/or runtime decision, not an application concern.
Infrastructure must absorb the complexity.

References

PS. I’ve changed SwiftStack, OCI and AWS specifics in this post; the underlying problem and the solution - are real.

Blob Downloader: Accelerate Remote Object Fetching with Concurrent Range-Reads

2025-11-26T00:00:00+00:00

In AIStore 4.1, we extended blob downloader to leverage the chunked object representation and speed up fetching remote objects. This design enables blob downloader to parallelize work across storage resources, yielding a substantial performance improvement for large-object retrieval.

Our benchmarks confirm the impact: fetching a 4GiB remote object via blob downloader is now 4x faster than a standard cold-GET. When integrated with the prefetch job, this approach delivers a 2.28x performance gain compared to monolithic fetch operations on a 1.56TiB S3 bucket.

This post describes the blob downloader’s design, internal workflow, and the optimizations that drive its performance improvements. It also outlines the benchmark setup, compares blob downloader against regular monolithic cold GETs, and shows how to use the blob downloader API from the supported clients.

Motivation
Architecture and Workflow
Usage
Benchmark
Conclusion
References

Motivation: Why Blob Downloader Scales Better for Large Object?

Splitting large objects into smaller, manageable chunks for parallel downloading is a proven strategy to increase throughput and resilience. In fact, cloud providers like AWS and GCP explicitly recommend concurrent range-read requests for optimal performance. The core advantages include:

Isolating Failures and Reducing Retries: With a single sequential stream, a network hiccup can force a restart or large rollback. With range-reads, failures are isolated to individual chunks, so only the affected chunk needs to be retried.
Leveraging Distributed Server Throughput: Cloud objects are typically spread across many disks and nodes. Concurrent range-reads allow the client to pull data from multiple storage nodes in parallel. This aligns with the provider’s internal architecture and bypasses the single-node or per-disk I/O limits.

Beyond these standard benefits, AIStore leverages the concurrent range-read pattern to unlock an architectural advantage: chunked object representation. Introduced in AIStore 4.0, this capability allows objects to be stored as separate chunk files, which are automatically distributed across all available disks on a target. This enables the blob downloader to stream each range-read payload directly to a local chunk file, achieving zero-copy efficiency and aggregating the full write bandwidth of all underlying disks.

Architecture and Workflow

The blob downloader uses a coordinator-worker pattern to execute the download process. When a request is initiated, the main coordinator thread fetch the remote object’s metadata to determine its total size and logically segments it into smaller chunks.

This is the same general pattern often referred to as a worker pool, a work-queue with a pool of workers, or a producer–consumer pipeline.

Once the segmentation is complete, the coordinator initializes a pool of worker threads and begins dispatching work. It assigns specific byte ranges to available workers, who then independently issue concurrent “Range Read” requests to the remote storage backend.

As workers receive data, they write each chunk directly to separate local files and report back to the coordinator to receive its next assignment. This continuous loop proceeds until every segment of the object has been successfully persisted.

Load-Aware Runtime Adaptation

Blob downloader is wired into AIStore’s load system, which continuously grades node pressure (memory, CPU, goroutines, disk) and returns throttling advice.

At a high level, blob downloader:

checks load once before starting a job and may reject or briefly delay it when the node is already under heavy memory pressure,
derives a safe chunk size from current memory conditions instead of blindly honoring the user’s request, and
lets workers occasionally back off (sleep) when disks are too busy while downloads are in progress.

The result is that blob downloads run at full speed when the cluster has headroom, but automatically slow down instead of pushing the node into memory or disk overload.

Usage

AIStore exposes blob download functionality through three distinct interfaces, each suited to different use cases.

1. Single Object Blob Download Job

Start a blob download job for one or more specific objects.

Use Case: Direct control over blob downloads, monitoring individual jobs.

AIS CLI Example:

# Download single large object
$ ais blob-download s3://my-bucket/large-model.bin --chunk-size 4MiB --num-workers 8 --progress
blob-download[X-def456]: downloading s3://my-bucket/large-model.bin
Progress: [████████████████████] 100% | 50.00 GiB/50.00 GiB | 2m30s

# Download multiple objects
$ ais blob-download s3://my-bucket --list "obj1.tar,obj2.bin,obj3.dat" --num-workers 4

2. Prefetch + Blob Downloader

The prefetch operation is integrated with blob downloader via a configurable blob-threshold parameter. When this threshold is set (by default, it is disabled), prefetch routes objects whose size meets or exceeds the value to an internal blob-download job, while smaller objects continue to use standard cold GET.

Use Case: Batch prefetching of remote buckets where some objects are very large, letting the job automatically decide when to engage blob downloader behind the scenes.

AIS CLI Example:

# List remote bucket
$ ais ls s3://my-bucket
NAME             SIZE            CACHED
model.ckpt       12.50GiB        no
dataset.tar      8.30GiB         no
config.json      4.20KiB         no

# Prefetch with 1 GiB threshold:
# - objects ≥ threshold use blob downloader (parallel chunks)
# - objects < threshold use standard cold GET
$ ais prefetch s3://my-bucket --blob-threshold 1GiB --blob-chunk-size 8MiB
prefetch-objects[E-abc123]: prefetch entire bucket s3://my-bucket

3. Streaming GET

The blob downloader splits the object into chunks, downloads them concurrently into the cluster, and simultaneously streams the assembled result to the client as it arrives.

Use Case: Stream a large object directly to the client while simultaneously caching it in the cluster.

Python SDK Example:

from aistore import Client
from aistore.sdk.blob_download_config import BlobDownloadConfig

# Set up AIS client and bucket
client = Client("AIS_ENDPOINT")
bucket = client.bucket(name="my_bucket", provider="aws")

# Configure blob downloader (4MiB chunks, 16 workers)
blob_config = BlobDownloadConfig(chunk_size="4MiB", num_workers="16")

# Stream large object using blob downloader settings
reader = bucket.object("my_large_object").get_reader(blob_download_config=blob_config)
print(reader.readall())

Benchmark

The benchmark was run on an AIStore cluster using the following system configuration:

Kubernetes Cluster: 3 bare-metal nodes, each hosting one AIS proxy (gateway) and one AIS target (storage server)
Storage: 16 × 5.8 TiB NVMe SSDs per target
CPU: 48 cores per node
Memory: 995 GiB per node
Network: dual 100 GbE (100000 Mb/s) NICs per node

1. Single Blob Download Request

The chart above compares the time to fetch a single remote object using blob download versus a standard cold GET across a range of object sizes (16 MiB to 8 GiB).

For smaller objects, cold GET performs slightly better due to the coordination overhead inherent in blob download. However, once objects exceed 256 MiB, blob download begins to show significant advantages. The speedup grows significantly with object size.

These results validate the architectural benefits discussed earlier: concurrent range-read requests combined with distributed chunk writes deliver substantial gains for large objects.

2. Prefetch with Blob Download Threshold

In the prefetch benchmark, we created an S3 bucket containing 4,443 remote objects, ranging from 10.68 MiB to 3.53 GiB in size, for a total remote footprint of 1.56 TiB.

$ ais bucket summary s3://ais-tonyche/blob-bench
NAME                     OBJECTS (cached, remote)        OBJECT SIZES (min, avg, max)            TOTAL OBJECT SIZE (cached, remote)
s3://ais-tonyche         0    4443                       10.68MiB   305.77MiB  3.53GiB           0         1.56TiB

The chart above compares different --blob-threshold values for this mixed-size workload and reports both total prefetch duration and aggregate disk write throughput. In our environment, a threshold around 256 MiB strikes the best balance by routing large objects through blob download while letting smaller objects use regular cold GET.

If the threshold is set too high: blob downloader is underutilized because more parallelizable large objects fall back to monolithic GETs.
If the threshold is set too low: blob downloader is overused on small objects, flooding the system with chunked downloads and adding coordination overhead without improving throughput.

Across all thresholds, the key pattern is that assigning a reasonable share of large objects to blob downloader raises aggregate disk write throughput, which in turn shortens total prefetch time. When the threshold is tuned so that genuinely large objects are handled via blob download, the cluster is able to drive the highest parallel writes across targets. In our setup, a threshold of about 256 MiB achieved this balance, delivering a 2.28× shorter prefetch duration than a pure monolithic cold GET of the same bucket.

Conclusion

The key takeaway is simple: on real workloads with multi‑GiB objects, blob downloader reduces time to fetch large remote objects by up to 4× in our benchmarks. It achieves this by driving much higher aggregate disk throughput than a single cold GET can sustain.

Benchmarks also show that performance is highly sensitive to the --blob-threshold setting: in our 1.56 TiB S3 bucket, a threshold around 256 MiB maximized disk write throughput during the prefetch job. The ideal value in your deployment will depend on cluster configuration, network conditions, backend provider, and object size distribution, but there will almost always be a sweet spot where blob downloader is neither underutilized nor overused.

In practice, the guidance is simple: use a small benchmark to pick a reasonable threshold for your environment, and let blob downloader plus load advice handle the rest. Today, that choice is exposed as the --blob-threshold knob on prefetch jobs, while the load system ensures that even an aggressive setting won’t push targets into memory or disk overload. Longer term, the goal is to make this decision mostly internal — using observed object sizes and node load to engage blob downloader automatically — so most users can rely on sane defaults and only reach for explicit tuning when they really need it.

References

GetBatch API: faster data retrieval for ML workloads

2025-10-06T00:00:00+00:00

ML training and inference typically operate on batches of samples or data items. To simplify such workflows, AIStore 4.0 introduces the GetBatch API.

The API returns a single ordered archive - TAR by default - containing the requested objects and/or sharded files.

A given GetBatch may specify any number of items and span any number of buckets.

From the caller’s perspective, each request behaves like a regular synchronous GET, but you can read multiple batches in parallel.

Inputs may mix plain objects with any of the four supported shard formats (.tar, .tgz/.tar.gz, .tar.lz4, .zip), and outputs can use the same formats (default: TAR).

Ordering is strict: ask for data items named A, B, C - and the resulting batch will contain A, then B, then C.

Items A, B, C, etc. can reference plain objects or sharded files, stored locally or in remote cloud buckets.

Two delivery modes are available. The streaming path starts sending as the resulting payload is assembled. The multipart path returns two parts: a small JSON header (apc.MossOut) with per-item status and sizes, followed by the archive payload.

Get-Batch provides the largest gains for small-to-medium object sizes, where it effectively amortizes TCP and connection-setup overheads across multiple requests. For larger objects, overall performance improvement tapers off because the data transfer time dominates total latency, making the per-request network overhead negligible in comparison.

Fig. 1. Up to 25x single-worker speed-up in early benchmarks.

The graph plots speed-up factor (Y-axis) against object size (X-axis), showing how batch size (100, 1K, 10K objects per batch) and object size affect performance. Each test used 10k objects on a 3-node AIStore cluster (48 CPUs, 187 GiB RAM, 10×9.1 TiB disks per node). The gains come from reducing per-request TCP overhead and parallelizing object fetches.

PS. Cluster-wide multi-worker benchmarks are in progress and will be shared soon.

Automated API Documentation Generation with GenDocs

2025-08-29T00:00:00+00:00

Automated API Documentation Generation with GenDocs

Maintaining accurate and up-to-date HTTP API documentation is critical for the developer experience when building and debugging SDKs. Clear HTTP documentation saves developers from digging through AIStore source code to understand expected endpoints, actions, query parameters, and request formats—whether implementing new features or troubleshooting issues in the SDK. With REST API endpoints spanning object management, cluster operations, ETL workflows, and administrative functions, manually maintaining this documentation quickly becomes a bottleneck that leads to inconsistencies and outdated information.

This is where GenDocs comes in—a powerful tool that automatically generates comprehensive OpenAPI/Swagger documentation directly from AIStore’s Go source code using descriptive annotation-based parsing.

GenDocs streamlines AIStore’s documentation workflow, eliminates manual maintenance overhead, and ensures that API documentation stays perfectly synchronized with the codebase as it evolves.

The Challenge: Scale and Consistency

AIStore’s REST API surface is extensive, covering everything from basic object operations to complex multi-cloud data management and ETL transformations. Each endpoint requires documentation that includes:

HTTP methods and paths with parameter definitions
Request/response schemas and examples
Action-based operations with multiple model variants
Interactive code samples and curl commands
Proper categorization and cross-references

Maintaining this manually across a rapidly evolving codebase presents several challenges:

Synchronization Drift: Documentation inevitably falls behind code changes
Human Error: Manual updates are prone to inconsistencies and omissions
Developer Overhead: Engineers spend valuable time on documentation maintenance
Scalability: As the API grows, manual processes become increasingly unsustainable

The GenDocs Solution

GenDocs solves these problems through annotation-driven documentation generation. Instead of maintaining separate documentation files, developers add lightweight annotations directly in the Go source code alongside their API handlers. GenDocs then parses these annotations to automatically generate comprehensive OpenAPI specifications which are rendered into a formatted website that developers can easily reference.

Core Design Principles

Developer-Friendly: Minimal annotation syntax that doesn’t clutter code
Source of Truth: Documentation lives alongside implementation code
Automatic Generation: Zero manual steps to update documentation
Universal format: Generates standard OpenAPI spec (YAML/JSON)

Annotation Syntax

GenDocs uses a simple but powerful annotation format. Here’s how developers document an API endpoint:

// +gen:endpoint GET /v1/buckets/{bucket-name}/objects/{object-name} [provider=string]
// Retrieves an object from the specified bucket.
// Supports streaming for large objects and conditional requests.
func GetObject(w http.ResponseWriter, r *http.Request) {
    // implementation...
}

This single annotation automatically generates:

OpenAPI endpoint definition
Parameter documentation
HTTP examples with proper curl commands

Advanced Features

Action-Based Endpoints

Many AIStore endpoints support multiple operations through action parameters. In AIStore, an “action” is a JSON message in the request body that at minimum includes an {"action":"..."} string; some actions also carry a structured value field. The action constants (e.g. apc.ActCopyBck) map to the action string used in the body, and the associated model defines the value.

// +gen:endpoint PUT /v1/buckets/{bucket-name} action=[apc.ActCopyBck=apc.TCBMsg|apc.ActETLBck=apc.TCBMsg]
// +gen:payload apc.ActCopyBck={"action": "copy-bck", "value": {"dry_run": false}}
// +gen:payload apc.ActETLBck={"action": "etl-bck", "value": {"id": "ETL_NAME"}}
// Administrative bucket operations including copy and ETL transformations.
func BucketHandler(w http.ResponseWriter, r *http.Request) {
    // implementation...
}

This generates comprehensive documentation showing:

All supported actions and their models
Complete JSON payload examples

Automatic Model Discovery

GenDocs automatically discovers Go structs marked with // swagger:model and incorporates them into the API documentation:

// swagger:model
type Transform struct {
    Name     string       `json:"id,omitempty"`
    Pipeline []string     `json:"pipeline,omitempty"`
    Timeout  cos.Duration `json:"request_timeout,omitempty" swaggertype:"primitive,integer"`
}

Note: swaggertype is only needed by swagger when mapping custom Go types (e.g., cos.Duration) to primitive types (e.g. integer) in the generated OpenAPI spec. Primitive fields like string, int, bool, etc. do not require it.

Intelligent Payload Generation

For simple actions that only require an action name, GenDocs automatically generates basic payloads, reducing annotation overhead:

// +gen:endpoint PUT /v1/cluster action=[apc.ActResetConfig=apc.ActMsg|apc.ActRotateLogs=apc.ActMsg]
// These simple actions auto-generate: {"action": "reset-config"} and {"action": "rotate-logs"}

Integration with AIStore’s Workflow

GenDocs is seamlessly integrated into AIStore’s development workflow and CI pipeline:

Documentation Website Deployment Workflow

Code Changes: Developers add/modify API endpoints with annotations
Local Testing: make api-docs-website generates documentation locally
CI Pipeline: GitHub Actions automatically regenerates docs on merge
Website Deployment: Updated documentation is deployed to the AIStore website

Build Process

Figure: GenDocs multi-phase pipeline transforming source code annotations into comprehensive API documentation

The documentation generation process is a multi-stage pipeline that transforms source code annotations into an OpenAPI specification and markdown.

(1) The process begins when GenDocs scans the entire AIStore codebase, discovering every +gen:endpoint annotation and building a complete inventory of API endpoints, parameters, and data models. (2) During this discovery phase, the tool also collects +gen:payload definitions and (3) action mappings that will be used to generate realistic examples.

(4) Once the scanning is complete, GenDocs transforms these annotations into standard Swagger comments that can be processed by the OpenAPI toolchain. This transformation includes generating operation IDs, parameter documentation, and request/response schemas for each endpoint.

(5) The OpenAPI specification is then generated using the Swagger tooling, producing both YAML and JSON formats that contain the complete API definition. However, the standard OpenAPI specification lacks some of the rich metadata that makes AIStore’s documentation particularly useful.

(6) This is where GenDocs’ vendor extension system comes into play. The tool injects AIStore-specific extensions into the OpenAPI specification, including action-to-model mappings and complete HTTP examples with curl commands. These extensions are what enable the interactive features and comprehensive examples in the final documentation.

(7) The final step involves converting the enhanced OpenAPI specification into markdown format using the OpenAPI Generator CLI with custom templates. (8) This produces the website-ready documentation that is integrated into AIStore’s Jekyll-based documentation site.

In practice, the CI pipeline runs this workflow automatically—developers only need to provide the GenDocs annotation syntax.

User Experience Enhancements

The auto-generated documentation provides users with:

# Working curl examples for every endpoint
curl -i -L -X PUT \
  -H 'Content-Type: application/json' \
  -d '{"action": "copy-bck", "value": {"dry_run": false}}' \
  'AIS_ENDPOINT/v1/buckets/source-bucket'

Technical Architecture and Annotations

Parsing Engine

Maintaining accurate API documentation is difficult when complex model structs are spread across the codebase. Manually discovering these structs and keeping cross-references between endpoints, actions, and data models in sync is time-consuming and error-prone.

To solve this, the Abstract Syntax Tree (AST) parsing approach is used to analyze the codebase directly. It automatically discovers model structs, builds a complete inventory of API models and their relationships, and maintains precise links between +gen:endpoint annotations and their corresponding handler functions.

A second challenge is flexibility: developers often want to place annotations close to the logic they describe, even if that means spreading them across multiple files. For example, a payload definition might live near a helper function rather than in the main endpoint file.

By using a file walker that recursively scans the codebase, GenDocs is collecting every +gen:payload annotation. It then parses endpoints file-by-file, ensuring that payload definitions are correctly applied to their endpoints regardless of where they are declared.

To further reduce drift, we started with go generate as the primary goal was to keep documentation annotations in line with the code. Annotations live next to handlers and regeneration runs with builds, so the docs track the exact code state—no separate “docs repo,” less drift, and less context‑switching for developers.

To prevent annotations from becoming too verbose, we auto‑generate simple { "action":"..." } payloads where possible. When an action takes a structured value or a name, we add a +gen:payload. S3‑compatible endpoints are the exception—they expect XML. For those, we point to an XML body via payload=, and the generator switches the Content‑Type automatically.

On the spec side, Swaggo scans Go code and inline annotations and emits an OpenAPI document that feeds straight into the website pipeline. For custom wrappers (for example, cos.Duration), the swaggertype tag tells the generator how the field should appear in the spec, keeping models faithful to the API’s serialization.

Descriptive Comments

Right after a +gen:endpoint line, GenDocs reads the plain comment lines and encapsulates them into the endpoint’s summary. Those few sentences become the description on the website detailing what it does, why a developer would call it, and any guardrails (auth, permissions, size limits).

Separately, model struct fields can include Go comments alongside their JSON which become per‑field descriptions in the generated schema (e.g., allowed values, units, defaults). Keeping these comments close to the code ensures the final API docs reflect the intended behavior and field semantics without manual editing. In addition, the vendor extension framework enables injection of AIStore-specific metadata while maintaining full OpenAPI specification compliance.

Case Study: Isolating a client issue with a direct API call

A bucket deletion operation failed when invoked via CLI. To determine whether the issue was in the client or the AIStore cluster, the operation was executed directly using the documented HTTP example generated by GenDocs. The direct API call succeeded, confirming server behavior was correct and narrowing the problem to the CLI implementation. This illustrates how canonical HTTP examples enable developers to easily isolate client‑versus‑server issues, reduce time to root cause, and focus fixes on the right component.

curl -i -L -X DELETE \
  -H 'Content-Type: application/json' \
  -d '{"action":"destroy-bck"}' \
  'AIS_ENDPOINT/v1/buckets/BUCKET_NAME'

Conclusion

GenDocs is a shift in AIStore’s approach to API documentation—from manual maintenance to automated generation that scales with the codebase. By embedding documentation directly in source code through lightweight annotations, the tool eliminates synchronization issues while significantly improving documentation quality and the developer experience.

In practice, this yielded measurable benefits: comprehensive endpoint coverage, immediate updates with code changes, and consistent formatting across the API surface. Developers can focus on feature development rather than documentation maintenance, while users receive accurate, up‑to‑date documentation with working examples.

References

AIStore + HuggingFace: Distributed Downloads for Large-Scale Machine Learning

2025-08-22T00:00:00+00:00

AIStore + HuggingFace: Distributed Downloads for Large-Scale Machine Learning

Machine learning teams increasingly rely on large datasets from HuggingFace to power their models. But traditional download tools struggle with terabyte-scale datasets containing thousands of files, creating bottlenecks that slow development cycles.

This post introduces AIStore’s new HuggingFace download integration, which enables efficient downloads of large datasets with parallel batch jobs.

Background
CLI Integration: Simplified Workflows
Download Optimizations
Complete Walkthrough: NonverbalTTS Dataset
Next Steps
Conclusion

Background

Sequential downloads create significant bottlenecks when dealing with complex datasets that have hundreds of thousands of files distributed across multiple directories.

AIStore addresses this by parallelizing downloads within each target using multiple workers (one per mountpath), batching jobs based on file size, and collecting file metadata in parallel. This approach leverages the network throughput from each individual target to the HuggingFace servers.

CLI Integration: Simplified Workflows

Prerequisites

The following examples assume an active AIStore cluster. If the destination buckets (e.g., ais://datasets, ais://models) don’t exist, they will be created automatically with default properties.

AIStore’s CLI includes HuggingFace-specific flags for the ais download command that handle distributed operations behind the scenes.

Basic Download Commands

# Download entire dataset
$ ais download --hf-dataset squad ais://datasets/squad/

# Download entire model  
$ ais download --hf-model bert-base-uncased ais://models/bert/

# Download specific file
$ ais download --hf-dataset squad --hf-file train/0.parquet ais://datasets/squad/

Authentication and Configuration

# Export your HuggingFace token and use for private/gated content
$ export HF_TOKEN=your_hf_token_here
$ ais download --hf-dataset private-dataset --hf-auth $HF_TOKEN ais://private-data/

# Control batching with blob threshold
$ ais download --hf-dataset large-dataset --blob-threshold 200MB ais://datasets/large/

Progress Monitoring

# Real-time progress tracking
$ ais show job --refresh 2s

# Detailed job information
$ ais show job download --verbose

Download Optimizations

The system uses some key techniques to improve download performance:

Job Batching: Size-Based Distribution

Job batching categorizes files based on configurable size thresholds:

# Configure blob threshold for job batching
$ ais download --hf-dataset squad --blob-threshold 100MB ais://ml-datasets/

Files are categorized into two groups:

Large files (above blob threshold): Get individual download jobs for maximum parallelism
Small files (below threshold): Batched together to reduce overhead

Figure: How AIStore batches files based on size threshold (100MB in this example)

Concurrent Metadata Collection

Before downloading files, AIStore makes parallel HEAD requests to the HuggingFace API to collect file metadata (like file sizes) concurrently rather than sequentially. This reduces setup time for datasets with many files.

Complete Walkthrough: NonverbalTTS Dataset

Let’s walk through an example downloading a machine learning dataset and processing it with ETL operations:

Walkthrough Prerequisites

For this walkthrough, we’ll create and use three buckets:

ais://deepvs - for the initial dataset download
ais://ml-dataset - for ETL-processed files
ais://ml-dataset-parsed - for the final parsed dataset

If these buckets don’t exist, they will be created automatically with default properties.

Step 1: Download Dataset with Configurable Job Batching

# Download deepvk/NonverbalTTS dataset with job batching
$ ais download --hf-dataset deepvk/NonverbalTTS ais://deepvs --blob-threshold 500MB --max-conns 5
Found 11 parquet files in dataset 'deepvk/NonverbalTTS'
Created 7 individual jobs for files >= 500MiB
Started download job dnl-B-oOHruKH9
To monitor the progress, run 'ais show job dnl-B-oOHruKH9 --progress'

Step 2: Monitor Distributed Job Execution

# Watch configurable job distribution across cluster targets
$ ais show job
download jobs  
JOB ID           XACTION         STATUS          ERRORS  DESCRIPTION
dnl-B-oOHruKH9   D6JOGa7PH9      1 pending       0       multi-download -> ais://deepvs
dnl-zoOHr7PG3    D6JOGa7PH9      1 pending       0       https://huggingface.co/api/datasets/deepvk/NonverbalTTS/parquet/default/other/0.parquet -> ais://deepvs/0.parquet
dnl-oJOHruKG3    D6JOGa7PH9      1 pending       0       https://huggingface.co/api/datasets/deepvk/NonverbalTTS/parquet/default/train/1.parquet -> ais://deepvs/1.parquet
dnl-F_ogHauKH9   D6JOGa7PH9      1 pending       0       https://huggingface.co/api/datasets/deepvk/NonverbalTTS/parquet/default/train/2.parquet -> ais://deepvs/2.parquet
dnl-PoOHr7KG9    D6JOGa7PH9      1 pending       0       https://huggingface.co/api/datasets/deepvk/NonverbalTTS/parquet/default/train/3.parquet -> ais://deepvs/3.parquet
....

Step 3: Verify Download Completion

# Check bucket summary after download
$ ais ls ais://deepvs --summary
NAME             PRESENT         OBJECTS         SIZE (apparent, objects, remote)        USAGE(%)
ais://deepvs     yes             6 0             2.76GiB 2.76GiB 0B                      0%

Options for Using Downloaded Data

At this point, you have several options:

Use directly: Work with the downloaded files as-is if they meet your requirements
Transform with ETL: Apply preprocessing for format conversion, file organization, or data standardization
Custom processing: Use your own tools for data preparation

Why transform? HuggingFace datasets often have complex paths or formats that benefit from standardization. This walkthrough demonstrates ETL transformations for file organization (consistent naming) and format conversion (Parquet → JSON for framework compatibility).

Step 4: Initialize ETL Transformers

Note: ETL operations require AIStore to be deployed on Kubernetes. See ETL documentation for deployment requirements and setup instructions.

Before applying transformations, initialize the required ETL containers:

# Initialize batch-rename ETL transformer for file organization
$ ais etl init -f https://raw.githubusercontent.com/NVIDIA/ais-etl/main/transformers/batch_rename/etl_spec.yaml

# Initialize parquet-parser ETL transformer for data parsing
$ ais etl init -f https://raw.githubusercontent.com/NVIDIA/ais-etl/main/transformers/parquet-parser/etl_spec.yaml

# Verify ETL transformers are running
$ ais etl show

Step 5: Preprocessing using ETL

# Organize and rename files using batch rename ETL
$ ais etl bucket batch-rename-etl ais://deepvs ais://ml-dataset
etl-bucket[BatchRename] ais://deepvs => ais://ml-dataset

# Verify renamed files with structured naming
$ ais ls ais://ml-dataset/
NAME                        SIZE            
train_0.parquet             485MiB          
train_1.parquet             492MiB          
train_2.parquet             511MiB          
...

# Convert parquet files to JSON format for easier ML framework integration
$ ais etl bucket parquet-parser-etl ais://ml-dataset ais://ml-dataset-parsed
etl-bucket[xO_sVT3Im] ais://ml-dataset => ais://ml-dataset-parsed

# Verify processed dataset ready for ML training
$ ais ls ais://ml-dataset-parsed --summary
NAME                         PRESENT         OBJECTS         SIZE (apparent, objects, remote)        USAGE(%)
ais://ml-dataset-parsed      yes             7 0             8.68GiB 8.68GiB 0B                      1%

Step 6: ML Pipeline Integration

AIStore integrates seamlessly with popular ML frameworks. Here’s how to use the processed dataset in your training pipeline:

Option A: Direct SDK Usage (Simple)

from aistore.sdk import Client
import json

client = Client("http://localhost:51080")
bucket = client.bucket("ml-dataset-parsed")

# Load processed training data
for obj in bucket.list_objects():
    if obj.name.startswith("train_"):
        data = json.loads(obj.get_reader().read_all())
        # Process individual training samples
        for sample in data:
            # Your training logic here
            pass

Option B: PyTorch Integration (Recommended for ML Training)

from aistore.sdk import Client
from aistore.pytorch import AISIterDataset
from torch.utils.data import DataLoader
import json

# Create dataset that reads directly from the cluster
client = Client("http://localhost:51080")
dataset = AISIterDataset(ais_source_list=client.bucket("ml-dataset-parsed"))

# Configure DataLoader with multiprocessing
loader = DataLoader(
    dataset,
    batch_size=32,
    num_workers=4,  # Parallel data loading across multiple cores
)

# Training loop
for batch_names, batch_data in loader:
    # Parse JSON data
    parsed_samples = [json.loads(data) for data in batch_data]
    
    # Convert to tensors and train your model
    # model.train_step(parsed_samples)
    pass

Next Steps

The HuggingFace integration opens up some practical areas for expansion:

Download and Transform API: AIStore supports combining download and ETL transformation in a single API call, eliminating the two-step process shown in the walkthrough. This allows downloading HuggingFace datasets with immediate transformation (e.g., Parquet → JSON) in one operation. CLI integration for this functionality is in development.

Additional Dataset Formats: Beyond the current Parquet support, HuggingFace datasets are available in multiple formats that teams commonly need:

JSON format - Direct JSON downloads for frameworks requiring this format
CSV format - For traditional data processing workflows
WebDataset format - For large-scale ML pipelines using WebDataset

Conclusion

AIStore’s HuggingFace integration addresses common dataset download bottlenecks in machine learning workflows. Job batching and concurrent metadata collection enable efficient, parallel downloads of terabyte-scale datasets that would otherwise overwhelm traditional tools. Once stored in AIStore, teams can leverage local ETL operations to transform and prepare data without additional network transfers. This approach provides a streamlined path from raw downloads to training-ready datasets, eliminating the typical download-wait-process cycle that slows ML development.

References:

AIStore Core Documentation

AIStore GitHub
AIStore Blog
AIStore Downloader Documentation
AIStore Python SDK
AIStore PyTorch Integration - High-performance data loading for ML training

ETL (Extract, Transform, Load) Resources

ETL Documentation - Comprehensive guide to AIStore ETL capabilities and Kubernetes deployment
ETL CLI Reference - Command-line interface for ETL operations
Batch-Rename Transformer - File organization and renaming
Parquet Parser Transformer - Parquet to JSON conversion
AIStore Kubernetes Deployment - Production Kubernetes deployment tools and documentation

External Resources

The Perfect Line

2025-07-26T00:00:00+00:00

I didn’t want to write this blog.

AIStore performance and scale-out story dates back to at least 2020, when we first presented our work at the IEEE Big Data Conference (arxiv:2001.01858). The linear scalability story was told and re-told, and the point was made. And so I really did not want to talk about it any longer.

But something has changed with our latest v3.30 benchmarks on a 16-node OCI cluster:

Fig. 1. Aggregated cluster throughput.

That’s a 100% random read workload at 10MiB transfer size from an 87TB dataset, with 1536 GB RAM on each storage node (ensuring the data is served from disks).

$ ais ls ais://ais-bench-10mb --summary
NAME                     PRESENT     OBJECTS     SIZE (apparent, objects, remote)    USAGE(%)
ais://ais-bench-10mb     yes         8421329 0   80.31TiB 80.31TiB 0B                7%

The Theoretical Limit

When we talk about the raw power of our 16-node cluster, each equipped with a 100Gbps NIC, the numbers are impressive: 186 GiB/s.

Quick math. Per‑node link speed: 100Gbps = 12.5 GB/s ≈ 11.64 GiB/s. Cluster aggregate (16 nodes): 11.64 GiB/s × 16 ≈ 186 GiB/s

This is the sheer, unadulterated, theoretical maximum aggregate throughput if every single bit could fly across the network without any processing, any protocol, or any pause. It represents the absolute ceiling of what the hardware could theoretically achieve assuming all those circumstances.

However, in the real world, data doesn’t just teleport. It needs to be packaged, routed, error-checked, and processed by the operating system and applications. This is where the networking stack overhead comes into play. Think of it like the ‘friction’ or ‘tax’ on the raw bandwidth.

This overhead isn’t just one thing; it’s a stack of small but measurable costs:

L3/L4 protocol headers: IPv4 + TCP add a minimum of 40B (52B with common SACK/TSopt options). Maybe the least expense, especially at Jumbo frames and also because LRO/GRO reduce the number of packet headers the host sees (by coalescing them).
TLS handshake, TLS 5-byte headers, and TLS encryption (if HTTPS is used).
Memory copies: the default TCP path copies payload once into kernel space and once via DMA.
Context‑switch overhead (syscalls + IRQs).

More about context switching: consider read-only HTTP traffic (no sendfile) whereby the server (like AIStore) is transmitting large payloads — large enough to utilize reusable 128K buffers. In other words, Tx path and a standard io.CopyBuffer at 128 KiB chunks. Each iteration performs two syscalls – read(2) on the local XFS file and write(2) on the socket – and therefore 4 (four) context switches (user => kernel and back for each call). Unlike sendfile(2), this path touches userland twice: kernel => Go slice on read(), then Go slice => kernel on write(). At full network speed, that adds another ~23 GiB/s of DRAM traffic.

Long story short, the actual achievable throughput is always lower due to various networking (and non-networking) overheads. The realistic percentage, bounded of course by the physical link, is highly contingent on the entire software stack and underlying infrastructure.

Industry sources typically cite 85-95% range as the realistic maximum efficiency for high-speed Ethernet. Generally, 85% is considered very good while 95% is exceptional to the point of being almost unachievable.

95%

The observed performance is what ultimately prompted me reconsider the blog. As the monitoring graphs clearly show, our AIStore v3.30 cluster consistently achieves a sustained GET throughput with a mean of 175 GiB/s, frequently hitting peaks of 177 GiB/s for extended periods.

Fig. 2. Node throughput (16 nodes).

Fig. 3. Disk (min, avg, max) utilizations (16 nodes).

As a side, disk utilization may serve as an indication for OCI to maybe consider adding another 100G port.

This means we are effectively operating at 95% of the theoretical maximum raw network capacity — exceeding what most industry sources consider the practical ceiling. But the numbers tell only part of the story. What really stands out is the consistency:

Time variance: under 2% during sustained runs
Node variance: under 3% spread across all 16 nodes
Disk utilization: a rock-steady 55–57% across all 192 NVMe drives
Workload distribution: each node contributing roughly 11 GiB/s

In short, the graphs show something you rarely encounter in practice: a distributed system operating at the physical limits of the underlying infrastructure.

Not bad for a “boring” benchmark that’s just a straight line.

Single-Object Copy/Transform Capability

2025-07-25T00:00:00+00:00

Single-Object Copy/Transform Capability

In version 3.30, AIStore introduced a lightweight, flexible API to copy or transform a single object between buckets. It provides a simpler alternative to existing batch-style operations, ideal for fast, one-off object copy or transformation without the overhead of a full-scale job.

Notably, both the source and destination can be the local AIStore cluster or any of its remote backends (e.g., s3://src/a => gs://dest/b), making this feature especially useful for ad-hoc workflows and lightweight data preparation.

In this post, we’ll walk through the design and internal workflow that make this capability possible. We’ll also demonstrate how to use it with various supported clients, and compare it with existing copy mechanisms in AIStore to help you choose the right one for your use case.

Features Highlight

AIStore supports a variety of copy-object features, including bucket copy and multi-object copy. However, these operations are designed as batch jobs that involve a more complex setup across the cluster to ensure all storage targets are ready and connected. While cluster-wide coordination ensures job can be executed or aborted seamlessly, it introduces noticeable overhead upfront — an unnecessary cost when the operation doesn’t require participation from all storage targets.

In contrast, the newly introduced single-object copy operation takes a simpler and more lightweight approach. It directly transfers the object from the source to the destination target in a single, synchronous step - bypassing the need for cluster-wide coordination and setup.

This direct transmission approach also allows the single-object copy operation to bypass the client entirely, offloading the responsibility of handling the object exchange. Unlike a GET-and-PUT sequence for copying, the client never needs to fetch or upload the object. The data moves entirely within the cluster, directly from the source to the destination target - all the client does is send the command. This becomes especially beneficial as object size increases, reducing client-side overhead and network usage.

Additionally, the single-object copy workflow integrates seamlessly with ETL transformations. When an ETL is specified in the request’s parameter, the source target streams the object bytes to a local ETL container for transformation. Once processed, the transformed bytes are forwarded directly to the destination target — again, without routing through the client.

For more details on the direct put optimization, please refer to this documentation.

Usage

AIStore CLI

Here’s a quick example of how to use the single-object copy feature with the CLI:

# Upload a local file to the source bucket
$ ais object put README.md ais://src/aaa
PUT "README.md" => ais://src/aaa

# Copy the object from AIStore to a GCP bucket
$ ais object cp ais://src/aaa gs://dest/bbb
COPY ais://src/aaa => gs://dest/bbb

# Download and verify the copied object
$ ais object get gs://dest/bbb
GET bbb from gs://dest as bbb (11.24KiB)

Using the feature with ETL transformation is just as straightforward. It follows the standard ais etl command pattern: specify the subcommand (object), provide the ETL name, and pass in the arguments. Here’s an example that computes an object’s MD5 hash via single-object transformation:

# Initialize an ETL transformer to compute MD5 hash values
$ ais etl init --name md5-etl -f https://raw.githubusercontent.com/NVIDIA/ais-etl/main/transformers/md5/etl_spec.yaml

# Perform a single-object transformation
$ ais etl object md5-etl cp ais://src/aaa ais://dest/bbb
ETL[md5-etl]: ais://src/aaa => ais://dest/bbb

# Retrieve the transformed object (MD5 hash value)
$ ais object get ais://dest/bbb -
# MD5 hash value of the original object "ais://src/aaa"

AIStore Python SDK

The Python SDK provides an intuitive interface for using the single-object copy API.

# Create source and destination buckets
src_bck = client.bucket("src").create(exist_ok=True)
dest_bck = client.bucket("dest").create(exist_ok=True)

# Upload an object to the source bucket
src_obj = src_bck.object("aaa")
src_obj.get_writer().put_content(b"Hello World!")

# Prepare a destination object handle, and perform the copy operation
dest_obj = dest_bck.object("bbb")
src_obj.copy(dest_obj)

# Verify that the object was copied correctly
print(dest_obj.get_reader().read_all())
# Output: b'Hello World!'

To apply an ETL transformation as part of the copy operation, simply pass an ETLConfig to the copy() method. The SDK automatically handles the required parameter population:

# Define and initialize a simple ETL that reverses object content
etl_reverse = client.etl("etl-reverse")

@etl_reverse.init_class()
class UpperCaseETL(FastAPIServer):
    def transform(self, data, *_args):
        return data[::-1]

# Perform a copy with ETL transformation applied
from aistore.sdk.etl import ETLConfig
src_obj.copy(to_obj=dest_obj, etl=ETLConfig(etl_reverse.name))

# Confirm the transformation result
print(dest_obj.get_reader().read_all())
# Output: b'!dlroW olleH'

S3 Client

The single-object copy feature is also accessible via any S3-compatible client. For example, using s3cmd, you can copy objects between buckets without any changes to your existing S3-based workflows.

First, install s3cmd and configure it to connect to your AIStore cluster by following the S3 client configuration guide.

Once configured, here’s how you can perform a simple object copy:

# Confirm the source object exists
$ ais ls ais://src
NAME             SIZE            
README.md        11.24KiB        

# Confirm the destination is initially empty
$ ais ls ais://dest
No objects in ais://dest

# Use s3cmd to copy the object from src to dest
$ s3cmd cp s3://src/README.md s3://dest
remote copy: 's3://src/README.md' -> 's3://dest/README.md'  [1 of 1]

# Verify the copied object is accessible in the destination bucket
$ ais object get ais://dest/README.md
GET README.md from ais://dest as README.md (11.24KiB)

Performance Comparison

To better understand when to use the single-object copy API versus the job-based copy bucket mechanism, we ran a set of performance benchmarks across varying object sizes and workloads.

This scenario focuses on copying just one object at a time. We evaluated the three supported approaches across different object sizes.

Client-Side Copy: The simplest method. It retrieves the object with a GET, then re-upload it to the destination using PUT. The client handles the full object payload.
Single-Object Copy API: Performs a direct, in-cluster transfer from source to destination, bypassing the client entirely.
Job-Type Copy Bucket API: Launches a cluster-wide job to move the object, even when there’s only one object involved.

Method	64 KB	1 MB	16 MB	256 MB	1 GB	4 GB
Client-Side Copy	0.007s	0.021s	0.670s	2.515s	16.283s	56.407s
Single-Object Copy API	0.004s	0.006s	0.027s	0.338s	1.172s	5.0060s
Job-Type API (Copy Bucket)	13.08s	13.07s	13.08s	13.08s	14.123s	19.147s

As expected, the single-object copy API significantly outperforms the client-side method, especially as object size increases. Involving the client introduces unnecessary latency — effectively pulling data out of and back into the cluster. The job-type API introduces coordination overhead that isn’t justified for single-object transfers.

Note: The relative performance order remains consistent even when an ETL transformation is applied during the copy. In each case, the transformation just adds one extra network step between the target and its ETL container. We ran the same tests with ETL included and confirmed that performance ranking across the three approaches did not change.

Conclusion

The single-object copy API is a fast, low-overhead solution tailored for one-off object transfers. Whether you’re moving data between internal buckets or bridging between cloud backends, it delivers consistent performance without the setup cost of a full job. It’s the ideal choice for lightweight workflows and ad-hoc object manipulation where efficiency matters.

AIStore

Eliminating Cluster Authentication Risks: AIStore with RSA and OIDC Issuer Discovery

Table of Contents

RSA JWT Signing

OIDC Issuer Discovery

Static Key Distribution

Trusted Issuers

OIDC in AuthN

Drawbacks and Limitations

Complete Kubernetes Deployment

Running the Deployment

AuthN Config

AIS Config

Conclusion and Future Work

Signing Key Rotation

Multi-replica Support

Service Account Authentication

References

Native Bucket Inventory: Up to 17x Faster Remote Bucket Listing

Table of Contents

Motivation

Workflow

Creation

Listing

Usage

Python SDK

CLI

Benchmark

Current Limitations

When to Use NBI (and When Not To)

Conclusion

References

Parallel Download: 9x Lower Latency for Large-Object Reads

Table of Contents

Motivation: Why Parallel Download Scales Better for Large Objects

Architecture and Workflow

1. Streaming Mode: Ring-Buffer Transfer

2. Full-Object Mode: Direct Shared Memory

Usage

Python SDK

PyTorch Integration

Go API: Stream Mode

CLI

Benchmark

1. Single Large-Object Read

NVMe-based Cluster: 16 × 5.8 TiB NVMe SSDs per Target

HDD-based Cluster: 10 × 9.1 TiB Drives per Target

2. Full Data-Loading Job via PyTorch

Conclusion

References

The Many Lives of a Dataset Called ‘data’

Same Name, Many Buckets

What’s in the Name

Dynamic Binding

Recap

References

Blob Downloader: Accelerate Remote Object Fetching with Concurrent Range-Reads

Table of Contents

Motivation: Why Blob Downloader Scales Better for Large Object?

Architecture and Workflow

Load-Aware Runtime Adaptation

Usage

1. Single Object Blob Download Job

2. Prefetch + Blob Downloader

3. Streaming GET

Benchmark

1. Single Blob Download Request

2. Prefetch with Blob Download Threshold

Conclusion

References

GetBatch API: faster data retrieval for ML workloads

Automated API Documentation Generation with GenDocs

Automated API Documentation Generation with GenDocs

The Challenge: Scale and Consistency

The GenDocs Solution

Core Design Principles

Annotation Syntax

Advanced Features

Action-Based Endpoints

Automatic Model Discovery