Skip to content

Add client-side batching for CreateMultiple/UpdateMultiple/UpsertMultiple #156

@suyask-msft

Description

@suyask-msft

Problem

CreateMultiple, UpdateMultiple, and UpsertMultiple each send all records in a single POST regardless of count. There is no client-side chunking.

  • _create_multiple (data/_odata.py:316-376) — builds one {"Targets": [...]} payload with every record and POSTs it.
  • _update_multiple (data/_odata.py:656-697) — same pattern.
  • _upsert_multiple (data/_odata.py:440-493) — same pattern.

Dataverse has a server-side limit (typically 1,000 records per *Multiple call). Sending more can result in 400/413 errors or timeouts. Today, callers must chunk manually in their scripts. The SDK should handle this internally.

Proposed changes

1. Client-side batching (correctness fix)

Split large record lists into 1,000-record chunks and send each as a separate POST. This is the minimum viable fix.

# Pseudocode for _create_multiple with batching
BATCH_SIZE = 1000

def _create_multiple(self, entity_set, table_schema_name, records):
    all_ids = []
    for i in range(0, len(records), BATCH_SIZE):
        chunk = records[i:i + BATCH_SIZE]
        ids = self._create_multiple_batch(entity_set, table_schema_name, chunk)
        all_ids.extend(ids)
    return all_ids

Atomicity trade-off: Today a single POST is atomic (all-or-nothing). Splitting into batches means partial success is possible — batch 1 succeeds, batch 2 fails, leaving the caller with a partial import. This should be clearly documented. Callers who need atomicity should limit their input to <=1000 records.

2. Optional concurrent batch dispatch (performance, follow-on)

After batching exists, add an opt-in max_workers parameter to dispatch batches concurrently via concurrent.futures.ThreadPoolExecutor (stdlib, no new dependency).

def create(self, table, data, *, max_workers=1):
    # max_workers=1 (default) = sequential, identical to today
    # max_workers=4 = 4 concurrent batch POSTs

Default must be 1 (sequential) to avoid any regression:

  • No extra threads on slow machines
  • No extra memory overhead
  • No concurrent request spike hitting Dataverse rate limits
  • Identical behavior to today unless user explicitly opts in

When max_workers > 1:

  • Uses ThreadPoolExecutor (~8MB stack per thread, bounded by max_workers)
  • Respects 429 (rate limit) responses — backs off all workers
  • Connection pooling via existing _HttpClient session support

3. Page pre-fetching in _get_multiple (separate enhancement)

_get_multiple (data/_odata.py:821-826) fetches pages sequentially in a while next_link loop. Each page blocks until complete before the next is requested.

Pre-fetching 1 page ahead while the caller processes the current page would overlap I/O with processing:

def _get_multiple(self, ..., prefetch_pages=0):
    # prefetch_pages=0 (default) = sequential, identical to today
    # prefetch_pages=1 = fetch next page while caller processes current

Default must be 0 to avoid buffering extra pages in memory. A single pre-fetched page for a 5,000-record default page size is ~5-20MB depending on column count — acceptable when opted in, but shouldn't be forced.

4. Picklist cache warming (separate enhancement)

_optionset_map (data/_odata.py:1219-1331) makes 2 HTTP calls per string field on cache miss. The cache works well for subsequent records, but the first record with N string fields triggers 2N sequential HTTP calls.

A warm_picklist_cache(table) method that fetches all picklist metadata for a table in a single request would eliminate the cold-start penalty for bulk operations.

APIs NOT proposed for parallelism

API Why not
Chunked file upload (_upload.py:117-195) Protocol is sequential by design — uses session token with Content-Range headers, each chunk returns 206 before next can be sent
Column creation (_odata.py:1712-1762) Dataverse metadata locks on the same table can cause conflicts with concurrent POSTs
Column deletion (_odata.py:1764-1831) Same metadata lock concern
Relationship creation (_relationships.py) Same metadata lock concern
BulkDelete (_odata.py:548-618) Already async server-side; splitting into concurrent jobs adds complexity with minimal benefit

Context

Identified during end-to-end validation of a 21-table dataset import. The agent-generated script had to implement its own chunking (chunk_size=1000) because the SDK doesn't handle it. Client-side batching should be an SDK responsibility, not something every caller reinvents.

Metadata

Metadata

Labels

enhancementNew feature or request

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions