Importing documents in bulk
Open Zaak data can be supplied in various ways. One can of course make API calls,
for example to create EnkelvoudigInformatieObject’s but whenever a large amount
of data needs to be created in a relatively short time period, this can become
a not-so-pleasant experience. To prevent having to deal with these kinds of situations,
Open Zaak users can make use of the import functionality for EnkelvoudigInformatieObject’s.
Open Zaak exposes several API endpoints (which are not part of the Documenten API standard)
to help aid the user to import larger amounts of EnkelvoudigInformatieObject’s.
Note
This bulk import was specifically implemented for Documenten API using the (default) filestorage backend. It does function correctly for other backends such as Azure blob storage or S3 storage, but it might not be as performant.
In addition to this, the behavior of the bulk import for filesystem storage compared to blob storage and S3 storage with regard to existing files is slightly different: with filesystem storage, if a file already exists on storage at the path defined in the metadata file, this existing file will be overwritten by the imported file. For blob storage and s3 storage, a suffix is appended to the filename to make sure the original file is not overwritten.
Configuration
Environment variables related to the import functionality are also described in Environment configuration reference
IMPORT_DOCUMENTEN_BASE_DIRis used to determine the absolute import path for each row in the import metadata file. All file paths specified in the import metadata file should be relative to the directory specified for this setting. By default this is the same directory as the projects directory (BASE_DIR) and can either be configured through an environment variable or directly through the django’s settings file being used.IMPORT_DOCUMENTEN_BATCH_SIZEis the number of rows that will be processed at a time.IMPORT_RETENTION_DAYS: an integer which specifies the number of days after whichImportinstances will be deleted
Process
The import process consists of several steps with each step having an API endpoint. The process consists of the following steps:
Creating the
ImportUploading the
Importmetadata file(Optionally) Retrieving the status of the
Import(Optionally) Retrieving the report file of the
Import(Optionally) Deleting the
Import
An visual representation of the process can be seen below:
Permissions
Importing EnkelvoudigInformatieObjecten is not possible for all authenticated users
for Open Zaak. The reasoning behind this is that not all users, for example have
the rights to create EnkelvoudigInformatieObjecten with certain
InformatieObjectType’s. Various permissions exist within Open Zaak and
therefore to make use of the import functionality, only users with an
Applicatie with the heeft_alle_autorisaties set to True are allowed
to perform any import operation.
Creating an Import
The first step in the import process is that of making an Import resource.
This is done through a POST request. The request body of this request can be empty
for this step.
Creating an Import is only possible whenever no other Import instance exist
with the statusses pending or active.
Whenever no other import is pending or active, the endpoint will provide the user
three URLs: a URL to upload an import metadata file, a URL to retrieve the status
of an import and another URL for downloading a report of the import. These three
urls provide the user the ability to progress further in the import process.
After this request the Import instance will have it’s status changed to
pending.
See the API documentation for more details.
Starting an Import
After creating an Import instance, users can upload an import metadata file. This
should be a CSV which consists of rows with the data needed to create an
EnkelvoudigInformatieObject. This data is roughly the same as the data needed
for creating an EnkelvoudigInformatieObject through a “regular” API call
with some exceptions. For more details about the format and the requirements of
the CSV file, the API documentation should be consulted. The request for this
endpoint should be a POST request containing the CSV data in its request body.
Whenever the CSV file contains invalid and/or missing headers, the import process will not be started and the error response will contain any missing headers.
The bestandspad column, which is required for each row in the CSV file,
is the path to the file which will be imported and will be assosciated to the
EnkelvoudigInformatieObject. This should be a relative path from the directory
configured in IMPORT_DOCUMENTEN_BASE_DIR.
The import will only start if the IMPORT_DOCUMENTEN_BASE_DIR setting is set
correctly. Unknown directories or the IMPORT_DOCUMENTEN_BASE_DIR leading to a
a file instead of a directory are examples of incorrect configurations. An
incorrectly configured IMPORT_DOCUMENTEN_BASE_DIR setting will cause the import
to not start.
Just like in the step to create the Import instance, no other pending or
active import instances can be active before starting the specified Import.
After this request passes validation (with the above mentioned checks) the
actual import process is started through a background task and the status of the
Import instance is changed to active.
See the API documentation for more details.
Retrieving the status
The status of the Import can be retrieved when an import process is started
and has the status pending, active, finished or error. This endpoint
can be called through a GET request. The data of the response contains
information, for example, about the total amount of rows the import metadata file
has and the amount of rows the Import at that time has processed.
If the background task is finished the status of the Import is either finished
or error in case of unrecoverable error situations.
See the API documentation for more details.
Retrieving a report file
When an Import instance is finished or has an error, a report
file of the process can be downloaded. This report is a CSV file the same as
the provided metadata file with an additional two columns that specify whether
a row was imported successfully and if there any comments about the row.
See the API documentation for more details.
Deleting an Import
When an Import instance has the status finished, error or pending
it can be deleted. Deletion of Import instances that are older than the environment
variable IMPORT_RENTENTION_DAYS days and have one of the above mentioned
statuses is done in the background through a daily occurring task.
See the API documentation for more details.
Import behavior
The import process is a background task and imports each row in
batches (configured through IMPORT_DOCUMENTEN_BATCH_SIZE). During each batch,
a validation error can occur, for example an existing uuid being present in the
database. This will not cause other rows to not be imported.
If a row does not cause any validation errors, the file associated with that row will be copied to Open Zaak’s storage. If the file already exists there, it will be overwritten.
Another situation can occur where the import process cannot proceed, for example
a database connection loss. This will stop the import process
(the background task). In this situation the database cannot be reached and the
data of the Import instance (e.g statistics) will be out-of-sync. However, logging
is done and the report file will have comments for all rows in that
specified batch.
It is important to note that no notifications will be sent during or after the import process. If you use the import process please notify the subscribers of your API about the new documents. so they won’t have inconsistent data.
Examples
The following steps are an example of how the import process could look like. For these examples the curl tool is used.
Creating an import
curl --request POST \
--header "Authorization: Bearer <token>" \
https://<domain-name>/documenten/api/v1/import/create
Starting an import
curl --request POST \
--header "Authorization: Bearer <token>" \
--header "Content-Type: text/csv" \
--upload-file <path-to-metadata-file> \
https://<domain-name>/documenten/api/v1/import/<import-uuid>/upload
Retrieving the status of an import
curl --request GET \
--header "Authorization: Bearer <token>" \
https://<domain-name>/documenten/api/v1/import/<import-uuid>/status
Retrieving the report of an import
curl --request GET \
--header "Authorization: Bearer <token>" \
https://<domain-name>/documenten/api/v1/import/<import-uuid>/report
Deleting an import
curl --request DELETE \
--header "Authorization: Bearer <token>" \
https://<domain-name>/documenten/api/v1/import/<import-uuid>/delete