Stream decompress archive files on S3

Posted on 2020-11-20

Sometimes I need to uncompress large archives located in remote locations - for instance, in Amazon S3. Most normal approaches to uncompressing archives involve resource-intensive processes, like downloading the entire file into memory - an approach not tenable for multi-gigabyte files.

The solution is to stream the files, and perform the decompression and file extraction as you go. Similar to piping the output from one command into another. If you do the work of writing the output files while uncompressing them, you only need to store the current file's data in memory. Much more efficient!

Enter the smart_open library - a wrapper around many different cloud storage providers (and more) that handles the gory details of opening those files in a streamable manner. With this library, I was able to quickly put together a script that would open my zip file (located in S3) and upload the resulting files back to S3 again. No local disk, no huge memory overhead!

Requirements:

awscli - must be installed and configured with S3 credentials
smart_open

Code:

from smart_open import open
import zipfile

source = 'local file path / s3:// URI'
dest = 'local file path / s3:// URI'

# Iterate over all entries in the zip file
with open(source, 'rb') as file_data:
	with zipfile.ZipFile(file_data) as z:
		for file_info in z.infolist():

			new_filename = dest + file_info.filename

			# Skip directories - prefixes aren't explicitly created in S3
			if not file_info.is_dir():

				# Stream the uncompressed file directly to the dest
				dest_file_data = open(new_filename, 'wb')
				with z.open(file_info.filename) as zip_file_data :
					dest_file_data.write(zip_file_data.read())

Tags: python

Add Your Comment