How to Upload a Csv File Django

3 Techniques for Importing Big CSV Files Into a Django App

Pixelated character over an orange backdrop

Image by the author.

Trouble Overview and App Configuration

It'south often the case that you desire to load the data into your database from a CSV file. Ordinarily, it'south non a problem at all, but in that location are some cases when operation bug can occur — especially if yous want to load a massive amount of data. In this case, "massive" means a CSV file that has 500MB to 1GB of information and millions of rows.

In this article, I will focus on a situation where using the database utilities to load CSV files is not possible (similar PostgreSQL COPY), as you need to do a transformation in the process.

Besides, it is worth noting here that data load of this size should always exist questioned and you should try to find more suitable ways to practise it. E'er check if yous can copy the data directly to the database using database engine utilities like COPY. These kinds of operations will almost always exist much more performant than using ORM and your application code.

Allow's say that we have two models: Production and ProductCategory. We become the data from a different organization department and we have to load the data to the system. Our Django models will look similar this:

The data structure is pretty simple, but it volition be plenty to show the issues with massive information load. One thing worth noting hither is the relationship betwixt Product and ProductCategory. In this case, we can expect that the number of product categories will be several orders of magnitude lower than the number of products. We will use this knowledge afterwards.

We demand also a generator for the CSV files. The CSV file has the post-obit columns:

  • product_name
  • product_code
  • price
  • product_category_name
  • product_category_code

Using the script higher up, yous tin can create a CSV file with the data we will need to do load testing. You can laissez passer a number when calling arguments, and this will exist the number of rows in the generated file:

            python3 csv_mock_data_create.py 10000          

The control above volition create a file with 10,000 products. Notation that the script is skipping the CSV header now. I volition go back to that later.

Be careful here, as 10 meg rows will create a file of size around 600MB.

At present we only demand a elementary Django management command to load the file. We will not do information technology via the view considering, every bit nosotros know already, the files are huge. This ways that nosotros will need to upload ~500MB files using a asking handler, and every bit a event, load the files to the memory. This is inefficient.

The command now has a naive implementation of the data loading and also shows the time needed to process the CSV file:

            python3 manage.py load_csv /path/to/your/file.csv                      

For 200 products, the code above was executed in 0.220191 seconds. For 100,000 products it took 103.066553 seconds. And it would likely take ten times longer for 1 million products. Tin can nosotros make information technology faster?

1. Practise Not Load the Whole File Into Retention

The first affair to note is that the lawmaking above loads the whole CSV into memory. Even more interestingly, it's doing information technology twice. These 2 lines are really bad:

            information = list(csv.reader(csv_file, delimiter=","))
for row in information[one:]:
...

It's a mutual fault to try to skip the header from processing like that. The code is iterating from the second element on the list, but the csv.reader is an iterator, which means it's retention-efficient. If a programmer forces list conversion, then the CSV file will be loaded into a list and thus into the memory of the procedure. On instances without enough RAM memory, that can be an issue. The second copy of the data is washed when the data[ane:] is used in the for loop. And so how can nosotros handle it?

            information = csv.reader(csv_file, delimiter=",")
next(data)
for row in data:
...

Calling adjacent volition move the iterator to the next detail, and nosotros will be able to skip a CSV header (in nigh cases, it's not needed for the processing). Too, the memory footprint of the process will be much lower. This change has no big bear on on the execution time (negligible), but it has a big impact on the retentivity used by the process.

two. Do Non Make Unnecessary Queries When Iterating

I am talking virtually this line in item:

            product_category = ProductCategory.objects.get_or_create(name=row[iii], lawmaking=row[4])          

What we are fetching here is the ProductCategory instance on each loop using the category name and code. How we can solve this?

Nosotros tin load the categories before the for loop and add them just when they don't be in the database:

This alter solitary decreases the fourth dimension for 100,000 products past 34 seconds (around xxx%). The command executes in 69 seconds after the change.

3. Do Not Save 1 Element at a Time

When we are creating the example of the Product, we are asking our database to commit the changes in each loop:

            Production.objects.create(
name=row[0],
lawmaking=row[1],
price=row[2],
product_category=product_category
)

This is the I/O operation of each loop. It must be costly. Equally information technology's pretty fast, the trouble hither is that there tin be millions of such operations and nosotros tin decrease the number of such operations significantly. How? Past using Django's bulk_create method:

And this change has a tremendous effect. For 100,000 products, the command executes in just 3.v seconds. Y'all demand to retrieve that the last loop can even so take items in the products list (fewer than 5,000 items in our case). This needs to exist handled after the loop:

            if products:
Product.objects.bulk_create(products)

Those iii changes we made together allowed us to increase the performance of the control by more than 96%. Lawmaking matters. Proficient code matters even more. The final command looks like this:

With the lawmaking above, 1 million products are loaded in xxx seconds!

Pro Tip: Use Multi-Processing

Yet another thought for improving the loading speed of a massive CSV would be to use multi-processing, I will just present the idea here. In the command above, you could carve up the ane big CSV file into multiple smaller files (the best approach would be to try to use indexes of rows) and put each batch of work nether a separate process. If you can use multiple CPUs on your auto, the scaling volition be linear (2x CPUs — 2 times faster, 4x CPUs — four times faster).

Imagine that you have ane million rows to procedure. Then the first procedure tin take rows with the numbers 099999, the second takes rows with the numbers 100000199999, and so on until the last one takes rows with the numbers 900000999999.

The but downside hither is that you need to have ten free CPUs.

Summary

  • You lot should avoid loading the file into retentiveness. Use iterators instead.
  • If you are processing the file line by line, avoid queries to the database in the for loop torso.
  • Do not salve i chemical element per loop. Utilize the bulk_create method.

Thanks for reading!

bookerhomad1968.blogspot.com

Source: https://betterprogramming.pub/3-techniques-for-importing-large-csv-files-into-a-django-app-2b6e5e47dba0

0 Response to "How to Upload a Csv File Django"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel