How to Upload a Csv File Django
3 Techniques for Importing Big CSV Files Into a Django App
Loading huge amounts of data to a database simply got easier
Trouble Overview and App Configuration
It'south often the case that you desire to load the data into your database from a CSV file. Ordinarily, it'south non a problem at all, but in that location are some cases when operation bug can occur — especially if yous want to load a massive amount of data. In this case, "massive" means a CSV file that has 500MB to 1GB of information and millions of rows.
In this article, I will focus on a situation where using the database utilities to load CSV files is not possible (similar PostgreSQL COPY
), as you need to do a transformation in the process.
Besides, it is worth noting here that data load of this size should always exist questioned and you should try to find more suitable ways to practise it. E'er check if yous can copy the data directly to the database using database engine utilities like COPY
. These kinds of operations will almost always exist much more performant than using ORM and your application code.
Allow's say that we have two models: Production
and ProductCategory
. We become the data from a different organization department and we have to load the data to the system. Our Django models will look similar this:
The data structure is pretty simple, but it volition be plenty to show the issues with massive information load. One thing worth noting hither is the relationship betwixt Product
and ProductCategory
. In this case, we can expect that the number of product categories will be several orders of magnitude lower than the number of products. We will use this knowledge afterwards.
We demand also a generator for the CSV files. The CSV file has the post-obit columns:
-
product_name
-
product_code
-
price
-
product_category_name
-
product_category_code
Using the script higher up, yous tin can create a CSV file with the data we will need to do load testing. You can laissez passer a number when calling arguments, and this will exist the number of rows in the generated file:
python3 csv_mock_data_create.py 10000
The control above volition create a file with 10,000 products. Notation that the script is skipping the CSV header now. I volition go back to that later.
Be careful here, as 10 meg rows will create a file of size around 600MB.
At present we only demand a elementary Django management command to load the file. We will not do information technology via the view considering, every bit nosotros know already, the files are huge. This ways that nosotros will need to upload ~500MB files using a asking handler, and every bit a event, load the files to the memory. This is inefficient.
The command now has a naive implementation of the data loading and also shows the time needed to process the CSV file:
python3 manage.py load_csv /path/to/your/file.csv
For 200 products, the code above was executed in 0.220191 seconds. For 100,000 products it took 103.066553 seconds. And it would likely take ten times longer for 1 million products. Tin can nosotros make information technology faster?
1. Practise Not Load the Whole File Into Retention
The first affair to note is that the lawmaking above loads the whole CSV into memory. Even more interestingly, it's doing information technology twice. These 2 lines are really bad:
information = list(csv.reader(csv_file, delimiter=","))
for row in information[one:]:
...
It's a mutual fault to try to skip the header from processing like that. The code is iterating from the second element on the list, but the csv.reader
is an iterator, which means it's retention-efficient. If a programmer forces list
conversion, then the CSV file will be loaded into a list and thus into the memory of the procedure. On instances without enough RAM memory, that can be an issue. The second copy of the data is washed when the data[ane:]
is used in the for
loop. And so how can nosotros handle it?
information = csv.reader(csv_file, delimiter=",")
next(data)
for row in data:
...
Calling adjacent
volition move the iterator to the next detail, and nosotros will be able to skip a CSV header (in nigh cases, it's not needed for the processing). Too, the memory footprint of the process will be much lower. This change has no big bear on on the execution time (negligible), but it has a big impact on the retentivity used by the process.
two. Do Non Make Unnecessary Queries When Iterating
I am talking virtually this line in item:
product_category = ProductCategory.objects.get_or_create(name=row[iii], lawmaking=row[4])
What we are fetching here is the ProductCategory
instance on each loop using the category name and code. How we can solve this?
Nosotros tin load the categories before the for
loop and add them just when they don't be in the database:
This alter solitary decreases the fourth dimension for 100,000 products past 34 seconds (around xxx%). The command executes in 69 seconds after the change.
3. Do Not Save 1 Element at a Time
When we are creating the example of the Product
, we are asking our database to commit the changes in each loop:
Production.objects.create(
name=row[0],
lawmaking=row[1],
price=row[2],
product_category=product_category
)
This is the I/O operation of each loop. It must be costly. Equally information technology's pretty fast, the trouble hither is that there tin be millions of such operations and nosotros tin decrease the number of such operations significantly. How? Past using Django's bulk_create
method:
And this change has a tremendous effect. For 100,000 products, the command executes in just 3.v seconds. Y'all demand to retrieve that the last loop can even so take items in the products
list (fewer than 5,000 items in our case). This needs to exist handled after the loop:
if products:
Product.objects.bulk_create(products)
Those iii changes we made together allowed us to increase the performance of the control by more than 96%. Lawmaking matters. Proficient code matters even more. The final command looks like this:
With the lawmaking above, 1 million products are loaded in xxx seconds!
Pro Tip: Use Multi-Processing
Yet another thought for improving the loading speed of a massive CSV would be to use multi-processing, I will just present the idea here. In the command above, you could carve up the ane big CSV file into multiple smaller files (the best approach would be to try to use indexes of rows) and put each batch of work nether a separate process. If you can use multiple CPUs on your auto, the scaling volition be linear (2x CPUs — 2 times faster, 4x CPUs — four times faster).
Imagine that you have ane million rows to procedure. Then the first procedure tin take rows with the numbers 0
–99999
, the second takes rows with the numbers 100000
–199999
, and so on until the last one takes rows with the numbers 900000
–999999
.
The but downside hither is that you need to have ten free CPUs.
Summary
- You lot should avoid loading the file into retentiveness. Use iterators instead.
- If you are processing the file line by line, avoid queries to the database in the
for
loop torso. - Do not salve i chemical element per loop. Utilize the
bulk_create
method.
Thanks for reading!
Source: https://betterprogramming.pub/3-techniques-for-importing-large-csv-files-into-a-django-app-2b6e5e47dba0
0 Response to "How to Upload a Csv File Django"
Post a Comment