Is your data too big for your RAM?

Ed warothe
4 min readFeb 18, 2021

How to analyze large data on your local machine

Photo by Ishan @seefromthesky on Unsplash

Introduction

I recently had to analyze this large dataset on my local machine (windows 7 core, 8 GB RAM). With the entire dataset taking up more than half of my machine's RAM, everything from loading the data, to analysis was painfully slow. Considering that 5 GB was not what you would call big data, I did an internet search and tried several solutions to this problem. Below are the most common of the tools and techniques I came across.

Dask

Dask is a python library for distributed/parallel computing. It has a task scheduler and uses similar syntax to the most powerful libraries in python machine learning today; NumPy and pandas. This made it my first option. Unfortunately, operations were really slow and there wasn’t much of a change from the previous times. I had a closer look at Dask’s documentation and guess what? they had made it really clear that the library was for data exceeding 100 GB on a cluster or a very powerful machine.

Blaze

Blaze is a library that enables the user to query data across multiple databases. I did not try it for two simple reasons: one, it was also optimized for big data (>100GB), and, at the time, its functionality with pandas and NumPy libraries was not at a level I was comfortable with.

Vaex

Vaex is also a distributed python library which was extremely slow on my machine. In addition, it uses out-of-core learning, which means data has to be fetched from the hard disk for reading. This means that read times are inevitable slower than reading data from memory.

SQLite

This was a very common recommendation in online forums. After unsuccessfully trying to load the data onto my SQLite database, the only other option was to get the python library Odo and load the data via pandas to the SQLite database. Obviously, this defeats the purpose of not getting a faster alternative to pandas.

Scikit-Learn

sklearn has a feature for incremental learning in its stochastic gradient descent classifier and regressor classes. This feature is called the partial-fit method. I am yet to try this method.

Postgresql

Postgresql worked really well.* I loaded the data onto a database with psql’s copy command, which although took about an hour, was successful. With the data in the database, I could query the data but the read times were still too long. This necessitated the use of indexing on some columns. Partitioning was also an option, but since I was running version 9.6 -on which you have to set up a trigger in order to partition as compared to later versions, I chose to split the data into subtables for faster operations. The data could be queried using psycopg for seamless analysis.

COPY train FROM 'C:\train.csv' DELIMITERS ',' CSV header;

Pandas

Eventually, this was the method I settled for because the data was just small enough and the analysis techniques were easier and more versatile using pandas. I used chunking to split the data into an iterable, then split it into subtables by the year column. Downcasting some data types, for instance setting float64 to int32 for integers, is extremely important as it can bring down the space taken by data by a huge amount. These subtables are then saved to my local drive using parquet as opposed to CSV for faster (can reach up to 50%) read speeds.

In [ ]:

import pandas as pd
import numpy as np
import time
# set each chunk at 1 million rows, the total row number is 125 million
dtype = {'store_nbr':np.int16, 'item_nbr':np.int32, 'unit_sales':np.float32, 'onpromotion':np.int8}
chunks = pd.read_csv('C:/pathway/train.csv', parse_dates=['date'], dtype=dtype,
chunksize=1000000)
type(chunks)
train = pd.DataFrame()
count = 0
start = time.time()
for chunk in chunks:
train = pd.concat([train, chunk])
print('chunk', count, 'time taken', time.time()-start)
count += 1

# splitting into subtables according to the date column
year1 = train[train['date'].between('2013-01-01', '2013-12-31')]
year2 = train[train['date'].between('2014-01-01', '2014-12-31')]
year3 = train[train['date'].between('2015-01-01', '2015-12-31')]
# persist to database in parquet format
year1.to_parquet('C:/pathway/year1.parquet')

Conclusion

I saw it fit to write this article as a general guide to these libraries and methods. But as always what works for me might not necessarily be the best method for you. The reader should also take note that this list is not exhaustive, for example, I did not try command-line tools to split the data or store it in a database, or load the data bit by bit through a deep learning library.

Catch more stories similar to this one on my website.

--

--

Ed warothe
0 Followers

Ed helps organizations make sense of their data through analytics and visualization with multiple programming languages. Dabbles in the great outdoors also.