How to deal with large dataset using Kaggle while working on Google Colab
While working on a Machine Learning project sometimes the dataset is really big. The problem while using it on Colab is that you need to upload the dataset on Colab and when you restart the kernel, it will be gone. You need to upload it again.
One solution to deal with this problem is to upload it on Google Drive and access it in Colab but again Google Drive has a storage limit of 15GB for normal users.
We have one more beautiful solution. We will upload it on Kaggle as a dataset and access it on Colab using Kaggle API.
Here are the Steps for using Kaggle Dataset on Google Colab,
- Download Kaggle.JSON: For using Kaggle Dataset, we need Kaggle API Key. After Signing in to the Kaggle click on the My Account in the User Profile Section. In the API Section click on the “ Create New API Token” link, It will download kaggle.json file which consists of the detail of API key
2. Upload Kaggle.json file in Colab Notebook
3. Install the Kaggle API client.
!pip install -q kaggle
4. The Kaggle API client expects this file to be in ~/.kaggle, so we need to move it there.
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
5. change permissions to avoids a warning on Kaggle tool startup.
!chmod 600 ~/.kaggle/kaggle.json
6. Get Dataset Path and Store it in the ‘dataset_path’ Variable, For that go to the https://www.kaggle.com/datasets, and click on your dataset, I am using ‘goodreadsbooks’ dataset uploaded by ‘jealousleopard’ so my dataset path will be ‘jealousleopard/goodreadsbooks’
dataset_path="jealousleopard/goodreadsbooks"
7. Copy the dataset set locally.
!kaggle datasets download -d dataset_path
That’s it, Now we can use this dataset in Colab
Here is the Demo Notebook Link of the above code,
Feel free to clap if you find it helpful!
Follow my telegram channel to get awesome blogs, projects, and learning opportunities for Python, Machine Learning, and Data Science Stuff.
Keep Learning!!