A Suite of Data Tools for Beginners

14 September 2022

, posted by

Nida Amalia

I have too many tools, said no woodworker ever!
Lisa Stronzi

This article will share the tools used by Data Scientists in our daily work, and explore how they can also be helpful for other teams. Because Python is where we do most of our work, the tools we are using are also mainly collections of Python libraries.

What does a Data Scientist do?

Data Acquisition and Exploration
Model Development
Model Deployment
Monitoring

Life of a Data Scientist, day in – day out.

We will share some of the tools for the first 2 steps (Data Acquisition and Exploration, and Model Development) in this article.

How to get these tasks done efficiently?

Below are the tools in python to help us work efficiently with a variety of customisation

Before sharing about Python and its libraries, we will first go through SQL and its use for analysis and predictive modelling.

SQL

SQL is a standard language for storing, manipulating and retrieving data in databases. SQL can be used in database systems like MySQL, SQL Server, MS Access, Postgres, as well as data warehouse systems like BigQuery and Redshift.

The skills necessary to be a good data scientist include being able to retrieve and work with data, and to do that we need to be well versed in SQL, the standard language for communicating with database systems. Every organisation has a database and to read and retrieve data from it for data modelling, we use SQL either by querying using python, directly or other means.

General Tools

About Python and its libraries

jupyter notebook

The name, “jupyter”, comes from the core supported programming languages that it supports: Julia, Python, and R and “notebook” denotes a document that contains both code and rich text elements. Because of the mix of 2, it’s a great place to bring the interactive analysis code and outcome in one place.

How to install jupyter notebook

Using Anaconda: Install Python and Jupyter using the Anaconda Distribution, which includes Python, the Jupyter Notebook, and other commonly used packages for scientific computing and data science.
Using PIP: Install Jupyter using the PIP package manager used to install and manage software packages/libraries written in Python. – When using Jupyter, we can choose to execute individual cells rather than run the whole script all at once.

You will see a few examples of jupyter notebooks for libraries mentioned below.
For more details you can go to jupyter.org

Data Acquisition and Exploration

Jupyter Notebook for Data Exploration

One of the reason why Jupyter Notebook is often used is that we can run step by step codes and see the outcome, as well as to visualise the data. In this way, it helps us in the early phases of the project, especially when exploring data and when initially developing the model.

Import `pandas` as pd

pandas is one of the first libraries we use when we want to load the data. It creates a 2-dimension tabular data structure (called as DataFrame in pandas) from the database and interfaces with data using DataFrames, which have named columns for data manipulation. It is similar to a spreadsheet, a SQL table or the data.frame in R.

Source : https://www.datacamp.com/tutorial/tutorial-jupyter-notebook

pandas make it easier to explore and manipulate data by providing different features mentioned in image.

 💡 pandas is preferred over multiple other tools like SQL, R, etc.. because it is faster and easier to perform data analyses for given DataFrame/Series.

Import `NumPy` as np

While pandas helps us manipulate and analyze data in tabular format, NumPy helps us

To provide various method/function for Array, Metrics, and linear algebra.
It stands for Numerical Python and provides lots of useful features for operations on n-arrays and matrices in Python.
This library provides vectorisation of mathematical operations on the NumPy array type, which enhances performance and speeds up the execution.
It’s very easy to work with large multidimensional arrays and matrices using NumPy.

A small example of data handling using numpy

💡 pandas vs NumPy - NumPy performs better than pandas when it comes to complex mathematical calculations on multidimensional arrays with data lesser than 50k rows. We often use both tools when manipulating and exploring data

Data Visualisation

import `matplotlib.pyplot` as plt

One important part of storytelling through data is “Correct visualisation”.

There are multiple libraries in python which helps us draw plots with an option to customise and save the image. matplotlib is one of them!

matplotlib is a collection of command style functions that make matplotlib work like MATLAB.

💡 matplotlib works efficiently with DataFrames and arrays. There are other libraries like seaborn , plotly which are equally used for interactive and customizable plots, and often provide better looking graphs

Model Development

import `scikit-learn` as sklearn

There are multiple open source and paid tools available for predictive analytics, one of which is scikit-learn library used by many Data Scientists to build ML models since it is:

Simple and efficient tools for predictive data analysis
Accessible to everybody, and reusable in various contexts
Built on numpy, scipy, and matplotlib
Open source, commercially usable – BSD license

sklearn provides access to high-quality, easy-to-use, implementations of popular algorithms so scikit-learn is a great place to start.

Source: https://github.com/microsoft/hummingbird/wiki/Supported-Operators

Example of logistic regression model in sklearn

While sklearn provides a large library of model types, data scientists also make frequent use of other packages, especially if they are creating more specialised types of machine learning models. For example, there are libraries for creating tree-based models which have better performance especially when training with large amount of data, such as CatBoost and XGBoost. Neural network models, which are also very popular especially for natural language processing and computer vision models, are commonly built using Tensorflow or PyTorch .

Examples of packages for tree-based models

Model built in Catboost. It shows the performance metric of train and test datasets as samples are included.

Examples of neural network model packages

Neural network model built in Tensorflow and visualised using Tensorboard

💡 sklearn is one of the most commonly used library and provides a good baseline for learning machine learning and experimenting with many basic models.

Conclusion

Python libraries are preferred by data teams as it is easier and faster to perform data analytics tasks all in one place. Especially, for people who prefer MS Excel, they can easily switch to these python libraries to fast track their work when there is big data involved or otherwise too.

You can refer to below documents for the libraries

pandas – to analyze DataFrame
numpy – to compute numerical operations on multidimensional array
matplotlib – to visualize data
scikit-learn – to conduct predictive analytics

Written by Mitali Chotrani

If you are passionate about data and want to be involved in making the future of finance accessible to businesses of all scales, let’s be part of our team. Find out more about our available vacancy on this link.

Post Views: 6,690

Share

Nida Amalia

6+ years working in Content Marketing and SEO in various industries such as Travel, Health Tech, SaaS, Fintech, and Education Technology.

Reach out to unlimit your business

Fresh Resources

Fazz Rapid Relief for Businesses in Need

Company

A Suite of Data Tools for Beginners

, posted by

Nida Amalia

What does a Data Scientist do?

How to get these tasks done efficiently?

General Tools

Data Acquisition and Exploration

Jupyter Notebook for Data Exploration

Import pandas as pd

Import NumPy as np

Data Visualisation

import matplotlib.pyplot as plt

Model Development

import scikit-learn as sklearn

Examples of packages for tree-based models

Examples of neural network model packages

Conclusion

Nida Amalia

Reach out to unlimit your business

Fresh Resources

Fazz rapid relief for businesses in need

SME’s guide to survive economic uncertainty

SME Financial Access in Singapore and Indonesia

Built for growth

Perusahaan

Hubungi Kami

Business

Agent

Business

Agen Keuangan

Merchant

Business

Agent

Business

Billfazz

Business

Agent

Business

Where is your business registered?

Account you would like to open?

Browse by product

Import `pandas` as pd

Import `NumPy` as np

import `matplotlib.pyplot` as plt

import `scikit-learn` as sklearn