Machine learning pipeline with pyspark | pyspark ml pipeline example

In this article I am going to illustrate the process to perform machine learning classification in Pyspark with Pipeline. A pipeline is a sequence of stages used to perform a specific task. In the pipeline , the output of a task in a stage acts as an input to the next stage of the pipeline. The machine learning pipeline is composed of multiple stages, like data cleaning, filling of missing values, encoding, modelling and evaluation. The pipeline is a more organised and structured way to code  and a machine learning pipeline helps to speed up the process by automating the  workflows and synchronizing them together.

What is Bigdata?

Big data refers to vast, complex datasets from various sources, including social media, sensors, and more. The features of these complex datasets can be referred to as the 5 V’s:-

Volume: It’s massive, often terabytes to exabytes, beyond traditional systems. The companies which process or analyze huge numbers of transactions per unit time e.g. Walmart falls in this category

Velocity: Data is generated in high speed like social media data or sensor data

Variety: It’s diverse, from structured databases to unstructured text and images.

Veracity: Dealing with uncertain or unreliable data quality. The data available from the web is noisy and chaotic, comprising of  missing values or inconsistent data.

Value: Extracting insights for data-driven decisions. The data can generate valuable business insights required to take important decisions or provide Business intelligence.

Tools like Hadoop, Spark, NoSQL databases, and machine learning help analyze big data for industries like business, healthcare, finance, and marketing.

Introduction to Spark

Spark is an open source Bigdata analytics framework which started in 2009  as a small project in Berkeley’s lab to improve the performance of Hadoop. Spark uses in-memory computation in contrast to Hadoop which write all the temp files in the persistent storage. Due to the in-memory computation , Spark is 100 times faster than Hadoop Map reduce. Spark emerged as an Apache popular Project in February, 2014.

Spark does not have its own storage, It can use Hadoop HDFS , AWS, GCP or any other cloud storage. Spark provides an user friendly API in multiple programming languages like Python, Scala, R and Java.



Spark has a master slave architecture. It has a master node or Driver which acts as a coordinator for multiple worker nodes or executors which perform actual processing on the data. SparkContext (sc) is the entry point to work with the master node of the Spark or Spark Driver. It is used to create RDDs (Resilient Distributed datasets) which are chunks of immutable datasets formed by partitioning the Datasets imported in Spark. Parallel processing is performed on RDDs where-ever possible for generating summarisations. In Spark , the SparkContext or SC is automatically generated by the environment. In Pyspark, the SparkContext need to be initialized explicitly.


The spark variable is the entry point to the Spark Data Frame API. It can be created as an instance of the SparkSession. The Spark variable can be used to execute SQL queries in Spark.

click on thumbnail to open the full image









Pyspark is a Python API available to work with Spark. One need to import the pyspark module in Python and then create the environment for working with the Spark framework. The instances of SparkContext and SparkSession need to be created explicitly with Python code. For successful working of pyspark, Spark should be installed in your machine and pyspark should also be installed , which has the same version as your Spark shell installed in your machine.








The dataset used is titanic dataset from kaggle. It has both train and test datasets

Continue reading “Machine learning pipeline with pyspark | pyspark ml pipeline example”

Keyword planner tools

Keyword planner tools

Why plan your keywords?


When you are into blogging and posting at least 2 to 3 days a week in your blog ,sometimes you may run out of your ideas . Again sometimes may be you are a very passionate writer or a efficient professional in your field and want to document things as you do. So you post articles in your blog e.g. it may be technical coding, recipes, travel experience, photography shoots and so on without prior planning.

As a blogger this is a very common experience to find some of our posts not doing  well in spite of  promoting it on social media channels or even paid promotions while some of them readily rank in search engines and get displayed in the first page. The posts doing well , get a lot of organic traffic from different sources. So how can we bring consistency and eliminate the chance factor a little , the goal is to plan and focus on your keywords.

You may find the trending keywords related to a specific topic  or if you want to search the relevant or high ranking keywords  , you can use various keyword analyzer free or trial version  at least for the first time users.

Tools  to plan your keyword


I will be discussing about three tools which I use for planning my keywords though some posts, I may upload despite having low score just because I loved writing the article or loved teaching it or I have thorough knowledge about it or it is purely passion driven. If you decide not to plan your keywords then you may go for a keyword analysis tool which I am planning to discuss in my next article.

Let us first understand the metrics or the measurements used for judging the keywords

Keyword Difficulty score or competition score:


This metrics is represented in percentage and can depict how difficult is the keyword to rank in the search engine. So you need to choose a keyword which has a lower competition or difficulty score. It is dependent on the domains which contains posts on this specific keywords and how strong the domains are with respect to the number of backlinks they have.


 Search volume:


This is represented by  a number which changes dynamically depending on the average number of times a search query containing the specific keyword is searched in the search engine. It can also be represented by average number of searches per month or in some tools  by a trend line showing the increase of decrease in search volume over time


Suggested bid prices


The Keyword Planner provides a suggested price to bid on keywords in AdWords. It acts as a representation of the value of these keywords in the search engine.


How we can check keyword traffic by anyone free tool?


For planning your keywords through this free tool ,you have to create a google Ads account specifying your domain details. This is used to create Google ads to promote your posts but can also be used as a keyword planner , the monetary value of the keyword , the competiotion score and monthly search volume is displayed by this tool

google keyword planner
Home page of Google keyword planner






Researching the keyword prawn recipe





google planner
Researching the keyword in your specific domain
  • Google trends

    It is a tool which shows you the growth of search volume for the specific keyword over time and the trends of the keyword. It shows a graph representing the growth or decline of interest for that specific keyword. I have included some interesting trends showing growing trend and seasonal trends of specific topics. Also the trends of the keywords related to the specific search  are represented by a rise percentage growth and breakout.  If you see “Breakout” instead of a percentage, it means that the search term grew by more than 5000%.

Google trends
Google keyword planner Home page
google trends
Trends of the search term “Prawn recipe”


python coding
Growth of the Search term “Python coding over time” which shows stable and increasing interest over time


diwali recipes
The trend shows only seasonal spike only during festival


Hoth is a keyword planner tool which is used to find high volume keyword with low competition score

Hoth Keyword planner
Hoth Keyword planner home page
Hoth Keyword research
Hoth keyword showing the results of “Jalebi recipe” showing a fair volume and low competition score






VidIQ is a chrome extension free keyword research tool which can be downloaded ,added in your chrome browser and accessed for analyzing your keywords for YouTube videos. You can also get suggestions and recommendations regarding your specific keyword research and access keyword trends on upgraded paid version. Add this to your chrome extension and search the keyword you plan to make video with. Analyze the search volume and competition score already discussed above. It also give a overall score which indicates the opportunity of your keyword , this should be above 40 for maximizing benefits,

VidIQ home page
VidIQ home page

Keyword research vidIQ






Anonymous PL/SQL block

Structure of PL/SQL block

Anonymous Block:

The anonymous block is the simplest unit in PL/SQL. It is called anonymous block because it is not saved in the database. It is the P/L SQL Block without name.

Named Block:

Named Block is a type of block which starts with the HEADER section which specifies the name and the type of the block. There are 2 types of named blocks namely:-


It is a collection of statements which collectively perform a certain task. It passes variables through parameters and return one or more value through parameters.


It is a series of statements performing a specific task and returning only one value.

Structure of a PL SQL Block:

Each PL/SQL program consists of SQL and PL/ SQL statements which form a PL /SQL block. A PL/SQL Block consists of four sections:

The Header section.

The Declaration section.

The Execution section.

The Exception (or Error) Handling section.


<Type and Name of block >


<All variables, Cursors are declared here>


<All programming logic, queries, program statements are written here>


<All Error Handling code is written here>


–It ends the program


Creation of an anonymous PL/ SQL Block:

The anonymous block is a type of PL SQL block which has no name associated with it. In fact, the anonymous block is missing the header section altogether.

Instead it simply uses the DECLARE reserved word to mark the beginning of its optional declaration section.

For Example,

To create a P/L SQL Block which inserts the following 3 records into the table “Prod_bill”.

Record #1: B1,02-FEB-2012, Rohini,Washing Machine,1,10000

Record #2:B2,25-MAR-2012,Mahesh,Refrigerator,1,12000

Record #3:B3,30-MAR-2012,Arpita,Mixer,2,8000


SQL vs PL/SQL engine

Overview of PLSQL

PL/SQL stands for Procedural Language extension of SQL. PL/SQL is a combination of SQL along with the features of procedural programming languages. Oracle  Corporation has developed this language to enhance the capabilities of SQL.
Oracle uses a PL/SQL engine to processes the PL/SQL statements. A PL/SQL code can be stored in the client-side or in the server-side.Oracle application can be built on client/server architecture.
PL/SQL programs can be written on the client side and queries can be written to manipulate or retrieve the data from the server on the server side.

Difference between SQL and PL/SQL

  1. SQL is a Structured Query Language  fires a query to create or modify database objects at the Server.PL-SQL is a procedural language SQL,adds programming elements like variables, loops,operators etc. to SQL queries
  2. SQL is a data oriented language selects and manipulates sets of data.
    PL/SQL is a procedural language create applications.
  3. SQL query gets executed one by one causing two trips from the client to the server ,one for firing the request and another for bringing the result set. PL/SQL can execute multiple SQL queries together in a block at one go.

Benefits of PL/SQL:

  1. Structured blocks: PL SQL consists of blocks of code, which can be nested within each other. Each block forms a unit of a task or a logical module. These blocks can be stored in the database and supports re-usability.
  2. Integration with SQL: The PL/SQL language is tightly integrated with SQL. The user does not have to perform data conversion between SQL and PL/SQL data types. This integration saves both learning time and processing time. PL/SQL lets you use all the SQL data manipulation, cursor control, and transaction control commands, as well as all the SQL functions and operators.
  3. Full Portability: Applications written in PL/SQL can run on any operating system and platform where the Oracle database runs.
  4. Procedural Language: PL SQL includes the procedural language constructs such as control flow statements which consists of (if else statements) and loops like (FOR loops).
  5. Efficient Error handling mechanism: PL/SQL handles errors or exceptions effectively during the execution of a PL/SQL program. Run time errors can be trapped with Exception handling statements.
    Once an exception is caught, specific actions can be taken depending upon the type of the exception .
  6. Security: PL/SQL stored procedures protect the application code from tampering, hide the internal details, and restrict user’s access.
  7. Object oriented programming support: Oracle object types are user-defined types that make it possible to model real-world entities such as customers and purchase orders as objects in the database.
    It supports encapsulation, modularity, maintainability and re-usability.
  8. Better efficiency and performance: PL SQL engine processes multiple SQL statements simultaneously as a single block, thereby reducing network traffic.
SQL vs PL/SQL engine
SQL vs PL/SQL engine

Common Language Runtime

Common language runtime :What is application domain?

Common language run time establishes a boundary around objects created within the same application scope. All objects which belong to the same application or website are kept in the memory isolated from the other application so that they do not conflict with each other resulting in corruption of data. For example a variable “x” created in the application “Bank_app” is distinguished from the variable “x”  in the application “Employee_app” .

CLR help isolate objects created in one application from those created in other applications so that run-time behavior is predictable and the variables do not display garbage value ( which is common in C/C++).

Application Domain:

Application domains provide a flexible and secure method of isolating running applications. An application domain provides application security by the common language runtime  to provide isolation between applications. Several processes which need to communicate with each other can be executed in the same process and can be accessed without switching between processes. The ability to run multiple applications within a single process increases server scalability and efficiency of the server. Using application domains ensures that code running in one domain cannot affect other applications in the process.

In the example below an application domain called “MyDomain “ is created and two applications are launched in the same domain .

  • The first application is an employee windows application for adding and retrieval of data from the employee database named “empproj”
  • The second application named “consoleadd”  is a console application which prints the sum of two long integers .Both the applications have been compiled independently and executed before to create an .exe file.

Then with the help of the following C# code an application domain is created and multiple applications are loaded into the same application domain and executed.

using System;

namespace ConsoleApplication1
public class AppDomain1
public static void Main()
Console.WriteLine(“Creating new AppDomain.”);
AppDomain domain = AppDomain.CreateDomain(“MyDomain”);

Console.WriteLine(“Host domain: ” + AppDomain.CurrentDomain.FriendlyName);//prints parent domain name
Console.WriteLine(“child domain: ” + domain.FriendlyName);//prints child domain name
domain.ExecuteAssembly(@”C:\Users\Indrani Sen\Documents\Visual Studio 2008\Projects\Empproj\Empproj\bin\Debug\Empproj.exe”);
domain.ExecuteAssembly(@”C:\Users\Indrani Sen\Documents\Visual Studio 2008\Projects\consoleadd\consoleadd\bin\Debug\consoleadd.exe”);





Application domain

Entity Relationship model case study

Case study

“Saboo-Car-Rental-Services” is a car rental showroom ,who want to automate their business.
They offer different types of cars on rent as small car,SUV,MUV.
Each type of car has the maximum seating available and the tariff per kilometer.
The management wants the system to show availability of the number of cars of each type for serving the inquiry.The system should have a provision for booking the car.
Before the booking is made ,the customer needs to provide personal information and driving license details.
Booking is typically stored as booking date,date of rent,duration in hours and type of vehicle.
Once the booking is done a unique booking number is provided to the customer for their reference which they need to produce at the time they come to collect the car.
A new transaction record is created for each booking after the car is returned,specifying the kilometers used and the amount to be paid ,date of payment.”


Cross language interoperability

How does a program gets leaded and executed in Dot Net framework?(CLR)

Loading and execution of programs:

It is the responsibility of CLR(Common language Runtime) to load the concerned class containing the source code from the memory and with the help of the respective compiler, compile the source code into an intermediate machine independent language called MSIL(Microsoft intermediate language).

Dot net framework supports working with multiple programming languages (Cross language inter-operabilty); to transform the source code to MSIL.

  • The CLR requires the metadata containing the information of the project concerned to determine which compiler it requires.
  • After the information regarding the compiler is obtained with the aid of Metadata Engine the concerned classes referred by the program are loaded from the Base class Library and linked with the application.
  • The program execution is dynamic with JIT compiler(Just in time) i.e. the classes are instantiated only during run time.
  • The Just in time compiler is responsible for transforming MSIL to machine depended code for executing the program in the target machine.

common language runtime

Dot Net Framework Evolution

Dot Net framework is a software development environment which aids in faster software development with the help of features such as Cross language interoperability, common language runtime, just in time compilation, dynamic coding etc. Dot Net Framework is not a programming language. It is a platform that provides tools and technologies to develop Windows, Web and Enterprise applications .


Evolution of .NET

NET technology was introduced by Microsoft, to catch the market from the SUN Microsystem’s  Java. Java has become an extremely popular language specially for web development and Microsoft had only VC++ and VB to compete with Java . With the world more and more dependent on Internet and java related tools, Microsoft seemed to be losing the battle.VC++ foundation classes were difficult to learn while Visual Basic though popular was too simple for serious applications.Microsoft developed the OLE and the COM technology before Dot Net .


OLE Technology

OLE stands for Object Linking and Embedding which enables one to create objects with one application and then link or embed them in a second application. For example, The Microsoft Excel worksheets can be embedded in a Visual Basic Form or in a PowerPoint presentation such that when the original embedded object changes the change is affected in the linked application. Embedded objects retain their original format and links to the application that created them. This technology enables users develop applications which requires interoperability between various products.

COM Technology

The component oriented model (COM)on the other hand enables to divide a software into independent executable and reusable components .These components can be collected to form a library which can be references in the future projects if required.




What makes COM and OLE unique is that the architecture is based on reusable design and reusable code. One of the fundamental concepts in COM, the interface, is based on the idea of design reuse. COM and OLE introduce a programming model based on reusable designs.

Microsoft started development of the .NET Framework in the late 1990s secretly under the name of Next Generation Windows Services (NGWS) under the direct supervision of Mr.Bill Gates. Sometime in the July 2000, Microsoft announced a whole new software development   framework for Windows called .NET in the Professional Developer Conference (PDC).Microsoft also released PDC version of the software for the developers to test. By late 2000 the first beta versions of .NET 1.0 were released. The first version of .NET Framework was released on 13 February 2002, bringing managed code to Windows NT 4.0, 98, 2000, ME and XP.



Entity Relationship diagram

Entity Relationship Diagram..all about Relationships

when we design Entity sets during database designs through Entity Relationship Diagram, there have to be  some kind of connections existing between them too. These connections are called “Relationships “derived from the real world term. A relationship can be classified according to the degree or connectivity. The following are the important points to remember about relationships in Entity Relationship diagrams

  • A relationship is an association between entity sets.
  • The entity sets that are involved in the relationship are also known as participants.
  • Relationships are named and the name is generally a verb.
  • Relationships are always two ways i.e. they operate in both the directions, so the N: 1 between Students and Teachers could also be thought of as a 1: N between Teachers and Students.