Machine learning pipeline with pyspark | pyspark ml pipeline example

In this article I am going to illustrate the process to perform machine learning classification in Pyspark with Pipeline. A pipeline is a sequence of stages used to perform a specific task. In the pipeline , the output of a task in a stage acts as an input to the next stage of the pipeline. The machine learning pipeline is composed of multiple stages, like data cleaning, filling of missing values, encoding, modelling and evaluation. The pipeline is a more organised and structured way to code and a machine learning pipeline helps to speed up the process by automating the workflows and synchronizing them together.

What is Bigdata?

Big data refers to vast, complex datasets from various sources, including social media, sensors, and more. The features of these complex datasets can be referred to as the 5 V’s:-

Volume: It’s massive, often terabytes to exabytes, beyond traditional systems. The companies which process or analyze huge numbers of transactions per unit time e.g. Walmart falls in this category

Velocity: Data is generated in high speed like social media data or sensor data

Variety: It’s diverse, from structured databases to unstructured text and images.

Veracity: Dealing with uncertain or unreliable data quality. The data available from the web is noisy and chaotic, comprising of missing values or inconsistent data.

Value: Extracting insights for data-driven decisions. The data can generate valuable business insights required to take important decisions or provide Business intelligence.

Tools like Hadoop, Spark, NoSQL databases, and machine learning help analyze big data for industries like business, healthcare, finance, and marketing.

Introduction to Spark

Spark is an open source Bigdata analytics framework which started in 2009 as a small project in Berkeley’s lab to improve the performance of Hadoop. Spark uses in-memory computation in contrast to Hadoop which write all the temp files in the persistent storage. Due to the in-memory computation , Spark is 100 times faster than Hadoop Map reduce. Spark emerged as an Apache popular Project in February, 2014.

Spark does not have its own storage, It can use Hadoop HDFS , AWS, GCP or any other cloud storage. Spark provides an user friendly API in multiple programming languages like Python, Scala, R and Java.

SparkContext

Spark has a master slave architecture. It has a master node or Driver which acts as a coordinator for multiple worker nodes or executors which perform actual processing on the data. SparkContext (sc) is the entry point to work with the master node of the Spark or Spark Driver. It is used to create RDDs (Resilient Distributed datasets) which are chunks of immutable datasets formed by partitioning the Datasets imported in Spark. Parallel processing is performed on RDDs where-ever possible for generating summarisations. In Spark , the SparkContext or SC is automatically generated by the environment. In Pyspark, the SparkContext need to be initialized explicitly.

Spark

The spark variable is the entry point to the Spark Data Frame API. It can be created as an instance of the SparkSession. The Spark variable can be used to execute SQL queries in Spark.

click on thumbnail to open the full image

Pyspark

Pyspark is a Python API available to work with Spark. One need to import the pyspark module in Python and then create the environment for working with the Spark framework. The instances of SparkContext and SparkSession need to be created explicitly with Python code. For successful working of pyspark, Spark should be installed in your machine and pyspark should also be installed , which has the same version as your Spark shell installed in your machine.

Dataset

The dataset used is titanic dataset from kaggle. It has both train and test datasets

https://www.kaggle.com/competitions/titanic/data

Continue reading “Machine learning pipeline with pyspark | pyspark ml pipeline example” →

September 29, 2020September 30, 2020

Keyword planner tools

Why plan your keywords?

When you are into blogging and posting at least 2 to 3 days a week in your blog ,sometimes you may run out of your ideas . Again sometimes may be you are a very passionate writer or a efficient professional in your field and want to document things as you do. So you post articles in your blog e.g. it may be technical coding, recipes, travel experience, photography shoots and so on without prior planning.

As a blogger this is a very common experience to find some of our posts not doing well in spite of promoting it on social media channels or even paid promotions while some of them readily rank in search engines and get displayed in the first page. The posts doing well , get a lot of organic traffic from different sources. So how can we bring consistency and eliminate the chance factor a little , the goal is to plan and focus on your keywords.

You may find the trending keywords related to a specific topic or if you want to search the relevant or high ranking keywords , you can use various keyword analyzer free or trial version at least for the first time users.

Tools to plan your keyword

I will be discussing about three tools which I use for planning my keywords though some posts, I may upload despite having low score just because I loved writing the article or loved teaching it or I have thorough knowledge about it or it is purely passion driven. If you decide not to plan your keywords then you may go for a keyword analysis tool which I am planning to discuss in my next article.

Let us first understand the metrics or the measurements used for judging the keywords

Keyword Difficulty score or competition score:

This metrics is represented in percentage and can depict how difficult is the keyword to rank in the search engine. So you need to choose a keyword which has a lower competition or difficulty score. It is dependent on the domains which contains posts on this specific keywords and how strong the domains are with respect to the number of backlinks they have.

Search volume:

This is represented by a number which changes dynamically depending on the average number of times a search query containing the specific keyword is searched in the search engine. It can also be represented by average number of searches per month or in some tools by a trend line showing the increase of decrease in search volume over time

Suggested bid prices

The Keyword Planner provides a suggested price to bid on keywords in AdWords. It acts as a representation of the value of these keywords in the search engine.

How we can check keyword traffic by anyone free tool?

Google Ads keyword planner

For planning your keywords through this free tool ,you have to create a google Ads account specifying your domain details. This is used to create Google ads to promote your posts but can also be used as a keyword planner , the monetary value of the keyword , the competiotion score and monthly search volume is displayed by this tool

google keyword planner — Home page of Google keyword planner

google planner — Researching the keyword in your specific domain

Google trends

It is a tool which shows you the growth of search volume for the specific keyword over time and the trends of the keyword. It shows a graph representing the growth or decline of interest for that specific keyword. I have included some interesting trends showing growing trend and seasonal trends of specific topics. Also the trends of the keywords related to the specific search are represented by a rise percentage growth and breakout. If you see “Breakout” instead of a percentage, it means that the search term grew by more than 5000%.

Google trends — Google keyword planner Home page

python coding — Growth of the Search term “Python coding over time” which shows stable and increasing interest over time

diwali recipes — The trend shows only seasonal spike only during festival

Hoth-The Google keyword planner tool

Hoth is a keyword planner tool which is used to find high volume keyword with low competition score

Hoth Keyword research — Hoth keyword showing the results of “Jalebi recipe” showing a fair volume and low competition score

VidIQ

VidIQ is a chrome extension free keyword research tool which can be downloaded ,added in your chrome browser and accessed for analyzing your keywords for YouTube videos. You can also get suggestions and recommendations regarding your specific keyword research and access keyword trends on upgraded paid version. Add this to your chrome extension and search the keyword you plan to make video with. Analyze the search volume and competition score already discussed above. It also give a overall score which indicates the opportunity of your keyword , this should be above 40 for maximizing benefits,

July 4, 2017July 4, 2017

Structure of PL/SQL block

Anonymous Block:

The anonymous block is the simplest unit in PL/SQL. It is called anonymous block because it is not saved in the database. It is the P/L SQL Block without name.

Named Block:

Named Block is a type of block which starts with the HEADER section which specifies the name and the type of the block. There are 2 types of named blocks namely:-

Procedures:

It is a collection of statements which collectively perform a certain task. It passes variables through parameters and return one or more value through parameters.

Functions:

It is a series of statements performing a specific task and returning only one value.

Structure of a PL SQL Block:

Each PL/SQL program consists of SQL and PL/ SQL statements which form a PL /SQL block. A PL/SQL Block consists of four sections:

The Header section.

The Declaration section.

The Execution section.

The Exception (or Error) Handling section.

HEADER

<Type and Name of block >

DECLARE

<All variables, Cursors are declared here>

BEGIN

<All programming logic, queries, program statements are written here>

EXCEPTION

<All Error Handling code is written here>

END;

–It ends the program

Get PDF

Creation of an anonymous PL/ SQL Block:

The anonymous block is a type of PL SQL block which has no name associated with it. In fact, the anonymous block is missing the header section altogether.

Instead it simply uses the DECLARE reserved word to mark the beginning of its optional declaration section.

For Example,

To create a P/L SQL Block which inserts the following 3 records into the table “Prod_bill”.

Record #1: B1,02-FEB-2012, Rohini,Washing Machine,1,10000

Record #2:B2,25-MAR-2012,Mahesh,Refrigerator,1,12000

Record #3:B3,30-MAR-2012,Arpita,Mixer,2,8000

June 26, 2017July 4, 2017

Overview of PLSQL

PL/SQL stands for Procedural Language extension of SQL. PL/SQL is a combination of SQL along with the features of procedural programming languages. Oracle Corporation has developed this language to enhance the capabilities of SQL.
Oracle uses a PL/SQL engine to processes the PL/SQL statements. A PL/SQL code can be stored in the client-side or in the server-side.Oracle application can be built on client/server architecture.
PL/SQL programs can be written on the client side and queries can be written to manipulate or retrieve the data from the server on the server side.

Difference between SQL and PL/SQL

SQL is a Structured Query Language fires a query to create or modify database objects at the Server.PL-SQL is a procedural language SQL,adds programming elements like variables, loops,operators etc. to SQL queries
SQL is a data oriented language selects and manipulates sets of data.
PL/SQL is a procedural language create applications.
SQL query gets executed one by one causing two trips from the client to the server ,one for firing the request and another for bringing the result set. PL/SQL can execute multiple SQL queries together in a block at one go.

Benefits of PL/SQL:

Structured blocks: PL SQL consists of blocks of code, which can be nested within each other. Each block forms a unit of a task or a logical module. These blocks can be stored in the database and supports re-usability.
Integration with SQL: The PL/SQL language is tightly integrated with SQL. The user does not have to perform data conversion between SQL and PL/SQL data types. This integration saves both learning time and processing time. PL/SQL lets you use all the SQL data manipulation, cursor control, and transaction control commands, as well as all the SQL functions and operators.
Full Portability: Applications written in PL/SQL can run on any operating system and platform where the Oracle database runs.
Procedural Language: PL SQL includes the procedural language constructs such as control flow statements which consists of (if else statements) and loops like (FOR loops).
Efficient Error handling mechanism: PL/SQL handles errors or exceptions effectively during the execution of a PL/SQL program. Run time errors can be trapped with Exception handling statements.
Once an exception is caught, specific actions can be taken depending upon the type of the exception .
Security: PL/SQL stored procedures protect the application code from tampering, hide the internal details, and restrict user’s access.
Object oriented programming support: Oracle object types are user-defined types that make it possible to model real-world entities such as customers and purchase orders as objects in the database.
It supports encapsulation, modularity, maintainability and re-usability.
Better efficiency and performance: PL SQL engine processes multiple SQL statements simultaneously as a single block, thereby reducing network traffic.

April 10, 2017April 10, 2017

Common language runtime :What is application domain?

Common language run time establishes a boundary around objects created within the same application scope. All objects which belong to the same application or website are kept in the memory isolated from the other application so that they do not conflict with each other resulting in corruption of data. For example a variable “x” created in the application “Bank_app” is distinguished from the variable “x” in the application “Employee_app” .

CLR help isolate objects created in one application from those created in other applications so that run-time behavior is predictable and the variables do not display garbage value ( which is common in C/C++).

Application Domain:

Application domains provide a flexible and secure method of isolating running applications. An application domain provides application security by the common language runtime to provide isolation between applications. Several processes which need to communicate with each other can be executed in the same process and can be accessed without switching between processes. The ability to run multiple applications within a single process increases server scalability and efficiency of the server. Using application domains ensures that code running in one domain cannot affect other applications in the process.

In the example below an application domain called “MyDomain “ is created and two applications are launched in the same domain .

The first application is an employee windows application for adding and retrieval of data from the employee database named “empproj”
The second application named “consoleadd” is a console application which prints the sum of two long integers .Both the applications have been compiled independently and executed before to create an .exe file.

Then with the help of the following C# code an application domain is created and multiple applications are loaded into the same application domain and executed.

using System;

namespace ConsoleApplication1
{
public class AppDomain1
{
public static void Main()
{
Console.WriteLine(“Creating new AppDomain.”);
AppDomain domain = AppDomain.CreateDomain(“MyDomain”);

Console.WriteLine(“Host domain: ” + AppDomain.CurrentDomain.FriendlyName);//prints parent domain name
Console.WriteLine(“child domain: ” + domain.FriendlyName);//prints child domain name
domain.ExecuteAssembly(@”C:\Users\Indrani Sen\Documents\Visual Studio 2008\Projects\Empproj\Empproj\bin\Debug\Empproj.exe”);
domain.ExecuteAssembly(@”C:\Users\Indrani Sen\Documents\Visual Studio 2008\Projects\consoleadd\consoleadd\bin\Debug\consoleadd.exe”);

Console.ReadLine();

}
}

}

January 20, 2017January 20, 2017

Entity Relationship model case study

Case study

“Saboo-Car-Rental-Services” is a car rental showroom ,who want to automate their business.
They offer different types of cars on rent as small car,SUV,MUV.
Each type of car has the maximum seating available and the tariff per kilometer.
The management wants the system to show availability of the number of cars of each type for serving the inquiry.The system should have a provision for booking the car.
Before the booking is made ,the customer needs to provide personal information and driving license details.
Booking is typically stored as booking date,date of rent,duration in hours and type of vehicle.
Once the booking is done a unique booking number is provided to the customer for their reference which they need to produce at the time they come to collect the car.
A new transaction record is created for each booking after the car is returned,specifying the kilometers used and the amount to be paid ,date of payment.”

January 12, 2017April 11, 2017

How does a program gets leaded and executed in Dot Net framework?(CLR)

Loading and execution of programs:

It is the responsibility of CLR(Common language Runtime) to load the concerned class containing the source code from the memory and with the help of the respective compiler, compile the source code into an intermediate machine independent language called MSIL(Microsoft intermediate language).

Dot net framework supports working with multiple programming languages (Cross language inter-operabilty); to transform the source code to MSIL.

The CLR requires the metadata containing the information of the project concerned to determine which compiler it requires.
After the information regarding the compiler is obtained with the aid of Metadata Engine the concerned classes referred by the program are loaded from the Base class Library and linked with the application.
The program execution is dynamic with JIT compiler(Just in time) i.e. the classes are instantiated only during run time.
The Just in time compiler is responsible for transforming MSIL to machine depended code for executing the program in the target machine.

January 8, 2017January 11, 2017

Architecture of .NET Framework

The .NET Framework consists of the common language runtime and the .NET Framework class library.

The common language runtime is the foundation of the .NET Framework. CLR is responsible for services such as

memory management
thread management
automatic garbage collection
enforcing strict type safety constraints that promote security.

The code developed in Dot net Framework is thus more robust and is known as managed code while the code developed in an external application and later imported in Dot net Framework does not have the intrinsic security provided to the Managed code by CLR and is known as unmanaged code. The class library is a collection of reusable types that you can use to develop applications ranging from traditional console driven or graphical user interface (GUI) applications to Web applications and Mobile apps based on the latest technology by ASP.NET.

Components of .NET Framework:-

1.Common Language Runtime (CLR)

.Net Framework provides runtime environment called Common Language Runtime (CLR).It introduces an interesting concept called managed code which performs tasks to improve the quality and performance of the source code written in Dot Net environment to run all the .Net Programs. The developers do not need to manage garbage values or data getting corrupt due to overlapping memory between two applications through their code. It provides memory management, thread management and automatic garbage collection.We will talk about CLR in detail in our next few posts.

2.Metadata

.NET metadata for an application, in the Microsoft .NET framework, refers to certain data structures embedded within the Microsoft Intermediate Language code which determines the type of compiler required to compile the application and classes to be loaded. Metadata describes data about data .It has information about all data that are defined in the assembly. A .NET language compiler will generate the metadata and store this in the assembly containing the MSIL. When the CLR executes MSIL it will check to make sure that the metadata of the called method is the same as the metadata that is stored in the calling method. This ensures that a method can only be called with exactly the right number of parameters and exactly the right parameter types.

3. Assemblies

Microsoft .Net Assembly is the smallest unit of deployment of a .net application .
It include both executable application files that you can run directly from Windows without the need for any other programs (.exe files), and libraries (.dll files) for use by other applications.
Assemblies are the building blocks of .NET Framework applications.
During the compile time data is created for tracking ,identifying and describing each program.
This data also known as Metadata is created with Microsoft Intermediate Language (MSIL) and stored in a file called Assembly Manifest .
Both Metadata and Microsoft Intermediate Language (MSIL) together wrapped in a Portable Executable (PE) file.
The Assembly Manifest, contains information about the members, types, references and all the other data that is needed for the execution of program during run time. Every Assembly created contains one or more program files and a Manifest.

Why assemblies are necessary?

The assemblies in Dot Net are used to prevent the problem of DLL Hell is a set of errors thrown due to multiple software programs or applications attempting to register a DLL dynamic link library (DLL) with the same name. The reason for this issue was that the version information about the different components of an application was not recorded by the system. During the installation of an application the dll of that application get stored in the registry, then if we install other application that has same name .dll that means previously installed .dll get overwritten by the same name new .dll. So the previously installed application cannot be executed any more due to the missing DLL. This problem in context of version of same application is known as Dell-Hell. This problem of dynamic link library (.dll) is resolved through Versioning.

4.Framework Class Library(FCL)

Dot net supports more than 15 languages and provides a set of common class libraries used by the source code written in any of the supported language. The programmers trained in any one language of Dot net environment can switch to any other language with minimum training.

In short, developers just need to import the Base Class Library in their language code and use its predefined methods and properties to implement common and complex functions like reading and writing to file, graphic rendering, database interaction, and XML document manipulation.

For example The SQLClient class in Dot Net Framework is used for accessing database designed in SQL Server.

To perform database connection a comparison of code is shown in the following table.

VB.NET

C#.NET

imports System.Data.SQLClient

private sub con

dim cn as new SQLConnection

cn=new SQLConnection(“Data Source=.\sqlexpress;Initial Catalog=employee;Integrated Security=True”)

cn.open()

end sub

using System.Data.SQLClient;

private void con()

{SQLConnection cn=new SQLConnection(@“Data Source=.\sqlexpress;Initial Catalog=employee;

Integrated Security=True”);

cn.open();

}

5.Common Type System

Dot net has a generic type system which is common to all the languages supported in Dot Net framework.

The Common type system is a standard that specifies how type definitions are represented in computer memory. It provides cross language inter operability.
A program developed in Dot net framework is a collection of classes.
Each class consists of data members and methods.
The storage and retrieval of the data in an identifier is controlled through “Types”.
C#.NET is a strongly “Typed” language. Thus all operations on variables are performed with consideration of what the variable’s “Type” is. Operands should normally be of the same type.
For example, if you are doing subtraction with an Integer variable, you should subtract it from another Integer variable, and store the result to a variable of type Integer as well.

6.Application Domain:

7.Common Language Specification:

The .NET Framework includes classes, interfaces, and value types that provide access to system functionality. The CLS rules define a subset of the Common Type System with some stricter rules are defined in the CLS.Most .NET Framework types are CLS-compliant and can therefore be used from any programming language whose compiler conforms to the common language specification (CLS).

8.Managed Code and Unmanaged code

Dot NET supports two kind of coding

Managed Code
Unmanaged Code

Continue reading “Architecture of .NET Framework” →

December 23, 2016December 23, 2016

Dot Net Framework Evolution

Dot Net framework is a software development environment which aids in faster software development with the help of features such as Cross language interoperability, common language runtime, just in time compilation, dynamic coding etc. Dot Net Framework is not a programming language. It is a platform that provides tools and technologies to develop Windows, Web and Enterprise applications .

Evolution of .NET

NET technology was introduced by Microsoft, to catch the market from the SUN Microsystem’s Java. Java has become an extremely popular language specially for web development and Microsoft had only VC++ and VB to compete with Java . With the world more and more dependent on Internet and java related tools, Microsoft seemed to be losing the battle.VC++ foundation classes were difficult to learn while Visual Basic though popular was too simple for serious applications.Microsoft developed the OLE and the COM technology before Dot Net .

OLE Technology

OLE stands for Object Linking and Embedding which enables one to create objects with one application and then link or embed them in a second application. For example, The Microsoft Excel worksheets can be embedded in a Visual Basic Form or in a PowerPoint presentation such that when the original embedded object changes the change is affected in the linked application. Embedded objects retain their original format and links to the application that created them. This technology enables users develop applications which requires interoperability between various products.

COM Technology

The component oriented model (COM)on the other hand enables to divide a software into independent executable and reusable components .These components can be collected to form a library which can be references in the future projects if required.

THE ADD REFERENCE DIALOG BOX FOR ADDING AN EXTERNAL COMPONENT TO THE PROJECT

What makes COM and OLE unique is that the architecture is based on reusable design and reusable code. One of the fundamental concepts in COM, the interface, is based on the idea of design reuse. COM and OLE introduce a programming model based on reusable designs.

Microsoft started development of the .NET Framework in the late 1990s secretly under the name of Next Generation Windows Services (NGWS) under the direct supervision of Mr.Bill Gates. Sometime in the July 2000, Microsoft announced a whole new software development framework for Windows called .NET in the Professional Developer Conference (PDC).Microsoft also released PDC version of the software for the developers to test. By late 2000 the first beta versions of .NET 1.0 were released. The first version of .NET Framework was released on 13 February 2002, bringing managed code to Windows NT 4.0, 98, 2000, ME and XP.

December 16, 2016January 11, 2017

Entity Relationship Diagram..all about Relationships

when we design Entity sets during database designs through Entity Relationship Diagram, there have to be some kind of connections existing between them too. These connections are called “Relationships “derived from the real world term. A relationship can be classified according to the degree or connectivity. The following are the important points to remember about relationships in Entity Relationship diagrams

A relationship is an association between entity sets.
The entity sets that are involved in the relationship are also known as participants.
Relationships are named and the name is generally a verb.
Relationships are always two ways i.e. they operate in both the directions, so the N: 1 between Students and Teachers could also be thought of as a 1: N between Teachers and Students.