code sample to read a data from text file?

from pyspark import SparkContext
sc = SparkContext(“local”,”besant”) sqlContext = SQLContext(sc)

code sample to read a data from mysql ?‘jdbc’).options(driver=’com.mysql.jdbc.Driver’,url=”””jdbc:mysql://<host>:3306/<>db?user=<usr>&password=<pass>”””,dbtable=’besant’,numPartitions=4 ).load()

. What is hang on() ?‘jdbc’).options(driver=’com.mysql.jdbc.Driver’,url=&quot;&quot;&quot;jdbc:mysql://&lt;host&gt;:3306/&lt;&gt;db

?user=&lt;usr&gt;&amp;password=&lt;pass&gt;&quot;&quot;&quot;,dbtable=’besant’,numPartitions=4 ).load()

Say I have a huge list of numbers in RDD(say myRDD). And I wrote the following code to compute average: 

What is wrong with it and how would you correct it?

The average function is not commutative and associative. I would simply sum it and then divide by count.

The only problem with the above code is that the total might become very big thus overflow. So, I would rather divide each number by count and then sum in the following way.

The problem with above code is that it uses two jobs – one for the count and other for the sum. We can do it in a single shot as follows:

Again, it might cause a number overflow because we are summing a huge number of values. We could instead keep the average values and keep computing the average from the average and counts of two parts getting reduced.

If you have two parts having average and counts as (a1, c1) and (a2, c2), the overall average is:
total/counts = (total1 + total2)/ (count1 + counts2) = (a1*c1 + a2*c2)/(c1+c2)

If we mark R = c2/c1, It can be re-written further as a1/(1+R) + a2*R/(1+R)
If we further mark Ri as 1/(1+R), we can write it as a1*Ri + a2*R*Ri

. Say I have a huge list of numbers in a file in HDFS. Each line has one number and I want to compute the square root of the sum of squares of these numbers. How would you do it?

. Is the following approach correct? Is the sqrtOfSumOfSq a valid reducer?

Yes. The approach is correct and sqrtOfSumOfSq is a valid reducer.

. In a very huge text file, you want to just check if a particular keyword exists. How would you do this using Spark?

. Can you improve the performance of the code in the previous answer?

Yes. The search is not stopping even after the word we are looking for has been found. Our map code would keep executing on all the nodes which is very inefficient.

We could utilize accumulators to report whether the word has been found or not and then stop the job. Something on these lines.