how to merge multiple csv file with one having same header

by Rajesh Meher   Last Updated September 12, 2019 02:26 AM

In my hdfs folder I am getting my inputs files continuously. I wanted to merge multiple csv file having same header from last 15 min and make one csv file having one header. I tried with -getmerge but it did not work. any pointers please?

Tags : pyspark


Answers 1


I am referring below link to get the list of files which were processed in last '5 minutes'.

Get the list of files processed in last 5 minutes Since you want to skip individual header and merge all the listed files with single header. Can get those files to local unix as shown below:

#!/bin/bash

filenames=`hdfs dfs -ls /user/vikct001/dev/hadoop/external/csvfiles/part* | tr -s " " | cut -d' ' -f6-8 | grep "^[0-9]" | awk 'BEGIN{ MIN=5;LAST=60*MIN; "date +%s" | getline NOW } { cmd="date -d'\''"$1" "$2"'\'' +%s"; cmd | getline WHEN; DIFF=NOW-WHEN; if(DIFF < LAST){ print $3 }}' `

for file in $filenames
do
   #echo $file
   hdfs dfs -get ${file} /home/vikct001/user/vikrant/shellscript/testfiles
done

once you have the listed files at your local. can use below command to merge all the files with single header.

awk '(NR == 1) || (FNR > 1)' /home/vikct001/user/vikrant/shellscript/testfiles/part*.csv > bigfile.csv

Here's a link for more details on this. Merge csv with a single header

vikrant rana
vikrant rana
September 12, 2019 02:25 AM

Related Questions





PySpark Numeric Window Group By

Updated January 28, 2018 20:26 PM