How Can I Merge Two Csv Files By A Common Column, In The Case Of Unequal Rows?

August 20, 2024 Post a Comment

I have a set of 100 files. 50 files containing census information for each US state. The other fifty are geographic data that need to be merged with the correct file for each state

Solution 1:

Read data from the shorter file into memory, into a dictionary keyed on the LOGRECNO row:

import csv

withopen('sample_state_census.csv', 'rb') as census_file:
    reader = csv.reader(census_file, delimiter='\t')
    census_header = next(reader, None)  # store header
    census = {row[9]: row for row in reader}

then use this dictionary to match against the geo data, write out matches:

withopen('sample_state_geo.csv', 'rb') as geo_file:
    withopen('outputfile.csv', 'wd') as outfile:
        reader = csv.reader(geo_file, delimiter='\t')
        geo_header = next(reader, None)  # grab header
        geo_header.pop(6) # no need to list LOGRECNO header twice

        writer = csv.writer(outfile, delimiter='\t')
        writer.writerow(census_header + geo_header)

        for row in reader:
            if row[6] notin census:
                # no census data for this LOGRECNO entrycontinue# new row is all of the census data plus all of geo minus column 7
            newrow = census[row[6]] + row[:6] + row[7:]
            writer.writerow(newrow)

This all assumes the census file is not so big as to take up too much memory. If that's the case you'll have to use a database instead (read all data into a SQLite database, match in the same vein agains the geo data).

Solution 2:

For merging multiple files (even > 2) based on one or more common columns, one of the best and efficient approaches in python would be to use "brewery". You could even specify what fields need to be considered for merging and what fields need to be saved.

import brewery
from brewery
import ds
importsyssources= [
    {"file": "grants_2008.csv",
     "fields": ["receiver", "amount", "date"]},
    {"file": "grants_2009.csv",
     "fields": ["id", "receiver", "amount", "contract_number", "date"]},
    {"file": "grants_2010.csv",
     "fields": ["receiver", "subject", "requested_amount", "amount", "date"]}
]

Create list of all fields and add filename to store information

about origin of data records

all_fields = brewery.FieldList(["file"])

Go through source definitions and collect the fields

forsourcein sources:
    for field insource["fields"]:
        if field not in all_fields:

out = ds.CSVDataTarget("merged.csv")
out.fields = brewery.FieldList(all_fields)
out.initialize()

forsourcein sources:

    path = source["file"]

# Initialize data source: skip reading of headers# use XLSDataSource for XLS files# We ignore the fields in the header, because we have set-up fields# previously. We need to skip the header row.

    src = ds.CSVDataSource(path,read_header=False,skip_rows=1)

    src.fields = ds.FieldList(source["fields"])

    src.initialize()


    for record in src.records():

   # Add file reference into ouput - to know where the row comes from
    record["file"] = path

        out.append(record)

# Close the source stream

    src.finalize()


cat merged.csv | brewery pipe pretty_printer

tmahurin