-
Notifications
You must be signed in to change notification settings - Fork 81
Description
The documentation claims:
csvlink operates in much the same way as csvdedupe, but will flatten both CSVs in to one output file similar to a SQL OUTER JOIN statement. You can use the
--inner_joinflag to exclude rows that don't match across the two input files, much like an INNER JOIN.
When I run csvlink against two CSV files like so:
csvlink CFF-LWP-CSR-Combined-to-Merge-deduped.csv space-finder-name-scrape.csv --field_names_1 Name --field_names_2 "Space Finder Name" > CFF-LWP-CSR-Space-Finder-Combined.csvI can verify no duplicates in the name column of the first input tables:
xsv frequency -s Name CFF-LWP-CSR-Combined-to-Merge-deduped.csv
field,value,count
Name,Artist Lofts at WestingHouse ^ The,1
Name,Whistler House Museum of Art,1
Name,Boston Photo Collaborative,1
Name,"Martha's Vineyard Film Society, Inc.",1
Name,Provincetown Center for Coastal Studies,1
Name,119 Braintree,1
Name,Arts United Fall River,1
Name,"Gardner Museum, Inc.^Isabella Stewart",1
Name,Andover Studio Building,1
Name,300 Summer Street,1Only a couple in the second file...
xsv frequency -s "Space Finder Name" space-finder-name-scrape.csv
field,value,count
Space Finder Name,Norwood Space Center: Creative Studio,2
Space Finder Name,Boston Fit Body Bootcamp: 850ft - Dancing/Yoga/Barre Studio for rent (available now),2
Space Finder Name,Berkshire South Regional Community Center,2
Space Finder Name,Gateway City Arts,2
Space Finder Name,Hope & Feathers Framing and Printing: Hope & Feathers Gallery,1
Space Finder Name,Mass Audubon - Arcadia Wildlife Sanctuary: Event Space,1
Space Finder Name,The Rivers School Conservatory: A. Ramon Rivera Recital Hall,1
Space Finder Name,Lydia Pinkham Building: Lydia Pinkham Artist Studios,1
Space Finder Name,The Westfield Athenaeum: Elizabeth Stewart Reed Room,1
Space Finder Name,Williams Inn: Main Ballroom,1I note that it outputs multiple rows for matches even if they weren't listed twice in the above files:
xsv frequency -s Name CFF-LWP-CSR-Space-Finder-Combined.csv
field,value,count
Name,(NULL),501
Name,Indian Orchard Mills,2
Name,Fountain Street Studios,2
Name,Fine Arts Work Center,2
Name,Third Life Studio,2
Name,South Shore Art Center,2
Name,Hopkinton Center for the Arts,2
Name,Sound Museum,2
Name,Historic Beaver Mill,2
Name,Eclipse Mill,2My understanding is OUTER JOIN is defined as follows:
For those rows that do match, a single row will be produced in the result set (containing columns populated from both tables).
While this is a minor issue that I can solve using Pandas after the fact, it lead to confusion today and makes me think I am either missing something or the documentation is not quite correct.