Have you ever needed to process some large datasets? Well, it all depends on how you define “large”. In this post, I will refer to large datasets as to relational tables or delimited text files of several GBs of size containing 10+ mln rows.
When I have to do some data processing or filtering, I often start analyzing what kind of toolset or framework should I use to perform the work. Say you have a large .csv file and you want to create a new .csv file that is filtered by a column (so you basically want to create a new .csv based on the SQL where clause).
Well, you can do that with ArcGIS. It is possible to add text files including .csv files into ArcMap and open its attribute table. You will be able to make selections based on attributes, calculate statistics, and many more. After having a selection created, you can export selected rows into a new comma separated file. However, in order to make selection work on the whole dataset (and not only the first 2000 rows visible in the attribute table), it seems as you need to move to the end of table. This can take rather long time; so it’s probably best to use the Definition Query tab in the Table Properties dialog box. This will make visible only those rows that match your criteria. Now you can export the whole table with the query on to a new text file; pay particular attention to what kind of default demiter will used when exporting a text file from ArcMap and whether any new columns such as OBJECTID is added to each of the source rows. Another caveat is to watch out carefully for any decimal delimiters that are used because they can cause problems for comma-delimited text tables. Refer to this KB article to learn more.
Another software used by many GIS professionals is FME and this task can be done within this framework, too. Probably the easiest way to do this is to use the AttributeFilter transformer to obtain only those rows that meet your criteria. You will be able to supply a Separator Character and Character Encoding among other options. Very handy and easy to build. Filtering 20mln rows csv file with two columns and writing to a new csv file with FME 2015 wouldn’t take more than several minutes on a decent machine.
Most DBMS will let you to import and export csv files, too. SQL Server, for instance, is also capable of importing a .csv file using the SQL Server Import and Export Wizard. You will be able to load a csv file into a SQL Server table, then create a new table with SELECT INTO with a where clause to keep only needed rows. You will also be able to script this process by using the SSIS packages or by using the BULK INSERT to insert csv rows into a table.
And… you can do this in Python! There are many ways to do this such as reading file lines or using built-in modules such as csv or external ones such as unicodecsv. Be careful when reading large files with Python; very often I see people reading all the file rows into a single object, and this can cause out of memory error even on powerful machines. Learn more about iterators in Python – the reader objects that will be created are generators that you can iterate (go through just once).
Here is a couple of samples on how to read/write csv with Python to help you get started: