If you have ever needed to merge multiple spatial datasets into a single one using ArcGIS, you have probably used the
Merge geoprocessing tool. This tool can take multiple datasets and create a single one by merging all the features together. However, when your datasets are stored on disk as multiple files and you only want to get a subset of features from it, running the
Merge tool to get all the features together into a single feature class may not be very smart.
First, merging features will take some time particularly if your datasets are large and there are a few of them. Second, even after you have merged the features together into a single feature class, you still need to iterate it getting the features you really need.
Let’s say you have a number of feature classes and each of them stores cities (as points) in a number of states (one feature class per state). Your task is to find out 10 most populated cities in all of the feature classes. You could definitely run the
Merge tool and then use the
arcpy.da.SearchCursor with the
sql_clause to iterate over sorted cities (the
sql_clause argument can have an
ORDER BY SQL clause). Alternatively, you could
chain multiple cursor objects and then use the
sorted built-in function to get only the top 10 items. I have already blogged about using the chains to combine multiple
arcpy.da.SearchCursor objects in this post.
However, this can also be done without using the
Merge geoprocessing tool or
sorted function (which will construct a
list object in memory) solely with the help of
arcpy.da.SearchCursor and the built-in Python
heapq module. Arguably, the most important advantage of using the
heapq module lies in ability to avoid constructing lists in memory which can be critical when operating on many large datasets.
heapq module is present in Python 2.7 which makes it available to ArcGIS Desktop users. However, in Python 3.6, it got two new optional
reverse arguments which made it very similar to the built-in
sorted function. So, ArcGIS Pro users have a certain advantage because they can choose to sort the iterator items in a custom way.
Here is a sample code that showcases efficiency of using the
heapq.merge over constructing a sorted list in memory. Please mind that the
reverse arguments are used, so this code can be run only with Python 3.