Featured

Leaving the GIS industry: bye for now

After working for more than 10 years in geospatial industry, I have decided to change the field and focus on pure software engineering. I have quite enjoyed using GIS tools to solve various computational problems and software development tools to build useful GIS products and services for my customers. However, as my interests starting shifting towards clean code and programming per se – I have noticed I was reading in the bed Refactoring by Martin Fowler way more often than The ESRI guide to GIS analysis by Andy Mitchell – so I thought it would be useful to try changing the career path.

As I won’t be using any of the GIS software any longer, I won’t be able to post any new practical material that would be useful to GIS analysts and developers. However, in this post I would like to share some of the last thoughts I have which could be of interest to peer GIS professionals.

This blog becomes an archive of hopefully useful resources. Good luck!

Skills

SQL

I cannot stress enough how important it is for anyone using GIS data to master SQL. It is such an integral part of nearly any GIS processing workflow and being able to process or manage data stored in a DBMS using SQL becomes crucial. As GIS datasets grow in size, writing scripts for data processing won’t suffice as there are still no efficient spatial data processing packages that could make the process consistent, fast, and reliable. Using PostGIS or SQL Server native spatial types can get you often farther than any open source Python package. Don’t stop mastering SQL after learning the basics as there is so much more.

I have two massive posts about using SQL functions for GIS:

Math

Don’t bother too much studying math. I am not sure why many people I have spoken to think that in order to be good at GIS one has to be good at math. Being able to operate high school algebra and geometry terms is definitely useful (angle between intersecting roads, area of a lake, percentage of parcel covered by forest), but there is no need to study more sophisticated math. For a casual GIS analyst work, having the basics covered will be more than enough.

Of course if you are building a new routing algorithm that will be more efficient that A*, then you would have to review a lot of graph theory materials or if you are building a new Voronoi diagram generation procedure, you will find yourself reading the computational geometry books. If you are doing a lot of GIS analysis with spatial statistics stuff, then you should definitely get this amazing book Elementary Statistics for Geographers Third Edition.

If you do a lot of computational intensive programming and would like to catch up on math specifically for GIS, review

Testing in GIS

Review these two posts:

Python

Learn Python. It is the most widely used programming language in GIS industry and its usage will only expand. I have written the A progression path for GIS analyst which could be used a development road map. You should be very comfortable using Python; having Python skills will let you have a stronger influence on any operations within the organization and potentially automate more manual workflows leading to a better workplace.

Linux and bash

Learn Linux and bash. I think I should have started using Linux earlier. There are a few (1, 2) ready to use VirtualBox images with a ton of open-source GIS software installed, configured and ready-to-use. Using those machines will save you a lot of time. Learning bash is extremely helpful because it would let you be much more productive executing smaller commands and building pipelines for data processing than you would normally do on Windows using a programming language. Obviously learning bash, Linux, and Python are part of the industry agnostic skill set you could benefit from having at any later point of time.

Readings

There are so many excellent GIS books that I would like to recommend. You can find most popular titles online. What I’d like to do instead is to share of the hidden gems I have discovered and have really enjoyed reviewing. You can find those in the post Useful resources in computer science/math for GIS Analysts.

Programming

Ad-hoc mentality is very difficult to fight. It is 7 pm. You have a job you have to get done before releasing the data to a customer tomorrow morning. You are adding a missing domain to a geodatabase that your colleague’s Python script failed to add. Then you are changing a data type for a field because you have just realized that you need to store text instead of numbers. And… you find a few other things you are so glad to have spotted before the release. You fix them, zip the database, and upload it onto an FTP site. It is 10 pm, you are tired but happy and are heading off to the bed.

Success! … Or is it? The next thing tomorrow morning you want to document the manual changes you’ve introduced yesterday, but you are being dragged into some other urgent job… and you never do. A week after, a customer sends an email telling you she’s not able to run their in-house tools using your database you’ve prepared for them, but the one you’ve prepared a month ago works. Now it is 9 pm again and you are writing some oddly looking script trying to compare the databases and recalling what have you done on that evening… You are in a mess.

Doing what you have done may look natural because you just want to get stuff done. However, I want you to look at this from another perspective. You want your steps to be reproducible. You want to be able to track the changes you have done. Not only you, but any colleague of yours should be able to pick up the updates that have been made to any piece of data or a script. So resist the urge to get stuff done, pace yourself, and track your work with one of the following methods.

Documenting manually

If you are not comfortable programming or scripting at all, you should document each step you are taking while making modifications to a dataset. At least you could see what has been done in written form. I cannot stress that enough – you should document what you are doing, not what you have done. So write down what you have done after each change operation, not after you have done all the work. This is how it can look:

  1. You added field Area of Double type to the table dbo.Parcels.
  2. You write: “Add field Area of Double type to the table dbo.Parcels.”
  3. You dropped field OldArea of Double type in the table dbo.Parcels.
  4. You write: “Drop field OldArea of Double type in the table dbo.Parcels.”

One of the disadvantages of this approach is that it is possible to get the changes out of sync with the documentation. You could have made an error documenting a data type or a field name. Another thing is that the very same step can be done in many ways – what if you add field to a database using some GIS application and a colleague of yours uses a DBMS command line tool? Documenting exactly the procedure of making changes soon becomes tedious and you end up with tons of instructions that easily becomes misleading or plain obsolete. However, if you are vigorous, it is still possible to maintain a decent level of changes tracking with sufficiently rigid discipline.

Simply programming with VCS

Another approach is to write a program that will make the changes. When you write code, you don’t need to document what you are doing because the reader familiar with the syntax of this programming language will understand what happens. You can add some comments of course explaining why adding certain fields is required though. So, if you are building a database with a few tables, you can write a SQL script that can be re-run recreating your database at any point of time. If you never make any manual changes to a database and only write and keep SQL commands, your main SQL data compilation script will never get out of sync.

This leads us to a concept of version tracking where it is possible to track how your SQL script changed since the last version. Who is not guilty of having at some point of our career a dozen of files with some scripts named “production_final_compilation_truly_final_12.sql“? To avoid this mess, you should really use a VCS.

The main argument against this approach is that it setting up all this version control tools look like an overkill for someone doing simple GIS work. However, you will see how much safer your work will be in the long run. It will pay off very soon. Invest some time in learning about VCS such as Git for managing the source code. All major players – BitBucket, GitLab, and GitHub – provide free private repositories. Find out whether there is a VCS solution deployed in-house within your organization, such as Microsoft TFS, which you could use to check in the code. Should you like to dive deeper into Git, read the Git Pro book for free online. If you are not comfortable putting anything into the cloud (which is just someone’s else computer), use Git locally on your machine or a local server where you can securely check in your code and ask your system administrator to take backups of those repositories.

Open source vs proprietary software

Throughout your GIS career, you most likely will be exposed to both proprietary and open source software. You will have Windows on your machine with QGIS; or a Linux machine with Esri ArcGIS Server. It would be naive to think that either of these technologies is superior to another. You should be able to get the job done whatever tools you have available because you will not always be able to decide what your employer will be using.

I suggest instead being comfortable with both of them and widening your toolset as much as possible. As you become exposed to different tools, you will soon realize that commercial software can be much better for certain jobs rather than open-source or free one. For instance, certain types of spatial joins can run faster in ArcGIS Desktop rather than PostGIS, but some GDAL based raster masking may outperform ArcGIS Spatial Analyst tools. Creating a map layout with data driven pages is a pleasure in ArcMap, but can be tedious in QGIS. Always do the benchmarking to understand what tools work best and document your findings. Keep in mind that the very same tool can take 1 second to process 1,000 features and 5 minutes to process 10,000 features. Review briefly the Big O notation to avoid surprises.

I have always encouraged people using a particular tool to understand what it really does. Having a clear understanding of the underlying process will make it possible for you to extend an existing tool or write your own one. For instance, if you are applying some raster masking, you should understand what a matrix is. If you do a spatial join, you should understand how having a spatial index helps.

Always look to expand your toolset and be prepared to apply a tool that you think would be right for a particular job. A customer only has PostGIS and needs to do some polygon intersection? You use ST_Intersects native function. Don’t have access to QGIS? Know which ArcToolbox tool does the same job. You have to process a file on Linux machine you SSH into (so no Excel-like software)? Use bash or pandas to wrangle the data as needed. You shouldn’t be constrained by the environment you are in and what tools you have available at your disposal. You should be able to get the job done no matter what.

Keeping up with the industry

I have been a user of GIS StackExchange since 2013 and have blogged about my experience and why is it useful to be active on a forum in the post 4 years with GIS at StackExchange. Make a habit of reading the weekly most popular questions, for instance, every weekend. If you see a question you know the answer to, post it. It also helps to ask a question you had yourself and then you spent a week solving it and then finally found a solution. Please post an answer to your own question. You will save some effort to a peer GIS professional and you can also find this answer later when you will be doing a web search for the same issue in a few years time. If you have some time, you can review the most popular questions using the most voted questions option; there is so much to learn there.

Geospatial careers: list of companies

I have published a curated list of companies in geospatial industry which you can use when looking for a job. The list is sorted by country and with some extent by industry domain. I have compiled it over many years while I was a part of GIS industry.

https://github.com/AlexArcPy/geospatial-careers

Each company is documented in format of [company website]: very brief description of what they do. Obviously quite a few companies work with a variety things so they can be hard to categorize. A company may have jobs for GIS developers, digital mappers, photogrammetry experts, machine learning professionals, and so forth. I suggest exploring their home page to learn more about them. It is also possible that a company did just one thing when this list was published, but then they expanded and now advertize for other job titles, too.

It is also possible that a company is located under a certain country section, but may have offices in other countries, too. However, keep in mind that they may look for geospatial positions only in a particular country/office. Company description is not comprehensive: a company may do many other things apart from geospatial related operations, but I won’t mention them as they are irrelevant in this context. Some companies may permit working remotely. Again, please explore the company website to double-check. Due to the dynamic nature of the Internet, if the URL is broken, just use a web search engine to find a company’s website.

This page won’t be updated on a regular basis so it is pretty static. If you know a company in geospatial sector, by all means, please do submit a pull request so we could expand this list. The geospatial industry is fairly small so I thought sharing this list with the community would benefit both the companies looking for talent and peer professionals looking for a job.

Good luck with job hunting!

Edit files in a mounted Linux directory in Windows

Sometimes it is very useful to be able to edit files stored on a Linux machine in a Windows application. This can be a handy setup when you want to store your source code on Linux to be able to execute it against a Linux Python interpreter but you would like to edit it in a rich GUI application such as PyCharm or Eclipse. To achieve this, you can use an open source framework that mounts a Linux directory as a Windows drive from which you can add your files to a PyCharm project.

Another use case is when an application you need to use is available under Windows only, but copying the files from Windows to Linux upon every edit is tedious.

To mount a Linux directory in Windows:

  1. Install https://www.microsoft.com/en-US/download/details.aspx?id=40784 (install x64).

  2. Install https://github.com/dokan-dev/dokany/releases/tag/v0.7.4 (ignore a message saying do you want to download the VS runtime – this is because it installs libraries for x86 you will not need, just click Cancel).

  3. Download and run https://github.com/feo-cz/win-sshfs/releases/tag/1.5.12.8. It will be available in your tray.

  4. Mount a drive as described in the section Using Win-SSHFS to Mount Remote File Systems on Windows at https://www.digitalocean.com/community/tutorials/how-to-use-sshfs-to-mount-remote-file-systems-over-ssh.

You can optionally choose to mount a Linux directory on Windows start up. Extremely handy.

Drop multiple PostgreSQL databases matching name pattern

I once had a long data processing job that created multiple database snapshots on a PostgreSQL instance. However, this process was crashing sometimes leaving the databases behind. I had to automate the process of dropping the database where databases with names matching a certain pattern would be dropped. Bash script came in handy!

In the script, I write to a text file a series of SQL drop database statements and then execute each of the line using psql. It tool me a while to get it right as I wasn’t able to find a ready-to-use Bash script online, so this may be helpful to others.


echo "Drop PostgreSQL database matching a pattern"
cd C:/Temp
sql_query="select 'drop database \"'||datname||'\";' from pg_database WHERE datistemplate = false and datname like 'operat%';"
echo $sql_query
PGPASSWORD=postgres psql -p 5432 -d postgres -U postgres -t -A -F"," -c "${sql_query}" > drop_dbs.sql
cat drop_dbs.sql | while read line
do
echo $line;
PGPASSWORD=postgres psql -p 5432 -d postgres -U postgres -t -A -F"," -c "${line}"
done

Handy Bash scripts to run OGR command line tools

If you are a user of GDAL installation (which includes OGR), then you could take advantage of running OGR command line tools such as ogrinfo and ogr2ogr which are very handy and often more efficient in data processing comparing to Python scripts written using shapely or fiona. As a rule of thumb, if your Bash script is within 20-30 lines of code then you are doing okay. If it gets longer, it is worth switching to Python for readability and maintainability of the code.

Make sure to review the Python GDAL/OGR Cookbook!, it has a ton of useful examples. Below are some snippets you can use; they will also help you learn Bash if you are not familiar with it yet.

If you will be writing a lot of Bash, I suggest using an IDE that supports it. I have been using PyCharm with an amazing plugin BashSupport. It takes the experience of writing Bash scripts to a new level. It provides syntax highlight, auto-completion, and hover hints.


echo "Working directory: $PWD";
echo "List files in a given directory"
for file in C:/GISData/*; do
echo $file;
done
echo "Get basic info about all shapefiles in a given folder"
for file in C:/GISData/*; do
if [[ ${file: -4} == ".shp" ]]
then
ogrinfo -so -al $file;
fi
done
echo "Get basic info about all shapefiles in a given folder recursively"
cd C:/GISData
for file in C:/GISData/**/*; do
if [[ ${file: -4} == ".shp" ]]
then
ogrinfo -so $file;
fi
done
echo "Merge shapefiles recursively"
BASEPATH=C:/GISData;
POSTFIX=SRTM;
SHP=output.shp;
cd ${BASEPATH};
for file in ${BASEPATH}/${POSTFIX}*/*; do
if [[ ${file: -4} == ".shp" ]]
then
echo ${file};
if [[ ! -e ${SHP} ]]; then
ogr2ogr -f "esri shapefile" ${SHP} ${file}
else
ogr2ogr -f "esri shapefile" -update -append ${SHP} ${file}
fi
fi
done
echo "Get unique values in a given field recursively"
cd C:/GISData
field='Type'
for file in C:/GISData/Data*/*; do
if [[ ${file: -4} == ".shp" ]]
then
filename=$(basename — "$file");
ogrinfo -sql "SELECT DISTINCT ${field} FROM ${filename%%.*}" ${file} -nomd;
fi
done
echo "Load a shapefile into a PostGIS database table keeping only certain columns"
DB_NAME=gfm
DB_USER=postgres
SRC_SHAPEFILE="C:/GISData/CountriesEsri/ESRI_Online_World_Countries_Detailed_WGS84.shp"
DEST_NAME=countries
DEST_COLUMNS="COUNTRY,ISO_CC,COUNTRYAFF"
ogr2ogr -f "PostgreSQL" PG:"dbname=${DB_NAME} user=${DB_USER}" ${SRC_SHAPEFILE} -nln ${DEST_NAME} -nlt geometry -lco GEOMETRY_NAME=geom -select ${DEST_COLUMNS}
echo "Export a PostGIS database table into a shapefile"
DB_NAME=gisdata
DB_USER=postgres
DB_PASSWORD=postgres
PORT=5433
cd C:/GISData/PostGISData
ogr2ogr -f "ESRI Shapefile" countries.shp PG:"dbname=${DB_NAME} user=${DB_USER} password=${DB_PASSWORD} port=${PORT}" "countries"

view raw

ogrtools.sh

hosted with ❤ by GitHub

How to print line being executed when running a Python script

If you have ever needed to run a rather large data processing script, you know that it may be rather difficult to track the progress of the script execution. If you copy a number of files from a directory into another one you can easily show the progress by figuring out the total size of all files and then print how much has already been copied or how much is left. However, if your program does many things and executes code from some 3rd party packages, there is a risk you won’t have a clue about how much time is left or at least where you are in the program, that is what line of code is currently being executed.

A simple solution to this is to spread the print statements around the program to show the progress. This approach works great when you have just a few key breakpoints you are paying attention to, however, as your program grows it may become vital to be able to tell exactly what line of code is being executed. This may come in handy if the program does not seem to do anything any longer and you would like to re-execute it, but do not want to run the code that has been run successfully. Adding a print statement after each line of code will soon become impractical.

Fortunately, there is a built-in module trace available both in Python 2 and 3 which you can use to show the progress of your program execution. To learn more about the trace module, take a look at the standard library docs page and on the Python MOTW page.

To provide a simple example, if you have a Python script run.py containing the following:

import math
import time

print(math.pow(2, 3))
print(math.pow(2, 4))
print(math.pow(2, 5))

then you can run python -m trace -t --ignore-dir C:\Python36 .\run.py to see the live updates on what line of your program is being executed. This means you can run your long time taking script in a terminal and then get back to it now and then to see its progress because it will print each line that is currently being executed. The handy --ignore-dir option lets you filter out calls to the internal Python modules so your terminal won’t be polluted with unnecessary details.

On Windows, be aware of the bug in CPython which breaks because of how directories comparison works incorrect on case-insensitive file systems (such as NTFS on Windows). So be sure to specify the path to the Python interpreter directory using the right case (C:\Python36 would work, but c:\python36 would not).

You can also provide multiple directories to ignore, but be aware of what environment you run your Python script on Windows, because you would need to use different syntax.

  • Git Bash: $ python -m trace -t --ignore-dir 'C:/Python36;C:/Util' run.py
  • cmd: python -m trace -t --ignore-dir C:\Python36;C:\Util .\run.py
  • PowerShell: python -m trace -t --ignore-dir 'C:\Python36;C:\Util' .\run.py

In Linux, it seems like you don’t have to provide the ignore-dir argument at all to filter out the system calls:

linuxuser@LinuxMachine:~/Development$ python -m trace -t run.py
— modulename: run, funcname: <module>
run.py(1): import math
run.py(2): import time
run.py(4): print(math.pow(2, 3))
8.0
run.py(5): print(math.pow(2, 4))
16.0
run.py(6): print(math.pow(2, 5))
32.0
— modulename: trace, funcname: _unsettrace
trace.py(80): sys.settrace(None)

The trace module also has other usages such as generating the code coverage and branching which can be useful if you would like to see what branch of your if-else was picked during the program execution. However, you wouldn’t use the trace module only to generate the code coverage, because there is Python package called coverage.py that provides much richer functionality for this, so be sure to use that instead.

Python static code analysis and linting

Over the past years, I have been using various tools that helped me to write better Python code and catch common errors before committing the code. Linter is a piece of software that can help with that and there are a few Python linters that are capable of finding and reporting issues with your code. I would like to split the types of issues a linter can report into three groups:

Obvious code errors that would cause runtime errors.

Those are easy ones. To mention a few:

  • you forgot to declare a variable before it is used;
  • you supplied wrong number of arguments to a function;
  • you try to access a non-existing class property or method.

Linters help you catch those errors so it is great to run the linter on your Python modules before executing them. You would need to modify your code manually. PyLint or flake8 could be used.

Style related issues that do not cause runtime errors.

Those are easy ones, too. To mention a few:

  • a code line is too large making it difficult to read;
  • a docstring has single quotes (instead of double quotes);
  • you have two statements on the same line of code (separated with semicolon ;);
  • you have too many spaces around certain operators such as assignment.

Linters can also help you catch those issues so it is great to run the linter on your Python modules before executing them. They are less critical as you won’t get any runtime errors due to those issues found. Fixing those issues, however, will make your code more consistent and easier to work with.

Making such changes as separating function arguments with a space or breaking a longer line manually could be tedious. It becomes even more difficult if you are working with a legacy Python module that was written without any style guidelines. You could need to reformat every single line of the code which is very impractical.

Fortunately, there are a couple of Python code formatters (such as autopep8 and yapf) that could reformat Python module in place meaning you don’t have do it manually. Formatters depend on the configuration that would define how certain issues should be handled, for instance, the maximum length line or whether all function arguments should be supplied each on a separate line. The configuration files is read every time formatter runs and makes it possible to use the same code style which is of utter importance when you are a team of Python developers.

The general style guidelines can be found at PEP-8 which is the de-facto standard for code formatting used by Python community. However, if PEP-8 suggestions don’t work for your team, you can tweak it; the important thing is that everyone agrees and sticks to the standard you decide to use.

Code quality, maintainability, and compliance to best practices

Those are more complex mainly because it is way harder to identify less obvious issues in the code. To mention a few:

  • a class has too many methods and properties;
  • there are two many nesting if and else statements;
  • a function is too complex and should be split into multiple ones.

There are just a few linters that can give some hints on those. You would obviously need to modify your code manually.

As I used linters more and more, I’ve been exposed to various issues I have not thought of earlier. At that point of time I realized that linters I used could not catch all kinds of issues that could exist leaving some of them for me for debugging. So, I’ve started searching for other Python linters and Python rulesets that I could learn from to write better code.

There are so many Python linters and programs that could help you with static analysis – a dedicated list is maintained under awesome-static-analysis repository.

I would like to share a couple of helpful tools not listed there that could be of great help. The obvious errors and stylistic issues are less interesting because there are so many Python linters that would report those such as pylint and flake8. However, for more complex programs, to be able to get some insights on possible refactoring can often be more important. Knowing the best practices of the language and idioms can also make your Python code easier to understand and maintain. There are even some companies that develop products that check the quality of your code and report any possible issues. Reading through their rulesets is also very helpful.

wemake-python-styleguide is a flake8 plugin that aggregates many other flake8 plugins reporting a huge number of issues of all three categories we have discussed above. Custom rules (not reported by any flake8 plugin) can be found at the docs page.

SonarSource linter is available in multiple IDEs and can report all kinds of bugs, code smells, and pieces of code that are too complex and should be refactored. Make sure to read through the ruleset, it is a great one.

Semmle ruleset is an open-source product (as part of LGTM.com), and their ruleset is very helpful and should be reviewed.

SourceMeter ruleset is not an open-source product (however, there is a free version) but their ruleset is also very helpful and should be reviewed.

Open source community in 2018

I was recently searching for a virtual machine with the open source GIS software pre-installed knowing that there was one available for many years which I have blogged about 8 years ago. It was funny to find and read my 8 years old blog post which I’ve started saying that “I am not a big fan of Linux and open source software.” How much have changed since then!

True, back then open source GIS community was not what it is today; QGIS has really grown into a fully-fledged desktop GIS, quite a few Python geospatial packages have been written, and it became a whole lot easier to start using open-source software. Today, I love open-source. Just as proprietary software, it has its own pros and cons, though. But to illustrate the beauty of open-source, I’d like to share a couple of personal stories that I think are very illustrative.

One evening, I was playing with QGIS and have noticed an annoying bug – pasting data from clipboard inside Python console causes the text cursor to be moved to the end of the row. What is funny about this bug is that the same behavior can be seen in ArcMap Python console. I decided to report the QGIS bug on their issues web page. You have no idea how surprised I was when I have received a notification in a couple of hours that the issue was fixed in the QGIS source code and the next release won’t have this issue. Some time later, I have reported a typo in the installation dialog text – it was fixed within an hour after the issue was reported. OK, I do understand that those were not very complicated issues, however, I found it astonishing to be able to get fixes in a fairly large and complex desktop applications that quickly. This is how open-source community operates: next time I upgraded the QGIS, those bugs were not there any longer.

Another night, I was playing with mypy generating Python interface files. I found a bug which I reported on the mypy GitHub page. Later, on the same day, Guido van Rossum himself confirmed the bug and suggested the fix. I have forked the mypy repo, fixed the issue, Guido reviewed the change, suggested refactoring, I have refactored the code, Guido reviewed it again, and merged my pull request. It took just a few hours to fix an issue in a package used daily by thousands users. In addition, having this personal interaction with the author of Python and having him approving the code you write is very inspiring. This is what I love about Python community. This is what I love about open-source.

If you have not done so yet, I encourage everyone find a product, a project, or a program that is open-sourced and start contributing. You have no idea how much you will learn by reading code written by other people and how fast you will grow as a developer by working in a virtual team with other peers. If you are not a programmer, you can always work on finding and reporting issues, improving the docs, or writing a tutorial. Answering or improving questions on the GIS StackExchange website is another great way to contribute to the public knowledge base available for all GIS professionals.

I have myself authored a few programs with open source code published on GitHub. It is hard to describe what a joy it is to hear from the users of those fairly simple programs that they found my programs to be helpful in their work. Yes, you may be writing software as a part of your job you get paid for and this software is then used by your happy customers, but having a complete stranger praising the program you have written and shared is a whole different story. Give it a try!

Getting geodatabase features with arcpy and heapq Python module

If you have ever needed to merge multiple spatial datasets into a single one using ArcGIS, you have probably used the Merge geoprocessing tool. This tool can take multiple datasets and create a single one by merging all the features together. However, when your datasets are stored on disk as multiple files and you only want to get a subset of features from it, running the Merge tool to get all the features together into a single feature class may not be very smart.

First, merging features will take some time particularly if your datasets are large and there are a few of them. Second, even after you have merged the features together into a single feature class, you still need to iterate it getting the features you really need.

Let’s say you have a number of feature classes and each of them stores cities (as points) in a number of states (one feature class per state). Your task is to find out 10 most populated cities in all of the feature classes. You could definitely run the Merge tool and then use the arcpy.da.SearchCursor with the sql_clause to iterate over sorted cities (the sql_clause argument can have an ORDER BY SQL clause). Alternatively, you could chain multiple cursor objects and then use the sorted built-in function to get only the top 10 items. I have already blogged about using the chains to combine multiple arcpy.da.SearchCursor objects in this post.

However, this can also be done without using the Merge geoprocessing tool or sorted function (which will construct a list object in memory) solely with the help of arcpy.da.SearchCursor and the built-in Python heapq module. Arguably, the most important advantage of using the heapq module lies in ability to avoid constructing lists in memory which can be critical when operating on many large datasets.

The heapq module is present in Python 2.7 which makes it available to ArcGIS Desktop users. However, in Python 3.6, it got two new optional key and reverse arguments which made it very similar to the built-in sorted function. So, ArcGIS Pro users have a certain advantage because they can choose to sort the iterator items in a custom way.

Here is a sample code that showcases efficiency of using the heapq.merge over constructing a sorted list in memory. Please mind that the key and reverse arguments are used, so this code can be run only with Python 3.


'''
Demonstrating efficient use of heapq.merge when working with
arcpy.da.SearchCursor structures. The heapq.merge function will
merge multiple sorted inputs into a single sorted output which
can be useful when iterating over multiple arcpy.da.SearchCursor
iterators.
'''
from pprint import pprint
import heapq
import itertools
import arcpy
arcpy.env.overwriteOutput = True
arcpy.env.workspace = r'C:\GIS\Temp\ArcGISHomeFolder\sample.gdb'
STATES = ['California', 'Texas', 'Arizona']
FC = (r'C:\Program Files (x86)\ArcGIS\Desktop10.5\TemplateData'
r'\TemplateData.gdb\USA\cities')
# preparing the sample data by exporting individual feature classes
states_written = []
for state_name in STATES:
state_name_normalized = state_name.replace(' ', '_')
arcpy.Select_analysis(
in_features=FC,
out_feature_class='cities_{}'.format(state_name),
where_clause='''STATE_NAME = '{}' '''.format(state_name_normalized),
)
states_written.append(state_name_normalized)
# getting the list of cursors for each individual feature class
cursors = [
arcpy.da.SearchCursor(
'cities_{}'.format(state_name),
['CITY_NAME', 'POP1990', 'STATE_NAME'],
sql_clause=(None, 'ORDER BY POP1990 DESC'),
) for state_name in states_written
]
@profile
# ———————————————————————-
def heapq_solution():
# sort by the POP1990 field DESC
merged = heapq.merge(*cursors, key=lambda x: x[1], reverse=True)
# print 10 most populated cities in specified states
result = list(itertools.islice(merged, 10))
return result
@profile
# ———————————————————————-
def sorted_solution():
result = sorted(
itertools.chain(*cursors), key=lambda x: x[1], reverse=True)
return result
pprint(heapq_solution())
# resetting all da.SearchCursor to iterate over again
for cur in cursors:
cur.reset()
pprint(sorted_solution()[:10])

view raw

heapq_merge.py

hosted with ❤ by GitHub

 

Printing pretty tables with Python in ArcGIS

This post would of interest to ArcGIS users authoring custom Python script tools who need to print out tables in the tool dialog box. You would also benefit from the following information if you need to print out some information in the Python window of ArcMap doing some ad hoc data exploration.

Fairly often your only way to communicate the results of the tool execution is to print out a table that the user could look at. It is possible to create an Excel file using a Python package such as xlsxwriter or by exporting an existing data structure such as a pandas data frame into an Excel or .csv file which user could open. Keep in mind that it is possible to start Excel with the file open using the os.system command:

os.system('start excel.exe {0}'.format(excel_path))

However, if you only need to print out some simple information into a table format within the dialog box of the running tool, you could construct such a table using built-in Python. This is particularly helpful in those cases where you cannot guarantee that the end user will have the 3rd party Python packages installed or where the output table is really small and it is not supposed to be analyzed or processed further.

However, as soon as you would try to build something flexible with the varying column width or when you don’t know beforehand what output columns and what data the table will be printed with, it gets very tedious. You need to manipulate multiple strings and tuples making sure everything draws properly.

In these cases, it is so much nicer to be able to take advantage of the external Python packages where all these concerns have been already taken care of. I have been using the tabulate, but there are a few others such as PrettyTable and texttable both of which will generate a formatted text table using ASCII characters.

To give you a sense of the tabulate package, look at the code necessary to produce a nice table using the ugly formatted strings (the first part) and using the tabulate package (the second part):


# using the built-in Python modules
import arcpy
arcpy.env.workspace = r'C:\Program Files (x86)\ArcGIS\Desktop10.5\TemplateData\TemplateData.gdb'
fc = r'USA\cities'
data = ['CITY_FIPS', 'CITY_NAME', 'STATE_FIPS', 'STATE_NAME', 'POP1990']
feats = [f for f in arcpy.da.SearchCursor(fc, field_names=data)][:10]
arcpy.AddMessage(" ")
row_format = u"{:>20}" * (len(data))
arcpy.AddMessage(row_format.format(*data))
for feat in feats:
arcpy.AddMessage(row_format.format(*feat))
arcpy.AddMessage(" ")
# using 3rd party tabulate module
import arcpy
from tabulate import tabulate
arcpy.env.workspace = r'C:\Program Files (x86)\ArcGIS\Desktop10.5\TemplateData\TemplateData.gdb'
fc = r'USA\cities'
data = [['CITY_FIPS', 'CITY_NAME', 'STATE_FIPS', 'STATE_NAME', 'POP1990']]
feats = [f for f in arcpy.da.SearchCursor(fc, field_names=data[0])][:10]
data.extend(feats)
arcpy.AddMessage(" ")
arcpy.AddMessage(tabulate(data))
arcpy.AddMessage(" ")

The output of the table produced using the built-in modules only:

builtin

The output of the table produced using the tabulate module:

tabulate