How to print line being executed when running a Python script

If you have ever needed to run a rather large data processing script, you know that it may be rather difficult to track the progress of the script execution. If you copy a number of files from a directory into another one you can easily show the progress by figuring out the total size of all files and then print how much has already been copied or how much is left. However, if your program does many things and executes code from some 3rd party packages, there is a risk you won’t have a clue about how much time is left or at least where you are in the program, that is what line of code is currently being executed.

A simple solution to this is to spread the print statements around the program to show the progress. This approach works great when you have just a few key breakpoints you are paying attention to, however, as your program grows it may become vital to be able to tell exactly what line of code is being executed. This may come in handy if the program does not seem to do anything any longer and you would like to re-execute it, but do not want to run the code that has been run successfully. Adding a print statement after each line of code will soon become impractical.

Fortunately, there is a built-in module trace available both in Python 2 and 3 which you can use to show the progress of your program execution. To learn more about the trace module, take a look at the standard library docs page and on the Python MOTW page.

To provide a simple example, if you have a Python script containing the following:

import math
import time

print(math.pow(2, 3))
print(math.pow(2, 4))
print(math.pow(2, 5))

then you can run python -m trace -t --ignore-dir C:\Python36 .\ to see the live updates on what line of your program is being executed. This means you can run your long time taking script in a terminal and then get back to it now and then to see its progress because it will print each line that is currently being executed. The handy --ignore-dir option lets you filter out calls to the internal Python modules so your terminal won’t be polluted with unnecessary details.

On Windows, be aware of the bug in CPython which breaks because of how directories comparison works incorrect on case-insensitive file systems (such as NTFS on Windows). So be sure to specify the path to the Python interpreter directory using the right case (C:\Python36 would work, but c:\python36 would not).

You can also provide multiple directories to ignore, but be aware of what environment you run your Python script on Windows, because you would need to use different syntax.

  • Git Bash: $ python -m trace -t --ignore-dir 'C:/Python36;C:/Util'
  • cmd: python -m trace -t --ignore-dir C:\Python36;C:\Util .\
  • PowerShell: python -m trace -t --ignore-dir 'C:\Python36;C:\Util' .\

In Linux, it seems like you don’t have to provide the ignore-dir argument at all to filter out the system calls:

linuxuser@LinuxMachine:~/Development$ python -m trace -t
— modulename: run, funcname: <module> import math import time print(math.pow(2, 3))
8.0 print(math.pow(2, 4))
16.0 print(math.pow(2, 5))
— modulename: trace, funcname: _unsettrace sys.settrace(None)

The trace module also has other usages such as generating the code coverage and branching which can be useful if you would like to see what branch of your if-else was picked during the program execution. However, you wouldn’t use the trace module only to generate the code coverage, because there is Python package called that provides much richer functionality for this, so be sure to use that instead.


Python static code analysis and linting

Over the past years, I have been using various tools that helped me to write better Python code and catch common errors before committing the code. Linter is a piece of software that can help with that and there are a few Python linters that are capable of finding and reporting issues with your code. I would like to split the types of issues a linter can report into three groups:

Obvious code errors that would cause runtime errors.

Those are easy ones. To mention a few:

  • you forgot to declare a variable before it is used;
  • you supplied wrong number of arguments to a function;
  • you try to access a non-existing class property or method.

Linters help you catch those errors so it is great to run the linter on your Python modules before executing them. You would need to modify your code manually. PyLint or flake8 could be used.

Style related issues that do not cause runtime errors.

Those are easy ones, too. To mention a few:

  • a code line is too large making it difficult to read;
  • a docstring has single quotes (instead of double quotes);
  • you have two statements on the same line of code (separated with semicolon ;);
  • you have too many spaces around certain operators such as assignment.

Linters can also help you catch those issues so it is great to run the linter on your Python modules before executing them. They are less critical as you won’t get any runtime errors due to those issues found. Fixing those issues, however, will make your code more consistent and easier to work with.

Making such changes as separating function arguments with a space or breaking a longer line manually could be tedious. It becomes even more difficult if you are working with a legacy Python module that was written without any style guidelines. You could need to reformat every single line of the code which is very impractical.

Fortunately, there are a couple of Python code formatters (such as autopep8 and yapf) that could reformat Python module in place meaning you don’t have do it manually. Formatters depend on the configuration that would define how certain issues should be handled, for instance, the maximum length line or whether all function arguments should be supplied each on a separate line. The configuration files is read every time formatter runs and makes it possible to use the same code style which is of utter importance when you are a team of Python developers.

The general style guidelines can be found at PEP-8 which is the de-facto standard for code formatting used by Python community. However, if PEP-8 suggestions don’t work for your team, you can tweak it; the important thing is that everyone agrees and sticks to the standard you decide to use.

Code quality, maintainability, and compliance to best practices

Those are more complex mainly because it is way harder to identify less obvious issues in the code. To mention a few:

  • a class has too many methods and properties;
  • there are two many nesting if and else statements;
  • a function is too complex and should be split into multiple ones.

There are just a few linters that can give some hints on those. You would obviously need to modify your code manually.

As I used linters more and more, I’ve been exposed to various issues I have not thought of earlier. At that point of time I realized that linters I used could not catch all kinds of issues that could exist leaving some of them for me for debugging. So, I’ve started searching for other Python linters and Python rulesets that I could learn from to write better code.

There are so many Python linters and programs that could help you with static analysis – a dedicated list is maintained under awesome-static-analysis repository.

I would like to share a couple of helpful tools not listed there that could be of great help. The obvious errors and stylistic issues are less interesting because there are so many Python linters that would report those such as pylint and flake8. However, for more complex programs, to be able to get some insights on possible refactoring can often be more important. Knowing the best practices of the language and idioms can also make your Python code easier to understand and maintain. There are even some companies that develop products that check the quality of your code and report any possible issues. Reading through their rulesets is also very helpful.

wemake-python-styleguide is a flake8 plugin that aggregates many other flake8 plugins reporting a huge number of issues of all three categories we have discussed above. Custom rules (not reported by any flake8 plugin) can be found at the docs page.

SonarSource linter is available in multiple IDEs and can report all kinds of bugs, code smells, and pieces of code that are too complex and should be refactored. Make sure to read through the ruleset, it is a great one.

Semmle ruleset is not an open-source product, but their ruleset is very helpful and should be reviewed.

SourceMeter ruleset is not an open-source product (however, there is a free version) but their ruleset is also very helpful and should be reviewed.

Open source community in 2018

I was recently searching for a virtual machine with the open source GIS software pre-installed knowing that there was one available for many years which I have blogged about 8 years ago. It was funny to find and read my 8 years old blog post which I’ve started saying that “I am not a big fan of Linux and open source software.” How much have changed since then!

True, back then open source GIS community was not what it is today; QGIS has really grown into a fully-fledged desktop GIS, quite a few Python geospatial packages have been written, and it became a whole lot easier to start using open-source software. Today, I love open-source. Just as proprietary software, it has its own pros and cons, though. But to illustrate the beauty of open-source, I’d like to share a couple of personal stories that I think are very illustrative.

One evening, I was playing with QGIS and have noticed an annoying bug – pasting data from clipboard inside Python console causes the text cursor to be moved to the end of the row. What is funny about this bug is that the same behavior can be seen in ArcMap Python console. I decided to report the QGIS bug on their issues web page. You have no idea how surprised I was when I have received a notification in a couple of hours that the issue was fixed in the QGIS source code and the next release won’t have this issue. Some time later, I have reported a typo in the installation dialog text – it was fixed within an hour after the issue was reported. OK, I do understand that those were not very complicated issues, however, I found it astonishing to be able to get fixes in a fairly large and complex desktop applications that quickly. This is how open-source community operates: next time I upgraded the QGIS, those bugs were not there any longer.

Another night, I was playing with mypy generating Python interface files. I found a bug which I reported on the mypy GitHub page. Later, on the same day, Guido van Rossum himself confirmed the bug and suggested the fix. I have forked the mypy repo, fixed the issue, Guido reviewed the change, suggested refactoring, I have refactored the code, Guido reviewed it again, and merged my pull request. It took just a few hours to fix an issue in a package used daily by thousands users. In addition, having this personal interaction with the author of Python and having him approving the code you write is very inspiring. This is what I love about Python community. This is what I love about open-source.

If you have not done so yet, I encourage everyone find a product, a project, or a program that is open-sourced and start contributing. You have no idea how much you will learn by reading code written by other people and how fast you will grow as a developer by working in a virtual team with other peers. If you are not a programmer, you can always work on finding and reporting issues, improving the docs, or writing a tutorial. Answering or improving questions on the GIS StackExchange website is another great way to contribute to the public knowledge base available for all GIS professionals.

I have myself authored a few programs with open source code published on GitHub. It is hard to describe what a joy it is to hear from the users of those fairly simple programs that they found my programs to be helpful in their work. Yes, you may be writing software as a part of your job you get paid for and this software is then used by your happy customers, but having a complete stranger praising the program you have written and shared is a whole different story. Give it a try!

Getting geodatabase features with arcpy and heapq Python module

If you have ever needed to merge multiple spatial datasets into a single one using ArcGIS, you have probably used the Merge geoprocessing tool. This tool can take multiple datasets and create a single one by merging all the features together. However, when your datasets are stored on disk as multiple files and you only want to get a subset of features from it, running the Merge tool to get all the features together into a single feature class may not be very smart.

First, merging features will take some time particularly if your datasets are large and there are a few of them. Second, even after you have merged the features together into a single feature class, you still need to iterate it getting the features you really need.

Let’s say you have a number of feature classes and each of them stores cities (as points) in a number of states (one feature class per state). Your task is to find out 10 most populated cities in all of the feature classes. You could definitely run the Merge tool and then use the arcpy.da.SearchCursor with the sql_clause to iterate over sorted cities (the sql_clause argument can have an ORDER BY SQL clause). Alternatively, you could chain multiple cursor objects and then use the sorted built-in function to get only the top 10 items. I have already blogged about using the chains to combine multiple arcpy.da.SearchCursor objects in this post.

However, this can also be done without using the Merge geoprocessing tool or sorted function (which will construct a list object in memory) solely with the help of arcpy.da.SearchCursor and the built-in Python heapq module. Arguably, the most important advantage of using the heapq module lies in ability to avoid constructing lists in memory which can be critical when operating on many large datasets.

The heapq module is present in Python 2.7 which makes it available to ArcGIS Desktop users. However, in Python 3.6, it got two new optional key and reverse arguments which made it very similar to the built-in sorted function. So, ArcGIS Pro users have a certain advantage because they can choose to sort the iterator items in a custom way.

Here is a sample code that showcases efficiency of using the heapq.merge over constructing a sorted list in memory. Please mind that the key and reverse arguments are used, so this code can be run only with Python 3.


Printing pretty tables with Python in ArcGIS

This post would of interest to ArcGIS users authoring custom Python script tools who need to print out tables in the tool dialog box. You would also benefit from the following information if you need to print out some information in the Python window of ArcMap doing some ad hoc data exploration.

Fairly often your only way to communicate the results of the tool execution is to print out a table that the user could look at. It is possible to create an Excel file using a Python package such as xlsxwriter or by exporting an existing data structure such as a pandas data frame into an Excel or .csv file which user could open. Keep in mind that it is possible to start Excel with the file open using the os.system command:

os.system('start excel.exe {0}'.format(excel_path))

However, if you only need to print out some simple information into a table format within the dialog box of the running tool, you could construct such a table using built-in Python. This is particularly helpful in those cases where you cannot guarantee that the end user will have the 3rd party Python packages installed or where the output table is really small and it is not supposed to be analyzed or processed further.

However, as soon as you would try to build something flexible with the varying column width or when you don’t know beforehand what output columns and what data the table will be printed with, it gets very tedious. You need to manipulate multiple strings and tuples making sure everything draws properly.

In these cases, it is so much nicer to be able to take advantage of the external Python packages where all these concerns have been already taken care of. I have been using the tabulate, but there are a few others such as PrettyTable and texttable both of which will generate a formatted text table using ASCII characters.

To give you a sense of the tabulate package, look at the code necessary to produce a nice table using the ugly formatted strings (the first part) and using the tabulate package (the second part):

The output of the table produced using the built-in modules only:


The output of the table produced using the tabulate module:




Warning: new GDB_GEOMATTR_DATA column in ArcGIS geodatabase 10.5

This post would be of interest to ArcGIS users who are upgrading enterprise geodatabases from ArcGIS 10.1-10.4 version to 10.5+ version. According to the Esri documentation and resources (link1, link2, link3):

Feature classes created in an ArcGIS 10.5 or 10.5.1 geodatabase using a 10.5 or 10.5.1 client use a new storage model for geometry attributes, which stores them in a new column (GDB_GEOMATTR_DATA). The purpose of this column is to handle complex geometries such as curves. Since a feature class can have only one shape column, the storage of circular geometries must be stored separately and then joined to the feature class when viewed in ArcGIS.

This means that if you create a new feature class in an enterprise geodatabase (either manually or by using a geoprocessing tool), three fields will be created: the OID field (OBJECTID), the geometry field (SHAPE), and this special GDB_GEOMATTR_DATA field. To be aware of this is very important because you will not be able to see this column when working in ArcGIS Desktop or when using arcpy.

The GDB_GEOMATTR_DATA field is not shown when accessing a feature class using arcpy.

[ for f in arcpy.ListFields('samplefc')]
[u'OBJECTID', u'SHAPE', u'SHAPE.STArea()', u'SHAPE.STLength()']

Querying the table using SQL, however, does show the field.

select * from dbo.samplefc

If you are working with your enterprise geodatabase only using ArcGIS tools, you may not notice anything. However, if you have existing SQL scripts that work with the feature class schema, it is a good time to check that those scripts will not remove the GDB_GEOMATTR_DATA column from the feature class. This could happen if you are re-constructing the schema based on another table and have previously needed to keep the OBJECTID and the SHAPE columns. After moving to 10.5, you would also keep the GDB_GEOMATTR_DATA column.

Keep in mind that deleting the GDB_GEOMATTR_DATA column will make the feature class unusable in ArcGIS. Moreover, if this feature class stores any complex geometries such as curves, deleting the GDB_GEOMATTR_DATA column would result in data loss.

Trying to preview a feature class without the GDB_GEOMATTR_DATA column in ArcCatalog would show up the following error:

database.schema.SampleFC: Attribute column not found [42S22:[Microsoft][SQL Server Native Client 11.0][SQL Server]Invalid column name ‘GDB_GEOMATTR_DATA’.] [database.schema.SampleFC]

Even though very unlikely to happen, trying to add a new field called exactly GDB_GEOMATTR_DATA to a valid feature class using ArcGIS tools would also result in an error:

ERROR 999999: Error executing function.
Underlying DBMS error [Underlying DBMS error [[Microsoft][SQL Server Native Client 11.0][SQL Server]Column names in each table must be unique. Column name ‘GDB_GEOMATTR_DATA’ in table ‘SDE.SAMPLE1’ is specified more than once.][database.schema.sample1.GDB_GEOMATTR_DATA]]
Failed to execute (AddField).

Obviously, trying to add the GDB_GEOMATTR_DATA using plain SQL would not  work either:

ALTER TABLE sde.samplefc

Column names in each table must be unique. Column name ‘GDB_GEOMATTR_DATA’ in table ‘sde.samplefc’ is specified more than once.

Multiple Ring Buffer with PostGIS and SQL Server

Recently I needed to generate multiple ring buffers around some point features. This can be done using a dozen of tools – Multiple Ring Buffer geoprocessing tool in ArcGIS, using arcpy to generate multiple buffer polygons and merging them into a single feature class using the buffer() method of arcpy.Geometry() object, or by using open source GIS tools such as QGIS. This is also possible to achieve using relational database that has support for the spatial functions. In this post, I would like to show you how this can be done using the ST_Buffer spatial function in PostGIS and SQL Server.

In order to generate multiple buffer distance values (for instance, from 100 to 500 with the step of 100) in SQL Server, I would probably need use CTE or just create a plain in-memory table using declare; in other words, this is what it takes to run range(100, 501, 100) in Python.

In the gist below, there are two ways to generate multiple buffers – using the plain table and the CTE.

Generating a sequence of distances in Postgres is a lot easier thanks to the presence of the generate_series function which provides the same syntax as range in Python.