Featured

Leaving the GIS industry: bye for now

After working for more than 10 years in geospatial industry, I have decided to change the field and focus on pure software engineering. I have quite enjoyed using GIS tools to solve various computational problems and software development tools to build useful GIS products and services for my customers. However, as my interests starting shifting towards clean code and programming per se – I have noticed I was reading in the bed Refactoring by Martin Fowler way more often than The ESRI guide to GIS analysis by Andy Mitchell – so I thought it would be useful to try changing the career path.

As I won’t be using any of the GIS software any longer, I won’t be able to post any new practical material that would be useful to GIS analysts and developers. However, in this post I would like to share some of the last thoughts I have which could be of interest to peer GIS professionals.

This blog becomes an archive of hopefully useful resources. Good luck!

Skills

SQL

I cannot stress enough how important it is for anyone using GIS data to master SQL. It is such an integral part of nearly any GIS processing workflow and being able to process or manage data stored in a DBMS using SQL becomes crucial. As GIS datasets grow in size, writing scripts for data processing won’t suffice as there are still no efficient spatial data processing packages that could make the process consistent, fast, and reliable. Using PostGIS or SQL Server native spatial types can get you often farther than any open source Python package. Don’t stop mastering SQL after learning the basics as there is so much more.

I have two massive posts about using SQL functions for GIS:

Math

Don’t bother too much studying math. I am not sure why many people I have spoken to think that in order to be good at GIS one has to be good at math. Being able to operate high school algebra and geometry terms is definitely useful (angle between intersecting roads, area of a lake, percentage of parcel covered by forest), but there is no need to study more sophisticated math. For a casual GIS analyst work, having the basics covered will be more than enough.

Of course if you are building a new routing algorithm that will be more efficient that A*, then you would have to review a lot of graph theory materials or if you are building a new Voronoi diagram generation procedure, you will find yourself reading the computational geometry books. If you are doing a lot of GIS analysis with spatial statistics stuff, then you should definitely get this amazing book Elementary Statistics for Geographers Third Edition.

If you do a lot of computational intensive programming and would like to catch up on math specifically for GIS, review

Testing in GIS

Review these two posts:

Python

Learn Python. It is the most widely used programming language in GIS industry and its usage will only expand. I have written the A progression path for GIS analyst which could be used a development road map. You should be very comfortable using Python; having Python skills will let you have a stronger influence on any operations within the organization and potentially automate more manual workflows leading to a better workplace.

Linux and bash

Learn Linux and bash. I think I should have started using Linux earlier. There are a few (1, 2) ready to use VirtualBox images with a ton of open-source GIS software installed, configured and ready-to-use. Using those machines will save you a lot of time. Learning bash is extremely helpful because it would let you be much more productive executing smaller commands and building pipelines for data processing than you would normally do on Windows using a programming language. Obviously learning bash, Linux, and Python are part of the industry agnostic skill set you could benefit from having at any later point of time.

Readings

There are so many excellent GIS books that I would like to recommend. You can find most popular titles online. What I’d like to do instead is to share of the hidden gems I have discovered and have really enjoyed reviewing. You can find those in the post Useful resources in computer science/math for GIS Analysts.

Programming

Ad-hoc mentality is very difficult to fight. It is 7 pm. You have a job you have to get done before releasing the data to a customer tomorrow morning. You are adding a missing domain to a geodatabase that your colleague’s Python script failed to add. Then you are changing a data type for a field because you have just realized that you need to store text instead of numbers. And… you find a few other things you are so glad to have spotted before the release. You fix them, zip the database, and upload it onto an FTP site. It is 10 pm, you are tired but happy and are heading off to the bed.

Success! … Or is it? The next thing tomorrow morning you want to document the manual changes you’ve introduced yesterday, but you are being dragged into some other urgent job… and you never do. A week after, a customer sends an email telling you she’s not able to run their in-house tools using your database you’ve prepared for them, but the one you’ve prepared a month ago works. Now it is 9 pm again and you are writing some oddly looking script trying to compare the databases and recalling what have you done on that evening… You are in a mess.

Doing what you have done may look natural because you just want to get stuff done. However, I want you to look at this from another perspective. You want your steps to be reproducible. You want to be able to track the changes you have done. Not only you, but any colleague of yours should be able to pick up the updates that have been made to any piece of data or a script. So resist the urge to get stuff done, pace yourself, and track your work with one of the following methods.

Documenting manually

If you are not comfortable programming or scripting at all, you should document each step you are taking while making modifications to a dataset. At least you could see what has been done in written form. I cannot stress that enough – you should document what you are doing, not what you have done. So write down what you have done after each change operation, not after you have done all the work. This is how it can look:

  1. You added field Area of Double type to the table dbo.Parcels.
  2. You write: “Add field Area of Double type to the table dbo.Parcels.”
  3. You dropped field OldArea of Double type in the table dbo.Parcels.
  4. You write: “Drop field OldArea of Double type in the table dbo.Parcels.”

One of the disadvantages of this approach is that it is possible to get the changes out of sync with the documentation. You could have made an error documenting a data type or a field name. Another thing is that the very same step can be done in many ways – what if you add field to a database using some GIS application and a colleague of yours uses a DBMS command line tool? Documenting exactly the procedure of making changes soon becomes tedious and you end up with tons of instructions that easily becomes misleading or plain obsolete. However, if you are vigorous, it is still possible to maintain a decent level of changes tracking with sufficiently rigid discipline.

Simply programming with VCS

Another approach is to write a program that will make the changes. When you write code, you don’t need to document what you are doing because the reader familiar with the syntax of this programming language will understand what happens. You can add some comments of course explaining why adding certain fields is required though. So, if you are building a database with a few tables, you can write a SQL script that can be re-run recreating your database at any point of time. If you never make any manual changes to a database and only write and keep SQL commands, your main SQL data compilation script will never get out of sync.

This leads us to a concept of version tracking where it is possible to track how your SQL script changed since the last version. Who is not guilty of having at some point of our career a dozen of files with some scripts named “production_final_compilation_truly_final_12.sql“? To avoid this mess, you should really use a VCS.

The main argument against this approach is that it setting up all this version control tools look like an overkill for someone doing simple GIS work. However, you will see how much safer your work will be in the long run. It will pay off very soon. Invest some time in learning about VCS such as Git for managing the source code. All major players – BitBucket, GitLab, and GitHub – provide free private repositories. Find out whether there is a VCS solution deployed in-house within your organization, such as Microsoft TFS, which you could use to check in the code. Should you like to dive deeper into Git, read the Git Pro book for free online. If you are not comfortable putting anything into the cloud (which is just someone’s else computer), use Git locally on your machine or a local server where you can securely check in your code and ask your system administrator to take backups of those repositories.

Open source vs proprietary software

Throughout your GIS career, you most likely will be exposed to both proprietary and open source software. You will have Windows on your machine with QGIS; or a Linux machine with Esri ArcGIS Server. It would be naive to think that either of these technologies is superior to another. You should be able to get the job done whatever tools you have available because you will not always be able to decide what your employer will be using.

I suggest instead being comfortable with both of them and widening your toolset as much as possible. As you become exposed to different tools, you will soon realize that commercial software can be much better for certain jobs rather than open-source or free one. For instance, certain types of spatial joins can run faster in ArcGIS Desktop rather than PostGIS, but some GDAL based raster masking may outperform ArcGIS Spatial Analyst tools. Creating a map layout with data driven pages is a pleasure in ArcMap, but can be tedious in QGIS. Always do the benchmarking to understand what tools work best and document your findings. Keep in mind that the very same tool can take 1 second to process 1,000 features and 5 minutes to process 10,000 features. Review briefly the Big O notation to avoid surprises.

I have always encouraged people using a particular tool to understand what it really does. Having a clear understanding of the underlying process will make it possible for you to extend an existing tool or write your own one. For instance, if you are applying some raster masking, you should understand what a matrix is. If you do a spatial join, you should understand how having a spatial index helps.

Always look to expand your toolset and be prepared to apply a tool that you think would be right for a particular job. A customer only has PostGIS and needs to do some polygon intersection? You use ST_Intersects native function. Don’t have access to QGIS? Know which ArcToolbox tool does the same job. You have to process a file on Linux machine you SSH into (so no Excel-like software)? Use bash or pandas to wrangle the data as needed. You shouldn’t be constrained by the environment you are in and what tools you have available at your disposal. You should be able to get the job done no matter what.

Keeping up with the industry

I have been a user of GIS StackExchange since 2013 and have blogged about my experience and why is it useful to be active on a forum in the post 4 years with GIS at StackExchange. Make a habit of reading the weekly most popular questions, for instance, every weekend. If you see a question you know the answer to, post it. It also helps to ask a question you had yourself and then you spent a week solving it and then finally found a solution. Please post an answer to your own question. You will save some effort to a peer GIS professional and you can also find this answer later when you will be doing a web search for the same issue in a few years time. If you have some time, you can review the most popular questions using the most voted questions option; there is so much to learn there.

Advertisements

Warning: new GDB_GEOMATTR_DATA column in ArcGIS geodatabase 10.5

This post would be of interest to ArcGIS users who are upgrading enterprise geodatabases from ArcGIS 10.1-10.4 version to 10.5+ version. According to the Esri documentation and resources (link1, link2, link3):

Feature classes created in an ArcGIS 10.5 or 10.5.1 geodatabase using a 10.5 or 10.5.1 client use a new storage model for geometry attributes, which stores them in a new column (GDB_GEOMATTR_DATA). The purpose of this column is to handle complex geometries such as curves. Since a feature class can have only one shape column, the storage of circular geometries must be stored separately and then joined to the feature class when viewed in ArcGIS.

This means that if you create a new feature class in an enterprise geodatabase (either manually or by using a geoprocessing tool), three fields will be created: the OID field (OBJECTID), the geometry field (SHAPE), and this special GDB_GEOMATTR_DATA field. To be aware of this is very important because you will not be able to see this column when working in ArcGIS Desktop or when using arcpy.

The GDB_GEOMATTR_DATA field is not shown when accessing a feature class using arcpy.

[f.name for f in arcpy.ListFields('samplefc')]
[u'OBJECTID', u'SHAPE', u'SHAPE.STArea()', u'SHAPE.STLength()']

Querying the table using SQL, however, does show the field.

select * from dbo.samplefc
[OBJECTID],[SHAPE],[GDB_GEOMATTR_DATA]

If you are working with your enterprise geodatabase only using ArcGIS tools, you may not notice anything. However, if you have existing SQL scripts that work with the feature class schema, it is a good time to check that those scripts will not remove the GDB_GEOMATTR_DATA column from the feature class. This could happen if you are re-constructing the schema based on another table and have previously needed to keep the OBJECTID and the SHAPE columns. After moving to 10.5, you would also keep the GDB_GEOMATTR_DATA column.

Keep in mind that deleting the GDB_GEOMATTR_DATA column will make the feature class unusable in ArcGIS. Moreover, if this feature class stores any complex geometries such as curves, deleting the GDB_GEOMATTR_DATA column would result in data loss.

Trying to preview a feature class without the GDB_GEOMATTR_DATA column in ArcCatalog would show up the following error:

database.schema.SampleFC: Attribute column not found [42S22:[Microsoft][SQL Server Native Client 11.0][SQL Server]Invalid column name ‘GDB_GEOMATTR_DATA’.] [database.schema.SampleFC]

Even though very unlikely to happen, trying to add a new field called exactly GDB_GEOMATTR_DATA to a valid feature class using ArcGIS tools would also result in an error:

ERROR 999999: Error executing function.
Underlying DBMS error [Underlying DBMS error [[Microsoft][SQL Server Native Client 11.0][SQL Server]Column names in each table must be unique. Column name ‘GDB_GEOMATTR_DATA’ in table ‘SDE.SAMPLE1’ is specified more than once.][database.schema.sample1.GDB_GEOMATTR_DATA]]
Failed to execute (AddField).

Obviously, trying to add the GDB_GEOMATTR_DATA using plain SQL would not  work either:

ALTER TABLE sde.samplefc
ADD GDB_GEOMATTR_DATA varchar(100)

Column names in each table must be unique. Column name ‘GDB_GEOMATTR_DATA’ in table ‘sde.samplefc’ is specified more than once.

Multiple Ring Buffer with PostGIS and SQL Server

Recently I needed to generate multiple ring buffers around some point features. This can be done using a dozen of tools – Multiple Ring Buffer geoprocessing tool in ArcGIS, using arcpy to generate multiple buffer polygons and merging them into a single feature class using the buffer() method of arcpy.Geometry() object, or by using open source GIS tools such as QGIS. This is also possible to achieve using relational database that has support for the spatial functions. In this post, I would like to show you how this can be done using the ST_Buffer spatial function in PostGIS and SQL Server.

In order to generate multiple buffer distance values (for instance, from 100 to 500 with the step of 100) in SQL Server, I would probably need use CTE or just create a plain in-memory table using declare; in other words, this is what it takes to run range(100, 501, 100) in Python.

In the gist below, there are two ways to generate multiple buffers – using the plain table and the CTE.

Generating a sequence of distances in Postgres is a lot easier thanks to the presence of the generate_series function which provides the same syntax as range in Python.

SQL Server spatial functions for GIS users: part 2

I was curious to see what posts of my blog have been visited by most people over the time. The most popular post is SQL Server spatial functions for GIS users which I have published 3 years ago. It seems as there are many GIS users who need to use SQL Server spatial functions as similar requests such as “sql server spatial functions” are appearing fairly often in the search engines. Today, I will share more examples of using SQL Server spatial functions if you are a GIS user.

Below you find just a few sample SQL queries that have been executed against a SQL Server database containing datasets accessible via PostGIS workshop materials by Boundless which can be downloaded from this link.

This post is an extraction of the Jupyter notebook published on GitHub. Open the notebook to see the result tables or jump to the end of the post where I have it published.

Find points that have the same coordinates

To find coincident points (often referred to as duplicates), you could use various SQL Server spatial functions.

Table1.Shape.STDistance(Table2.Shape) < 1000 --distance value
Table1.Shape.STEquals(Table2.Shape) = 1 --whether shapes are identical
Table1.SHAPE.STX and Table1.SHAPE.STY --compare points XY coordinates

All of these methods would give the same result while having a different execution performance.

Find using STDistance

--Find duplicate points within a certain borough
SELECT CAST(T1.ID AS INT) AS FirstPoint,
CAST(T2.ID AS INT) SecondPoint,
T1.Shape.STDistance(T2.Shape) Distance
FROM
dbo.Homicides T1
JOIN
dbo.Homicides T2
ON
T1.ID < T2.ID
and
T1.Shape.STDistance(T2.Shape) = 0
and
T1.BORONAME = 'Queens'
ORDER BY
Distance, FirstPoint

Find using STDistance

SELECT CAST(T1.ID AS INT) AS FirstPointId,
CAST(T2.ID AS INT) SecondPointId,
T1.Shape.STDistance(T2.Shape) Distance
FROM
dbo.Homicides T1
JOIN
dbo.Homicides T2
ON
T1.ID < T2.ID
and
T1.Shape.STEquals(T2.Shape) = 1
and
T1.BORONAME = 'Queens'
ORDER BY
FirstPointId, SecondPointId

Find using STEquals

SELECT CAST(T1.ID AS INT) AS FirstPointId,
CAST(T2.ID AS INT) SecondPointId,
T1.Shape.STDistance(T2.Shape) Distance
FROM
dbo.Homicides T1
JOIN
dbo.Homicides T2
ON
T1.ID < T2.ID
and
T1.Shape.STEquals(T2.Shape) = 1
and
T1.BORONAME = 'Queens'
ORDER BY
FirstPointId, SecondPointId

Find using STX and STY

SELECT T1.SHAPE.STX, T1.SHAPE.STY, COUNT(*) AS COUNT
FROM
dbo.Homicides T1
WHERE
T1.BORONAME = 'Queens'
GROUP BY
T1.SHAPE.STX, T1.SHAPE.STY
HAVING
COUNT(*) > 1

Calculating distances between points stored in tables

The STDistance function could be used to find the plain distance between the points stored within the same table.

SELECT CAST(T1.ID AS INT) AS FirstPointId
,CAST(T2.ID AS INT) SecondPointId
,T1.Shape.STDistance(T2.Shape) Distance
FROM
dbo.Homicides T1
JOIN dbo.Homicides T2
ON T1.ID < T2.ID
and
T1.Shape.STDistance(T2.Shape) BETWEEN 10 AND 20
and
T1.BORONAME = 'Queens'
ORDER BY
Distance

Calculating distances between points stored in the same table

SELECT CAST(T1.ID AS INT) AS FirstPointId
,CAST(T2.ID AS INT) SecondPointId
,T1.Shape.STDistance(T2.Shape) Distance
FROM
dbo.Homicides T1
JOIN dbo.Homicides T2
ON T1.ID < T2.ID
and
T1.Shape.STDistance(T2.Shape) BETWEEN 10 AND 20
and
T1.BORONAME = 'Queens'
ORDER BY
Distance

Calculating distances between points stored in two tables

Find homicide points that are located within the specified number of meters to the subway stations points. ArcGIS tool: Point Distance (Analysis)

SELECT * FROM
(SELECT CAST(Homi.ID AS int) AS HomicideId
,Homi.WEAPON AS Weapon
,CAST(Subway.ID AS int) AS SubwayId
,Subway.NAME AS Name
,Homi.Shape.STDistance(Subway.Shape) AS Distance
FROM dbo.HOMICIDES Homi
cross join dbo.Subway_stations Subway
where Homi.BORONAME = 'Manhattan' AND Subway.BOROUGH = 'Manhattan')
AS
data
WHERE
Distance < 20
ORDER BY
Distance

Counting points in polygons

To count points in polygons, you could use the STContains function. Table1.Shape.STContains(Table2.Shape) would return 0 or 1.

Find neighborhoods with the largest number of crimes committed (count number of homicides in each neighborhood)

SELECT TOP 10
Polys.Name AS NeighborhoodName, Count(*) AS CrimeCount
FROM
dbo.Homicides AS Points
JOIN
dbo.Neighborhoods AS Polys
ON
Polys.Shape.STContains(Points.Shape) = 1
GROUP BY
Polys.Name
ORDER BY
CrimeCount DESC

We can also calculate a column in the Neighborhoods table to contain the number of points within each neighborhood. For that, we will first need to add a new column to the table, then populate it, and then drop to leave the data clean for the further queries.

This query adding a new fields and calculating the number of points located within each polygon is what is done by the ArcGIS GP tool Spatial Join.

ALTER TABLE dbo.Neighborhoods
ADD PointCount int;

UPDATE
Polys
SET
[PointCount] = COUNTS.CrimeCount
FROM
dbo.Neighborhoods AS Polys
JOIN
(
SELECT
Polys.Name AS NeighborhoodName, Count(*) AS CrimeCount
FROM
dbo.Homicides AS Points
JOIN
dbo.Neighborhoods AS Polys
ON
Polys.Shape.STContains(Points.Shape) = 1
GROUP BY
Polys.Name
) AS COUNTS
ON
Polys.Name = COUNTS.NeighborhoodName

Add polygon name to points located within the polygon

To enrich the points layer with the information what polygon each point is located within you would need to use the STWithin function. In this example, we will add a new column to the homicides table so we know what neighborhood the crime has been committed.

Again, this query adding a new field and calculating the neighborhood name for the points located within each polygon is what is done by the ArcGIS GP tool Spatial Join.

SELECT TOP 10
Points.OBJECTID
,Points.INCIDENT_D
,Points.BORONAME
,Points.NUM_VICTIM
,Points.PRIMARY_MO
,Points.ID
,Points.WEAPON
,Points.LIGHT_DARK
,Points.YEAR
,Polys.Name AS NeighborhoodName
FROM
dbo.Neighborhoods AS Polys
JOIN
dbo.Homicides AS Points
ON
Points.Shape.STWithin(Polys.Shape) = 1
ORDER BY
OBJECTID
ALTER TABLE dbo.Homicides
ADD NeighborhoodName varchar(50);

UPDATE
Points
SET
[NeighborhoodName] = PointsInPolys.NeighborhoodName
FROM
dbo.Homicides AS Points
JOIN
(
SELECT
Points.OBJECTID
,Points.INCIDENT_D
,Points.BORONAME
,Points.NUM_VICTIM
,Points.PRIMARY_MO
,Points.ID
,Points.WEAPON
,Points.LIGHT_DARK
,Points.YEAR
,Polys.Name AS NeighborhoodName
FROM
dbo.Neighborhoods AS Polys
JOIN
dbo.Homicides AS Points
ON
Points.Shape.STWithin(Polys.Shape) = 1
) AS PointsInPolys
ON
PointsInPolys.ID = Points.ID

Summary statistics and frequency

This is a simplistic implementation of the Frequency GP tool in ArcGIS using a T-SQL stored procedure.

ALTER PROCEDURE dbo.FrequencyTable
@Columns varchar(500)
AS
BEGIN
EXEC ('SELECT COUNT(*) AS ' + @Columns +
' FROM dbo.HOMICIDES
WHERE WEAPON <> '''' AND LIGHT_DARK <> ''''
GROUP BY WEAPON,
LIGHT_DARK ORDER BY FREQUENCY DESC;');
END
GO
EXEC dbo.FrequencyTable 'dbo.HOMICIDES', 'FREQUENCY, WEAPON, LIGHT_DARK';

The Jupyter notebook with the result sets:

 

Removing redundant polyline vertices using arcpy

If you maintain a polyline feature class of some entities such as city roads, you may have seen that some of your features have redundant vertices. These vertices are not necessary to preserve the feature’s shape. Not only they increase the size of your dataset, but they also may slow down processing time as it takes longer time to process geometries with a larger number of vertices.

You can see the redundant vertices (small green points) highlighted on a source polyline (thin green line). The cleaned shape (thick blue line) has only the essential vertices (large blue points).

collinear_image

To automate the process of removing the redundant vertices, I’ve written a Python script. The tricky part was to find out whether the points one iterates over are collinear.

Take a backup of your data before running this script as it modifies the features in place!

  1. Script works for single part as well as multi-part lines.
  2. You can choose what to do with the curves: they are either preserved without removing any redundant vertices this particular feature might have OR densified with the specified distance value. Look up the .densify() method on arcpy.Geometry object. You could alternatively use the Densify GP tool to fix your curves before running this script if you would like to experiment with the deviation values.
  3. Only polyline feature classes are supported (no points/polygons). Could be extended to support polygons though.

Access ArcMap UI from Python comtypes

I have blogged earlier about the possibility to submit calls to methods defined in classes stored within a .NET library either using clr (with the help of pythonnet) or using native ctypes depending on how your libraries have been compiled. I have also shared some of the examples on how one can use comtypes to access COM-objects of ArcGIS Desktop to be able interact with ArcObjects via Python without writing any C#, Java, or C++ code.

This time, I need combine these approaches. I have to execute a tool on a toolbar in ArcMap (this can be a custom .NET add-in or a built-in tool such as Full Extent) from Python code while working in an open ArcMap session. This is different from what I’ve done before because I have always worked with comtypes in a stand-alone mode accessing existing map documents stored on the disk. This time, I have to use comtypes within an ArcMap session hooking into the currently open map document.

Fortunately, this wasn’t as hard as I expected. I just had to use the standard interface to get into the application, and then into the map document, and then into the command bars available. The last thing you would need to do is to find the tool you need and then execute it. I’ve published a snippet that would get you started.

The first one is for calling the Full Extent button on the Tools toolbar in ArcMap. The second one is for calling a button on a custom toolbar that is part of a custom .NET add-in in ArcMap. You can both of these samples as a part of a custom Python script tool or a Python add-in.

 

ArcPy: Batch export of attachments from geodatabase feature class with custom naming

If you have worked with attachments within an ArGIS geodatabase, you might have needed to take the attachments out of the geodatabase to use in another system or to process them separately. Unfortunately, there is no geoprocessing tool that would allow you to export all attachments to files, however this workflow is supported with arcpy and Python. There is a KB article at Esri site that shows the code necessary to export attachments of a feature class into an output folder.

This workflow is based, however, on using the attributes from the attachments table – %featureclassname%__ATTACH. But what if you would like to export the attachments naming output files using attributes of the source feature that has attachments associated with it? In this case, you would need to find out what feature every attachment is associated to (this is exposed via a relationship class) and then use this value to name the output file. To get all the information needed regarding the relationship class properties a Describe object for the relationship class can be used. Of course when exporting the attachments and naming output files using a source feature attribute, you have to make sure it is unique, otherwise the later exported attachment files will overwrite the previous ones.

This tool will handle Unicode characters in the attribute field you would like to use for naming output files, so it will work for users working in various locales, however, there is no validation for the output file name in terms of operating system restrictions on legal file names, but its easy to fix, for instance, using regex.

The code that will do this with some extra comments to help you follow along: