Working with SQL in Jupyter notebook and dumping pandas into a SQL database

I have posted previously an example of using the SQL magic inside Jupyter notebooks. Today, I will show you how to execute a SQL query against a PostGIS database, get the results back into a pandas DataFrame object, manipulate it, and then dump the DataFrame into a brand new table inside the very same database. This can be very handy if some of your operations are better done using plain SQL, for instance, with the help of spatial SQL functions, but other ones are easier to perform using pandas​, for instance, adding and calculating new columns when the data you need to access is not stored inside the database.

The steps are as below:

  • Connect to a database using the SQL magic syntax
  • Execute a SELECT SQL query getting back a result set
  • Read the result set into a pandas DataFrame object
  • Do stuff with the DataFrame object
  • Dump it to a new table in the database you are connected to using the PERSISTcommand

Again, this is very handy for prototyping when you need to produce some tables doing a database design or when you need to have a new temporary table created for some application to read or when running some integration tests and needing to have a mock-up table to work with.

The sample notebook is available as a gist:

 

Advertisements

SQL Server spatial functions for GIS users: part 2

I was curious to see what posts of my blog have been visited by most people over the time. The most popular post is SQL Server spatial functions for GIS users which I have published 3 years ago. It seems as there are many GIS users who need to use SQL Server spatial functions as similar requests such as “sql server spatial functions” are appearing fairly often in the search engines. Today, I will share more examples of using SQL Server spatial functions if you are a GIS user.

Below you find just a few sample SQL queries that have been executed against a SQL Server database containing datasets accessible via PostGIS workshop materials by Boundless which can be downloaded from this link.

This post is an extraction of the Jupyter notebook published on GitHub. Open the notebook to see the result tables or jump to the end of the post where I have it published.

Find points that have the same coordinates

To find coincident points (often referred to as duplicates), you could use various SQL Server spatial functions.

Table1.Shape.STDistance(Table2.Shape) < 1000 --distance value
Table1.Shape.STEquals(Table2.Shape) = 1 --whether shapes are identical
Table1.SHAPE.STX and Table1.SHAPE.STY --compare points XY coordinates

All of these methods would give the same result while having a different execution performance.

Find using STDistance

--Find duplicate points within a certain borough
SELECT CAST(T1.ID AS INT) AS FirstPoint,
CAST(T2.ID AS INT) SecondPoint,
T1.Shape.STDistance(T2.Shape) Distance
FROM
dbo.Homicides T1
JOIN
dbo.Homicides T2
ON
T1.ID < T2.ID
and
T1.Shape.STDistance(T2.Shape) = 0
and
T1.BORONAME = 'Queens'
ORDER BY
Distance, FirstPoint

Find using STDistance

SELECT CAST(T1.ID AS INT) AS FirstPointId,
CAST(T2.ID AS INT) SecondPointId,
T1.Shape.STDistance(T2.Shape) Distance
FROM
dbo.Homicides T1
JOIN
dbo.Homicides T2
ON
T1.ID < T2.ID
and
T1.Shape.STEquals(T2.Shape) = 1
and
T1.BORONAME = 'Queens'
ORDER BY
FirstPointId, SecondPointId

Find using STEquals

SELECT CAST(T1.ID AS INT) AS FirstPointId,
CAST(T2.ID AS INT) SecondPointId,
T1.Shape.STDistance(T2.Shape) Distance
FROM
dbo.Homicides T1
JOIN
dbo.Homicides T2
ON
T1.ID < T2.ID
and
T1.Shape.STEquals(T2.Shape) = 1
and
T1.BORONAME = 'Queens'
ORDER BY
FirstPointId, SecondPointId

Find using STX and STY

SELECT T1.SHAPE.STX, T1.SHAPE.STY, COUNT(*) AS COUNT
FROM
dbo.Homicides T1
WHERE
T1.BORONAME = 'Queens'
GROUP BY
T1.SHAPE.STX, T1.SHAPE.STY
HAVING
COUNT(*) > 1

Calculating distances between points stored in tables

The STDistance function could be used to find the plain distance between the points stored within the same table.

SELECT CAST(T1.ID AS INT) AS FirstPointId
,CAST(T2.ID AS INT) SecondPointId
,T1.Shape.STDistance(T2.Shape) Distance
FROM
dbo.Homicides T1
JOIN dbo.Homicides T2
ON T1.ID < T2.ID
and
T1.Shape.STDistance(T2.Shape) BETWEEN 10 AND 20
and
T1.BORONAME = 'Queens'
ORDER BY
Distance

Calculating distances between points stored in the same table

SELECT CAST(T1.ID AS INT) AS FirstPointId
,CAST(T2.ID AS INT) SecondPointId
,T1.Shape.STDistance(T2.Shape) Distance
FROM
dbo.Homicides T1
JOIN dbo.Homicides T2
ON T1.ID < T2.ID
and
T1.Shape.STDistance(T2.Shape) BETWEEN 10 AND 20
and
T1.BORONAME = 'Queens'
ORDER BY
Distance

Calculating distances between points stored in two tables

Find homicide points that are located within the specified number of meters to the subway stations points. ArcGIS tool: Point Distance (Analysis)

SELECT * FROM
(SELECT CAST(Homi.ID AS int) AS HomicideId
,Homi.WEAPON AS Weapon
,CAST(Subway.ID AS int) AS SubwayId
,Subway.NAME AS Name
,Homi.Shape.STDistance(Subway.Shape) AS Distance
FROM dbo.HOMICIDES Homi
cross join dbo.Subway_stations Subway
where Homi.BORONAME = 'Manhattan' AND Subway.BOROUGH = 'Manhattan')
AS
data
WHERE
Distance < 20
ORDER BY
Distance

Counting points in polygons

To count points in polygons, you could use the STContains function. Table1.Shape.STContains(Table2.Shape) would return 0 or 1.

Find neighborhoods with the largest number of crimes committed (count number of homicides in each neighborhood)

SELECT TOP 10
Polys.Name AS NeighborhoodName, Count(*) AS CrimeCount
FROM
dbo.Homicides AS Points
JOIN
dbo.Neighborhoods AS Polys
ON
Polys.Shape.STContains(Points.Shape) = 1
GROUP BY
Polys.Name
ORDER BY
CrimeCount DESC

We can also calculate a column in the Neighborhoods table to contain the number of points within each neighborhood. For that, we will first need to add a new column to the table, then populate it, and then drop to leave the data clean for the further queries.

This query adding a new fields and calculating the number of points located within each polygon is what is done by the ArcGIS GP tool Spatial Join.

ALTER TABLE dbo.Neighborhoods
ADD PointCount int;

UPDATE
Polys
SET
[PointCount] = COUNTS.CrimeCount
FROM
dbo.Neighborhoods AS Polys
JOIN
(
SELECT
Polys.Name AS NeighborhoodName, Count(*) AS CrimeCount
FROM
dbo.Homicides AS Points
JOIN
dbo.Neighborhoods AS Polys
ON
Polys.Shape.STContains(Points.Shape) = 1
GROUP BY
Polys.Name
) AS COUNTS
ON
Polys.Name = COUNTS.NeighborhoodName

Add polygon name to points located within the polygon

To enrich the points layer with the information what polygon each point is located within you would need to use the STWithin function. In this example, we will add a new column to the homicides table so we know what neighborhood the crime has been committed.

Again, this query adding a new field and calculating the neighborhood name for the points located within each polygon is what is done by the ArcGIS GP tool Spatial Join.

SELECT TOP 10
Points.OBJECTID
,Points.INCIDENT_D
,Points.BORONAME
,Points.NUM_VICTIM
,Points.PRIMARY_MO
,Points.ID
,Points.WEAPON
,Points.LIGHT_DARK
,Points.YEAR
,Polys.Name AS NeighborhoodName
FROM
dbo.Neighborhoods AS Polys
JOIN
dbo.Homicides AS Points
ON
Points.Shape.STWithin(Polys.Shape) = 1
ORDER BY
OBJECTID
ALTER TABLE dbo.Homicides
ADD NeighborhoodName varchar(50);

UPDATE
Points
SET
[NeighborhoodName] = PointsInPolys.NeighborhoodName
FROM
dbo.Homicides AS Points
JOIN
(
SELECT
Points.OBJECTID
,Points.INCIDENT_D
,Points.BORONAME
,Points.NUM_VICTIM
,Points.PRIMARY_MO
,Points.ID
,Points.WEAPON
,Points.LIGHT_DARK
,Points.YEAR
,Polys.Name AS NeighborhoodName
FROM
dbo.Neighborhoods AS Polys
JOIN
dbo.Homicides AS Points
ON
Points.Shape.STWithin(Polys.Shape) = 1
) AS PointsInPolys
ON
PointsInPolys.ID = Points.ID

Summary statistics and frequency

This is a simplistic implementation of the Frequency GP tool in ArcGIS using a T-SQL stored procedure.

ALTER PROCEDURE dbo.FrequencyTable
@Columns varchar(500)
AS
BEGIN
EXEC ('SELECT COUNT(*) AS ' + @Columns +
' FROM dbo.HOMICIDES
WHERE WEAPON <> '''' AND LIGHT_DARK <> ''''
GROUP BY WEAPON,
LIGHT_DARK ORDER BY FREQUENCY DESC;');
END
GO
EXEC dbo.FrequencyTable 'dbo.HOMICIDES', 'FREQUENCY, WEAPON, LIGHT_DARK';

The Jupyter notebook with the result sets:

 

Adding IPython SQL magic to Jupyter notebook

If you do not use the %%sql magic in your Jupyter notebook, the output of your SQL queries will be just a plain list of tuples. A better way to work with the result sets returned is to draw them as a table with the headers. This is where the IPython SQL magic gets very handy. You can install it using pip install ipython-sql. Refer to its GitHub repository for details of the implementation.

You have to connect to a database and then all your subsequent SQL queries will be aware of this connection and the result sets are also drawn nicely in a table. Another neat feature of the %%sql magic is that you will be able to get the result of your SQL query as a pandas data frame object if you would like to proceed working with the result set using Python. The result object of SQL query execution can be accessed from a variable _. This is because IPython’s output caching system defines several global variables; _ (a single underscore) stores previous output just like the IPython interpreter.

Look into this sample Jupyter notebook for illustration.

 

Patching subprocess module for Python 2.7

Some of the scripts I code in Python still target version 2.7 since this is the Python version that is being shipped with ArcGIS Desktop installation. I have recently found a bug in Python 2.7 subprocess module. This is a module that will let you execute your programs from a Python script. A very typical use case would be to run a console application such as an .exe or a .bat file to do some data processing outside of Python script.

The bug has been registered at bugs.python.org with the id 1759845:

subprocess.call fails with unicode strings in command line

So if you would try to execute the following in Python 2.7 (note the file containing the Unicode é):

import subprocess
subprocess.Popen(['python.exe', u"C:\Temp\file_char_é.py"])

You would get the error message:

UnicodeEncodeError: ‘ascii’ codec can’t encode character u’\xe9′ in position 28: ordinal not in range(128)

This means that you would not be able to use the Python 2.7 subprocess module on Windows if any of your call arguments contain Unicode. I have tried to search for alternatives to the subprocess module, but other packages I found such as envoy only provide a nice wrapper around the subprocess module. This means the bug would still make it impossible to call your programs if any of the arguments contains a Unicode character. Python 3.5 does not have this bug, but since I need to target the Python 2.7, I had to find a workaround.

After many hours of search, I finally found a patched version of subprocess – subprocessww which can be obtained from PyPi.

Version of subprocess with patches to allow unicode arguments. This allows Unicode parameters to be passed to POpen on Python 2.7.

Usage:

# Use patched version of subprocess module for Unicode on Windows
import subprocessww

# Load the regular POpen, which is now patched
from subprocess import Popen

I am very thankful to the The SABnzbd Team for providing the patch. When using the patched version, it is possible to supply arguments containing the Unicode characters.

Desktop PyQt application for executing SQL queries against Esri file geodatabase

As I was always interested in building GUIs for some GIS operations, I thought that exploring PyQt deeper would be fun. A project has started as an experimental playground to see what functionality PyQt provides. As I spent more time working with PyQt, I have started wondering what it would take to build a useful desktop application.

Because I often find myself in need of querying a file geodatabase’s datasets, I have decided to build a GUI-based SQL editor that would let me execute SQL queries against a table or a feature class and draw the result set in a table form for further visual inspection. I have thought that other GIS users and developers may find this application useful and I therefore have decided to start a GDB: GitHub repository to let others take advantage of my work. Here it comes, check it out!

GDBee_sample

GDBee is a PyQt5 desktop application which you can use to write and execute SQL queries against tables and feature classes stored inside an Esri file geodatabase. It provides a tabbed interface which lets you connect to multiple geodatabases within a single session. It has a rich code editor featuring auto-completion (with suggestions), syntax highlight, and result set export.

If you are a QGIS Desktop user, you are already able to execute SQL against file geodatabases using QGIS DBManager plugin, but GDBee has some extra features that the DBManager is missing (for instance, you do not need to add your datasets as layers first and you can choose to copy individual cells instead of the whole row) from the result table.

Because Python is so widely used in the GIS community, I thought it would make sense to take advantage of Python bindings of GDAL (via GEOS) to be able to connect to a file geodatabase and execute SQL queries. Working with a file geodatabase via GEOS makes it possible to take advantage of SQL spatial functions that are otherwise inaccessible to an ArcGIS user!

The application provides multiple features:

  • Working with multiple geodatabases using multiple tabs (single geodatabase connection per tab)
  • Exporting result sets into various formats (WKT strings to paste into QGIS using QuickWKT plugin, arcpy code to paste into ArcMap Python window, pandas data frame via .csv file (which can be taken into geopandas), and Markdown table via .md file or plain text)
  • Executing SQL query with respect to the user selection (only selected text is executed)
  • Loading/saving SQL queries from and to text files on disk
  • Convenient keyboard shortcuts for query execution (F5 and Ctrl-Enter) and tab interaction (Ctrl-N and Ctrl-W for opening and closing tabs)
  • Copying data from the result set table (either individual cell values or row(s) with the headers preserved) – ready to paste properly into an Excel sheet
  • Choosing whether you want to have geometry column in the result set as WKT

You can look at its GitHub repository: GDBee: GitHub repo. You may find this PyQt desktop application useful if:

  • You would like to be able to interrogate your file geodatabase datasets using SQL (instead of Python-based interface such as Esri arcpy or open-source ogr)
  • You are an ArcGIS user that does not want to have QGIS Desktop installed just to be able to execute SQL against a file geodatabase
  • You use SQL on a daily basis working with spatial databases (such as PostgreSQL or Microsoft SQL Server) and want to be able to execute ad hoc SQL queries against file geodatabase datasets without loading them into a proper DBMS database
  • You already have a lot SQL code targeting tables stored in a DBMS spatial database and you would like to be able to reuse this code when targeting a file geodatabase

Do you think there is some other functionality that should be added? Please let me know by submitting an issue in the repository.

Building concave hulls (alpha shapes) with PyQt, shapely, and arcpy

In the previous post, I have shared some information about how to set up PyQt development environment for ArcGIS Pro Python developers. If you are interested in building interactive desktop applications using Python and PyQt and running them within Pro, by all means, do check the post as it will guide you through the process.

In this post, I’d like to share a sample application that I’ve put together to illustrate the process of building concave hulls (also known as alpha shapes). I’ve built a fairly simple PyQt5 application that can be run from Pro or as a standalone application (without the functionality of adding the data to the map of course). With this application, I wanted to show how easy it is to integrate a PyQt window into ArcGIS Pro, but also to show how to build the concave hulls and what kind of results you get when adjusting the alpha value.

This application can be helpful for anyone who would like to create polygons around a set of points but creating convex hulls does not produce polygons that are good enough in terms of the boundary bounds. This work is inspired by two blog posts:

Let’s do some work with the points that form an outline that looks like an “H”.

This is what you would get generating the convex hull:

convex_hull

This is what you would get generating the concave hull with a specific alpha value:

concave_hull_better

And one more concave hull with another alpha value:

concave_hull_good

The basic workflow is that you interactively remove the triangulation edges for the points that are too far from each other. Adjusting the alpha value to tune the boundaries is a tricky process because it may improve the boundaries in one place (areas 1 and 2) but degrade them in another place (areas 3, 4, and 5). Look at the figure below; we are trying to increase the alpha value to improve the figure above:

concave_hull_best

So, as you can see, it is all about finding a sweet spot where the produced results are good enough and trying to improve the boundaries will only make them worse.

The animated figure to demonstrate the process of increasing the alpha value. Removing too many edges will result in having separate disconnected polygons:

alphas

Building concave hulls using multiple alpha values can be very helpful when you have clusters of points you would like to generate separate polygons for. Because the points in these two clusters are located fairly far from each other, they don’t have any connecting triangulation edges. As the end result, you get two polygons – one polygon for each cluster of points.

two_clusters

The source code behind this application in available here:

 

Developing Python GUI in ArcGIS Pro with PyQt

Many ArcGIS analysts and developers working with Python tools in ArcGIS Desktop or ArcGIS Pro would like to be able to build custom Python script tools extending the built-in framework of the GP tools dialog. I have answered a question Recommendations about graphic interface for ArcPy/Python script providing useful links on how to build an own GUI inside ArcGIS desktop application. I have also blogged about this topic in another post – Building custom UI tools for ArcGIS with Python.

Other users have also shown some progress. I have seen that Erik Pimpler has a couple of videos (Building GUI Applications for ArcGIS Desktop with Python – Part 1, Building GUI Applications for ArcGIS Desktop with Python – Part 2) that showcase how one can build custom GUI inside ArcMap using wxPython – a Python framework for GUI development. I have also seen Andrew Chapkowski integrating Tkinter into a Python add-in, but this is basically it. However, I do see that many users would like to be able to build own GUI inside ArcGIS using Python and/or using a designer or form builder to facilitate the process of the layout preparation. There is even an idea Form Builder for Python Tools on the ArcGIS Ideas website with quite a few votes.

When building a GUI app with Python, my personal choice is PyQt. It’s extremely mature and robust framework for GUI development which is a pleasure to work with. I have built some GUI utilities for text editing (think some of the Linux grep functionality in a Windows app) as well as SQL editor to work with file geodatabases. I have been incredibly productive building applications really quickly. There are so many useful tutorials on PyQt, so I have never really have stuck at any problem. You can build a helpful utility application in one evening.

Resources on PyQt development:

QGIS developers are familiar with this framework because they build QGIS plugins using PyQt, however if you are an ArcGIS developer, this is pretty new to you unless you have used PyQt for other projects. So, let’s see how you can set up an PyQt application with ArcGIS Pro.

  1. Install ArcGIS Pro (further just Pro). You will get a default conda environment called arcgispro-py3 which you can copy first into a new environment so you could restore it should something bad happen with the default one which we will be playing with.
  2. With the help of Python Package Manager in Pro, you need to install pyqt 5.6.0 into the default environment. This will quite take some time (5-10 mins).
  3. Now you are ready to write a PyQt application that can be called from a Python script tool.
  4. Run Pro, create and then save a new empty project.
  5. Create a new custom script tool in a toolbox and point it to a Python file with the following content.

Now you can run the script tool just like you would any other custom script tool and you should get this tiny app window open:

2017-11-25 16_45_14-Basic App

Note: if you are getting an error message trying to run the tool:

This application failed to start because it could not find or load the Qt platform plugin “windows”.

then you need to add a system environment variable QT_QPA_PLATFORM_PLUGIN_PATH with the path to the plugins folder used by PyQt5. If you have installed ArcGIS Pro into a default location for all users, you would need to provide this value: C:\Program Files\ArcGIS\Pro\bin\Python\envs\arcgispro-py3\Library\plugins\platforms. Restart ArcGIS Pro to let it grab the environment variable on the next start.

If you did get this window – congratulations, you have just accessed arcpy from a Python program that was run as a PyQt5 application. You have accessed the currently open project in Pro and accessed its property. So, what does this imply?

This means you can write code within a PyQt5 application that will be accessing your current map and its layers as well as basically everything you can reach with arcpy. Now it is up to you to choose what GUI to build as Qt framework provides excellent set of widgets to choose from.

Note: you may have noticed that after you have run the script tool and the PyQt app window was opened, Pro thought that the script tool is still running, so the script tool will complete only upon closing the application. This behavior can be changed if you run your PyQt application in own thread, so as your application starts, the script tool will complete its execution. This can be done using the threading module.

Create a new Python Python module to use as the source code for the script tool:

This is what you would see.

2017-11-25 17_02_54-Basic App

Note: since the application runs in its own thread, it won’t be able to access the current project: it will run as if you were running it as a standalone script. You can of course still use arcpy and other packages you have in your Pro conda environment.

To give you a feeling of how powerful PyQt is and what kind of amazing tools you can build to let Pro users to work with, I’ve created a sample app that can showcase some of the PyQt power. In this application, you can choose an input point feature class and it will generate something called alpha shape (or concave hull). I’ve adopted some of the code you can find in the blog posts Drawing Boundaries In Python by HumanGeo and The fading shape of alpha by Sean Gillies.

In the application, you can choose what edges of the triangulation you would like to keep based on the distance between the edge vertices by using an interactive slider. To visualize the produced data, embedded version of Matplotlib plot is drawn. When you are happy with the result, you can add the produced data to ArcGIS Pro current map as layers. I will post the source code of this application in another post later on.

Below is the screen recording of running the Pro application and starting a custom script tool that in turn starts a PyQt application window.

alphas.gif