Featured

Leaving the GIS industry: bye for now

After working for more than 10 years in geospatial industry, I have decided to change the field and focus on pure software engineering. I have quite enjoyed using GIS tools to solve various computational problems and software development tools to build useful GIS products and services for my customers. However, as my interests starting shifting towards clean code and programming per se – I have noticed I was reading in the bed Refactoring by Martin Fowler way more often than The ESRI guide to GIS analysis by Andy Mitchell – so I thought it would be useful to try changing the career path.

As I won’t be using any of the GIS software any longer, I won’t be able to post any new practical material that would be useful to GIS analysts and developers. However, in this post I would like to share some of the last thoughts I have which could be of interest to peer GIS professionals.

This blog becomes an archive of hopefully useful resources. Good luck!

Skills

SQL

I cannot stress enough how important it is for anyone using GIS data to master SQL. It is such an integral part of nearly any GIS processing workflow and being able to process or manage data stored in a DBMS using SQL becomes crucial. As GIS datasets grow in size, writing scripts for data processing won’t suffice as there are still no efficient spatial data processing packages that could make the process consistent, fast, and reliable. Using PostGIS or SQL Server native spatial types can get you often farther than any open source Python package. Don’t stop mastering SQL after learning the basics as there is so much more.

I have two massive posts about using SQL functions for GIS:

Math

Don’t bother too much studying math. I am not sure why many people I have spoken to think that in order to be good at GIS one has to be good at math. Being able to operate high school algebra and geometry terms is definitely useful (angle between intersecting roads, area of a lake, percentage of parcel covered by forest), but there is no need to study more sophisticated math. For a casual GIS analyst work, having the basics covered will be more than enough.

Of course if you are building a new routing algorithm that will be more efficient that A*, then you would have to review a lot of graph theory materials or if you are building a new Voronoi diagram generation procedure, you will find yourself reading the computational geometry books. If you are doing a lot of GIS analysis with spatial statistics stuff, then you should definitely get this amazing book Elementary Statistics for Geographers Third Edition.

If you do a lot of computational intensive programming and would like to catch up on math specifically for GIS, review

Testing in GIS

Review these two posts:

Python

Learn Python. It is the most widely used programming language in GIS industry and its usage will only expand. I have written the A progression path for GIS analyst which could be used a development road map. You should be very comfortable using Python; having Python skills will let you have a stronger influence on any operations within the organization and potentially automate more manual workflows leading to a better workplace.

Linux and bash

Learn Linux and bash. I think I should have started using Linux earlier. There are a few (1, 2) ready to use VirtualBox images with a ton of open-source GIS software installed, configured and ready-to-use. Using those machines will save you a lot of time. Learning bash is extremely helpful because it would let you be much more productive executing smaller commands and building pipelines for data processing than you would normally do on Windows using a programming language. Obviously learning bash, Linux, and Python are part of the industry agnostic skill set you could benefit from having at any later point of time.

Readings

There are so many excellent GIS books that I would like to recommend. You can find most popular titles online. What I’d like to do instead is to share of the hidden gems I have discovered and have really enjoyed reviewing. You can find those in the post Useful resources in computer science/math for GIS Analysts.

Programming

Ad-hoc mentality is very difficult to fight. It is 7 pm. You have a job you have to get done before releasing the data to a customer tomorrow morning. You are adding a missing domain to a geodatabase that your colleague’s Python script failed to add. Then you are changing a data type for a field because you have just realized that you need to store text instead of numbers. And… you find a few other things you are so glad to have spotted before the release. You fix them, zip the database, and upload it onto an FTP site. It is 10 pm, you are tired but happy and are heading off to the bed.

Success! … Or is it? The next thing tomorrow morning you want to document the manual changes you’ve introduced yesterday, but you are being dragged into some other urgent job… and you never do. A week after, a customer sends an email telling you she’s not able to run their in-house tools using your database you’ve prepared for them, but the one you’ve prepared a month ago works. Now it is 9 pm again and you are writing some oddly looking script trying to compare the databases and recalling what have you done on that evening… You are in a mess.

Doing what you have done may look natural because you just want to get stuff done. However, I want you to look at this from another perspective. You want your steps to be reproducible. You want to be able to track the changes you have done. Not only you, but any colleague of yours should be able to pick up the updates that have been made to any piece of data or a script. So resist the urge to get stuff done, pace yourself, and track your work with one of the following methods.

Documenting manually

If you are not comfortable programming or scripting at all, you should document each step you are taking while making modifications to a dataset. At least you could see what has been done in written form. I cannot stress that enough – you should document what you are doing, not what you have done. So write down what you have done after each change operation, not after you have done all the work. This is how it can look:

  1. You added field Area of Double type to the table dbo.Parcels.
  2. You write: “Add field Area of Double type to the table dbo.Parcels.”
  3. You dropped field OldArea of Double type in the table dbo.Parcels.
  4. You write: “Drop field OldArea of Double type in the table dbo.Parcels.”

One of the disadvantages of this approach is that it is possible to get the changes out of sync with the documentation. You could have made an error documenting a data type or a field name. Another thing is that the very same step can be done in many ways – what if you add field to a database using some GIS application and a colleague of yours uses a DBMS command line tool? Documenting exactly the procedure of making changes soon becomes tedious and you end up with tons of instructions that easily becomes misleading or plain obsolete. However, if you are vigorous, it is still possible to maintain a decent level of changes tracking with sufficiently rigid discipline.

Simply programming with VCS

Another approach is to write a program that will make the changes. When you write code, you don’t need to document what you are doing because the reader familiar with the syntax of this programming language will understand what happens. You can add some comments of course explaining why adding certain fields is required though. So, if you are building a database with a few tables, you can write a SQL script that can be re-run recreating your database at any point of time. If you never make any manual changes to a database and only write and keep SQL commands, your main SQL data compilation script will never get out of sync.

This leads us to a concept of version tracking where it is possible to track how your SQL script changed since the last version. Who is not guilty of having at some point of our career a dozen of files with some scripts named “production_final_compilation_truly_final_12.sql“? To avoid this mess, you should really use a VCS.

The main argument against this approach is that it setting up all this version control tools look like an overkill for someone doing simple GIS work. However, you will see how much safer your work will be in the long run. It will pay off very soon. Invest some time in learning about VCS such as Git for managing the source code. All major players – BitBucket, GitLab, and GitHub – provide free private repositories. Find out whether there is a VCS solution deployed in-house within your organization, such as Microsoft TFS, which you could use to check in the code. Should you like to dive deeper into Git, read the Git Pro book for free online. If you are not comfortable putting anything into the cloud (which is just someone’s else computer), use Git locally on your machine or a local server where you can securely check in your code and ask your system administrator to take backups of those repositories.

Open source vs proprietary software

Throughout your GIS career, you most likely will be exposed to both proprietary and open source software. You will have Windows on your machine with QGIS; or a Linux machine with Esri ArcGIS Server. It would be naive to think that either of these technologies is superior to another. You should be able to get the job done whatever tools you have available because you will not always be able to decide what your employer will be using.

I suggest instead being comfortable with both of them and widening your toolset as much as possible. As you become exposed to different tools, you will soon realize that commercial software can be much better for certain jobs rather than open-source or free one. For instance, certain types of spatial joins can run faster in ArcGIS Desktop rather than PostGIS, but some GDAL based raster masking may outperform ArcGIS Spatial Analyst tools. Creating a map layout with data driven pages is a pleasure in ArcMap, but can be tedious in QGIS. Always do the benchmarking to understand what tools work best and document your findings. Keep in mind that the very same tool can take 1 second to process 1,000 features and 5 minutes to process 10,000 features. Review briefly the Big O notation to avoid surprises.

I have always encouraged people using a particular tool to understand what it really does. Having a clear understanding of the underlying process will make it possible for you to extend an existing tool or write your own one. For instance, if you are applying some raster masking, you should understand what a matrix is. If you do a spatial join, you should understand how having a spatial index helps.

Always look to expand your toolset and be prepared to apply a tool that you think would be right for a particular job. A customer only has PostGIS and needs to do some polygon intersection? You use ST_Intersects native function. Don’t have access to QGIS? Know which ArcToolbox tool does the same job. You have to process a file on Linux machine you SSH into (so no Excel-like software)? Use bash or pandas to wrangle the data as needed. You shouldn’t be constrained by the environment you are in and what tools you have available at your disposal. You should be able to get the job done no matter what.

Keeping up with the industry

I have been a user of GIS StackExchange since 2013 and have blogged about my experience and why is it useful to be active on a forum in the post 4 years with GIS at StackExchange. Make a habit of reading the weekly most popular questions, for instance, every weekend. If you see a question you know the answer to, post it. It also helps to ask a question you had yourself and then you spent a week solving it and then finally found a solution. Please post an answer to your own question. You will save some effort to a peer GIS professional and you can also find this answer later when you will be doing a web search for the same issue in a few years time. If you have some time, you can review the most popular questions using the most voted questions option; there is so much to learn there.

Warning: new GDB_GEOMATTR_DATA column in ArcGIS geodatabase 10.5

This post would be of interest to ArcGIS users who are upgrading enterprise geodatabases from ArcGIS 10.1-10.4 version to 10.5+ version. According to the Esri documentation and resources (link1, link2, link3):

Feature classes created in an ArcGIS 10.5 or 10.5.1 geodatabase using a 10.5 or 10.5.1 client use a new storage model for geometry attributes, which stores them in a new column (GDB_GEOMATTR_DATA). The purpose of this column is to handle complex geometries such as curves. Since a feature class can have only one shape column, the storage of circular geometries must be stored separately and then joined to the feature class when viewed in ArcGIS.

This means that if you create a new feature class in an enterprise geodatabase (either manually or by using a geoprocessing tool), three fields will be created: the OID field (OBJECTID), the geometry field (SHAPE), and this special GDB_GEOMATTR_DATA field. To be aware of this is very important because you will not be able to see this column when working in ArcGIS Desktop or when using arcpy.

The GDB_GEOMATTR_DATA field is not shown when accessing a feature class using arcpy.

[f.name for f in arcpy.ListFields('samplefc')]
[u'OBJECTID', u'SHAPE', u'SHAPE.STArea()', u'SHAPE.STLength()']

Querying the table using SQL, however, does show the field.

select * from dbo.samplefc
[OBJECTID],[SHAPE],[GDB_GEOMATTR_DATA]

If you are working with your enterprise geodatabase only using ArcGIS tools, you may not notice anything. However, if you have existing SQL scripts that work with the feature class schema, it is a good time to check that those scripts will not remove the GDB_GEOMATTR_DATA column from the feature class. This could happen if you are re-constructing the schema based on another table and have previously needed to keep the OBJECTID and the SHAPE columns. After moving to 10.5, you would also keep the GDB_GEOMATTR_DATA column.

Keep in mind that deleting the GDB_GEOMATTR_DATA column will make the feature class unusable in ArcGIS. Moreover, if this feature class stores any complex geometries such as curves, deleting the GDB_GEOMATTR_DATA column would result in data loss.

Trying to preview a feature class without the GDB_GEOMATTR_DATA column in ArcCatalog would show up the following error:

database.schema.SampleFC: Attribute column not found [42S22:[Microsoft][SQL Server Native Client 11.0][SQL Server]Invalid column name ‘GDB_GEOMATTR_DATA’.] [database.schema.SampleFC]

Even though very unlikely to happen, trying to add a new field called exactly GDB_GEOMATTR_DATA to a valid feature class using ArcGIS tools would also result in an error:

ERROR 999999: Error executing function.
Underlying DBMS error [Underlying DBMS error [[Microsoft][SQL Server Native Client 11.0][SQL Server]Column names in each table must be unique. Column name ‘GDB_GEOMATTR_DATA’ in table ‘SDE.SAMPLE1’ is specified more than once.][database.schema.sample1.GDB_GEOMATTR_DATA]]
Failed to execute (AddField).

Obviously, trying to add the GDB_GEOMATTR_DATA using plain SQL would not  work either:

ALTER TABLE sde.samplefc
ADD GDB_GEOMATTR_DATA varchar(100)

Column names in each table must be unique. Column name ‘GDB_GEOMATTR_DATA’ in table ‘sde.samplefc’ is specified more than once.

Multiple Ring Buffer with PostGIS and SQL Server

Recently I needed to generate multiple ring buffers around some point features. This can be done using a dozen of tools – Multiple Ring Buffer geoprocessing tool in ArcGIS, using arcpy to generate multiple buffer polygons and merging them into a single feature class using the buffer() method of arcpy.Geometry() object, or by using open source GIS tools such as QGIS. This is also possible to achieve using relational database that has support for the spatial functions. In this post, I would like to show you how this can be done using the ST_Buffer spatial function in PostGIS and SQL Server.

In order to generate multiple buffer distance values (for instance, from 100 to 500 with the step of 100) in SQL Server, I would probably need use CTE or just create a plain in-memory table using declare; in other words, this is what it takes to run range(100, 501, 100) in Python.

In the gist below, there are two ways to generate multiple buffers – using the plain table and the CTE.

Generating a sequence of distances in Postgres is a lot easier thanks to the presence of the generate_series function which provides the same syntax as range in Python.

SQL Server spatial functions for GIS users: part 2

I was curious to see what posts of my blog have been visited by most people over the time. The most popular post is SQL Server spatial functions for GIS users which I have published 3 years ago. It seems as there are many GIS users who need to use SQL Server spatial functions as similar requests such as “sql server spatial functions” are appearing fairly often in the search engines. Today, I will share more examples of using SQL Server spatial functions if you are a GIS user.

Below you find just a few sample SQL queries that have been executed against a SQL Server database containing datasets accessible via PostGIS workshop materials by Boundless which can be downloaded from this link.

This post is an extraction of the Jupyter notebook published on GitHub. Open the notebook to see the result tables or jump to the end of the post where I have it published.

Find points that have the same coordinates

To find coincident points (often referred to as duplicates), you could use various SQL Server spatial functions.

Table1.Shape.STDistance(Table2.Shape) < 1000 --distance value
Table1.Shape.STEquals(Table2.Shape) = 1 --whether shapes are identical
Table1.SHAPE.STX and Table1.SHAPE.STY --compare points XY coordinates

All of these methods would give the same result while having a different execution performance.

Find using STDistance

--Find duplicate points within a certain borough
SELECT CAST(T1.ID AS INT) AS FirstPoint,
CAST(T2.ID AS INT) SecondPoint,
T1.Shape.STDistance(T2.Shape) Distance
FROM
dbo.Homicides T1
JOIN
dbo.Homicides T2
ON
T1.ID < T2.ID
and
T1.Shape.STDistance(T2.Shape) = 0
and
T1.BORONAME = 'Queens'
ORDER BY
Distance, FirstPoint

Find using STDistance

SELECT CAST(T1.ID AS INT) AS FirstPointId,
CAST(T2.ID AS INT) SecondPointId,
T1.Shape.STDistance(T2.Shape) Distance
FROM
dbo.Homicides T1
JOIN
dbo.Homicides T2
ON
T1.ID < T2.ID
and
T1.Shape.STEquals(T2.Shape) = 1
and
T1.BORONAME = 'Queens'
ORDER BY
FirstPointId, SecondPointId

Find using STEquals

SELECT CAST(T1.ID AS INT) AS FirstPointId,
CAST(T2.ID AS INT) SecondPointId,
T1.Shape.STDistance(T2.Shape) Distance
FROM
dbo.Homicides T1
JOIN
dbo.Homicides T2
ON
T1.ID < T2.ID
and
T1.Shape.STEquals(T2.Shape) = 1
and
T1.BORONAME = 'Queens'
ORDER BY
FirstPointId, SecondPointId

Find using STX and STY

SELECT T1.SHAPE.STX, T1.SHAPE.STY, COUNT(*) AS COUNT
FROM
dbo.Homicides T1
WHERE
T1.BORONAME = 'Queens'
GROUP BY
T1.SHAPE.STX, T1.SHAPE.STY
HAVING
COUNT(*) > 1

Calculating distances between points stored in tables

The STDistance function could be used to find the plain distance between the points stored within the same table.

SELECT CAST(T1.ID AS INT) AS FirstPointId
,CAST(T2.ID AS INT) SecondPointId
,T1.Shape.STDistance(T2.Shape) Distance
FROM
dbo.Homicides T1
JOIN dbo.Homicides T2
ON T1.ID < T2.ID
and
T1.Shape.STDistance(T2.Shape) BETWEEN 10 AND 20
and
T1.BORONAME = 'Queens'
ORDER BY
Distance

Calculating distances between points stored in the same table

SELECT CAST(T1.ID AS INT) AS FirstPointId
,CAST(T2.ID AS INT) SecondPointId
,T1.Shape.STDistance(T2.Shape) Distance
FROM
dbo.Homicides T1
JOIN dbo.Homicides T2
ON T1.ID < T2.ID
and
T1.Shape.STDistance(T2.Shape) BETWEEN 10 AND 20
and
T1.BORONAME = 'Queens'
ORDER BY
Distance

Calculating distances between points stored in two tables

Find homicide points that are located within the specified number of meters to the subway stations points. ArcGIS tool: Point Distance (Analysis)

SELECT * FROM
(SELECT CAST(Homi.ID AS int) AS HomicideId
,Homi.WEAPON AS Weapon
,CAST(Subway.ID AS int) AS SubwayId
,Subway.NAME AS Name
,Homi.Shape.STDistance(Subway.Shape) AS Distance
FROM dbo.HOMICIDES Homi
cross join dbo.Subway_stations Subway
where Homi.BORONAME = 'Manhattan' AND Subway.BOROUGH = 'Manhattan')
AS
data
WHERE
Distance < 20
ORDER BY
Distance

Counting points in polygons

To count points in polygons, you could use the STContains function. Table1.Shape.STContains(Table2.Shape) would return 0 or 1.

Find neighborhoods with the largest number of crimes committed (count number of homicides in each neighborhood)

SELECT TOP 10
Polys.Name AS NeighborhoodName, Count(*) AS CrimeCount
FROM
dbo.Homicides AS Points
JOIN
dbo.Neighborhoods AS Polys
ON
Polys.Shape.STContains(Points.Shape) = 1
GROUP BY
Polys.Name
ORDER BY
CrimeCount DESC

We can also calculate a column in the Neighborhoods table to contain the number of points within each neighborhood. For that, we will first need to add a new column to the table, then populate it, and then drop to leave the data clean for the further queries.

This query adding a new fields and calculating the number of points located within each polygon is what is done by the ArcGIS GP tool Spatial Join.

ALTER TABLE dbo.Neighborhoods
ADD PointCount int;

UPDATE
Polys
SET
[PointCount] = COUNTS.CrimeCount
FROM
dbo.Neighborhoods AS Polys
JOIN
(
SELECT
Polys.Name AS NeighborhoodName, Count(*) AS CrimeCount
FROM
dbo.Homicides AS Points
JOIN
dbo.Neighborhoods AS Polys
ON
Polys.Shape.STContains(Points.Shape) = 1
GROUP BY
Polys.Name
) AS COUNTS
ON
Polys.Name = COUNTS.NeighborhoodName

Add polygon name to points located within the polygon

To enrich the points layer with the information what polygon each point is located within you would need to use the STWithin function. In this example, we will add a new column to the homicides table so we know what neighborhood the crime has been committed.

Again, this query adding a new field and calculating the neighborhood name for the points located within each polygon is what is done by the ArcGIS GP tool Spatial Join.

SELECT TOP 10
Points.OBJECTID
,Points.INCIDENT_D
,Points.BORONAME
,Points.NUM_VICTIM
,Points.PRIMARY_MO
,Points.ID
,Points.WEAPON
,Points.LIGHT_DARK
,Points.YEAR
,Polys.Name AS NeighborhoodName
FROM
dbo.Neighborhoods AS Polys
JOIN
dbo.Homicides AS Points
ON
Points.Shape.STWithin(Polys.Shape) = 1
ORDER BY
OBJECTID
ALTER TABLE dbo.Homicides
ADD NeighborhoodName varchar(50);

UPDATE
Points
SET
[NeighborhoodName] = PointsInPolys.NeighborhoodName
FROM
dbo.Homicides AS Points
JOIN
(
SELECT
Points.OBJECTID
,Points.INCIDENT_D
,Points.BORONAME
,Points.NUM_VICTIM
,Points.PRIMARY_MO
,Points.ID
,Points.WEAPON
,Points.LIGHT_DARK
,Points.YEAR
,Polys.Name AS NeighborhoodName
FROM
dbo.Neighborhoods AS Polys
JOIN
dbo.Homicides AS Points
ON
Points.Shape.STWithin(Polys.Shape) = 1
) AS PointsInPolys
ON
PointsInPolys.ID = Points.ID

Summary statistics and frequency

This is a simplistic implementation of the Frequency GP tool in ArcGIS using a T-SQL stored procedure.

ALTER PROCEDURE dbo.FrequencyTable
@Columns varchar(500)
AS
BEGIN
EXEC ('SELECT COUNT(*) AS ' + @Columns +
' FROM dbo.HOMICIDES
WHERE WEAPON <> '''' AND LIGHT_DARK <> ''''
GROUP BY WEAPON,
LIGHT_DARK ORDER BY FREQUENCY DESC;');
END
GO
EXEC dbo.FrequencyTable 'dbo.HOMICIDES', 'FREQUENCY, WEAPON, LIGHT_DARK';

The Jupyter notebook with the result sets:

 

Using SQL Server constraints and geodatabase domains

Attribute constraints

Many ArcGIS users use geodatabase domains to allow data editors to enter for certain attributes only certain values, either within a range or from a set of coded values. This functionality streamlines data editing and is very helpful indeed. However, having a geodatabase domain set for a field doesn’t actually prevent users from typing in value that is either outside of range or not in the list of coded values.

Entrance of illegal values can be done either programmatically using arcpy or SQL or by editing the attribute table of a feature class using Field Calculator. To be able to find out which features have illegal attribute values, you would need to select all of your features in the editing session and click Editor > Validate Features. This will select features with illegal values.

But what if you would like to let your users pick only certain values when editing the features and prohibit entering any illegal values? To do this, you could use database constraints such as foreign key constraint. In fact, I have already answered this exact question on GIS.SE: Restrict values to domain codes (beyond the attribute table).

In the end of the post, please look at the code snippet of what should be done in SQL. 

Now you can just use GP tool Table To Domain which will let you create a geodatabase domain from the dbo.CityType table as well as add the coded values into it. Then you can assign this domain to a field Type in the Cities feature class using the GP tool Assign Domain To Field.

Now user will get an error window in ArcMap (returned from SQL Server) every time they will try to enter illegal values into the field and save the feature. One thing to keep in mind when embracing this workflow is that you’d need to go to Editor toolbar > Options > Attributes tab and enable the option Display the attributes dialog before storing new features. This is necessary to do if you don’t specify any default value for this field that is stored within the dbo.CityType table. In this case, newly created features will have no value associated with the Type attribute and you won’t be able to digitize a feature on the map without getting the error message.

Spatial constraints

Another thing that may bug you is the geodatabase topology. It’s very handy when you have inherited a large suite of feature classes and you would like to enforce some integrity rules concerning the spatial relationships between features in those feature classes. However, if your data is stored in a multi-user geodatabase, then you could create own rules that would prohibit users from creating features that break those rules. Using ArcGIS geodatabase topology it is still possible to create a feature that would be considered invalid in terms of its relationship with another feature (say school point inside a lake polygon), however the only way to find this out is to validate topology on existing features.

Using SQL Server triggers, it is possible to specify the spatial rules and prevent creation of features that don’t follow these rules. Below is a simple example of a trigger that won’t let ArcMap users to digitize a point on the map to create a point feature that is located outside of the boundaries of the California state.

Changing the order of fields in a geodatabase feature class

At some point, it might be necessary to change the order of fields in a feature class. Perhaps a new field has been added and you want that all your users that will add this feature class to their map document to see the field in a right place when opening the attribute table.

An easy solution that could work is to create a layer file (.lyr) that contains among other things the order of fields that was set up for the map layer at the moment of layer file export. Now your users can add this layer file into ArcMap session to see the fields in the needed order when opening the attribute table or when editing feature class features. Creating and updating layer files can be done with Python and arcpy scripting.

If your data don’t reside in an enterprise geodatabase, such as SQL Server, your only choice is to copy your feature class to a new feature class providing a field mapping object that will contain the information on the needed field order. Then you need to replace the source feature class replacing it with the newly created one. This approach would work for a standalone dataset that is not part of any complex container such as topology, network dataset, or a geometric network and doesn’t have any domains specified for its fields. Otherwise, you would need to take the feature class out of the containers (for instance, you cannot delete a feature class that is part of a network dataset), replace it with the feature class that have the right field order, and then reassign the domains, add feature class to its containers and so forth. Many of those operations cannot be automated which makes it re-ordering the fields in a feature class quite complicated.

If your data reside in a DBMS such as SQL Server, though, you can use your DBMS tools for defining the order of the fields. If you want to provide your users access to a feature class with a particular order of the fields, you could create a view using SQL or ArcGIS geoprocessing tools. However, a view cannot be edited in ArcMap (you won’t be able to update the underlying feature class through a view).

If you do need to change the original feature class field order and let users edit it, you can use the SQL Server. The detailed instructions are available on SO in this answer. This can be done using SSMS without writing any SQL queries.

I have tested this on a feature class that was part of a network dataset and a topology and also had domains associated with its fields. After changing the columns order in SQL Server, I was able to use the feature class in ArcMap and all behaviors associated with its containers seemed to be intact and working properly. The domains were left in place, too.

If you want to take a look at the SQL query that is being run behind the scene, look here at DBA site of SE.

To see the changes in the field order, restart ArcMap, as it seems to cache the order of fields in feature class within the session. I’m not sure whether changing the order of fields for a feature class is something that is officially supported by Esri, so be careful if you decide to do this – always back your data up before doing this.

When writing any code that works with feature class columns, always refer to columns by names and not by their index. If using arcpy and Python, it is super easy to use named tuples (I blogged about this earlier).

Then you are not dependent any longer on the order of fields in your feature class, so if the order will change, as long as a field has the same name, your code won’t break.

Working with csv files in Python

Have you ever needed to process some large datasets? Well, it all depends on how you define “large”. In this post, I will refer to large datasets as to relational tables or delimited text files of several GBs of size containing 10+ mln rows.

When I have to do some data processing or filtering, I often start analyzing what kind of toolset or framework should I use to perform the work. Say you have a large .csv file and you want to create a new .csv file that is filtered by a column (so you basically want to create a new .csv based on the SQL where clause).

Well, you can do that with ArcGIS. It is possible to add text files including .csv files into ArcMap and open its attribute table. You will be able to make selections based on attributes, calculate statistics, and many more. After having a selection created, you can export selected rows into a new comma separated file. However, in order to make selection work on the whole dataset (and not only the first 2000 rows visible in the attribute table), it seems as you need to move to the end of table. This can take rather long time; so it’s probably best to use the Definition Query tab in the Table Properties dialog box. This will make visible only those rows that match your criteria. Now you can export the whole table with the query on to a new text file; pay particular attention to what kind of default demiter will used when exporting a text file from ArcMap and whether any new columns such as OBJECTID is added to each of the source rows. Another caveat is to watch out carefully for any decimal delimiters that are used because they can cause problems for comma-delimited text tables. Refer to this KB article to learn more.

Another software used by many GIS professionals is FME and this task can be done within this framework, too. Probably the easiest way to do this is to use the AttributeFilter transformer to obtain only those rows that meet your criteria. You will be able to supply a Separator Character and Character Encoding among other options. Very handy and easy to build. Filtering 20mln rows csv file with two columns and writing to a new csv file with FME 2015 wouldn’t take more than several minutes on a decent machine.

Most DBMS will let you to import and export csv files, too. SQL Server, for instance, is also capable of importing a .csv file using the SQL Server Import and Export Wizard. You will be able to load a csv file into a SQL Server table, then create a new table with SELECT INTO with a where clause to keep only needed rows. You will also be able to script this process by using the SSIS packages or by using the BULK INSERT to insert csv rows into a table.

And… you can do this in Python! There are many ways to do this such as reading file lines or using built-in modules such as csv or external ones such as unicodecsv. Be careful when reading large files with Python; very often I see people reading all the file rows into a single object, and this can cause out of memory error even on powerful machines. Learn more about iterators in Python – the reader objects that will be created are generators that you can iterate (go through just once).

Here is a couple of samples on how to read/write csv with Python to help you get started:

Design of WebGIS back-end: architecture considerations

I have spent last two years doing a lot of Python development and designing and implementing Web GIS which included ArcGIS Server, geoprocessing services and ArcGIS API for JavaScript (further JS) web client. What I would like to do is to share an idea which I got to like.

If you need to do something, try doing it at the back-end

Imagine you have a JS web application where users will work with some feature services via a web map. They can select multiple features and calculate the sum of the values features have in a field (or fields). Let’s go through alternatives you have now.

  1. Pre-calculate the values you think your users will query and store them in the database.
    This would work fine actually when you know that your users are going to generate reports on a certain fields often and the performance is crucial. It might actually make sense to calculate certain values beforehand and store them. The disadvantage of this is additional storage and that you need to keep the values updated – the calculated field depends on other fields and their values can change. This would imply re-calculating the report field often as a part of the daily or weekly routine depending on the workflow.
  1. Get the feature’s data from the ArcGIS Server feature service and calculate the requested value on-the-fly in the client.
    Unless you are retrieving complex geometry, this operation wouldn’t cost you much. The problem is that the volume of JS code (or TypeScript) will increase and every upcoming modification in the code would imply new release which can be a painful process if you need to compress your code and move things around. Another thing is that if the amount of data you work with is rather large, there is a good chance the web browser might get slow and the performance will degrade significantly.
  1. Use the database server to calculate the values.
    This became my favorite over last years. This approach has multiple advantages.
    First, this operation runs on the database server machine with enough RAM and CPU resources. So you are not limited by the web browser capacity. The database servers are very good at calculating the values: this kind of operation is very inexpensive because in most cases it does not involve use of cursors. You have a privilege to work in transaction which provides a higher level of data integrity (it would be hard to mess up the database since you can roll back).
    Second, you can use SQL. It might not sound as an advantage first, but remember that code is written once, but is read many times. Readability counts. SQL is a clean way of communicating the workflow and the database code (such as stored procedures) is very easy to maintain. Unlike JS, you work with just one database object and don’t really have any dependencies on the system provided that you have a database server of a certain version and privileges required to create and execute stored procedures.
    Finally, allowing the database server do the work for you, you expose a certain procedure to other clients which could work with it. You don’t need to modify the client code and by updating the SQL code at one place, you automatically make it available for all the applications that work with it.

How to be efficient as a GIS professional (part 3)

6. Automate, automate, automate

Whatever you are doing, take a second to think whether you will need to run the sequence of steps you’ve just completed again. It may seem first that you are very unlikely to run the same sequence of steps again, but in fact you may find yourself performing them over and over again later on.

Automating is not only about saving the time. It is also about the quality assurance. When you are doing something manually, there is always a chance to forget a certain step or detail which can potentially lead to an error. When having the workflow automated, you can always see what steps are being performed. An automated workflow is already a piece of documentation which you can share with others or use yourself as a reference.

Don’t trust your memory: you think you know what columns you’ve added to the table and why, yeah. Get back in two weeks and you will be surprised by how much of those memories you have left. If you will leave the job and get the work over to a new person, she will be happy to inherit a well maintained documentation and discrete description of the workflow he will be responsible for.

Considering desktop GIS automation, think about using Python for geospatial operations (think truncating tables + appending new data + perform data checks). For database automation, use SQL (add new columns + alter columns data type). Feel free to build SQL scripts with commands for adding/deleting/calculating columns and copying data, too. By preserving those scripts, you will always be able to re-run them on a another table, in another database or modify the script to match your needs. This gives you a way into looking at changes performed in your database. This is just like adding a field manually and then writing down that you have added field of type X into table Y at time Z. It is just so much easier to build a SQL script to avoid doing that.

7. SQL, SQL, SQL

Another advantage of the SQL for data processing is that it is very vendor neutral and can be executed either as is or with really minor adjustments on most DBMS platforms. This is applicable to SQL spatial functions which provide ISO and OGC compliant access to the geodatabase and database, too. Being able to execute SQL queries and perform data management operation is really advantageous when you work in a large IT environment. This might be helpful because you won’t always have the network connection to the production environment for data update and using ArcGIS might not be possible. Running a Python script would require having the Python installation on some machine and if you use arcpy – ArcGIS Desktop. Running a SQL code which has no dependencies might be your only alternative.

Many folks don’t know that one can use pure SQL with an enterprise geodatabase stored in any DBMS supported. This is just a short list of what you can do with SQL:

8. Python, Python, Python

I have blogged about using spatial functions of SQL Server earlier. Remember that you can also execute some of the SQL from Python code when using the arcpy.ArcSDESQLExecute class. Here is the SQL reference for query expressions used in ArcGIS some of which you can use in the arcpy.da cursors where clauses. Learn some of the useful Python libraries which could save you some time. Look at:

  • Selenium for automating ftp data download if this happens often and you have to browse through a set of pages;
  • scipy.spatial module for spatial analysis such as building Voronoi diagrams, finding distances between arrays, construct convex hulls in N dimensions and doing many other things;
  • Numpy, a fundamental package for scientific computing with Python, for handling huge GIS datasets (both vectors and rasters) with arcpy.

Read more about What are the Python tools/modules/add-ins crucial in GIS and watch an Esri Video on Python: Useful Libraries for the GIS Professional.

Get a chance to learn more about the SQL and Python and how you could take advantage of them in your work!

SQL Server spatial functions for GIS users

If you have been using SQL Server for some time, you’ve probably heard of the spatial data support. This might be particularly interesting for anyone who is using any desktop GIS for data management and analysis. If you are an ArcGIS user and have enterprise geodatabases stored within SQL Server databases, you might have wondered whether it is possible to interact with the spatial data. This is useful when you don’t have a chance to use ArcMap to access the database due to some restrictions (permissions, network connections or software compatibility).

Well, you actually can do a whole lot with your geographic data just with SQL. It is important that you define the Shape field as of the Geometry/Geography data type. For most of the GIS work, you’d probably choose Geometry type which represents data in a Euclidean (flat) coordinate system. As soon as you have a geodatabase feature class which has the Shape field defined as of Geometry type, you can use native SQL Server tools to interact both with the feature class attributes and geometry.

Beginning with ArcGIS 10.1, feature classes created in geodatabases in SQL Server use the Microsoft Geometry type by default. To move your existing feature classes to the Geometry storage type, use the Migrate Storage geoprocessing tool or a Python script.

Alright, so after you have copied your file geodatabase feature class into a SQL Server geodatabase, you are ready to use native SQL to interact with the spatial data.

Let’s select all the features from the Parcels feature class.

SELECT * FROM dbo.PARCELS

Because we have a SHAPE column that of Geometry type, we get another tab in the results grid – Spatial results. There you can see your geometries visualized.

Microsoft SQL Server Management Studio

Let’s see what coordinate system our feature class was defined in.

DECLARE @srid INT = (SELECT TOP 1 shape.STSrid FROM dbo.PARCELS)
SELECT @srid AS SRID,
srtext AS Name FROM sde.SDE_spatial_references WHERE auth_srid = @srid

Here we use <GeometryColumnName>.STSrid to get the spatial reference id (SRID) of the coordinate system of the first feature. Because our geographic data is stored in a projected coordinate system (and Geometry type), we cannot get its name by using core SQL Server spatial references table, sys.spatial_reference_systems.

Here is why:

The coordinate systems in this table are for the geography type only as it contains information about the ellipsoid that is required to perform calculations. No such information is required to perform calculations for projected coordinate systems on the plane used by the geometry type, so you are free to use any reference system you like. For the calculations done by the geometry type, it is the same no matter what you use.

Let us explore next what kind of geometry is stored within a table. It is possible to store different types of geometry (such as polygon and polyline) within one table in SQL Server.

Let us see if it is true:

SELECT Id,GeomData AS Geometry,GeomData.STAsText() AS GeometryData
FROM [testgdb].[dbo].[GeneralizedData]

Microsoft SQL Server Management Studio

Yes indeed we store in one table features of different geometry and SQL Server has no problems with that. To be able to visualize this table in ArcMap though, you would need to use a query layer which is basically stand-alone table that is defined by a SQL query. ArcGIS can only handle having one type of geometry stored within each feature class which is why you will get a choice to pick what type of geometry do you want to look at.

New Query Layer After adding this layer into ArcMap, you will be able to see the polygons (provided you’ve chosen the polygons). The query layer is in read-only, so you cannot edit features in ArcMap. If you have a SQL Server table (non-registered with geodatabase) with multiple types of geometries stored, you will be able to switch easily between by adding multiple query layers into ArcMap defining what kind of geometry you want to work with.

Let us keep working with a geodatabase feature class which has only polygons. Let’s check if it’s true:

SELECT Shape.STGeometryType() AS GeometryType FROM dbo.PARCELS

Alright, so we know already what kind of coordinate system the data is stored in and we know that there are polygons. Let us get the perimeter (length) and the area of those polygons.

SELECT PARCEL_ID, SHAPE.STArea() AS Area,
SHAPE.STLength() AS Perimeter
FROM dbo.PARCELS 

Microsoft SQL Server Management StudioWe can also get the coordinates of each polygon within our feature class. Note that that the start point and the end point are identical – that is because each polygon is considered to be closed:

SELECT Shape.STAsText() AS GeometryType FROM dbo.PARCELS

This is what we will get:
POLYGON ((507348.9687482774 687848.062502546, 507445.156252367 687886.06251058145, 507444.18750036607 687888.56250258372, 507348.9687482774 687848.062502546))

There are similar functions such as .STAsBinary() which returns the Open Geospatial Consortium (OGC) Well-Known Binary (WKB) representation of a geometry instance and .AsGml() which returns the Geography Markup Language (GML) representation of a geometry instance.

We can also check the number of vertices per polygon:

SELECT PARCEL_id,
Shape.STAsText() AS GeometryDesc,
Shape.STNumPoints() AS NumVertices
FROM dbo.PARCELS
ORDER BY NumVertices DESC

Microsoft SQL Server Management StudioAlright, that was probably enough querying data. Let us check what kind of GIS analysis is available to us with native SQL. The easiest way to get started is probably to process the features of a feature class and then write the resultant geometries into a new table.

DROP TABLE [ParcelEnvelope]
CREATE TABLE [dbo].[ParcelEnvelope]([Id] [int] NOT NULL,
[PolyArea] int,[GeomData] [geometry] NOT NULL) ON [PRIMARY]

INSERT INTO ParcelEnvelope (Id,GeomData,PolyArea)
SELECT PARCEL_ID AS Id,
SHAPE.STEnvelope() AS GeomData,
SHAPE.STArea() AS PolyArea
FROM dbo.PARCELS
ORDER BY OBJECTID

This will create a new table where the envelopes of each parcel polygon will be written to.

Feature envelopeLet us do some buffers on road centerlines geodatabase feature class:

DROP TABLE [RoadBuffer]
CREATE TABLE [dbo].[RoadBuffer]([Id] [int] NOT NULL,
[GeomData] [geometry] NOT NULL) ON [PRIMARY]

INSERT INTO [RoadBuffer] (Id,GeomData)
SELECT OBJECTID AS Id,
SHAPE.STBuffer(50)
FROM dbo.Road_cl
ORDER BY OBJECTID

You can of course write newly generated features into a geodatabase feature class, not just a SQL Server database table. You need to create a new polygon feature class and then run the SQL below. This will create buffer zones for every line found in the Road_cl feature class.

DELETE FROM FC_ROADBUFFERS
INSERT INTO FC_ROADBUFFERS(OBJECTID,SHAPE)
SELECT OBJECTID AS OBJECTID,
SHAPE.STBuffer(50) AS SHAPE
FROM dbo.Road_cl
ORDER BY OBJECTID

Please refer to the Microsoft Geometry Data Type Method Reference to get a full list of available functions and more detailed description.

Try doing some other analysis such as finding what features intersect or overlap or how many points are located within a certain polygon. There is so much you can do! To learn more, get a book Beginning Spatial with SQL Server 2008 which has tons of examples and will also help you understand the spatial data structure basics. I have read this book and really liked it. I think it is a must read for anyone using spatial SQL.

I hope this short introduction into what you as a GIS user can do with SQL Server will help you take advantage of using the native SQL functions wherever using a desktop GIS is not an option.