Glue crawler exclude patterns
Glue crawler exclude patterns
I have an s3 bucket that I'm trying to crawl and catalog. The format is something like this, where the SQL files are DDL queries (CREATE TABLE
statements) that match the schema of the different data files, i.e. data1
, data2
, etc.)
CREATE TABLE
data1
data2
s3://my-bucket/somedata/20180101/data1/stuff.txt.gz
s3://my-bucket/somedata/20180101/data2/stuff.txt.gz
s3://my-bucket/somedata/20180101/data1.sql
s3://my-bucket/somedata/20180101/data2.sql
s3://my-bucket/somedata/20180102/data1/stuff.txt.gz
s3://my-bucket/somedata/20180102/data2/stuff.txt.gz
...
I just want to catalog data1
, so I am trying to use the exclude patterns in the Glue Crawler - see below - i.e. *.sql
and data2/*
.
data1
*.sql
data2/*
Unfortunately the crawler is still classifying everything within the root path of s3://my-bucket/somedata/
. I can live with having data2
cataloged; I'm most concerned/annoyed by the sql
files.
s3://my-bucket/somedata/
data2
sql
Anyone have experience with exclude patterns or able to point out what is wrong here?
1 Answer
1
The *
in the exclude pattern does not cross directories, but the **
does span across directories.
*
**
To exclude all .sql
files you can use: **.sql
.sql
**.sql
The fullpath of your data2/*
exclusion is s3://my-bucket/somedata/data2/*
, but its missing your date partition folders. This is remedied by adding a *
in front.
data2/*
s3://my-bucket/somedata/data2/*
*
To exclude the data2/
directories use: */data2/*
data2/
*/data2/*
By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.