wiki.gimslab.com: Presto And Hive Training Session

Presto And Hive Training Session

# Hive & Presto Architecture - DFS(HDFS, S3) > ResourceManager(YARN) > Processing(MapReduce,Tez) > SQL Engine(Hive) - DFS(HDFS, S3) > ResourceManager(YARN) > Processing(SparkEngine) > SQL Engine(SparkSQL) - DFS(HDFS, S3) > SQL Engine(Presto,Impala,Tajo)

Redshift는 자체 저장소가 있어 ETL등을 통해 S3의 내용을 복제함

basedOnDisk: Hive(MR, Tez) basedOnMemory: SparkSQL(SparkEngine), Presto,Impala,Tajo

Presto는 다른 메모리가 모자라면 Cancel 발생(앞의 사용자가 이미 선점해서 사용하고 있는 경우 등)

# Optimize file format in storage - Row-Oriented : Good for data processing & transformation -- Textfile, Avro, Sequencefiles - Column-Oriented: Good for analytic and aggregation functions -- ORC, Parquet -- read only needed columns, can apply various compression

Redshift는 Column Store이기 때문에 필요한 컬럼만 지정해서 쿼리하는게 빠름

S3에서 새로운 파일을 생성할때 sql에서 위의 방법을 선택할 수 있음 ex) create table ... store as ORC

Recommend: ORC File + Snappy Compression ex) describe formatted ods.cs_cancel_order_status partition (regdttm_day = 20190101);

Partitioning : table is large >50GB 어떤 단위로 쪼개는가는 어떤단위로 사용하느냐에 따라 결정되어야함

Hive on Tez(or MR) : best for batch processing Presto : Best for adhoc queris (결과 set이 적은 경우) SparkSQL: best for batch processing(if you know resource requirement)

# Optimize SELECT Use Only necessary Column Don't Use '*' in SELECT distinct 가능한 쓰지 말자 count(distinct ..) 의 경우 subquery에서 distinct 후 count 수행 Presto: count(distinct..) --> approx_distinct 대략적카운트라 2/3% 오차 가능 Don't use unnecessary join 조인하는 경우 join key 조건이 있더라도 subquery, main query에 모두 조건을 지정하는게 좋다 Presto: join 순서를 변경하지 않음. 왼쪽은 결과셋이 큰테이블, 오른쪽은 작은 테이블을 쓰는게 빠르다(Hive는 무관) Presto: join keyword없이 join하는 경우(=oracle style) 테이블 나열 순서 중요함 Join 조건절에 함수사용 하지말자, subquery에서 함수사용하여 결과셋만들어서 join하는 방식으로 where 조건식에 컬럼에 함수사용하지 말것 where 에 like 가 or로 여러개 사용되는경우 하나의 regex로 표현가능하다면 regexp_like 사용할것 Hive: group by 기준컬럼은 distinct(=분포도) 가 높은 컬럼 순으로 작성(ex: group by order_dt, gender) Presto : group by 에 컬럼명대신 숫자적는게 미미하게 낫다 Order by : 컬럼수 적을수록 좋다. order by 로 상위 몇건만 필요하는경우 반드시 limit 걸자 Hive: order by a, b --> distribute by a sort by a, b (a를 이용해서 reducer 생성) Presto: order by 대략 1만건이하일때 사용 권장, Tableue에서 데이터 끌어올때 굳이 order by 할필요없음 대량 order by : Hive에서 중간 테이블 만들고 Presto에서 이용

last modified 2019-05-09 11:59:37
Processing time 0.0104 sec